[mpir-devel] Re: K8 mul_basecase

jason Wed, 24 Dec 2008 08:15:25 -0800

On Wednesday 24 December 2008 09:39:20 Bill Hart wrote:
> Merry Christmas Brian (and all).
>
> I used to have an Athlon XP but no longer. However I do have an AMP
> Turion 64 x2 and of course have access to Opterons. I'm betting the
> tuning parameters are nearly the same for these machines. We can give
> it a go. I'll send a file hopefully later today. It can't hurt to try
> anyway. Of course I'll need to get make tune working again, which
> might be a mission.


I have just run tune on my linux box and svn'd the gmp-param file , should be 
a fair match for windows.


>
> Well done to Jason Moxham with the assembly improvements. That's
> pretty amazing. Loving the enthusiasm.
>
> Bill.
>
> 2008/12/24 Cactus <rieman...@googlemail.com>:
> > On Dec 24, 8:50 am, Cactus <rieman...@googlemail.com> wrote:
> >> On Dec 23, 11:31 pm, ja...@njkfrudils.plus.com wrote:
> >> > On Tuesday 23 December 2008 22:52:10 Cactus wrote:
> >> > > On Dec 22, 11:55 pm, jason <ja...@njkfrudils.plus.com> wrote:
> >> > > > On Dec 20, 1:13 pm, Cactus <rieman...@googlemail.com> wrote:
> >> > > > > On Dec 20, 10:49 am, Cactus <rieman...@googlemail.com> wrote:
> >> > > > > > On Dec 20, 3:56 am, "Bill Hart" <goodwillh...@googlemail.com>
> >> > > > > > wrote:
> >> > > > >
> >> > > > > Following up my earlier results, I have now played with
> >> > > > > alignment and jump decisions and I find that:
> >> > > > >
> >> > > > >     jc      .1
> >> > > > >     jmp     .2
> >> > > > >
> >> > > > >     align   16
> >> > > > > .1:mov     rax, [r10+r8*8]
> >> > > > >
> >> > > > > in which there is a jump to aligned code (rather than falling
> >> > > > > through and hence executing the padding code) gives
> >> > > > > significantly better results:
> >> > > > >
> >> > > > >  Jason's Code (mp_add_n and mp_sub_n):
> >> > > > > Jason's Code (mp_addmul_n and mp_submul_n):
> >> > > > > Jason's Code (mp_mul_1):
> >> > > > >
> >> > > > > Running benchmarks
> >> > > > >   Category base
> >> > > > >     Program multiply
> >> > > > >       multiply 128 128
> >> > > > >       MPIRbench.base.multiply.128.128 result: 26701842
> >> > > > >       multiply 512 512
> >> > > > >       MPIRbench.base.multiply.512.512 result: 6455010
> >> > > > >       multiply 8192 8192
> >> > > > >       MPIRbench.base.multiply.8192.8192 result: 61537
> >> > > > >       multiply 131072 131072
> >> > > > >       MPIRbench.base.multiply.131072.131072 result: 938
> >> > > > >       multiply 2097152 2097152
> >> > > > >       MPIRbench.base.multiply.2097152.2097152 result: 23.0
> >> > > > >     MPIRbench.base.multiply result: 46978.70
> >> > > > >     Program divide
> >> > > > >       divide 8192 32
> >> > > > >       MPIRbench.base.divide.8192.32 result: 677900
> >> > > > >       divide 8192 64
> >> > > > >       MPIRbench.base.divide.8192.64 result: 689331
> >> > > > >       divide 8192 128
> >> > > > >       MPIRbench.base.divide.8192.128 result: 269308
> >> > > > >       divide 8192 4096
> >> > > > >       MPIRbench.base.divide.8192.4096 result: 116612
> >> > > > >       divide 8192 8064
> >> > > > >       MPIRbench.base.divide.8192.8064 result: 1027764
> >> > > > >       divide 131072 8192
> >> > > > >       MPIRbench.base.divide.131072.8192 result: 2667
> >> > > > >       divide 131072 65536
> >> > > > >       MPIRbench.base.divide.131072.65536 result: 1249
> >> > > > >       divide 8388608 4194304
> >> > > > >       MPIRbench.base.divide.8388608.4194304 result: 2.56
> >> > > > >     MPIRbench.base.divide result: 24471.64
> >> > > > >   MPIRbench.base result 33906.43
> >> > > > >   Category app
> >> > > > >     Program rsa
> >> > > > >       rsa 512
> >> > > > >       MPIRbench.app.rsa.512 result: 14055
> >> > > > >       rsa 1024
> >> > > > >       MPIRbench.app.rsa.1024 result: 2735
> >> > > > >       rsa 2048
> >> > > > >       MPIRbench.app.rsa.2048 result: 498
> >> > > > >     MPIRbench.app.rsa result: 2675.09
> >> > > > >   MPIRbench.app result 2675.09
> >> > > > > MPIRbench result: 9523.81
> >> > > > >
> >> > > > > This is about 8% faster than my original Windows code.
> >> > > > >
> >> > > > > Well done Jason!
> >> > > > >
> >> > > > >      Brian
> >> > > >
> >> > > > I've put the mpn_mul_basecase in the mpir development branch ,
> >> > > > ready for conversion to
> >> > > > windows.http://www.digitalmischief.co.uk/fruitbowl/is the latest
> >> > > > with a new mpn_sqr_basecase and mpn_redc_basecase , which overall
> >> > > > gives me a 60% (which by co-incidence is the same ratio as 4/2.5
> >> > > > the addmul
> >> > > > ratio's!!!) improvement over gmp-4.2.4, they are very much still
> >> > > > cut&paste , so expect a few more % in time. I'm going to try a
> >> > > > division_basecase and a mullow and mulhigh basecase next , there
> >> > > > is also a addmul loop in bdivmod.c which does something , and may
> >> > > > be worth doing.
> >> > >
> >> > > Hi Jason,
> >> > >
> >> > > Thanks for the mpn_mul_basecase code.
> >> > >
> >> > > I have converted this to Windows and it is slower than my old code -
> >> > > the mpirbench score with the new code is 9350 whereas the current
> >> > > code is 9550, which is a 2% performance loss. Only the
> >> > > mpn_mul_basecase code is different - I have kept your other routines
> >> > > in place in making this comparison.
> >> >
> >> > Odd!!!
> >> > Did you run tune? , I assume your old code is is the Gaudry code,
> >> > ,doesn't even sound like its running!!
> >> >
> >> > > In this case there is about the same prologue/epilogue overhead in
> >> > > both versions so it will be interesting to see how it compares on
> >> > > Linux.
> >> > >
> >> > >     Brian
> >>
> >> I can't run tune under Windows so, no, the same tuning parameters are
> >> being used in both runs.
> >>
> >> If you have tuning parameters that might be appropriate for an AMD
> >> Athlon X2, I can try them.
> >>
> >> I am confident that the right code is being used in these comparisons
> >> because I use a debugger to check this (I have been caught by this
> >> sort of problem previously).
> >>
> >> My existing code is basically a translation of Pierrick Gaudry's code
> >> for YASM with Intel syntax.
> >>
> >>     Brian
> >
> > Hi All
> >
> > I have tracked down a problem in my Windows conversion of the
> > mul_basecase code and this is now showing a good performance gain from
> > 9,520 to 10,100 - a good gain.
> >
> > In overall terms Jason's work takes my original Windows code from
> > 8,800 to 10,100 - a 15% gain in performance. This is without any
> > tuning so there may be more to be gained if the tuning parameters are
> > adjusted.
> >
> > Does anyone have a comparison between the tuning needed for our old
> > code and that for Jason's code.  THis would help me as I can then try
> > out new parameters on Windows.  I could try to get tune working on
> > Windows but I am fearful that this is likely to be a big job.
> >
> > A happy Christmas to all.
> >
> >   Brian
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: K8 mul_basecase

Reply via email to