Re: [mpir-devel] Re: New assembler

Jason Sat, 04 Dec 2010 09:13:18 -0800

On Saturday 04 December 2010 10:46:35 Jason wrote:
> On Saturday 04 December 2010 09:35:02 Cactus wrote:
> > On Dec 4, 3:20 am, Jason <ja...@njkfrudils.plus.com> wrote:
> > > On Saturday 04 December 2010 02:01:48 Jason wrote:
> > > > On Saturday 04 December 2010 01:40:13 Bill Hart wrote:
> > > > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > > > Hi
> > > > > > 
> > > > > > Heres the first lot of new assembler code for the x64 (in trunk)
> > > > > > 
> > > > > > popcount/hamdist are not terribly useful for MPIR , but they do
> > > > > > offer a simple way to practice stuff.
> > > > > > 
> > > > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
> > > > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way
> > > > > > 
> > > > > > The above was just practice for the core2 version which uses SSE
> > > > > > , if I'm going to try to use SSE for anything other than trivial
> > > > > > copys/logic then I need the practice.
> > > > > > 
> > > > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l
> > > > > > with 4way
> > > > > > 
> > > > > > The hamdist shows similar improvements , just have to write the
> > > > > > horrible SSE alignment stuff , yuck..
> > > > > > 
> > > > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
> > > > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way
> > > > > 
> > > > > Wow, sounds like a lot of great work Jason.
> > > > > 
> > > > > > The above are "optimal" , although for very large unrolls
> > > > > > 28way(10way is probably the minimum) we could get down to 0.87c/l
> > > > > > for popcount because we do have a spare ALU slot.
> > > > > > The above is more interesting than that as it's very similar to
> > > > > > the limits of addmul
> > > > > 
> > > > > Not sure what you mean. Do you mean that the point at which it
> > > > > drops to the lower time is the same as for addmul.
> > > > 
> > > > No , it's just the way the code is arranged.
> > > > 
> > > > > > Should be able to get the nehalem to run at the same speed as the
> > > > > > K10 but so far a conflict of scheduling with the jcc inst is
> > > > > > preventing this. Best so far(and current code in trunk) is 1.25
> > > > > > and 1.9 c/l
> > > > > > 
> > > > > > I'll see if I can come up with the Windows version tomorrow.
> > > > > 
> > > > > Sounds good. I'm not sure where these get used, but if it gives you
> > > > > practice for other things then its pretty valuable.
> > > > 
> > > > Doing the nehalem for the same reason I did the K10.
> > > > 
> > > > > Bill.
> > > 
> > > I got the nehalem popcount at 1.0c/l , it's quite interesting
> > > 
> > > The nehalem like the core2 has a loopback buffer (like a level 0
> > > cache)(see Agner Fog's manuals) with which to use for small loops ,
> > > however on the nehalem there is a 1 cycle clock penalty , so it may
> > > appear that we can never get 1.0c/l however...
> > > 
> > > each limb requires 1load , 1pop 1add
> > > plus add and jmp for loop control
> > > so for an N-way unroll we need 3N+2 micro-ops
> > > as we can do at most 4 micro-ops per cycle plus the 1 cycle delay
> > > 
> > > N       3N+2    ceil(3N+2)/4    (3N+2)/4+1      c/l
> > > 1       5               2                       3                      
> > > 3 2       8               2                       3
> > > 1.5 3       11              3                       4
> > > 
> > >     1.333 4       14              4                       5
> > >     
> > >           1.25 5       17              5                       6
> > >           
> > >               1.20 6       20              5                       6
> > >               
> > >                   1.0
> > > 
> > > most of these tables like this the c/l just approaches the asymptote ,
> > > but in this case we reach it :) , and it matches what we already have
> > > for 2,3,4-way unroll.
> > > 
> > > So now all I have to do is reduce the number of registers I used(always
> > > use as many as you can to make the scheduling easier)
> > > 
> > > Jason
> > 
> > Hi Jason,
> > 
> > Is this stuff that it is worthwhile to translate for Windows?
> > 
> > Writing one version of a new routine and then making a small number of
> > changes to produce N versions for different architectures makes
> > translation difficult.
> > 
> > If there are only minor changes, it makes much more sense to translate
> > a single 'master' version and then make the small number of changes
> > needed after translation.
> > 
> > On the other hand if the changes are extensive it makes more sense to
> > translate each of the multiple versions.
> > 
> > If we are in the first situation and translating for Windows makes
> > sense, can you let me have (or designate) a 'master' file for popcount
> > and hamdist?
> > 
> > I also need to know when a file has become stable as I don't wnat to
> > translate anything that is likely to change.
> > 
> >    Brian
> 
> Yep , I would wait until I have finished , although I was going to do the
> translation myself , it's good practice and these are easy examples.
> 
> The new nehalem hamdist is in now in trunk , and using the same arguments
> as for popcount we can do it with a 2-way unroll (before was 4-way) to get
> the optimal speed of 2.0c/l (bound by ld/st) . Note: the timings on my
> nehalem are always slightly under when I measure it ie it reads as
> 1.940c/l , it's like there is a negative overhead on all measurements ,
> must be the OOO putting the rdtsc adhead ;)
> 
> Using a mixed int/SSE we could beat this ld/st bound of 2.0c/l , I may try
> it as this is an easy example (if it works) . Quite a few other functions
> could benefit from a mixed int/SSE if we can do it efficiently , I have
> tried before but never got anywhere , and with this being such an easy
> function I should be able to figure out where I was going wrong.
>


That was easy , with a 4-way unroll (limbs) (ie 2-way SSE) I can get 1.75c/l 
for hamdist (with no pipelining) but with the arguments aligned , should be 
able to get the same with the arguments unaligned. With a 8-way unroll should 
be able to get down to 1.5c/l and 1.625c/l for aligned/un-aligned. Of course 
hamdist is easy as there are no dependencies between the limbs.This suggests 
that my failure before was lack of unrolling.


> The nehalem popcount is interesting , as again it is a simple example of a
> situation I have come across before. When we have a large unroll (given an
> inner loop) , what is the best way to do the feed-in and wind-down.
> I could jump into the middle of the loop but we would need to calculate
> size%6 but could change the unroll to 8-way , or as there is no
> dependency's between limbs just jump into a 6part winddown(like for logic)
> 
> Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: New assembler

Reply via email to