Re: [mpir-devel] Re: New assembler

Jason Sat, 04 Dec 2010 02:46:48 -0800

On Saturday 04 December 2010 09:35:02 Cactus wrote:
> On Dec 4, 3:20 am, Jason <ja...@njkfrudils.plus.com> wrote:
> > On Saturday 04 December 2010 02:01:48 Jason wrote:
> > > On Saturday 04 December 2010 01:40:13 Bill Hart wrote:
> > > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > > Hi
> > > > > 
> > > > > Heres the first lot of new assembler code for the x64 (in trunk)
> > > > > 
> > > > > popcount/hamdist are not terribly useful for MPIR , but they do
> > > > > offer a simple way to practice stuff.
> > > > > 
> > > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
> > > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way
> > > > > 
> > > > > The above was just practice for the core2 version which uses SSE ,
> > > > > if I'm going to try to use SSE for anything other than trivial
> > > > > copys/logic then I need the practice.
> > > > > 
> > > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with
> > > > > 4way
> > > > > 
> > > > > The hamdist shows similar improvements , just have to write the
> > > > > horrible SSE alignment stuff , yuck..
> > > > > 
> > > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
> > > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way
> > > > 
> > > > Wow, sounds like a lot of great work Jason.
> > > > 
> > > > > The above are "optimal" , although for very large unrolls
> > > > > 28way(10way is probably the minimum) we could get down to 0.87c/l
> > > > > for popcount because we do have a spare ALU slot.
> > > > > The above is more interesting than that as it's very similar to the
> > > > > limits of addmul
> > > > 
> > > > Not sure what you mean. Do you mean that the point at which it drops
> > > > to the lower time is the same as for addmul.
> > > 
> > > No , it's just the way the code is arranged.
> > > 
> > > > > Should be able to get the nehalem to run at the same speed as the
> > > > > K10 but so far a conflict of scheduling with the jcc inst is
> > > > > preventing this. Best so far(and current code in trunk) is 1.25
> > > > > and 1.9 c/l
> > > > > 
> > > > > I'll see if I can come up with the Windows version tomorrow.
> > > > 
> > > > Sounds good. I'm not sure where these get used, but if it gives you
> > > > practice for other things then its pretty valuable.
> > > 
> > > Doing the nehalem for the same reason I did the K10.
> > > 
> > > > Bill.
> > 
> > I got the nehalem popcount at 1.0c/l , it's quite interesting
> > 
> > The nehalem like the core2 has a loopback buffer (like a level 0
> > cache)(see Agner Fog's manuals) with which to use for small loops ,
> > however on the nehalem there is a 1 cycle clock penalty , so it may
> > appear that we can never get 1.0c/l however...
> > 
> > each limb requires 1load , 1pop 1add
> > plus add and jmp for loop control
> > so for an N-way unroll we need 3N+2 micro-ops
> > as we can do at most 4 micro-ops per cycle plus the 1 cycle delay
> > 
> > N       3N+2    ceil(3N+2)/4    (3N+2)/4+1      c/l
> > 1       5               2                       3                       3
> > 2       8               2                       3                      
> > 1.5 3       11              3                       4                  
> >     1.333 4       14              4                       5            
> >           1.25 5       17              5                       6        
> >               1.20 6       20              5                       6    
> >                   1.0
> > 
> > most of these tables like this the c/l just approaches the asymptote ,
> > but in this case we reach it :) , and it matches what we already have
> > for 2,3,4-way unroll.
> > 
> > So now all I have to do is reduce the number of registers I used(always
> > use as many as you can to make the scheduling easier)
> > 
> > Jason
> 
> Hi Jason,
> 
> Is this stuff that it is worthwhile to translate for Windows?
> 
> Writing one version of a new routine and then making a small number of
> changes to produce N versions for different architectures makes
> translation difficult.
> 
> If there are only minor changes, it makes much more sense to translate
> a single 'master' version and then make the small number of changes
> needed after translation.
> 
> On the other hand if the changes are extensive it makes more sense to
> translate each of the multiple versions.
> 
> If we are in the first situation and translating for Windows makes
> sense, can you let me have (or designate) a 'master' file for popcount
> and hamdist?
> 
> I also need to know when a file has become stable as I don't wnat to
> translate anything that is likely to change.
> 
>    Brian


Yep , I would wait until I have finished , although I was going to do the 
translation myself , it's good practice and these are easy examples.

The new nehalem hamdist is in now in trunk , and using the same arguments as 
for popcount we can do it with a 2-way unroll (before was 4-way) to get the 
optimal speed of 2.0c/l (bound by ld/st) . Note: the timings on my nehalem are 
always slightly under when I measure it ie it reads as 1.940c/l , it's like 
there is a negative overhead on all measurements , must be the OOO putting the 
rdtsc adhead ;)

Using a mixed int/SSE we could beat this ld/st bound of 2.0c/l , I may try it 
as this is an easy example (if it works) . Quite a few other functions could 
benefit from a mixed int/SSE if we can do it efficiently , I have tried before 
but never got anywhere , and with this being such an easy function I should be 
able to figure out where I was going wrong.

The nehalem popcount is interesting , as again it is a simple example of a 
situation I have come across before. When we have a large unroll (given an 
inner loop) , what is the best way to do the feed-in and wind-down.
I could jump into the middle of the loop but we would need to calculate size%6 
but could change the unroll to 8-way , or as there is no dependency's between 
limbs just jump into a 6part winddown(like for logic)

Jason










-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: New assembler

Reply via email to