[mpir-devel] Re: New assembler

Cactus Sat, 04 Dec 2010 06:36:45 -0800


On Dec 4, 12:56 pm, Cactus <rieman...@gmail.com> wrote:
> On Dec 4, 10:46 am, Jason <ja...@njkfrudils.plus.com> wrote:
>
>
>
>
>
>
>
>
>
> > On Saturday 04 December 2010 09:35:02 Cactus wrote:
>
> > > On Dec 4, 3:20 am, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > On Saturday 04 December 2010 02:01:48 Jason wrote:
> > > > > On Saturday 04 December 2010 01:40:13 Bill Hart wrote:
> > > > > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > > > > Hi
>
> > > > > > > Heres the first lot of new assembler code for the x64 (in trunk)
>
> > > > > > > popcount/hamdist are not terribly useful for MPIR , but they do
> > > > > > > offer a simple way to practice stuff.
>
> > > > > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
> > > > > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way
>
> > > > > > > The above was just practice for the core2 version which uses SSE ,
> > > > > > > if I'm going to try to use SSE for anything other than trivial
> > > > > > > copys/logic then I need the practice.
>
> > > > > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with
> > > > > > > 4way
>
> > > > > > > The hamdist shows similar improvements , just have to write the
> > > > > > > horrible SSE alignment stuff , yuck..
>
> > > > > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
> > > > > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way
>
> > > > > > Wow, sounds like a lot of great work Jason.
>
> > > > > > > The above are "optimal" , although for very large unrolls
> > > > > > > 28way(10way is probably the minimum) we could get down to 0.87c/l
> > > > > > > for popcount because we do have a spare ALU slot.
> > > > > > > The above is more interesting than that as it's very similar to 
> > > > > > > the
> > > > > > > limits of addmul
>
> > > > > > Not sure what you mean. Do you mean that the point at which it drops
> > > > > > to the lower time is the same as for addmul.
>
> > > > > No , it's just the way the code is arranged.
>
> > > > > > > Should be able to get the nehalem to run at the same speed as the
> > > > > > > K10 but so far a conflict of scheduling with the jcc inst is
> > > > > > > preventing this. Best so far(and current code in trunk) is 1.25
> > > > > > > and 1.9 c/l
>
> > > > > > > I'll see if I can come up with the Windows version tomorrow.
>
> > > > > > Sounds good. I'm not sure where these get used, but if it gives you
> > > > > > practice for other things then its pretty valuable.
>
> > > > > Doing the nehalem for the same reason I did the K10.
>
> > > > > > Bill.
>
> > > > I got the nehalem popcount at 1.0c/l , it's quite interesting
>
> > > > The nehalem like the core2 has a loopback buffer (like a level 0
> > > > cache)(see Agner Fog's manuals) with which to use for small loops ,
> > > > however on the nehalem there is a 1 cycle clock penalty , so it may
> > > > appear that we can never get 1.0c/l however...
>
> > > > each limb requires 1load , 1pop 1add
> > > > plus add and jmp for loop control
> > > > so for an N-way unroll we need 3N+2 micro-ops
> > > > as we can do at most 4 micro-ops per cycle plus the 1 cycle delay
>
> > > > N       3N+2    ceil(3N+2)/4    (3N+2)/4+1      c/l
> > > > 1       5               2                       3                       
> > > > 3
> > > > 2       8               2                       3                      
> > > > 1.5 3       11              3                       4                  
> > > >     1.333 4       14              4                       5            
> > > >           1.25 5       17              5                       6        
> > > >               1.20 6       20              5                       6    
> > > >                   1.0
>
> > > > most of these tables like this the c/l just approaches the asymptote ,
> > > > but in this case we reach it :) , and it matches what we already have
> > > > for 2,3,4-way unroll.
>
> > > > So now all I have to do is reduce the number of registers I used(always
> > > > use as many as you can to make the scheduling easier)
>
> > > > Jason
>
> > > Hi Jason,
>
> > > Is this stuff that it is worthwhile to translate for Windows?
>
> > > Writing one version of a new routine and then making a small number of
> > > changes to produce N versions for different architectures makes
> > > translation difficult.
>
> > > If there are only minor changes, it makes much more sense to translate
> > > a single 'master' version and then make the small number of changes
> > > needed after translation.
>
> > > On the other hand if the changes are extensive it makes more sense to
> > > translate each of the multiple versions.
>
> > > If we are in the first situation and translating for Windows makes
> > > sense, can you let me have (or designate) a 'master' file for popcount
> > > and hamdist?
>
> > > I also need to know when a file has become stable as I don't wnat to
> > > translate anything that is likely to change.
>
> > >    Brian
>
> > Yep , I would wait until I have finished , although I was going to do the
> > translation myself , it's good practice and these are easy examples.
>
> > The new nehalem hamdist is in now in trunk , and using the same arguments as
> > for popcount we can do it with a 2-way unroll (before was 4-way) to get the
> > optimal speed of 2.0c/l (bound by ld/st) . Note: the timings on my nehalem 
> > are
> > always slightly under when I measure it ie it reads as 1.940c/l , it's like
> > there is a negative overhead on all measurements , must be the OOO putting 
> > the
> > rdtsc adhead ;)
>
> > Using a mixed int/SSE we could beat this ld/st bound of 2.0c/l , I may try 
> > it
> > as this is an easy example (if it works) . Quite a few other functions could
> > benefit from a mixed int/SSE if we can do it efficiently , I have tried 
> > before
> > but never got anywhere , and with this being such an easy function I should 
> > be
> > able to figure out where I was going wrong.
>
> > The nehalem popcount is interesting , as again it is a simple example of a
> > situation I have come across before. When we have a large unroll (given an
> > inner loop) , what is the best way to do the feed-in and wind-down.
> > I could jump into the middle of the loop but we would need to calculate 
> > size%6
> > but could change the unroll to 8-way , or as there is no dependency's 
> > between
> > limbs just jump into a 6part winddown(like for logic)
>
> A point worth noting in using the XMM registers is that XMM6 to XMM15
> have to be preserved on Windows if a function uses them.
>
> Hence the advantages of leaf functions will be lost if too many xmm
> registers are bought into use.
>
> With six available this should not be a problem but I notice that you
> are using xmm6, xmm7 and xmm8 in places.
>
>    Brian


Since it was trivial, I added your latest stuff to Windows.

   Brian

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

[mpir-devel] Re: New assembler

Reply via email to