On Saturday 04 December 2010 02:01:48 Jason wrote:
> On Saturday 04 December 2010 01:40:13 Bill Hart wrote:
> > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote:
> > > Hi
> > > 
> > > Heres the first lot of new assembler code for the x64 (in trunk)
> > > 
> > > popcount/hamdist are not terribly useful for MPIR , but they do offer a
> > > simple way to practice stuff.
> > > 
> > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
> > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way
> > > 
> > > The above was just practice for the core2 version which uses SSE , if
> > > I'm going to try to use SSE for anything other than trivial
> > > copys/logic then I need the practice.
> > > 
> > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with 4way
> > > 
> > > The hamdist shows similar improvements , just have to write the
> > > horrible SSE alignment stuff , yuck..
> > > 
> > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
> > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way
> > 
> > Wow, sounds like a lot of great work Jason.
> > 
> > > The above are "optimal" , although for very large unrolls 28way(10way
> > > is probably the minimum) we could get down to 0.87c/l for popcount
> > > because we do have a spare ALU slot.
> > > The above is more interesting than that as it's very similar to the
> > > limits of addmul
> > 
> > Not sure what you mean. Do you mean that the point at which it drops
> > to the lower time is the same as for addmul.
> 
> No , it's just the way the code is arranged.
> 
> > > Should be able to get the nehalem to run at the same speed as the K10
> > > but so far a conflict of scheduling with the jcc inst is preventing
> > > this. Best so far(and current code in trunk) is 1.25 and 1.9 c/l
> > > 
> > > I'll see if I can come up with the Windows version tomorrow.
> > 
> > Sounds good. I'm not sure where these get used, but if it gives you
> > practice for other things then its pretty valuable.
> 
> Doing the nehalem for the same reason I did the K10.
> 
> > Bill.

I got the nehalem popcount at 1.0c/l , it's quite interesting

The nehalem like the core2 has a loopback buffer (like a level 0 cache)(see 
Agner Fog's manuals) with which to use for small loops , however on the 
nehalem there is a 1 cycle clock penalty , so it may appear that we can never 
get 1.0c/l however...

each limb requires 1load , 1pop 1add
plus add and jmp for loop control
so for an N-way unroll we need 3N+2 micro-ops 
as we can do at most 4 micro-ops per cycle plus the 1 cycle delay

N       3N+2    ceil(3N+2)/4    (3N+2)/4+1      c/l
1       5               2                       3                       3
2       8               2                       3                       1.5
3       11              3                       4                       1.333
4       14              4                       5                       1.25
5       17              5                       6                       1.20
6       20              5                       6                       1.0

most of these tables like this the c/l just approaches the asymptote , but in 
this case we reach it :) , and it matches what we already have for 2,3,4-way 
unroll.

So now all I have to do is reduce the number of registers I used(always use as 
many as you can to make the scheduling easier)

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to