On Saturday 04 December 2010 02:01:48 Jason wrote: > On Saturday 04 December 2010 01:40:13 Bill Hart wrote: > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote: > > > Hi > > > > > > Heres the first lot of new assembler code for the x64 (in trunk) > > > > > > popcount/hamdist are not terribly useful for MPIR , but they do offer a > > > simple way to practice stuff. > > > > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way > > > > > > The above was just practice for the core2 version which uses SSE , if > > > I'm going to try to use SSE for anything other than trivial > > > copys/logic then I need the practice. > > > > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with 4way > > > > > > The hamdist shows similar improvements , just have to write the > > > horrible SSE alignment stuff , yuck.. > > > > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way > > > > Wow, sounds like a lot of great work Jason. > > > > > The above are "optimal" , although for very large unrolls 28way(10way > > > is probably the minimum) we could get down to 0.87c/l for popcount > > > because we do have a spare ALU slot. > > > The above is more interesting than that as it's very similar to the > > > limits of addmul > > > > Not sure what you mean. Do you mean that the point at which it drops > > to the lower time is the same as for addmul. > > No , it's just the way the code is arranged. > > > > Should be able to get the nehalem to run at the same speed as the K10 > > > but so far a conflict of scheduling with the jcc inst is preventing > > > this. Best so far(and current code in trunk) is 1.25 and 1.9 c/l > > > > > > I'll see if I can come up with the Windows version tomorrow. > > > > Sounds good. I'm not sure where these get used, but if it gives you > > practice for other things then its pretty valuable. > > Doing the nehalem for the same reason I did the K10. > > > Bill.
I got the nehalem popcount at 1.0c/l , it's quite interesting The nehalem like the core2 has a loopback buffer (like a level 0 cache)(see Agner Fog's manuals) with which to use for small loops , however on the nehalem there is a 1 cycle clock penalty , so it may appear that we can never get 1.0c/l however... each limb requires 1load , 1pop 1add plus add and jmp for loop control so for an N-way unroll we need 3N+2 micro-ops as we can do at most 4 micro-ops per cycle plus the 1 cycle delay N 3N+2 ceil(3N+2)/4 (3N+2)/4+1 c/l 1 5 2 3 3 2 8 2 3 1.5 3 11 3 4 1.333 4 14 4 5 1.25 5 17 5 6 1.20 6 20 5 6 1.0 most of these tables like this the c/l just approaches the asymptote , but in this case we reach it :) , and it matches what we already have for 2,3,4-way unroll. So now all I have to do is reduce the number of registers I used(always use as many as you can to make the scheduling easier) Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.