Hi

Heres the first lot of new assembler code for the x64 (in trunk)

popcount/hamdist are not terribly useful for MPIR , but they do offer a simple 
way to practice stuff.

K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way

The above was just practice for the core2 version which uses SSE , if I'm 
going to try to use SSE for anything other than trivial copys/logic then I 
need the practice.

core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with 4way

The hamdist shows similar improvements , just have to write the horrible SSE 
alignment stuff , yuck..

K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way

The above are "optimal" , although for very large unrolls 28way(10way is 
probably the minimum) we could get down to 0.87c/l for popcount because we do 
have a spare ALU slot.
The above is more interesting than that as it's very similar to the limits of 
addmul

Should be able to get the nehalem to run at the same speed as the K10 but so 
far a conflict of scheduling with the jcc inst is preventing this.
Best so far(and current code in trunk) is 1.25 and 1.9 c/l

I'll see if I can come up with the Windows version tomorrow.

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to