Hi Heres the first lot of new assembler code for the x64 (in trunk)
popcount/hamdist are not terribly useful for MPIR , but they do offer a simple way to practice stuff. K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way The above was just practice for the core2 version which uses SSE , if I'm going to try to use SSE for anything other than trivial copys/logic then I need the practice. core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with 4way The hamdist shows similar improvements , just have to write the horrible SSE alignment stuff , yuck.. K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way The above are "optimal" , although for very large unrolls 28way(10way is probably the minimum) we could get down to 0.87c/l for popcount because we do have a spare ALU slot. The above is more interesting than that as it's very similar to the limits of addmul Should be able to get the nehalem to run at the same speed as the K10 but so far a conflict of scheduling with the jcc inst is preventing this. Best so far(and current code in trunk) is 1.25 and 1.9 c/l I'll see if I can come up with the Windows version tomorrow. Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.