ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund <t...@gmplib.org> writes: > * The code is no win for AMD k10/k8 (although close to 10 c/l might well be > possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put this modified version in the k10 subdirectory, since it was a significant slowdown on some other processors. Nice speedup! It is not too far from decoder saturated now, I presume.
I think the right place for the file is the k8 subdir, not the k10 subdir. Their pipelines are almost identical, so the k10 subdir are used just for code which uses instructions not available on k10. Next thing to try is to delay the Q1 store, but that's a bit more work. After that, I guess I should try the loop mixer. I think k8-k10 are losing importance since they aren't made since several years. AMD bulldriver/piledriver are not terribly important GMP targets either, since they have a hopelessly slow integer multiply unit. The most important targets are sandybridge/ivybridge (similar pipelines) and haswell. Less important are nehalem/westmere (very similar pipelines). Conroe and the other core2 processors are not important, except for your laptop. :-) I think haswell code could be made a few cycles faster by using the mulx instruction. That will avoid the copying forth and back of rax. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel