ni...@lysator.liu.se (Niels Möller) writes: Gave it a run on my closest x86_64 (intel broadwell, no mulx)), and numbers for mpn_addaddmul_1msb0 are not impressing. Also, it appears mpn_addmul_2 is significantly slower than two addmul_1.
I believe addmul_2 is inhibited for that CPU. It might still appear in the compiled library, though. :-( 79 #1.5617 1.8006 4.3277 4.6949 86 #1.5702 1.7883 4.3290 4.7031 94 #1.5441 1.7743 4.3321 4.7018 So there's definitely some room for improvement. The odd instruction order of the present loop suggests it was optimised for K8. In fact, it runs almost optimally there. (32 loop instructions, the 6 muls need a double slot, so 38. 3-way issue, 6 way unrolled gives (32+6)/3/6 = 2.111... Very close to the stated 2.167.) Beating mul_1 + addmul_1 elsewhere without loopmixing will probably be hard. We should probably move the present code into the k8 subdir. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel