"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes: If I'm not reading this timings wrongly, this means that with the current code (disregarding the overhead, for those 64-bits limbs) the bits in the limb 1 require 4 cycles each; the bits in the limb 2 require 8 cycles each; the bits in the limb 3 require 54 cycles each; the bits in the limb 4 require 33 cycles each...
Timing is confusing, and I cannot tell if you're right. Perhaps Niels can?! For good measure, I write a gcd_33 in the style of gcd_22. It runs about 20% slower than gcd_22 on AMD Ryzen, so really well! The code runs on Intel Haswell and later and on AMD Excavator and later. Attached below. I agree with Niels that we should optimise hgcd2 and its div1 and div2 callees. They are shock full of unpredicable branches. But it might also make sense to provide a set of gcd_kk for small k, as gcd is an important operation which is also very slow compared to most other GMP operations.
x64-zny-gcd_33.asm
Description: Binary data
-- Torbjörn Please encrypt, key id 0xC8601622
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel