Ciao, Il Dom, 25 Agosto 2019 9:34 am, Niels Möller ha scritto: > t...@gmplib.org (Torbjörn Granlund) writes:
>> Should we have gcd_33, gcd_44, and gcd_55 also? > I think it would be more beneficial to optimize hgcd2. > E.g., 3-limb gcd should typically boil down to one hgcd2, a couple of > 3:1 mul_1/addmul_1, and one call to gcd_22. It's not obvious that a > gcd_33 using the binary algorithm will be faster. On shell I get: $ tune/speed -CD -s16-64 -t48 mpn_gcd_11 mpn_gcd_22 overhead 6.84 cycles, precision 10000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_gcd_11 mpn_gcd_22 16 (#78.3223) (134.4004) 64 #4.0877 7.8639 $ tune/speed -CD -s1-4 mpn_gcd overhead 6.83 cycles, precision 10000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_gcd 1 (350.7188) 2 519.4961 3 3465.1028 4 2168.9167 If I'm not reading this timings wrongly, this means that with the current code (disregarding the overhead, for those 64-bits limbs) the bits in the limb 1 require 4 cycles each; the bits in the limb 2 require 8 cycles each; the bits in the limb 3 require 54 cycles each; the bits in the limb 4 require 33 cycles each... On the same machine, with ABI=32, gcd_22.c is used: $ tune/speed -CD -s8-32 -t24 mpn_gcd_11 mpn_gcd_22 overhead 6.86 cycles, precision 10000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_gcd_11 mpn_gcd_22 8 (#45.9492) (126.3672) 32 #4.5485 17.2290 $ tune/speed -CD -s1-4 mpn_gcd overhead 6.85 cycles, precision 10000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_gcd 1 (270.6172) 2 538.6328 3 1797.6500 4 1258.4594 the bits in the limb 1 require 4.5 cycles each; the bits in the limb 2 require 17 cycles each; the bits in the limb 3 require 56 cycles each; the bits in the limb 4 require 39 cycles each... It really seem that the _33 case need some work. Ĝis, m -- http://bodrato.it/papers/ _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel