Re: gcd_22

Torbjörn Granlund Mon, 26 Aug 2019 02:16:32 -0700

"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes:

  If I'm not reading this timings wrongly, this means that with the current
  code (disregarding the overhead, for those 64-bits limbs)
  the bits in the limb 1 require 4 cycles each;
  the bits in the limb 2 require 8 cycles each;
  the bits in the limb 3 require 54 cycles each;
  the bits in the limb 4 require 33 cycles each...


Timing is confusing, and I cannot tell if you're right.  Perhaps Niels
can?!

For good measure, I write a gcd_33 in the style of gcd_22.  It runs
about 20% slower than gcd_22 on AMD Ryzen, so really well!  The code
runs on Intel Haswell and later and on AMD Excavator and later.
Attached below.

I agree with Niels that we should optimise hgcd2 and its div1 and div2
callees.  They are shock full of unpredicable branches.  But it might
also make sense to provide a set of gcd_kk for small k, as gcd is an
important operation which is also very slow compared to most other GMP
operations.

x64-zny-gcd_33.asm
Description: Binary data


-- 
Torbjörn
Please encrypt, key id 0xC8601622

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: gcd_22

Reply via email to