ni...@lysator.liu.se (Niels Möller) writes:

  With evaluation at +1, (a1+a0) and (b1+b0) are each n limbs + a bit, and
  the product is 2n limbs + 2 bits. If we handle the extra bits
  separately, rather than doing it as an (n+1)x(n+1) multiply, I think that
  would be two conditional n-limb additions, at weight B^2 in the final
  product, and a single bit (the and of the two extra bits) at weight B^3.
  I see that the last bit has some potential for folding with some other
  carry propagation pass over the highest n limbs, but I don't see how to
  do it cheaper than 2n carry propagations.

  So +1 sounds better than -1. But it were still surprisingly many
  operations on the n=1 case.

Yes, we trade a multiply for a bunch of additions.

For some archs, notably ARM inc's Arm64 cores, we have at least 15 free
insn slots available for plain operations (add, sub, and, etc).  (This is
because the 3 muls need 21 cycles.)

  As far as I remember, I ran tests, and they failed...

Then, it might perhaps be buggy.  ;-)

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to