ni...@lysator.liu.se (Niels Möller) writes: With evaluation at +1, (a1+a0) and (b1+b0) are each n limbs + a bit, and the product is 2n limbs + 2 bits. If we handle the extra bits separately, rather than doing it as an (n+1)x(n+1) multiply, I think that would be two conditional n-limb additions, at weight B^2 in the final product, and a single bit (the and of the two extra bits) at weight B^3. I see that the last bit has some potential for folding with some other carry propagation pass over the highest n limbs, but I don't see how to do it cheaper than 2n carry propagations.
So +1 sounds better than -1. But it were still surprisingly many operations on the n=1 case. Yes, we trade a multiply for a bunch of additions. For some archs, notably ARM inc's Arm64 cores, we have at least 15 free insn slots available for plain operations (add, sub, and, etc). (This is because the 3 muls need 21 cycles.) As far as I remember, I ran tests, and they failed... Then, it might perhaps be buggy. ;-) -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel