ni...@lysator.liu.se (Niels Möller) writes: > If we have adox/adcx, use same strategy as suggested for > addaddmul_1msb0, but subtract rather than add in the chain with long > lived carry.
Here's a sketch of a loop, that should work for both addaddmul_1msb0 and addsubmul_1msb0: L(top): mov (ap, n, 8), %rdx mulx %r8, alo, hi adox ahi, alo mov hi, ahi C 2-way unroll. adox zero, ahi C Clears O mov (bp, n), %rdx mulx %r9, blo, hi adox bhi, blo mov hi, bhi adox zero, bhi C clears O adc blo, alo C Or sbb, for addsubmul_1msb0 mov alo, (rp, n, 8) inc n jnz top L(done): adc bhi, ahi C No carry out, thanks to msb0 mov ahi, %rax C Return value (BTW, do I get operand order right for mulx? I'm confused by the docs that use the generally different intel conventions). Note that in this form, I think we could allow full limb inputs (%r8, %r9), except that we would get a final carry, and we'd need to return a 65-bit value. For the addadd case, this could be simplified by adding ahi and bhi together early (since there can be no overflow), eliminating a few of the adox instructions. Now, question is if it can beat mul_1 + addmul_1. I don't know. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel