ni...@lysator.liu.se (Niels Möller) writes: Here's a sketch of a loop, that should work for both addaddmul_1msb0 and addsubmul_1msb0:
L(top): mov (ap, n, 8), %rdx mulx %r8, alo, hi adox ahi, alo mov hi, ahi C 2-way unroll. adox zero, ahi C Clears O mov (bp, n), %rdx mulx %r9, blo, hi adox bhi, blo mov hi, bhi adox zero, bhi C clears O adc blo, alo C Or sbb, for addsubmul_1msb0 mov alo, (rp, n, 8) inc n jnz top L(done): adc bhi, ahi C No carry out, thanks to msb0 mov ahi, %rax C Return value Neat! Some unrolling would save several instructions: Put r8 (aka u0) in rdx over k ways of unrolling. Supply ap[...] limbs to mulx directly. Then accumulate with an adox chain, reducing the adox zero need. Same for r9/v0. (BTW, do I get operand order right for mulx? I'm confused by the docs that use the generally different intel conventions). Your use looks right. Now, question is if it can beat mul_1 + addmul_1. I don't know. It surely has potential. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel