Your version is faster than my versions (where I tested them). I made some minor changes to your code.
1. Got rid of c1 by moving two adox earlier. That also made for a speedup. 2. Simplified the feed-in code by jumping into the loop for the odd n case. 3. Use rbx for the bp variable as rbp is not a great base register (yes x86 coding is absurd). 4. Use some 32-bit operations for code size. (More could be done along those lines, i.e. use 8-bit test $1,R8(n), add n instead of $0 for the final carry add Note that the loop now contains two identical copies of the same code block. One might unroll more or less with quite limited effort. :-)
addaddmul_1msb0-mulx-bynisse.asm
Description: Binary data
-- Torbjörn Please encrypt, key id 0xC8601622
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel