Your version is faster than my versions (where I tested them).

I made some minor changes to your code.

1. Got rid of c1 by moving two adox earlier.  That also made for a speedup.

2. Simplified the feed-in code by jumping into the loop for the odd n case.

3. Use rbx for the bp variable as rbp is not a great base register (yes x86
   coding is absurd).

4. Use some 32-bit operations for code size.  (More could be done along
   those lines, i.e. use 8-bit test $1,R8(n), add n instead of $0 for
   the final carry add

Note that the loop now contains two identical copies of the same code
block.  One might unroll more or less with quite limited effort.  :-)

Attachment: addaddmul_1msb0-mulx-bynisse.asm
Description: Binary data

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to