Plain, non-pipelined version of bdiv_dbm1c.asm, mod_1_4.asm, mode1o.asm,
dive_1.asm, invert_limb.asm.

I wrote this with help of gcc, having first told longlong.h about
umulxhi and addxc.  Then I hand-optimised the result to varying degree.
In no case did I software pipeline the loops, so these will rely on OoO
execution for good speed.

I believe this code is correct.  If you could provide T3 and T4 timing
numbers, that would be welcome.  Or if you would optimise the lot, that
would also be welcome.

The code uses lzcnt, which I hope is implemented in T3 and T4.  I added
it to the missing.m4 file, so that I could test the code on my old
sparcs.

More work is needed for loading table symbols.  I think most files do it
properly, but at least sparct34-invert_limb.asm just assumes that a
locally defined table is at a 32-bit address, and statically.

Feedback welcome.

Attachment: sparct34-bdiv_dbm1c.asm
Description: Binary data

Attachment: sparct34-mod_1_4.asm
Description: Binary data

Attachment: sparct34-mode1o.asm
Description: Binary data

Attachment: sparct34-dive_1.asm
Description: Binary data

Attachment: sparct34-invert_limb.asm
Description: Binary data

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to