Plain, non-pipelined version of bdiv_dbm1c.asm, mod_1_4.asm, mode1o.asm, dive_1.asm, invert_limb.asm.
I wrote this with help of gcc, having first told longlong.h about umulxhi and addxc. Then I hand-optimised the result to varying degree. In no case did I software pipeline the loops, so these will rely on OoO execution for good speed. I believe this code is correct. If you could provide T3 and T4 timing numbers, that would be welcome. Or if you would optimise the lot, that would also be welcome. The code uses lzcnt, which I hope is implemented in T3 and T4. I added it to the missing.m4 file, so that I could test the code on my old sparcs. More work is needed for loading table symbols. I think most files do it properly, but at least sparct34-invert_limb.asm just assumes that a locally defined table is at a 32-bit address, and statically. Feedback welcome.
sparct34-bdiv_dbm1c.asm
Description: Binary data
sparct34-mod_1_4.asm
Description: Binary data
sparct34-mode1o.asm
Description: Binary data
sparct34-dive_1.asm
Description: Binary data
sparct34-invert_limb.asm
Description: Binary data
-- Torbjörn
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel