One of my previous proposal looks like this, when put into its right form: mp_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { if (UNLIKELY ((d0 >> 61) != 0)) return n0 / d0;
if (UNLIKELY (n0 >= (d0 << 3))) return n0 / d0; else { mp_limb_t q, mask; d0 <<= 2; mask = -(mp_limb_t) (n0 >= d0); q = 4 & mask; n0 -= d0 & mask; d0 >>= 1; mask = -(mp_limb_t) (n0 >= d0); q += 2 & mask; n0 -= d0 & mask; d0 >>= 1; mask = -(mp_limb_t) (n0 >= d0); q += 1 & mask; n0 -= d0 & mask; return q; } } Some timing tests indicate that it is 6 times faster on 'shell' for quotients < 8. For hgcd, that would mean almost all quotients. I expect an asm version would be 10x faster. I attached a full test-and-timing program.
div1.c
Description: Binary data
-- Torbjörn Please encrypt, key id 0xC8601622
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel