ni...@lysator.liu.se (Niels Möller) writes: Sounds good. Give it a few days, and delete if it still looks slow everywhere.
I cooked a modern alternative: static mp_double_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { mp_double_limb_t res; int ncnt, dcnt, cnt; mp_limb_t q; mp_limb_t mask; ASSERT (n0 >= d0); count_leading_zeros (ncnt, n0); count_leading_zeros (dcnt, d0); cnt = dcnt - ncnt; d0 <<= cnt; q = -(mp_limb_t) (n0 >= d0); n0 -= d0 & q; d0 >>= 1; q = -q; while (--cnt >= 0) { mask = -(mp_limb_t) (n0 >= d0); n0 -= d0 & mask; d0 >>= 1; q = (q << 1) - mask; } res.d0 = n0; res.d1 = q; return res; } That should use the same choices for div1 and div2. Is it important to inline some or all of the div1/div2 code? If we move them out to separate .c files, it gets easier to share between hgcd2 and hgcd2_jacobi, and easier to experiment with assembly div1. But it gets a bit more complex to set things up for tuneup. Not considering tuneup hardness, I suppose inlining makes some sense, at least of _METHOD 1 which often ends up as a single instruction. I've never been a big fan of inlining. Small operands GCD in GMP is special in many ways, as it is hundreds of times slower than multiplication. We really need to try hard to improve it. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel