Another trick we've been discussing is to use u+v rather than |u-v|, in the case that u+v is divisible by 4.
I've not yet tried it out for gcd_22 (and I don't yet have any timing code), but I've tried it for gcd_11. It's slower (not that surprising), but with some potential for gcd_22. We need masking tricks to select between the three values u-v, v-u and u+v, depending on which is both positive and divisible by 4. And regardless, we still need u-v (or at least the comparison), for the computation of min(u,v). The below seems to work. Looking at the interesting operations, it uses d = u - v; vgtu = LIMB_HIGHBIT_TO_MASK (d); v += (vgtu & d); to get v <-- min (u,v). This is unchanged, except that t was renamed to d for difference. To get the other value, always divisible by 4 (or by 2 in the implicit bit representation), the code uses t = u ^ v; use_add = -(t & 1); /* The mask value */ t = u - (use_add ^ v); /* I'm clever here */ vgtu &= ~use_add; /* Cancel negation if we use add */ u = (t ^ vgtu) - vgtu; The cleverness is that in the add case, we need u + v + 1 due to the carry from implicit bits, but we get by with just complement thanks to the relation between complement and negation, -x = ~x + 1 or -(x+1) = ~x. So if we try something similar for gcd_22, we'll get two sub_ddmmss, but at they're idependent, we don't add any carry propagation to the critical path. I see some possible micro-optimizations in the below code, but I doubt it can be made faster than the regular gcd_11. mp_limb_t mpn_gcd_11 (mp_limb_t u, mp_limb_t v) { ASSERT (u & v & 1); /* In this loop, we represent the odd numbers ulimb and vlimb without the redundant least significant one bit. This reduction in size by one bit ensures that the high bit of t, below, is set if and only if vlimb > ulimb. */ u >>= 1; v >>= 1; while (u != v) { mp_limb_t t, d; mp_limb_t vgtu, use_add; int c; t = u ^ v; /* If least significant bit of t is 1, then u + v = ...01 + ...11 = ...00, i.e., divisible by 4 */ use_add = -(t & 1); d = u - v; /* u - ~v = u + v + 1 */ t = u - (use_add ^ v); vgtu = LIMB_HIGHBIT_TO_MASK (d); /* v <-- min (u, v) */ v += (vgtu & d); /* u <-- |u - v|, but only if we're not doing u + v */ vgtu &= ~use_add; u = (t ^ vgtu) - vgtu; count_trailing_zeros (c, t); /* We have c <= GMP_LIMB_BITS - 2 here, so that ulimb >>= (c + 1); would be safe. But unlike the addition c + 1, a separate shift by 1 is independent of c, and can be executed in parallel with count_trailing_zeros. */ u = (u >> 1) >> c; } return (u << 1) + 1; } -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel