After discussing several interesting gcd algorithms with Torbjörn, I've made this observation that I think is kind-of interesting.
Lehmer code depends on the fact that if M is a 2x2 integer matrix with determinant ±1, and (a';b') = M (a;b), then gcd (a,b) = gcd (a', b'). But for binary gcd, we can relax this: If the matrix determinant is a power of two (in absolute value), then gcd(a,b) and gcd(a',b') can differ only by a power of two. And in the binary algorithm, we discard powers of two as we go, so they're no problem. In addition, many small matrices have power-of-two determinant, e.g., det(1,1;1,-3) = -4. So we can choose a', b' quite freely among various linear combinations, e.g., a+b, a-b, a+3b; we can chose any pair as long as the corresponding determinant is a power of two. One way to arrange it that I think may be simple and efficient is to chose constants m, k so that a' = a + kb = 0 (mod 8) b' = a + mb = 4 (mod 8) This can be done by first choosing a linear combination that is divisible by 4, and set t = (a + b) / 4 or t = (a + 3b) / 4 (that's pretty standard). But next, form a', b' as t and t - b, one of which is even and the other odd. Corresponding matrices seem to satisfy the requirement: since det (1,k;1,m) = m - k, we can use any distinct pair of numbers in {-3, -1, 1, 3 } except the pair ±3. Algorithm (which is made a bit more complicated by the implicit lsb trick): mp_limb_t mpn_gcd_11 (mp_limb_t u, mp_limb_t v) { ASSERT (u & v & 1); /* In this loop, we represent the odd numbers ulimb and vlimb without the redundant least significant one bit. This reduction in size by one bit avoids overflow. */ u >>= 1; v >>= 1; for (;;) { mp_limb_t n0, n1; int c; /* In this iteration, we set u' = u + k v = 0 (mod 8), v' = u + m v = 4 (mod 8), with k, m chosen from the set { -3, -1, 1, 3 }. It's arranged bitwise; first set t = (a + {1,3} b) / 4, then use t and t - b for the next iteration, keeping track of which one is odd so that we only need one count_trailing_zeros. */ mp_limb_t t = u + v + 1; /* (u + v) / 2, explicit lsb */ mp_limb_t m0 = - (t & 1); t = (t >> 1) + (m0 & (v + 1)); /* ((u + v)/ 2 + m0 v) / 2 */ mp_limb_t m1 = - (t & 1); u = (t >> 1) - (m1 & v); /* Make even, then shift out lsb 0 */ if (u == 0) return (v << 1) + 1; count_trailing_zeros (c, u); v = (t >> 1) + (~m1 & ~v); /* Make odd, shift out lsb 1 */ /* Take absolute values */ n0 = LIMB_HIGHBIT_TO_MASK (u); n1 = LIMB_HIGHBIT_TO_MASK (v); u = ((u^n0) - n0) >> (c+1); /* Negation only happens when ~m1, no -n1 since that's canceled by the implicit bit. */ v ^= n1; } } Seems to work, but not particularly fast :-( It's unclear to me what progress actually is made in one iteration. There's also a long dependency chain, it could probably be shortened a bit by computing the m1 mask directly from the inputs, so it could be scheduled in parallel with theoperations involving m0. I also wonder if it can be extended to larger matrices; if one can find matrices with k bit elements that lets us clear 2k bits most of the time? Power-of-two determinant is a constraint, but there are still lots of matrices. The idea is a bit similar to the binary recursive gcd by Stehlé and Zimmermann, but here we're aiming for fairly small numbers with matrix selected by table lookup, with no need for a divide-and-conquer algorithm. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel