ni...@lysator.liu.se (Niels Möller) writes: For small gcdext (one or two limbs), it seems likely that a branch-free binary algorithm is a good choice. The binary algorithm in mpn/generic/gcdext_1.c seems to work now, after a recent bugfix, but it's not that fast. One reason is that the loop updates both cofactors, which costs both instructions and registers.
If we do a gcdext computing only one cofactor, I'd like to think about it as a modular inversion function. For u, v, with v odd, attempt to compute u^{-1} (mod v). More precisely, always compute gcd(u,v). In addition, compute a cofactor s according to these rules: A modular inersion function also when the inverse does not exist. :-) If u = 0 (mod v), equivalently, gcd (u,v) = v, set s = 0 If gcd (u,v) = 1, set s = u^{-1} (mod v) If gcd (u,v) > 1, set s = (u/g)^{-1} (mod v/g) Below are three implementations of such a function, one based on euclid's algorithm, one plain binary, and one binary using same masking tricks as in gcdext_1.c. You've been busy! I believe two things will make the final, branch-reduced variant slow. 1. The explicit division, and 2. the shift loop. It is not clear that postponing the work done in the shift loop from the main loopis a good idea (even if it is done with a faster bmod op). For plain gcd_11 we typically are latency limited. Perhaps some indepndent work such as shifting s0 could be done for free in gcdext_11? Using the main loop might also reduce register pressure. PS. Why is the sign logic for the cofactor sign different in the two binary method implementations? -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel