I agree with Niels (don't understand + not appropriate for low-level...). We should replace mpn_divexact_1 with code that:
(1) Uses Jebelean's trick with a Euclidean division working left-to- right and a simultaneous Hensel division working right-to-left. This is faster in the single-limb divisor case since both division algorithms (Euclidean and Hensel) are typically limited by multiply instruction latency. (2) Uses a 2-limb modular inverse but only uses Hensel norm. Optimal code without divisor invariance would probably look like this: if (n < THRESHOLD_1) binv = 1/d mod B mpn_bdiv_1_pi1 (...) else if (n < THRESHOLD_2) binv = 1/d mod B² mpn_bdiv_1_pi2 (...) else binv = 1/d mod B inv = (B²-1)/d mpn_div_1_pi1_bdiv_1_pi1 (...) (This code assumes the divisor d is odd and the limb base B has no common factor...) For CPUs with pipelined multiply insns, I'd expect THRESHOLD_1 to be small, perhaps 3 to 4. THRESHOLD_2 will be much larger, perhaps around 15. (Without pipelined multiply, mpn_bdiv_1_pi1 will always be faster.) -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel