I added data for the new code at <http://gmplib.org/devel/asm.html>.
There is a line for div_qr_1u_pi1 as well, since that will also be needed. It might actually be more common that the divisor is not normalised. I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and then add threshold for the non-invariant case. If my old data for those are correct, then it is always faster for large enough operands. I expect div_qr_1u_pi1 to be no slower than div_qr_1n_pi1 on some machines, just like divrem_1 is often the same speed for normalised and unnormalised divisors (sometimes using one loop, sometimes using two). To use just one loop, we probably need an efficient shrd, since then the normalised case just mean a shift count of 0. (Only Intel's high-end processors run shrd well.) I suppose we should provide just div_qr_1_pi1 when a general loop is fast. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel