ni...@lysator.liu.se (Niels Möller) writes: I've tried it out. Works nicely, but no speedup on my machine. I'm attaching another patch. There are then 4 methods:
method 1: Old loop around udiv_qrnnd_preinv. method 2: The clever code from 10 years ago, with the microoptimization I commited the other day. method 3: More or less the same as I posted a few days ago. method 4: Postpones the update u1 -= u2 d, off the critical recurrency chain. Instead, conditionally adds in the constant B2 (B - d) to the lower u limbs. Let me throw some additional complication into this project: The fastest way to do a division by a single limb will involve more precomputation, assuming we have good throughput of the multiply instruction. Even in some scenario where one has a 1-limb inverse precomputed, it will be faster to first do a Newton round to get a 2-limb inverse, and then run a loop which develops two quotient limbs at a time. I expect the dividend limb count cutoff point to be small, surely less than 10. With full precomputation, the dividend limb count cutoff point would be 2, I presume. The function div_qr_1n_pi1 is useful mainly for CPUs without a pipelined multiply instruction. And of course also useful until we have great div_qr_1n_piN for N >= 2. Improvements to div_qr_1n_piN are of course theoretically interesting too. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel