I haven't followed this discussion very closely, and did not see if you have conidered the following.
OK, so the code is 3-ways unrolled. That's always a bit inconvenient and tends to cause some code bloat. I am pretty sure we have that at least in sme other place, but still make all the work in one loop, switching into apropriate places from the feed-in code. It is not expensice to compute something like (3^(-1) mod 2^32)*n mod 2^32 / 2^30 in the feed-in code. (3^(-1) mod 2^32) = 0xaaaaaaab, so we can do the above with two instructions (imul and shr). The latency of umul+shr is <= 4 on moderna architectures. Since addaddmul_1msb0 is strictly internal, and since it presumably is used for very limited values of n, I assume 32-bit arithmetic on n is suffficient. (Note that the tricky mod computation above "maps" the remainder 1 to 2 and the remainder 2 to 1.) Other ideas: * Use xor r,r instead of mov $0,r (considering that xor messes with the carry bit). * Use one more register for accumulation, with 4x unrolling. That would save the 0xaaaaaaab magic mul. * Provide variant with mulx. * Accumulate differently, say 4 consecutive limbs at a time, with carry being alive. That will require more registers for sure. By using adcx and adox, one may accumulate to the same registers in two chains semi-simultaneously. * Use rbx instead of r12 to save a byte or two. I suspect te present code is far from optimal on modern x86 CPUs which can sustain 1 64x64->128 multiply per cycle. I feel confident that we could reach close to 1 c/l. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel