Torbjorn Granlund <t...@gmplib.org> writes: > Have you analysed the register needs? Pushing all callee-saves > registers is quite expensive.
Per the FIXME-comment, we could avoid saving them for the n == 2 case (which I think corresponds corresponds to n == 3 for the mpn_div_qr_1 caller, so it might help that regression), but we do need a lot of registers for the actual loop. > For the mul insn, it is usually better to copy the invariant/noncritical > operand to rax, and use the critical operand explicitly in the mul insn. Will try that. I think one could also try to delay the quotient store one iteration, keeping "Q1" in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per iteration in the likely case. May need yet another register, though. > I suspect one or two of the register-to-register copy insns could be > optimised out. Maybe. And it would be easier to avoid moves if one unrolls the loop twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more bloated, of course. > In order to run this through the loopmixer, you need to setup data in > the prologue which makes the adjustment branch to never be taken. > Letting the inverse be 0 or else B-1 might work... I vaguely recall some previous attempt at loopmixing this, but I don't remember any success. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel