I took a brief look at the loop of the new assembly code. Have you analysed the register needs? Pushing all callee-saves registers is quite expensive.
For the mul insn, it is usually better to copy the invariant/noncritical operand to rax, and use the critical operand explicitly in the mul insn. I suspect one or two of the register-to-register copy insns could be optimised out. In order to run this through the loopmixer, you need to setup data in the prologue which makes the adjustment branch to never be taken. Letting the inverse be 0 or else B-1 might work... -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel