I looked at the logic following this: sbb U2, U2 C 7 13
You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. I might also be possible to replace the early loop "and" stuff by cmov. Note that the carry flag survives dec, although that causes a pipeline replay on older Intel chips. (IIRC, only sandybridge, ivybridge, haswell runs that well.) But one variable must be moved out of the registers. Maybe B2md (used once) is the best candidate. Then lea (U0, B2md), U1O would have to be replaced by mov (%rsp), U1O C Can be done very early ... add U0, U1O We then have 26 instructions + loop overhead, or 54 instructions for 2 iterations. Or possibly DINV, if one thinks the quotient logic is less critical. Reading from a stack slot costs nothing under ideal circumstances. To optimise register usage, I sometimes annotate the code with live ranges for each register. That will help with register coalescing. (T is rather shot-lived, perhaps its register could serve two usages?) -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel