Torbjorn Granlund <t...@gmplib.org> writes: > On Intel chips, op-to-mem is expensive. Even op-from-memory is often > slower than load+op. (I understand the register shortage problem.)
The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter. DINV, B2, B2md: Loop invariant constants. U2, U1I, U0, Q1I, Q0: Inputs. U1O, Q1O: Outputs. Q2, %rax, %rdx: Locals. Also U1I -> U1O recurrency chain (with opteron cycle counts) mov U2, Q2 mov U2, Q1O neg Q2 mov DINV, %rax and DINV, Q1O mul U1I add Q0, Q1O adc $0, Q2 mov %rax, Q0 add %rdx, Q1O adc $0, Q2 mov B2, %rax and B2, U2 mul U1I C 0 6 lea (U0, B2md), U1O add U0, U2 cmovnc U2, U1O adc U1I, Q1O adc Q1I, Q2 mov Q2, (QP, UN, 8) jc L(incr) L(incr_done): mov (UP, UN, 8), U0 add %rax, U0 C 4 adc %rdx, U1O C 5 sbb U2, U2 C 6 25 instructions (27 K10 decoder slots) excluding loop overhead. But one variable must be moved out of the registers. Maybe B2md (used once) is the best candidate. Then lea (U0, B2md), U1O would have to be replaced by mov (%rsp), U1O C Can be done very early ... add U0, U1O We then have 26 instructions + loop overhead, or 54 instructions for 2 iterations. Or possibly DINV, if one thinks the quotient logic is less critical. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel