Torbjörn Granlund <t...@gmplib.org> writes: > The cycle counts are per bit eliminated, it is not cycle counts.
Ok, so they can't be compared directly with "expected" cycles per iteration. > I never bothered to try to measure cycle counts. One would need to > measure with specially selected operands with known behaviour for the > algorothm, but that would be fragile. Or let the benchmark program also call a reference implementation that produces the iteration count. Cycles per iteration would be helpful in this context. > I wonder if it's possible to get down to 3 cycles. Scheduling the second > sub instruction in parallell with the first would be a start. > > It might be. Good luck! :-) Don't hold your breath... > The last few generations of x86 processors have free mov insns. Sounds good. A move instruction only needs to update the register-renaming state, it shouldn't issue to any execution unit. > One thing which I never really explored is to evaluate at least a-b and > a+b and choose the one which is divisible by 4. If that cost just one > extra critical insn, it should be faster. Hmm. One should aim to have only a single added cmov on the critical path, and do the add and divisibility test in parallel. One difficulty is that the shift count will depend on the choice (while current code takes advantage of the count being the same for the other two choices, a-b and b-a). Moving the ctz after the cmovs makes the critical path longer. To avoid that, one would need to compute both ctz(a-b) and ctz(a+b). Or even shift two candidate values, |a-b| and (a+b) in parallel. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel