ni...@lysator.liu.se (Niels Möller) writes: The recurrency depth seems to be the same in all cases, though, with a latency of add + adc + adc from cin to cout. If that's what's killing performance, maybe this would be better, The x86_64/mul_1.asm code has adc+adc with register operands on the recurrency chain, i.e., 2 cycles on any AMD chip and 4 cycles on any Intel chip (except Pentium4 which has about 20 cycles).
mov (up), %rax mul v xor cout, cout add (rp), %rax adc %rdx, cout add %rax, cin adc $0, cout mov cin, (rp) with a recurrency latency of only add + adc (where the adc in question has a $0 source operand). If I understand you correctly, that would be only two cycles on sandybridge. It seems the current sbr code does something similar? Then we have to rely on scheduling or out-of-order execution to not get a useless wait between the mul and the first add. I think your code has indeed a 2c recurrency path. It is similar to the sandybridge code, except that it turned out to be better to use adc $0 in both places possible. Is instruction issue still limited to three instructions per cycle? Then that's a more narrow bottleneck than both latency and multiplication throughput. The above is eight instructions per limb, excluding looping. So one could at least *hope* to get down to 3 cycles with a moderate level of unrolling. I don't think Intel's processors like "op mem,reg" very much. I tried the code above (with the loop mixer) and its runs slower than the code I checked in (which runs slightly faster than claimed, 3.25 c/l, mot 3.4 c/l). We still have 3 insn/c, except that some adjacent insn pairs are fused. I don't know exactly which, but test+jcc and cmp+jcc might be the only ones. My new addmul_1 has 38 insn, no fusing expected so 3.17 c/l is a decoder imposed limit. Maybe one could try a variant updating (rp) twice, to save an instruction: mov (up), %rax mul v xor cout, cout add %rax, (rp) adc %rdx, cout add cin, (rp) adc $0, cout Is it possible to squeeze those three memory instructions in less than 3 cycles? Intel processors likes "op reg,mem" even less... If we try to get the instruction count below 9 (like above), then there's not much room for moving around %rax and %rdx. But the critical recurrency, "add cin, ...; adc $0, cout", doesn't involve %rax and %rdx, leaving a bit freedom to move it between iterations if we unroll and use multiple registers for cin and cout. Hmm, let me give one more variant, which moves %rax and %rdx out of the way, at the cost of one more instruction (8 instructions per limb): mov (up), %rax mul v mov %rax, pl mov %rdx, cout add pl, (rp) adc $0, cout add cin, (rp) adc $0, cout I guess one ought to bring out the loop mixer to find out if any of this really can run at 3 cycles. The sandybridge machine tom behind shell is waiting for you. A nehalem machine is biko* (lots of Xen machines on the same hardware) and repentium is a Conroe ("core2"). I recently played with karatsuba-based addmul_2 for s390. For MUL- challenged machines, that might be an option. Sandybridge is OK there, but Bull-dozer is not. A disadvantage is that such code would be unsuitable for code which wants to be "side-channel silent". -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel