ni...@lysator.liu.se (Niels Möller) writes: Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers before the loop. Carry input in c0, c1, carry out in r2, r3. mov (up), %rax mov %rax, ul mul vl C low product mov 8(up), uh mov %rax, r0 mov %rdx, r1 lea (uh, ul), %rax sub uh, ul cmovnc ul, %rax sbb r3, r3 mul vd C Middle product mov r1, r2 C Add shifted low product add r0, r1 adc $0, r2 add (rp), r0 C Add rp limbs adc 8(rp), r1 adc $0, r2 mov %rax, p0 mov %rdx, p1 mov uh, %rax mul vh C High product xor vmask, r3 xor r3, p0 C Conditionally negate, and add, middle product xor r3, p1 bt $0, r3 adc p0, r1 adc p1, r2 adc $0, r3 add %rax, r1 C Add shifted high product adc %rdx, r2 adc $0, r2 add c0, r0 C Add input carry limbs adc c1, r1 mov c0, (rp) mov c1, 8(rp) adc %rax, r2 C Add high product adc %rdx, r3 37 instructions, or 12.25 instructions per limb, excluding looping logic (and it has to be unrolled twice, to use separate registers for input and output carries). How did you arrive to 12.25 insns/limb? I have not tried to understand the code, but doesn't it perform a 2x2 limb multiply with accumulation? That's 9.25 insn/limb product.
I very much doubt this will win for Sandybridge, unless you can decrease the insn count with several instructions. Unfortunately it has no chance on Bull-dozer, since the latter has a 2 issue pipeline; you need to beat 32 insns per 2x2 accumulation block there. I think the instruction count can be reduced a bit, at the cost of higher pressure on the out-of-order execution. Perhaps some of the adc $0 could be eliminated with 2x unrolling? What do you think? If one can get one iteration to run at 12 cycles, that's 3 c/l and an improvement over addmul_1. If one can get it down to 11 or 11.5, one beats 3 c/l. If that is possible, it might not be enough... See below. For a "vertical" mul_basecase, the quadratic work would be an iteration of mov (up), %rax mul (vp) add %rax, r0 adc %rax, r1 adc $0, r3 So there's potential for that to run at 2 cycles per limb product. But then there's also a significant linear cost for accumulation and carrry propagation, and possible bad branch-prediction due to loops of varying lenghts. Exactly. (But the branch misprediction problem would not happen for for David's mulmid_basecase, I suppose.) Some ways to deal with the branch misprediction problem: * Have straight line code for the corners, up to a limit. This gets rid of the really high relative branch misprediction for these small areas. * Handle two (or more) columns in parallel, and separately for the low-significant right triangle, any middle rectangular part, and the left triangle. This doubles (or more) the amount of useful work per branch misprediction. * I suppose that making full use of out-of-order execution just before a mispredicted branch would make sense. I played a bit with mul_2 yesternight. I am not 100% the code is correct, but I think it is. The loopmixer found a 2.5 c/l version of it. I started with genxmul.c (from the loopmixer repo) using these args: "-n2 -w4 --mul". I then analysed the critical path and determined that it is about 24. The problem is adc feeding other adc feeding other adc though a register (remember that pure carry deps are fast on Sandybridge). So I mindlessly introduced 4 new registers, then using alternating accumulation registers. I needed 8 extra insn (in total, corresponding to 2 per way) to pairwise sum accumulation registers. I am sure this was not done optimally, but (assuming my code is sound) it proves that there is a lot of performance headroom, as expected. I conjecture that we could create an addmul_N for Sandybridge that runs at <= 2.5 c/l. I think this will be possible already for N=2. Perhaps we could arrive to 2.25 for N=4, matching the K8-K10 performance. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel