Re: Sandybridge addmul_N challenge
Torbjorn Granlund writes: If we slightly restrict the operand range, we could remove the indicated carry propagation insn. Wrong. Those carry propagation insns are needed. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
ni...@lysator.liu.se (Niels Möller) writes: For the recurrency, the inputs are c0, c1, and the outputs are r1, r2. Let's write the interesting instructions out and unroll twice (using different registers), add c0, r0C 0 6 adc c1, r1C 1 7 adc $0, r2C 3 9 add r1, c2C 3 9 adc r2, c0C 4 10 adc $0, c1C 6 12 So the recurrency, for one iteration, seems to be just 3 cycles. But the loop mixer doesn't find anything faster then 6.36 cycles for one iteration, or 3.18 per limb product. Which isn't too bad (a slight improvement over 3.24, which I think is the best reported earlier), but stubbornly above 3 c/l. I am playing with this block: carry-in lo in r14 carry-in hi in rcx mov 0(up), %rax mul v1 mov 8(rp), %r8 add %rax, %r8 mov %rdx, %r9 adc $0, %r9 mov 8(up), %rax mul v0 add %rax, %r8 adc %rdx, %r9 mov $0, R32(%rbx) adc $0, R32(%rbx) add %r14, %r8 C 0 adc %rcx, %r9 C 1 adc $0, R32(%rbx) C might be removed mov %r8, 8(rp) carry-out lo in r9 carry-out hi in rbx This is not identical to your block, I think. It runs at exactly 3 c/l. The recurrency path is extremely shallow, at 1.5 c/l. If we slightly restrict the operand range, we could remove the indicated carry propagation insn. Then the code runs at 2.8 c/l. Neither is decoding bandwidth limited, It is further possible to supplant the 'mov $0,reg' and following 'adc $0,reg' with 'setc reg'. This creates a false dependency (on the upper 56 bits) and seems to run at about the same speed. The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon if the lea pointer update insns are removed. This is a good sign, proving there is no magic stopping us at 3 c/l... > My experience of Sandybridge is that with load/store coding style, the > CPU typically executes 3 insn/cycle unless there is a recurrency > dependency stopping that. If we could get there, the above loop should run just below 3 c/l. I was obviously wrong. :-( Hmm. Last time I looked at that was in a 32-bit context. There's a 32x32->64 instruction which might be useful for a 32-bit build, at least in theory, but as far as I can find in the manual, the latest ss*-extensions don't provide any wider multiplication than that. I believe that insn is used for 32-bit builds, where it helps. Much improvments could be done for 32-bit builds, if one care (I see a new mail has arrived, will read now.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
ni...@lysator.liu.se (Niels Möller) writes: > So the recurrency, for one iteration, seems to be just 3 cycles. But the > loop mixer doesn't find anything faster then 6.36 cycles for one > iteration, or 3.18 per limb product. Which isn't too bad (a slight > improvement over 3.24, which I think is the best reported earlier), but > stubbornly above 3 c/l. One update. I have now tried unrolling four times. Then I've seen one sequence running at 6.16 cycles per iteration, or 3.08 c/l. See shell:~nisse/hack/loopmix/lms/addmul_2-nisse-2.nlms. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
Torbjorn Granlund writes: > In loopmixer or manually? I wouldn't draw any conclusions without > mixing the code first... With the loop mixer. > Meaning evaluating in +1 instead of -1, I assume. Exactly. > Did you compute the recurrency chain? Annotating the instructions on > the recurrency chain helps understanding the problem. I tried. I use this iteration mov (up, n, 8), %rax mov %rax, u mul v0 mov %rax, r0 mov %rdx, r1 mov u, %rax mul v1 mov (rp, n, 8), t add t, r0 add %rax, r1 mov %rdx, r2 adc $0, r2 add c0, r0 mov r0, (rp, n, 8) adc c1, r1 adc $0, r2 For the recurrency, the inputs are c0, c1, and the outputs are r1, r2. Let's write the interesting instructions out and unroll twice (using different registers), add c0, r0 C 0 6 adc c1, r1 C 1 7 adc $0, r2 C 3 9 add r1, c2 C 3 9 adc r2, c0 C 4 10 adc $0, c1 C 6 12 So the recurrency, for one iteration, seems to be just 3 cycles. But the loop mixer doesn't find anything faster then 6.36 cycles for one iteration, or 3.18 per limb product. Which isn't too bad (a slight improvement over 3.24, which I think is the best reported earlier), but stubbornly above 3 c/l. > My experience of Sandybridge is that with load/store coding style, the > CPU typically executes 3 insn/cycle unless there is a recurrency > dependency stopping that. If we could get there, the above loop should run just below 3 c/l. > I don't think there are. These instructions are mostly FP plus narrow > integer ops. Hmm. Last time I looked at that was in a 32-bit context. There's a 32x32->64 instruction which might be useful for a 32-bit build, at least in theory, but as far as I can find in the manual, the latest ss*-extensions don't provide any wider multiplication than that. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
ni...@lysator.liu.se (Niels Möller) writes: One can decrease it a bit by adding c0, c1 earlier (do you think recurrency can be a problem if we add c0, c1 to the first product?) and doing an in-place add to (rp) and 8(rp) at the end. I could get it down to 30 instructions with a deep carry recurrency, or 34 with a short one. I can get neither variant to run faster than 4 c/l. In loopmixer or manually? I wouldn't draw any conclusions without mixing the code first... I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1). Meaning evaluating in +1 instead of -1, I assume. It's about the same number of instructions, but the updates from carries are independent of all products, so there's more more freedom in where to move them around. I think this idea may be more useful for other processors, without the awkward hardwired mul registers. True. > I played a bit with mul_2 yesternight. I am not 100% the code is > correct, but I think it is. The loopmixer found a 2.5 c/l version of > it. Nice. I've now wasted quite some time... It seems really difficult. It is challenging, but I am getting convinced we can really speed things a lot. Now I also tried a very basic variant of addmul_2, doing only one u limb per iteration and multiplying it by the two v limbs. Even if I have a very nice carry recurrence between iterations, add, adc, adc $0, four cycles, and a small number of instructions (15 per iteration, 32 for the twice unrolled loop), which one might think could be executed in 11 cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't run faster than 6.5 cycles per iteration, or 3.25 c/l. So it just seems very difficult to convince the cpu to really execute the independent operations, outside of the recurrency, in parallel. Did you compute the recurrency chain? Annotating the instructions on the recurrency chain helps understanding the problem. My experience of Sandybridge is that with load/store coding style, the CPU typically executes 3 insn/cycle unless there is a recurrency dependency stopping that. BTW, are any of the SSE3 etc instructions useful here? I don't think there are. These instructions are mostly FP plus narrow integer ops. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
Torbjorn Granlund writes: > How did you arrive to 12.25 insns/limb? I did someting wrong, I guess... 9.25 instructions / limb or 3.1 cycles if we can issue 3 instructions per cycle. > I very much doubt this will win for Sandybridge, unless you can decrease > the insn count with several instructions. One can decrease it a bit by adding c0, c1 earlier (do you think recurrency can be a problem if we add c0, c1 to the first product?) and doing an in-place add to (rp) and 8(rp) at the end. I could get it down to 30 instructions with a deep carry recurrency, or 34 with a short one. I can get neither variant to run faster than 4 c/l. I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1). It's about the same number of instructions, but the updates from carries are independent of all products, so there's more more freedom in where to move them around. I think this idea may be more useful for other processors, without the awkward hardwired mul registers. For documentation, this is what the iteration should do: vd = |v1 - v0|, sign in vs (outside loop) ud = |u1 - u0|, sign in us s = us ^ vs ^ 1 = u1 * v1 = ud * vd ^ <-s, -s> = u0 * v0 +-+-+ |p3 p2|p1 p0| +-+--+--+ |c1 c0| +-+ |r1 r0| +--+--+--+ |p1 p0| +-+ |p3 p2| +--+--+--+ |-s| 0| s| +--+--+--+ |m1 m0| --+--+-+--+ |c3 c2 r1 r0| +---+ or = v1 + v0 (outside loop) = u1 + u0 = u1 * v1 = us * vs = u0 * v0 +-+-+ |p3 p2|p1 p0| +-+--+--+ |c1 c0| +-+ |r1 r0| +--+--+--+ - |p1 p0| +-+ - |p3 p2| +-+ |m1 m0| +--+--+--+ |vc vs|if uc +--+--+ |us|if vc --+--+--+-+ |c3 c2 r1 r0| +---+ > Perhaps some of the adc $0 could be eliminated with 2x unrolling? In effect, that would be a kind of 4x2 multiply. Which would then be done as two 2x2 (I think the high limbs one get from evaluation rules out using toom32 or toom42). Haven't tried that. I suspect one will run out of registers. > I played a bit with mul_2 yesternight. I am not 100% the code is > correct, but I think it is. The loopmixer found a 2.5 c/l version of > it. Nice. I've now wasted quite some time... It seems really difficult. Now I also tried a very basic variant of addmul_2, doing only one u limb per iteration and multiplying it by the two v limbs. Even if I have a very nice carry recurrence between iterations, add, adc, adc $0, four cycles, and a small number of instructions (15 per iteration, 32 for the twice unrolled loop), which one might think could be executed in 11 cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't run faster than 6.5 cycles per iteration, or 3.25 c/l. So it just seems very difficult to convince the cpu to really execute the independent operations, outside of the recurrency, in parallel. BTW, are any of the SSE3 etc instructions useful here? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
ni...@lysator.liu.se (Niels Möller) writes: Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers before the loop. Carry input in c0, c1, carry out in r2, r3. mov (up), %rax mov %rax, ul mul vl C low product mov 8(up), uh mov %rax, r0 mov %rdx, r1 lea (uh, ul), %rax sub uh, ul cmovncul, %rax sbb r3, r3 mul vd C Middle product mov r1, r2 C Add shifted low product add r0, r1 adc $0, r2 add (rp), r0C Add rp limbs adc 8(rp), r1 adc $0, r2 mov %rax, p0 mov %rdx, p1 mov uh, %rax mul vh C High product xor vmask, r3 xor r3, p0 C Conditionally negate, and add, middle product xor r3, p1 bt$0, r3 adc p0, r1 adc p1, r2 adc $0, r3 add %rax, r1C Add shifted high product adc %rdx, r2 adc $0, r2 add c0, r0 C Add input carry limbs adc c1, r1 mov c0, (rp) mov c1, 8(rp) adc %rax, r2C Add high product adc %rdx, r3 37 instructions, or 12.25 instructions per limb, excluding looping logic (and it has to be unrolled twice, to use separate registers for input and output carries). How did you arrive to 12.25 insns/limb? I have not tried to understand the code, but doesn't it perform a 2x2 limb multiply with accumulation? That's 9.25 insn/limb product. I very much doubt this will win for Sandybridge, unless you can decrease the insn count with several instructions. Unfortunately it has no chance on Bull-dozer, since the latter has a 2 issue pipeline; you need to beat 32 insns per 2x2 accumulation block there. I think the instruction count can be reduced a bit, at the cost of higher pressure on the out-of-order execution. Perhaps some of the adc $0 could be eliminated with 2x unrolling? What do you think? If one can get one iteration to run at 12 cycles, that's 3 c/l and an improvement over addmul_1. If one can get it down to 11 or 11.5, one beats 3 c/l. If that is possible, it might not be enough... See below. For a "vertical" mul_basecase, the quadratic work would be an iteration of mov (up), %rax mul (vp) add %rax, r0 adc %rax, r1 adc $0, r3 So there's potential for that to run at 2 cycles per limb product. But then there's also a significant linear cost for accumulation and carrry propagation, and possible bad branch-prediction due to loops of varying lenghts. Exactly. (But the branch misprediction problem would not happen for for David's mulmid_basecase, I suppose.) Some ways to deal with the branch misprediction problem: * Have straight line code for the corners, up to a limit. This gets rid of the really high relative branch misprediction for these small areas. * Handle two (or more) columns in parallel, and separately for the low-significant right triangle, any middle rectangular part, and the left triangle. This doubles (or more) the amount of useful work per branch misprediction. * I suppose that making full use of out-of-order execution just before a mispredicted branch would make sense. I played a bit with mul_2 yesternight. I am not 100% the code is correct, but I think it is. The loopmixer found a 2.5 c/l version of it. I started with genxmul.c (from the loopmixer repo) using these args: "-n2 -w4 --mul". I then analysed the critical path and determined that it is about 24. The problem is adc feeding other adc feeding other adc though a register (remember that pure carry deps are fast on Sandybridge). So I mindlessly introduced 4 new registers, then using alternating accumulation registers. I needed 8 extra insn (in total, corresponding to 2 per way) to pairwise sum accumulation registers. I am sure this was not done optimally, but (assuming my code is sound) it proves that there is a lot of performance headroom, as expected. I conjecture that we could create an addmul_N for Sandybridge that runs at <= 2.5 c/l. I think this will be possible already for N=2. Perhaps we could arrive to 2.25 for N=4, matching the K8-K10 performance. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Sandybridge addmul_N challenge
Torbjorn Granlund writes: > I think we should focus not on addmul_1 but on mul_basecase, > sqr_basecase, redc_1, perhaps redc_2. I.e., please focus on addmul_2 > (or addmul_N, N > 2) or vertical multiplication primitives. Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers before the loop. Carry input in c0, c1, carry out in r2, r3. mov (up), %rax mov %rax, ul mul vl C low product mov 8(up), uh mov %rax, r0 mov %rdx, r1 lea (uh, ul), %rax sub uh, ul cmovnc ul, %rax sbb r3, r3 mul vd C Middle product mov r1, r2 C Add shifted low product add r0, r1 adc $0, r2 add (rp), r0C Add rp limbs adc 8(rp), r1 adc $0, r2 mov %rax, p0 mov %rdx, p1 mov uh, %rax mul vh C High product xor vmask, r3 xor r3, p0 C Conditionally negate, and add, middle product xor r3, p1 bt $0, r3 adc p0, r1 adc p1, r2 adc $0, r3 add %rax, r1C Add shifted high product adc %rdx, r2 adc $0, r2 add c0, r0 C Add input carry limbs adc c1, r1 mov c0, (rp) mov c1, 8(rp) adc %rax, r2C Add high product adc %rdx, r3 37 instructions, or 12.25 instructions per limb, excluding looping logic (and it has to be unrolled twice, to use separate registers for input and output carries). I think the instruction count can be reduced a bit, at the cost of higher pressure on the out-of-order execution. * At least the moves to p0, p1 can be eliminated. * One could also save some instructions from adding in c0, c1 earlier, and doing an in-place add to (rp) at the end, on the theory that the recurrency is less tight. * I'm also not sure if the order of the three multiplications is the best one. * I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which (if additions are organized in the right way) is done twice, I suspect it's going to be a bit painful since the carry has to be applied at two places. What do you think? If one can get one iteration to run at 12 cycles, that's 3 c/l and an improvement over addmul_1. If one can get it down to 11 or 11.5, one beats 3 c/l. For a "vertical" mul_basecase, the quadratic work would be an iteration of mov (up), %rax mul (vp) add %rax, r0 adc %rax, r1 adc $0, r3 So there's potential for that to run at 2 cycles per limb product. But then there's also a significant linear cost for accumulation and carrry propagation, and possible bad branch-prediction due to loops of varying lenghts. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel