Re: div_qr_1 interface
ni...@lysator.liu.se (Niels Möller) writes: Will try that. I think one could also try to delay the quotient store one iteration, keeping Q1 in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per iteration in the likely case. May need yet another register, though. On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.) I suspect one or two of the register-to-register copy insns could be optimised out. Maybe. And it would be easier to avoid moves if one unrolls the loop twice, switching roles U0-U1 and Q0-Q1. But that makes it a bit more bloated, of course. It might be worth it, since this is an importand operation. In order to run this through the loopmixer, you need to setup data in the prologue which makes the adjustment branch to never be taken. Letting the inverse be 0 or else B-1 might work... I vaguely recall some previous attempt at loopmixing this, but I don't remember any success. Let's take a look at current performance on all amd64 CPUs except nocona (=pentium4). I compare the pi variants here. Conclusions: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) * The code is a big win for AMD bulldozer and also for piledriver * The code is a big win for Intel core2 (alias conroe) * The code is a cycle slower for Intel sandybridge * The code is a cycle faster on Intel nehalem, ivybridge, haswell * The code is a big win for VIA nano In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7 c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for nehalem. Of course, the precomputation cost there is much higher. k10 overhead 6.00 cycles, precision 100 units of 3.12e-10 secs, CPU freq 3200.35 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1#12.0018 26.0043 2 19.5030 #19.5024 3 17.6695 #17.3362 5 15.8019 #15.6019 8 14.7518 #14.6267 13 #14.0895 15.0901 22 #14.1463 14.2366 37 #13.6849 13.7393 62 #13.4139 13.4445 105 #13.2498 13.2589 178 #13.1524 13.1632 302 #13.0952 13.1011 513 #13.0607 13.0642 872 #13.0302 13.0325 bulldozer overhead 6.00 cycles, precision 100 units of 2.77e-10 secs, CPU freq 3612.09 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1#13.9118 30.8628 2#22.5047 24.0040 3 20.6703 #20.3110 5#18.0036 20.0033 8#17.2535 19.2532 13 #16.7725 19.8804 22 #17.0943 20.5489 37 #16.6519 20.4899 62 #16.3905 20.2277 105 #16.2322 20.1748 178 #16.1383 20.0710 302 #16.0895 20.0499 513 #16.0513 20.0218 872 #16.0337 20.0186 piledriver overhead 6.00 cycles, precision 100 units of 7.14e-10 secs, CPU freq 4000.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1#13.4460 27.7072 2#21.2536 22.0034 3#19.1284 20.6698 5#17.6283 19.6027 8#16.8365 19.1819 13 #16.7634 19.3874 22 #16.5480 19.1393 37 #16.5433 18.9761 62 #16.3419 18.8095 105 #16.2121 18.6698 178 #16.0991 18.6101 302 #16.0503 18.5661 513 #16.2965 18.5379 872 #16.3580 19.0568 core2 overhead 6.01 cycles, precision 100 units of 4.69e-10 secs, CPU freq 2132.93 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1#15.7048 28.7024 2#26.5272 26.5408 3#24.7981 25.2783 5#21.9089 24.9270 8#20.9994 24.4683 13 #20.4778 24.1549 22 #20.0956 23.8461 37 #19.7079 23.8088 62 #19.6855 23.8366 105 #19.5935 23.9688 178 #19.3434 23.8856 302 #19.3213 23.8421 513 #19.4093 23.8145 872 #19.3424 23.8016 nehalem overhead 6.00 cycles, precision 100 units of 3.75e-10 secs, CPU freq 2670.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1#12.1014 24.6814 2#21.0024 21.8684 3 20.9170 #20.5440 5#19.6475 20.3452 8
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.) The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter. DINV, B2, B2md: Loop invariant constants. U2, U1I, U0, Q1I, Q0: Inputs. U1O, Q1O: Outputs. Q2, %rax, %rdx: Locals. Also U1I - U1O recurrency chain (with opteron cycle counts) mov U2, Q2 mov U2, Q1O neg Q2 mov DINV, %rax and DINV, Q1O mul U1I add Q0, Q1O adc $0, Q2 mov %rax, Q0 add %rdx, Q1O adc $0, Q2 mov B2, %rax and B2, U2 mul U1I C 0 6 lea (U0, B2md), U1O add U0, U2 cmovnc U2, U1O adc U1I, Q1O adc Q1I, Q2 mov Q2, (QP, UN, 8) jc L(incr) L(incr_done): mov (UP, UN, 8), U0 add %rax, U0C 4 adc %rdx, U1O C 5 sbb U2, U2 C 6 25 instructions (27 K10 decoder slots) excluding loop overhead. But one variable must be moved out of the registers. Maybe B2md (used once) is the best candidate. Then lea (U0, B2md), U1O would have to be replaced by mov (%rsp), U1O C Can be done very early ... add U0, U1O We then have 26 instructions + loop overhead, or 54 instructions for 2 iterations. Or possibly DINV, if one thinks the quotient logic is less critical. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. I might also be possible to replace the early loop and stuff by cmov. Note that the carry flag survives dec, although that causes a pipeline replay on older Intel chips. (IIRC, only sandybridge, ivybridge, haswell runs that well.) But one variable must be moved out of the registers. Maybe B2md (used once) is the best candidate. Then lea (U0, B2md), U1O would have to be replaced by mov (%rsp), U1O C Can be done very early ... add U0, U1O We then have 26 instructions + loop overhead, or 54 instructions for 2 iterations. Or possibly DINV, if one thinks the quotient logic is less critical. Reading from a stack slot costs nothing under ideal circumstances. To optimise register usage, I sometimes annotate the code with live ranges for each register. That will help with register coalescing. (T is rather shot-lived, perhaps its register could serve two usages?) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: div_qr_1 interface
Torbjorn Granlund t...@gmplib.org writes: I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. The problem is the final use, where Q2 is added, with carry, to a different register. It's tempting to replace adc Q1I, Q2 with sbb Q2, Q1I and negated Q2, but I'm afraid that will get the sense of the carry wrong. Do you see any trick to get that right without negating Q2 somewhere along the way? I might also be possible to replace the early loop and stuff by cmov. Maybe, but the simple way to do conditional addition with lea + cmov won't to, since we also need carry out. Does it matter if we do mov B2, r and mask, r or mov $0, r cmovc B2, r ? To optimise register usage, I sometimes annotate the code with live ranges for each register. That will help with register coalescing. There are lots of possibilities, since the computations for Q and U are mostly independent. The data flow is something like load U limb | _V_ U2, U1I, U0 - |___| - U2, U1O, U0 \ |__/ cy _V__V___V_ Q1I, Q0- |__| - Q1O, Q0 \ V store Q limb (T is rather shot-lived, perhaps its register could serve two usages?) It could perhaps eliminated. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel