ni...@lysator.liu.se (Niels Möller) writes: Will try that. I think one could also try to delay the quotient store one iteration, keeping "Q1" in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per iteration in the likely case. May need yet another register, though. On Intel chips, op-to-mem is expensive. Even op-from-memory is often slower than load+op. (I understand the register shortage problem.)
> I suspect one or two of the register-to-register copy insns could be > optimised out. Maybe. And it would be easier to avoid moves if one unrolls the loop twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more bloated, of course. It might be worth it, since this is an importand operation. > In order to run this through the loopmixer, you need to setup data in > the prologue which makes the adjustment branch to never be taken. > Letting the inverse be 0 or else B-1 might work... I vaguely recall some previous attempt at loopmixing this, but I don't remember any success. Let's take a look at current performance on all amd64 CPUs except nocona (=pentium4). I compare the pi variants here. Conclusions: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) * The code is a big win for AMD bulldozer and also for piledriver * The code is a big win for Intel core2 (alias conroe) * The code is a cycle slower for Intel sandybridge * The code is a cycle faster on Intel nehalem, ivybridge, haswell * The code is a big win for VIA nano In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7 c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for nehalem. Of course, the precomputation cost there is much higher. ******** k10 ******** overhead 6.00 cycles, precision 1000000 units of 3.12e-10 secs, CPU freq 3200.35 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #12.0018 26.0043 2 19.5030 #19.5024 3 17.6695 #17.3362 5 15.8019 #15.6019 8 14.7518 #14.6267 13 #14.0895 15.0901 22 #14.1463 14.2366 37 #13.6849 13.7393 62 #13.4139 13.4445 105 #13.2498 13.2589 178 #13.1524 13.1632 302 #13.0952 13.1011 513 #13.0607 13.0642 872 #13.0302 13.0325 ******** bulldozer ******** overhead 6.00 cycles, precision 1000000 units of 2.77e-10 secs, CPU freq 3612.09 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #13.9118 30.8628 2 #22.5047 24.0040 3 20.6703 #20.3110 5 #18.0036 20.0033 8 #17.2535 19.2532 13 #16.7725 19.8804 22 #17.0943 20.5489 37 #16.6519 20.4899 62 #16.3905 20.2277 105 #16.2322 20.1748 178 #16.1383 20.0710 302 #16.0895 20.0499 513 #16.0513 20.0218 872 #16.0337 20.0186 ******** piledriver ******** overhead 6.00 cycles, precision 1000000 units of 7.14e-10 secs, CPU freq 4000.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #13.4460 27.7072 2 #21.2536 22.0034 3 #19.1284 20.6698 5 #17.6283 19.6027 8 #16.8365 19.1819 13 #16.7634 19.3874 22 #16.5480 19.1393 37 #16.5433 18.9761 62 #16.3419 18.8095 105 #16.2121 18.6698 178 #16.0991 18.6101 302 #16.0503 18.5661 513 #16.2965 18.5379 872 #16.3580 19.0568 ******** core2 ******** overhead 6.01 cycles, precision 1000000 units of 4.69e-10 secs, CPU freq 2132.93 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #15.7048 28.7024 2 #26.5272 26.5408 3 #24.7981 25.2783 5 #21.9089 24.9270 8 #20.9994 24.4683 13 #20.4778 24.1549 22 #20.0956 23.8461 37 #19.7079 23.8088 62 #19.6855 23.8366 105 #19.5935 23.9688 178 #19.3434 23.8856 302 #19.3213 23.8421 513 #19.4093 23.8145 872 #19.3424 23.8016 ******** nehalem ******** overhead 6.00 cycles, precision 1000000 units of 3.75e-10 secs, CPU freq 2670.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #12.1014 24.6814 2 #21.0024 21.8684 3 20.9170 #20.5440 5 #19.6475 20.3452 8 #19.1692 20.0459 13 #19.1643 19.7841 22 #18.9714 19.4242 37 #18.9281 19.6363 62 #18.7318 19.3491 105 #18.9929 19.2355 178 #18.7822 19.2779 302 #18.7368 19.1683 513 #18.7251 19.1364 872 #18.6993 19.1451 ******** sandybridge ******** overhead 6.00 cycles, precision 1000000 units of 3.02e-10 secs, CPU freq 3311.22 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #11.0014 19.0029 2 14.5490 #14.0978 3 15.2221 #13.5232 5 15.0096 #13.5554 8 14.9704 #13.7121 13 15.0515 #13.8339 22 15.1180 #13.9051 37 15.6663 #14.3060 62 15.1635 #14.3427 105 15.2652 #14.3665 178 15.3321 #14.3720 302 15.2939 #14.3763 513 15.2368 #14.3822 872 15.1984 #14.3878 ******** ivybridge ******** overhead 6.56 cycles, precision 1000000 units of 2.86e-10 secs, CPU freq 3500.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #11.0014 18.0029 2 #12.9980 13.2798 3 13.4954 #12.3369 5 13.2389 #12.7242 8 #13.2280 13.2300 13 #13.2331 13.6025 22 #13.1963 13.8519 37 #13.5290 13.9966 62 #13.1636 14.0779 105 #13.1274 14.2256 178 #13.1143 14.2060 302 #13.0608 14.2141 513 #13.1540 14.2050 872 #13.1764 14.2006 ******** haswell ******** overhead 5.00 cycles, precision 1000000 units of 3.46e-10 secs, CPU freq 2893.21 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #9.0015 17.0025 2 12.0021 #11.7526 3 11.5992 #11.0628 5 11.7255 #11.5849 8 #11.7214 12.3431 13 #11.7310 12.8741 22 #11.7497 13.2291 37 #12.0599 13.7739 62 #12.0945 13.7338 105 #12.0774 13.7221 178 #12.0205 13.7119 302 #12.0197 13.7050 513 #12.0365 13.7028 872 #12.0426 13.6965 ******** vianano ******** overhead 9.01 cycles, precision 1000000 units of 6.25e-10 secs, CPU freq 1600.00 MHz mpn_div_qr_1n_pi1.0xcafebabedeadbeef mpn_preinv_divrem_1.0xcafebabedeadbeef 1 #22.5221 41.0437 2 #31.5313 32.5324 3 #28.0280 29.6950 5 #24.4255 27.4286 8 #22.3986 26.1515 13 #21.1205 26.6420 22 #21.0681 25.5705 37 #20.2399 24.9434 62 #19.7464 24.5723 105 #19.4484 24.3500 178 #19.2731 24.2168 302 #19.1679 24.1375 513 #19.1063 24.0905 872 #19.0714 24.0626 shell$ -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel