ni...@lysator.liu.se (Niels Möller) writes: On my core2 laptop: $ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999 overhead 6.13 cycles, precision 10000 units of 8.33e-10 secs, CPU freq 1200.00 MHz mpn_divrem_1.0x9999999999999999 mpn_div_qr_1.0x9999999999999999 2 60.6420 #39.9427 3 #40.9839 55.0469 4 #43.7667 44.4534 5 44.6333 #38.9055 6 39.6259 #34.4167 7 34.0063 #32.4018 8 30.1364 #28.5745 9 29.6472 #27.4599 10 29.1270 #26.7300 100 24.7920 #20.6700 500 24.4400 #19.7600 So here it's a clear win, except an ugly regression for n = 3. You might want to pass -p1000000 or something, to avoid startup noise.
On shell, the same command gives: 2 #37.4379 51.1157 3 #30.0256 61.0904 4 #25.8058 27.0781 5 #23.2717 24.2831 6 #21.7520 22.4346 7 #20.5219 21.1111 8 #19.4783 20.1101 9 #18.7726 19.3369 10 #18.3271 18.7228 100 #13.8063 13.8175 500 #13.2670 13.2750 So here the new code is epsilon slower for the larger sizes. Maybe the loopmixer can help. The old code runs optimally, given its instructions. What is the latency critical path of the new code? The performance for n=3 is poor for both processors. Do you understand the reason? -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel