Niels Möller <ni...@lysator.liu.se> writes: > Nice speedup for some sices, in particular 15. Some notable regressions, > in particular size 30x24 and 42x21, if I read the table correctly. > > To understand what's going on, one might need to separate the rather > small changes to algorithm selection, to the reduced toom32 overhead.
Since this was so noisy, I've made a new try, taking out (most of) the mpn_mul algorithm selection from picture. So let's compare mpn_mul_basecase and mpn_toom32_mul for a few ratios. I tried, 0.5, 0.7 and 0.8. Before: $ ./speed-nohack -p 100000 -c -s 2-100 -f 1.2 mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80 overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80 2 15.85 n/a 15.85 n/a #15.85 n/a 3 #17.83 n/a 24.76 n/a 24.76 n/a 4 28.89 n/a #28.84 n/a 53.46 n/a 5 #32.85 n/a 61.39 n/a 72.28 n/a 6 #68.32 n/a 81.19 n/a 81.19 n/a 7 #75.27 n/a 90.12 n/a 112.00 n/a 8 #98.07 n/a 122.02 n/a 142.78 n/a 9 #108.10 n/a 159.69 n/a 189.20 n/a 10 #152.87 463.13 211.76 463.23 235.97 463.86 12 #203.81 501.84 270.97 502.65 306.06 502.62 14 #283.98 614.32 364.14 615.33 445.40 614.30 16 #352.81 696.92 488.13 697.71 529.72 699.03 19 #479.76 884.33 691.06 883.58 797.68 883.27 22 #671.82 1034.74 916.42 1034.20 1037.16 1033.89 26 #930.34 1306.73 1279.71 1303.43 1422.11 1303.24 31 #1266.87 1750.73 1814.02 1749.58 2038.85 2281.88 37 #1785.48 2306.52 2519.19 2305.17 2919.34 2304.54 44 #2593.88 3037.51 3531.03 3032.28 4122.22 3035.03 52 #3600.48 4006.00 4994.00 4003.96 5688.95 4001.57 62 #5147.14 5370.00 7127.25 5365.57 8147.50 5363.33 74 7315.00 6952.75 10098.10 6953.31 11673.44 #6949.69 88 10236.73 9273.75 14214.00 9278.67 16295.00 #9265.75 $ ./speed-hack -p 100000 -c -s 2-100 -f 1.2 mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80 overhead 5.94 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_mul_basecase.-50 mpn_toom32_mul.-50 mpn_mul_basecase.-70 mpn_toom32_mul.-70 mpn_mul_basecase.-80 mpn_toom32_mul.-80 2 15.85 n/a 15.84 n/a #15.84 n/a 3 #17.82 n/a 24.76 n/a 24.76 n/a 4 28.91 n/a #28.86 n/a 53.45 n/a 5 #32.85 n/a 61.38 n/a 72.26 n/a 6 #68.31 n/a 81.18 n/a 81.18 n/a 7 #75.25 n/a 90.10 n/a 112.05 n/a 8 #98.06 n/a 122.04 n/a 142.86 n/a 9 #108.15 n/a 159.92 n/a 189.21 n/a 10 #152.76 420.43 211.70 420.25 236.25 419.95 12 #203.69 463.06 270.84 464.07 306.00 463.74 14 #283.98 586.27 363.69 586.42 444.86 586.57 16 #352.48 673.81 488.66 673.32 529.77 673.20 19 #479.49 853.71 691.75 853.47 796.95 853.06 22 #671.75 996.61 915.84 997.88 1037.51 995.13 26 #929.26 1294.79 1279.71 1291.70 1421.53 1290.95 31 #1264.93 1717.77 1813.25 1716.57 2038.21 1717.83 37 #1785.33 2276.68 2518.91 2274.46 2918.13 2275.00 44 #2593.69 3005.06 3531.81 3002.97 4121.48 3005.35 52 #3599.65 3977.39 4993.23 3979.29 5687.50 3978.75 62 #5149.77 5268.05 7125.44 5262.10 8149.14 5267.90 74 #7313.47 8835.50 10096.90 8835.88 11668.33 8835.88 88 10232.73 #9199.17 14206.37 9199.42 16287.14 11831.92 It looks like the new toom32 is generally faster than the old (except for the larger sizes where we should be using higher toom, and then the old code wins because it recurses to mpn_mul which can call one of those functions), which is nice. But another observation that I think is interesting, is that it has a really hard time beating mul_basecase at ratio 0.5. So we probably should never call toom32 close to this ratio (on shell, it seems toom42 starts beating basecase around size 55 limbs). Which brings me somewhat back to the drawing board, where the idea was that the unbalanced subproduct in toom22 should always recurse to one of toom22, toom32 or schoolbook. Maybe we need some other alternative. Could consider toom42, but my gut feeling is that's no use for smallish sizes (although we don't tune toom32 vs toom42, so I don't know where the threshold might be). Perhaps we should have a 2:1 algorithm which splits the larger number in two, and multiplies each part with the smaller one using toom22. What would be a name for that? I'm having some difficulty with the intuition, perhaps it would help to draw the diagrams of ratio for unbalanced sub product as a function of the input ratio, for both toom22 and toom32, as well as the new candidate algorithm. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel