ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: > I'm not yet using GMP's mpn_cnd_{add,sub}_n, that's the next thing I'd > like to try. That wasn't a clear win... I use addmul_1 and submul_1 as a fallback (and I always do in-place operation, so that works). Now, cnd_sub_n beats submul_1 (except for n == 2, which I don't use): $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_submul_1.1 mpn_cnd_sub_n clock_gettime is 1.000ns accurate overhead 8.87 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz mpn_submul_1.1 mpn_cnd_sub_n 1 #19.8927 21.6831 2 #10.9752 12.4106 3 9.5514 #8.9371 4 8.5227 #6.6696 5 7.8316 #6.7412 6 7.1571 #6.0339 7 7.2859 #5.3320 8 6.8553 #4.8715 9 6.6945 #5.0376 10 6.3129 #4.8351 100 5.5065 #3.2110 This is perhaps not thanks for the speed of mpn_cnd_sub_n, but due to mpn_submul_1 's slowness. I have a new A15 submul_1, I know of no A9 improvement.
But for addition, mpn_addmul_1 beats mpn_cnd_add_n for many small sizes, $ GMP_CPU_FREQUENCY=1e9 ./speed -C -s 1-10,100 mpn_addmul_1.1 mpn_cnd_add_n clock_gettime is 1.000ns accurate overhead 8.94 cycles, precision 1000 units of 1.00e-06 secs, CPU freq 1000.00 MHz mpn_addmul_1.1 mpn_cnd_add_n 1 #19.8927 21.2256 2 #10.8574 11.6940 3 #8.0235 8.5240 4 #6.4561 6.5216 5 #6.0308 6.5071 6 #5.4937 5.9282 7 #5.2063 5.3603 8 4.8838 #4.7493 9 #4.9249 4.9533 10 #4.5364 4.8244 100 3.4846 #3.2842 Not an alarming difference. Some questions: 1. I guess one can expect submul_1 to always be a bit slower than addmul_1, since submul_1 needs additional arithmetics besides the umaal? One could perhaps do some negations on the fly, a - b C = - ((-a) + b*C), maybe that would be advantageous? I encourage you to work on that; 3.25 c/l vs 5.25 c/l seem like a very large difference between addmul_1 and submul_1. 2. cnd_add_n should be at least as fast as addmul_1, shouldn't it? It appears to be 0.25 c/l faster for larger operands, so maybe it's "only" a question of optimizing loop setup and feedin? I suppose I've given addmul_1 much more attention. And the focus on any cnd_ functions is side channel silence, not ultimate speed. I've never considered addmul_1/submul_1 as alternatives to cnd_add_n/cnd_sub_n. We might very well have cases where the former is faster, as per http://gmplib.org/devel/asm.html. A similar situation is that addmul_1/submul_1 is sometimes faster than addlsh_1/sublsh_1. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel