Hi. I implemented basecase multiplication and squaring for x86 using SSE2 instructions and Comba column-wise multiplication method. On Ivy Bridge (Intel Core i7 3517U) multiplication 10-20% faster than present GMP basecase MMX multiplication. Squaring is 5-10% faster than GMP MMX version. However, on older CPU (Core 2 DUO E7500, Wolfdale) the same code is 15-30 worse than GMP MMX version.
What is good about SIMD and Comba - they perfectly match each other. It's easy to do 2 or more multiplications in parallel. But, to my surprize, the gain was not so good. I would expect at least 50% over MMX code. Moreover 32-bit x86 is nearly obsolete. I'm going to try the same approach for ARM NEON. NEON instruction set is more elegant, than SSE2, so I'm rather optimistic to beat current ARM GMP multiplication... If anybody interested, I will share my present (GNU assembly and C/C++ SSE intrinsic) implementation of basecase MUL and SQR. Here are results (Windows, MINGW32): ~~~~~~~~~~~~~~~~~ /d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-64 mpn_mul_basecase mpn_mul_basecase_vb overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq 2794.00 MHz mpn_mul_basecase mpn_mul_basecase_vb 2 13.2628 #0.7281 3 12.8256 #0.8463 4 14.7880 #0.8549 5 15.7579 #0.9141 6 16.9703 #0.9563 7 18.8107 #0.9472 8 20.7070 #0.9374 9 22.4533 #0.9253 10 24.2333 #0.9179 11 26.1746 #0.9048 12 28.0606 #0.8986 13 30.0165 #0.8832 14 32.2114 #0.8684 15 33.6000 #0.8764 16 35.6812 #0.8637 17 37.6875 #0.8777 18 39.8452 #0.8755 19 41.4899 #0.8687 20 44.2875 #0.8339 21 46.4935 #0.8316 22 48.4636 #0.8322 23 50.3333 #0.8283 24 52.2222 #0.8210 25 53.5350 #0.8298 26 55.8297 #0.8223 27 57.7619 #0.8173 28 59.5867 #0.8148 29 62.5862 #0.8038 30 63.4333 #0.8126 31 64.8548 #0.8156 32 67.5938 #0.8037 /d/prj/3rdparty/gmp-32/tune$> ./speed.exe -Cr -s2-32 mpn_sqr_basecase mpn_sqr_basecase_vb overhead 5.29 cycles, precision 10000 units of 3.58e-010 secs, CPU freq 2794.00 MHz mpn_sqr_basecase mpn_sqr_basecase_vb 2 11.0504 #0.7342 3 12.0921 #0.6959 4 12.2253 #0.7774 5 13.4742 #0.8283 6 14.0847 #0.8827 7 14.8528 #0.9195 8 15.6207 #0.9451 9 16.5070 #0.9704 10 17.5082 #0.9794 11 18.3551 #0.9722 12 19.5435 #0.9555 13 20.4577 #0.9553 14 21.4923 #0.9464 15 22.4375 #0.9387 16 23.4310 #0.9345 17 24.3137 #0.9337 18 25.4167 #0.9304 19 26.3612 #0.9284 20 27.3075 #0.9253 21 28.2782 #0.9174 22 29.6847 #0.8977 23 30.2853 #0.9642 24 31.2333 #0.9028 25 32.2523 #0.8975 26 33.8609 #0.8748 27 34.6944 #0.8780 28 35.8929 #0.8622 29 36.1448 #0.8828 30 36.9900 #0.8757 31 38.1935 #0.8705 32 39.1875 #0.8672 Some implementation details: - only SSE2 instructions are used. - Column multiplication is unrolled to 4 multiplications per loop. - All data loads are done using unaligned instruction "muldqu" to load 4 limbs at time. - Computation of 4 first and last columns is unrolled. And they share input data (4 limbs). The main unit of computation does 2 multiplications at time. "4 multiplications" loop unit is just a 2 interleaved instances of the code below: --------------------------------------- pshufd $0xXX,SRC_A,VA0 # [?,a1,?,a0] pshufd $0xXX,SRC_B,VB0 # [?,b1,?,b0] pmuludq VB0,VA0 # [(r11,r10)=a1*b1,(r01,r00)=a0*b0] movdqa VA0,VB0 punpckhdq ZERO,VA0 # [0,r11,0,r10] punpckldq ZERO,VB0 # [0,r01,0,r00] paddq VA0,ACC # [(p11,p10),(p01,p00)] += [(0,r11),(0,r10)] paddq VB0,ACC # [(p11,p10),(p01,p00)] += [(0,r01),(0,r00)] --------------------------------------- Column product save & shift looks like this: --------------------------------------- movd ACC,(memory) # [p11,p10,p01,p00]; p00 ==> memory movq ACC,SCRATCH # [0,0,p01,p00] psrldq $8,ACC # [0,0,p11,p10] psrldq $4,SCRATCH # [0,0,0,p01] paddq SCRATCH,VACC # [(0,0),(p11,p10)] += [(0,0),(0,p01)] --------------------------------------- Thanks, Vasili _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel