I've been busy improving addmul_1 and submul_1 for Cortex-A15 lately. It turned out to be possible to reach 2 c/l for addmul_1 using plain (non-SIMD) operations; such code is in the repo since a few days. The trick was to move the recurrency path away from multiply-accumulate instructions, and instead have just adcs (add-with-carry) on the path, and also latency schedule things manually.
Ten days ago I posted about a 1.83 c/l mul_1 using Neon (i.e. SIMD) instructions. By moving from str to strd (thanks Richard Henderson for the hint!) the code now runs at 1.48 c/l. This is still not optimal, I expect 1.25 c/l to be possible. This code performs just multiplies in the SIMD side, and then labouriously copies things to the core side, where things are added again using adcs. While mul_1 is important, it is nowhere close to as important as addmul_1. Can the new non-SIMD 2 c/l be beaten using SIMD ops? It turns out that it can. I've reached 1.65 c/l now with a loop similar to the one used for mul_1, adjoining one vld1.32 and two vaddw.u32 for each 4 limbs. This is by far the best addmul_1 performance we have seen on any CPU. The code is attached. Note that it works just for n = 0 (mod 4).
arm-neon-skel-mul_1-v2.asm
Description: Binary data
arm-neon-skel-addmul_1-v2.asm
Description: Binary data
I expect there to still be much performance headroom for multiplication on A15. This CPU can do 2.5 32x32->64 multiplies per cycle, by running SIMD and non-SIMD multiplies in parallel. The A9 can do 1.5 multiplies per cycle. Our latest and greatest code has very poor multiply utilisation, doing 0.61 and 0.48 multiply operations per cycle, for A15 and A9 respectively. (A15 using the new SIMD addmul_1, A9 using an older non-SIMD addmul_3.) I have not looked into a mixed SIMD + non-SIMD addmul_k yet, but I actually wouldn't be at all surprised if that will turn out to be feasible. The next step might be to look at a SIMD addmul_2. If I am not much mistaken, adjoining one vmull.u32 and two vaddw.u32 per two limbs to the addmul_1 loop could do it, but there are many possibilities. That would almost certainly slow down the loop at most one cycle per limb, and thus result in something considerably quicker than 1.65 c/l... (An interesting conclusion from these experiments is that A57 addmul_1 should stay away from its fancy new 64-bit multiply instructions. If my information is correct, the umul-hi has a throughput of 1/4 per cycle, while the mul-lo has a throughput of 1/3 per cycle, and furthermore these instructions conflict with each other. Therefore, the best we can hope for for any addmul_k using these instructions is 7 cycles per 64-bit limb. That's slower than our new 32-bit addmul_1, which corresponds to 1.65*4 = 6.6 cycles per 64-bit limb.) -- Torbjörn
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel