ni...@lysator.liu.se (Niels Möller) writes: I tried that, and I ended up with something *very* similar to your addmul_8 (after first writing addmul_4 and addmul_6). The following loop runs at 3.24 c/l on my A9 (according to the addmul_N program): A9 is a quite uninteresting core for Neon. Please make your experiements on an A15 instead.
I tried moving things around to interleave independent operations, but I only managed to slow it down. I initially used separete mul and add, but vmlal was faster. Recurrency (for each carry register in parallel) is vmlal, vext, vpaddl. Which might be the killer. 3.25 c/l means that one iteration in this loop takes 26 cycles, for 17 instructions. To me, that's surprisingly slow, the instruction sequence looks very friendly with few dependencies and ample opportunities for executing two instructions in parallel, We should probably work out the latencies for the interesting instructions. That's not hard to do. I also tried reversing the order of operations doing Qc67 first and Qc01 last (on the theory that this matches the natural dependencies in the vext shifting), using additional registers to keep the values between vext and vpaddl, but that was a slowdown. Here are cycle numbers for my attempts: addmul_2: 8.95 c/l addmul_4: 4.49 c/l addmul_6: 3.66 c/l addmul_8: 3.24 c/l Untried tricks: One could try to use vuzp to separate high and low parts of the products. Then only the low parts need shifting around. I guess I'll try that with addmul_4 first, to see if it makes for any improvement. One could maybe use vaddw, to delay adding in one of the carry limbs, reducing the recurrency to only vuzp, vaddw (but if the recurrency isn't the bottleneck, that won't help). Anyway, it seems very challenging to make this neon code competitive on cortex-a9. I really wonder where the bottleck might be for the above loop. You could try a dependency breaking trick I tend to use in situations such as this: Make every instruction write to a unique register. (To get away with that, you might need to save/restore the callee-saves registers. See Richard's previous short message about that.) Make every instruction read the same register, which is never written. Now, there are no dependencies at all (no RAW, no WAW, and no WAR (anti-dependencies)). Does the sequence run faster? By rearranging insns to something more balanced, does it run faster? This is a 10 minute experiment which give a lot of information of the potential of the instruction mix. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel