Richard Henderson <r...@twiddle.net> writes: Indeed, the last version that Niels posted doesn't pass this test. Oops.
The following does pass, but if I'm to believe the arithmetic it's still fairly slow -- around 12cyc/sec. 12cyc/sec is a poor clock frequency. :-) If one is even more clever than I, one could do a 4x unroll, making best use of vld4. But when you do that, getting the carries right becomes even more tricky. But I think any correct solution will involve chains of vsra to shift and add up the chain. Perhaps addmul_2 might not be easy to make fast for this target. I think an mul_basecase could be made to run at awesome speed. We might need a building block of at least addmul_4, more likely something larger. Neon has SIMD 32+32 -> 64 bit add. Assume we want to do (32+32)+32 or ((32+32)+32)+32 [the latter possibly arranged as (32+32)+(32+32)], is there good ISA support for that too? It might require an insn that does 32+64 -> 64. The key here is accumulation support, as always with SIMD. Without good such ISA support, we probably need more right shift operations, which will damage performance. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel