I decided to play a bit with Neon, but instead of doing something hard like addmul_k, I wrote an mpn_popcount. :-)
The code runs well for A15 at about 0.56 c/l, but much worse on A9 at about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9; using a8 and a9 alternatingly shaves off about 0.4 c/l. Still unimpressive.) I am a novice at Neon hacking, so I am sure this can be improved in various ways. Specific questions: * I completely ignore alignment. Is that bad? * Can 32 bits be read to a dN register with zeroing of the other 32 bits? (See comment "surely we can read...".) * Could one shave of an instruction in the final accumulation? We don't really need 64-bit accumulators. * Can one read four 128-bit values using just one insn (for inner loop)?
arm-popcount.asm
Description: Binary data
-- Torbjörn
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel