ni...@lysator.liu.se (Niels Möller) writes: No speedup for addmul_1, unfortunately, but a saving for submul_1. Here are new versions of both files (for mpn/arm/v6).
I sometimes get better A9 performance with *discrete* pointer updates, not one-out-of-four autoincrement pointer updates like used here. I think the code you started with had that one-out-of-four trick for str, already? I wonder if this submul_1 complement trick is useful on some other platforms too, e.g., 64-bit sparc? Possibly. This is a trick I actually realised many years ago, so it might very well already be used someplace in GMP. I had on the other hand not realised David's ones complement + pre-invert carry trick. I think that trick and this trick will result in the same loop insn count on most subtraction challenged machines. It is possible that this or similar tricks could be useful in other contexts, such as the 2/1 or 3/2 quotient approximation primitives. Running at 3.25 and 3.9 c/l on A9: Cool! Looks like it is actually faster than 3.9 for some alignments/sizes. Did you time this on some other CPU too? I have new submul_1 code for A15 which runs at 2.75 c/l. It runs at 6.25 c/l on A9... Fewer variants are always good, so it'd be nice if your code is faster everwhere. To squeeze the last out of this code there are a few things you might want to try: 1. Use descrete ptr updates for up and/or rp. 2. Move the one-out-of-four autoincrement updates to other ldr/str insns. 3. Use ldm/stm. Often an A9 win. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel