Richard Henderson <r...@twiddle.net> writes: Building on the copyi that tege committed the other day, use neon for the logical operations too. I did both a 128-bit aligned version, > $ ./speed-128 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n > clock_gettime is 1.000ns accurate > overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz > mpn_and_n mpn_nand_n > 10 #1.7987 1.8986 > 50 #0.9393 1.0692 > 100 #1.2491 1.3890 > 500 #0.8154 0.9753 > 1000 #0.7786 0.9435 > 5000 #1.4955 1.5765 > 10000 #1.6532 1.7415 and a 256-bit aligned version, just to see if having a higher ratio of operation insns to memory insns would help, > $ ./speed-256 -p 1000000000 -C -s 10,50,100,500,1000,5000,10000 mpn_and_n mpn_nand_n > clock_gettime is 1.000ns accurate > overhead 6.00 cycles, precision 1000000000 units of 1.00e-09 secs, CPU freq 1694.10 MHz > mpn_and_n mpn_nand_n > 10 #1.5989 1.6988 > 50 #1.0992 1.1592 > 100 #1.0393 1.0593 > 500 #1.0373 1.0413 > 1000 #1.0303 1.0313 > 5000 #1.5914 1.6003 > 10000 1.6824 #1.6768 It's a bit curious how the later is less "jaggy", but slightly slower. I assume you mean that the destination ptr are naturally aligned, while the source ptrs are 32-bit aligned?
My guess for the "jaggyness" is that of two src ptrs, you rarely strike a case where they are 256-bit aligned, in particular not when both are 256-bit aligned. But that happens much more often for 128-bit alignment. My copy was alignment insensitive, perhaps thanks to scheduling, or that it stresses the unaligned load logic less, with its one load-per-store? You can play with -x -y -w -W to force alignment. They are for src1, src2, dst1, dst2, respectively, IIRC. 0 would mean "aligned", except that's not too well-defined. 1 means the pointer mod 2^something = 1, etc. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel