On 2013-03-08 03:46, Torbjorn Granlund wrote:
I assume you mean that the destination ptr are naturally aligned, while the source ptrs are 32-bit aligned?
Yes.
My guess for the "jaggyness" is that of two src ptrs, you rarely strike a case where they are 256-bit aligned, in particular not when both are 256-bit aligned. But that happens much more often for 128-bit alignment. My copy was alignment insensitive, perhaps thanks to scheduling, or that it stresses the unaligned load logic less, with its one load-per-store?
I don't know. I do know there's something bizzare going on that's probably needs some chip knowledge to figure out.
For instance, testing the -128 patch I posted here, and making no other change except *adding* :128 markers to both source operands, I hoped to determine what effect source alignment has on the loop. (This change is not generally correct, but does work for the case of speed with specified alignment.)
The peak result is slightly *slower* than before. with align without align mpn_and_n mpn_nand_n mpn_and_n mpn_nand_n 10 #1.7989 1.8987 1.7990 1.8989 50 #0.9393 1.0693 0.9395 1.0694 100 #1.2491 1.3891 1.2496 1.3893 500 #0.8154 0.9753 0.8156 0.9756 1000 0.8746 1.0642 #0.7787 0.9435 5000 #1.4067 1.4939 1.5012 1.5577 10000 #1.5454 1.6702 1.5521 1.5926 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel