Benjamin Redelings I wrote:
Thanks for the information!

Here are several reasons (there are more) why gcc uses 64-bit loads by default: 1) For a single dot product, the rate of 64-bit data loads roughly balances the latency of adds to the same register. Parallel dot products (using 2 accumulators) would take advantage of faster 128-bit loads. 2) run-time checks to adjust alignment, if possible, don't pay off for loop counts < about 40. 3) several obsolete CPU architectures implemented 128-bit loads by pairs of 64-bit loads. 4) 64-bit loads were generally more efficient than movupd, prior to barcelona.

In the case you quote, with parallel dot products, 128-bit loads would be required so as to show much performance gain over x87.

Reply via email to