Don wrote:
bearophile wrote:
Robert Jacques:

Yes, but the unaligned version is slower, even for aligned data.

This is true today, but in future it may become a little less true, thanks to improvements in the CPUs.

The problem is that difference today is so extreme. On core2:
 movaps [mem128], xmm0; // aligned,   1 micro-op
 movups [mem128], xmm0; // unaligned, 9 micro-ops, even on aligned data!
In practice it's about an 8X speed difference!

On AMD K8, it's only 2 vs 5 ops, and on K10 it's 2 vs 3 ops.
On i7, movups on aligned data is the same speed as movaps. It's still slower if it's an unaligned access.

It all depends on how important you think performance on Core2 and earlier Intel processors is.

I wasn't aware of that, and here I was wondering why my SSE code was slower than the FPU in certain places on my core2 quad, I now recall using a lot of movups instructions, thanks for the tip.

Reply via email to