On 11/12/12 09:45, Richard Biener wrote:
On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> wrote:
Jan Hubicka <hubi...@ucw.cz> writes:

Note that I think Core has similar characteristics - at least for string 
operations
it fares well with unalignes accesses.

Nehalem and later has very fast unaligned vector loads. There's still some
penalty when they cross cache lines however.

iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.

Yes, I think the rule was that using the unaligned instruction variants carries
no penalty when the actual access is aligned but that aligned accesses are
still faster than unaligned accesses.  Thus peeling for alignment _is_ a win.
I also seem to remember that the story for unaligned stores vs. unaligned loads
is usually different.

Yes, it's generally the case that unaligned loads are slightly more expensive than unaligned stores, since the stores can often merge in a store buffer with little or no penalty.

R.


Reply via email to