On 12/11/2012 5:14 AM, Richard Earnshaw wrote:
On 11/12/12 09:56, Richard Biener wrote:
On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com> wrote:
On 11/12/12 09:45, Richard Biener wrote:

On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> wrote:

Jan Hubicka <hubi...@ucw.cz> writes:

Note that I think Core has similar characteristics - at least for string
operations
it fares well with unalignes accesses.


Nehalem and later has very fast unaligned vector loads. There's still
some
penalty when they cross cache lines however.

iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.


Yes, I think the rule was that using the unaligned instruction variants
carries
no penalty when the actual access is aligned but that aligned accesses are still faster than unaligned accesses. Thus peeling for alignment _is_ a
win.
I also seem to remember that the story for unaligned stores vs. unaligned
loads
is usually different.


Yes, it's generally the case that unaligned loads are slightly more
expensive than unaligned stores, since the stores can often merge in a store
buffer with little or no penalty.

It was the other way around on AMD CPUs AFAIK - unaligned stores forced
flushes of the store buffers.  Which is why the vectorizer first and
foremost tries
to align stores.


In which case, which to align should be a question that the ME asks the BE.

R.


I see that this thread is no longer about ARM.
Yes, when peeling for alignment, aligned stores should take precedence over aligned loads. "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy bridge" corei7-2 where unaligned 256-bit load is more expensive than explicitly split (128-bit) loads. There aren't yet any production multi-socket corei7-3 platforms. It seems difficult to make the best decision between 128-bit unaligned access without peeling and 256-bit access with peeling for alignment (unless the loop count is known to be too small for the latter to come up to speed). Facilities afforded by various compilers to allow the programmer to guide this choice are rather strange and probably not to be counted on. In my experience, "westmere" unaligned 128-bit loads are more expensive than explicitly split (64-bit) loads, but the architecture manuals disagree with this finding. gcc already does a good job for corei7[-1] in such situations.

--
Tim Prince

Reply via email to