On 12/11/2012 5:14 AM, Richard Earnshaw wrote:
On 11/12/12 09:56, Richard Biener wrote:
On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com>
wrote:
On 11/12/12 09:45, Richard Biener wrote:
On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org>
wrote:
Jan Hubicka <hubi...@ucw.cz> writes:
Note that I think Core has similar characteristics - at least for
string
operations
it fares well with unalignes accesses.
Nehalem and later has very fast unaligned vector loads. There's still
some
penalty when they cross cache lines however.
iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.
Yes, I think the rule was that using the unaligned instruction
variants
carries
no penalty when the actual access is aligned but that aligned
accesses are
still faster than unaligned accesses. Thus peeling for alignment
_is_ a
win.
I also seem to remember that the story for unaligned stores vs.
unaligned
loads
is usually different.
Yes, it's generally the case that unaligned loads are slightly more
expensive than unaligned stores, since the stores can often merge in
a store
buffer with little or no penalty.
It was the other way around on AMD CPUs AFAIK - unaligned stores forced
flushes of the store buffers. Which is why the vectorizer first and
foremost tries
to align stores.
In which case, which to align should be a question that the ME asks
the BE.
R.
I see that this thread is no longer about ARM.
Yes, when peeling for alignment, aligned stores should take precedence
over aligned loads.
"ivy bridge" corei7-3 is supposed to have corrected the situation on
"sandy bridge" corei7-2 where unaligned 256-bit load is more expensive
than explicitly split (128-bit) loads. There aren't yet any production
multi-socket corei7-3 platforms.
It seems difficult to make the best decision between 128-bit unaligned
access without peeling and 256-bit access with peeling for alignment
(unless the loop count is known to be too small for the latter to come
up to speed). Facilities afforded by various compilers to allow the
programmer to guide this choice are rather strange and probably not to
be counted on.
In my experience, "westmere" unaligned 128-bit loads are more expensive
than explicitly split (64-bit) loads, but the architecture manuals
disagree with this finding. gcc already does a good job for corei7[-1]
in such situations.
--
Tim Prince