Re: RFC: [ARM] Disable peeling

Tim Prince Tue, 11 Dec 2012 04:26:51 -0800

On 12/11/2012 5:14 AM, Richard Earnshaw wrote:

On 11/12/12 09:56, Richard Biener wrote:
On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <[email protected]>wrote:
On 11/12/12 09:45, Richard Biener wrote:
On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <[email protected]>wrote:
Jan Hubicka <[email protected]> writes:
Note that I think Core has similar characteristics - at least forstring
operations
it fares well with unalignes accesses.
Nehalem and later has very fast unaligned vector loads. There's still
some
penalty when they cross cache lines however.

iirc the rule of thumb is to do unaligned for 128 bit vectors,
but avoid it for 256bit vectors because the cache line cross
penalty is larger on Sandy Bridge and more likely with the larger
vectors.
Yes, I think the rule was that using the unaligned instructionvariants
carries
no penalty when the actual access is aligned but that alignedaccesses arestill faster than unaligned accesses. Thus peeling for alignment_is_ a
win.
I also seem to remember that the story for unaligned stores vs.unaligned
loads
is usually different.
Yes, it's generally the case that unaligned loads are slightly more
expensive than unaligned stores, since the stores can often merge ina store
buffer with little or no penalty.
It was the other way around on AMD CPUs AFAIK - unaligned stores forced
flushes of the store buffers.  Which is why the vectorizer first and
foremost tries
to align stores.
In which case, which to align should be a question that the ME asksthe BE.
R.

I see that this thread is no longer about ARM.

Yes, when peeling for alignment, aligned stores should take precedenceover aligned loads."ivy bridge" corei7-3 is supposed to have corrected the situation on"sandy bridge" corei7-2 where unaligned 256-bit load is more expensivethan explicitly split (128-bit) loads. There aren't yet any productionmulti-socket corei7-3 platforms.It seems difficult to make the best decision between 128-bit unalignedaccess without peeling and 256-bit access with peeling for alignment(unless the loop count is known to be too small for the latter to comeup to speed). Facilities afforded by various compilers to allow theprogrammer to guide this choice are rather strange and probably not tobe counted on.In my experience, "westmere" unaligned 128-bit loads are more expensivethan explicitly split (64-bit) loads, but the architecture manualsdisagree with this finding. gcc already does a good job for corei7[-1]in such situations.


--
Tim Prince

Re: RFC: [ARM] Disable peeling

Reply via email to