On 11 December 2012 13:26, Tim Prince <n...@aol.com> wrote: > On 12/11/2012 5:14 AM, Richard Earnshaw wrote: >> >> On 11/12/12 09:56, Richard Biener wrote: >>> >>> On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com> >>> wrote: >>>> >>>> On 11/12/12 09:45, Richard Biener wrote: >>>>> >>>>> >>>>> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> >>>>> wrote: >>>>>> >>>>>> >>>>>> Jan Hubicka <hubi...@ucw.cz> writes: >>>>>> >>>>>>> Note that I think Core has similar characteristics - at least for >>>>>>> string >>>>>>> operations >>>>>>> it fares well with unalignes accesses. >>>>>> >>>>>> >>>>>> >>>>>> Nehalem and later has very fast unaligned vector loads. There's still >>>>>> some >>>>>> penalty when they cross cache lines however. >>>>>> >>>>>> iirc the rule of thumb is to do unaligned for 128 bit vectors, >>>>>> but avoid it for 256bit vectors because the cache line cross >>>>>> penalty is larger on Sandy Bridge and more likely with the larger >>>>>> vectors. >>>>> >>>>> >>>>> >>>>> Yes, I think the rule was that using the unaligned instruction variants >>>>> carries >>>>> no penalty when the actual access is aligned but that aligned accesses >>>>> are >>>>> still faster than unaligned accesses. Thus peeling for alignment _is_ >>>>> a >>>>> win. >>>>> I also seem to remember that the story for unaligned stores vs. >>>>> unaligned >>>>> loads >>>>> is usually different. >>>> >>>> >>>> >>>> Yes, it's generally the case that unaligned loads are slightly more >>>> expensive than unaligned stores, since the stores can often merge in a >>>> store >>>> buffer with little or no penalty. >>> >>> >>> It was the other way around on AMD CPUs AFAIK - unaligned stores forced >>> flushes of the store buffers. Which is why the vectorizer first and >>> foremost tries >>> to align stores. >>> >> >> In which case, which to align should be a question that the ME asks the >> BE. >> >> R. >> >> > I see that this thread is no longer about ARM. > Yes, when peeling for alignment, aligned stores should take precedence over > aligned loads. > "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy > bridge" corei7-2 where unaligned 256-bit load is more expensive than > explicitly split (128-bit) loads. There aren't yet any production > multi-socket corei7-3 platforms. > It seems difficult to make the best decision between 128-bit unaligned > access without peeling and 256-bit access with peeling for alignment (unless > the loop count is known to be too small for the latter to come up to speed). > Facilities afforded by various compilers to allow the programmer to guide > this choice are rather strange and probably not to be counted on. > In my experience, "westmere" unaligned 128-bit loads are more expensive than > explicitly split (64-bit) loads, but the architecture manuals disagree with > this finding. gcc already does a good job for corei7[-1] in such > situations. > > -- > Tim Prince >
Since this thread is also about x86 now, I have tried to look at how things are implemented on this target. People have mentioned nehalem, sandy bridge, ivy bridge and westmere; I have searched for occurrences of these strings in GCC, and I couldn't find anything that would imply a different behavior wrt unaligned loads on 128/256 bits vectors. Is it still unimplemented? Thanks, Christophe.