On 11 December 2012 13:26, Tim Prince <n...@aol.com> wrote:
> On 12/11/2012 5:14 AM, Richard Earnshaw wrote:
>>
>> On 11/12/12 09:56, Richard Biener wrote:
>>>
>>> On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com>
>>> wrote:
>>>>
>>>> On 11/12/12 09:45, Richard Biener wrote:
>>>>>
>>>>>
>>>>> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Jan Hubicka <hubi...@ucw.cz> writes:
>>>>>>
>>>>>>> Note that I think Core has similar characteristics - at least for
>>>>>>> string
>>>>>>> operations
>>>>>>> it fares well with unalignes accesses.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nehalem and later has very fast unaligned vector loads. There's still
>>>>>> some
>>>>>> penalty when they cross cache lines however.
>>>>>>
>>>>>> iirc the rule of thumb is to do unaligned for 128 bit vectors,
>>>>>> but avoid it for 256bit vectors because the cache line cross
>>>>>> penalty is larger on Sandy Bridge and more likely with the larger
>>>>>> vectors.
>>>>>
>>>>>
>>>>>
>>>>> Yes, I think the rule was that using the unaligned instruction variants
>>>>> carries
>>>>> no penalty when the actual access is aligned but that aligned accesses
>>>>> are
>>>>> still faster than unaligned accesses.  Thus peeling for alignment _is_
>>>>> a
>>>>> win.
>>>>> I also seem to remember that the story for unaligned stores vs.
>>>>> unaligned
>>>>> loads
>>>>> is usually different.
>>>>
>>>>
>>>>
>>>> Yes, it's generally the case that unaligned loads are slightly more
>>>> expensive than unaligned stores, since the stores can often merge in a
>>>> store
>>>> buffer with little or no penalty.
>>>
>>>
>>> It was the other way around on AMD CPUs AFAIK - unaligned stores forced
>>> flushes of the store buffers.  Which is why the vectorizer first and
>>> foremost tries
>>> to align stores.
>>>
>>
>> In which case, which to align should be a question that the ME asks the
>> BE.
>>
>> R.
>>
>>
> I see that this thread is no longer about ARM.
> Yes, when peeling for alignment, aligned stores should take precedence over
> aligned loads.
> "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy
> bridge" corei7-2 where unaligned 256-bit load is more expensive than
> explicitly split (128-bit) loads.  There aren't yet any production
> multi-socket corei7-3 platforms.
> It seems difficult to make the best decision between 128-bit unaligned
> access without peeling and 256-bit access with peeling for alignment (unless
> the loop count is known to be too small for the latter to come up to speed).
> Facilities afforded by various compilers to allow the programmer to guide
> this choice are rather strange and probably not to be counted on.
> In my experience, "westmere" unaligned 128-bit loads are more expensive than
> explicitly split (64-bit) loads, but the architecture manuals disagree with
> this finding.  gcc already does a good job for corei7[-1] in such
> situations.
>
> --
> Tim Prince
>

Since this thread is also about x86 now, I have tried to look at how
things are implemented on this target.
People have mentioned nehalem, sandy bridge, ivy bridge and westmere;
I have searched for occurrences of these strings in GCC, and I
couldn't find anything that would imply a different behavior wrt
unaligned loads on 128/256 bits vectors. Is it still unimplemented?

Thanks,

Christophe.

Reply via email to