https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- Richard wrote: > [...] avoid peeling for alignment on x86_64 and just use unaligned ops Yeah, that's what clang does, and may be optimal. Certainly it's easy, and gives optimal performance when buffers *are* in fact aligned, even when the programmer has neglected to inform the compiler of any guarantee. However, with vector sizes getting closer to the cache-line size, unaligned accesses will cross cache lines more of the time. (e.g. an AVX loop over an unaligned buffer will have a cacheline split on every other iteration). Iff we can *cheaply* avoid this, it may be worth it. IIRC, all modern x86 / x86-64 CPUs have no penalty for unaligned loads, as long as they don't actually cross a cache-line boundary. (True for Intel since Nehalem). Store-forwarding doesn't work well if the stores don't line up with the loads, though.