https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
Yann Collet <yann.collet.73 at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |yann.collet.73 at gmail dot com --- Comment #9 from Yann Collet <yann.collet.73 at gmail dot com> --- While the issue can be easily fixed from an LZ4 perspective, the main topic here is to analyze a GCC 4.9+ vectorizer choice. The piece of code that it tried to optimize can be summarized as follows (once removed all the garbage) : static void LZ4_copy8(void* dstPtr, const void* srcPtr) { *(U64*)dstPtr = *(U64*)srcPtr; } Pretty simple. Let's assume for the rest of the post that both pointers are correctly aligned, so it's not a problem anymore. Looking at the assembler generated, we see that GCC generates a MOVDQA instruction for it. > movdqa (%rdi,%rax,1),%xmm0 > $rdi=0x7fffea4b53e6 > $rax=0x0 This seems wrong on 2 levels : - The function only wants to copy 8 bytes. MOVDQA works on a full SSE register, which is 16 bytes. This spell troubles, if only for buffer boundaries checks : the algorithm uses 8 bytes because it knows it can safely read/write that size without crossing buffer limits. With 16 bytes, no such guarantee. - MOVDQA requires both positions to be aligned. I read it as being SSE size aligned, which means 16-bytes aligned. But they are not, these pointers are supposed to be 8-bytes aligned only. (A bit off topic, but from a general perspective, I don't understand the use of MOVDQA, which requires such a strong alignment condition, while there is also MOVDQU available, which works fine at any memory address, while suffering no performance penalty on aligned memory addresses. MOVDQU looks like a better choice in every circumstances.) Anyway, the core of the issue is rather above : this is just an 8-bytes copy operation, replacing by a 16-bytes one looks suspicious. Maybe it would deserve a look.