[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

yann.collet.73 at gmail dot com Thu, 09 Apr 2015 14:05:26 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709


Yann Collet <yann.collet.73 at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yann.collet.73 at gmail dot com

--- Comment #9 from Yann Collet <yann.collet.73 at gmail dot com> ---
While the issue can be easily fixed from an LZ4 perspective, 
the main topic here is to analyze a GCC 4.9+ vectorizer choice.

The piece of code that it tried to optimize can be summarized as follows (once
removed all the garbage) :

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

Pretty simple.
Let's assume for the rest of the post that both pointers are correctly aligned,
so it's not a problem anymore.

Looking at the assembler generated, we see that GCC generates a MOVDQA
instruction for it.
> movdqa (%rdi,%rax,1),%xmm0
> $rdi=0x7fffea4b53e6
> $rax=0x0

This seems wrong on 2 levels :

- The function only wants to copy 8 bytes. MOVDQA works on a full SSE register,
which is 16 bytes. This spell troubles, if only for buffer boundaries checks :
the algorithm uses 8 bytes because it knows it can safely read/write that size
without crossing buffer limits. With 16 bytes, no such guarantee.

- MOVDQA requires both positions to be aligned.
I read it as being SSE size aligned, which means 16-bytes aligned.
But they are not, these pointers are supposed to be 8-bytes aligned only.

(A bit off topic, but from a general perspective, I don't understand the use of
MOVDQA, which requires such a strong alignment condition, while there is also
MOVDQU available, which works fine at any memory address, while suffering no
performance penalty on aligned memory addresses. MOVDQU looks like a better
choice in every circumstances.)

Anyway, the core of the issue is rather above :
this is just an 8-bytes copy operation, replacing by a 16-bytes one looks
suspicious. Maybe it would deserve a look.

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

Reply via email to