https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97482
Bug ID: 97482
Summary: Optimized (-O3) XMM register load incorrectly uses
movdqu
Product: gcc
Version: 10.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: vkkerrata at gmail dot com
Target Milestone: ---
Created attachment 49396
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49396&action=edit
preprocessed f.c
The code pasted below reproduces this bug. There is a commented line in f.c
that can be used to replace the builtin function call which also exhibits the
bug. I have only encountered this bug with -O3 on. The load of two 64-bit
values into a 128-bit register at lower optimization levels is a two step
process with movq and movhps instructions handling each 64-bit half. In gcc
10.1.0, this can instead be replaced with movdqu, which puts the halves in
"backwards" from what's intended.
Because the optimizer doesn't always choose movdqu, the issue may disappear
with seemingly unrelated changes. The code provided below is in two files
because I was unable to create a reproducer inside a single translation unit.
System Type: Linux, Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
Build Options: Default; apg installed from ppa:ubuntu-toolchain-r/test
Compile Line: gcc-10 main.c f.c -O3 -save-temps -o movdqu-bug
Compiler Output: None ($? == 0)
$ gcc-10 -v
Using built-in specs.
COLLECT_GCC=gcc-10
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
10.1.0-2ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-eDoCEC/gcc-10-10.1.0/debian/tmp-gcn/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.1.0 (Ubuntu 10.1.0-2ubuntu1~18.04)
$ cat f.c
#include <emmintrin.h>
#include <stdint.h>
uint64_t
f(const uint64_t *in)
{
// load two 64-bit halves
// bug: incorrect use of movdqu under -O3
// Both versions below do the wrong thing.
__m128i x = _mm_set_epi64x(in[0], in[1]);
//__m128i x = {in[1], in[0]};
// permute to illustrate change
x = _mm_shuffle_epi32(x, _MM_SHUFFLE(1,2,3,0));
// extract and return the low 64 bits
return _mm_cvtsi128_si64x(x);
}
$ cat main.c
#include <inttypes.h>
#include <stdio.h>
uint64_t f(const uint64_t *);
int
main(void)
{
// correct output: 4444444411111111
// bug output: 2222222233333333
uint64_t vec[2] = { 0x4444444433333333, 0x2222222211111111 };
printf("%016"PRIx64"\n", f(vec));
return 0;
}