[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Jakub Jelinek changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #9 from Jakub Jelinek --- Actually managed to reproduce today using godbolt's -E output. This is something that got fixed with r247104 on the trunk and backported to 7.2 as r248724. So I believe this is fixed. The choice to emit __atomic_load_16 calls is intentional, see the overlong thread about it from the last week or so on gcc at gcc.gnu.org mailing list. *** This bug has been marked as a duplicate of bug 80293 ***
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Richard Biener changed: What|Removed |Added Target Milestone|7.3 |7.4 --- Comment #8 from Richard Biener --- GCC 7.3 is being released, adjusting target milestone.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Matt Godbolt changed: What|Removed |Added CC||matt at godbolt dot org --- Comment #7 from Matt Godbolt --- The pedigree of the GCC 7.1 is as Peter says. The compiler isn't bootstrapped, and uses an in-tree binutils, should that be important. Other pertinent info from gcc -v Configured with: ../gcc-7.1.0/configure --prefix /root/staging --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-multilib --disable-bootstrap --disable-multiarch --with-arch-32=i586 --enable-clocale=gnu --enable-languages=c,c++,go,fortran --enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto --enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build Thread model: posix gcc version 7.1.0 (GCC-Explorer-Build) GNU C++11 (GCC-Explorer-Build) version 7.1.0 (x86_64-linux-gnu) compiled by GNU C version 5.4.0 20160609, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.16.1-GMP GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 The binaries used on CE are here: https://s3.amazonaws.com/compiler-explorer/opt/gcc-7.1.0.tar.xz
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #6 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > But have just tried gcc 7.1.0 release and can't reproduce even there. Matt says the Compiler Explorer backend uses upstream release tarballs like `URL=ftp://ftp.gnu.org/gnu/gcc/gcc-${VERSION}/${TARBALL}`. (where TARBALL is `gcc-${VERSION}.tar.xz` for recent gcc where .xz is available). The compiler config used is https://github.com/mattgodbolt/compiler-explorer-image/blob/master/gcc/build/build.sh#L78: CONFIG="" CONFIG+=" --build=x86_64-linux-gnu" CONFIG+=" --host=x86_64-linux-gnu" CONFIG+=" --target=x86_64-linux-gnu" CONFIG+=" --disable-bootstrap" CONFIG+=" --enable-multiarch" CONFIG+=" --with-abi=m64" CONFIG+=" --with-multilib-list=m32,m64,mx32" CONFIG+=" --enable-multilib" CONFIG+=" --enable-clocale=gnu" CONFIG+=" --enable-languages=c,c++,fortran" # used to have go, but is incompatible with m32/mx32 CONFIG+=" --enable-ld=yes" CONFIG+=" --enable-gold=yes" CONFIG+=" --enable-libstdcxx-debug" CONFIG+=" --enable-libstdcxx-time=yes" CONFIG+=" --enable-linker-build-id" CONFIG+=" --enable-lto" CONFIG+=" --enable-plugins" CONFIG+=" --enable-threads=posix" CONFIG+=" --with-pkgversion=GCC-Explorer-Build" BINUTILS_VERSION=2.29.1 Does that help figure out how to build a gcc7.1.0 that can repro this?
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #5 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > Can't reproduce. It is true that we now emit the __atomic_load_16 call, but > that was intentional change Yup. >, and it can't be easily tail call, because the > tailcall pass doesn't understand that the low 8 bytes of the 16 byte > structure are returned the same as the whole structure Ok that's disappointing, but hopefully is very rare after inlining. > But I certainly can't reproduce any significant value masking etc., tried > r235002 (+- gcc 6 branchpoint), r247000 (+- gcc 7 branchpoint) as well as > current trunk. > Unless it is something that has been broken on the 7 branch and later fixed. > > But have just tried gcc 7.1.0 release and can't reproduce even there. I can't repro it locally with gcc7.1.1 either. This is the version info from -fverbose-asm on the godbolt.org link (which does still repro it) # GNU C++11 (GCC-Explorer-Build) version 7.1.0 (x86_64-linux-gnu) # compiled by GNU C version 5.4.0 20160609, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.16.1-GMP It's not present in the gcc7.2 build on Godbolt.org either. I asked Matt Godbolt what exact version the compiler explorer site is using for the gcc7.1.0 dropdown (https://github.com/mattgodbolt/compiler-explorer/issues/684). Hopefully he can help us track down a gcc SVN revision to repro it, or confirm that it was a misconfigured or buggy gcc or something. Just to rule out the possibility of a now-dormant bug lurking somewhere.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- Can't reproduce. It is true that we now emit the __atomic_load_16 call, but that was intentional change, and it can't be easily tail call, because the tailcall pass doesn't understand that the low 8 bytes of the 16 byte structure are returned the same as the whole structure; I think that is pretty obscure detail of the ABI, so I think it isn't worth teaching the tailcall pass about such arch specific complexities. But I certainly can't reproduce any significant value masking etc., tried r235002 (+- gcc 6 branchpoint), r247000 (+- gcc 7 branchpoint) as well as current trunk. Unless it is something that has been broken on the 7 branch and later fixed. But have just tried gcc 7.1.0 release and can't reproduce even there.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #3 from Peter Cordes --- Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact. -m64 just calls __atomic_load_16 gcc7.2 -O3 -mx32 output: follow_nounion(std::atomic*): movq(%edi), %rax movl%eax, %eax ret vs. gcc7.1 -O3 -mx32 follow_nounion(std::atomic*): movq(%edi), %rcx xorl%edx, %edx movzbl %ch, %eax movb%cl, %dl movq%rcx, %rsi movb%al, %dh andl$16711680, %esi andl$4278190080, %ecx movzwl %dx, %eax orq %rsi, %rax orq %rcx, %rax ret --- gcc7.2 -O3 -m64 just forwards its arg to __atomic_load_16 and then returns: follow_nounion(std::atomic*): subq$8, %rsp movl$2, %esi call__atomic_load_16 addq$8, %rsp ret It unfortunately doesn't optimize the tail-call to movl$2, %esi jmp __atomic_load_16 presumably because it hasn't realized early enough that it takes zero instructions to extract the 8-byte low half of the 16-byte __atomic_load_16 return value.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3 --- Comment #3 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone. --- Comment #4 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3 --- Comment #3 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone. --- Comment #4 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > GCC 8 generates a __atomic_load_16 call for me while GCC 6 does > > lock cmpxchg16b (%rdi) That's expected. See https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html for the reason. (Apparently the plan is for gcc7 to call libatomic, to make it possible to change the implementation sooner in the future if that's deemed appropriate, even in ways that breaks compat with code that inlines lock cmpxchg16b.) The bug here is that gcc7.1.0 goes berserk after the library call, but gcc8 doesn't. I haven't tested on any other gcc7 versions.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Status|UNCONFIRMED |NEW Last reconfirmed||2017-05-23 Target Milestone|--- |7.2 Summary|[7.1.0 regression] x86 |[7/8 regression] x86 |accessing a member of a |accessing a member of a |16-byte atomic object |16-byte atomic object |generates terrible code:|generates terrible code: |splitting/merging the bytes |splitting/merging the bytes Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- GCC 8 generates a __atomic_load_16 call for me while GCC 6 does lock cmpxchg16b (%rdi)