https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113468
Bug ID: 113468 Summary: copy of large struct violates semantics of 'volatile' Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gnu at kosak dot com Target Milestone: --- Created attachment 57132 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57132&action=edit the .ii file from --save-temps Hello, I'm not 100% certain, but I'm pretty sure that the assembly generated for this code violates the semantics for 'volatile'. The reason I think so is that it reads and writes the same memory locations more than once, and also reads and writes them out of order. I believe that both of these actions violate the rules for 'volatile'. ``` struct Foo { volatile int values[256]; }; void copy(const Foo *src, Foo *dest) { *dest = *src; } ``` Relevant assembly on x86_64, invoked with g++ -S -O3 test.cc ``` _Z4copyPK3FooPS_: .LFB0: .cfi_startproc endbr64 movq (%rdi), %rdx movq %rdi, %rax movq %rsi, %rcx movq %rdx, (%rsi) movq 1016(%rdi), %rdx leaq 8(%rsi), %rdi andq $-8, %rdi subq %rdi, %rcx movq %rdx, 1016(%rsi) subq %rcx, %rax addl $1024, %ecx movq %rax, %rsi shrl $3, %ecx rep movsq ret .cfi_endproc ``` Analysis: The compiler wants to use "rep movsq", but Foo is only aligned to 4 bytes, so the generated code takes extra pains to make sure that the destination register rdi is aligned to an 8 byte boundary so that "rep movsq" can be efficient. The trick used is to first move the first and last quadwords (regardless of alignment), and then do a "rep movsq" of the interior, but with care taken so that the destination is aligned to an 8-byte boundary. There is a 2x2 matrix of cases here: src will be aligned to either 4 or 8; likewise dest will be aligned to 4 or 8. To illustrate the problem, consider when src and dest are aligned to 4 but not 8. For the sake of the example, assume that src is 0x50000004 and dest is 0x60000004 First, the code moves a quadword from [0x50000004] to [0x60000004]. Call this [FIRST STEP] because we reference it below. Then it moves a quadword from [0x50000004+1016] to [0x60000004+1016]. To my mind this already is a problem because it violates the ordering assumptions of 'volatile'. But that's not the only problem. Finally the code does a bit of arithmetic and masking, ultimately ending up with rsi = 0x50000008 and rdi = 0x60000008 and then it does the "rep movsq". My main objection is that this is a partially overlapping re-read (and re-write) of one-half of the same quadword that was copied in [FIRST STEP], which I believe violates the rules for volatile. If Foo were a bank of hardware registers this might be an unwelcome access pattern. Similar violations happen for the other cases in the 2x2 matrix. Apologies if I'm wrong about the above. I'm certainly not a 'volatile' expert but this strikes me as rather hinky. I would stress that I have no objection to the code for the non-volatile case. Output from g++ -v: ``` Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 13.2.0-4ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-13-XYspKM/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-13-XYspKM/gcc-13-13.2.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3) ```