https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116383

            Bug ID: 116383
           Summary: Value from __atomic_store not forwarded to non-atomic
                    load at same address
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/1bbjoc87n

int test(int* i, int val) {
    __atomic_store_n(i, val, __ATOMIC_RELAXED);
    return *i;
}

The non-atomic load should be able to directly use the value stored by the
atomic store, but instead GCC issues a new load:

        mov     DWORD PTR [rdi], esi
        mov     eax, DWORD PTR [rdi]
        ret

Clang recognizes that the load is unnecessary and propagates the value:

        mov     eax, esi
        mov     dword ptr [rdi], esi
        ret

In addition to simply being an unnecessary load, there is an additionally
penalty in most CPUs from accessing reading a value still in the CPU's store
buffer which it almost certainly would be in this case. And of course this also
disables further optimizations eg DSE and value propagation where the compiler
knows something about the value.

void blocking_further_optimizations(int* i) {
    if (test(i, 1) == 0) {
        __builtin_abort();
    }
}

generates the following with gcc

        mov     DWORD PTR [rdi], 1
        mov     edx, DWORD PTR [rdi]
        test    edx, edx
        je      .L5
        ret
blocking_further_optimizations(int*) [clone .cold]:
.L5:
        push    rax
        call    abort

And this much better output with clang

        mov     dword ptr [rdi], 1
        ret

While I'm using a relaxed store here to show that gcc doesn't apply the
optimization in that case, I think the optimization should apply regardless of
memory ordering (and clang seems to agree). Also while the minimal example code
is contrived, there are several real-world use cases where this pattern can
come up. I would expect it in cases where there is a single writer thread but
many reader threads. The writes and off-thread reads need to use __atomic ops
to avoid data races, but on-thread reads should be safe using ordinary loads,
and you would want them to be optimized as much as possible.

Reply via email to