https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
Thanks for correcting my mistake in tagging this bug, but this got me thinking
it's not just a C++ issue.

This also applies to GNU C __atomic_load_n(), and ISO C11 stdatomic code like

#include <stdatomic.h>
#include <stdint.h>
uint32_t load(atomic_uint_fast64_t *p) {  // https://godbolt.org/g/CXuiPO
  return *p;
}

With -m32, it's definitely useful to only do a 32-bit load.  With -m64, that
might let us fold the load as a memory operand, like `add (%rdi), %eax` or
something.

In 64-bit code, it just compiles to movq (%edi),%rax; ret.  But if we'd wanted
the high half, it would have been a load and shift instead of just a 32-bit
load.  Or shift+zero-extend if we'd wanted bytes 1 through 4 with (*p) >> 8. 
We know that a 64-bit load of *p doesn't cross a cache-line boundary (otherwise
it wouldn't be atomic), so neither will any load that's fully contained by it.  

This optimization should be disabled for `volatile`, if that's supposed to make
stdatomic usable for MMIO registers.

------------------

It's common to use a load as setup for a CAS.  For cmpxchg8/16b, it's most
efficient to use two separate loads as setup for cmpxchg8b.  gcc even does this
for us with code like *p |= 3, but we can't get that behaviour if we write a
CAS loop ourselves.  This makes the un-contended case slower, even if it's just
an xmm->int delay for an 8 byte load, not a function-call + CMPXCHG16B.


void cas_compiler(atomic_uint_fast64_t *p) {
  *p |= 3;  // separate 32-bit loads before loop
}
        #gcc8 snapshot 20170522 -m32, and clang does the same
        ... intro, including including pushing 4 regs, two of which aren't used
:(
        movl    (%esi), %eax    #* p, tmp90
        movl    4(%esi), %edx   #,
        ... (loop with MOV, OR, and CMPXCHG8B)


void cas_explicit(atomic_uint_fast64_t *p) {
  // AFAIK, it would be data-race UB to cast to regular uint64_t* for a
non-atomic load
  uint_fast64_t expect = atomic_load_explicit(p, memory_order_relaxed);
  _Bool done = 0;
  do {
    uint_fast64_t desired = expect | 3;
    done = atomic_compare_exchange_weak(p, &expect, desired);
  } while (!done);
}
        ... similar setup, but also reserve some stack space
        movq    (%esi), %xmm0   #* p, tmp102
        movq    %xmm0, (%esp)   # tmp102,
        movl    (%esp), %eax    #, expect
        movl    4(%esp), %edx   #, expect
        ...  (then the same loop)

I think it's legal to split an atomic load into two halves if the value can't
escape and is only feeding the old/new args of a CAS:

If the cmpxchg8/16b succeeds, that means the "expected" value was there in
memory.  Seeing it earlier than it was actually there because of tearing
between two previous values is indistinguishable from a possible memory
ordering for a full-width atomic load.  We could have got the same result if
everything in this thread had happened after the store that made the value seen
by cmpxchg8/16b globally visible.  So this also requires that there are no
other synchronization points between the load and the CAS.


For 16-byte objects in 64-bit code, this saves a CMPXCHG16B-load, so it's about
half the cost in the no-contention case.

For 8-byte objects with -m32, it's smaller, but probably not totally irrelevant
in the un-contended case.  An SSE2 load and xmm->int (via ALU or store/reload)
might be the same order of magnitude in cost as lock cmpxchg8b on AMD
Bulldozer.  As far the latency chain for operations on a single atomic
variable, lock cmpxchg8b has ~42 cycle latency on Piledriver (according to
Agner Fog), while an extra xmm->int has about 8c latency beyond directly
loading into integer regs.  (The throughput costs of lock cmpxchg8b are vastly
higher, though: 18 m-ops instead of 3 for movd/pextrd.)

Reply via email to