https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- Thanks for correcting my mistake in tagging this bug, but this got me thinking it's not just a C++ issue. This also applies to GNU C __atomic_load_n(), and ISO C11 stdatomic code like #include <stdatomic.h> #include <stdint.h> uint32_t load(atomic_uint_fast64_t *p) { // https://godbolt.org/g/CXuiPO return *p; } With -m32, it's definitely useful to only do a 32-bit load. With -m64, that might let us fold the load as a memory operand, like `add (%rdi), %eax` or something. In 64-bit code, it just compiles to movq (%edi),%rax; ret. But if we'd wanted the high half, it would have been a load and shift instead of just a 32-bit load. Or shift+zero-extend if we'd wanted bytes 1 through 4 with (*p) >> 8. We know that a 64-bit load of *p doesn't cross a cache-line boundary (otherwise it wouldn't be atomic), so neither will any load that's fully contained by it. This optimization should be disabled for `volatile`, if that's supposed to make stdatomic usable for MMIO registers. ------------------ It's common to use a load as setup for a CAS. For cmpxchg8/16b, it's most efficient to use two separate loads as setup for cmpxchg8b. gcc even does this for us with code like *p |= 3, but we can't get that behaviour if we write a CAS loop ourselves. This makes the un-contended case slower, even if it's just an xmm->int delay for an 8 byte load, not a function-call + CMPXCHG16B. void cas_compiler(atomic_uint_fast64_t *p) { *p |= 3; // separate 32-bit loads before loop } #gcc8 snapshot 20170522 -m32, and clang does the same ... intro, including including pushing 4 regs, two of which aren't used :( movl (%esi), %eax #* p, tmp90 movl 4(%esi), %edx #, ... (loop with MOV, OR, and CMPXCHG8B) void cas_explicit(atomic_uint_fast64_t *p) { // AFAIK, it would be data-race UB to cast to regular uint64_t* for a non-atomic load uint_fast64_t expect = atomic_load_explicit(p, memory_order_relaxed); _Bool done = 0; do { uint_fast64_t desired = expect | 3; done = atomic_compare_exchange_weak(p, &expect, desired); } while (!done); } ... similar setup, but also reserve some stack space movq (%esi), %xmm0 #* p, tmp102 movq %xmm0, (%esp) # tmp102, movl (%esp), %eax #, expect movl 4(%esp), %edx #, expect ... (then the same loop) I think it's legal to split an atomic load into two halves if the value can't escape and is only feeding the old/new args of a CAS: If the cmpxchg8/16b succeeds, that means the "expected" value was there in memory. Seeing it earlier than it was actually there because of tearing between two previous values is indistinguishable from a possible memory ordering for a full-width atomic load. We could have got the same result if everything in this thread had happened after the store that made the value seen by cmpxchg8/16b globally visible. So this also requires that there are no other synchronization points between the load and the CAS. For 16-byte objects in 64-bit code, this saves a CMPXCHG16B-load, so it's about half the cost in the no-contention case. For 8-byte objects with -m32, it's smaller, but probably not totally irrelevant in the un-contended case. An SSE2 load and xmm->int (via ALU or store/reload) might be the same order of magnitude in cost as lock cmpxchg8b on AMD Bulldozer. As far the latency chain for operations on a single atomic variable, lock cmpxchg8b has ~42 cycle latency on Piledriver (according to Agner Fog), while an extra xmm->int has about 8c latency beyond directly loading into integer regs. (The throughput costs of lock cmpxchg8b are vastly higher, though: 18 m-ops instead of 3 for movd/pextrd.)