https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659
Bug ID: 108659 Summary: Suboptimal 128 bit atomics codegen on AArch64 and x64 Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: s_gccbugzilla at nedprod dot com Target Milestone: --- Related: - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878 - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94649 - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 I got bitten by this again, latest GCC still does not emit single instruction 128 bit atomics, even when the -march is easily new enough. Here is a godbolt comparing latest MSVC, latest GCC and latest clang for the skylake-avx512 architecture, which unquestionably supports cmpxchg16b. Only clang emits the single instruction atomic: https://godbolt.org/z/EnbeeW4az I'm gathering from the issue comments and from the comments at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 that you're going to wait for AMD to guarantee atomicity of SSE instructions before changing the codegen here, which makes sense. However I also wanted to raise potentially suboptimal 128 bit atomic codegen by GCC for AArch64 as compared to clang: https://godbolt.org/z/oKv4o81nv GCC emits `dmb` to force a global memory fence, whereas clang does not. I think clang is in the right here, the seq_cst atomic semantics are not supposed to globally memory fence.