https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929
Bug ID: 105929 Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } // compiles nicely: https://godbolt.org/z/8zf8ns14K mov w1, 48879 movk w1, 0xdead, lsl 16 stp w1, w1, [x0] ret But even with -Os -march=armv8.4-a the following doesn't: void foo_atomic(_Atomic unsigned long *p) { __atomic_store_n(p, 0xdeadbeefdeadbeef, __ATOMIC_RELAXED); } mov x1, 48879 movk x1, 0xdead, lsl 16 movk x1, 0xbeef, lsl 32 movk x1, 0xdead, lsl 48 stlr x1, [x0] ret ARMv8.4-a and later guarantees atomicity for aligned ldp/stp, according to ARM's architecture reference manual: ARM DDI 0487H.a - ID020222, so we could use the same asm as the non-atomic version. > If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access > fewer than 16 bytes are single-copy atomic when all of the following > conditions are true: > • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes. > • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory (FEAT_LSE2 is the same CPU feature that gives 128-bit atomicity for aligned ldp/stp x,x,mem) Prior to that, apparently it wasn't guaranteed that stp of 32-bit halves merged into a single 64-bit store. So without -march=armv8.4-a it wasn't a missed optimization to construct the constant in a single register for _Atomic or volatile. But with ARMv8.4, we should use MOV/MOVK + STP. Since there doesn't seem to be a release-store version of STP, 64-bit release and seq_cst stores should still generate the full constant in a register, instead of using STP + barriers. (Without ARMv8.4-a, or with a memory-order other than relaxed, see PR105928 for generating 64-bit constants in 3 instructions instead of 4, at least for -Os, with add x0, x0, x0, lsl 32)