https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069
Bug ID: 103069 Summary: cmpxchg isn't optimized Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hjl.tools at gmail dot com CC: crazylht at gmail dot com, wwwhhhyyy333 at gmail dot com Blocks: 103065 Target Milestone: --- Target: i386,x86-64 >From the CPU's point of view, getting a cache line for writing is more expensive than reading. See Appendix A.2 Spinlock in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf The full compare and swap will grab the cache line exclusive and causes excessive cache line bouncing. [hjl@gnu-cfl-2 pr102566]$ cat e.c int f3 (int *a) { return __atomic_fetch_or (a, 0x40000000, __ATOMIC_RELAXED); } [hjl@gnu-cfl-2 pr102566]$ gcc -S -O2 x.c [hjl@gnu-cfl-2 pr102566]$ cat x.s .file "x.c" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc movl v(%rip), %eax .L2: movl %eax, %ecx movl %eax, %edx orl $1, %ecx lock cmpxchgl %ecx, v(%rip) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ GCC should first emit a normal load, check and jump to .L2 if cmpxchgl may fail. Before jump to .L2, PAUSE should be inserted to to yield the CPU to another hyperthread and to save power. It also serves to slightly limit the rate of accesses on the processor interconnect. jne .L2 movl %edx, %eax andl $1, %eax ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 11.2.1 20211019 (Red Hat 11.2.1-6)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-2 pr102566]$ Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065 [Bug 103065] [meta] atomic operations aren't optimized