https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693
Bug ID: 86693 Summary: inefficient atomic_fetch_xor Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- (Compiled with O2 on x86-64) Consider the following example: void func1(); void func(unsigned long *counter) { if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1) { func1(); } } It is clear that the code can be optimized to simply do 'lock xorq' rather than cmpxchg loop since the xor operation can easily be inverted 1^1 = 0, i.e. can be tested from flags directly (just like for similar cases with fetch_sub and fetch_add which gcc optimizes well). However, gcc currently generates cmpxchg loop: func: .LFB0: .cfi_startproc movq (%rdi), %rax .L2: movq %rax, %rcx movq %rax, %rdx xorq $1, %rcx lock cmpxchgq %rcx, (%rdi) jne .L2 cmpq $1, %rdx je .L7 rep ret Compare this with fetch_sub instead of fetch_xor: func: .LFB0: .cfi_startproc lock subq $1, (%rdi) je .L4 rep ret