https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825
Bug ID: 70825 Summary: x86_64: __atomic_compare_exchange_n() accesses stack unnecessarily Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: dhowells at redhat dot com Target Milestone: --- Created attachment 38349 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38349&action=edit Example code __atomic_compare_exchange_n() stores the 'expected' value and, if not discarded, the old value on the stack rather than in a register and then may immediately retrieve the old value from there back into the register from whence it just came. For example: static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new) { int cur = old; if (__atomic_compare_exchange_n(&v->counter, &cur, new, false, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) return cur; return cur; } void test_atomic_cmpxchg(atomic_t *counter) { atomic_cmpxchg(counter, 23, 42); } produces: 0: ba 2a 00 00 00 mov $0x2a,%edx 5: b8 17 00 00 00 mov $0x17,%eax a: c7 44 24 fc 17 00 00 movl $0x17,-0x4(%rsp) 11: 00 12: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 16: c3 retq Note line a where 0x17 is stored at -04(%rsp) unnecessarily (the value is dicarded). Note also that this value is in %eax at the time and could be saved from there rather than using an immediate operand. And then: int test_atomic_cmpxchg_2(atomic_t *counter) { return atomic_cmpxchg(counter, 23, 42); } where the value is returned, we get something like: 20: ba 2a 00 00 00 mov $0x2a,%edx 25: b8 17 00 00 00 mov $0x17,%eax 2a: c7 44 24 fc 17 00 00 movl $0x17,-0x4(%rsp) 31: 00 32: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 36: 74 04 je 3c <test_atomic_cmpxchg_2+0x1c> 38: 89 44 24 fc mov %eax,-0x4(%rsp) 3c: 8b 44 24 fc mov -0x4(%rsp),%eax 40: c3 retq In this case, the return value is being constructed on the stack whereas the value we actually want to return is in %eax at the conclusion of the CMPXCHG, so lines 2a & 36-3c are redundant. Changing atomic_cmpxchg() slightly: static __always_inline int atomic_cmpxchg_B(atomic_t *v, int old, int new) { int cur = old; if (__atomic_compare_exchange_n(&v->counter, &cur, new, false, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) --> return old; return cur; } does generate code that's somewhat better: a0: ba 2a 00 00 00 mov $0x2a,%edx a5: b8 17 00 00 00 mov $0x17,%eax aa: c7 44 24 fc 17 00 00 movl $0x17,-0x4(%rsp) b1: 00 b2: f0 0f b1 17 lock cmpxchg %edx,(%rdi) b6: ba 17 00 00 00 mov $0x17,%edx bb: 0f 45 d0 cmovne %eax,%edx be: 89 d0 mov %edx,%eax c0: c3 retq However, given the fact that old == cur must hold true if the CMPXCHG succeeded, lines b6-be are redundant. Note that the unnecessary store (line aa) is still there. Note that clang does somewhat better than gcc: 50: b9 2a 00 00 00 mov $0x2a,%ecx 55: b8 17 00 00 00 mov $0x17,%eax 5a: f0 0f b1 0f lock cmpxchg %ecx,(%rdi) 5e: c3 retq 5f: 90 nop See the attached example code which should be compiled to a .s file. This is the case for all of: gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC) gcc version 6.0.0 20160219 (Red Hat Cross 6.0.0-0.1) (GCC) gcc version 4.8.5 20150623 (Red Hat 4.8.5-2.x) (GCC)