http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48987
Summary: Atomic update merging Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: piotr.wyder...@gmail.com Not being a GCC developer I am not sure if this is RTL or tree level optimization, so the the component selection is guesstimated. The problem: GCC does not merge atomic modifications of the same location. An example is given below: void yyy(int* p) { __sync_fetch_and_add(p, 1); __sync_fetch_and_add(p, 1); } On x86 GCC emits multiple independent modifications: 0040edd0 <__Z3yyyPi>: 40edd0: 8b 44 24 04 mov 0x4(%esp),%eax 40edd4: f0 ff 00 lock incl (%eax) 40edd7: f0 ff 00 lock incl (%eax) 40edda: c3 ret The lock prefix implies a full memory barrier, so if there were no memory references in between, the code is equivalent to: lock addl $0x02, (%eax) The example above is purely artificial, but real C++ programs using smart pointers generate similar patterns. Atomic operations are expensive, so GCC should minimize their count, namely it should check the invocation graph/tree and try merge subsequent modifications. A pattern with high probability of occurence is equivalent to: void yyy(int* p) { __sync_fetch_and_add(p, 1); __sync_fetch_and_add(p, -1); } which on most platforms technically is a NOP with membar semantics and can be implemented as such. On x86/x64 supporting SSE3 the update interferes with the monitor/mwait mechanism (i.e. processor wakes up after specified cache line modification) and shouldn't be replaced by mfence -- a dummy store should be performed and the correct pattern then is: lock add $0x00, (addr) In fact it can be the only pattern, as mfence and lock add are of comparable performance and none of them wastes a register, as, for example, xadd does.