[Bug rtl-optimization/48987] New: Atomic update merging

piotr.wyderski at gmail dot com Fri, 13 May 2011 03:32:25 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48987


           Summary: Atomic update merging
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: piotr.wyder...@gmail.com


Not being a GCC developer I am not sure if this is RTL or tree
level optimization, so the the component selection is guesstimated.

The problem: GCC does not merge atomic modifications of the
same location. An example is given below:

void yyy(int* p) {

    __sync_fetch_and_add(p, 1);
    __sync_fetch_and_add(p, 1);
}

On x86 GCC emits multiple independent modifications:

0040edd0 <__Z3yyyPi>:
  40edd0:    8b 44 24 04              mov    0x4(%esp),%eax
  40edd4:    f0 ff 00                 lock incl (%eax)
  40edd7:    f0 ff 00                 lock incl (%eax)
  40edda:    c3                       ret     

The lock prefix implies a full memory barrier, so if there
were no memory references in between, the code is equivalent to:

    lock addl $0x02, (%eax)

The example above is purely artificial, but real C++ programs
using smart pointers generate similar patterns. Atomic operations
are expensive, so GCC should minimize their count, namely it
should check the invocation graph/tree and try merge subsequent
modifications. A pattern with high probability of occurence is
equivalent to:

void yyy(int* p) {

    __sync_fetch_and_add(p, 1);
    __sync_fetch_and_add(p, -1);
}

which on most platforms technically is a NOP with membar semantics
and can be implemented as such. On x86/x64 supporting SSE3 the update
interferes with the monitor/mwait mechanism (i.e. processor wakes up
after specified cache line modification) and shouldn't be replaced
by mfence -- a dummy store should be performed and the correct pattern
then is:

    lock add $0x00, (addr)

In fact it can be the only pattern, as mfence and lock add are
of comparable performance and none of them wastes a register, as,
for example, xadd does.

[Bug rtl-optimization/48987] New: Atomic update merging

Reply via email to