https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |aarch64, arm CC| |ramana at gcc dot gnu.org --- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> --- (In reply to dhowe...@redhat.com from comment #0) > In the kernel, we have various bits of code that boil down to: > > int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED); > int new; > do { > new = do_something_to cur; > } while (!__atomic_compare_exchange_n(&v->counter, > &cur, new, > false, > __ATOMIC_SEQ_CST, > __ATOMIC_RELAXED)); > > and: > > int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED); > int new; > do { > if (new == not_this_value) > break; > new = do_something_to cur; > } while (!__atomic_compare_exchange_n(&v->counter, > &cur, new, > false, > __ATOMIC_SEQ_CST, > __ATOMIC_RELAXED)); > > For example: > > static __always_inline int __atomic_add_unless(atomic_t *v, > int addend, int unless) > { > int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED); > int new; > > do { > if (__builtin_expect(cur == unless, 0)) > break; > new = cur + addend; > } while (!__atomic_compare_exchange_n(&v->counter, > &cur, new, > false, > __ATOMIC_SEQ_CST, > __ATOMIC_RELAXED)); > return cur; > } > > int test_atomic_add_unless(atomic_t *counter) > { > return __atomic_add_unless(counter, 0x56, 0x23); > } > > If the CPU has LL/SC constructs available, something like this is probably > better implemented using that than a CMPXCHG loop - _provided_ the bit > between the __atomic_load and the __atomic_compare_exchange doesn't resolve > to more than a few instructions Making the compiler deal with "doesn't resolve to more than a few instructions" and dealing with architecture restrictions will be the fun part here. It's architecture specific as to what can or cannot go into the loop. On AArch64 there are restrictions on what kind of instructions can go into these LL/SC loops using the exclusive instructions i.e. the LDAXR / STLXR instructions. For e.g. these loops cannot contain other loads and stores and we work pretty hard to make sure that there is no accidental spilling inside these loops. See PR69904 for an example of where an optimization was skirting on the edges of the architecture - aarch32 is quite similar to aarch64 in this respect. This is probably a bit harder to do in a generic manner rather than just adding combine patterns as we will be doing that in the backend till kingdom come. I'm leaving this as a target bug for now and marking it as relevant to both the backends but I suspect this affects other architectures too that have LL/SC style atomics. > > So, using armv8-a as an example, the test_atomic_add_unless() function above > compiles to: > > test_atomic_add_unless: > sub sp, sp, #16 # unnecessary > ldr w1, [x0] # __atomic_load_n() > str w1, [sp, 12] # bug 70825 > .L5: > ldr w1, [sp, 12] # bug 70825 > cmp w1, 35 > beq .L4 > ldr w3, [sp, 12] # bug 70825 > add w1, w1, 86 > .L7: > ldaxr w2, [x0] # } > cmp w2, w3 # } __atomic_compare_exchange() > bne .L8 # } > stlxr w4, w1, [x0] # } > cbnz w4, .L7 # } > .L8: > beq .L4 > str w2, [sp, 12] # bug 70825 > b .L5 > .L4: > ldr w0, [sp, 12] # bug 70825 > add sp, sp, 16 # unnecessary > ret > > Removing the bits added by gcc by bug 70825 and using registers throughout, > this can be reduced to: > > test_atomic_add_unless: > ldr w1, [x0] # __atomic_load_n() > .L5: > cmp w1, 35 > beq .L4 > add w2, w1, 86 > mov w3, w1 > .L7: > ldaxr w1, [x0] # } > cmp w1, w3 # } __atomic_compare_exchange() > bne .L5 # } > stlxr w4, w2, [x0] # } > cbnz w4, .L7 # } > .L4: > mov w1, w0 > ret > > However, the atomic load, the arithmetic and the compare exchange can be > condensed further by moving the arithmetic inside the LL/SC section: > > test_atomic_add_unless: > .L7: > ldaxr w1, [x0] # __atomic_load_n() > cmp w1, 35 # } if (cur == unless) > beq .L4 # } break > add w2, w1, 86 # new = cur + addend > stlxr w4, w2, [x0] > cbnz w4, .L7 > .L4: > mov w1, w0 > ret > > When compiled for -march=armv8-a+lse, however, the code resolves to a > compare-exchange loop using the CASAL instruction: > > test_atomic_add_unless: > sub sp, sp, #16 # unnecessary > ldr w1, [x0] # __atomic_load_n() > str w1, [sp, 12] # bug 70825 > .L5: > ldr w1, [sp, 12] # bug 70825 > cmp w1, 35 > beq .L4 > ldr w3, [sp, 12] # bug 70825 > add w1, w1, 86 > mov w2, w3 > casal w2, w1, [x0] # __atomic_compare_exchange_n() > cmp w2, w3 > beq .L4 > str w2, [sp, 12] # bug 70825 > b .L5 > .L4: > ldr w0, [sp, 12] # bug 70825 > add sp, sp, 16 # unnecessary > ret > > but LDAXR/STLXR might actually be better.