SC-type constructs than a CAS loop

ramana at gcc dot gnu.org Thu, 19 May 2016 03:38:00 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191


Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |aarch64, arm
                 CC|                            |ramana at gcc dot gnu.org

--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
(In reply to dhowe...@redhat.com from comment #0)
> In the kernel, we have various bits of code that boil down to:
> 
>       int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
>       int new;
>       do {
>               new = do_something_to cur;
>       } while (!__atomic_compare_exchange_n(&v->counter,
>                                               &cur, new,
>                                               false,
>                                               __ATOMIC_SEQ_CST,
>                                               __ATOMIC_RELAXED));
> 
> and:
> 
>       int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
>       int new;
>       do {
>               if (new == not_this_value)
>                       break;
>               new = do_something_to cur;
>       } while (!__atomic_compare_exchange_n(&v->counter,
>                                               &cur, new,
>                                               false,
>                                               __ATOMIC_SEQ_CST,
>                                               __ATOMIC_RELAXED));
> 
> For example:
> 
>       static __always_inline int __atomic_add_unless(atomic_t *v,
>                                                      int addend, int unless)
>       {
>               int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
>               int new;
> 
>               do {
>                       if (__builtin_expect(cur == unless, 0))
>                               break;
>                       new = cur + addend;
>               } while (!__atomic_compare_exchange_n(&v->counter,
>                                                     &cur, new,
>                                                     false,
>                                                     __ATOMIC_SEQ_CST,
>                                                     __ATOMIC_RELAXED));
>               return cur;
>       }
> 
>       int test_atomic_add_unless(atomic_t *counter)
>       {
>               return __atomic_add_unless(counter, 0x56, 0x23);
>       }
> 
> If the CPU has LL/SC constructs available, something like this is probably
> better implemented using that than a CMPXCHG loop - _provided_ the bit
> between the __atomic_load and the __atomic_compare_exchange doesn't resolve
> to more than a few instructions

Making the compiler deal with "doesn't resolve to more than a few instructions"
and dealing with architecture restrictions will be the fun part here.

It's architecture specific as to what can or cannot go into the loop. On
AArch64 there are restrictions on what kind of instructions can go into these
LL/SC loops using the exclusive instructions i.e. the LDAXR / STLXR
instructions.  For e.g. these loops cannot contain other loads and stores and
we work pretty hard to make sure that there is no accidental spilling inside
these loops. See PR69904 for an example of where an optimization was skirting
on the edges of the architecture - aarch32 is quite similar to aarch64 in this
respect. 

This is probably a bit harder to do in a generic manner rather than just adding
combine patterns as we will be doing that in the backend till kingdom come.

I'm leaving this as a target bug for now and marking it as relevant to both the
backends but I suspect this affects other architectures too that have LL/SC
style atomics.

> 
> So, using armv8-a as an example, the test_atomic_add_unless() function above
> compiles to:
> 
>       test_atomic_add_unless:
>               sub     sp, sp, #16             # unnecessary
>               ldr     w1, [x0]                # __atomic_load_n()
>               str     w1, [sp, 12]            # bug 70825
>       .L5:
>               ldr     w1, [sp, 12]            # bug 70825
>               cmp     w1, 35
>               beq     .L4
>               ldr     w3, [sp, 12]            # bug 70825
>               add     w1, w1, 86
>       .L7:
>               ldaxr   w2, [x0]                # }
>               cmp     w2, w3                  # } __atomic_compare_exchange()
>               bne     .L8                     # }
>               stlxr   w4, w1, [x0]            # }
>               cbnz    w4, .L7                 # }
>       .L8:
>               beq     .L4
>               str     w2, [sp, 12]            # bug 70825
>               b       .L5
>       .L4:
>               ldr     w0, [sp, 12]            # bug 70825
>               add     sp, sp, 16              # unnecessary
>               ret
> 
> Removing the bits added by gcc by bug 70825 and using registers throughout,
> this can be reduced to:
> 
>       test_atomic_add_unless:
>               ldr     w1, [x0]                # __atomic_load_n()
>       .L5:
>               cmp     w1, 35
>               beq     .L4
>               add     w2, w1, 86
>               mov     w3, w1
>       .L7:
>               ldaxr   w1, [x0]                # }
>               cmp     w1, w3                  # } __atomic_compare_exchange()
>               bne     .L5                     # }
>               stlxr   w4, w2, [x0]            # }
>               cbnz    w4, .L7                 # }
>       .L4:
>               mov     w1, w0
>               ret
> 
> However, the atomic load, the arithmetic and the compare exchange can be
> condensed further by moving the arithmetic inside the LL/SC section:
> 
>       test_atomic_add_unless:
>       .L7:
>               ldaxr   w1, [x0]                # __atomic_load_n()
>               cmp     w1, 35                  # } if (cur == unless)
>               beq     .L4                     # }     break
>               add     w2, w1, 86              # new = cur + addend
>               stlxr   w4, w2, [x0]
>               cbnz    w4, .L7
>       .L4:
>               mov     w1, w0
>               ret
> 
> When compiled for -march=armv8-a+lse, however, the code resolves to a
> compare-exchange loop using the CASAL instruction:
> 
>       test_atomic_add_unless:
>               sub     sp, sp, #16             # unnecessary
>               ldr     w1, [x0]                # __atomic_load_n()
>               str     w1, [sp, 12]            # bug 70825
>       .L5:
>               ldr     w1, [sp, 12]            # bug 70825
>               cmp     w1, 35
>               beq     .L4
>               ldr     w3, [sp, 12]            # bug 70825
>               add     w1, w1, 86
>               mov     w2, w3
>               casal   w2, w1, [x0]            # __atomic_compare_exchange_n()
>               cmp     w2, w3
>               beq     .L4
>               str     w2, [sp, 12]            # bug 70825
>               b       .L5
>       .L4:
>               ldr     w0, [sp, 12]            # bug 70825
>               add     sp, sp, 16              # unnecessary
>               ret
> 
> but LDAXR/STLXR might actually be better.

[Bug target/71191] aarch64 and others: __atomic_load;arithmetic;__atomic_compare_exchange loops should be able to generate better code with LL/SC-type constructs than a CAS loop

Reply via email to