Re: [PATCH] [libstdc++] Refactor/cleanup of atomic wait implementation

Jonathan Wakely via Gcc-patches Wed, 03 Mar 2021 09:31:40 -0800

On 23/02/21 13:57 -0800, Thomas Rodgers wrote:

From: Thomas Rodgers <rodg...@twrodgers.com>


* This revises the previous version to fix std::__condvar::wait_until() usage.

This is a substantial rewrite of the atomic wait/notify (and timed wait
counterparts) implementation.

The previous __platform_wait looped on EINTR however this behavior is
not required by the standard. A new _GLIBCXX_HAVE_PLATFORM_WAIT macro
now controls whether wait/notify are implemented using a platform
specific primitive or with a platform agnostic mutex/condvar. This
patch only supplies a definition for linux futexes. A future update
could add support __ulock_wait/wake on Darwin, for instance.

The members of __waiters were lifted to a new base class. The members
are now arranged such that overall sizeof(__waiters_base) fits in two
cache lines (on platforms with at least 64 byte cache lines). The
definition will also use destructive_interference_size for this if it
is available.

The __waiters type is now specific to untimed waits. Timed waits have a
corresponding __timed_waiters type. Much of the code has been moved from
the previous __atomic_wait() free function to the __waiter_base template
and a __waiter derived type is provided to implement the un-timed wait
operations. A similar change has been made to the timed wait
implementation.

The __atomic_spin code has been extended to take a spin policy which is
invoked after the initial busy wait loop. The default policy is to
return from the spin. The timed wait code adds a timed backoff spinning
policy. The code from <thread> which implements this_thread::sleep_for,
sleep_until has been moved to a new <bits/std_thread_sleep.h> header
which allows the thread sleep code to be consumed without pulling in the
whole of <thread>.

The entry points into the wait/notify code have been restructured to
support either -
  * Testing the current value of the atomic stored at the given address
    and waiting on a notification.
  * Applying a predicate to determine if the wait was satisfied.
The entry points were renamed to make it clear that the wait and wake
operations operate on addresses. The first variant takes the expected
value and a function which returns the current value that should be used
in comparison operations, these operations are named with a _v suffix
(e.g. 'value'). All atomic<_Tp> wait/notify operations use the first
variant. Barriers, latches and semaphores use the predicate variant.

This change also centralizes what it means to compare values for the
purposes of atomic<T>::wait rather than scattering through individual
predicates.

This change also centralizes the repetitive code which adjusts for
different user supplied clocks (this should be moved elsewhere
and all such adjustments should use a common implementation).

libstdc++-v3/ChangeLog:
        * include/Makefile.am: Add new <bits/std_thread_sleep.h> header.
        * include/Makefile.in: Regenerate.
        * include/bits/atomic_base.h: Adjust all calls
        to __atomic_wait/__atomic_notify for new call signatures.
        * include/bits/atomic_wait.h: Extensive rewrite.
        * include/bits/atomic_timed_wait.h: Likewise.
        * include/bits/semaphore_base.h: Adjust all calls
        to __atomic_wait/__atomic_notify for new call signatures.
        * include/bits/std_thread_sleep.h: New file.
        * include/std/atomic: Likewise.
        * include/std/barrier: Likewise.
        * include/std/latch: Likewise.
        * testsuite/29_atomics/atomic/wait_notify/bool.cc: Simplify
        test.
        * testsuite/29_atomics/atomic/wait_notify/generic.cc: Likewise.
        * testsuite/29_atomics/atomic/wait_notify/pointers.cc: Likewise.
        * testsuite/29_atomics/atomic_flag/wait_notify.cc: Likewise.
        * testsuite/29_atomics/atomic_float/wait_notify.cc: Likewise.
        * testsuite/29_atomics/atomic_integral/wait_notify.cc: Likewise.
        * testsuite/29_atomics/atomic_ref/wait_notify.cc: Likewise.


Some of this diff is very confusing, where the context being shown as
removed is actually a completely different function. Please try
--diff-algorithm=histogram for the next version of this patch. It
might make it easier to read.

+    struct __timed_backoff_spin_policy
+    {
+      __wait_clock_t::time_point _M_deadline;
+      __wait_clock_t::time_point _M_t0;
+
+      template<typename _Clock, typename _Dur>
+       __timed_backoff_spin_policy(chrono::time_point<_Clock, _Dur>
+                                     __deadline = _Clock::time_point::max(),
+                                   chrono::time_point<_Clock, _Dur>
+                                     __t0 = _Clock::now()) noexcept
+         : _M_deadline(__to_wait_clock(__deadline))
+         , _M_t0(__to_wait_clock(__t0))


If this policy object is constructed with a time_point using the
steady_clock then it will still call __to_wait_clock to convert it to
the steady_clock, making multiple unnecessary (and expensive) calls to
steady_clock::now().

I think you either need to overload the constructor or overload
__to_wait_clock.

+       { }
+
+      bool
+      operator()() noexcept


This can be const.

      {
-       static_assert(sizeof(__timed_waiters) == sizeof(__waiters));
-       return static_cast<__timed_waiters&>(__waiters::_S_for(__t));
+       using namespace literals::chrono_literals;
+       auto __now = __wait_clock_t::now();
+       if (_M_deadline <= __now)
+         return false;
+
+       auto __elapsed = __now - _M_t0;
+       if (__elapsed > 128ms)
+         {
+           this_thread::sleep_for(64ms);
+         }
+       else if (__elapsed > 64us)
+         {
+           this_thread::sleep_for(__elapsed / 2);
+         }
+       else if (__elapsed > 4us)
+         {
+           __thread_yield();
+         }
+       else
+         return false;
      }
    };
-  } // namespace __detail

+      template<typename _Tp, typename _ValFn,
+              typename _Rep, typename _Period>
+       bool
+       _M_do_wait_for_v(_Tp __old, _ValFn __vfn,
+                        const chrono::duration<_Rep, _Period>&
+                                                             __rtime) noexcept
+       {
+         __platform_wait_t __val;
+         if (_M_do_spin_v(__old, move(__vfn), __val))


This should be std::move (there's another case of this in the patch
too).

+           return true;
+
+         if (!__rtime.count())
+           return false; // no rtime supplied, and spin did not acquire
+
+         using __dur = chrono::steady_clock::duration;
+         auto __reltime = chrono::duration_cast<__dur>(__rtime);
+         if (__reltime < __rtime)
+           ++__reltime;


This is C++20 code so it can use chrono::ceil here instead of
duration_cast, then you don't need the increment.

+         return _M_w._M_do_wait_until(_M_addr, __val,
+                                      chrono::steady_clock::now() + __reltime);
        }
-      while (!__pred() && __atime < _Clock::now());
-      __w._M_leave_wait();

-      // if timed out, return false
-      return (_Clock::now() < __atime);
+      template<typename _Pred,
+              typename _Rep, typename _Period>
+       bool
+       _M_do_wait_for(_Pred __pred,
+                      const chrono::duration<_Rep, _Period>& __rtime) noexcept
+       {
+         __platform_wait_t __val;
+         if (_M_do_spin(__pred, __val))
+           return true;
+
+         if (!__rtime.count())
+           return false; // no rtime supplied, and spin did not acquire
+
+         using __dur = chrono::steady_clock::duration;
+         auto __reltime = chrono::duration_cast<__dur>(__rtime);
+         if (__reltime < __rtime)
+           ++__reltime;


chrono::ceil here too.

+  template<typename _Tp>
+    inline constexpr bool __platform_wait_uses_type
+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
+      = is_same_v<remove_cv_t<_Tp>, __detail::__platform_wait_t>


This is_same check seems redundant, as the following will be true
anyway.

+       || ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
+           && (alignof(_Tp*) == alignof(__detail::__platform_wait_t)));


This should be alignof(_Tp) not alignof(_Tp*) shouldn't it?

And alignof(_Tp) > alignof(__platform_wait_t) is OK too, so >= not ==.

We need the is_scalar check from Thiago's patch. We don't want to try
and use a futex for something like:

struct S { short s; char c; /* padding */ };

#else
-       = false;
+      = false;
#endif

+  namespace __detail
+  {
#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
+#define _GLIBCXX_HAVE_PLATFORM_WAIT


Redefinition, as I pointed out in my earlier mail.

+    struct __default_spin_policy
+    {
+      bool
+      operator()() noexcept


This can be const.

+      { return false; }
+    };
+
+    template<typename _Pred,
+            typename _Spin = __default_spin_policy>
+      bool
+      __atomic_spin(_Pred& __pred, _Spin __spin = _Spin{ }) noexcept
      {
-       __platform_wait_t __res;
-       __atomic_load(&_M_ver, &__res, __ATOMIC_ACQUIRE);
-       __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL);
-       return __res;
+       for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i)
+         {
+           if (__pred())
+             return true;
+
+           if (__i < __detail::__atomic_spin_count_2)
+             __detail::__thread_relax();
+           else
+             __detail::__thread_yield();
+         }


I keep wondering (and not bothering to check) whether having two loops
(for counts of 12 and then 4) would make more sense than this branch
in each loop. It doesn't matter though.

+       while (__spin())
+         {
+           if (__pred())
+             return true;
+         }
+
+       return false;
      }

-      void
-      _M_leave_wait() noexcept
+    template<typename _Tp>
+      bool __atomic_compare(const _Tp& __a, const _Tp& __b)
      {
-       __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL);
+       // TODO make this do the correct padding bit ignoring comparison
+       return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) != 0;
      }

-      void
-      _M_do_wait(__platform_wait_t __version) noexcept
-      {
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-       __platform_wait(&_M_ver, __version);
+#ifdef __cpp_lib_hardware_interference_size
+    struct alignas(hardware_destructive_interference_size)
#else
-       __platform_wait_t __cur = 0;
-       while (__cur <= __version)
-         {
-           __waiters::__lock_t __l(_M_mtx);
-           _M_cv.wait(_M_mtx);
-           __platform_wait_t __last = __cur;
-           __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE);
-           if (__cur < __last)
-             break; // break the loop if version overflows
-         }
+    struct alignas(64)
+#endif
+    __waiters_base
+    {
+      __platform_wait_t _M_wait = 0;
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
+      mutex _M_mtx;
#endif
-      }
+
+#ifdef __cpp_lib_hardware_interference_size
+      alignas(hardware_destructive_interference_size)
+#else
+      alignas(64)
+#endif


Please do this #ifdef dance once and define a constant that can be
used in both places, instead of repeating the #ifdef.

e.g.

    struct __waiters_base
    {
#ifdef __cpp_lib_hardware_interference_size
      static constexpr _S_align = hardware_destructive_interference_size;
#else
      static constexpr _S_align = 64;
#endif

      alignas(_S_align) __platform_wait_t _M_wait = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
      mutex _M_mtx;
#endif

      alignas(_S_align) __platform_wait_t _M_ver = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
      __condvar _M_cond;
#endif

      __waiters_base() = default;

      // ...

+      __platform_wait_t _M_ver = 0;
+
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
+      __condvar _M_cv;
+
+      __waiters_base() noexcept = default;


Should this be outside the #ifdef block?

I think the noexcept is redundant, but harmless.

+#endif
+
+      void
+      _M_enter_wait() noexcept
+      { __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); }
+
+      void
+      _M_leave_wait() noexcept
+      { __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); }

      bool
      _M_waiting() const noexcept
      {
        __platform_wait_t __res;
        __atomic_load(&_M_wait, &__res, __ATOMIC_ACQUIRE);
-       return __res;
+       return __res > 0;
      }

      void
-      _M_notify(bool __all) noexcept
+      _M_notify(const __platform_wait_t* __addr, bool __all) noexcept
      {
-       __atomic_fetch_add(&_M_ver, 1, __ATOMIC_ACQ_REL);
+       if (!_M_waiting())
+         return;
+
#ifdef _GLIBCXX_HAVE_LINUX_FUTEX


Should this check HAVE_PLATFORM_WAIT instead?

-       __platform_notify(&_M_ver, __all);
+       __platform_notify(__addr, __all);
#else
        if (__all)
          _M_cv.notify_all();

+    struct __waiter : __waiter_base<__waiters>
    {
-      using namespace __detail;
-      if (std::__atomic_spin(__pred))
-       return;
+      template<typename _Tp>
+       __waiter(const _Tp* __addr, bool __waiting = true) noexcept


Make this constructor explicit please.

+         : __waiter_base(__addr, __waiting)
+       { }

diff --git a/libstdc++-v3/include/bits/semaphore_base.h 
b/libstdc++-v3/include/bits/semaphore_base.h
index b65717e64d7..95d5414ff80 100644
--- a/libstdc++-v3/include/bits/semaphore_base.h
+++ b/libstdc++-v3/include/bits/semaphore_base.h
@@ -181,40 +181,32 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
      __atomic_semaphore(const __atomic_semaphore&) = delete;
      __atomic_semaphore& operator=(const __atomic_semaphore&) = delete;

+      static _GLIBCXX_ALWAYS_INLINE bool
+      _S_do_try_acquire(_Tp* __counter) noexcept
+      {
+       auto __old = __atomic_impl::load(__counter, memory_order::acquire);
+
+       if (__old == 0)
+         return false;
+
+       return __atomic_impl::compare_exchange_strong(__counter,
+                                                     __old, __old - 1,
+                                                     memory_order::acquire,
+                                                     memory_order::release);
+      }


If we keep calling this in a loop it means that we reload the value
every time using atomic_load, despite the compare_exchange telling us
that value. Can't we reuse that value returned from the CAS?

If the caller provides it by reference:

      static _GLIBCXX_ALWAYS_INLINE bool
      _S_do_try_acquire(_Tp* __counter, _Tp& __old) noexcept
      {
        if (__old == 0)
          return false;
        return __atomic_impl::compare_exchange_strong(__counter,
                                                      __old, __old - 1,
                                                      memory_order::acquire,
                                                      memory_order::release);
      }

+
      _GLIBCXX_ALWAYS_INLINE void
      _M_acquire() noexcept
      {
-       auto const __pred = [this]
-         {
-           auto __old = __atomic_impl::load(&this->_M_counter,
-                           memory_order::acquire);
-           if (__old == 0)
-             return false;
-           return __atomic_impl::compare_exchange_strong(&this->_M_counter,
-                     __old, __old - 1,
-                     memory_order::acquire,
-                     memory_order::release);
-         };
-       auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);
-       std::__atomic_wait(&_M_counter, __old, __pred);
+       auto const __pred = [this] { return 
_S_do_try_acquire(&this->_M_counter); };


Then the predicate can maintain that state:

        auto __old = __atomic_impl::load(_M_counter, memory_order::acquire);
        auto const __pred = [this, __old] () mutable {
          return _S_do_try_acquire(&this->_M_counter, __old);
        };

Or is realoading it every time needed, because we do a
yield/relax/spin after the CAS and so the value it returns might be
stale before the next CAS?

+       std::__atomic_wait_address(&_M_counter, __pred);

Re: [PATCH] [libstdc++] Refactor/cleanup of atomic wait implementation

Reply via email to