Re: [PATCH] libstdc++: Future-proof C++20 atomic wait/notify

Tomasz Kaminski Mon, 17 Nov 2025 05:11:56 -0800

On Mon, Nov 17, 2025 at 1:50 PM Jonathan Wakely <[email protected]> wrote:


> On Mon, 17 Nov 2025 at 10:10 +0100, Tomasz Kaminski wrote:
> >On Sun, Nov 16, 2025 at 1:56 AM Jonathan Wakely <[email protected]>
> wrote:
> >
> >> This will allow us to extend atomic waiting functions to support a
> >> possible future 64-bit version of futex, as well as supporting
> >> futex-like wait/wake primitives on other targets (e.g. macOS has
> >> os_sync_wait_on_address and FreeBSD has _umtx_op).
> >>
> >> Before this change, the decision of whether to do a proxy wait or to
> >> wait on the atomic variable itself was made in the header at
> >> compile-time, which makes it an ABI property that would not have been
> >> possible to change later.  That would have meant that
> >> std::atomic<uint64_t> would always have to do a proxy wait even if Linux
> >> gains support for 64-bit futex2(2) calls at some point in the future.
> >> The disadvantage of proxy waits is that several distinct atomic objects
> >> can share the same proxy state, leading to contention between threads
> >> even when they are not waiting on the same atomic object, similar to
> >> false sharing. It also result in spurious wake-ups because doing a
> >> notify on an atomic object that uses a proxy wait will wake all waiters
> >> sharing the proxy.
> >>
> >> For types that are known to definitely not need a proxy wait (e.g. int
> >> on Linux) the header can still choose a more efficient path at
> >> compile-time. But for other types, the decision of whether to do a proxy
> >> wait is deferred to runtime, inside the library internals. This will
> >> make it possible for future versions of libstdc++.so to extend the set
> >> of types which don't need to use proxy waits, without ABI changes.
> >>
> >> The way the change works is to stop using the __proxy_wait flag that was
> >> set by the inline code in the headers. Instead the __wait_args struct
> >> has an extra pointer member which the library internals populate with
> >> either the address of the atomic object or the _M_ver counter in the
> >> proxy state. There is also a new _M_obj_size member which stores the
> >> size of the atomic object, so that the library can decide whether a
> >> proxy is needed. So for example if linux gains 64-bit futex support then
> >> the library can decide not to use a proxy when _M_obj_size == 8.
> >> Finally, the _M_old member of the __wait_args struct is changed to
> >> uint64_t so that it has room to store 64-bit values, not just whatever
> >> size the __platform_wait_t type is (which is a 32-bit int on Linux).
> >> Similarly, the _M_val member of __wait_result_type changes to uint64_t
> >> too.
> >>
> >> libstdc++-v3/ChangeLog:
> >>
> >>         * config/abi/pre/gnu.ver: Adjust exports.
> >>         * include/bits/atomic_timed_wait.h
> >> (_GLIBCXX_HAVE_PLATFORM_TIMED_WAIT):
> >>         Do not define this macro.
> >>         (__atomic_wait_address_until_v, __atomic_wait_address_for_v):
> >>         Guard assertions with #ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
> >>         * include/bits/atomic_wait.h (__platform_wait_uses_type):
> >>         Different separately for platforms with and without platform
> >>         wait.
> >>         (_GLIBCXX_HAVE_PLATFORM_WAIT): Do not define this macro.
> >>         (_GLIBCXX_UNKNOWN_PLATFORM_WAIT): Define new macro.
> >>         (__wait_value_type): New typedef.
> >>         (__wait_result_type): Change _M_val to __wait_value_type.
> >>         (__wait_args_base::_M_old): Change to __wait_args_base.
> >>         (__wait_args_base::_M_obg, __wait_args_base::_M_obj_size): New
> >>         data members.
> >>         (__wait_args::__wait_args): Set _M_obj and _M_obj_size on
> >>         construction.
> >>         (__wait_args::_M_setup_wait): Change void* parameter to deduced
> >>         type. Use _S_bit_cast instead of __builtin_bit_cast.
> >>         (__wait_args::_M_load_proxy_wait_val): Remove function, replace
> >>         with ...
> >>         (__wait_args::_M_setup_wait_impl): New function.
> >>         (__wait_args::_S_bit_cast): Wrapper for __builtin_bit_cast which
> >>         also supports conversion from 32-bit values.
> >>         (__wait_args::_S_flags_for): Do not set __proxy_wait flag.
> >>         (__atomic_wait_address_v): Guard assertions with #ifdef
> >>         _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
> >>         * src/c++20/atomic.cc (_GLIBCXX_HAVE_PLATFORM_WAIT): Define here
> >>         instead of in header. Check _GLIBCXX_HAVE_PLATFORM_WAIT instead
> >>         of _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT.
> >>         (__spin_impl): Adjust for 64-bit __wait_args_base::_M_old.
> >>         (use_proxy_wait): New function.
> >>         (__wait_args::_M_load_proxy_wait_val): Replace with ...
> >>         (__wait_args::_M_setup_wait_impl): New function. Call
> >>         use_proxy_wait to decide at runtime whether to wait on the
> >>         pointer directly instead of using a proxy. If a proxy is needed,
> >>         set _M_obj to point to its _M_ver member. Adjust for change to
> >>         type of _M_old.
> >>         (__wait_impl): Wait on _M_obj unconditionally.
> >>         (__notify_impl): Call use_proxy_wait to decide whether to notify
> >>         on the address parameter or a proxy
> >>         (__spin_until_impl): Adjust for change to type of _M_val.
> >>         (__wait_until_impl): Wait on _M_obj unconditionally.
> >> ---
> >>
> >> Tested x86_64-linux, powerpc64le-linux, sparc-solaris.
> >>
> >A lot of comments below.
> >
> >>
> >> I think this is an imporant change which I unfortunately didn't think of
> >> until recently.
> >>
> >> This changes the exports from the shared library, but we're still in
> >> stage 1 so I think that should be allowed (albeit unfortunate). Nobody
> >> should be expecting GCC 16 to be stable yet.
> >>
> >> The __proxy_wait enumerator is now unused and could be removed. The
> >> __abi_version enumerator could also be bumped to indicate the
> >> incompatibility with earlier snapshots of GCC 16, but I don't think that
> >> is needed. We could in theory keep the old symbol export
> >> (__wait_args::_M_load_proxy_wait) and make it trap/abort if called, but
> >> I'd prefer to just remove it and cause dynamic linker errors instead.
> >>
> >> There's a TODO in the header about which types should be allowed to take
> >> the optimized paths (see the __waitable concept). For types where that's
> >> true, if the size matches a futex then we'll use a futex, even if it's
> >> actually an enum or floating-point type (or pointer on 32-bit targets).
> >> I'm not sure if that's safe.
> >>
> >>
> >>  libstdc++-v3/config/abi/pre/gnu.ver           |   3 +-
> >>  libstdc++-v3/include/bits/atomic_timed_wait.h |  12 +-
> >>  libstdc++-v3/include/bits/atomic_wait.h       | 109 +++++++++-----
> >>  libstdc++-v3/src/c++20/atomic.cc              | 140 +++++++++++-------
> >>  4 files changed, 166 insertions(+), 98 deletions(-)
> >>
> >> diff --git a/libstdc++-v3/config/abi/pre/gnu.ver
> >> b/libstdc++-v3/config/abi/pre/gnu.ver
> >> index 2e48241d51f9..3c2bd4921730 100644
> >> --- a/libstdc++-v3/config/abi/pre/gnu.ver
> >> +++ b/libstdc++-v3/config/abi/pre/gnu.ver
> >> @@ -2553,7 +2553,8 @@ GLIBCXX_3.4.35 {
> >>      _ZNSt8__detail11__wait_implEPKvRNS_16__wait_args_baseE;
> >>      _ZNSt8__detail13__notify_implEPKvbRKNS_16__wait_args_baseE;
> >>
> >>
> _ZNSt8__detail17__wait_until_implEPKvRNS_16__wait_args_baseERKNSt6chrono8durationI[lx]St5ratioIL[lx]1EL[lx]1000000000EEEE;
> >> -    _ZNSt8__detail11__wait_args22_M_load_proxy_wait_valEPKv;
> >> +    _ZNSt8__detail11__wait_args18_M_setup_wait_implEPKv;
> >> +    _ZNSt8__detail11__wait_args20_M_setup_notify_implEPKv;
> >>
> >>      # std::chrono::gps_clock::now, tai_clock::now
> >>      _ZNSt6chrono9gps_clock3nowEv;
> >> diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h
> >> b/libstdc++-v3/include/bits/atomic_timed_wait.h
> >> index 30f7ff616840..918a267d10eb 100644
> >> --- a/libstdc++-v3/include/bits/atomic_timed_wait.h
> >> +++ b/libstdc++-v3/include/bits/atomic_timed_wait.h
> >> @@ -75,14 +75,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>           return chrono::ceil<__w_dur>(__atime);
> >>        }
> >>
> >> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
> >> -#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> -#else
> >> -// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement
> >> __platform_wait_until
> >> -// if there is a more efficient primitive supported by the platform
> >> -// (e.g. __ulock_wait) which is better than pthread_cond_clockwait.
> >> -#endif // ! HAVE_LINUX_FUTEX
> >> -
> >>      __wait_result_type
> >>      __wait_until_impl(const void* __addr, __wait_args_base& __args,
> >>                       const __wait_clock_t::duration& __atime);
> >> @@ -156,7 +148,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>                                   const chrono::time_point<_Clock,
> _Dur>&
> >> __atime,
> >>                                   bool __bare_wait = false) noexcept
> >>      {
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
> >>        __glibcxx_assert(false); // This function can't be used for proxy
> >> wait.
> >>  #endif
> >>        __detail::__wait_args __args{ __addr, __old, __order,
> __bare_wait };
> >> @@ -208,7 +200,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>                                 const chrono::duration<_Rep, _Period>&
> >> __rtime,
> >>                                 bool __bare_wait = false) noexcept
> >>      {
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
> >>
> >This name really reads strange, and sounds like something with "TODO".
> >I think  _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT was just OK name, even
> >if it was not used directly.
>
> The rationale for the new name is that when the header gets
> preprocessed we don't yet know if the libstdc++ shared lib that will
> be used at runtime has a platform wait function available. A
> translation unit might get compiled with GCC 16 and so there is no
> known platform wait for macOS, but then at runtime it might link to
> the libstdc++.dylib from GCC 17 which uses ulock_wait.
>
> And the reason for getting rid of HAVE_PLATFORM_TIMED_WAIT is twofold:
> - I don't think it makes sense to have separate HAVE_PLATFORM_WAIT and
>    HAVE_PLATFORM_TIMED_WAIT macros. I doubt any target is going to
>    support a futex-like operation that doesn't support a timeout (and
>    if there is such an operation, we should just not use it at all).
> - The macro saying whether we have a platform wait operation is now
>    only defined inside the library, not in the header. It's not
>    something that the header needs to know (or can know).
>
> But I will try to improve the name of UNKNOWN_PLATFORM_WAIT.
>
> >>        __glibcxx_assert(false); // This function can't be used for proxy
> >> wait.
> >>  #endif
> >>        __detail::__wait_args __args{ __addr, __old, __order,
> __bare_wait };
> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
> >> b/libstdc++-v3/include/bits/atomic_wait.h
> >> index 95151479c120..49369419d6a6 100644
> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
> >> @@ -45,35 +45,34 @@
> >>  namespace std _GLIBCXX_VISIBILITY(default)
> >>  {
> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> +#if defined _GLIBCXX_HAVE_LINUX_FUTEX
> >>    namespace __detail
> >>    {
> >> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
> >> -#define _GLIBCXX_HAVE_PLATFORM_WAIT 1
> >>      using __platform_wait_t = int;
> >>      inline constexpr size_t __platform_wait_alignment = 4;
> >> +  }
> >> +  template<typename _Tp>
> >> +    inline constexpr bool __platform_wait_uses_type
> >> +      = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp)
> >=
> >> 4;
> >>  #else
> >> +# define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1
> >>  // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait()
> >>  // and __platform_notify() if there is a more efficient primitive
> >> supported
> >>  // by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better
> >> than
> >>  // a mutex/condvar based wait.
> >> +  namespace __detail
> >> +  {
> >>  # if ATOMIC_LONG_LOCK_FREE == 2
> >>      using __platform_wait_t = unsigned long;
> >>  # else
> >>      using __platform_wait_t = unsigned int;
> >>  # endif
> >>      inline constexpr size_t __platform_wait_alignment
> >> -      = __alignof__(__platform_wait_t);
> >> -#endif
> >> +      = sizeof(__platform_wait_t) < __alignof__(__platform_wait_t)
> >> +         ? __alignof__(__platform_wait_t) : sizeof(__platform_wait_t);
> >>    } // namespace __detail
> >> -
> >> -  template<typename _Tp>
> >> -    inline constexpr bool __platform_wait_uses_type
> >> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> >> -      = is_scalar_v<_Tp>
> >> -       && ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
> >> -       && (alignof(_Tp) >= __detail::__platform_wait_alignment));
> >> -#else
> >> -      = false;
> >> +  template<typename>
> >> +    inline constexpr bool __platform_wait_uses_type = false;
> >>  #endif
> >>
> >>    namespace __detail
> >> @@ -105,10 +104,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>         return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0;
> >>        }
> >>
> >> -    // lightweight std::optional<__platform_wait_t>
> >> +    // TODO: this needs to be false for types with padding, e.g.
> __int20.
> >>
> >I do not understand why this needs to be required. This funciton is used
> >only via atomic
> >or atomic_ref. For atomic, we can guarantee that the type has padding
> bytes
> >cleared..
>
> When we pass the "old" value to the kernel, I don't know if we can
> guarantee that cleared padding bits remain cleared. We can be sure
> that the value inside the std::atomic has padding bits cleared to
> zero, but the std::atomic<T>::wait(T) function takes its argument by
> value, and also passes that to the kernel by value.
>
> I think it's safer to just do proxy waits for types with padding.
>
> >> +    // TODO: should this be true only for integral, enum, and pointer
> >> types?
> >>
> >What I think is missing here is alignment. I assume that any platform wait
> >may use
> >bits that are clear due the alignment of platform wait type for some
> >internal state.
>
> I don't understand what you mean here. The platform wait type is just
> an integer, so all bits are part of the value representation (if we
> ignore weird types like __int20 on msp430). A pointer to that type
> might have "unused" low bits due to alignment, but we form that
> pointer by taking the address of the atomic object, and we pass it to
> the library, and we pass it to the OS wake/wait operation. I don't
> think there's any opportunity for anything to ever set (or read) bits
> for internal state.
>
> >Or we are going to check is_sufficiently_aligment in cc file, and use
> >different kind of
> >wait depending on the object?
>
> Yes, we can check the alignment of the __addr pointer when deciding
> whether to use the platform wait operation. But the objects that
> __addr points to are always the value inside a std::atomic or the
> object that std::atomic_ref::_M_ptr points to, and we already know
> that those objects are suitably aligned for atomic ops. It's possible
> that some platform wait has stricter alignment requirements (e.g. you
> can do atomic add/subtract if it's 4-byte aligned by you can't
> wait/notify unless it's 16-byte aligned) but I don't know of any cases
> where that's true (for macOS the object only needs to be aligned to
> its size, which is already guaranteed by std::atomic and required by
> std::atomic_ref, for FreeBSD it must be aligned to "ABI-mandated
> alignment" which isn't very clear to me, but I don't think it means
> "aligned more than the type usually requires").
>
> If we want to use some new wait/wake primitive in future and it
> requires stricter alignment, we can just check the alignment of __addr
> in the new use_proxy_wait function in src/c++20/atomic.cc. That
> function doesn't currently take the original __addr pointer, because
> it doesn't currently need it, but it's not a public function so we can
> just change it any time we need to. I can change it to take the
> __addr argument now, marked [[maybe_unused]], if that would make it
> clearer that the decision can (in theory) depend on the address.
>
> But I don't think the __waitable concept should consider alignment.
>
> The objects we do atomic wait/notify on are already aligned suitably
> for atomic ops, and the address of any individual object can be used
> to decide whether it's suitably aligned for wait/notify, it doesn't
> need to be a static property of the type.
>
> The point of this __waitable check is "does this type look like
> something that we should consider using as a futex", which for linux
> today means "can it safely be reinterpret_cast to int& for the
> kernel's purposes" (where I'm relying on the fact that the kernel can
> happily ignore type-based alias analysis and just cares about the
> bits!)
>
> We should probably not treat struct { char c; /*padding;*/ short i; }
> as a futex, but we can treat an enum as a futex, because as far as the
> kernel's futex syscall is concerned, it's just some bits and it either
> equals some value or it doesn't.
>
> >But I think, we can later safely extend or change what is waitable (except
> >extending it past 8 bytes),
> >as if we start putting _M_obj_size to non zero, impl may use platform
> wait.
> >
> >So, I will go with safe option is integral, pointer or enum, This would
>
> That means always using a proxy wait for float and double, but I think
> that's OK.
>
> >also give
> >us no padding guarantee, I assume?
>
> __int20 is integral and has padding bits. But it's a very special case
> and only present on one or two non-mainstream targets.
>
> >> +    template<typename _Tp>
> >> +      concept __waitable
> >> +       = is_scalar_v<_Tp> && (sizeof(_Tp) <= sizeof(__UINT64_TYPE__));
> >> +
> >> +    // Storage for up to 64 bits of value, should be considered opaque
> >> bits.
> >> +    using __wait_value_type = __UINT64_TYPE__;
> >> +
> >> +    // lightweight std::optional<__wait_value_type>
> >>      struct __wait_result_type
> >>      {
> >> -      __platform_wait_t _M_val;
> >> +      __wait_value_type _M_val;
> >>        unsigned char _M_has_val : 1; // _M_val value was loaded before
> >> return.
> >>        unsigned char _M_timeout : 1; // Waiting function ended with
> >> timeout.
> >>        unsigned char _M_unused : 6;  // padding
> >> @@ -143,8 +151,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>      {
> >>        __wait_flags _M_flags;
> >>        int _M_order = __ATOMIC_ACQUIRE;
> >> -      __platform_wait_t _M_old = 0;
> >> +      __wait_value_type _M_old{};
> >>        void* _M_wait_state = nullptr;
> >> +      const void* _M_obj = nullptr;  // The address of the object to
> wait
> >> on.
> >> +      unsigned char _M_obj_size = 0; // The size of that object.
> >>
> >>        // Test whether _M_flags & __flags is non-zero.
> >>        bool
> >> @@ -162,36 +172,48 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>         explicit
> >>         __wait_args(const _Tp* __addr, bool __bare_wait = false)
> noexcept
> >>         : __wait_args_base{ _S_flags_for(__addr, __bare_wait) }
> >> -       { }
> >> +       {
> >> +         _M_obj = __addr; // Might be replaced by _M_setup_wait
> >> +         if constexpr (__waitable<_Tp>)
> >> +           // __wait_impl might be able to wait directly on __addr
> >> +           // instead of using a proxy, depending on its size.
> >> +           _M_obj_size = sizeof(_Tp);
> >> +       }
> >>
> >>        __wait_args(const __platform_wait_t* __addr, __platform_wait_t
> >> __old,
> >>                   int __order, bool __bare_wait = false) noexcept
> >> -      : __wait_args_base{ _S_flags_for(__addr, __bare_wait), __order,
> >> __old }
> >> -      { }
> >> +      : __wait_args(__addr, __bare_wait)
> >> +      {
> >> +       _M_order = __order;
> >> +       _M_old = __old;
> >> +      }
> >>
> >>        __wait_args(const __wait_args&) noexcept = default;
> >>        __wait_args& operator=(const __wait_args&) noexcept = default;
> >>
> >> -      template<typename _ValFn,
> >> -              typename _Tp =
> decay_t<decltype(std::declval<_ValFn&>()())>>
> >> +      template<typename _Tp, typename _ValFn>
> >>         _Tp
> >> -       _M_setup_wait(const void* __addr, _ValFn __vfn,
> >> +       _M_setup_wait(const _Tp* __addr, _ValFn __vfn,
> >>                       __wait_result_type __res = {})
> >>         {
> >> +         static_assert(is_same_v<_Tp, decay_t<decltype(__vfn())>>);
> >> +
> >>           if constexpr (__platform_wait_uses_type<_Tp>)
> >>             {
> >> -             // If the wait is not proxied, the value we check when
> >> waiting
> >> -             // is the value of the atomic variable itself.
> >> +             // If we know for certain that this type can be waited on
> >> +             // efficiently using something like a futex syscall,
> >> +             // then we can avoid the overhead of _M_setup_wait_impl
> >> +             // and just load the value and store it into _M_old.
> >>
> >> -             if (__res._M_has_val) // The previous wait loaded a recent
> >> value.
> >> +             if (__res._M_has_val) // A previous wait loaded a recent
> >> value.
> >>                 {
> >>                   _M_old = __res._M_val;
> >> -                 return __builtin_bit_cast(_Tp, __res._M_val);
> >> +                 return _S_bit_cast<_Tp>(_M_old);
> >>
> >I am not sure if I understand which branch of _S_bit_cast this would use,
> >we have neither (sizeof(To) == sizeof(From) i.e. 4 vs 8) and
> >neither sizeof(_From) == sizeof(__UINT32_TYPE__).
>
> Oops, yes, that's a bug in _S_bit_cast. It was supposed to also handle
> the case where we only want some of the bits in __from. Apparently I
> lost that branch of _S_bit_cast in some revision of the patch.
>
> >I would much more prefer if this would be dome as:
> >return __builtin_bit_cast(_Tp, static_cast<__UINT32_TYPE__>(_M_old));
>
> That assumes that only 32 bits of the values are used, but the header
> doesn't know that.
>
> >
> >>                 }
> >>               else // Load the value from __vfn
> >>                 {
> >> -                 _Tp __val = __vfn();
> >> -                 _M_old = __builtin_bit_cast(__platform_wait_t, __val);
> >> +                 auto __val = __vfn();
> >> +                 _M_old = _S_bit_cast<__wait_value_type>(__val);
> >>
> >And here:
> >      _M_ old = __builtin_bit_cast(__UINT32_TYPE__, __val);
> >However, instead of __UINT32_TYPE__, we should use
> >make_unsinged<__platform_wait_t>.
>
> No, again, this assumes that only sizeof(__platform_wait_t) bytes are
> useful. That is the design mistake in the current code. It assumes
> that we will never do non-proxy wait on something larger than today's
> __platform_wait_t. For a future version of linux, we will not change
> __platform_wait_t (it will still be int) but we might start doing
> non-proxy wait on 64-bit values. In that case, all bits of _M_val and
> _M_old would be used, and casting to uint32_t in the header would
> truncate the values.
>
Yes, but allo of this code is inside __platform_wait_uses_type<_Tp>,
constexpr branch, where we skip overhead of _M_setup_wait impl.
The __platform_wait_uses_type can be only true for types equal to
__platform_wait_t, and if we do not change that value, we will not change
what types go into this branch, so it will always be 4 bytes.
>
>
>
> >>                   return __val;
> >>                 }
> >>             }
> >> @@ -199,16 +221,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>             {
> >>               if (__res._M_has_val) // The previous wait loaded a recent
> >> value.
> >>                 _M_old = __res._M_val;
> >> -             else // Load _M_ver from the proxy (must happen before
> >> __vfn()).
> >> -               _M_load_proxy_wait_val(__addr);
> >> +             else // Let the library decide how to setup the wait.
> >> +               {
> >> +                 // Set _M_obj to the address to be waited on (either
> >> __addr
> >> +                 // or a proxy) and load its current value into _M_old.
> >> +                 _M_setup_wait_impl(__addr);
> >> +               }
> >>               return __vfn();
> >>             }
> >>         }
> >>
> >>      private:
> >> -      // Populates _M_wait_state and _M_old from the proxy for __addr.
> >> +      template<typename _To, typename _From>
> >> +       [[__gnu__::__always_inline__]]
> >> +       static _To
> >> +       _S_bit_cast(_From __from)
> >> +       {
> >> +         if constexpr (sizeof(_To) == sizeof(_From))
> >> +           return __builtin_bit_cast(_To, __from);
> >> +         else if constexpr (sizeof(_From) == sizeof(__UINT32_TYPE__))
> >> +           return __builtin_bit_cast(__UINT32_TYPE__, __from);
> >> +         else
> >> +           __builtin_unreachable();
> >> +       }
> >> +
> >> +      // Populates _M_wait_state and _M_old appropriately for __addr.
> >>        void
> >> -      _M_load_proxy_wait_val(const void* __addr);
> >> +      _M_setup_wait_impl(const void* __addr);
> >>
> >>        template<typename _Tp>
> >>         static constexpr __wait_flags
> >> @@ -218,8 +257,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>           __wait_flags __res = __abi_version | __do_spin;
> >>           if (!__bare_wait)
> >>             __res |= __track_contention;
> >> -         if constexpr (!__platform_wait_uses_type<_Tp>)
> >> -           __res |= __proxy_wait;
> >> +         // if constexpr (!__platform_wait_uses_type<_Tp>)
> >> +         // __res |= __proxy_wait;
> >>
> >Remove this simply.
>
> Yes, I wanted to get the patch sent in stage 1 so there is some
> clean-up still needed. The __proxy_wait enumerator should be removed
> entirely.
>
> >>           return __res;
> >>         }
> >>      };
> >> @@ -255,7 +294,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>                           __detail::__platform_wait_t __old,
> >>                           int __order, bool __bare_wait = false)
> >>    {
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
> >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
> >>      __glibcxx_assert(false); // This function can't be used for proxy
> >> wait.
> >>  #endif
> >>      __detail::__wait_args __args{ __addr, __old, __order, __bare_wait
> };
> >> diff --git a/libstdc++-v3/src/c++20/atomic.cc
> >> b/libstdc++-v3/src/c++20/atomic.cc
> >> index e280045b619d..d5fbd10ed11f 100644
> >> --- a/libstdc++-v3/src/c++20/atomic.cc
> >> +++ b/libstdc++-v3/src/c++20/atomic.cc
> >> @@ -38,14 +38,7 @@
> >>  # include <syscall.h>
> >>  # include <bits/functexcept.h>
> >>  # include <sys/time.h>
> >> -#endif
> >> -
> >> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> >> -# ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> -// __waitable_state assumes that we consistently use the same
> >> implementation
> >> -// (i.e. futex vs mutex+condvar) for timed and untimed waiting.
> >> -#  error "This configuration is not currently supported"
> >> -# endif
> >> +# define _GLIBCXX_HAVE_PLATFORM_WAIT 1
> >>  #endif
> >>
> >>  #pragma GCC diagnostic ignored "-Wmissing-field-initializers"
> >> @@ -107,7 +100,7 @@ namespace
> >>      // Count of threads blocked waiting on this state.
> >>      alignas(_S_align) __platform_wait_t _M_waiters = 0;
> >>
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
> >>      mutex _M_mtx;
> >>
> >>      // This type meets the Cpp17BasicLockable requirements.
> >> @@ -123,7 +116,7 @@ namespace
> >>      // use this for waiting and notifying functions instead.
> >>      alignas(_S_align) __platform_wait_t _M_ver = 0;
> >>
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
> >>      __condvar _M_cv;
> >>  #endif
> >>
> >> @@ -216,17 +209,19 @@ namespace
> >>    __spin_impl(const __platform_wait_t* __addr, const __wait_args_base&
> >> __args)
> >>    {
> >>      __platform_wait_t __val{};
> >> +    __wait_value_type wval;
> >>      for (auto __i = 0; __i < __atomic_spin_count; ++__i)
> >>        {
> >>         __atomic_load(__addr, &__val, __args._M_order);
> >> -       if (__val != __args._M_old)
> >> -         return { ._M_val = __val, ._M_has_val = true, ._M_timeout =
> >> false };
> >> +       wval = static_cast<__wait_value_type>(__val);
> >> +       if (wval != __args._M_old)
> >> +         return { ._M_val = wval, ._M_has_val = true, ._M_timeout =
> false
> >> };
> >>         if (__i < __atomic_spin_count_relax)
> >>           __thread_relax();
> >>         else
> >>           __thread_yield();
> >>        }
> >> -    return { ._M_val = __val, ._M_has_val = true, ._M_timeout = true };
> >> +    return { ._M_val = wval, ._M_has_val = true, ._M_timeout = true };
> >>    }
> >>
> >>    inline __waitable_state*
> >> @@ -237,32 +232,58 @@ namespace
> >>      return static_cast<__waitable_state*>(args._M_wait_state);
> >>    }
> >>
> >> +  [[gnu::always_inline]]
> >> +  inline bool
> >> +  use_proxy_wait(const __wait_args_base& args)
> >>
> >This needs to take the address that we are waiting on and check if it's
> >sufficiently aligned
> >to __platform_wait_aligment.
>
> Maybe in future yes, but today the only platform where use_proxy_wait
> returns false is Linux, and we know that all 4-byte objects in
> std::atomic or std::atomic_ref are also 4-byte aligned (or should be,
> once PR 122410 gets fixed). So I don't agree that it *needs* to take
> the address. It doesn't need to today, because it doesn't use it, and
> if it did a runtime check that the pointer is aligned to at least
> 4-bytes, that would be wasted cycles today because it's already
> guaranteed by std::atomic and std::atomic_ref.
>
> But I can add the address argument, marked [[maybe_unused]] for now.
>
> >However, I do not think we should check unit64_t at all, as we do not have
> >such platform
> >wait yet, and we will not change __platform_wait_uses_type in that case.
>
> Yes, the 64-bit check could be removed and added back when macOS
> support gets added here (I am already working on that patch as a proof
> of concept for Iain to test, but it might not be ready today).
>
> >>
> >> +  {
> >> +    if constexpr (__platform_wait_uses_type<uint32_t>)
> >> +      if (args._M_obj_size == sizeof(uint32_t))
> >> +       return false;
> >> +
> >> +    if constexpr (__platform_wait_uses_type<uint64_t>)
> >> +      if (args._M_obj_size == sizeof(uint64_t))
> >> +       return false;
> >> +
> >> +    // Use proxy wait for everything else:
> >> +    return true;
> >> +  }
> >> +
> >>  } // namespace
> >>
> >> -// Called for a proxy wait
> >> +// Set _M_wait_state if using proxy wait, or caller wants contention
> >> tracking.
> >> +// Set _M_obj to &_M_wait_state->_M_ver if using proxy wait.
> >> +// Load the current value from _M_obj and store in _M_val.
> >>  void
> >> -__wait_args::_M_load_proxy_wait_val(const void* addr)
> >> +__wait_args::_M_setup_wait_impl(const void* addr)
> >>
> >I would use _M_obj_size to indicate if proxy wait is used for the actual
> >await.
>
> As you noted in the follow-up mail, that doesn't work, because
> _M_setup_wait_impl is never called in the thread that calls
> __notify_impl.
>
> I did consider adding a _M_setup_notify_impl function, which would
> adjust _M_obj and could set _M_obj_size=0 for proxy waits. But there
> would be no benefit. Unlike std::atomic::wait, there is no loop in
> std::atomic::notify, it just calls __notify_impl and returns. So if we
> added a separate _M_setup_notify step that needs to be run before
> __notify_impl then we're just doing two PLT calls into libstdc++.so
> instead of one PLT call. That would just makes things slower.
>
> So __notify_impl needs to repeat the "does this combination of __addr
> and _M_obj_size use a proxy wait" check directly, and that's what the
> use_proxy_wait function does. There's no point modifying the
> __wait_args to "remember" that decision for subsequent calls to
> __notify_impl, because there will never be another call to
> __notify_impl using the same __wait_args. The next call will use a new
> __wait_args object, constructed by a different caller.
>
> >>  {
> >> -  // __glibcxx_assert( *this & __wait_flags::__proxy_wait );
> >> +  if (!use_proxy_wait(*this))
> >
> >Do; _M_obj_size > 0 &&  !use_proxy_wait(*this)
>
> _M_obj_size is available in use_proxy_wait, so I would prefer to check
> it in use_proxy_wait.
>
> >> +    {
> >> +      // We can wait on this address directly.
> >> +      __glibcxx_assert(_M_obj == addr);
> >>
> >This funciton is one that is expected to set _M_obj, so it seems incorrect
> >to assert it here.
>
> As noted in your follow-up mail, that's not true. It's set in the
> __wait_args ctor, and only set here for a proxy wait. This assertion
> is ensuring that it really was set in the ctor.
>
> >>
> >> -  // We always need a waitable state for proxy waits.
> >> +      int val;
> >> +      __atomic_load(static_cast<const int*>(addr), &val,
> >> __ATOMIC_ACQUIRE);
> >> +      _M_old = val;
> >> +
> >> +      return;
> >> +    }
> >> +
> >> +  // This will be a proxy wait, so get a waitable state.
> >>    auto state = set_wait_state(addr, *this);
> >>
> >> +  // The address we will wait on is the version count of the waitable
> >> state:
> >> +  _M_obj = &state->_M_ver;
> >>
> >+
> >>    // Read the value of the _M_ver counter.
> >> -  __atomic_load(&state->_M_ver, &_M_old, __ATOMIC_ACQUIRE);
> >> +  __platform_wait_t __old;
> >> +  __atomic_load(&state->_M_ver, &__old, __ATOMIC_ACQUIRE);
> >>
> >Nit pick, but I think this could be just:
> >    __platform_wait_t __old = __atomic_load_n(&state->_M_ver,
> >__ATOMIC_ACQUIRE);
>
> Yes, I think I got confused by the fact that Clang doesn't support
> __atomic_load_4 and decided that it doesn't support __atomic_load_n
> either.
>
> >
> >> +  _M_old = __old;
> >>
> >Or just:
> >  _M_old = __atomic_load_n(&state->_M_ver, __ATOMIC_ACQUIRE);
>
> Yes, I'll do that.
>
> >>  }
> >>
> >>  __wait_result_type
> >> -__wait_impl(const void* __addr, __wait_args_base& __args)
> >> +__wait_impl([[maybe_unused]] const void* __addr, __wait_args_base&
> __args)
> >>  {
> >> -  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
> >> -
> >> -  const __platform_wait_t* __wait_addr;
> >> -
> >> -  if (__args & __wait_flags::__proxy_wait)
> >> -    __wait_addr = &__state->_M_ver;
> >> -  else
> >> -    __wait_addr = static_cast<const __platform_wait_t*>(__addr);
> >> +  auto* __wait_addr = static_cast<const
> >> __platform_wait_t*>(__args._M_obj);
> >>
> >>    if (__args & __wait_flags::__do_spin)
> >>      {
> >> @@ -286,6 +307,7 @@ __wait_impl(const void* __addr, __wait_args_base&
> >> __args)
> >>    __atomic_load(__wait_addr, &__val, __args._M_order);
> >>    if (__val == __args._M_old)
> >>      {
> >> +      auto __state =
> static_cast<__waitable_state*>(__args._M_wait_state);
> >>        __state->_M_cv.wait(__state->_M_mtx);
> >>        return { ._M_val = __val, ._M_has_val = false, ._M_timeout =
> false
> >> };
> >>      }
> >> @@ -294,24 +316,40 @@ __wait_impl(const void* __addr, __wait_args_base&
> >> __args)
> >>  }
> >>
> >>  void
> >> -__notify_impl(const void* __addr, [[maybe_unused]] bool __all,
> >> +__notify_impl([[maybe_unused]] const void* __addr, [[maybe_unused]]
> bool
> >> __all,
> >>               const __wait_args_base& __args)
> >>  {
> >> -  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
> >> -  if (!__state)
> >> -    __state = &__waitable_state::_S_state_for(__addr);
> >> +  const bool __track_contention = __args &
> >> __wait_flags::__track_contention;
> >> +  const bool proxy_wait = use_proxy_wait(__args);
> >>
> >And here we could just check if _M_obj_size == 0, as this would indication
> >of proxy wait.
> >
> >>
> >> -  [[maybe_unused]] const __platform_wait_t* __wait_addr;
> >> +  [[maybe_unused]] auto* __wait_addr
> >> +    = static_cast<const __platform_wait_t*>(__addr);
> >> +
> >> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> >> +  // Check whether it would be a non-proxy wait for this object.
> >> +  // This condition must match the one in _M_setup_wait_impl to ensure
> >> that
> >> +  // the address used for the notify matches the one used for the wait.
> >> +  if (!proxy_wait)
> >> +    {
> >> +      if (__track_contention)
> >> +       if (!__waitable_state::_S_state_for(__addr)._M_waiting())
> >> +         return;
> >> +
> >> +      __platform_notify(__wait_addr, __all);
> >> +      return;
> >> +    }
> >> +#endif
> >> +
> >> +  // Either a proxy wait or we don't have platform wait/wake
> primitives.
> >> +
> >> +  auto __state = &__waitable_state::_S_state_for(__addr);
> >>
> >>    // Lock mutex so that proxied waiters cannot race with incrementing
> >> _M_ver
> >>    // and see the old value, then sleep after the increment and
> >> notify_all().
> >>    lock_guard __l{ *__state };
> >>
> >> -  if (__args & __wait_flags::__proxy_wait)
> >> +  if (proxy_wait)
> >>      {
> >> -      // Waiting for *__addr is actually done on the proxy's _M_ver.
> >> -      __wait_addr = &__state->_M_ver;
> >> -
> >>        // Increment _M_ver so that waiting threads see something
> changed.
> >>        // This has to be atomic because the load in
> _M_load_proxy_wait_val
> >>        // is done without the mutex locked.
> >> @@ -322,11 +360,11 @@ __notify_impl(const void* __addr, [[maybe_unused]]
> >> bool __all,
> >>        // they can re-evaluate their conditions to see if they should
> >>        // stop waiting or should wait again.
> >>        __all = true;
> >> -    }
> >> -  else // Use the atomic variable's own address.
> >> -    __wait_addr = static_cast<const __platform_wait_t*>(__addr);
> >>
> >> -  if (__args & __wait_flags::__track_contention)
> >> +      __wait_addr = &__state->_M_ver;
> >> +    }
> >> +
> >> +  if (__track_contention)
> >>      {
> >>        if (!__state->_M_waiting())
> >>         return;
> >> @@ -366,7 +404,7 @@ __platform_wait_until(const __platform_wait_t*
> __addr,
> >>  }
> >>  #endif // HAVE_LINUX_FUTEX
> >>
> >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
> >>  bool
> >>  __cond_wait_until(__condvar& __cv, mutex& __mx,
> >>                   const __wait_clock_t::time_point& __atime)
> >> @@ -381,7 +419,7 @@ __cond_wait_until(__condvar& __cv, mutex& __mx,
> >>      __cv.wait_until(__mx, __ts);
> >>    return __wait_clock_t::now() < __atime;
> >>  }
> >> -#endif // ! HAVE_PLATFORM_TIMED_WAIT
> >> +#endif // ! HAVE_PLATFORM_WAIT
> >>
> >>  // Unlike __spin_impl, does not always return _M_has_val == true.
> >>  // If the deadline has already passed then no fresh value is loaded.
> >> @@ -414,7 +452,9 @@ __spin_until_impl(const __platform_wait_t* __addr,
> >>             return __res;
> >>         }
> >>
> >> -      __atomic_load(__addr, &__res._M_val, __args._M_order);
> >> +      __platform_wait_t val;
> >> +      __atomic_load(__addr, &val, __args._M_order);
> >>
> >Same comment on __atomic_load_n here.
>
> Agreed.
>
> >> +      __res._M_val = val;
> >>        __res._M_has_val = true;
> >>        if (__res._M_val != __args._M_old)
> >>         {
> >> @@ -428,16 +468,11 @@ __spin_until_impl(const __platform_wait_t* __addr,
> >>  } // namespace
> >>
> >>  __wait_result_type
> >> -__wait_until_impl(const void* __addr, __wait_args_base& __args,
> >> +__wait_until_impl([[maybe_unused]] const void* __addr,
> __wait_args_base&
> >> __args,
> >>                   const __wait_clock_t::duration& __time)
> >>  {
> >>    const __wait_clock_t::time_point __atime(__time);
> >> -  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
> >> -  const __platform_wait_t* __wait_addr;
> >> -  if (__args & __wait_flags::__proxy_wait)
> >> -    __wait_addr = &__state->_M_ver;
> >> -  else
> >> -    __wait_addr = static_cast<const __platform_wait_t*>(__addr);
> >> +  auto* __wait_addr = static_cast<const
> >> __platform_wait_t*>(__args._M_obj);
> >>
> >>    if (__args & __wait_flags::__do_spin)
> >>      {
> >> @@ -448,9 +483,9 @@ __wait_until_impl(const void* __addr,
> >> __wait_args_base& __args,
> >>         return __res;
> >>      }
> >>
> >> -#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> >> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> >>    if (__args & __wait_flags::__track_contention)
> >> -    set_wait_state(__addr, __args);
> >> +    set_wait_state(__addr, __args); // scoped_wait needs a
> >> __waitable_state
> >>    scoped_wait s(__args);
> >>    bool timeout = !__platform_wait_until(__wait_addr, __args._M_old,
> >> __atime);
> >>    return { ._M_val = __args._M_old, ._M_has_val = false, ._M_timeout =
> >> timeout };
> >> @@ -460,6 +495,7 @@ __wait_until_impl(const void* __addr,
> >> __wait_args_base& __args,
> >>    __atomic_load(__wait_addr, &__val, __args._M_order);
> >>    if (__val == __args._M_old)
> >>      {
> >> +      auto __state =
> static_cast<__waitable_state*>(__args._M_wait_state);
> >>        bool timeout = !__cond_wait_until(__state->_M_cv,
> __state->_M_mtx,
> >> __atime);
> >>        return { ._M_val = __val, ._M_has_val = false, ._M_timeout =
> >> timeout };
> >>      }
> >> --
> >> 2.51.1
> >>
> >>
>
>

Re: [PATCH] libstdc++: Future-proof C++20 atomic wait/notify

Reply via email to