On Mon, Nov 17, 2025 at 1:50 PM Jonathan Wakely <[email protected]> wrote:
> On Mon, 17 Nov 2025 at 10:10 +0100, Tomasz Kaminski wrote: > >On Sun, Nov 16, 2025 at 1:56 AM Jonathan Wakely <[email protected]> > wrote: > > > >> This will allow us to extend atomic waiting functions to support a > >> possible future 64-bit version of futex, as well as supporting > >> futex-like wait/wake primitives on other targets (e.g. macOS has > >> os_sync_wait_on_address and FreeBSD has _umtx_op). > >> > >> Before this change, the decision of whether to do a proxy wait or to > >> wait on the atomic variable itself was made in the header at > >> compile-time, which makes it an ABI property that would not have been > >> possible to change later. That would have meant that > >> std::atomic<uint64_t> would always have to do a proxy wait even if Linux > >> gains support for 64-bit futex2(2) calls at some point in the future. > >> The disadvantage of proxy waits is that several distinct atomic objects > >> can share the same proxy state, leading to contention between threads > >> even when they are not waiting on the same atomic object, similar to > >> false sharing. It also result in spurious wake-ups because doing a > >> notify on an atomic object that uses a proxy wait will wake all waiters > >> sharing the proxy. > >> > >> For types that are known to definitely not need a proxy wait (e.g. int > >> on Linux) the header can still choose a more efficient path at > >> compile-time. But for other types, the decision of whether to do a proxy > >> wait is deferred to runtime, inside the library internals. This will > >> make it possible for future versions of libstdc++.so to extend the set > >> of types which don't need to use proxy waits, without ABI changes. > >> > >> The way the change works is to stop using the __proxy_wait flag that was > >> set by the inline code in the headers. Instead the __wait_args struct > >> has an extra pointer member which the library internals populate with > >> either the address of the atomic object or the _M_ver counter in the > >> proxy state. There is also a new _M_obj_size member which stores the > >> size of the atomic object, so that the library can decide whether a > >> proxy is needed. So for example if linux gains 64-bit futex support then > >> the library can decide not to use a proxy when _M_obj_size == 8. > >> Finally, the _M_old member of the __wait_args struct is changed to > >> uint64_t so that it has room to store 64-bit values, not just whatever > >> size the __platform_wait_t type is (which is a 32-bit int on Linux). > >> Similarly, the _M_val member of __wait_result_type changes to uint64_t > >> too. > >> > >> libstdc++-v3/ChangeLog: > >> > >> * config/abi/pre/gnu.ver: Adjust exports. > >> * include/bits/atomic_timed_wait.h > >> (_GLIBCXX_HAVE_PLATFORM_TIMED_WAIT): > >> Do not define this macro. > >> (__atomic_wait_address_until_v, __atomic_wait_address_for_v): > >> Guard assertions with #ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT. > >> * include/bits/atomic_wait.h (__platform_wait_uses_type): > >> Different separately for platforms with and without platform > >> wait. > >> (_GLIBCXX_HAVE_PLATFORM_WAIT): Do not define this macro. > >> (_GLIBCXX_UNKNOWN_PLATFORM_WAIT): Define new macro. > >> (__wait_value_type): New typedef. > >> (__wait_result_type): Change _M_val to __wait_value_type. > >> (__wait_args_base::_M_old): Change to __wait_args_base. > >> (__wait_args_base::_M_obg, __wait_args_base::_M_obj_size): New > >> data members. > >> (__wait_args::__wait_args): Set _M_obj and _M_obj_size on > >> construction. > >> (__wait_args::_M_setup_wait): Change void* parameter to deduced > >> type. Use _S_bit_cast instead of __builtin_bit_cast. > >> (__wait_args::_M_load_proxy_wait_val): Remove function, replace > >> with ... > >> (__wait_args::_M_setup_wait_impl): New function. > >> (__wait_args::_S_bit_cast): Wrapper for __builtin_bit_cast which > >> also supports conversion from 32-bit values. > >> (__wait_args::_S_flags_for): Do not set __proxy_wait flag. > >> (__atomic_wait_address_v): Guard assertions with #ifdef > >> _GLIBCXX_UNKNOWN_PLATFORM_WAIT. > >> * src/c++20/atomic.cc (_GLIBCXX_HAVE_PLATFORM_WAIT): Define here > >> instead of in header. Check _GLIBCXX_HAVE_PLATFORM_WAIT instead > >> of _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT. > >> (__spin_impl): Adjust for 64-bit __wait_args_base::_M_old. > >> (use_proxy_wait): New function. > >> (__wait_args::_M_load_proxy_wait_val): Replace with ... > >> (__wait_args::_M_setup_wait_impl): New function. Call > >> use_proxy_wait to decide at runtime whether to wait on the > >> pointer directly instead of using a proxy. If a proxy is needed, > >> set _M_obj to point to its _M_ver member. Adjust for change to > >> type of _M_old. > >> (__wait_impl): Wait on _M_obj unconditionally. > >> (__notify_impl): Call use_proxy_wait to decide whether to notify > >> on the address parameter or a proxy > >> (__spin_until_impl): Adjust for change to type of _M_val. > >> (__wait_until_impl): Wait on _M_obj unconditionally. > >> --- > >> > >> Tested x86_64-linux, powerpc64le-linux, sparc-solaris. > >> > >A lot of comments below. > > > >> > >> I think this is an imporant change which I unfortunately didn't think of > >> until recently. > >> > >> This changes the exports from the shared library, but we're still in > >> stage 1 so I think that should be allowed (albeit unfortunate). Nobody > >> should be expecting GCC 16 to be stable yet. > >> > >> The __proxy_wait enumerator is now unused and could be removed. The > >> __abi_version enumerator could also be bumped to indicate the > >> incompatibility with earlier snapshots of GCC 16, but I don't think that > >> is needed. We could in theory keep the old symbol export > >> (__wait_args::_M_load_proxy_wait) and make it trap/abort if called, but > >> I'd prefer to just remove it and cause dynamic linker errors instead. > >> > >> There's a TODO in the header about which types should be allowed to take > >> the optimized paths (see the __waitable concept). For types where that's > >> true, if the size matches a futex then we'll use a futex, even if it's > >> actually an enum or floating-point type (or pointer on 32-bit targets). > >> I'm not sure if that's safe. > >> > >> > >> libstdc++-v3/config/abi/pre/gnu.ver | 3 +- > >> libstdc++-v3/include/bits/atomic_timed_wait.h | 12 +- > >> libstdc++-v3/include/bits/atomic_wait.h | 109 +++++++++----- > >> libstdc++-v3/src/c++20/atomic.cc | 140 +++++++++++------- > >> 4 files changed, 166 insertions(+), 98 deletions(-) > >> > >> diff --git a/libstdc++-v3/config/abi/pre/gnu.ver > >> b/libstdc++-v3/config/abi/pre/gnu.ver > >> index 2e48241d51f9..3c2bd4921730 100644 > >> --- a/libstdc++-v3/config/abi/pre/gnu.ver > >> +++ b/libstdc++-v3/config/abi/pre/gnu.ver > >> @@ -2553,7 +2553,8 @@ GLIBCXX_3.4.35 { > >> _ZNSt8__detail11__wait_implEPKvRNS_16__wait_args_baseE; > >> _ZNSt8__detail13__notify_implEPKvbRKNS_16__wait_args_baseE; > >> > >> > _ZNSt8__detail17__wait_until_implEPKvRNS_16__wait_args_baseERKNSt6chrono8durationI[lx]St5ratioIL[lx]1EL[lx]1000000000EEEE; > >> - _ZNSt8__detail11__wait_args22_M_load_proxy_wait_valEPKv; > >> + _ZNSt8__detail11__wait_args18_M_setup_wait_implEPKv; > >> + _ZNSt8__detail11__wait_args20_M_setup_notify_implEPKv; > >> > >> # std::chrono::gps_clock::now, tai_clock::now > >> _ZNSt6chrono9gps_clock3nowEv; > >> diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h > >> b/libstdc++-v3/include/bits/atomic_timed_wait.h > >> index 30f7ff616840..918a267d10eb 100644 > >> --- a/libstdc++-v3/include/bits/atomic_timed_wait.h > >> +++ b/libstdc++-v3/include/bits/atomic_timed_wait.h > >> @@ -75,14 +75,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> return chrono::ceil<__w_dur>(__atime); > >> } > >> > >> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX > >> -#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> -#else > >> -// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement > >> __platform_wait_until > >> -// if there is a more efficient primitive supported by the platform > >> -// (e.g. __ulock_wait) which is better than pthread_cond_clockwait. > >> -#endif // ! HAVE_LINUX_FUTEX > >> - > >> __wait_result_type > >> __wait_until_impl(const void* __addr, __wait_args_base& __args, > >> const __wait_clock_t::duration& __atime); > >> @@ -156,7 +148,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> const chrono::time_point<_Clock, > _Dur>& > >> __atime, > >> bool __bare_wait = false) noexcept > >> { > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT > >> __glibcxx_assert(false); // This function can't be used for proxy > >> wait. > >> #endif > >> __detail::__wait_args __args{ __addr, __old, __order, > __bare_wait }; > >> @@ -208,7 +200,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> const chrono::duration<_Rep, _Period>& > >> __rtime, > >> bool __bare_wait = false) noexcept > >> { > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT > >> > >This name really reads strange, and sounds like something with "TODO". > >I think _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT was just OK name, even > >if it was not used directly. > > The rationale for the new name is that when the header gets > preprocessed we don't yet know if the libstdc++ shared lib that will > be used at runtime has a platform wait function available. A > translation unit might get compiled with GCC 16 and so there is no > known platform wait for macOS, but then at runtime it might link to > the libstdc++.dylib from GCC 17 which uses ulock_wait. > > And the reason for getting rid of HAVE_PLATFORM_TIMED_WAIT is twofold: > - I don't think it makes sense to have separate HAVE_PLATFORM_WAIT and > HAVE_PLATFORM_TIMED_WAIT macros. I doubt any target is going to > support a futex-like operation that doesn't support a timeout (and > if there is such an operation, we should just not use it at all). > - The macro saying whether we have a platform wait operation is now > only defined inside the library, not in the header. It's not > something that the header needs to know (or can know). > > But I will try to improve the name of UNKNOWN_PLATFORM_WAIT. > > >> __glibcxx_assert(false); // This function can't be used for proxy > >> wait. > >> #endif > >> __detail::__wait_args __args{ __addr, __old, __order, > __bare_wait }; > >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h > >> b/libstdc++-v3/include/bits/atomic_wait.h > >> index 95151479c120..49369419d6a6 100644 > >> --- a/libstdc++-v3/include/bits/atomic_wait.h > >> +++ b/libstdc++-v3/include/bits/atomic_wait.h > >> @@ -45,35 +45,34 @@ > >> namespace std _GLIBCXX_VISIBILITY(default) > >> { > >> _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> +#if defined _GLIBCXX_HAVE_LINUX_FUTEX > >> namespace __detail > >> { > >> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX > >> -#define _GLIBCXX_HAVE_PLATFORM_WAIT 1 > >> using __platform_wait_t = int; > >> inline constexpr size_t __platform_wait_alignment = 4; > >> + } > >> + template<typename _Tp> > >> + inline constexpr bool __platform_wait_uses_type > >> + = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp) > >= > >> 4; > >> #else > >> +# define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1 > >> // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait() > >> // and __platform_notify() if there is a more efficient primitive > >> supported > >> // by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better > >> than > >> // a mutex/condvar based wait. > >> + namespace __detail > >> + { > >> # if ATOMIC_LONG_LOCK_FREE == 2 > >> using __platform_wait_t = unsigned long; > >> # else > >> using __platform_wait_t = unsigned int; > >> # endif > >> inline constexpr size_t __platform_wait_alignment > >> - = __alignof__(__platform_wait_t); > >> -#endif > >> + = sizeof(__platform_wait_t) < __alignof__(__platform_wait_t) > >> + ? __alignof__(__platform_wait_t) : sizeof(__platform_wait_t); > >> } // namespace __detail > >> - > >> - template<typename _Tp> > >> - inline constexpr bool __platform_wait_uses_type > >> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT > >> - = is_scalar_v<_Tp> > >> - && ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t)) > >> - && (alignof(_Tp) >= __detail::__platform_wait_alignment)); > >> -#else > >> - = false; > >> + template<typename> > >> + inline constexpr bool __platform_wait_uses_type = false; > >> #endif > >> > >> namespace __detail > >> @@ -105,10 +104,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0; > >> } > >> > >> - // lightweight std::optional<__platform_wait_t> > >> + // TODO: this needs to be false for types with padding, e.g. > __int20. > >> > >I do not understand why this needs to be required. This funciton is used > >only via atomic > >or atomic_ref. For atomic, we can guarantee that the type has padding > bytes > >cleared.. > > When we pass the "old" value to the kernel, I don't know if we can > guarantee that cleared padding bits remain cleared. We can be sure > that the value inside the std::atomic has padding bits cleared to > zero, but the std::atomic<T>::wait(T) function takes its argument by > value, and also passes that to the kernel by value. > > I think it's safer to just do proxy waits for types with padding. > > >> + // TODO: should this be true only for integral, enum, and pointer > >> types? > >> > >What I think is missing here is alignment. I assume that any platform wait > >may use > >bits that are clear due the alignment of platform wait type for some > >internal state. > > I don't understand what you mean here. The platform wait type is just > an integer, so all bits are part of the value representation (if we > ignore weird types like __int20 on msp430). A pointer to that type > might have "unused" low bits due to alignment, but we form that > pointer by taking the address of the atomic object, and we pass it to > the library, and we pass it to the OS wake/wait operation. I don't > think there's any opportunity for anything to ever set (or read) bits > for internal state. > > >Or we are going to check is_sufficiently_aligment in cc file, and use > >different kind of > >wait depending on the object? > > Yes, we can check the alignment of the __addr pointer when deciding > whether to use the platform wait operation. But the objects that > __addr points to are always the value inside a std::atomic or the > object that std::atomic_ref::_M_ptr points to, and we already know > that those objects are suitably aligned for atomic ops. It's possible > that some platform wait has stricter alignment requirements (e.g. you > can do atomic add/subtract if it's 4-byte aligned by you can't > wait/notify unless it's 16-byte aligned) but I don't know of any cases > where that's true (for macOS the object only needs to be aligned to > its size, which is already guaranteed by std::atomic and required by > std::atomic_ref, for FreeBSD it must be aligned to "ABI-mandated > alignment" which isn't very clear to me, but I don't think it means > "aligned more than the type usually requires"). > > If we want to use some new wait/wake primitive in future and it > requires stricter alignment, we can just check the alignment of __addr > in the new use_proxy_wait function in src/c++20/atomic.cc. That > function doesn't currently take the original __addr pointer, because > it doesn't currently need it, but it's not a public function so we can > just change it any time we need to. I can change it to take the > __addr argument now, marked [[maybe_unused]], if that would make it > clearer that the decision can (in theory) depend on the address. > > But I don't think the __waitable concept should consider alignment. > > The objects we do atomic wait/notify on are already aligned suitably > for atomic ops, and the address of any individual object can be used > to decide whether it's suitably aligned for wait/notify, it doesn't > need to be a static property of the type. > > The point of this __waitable check is "does this type look like > something that we should consider using as a futex", which for linux > today means "can it safely be reinterpret_cast to int& for the > kernel's purposes" (where I'm relying on the fact that the kernel can > happily ignore type-based alias analysis and just cares about the > bits!) > > We should probably not treat struct { char c; /*padding;*/ short i; } > as a futex, but we can treat an enum as a futex, because as far as the > kernel's futex syscall is concerned, it's just some bits and it either > equals some value or it doesn't. > > >But I think, we can later safely extend or change what is waitable (except > >extending it past 8 bytes), > >as if we start putting _M_obj_size to non zero, impl may use platform > wait. > > > >So, I will go with safe option is integral, pointer or enum, This would > > That means always using a proxy wait for float and double, but I think > that's OK. > > >also give > >us no padding guarantee, I assume? > > __int20 is integral and has padding bits. But it's a very special case > and only present on one or two non-mainstream targets. > > >> + template<typename _Tp> > >> + concept __waitable > >> + = is_scalar_v<_Tp> && (sizeof(_Tp) <= sizeof(__UINT64_TYPE__)); > >> + > >> + // Storage for up to 64 bits of value, should be considered opaque > >> bits. > >> + using __wait_value_type = __UINT64_TYPE__; > >> + > >> + // lightweight std::optional<__wait_value_type> > >> struct __wait_result_type > >> { > >> - __platform_wait_t _M_val; > >> + __wait_value_type _M_val; > >> unsigned char _M_has_val : 1; // _M_val value was loaded before > >> return. > >> unsigned char _M_timeout : 1; // Waiting function ended with > >> timeout. > >> unsigned char _M_unused : 6; // padding > >> @@ -143,8 +151,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> { > >> __wait_flags _M_flags; > >> int _M_order = __ATOMIC_ACQUIRE; > >> - __platform_wait_t _M_old = 0; > >> + __wait_value_type _M_old{}; > >> void* _M_wait_state = nullptr; > >> + const void* _M_obj = nullptr; // The address of the object to > wait > >> on. > >> + unsigned char _M_obj_size = 0; // The size of that object. > >> > >> // Test whether _M_flags & __flags is non-zero. > >> bool > >> @@ -162,36 +172,48 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> explicit > >> __wait_args(const _Tp* __addr, bool __bare_wait = false) > noexcept > >> : __wait_args_base{ _S_flags_for(__addr, __bare_wait) } > >> - { } > >> + { > >> + _M_obj = __addr; // Might be replaced by _M_setup_wait > >> + if constexpr (__waitable<_Tp>) > >> + // __wait_impl might be able to wait directly on __addr > >> + // instead of using a proxy, depending on its size. > >> + _M_obj_size = sizeof(_Tp); > >> + } > >> > >> __wait_args(const __platform_wait_t* __addr, __platform_wait_t > >> __old, > >> int __order, bool __bare_wait = false) noexcept > >> - : __wait_args_base{ _S_flags_for(__addr, __bare_wait), __order, > >> __old } > >> - { } > >> + : __wait_args(__addr, __bare_wait) > >> + { > >> + _M_order = __order; > >> + _M_old = __old; > >> + } > >> > >> __wait_args(const __wait_args&) noexcept = default; > >> __wait_args& operator=(const __wait_args&) noexcept = default; > >> > >> - template<typename _ValFn, > >> - typename _Tp = > decay_t<decltype(std::declval<_ValFn&>()())>> > >> + template<typename _Tp, typename _ValFn> > >> _Tp > >> - _M_setup_wait(const void* __addr, _ValFn __vfn, > >> + _M_setup_wait(const _Tp* __addr, _ValFn __vfn, > >> __wait_result_type __res = {}) > >> { > >> + static_assert(is_same_v<_Tp, decay_t<decltype(__vfn())>>); > >> + > >> if constexpr (__platform_wait_uses_type<_Tp>) > >> { > >> - // If the wait is not proxied, the value we check when > >> waiting > >> - // is the value of the atomic variable itself. > >> + // If we know for certain that this type can be waited on > >> + // efficiently using something like a futex syscall, > >> + // then we can avoid the overhead of _M_setup_wait_impl > >> + // and just load the value and store it into _M_old. > >> > >> - if (__res._M_has_val) // The previous wait loaded a recent > >> value. > >> + if (__res._M_has_val) // A previous wait loaded a recent > >> value. > >> { > >> _M_old = __res._M_val; > >> - return __builtin_bit_cast(_Tp, __res._M_val); > >> + return _S_bit_cast<_Tp>(_M_old); > >> > >I am not sure if I understand which branch of _S_bit_cast this would use, > >we have neither (sizeof(To) == sizeof(From) i.e. 4 vs 8) and > >neither sizeof(_From) == sizeof(__UINT32_TYPE__). > > Oops, yes, that's a bug in _S_bit_cast. It was supposed to also handle > the case where we only want some of the bits in __from. Apparently I > lost that branch of _S_bit_cast in some revision of the patch. > > >I would much more prefer if this would be dome as: > >return __builtin_bit_cast(_Tp, static_cast<__UINT32_TYPE__>(_M_old)); > > That assumes that only 32 bits of the values are used, but the header > doesn't know that. > > > > >> } > >> else // Load the value from __vfn > >> { > >> - _Tp __val = __vfn(); > >> - _M_old = __builtin_bit_cast(__platform_wait_t, __val); > >> + auto __val = __vfn(); > >> + _M_old = _S_bit_cast<__wait_value_type>(__val); > >> > >And here: > > _M_ old = __builtin_bit_cast(__UINT32_TYPE__, __val); > >However, instead of __UINT32_TYPE__, we should use > >make_unsinged<__platform_wait_t>. > > No, again, this assumes that only sizeof(__platform_wait_t) bytes are > useful. That is the design mistake in the current code. It assumes > that we will never do non-proxy wait on something larger than today's > __platform_wait_t. For a future version of linux, we will not change > __platform_wait_t (it will still be int) but we might start doing > non-proxy wait on 64-bit values. In that case, all bits of _M_val and > _M_old would be used, and casting to uint32_t in the header would > truncate the values. > Yes, but allo of this code is inside __platform_wait_uses_type<_Tp>, constexpr branch, where we skip overhead of _M_setup_wait impl. The __platform_wait_uses_type can be only true for types equal to __platform_wait_t, and if we do not change that value, we will not change what types go into this branch, so it will always be 4 bytes. > > > > >> return __val; > >> } > >> } > >> @@ -199,16 +221,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> { > >> if (__res._M_has_val) // The previous wait loaded a recent > >> value. > >> _M_old = __res._M_val; > >> - else // Load _M_ver from the proxy (must happen before > >> __vfn()). > >> - _M_load_proxy_wait_val(__addr); > >> + else // Let the library decide how to setup the wait. > >> + { > >> + // Set _M_obj to the address to be waited on (either > >> __addr > >> + // or a proxy) and load its current value into _M_old. > >> + _M_setup_wait_impl(__addr); > >> + } > >> return __vfn(); > >> } > >> } > >> > >> private: > >> - // Populates _M_wait_state and _M_old from the proxy for __addr. > >> + template<typename _To, typename _From> > >> + [[__gnu__::__always_inline__]] > >> + static _To > >> + _S_bit_cast(_From __from) > >> + { > >> + if constexpr (sizeof(_To) == sizeof(_From)) > >> + return __builtin_bit_cast(_To, __from); > >> + else if constexpr (sizeof(_From) == sizeof(__UINT32_TYPE__)) > >> + return __builtin_bit_cast(__UINT32_TYPE__, __from); > >> + else > >> + __builtin_unreachable(); > >> + } > >> + > >> + // Populates _M_wait_state and _M_old appropriately for __addr. > >> void > >> - _M_load_proxy_wait_val(const void* __addr); > >> + _M_setup_wait_impl(const void* __addr); > >> > >> template<typename _Tp> > >> static constexpr __wait_flags > >> @@ -218,8 +257,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> __wait_flags __res = __abi_version | __do_spin; > >> if (!__bare_wait) > >> __res |= __track_contention; > >> - if constexpr (!__platform_wait_uses_type<_Tp>) > >> - __res |= __proxy_wait; > >> + // if constexpr (!__platform_wait_uses_type<_Tp>) > >> + // __res |= __proxy_wait; > >> > >Remove this simply. > > Yes, I wanted to get the patch sent in stage 1 so there is some > clean-up still needed. The __proxy_wait enumerator should be removed > entirely. > > >> return __res; > >> } > >> }; > >> @@ -255,7 +294,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >> __detail::__platform_wait_t __old, > >> int __order, bool __bare_wait = false) > >> { > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT > >> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT > >> __glibcxx_assert(false); // This function can't be used for proxy > >> wait. > >> #endif > >> __detail::__wait_args __args{ __addr, __old, __order, __bare_wait > }; > >> diff --git a/libstdc++-v3/src/c++20/atomic.cc > >> b/libstdc++-v3/src/c++20/atomic.cc > >> index e280045b619d..d5fbd10ed11f 100644 > >> --- a/libstdc++-v3/src/c++20/atomic.cc > >> +++ b/libstdc++-v3/src/c++20/atomic.cc > >> @@ -38,14 +38,7 @@ > >> # include <syscall.h> > >> # include <bits/functexcept.h> > >> # include <sys/time.h> > >> -#endif > >> - > >> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT > >> -# ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> -// __waitable_state assumes that we consistently use the same > >> implementation > >> -// (i.e. futex vs mutex+condvar) for timed and untimed waiting. > >> -# error "This configuration is not currently supported" > >> -# endif > >> +# define _GLIBCXX_HAVE_PLATFORM_WAIT 1 > >> #endif > >> > >> #pragma GCC diagnostic ignored "-Wmissing-field-initializers" > >> @@ -107,7 +100,7 @@ namespace > >> // Count of threads blocked waiting on this state. > >> alignas(_S_align) __platform_wait_t _M_waiters = 0; > >> > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT > >> mutex _M_mtx; > >> > >> // This type meets the Cpp17BasicLockable requirements. > >> @@ -123,7 +116,7 @@ namespace > >> // use this for waiting and notifying functions instead. > >> alignas(_S_align) __platform_wait_t _M_ver = 0; > >> > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT > >> __condvar _M_cv; > >> #endif > >> > >> @@ -216,17 +209,19 @@ namespace > >> __spin_impl(const __platform_wait_t* __addr, const __wait_args_base& > >> __args) > >> { > >> __platform_wait_t __val{}; > >> + __wait_value_type wval; > >> for (auto __i = 0; __i < __atomic_spin_count; ++__i) > >> { > >> __atomic_load(__addr, &__val, __args._M_order); > >> - if (__val != __args._M_old) > >> - return { ._M_val = __val, ._M_has_val = true, ._M_timeout = > >> false }; > >> + wval = static_cast<__wait_value_type>(__val); > >> + if (wval != __args._M_old) > >> + return { ._M_val = wval, ._M_has_val = true, ._M_timeout = > false > >> }; > >> if (__i < __atomic_spin_count_relax) > >> __thread_relax(); > >> else > >> __thread_yield(); > >> } > >> - return { ._M_val = __val, ._M_has_val = true, ._M_timeout = true }; > >> + return { ._M_val = wval, ._M_has_val = true, ._M_timeout = true }; > >> } > >> > >> inline __waitable_state* > >> @@ -237,32 +232,58 @@ namespace > >> return static_cast<__waitable_state*>(args._M_wait_state); > >> } > >> > >> + [[gnu::always_inline]] > >> + inline bool > >> + use_proxy_wait(const __wait_args_base& args) > >> > >This needs to take the address that we are waiting on and check if it's > >sufficiently aligned > >to __platform_wait_aligment. > > Maybe in future yes, but today the only platform where use_proxy_wait > returns false is Linux, and we know that all 4-byte objects in > std::atomic or std::atomic_ref are also 4-byte aligned (or should be, > once PR 122410 gets fixed). So I don't agree that it *needs* to take > the address. It doesn't need to today, because it doesn't use it, and > if it did a runtime check that the pointer is aligned to at least > 4-bytes, that would be wasted cycles today because it's already > guaranteed by std::atomic and std::atomic_ref. > > But I can add the address argument, marked [[maybe_unused]] for now. > > >However, I do not think we should check unit64_t at all, as we do not have > >such platform > >wait yet, and we will not change __platform_wait_uses_type in that case. > > Yes, the 64-bit check could be removed and added back when macOS > support gets added here (I am already working on that patch as a proof > of concept for Iain to test, but it might not be ready today). > > >> > >> + { > >> + if constexpr (__platform_wait_uses_type<uint32_t>) > >> + if (args._M_obj_size == sizeof(uint32_t)) > >> + return false; > >> + > >> + if constexpr (__platform_wait_uses_type<uint64_t>) > >> + if (args._M_obj_size == sizeof(uint64_t)) > >> + return false; > >> + > >> + // Use proxy wait for everything else: > >> + return true; > >> + } > >> + > >> } // namespace > >> > >> -// Called for a proxy wait > >> +// Set _M_wait_state if using proxy wait, or caller wants contention > >> tracking. > >> +// Set _M_obj to &_M_wait_state->_M_ver if using proxy wait. > >> +// Load the current value from _M_obj and store in _M_val. > >> void > >> -__wait_args::_M_load_proxy_wait_val(const void* addr) > >> +__wait_args::_M_setup_wait_impl(const void* addr) > >> > >I would use _M_obj_size to indicate if proxy wait is used for the actual > >await. > > As you noted in the follow-up mail, that doesn't work, because > _M_setup_wait_impl is never called in the thread that calls > __notify_impl. > > I did consider adding a _M_setup_notify_impl function, which would > adjust _M_obj and could set _M_obj_size=0 for proxy waits. But there > would be no benefit. Unlike std::atomic::wait, there is no loop in > std::atomic::notify, it just calls __notify_impl and returns. So if we > added a separate _M_setup_notify step that needs to be run before > __notify_impl then we're just doing two PLT calls into libstdc++.so > instead of one PLT call. That would just makes things slower. > > So __notify_impl needs to repeat the "does this combination of __addr > and _M_obj_size use a proxy wait" check directly, and that's what the > use_proxy_wait function does. There's no point modifying the > __wait_args to "remember" that decision for subsequent calls to > __notify_impl, because there will never be another call to > __notify_impl using the same __wait_args. The next call will use a new > __wait_args object, constructed by a different caller. > > >> { > >> - // __glibcxx_assert( *this & __wait_flags::__proxy_wait ); > >> + if (!use_proxy_wait(*this)) > > > >Do; _M_obj_size > 0 && !use_proxy_wait(*this) > > _M_obj_size is available in use_proxy_wait, so I would prefer to check > it in use_proxy_wait. > > >> + { > >> + // We can wait on this address directly. > >> + __glibcxx_assert(_M_obj == addr); > >> > >This funciton is one that is expected to set _M_obj, so it seems incorrect > >to assert it here. > > As noted in your follow-up mail, that's not true. It's set in the > __wait_args ctor, and only set here for a proxy wait. This assertion > is ensuring that it really was set in the ctor. > > >> > >> - // We always need a waitable state for proxy waits. > >> + int val; > >> + __atomic_load(static_cast<const int*>(addr), &val, > >> __ATOMIC_ACQUIRE); > >> + _M_old = val; > >> + > >> + return; > >> + } > >> + > >> + // This will be a proxy wait, so get a waitable state. > >> auto state = set_wait_state(addr, *this); > >> > >> + // The address we will wait on is the version count of the waitable > >> state: > >> + _M_obj = &state->_M_ver; > >> > >+ > >> // Read the value of the _M_ver counter. > >> - __atomic_load(&state->_M_ver, &_M_old, __ATOMIC_ACQUIRE); > >> + __platform_wait_t __old; > >> + __atomic_load(&state->_M_ver, &__old, __ATOMIC_ACQUIRE); > >> > >Nit pick, but I think this could be just: > > __platform_wait_t __old = __atomic_load_n(&state->_M_ver, > >__ATOMIC_ACQUIRE); > > Yes, I think I got confused by the fact that Clang doesn't support > __atomic_load_4 and decided that it doesn't support __atomic_load_n > either. > > > > >> + _M_old = __old; > >> > >Or just: > > _M_old = __atomic_load_n(&state->_M_ver, __ATOMIC_ACQUIRE); > > Yes, I'll do that. > > >> } > >> > >> __wait_result_type > >> -__wait_impl(const void* __addr, __wait_args_base& __args) > >> +__wait_impl([[maybe_unused]] const void* __addr, __wait_args_base& > __args) > >> { > >> - auto __state = static_cast<__waitable_state*>(__args._M_wait_state); > >> - > >> - const __platform_wait_t* __wait_addr; > >> - > >> - if (__args & __wait_flags::__proxy_wait) > >> - __wait_addr = &__state->_M_ver; > >> - else > >> - __wait_addr = static_cast<const __platform_wait_t*>(__addr); > >> + auto* __wait_addr = static_cast<const > >> __platform_wait_t*>(__args._M_obj); > >> > >> if (__args & __wait_flags::__do_spin) > >> { > >> @@ -286,6 +307,7 @@ __wait_impl(const void* __addr, __wait_args_base& > >> __args) > >> __atomic_load(__wait_addr, &__val, __args._M_order); > >> if (__val == __args._M_old) > >> { > >> + auto __state = > static_cast<__waitable_state*>(__args._M_wait_state); > >> __state->_M_cv.wait(__state->_M_mtx); > >> return { ._M_val = __val, ._M_has_val = false, ._M_timeout = > false > >> }; > >> } > >> @@ -294,24 +316,40 @@ __wait_impl(const void* __addr, __wait_args_base& > >> __args) > >> } > >> > >> void > >> -__notify_impl(const void* __addr, [[maybe_unused]] bool __all, > >> +__notify_impl([[maybe_unused]] const void* __addr, [[maybe_unused]] > bool > >> __all, > >> const __wait_args_base& __args) > >> { > >> - auto __state = static_cast<__waitable_state*>(__args._M_wait_state); > >> - if (!__state) > >> - __state = &__waitable_state::_S_state_for(__addr); > >> + const bool __track_contention = __args & > >> __wait_flags::__track_contention; > >> + const bool proxy_wait = use_proxy_wait(__args); > >> > >And here we could just check if _M_obj_size == 0, as this would indication > >of proxy wait. > > > >> > >> - [[maybe_unused]] const __platform_wait_t* __wait_addr; > >> + [[maybe_unused]] auto* __wait_addr > >> + = static_cast<const __platform_wait_t*>(__addr); > >> + > >> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT > >> + // Check whether it would be a non-proxy wait for this object. > >> + // This condition must match the one in _M_setup_wait_impl to ensure > >> that > >> + // the address used for the notify matches the one used for the wait. > >> + if (!proxy_wait) > >> + { > >> + if (__track_contention) > >> + if (!__waitable_state::_S_state_for(__addr)._M_waiting()) > >> + return; > >> + > >> + __platform_notify(__wait_addr, __all); > >> + return; > >> + } > >> +#endif > >> + > >> + // Either a proxy wait or we don't have platform wait/wake > primitives. > >> + > >> + auto __state = &__waitable_state::_S_state_for(__addr); > >> > >> // Lock mutex so that proxied waiters cannot race with incrementing > >> _M_ver > >> // and see the old value, then sleep after the increment and > >> notify_all(). > >> lock_guard __l{ *__state }; > >> > >> - if (__args & __wait_flags::__proxy_wait) > >> + if (proxy_wait) > >> { > >> - // Waiting for *__addr is actually done on the proxy's _M_ver. > >> - __wait_addr = &__state->_M_ver; > >> - > >> // Increment _M_ver so that waiting threads see something > changed. > >> // This has to be atomic because the load in > _M_load_proxy_wait_val > >> // is done without the mutex locked. > >> @@ -322,11 +360,11 @@ __notify_impl(const void* __addr, [[maybe_unused]] > >> bool __all, > >> // they can re-evaluate their conditions to see if they should > >> // stop waiting or should wait again. > >> __all = true; > >> - } > >> - else // Use the atomic variable's own address. > >> - __wait_addr = static_cast<const __platform_wait_t*>(__addr); > >> > >> - if (__args & __wait_flags::__track_contention) > >> + __wait_addr = &__state->_M_ver; > >> + } > >> + > >> + if (__track_contention) > >> { > >> if (!__state->_M_waiting()) > >> return; > >> @@ -366,7 +404,7 @@ __platform_wait_until(const __platform_wait_t* > __addr, > >> } > >> #endif // HAVE_LINUX_FUTEX > >> > >> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT > >> bool > >> __cond_wait_until(__condvar& __cv, mutex& __mx, > >> const __wait_clock_t::time_point& __atime) > >> @@ -381,7 +419,7 @@ __cond_wait_until(__condvar& __cv, mutex& __mx, > >> __cv.wait_until(__mx, __ts); > >> return __wait_clock_t::now() < __atime; > >> } > >> -#endif // ! HAVE_PLATFORM_TIMED_WAIT > >> +#endif // ! HAVE_PLATFORM_WAIT > >> > >> // Unlike __spin_impl, does not always return _M_has_val == true. > >> // If the deadline has already passed then no fresh value is loaded. > >> @@ -414,7 +452,9 @@ __spin_until_impl(const __platform_wait_t* __addr, > >> return __res; > >> } > >> > >> - __atomic_load(__addr, &__res._M_val, __args._M_order); > >> + __platform_wait_t val; > >> + __atomic_load(__addr, &val, __args._M_order); > >> > >Same comment on __atomic_load_n here. > > Agreed. > > >> + __res._M_val = val; > >> __res._M_has_val = true; > >> if (__res._M_val != __args._M_old) > >> { > >> @@ -428,16 +468,11 @@ __spin_until_impl(const __platform_wait_t* __addr, > >> } // namespace > >> > >> __wait_result_type > >> -__wait_until_impl(const void* __addr, __wait_args_base& __args, > >> +__wait_until_impl([[maybe_unused]] const void* __addr, > __wait_args_base& > >> __args, > >> const __wait_clock_t::duration& __time) > >> { > >> const __wait_clock_t::time_point __atime(__time); > >> - auto __state = static_cast<__waitable_state*>(__args._M_wait_state); > >> - const __platform_wait_t* __wait_addr; > >> - if (__args & __wait_flags::__proxy_wait) > >> - __wait_addr = &__state->_M_ver; > >> - else > >> - __wait_addr = static_cast<const __platform_wait_t*>(__addr); > >> + auto* __wait_addr = static_cast<const > >> __platform_wait_t*>(__args._M_obj); > >> > >> if (__args & __wait_flags::__do_spin) > >> { > >> @@ -448,9 +483,9 @@ __wait_until_impl(const void* __addr, > >> __wait_args_base& __args, > >> return __res; > >> } > >> > >> -#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT > >> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT > >> if (__args & __wait_flags::__track_contention) > >> - set_wait_state(__addr, __args); > >> + set_wait_state(__addr, __args); // scoped_wait needs a > >> __waitable_state > >> scoped_wait s(__args); > >> bool timeout = !__platform_wait_until(__wait_addr, __args._M_old, > >> __atime); > >> return { ._M_val = __args._M_old, ._M_has_val = false, ._M_timeout = > >> timeout }; > >> @@ -460,6 +495,7 @@ __wait_until_impl(const void* __addr, > >> __wait_args_base& __args, > >> __atomic_load(__wait_addr, &__val, __args._M_order); > >> if (__val == __args._M_old) > >> { > >> + auto __state = > static_cast<__waitable_state*>(__args._M_wait_state); > >> bool timeout = !__cond_wait_until(__state->_M_cv, > __state->_M_mtx, > >> __atime); > >> return { ._M_val = __val, ._M_has_val = false, ._M_timeout = > >> timeout }; > >> } > >> -- > >> 2.51.1 > >> > >> > >
