On Tue, 27 Jan 2026 at 18:44, Jay Feldblum <[email protected]> wrote: > > On Tue, Jan 27, 2026 at 7:55 AM Jonathan Wakely <[email protected]> wrote: > > > > On Tue, 27 Jan 2026 at 07:29, Jay Feldblum <[email protected]> wrote: > > > > > > Only some operations on std::shared_ptr and std::weak_ptr must > > > synchronize with each other, namely, destructions and calls to member > > > reset(). In the atomic implementation, synchronization is accomplished > > > with release/acquire memory-ordering operations on refcount decrements. > > > Copies, involving refcount increments, do not synchronize. > > > > > > On TSO architectures like x86-64, there is little observable difference. > > > But on architectures with weaker memory models like aarch64, the > > > compiler may emit faster instruction sequences for atomic increments > > > without memory ordering than with. > > > > Thanks for working on this. I think it's a useful optimization, > > although it's too late to make the cut for GCC 16 now. > > C'est la vie. > > > > > > > > > PR libstdc++/111589 > > > > > > libstdc++-v3/ChangeLog: > > > > > > PR libstdc++/111589 > > > * include/bits/shared_ptr_base.h > > > (_Sp_counted_base::_M_add_ref_copy): > > > Use __exchange_and_add_relaxed_dispatch instead of > > > __exchange_and_add_dispatch. > > > (_Sp_counted_base::_M_weak_add_ref): Likewise. > > > (_Sp_counted_base::_M_add_ref_lock_nothrow): Use __ATOMIC_RELAXED > > > instead of __ATOMIC_ACQ_REL for the success ordering. > > > * include/ext/atomicity.h (__exchange_and_add_relaxed): New > > > function. > > > (__atomic_add_relaxed): New function. > > > (__exchange_and_add_relaxed_dispatch): New function. > > > (__atomic_add_relaxed_dispatch): New function. > > > --- > > > libstdc++-v3/include/bits/shared_ptr_base.h | 11 ++++-- > > > libstdc++-v3/include/ext/atomicity.h | 43 +++++++++++++++++++++ > > > 2 files changed, 51 insertions(+), 3 deletions(-) > > > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > > > b/libstdc++-v3/include/bits/shared_ptr_base.h > > > index 88e0f4d58c6..1aa0d4faa85 100644 > > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h > > > @@ -148,7 +148,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > // Increment the use count (used when the count is greater than > > > zero). > > > void > > > _M_add_ref_copy() > > > - { _S_chk(__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, > > > 1)); } > > > + { > > > + > > > _S_chk(__gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_use_count, > > > + 1)); > > > + } > > > > > > // Increment the use count if it is non-zero, throw otherwise. > > > void > > > @@ -204,7 +207,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > // _M_weak_count can always use negative values because it cannot > > > be > > > // observed by users (unlike _M_use_count). See _S_chk for > > > details. > > > constexpr _Atomic_word __max = -1; > > > - if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, 1) == > > > __max) > > > + auto __r > > > + = > > > __gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_weak_count, 1); > > > + if (__r == __max) > > > [[__unlikely__]] __builtin_trap(); > > > } > > > > > > @@ -358,7 +363,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > // long as it's not changed meanwhile. > > > } > > > while (!__atomic_compare_exchange_n(&_M_use_count, &__count, > > > __count + 1, > > > - true, __ATOMIC_ACQ_REL, > > > + true, __ATOMIC_RELAXED, > > > > Can this really be relaxed? > > > > I guess it already has no ordering guarantees for the case where we > > return false, and when we successfully increment a non-zero count this > > is equivalent to _M_ref_add_copy() so relaxed must be fine. OK, I'm > > happy. > > > > That's how I see it too. > > > > > > __ATOMIC_RELAXED)); > > > _S_chk(__count); > > > return true; > > > diff --git a/libstdc++-v3/include/ext/atomicity.h > > > b/libstdc++-v3/include/ext/atomicity.h > > > index ecbe376a687..6ecda307990 100644 > > > --- a/libstdc++-v3/include/ext/atomicity.h > > > +++ b/libstdc++-v3/include/ext/atomicity.h > > > @@ -74,12 +74,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > __attribute__((__always_inline__)) > > > __atomic_add(volatile _Atomic_word* __mem, int __val) > > > { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); } > > > + > > > + inline _Atomic_word > > > + __attribute__((__always_inline__)) > > > + __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val) > > > + { return __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); } > > > + > > > + inline void > > > + __attribute__((__always_inline__)) > > > + __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val) > > > + { __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); } > > > > Do we need __atomic_add_relaxed? You don't seem to use it above. > > > > > #else // Defined in config/cpu/.../atomicity.h > > > _Atomic_word > > > __exchange_and_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW; > > > > > > void > > > __atomic_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW; > > > + > > > + // Fallback to acq_rel when builtins are not available. > > > + inline _Atomic_word > > > + __attribute__((__always_inline__)) > > > + __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val) > > > + { return __exchange_and_add(__mem, __val); } > > > + > > > + inline void > > > + __attribute__((__always_inline__)) > > > + __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val) > > > + { __atomic_add(__mem, __val); } > > > #endif > > > > > > #if __cplusplus < 201103L > > > @@ -154,6 +175,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > __atomic_add(__mem, __val); > > > } > > > > > > + // Relaxed variants for reference count increments where > > > acquire/release > > > + // semantics are not required. Decrements still need stronger ordering. > > > + inline _Atomic_word > > > + __attribute__ ((__always_inline__)) > > > + __exchange_and_add_relaxed_dispatch(_Atomic_word* __mem, int __val) > > > + { > > > + if (__is_single_threaded()) > > > + return __exchange_and_add_single(__mem, __val); > > > + else > > > + return __exchange_and_add_relaxed(__mem, __val); > > > + } > > > > I wonder if we really need the new dispatch function, or should just > > use __exchange_and_add_relaxed directly in the shared_ptr code. > > > > The point of the dispatch functions is to use unsynchronized > > operations when we know the program is single-threaded, to avoid the > > additional overhead of atomics with acq_rel ordering. But for some > > targets, a relaxed atomic is almost as cheap as the non-atomic op > > anyway. At what point does checking the __is_single_threaded() > > function outweigh the cost of a relaxed RMW? It certainly increases > > code size. > > > > It depends on the processor, and for some arches I expect it's worth > > keeping the dispatch. But maybe not on all arches. > > Interesting point. I took a look at __is_single_threaded and > __gthread_active_p. It looks like the cost of dispatch is low across all > implementations - a hot path with just an unordered load, compare, and > branch, a larger cold path, and the single-threaded code. So there is > some performance overhead, but not much, and some unnecessary code size. > But whether it is better to remove the dispatch unconditionally, or how > to condition removal of the dispatch, seems like a larger question than > I can answer in this patch. > > Nevertheless, this patch strictly removes cpu cost or leaves cpu cost > unaffected, and leaves code size unaffected (or provides room for > hypothetical compiler optimizations that increase code size but that > would be forbidden in the presence of memory-ordering calls), across > all processors. > > So this patch is an improvement over the current state, even if there is > room for further improvement later on by adjusting the dispatch.
Yes, I agree. Determining whether to skip the dispatch for specific arches can be done later, or not at all.
