Re: [PATCH] libstdc++: Relax memory-ordering in shared-ptr/weak-ptr inc-ref

Jonathan Wakely Tue, 03 Feb 2026 08:17:15 -0800

On Tue, 3 Feb 2026 at 13:47, Jay Feldblum <[email protected]> wrote:
>
> What would be the next step for this? Note that I do not have write
> access to the repository.


It needs to wait until after gcc-16 branches from trunk, some time in
April or early May. If I forget to come back to it then, please send a
reminder to me or to the list, thanks.


>
> On Tue, Jan 27, 2026 at 1:54 PM Jonathan Wakely <[email protected]> wrote:
> >
> > On Tue, 27 Jan 2026 at 18:44, Jay Feldblum <[email protected]> wrote:
> > >
> > > On Tue, Jan 27, 2026 at 7:55 AM Jonathan Wakely <[email protected]> 
> > > wrote:
> > > >
> > > > On Tue, 27 Jan 2026 at 07:29, Jay Feldblum <[email protected]> wrote:
> > > > >
> > > > > Only some operations on std::shared_ptr and std::weak_ptr must
> > > > > synchronize with each other, namely, destructions and calls to member
> > > > > reset(). In the atomic implementation, synchronization is accomplished
> > > > > with release/acquire memory-ordering operations on refcount 
> > > > > decrements.
> > > > > Copies, involving refcount increments, do not synchronize.
> > > > >
> > > > > On TSO architectures like x86-64, there is little observable 
> > > > > difference.
> > > > > But on architectures with weaker memory models like aarch64, the
> > > > > compiler may emit faster instruction sequences for atomic increments
> > > > > without memory ordering than with.
> > > >
> > > > Thanks for working on this. I think it's a useful optimization,
> > > > although it's too late to make the cut for GCC 16 now.
> > >
> > > C'est la vie.
> > >
> > > >
> > > > >
> > > > >         PR libstdc++/111589
> > > > >
> > > > > libstdc++-v3/ChangeLog:
> > > > >
> > > > >         PR libstdc++/111589
> > > > >         * include/bits/shared_ptr_base.h 
> > > > > (_Sp_counted_base::_M_add_ref_copy):
> > > > >         Use __exchange_and_add_relaxed_dispatch instead of
> > > > >         __exchange_and_add_dispatch.
> > > > >         (_Sp_counted_base::_M_weak_add_ref): Likewise.
> > > > >         (_Sp_counted_base::_M_add_ref_lock_nothrow): Use 
> > > > > __ATOMIC_RELAXED
> > > > >         instead of __ATOMIC_ACQ_REL for the success ordering.
> > > > >         * include/ext/atomicity.h (__exchange_and_add_relaxed): New 
> > > > > function.
> > > > >         (__atomic_add_relaxed): New function.
> > > > >         (__exchange_and_add_relaxed_dispatch): New function.
> > > > >         (__atomic_add_relaxed_dispatch): New function.
> > > > > ---
> > > > >  libstdc++-v3/include/bits/shared_ptr_base.h | 11 ++++--
> > > > >  libstdc++-v3/include/ext/atomicity.h        | 43 
> > > > > +++++++++++++++++++++
> > > > >  2 files changed, 51 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
> > > > > b/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > > index 88e0f4d58c6..1aa0d4faa85 100644
> > > > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > > @@ -148,7 +148,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > > >        // Increment the use count (used when the count is greater 
> > > > > than zero).
> > > > >        void
> > > > >        _M_add_ref_copy()
> > > > > -      { _S_chk(__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, 
> > > > > 1)); }
> > > > > +      {
> > > > > +       
> > > > > _S_chk(__gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_use_count,
> > > > > +                                                             1));
> > > > > +      }
> > > > >
> > > > >        // Increment the use count if it is non-zero, throw otherwise.
> > > > >        void
> > > > > @@ -204,7 +207,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > > >         // _M_weak_count can always use negative values because it 
> > > > > cannot be
> > > > >         // observed by users (unlike _M_use_count). See _S_chk for 
> > > > > details.
> > > > >         constexpr _Atomic_word __max = -1;
> > > > > -       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, 1) 
> > > > > == __max)
> > > > > +       auto __r
> > > > > +         = 
> > > > > __gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_weak_count, 1);
> > > > > +       if (__r == __max)
> > > > >           [[__unlikely__]] __builtin_trap();
> > > > >        }
> > > > >
> > > > > @@ -358,7 +363,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > > >           // long as it's not changed meanwhile.
> > > > >         }
> > > > >        while (!__atomic_compare_exchange_n(&_M_use_count, &__count, 
> > > > > __count + 1,
> > > > > -                                         true, __ATOMIC_ACQ_REL,
> > > > > +                                         true, __ATOMIC_RELAXED,
> > > >
> > > > Can this really be relaxed?
> > > >
> > > > I guess it already has no ordering guarantees for the case where we
> > > > return false, and when we successfully increment a non-zero count this
> > > > is equivalent to _M_ref_add_copy() so relaxed must be fine. OK, I'm
> > > > happy.
> > > >
> > >
> > > That's how I see it too.
> > >
> > > >
> > > > >                                           __ATOMIC_RELAXED));
> > > > >        _S_chk(__count);
> > > > >        return true;
> > > > > diff --git a/libstdc++-v3/include/ext/atomicity.h 
> > > > > b/libstdc++-v3/include/ext/atomicity.h
> > > > > index ecbe376a687..6ecda307990 100644
> > > > > --- a/libstdc++-v3/include/ext/atomicity.h
> > > > > +++ b/libstdc++-v3/include/ext/atomicity.h
> > > > > @@ -74,12 +74,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > > >    __attribute__((__always_inline__))
> > > > >    __atomic_add(volatile _Atomic_word* __mem, int __val)
> > > > >    { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
> > > > > +
> > > > > +  inline _Atomic_word
> > > > > +  __attribute__((__always_inline__))
> > > > > +  __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > > +  { return __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); }
> > > > > +
> > > > > +  inline void
> > > > > +  __attribute__((__always_inline__))
> > > > > +  __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > > +  { __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); }
> > > >
> > > > Do we need __atomic_add_relaxed? You don't seem to use it above.
> > > >
> > > > >  #else // Defined in config/cpu/.../atomicity.h
> > > > >    _Atomic_word
> > > > >    __exchange_and_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW;
> > > > >
> > > > >    void
> > > > >    __atomic_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW;
> > > > > +
> > > > > +  // Fallback to acq_rel when builtins are not available.
> > > > > +  inline _Atomic_word
> > > > > +  __attribute__((__always_inline__))
> > > > > +  __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > > +  { return __exchange_and_add(__mem, __val); }
> > > > > +
> > > > > +  inline void
> > > > > +  __attribute__((__always_inline__))
> > > > > +  __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > > +  { __atomic_add(__mem, __val); }
> > > > >  #endif
> > > > >
> > > > >  #if __cplusplus < 201103L
> > > > > @@ -154,6 +175,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > > >        __atomic_add(__mem, __val);
> > > > >    }
> > > > >
> > > > > +  // Relaxed variants for reference count increments where 
> > > > > acquire/release
> > > > > +  // semantics are not required. Decrements still need stronger 
> > > > > ordering.
> > > > > +  inline _Atomic_word
> > > > > +  __attribute__ ((__always_inline__))
> > > > > +  __exchange_and_add_relaxed_dispatch(_Atomic_word* __mem, int __val)
> > > > > +  {
> > > > > +    if (__is_single_threaded())
> > > > > +      return __exchange_and_add_single(__mem, __val);
> > > > > +    else
> > > > > +      return __exchange_and_add_relaxed(__mem, __val);
> > > > > +  }
> > > >
> > > > I wonder if we really need the new dispatch function, or should just
> > > > use __exchange_and_add_relaxed directly in the shared_ptr code.
> > > >
> > > > The point of the dispatch functions is to use unsynchronized
> > > > operations when we know the program is single-threaded, to avoid the
> > > > additional overhead of atomics with acq_rel ordering. But for some
> > > > targets, a relaxed atomic is almost as cheap as the non-atomic op
> > > > anyway. At what point does checking the __is_single_threaded()
> > > > function outweigh the cost of a relaxed RMW? It certainly increases
> > > > code size.
> > > >
> > > > It depends on the processor, and for some arches I expect it's worth
> > > > keeping the dispatch. But maybe not on all arches.
> > >
> > > Interesting point. I took a look at __is_single_threaded and
> > > __gthread_active_p. It looks like the cost of dispatch is low across all
> > > implementations - a hot path with just an unordered load, compare, and
> > > branch, a larger cold path, and the single-threaded code. So there is
> > > some performance overhead, but not much, and some unnecessary code size.
> > > But whether it is better to remove the dispatch unconditionally, or how
> > > to condition removal of the dispatch, seems like a larger question than
> > > I can answer in this patch.
> > >
> > > Nevertheless, this patch strictly removes cpu cost or leaves cpu cost
> > > unaffected, and leaves code size unaffected (or provides room for
> > > hypothetical compiler optimizations that increase code size but that
> > > would be forbidden in the presence of memory-ordering calls), across
> > > all processors.
> > >
> > > So this patch is an improvement over the current state, even if there is
> > > room for further improvement later on by adjusting the dispatch.
> >
> > Yes, I agree. Determining whether to skip the dispatch for specific
> > arches can be done later, or not at all.
> >
>

Re: [PATCH] libstdc++: Relax memory-ordering in shared-ptr/weak-ptr inc-ref

Reply via email to