What would be the next step for this? Note that I do not have write
access to the repository.

On Tue, Jan 27, 2026 at 1:54 PM Jonathan Wakely <[email protected]> wrote:
>
> On Tue, 27 Jan 2026 at 18:44, Jay Feldblum <[email protected]> wrote:
> >
> > On Tue, Jan 27, 2026 at 7:55 AM Jonathan Wakely <[email protected]> wrote:
> > >
> > > On Tue, 27 Jan 2026 at 07:29, Jay Feldblum <[email protected]> wrote:
> > > >
> > > > Only some operations on std::shared_ptr and std::weak_ptr must
> > > > synchronize with each other, namely, destructions and calls to member
> > > > reset(). In the atomic implementation, synchronization is accomplished
> > > > with release/acquire memory-ordering operations on refcount decrements.
> > > > Copies, involving refcount increments, do not synchronize.
> > > >
> > > > On TSO architectures like x86-64, there is little observable difference.
> > > > But on architectures with weaker memory models like aarch64, the
> > > > compiler may emit faster instruction sequences for atomic increments
> > > > without memory ordering than with.
> > >
> > > Thanks for working on this. I think it's a useful optimization,
> > > although it's too late to make the cut for GCC 16 now.
> >
> > C'est la vie.
> >
> > >
> > > >
> > > >         PR libstdc++/111589
> > > >
> > > > libstdc++-v3/ChangeLog:
> > > >
> > > >         PR libstdc++/111589
> > > >         * include/bits/shared_ptr_base.h 
> > > > (_Sp_counted_base::_M_add_ref_copy):
> > > >         Use __exchange_and_add_relaxed_dispatch instead of
> > > >         __exchange_and_add_dispatch.
> > > >         (_Sp_counted_base::_M_weak_add_ref): Likewise.
> > > >         (_Sp_counted_base::_M_add_ref_lock_nothrow): Use 
> > > > __ATOMIC_RELAXED
> > > >         instead of __ATOMIC_ACQ_REL for the success ordering.
> > > >         * include/ext/atomicity.h (__exchange_and_add_relaxed): New 
> > > > function.
> > > >         (__atomic_add_relaxed): New function.
> > > >         (__exchange_and_add_relaxed_dispatch): New function.
> > > >         (__atomic_add_relaxed_dispatch): New function.
> > > > ---
> > > >  libstdc++-v3/include/bits/shared_ptr_base.h | 11 ++++--
> > > >  libstdc++-v3/include/ext/atomicity.h        | 43 +++++++++++++++++++++
> > > >  2 files changed, 51 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
> > > > b/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > index 88e0f4d58c6..1aa0d4faa85 100644
> > > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
> > > > @@ -148,7 +148,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >        // Increment the use count (used when the count is greater than 
> > > > zero).
> > > >        void
> > > >        _M_add_ref_copy()
> > > > -      { _S_chk(__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, 
> > > > 1)); }
> > > > +      {
> > > > +       
> > > > _S_chk(__gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_use_count,
> > > > +                                                             1));
> > > > +      }
> > > >
> > > >        // Increment the use count if it is non-zero, throw otherwise.
> > > >        void
> > > > @@ -204,7 +207,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >         // _M_weak_count can always use negative values because it 
> > > > cannot be
> > > >         // observed by users (unlike _M_use_count). See _S_chk for 
> > > > details.
> > > >         constexpr _Atomic_word __max = -1;
> > > > -       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, 1) 
> > > > == __max)
> > > > +       auto __r
> > > > +         = 
> > > > __gnu_cxx::__exchange_and_add_relaxed_dispatch(&_M_weak_count, 1);
> > > > +       if (__r == __max)
> > > >           [[__unlikely__]] __builtin_trap();
> > > >        }
> > > >
> > > > @@ -358,7 +363,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >           // long as it's not changed meanwhile.
> > > >         }
> > > >        while (!__atomic_compare_exchange_n(&_M_use_count, &__count, 
> > > > __count + 1,
> > > > -                                         true, __ATOMIC_ACQ_REL,
> > > > +                                         true, __ATOMIC_RELAXED,
> > >
> > > Can this really be relaxed?
> > >
> > > I guess it already has no ordering guarantees for the case where we
> > > return false, and when we successfully increment a non-zero count this
> > > is equivalent to _M_ref_add_copy() so relaxed must be fine. OK, I'm
> > > happy.
> > >
> >
> > That's how I see it too.
> >
> > >
> > > >                                           __ATOMIC_RELAXED));
> > > >        _S_chk(__count);
> > > >        return true;
> > > > diff --git a/libstdc++-v3/include/ext/atomicity.h 
> > > > b/libstdc++-v3/include/ext/atomicity.h
> > > > index ecbe376a687..6ecda307990 100644
> > > > --- a/libstdc++-v3/include/ext/atomicity.h
> > > > +++ b/libstdc++-v3/include/ext/atomicity.h
> > > > @@ -74,12 +74,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >    __attribute__((__always_inline__))
> > > >    __atomic_add(volatile _Atomic_word* __mem, int __val)
> > > >    { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
> > > > +
> > > > +  inline _Atomic_word
> > > > +  __attribute__((__always_inline__))
> > > > +  __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > +  { return __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); }
> > > > +
> > > > +  inline void
> > > > +  __attribute__((__always_inline__))
> > > > +  __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > +  { __atomic_fetch_add(__mem, __val, __ATOMIC_RELAXED); }
> > >
> > > Do we need __atomic_add_relaxed? You don't seem to use it above.
> > >
> > > >  #else // Defined in config/cpu/.../atomicity.h
> > > >    _Atomic_word
> > > >    __exchange_and_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW;
> > > >
> > > >    void
> > > >    __atomic_add(volatile _Atomic_word*, int) _GLIBCXX_NOTHROW;
> > > > +
> > > > +  // Fallback to acq_rel when builtins are not available.
> > > > +  inline _Atomic_word
> > > > +  __attribute__((__always_inline__))
> > > > +  __exchange_and_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > +  { return __exchange_and_add(__mem, __val); }
> > > > +
> > > > +  inline void
> > > > +  __attribute__((__always_inline__))
> > > > +  __atomic_add_relaxed(volatile _Atomic_word* __mem, int __val)
> > > > +  { __atomic_add(__mem, __val); }
> > > >  #endif
> > > >
> > > >  #if __cplusplus < 201103L
> > > > @@ -154,6 +175,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> > > >        __atomic_add(__mem, __val);
> > > >    }
> > > >
> > > > +  // Relaxed variants for reference count increments where 
> > > > acquire/release
> > > > +  // semantics are not required. Decrements still need stronger 
> > > > ordering.
> > > > +  inline _Atomic_word
> > > > +  __attribute__ ((__always_inline__))
> > > > +  __exchange_and_add_relaxed_dispatch(_Atomic_word* __mem, int __val)
> > > > +  {
> > > > +    if (__is_single_threaded())
> > > > +      return __exchange_and_add_single(__mem, __val);
> > > > +    else
> > > > +      return __exchange_and_add_relaxed(__mem, __val);
> > > > +  }
> > >
> > > I wonder if we really need the new dispatch function, or should just
> > > use __exchange_and_add_relaxed directly in the shared_ptr code.
> > >
> > > The point of the dispatch functions is to use unsynchronized
> > > operations when we know the program is single-threaded, to avoid the
> > > additional overhead of atomics with acq_rel ordering. But for some
> > > targets, a relaxed atomic is almost as cheap as the non-atomic op
> > > anyway. At what point does checking the __is_single_threaded()
> > > function outweigh the cost of a relaxed RMW? It certainly increases
> > > code size.
> > >
> > > It depends on the processor, and for some arches I expect it's worth
> > > keeping the dispatch. But maybe not on all arches.
> >
> > Interesting point. I took a look at __is_single_threaded and
> > __gthread_active_p. It looks like the cost of dispatch is low across all
> > implementations - a hot path with just an unordered load, compare, and
> > branch, a larger cold path, and the single-threaded code. So there is
> > some performance overhead, but not much, and some unnecessary code size.
> > But whether it is better to remove the dispatch unconditionally, or how
> > to condition removal of the dispatch, seems like a larger question than
> > I can answer in this patch.
> >
> > Nevertheless, this patch strictly removes cpu cost or leaves cpu cost
> > unaffected, and leaves code size unaffected (or provides room for
> > hypothetical compiler optimizations that increase code size but that
> > would be forbidden in the presence of memory-ordering calls), across
> > all processors.
> >
> > So this patch is an improvement over the current state, even if there is
> > room for further improvement later on by adjusting the dispatch.
>
> Yes, I agree. Determining whether to skip the dispatch for specific
> arches can be done later, or not at all.
>

Reply via email to