Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-12-08 Thread Jonathan Wakely via Gcc-patches
On Wed, 8 Dec 2021 at 19:21, Jonathan Wakely wrote:
>
> On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote:
> >
> > Hi Jonathan,
> >
> > > I've pushed this change to trunk now (it was posted and reviewed in
> > > stage 1, I just didn't get around to pushing it until now).
> > >
> > > The final version of the patch is attached to this mail.
> >
> > unfortunately, it breaks Solaris/SPARC bootstrap:
> >
> > In file included from 
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
> >  from 
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
> >  from 
> > /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:
> >  In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with 
> > __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26:
> >  error: right operand of shift expression '(1 << 64)' is greater than or 
> > equal to the precision 64 of the left operand [-fpermissive]
> >   329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> >   | ~^
> > make[9]: *** [Makefile:1875: 
> > sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1
> >
> > For 64-bit SPARC, _Atomic_word is long.
>
> Ah yes, so we need to disable this optimization. Patch coming up ...

Gah, I remembered to check that:

  constexpr bool __double_word
= sizeof(long long) == 2 * sizeof(_Atomic_word);
  // The ref-count members follow the vptr, so are aligned to
  // alignof(void*).
  constexpr bool __aligned = __alignof(long long) <= alignof(void*);
  if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)


But for C++11 and C++14 that is a normal runtime condition not
if-constexpr, so the undefined shift still gets compiled, even though
it can never be reached at runtime.



Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-12-08 Thread Jonathan Wakely via Gcc-patches
On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote:
>
> Hi Jonathan,
>
> > I've pushed this change to trunk now (it was posted and reviewed in
> > stage 1, I just didn't get around to pushing it until now).
> >
> > The final version of the patch is attached to this mail.
>
> unfortunately, it breaks Solaris/SPARC bootstrap:
>
> In file included from 
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
>  from 
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
>  from 
> /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:
>  In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with 
> __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26:
>  error: right operand of shift expression '(1 << 64)' is greater than or 
> equal to the precision 64 of the left operand [-fpermissive]
>   329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>   | ~^
> make[9]: *** [Makefile:1875: 
> sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1
>
> For 64-bit SPARC, _Atomic_word is long.

Ah yes, so we need to disable this optimization. Patch coming up ...



Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-12-08 Thread Rainer Orth
Hi Jonathan,

> I've pushed this change to trunk now (it was posted and reviewed in
> stage 1, I just didn't get around to pushing it until now).
>
> The final version of the patch is attached to this mail.

unfortunately, it breaks Solaris/SPARC bootstrap:

In file included from 
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
 from 
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
 from 
/vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:
 In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with 
__gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26:
 error: right operand of shift expression '(1 << 64)' is greater than or equal 
to the precision 64 of the left operand [-fpermissive]
  329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
  | ~^
make[9]: *** [Makefile:1875: 
sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1

For 64-bit SPARC, _Atomic_word is long.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-12-08 Thread Jonathan Wakely via Gcc-patches
I've pushed this change to trunk now (it was posted and reviewed in
stage 1, I just didn't get around to pushing it until now).

The final version of the patch is attached to this mail.

Thanks for the nice optimization, Maged!


On Wed, 4 Aug 2021 at 20:49, Maged Michael via Libstdc++
 wrote:
>
> On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely 
> wrote:
>
> > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> > >
> > > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> > eyesight, which usually guesses correctly 90% of the time, mistook
> > "Secondly" on a line by itself for "Sincerely" :-)
> >
> > :-)
> >
> > > The noinlining was based on looking at generated code. That was for
> > clang. It was inlining the _M_last_use function for every instance of
> > _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> > _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> > (several megabytes)., which was an extra benefit. Without the noinline we
> > saw code size increase.
> >
> > Wow, that is a convincing argument for making it not inline, thanks.
> >
> > > IIUC, we van use the following. Right?
> > >
> > > __attribute__((__noinline__))
> >
> > Right.
> >
> > > I didn't understand the part about programmers doing #define noinline 1.
> > I don't see code in the patch that uses noinline.
> >
> > This is a valid C++ program:
> >
> > #define noinline 1
> > #include 
> > int main() { }
> >
> > But if anything in  uses "noinline" then this valid program
> > will not compile. Which is why we must use ((__noinline__)) instead of
> > ((noinline)).
> >
> > Thanks. Now I get it.
>
>
> >
> >
> > >
> > > How about something like this comment?
> > >
> > > // Noinline to avoid code size increase.
> >
> > Great, thanks.
> >
> > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > > Actually I take back what I said. Sorry. I think the logic in your patch
> > is correct. I missed some of the atomic decrements.
> > > But I'd be concerned about performance. If we make _M_release_last_use
> > noinline then we are adding overhead to the fast path of the original logic
> > (where both counts are 1).
> >
> > Oh, I see. So the code duplication serves a purpose. We want the
> > _M_release_last_use() code to be non-inline for the new logic, because
> > in the common case we never call it (we either detect that both counts
> > are 1 and do the dispose & destroy without atomic decrements, or we do
> > a single decrement and don't dispose or destroy anything). But for the
> > old logic, we do want that code inlined into _M_release (or
> > _M_release_orig as it was called in your patch). Have I got that right
> > now?
> >
> > Yes. Completely right.
>
>
> > What if we remove the __noinline__ from _M_release_last_use() so that
> > it can be inlined, but than add a noinline wrapper that can be called
> > when we don't want to inline it?
> >
> > So:
> >   // Called by _M_release() when the use count reaches zero.
> >   void
> >   _M_release_last_use() noexcept
> >   {
> > // unchanged from previous patch, but without the attribute.
> > // ...
> >   }
> >
> >   // As above, but 'noinline' to reduce code size on the cold path.
> >   __attribute__((__noinline__))
> >   void
> >   _M_release_last_use_cold() noexcept
> >   { _M_release_last_use(); }
> >
> >
> > And then:
> >
> >   template<>
> > inline void
> > _Sp_counted_base<_S_atomic>::_M_release() noexcept
> > {
> >   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> > #if ! _GLIBCXX_TSAN
> >   constexpr bool __lock_free
> > = __atomic_always_lock_free(sizeof(long long), 0)
> > && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
> >   constexpr bool __double_word
> > = sizeof(long long) == 2 * sizeof(_Atomic_word);
> >   // The ref-count members follow the vptr, so are aligned to
> >   // alignof(void*).
> >   constexpr bool __aligned = __alignof(long long) <= alignof(void*);
> >   if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
> > {
> >   constexpr long long __unique_ref
> > = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> >   auto __both_counts = reinterpret_cast(&_M_use_count);
> >
> >   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >   if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> > __unique_ref)
> > {
> >   // Both counts are 1, so there are no weak references and
> >   // we are releasing the last strong reference. No other
> >   // threads can observe the effects of this _M_release()
> >   // call (e.g. calling use_count()) without a data race.
> >   *(long long*)(&_M_use_count) = 0;
> >   _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >   

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely 
wrote:

> On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> >
> > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> :-)
>
> > The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> Wow, that is a convincing argument for making it not inline, thanks.
>
> > IIUC, we van use the following. Right?
> >
> > __attribute__((__noinline__))
>
> Right.
>
> > I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> This is a valid C++ program:
>
> #define noinline 1
> #include 
> int main() { }
>
> But if anything in  uses "noinline" then this valid program
> will not compile. Which is why we must use ((__noinline__)) instead of
> ((noinline)).
>
> Thanks. Now I get it.


>
>
> >
> > How about something like this comment?
> >
> > // Noinline to avoid code size increase.
>
> Great, thanks.
>
> On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > Actually I take back what I said. Sorry. I think the logic in your patch
> is correct. I missed some of the atomic decrements.
> > But I'd be concerned about performance. If we make _M_release_last_use
> noinline then we are adding overhead to the fast path of the original logic
> (where both counts are 1).
>
> Oh, I see. So the code duplication serves a purpose. We want the
> _M_release_last_use() code to be non-inline for the new logic, because
> in the common case we never call it (we either detect that both counts
> are 1 and do the dispose & destroy without atomic decrements, or we do
> a single decrement and don't dispose or destroy anything). But for the
> old logic, we do want that code inlined into _M_release (or
> _M_release_orig as it was called in your patch). Have I got that right
> now?
>
> Yes. Completely right.


> What if we remove the __noinline__ from _M_release_last_use() so that
> it can be inlined, but than add a noinline wrapper that can be called
> when we don't want to inline it?
>
> So:
>   // Called by _M_release() when the use count reaches zero.
>   void
>   _M_release_last_use() noexcept
>   {
> // unchanged from previous patch, but without the attribute.
> // ...
>   }
>
>   // As above, but 'noinline' to reduce code size on the cold path.
>   __attribute__((__noinline__))
>   void
>   _M_release_last_use_cold() noexcept
>   { _M_release_last_use(); }
>
>
> And then:
>
>   template<>
> inline void
> _Sp_counted_base<_S_atomic>::_M_release() noexcept
> {
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> #if ! _GLIBCXX_TSAN
>   constexpr bool __lock_free
> = __atomic_always_lock_free(sizeof(long long), 0)
> && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>   constexpr bool __double_word
> = sizeof(long long) == 2 * sizeof(_Atomic_word);
>   // The ref-count members follow the vptr, so are aligned to
>   // alignof(void*).
>   constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>   if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
> {
>   constexpr long long __unique_ref
> = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>   auto __both_counts = reinterpret_cast(&_M_use_count);
>
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>   if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> __unique_ref)
> {
>   // Both counts are 1, so there are no weak references and
>   // we are releasing the last strong reference. No other
>   // threads can observe the effects of this _M_release()
>   // call (e.g. calling use_count()) without a data race.
>   *(long long*)(&_M_use_count) = 0;
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>   _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>   _M_dispose();
>   _M_destroy();
>   return;
> }
>   if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
> {
>   _M_release_last_use_cold();
>   return;
> }
> }
>   else
> #endif
>   if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
> {
>   _M_release_last_use();
> }
> }
>
>
> So we use the noinline version for the else branch in the new 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
>
> Sorry. I totally missed the rest of your message and the patch. My fuzzy 
> eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" 
> on a line by itself for "Sincerely" :-)

:-)

> The noinlining was based on looking at generated code. That was for clang. It 
> was inlining the _M_last_use function for every instance of _M_release (e.g., 
> ~shared_ptr). This optimization with the noinline for _M_release_last_use 
> ended up reducing massive binary text sizes by 1.5% (several megabytes)., 
> which was an extra benefit. Without the noinline we saw code size increase.

Wow, that is a convincing argument for making it not inline, thanks.

> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))

Right.

> I didn't understand the part about programmers doing #define noinline 1. I 
> don't see code in the patch that uses noinline.

This is a valid C++ program:

#define noinline 1
#include 
int main() { }

But if anything in  uses "noinline" then this valid program
will not compile. Which is why we must use ((__noinline__)) instead of
((noinline)).



>
> How about something like this comment?
>
> // Noinline to avoid code size increase.

Great, thanks.

On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> Actually I take back what I said. Sorry. I think the logic in your patch is 
> correct. I missed some of the atomic decrements.
> But I'd be concerned about performance. If we make _M_release_last_use 
> noinline then we are adding overhead to the fast path of the original logic 
> (where both counts are 1).

Oh, I see. So the code duplication serves a purpose. We want the
_M_release_last_use() code to be non-inline for the new logic, because
in the common case we never call it (we either detect that both counts
are 1 and do the dispose & destroy without atomic decrements, or we do
a single decrement and don't dispose or destroy anything). But for the
old logic, we do want that code inlined into _M_release (or
_M_release_orig as it was called in your patch). Have I got that right
now?

What if we remove the __noinline__ from _M_release_last_use() so that
it can be inlined, but than add a noinline wrapper that can be called
when we don't want to inline it?

So:
  // Called by _M_release() when the use count reaches zero.
  void
  _M_release_last_use() noexcept
  {
// unchanged from previous patch, but without the attribute.
// ...
  }

  // As above, but 'noinline' to reduce code size on the cold path.
  __attribute__((__noinline__))
  void
  _M_release_last_use_cold() noexcept
  { _M_release_last_use(); }


And then:

  template<>
inline void
_Sp_counted_base<_S_atomic>::_M_release() noexcept
{
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
#if ! _GLIBCXX_TSAN
  constexpr bool __lock_free
= __atomic_always_lock_free(sizeof(long long), 0)
&& __atomic_always_lock_free(sizeof(_Atomic_word), 0);
  constexpr bool __double_word
= sizeof(long long) == 2 * sizeof(_Atomic_word);
  // The ref-count members follow the vptr, so are aligned to
  // alignof(void*).
  constexpr bool __aligned = __alignof(long long) <= alignof(void*);
  if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
{
  constexpr long long __unique_ref
= 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
  auto __both_counts = reinterpret_cast(&_M_use_count);

  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
  if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
{
  // Both counts are 1, so there are no weak references and
  // we are releasing the last strong reference. No other
  // threads can observe the effects of this _M_release()
  // call (e.g. calling use_count()) without a data race.
  *(long long*)(&_M_use_count) = 0;
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
  _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
  _M_dispose();
  _M_destroy();
  return;
}
  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
  _M_release_last_use_cold();
  return;
}
}
  else
#endif
  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
  _M_release_last_use();
}
}


So we use the noinline version for the else branch in the new logic,
but the can-inline version for the old logic. Would that work?

We could also consider adding __attribute__((__cold__)) to the
_M_release_last_use_cold() function, and/or add [[__unlikely__]] to
the 'if' that calls it, but maybe that's overkill.

It seems like this will slightly pessimize the case where the last use

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
On Wed, Aug 4, 2021 at 1:19 PM Maged Michael 
wrote:

> Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
> wrote:
>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
>> wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's
>> suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
>> double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function
>> _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +constexpr bool __double_word
>> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +// The ref-count members follow the vptr, so are aligned to
>> +// alignof(void*).
>> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
>> Secondly:
>>
>> +  void
>> +  __attribute__ ((noinline))
>>
>> This needs to be __noinline__ because noinline is not a reserved word,
>> so users can do:
>> #define noinline 1
>> #include 
>>
>> Was it performance profiling, or code-size measurements (or something
>> else?) that showed this should be non-inline?
>> I'd like to add a comment explaining why we want it to be noinline.
>>
>> The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))
>
>
> I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> How about something like this comment?
>
> // Noinline to avoid code size increase.
>
>
>
> In that function ...
>>
>> +  _M_release_last_use() noexcept
>> +  {
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> +_M_dispose();
>> +if (_Mutex_base<_Lp>::_S_need_barriers)
>> +  {
>> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
>> +  }
>>
>> I think we can remove this fence. The _S_need_barriers constant is
>> only true for the _S_mutex policy, and that just calls
>> _M_release_orig(), so never uses this function. I'll remove it and add
>> a comment noting that the lack of barrier is intentional.
>>
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>> +   -1) == 1)
>> +  {
>> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> +_M_destroy();
>> +  }
>> +  }
>>
>> Alternatively, we could keep the fence in _M_release_last_use() and
>> refactor the other _M_release functions, so that we have explicit
>> specializations for each of:
>> _Sp_counted_base<_S_single>::_M_release() (already present)
>> _Sp_counted_base<_S_mutex>::_M_release()
>> _Sp_counted_base<_S_atomic>::_M_release()
>>
>> The second and third would be new, as currently they both 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
Sorry. I totally missed the rest of your message and the patch. My fuzzy
eyesight, which usually guesses correctly 90% of the time, mistook
"Secondly" on a line by itself for "Sincerely" :-)

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>
> I'm no longer confident in the alignof suggestion I gave you.
>
> +constexpr bool __double_word
> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +// The ref-count members follow the vptr, so are aligned to
> +// alignof(void*).
> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> Secondly:
>
> +  void
> +  __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include 
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> The noinlining was based on looking at generated code. That was for clang.
It was inlining the _M_last_use function for every instance of _M_release
(e.g., ~shared_ptr). This optimization with the noinline for
_M_release_last_use ended up reducing massive binary text sizes by 1.5%
(several megabytes)., which was an extra benefit. Without the noinline we
saw code size increase.

IIUC, we van use the following. Right?

__attribute__((__noinline__))


I didn't understand the part about programmers doing #define noinline 1. I
don't see code in the patch that uses noinline.

How about something like this comment?

// Noinline to avoid code size increase.



In that function ...
>
> +  _M_release_last_use() noexcept
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +_M_dispose();
> +if (_Mutex_base<_Lp>::_S_need_barriers)
> +  {
> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
> +  }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +   -1) == 1)
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +_M_destroy();
> +  }
> +  }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Wed, 4 Aug 2021 at 16:47, Maged Michael  wrote:
>
> Thanks, Jonathan!
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely  wrote:
>>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael  
>> > > wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's 
>> > >> suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, 
>> > >> double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function 
>> > >> _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>
>
> Yes definitely _M_release_combined makes sense. Will do.

See the patch I attached to my previous mail, which has a refactored
version that gets rid of that function entirely.

>
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +constexpr bool __double_word
>> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +// The ref-count members follow the vptr, so are aligned to
>> +// alignof(void*).
>> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
> I'd rather avoid any runtime checks because they may negate the speed 
> rationale for doing this optimization.
> I'd be OK with not doing this optimization for any 32-bit architecture.
>
> Is it OK to change the __align condition to the following?
> constexpr bool __aligned =
>   (alignof(long long) <= alignof(void*))
>   && (sizeof(long long) == sizeof(void*));

Yes, that will work fine.


Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Maged Michael via Gcc-patches
Thanks, Jonathan!

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely 
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>

Yes definitely _M_release_combined makes sense. Will do.


> I'm no longer confident in the alignof suggestion I gave you.
>
> +constexpr bool __double_word
> +  = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +// The ref-count members follow the vptr, so are aligned to
> +// alignof(void*).
> +constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> I'd rather avoid any runtime checks because they may negate the speed
rationale for doing this optimization.
I'd be OK with not doing this optimization for any 32-bit architecture.

Is it OK to change the __align condition to the following?
constexpr bool __aligned =
  (alignof(long long) <= alignof(void*))
  && (sizeof(long long) == sizeof(void*));

Thanks,
Maged



> Secondly:
>
> +  void
> +  __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include 
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> In that function ...
>
> +  _M_release_last_use() noexcept
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +_M_dispose();
> +if (_Mutex_base<_Lp>::_S_need_barriers)
> +  {
> +__atomic_thread_fence (__ATOMIC_ACQ_REL);
> +  }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +   -1) == 1)
> +  {
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +_M_destroy();
> +  }
> +  }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The _S_atomic one
> would have your new optimization. See the attached patch showing what
> I mean. I find this version a bit simpler to understand, as we just
> have _M_release and _M_release_last_use, without
> _M_release_double_width_cas and _M_release_orig. What do you think of
> this version? Does it lose any important properties of your version
> which I've failed to notice?
>


Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-04 Thread Jonathan Wakely via Gcc-patches
On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>
> On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> >
> > This is the right patch. The previous one is missing noexcept. Sorry.
> >
> >
> > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael  
> > wrote:
> >>
> >> Please find attached an updated patch after incorporating Jonathan's 
> >> suggestions.
> >>
> >> Changes from the last patch include:
> >> - Add a TSAN macro to bits/c++config.
> >> - Use separate constexpr bool-s for the conditions for lock-freedom, 
> >> double-width and alignment.
> >> - Move the code in the optimized path to a separate function 
> >> _M_release_double_width_cas.
>
> Thanks for the updated patch. At a quick glance it looks great. I'll
> apply it locally and test it tomorrow.


It occurs to me now that _M_release_double_width_cas is the wrong
name, because we're not doing a DWCAS, just a double-width load. What
do you think about renaming it to _M_release_combined instead? Since
it does a combined load of the two counts.

I'm no longer confident in the alignof suggestion I gave you.

+constexpr bool __double_word
+  = sizeof(long long) == 2 * sizeof(_Atomic_word);
+// The ref-count members follow the vptr, so are aligned to
+// alignof(void*).
+constexpr bool __aligned = alignof(long long) <= alignof(void*);

For IA32 (i.e. 32-bit x86) this constant will be true, because
alignof(long long) and alignof(void*) are both 4, even though
sizeof(long long) is 8. So in theory the _M_use_count and
_M_weak_count members could be in different cache lines, and the
atomic load will not be atomic. I think we want to use __alignof(long
long) here to get the "natural" alignment, not the smaller 4B
alignment allowed by SysV ABI. That means that we won't do this
optimization on IA32, but I think that's acceptable.

Alternatively, we could keep using alignof, but with an additional
run-time check something like (uintptr_t)&_M_use_count / 64 ==
(uintptr_t)&_M_weak_count / 64 to check they're on the same cache
line. I think for now we should just use __alignof and live without
the optimization on IA32.

Secondly:

+  void
+  __attribute__ ((noinline))

This needs to be __noinline__ because noinline is not a reserved word,
so users can do:
#define noinline 1
#include 

Was it performance profiling, or code-size measurements (or something
else?) that showed this should be non-inline?
I'd like to add a comment explaining why we want it to be noinline.

In that function ...

+  _M_release_last_use() noexcept
+  {
+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+_M_dispose();
+if (_Mutex_base<_Lp>::_S_need_barriers)
+  {
+__atomic_thread_fence (__ATOMIC_ACQ_REL);
+  }

I think we can remove this fence. The _S_need_barriers constant is
only true for the _S_mutex policy, and that just calls
_M_release_orig(), so never uses this function. I'll remove it and add
a comment noting that the lack of barrier is intentional.

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+   -1) == 1)
+  {
+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+_M_destroy();
+  }
+  }

Alternatively, we could keep the fence in _M_release_last_use() and
refactor the other _M_release functions, so that we have explicit
specializations for each of:
_Sp_counted_base<_S_single>::_M_release() (already present)
_Sp_counted_base<_S_mutex>::_M_release()
_Sp_counted_base<_S_atomic>::_M_release()

The second and third would be new, as currently they both use the
definition in the primary template. The _S_mutex one would just
decrement _M_use_count then call _M_release_last_use() (which still
has the barrier needed for the _S_mutex policy). The _S_atomic one
would have your new optimization. See the attached patch showing what
I mean. I find this version a bit simpler to understand, as we just
have _M_release and _M_release_last_use, without
_M_release_double_width_cas and _M_release_orig. What do you think of
this version? Does it lose any important properties of your version
which I've failed to notice?
diff --git a/libstdc++-v3/include/bits/c++config 
b/libstdc++-v3/include/bits/c++config
index 32b8957f814..07465f7ecd5 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -143,6 +143,15 @@
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..b2397c8fddb 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-03 Thread Jonathan Wakely via Gcc-patches
On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>
> This is the right patch. The previous one is missing noexcept. Sorry.
>
>
> On Mon, Aug 2, 2021 at 9:23 AM Maged Michael  wrote:
>>
>> Please find attached an updated patch after incorporating Jonathan's 
>> suggestions.
>>
>> Changes from the last patch include:
>> - Add a TSAN macro to bits/c++config.
>> - Use separate constexpr bool-s for the conditions for lock-freedom, 
>> double-width and alignment.
>> - Move the code in the optimized path to a separate function 
>> _M_release_double_width_cas.

Thanks for the updated patch. At a quick glance it looks great. I'll
apply it locally and test it tomorrow.


Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-02 Thread Maged Michael via Gcc-patches
This is the right patch. The previous one is missing noexcept. Sorry.


On Mon, Aug 2, 2021 at 9:23 AM Maged Michael 
wrote:

> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
>
> Changes from the last patch include:
> - Add a TSAN macro to bits/c++config.
> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
>
> Thanks,
> Maged
>
>
> On Fri, Jul 16, 2021 at 11:55 AM Maged Michael 
> wrote:
>
>> Thank you, Jonathan, for the detailed comments! I'll update the patch
>> accordingly.
>>
>> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely 
>> wrote:
>>
>>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>>> >
>>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>>> the
>>> > two atomic instructions that decrement each of the use count and the
>>> weak
>>> > count when both are 1. I proposed the general idea in an earlier
>>> thread (
>>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html)
>>> and got
>>> > useful feedback on a draft patch and responses to related questions
>>> about
>>> > multi-granular atomicity and alignment. This patch is based on that
>>> > feedback.
>>> >
>>> >
>>> > I added a check for thread sanitizer to use the current algorithm in
>>> that
>>> > case because TSAN does not support multi-granular atomicity. I'd like
>>> to
>>> > add a check of __has_feature(thread_sanitizer) for building using
>>> LLVM. I
>>> > found examples of __has_feature in libstdc++
>>>
>>> There are no uses of __has_feature in libstdc++. We do use
>>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>>> (which GCC doesn't support) to work around some weird semantics of
>>> __has_builtin in older versions of Clang.
>>>
>>>
>>> > but it doesn't seem to be
>>> > recognized in shared_ptr_base.h. Any guidance on how to check
>>> > __has_feature(thread_sanitizer) in this patch?
>>>
>>> I think we want to do something like this in include/bits/c++config
>>>
>>> #if __SANITIZE_THREAD__
>>> #  define _GLIBCXX_TSAN 1
>>> #elif defined __has_feature
>>> # if __has_feature(thread_sanitizer)
>>> #  define _GLIBCXX_TSAN 1
>>> # endif
>>> #endif
>>>
>>> Then in bits/shared_ptr_base.h
>>>
>>> #if _GLIBCXX_TSAN
>>> _M_release_orig();
>>> return;
>>> #endif
>>>
>>>
>>>
>>> > GCC generates code for _M_release that is larger and more complex than
>>> that
>>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>>
>>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>>
>>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
>> generated less code than gcc. I see in the response to the issue that the
>> new glibc is expected to optimize better. So maybe this will eliminate the
>> issue.
>>
>>
>>> > would you please create a bugzilla account for me (
>>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>>
>>> Done (sorry, I didn't notice the request in this mail until coming
>>> back to it to review the patch properly).
>>>
>>> Thank you!
>>
>>
>>>
>>>
>>> >
>>> >
>>> > Information about the patch:
>>> >
>>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>>> > instructions are significantly slower than regular loads and stores
>>> across
>>> > major architectures.
>>> >
>>> > - How current code works: _M_release() atomically decrements the use
>>> count,
>>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>>> > weak count, checks if it was 1, and if so calls _M_destroy().
>>> >
>>> > - How the proposed patch works: _M_release() loads both use count and
>>> weak
>>> > count together atomically (when properly aligned), checks if the value
>>> is
>>> > equal to the value of both counts equal to 1 (e.g., 0x10001), and
>>> if so
>>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>>> > algorithm.
>>> >
>>> > - Why it works: When the current thread executing _M_release() finds
>>> each
>>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>>> other
>>> > threads could possibly hold use or weak references to this control
>>> block.
>>> > That is, no other threads could possibly access the counts or the
>>> protected
>>> > object.
>>> >
>>> > - The proposed patch is intended to interact correctly with current
>>> code
>>> > (under certain conditions: _Lock_policy is _S_atomic, proper
>>> alignment, and
>>> > native lock-free support for atomic operations). That is, multiple
>>> threads
>>> > using different versions of the code with and without the patch
>>> operating
>>> > on the same objects should always interact correctly. The intent for
>>> the
>>> > patch is to be ABI compatible with the current implementation.
>>> >
>>> > - The proposed patch 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-08-02 Thread Maged Michael via Gcc-patches
Please find attached an updated patch after incorporating Jonathan's
suggestions.

Changes from the last patch include:
- Add a TSAN macro to bits/c++config.
- Use separate constexpr bool-s for the conditions for lock-freedom,
double-width and alignment.
- Move the code in the optimized path to a separate function
_M_release_double_width_cas.

Thanks,
Maged


On Fri, Jul 16, 2021 at 11:55 AM Maged Michael 
wrote:

> Thank you, Jonathan, for the detailed comments! I'll update the patch
> accordingly.
>
> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely 
> wrote:
>
>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>> >
>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>> the
>> > two atomic instructions that decrement each of the use count and the
>> weak
>> > count when both are 1. I proposed the general idea in an earlier thread
>> (
>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
>> got
>> > useful feedback on a draft patch and responses to related questions
>> about
>> > multi-granular atomicity and alignment. This patch is based on that
>> > feedback.
>> >
>> >
>> > I added a check for thread sanitizer to use the current algorithm in
>> that
>> > case because TSAN does not support multi-granular atomicity. I'd like to
>> > add a check of __has_feature(thread_sanitizer) for building using LLVM.
>> I
>> > found examples of __has_feature in libstdc++
>>
>> There are no uses of __has_feature in libstdc++. We do use
>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>> (which GCC doesn't support) to work around some weird semantics of
>> __has_builtin in older versions of Clang.
>>
>>
>> > but it doesn't seem to be
>> > recognized in shared_ptr_base.h. Any guidance on how to check
>> > __has_feature(thread_sanitizer) in this patch?
>>
>> I think we want to do something like this in include/bits/c++config
>>
>> #if __SANITIZE_THREAD__
>> #  define _GLIBCXX_TSAN 1
>> #elif defined __has_feature
>> # if __has_feature(thread_sanitizer)
>> #  define _GLIBCXX_TSAN 1
>> # endif
>> #endif
>>
>> Then in bits/shared_ptr_base.h
>>
>> #if _GLIBCXX_TSAN
>> _M_release_orig();
>> return;
>> #endif
>>
>>
>>
>> > GCC generates code for _M_release that is larger and more complex than
>> that
>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>
>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>
>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
> generated less code than gcc. I see in the response to the issue that the
> new glibc is expected to optimize better. So maybe this will eliminate the
> issue.
>
>
>> > would you please create a bugzilla account for me (
>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>
>> Done (sorry, I didn't notice the request in this mail until coming
>> back to it to review the patch properly).
>>
>> Thank you!
>
>
>>
>>
>> >
>> >
>> > Information about the patch:
>> >
>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>> > instructions are significantly slower than regular loads and stores
>> across
>> > major architectures.
>> >
>> > - How current code works: _M_release() atomically decrements the use
>> count,
>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>> > weak count, checks if it was 1, and if so calls _M_destroy().
>> >
>> > - How the proposed patch works: _M_release() loads both use count and
>> weak
>> > count together atomically (when properly aligned), checks if the value
>> is
>> > equal to the value of both counts equal to 1 (e.g., 0x10001), and
>> if so
>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>> > algorithm.
>> >
>> > - Why it works: When the current thread executing _M_release() finds
>> each
>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>> other
>> > threads could possibly hold use or weak references to this control
>> block.
>> > That is, no other threads could possibly access the counts or the
>> protected
>> > object.
>> >
>> > - The proposed patch is intended to interact correctly with current code
>> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
>> and
>> > native lock-free support for atomic operations). That is, multiple
>> threads
>> > using different versions of the code with and without the patch
>> operating
>> > on the same objects should always interact correctly. The intent for the
>> > patch is to be ABI compatible with the current implementation.
>> >
>> > - The proposed patch involves a performance trade-off between saving the
>> > costs of two atomic instructions when the counts are both 1 vs adding
>> the
>> > cost of loading the combined counts and comparison with two ones (e.g.,
>> > 0x10001).
>> >
>> > - The patch has been in use (built using LLVM) in a 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-07-16 Thread Maged Michael via Gcc-patches
Thank you, Jonathan, for the detailed comments! I'll update the patch
accordingly.

On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely 
wrote:

> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
> >
> > Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> > two atomic instructions that decrement each of the use count and the weak
> > count when both are 1. I proposed the general idea in an earlier thread (
> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got
> > useful feedback on a draft patch and responses to related questions about
> > multi-granular atomicity and alignment. This patch is based on that
> > feedback.
> >
> >
> > I added a check for thread sanitizer to use the current algorithm in that
> > case because TSAN does not support multi-granular atomicity. I'd like to
> > add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> > found examples of __has_feature in libstdc++
>
> There are no uses of __has_feature in libstdc++. We do use
> __has_builtin (which GCC also supports) and Clang's __is_identifier
> (which GCC doesn't support) to work around some weird semantics of
> __has_builtin in older versions of Clang.
>
>
> > but it doesn't seem to be
> > recognized in shared_ptr_base.h. Any guidance on how to check
> > __has_feature(thread_sanitizer) in this patch?
>
> I think we want to do something like this in include/bits/c++config
>
> #if __SANITIZE_THREAD__
> #  define _GLIBCXX_TSAN 1
> #elif defined __has_feature
> # if __has_feature(thread_sanitizer)
> #  define _GLIBCXX_TSAN 1
> # endif
> #endif
>
> Then in bits/shared_ptr_base.h
>
> #if _GLIBCXX_TSAN
> _M_release_orig();
> return;
> #endif
>
>
>
> > GCC generates code for _M_release that is larger and more complex than
> that
> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>
> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>
> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
generated less code than gcc. I see in the response to the issue that the
new glibc is expected to optimize better. So maybe this will eliminate the
issue.


> > would you please create a bugzilla account for me (
> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
> Done (sorry, I didn't notice the request in this mail until coming
> back to it to review the patch properly).
>
> Thank you!


>
>
> >
> >
> > Information about the patch:
> >
> > - Benefits of the patch: Save the cost of the last atomic decrements of
> > each of the use count and the weak count in _Sp_counted_base. Atomic
> > instructions are significantly slower than regular loads and stores
> across
> > major architectures.
> >
> > - How current code works: _M_release() atomically decrements the use
> count,
> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
> > weak count, checks if it was 1, and if so calls _M_destroy().
> >
> > - How the proposed patch works: _M_release() loads both use count and
> weak
> > count together atomically (when properly aligned), checks if the value is
> > equal to the value of both counts equal to 1 (e.g., 0x10001), and if
> so
> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> > algorithm.
> >
> > - Why it works: When the current thread executing _M_release() finds each
> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
> other
> > threads could possibly hold use or weak references to this control block.
> > That is, no other threads could possibly access the counts or the
> protected
> > object.
> >
> > - The proposed patch is intended to interact correctly with current code
> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
> and
> > native lock-free support for atomic operations). That is, multiple
> threads
> > using different versions of the code with and without the patch operating
> > on the same objects should always interact correctly. The intent for the
> > patch is to be ABI compatible with the current implementation.
> >
> > - The proposed patch involves a performance trade-off between saving the
> > costs of two atomic instructions when the counts are both 1 vs adding the
> > cost of loading the combined counts and comparison with two ones (e.g.,
> > 0x10001).
> >
> > - The patch has been in use (built using LLVM) in a large environment for
> > many months. The performance gains outweigh the losses (roughly 10 to 1)
> > across a large variety of workloads.
> >
> >
> > I'd appreciate feedback on the patch and any suggestions for checking
> > __has_feature(thread_sanitizer).
>
> N.B. gmail completely mangles patches unless you send them as attachments.
>
>
> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> > b/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > index 368b2d7379a..a8fc944af5f 100644
> >
> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > +++ 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-07-16 Thread Jonathan Wakely via Gcc-patches
On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>
> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got
> useful feedback on a draft patch and responses to related questions about
> multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++

There are no uses of __has_feature in libstdc++. We do use
__has_builtin (which GCC also supports) and Clang's __is_identifier
(which GCC doesn't support) to work around some weird semantics of
__has_builtin in older versions of Clang.


> but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?

I think we want to do something like this in include/bits/c++config

#if __SANITIZE_THREAD__
#  define _GLIBCXX_TSAN 1
#elif defined __has_feature
# if __has_feature(thread_sanitizer)
#  define _GLIBCXX_TSAN 1
# endif
#endif

Then in bits/shared_ptr_base.h

#if _GLIBCXX_TSAN
_M_release_orig();
return;
#endif



> GCC generates code for _M_release that is larger and more complex than that
> generated by LLVM. I'd like to file a bug report about that. Jonathan,

Is this the same issue as https://gcc.gnu.org/PR101406 ?

> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.

Done (sorry, I didn't notice the request in this mail until coming
back to it to review the patch properly).



>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use count,
> checks if it was 1, if so calls _M_dispose(), atomically decrements the
> weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x10001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x10001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).

N.B. gmail completely mangles patches unless you send them as attachments.


> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
> if (!_M_add_ref_lock_nothrow())
>
>   __throw_bad_weak_ptr();
>
>}
>
>
>bool
>
>_M_add_ref_lock_nothrow() noexcept;
>
>
>void
>
>_M_release() noexcept
>
>{
>
> +#if __SANITIZE_THREAD__
>
> +_M_release_orig();
>
> +return;
>
> +#endif
>
> +if (!__atomic_always_lock_free(sizeof(long long), 0) ||

The line break should come before the logical operator, not after.
This makes it easier to see which operator it is, because it's at 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-06-10 Thread Maged Michael via Gcc-patches
Would appreciate any comments on this proposed patch.

Thanks,
Maged


On Thu, Dec 17, 2020 at 3:49 PM Maged Michael 
wrote:

> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got useful feedback on a draft patch and responses to related questions
> about multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++ but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?
>
>
> GCC generates code for _M_release that is larger and more complex than
> that generated by LLVM. I'd like to file a bug report about that. Jonathan,
> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use
> count, checks if it was 1, if so calls _M_dispose(), atomically decrements
> the weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x10001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x10001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).
>
>
> Maged
>
>
>
> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
> if (!_M_add_ref_lock_nothrow())
>
>   __throw_bad_weak_ptr();
>
>}
>
>
>bool
>
>_M_add_ref_lock_nothrow() noexcept;
>
>
>void
>
>_M_release() noexcept
>
>{
>
> +#if __SANITIZE_THREAD__
>
> +_M_release_orig();
>
> +return;
>
> +#endif
>
> +if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> +!__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>
> +sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> +sizeof(long long) > (sizeof(void*)))
>
> +  {
>
> +_M_release_orig();
>
> +return;
>
> +  }
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
>
> +== (1LL + (1LL << (8 * sizeof(_Atomic_word)
>
> +  {
>
> +// Both counts are 1, so there are no weak references and
>
> +// we are releasing the last strong reference. No other
>
> +// threads can observe the effects of this _M_release()
>
> +// call (e.g. calling 

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2021-01-05 Thread Maged Michael via Gcc-patches
Please let me know if more information about this patch is needed. Thank
you.

Maged



On Thu, Dec 17, 2020 at 3:49 PM Maged Michael 
wrote:

> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got useful feedback on a draft patch and responses to related questions
> about multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++ but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?
>
>
> GCC generates code for _M_release that is larger and more complex than
> that generated by LLVM. I'd like to file a bug report about that. Jonathan,
> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use
> count, checks if it was 1, if so calls _M_dispose(), atomically decrements
> the weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x10001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x10001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).
>
>
> Maged
>
>
>
> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
> if (!_M_add_ref_lock_nothrow())
>
>   __throw_bad_weak_ptr();
>
>}
>
>
>bool
>
>_M_add_ref_lock_nothrow() noexcept;
>
>
>void
>
>_M_release() noexcept
>
>{
>
> +#if __SANITIZE_THREAD__
>
> +_M_release_orig();
>
> +return;
>
> +#endif
>
> +if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> +!__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>
> +sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> +sizeof(long long) > (sizeof(void*)))
>
> +  {
>
> +_M_release_orig();
>
> +return;
>
> +  }
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
>
> +== (1LL + (1LL << (8 * sizeof(_Atomic_word)
>
> +  {
>
> +// Both counts are 1, so there are no weak references and
>
> +// we are releasing the last strong reference. No other
>
> +// threads can observe the effects of this _M_release()
>
> +// call (e.g. 

[PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

2020-12-17 Thread Maged Michael via Gcc-patches
Please find a proposed patch for _Sp_counted_base::_M_release to skip the
two atomic instructions that decrement each of the use count and the weak
count when both are 1. I proposed the general idea in an earlier thread (
https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got
useful feedback on a draft patch and responses to related questions about
multi-granular atomicity and alignment. This patch is based on that
feedback.


I added a check for thread sanitizer to use the current algorithm in that
case because TSAN does not support multi-granular atomicity. I'd like to
add a check of __has_feature(thread_sanitizer) for building using LLVM. I
found examples of __has_feature in libstdc++ but it doesn't seem to be
recognized in shared_ptr_base.h. Any guidance on how to check
__has_feature(thread_sanitizer) in this patch?


GCC generates code for _M_release that is larger and more complex than that
generated by LLVM. I'd like to file a bug report about that. Jonathan,
would you please create a bugzilla account for me (
https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.


Information about the patch:

- Benefits of the patch: Save the cost of the last atomic decrements of
each of the use count and the weak count in _Sp_counted_base. Atomic
instructions are significantly slower than regular loads and stores across
major architectures.

- How current code works: _M_release() atomically decrements the use count,
checks if it was 1, if so calls _M_dispose(), atomically decrements the
weak count, checks if it was 1, and if so calls _M_destroy().

- How the proposed patch works: _M_release() loads both use count and weak
count together atomically (when properly aligned), checks if the value is
equal to the value of both counts equal to 1 (e.g., 0x10001), and if so
calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
algorithm.

- Why it works: When the current thread executing _M_release() finds each
of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
threads could possibly hold use or weak references to this control block.
That is, no other threads could possibly access the counts or the protected
object.

- The proposed patch is intended to interact correctly with current code
(under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
native lock-free support for atomic operations). That is, multiple threads
using different versions of the code with and without the patch operating
on the same objects should always interact correctly. The intent for the
patch is to be ABI compatible with the current implementation.

- The proposed patch involves a performance trade-off between saving the
costs of two atomic instructions when the counts are both 1 vs adding the
cost of loading the combined counts and comparison with two ones (e.g.,
0x10001).

- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.


I'd appreciate feedback on the patch and any suggestions for checking
__has_feature(thread_sanitizer).


Maged



diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
b/libstdc++-v3/include/bits/shared_ptr_base.h

index 368b2d7379a..a8fc944af5f 100644

--- a/libstdc++-v3/include/bits/shared_ptr_base.h

+++ b/libstdc++-v3/include/bits/shared_ptr_base.h

@@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

if (!_M_add_ref_lock_nothrow())

  __throw_bad_weak_ptr();

   }


   bool

   _M_add_ref_lock_nothrow() noexcept;


   void

   _M_release() noexcept

   {

+#if __SANITIZE_THREAD__

+_M_release_orig();

+return;

+#endif

+if (!__atomic_always_lock_free(sizeof(long long), 0) ||

+!__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||

+sizeof(long long) < (2 * sizeof(_Atomic_word)) ||

+sizeof(long long) > (sizeof(void*)))

+  {

+_M_release_orig();

+return;

+  }

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);

+if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)

+== (1LL + (1LL << (8 * sizeof(_Atomic_word)

+  {

+// Both counts are 1, so there are no weak references and

+// we are releasing the last strong reference. No other

+// threads can observe the effects of this _M_release()

+// call (e.g. calling use_count()) without a data race.

+*(long long*)(&_M_use_count) = 0;

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);

+_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);

+_M_dispose();

+_M_destroy();

+  }

+else

+  {

+if