[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #25 from Thomas Rodgers  ---
(In reply to Mkkt Bkkt from comment #24)
> (In reply to Thomas Rodgers from comment #22)
> > Your example of '100+ core' systems especially on NUMA is certainly a valid
> > one. I would ask, at what point do those collisions and the resulting cache
> > invalidation traffic swamp the cost of just making the syscall? I do plan to
> > put these tests together, because there is another algorithm that I am
> > exploring, that I believe will reduce the likelihood of spurious wakeups,
> > and achieves the same result as this particular approach, without generating
> > the same invalidation traffic. At this point, I don't anticipate doing that
> > work until after GCC13 stage1 closes.
> 
> I try to explain: 
> 
> syscall overhead is some constant commonly like 10-30ns (futex syscall can
> be more expensive like 100ns in your example)
> 
> But numbers of cores are grow, arm also makes more popular (fetch_add/sub
> have more cost on it compares to x86)
> And people already faced with situation where fetch_add have a bigger cost
> than syscall overhead:
> 
> https://pkolaczk.github.io/server-slower-than-a-laptop/
> https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html
> 
> I don't think we will faced with problem like in these links in
> atomic::wait/notify in real code, but I'm pretty sure in some cases it can
> be more expansive than syscall part of atomic::wait/notify
> 
> Of course better to prove it, maybe someday I will do it :(

So to your previous comment, I don't the discussion is at all pointless. i plan
to raise some of these issues at the next SG1 meeting in November. Sure, that
doesn't help *you* or any developer with your specific intent until C++26, and
maybe Boost's implementation is a better choice, I also get how unsatisfying of
an aswer that is.

I'm well aware of the potential scalability problems, and I have a longer term
plan to get concrete data on how different implementation choices impact
scalability. The barrier implementation (which is the same algorithm as in
libc++), for example spreads this traffic over 64 individual atomic_refs, for
this very reason, and that implementation has been shown to scale quite well on
ORNL's Summit. But not all users of libstdc++ have those sorts of problems
either.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #24 from Mkkt Bkkt  ---
(In reply to Thomas Rodgers from comment #22)
> Your example of '100+ core' systems especially on NUMA is certainly a valid
> one. I would ask, at what point do those collisions and the resulting cache
> invalidation traffic swamp the cost of just making the syscall? I do plan to
> put these tests together, because there is another algorithm that I am
> exploring, that I believe will reduce the likelihood of spurious wakeups,
> and achieves the same result as this particular approach, without generating
> the same invalidation traffic. At this point, I don't anticipate doing that
> work until after GCC13 stage1 closes.

I try to explain: 

syscall overhead is some constant commonly like 10-30ns (futex syscall can be
more expensive like 100ns in your example)

But numbers of cores are grow, arm also makes more popular (fetch_add/sub have
more cost on it compares to x86)
And people already faced with situation where fetch_add have a bigger cost than
syscall overhead:

https://pkolaczk.github.io/server-slower-than-a-laptop/
https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

I don't think we will faced with problem like in these links in
atomic::wait/notify in real code, but I'm pretty sure in some cases it can be
more expansive than syscall part of atomic::wait/notify

Of course better to prove it, maybe someday I will do it :(

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #23 from Mkkt Bkkt  ---
(In reply to Thomas Rodgers from comment #21)
> I wasn't sarcastic in what I posted. As I noted, this question has come up
> before in different contexts, Bugzilla is a useful historical archive, so
> updating this issue with my reasoning and a bit of data was primarily a
> capture task.

Ok, sorry

> In this test, it is the example Jonathan cited in #14; incrementing an
> atomic int and calling notify.

incrementing atomic int here doesn't touch waiter pool atomic.
I think in reality it possible that you have contention between load in notify
and fetch_add/sub in some wait (maybe on other atomics)

> This isn't about semaphore or any other synchronization primitive. Those
> types are free to make different choices that may be more appropriate to the
> constrained usage of the type just like Lewis' lightweight manual reset
> event does (I will also note that Lewis has reviewed this implementation,
> and has written a paper to be discussed at the Fall meeting, p2616). 

Ok, but it just means at least before C++26 devs should use own kludge.

> There are, as Jonathan has pointed out, use cases where notify can and will
> be called without a notifier having any way to determine it will wake a
> waiter. One example I, as the person that is going to have to implement
> C++26 executors care about is a wait free work-stealing deque, it certainly
> doesn't require anything more than spinning for work on an empty queue to be
> algorithmically correct, but after spinning on an empty queue, making the
> rounds trying to steal work from other deques, maybe spinning a bit more,
> just to be sure, the de-queuing thread which hasn't been able to acquire
> more work is probably going to want to enter a wait until such time as it
> knows it can do productive work. Another thread pushing work into that queue
> won't be able to determine if the dequeue-ing thread is spinning for new
> work, work stealing, or has entered a wait, but atomic::notify() does
> know that, and can avoid penalizing the thread submitting work with a
> syscall, if there is no thread in a wait on the other end of the deque,
> which is the expected case for this algorithm.

It's good example, but I think it will be better to have own "waiter pool" in
executor for that.
I try to explain why:
1) In that case you can control size of waiter pool, for example just create
separate atomic per threadpool's thread.
2) you avoid contention/and false-positive with other atomics because its not
global 

> p1135 was the paper that added atomic wait/notify. One of the co-authors of
> that paper wrote the libc++ implementation. That implementation, as with
> libstdc++, is not simply a wrapper over the underlying platform's wait
> primitive. It has two 'backends', an exponentially timed backoff and ulock
> wait/wake. libstdc++ currently has a Futex and condvar backend. Both
> implementations make the choice of having a short-term spinning strategy and
> long term waiting strategy (spinning, futex/ulock, condvar).

I know (one moment that I was wrong libc++ also use waiter pool in platform
case), and I don't think its good.
>From my point of view better to have something like
std::wait_spin waiter;
bool std::wait_spin::wait(Condition condition_func);
void std::wait_spin::reset();

Also libc++ implementation make timed wait on futex (but not on ulock), I don't
why, do you know?
https://discourse.llvm.org/t/why-do-atomic-int-wait-use-timed-futex/65341

But in general it's not about this thread.

> I have confirmed with the libc++ implementation's author, (who also chairs
> the C++ committee's Concurrency and Parallelism study group), that it was
> never the intention of p1135 or the subsequently standardized language in
> C++20 to imply that wait/notify were direct portable analogs to the
> platform's waiting primitives. There are users, such as yourself that want
> exactly that, there are other users (like in my prior industry) where busy
> waiting is the desired strategy, and in between those two choices are people
> who want it to work as advertised in the standard, and to do so
> 'efficiently'. Both libc++ and libstdc++ take a balanced approach somewhere
> between always going to OS and always spinning.

It's a good point, despite though atomic wait/notify is too low-level to that
make sense in my opinion.
As a result, it seems to me that in this form they aren't very suitable for
developers who know what they want (syscall).

And at the same time, developers who write business code aren't particularly
needed them (they generally use some library for coroutines/callbacks, for
example, like p2300)


> There is an open question here that your original issue raises -
> 
> * At what point do collisions on the waiter pool, with the cache
> invalidation traffic and spurious wakeups that result, swamp the gain of
> doing this empty waiter check on notify?

Just start threadpool with 

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #22 from Thomas Rodgers  ---
(In reply to Mkkt Bkkt from comment #20)
> My main concern with this optimization it's not zero-overhead.
> 
> It's not necessary when we expect we have some waiters, in that case it just
> additional synchronization and contention in waiter pool (that have small
> fixed size, just imagine system with 100+ cores, if we have > 16 waiting
> threads some of them make fetch_add/sub on the same atomic, that can be
> expensive, especially on numa)
> 
> And at the same time, I don't understand when I need to notify and cannot
> know notification not needed.
> I don't understand when it useful.

You are correct, it is not zero overhead. It also isn't clear what those
overheads are, either. As I noted in comment #21, there is no test over a
variety of workloads to inform this discussion, either.

Your example of '100+ core' systems especially on NUMA is certainly a valid
one. I would ask, at what point do those collisions and the resulting cache
invalidation traffic swamp the cost of just making the syscall? I do plan to
put these tests together, because there is another algorithm that I am
exploring, that I believe will reduce the likelihood of spurious wakeups, and
achieves the same result as this particular approach, without generating the
same invalidation traffic. At this point, I don't anticipate doing that work
until after GCC13 stage1 closes.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #21 from Thomas Rodgers  ---
(In reply to Mkkt Bkkt from comment #16)
> > it with sarcasm
> 
> I started with sarcasm because you restart this thread with some doubtful
> benchmarks without code for them.
> 
> I think it's very disrespectfully.

I wasn't sarcastic in what I posted. As I noted, this question has come up
before in different contexts, Bugzilla is a useful historical archive, so
updating this issue with my reasoning and a bit of data was primarily a capture
task.

So, let's try this again.

I did not post the original source because it required hacking on the libstdc++
headers. I have posted a version that does not require that, the results are
identical.

In this test, it is the example Jonathan cited in #14; incrementing an atomic
int and calling notify.

This isn't about semaphore or any other synchronization primitive. Those types
are free to make different choices that may be more appropriate to the
constrained usage of the type just like Lewis' lightweight manual reset event
does (I will also note that Lewis has reviewed this implementation, and has
written a paper to be discussed at the Fall meeting, p2616). 

There are, as Jonathan has pointed out, use cases where notify can and will be
called without a notifier having any way to determine it will wake a waiter.
One example I, as the person that is going to have to implement C++26 executors
care about is a wait free work-stealing deque, it certainly doesn't require
anything more than spinning for work on an empty queue to be algorithmically
correct, but after spinning on an empty queue, making the rounds trying to
steal work from other deques, maybe spinning a bit more, just to be sure, the
de-queuing thread which hasn't been able to acquire more work is probably going
to want to enter a wait until such time as it knows it can do productive work.
Another thread pushing work into that queue won't be able to determine if the
dequeue-ing thread is spinning for new work, work stealing, or has entered a
wait, but atomic::notify() does know that, and can avoid penalizing the
thread submitting work with a syscall, if there is no thread in a wait on the
other end of the deque, which is the expected case for this algorithm.

p1135 was the paper that added atomic wait/notify. One of the co-authors of
that paper wrote the libc++ implementation. That implementation, as with
libstdc++, is not simply a wrapper over the underlying platform's wait
primitive. It has two 'backends', an exponentially timed backoff and ulock
wait/wake. libstdc++ currently has a Futex and condvar backend. Both
implementations make the choice of having a short-term spinning strategy and
long term waiting strategy (spinning, futex/ulock, condvar).

I have confirmed with the libc++ implementation's author, (who also chairs the
C++ committee's Concurrency and Parallelism study group), that it was never the
intention of p1135 or the subsequently standardized language in C++20 to imply
that wait/notify were direct portable analogs to the platform's waiting
primitives. There are users, such as yourself that want exactly that, there are
other users (like in my prior industry) where busy waiting is the desired
strategy, and in between those two choices are people who want it to work as
advertised in the standard, and to do so 'efficiently'. Both libc++ and
libstdc++ take a balanced approach somewhere between always going to OS and
always spinning.

There is an open question here that your original issue raises -

* At what point do collisions on the waiter pool, with the cache invalidation
traffic and spurious wakeups that result, swamp the gain of doing this empty
waiter check on notify?

I also, made a comment about the 'size of the waiter pool not withstanding'. I
chose a smaller size than libc++ chose, in part because Jonathan and I did not
want to make any sort of ABI commitment until this had been in a few GCC
releases. This implementation is header only at present, and still considered
experimental. libc++ committed to an ABI early.

In the sizing of the libc++ waiter pool there is the comment that 'there is no
magic in this value'. 

Not only is there no magic, there is no test of any sort that I have done, or
that has been done on libc++ to determine what effect different size pools have
under different load scenarios. So, all of it is a guess at this point. I will
likely match libc++ when I do move this into the .so.

Finally, in the most recent mailing there is p2643 which proposes additional
changes to atomic waiting. One proposal is to add a 'hinted' wait that can
allow the caller to steer the choices atomic wait/notify uses. I have conferred
with the other authors of the paper and this latter option is not without
controversy, and likely some sharp edges for the user, but I plan to raise the
discussion at the Fall WG21 meeting to see what the other members of SG1 think.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #20 from Mkkt Bkkt  ---
My main concern with this optimization it's not zero-overhead.

It's not necessary when we expect we have some waiters, in that case it just
additional synchronization and contention in waiter pool (that have small fixed
size, just imagine system with 100+ cores, if we have > 16 waiting threads some
of them make fetch_add/sub on the same atomic, that can be expensive,
especially on numa)

And at the same time, I don't understand when I need to notify and cannot know
notification not needed.
I don't understand when it useful.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #19 from Mkkt Bkkt  ---
An example with a futex or other syscall would also work, because atomic::wait
is still not widely used.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #18 from Mkkt Bkkt  ---
Thanks for provide source of your benchmarks!

Can you also provide some example of usage atomic::notify where this
optimization is needed?

If it exists I think you right and it is necessary.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-28 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #17 from Thomas Rodgers  ---
Created attachment 53638
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53638=edit
benchmark main.cc

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #16 from Mkkt Bkkt  ---
> it with sarcasm

I started with sarcasm because you restart this thread with some doubtful
benchmarks without code for them.

I think it's very disrespectfully.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #15 from Mkkt Bkkt  ---
>  If you have an atomic counter and want to signal when it has been 
> incremented, you cannot tell from the previous value whether another thread 
> is waiting.

I wrote it example.

Do you talk about like semaphore usage?

As I know in libstdc++ semaphore use bare wait without touching waiters_pool.
Am I wrong?

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #14 from Jonathan Wakely  ---
(In reply to Mkkt Bkkt from comment #12)
> First of, I was a toxic, sorry.
> But you start this first, maybe it's allowed for maintainer, I don't know.

No, you started it with sarcasm and disparaging remarks in comment 4, and
continued the aggression in comment 5.

Your example demonstrates a state machine where the only possible values of the
atomic are known (e.g. a locked state and unlocked state). That is not the only
way to use atomics. If you have an atomic counter and want to signal when it
has been incremented, you cannot tell from the previous value whether another
thread is waiting.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #13 from Mkkt Bkkt  ---
example of simple mutex pseudocode that don't need call notify every unlock:
Same possible for other cases, if it doesn't please share example for me

```
int kUnlocked = 0;
int kLocked = 1;
int kLockedWithWaiters = 2;

std::atomic state{0};

lock:
  // some spin
  if (state.cas(kUnlocked, kLocked)) {
return;
  }
  do {
if (state == kLockedWithWaiters || state.cas(kLocked, kLockedWithWaiters))
{
  state.wait(kLockedWithWaiters);
}
  } while (!state.cas(kUnlocked, kLockedWithWaiters));

unlock:
  if (state.exchange(kUnlocked) == kLockedWithWaiters) {
state.notify_one();
  }
```

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #12 from Mkkt Bkkt  ---
First of, I was a toxic, sorry.
But you start this first, maybe it's allowed for maintainer, I don't know.

But I still waiting code of benchmarks.

Also I want to see example of usage when we notify atomic, that in
non-deterministic wait state. And we cannot make this optimization outside of
libstdc++.

Because honestly I don't understand purpose of this optimization except
benchmarks like

```
atomic x;
while(...) {
  x.notify_one();
}
```

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #11 from Thomas Rodgers  ---
(In reply to Mkkt Bkkt from comment #9)
> Why do you think you smarter than msvc stl, libc++, boost::atomic developers?
> 
> Maybe it's about your "I"?

I should ignore this (see jwakely's response), but -

I very much don't think I am smarter than the person (ogir...@apple.com) who
implemented libc++'s implementation. There are minor difference between
libstc++'s details and libc++'s but in this regard,in particular, they behave
the same.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #10 from Jonathan Wakely  ---
Please try to be civil, or your requests will simply be ignored.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #9 from Mkkt Bkkt  ---
Why do you think you smarter than msvc stl, libc++, boost::atomic developers?

Maybe it's about your "I"?

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #8 from Mkkt Bkkt  ---
> I have every confidence that Lewis knows how to bring a paper for a 
> 'lightweight manual reset event' to SG1, I suspect it will be well received 
> when he does.

So at least before C++26 I and any other developer that known notify not called
on non-waiting atomic should use ifdefs for linux/windows/mac etc syscalls?

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #7 from Mkkt Bkkt  ---
Can you give example when this optimization needed and cannot be done on user,
not stdlib side?

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #6 from Thomas Rodgers  ---
(In reply to Mkkt Bkkt from comment #5)
> Single reason why I want to use atomic::wait/notify is cross platform api
> for futex, wait/wake on address, ulock, etc
> Not because I need YOU decide instead of me how many spins, or other
> optimization I need.
> 
> boost::atomic already do this.
> 
> Why do you cannot make same?

(In reply to Mkkt Bkkt from comment #4)
> Nice, thanks for benchmarks without code.
> 
> Also, why do I need call notify when don't have wait? 
> 
> Common usage from my point of view looks like:
> 
> atomic.wait(value, mo):
> 
> while (atomic.load(mo) == value) {
>   futex_wait();
> }
> 
> Notify want to looks like:
> 
> atomic.store(1, mo)
> atomic.notify_one();
> 
> See:
> 
> https://github.com/lewissbaker/cppcoro/blob/master/lib/
> lightweight_manual_reset_event.cpp
> 

I have every confidence that Lewis knows how to bring a paper for a
'lightweight manual reset event' to SG1, I suspect it will be well received
when he does.

> https://github.com/YACLib/YACLib/blob/main/src/util/atomic_event.cpp
> 
> and others
> 
> So your optimization is pretty useless.
> 
> Also if for some reason possible few notification, or notify before wait,
> then possible to have code can looks like:
> 
> if (atomic.exchange(non_wait_value, mo) == wait_value) {
>   atomic.notify_one();
> }

I

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #5 from Mkkt Bkkt  ---
Single reason why I want to use atomic::wait/notify is cross platform api for
futex, wait/wake on address, ulock, etc
Not because I need YOU decide instead of me how many spins, or other
optimization I need.

boost::atomic already do this.

Why do you cannot make same?

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-20 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #4 from Mkkt Bkkt  ---
Nice, thanks for benchmarks without code.

Also, why do I need call notify when don't have wait? 

Common usage from my point of view looks like:

atomic.wait(value, mo):

while (atomic.load(mo) == value) {
  futex_wait();
}

Notify want to looks like:

atomic.store(1, mo)
atomic.notify_one();

See:

https://github.com/lewissbaker/cppcoro/blob/master/lib/lightweight_manual_reset_event.cpp

https://github.com/YACLib/YACLib/blob/main/src/util/atomic_event.cpp

and others

So your optimization is pretty useless.

Also if for some reason possible few notification, or notify before wait, then
possible to have code can looks like:

if (atomic.exchange(non_wait_value, mo) == wait_value) {
  atomic.notify_one();
}

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-09-19 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #3 from Thomas Rodgers  ---
Since this latter point has come up before, I want to additionally note that
the optimization to use an atomic count of waiters per-waiter pool bucket means
that a call to notify_one/notify_all is roughly 25x faster based on my testing
than naively issuing a syscall to FUTEX_WAKE when there is no possibility of
the wake being issued to a waiter.

2022-09-19T20:34:28-07:00
Running ./benchmark
Run on (20 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.69, 0.61, 1.30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
be noisy and will incur extra overhead.
--
BenchmarkTime CPU   Iterations
--
BM_empty_notify_checked   3.79 ns 3.79 ns179929051
BM_empty_notify_syscall   94.1 ns 93.9 ns  7477997

For types that can use a FUTEX directly (e.g. int) there is no place to put
that extra atomic to perform this check, so we can either have the type that is
directly usable by the underlying platform be significantly more expensive to
call, or we can use the waiter count in the waiter_pool.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-08-29 Thread rodgertq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

Thomas Rodgers  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Thomas Rodgers  ---
This is not a bug. The size of the waiter pool, not withstanding, the point is
you want to determine if a notify is likely to notify anything, to avoid the
very expensive (by comparison) syscall to wake the futex.

[Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait

2022-08-29 Thread valera.mironow at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #1 from Mkkt Bkkt  ---
libc++ implementation do it.
msvc stl of course too.