[Bug libstdc++/122878] [16 Regression] std::counting_semaphore performance

redi at gcc dot gnu.org via Gcc-bugs Thu, 27 Nov 2025 05:23:00 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122878


--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
With this test:

#include <semaphore>
#include <chrono>

int main()
{
  using namespace std::chrono;
  std::counting_semaphore sem(0);
  sem.try_acquire_for(2s);
}

Compiling with GCC 15 and running under strace shows:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7fff39d775c4, FUTEX_WAIT_BITSET, 0, {tv_sec=271721, tv_nsec=979839514},
FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

This is because __waiter_base::_M_do_wait_for first calls _M_do_spin with the
default spin policy, which does an atomic load and checks the value 16 times
(yielding for the last 4 iterations), then it calls _M_do_wait_until which does
a futex wait, and then if that times out it calls _M_do_spin again with the
timed backoff policy, which loops 16 times again. This is not ideal, but at
least it blocks on the futex.

(ltrace shows that GCC 15 also makes four calls to chrono::steady_clock::now()
in the timed spinloop.)

With GCC 16 we get:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=157999}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=320854}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=573062}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=972695}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1619661}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=2652507}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=4179607}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=6470909}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=9973633}, 0x7ffcbac6bf40)
= 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=15209308},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=22992500},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=34698072},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=52193686},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=64000000},
0x7ffcbac6bf40) = 0
futex(0x7ffcbac6c194, FUTEX_WAIT_BITSET, 0, {tv_sec=273517, tv_nsec=267249297},
FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

This is because __wait_until_impl calls __spin_until_impl which first calls
__spin_impl (equivalent to the default spin policy in GCC 15, so loop 16 times)
then it sleeps for 64ms on each iteration until the timeout is reached (which
requires calling clock_gettime on every iteration). Then when the timeout
happens, it does a futex wait which returns immediately because the timeout
already passed. This is just a busy loop, and the futex wait is a redundant
syscall.

With the patch above, GCC 16 does:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7ffdab57f934, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=273918,
tv_nsec=511961780}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed
out)

So we use __spin_impl then go straight to the futex wait and let that timeout.

[Bug libstdc++/122878] [16 Regression] std::counting_semaphore performance

Reply via email to