https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772
--- Comment #3 from Thomas Rodgers <rodgertq at gcc dot gnu.org> --- Since this latter point has come up before, I want to additionally note that the optimization to use an atomic count of waiters per-waiter pool bucket means that a call to notify_one/notify_all is roughly 25x faster based on my testing than naively issuing a syscall to FUTEX_WAKE when there is no possibility of the wake being issued to a waiter. 2022-09-19T20:34:28-07:00 Running ./benchmark Run on (20 X 4800 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 1280 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 0.69, 0.61, 1.30 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------ BM_empty_notify_checked 3.79 ns 3.79 ns 179929051 BM_empty_notify_syscall 94.1 ns 93.9 ns 7477997 For types that can use a FUTEX directly (e.g. int) there is no place to put that extra atomic to perform this check, so we can either have the type that is directly usable by the underlying platform be significantly more expensive to call, or we can use the waiter count in the waiter_pool.