I've noticed that even a simple hello world does a lot of gsync calls: $ rpctrace echo hello world |& grep -c gsync 53
These are in fact all identical gsync_wake () calls on a single address: $ rpctrace echo hello world |& grep gsync | uniq task132(pid18549)->gsync_wake (202820 0 0); This immediately screams that something is wrong to me, for two reasons: * Surely there's just a single thread in /bin/echo! Well, there's always the implicit signal/msg thread, but it shouldn't be creating *that* much contention. * gsync_wake () calls are useless unless paired with gsync_wait (). I traced the gsync_wake () calls to the THREAD_GSCOPE macros in glibc (sysdeps/mach/hurd/tls.h). As I understand it from looking at the Linux version, THREAD_GSCOPE is supposed to be an RCU-like mechanism, where a thread can wait until all the other threads have left the "global scope" at least once. The Hurd version is different, it waits for all the other threads to be outside the "global scope" *at the same time*. Which also sounds wrong to me, because this way threads constantly entering and exiting the "global scope" at the same time can completely starve a waiting thread, unlike with RCU. I'm probably misunderstanding something though. Anyway, the Hurd implementation seems pretty straightforward: it keeps a single global counter, increments it when entering "global scope", decrements it when exiting it. If this was the last thread exiting, it calls gsync_wake (), and this is where the excessive gsync_wake () calls come from. In my experience building synchronization primitives on top of futexes, one of the most important things to do is make it so that the fast (uncontended) path never requires calling into the kernel. Any (reasonably small) amount of userspace logic is far cheaper than calling into the kernel. Which is why you add various "have waiters" flags, to only call futex(FUTEX_WAKE_PRIVATE) if there are, in fact, any waiters. So I took a stab at implementing this for THREAD_GSCOPE. I'm currently building glibc to see if it works; but I might have easily messed up ("futexes are tricky"!). Please tell me if I did :) This is not supposed to be a patch ready for merging (I don't even know if this is the right list to send glibc patches to); just an RFC. Sergey -- >8 -- diff --git a/sysdeps/mach/hurd/tls.h b/sysdeps/mach/hurd/tls.h index f83956d3..9fccd9f2 100644 --- a/sysdeps/mach/hurd/tls.h +++ b/sysdeps/mach/hurd/tls.h @@ -55,22 +55,46 @@ /* Global scope switch support. */ #define THREAD_GSCOPE_IN_TCB 0 #define THREAD_GSCOPE_GLOBAL + +#define THREAD_GSCOPE_WAITERS (1 << 31) + #define THREAD_GSCOPE_SET_FLAG() \ atomic_exchange_and_add_acq (&GL(dl_thread_gscope_count), 1) + #define THREAD_GSCOPE_RESET_FLAG() \ - do \ - if (atomic_exchange_and_add_rel (&GL(dl_thread_gscope_count), -1) == 1) \ - lll_wake (GL(dl_thread_gscope_count), 0); \ + do \ + { \ + int count; \ + count = atomic_exchange_and_add_rel (&GL(dl_thread_gscope_count), -1); \ + if (count == (1 | THREAD_GSCOPE_WAITERS)) \ + { \ + /* We're going to wake everyone; clear the waiters flag. */ \ + count = atomic_and_val (&GL(dl_thread_gscope_count), ~THREAD_GSCOPE_WAITERS); \ + if (count & THREAD_GSCOPE_WAITERS) \ + lll_wake (GL(dl_thread_gscope_count), GSYNC_BROADCAST); \ + } \ + } \ while (0) + #define THREAD_GSCOPE_WAIT() \ - do \ - { \ - int count; \ - atomic_write_barrier (); \ - while ((count = GL(dl_thread_gscope_count))) \ - lll_wait (GL(dl_thread_gscope_count), count, 0); \ - } \ - while (0) + do \ + { \ + int count; \ + count = atomic_load_acquire (&GL(dl_thread_gscope_count));\ + if (count == 0) \ + break; \ + if (!(count & THREAD_GSCOPE_WAITERS)) \ + { \ + /* Record that we're going to wait. */ \ + if (atomic_compare_and_exchange_bool_acq ( \ + &GL(dl_thread_gscope_count), \ + count | THREAD_GSCOPE_WAITERS, count)) \ + continue; \ + count |= THREAD_GSCOPE_WAITERS; \ + } \ + lll_wait (GL(dl_thread_gscope_count), count, 0); \ + } \ + while (1) #endif /* !ASSEMBLER */