Hi, thanks for your bug report
On Thu, Sep 21, 2017 at 04:37:17PM +0300, Gregory Potamianos wrote: > Package: ruby2.3 > Version: 2.3.3-1+deb9u1 > Severity: important > Tags: upstream patch > Forwarded: https://bugs.ruby-lang.org/issues/13794 > > Hello, > > After the upgrade to stretch we keep finding ruby processes (puppet > agents in particular) stuck in a sched_yield busyloop. The stuck process > is always a forked child of the main puppet agent running inside a > timeout block. > > The backtrace of the process is the following: > > (gdb) thread apply all bt > > Thread 2 (Thread 0x7f2dc7904700 (LWP 11226)): > #0 0x00007f2dc63bb6ad in poll () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007f2dc73fba62 in timer_thread_sleep (gvl=0x5628917b3f28) at > thread_pthread.c:1455 > #2 thread_timer (p=0x5628917b3f28) at thread_pthread.c:1563 > #3 0x00007f2dc7045494 in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #4 0x00007f2dc63c4aff in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > Thread 1 (Thread 0x7f2dc78fc700 (LWP 11224)): > #0 0x00007f2dc63adca7 in sched_yield () from > /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007f2dc73fbac5 in native_stop_timer_thread () at > thread_pthread.c:1664 > #2 rb_thread_stop_timer_thread () at thread.c:3902 > #3 0x00007f2dc7341c42 in before_exec_non_async_signal_safe () at > process.c:1175 > #4 before_exec () at process.c:1181 > #5 rb_f_exec (argc=<optimized out>, argv=<optimized out>) at > process.c:2576 > > And the offending part of the code is this: > > native_stop_timer_thread(void) > { > int stopped; > stopped = --system_working <= 0; > > if (TT_DEBUG) fprintf(stderr, "stop timer thread\n"); > #if USE_SLEEPY_TIMER_THREAD > if (stopped) { > /* prevent wakeups from signal handler ASAP */ > timer_thread_pipe.owner_process = 0; > > /* > * however, the above was not enough: the FD may already be > * captured and in the middle of a write while we are running, > * so wait for that to finish: > */ > while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) { > native_thread_yield(); > } > [..] > } > > Thread 1 is spinning around `timer_thread_pipe.writing` because someone has > erroneously bumped it to 1. > > (gdb) print timer_thread_pipe > $1 = {normal = {3, 4}, low = {5, 6}, owner_process = 0, writing = 1} > > > Our case seems identical to this [1] bug report. We have applied the patch [2] > by Eric Wong and the problem seems resolved without causing any other > problems. > > [1] https://bugs.ruby-lang.org/issues/13794 > [2] https://80x24.org/spew/20170809232533.14932-...@80x24.org/raw can you provide a minimal test case that can reproduce the issue that does not take hours/days? Also, it would be nice to have some feedback from upstream about whether one of those patches is going to be applied. I would not like to to carry such patch indefinitely.