Hi,

thanks for your bug report

On Thu, Sep 21, 2017 at 04:37:17PM +0300, Gregory Potamianos wrote:
> Package: ruby2.3
> Version: 2.3.3-1+deb9u1
> Severity: important
> Tags: upstream patch
> Forwarded: https://bugs.ruby-lang.org/issues/13794
> 
> Hello,
> 
> After the upgrade to stretch we keep finding ruby processes (puppet
> agents in particular) stuck in a sched_yield busyloop. The stuck process
> is always a forked child of the main puppet agent running inside a
> timeout block.
> 
> The backtrace of the process is the following:
> 
> (gdb) thread apply all bt
> 
> Thread 2 (Thread 0x7f2dc7904700 (LWP 11226)):
> #0  0x00007f2dc63bb6ad in poll () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007f2dc73fba62 in timer_thread_sleep (gvl=0x5628917b3f28) at
> thread_pthread.c:1455
> #2  thread_timer (p=0x5628917b3f28) at thread_pthread.c:1563
> #3  0x00007f2dc7045494 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x00007f2dc63c4aff in clone () from /lib/x86_64-linux-gnu/libc.so.6
> 
> Thread 1 (Thread 0x7f2dc78fc700 (LWP 11224)):
> #0  0x00007f2dc63adca7 in sched_yield () from
> /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007f2dc73fbac5 in native_stop_timer_thread () at
> thread_pthread.c:1664
> #2  rb_thread_stop_timer_thread () at thread.c:3902
> #3  0x00007f2dc7341c42 in before_exec_non_async_signal_safe () at
> process.c:1175
> #4  before_exec () at process.c:1181
> #5  rb_f_exec (argc=<optimized out>, argv=<optimized out>) at
> process.c:2576
> 
> And the offending part of the code is this:
> 
> native_stop_timer_thread(void)
> {
>     int stopped;
>     stopped = --system_working <= 0;
> 
>     if (TT_DEBUG) fprintf(stderr, "stop timer thread\n");
> #if USE_SLEEPY_TIMER_THREAD
>     if (stopped) {
>         /* prevent wakeups from signal handler ASAP */
>         timer_thread_pipe.owner_process = 0;  
> 
>         /*   
>          * however, the above was not enough: the FD may already be
>          * captured and in the middle of a write while we are running,
>          * so wait for that to finish:
>          */  
>         while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
>             native_thread_yield();
>         }   
> [..]
> }
> 
> Thread 1 is spinning around `timer_thread_pipe.writing` because someone has
>  erroneously bumped it to 1.
> 
> (gdb) print timer_thread_pipe
> $1 = {normal = {3, 4}, low = {5, 6}, owner_process = 0, writing = 1}
> 
> 
> Our case seems identical to this [1] bug report. We have applied the patch [2]
> by Eric Wong and the problem seems resolved without causing any other 
> problems.
> 
> [1] https://bugs.ruby-lang.org/issues/13794
> [2] https://80x24.org/spew/20170809232533.14932-...@80x24.org/raw

can you provide a minimal test case that can reproduce the issue that
does not take hours/days?

Also, it would be nice to have some feedback from upstream about whether
one of those patches is going to be applied. I would not like to to
carry such patch indefinitely.

Reply via email to