On Sun, 12 November 2017 "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> On Sun, Nov 12, 2017 at 12:09:28PM +0100, Bruno Prémont wrote: > > On Sat, 11 November 2017 "Paul E. McKenney" <paul...@linux.vnet.ibm.com> > > wrote: > > > On Sat, Nov 11, 2017 at 08:38:32PM +0100, Bruno Prémont wrote: > > > > Hi, > > > > > > > > On a single-CPU KVM-based virtual machine I'm suffering from RCU stall > > > > and soft-lockup. 4.10.x kernels run fine (4.10.12) but starting with > > > > 4.11.x (4.11.3, 4.13.11) I'm getting system freezes for no apparent > > > > reason. > > > > > > > > All info I have is following console dump (from 4.13.11): > > > > [526415.290012] INFO: rcu_sched self-detected stall on CPU > > > > [526415.290012] o0-...: (745847 ticks this GP) idle=ba2/2/0 > > > > softirq=37393463/37393463 fqs=0 > > > > [526415.290012] o (t=745854 jiffies g=23779976 c=23779975 q=32) > > > > [526415.290012] rcu_sched kthread starved for 745854 jiffies! g23779976 > > > > c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > > > The above line says that the rcu_sched kthread asked to sleep for three > > > jiffies, but ended up sleeping for more than 745,854 jiffies. > > > > > > If your system does not let the RCU's kernel threads run, RCU cannot > > > help you much. > > > > > > The ->state of 0x0 indicates that the kthread is in fact runnable, but > > > did not get a chance to run. Was the system heavily loaded to the > > > point where you would expect a kthread to remain preempted for many > > > minutes? > > > > > > I am guessing that the answer is no, given that CPU 0 is actually idle > > > (idle=ba2/2/0). Seems unlikely, but I have to ask: Did you bind the > > > kthread to a specific CPU? > > > > The system should be lightly loaded (about 5-10% CPU usage on average), so > > plenty of time for RCU to do its work. > > > > I didn't bind processes (be it userspace process or kthread) to a specific > > CPU, thus it's all auto-configured. > > > > I guess the question then is what is the system busy with or waiting for > > that prevents RCU to get its work done... > > Shouldn't the watchdog print a trace of where CPU#0 is stuck? If so I might > > need > > to check at which log level and make sure that loglevel reaches console. > > Nothing did hit the disk though. > > Do you have a high-speed interface to capture and store console output? > (As in something that can handle, say, 50MB in a reasonable period of > time.) The only option I could have is netconsole, in the hope it does not cause issues by itself! Other than that I have a poor man's web-VNC kind of view on VGA-sized text console, which is slow and small (it's a virtual server at a hosting company). Thanks, Bruno > Thanx, Paul > > > Thanks, > > Bruno > > > > > Thanx, Paul > > > > > > > [526440.020015] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > > > > [swapper/0:0] > > > > [526468.020005] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > > > > [swapper/0:0] > > > > [526478.320009] INFO: rcu_sched self-detected stall on CPU > > > > [526478.320009] o0-...: (752143 ticks this GP) idle=ba2/2/0 > > > > softirq=37393463/37393463 fqs=0 > > > > [526478.320009] o (t=752157 jiffies g=23779976 c=23779975 q=32) > > > > [526478.320009] rcu_sched kthread starved for 752157 jiffies! g23779976 > > > > c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > [526504.020016] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > > > > [swapper/0:0] > > > > [526532.020007] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > > > > [swapper/0:0] > > > > ... > > > > > > > > Attached is kernel config (4.13.11). > > > > > > > > > > > > The output obtained with 4.11.3 was: > > > > [ 280.680010] INFO: rcu_sched self-detected stall on CPU > > > > [ 280.680021] o0-...: (27312 ticks this GP) dile=b11/2/0 > > > > softirq=6119/6119 fqs=0 > > > > [ 280.680021] o (t=27312 jiffies g=441 c=440 q=0) > > > > [ 280.680021] rcu_sched_kthread starved for 27312 jiffies! g441 c440 > > > > f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > ... > > > > > > > > > > > > As it's a remote VM for which I don't have access to the host I have > > > > little > > > > options for further digging (can't trigger sysrq's). > > > > > > > > > > > > Same kernel (4.13.11) seems to be running just fine on another KVM-base > > > > VM that > > > > has two CPUs. > > > > > > > > > > > > Does it ring a bell or is there some info that might be of any use, > > > > assuming I can obtain it? > > > > > > > > Bruno > > > > > > > > >