On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote: > Dear Sir, > > Hope you are doing well. I have watched your various conference videos and > have read technical papers. > We are facing an issue with CPU stall on our systems and I felt like there is > no one better who can guide us on how we can deal with it. > > I have attached logs for your reference. Towards end I have run couple of > sysreq commands and have taken crash dump using sysreq which may help provide > additional information. > Could you please guide us on how we could fix this issue or identify what is > going wrong here?
Let's focus on the first few lines of your console message: [20526.345089] INFO: rcu_preempt self-detected stall on CPU [20526.351110] 0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0 [20526.360163] (t=2101 jiffies g=96468 c=96467 q=2) [20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0 The last line contains the hint, namely "rcu_preempt kthread starved for 2101 jiffies!" If you don't let RCU's kernel threads run, then RCU CPU stall warnings are expected behavior. The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep for three jiffies. As you can see from earlier in that same line, that was 2101 jiffies ago. The "->state=0x402" means that the scheduler believes that this kthread is blocked, that is not yet runnable. The usual way this sort of thing happens is a timer problem, be it a hardware configuration problem, a timer-driver bug, an interrupt-handling problem, and so on. This sort of problem is especially common when bringing up new hardware or when modifying timer code or when modifying code on the interrupt/exception paths. So the question to ask yourself is "Why is the timer wakeup not reaching this kthread?", with special attention to changed code and new hardware. Thanx, Paul