[ Adding LKML on CC so that others can find this. ] On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote: > Hi, Paul McKenney. > > I have received many machine-stopped-respone reports, after reboot and > inspect message, all of them show RCU stalled, but I can't figure out > how to fix it. I can't update the kernel, it is the painful point, so I > need to fix it in 3.10. I have attached four messages come from different > cpu and broads(so I guess it is a BUG instead of hardware fault), any > suggestion is welcome.
The first step is of course to report this to your distro, as they are the ones who do the care and feeding of such old kernels. Please include the information below in that report, as it might help your distro find and fix the problem. It looks like the stalled CPU is idle, and that the activity resulting from the stall-warning message gets things going again. Callbacks are being processed, so no OOM. But you are getting the splat every 60 seconds. The system has only two CPUs, and is x86. If you cannot upgrade the kernel, my ability to help is limited. And the diagnostics printed with the v3.10 CPU stall warnings are also quite limited. However, there are some things you could try as workarounds: 1. Check to make sure that the rcu_sched kthread is getting the CPU time that it needs. Preventing this kthread from running would create exactly this output, assuming that the stall warning got it going again temporarily. 2. It looks like the disturbance of the RCU CPU stall warning is getting things going again. Try artificially providing this disturbance, for example, by running a usermode program or script that runs on each CPU in turn, then sleeps for (say) five seconds. 3. If you can reconfigure your kernel, try building with CONFIG_RCU_FAST_NO_HZ=n. 4. Was the system running reliably on some earlier version? If so, consider reverting back to that version, and include the version information in your report to your distro. If your distro provides individual patches, you should consider bisecting so as to locate the offending patch. Good luck with it! Thanx, Paul