On Thu, Oct 12, 2017 at 01:38:24PM -0700, Paul E. McKenney wrote:
> [ Adding LKML on CC so that others can find this. ]
> 
> On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> > Hi, Paul McKenney.
> > 
> > I have received many machine-stopped-respone reports, after reboot and
> > inspect message, all of them show RCU stalled, but I can't figure out
> > how to fix it. I can't update the kernel, it is the painful point, so I
> > need to fix it in 3.10. I have attached four messages come from different
> > cpu and broads(so I guess it is a BUG instead of hardware fault), any
> > suggestion is welcome.
> 
> The first step is of course to report this to your distro, as they are
> the ones who do the care and feeding of such old kernels.  Please include
> the information below in that report, as it might help your distro find
> and fix the problem.
> 
> It looks like the stalled CPU is idle, and that the activity resulting
> from the stall-warning message gets things going again.  Callbacks are
> being processed, so no OOM.  But you are getting the splat every 60
> seconds.  The system has only two CPUs, and is x86.
> 
> If you cannot upgrade the kernel, my ability to help is limited.  And the
> diagnostics printed with the v3.10 CPU stall warnings are also quite
> limited.  However, there are some things you could try as workarounds:
> 
> 1.    Check to make sure that the rcu_sched kthread is getting
>       the CPU time that it needs.  Preventing this kthread from
>       running would create exactly this output, assuming that
>       the stall warning got it going again temporarily.
> 
> 2.    It looks like the disturbance of the RCU CPU stall warning
>       is getting things going again.  Try artificially providing
>       this disturbance, for example, by running a usermode program
>       or script that runs on each CPU in turn, then sleeps for
>       (say) five seconds.
> 
> 3.    If you can reconfigure your kernel, try building with
>       CONFIG_RCU_FAST_NO_HZ=n.

And if you can reconfigure kernel, in v3.10, building with
CONFIG_RCU_CPU_STALL_INFO and CONFIG_RCU_CPU_STALL_VERBOSE will provide
more information on the CPUs and tasks stalling the grace period.

                                                        Thanx, Paul

> 4.    Was the system running reliably on some earlier version?
>       If so, consider reverting back to that version, and include
>       the version information in your report to your distro.  If
>       your distro provides individual patches, you should consider
>       bisecting so as to locate the offending patch.
> 
> Good luck with it!
> 
>                                                       Thanx, Paul

Reply via email to