On Sat, Jun 27, 2020 at 03:14:14PM -0700, Andy Lutomirski wrote:
> 
> > On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <[email protected]> wrote:
> > 
> > On Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
> >>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <[email protected]> 
> >>> wrote:
> >>> 
> >>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
> >>> (where "HH" is the hexadecimal softirq vector number) when one or more
> >>> non-RCU softirq handlers are still enablded when checking to stop the
> >>> scheduler-tick interrupt.  This message is not as enlightening as one
> >>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
> >>> local softirq work is pending, handler #HH.
> >> 
> >> Thank you!  It would be even better if it would explain *why* the
> >> problem happened, but I suppose this code doesn't actually know.
> > 
> > Glad to help!
> > 
> > To your point, is it possible to bisect the appearance of this message,
> > or is it as usual non-reproducible?  (Hey, had to ask!)
> > 
> >                            
> 
> In this particular case, I tracked it down by good old fashioned sleuthing 
> for bugs, but it’s still unclear to me precisely how NOHZ gets involved. The 
> bug is that we were entering the kernel from usermode, doing nmi_enter(), 
> turning on interrupts, maybe getting a page fault, raising a signal, turning 
> off interrupts, nmi_exit(), and back to usermode, with the signal still 
> queued and undelivered.  This is all kinds of bad, but I still don’t 
> understand what softirqs or idle have to do with it.
> 
> But I have the bug fixed now!

Glad you found it!

                                                        Thanx, Paul

Reply via email to