Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 11:59:36AM -0700, Mike Travis wrote:
> On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
> > On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
> >> On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
> >>> On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
>  On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> > For performance reasons, the NMI handler may be disabled to lessen the
> > performance impact caused by the multiple perf tools running 
> > concurently.
> > If the system nmi command is issued when the UV NMI handler is disabled,
> > the "Dazed and Confused" messages occur for all cpus.  The NMI handler 
> > is
> > disabled by setting the nmi disabled variable to '1'.  Setting it back 
> > to
> > '0' will re-enable the NMI handler.
> 
>  I'm not entirely sure why this is still needed now that you've moved all
>  really expensive bits into the UNKNOWN handler.
> 
> >>>
> >>> Yes, it could be considered optional.  My primary use was to isolate
> >>> new bugs I found to see if my NMI changes were causing them.  But it
> >>> appears that they are not since the problems occur with or without
> >>> using the NMI entry into KDB.  So it can be safely removed.
> >>
> >> OK, as a debug option it might make sense, but removing it is (of course)
> >> fine with me ;-)
> >>
> >>> (The basic problem is that if you hang out in KDB too long the machine
> >>> locks up.  
> >>
> >> Yeah, known issue. Not much you can do about it either I suspect. The
> >> system generally isn't build for things like that.
> >>
> >>> Other problems like the rcu stall detector does not have a
> >>> means to be "touched" like the nmi_watchdog_timer so it fires off a
> >>> few to many, many messages.  
> >>
> >> That however might be easily cured if you ask Paul nicely ;-)
> > 
> > RCU's grace-period mechanism is supposed to be what touches it.  ;-)
> > 
> > But what is it that you are looking for?  If you want to silence it
> > completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
> > you want to use.
> 
> We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
> cmdline.  I'll double check if it was set during my testing.
> 
> > 
> >>> Another, any network connections will time
> >>> out if you are in KDB more than say 20 or 30 seconds.)
> > 
> > Ah, you are looking for RCU to refrain from complaining about grace
> > periods that have been delayed by breakpoints in the kernel?  Is there
> > some way that RCU can learn that a breakpoint has happened?  If so,
> > this should not be hard.
> 
> Yes, exactly.  After a UV NMI event which might or might not call KDB,
> but definitely can consume some time with the system stopped, I have
> these notifications:
> 
> static void uv_nmi_touch_watchdogs(void)
> {
> touch_softlockup_watchdog_sync();
> clocksource_touch_watchdog();
> rcu_cpu_stall_reset();

This function effectively disables RCU CPU stall warnings for the current
set of grace periods.  Or is supposed to do so, anyway.  I won't guarantee
that this is avoids false positive in the face of all possible races
between grace-period initialization, calls to rcu_cpu_stall_reset(),
and stall warnings.

So how often are you seeing RCU CPU stall warnings?

> touch_nmi_watchdog();
> }
> 
> 
> In all the cases I checked, I had all the cpus in the NMI event so
> I don't think it was a straggler who triggered the problem.  One
> question though, the above is called by all cpus exiting the NMI
> event.  Should I limit that to only one cpu?

You should only need to invoke rcu_cpu_stall_reset() from a single CPU.
That said, I would not expect problems from concurrent invocations,
unless your compiler stores to a long with a pair of smaller stores
or something.

> Note btw, that this also happens when KGDB/KDB is entered via the
> sysrq-trigger 'g' event.
> 
> Perhaps there is some other timer that is going off?

Is uv_nmi_touch_watchdogs() invoked on the way in to the breakpoint?
On the way out?  Both?  Either way, what software environment does it
run in?  The only environment completely safe against races on the way
in would be stop_machine() -- otherwise, a grace period might start just
after uv_nmi_touch_watchdogs() returned, which would cause a normal RCU
CPU stall timeout to be in effect.

> > If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
> > earlier.
> > 
> >>> One other problem is with the perf tool.  It seems running more than
> >>> about 2 or 3 perf top instances on a medium (1k cpu threads) sized
> >>> system, they start behaving badly with a bunch of NMI stackdumps
> >>> appearing on the console.  Eventually the system become unusable.
> >>
> >> Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
> > 
> > Indeed, with that definition of "medium", large must be truly impressive!
> 
> I say medium because it's only one rack w/~4TB of 

Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 08:48:33PM +0100, Hedi Berriche wrote:
> On Thu, Sep 12, 2013 at 19:59 Mike Travis wrote:
> | On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
> |
> | > But what is it that you are looking for?  If you want to silence it
> | > completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
> | > you want to use.
> | 
> | We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
> | cmdline.  I'll double check if it was set during my testing.
> 
> FWIW, for recent enough kernels the correct boot parameter is
> rcupdate.rcu_cpu_stall_suppress.
> 
> It used to be rcutree.rcu_cpu_stall_suppress, but that has changed after
> commit 6bfc09e.

Good point, Hedi!  That change happened when rcutiny gained RCU CPU
stall warning capability.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Hedi Berriche
On Thu, Sep 12, 2013 at 19:59 Mike Travis wrote:
| On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
|
| > But what is it that you are looking for?  If you want to silence it
| > completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
| > you want to use.
| 
| We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
| cmdline.  I'll double check if it was set during my testing.

FWIW, for recent enough kernels the correct boot parameter is
rcupdate.rcu_cpu_stall_suppress.

It used to be rcutree.rcu_cpu_stall_suppress, but that has changed after
commit 6bfc09e.

Cheers,
Hedi.
-- 
Be careful of reading health books, you might die of a misprint.
-- Mark Twain
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Mike Travis


On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
> On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
>>> On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
 On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> For performance reasons, the NMI handler may be disabled to lessen the
> performance impact caused by the multiple perf tools running concurently.
> If the system nmi command is issued when the UV NMI handler is disabled,
> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
> disabled by setting the nmi disabled variable to '1'.  Setting it back to
> '0' will re-enable the NMI handler.

 I'm not entirely sure why this is still needed now that you've moved all
 really expensive bits into the UNKNOWN handler.

>>>
>>> Yes, it could be considered optional.  My primary use was to isolate
>>> new bugs I found to see if my NMI changes were causing them.  But it
>>> appears that they are not since the problems occur with or without
>>> using the NMI entry into KDB.  So it can be safely removed.
>>
>> OK, as a debug option it might make sense, but removing it is (of course)
>> fine with me ;-)
>>
>>> (The basic problem is that if you hang out in KDB too long the machine
>>> locks up.  
>>
>> Yeah, known issue. Not much you can do about it either I suspect. The
>> system generally isn't build for things like that.
>>
>>> Other problems like the rcu stall detector does not have a
>>> means to be "touched" like the nmi_watchdog_timer so it fires off a
>>> few to many, many messages.  
>>
>> That however might be easily cured if you ask Paul nicely ;-)
> 
> RCU's grace-period mechanism is supposed to be what touches it.  ;-)
> 
> But what is it that you are looking for?  If you want to silence it
> completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
> you want to use.

We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
cmdline.  I'll double check if it was set during my testing.

> 
>>> Another, any network connections will time
>>> out if you are in KDB more than say 20 or 30 seconds.)
> 
> Ah, you are looking for RCU to refrain from complaining about grace
> periods that have been delayed by breakpoints in the kernel?  Is there
> some way that RCU can learn that a breakpoint has happened?  If so,
> this should not be hard.

Yes, exactly.  After a UV NMI event which might or might not call KDB,
but definitely can consume some time with the system stopped, I have
these notifications:

static void uv_nmi_touch_watchdogs(void)
{
touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();
rcu_cpu_stall_reset();
touch_nmi_watchdog();
}


In all the cases I checked, I had all the cpus in the NMI event so
I don't think it was a straggler who triggered the problem.  One
question though, the above is called by all cpus exiting the NMI
event.  Should I limit that to only one cpu?

Note btw, that this also happens when KGDB/KDB is entered via the
sysrq-trigger 'g' event.

Perhaps there is some other timer that is going off?

> If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
> earlier.
> 
>>> One other problem is with the perf tool.  It seems running more than
>>> about 2 or 3 perf top instances on a medium (1k cpu threads) sized
>>> system, they start behaving badly with a bunch of NMI stackdumps
>>> appearing on the console.  Eventually the system become unusable.
>>
>> Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
> 
> Indeed, with that definition of "medium", large must be truly impressive!

I say medium because it's only one rack w/~4TB of memory (and quite
popular).  Large would be 4k cpus/64TB.  Not sure yet what is "huge",
at least in terms of an SSI system.

> 
>   Thanx, Paul
> 
>>> On a large system (4k), the perf tools get an error message (sorry
>>> don't have it handy at the moment) the basically implies that the
>>> perf config option is not set.  Again, I wanted to remove the new
>>> NMI handler to insure that it wasn't doing something weird, and
>>> it wasn't.
>>
>> Cute.. 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Mike Travis


On 9/12/2013 11:35 AM, Paul E. McKenney wrote:
> On Thu, Sep 12, 2013 at 10:27:31AM -0700, Paul E. McKenney wrote:
>> On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
>>> On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
 On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
> On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
>> For performance reasons, the NMI handler may be disabled to lessen the
>> performance impact caused by the multiple perf tools running concurently.
>> If the system nmi command is issued when the UV NMI handler is disabled,
>> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
>> disabled by setting the nmi disabled variable to '1'.  Setting it back to
>> '0' will re-enable the NMI handler.
>
> I'm not entirely sure why this is still needed now that you've moved all
> really expensive bits into the UNKNOWN handler.
>

 Yes, it could be considered optional.  My primary use was to isolate
 new bugs I found to see if my NMI changes were causing them.  But it
 appears that they are not since the problems occur with or without
 using the NMI entry into KDB.  So it can be safely removed.
>>>
>>> OK, as a debug option it might make sense, but removing it is (of course)
>>> fine with me ;-)
>>>
 (The basic problem is that if you hang out in KDB too long the machine
 locks up.  
>>>
>>> Yeah, known issue. Not much you can do about it either I suspect. The
>>> system generally isn't build for things like that.
>>>
 Other problems like the rcu stall detector does not have a
 means to be "touched" like the nmi_watchdog_timer so it fires off a
 few to many, many messages.  
>>>
>>> That however might be easily cured if you ask Paul nicely ;-)
>>
>> RCU's grace-period mechanism is supposed to be what touches it.  ;-)
>>
>> But what is it that you are looking for?  If you want to silence it
>> completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
>> you want to use.
>>
 Another, any network connections will time
 out if you are in KDB more than say 20 or 30 seconds.)
>>
>> Ah, you are looking for RCU to refrain from complaining about grace
>> periods that have been delayed by breakpoints in the kernel?  Is there
>> some way that RCU can learn that a breakpoint has happened?  If so,
>> this should not be hard.
> 
> But wait...  RCU relies on the jiffies counter for RCU CPU stall warnings.
> Doesn't the jiffies counter stop during breakpoints?
> 
>   Thanx, Paul

All cpus entering the UV NMI event use local_irq_save (as does the
entry into KGDB/KDB).  So the question becomes more what happens
after all the cpus do the local_irq_restore?  The hardware clocks
are of course still running.

> 
>> If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
>> earlier.
>>
 One other problem is with the perf tool.  It seems running more than
 about 2 or 3 perf top instances on a medium (1k cpu threads) sized
 system, they start behaving badly with a bunch of NMI stackdumps
 appearing on the console.  Eventually the system become unusable.
>>>
>>> Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
>>
>> Indeed, with that definition of "medium", large must be truly impressive!
>>
>>  Thanx, Paul
>>
 On a large system (4k), the perf tools get an error message (sorry
 don't have it handy at the moment) the basically implies that the
 perf config option is not set.  Again, I wanted to remove the new
 NMI handler to insure that it wasn't doing something weird, and
 it wasn't.
>>>
>>> Cute.. 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 10:27:31AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
> > > On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
> > > > On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> > > >> For performance reasons, the NMI handler may be disabled to lessen the
> > > >> performance impact caused by the multiple perf tools running 
> > > >> concurently.
> > > >> If the system nmi command is issued when the UV NMI handler is 
> > > >> disabled,
> > > >> the "Dazed and Confused" messages occur for all cpus.  The NMI handler 
> > > >> is
> > > >> disabled by setting the nmi disabled variable to '1'.  Setting it back 
> > > >> to
> > > >> '0' will re-enable the NMI handler.
> > > > 
> > > > I'm not entirely sure why this is still needed now that you've moved all
> > > > really expensive bits into the UNKNOWN handler.
> > > > 
> > > 
> > > Yes, it could be considered optional.  My primary use was to isolate
> > > new bugs I found to see if my NMI changes were causing them.  But it
> > > appears that they are not since the problems occur with or without
> > > using the NMI entry into KDB.  So it can be safely removed.
> > 
> > OK, as a debug option it might make sense, but removing it is (of course)
> > fine with me ;-)
> > 
> > > (The basic problem is that if you hang out in KDB too long the machine
> > > locks up.  
> > 
> > Yeah, known issue. Not much you can do about it either I suspect. The
> > system generally isn't build for things like that.
> > 
> > > Other problems like the rcu stall detector does not have a
> > > means to be "touched" like the nmi_watchdog_timer so it fires off a
> > > few to many, many messages.  
> > 
> > That however might be easily cured if you ask Paul nicely ;-)
> 
> RCU's grace-period mechanism is supposed to be what touches it.  ;-)
> 
> But what is it that you are looking for?  If you want to silence it
> completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
> you want to use.
> 
> > > Another, any network connections will time
> > > out if you are in KDB more than say 20 or 30 seconds.)
> 
> Ah, you are looking for RCU to refrain from complaining about grace
> periods that have been delayed by breakpoints in the kernel?  Is there
> some way that RCU can learn that a breakpoint has happened?  If so,
> this should not be hard.

But wait...  RCU relies on the jiffies counter for RCU CPU stall warnings.
Doesn't the jiffies counter stop during breakpoints?

Thanx, Paul

> If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
> earlier.
> 
> > > One other problem is with the perf tool.  It seems running more than
> > > about 2 or 3 perf top instances on a medium (1k cpu threads) sized
> > > system, they start behaving badly with a bunch of NMI stackdumps
> > > appearing on the console.  Eventually the system become unusable.
> > 
> > Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
> 
> Indeed, with that definition of "medium", large must be truly impressive!
> 
>   Thanx, Paul
> 
> > > On a large system (4k), the perf tools get an error message (sorry
> > > don't have it handy at the moment) the basically implies that the
> > > perf config option is not set.  Again, I wanted to remove the new
> > > NMI handler to insure that it wasn't doing something weird, and
> > > it wasn't.
> > 
> > Cute.. 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
> > On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
> > > On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> > >> For performance reasons, the NMI handler may be disabled to lessen the
> > >> performance impact caused by the multiple perf tools running concurently.
> > >> If the system nmi command is issued when the UV NMI handler is disabled,
> > >> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
> > >> disabled by setting the nmi disabled variable to '1'.  Setting it back to
> > >> '0' will re-enable the NMI handler.
> > > 
> > > I'm not entirely sure why this is still needed now that you've moved all
> > > really expensive bits into the UNKNOWN handler.
> > > 
> > 
> > Yes, it could be considered optional.  My primary use was to isolate
> > new bugs I found to see if my NMI changes were causing them.  But it
> > appears that they are not since the problems occur with or without
> > using the NMI entry into KDB.  So it can be safely removed.
> 
> OK, as a debug option it might make sense, but removing it is (of course)
> fine with me ;-)
> 
> > (The basic problem is that if you hang out in KDB too long the machine
> > locks up.  
> 
> Yeah, known issue. Not much you can do about it either I suspect. The
> system generally isn't build for things like that.
> 
> > Other problems like the rcu stall detector does not have a
> > means to be "touched" like the nmi_watchdog_timer so it fires off a
> > few to many, many messages.  
> 
> That however might be easily cured if you ask Paul nicely ;-)

RCU's grace-period mechanism is supposed to be what touches it.  ;-)

But what is it that you are looking for?  If you want to silence it
completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
you want to use.

> > Another, any network connections will time
> > out if you are in KDB more than say 20 or 30 seconds.)

Ah, you are looking for RCU to refrain from complaining about grace
periods that have been delayed by breakpoints in the kernel?  Is there
some way that RCU can learn that a breakpoint has happened?  If so,
this should not be hard.

If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
earlier.

> > One other problem is with the perf tool.  It seems running more than
> > about 2 or 3 perf top instances on a medium (1k cpu threads) sized
> > system, they start behaving badly with a bunch of NMI stackdumps
> > appearing on the console.  Eventually the system become unusable.
> 
> Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/

Indeed, with that definition of "medium", large must be truly impressive!

Thanx, Paul

> > On a large system (4k), the perf tools get an error message (sorry
> > don't have it handy at the moment) the basically implies that the
> > perf config option is not set.  Again, I wanted to remove the new
> > NMI handler to insure that it wasn't doing something weird, and
> > it wasn't.
> 
> Cute.. 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
 On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
  On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
   On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
   For performance reasons, the NMI handler may be disabled to lessen the
   performance impact caused by the multiple perf tools running concurently.
   If the system nmi command is issued when the UV NMI handler is disabled,
   the Dazed and Confused messages occur for all cpus.  The NMI handler is
   disabled by setting the nmi disabled variable to '1'.  Setting it back to
   '0' will re-enable the NMI handler.
   
   I'm not entirely sure why this is still needed now that you've moved all
   really expensive bits into the UNKNOWN handler.
   
  
  Yes, it could be considered optional.  My primary use was to isolate
  new bugs I found to see if my NMI changes were causing them.  But it
  appears that they are not since the problems occur with or without
  using the NMI entry into KDB.  So it can be safely removed.
 
 OK, as a debug option it might make sense, but removing it is (of course)
 fine with me ;-)
 
  (The basic problem is that if you hang out in KDB too long the machine
  locks up.  
 
 Yeah, known issue. Not much you can do about it either I suspect. The
 system generally isn't build for things like that.
 
  Other problems like the rcu stall detector does not have a
  means to be touched like the nmi_watchdog_timer so it fires off a
  few to many, many messages.  
 
 That however might be easily cured if you ask Paul nicely ;-)

RCU's grace-period mechanism is supposed to be what touches it.  ;-)

But what is it that you are looking for?  If you want to silence it
completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
you want to use.

  Another, any network connections will time
  out if you are in KDB more than say 20 or 30 seconds.)

Ah, you are looking for RCU to refrain from complaining about grace
periods that have been delayed by breakpoints in the kernel?  Is there
some way that RCU can learn that a breakpoint has happened?  If so,
this should not be hard.

If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
earlier.

  One other problem is with the perf tool.  It seems running more than
  about 2 or 3 perf top instances on a medium (1k cpu threads) sized
  system, they start behaving badly with a bunch of NMI stackdumps
  appearing on the console.  Eventually the system become unusable.
 
 Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/

Indeed, with that definition of medium, large must be truly impressive!

Thanx, Paul

  On a large system (4k), the perf tools get an error message (sorry
  don't have it handy at the moment) the basically implies that the
  perf config option is not set.  Again, I wanted to remove the new
  NMI handler to insure that it wasn't doing something weird, and
  it wasn't.
 
 Cute.. 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 10:27:31AM -0700, Paul E. McKenney wrote:
 On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
  On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
   On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
For performance reasons, the NMI handler may be disabled to lessen the
performance impact caused by the multiple perf tools running 
concurently.
If the system nmi command is issued when the UV NMI handler is 
disabled,
the Dazed and Confused messages occur for all cpus.  The NMI handler 
is
disabled by setting the nmi disabled variable to '1'.  Setting it back 
to
'0' will re-enable the NMI handler.

I'm not entirely sure why this is still needed now that you've moved all
really expensive bits into the UNKNOWN handler.

   
   Yes, it could be considered optional.  My primary use was to isolate
   new bugs I found to see if my NMI changes were causing them.  But it
   appears that they are not since the problems occur with or without
   using the NMI entry into KDB.  So it can be safely removed.
  
  OK, as a debug option it might make sense, but removing it is (of course)
  fine with me ;-)
  
   (The basic problem is that if you hang out in KDB too long the machine
   locks up.  
  
  Yeah, known issue. Not much you can do about it either I suspect. The
  system generally isn't build for things like that.
  
   Other problems like the rcu stall detector does not have a
   means to be touched like the nmi_watchdog_timer so it fires off a
   few to many, many messages.  
  
  That however might be easily cured if you ask Paul nicely ;-)
 
 RCU's grace-period mechanism is supposed to be what touches it.  ;-)
 
 But what is it that you are looking for?  If you want to silence it
 completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
 you want to use.
 
   Another, any network connections will time
   out if you are in KDB more than say 20 or 30 seconds.)
 
 Ah, you are looking for RCU to refrain from complaining about grace
 periods that have been delayed by breakpoints in the kernel?  Is there
 some way that RCU can learn that a breakpoint has happened?  If so,
 this should not be hard.

But wait...  RCU relies on the jiffies counter for RCU CPU stall warnings.
Doesn't the jiffies counter stop during breakpoints?

Thanx, Paul

 If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
 earlier.
 
   One other problem is with the perf tool.  It seems running more than
   about 2 or 3 perf top instances on a medium (1k cpu threads) sized
   system, they start behaving badly with a bunch of NMI stackdumps
   appearing on the console.  Eventually the system become unusable.
  
  Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
 
 Indeed, with that definition of medium, large must be truly impressive!
 
   Thanx, Paul
 
   On a large system (4k), the perf tools get an error message (sorry
   don't have it handy at the moment) the basically implies that the
   perf config option is not set.  Again, I wanted to remove the new
   NMI handler to insure that it wasn't doing something weird, and
   it wasn't.
  
  Cute.. 
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Mike Travis


On 9/12/2013 11:35 AM, Paul E. McKenney wrote:
 On Thu, Sep 12, 2013 at 10:27:31AM -0700, Paul E. McKenney wrote:
 On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
 On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
 On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
 On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
 For performance reasons, the NMI handler may be disabled to lessen the
 performance impact caused by the multiple perf tools running concurently.
 If the system nmi command is issued when the UV NMI handler is disabled,
 the Dazed and Confused messages occur for all cpus.  The NMI handler is
 disabled by setting the nmi disabled variable to '1'.  Setting it back to
 '0' will re-enable the NMI handler.

 I'm not entirely sure why this is still needed now that you've moved all
 really expensive bits into the UNKNOWN handler.


 Yes, it could be considered optional.  My primary use was to isolate
 new bugs I found to see if my NMI changes were causing them.  But it
 appears that they are not since the problems occur with or without
 using the NMI entry into KDB.  So it can be safely removed.

 OK, as a debug option it might make sense, but removing it is (of course)
 fine with me ;-)

 (The basic problem is that if you hang out in KDB too long the machine
 locks up.  

 Yeah, known issue. Not much you can do about it either I suspect. The
 system generally isn't build for things like that.

 Other problems like the rcu stall detector does not have a
 means to be touched like the nmi_watchdog_timer so it fires off a
 few to many, many messages.  

 That however might be easily cured if you ask Paul nicely ;-)

 RCU's grace-period mechanism is supposed to be what touches it.  ;-)

 But what is it that you are looking for?  If you want to silence it
 completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
 you want to use.

 Another, any network connections will time
 out if you are in KDB more than say 20 or 30 seconds.)

 Ah, you are looking for RCU to refrain from complaining about grace
 periods that have been delayed by breakpoints in the kernel?  Is there
 some way that RCU can learn that a breakpoint has happened?  If so,
 this should not be hard.
 
 But wait...  RCU relies on the jiffies counter for RCU CPU stall warnings.
 Doesn't the jiffies counter stop during breakpoints?
 
   Thanx, Paul

All cpus entering the UV NMI event use local_irq_save (as does the
entry into KGDB/KDB).  So the question becomes more what happens
after all the cpus do the local_irq_restore?  The hardware clocks
are of course still running.

 
 If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
 earlier.

 One other problem is with the perf tool.  It seems running more than
 about 2 or 3 perf top instances on a medium (1k cpu threads) sized
 system, they start behaving badly with a bunch of NMI stackdumps
 appearing on the console.  Eventually the system become unusable.

 Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/

 Indeed, with that definition of medium, large must be truly impressive!

  Thanx, Paul

 On a large system (4k), the perf tools get an error message (sorry
 don't have it handy at the moment) the basically implies that the
 perf config option is not set.  Again, I wanted to remove the new
 NMI handler to insure that it wasn't doing something weird, and
 it wasn't.

 Cute.. 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Mike Travis


On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
 On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
 On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
 On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
 On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
 For performance reasons, the NMI handler may be disabled to lessen the
 performance impact caused by the multiple perf tools running concurently.
 If the system nmi command is issued when the UV NMI handler is disabled,
 the Dazed and Confused messages occur for all cpus.  The NMI handler is
 disabled by setting the nmi disabled variable to '1'.  Setting it back to
 '0' will re-enable the NMI handler.

 I'm not entirely sure why this is still needed now that you've moved all
 really expensive bits into the UNKNOWN handler.


 Yes, it could be considered optional.  My primary use was to isolate
 new bugs I found to see if my NMI changes were causing them.  But it
 appears that they are not since the problems occur with or without
 using the NMI entry into KDB.  So it can be safely removed.

 OK, as a debug option it might make sense, but removing it is (of course)
 fine with me ;-)

 (The basic problem is that if you hang out in KDB too long the machine
 locks up.  

 Yeah, known issue. Not much you can do about it either I suspect. The
 system generally isn't build for things like that.

 Other problems like the rcu stall detector does not have a
 means to be touched like the nmi_watchdog_timer so it fires off a
 few to many, many messages.  

 That however might be easily cured if you ask Paul nicely ;-)
 
 RCU's grace-period mechanism is supposed to be what touches it.  ;-)
 
 But what is it that you are looking for?  If you want to silence it
 completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
 you want to use.

We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
cmdline.  I'll double check if it was set during my testing.

 
 Another, any network connections will time
 out if you are in KDB more than say 20 or 30 seconds.)
 
 Ah, you are looking for RCU to refrain from complaining about grace
 periods that have been delayed by breakpoints in the kernel?  Is there
 some way that RCU can learn that a breakpoint has happened?  If so,
 this should not be hard.

Yes, exactly.  After a UV NMI event which might or might not call KDB,
but definitely can consume some time with the system stopped, I have
these notifications:

static void uv_nmi_touch_watchdogs(void)
{
touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();
rcu_cpu_stall_reset();
touch_nmi_watchdog();
}


In all the cases I checked, I had all the cpus in the NMI event so
I don't think it was a straggler who triggered the problem.  One
question though, the above is called by all cpus exiting the NMI
event.  Should I limit that to only one cpu?

Note btw, that this also happens when KGDB/KDB is entered via the
sysrq-trigger 'g' event.

Perhaps there is some other timer that is going off?

 If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
 earlier.
 
 One other problem is with the perf tool.  It seems running more than
 about 2 or 3 perf top instances on a medium (1k cpu threads) sized
 system, they start behaving badly with a bunch of NMI stackdumps
 appearing on the console.  Eventually the system become unusable.

 Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
 
 Indeed, with that definition of medium, large must be truly impressive!

I say medium because it's only one rack w/~4TB of memory (and quite
popular).  Large would be 4k cpus/64TB.  Not sure yet what is huge,
at least in terms of an SSI system.

 
   Thanx, Paul
 
 On a large system (4k), the perf tools get an error message (sorry
 don't have it handy at the moment) the basically implies that the
 perf config option is not set.  Again, I wanted to remove the new
 NMI handler to insure that it wasn't doing something weird, and
 it wasn't.

 Cute.. 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Hedi Berriche
On Thu, Sep 12, 2013 at 19:59 Mike Travis wrote:
| On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
|
|  But what is it that you are looking for?  If you want to silence it
|  completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
|  you want to use.
| 
| We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
| cmdline.  I'll double check if it was set during my testing.

FWIW, for recent enough kernels the correct boot parameter is
rcupdate.rcu_cpu_stall_suppress.

It used to be rcutree.rcu_cpu_stall_suppress, but that has changed after
commit 6bfc09e.

Cheers,
Hedi.
-- 
Be careful of reading health books, you might die of a misprint.
-- Mark Twain
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 08:48:33PM +0100, Hedi Berriche wrote:
 On Thu, Sep 12, 2013 at 19:59 Mike Travis wrote:
 | On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
 |
 |  But what is it that you are looking for?  If you want to silence it
 |  completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
 |  you want to use.
 | 
 | We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
 | cmdline.  I'll double check if it was set during my testing.
 
 FWIW, for recent enough kernels the correct boot parameter is
 rcupdate.rcu_cpu_stall_suppress.
 
 It used to be rcutree.rcu_cpu_stall_suppress, but that has changed after
 commit 6bfc09e.

Good point, Hedi!  That change happened when rcutiny gained RCU CPU
stall warning capability.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-12 Thread Paul E. McKenney
On Thu, Sep 12, 2013 at 11:59:36AM -0700, Mike Travis wrote:
 On 9/12/2013 10:27 AM, Paul E. McKenney wrote:
  On Tue, Sep 10, 2013 at 11:03:49AM +0200, Peter Zijlstra wrote:
  On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
  On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
  On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
  For performance reasons, the NMI handler may be disabled to lessen the
  performance impact caused by the multiple perf tools running 
  concurently.
  If the system nmi command is issued when the UV NMI handler is disabled,
  the Dazed and Confused messages occur for all cpus.  The NMI handler 
  is
  disabled by setting the nmi disabled variable to '1'.  Setting it back 
  to
  '0' will re-enable the NMI handler.
 
  I'm not entirely sure why this is still needed now that you've moved all
  really expensive bits into the UNKNOWN handler.
 
 
  Yes, it could be considered optional.  My primary use was to isolate
  new bugs I found to see if my NMI changes were causing them.  But it
  appears that they are not since the problems occur with or without
  using the NMI entry into KDB.  So it can be safely removed.
 
  OK, as a debug option it might make sense, but removing it is (of course)
  fine with me ;-)
 
  (The basic problem is that if you hang out in KDB too long the machine
  locks up.  
 
  Yeah, known issue. Not much you can do about it either I suspect. The
  system generally isn't build for things like that.
 
  Other problems like the rcu stall detector does not have a
  means to be touched like the nmi_watchdog_timer so it fires off a
  few to many, many messages.  
 
  That however might be easily cured if you ask Paul nicely ;-)
  
  RCU's grace-period mechanism is supposed to be what touches it.  ;-)
  
  But what is it that you are looking for?  If you want to silence it
  completely, the rcu_cpu_stall_suppress boot/sysfs parameter is what
  you want to use.
 
 We have by default rcutree.rcu_cpu_stall_suppress=1 on the kernel
 cmdline.  I'll double check if it was set during my testing.
 
  
  Another, any network connections will time
  out if you are in KDB more than say 20 or 30 seconds.)
  
  Ah, you are looking for RCU to refrain from complaining about grace
  periods that have been delayed by breakpoints in the kernel?  Is there
  some way that RCU can learn that a breakpoint has happened?  If so,
  this should not be hard.
 
 Yes, exactly.  After a UV NMI event which might or might not call KDB,
 but definitely can consume some time with the system stopped, I have
 these notifications:
 
 static void uv_nmi_touch_watchdogs(void)
 {
 touch_softlockup_watchdog_sync();
 clocksource_touch_watchdog();
 rcu_cpu_stall_reset();

This function effectively disables RCU CPU stall warnings for the current
set of grace periods.  Or is supposed to do so, anyway.  I won't guarantee
that this is avoids false positive in the face of all possible races
between grace-period initialization, calls to rcu_cpu_stall_reset(),
and stall warnings.

So how often are you seeing RCU CPU stall warnings?

 touch_nmi_watchdog();
 }
 
 
 In all the cases I checked, I had all the cpus in the NMI event so
 I don't think it was a straggler who triggered the problem.  One
 question though, the above is called by all cpus exiting the NMI
 event.  Should I limit that to only one cpu?

You should only need to invoke rcu_cpu_stall_reset() from a single CPU.
That said, I would not expect problems from concurrent invocations,
unless your compiler stores to a long with a pair of smaller stores
or something.

 Note btw, that this also happens when KGDB/KDB is entered via the
 sysrq-trigger 'g' event.
 
 Perhaps there is some other timer that is going off?

Is uv_nmi_touch_watchdogs() invoked on the way in to the breakpoint?
On the way out?  Both?  Either way, what software environment does it
run in?  The only environment completely safe against races on the way
in would be stop_machine() -- otherwise, a grace period might start just
after uv_nmi_touch_watchdogs() returned, which would cause a normal RCU
CPU stall timeout to be in effect.

  If not, I must fall back on the rcu_cpu_stall_suppress that I mentioned
  earlier.
  
  One other problem is with the perf tool.  It seems running more than
  about 2 or 3 perf top instances on a medium (1k cpu threads) sized
  system, they start behaving badly with a bunch of NMI stackdumps
  appearing on the console.  Eventually the system become unusable.
 
  Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/
  
  Indeed, with that definition of medium, large must be truly impressive!
 
 I say medium because it's only one rack w/~4TB of memory (and quite
 popular).  Large would be 4k cpus/64TB.  Not sure yet what is huge,
 at least in terms of an SSI system.

Well, when I tell people that someone reported a bug running on a
4K-CPU system, they look at me funny.  ;-)

  

Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-10 Thread Peter Zijlstra
On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
> 
> 
> On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> >> For performance reasons, the NMI handler may be disabled to lessen the
> >> performance impact caused by the multiple perf tools running concurently.
> >> If the system nmi command is issued when the UV NMI handler is disabled,
> >> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
> >> disabled by setting the nmi disabled variable to '1'.  Setting it back to
> >> '0' will re-enable the NMI handler.
> > 
> > I'm not entirely sure why this is still needed now that you've moved all
> > really expensive bits into the UNKNOWN handler.
> > 
> 
> Yes, it could be considered optional.  My primary use was to isolate
> new bugs I found to see if my NMI changes were causing them.  But it
> appears that they are not since the problems occur with or without
> using the NMI entry into KDB.  So it can be safely removed.

OK, as a debug option it might make sense, but removing it is (of course)
fine with me ;-)

> (The basic problem is that if you hang out in KDB too long the machine
> locks up.  

Yeah, known issue. Not much you can do about it either I suspect. The
system generally isn't build for things like that.

> Other problems like the rcu stall detector does not have a
> means to be "touched" like the nmi_watchdog_timer so it fires off a
> few to many, many messages.  

That however might be easily cured if you ask Paul nicely ;-)

> Another, any network connections will time
> out if you are in KDB more than say 20 or 30 seconds.)
> 
> One other problem is with the perf tool.  It seems running more than
> about 2 or 3 perf top instances on a medium (1k cpu threads) sized
> system, they start behaving badly with a bunch of NMI stackdumps
> appearing on the console.  Eventually the system become unusable.

Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/

> On a large system (4k), the perf tools get an error message (sorry
> don't have it handy at the moment) the basically implies that the
> perf config option is not set.  Again, I wanted to remove the new
> NMI handler to insure that it wasn't doing something weird, and
> it wasn't.

Cute.. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-10 Thread Peter Zijlstra
On Mon, Sep 09, 2013 at 10:07:03AM -0700, Mike Travis wrote:
 
 
 On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
  On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
  For performance reasons, the NMI handler may be disabled to lessen the
  performance impact caused by the multiple perf tools running concurently.
  If the system nmi command is issued when the UV NMI handler is disabled,
  the Dazed and Confused messages occur for all cpus.  The NMI handler is
  disabled by setting the nmi disabled variable to '1'.  Setting it back to
  '0' will re-enable the NMI handler.
  
  I'm not entirely sure why this is still needed now that you've moved all
  really expensive bits into the UNKNOWN handler.
  
 
 Yes, it could be considered optional.  My primary use was to isolate
 new bugs I found to see if my NMI changes were causing them.  But it
 appears that they are not since the problems occur with or without
 using the NMI entry into KDB.  So it can be safely removed.

OK, as a debug option it might make sense, but removing it is (of course)
fine with me ;-)

 (The basic problem is that if you hang out in KDB too long the machine
 locks up.  

Yeah, known issue. Not much you can do about it either I suspect. The
system generally isn't build for things like that.

 Other problems like the rcu stall detector does not have a
 means to be touched like the nmi_watchdog_timer so it fires off a
 few to many, many messages.  

That however might be easily cured if you ask Paul nicely ;-)

 Another, any network connections will time
 out if you are in KDB more than say 20 or 30 seconds.)
 
 One other problem is with the perf tool.  It seems running more than
 about 2 or 3 perf top instances on a medium (1k cpu threads) sized
 system, they start behaving badly with a bunch of NMI stackdumps
 appearing on the console.  Eventually the system become unusable.

Yuck.. I haven't seen anything like that on the 'tiny' systems I have :/

 On a large system (4k), the perf tools get an error message (sorry
 don't have it handy at the moment) the basically implies that the
 perf config option is not set.  Again, I wanted to remove the new
 NMI handler to insure that it wasn't doing something weird, and
 it wasn't.

Cute.. 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-09 Thread Mike Travis


On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
> On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
>> For performance reasons, the NMI handler may be disabled to lessen the
>> performance impact caused by the multiple perf tools running concurently.
>> If the system nmi command is issued when the UV NMI handler is disabled,
>> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
>> disabled by setting the nmi disabled variable to '1'.  Setting it back to
>> '0' will re-enable the NMI handler.
> 
> I'm not entirely sure why this is still needed now that you've moved all
> really expensive bits into the UNKNOWN handler.
> 

Yes, it could be considered optional.  My primary use was to isolate
new bugs I found to see if my NMI changes were causing them.  But it
appears that they are not since the problems occur with or without
using the NMI entry into KDB.  So it can be safely removed.

(The basic problem is that if you hang out in KDB too long the machine
locks up.  Other problems like the rcu stall detector does not have a
means to be "touched" like the nmi_watchdog_timer so it fires off a
few to many, many messages.  Another, any network connections will time
out if you are in KDB more than say 20 or 30 seconds.)

One other problem is with the perf tool.  It seems running more than
about 2 or 3 perf top instances on a medium (1k cpu threads) sized
system, they start behaving badly with a bunch of NMI stackdumps
appearing on the console.  Eventually the system become unusable.

On a large system (4k), the perf tools get an error message (sorry
don't have it handy at the moment) the basically implies that the
perf config option is not set.  Again, I wanted to remove the new
NMI handler to insure that it wasn't doing something weird, and
it wasn't.

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-09 Thread Peter Zijlstra
On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
> For performance reasons, the NMI handler may be disabled to lessen the
> performance impact caused by the multiple perf tools running concurently.
> If the system nmi command is issued when the UV NMI handler is disabled,
> the "Dazed and Confused" messages occur for all cpus.  The NMI handler is
> disabled by setting the nmi disabled variable to '1'.  Setting it back to
> '0' will re-enable the NMI handler.

I'm not entirely sure why this is still needed now that you've moved all
really expensive bits into the UNKNOWN handler.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-09 Thread Peter Zijlstra
On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
 For performance reasons, the NMI handler may be disabled to lessen the
 performance impact caused by the multiple perf tools running concurently.
 If the system nmi command is issued when the UV NMI handler is disabled,
 the Dazed and Confused messages occur for all cpus.  The NMI handler is
 disabled by setting the nmi disabled variable to '1'.  Setting it back to
 '0' will re-enable the NMI handler.

I'm not entirely sure why this is still needed now that you've moved all
really expensive bits into the UNKNOWN handler.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] x86/UV: Add ability to disable UV NMI handler

2013-09-09 Thread Mike Travis


On 9/9/2013 5:43 AM, Peter Zijlstra wrote:
 On Thu, Sep 05, 2013 at 05:50:41PM -0500, Mike Travis wrote:
 For performance reasons, the NMI handler may be disabled to lessen the
 performance impact caused by the multiple perf tools running concurently.
 If the system nmi command is issued when the UV NMI handler is disabled,
 the Dazed and Confused messages occur for all cpus.  The NMI handler is
 disabled by setting the nmi disabled variable to '1'.  Setting it back to
 '0' will re-enable the NMI handler.
 
 I'm not entirely sure why this is still needed now that you've moved all
 really expensive bits into the UNKNOWN handler.
 

Yes, it could be considered optional.  My primary use was to isolate
new bugs I found to see if my NMI changes were causing them.  But it
appears that they are not since the problems occur with or without
using the NMI entry into KDB.  So it can be safely removed.

(The basic problem is that if you hang out in KDB too long the machine
locks up.  Other problems like the rcu stall detector does not have a
means to be touched like the nmi_watchdog_timer so it fires off a
few to many, many messages.  Another, any network connections will time
out if you are in KDB more than say 20 or 30 seconds.)

One other problem is with the perf tool.  It seems running more than
about 2 or 3 perf top instances on a medium (1k cpu threads) sized
system, they start behaving badly with a bunch of NMI stackdumps
appearing on the console.  Eventually the system become unusable.

On a large system (4k), the perf tools get an error message (sorry
don't have it handy at the moment) the basically implies that the
perf config option is not set.  Again, I wanted to remove the new
NMI handler to insure that it wasn't doing something weird, and
it wasn't.

Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/