Re: [patch] serial console vs NMI watchdog

2001-03-12 Thread Keith Owens

On Mon, 12 Mar 2001 00:27:14 -0800, 
george anzinger <[EMAIL PROTECTED]> wrote:
>Keith Owens wrote:
>> kdb uses NMI IPI to get the other cpu's attention.  One cpu is in
>> control and may or may not be accepting NMI, it depends on the event
>> that entered kdb.  The other cpus end up in kdb code, spinning waiting
>> for a cpu switch.  Initially they are not receiving NMI because they
>> were invoked via NMI which is masked until they exit.
>
>Are you actually twiddling the hardware, or just changing what happens
>on NMI?

No hardware twiddling.  One cpu gets an event which triggers kdb, that
event may or may not be via NMI.  kdb on ix86 then uses NMI to get the
attention of the other cpus, even if they are in a disabled spinloop.
ix86 hardware will not deliver another NMI to a cpu until the cpu
issues iret to return from the NMI handler.

Initially all but one cpu is forced into kdb via the NMI handler so no
more NMI events will occur on those cpus.  The first cpu may or may not
be receiving NMI, it depends on how kdb was invoked on the first cpu.
To do single stepping of code, kdb allows one cpu out of kdb state so
it can execute one instruction at the point it was interrupted.  If the
cpu was entered via NMI then that means exiting from the NMI handler
back to the original code, do single step then back into kdb again.

Having exited the NMI handler, that cpu will start receiving NMI events
again, even though it is still under kdb control.  So some cpus will
get NMI, some will not, depending on user actions.  kdb uses a software
mechanism to selectively disable the NMI watchdog.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-12 Thread george anzinger

Keith Owens wrote:
> 
> On Sun, 11 Mar 2001 20:43:16 -0800,
> george anzinger <[EMAIL PROTECTED]> wrote:
> >Consider this.  Why not use the NMI to sync the cpus.  Kdb would have a
> >function that is called each NMI.
> 
> kdb uses NMI IPI to get the other cpu's attention.  One cpu is in
> control and may or may not be accepting NMI, it depends on the event
> that entered kdb.  The other cpus end up in kdb code, spinning waiting
> for a cpu switch.  Initially they are not receiving NMI because they
> were invoked via NMI which is masked until they exit.  However if the
> user does a cpu switch then single steps the interrupted code, the cpu
> has to return from the NMI handler to the interrupted code at which
> time this cpu starts receiving NMI again.

Are you actually twiddling the hardware, or just changing what happens
on NMI?
> 
> The kdb context can change from ignoring NMI to accepting NMI.  It is
> easier to bring all the cpus into kdb and let the kdb code decide if it
> ignores any NMI that is being received.

Yes. Exactly.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-11 Thread Keith Owens

On Sun, 11 Mar 2001 20:43:16 -0800, 
george anzinger <[EMAIL PROTECTED]> wrote:
>Consider this.  Why not use the NMI to sync the cpus.  Kdb would have a
>function that is called each NMI.

kdb uses NMI IPI to get the other cpu's attention.  One cpu is in
control and may or may not be accepting NMI, it depends on the event
that entered kdb.  The other cpus end up in kdb code, spinning waiting
for a cpu switch.  Initially they are not receiving NMI because they
were invoked via NMI which is masked until they exit.  However if the
user does a cpu switch then single steps the interrupted code, the cpu
has to return from the NMI handler to the interrupted code at which
time this cpu starts receiving NMI again.

The kdb context can change from ignoring NMI to accepting NMI.  It is
easier to bring all the cpus into kdb and let the kdb code decide if it
ignores any NMI that is being received.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-11 Thread george anzinger

Keith Owens wrote:
> 
> On Sun, 11 Mar 2001 08:44:24 +0100 (CET),
> Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >Andrew,
> >
> >your patch looks too complex, and doesnt cover the case of the serial
> >driver deadlocking. Why not add a "touch_nmi_watchdog_counter()" function
> >that just changes last_irq_sums instead of adding locking? This way
> >deadlocks will be caught in the serial code too. (because touch_nmi() will
> >only "postpone" the NMI watchdog lockup event, not disable it.)
> 
> kdb has to completely disable the nmi counter while it is in control.
> All interrupts are disabled, all but one cpus are spinning, the control
> cpu does busy wait while it polls the input devices.  With that model
> there is no alternative to a complete disable.
> 
Consider this.  Why not use the NMI to sync the cpus.  Kdb would have a
function that is called each NMI.  If it is doing nothing, just return
false, else, if waiting for this cpu, well here it is, put it in spin
AFTER saving where it came from so the operator can figure out what it
is doing.  In kgdb I just put the interrupt registers in the task_struct
where they are put when a context switch is done.  Then the debugger can
do a trace, etc. on that task.  A global var that the debugger can see
is also set to the cpus, "current".  

If the cpu is already spinning, return to the nmi code with a true flag
which will cause it to ignore the nmi.  Same thing if it is the cpu that
is doing debug i/o.

I went to this for kgdb after the system failed to return from the call
to force the other cpus to execute a function (which means they have to
be alive).  For extra safety I also time the sync.  If one or more
expected cpus, don't show while looping reading the cycle counter, the
code just continues with out the sync.

George
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-11 Thread Andrew Morton

Keith Owens wrote:
> 
> On Sun, 11 Mar 2001 08:53:40 +0100 (CET),
> Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >it sure has an alternative. The 'cpus spinning' code calls touch_nmi()
> >within the busy loop, the polling code on the control CPU too. This is
> >sure more robust and catches lockup bugs in kdb too ...
> 
> Works for me.  It even makes kdb simpler.

humm...

OK, this seems doable in the case of serial console.
For CONFIG_LP_CONSOLE (which has the same problem) it
looks like we can just call touch_nmi() in lp_console_write().

I'll see what Tim has to say.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-10 Thread Ingo Molnar


On Sun, 11 Mar 2001, Keith Owens wrote:

> kdb has to completely disable the nmi counter while it is in control.
> All interrupts are disabled, all but one cpus are spinning, the
> control cpu does busy wait while it polls the input devices.  With
> that model there is no alternative to a complete disable.

it sure has an alternative. The 'cpus spinning' code calls touch_nmi()
within the busy loop, the polling code on the control CPU too. This is
sure more robust and catches lockup bugs in kdb too ...

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-10 Thread Keith Owens

On Sun, 11 Mar 2001 08:44:24 +0100 (CET), 
Ingo Molnar <[EMAIL PROTECTED]> wrote:
>Andrew,
>
>your patch looks too complex, and doesnt cover the case of the serial
>driver deadlocking. Why not add a "touch_nmi_watchdog_counter()" function
>that just changes last_irq_sums instead of adding locking? This way
>deadlocks will be caught in the serial code too. (because touch_nmi() will
>only "postpone" the NMI watchdog lockup event, not disable it.)

kdb has to completely disable the nmi counter while it is in control.
All interrupts are disabled, all but one cpus are spinning, the control
cpu does busy wait while it polls the input devices.  With that model
there is no alternative to a complete disable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-10 Thread Keith Owens

On Sun, 11 Mar 2001 08:53:40 +0100 (CET), 
Ingo Molnar <[EMAIL PROTECTED]> wrote:
>it sure has an alternative. The 'cpus spinning' code calls touch_nmi()
>within the busy loop, the polling code on the control CPU too. This is
>sure more robust and catches lockup bugs in kdb too ...

Works for me.  It even makes kdb simpler.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-10 Thread Ingo Molnar


Andrew,

your patch looks too complex, and doesnt cover the case of the serial
driver deadlocking. Why not add a "touch_nmi_watchdog_counter()" function
that just changes last_irq_sums instead of adding locking? This way
deadlocks will be caught in the serial code too. (because touch_nmi() will
only "postpone" the NMI watchdog lockup event, not disable it.) This
should be a matter of 5 lines ...

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-09 Thread Andrew Morton

Ion Badulescu wrote:
> 
> On Sat, 10 Mar 2001 01:21:25 +1100, Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > +/**
> > + * enable_nmi_watchdog - enables/disables NMI watchdog checking.
> > + * @yes: If zero, disable
> 
> Ugh. I have a feeling that your chances to get Linus to accept this are
> extremely slim.
> 
> Just have two functions, enable_nmi_watchdog and disable_nmi_watchdog.
> You can make them inline, or even macros...

You're right.



--- linux-2.4.2-ac16/include/linux/irq.hFri Mar  9 17:11:17 2001
+++ linux-ac/include/linux/irq.hSat Mar 10 13:34:22 2001
@@ -56,6 +56,21 @@
 
 #include  /* the arch dependent stuff */
 
+/**
+ * nmi_watchdog_disable - disable NMI watchdog checking.
+ * 
+ * If the architecture supports the NMI watchdog, nmi_watchdog_disable() may be used
+ * to temporarily disable it.  Use nmi_watchdog_enable() later on.  It is implemented
+ * via an up/down counter, so you must keep the calls balanced.
+ */
+#ifdef ARCH_HAS_NMI_WATCHDOG
+extern void nmi_watchdog_disable(void);
+extern void nmi_watchdog_enable(void);
+#else
+#define nmi_watchdog_disable() do{} while(0)
+#define nmi_watchdog_enable() do{} while(0)
+#endif
+
 extern int handle_IRQ_event(unsigned int, struct pt_regs *, struct irqaction *);
 extern int setup_irq(unsigned int , struct irqaction * );
 
--- linux-2.4.2-ac16/include/asm-i386/irq.h Fri Oct  8 03:17:09 1999
+++ linux-ac/include/asm-i386/irq.h Sat Mar 10 02:17:47 2001
@@ -32,4 +32,8 @@
 extern void disable_irq_nosync(unsigned int);
 extern void enable_irq(unsigned int);
 
+#ifdef CONFIG_X86_LOCAL_APIC
+#define ARCH_HAS_NMI_WATCHDOG  /* See include/linux/irq.h */
+#endif
+
 #endif /* _ASM_IRQ_H */
--- linux-2.4.2-ac16/drivers/char/sysrq.c   Sun Feb 25 17:37:04 2001
+++ linux-ac/drivers/char/sysrq.c   Sat Mar 10 13:07:46 2001
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -69,6 +70,11 @@
if (!key)
return;
 
+   /*
+* Interrupts are disabled, and serial consoles are slow. So
+* Let's suspend the NMI watchdog.
+*/
+   nmi_watchdog_disable();
console_loglevel = 7;
printk(KERN_INFO "SysRq: ");
switch (key) {
@@ -152,6 +158,7 @@
/* Don't use 'A' as it's handled specially on the Sparc */
}
 
+   nmi_watchdog_enable();
console_loglevel = orig_log_level;
 }
 
--- linux-2.4.2-ac16/arch/i386/kernel/nmi.c Fri Mar  9 17:10:51 2001
+++ linux-ac/arch/i386/kernel/nmi.c Sat Mar 10 13:21:59 2001
@@ -226,6 +226,17 @@
 }
 
 static spinlock_t nmi_print_lock = SPIN_LOCK_UNLOCKED;
+static atomic_t nmi_watchdog_disabled = ATOMIC_INIT(0);
+
+void nmi_watchdog_disable(void)
+{
+   atomic_inc(&nmi_watchdog_disabled);
+}
+
+void nmi_watchdog_enable(void)
+{
+   atomic_dec(&nmi_watchdog_disabled);
+}
 
 void nmi_watchdog_tick (struct pt_regs * regs)
 {
@@ -255,7 +266,7 @@
 
sum = apic_timer_irqs[cpu];
 
-   if (last_irq_sums[cpu] == sum) {
+   if (last_irq_sums[cpu] == sum && atomic_read(&nmi_watchdog_disabled) == 0) {
/*
 * Ayiee, looks like this CPU is stuck ...
 * wait a few IRQs (5 seconds) before doing the oops ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-09 Thread Robert Read

On Sat, Mar 10, 2001 at 01:21:25AM +1100, Andrew Morton wrote:
> +static atomic_t nmi_watchdog_enabled = ATOMIC_INIT(0);   /* 0 == enabled */
> +
> +void enable_nmi_watchdog(int yes)
> +{
> + if (yes)
> + atomic_inc(&nmi_watchdog_enabled);
> + else
> + atomic_dec(&nmi_watchdog_enabled);
> +}
>  
>  void nmi_watchdog_tick (struct pt_regs * regs)
>  {
> @@ -255,7 +264,7 @@
>  
>   sum = apic_timer_irqs[cpu];
>  
> - if (last_irq_sums[cpu] == sum) {
> + if (last_irq_sums[cpu] == sum && atomic_read(&nmi_watchdog_enabled) == 0) {

Shouldn't that be atomic_read(&nmi_watchdog_enabled) != 0?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] serial console vs NMI watchdog

2001-03-09 Thread Ion Badulescu

On Sat, 10 Mar 2001 01:21:25 +1100, Andrew Morton <[EMAIL PROTECTED]> wrote:

> +/**
> + * enable_nmi_watchdog - enables/disables NMI watchdog checking.
> + * @yes: If zero, disable

Ugh. I have a feeling that your chances to get Linus to accept this are
extremely slim.

Just have two functions, enable_nmi_watchdog and disable_nmi_watchdog.
You can make them inline, or even macros...

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] serial console vs NMI watchdog

2001-03-09 Thread Andrew Morton

SYSRQ-T on serial console can crash the machine.  This
is because a large amount of output is sent to a slow
device while interrupts are disabled.  The NMI
watchdog triggers.

The interrupt disabling happens in pc_keyb.c:keyboard_interrupt().
Changing this code to *not* disable interrupts looks complex.

I see two ways of fixing this.  One is to do the sysrq
stuff outside the spin_lock_irq(), with:

static void keyboard_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
+   extern void (*sysrq_handler)(void);
+   void (*my_sysrq_handler)(void);

spin_lock_irq(&kbd_controller_lock);
handle_kbd_event();
+   my_sysrq_handler = sysrq_handler;
+   sysrq_handler = 0;
spin_unlock_irq(&kbd_controller_lock);
+   if (my_sysrq_handler)
+   (*my_sysrq_handler)();
}

But I didn't do that, because I suspect there are other
places in the kernel (development and debug stuff) where
we want to turn the NMI watchdog handler off for a while.

So this patch creates a new API function

enable_nmi_watchdog(int yes);

and uses it within the sysrq code.

BTW: NMI watchdog is now disabled by default in 2.4.3-pre3.
The `nmi_watchdog=1' boot option is needed to enable it.



--- linux-2.4.2-ac16/include/linux/irq.hFri Mar  9 17:11:17 2001
+++ linux-ac/include/linux/irq.hSat Mar 10 01:02:12 2001
@@ -56,6 +56,20 @@
 
 #include  /* the arch dependent stuff */
 
+/**
+ * enable_nmi_watchdog - enables/disables NMI watchdog checking.
+ * @yes: If zero, disable
+ * 
+ * If the architecture supports the NMI watchdog, enable_nmi_watchdog() may be used
+ * to temporarily disable it.  Calls to enable_nmi_watchdog() may be nested - it is
+ * implemented as an up/down counter, so the calls must be balanced.
+ */
+#ifdef ARCH_HAS_NMI_WATCHDOG
+extern void enable_nmi_watchdog(int yes);
+#else
+#define enable_nmi_watchdog(yes) do{} while(0)
+#endif
+
 extern int handle_IRQ_event(unsigned int, struct pt_regs *, struct irqaction *);
 extern int setup_irq(unsigned int , struct irqaction * );
 
--- linux-2.4.2-ac16/include/asm-i386/irq.h Fri Oct  8 03:17:09 1999
+++ linux-ac/include/asm-i386/irq.h Fri Mar  9 22:59:15 2001
@@ -32,4 +32,8 @@
 extern void disable_irq_nosync(unsigned int);
 extern void enable_irq(unsigned int);
 
+#ifdef CONFIG_X86_LOCAL_APIC
+#define ARCH_HAS_NMI_WATCHDOG  /* See include/linux/irq.h */
+#endif
+
 #endif /* _ASM_IRQ_H */
--- linux-2.4.2-ac16/drivers/char/sysrq.c   Sun Feb 25 17:37:04 2001
+++ linux-ac/drivers/char/sysrq.c   Fri Mar  9 23:00:39 2001
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -69,6 +70,11 @@
if (!key)
return;
 
+   /*
+* Interrupts are disabled, and serial consoles are slow. So
+* Let's suspend the NMI watchdog.
+*/
+   enable_nmi_watchdog(0);
console_loglevel = 7;
printk(KERN_INFO "SysRq: ");
switch (key) {
@@ -152,6 +158,7 @@
/* Don't use 'A' as it's handled specially on the Sparc */
}
 
+   enable_nmi_watchdog(1);
console_loglevel = orig_log_level;
 }
 
--- linux-2.4.2-ac16/arch/i386/kernel/nmi.c Fri Mar  9 17:10:51 2001
+++ linux-ac/arch/i386/kernel/nmi.c Sat Mar 10 01:10:50 2001
@@ -226,6 +226,15 @@
 }
 
 static spinlock_t nmi_print_lock = SPIN_LOCK_UNLOCKED;
+static atomic_t nmi_watchdog_enabled = ATOMIC_INIT(0); /* 0 == enabled */
+
+void enable_nmi_watchdog(int yes)
+{
+   if (yes)
+   atomic_inc(&nmi_watchdog_enabled);
+   else
+   atomic_dec(&nmi_watchdog_enabled);
+}
 
 void nmi_watchdog_tick (struct pt_regs * regs)
 {
@@ -255,7 +264,7 @@
 
sum = apic_timer_irqs[cpu];
 
-   if (last_irq_sums[cpu] == sum) {
+   if (last_irq_sums[cpu] == sum && atomic_read(&nmi_watchdog_enabled) == 0) {
/*
 * Ayiee, looks like this CPU is stuck ...
 * wait a few IRQs (5 seconds) before doing the oops ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/