On Sun, 11 Mar 2001, Keith Owens wrote: > Works for me. It even makes kdb simpler. agreed. Also, touch_nmi_watchdog() is stateless and is thus much less prone to locking bugs. i've attached nmi-watchdog-2.4.2-A1 that implements touch_nmi_watchdog(), ontop of 2.4.2-ac18, and changes show_state() to use this interface. (the patch also takes the NMI counters out of the obscure in-function place they used to be.) the patch compiles & boots just fine on 2.4.2-ac18 in both SMP and APIC-less-UP mode. The NMI watchdog is functional, and SysRq-T does not cause a lockup if used with a slow serial console that takes more than 5 seconds to output the tasklist. Ingo
--- linux/include/linux/irq.h.orig Sun Mar 11 11:20:21 2001 +++ linux/include/linux/irq.h Sun Mar 11 11:22:27 2001 @@ -57,18 +57,16 @@ #include <asm/hw_irq.h> /* the arch dependent stuff */ /** - * nmi_watchdog_disable - disable NMI watchdog checking. + * touch_nmi_watchdog - restart NMI watchdog timeout. * - * If the architecture supports the NMI watchdog, nmi_watchdog_disable() may be used - * to temporarily disable it. Use nmi_watchdog_enable() later on. It is implemented - * via an up/down counter, so you must keep the calls balanced. + * If the architecture supports the NMI watchdog, touch_nmi_watchdog() + * may be used to reset the timeout - for code which intentionally + * disables interrupts for a long time. This call is stateless. */ #ifdef ARCH_HAS_NMI_WATCHDOG -extern void nmi_watchdog_disable(void); -extern void nmi_watchdog_enable(void); +extern void touch_nmi_watchdog(void); #else -#define nmi_watchdog_disable() do{} while(0) -#define nmi_watchdog_enable() do{} while(0) +# define touch_nmi_watchdog() do { } while(0) #endif extern int handle_IRQ_event(unsigned int, struct pt_regs *, struct irqaction *); --- linux/drivers/char/sysrq.c.orig Sun Mar 11 11:30:46 2001 +++ linux/drivers/char/sysrq.c Sun Mar 11 11:31:12 2001 @@ -72,9 +72,9 @@ /* * Interrupts are disabled, and serial consoles are slow. So - * Let's suspend the NMI watchdog. + * let's re-start the NMI watchdog timeout. */ - nmi_watchdog_disable(); + touch_nmi_watchdog(); console_loglevel = 7; printk(KERN_INFO "SysRq: "); switch (key) { @@ -158,7 +158,6 @@ /* Don't use 'A' as it's handled specially on the Sparc */ } - nmi_watchdog_enable(); console_loglevel = orig_log_level; } --- linux/arch/i386/kernel/nmi.c.orig Sun Mar 11 11:24:34 2001 +++ linux/arch/i386/kernel/nmi.c Sun Mar 11 11:30:06 2001 @@ -226,37 +226,36 @@ } static spinlock_t nmi_print_lock = SPIN_LOCK_UNLOCKED; -static atomic_t nmi_watchdog_disabled = ATOMIC_INIT(0); -void nmi_watchdog_disable(void) -{ - atomic_inc(&nmi_watchdog_disabled); -} +/* + * the best way to detect wether a CPU has a 'hard lockup' problem + * is to check it's local APIC timer IRQ counts. If they are not + * changing then that CPU has some problem. + * + * as these watchdog NMI IRQs are generated on every CPU, we only + * have to check the current processor. + * + * since NMIs dont listen to _any_ locks, we have to be extremely + * careful not to rely on unsafe variables. The printk might lock + * up though, so we have to break up any console locks first ... + * [when there will be more tty-related locks, break them up + * here too!] + */ + +static unsigned int + last_irq_sums [NR_CPUS], + alert_counter [NR_CPUS]; -void nmi_watchdog_enable(void) +void touch_nmi_watchdog (void) { - atomic_dec(&nmi_watchdog_disabled); + /* + * Just reset the alert counter: + */ + alert_counter[smp_processor_id()] = 0; } void nmi_watchdog_tick (struct pt_regs * regs) { - /* - * the best way to detect wether a CPU has a 'hard lockup' problem - * is to check it's local APIC timer IRQ counts. If they are not - * changing then that CPU has some problem. - * - * as these watchdog NMI IRQs are broadcasted to every CPU, here - * we only have to check the current processor. - * - * since NMIs dont listen to _any_ locks, we have to be extremely - * careful not to rely on unsafe variables. The printk might lock - * up though, so we have to break up any console locks first ... - * [when there will be more tty-related locks, break them up - * here too!] - */ - - static unsigned int last_irq_sums [NR_CPUS], - alert_counter [NR_CPUS]; /* * Since current-> is always on the stack, and we always switch @@ -266,7 +265,7 @@ sum = apic_timer_irqs[cpu]; - if (last_irq_sums[cpu] == sum && atomic_read(&nmi_watchdog_disabled) == 0) { + if (last_irq_sums[cpu] == sum) { /* * Ayiee, looks like this CPU is stuck ... * wait a few IRQs (5 seconds) before doing the oops ...