Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0

2007-09-22 Thread Fengguang Wu
On Sat, Sep 22, 2007 at 10:35:45PM -0700, Andrew Morton wrote:
> On Sun, 23 Sep 2007 13:30:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > > That's interensting.  serial_in().  We have had NMI watchdog expiries when
> > > the kernel is printing a large amount of stuff out a slow serial port with
> > > interrutps disabled.  But I thought we'd pretty much plugged those 
> > > problems
> > > by sprinkling touch_nmi_watchdog() in various places.
> > > 
> > > Do you think this is what was happening on your system?
> > 
> > Very likely. I'm running linux with cmdline
> > "root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0",
> > and doing a lot of printks at the time ;-)
> 
> OK.  We need to find a suitable place to poke yet another
> touch_nmi_watchdog().  Maybe we should give up and put one in
> printk().
> 
> And you oopsed for different reasons in the nmi-watchdog handling
> code too.  I think I'll pretend I didn't see that.

Let's forget it for now.
I can try Ingo's latency tracing patches at some convenient time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0

2007-09-22 Thread Andrew Morton
On Sun, 23 Sep 2007 13:30:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> > That's interensting.  serial_in().  We have had NMI watchdog expiries when
> > the kernel is printing a large amount of stuff out a slow serial port with
> > interrutps disabled.  But I thought we'd pretty much plugged those problems
> > by sprinkling touch_nmi_watchdog() in various places.
> > 
> > Do you think this is what was happening on your system?
> 
> Very likely. I'm running linux with cmdline
> "root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0",
> and doing a lot of printks at the time ;-)

OK.  We need to find a suitable place to poke yet another
touch_nmi_watchdog().  Maybe we should give up and put one in
printk().

And you oopsed for different reasons in the nmi-watchdog handling
code too.  I think I'll pretend I didn't see that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0

2007-09-22 Thread Fengguang Wu
On Sat, Sep 22, 2007 at 09:22:36PM -0700, Andrew Morton wrote:
> On Sun, 23 Sep 2007 09:42:14 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote:
> > > 
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/
> > > 
> > > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6.
> > 
> > 
> > This bug appears in 2.6.23-rc3-mm1, too.
> 
> hm, there isn't much info here.
> 
> > The message:
> > 
> > [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0
> > [ 3267.849515] CPU 0 
> > [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm 
> > snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore 
> > snd_page_alloc thermal sr_mod pcspkr evdev button processor cdrom
> > [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> > [ 3267.895442] RIP: 0033:[<2ab84e34cd44>]  [<2ab84e34cd44>]
> > [ 3267.901438] RSP: 002b:7fff5c9e03f8  EFLAGS: 0287
> > [ 3267.906726] RAX:  RBX: 7fff5c9e0580 RCX: 
> > 
> > [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: 
> > 012a7010
> > [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: 
> > 
> > [ 3267.928045] R10:  R11: 012a7410 R12: 
> > 0002
> > [ 3267.935151] R13: 0003 R14: 0005 R15: 
> > 001f
> > [ 3267.942258] FS:  2ab84f144170() GS:814f3000() 
> > knlGS:
> > [ 3267.950317] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: 
> > 06e0
> > [ 3267.963144] DR0:  DR1:  DR2: 
> > 
> > [ 3267.970250] DR3:  DR6: 0ff0 DR7: 
> > 0400
> > [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> > 810008b849d0)
> > [ 3267.985416] 
> > [ 3267.997480] Unable to handle kernel paging request at fffe 
> > RIP: 
> > [ 3268.002082]  []
> > [ 3268.007827] PGD ea85067 PUD 0 
> 
> Looks like it oopsed in the middle of handling an NMI watchdog expiry,
> perhaps.
> 
> > [ 3268.010887] Oops: 0010 [1] SMP 
> > [ 3268.014035] last sysfs file: 
> > /devices/pci:00/:00:1e.0/:05:04.0/resource
> > [ 3268.021662] CPU 0 
> > [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm 
> > snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore 
> > snd_page_alloc thermal sr_mod pcspkr evdev button processor cdrom
> > [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> > [ 3268.067584] RIP: 0010:[]  []
> > [ 3268.073578] RSP: :8157ce38  EFLAGS: 00010296
> > [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: 
> > 8100036788e0
> > [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: 
> > 810009787080
> > [ 3268.093080] RBP:  R08:  R09: 
> > 
> > [ 3268.100185] R10:  R11: 0001 R12: 
> > 810008b849d0
> > [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: 
> > 8157cf58
> > [ 3268.114399] FS:  2ab84f144170() GS:814f3000() 
> > knlGS:
> > [ 3268.122455] CS:  0010 DS:  ES:  CR0: 8005003b
> > [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: 
> > 06e0
> > [ 3268.135283] DR0:  DR1:  DR2: 
> > 
> > [ 3268.142388] DR3:  DR6: 0ff0 DR7: 
> > 0400
> > [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> > 810008b849d0)
> > [ 3268.157552] last branch before last exception/interrupt
> > [ 3268.162753]  from  [] serial_in+0x23/0x80
> > [ 3268.168316]  to  [] serial_in+0x12/0x80
> 
> That's interensting.  serial_in().  We have had NMI watchdog expiries when
> the kernel is printing a large amount of stuff out a slow serial port with
> interrutps disabled.  But I thought we'd pretty much plugged those problems
> by sprinkling touch_nmi_watchdog() in various places.
> 
> Do you think this is what was happening on your system?

Very likely. I'm running linux with cmdline
"root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0",
and doing a lot of printks at the time 

Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0

2007-09-22 Thread Andrew Morton
On Sun, 23 Sep 2007 09:42:14 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/
> > 
> > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6.
> 
> 
> This bug appears in 2.6.23-rc3-mm1, too.

hm, there isn't much info here.

> The message:
> 
> [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0
> [ 3267.849515] CPU 0 
> [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel 
> snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc 
> thermal sr_mod pcspkr evdev button processor cdrom
> [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> [ 3267.895442] RIP: 0033:[<2ab84e34cd44>]  [<2ab84e34cd44>]
> [ 3267.901438] RSP: 002b:7fff5c9e03f8  EFLAGS: 0287
> [ 3267.906726] RAX:  RBX: 7fff5c9e0580 RCX: 
> 
> [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: 
> 012a7010
> [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: 
> 
> [ 3267.928045] R10:  R11: 012a7410 R12: 
> 0002
> [ 3267.935151] R13: 0003 R14: 0005 R15: 
> 001f
> [ 3267.942258] FS:  2ab84f144170() GS:814f3000() 
> knlGS:
> [ 3267.950317] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: 
> 06e0
> [ 3267.963144] DR0:  DR1:  DR2: 
> 
> [ 3267.970250] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> 810008b849d0)
> [ 3267.985416] 
> [ 3267.997480] Unable to handle kernel paging request at fffe 
> RIP: 
> [ 3268.002082]  []
> [ 3268.007827] PGD ea85067 PUD 0 

Looks like it oopsed in the middle of handling an NMI watchdog expiry,
perhaps.

> [ 3268.010887] Oops: 0010 [1] SMP 
> [ 3268.014035] last sysfs file: 
> /devices/pci:00/:00:1e.0/:05:04.0/resource
> [ 3268.021662] CPU 0 
> [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel 
> snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc 
> thermal sr_mod pcspkr evdev button processor cdrom
> [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> [ 3268.067584] RIP: 0010:[]  []
> [ 3268.073578] RSP: :8157ce38  EFLAGS: 00010296
> [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: 
> 8100036788e0
> [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: 
> 810009787080
> [ 3268.093080] RBP:  R08:  R09: 
> 
> [ 3268.100185] R10:  R11: 0001 R12: 
> 810008b849d0
> [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: 
> 8157cf58
> [ 3268.114399] FS:  2ab84f144170() GS:814f3000() 
> knlGS:
> [ 3268.122455] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: 
> 06e0
> [ 3268.135283] DR0:  DR1:  DR2: 
> 
> [ 3268.142388] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> 810008b849d0)
> [ 3268.157552] last branch before last exception/interrupt
> [ 3268.162753]  from  [] serial_in+0x23/0x80
> [ 3268.168316]  to  [] serial_in+0x12/0x80

That's interensting.  serial_in().  We have had NMI watchdog expiries when
the kernel is printing a large amount of stuff out a slow serial port with
interrutps disabled.  But I thought we'd pretty much plugged those problems
by sprinkling touch_nmi_watchdog() in various places.

Do you think this is what was happening on your system?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0

2007-09-22 Thread Fengguang Wu
On Sun, Sep 23, 2007 at 09:42:14AM +0800, Fengguang Wu wrote:
> On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/
> > 
> > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6.
> 
> 
> This bug appears in 2.6.23-rc3-mm1, too.
> 
> The message:
> 
> [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0
> [ 3267.849515] CPU 0 
> [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel 
> snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc 
> thermal sr_mod pcspkr evdev button processor cdrom
> [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> [ 3267.895442] RIP: 0033:[<2ab84e34cd44>]  [<2ab84e34cd44>]
> [ 3267.901438] RSP: 002b:7fff5c9e03f8  EFLAGS: 0287
> [ 3267.906726] RAX:  RBX: 7fff5c9e0580 RCX: 
> 
> [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: 
> 012a7010
> [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: 
> 
> [ 3267.928045] R10:  R11: 012a7410 R12: 
> 0002
> [ 3267.935151] R13: 0003 R14: 0005 R15: 
> 001f
> [ 3267.942258] FS:  2ab84f144170() GS:814f3000() 
> knlGS:
> [ 3267.950317] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: 
> 06e0
> [ 3267.963144] DR0:  DR1:  DR2: 
> 
> [ 3267.970250] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> 810008b849d0)
> [ 3267.985416] 
> [ 3267.997480] Unable to handle kernel paging request at fffe 
> RIP: 
> [ 3268.002082]  []
> [ 3268.007827] PGD ea85067 PUD 0 
> [ 3268.010887] Oops: 0010 [1] SMP 
> [ 3268.014035] last sysfs file: 
> /devices/pci:00/:00:1e.0/:05:04.0/resource
> [ 3268.021662] CPU 0 
> [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle 
> iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp 
> nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink 
> fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel 
> snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc 
> thermal sr_mod pcspkr evdev button processor cdrom
> [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4
> [ 3268.067584] RIP: 0010:[]  []
> [ 3268.073578] RSP: :8157ce38  EFLAGS: 00010296
> [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: 
> 8100036788e0
> [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: 
> 810009787080
> [ 3268.093080] RBP:  R08:  R09: 
> 
> [ 3268.100185] R10:  R11: 0001 R12: 
> 810008b849d0
> [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: 
> 8157cf58
> [ 3268.114399] FS:  2ab84f144170() GS:814f3000() 
> knlGS:
> [ 3268.122455] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: 
> 06e0
> [ 3268.135283] DR0:  DR1:  DR2: 
> 
> [ 3268.142388] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task 
> 810008b849d0)
> [ 3268.157552] last branch before last exception/interrupt
> [ 3268.162753]  from  [] serial_in+0x23/0x80
> [ 3268.168316]  to  [] serial_in+0x12/0x80
> [ 3268.173701] Stack:  8157ce78 812e214f 81000ebe7fd8 
> 
> [ 3268.181728]   0ebe7f2d 003d 
> 8157cf58
> [ 3268.189133]  8157ce88 812e218d 8157ce98 
> 812e21a1
> [ 3268.196358] Call Trace:
> [ 3268.198974] Inexact backtrace:
> [ 3268.202014][] notifier_call_chain+0x3f/0x70
> [ 3268.208531]  [] __atomic_notifier_call_chain+0xd/0x10
> [ 3268.215118]  [] atomic_notifier_call_chain+0x11/0x20
> [ 3268.221619]  [] notify_die+0x2e/0x30
> [ 3268.226736]  [] nmi_watchdog_tick+0x4c/0x1e0
> [ 3268.232545]  [] default_do_nmi+0x67/0x1e0
> [ 3268.238093]  [] do_nmi+0x2f/0x50
> [ 3268.242863]  [] nmi+0x7f/0x90
> [ 3268.247377]  [] __delay+0xe/0x20
> [ 3268.252147]  <> 
> [ 3268.254416] 
> [ 3268.254416] Code:  Bad RIP value.
> [ 3268.259216] RIP  []
>