Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0
On Sat, Sep 22, 2007 at 10:35:45PM -0700, Andrew Morton wrote: > On Sun, 23 Sep 2007 13:30:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > That's interensting. serial_in(). We have had NMI watchdog expiries when > > > the kernel is printing a large amount of stuff out a slow serial port with > > > interrutps disabled. But I thought we'd pretty much plugged those > > > problems > > > by sprinkling touch_nmi_watchdog() in various places. > > > > > > Do you think this is what was happening on your system? > > > > Very likely. I'm running linux with cmdline > > "root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0", > > and doing a lot of printks at the time ;-) > > OK. We need to find a suitable place to poke yet another > touch_nmi_watchdog(). Maybe we should give up and put one in > printk(). > > And you oopsed for different reasons in the nmi-watchdog handling > code too. I think I'll pretend I didn't see that. Let's forget it for now. I can try Ingo's latency tracing patches at some convenient time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0
On Sun, 23 Sep 2007 13:30:40 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > > That's interensting. serial_in(). We have had NMI watchdog expiries when > > the kernel is printing a large amount of stuff out a slow serial port with > > interrutps disabled. But I thought we'd pretty much plugged those problems > > by sprinkling touch_nmi_watchdog() in various places. > > > > Do you think this is what was happening on your system? > > Very likely. I'm running linux with cmdline > "root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0", > and doing a lot of printks at the time ;-) OK. We need to find a suitable place to poke yet another touch_nmi_watchdog(). Maybe we should give up and put one in printk(). And you oopsed for different reasons in the nmi-watchdog handling code too. I think I'll pretend I didn't see that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0
On Sat, Sep 22, 2007 at 09:22:36PM -0700, Andrew Morton wrote: > On Sun, 23 Sep 2007 09:42:14 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote: > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/ > > > > > > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6. > > > > > > This bug appears in 2.6.23-rc3-mm1, too. > > hm, there isn't much info here. > > > The message: > > > > [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0 > > [ 3267.849515] CPU 0 > > [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm > > snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore > > snd_page_alloc thermal sr_mod pcspkr evdev button processor cdrom > > [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > > [ 3267.895442] RIP: 0033:[<2ab84e34cd44>] [<2ab84e34cd44>] > > [ 3267.901438] RSP: 002b:7fff5c9e03f8 EFLAGS: 0287 > > [ 3267.906726] RAX: RBX: 7fff5c9e0580 RCX: > > > > [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: > > 012a7010 > > [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: > > > > [ 3267.928045] R10: R11: 012a7410 R12: > > 0002 > > [ 3267.935151] R13: 0003 R14: 0005 R15: > > 001f > > [ 3267.942258] FS: 2ab84f144170() GS:814f3000() > > knlGS: > > [ 3267.950317] CS: 0010 DS: ES: CR0: 80050033 > > [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: > > 06e0 > > [ 3267.963144] DR0: DR1: DR2: > > > > [ 3267.970250] DR3: DR6: 0ff0 DR7: > > 0400 > > [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > > 810008b849d0) > > [ 3267.985416] > > [ 3267.997480] Unable to handle kernel paging request at fffe > > RIP: > > [ 3268.002082] [] > > [ 3268.007827] PGD ea85067 PUD 0 > > Looks like it oopsed in the middle of handling an NMI watchdog expiry, > perhaps. > > > [ 3268.010887] Oops: 0010 [1] SMP > > [ 3268.014035] last sysfs file: > > /devices/pci:00/:00:1e.0/:05:04.0/resource > > [ 3268.021662] CPU 0 > > [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm > > snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore > > snd_page_alloc thermal sr_mod pcspkr evdev button processor cdrom > > [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > > [ 3268.067584] RIP: 0010:[ ] [ ] > > [ 3268.073578] RSP: :8157ce38 EFLAGS: 00010296 > > [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: > > 8100036788e0 > > [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: > > 810009787080 > > [ 3268.093080] RBP: R08: R09: > > > > [ 3268.100185] R10: R11: 0001 R12: > > 810008b849d0 > > [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: > > 8157cf58 > > [ 3268.114399] FS: 2ab84f144170() GS:814f3000() > > knlGS: > > [ 3268.122455] CS: 0010 DS: ES: CR0: 8005003b > > [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: > > 06e0 > > [ 3268.135283] DR0: DR1: DR2: > > > > [ 3268.142388] DR3: DR6: 0ff0 DR7: > > 0400 > > [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > > 810008b849d0) > > [ 3268.157552] last branch before last exception/interrupt > > [ 3268.162753] from [] serial_in+0x23/0x80 > > [ 3268.168316] to [] serial_in+0x12/0x80 > > That's interensting. serial_in(). We have had NMI watchdog expiries when > the kernel is printing a large amount of stuff out a slow serial port with > interrutps disabled. But I thought we'd pretty much plugged those problems > by sprinkling touch_nmi_watchdog() in various places. > > Do you think this is what was happening on your system? Very likely. I'm running linux with cmdline "root=/dev/sda1 ro nmi_watchdog=1 console=ttyS0,115200 console=tty0", and doing a lot of printks at the time
Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0
On Sun, 23 Sep 2007 09:42:14 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/ > > > > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6. > > > This bug appears in 2.6.23-rc3-mm1, too. hm, there isn't much info here. > The message: > > [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0 > [ 3267.849515] CPU 0 > [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel > snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc > thermal sr_mod pcspkr evdev button processor cdrom > [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > [ 3267.895442] RIP: 0033:[<2ab84e34cd44>] [<2ab84e34cd44>] > [ 3267.901438] RSP: 002b:7fff5c9e03f8 EFLAGS: 0287 > [ 3267.906726] RAX: RBX: 7fff5c9e0580 RCX: > > [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: > 012a7010 > [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: > > [ 3267.928045] R10: R11: 012a7410 R12: > 0002 > [ 3267.935151] R13: 0003 R14: 0005 R15: > 001f > [ 3267.942258] FS: 2ab84f144170() GS:814f3000() > knlGS: > [ 3267.950317] CS: 0010 DS: ES: CR0: 80050033 > [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: > 06e0 > [ 3267.963144] DR0: DR1: DR2: > > [ 3267.970250] DR3: DR6: 0ff0 DR7: > 0400 > [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > 810008b849d0) > [ 3267.985416] > [ 3267.997480] Unable to handle kernel paging request at fffe > RIP: > [ 3268.002082] [] > [ 3268.007827] PGD ea85067 PUD 0 Looks like it oopsed in the middle of handling an NMI watchdog expiry, perhaps. > [ 3268.010887] Oops: 0010 [1] SMP > [ 3268.014035] last sysfs file: > /devices/pci:00/:00:1e.0/:05:04.0/resource > [ 3268.021662] CPU 0 > [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel > snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc > thermal sr_mod pcspkr evdev button processor cdrom > [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > [ 3268.067584] RIP: 0010:[ ] [ ] > [ 3268.073578] RSP: :8157ce38 EFLAGS: 00010296 > [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: > 8100036788e0 > [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: > 810009787080 > [ 3268.093080] RBP: R08: R09: > > [ 3268.100185] R10: R11: 0001 R12: > 810008b849d0 > [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: > 8157cf58 > [ 3268.114399] FS: 2ab84f144170() GS:814f3000() > knlGS: > [ 3268.122455] CS: 0010 DS: ES: CR0: 8005003b > [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: > 06e0 > [ 3268.135283] DR0: DR1: DR2: > > [ 3268.142388] DR3: DR6: 0ff0 DR7: > 0400 > [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > 810008b849d0) > [ 3268.157552] last branch before last exception/interrupt > [ 3268.162753] from [] serial_in+0x23/0x80 > [ 3268.168316] to [] serial_in+0x12/0x80 That's interensting. serial_in(). We have had NMI watchdog expiries when the kernel is printing a large amount of stuff out a slow serial port with interrutps disabled. But I thought we'd pretty much plugged those problems by sprinkling touch_nmi_watchdog() in various places. Do you think this is what was happening on your system? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG 2.6.23-rc6-mm1] NMI Watchdog detected LOCKUP on CPU 0
On Sun, Sep 23, 2007 at 09:42:14AM +0800, Fengguang Wu wrote: > On Tue, Sep 18, 2007 at 01:18:41AM -0700, Andrew Morton wrote: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc6/2.6.23-rc6-mm1/ > > > > 2.6.23-rc6-mm1 is a 29MB diff against 2.6.23-rc6. > > > This bug appears in 2.6.23-rc3-mm1, too. > > The message: > > [ 3267.844826] NMI Watchdog detected LOCKUP on CPU 0 > [ 3267.849515] CPU 0 > [ 3267.851525] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel > snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc > thermal sr_mod pcspkr evdev button processor cdrom > [ 3267.889547] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > [ 3267.895442] RIP: 0033:[<2ab84e34cd44>] [<2ab84e34cd44>] > [ 3267.901438] RSP: 002b:7fff5c9e03f8 EFLAGS: 0287 > [ 3267.906726] RAX: RBX: 7fff5c9e0580 RCX: > > [ 3267.913833] RDX: 0013 RSI: 7fff5c9e0680 RDI: > 012a7010 > [ 3267.920939] RBP: 7fff5c9e0550 R08: 0050 R09: > > [ 3267.928045] R10: R11: 012a7410 R12: > 0002 > [ 3267.935151] R13: 0003 R14: 0005 R15: > 001f > [ 3267.942258] FS: 2ab84f144170() GS:814f3000() > knlGS: > [ 3267.950317] CS: 0010 DS: ES: CR0: 80050033 > [ 3267.956038] CR2: 2ab84e3a7430 CR3: 0d618000 CR4: > 06e0 > [ 3267.963144] DR0: DR1: DR2: > > [ 3267.970250] DR3: DR6: 0ff0 DR7: > 0400 > [ 3267.977357] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > 810008b849d0) > [ 3267.985416] > [ 3267.997480] Unable to handle kernel paging request at fffe > RIP: > [ 3268.002082] [] > [ 3268.007827] PGD ea85067 PUD 0 > [ 3268.010887] Oops: 0010 [1] SMP > [ 3268.014035] last sysfs file: > /devices/pci:00/:00:1e.0/:05:04.0/resource > [ 3268.021662] CPU 0 > [ 3268.023674] Modules linked in: binfmt_misc ipt_MASQUERADE iptable_mangle > iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables nf_nat_tftp > nf_nat_ftp nf_nat nf_conntrack_tftp nf_conntrack_ftp nf_conntrack nfnetlink > fan ac battery ipv6 eeprom lm85 hwmon_vid i2c_core tun fuse kvm snd_hda_intel > snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd sg soundcore snd_page_alloc > thermal sr_mod pcspkr evdev button processor cdrom > [ 3268.061688] Pid: 13507, comm: gcc Not tainted 2.6.23-rc6-mm1 #4 > [ 3268.067584] RIP: 0010:[ ] [ ] > [ 3268.073578] RSP: :8157ce38 EFLAGS: 00010296 > [ 3268.078867] RAX: 2710 RBX: 810009787050 RCX: > 8100036788e0 > [ 3268.085973] RDX: 018d RSI: 810ba000 RDI: > 810009787080 > [ 3268.093080] RBP: R08: R09: > > [ 3268.100185] R10: R11: 0001 R12: > 810008b849d0 > [ 3268.107293] R13: 810008b850d0 R14: 0001 R15: > 8157cf58 > [ 3268.114399] FS: 2ab84f144170() GS:814f3000() > knlGS: > [ 3268.122455] CS: 0010 DS: ES: CR0: 8005003b > [ 3268.128178] CR2: fffe CR3: 06bfd000 CR4: > 06e0 > [ 3268.135283] DR0: DR1: DR2: > > [ 3268.142388] DR3: DR6: 0ff0 DR7: > 0400 > [ 3268.149495] Process gcc (pid: 13507, threadinfo 81000ebe6000, task > 810008b849d0) > [ 3268.157552] last branch before last exception/interrupt > [ 3268.162753] from [] serial_in+0x23/0x80 > [ 3268.168316] to [] serial_in+0x12/0x80 > [ 3268.173701] Stack: 8157ce78 812e214f 81000ebe7fd8 > > [ 3268.181728] 0ebe7f2d 003d > 8157cf58 > [ 3268.189133] 8157ce88 812e218d 8157ce98 > 812e21a1 > [ 3268.196358] Call Trace: > [ 3268.198974] Inexact backtrace: > [ 3268.202014][] notifier_call_chain+0x3f/0x70 > [ 3268.208531] [] __atomic_notifier_call_chain+0xd/0x10 > [ 3268.215118] [] atomic_notifier_call_chain+0x11/0x20 > [ 3268.221619] [] notify_die+0x2e/0x30 > [ 3268.226736] [] nmi_watchdog_tick+0x4c/0x1e0 > [ 3268.232545] [] default_do_nmi+0x67/0x1e0 > [ 3268.238093] [] do_nmi+0x2f/0x50 > [ 3268.242863] [] nmi+0x7f/0x90 > [ 3268.247377] [] __delay+0xe/0x20 > [ 3268.252147] <> > [ 3268.254416] > [ 3268.254416] Code: Bad RIP value. > [ 3268.259216] RIP [ ] >