Re: [BUG] NMI watchdog lockups caused by mwait_idle
Pallipadi, Venkatesh wrote: > Darrick, > > I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked > fine. No watchdog lockups. > Can you try idle routine with hlt instead of mwait. There is no boot > option for this in x86_64, but you can change > arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. > With that default kernel should use hlt based idle. > > Also, worth seeing will be, what happens when nmi_watchdog=0, > nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us > whether nmi_watchdog is raising some false alarm or the CPUs are indeed > getting locked up here.. > Locks up with hlt-based idle too. :( Here's what I get with nmi_watchdog=0: [ 206.088703] BUG: soft lockup detected on CPU#0! [ 206.093284] [ 206.093286] Call Trace: [ 206.097324][] softlockup_tick+0xd4/0xe9 [ 206.103618] [] do_flush_tlb_all+0x0/0x68 [ 206.109238] [] run_local_timers+0x13/0x15 [ 206.114949] [] update_process_times+0x4c/0x78 [ 206.121008] [] smp_local_timer_interrupt+0x34/0x51 [ 206.127498] [] smp_apic_timer_interrupt+0x49/0x60 [ 206.133901] [] apic_timer_interrupt+0x66/0x70 [ 206.139956][] __smp_call_function+0x66/0x87 [ 206.146594] [] __smp_call_function+0x62/0x87 [ 206.152564] [] do_flush_tlb_all+0x0/0x68 [ 206.158188] [] do_flush_tlb_all+0x0/0x68 [ 206.163813] [] smp_call_function+0x32/0x49 [ 206.169611] [] do_flush_tlb_all+0x0/0x68 [ 206.175236] [] on_each_cpu+0x30/0x67 [ 206.180514] [] flush_tlb_all+0x1c/0x1e [ 206.185965] [] unmap_vm_area+0x1c3/0x265 [ 206.191590] [] init_level4_pgt+0xc20/0x1000 [ 206.197474] [] remove_vm_area+0x41/0x67 [ 206.203010] [] iounmap+0x8e/0xc8 [ 206.207933] [] acpi_os_unmap_memory+0x9/0xb [ 206.213810] [] acpi_ev_system_memory_region_setup+0x52/0x105 [ 206.221174] [] acpi_ut_delete_internal_obj+0x2c4/0x3b2 [ 206.228012] [] acpi_ut_update_ref_count+0x180/0x1d2 [ 206.234587] [] acpi_ut_update_object_reference+0x160/0x207 [ 206.241770] [] acpi_ut_remove_reference+0xb5/0xd5 [ 206.248173] [] acpi_ns_detach_object+0xca/0xee [ 206.254318] [] acpi_ns_delete_namespace_by_owner+0xcf/0x154 [ 206.261597] [] acpi_ds_terminate_control_method+0xb5/0x14f [ 206.268779] [] acpi_ps_parse_aml+0x242/0x3a0 [ 206.274750] [] acpi_ps_execute_pass+0xd5/0x10b [ 206.280895] [] acpi_ps_execute_method+0x1bf/0x2cb [ 206.287298] [] acpi_ns_evaluate+0x1f8/0x315 [ 206.293180] [] acpi_evaluate_object+0x1d9/0x2fa [ 206.299411] [] kmem_cache_alloc+0xce/0xda [ 206.305125] [] :processor:acpi_processor_start+0x656/0x6fd [ 206.312307] [] kmem_cache_zalloc+0xce/0xf4 [ 206.318103] [] acpi_start_single_object+0x2a/0x54 [ 206.324509] [] acpi_bus_register_driver+0xcd/0x14c [ 206.331001] [] :processor:acpi_processor_init+0x61/0xb7 [ 206.337923] [] sys_init_module+0xac/0x16c [ 206.343630] [] system_call+0x7e/0x83 nmi_watchdog={1,2} produce the same errors. --D - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [BUG] NMI watchdog lockups caused by mwait_idle
Darrick, I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked fine. No watchdog lockups. Can you try idle routine with hlt instead of mwait. There is no boot option for this in x86_64, but you can change arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. With that default kernel should use hlt based idle. Also, worth seeing will be, what happens when nmi_watchdog=0, nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us whether nmi_watchdog is raising some false alarm or the CPUs are indeed getting locked up here.. Thanks, Venki >-Original Message- >From: Darrick J. Wong [mailto:[EMAIL PROTECTED] >Sent: Friday, January 12, 2007 1:01 PM >To: Pallipadi, Venkatesh >Cc: Linux Kernel Mailing List >Subject: [BUG] NMI watchdog lockups caused by mwait_idle > >Hi Venkatesh, > >I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to >boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog >lockups. git-bisect determined that the patch "[PATCH] x86-64: Fix >interrupt race in idle callback (3rd try)" was the source of these >problems, and I can work around the problem either by passing >"idle=poll" to get avoid mwait_idle or by reverting the patch. > >Other non-Dempsey Xeon machines with mwait support do not exhibit these >symptoms. I will try to determine if this is a bug specific to Dempsey >CPUs or this particular type of machine. I suspect the latter, but I >don't know enough about monitor/mwait to pursue this much further. > >What else can I do to diagnose this? > >--D > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] NMI watchdog lockups caused by mwait_idle
Hi Venkatesh, I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog lockups. git-bisect determined that the patch "[PATCH] x86-64: Fix interrupt race in idle callback (3rd try)" was the source of these problems, and I can work around the problem either by passing "idle=poll" to get avoid mwait_idle or by reverting the patch. Other non-Dempsey Xeon machines with mwait support do not exhibit these symptoms. I will try to determine if this is a bug specific to Dempsey CPUs or this particular type of machine. I suspect the latter, but I don't know enough about monitor/mwait to pursue this much further. What else can I do to diagnose this? --D - [ 81.794792] Parsing all Control Methods: [ 81.798710] Table [SSDT](id 002F) - 5 Objects with 0 Devices 2 Methods 0 Regions [ 81.806410] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu0Ist] [20060707] [ 81.815967] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) [ 81.821837] ACPI: Processor [CPU0] (supports 8 throttling states) [ 81.831290] Parsing all Control Methods: [ 81.835283] Table [SSDT](id 0032) - 3 Objects with 0 Devices 2 Methods 0 Regions [ 81.842988] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu1Ist] [20060707] [ 87.276183] NMI Watchdog detected LOCKUP on CPU 3 [ 87.280944] CPU 3 [ 87.283081] Modules linked in: processor fan unix [ 87.288109] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0 [ 87.294253] RIP: 0010:[] [] cpu_idle+0x61/0xc7 [ 87.302039] RSP: 0018:8100059ebed8 EFLAGS: 0086 [ 87.307398] RAX: RBX: 88015daf RCX: 80156564 [ 87.314574] RDX: 8100059ebec8 RSI: 0002 RDI: 0001 [ 87.321760] RBP: 8100059ebee8 R08: 0001 R09: 0001 [ 87.328942] R10: 80160cff R11: 0246 R12: fff7 [ 87.336125] R13: 0040 R14: 0246 R15: [ 87.343309] FS: () GS:8100059970a0() knlGS: [ 87.351453] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 87.357249] CR2: 00602148 CR3: 00101000 CR4: 06e0 [ 87.364435] Process swapper (pid: 0, threadinfo 8100059ea000, task 8100059a7040) [ 87.372577] Stack: 8100059ebee8 8100052a5600 8100059ebf48 80174dc2 [ 87.380966] 80502e38 80503038 037d [ 87.388684] 06e8 [ 87.396148] Call Trace: [ 87.398896] [] start_secondary+0x46f/0x47e [ 87.404695] [ 87.406245] [ 87.406247] Code: 75 25 e8 dc 06 04 00 0f 09 0f ae f0 65 48 8b 14 25 08 00 00 [ 87.416355] <3>BUG: sleeping function called from invalid context at /home/djwong/linux-2.6.20-rc4-dic94xx/kernel/rwsem.c:20 [ 87.427809] in_atomic():1, irqs_disabled():0 [ 87.432130] no locks held by swapper/0. [ 87.436015] [ 87.436017] Call Trace: [ 87.440072][] debug_show_held_locks+0x9/0xb [ 87.446718] [] __might_sleep+0xc6/0xc8 [ 87.452170] [] down_read+0x1d/0x45 [ 87.457276] [] blocking_notifier_call_chain+0x1b/0x41 [ 87.464029] [] profile_task_exit+0x15/0x17 [ 87.469824] [] do_exit+0x25/0x870 [ 87.474844] [] oops_end+0x42/0x62 [ 87.479863] [] sync_regs+0x0/0x71 [ 87.484883] [] nmi_watchdog_tick+0x156/0x240 [ 87.490856] [] default_do_nmi+0x81/0x1c6 [ 87.496480] [] do_nmi+0x2c/0x40 [ 87.501326] [] nmi+0x7f/0x90 [ 87.505920] [] :processor:acpi_processor_idle+0x0/0x4ad [ 87.512852] [] __sched_text_start+0xa7f/0xaba [ 87.518905] [] mwait_idle+0x47/0x4c [ 87.524097] [] cpu_idle+0x61/0xc7 [ 87.529116] <> [] start_secondary+0x46f/0x47e [ 87.535770] [ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task! [ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task! [ 87.544064] NMI Watchdog detected LOCKUP on CPU 2 [ 92.609986] CPU 2 [ 92.612110] Modules linked in: processor fan unix [ 92.617096] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0 [ 92.623230] RIP: 0010:[] [] cpu_idle+0x95/0xc7 [ 92.630998] RSP: 0018:8100059b9ed8 EFLAGS: 0046 [ 92.636358] RAX: RBX: 88015daf RCX: 80156564 [ 92.643533] RDX: RSI: 0002 RDI: 0001 [ 92.650708] RBP: 8100059b9ee8 R08: 0001 R09: 0001 [ 92.657883] R10: 80166fbd R11: 0070 R12: ffe6 [ 92.665057] R13: 0040 R14: 0246 R15: [ 92.672233] FS: () GS:810005997858() knlGS: [ 92.680368] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 92.686156] CR2: 0060c270 CR3: 00101000 CR4: 06e0
[BUG] NMI watchdog lockups caused by mwait_idle
Hi Venkatesh, I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog lockups. git-bisect determined that the patch [PATCH] x86-64: Fix interrupt race in idle callback (3rd try) was the source of these problems, and I can work around the problem either by passing idle=poll to get avoid mwait_idle or by reverting the patch. Other non-Dempsey Xeon machines with mwait support do not exhibit these symptoms. I will try to determine if this is a bug specific to Dempsey CPUs or this particular type of machine. I suspect the latter, but I don't know enough about monitor/mwait to pursue this much further. What else can I do to diagnose this? --D - [ 81.794792] Parsing all Control Methods: [ 81.798710] Table [SSDT](id 002F) - 5 Objects with 0 Devices 2 Methods 0 Regions [ 81.806410] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu0Ist] [20060707] [ 81.815967] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) [ 81.821837] ACPI: Processor [CPU0] (supports 8 throttling states) [ 81.831290] Parsing all Control Methods: [ 81.835283] Table [SSDT](id 0032) - 3 Objects with 0 Devices 2 Methods 0 Regions [ 81.842988] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] OemTableId [ Cpu1Ist] [20060707] [ 87.276183] NMI Watchdog detected LOCKUP on CPU 3 [ 87.280944] CPU 3 [ 87.283081] Modules linked in: processor fan unix [ 87.288109] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0 [ 87.294253] RIP: 0010:[80149c75] [80149c75] cpu_idle+0x61/0xc7 [ 87.302039] RSP: 0018:8100059ebed8 EFLAGS: 0086 [ 87.307398] RAX: RBX: 88015daf RCX: 80156564 [ 87.314574] RDX: 8100059ebec8 RSI: 0002 RDI: 0001 [ 87.321760] RBP: 8100059ebee8 R08: 0001 R09: 0001 [ 87.328942] R10: 80160cff R11: 0246 R12: fff7 [ 87.336125] R13: 0040 R14: 0246 R15: [ 87.343309] FS: () GS:8100059970a0() knlGS: [ 87.351453] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 87.357249] CR2: 00602148 CR3: 00101000 CR4: 06e0 [ 87.364435] Process swapper (pid: 0, threadinfo 8100059ea000, task 8100059a7040) [ 87.372577] Stack: 8100059ebee8 8100052a5600 8100059ebf48 80174dc2 [ 87.380966] 80502e38 80503038 037d [ 87.388684] 06e8 [ 87.396148] Call Trace: [ 87.398896] [80174dc2] start_secondary+0x46f/0x47e [ 87.404695] [ 87.406245] [ 87.406247] Code: 75 25 e8 dc 06 04 00 0f 09 0f ae f0 65 48 8b 14 25 08 00 00 [ 87.416355] 3BUG: sleeping function called from invalid context at /home/djwong/linux-2.6.20-rc4-dic94xx/kernel/rwsem.c:20 [ 87.427809] in_atomic():1, irqs_disabled():0 [ 87.432130] no locks held by swapper/0. [ 87.436015] [ 87.436017] Call Trace: [ 87.440072] NMI [8019f84e] debug_show_held_locks+0x9/0xb [ 87.446718] [8010ba32] __might_sleep+0xc6/0xc8 [ 87.452170] [8019da22] down_read+0x1d/0x45 [ 87.457276] [8019556d] blocking_notifier_call_chain+0x1b/0x41 [ 87.464029] [8018cabc] profile_task_exit+0x15/0x17 [ 87.469824] [80114b87] do_exit+0x25/0x870 [ 87.474844] [801646ff] oops_end+0x42/0x62 [ 87.479863] [80164a05] sync_regs+0x0/0x71 [ 87.484883] [801650ac] nmi_watchdog_tick+0x156/0x240 [ 87.490856] [80164c07] default_do_nmi+0x81/0x1c6 [ 87.496480] [801651c2] do_nmi+0x2c/0x40 [ 87.501326] [801645ef] nmi+0x7f/0x90 [ 87.505920] [88015daf] :processor:acpi_processor_idle+0x0/0x4ad [ 87.512852] [80160cff] __sched_text_start+0xa7f/0xaba [ 87.518905] [80156564] mwait_idle+0x47/0x4c [ 87.524097] [80149c75] cpu_idle+0x61/0xc7 [ 87.529116] EOE [80174dc2] start_secondary+0x46f/0x47e [ 87.535770] [ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task! [ 87.537315] Kernel panic - not syncing: Attempted to kill the idle task! [ 87.544064] NMI Watchdog detected LOCKUP on CPU 2 [ 92.609986] CPU 2 [ 92.612110] Modules linked in: processor fan unix [ 92.617096] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0 [ 92.623230] RIP: 0010:[80149ca9] [80149ca9] cpu_idle+0x95/0xc7 [ 92.630998] RSP: 0018:8100059b9ed8 EFLAGS: 0046 [ 92.636358] RAX: RBX: 88015daf RCX: 80156564 [ 92.643533] RDX: RSI: 0002 RDI: 0001 [ 92.650708] RBP: 8100059b9ee8 R08: 0001 R09: 0001 [ 92.657883] R10: 80166fbd R11:
RE: [BUG] NMI watchdog lockups caused by mwait_idle
Darrick, I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked fine. No watchdog lockups. Can you try idle routine with hlt instead of mwait. There is no boot option for this in x86_64, but you can change arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. With that default kernel should use hlt based idle. Also, worth seeing will be, what happens when nmi_watchdog=0, nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us whether nmi_watchdog is raising some false alarm or the CPUs are indeed getting locked up here.. Thanks, Venki -Original Message- From: Darrick J. Wong [mailto:[EMAIL PROTECTED] Sent: Friday, January 12, 2007 1:01 PM To: Pallipadi, Venkatesh Cc: Linux Kernel Mailing List Subject: [BUG] NMI watchdog lockups caused by mwait_idle Hi Venkatesh, I have an IBM IntelliStation Z30 with two Dempsey CPUs. When I try to boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog lockups. git-bisect determined that the patch [PATCH] x86-64: Fix interrupt race in idle callback (3rd try) was the source of these problems, and I can work around the problem either by passing idle=poll to get avoid mwait_idle or by reverting the patch. Other non-Dempsey Xeon machines with mwait support do not exhibit these symptoms. I will try to determine if this is a bug specific to Dempsey CPUs or this particular type of machine. I suspect the latter, but I don't know enough about monitor/mwait to pursue this much further. What else can I do to diagnose this? --D - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] NMI watchdog lockups caused by mwait_idle
Pallipadi, Venkatesh wrote: Darrick, I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked fine. No watchdog lockups. Can you try idle routine with hlt instead of mwait. There is no boot option for this in x86_64, but you can change arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait. With that default kernel should use hlt based idle. Also, worth seeing will be, what happens when nmi_watchdog=0, nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us whether nmi_watchdog is raising some false alarm or the CPUs are indeed getting locked up here.. Locks up with hlt-based idle too. :( Here's what I get with nmi_watchdog=0: [ 206.088703] BUG: soft lockup detected on CPU#0! [ 206.093284] [ 206.093286] Call Trace: [ 206.097324] IRQ [801b1f89] softlockup_tick+0xd4/0xe9 [ 206.103618] [80173c55] do_flush_tlb_all+0x0/0x68 [ 206.109238] [8014d8f8] run_local_timers+0x13/0x15 [ 206.114949] [80192844] update_process_times+0x4c/0x78 [ 206.121008] [80174fcd] smp_local_timer_interrupt+0x34/0x51 [ 206.127498] [801756b1] smp_apic_timer_interrupt+0x49/0x60 [ 206.133901] [8015cd16] apic_timer_interrupt+0x66/0x70 [ 206.139956] EOI [80173baa] __smp_call_function+0x66/0x87 [ 206.146594] [80173ba6] __smp_call_function+0x62/0x87 [ 206.152564] [80173c55] do_flush_tlb_all+0x0/0x68 [ 206.158188] [80173c55] do_flush_tlb_all+0x0/0x68 [ 206.163813] [80173cef] smp_call_function+0x32/0x49 [ 206.169611] [80173c55] do_flush_tlb_all+0x0/0x68 [ 206.175236] [8018e117] on_each_cpu+0x30/0x67 [ 206.180514] [80173d46] flush_tlb_all+0x1c/0x1e [ 206.185965] [80150f2a] unmap_vm_area+0x1c3/0x265 [ 206.191590] [80101c20] init_level4_pgt+0xc20/0x1000 [ 206.197474] [801bfc47] remove_vm_area+0x41/0x67 [ 206.203010] [8017c33c] iounmap+0x8e/0xc8 [ 206.207933] [80230032] acpi_os_unmap_memory+0x9/0xb [ 206.213810] [8023aaff] acpi_ev_system_memory_region_setup+0x52/0x105 [ 206.221174] [80259465] acpi_ut_delete_internal_obj+0x2c4/0x3b2 [ 206.228012] [802596d3] acpi_ut_update_ref_count+0x180/0x1d2 [ 206.234587] [80259885] acpi_ut_update_object_reference+0x160/0x207 [ 206.241770] [802599e1] acpi_ut_remove_reference+0xb5/0xd5 [ 206.248173] [8024da8a] acpi_ns_detach_object+0xca/0xee [ 206.254318] [8024b08a] acpi_ns_delete_namespace_by_owner+0xcf/0x154 [ 206.261597] [80234481] acpi_ds_terminate_control_method+0xb5/0x14f [ 206.268779] [8024ef7c] acpi_ps_parse_aml+0x242/0x3a0 [ 206.274750] [80250a00] acpi_ps_execute_pass+0xd5/0x10b [ 206.280895] [80250c3c] acpi_ps_execute_method+0x1bf/0x2cb [ 206.287298] [8024b4da] acpi_ns_evaluate+0x1f8/0x315 [ 206.293180] [8024abf1] acpi_evaluate_object+0x1d9/0x2fa [ 206.299411] [8010ab03] kmem_cache_alloc+0xce/0xda [ 206.305125] [880146a9] :processor:acpi_processor_start+0x656/0x6fd [ 206.312307] [801cc2a0] kmem_cache_zalloc+0xce/0xf4 [ 206.318103] [80261097] acpi_start_single_object+0x2a/0x54 [ 206.324509] [8026192d] acpi_bus_register_driver+0xcd/0x14c [ 206.331001] [88022061] :processor:acpi_processor_init+0x61/0xb7 [ 206.337923] [801a4d6e] sys_init_module+0xac/0x16c [ 206.343630] [8015c11e] system_call+0x7e/0x83 nmi_watchdog={1,2} produce the same errors. --D - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/