Re: [BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Darrick J. Wong
Pallipadi, Venkatesh wrote:
> Darrick,
> 
> I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
> fine. No watchdog lockups.
> Can you try idle routine with hlt instead of mwait. There is no boot
> option for this in x86_64, but you can change
> arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
> With that default kernel should use hlt based idle.
> 
> Also, worth seeing will be, what happens when nmi_watchdog=0,
> nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
> whether nmi_watchdog is raising some false alarm or the CPUs are indeed
> getting locked up here..
> 

Locks up with hlt-based idle too. :(

Here's what I get with nmi_watchdog=0:

[  206.088703] BUG: soft lockup detected on CPU#0!
[  206.093284] 
[  206.093286] Call Trace:
[  206.097324][] softlockup_tick+0xd4/0xe9
[  206.103618]  [] do_flush_tlb_all+0x0/0x68
[  206.109238]  [] run_local_timers+0x13/0x15
[  206.114949]  [] update_process_times+0x4c/0x78
[  206.121008]  [] smp_local_timer_interrupt+0x34/0x51
[  206.127498]  [] smp_apic_timer_interrupt+0x49/0x60
[  206.133901]  [] apic_timer_interrupt+0x66/0x70
[  206.139956][] __smp_call_function+0x66/0x87
[  206.146594]  [] __smp_call_function+0x62/0x87
[  206.152564]  [] do_flush_tlb_all+0x0/0x68
[  206.158188]  [] do_flush_tlb_all+0x0/0x68
[  206.163813]  [] smp_call_function+0x32/0x49
[  206.169611]  [] do_flush_tlb_all+0x0/0x68
[  206.175236]  [] on_each_cpu+0x30/0x67
[  206.180514]  [] flush_tlb_all+0x1c/0x1e
[  206.185965]  [] unmap_vm_area+0x1c3/0x265
[  206.191590]  [] init_level4_pgt+0xc20/0x1000
[  206.197474]  [] remove_vm_area+0x41/0x67
[  206.203010]  [] iounmap+0x8e/0xc8
[  206.207933]  [] acpi_os_unmap_memory+0x9/0xb
[  206.213810]  [] 
acpi_ev_system_memory_region_setup+0x52/0x105
[  206.221174]  [] acpi_ut_delete_internal_obj+0x2c4/0x3b2
[  206.228012]  [] acpi_ut_update_ref_count+0x180/0x1d2
[  206.234587]  [] acpi_ut_update_object_reference+0x160/0x207
[  206.241770]  [] acpi_ut_remove_reference+0xb5/0xd5
[  206.248173]  [] acpi_ns_detach_object+0xca/0xee
[  206.254318]  [] 
acpi_ns_delete_namespace_by_owner+0xcf/0x154
[  206.261597]  [] acpi_ds_terminate_control_method+0xb5/0x14f
[  206.268779]  [] acpi_ps_parse_aml+0x242/0x3a0
[  206.274750]  [] acpi_ps_execute_pass+0xd5/0x10b
[  206.280895]  [] acpi_ps_execute_method+0x1bf/0x2cb
[  206.287298]  [] acpi_ns_evaluate+0x1f8/0x315
[  206.293180]  [] acpi_evaluate_object+0x1d9/0x2fa
[  206.299411]  [] kmem_cache_alloc+0xce/0xda
[  206.305125]  [] :processor:acpi_processor_start+0x656/0x6fd
[  206.312307]  [] kmem_cache_zalloc+0xce/0xf4
[  206.318103]  [] acpi_start_single_object+0x2a/0x54
[  206.324509]  [] acpi_bus_register_driver+0xcd/0x14c
[  206.331001]  [] :processor:acpi_processor_init+0x61/0xb7
[  206.337923]  [] sys_init_module+0xac/0x16c
[  206.343630]  [] system_call+0x7e/0x83

nmi_watchdog={1,2} produce the same errors.

--D
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Pallipadi, Venkatesh

Darrick,

I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
fine. No watchdog lockups.
Can you try idle routine with hlt instead of mwait. There is no boot
option for this in x86_64, but you can change
arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
With that default kernel should use hlt based idle.

Also, worth seeing will be, what happens when nmi_watchdog=0,
nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
whether nmi_watchdog is raising some false alarm or the CPUs are indeed
getting locked up here..

Thanks,
Venki


>-Original Message-
>From: Darrick J. Wong [mailto:[EMAIL PROTECTED] 
>Sent: Friday, January 12, 2007 1:01 PM
>To: Pallipadi, Venkatesh
>Cc: Linux Kernel Mailing List
>Subject: [BUG] NMI watchdog lockups caused by mwait_idle
>
>Hi Venkatesh,
>
>I have an IBM IntelliStation Z30 with two Dempsey CPUs.  When I try to
>boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
>lockups.  git-bisect determined that the patch "[PATCH] x86-64: Fix
>interrupt race in idle callback (3rd try)" was the source of these
>problems, and I can work around the problem either by passing
>"idle=poll" to get avoid mwait_idle or by reverting the patch.
>
>Other non-Dempsey Xeon machines with mwait support do not exhibit these
>symptoms.  I will try to determine if this is a bug specific to Dempsey
>CPUs or this particular type of machine.  I suspect the latter, but I
>don't know enough about monitor/mwait to pursue this much further.
>
>What else can I do to diagnose this?
>
>--D
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Darrick J. Wong
Hi Venkatesh,

I have an IBM IntelliStation Z30 with two Dempsey CPUs.  When I try to
boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
lockups.  git-bisect determined that the patch "[PATCH] x86-64: Fix
interrupt race in idle callback (3rd try)" was the source of these
problems, and I can work around the problem either by passing
"idle=poll" to get avoid mwait_idle or by reverting the patch.

Other non-Dempsey Xeon machines with mwait support do not exhibit these
symptoms.  I will try to determine if this is a bug specific to Dempsey
CPUs or this particular type of machine.  I suspect the latter, but I
don't know enough about monitor/mwait to pursue this much further.

What else can I do to diagnose this?

--D

-

[   81.794792] Parsing all Control Methods:
[   81.798710] Table [SSDT](id 002F) - 5 Objects with 0 Devices 2 Methods 0 
Regions
[   81.806410] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] 
OemTableId [ Cpu0Ist] [20060707]
[   81.815967] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
[   81.821837] ACPI: Processor [CPU0] (supports 8 throttling states)
[   81.831290] Parsing all Control Methods:
[   81.835283] Table [SSDT](id 0032) - 3 Objects with 0 Devices 2 Methods 0 
Regions
[   81.842988] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] 
OemTableId [ Cpu1Ist] [20060707]
[   87.276183] NMI Watchdog detected LOCKUP on CPU 3
[   87.280944] CPU 3 
[   87.283081] Modules linked in: processor fan unix
[   87.288109] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[   87.294253] RIP: 0010:[]  [] 
cpu_idle+0x61/0xc7
[   87.302039] RSP: 0018:8100059ebed8  EFLAGS: 0086
[   87.307398] RAX:  RBX: 88015daf RCX: 80156564
[   87.314574] RDX: 8100059ebec8 RSI: 0002 RDI: 0001
[   87.321760] RBP: 8100059ebee8 R08: 0001 R09: 0001
[   87.328942] R10: 80160cff R11: 0246 R12: fff7
[   87.336125] R13: 0040 R14: 0246 R15: 
[   87.343309] FS:  () GS:8100059970a0() 
knlGS:
[   87.351453] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   87.357249] CR2: 00602148 CR3: 00101000 CR4: 06e0
[   87.364435] Process swapper (pid: 0, threadinfo 8100059ea000, task 
8100059a7040)
[   87.372577] Stack:  8100059ebee8 8100052a5600 8100059ebf48 
80174dc2
[   87.380966]   80502e38 80503038 
037d
[   87.388684]  06e8   

[   87.396148] Call Trace:
[   87.398896]  [] start_secondary+0x46f/0x47e
[   87.404695] 
[   87.406245] 
[   87.406247] Code: 75 25 e8 dc 06 04 00 0f 09 0f ae f0 65 48 8b 14 25 08 00 
00 
[   87.416355]  <3>BUG: sleeping function called from invalid context at 
/home/djwong/linux-2.6.20-rc4-dic94xx/kernel/rwsem.c:20
[   87.427809] in_atomic():1, irqs_disabled():0
[   87.432130] no locks held by swapper/0.
[   87.436015] 
[   87.436017] Call Trace:
[   87.440072][] debug_show_held_locks+0x9/0xb
[   87.446718]  [] __might_sleep+0xc6/0xc8
[   87.452170]  [] down_read+0x1d/0x45
[   87.457276]  [] blocking_notifier_call_chain+0x1b/0x41
[   87.464029]  [] profile_task_exit+0x15/0x17
[   87.469824]  [] do_exit+0x25/0x870
[   87.474844]  [] oops_end+0x42/0x62
[   87.479863]  [] sync_regs+0x0/0x71
[   87.484883]  [] nmi_watchdog_tick+0x156/0x240
[   87.490856]  [] default_do_nmi+0x81/0x1c6
[   87.496480]  [] do_nmi+0x2c/0x40
[   87.501326]  [] nmi+0x7f/0x90
[   87.505920]  [] :processor:acpi_processor_idle+0x0/0x4ad
[   87.512852]  [] __sched_text_start+0xa7f/0xaba
[   87.518905]  [] mwait_idle+0x47/0x4c
[   87.524097]  [] cpu_idle+0x61/0xc7
[   87.529116]  <>  [] start_secondary+0x46f/0x47e
[   87.535770] 
[   87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[   87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[   87.544064]  NMI Watchdog detected LOCKUP on CPU 2
[   92.609986] CPU 2 
[   92.612110] Modules linked in: processor fan unix
[   92.617096] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[   92.623230] RIP: 0010:[]  [] 
cpu_idle+0x95/0xc7
[   92.630998] RSP: 0018:8100059b9ed8  EFLAGS: 0046
[   92.636358] RAX:  RBX: 88015daf RCX: 80156564
[   92.643533] RDX:  RSI: 0002 RDI: 0001
[   92.650708] RBP: 8100059b9ee8 R08: 0001 R09: 0001
[   92.657883] R10: 80166fbd R11: 0070 R12: ffe6
[   92.665057] R13: 0040 R14: 0246 R15: 
[   92.672233] FS:  () GS:810005997858() 
knlGS:
[   92.680368] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   92.686156] CR2: 0060c270 CR3: 00101000 CR4: 06e0

[BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Darrick J. Wong
Hi Venkatesh,

I have an IBM IntelliStation Z30 with two Dempsey CPUs.  When I try to
boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
lockups.  git-bisect determined that the patch [PATCH] x86-64: Fix
interrupt race in idle callback (3rd try) was the source of these
problems, and I can work around the problem either by passing
idle=poll to get avoid mwait_idle or by reverting the patch.

Other non-Dempsey Xeon machines with mwait support do not exhibit these
symptoms.  I will try to determine if this is a bug specific to Dempsey
CPUs or this particular type of machine.  I suspect the latter, but I
don't know enough about monitor/mwait to pursue this much further.

What else can I do to diagnose this?

--D

-

[   81.794792] Parsing all Control Methods:
[   81.798710] Table [SSDT](id 002F) - 5 Objects with 0 Devices 2 Methods 0 
Regions
[   81.806410] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] 
OemTableId [ Cpu0Ist] [20060707]
[   81.815967] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
[   81.821837] ACPI: Processor [CPU0] (supports 8 throttling states)
[   81.831290] Parsing all Control Methods:
[   81.835283] Table [SSDT](id 0032) - 3 Objects with 0 Devices 2 Methods 0 
Regions
[   81.842988] ACPI (exconfig-0455): Dynamic SSDT Load - OemId [ PmRef] 
OemTableId [ Cpu1Ist] [20060707]
[   87.276183] NMI Watchdog detected LOCKUP on CPU 3
[   87.280944] CPU 3 
[   87.283081] Modules linked in: processor fan unix
[   87.288109] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[   87.294253] RIP: 0010:[80149c75]  [80149c75] 
cpu_idle+0x61/0xc7
[   87.302039] RSP: 0018:8100059ebed8  EFLAGS: 0086
[   87.307398] RAX:  RBX: 88015daf RCX: 80156564
[   87.314574] RDX: 8100059ebec8 RSI: 0002 RDI: 0001
[   87.321760] RBP: 8100059ebee8 R08: 0001 R09: 0001
[   87.328942] R10: 80160cff R11: 0246 R12: fff7
[   87.336125] R13: 0040 R14: 0246 R15: 
[   87.343309] FS:  () GS:8100059970a0() 
knlGS:
[   87.351453] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[   87.357249] CR2: 00602148 CR3: 00101000 CR4: 06e0
[   87.364435] Process swapper (pid: 0, threadinfo 8100059ea000, task 
8100059a7040)
[   87.372577] Stack:  8100059ebee8 8100052a5600 8100059ebf48 
80174dc2
[   87.380966]   80502e38 80503038 
037d
[   87.388684]  06e8   

[   87.396148] Call Trace:
[   87.398896]  [80174dc2] start_secondary+0x46f/0x47e
[   87.404695] 
[   87.406245] 
[   87.406247] Code: 75 25 e8 dc 06 04 00 0f 09 0f ae f0 65 48 8b 14 25 08 00 
00 
[   87.416355]  3BUG: sleeping function called from invalid context at 
/home/djwong/linux-2.6.20-rc4-dic94xx/kernel/rwsem.c:20
[   87.427809] in_atomic():1, irqs_disabled():0
[   87.432130] no locks held by swapper/0.
[   87.436015] 
[   87.436017] Call Trace:
[   87.440072]  NMI  [8019f84e] debug_show_held_locks+0x9/0xb
[   87.446718]  [8010ba32] __might_sleep+0xc6/0xc8
[   87.452170]  [8019da22] down_read+0x1d/0x45
[   87.457276]  [8019556d] blocking_notifier_call_chain+0x1b/0x41
[   87.464029]  [8018cabc] profile_task_exit+0x15/0x17
[   87.469824]  [80114b87] do_exit+0x25/0x870
[   87.474844]  [801646ff] oops_end+0x42/0x62
[   87.479863]  [80164a05] sync_regs+0x0/0x71
[   87.484883]  [801650ac] nmi_watchdog_tick+0x156/0x240
[   87.490856]  [80164c07] default_do_nmi+0x81/0x1c6
[   87.496480]  [801651c2] do_nmi+0x2c/0x40
[   87.501326]  [801645ef] nmi+0x7f/0x90
[   87.505920]  [88015daf] :processor:acpi_processor_idle+0x0/0x4ad
[   87.512852]  [80160cff] __sched_text_start+0xa7f/0xaba
[   87.518905]  [80156564] mwait_idle+0x47/0x4c
[   87.524097]  [80149c75] cpu_idle+0x61/0xc7
[   87.529116]  EOE  [80174dc2] start_secondary+0x46f/0x47e
[   87.535770] 
[   87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[   87.537315] Kernel panic - not syncing: Attempted to kill the idle task!
[   87.544064]  NMI Watchdog detected LOCKUP on CPU 2
[   92.609986] CPU 2 
[   92.612110] Modules linked in: processor fan unix
[   92.617096] Pid: 0, comm: swapper Not tainted 2.6.20-rc4-dic64 #0
[   92.623230] RIP: 0010:[80149ca9]  [80149ca9] 
cpu_idle+0x95/0xc7
[   92.630998] RSP: 0018:8100059b9ed8  EFLAGS: 0046
[   92.636358] RAX:  RBX: 88015daf RCX: 80156564
[   92.643533] RDX:  RSI: 0002 RDI: 0001
[   92.650708] RBP: 8100059b9ee8 R08: 0001 R09: 0001
[   92.657883] R10: 80166fbd R11: 

RE: [BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Pallipadi, Venkatesh

Darrick,

I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
fine. No watchdog lockups.
Can you try idle routine with hlt instead of mwait. There is no boot
option for this in x86_64, but you can change
arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
With that default kernel should use hlt based idle.

Also, worth seeing will be, what happens when nmi_watchdog=0,
nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
whether nmi_watchdog is raising some false alarm or the CPUs are indeed
getting locked up here..

Thanks,
Venki


-Original Message-
From: Darrick J. Wong [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 12, 2007 1:01 PM
To: Pallipadi, Venkatesh
Cc: Linux Kernel Mailing List
Subject: [BUG] NMI watchdog lockups caused by mwait_idle

Hi Venkatesh,

I have an IBM IntelliStation Z30 with two Dempsey CPUs.  When I try to
boot 2.6.20-rc4 on it, the system prints messages about NMI watchdog
lockups.  git-bisect determined that the patch [PATCH] x86-64: Fix
interrupt race in idle callback (3rd try) was the source of these
problems, and I can work around the problem either by passing
idle=poll to get avoid mwait_idle or by reverting the patch.

Other non-Dempsey Xeon machines with mwait support do not exhibit these
symptoms.  I will try to determine if this is a bug specific to Dempsey
CPUs or this particular type of machine.  I suspect the latter, but I
don't know enough about monitor/mwait to pursue this much further.

What else can I do to diagnose this?

--D

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] NMI watchdog lockups caused by mwait_idle

2007-01-12 Thread Darrick J. Wong
Pallipadi, Venkatesh wrote:
 Darrick,
 
 I tried 2.6.20-rc4 on a Dempsey system here in my lab and it worked
 fine. No watchdog lockups.
 Can you try idle routine with hlt instead of mwait. There is no boot
 option for this in x86_64, but you can change
 arch/x86_64/kernel/process.c:select_idle_routine() not to enable mwait.
 With that default kernel should use hlt based idle.
 
 Also, worth seeing will be, what happens when nmi_watchdog=0,
 nmi_watchdog=1, and nmi_watchdog=2 boot options. That should tell us
 whether nmi_watchdog is raising some false alarm or the CPUs are indeed
 getting locked up here..
 

Locks up with hlt-based idle too. :(

Here's what I get with nmi_watchdog=0:

[  206.088703] BUG: soft lockup detected on CPU#0!
[  206.093284] 
[  206.093286] Call Trace:
[  206.097324]  IRQ  [801b1f89] softlockup_tick+0xd4/0xe9
[  206.103618]  [80173c55] do_flush_tlb_all+0x0/0x68
[  206.109238]  [8014d8f8] run_local_timers+0x13/0x15
[  206.114949]  [80192844] update_process_times+0x4c/0x78
[  206.121008]  [80174fcd] smp_local_timer_interrupt+0x34/0x51
[  206.127498]  [801756b1] smp_apic_timer_interrupt+0x49/0x60
[  206.133901]  [8015cd16] apic_timer_interrupt+0x66/0x70
[  206.139956]  EOI  [80173baa] __smp_call_function+0x66/0x87
[  206.146594]  [80173ba6] __smp_call_function+0x62/0x87
[  206.152564]  [80173c55] do_flush_tlb_all+0x0/0x68
[  206.158188]  [80173c55] do_flush_tlb_all+0x0/0x68
[  206.163813]  [80173cef] smp_call_function+0x32/0x49
[  206.169611]  [80173c55] do_flush_tlb_all+0x0/0x68
[  206.175236]  [8018e117] on_each_cpu+0x30/0x67
[  206.180514]  [80173d46] flush_tlb_all+0x1c/0x1e
[  206.185965]  [80150f2a] unmap_vm_area+0x1c3/0x265
[  206.191590]  [80101c20] init_level4_pgt+0xc20/0x1000
[  206.197474]  [801bfc47] remove_vm_area+0x41/0x67
[  206.203010]  [8017c33c] iounmap+0x8e/0xc8
[  206.207933]  [80230032] acpi_os_unmap_memory+0x9/0xb
[  206.213810]  [8023aaff] 
acpi_ev_system_memory_region_setup+0x52/0x105
[  206.221174]  [80259465] acpi_ut_delete_internal_obj+0x2c4/0x3b2
[  206.228012]  [802596d3] acpi_ut_update_ref_count+0x180/0x1d2
[  206.234587]  [80259885] acpi_ut_update_object_reference+0x160/0x207
[  206.241770]  [802599e1] acpi_ut_remove_reference+0xb5/0xd5
[  206.248173]  [8024da8a] acpi_ns_detach_object+0xca/0xee
[  206.254318]  [8024b08a] 
acpi_ns_delete_namespace_by_owner+0xcf/0x154
[  206.261597]  [80234481] acpi_ds_terminate_control_method+0xb5/0x14f
[  206.268779]  [8024ef7c] acpi_ps_parse_aml+0x242/0x3a0
[  206.274750]  [80250a00] acpi_ps_execute_pass+0xd5/0x10b
[  206.280895]  [80250c3c] acpi_ps_execute_method+0x1bf/0x2cb
[  206.287298]  [8024b4da] acpi_ns_evaluate+0x1f8/0x315
[  206.293180]  [8024abf1] acpi_evaluate_object+0x1d9/0x2fa
[  206.299411]  [8010ab03] kmem_cache_alloc+0xce/0xda
[  206.305125]  [880146a9] :processor:acpi_processor_start+0x656/0x6fd
[  206.312307]  [801cc2a0] kmem_cache_zalloc+0xce/0xf4
[  206.318103]  [80261097] acpi_start_single_object+0x2a/0x54
[  206.324509]  [8026192d] acpi_bus_register_driver+0xcd/0x14c
[  206.331001]  [88022061] :processor:acpi_processor_init+0x61/0xb7
[  206.337923]  [801a4d6e] sys_init_module+0xac/0x16c
[  206.343630]  [8015c11e] system_call+0x7e/0x83

nmi_watchdog={1,2} produce the same errors.

--D
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/