e1000e driver - hang after 4 hours of uptime - finally bisected!

2015-06-18 Thread Valdis Kletnieks
(follow up to a report from last week - bisecting took a while as I could
only do 1 or 2 tests an evening)

My Dell Latitude E6530 crashes with a specific kernel lockup almost
exactly 4 hours after boot if there isn't a cable connected to the
Ethernet port:

[14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
1
[14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
2
[14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
1
[14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
1
[14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 
3
[14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0

As far as I can tell, the timestamp jitter is just how long it takes me to
enter the cryptLUKS passphrase for the hard drive at boot...

lspci tells me:

lspci -vvv -s 00:19.0
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network 
Connection (rev 04)
DeviceName:  Onboard LAN
Subsystem: Dell Device 0535
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 28
Region 0: Memory at f770 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
Region 2: I/O ports at f040 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee00318  Data: 
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e


The traceback always looks like:

[14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0

[14479.906908] Call Trace:
[14479.906914]  NMI  [ba94db16] dump_stack+0x50/0xa8
[14479.906930]  [ba948bb9] panic+0xcd/0x1e4
[14479.906940]  [ba166a60] ? perf_event_task_disable+0xc0/0xc0
[14479.906952]  [ba125d8b] watchdog_overflow_callback+0x9b/0xa0
[14479.906959]  [ba16a684] __perf_event_overflow+0xc4/0x1f0
[14479.906968]  [ba16b3a4] perf_event_overflow+0x14/0x20
[14479.906976]  [ba022271] intel_pmu_handle_irq+0x1e1/0x430
[14479.906990]  [ba01a0f6] perf_event_nmi_handler+0x26/0x40
[14479.906999]  [ba0085b3] nmi_handle+0x103/0x340
[14479.907005]  [ba0084b5] ? nmi_handle+0x5/0x340
[14479.907017]  [ba008a53] default_do_nmi+0xc3/0x120
[14479.907032]  [ba008b98] do_nmi+0xe8/0x130
[14479.907044]  [ba95c9a8] end_repeat_nmi+0x1e/0x2e
[14479.907055]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
[14479.907061]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
[14479.907069]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
[14479.907075]  EOE  [ba0e9529] timecounter_read+0x19/0x60
[14479.907088]  [ba53687e] e1000e_phc_gettime+0x2e/0x60
[14479.907098]  [ba536a31] e1000e_systim_overflow_work+0x31/0x70
[14479.907105]  [ba07ad19] process_one_work+0x3c9/0x980
[14479.907115]  [ba07ac62] ? process_one_work+0x312/0x980
[14479.907125]  [ba07b348] ? worker_thread+0x78/0x760
[14479.907134]  [ba07b59c] worker_thread+0x2cc/0x760
[14479.907144]  [ba07b2d0] ? process_one_work+0x980/0x980
[14479.907154]  [ba082a5e] kthread+0xfe/0x120
[14479.907163]  [ba08ca50] ? finish_task_switch+0x50/0x1c0
[14479.907173]  [ba082960] ? kthread_create_on_node+0x270/0x270
[14479.907179]  [ba95ae4f] ret_from_fork+0x3f/0x70
[14479.907188]  [ba082960] ? kthread_create_on_node+0x270/0x270
[14479.907243] Kernel Offset: 0x3900 from 0x8100 (relocation 
range: 0x8000-0xbfff)

Bisection tells me it's this commit:

commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
Author: Yanir Lubetkin yanirx.lubet...@intel.com
Date:   Tue Jun 2 17:05:45 2015 +0300

e1000e: fix systim issues

Two issues involving systim were reported.
1. Clock is not running in the correct frequency
2. In some situations, systim values were not incremented linearly
This patch fixes the hardware clock configuration and the 

Re: e1000e driver - hang after 4 hours of uptime - finally bisected!

2015-06-18 Thread Jeff Kirsher
On Thu, 2015-06-18 at 12:46 -0400, Valdis Kletnieks wrote:
 (follow up to a report from last week - bisecting took a while as I could
 only do 1 or 2 tests an evening)
 
 My Dell Latitude E6530 crashes with a specific kernel lockup almost
 exactly 4 hours after boot if there isn't a cable connected to the
 Ethernet port:
 
 [14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 [14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 [14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 [14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 1
 [14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 2
 [14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 1
 [14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 [14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 1
 [14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 3
 [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 
 As far as I can tell, the timestamp jitter is just how long it takes me to
 enter the cryptLUKS passphrase for the hard drive at boot...
 
 lspci tells me:
 
 lspci -vvv -s 00:19.0
 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network 
 Connection (rev 04)
 DeviceName:  Onboard LAN
 Subsystem: Dell Device 0535
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
 Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 28
 Region 0: Memory at f770 (32-bit, non-prefetchable) [size=128K]
 Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
 Region 2: I/O ports at f040 [size=32]
 Capabilities: [c8] Power Management version 2
 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
 PME(D0+,D1-,D2-,D3hot+,D3cold+)
 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
 Address: fee00318  Data: 
 Capabilities: [e0] PCI Advanced Features
 AFCap: TP+ FLR+
 AFCtrl: FLR-
 AFStatus: TP-
 Kernel driver in use: e1000e
 
 
 The traceback always looks like:
 
 [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on 
 cpu 0
 
 [14479.906908] Call Trace:
 [14479.906914]  NMI  [ba94db16] dump_stack+0x50/0xa8
 [14479.906930]  [ba948bb9] panic+0xcd/0x1e4
 [14479.906940]  [ba166a60] ? perf_event_task_disable+0xc0/0xc0
 [14479.906952]  [ba125d8b] watchdog_overflow_callback+0x9b/0xa0
 [14479.906959]  [ba16a684] __perf_event_overflow+0xc4/0x1f0
 [14479.906968]  [ba16b3a4] perf_event_overflow+0x14/0x20
 [14479.906976]  [ba022271] intel_pmu_handle_irq+0x1e1/0x430
 [14479.906990]  [ba01a0f6] perf_event_nmi_handler+0x26/0x40
 [14479.906999]  [ba0085b3] nmi_handle+0x103/0x340
 [14479.907005]  [ba0084b5] ? nmi_handle+0x5/0x340
 [14479.907017]  [ba008a53] default_do_nmi+0xc3/0x120
 [14479.907032]  [ba008b98] do_nmi+0xe8/0x130
 [14479.907044]  [ba95c9a8] end_repeat_nmi+0x1e/0x2e
 [14479.907055]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
 [14479.907061]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
 [14479.907069]  [ba529886] ? e1000e_cyclecounter_read+0x16/0xc0
 [14479.907075]  EOE  [ba0e9529] timecounter_read+0x19/0x60
 [14479.907088]  [ba53687e] e1000e_phc_gettime+0x2e/0x60
 [14479.907098]  [ba536a31] e1000e_systim_overflow_work+0x31/0x70
 [14479.907105]  [ba07ad19] process_one_work+0x3c9/0x980
 [14479.907115]  [ba07ac62] ? process_one_work+0x312/0x980
 [14479.907125]  [ba07b348] ? worker_thread+0x78/0x760
 [14479.907134]  [ba07b59c] worker_thread+0x2cc/0x760
 [14479.907144]  [ba07b2d0] ? process_one_work+0x980/0x980
 [14479.907154]  [ba082a5e] kthread+0xfe/0x120
 [14479.907163]  [ba08ca50] ? finish_task_switch+0x50/0x1c0
 [14479.907173]  [ba082960] ? kthread_create_on_node+0x270/0x270
 [14479.907179]  [ba95ae4f] ret_from_fork+0x3f/0x70
 [14479.907188]  [ba082960] ? kthread_create_on_node+0x270/0x270
 [14479.907243] Kernel Offset: 0x3900 from 0x8100 (relocation 
 range: 0x8000-0xbfff)
 
 Bisection tells me it's this commit:
 
 commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
 Author: Yanir Lubetkin yanirx.lubet...@intel.com
 Date:   Tue Jun 2 17:05:45 2015 +0300
 
 e1000e: fix systim issues
 
 Two issues involving systim were reported.
 1. Clock