[E1000-devel] 82571EB: Detected Hardware Unit Hang

Flavio Leitner Fri, 14 Oct 2011 10:07:43 -0700

Hi,

I got few reports so far that 82571EB models are having the
"Detected Hardware Unit Hang" issue after upgrading the kernel.


Further debugging with an instrumented kernel revealed that the
socket buffer time stamp matches with the last time e1000_xmit_frame()
was called. Also that the time stamp of e1000_clean_tx_irq() last run
is prior to the one in socket buffer.

However, ~1 second later, an interrupt is fired and the old entry
is found. Sometimes, the scheduled print_hang_task dumps the
information _after_ the old entry is sent (shows empty ring),
indicating that the HW TX unit isn't really stuck and apparently
just missed the signal to initiate the transmission.

Order of events:
 (1) skb is pushed down
 (2) e1000_xmit_frame() is called
 (3) ring is filled with one entry
 (4) TDT is updated
>(5) nothing happens for little more than 1 second
 (6) interrupt is fired
 (7) e1000_clean_tx_irq() is called
 (8) finds the entry not ready with an old time stamp,
     schedules print_hang_task and stops the TX queue.
 (9) print_hang_task runs, dump the info but the old entry is now sent
(10) apparently the TX queue is back.

The following commit seems to be related to the symptoms seen above:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3a3b75860527a11ba5035c6aa576079245d09e2a

 From: Jesse Brandeburg <jesse.brandeb...@intel.com>
 Date: Wed, 29 Sep 2010 21:38:49 +0000 (+0000)
 Subject: e1000e: use hardware writeback batching
 X-Git-Tag: v2.6.37-rc1~147^2~299
 X-Git-Url:
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3a3b75860527a11ba5035c6aa576079245d09e2a
 

 e1000e: use hardware writeback batching

 Most e1000e parts support batching writebacks.  The problem with this is
 that when some of the TADV or TIDV timers are not set, Tx can sit forever.

 This is solved in this patch with write flushes using the Flush Partial
 Descriptors (FPD) bit in TIDV and RDTR.

 This improves bus utilization and removes partial writes on e1000e,
 particularly from 82571 parts in S5500 chipset based machines.

 Only ES2LAN and 82571/2 parts are included in this optimization, to reduce
 testing load.

We have modified the instrumented kernel to include the following patch
disabling writeback batching feature to narrow down the problem:

--- debug/drivers/net/e1000e/82571.c.orig      2011-10-11 14:00:44.000000000
-0300
+++ debug/drivers/net/e1000e/82571.c   2011-10-11 15:02:51.000000000 -0300
@@ -2028,8 +2028,7 @@ struct e1000_info e1000_82571_info = {
                                 | FLAG_RESET_OVERWRITES_LAA /* errata */
                                 | FLAG_TARC_SPEED_MODE_BIT /* errata */
                                 | FLAG_APME_CHECK_PORT_B,
-      .flags2                 = FLAG2_DISABLE_ASPM_L1 /* errata 13 */
-                                | FLAG2_DMA_BURST,
+      .flags2                 = FLAG2_DISABLE_ASPM_L1, /* errata 13 */
       .pba                    = 38,
       .max_hw_frame_size      = DEFAULT_JUMBO,


and the customer confirmed that the issue has disappeared since then.

Board info:
1e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)

1e:00.0 0200: 8086:10bc (rev 06)
        Subsystem: 103c:704b
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 224
        Region 0: Memory at fd4e0000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fd400000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at 7000 [size=32]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 4073
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns,
L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+
Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr+
TransPend-
                LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0
<4us, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
        Kernel driver in use: e1000e
        Kernel modules: e1000e


I am checking the NIC dev specs again but so far I couldn't find
a reason for this to happen yet.

Any ideas? It happens with 5.7 kernel (2.6.18-274.el5) and it seems
to be happening with 6.1 as well, though I am waiting the instrumented
kernel outputs to confirm.

This is related to 
https://bugzilla.redhat.com/show_bug.cgi?id=746272

thanks in advance!
fbl

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] 82571EB: Detected Hardware Unit Hang

Reply via email to