Hi, I got few reports so far that 82571EB models are having the "Detected Hardware Unit Hang" issue after upgrading the kernel.
Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated >(5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. The following commit seems to be related to the symptoms seen above: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3a3b75860527a11ba5035c6aa576079245d09e2a From: Jesse Brandeburg <jesse.brandeb...@intel.com> Date: Wed, 29 Sep 2010 21:38:49 +0000 (+0000) Subject: e1000e: use hardware writeback batching X-Git-Tag: v2.6.37-rc1~147^2~299 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3a3b75860527a11ba5035c6aa576079245d09e2a e1000e: use hardware writeback batching Most e1000e parts support batching writebacks. The problem with this is that when some of the TADV or TIDV timers are not set, Tx can sit forever. This is solved in this patch with write flushes using the Flush Partial Descriptors (FPD) bit in TIDV and RDTR. This improves bus utilization and removes partial writes on e1000e, particularly from 82571 parts in S5500 chipset based machines. Only ES2LAN and 82571/2 parts are included in this optimization, to reduce testing load. We have modified the instrumented kernel to include the following patch disabling writeback batching feature to narrow down the problem: --- debug/drivers/net/e1000e/82571.c.orig 2011-10-11 14:00:44.000000000 -0300 +++ debug/drivers/net/e1000e/82571.c 2011-10-11 15:02:51.000000000 -0300 @@ -2028,8 +2028,7 @@ struct e1000_info e1000_82571_info = { | FLAG_RESET_OVERWRITES_LAA /* errata */ | FLAG_TARC_SPEED_MODE_BIT /* errata */ | FLAG_APME_CHECK_PORT_B, - .flags2 = FLAG2_DISABLE_ASPM_L1 /* errata 13 */ - | FLAG2_DMA_BURST, + .flags2 = FLAG2_DISABLE_ASPM_L1, /* errata 13 */ .pba = 38, .max_hw_frame_size = DEFAULT_JUMBO, and the customer confirmed that the issue has disappeared since then. Board info: 1e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 1e:00.0 0200: 8086:10bc (rev 06) Subsystem: 103c:704b Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 224 Region 0: Memory at fd4e0000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fd400000 (32-bit, non-prefetchable) [size=512K] Region 2: I/O ports at 7000 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 4073 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Kernel driver in use: e1000e Kernel modules: e1000e I am checking the NIC dev specs again but so far I couldn't find a reason for this to happen yet. Any ideas? It happens with 5.7 kernel (2.6.18-274.el5) and it seems to be happening with 6.1 as well, though I am waiting the instrumented kernel outputs to confirm. This is related to https://bugzilla.redhat.com/show_bug.cgi?id=746272 thanks in advance! fbl ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired