On Fri, 12 Feb 2010, Nishit Shah wrote:
> Hi, > > I am getting Tx hangs with e1000e-1.0.15 driver. Attached > logs below. Is there a chance you can try 1.1.2? do you have jumbo frames enabled? > Feb 10 06:05:11 1265762111 kernel: e1000: eth4: e1000_clean_tx_irq: Detected > Tx Unit Hang > Feb 10 06:05:11 1265762111 kernel: Tx Queue <0> > Feb 10 06:05:11 1265762111 kernel: TDH <e1> > Feb 10 06:05:11 1265762111 kernel: TDT <cc> > Feb 10 06:05:11 1265762111 kernel: next_to_use <cc> > Feb 10 06:05:11 1265762111 kernel: next_to_clean <e0> > Feb 10 06:05:11 1265762111 kernel: buffer_info[next_to_clean] > Feb 10 06:05:11 1265762111 kernel: time_stamp <56300a18> > Feb 10 06:05:11 1265762111 kernel: next_to_watch <e4> > Feb 10 06:05:11 1265762111 kernel: jiffies <56300b51> > Feb 10 06:05:11 1265762111 kernel: next_to_watch.status <0> > Feb 10 06:05:13 1265762113 kernel: e1000: eth4: e1000_clean_tx_irq: Detected > Tx Unit Hang looks like something is really hanging. If you turn off UDP checksum offload (and maybe scatter gather) with ethtool, does it start working? If this is reproducable, I would like to see the output of the e1000_dump routine at the time of the hang, but with 2048 descriptors it will be really huge (and probably overrun syslog). I would need to prepare a version (or patch) of 1.0.15 or 1.1.2 with the e1000_dump code enabled. is it always the same interface? > [r...@manage1 /root]# lspci_ether > > 05:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 05:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 06:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 06:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 07:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 07:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 08:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 08:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06) > - (E1000_DEV_ID_82571EB_COPPER) > > 0d:00.0 Ethernet controller: Intel Corporation: Unknown device 1096 (rev 01) > - (E1000_DEV_ID_80003ES2LAN_COPPER_DPT) > > 0d:00.1 Ethernet controller: Intel Corporation: Unknown device 1096 (rev 01) > - (E1000_DEV_ID_80003ES2LAN_COPPER_DPT) > > 0f:00.0 Ethernet controller: Intel Corporation: Unknown device 105f (rev 06) > - (E1000_DEV_ID_82571EB_FIBER) > > 0f:00.1 Ethernet controller: Intel Corporation: Unknown device 105f (rev 06) > - (E1000_DEV_ID_82571EB_FIBER) you have a lot of ports in this machine, but that should be fine. > ethtool -g eth4 > Ring parameters for eth4: > > Pre-set maximums: > RX: 4096 > RX Mini: 0 > RX Jumbo: 0 > TX: 4096 > Current hardware settings: > RX: 2048 > RX Mini: 0 > RX Jumbo: 0 > TX: 2048 > > > ethtool -k eth4 > > Offload parameters for eth4: > rx-checksumming: on > tx-checksumming: on > scatter-gather: on I know it will use more cpu but does the problem repro if you turn off the above two? > tcp segmentation offload: on > udp fragmentation offload: off > generic segmentation offload: off > > System Info: > > Running kernel - 2.6.16.-13-1 > Openswan - 2.4.9 with klips > cat /proc/interrupts > > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 > CPU6 CPU7 > > 0: 40087329 273 274 274 274 273 > 273 241 IO-APIC-edge timer > > 2: 0 0 0 0 0 0 > 0 0 XT-PIC cascade > > 4: 10 0 0 0 1 0 > 0 0 IO-APIC-edge serial > > 8: 3393 1 0 0 0 0 > 0 0 IO-APIC-edge rtc > > 66: 63 0 0 80096 0 0 > 0 0 PCI-MSI eth0 > > 74: 63 0 0 80096 0 0 > 0 0 PCI-MSI eth1 > > 82: 80158 0 0 0 0 0 > 0 0 PCI-MSI eth2 > > 90: 80158 0 0 0 0 0 > 0 0 PCI-MSI eth3 > > 98: 256 0 5594913 0 168731027 0 > 0 0 PCI-MSI eth4 > > 106: 130 0 6517103 0 0 255948447 > 0 0 PCI-MSI eth5 > > 114: 64 0 100789 0 0 0 > 0 0 PCI-MSI eth6 > > 122: 68 0 87466 0 0 0 > 0 0 PCI-MSI eth7 > > 130: 252 0 0 466626 0 0 > 0 0 PCI-MSI eth8 > > 138: 30033 0 0 4989635 0 0 > 0 0 PCI-MSI eth9 > > 146: 62 0 0 80096 0 0 > 0 0 PCI-MSI eth10 > > 153: 557669 0 1 0 0 0 > 0 0 IO-APIC-level libata > > 154: 62 0 0 80096 0 0 > 0 0 PCI-MSI eth11 > > NMI: 0 0 0 0 0 0 > 0 0 > > LOC: 40086777 40087580 40087468 40087495 40083411 40083410 > 40086663 40086021 > > ERR: 0 > > MIS: 0 > > > > This machine is a IPSEC Gateway and we are using openswan > 2.4.9 with klips for VPN. > > Possible suspect for this Hang is a Fragmented UDP packet > coming/going on eth4 with datasize 32560 size over VPN tunnel. (eth4 <-> > ipsec0 <-> eth5) > > Without VPN tunnel, I am not observing the hangs with same > size of UDP packets. > > Let me know if you need more information on this. I think that is an extremely good clue. Please try the experiment mentinoned above with disabling tx csum offload and tx sg. The stack could be handing down a packet that is unusually long or formatted strangely that could hang up our offload setup for tx csum. Also are you running any traffic shaping via tc or netfilter rules? ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
