On Fri, 12 Feb 2010, Nishit Shah wrote:

> Hi,
> 
>                 I am getting Tx hangs with e1000e-1.0.15 driver. Attached
> logs below.

Is there a chance you can try 1.1.2?  do you have jumbo frames enabled?

> Feb 10 06:05:11 1265762111 kernel: e1000: eth4: e1000_clean_tx_irq: Detected
> Tx Unit Hang
> Feb 10 06:05:11 1265762111 kernel:   Tx Queue             <0>
> Feb 10 06:05:11 1265762111 kernel:   TDH                  <e1>
> Feb 10 06:05:11 1265762111 kernel:   TDT                  <cc>
> Feb 10 06:05:11 1265762111 kernel:   next_to_use          <cc>
> Feb 10 06:05:11 1265762111 kernel:   next_to_clean        <e0>
> Feb 10 06:05:11 1265762111 kernel: buffer_info[next_to_clean]
> Feb 10 06:05:11 1265762111 kernel:   time_stamp           <56300a18>
> Feb 10 06:05:11 1265762111 kernel:   next_to_watch        <e4>
> Feb 10 06:05:11 1265762111 kernel:   jiffies              <56300b51>
> Feb 10 06:05:11 1265762111 kernel:   next_to_watch.status <0>
> Feb 10 06:05:13 1265762113 kernel: e1000: eth4: e1000_clean_tx_irq: Detected
> Tx Unit Hang

looks like something is really hanging.  If you turn off UDP checksum 
offload (and maybe scatter gather) with ethtool, does it start working?

If this is reproducable, I would like to see the output of the e1000_dump 
routine at the time of the hang, but with 2048 descriptors it will be 
really huge (and probably overrun syslog).  I would need to prepare a 
version (or patch) of 1.0.15 or 1.1.2 with the e1000_dump code enabled.

is it always the same interface?

>                 [r...@manage1 /root]# lspci_ether
> 
> 05:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 05:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 06:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 06:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 07:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 07:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 08:00.0 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 08:00.1 Ethernet controller: Intel Corporation: Unknown device 105e (rev 06)
> - (E1000_DEV_ID_82571EB_COPPER)
> 
> 0d:00.0 Ethernet controller: Intel Corporation: Unknown device 1096 (rev 01)
> - (E1000_DEV_ID_80003ES2LAN_COPPER_DPT)
> 
> 0d:00.1 Ethernet controller: Intel Corporation: Unknown device 1096 (rev 01)
> - (E1000_DEV_ID_80003ES2LAN_COPPER_DPT)
> 
> 0f:00.0 Ethernet controller: Intel Corporation: Unknown device 105f (rev 06)
> - (E1000_DEV_ID_82571EB_FIBER)
> 
> 0f:00.1 Ethernet controller: Intel Corporation: Unknown device 105f (rev 06)
> - (E1000_DEV_ID_82571EB_FIBER)

you have a lot of ports in this machine, but that should be fine.

>                 ethtool -g eth4
>                                 Ring parameters for eth4:
> 
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             2048
> RX Mini:        0
> RX Jumbo:       0
> TX:             2048
> 
> 
>                 ethtool -k eth4
> 
> Offload parameters for eth4:
> rx-checksumming: on

> tx-checksumming: on
> scatter-gather: on

I know it will use more cpu but does the problem repro if you turn off the 
above two?

> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: off

> 
>                 System Info:
> 
>                                 Running kernel - 2.6.16.-13-1
>                                 Openswan - 2.4.9 with klips
>                                 cat /proc/interrupts
> 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
> CPU6       CPU7
> 
>   0:   40087329        273        274        274        274        273
> 273        241    IO-APIC-edge  timer
> 
>   2:          0          0          0          0          0          0
> 0          0          XT-PIC  cascade
> 
>   4:         10          0          0          0          1          0
> 0          0    IO-APIC-edge  serial
> 
>   8:       3393          1          0          0          0          0
> 0          0    IO-APIC-edge  rtc
> 
>  66:         63          0          0      80096          0          0
> 0          0         PCI-MSI  eth0
> 
>  74:         63          0          0      80096          0          0
> 0          0         PCI-MSI  eth1
> 
>  82:      80158          0          0          0          0          0
> 0          0         PCI-MSI  eth2
> 
>  90:      80158          0          0          0          0          0
> 0          0         PCI-MSI  eth3
> 
>  98:        256          0    5594913          0  168731027          0
> 0          0         PCI-MSI  eth4
> 
> 106:        130          0    6517103          0          0  255948447
> 0          0         PCI-MSI  eth5
> 
> 114:         64          0     100789          0          0          0
> 0          0         PCI-MSI  eth6
> 
> 122:         68          0      87466          0          0          0
> 0          0         PCI-MSI  eth7
> 
> 130:        252          0          0     466626          0          0
> 0          0         PCI-MSI  eth8
> 
> 138:      30033          0          0    4989635          0          0
> 0          0         PCI-MSI  eth9
> 
> 146:         62          0          0      80096          0          0
> 0          0         PCI-MSI  eth10
> 
> 153:     557669          0          1          0          0          0
> 0          0   IO-APIC-level  libata
> 
> 154:         62          0          0      80096          0          0
> 0          0         PCI-MSI  eth11
> 
> NMI:          0          0          0          0          0          0
> 0          0
> 
> LOC:   40086777   40087580   40087468   40087495   40083411   40083410
> 40086663   40086021
> 
> ERR:          0
> 
> MIS:          0
> 
> 
> 
>                 This machine is a IPSEC Gateway and we are using openswan
> 2.4.9 with klips for VPN.
> 
>                 Possible suspect for this Hang is a Fragmented UDP packet
> coming/going on eth4 with datasize 32560 size over VPN tunnel. (eth4 <->
> ipsec0 <-> eth5)
> 
>                 Without VPN tunnel, I am not observing the hangs with same
> size of UDP packets.
> 
>                 Let me know if you need more information on this.

I think that is an extremely good clue.  Please try the experiment 
mentinoned above with disabling tx csum offload and tx sg.  The stack 
could be handing down a packet that is unusually long or formatted 
strangely that could hang up our offload setup for tx csum.

Also are you running any traffic shaping via tc or netfilter rules?


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to