Re: [E1000-devel] sf.net bug ID 2934941: "Detected Tx Unit Hang" on quad port copper 82576

Покотиленко Костик Fri, 12 Feb 2010 19:01:44 -0800

В Птн, 12/02/2010 в 07:59 -0800, Duyck, Alexander H пишет:
> Покотиленко Костик wrote:
> > В Вто, 09/02/2010 в 23:34 +0200, Покотиленко Костик пишет:
> > 
> >>>> Also if ACPI is having an effect on the issue one other thing you
> >>>> might try changing in the BIOS would be to disable all CPU
> >>>> C-states. The system will consume more power as a result, but the
> >>>> CPU also ends up usually being much more responsive as a result,
> >>>> and we have seen in the past that this can sometimes resolve
> >>>> performance issues. 
> >>> 
> >>> I'll turn those off:
> >>> 
> >>> CPU C State=1               ;Options: 1=Enabled: 0=Disabled
> >>> C1E=1                       ;Options: 1=Enabled: 0=Disabled
> >> 
> >> Turned off "CPU C State" and "Spread spectrum", C1E turned off
> >> automatically. 
> > 
> > With "CPU C State" and "Spread spectrum" turned off after 47 hours I
> > got:
> > 
> > NETDEV WATCHDOG: eth1 (igb): transmit timed out
> > Modules linked in: ...
> > Call Trace:
> > ...
> > 
> > Let summarize:
> > 
> > - None of kernel (29, 30) and driver combinations solved the problem
> > - None of BIOS options helped
> > - I've figured out that when TX Unit Hang on 2 configured ports,
> > Loopback test fails on 2 unconfigured/used ports also
> > - When the NIC stops working, rest of the system feels Ok
> > 
> > So the problem localized a bit, but the source of the problem it's not
> > clear. Is it hardware related or software...
> > 
> > Also system is in use by ~300 customers, so more downtime that we
> > already have is not desireable.
> > 
> > Server has 2 onboard NICs with one of which we have had similar
> > problem, and PCI-e Quad port NIC.
> > 
> > We can still live with 2 NICs, so one of the options for further
> > testing I see is to go back using onboard NICs and put PCI-e Quad
> > port NIC to another server I support and do a loop back (Port1<->
> > Port2, Port3<->Port4) stress test, but there is 2.6.26 kernel
> > (changing not an option).
> > 
> > Let me know what you think and what are other options of further
> > testing. I'm going to try 2.6.32 before switching NIC to another
> > server. I Did not do this before because there was issues backporting
> > it to Lenny.
> 
> At this point it feels like we have pretty much eliminated the drivers
> as being an issue since the unused pair of ports is effected by
> whatever is causing the first pair to fail.  The issue most likely
> resides somewhere in the path between the on-board PCIe bridge and the
> PCIe root complex on the system.


> I think testing the NIC in another system would be our best option for
> now.  This will help to determine if the problem is something in the
> PCIe bridge on the adapter, or a problem in the root complex of the
> server.  If the issue follows the adapter you will likely need to get
> it replaced, but if the issue disappears we will need to start
> investigating all BIOS options on the system related to PCIe.

I posted BIOS options of the system along with the changes I was doing,
I can repost currect BIOs config if you like.

Well, I'll switched to 2.6.32 today. It's too early to say something,
but interesting results already available:

- The "Tx unit hang" have happened 4 times in 10 hours, only to one NIC
faced to clients
- At the time it appeared in logs there were no noticable traffic
degradation.
- Noticable traffic degradation in some cases was before the log entries
in some after. The delay is > 15 minutes.
- In some cases there were interesting degradation: 
  - There were alot of traffic between clients (eth1.11, eth1.12) and
the servers (vlan eth0.23)
  - And no traffic at all between clients (eth1.11, eth1.12) and the
Internet (vlan eth0.31)

This is strange because eth0.23 and eth0.31 are vlans on eth0 and if it
has died there were no vlan traffic.

If eth1 have died there were no trafic on eth1.11 and eth1.12 and there
would nobody left to make traffic with server.

It seems like just one vlan has died at that 1 case.

Also things are changed because connectivity is then being reestablished
in 1-10 minutes. System doesn't freeze or reboot. We didn't have any one
such case with prior kernels.

I have rrd graphics saved about those several Tx cases (Inet detailed
and all interfaces traffic) if somebody interested.

-- 
Покотиленко Костик <[email protected]>


------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] sf.net bug ID 2934941: "Detected Tx Unit Hang" on quad port copper 82576

Reply via email to