В Tue, 09 Feb 2010 10:46:46 +0200, "Покотиленко Костик" пишет:
> В Пнд, 08/02/2010 в 14:03 -0800, Duyck, Alexander H пишет: >> Покотиленко Костик wrote: >> > В Fri, 29 Jan 2010 01:29:05 +0200, "Покотиленко Костик" пишет: >> > >> >> В Чтв, 28/01/2010 в 14:32 -0800, Alexander Duyck пишет: >> >>> On Wed, 2010-01-27 at 04:14 -0800, Покотиленко Костик wrote: >> >>>> Using serial console I've figured out: >> >>>> >> >>>> - system working fine except for the NIC >> >>>> - ifconfig show only RX dropped increasing on eth1 (client side), >> >>>> other counters stailed. >> >>>> - ethtool -t eth0: >> >>>> >> >>>> The test result is FAIL >> >>>> The test extra info: >> >>>> Register test (offline) 0 >> >>>> Eeprom test (offline) 0 >> >>>> Interrupt test (offline) 0 >> >>>> Loopback test (offline) 13 >> >>>> Link test (on/offline) 0 >> >>>> >> >>>> - ethtool -t eth1 >> >>>> >> >>>> The test result is FAIL >> >>>> The test extra info: >> >>>> Register test (offline) 0 >> >>>> Eeprom test (offline) 0 >> >>>> Interrupt test (offline) 0 >> >>>> Loopback test (offline) 13 >> >>>> Link test (on/offline) 0 >> >>>> >> >>>> - After doing: >> >>>> >> >>>> ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a >> >>>> >> >>>> both ethtool commands (The test result is FAIL) and ifconfig show >> >>>> same result >> >>>> >> >>>> So it seems like NIC hawdware hand. >> >>> >> >>> The next time this occurs could you go though and run the ethtool >> >>> test on all of the network ports? I'm wondering if it is only >> >>> eth0/1 that are blocked or if eth3/4 are stopped as well. >> >> >> >> Sure. >> > >> > Last time we have changed some BIOS options to: >> > >> > Execute Disable Bit: Disabled >> > ACPI 1.0 Support: Enabled (When Disabled it's 3.0(??)) >> > >> > After which system worked for almost 9 days with 2.6.30. Then the >> same >> > problem. >> > >> > Forgot to do ethtool test for all ports :/ Well, it happened again, ethtool -t "Loopback test" failed for all 4 ports. >> Based on the results it seems like what is failing is the hardware's >> ability to handle DMA transactions. Ideally if possible it would be >> best if you could do an lspci -t dump of the system and work your way >> up until you find at which point in the tree we have the failure. The >> ethtool -t test seems to show the failure as a loopback test so we >> should be able to at least test this up to the PCIe bridge on the >> adapter. > > lspci -tv attached. lspci -tv during failure doesn't differ. Also, it seems that more load make it happen sooner. Average load here 55Mbit/s (summary throuput between 2 ports), maximal is ~150Mbit/s. > During last 2 days system rebooted twice shortly after the problem > occured, so not ethtool tests yet. > > BTW, I have many "UDP: bad checksum" messages before the issue occurs > like this: > > Feb 8 18:49:16 lan-r kernel: [99067.458074] UDP: bad checksum. From > 95.169.150.116:48810 to 89.28.200.210:1126 ulen 181 > Feb 8 18:49:24 lan-r kernel: [99074.976709] __ratelimit: 29 callbacks > suppressed > > Also today there was: > > Feb 9 09:57:33 lan-r kernel: [53517.383722] igb 0000:03:00.1: Detected > Tx Unit Hang > Feb 9 09:57:33 lan-r kernel: [53517.383725] Tx Queue <0> > Feb 9 09:57:33 lan-r kernel: [53517.383729] TDH <aa> > Feb 9 09:57:33 lan-r kernel: [53517.383730] TDT <e8> > Feb 9 09:57:33 lan-r kernel: [53517.383730] next_to_use <e8> > Feb 9 09:57:33 lan-r kernel: [53517.383731] next_to_clean <aa> > Feb 9 09:57:33 lan-r kernel: [53517.383732] buffer_info[next_to_clean] > Feb 9 09:57:33 lan-r kernel: [53517.383732] time_stamp > <cb1921> > Feb 9 09:57:33 lan-r kernel: [53517.383733] next_to_watch <ab> > Feb 9 09:57:33 lan-r kernel: [53517.383734] jiffies > <cb1c48> > Feb 9 09:57:33 lan-r kernel: [53517.383734] desc.status > <158000> > > But the system still alive. > >> Also if ACPI is having an effect on the issue one other thing you >> might try changing in the BIOS would be to disable all CPU C-states. >> The system will consume more power as a result, but the CPU also ends >> up usually being much more responsive as a result, and we have seen in >> the past that this can sometimes resolve performance issues. > > I'll turn those off: > > CPU C State=1 ;Options: 1=Enabled: 0=Disabled > C1E=1 ;Options: 1=Enabled: 0=Disabled Turned off "CPU C State" and "Spread spectrum", C1E turned off automatically. > Full current BIOS config attached. > > -- > Покотиленко Костик <[email protected]> > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
