В Вто, 26/01/2010 в 20:32 +0200, Покотиленко Костик пишет:
> В Вто, 26/01/2010 в 09:35 -0800, Duyck, Alexander H пишет:
> > Покотиленко Костик wrote:
> > > Hi,
> > > 
> > > Can somebody investigate please? Bug posted 19.01.2010/
> > > 
> > > I have tried:
> > > - 2.6.29 + igb 2.0.6
> > > - 2.6.30 + igb 2.0.6
> > > - 2.6.30 + igb 2.1.9
> > > 
> > > all resulting in deep hang or network down or reboot in 1-20 hours
> > > randomly.
> > > 
> > > I have only 3 more variations to try:
> > > - 2.6.30 + in kernel igb
> > > - 2.6.32 + in kernel igb
> > > - 2.6.32 + igb 2.1.9
> > > 
> 
> Today I switched to 2.6.30 + in kernel igb 1.3.16-k2. Working fine for
> 6+ hours, as for now. Noticed that it by default use 4 rx-queue and 4
> tx-queue for each NIC and uses all cores available. 2.0.6 and 2.1.9 used
> 1 core per NIC by default.

2.6.30 + in kernel igb 1.3.16-k2, after ~22 hours got this (copied some
entries before the problem occured):

=====================================================================
Jan 27 12:25:40 lan-r kernel: [80225.568489] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:25:45 lan-r kernel: [80231.221090] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 37
Jan 27 12:25:52 lan-r kernel: [80237.967603] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:26:01 lan-r kernel: [80247.133922] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:26:05 lan-r kernel: [80251.084100] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:26:15 lan-r kernel: [80261.218534] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:26:18 lan-r kernel: [80263.956672] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:26:21 lan-r kernel: [80266.517065] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 37
Jan 27 12:30:04 lan-r kernel: [80489.341414] oSHAP_vl_gr unclassified:
IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=40 TOS=0x00 PREC=0x00 TTL=64
ID=159 DF PROTO=UDP SPT=33818 DPT=123 LEN=20 UID=65534 GID=110 
Jan 27 12:30:04 lan-r kernel: [80489.341428] iSHAP_vl_gr unclassified:
IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1
DST=127.0.0.1 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=159 DF PROTO=UDP
SPT=33818 DPT=123 LEN=20 
Jan 27 12:32:01 lan-r kernel: [80606.331316] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:32:40 lan-r kernel: [80645.582000] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:34:40 lan-r kernel: [80765.028307] UDP: bad checksum. From
87.185.189.190:35465 to 89.28.200.210:1178 ulen 31
Jan 27 12:35:13 lan-r kernel: [80798.149872] ------------[ cut
here ]------------
Jan 27 12:35:13 lan-r kernel: [80798.149882] WARNING: at
net/sched/sch_generic.c:226 dev_watchdog+0xa8/0x135()
Jan 27 12:35:13 lan-r kernel: [80798.149885] Hardware name: S3420GP
Jan 27 12:35:13 lan-r kernel: [80798.149888] NETDEV WATCHDOG: eth1
(igb): transmit timed out
Jan 27 12:35:13 lan-r kernel: [80798.149891] Modules linked in: tcp_diag
inet_diag act_police cls_u32 tun xt_IMQ cls_fw sch_htb bridge ebt_arp
ebt_pkttype ebt_ip ebtable_filter ebtables xt_CLASSIFY xt_physdev
xt_MARK iptable_mangle ipt_LOG ipt_REJECT xt_mark xt_conntrack
xt_hashlimit iptable_filter ipt_set xt_tcpudp xt_connlimit iptable_nat
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables
ip_set_nethash ip_set_iphash ip_set dm_snapshot dm_mirror dm_region_hash
dm_log dm_mod via686a eeprom lm80 i2c_viapro 8021q garp stp sch_esfq
nf_conntrack imq psmouse ide_generic ide_gd_mod ide_cd_mod cdrom snd_pcm
snd_timer joydev snd soundcore i2c_i801 evdev pcspkr snd_page_alloc
i2c_core button processor ext3 jbd mbcache usbhid hid sd_mod crc_t10dif
ide_pci_generic ide_core ata_generic ehci_hcd ata_piix libata scsi_mod
usbcore igb dca thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Jan 27 12:35:13 lan-r kernel: [80798.149984] Pid: 0, comm: swapper Not
tainted 2.6.30-ipset+imq+esfq+ipmark+route01 #1
Jan 27 12:35:13 lan-r kernel: [80798.149986] Call Trace:
Jan 27 12:35:13 lan-r kernel: [80798.149991]  [<c0127010>] ?
warn_slowpath_common+0x5e/0x8a
Jan 27 12:35:13 lan-r kernel: [80798.149993]  [<c029edac>] ?
dev_watchdog+0x0/0x135
Jan 27 12:35:13 lan-r kernel: [80798.149996]  [<c012706e>] ?
warn_slowpath_fmt+0x26/0x2a
Jan 27 12:35:13 lan-r kernel: [80798.149999]  [<c029ee54>] ?
dev_watchdog+0xa8/0x135
Jan 27 12:35:13 lan-r kernel: [80798.150002]  [<c0133d49>] ? insert_work
+0x71/0x78
Jan 27 12:35:13 lan-r kernel: [80798.150004]  [<c0134361>] ?
delayed_work_timer_fn+0x0/0x28
Jan 27 12:35:13 lan-r kernel: [80798.150007]  [<c012e3a7>] ?
run_timer_softirq+0x13d/0x19d
Jan 27 12:35:13 lan-r kernel: [80798.150010]  [<c029edac>] ?
dev_watchdog+0x0/0x135
Jan 27 12:35:13 lan-r kernel: [80798.150013]  [<c012b0ef>] ?
__do_softirq+0x8e/0x135
Jan 27 12:35:13 lan-r kernel: [80798.150016]  [<c012b1c4>] ? do_softirq
+0x2e/0x38
Jan 27 12:35:13 lan-r kernel: [80798.150019]  [<c012b2a7>] ? irq_exit
+0x26/0x53
Jan 27 12:35:13 lan-r kernel: [80798.150022]  [<c01049ea>] ? do_IRQ
+0x65/0x76
Jan 27 12:35:13 lan-r kernel: [80798.150025]  [<c01036c9>] ?
common_interrupt+0x29/0x30
Jan 27 12:35:13 lan-r kernel: [80798.150028]  [<c0108a65>] ? mwait_idle
+0x67/0x83
Jan 27 12:35:13 lan-r kernel: [80798.150030]  [<c010239a>] ? cpu_idle
+0x46/0x60
Jan 27 12:35:13 lan-r kernel: [80798.150032] ---[ end trace
6f8676d5d18ecd07 ]---
Jan 27 12:35:14 lan-r kernel: [80799.420356] br0: port 4(eth1.19)
entering disabled state
Jan 27 12:35:14 lan-r kernel: [80799.432335] br0: port 3(eth1.13)
entering disabled state
Jan 27 12:35:14 lan-r kernel: [80799.444320] br0: port 2(eth1.12)
entering disabled state
Jan 27 12:35:14 lan-r kernel: [80799.456304] br0: port 1(eth1.11)
entering disabled state
Jan 27 12:35:16 lan-r kernel: [80801.357392] igb: eth1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 27 12:35:16 lan-r kernel: [80801.357828] br0: port 4(eth1.19)
entering learning state
Jan 27 12:35:16 lan-r kernel: [80801.357974] br0: port 3(eth1.13)
entering learning state
Jan 27 12:35:16 lan-r kernel: [80801.358120] br0: port 2(eth1.12)
entering learning state
Jan 27 12:35:16 lan-r kernel: [80801.358265] br0: port 1(eth1.11)
entering learning state
Jan 27 12:35:31 lan-r kernel: [80816.333741] br0: port 4(eth1.19)
entering forwarding state
Jan 27 12:35:31 lan-r kernel: [80816.333745] br0: port 3(eth1.13)
entering forwarding state
Jan 27 12:35:31 lan-r kernel: [80816.333749] br0: port 2(eth1.12)
entering forwarding state
Jan 27 12:35:31 lan-r kernel: [80816.333752] br0: port 1(eth1.11)
entering forwarding state
Jan 27 12:35:56 lan-r kernel: [80841.231248] oSHAP_vl_gr unclassified:
IN= OUT=lo SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0
TTL=64 ID=4485 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3
LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54724 DF PROTO=TCP SPT=52837
DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] 
Jan 27 12:35:56 lan-r kernel: [80841.231286] iSHAP_vl_gr unclassified:
IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=192.168.1.1
DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4485 PROTO=ICMP
TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00
PREC=0x00 TTL=64 ID=54724 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92
RES=0x00 ACK PSH URGP=0 ] 
Jan 27 12:36:22 lan-r kernel: [80867.127931] oSHAP_vl_gr unclassified:
IN= OUT=lo SRC=89.28.200.210 DST=89.28.200.210 LEN=104 TOS=0x00
PREC=0xC0 TTL=64 ID=7716 PROTO=ICMP TYPE=3 CODE=1 [SRC=89.28.200.210
DST=82.207.71.2 LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP
SPT=123 DPT=123 LEN=56 ] 
Jan 27 12:36:22 lan-r kernel: [80867.127967] iSHAP_vl_gr unclassified:
IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00
SRC=89.28.200.210 DST=89.28.200.210 LEN=104 TOS=0x00 PREC=0xC0 TTL=64
ID=7716 PROTO=ICMP TYPE=3 CODE=1 [SRC=89.28.200.210 DST=82.207.71.2
LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=123 DPT=123
LEN=56 ] 
Jan 27 12:36:37 lan-r kernel: [80882.292148] br0: port 4(eth1.19)
entering disabled state
Jan 27 12:36:37 lan-r kernel: [80882.304129] br0: port 3(eth1.13)
entering disabled state
Jan 27 12:36:37 lan-r kernel: [80882.316111] br0: port 2(eth1.12)
entering disabled state
Jan 27 12:36:37 lan-r kernel: [80882.328087] br0: port 1(eth1.11)
entering disabled state
Jan 27 12:36:39 lan-r kernel: [80884.109371] igb: eth1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 27 12:36:39 lan-r kernel: [80884.109806] br0: port 4(eth1.19)
entering learning state
Jan 27 12:36:39 lan-r kernel: [80884.109952] br0: port 3(eth1.13)
entering learning state
Jan 27 12:36:39 lan-r kernel: [80884.110098] br0: port 2(eth1.12)
entering learning state
Jan 27 12:36:39 lan-r kernel: [80884.110243] br0: port 1(eth1.11)
entering learning state
Jan 27 12:36:49 lan-r kernel: [80893.374576] oSHAP_vl_gr unclassified:
IN= OUT=lo SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0
TTL=64 ID=4486 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3
LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54725 DF PROTO=TCP SPT=52837
DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] 
Jan 27 12:36:49 lan-r kernel: [80893.374613] iSHAP_vl_gr unclassified:
IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=192.168.1.1
DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4486 PROTO=ICMP
TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00
PREC=0x00 TTL=64 ID=54725 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92
RES=0x00 ACK PSH URGP=0 ] 
Jan 27 12:36:54 lan-r kernel: [80899.085720] br0: port 4(eth1.19)
entering forwarding state
Jan 27 12:36:54 lan-r kernel: [80899.085725] br0: port 3(eth1.13)
entering forwarding state
Jan 27 12:36:54 lan-r kernel: [80899.085728] br0: port 2(eth1.12)
entering forwarding state
Jan 27 12:36:54 lan-r kernel: [80899.085732] br0: port 1(eth1.11)
entering forwarding state
=====================================================================

Using serial console I've figured out:

- system working fine except for the NIC
- ifconfig show only RX dropped increasing on eth1 (client side), other
counters stailed.
- ethtool -t eth0:

The test result is FAIL
The test extra info:
Register test  (offline)         0
Eeprom test    (offline)         0
Interrupt test (offline)         0
Loopback test  (offline)         13
Link test   (on/offline)         0

- ethtool -t eth1

The test result is FAIL
The test extra info:
Register test  (offline)         0
Eeprom test    (offline)         0
Interrupt test (offline)         0
Loopback test  (offline)         13
Link test   (on/offline)         0

- After doing:

ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a

both ethtool commands (The test result is FAIL) and ifconfig show same
result

So it seems like NIC hawdware hand.

I don't think this problem is related to something other then NIC / igb
driver. If there are HW problems like memory or power I would notice
other system problems not just NIC, itsn't it?

If I can do more testing let me know. Moving NIC to other server isn't
option for me.

The server is quite new, could it be IRQ related problem, i.e.
motherboard not fully supported by <=2.6.30?

> > > And please can somebody tell which one of the drivers is to be
> > > considered more stable, the one in kernel or the one from sf.net?
> 
> > I'm curious.  You say the device is causing reboots.  Is this due to a
> > kernel panic followed by a reboot or does the system just reboot?
> 
> Regarding last bug ID: 2934941, system become disconnected from network
> at the same time alot of "Detected Tx Unit Hang" printing to console and
> logs. Some times it just stays in this state (disconnected + error being
> printed, but system is responding), sometimes after being in this state
> for few minutes it just reboots.
> 
> I didn't have any chance to see "kernel panic" message. Most of the time
> system become disconnected when there are nobody around it, so we just
> remotely power down/up through cli like IPMI.
> 
> Today I've set up serial console connected to a router nearby with
> independant Internet connection, so I can "see" what happens when it get
> disconnected, and if it still alive I can do clean reboot.
> 
> >  If the entire system is rebooting I would suspect a bigger issue such
> > as problems in the system memory, power issues, or an issue in the
> > kernel.
> 
> Good guess, but until "Detected Tx Unit Hang" there is no other signs of
> any instabilities. Everything works perfect until that.
> 
> > In 2907473 you mentioned also having SATA issues.  This leads me to
> > wonder if there is a problem with the Mainboard or components in the
> > system you are currently using.
> 
> In this case everything also worked perfect until NIC problems. I would
> notice, we have nagois and munin. Also I was working on console while
> few of those problem occured.
> 
> >   In the bug you mentioned that you had recently upgraded to this
> > server.  Would it be possible to try installing the ET Quad port
> > server adapter in that system and run the same tests that you are
> > currently running in this system.
> 
> If you mean installing ET Quad port server adapter in old system - it's
> impossible, there was PCI only board.
> 
> >   My main concern is that this issue could be due to something outside
> > of our control since the SATA seemed to be experiencing an I/O stall
> > at the same time as the network adapter.
> 
> Well, first, SATA and NIC problems poped up in the same time only in
> 2907473 case with 82574L. Now with ET Quad port I don't see anything
> except NIC problems. Also, this hardware successfully compiles kernel
> with CONCURENCY_LEVEL=10, done many times.
> 
> >   If we can test this in a known good platform we might be able to
> > verify if the issue is a problem in the server or not.
> 
> Agreed, but we don't have any spare server with PCI-e x4 v2.0 :(
> 
> > In the bugs that you filed you mentioned that you have been putting
> > additional patches on top of the kernel.  In the tests you have
> > recently done have any of the kernels you tested not included the
> > patches you mentioned?  If not you may want to try running just a
> > plain kernel and see if the same issues occur.
> 
> I thought about that. But, the router is closely interconnected with a
> billing software, and the whole solution requires ipset and imq. So,
> making such test means leaving network down. Also, problem may not occur
> for more than 20 hours. With ET Quad port the record is ~36 hours.
> 
> -- 
> Покотиленко Костик <[email protected]>
> 
> 
> ------------------------------------------------------------------------------
> The Planet: dedicated and managed hosting, cloud storage, colocation
> Stay online with enterprise data centers and the best network in the business
> Choose flexible plans and management services without long-term contracts
> Personal 24x7 support from experience hosting pros just a phone call away.
> http://p.sf.net/sfu/theplanet-com
> _______________________________________________
> E1000-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit 
> http://communities.intel.com/community/wired
-- 
Покотиленко Костик <[email protected]>


------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to