В Вто, 26/01/2010 в 20:32 +0200, Покотиленко Костик пишет: > В Вто, 26/01/2010 в 09:35 -0800, Duyck, Alexander H пишет: > > Покотиленко Костик wrote: > > > Hi, > > > > > > Can somebody investigate please? Bug posted 19.01.2010/ > > > > > > I have tried: > > > - 2.6.29 + igb 2.0.6 > > > - 2.6.30 + igb 2.0.6 > > > - 2.6.30 + igb 2.1.9 > > > > > > all resulting in deep hang or network down or reboot in 1-20 hours > > > randomly. > > > > > > I have only 3 more variations to try: > > > - 2.6.30 + in kernel igb > > > - 2.6.32 + in kernel igb > > > - 2.6.32 + igb 2.1.9 > > > > > Today I switched to 2.6.30 + in kernel igb 1.3.16-k2. Working fine for > 6+ hours, as for now. Noticed that it by default use 4 rx-queue and 4 > tx-queue for each NIC and uses all cores available. 2.0.6 and 2.1.9 used > 1 core per NIC by default.
2.6.30 + in kernel igb 1.3.16-k2, after ~22 hours got this (copied some entries before the problem occured): ===================================================================== Jan 27 12:25:40 lan-r kernel: [80225.568489] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:25:45 lan-r kernel: [80231.221090] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 37 Jan 27 12:25:52 lan-r kernel: [80237.967603] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:26:01 lan-r kernel: [80247.133922] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:26:05 lan-r kernel: [80251.084100] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:26:15 lan-r kernel: [80261.218534] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:26:18 lan-r kernel: [80263.956672] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:26:21 lan-r kernel: [80266.517065] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 37 Jan 27 12:30:04 lan-r kernel: [80489.341414] oSHAP_vl_gr unclassified: IN= OUT=lo SRC=127.0.0.1 DST=127.0.0.1 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=159 DF PROTO=UDP SPT=33818 DPT=123 LEN=20 UID=65534 GID=110 Jan 27 12:30:04 lan-r kernel: [80489.341428] iSHAP_vl_gr unclassified: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=127.0.0.1 DST=127.0.0.1 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=159 DF PROTO=UDP SPT=33818 DPT=123 LEN=20 Jan 27 12:32:01 lan-r kernel: [80606.331316] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:32:40 lan-r kernel: [80645.582000] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:34:40 lan-r kernel: [80765.028307] UDP: bad checksum. From 87.185.189.190:35465 to 89.28.200.210:1178 ulen 31 Jan 27 12:35:13 lan-r kernel: [80798.149872] ------------[ cut here ]------------ Jan 27 12:35:13 lan-r kernel: [80798.149882] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xa8/0x135() Jan 27 12:35:13 lan-r kernel: [80798.149885] Hardware name: S3420GP Jan 27 12:35:13 lan-r kernel: [80798.149888] NETDEV WATCHDOG: eth1 (igb): transmit timed out Jan 27 12:35:13 lan-r kernel: [80798.149891] Modules linked in: tcp_diag inet_diag act_police cls_u32 tun xt_IMQ cls_fw sch_htb bridge ebt_arp ebt_pkttype ebt_ip ebtable_filter ebtables xt_CLASSIFY xt_physdev xt_MARK iptable_mangle ipt_LOG ipt_REJECT xt_mark xt_conntrack xt_hashlimit iptable_filter ipt_set xt_tcpudp xt_connlimit iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables ip_set_nethash ip_set_iphash ip_set dm_snapshot dm_mirror dm_region_hash dm_log dm_mod via686a eeprom lm80 i2c_viapro 8021q garp stp sch_esfq nf_conntrack imq psmouse ide_generic ide_gd_mod ide_cd_mod cdrom snd_pcm snd_timer joydev snd soundcore i2c_i801 evdev pcspkr snd_page_alloc i2c_core button processor ext3 jbd mbcache usbhid hid sd_mod crc_t10dif ide_pci_generic ide_core ata_generic ehci_hcd ata_piix libata scsi_mod usbcore igb dca thermal fan thermal_sys [last unloaded: scsi_wait_scan] Jan 27 12:35:13 lan-r kernel: [80798.149984] Pid: 0, comm: swapper Not tainted 2.6.30-ipset+imq+esfq+ipmark+route01 #1 Jan 27 12:35:13 lan-r kernel: [80798.149986] Call Trace: Jan 27 12:35:13 lan-r kernel: [80798.149991] [<c0127010>] ? warn_slowpath_common+0x5e/0x8a Jan 27 12:35:13 lan-r kernel: [80798.149993] [<c029edac>] ? dev_watchdog+0x0/0x135 Jan 27 12:35:13 lan-r kernel: [80798.149996] [<c012706e>] ? warn_slowpath_fmt+0x26/0x2a Jan 27 12:35:13 lan-r kernel: [80798.149999] [<c029ee54>] ? dev_watchdog+0xa8/0x135 Jan 27 12:35:13 lan-r kernel: [80798.150002] [<c0133d49>] ? insert_work +0x71/0x78 Jan 27 12:35:13 lan-r kernel: [80798.150004] [<c0134361>] ? delayed_work_timer_fn+0x0/0x28 Jan 27 12:35:13 lan-r kernel: [80798.150007] [<c012e3a7>] ? run_timer_softirq+0x13d/0x19d Jan 27 12:35:13 lan-r kernel: [80798.150010] [<c029edac>] ? dev_watchdog+0x0/0x135 Jan 27 12:35:13 lan-r kernel: [80798.150013] [<c012b0ef>] ? __do_softirq+0x8e/0x135 Jan 27 12:35:13 lan-r kernel: [80798.150016] [<c012b1c4>] ? do_softirq +0x2e/0x38 Jan 27 12:35:13 lan-r kernel: [80798.150019] [<c012b2a7>] ? irq_exit +0x26/0x53 Jan 27 12:35:13 lan-r kernel: [80798.150022] [<c01049ea>] ? do_IRQ +0x65/0x76 Jan 27 12:35:13 lan-r kernel: [80798.150025] [<c01036c9>] ? common_interrupt+0x29/0x30 Jan 27 12:35:13 lan-r kernel: [80798.150028] [<c0108a65>] ? mwait_idle +0x67/0x83 Jan 27 12:35:13 lan-r kernel: [80798.150030] [<c010239a>] ? cpu_idle +0x46/0x60 Jan 27 12:35:13 lan-r kernel: [80798.150032] ---[ end trace 6f8676d5d18ecd07 ]--- Jan 27 12:35:14 lan-r kernel: [80799.420356] br0: port 4(eth1.19) entering disabled state Jan 27 12:35:14 lan-r kernel: [80799.432335] br0: port 3(eth1.13) entering disabled state Jan 27 12:35:14 lan-r kernel: [80799.444320] br0: port 2(eth1.12) entering disabled state Jan 27 12:35:14 lan-r kernel: [80799.456304] br0: port 1(eth1.11) entering disabled state Jan 27 12:35:16 lan-r kernel: [80801.357392] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Jan 27 12:35:16 lan-r kernel: [80801.357828] br0: port 4(eth1.19) entering learning state Jan 27 12:35:16 lan-r kernel: [80801.357974] br0: port 3(eth1.13) entering learning state Jan 27 12:35:16 lan-r kernel: [80801.358120] br0: port 2(eth1.12) entering learning state Jan 27 12:35:16 lan-r kernel: [80801.358265] br0: port 1(eth1.11) entering learning state Jan 27 12:35:31 lan-r kernel: [80816.333741] br0: port 4(eth1.19) entering forwarding state Jan 27 12:35:31 lan-r kernel: [80816.333745] br0: port 3(eth1.13) entering forwarding state Jan 27 12:35:31 lan-r kernel: [80816.333749] br0: port 2(eth1.12) entering forwarding state Jan 27 12:35:31 lan-r kernel: [80816.333752] br0: port 1(eth1.11) entering forwarding state Jan 27 12:35:56 lan-r kernel: [80841.231248] oSHAP_vl_gr unclassified: IN= OUT=lo SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4485 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54724 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] Jan 27 12:35:56 lan-r kernel: [80841.231286] iSHAP_vl_gr unclassified: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4485 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54724 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] Jan 27 12:36:22 lan-r kernel: [80867.127931] oSHAP_vl_gr unclassified: IN= OUT=lo SRC=89.28.200.210 DST=89.28.200.210 LEN=104 TOS=0x00 PREC=0xC0 TTL=64 ID=7716 PROTO=ICMP TYPE=3 CODE=1 [SRC=89.28.200.210 DST=82.207.71.2 LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=123 DPT=123 LEN=56 ] Jan 27 12:36:22 lan-r kernel: [80867.127967] iSHAP_vl_gr unclassified: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=89.28.200.210 DST=89.28.200.210 LEN=104 TOS=0x00 PREC=0xC0 TTL=64 ID=7716 PROTO=ICMP TYPE=3 CODE=1 [SRC=89.28.200.210 DST=82.207.71.2 LEN=76 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=123 DPT=123 LEN=56 ] Jan 27 12:36:37 lan-r kernel: [80882.292148] br0: port 4(eth1.19) entering disabled state Jan 27 12:36:37 lan-r kernel: [80882.304129] br0: port 3(eth1.13) entering disabled state Jan 27 12:36:37 lan-r kernel: [80882.316111] br0: port 2(eth1.12) entering disabled state Jan 27 12:36:37 lan-r kernel: [80882.328087] br0: port 1(eth1.11) entering disabled state Jan 27 12:36:39 lan-r kernel: [80884.109371] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Jan 27 12:36:39 lan-r kernel: [80884.109806] br0: port 4(eth1.19) entering learning state Jan 27 12:36:39 lan-r kernel: [80884.109952] br0: port 3(eth1.13) entering learning state Jan 27 12:36:39 lan-r kernel: [80884.110098] br0: port 2(eth1.12) entering learning state Jan 27 12:36:39 lan-r kernel: [80884.110243] br0: port 1(eth1.11) entering learning state Jan 27 12:36:49 lan-r kernel: [80893.374576] oSHAP_vl_gr unclassified: IN= OUT=lo SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4486 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54725 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] Jan 27 12:36:49 lan-r kernel: [80893.374613] iSHAP_vl_gr unclassified: IN=lo OUT= MAC=00:00:00:00:00:00:00:00:00:00:00:00:08:00 SRC=192.168.1.1 DST=192.168.1.1 LEN=183 TOS=0x00 PREC=0xC0 TTL=64 ID=4486 PROTO=ICMP TYPE=3 CODE=1 [SRC=192.168.1.1 DST=192.168.1.3 LEN=155 TOS=0x00 PREC=0x00 TTL=64 ID=54725 DF PROTO=TCP SPT=52837 DPT=999 WINDOW=92 RES=0x00 ACK PSH URGP=0 ] Jan 27 12:36:54 lan-r kernel: [80899.085720] br0: port 4(eth1.19) entering forwarding state Jan 27 12:36:54 lan-r kernel: [80899.085725] br0: port 3(eth1.13) entering forwarding state Jan 27 12:36:54 lan-r kernel: [80899.085728] br0: port 2(eth1.12) entering forwarding state Jan 27 12:36:54 lan-r kernel: [80899.085732] br0: port 1(eth1.11) entering forwarding state ===================================================================== Using serial console I've figured out: - system working fine except for the NIC - ifconfig show only RX dropped increasing on eth1 (client side), other counters stailed. - ethtool -t eth0: The test result is FAIL The test extra info: Register test (offline) 0 Eeprom test (offline) 0 Interrupt test (offline) 0 Loopback test (offline) 13 Link test (on/offline) 0 - ethtool -t eth1 The test result is FAIL The test extra info: Register test (offline) 0 Eeprom test (offline) 0 Interrupt test (offline) 0 Loopback test (offline) 13 Link test (on/offline) 0 - After doing: ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a both ethtool commands (The test result is FAIL) and ifconfig show same result So it seems like NIC hawdware hand. I don't think this problem is related to something other then NIC / igb driver. If there are HW problems like memory or power I would notice other system problems not just NIC, itsn't it? If I can do more testing let me know. Moving NIC to other server isn't option for me. The server is quite new, could it be IRQ related problem, i.e. motherboard not fully supported by <=2.6.30? > > > And please can somebody tell which one of the drivers is to be > > > considered more stable, the one in kernel or the one from sf.net? > > > I'm curious. You say the device is causing reboots. Is this due to a > > kernel panic followed by a reboot or does the system just reboot? > > Regarding last bug ID: 2934941, system become disconnected from network > at the same time alot of "Detected Tx Unit Hang" printing to console and > logs. Some times it just stays in this state (disconnected + error being > printed, but system is responding), sometimes after being in this state > for few minutes it just reboots. > > I didn't have any chance to see "kernel panic" message. Most of the time > system become disconnected when there are nobody around it, so we just > remotely power down/up through cli like IPMI. > > Today I've set up serial console connected to a router nearby with > independant Internet connection, so I can "see" what happens when it get > disconnected, and if it still alive I can do clean reboot. > > > If the entire system is rebooting I would suspect a bigger issue such > > as problems in the system memory, power issues, or an issue in the > > kernel. > > Good guess, but until "Detected Tx Unit Hang" there is no other signs of > any instabilities. Everything works perfect until that. > > > In 2907473 you mentioned also having SATA issues. This leads me to > > wonder if there is a problem with the Mainboard or components in the > > system you are currently using. > > In this case everything also worked perfect until NIC problems. I would > notice, we have nagois and munin. Also I was working on console while > few of those problem occured. > > > In the bug you mentioned that you had recently upgraded to this > > server. Would it be possible to try installing the ET Quad port > > server adapter in that system and run the same tests that you are > > currently running in this system. > > If you mean installing ET Quad port server adapter in old system - it's > impossible, there was PCI only board. > > > My main concern is that this issue could be due to something outside > > of our control since the SATA seemed to be experiencing an I/O stall > > at the same time as the network adapter. > > Well, first, SATA and NIC problems poped up in the same time only in > 2907473 case with 82574L. Now with ET Quad port I don't see anything > except NIC problems. Also, this hardware successfully compiles kernel > with CONCURENCY_LEVEL=10, done many times. > > > If we can test this in a known good platform we might be able to > > verify if the issue is a problem in the server or not. > > Agreed, but we don't have any spare server with PCI-e x4 v2.0 :( > > > In the bugs that you filed you mentioned that you have been putting > > additional patches on top of the kernel. In the tests you have > > recently done have any of the kernels you tested not included the > > patches you mentioned? If not you may want to try running just a > > plain kernel and see if the same issues occur. > > I thought about that. But, the router is closely interconnected with a > billing software, and the whole solution requires ipset and imq. So, > making such test means leaving network down. Also, problem may not occur > for more than 20 hours. With ET Quad port the record is ~36 hours. > > -- > Покотиленко Костик <[email protected]> > > > ------------------------------------------------------------------------------ > The Planet: dedicated and managed hosting, cloud storage, colocation > Stay online with enterprise data centers and the best network in the business > Choose flexible plans and management services without long-term contracts > Personal 24x7 support from experience hosting pros just a phone call away. > http://p.sf.net/sfu/theplanet-com > _______________________________________________ > E1000-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/e1000-devel > To learn more about Intel® Ethernet, visit > http://communities.intel.com/community/wired -- Покотиленко Костик <[email protected]> ------------------------------------------------------------------------------ The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
