Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
On Mon, 12 Mar 2007, Chris Stromsoe wrote: On Thu, 8 Mar 2007, Chris Stromsoe wrote: On Thu, 8 Mar 2007, Jay Vosburgh wrote: Chris Stromsoe [EMAIL PROTECTED] wrote: 1) ip link set mtu 9000 eth2 -- eth2 is no longer responsive ip link set mtu 1500 eth2 -- eth2 remains unresponsive 2) ifup eth2 ifdown eth2 perl -pi -e 's/eth2/eth3/' /etc/network/interfaces ifup eth3 -- locks up here This would seem to suggest a problem with skge itself, although there might be some other interaction with bonding that causes the problems for that case. In both of the above mentioned cases, I was not using bonding. That was with the skge driver only. The above tests both work fine with the 2.6.20.1 sk98lin driver loaded as modprobe sk98lin RlmtMode=DualNet. I can change the MTU, add and remove eth2/eth3 from the bond, and up and down the interface. It also works fine with different hardware (e100, e1000, tg3, bnx2). Running both interfaces alone without the bonding driver also works (I can up and down the interfaces with no side-affects). Just an update - it looks like 2.6.20.1 fixed the MTU problem (1 above), but not the other problem (where the machine locks up if the second port on the dual-port card is downed). To recap: I am use SysKonnect SK-9843 cards. The sk98lin driver works fine; the skge driver does not. The following sequence of commands locks up the server. The lock is a hard lock; console is not responsive to keyboard input or to sysrq. Nothing is printed on the serial console. ip li set eth2 up ip li set eth2 down ip li set eth3 up There are no addresses assigned to either interface. This was done after a fresh boot. It is repeatable. If I do not down eth2, I can up eth3 assign addresses, and use both interfaces. The kernel is fresh from kernel.org and does not have any third party patches. lspci -vv output: :01:0a.0 Ethernet controller: Syskonnect (Schneider Koch) SK-98xx Gigabit Ethernet Server Adapter (rev 12) Subsystem: Syskonnect (Schneider Koch) SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes) Interrupt: pin A routed to IRQ 10 Region 0: Memory at ff8fc000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at d800 [size=256] Expansion ROM at ff40 [disabled] [size=128K] Capabilities: available only to root -Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
On Tue, 8 May 2007, Chris Stromsoe wrote: On Mon, 12 Mar 2007, Chris Stromsoe wrote: On Thu, 8 Mar 2007, Chris Stromsoe wrote: On Thu, 8 Mar 2007, Jay Vosburgh wrote: Chris Stromsoe [EMAIL PROTECTED] wrote: 1) ip link set mtu 9000 eth2 -- eth2 is no longer responsive ip link set mtu 1500 eth2 -- eth2 remains unresponsive 2) ifup eth2 ifdown eth2 perl -pi -e 's/eth2/eth3/' /etc/network/interfaces ifup eth3 -- locks up here This would seem to suggest a problem with skge itself, although there might be some other interaction with bonding that causes the problems for that case. In both of the above mentioned cases, I was not using bonding. That was with the skge driver only. The above tests both work fine with the 2.6.20.1 sk98lin driver loaded as modprobe sk98lin RlmtMode=DualNet. I can change the MTU, add and remove eth2/eth3 from the bond, and up and down the interface. It also works fine with different hardware (e100, e1000, tg3, bnx2). Running both interfaces alone without the bonding driver also works (I can up and down the interfaces with no side-affects). Just an update - it looks like 2.6.20.1 fixed the MTU problem (1 above), but not the other problem (where the machine locks up if the second port on the dual-port card is downed). To recap: I am use SysKonnect SK-9843 cards. The sk98lin driver works fine; the skge I should proof-read first. The cards are SK-9844s, not SK-9843s. The rest of the prior message is still correct. -Chris driver does not. The following sequence of commands locks up the server. The lock is a hard lock; console is not responsive to keyboard input or to sysrq. Nothing is printed on the serial console. ip li set eth2 up ip li set eth2 down ip li set eth3 up There are no addresses assigned to either interface. This was done after a fresh boot. It is repeatable. If I do not down eth2, I can up eth3 assign addresses, and use both interfaces. The kernel is fresh from kernel.org and does not have any third party patches. lspci -vv output: :01:0a.0 Ethernet controller: Syskonnect (Schneider Koch) SK-98xx Gigabit Ethernet Server Adapter (rev 12) Subsystem: Syskonnect (Schneider Koch) SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes) Interrupt: pin A routed to IRQ 10 Region 0: Memory at ff8fc000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at d800 [size=256] Expansion ROM at ff40 [disabled] [size=128K] Capabilities: available only to root -Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
Within 2 or 3 minutes after issuing ip link set bond1 mtu 9000 I get one NETDEV WATCHDOG: eth2: transmit timed out to the console, and then this starts to repeat: BUG: soft lockup detected on CPU#0! [c0103667] show_trace_log_lvl+0x19/0x2e [c010368e] show_trace+0x12/0x14 [c010377a] dump_stack+0x14/0x16 [c0134964] softlockup_tick+0x9f/0xae [c0120f38] run_local_timers+0x12/0x14 [c0120d89] update_process_times+0x3e/0x63 [c010ceb0] smp_apic_timer_interrupt+0x6f/0x7e [c010337c] apic_timer_interrupt+0x28/0x30 [d088d5b3] skge_tx_clean+0x1d/0x87 [skge] [d088d663] skge_tx_timeout+0x46/0x4c [skge] [c02b0854] dev_watchdog+0x79/0xb9 [c0120ec9] run_timer_softirq+0x10e/0x16b [c011cebb] __do_softirq+0x65/0xc3 [c0104d55] do_softirq+0x54/0xbb === Once the soft lockups start, the machine becomes unresponsive and has to be power cycled. bond1 is a dual-port syskonnect sk-98xx with both ports bonded in active-backup mode. kernel is 2.6.20.1 with the web100 kernel patches from http://web100.org/. I'm building a plain 2.6.20.1 without the patches right now and will test when it finishes compiling. lspci for the card shows: fresno:~ # lspci -vv -s 02:01.0 :02:01.0 Ethernet controller: Syskonnect (Schneider Koch) SK-98xx Gigabit Ethernet Server Adapter (rev 11) Subsystem: Syskonnect (Schneider Koch) SK-9844 Gigabit Ethernet Server Adapter (SK-NET GE-SX dual link) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes) Interrupt: pin A routed to IRQ 22 Region 0: Memory at febfc000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at e800 [size=256] Expansion ROM at febc [disabled] [size=128K] Capabilities: [48] Power Management version 1 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data -Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
[SKGE]: Fix deadlock in skge_tx_timeout dev_watchdog() already holds the device lock, don't take it again in skge_tx_clean(). Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 0b1cfafa6f6b8a168d5811d1f65cf540942c52b1 tree 4d3f252d6618adfe812e9da95cd496bb798e7c7b parent 1ca949299260aa49eeba34ff912e2321c8b1f647 author Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100 committer Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100 drivers/net/skge.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/skge.c b/drivers/net/skge.c index e482e7f..4a948c2 100644 --- a/drivers/net/skge.c +++ b/drivers/net/skge.c @@ -2575,7 +2575,9 @@ static int skge_down(struct net_device * skge_led(skge, LED_MODE_OFF); netif_poll_disable(dev); + netif_tx_lock_bh(dev); skge_tx_clean(dev); + netif_tx_unlock_bh(dev); skge_rx_clean(skge); kfree(skge-rx_ring.start); @@ -2720,7 +2722,6 @@ static void skge_tx_clean(struct net_dev struct skge_port *skge = netdev_priv(dev); struct skge_element *e; - netif_tx_lock_bh(dev); for (e = skge-tx_ring.to_clean; e != skge-tx_ring.to_use; e = e-next) { struct skge_tx_desc *td = e-desc; skge_tx_free(skge, e, td-control); @@ -2729,7 +2730,6 @@ static void skge_tx_clean(struct net_dev skge-tx_ring.to_clean = e; netif_wake_queue(dev); - netif_tx_unlock_bh(dev); } static void skge_tx_timeout(struct net_device *dev) --010402080104080400040104-- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
Thanks. That fixes the soft lockup. I've got another problem now. The cards I'm using are dual port (sk-9844). I am bonding both ports together. The card presents as eth2 and eth3. If I remove eth2 from the bond so that eth3 is the active interface, I get a hard lock (nothing prints to serial console, sysrq isn't responsive) and have to power cycle. This is with plain 2.6.20.1. I also tested using skge.[ch] from the current netdev git tree. -Chris On Thu, 8 Mar 2007, Stephen Hemminger wrote: [SKGE]: Fix deadlock in skge_tx_timeout dev_watchdog() already holds the device lock, don't take it again in skge_tx_clean(). Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 0b1cfafa6f6b8a168d5811d1f65cf540942c52b1 tree 4d3f252d6618adfe812e9da95cd496bb798e7c7b parent 1ca949299260aa49eeba34ff912e2321c8b1f647 author Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100 committer Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100 drivers/net/skge.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/skge.c b/drivers/net/skge.c index e482e7f..4a948c2 100644 --- a/drivers/net/skge.c +++ b/drivers/net/skge.c @@ -2575,7 +2575,9 @@ static int skge_down(struct net_device * skge_led(skge, LED_MODE_OFF); netif_poll_disable(dev); + netif_tx_lock_bh(dev); skge_tx_clean(dev); + netif_tx_unlock_bh(dev); skge_rx_clean(skge); kfree(skge-rx_ring.start); @@ -2720,7 +2722,6 @@ static void skge_tx_clean(struct net_dev struct skge_port *skge = netdev_priv(dev); struct skge_element *e; - netif_tx_lock_bh(dev); for (e = skge-tx_ring.to_clean; e != skge-tx_ring.to_use; e = e-next) { struct skge_tx_desc *td = e-desc; skge_tx_free(skge, e, td-control); @@ -2729,7 +2730,6 @@ static void skge_tx_clean(struct net_dev skge-tx_ring.to_clean = e; netif_wake_queue(dev); - netif_tx_unlock_bh(dev); } static void skge_tx_timeout(struct net_device *dev) --010402080104080400040104-- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
Chris Stromsoe [EMAIL PROTECTED] wrote: Within 2 or 3 minutes after issuing ip link set bond1 mtu 9000 I get one NETDEV WATCHDOG: eth2: transmit timed out to the console, and then this starts to repeat: BUG: soft lockup detected on CPU#0! I believe this is the same bug that is fixed by this change: commit c4f283b1f275e5528c13c119e5cfc80cdba55d00 Author: Jay Vosburgh [EMAIL PROTECTED] Date: Wed Feb 28 17:03:20 2007 -0800 bonding: fix double dev_add_pack Bonding can erroneously register the same packet_type to receive ARPs (for use by ARP validation): once at device open time, and once via sysfs. Since sysfs can change the validate setting (and thus register or unregister) at any time, a flag is needed to synchronize with device open in order to avoid double registrations, and the simplest place is within the packet_type structure itself. Double unregister is not an issue. Bug reported by Ulrich Oelmann [EMAIL PROTECTED]. Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] Signed-off-by: Jeff Garzik [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index ea73ebf..68afcb5 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3423,6 +3423,9 @@ void bond_register_arp(struct bonding *b { struct packet_type *pt = bond-arp_mon_pt; + if (pt-type) + return; + pt-type = htons(ETH_P_ARP); pt-dev = NULL; /*bond-dev;XXX*/ pt-func = bond_arp_rcv; @@ -3431,7 +3434,10 @@ void bond_register_arp(struct bonding *b void bond_unregister_arp(struct bonding *bond) { - dev_remove_pack(bond-arp_mon_pt); + struct packet_type *pt = bond-arp_mon_pt; + + dev_remove_pack(pt); + pt-type = 0; } /* Hashing Policies -*/ -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
On Thu, 8 Mar 2007 13:31:13 -0800 (PST) Chris Stromsoe [EMAIL PROTECTED] wrote: Thanks. That fixes the soft lockup. I've got another problem now. The cards I'm using are dual port (sk-9844). I am bonding both ports together. The card presents as eth2 and eth3. If I remove eth2 from the bond so that eth3 is the active interface, I get a hard lock (nothing prints to serial console, sysrq isn't responsive) and have to power cycle. This is with plain 2.6.20.1. I also tested using skge.[ch] from the current netdev git tree. Which form of bonding failover, there are locking issues with some of the bonding modes. You should ask on the bonding mailing list. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
On Thu, 8 Mar 2007, Stephen Hemminger wrote: On Thu, 8 Mar 2007 13:31:13 -0800 (PST) Chris Stromsoe [EMAIL PROTECTED] wrote: Thanks. That fixes the soft lockup. I've got another problem now. The cards I'm using are dual port (sk-9844). I am bonding both ports together. The card presents as eth2 and eth3. If I remove eth2 from the bond so that eth3 is the active interface, I get a hard lock (nothing prints to serial console, sysrq isn't responsive) and have to power cycle. This is with plain 2.6.20.1. I also tested using skge.[ch] from the current netdev git tree. Which form of bonding failover, there are locking issues with some of the bonding modes. You should ask on the bonding mailing list. It's active-backup. Testing with the same setup and e100 works fine. I've done a few tests without the bonding module, using the dual-port separately. Testing with bonding and skge: 1) ifenslave bond0 eth2 eth3 ifenslave -d bond0 eth3 ifenslave -d bond0 eth2 -- locks up here 2) ifenslave bond0 eth2 eth3 ifenslave -d bond0 eth2 -- locks up here 3) ifenslave bond0 eth3 eth2 ifenslave -d bond0 eth2 ifenslave bond0 eth2 ifenslave bond0 -d eth3 ifenslave bond0 eth3 ifenslave -d bond0 eth2 -- locks up here Testing without bonding: 1) ip link set mtu 9000 eth2 -- eth2 is no longer responsive ip link set mtu 1500 eth2 -- eth2 remains unresponsive 2) ifup eth2 ifdown eth2 perl -pi -e 's/eth2/eth3/' /etc/network/interfaces ifup eth3 -- locks up here -Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
Chris Stromsoe [EMAIL PROTECTED] wrote: It's active-backup. Testing with the same setup and e100 works fine. I've done a few tests without the bonding module, using the dual-port separately. Somebody else a couple of weeks ago was having similar issues running bonding with skge (in 802.3ad mode, in his case) that also vanished with different hardware. I don't have any skge hardware, so I can't test it here. His problem was a failure in 802.3ad negotiation, not a system lockup, though. If you're running active-backup and not using the ARP monitor (arp_interval), then I'm not aware of any possible locking problems in bonding for the kernel version you reference (2.6.20.1). 1) ip link set mtu 9000 eth2 -- eth2 is no longer responsive ip link set mtu 1500 eth2 -- eth2 remains unresponsive 2) ifup eth2 ifdown eth2 perl -pi -e 's/eth2/eth3/' /etc/network/interfaces ifup eth3 -- locks up here This would seem to suggest a problem with skge itself, although there might be some other interaction with bonding that causes the problems for that case. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)
On Thu, 8 Mar 2007, Jay Vosburgh wrote: If you're running active-backup and not using the ARP monitor (arp_interval), then I'm not aware of any possible locking problems in bonding for the kernel version you reference (2.6.20.1). I'm not using arp_interval. On Thu, 8 Mar 2007, Jay Vosburgh wrote: Chris Stromsoe [EMAIL PROTECTED] wrote: 1) ip link set mtu 9000 eth2 -- eth2 is no longer responsive ip link set mtu 1500 eth2 -- eth2 remains unresponsive 2) ifup eth2 ifdown eth2 perl -pi -e 's/eth2/eth3/' /etc/network/interfaces ifup eth3 -- locks up here This would seem to suggest a problem with skge itself, although there might be some other interaction with bonding that causes the problems for that case. In both of the above mentioned cases, I was not using bonding. That was with the skge driver only. -Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html