Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-05-08 Thread Chris Stromsoe

On Mon, 12 Mar 2007, Chris Stromsoe wrote:

On Thu, 8 Mar 2007, Chris Stromsoe wrote:

On Thu, 8 Mar 2007, Jay Vosburgh wrote:

Chris Stromsoe [EMAIL PROTECTED] wrote:


1) ip link set mtu 9000 eth2  -- eth2 is no longer responsive
   ip link set mtu 1500 eth2  -- eth2 remains unresponsive

2) ifup eth2
   ifdown eth2

   perl -pi -e 's/eth2/eth3/' /etc/network/interfaces

   ifup eth3   --  locks up here


	This would seem to suggest a problem with skge itself, although 
there might be some other interaction with bonding that causes the 
problems for that case.


In both of the above mentioned cases, I was not using bonding.  That 
was with the skge driver only.


The above tests both work fine with the 2.6.20.1 sk98lin driver loaded 
as modprobe sk98lin RlmtMode=DualNet.


I can change the MTU, add and remove eth2/eth3 from the bond, and up and 
down the interface.  It also works fine with different hardware (e100, 
e1000, tg3, bnx2).  Running both interfaces alone without the bonding 
driver also works (I can up and down the interfaces with no 
side-affects).


Just an update - it looks like 2.6.20.1 fixed the MTU problem (1 
above), but not the other problem (where the machine locks up if the 
second port on the dual-port card is downed).


To recap:

I am use SysKonnect SK-9843 cards.  The sk98lin driver works fine; the 
skge driver does not.  The following sequence of commands locks up the 
server.  The lock is a hard lock; console is not responsive to keyboard 
input or to sysrq.  Nothing is printed on the serial console.



  ip li set eth2 up
  ip li set eth2 down
  ip li set eth3 up


There are no addresses assigned to either interface.  This was done after 
a fresh boot.  It is repeatable.  If I do not down eth2, I can up eth3 
assign addresses, and use both interfaces.


The kernel is fresh from kernel.org and does not have any third party 
patches.


lspci -vv output:

:01:0a.0 Ethernet controller: Syskonnect (Schneider  Koch) SK-98xx Gigabit 
Ethernet Server Adapter (rev 12)
Subsystem: Syskonnect (Schneider  Koch) SK-9844 Gigabit Ethernet 
Server Adapter (SK-NET GE-SX dual link)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- 
MAbort- SERR- PERR-
Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes)
Interrupt: pin A routed to IRQ 10
Region 0: Memory at ff8fc000 (32-bit, non-prefetchable) [size=16K]
Region 1: I/O ports at d800 [size=256]
Expansion ROM at ff40 [disabled] [size=128K]
Capabilities: available only to root




-Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-05-08 Thread Chris Stromsoe

On Tue, 8 May 2007, Chris Stromsoe wrote:

On Mon, 12 Mar 2007, Chris Stromsoe wrote:

On Thu, 8 Mar 2007, Chris Stromsoe wrote:

On Thu, 8 Mar 2007, Jay Vosburgh wrote:

Chris Stromsoe [EMAIL PROTECTED] wrote:


1) ip link set mtu 9000 eth2  -- eth2 is no longer responsive
   ip link set mtu 1500 eth2  -- eth2 remains unresponsive

2) ifup eth2
   ifdown eth2

   perl -pi -e 's/eth2/eth3/' /etc/network/interfaces

   ifup eth3   --  locks up here


	This would seem to suggest a problem with skge itself, although there 
might be some other interaction with bonding that causes the problems for 
that case.


In both of the above mentioned cases, I was not using bonding.  That was 
with the skge driver only.


The above tests both work fine with the 2.6.20.1 sk98lin driver loaded as 
modprobe sk98lin RlmtMode=DualNet.


I can change the MTU, add and remove eth2/eth3 from the bond, and up and 
down the interface.  It also works fine with different hardware (e100, 
e1000, tg3, bnx2).  Running both interfaces alone without the bonding 
driver also works (I can up and down the interfaces with no side-affects).


Just an update - it looks like 2.6.20.1 fixed the MTU problem (1 above), 
but not the other problem (where the machine locks up if the second port on 
the dual-port card is downed).


To recap:

I am use SysKonnect SK-9843 cards.  The sk98lin driver works fine; the skge


I should proof-read first.  The cards are SK-9844s, not SK-9843s.  The 
rest of the prior message is still correct.




-Chris

driver does not.  The following sequence of commands locks up the server. 
The lock is a hard lock; console is not responsive to keyboard input or to 
sysrq.  Nothing is printed on the serial console.



 ip li set eth2 up
 ip li set eth2 down
 ip li set eth3 up


There are no addresses assigned to either interface.  This was done after a 
fresh boot.  It is repeatable.  If I do not down eth2, I can up eth3 assign 
addresses, and use both interfaces.


The kernel is fresh from kernel.org and does not have any third party 
patches.


lspci -vv output:

:01:0a.0 Ethernet controller: Syskonnect (Schneider  Koch) SK-98xx 
Gigabit Ethernet Server Adapter (rev 12)

   Subsystem: Syskonnect (Schneider  Koch) SK-9844 Gigabit Ethernet  
Server Adapter (SK-NET GE-SX dual link)
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
   Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- 
MAbort- SERR- PERR-
   Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes)
   Interrupt: pin A routed to IRQ 10
   Region 0: Memory at ff8fc000 (32-bit, non-prefetchable) [size=16K]
   Region 1: I/O ports at d800 [size=256]
   Expansion ROM at ff40 [disabled] [size=128K]
   Capabilities: available only to root




-Chris


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Chris Stromsoe

Within 2 or 3 minutes after issuing

ip link set bond1 mtu 9000

I get one NETDEV WATCHDOG: eth2: transmit timed out to the console, and 
then this starts to repeat:


BUG: soft lockup detected on CPU#0!
 [c0103667] show_trace_log_lvl+0x19/0x2e
 [c010368e] show_trace+0x12/0x14
 [c010377a] dump_stack+0x14/0x16
 [c0134964] softlockup_tick+0x9f/0xae
 [c0120f38] run_local_timers+0x12/0x14
 [c0120d89] update_process_times+0x3e/0x63
 [c010ceb0] smp_apic_timer_interrupt+0x6f/0x7e
 [c010337c] apic_timer_interrupt+0x28/0x30
 [d088d5b3] skge_tx_clean+0x1d/0x87 [skge]
 [d088d663] skge_tx_timeout+0x46/0x4c [skge]
 [c02b0854] dev_watchdog+0x79/0xb9
 [c0120ec9] run_timer_softirq+0x10e/0x16b
 [c011cebb] __do_softirq+0x65/0xc3
 [c0104d55] do_softirq+0x54/0xbb
 ===

Once the soft lockups start, the machine becomes unresponsive and has to 
be power cycled.


bond1 is a dual-port syskonnect sk-98xx with both ports bonded in 
active-backup mode.


kernel is 2.6.20.1 with the web100 kernel patches from http://web100.org/. 
I'm building a plain 2.6.20.1 without the patches right now and will test 
when it finishes compiling.


lspci for the card shows:

fresno:~ # lspci -vv -s 02:01.0
:02:01.0 Ethernet controller: Syskonnect (Schneider  Koch) SK-98xx Gigabit 
Ethernet Server Adapter (rev 11)
Subsystem: Syskonnect (Schneider  Koch) SK-9844 Gigabit Ethernet 
Server Adapter (SK-NET GE-SX dual link)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- 
MAbort- SERR- PERR-
Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x08 (32 bytes)
Interrupt: pin A routed to IRQ 22
Region 0: Memory at febfc000 (32-bit, non-prefetchable) [size=16K]
Region 1: I/O ports at e800 [size=256]
Expansion ROM at febc [disabled] [size=128K]
Capabilities: [48] Power Management version 1
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data




-Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Stephen Hemminger

[SKGE]: Fix deadlock in skge_tx_timeout

dev_watchdog() already holds the device lock, don't take it again in
skge_tx_clean().

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 0b1cfafa6f6b8a168d5811d1f65cf540942c52b1
tree 4d3f252d6618adfe812e9da95cd496bb798e7c7b
parent 1ca949299260aa49eeba34ff912e2321c8b1f647
author Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100
committer Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100

 drivers/net/skge.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index e482e7f..4a948c2 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -2575,7 +2575,9 @@ static int skge_down(struct net_device *
skge_led(skge, LED_MODE_OFF);
 
netif_poll_disable(dev);
+   netif_tx_lock_bh(dev);
skge_tx_clean(dev);
+   netif_tx_unlock_bh(dev);
skge_rx_clean(skge);
 
kfree(skge-rx_ring.start);
@@ -2720,7 +2722,6 @@ static void skge_tx_clean(struct net_dev
struct skge_port *skge = netdev_priv(dev);
struct skge_element *e;
 
-   netif_tx_lock_bh(dev);
for (e = skge-tx_ring.to_clean; e != skge-tx_ring.to_use; e = 
e-next) {
struct skge_tx_desc *td = e-desc;
skge_tx_free(skge, e, td-control);
@@ -2729,7 +2730,6 @@ static void skge_tx_clean(struct net_dev
 
skge-tx_ring.to_clean = e;
netif_wake_queue(dev);
-   netif_tx_unlock_bh(dev);
 }
 
 static void skge_tx_timeout(struct net_device *dev)


--010402080104080400040104--
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Chris Stromsoe

Thanks.  That fixes the soft lockup.

I've got another problem now.  The cards I'm using are dual port 
(sk-9844).  I am bonding both ports together.


The card presents as eth2 and eth3.  If I remove eth2 from the bond so 
that eth3 is the active interface, I get a hard lock (nothing prints to 
serial console, sysrq isn't responsive) and have to power cycle.


This is with plain 2.6.20.1.  I also tested using skge.[ch] from the 
current netdev git tree.



-Chris

On Thu, 8 Mar 2007, Stephen Hemminger wrote:



[SKGE]: Fix deadlock in skge_tx_timeout

dev_watchdog() already holds the device lock, don't take it again in
skge_tx_clean().

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 0b1cfafa6f6b8a168d5811d1f65cf540942c52b1
tree 4d3f252d6618adfe812e9da95cd496bb798e7c7b
parent 1ca949299260aa49eeba34ff912e2321c8b1f647
author Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100
committer Patrick McHardy [EMAIL PROTECTED] Sat, 24 Feb 2007 20:05:39 +0100

drivers/net/skge.c |4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index e482e7f..4a948c2 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -2575,7 +2575,9 @@ static int skge_down(struct net_device *
skge_led(skge, LED_MODE_OFF);

netif_poll_disable(dev);
+   netif_tx_lock_bh(dev);
skge_tx_clean(dev);
+   netif_tx_unlock_bh(dev);
skge_rx_clean(skge);

kfree(skge-rx_ring.start);
@@ -2720,7 +2722,6 @@ static void skge_tx_clean(struct net_dev
struct skge_port *skge = netdev_priv(dev);
struct skge_element *e;

-   netif_tx_lock_bh(dev);
for (e = skge-tx_ring.to_clean; e != skge-tx_ring.to_use; e = 
e-next) {
struct skge_tx_desc *td = e-desc;
skge_tx_free(skge, e, td-control);
@@ -2729,7 +2730,6 @@ static void skge_tx_clean(struct net_dev

skge-tx_ring.to_clean = e;
netif_wake_queue(dev);
-   netif_tx_unlock_bh(dev);
}

static void skge_tx_timeout(struct net_device *dev)


--010402080104080400040104--
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Jay Vosburgh
Chris Stromsoe [EMAIL PROTECTED] wrote:

Within 2 or 3 minutes after issuing

ip link set bond1 mtu 9000

I get one NETDEV WATCHDOG: eth2: transmit timed out to the console, and
then this starts to repeat:

BUG: soft lockup detected on CPU#0!

I believe this is the same bug that is fixed by this change:

commit c4f283b1f275e5528c13c119e5cfc80cdba55d00
Author: Jay Vosburgh [EMAIL PROTECTED]
Date:   Wed Feb 28 17:03:20 2007 -0800

bonding: fix double dev_add_pack

Bonding can erroneously register the same packet_type to receive
ARPs (for use by ARP validation): once at device open time, and once via
sysfs.  Since sysfs can change the validate setting (and thus register
or unregister) at any time, a flag is needed to synchronize with device
open in order to avoid double registrations, and the simplest place is
within the packet_type structure itself.  Double unregister is not an
issue.

Bug reported by Ulrich Oelmann [EMAIL PROTECTED].

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
Signed-off-by: Jeff Garzik [EMAIL PROTECTED]

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ea73ebf..68afcb5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3423,6 +3423,9 @@ void bond_register_arp(struct bonding *b
 {
struct packet_type *pt = bond-arp_mon_pt;
 
+   if (pt-type)
+   return;
+
pt-type = htons(ETH_P_ARP);
pt-dev = NULL; /*bond-dev;XXX*/
pt-func = bond_arp_rcv;
@@ -3431,7 +3434,10 @@ void bond_register_arp(struct bonding *b
 
 void bond_unregister_arp(struct bonding *bond)
 {
-   dev_remove_pack(bond-arp_mon_pt);
+   struct packet_type *pt = bond-arp_mon_pt;
+
+   dev_remove_pack(pt);
+   pt-type = 0;
 }
 
 /* Hashing Policies -*/


-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Stephen Hemminger
On Thu, 8 Mar 2007 13:31:13 -0800 (PST)
Chris Stromsoe [EMAIL PROTECTED] wrote:

 Thanks.  That fixes the soft lockup.
 
 I've got another problem now.  The cards I'm using are dual port 
 (sk-9844).  I am bonding both ports together.
 
 The card presents as eth2 and eth3.  If I remove eth2 from the bond so 
 that eth3 is the active interface, I get a hard lock (nothing prints to 
 serial console, sysrq isn't responsive) and have to power cycle.
 
 This is with plain 2.6.20.1.  I also tested using skge.[ch] from the 
 current netdev git tree.

Which form of bonding failover, there are locking issues with some
of the bonding modes. You should ask on the bonding mailing list.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Chris Stromsoe

On Thu, 8 Mar 2007, Stephen Hemminger wrote:

On Thu, 8 Mar 2007 13:31:13 -0800 (PST)
Chris Stromsoe [EMAIL PROTECTED] wrote:


Thanks.  That fixes the soft lockup.

I've got another problem now.  The cards I'm using are dual port
(sk-9844).  I am bonding both ports together.

The card presents as eth2 and eth3.  If I remove eth2 from the bond so
that eth3 is the active interface, I get a hard lock (nothing prints to
serial console, sysrq isn't responsive) and have to power cycle.

This is with plain 2.6.20.1.  I also tested using skge.[ch] from the
current netdev git tree.


Which form of bonding failover, there are locking issues with some
of the bonding modes. You should ask on the bonding mailing list.


It's active-backup.  Testing with the same setup and e100 works fine. 
I've done a few tests without the bonding module, using the dual-port 
separately.


Testing with bonding and skge:

1) ifenslave bond0 eth2 eth3
   ifenslave -d bond0 eth3
   ifenslave -d bond0 eth2   -- locks up here

2) ifenslave bond0 eth2 eth3
   ifenslave -d bond0 eth2   -- locks up here

3) ifenslave bond0 eth3 eth2
   ifenslave -d bond0 eth2
   ifenslave bond0 eth2
   ifenslave bond0 -d eth3
   ifenslave bond0 eth3
   ifenslave -d bond0 eth2   -- locks up here


Testing without bonding:

1) ip link set mtu 9000 eth2  -- eth2 is no longer responsive
   ip link set mtu 1500 eth2  -- eth2 remains unresponsive

2) ifup eth2
   ifdown eth2

   perl -pi -e 's/eth2/eth3/' /etc/network/interfaces

   ifup eth3   --  locks up here




-Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Jay Vosburgh
Chris Stromsoe [EMAIL PROTECTED] wrote:

It's active-backup.  Testing with the same setup and e100 works fine. I've
done a few tests without the bonding module, using the dual-port
separately.

Somebody else a couple of weeks ago was having similar issues
running bonding with skge (in 802.3ad mode, in his case) that also
vanished with different hardware.  I don't have any skge hardware, so I
can't test it here.  His problem was a failure in 802.3ad negotiation,
not a system lockup, though.

If you're running active-backup and not using the ARP monitor
(arp_interval), then I'm not aware of any possible locking problems in
bonding for the kernel version you reference (2.6.20.1).

1) ip link set mtu 9000 eth2  -- eth2 is no longer responsive
   ip link set mtu 1500 eth2  -- eth2 remains unresponsive

2) ifup eth2
   ifdown eth2

   perl -pi -e 's/eth2/eth3/' /etc/network/interfaces

   ifup eth3   --  locks up here

This would seem to suggest a problem with skge itself, although
there might be some other interaction with bonding that causes the
problems for that case.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge- soft lockup on CPU#0 with mtu=9000 (2.6.20.1 + web100 patch)

2007-03-08 Thread Chris Stromsoe

On Thu, 8 Mar 2007, Jay Vosburgh wrote:

	If you're running active-backup and not using the ARP monitor 
(arp_interval), then I'm not aware of any possible locking problems in 
bonding for the kernel version you reference (2.6.20.1).


I'm not using arp_interval.

On Thu, 8 Mar 2007, Jay Vosburgh wrote:

Chris Stromsoe [EMAIL PROTECTED] wrote:


1) ip link set mtu 9000 eth2  -- eth2 is no longer responsive
   ip link set mtu 1500 eth2  -- eth2 remains unresponsive

2) ifup eth2
   ifdown eth2

   perl -pi -e 's/eth2/eth3/' /etc/network/interfaces

   ifup eth3   --  locks up here


	This would seem to suggest a problem with skge itself, although 
there might be some other interaction with bonding that causes the 
problems for that case.


In both of the above mentioned cases, I was not using bonding.  That was 
with the skge driver only.



-Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html