Re: sky2 0.11 instability

2006-01-23 Thread Carl-Daniel Hailfinger
Stephen Hemminger schrieb:
 You might try adjusting the interrupt coalescing parameters with
   ethtool -C eth0 ...
 But I can't give you hard guidelines as to what would make it better.
 
 I have a debug patch, but it needs work still.

I don't care whether that debug patch will freeze the box or perform
other random funnies. All the debugging printks I added to the driver
did not trigger and I'd try anything. So yes, I'm desparate.

Does the sk98lin driver have any code for such problems?

Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-23 Thread Stephen Hemminger
On Mon, 23 Jan 2006 20:57:10 +0100
Carl-Daniel Hailfinger [EMAIL PROTECTED] wrote:

 Stephen Hemminger schrieb:
  You might try adjusting the interrupt coalescing parameters with
  ethtool -C eth0 ...
  But I can't give you hard guidelines as to what would make it better.
  
  I have a debug patch, but it needs work still.
 
 I don't care whether that debug patch will freeze the box or perform
 other random funnies. All the debugging printks I added to the driver
 did not trigger and I'd try anything. So yes, I'm desparate.
 
 Does the sk98lin driver have any code for such problems?

There are several differences that the sk98lin driver has.
* It programs some parts of the chip differently. But most
  of those are wrong. I started copying it, but where it was wrong
  I didn't copy the mistakes.
* Sk98lin does NAPI wrong. It has interrupts disabled and runs
  packets through soft irq twice.
* Sk98lin does it's own buggy rx checksum validation.
* Sk98lin does not do VLAN
* Sk98lin programs PCI-Ex for 2K transfers, but that causes data
  corruption

The one that probably is saving you with sk98lin, is it has a watchdog
routine that tries to work around all the possible driver hangs.
I prefer to find an fix these hangs, because a watchdog routine like that
just masks the problem and introduces a bunch of SMP race conditions which
the sk98lin author either didn't see or ignored.


-- 
Stephen Hemminger [EMAIL PROTECTED]
OSDL http://developer.osdl.org/~shemminger
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-23 Thread Carl-Daniel Hailfinger
Stephen Hemminger schrieb:
 On Mon, 23 Jan 2006 20:57:10 +0100
 Carl-Daniel Hailfinger [EMAIL PROTECTED] wrote:
 
 
Stephen Hemminger schrieb:

You might try adjusting the interrupt coalescing parameters with
 ethtool -C eth0 ...
But I can't give you hard guidelines as to what would make it better.

I have a debug patch, but it needs work still.

I don't care whether that debug patch will freeze the box or perform
other random funnies. All the debugging printks I added to the driver
did not trigger and I'd try anything. So yes, I'm desparate.

Does the sk98lin driver have any code for such problems?
 
 
 There are several differences that the sk98lin driver has.
 * It programs some parts of the chip differently. But most
   of those are wrong. I started copying it, but where it was wrong
   I didn't copy the mistakes.
 * Sk98lin does NAPI wrong. It has interrupts disabled and runs
   packets through soft irq twice.
 * Sk98lin does it's own buggy rx checksum validation.
 * Sk98lin does not do VLAN
 * Sk98lin programs PCI-Ex for 2K transfers, but that causes data
   corruption
 
 The one that probably is saving you with sk98lin, is it has a watchdog
 routine that tries to work around all the possible driver hangs.
 I prefer to find an fix these hangs, because a watchdog routine like that
 just masks the problem and introduces a bunch of SMP race conditions which
 the sk98lin author either didn't see or ignored.

Oh. Now that is news to me. Glad I didn't have a SMP machine with the old
driver.

There is a bug in ethtool support in sky2. Namely, rx-frames{,-irq}=64 is
wrapped to zero. And rx-usecs-irq is 20 no matter what I set it to.

# ethtool -C bridgeint0 rx-frames 64 rx-frames-irq 64 rx-usecs 1 rx-usecs-irq 1 
tx-usecs 1 tx-frames 64
# ethtool -c bridgeint0
Coalesce parameters for bridgeint0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 20
rx-frames-irq: 0

tx-usecs: 1
tx-frames: 64
tx-usecs-irq: 0
tx-frames-irq: 0

Will continue investigating.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-23 Thread Carl-Daniel Hailfinger
Carl-Daniel Hailfinger schrieb:
 Stephen Hemminger schrieb:
 
On Mon, 23 Jan 2006 20:57:10 +0100
Carl-Daniel Hailfinger [EMAIL PROTECTED] wrote:


Stephen Hemminger schrieb:

You might try adjusting the interrupt coalescing parameters with
ethtool -C eth0 ...
But I can't give you hard guidelines as to what would make it better.

I have a debug patch, but it needs work still.

I don't care whether that debug patch will freeze the box or perform
other random funnies. All the debugging printks I added to the driver
did not trigger and I'd try anything. So yes, I'm desparate.

Does the sk98lin driver have any code for such problems?


There are several differences that the sk98lin driver has.
* It programs some parts of the chip differently. But most
  of those are wrong. I started copying it, but where it was wrong
  I didn't copy the mistakes.
* Sk98lin does NAPI wrong. It has interrupts disabled and runs
  packets through soft irq twice.
* Sk98lin does it's own buggy rx checksum validation.
* Sk98lin does not do VLAN
* Sk98lin programs PCI-Ex for 2K transfers, but that causes data
  corruption

The one that probably is saving you with sk98lin, is it has a watchdog
routine that tries to work around all the possible driver hangs.
I prefer to find an fix these hangs, because a watchdog routine like that
just masks the problem and introduces a bunch of SMP race conditions which
the sk98lin author either didn't see or ignored.
 
 
 Oh. Now that is news to me. Glad I didn't have a SMP machine with the old
 driver.
 
 There is a bug in ethtool support in sky2. Namely, rx-frames{,-irq}=64 is
 wrapped to zero. And rx-usecs-irq is 20 no matter what I set it to.

The following whitespace-damaged patch should help with the latter problem.
--- a/drivers/net/sky2.c  2006-01-23 23:41:35.0 +0100
+++ b/drivers/net/sky2.c  2006-01-24 03:41:21.0 +0100
@@ -2843,7 +2843,7 @@
if (ecmd-rx_coalesce_usecs_irq == 0)
sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP);
else {
-   sky2_write32(hw, STAT_TX_TIMER_INI,
+   sky2_write32(hw, STAT_ISR_TIMER_INI,
 sky2_us2clk(hw, ecmd-rx_coalesce_usecs_irq));
sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START);
}


Despite all the problems I'm having with sky2, I want to thank you
for writing it. The driver is easily readable and I can at least try
to get it running. With sk98lin I'm just stuck due to coding style
and general obfuscation.

Yeah!
I got the nic to reproducibly auto-recover. With the following ethtool
settings it would hang after a few minutes and not recover until a
rmmod/modprobe cycle. Now it comes back reliably.
# ethtool -C bridgeext0 rx-frames 63 rx-frames-irq 63 tx-frames 63 \
rx-usecs 250 rx-usecs-irq 250 tx-usecs 250

Patch follows:
--- a/drivers/net/sky2.c  2006-01-23 23:41:35.0 +0100
+++ b/drivers/net/sky2.c  2006-01-24 04:59:38.0 +0100
@@ -1623,6 +1623,12 @@
unsigned txq = txqaddr[sky2-port];
u16 ridx;

+   //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
+   sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_STOP);
+   //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP);
+   //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
+   sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_START);
+   //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START);
/* Maybe we just missed an status interrupt */
spin_lock(sky2-tx_lock);
ridx = sky2_read16(hw,
@@ -1639,6 +1645,7 @@
if (netif_msg_timer(sky2))
printk(KERN_ERR PFX %s: tx timeout\n, dev-name);

+#if 0
sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP);
sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET);

@@ -1646,6 +1653,7 @@

sky2_qset(hw, txq);
sky2_prefetch_init(hw, txq, sky2-tx_le_map, TX_RING_SIZE - 1);
+#endif
 }

Properties of the patch above: The device will fail after
some time, enter the tx_timeout handler, recover and continue.
Now if I could avoid entering the tx_timeout handler, I would
be happy because it triggers only after hanging for approx.
10 seconds.

Error log with my patch so far:
Jan 24 05:09:27 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:09:27 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:09:41 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:09:41 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:09:41 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 1312
Jan 24 05:11:12 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:11:12 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:11:12 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 592
Jan 24 05:11:42 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:11:42 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:11:42 switch kernel: sky2 

Re: sky2 0.11 instability

2006-01-22 Thread Carl-Daniel Hailfinger
Hi,

Carl-Daniel Hailfinger schrieb:
 Carl-Daniel Hailfinger schrieb:
 
Carl-Daniel Hailfinger schrieb:


after sending 259 GB and receiving 25 GB over my SysKonnect SK-9E21
card (sky2 says it is a Yukon-EC (0xb6) rev 1), the card appears
dead. Machine is an Athlon64 3200+ on an Asus A8N-SLI Deluxe board.

I have now added a hard reset routine to the tx timeout
path and hope it won't kill my machine.
 
 
 Apologies for mangled whitespace, this is just a rough cut'n'paste.
 --- linux-2.6.15/drivers/net/sky2.c.orig2006-01-21 16:00:15.0 
 +0100
 +++ linux-2.6.15/drivers/net/sky2.c 2006-01-21 14:08:28.0 +0100
 @@ -1565,6 +1565,7 @@ static int sky2_autoneg_done(struct sky2
 return 0;
  }
 
 +static int sky2_reset(struct sky2_hw *hw);
  /*
   * Interrupt from PHY are handled outside of interrupt context
   * because accessing phy registers requires spin wait which might
 @@ -1639,6 +1640,7 @@ static void sky2_tx_timeout(struct net_d
 if (netif_msg_timer(sky2))
 printk(KERN_ERR PFX %s: tx timeout\n, dev-name);
 
 +   if (0) {
 sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP);
 sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET);
 
 @@ -1646,6 +1648,12 @@ static void sky2_tx_timeout(struct net_d
 
 sky2_qset(hw, txq);
 sky2_prefetch_init(hw, txq, sky2-tx_le_map, TX_RING_SIZE - 1);
 +   } else {
 +   printk(KERN_ERR PFX %s: recovering the HARD way...\n, dev-name);
 +   sky2_down(dev);
 +   sky2_reset(hw);
 +   sky2_up(dev);
 +   }
  }
 
 
 And everytime the kernel throws this message, I run the following
 script:
 
 #!/bin/bash
 deadinterface=`dmesg|grep HARD|tail -1|sed s/.*sky2 //;s/:.*//`
 ip l s $deadinterface down
 ip l s $deadinterface up
 
 After that, everything continues to work until the next tx timeout
 happens, and then the script again saves the day.
 
 More results about the circumstances of this bug: It seems that
 it will only trigger under LOW load. As long as I keep the interface
 busy, it will have no problems at all.

OK, more info about the circumstances of the bug.
- happens with sky2 0.11 and 0.13
- with low load (100 kB/s) it triggers after 12 hours and then
  approx. every 50 minutes
- with medium load (100-1200 kB/s) it triggers after 30 minutes
  and then approx. every 70 minutes
- with high RX load (9-12 MB/s) it triggers every 8 hours
- with high TX load (9-12 MB/s) I can't get it to trigger
- with stock tx_timeout handler, it will stay dead and no interrupts
  are received from the nic once it hangs
- simply taking the interface down and up again doesn't help
- with my modified tx_timeout handler, taking the interface down and
  up again after the timeout helps
- with stock tx_timeout handler, I have to unload and reload the
  module to fix up the card
- general pattern seems to be medium interrupt load - instability
- ah yes, and this is a production machine at a slightly remote
  location. Silly me.

If you want me to test any patch, tell me. It can only get better.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-22 Thread Stephen Hemminger
You might try adjusting the interrupt coalescing parameters with
ethtool -C eth0 ...
But I can't give you hard guidelines as to what would make it better.

I have a debug patch, but it needs work still.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-22 Thread Carl-Daniel Hailfinger
Stephen Hemminger schrieb:
 You might try adjusting the interrupt coalescing parameters with
   ethtool -C eth0 ...
 But I can't give you hard guidelines as to what would make it better.
 
 I have a debug patch, but it needs work still.

ethtool -C bridgeint1 rx-frames 255 rx-frames-irq 255 rx-usecs 0 rx-usecs-irq 0 
tx-usecs 0 tx-frames 255

always results in a hang after less than 2 minutes if the network
activity is not too high (about 100-600 packets/s). So yes, I can
trigger this sucker on demand and give you all the debugging you
need.

Do you have any idea what the out-of-tree sk98lin did differently?


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-22 Thread Carl-Daniel Hailfinger
Carl-Daniel Hailfinger schrieb:
 Stephen Hemminger schrieb:
 
You might try adjusting the interrupt coalescing parameters with
  ethtool -C eth0 ...
But I can't give you hard guidelines as to what would make it better.

I have a debug patch, but it needs work still.

After experimenting further, the following command will always hang
the card after 2-3 seconds:

ethtool -C bridgeint1 rx-frames 63 rx-frames-irq 63 rx-usecs 0 rx-usecs-irq 0 
tx-usecs 0 tx-frames 63

Crude activity log (1 second interval) follows:

interrupts   RX packets   TX packets

# normal activity
18225503  1828622  2084564
18225914  1828932  2084939
18226422  1829361  2085422
18226875  1829694  2085832
18227286  1830012  2086183
18227622  1830270  2086465
18227963  1830541  2086738
18228340  1830827  2087057
18228710  1831107  2087382
18229091  1831390  2087694
18229467  1831677  2088002
18229835  1831954  2088338
# ethtool starts now
18230143  1832249  2088647
18230146  1832434  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
18230146  1832462  2088799
# the netdev watchdog triggers now


 So yes, I can trigger this sucker on demand and give you all the
 debugging you need.
 
 Do you have any idea what the out-of-tree sk98lin v8.14.3.3 did
 differently?


Regards,
Carl-Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-21 Thread Carl-Daniel Hailfinger
Hi,

Carl-Daniel Hailfinger schrieb:
 
 after sending 259 GB and receiving 25 GB over my SysKonnect SK-9E21
 card (sky2 says it is a Yukon-EC (0xb6) rev 1), the card appears
 dead. Machine is an Athlon64 3200+ on an Asus A8N-SLI Deluxe board.
 
 sky2 v0.11 addr 0xc900 irq 74 Yukon-EC (0xb6) rev 1
 sky2 eth3: addr 00:00:5a:70:30:fb
 [...]
 sky2 eth3: enabling interface
 [...]
 sky2 eth3: phy interrupt status 0x1c40 0x7d0c
 sky2 eth3: Link is up at 100 Mbps, full duplex, flow control both
 [...]
 NETDEV WATCHDOG: eth3: transmit timed out
 sky2 eth3: tx timeout
 NETDEV WATCHDOG: eth3: transmit timed out
 sky2 eth3: tx timeout
 
 
 switch:~ # ifconfig eth3
 eth3   Link encap:Ethernet  HWaddr 00:00:5A:70:30:FB
   inet6 addr: fe80::200:5aff:fe70:30fb/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:130530358 errors:0 dropped:0 overruns:0 frame:0
   TX packets:209647800 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:25980735946 (24777.1 Mb)  TX bytes:259787058579 (247752.2 
 Mb)
   Interrupt:74
 
 switch:~ # cat /proc/interrupts
CPU0
   0:   11213627IO-APIC-edge  timer
   1:  24783IO-APIC-edge  i8042
   8:  0IO-APIC-edge  rtc
   9:  0   IO-APIC-level  acpi
  15: 401558IO-APIC-edge  ide1
  50:  249384881   IO-APIC-level  eth0
  58:  179123938   IO-APIC-level  sky2
  66:  3   IO-APIC-level  sky2, ohci1394
  74:   98956955   IO-APIC-level  sky2
  82:  19952   IO-APIC-level  sky2
 217:   1865   IO-APIC-level  libata, NVidia CK804
 225: 263052   IO-APIC-level  libata, ehci_hcd:usb1
 NMI:  11098
 LOC:   11214113
 ERR:  0
 MIS:  0
 
 Not only will the card not transmit anymore, it also doesn't
 receive any packet at all. ethtool -r eth3 doesn't change
 anything, taking the interface down and up again also doesn't
 help. The interrupt count of interrupt 74 stays constant after
 failing.
 
 modprobe -r sky2; modprobe sky2
 fixes the problem for me, so maybe resetting the card on TX
 timeouts will help.
 
 The same problem appeared much earlier for another card which
 shared interrupt 58 with an onboard card driven by skge. After
 disabling the skge driver and rebooting, that card has been
 stable so far.
 
 The card is connected to a 100 MBit switch.
 
 These problems didn't appear with sk98lin v8.14.3.3 (that
 driver did survive about 10 TB of traffic before I rebooted).
 
 Register dumps are available on request (too big for this
 list).
 
 I will now try sky2 0.13 and report back.

And it hit the other interface after 200 MB transferred...
NETDEV WATCHDOG: bridgeext0: transmit timed out
sky2 bridgeext0: tx timeout
NETDEV WATCHDOG: bridgeext0: transmit timed out
sky2 transmit interrupt missed? recovered

Although the driver claims to recover, it doesn't recover at all.
What debug level would be advisable? It is now running with
modprobe sky2 debug=2, but I can't see more than the messages
above.

I have now added a hard reset routine to the tx timeout
path and hope it won't kill my machine.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 0.11 instability

2006-01-21 Thread Carl-Daniel Hailfinger
Carl-Daniel Hailfinger schrieb:
 Hi,
 
 Carl-Daniel Hailfinger schrieb:
 
after sending 259 GB and receiving 25 GB over my SysKonnect SK-9E21
card (sky2 says it is a Yukon-EC (0xb6) rev 1), the card appears
dead. Machine is an Athlon64 3200+ on an Asus A8N-SLI Deluxe board.

sky2 v0.11 addr 0xc900 irq 74 Yukon-EC (0xb6) rev 1
sky2 eth3: addr 00:00:5a:70:30:fb
[...]
sky2 eth3: enabling interface
[...]
sky2 eth3: phy interrupt status 0x1c40 0x7d0c
sky2 eth3: Link is up at 100 Mbps, full duplex, flow control both
[...]
NETDEV WATCHDOG: eth3: transmit timed out
sky2 eth3: tx timeout
NETDEV WATCHDOG: eth3: transmit timed out
sky2 eth3: tx timeout


switch:~ # ifconfig eth3
eth3   Link encap:Ethernet  HWaddr 00:00:5A:70:30:FB
  inet6 addr: fe80::200:5aff:fe70:30fb/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:130530358 errors:0 dropped:0 overruns:0 frame:0
  TX packets:209647800 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:25980735946 (24777.1 Mb)  TX bytes:259787058579 (247752.2 
 Mb)
  Interrupt:74

switch:~ # cat /proc/interrupts
   CPU0
  0:   11213627IO-APIC-edge  timer
  1:  24783IO-APIC-edge  i8042
  8:  0IO-APIC-edge  rtc
  9:  0   IO-APIC-level  acpi
 15: 401558IO-APIC-edge  ide1
 50:  249384881   IO-APIC-level  eth0
 58:  179123938   IO-APIC-level  sky2
 66:  3   IO-APIC-level  sky2, ohci1394
 74:   98956955   IO-APIC-level  sky2
 82:  19952   IO-APIC-level  sky2
217:   1865   IO-APIC-level  libata, NVidia CK804
225: 263052   IO-APIC-level  libata, ehci_hcd:usb1
NMI:  11098
LOC:   11214113
ERR:  0
MIS:  0

Not only will the card not transmit anymore, it also doesn't
receive any packet at all. ethtool -r eth3 doesn't change
anything, taking the interface down and up again also doesn't
help. The interrupt count of interrupt 74 stays constant after
failing.

modprobe -r sky2; modprobe sky2
fixes the problem for me, so maybe resetting the card on TX
timeouts will help.

The same problem appeared much earlier for another card which
shared interrupt 58 with an onboard card driven by skge. After
disabling the skge driver and rebooting, that card has been
stable so far.

The card is connected to a 100 MBit switch.

These problems didn't appear with sk98lin v8.14.3.3 (that
driver did survive about 10 TB of traffic before I rebooted).

Register dumps are available on request (too big for this
list).

I will now try sky2 0.13 and report back.
 
 
 And it hit the other interface after 200 MB transferred...
 NETDEV WATCHDOG: bridgeext0: transmit timed out
 sky2 bridgeext0: tx timeout
 NETDEV WATCHDOG: bridgeext0: transmit timed out
 sky2 transmit interrupt missed? recovered
 
 Although the driver claims to recover, it doesn't recover at all.
 What debug level would be advisable? It is now running with
 modprobe sky2 debug=2, but I can't see more than the messages
 above.
 
 I have now added a hard reset routine to the tx timeout
 path and hope it won't kill my machine.

Apologies for mangled whitespace, this is just a rough cut'n'paste.
--- linux-2.6.15/drivers/net/sky2.c.orig2006-01-21 16:00:15.0 
+0100
+++ linux-2.6.15/drivers/net/sky2.c 2006-01-21 14:08:28.0 +0100
@@ -1565,6 +1565,7 @@ static int sky2_autoneg_done(struct sky2
return 0;
 }

+static int sky2_reset(struct sky2_hw *hw);
 /*
  * Interrupt from PHY are handled outside of interrupt context
  * because accessing phy registers requires spin wait which might
@@ -1639,6 +1640,7 @@ static void sky2_tx_timeout(struct net_d
if (netif_msg_timer(sky2))
printk(KERN_ERR PFX %s: tx timeout\n, dev-name);

+   if (0) {
sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP);
sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET);

@@ -1646,6 +1648,12 @@ static void sky2_tx_timeout(struct net_d

sky2_qset(hw, txq);
sky2_prefetch_init(hw, txq, sky2-tx_le_map, TX_RING_SIZE - 1);
+   } else {
+   printk(KERN_ERR PFX %s: recovering the HARD way...\n, dev-name);
+   sky2_down(dev);
+   sky2_reset(hw);
+   sky2_up(dev);
+   }
 }


And everytime the kernel throws this message, I run the following
script:

#!/bin/bash
deadinterface=`dmesg|grep HARD|tail -1|sed s/.*sky2 //;s/:.*//`
ip l s $deadinterface down
ip l s $deadinterface up

After that, everything continues to work until the next tx timeout
happens, and then the script again saves the day.

More results about the circumstances of this bug: It seems that
it will only trigger under LOW load. As long as I keep the interface
busy, it will have no problems at all.


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at