Carl-Daniel Hailfinger schrieb:
> Stephen Hemminger schrieb:
> 
>>On Mon, 23 Jan 2006 20:57:10 +0100
>>Carl-Daniel Hailfinger <[EMAIL PROTECTED]> wrote:
>>
>>
>>>Stephen Hemminger schrieb:
>>>
>>>>You might try adjusting the interrupt coalescing parameters with
>>>>    ethtool -C eth0 ...
>>>>But I can't give you hard guidelines as to what would make it better.
>>>>
>>>>I have a debug patch, but it needs work still.
>>>
>>>I don't care whether that debug patch will freeze the box or perform
>>>other random funnies. All the debugging printks I added to the driver
>>>did not trigger and I'd try anything. So yes, I'm desparate.
>>>
>>>Does the sk98lin driver have any code for such problems?
>>
>>
>>There are several differences that the sk98lin driver has.
>>* It programs some parts of the chip differently. But most
>>  of those are wrong. I started copying it, but where it was wrong
>>  I didn't copy the mistakes.
>>* Sk98lin does NAPI wrong. It has interrupts disabled and runs
>>  packets through soft irq twice.
>>* Sk98lin does it's own buggy rx checksum validation.
>>* Sk98lin does not do VLAN
>>* Sk98lin programs PCI-Ex for 2K transfers, but that causes data
>>  corruption
>>
>>The one that probably is saving you with sk98lin, is it has a watchdog
>>routine that tries to work around all the possible driver hangs.
>>I prefer to find an fix these hangs, because a watchdog routine like that
>>just masks the problem and introduces a bunch of SMP race conditions which
>>the sk98lin author either didn't see or ignored.
> 
> 
> Oh. Now that is news to me. Glad I didn't have a SMP machine with the old
> driver.
> 
> There is a bug in ethtool support in sky2. Namely, rx-frames{,-irq}=64 is
> wrapped to zero. And rx-usecs-irq is 20 no matter what I set it to.

The following whitespace-damaged patch should help with the latter problem.
--- a/drivers/net/sky2.c  2006-01-23 23:41:35.000000000 +0100
+++ b/drivers/net/sky2.c  2006-01-24 03:41:21.000000000 +0100
@@ -2843,7 +2843,7 @@
        if (ecmd->rx_coalesce_usecs_irq == 0)
                sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP);
        else {
-               sky2_write32(hw, STAT_TX_TIMER_INI,
+               sky2_write32(hw, STAT_ISR_TIMER_INI,
                             sky2_us2clk(hw, ecmd->rx_coalesce_usecs_irq));
                sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START);
        }


Despite all the problems I'm having with sky2, I want to thank you
for writing it. The driver is easily readable and I can at least try
to get it running. With sk98lin I'm just stuck due to coding style
and general obfuscation.

Yeeeeeaaaaaaaaaaaaahhhhhhhhhhhhh!
I got the nic to reproducibly auto-recover. With the following ethtool
settings it would hang after a few minutes and not recover until a
rmmod/modprobe cycle. Now it comes back reliably.
# ethtool -C bridgeext0 rx-frames 63 rx-frames-irq 63 tx-frames 63 \
rx-usecs 250 rx-usecs-irq 250 tx-usecs 250

Patch follows:
--- a/drivers/net/sky2.c  2006-01-23 23:41:35.000000000 +0100
+++ b/drivers/net/sky2.c  2006-01-24 04:59:38.000000000 +0100
@@ -1623,6 +1623,12 @@
        unsigned txq = txqaddr[sky2->port];
        u16 ridx;

+       //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
+       sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_STOP);
+       //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP);
+       //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
+       sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_START);
+       //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START);
        /* Maybe we just missed an status interrupt */
        spin_lock(&sky2->tx_lock);
        ridx = sky2_read16(hw,
@@ -1639,6 +1645,7 @@
        if (netif_msg_timer(sky2))
                printk(KERN_ERR PFX "%s: tx timeout\n", dev->name);

+#if 0
        sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP);
        sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET);

@@ -1646,6 +1653,7 @@

        sky2_qset(hw, txq);
        sky2_prefetch_init(hw, txq, sky2->tx_le_map, TX_RING_SIZE - 1);
+#endif
 }

Properties of the patch above: The device will fail after
some time, enter the tx_timeout handler, recover and continue.
Now if I could avoid entering the tx_timeout handler, I would
be happy because it triggers only after hanging for approx.
10 seconds.

Error log with my patch so far:
Jan 24 05:09:27 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:09:27 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:09:41 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:09:41 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:09:41 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 1312
Jan 24 05:11:12 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:11:12 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:11:12 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 592
Jan 24 05:11:42 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:11:42 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:11:42 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 80
Jan 24 05:13:31 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:13:31 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:13:31 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 720
Jan 24 05:14:12 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:14:12 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:14:12 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 512
Jan 24 05:15:21 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:15:21 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:15:21 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 128
Jan 24 05:17:52 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:17:52 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:17:52 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 840
Jan 24 05:18:51 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:18:51 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:18:51 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 272
Jan 24 05:23:07 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:23:07 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:23:07 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 208
Jan 24 05:23:37 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:23:37 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:23:37 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 992
Jan 24 05:26:22 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:26:22 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:26:22 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 744
Jan 24 05:28:47 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:28:47 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:29:11 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:29:11 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:29:11 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 352
Jan 24 05:30:02 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:30:02 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:30:02 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 96
Jan 24 05:30:27 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:30:27 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:30:27 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 800
Jan 24 05:30:51 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:30:51 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:30:51 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 352
Jan 24 05:31:32 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:31:32 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:31:32 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 
length 1344
Jan 24 05:34:17 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out
Jan 24 05:34:17 switch kernel: sky2 bridgeint0: tx timeout
Jan 24 05:35:36 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out
Jan 24 05:35:36 switch kernel: sky2 bridgeext0: tx timeout
Jan 24 05:35:36 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 
length 128

Strange. Not every tx timeout corresponds with a rx error. However,
that could be due to net_ratelimit firing.

I'm now trying to find out which timer is the problematic one.
Kicking STAT_TX_TIMER_CTRL alone has no effect.
Kicking STAT_LEV_TIMER_CTRL alone does help so far.
STAT_ISR_TIMER_CTRL was not tested yet.

...test...
Survived 22 hangs with the hand-edited patch above.

Stephen, do you know of any errata which could help explain this?


Regards,
Carl-Daniel
-- 
http://www.hailfinger.org/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to