Re: [PATCH] net: adaptec: remove dead code in set_vlan_mode

2020-11-20 Thread Ion Badulescu

On 11/20/20 6:56 PM, Jakub Kicinski wrote:

On Fri, 20 Nov 2020 18:41:03 -0500 Ion Badulescu wrote:

Frankly, no, I don't know of any users, and that unfortunately includes
myself. I still have two cards in my stash, but they're 64-bit PCI-X, so
plugging them in would likely require taking a dremel to a 32-bit PCI
slot to make it open-ended. (They do work in a 32-bit slot.)

Anyway, that filter code could use some fixing in other regards. So
either we fix it properly (which I can submit a patch for), or clean it
out for good.


Entirely up to you.


All right then. I'll whip out the Dremel this weekend and hopefully get 
a test rig going... :)


-Ion


Re: [PATCH] net: adaptec: remove dead code in set_vlan_mode

2020-11-20 Thread Ion Badulescu

On 11/20/20 6:17 PM, Jakub Kicinski wrote:

On Fri, 20 Nov 2020 15:50:00 +0800 xiakaixu1...@gmail.com wrote:

From: Kaixu Xia 

The body of the if statement can be executed only when the variable
vlan_count equals to 32, so the condition of the while statement can
not be true and the while statement is dead code. Remove it.

Reported-by: Tosk Robot 
Signed-off-by: Kaixu Xia 
---
  drivers/net/ethernet/adaptec/starfire.c | 9 ++---
  1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/adaptec/starfire.c 
b/drivers/net/ethernet/adaptec/starfire.c
index 555299737b51..ad27a9fa5e95 100644
--- a/drivers/net/ethernet/adaptec/starfire.c
+++ b/drivers/net/ethernet/adaptec/starfire.c
@@ -1754,14 +1754,9 @@ static u32 set_vlan_mode(struct netdev_private *np)
filter_addr += 16;
vlan_count++;
}
-   if (vlan_count == 32) {
+   if (vlan_count == 32)
ret |= PerfectFilterVlan;
-   while (vlan_count < 32) {
-   writew(0, filter_addr);
-   filter_addr += 16;
-   vlan_count++;
-   }
-   }
+
return ret;
  }
  #endif /* VLAN_SUPPORT */


This got broken back in 2011:

commit 5da96be53a16a62488316810d0c7c5d58ce3ee4f
Author: Jiri Pirko 
Date:   Wed Jul 20 04:54:31 2011 +

 starfire: do vlan cleanup
 
 - unify vlan and nonvlan rx path

 - kill np->vlgrp and netdev_vlan_rx_register
 
 Signed-off-by: Jiri Pirko 

 Signed-off-by: David S. Miller 

The comparison to 32 was on a different variable before that change.

Ion, do you think anyone is still using this driver?

Maybe it's time we put it in the history book (by which I mean remove
from the kernel).


Frankly, no, I don't know of any users, and that unfortunately includes 
myself. I still have two cards in my stash, but they're 64-bit PCI-X, so 
plugging them in would likely require taking a dremel to a 32-bit PCI 
slot to make it open-ended. (They do work in a 32-bit slot.)


Anyway, that filter code could use some fixing in other regards. So 
either we fix it properly (which I can submit a patch for), or clean it 
out for good.


-Ion


Re: rcu stalls and soft lockups with recent kernels

2016-03-22 Thread Ion Badulescu

On 03/17/2016 10:28 PM, Mike Galbraith wrote:

On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
handle its own RCU callbacks completely fixes the problems for us.

Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
for fedora and rhel7. Ho-humm...


All RCU offloaded to CPU0 of a big box seems like a very bad idea.



It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound 
and can run on any cpu the scheduler will put them on. In any case, 
there was no indication that the rcuo* threads wanted to run but 
couldn't get cpu time.


Anyway, looks like I spoke too soon. It's less often with 
RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and 
rcu stalls are still happening.


[44206.316711] clocksource: timekeeping watchdog: Marking clocksource 
'tsc' as unstable because the skew is too large:
[44206.328463] clocksource:   'hpet' wd_now: 
 wd_last: 5f03cdca mask: 
[44206.339037] clocksource:   'tsc' cs_now: 
64788b443c3a cs_last: 647840eea919 mask: 

[44206.351253] clocksource: Switched to clocksource hpet
[44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
[44922.307644]  0-...: (1 GPs behind) idle=53d/141/0 
softirq=8515474/8515477 fqs=6994
[44922.317435]  (detected by 1, t=21019 jiffies, g=2011397, c=2011396, 
q=3263)

[44922.325274] Task dump for CPU 0:
[44922.325276] python  R  running task0 257113 257112 
0x00080088   0 FAIR10   15229448373
[44922.325283]  8152ca8e 881b7687 880e83669000 
7d54
[44922.333671]  881b1cdc7a48 880a58a57e58 0086 

[44922.342060]   3fa1 880a58a54000 
880a58a57e88

[44922.350446] Call Trace:
[44922.353215]  [] ? __schedule+0x38e/0xa90
[44922.359388]  [] ? rcu_eqs_enter_common+0x66/0x130
[44922.366437]  [] ? acct_account_cputime+0x1c/0x20
[44922.373388]  [] ? account_user_time+0x78/0x80
[44922.380045]  [] ? vtime_account_user+0x43/0x60
[44922.386801]  [] ? __context_tracking_exit+0x70/0xc0
[44922.394044]  [] ? enter_from_user_mode+0x1f/0x50
[44922.400994]  [] ? apic_timer_interrupt+0x69/0x90
[44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! 
[python:257113]
[44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd 
grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat 
fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt 
iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core 
ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm 
acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror 
dm_region_hash dm_log dm_mod
[44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G  I 
4.4.5-el6.ia32e.lime.0 #1
[44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.0011.081020151200 08/10/2015
[44923.290824] task: 880e83669000 ti: 880a58a54000 task.ti: 
880a58a54000
[44923.299253] RIP: 0033:[<2b5ee4e0502d>]  [<2b5ee4e0502d>] 
0x2b5ee4e0502d

[44923.307508] RSP: 002b:7ffe4120b170  EFLAGS: 0212
[44923.313494] RAX: 0008 RBX: 03fe1480 RCX: 
03fc9b00
[44923.321513] RDX: 2b5ee34b4260 RSI: 2248 RDI: 
00d4
[44923.329530] RBP: 03fe1800 R08: 0078 R09: 
0800
[44923.337548] R10: 7ffe4120b550 R11: 00011240 R12: 
2b5ee34ba938
[44923.345566] R13: 2b5ee34c1010 R14: 7ffe4120b428 R15: 
0400
[44923.353580] FS:  2b5ec7fe98c0() GS:88103fc0() 
knlGS:

[44923.362689] CS:  0010 DS:  ES:  CR0: 80050033
[44923.369147] CR2: 2b5ee419 CR3: 0005ac81b000 CR4: 
001406f0


That rcu_eqs_enter_common function seems to be a fairly common 
occurrence in these stack traces. Not sure if it means anything, though.


Also, this seems to be a sock lockup with RIP in userspace. Does it mean 
timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.


We're at our wits' end here...

This, btw, is a 2x12 core Haswell box.

Thanks,
-Ion


Re: rcu stalls and soft lockups with recent kernels

2016-03-22 Thread Ion Badulescu

On 03/17/2016 10:28 PM, Mike Galbraith wrote:

On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
handle its own RCU callbacks completely fixes the problems for us.

Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
for fedora and rhel7. Ho-humm...


All RCU offloaded to CPU0 of a big box seems like a very bad idea.



It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound 
and can run on any cpu the scheduler will put them on. In any case, 
there was no indication that the rcuo* threads wanted to run but 
couldn't get cpu time.


Anyway, looks like I spoke too soon. It's less often with 
RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and 
rcu stalls are still happening.


[44206.316711] clocksource: timekeeping watchdog: Marking clocksource 
'tsc' as unstable because the skew is too large:
[44206.328463] clocksource:   'hpet' wd_now: 
 wd_last: 5f03cdca mask: 
[44206.339037] clocksource:   'tsc' cs_now: 
64788b443c3a cs_last: 647840eea919 mask: 

[44206.351253] clocksource: Switched to clocksource hpet
[44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
[44922.307644]  0-...: (1 GPs behind) idle=53d/141/0 
softirq=8515474/8515477 fqs=6994
[44922.317435]  (detected by 1, t=21019 jiffies, g=2011397, c=2011396, 
q=3263)

[44922.325274] Task dump for CPU 0:
[44922.325276] python  R  running task0 257113 257112 
0x00080088   0 FAIR10   15229448373
[44922.325283]  8152ca8e 881b7687 880e83669000 
7d54
[44922.333671]  881b1cdc7a48 880a58a57e58 0086 

[44922.342060]   3fa1 880a58a54000 
880a58a57e88

[44922.350446] Call Trace:
[44922.353215]  [] ? __schedule+0x38e/0xa90
[44922.359388]  [] ? rcu_eqs_enter_common+0x66/0x130
[44922.366437]  [] ? acct_account_cputime+0x1c/0x20
[44922.373388]  [] ? account_user_time+0x78/0x80
[44922.380045]  [] ? vtime_account_user+0x43/0x60
[44922.386801]  [] ? __context_tracking_exit+0x70/0xc0
[44922.394044]  [] ? enter_from_user_mode+0x1f/0x50
[44922.400994]  [] ? apic_timer_interrupt+0x69/0x90
[44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! 
[python:257113]
[44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd 
grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat 
fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt 
iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core 
ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm 
acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror 
dm_region_hash dm_log dm_mod
[44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G  I 
4.4.5-el6.ia32e.lime.0 #1
[44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.0011.081020151200 08/10/2015
[44923.290824] task: 880e83669000 ti: 880a58a54000 task.ti: 
880a58a54000
[44923.299253] RIP: 0033:[<2b5ee4e0502d>]  [<2b5ee4e0502d>] 
0x2b5ee4e0502d

[44923.307508] RSP: 002b:7ffe4120b170  EFLAGS: 0212
[44923.313494] RAX: 0008 RBX: 03fe1480 RCX: 
03fc9b00
[44923.321513] RDX: 2b5ee34b4260 RSI: 2248 RDI: 
00d4
[44923.329530] RBP: 03fe1800 R08: 0078 R09: 
0800
[44923.337548] R10: 7ffe4120b550 R11: 00011240 R12: 
2b5ee34ba938
[44923.345566] R13: 2b5ee34c1010 R14: 7ffe4120b428 R15: 
0400
[44923.353580] FS:  2b5ec7fe98c0() GS:88103fc0() 
knlGS:

[44923.362689] CS:  0010 DS:  ES:  CR0: 80050033
[44923.369147] CR2: 2b5ee419 CR3: 0005ac81b000 CR4: 
001406f0


That rcu_eqs_enter_common function seems to be a fairly common 
occurrence in these stack traces. Not sure if it means anything, though.


Also, this seems to be a sock lockup with RIP in userspace. Does it mean 
timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.


We're at our wits' end here...

This, btw, is a 2x12 core Haswell box.

Thanks,
-Ion


Re: rcu stalls and soft lockups with recent kernels

2016-03-19 Thread Ion Badulescu

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from 
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu 
handle its own RCU callbacks completely fixes the problems for us.


Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config 
for fedora and rhel7. Ho-humm... anybody interesting in tracking this 
down further?


The bug is likely to be a missing wakeup for the rcuo* threads. It would 
explain why the stalls get resolved eventually, maybe when another RCU 
callback gets scheduled (properly this time!) and the threads wake up 
and process their entire queue. But it's just speculation at this point.


And one final point: the RCU stall stack traces are actually rather 
useless when the RCU callbacks are delegated to kernel threads. What one 
particular cpu may be doing when it's fallen behind on RCU grace periods 
isn't very interesting, given that its rcuo* threads could be running 
anywhere, on any cpu.


Thanks,
-Ion


On 02/04/2016 02:12 PM, Ion Badulescu wrote:

Hello,

We run a compute cluster of about 800 or so machines here, which makes
heavy use of NFS and fscache (on a dedicated local drive with an ext4
filesystem) and also exercises the other local drives pretty hard. All
the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
nice level 1.

Over the past month or two, we've been seeing some strange and seemingly
random rcu stalls and soft cpu lockups when running jobs on the cluster.
The stalls are bad enough to cause hard drives to get kicked out of MD
arrays, and even to get offlined altogether by the SCSI layer. The
kernel running on these machines is based on 3.18.12 with some local
patches, which we've been running quite happily since early May 2015.
It's unclear what started triggering the stalls after all these months,
as we've been unable to correlate them to any particular jobs. The one
thing that's clear is that they tend to happen in bunches on multiple
machines at the same time, so whatever it is it's some centralized
condition that triggers them. It could be a certain type of job, or it
could be the state of the centralized NFS servers they access.

In an effort to solve the issue and isolate its cause, we upgraded the
kernel on about 80 of those machines to the latest 4.4.1 kernel, this
time keeping out most of the local patches we had. We also enabled some
more kernel debugging options, including lockdep. Only a few local
patches were kept in:
- printing some extra information in sysrq
- disabling audit-to-printk
- perf build fixes
- changing the slab calculate_order algorithm to favor small allocation
orders whenever possible.

The full kernel config is available at
http://www.badula.org/config-4.4.1 and the combined patch with our local
changes is at http://www.badula.org/linux-ion-git-4.4.1.diff

Unfortunately, the stalls did not go away. Many of the machines running
4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
to this one:


Feb  2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO:
rcu_sched detected stalls on CPUs/tasks:
Feb  2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/141/0
softirq=9090641/9090641 fqs=7000
Feb  2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
Feb  2 04:26:02   running task0 328895  2 0x00080088   6
FAIR  -206
Feb  2 04:26:02
Feb  2 04:26:02  ff10 8152abaf 
0004
Feb  2 04:26:02  00043121b8b0 881001149808 0296
02080020
Feb  2 04:26:02  0296 88003121b908 810968fb
8810010ed9c8
Feb  2 04:26:02 Call Trace:
Feb  2 04:26:02  [] ? _raw_spin_lock_irqsave+0xf/0x40
Feb  2 04:26:02  [] ? finish_wait+0x6b/0x90
Feb  2 04:26:02  [] ? sk_stream_wait_memory+0x24a/0x2d0
Feb  2 04:26:02  [] ? woken_wake_function+0x20/0x20
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? release_sock+0xfd/0x150
Feb  2 04:26:02  [] ? tcp_sendpage+0xd6/0x5e0
Feb  2 04:26:02  [] ? inet_sendpage+0x50/0xe0
Feb  2 04:26:02  [] ? xs_nospace+0x75/0xf0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? read_hpet+0x16/0x20
Feb  2 04:26:02  [] ? ktime_get+0x52/0xc0
Feb  2 04:26:02  [] ? xprt_transmit+0x63/0x3a0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit+0x1d8/0x2c0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? __rpc_execute+0x89/0x3c0 [sunrpc]
Feb  2 04:26:02  [] ? finish_task_switch+0x125/0x260
Feb  2 04:26:02  [] ? rpc_async_schedule+0x15/0x20
[sunrpc]
Feb  2 04:26:02  [] ? process_one_work+0x148/0x450
Feb  2 04:26:02  [] ? worker_thread+0x132/0x600

Re: rcu stalls and soft lockups with recent kernels

2016-03-19 Thread Ion Badulescu

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from 
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu 
handle its own RCU callbacks completely fixes the problems for us.


Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config 
for fedora and rhel7. Ho-humm... anybody interesting in tracking this 
down further?


The bug is likely to be a missing wakeup for the rcuo* threads. It would 
explain why the stalls get resolved eventually, maybe when another RCU 
callback gets scheduled (properly this time!) and the threads wake up 
and process their entire queue. But it's just speculation at this point.


And one final point: the RCU stall stack traces are actually rather 
useless when the RCU callbacks are delegated to kernel threads. What one 
particular cpu may be doing when it's fallen behind on RCU grace periods 
isn't very interesting, given that its rcuo* threads could be running 
anywhere, on any cpu.


Thanks,
-Ion


On 02/04/2016 02:12 PM, Ion Badulescu wrote:

Hello,

We run a compute cluster of about 800 or so machines here, which makes
heavy use of NFS and fscache (on a dedicated local drive with an ext4
filesystem) and also exercises the other local drives pretty hard. All
the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
nice level 1.

Over the past month or two, we've been seeing some strange and seemingly
random rcu stalls and soft cpu lockups when running jobs on the cluster.
The stalls are bad enough to cause hard drives to get kicked out of MD
arrays, and even to get offlined altogether by the SCSI layer. The
kernel running on these machines is based on 3.18.12 with some local
patches, which we've been running quite happily since early May 2015.
It's unclear what started triggering the stalls after all these months,
as we've been unable to correlate them to any particular jobs. The one
thing that's clear is that they tend to happen in bunches on multiple
machines at the same time, so whatever it is it's some centralized
condition that triggers them. It could be a certain type of job, or it
could be the state of the centralized NFS servers they access.

In an effort to solve the issue and isolate its cause, we upgraded the
kernel on about 80 of those machines to the latest 4.4.1 kernel, this
time keeping out most of the local patches we had. We also enabled some
more kernel debugging options, including lockdep. Only a few local
patches were kept in:
- printing some extra information in sysrq
- disabling audit-to-printk
- perf build fixes
- changing the slab calculate_order algorithm to favor small allocation
orders whenever possible.

The full kernel config is available at
http://www.badula.org/config-4.4.1 and the combined patch with our local
changes is at http://www.badula.org/linux-ion-git-4.4.1.diff

Unfortunately, the stalls did not go away. Many of the machines running
4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
to this one:


Feb  2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO:
rcu_sched detected stalls on CPUs/tasks:
Feb  2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/141/0
softirq=9090641/9090641 fqs=7000
Feb  2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
Feb  2 04:26:02   running task0 328895  2 0x00080088   6
FAIR  -206
Feb  2 04:26:02
Feb  2 04:26:02  ff10 8152abaf 
0004
Feb  2 04:26:02  00043121b8b0 881001149808 0296
02080020
Feb  2 04:26:02  0296 88003121b908 810968fb
8810010ed9c8
Feb  2 04:26:02 Call Trace:
Feb  2 04:26:02  [] ? _raw_spin_lock_irqsave+0xf/0x40
Feb  2 04:26:02  [] ? finish_wait+0x6b/0x90
Feb  2 04:26:02  [] ? sk_stream_wait_memory+0x24a/0x2d0
Feb  2 04:26:02  [] ? woken_wake_function+0x20/0x20
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? release_sock+0xfd/0x150
Feb  2 04:26:02  [] ? tcp_sendpage+0xd6/0x5e0
Feb  2 04:26:02  [] ? inet_sendpage+0x50/0xe0
Feb  2 04:26:02  [] ? xs_nospace+0x75/0xf0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? read_hpet+0x16/0x20
Feb  2 04:26:02  [] ? ktime_get+0x52/0xc0
Feb  2 04:26:02  [] ? xprt_transmit+0x63/0x3a0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit+0x1d8/0x2c0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb  2 04:26:02  [] ? __rpc_execute+0x89/0x3c0 [sunrpc]
Feb  2 04:26:02  [] ? finish_task_switch+0x125/0x260
Feb  2 04:26:02  [] ? rpc_async_schedule+0x15/0x20
[sunrpc]
Feb  2 04:26:02  [] ? process_one_work+0x148/0x450
Feb  2 04:26:02  [] ? worker_thread+0x132/0x600

rcu stalls and soft lockups with recent kernels

2016-02-04 Thread Ion Badulescu

Hello,

We run a compute cluster of about 800 or so machines here, which makes
heavy use of NFS and fscache (on a dedicated local drive with an ext4
filesystem) and also exercises the other local drives pretty hard. All
the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
nice level 1.

Over the past month or two, we've been seeing some strange and seemingly
random rcu stalls and soft cpu lockups when running jobs on the cluster.
The stalls are bad enough to cause hard drives to get kicked out of MD
arrays, and even to get offlined altogether by the SCSI layer. The
kernel running on these machines is based on 3.18.12 with some local
patches, which we've been running quite happily since early May 2015.
It's unclear what started triggering the stalls after all these months,
as we've been unable to correlate them to any particular jobs. The one
thing that's clear is that they tend to happen in bunches on multiple
machines at the same time, so whatever it is it's some centralized
condition that triggers them. It could be a certain type of job, or it
could be the state of the centralized NFS servers they access.

In an effort to solve the issue and isolate its cause, we upgraded the
kernel on about 80 of those machines to the latest 4.4.1 kernel, this
time keeping out most of the local patches we had. We also enabled some
more kernel debugging options, including lockdep. Only a few local
patches were kept in:
- printing some extra information in sysrq
- disabling audit-to-printk
- perf build fixes
- changing the slab calculate_order algorithm to favor small allocation
orders whenever possible.

The full kernel config is available at
http://www.badula.org/config-4.4.1 and the combined patch with our local
changes is at http://www.badula.org/linux-ion-git-4.4.1.diff

Unfortunately, the stalls did not go away. Many of the machines running
4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
to this one:


Feb  2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched 
detected stalls on CPUs/tasks:
Feb  2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/141/0 
softirq=9090641/9090641 fqs=7000
Feb  2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
Feb  2 04:26:02   running task0 328895  2 0x00080088   6 FAIR  -20  
  6
Feb  2 04:26:02
Feb  2 04:26:02  ff10 8152abaf  
0004
Feb  2 04:26:02  00043121b8b0 881001149808 0296 
02080020
Feb  2 04:26:02  0296 88003121b908 810968fb 
8810010ed9c8
Feb  2 04:26:02 Call Trace:
Feb  2 04:26:02  [] ? _raw_spin_lock_irqsave+0xf/0x40
Feb  2 04:26:02  [] ? finish_wait+0x6b/0x90
Feb  2 04:26:02  [] ? sk_stream_wait_memory+0x24a/0x2d0
Feb  2 04:26:02  [] ? woken_wake_function+0x20/0x20
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? release_sock+0xfd/0x150
Feb  2 04:26:02  [] ? tcp_sendpage+0xd6/0x5e0
Feb  2 04:26:02  [] ? inet_sendpage+0x50/0xe0
Feb  2 04:26:02  [] ? xs_nospace+0x75/0xf0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? read_hpet+0x16/0x20
Feb  2 04:26:02  [] ? ktime_get+0x52/0xc0
Feb  2 04:26:02  [] ? xprt_transmit+0x63/0x3a0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit+0x1d8/0x2c0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? __rpc_execute+0x89/0x3c0 [sunrpc]
Feb  2 04:26:02  [] ? finish_task_switch+0x125/0x260
Feb  2 04:26:02  [] ? rpc_async_schedule+0x15/0x20 [sunrpc]
Feb  2 04:26:02  [] ? process_one_work+0x148/0x450
Feb  2 04:26:02  [] ? worker_thread+0x132/0x600
Feb  2 04:26:02  [] ? default_wake_function+0x12/0x20
Feb  2 04:26:02  [] ? __wake_up_common+0x56/0x90
Feb  2 04:26:02  [] ? process_one_work+0x450/0x450
Feb  2 04:26:02  [] ? process_one_work+0x450/0x450
Feb  2 04:26:02  [] ? kthread+0xcc/0xf0
Feb  2 04:26:02  [] ? kthread_freezable_should_stop+0x70/0x70
Feb  2 04:26:02  [] ? ret_from_fork+0x3f/0x70
Feb  2 04:26:02  [] ? kthread_freezable_should_stop+0x70/0x70


So we added a cond_resched() call inside the loop in __rpc_execute, if
the loop made more than 20 iterations. While clearly just a bandaid, and
maybe also in the wrong place, we just wanted to see if it would help.
And indeed it did, because the very common sunrpc-related stalls went
away. Instead, this ext4-related stall started showing up:


[91919.498218] INFO: rcu_sched self-detected stall on CPU
[91919.498221] INFO: rcu_sched detected stalls on CPUs/tasks:
[91919.498235]  15-...: (4 ticks this GP) idle=dbf/141/0 
softirq=16345141/16345141 fqs=1
[91919.498238]  (detected by 10, t=178100 jiffies, g=4061489, c=4061488, 

rcu stalls and soft lockups with recent kernels

2016-02-04 Thread Ion Badulescu

Hello,

We run a compute cluster of about 800 or so machines here, which makes
heavy use of NFS and fscache (on a dedicated local drive with an ext4
filesystem) and also exercises the other local drives pretty hard. All
the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
nice level 1.

Over the past month or two, we've been seeing some strange and seemingly
random rcu stalls and soft cpu lockups when running jobs on the cluster.
The stalls are bad enough to cause hard drives to get kicked out of MD
arrays, and even to get offlined altogether by the SCSI layer. The
kernel running on these machines is based on 3.18.12 with some local
patches, which we've been running quite happily since early May 2015.
It's unclear what started triggering the stalls after all these months,
as we've been unable to correlate them to any particular jobs. The one
thing that's clear is that they tend to happen in bunches on multiple
machines at the same time, so whatever it is it's some centralized
condition that triggers them. It could be a certain type of job, or it
could be the state of the centralized NFS servers they access.

In an effort to solve the issue and isolate its cause, we upgraded the
kernel on about 80 of those machines to the latest 4.4.1 kernel, this
time keeping out most of the local patches we had. We also enabled some
more kernel debugging options, including lockdep. Only a few local
patches were kept in:
- printing some extra information in sysrq
- disabling audit-to-printk
- perf build fixes
- changing the slab calculate_order algorithm to favor small allocation
orders whenever possible.

The full kernel config is available at
http://www.badula.org/config-4.4.1 and the combined patch with our local
changes is at http://www.badula.org/linux-ion-git-4.4.1.diff

Unfortunately, the stalls did not go away. Many of the machines running
4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
to this one:


Feb  2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched 
detected stalls on CPUs/tasks:
Feb  2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/141/0 
softirq=9090641/9090641 fqs=7000
Feb  2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
Feb  2 04:26:02   running task0 328895  2 0x00080088   6 FAIR  -20  
  6
Feb  2 04:26:02
Feb  2 04:26:02  ff10 8152abaf  
0004
Feb  2 04:26:02  00043121b8b0 881001149808 0296 
02080020
Feb  2 04:26:02  0296 88003121b908 810968fb 
8810010ed9c8
Feb  2 04:26:02 Call Trace:
Feb  2 04:26:02  [] ? _raw_spin_lock_irqsave+0xf/0x40
Feb  2 04:26:02  [] ? finish_wait+0x6b/0x90
Feb  2 04:26:02  [] ? sk_stream_wait_memory+0x24a/0x2d0
Feb  2 04:26:02  [] ? woken_wake_function+0x20/0x20
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? release_sock+0xfd/0x150
Feb  2 04:26:02  [] ? tcp_sendpage+0xd6/0x5e0
Feb  2 04:26:02  [] ? inet_sendpage+0x50/0xe0
Feb  2 04:26:02  [] ? xs_nospace+0x75/0xf0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? read_hpet+0x16/0x20
Feb  2 04:26:02  [] ? ktime_get+0x52/0xc0
Feb  2 04:26:02  [] ? xprt_transmit+0x63/0x3a0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? _raw_spin_unlock_bh+0x1a/0x20
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? call_transmit+0x1d8/0x2c0 [sunrpc]
Feb  2 04:26:02  [] ? call_transmit_status+0x110/0x110 
[sunrpc]
Feb  2 04:26:02  [] ? __rpc_execute+0x89/0x3c0 [sunrpc]
Feb  2 04:26:02  [] ? finish_task_switch+0x125/0x260
Feb  2 04:26:02  [] ? rpc_async_schedule+0x15/0x20 [sunrpc]
Feb  2 04:26:02  [] ? process_one_work+0x148/0x450
Feb  2 04:26:02  [] ? worker_thread+0x132/0x600
Feb  2 04:26:02  [] ? default_wake_function+0x12/0x20
Feb  2 04:26:02  [] ? __wake_up_common+0x56/0x90
Feb  2 04:26:02  [] ? process_one_work+0x450/0x450
Feb  2 04:26:02  [] ? process_one_work+0x450/0x450
Feb  2 04:26:02  [] ? kthread+0xcc/0xf0
Feb  2 04:26:02  [] ? kthread_freezable_should_stop+0x70/0x70
Feb  2 04:26:02  [] ? ret_from_fork+0x3f/0x70
Feb  2 04:26:02  [] ? kthread_freezable_should_stop+0x70/0x70


So we added a cond_resched() call inside the loop in __rpc_execute, if
the loop made more than 20 iterations. While clearly just a bandaid, and
maybe also in the wrong place, we just wanted to see if it would help.
And indeed it did, because the very common sunrpc-related stalls went
away. Instead, this ext4-related stall started showing up:


[91919.498218] INFO: rcu_sched self-detected stall on CPU
[91919.498221] INFO: rcu_sched detected stalls on CPUs/tasks:
[91919.498235]  15-...: (4 ticks this GP) idle=dbf/141/0 
softirq=16345141/16345141 fqs=1
[91919.498238]  (detected by 10, t=178100 jiffies, g=4061489, c=4061488, 

Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

Hi Alexey,

On Sat, 3 Sep 2005, Alexey Kuznetsov wrote:


Well, take a look at the double acks for 84439343, 84440447 and 84441059,
they seem pretty much identical to me.


It is just a little tcpdump glitch.

19:34:54.532271 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 
win 24544  (DF) (ttl 64, id 60946)
19:34:54.532432 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 
win 24544  (DF) (ttl 64, id 60946)

It is one ACK (look at IP ID), shown twice. This happens sometimes
with our packet socket.


Ahh... ack. :) That explains it.


I understood. I expect when 184*4, when you said 184. But minimum is
still 730 (unscaled 1460*2). If you really saw values lower than 730
(unscaled 1460*2), there is another more severe problem and the suggested
patch will not solve it.


I really did see very small values. This one is plucked from one of 
today's streams, after a full day's worth of data had passed through it:


19:03:19.659454 10.1.12.11.8001 > 10.2.10.212.56690: P 3:6(3) ack 1 win 65529 
 (DF)
19:03:19.659462 10.2.10.212.56690 > 10.1.12.11.8001: . ack 6 win 181 
 (DF)
19:03:20.690719 10.1.12.11.8001 > 10.2.10.212.56690: P 6:9(3) ack 1 win 65529 
 (DF)
19:03:20.690727 10.2.10.212.56690 > 10.1.12.11.8001: . ack 9 win 181 
 (DF)

10.1.12.11 is the Win2k box, 10.2.10.212 is the Linux box. The socket 
buffer sizes are the defaults, so the scaling is most likely 2^2. The 
packets being exchanged at this point are just heartbeats.


On Tuesday I can try to capture a full session from the very begining, if 
you think it would help.


Thanks,
-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

Hi Alexey,

On Fri, 2 Sep 2005, Alexey Kuznetsov wrote:


This is where things start going bad. The window starts shrinking from
15340 all the way down to 2355 over the course of 0.3 seconds. Notice the
many duplicate acks that serve no purpose


These are not duplicate, TCP_NODELAY sender just starts flooding
tiny segments, and those are normal ACKs acking those segments, note
ACK field is not the same.


Well, take a look at the double acks for 84439343, 84440447 and 84441059, 
they seem pretty much identical to me.



I still do not know how the value of 184 is possible in your case,
I would expect 730 as an absolute possible minumum. I see 9420 (2355*4).


The numbers I mentioned are straight from the tcpdump and are not scaled, 
so they need to be multiplied by 4. But even 9420, combined with a RTT of 
20ms, results in a total usable bandwidth of about 3.75 Mbps, not enough 
for this real-time stream at peak times.


Besides, it often gets even worse than 2355, all it takes is a few 
application slowdowns.



Anyway, ignoring this puzzle, the following patch for 2.4 should help.


--- net/ipv4/tcp_input.c.orig   2003-02-20 20:38:39.0 +0300
+++ net/ipv4/tcp_input.c2005-09-02 22:28:00.845952888 +0400
@@ -343,8 +343,6 @@
app_win -= tp->ack.rcv_mss;
app_win = max(app_win, 2U*tp->advmss);

-   if (!ofo_win)
-   tp->window_clamp = min(tp->window_clamp, app_win);
tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
}
}


That makes perfect sense...

I'll test it out on Tuesday, when I can connect again to the real-time 
streams that we use.


Thanks a lot!
-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, John Heffner wrote:

If it is window clamping, then you should be asymptotically approaching a 
ratio between receive buffer and window that corresponds (with a fudge 
factor) to the ratio between TCP segment data size and allocated packet size. 
If you make the receive buffer large enough, then the clamped window should 
still end up big enough.


For what it's worth, running with a 512k receive buffer still caused the 
clamping to occur, though it took longer than with the normal buffer size. 
The window went down from a maximum of 12291 (times 2^4 due to window 
scaling) to 3190 currently. That's still enough for our purposes, but I'll 
keep monitoring it to see if it shrinks any further. It could be a viable 
work-around for the time being.


Is this a bug, though, or a feature? :)

Also, since you have "real time" data, a larger 
receive buffer should probably be adequate to eliminate this problem, since 
it only occurs when the receiving application falls behind for a while, and a 
bigger receive buffer allows it to fall behind more without triggering the 
window clamping.


Correct. I noticed too while experimenting that the clamping never occurs 
if the application is fast enough to keep the socket buffer empty. It's 
when data is allowed to accumulate in the buffer that the window shrinks, 
and then it never grows back, as if a portion of the buffer got lost 
permanently.


-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, Guillaume Autran wrote:

I experienced the very same problem but with window size going all the way 
down to just a few bytes (14 bytes). dump files available upon requests :)
Ion, how were you able to reproduce the issue ? Can the same type of traffice 
always reproduce the issue or is it more intermittent ?


I have no problem whatsoever reproducing it, at least with the kind of 
traffic I described. I had 4 flows like that running yesterday, and all 4 
had TCP window sizes smaller than 500 bytes on the receiver by mid-day.


-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, Noritoshi Demizu wrote:


By the way, if tcpdump does not track the window scale option, the right
edge (ack + real win) does not change between the following two ACKs.


11:34:54.337167 10.2.20.246.33060 > 10.2.224.182.8700: . ack 84402527 win 15340 
 (DF)

  (259 ACKs are omitted here)

11:34:54.611769 10.2.20.246.33060 > 10.2.224.182.8700: . ack 84454467 win 2355 
 (DF)


The first line is the 37th ACK and the second line is the 295th ACK.

  ACK#37:  ack=84402527 win=15340 right_edge=84463887 (= ack + win * 4)
  ACK#295: ack=84454467 win=2355  right_edge=84463887 (= ack + win * 4)

And all ACKs later than ACK#295 has win=2355 (2355*4=9420).

This may be a hint.  But, sorry, I do not know the internal of Linux TCP.


Oh, it's absolutely possible (even likely) that the application was slow 
between 11:34:54.337167 and 11:34:54.611769 and data kept accumulating in 
the socket buffer. The real problem is not the shrinking of the window, 
but the fact that it never increases back to normal once the socket buffer 
is emptied.



I think there is a possibility that some middle-box does something,
for example, some middle-box between the two machines does kinda
traffic-shaping by tweaking the TCP window size field.


Not really: the tcpdump is taken on the very box that generates the acks 
with the shrinking window, so it can't possibly be affected by any shaper. 
Unless the shaper is the Linux kernel itself...


-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, Noritoshi Demizu wrote:


By the way, if tcpdump does not track the window scale option, the right
edge (ack + real win) does not change between the following two ACKs.


11:34:54.337167 10.2.20.246.33060  10.2.224.182.8700: . ack 84402527 win 15340 
nop,nop,timestamp 226080473 99717814 (DF)

  (259 ACKs are omitted here)

11:34:54.611769 10.2.20.246.33060  10.2.224.182.8700: . ack 84454467 win 2355 
nop,nop,timestamp 226080721 99717841 (DF)


The first line is the 37th ACK and the second line is the 295th ACK.

  ACK#37:  ack=84402527 win=15340 right_edge=84463887 (= ack + win * 4)
  ACK#295: ack=84454467 win=2355  right_edge=84463887 (= ack + win * 4)

And all ACKs later than ACK#295 has win=2355 (2355*4=9420).

This may be a hint.  But, sorry, I do not know the internal of Linux TCP.


Oh, it's absolutely possible (even likely) that the application was slow 
between 11:34:54.337167 and 11:34:54.611769 and data kept accumulating in 
the socket buffer. The real problem is not the shrinking of the window, 
but the fact that it never increases back to normal once the socket buffer 
is emptied.



I think there is a possibility that some middle-box does something,
for example, some middle-box between the two machines does kinda
traffic-shaping by tweaking the TCP window size field.


Not really: the tcpdump is taken on the very box that generates the acks 
with the shrinking window, so it can't possibly be affected by any shaper. 
Unless the shaper is the Linux kernel itself...


-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, Guillaume Autran wrote:

I experienced the very same problem but with window size going all the way 
down to just a few bytes (14 bytes). dump files available upon requests :)
Ion, how were you able to reproduce the issue ? Can the same type of traffice 
always reproduce the issue or is it more intermittent ?


I have no problem whatsoever reproducing it, at least with the kind of 
traffic I described. I had 4 flows like that running yesterday, and all 4 
had TCP window sizes smaller than 500 bytes on the receiver by mid-day.


-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

Hi Alexey,

On Sat, 3 Sep 2005, Alexey Kuznetsov wrote:


Well, take a look at the double acks for 84439343, 84440447 and 84441059,
they seem pretty much identical to me.


It is just a little tcpdump glitch.

19:34:54.532271  10.2.20.246.33060  65.171.224.182.8700: . 44:44(0) ack 84439343 
win 24544 nop,nop,timestamp 226080638 99717832 (DF) (ttl 64, id 60946)
19:34:54.532432  10.2.20.246.33060  65.171.224.182.8700: . 44:44(0) ack 84439343 
win 24544 nop,nop,timestamp 226080638 99717832 (DF) (ttl 64, id 60946)

It is one ACK (look at IP ID), shown twice. This happens sometimes
with our packet socket.


Ahh... ack. :) That explains it.


I understood. I expect when 184*4, when you said 184. But minimum is
still 730 (unscaled 1460*2). If you really saw values lower than 730
(unscaled 1460*2), there is another more severe problem and the suggested
patch will not solve it.


I really did see very small values. This one is plucked from one of 
today's streams, after a full day's worth of data had passed through it:


19:03:19.659454 10.1.12.11.8001  10.2.10.212.56690: P 3:6(3) ack 1 win 65529 
nop,nop,timestamp 27146219 3617561665 (DF)
19:03:19.659462 10.2.10.212.56690  10.1.12.11.8001: . ack 6 win 181 
nop,nop,timestamp 3617562713 27146219 (DF)
19:03:20.690719 10.1.12.11.8001  10.2.10.212.56690: P 6:9(3) ack 1 win 65529 
nop,nop,timestamp 27146230 3617562713 (DF)
19:03:20.690727 10.2.10.212.56690  10.1.12.11.8001: . ack 9 win 181 
nop,nop,timestamp 3617563744 27146230 (DF)

10.1.12.11 is the Win2k box, 10.2.10.212 is the Linux box. The socket 
buffer sizes are the defaults, so the scaling is most likely 2^2. The 
packets being exchanged at this point are just heartbeats.


On Tuesday I can try to capture a full session from the very begining, if 
you think it would help.


Thanks,
-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

On Fri, 2 Sep 2005, John Heffner wrote:

If it is window clamping, then you should be asymptotically approaching a 
ratio between receive buffer and window that corresponds (with a fudge 
factor) to the ratio between TCP segment data size and allocated packet size. 
If you make the receive buffer large enough, then the clamped window should 
still end up big enough.


For what it's worth, running with a 512k receive buffer still caused the 
clamping to occur, though it took longer than with the normal buffer size. 
The window went down from a maximum of 12291 (times 2^4 due to window 
scaling) to 3190 currently. That's still enough for our purposes, but I'll 
keep monitoring it to see if it shrinks any further. It could be a viable 
work-around for the time being.


Is this a bug, though, or a feature? :)

Also, since you have real time data, a larger 
receive buffer should probably be adequate to eliminate this problem, since 
it only occurs when the receiving application falls behind for a while, and a 
bigger receive buffer allows it to fall behind more without triggering the 
window clamping.


Correct. I noticed too while experimenting that the clamping never occurs 
if the application is fast enough to keep the socket buffer empty. It's 
when data is allowed to accumulate in the buffer that the window shrinks, 
and then it never grows back, as if a portion of the buffer got lost 
permanently.


-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-02 Thread Ion Badulescu

Hi Alexey,

On Fri, 2 Sep 2005, Alexey Kuznetsov wrote:


This is where things start going bad. The window starts shrinking from
15340 all the way down to 2355 over the course of 0.3 seconds. Notice the
many duplicate acks that serve no purpose


These are not duplicate, TCP_NODELAY sender just starts flooding
tiny segments, and those are normal ACKs acking those segments, note
ACK field is not the same.


Well, take a look at the double acks for 84439343, 84440447 and 84441059, 
they seem pretty much identical to me.



I still do not know how the value of 184 is possible in your case,
I would expect 730 as an absolute possible minumum. I see 9420 (2355*4).


The numbers I mentioned are straight from the tcpdump and are not scaled, 
so they need to be multiplied by 4. But even 9420, combined with a RTT of 
20ms, results in a total usable bandwidth of about 3.75 Mbps, not enough 
for this real-time stream at peak times.


Besides, it often gets even worse than 2355, all it takes is a few 
application slowdowns.



Anyway, ignoring this puzzle, the following patch for 2.4 should help.


--- net/ipv4/tcp_input.c.orig   2003-02-20 20:38:39.0 +0300
+++ net/ipv4/tcp_input.c2005-09-02 22:28:00.845952888 +0400
@@ -343,8 +343,6 @@
app_win -= tp-ack.rcv_mss;
app_win = max(app_win, 2U*tp-advmss);

-   if (!ofo_win)
-   tp-window_clamp = min(tp-window_clamp, app_win);
tp-rcv_ssthresh = min(tp-window_clamp, 2U*tp-advmss);
}
}


That makes perfect sense...

I'll test it out on Tuesday, when I can connect again to the real-time 
streams that we use.


Thanks a lot!
-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-01 Thread Ion Badulescu

Hi David,

On Thu, 1 Sep 2005, David S. Miller wrote:


Thanks for the empty posting.  Please provide the content you
intended to post, and furthermore please post it to the network
developer mailing list, netdev@vger.kernel.org


First of all, thanks for the reply (even to an empty posting :).

The posting wasn't actually empty, it was probably too long (94K according 
to my sent-mail folder) and majordomo truncated it to zero. It has some 
tcpdump snippets, that's what made it so long... unfortunately, they're 
all necessary to understand the nature of the bug. I wasn't sure about 
netdev, that's why I posted it only to linux-kernel and linux-net.


I can provide the full tcpdump out-of-band to interested people, since I 
don't think I can get it past majordomo.


Here is the text of the message without the tcpdump inserts:

---
Hello,

I've been tracking down this bug for some time, and I'm fairly convinced 
at this point that it's a kernel bug.


Under certain conditions, the TCP stack starts shrinking the TCP window 
down to some ridiculously low values (hundreds of bytes, as low as 181) 
and never recovers. The certain conditions I mentioned are not well 
understood at this point, but they include a long-lived connection with a 
very one-sided, fluctuating traffic flowing through it.


So far I've been able to reproduce it on plain-vanilla 2.4.9, 2.4.11.9, 
and 2.4.12.2, as well as on the RHEL3 kernels 2.4.21-20 and 2.4.21-31. The 
hardware is dual Opteron 250, running both 32- and 64-bit SMP kernels 
(seems to make no difference). I've also seen the bug occur on a single 
Athlon XP running 2.6.11.9 UP.


The bug occurs with all sysctl settings at their default values. I've 
tried enabling and disabling pretty much all the tcp-related sysctl's in 
/proc/sys/net/ipv4, to no visible improvement.


Here are a few tcpdump snippets of a TCP connection exhibiting the bug 
(the complete tcpdump is available upon request, but it's very large). 
10.2.20.246 is the data receiver and is the box exhibiting the bug (I'm 
not sure what 10.2.224.182 is running, I don't have access to it). The 
data being sent through is real-time financial data; the session begins by 
catching up (at line speed) to present time, then continues to receive 
real-time data as it is being generated. For what it's worth, we've never 
been seen the bug occur while the session is still catching up (and 
receiving a few large packets at a time); it always seems to happen while 
receiving real-time data (many small packets, variably interspaced).


[I apologize for the amount of tcpdump data, but it's the only way to show 
the bug in action.]


[tcpdump output removed]

The connection is established and the receiver's TCP window quickly ramps 
up to 8192.


[tcpdump output removed]

Shortly thereafter the TCP window increases further to 16534. It remains 
around 16534 for the next 5 minutes or so.


[tcpdump output removed]

A few minutes later it has finally caught up to present time and it starts 
receiving smaller packets containing real-time data. The TCP window is 
still 16534 at this point.


[tcpdump output removed]

This is where things start going bad. The window starts shrinking from 
15340 all the way down to 2355 over the course of 0.3 seconds. Notice the 
many duplicate acks that serve no purpose (there are no lost packets and 
the tcpdump is taken on the receiver so there is no packets/acks crossed 
in flight).


[tcpdump output removed]

Five minutes later the TCP window is still at 2355, having never 
recovered. The window is so small that the available bandwidth for this 
connection is too small to keep up with the real-time data so it is 
falling behind, hence large packets are again being used. The application 
processing the data (Java-based) is mostly idle at this point, and netstat 
shows its recv queue to be empty. There is no apparent reason why the 
kernel shouldn't enlarge the window.


In fact, if I let it continue, it eventually shrinks the window even 
further (by 18:19:29, the time I'm writing this email, it's gone all the 
way down to 1373). As I mentioned earlier, I've seen it go as low as 181.


We are kind of stumped at this point, and it's proving to be a 
show-stopping bug for our purposes, especially over WAN links that have 
higher latency (for obvious reasons). Any kind of assistance would be 
greatly appreciated.


Thanks,
-Ion
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

2005-09-01 Thread Ion Badulescu

Hi David,

On Thu, 1 Sep 2005, David S. Miller wrote:


Thanks for the empty posting.  Please provide the content you
intended to post, and furthermore please post it to the network
developer mailing list, netdev@vger.kernel.org


First of all, thanks for the reply (even to an empty posting :).

The posting wasn't actually empty, it was probably too long (94K according 
to my sent-mail folder) and majordomo truncated it to zero. It has some 
tcpdump snippets, that's what made it so long... unfortunately, they're 
all necessary to understand the nature of the bug. I wasn't sure about 
netdev, that's why I posted it only to linux-kernel and linux-net.


I can provide the full tcpdump out-of-band to interested people, since I 
don't think I can get it past majordomo.


Here is the text of the message without the tcpdump inserts:

---
Hello,

I've been tracking down this bug for some time, and I'm fairly convinced 
at this point that it's a kernel bug.


Under certain conditions, the TCP stack starts shrinking the TCP window 
down to some ridiculously low values (hundreds of bytes, as low as 181) 
and never recovers. The certain conditions I mentioned are not well 
understood at this point, but they include a long-lived connection with a 
very one-sided, fluctuating traffic flowing through it.


So far I've been able to reproduce it on plain-vanilla 2.4.9, 2.4.11.9, 
and 2.4.12.2, as well as on the RHEL3 kernels 2.4.21-20 and 2.4.21-31. The 
hardware is dual Opteron 250, running both 32- and 64-bit SMP kernels 
(seems to make no difference). I've also seen the bug occur on a single 
Athlon XP running 2.6.11.9 UP.


The bug occurs with all sysctl settings at their default values. I've 
tried enabling and disabling pretty much all the tcp-related sysctl's in 
/proc/sys/net/ipv4, to no visible improvement.


Here are a few tcpdump snippets of a TCP connection exhibiting the bug 
(the complete tcpdump is available upon request, but it's very large). 
10.2.20.246 is the data receiver and is the box exhibiting the bug (I'm 
not sure what 10.2.224.182 is running, I don't have access to it). The 
data being sent through is real-time financial data; the session begins by 
catching up (at line speed) to present time, then continues to receive 
real-time data as it is being generated. For what it's worth, we've never 
been seen the bug occur while the session is still catching up (and 
receiving a few large packets at a time); it always seems to happen while 
receiving real-time data (many small packets, variably interspaced).


[I apologize for the amount of tcpdump data, but it's the only way to show 
the bug in action.]


[tcpdump output removed]

The connection is established and the receiver's TCP window quickly ramps 
up to 8192.


[tcpdump output removed]

Shortly thereafter the TCP window increases further to 16534. It remains 
around 16534 for the next 5 minutes or so.


[tcpdump output removed]

A few minutes later it has finally caught up to present time and it starts 
receiving smaller packets containing real-time data. The TCP window is 
still 16534 at this point.


[tcpdump output removed]

This is where things start going bad. The window starts shrinking from 
15340 all the way down to 2355 over the course of 0.3 seconds. Notice the 
many duplicate acks that serve no purpose (there are no lost packets and 
the tcpdump is taken on the receiver so there is no packets/acks crossed 
in flight).


[tcpdump output removed]

Five minutes later the TCP window is still at 2355, having never 
recovered. The window is so small that the available bandwidth for this 
connection is too small to keep up with the real-time data so it is 
falling behind, hence large packets are again being used. The application 
processing the data (Java-based) is mostly idle at this point, and netstat 
shows its recv queue to be empty. There is no apparent reason why the 
kernel shouldn't enlarge the window.


In fact, if I let it continue, it eventually shrinks the window even 
further (by 18:19:29, the time I'm writing this email, it's gone all the 
way down to 1373). As I mentioned earlier, I've seen it go as low as 181.


We are kind of stumped at this point, and it's proving to be a 
show-stopping bug for our purposes, especially over WAN links that have 
higher latency (for obvious reasons). Any kind of assistance would be 
greatly appreciated.


Thanks,
-Ion
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH re-sent] one more starfire net driver fix for 2.4.7pre6+

2001-07-20 Thread Ion Badulescu

Hi,

This patch reverses the MII hunk from the previous patch (included in
2.4.7-pre6), which was apparently breaking some cards. It also fixes an
incorrect comment.

Please apply.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
--- linux-2.4/drivers/net/starfire.c.orig   Thu Jul 12 10:15:18 2001
+++ linux-2.4/drivers/net/starfire.cThu Jul 12 10:17:30 2001
@@ -87,8 +87,7 @@
 
LK1.3.3 (Ion Badulescu)
- Initialize the TxMode register properly
-   - Set the MII registers _after_ resetting it
-   - Don't dereference dev->priv after unregister_netdev() has freed it
+   - Don't dereference dev->priv after freeing it
 
 TODO:
- implement tx_timeout() properly
@@ -987,12 +986,12 @@
struct netdev_private *np = dev->priv;
u16 reg0;
 
+   mdio_write(dev, np->phys[0], MII_ADVERTISE, np->advertising);
mdio_write(dev, np->phys[0], MII_BMCR, BMCR_RESET);
udelay(500);
while (mdio_read(dev, np->phys[0], MII_BMCR) & BMCR_RESET);
 
reg0 = mdio_read(dev, np->phys[0], MII_BMCR);
-   mdio_write(dev, np->phys[0], MII_ADVERTISE, np->advertising);
 
if (np->autoneg) {
reg0 |= BMCR_ANENABLE | BMCR_ANRESTART;
@@ -1939,12 +1938,12 @@
pci_free_consistent(pdev, PAGE_SIZE,
np->rx_ring, np->rx_ring_dma);
 
-   unregister_netdev(dev); /* Will also free np!! */
+   unregister_netdev(dev);
iounmap((char *)dev->base_addr);
pci_release_regions(pdev);
 
pci_set_drvdata(pdev, NULL);
-   kfree(dev);
+   kfree(dev); /* Will also free np!! */
 }
 
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH re-sent] one more starfire net driver fix for 2.4.7pre6+

2001-07-20 Thread Ion Badulescu

Hi,

This patch reverses the MII hunk from the previous patch (included in
2.4.7-pre6), which was apparently breaking some cards. It also fixes an
incorrect comment.

Please apply.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
--- linux-2.4/drivers/net/starfire.c.orig   Thu Jul 12 10:15:18 2001
+++ linux-2.4/drivers/net/starfire.cThu Jul 12 10:17:30 2001
@@ -87,8 +87,7 @@
 
LK1.3.3 (Ion Badulescu)
- Initialize the TxMode register properly
-   - Set the MII registers _after_ resetting it
-   - Don't dereference dev-priv after unregister_netdev() has freed it
+   - Don't dereference dev-priv after freeing it
 
 TODO:
- implement tx_timeout() properly
@@ -987,12 +986,12 @@
struct netdev_private *np = dev-priv;
u16 reg0;
 
+   mdio_write(dev, np-phys[0], MII_ADVERTISE, np-advertising);
mdio_write(dev, np-phys[0], MII_BMCR, BMCR_RESET);
udelay(500);
while (mdio_read(dev, np-phys[0], MII_BMCR)  BMCR_RESET);
 
reg0 = mdio_read(dev, np-phys[0], MII_BMCR);
-   mdio_write(dev, np-phys[0], MII_ADVERTISE, np-advertising);
 
if (np-autoneg) {
reg0 |= BMCR_ANENABLE | BMCR_ANRESTART;
@@ -1939,12 +1938,12 @@
pci_free_consistent(pdev, PAGE_SIZE,
np-rx_ring, np-rx_ring_dma);
 
-   unregister_netdev(dev); /* Will also free np!! */
+   unregister_netdev(dev);
iounmap((char *)dev-base_addr);
pci_release_regions(pdev);
 
pci_set_drvdata(pdev, NULL);
-   kfree(dev);
+   kfree(dev); /* Will also free np!! */
 }
 
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Minor cleanup and export three functions

2001-07-19 Thread Ion Badulescu

On Fri, 20 Jul 2001 03:03:58 +0100 (BST), Anton Altaparmakov <[EMAIL PROTECTED]> 
wrote:

> I will repost as soon as I manage to convince pine of it's wrong ways...

You can't, so don't bother. Just inline it, ctrl-r should do the trick. However
be careful, newer pine's like to strip trailing spaces even from inlined files
-- I've fixed mine but most distributions have the broken one.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Minor cleanup and export three functions

2001-07-19 Thread Ion Badulescu

On Fri, 20 Jul 2001 03:03:58 +0100 (BST), Anton Altaparmakov [EMAIL PROTECTED] 
wrote:

 I will repost as soon as I manage to convince pine of it's wrong ways...

You can't, so don't bother. Just inline it, ctrl-r should do the trick. However
be careful, newer pine's like to strip trailing spaces even from inlined files
-- I've fixed mine but most distributions have the broken one.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-08 Thread Ion Badulescu

On Fri, 8 Jun 2001, Tom Sightler wrote:

> OK, I tried your patch, it did fix the problem where pump wouldn't
> pull an IP address, but I'm still having the problem where my ping
> times go nuts.  I've attached an example, it's 100% repeatable on my
> network at work.  It was so bad I couldn't get any benchmark numbers.

Just one more question: do you see the same bad ping times if you
completely comment out the call to set_half_duplex?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-08 Thread Ion Badulescu

On Fri, 8 Jun 2001, Tom Sightler wrote:

 OK, I tried your patch, it did fix the problem where pump wouldn't
 pull an IP address, but I'm still having the problem where my ping
 times go nuts.  I've attached an example, it's 100% repeatable on my
 network at work.  It was so bad I couldn't get any benchmark numbers.

Just one more question: do you see the same bad ping times if you
completely comment out the call to set_half_duplex?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-07 Thread Ion Badulescu

On Thu, 7 Jun 2001, Tom Sightler wrote:

> Transferring files between the eepro100 machine running 2.4.2-ac11 and my 
> laptop produced a result of 2.24MB/s for sending and 2.13MB/s recieving the 
> file.
> 
> Transfering files between the Alteon Gigabit machine running 2.2.19 and my 
> laptop resulted in the dismal numbers of 249KB/s sending and 185KB/s recieving, 
> close to the numbers you quoted above, but actually slightly worse.
> 
> I'm not sure what would explain the 2.2.19 1GB conencted box being 10x slower 
> than the 2.4.2-ac11 100MB machine.

Both of these are slow, actually. I'm getting 7.5-8MB/s when receiving 
from a 100Mbit box (tulip or starfire, doesn't seem to matter). 
Transmitting is still slow for me, but that is most likely a different 
problem -- and I'm looking into it.

Moreover, I'm getting 9+MB/s in both directions when using the other 
driver (xircom_tulip_cb), patched to do half-duplex only. So the card can 
definitely transfer at network speeds.

> I'll apply your patch with the change to MII handling and rerun some simple 
> file transfers and report the results soon.

Looking forward to seeing them...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-07 Thread Ion Badulescu

On Wed, 6 Jun 2001, Tom Sightler wrote:

> At home where I have a 10Mb half-duplex hub connection all of the drivers work
> properly.

All right, that's expected.

> At work where I have a 10/100Mb full-duplex switch connection the drivers work
> exactly as I described before:
> 
> 2.4.4-ac11 -- mostly works fine -- minor problems awaking from sleep

Can you run some performance testing with this driver, though? The speed
of ftp transfers in both directions would be a good measure. The reason
I'm asking is because we saw really poor performance on 100Mb full-duplex,
something like 200-300KB/s when receiving.

> 2.4.5-ac9 -- keeps logging "Link is absent" then "Linux is 100 mbit" over and
> over when trying to pull an IP address via dhcp using pump or dhcpcd. 

pump likes to bring the interface up and down and up and down, so those 
messages are not necessarily unusual.

Hmm. I have an idea though. In set_half_duplex, we shouldn't touch the MII 
if the new autoneg value is the same as the old one. It should certainly 
help with things like pump. Arjan, what do you think?

> Interestingly manually setting an IP address seems to work fine with
> this driver.

That's very good to know. So most likely the repeated up/down that pump's 
doing is upsetting the card.

> I'll do this tomorrow morning when I get in and report back.  Thanks
> for the help, I'd really like to see this card get stable as we have
> it in a lot of our laptops here at work.

And we'd like to thank you for your patience and for your help diagnosing 
the problem. Let's hope we can solve it quickly..

I'm attaching a small patch that does what I proposed above -- can you 
give it a try as well?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

--- linux-2.4-ac/drivers/net/pcmcia/xircom_cb.c.old Thu Jun  7 01:27:07 2001
+++ linux-2.4-ac/drivers/net/pcmcia/xircom_cb.c Thu Jun  7 01:28:13 2001
@@ -1092,13 +1092,15 @@
 
/* tell the MII not to advertise 10/100FDX */
tmp = mdio_read(card, 0, 4);
-   printk("xircom_cb: capabilities changed from %#x to %#x\n",
-  tmp, tmp & ~0x140);
-   tmp &= ~0x140;
-   mdio_write(card, 0, 4, tmp);
-   /* restart autonegotiation */
-   tmp = mdio_read(card, 0, 0);
-   mdio_write(card, 0, 0, tmp | 0x1200);
+   if (tmp != tmp & ~0x140) {
+   printk("xircom_cb: capabilities changed from %#x to %#x\n",
+  tmp, tmp & ~0x140);
+   tmp &= ~0x140;
+   mdio_write(card, 0, 4, tmp);
+   /* restart autonegotiation */
+   tmp = mdio_read(card, 0, 0);
+   mdio_write(card, 0, 0, tmp | 0x1200);
+   }
 
if (rx)
activate_receiver(card);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-07 Thread Ion Badulescu

On Wed, 6 Jun 2001, Tom Sightler wrote:

 At home where I have a 10Mb half-duplex hub connection all of the drivers work
 properly.

All right, that's expected.

 At work where I have a 10/100Mb full-duplex switch connection the drivers work
 exactly as I described before:
 
 2.4.4-ac11 -- mostly works fine -- minor problems awaking from sleep

Can you run some performance testing with this driver, though? The speed
of ftp transfers in both directions would be a good measure. The reason
I'm asking is because we saw really poor performance on 100Mb full-duplex,
something like 200-300KB/s when receiving.

 2.4.5-ac9 -- keeps logging Link is absent then Linux is 100 mbit over and
 over when trying to pull an IP address via dhcp using pump or dhcpcd. 

pump likes to bring the interface up and down and up and down, so those 
messages are not necessarily unusual.

Hmm. I have an idea though. In set_half_duplex, we shouldn't touch the MII 
if the new autoneg value is the same as the old one. It should certainly 
help with things like pump. Arjan, what do you think?

 Interestingly manually setting an IP address seems to work fine with
 this driver.

That's very good to know. So most likely the repeated up/down that pump's 
doing is upsetting the card.

 I'll do this tomorrow morning when I get in and report back.  Thanks
 for the help, I'd really like to see this card get stable as we have
 it in a lot of our laptops here at work.

And we'd like to thank you for your patience and for your help diagnosing 
the problem. Let's hope we can solve it quickly..

I'm attaching a small patch that does what I proposed above -- can you 
give it a try as well?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

--- linux-2.4-ac/drivers/net/pcmcia/xircom_cb.c.old Thu Jun  7 01:27:07 2001
+++ linux-2.4-ac/drivers/net/pcmcia/xircom_cb.c Thu Jun  7 01:28:13 2001
@@ -1092,13 +1092,15 @@
 
/* tell the MII not to advertise 10/100FDX */
tmp = mdio_read(card, 0, 4);
-   printk(xircom_cb: capabilities changed from %#x to %#x\n,
-  tmp, tmp  ~0x140);
-   tmp = ~0x140;
-   mdio_write(card, 0, 4, tmp);
-   /* restart autonegotiation */
-   tmp = mdio_read(card, 0, 0);
-   mdio_write(card, 0, 0, tmp | 0x1200);
+   if (tmp != tmp  ~0x140) {
+   printk(xircom_cb: capabilities changed from %#x to %#x\n,
+  tmp, tmp  ~0x140);
+   tmp = ~0x140;
+   mdio_write(card, 0, 4, tmp);
+   /* restart autonegotiation */
+   tmp = mdio_read(card, 0, 0);
+   mdio_write(card, 0, 0, tmp | 0x1200);
+   }
 
if (rx)
activate_receiver(card);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: xircom_cb problems

2001-06-07 Thread Ion Badulescu

On Thu, 7 Jun 2001, Tom Sightler wrote:

 Transferring files between the eepro100 machine running 2.4.2-ac11 and my 
 laptop produced a result of 2.24MB/s for sending and 2.13MB/s recieving the 
 file.
 
 Transfering files between the Alteon Gigabit machine running 2.2.19 and my 
 laptop resulted in the dismal numbers of 249KB/s sending and 185KB/s recieving, 
 close to the numbers you quoted above, but actually slightly worse.
 
 I'm not sure what would explain the 2.2.19 1GB conencted box being 10x slower 
 than the 2.4.2-ac11 100MB machine.

Both of these are slow, actually. I'm getting 7.5-8MB/s when receiving 
from a 100Mbit box (tulip or starfire, doesn't seem to matter). 
Transmitting is still slow for me, but that is most likely a different 
problem -- and I'm looking into it.

Moreover, I'm getting 9+MB/s in both directions when using the other 
driver (xircom_tulip_cb), patched to do half-duplex only. So the card can 
definitely transfer at network speeds.

 I'll apply your patch with the change to MII handling and rerun some simple 
 file transfers and report the results soon.

Looking forward to seeing them...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.5-ac9

2001-06-06 Thread Ion Badulescu

On Wed, 6 Jun 2001 13:20:41 -0400, Tom Sightler <[EMAIL PROTECTED]> wrote:
>> 2.4.5-ac9
> 
>> o Fix xircom_cb problems with some cisco kit (Ion Badulescu)
> 
> I'm not sure what this is supposed to fix, but it makes my Xircom
> RBEM56G-100 almost useless on my network at the office.  Actually, I can't
> quite blame just this patch, it only makes the problem worse, the driver
> from 2.4.5-ac3 worked, but with 1 second ping times, the new driver barely
> works at all, it seems to think the link is not there, at least not enough
> to pull an IP address.

The patch does only one thing: it instructs the card not to negotiate
full-duplex modes, because (for undocumented and yet unexplained reasons)
full-duplex modes don't work well on this card.

If you had problems before, then their cause is most likely elsewhere.
1-second ping time is definitely wrong.

> The last driver that worked moderately well for me was the one from
> 2.4.4-ac11, it still had a few issues, mostly when resuming, but everything
> worked at home on my 10Mb hub, and at the office on my 10/100Mb FD Cisco
> 6509.  I must admist that I haven't tested every version in between.

The thing is, I don't really see any significant differences between the
2.4.4-ac11 driver and the 2.4.5-ac9 driver. I see lots of clean-ups, some
power management stuff, and the half-duplex stuff. None of them should
affect the core functionality directly..

Please do me a favor: comment out the call to set_half_duplex() (in
xircom_up), recompile and see if it makes a difference.

> One other note, the version in 2.4.4-ac11 is listed as 1.33 while the
> version in 2.4.5-ac9 is 1.11, why did we go backwards?  Were there
> significant problems with the newer version?  The 1.33 sure seems to work
> better for me.

The CVS version is almost irrelevant, I guess Arjan simply rebuild his
repository.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Proper perfect filter setup for xircom_tulip_cb.c

2001-06-06 Thread Ion Badulescu

On Wed, 6 Jun 2001, Keith Owens wrote:

> Nicely spotted.  The X3201-3 Software Specification says nothing about
> the segment bits for the filter, instead the information is tucked away
> in the 21143 PCI/CardBus 10/100Mb/s Ethernet LAN Controller Hardware
> Reference Manual.  So Xircom have a software specification manual that
> does not include the full software spec, oh the horrors.

If that were the only thing missing... At least I was able to catch it by 
simply being tidy -- I was going over the control bits to see which of 
them need to be/make sense to be set. 

But what about setting the hash (the imperfect filter)? The 21143 docs say
the *upper* 9 bits of the crc32 should be used; the Xircom docs are
completely silent, and the driver currently uses the *lower* 9 bits of the
crc32.

Oh, and for that matter, the hash is currently broken on the Xircom: 
unlike the 21143, the Xircom has 4 perfect filter slots in hash mode, not
one, but the driver ignores this fact. The hash layout is also severely
brain-damaged... I'll look into fixing it tomorrow -- by simply copying 
the fix from the patch I wrote for the other xircom driver (xircom_cb). :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.5-ac9

2001-06-06 Thread Ion Badulescu

On Wed, 6 Jun 2001 13:20:41 -0400, Tom Sightler [EMAIL PROTECTED] wrote:
 2.4.5-ac9
 
 o Fix xircom_cb problems with some cisco kit (Ion Badulescu)
 
 I'm not sure what this is supposed to fix, but it makes my Xircom
 RBEM56G-100 almost useless on my network at the office.  Actually, I can't
 quite blame just this patch, it only makes the problem worse, the driver
 from 2.4.5-ac3 worked, but with 1 second ping times, the new driver barely
 works at all, it seems to think the link is not there, at least not enough
 to pull an IP address.

The patch does only one thing: it instructs the card not to negotiate
full-duplex modes, because (for undocumented and yet unexplained reasons)
full-duplex modes don't work well on this card.

If you had problems before, then their cause is most likely elsewhere.
1-second ping time is definitely wrong.

 The last driver that worked moderately well for me was the one from
 2.4.4-ac11, it still had a few issues, mostly when resuming, but everything
 worked at home on my 10Mb hub, and at the office on my 10/100Mb FD Cisco
 6509.  I must admist that I haven't tested every version in between.

The thing is, I don't really see any significant differences between the
2.4.4-ac11 driver and the 2.4.5-ac9 driver. I see lots of clean-ups, some
power management stuff, and the half-duplex stuff. None of them should
affect the core functionality directly..

Please do me a favor: comment out the call to set_half_duplex() (in
xircom_up), recompile and see if it makes a difference.

 One other note, the version in 2.4.4-ac11 is listed as 1.33 while the
 version in 2.4.5-ac9 is 1.11, why did we go backwards?  Were there
 significant problems with the newer version?  The 1.33 sure seems to work
 better for me.

The CVS version is almost irrelevant, I guess Arjan simply rebuild his
repository.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Proper perfect filter setup for xircom_tulip_cb.c

2001-06-06 Thread Ion Badulescu

On Wed, 6 Jun 2001, Keith Owens wrote:

 Nicely spotted.  The X3201-3 Software Specification says nothing about
 the segment bits for the filter, instead the information is tucked away
 in the 21143 PCI/CardBus 10/100Mb/s Ethernet LAN Controller Hardware
 Reference Manual.  So Xircom have a software specification manual that
 does not include the full software spec, oh the horrors.

If that were the only thing missing... At least I was able to catch it by 
simply being tidy -- I was going over the control bits to see which of 
them need to be/make sense to be set. 

But what about setting the hash (the imperfect filter)? The 21143 docs say
the *upper* 9 bits of the crc32 should be used; the Xircom docs are
completely silent, and the driver currently uses the *lower* 9 bits of the
crc32.

Oh, and for that matter, the hash is currently broken on the Xircom: 
unlike the 21143, the Xircom has 4 perfect filter slots in hash mode, not
one, but the driver ignores this fact. The hash layout is also severely
brain-damaged... I'll look into fixing it tomorrow -- by simply copying 
the fix from the patch I wrote for the other xircom driver (xircom_cb). :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Starfire driver updates

2001-06-04 Thread Ion Badulescu

Hi Jeff,

The patch below updates the starfire driver to support zerocopy operations
and adds full ethtool support. It also adds a small perl utility (already
present in the -ac tree) people can use to generate the firmware header
file from Adaptec's own Netware drivers.

Please apply..

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- /src/vanilla/linux-2.4-jg/drivers/net/starfire.cThu May 31 23:38:19 2001
+++ linux-2.4/drivers/net/starfire.cMon Jun  4 19:12:05 2001
@@ -2,6 +2,10 @@
 /*
Written 1998-2000 by Donald Becker.
 
+   Current maintainer is Ion Badulescu <[EMAIL PROTECTED]>. Please
+   send all bug reports to me, and not to Donald Becker, as this code
+   has been modified quite a bit from Donald's original version.
+
This software may be used and distributed according to the terms of
the GNU General Public License (GPL), incorporated herein by reference.
Drivers based on or derived from this code fall under the GPL and must
@@ -70,15 +74,20 @@
LK1.2.9a (Ion Badulescu)
- More updates from Jeff Garzik
 
+   LK1.3.0 (Ion Badulescu)
+   - Merged zerocopy support
+
+   LK1.3.1 (Ion Badulescu)
+   - Added ethtool support
+   - Added GPIO (media change) interrupt support
+
 TODO:
- implement tx_timeout() properly
-   - support ethtool
 */
 
 #define DRV_NAME   "starfire"
-#define DRV_VERSION"1.03+LK1.2.9"
-#define DRV_RELDATE"April 19, 2001"
-
+#define DRV_VERSION"1.03+LK1.3.1"
+#define DRV_RELDATE"June 04, 2001"
 
 /*
  * Adaptec's license for their Novell drivers (which is where I got the
@@ -87,7 +96,7 @@
  *
  * However, an end-user is allowed to download and use it, after
  * converting it to C header files using starfire_firmware.pl.
- * Once that's done, the #undef must be changed into a #define
+ * Once that's done, the #undef below must be changed into a #define
  * for this driver to really use the firmware. Note that Rx/Tx
  * hardware TCP checksumming is not possible without the firmware.
  *
@@ -100,6 +109,12 @@
  * of length 1. If and when this is fixed, the #define below can be removed.
  */
 #define HAS_BROKEN_FIRMWARE
+/*
+ * Define this if using the driver with the zero-copy patch
+ */
+#if defined(HAS_FIRMWARE) && defined(MAX_SKB_FRAGS)
+#define ZEROCOPY
+#endif
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
@@ -138,8 +153,8 @@
The media type is usually passed in 'options[]'.
 */
 #define MAX_UNITS 8/* More are supported, limit only on options */
-static int options[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1};
-static int full_duplex[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1};
+static int options[MAX_UNITS] = {0, };
+static int full_duplex[MAX_UNITS] = {0, };
 
 /* Operational parameters that are set at compile time. */
 
@@ -155,9 +170,23 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies before concluding the transmitter is hung. */
-#define TX_TIMEOUT (2*HZ)
+#define TX_TIMEOUT (2 * HZ)
 
+#ifdef ZEROCOPY
+#if MAX_SKB_FRAGS <= 6
+#define MAX_STARFIRE_FRAGS 6
+#else  /* MAX_STARFIRE_FRAGS > 6 */
+#warning This driver will not work with more than 6 skb fragments.
+#warning Turning off zerocopy support.
+#undef ZEROCOPY
+#endif /* MAX_STARFIRE_FRAGS > 6 */
+#endif /* ZEROCOPY */
+
+#ifdef ZEROCOPY
+#define skb_first_frag_len(skb)skb_headlen(skb)
+#else  /* not ZEROCOPY */
 #define skb_first_frag_len(skb)(skb->len)
+#endif /* not ZEROCOPY */
 
 #if !defined(__OPTIMIZE__)
 #warning  You must compile this file with the correct options!
@@ -180,22 +209,25 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include  /* Processor type for cache alignment. */
 #include 
 #include 
+#include 
 
-/* These identify the driver base version and may not be removed. */
-static char version[] __devinitdata =
-KERN_INFO DRV_NAME ".c:v1.03 7/26/2000  Written by Donald Becker <[EMAIL PROTECTED]>\n"
-KERN_INFO " Updates and info at http://www.scyld.com/network/starfire.html\n;
-KERN_INFO " (unofficial 2.4.x kernel port, version " DRV_VERSION ", " DRV_RELDATE 
")\n";
+#ifdef SIOCETHTOOL
+#include 
+#endif
 
 #ifdef HAS_FIRMWARE
 #include "starfire_firmware.h"
 #endif /* HAS_FIRMWARE */
 
+/* These identify the driver base version and may not be removed. */
+static char version[] __devinitdata =
+KERN_INFO "starfire.c:v1.03 7/26/2000  Written by Donald Becker <[EMAIL PROTECTED]>\n"
+KERN_INFO " Updates and info at http://www.scyld.com/network/starfire.html\n;
+KERN_INFO " (unofficial 2.4.x kernel port, version " DRV_VERSION ", " DRV_RELDATE 
+")\n"

Starfire driver updates

2001-06-04 Thread Ion Badulescu

Hi Jeff,

The patch below updates the starfire driver to support zerocopy operations
and adds full ethtool support. It also adds a small perl utility (already
present in the -ac tree) people can use to generate the firmware header
file from Adaptec's own Netware drivers.

Please apply..

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- /src/vanilla/linux-2.4-jg/drivers/net/starfire.cThu May 31 23:38:19 2001
+++ linux-2.4/drivers/net/starfire.cMon Jun  4 19:12:05 2001
@@ -2,6 +2,10 @@
 /*
Written 1998-2000 by Donald Becker.
 
+   Current maintainer is Ion Badulescu [EMAIL PROTECTED]. Please
+   send all bug reports to me, and not to Donald Becker, as this code
+   has been modified quite a bit from Donald's original version.
+
This software may be used and distributed according to the terms of
the GNU General Public License (GPL), incorporated herein by reference.
Drivers based on or derived from this code fall under the GPL and must
@@ -70,15 +74,20 @@
LK1.2.9a (Ion Badulescu)
- More updates from Jeff Garzik
 
+   LK1.3.0 (Ion Badulescu)
+   - Merged zerocopy support
+
+   LK1.3.1 (Ion Badulescu)
+   - Added ethtool support
+   - Added GPIO (media change) interrupt support
+
 TODO:
- implement tx_timeout() properly
-   - support ethtool
 */
 
 #define DRV_NAME   starfire
-#define DRV_VERSION1.03+LK1.2.9
-#define DRV_RELDATEApril 19, 2001
-
+#define DRV_VERSION1.03+LK1.3.1
+#define DRV_RELDATEJune 04, 2001
 
 /*
  * Adaptec's license for their Novell drivers (which is where I got the
@@ -87,7 +96,7 @@
  *
  * However, an end-user is allowed to download and use it, after
  * converting it to C header files using starfire_firmware.pl.
- * Once that's done, the #undef must be changed into a #define
+ * Once that's done, the #undef below must be changed into a #define
  * for this driver to really use the firmware. Note that Rx/Tx
  * hardware TCP checksumming is not possible without the firmware.
  *
@@ -100,6 +109,12 @@
  * of length 1. If and when this is fixed, the #define below can be removed.
  */
 #define HAS_BROKEN_FIRMWARE
+/*
+ * Define this if using the driver with the zero-copy patch
+ */
+#if defined(HAS_FIRMWARE)  defined(MAX_SKB_FRAGS)
+#define ZEROCOPY
+#endif
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
@@ -138,8 +153,8 @@
The media type is usually passed in 'options[]'.
 */
 #define MAX_UNITS 8/* More are supported, limit only on options */
-static int options[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1};
-static int full_duplex[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1};
+static int options[MAX_UNITS] = {0, };
+static int full_duplex[MAX_UNITS] = {0, };
 
 /* Operational parameters that are set at compile time. */
 
@@ -155,9 +170,23 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies before concluding the transmitter is hung. */
-#define TX_TIMEOUT (2*HZ)
+#define TX_TIMEOUT (2 * HZ)
 
+#ifdef ZEROCOPY
+#if MAX_SKB_FRAGS = 6
+#define MAX_STARFIRE_FRAGS 6
+#else  /* MAX_STARFIRE_FRAGS  6 */
+#warning This driver will not work with more than 6 skb fragments.
+#warning Turning off zerocopy support.
+#undef ZEROCOPY
+#endif /* MAX_STARFIRE_FRAGS  6 */
+#endif /* ZEROCOPY */
+
+#ifdef ZEROCOPY
+#define skb_first_frag_len(skb)skb_headlen(skb)
+#else  /* not ZEROCOPY */
 #define skb_first_frag_len(skb)(skb-len)
+#endif /* not ZEROCOPY */
 
 #if !defined(__OPTIMIZE__)
 #warning  You must compile this file with the correct options!
@@ -180,22 +209,25 @@
 #include linux/skbuff.h
 #include linux/init.h
 #include linux/delay.h
-#include linux/ethtool.h
-#include asm/uaccess.h
 #include asm/processor.h /* Processor type for cache alignment. */
 #include asm/bitops.h
 #include asm/io.h
+#include asm/uaccess.h
 
-/* These identify the driver base version and may not be removed. */
-static char version[] __devinitdata =
-KERN_INFO DRV_NAME .c:v1.03 7/26/2000  Written by Donald Becker [EMAIL PROTECTED]\n
-KERN_INFO  Updates and info at http://www.scyld.com/network/starfire.html\n;
-KERN_INFO  (unofficial 2.4.x kernel port, version  DRV_VERSION ,  DRV_RELDATE 
)\n;
+#ifdef SIOCETHTOOL
+#include linux/ethtool.h
+#endif
 
 #ifdef HAS_FIRMWARE
 #include starfire_firmware.h
 #endif /* HAS_FIRMWARE */
 
+/* These identify the driver base version and may not be removed. */
+static char version[] __devinitdata =
+KERN_INFO starfire.c:v1.03 7/26/2000  Written by Donald Becker [EMAIL PROTECTED]\n
+KERN_INFO  Updates and info at http://www.scyld.com/network/starfire.html\n;
+KERN_INFO  (unofficial 2.4.x kernel port, version  DRV_VERSION ,  DRV_RELDATE 
+)\n;
+
 MODULE_AUTHOR(Donald Becker [EMAIL PROTECTED]);
 MODULE_DESCRIPTION(Adaptec Starfire Ethernet

Re: 2.2.19 locks up on SMP - tcp-hang patch NOT fixed the problem!

2001-06-01 Thread Ion Badulescu

On Tue, 29 May 2001 15:50:22 +0200, [EMAIL PROTECTED] wrote:

> Today I tried to install freeswan1.9. After establishing ipsec tunnel with
> my peer I got the wait_on_bh message.
> (I cannot paste exactly because It is a production machine, and I restarted
> it as fast as I could)
> 
> So what to do?

Take it up with the freeswan people. It is very likely an SMP bug in their
code.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP - tcp-hang patch NOT fixed the problem!

2001-06-01 Thread Ion Badulescu

On Tue, 29 May 2001 15:50:22 +0200, [EMAIL PROTECTED] wrote:

 Today I tried to install freeswan1.9. After establishing ipsec tunnel with
 my peer I got the wait_on_bh message.
 (I cannot paste exactly because It is a production machine, and I restarted
 it as fast as I could)
 
 So what to do?

Take it up with the freeswan people. It is very likely an SMP bug in their
code.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Xircom RealPort versus 3COM 3C3FEM656C

2001-05-22 Thread Ion Badulescu

On Tue, 22 May 2001 [EMAIL PROTECTED] wrote:

> This sounds like a bug I have heard before: some switches don't work with
> the xircom card (well, our drivers for it) when doing full duplex.
> Could you try the latest driver from 
> 
> http://people.redhat.com/arjanv
> 
> which forces the card to half-duplex? 

I doesn't help, the switch still thinks it's running in full-duplex mode.
Performance is obviously the same.

The switch I have is not managed, so there is nothing I can do on that 
front. Any other suggestions?

[BTW, you've removed too many includes, the driver doesn't compile anymore 
in the 2.4.4-ac tree.]

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Xircom RealPort versus 3COM 3C3FEM656C

2001-05-22 Thread Ion Badulescu

On Tue, 22 May 2001 20:10:41 +0100 (BST), Alan Cox <[EMAIL PROTECTED]> wrote:

> Before you give up on the xircom thing, try the -ac kernel and set the box
> up to use xircom_cb not xircom_tulip_cb
> 
> That might help a lot

It doesn't, it still performs poorly with any of the three available
drivers -- xircom_cb, xircom_tulip_cb, and tulip_cb (from the pcmcia package):

* Rx gets only about 1.8Mbit/s on a 100Base-TX network with any of the three
* Tx gets 80+Mb/s with xircom_tulip_cb and tulip_cb, and less than 30Mb/s
  with xircom_cb.

And no, promisc mode played no role in this experiment, because my test
network is switched and otherwise very quiet.

Windows drivers handle both Rx and Tx at full speed.

I have this feeling that we're handling the card in some (tulip) compat
mode, which severely cripples performance. It's hard to tell, without
docs...

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Xircom RealPort versus 3COM 3C3FEM656C

2001-05-22 Thread Ion Badulescu

On Tue, 22 May 2001 [EMAIL PROTECTED] wrote:

 This sounds like a bug I have heard before: some switches don't work with
 the xircom card (well, our drivers for it) when doing full duplex.
 Could you try the latest driver from 
 
 http://people.redhat.com/arjanv
 
 which forces the card to half-duplex? 

I doesn't help, the switch still thinks it's running in full-duplex mode.
Performance is obviously the same.

The switch I have is not managed, so there is nothing I can do on that 
front. Any other suggestions?

[BTW, you've removed too many includes, the driver doesn't compile anymore 
in the 2.4.4-ac tree.]

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Xircom RealPort versus 3COM 3C3FEM656C

2001-05-22 Thread Ion Badulescu

On Tue, 22 May 2001 20:10:41 +0100 (BST), Alan Cox [EMAIL PROTECTED] wrote:

 Before you give up on the xircom thing, try the -ac kernel and set the box
 up to use xircom_cb not xircom_tulip_cb
 
 That might help a lot

It doesn't, it still performs poorly with any of the three available
drivers -- xircom_cb, xircom_tulip_cb, and tulip_cb (from the pcmcia package):

* Rx gets only about 1.8Mbit/s on a 100Base-TX network with any of the three
* Tx gets 80+Mb/s with xircom_tulip_cb and tulip_cb, and less than 30Mb/s
  with xircom_cb.

And no, promisc mode played no role in this experiment, because my test
network is switched and otherwise very quiet.

Windows drivers handle both Rx and Tx at full speed.

I have this feeling that we're handling the card in some (tulip) compat
mode, which severely cripples performance. It's hard to tell, without
docs...

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: eepro100 rev 12 problems

2001-05-17 Thread Ion Badulescu

On Fri, 18 May 2001, James Fidell wrote:

> > Is this a real card, or is it built-in on the motherboard?
> 
> It's a real card.

All right, that's good to know. Maybe I'll get one for myself, so I can 
test new code on it -- right now I only have rev 9 and earlier cards.

> For various reasons that are far to boring to go into here, I'm not entirely
> free in my choice of card.  What I'll probably do is try to get a rev 8 card
> swapped in for the rev 12 one.  If I can't get a rev 8 card for that machine,
> I'll go with the e100 driver and let you know what happens.

Yes, and do let us know what happens. Whatever the problem is, it wants 
fixing sooner than later.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: eepro100 rev 12 problems

2001-05-17 Thread Ion Badulescu

On Thu, 17 May 2001 16:59:04 +0100, James Fidell <[EMAIL PROTECTED]> wrote:
> I have two eepro100 interfaces in a machine, one rev 8, which works just
> fine, and another rev 12, which appears as a device when the kernel boots
> and can be configured with an IP address etc., but I can't get any data
> in or out of it.  All the other hardware looks like it's working fine and
> all my rev 8 cards work, so I'm led to ask, are there any known problems
> with eepro100 rev 12 cards under 2.2.18?

Is this a real card, or is it built-in on the motherboard?

I don't think eepro100 has got much testing with rev > 9, though it should
have worked. All eepro100 chips are supposed to be backwards compatible with
the 82557, but maybe our driver initializes some registers in a way that
upsets newer chips. Not having docs for the newer chips doesn't help, either...

Intel's own e100 driver probably works, their code does things differently if
rev >= 12 (what they call the D102 revision). Give it a spin, I guess.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: eepro100 rev 12 problems

2001-05-17 Thread Ion Badulescu

On Thu, 17 May 2001 16:59:04 +0100, James Fidell [EMAIL PROTECTED] wrote:
 I have two eepro100 interfaces in a machine, one rev 8, which works just
 fine, and another rev 12, which appears as a device when the kernel boots
 and can be configured with an IP address etc., but I can't get any data
 in or out of it.  All the other hardware looks like it's working fine and
 all my rev 8 cards work, so I'm led to ask, are there any known problems
 with eepro100 rev 12 cards under 2.2.18?

Is this a real card, or is it built-in on the motherboard?

I don't think eepro100 has got much testing with rev  9, though it should
have worked. All eepro100 chips are supposed to be backwards compatible with
the 82557, but maybe our driver initializes some registers in a way that
upsets newer chips. Not having docs for the newer chips doesn't help, either...

Intel's own e100 driver probably works, their code does things differently if
rev = 12 (what they call the D102 revision). Give it a spin, I guess.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: eepro100 rev 12 problems

2001-05-17 Thread Ion Badulescu

On Fri, 18 May 2001, James Fidell wrote:

  Is this a real card, or is it built-in on the motherboard?
 
 It's a real card.

All right, that's good to know. Maybe I'll get one for myself, so I can 
test new code on it -- right now I only have rev 9 and earlier cards.

 For various reasons that are far to boring to go into here, I'm not entirely
 free in my choice of card.  What I'll probably do is try to get a rev 8 card
 swapped in for the rev 12 one.  If I can't get a rev 8 card for that machine,
 I'll go with the e100 driver and let you know what happens.

Yes, and do let us know what happens. Whatever the problem is, it wants 
fixing sooner than later.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Trond Myklebust wrote:

>  > I'll give your patch a spin tomorrow, after I catch some
>  > zzz's. :-)
> 
> Right you are.

And indeed, the tcp-hang patch fixed the problem! Thanks a lot!

> FYI I've now put up those patches of which I am aware against 2.2.19
> on
> 
>   http://www.fys.uio.no/~trondmy/src/2.2.19
> 
> I'll try to keep that area updated with a brief explanation for each
> patch...

That's where I tried looking first, two days ago, but couldn't find 
anything, and I must have overlooked the patch you sent to the list.

Thanks for crediting me, btw. :-) Just one little nit: the readdir() 
problem appears only when using glibc-2.0, glibc-2.1 seems to be fine.

Thanks again to everybody,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Trond Myklebust wrote:

> Did you apply the following patch which I put out on the lists a
> couple of weeks ago?

No, I was testing with 2.2.19 and then I started going back into the 
2.2.19pre series until I found the culprit.

I'll give your patch a spin tomorrow, after I catch some zzz's. :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

> I'll do another test, 2.2.18 + the NFS/SunRPC changes, and see how it 
> goes. Hopefully they'll apply easily...

As I suspected, 2.2.18 + all the NFS/NFSd/SunRPC changes present in 
2.2.19pre10 locks up with wait_on_bh as soon as I run ls -lR on a large 
NFS directory tree, while at the same time pummeling the network and the 
local disks.

NFS is not enough to trigger the bug, the extra disk/network stress *is*
necessary. The network stress actually seems to be enough, I just 
triggered the bug again...

2.2.18 vanilla is fine.

So I guess the next round is in Trond's court. :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

> Right now I'm pretty sure it's the NFS/SunRPC changes, but I'll know for 
> sure in about 30 minutes.

As I suspected, 2.2.19pre9 + the NFS/SunRPC changes locked up under load 
with the now familiar:

wait_on_bh, CPU 2:
irq:  1 [0 0]
bh:   1 [0 1]
<[8010af71]> 

This time it happened precisely when I ran a ls -lR on a large tree over 
NFS, so I'm pretty sure this is it.

I'll do another test, 2.2.18 + the NFS/SunRPC changes, and see how it 
goes. Hopefully they'll apply easily...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

> > aic7xxx
> 
> Loaded but not used, no devices attached to it.

Scratch that, I was confusing it with another box. There is no trace of 
aic7xxx on this system.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Alan Cox wrote:

> Ok the main candidates there would be:
> 
>   The sunrpc/nfs changes

I'm currently testing this one -- just preparing to reboot pre9 + these 
changes.

>   EEpro100/starfire

eepro100 is in use. But that patch is harmless.

>   aic7xxx

Loaded but not used, no devices attached to it.

Right now I'm pretty sure it's the NFS/SunRPC changes, but I'll know for 
sure in about 30 minutes.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Alan Cox wrote:

 Ok the main candidates there would be:
 
   The sunrpc/nfs changes

I'm currently testing this one -- just preparing to reboot pre9 + these 
changes.

   EEpro100/starfire

eepro100 is in use. But that patch is harmless.

   aic7xxx

Loaded but not used, no devices attached to it.

Right now I'm pretty sure it's the NFS/SunRPC changes, but I'll know for 
sure in about 30 minutes.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

  aic7xxx
 
 Loaded but not used, no devices attached to it.

Scratch that, I was confusing it with another box. There is no trace of 
aic7xxx on this system.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

 Right now I'm pretty sure it's the NFS/SunRPC changes, but I'll know for 
 sure in about 30 minutes.

As I suspected, 2.2.19pre9 + the NFS/SunRPC changes locked up under load 
with the now familiar:

wait_on_bh, CPU 2:
irq:  1 [0 0]
bh:   1 [0 1]
[8010af71] 

This time it happened precisely when I ran a ls -lR on a large tree over 
NFS, so I'm pretty sure this is it.

I'll do another test, 2.2.18 + the NFS/SunRPC changes, and see how it 
goes. Hopefully they'll apply easily...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Ion Badulescu wrote:

 I'll do another test, 2.2.18 + the NFS/SunRPC changes, and see how it 
 goes. Hopefully they'll apply easily...

As I suspected, 2.2.18 + all the NFS/NFSd/SunRPC changes present in 
2.2.19pre10 locks up with wait_on_bh as soon as I run ls -lR on a large 
NFS directory tree, while at the same time pummeling the network and the 
local disks.

NFS is not enough to trigger the bug, the extra disk/network stress *is*
necessary. The network stress actually seems to be enough, I just 
triggered the bug again...

2.2.18 vanilla is fine.

So I guess the next round is in Trond's court. :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Trond Myklebust wrote:

 Did you apply the following patch which I put out on the lists a
 couple of weeks ago?

No, I was testing with 2.2.19 and then I started going back into the 
2.2.19pre series until I found the culprit.

I'll give your patch a spin tomorrow, after I catch some zzz's. :-)

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-05-01 Thread Ion Badulescu

On Tue, 1 May 2001, Trond Myklebust wrote:

   I'll give your patch a spin tomorrow, after I catch some
   zzz's. :-)
 
 Right you are.

And indeed, the tcp-hang patch fixed the problem! Thanks a lot!

 FYI I've now put up those patches of which I am aware against 2.2.19
 on
 
   http://www.fys.uio.no/~trondmy/src/2.2.19
 
 I'll try to keep that area updated with a brief explanation for each
 patch...

That's where I tried looking first, two days ago, but couldn't find 
anything, and I must have overlooked the patch you sent to the list.

Thanks for crediting me, btw. :-) Just one little nit: the readdir() 
problem appears only when using glibc-2.0, glibc-2.1 seems to be fine.

Thanks again to everybody,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Ion Badulescu wrote:

> Ok, so onto the binary search through the 2.2.19pre series...

I think it started in 2.2.19pre10. I can reproduce the hang on pre10, 
quite easily, but I couldn't reproduce it on pre5, pre7 and pre9. I'll try 
a few other pre versions, just to make sure.

One of the things that are different between pre10 and the others is NFS:
the client is broken in all versions except pre10. I'm not sure how much 
it matters, since I wasn't pounding on NFS. Anyway, that's why I want to 
try a few other kernels -- maybe it does matter after all.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Mohammad A. Haque wrote:

> Just to give another data point...
> 
> 2.2.19 + LVM patches - dual P3 550
> 1 GB RAM
> eepro100
> ncr53c8xx scsi
> mylex accelRAID 1100 RAID controller
> 
> We've transferred around 1 GB of stuff over the network and about 200 GB
> between two raids w/o problems in a little under 3 days.
> 
> We've only scratched into swap. Free show 128K being used.

Ok. Have you tried running a large bonnie (1GB) while at the same time 
pummeling the network? That's how I trigger it, quite reliably.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Alan Cox wrote:

> > I also have reports but related to the network driver updates. So I
> > suggest to try again with 2.2.19 but with the drivers/net/* of 2.2.18.
> 
> Thats probably a better starting point. Its easier to back out than the VM
> changes and it would also explain the reports I saw.

Except that the only driver I'm using is eepro100, and the only change to 
that driver was the patch I submitted myself and which is also in 2.4.

Also, another data point: those two SMP boxes have been running 2.2.18 + 
Andrea's VM-global patch since January, without a hitch.

Ok, so onto the binary search through the 2.2.19pre series...

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Sun, 29 Apr 2001 01:16:04 +0200, bert hubert <[EMAIL PROTECTED]> wrote:
> On Sat, Apr 28, 2001 at 02:21:29PM -0700, Ion Badulescu wrote:
>> Hi Alan,
>> 
>> Over the last week I've tried to upgrade a 4-CPU Xeon box to 2.2.19, but 
>> the it keeps locking up whenever the disks are stresses a bit, e.g. when 
>> updatedb is running. I get the following messages on the console:
>> 
>> wait_on_bh, CPU 1:
>> irq:  1 [1 0]
>> bh:   1 [1 0]
>> <[8010af71]>
> 
> Obvious question is, which compiler.

These are rh62 systems, the compiler is egcs-1.1.2. So that's not it.

I'd be willing to do the binary search through the 2.2.19pre series,
but I'd rather avoid it if it's a known bug. It's pretty painful, both
for myself and for the real users of this box, to go through the pains
of 10-20 cycles of reboot-crash-fsck_3_large_disks...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Sun, 29 Apr 2001 01:16:04 +0200, bert hubert [EMAIL PROTECTED] wrote:
 On Sat, Apr 28, 2001 at 02:21:29PM -0700, Ion Badulescu wrote:
 Hi Alan,
 
 Over the last week I've tried to upgrade a 4-CPU Xeon box to 2.2.19, but 
 the it keeps locking up whenever the disks are stresses a bit, e.g. when 
 updatedb is running. I get the following messages on the console:
 
 wait_on_bh, CPU 1:
 irq:  1 [1 0]
 bh:   1 [1 0]
 [8010af71]
 
 Obvious question is, which compiler.

These are rh62 systems, the compiler is egcs-1.1.2. So that's not it.

I'd be willing to do the binary search through the 2.2.19pre series,
but I'd rather avoid it if it's a known bug. It's pretty painful, both
for myself and for the real users of this box, to go through the pains
of 10-20 cycles of reboot-crash-fsck_3_large_disks...

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Alan Cox wrote:

  I also have reports but related to the network driver updates. So I
  suggest to try again with 2.2.19 but with the drivers/net/* of 2.2.18.
 
 Thats probably a better starting point. Its easier to back out than the VM
 changes and it would also explain the reports I saw.

Except that the only driver I'm using is eepro100, and the only change to 
that driver was the patch I submitted myself and which is also in 2.4.

Also, another data point: those two SMP boxes have been running 2.2.18 + 
Andrea's VM-global patch since January, without a hitch.

Ok, so onto the binary search through the 2.2.19pre series...

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Mohammad A. Haque wrote:

 Just to give another data point...
 
 2.2.19 + LVM patches - dual P3 550
 1 GB RAM
 eepro100
 ncr53c8xx scsi
 mylex accelRAID 1100 RAID controller
 
 We've transferred around 1 GB of stuff over the network and about 200 GB
 between two raids w/o problems in a little under 3 days.
 
 We've only scratched into swap. Free show 128K being used.

Ok. Have you tried running a large bonnie (1GB) while at the same time 
pummeling the network? That's how I trigger it, quite reliably.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Ion Badulescu

On Mon, 30 Apr 2001, Ion Badulescu wrote:

 Ok, so onto the binary search through the 2.2.19pre series...

I think it started in 2.2.19pre10. I can reproduce the hang on pre10, 
quite easily, but I couldn't reproduce it on pre5, pre7 and pre9. I'll try 
a few other pre versions, just to make sure.

One of the things that are different between pre10 and the others is NFS:
the client is broken in all versions except pre10. I'm not sure how much 
it matters, since I wasn't pounding on NFS. Anyway, that's why I want to 
try a few other kernels -- maybe it does matter after all.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.2.19 locks up on SMP

2001-04-28 Thread Ion Badulescu

Hi Alan,

Over the last week I've tried to upgrade a 4-CPU Xeon box to 2.2.19, but 
the it keeps locking up whenever the disks are stresses a bit, e.g. when 
updatedb is running. I get the following messages on the console:

wait_on_bh, CPU 1:
irq:  1 [1 0]
bh:   1 [1 0]
<[8010af71]>

over and over again, until somebody pushes the reset button.  8010af71 is 
somewhere in the middle of synchronize_bh().

The hardware configuration is: 4 Xeon/500MHz, 1GB RAM, 3 SCSI disks
attached to a symbios controller, 2 eepro100 interfaces. The kernel is
compiled with support for SMP and 2GB of RAM (hence the kernel address
starting with 8 instead of c).  It was compiled from a pristine source
tree, no patches were applied.

I had more problems with 2.2.19 and another SMP box, which was also 
locking up under stress. I'm not sure if it had the same messages on the 
console, since it's headless, but it was running the same 2.2.19 kernel as 
the previous one and was locking up in a very similar fashion. The 
hardware in that box is 2 P-III/750MHz, 512MB RAM, 1 IDE disk on a PIIX 
controller, and an unused aic7xxx SCSI controller with no SCSI devices 
attached to it.

Both boxes are rock-solid when running 2.2.18-SMP.

Any ideas? Has anybody else reported this with 2.2.19?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.2.19 locks up on SMP

2001-04-28 Thread Ion Badulescu

Hi Alan,

Over the last week I've tried to upgrade a 4-CPU Xeon box to 2.2.19, but 
the it keeps locking up whenever the disks are stresses a bit, e.g. when 
updatedb is running. I get the following messages on the console:

wait_on_bh, CPU 1:
irq:  1 [1 0]
bh:   1 [1 0]
[8010af71]

over and over again, until somebody pushes the reset button.  8010af71 is 
somewhere in the middle of synchronize_bh().

The hardware configuration is: 4 Xeon/500MHz, 1GB RAM, 3 SCSI disks
attached to a symbios controller, 2 eepro100 interfaces. The kernel is
compiled with support for SMP and 2GB of RAM (hence the kernel address
starting with 8 instead of c).  It was compiled from a pristine source
tree, no patches were applied.

I had more problems with 2.2.19 and another SMP box, which was also 
locking up under stress. I'm not sure if it had the same messages on the 
console, since it's headless, but it was running the same 2.2.19 kernel as 
the previous one and was locking up in a very similar fashion. The 
hardware in that box is 2 P-III/750MHz, 512MB RAM, 1 IDE disk on a PIIX 
controller, and an unused aic7xxx SCSI controller with no SCSI devices 
attached to it.

Both boxes are rock-solid when running 2.2.18-SMP.

Any ideas? Has anybody else reported this with 2.2.19?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 and file lock on NFS?

2001-04-26 Thread Ion Badulescu

On 25 Apr 2001 00:39:43 +0200, Trond Myklebust <[EMAIL PROTECTED]> wrote:
>> " " == apark  <[EMAIL PROTECTED]> writes:
> 
> > Hi, Recently upgraded to 2.2.19, along with new
> > nfs-utils(0.3.1).  But I have a program that requires a
> > exclusive write lock on a NFSed directory.  When I was using
> > 2.2.17 all was ok, but now it returns ENOLCK.  Does anybody
> > else have the same problem?  Thanks
> 
> Hi,
> 
> You are probably failing to run the statd daemon or you may have set
> up over-restrictive /etc/hosts.(allow|deny).

Or /var/lib/nfs/statd does not exist on your system, or it is not
writable by the uid rpc.statd runs under (rpcuser on Red Hat boxes
with all the upgrades).

I've had this problem myself, I don't think it's covered in the FAQ.

I haven't checked if the rpcuser thing is a redhat-ism, probably is,
their script is not doing anything special with rpc.statd so they
must have hacked it into the executable.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 and file lock on NFS?

2001-04-26 Thread Ion Badulescu

On 25 Apr 2001 00:39:43 +0200, Trond Myklebust [EMAIL PROTECTED] wrote:
   == apark  [EMAIL PROTECTED] writes:
 
  Hi, Recently upgraded to 2.2.19, along with new
  nfs-utils(0.3.1).  But I have a program that requires a
  exclusive write lock on a NFSed directory.  When I was using
  2.2.17 all was ok, but now it returns ENOLCK.  Does anybody
  else have the same problem?  Thanks
 
 Hi,
 
 You are probably failing to run the statd daemon or you may have set
 up over-restrictive /etc/hosts.(allow|deny).

Or /var/lib/nfs/statd does not exist on your system, or it is not
writable by the uid rpc.statd runs under (rpcuser on Red Hat boxes
with all the upgrades).

I've had this problem myself, I don't think it's covered in the FAQ.

I haven't checked if the rpcuser thing is a redhat-ism, probably is,
their script is not doing anything special with rpc.statd so they
must have hacked it into the executable.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-24 Thread Ion Badulescu

On 23 Apr 2001 12:54:22 -0600, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> I'll include it again.  I had it attached as a plain text attachment,
> I don't know if that is a problem or not.

Actually it was attached as text/x-patch, not as text/plain... so
pine certainly refused to display it inline.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] Longstanding elf fix (2.4.3 fix)

2001-04-24 Thread Ion Badulescu

On 23 Apr 2001 12:54:22 -0600, Eric W. Biederman [EMAIL PROTECTED] wrote:

 I'll include it again.  I had it attached as a plain text attachment,
 I don't know if that is a problem or not.

Actually it was attached as text/x-patch, not as text/plain... so
pine certainly refused to display it inline.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Roberto Nibali wrote:

> No, it's not a bug but thank you for this tip. It's just a put-on limitation
> in the driver itself:
> 
> --- starfire.c~   Fri Apr 20 18:48:05 2001
> +++ starfire.cFri Apr 20 18:27:20 2001
> @@ -308,7 +308,7 @@
>   void (*resume)(struct pci_dev *dev);/* Device woken up */
>  };
>  
> -#define PCI_MAX_MAPPINGS 16
> +#define PCI_MAX_MAPPINGS 32

Ehh.. yes, I forgot about this. It's a limitation in the 2.2 compatibility 
code, 2.4 is not affected.

> This cures my problem. I've checked this and it seems as if Ion copied
> this from the sound/emu10k1/emu_wrapper.c code, where I understand that
> nobody will have more then 16 times the same soundcard. Ion, do I break
> something with this? If not, could you please adjust your driver?

Well, normally nobody will have more than 16 eth ports, either, because
net_init.c won't let them. So I'm not sure this is something *I* should fix.

I guess I'll send a patch to Alan that changes both the driver and 
net_init.c, once 2.2.20pre is started. If he takes it, great, otherwise 
you'll have to continue making this change for yourself.

> Thanks to all of you for your help. I learned a lot today.

You're welcome. :-)

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: starfire update for 2.4.4-pre5

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

> alas:
> http://gtf.org/garzik/kernel/files/patches/2.4/2.4.4/net-version-2.4.4.5.patch.gz

Oh well. Another hour, another patch to be sent out. :-)

I'll deal with CVS tomorrow, when I figure out on which disk I have enough
space for yet another tree. So I can only hope the attached diff,
generated against 2.4.4-pre5 plus the above patch, will apply cleanly.

Once these changes are accepted, the next step will be to add zerocopy 
support. I have it all ready (since January), I was just waiting for the 
zerocopy framework to be included.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
--- /mnt/3/linux-2.4/drivers/net/starfire.c Fri Apr 20 04:06:54 2001
+++ linux-2.4/drivers/net/starfire.cFri Apr 20 04:08:23 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,27 +31,102 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu <[EMAIL PROTECTED]>)
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+   LK1.2.9a (Ion Badulescu)
+   - More updates from Jeff Garzik
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
+
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
 static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to > 1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to > 1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PKT_BUF_SZ;
+#else
 static int rx_copybreak = 0;
+#endif
 
 /* Used to pass the media type, etc.
Both 'options[]' and 'full_duplex[]' exist for driver interoperability.
@@ -75,21 +150,9 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies

starfire update for 2.4.4-pre5

2001-04-20 Thread Ion Badulescu

Hi Jeff,

Here is the same starfire.c version I sent earlier, this time diff'ed 
against 2.4.4-pre5. It's essentially the version from 2.2.19 plus your 
2.4.4-pre5 changes minus the 2.2 compatibility stuff.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
--
--- /mnt/3/linux-2.4/drivers/net/starfire.c Thu Apr 19 15:54:59 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 21:39:24 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu <[EMAIL PROTECTED]>)
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
 static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to > 1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to > 1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PKT_BUF_SZ;
+#else
 static int rx_copybreak = 0;
+#endif
 
 /* Used to pass the media type, etc.
Both 'options[]' and 'full_duplex[]' exist for driver interoperability.
@@ -84,21 +156,9 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies before concluding the transmitter is hung. */
-#define TX_TIMEOUT  (2*HZ)
+#define TX_TIMEOUT (2*

Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

> Sorry, I was talking about a local patch not a global patch.  If a user
> must patch their 2.2 kernel to get the starfire driver working anyway,
> then adding a change to do s/.a/.o/ on Makefiles would be simple.

People don't need to patch *anything* to get the starfire driver working --
it's included in 2.2.19 and working rather well I might add. :-)

This was a special case, which btw had nothing to do with the starfire 
driver itself. The user needed to support more than 8 eth ports, which 
2.2 complains about, and more than 16 eth ports, which 2.2 simply doesn't 
allow without further changes.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

> > Check again. drivers/net builds a .a, not a .o. Trust me, I've tried.
> 
> Sure, but if you are patching anyway, it much better to fix that than
> hack space.c :)

Well, I remember asking Alan if he'd prefer it done that way, and not 
getting a reply back. So I didn't press further.

The change to support __init/__exit in drivers/net is a no-brainer, and I 
did test it at the time -- it worked as expected. But it's really up to 
Alan to decide, I couldn't care less to be quite honest.

In a way I think I understand why he's reluctant: it's very easy to end up
changing the initialization order by mistake and messing up people's 
network setups.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Roberto Nibali wrote:

> Hmm, but doesn't the code in 2.4.x improve the hard IRQ signal delivery
> even for UP systems with a local APIC table? I have an APIC aware board
> but I have only got 1 CPU on it and I currently need to run 2.2 kernel.
> But if you tell me that there is not much help, I'm ok with that, as 
> long as it wouldn't be better with APIC support :)

I think the UP-APIC support was added primarily to support the NMI oopser 
on UP systems. I might be wrong, though.

> > Well.. Space.c is a dinozaur. However, this is the 2.2 series and no more
> > surgery will happen on this kernel, at least normally.
> 
> So, what is your suggestion: Does this limitation do any harm or can I
> live with that and still run 16 eth devices and safely disregard the
> "early initialization ..." ?

You can safely disregard the "early initialization deferred" messages. 
They are essentially harmless.

As for the 16 eth ports limit, if you want to increase it, simply edit 
drivers/net/net_init.c and change the value of MAX_ETH_CARDS. This limit 
appears to also affect modules, so my earlier suggestion of using modules 
wouldn't have helped.

> > Because, again, this is legacy code. It works, it does the job, that's it.
> > All this crap is gone in 2.4.
> 
> I'll be porting my distribution to 2.4.x soon I think :)

If the only thing you need from your boxes is networking-related, than 
it's probably ok. Otherwise I'd wait a bit longer before putting 2.4 on 
production servers...

> Your driver works now and for me now need to mark it experimental. 

Yeah, I guess I'll submit a patch to remove the experimental bit, after 
the current code changes are accepted..

> It also works statically built into the kernel up to 4 quadboards. I
> hacked Space.c and enhanced the ``static struct device ethX_dev = { };''
> stuff.

You shouldn't need to do that, it's just wasted memory. The ethX_dev was
used mostly to avoid probing for ISA cards, which is completely irrelevant
when using PCI cards. As for the 4 quadboards limit, see above -- all you
need to change is MAX_ETH_CARDS.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

> > Have you tried loading the drivers as modules? You might have more luck
> > with that approach. Space.c was designed at a time when having 4 NIC's in
> > a PC was "pushing the limits"...
> 
> 2.2.recent has module_init/exit, so you don't even need Space.c.

Check again. drivers/net builds a .a, not a .o. Trust me, I've tried.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

  Have you tried loading the drivers as modules? You might have more luck
  with that approach. Space.c was designed at a time when having 4 NIC's in
  a PC was "pushing the limits"...
 
 2.2.recent has module_init/exit, so you don't even need Space.c.

Check again. drivers/net builds a .a, not a .o. Trust me, I've tried.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Roberto Nibali wrote:

 Hmm, but doesn't the code in 2.4.x improve the hard IRQ signal delivery
 even for UP systems with a local APIC table? I have an APIC aware board
 but I have only got 1 CPU on it and I currently need to run 2.2 kernel.
 But if you tell me that there is not much help, I'm ok with that, as 
 long as it wouldn't be better with APIC support :)

I think the UP-APIC support was added primarily to support the NMI oopser 
on UP systems. I might be wrong, though.

  Well.. Space.c is a dinozaur. However, this is the 2.2 series and no more
  surgery will happen on this kernel, at least normally.
 
 So, what is your suggestion: Does this limitation do any harm or can I
 live with that and still run 16 eth devices and safely disregard the
 "early initialization ..." ?

You can safely disregard the "early initialization deferred" messages. 
They are essentially harmless.

As for the 16 eth ports limit, if you want to increase it, simply edit 
drivers/net/net_init.c and change the value of MAX_ETH_CARDS. This limit 
appears to also affect modules, so my earlier suggestion of using modules 
wouldn't have helped.

  Because, again, this is legacy code. It works, it does the job, that's it.
  All this crap is gone in 2.4.
 
 I'll be porting my distribution to 2.4.x soon I think :)

If the only thing you need from your boxes is networking-related, than 
it's probably ok. Otherwise I'd wait a bit longer before putting 2.4 on 
production servers...

 Your driver works now and for me now need to mark it experimental. 

Yeah, I guess I'll submit a patch to remove the experimental bit, after 
the current code changes are accepted..

 It also works statically built into the kernel up to 4 quadboards. I
 hacked Space.c and enhanced the ``static struct device ethX_dev = { };''
 stuff.

You shouldn't need to do that, it's just wasted memory. The ethX_dev was
used mostly to avoid probing for ISA cards, which is completely irrelevant
when using PCI cards. As for the 4 quadboards limit, see above -- all you
need to change is MAX_ETH_CARDS.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

  Check again. drivers/net builds a .a, not a .o. Trust me, I've tried.
 
 Sure, but if you are patching anyway, it much better to fix that than
 hack space.c :)

Well, I remember asking Alan if he'd prefer it done that way, and not 
getting a reply back. So I didn't press further.

The change to support __init/__exit in drivers/net is a no-brainer, and I 
did test it at the time -- it worked as expected. But it's really up to 
Alan to decide, I couldn't care less to be quite honest.

In a way I think I understand why he's reluctant: it's very easy to end up
changing the initialization order by mistake and messing up people's 
network setups.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

 Sorry, I was talking about a local patch not a global patch.  If a user
 must patch their 2.2 kernel to get the starfire driver working anyway,
 then adding a change to do s/.a/.o/ on Makefiles would be simple.

People don't need to patch *anything* to get the starfire driver working --
it's included in 2.2.19 and working rather well I might add. :-)

This was a special case, which btw had nothing to do with the starfire 
driver itself. The user needed to support more than 8 eth ports, which 
2.2 complains about, and more than 16 eth ports, which 2.2 simply doesn't 
allow without further changes.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



starfire update for 2.4.4-pre5

2001-04-20 Thread Ion Badulescu

Hi Jeff,

Here is the same starfire.c version I sent earlier, this time diff'ed 
against 2.4.4-pre5. It's essentially the version from 2.2.19 plus your 
2.4.4-pre5 changes minus the 2.2 compatibility stuff.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
--
--- /mnt/3/linux-2.4/drivers/net/starfire.c Thu Apr 19 15:54:59 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 21:39:24 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu [EMAIL PROTECTED])
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
 static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to  1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to  1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PKT_BUF_SZ;
+#else
 static int rx_copybreak = 0;
+#endif
 
 /* Used to pass the media type, etc.
Both 'options[]' and 'full_duplex[]' exist for driver interoperability.
@@ -84,21 +156,9 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies before concluding the transmitter is hung. */
-#define TX_TIMEOUT  (2*HZ)
+#define TX_TIMEOUT (2*HZ)
 
-#defi

Re: starfire update for 2.4.4-pre5

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Jeff Garzik wrote:

 alas:
 http://gtf.org/garzik/kernel/files/patches/2.4/2.4.4/net-version-2.4.4.5.patch.gz

Oh well. Another hour, another patch to be sent out. :-)

I'll deal with CVS tomorrow, when I figure out on which disk I have enough
space for yet another tree. So I can only hope the attached diff,
generated against 2.4.4-pre5 plus the above patch, will apply cleanly.

Once these changes are accepted, the next step will be to add zerocopy 
support. I have it all ready (since January), I was just waiting for the 
zerocopy framework to be included.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
--- /mnt/3/linux-2.4/drivers/net/starfire.c Fri Apr 20 04:06:54 2001
+++ linux-2.4/drivers/net/starfire.cFri Apr 20 04:08:23 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,27 +31,102 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu [EMAIL PROTECTED])
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+   LK1.2.9a (Ion Badulescu)
+   - More updates from Jeff Garzik
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
+
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
 static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to  1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to  1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PKT_BUF_SZ;
+#else
 static int rx_copybreak = 0;
+#endif
 
 /* Used to pass the media type, etc.
Both 'options[]' and 'full_duplex[]' exist for driver interoperability.
@@ -75,21 +150,9 @@
 
 /* Operational parameters that usually are not changed. */
 /* Time in jiffies before concluding the transmitter

Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-20 Thread Ion Badulescu

On Fri, 20 Apr 2001, Roberto Nibali wrote:

 No, it's not a bug but thank you for this tip. It's just a put-on limitation
 in the driver itself:
 
 --- starfire.c~   Fri Apr 20 18:48:05 2001
 +++ starfire.cFri Apr 20 18:27:20 2001
 @@ -308,7 +308,7 @@
   void (*resume)(struct pci_dev *dev);/* Device woken up */
  };
  
 -#define PCI_MAX_MAPPINGS 16
 +#define PCI_MAX_MAPPINGS 32

Ehh.. yes, I forgot about this. It's a limitation in the 2.2 compatibility 
code, 2.4 is not affected.

 This cures my problem. I've checked this and it seems as if Ion copied
 this from the sound/emu10k1/emu_wrapper.c code, where I understand that
 nobody will have more then 16 times the same soundcard. Ion, do I break
 something with this? If not, could you please adjust your driver?

Well, normally nobody will have more than 16 eth ports, either, because
net_init.c won't let them. So I'm not sure this is something *I* should fix.

I guess I'll send a patch to Alan that changes both the driver and 
net_init.c, once 2.2.20pre is started. If he takes it, great, otherwise 
you'll have to continue making this change for yourself.

 Thanks to all of you for your help. I learned a lot today.

You're welcome. :-)

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001, Roberto Nibali wrote:

> A 2.2.x UP-APIC patch would maybe improve things here while under
> heavy load. I'm using such boxes as packetfilters. All quadboards
> get IRQ 11 which is rather nasty considering a possible throughput
> of 40Mbit/s per NIC.

The UP-APIC wouldn't help much since there really aren't other processors 
available to share the load.

On the other hand, this is not as bad as it looks. In fact, it will
function rather well and with relatively little overhead if all configured
interfaces are seeing traffic on a regular basis. The IRQ dispatcher will
simply call all registered interrupt routines, and most of them will end
up doing something useful.

> Would be nice if I could fix the "early initialization ..." problem
> too. I'm still checking the Space.c code:
[snip]

Well.. Space.c is a dinozaur. However, this is the 2.2 series and no more 
surgery will happen on this kernel, at least normally.

Have you tried loading the drivers as modules? You might have more luck 
with that approach. Space.c was designed at a time when having 4 NIC's in 
a PC was "pushing the limits"...

> Why isn't it possible to put the "probed" counter into the Space.c for all
> network drivers? So people would not need to care about and the driver
> code would yet be more generic (at least a little bit).

Because, again, this is legacy code. It works, it does the job, that's it. 
All this crap is gone in 2.4.

> Any pointers, sources are welcome and in hope for some further wisdom,

Like I said, try the modules approach. If that doesn't work, I'll take a
closer look (and maybe borrow a few quads from work so I can actually test
the code...)

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Linux 2.4.3-ac10

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001, Jeff Garzik wrote:

> I should have gotten off my butt and mentioned this...  I would prefer a
> patch without the 2.2.x compat stuff.  So instead of all that compat
> code, have
>   #include "starfire-2.2.h"
> or similar...
> 
> And then starfire-2.2.h would only exist on 2.2.x.

Hard to please, aren't we. :-)

All right, is this version more pleasant to the eye? It's identical to the 
previous one, but with all the 2.2 ifdef'ed code replaced with an #include 
"starfire-kcomp22.h".

[Now of course starfire-kcomp22.h doesn't exist on 2.4, which -- if you 
stick to your own principles from 2 months ago -- will naturally lead to 
the question, what is this doing in the tree, it's referencing code that 
doesn't exist in the tree, etc, etc. And round and round we go again. :-{

Or maybe not. Hopefully...]

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- /mnt/3/linux-2.4-ac/drivers/net/starfire.c  Thu Apr 19 15:58:57 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 21:39:24 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu <[EMAIL PROTECTED]>)
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+       - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+       - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
-static int mtu = 0;
+static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to > 1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to > 1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misali

Re: Linux 2.4.3-ac10

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001 21:14:32 +0100 (BST), Alan Cox <[EMAIL PROTECTED]> wrote:

> 2.4.3-ac10
> o   Merge Linus 2.4.4pre4

Well, it seems you have backed out my starfire changes when you merged
Jeff Garzik's changes from 2.4.4pre4. So here's a new version, diff'ed
against 2.4.3-ac10, which includes all of Jeff's changes from 2.4.3pre[45].

BTW Jeff, do you want me to send these updates to you instead of Alan,
diff'ed against 2.4.x-pre_latest? Right now we're just wasting each
other's time by making conflicting changes to different trees.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
--
--- /mnt/3/linux-2.4-ac/drivers/net/starfire.c  Thu Apr 19 15:58:57 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 17:41:01 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu <[EMAIL PROTECTED]>)
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
-static int mtu = 0;
+static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to > 1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to > 1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PK

Re: Linux 2.4.3-ac10

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001 21:14:32 +0100 (BST), Alan Cox [EMAIL PROTECTED] wrote:

 2.4.3-ac10
 o   Merge Linus 2.4.4pre4

Well, it seems you have backed out my starfire changes when you merged
Jeff Garzik's changes from 2.4.4pre4. So here's a new version, diff'ed
against 2.4.3-ac10, which includes all of Jeff's changes from 2.4.3pre[45].

BTW Jeff, do you want me to send these updates to you instead of Alan,
diff'ed against 2.4.x-pre_latest? Right now we're just wasting each
other's time by making conflicting changes to different trees.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
--
--- /mnt/3/linux-2.4-ac/drivers/net/starfire.c  Thu Apr 19 15:58:57 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 17:41:01 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu [EMAIL PROTECTED])
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
-static int mtu = 0;
+static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to  1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to  1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force copying of
+ * packets as the starfire doesn't allow for misaligned DMAs ;-(
+ * 23/10/2000 - Jes
+ *
+ * The Alpha and the Sparc don't allow unaligned loads, either. -Ion
+ */
+#if defined(__ia64__) || defined(__alpha__) || defined(__sparc__)
+static int rx_copybreak = PKT_BUF_SZ;
+#else
 static int rx_cop

Re: Linux 2.4.3-ac10

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001, Jeff Garzik wrote:

 I should have gotten off my butt and mentioned this...  I would prefer a
 patch without the 2.2.x compat stuff.  So instead of all that compat
 code, have
   #include "starfire-2.2.h"
 or similar...
 
 And then starfire-2.2.h would only exist on 2.2.x.

Hard to please, aren't we. :-)

All right, is this version more pleasant to the eye? It's identical to the 
previous one, but with all the 2.2 ifdef'ed code replaced with an #include 
"starfire-kcomp22.h".

[Now of course starfire-kcomp22.h doesn't exist on 2.4, which -- if you 
stick to your own principles from 2 months ago -- will naturally lead to 
the question, what is this doing in the tree, it's referencing code that 
doesn't exist in the tree, etc, etc. And round and round we go again. :-{

Or maybe not. Hopefully...]

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- /mnt/3/linux-2.4-ac/drivers/net/starfire.c  Thu Apr 19 15:58:57 2001
+++ linux-2.4/drivers/net/starfire.cThu Apr 19 21:39:24 2001
@@ -20,7 +20,7 @@
---
 
Linux kernel-specific changes:
-   
+
LK1.1.1 (jgarzik):
- Use PCI driver interface
- Fix MOD_xxx races
@@ -31,9 +31,45 @@
 
LK1.1.3 (Andrew Morton)
- Timer cleanups
-   
+
LK1.1.4 (jgarzik):
- Merge Becker version 1.03
+
+   LK1.2.1 (Ion Badulescu [EMAIL PROTECTED])
+   - Support hardware Rx/Tx checksumming
+   - Use the GFP firmware taken from Adaptec's Netware driver
+
+   LK1.2.2 (Ion Badulescu)
+   - Backported to 2.2.x
+
+   LK1.2.3 (Ion Badulescu)
+   - Fix the flaky mdio interface
+   - More compat clean-ups
+
+   LK1.2.4 (Ion Badulescu)
+   - More 2.2.x initialization fixes
+
+   LK1.2.5 (Ion Badulescu)
+   - Several fixes from Manfred Spraul
+
+   LK1.2.6 (Ion Badulescu)
+   - Fixed ifup/ifdown/ifup problem in 2.4.x
+
+   LK1.2.7 (Ion Badulescu)
+   - Removed unused code
+   - Made more functions static and __init
+
+   LK1.2.8 (Ion Badulescu)
+   - Quell bogus error messages, inform about the Tx threshold
+   - Removed #ifdef CONFIG_PCI, this driver is PCI only
+
+   LK1.2.9 (Ion Badulescu)
+   - Merged Jeff Garzik's changes from 2.4.4-pre5
+   - Added 2.2.x compatibility stuff required by the above changes
+
+TODO:
+   - implement tx_timeout() properly
+   - support ethtool
 */
 
 /* These identify the driver base version and may not be removed. */
@@ -43,24 +79,60 @@
 " Updates and info at http://www.scyld.com/network/starfire.html\n";
 
 static const char version3[] =
-" (unofficial 2.4.x kernel port, version 1.1.4, August 10, 2000)\n";
+" (unofficial 2.4.x kernel port, version 1.2.9, April 19, 2001)\n";
+
+/*
+ * Adaptec's license for their Novell drivers (which is where I got the
+ * firmware files) does not allow one to redistribute them. Thus, we can't
+ * include the firmware with this driver.
+ *
+ * However, an end-user is allowed to download and use it, after
+ * converting it to C header files using starfire_firmware.pl.
+ * Once that's done, the #undef must be changed into a #define
+ * for this driver to really use the firmware. Note that Rx/Tx
+ * hardware TCP checksumming is not possible without the firmware.
+ *
+ * I'm currently [Feb 2001] talking to Adaptec about this redistribution
+ * issue. Stay tuned...
+ */
+#undef HAS_FIRMWARE
+/*
+ * The current frame processor firmware fails to checksum a fragment
+ * of length 1. If and when this is fixed, the #define below can be removed.
+ */
+#define HAS_BROKEN_FIRMWARE
 
 /* The user-configurable values.
These may be modified when a driver module is loaded.*/
 
 /* Used for tuning interrupt latency vs. overhead. */
-static int interrupt_mitigation = 0x0;
+static int interrupt_mitigation;
 
 static int debug = 1;  /* 1 normal messages, 0 quiet .. 7 verbose. */
 static int max_interrupt_work = 20;
-static int mtu = 0;
+static int mtu;
 /* Maximum number of multicast addresses to filter (vs. rx-all-multicast).
-   The Starfire has a 512 element hash table based on the Ethernet CRC.  */
-static int multicast_filter_limit = 32;
+   The Starfire has a 512 element hash table based on the Ethernet CRC. */
+static int multicast_filter_limit = 512;
 
-/* Set the copy breakpoint for the copy-only-tiny-frames scheme.
-   Setting to  1518 effectively disables this feature. */
+#define PKT_BUF_SZ 1536/* Size of each temporary Rx buffer.*/
+/*
+ * Set the copy breakpoint for the copy-only-tiny-frames scheme.
+ * Setting to  1518 effectively disables this feature.
+ *
+ * NOTE:
+ * The ia64 doesn't allow for unaligned loads even of integers being
+ * misaligned on a 2 byte boundary. Thus always force co

Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-19 Thread Ion Badulescu

On Thu, 19 Apr 2001, Roberto Nibali wrote:

 A 2.2.x UP-APIC patch would maybe improve things here while under
 heavy load. I'm using such boxes as packetfilters. All quadboards
 get IRQ 11 which is rather nasty considering a possible throughput
 of 40Mbit/s per NIC.

The UP-APIC wouldn't help much since there really aren't other processors 
available to share the load.

On the other hand, this is not as bad as it looks. In fact, it will
function rather well and with relatively little overhead if all configured
interfaces are seeing traffic on a regular basis. The IRQ dispatcher will
simply call all registered interrupt routines, and most of them will end
up doing something useful.

 Would be nice if I could fix the "early initialization ..." problem
 too. I'm still checking the Space.c code:
[snip]

Well.. Space.c is a dinozaur. However, this is the 2.2 series and no more 
surgery will happen on this kernel, at least normally.

Have you tried loading the drivers as modules? You might have more luck 
with that approach. Space.c was designed at a time when having 4 NIC's in 
a PC was "pushing the limits"...

 Why isn't it possible to put the "probed" counter into the Space.c for all
 network drivers? So people would not need to care about and the driver
 code would yet be more generic (at least a little bit).

Because, again, this is legacy code. It works, it does the job, that's it. 
All this crap is gone in 2.4.

 Any pointers, sources are welcome and in hope for some further wisdom,

Like I said, try the modules approach. If that doesn't work, I'll take a
closer look (and maybe borrow a few quads from work so I can actually test
the code...)

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-18 Thread Ion Badulescu

On Wed, 18 Apr 2001, Steve Hill wrote:

> Anyway, it wasn't me who wanted to use the starfire driver :)

True, I plead guilty to the "replying at 3:30am" sin. :-) I meant to reply
to Roberto's mail, and accidentally replied to yours..

Anyway, Roberto, if you could give the starfire driver in 2.2.19 a try, 
I'd appreciate it. You mentioned looking at the code, did you actually 
test it?

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-18 Thread Ion Badulescu

On Tue, 17 Apr 2001 17:30:46 +0100 (BST), Steve Hill <[EMAIL PROTECTED]> wrote:

> Not sure - I've never tried initing more than 3 of the DP83815 cards in a
> single machine.  (I am using Cobalt Qube 3's, which have 2 DP83815's on
> the motherboard, and a single PCI slot which I have installed a DP38315 in
> for testing purposes).

Have you tried the starfire driver in 2.2.19 or 2.4.x-ac? Or, if you want to
use vanilla 2.4.x, you can simply copy drivers/net/starfire.c from the -ac
tree.

Ion
Starfire driver maintainer

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Fix for Donald Becker's DP83815 network driver (v1.07)

2001-04-18 Thread Ion Badulescu

On Tue, 17 Apr 2001 17:30:46 +0100 (BST), Steve Hill [EMAIL PROTECTED] wrote:

 Not sure - I've never tried initing more than 3 of the DP83815 cards in a
 single machine.  (I am using Cobalt Qube 3's, which have 2 DP83815's on
 the motherboard, and a single PCI slot which I have installed a DP38315 in
 for testing purposes).

Have you tried the starfire driver in 2.2.19 or 2.4.x-ac? Or, if you want to
use vanilla 2.4.x, you can simply copy drivers/net/starfire.c from the -ac
tree.

Ion
Starfire driver maintainer

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: VRRP related

2001-04-16 Thread Ion Badulescu

On Mon, 16 Apr 2001 15:06:44 +0100, [EMAIL PROTECTED] wrote:
> 
> Hi,
> I am trying to put virtual mac address at the place of physical mac
> address , for that I have overwrite source hardware address with virtual
> address.Now when I try to ping to this machine with some other
> machine.It says request time out.While checking arp -a , gives me
> virtual mac address in ARP-Table instead of physical mac address.I want
> it should give response to ping  also.what I can do

1. Get a card that accepts non-multicast MAC addresses in its hardware
filter. eepro100, tulip, starfire will do. 3c59x won't (well newer cards
have the capability, but the driver doesn't support it).

2. Apply the attached patch and enable "Ethernet Virtual MAC support".

3. Tell the card about your VMAC using ipmaddr.

The patch slows down the fast receive patch, I know, but I don't see
a way around it. It's against 2.4.recent, I haven't looked at 2.2.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- linux-2.4/net/ethernet/eth.c.oldTue Nov 14 20:18:52 2000
+++ linux-2.4/net/ethernet/eth.cTue Nov 14 20:30:45 2000
@@ -203,8 +203,21 @@
 
else if(1 /*dev->flags_PROMISC*/)
{
-   if(memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN))
+#ifdef CONFIG_NET_VMAC
+   if (memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN)) {
+   struct dev_mc_list *mc_addr = dev->mc_list;
+   while (mc_addr) {
+   if (memcmp(mc_addr->dmi_addr, dev->dev_addr, ETH_ALEN))
+   goto loose_local;
+   mc_addr = mc_addr->next;
+   }
skb->pkt_type=PACKET_OTHERHOST;
+   loose_local:
+   }
+#else  /* not CONFIG_NET_VMAC */
+   if (memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN))
+   skb->pkt_type=PACKET_OTHERHOST;
+#endif /* not CONFIG_NET_VMAC */
}

if (ntohs(eth->h_proto) >= 1536)
--- linux-2.4/net/Config.in.old Tue Nov 14 20:29:37 2000
+++ linux-2.4/net/Config.in Tue Nov 14 20:30:31 2000
@@ -64,6 +64,7 @@
tristate 'LAPB Data Link Driver (EXPERIMENTAL)' CONFIG_LAPB
bool '802.2 LLC (EXPERIMENTAL)' CONFIG_LLC
bool 'Frame Diverter (EXPERIMENTAL)' CONFIG_NET_DIVERT
+   bool 'Ethernet Virtual MAC support (EXPERIMENTAL)' CONFIG_NET_VMAC
 #   if [ "$CONFIG_LLC" = "y" ]; then
 #  bool '  Netbeui (EXPERIMENTAL)' CONFIG_NETBEUI
 #   fi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: VRRP related

2001-04-16 Thread Ion Badulescu

On Mon, 16 Apr 2001 15:06:44 +0100, [EMAIL PROTECTED] wrote:
 
 Hi,
 I am trying to put virtual mac address at the place of physical mac
 address , for that I have overwrite source hardware address with virtual
 address.Now when I try to ping to this machine with some other
 machine.It says request time out.While checking arp -a , gives me
 virtual mac address in ARP-Table instead of physical mac address.I want
 it should give response to ping  also.what I can do

1. Get a card that accepts non-multicast MAC addresses in its hardware
filter. eepro100, tulip, starfire will do. 3c59x won't (well newer cards
have the capability, but the driver doesn't support it).

2. Apply the attached patch and enable "Ethernet Virtual MAC support".

3. Tell the card about your VMAC using ipmaddr.

The patch slows down the fast receive patch, I know, but I don't see
a way around it. It's against 2.4.recent, I haven't looked at 2.2.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
---
--- linux-2.4/net/ethernet/eth.c.oldTue Nov 14 20:18:52 2000
+++ linux-2.4/net/ethernet/eth.cTue Nov 14 20:30:45 2000
@@ -203,8 +203,21 @@
 
else if(1 /*dev-flagsIFF_PROMISC*/)
{
-   if(memcmp(eth-h_dest,dev-dev_addr, ETH_ALEN))
+#ifdef CONFIG_NET_VMAC
+   if (memcmp(eth-h_dest,dev-dev_addr, ETH_ALEN)) {
+   struct dev_mc_list *mc_addr = dev-mc_list;
+   while (mc_addr) {
+   if (memcmp(mc_addr-dmi_addr, dev-dev_addr, ETH_ALEN))
+   goto loose_local;
+   mc_addr = mc_addr-next;
+   }
skb-pkt_type=PACKET_OTHERHOST;
+   loose_local:
+   }
+#else  /* not CONFIG_NET_VMAC */
+   if (memcmp(eth-h_dest,dev-dev_addr, ETH_ALEN))
+   skb-pkt_type=PACKET_OTHERHOST;
+#endif /* not CONFIG_NET_VMAC */
}

if (ntohs(eth-h_proto) = 1536)
--- linux-2.4/net/Config.in.old Tue Nov 14 20:29:37 2000
+++ linux-2.4/net/Config.in Tue Nov 14 20:30:31 2000
@@ -64,6 +64,7 @@
tristate 'LAPB Data Link Driver (EXPERIMENTAL)' CONFIG_LAPB
bool '802.2 LLC (EXPERIMENTAL)' CONFIG_LLC
bool 'Frame Diverter (EXPERIMENTAL)' CONFIG_NET_DIVERT
+   bool 'Ethernet Virtual MAC support (EXPERIMENTAL)' CONFIG_NET_VMAC
 #   if [ "$CONFIG_LLC" = "y" ]; then
 #  bool '  Netbeui (EXPERIMENTAL)' CONFIG_NETBEUI
 #   fi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: linux 2.4.3 crashed my hard disk

2001-04-06 Thread Ion Badulescu

On Fri, 6 Apr 2001, Andre Hedrick wrote:

> > You really ought to rename this parameter to pcibus. Even though it doesn't
> > do justice to the VLB bus, the potential for user error is much smaller.
> 
> Until today you had a vaild point!
> 
> Promise Ultra100TX2 (20268 chipset).
> 
> This is a 66MHz clocked Ultra100 Chipset release this week.

Ok... but how does this invalidate my point? The 66MHz still applies to 
the PCI bus, so pcibus is ok for the parameter name, no?

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: linux 2.4.3 crashed my hard disk

2001-04-06 Thread Ion Badulescu

On Fri, 06 Apr 2001 21:30:24 -0700, Andre Hedrick <[EMAIL PROTECTED]> wrote:
> 
> You killed yourself
> 
> You do not have a host that will do idebus=66

You really ought to rename this parameter to pcibus. Even though it doesn't
do justice to the VLB bus, the potential for user error is much smaller.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: linux 2.4.3 crashed my hard disk

2001-04-06 Thread Ion Badulescu

On Fri, 06 Apr 2001 21:30:24 -0700, Andre Hedrick [EMAIL PROTECTED] wrote:
 
 You killed yourself
 
 You do not have a host that will do idebus=66

You really ought to rename this parameter to pcibus. Even though it doesn't
do justice to the VLB bus, the potential for user error is much smaller.

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



  1   2   3   4   >