Re: Soft lockup in e100 driver ?

2005-08-11 Thread Stephen D. Williams

"noapic" didn't work, nor did "noacpi", etc.
Going to 2.6.13-rc6.2 solved the problem (once I integrated udev, etc.).

The chipset is an Intel 8x0 something.  Unfortunately, there is a 
heatsink semi-permanently installed over everything.  Is there a /proc 
pseudofile that will give me good identifying chipset info to report here?


If there is a FAQ for this, we should post a message about it once in a 
while.

Nothing here indicates chipset:
http://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html

The CPU is an Intel Celeron CPU 2.00GHz running at 1495.772 MHz, 128MB 
cache.


sdw

Matti Aarnio wrote:


On Wed, Aug 10, 2005 at 08:32:45PM -0400, Stephen D. Williams wrote:
 

I just noticed that the Ubuntu setup says "GSI 20(level,low) -> IRQ 20" 
whereas I remember my built kernels saying "No GSI..  IRQ 11".  I'll 
investigate what that means and how to enable it.  Pointers appreciated.
   



That is most likely unrelated, but I had similar experiences
at times.  It turned out that something done recently in APIC
management code did break things, but lattest version is again
working.   For a while to have network card working I had to boot
with  "noapic"  option in my home SMP box.

In an UP box it is about same to boot as "noapic", but in SMP it
does result in "one CPU does all interrupts" thingie.  (In some
rare cases it could be desirable, even.)

  /Matti Aarnio


 


sdw

Stephen D. Williams wrote:

   

I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I 
am planning to use as thin clients.  I'm doing a PXE boot, loading 
kernels, and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads 
but does not allow any packet traffic.  The system isn't crashed, but 
I do get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
"squashfs" patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted "Live" on the box, 
works just fine with the e100 driver with a kernel shown as: 
"2.6.10-5-386".  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw
 



 



begin:vcard
fn:Stephen Williams
n:Williams;Stephen
email;internet:[EMAIL PROTECTED]
tel;work:703-724-0118
tel;fax:703-995-0407
tel;pager:[EMAIL PROTECTED]
tel;home:703-729-5405
tel;cell:703-371-9362
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: Soft lockup in e100 driver ?

2005-08-11 Thread Stephen D. Williams

noapic didn't work, nor did noacpi, etc.
Going to 2.6.13-rc6.2 solved the problem (once I integrated udev, etc.).

The chipset is an Intel 8x0 something.  Unfortunately, there is a 
heatsink semi-permanently installed over everything.  Is there a /proc 
pseudofile that will give me good identifying chipset info to report here?


If there is a FAQ for this, we should post a message about it once in a 
while.

Nothing here indicates chipset:
http://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html

The CPU is an Intel Celeron CPU 2.00GHz running at 1495.772 MHz, 128MB 
cache.


sdw

Matti Aarnio wrote:


On Wed, Aug 10, 2005 at 08:32:45PM -0400, Stephen D. Williams wrote:
 

I just noticed that the Ubuntu setup says GSI 20(level,low) - IRQ 20 
whereas I remember my built kernels saying No GSI..  IRQ 11.  I'll 
investigate what that means and how to enable it.  Pointers appreciated.
   



That is most likely unrelated, but I had similar experiences
at times.  It turned out that something done recently in APIC
management code did break things, but lattest version is again
working.   For a while to have network card working I had to boot
with  noapic  option in my home SMP box.

In an UP box it is about same to boot as noapic, but in SMP it
does result in one CPU does all interrupts thingie.  (In some
rare cases it could be desirable, even.)

  /Matti Aarnio


 


sdw

Stephen D. Williams wrote:

   

I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I 
am planning to use as thin clients.  I'm doing a PXE boot, loading 
kernels, and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads 
but does not allow any packet traffic.  The system isn't crashed, but 
I do get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
squashfs patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted Live on the box, 
works just fine with the e100 driver with a kernel shown as: 
2.6.10-5-386.  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw
 



 



begin:vcard
fn:Stephen Williams
n:Williams;Stephen
email;internet:[EMAIL PROTECTED]
tel;work:703-724-0118
tel;fax:703-995-0407
tel;pager:[EMAIL PROTECTED]
tel;home:703-729-5405
tel;cell:703-371-9362
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: Soft lockup in e100 driver ?

2005-08-10 Thread Stephen D. Williams
I just noticed that the Ubuntu setup says "GSI 20(level,low) -> IRQ 20" 
whereas I remember my built kernels saying "No GSI..  IRQ 11".  I'll 
investigate what that means and how to enable it.  Pointers appreciated.


sdw

Stephen D. Williams wrote:

I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I 
am planning to use as thin clients.  I'm doing a PXE boot, loading 
kernels, and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads 
but does not allow any packet traffic.  The system isn't crashed, but 
I do get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
"squashfs" patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted "Live" on the box, 
works just fine with the e100 driver with a kernel shown as: 
"2.6.10-5-386".  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw

Matti Aarnio wrote:


On Tue, Aug 09, 2005 at 09:16:21AM -0700, Daniel Walker wrote:
 


It looks like this might be an SMP race , it seem that both processors
are in e100_down(). There is a while loop in e100_clean_cbs() that
appears to have an unsafe looping condition .
It looks like cbs_avail might jump over params.cbs.count , then you
would have to wait for a rollover . Is this a PREEMPT_NONE kernel?
  



 # CONFIG_PREEMPT is not set
 # CONFIG_PREEMPT_BKL is not set

which is probably same as "NONE".

There is _one_ processor in down, but other may be in trying to send
some data out, or otherwise polling the card.

However...  while real bugs in their own sense, none of these are
as important as original "card dies" thing, during a recovery of
which all this soft-lockup merryment happens.

Also, as it happens only once a week or so (except when it happens
right after another), testing code patches is rather slow.
I can guess which things make it more likely, but I can't make it
happen at will.

 /Matti Aarnio


 


This patch may help, but it's not a complete fix.

--- linux-2.6.12.orig/drivers/net/e100.c2005-08-05 
16:45:59.0 +
+++ linux-2.6.12/drivers/net/e100.c 2005-08-09 
16:14:45.0 +

@@ -1393,7 +1393,7 @@ static inline int e100_tx_clean(struct n
static void e100_clean_cbs(struct nic *nic)
{
   if(nic->cbs) {
-   while(nic->cbs_avail != nic->params.cbs.count) {
+   while(nic->cbs_avail < nic->params.cbs.count) {
   struct cb *cb = nic->cb_to_clean;
   if(cb->skb) {
   pci_unmap_single(nic->pdev,



On Tue, 2005-08-09 at 16:36 +0300, Matti Aarnio wrote:
  


Running very recent Fedora Core Development kernel I can following
soft-oops..   ( 2.6.12-1.1455_FC5smp )


e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
BUG: soft lockup detected on CPU#0!

Pid: 10743, comm: ifconfig
EIP: 0060:[] CPU: 0
EIP is at e100_clean_cbs+0x2f/0x12b [e100]
EFLAGS: 0293Not tainted  (2.6.12-1.1455_FC5smp)
EAX: 495c7c2b EBX: 495c7c2b ECX: f6c311a0 EDX: 
ESI: 0040 EDI: f6c3 EBP: f71a4b20 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804a544 CR3: 01e9cd80 CR4: 06f0
[] e100_down+0x66/0x9a [e100]
[] e100_close+0xa/0xd [e100]
[] dev_close+0x40/0x7e
[] dev_change_flags+0x46/0xf5
[] devinet_ioctl+0x564/0x5df
[] sock_ioctl+0xc3/0x250
[] sock_ioctl+0x0/0x250
[] do_ioctl+0x1f/0x6d
[] vfs_ioctl+0x50/0x1c6
[] sys_ioctl+0x5d/0x6f
[] syscall_call+0x7/0xb
[] softlockup_tick+0x6f/0x80
[] timer_interrupt+0x2d/0x75
[] handle_IRQ_event+0x2e/0x5a
[] __do_IRQ+0xc2/0x127
[] do_IRQ+0x4e/0x86
===
[] smp_apic_timer_interrupt+0xc1/0xca
[] common_interrupt+0x1a/0x20
[] e100_clean_cbs+0x2f/0x12b [e100]
[] e100_down+0x66/0x9a [e100]
[] e100_close+0xa/0xd [e100]
[] dev_close+0x40/0x7e
[] dev_change_flags+0x46/0xf5
[] devinet_ioctl+0x564/0x5df
[] sock_ioctl+0xc3/0x250
[] sock_ioctl+0x0/0x250
[] do_ioctl+0x1f/0x6d
[] vfs_ioctl+0x50/0x1c6
[] sys_ioctl+0x5d/0x6f
[] syscall_call+0x7/0xb



Preconditions for this are:

- E100 card stopped working for some reason (no idea why, it just
 does sometimes at this oldish 2x P-III machine)
- There are active datastreams running in and out
 (around 0.2 Mbps out, multiple megabits in.)
- Commanding then "ifconfig eth0 down" results in what feels like 
 system freezing, but it does recover in about 30-60 seconds

 (it takes long enough for me to sweat bullets...)
- While in freeze state, keyboard can go crazy, but mouse does
 respond, as well as tvtime shows bt848 captured live video.
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/maj

Re: Soft lockup in e100 driver ?

2005-08-10 Thread Stephen D. Williams
I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I am 
planning to use as thin clients.  I'm doing a PXE boot, loading kernels, 
and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads but 
does not allow any packet traffic.  The system isn't crashed, but I do 
get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
"squashfs" patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted "Live" on the box, 
works just fine with the e100 driver with a kernel shown as: 
"2.6.10-5-386".  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw

Matti Aarnio wrote:


On Tue, Aug 09, 2005 at 09:16:21AM -0700, Daniel Walker wrote:
 


It looks like this might be an SMP race , it seem that both processors
are in e100_down(). There is a while loop in e100_clean_cbs() that
appears to have an unsafe looping condition . 


It looks like cbs_avail might jump over params.cbs.count , then you
would have to wait for a rollover . Is this a PREEMPT_NONE kernel?
   



 # CONFIG_PREEMPT is not set
 # CONFIG_PREEMPT_BKL is not set

which is probably same as "NONE".

There is _one_ processor in down, but other may be in trying to send
some data out, or otherwise polling the card.

However...  while real bugs in their own sense, none of these are
as important as original "card dies" thing, during a recovery of
which all this soft-lockup merryment happens.

Also, as it happens only once a week or so (except when it happens
right after another), testing code patches is rather slow.
I can guess which things make it more likely, but I can't make it
happen at will.

 /Matti Aarnio


 


This patch may help, but it's not a complete fix.

--- linux-2.6.12.orig/drivers/net/e100.c2005-08-05 16:45:59.0 
+
+++ linux-2.6.12/drivers/net/e100.c 2005-08-09 16:14:45.0 +
@@ -1393,7 +1393,7 @@ static inline int e100_tx_clean(struct n
static void e100_clean_cbs(struct nic *nic)
{
   if(nic->cbs) {
-   while(nic->cbs_avail != nic->params.cbs.count) {
+   while(nic->cbs_avail < nic->params.cbs.count) {
   struct cb *cb = nic->cb_to_clean;
   if(cb->skb) {
   pci_unmap_single(nic->pdev,



On Tue, 2005-08-09 at 16:36 +0300, Matti Aarnio wrote:
   


Running very recent Fedora Core Development kernel I can following
soft-oops..   ( 2.6.12-1.1455_FC5smp )


e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
BUG: soft lockup detected on CPU#0!

Pid: 10743, comm: ifconfig
EIP: 0060:[] CPU: 0
EIP is at e100_clean_cbs+0x2f/0x12b [e100]
EFLAGS: 0293Not tainted  (2.6.12-1.1455_FC5smp)
EAX: 495c7c2b EBX: 495c7c2b ECX: f6c311a0 EDX: 
ESI: 0040 EDI: f6c3 EBP: f71a4b20 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804a544 CR3: 01e9cd80 CR4: 06f0
[] e100_down+0x66/0x9a [e100]
[] e100_close+0xa/0xd [e100]
[] dev_close+0x40/0x7e
[] dev_change_flags+0x46/0xf5
[] devinet_ioctl+0x564/0x5df
[] sock_ioctl+0xc3/0x250
[] sock_ioctl+0x0/0x250
[] do_ioctl+0x1f/0x6d
[] vfs_ioctl+0x50/0x1c6
[] sys_ioctl+0x5d/0x6f
[] syscall_call+0x7/0xb
[] softlockup_tick+0x6f/0x80
[] timer_interrupt+0x2d/0x75
[] handle_IRQ_event+0x2e/0x5a
[] __do_IRQ+0xc2/0x127
[] do_IRQ+0x4e/0x86
===
[] smp_apic_timer_interrupt+0xc1/0xca
[] common_interrupt+0x1a/0x20
[] e100_clean_cbs+0x2f/0x12b [e100]
[] e100_down+0x66/0x9a [e100]
[] e100_close+0xa/0xd [e100]
[] dev_close+0x40/0x7e
[] dev_change_flags+0x46/0xf5
[] devinet_ioctl+0x564/0x5df
[] sock_ioctl+0xc3/0x250
[] sock_ioctl+0x0/0x250
[] do_ioctl+0x1f/0x6d
[] vfs_ioctl+0x50/0x1c6
[] sys_ioctl+0x5d/0x6f
[] syscall_call+0x7/0xb



Preconditions for this are:

- E100 card stopped working for some reason (no idea why, it just
 does sometimes at this oldish 2x P-III machine)
- There are active datastreams running in and out
 (around 0.2 Mbps out, multiple megabits in.)
- Commanding then "ifconfig eth0 down" results in what feels like 
 system freezing, but it does recover in about 30-60 seconds

 (it takes long enough for me to sweat bullets...)
- While in freeze state, keyboard can go crazy, but mouse does
 respond, as well as tvtime shows bt848 captured live video.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 



begin:vcard
fn:Stephen Williams
n:Williams;Stephen

Re: Soft lockup in e100 driver ?

2005-08-10 Thread Stephen D. Williams
I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I am 
planning to use as thin clients.  I'm doing a PXE boot, loading kernels, 
and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads but 
does not allow any packet traffic.  The system isn't crashed, but I do 
get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
squashfs patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted Live on the box, 
works just fine with the e100 driver with a kernel shown as: 
2.6.10-5-386.  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw

Matti Aarnio wrote:


On Tue, Aug 09, 2005 at 09:16:21AM -0700, Daniel Walker wrote:
 


It looks like this might be an SMP race , it seem that both processors
are in e100_down(). There is a while loop in e100_clean_cbs() that
appears to have an unsafe looping condition . 


It looks like cbs_avail might jump over params.cbs.count , then you
would have to wait for a rollover . Is this a PREEMPT_NONE kernel?
   



 # CONFIG_PREEMPT is not set
 # CONFIG_PREEMPT_BKL is not set

which is probably same as NONE.

There is _one_ processor in down, but other may be in trying to send
some data out, or otherwise polling the card.

However...  while real bugs in their own sense, none of these are
as important as original card dies thing, during a recovery of
which all this soft-lockup merryment happens.

Also, as it happens only once a week or so (except when it happens
right after another), testing code patches is rather slow.
I can guess which things make it more likely, but I can't make it
happen at will.

 /Matti Aarnio


 


This patch may help, but it's not a complete fix.

--- linux-2.6.12.orig/drivers/net/e100.c2005-08-05 16:45:59.0 
+
+++ linux-2.6.12/drivers/net/e100.c 2005-08-09 16:14:45.0 +
@@ -1393,7 +1393,7 @@ static inline int e100_tx_clean(struct n
static void e100_clean_cbs(struct nic *nic)
{
   if(nic-cbs) {
-   while(nic-cbs_avail != nic-params.cbs.count) {
+   while(nic-cbs_avail  nic-params.cbs.count) {
   struct cb *cb = nic-cb_to_clean;
   if(cb-skb) {
   pci_unmap_single(nic-pdev,



On Tue, 2005-08-09 at 16:36 +0300, Matti Aarnio wrote:
   


Running very recent Fedora Core Development kernel I can following
soft-oops..   ( 2.6.12-1.1455_FC5smp )


e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
BUG: soft lockup detected on CPU#0!

Pid: 10743, comm: ifconfig
EIP: 0060:[f88bf2f9] CPU: 0
EIP is at e100_clean_cbs+0x2f/0x12b [e100]
EFLAGS: 0293Not tainted  (2.6.12-1.1455_FC5smp)
EAX: 495c7c2b EBX: 495c7c2b ECX: f6c311a0 EDX: 
ESI: 0040 EDI: f6c3 EBP: f71a4b20 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804a544 CR3: 01e9cd80 CR4: 06f0
[f88c0708] e100_down+0x66/0x9a [e100]
[f88c1623] e100_close+0xa/0xd [e100]
[c02b7adb] dev_close+0x40/0x7e
[c02b8f59] dev_change_flags+0x46/0xf5
[c02f76b3] devinet_ioctl+0x564/0x5df
[c02af22c] sock_ioctl+0xc3/0x250
[c02af169] sock_ioctl+0x0/0x250
[c01762ef] do_ioctl+0x1f/0x6d
[c017648f] vfs_ioctl+0x50/0x1c6
[c0176662] sys_ioctl+0x5d/0x6f
[c010394d] syscall_call+0x7/0xb
[c014473f] softlockup_tick+0x6f/0x80
[c01085b8] timer_interrupt+0x2d/0x75
[c01448dd] handle_IRQ_event+0x2e/0x5a
[c01449cb] __do_IRQ+0xc2/0x127
[c0105f7e] do_IRQ+0x4e/0x86
===
[c01160cc] smp_apic_timer_interrupt+0xc1/0xca
[c0104382] common_interrupt+0x1a/0x20
[f88bf2f9] e100_clean_cbs+0x2f/0x12b [e100]
[f88c0708] e100_down+0x66/0x9a [e100]
[f88c1623] e100_close+0xa/0xd [e100]
[c02b7adb] dev_close+0x40/0x7e
[c02b8f59] dev_change_flags+0x46/0xf5
[c02f76b3] devinet_ioctl+0x564/0x5df
[c02af22c] sock_ioctl+0xc3/0x250
[c02af169] sock_ioctl+0x0/0x250
[c01762ef] do_ioctl+0x1f/0x6d
[c017648f] vfs_ioctl+0x50/0x1c6
[c0176662] sys_ioctl+0x5d/0x6f
[c010394d] syscall_call+0x7/0xb



Preconditions for this are:

- E100 card stopped working for some reason (no idea why, it just
 does sometimes at this oldish 2x P-III machine)
- There are active datastreams running in and out
 (around 0.2 Mbps out, multiple megabits in.)
- Commanding then ifconfig eth0 down results in what feels like 
 system freezing, but it does recover in about 30-60 seconds

 (it takes long enough for me to sweat bullets...)
- While in freeze state, keyboard can go crazy, but mouse does
 respond, as well as tvtime shows bt848 captured live video.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the 

Re: Soft lockup in e100 driver ?

2005-08-10 Thread Stephen D. Williams
I just noticed that the Ubuntu setup says GSI 20(level,low) - IRQ 20 
whereas I remember my built kernels saying No GSI..  IRQ 11.  I'll 
investigate what that means and how to enable it.  Pointers appreciated.


sdw

Stephen D. Williams wrote:

I have been working for days to get a recent kernel to work with these 
small-format UP Celeron 2Ghz (running at 1.33Ghz) motherboards that I 
am planning to use as thin clients.  I'm doing a PXE boot, loading 
kernels, and trying to get networking to come up.


I eventually realized that the problem is that the e100 driver loads 
but does not allow any packet traffic.  The system isn't crashed, but 
I do get transmit timeouts.


I've used kernels: 2.6.10, 2.6.11, and 2.6.12.4, stock with only the 
squashfs patch applied and compiled as 586/


The interesting thing is that Ubuntu 5.04, booted Live on the box, 
works just fine with the e100 driver with a kernel shown as: 
2.6.10-5-386.  I'm going to work on pulling this kernel and its 
modules off to use.


Any help urgently appreciated.

sdw

Matti Aarnio wrote:


On Tue, Aug 09, 2005 at 09:16:21AM -0700, Daniel Walker wrote:
 


It looks like this might be an SMP race , it seem that both processors
are in e100_down(). There is a while loop in e100_clean_cbs() that
appears to have an unsafe looping condition .
It looks like cbs_avail might jump over params.cbs.count , then you
would have to wait for a rollover . Is this a PREEMPT_NONE kernel?
  



 # CONFIG_PREEMPT is not set
 # CONFIG_PREEMPT_BKL is not set

which is probably same as NONE.

There is _one_ processor in down, but other may be in trying to send
some data out, or otherwise polling the card.

However...  while real bugs in their own sense, none of these are
as important as original card dies thing, during a recovery of
which all this soft-lockup merryment happens.

Also, as it happens only once a week or so (except when it happens
right after another), testing code patches is rather slow.
I can guess which things make it more likely, but I can't make it
happen at will.

 /Matti Aarnio


 


This patch may help, but it's not a complete fix.

--- linux-2.6.12.orig/drivers/net/e100.c2005-08-05 
16:45:59.0 +
+++ linux-2.6.12/drivers/net/e100.c 2005-08-09 
16:14:45.0 +

@@ -1393,7 +1393,7 @@ static inline int e100_tx_clean(struct n
static void e100_clean_cbs(struct nic *nic)
{
   if(nic-cbs) {
-   while(nic-cbs_avail != nic-params.cbs.count) {
+   while(nic-cbs_avail  nic-params.cbs.count) {
   struct cb *cb = nic-cb_to_clean;
   if(cb-skb) {
   pci_unmap_single(nic-pdev,



On Tue, 2005-08-09 at 16:36 +0300, Matti Aarnio wrote:
  


Running very recent Fedora Core Development kernel I can following
soft-oops..   ( 2.6.12-1.1455_FC5smp )


e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex
BUG: soft lockup detected on CPU#0!

Pid: 10743, comm: ifconfig
EIP: 0060:[f88bf2f9] CPU: 0
EIP is at e100_clean_cbs+0x2f/0x12b [e100]
EFLAGS: 0293Not tainted  (2.6.12-1.1455_FC5smp)
EAX: 495c7c2b EBX: 495c7c2b ECX: f6c311a0 EDX: 
ESI: 0040 EDI: f6c3 EBP: f71a4b20 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804a544 CR3: 01e9cd80 CR4: 06f0
[f88c0708] e100_down+0x66/0x9a [e100]
[f88c1623] e100_close+0xa/0xd [e100]
[c02b7adb] dev_close+0x40/0x7e
[c02b8f59] dev_change_flags+0x46/0xf5
[c02f76b3] devinet_ioctl+0x564/0x5df
[c02af22c] sock_ioctl+0xc3/0x250
[c02af169] sock_ioctl+0x0/0x250
[c01762ef] do_ioctl+0x1f/0x6d
[c017648f] vfs_ioctl+0x50/0x1c6
[c0176662] sys_ioctl+0x5d/0x6f
[c010394d] syscall_call+0x7/0xb
[c014473f] softlockup_tick+0x6f/0x80
[c01085b8] timer_interrupt+0x2d/0x75
[c01448dd] handle_IRQ_event+0x2e/0x5a
[c01449cb] __do_IRQ+0xc2/0x127
[c0105f7e] do_IRQ+0x4e/0x86
===
[c01160cc] smp_apic_timer_interrupt+0xc1/0xca
[c0104382] common_interrupt+0x1a/0x20
[f88bf2f9] e100_clean_cbs+0x2f/0x12b [e100]
[f88c0708] e100_down+0x66/0x9a [e100]
[f88c1623] e100_close+0xa/0xd [e100]
[c02b7adb] dev_close+0x40/0x7e
[c02b8f59] dev_change_flags+0x46/0xf5
[c02f76b3] devinet_ioctl+0x564/0x5df
[c02af22c] sock_ioctl+0xc3/0x250
[c02af169] sock_ioctl+0x0/0x250
[c01762ef] do_ioctl+0x1f/0x6d
[c017648f] vfs_ioctl+0x50/0x1c6
[c0176662] sys_ioctl+0x5d/0x6f
[c010394d] syscall_call+0x7/0xb



Preconditions for this are:

- E100 card stopped working for some reason (no idea why, it just
 does sometimes at this oldish 2x P-III machine)
- There are active datastreams running in and out
 (around 0.2 Mbps out, multiple megabits in.)
- Commanding then ifconfig eth0 down results in what feels like 
 system freezing, but it does recover in about 30-60 seconds

 (it takes long enough for me to sweat bullets...)
- While in freeze state, keyboard can go crazy, but mouse does
 respond, as well as tvtime shows bt848 captured live video.
-
To unsubscribe from this list: send the line unsubscribe 
linux-kernel

Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-04-11 Thread Stephen D. Williams

James Antill wrote:
...
> >The
> > time went from 3.7 to 4.4 seconds per 10.
> 
>  Ok here's a quick test that I've done. This passes data between 2
> processes. Obviously you can't compare this to your code or Michael's,
> however...

I've attached my version of his code with your suggested change. 
Possibly I didn't do it correctly.

>  The results with USE_DOUBLE_POLL on are...
> 
> % time ./pingpong
> ./pingpong  0.15s user 0.89s system 48% cpu 2.147 total
> % time ./pingpong
> ./pingpong  0.19s user 0.91s system 45% cpu 2.422 total
> % time ./pingpong
> ./pingpong  0.10s user 1.02s system 49% cpu 2.282 total
> 
>  The results with USE_DOUBLE_POLL off are...
> 
> % time ./pingpong
> ./pingpong  0.24s user 1.07s system 50% cpu 2.614 total
...

sdw
-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


#ifndef INADDR_NONE
#define INADDR_NONE ~0
#endif

void
errexit(format, va_alist)
char*format;
va_dcl
{
va_list args;
va_start(args);
vfprintf(stderr, format, args);
va_end(args);
exit(1);
}

/*
 * passivesock - allocate & bind a server socket using TCP or UDP
 */

int
passivesock( service, protocol, qlen )
char*service;   /* service associeted with the desired port */
char*protocol;  /* name of protocol to use ("tcp" or "udp") */
int qlen;   /* maximum length of the server request queue   */
{
struct servent *pse;
struct protoent *ppe;
struct sockaddr_in sin;
int s, type;
int one = 1;
int f=1;

bzero((char *) & sin, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = INADDR_ANY;

/* Map service name to port number */
if ( pse = getservbyname(service, protocol) )
sin.sin_port = htons(ntohs((u_short)pse->s_port));
else if ( (sin.sin_port = htons((u_short)atoi(service))) == 0 )
errexit("can't get \"%s\" service entry\n", service);

/* Map protocol name to protocol number */
if ( (ppe = getprotobyname(protocol)) == 0)
errexit("can't get \"%s\" protocol entry\n", protocol);

/* Use protocol to chose a socket type */
if (strcmp(protocol, "udp") == 0)
type = SOCK_DGRAM;
else
type = SOCK_STREAM;

/* Allocate a socket */
s = socket(PF_INET, type, ppe->p_proto);
if (s < 0 )
errexit("can't create socket: %s\n", strerror(errno));

setsockopt(s, SOL_SOCKET, SO_REUSEADDR, , sizeof(one));
setsockopt(s, SOL_TCP, TCP_NODELAY, , sizeof(f));
/* Bind the socket */
if (bind(s, (struct sockaddr *) & sin, sizeof(sin)) < 0)
errexit("can't bind to %s port: %s\n", service,
strerror(errno));
if (type == SOCK_STREAM && listen(s, qlen) < 0)
errexit("can't listen on %s port: %s\n", service,
strerror(errno));
return s;
}

int
connectsock(host, service, protocol)
char*host;
char*service;
char*protocol;
{
struct hostent  *phe;
struct servent  *pse;
struct protoent *ppe;
struct sockaddr_in  sin;
int s, type;
int f=1;

memset(, 0, sizeof(sin));
if (pse = getservbyname(service, protocol))
sin.sin_port = pse->s_port;
else if ((sin.sin_port = htons((u_short) atoi(service))) == 0) {
fprintf(stderr, "can't get '%s' service entry\n", service);
exit(1);
}
if (phe = gethostbyname(host)) 
memcpy((char *) _addr, phe->h_addr, phe->h_length);
else if ((sin.sin_addr.s_addr = inet_addr(host)) == INADDR_NONE) {
fprintf(stderr, "can't get '%s' host entry\n", host);
exit(1);
}
/* if (ppe = getprotobyname(protocol)) {
fprintf(stderr, "can't get '%s' protocol entry\n", protocol);
exit(1);
}
if (strcmp(protocol, "udp") == 0)
type = SOCK_DGRAM;
else
type = SOCK_STREAM;
*/
sin.sin_family = AF_INET;
s = socket(AF_INET, SOCK_STREAM, 6);
setsockopt(s, SOL_TCP, TCP_NODELAY, , sizeof(f));
if (s < 0) {
perror("can't create socket\n");
 

Re: No 100 HZ timer !

2001-04-10 Thread Stephen D. Williams

When this is rewritten, I would strongly suggest that we find a way to
make 'gettimeofday' nearly free.  Certain applications need to use this
frequently while still being portable.  One solution when you do have
clock ticks is a read-only mapped Int.  Another cheap solution is
library assembly that adds a cycle clock delta since last system call to
a 'gettimeofday' value set on every system call return.

sdw

Andi Kleen wrote:
> 
> On Tue, Apr 10, 2001 at 01:12:14PM +0100, Alan Cox wrote:
> > Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
> > consider the fact that out of this you get KHz or better scheduling
> > resolution required for games and midi. I'd say it looks good. I agree
> 
> And measure the number of cycles a gigahertz CPU can do between a 1ms timer.
> And then check how often the typical application executes something like
> gettimeofday.
> 
...

sdw
-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-04-10 Thread Stephen D. Williams

James Antill wrote:
> 
> "Stephen D. Williams" <[EMAIL PROTECTED]> writes:
> 
> > An old thread, but important to get these fundamental performance
> > numbers up there:
> >
> > 2.4.2 on an 800mhz PIII Sceptre laptop w/ 512MB ram:
> >
> > elapsed time for 10 pingpongs is
> > 3.81327
> > 10/3.81256
> > ~26229.09541095746689888159
> > 1/.379912
> > ~26321.88506812103855629724
...
>  I seemed to miss the original post, so I can't really comment on the
> tests. However...

It was a thread in January, but just ran accross it looking for
something else.  See below for results.


> > Michael Lindner wrote:
...
> > >  0.052371 send(7, "\0\0\0
> > > \177\0\0\1\3243\0\0\0\2\4\236\216\341\0\0\v\277"..., 32, 0) = 32
> > > <0.000529>
> > >  0.000882 rt_sigprocmask(SIG_BLOCK, ~[], [RT_0], 8) = 0 <0.21>
> > >  0.000242 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
> > > <0.21>
> > >  0.000173 select(8, [3 4 6 7], NULL, NULL, NULL) = 1 (in [6])
> > > <0.47>
> > >  0.000328 read(6, "\0\0\0 ", 4) = 4 <0.31>
> > >  0.000179 read(6,
> > > "\177\0\0\1\3242\0\0\0\2\4\236\216\341\0\0\7\327\177\0\0"..., 28) = 28
> > > <0.75>
> 
>  The strace here shows select() with an infinite timeout, you're
> numbers will be much better if you do (pseudo code)...
> 
>   struct timeval zerotime;
> 
>   zerotime.tv_sec = 0;
>   zerotime.tv_usec = 0;
> 
>  if (!(ret = select( ... , )))
>   ret = select( ... , NULL);
> 
> ...basically you completely miss the function call for __pollwait()
> inside poll_wait (include/linux/poll.h in the linux sources, with
> __pollwait being in fs/select.c).

Apparently the extra system call overhead outweighs any benefit.  In any
case, what you suggest would be better done in the kernel anyway.  The
time went from 3.7 to 4.4 seconds per 10.

> 
> --
> # James Antill -- [EMAIL PROTECTED]
> :0:
> * ^From: .*james@and\.org
> /dev/null

-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-04-10 Thread Stephen D. Williams

James Antill wrote:
 
 "Stephen D. Williams" [EMAIL PROTECTED] writes:
 
  An old thread, but important to get these fundamental performance
  numbers up there:
 
  2.4.2 on an 800mhz PIII Sceptre laptop w/ 512MB ram:
 
  elapsed time for 10 pingpongs is
  3.81327
  10/3.81256
  ~26229.09541095746689888159
  1/.379912
  ~26321.88506812103855629724
...
  I seemed to miss the original post, so I can't really comment on the
 tests. However...

It was a thread in January, but just ran accross it looking for
something else.  See below for results.


  Michael Lindner wrote:
...
0.052371 send(7, "\0\0\0
   \177\0\0\1\3243\0\0\0\2\4\236\216\341\0\0\v\277"..., 32, 0) = 32
   0.000529
0.000882 rt_sigprocmask(SIG_BLOCK, ~[], [RT_0], 8) = 0 0.21
0.000242 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
   0.21
0.000173 select(8, [3 4 6 7], NULL, NULL, NULL) = 1 (in [6])
   0.47
0.000328 read(6, "\0\0\0 ", 4) = 4 0.31
0.000179 read(6,
   "\177\0\0\1\3242\0\0\0\2\4\236\216\341\0\0\7\327\177\0\0"..., 28) = 28
   0.75
 
  The strace here shows select() with an infinite timeout, you're
 numbers will be much better if you do (pseudo code)...
 
   struct timeval zerotime;
 
   zerotime.tv_sec = 0;
   zerotime.tv_usec = 0;
 
  if (!(ret = select( ... , zerotime)))
   ret = select( ... , NULL);
 
 ...basically you completely miss the function call for __pollwait()
 inside poll_wait (include/linux/poll.h in the linux sources, with
 __pollwait being in fs/select.c).

Apparently the extra system call overhead outweighs any benefit.  In any
case, what you suggest would be better done in the kernel anyway.  The
time went from 3.7 to 4.4 seconds per 10.

 
 --
 # James Antill -- [EMAIL PROTECTED]
 :0:
 * ^From: .*james@and\.org
 /dev/null

-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: No 100 HZ timer !

2001-04-10 Thread Stephen D. Williams

When this is rewritten, I would strongly suggest that we find a way to
make 'gettimeofday' nearly free.  Certain applications need to use this
frequently while still being portable.  One solution when you do have
clock ticks is a read-only mapped Int.  Another cheap solution is
library assembly that adds a cycle clock delta since last system call to
a 'gettimeofday' value set on every system call return.

sdw

Andi Kleen wrote:
 
 On Tue, Apr 10, 2001 at 01:12:14PM +0100, Alan Cox wrote:
  Measure the number of clocks executing a timer interrupt. rdtsc is fast. Now
  consider the fact that out of this you get KHz or better scheduling
  resolution required for games and midi. I'd say it looks good. I agree
 
 And measure the number of cycles a gigahertz CPU can do between a 1ms timer.
 And then check how often the typical application executes something like
 gettimeofday.
 
...

sdw
-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-04-09 Thread Stephen D. Williams

An old thread, but important to get these fundamental performance
numbers up there:

2.4.2 on an 800mhz PIII Sceptre laptop w/ 512MB ram:

elapsed time for 10 pingpongs is
3.81327
10/3.81256
~26229.09541095746689888159 
1/.379912
~26321.88506812103855629724  

26300 compares to 8000/sec. quite well ;-)  You didn't give specs for
your test machine unfortunately.

Since this tests both 'sides' of an application communication, it
indicates a 'null transaction' rate of twice that.

This was typical cpu usage on a triple run of 1:
CPU states:  7.2% user, 92.7% system,  0.0% nice,  0.0% idle  

sdw


Michael Lindner wrote:
> 
> OK, 2.4.0 kernel installed, and a new set of numbers:
> 
> testkernel  ping-pongs/s. @ total CPU util  w/SOL_NDELAY
> sample (2 skts) 2.2.18  100 @ 0.1%  800 @ 1%
> sample (1 skt)  2.2.18  8000 @ 100% 8000 @ 50%
> real app2.2.18  100 @ 0.1%  800 @ 1%
> 
> sample (2 skts) 2.4.0   8000 @ 50%  8000 @ 50%
> sample (1 skt)  2.4.0   1 @ 50% 1 @ 50%
> real app2.4.0   1200 @ 50%  1200 @ 50%
> 
> real appWindows 2K  4000 @ 100%
> 
> The two points that still seem strange to me are:
> 
> 1. The 1 socket case is still 25% faster than the 2 socket case in 2.4.0
> (in 2.2.18 the 1 socket case was 10x faster).
> 
> 2. Linux never devotes more than 50% of the CPU (average over a long
> run) to the two processes (25% to each process, with the rest of the
> time idle).
> 
> I'd really love to show that Linux is a viable platform for our SW, and
> I think it would be doable if I could figure out how to get the other
> 50% of my CPU involved. An "strace -rT" of the real app on 2.4.0 looks
> like this for each ping/pong.
> 
>  0.052371 send(7, "\0\0\0
> \177\0\0\1\3243\0\0\0\2\4\236\216\341\0\0\v\277"..., 32, 0) = 32
> <0.000529>
>  0.000882 rt_sigprocmask(SIG_BLOCK, ~[], [RT_0], 8) = 0 <0.21>
>  0.000242 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
> <0.21>
>  0.000173 select(8, [3 4 6 7], NULL, NULL, NULL) = 1 (in [6])
> <0.47>
>  0.000328 read(6, "\0\0\0 ", 4) = 4 <0.31>
>  0.000179 read(6,
> "\177\0\0\1\3242\0\0\0\2\4\236\216\341\0\0\7\327\177\0\0"..., 28) = 28
> <0.75>
> 
> --
> Mike Lindner
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/

-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-04-09 Thread Stephen D. Williams

An old thread, but important to get these fundamental performance
numbers up there:

2.4.2 on an 800mhz PIII Sceptre laptop w/ 512MB ram:

elapsed time for 10 pingpongs is
3.81327
10/3.81256
~26229.09541095746689888159 
1/.379912
~26321.88506812103855629724  

26300 compares to 8000/sec. quite well ;-)  You didn't give specs for
your test machine unfortunately.

Since this tests both 'sides' of an application communication, it
indicates a 'null transaction' rate of twice that.

This was typical cpu usage on a triple run of 1:
CPU states:  7.2% user, 92.7% system,  0.0% nice,  0.0% idle  

sdw


Michael Lindner wrote:
 
 OK, 2.4.0 kernel installed, and a new set of numbers:
 
 testkernel  ping-pongs/s. @ total CPU util  w/SOL_NDELAY
 sample (2 skts) 2.2.18  100 @ 0.1%  800 @ 1%
 sample (1 skt)  2.2.18  8000 @ 100% 8000 @ 50%
 real app2.2.18  100 @ 0.1%  800 @ 1%
 
 sample (2 skts) 2.4.0   8000 @ 50%  8000 @ 50%
 sample (1 skt)  2.4.0   1 @ 50% 1 @ 50%
 real app2.4.0   1200 @ 50%  1200 @ 50%
 
 real appWindows 2K  4000 @ 100%
 
 The two points that still seem strange to me are:
 
 1. The 1 socket case is still 25% faster than the 2 socket case in 2.4.0
 (in 2.2.18 the 1 socket case was 10x faster).
 
 2. Linux never devotes more than 50% of the CPU (average over a long
 run) to the two processes (25% to each process, with the rest of the
 time idle).
 
 I'd really love to show that Linux is a viable platform for our SW, and
 I think it would be doable if I could figure out how to get the other
 50% of my CPU involved. An "strace -rT" of the real app on 2.4.0 looks
 like this for each ping/pong.
 
  0.052371 send(7, "\0\0\0
 \177\0\0\1\3243\0\0\0\2\4\236\216\341\0\0\v\277"..., 32, 0) = 32
 0.000529
  0.000882 rt_sigprocmask(SIG_BLOCK, ~[], [RT_0], 8) = 0 0.21
  0.000242 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
 0.21
  0.000173 select(8, [3 4 6 7], NULL, NULL, NULL) = 1 (in [6])
 0.47
  0.000328 read(6, "\0\0\0 ", 4) = 4 0.31
  0.000179 read(6,
 "\177\0\0\1\3242\0\0\0\2\4\236\216\341\0\0\7\327\177\0\0"..., 28) = 28
 0.75
 
 --
 Mike Lindner
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/

-- 
[EMAIL PROTECTED]  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/