Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-29 Thread Jarek Poplawski
On Wed, Dec 27, 2006 at 08:16:10AM -0800, Ben Greear wrote:
 Jarek Poplawski wrote:
 On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote:
 Jarek Poplawski wrote:
 On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:
 On 20-12-2006 03:13, Ben Greear wrote:
 This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
 active use.
 From the backtrace, I am thinking this might be a generic problem, 
 however.
 
 Any ideas about what this could be?  It seems to be reproducible every 
 day or
 ...
 If it doesn't help, I hope lockdep will be more
 precise when you'll upgrade to 2.6.19 or higher.
 ... or when you enable lockdep in 2.6.18 (I've
 forgotten it's there alredy!).
 I got lucky..the system was available by ssh still.  I see this in the 
 boot logs..I assume
 this means lockdep is enabled?  Should I have expected to see a lockdep 
 trace in the case of
 his soft-lockup then?
 
 .
 Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright 
 (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... 
 MAX_LOCKDEP_SUBCLASSES:8
 
 Yes, you got it enabled in the config.
 
 If there is no message later about validator
 turning off and no warnings which could point
 at lockdep then it is working.
 
 But then, IMHO, there is rather small probability
 this bug is really from lockup. Another possibility
 is hardware irqs (timer in particular) are turned
 off by something (maybe those hacks?) for extremely
 long time (~10 sec.). 
 
 The system hangs and does not recover (well, a few processes
 continue on the other processor for a few minutes before they
 too deadlock...)
 
 I am guessing this problem has been around for a while, but it
 is only triggered when interfaces are created, and probably only
 when UDP traffic is already running heavily on the system.  Most
 systems w/out virtual devices will not trigger this sort of
 race.

I'd one more look at this considering the info about
creating interfaces and here are some of my doubts on
possible races (I hope you'll forgive me if I totaly
miss some point):

- During register procedure the real device seems to
be up and running; vlan_rx_register is used but I see
drivers differ here: some of them do netif_stop and
disable irqs while others only lock. It seems they
can start do vlan_hwaccel_rx directly after
this (sometimes even during registration if
irq will happen).

- vlan_hwaccel_rx is checking skb_bond_should_drop
but I'm not sure it is really useful here, so
probably at least broadcasts and multicasts can
use netif_rx even before vlan_dev is up (and your
log accidentally shows multicast receive).

- Preemption is blocked for quite a long time in
vlan_skb_recv and during netif_receive; I guess 
this could be also possible reason of triggering
the softlockup bug. I wonder if lowering the
value of netdev_max_backlog wouldn't improve
scheduling times.

Happy New Year,

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-28 Thread Jarek Poplawski
On Wed, Dec 27, 2006 at 08:16:10AM -0800, Ben Greear wrote:
...
 The system hangs and does not recover (well, a few processes
 continue on the other processor for a few minutes before they
 too deadlock...)
 
 I am guessing this problem has been around for a while, but it
 is only triggered when interfaces are created, and probably only
 when UDP traffic is already running heavily on the system.  Most
 systems w/out virtual devices will not trigger this sort of
 race.

Considering your contribution into kernel,
many people here would like to help, I hope,
but these informations are probably not enough.

Maybe some more logs  dmesg? If it deadlocks
anyway, maybe adding panic() after dump_stack()
could tell something.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-27 Thread Jarek Poplawski
On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote:
 Jarek Poplawski wrote:
 On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:
 On 20-12-2006 03:13, Ben Greear wrote:
 This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
 active use.
  From the backtrace, I am thinking this might be a generic problem, 
 however.
 
 Any ideas about what this could be?  It seems to be reproducible every 
 day or
 ...
 If it doesn't help, I hope lockdep will be more
 precise when you'll upgrade to 2.6.19 or higher.
 
 ... or when you enable lockdep in 2.6.18 (I've
 forgotten it's there alredy!).
 
 I got lucky..the system was available by ssh still.  I see this in the boot 
 logs..I assume
 this means lockdep is enabled?  Should I have expected to see a lockdep 
 trace in the case of
 his soft-lockup then?
 
 .
 Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 
 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... 
 MAX_LOCKDEP_SUBCLASSES:8

Yes, you got it enabled in the config.

If there is no message later about validator
turning off and no warnings which could point
at lockdep then it is working.

But then, IMHO, there is rather small probability
this bug is really from lockup. Another possibility
is hardware irqs (timer in particular) are turned
off by something (maybe those hacks?) for extremely
long time (~10 sec.). 
 
Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-27 Thread Ben Greear

Jarek Poplawski wrote:

On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote:

Jarek Poplawski wrote:

On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:

On 20-12-2006 03:13, Ben Greear wrote:
This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
active use.
From the backtrace, I am thinking this might be a generic problem, 
however.


Any ideas about what this could be?  It seems to be reproducible every 
day or

...

If it doesn't help, I hope lockdep will be more
precise when you'll upgrade to 2.6.19 or higher.

... or when you enable lockdep in 2.6.18 (I've
forgotten it's there alredy!).
I got lucky..the system was available by ssh still.  I see this in the boot 
logs..I assume
this means lockdep is enabled?  Should I have expected to see a lockdep 
trace in the case of

his soft-lockup then?

.
Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 
2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... 
MAX_LOCKDEP_SUBCLASSES:8


Yes, you got it enabled in the config.

If there is no message later about validator
turning off and no warnings which could point
at lockdep then it is working.

But then, IMHO, there is rather small probability
this bug is really from lockup. Another possibility
is hardware irqs (timer in particular) are turned
off by something (maybe those hacks?) for extremely
long time (~10 sec.). 


The system hangs and does not recover (well, a few processes
continue on the other processor for a few minutes before they
too deadlock...)

I am guessing this problem has been around for a while, but it
is only triggered when interfaces are created, and probably only
when UDP traffic is already running heavily on the system.  Most
systems w/out virtual devices will not trigger this sort of
race.

Ben

 
Regards,

Jarek P.



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-22 Thread Jarek Poplawski
On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:
 [PATCH] igmp: spin_lock_bh in timer
 
 igmp_timer_expire() uses spin_lock(im-lock)
 but this lock is also taken by other igmp timers,
 so it should be changed to bh version.

... but according to theory this doesn't matter.
I was suggested by this other timers, which
probably use _bh unnecessarily.

Sorry for confusion - I withdraw this patch.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-22 Thread Herbert Xu
Jarek Poplawski [EMAIL PROTECTED] wrote:

 [PATCH] igmp: spin_lock_bh in timer
 
 igmp_timer_expire() uses spin_lock(im-lock)
 but this lock is also taken by other igmp timers,
 so it should be changed to bh version.

When you're in a timer BH is already disabled.  So this patch
is redundant.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-22 Thread Jarek Poplawski
On Fri, Dec 22, 2006 at 10:16:30PM +1100, Herbert Xu wrote:
 Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  [PATCH] igmp: spin_lock_bh in timer
  
  igmp_timer_expire() uses spin_lock(im-lock)
  but this lock is also taken by other igmp timers,
  so it should be changed to bh version.
 
 When you're in a timer BH is already disabled.  So this patch
 is redundant.

Yes, I recognized this after the damage was done.

Thanks and regards,

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-22 Thread Ben Greear

Jarek Poplawski wrote:

On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:

On 20-12-2006 03:13, Ben Greear wrote:
This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
active use.
 From the backtrace, I am thinking this might be a generic problem, 
however.


Any ideas about what this could be?  It seems to be reproducible every 
day or

...

If it doesn't help, I hope lockdep will be more
precise when you'll upgrade to 2.6.19 or higher.


... or when you enable lockdep in 2.6.18 (I've
forgotten it's there alredy!).


I thought I had it enabled, but perhaps I do not.  I'll double check that
as soon as I'm back in the office after Christmas vacation.

Thanks for looking at this!

Ben



Jarek P.



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-22 Thread Ben Greear

Jarek Poplawski wrote:

On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:

On 20-12-2006 03:13, Ben Greear wrote:
This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
active use.
 From the backtrace, I am thinking this might be a generic problem, 
however.


Any ideas about what this could be?  It seems to be reproducible every 
day or

...

If it doesn't help, I hope lockdep will be more
precise when you'll upgrade to 2.6.19 or higher.


... or when you enable lockdep in 2.6.18 (I've
forgotten it's there alredy!).


I got lucky..the system was available by ssh still.  I see this in the boot 
logs..I assume
this means lockdep is enabled?  Should I have expected to see a lockdep trace 
in the case of
his soft-lockup then?

.
Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... 
MAX_LOCKDEP_SUBCLASSES:8

Dec 19 04:33:48 localhost kernel: ... MAX_LOCK_DEPTH:  30
Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_KEYS:2048
Dec 19 04:33:48 localhost kernel: ... CLASSHASH_SIZE:   1024
Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_ENTRIES: 8192
Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_CHAINS:  8192
Dec 19 04:33:48 localhost kernel: ... CHAINHASH_SIZE:  4096
Dec 19 04:33:48 localhost kernel:  memory used by lock dependency info: 696 kB
Dec 19 04:33:48 localhost kernel:  per task-struct memory footprint: 1200 bytes
Dec 19 04:33:48 localhost kernel: 
Dec 19 04:33:48 localhost kernel: | Locking API testsuite:





Jarek P.



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-21 Thread Jarek Poplawski
On 20-12-2006 03:13, Ben Greear wrote:
 This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
 active use.
  From the backtrace, I am thinking this might be a generic problem, 
 however.
 
 Any ideas about what this could be?  It seems to be reproducible every 
 day or
 two, but no known way to make it happen quickly...
 
 Kernel is SMP, PREEMPT.
 
 
 Dec 19 04:49:33 localhost kernel: BUG: soft lockup detected on CPU#0!
 Dec 19 04:49:33 localhost kernel:  [78104252] show_trace+0x12/0x20
 Dec 19 04:49:33 localhost kernel:  [78104929] dump_stack+0x19/0x20
 Dec 19 04:49:33 localhost kernel:  [7814c88b] softlockup_tick+0x9b/0xd0
 Dec 19 04:49:33 localhost kernel:  [7812a992] run_local_timers+0x12/0x20
 Dec 19 04:49:33 localhost kernel:  [7812ac08] 
 update_process_times+0x38/0x80
 Dec 19 04:49:33 localhost kernel:  [78112796] 
 smp_apic_timer_interrupt+0x66/0x70
 Dec 19 04:49:33 localhost kernel:  [78103baa] 
 apic_timer_interrupt+0x2a/0x30
 Dec 19 04:49:33 localhost kernel:  [78354e8c] _read_lock+0x3c/0x50
 Dec 19 04:49:33 localhost kernel:  [78331f42] ip_check_mc+0x22/0xb0
 Dec 19 04:49:33 localhost kernel:  [783068bf] ip_route_input+0x17f/0xef0
 Dec 19 04:49:33 localhost kernel:  [78309c59] ip_rcv+0x349/0x580

Hello,

This log isn't probably enough to tell with certainty
which lock is to blame. We can see it's taken from
some timer during ip_check_mc() but this
read_lock(in_dev-mc_list_lock) doesn't seem to
be used in timers for writing.
Maybe if you would wait a few minutes or tried
SysRq a oops could tell more.

Looking at igmp.c I've found one suspicious place
and here is a patch proposal included, but it may
be not your case. Anyway you could also try to
change this above mentioned read_lock and
read_unlock to _bh versions - maybe I missed
something.

If it doesn't help, I hope lockdep will be more
precise when you'll upgrade to 2.6.19 or higher.
 
Regards,
Jarek P.
---
[PATCH] igmp: spin_lock_bh in timer

igmp_timer_expire() uses spin_lock(im-lock)
but this lock is also taken by other igmp timers,
so it should be changed to bh version.

Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
---

diff -Nurp linux-2.6.20-rc1-/net/ipv4/igmp.c linux-2.6.20-rc1/net/ipv4/igmp.c
--- linux-2.6.20-rc1-/net/ipv4/igmp.c   2006-12-16 20:37:18.0 +0100
+++ linux-2.6.20-rc1/net/ipv4/igmp.c2006-12-21 22:57:30.0 +0100
@@ -727,7 +727,7 @@ static void igmp_timer_expire(unsigned l
struct ip_mc_list *im=(struct ip_mc_list *)data;
struct in_device *in_dev = im-interface;
 
-   spin_lock(im-lock);
+   spin_lock_bh(im-lock);
im-tm_running=0;
 
if (im-unsolicit_count) {
@@ -735,7 +735,7 @@ static void igmp_timer_expire(unsigned l
igmp_start_timer(im, IGMP_Unsolicited_Report_Interval);
}
im-reporter = 1;
-   spin_unlock(im-lock);
+   spin_unlock_bh(im-lock);
 
if (IGMP_V1_SEEN(in_dev))
igmp_send_report(in_dev, im, IGMP_HOST_MEMBERSHIP_REPORT);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)

2006-12-21 Thread Jarek Poplawski
On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote:
 On 20-12-2006 03:13, Ben Greear wrote:
  This is from 2.6.18.2 kernel with my patch set.  The MAC-VLANs are in 
  active use.
   From the backtrace, I am thinking this might be a generic problem, 
  however.
  
  Any ideas about what this could be?  It seems to be reproducible every 
  day or
...
 If it doesn't help, I hope lockdep will be more
 precise when you'll upgrade to 2.6.19 or higher.

... or when you enable lockdep in 2.6.18 (I've
forgotten it's there alredy!).

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html