Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Wed, Dec 27, 2006 at 08:16:10AM -0800, Ben Greear wrote: Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote: Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). I got lucky..the system was available by ssh still. I see this in the boot logs..I assume this means lockdep is enabled? Should I have expected to see a lockdep trace in the case of his soft-lockup then? . Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_SUBCLASSES:8 Yes, you got it enabled in the config. If there is no message later about validator turning off and no warnings which could point at lockdep then it is working. But then, IMHO, there is rather small probability this bug is really from lockup. Another possibility is hardware irqs (timer in particular) are turned off by something (maybe those hacks?) for extremely long time (~10 sec.). The system hangs and does not recover (well, a few processes continue on the other processor for a few minutes before they too deadlock...) I am guessing this problem has been around for a while, but it is only triggered when interfaces are created, and probably only when UDP traffic is already running heavily on the system. Most systems w/out virtual devices will not trigger this sort of race. I'd one more look at this considering the info about creating interfaces and here are some of my doubts on possible races (I hope you'll forgive me if I totaly miss some point): - During register procedure the real device seems to be up and running; vlan_rx_register is used but I see drivers differ here: some of them do netif_stop and disable irqs while others only lock. It seems they can start do vlan_hwaccel_rx directly after this (sometimes even during registration if irq will happen). - vlan_hwaccel_rx is checking skb_bond_should_drop but I'm not sure it is really useful here, so probably at least broadcasts and multicasts can use netif_rx even before vlan_dev is up (and your log accidentally shows multicast receive). - Preemption is blocked for quite a long time in vlan_skb_recv and during netif_receive; I guess this could be also possible reason of triggering the softlockup bug. I wonder if lowering the value of netdev_max_backlog wouldn't improve scheduling times. Happy New Year, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Wed, Dec 27, 2006 at 08:16:10AM -0800, Ben Greear wrote: ... The system hangs and does not recover (well, a few processes continue on the other processor for a few minutes before they too deadlock...) I am guessing this problem has been around for a while, but it is only triggered when interfaces are created, and probably only when UDP traffic is already running heavily on the system. Most systems w/out virtual devices will not trigger this sort of race. Considering your contribution into kernel, many people here would like to help, I hope, but these informations are probably not enough. Maybe some more logs dmesg? If it deadlocks anyway, maybe adding panic() after dump_stack() could tell something. Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote: Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). I got lucky..the system was available by ssh still. I see this in the boot logs..I assume this means lockdep is enabled? Should I have expected to see a lockdep trace in the case of his soft-lockup then? . Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_SUBCLASSES:8 Yes, you got it enabled in the config. If there is no message later about validator turning off and no warnings which could point at lockdep then it is working. But then, IMHO, there is rather small probability this bug is really from lockup. Another possibility is hardware irqs (timer in particular) are turned off by something (maybe those hacks?) for extremely long time (~10 sec.). Regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 06:05:18AM -0800, Ben Greear wrote: Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). I got lucky..the system was available by ssh still. I see this in the boot logs..I assume this means lockdep is enabled? Should I have expected to see a lockdep trace in the case of his soft-lockup then? . Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_SUBCLASSES:8 Yes, you got it enabled in the config. If there is no message later about validator turning off and no warnings which could point at lockdep then it is working. But then, IMHO, there is rather small probability this bug is really from lockup. Another possibility is hardware irqs (timer in particular) are turned off by something (maybe those hacks?) for extremely long time (~10 sec.). The system hangs and does not recover (well, a few processes continue on the other processor for a few minutes before they too deadlock...) I am guessing this problem has been around for a while, but it is only triggered when interfaces are created, and probably only when UDP traffic is already running heavily on the system. Most systems w/out virtual devices will not trigger this sort of race. Ben Regards, Jarek P. -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: [PATCH] igmp: spin_lock_bh in timer igmp_timer_expire() uses spin_lock(im-lock) but this lock is also taken by other igmp timers, so it should be changed to bh version. ... but according to theory this doesn't matter. I was suggested by this other timers, which probably use _bh unnecessarily. Sorry for confusion - I withdraw this patch. Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
Jarek Poplawski [EMAIL PROTECTED] wrote: [PATCH] igmp: spin_lock_bh in timer igmp_timer_expire() uses spin_lock(im-lock) but this lock is also taken by other igmp timers, so it should be changed to bh version. When you're in a timer BH is already disabled. So this patch is redundant. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Fri, Dec 22, 2006 at 10:16:30PM +1100, Herbert Xu wrote: Jarek Poplawski [EMAIL PROTECTED] wrote: [PATCH] igmp: spin_lock_bh in timer igmp_timer_expire() uses spin_lock(im-lock) but this lock is also taken by other igmp timers, so it should be changed to bh version. When you're in a timer BH is already disabled. So this patch is redundant. Yes, I recognized this after the damage was done. Thanks and regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). I thought I had it enabled, but perhaps I do not. I'll double check that as soon as I'm back in the office after Christmas vacation. Thanks for looking at this! Ben Jarek P. -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
Jarek Poplawski wrote: On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). I got lucky..the system was available by ssh still. I see this in the boot logs..I assume this means lockdep is enabled? Should I have expected to see a lockdep trace in the case of his soft-lockup then? . Dec 19 04:33:48 localhost kernel: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo MolnarDec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_SUBCLASSES:8 Dec 19 04:33:48 localhost kernel: ... MAX_LOCK_DEPTH: 30 Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_KEYS:2048 Dec 19 04:33:48 localhost kernel: ... CLASSHASH_SIZE: 1024 Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_ENTRIES: 8192 Dec 19 04:33:48 localhost kernel: ... MAX_LOCKDEP_CHAINS: 8192 Dec 19 04:33:48 localhost kernel: ... CHAINHASH_SIZE: 4096 Dec 19 04:33:48 localhost kernel: memory used by lock dependency info: 696 kB Dec 19 04:33:48 localhost kernel: per task-struct memory footprint: 1200 bytes Dec 19 04:33:48 localhost kernel: Dec 19 04:33:48 localhost kernel: | Locking API testsuite: Jarek P. -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or two, but no known way to make it happen quickly... Kernel is SMP, PREEMPT. Dec 19 04:49:33 localhost kernel: BUG: soft lockup detected on CPU#0! Dec 19 04:49:33 localhost kernel: [78104252] show_trace+0x12/0x20 Dec 19 04:49:33 localhost kernel: [78104929] dump_stack+0x19/0x20 Dec 19 04:49:33 localhost kernel: [7814c88b] softlockup_tick+0x9b/0xd0 Dec 19 04:49:33 localhost kernel: [7812a992] run_local_timers+0x12/0x20 Dec 19 04:49:33 localhost kernel: [7812ac08] update_process_times+0x38/0x80 Dec 19 04:49:33 localhost kernel: [78112796] smp_apic_timer_interrupt+0x66/0x70 Dec 19 04:49:33 localhost kernel: [78103baa] apic_timer_interrupt+0x2a/0x30 Dec 19 04:49:33 localhost kernel: [78354e8c] _read_lock+0x3c/0x50 Dec 19 04:49:33 localhost kernel: [78331f42] ip_check_mc+0x22/0xb0 Dec 19 04:49:33 localhost kernel: [783068bf] ip_route_input+0x17f/0xef0 Dec 19 04:49:33 localhost kernel: [78309c59] ip_rcv+0x349/0x580 Hello, This log isn't probably enough to tell with certainty which lock is to blame. We can see it's taken from some timer during ip_check_mc() but this read_lock(in_dev-mc_list_lock) doesn't seem to be used in timers for writing. Maybe if you would wait a few minutes or tried SysRq a oops could tell more. Looking at igmp.c I've found one suspicious place and here is a patch proposal included, but it may be not your case. Anyway you could also try to change this above mentioned read_lock and read_unlock to _bh versions - maybe I missed something. If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. Regards, Jarek P. --- [PATCH] igmp: spin_lock_bh in timer igmp_timer_expire() uses spin_lock(im-lock) but this lock is also taken by other igmp timers, so it should be changed to bh version. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp linux-2.6.20-rc1-/net/ipv4/igmp.c linux-2.6.20-rc1/net/ipv4/igmp.c --- linux-2.6.20-rc1-/net/ipv4/igmp.c 2006-12-16 20:37:18.0 +0100 +++ linux-2.6.20-rc1/net/ipv4/igmp.c2006-12-21 22:57:30.0 +0100 @@ -727,7 +727,7 @@ static void igmp_timer_expire(unsigned l struct ip_mc_list *im=(struct ip_mc_list *)data; struct in_device *in_dev = im-interface; - spin_lock(im-lock); + spin_lock_bh(im-lock); im-tm_running=0; if (im-unsolicit_count) { @@ -735,7 +735,7 @@ static void igmp_timer_expire(unsigned l igmp_start_timer(im, IGMP_Unsolicited_Report_Interval); } im-reporter = 1; - spin_unlock(im-lock); + spin_unlock_bh(im-lock); if (IGMP_V1_SEEN(in_dev)) igmp_send_report(in_dev, im, IGMP_HOST_MEMBERSHIP_REPORT); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] igmp: spin_lock_bh in timer (Re: BUG: soft lockup detected on CPU#0!)
On Fri, Dec 22, 2006 at 08:13:08AM +0100, Jarek Poplawski wrote: On 20-12-2006 03:13, Ben Greear wrote: This is from 2.6.18.2 kernel with my patch set. The MAC-VLANs are in active use. From the backtrace, I am thinking this might be a generic problem, however. Any ideas about what this could be? It seems to be reproducible every day or ... If it doesn't help, I hope lockdep will be more precise when you'll upgrade to 2.6.19 or higher. ... or when you enable lockdep in 2.6.18 (I've forgotten it's there alredy!). Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html