Processed: Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
Processing commands for cont...@bugs.debian.org: > tags 542250 +patch Bug #542250 [src:linux-2.6] repeatable crashes while copying 500G from NFS mount to local logical volume Bug #516479 [src:linux-2.6] linux-image-2.6.26-1-xen-amd64: kernel-panic in xen_spin_wait an mutlicore dom0 with high load, not interruption save? Ignoring request to alter tags of bug #542250 to the same tags previously set Ignoring request to alter tags of bug #516479 to the same tags previously set > thanks Stopping processing here. Please contact me if you need assistance. Debian bug tracking system administrator (administrator, Debian Bugs database) -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
> On Wed, 2009-08-19 at 22:36 +0400, Nikita V. Youshchenko wrote: > > tags 542250 +patch > > thanks > > > > > ... I may guess that line 74 should check for in_interrupt() instead > > > of in_softirq(). > > > > I've tried that and it really fixed the problem. Server already runs > > the same backup procedure for several hours. Previously it crashed > > within 15 minutes. > > > > Here is the patch I've applied: > > > > --- a/drivers/xen/core/spinlock.c 2009-08-19 16:20:17.0 > > +0400 +++ b/drivers/xen/core/spinlock.c 2009-08-19 > > 17:36:55.0 +0400 @@ -71,7 +71,7 @@ > > BUG_ON(__get_cpu_var(spinning_bh).lock == > > lock); spinning = &__get_cpu_var(spinning_irq); } else { > > - BUG_ON(!in_softirq()); > > + BUG_ON(!in_interrupt()); > > spinning = &__get_cpu_var(spinning_bh); > > } > > BUG_ON(spinning->lock); > > I'm glad it works for you, but it isn't a proper fix. Could you please explain? How that code line cod hit if not in interrupt handler? Here is my understanding of the logic of that code. They try to track spinlocks CPU currently spins at. CPU spinning may be interrupted only by irq. There "normal" (not SA_NODELAY) interrupt handlers can't be active at the same CPU at the same time. That leads to maximum 3 spinings: - one from process context, - one from "normal" irq handler that interrupted that process context, - and one from SA_NODELAY irq handler that interrupted normal irq handler. This one can't be interrupted since it runs with interrupts disabled. If such, the code path in question corresponds to "normal" interrupt handler starting to spin. Thus it should be in_interrupt(). How this is wrong? Perhaps softirq handler could be activated at exit of the "normal" handler? Maybe better check is BUG_ON(!in_interrupt() && !in_softrq()). Need to check the code ... Nikita signature.asc Description: This is a digitally signed message part.
Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
On Wed, 2009-08-19 at 22:36 +0400, Nikita V. Youshchenko wrote: > tags 542250 +patch > thanks > > > ... I may guess that line 74 should check for in_interrupt() instead of > > in_softirq(). > > I've tried that and it really fixed the problem. Server already runs the > same backup procedure for several hours. Previously it crashed within 15 > minutes. > > Here is the patch I've applied: > > --- a/drivers/xen/core/spinlock.c 2009-08-19 16:20:17.0 +0400 > +++ b/drivers/xen/core/spinlock.c 2009-08-19 17:36:55.0 +0400 > @@ -71,7 +71,7 @@ > BUG_ON(__get_cpu_var(spinning_bh).lock == lock); > spinning = &__get_cpu_var(spinning_irq); > } else { > - BUG_ON(!in_softirq()); > + BUG_ON(!in_interrupt()); > spinning = &__get_cpu_var(spinning_bh); > } > BUG_ON(spinning->lock); I'm glad it works for you, but it isn't a proper fix. Ben. -- Ben Hutchings If at first you don't succeed, you're doing about average. signature.asc Description: This is a digitally signed message part
Processed: Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
Processing commands for cont...@bugs.debian.org: > tags 542250 +patch Bug #542250 [linux-image-2.6.26-2-xen-amd64] repeatable crashes while copying 500G from NFS mount to local logical volume Added tag(s) patch. > thanks Stopping processing here. Please contact me if you need assistance. Debian bug tracking system administrator (administrator, Debian Bugs database) -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
tags 542250 +patch thanks > ... I may guess that line 74 should check for in_interrupt() instead of > in_softirq(). I've tried that and it really fixed the problem. Server already runs the same backup procedure for several hours. Previously it crashed within 15 minutes. Here is the patch I've applied: --- a/drivers/xen/core/spinlock.c 2009-08-19 16:20:17.0 +0400 +++ b/drivers/xen/core/spinlock.c 2009-08-19 17:36:55.0 +0400 @@ -71,7 +71,7 @@ BUG_ON(__get_cpu_var(spinning_bh).lock == lock); spinning = &__get_cpu_var(spinning_irq); } else { - BUG_ON(!in_softirq()); + BUG_ON(!in_interrupt()); spinning = &__get_cpu_var(spinning_bh); } BUG_ON(spinning->lock); signature.asc Description: This is a digitally signed message part.
Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume
> This asserts that if we spin on a lock after interrupting another spin, > and interrupts are enabled, we must be in a softirq. Looking at the bottom of the same file drivers/xen/core/spinlock.c: void xen_spin_kick(raw_spinlock_t *lock, unsigned int token) { unsigned int cpu; token &= (1U << TICKET_SHIFT) - 1; for_each_online_cpu(cpu) { if (spinning(&per_cpu(spinning, cpu), cpu, lock, token)) return; if (in_interrupt() && spinning(&per_cpu(spinning_bh, cpu), cpu, lock, token)) return; if (raw_irqs_disabled() && spinning(&per_cpu(spinning_irq, cpu), cpu, lock, token)) return; } } EXPORT_SYMBOL(xen_spin_kick); ... I may guess that line 74 should check for in_interrupt() instead of in_softirq(). However it is just a guess based on analogy. I don't currently understand the logic of that code. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org