Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-05-15 Thread Ben Hutchings
On Wed, 2012-05-16 at 11:52 +1200, Quintin Russ wrote:
> Hi Ian,
> 
> After around 26 days of uptime one of the first hosts we upgraded the
> linux-image-2.6.32-5-xen-amd64 package to 2.6.32-43 from
> proposed-updates crashed overnight in the same situation as originally
> detailed in this ticket. One new piece of information has come up since
> then - just prior to it crashing this was logged into syslog:
> 
> May 16 01:16:20 dom0 kernel: [2332024.578498] lvcreate: sending ioctl
> 1261 to a partition!

This warning was introduced as part of the fix for a security flaw,
CVE-2011-4127.  It probably doesn't make any difference to this bug.

Ben.

> I am going to look into setting up a serial console for our nodes now,
> but just highlighting that I don't believe this update has fixed the
> issue we're seeing.
> 
> If you have any other ideas / thoughts on how to reliably reproduce this
> fault please let me know.


-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.


signature.asc
Description: This is a digitally signed message part


Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-05-15 Thread Quintin Russ
Hi Ian,

After around 26 days of uptime one of the first hosts we upgraded the
linux-image-2.6.32-5-xen-amd64 package to 2.6.32-43 from
proposed-updates crashed overnight in the same situation as originally
detailed in this ticket. One new piece of information has come up since
then - just prior to it crashing this was logged into syslog:

May 16 01:16:20 dom0 kernel: [2332024.578498] lvcreate: sending ioctl
1261 to a partition!

I am going to look into setting up a serial console for our nodes now,
but just highlighting that I don't believe this update has fixed the
issue we're seeing.

If you have any other ideas / thoughts on how to reliably reproduce this
fault please let me know.

Best Regards,

Quintin



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4fb2ec23.9080...@quintin.co.nz



Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-04-09 Thread Quintin Russ

Hi Ian,

On 05/04/12 21:04, Ian Campbell wrote:

On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote:

Those issues were believed to be fixed in 2.6.32-34 and you are running
2.6.32-39 so either this is a different issue (perhaps with similar
symptoms) or the issue isn't really fixed. Either way I think we need to
see your kernel logs containing the actual oops in order to make any
progress.

Yes, we have been having this problem since before 2.6.32-34 and were
very hopeful that change would fix it. This sadly was not the case.
Unfortunately there isn't anything in the logs for this, but I have a
screenshot from the console, which I have attached.

Thanks.

Googling around for issues with sync_super threw up
https://bugzilla.redhat.com/show_bug.cgi?id=587265 and
https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the
second one mentioned issues with IRQ handling which reminded me that a
bunch of those were fixed 2.6.32-40 whereas you are running -39 (which
is fair enough since that is the version currently in stable). Could you
try the kernel from stable-proposed-updates (now 2.6.32-43)?

Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports
the interrupt problem theory.


Thanks for that, I think you could be on the right track here.

On another dom0 which crashed over the weekend I observed the following 
behaviour at least 6 times while doing a raid re-sync, unsure if this is 
related at all, but disk utilisation is low & the server is responsive.


Apr 10 03:30:47 dom0 kernel: [261639.807061] INFO: task umount:18216 
blocked for more than 120 seconds.
Apr 10 03:30:47 dom0 kernel  [261639.807098] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 03:30:47 dom0 kernel: [261639.807146] umountD 
0002 0 18216  18214 0x
Apr 10 03:30:47 dom0 kernel: [261639.807152]  88003d44cdb0 
0286 00043ca5fd08 88003ca5fd98
Apr 10 03:30:47 dom0 kernel: [261639.807157]   
 f9e0 88003ca5ffd8
Apr 10 03:30:47 dom0 kernel: [261639.807161]  00015780 
00015780 88003d44a350 88003d44a648

Apr 10 03:30:47 dom0 kernel: [261639.807165] Call Trace:
Apr 10 03:30:47 dom0 kernel: [261639.807177]  [] ? 
xen_force_evtchn_callback+0x9/0xa
Apr 10 03:30:47 dom0 kernel: [261639.807180]  [] ? 
check_events+0x12/0x20
Apr 10 03:30:47 dom0 kernel: [261639.807186]  [] ? 
check_preempt_wakeup+0x0/0x268
Apr 10 03:30:47 dom0 kernel: [261639.807191]  [] ? 
bdi_sched_wait+0x0/0xe
Apr 10 03:30:47 dom0 kernel: [261639.807194]  [] ? 
bdi_sched_wait+0x9/0xe
Apr 10 03:30:47 dom0 kernel: [261639.807201]  [] ? 
_spin_unlock_irqrestore+0xd/0xe
Apr 10 03:30:47 dom0 kernel: [261639.807205]  [] ? 
__wait_on_bit+0x41/0x70
Apr 10 03:30:47 dom0 kernel: [261639.807208]  [] ? 
check_preempt_wakeup+0x0/0x268
Apr 10 03:30:47 dom0 kernel: [261639.807211]  [] ? 
bdi_sched_wait+0x0/0xe
Apr 10 03:30:47 dom0 kernel: [261639.807214]  [] ? 
out_of_line_wait_on_bit+0x6b/0x77
Apr 10 03:30:47 dom0 kernel: [261639.807219]  [] ? 
wake_bit_function+0x0/0x23
Apr 10 03:30:47 dom0 kernel: [261639.807222]  [] ? 
sync_inodes_sb+0x73/0x12a
Apr 10 03:30:47 dom0 kernel: [261639.807227]  [] ? 
__sync_filesystem+0x4b/0x70
Apr 10 03:30:47 dom0 kernel: [261639.807239]  [] ? 
generic_shutdown_super+0x21/0xfa
Apr 10 03:30:47 dom0 kernel: [261639.807242]  [] ? 
xen_restore_fl_direct_end+0x0/0x1
Apr 10 03:30:47 dom0 kernel: [261639.807245]  [] ? 
kill_block_super+0x22/0x3a
Apr 10 03:30:47 dom0 kernel: [261639.807249]  [] ? 
deactivate_super+0x60/0x77
Apr 10 03:30:47 dom0 kernel: [261639.807254]  [] ? 
sys_umount+0x2dc/0x30b
Apr 10 03:30:47 dom0 kernel: [261639.807257]  [] ? 
system_call_fastpath+0x16/0x1b


When reviewing our logs across multiple crashes affecting multiple 
physical servers I note that 1 process is umounting the snapshot while 
another process takes a new snapshot as the last log entry on the server 
before the oops.


It will take us a few days (probably a week) or so to get this new 
kernel rolled out, but will post an update here on how that has changed 
things once I know more.



If there's any chance of setting up a serial console to catch this issue
should it happen again then that would be very useful too.


Will also be looking into the possibility of setting this up, as it has 
been happening fairly frequently for us.


Thanks for your help so far. :-)

Best Regards,

Quintin



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4f836b56.6000...@quintin.co.nz



Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-04-05 Thread Ian Campbell
On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote:
> Hi Ian,
> 
> On 05/04/12 01:00, Ian Campbell wrote:
> > Hi Quintin,
> >
> > Thanks for your report.
> >
> > On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote:
> >> Package: linux-image-2.6.32-5-xen-amd64
> >> Version: 2.6.32-39
> >> Severity: important
> >>
> >> We have observed an issue when a Xen dom0 is removing a snapshot for a
> >> logical volume and another process comes along to create a snapshot
> >> for that same device (different names) causing the server to Kernel
> >> Ooops. According to my logs sometimes removing of the snapshot can
> >> pause or take a while contributing to the issue. Attempts to add
> >> locking code (using dotlockfile) have not so far been successful in
> >> mitigating this bug, but we are still exploring this option.
> >>
> >> The nodes that are affected intermittently&   we have been unable to
> >> reproduce this issue in the lab (on either the same model of hardware
> >> or hardware that has crashed in production). From our logs we can see
> >> that every time this issue occurs one process has been removing the
> >> snapshot while another has been creating a snapshot shortly after
> >> (seconds normally). We are currently seeing about a 5% chance of a
> >> crash per month (assuming our nodes are equal).
> >>
> >> This bug looks similar to a number of bugs that have already been
> >> filed related to this
> >> issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400  A quick
> >> Google search shows many more (which have mostly been merged):
> >> https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%
> >> 20snapshot%20kernel%20oops%20squeeze
> > Those issues were believed to be fixed in 2.6.32-34 and you are running
> > 2.6.32-39 so either this is a different issue (perhaps with similar
> > symptoms) or the issue isn't really fixed. Either way I think we need to
> > see your kernel logs containing the actual oops in order to make any
> > progress.
> 
> Yes, we have been having this problem since before 2.6.32-34 and were 
> very hopeful that change would fix it. This sadly was not the case. 
> Unfortunately there isn't anything in the logs for this, but I have a 
> screenshot from the console, which I have attached.

Thanks.

Googling around for issues with sync_super threw up
https://bugzilla.redhat.com/show_bug.cgi?id=587265 and
https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the
second one mentioned issues with IRQ handling which reminded me that a
bunch of those were fixed 2.6.32-40 whereas you are running -39 (which
is fair enough since that is the version currently in stable). Could you
try the kernel from stable-proposed-updates (now 2.6.32-43)?

Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports
the interrupt problem theory.

If there's any chance of setting up a serial console to catch this issue
should it happen again then that would be very useful too.

Ian.

> 
> I also had an idle shell at the time the server crashed and this is what 
> I saw:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.000629] Oops:  [#1] SMP
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.000661] last sysfs file: 
> /sys/devices/virtual/block/dm-49/removable
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.001891] Stack:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002101] Call Trace:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a 
> f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b 
> c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff 
> ff ff 66 ff
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002901] CR2: 
> 
> Please let me know if there is anything further I can provide.

-- 
Ian Campbell
Current Noise: Crippled Black Phoenix - The Heart Of Every Country

Dealer prices may vary.




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1333616660.937.31.ca...@zakaz.uk.xensource.com



Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-04-04 Thread Quintin Russ

Hi Ian,

On 05/04/12 01:00, Ian Campbell wrote:

Hi Quintin,

Thanks for your report.

On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote:

Package: linux-image-2.6.32-5-xen-amd64
Version: 2.6.32-39
Severity: important

We have observed an issue when a Xen dom0 is removing a snapshot for a
logical volume and another process comes along to create a snapshot
for that same device (different names) causing the server to Kernel
Ooops. According to my logs sometimes removing of the snapshot can
pause or take a while contributing to the issue. Attempts to add
locking code (using dotlockfile) have not so far been successful in
mitigating this bug, but we are still exploring this option.

The nodes that are affected intermittently&   we have been unable to
reproduce this issue in the lab (on either the same model of hardware
or hardware that has crashed in production). From our logs we can see
that every time this issue occurs one process has been removing the
snapshot while another has been creating a snapshot shortly after
(seconds normally). We are currently seeing about a 5% chance of a
crash per month (assuming our nodes are equal).

This bug looks similar to a number of bugs that have already been
filed related to this
issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400  A quick
Google search shows many more (which have mostly been merged):
https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%
20snapshot%20kernel%20oops%20squeeze

Those issues were believed to be fixed in 2.6.32-34 and you are running
2.6.32-39 so either this is a different issue (perhaps with similar
symptoms) or the issue isn't really fixed. Either way I think we need to
see your kernel logs containing the actual oops in order to make any
progress.


Yes, we have been having this problem since before 2.6.32-34 and were 
very hopeful that change would fix it. This sadly was not the case. 
Unfortunately there isn't anything in the logs for this, but I have a 
screenshot from the console, which I have attached.


I also had an idle shell at the time the server crashed and this is what 
I saw:


Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.000629] Oops:  [#1] SMP

Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.000661] last sysfs file: 
/sys/devices/virtual/block/dm-49/removable


Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.001891] Stack:

Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.002101] Call Trace:

Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a 
f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b 
c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff 
ff ff 66 ff


Message from syslogd@dom0 at Apr  4 01:37:22 ...
 kernel:[4805213.002901] CR2: 

Please let me know if there is anything further I can provide.
<>

Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-04-04 Thread Ian Campbell
Hi Quintin,

Thanks for your report.

On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote:
> Package: linux-image-2.6.32-5-xen-amd64
> Version: 2.6.32-39
> Severity: important
> 
> We have observed an issue when a Xen dom0 is removing a snapshot for a
> logical volume and another process comes along to create a snapshot
> for that same device (different names) causing the server to Kernel
> Ooops. According to my logs sometimes removing of the snapshot can
> pause or take a while contributing to the issue. Attempts to add
> locking code (using dotlockfile) have not so far been successful in
> mitigating this bug, but we are still exploring this option.
> 
> The nodes that are affected intermittently&  we have been unable to
> reproduce this issue in the lab (on either the same model of hardware
> or hardware that has crashed in production). From our logs we can see
> that every time this issue occurs one process has been removing the
> snapshot while another has been creating a snapshot shortly after
> (seconds normally). We are currently seeing about a 5% chance of a
> crash per month (assuming our nodes are equal).
> 
> This bug looks similar to a number of bugs that have already been
> filed related to this
> issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400  A quick
> Google search shows many more (which have mostly been merged):
> https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%
> 20snapshot%20kernel%20oops%20squeeze

Those issues were believed to be fixed in 2.6.32-34 and you are running
2.6.32-39 so either this is a different issue (perhaps with similar
symptoms) or the issue isn't really fixed. Either way I think we need to
see your kernel logs containing the actual oops in order to make any
progress.

Thanks,
Ian.

-- 
Ian Campbell
Current Noise: Cathedral - Carnival Bizarre

HOORAY, Ronald!!  Now YOU can marry LINDA RONSTADT too!!




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/133354.20582.13.ca...@zakaz.uk.xensource.com



Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs

2012-04-03 Thread Quintin Russ

Package: linux-image-2.6.32-5-xen-amd64
Version: 2.6.32-39
Severity: important

We have observed an issue when a Xen dom0 is removing a snapshot for a logical 
volume and another process comes along to create a snapshot for that same 
device (different names) causing the server to Kernel Ooops. According to my 
logs sometimes removing of the snapshot can pause or take a while contributing 
to the issue. Attempts to add locking code (using dotlockfile) have not so far 
been successful in mitigating this bug, but we are still exploring this option.

The nodes that are affected intermittently&  we have been unable to reproduce 
this issue in the lab (on either the same model of hardware or hardware that has 
crashed in production). From our logs we can see that every time this issue occurs 
one process has been removing the snapshot while another has been creating a 
snapshot shortly after (seconds normally). We are currently seeing about a 5% 
chance of a crash per month (assuming our nodes are equal).

This bug looks similar to a number of bugs that have already been filed related 
to this issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400  A quick 
Google search shows many more (which have mostly been merged): 
https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%20snapshot%20kernel%20oops%20squeeze

# lspci
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 1 (rev 22)
00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 2 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 3 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 
(rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 7 (rev 22)
00:13.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt 
Controller (rev 22)
00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management 
Registers (rev 22)
00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad 
Registers (rev 22)
00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS 
Registers (rev 22)
00:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData 
Technology Device (rev 22)
00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #4
00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #5
00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #6
00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI 
Controller #2
00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #1
00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #2
00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI 
Controller #3
00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI 
Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller
00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE 
Controller #1
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE 
Controller #2
01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 
PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02)
05:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)
06:01.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 
(rev 0a)

# lsmod
Module  Size  Used by
nf_conntrack_ipv4   9833  10
nf_defrag_ipv4  1139  1 nf_conntrack_ipv4
xt_state1303  10
nf_conntrack   46519  2 nf_conntrack_ipv4,xt_state
xt_physdev