Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
On Wed, 2012-05-16 at 11:52 +1200, Quintin Russ wrote: > Hi Ian, > > After around 26 days of uptime one of the first hosts we upgraded the > linux-image-2.6.32-5-xen-amd64 package to 2.6.32-43 from > proposed-updates crashed overnight in the same situation as originally > detailed in this ticket. One new piece of information has come up since > then - just prior to it crashing this was logged into syslog: > > May 16 01:16:20 dom0 kernel: [2332024.578498] lvcreate: sending ioctl > 1261 to a partition! This warning was introduced as part of the fix for a security flaw, CVE-2011-4127. It probably doesn't make any difference to this bug. Ben. > I am going to look into setting up a serial console for our nodes now, > but just highlighting that I don't believe this update has fixed the > issue we're seeing. > > If you have any other ideas / thoughts on how to reliably reproduce this > fault please let me know. -- Ben Hutchings The two most common things in the universe are hydrogen and stupidity. signature.asc Description: This is a digitally signed message part
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Hi Ian, After around 26 days of uptime one of the first hosts we upgraded the linux-image-2.6.32-5-xen-amd64 package to 2.6.32-43 from proposed-updates crashed overnight in the same situation as originally detailed in this ticket. One new piece of information has come up since then - just prior to it crashing this was logged into syslog: May 16 01:16:20 dom0 kernel: [2332024.578498] lvcreate: sending ioctl 1261 to a partition! I am going to look into setting up a serial console for our nodes now, but just highlighting that I don't believe this update has fixed the issue we're seeing. If you have any other ideas / thoughts on how to reliably reproduce this fault please let me know. Best Regards, Quintin -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4fb2ec23.9080...@quintin.co.nz
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Hi Ian, On 05/04/12 21:04, Ian Campbell wrote: On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote: Those issues were believed to be fixed in 2.6.32-34 and you are running 2.6.32-39 so either this is a different issue (perhaps with similar symptoms) or the issue isn't really fixed. Either way I think we need to see your kernel logs containing the actual oops in order to make any progress. Yes, we have been having this problem since before 2.6.32-34 and were very hopeful that change would fix it. This sadly was not the case. Unfortunately there isn't anything in the logs for this, but I have a screenshot from the console, which I have attached. Thanks. Googling around for issues with sync_super threw up https://bugzilla.redhat.com/show_bug.cgi?id=587265 and https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the second one mentioned issues with IRQ handling which reminded me that a bunch of those were fixed 2.6.32-40 whereas you are running -39 (which is fair enough since that is the version currently in stable). Could you try the kernel from stable-proposed-updates (now 2.6.32-43)? Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports the interrupt problem theory. Thanks for that, I think you could be on the right track here. On another dom0 which crashed over the weekend I observed the following behaviour at least 6 times while doing a raid re-sync, unsure if this is related at all, but disk utilisation is low & the server is responsive. Apr 10 03:30:47 dom0 kernel: [261639.807061] INFO: task umount:18216 blocked for more than 120 seconds. Apr 10 03:30:47 dom0 kernel [261639.807098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 10 03:30:47 dom0 kernel: [261639.807146] umountD 0002 0 18216 18214 0x Apr 10 03:30:47 dom0 kernel: [261639.807152] 88003d44cdb0 0286 00043ca5fd08 88003ca5fd98 Apr 10 03:30:47 dom0 kernel: [261639.807157] f9e0 88003ca5ffd8 Apr 10 03:30:47 dom0 kernel: [261639.807161] 00015780 00015780 88003d44a350 88003d44a648 Apr 10 03:30:47 dom0 kernel: [261639.807165] Call Trace: Apr 10 03:30:47 dom0 kernel: [261639.807177] [] ? xen_force_evtchn_callback+0x9/0xa Apr 10 03:30:47 dom0 kernel: [261639.807180] [] ? check_events+0x12/0x20 Apr 10 03:30:47 dom0 kernel: [261639.807186] [] ? check_preempt_wakeup+0x0/0x268 Apr 10 03:30:47 dom0 kernel: [261639.807191] [] ? bdi_sched_wait+0x0/0xe Apr 10 03:30:47 dom0 kernel: [261639.807194] [] ? bdi_sched_wait+0x9/0xe Apr 10 03:30:47 dom0 kernel: [261639.807201] [] ? _spin_unlock_irqrestore+0xd/0xe Apr 10 03:30:47 dom0 kernel: [261639.807205] [] ? __wait_on_bit+0x41/0x70 Apr 10 03:30:47 dom0 kernel: [261639.807208] [] ? check_preempt_wakeup+0x0/0x268 Apr 10 03:30:47 dom0 kernel: [261639.807211] [] ? bdi_sched_wait+0x0/0xe Apr 10 03:30:47 dom0 kernel: [261639.807214] [] ? out_of_line_wait_on_bit+0x6b/0x77 Apr 10 03:30:47 dom0 kernel: [261639.807219] [] ? wake_bit_function+0x0/0x23 Apr 10 03:30:47 dom0 kernel: [261639.807222] [] ? sync_inodes_sb+0x73/0x12a Apr 10 03:30:47 dom0 kernel: [261639.807227] [] ? __sync_filesystem+0x4b/0x70 Apr 10 03:30:47 dom0 kernel: [261639.807239] [] ? generic_shutdown_super+0x21/0xfa Apr 10 03:30:47 dom0 kernel: [261639.807242] [] ? xen_restore_fl_direct_end+0x0/0x1 Apr 10 03:30:47 dom0 kernel: [261639.807245] [] ? kill_block_super+0x22/0x3a Apr 10 03:30:47 dom0 kernel: [261639.807249] [] ? deactivate_super+0x60/0x77 Apr 10 03:30:47 dom0 kernel: [261639.807254] [] ? sys_umount+0x2dc/0x30b Apr 10 03:30:47 dom0 kernel: [261639.807257] [] ? system_call_fastpath+0x16/0x1b When reviewing our logs across multiple crashes affecting multiple physical servers I note that 1 process is umounting the snapshot while another process takes a new snapshot as the last log entry on the server before the oops. It will take us a few days (probably a week) or so to get this new kernel rolled out, but will post an update here on how that has changed things once I know more. If there's any chance of setting up a serial console to catch this issue should it happen again then that would be very useful too. Will also be looking into the possibility of setting this up, as it has been happening fairly frequently for us. Thanks for your help so far. :-) Best Regards, Quintin -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4f836b56.6000...@quintin.co.nz
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote: > Hi Ian, > > On 05/04/12 01:00, Ian Campbell wrote: > > Hi Quintin, > > > > Thanks for your report. > > > > On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote: > >> Package: linux-image-2.6.32-5-xen-amd64 > >> Version: 2.6.32-39 > >> Severity: important > >> > >> We have observed an issue when a Xen dom0 is removing a snapshot for a > >> logical volume and another process comes along to create a snapshot > >> for that same device (different names) causing the server to Kernel > >> Ooops. According to my logs sometimes removing of the snapshot can > >> pause or take a while contributing to the issue. Attempts to add > >> locking code (using dotlockfile) have not so far been successful in > >> mitigating this bug, but we are still exploring this option. > >> > >> The nodes that are affected intermittently& we have been unable to > >> reproduce this issue in the lab (on either the same model of hardware > >> or hardware that has crashed in production). From our logs we can see > >> that every time this issue occurs one process has been removing the > >> snapshot while another has been creating a snapshot shortly after > >> (seconds normally). We are currently seeing about a 5% chance of a > >> crash per month (assuming our nodes are equal). > >> > >> This bug looks similar to a number of bugs that have already been > >> filed related to this > >> issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick > >> Google search shows many more (which have mostly been merged): > >> https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen% > >> 20snapshot%20kernel%20oops%20squeeze > > Those issues were believed to be fixed in 2.6.32-34 and you are running > > 2.6.32-39 so either this is a different issue (perhaps with similar > > symptoms) or the issue isn't really fixed. Either way I think we need to > > see your kernel logs containing the actual oops in order to make any > > progress. > > Yes, we have been having this problem since before 2.6.32-34 and were > very hopeful that change would fix it. This sadly was not the case. > Unfortunately there isn't anything in the logs for this, but I have a > screenshot from the console, which I have attached. Thanks. Googling around for issues with sync_super threw up https://bugzilla.redhat.com/show_bug.cgi?id=587265 and https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the second one mentioned issues with IRQ handling which reminded me that a bunch of those were fixed 2.6.32-40 whereas you are running -39 (which is fair enough since that is the version currently in stable). Could you try the kernel from stable-proposed-updates (now 2.6.32-43)? Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports the interrupt problem theory. If there's any chance of setting up a serial console to catch this issue should it happen again then that would be very useful too. Ian. > > I also had an idle shell at the time the server crashed and this is what > I saw: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.000629] Oops: [#1] SMP > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.000661] last sysfs file: > /sys/devices/virtual/block/dm-49/removable > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.001891] Stack: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002101] Call Trace: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a > f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b > c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff > ff ff 66 ff > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002901] CR2: > > Please let me know if there is anything further I can provide. -- Ian Campbell Current Noise: Crippled Black Phoenix - The Heart Of Every Country Dealer prices may vary. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1333616660.937.31.ca...@zakaz.uk.xensource.com
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Hi Ian, On 05/04/12 01:00, Ian Campbell wrote: Hi Quintin, Thanks for your report. On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote: Package: linux-image-2.6.32-5-xen-amd64 Version: 2.6.32-39 Severity: important We have observed an issue when a Xen dom0 is removing a snapshot for a logical volume and another process comes along to create a snapshot for that same device (different names) causing the server to Kernel Ooops. According to my logs sometimes removing of the snapshot can pause or take a while contributing to the issue. Attempts to add locking code (using dotlockfile) have not so far been successful in mitigating this bug, but we are still exploring this option. The nodes that are affected intermittently& we have been unable to reproduce this issue in the lab (on either the same model of hardware or hardware that has crashed in production). From our logs we can see that every time this issue occurs one process has been removing the snapshot while another has been creating a snapshot shortly after (seconds normally). We are currently seeing about a 5% chance of a crash per month (assuming our nodes are equal). This bug looks similar to a number of bugs that have already been filed related to this issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick Google search shows many more (which have mostly been merged): https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen% 20snapshot%20kernel%20oops%20squeeze Those issues were believed to be fixed in 2.6.32-34 and you are running 2.6.32-39 so either this is a different issue (perhaps with similar symptoms) or the issue isn't really fixed. Either way I think we need to see your kernel logs containing the actual oops in order to make any progress. Yes, we have been having this problem since before 2.6.32-34 and were very hopeful that change would fix it. This sadly was not the case. Unfortunately there isn't anything in the logs for this, but I have a screenshot from the console, which I have attached. I also had an idle shell at the time the server crashed and this is what I saw: Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.000629] Oops: [#1] SMP Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.000661] last sysfs file: /sys/devices/virtual/block/dm-49/removable Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.001891] Stack: Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.002101] Call Trace: Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff ff ff 66 ff Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.002901] CR2: Please let me know if there is anything further I can provide. <>
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Hi Quintin, Thanks for your report. On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote: > Package: linux-image-2.6.32-5-xen-amd64 > Version: 2.6.32-39 > Severity: important > > We have observed an issue when a Xen dom0 is removing a snapshot for a > logical volume and another process comes along to create a snapshot > for that same device (different names) causing the server to Kernel > Ooops. According to my logs sometimes removing of the snapshot can > pause or take a while contributing to the issue. Attempts to add > locking code (using dotlockfile) have not so far been successful in > mitigating this bug, but we are still exploring this option. > > The nodes that are affected intermittently& we have been unable to > reproduce this issue in the lab (on either the same model of hardware > or hardware that has crashed in production). From our logs we can see > that every time this issue occurs one process has been removing the > snapshot while another has been creating a snapshot shortly after > (seconds normally). We are currently seeing about a 5% chance of a > crash per month (assuming our nodes are equal). > > This bug looks similar to a number of bugs that have already been > filed related to this > issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick > Google search shows many more (which have mostly been merged): > https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen% > 20snapshot%20kernel%20oops%20squeeze Those issues were believed to be fixed in 2.6.32-34 and you are running 2.6.32-39 so either this is a different issue (perhaps with similar symptoms) or the issue isn't really fixed. Either way I think we need to see your kernel logs containing the actual oops in order to make any progress. Thanks, Ian. -- Ian Campbell Current Noise: Cathedral - Carnival Bizarre HOORAY, Ronald!! Now YOU can marry LINDA RONSTADT too!! -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/133354.20582.13.ca...@zakaz.uk.xensource.com
Bug#667434: lvcreate / lvremove snapshot under Xen causes Kernel OOPs
Package: linux-image-2.6.32-5-xen-amd64 Version: 2.6.32-39 Severity: important We have observed an issue when a Xen dom0 is removing a snapshot for a logical volume and another process comes along to create a snapshot for that same device (different names) causing the server to Kernel Ooops. According to my logs sometimes removing of the snapshot can pause or take a while contributing to the issue. Attempts to add locking code (using dotlockfile) have not so far been successful in mitigating this bug, but we are still exploring this option. The nodes that are affected intermittently& we have been unable to reproduce this issue in the lab (on either the same model of hardware or hardware that has crashed in production). From our logs we can see that every time this issue occurs one process has been removing the snapshot while another has been creating a snapshot shortly after (seconds normally). We are currently seeing about a 5% chance of a crash per month (assuming our nodes are equal). This bug looks similar to a number of bugs that have already been filed related to this issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick Google search shows many more (which have mostly been merged): https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%20snapshot%20kernel%20oops%20squeeze # lspci 00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) 00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 2 (rev 22) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22) 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22) 00:13.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22) 00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22) 00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22) 00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22) 00:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 22) 00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1 00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller 00:1f.5 IDE interface: Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller #2 01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection 02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection 04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02) 05:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0) 06:01.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a) # lsmod Module Size Used by nf_conntrack_ipv4 9833 10 nf_defrag_ipv4 1139 1 nf_conntrack_ipv4 xt_state1303 10 nf_conntrack 46519 2 nf_conntrack_ipv4,xt_state xt_physdev