On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote: > Hi Ian, > > On 05/04/12 01:00, Ian Campbell wrote: > > Hi Quintin, > > > > Thanks for your report. > > > > On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote: > >> Package: linux-image-2.6.32-5-xen-amd64 > >> Version: 2.6.32-39 > >> Severity: important > >> > >> We have observed an issue when a Xen dom0 is removing a snapshot for a > >> logical volume and another process comes along to create a snapshot > >> for that same device (different names) causing the server to Kernel > >> Ooops. According to my logs sometimes removing of the snapshot can > >> pause or take a while contributing to the issue. Attempts to add > >> locking code (using dotlockfile) have not so far been successful in > >> mitigating this bug, but we are still exploring this option. > >> > >> The nodes that are affected intermittently& we have been unable to > >> reproduce this issue in the lab (on either the same model of hardware > >> or hardware that has crashed in production). From our logs we can see > >> that every time this issue occurs one process has been removing the > >> snapshot while another has been creating a snapshot shortly after > >> (seconds normally). We are currently seeing about a 5% chance of a > >> crash per month (assuming our nodes are equal). > >> > >> This bug looks similar to a number of bugs that have already been > >> filed related to this > >> issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick > >> Google search shows many more (which have mostly been merged): > >> https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen% > >> 20snapshot%20kernel%20oops%20squeeze > > Those issues were believed to be fixed in 2.6.32-34 and you are running > > 2.6.32-39 so either this is a different issue (perhaps with similar > > symptoms) or the issue isn't really fixed. Either way I think we need to > > see your kernel logs containing the actual oops in order to make any > > progress. > > Yes, we have been having this problem since before 2.6.32-34 and were > very hopeful that change would fix it. This sadly was not the case. > Unfortunately there isn't anything in the logs for this, but I have a > screenshot from the console, which I have attached.
Thanks. Googling around for issues with sync_super threw up https://bugzilla.redhat.com/show_bug.cgi?id=587265 and https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the second one mentioned issues with IRQ handling which reminded me that a bunch of those were fixed 2.6.32-40 whereas you are running -39 (which is fair enough since that is the version currently in stable). Could you try the kernel from stable-proposed-updates (now 2.6.32-43)? Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports the interrupt problem theory. If there's any chance of setting up a serial console to catch this issue should it happen again then that would be very useful too. Ian. > > I also had an idle shell at the time the server crashed and this is what > I saw: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.000629] Oops: 0000 [#1] SMP > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.000661] last sysfs file: > /sys/devices/virtual/block/dm-49/removable > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.001891] Stack: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002101] Call Trace: > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a > f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b > c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff > ff ff 66 ff > > Message from syslogd@dom0 at Apr 4 01:37:22 ... > kernel:[4805213.002901] CR2: 0000000000000000 > > Please let me know if there is anything further I can provide. -- Ian Campbell Current Noise: Crippled Black Phoenix - The Heart Of Every Country Dealer prices may vary. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1333616660.937.31.ca...@zakaz.uk.xensource.com