On Thu, 2012-04-05 at 09:00 +1200, Quintin Russ wrote:
> Hi Ian,
> 
> On 05/04/12 01:00, Ian Campbell wrote:
> > Hi Quintin,
> >
> > Thanks for your report.
> >
> > On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote:
> >> Package: linux-image-2.6.32-5-xen-amd64
> >> Version: 2.6.32-39
> >> Severity: important
> >>
> >> We have observed an issue when a Xen dom0 is removing a snapshot for a
> >> logical volume and another process comes along to create a snapshot
> >> for that same device (different names) causing the server to Kernel
> >> Ooops. According to my logs sometimes removing of the snapshot can
> >> pause or take a while contributing to the issue. Attempts to add
> >> locking code (using dotlockfile) have not so far been successful in
> >> mitigating this bug, but we are still exploring this option.
> >>
> >> The nodes that are affected intermittently&   we have been unable to
> >> reproduce this issue in the lab (on either the same model of hardware
> >> or hardware that has crashed in production). From our logs we can see
> >> that every time this issue occurs one process has been removing the
> >> snapshot while another has been creating a snapshot shortly after
> >> (seconds normally). We are currently seeing about a 5% chance of a
> >> crash per month (assuming our nodes are equal).
> >>
> >> This bug looks similar to a number of bugs that have already been
> >> filed related to this
> >> issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400  A quick
> >> Google search shows many more (which have mostly been merged):
> >> https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen%
> >> 20snapshot%20kernel%20oops%20squeeze
> > Those issues were believed to be fixed in 2.6.32-34 and you are running
> > 2.6.32-39 so either this is a different issue (perhaps with similar
> > symptoms) or the issue isn't really fixed. Either way I think we need to
> > see your kernel logs containing the actual oops in order to make any
> > progress.
> 
> Yes, we have been having this problem since before 2.6.32-34 and were 
> very hopeful that change would fix it. This sadly was not the case. 
> Unfortunately there isn't anything in the logs for this, but I have a 
> screenshot from the console, which I have attached.

Thanks.

Googling around for issues with sync_super threw up
https://bugzilla.redhat.com/show_bug.cgi?id=587265 and
https://bugzilla.redhat.com/show_bug.cgi?id=550724. Comment 81 of the
second one mentioned issues with IRQ handling which reminded me that a
bunch of those were fixed 2.6.32-40 whereas you are running -39 (which
is fair enough since that is the version currently in stable). Could you
try the kernel from stable-proposed-updates (now 2.6.32-43)?

Also referenced was https://lkml.org/lkml/2010/9/1/178 which supports
the interrupt problem theory.

If there's any chance of setting up a serial console to catch this issue
should it happen again then that would be very useful too.

Ian.

> 
> I also had an idle shell at the time the server crashed and this is what 
> I saw:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.000629] Oops: 0000 [#1] SMP
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.000661] last sysfs file: 
> /sys/devices/virtual/block/dm-49/removable
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.001891] Stack:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002101] Call Trace:
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a 
> f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b 
> c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff 
> ff ff 66 ff
> 
> Message from syslogd@dom0 at Apr  4 01:37:22 ...
>   kernel:[4805213.002901] CR2: 0000000000000000
> 
> Please let me know if there is anything further I can provide.

-- 
Ian Campbell
Current Noise: Crippled Black Phoenix - The Heart Of Every Country

Dealer prices may vary.




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1333616660.937.31.ca...@zakaz.uk.xensource.com

Reply via email to