Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

Ladi Prosek Mon, 19 Jun 2017 23:00:08 -0700

Hi Fernando,

On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow
<casasferna...@hotmail.com> wrote:
> Hi Ladi,
>
> Today two guests failed again at different times of day.
> One of them was the one I switched from virtio_blk to virtio_scsi so this
> change didn't solved the problem.
> Now in this guest I also disabled virtio_balloon, continuing with the
> elimination process.
>
> Also this time I found a different error message in the guest console.
> In the guest already switched to virtio_scsi:
>
> virtio_scsi virtio2: request:id 44 is not a head!
>
> Followed by the usual "task blocked for more than 120 seconds." error.
>
> On the guest still running on virtio_blk the error was similar:
>
> virtio_blk virtio2: req.0:id 42 is not a head!
> blk_update_request: I/O error, dev vda, sector 645657736
> Buffer I/O error on dev dm-1, logical block 7413821, lost async page write
>
> Followed by the usual "task blocked for more than 120 seconds." error.


Honestly this is starting to look more and more like a memory
corruption. Two different virtio devices and two different guest
operating systems, that would have to be a bug in the common virtio
code and we would have seen it somewhere else already.

Would it be possible run a thorough memtest on the host just in case?

> Do you think that the blk_update_request and the buffer I/O error may be a
> consequence of the previous "is not a head!" error or should I be worried
> for a storage level issue here?
>
> Now I will wait to see if disabling virtio_balloon helps or not and report
> back.
>
> Thanks.
>
> Fer
>
> On vie, jun 16, 2017 at 12:25 , Ladi Prosek <lpro...@redhat.com> wrote:
>
> On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow
> <casasferna...@hotmail.com> wrote:
>
> Hi Ladi, Thanks a lot for looking into this and replying. I will do my best
> to rebuild and deploy Alpine's qemu packages with this patch included but
> not sure its feasible yet. In any case, would it be possible to have this
> patch included in the next qemu release?
>
> Yes, I have already added this to my todo list.
>
> The current error message is helpful but knowing which device was involved
> will be much more helpful. Regarding the environment, I'm not doing
> migrations and only managed save is done in case the host needs to be
> rebooted or shutdown. The QEMU process is running the VM since the host is
> started and this failuire is ocurring randomly without any previous manage
> save done. As part of troubleshooting on one of the guests I switched from
> virtio_blk to virtio_scsi for the guest disks but I will need more time to
> see if that helped. If I have this problem again I will follow your advise
> and remove virtio_balloon.
>
> Thanks, please keep us posted.
>
> Another question: is there any way to monitor the virtqueue size either from
> the guest itself or from the host? Any file in sysfs or proc? This may help
> to understand in which conditions this is happening and to react faster to
> mitigate the problem.
>
> The problem is not in the virtqueue size but in one piece of internal state
> ("inuse") which is meant to track the number of buffers "checked out" by
> QEMU. It's being compared to virtqueue size merely as a sanity check. I'm
> afraid that there's no way to expose this variable without rebuilding QEMU.
> The best you could do is attach gdb to the QEMU process and use some clever
> data access breakpoints to catch suspicious writes to the variable. Although
> it's likely that it just creeps up slowly and you won't see anything
> interesting. It's probably beyond reasonable at this point anyway. I would
> continue with the elimination process (virtio_scsi instead of virtio_blk, no
> balloon, etc.) and then maybe once we know which device it is, we can add
> some instrumentation to the code.
>
> Thanks again for your help with this! Fer On vie, jun 16, 2017 at 8:58 ,
> Ladi Prosek <lpro...@redhat.com> wrote: Hi, Would you be able to enhance the
> error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++
> b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq,
> size_t sz) max = vq->vring.num; if (vq->inuse
>
> = vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); +
>
> virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index,
> vdev->name); goto done; } This would at least confirm the theory that it's
> caused by virtio-blk-pci. If rebuilding is not feasible I would start by
> removing other virtio devices -- particularly balloon which has had quite a
> few virtio related bugs fixed recently. Does your environment involve VM
> migrations or saving/resuming, or does the crashing QEMU process always run
> the VM from its boot? Thanks!
>
>
>

Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

Reply via email to