Hi Ladi,

In this case both guests are CentOS 7.3 running the same kernel 3.10.0-514.21.1.
Also the guest that fails most frequently is running Docker with 4 or 5 
containers.

Another thing I would like to mention is that the host is running on Alpine's 
default grsec patched kernel. I have the option to install also a vanilla 
kernel. Would it make sense to switch to the vanilla kernel on the host and see 
if that helps?
And last but not least KSM is enabled on the host. Should I disable it?

Following your advice I will run memtest on the host and report back. Just as a 
side comment, the host is running on ECC memory.

Thanks for all your help.

Fer.

On mar, jun 20, 2017 at 7:59 , Ladi Prosek <lpro...@redhat.com> wrote:
Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow 
<casasferna...@hotmail.com<mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Today two guests failed again at different times of day. One of them 
was the one I switched from virtio_blk to virtio_scsi so this change didn't 
solved the problem. Now in this guest I also disabled virtio_balloon, 
continuing with the elimination process. Also this time I found a different 
error message in the guest console. In the guest already switched to 
virtio_scsi: virtio_scsi virtio2: request:id 44 is not a head! Followed by the 
usual "task blocked for more than 120 seconds." error. On the guest still 
running on virtio_blk the error was similar: virtio_blk virtio2: req.0:id 42 is 
not a head! blk_update_request: I/O error, dev vda, sector 645657736 Buffer I/O 
error on dev dm-1, logical block 7413821, lost async page write Followed by the 
usual "task blocked for more than 120 seconds." error.
Honestly this is starting to look more and more like a memory corruption. Two 
different virtio devices and two different guest operating systems, that would 
have to be a bug in the common virtio code and we would have seen it somewhere 
else already. Would it be possible run a thorough memtest on the host just in 
case?
Do you think that the blk_update_request and the buffer I/O error may be a 
consequence of the previous "is not a head!" error or should I be worried for a 
storage level issue here? Now I will wait to see if disabling virtio_balloon 
helps or not and report back. Thanks. Fer On vie, jun 16, 2017 at 12:25 , Ladi 
Prosek <lpro...@redhat.com<mailto:lpro...@redhat.com>> wrote: On Fri, Jun 16, 
2017 at 12:11 PM, Fernando Casas Schössow 
<casasferna...@hotmail.com<mailto:casasferna...@hotmail.com>> wrote: Hi Ladi, 
Thanks a lot for looking into this and replying. I will do my best to rebuild 
and deploy Alpine's qemu packages with this patch included but not sure its 
feasible yet. In any case, would it be possible to have this patch included in 
the next qemu release? Yes, I have already added this to my todo list. The 
current error message is helpful but knowing which device was involved will be 
much more helpful. Regarding the environment, I'm not doing migrations and only 
managed save is done in case the host needs to be rebooted or shutdown. The 
QEMU process is running the VM since the host is started and this failuire is 
ocurring randomly without any previous manage save done. As part of 
troubleshooting on one of the guests I switched from virtio_blk to virtio_scsi 
for the guest disks but I will need more time to see if that helped. If I have 
this problem again I will follow your advise and remove virtio_balloon. Thanks, 
please keep us posted. Another question: is there any way to monitor the 
virtqueue size either from the guest itself or from the host? Any file in sysfs 
or proc? This may help to understand in which conditions this is happening and 
to react faster to mitigate the problem. The problem is not in the virtqueue 
size but in one piece of internal state ("inuse") which is meant to track the 
number of buffers "checked out" by QEMU. It's being compared to virtqueue size 
merely as a sanity check. I'm afraid that there's no way to expose this 
variable without rebuilding QEMU. The best you could do is attach gdb to the 
QEMU process and use some clever data access breakpoints to catch suspicious 
writes to the variable. Although it's likely that it just creeps up slowly and 
you won't see anything interesting. It's probably beyond reasonable at this 
point anyway. I would continue with the elimination process (virtio_scsi 
instead of virtio_blk, no balloon, etc.) and then maybe once we know which 
device it is, we can add some instrumentation to the code. Thanks again for 
your help with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi Prosek 
<lpro...@redhat.com<mailto:lpro...@redhat.com>> wrote: Hi, Would you be able to 
enhance the error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++ 
b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, 
size_t sz) max = vq->vring.num; if (vq->inuse = vq->vring.num) { - 
virtio_error(vdev, "Virtqueue size exceeded"); + virtio_error(vdev, "Virtqueue 
%u device %s size exceeded", vq->queue_index, vdev->name); goto done; } This 
would at least confirm the theory that it's caused by virtio-blk-pci. If 
rebuilding is not feasible I would start by removing other virtio devices -- 
particularly balloon which has had quite a few virtio related bugs fixed 
recently. Does your environment involve VM migrations or saving/resuming, or 
does the crashing QEMU process always run the VM from its boot? Thanks!


Reply via email to