Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow <casasferna...@hotmail.com> wrote: > Hi Ladi, > > Today two guests failed again at different times of day. > One of them was the one I switched from virtio_blk to virtio_scsi so this > change didn't solved the problem. > Now in this guest I also disabled virtio_balloon, continuing with the > elimination process. > > Also this time I found a different error message in the guest console. > In the guest already switched to virtio_scsi: > > virtio_scsi virtio2: request:id 44 is not a head! > > Followed by the usual "task blocked for more than 120 seconds." error. > > On the guest still running on virtio_blk the error was similar: > > virtio_blk virtio2: req.0:id 42 is not a head! > blk_update_request: I/O error, dev vda, sector 645657736 > Buffer I/O error on dev dm-1, logical block 7413821, lost async page write > > Followed by the usual "task blocked for more than 120 seconds." error.
Honestly this is starting to look more and more like a memory corruption. Two different virtio devices and two different guest operating systems, that would have to be a bug in the common virtio code and we would have seen it somewhere else already. Would it be possible run a thorough memtest on the host just in case? > Do you think that the blk_update_request and the buffer I/O error may be a > consequence of the previous "is not a head!" error or should I be worried > for a storage level issue here? > > Now I will wait to see if disabling virtio_balloon helps or not and report > back. > > Thanks. > > Fer > > On vie, jun 16, 2017 at 12:25 , Ladi Prosek <lpro...@redhat.com> wrote: > > On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow > <casasferna...@hotmail.com> wrote: > > Hi Ladi, Thanks a lot for looking into this and replying. I will do my best > to rebuild and deploy Alpine's qemu packages with this patch included but > not sure its feasible yet. In any case, would it be possible to have this > patch included in the next qemu release? > > Yes, I have already added this to my todo list. > > The current error message is helpful but knowing which device was involved > will be much more helpful. Regarding the environment, I'm not doing > migrations and only managed save is done in case the host needs to be > rebooted or shutdown. The QEMU process is running the VM since the host is > started and this failuire is ocurring randomly without any previous manage > save done. As part of troubleshooting on one of the guests I switched from > virtio_blk to virtio_scsi for the guest disks but I will need more time to > see if that helped. If I have this problem again I will follow your advise > and remove virtio_balloon. > > Thanks, please keep us posted. > > Another question: is there any way to monitor the virtqueue size either from > the guest itself or from the host? Any file in sysfs or proc? This may help > to understand in which conditions this is happening and to react faster to > mitigate the problem. > > The problem is not in the virtqueue size but in one piece of internal state > ("inuse") which is meant to track the number of buffers "checked out" by > QEMU. It's being compared to virtqueue size merely as a sanity check. I'm > afraid that there's no way to expose this variable without rebuilding QEMU. > The best you could do is attach gdb to the QEMU process and use some clever > data access breakpoints to catch suspicious writes to the variable. Although > it's likely that it just creeps up slowly and you won't see anything > interesting. It's probably beyond reasonable at this point anyway. I would > continue with the elimination process (virtio_scsi instead of virtio_blk, no > balloon, etc.) and then maybe once we know which device it is, we can add > some instrumentation to the code. > > Thanks again for your help with this! Fer On vie, jun 16, 2017 at 8:58 , > Ladi Prosek <lpro...@redhat.com> wrote: Hi, Would you be able to enhance the > error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++ > b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, > size_t sz) max = vq->vring.num; if (vq->inuse > > = vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); + > > virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index, > vdev->name); goto done; } This would at least confirm the theory that it's > caused by virtio-blk-pci. If rebuilding is not feasible I would start by > removing other virtio devices -- particularly balloon which has had quite a > few virtio related bugs fixed recently. Does your environment involve VM > migrations or saving/resuming, or does the crashing QEMU process always run > the VM from its boot? Thanks! > > >