Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2019-01-31 Thread Fernando Casas Schössow
Hi,

Sorry for resurrecting this thread after so long but I just upgraded the host 
to Qemu 3.1 and libvirt 4.10 and I'm still facing this problem.
At the moment I cannot use virtio disks (virtio-blk nor virtio-scsi) with my 
guests in order to avoid this issue so as a workaround I'm using SATA emulated 
storage which is not ideal but is perfectly stable.

Do you have any suggestions on how can I progress troubleshooting?
Qemu is not crashing so I don't have any dumps that can be analyzed. The guest 
is just "stuck" and all I can do is destroy it and start it again.
It's really frustrating that after all this time I couldn't find the cause for 
this issue so any ideas are welcome.

Thanks.

Fernando


From: Fernando Casas Schössow 
Sent: Saturday, June 24, 2017 10:34 AM
To: Ladi Prosek
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

Hi Ladi,

After running for about 15hrs two different guests (one Windows, one Linux) 
crashed with around 1 hour difference and the same error in qemu log "Virqueue 
size exceeded".

The Linux guest was already running on virtio_scsi and without virtio_balloon. 
:(
I compiled and attached gdbserver to the qemu process for this guest but when I 
did this I got the following warning in gdbserver:

warning: Cannot call inferior functions, Linux kernel PaX protection forbids 
return to non-executable pages!

The default Alpine kernel is a grsec kernel. Not sure if this will interfere 
with debugging or not but I suspect yes.
If you need me to replace the grsec kernel with a vanilla one (also available 
as an option in Alpine) let me know and I will do so.
Otherwise send me an email directly so I can share with you the host:port 
details so you can connect to gdbserver.

Thanks,

Fer

On vie, jun 23, 2017 at 8:29 , Fernando Casas Schössow 
 wrote:
Hi Ladi,

Small update. Memtest86+ was running on the host for more than 54 hours. 8 
passes were completed and no memory errors found. For now I think we can assume 
that the host memory is ok.

I just started all the guests one hour ago. I will monitor them and once one 
fails I will attach the debugger and let you know.

Thanks.

Fer

On jue, jun 22, 2017 at 9:43 , Ladi Prosek  wrote:
Hi Fernando, On Wed, Jun 21, 2017 at 2:19 PM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Sorry for the delay in my reply. I will leave the host kernel alone 
for now then. For the last 15 hours or so I'm running memtest86+ on the host. 
So far so good. Two passes no errors so far. I will try to leave it running for 
at least another 24hr and report back the results. Hopefully we can discard the 
memory issue at hardware level. Regarding KSM, that will be the next thing I 
will disable if after removing the balloon device guests still crash. About 
leaving a guest in a failed state for you to debug it remotely, that's 
absolutely an option. We just need to coordinate so I can give you remote 
access to the host and so on. Let me know if any preparation is needed in 
advance and which tools you need installed on the host.
I think that gdbserver attached to the QEMU process should be enough. When the 
VM gets into the broken state please do something like: gdbserver --attach 
host:12345  and let me know the host name and port (12345 in the 
above example).
Once I again I would like to thank you for all your help and your great 
disposition!
You're absolutely welcome, I don't think I've done anything helpful so far :)
Cheers, Fer On mar, jun 20, 2017 at 9:52 , Ladi Prosek 
mailto:lpro...@redhat.com>> wrote: The host kernel is less 
likely to be responsible for this, in my opinion. I'd hold off on that for now. 
And last but not least KSM is enabled on the host. Should I disable it? Could 
be worth the try. Following your advice I will run memtest on the host and 
report back. Just as a side comment, the host is running on ECC memory. I see. 
Would it be possible for you, once a guest is in the broken state, to make it 
available for debugging? By attaching gdb to the QEMU process for example and 
letting me poke around it remotely? Thanks!






Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-24 Thread Fernando Casas Schössow
Hi Ladi,

After running for about 15hrs two different guests (one Windows, one Linux) 
crashed with around 1 hour difference and the same error in qemu log "Virqueue 
size exceeded".

The Linux guest was already running on virtio_scsi and without virtio_balloon. 
:(
I compiled and attached gdbserver to the qemu process for this guest but when I 
did this I got the following warning in gdbserver:

warning: Cannot call inferior functions, Linux kernel PaX protection forbids 
return to non-executable pages!

The default Alpine kernel is a grsec kernel. Not sure if this will interfere 
with debugging or not but I suspect yes.
If you need me to replace the grsec kernel with a vanilla one (also available 
as an option in Alpine) let me know and I will do so.
Otherwise send me an email directly so I can share with you the host:port 
details so you can connect to gdbserver.

Thanks,

Fer

On vie, jun 23, 2017 at 8:29 , Fernando Casas Schössow 
 wrote:
Hi Ladi,

Small update. Memtest86+ was running on the host for more than 54 hours. 8 
passes were completed and no memory errors found. For now I think we can assume 
that the host memory is ok.

I just started all the guests one hour ago. I will monitor them and once one 
fails I will attach the debugger and let you know.

Thanks.

Fer

On jue, jun 22, 2017 at 9:43 , Ladi Prosek  wrote:
Hi Fernando, On Wed, Jun 21, 2017 at 2:19 PM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Sorry for the delay in my reply. I will leave the host kernel alone 
for now then. For the last 15 hours or so I'm running memtest86+ on the host. 
So far so good. Two passes no errors so far. I will try to leave it running for 
at least another 24hr and report back the results. Hopefully we can discard the 
memory issue at hardware level. Regarding KSM, that will be the next thing I 
will disable if after removing the balloon device guests still crash. About 
leaving a guest in a failed state for you to debug it remotely, that's 
absolutely an option. We just need to coordinate so I can give you remote 
access to the host and so on. Let me know if any preparation is needed in 
advance and which tools you need installed on the host.
I think that gdbserver attached to the QEMU process should be enough. When the 
VM gets into the broken state please do something like: gdbserver --attach 
host:12345  and let me know the host name and port (12345 in the 
above example).
Once I again I would like to thank you for all your help and your great 
disposition!
You're absolutely welcome, I don't think I've done anything helpful so far :)
Cheers, Fer On mar, jun 20, 2017 at 9:52 , Ladi Prosek 
mailto:lpro...@redhat.com>> wrote: The host kernel is less 
likely to be responsible for this, in my opinion. I'd hold off on that for now. 
And last but not least KSM is enabled on the host. Should I disable it? Could 
be worth the try. Following your advice I will run memtest on the host and 
report back. Just as a side comment, the host is running on ECC memory. I see. 
Would it be possible for you, once a guest is in the broken state, to make it 
available for debugging? By attaching gdb to the QEMU process for example and 
letting me poke around it remotely? Thanks!






Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-22 Thread Fernando Casas Schössow
Hi Ladi,

Small update. Memtest86+ was running on the host for more than 54 hours. 8 
passes were completed and no memory errors found. For now I think we can assume 
that the host memory is ok.

I just started all the guests one hour ago. I will monitor them and once one 
fails I will attach the debugger and let you know.

Thanks.

Fer

On jue, jun 22, 2017 at 9:43 , Ladi Prosek  wrote:
Hi Fernando, On Wed, Jun 21, 2017 at 2:19 PM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Sorry for the delay in my reply. I will leave the host kernel alone 
for now then. For the last 15 hours or so I'm running memtest86+ on the host. 
So far so good. Two passes no errors so far. I will try to leave it running for 
at least another 24hr and report back the results. Hopefully we can discard the 
memory issue at hardware level. Regarding KSM, that will be the next thing I 
will disable if after removing the balloon device guests still crash. About 
leaving a guest in a failed state for you to debug it remotely, that's 
absolutely an option. We just need to coordinate so I can give you remote 
access to the host and so on. Let me know if any preparation is needed in 
advance and which tools you need installed on the host.
I think that gdbserver attached to the QEMU process should be enough. When the 
VM gets into the broken state please do something like: gdbserver --attach 
host:12345  and let me know the host name and port (12345 in the 
above example).
Once I again I would like to thank you for all your help and your great 
disposition!
You're absolutely welcome, I don't think I've done anything helpful so far :)
Cheers, Fer On mar, jun 20, 2017 at 9:52 , Ladi Prosek 
mailto:lpro...@redhat.com>> wrote: The host kernel is less 
likely to be responsible for this, in my opinion. I'd hold off on that for now. 
And last but not least KSM is enabled on the host. Should I disable it? Could 
be worth the try. Following your advice I will run memtest on the host and 
report back. Just as a side comment, the host is running on ECC memory. I see. 
Would it be possible for you, once a guest is in the broken state, to make it 
available for debugging? By attaching gdb to the QEMU process for example and 
letting me poke around it remotely? Thanks!




Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-22 Thread Ladi Prosek
Hi Fernando,

On Wed, Jun 21, 2017 at 2:19 PM, Fernando Casas Schössow
 wrote:
> Hi Ladi,
>
> Sorry for the delay in my reply.
> I will leave the host kernel alone for now then.
> For the last 15 hours or so I'm running memtest86+ on the host. So far so
> good. Two passes no errors so far. I will try to leave it running for at
> least another 24hr and report back the results. Hopefully we can discard the
> memory issue at hardware level.
>
> Regarding KSM, that will be the next thing I will disable if after removing
> the balloon device guests still crash.
>
> About leaving a guest in a failed state for you to debug it remotely, that's
> absolutely an option. We just need to coordinate so I can give you remote
> access to the host and so on. Let me know if any preparation is needed in
> advance and which tools you need installed on the host.

I think that gdbserver attached to the QEMU process should be enough.
When the VM gets into the broken state please do something like:

gdbserver --attach host:12345 

and let me know the host name and port (12345 in the above example).

> Once I again I would like to thank you for all your help and your great
> disposition!

You're absolutely welcome, I don't think I've done anything helpful so far :)

> Cheers,
>
> Fer
>
> On mar, jun 20, 2017 at 9:52 , Ladi Prosek  wrote:
>
> The host kernel is less likely to be responsible for this, in my opinion.
> I'd hold off on that for now.
>
> And last but not least KSM is enabled on the host. Should I disable it?
>
> Could be worth the try.
>
> Following your advice I will run memtest on the host and report back. Just
> as a side comment, the host is running on ECC memory.
>
> I see. Would it be possible for you, once a guest is in the broken state, to
> make it available for debugging? By attaching gdb to the QEMU process for
> example and letting me poke around it remotely? Thanks!
>
>
>



Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-21 Thread Fernando Casas Schössow
Hi Ladi,

Sorry for the delay in my reply.
I will leave the host kernel alone for now then.
For the last 15 hours or so I'm running memtest86+ on the host. So far so good. 
Two passes no errors so far. I will try to leave it running for at least 
another 24hr and report back the results. Hopefully we can discard the memory 
issue at hardware level.

Regarding KSM, that will be the next thing I will disable if after removing the 
balloon device guests still crash.

About leaving a guest in a failed state for you to debug it remotely, that's 
absolutely an option. We just need to coordinate so I can give you remote 
access to the host and so on. Let me know if any preparation is needed in 
advance and which tools you need installed on the host.

Once I again I would like to thank you for all your help and your great 
disposition!

Cheers,

Fer

On mar, jun 20, 2017 at 9:52 , Ladi Prosek  wrote:
The host kernel is less likely to be responsible for this, in my opinion. I'd 
hold off on that for now.
And last but not least KSM is enabled on the host. Should I disable it?
Could be worth the try.
Following your advice I will run memtest on the host and report back. Just as a 
side comment, the host is running on ECC memory.
I see. Would it be possible for you, once a guest is in the broken state, to 
make it available for debugging? By attaching gdb to the QEMU process for 
example and letting me poke around it remotely? Thanks!




Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-20 Thread Ladi Prosek
On Tue, Jun 20, 2017 at 8:30 AM, Fernando Casas Schössow
 wrote:
> Hi Ladi,
>
> In this case both guests are CentOS 7.3 running the same kernel
> 3.10.0-514.21.1.
> Also the guest that fails most frequently is running Docker with 4 or 5
> containers.
>
> Another thing I would like to mention is that the host is running on
> Alpine's default grsec patched kernel. I have the option to install also a
> vanilla kernel. Would it make sense to switch to the vanilla kernel on the
> host and see if that helps?

The host kernel is less likely to be responsible for this, in my
opinion. I'd hold off on that for now.

> And last but not least KSM is enabled on the host. Should I disable it?

Could be worth the try.

> Following your advice I will run memtest on the host and report back. Just
> as a side comment, the host is running on ECC memory.

I see.

Would it be possible for you, once a guest is in the broken state, to
make it available for debugging? By attaching gdb to the QEMU process
for example and letting me poke around it remotely? Thanks!

> Thanks for all your help.
>
> Fer.
>
> On mar, jun 20, 2017 at 7:59 , Ladi Prosek  wrote:
>
> Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow
>  wrote:
>
> Hi Ladi, Today two guests failed again at different times of day. One of
> them was the one I switched from virtio_blk to virtio_scsi so this change
> didn't solved the problem. Now in this guest I also disabled virtio_balloon,
> continuing with the elimination process. Also this time I found a different
> error message in the guest console. In the guest already switched to
> virtio_scsi: virtio_scsi virtio2: request:id 44 is not a head! Followed by
> the usual "task blocked for more than 120 seconds." error. On the guest
> still running on virtio_blk the error was similar: virtio_blk virtio2:
> req.0:id 42 is not a head! blk_update_request: I/O error, dev vda, sector
> 645657736 Buffer I/O error on dev dm-1, logical block 7413821, lost async
> page write Followed by the usual "task blocked for more than 120 seconds."
> error.
>
> Honestly this is starting to look more and more like a memory corruption.
> Two different virtio devices and two different guest operating systems, that
> would have to be a bug in the common virtio code and we would have seen it
> somewhere else already. Would it be possible run a thorough memtest on the
> host just in case?
>
> Do you think that the blk_update_request and the buffer I/O error may be a
> consequence of the previous "is not a head!" error or should I be worried
> for a storage level issue here? Now I will wait to see if disabling
> virtio_balloon helps or not and report back. Thanks. Fer On vie, jun 16,
> 2017 at 12:25 , Ladi Prosek  wrote: On Fri, Jun 16, 2017
> at 12:11 PM, Fernando Casas Schössow  wrote: Hi
> Ladi, Thanks a lot for looking into this and replying. I will do my best to
> rebuild and deploy Alpine's qemu packages with this patch included but not
> sure its feasible yet. In any case, would it be possible to have this patch
> included in the next qemu release? Yes, I have already added this to my todo
> list. The current error message is helpful but knowing which device was
> involved will be much more helpful. Regarding the environment, I'm not doing
> migrations and only managed save is done in case the host needs to be
> rebooted or shutdown. The QEMU process is running the VM since the host is
> started and this failuire is ocurring randomly without any previous manage
> save done. As part of troubleshooting on one of the guests I switched from
> virtio_blk to virtio_scsi for the guest disks but I will need more time to
> see if that helped. If I have this problem again I will follow your advise
> and remove virtio_balloon. Thanks, please keep us posted. Another question:
> is there any way to monitor the virtqueue size either from the guest itself
> or from the host? Any file in sysfs or proc? This may help to understand in
> which conditions this is happening and to react faster to mitigate the
> problem. The problem is not in the virtqueue size but in one piece of
> internal state ("inuse") which is meant to track the number of buffers
> "checked out" by QEMU. It's being compared to virtqueue size merely as a
> sanity check. I'm afraid that there's no way to expose this variable without
> rebuilding QEMU. The best you could do is attach gdb to the QEMU process and
> use some clever data access breakpoints to catch suspicious writes to the
> variable. Although it's likely that it just creeps up slowly and you won't
> see anything interesting. It's probably beyond reasonable at this point
> anyway. I would continue with the elimination process (virtio_scsi instead
> of virtio_blk, no balloon, etc.) and then maybe once we know which device it
> is, we can add some instrumentation to the code. Thanks again for your help
> with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi Prosek
>  wrote: Hi, Would you be able to enhance the error

Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-19 Thread Fernando Casas Schössow
Hi Ladi,

In this case both guests are CentOS 7.3 running the same kernel 3.10.0-514.21.1.
Also the guest that fails most frequently is running Docker with 4 or 5 
containers.

Another thing I would like to mention is that the host is running on Alpine's 
default grsec patched kernel. I have the option to install also a vanilla 
kernel. Would it make sense to switch to the vanilla kernel on the host and see 
if that helps?
And last but not least KSM is enabled on the host. Should I disable it?

Following your advice I will run memtest on the host and report back. Just as a 
side comment, the host is running on ECC memory.

Thanks for all your help.

Fer.

On mar, jun 20, 2017 at 7:59 , Ladi Prosek  wrote:
Hi Fernando, On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Today two guests failed again at different times of day. One of them 
was the one I switched from virtio_blk to virtio_scsi so this change didn't 
solved the problem. Now in this guest I also disabled virtio_balloon, 
continuing with the elimination process. Also this time I found a different 
error message in the guest console. In the guest already switched to 
virtio_scsi: virtio_scsi virtio2: request:id 44 is not a head! Followed by the 
usual "task blocked for more than 120 seconds." error. On the guest still 
running on virtio_blk the error was similar: virtio_blk virtio2: req.0:id 42 is 
not a head! blk_update_request: I/O error, dev vda, sector 645657736 Buffer I/O 
error on dev dm-1, logical block 7413821, lost async page write Followed by the 
usual "task blocked for more than 120 seconds." error.
Honestly this is starting to look more and more like a memory corruption. Two 
different virtio devices and two different guest operating systems, that would 
have to be a bug in the common virtio code and we would have seen it somewhere 
else already. Would it be possible run a thorough memtest on the host just in 
case?
Do you think that the blk_update_request and the buffer I/O error may be a 
consequence of the previous "is not a head!" error or should I be worried for a 
storage level issue here? Now I will wait to see if disabling virtio_balloon 
helps or not and report back. Thanks. Fer On vie, jun 16, 2017 at 12:25 , Ladi 
Prosek mailto:lpro...@redhat.com>> wrote: On Fri, Jun 16, 
2017 at 12:11 PM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote: Hi Ladi, 
Thanks a lot for looking into this and replying. I will do my best to rebuild 
and deploy Alpine's qemu packages with this patch included but not sure its 
feasible yet. In any case, would it be possible to have this patch included in 
the next qemu release? Yes, I have already added this to my todo list. The 
current error message is helpful but knowing which device was involved will be 
much more helpful. Regarding the environment, I'm not doing migrations and only 
managed save is done in case the host needs to be rebooted or shutdown. The 
QEMU process is running the VM since the host is started and this failuire is 
ocurring randomly without any previous manage save done. As part of 
troubleshooting on one of the guests I switched from virtio_blk to virtio_scsi 
for the guest disks but I will need more time to see if that helped. If I have 
this problem again I will follow your advise and remove virtio_balloon. Thanks, 
please keep us posted. Another question: is there any way to monitor the 
virtqueue size either from the guest itself or from the host? Any file in sysfs 
or proc? This may help to understand in which conditions this is happening and 
to react faster to mitigate the problem. The problem is not in the virtqueue 
size but in one piece of internal state ("inuse") which is meant to track the 
number of buffers "checked out" by QEMU. It's being compared to virtqueue size 
merely as a sanity check. I'm afraid that there's no way to expose this 
variable without rebuilding QEMU. The best you could do is attach gdb to the 
QEMU process and use some clever data access breakpoints to catch suspicious 
writes to the variable. Although it's likely that it just creeps up slowly and 
you won't see anything interesting. It's probably beyond reasonable at this 
point anyway. I would continue with the elimination process (virtio_scsi 
instead of virtio_blk, no balloon, etc.) and then maybe once we know which 
device it is, we can add some instrumentation to the code. Thanks again for 
your help with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi Prosek 
mailto:lpro...@redhat.com>> wrote: Hi, Would you be able to 
enhance the error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++ 
b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, 
size_t sz) max = vq->vring.num; if (vq->inuse = vq->vring.num) { - 
virtio_error(vdev, "Virtqueue size exceeded"); + virtio_error(vdev, "Virtqueue 
%u device %s size exceeded", vq->queue_index, vdev->name); goto done; } This 
would at least conf

Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-19 Thread Ladi Prosek
Hi Fernando,

On Tue, Jun 20, 2017 at 12:10 AM, Fernando Casas Schössow
 wrote:
> Hi Ladi,
>
> Today two guests failed again at different times of day.
> One of them was the one I switched from virtio_blk to virtio_scsi so this
> change didn't solved the problem.
> Now in this guest I also disabled virtio_balloon, continuing with the
> elimination process.
>
> Also this time I found a different error message in the guest console.
> In the guest already switched to virtio_scsi:
>
> virtio_scsi virtio2: request:id 44 is not a head!
>
> Followed by the usual "task blocked for more than 120 seconds." error.
>
> On the guest still running on virtio_blk the error was similar:
>
> virtio_blk virtio2: req.0:id 42 is not a head!
> blk_update_request: I/O error, dev vda, sector 645657736
> Buffer I/O error on dev dm-1, logical block 7413821, lost async page write
>
> Followed by the usual "task blocked for more than 120 seconds." error.

Honestly this is starting to look more and more like a memory
corruption. Two different virtio devices and two different guest
operating systems, that would have to be a bug in the common virtio
code and we would have seen it somewhere else already.

Would it be possible run a thorough memtest on the host just in case?

> Do you think that the blk_update_request and the buffer I/O error may be a
> consequence of the previous "is not a head!" error or should I be worried
> for a storage level issue here?
>
> Now I will wait to see if disabling virtio_balloon helps or not and report
> back.
>
> Thanks.
>
> Fer
>
> On vie, jun 16, 2017 at 12:25 , Ladi Prosek  wrote:
>
> On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow
>  wrote:
>
> Hi Ladi, Thanks a lot for looking into this and replying. I will do my best
> to rebuild and deploy Alpine's qemu packages with this patch included but
> not sure its feasible yet. In any case, would it be possible to have this
> patch included in the next qemu release?
>
> Yes, I have already added this to my todo list.
>
> The current error message is helpful but knowing which device was involved
> will be much more helpful. Regarding the environment, I'm not doing
> migrations and only managed save is done in case the host needs to be
> rebooted or shutdown. The QEMU process is running the VM since the host is
> started and this failuire is ocurring randomly without any previous manage
> save done. As part of troubleshooting on one of the guests I switched from
> virtio_blk to virtio_scsi for the guest disks but I will need more time to
> see if that helped. If I have this problem again I will follow your advise
> and remove virtio_balloon.
>
> Thanks, please keep us posted.
>
> Another question: is there any way to monitor the virtqueue size either from
> the guest itself or from the host? Any file in sysfs or proc? This may help
> to understand in which conditions this is happening and to react faster to
> mitigate the problem.
>
> The problem is not in the virtqueue size but in one piece of internal state
> ("inuse") which is meant to track the number of buffers "checked out" by
> QEMU. It's being compared to virtqueue size merely as a sanity check. I'm
> afraid that there's no way to expose this variable without rebuilding QEMU.
> The best you could do is attach gdb to the QEMU process and use some clever
> data access breakpoints to catch suspicious writes to the variable. Although
> it's likely that it just creeps up slowly and you won't see anything
> interesting. It's probably beyond reasonable at this point anyway. I would
> continue with the elimination process (virtio_scsi instead of virtio_blk, no
> balloon, etc.) and then maybe once we know which device it is, we can add
> some instrumentation to the code.
>
> Thanks again for your help with this! Fer On vie, jun 16, 2017 at 8:58 ,
> Ladi Prosek  wrote: Hi, Would you be able to enhance the
> error message and rebuild QEMU? --- a/hw/virtio/virtio.c +++
> b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq,
> size_t sz) max = vq->vring.num; if (vq->inuse
>
> = vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); +
>
> virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index,
> vdev->name); goto done; } This would at least confirm the theory that it's
> caused by virtio-blk-pci. If rebuilding is not feasible I would start by
> removing other virtio devices -- particularly balloon which has had quite a
> few virtio related bugs fixed recently. Does your environment involve VM
> migrations or saving/resuming, or does the crashing QEMU process always run
> the VM from its boot? Thanks!
>
>
>



Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-19 Thread Fernando Casas Schössow
Hi Ladi,

Today two guests failed again at different times of day.
One of them was the one I switched from virtio_blk to virtio_scsi so this 
change didn't solved the problem.
Now in this guest I also disabled virtio_balloon, continuing with the 
elimination process.

Also this time I found a different error message in the guest console.
In the guest already switched to virtio_scsi:

virtio_scsi virtio2: request:id 44 is not a head!

Followed by the usual "task blocked for more than 120 seconds." error.

On the guest still running on virtio_blk the error was similar:

virtio_blk virtio2: req.0:id 42 is not a head!
blk_update_request: I/O error, dev vda, sector 645657736
Buffer I/O error on dev dm-1, logical block 7413821, lost async page write

Followed by the usual "task blocked for more than 120 seconds." error.

Do you think that the blk_update_request and the buffer I/O error may be a 
consequence of the previous "is not a head!" error or should I be worried for a 
storage level issue here?

Now I will wait to see if disabling virtio_balloon helps or not and report back.

Thanks.

Fer

On vie, jun 16, 2017 at 12:25 , Ladi Prosek  wrote:
On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow 
mailto:casasferna...@hotmail.com>> wrote:
Hi Ladi, Thanks a lot for looking into this and replying. I will do my best to 
rebuild and deploy Alpine's qemu packages with this patch included but not sure 
its feasible yet. In any case, would it be possible to have this patch included 
in the next qemu release?
Yes, I have already added this to my todo list.
The current error message is helpful but knowing which device was involved will 
be much more helpful. Regarding the environment, I'm not doing migrations and 
only managed save is done in case the host needs to be rebooted or shutdown. 
The QEMU process is running the VM since the host is started and this failuire 
is ocurring randomly without any previous manage save done. As part of 
troubleshooting on one of the guests I switched from virtio_blk to virtio_scsi 
for the guest disks but I will need more time to see if that helped. If I have 
this problem again I will follow your advise and remove virtio_balloon.
Thanks, please keep us posted.
Another question: is there any way to monitor the virtqueue size either from 
the guest itself or from the host? Any file in sysfs or proc? This may help to 
understand in which conditions this is happening and to react faster to 
mitigate the problem.
The problem is not in the virtqueue size but in one piece of internal state 
("inuse") which is meant to track the number of buffers "checked out" by QEMU. 
It's being compared to virtqueue size merely as a sanity check. I'm afraid that 
there's no way to expose this variable without rebuilding QEMU. The best you 
could do is attach gdb to the QEMU process and use some clever data access 
breakpoints to catch suspicious writes to the variable. Although it's likely 
that it just creeps up slowly and you won't see anything interesting. It's 
probably beyond reasonable at this point anyway. I would continue with the 
elimination process (virtio_scsi instead of virtio_blk, no balloon, etc.) and 
then maybe once we know which device it is, we can add some instrumentation to 
the code.
Thanks again for your help with this! Fer On vie, jun 16, 2017 at 8:58 , Ladi 
Prosek mailto:lpro...@redhat.com>> wrote: Hi, Would you be 
able to enhance the error message and rebuild QEMU? --- a/hw/virtio/virtio.c 
+++ b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void *virtqueue_pop(VirtQueue *vq, 
size_t sz) max = vq->vring.num; if (vq->inuse
= vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); +
virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index, 
vdev->name); goto done; } This would at least confirm the theory that it's 
caused by virtio-blk-pci. If rebuilding is not feasible I would start by 
removing other virtio devices -- particularly balloon which has had quite a few 
virtio related bugs fixed recently. Does your environment involve VM migrations 
or saving/resuming, or does the crashing QEMU process always run the VM from 
its boot? Thanks!




Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-16 Thread Ladi Prosek
On Fri, Jun 16, 2017 at 12:11 PM, Fernando Casas Schössow
 wrote:
> Hi Ladi,
>
> Thanks a lot for looking into this and replying.
> I will do my best to rebuild and deploy Alpine's qemu packages with this
> patch included but not sure its feasible yet.
> In any case, would it be possible to have this patch included in the next
> qemu release?

Yes, I have already added this to my todo list.

> The current error message is helpful but knowing which device was involved
> will be much more helpful.
>
> Regarding the environment, I'm not doing migrations and only managed save is
> done in case the host needs to be rebooted or shutdown. The QEMU process is
> running the VM since the host is started and this failuire is ocurring
> randomly without any previous manage save done.
>
> As part of troubleshooting on one of the guests I switched from virtio_blk
> to virtio_scsi for the guest disks but I will need more time to see if that
> helped.
> If I have this problem again I will follow your advise and remove
> virtio_balloon.

Thanks, please keep us posted.

> Another question: is there any way to monitor the virtqueue size either from
> the guest itself or from the host? Any file in sysfs or proc?
> This may help to understand in which conditions this is happening and to
> react faster to mitigate the problem.

The problem is not in the virtqueue size but in one piece of internal
state ("inuse") which is meant to track the number of buffers "checked
out" by QEMU. It's being compared to virtqueue size merely as a sanity
check.

I'm afraid that there's no way to expose this variable without
rebuilding QEMU. The best you could do is attach gdb to the QEMU
process and use some clever data access breakpoints to catch
suspicious writes to the variable. Although it's likely that it just
creeps up slowly and you won't see anything interesting. It's probably
beyond reasonable at this point anyway.

I would continue with the elimination process (virtio_scsi instead of
virtio_blk, no balloon, etc.) and then maybe once we know which device
it is, we can add some instrumentation to the code.

> Thanks again for your help with this!
>
> Fer
>
> On vie, jun 16, 2017 at 8:58 , Ladi Prosek  wrote:
>
> Hi,
>
> Would you be able to enhance the error message and rebuild QEMU? ---
> a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void
> *virtqueue_pop(VirtQueue *vq, size_t sz) max = vq->vring.num; if (vq->inuse
>>= vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); +
> virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index,
> vdev->name); goto done; } This would at least confirm the theory that it's
> caused by virtio-blk-pci. If rebuilding is not feasible I would start by
> removing other virtio devices -- particularly balloon which has had quite a
> few virtio related bugs fixed recently. Does your environment involve VM
> migrations or saving/resuming, or does the crashing QEMU process always run
> the VM from its boot? Thanks!
>
>
>



Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-16 Thread Fernando Casas Schössow
Hi Ladi,

Thanks a lot for looking into this and replying.
I will do my best to rebuild and deploy Alpine's qemu packages with this patch 
included but not sure its feasible yet.
In any case, would it be possible to have this patch included in the next qemu 
release?
The current error message is helpful but knowing which device was involved will 
be much more helpful.

Regarding the environment, I'm not doing migrations and only managed save is 
done in case the host needs to be rebooted or shutdown. The QEMU process is 
running the VM since the host is started and this failuire is ocurring randomly 
without any previous manage save done.

As part of troubleshooting on one of the guests I switched from virtio_blk to 
virtio_scsi for the guest disks but I will need more time to see if that helped.
If I have this problem again I will follow your advise and remove 
virtio_balloon.

Another question: is there any way to monitor the virtqueue size either from 
the guest itself or from the host? Any file in sysfs or proc?
This may help to understand in which conditions this is happening and to react 
faster to mitigate the problem.

Thanks again for your help with this!

Fer

On vie, jun 16, 2017 at 8:58 , Ladi Prosek  wrote:
Hi,
Would you be able to enhance the error message and rebuild QEMU? --- 
a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -856,7 +856,7 @@ void 
*virtqueue_pop(VirtQueue *vq, size_t sz) max = vq->vring.num; if (vq->inuse >= 
vq->vring.num) { - virtio_error(vdev, "Virtqueue size exceeded"); + 
virtio_error(vdev, "Virtqueue %u device %s size exceeded", vq->queue_index, 
vdev->name); goto done; } This would at least confirm the theory that it's 
caused by virtio-blk-pci. If rebuilding is not feasible I would start by 
removing other virtio devices -- particularly balloon which has had quite a few 
virtio related bugs fixed recently. Does your environment involve VM migrations 
or saving/resuming, or does the crashing QEMU process always run the VM from 
its boot? Thanks!




Re: [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-15 Thread Ladi Prosek
Hi,

On Wed, Jun 14, 2017 at 11:56 PM, Fernando Casas Schössow
 wrote:
> Hi there,
>
> I recently migrated a Hyper-V host to qemu/kvm runing on Alpine Linux 3.6.1 
> (kernel 4.9.30 -with grsec patches- and qemu 2.8.1).
>
> Almost on daily basis at least one of the guests is showing the following 
> error in the log and the it needs to be terminated and restarted to recover 
> it:
>
> qemu-system-x86_64: Virtqueue size exceeded
>
> Is not always the same guest, and the error is appearing for both, Linux 
> (CentOS 7.3) and Windows (2012R2) guests.
> As soon as this error appears the guest is not really working anymore. It may 
> respond to ping or you can even try to login but then everything is very slow 
> or completely unresponsive. Restarting the guest from within the guest OS is 
> not working either and the only thing I can do is to terminate it (virsh 
> destroy) and start it again until the next failure.
>
> In Windows guest the error seems to be related to disk:
> "Reset to device, \Device\RaidPort2, was issued" and the source is viostor
>
> And in Linux guests the error is always (with the process and pid changing):
>
> INFO: task : blocked for more than 120 seconds
>
> But unfortunately I was not able to find any other indication of a problem in 
> the guests logs nor in the host logs except for the error regarding the 
> virtqueue size. The problem is happening at different times of day and I 
> couldn't find any patterns yet.
>
> All the Windows guests are using virtio drivers version 126 and all Linux 
> guests are CentOS 7.3 using the latest kernel available in the distribution 
> (3.10.0-514.21.1). They all run qemu-guest agent as well.
> All the guest disks are qcow2 images with cache=none and aimode=threads 
> (tried native mode before but with the same results).
>
> Example qemu command for a Linux guest:
>
> /usr/bin/qemu-system-x86_64 -name guest=DOCKER01,debug-threads=on -S -object 
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-24-DOCKER01/master-key.aes
>  -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu 
> IvyBridge,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,+arat,+xsaveopt
>  -drive 
> file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on
>  -drive 
> file=/var/lib/libvirt/qemu/nvram/DOCKER01_VARS.fd,if=pflash,format=raw,unit=1 
> -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 
> 4705b146-3b14-4c20-923c-42105d47e7fc -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-24-DOCKER01/monitor.sock,server,nowait
>  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
> -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global 
> PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device 
> ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device 
> ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4
>  -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 
> -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive 
> file=/storage/storage-ssd-vms/virtual_machines_ssd/docker01.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=threads
>  -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>  -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=45 -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:af:ce,bus=pci.0,addr=0x3
>  -chardev pty,id=charserial0 -device 
> isa-serial,chardev=charserial0,id=serial0 -chardev 
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-24-DOCKER01/org.qemu.guest_agent.0,server,nowait
>  -device 
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
>  -chardev spicevmc,id=charchannel1,name=vdagent -device 
> virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
>  -device usb-tablet,id=input0,bus=usb.0,port=1 -spice 
> port=5905,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device 
> qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2
>  -chardev spicevmc,id=charredir0,name=usbredir -device 
> usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev 
> spicevmc,id=charredir1,name=usbredir -device 
> usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -object 
> rng-random,id=objrng0,filename=/dev/random -device 
> virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -msg timestamp=on
>
> For what it worth, the same guests were working fine for years on Hyper-V on 
> the same hardware (Intel Xeon E3, 32GB RAM, Supermicro mainboard, 6x3TB 
> West

[Qemu-devel] Guest unresponsive after Virtqueue size exceeded error

2017-06-15 Thread Fernando Casas Schössow
Hi there,

I recently migrated a Hyper-V host to qemu/kvm runing on Alpine Linux 3.6.1 
(kernel 4.9.30 -with grsec patches- and qemu 2.8.1).

Almost on daily basis at least one of the guests is showing the following error 
in the log and the it needs to be terminated and restarted to recover it:

qemu-system-x86_64: Virtqueue size exceeded

Is not always the same guest, and the error is appearing for both, Linux 
(CentOS 7.3) and Windows (2012R2) guests.
As soon as this error appears the guest is not really working anymore. It may 
respond to ping or you can even try to login but then everything is very slow 
or completely unresponsive. Restarting the guest from within the guest OS is 
not working either and the only thing I can do is to terminate it (virsh 
destroy) and start it again until the next failure.

In Windows guest the error seems to be related to disk:
"Reset to device, \Device\RaidPort2, was issued" and the source is viostor

And in Linux guests the error is always (with the process and pid changing):

INFO: task : blocked for more than 120 seconds

But unfortunately I was not able to find any other indication of a problem in 
the guests logs nor in the host logs except for the error regarding the 
virtqueue size. The problem is happening at different times of day and I 
couldn't find any patterns yet.

All the Windows guests are using virtio drivers version 126 and all Linux 
guests are CentOS 7.3 using the latest kernel available in the distribution 
(3.10.0-514.21.1). They all run qemu-guest agent as well.
All the guest disks are qcow2 images with cache=none and aimode=threads (tried 
native mode before but with the same results).

Example qemu command for a Linux guest:

/usr/bin/qemu-system-x86_64 -name guest=DOCKER01,debug-threads=on -S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-24-DOCKER01/master-key.aes
 -machine pc-i440fx-2.8,accel=kvm,usb=off,dump-guest-core=off -cpu 
IvyBridge,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+pcid,+osxsave,+arat,+xsaveopt
 -drive 
file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on
 -drive 
file=/var/lib/libvirt/qemu/nvram/DOCKER01_VARS.fd,if=pflash,format=raw,unit=1 
-m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 
4705b146-3b14-4c20-923c-42105d47e7fc -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-24-DOCKER01/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
-global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global 
PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device 
ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 -device 
ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive 
file=/storage/storage-ssd-vms/virtual_machines_ssd/docker01.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=threads
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=45 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1c:af:ce,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-24-DOCKER01/org.qemu.guest_agent.0,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
 -chardev spicevmc,id=charchannel1,name=vdagent -device 
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice 
port=5905,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device 
qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2
 -chardev spicevmc,id=charredir0,name=usbredir -device 
usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2 -chardev 
spicevmc,id=charredir1,name=usbredir -device 
usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -object 
rng-random,id=objrng0,filename=/dev/random -device 
virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -msg timestamp=on

For what it worth, the same guests were working fine for years on Hyper-V on 
the same hardware (Intel Xeon E3, 32GB RAM, Supermicro mainboard, 6x3TB Western 
Digital Red disks and 6x120MB Kingston V300 SSD all connected to a LSI 
LSISAS2008 controller).
Except for this stability issue that I hope to solve everything else is working 
great and outperforming Hyper-V.

Any ideas, th