Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

Dr. David Alan Gilbert Mon, 11 May 2020 05:08:14 -0700

* Daniel P. BerrangÃ© (berra...@redhat.com) wrote:
> On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote:
> > Hello Everyone,
> > In many cases, if qemu has a network connection (qmp, migration, chardev, 
> > etc.)
> > to some other server and that server dies or hangs, qemu hangs too.
> 
> If qemu as a whole hangs due to a stalled network connection, that is a
> bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking
> I/O in general, such that if the network connection or remote server stalls,
> we simply stop sending I/O - we shouldn't ever hang the QEMU process or main
> loop.
> 
> There are places in QEMU code which are not well behaved in this respect,
> but many are, and others are getting fixed where found to be important.
> 
> Arguably any place in QEMU code which can result in a hang of QEMU in the
> event of a stalled network should be considered a security flaw, because
> the network is untrusted in general.


That's not really true of the 'management network' - people trust that
and I don't see a lot of the qemu code getting fixed safely for all of
them.

> > These patches introduce the new 'yank' out-of-band qmp command to recover 
> > from
> > these kinds of hangs. The different subsystems register callbacks which get
> > executed with the yank command. For example the callback can shutdown() a
> > socket. This is intended for the colo use-case, but it can be used for other
> > things too of course.
> 
> IIUC, invoking the "yank" command unconditionally kills every single
> network connection in QEMU that has registered with the "yank" subsystem.
> IMHO this is way too big of a hammer, even if we accept there are bugs in
> QEMU not handling stalled networking well.

But isn't this hammer conditional - I see that it's a migration
capabiltiy for the migration socket, and a flag in nbd - so it only
yanks things you've told it to.

> eg if a chardev hangs QEMU, and we tear down everything, killing the NBD
> connection used for the guest disk, we needlessly break I/O.
> 
> eg doing this in the chardev backend is not desirable, because the bugs
> with hanging QEMU are typically caused by the way the frontend device
> uses the chardev blocking I/O calls, instead of non-blocking I/O calls.
> 

Having a way to get out of any of these problems from a single point is
quite nice.  To be useful in COLO you need to know for sure you can get
out of any network screwup.

We already use shutdown(2) in migrate_cancel and migrate-pause for
basically the same reason; I don't think we've got anything similar for
NBD, and we probably should have (I think I asked for it fairly
recently).

Dave



> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 0/5] Introduce 'yank' oob qmp command to recover from hanging qemu

Reply via email to