* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Mon, May 11, 2020 at 01:14:34PM +0200, Lukas Straub wrote: > > Hello Everyone, > > In many cases, if qemu has a network connection (qmp, migration, chardev, > > etc.) > > to some other server and that server dies or hangs, qemu hangs too. > > If qemu as a whole hangs due to a stalled network connection, that is a > bug in QEMU that we should be fixing IMHO. QEMU should be doing non-blocking > I/O in general, such that if the network connection or remote server stalls, > we simply stop sending I/O - we shouldn't ever hang the QEMU process or main > loop. > > There are places in QEMU code which are not well behaved in this respect, > but many are, and others are getting fixed where found to be important. > > Arguably any place in QEMU code which can result in a hang of QEMU in the > event of a stalled network should be considered a security flaw, because > the network is untrusted in general.
That's not really true of the 'management network' - people trust that and I don't see a lot of the qemu code getting fixed safely for all of them. > > These patches introduce the new 'yank' out-of-band qmp command to recover > > from > > these kinds of hangs. The different subsystems register callbacks which get > > executed with the yank command. For example the callback can shutdown() a > > socket. This is intended for the colo use-case, but it can be used for other > > things too of course. > > IIUC, invoking the "yank" command unconditionally kills every single > network connection in QEMU that has registered with the "yank" subsystem. > IMHO this is way too big of a hammer, even if we accept there are bugs in > QEMU not handling stalled networking well. But isn't this hammer conditional - I see that it's a migration capabiltiy for the migration socket, and a flag in nbd - so it only yanks things you've told it to. > eg if a chardev hangs QEMU, and we tear down everything, killing the NBD > connection used for the guest disk, we needlessly break I/O. > > eg doing this in the chardev backend is not desirable, because the bugs > with hanging QEMU are typically caused by the way the frontend device > uses the chardev blocking I/O calls, instead of non-blocking I/O calls. > Having a way to get out of any of these problems from a single point is quite nice. To be useful in COLO you need to know for sure you can get out of any network screwup. We already use shutdown(2) in migrate_cancel and migrate-pause for basically the same reason; I don't think we've got anything similar for NBD, and we probably should have (I think I asked for it fairly recently). Dave > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK