On Tue, Apr 24, 2018 at 07:24:05PM +0100, Daniel P. Berrangé wrote: > On Tue, Apr 24, 2018 at 06:16:31PM +0100, Dr. David Alan Gilbert wrote: > > * Wang Xin (wangxinxin.w...@huawei.com) wrote: > > > If the fd socket peer closed shortly, ppoll may receive a POLLHUP > > > event before the expected POLLIN event, and qemu will do nothing > > > but goes into an infinite loop of the POLLHUP event. > > > > > > So, abort the migration if we receive a POLLHUP event. > > > > Hi Wang Xin, > > Can you explain how you manage to trigger this case; I've not hit it. > > > > > Signed-off-by: Wang Xin <wangxinxin.w...@huawei.com> > > > > > > diff --git a/migration/fd.c b/migration/fd.c > > > index cd06182..5932c87 100644 > > > --- a/migration/fd.c > > > +++ b/migration/fd.c > > > @@ -15,6 +15,7 @@ > > > */ > > > > > > #include "qemu/osdep.h" > > > +#include "qemu/error-report.h" > > > #include "channel.h" > > > #include "fd.h" > > > #include "monitor/monitor.h" > > > @@ -46,6 +47,11 @@ static gboolean > > > fd_accept_incoming_migration(QIOChannel *ioc, > > > GIOCondition condition, > > > gpointer opaque) > > > { > > > + if (condition & G_IO_HUP) { > > > + error_report("The migration peer closed, job abort"); > > > + exit(EXIT_FAILURE); > > > + } > > > + > > > > OK, I wish we had a nicer way for failing; especially for the > > multifd/postcopy recovery worlds where one failed connection might not > > be fatal; but I don't see how to do that here. > > This doesn't feel right to me. > > We have passed in a pre-opened FD to QEMU, and we registered a watch > on it to detect when there is data from the src QEMU that is available > to read. Normally the src will have sent something so we'll get G_IO_IN, > but you're suggesting the client has quit immediately, so we're getting > G_IO_HUP due to end of file. > > The migration_channel_process_incoming() method that we pass the ioc > object to will be calling qio_channel_read(ioc) somewhere to try to > read that data. > > For QEMU to spin in infinite loop there must be code in the > migration_channel_process_incoming() that is ignoring the return > value of qio_channel_read() in some manner causing it to retry > the read again & again I presume. > > Putting this check for G_IO_HUP fixes your immediate problem scenario, > but whatever code was spinning in infinite loop is still broken and > I'd guess it was possible to still trigger the loop. eg by writing > 1 single byte and then closing the socket. > > So, IMHO this fix is wrong - we need to find the root cause and fix > that, not try to avoid calling the buggy code.
I agree. AFAIU the first read should be in qemu_loadvm_state(): v = qemu_get_be32(f); if (v != QEMU_VM_FILE_MAGIC) { error_report("Not a migration stream"); return -EINVAL; } So I would be curious more about how that infinite loop happened. -- Peter Xu