Peter Xu <[email protected]> writes: > Marc-André reported an issue on QEMU crash when retrying a cancelled > migration during early setup phase, see "Link:" for more information, and > also easy way to reproduce. > > This patch is a replacement of the prior fix proposed by not only switching > to migration_cleanup(), but also fixing it from CPR side, so that we track > hup_source properly to know if src QEMU is waiting or the HUP signal. > > To put it simple: this chunk of special casing in migration_cancel() should > not affect normal migration, but only cpr-transfer migration to cover the > small window when the src QEMU is waiting for a HUP signal on cpr > channel (so that src QEMU can continue the migration on the main channel). > > To achieve that, we'll also need to remember to detach the hup_source > whenenver invoked: after that point, we should always be able to cleanup > the migration. > > It's not a generic operation to explicitly detach a gsource from its > context while in its dispatch() function. But it should be safe, because > gsource disptch() will only happen with a boosted refcount for the > dispatcher so that the gsource will not be freed until the callback > completes. It's also safe to return G_SOURCE_REMOVE after the gsource is > detached, as glib will simply ignore the G_SOURCE_REMOVE. > > One can refer to latest 2.86.5 glib code in g_main_dispatch() for that: > > https://github.com/GNOME/glib/blob/2.86.5/glib/gmain.c#L3592 > > When at this, add a bunch of assertions to make sure nothing surprises us. > > After this patch applied, the 2nd migration will not crash QEMU, instead > it'll be in CANCELLING until the socket connection times out (it will take > ~2min on my Fedora default kernel). During this process no 2nd migration > will be allowed, and after it timed out migration can be restarted. > > It's because so far we don't have control over socket_connect_outgoing(), > or anything yet managed by a task executed in qio_task_run_in_thread(). > Speeding up the cancellation to be left for future. > > I also tested cpr-transfer by only providing cpr channel not the main > channel (with -incoming defer), kickoff migration on source, then cancel it > on source directly without providing the main channel. It keeps working. > > I wanted to add an unit test for that but it'll need to refactor current > cpr-transfer tests first; let's leave it for later. > > Link: > https://lore.kernel.org/r/[email protected] > Reported-by: Marc-André Lureau <[email protected]> > Signed-off-by: Peter Xu <[email protected]> > --- > include/migration/cpr.h | 1 + > migration/migration.h | 5 +++++ > migration/cpr-transfer.c | 10 ++++++++++ > migration/migration.c | 31 +++++++++++++++++++++++-------- > 4 files changed, 39 insertions(+), 8 deletions(-) > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h > index 5850fd1788..ebf09a2f0a 100644 > --- a/include/migration/cpr.h > +++ b/include/migration/cpr.h > @@ -57,6 +57,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, > Error **errp); > void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func, > void *opaque); > void cpr_transfer_source_destroy(MigrationState *s); > +bool cpr_transfer_source_active(MigrationState *s); > > void cpr_exec_init(void); > QEMUFile *cpr_exec_output(Error **errp); > diff --git a/migration/migration.h b/migration/migration.h > index b6888daced..2bc2787480 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -514,6 +514,11 @@ struct MigrationState { > bool postcopy_package_loaded; > QemuEvent postcopy_package_loaded_event; > > + /* > + * When set, it means cpr-transfer is waiting for the HUP signal from > + * destination to continue the 2nd step of migration via the main > + * channel. > + */ > GSource *hup_source; > > /* > diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c > index 61d5c9dce2..9defe7bad7 100644 > --- a/migration/cpr-transfer.c > +++ b/migration/cpr-transfer.c > @@ -6,6 +6,7 @@ > */ > > #include "qemu/osdep.h" > +#include "qemu/main-loop.h" > #include "qapi/clone-visitor.h" > #include "qapi/error.h" > #include "qapi/qapi-visit-migration.h" > @@ -79,6 +80,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, > Error **errp) > void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func, > void *opaque) > { > + assert(bql_locked()); > s->hup_source = qio_channel_create_watch(cpr_state_ioc(), G_IO_HUP); > g_source_set_callback(s->hup_source, > (GSourceFunc)func, > @@ -89,9 +91,17 @@ void cpr_transfer_add_hup_watch(MigrationState *s, > QIOChannelFunc func, > > void cpr_transfer_source_destroy(MigrationState *s) > { > + assert(bql_locked()); > if (s->hup_source) { > g_source_destroy(s->hup_source); > g_source_unref(s->hup_source); > s->hup_source = NULL; > } > } > + > +bool cpr_transfer_source_active(MigrationState *s) > +{ > + /* Whenever the HUP gsource is available, it's active. */ > + assert(bql_locked()); > + return s->hup_source; > +} > diff --git a/migration/migration.c b/migration/migration.c > index 5c9aaa6e58..58c1e56766 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1469,14 +1469,19 @@ void migration_cancel(void) > } > > /* > - * If migration_connect_outgoing has not been called, then there > - * is no path that will complete the cancellation. Do it now. > - */ > - if (setup && !s->to_dst_file) { > - migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING, > - MIGRATION_STATUS_CANCELLED); > - cpr_state_close(); > - cpr_transfer_source_destroy(s); > + * This is cpr-transfer specific processing. > + * > + * If this is true, it means cpr-transfer migration is waiting for the > + * destination to send HUP event on CPR channel to continue the next > + * phase. If so, do the cleanup proactively to avoid get stuck in > + * CANCELLING state. > + */ > + if (cpr_transfer_source_active(s)) { > + assert(migrate_mode() == MIG_MODE_CPR_TRANSFER); > + assert(setup && !s->to_dst_file); > + migration_cleanup(s); > + /* Now all things should have been released */ > + assert(!cpr_transfer_source_active(s)); > } > } > > @@ -2009,12 +2014,22 @@ static gboolean > migration_connect_outgoing_cb(QIOChannel *channel, > MigrationState *s = migrate_get_current(); > Error *local_err = NULL; > > + /* > + * Detach and release the GSource right after use. We rely on this to > + * detect this small cpr-transfer window of "waiting for HUP event". > + */ > + cpr_transfer_source_destroy(s); > + > migration_connect_outgoing(s, opaque, &local_err); > > if (local_err) { > migration_connect_error_propagate(s, local_err); > } > > + /* > + * This is redundant as we do cpr_transfer_source_destroy() at the > + * entry, but it's benign; glib will just skip the detach. > + */ > return G_SOURCE_REMOVE; > }
Sorry for taking too long on this, it fell off my queue. Looks good to me. I reproduced the original issue and tested this patch under load with hacked tests: /mode/transfer/defer + early cancel /cancel/src/after/setup + 2nd migrate Tested-by: Fabiano Rosas <[email protected]> Reviewed-by: Fabiano Rosas <[email protected]> BTW, I found it weird that we cannot start a second migration on the incoming after the first one having never properly started. Trying to actually do it without shutting down both QEMUs was fun. A bunch of asserts on the incoming side due to yank. After hacking a little and using migrate_recover to start the second migration on the incoming, I managed to make it work, but it hanged mid-transfer for some reason. Probably some iochannel reset needed.
