On Wed, May 06, 2026 at 05:51:20PM -0300, Fabiano Rosas wrote: > Peter Xu <[email protected]> writes: > > > Marc-André reported an issue on QEMU crash when retrying a cancelled > > migration during early setup phase, see "Link:" for more information, and > > also easy way to reproduce. > > > > This patch is a replacement of the prior fix proposed by not only switching > > to migration_cleanup(), but also fixing it from CPR side, so that we track > > hup_source properly to know if src QEMU is waiting or the HUP signal. > > > > To put it simple: this chunk of special casing in migration_cancel() should > > not affect normal migration, but only cpr-transfer migration to cover the > > small window when the src QEMU is waiting for a HUP signal on cpr > > channel (so that src QEMU can continue the migration on the main channel). > > > > To achieve that, we'll also need to remember to detach the hup_source > > whenenver invoked: after that point, we should always be able to cleanup > > the migration. > > > > It's not a generic operation to explicitly detach a gsource from its > > context while in its dispatch() function. But it should be safe, because > > gsource disptch() will only happen with a boosted refcount for the > > dispatcher so that the gsource will not be freed until the callback > > completes. It's also safe to return G_SOURCE_REMOVE after the gsource is > > detached, as glib will simply ignore the G_SOURCE_REMOVE. > > > > One can refer to latest 2.86.5 glib code in g_main_dispatch() for that: > > > > https://github.com/GNOME/glib/blob/2.86.5/glib/gmain.c#L3592 > > > > When at this, add a bunch of assertions to make sure nothing surprises us. > > > > After this patch applied, the 2nd migration will not crash QEMU, instead > > it'll be in CANCELLING until the socket connection times out (it will take > > ~2min on my Fedora default kernel). During this process no 2nd migration > > will be allowed, and after it timed out migration can be restarted. > > > > It's because so far we don't have control over socket_connect_outgoing(), > > or anything yet managed by a task executed in qio_task_run_in_thread(). > > Speeding up the cancellation to be left for future. > > > > I also tested cpr-transfer by only providing cpr channel not the main > > channel (with -incoming defer), kickoff migration on source, then cancel it > > on source directly without providing the main channel. It keeps working. > > > > I wanted to add an unit test for that but it'll need to refactor current > > cpr-transfer tests first; let's leave it for later. > > > > Link: > > https://lore.kernel.org/r/[email protected] > > Reported-by: Marc-André Lureau <[email protected]> > > Signed-off-by: Peter Xu <[email protected]> > > --- > > include/migration/cpr.h | 1 + > > migration/migration.h | 5 +++++ > > migration/cpr-transfer.c | 10 ++++++++++ > > migration/migration.c | 31 +++++++++++++++++++++++-------- > > 4 files changed, 39 insertions(+), 8 deletions(-) > > > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h > > index 5850fd1788..ebf09a2f0a 100644 > > --- a/include/migration/cpr.h > > +++ b/include/migration/cpr.h > > @@ -57,6 +57,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, > > Error **errp); > > void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func, > > void *opaque); > > void cpr_transfer_source_destroy(MigrationState *s); > > +bool cpr_transfer_source_active(MigrationState *s); > > > > void cpr_exec_init(void); > > QEMUFile *cpr_exec_output(Error **errp); > > diff --git a/migration/migration.h b/migration/migration.h > > index b6888daced..2bc2787480 100644 > > --- a/migration/migration.h > > +++ b/migration/migration.h > > @@ -514,6 +514,11 @@ struct MigrationState { > > bool postcopy_package_loaded; > > QemuEvent postcopy_package_loaded_event; > > > > + /* > > + * When set, it means cpr-transfer is waiting for the HUP signal from > > + * destination to continue the 2nd step of migration via the main > > + * channel. > > + */ > > GSource *hup_source; > > > > /* > > diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c > > index 61d5c9dce2..9defe7bad7 100644 > > --- a/migration/cpr-transfer.c > > +++ b/migration/cpr-transfer.c > > @@ -6,6 +6,7 @@ > > */ > > > > #include "qemu/osdep.h" > > +#include "qemu/main-loop.h" > > #include "qapi/clone-visitor.h" > > #include "qapi/error.h" > > #include "qapi/qapi-visit-migration.h" > > @@ -79,6 +80,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, > > Error **errp) > > void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func, > > void *opaque) > > { > > + assert(bql_locked()); > > s->hup_source = qio_channel_create_watch(cpr_state_ioc(), G_IO_HUP); > > g_source_set_callback(s->hup_source, > > (GSourceFunc)func, > > @@ -89,9 +91,17 @@ void cpr_transfer_add_hup_watch(MigrationState *s, > > QIOChannelFunc func, > > > > void cpr_transfer_source_destroy(MigrationState *s) > > { > > + assert(bql_locked()); > > if (s->hup_source) { > > g_source_destroy(s->hup_source); > > g_source_unref(s->hup_source); > > s->hup_source = NULL; > > } > > } > > + > > +bool cpr_transfer_source_active(MigrationState *s) > > +{ > > + /* Whenever the HUP gsource is available, it's active. */ > > + assert(bql_locked()); > > + return s->hup_source; > > +} > > diff --git a/migration/migration.c b/migration/migration.c > > index 5c9aaa6e58..58c1e56766 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -1469,14 +1469,19 @@ void migration_cancel(void) > > } > > > > /* > > - * If migration_connect_outgoing has not been called, then there > > - * is no path that will complete the cancellation. Do it now. > > - */ > > - if (setup && !s->to_dst_file) { > > - migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING, > > - MIGRATION_STATUS_CANCELLED); > > - cpr_state_close(); > > - cpr_transfer_source_destroy(s); > > + * This is cpr-transfer specific processing. > > + * > > + * If this is true, it means cpr-transfer migration is waiting for the > > + * destination to send HUP event on CPR channel to continue the next > > + * phase. If so, do the cleanup proactively to avoid get stuck in > > + * CANCELLING state. > > + */ > > + if (cpr_transfer_source_active(s)) { > > + assert(migrate_mode() == MIG_MODE_CPR_TRANSFER); > > + assert(setup && !s->to_dst_file); > > + migration_cleanup(s); > > + /* Now all things should have been released */ > > + assert(!cpr_transfer_source_active(s)); > > } > > } > > > > @@ -2009,12 +2014,22 @@ static gboolean > > migration_connect_outgoing_cb(QIOChannel *channel, > > MigrationState *s = migrate_get_current(); > > Error *local_err = NULL; > > > > + /* > > + * Detach and release the GSource right after use. We rely on this to > > + * detect this small cpr-transfer window of "waiting for HUP event". > > + */ > > + cpr_transfer_source_destroy(s); > > + > > migration_connect_outgoing(s, opaque, &local_err); > > > > if (local_err) { > > migration_connect_error_propagate(s, local_err); > > } > > > > + /* > > + * This is redundant as we do cpr_transfer_source_destroy() at the > > + * entry, but it's benign; glib will just skip the detach. > > + */ > > return G_SOURCE_REMOVE; > > } > > Sorry for taking too long on this, it fell off my queue. Looks good to > me. I reproduced the original issue and tested this patch under load > with hacked tests: > > /mode/transfer/defer + early cancel > /cancel/src/after/setup + 2nd migrate > > Tested-by: Fabiano Rosas <[email protected]> > Reviewed-by: Fabiano Rosas <[email protected]>
Thanks for double checking those! Then let me pick this patch up. > > BTW, I found it weird that we cannot start a second migration on the > incoming after the first one having never properly started. Trying to > actually do it without shutting down both QEMUs was fun. A bunch of > asserts on the incoming side due to yank. After hacking a little and > using migrate_recover to start the second migration on the incoming, I > managed to make it work, but it hanged mid-transfer for some > reason. Probably some iochannel reset needed. Yeah, I am guessing we need to first fill the gaps on destination side to properly manage those async tasks on port listens, etc.. and those can be the owners of dangling iochannels. -- Peter Xu
