Re: [PATCH] migration: Fix crash on second migration when cancel early

Peter Xu Thu, 07 May 2026 10:24:03 -0700

On Wed, May 06, 2026 at 05:51:20PM -0300, Fabiano Rosas wrote:
> Peter Xu <[email protected]> writes:
> 
> > Marc-André reported an issue on QEMU crash when retrying a cancelled
> > migration during early setup phase, see "Link:" for more information, and
> > also easy way to reproduce.
> >
> > This patch is a replacement of the prior fix proposed by not only switching
> > to migration_cleanup(), but also fixing it from CPR side, so that we track
> > hup_source properly to know if src QEMU is waiting or the HUP signal.
> >
> > To put it simple: this chunk of special casing in migration_cancel() should
> > not affect normal migration, but only cpr-transfer migration to cover the
> > small window when the src QEMU is waiting for a HUP signal on cpr
> > channel (so that src QEMU can continue the migration on the main channel).
> >
> > To achieve that, we'll also need to remember to detach the hup_source
> > whenenver invoked: after that point, we should always be able to cleanup
> > the migration.
> >
> > It's not a generic operation to explicitly detach a gsource from its
> > context while in its dispatch() function.  But it should be safe, because
> > gsource disptch() will only happen with a boosted refcount for the
> > dispatcher so that the gsource will not be freed until the callback
> > completes. It's also safe to return G_SOURCE_REMOVE after the gsource is
> > detached, as glib will simply ignore the G_SOURCE_REMOVE.
> >
> > One can refer to latest 2.86.5 glib code in g_main_dispatch() for that:
> >
> > https://github.com/GNOME/glib/blob/2.86.5/glib/gmain.c#L3592
> >
> > When at this, add a bunch of assertions to make sure nothing surprises us.
> >
> > After this patch applied, the 2nd migration will not crash QEMU, instead
> > it'll be in CANCELLING until the socket connection times out (it will take
> > ~2min on my Fedora default kernel).  During this process no 2nd migration
> > will be allowed, and after it timed out migration can be restarted.
> >
> > It's because so far we don't have control over socket_connect_outgoing(),
> > or anything yet managed by a task executed in qio_task_run_in_thread().
> > Speeding up the cancellation to be left for future.
> >
> > I also tested cpr-transfer by only providing cpr channel not the main
> > channel (with -incoming defer), kickoff migration on source, then cancel it
> > on source directly without providing the main channel.  It keeps working.
> >
> > I wanted to add an unit test for that but it'll need to refactor current
> > cpr-transfer tests first; let's leave it for later.
> >
> > Link: 
> > https://lore.kernel.org/r/[email protected]
> > Reported-by: Marc-André Lureau <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> >  include/migration/cpr.h  |  1 +
> >  migration/migration.h    |  5 +++++
> >  migration/cpr-transfer.c | 10 ++++++++++
> >  migration/migration.c    | 31 +++++++++++++++++++++++--------
> >  4 files changed, 39 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > index 5850fd1788..ebf09a2f0a 100644
> > --- a/include/migration/cpr.h
> > +++ b/include/migration/cpr.h
> > @@ -57,6 +57,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, 
> > Error **errp);
> >  void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func,
> >                                  void *opaque);
> >  void cpr_transfer_source_destroy(MigrationState *s);
> > +bool cpr_transfer_source_active(MigrationState *s);
> >  
> >  void cpr_exec_init(void);
> >  QEMUFile *cpr_exec_output(Error **errp);
> > diff --git a/migration/migration.h b/migration/migration.h
> > index b6888daced..2bc2787480 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -514,6 +514,11 @@ struct MigrationState {
> >      bool postcopy_package_loaded;
> >      QemuEvent postcopy_package_loaded_event;
> >  
> > +    /*
> > +     * When set, it means cpr-transfer is waiting for the HUP signal from
> > +     * destination to continue the 2nd step of migration via the main
> > +     * channel.
> > +     */
> >      GSource *hup_source;
> >  
> >      /*
> > diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> > index 61d5c9dce2..9defe7bad7 100644
> > --- a/migration/cpr-transfer.c
> > +++ b/migration/cpr-transfer.c
> > @@ -6,6 +6,7 @@
> >   */
> >  
> >  #include "qemu/osdep.h"
> > +#include "qemu/main-loop.h"
> >  #include "qapi/clone-visitor.h"
> >  #include "qapi/error.h"
> >  #include "qapi/qapi-visit-migration.h"
> > @@ -79,6 +80,7 @@ QEMUFile *cpr_transfer_input(MigrationChannel *channel, 
> > Error **errp)
> >  void cpr_transfer_add_hup_watch(MigrationState *s, QIOChannelFunc func,
> >                                  void *opaque)
> >  {
> > +    assert(bql_locked());
> >      s->hup_source = qio_channel_create_watch(cpr_state_ioc(), G_IO_HUP);
> >      g_source_set_callback(s->hup_source,
> >                            (GSourceFunc)func,
> > @@ -89,9 +91,17 @@ void cpr_transfer_add_hup_watch(MigrationState *s, 
> > QIOChannelFunc func,
> >  
> >  void cpr_transfer_source_destroy(MigrationState *s)
> >  {
> > +    assert(bql_locked());
> >      if (s->hup_source) {
> >          g_source_destroy(s->hup_source);
> >          g_source_unref(s->hup_source);
> >          s->hup_source = NULL;
> >      }
> >  }
> > +
> > +bool cpr_transfer_source_active(MigrationState *s)
> > +{
> > +    /* Whenever the HUP gsource is available, it's active. */
> > +    assert(bql_locked());
> > +    return s->hup_source;
> > +}
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58..58c1e56766 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1469,14 +1469,19 @@ void migration_cancel(void)
> >      }
> >  
> >      /*
> > -     * If migration_connect_outgoing has not been called, then there
> > -     * is no path that will complete the cancellation. Do it now.
> > -     */
> > -    if (setup && !s->to_dst_file) {
> > -        migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
> > -                          MIGRATION_STATUS_CANCELLED);
> > -        cpr_state_close();
> > -        cpr_transfer_source_destroy(s);
> > +     * This is cpr-transfer specific processing.
> > +     *
> > +     * If this is true, it means cpr-transfer migration is waiting for the
> > +     * destination to send HUP event on CPR channel to continue the next
> > +     * phase.  If so, do the cleanup proactively to avoid get stuck in
> > +     * CANCELLING state.
> > +     */
> > +    if (cpr_transfer_source_active(s)) {
> > +        assert(migrate_mode() == MIG_MODE_CPR_TRANSFER);
> > +        assert(setup && !s->to_dst_file);
> > +        migration_cleanup(s);
> > +        /* Now all things should have been released */
> > +        assert(!cpr_transfer_source_active(s));
> >      }
> >  }
> >  
> > @@ -2009,12 +2014,22 @@ static gboolean 
> > migration_connect_outgoing_cb(QIOChannel *channel,
> >      MigrationState *s = migrate_get_current();
> >      Error *local_err = NULL;
> >  
> > +    /*
> > +     * Detach and release the GSource right after use.  We rely on this to
> > +     * detect this small cpr-transfer window of "waiting for HUP event".
> > +     */
> > +    cpr_transfer_source_destroy(s);
> > +
> >      migration_connect_outgoing(s, opaque, &local_err);
> >  
> >      if (local_err) {
> >          migration_connect_error_propagate(s, local_err);
> >      }
> >  
> > +    /*
> > +     * This is redundant as we do cpr_transfer_source_destroy() at the
> > +     * entry, but it's benign; glib will just skip the detach.
> > +     */
> >      return G_SOURCE_REMOVE;
> >  }
> 
> Sorry for taking too long on this, it fell off my queue. Looks good to
> me. I reproduced the original issue and tested this patch under load
> with hacked tests:
> 
> /mode/transfer/defer + early cancel
> /cancel/src/after/setup + 2nd migrate
> 
> Tested-by: Fabiano Rosas <[email protected]>
> Reviewed-by: Fabiano Rosas <[email protected]>


Thanks for double checking those!  Then let me pick this patch up.

> 
> BTW, I found it weird that we cannot start a second migration on the
> incoming after the first one having never properly started. Trying to
> actually do it without shutting down both QEMUs was fun. A bunch of
> asserts on the incoming side due to yank. After hacking a little and
> using migrate_recover to start the second migration on the incoming, I
> managed to make it work, but it hanged mid-transfer for some
> reason. Probably some iochannel reset needed.

Yeah, I am guessing we need to first fill the gaps on destination side to
properly manage those async tasks on port listens, etc.. and those can be
the owners of dangling iochannels.

-- 
Peter Xu

Re: [PATCH] migration: Fix crash on second migration when cancel early

Reply via email to