On 23/10/2025 03:26, Peter Xu wrote:
> Migration module was there for 10+ years.  Initially, it was in most cases
> based on coroutines.  As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> 
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
> 
> After all these years, migration is now heavily based on a threaded model.
> 
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load.  We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
> 
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main).  The
> entrance to be migration_incoming_thread().
> 
> Quite a few things are needed to make it fly..  One note here is we need to
> change all these things in one patch to not break anything.  The other way
> to do this is add code to make all paths (that this patch touched) be ready
> for either coroutine or thread.  That may cause confusions in another way.
> So reviewers, please take my sincere apology on the hardness of reviewing
> this patch: it covers a few modules at the same time, and with some risky
> changes.
> 
> BQL Analysis
> ============
> 
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
> 
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
> 
> Most of the load process shouldn't need BQL, especially when it's about
> RAM.  After all, RAM is still the major chunk of data to move for a live
> migration process.  VFIO started to change that, though, but still, VFIO is
> per-device so that shouldn't need BQL either in most cases.
> 
> Generic device loads will need BQL, likely not when receiving VMSDs, but
> when applying them.  One example is any post_load() could potentially
> inject memory regions causing memory transactions to happen.  That'll need
> to update the global address spaces, hence requires BQL.  The other one is
> CPU sync operations, even if the sync alone may not need BQL (which is
> still to be further justified), run_on_cpu() will need it.
> 
> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> to now take a "bql_held" parameter saying whether bql is held.  We could
> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> lockings rather than relying on bql_locked TLS variable.  In case of
> migration, we always know whether BQL is held in different context as long
> as we can still pass that information downwards.
> 
> COLO
> ====
> 
> COLO assumed the dest VM load happens in a coroutine.  After this patch,
> it's not anymore.  Change that by invoking colo_incoming_co() directly from
> the migration_incoming_thread().
> 
> The name (colo_incoming_co()) isn't proper anymore.  Change it to
> colo_incoming_wait(), removing the coroutine annotation alongside.
> 
> Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
> used to release the lock for a short period while join().  Now it's not
> needed.  Instead, taking BQL but only when needed (colo_release_ram_cache).
> 
> At the meantime, there's colo_incoming_co variable that used to store the
> COLO incoming coroutine, only to be kicked off when a secondary failover
> happens.
> 
> To recap, what should happen for such failover should be (taking example of
> a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
> 
>    - The QMP command will kick off both the coroutine and the COLO
>      thread (colo_process_incoming_thread()), with something like:
> 
>      /* Notify COLO incoming thread that failover work is finished */
>      qemu_event_set(&mis->colo_incoming_event);
> 
>      qemu_coroutine_enter(mis->colo_incoming_co);
> 
>    - The coroutine, which yielded itself before, now resumes after enter(),
>      then it'll wait for the join():
> 
>      mis->colo_incoming_co = qemu_coroutine_self();
>      qemu_coroutine_yield();
>      mis->colo_incoming_co = NULL;
> 
>      /* Wait checkpoint incoming thread exit before free resource */
>      qemu_thread_join(&th);
> 
> Here, when switching to a thread model, it should be fine removing
> colo_incoming_co variable completely, because if so, the incoming thread
> will (instead of yielding the coroutine) wait at qemu_thread_join() until
> the colo thread completes execution (after receiving colo_incoming_event).
> 
> RDMA
> ====
> 
> With the prior patch making sure io_watch won't block for RDMA iochannels,
> RDMA threads should only block at its io_readv/io_writev functions.  When a
> disconnection is detected (as in rdma_cm_poll_handler()), the update to
> "errored" field will be immediately reflected in the migration incoming
> thread.  Hence the coroutine for RDMA is not needed anymore to kick the
> thread out.
> 
> When the thread is available, we also can't have rdma_cm_poll_handler()
> keep polling the fd and operate on it in the main thread.  Drop it
> completely, and it should be fine because qemu_rdma_wait_comp_channel()
> should also be monitoring it.
> 
> This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
> 
> We need to do this change in this same patch that we introduce the thread,
> unfortunately, otherwise we can have a risk of racing.
> 
> TODO
> ====
> 
> Currently the BQL is taken during loading of a START|FULL section.  When
> the IO hangs (e.g. network issue) during this process, it could potentially
> block others like the monitor servers.  One solution is breaking BQL to
> smaller granule and leave IOs to be always BQL-free.  That'll need more
> justifications.
> 
> For example, there are at least four things that need some closer
> attention:
> 
>    - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
>    to justify all of them (not to mention, some of them look like prone to
>    be rewritten as VMSDs..)
> 
>    - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
>    sometimes maybe it will!  Double checking on this will be needed.
> 
>    - VMSD's post_load(): in many cases, this DO need BQL, for example on
>    address space operations.  Likely we should just take it for any
>    post_load().
> 
>    - VMSD field's get(): this is tricky!  It could internally be anything
>    even if it was only a field.  E.g. there can be users to use a SINGLE
>    field to load a whole VMSD, which can further introduce more
>    possibilities.
> 
> In general, QEMUFile IOs should not need BQL, that is when receiving the
> VMSD data and waiting for e.g. the socket buffer to get refilled.  But
> that's the easy part.
> 
> Signed-off-by: Peter Xu <[email protected]>


Both COLO and RDMA changes look good to me

Reviewed-by: Li Zhijian <[email protected]> # COLO and RDMA

And with an addition fixes[1] for COLO, the whole patchset:

Tested-by: Li Zhijian <[email protected]> # COLO and RDMA

[1] 
https://lore.kernel.org/qemu-devel/[email protected]/


> ---
>   include/migration/colo.h |  6 ++--
>   migration/migration.h    | 14 +++-----
>   migration/colo-stubs.c   |  2 +-
>   migration/colo.c         | 24 ++++---------
>   migration/migration.c    | 77 +++++++++++++++++++++++++---------------
>   migration/rdma.c         | 34 +-----------------
>   migration/savevm.c       |  8 ++---
>   migration/trace-events   |  4 +--
>   8 files changed, 69 insertions(+), 100 deletions(-)
> 
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index d4fe422e4d..5de7d715a7 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -44,12 +44,10 @@ void colo_do_failover(void);
>   void colo_checkpoint_delay_set(void);
>   
>   /*
> - * Starts COLO incoming process. Called from process_incoming_migration_co()
> + * Starts COLO incoming process. Called from migration_incoming_thread()
>    * after loading the state.
> - *
> - * Called with BQL locked, may temporary release BQL.
>    */
> -void coroutine_fn colo_incoming_co(void);
> +void colo_incoming_wait(void);
>   
>   void colo_shutdown(void);
>   #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index e1c5029110..0d22dc8cc2 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -214,6 +214,10 @@ struct MigrationIncomingState {
>       bool           have_listen_thread;
>       QemuThread     listen_thread;
>   
> +    /* Migration main recv thread */
> +    bool           have_recv_thread;
> +    QemuThread     recv_thread;
> +
>       /* For the kernel to send us notifications */
>       int       userfault_fd;
>       /* To notify the fault_thread to wake, e.g., when need to quit */
> @@ -272,15 +276,7 @@ struct MigrationIncomingState {
>   
>       MigrationStatus state;
>   
> -    /*
> -     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> -     * Used to wake the migration incoming coroutine from rdma code. How 
> much is
> -     * it safe - it's a question.
> -     */
> -    Coroutine *loadvm_co;
> -
> -    /* The coroutine we should enter (back) after failover */
> -    Coroutine *colo_incoming_co;
> +    /* Notify secondary VM to move on */
>       QemuEvent colo_incoming_event;
>   
>       /* Optional load threads pool and its thread exit request flag */
> diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> index e22ce65234..ef77d1ab4b 100644
> --- a/migration/colo-stubs.c
> +++ b/migration/colo-stubs.c
> @@ -9,7 +9,7 @@ void colo_shutdown(void)
>   {
>   }
>   
> -void coroutine_fn colo_incoming_co(void)
> +void colo_incoming_wait(void)
>   {
>   }
>   
> diff --git a/migration/colo.c b/migration/colo.c
> index 4fd586951a..81276a3e65 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
>       }
>       /* Notify COLO incoming thread that failover work is finished */
>       qemu_event_set(&mis->colo_incoming_event);
> -
> -    /* For Secondary VM, jump to incoming co */
> -    if (mis->colo_incoming_co) {
> -        qemu_coroutine_enter(mis->colo_incoming_co);
> -    }
>   }
>   
>   static void primary_vm_do_failover(void)
> @@ -848,10 +843,8 @@ static void *colo_process_incoming_thread(void *opaque)
>   
>       mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
>       /*
> -     * Note: the communication between Primary side and Secondary side
> -     * should be sequential, we set the fd to unblocked in migration incoming
> -     * coroutine, and here we are in the COLO incoming thread, so it is ok to
> -     * set the fd back to blocked.
> +     * Here we are in the COLO incoming thread, so it is ok to set the fd
> +     * to blocking.
>        */
>       if (!qemu_file_set_blocking(mis->from_src_file, true, &local_err)) {
>           error_report_err(local_err);
> @@ -927,27 +920,22 @@ out:
>       return NULL;
>   }
>   
> -void coroutine_fn colo_incoming_co(void)
> +/* Wait for failover */
> +void colo_incoming_wait(void)
>   {
>       MigrationIncomingState *mis = migration_incoming_get_current();
>       QemuThread th;
>   
> -    assert(bql_locked());
>       assert(migration_incoming_colo_enabled());
>   
>       qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
>                          colo_process_incoming_thread,
>                          mis, QEMU_THREAD_JOINABLE);
>   
> -    mis->colo_incoming_co = qemu_coroutine_self();
> -    qemu_coroutine_yield();
> -    mis->colo_incoming_co = NULL;
> -
> -    bql_unlock();
>       /* Wait checkpoint incoming thread exit before free resource */
>       qemu_thread_join(&th);
> -    bql_lock();
>   
> -    /* We hold the global BQL, so it is safe here */
> +    bql_lock();
>       colo_release_ram_cache();
> +    bql_unlock();
>   }
> diff --git a/migration/migration.c b/migration/migration.c
> index 38a584afae..728d02dbee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -491,6 +491,11 @@ void migration_incoming_state_destroy(void)
>           mis->postcopy_qemufile_dst = NULL;
>       }
>   
> +    if (mis->have_recv_thread) {
> +        qemu_thread_join(&mis->recv_thread);
> +        mis->have_recv_thread = false;
> +    }
> +
>       cpr_set_incoming_mode(MIG_MODE_NONE);
>       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>   }
> @@ -861,30 +866,46 @@ static void process_incoming_migration_bh(void *opaque)
>       migration_incoming_state_destroy();
>   }
>   
> -static void coroutine_fn
> -process_incoming_migration_co(void *opaque)
> +static void migration_incoming_state_destroy_bh(void *opaque)
> +{
> +    struct MigrationIncomingState *mis = opaque;
> +
> +    migration_incoming_state_destroy();
> +
> +    if (mis->exit_on_error) {
> +        /*
> +         * NOTE: this exit() should better happen in the main thread, as
> +         * the exit notifier may require BQL which can deadlock.  See
> +         * commit e7bc0204e57836 for example.
> +         */
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
> +static void *migration_incoming_thread(void *opaque)
>   {
>       MigrationState *s = migrate_get_current();
> -    MigrationIncomingState *mis = migration_incoming_get_current();
> +    MigrationIncomingState *mis = opaque;
>       PostcopyState ps;
>       int ret;
>       Error *local_err = NULL;
>   
> +    rcu_register_thread();
> +
>       assert(mis->from_src_file);
> +    assert(!bql_locked());
>   
>       mis->largest_page_size = qemu_ram_pagesize_largest();
>       postcopy_state_set(POSTCOPY_INCOMING_NONE);
>       migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
>                         MIGRATION_STATUS_ACTIVE);
>   
> -    mis->loadvm_co = qemu_coroutine_self();
> -    ret = qemu_loadvm_state(mis->from_src_file, true, &local_err);
> -    mis->loadvm_co = NULL;
> +    ret = qemu_loadvm_state(mis->from_src_file, false, &local_err);
>   
>       trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>   
>       ps = postcopy_state_get();
> -    trace_process_incoming_migration_co_end(ret, ps);
> +    trace_process_incoming_migration_end(ret, ps);
>       if (ps != POSTCOPY_INCOMING_NONE) {
>           if (ps == POSTCOPY_INCOMING_ADVISE) {
>               /*
> @@ -898,7 +919,7 @@ process_incoming_migration_co(void *opaque)
>                * Postcopy was started, cleanup should happen at the end of the
>                * postcopy thread.
>                */
> -            trace_process_incoming_migration_co_postcopy_end_main();
> +            trace_process_incoming_migration_postcopy_end_main();
>               goto out;
>           }
>           /* Else if something went wrong then just fall out of the normal 
> exit */
> @@ -911,8 +932,8 @@ process_incoming_migration_co(void *opaque)
>       }
>   
>       if (migration_incoming_colo_enabled()) {
> -        /* yield until COLO exit */
> -        colo_incoming_co();
> +        /* wait until COLO exits */
> +        colo_incoming_wait();
>       }
>   
>       migration_bh_schedule(process_incoming_migration_bh, mis);
> @@ -924,28 +945,22 @@ fail:
>       migrate_set_error(s, local_err);
>       error_free(local_err);
>   
> -    migration_incoming_state_destroy();
> +    WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> +        error_report_err(s->error);
> +        s->error = NULL;
> +    }
>   
> -    if (mis->exit_on_error) {
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> +    /*
> +     * There's some step of the destroy process that will need to happen in
> +     * the main thread (e.g. joining this thread itself).  Leave to a BH.
> +     */
> +    migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
>   
> -        exit(EXIT_FAILURE);
> -    } else {
> -        /*
> -         * Report the error here in case that QEMU abruptly exits
> -         * when postcopy is enabled.
> -         */
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> -    }
>   out:
>       /* Pairs with the refcount taken in qmp_migrate_incoming() */
>       migrate_incoming_unref_outgoing_state();
> +    rcu_unregister_thread();
> +    return NULL;
>   }
>   
>   /**
> @@ -963,8 +978,12 @@ static void migration_incoming_setup(QEMUFile *f)
>   
>   void migration_incoming_process(void)
>   {
> -    Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, 
> NULL);
> -    qemu_coroutine_enter(co);
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    mis->have_recv_thread = true;
> +    qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> +                       migration_incoming_thread, mis,
> +                       QEMU_THREAD_JOINABLE);
>   }
>   
>   /* Returns true if recovered from a paused migration, otherwise false */
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 0e5e02cdca..3389f6448b 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3051,37 +3051,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t 
> block_offset,
>   
>   static void rdma_accept_incoming_migration(void *opaque);
>   
> -static void rdma_cm_poll_handler(void *opaque)
> -{
> -    RDMAContext *rdma = opaque;
> -    struct rdma_cm_event *cm_event;
> -    MigrationIncomingState *mis = migration_incoming_get_current();
> -
> -    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> -        error_report("get_cm_event failed %d", errno);
> -        return;
> -    }
> -
> -    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> -        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> -        if (!rdma->errored &&
> -            migration_incoming_get_current()->state !=
> -              MIGRATION_STATUS_COMPLETED) {
> -            error_report("receive cm event, cm event is %d", 
> cm_event->event);
> -            rdma->errored = true;
> -            if (rdma->return_path) {
> -                rdma->return_path->errored = true;
> -            }
> -        }
> -        rdma_ack_cm_event(cm_event);
> -        if (mis->loadvm_co) {
> -            qemu_coroutine_enter(mis->loadvm_co);
> -        }
> -        return;
> -    }
> -    rdma_ack_cm_event(cm_event);
> -}
> -
>   static int qemu_rdma_accept(RDMAContext *rdma)
>   {
>       Error *err = NULL;
> @@ -3199,8 +3168,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>                               NULL,
>                               (void *)(intptr_t)rdma->return_path);
>       } else {
> -        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
> -                            NULL, rdma);
> +        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
>       }
>   
>       ret = rdma_accept(rdma->cm_id, &conn_param);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 44aadc2f51..991f46593c 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2118,7 +2118,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       qemu_file_set_blocking(f, true, &error_fatal);
>   
>       /* TODO: sanity check that only postcopiable data will be loaded here */
> -    load_res = qemu_loadvm_state_main(f, mis, true, &local_err);
> +    load_res = qemu_loadvm_state_main(f, mis, false, &local_err);
>   
>       /*
>        * This is tricky, but, mis->from_src_file can change after it
> @@ -2415,11 +2415,11 @@ static void 
> loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>    * Immediately following this command is a blob of data containing an 
> embedded
>    * chunk of migration stream; read it and load it.
>    *
> - * @mis: Incoming state
> - * @length: Length of packaged data to read
> + * @mis:      Incoming state
> + * @bql_held: Whether BQL is held already
> + * @errp:     The Error** to set when returning failures.
>    *
>    * Returns: Negative values on error
> - *
>    */
>   static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
>                                         bool bql_held, Error **errp)
> diff --git a/migration/trace-events b/migration/trace-events
> index e8edd1fbba..2b7b522e73 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>   source_return_path_thread_switchover_acked(void) ""
>   migration_thread_low_pending(uint64_t pending) "%" PRIu64
>   migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t 
> bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " 
> time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " 
> max_size %" PRId64
> -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -process_incoming_migration_co_postcopy_end_main(void) ""
> +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> +process_incoming_migration_postcopy_end_main(void) ""
>   postcopy_preempt_enabled(bool value) "%d"
>   migration_precopy_complete(void) ""
>   

Reply via email to