During a CPR-style migration, we pass additional CPR state via an aux
migration channel (the cpr-transfer case). That's done in cpr_state_save().
The main bulk of the migration is only sent afterwards - that's why we
call cpr_state_save() early on in qmp_migrate().
Previous patch placed emitting the SETUP event before cpr_state_save(),
since devices must be properly shut down before we send their FDs to CPR
target. However, for the proper shut down to take place, we should also
stop operating them beforehand, i.e. stop the VM.
Thus the desirable order for cpr-transfer case looks as follows:
SOURCE TARGET
------ ------
cpr_state_load() blocks
| |
| 1. migration_stop_vm() |
| VM stopped, devices quiesced |
| | Waiting for
| 2. notifiers (SETUP) | FDs from source
| vhost_reset_owner() releases |
| device ownership |
| |
| 3. cpr_state_save() ---- FDs -------> |
| |
v v
postmigrate Device init begins
- cpr_find_fd()
- vhost_dev_init()
- VHOST_SET_OWNER
So step 3 is the synchronization/cut-over point. Target proceeds immediately
upon receiving FDs, so steps 1-2 must complete successfully. Otherwise:
* Target's VHOST_SET_OWNER fails with -EBUSY (source still owns)
* Race between source I/O and target device init
Let's stop the VM early (before FD transfer) to prevent this race.
Unlike regular migration, CPR-transfer passes memory via FD (memfd)
rather than copying RAM, so early VM stop should have minimal downtime.
Since we call migration_stop_vm() from qmp_migrate() (i.e. from the main
thread), we should also balance it out by fallback resume outside of the
migration thread - i.e. resume in migration_iteration_finish() is not
enough. One good place for this resume op is migration_cleanup().
However, in migration_iteration_finish() resume is gated by successful
block activation, so we additionally traverse the block graph nodes to
make sure activation did take place before doing vm_resume().
This patch is a rework of the change originally proposed by Steve and
Ben at [0].
[0]
https://lore.kernel.org/qemu-devel/[email protected]
Originally-by: Steve Sistare <[email protected]>
Originally-by: Ben Chaney <[email protected]>
Signed-off-by: Andrey Drobyshev <[email protected]>
---
migration/migration.c | 74 +++++++++++++++++++++++++++++++++----------
1 file changed, 58 insertions(+), 16 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 20eff9dbdcb..4a854ade503 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1323,6 +1323,24 @@ static void migration_cleanup_json_writer(MigrationState
*s)
g_clear_pointer(&s->vmdesc, json_writer_free);
}
+static bool migration_check_block_active(void)
+{
+ BlockDriverState *bs;
+ BdrvNextIterator it;
+
+ assert(bql_locked());
+ GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+ for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
+ if (bdrv_is_inactive(bs)) {
+ bdrv_next_cleanup(&it);
+ return false;
+ }
+ }
+
+ return true;
+}
+
static void migration_cleanup(MigrationState *s)
{
QEMUFile *tmp = NULL;
@@ -1381,9 +1399,16 @@ static void migration_cleanup(MigrationState *s)
/*
* FAILED notification should have already happened. Notify DONE if
* migration completed successfully.
+ *
+ * In case of a failed CPR migration, we want to resume VM. However,
+ * for a non-CPR migration resume is done in migration_iteration_finish()
+ * and is gated by a successful migration_block_activate(). So in the
+ * failed CPR case we only resume if that prior activation was successful.
*/
if (!migration_has_failed(s)) {
migration_call_notifiers(MIG_EVENT_DONE, NULL);
+ } else if (migrate_mode_is_cpr() && migration_check_block_active()) {
+ vm_resume(s->vm_old_state);
}
yank_unregister_instance(MIGRATION_YANK_INSTANCE);
@@ -2130,6 +2155,26 @@ void qmp_migrate(const char *uri, bool has_channels,
*/
Error *local_err = NULL;
+ /*
+ * For CPR migration the hand-off happens at cpr_state_save() below, which
+ * (for cpr-transfer case) transfers the device FDs to the target. The
+ * target starts claiming them as soon as they arrive. Everything the
+ * source must do to release the devices therefore has to happen before
+ * that cut-over - stop the vCPUs and run the SETUP notifiers (which
+ * release device ownership).
+ *
+ * So for CPR case stop the VM here, ahead of the SETUP notifiers and
+ * cpr_state_save(). Since we aren't copying any RAM (it stays in place)
+ * stopping early is cheap.
+ */
+ if (migrate_mode_is_cpr()) {
+ int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
+ if (ret < 0) {
+ error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
+ goto out;
+ }
+ }
+
/* Notify before starting migration thread and before starting CPR */
if (!(has_resume && resume) &&
migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
@@ -3483,13 +3528,19 @@ static void migration_iteration_finish(MigrationState
*s)
*/
migration_call_notifiers(MIG_EVENT_FAILED, NULL);
- if (runstate_is_live(s->vm_old_state)) {
- if (!runstate_check(RUN_STATE_SHUTDOWN)) {
- vm_start();
- }
- } else {
- if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
- runstate_set(s->vm_old_state);
+ /*
+ * For cpr the VM resume on failure is centralized in
+ * migration_cleanup(), so don't resume here as well.
+ */
+ if (!migrate_mode_is_cpr()) {
+ if (runstate_is_live(s->vm_old_state)) {
+ if (!runstate_check(RUN_STATE_SHUTDOWN)) {
+ vm_start();
+ }
+ } else {
+ if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
+ runstate_set(s->vm_old_state);
+ }
}
}
break;
@@ -3904,7 +3955,6 @@ void migration_start_outgoing(MigrationState *s)
Error *local_err = NULL;
uint64_t rate_limit;
bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
- int ret;
if (resume) {
/* This is a resumed migration */
@@ -3945,14 +3995,6 @@ void migration_start_outgoing(MigrationState *s)
return;
}
- if (migrate_mode_is_cpr()) {
- ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
- if (ret < 0) {
- error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
- goto fail;
- }
- }
-
/*
* Take a refcount to make sure the migration object won't get freed by
* the main thread already in migration_shutdown().
--
2.47.1