Switchover-ack is a mechanism to synchronize between source and destination QEMU during migration to prevent the source from switching over prematurely.
VFIO uses switchover-ack to ensure switchover happens only after destination side has loaded the precopy initial bytes. This is important for VFIO, as otherwise downtime could be impacted and be higher. In its current state, switchover-ack is a one-time mechanism, meaning that switchover is acked only once and past that another ACK cannot be requested again. This was sufficient until now, as VFIO precopy initial bytes was defined to be monotonically decreasing. Thus, when precopy initial bytes reached zero for all VFIO devices, a single ACK would be sent and its validity would hold. However, now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy initial bytes to be re-initialized during precopy. Specifically, it means that initial bytes can grow after reaching zero, which would invalidate a previously sent switchover ACK. To solve this, make switchover-ack reusable and allow devices to request switchover ACKs when needed via the save_query_pending SaveVMHandler. Since now switchover ACK can be requested for a specific device and in different times, make switchover ACK per-device (instead of a single ACK for all devices) and let source side do the pending ACKs accounting. Keep the legacy switchover-ack mechanism for backward compatibility and turn it on by a compatibility property for older machines. Enable the property until VFIO implements the new switchover-ack. Signed-off-by: Avihai Horon <[email protected]> --- qapi/migration.json | 16 ++++----- include/migration/client-options.h | 1 + include/migration/register.h | 2 ++ migration/migration.h | 32 ++++++++++++++++-- migration/savevm.h | 6 ++-- hw/core/machine.c | 1 + migration/migration.c | 37 ++++++++++++++------- migration/options.c | 10 ++++++ migration/savevm.c | 53 +++++++++++++++++++++++------- migration/trace-events | 5 +-- 10 files changed, 125 insertions(+), 38 deletions(-) diff --git a/qapi/migration.json b/qapi/migration.json index 27a7970556..550cb77762 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -507,15 +507,13 @@ # and should not affect the correctness of postcopy migration. # (since 7.1) # -# @switchover-ack: If enabled, migration will not stop the source VM -# and complete the migration until an ACK is received from the -# destination that it's OK to do so. Exactly when this ACK is -# sent depends on the migrated devices that use this feature. For -# example, a device can use it to make sure some of its data is -# sent and loaded in the destination before doing switchover. -# This can reduce downtime if devices that support this capability -# are present. 'return-path' capability must be enabled to use -# it. (since 8.1) +# @switchover-ack: If enabled, migration will rely on destination side +# to acknowledge the source's switchover decision. The +# acknowledgement may depend, for example, on some device's data +# being loaded in the destination before doing switchover. This +# can reduce downtime if devices that support this capability are +# present. 'return-path' capability must be enabled to use it. +# (since 8.1) # # @dirty-limit: If enabled, migration will throttle vCPUs as needed to # keep their dirty page rate within @vcpu-dirty-limit. This can diff --git a/include/migration/client-options.h b/include/migration/client-options.h index 289c9d7762..78b1daa1a6 100644 --- a/include/migration/client-options.h +++ b/include/migration/client-options.h @@ -13,6 +13,7 @@ /* properties */ bool migrate_send_switchover_start(void); +bool migrate_switchover_ack_legacy(void); /* capabilities */ diff --git a/include/migration/register.h b/include/migration/register.h index a61c4236d2..5825eb30cb 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -23,6 +23,8 @@ typedef struct MigPendingData { uint64_t postcopy_bytes; /* Amount of pending bytes can be transferred only in stopcopy */ uint64_t stopcopy_bytes; + /* Number of new pending switchover ACKs */ + uint32_t switchover_ack_pending; /* * Total pending data, modules do not need to update this field, it * will be automatically calculated by migration core API. diff --git a/migration/migration.h b/migration/migration.h index da45444f7b..086eb9a15d 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -494,6 +494,29 @@ struct MigrationState { */ uint8_t clear_bitmap_shift; + /* + * This decides whether to use legacy switchover-ack or new switchover-ack. + * The main difference between them is that the former allows acknowledging + * switchover only once while the latter multiple times. + * + * In legacy, the destination keeps track of a pending ACKs counter. As + * migration progresses, the devices on the destination acknowledge + * switchover, decreasing the counter. When the counter reaches zero, a + * single ACK message is sent to the source via the return path, indicating + * that it's OK to switchover. + * + * In new switchover-ack, the source is the one that keeps track of a + * pending ACKs counter. As migration progresses, the destination sends ACK + * message per-device via the return path, which decrements the source + * counter. When the counter reaches zero, it's OK to switchover. During + * precopy, source-side devices may request additional ACKs, which increment + * the counter again. + * + * In both legacy and new schemes, we rely on per-device protocol to request + * switchover ACK from the destination-side counterpart. + */ + bool switchover_ack_legacy; + /* * This save hostname when out-going migration starts */ @@ -503,10 +526,13 @@ struct MigrationState { JSONWriter *vmdesc; /* - * Indicates whether an ACK from the destination that it's OK to do - * switchover has been received. + * Indicates the number of pending ACKs from the destination. The value may + * increase or decrease during precopy as new ACKs are requested or + * received. When zero is reached, it's OK to switchover. In legacy + * switchover-ack, it's initialized to 1 and decreased to zero upon ACK. */ - bool switchover_acked; + uint32_t switchover_ack_pending_num; + /* Is this a rdma migration */ bool rdma_migration; diff --git a/migration/savevm.h b/migration/savevm.h index 0d732eb0f7..a049d695c9 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -45,8 +45,10 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy); void qemu_savevm_state_cleanup(void); void qemu_savevm_state_complete_postcopy(QEMUFile *f); int qemu_savevm_state_complete_precopy(MigrationState *s); -void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact); -void qemu_savevm_query_pending_final(MigPendingData *pending); +void qemu_savevm_query_pending_iter(MigrationState *s, MigPendingData *pending, + bool exact); +void qemu_savevm_query_pending_final(MigrationState *s, + MigPendingData *pending); int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy); bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp); void qemu_savevm_state_end(QEMUFile *f); diff --git a/hw/core/machine.c b/hw/core/machine.c index 63baff859f..8a39943d94 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -41,6 +41,7 @@ GlobalProperty hw_compat_11_0[] = { { "chardev-vc", "encoding", "cp437" }, + { "migration", "switchover-ack-legacy", "on" }, }; const size_t hw_compat_11_0_len = G_N_ELEMENTS(hw_compat_11_0); diff --git a/migration/migration.c b/migration/migration.c index 59aff50d68..4bb649a467 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1707,7 +1707,9 @@ int migrate_init(MigrationState *s, Error **errp) s->vm_old_state = -1; s->iteration_initial_bytes = 0; s->threshold_size = 0; - s->switchover_acked = false; + /* Legacy switchover-ack sends a single ACK for all devices */ + qatomic_set(&s->switchover_ack_pending_num, + migrate_switchover_ack_legacy() ? 1 : 0); s->rdma_migration = false; /* @@ -2201,7 +2203,7 @@ void migration_request_switchover_ack_legacy(const char *requester) { MigrationIncomingState *mis = migration_incoming_get_current(); - if (!migrate_switchover_ack()) { + if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) { return; } @@ -2457,9 +2459,18 @@ static void *source_return_path_thread(void *opaque) break; case MIG_RP_MSG_SWITCHOVER_ACK: - ms->switchover_acked = true; - trace_source_return_path_thread_switchover_acked(); + { + uint32_t pending_num; + + pending_num = qatomic_dec_fetch(&ms->switchover_ack_pending_num); + trace_source_return_path_thread_switchover_acked(pending_num); + if (pending_num == UINT32_MAX) { + error_setg(&err, "Switchover ack pending num underflowed"); + goto out; + } + break; + } default: break; @@ -2809,7 +2820,7 @@ static bool migration_switchover_start(MigrationState *s, Error **errp) return false; } - qemu_savevm_query_pending_final(&pending); + qemu_savevm_query_pending_final(s, &pending); /* Inactivate disks except in COLO */ if (!migrate_colo()) { @@ -3259,7 +3270,7 @@ static bool migration_can_switchover(MigrationState *s) return true; } - return s->switchover_acked; + return qatomic_read(&s->switchover_ack_pending_num) == 0; } /* Migration thread iteration status */ @@ -3298,12 +3309,13 @@ static bool migration_iteration_next_ready(MigrationState *s, return false; } -static void migration_iteration_go_next(MigPendingData *pending) +static void migration_iteration_go_next(MigrationState *s, + MigPendingData *pending) { /* * Do a slow sync first before boosting the iteration count. */ - qemu_savevm_query_pending_iter(pending, true); + qemu_savevm_query_pending_iter(s, pending, true); /* * Update the dirty information for the whole system for this @@ -3349,12 +3361,12 @@ static MigIterateState migration_iteration_run(MigrationState *s) Error *local_err = NULL; bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE || s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); - bool can_switchover = migration_can_switchover(s); + bool can_switchover; MigPendingData pending = { }; bool complete_ready; /* Fast path - get the estimated amount of pending data */ - qemu_savevm_query_pending_iter(&pending, false); + qemu_savevm_query_pending_iter(s, &pending, false); if (in_postcopy) { /* @@ -3395,9 +3407,12 @@ static MigIterateState migration_iteration_run(MigrationState *s) * during postcopy phase. */ if (migration_iteration_next_ready(s, &pending)) { - migration_iteration_go_next(&pending); + migration_iteration_go_next(s, &pending); } + /* Check can switchover after qemu_savevm_query_pending() */ + can_switchover = migration_can_switchover(s); + /* Should we switch to postcopy now? */ if (can_switchover && postcopy_should_start(s, &pending)) { if (postcopy_start(s, &local_err)) { diff --git a/migration/options.c b/migration/options.c index 5cbfd29099..4c9b25372e 100644 --- a/migration/options.c +++ b/migration/options.c @@ -110,6 +110,9 @@ const Property migration_properties[] = { preempt_pre_7_2, false), DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState, multifd_clean_tls_termination, true), + /* Use legacy until VFIO implements new switchover-ack */ + DEFINE_PROP_BOOL("switchover-ack-legacy", MigrationState, + switchover_ack_legacy, true), /* Migration parameters */ DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState, @@ -467,6 +470,13 @@ bool migrate_rdma(void) return s->rdma_migration; } +bool migrate_switchover_ack_legacy(void) +{ + MigrationState *s = migrate_get_current(); + + return s->switchover_ack_legacy; +} + typedef enum WriteTrackingSupport { WT_SUPPORT_UNKNOWN = 0, WT_SUPPORT_ABSENT, diff --git a/migration/savevm.c b/migration/savevm.c index fa188dd34f..0e487ea8ab 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1795,7 +1795,8 @@ int qemu_savevm_state_complete_precopy(MigrationState *s) return qemu_fflush(f); } -static void qemu_savevm_query_pending(MigPendingData *pending, bool exact, +static void qemu_savevm_query_pending(MigrationState *s, + MigPendingData *pending, bool exact, bool final) { SaveStateEntry *se; @@ -1824,22 +1825,35 @@ static void qemu_savevm_query_pending(MigPendingData *pending, bool exact, if (!final) { mig_stats.dirty_bytes_total = pending->total_bytes; } - trace_qemu_savevm_query_pending(exact, final, pending->precopy_bytes, - pending->stopcopy_bytes, - pending->postcopy_bytes, - pending->total_bytes); + + if (migrate_switchover_ack() && !migrate_switchover_ack_legacy() && + pending->switchover_ack_pending) { + /* + * NOTE: Currently we rely on per-device protocol to request switchover + * ACK from the device on the destination side. + */ + qatomic_add(&s->switchover_ack_pending_num, + pending->switchover_ack_pending); + } + + trace_qemu_savevm_query_pending( + exact, final, pending->precopy_bytes, pending->stopcopy_bytes, + pending->postcopy_bytes, pending->total_bytes, + pending->switchover_ack_pending, + qatomic_read(&s->switchover_ack_pending_num)); } -void qemu_savevm_query_pending_iter(MigPendingData *pending, bool exact) +void qemu_savevm_query_pending_iter(MigrationState *s, MigPendingData *pending, + bool exact) { - qemu_savevm_query_pending(pending, exact, false); + qemu_savevm_query_pending(s, pending, exact, false); } -void qemu_savevm_query_pending_final(MigPendingData *pending) +void qemu_savevm_query_pending_final(MigrationState *s, MigPendingData *pending) { g_assert(bql_locked()); - qemu_savevm_query_pending(pending, true, true); + qemu_savevm_query_pending(s, pending, true, true); } void qemu_savevm_state_cleanup(void) @@ -2485,7 +2499,7 @@ static int loadvm_switchover_ack_no_users_legacy(MigrationIncomingState *mis, { int ret; - if (!migrate_switchover_ack()) { + if (!migrate_switchover_ack() || !migrate_switchover_ack_legacy()) { return 0; } @@ -3167,7 +3181,7 @@ int qemu_load_device_state(QEMUFile *f, Error **errp) return 0; } -int qemu_loadvm_approve_switchover(const char *approver) +static int qemu_loadvm_approve_switchover_legacy(const char *approver) { MigrationIncomingState *mis = migration_incoming_get_current(); @@ -3186,6 +3200,23 @@ int qemu_loadvm_approve_switchover(const char *approver) return migrate_send_rp_switchover_ack(mis); } +int qemu_loadvm_approve_switchover(const char *approver) +{ + MigrationIncomingState *mis = migration_incoming_get_current(); + + if (!migrate_switchover_ack()) { + return 0; + } + + if (migrate_switchover_ack_legacy()) { + return qemu_loadvm_approve_switchover_legacy(approver); + } + + trace_loadvm_approve_switchover(approver); + + return migrate_send_rp_switchover_ack(mis); +} + bool qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, char *buf, size_t len, Error **errp) { diff --git a/migration/trace-events b/migration/trace-events index a6b8c31ee1..f5339f4193 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u" qemu_loadvm_state_post_main(int ret) "%d" qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u" qemu_savevm_send_packaged(void) "" -qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64 +qemu_savevm_query_pending(bool exact, bool final, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total, uint32_t switchover_ack_pending, uint32_t total_switchover_ack_pending) "exact=%d, final=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64", collected switchover ack pending=%"PRIu32", total switchover ack pending=%"PRIu32 loadvm_state_setup(void) "" loadvm_state_cleanup(void) "" loadvm_handle_cmd_packaged(unsigned int length) "%u" @@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d" loadvm_process_command_ping(uint32_t val) "0x%x" loadvm_approve_switchover_legacy(const char *approver, unsigned int switchover_ack_pending_num_legacy) "Approver %s, switchover_ack_pending_num_legacy %u" +loadvm_approve_switchover(const char *approver) "Approver %s" postcopy_ram_listen_thread_exit(void) "" postcopy_ram_listen_thread_start(void) "" qemu_savevm_send_postcopy_advise(void) "" @@ -189,7 +190,7 @@ source_return_path_thread_loop_top(void) "" source_return_path_thread_pong(uint32_t val) "0x%x" source_return_path_thread_shut(uint32_t val) "0x%x" source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32 -source_return_path_thread_switchover_acked(void) "" +source_return_path_thread_switchover_acked(uint32_t pending_num) "switchover_ack_pending_num %" PRIu32 source_return_path_thread_postcopy_package_loaded(void) "" migration_thread_low_pending(uint64_t pending) "%" PRIu64 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64 -- 2.40.1
