On 19.06.2026 12:55, Andrey Drobyshev wrote:
During a CPR-style migration, we pass additional CPR state via an aux migration channel (the cpr-transfer case). That's done in cpr_state_save(). The main bulk of the migration is only sent afterwards - that's why we call cpr_state_save() early on in qmp_migrate().Previous patch placed emitting the SETUP event before cpr_state_save(), since devices must be properly shut down before we send their FDs to CPR target. However, for the proper shut down to take place, we should also stop operating them beforehand, i.e. stop the VM. Thus the desirable order for cpr-transfer case looks as follows: SOURCE TARGET ------ ------ cpr_state_load() blocks | | | 1. migration_stop_vm() | | VM stopped, devices quiesced | | | Waiting for | 2. notifiers (SETUP) | FDs from source | vhost_reset_owner() releases | | device ownership | | | | 3. cpr_state_save() ---- FDs -------> | | | v v postmigrate Device init begins - cpr_find_fd() - vhost_dev_init() - VHOST_SET_OWNER So step 3 is the synchronization/cut-over point. Target proceeds immediately upon receiving FDs, so steps 1-2 must complete successfully. Otherwise: * Target's VHOST_SET_OWNER fails with -EBUSY (source still owns) * Race between source I/O and target device init Let's stop the VM early (before FD transfer) to prevent this race. Unlike regular migration, CPR-transfer passes memory via FD (memfd) rather than copying RAM, so early VM stop should have minimal downtime.
Was this "minimum downtime impact" assertion actually tested/benchmark with VM under some serious load? I am especially curious how this interacts with VFIO devices that have a lot of data.
Since we call migration_stop_vm() from qmp_migrate() (i.e. from the main thread), we should also balance it out by fallback resume outside of the migration thread - i.e. resume in migration_iteration_finish() is not enough. One good place for this resume op is migration_cleanup(). However, in migration_iteration_finish() resume is gated by successful block activation, so we additionally traverse the block graph nodes to make sure activation did take place before doing vm_resume(). This patch is a rework of the change originally proposed by Steve and Ben at [0]. [0] https://lore.kernel.org/qemu-devel/[email protected] Originally-by: Steve Sistare <[email protected]> Originally-by: Ben Chaney <[email protected]> Signed-off-by: Andrey Drobyshev <[email protected]> ---
Thanks, Maciej
