On 6/16/26 9:05 AM, Dongli Zhang wrote:
>
>
> On 2026-06-12 9:51 AM, Andrey Drobyshev wrote:
>> Devices register their own callbacks to MIG_EVENT_* notifications and
>> process them accordingly. But some devices, especially during CPR-style
>> migration, might need to be aware and process SETUP event earlier on. One
>> such example is vhost-based devs, as they need to call RESET_OWNER ioctl
>> before the target (in the cpr-transfer case) tries to call SET_OWNER.
>>
>> Let's move SETUP notification to an earlier place in qmp_migrate(). Before
>> this change, SETUP was firing shortly before migration thread creation, and
>> FAILED notification balancing it was only fired from
>> migration_iteration_finish() (ran in the thread). But now by moving SETUP
>> to earlier we widen the window. Hence let's also add FAILED to
>> migration_connect_error_propagate() and the special CPR-related case in
>> migration_cancel(), to balance things out.
>>
>> This patch is a rework of the change originally proposed by Steve and
>> Ben at [0].
>>
>> [0]
>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]__;!!ACWV5N9M2RV99hQ!PqTirI0SC9ktAapICP7w1YqQ7Fwlq0iPjHspIeoQU_M2xXPnQ1X5CJSCrLZEs5H-7rWU1DU_s-tNL6DD8riRAcMWPwp49pQVqQ$
>>
>>
>> Originally-by: Steve Sistare <[email protected]>
>> Originally-by: Ben Chaney <[email protected]>
>> Signed-off-by: Andrey Drobyshev <[email protected]>
>> ---
>> migration/migration.c | 23 ++++++++++++++++++-----
>> 1 file changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 074d3f2c697..3c70467d314 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1471,6 +1471,13 @@ void migration_connect_error_propagate(MigrationState
>> *s, Error *error)
>> migrate_error_propagate(s, error);
>>
>> if (!resume) {
>> + /*
>> + * Emit FAILED for every pre-thread failure/cancel that ends up
>> here,
>> + * to balance the SETUP sent in qmp_migrate().
>> + */
>> + if (migration_has_failed(s)) {
>> + migration_call_notifiers(MIG_EVENT_FAILED, NULL);
>> + }
>> migration_cleanup(s);
>> }
>> }
>> @@ -1530,6 +1537,12 @@ void migration_cancel(void)
>> if (cpr_transfer_source_active(s)) {
>> assert(migrate_mode() == MIG_MODE_CPR_TRANSFER);
>> assert(setup && !s->to_dst_file);
>> + /*
>> + * This path bypasses both migration_iteration_finish() and
>> + * migration_connect_error_propagate(), so emit FAILED here to
>> balance
>> + * the SETUP sent in qmp_migrate(), before the VM is resumed.
>> + */
>> + migration_call_notifiers(MIG_EVENT_FAILED, NULL);
>> migration_cleanup(s);
>> /* Now all things should have been released */
>> assert(!cpr_transfer_source_active(s));
>> @@ -2117,6 +2130,11 @@ void qmp_migrate(const char *uri, bool has_channels,
>> */
>> Error *local_err = NULL;
>>
>> + /* Notify before starting migration thread and before starting CPR */
>> + if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
>> + goto out;
>> + }
>> +
>
> This affects not only CPR, but also regular live migration and postcopy
> recovery
> (i.e. when resume=true). Even with resume=true, qmp_migrate() now triggers
> MIG_EVENT_SETUP unconditionally at the beginning. If recovery fails,
> migration_connect_error_propagate() treats it as the resume case and does not
> trigger MIG_EVENT_FAILED.
>
> Have you tested regular live migration with resume==true?
>
> Thank you very much!
>
> Dongli Zhang
>
Hello Dongli,
I just tried that exact case - postcopy recovery with resume=true.
Although this doesn't lead to immediate migration failure, I can see
with some extra instrumentation that we indeed have 2 SETUP events vs 1
DONE, i.e. events become unbalanced in this case.
Since in migration_start_outgoing() the SETUP event was gated by "if
(resume)", I believe the fix should be as simple as:
> /* Notify before starting migration thread and before starting CPR */
> - if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
> + if (!(has_resume && resume) &&
> + migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
> goto out;
> }
Thank you for pointing that out! Will adjust in v2.
Andrey
>> if (!cpr_state_save(cpr_ch, &local_err)) {
>> goto out;
>> }
>> @@ -3893,11 +3911,6 @@ void migration_start_outgoing(MigrationState *s)
>> } else {
>> /* This is a fresh new migration */
>> rate_limit = migrate_max_bandwidth();
>> -
>> - /* Notify before starting migration thread */
>> - if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
>> - goto fail;
>> - }
>> }
>>
>> migration_rate_set(rate_limit);
>