On 6/16/26 9:05 AM, Dongli Zhang wrote:
> 
> 
> On 2026-06-12 9:51 AM, Andrey Drobyshev wrote:
>> Devices register their own callbacks to MIG_EVENT_* notifications and
>> process them accordingly.  But some devices, especially during CPR-style
>> migration, might need to be aware and process SETUP event earlier on.  One
>> such example is vhost-based devs, as they need to call RESET_OWNER ioctl
>> before the target (in the cpr-transfer case) tries to call SET_OWNER.
>>
>> Let's move SETUP notification to an earlier place in qmp_migrate().  Before
>> this change, SETUP was firing shortly before migration thread creation, and
>> FAILED notification balancing it was only fired from
>> migration_iteration_finish() (ran in the thread).  But now by moving SETUP
>> to earlier we widen the window.  Hence let's also add FAILED to
>> migration_connect_error_propagate() and the special CPR-related case in
>> migration_cancel(), to balance things out.
>>
>> This patch is a rework of the change originally proposed by Steve and
>> Ben at [0].
>>
>> [0] 
>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/[email protected]__;!!ACWV5N9M2RV99hQ!PqTirI0SC9ktAapICP7w1YqQ7Fwlq0iPjHspIeoQU_M2xXPnQ1X5CJSCrLZEs5H-7rWU1DU_s-tNL6DD8riRAcMWPwp49pQVqQ$
>>  
>>
>> Originally-by: Steve Sistare <[email protected]>
>> Originally-by: Ben Chaney <[email protected]>
>> Signed-off-by: Andrey Drobyshev <[email protected]>
>> ---
>>  migration/migration.c | 23 ++++++++++++++++++-----
>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 074d3f2c697..3c70467d314 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1471,6 +1471,13 @@ void migration_connect_error_propagate(MigrationState 
>> *s, Error *error)
>>      migrate_error_propagate(s, error);
>>  
>>      if (!resume) {
>> +        /*
>> +         * Emit FAILED for every pre-thread failure/cancel that ends up 
>> here,
>> +         * to balance the SETUP sent in qmp_migrate().
>> +         */
>> +        if (migration_has_failed(s)) {
>> +            migration_call_notifiers(MIG_EVENT_FAILED, NULL);
>> +        }
>>          migration_cleanup(s);
>>      }
>>  }
>> @@ -1530,6 +1537,12 @@ void migration_cancel(void)
>>      if (cpr_transfer_source_active(s)) {
>>          assert(migrate_mode() == MIG_MODE_CPR_TRANSFER);
>>          assert(setup && !s->to_dst_file);
>> +        /*
>> +         * This path bypasses both migration_iteration_finish() and
>> +         * migration_connect_error_propagate(), so emit FAILED here to 
>> balance
>> +         * the SETUP sent in qmp_migrate(), before the VM is resumed.
>> +         */
>> +        migration_call_notifiers(MIG_EVENT_FAILED, NULL);
>>          migration_cleanup(s);
>>          /* Now all things should have been released */
>>          assert(!cpr_transfer_source_active(s));
>> @@ -2117,6 +2130,11 @@ void qmp_migrate(const char *uri, bool has_channels,
>>       */
>>      Error *local_err = NULL;
>>  
>> +    /* Notify before starting migration thread and before starting CPR */
>> +    if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
>> +        goto out;
>> +    }
>> +
> 
> This affects not only CPR, but also regular live migration and postcopy 
> recovery
> (i.e. when resume=true). Even with resume=true, qmp_migrate() now triggers
> MIG_EVENT_SETUP unconditionally at the beginning. If recovery fails,
> migration_connect_error_propagate() treats it as the resume case and does not
> trigger MIG_EVENT_FAILED.
> 
> Have you tested regular live migration with resume==true?
> 
> Thank you very much!
> 
> Dongli Zhang
>

Hello Dongli,

I just tried that exact case - postcopy recovery with resume=true.
Although this doesn't lead to immediate migration failure, I can see
with some extra instrumentation that we indeed have 2 SETUP events vs 1
DONE, i.e. events become unbalanced in this case.

Since in migration_start_outgoing() the SETUP event was gated by "if
(resume)", I believe the fix should be as simple as:

>      /* Notify before starting migration thread and before starting CPR */
> -    if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
> +    if (!(has_resume && resume) &&
> +        migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
>          goto out;
>      }


Thank you for pointing that out!  Will adjust in v2.

Andrey

>>      if (!cpr_state_save(cpr_ch, &local_err)) {
>>          goto out;
>>      }
>> @@ -3893,11 +3911,6 @@ void migration_start_outgoing(MigrationState *s)
>>      } else {
>>          /* This is a fresh new migration */
>>          rate_limit = migrate_max_bandwidth();
>> -
>> -        /* Notify before starting migration thread */
>> -        if (migration_call_notifiers(MIG_EVENT_SETUP, &local_err)) {
>> -            goto fail;
>> -        }
>>      }
>>  
>>      migration_rate_set(rate_limit);
> 


Reply via email to