On Wed, Feb 18, 2026 at 10:29:39PM +0100, Lukas Straub wrote:
> In the colo migration unit test, we shutdown all sockets with yank and
> then stop qemu with SIGTERM. During shutdown migration_shutdown() calls
> migration_cancel(). Now the colo thread frees s->rp_state.from_dst_file
> which races with migration_cancel() checking for NULL and potentially calling
> qemu_file_shutdown() on it.
> 
> Fix this by taking the s->qemu_file_lock.
> 
> Signed-off-by: Lukas Straub <[email protected]>

The whole patch taking the mutex is reasonable, but it didn't explain one
thing, on why COLO is special here on managing the return path channel..

colo_process_checkpoint() does re-opening of rp channel even if it was
partially shutdown before:

colo_process_checkpoint():
    s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file);

Should we instead leave all these to migration core?
E.g. migration_completion() will do close_return_path_on_source(), but I
wonder if that should only shutdown & close the return path channel when
without COLO running.  Then IIUC we can also leave the cleanup of the
qemufiles to migration_cleanup(), as it's also not special to COLO IIUC.

What do you think?

Thanks,

> ---
>  migration/colo.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c
> index 
> ce02c71d8857d470be434bdf3a9cacad3baab0d5..180793fe3f25140fa10887acc3d87515ebf43ac9
>  100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -173,11 +173,13 @@ static void primary_vm_do_failover(void)
>       * The s->rp_state.from_dst_file and s->to_dst_file may use the
>       * same fd, but we still shutdown the fd for twice, it is harmless.
>       */
> -    if (s->to_dst_file) {
> -        qemu_file_shutdown(s->to_dst_file);
> -    }
> -    if (s->rp_state.from_dst_file) {
> -        qemu_file_shutdown(s->rp_state.from_dst_file);
> +    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
> +        if (s->to_dst_file) {
> +            qemu_file_shutdown(s->to_dst_file);
> +        }
> +        if (s->rp_state.from_dst_file) {
> +            qemu_file_shutdown(s->rp_state.from_dst_file);
> +        }
>      }
>  
>      old_state = failover_set_state(FAILOVER_STATUS_ACTIVE,
> @@ -544,11 +546,14 @@ static void colo_process_checkpoint(MigrationState *s)
>  
>      failover_init_state();
>  
> +    qemu_mutex_lock(&s->qemu_file_lock);
>      s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file);
>      if (!s->rp_state.from_dst_file) {
> +        qemu_mutex_unlock(&s->qemu_file_lock);
>          error_report("Open QEMUFile from_dst_file failed");
>          goto out;
>      }
> +    qemu_mutex_unlock(&s->qemu_file_lock);
>  
>      packets_compare_notifier.notify = colo_compare_notify_checkpoint;
>      colo_compare_register_notifier(&packets_compare_notifier);
> @@ -640,9 +645,11 @@ out:
>       * Or the failover BH may shutdown the wrong fd that
>       * re-used by other threads after we release here.
>       */
> -    if (s->rp_state.from_dst_file) {
> -        qemu_fclose(s->rp_state.from_dst_file);
> -        s->rp_state.from_dst_file = NULL;
> +    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
> +        if (s->rp_state.from_dst_file) {
> +            qemu_fclose(s->rp_state.from_dst_file);
> +            s->rp_state.from_dst_file = NULL;
> +        }
>      }
>  }
>  
> 
> -- 
> 2.39.5
> 

-- 
Peter Xu


Reply via email to