On Thu, 19 Feb 2026 16:33:19 -0500
Peter Xu <[email protected]> wrote:

> On Wed, Feb 18, 2026 at 10:29:39PM +0100, Lukas Straub wrote:
> > In the colo migration unit test, we shutdown all sockets with yank and
> > then stop qemu with SIGTERM. During shutdown migration_shutdown() calls
> > migration_cancel(). Now the colo thread frees s->rp_state.from_dst_file
> > which races with migration_cancel() checking for NULL and potentially 
> > calling
> > qemu_file_shutdown() on it.
> > 
> > Fix this by taking the s->qemu_file_lock.
> > 
> > Signed-off-by: Lukas Straub <[email protected]>  
> 
> The whole patch taking the mutex is reasonable, but it didn't explain one
> thing, on why COLO is special here on managing the return path channel..
> 
> colo_process_checkpoint() does re-opening of rp channel even if it was
> partially shutdown before:
> 
> colo_process_checkpoint():
>     s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file);
> 
> Should we instead leave all these to migration core?
> E.g. migration_completion() will do close_return_path_on_source(), but I
> wonder if that should only shutdown & close the return path channel when
> without COLO running.  Then IIUC we can also leave the cleanup of the
> qemufiles to migration_cleanup(), as it's also not special to COLO IIUC.
> 
> What do you think?

It is a bit more involved since the return path needs to be always
opened for this. I implemented this in my current patchset.

Regards,
Lukas Straub

> 
> Thanks,
> 
> > ---
> >  migration/colo.c | 23 +++++++++++++++--------
> >  1 file changed, 15 insertions(+), 8 deletions(-)
> > 
> > diff --git a/migration/colo.c b/migration/colo.c
> > index 
> > ce02c71d8857d470be434bdf3a9cacad3baab0d5..180793fe3f25140fa10887acc3d87515ebf43ac9
> >  100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -173,11 +173,13 @@ static void primary_vm_do_failover(void)
> >       * The s->rp_state.from_dst_file and s->to_dst_file may use the
> >       * same fd, but we still shutdown the fd for twice, it is harmless.
> >       */
> > -    if (s->to_dst_file) {
> > -        qemu_file_shutdown(s->to_dst_file);
> > -    }
> > -    if (s->rp_state.from_dst_file) {
> > -        qemu_file_shutdown(s->rp_state.from_dst_file);
> > +    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
> > +        if (s->to_dst_file) {
> > +            qemu_file_shutdown(s->to_dst_file);
> > +        }
> > +        if (s->rp_state.from_dst_file) {
> > +            qemu_file_shutdown(s->rp_state.from_dst_file);
> > +        }
> >      }
> >  
> >      old_state = failover_set_state(FAILOVER_STATUS_ACTIVE,
> > @@ -544,11 +546,14 @@ static void colo_process_checkpoint(MigrationState *s)
> >  
> >      failover_init_state();
> >  
> > +    qemu_mutex_lock(&s->qemu_file_lock);
> >      s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file);
> >      if (!s->rp_state.from_dst_file) {
> > +        qemu_mutex_unlock(&s->qemu_file_lock);
> >          error_report("Open QEMUFile from_dst_file failed");
> >          goto out;
> >      }
> > +    qemu_mutex_unlock(&s->qemu_file_lock);
> >  
> >      packets_compare_notifier.notify = colo_compare_notify_checkpoint;
> >      colo_compare_register_notifier(&packets_compare_notifier);
> > @@ -640,9 +645,11 @@ out:
> >       * Or the failover BH may shutdown the wrong fd that
> >       * re-used by other threads after we release here.
> >       */
> > -    if (s->rp_state.from_dst_file) {
> > -        qemu_fclose(s->rp_state.from_dst_file);
> > -        s->rp_state.from_dst_file = NULL;
> > +    WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) {
> > +        if (s->rp_state.from_dst_file) {
> > +            qemu_fclose(s->rp_state.from_dst_file);
> > +            s->rp_state.from_dst_file = NULL;
> > +        }
> >      }
> >  }
> >  
> > 
> > -- 
> > 2.39.5
> >   
> 

Attachment: pgpnC5jv4KRvR.pgp
Description: OpenPGP digital signature

Reply via email to