On Thu, 19 Feb 2026 16:33:19 -0500 Peter Xu <[email protected]> wrote:
> On Wed, Feb 18, 2026 at 10:29:39PM +0100, Lukas Straub wrote: > > In the colo migration unit test, we shutdown all sockets with yank and > > then stop qemu with SIGTERM. During shutdown migration_shutdown() calls > > migration_cancel(). Now the colo thread frees s->rp_state.from_dst_file > > which races with migration_cancel() checking for NULL and potentially > > calling > > qemu_file_shutdown() on it. > > > > Fix this by taking the s->qemu_file_lock. > > > > Signed-off-by: Lukas Straub <[email protected]> > > The whole patch taking the mutex is reasonable, but it didn't explain one > thing, on why COLO is special here on managing the return path channel.. > > colo_process_checkpoint() does re-opening of rp channel even if it was > partially shutdown before: > > colo_process_checkpoint(): > s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file); > > Should we instead leave all these to migration core? > E.g. migration_completion() will do close_return_path_on_source(), but I > wonder if that should only shutdown & close the return path channel when > without COLO running. Then IIUC we can also leave the cleanup of the > qemufiles to migration_cleanup(), as it's also not special to COLO IIUC. > > What do you think? It is a bit more involved since the return path needs to be always opened for this. I implemented this in my current patchset. Regards, Lukas Straub > > Thanks, > > > --- > > migration/colo.c | 23 +++++++++++++++-------- > > 1 file changed, 15 insertions(+), 8 deletions(-) > > > > diff --git a/migration/colo.c b/migration/colo.c > > index > > ce02c71d8857d470be434bdf3a9cacad3baab0d5..180793fe3f25140fa10887acc3d87515ebf43ac9 > > 100644 > > --- a/migration/colo.c > > +++ b/migration/colo.c > > @@ -173,11 +173,13 @@ static void primary_vm_do_failover(void) > > * The s->rp_state.from_dst_file and s->to_dst_file may use the > > * same fd, but we still shutdown the fd for twice, it is harmless. > > */ > > - if (s->to_dst_file) { > > - qemu_file_shutdown(s->to_dst_file); > > - } > > - if (s->rp_state.from_dst_file) { > > - qemu_file_shutdown(s->rp_state.from_dst_file); > > + WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) { > > + if (s->to_dst_file) { > > + qemu_file_shutdown(s->to_dst_file); > > + } > > + if (s->rp_state.from_dst_file) { > > + qemu_file_shutdown(s->rp_state.from_dst_file); > > + } > > } > > > > old_state = failover_set_state(FAILOVER_STATUS_ACTIVE, > > @@ -544,11 +546,14 @@ static void colo_process_checkpoint(MigrationState *s) > > > > failover_init_state(); > > > > + qemu_mutex_lock(&s->qemu_file_lock); > > s->rp_state.from_dst_file = qemu_file_get_return_path(s->to_dst_file); > > if (!s->rp_state.from_dst_file) { > > + qemu_mutex_unlock(&s->qemu_file_lock); > > error_report("Open QEMUFile from_dst_file failed"); > > goto out; > > } > > + qemu_mutex_unlock(&s->qemu_file_lock); > > > > packets_compare_notifier.notify = colo_compare_notify_checkpoint; > > colo_compare_register_notifier(&packets_compare_notifier); > > @@ -640,9 +645,11 @@ out: > > * Or the failover BH may shutdown the wrong fd that > > * re-used by other threads after we release here. > > */ > > - if (s->rp_state.from_dst_file) { > > - qemu_fclose(s->rp_state.from_dst_file); > > - s->rp_state.from_dst_file = NULL; > > + WITH_QEMU_LOCK_GUARD(&s->qemu_file_lock) { > > + if (s->rp_state.from_dst_file) { > > + qemu_fclose(s->rp_state.from_dst_file); > > + s->rp_state.from_dst_file = NULL; > > + } > > } > > } > > > > > > -- > > 2.39.5 > > >
pgpnC5jv4KRvR.pgp
Description: OpenPGP digital signature
