Peter Xu <pet...@redhat.com> writes: > When a multifd sender thread hit errors, it always needs to kick the main > thread by kicking all the semaphores that it can be waiting upon. > > Provide a helper for it and deduplicate the code. > > Signed-off-by: Peter Xu <pet...@redhat.com> > --- > migration/multifd.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 4afdd88602..33fb21d0e4 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -374,6 +374,18 @@ struct { > MultiFDMethods *ops; > } *multifd_send_state; > > +/* > + * The migration thread can wait on either of the two semaphores. This > + * function can be used to kick the main thread out of waiting on either of > + * them. Should mostly only be called when something wrong happened with > + * the current multifd send thread. > + */ > +static void multifd_send_kick_main(MultiFDSendParams *p) > +{ > + qemu_sem_post(&p->sem_sync); > + qemu_sem_post(&multifd_send_state->channels_ready); > +} > + > /* > * How we use multifd_send_state->pages and channel->pages? > * > @@ -746,8 +758,7 @@ out: > assert(local_err); > trace_multifd_send_error(p->id); > multifd_send_terminate_threads(local_err); > - qemu_sem_post(&p->sem_sync); > - qemu_sem_post(&multifd_send_state->channels_ready); > + multifd_send_kick_main(p); > error_free(local_err); > } > > @@ -787,8 +798,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task, > * is not created, and then tell who pay attention to me. > */ > p->quit = true; > - qemu_sem_post(&multifd_send_state->channels_ready); > - qemu_sem_post(&p->sem_sync); > + multifd_send_kick_main(p);
There's a bug here in the original code: It's not really safe to call any of these outside of the channel lock because multifd_save_cleanup() could execute at the same time and call qemu_sem_destroy() -> qemu_mutex_destroy(), which can assert because we might be holding the sem_lock. It seems the reason we get away with this today is merely due to timing. A subset of this problem was already encountered here: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup https://lore.kernel.org/r/20230621081826.3203053-1-zhangjiangu...@huawei.com We could probably release the semaphores for all channels at once inside multifd_save_cleanup() in the main thread. We'd have a multifd_send_kick_main() in each channel when it fails and this: void multifd_save_cleanup(void) { ... for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; if (p->running) { qemu_thread_join(&p->thread); } else { multifd_send_kick_main(p); } } for (i = 0; i < migrate_multifd_channels(); i++) { qemu_sem_destroy, etc... } ... }