Ivan Ren <reny...@gmail.com> wrote: > When migrate_cancel a multifd migration, if run sequence like this: > > [source] [destination] > > multifd_send_sync_main[finish] > multifd_recv_thread wait &p->sem_sync > shutdown to_dst_file > detect error from_src_file > send RAM_SAVE_FLAG_EOS[fail] [no chance to run multifd_recv_sync_main] > multifd_load_cleanup > join multifd receive thread forever > > will lead destination qemu hung at following stack: > > pthread_join > qemu_thread_join > multifd_load_cleanup > process_incoming_migration_co > coroutine_trampoline > > Signed-off-by: Ivan Ren <ivan...@tencent.com>
I think this one is not enough. We need to set some error code, or disable the running bit at that point. > --- > migration/ram.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/migration/ram.c b/migration/ram.c > index e4eb9c441f..504c8ccb03 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp) > MultiFDRecvParams *p = &multifd_recv_state->params[i]; > > if (p->running) { > + /* > + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle code, > + * however try to wakeup it without harm in cleanup phase. > + */ > + qemu_sem_post(&p->sem_sync); > qemu_thread_join(&p->thread); > } > object_unref(OBJECT(p->c)); Let's see where we wait for p->sem_sync: static void *multifd_recv_thread(void *opaque) { .... while (true) { uint32_t used; uint32_t flags; ret = qio_channel_read_all_eof(p->c, (void *)p->packet, p->packet_len, &local_err); ..... if (flags & MULTIFD_FLAG_SYNC) { qemu_sem_post(&multifd_recv_state->sem_sync); qemu_sem_wait(&p->sem_sync); } } if (local_err) { multifd_recv_terminate_threads(local_err); } qemu_mutex_lock(&p->mutex); p->running = false; qemu_mutex_unlock(&p->mutex); rcu_unregister_thread(); trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages); return NULL; } If we just post it there, we get out of the wait (that bit is ok), but then we go back to the beggining of the bucle, we (probably) got one error on the qui_channel_read_all_eof(), and we go back to multifd_recv_terminate_threads(), or wait there. I think that it is better to *also* set an p->quit variable there, and not even try to receive anything for that channel? I will send a patch later. Good catch. Later, Juan.