> If we just post it there, we get out of the wait (that bit is ok), but > then we go back to the beggining of the bucle, we (probably) got one > error on the qui_channel_read_all_eof(), and we go back to > multifd_recv_terminate_threads(), or wait there. > > I think that it is better to *also* set an p->quit variable there, and > not even try to receive anything for that channel? > > I will send a patch later.
Yes, agree. Thanks for review. On Wed, Jul 24, 2019 at 5:01 PM Juan Quintela <quint...@redhat.com> wrote: > Ivan Ren <reny...@gmail.com> wrote: > > When migrate_cancel a multifd migration, if run sequence like this: > > > > [source] [destination] > > > > multifd_send_sync_main[finish] > > multifd_recv_thread wait &p->sem_sync > > shutdown to_dst_file > > detect error from_src_file > > send RAM_SAVE_FLAG_EOS[fail] [no chance to run > multifd_recv_sync_main] > > multifd_load_cleanup > > join multifd receive thread forever > > > > will lead destination qemu hung at following stack: > > > > pthread_join > > qemu_thread_join > > multifd_load_cleanup > > process_incoming_migration_co > > coroutine_trampoline > > > > Signed-off-by: Ivan Ren <ivan...@tencent.com> > > I think this one is not enough. We need to set some error code, or > disable the running bit at that point. > > > --- > > migration/ram.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/migration/ram.c b/migration/ram.c > > index e4eb9c441f..504c8ccb03 100644 > > --- a/migration/ram.c > > +++ b/migration/ram.c > > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp) > > MultiFDRecvParams *p = &multifd_recv_state->params[i]; > > > > if (p->running) { > > + /* > > + * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle > code, > > + * however try to wakeup it without harm in cleanup phase. > > + */ > > + qemu_sem_post(&p->sem_sync); > > qemu_thread_join(&p->thread); > > } > > object_unref(OBJECT(p->c)); > > Let's see where we wait for p->sem_sync: > > > static void *multifd_recv_thread(void *opaque) > { > .... > while (true) { > uint32_t used; > uint32_t flags; > > ret = qio_channel_read_all_eof(p->c, (void *)p->packet, > p->packet_len, &local_err); > ..... > > if (flags & MULTIFD_FLAG_SYNC) { > qemu_sem_post(&multifd_recv_state->sem_sync); > qemu_sem_wait(&p->sem_sync); > } > } > if (local_err) { > multifd_recv_terminate_threads(local_err); > } > qemu_mutex_lock(&p->mutex); > p->running = false; > qemu_mutex_unlock(&p->mutex); > > rcu_unregister_thread(); > trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages); > > return NULL; > } > > If we just post it there, we get out of the wait (that bit is ok), but > then we go back to the beggining of the bucle, we (probably) got one > error on the qui_channel_read_all_eof(), and we go back to > multifd_recv_terminate_threads(), or wait there. > > I think that it is better to *also* set an p->quit variable there, and > not even try to receive anything for that channel? > > I will send a patch later. > > Good catch. > > Later, Juan. >