> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.

Yes, agree.

Thanks for review.

On Wed, Jul 24, 2019 at 5:01 PM Juan Quintela <quint...@redhat.com> wrote:

> Ivan Ren <reny...@gmail.com> wrote:
> > When migrate_cancel a multifd migration, if run sequence like this:
> >
> >         [source]                              [destination]
> >
> > multifd_send_sync_main[finish]
> >                                     multifd_recv_thread wait &p->sem_sync
> > shutdown to_dst_file
> >                                     detect error from_src_file
> > send  RAM_SAVE_FLAG_EOS[fail]       [no chance to run
> multifd_recv_sync_main]
> >                                     multifd_load_cleanup
> >                                     join multifd receive thread forever
> >
> > will lead destination qemu hung at following stack:
> >
> > pthread_join
> > qemu_thread_join
> > multifd_load_cleanup
> > process_incoming_migration_co
> > coroutine_trampoline
> >
> > Signed-off-by: Ivan Ren <ivan...@tencent.com>
>
> I think this one is not enough.  We need to set some error code, or
> disable the running bit at that point.
>
> > ---
> >  migration/ram.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e4eb9c441f..504c8ccb03 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1291,6 +1291,11 @@ int multifd_load_cleanup(Error **errp)
> >          MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >
> >          if (p->running) {
> > +            /*
> > +             * multifd_recv_thread may hung at MULTIFD_FLAG_SYNC handle
> code,
> > +             * however try to wakeup it without harm in cleanup phase.
> > +             */
> > +            qemu_sem_post(&p->sem_sync);
> >              qemu_thread_join(&p->thread);
> >          }
> >          object_unref(OBJECT(p->c));
>
> Let's see where we wait for p->sem_sync:
>
>
> static void *multifd_recv_thread(void *opaque)
> {
>     ....
>     while (true) {
>         uint32_t used;
>         uint32_t flags;
>
>         ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>                                        p->packet_len, &local_err);
>         .....
>
>         if (flags & MULTIFD_FLAG_SYNC) {
>             qemu_sem_post(&multifd_recv_state->sem_sync);
>             qemu_sem_wait(&p->sem_sync);
>         }
>     }
>     if (local_err) {
>         multifd_recv_terminate_threads(local_err);
>     }
>     qemu_mutex_lock(&p->mutex);
>     p->running = false;
>     qemu_mutex_unlock(&p->mutex);
>
>     rcu_unregister_thread();
>     trace_multifd_recv_thread_end(p->id, p->num_packets, p->num_pages);
>
>     return NULL;
> }
>
> If we just post it there, we get out of the wait (that bit is ok), but
> then we go back to the beggining of the bucle, we (probably) got one
> error on the qui_channel_read_all_eof(), and we go back to
> multifd_recv_terminate_threads(), or wait there.
>
> I think that it is better to *also* set an p->quit variable there, and
> not even try to receive anything for that channel?
>
> I will send a patch later.
>
> Good catch.
>
> Later, Juan.
>

Reply via email to