* Leonardo Bras (leob...@redhat.com) wrote: > From source host viewpoint, losing a connection during migration will > cause the sockets to get stuck in sendmsg() syscall, waiting for > the receiving side to reply. > > In migration, yank works by shutting-down the migration QIOChannel fd. > This causes a failure in the next sendmsg() for that fd, and the whole > migration gets cancelled. > > In multifd, due to having multiple sockets in multiple threads, > on a connection loss there will be extra sockets stuck in sendmsg(), > and because they will be holding their own mutex, there is good chance > the main migration thread can get stuck in multifd_send_pages() > waiting for one of those mutexes. > > While it's waiting, the main migration thread can't run sendmsg() on > it's fd, and therefore can't cause the migration to be cancelled, thus > causing yank not to work. > > Fixes this by shutting down all migration fds (including multifd ones), > so no thread get's stuck in sendmsg() while holding a lock, and thus > allowing the main migration thread to properly cancel migration when > yank is used. > > There is no need to do the same procedure to yank to work in the > receiving host since ops->recv_pages() is kept outside the mutex protected > code in multifd_recv_thread(). > > Buglink:https://bugzilla.redhat.com/show_bug.cgi?id=1970337 > Reported-by: Li Xiaohui <xiao...@redhat.com> > Signed-off-by: Leonardo Bras <leob...@redhat.com> > --- > migration/multifd.c | 11 +++++++++++ > migration/multifd.h | 1 + > migration/yank_functions.c | 2 ++ > 3 files changed, 14 insertions(+) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 377da78f5b..744a180dfe 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -1040,6 +1040,17 @@ void multifd_recv_sync_main(void) > trace_multifd_recv_sync_main(multifd_recv_state->packet_num); > } > > +void multifd_shutdown(void) > +{ > + if (!migrate_use_multifd()) { > + return; > + } > + > + if (multifd_send_state) { > + multifd_send_terminate_threads(NULL); > + }
That calls : for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; qemu_mutex_lock(&p->mutex); p->quit = true; qemu_sem_post(&p->sem); qemu_mutex_unlock(&p->mutex); } so why doesn't this also get stuck in the same mutex you're trying to fix? Does the qio_channel_shutdown actually cause a shutdown on all fd's for the multifd? (I've just seen the multifd/cancel test fail stuck in multifd_send_sync_main waiting on one of the locks). Dave > +} > + > static void *multifd_recv_thread(void *opaque) > { > MultiFDRecvParams *p = opaque; > diff --git a/migration/multifd.h b/migration/multifd.h > index 8d6751f5ed..0517213bdf 100644 > --- a/migration/multifd.h > +++ b/migration/multifd.h > @@ -22,6 +22,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc, Error > **errp); > void multifd_recv_sync_main(void); > void multifd_send_sync_main(QEMUFile *f); > int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset); > +void multifd_shutdown(void); > > /* Multifd Compression flags */ > #define MULTIFD_FLAG_SYNC (1 << 0) > diff --git a/migration/yank_functions.c b/migration/yank_functions.c > index 8c08aef14a..9335a64f00 100644 > --- a/migration/yank_functions.c > +++ b/migration/yank_functions.c > @@ -15,12 +15,14 @@ > #include "io/channel-socket.h" > #include "io/channel-tls.h" > #include "qemu-file.h" > +#include "multifd.h" > > void migration_yank_iochannel(void *opaque) > { > QIOChannel *ioc = QIO_CHANNEL(opaque); > > qio_channel_shutdown(ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL); > + multifd_shutdown(); > } > > /* Return whether yank is supported on this ioc */ > -- > 2.32.0 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK