When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.

At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.

(gdb) bt
0  0x00007f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0
1  0x000055cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at 
../util/qemu-thread-posix.c:627
2  0x000055cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542
3  0x000055cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at 
../migration/migration.c:1808
4  0x000055cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at 
../migration/migration.c:1850
5  0x000055cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at ../util/async.c:141
6  0x000055cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at ../util/async.c:169
7  0x000055cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at 
../util/aio-posix.c:381
8  0x000055cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, callback=0x0, 
user_data=0x0) at ../util/async.c:311
9  0x00007f30c9c8cdf4 in g_main_context_dispatch () at 
/usr/lib64/libglib-2.0.so.0
10 0x000055cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232
11 0x000055cbb718521c in os_host_main_loop_wait (timeout=42251070366) at 
../util/main-loop.c:255
12 0x000055cbb7185321 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:531
13 0x000055cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726
14 0x000055cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c578888, 
envp=0x7ffc0c578ab0) at ../softmmu/main.c:50

At receiver's side:
Several receive threads are not created successfully and the receive threads
which have been created are blocked on qemu_sem_wait. No semaphores are posted
because migration is not started if not all the receive threads are created
successfully and multifd_recv_sync_main is not called which posts the semaphore
to receive threads. So the receive threads are waiting on the semaphore and
never return. It shouldn't wait for the semaphore forever.
Use qemu_sem_timedwait to wait for a while, then return and close the channels.
So the guest doesn't hang anymore.

(gdb) bt
0  0x00007fd61c43f064 in do_futex_wait.constprop () at /lib64/libpthread.so.0
1  0x00007fd61c43f158 in __new_sem_wait_slow.constprop.0 () at 
/lib64/libpthread.so.0
2  0x000056075916014a in qemu_sem_wait (sem=0x56075b6515f0) at 
../util/qemu-thread-posix.c:358
3  0x0000560758b56643 in multifd_recv_thread (opaque=0x56075b651550) at 
../migration/multifd.c:1112
4  0x0000560759160598 in qemu_thread_start (args=0x56075befad00) at 
../util/qemu-thread-posix.c:556
5  0x00007fd61c43594a in start_thread () at /lib64/libpthread.so.0
6  0x00007fd61c158d0f in clone () at /lib64/libc.so.6

Signed-off-by: Li Zhang <lizh...@suse.de>
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 7c9deb1921..656239ca2a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1109,7 +1109,7 @@ static void *multifd_recv_thread(void *opaque)
 
         if (flags & MULTIFD_FLAG_SYNC) {
             qemu_sem_post(&multifd_recv_state->sem_sync);
-            qemu_sem_wait(&p->sem_sync);
+            qemu_sem_timedwait(&p->sem_sync, 1000);
         }
     }
 
-- 
2.31.1


Reply via email to