* Li Zhang (lizh...@suse.de) wrote: > When testing live migration with multifd channels (8, 16, or a bigger number) > and using qemu -incoming (without "defer"), if a network error occurs > (for example, triggering the kernel SYN flooding detection), > the migration fails and the guest hangs forever. > > The test environment and the command line is as the following: > > QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf) > Host OS: SLE 15 with kernel: 5.14.5-1-default > Network Card: mlx5 100Gbps > Network card: Intel Corporation I350 Gigabit (1Gbps) > > Source: > qemu-system-x86_64 -M q35 -smp 32 -nographic \ > -serial telnet:10.156.208.153:4321,server,nowait \ > -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \ > -monitor stdio > Dest: > qemu-system-x86_64 -M q35 -smp 32 -nographic \ > -serial telnet:10.156.208.154:4321,server,nowait \ > -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \ > -monitor stdio \ > -incoming tcp:1.0.8.154:4000 > > (qemu) migrate_set_parameter max-bandwidth 100G > (qemu) migrate_set_capability multifd on > (qemu) migrate_set_parameter multifd-channels 16 > > The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000. > > If a network problem happens, TCP ACK is not received by destination > and the destination resets the connection with RST. > > No. Time Source Destination Protocol Length Info > 119 1.021169 1.0.8.153 1.0.8.154 TCP 1410 60166 > → 4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881 > TSecr=1399531897 > No. Time Source Destination Protocol Length Info > 125 1.021181 1.0.8.154 1.0.8.153 TCP 54 4000 > → 60166 [RST] Seq=1 Win=0 Len=0 > > kernel log: > [334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters. > [334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters. > [334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters. > [334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters. > [335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters. > [335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 4000. > Sending cookies. Check SNMP counters.
Should we document somewhere how to avoid that? Is there something we should be doing in the connection code to avoid it? Dave > There are two problems here: > 1. On the send side, the main thread is blocked on qemu_thread_join and > send threads are blocked on sendmsg > 2. On receive side, the receive threads are blocked on qemu_sem_wait to > wait for a semaphore. > > The patch is to fix the first problem, and the guest doesn't hang any more. > But there is no better solution to fix the second problem yet. > > Li Zhang (1): > multifd: Shut down the QIO channels to avoid blocking the send threads > when they are terminated. > > migration/multifd.c | 3 +++ > 1 file changed, 3 insertions(+) > > -- > 2.31.1 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK