Thanks Peter
On 14/11/22 10:21 pm, Peter Xu wrote:
Manish,
On Thu, Nov 03, 2022 at 11:47:51PM +0530, manish.mishra wrote:
Yes, but if we try to read early on main channel with tls enabled case it is an
issue. Sorry i may not have put above comment cleary. I will try to put
scenario step wise.
1. main channel is created and tls handshake is done for main channel.
2. Destionation side tries to read magic early on main channel in
migration_ioc_process_incoming but it is not yet sent by source.
3. Source has written magic to main channel file buffer but it is not yet
flushed, it is flushed first time in ram_save_setup, i mean data is sent on
channel only if qemu file buffer is full or explicitly flushed.
4. Source side blocks on multifd_send_sync_main in ram_save_setup before
flushing qemu file. But multifd_send_sync_main is blocked for sem_sync until
handshake is done for multiFD channels.
5. Destination side is still waiting for reading magic on main channel, so
unless we return from migration_ioc_process_incoming we can not accept new
channel, so handshake of multiFD channel is blocked.
6. So basically source is blocked on multiFD channels handshake before sending
data on main channel, but destination is blocked waiting for data before it can
acknowledge multiFD channels and do handshake, so it kind of creates a deadlock
situation.
Why is this issue only happening with TLS? It sounds like it'll happen as
long as multifd enabled.
Actually this was happening with tls because with tls we do handshake, so a
connection is assumed establised only after a tls handshake and we flush data
from source only after all channels are established, but with normal live
migration even if connection is not accepted on destination side we can
continue as we do not do any handshake. Basically in normal live migration a
connection is assumed established if connect() call was successful even if it
is not accepted/ack by destination, so that's why this deadlock was not
hapening.
I'm also thinking whether we should flush in qemu_savevm_state_header() so
at least upgraded src qemu will always flush the headers if it never hurts.
yes sure Peter.
Thanks
Manish Mishra