On Wed, Dec 01, 2021 at 02:42:04PM +0100, Li Zhang wrote: > > On 12/1/21 1:22 PM, Daniel P. Berrangé wrote: > > On Wed, Dec 01, 2021 at 01:11:13PM +0100, Li Zhang wrote: > > > On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote: > > > > * Li Zhang (lizh...@suse.de) wrote: > > > > > On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote: > > > > > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > > > > > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote: > > > > > > > > When doing live migration with multifd channels 8, 16 or larger > > > > > > > > number, > > > > > > > > the guest hangs in the presence of the network errors such as > > > > > > > > missing TCP ACKs. > > > > > > > > > > > > > > > > At sender's side: > > > > > > > > The main thread is blocked on qemu_thread_join, > > > > > > > > migration_fd_cleanup > > > > > > > > is called because one thread fails on qio_channel_write_all when > > > > > > > > the network problem happens and other send threads are blocked > > > > > > > > on sendmsg. > > > > > > > > They could not be terminated. So the main thread is blocked on > > > > > > > > qemu_thread_join > > > > > > > > to wait for the threads terminated. > > > > > > > Isn't the right answer here to ensure we've called 'shutdown' on > > > > > > > all the FDs, so that the threads get kicked out of sendmsg, before > > > > > > > trying to join the thread ? > > > > > > I agree a timeout is wrong here; there is no way to get a good > > > > > > timeout > > > > > > value. > > > > > > However, I'm a bit confused - we should be able to try a shutdown > > > > > > on the > > > > > > receive side using the 'yank' command. - that's what it's there > > > > > > for; Li > > > > > > does this solve your problem? > > > > > No, I tried to register 'yank' on the receive side, the receive > > > > > threads are > > > > > still waiting there. > > > > > > > > > > It seems that on send side, 'yank' doesn't work either when the send > > > > > threads > > > > > are blocked. > > > > > > > > > > This may be not the case to call yank. I am not quite sure about it. > > > > We need to fix that; 'yank' should be able to recover from any network > > > > issue. If it's not working we need to understand why. > > > Hi Dr. David, > > > > > > On the receive side, I register 'yank' and it is called. But it is just to > > > shut down the channels, > > > > > > it couldn't fix the problem of the receive threads which are waiting for > > > the > > > semaphore. > > > > > > So the receive threads are still waiting there. > > > > > > On the send side, the main process is blocked on qemu_thread_join(), > > > when I > > > tried the 'yank' > > > > > > command with QMP, it is not handled. So the QMP doesn't work and yank > > > doesn't work. > > IOW, there is a bug in QEMU on the send side. It should not be calling > > qemu_thread_join() from the main thread, unless it is extremely > > confident that the thread in question has already finished. > > > > You seem to be showing that the thread(s) are still running, so we > > need to understand why that is the case, and why the main thread > > still decided to try to join these threads which haven't finished. > > Some threads are running. But there is one thread fails to > qio_channel_write_all. > > In migration_thread(), it detects an error here: > > thr_error = migration_detect_error(s); > if (thr_error == MIG_THR_ERR_FATAL) { > /* Stop migration */ > break; > > It will stop migration and cleanup.
Those threads which are still running need to be made to terminate before trying to join them A quick glance at multifd_send_terminate_threads() makes me suspect multifd shutdown is not reliable. It is merely setting some boolean flags and posting to a semaphore. It is doing nothing to shutdown the socket associated with each thread, so the threads can still be waiting in an I/O call. IMHO multifd_send_terminate_threads needs to call qio_chanel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH) Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|