On 12/1/21 3:09 PM, Daniel P. Berrangé wrote:
On Wed, Dec 01, 2021 at 02:42:04PM +0100, Li Zhang wrote:
On 12/1/21 1:22 PM, Daniel P. Berrangé wrote:
On Wed, Dec 01, 2021 at 01:11:13PM +0100, Li Zhang wrote:
On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote:
* Li Zhang (lizh...@suse.de) wrote:
On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berra...@redhat.com) wrote:
On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.
At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.
Isn't the right answer here to ensure we've called 'shutdown' on
all the FDs, so that the threads get kicked out of sendmsg, before
trying to join the thread ?
I agree a timeout is wrong here; there is no way to get a good timeout
value.
However, I'm a bit confused - we should be able to try a shutdown on the
receive side using the 'yank' command. - that's what it's there for; Li
does this solve your problem?
No, I tried to register 'yank' on the receive side, the receive threads are
still waiting there.
It seems that on send side, 'yank' doesn't work either when the send threads
are blocked.
This may be not the case to call yank. I am not quite sure about it.
We need to fix that; 'yank' should be able to recover from any network
issue. If it's not working we need to understand why.
Hi Dr. David,
On the receive side, I register 'yank' and it is called. But it is just to
shut down the channels,
it couldn't fix the problem of the receive threads which are waiting for the
semaphore.
So the receive threads are still waiting there.
On the send side, the main process is blocked on qemu_thread_join(), when I
tried the 'yank'
command with QMP, it is not handled. So the QMP doesn't work and yank
doesn't work.
IOW, there is a bug in QEMU on the send side. It should not be calling
qemu_thread_join() from the main thread, unless it is extremely
confident that the thread in question has already finished.
You seem to be showing that the thread(s) are still running, so we
need to understand why that is the case, and why the main thread
still decided to try to join these threads which haven't finished.
Some threads are running. But there is one thread fails to
qio_channel_write_all.
In migration_thread(), it detects an error here:
thr_error = migration_detect_error(s);
if (thr_error == MIG_THR_ERR_FATAL) {
/* Stop migration */
break;
It will stop migration and cleanup.
Those threads which are still running need to be made to
terminate before trying to join them
A quick glance at multifd_send_terminate_threads() makes me
suspect multifd shutdown is not reliable.
It is merely setting some boolean flags and posting to a
semaphore. It is doing nothing to shutdown the socket
associated with each thread, so the threads can still be
waiting in an I/O call. IMHO multifd_send_terminate_threads
needs to call qio_chanel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH)
Agree with you.
Regards,
Daniel