multifd: Move semaphore release into main thread

Juan Quintela Thu, 16 Nov 2023 07:45:15 -0800

Fabiano Rosas <faro...@suse.de> wrote:
> Peter Xu <pet...@redhat.com> writes:
>
>> On Thu, Nov 09, 2023 at 01:58:56PM -0300, Fabiano Rosas wrote:
>>> We cannot operate on the multifd semaphores outside of the multifd
>>> channel thread
>>> because multifd_save_cleanup() can run in parallel and
>>> attempt to destroy the mutexes, which causes an assert.
>>> 
>>> Looking at the places where we use the semaphores aside from the
>>> migration thread, there's only the TLS handshake thread and the
>>> initial multifd_channel_connect() in the main thread. These are places
>>> where creating the multifd channel cannot fail, so releasing the
>>> semaphores at these places only serves the purpose of avoiding a
>>> deadlock when an error happens before the channel(s) have started, but
>>> after the migration thread has already called
>>> multifd_send_sync_main().
>>> 
>>> Instead of attempting to release the semaphores at those places, move
>>> the release into multifd_save_cleanup(). This puts the semaphore usage
>>> in the same thread that does the cleanup, eliminating the race.
>>> 
>>> Move the call to multifd_save_cleanup() before joining for the
>>> migration thread so we release the semaphores before.
>>> 
>>> Signed-off-by: Fabiano Rosas <faro...@suse.de>
>>> ---
>>>  migration/migration.c |  4 +++-
>>>  migration/multifd.c   | 29 +++++++++++------------------
>>>  2 files changed, 14 insertions(+), 19 deletions(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index cca32c553c..52be20561b 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -1300,6 +1300,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          QEMUFile *tmp;
>>>  
>>>          trace_migrate_fd_cleanup();
>>> +
>>> +        multifd_save_cleanup();
>>> +
>>>          qemu_mutex_unlock_iothread();
>>>          if (s->migration_thread_running) {
>>>              qemu_thread_join(&s->thread);
>>> @@ -1307,7 +1310,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>          }
>>>          qemu_mutex_lock_iothread();
>>>  
>>> -        multifd_save_cleanup();
>>>          qemu_mutex_lock(&s->qemu_file_lock);
>>>          tmp = s->to_dst_file;
>>>          s->to_dst_file = NULL;
>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>> index ec58c58082..9ca482cfbe 100644
>>> --- a/migration/multifd.c
>>> +++ b/migration/multifd.c
>>> @@ -527,6 +527,9 @@ void multifd_save_cleanup(void)
>>>  
>>>          if (p->running) {
>>>              qemu_thread_join(&p->thread);
>>> +        } else {
>>> +            qemu_sem_post(&p->sem_sync);
>>> +            qemu_sem_post(&multifd_send_state->channels_ready);
>>
>> I think relying on p->running to join the thread is already problematic.
>>
>
> Absolutely. I've already complained about this in another thread.


I haven't seen that yet.
But if there is a bug, of course that we need to fix it.

>> Now all threads are created with JOINABLE, so we must join them to release
>> the thread resources.  Clearing "running" at the end of the thread is
>> already wrong to me, because it means if the thread quits before the main
>> thread reaching here, we will not join the thread, and the thread resource
>> will be leaked.
>>
>> Here IMHO we should set p->running=true right before thread created, and
>> never clear it.  We may even want to rename it to p->thread_created?
>>
>
> Ok, let me do that. I already have some patches touching
> multifd_new_send_channel_async() due to fixed-ram.

I still think this is a bad idea.  What that value means is if we have
to send a post/whatever or not.
What I could agree is that when we put ->running to false, we ceauld
also shutdown the QIO channel.

>>>          }
>>>      }
>>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>> @@ -751,8 +754,6 @@ out:
>>>          assert(local_err);
>>>          trace_multifd_send_error(p->id);
>>>          multifd_send_terminate_threads(local_err);
>>> -        qemu_sem_post(&p->sem_sync);
>>> -        qemu_sem_post(&multifd_send_state->channels_ready);
>>>          error_free(local_err);
>>>      }
>>>  
>>> @@ -780,20 +781,15 @@ static void multifd_tls_outgoing_handshake(QIOTask 
>>> *task,
>>>  
>>>      if (!qio_task_propagate_error(task, &err)) {
>>>          trace_multifd_tls_outgoing_handshake_complete(ioc);
>>> -        if (multifd_channel_connect(p, ioc, &err)) {
>>> -            return;
>>> -        }
>>> +        multifd_channel_connect(p, ioc, NULL);
>>
>> Ignoring Error** is not good..
>>
>> I think you meant to say "it should never fail", then we should put
>> &error_abort.  Another cleaner way to do this is split the current
>> multifd_channel_connect() into tls and non-tls helpers, then we can call
>> the non-tls helpers here (which may not need an Error**).
>
> I attempted the split and it looked awkward, so I let go. But I agree
> about the Error.

As said, that code is really weird because tcp+tls is basically two
listen().  I never understood that part of the code, I think it was
daniel who wrote it.


>> I may still need some more time to digest your whole solution, currently
>> not very clear to me.  It may or may not be the patch's problem,
>> though.
>
> The core assumption of this patch is that we currently only need to
> release the semaphores because the multifd_send_sync_main() is allowed
> to execute even before the multifd channels have started. If creation
> failed to even start, for instance because of a TLS handshake failure,
> the migration thread will hang waiting for channels_ready (or sem_sync).
>
> Then there are two premises:
>
>  - when an error happens, multifd_save_cleanup() is always called;

That should be happen.  The problem is that nothing guarantees that we
only have one error.

Or that while we are sending something test migration_has_error() and
decides to stop also on the error path.

This needs to be audited.  I agree that we can do better here, but it is
not trivial.  Full change is 9.0 material.

>  - a hanging migration thread (at multifd_send_sync_main) will not stop
>    the main thread from reaching multifd_save_cleanup.

I don't remember to be fair.  But I agree that we shouldn't wait.

> If this holds, then it's safe to release the semaphores right before
> destroying them at multifd_save_cleanup.

>> But let me also share how I see this..  I think we should rework the
>> multifd thread model on channel establishment.
>>
>> Firstly, as I mentioned above, we must always join() the threads if it's
>> JOINABLE or the resource is leaked, afaict.  That's the first thing to fix.
>
> I think we historically stumbled upon the fact that qemu_thread_join()
> is not the same as pthread_join(). The former takes a pointer and is not
> safe to call with a NULL QemuThread. That seems to be the reason for the
> p->running check before it.

As I pointed on the other thread, Joining a thread twice is undefined
behaviour, so we need to be careful here.
We return NULL, so it is not clear to me that we need to do anything
else there.

We could return an Error, but the problem here is that we want to tell
the user what is the 1st error, not any other error.

>> Then let's see when TLS is there and what we do: we'll create "two" threads
>> just for that, what's even worse:
>>
>>   - We'll create tls handshake thread in multifd_tls_channel_connect()
>>     first, setting &p->thread.
>>
>>   - Within the same thread, we do multifd_tls_outgoing_handshake() when
>>     handshake done -> multifd_channel_connect() -> we yet create the real
>>     multifd_send_thread(), setting &p->thread too.
>
> Hmm good point, we're calling qio_task_complete() from within the
> thread, that's misleading. I believe using qio_task_run_in_thread()
> would put the completion callback in the main thread, right? That would
> look a bit better I think because we could then join the handshake
> thread before multifd_channel_connect().

>> So AFAICT, the tls handshake thread is already leaked, got p->thread
>> overwritten by the new thread created by itself..
>>
>> I think we should fix this then.  I haven't figured the best way to do,
>> two things I can come up with now:
>>
>>   1) At least make the tls handshake thread detached.
>>
>>   2) Make the handshake done in multifd send thread directly; I don't yet
>>      see why we must create two threads..
>
> I'm (a little bit) against this. It's nice to know that a multifdsend_#
> thread will only be doing IO and no other task. I have the same concern
> about putting zero page detection in the multifd thread.
>
>> Then assuming we have a clear model with all these threads issue fixed (no
>> matter whether we'd shrink 2N threads into N threads), then what we need to
>> do, IMHO, is making sure to join() all of them before destroying anything
>> (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> safely, either mutex/sem/etc..  Because no one will race us anymore.
>
> This doesn't address the race. There's a data dependency between the
> multifd channels and the migration thread around the channels_ready
> semaphore. So we cannot join the migration thread because it could be
> stuck waiting for the semaphore, which means we cannot join+cleanup the
> channel thread because the semaphore is still being used. That's why I'm
> moving the semaphores post to the point where we join the
> thread. Reading your email, now I think that should be:
>
> qemu_sem_post(&p->sem_sync);
> qemu_sem_post(&multifd_send_state->channels_ready);
> qemu_thread_join(&p->thread);
>
> Fundamentally, the issue is that we do create the multifd threads before
> the migration thread, but they might not be ready before the migration
> thread needs to use them.

Welcome the the async world.

But I agree that your idea could work.

Later, Juan.

Re: [RFC PATCH 2/2] migration/multifd: Move semaphore release into main thread

Reply via email to