Fabiano Rosas <faro...@suse.de> writes:

> Peter Xu <pet...@redhat.com> writes:
>
>> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
>>> When waiting for the return path (RP) thread to finish, there is
>>> really nothing wrong in the RP if the destination end of the migration
>>> stops responding, leaving it stuck.
>>> 
>>> Stop returning an error at that point and leave it to other parts of
>>> the code to catch. One such part is the very next routine run by
>>> migration_completion() which checks 'to_dst_file' for an error and fails
>>> the migration. Another is the RP thread itself when the recvmsg()
>>> returns an error.
>>> 
>>> With this we stop marking RP bad from outside of the thread and can
>>> reuse await_return_path_close_on_source() in the next patches to wait
>>> on the thread during a paused migration.
>>> 
>>> Signed-off-by: Fabiano Rosas <faro...@suse.de>
>>> ---
>>>  migration/migration.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 91bba630a8..051067f8c5 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2049,7 +2049,6 @@ static int 
>>> await_return_path_close_on_source(MigrationState *ms)
>>>           * waiting for the destination.
>>>           */
>>>          qemu_file_shutdown(ms->rp_state.from_dst_file);
>>> -        mark_source_rp_bad(ms);
>>>      }
>>>      trace_await_return_path_close_on_source_joining();
>>>      qemu_thread_join(&ms->rp_state.rp_thread);
>>
>> The retval of await_return_path_close_on_source() relies on
>> ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
>> that it'll start to return succeed where it used to return failure?
>
> Yep, as described in the commit message, I think it's ok to do that. The
> critical part is doing the shutdown. Other instances of
> mark_source_rp_bad() continue existing and we continue returning
> rp_state.error.
>
>>
>> Maybe not a big deal: I see migration_completion() also has another
>> qemu_file_get_error() later to catch errors, but I don't know how solid
>> that is.
>
> That is the instance I refer to in the commit message. At
> await_return_path_close_on_source() we only call mark_source_rp_bad() if
> to_dst_file has an error. That will be caught by this
> qemu_file_get_error() anyway.

Actually, I can do better, I can merge this shutdown() into
migration_completion(). Then this dependency becomes explicit. Since you
suggested moving await_return_path_close_on_source() into
postcopy_pause(), it doesn't make sense to check to_dst_file anymore,
because when pausing we clear that file.


Reply via email to