Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown

2023-07-31 Thread Fabiano Rosas
Fabiano Rosas  writes:

> Peter Xu  writes:
>
>> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
>>> When waiting for the return path (RP) thread to finish, there is
>>> really nothing wrong in the RP if the destination end of the migration
>>> stops responding, leaving it stuck.
>>> 
>>> Stop returning an error at that point and leave it to other parts of
>>> the code to catch. One such part is the very next routine run by
>>> migration_completion() which checks 'to_dst_file' for an error and fails
>>> the migration. Another is the RP thread itself when the recvmsg()
>>> returns an error.
>>> 
>>> With this we stop marking RP bad from outside of the thread and can
>>> reuse await_return_path_close_on_source() in the next patches to wait
>>> on the thread during a paused migration.
>>> 
>>> Signed-off-by: Fabiano Rosas 
>>> ---
>>>  migration/migration.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 91bba630a8..051067f8c5 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2049,7 +2049,6 @@ static int 
>>> await_return_path_close_on_source(MigrationState *ms)
>>>   * waiting for the destination.
>>>   */
>>>  qemu_file_shutdown(ms->rp_state.from_dst_file);
>>> -mark_source_rp_bad(ms);
>>>  }
>>>  trace_await_return_path_close_on_source_joining();
>>>  qemu_thread_join(&ms->rp_state.rp_thread);
>>
>> The retval of await_return_path_close_on_source() relies on
>> ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
>> that it'll start to return succeed where it used to return failure?
>
> Yep, as described in the commit message, I think it's ok to do that. The
> critical part is doing the shutdown. Other instances of
> mark_source_rp_bad() continue existing and we continue returning
> rp_state.error.
>
>>
>> Maybe not a big deal: I see migration_completion() also has another
>> qemu_file_get_error() later to catch errors, but I don't know how solid
>> that is.
>
> That is the instance I refer to in the commit message. At
> await_return_path_close_on_source() we only call mark_source_rp_bad() if
> to_dst_file has an error. That will be caught by this
> qemu_file_get_error() anyway.

Actually, I can do better, I can merge this shutdown() into
migration_completion(). Then this dependency becomes explicit. Since you
suggested moving await_return_path_close_on_source() into
postcopy_pause(), it doesn't make sense to check to_dst_file anymore,
because when pausing we clear that file.




Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown

2023-07-31 Thread Fabiano Rosas
Peter Xu  writes:

> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
>> When waiting for the return path (RP) thread to finish, there is
>> really nothing wrong in the RP if the destination end of the migration
>> stops responding, leaving it stuck.
>> 
>> Stop returning an error at that point and leave it to other parts of
>> the code to catch. One such part is the very next routine run by
>> migration_completion() which checks 'to_dst_file' for an error and fails
>> the migration. Another is the RP thread itself when the recvmsg()
>> returns an error.
>> 
>> With this we stop marking RP bad from outside of the thread and can
>> reuse await_return_path_close_on_source() in the next patches to wait
>> on the thread during a paused migration.
>> 
>> Signed-off-by: Fabiano Rosas 
>> ---
>>  migration/migration.c | 1 -
>>  1 file changed, 1 deletion(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 91bba630a8..051067f8c5 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2049,7 +2049,6 @@ static int 
>> await_return_path_close_on_source(MigrationState *ms)
>>   * waiting for the destination.
>>   */
>>  qemu_file_shutdown(ms->rp_state.from_dst_file);
>> -mark_source_rp_bad(ms);
>>  }
>>  trace_await_return_path_close_on_source_joining();
>>  qemu_thread_join(&ms->rp_state.rp_thread);
>
> The retval of await_return_path_close_on_source() relies on
> ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
> that it'll start to return succeed where it used to return failure?

Yep, as described in the commit message, I think it's ok to do that. The
critical part is doing the shutdown. Other instances of
mark_source_rp_bad() continue existing and we continue returning
rp_state.error.

>
> Maybe not a big deal: I see migration_completion() also has another
> qemu_file_get_error() later to catch errors, but I don't know how solid
> that is.

That is the instance I refer to in the commit message. At
await_return_path_close_on_source() we only call mark_source_rp_bad() if
to_dst_file has an error. That will be caught by this
qemu_file_get_error() anyway.




Re: [PATCH 1/3] migration: Stop marking RP bad after shutdown

2023-07-28 Thread Peter Xu
On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
> When waiting for the return path (RP) thread to finish, there is
> really nothing wrong in the RP if the destination end of the migration
> stops responding, leaving it stuck.
> 
> Stop returning an error at that point and leave it to other parts of
> the code to catch. One such part is the very next routine run by
> migration_completion() which checks 'to_dst_file' for an error and fails
> the migration. Another is the RP thread itself when the recvmsg()
> returns an error.
> 
> With this we stop marking RP bad from outside of the thread and can
> reuse await_return_path_close_on_source() in the next patches to wait
> on the thread during a paused migration.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  migration/migration.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 91bba630a8..051067f8c5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2049,7 +2049,6 @@ static int 
> await_return_path_close_on_source(MigrationState *ms)
>   * waiting for the destination.
>   */
>  qemu_file_shutdown(ms->rp_state.from_dst_file);
> -mark_source_rp_bad(ms);
>  }
>  trace_await_return_path_close_on_source_joining();
>  qemu_thread_join(&ms->rp_state.rp_thread);

The retval of await_return_path_close_on_source() relies on
ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
that it'll start to return succeed where it used to return failure?

Maybe not a big deal: I see migration_completion() also has another
qemu_file_get_error() later to catch errors, but I don't know how solid
that is.

I think as long as after this patch we can fail properly on e.g. network
failures for precopy when cap return-path=on, then we should be good.

-- 
Peter Xu




[PATCH 1/3] migration: Stop marking RP bad after shutdown

2023-07-28 Thread Fabiano Rosas
When waiting for the return path (RP) thread to finish, there is
really nothing wrong in the RP if the destination end of the migration
stops responding, leaving it stuck.

Stop returning an error at that point and leave it to other parts of
the code to catch. One such part is the very next routine run by
migration_completion() which checks 'to_dst_file' for an error and fails
the migration. Another is the RP thread itself when the recvmsg()
returns an error.

With this we stop marking RP bad from outside of the thread and can
reuse await_return_path_close_on_source() in the next patches to wait
on the thread during a paused migration.

Signed-off-by: Fabiano Rosas 
---
 migration/migration.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 91bba630a8..051067f8c5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2049,7 +2049,6 @@ static int 
await_return_path_close_on_source(MigrationState *ms)
  * waiting for the destination.
  */
 qemu_file_shutdown(ms->rp_state.from_dst_file);
-mark_source_rp_bad(ms);
 }
 trace_await_return_path_close_on_source_joining();
 qemu_thread_join(&ms->rp_state.rp_thread);
-- 
2.35.3