Cédric Le Goater <c...@redhat.com> writes:

> On 2/8/24 14:57, Fabiano Rosas wrote:
>> Cédric Le Goater <c...@redhat.com> writes:
>> 
>>> On 2/8/24 14:07, Fabiano Rosas wrote:
>>>> Cédric Le Goater <c...@redhat.com> writes:
>>>>
>>>>> close_return_path_on_source() retrieves the migration error from the
>>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>>>> shutdown is required to exit the return-path thread. However, in
>>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>>>> close_return_path_on_source() and the shutdown is never performed,
>>>>> leaving the source and destination waiting for an event to occur.
>>>>>
>>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>>>
>>>>> Suggested-by: Peter Xu <pet...@redhat.com>
>>>>> Signed-off-by: Cédric Le Goater <c...@redhat.com>
>>>>> ---
>>>>>    migration/migration.c | 3 +--
>>>>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index 
>>>>> d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d
>>>>>  100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -2372,8 +2372,7 @@ static bool 
>>>>> close_return_path_on_source(MigrationState *ms)
>>>>>         * cause it to unblock if it's stuck waiting for the destination.
>>>>>         */
>>>>>        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>>>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>>>> -            qemu_file_get_error(ms->to_dst_file)) {
>>>>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>>>                qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>>>            }
>>>>>        }
>>>>
>>>> Hm, maybe Peter can help defend this, but this assumes that every
>>>> function that takes an 'f' and sets the file error also sets
>>>> migrate_set_error(). I'm not sure we have determined that, have we?
>>>
>>> How could we check all the code path ? I agree it is difficult when
>>> looking at the code :/
>> 
>> It would help if the thing wasn't called 'f' for the most part of the
>> code to begin with.
>> 
>> Whenever there's a file error at to_dst_file there's the chance that the
>> rp_state.from_dst_file got stuck. So we cannot ignore the file error.
>> 
>> Would it work if we checked it earlier during cleanup as you did
>> previously and then set the migration error?
>
> Do you mean doing something similar to what is done in
> source_return_path_thread() ?
>
>          if (qemu_file_get_error(s->to_dst_file)) {
>              qemu_file_get_error_obj(s->to_dst_file, &err);
>           if (err) {
>               migrate_set_error(ms, err);
>               error_free(err);
>       ...
>
> Yes. That would be safer I think.

Yes, something like that.

I wish we could make that return path cleanup more deterministic, but
currently it's just: "if something hangs, call shutdown()". We don't
have a way to detect a hang, we just look at the file error and hope it
works.

A crucial aspect here is that calling qemu_file_shutdown() itself sets
the file error. So there's not even a guarantee that an error is
actually an error.

>
>
> Nevertheless, I am struggling to understand how qemu_file_set_error()
> and migrate_set_error() fit together. I was expecting some kind of
> synchronization  routine but there isn't it seems. Are they completely
> orthogonal ? when should we use these routines and when not ?

We're trying to phase out the QEMUFile usage altogether. One thing that
is getting in the way is this dependency on the qemu_file_*_error
functions.

While we're not there yet, a good pattern is to find a
qemu_file_set|get_error() pair and replace it with
migrate_set|has_error(). Unfortunately the return path does not fit in
this, because we don't have a matching qemu_file_set_error, it could be
anywhere. As I said above, we're using that error as a heuristic for: "a
recvmsg() might be hanging".

>
> My initial goal was to modify some of the memory handlers (log_global*)
> and migration handlers to propagate errors at the QMP level and them
> report to the management layer. This is growing in something bigger
> and currently, I don't find a good approach to the problem.
>
> The last two patches of this series try to fix the return-path thread
> termination. Let's keep that for after.

I'll try to figure that out. I see you provided a reproducer.

>
> Thanks,
>
> C.

Reply via email to