On 2/14/24 17:00, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

On 2/8/24 14:57, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

On 2/8/24 14:07, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

close_return_path_on_source() retrieves the migration error from the
the QEMUFile '->to_dst_file' to know if a shutdown is required. This
shutdown is required to exit the return-path thread. However, in
migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
close_return_path_on_source() and the shutdown is never performed,
leaving the source and destination waiting for an event to occur.

Avoid relying on '->to_dst_file' and use migrate_has_error() instead.

Suggested-by: Peter Xu <pet...@redhat.com>
Signed-off-by: Cédric Le Goater <c...@redhat.com>
---
    migration/migration.c | 3 +--
    1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 
d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d
 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState 
*ms)
         * cause it to unblock if it's stuck waiting for the destination.
         */
        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
-        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
-            qemu_file_get_error(ms->to_dst_file)) {
+        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
                qemu_file_shutdown(ms->rp_state.from_dst_file);
            }
        }

Hm, maybe Peter can help defend this, but this assumes that every
function that takes an 'f' and sets the file error also sets
migrate_set_error(). I'm not sure we have determined that, have we?

How could we check all the code path ? I agree it is difficult when
looking at the code :/

It would help if the thing wasn't called 'f' for the most part of the
code to begin with.

Whenever there's a file error at to_dst_file there's the chance that the
rp_state.from_dst_file got stuck. So we cannot ignore the file error.

Would it work if we checked it earlier during cleanup as you did
previously and then set the migration error?

Do you mean doing something similar to what is done in
source_return_path_thread() ?

          if (qemu_file_get_error(s->to_dst_file)) {
              qemu_file_get_error_obj(s->to_dst_file, &err);
            if (err) {
                migrate_set_error(ms, err);
                error_free(err);
        ...

Yes. That would be safer I think.

Yes, something like that.

I wish we could make that return path cleanup more deterministic, but
currently it's just: "if something hangs, call shutdown()". We don't
have a way to detect a hang, we just look at the file error and hope it
works.

A crucial aspect here is that calling qemu_file_shutdown() itself sets
the file error. So there's not even a guarantee that an error is
actually an error.



Nevertheless, I am struggling to understand how qemu_file_set_error()
and migrate_set_error() fit together. I was expecting some kind of
synchronization  routine but there isn't it seems. Are they completely
orthogonal ? when should we use these routines and when not ?

We're trying to phase out the QEMUFile usage altogether. One thing that
is getting in the way is this dependency on the qemu_file_*_error
functions.

OK. the other changes, which add an Error** argument to various handlers,
reduce the use of qemu_file_*_error routines in VFIO.

While we're not there yet, a good pattern is to find a
qemu_file_set|get_error() pair and replace it with
migrate_set|has_error().

OK. I will keep that in mind for the other changes.

Thanks,

C.



Unfortunately the return path does not fit in
this, because we don't have a matching qemu_file_set_error, it could be
anywhere. As I said above, we're using that error as a heuristic for: "a
recvmsg() might be hanging".


My initial goal was to modify some of the memory handlers (log_global*)
and migration handlers to propagate errors at the QMP level and them
report to the management layer. This is growing in something bigger
and currently, I don't find a good approach to the problem.

The last two patches of this series try to fix the return-path thread
termination. Let's keep that for after.

I'll try to figure that out. I see you provided a reproducer.


Thanks,

C.



Reply via email to