Cédric Le Goater <c...@redhat.com> writes:

> Hello Fabiano
>
> On 2/8/24 14:29, Fabiano Rosas wrote:
>> Cédric Le Goater <c...@redhat.com> writes:
>> 
>>> In case of error, close_return_path_on_source() can perform a shutdown
>>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>> and the shutdown fails, leaving the source and destination waiting for
>>> an event to occur.
>> 
>> Hi, Cédric
>> 
>> Are you sure this is not caused by patch 13? 
>
> It happens with upstream QEMU without any patch.

I might have taken that "shutdown fails" in the commit message too
literaly. Anyway, I have a proposed solution:

-->8--
>From 729aa7b5b7f130f756d41649fdd0862bd2e90430 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <faro...@suse.de>
Date: Wed, 14 Feb 2024 16:45:43 -0300
Subject: [PATCH] migration: Join the return path thread before releasing
 to_dst_file
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The return path thread might hang at a blocking system call. Before
joining the thread we might need to issue a shutdown() on the socket
file descriptor to release it. To determine whether the shutdown() is
necessary we look at the QEMUFile error.

Make sure we only clean up the QEMUFile after the return path has been
waited for.

This fixes a hang when qemu_savevm_state_setup() produced an error
that was detected by migration_detect_error(). That skips
migration_completion() so close_return_path_on_source() would get
stuck waiting for the RP thread to terminate.

At migrate_fd_cleanup() I'm keeping the relative order of joining the
migration thread and the return path just in case.

Reported-by: Cédric Le Goater <c...@redhat.com>
Signed-off-by: Fabiano Rosas <faro...@suse.de>
---
 migration/migration.c | 36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index ab21de2cad..f0b70e8a9d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1326,17 +1326,19 @@ static void migrate_fd_cleanup(MigrationState *s)
 
     qemu_savevm_state_cleanup();
 
+    bql_unlock();
+    if (s->migration_thread_running) {
+        qemu_thread_join(&s->thread);
+        s->migration_thread_running = false;
+    }
+    bql_lock();
+
+    close_return_path_on_source(s);
+
     if (s->to_dst_file) {
         QEMUFile *tmp;
 
         trace_migrate_fd_cleanup();
-        bql_unlock();
-        if (s->migration_thread_running) {
-            qemu_thread_join(&s->thread);
-            s->migration_thread_running = false;
-        }
-        bql_lock();
-
         multifd_send_shutdown();
         qemu_mutex_lock(&s->qemu_file_lock);
         tmp = s->to_dst_file;
@@ -1350,12 +1352,6 @@ static void migrate_fd_cleanup(MigrationState *s)
         qemu_fclose(tmp);
     }
 
-    /*
-     * We already cleaned up to_dst_file, so errors from the return
-     * path might be due to that, ignore them.
-     */
-    close_return_path_on_source(s);
-
     assert(!migration_is_active(s));
 
     if (s->state == MIGRATION_STATUS_CANCELLING) {
@@ -2874,6 +2870,13 @@ static MigThrError postcopy_pause(MigrationState *s)
     while (true) {
         QEMUFile *file;
 
+        /*
+         * We're already pausing, so ignore any errors on the return
+         * path and just wait for the thread to finish. It will be
+         * re-created when we resume.
+         */
+        close_return_path_on_source(s);
+
         /*
          * Current channel is possibly broken. Release it.  Note that this is
          * guaranteed even without lock because to_dst_file should only be
@@ -2893,13 +2896,6 @@ static MigThrError postcopy_pause(MigrationState *s)
         qemu_file_shutdown(file);
         qemu_fclose(file);
 
-        /*
-         * We're already pausing, so ignore any errors on the return
-         * path and just wait for the thread to finish. It will be
-         * re-created when we resume.
-         */
-        close_return_path_on_source(s);
-
         migrate_set_state(&s->state, s->state,
                           MIGRATION_STATUS_POSTCOPY_PAUSED);
 
-- 
2.35.3

Reply via email to