On 5/24/2024 10:58 AM, Fabiano Rosas wrote:
Steve Sistare <steven.sist...@oracle.com> writes:

Add the cpr-exec migration mode.  Usage:
   qemu-system-$arch -machine memfd-alloc=on ...
   migrate_set_parameter mode cpr-exec
   migrate_set_parameter cpr-exec-args \
     <arg1> <arg2> ... -incoming <uri>
   migrate -d <uri>

The migrate command stops the VM, saves state to the URI,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from the URI.  Guest RAM is preserved in place,
albeit with new virtual addresses.

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary.

Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so the URI must be a type, such as
a file, that reads all data before old QEMU exits.

Memory backend objects must have the share=on attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the '-machine memfd-alloc=on'
option.  This causes implicit ram blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without a memory-backend object.

The implementation saves precreate vmstate at the end of normal
migration in migrate_fd_cleanup, and tells the main loop to call
cpr_exec.  Incoming qemu loads preceate state early, before objects
are created.  The memfds are kept open across exec by clearing the
close-on-exec flag, their values are saved in precreate vmstate,
and they are mmap'd in new qemu.

Note that the memfd-alloc option is not related to memory-backend-memfd.
Later patches add support for memory-backend-memfd, and for additional
devices, including vfio, chardev, and more.

Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
---
  include/migration/cpr.h  |  14 +++++
  include/migration/misc.h |   3 ++
  migration/cpr.c          | 131 +++++++++++++++++++++++++++++++++++++++++++++++
  migration/meson.build    |   1 +
  migration/migration.c    |  21 ++++++++
  migration/migration.h    |   5 +-
  migration/ram.c          |   1 +
  qapi/migration.json      |  30 ++++++++++-
  system/physmem.c         |   2 +
  system/vl.c              |   4 ++
  10 files changed, 210 insertions(+), 2 deletions(-)
  create mode 100644 include/migration/cpr.h
  create mode 100644 migration/cpr.c
[...]
diff --git a/migration/migration.c b/migration/migration.c
index b5af6b5..0d91531 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -239,6 +239,7 @@ void migration_object_init(void)
      blk_mig_init();
      ram_mig_init();
      dirty_bitmap_mig_init();
+    cpr_mig_init();
  }
typedef struct {
@@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s)
          qemu_fclose(tmp);
      }
+ if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+        Error *err = NULL;
+        if (migration_precreate_save(&err)) {
+            migrate_set_error(s, err);
+            error_report_err(err);

There's an error_report_err() call already a few lines down.

+            migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+        }
+    }

Not a fan of saving state in the middle of the cleanup function. This
adds extra restrictions to migrate_fd_cleanup() which already tends to
be the source of a bunch of bugs.

Can this be done either entirely before or after migrate_fd_cleanup()?
There's only one callsite from which you actually want to do cpr-exec,
migration_iteration_finish(). It's no big deal if we have to play a bit
with the notifier call placement.

static void migration_iteration_finish(MigrationState *s)
{
...
     migration_bh_schedule(migrate_cpr_exec_bh, s);
     migration_bh_schedule(migrate_fd_cleanup_bh, s);
     bql_unlock();
}

IIUC, the BQL ensures the ordering here, but if that doesn't work we
could just call the cpr function right at the end of
migrate_fd_cleanup(). That would already be better than interleaving.

static void migrate_cpr_exec_bh(void *opaque)
{
     MigrationState *s = opaque;
     Error *err = NULL;

     if (migration_has_failed(s) || migrate_mode() != MIG_MODE_CPR_EXEC) {
         return;
     }

     assert(s->state == MIGRATION_STATUS_COMPLETED);

     if (migration_precreate_save(&err)) {
         migrate_set_error(s, err);
         error_report_err(err);
         migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
         migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);

         return;
     }

     qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
}

No problem, I can hoist the cpr exec logic out of migrate_fd_cleanup.
I'll call migration_precreate_save prior, and I'll register a notifier
for MIG_EVENT_PRECOPY_DONE that calls qemu_system_exec_request.

BTW the following does not work because the order of bh execution is not defined
by the qemu_bh_schedule API (and in fact the current implementation prepends
to bh_list, executing these in reverse order):

  migration_bh_schedule(migrate_cpr_exec_bh, s);
  migration_bh_schedule(migrate_fd_cleanup_bh, s);

- Steve

Reply via email to