from:"Steven Sistare"

Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr

2024-07-20 Thread Steven Sistare


On 7/19/2024 12:28 PM, Peter Xu wrote:

On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:

For new cpr modes, ramblock_is_ignored will always be true, because the
memory is preserved in place rather than copied.  However, for an ignored
block, parse_ramblock currently requires that the received address of the
block must match the address of the statically initialized region on the
target.  This fails for a PCI rom block, because the memory region address
is set when the guest writes to a BAR on the source, which does not occur
on the target, causing a "Mismatched GPAs" error during cpr migration.


Is this a common fix with/without cpr mode?


It does not occur during normal migration.


It looks to me mr->addr (for these ROMs) should only be set in PCI config
region updates as you mentioned.  But then I didn't figure out when they're
updated on dest in live migration: the ramblock info was sent at the
beginning of migration, so it doesn't even have PCI config space migrated;
I thought the real mr->addr should be in there.

I also failed to understand yet on why the mr->addr check needs to be done
by ignore-shared only.  Some explanation would be greatly helpful around
this area..


I will continue this thread later and explain more fully.

- Steve

Re: [PATCH V2 00/11] Live update: cpr-exec

2024-07-20 Thread Steven Sistare


On 7/18/2024 11:56 AM, Peter Xu wrote:

Steve,

On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:

What?


Thanks for trying out with the cpr-transfer series.  I saw that that series
missed most of the cc list here, so I'm attaching the link here:

https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sist...@oracle.com

I think most of my previous questions for exec() solution still are there,
I'll try to summarize them all in this reply as much as I can.



This patch series adds the live migration cpr-exec mode, which allows
the user to update QEMU with minimal guest pause time, by preserving
guest RAM in place, albeit with new virtual addresses in new QEMU, and
by preserving device file descriptors.

The new user-visible interfaces are:
   * cpr-exec (MigMode migration parameter)
   * cpr-exec-command (migration parameter)


I really, really hope we can avoid this..

It's super cumbersome to pass in a qemu cmdline in a qemu migration
parameter.. if we can do that with generic live migration ways, I hope we
stick with the clean approach.


This is no different than live migration, requiring a management agent to
launch target qemu with all the arguments use to start source QEMU.  Now that
same agent will send the arguments via cpr-exec-command.


   * anon-alloc (command-line option for -machine)


Igor questioned this, and I second his opinion..  We can leave the
discussion there for this one.


Continued on the other thread.


The user sets the mode parameter before invoking the migrate command.
In this mode, the user issues the migrate command to old QEMU, which
stops the VM and saves state to the migration channels.  Old QEMU then
exec's new QEMU, replacing the original process while retaining its PID.
The user specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.  The command must pass all old QEMU arguments to new
QEMU, plus the -incoming option.  Execution resumes in new QEMU.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started
with the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.

Why?

This mode has less impact on the guest than any other method of updating
in place.


So I wonder whether there's comparison between exec() and transfer mode
that you recently proposed.


Not yet, but I will measure it.


I'm asking because exec() (besides all the rest of things that I dislike on
it in this approach..) should be simply slower, logically, due to the
serialized operation to (1) tearing down the old mm, (2) reload the new
ELF, then (3) runs through the QEMU init process.

If with a generic migration solution, the dest QEMU can start running (2+3)
concurrently without even need to run (1).

In this whole process, I doubt (2) could be relatively fast, (3) I donno,
maybe it could be slow but I never measured; Paolo may have good idea as I
know he used to work on qboot.


We'll see, but in any case these take < 100 msec, which is a wonderfully short
pause time unless your customer is doing high speed stock trading.  If 
cpr-transfer
is faster still, that's gravy, but cpr-exec is still great.


For (1), I also doubt in your test cases it's fast, but it may not always
be fast.  Consider the guest has a huge TBs of shared mem, even if the
memory will be completely shared between src/dst QEMUs, the pgtable won't!
It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
pgtable alone can even take time, and that will be accounted in step (1)
and further in exec() request.


Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is
backed by huge pages.  In UEK, we make it faster still by unmapping in parallel
with multiple threads.  I don't have the data handy but can share after running
some experiments.  Regardless, this time is negligible for small and medium
size guests, which form the majority of instances in a cloud.


All these fuss will be avoided if you use a generic live migration model
like cpr-transfer you proposed.  That's also cleaner.


The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU.  Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time.  Lastly, there is no loss of connectivity to the guest,
because chardev descriptors remain open and connected.


Again, I raised the question on why this would matter, as after all mgmt
app will need to coop with reconnections due to the fact they'll need to
support a generic live migration, in which case reconnection is a must.

So far it doesn't sound like a performance critical path, for example, to
do the mgmt reconnects on the ports.  So this might be an

Re: [PATCH V2 01/11] machine: alloc-anon option

2024-07-20 Thread Steven Sistare


On 7/17/2024 3:24 PM, Peter Xu wrote:

On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:

On Sun, 30 Jun 2024 12:40:24 -0700
Steve Sistare  wrote:


Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This affects
memory-backend-ram objects, guest RAM created with the global -m option
but without an associated memory-backend object and without the -mem-path
option

nowadays, all machines were converted to use memory backend for VM RAM.
so -m option implicitly creates memory-backend object,
which will be either MEMORY_BACKEND_FILE if -mem-path present
or MEMORY_BACKEND_RAM otherwise.



To access the same memory in the old and new QEMU processes, the memory
must be mapped shared.  Therefore, the implementation always sets



RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
user must explicitly specify the share option.  In lieu of defining a new

so statement at the top that memory-backend-ram is affected is not
really valid?


RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
as the condition for calling memfd_create.


In general I do dislike adding yet another option that will affect
guest RAM allocation (memory-backends  should be sufficient).


I shared the same concern when reviewing the previous version, and I keep
having so.



However I do see that you need memfd for device memory (vram, roms, ...).
Can we just use memfd/shared unconditionally for those and
avoid introducing a new confusing option?


ROMs should be fine IIUC, as they shouldn't be large, and they can be
migrated normally (because they're not DMA target from VFIO assigned
devices).  IOW, per my understanding what must be shared via memfd is
writable memories that can be DMAed from a VFIO device.

I raised such question on whether / why vram can be a DMA target, but I
didn't get a response.  So I would like to redo this comment: I think we
should figure out what is missing when we switch all backends to use
-object, rather than adding this flag easily.  When added, we should be
crystal clear on which RAM region will be applicable by this flag.


All RAM regions that are mapped by the guest are registered for vfio DMA by
a memory listener and could potentially be DMA'd, either read or written.
That is defined by the architecture.  We are not allowed to make value
judgements and decide to not support the architecture for some segments
such as ROM.

Alex Williamson, any comment here?

- Steve

Re: [PATCH V2 01/11] machine: alloc-anon option

2024-07-20 Thread Steven Sistare


On 7/16/2024 5:19 AM, Igor Mammedov wrote:

On Sun, 30 Jun 2024 12:40:24 -0700
Steve Sistare  wrote:


Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This affects
memory-backend-ram objects, guest RAM created with the global -m option
but without an associated memory-backend object and without the -mem-path
option

nowadays, all machines were converted to use memory backend for VM RAM.
so -m option implicitly creates memory-backend object,
which will be either MEMORY_BACKEND_FILE if -mem-path present
or MEMORY_BACKEND_RAM otherwise.


Yes.  I dropped an an important adjective, "implicit".

  "guest RAM created with the global -m option but without an explicit 
associated
  memory-backend object and without the -mem-path option"


To access the same memory in the old and new QEMU processes, the memory
must be mapped shared.  Therefore, the implementation always sets



RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
user must explicitly specify the share option.  In lieu of defining a new

so statement at the top that memory-backend-ram is affected is not
really valid?


memory-backend-ram is affected by alloc-anon.  But in addition, the user must
explicitly add the "share" option.  I don't implicitly set share in this case,
because I would be overriding the user's specification of the memory object's 
property,
which would be private if omitted.


RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
as the condition for calling memfd_create.


In general I do dislike adding yet another option that will affect
guest RAM allocation (memory-backends  should be sufficient).

However I do see that you need memfd for device memory (vram, roms, ...).
Can we just use memfd/shared unconditionally for those and
avoid introducing a new confusing option?


The Linux kernel has different tunables for backing memfd's with huge pages, so 
we
could hurt performance if we unconditionally change to memfd.  The user should 
have
a choice for any segment that is large enough for huge pages to improve 
performance,
which potentially is any memory-backend-object.  The non memory-backend objects 
are
small, and it would be OK to use memfd unconditionally for them.

- Steve

Re: [RFC V1 0/6] Live update: cpr-transfer

2024-07-20 Thread Steven Sistare


On 7/18/2024 11:36 AM, Peter Xu wrote:

On Sun, Jun 30, 2024 at 12:44:02PM -0700, Steve Sistare wrote:

What?

This patch series adds the live migration cpr-transfer mode, which
allows the user to transfer a guest to a new QEMU instance on the same
host.  It is identical to cpr-exec in most respects, except as described
below.


I definitely prefer this one more than the exec solution, thanks for trying
this out.  It's a matter of whether we'll need both, my answer would be
no..



The new user-visible interfaces are:
   * cpr-transfer (MigMode migration parameter)
   * cpr-uri (migration parameter)


I wonder whether this parameter can be avoided already, maybe we can let
cpr-transfer depend on unix socket in -incoming, then integrate fd sharing
in the same channel?


You saw the answer in another thread, but I repeat it here for others benefit:

  "CPR state cannot be sent over the normal migration channel, because devices
   and backends are created prior to reading the channel, so this mode sends
   CPR state over a second migration channel that is not visible to the user.
   New QEMU reads the second channel prior to creating devices or backends."

- Steve


   * cpr-uri (command-line argument)

In this mode, the user starts new QEMU on the same host as old QEMU, with
the same arguments as old QEMU, plus the -incoming and the -cpr-uri options.
The user issues the migrate command to old QEMU, which stops the VM, saves
state to the migration channels, and enters the postmigrate state.  Execution
resumes in new QEMU.

This mode requires a second migration channel, specified by the cpr-uri
migration property on the outgoing side, and by the cpr-uri QEMU command-line
option on the incoming side.  The channel must be a type, such as unix socket,
that supports SCM_RIGHTS.

This series depends on the series "Live update: cpr-exec mode".

Why?

cpr-transfer offers the same benefits as cpr-exec mode, but with a model
for launching new QEMU that may be more natural for some management packages.

How?

The file descriptors are kept open by sending them to new QEMU via the
cpr-uri, which must support SCM_RIGHTS.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in terminal 2.

   Terminal 1: start old QEMU
   # qemu-kvm -monitor stdio -object
   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
   -m 4G -machine anon-alloc=memfd ...

   Terminal 2: start new QEMU
   # qemu-kvm ... -incoming unix:vm.sock -cpr-uri unix:cpr.sock

   Terminal 1:
   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-transfer
   (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
   (qemu) migrate -d unix:vm.sock
   (qemu) info status
   VM status: paused (postmigrate)

   Terminal 2:
   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running

Steve Sistare (6):
   migration: SCM_RIGHTS for QEMUFile
   migration: VMSTATE_FD
   migration: cpr-transfer save and load
   migration: cpr-uri parameter
   migration: cpr-uri option
   migration: cpr-transfer mode

  include/migration/cpr.h|  4 ++
  include/migration/vmstate.h|  9 +
  migration/cpr-transfer.c   | 81 +
  migration/cpr.c| 16 +++-
  migration/meson.build  |  1 +
  migration/migration-hmp-cmds.c | 10 +
  migration/migration.c  | 37 +++
  migration/options.c| 29 +++
  migration/options.h|  1 +
  migration/qemu-file.c  | 83 --
  migration/qemu-file.h  |  2 +
  migration/ram.c|  1 +
  migration/trace-events |  2 +
  migration/vmstate-types.c  | 33 +
  qapi/migration.json| 38 ++-
  qemu-options.hx|  8 
  stubs/vmstate.c|  7 
  system/vl.c|  3 ++
  18 files changed, 359 insertions(+), 6 deletions(-)
  create mode 100644 migration/cpr-transfer.c

--
1.8.3.1

Re: [PATCH V2 04/11] migration: stop vm earlier for cpr

2024-07-20 Thread Steven Sistare


On 7/17/2024 2:59 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Stop the vm earlier for cpr, to guarantee consistent device state when
CPR state is saved.

Signed-off-by: Steve Sistare 
---
  migration/migration.c | 22 +-
  1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0f47765..8a8e927 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
  MigrationState *s = migrate_get_current();
  g_autoptr(MigrationChannel) channel = NULL;
  MigrationAddress *addr = NULL;
+bool stopped = false;
  
  /*

   * Having preliminary checks for uri and channel
@@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
  }
  }
  
+if (migrate_mode_is_cpr(s)) {

+int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
+if (ret < 0) {
+error_setg(_err, "migration_stop_vm failed, error %d", -ret);
+goto out;
+}
+stopped = true;
+}
+
  if (cpr_state_save(_err)) {
  goto out;
  }
@@ -2155,6 +2165,9 @@ out:
  }
  migrate_fd_error(s, local_err);
  error_propagate(errp, local_err);
+if (stopped && runstate_is_live(s->vm_old_state)) {
+vm_start();
+}


What about non-live states? Shouldn't this be:

if (stopped) {
vm_resume();
}


Not quite.  vm_old_state may be a stopped state, so we don't want to resume.
However, I should probably restore the old stopped state here.  I'll try some 
more
error recovery scenarios.

- Steve




  return;
  }
  }
@@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
  Error *local_err = NULL;
  uint64_t rate_limit;
  bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
-int ret;
  
  /*

   * If there's a previous error, free it and prepare for another one.
@@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
  return;
  }
  
-if (migrate_mode_is_cpr(s)) {

-ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
-if (ret < 0) {
-error_setg(_err, "migration_stop_vm failed, error %d", -ret);
-goto fail;
-}
-}
-
  if (migrate_background_snapshot()) {
  qemu_thread_create(>thread, "mig/snapshot",
  bg_migration_thread, s, QEMU_THREAD_JOINABLE);

Re: [PATCH V2 02/11] migration: cpr-state

2024-07-20 Thread Steven Sistare


On 7/19/2024 11:03 AM, Peter Xu wrote:

On Sun, Jun 30, 2024 at 12:40:25PM -0700, Steve Sistare wrote:

CPR must save state that is needed after QEMU is restarted, when devices
are realized.  Thus the extra state cannot be saved in the migration stream,
as objects must already exist before that stream can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a cpr-specific
stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
restarts, before devices are realized.

Provide accessors for clients to register file descriptors for saving.
The mechanism for passing the fd's to the new process will be specific
to each migration mode, and added in subsequent patches.

Signed-off-by: Steve Sistare 
---
  include/migration/cpr.h |  21 ++
  migration/cpr.c | 188 
  migration/meson.build   |   1 +
  migration/migration.c   |   6 ++
  migration/trace-events  |   5 ++
  system/vl.c |   3 +
  6 files changed, 224 insertions(+)
  create mode 100644 include/migration/cpr.h
  create mode 100644 migration/cpr.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 000..8e7e705
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+typedef int (*cpr_walk_fd_cb)(int fd);
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb);
+void cpr_resave_fd(const char *name, int id, int fd);
+
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 000..313e74e
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+
+/*/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+
+typedef struct CprState {
+CprFdList fds;
+} CprState;
+
+static CprState cpr_state;
+
+//
+
+typedef struct CprFd {
+char *name;
+unsigned int namelen;
+int id;
+int fd;


[1]


+QLIST_ENTRY(CprFd) next;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+.name = "cpr fd",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_UINT32(namelen, CprFd),
+VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
+VMSTATE_INT32(id, CprFd),
+VMSTATE_INT32(fd, CprFd),
+VMSTATE_END_OF_LIST()
+}
+};
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+CprFd *elem = g_new0(CprFd, 1);
+
+trace_cpr_save_fd(name, id, fd);
+elem->name = g_strdup(name);
+elem->namelen = strlen(name) + 1;
+elem->id = id;
+elem->fd = fd;
+QLIST_INSERT_HEAD(_state.fds, elem, next);
+}
+
+static CprFd *find_fd(CprFdList *head, const char *name, int id)
+{
+CprFd *elem;
+
+QLIST_FOREACH(elem, head, next) {
+if (!strcmp(elem->name, name) && elem->id == id) {
+return elem;
+}
+}
+return NULL;
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+CprFd *elem = find_fd(_state.fds, name, id);
+
+if (elem) {
+QLIST_REMOVE(elem, next);
+g_free(elem->name);
+g_free(elem);
+}
+
+trace_cpr_delete_fd(name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+CprFd *elem = find_fd(_state.fds, name, id);
+int fd = elem ? elem->fd : -1;
+
+trace_cpr_find_fd(name, id, fd);
+return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+CprFd *elem;
+
+QLIST_FOREACH(elem, _state.fds, next) {
+if (elem->fd >= 0 && cb(elem->fd)) {
+return 1;
+}
+}
+return 0;
+}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+CprFd *elem = find_fd(_state.fds, name, id);
+int old_fd = elem ? elem->fd : -1;
+
+if (old_fd < 0) {
+cpr_save_fd(name, id, fd);


I don't think I know well on when old_fd<0 would happen yet, as this series
doesn't look like

Re: [PATCH V2 03/11] migration: save cpr mode

2024-07-18 Thread Steven Sistare


On 7/17/2024 2:39 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Save the mode in CPR state, so the user does not need to explicitly specify
it for the target.  Modify migrate_mode() so it returns the incoming mode on
the target.

Signed-off-by: Steve Sistare 
---
  include/migration/cpr.h |  7 +++
  migration/cpr.c | 23 ++-
  migration/migration.c   |  1 +
  migration/options.c |  9 +++--
  4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 8e7e705..42b4019 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -8,6 +8,13 @@
  #ifndef MIGRATION_CPR_H
  #define MIGRATION_CPR_H
  
+#include "qapi/qapi-types-migration.h"

+
+#define MIG_MODE_NONE MIG_MODE__MAX


What happens when a QEMU that knows about a new mode migrates into a
QEMU that doesn't know that mode, i.e. sees it as MIG_MODE__MAX?

I'd just use -1.


Good idea, thanks - steve


+
+MigMode cpr_get_incoming_mode(void);
+void cpr_set_incoming_mode(MigMode mode);
+
  typedef int (*cpr_walk_fd_cb)(int fd);
  void cpr_save_fd(const char *name, int id, int fd);
  void cpr_delete_fd(const char *name, int id);
diff --git a/migration/cpr.c b/migration/cpr.c
index 313e74e..1c296c6 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -21,10 +21,23 @@
  typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
  
  typedef struct CprState {

+MigMode mode;
  CprFdList fds;
  } CprState;
  
-static CprState cpr_state;

+static CprState cpr_state = {
+.mode = MIG_MODE_NONE,
+};
+
+MigMode cpr_get_incoming_mode(void)
+{
+return cpr_state.mode;
+}
+
+void cpr_set_incoming_mode(MigMode mode)
+{
+cpr_state.mode = mode;
+}
  
  //
  
@@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)

  /*/
  #define CPR_STATE "CprState"
  
+static int cpr_state_presave(void *opaque)

+{
+cpr_state.mode = migrate_mode();
+return 0;
+}
+
  static const VMStateDescription vmstate_cpr_state = {
  .name = CPR_STATE,
  .version_id = 1,
  .minimum_version_id = 1,
+.pre_save = cpr_state_presave,
  .fields = (VMStateField[]) {
+VMSTATE_UINT32(mode, CprState),
  VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
  VMSTATE_END_OF_LIST()
  }
diff --git a/migration/migration.c b/migration/migration.c
index e394ad7..0f47765 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -411,6 +411,7 @@ void migration_incoming_state_destroy(void)
  mis->postcopy_qemufile_dst = NULL;
  }
  
+cpr_set_incoming_mode(MIG_MODE_NONE);

  yank_unregister_instance(MIGRATION_YANK_INSTANCE);
  }
  
diff --git a/migration/options.c b/migration/options.c

index 645f550..305397a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -22,6 +22,7 @@
  #include "qapi/qmp/qnull.h"
  #include "sysemu/runstate.h"
  #include "migration/colo.h"
+#include "migration/cpr.h"
  #include "migration/misc.h"
  #include "migration.h"
  #include "migration-stats.h"
@@ -758,8 +759,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
  
  MigMode migrate_mode(void)

  {
-MigrationState *s = migrate_get_current();
-MigMode mode = s->parameters.mode;
+MigMode mode = cpr_get_incoming_mode();
+
+if (mode == MIG_MODE_NONE) {
+MigrationState *s = migrate_get_current();
+mode = s->parameters.mode;
+}
  
  assert(mode >= 0 && mode < MIG_MODE__MAX);

  return mode;

Re: [PATCH V2 01/11] machine: alloc-anon option

2024-07-18 Thread Steven Sistare


On 7/17/2024 3:24 PM, Peter Xu wrote:
[...]


PS to Steve: and I think I left tons of other comments in previous version
outside this patch too, but I don't think they're fully discussed when this
series was sent.  I can re-read the series again, but I don't think it'll
work out if we keep skipping discussions..


Hi Peter, let me address this part first, because I don't want you to think
that I ignored your questions and concerns.  This V2 series tries to address
them.  The change log was intended to be my response, rather than responding
to each open question individually, but let me try again here with more detail.
I apologize if I don't summarize your concerns correctly or completely.

issue: discomfort with exec. why is it needed?
response: exec is just a transport mechanism to send fd's to new qemu.
  I refactored to separate core patches from exec-specific patches, submitted
  cpr-transfer patches to illustrate a non-exec method, and provided reasons
  why one vs the other would be desirable in the commit messages and cover
  letter.

issue: why do we need to preserve the ramblock fields and make them available
  prior to object creation?
response.  we don't need to preserve all of them, and we only need fd prior
  to object creation, so I deleted the precreate, factory, and named object
  patches, and added CprState to preserve fd's. used_length arrives in the
  normal migration stream.  max_length is recovered from the mfd using lseek.

issue: the series is too large, with too much change.
response: in addition to the deletions mentioned above, I simplified the
  functionality and tossed out style patches and nice-to-haves, so we can
  focus on core functionality.  V2 is much smaller.

issue: memfd_create option is oddly expressed and hard to understand.
response: I redefined the option, deleted all the stylistic ramblock patches
  to lay its workings bare, and explicitly documented its affect on all types
  of memory in the commit messages and qapi documentation.

issue: no need to preserve blocks like ROM for DMA (with memfd_create).
  Blocks that must be preserved should be surfaced to the user as objects.
response: I disagree, and will continue that conversation in this email thread.

issue: how will vfio be handled?
response: I submitted the vfio patches (non-trivial, because first I had to
  rework them without using precreate vmstate).

issue: how will fd's be preserved for chardevs?
response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
  in each device's creation function, just like vfio.  Those primitives are
  defined in this V2 series.

- Steve

Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)

2024-07-16 Thread Steven Sistare


On 7/9/2024 4:58 PM, Steve Sistare wrote:

Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state.

In the container pre_save handler, suspend the use of virtual addresses
in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
be remapped at a different VA after exec.  DMA to already-mapped pages
continues.  Save the msi message area as part of vfio-pci vmstate, and
save the interrupt and notifier eventfd's in vmstate.

On qemu restart, vfio_realize() finds the saved descriptors, uses the
descriptors, and notes that the device is being reused.  Device and iommu
state is already configured, so operations in vfio_realize that would
modify the configuration are skipped for a reused device, including vfio
ioctl's and writes to PCI configuration space.  Vfio PCI device reset
is also suppressed. The result is that vfio_realize constructs qemu
data structures that reflect the current state of the device.  However,
the reconstruction is not complete until migrate_incoming is called.
migrate_incoming loads the msi data, the vfio post_load handler finds
eventfds in CPR state, rebuilds vector data structures, and attaches the
interrupts to the new KVM instance.  The container post_load handler then
invokes the main vfio listener callback, which walks the flattened ranges
of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
kernel of the new VA's.  Lastly, migration resumes the VM.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.
[...]
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 000..bc51ebe
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
[...]
+
+bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer,
+Error **errp)
+{
+VFIOContainer *container = VFIO_CONTAINER(bcontainer);
+
+if (!vfio_can_cpr_exec(container, >cpr_blocker)) {
+return migrate_add_blocker_modes(>cpr_blocker, errp,
+ MIG_MODE_CPR_EXEC, -1);


This is a bug.  With the change in cpr_register return type to bool, this
should be:
  return migrate_add_blocker_modes(...) == 0;

- Steve

Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)

2024-07-10 Thread Steven Sistare


On 7/10/2024 4:03 PM, Alex Williamson wrote:

On Tue,  9 Jul 2024 13:58:53 -0700
Steve Sistare  wrote:


Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state.

In the container pre_save handler, suspend the use of virtual addresses
in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will
be remapped at a different VA after exec.  DMA to already-mapped pages
continues.  Save the msi message area as part of vfio-pci vmstate, and
save the interrupt and notifier eventfd's in vmstate.

On qemu restart, vfio_realize() finds the saved descriptors, uses the
descriptors, and notes that the device is being reused.  Device and iommu
state is already configured, so operations in vfio_realize that would
modify the configuration are skipped for a reused device, including vfio
ioctl's and writes to PCI configuration space.  Vfio PCI device reset
is also suppressed. The result is that vfio_realize constructs qemu
data structures that reflect the current state of the device.  However,
the reconstruction is not complete until migrate_incoming is called.
migrate_incoming loads the msi data, the vfio post_load handler finds
eventfds in CPR state, rebuilds vector data structures, and attaches the
interrupts to the new KVM instance.  The container post_load handler then
invokes the main vfio listener callback, which walks the flattened ranges
of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the
kernel of the new VA's.  Lastly, migration resumes the VM.


Hi Steve,

What's the iommufd plan for cpr?  Thanks,


I am working on vdpa and iommufd as we speak, with working prototypes for both.
I plan to submit the kernel and qemu RFC for vdpa next week, followed by 
vacation,
and iommfd in the weeks after that.

- Steve

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-03 Thread Steven Sistare via


On 6/3/2024 6:17 AM, Daniel P. Berrangé wrote:

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:

On 5/28/2024 5:12 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare 
---
   hw/core/machine.c   | 22 ++
   include/hw/boards.h |  1 +
   qemu-options.hx |  6 ++
   system/memory.c |  9 ++---
   system/physmem.c| 18 +-
   system/trace-events |  1 +
   6 files changed, 53 insertions(+), 4 deletions(-)



diff --git a/qemu-options.hx b/qemu-options.hx
index cf61f6b..f0dfda5 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
   "vmport=on|off|auto controls emulation of vmport (default: 
auto)\n"
   "dump-guest-core=on|off include guest memory in a core dump 
(default=on)\n"
   "mem-merge=on|off controls memory merge support (default: 
on)\n"
+"memfd-alloc=on|off controls allocating anonymous guest RAM 
using memfd_create (default: off)\n"
   "aes-key-wrap=on|off controls support for AES key wrapping 
(default=on)\n"
   "dea-key-wrap=on|off controls support for DEA key wrapping 
(default=on)\n"
   "suppress-vmdesc=on|off disables self-describing migration 
(default=off)\n"
@@ -79,6 +80,11 @@ SRST
   supported by the host, de-duplicates identical memory pages
   among VMs instances (enabled by default).
+``memfd-alloc=on|off``
+Enables or disables allocation of anonymous guest RAM using
+memfd_create.  Any associated memory-backend objects are created with
+share=on.  The memfd-alloc default is off.
+
   ``aes-key-wrap=on|off``
   Enables or disables AES key wrapping support on s390-ccw hosts.
   This feature controls whether AES wrapping keys will be created
diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
 uint64_t size,
 Error **errp)
   {
+uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;


If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

- Why memory allocation method will be defined by a machine property,
  even if we have memory-backend-* which should cover everything?


Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command line "qemu -m 1G" does not explicitly describe an object, so it
goes through the anonymous allocation path, and is affected by memfd-alloc.

Internally, create_default_memdev does create a memory-backend object.
That is what my doc comment above refers to:
   Any associated memory-backend objects are created with share=on

An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.

The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:

+# Memory backend objects must have the share=on attribute, and
+# must be mmap'able in the new QEMU process.  For example,
+# memory-backend-file is acceptable, but memory-backend-ram is
+# not.
+#
+# The VM must be started with the '-machine memfd-alloc=on'
+# option.  This causes implicit ram blocks -- those not explicitly
+# described by a memory-backend object -- to be allocated by
+# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+# RAM when it is specified without a memory-backend object.


- Even if we have such a machine property, why setting "memfd" will
  always imply shared?  why not private?  After all it's not called
  "memfd-shared-alloc", and we can create private mappings using
  e.g. memory-backend-memfd,share=off.


There is no use case for memfd-alloc with share=off, so no point IMO in
making the option more verbose.  For cpr, the mapping with all its modifications
must be visible to new qemu when qemu mmaps it.



So IIUC, cpr doesn't care about the use of 'memfd' as the specific impl,
it only cares that the memory is share=on.

Rather than having a machine type option "memfd-alloc" which is named after
a Linux specific impl detail, how about having a machine type option
"mem-share=on", which just happens to trig

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-31 Thread Steven Sistare via


On 5/30/2024 2:39 PM, Peter Xu wrote:

On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:

On 5/29/2024 3:25 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:

On 5/28/2024 5:44 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.


This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..


The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.


AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?


Alignment has changed, and used_length vs max_length changed when
resizeable ramblocks were introduced.  I have dealt with these issues
while supporting cpr for our internal use, and the learned lesson is to
explicitly communicate the creation-time parameters to new qemu.


Why used_length can change?  I'm looking at ram_mig_ram_block_resized():

 if (!migration_is_idle()) {
 /*
  * Precopy code on the source cannot deal with the size of RAM blocks
  * changing at random points in time - especially after sending the
  * RAM block sizes in the migration stream, they must no longer change.
  * Abort and indicate a proper reason.
  */
 error_setg(, "RAM block '%s' resized during precopy.", rb->idstr);
 migration_cancel(err);
 error_free(err);
 }

We sent used_length upfront of a migration during SETUP phase.  Looks like
what you're describing can be something different, though?


I was imprecise.  used_length did not change; it was introduced as being
different than max_length when resizeable ramblocks were introduced.

The max_length is not sent.  It is an implicit property of the implementation,
and can change.  It is the size of the memfd mapping, so we need to know it
and preserve it.

used_length is indeed sent during SETUP.  We could also send max_length
at that time, and store both in the struct ramblock, and *maybe* that would
be safe, but that is more fragile and less future proof than setting both
properties to the correct value when the ramblock struct is created.

And BTW, the ramblock properties are sent using ad-hoc code in setup.
I send them using nice clean vmstate.


Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
alignment?  It's set when ramblock is created IIUC:

 rb->align = mr->align;

When will the alignment change?


The alignment specified by the mr to allocate a new block is an implicit 
property
of the implementation, and has changed before, from one qemu release to another.
Not often, but it did, and could again in the future.  Communicating the 
alignment
from old qemu to new qemu is future proof.


These are not an issue for migration because the ramblock is re-created
and the data copied into the new memory.


Mirror the mr->align field in the RAMBlock to simplify the vmstate.
Preserve the old host address, even though it is immediately discarded,
as it will be needed in the future for CPR with iommufd.  Preserve
guest_memfd, even though CPR does not yet support it, to maintain vmstate
compatibility when it becomes supported.


.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?


This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.


Is guest_memfd the only user so far, then?  If so, would it be

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-31 Thread Steven Sistare via


On 5/30/2024 2:14 PM, Peter Xu wrote:

On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:

On 5/29/2024 3:14 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:

diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
  uint64_t size,
  Error **errp)
{
+uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;


If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

 - Why memory allocation method will be defined by a machine property,
   even if we have memory-backend-* which should cover everything?


Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command line "qemu -m 1G" does not explicitly describe an object, so it
goes through the anonymous allocation path, and is affected by memfd-alloc.


Can we simply now allow "qemu -m 1G" to work for cpr-exec?


I assume you meant "simply not allow".

Yes, I could do that, but I would need to explicitly add code to exclude this
case, and add a blocker.  Right now it "just works" for all paths that lead to
ram_block_alloc_host, without any special logic at the memory-backend level.
And, I'm not convinced that simplifies the docs, as now I would need to tell
the user that "-m 1G" and similar constructions do not work with cpr.

I can try to clarify the doc for -memfd-alloc as currently defined.


Why do we need to keep cpr working for existing qemu cmdlines?  We'll
already need to add more new cmdline options already anyway, right?

cpr-reboot wasn't doing it, and that made sense to me, so that new features
will require the user to opt-in for it, starting with changing its
cmdlines.


I agree.  We need a new option to opt-in to cpr-friendly memory allocation, and 
I
am proposing -machine memfd-alloc. I am simply saying that I can try to do a 
better
job explaining the functionality in my proposed text for memfd-alloc, instead of
changing the functionality to exclude "-m 1G".  I believe excluding "-m 1G" is 
the
wrong approach, for the reasons I stated - messier implementation *and* 
documentation.

I am open to different syntax for opting in.


AFAIU that's
what we do with cpr-reboot: we ask the user to specify the right things to
make other thing work.  Otherwise it won't.



Internally, create_default_memdev does create a memory-backend object.
That is what my doc comment above refers to:
Any associated memory-backend objects are created with share=on

An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.

The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:

+# Memory backend objects must have the share=on attribute, and
+# must be mmap'able in the new QEMU process.  For example,
+# memory-backend-file is acceptable, but memory-backend-ram is
+# not.
+#
+# The VM must be started with the '-machine memfd-alloc=on'
+# option.  This causes implicit ram blocks -- those not explicitly
+# described by a memory-backend object -- to be allocated by
+# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+# RAM when it is specified without a memory-backend object.


VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
memory-backend-file,share=on propertly, these should be the only outliers?

Are these important enough for the downtime?  Can we put them into the
migrated image alongside with the rest device states?


It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
The pages must remain pinned during CPR to support ongoing DMA activity
which could target those pages (which we do not quiesce), and the same
physical pages must be used for the ramblocks in the new qemu process.


Ah ok, yes DMA can happen on the fly.

Guest mem is definitely the major DMA target and that can be covered by
-object memory-backend-*,shared=on cmdlines.

ROM is definitely not a DMA target.  So is VGA ram a target for, perhaps,
an assigned vGPU device?  Do we have a list of things that will need that?
Can we make them work somehow by sharing them like guest mem?


The pass-through devices map and pin all memory accessible to the guest.
We cannot make exceptions based on our intuition of how the memory will
and will not be used.

Also, we cannot simply abandon the old pinned ramblocks, owned by an mm_struct
that will become a zombie.  We would actually

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-30 Thread Steven Sistare via


On 5/28/2024 12:42 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:

On 5/27/2024 1:45 PM, Peter Xu wrote:

On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:

I understand, thanks.  If I can help with any of your todo list,
just ask - steve


Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.


The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough.  It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services.  cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system.  Live migration
also consumes more system and network resources.


Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration".  I think
it's more of the case whether exec() can be avoided.  I had a feeling that
this exec() will cause a major part of work elsewhere but maybe I am wrong
as I didn't see the whole branch.


The only parts of this work that are specific to exec are these patches
and the qemu_clear_cloexec() calls in cpr.c.
  vl: helper to request re-exec
  migration: precreate vmstate for exec

The rest would be the same if some other mechanism were used to start
new qemu.   Additional code would be needed for the new mechanism, such
as SCM_RIGHTS sends.


AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
least should be relevant to what devices are attached to the VM, right?

E.g., I observed at least

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-30 Thread Steven Sistare via


On 5/29/2024 3:25 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:

On 5/28/2024 5:44 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.


This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..


The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.


AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?


Alignment has changed, and used_length vs max_length changed when
resizeable ramblocks were introduced.  I have dealt with these issues
while supporting cpr for our internal use, and the learned lesson is to
explicitly communicate the creation-time parameters to new qemu.

These are not an issue for migration because the ramblock is re-created
and the data copied into the new memory.


Mirror the mr->align field in the RAMBlock to simplify the vmstate.
Preserve the old host address, even though it is immediately discarded,
as it will be needed in the future for CPR with iommufd.  Preserve
guest_memfd, even though CPR does not yet support it, to maintain vmstate
compatibility when it becomes supported.


.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?


This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.


Is guest_memfd the only user so far, then?  If so, would it be possible we
split it as a separate effort on top of the base cpr-exec support?


I don't understand the question.  I am indeed deferring support for guest_memfd
to a future time.  For now, I am adding a blocker, and reserving a field for
it in the preserved ramblock attributes, to avoid adding a subsection later.

- Steve

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-30 Thread Steven Sistare via


On 5/29/2024 2:53 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:

How about a more general name for the type:

migration/misc.h
 typedef char (MigrationId)[256];


How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
the loop) may have a better idea.

Meanwhile, s/MigrationID/IDString/?


typedefs.h has a different purpose; giving short names to types
defined in internal include files.

This id is specific to migration, so I still think its name should reflect
migration and it belongs in some include/migration/*.h file.

ramblocks and migration are already closely related.  There is nothing wrong
with including a migration header in ramblock.h so it can use a migration type.
We already have:
  include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
  include/hw/display/ramfb.h:#include "migration/vmstate.h"
  include/hw/input/pl050.h:#include "migration/vmstate.h"
  include/hw/pci/shpc.h:#include "migration/vmstate.h"
  include/hw/virtio/virtio.h:#include "migration/vmstate.h"
  include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"

The 256 byte magic length already appears in too many places, and my code
would add more occurrences, so I really think that abstracting this type
would be cleaner.

- Steve


exec/ramblock.h
 struct RAMBlock {
 MigrationId idstr;

migration/savevm.c
 typedef struct CompatEntry {
 MigrationId idstr;

 typedef struct SaveStateEntry {
 MigrationId idstr;


- Steve

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-30 Thread Steven Sistare via


On 5/29/2024 3:14 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:

diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
 uint64_t size,
 Error **errp)
   {
+uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;


If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

- Why memory allocation method will be defined by a machine property,
  even if we have memory-backend-* which should cover everything?


Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command line "qemu -m 1G" does not explicitly describe an object, so it
goes through the anonymous allocation path, and is affected by memfd-alloc.


Can we simply now allow "qemu -m 1G" to work for cpr-exec?  


I assume you meant "simply not allow".

Yes, I could do that, but I would need to explicitly add code to exclude this
case, and add a blocker.  Right now it "just works" for all paths that lead to
ram_block_alloc_host, without any special logic at the memory-backend level.
And, I'm not convinced that simplifies the docs, as now I would need to tell
the user that "-m 1G" and similar constructions do not work with cpr.

I can try to clarify the doc for -memfd-alloc as currently defined.


AFAIU that's
what we do with cpr-reboot: we ask the user to specify the right things to
make other thing work.  Otherwise it won't.



Internally, create_default_memdev does create a memory-backend object.
That is what my doc comment above refers to:
   Any associated memory-backend objects are created with share=on

An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.

The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:

+# Memory backend objects must have the share=on attribute, and
+# must be mmap'able in the new QEMU process.  For example,
+# memory-backend-file is acceptable, but memory-backend-ram is
+# not.
+#
+# The VM must be started with the '-machine memfd-alloc=on'
+# option.  This causes implicit ram blocks -- those not explicitly
+# described by a memory-backend object -- to be allocated by
+# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+# RAM when it is specified without a memory-backend object.


VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
memory-backend-file,share=on propertly, these should be the only outliers?

Are these important enough for the downtime?  Can we put them into the
migrated image alongside with the rest device states?


It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
The pages must remain pinned during CPR to support ongoing DMA activity
which could target those pages (which we do not quiesce), and the same
physical pages must be used for the ramblocks in the new qemu process.


- Even if we have such a machine property, why setting "memfd" will
  always imply shared?  why not private?  After all it's not called
  "memfd-shared-alloc", and we can create private mappings using
  e.g. memory-backend-memfd,share=off.


There is no use case for memfd-alloc with share=off, so no point IMO in
making the option more verbose.


Unfortunately this fact doesn't make the property easier to understand. :-( >

For cpr, the mapping with all its modifications must be visible to new
qemu when qemu mmaps it.


So this might be the important part - do you mean migrating
VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
Could you elaborate?


Pinning.


Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
restrict that specialty to minimal, making the interfacing as clear as
possible, or (at least migration) maintainers will start to be soon scared
and running away, if such proposal was not shot down.

In short, I hope when we introduce new knobs for cpr, we shouldn't always
keep cpr-* modes in mind, but consider whenever the user can use it without
cpr-*.  I'm not sure whether it'll be always possible, but we should try.


I agree in principle.  FWIW, I have tried to generalize the functionality needed
by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
precreate vmstate, factory objects; to base it on migration internals with
minimal change (vmstate); and to make minimal changes in the migration control
paths.

- Steve

Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-30 Thread Steven Sistare via


On 5/29/2024 2:39 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote:

On 5/27/2024 2:16 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:

Provide the VMStateDescription precreate field to mark objects that must
be loaded on the incoming side before devices have been created, because
they provide properties that will be needed at creation time.  They will
be saved to and loaded from their own QEMUFile, via
qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
functions are not yet called in this patch.  Allow them to be called
before or after normal migration is active, when current_migration and
current_incoming are not valid.

Signed-off-by: Steve Sistare 
---
   include/migration/vmstate.h |  6 
   migration/savevm.c  | 69 
+
   migration/savevm.h  |  3 ++
   3 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 294d2d8..4691334 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -198,6 +198,12 @@ struct VMStateDescription {
* a QEMU_VM_SECTION_START section.
*/
   bool early_setup;
+
+/*
+ * Send/receive this object in the precreate migration stream.
+ */
+bool precreate;
+
   int version_id;
   int minimum_version_id;
   MigrationPriority priority;
diff --git a/migration/savevm.c b/migration/savevm.c
index 9789823..a30bcd9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -239,6 +239,7 @@ static SaveState savevm_state = {
   #define SAVEVM_FOREACH(se, entry)\
   QTAILQ_FOREACH(se, _state.handlers, entry)\
+if (!se->vmsd || !se->vmsd->precreate)
   #define SAVEVM_FOREACH_ALL(se, entry)\
   QTAILQ_FOREACH(se, _state.handlers, entry)
@@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, 
SaveStateEntry *se,
   }
   }
+static bool send_section_footer(SaveStateEntry *se)
+{
+return (se->vmsd && se->vmsd->precreate) ||
+   migrate_get_current()->send_section_footer;
+}


Does the precreate vmsd "require" the footer?  Or it should also work?
IMHO it's less optimal to bind features without good reasons.


It is not required.  However, IMO we should not treat send-section-footer as
a fungible feature.  It is strictly an improvement, as was added to catch
misformated sections.  It is only registered as a feature for backwards
compatibility with qemu 2.3 and xen.

For a brand new data stream such as precreate, where we are not constrained
by backwards compatibility, we should unconditionally use the better protocol,
and always send the footer.


I see your point, but I still don't think we should mangle these things.
It makes future justification harder on whether section footer should be
sent.

Take example of whatever new feature added for migration like mapped-ram,
we might also want to enforce it by adding "return migrate_mapped_ram() ||
..." but it means we keep growing this with no benefit.

What you worry on "what if this is turned off" isn't a real one: nobody
will turn it off!  We started to deprecate machines, and I had a feeling
it will be enforced at some point by default.


That's fine, I'll delete the send_section_footer() function.

- Steve

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-29 Thread Steven Sistare via


On 5/28/2024 5:12 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote:

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare 
---
  hw/core/machine.c   | 22 ++
  include/hw/boards.h |  1 +
  qemu-options.hx |  6 ++
  system/memory.c |  9 ++---
  system/physmem.c| 18 +-
  system/trace-events |  1 +
  6 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 582c2df..9567b97 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -443,6 +443,20 @@ static void machine_set_mem_merge(Object *obj, bool value, 
Error **errp)
  ms->mem_merge = value;
  }
  
+static bool machine_get_memfd_alloc(Object *obj, Error **errp)

+{
+MachineState *ms = MACHINE(obj);
+
+return ms->memfd_alloc;
+}
+
+static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+
+ms->memfd_alloc = value;
+}
+
  static bool machine_get_usb(Object *obj, Error **errp)
  {
  MachineState *ms = MACHINE(obj);
@@ -1044,6 +1058,11 @@ static void machine_class_init(ObjectClass *oc, void 
*data)
  object_class_property_set_description(oc, "mem-merge",
  "Enable/disable memory merge support");
  
+object_class_property_add_bool(oc, "memfd-alloc",

+machine_get_memfd_alloc, machine_set_memfd_alloc);
+object_class_property_set_description(oc, "memfd-alloc",
+"Enable/disable allocating anonymous memory using memfd_create");
+
  object_class_property_add_bool(oc, "usb",
  machine_get_usb, machine_set_usb);
  object_class_property_set_description(oc, "usb",
@@ -1387,6 +1406,9 @@ static bool create_default_memdev(MachineState *ms, const 
char *path, Error **er
  if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
  goto out;
  }
+if (!object_property_set_bool(obj, "share", ms->memfd_alloc, errp)) {
+goto out;
+}
  object_property_add_child(object_get_objects_root(), mc->default_ram_id,
obj);
  /* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 69c1ba4..96259c3 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -372,6 +372,7 @@ struct MachineState {
  bool dump_guest_core;
  bool mem_merge;
  bool require_guest_memfd;
+bool memfd_alloc;
  bool usb;
  bool usb_disabled;
  char *firmware;
diff --git a/qemu-options.hx b/qemu-options.hx
index cf61f6b..f0dfda5 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
  "vmport=on|off|auto controls emulation of vmport (default: 
auto)\n"
  "dump-guest-core=on|off include guest memory in a core dump 
(default=on)\n"
  "mem-merge=on|off controls memory merge support (default: 
on)\n"
+"memfd-alloc=on|off controls allocating anonymous guest RAM 
using memfd_create (default: off)\n"
  "aes-key-wrap=on|off controls support for AES key wrapping 
(default=on)\n"
  "dea-key-wrap=on|off controls support for DEA key wrapping 
(default=on)\n"
  "suppress-vmdesc=on|off disables self-describing migration 
(default=off)\n"
@@ -79,6 +80,11 @@ SRST
  supported by the host, de-duplicates identical memory pages
  among VMs instances (enabled by default).
  
+``memfd-alloc=on|off``

+Enables or disables allocation of anonymous guest RAM using
+memfd_create.  Any associated memory-backend objects are created with
+share=on.  The memfd-alloc default is off.
+
  ``aes-key-wrap=on|off``
  Enables or disables AES key wrapping support on s390-ccw hosts.
  This feature controls whether AES wrapping keys will be created
diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
uint64_t size,
Error **errp)
  {
+uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;


If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

   - Why memory allocation method will be defined by a machine property,
 even if we have memory-backend-* which should cover everything?


Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-29 Thread Steven Sistare via


On 5/28/2024 5:44 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.


This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..


The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.


Mirror the mr->align field in the RAMBlock to simplify the vmstate.
Preserve the old host address, even though it is immediately discarded,
as it will be needed in the future for CPR with iommufd.  Preserve
guest_memfd, even though CPR does not yet support it, to maintain vmstate
compatibility when it becomes supported.


.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?


This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.


Another thing to ask: does this idea also need to rely on some future
iommufd kernel support?  If there's anything that's not merged in current
Linux upstream, this series needs to be marked as RFC, so it's not target
for merging.  This will also be true if this patch is "preparing" for that
work.  It means if this patch only services iommufd purpose, even if it
doesn't require any kernel header to be referenced, we should only merge it
together with the full iommufd support comes later (and that'll be after
iommufd kernel supports land).


It does not rely on future kernel support.

- Steve

Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr

2024-05-29 Thread Steven Sistare via


On 5/28/2024 2:21 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote:

On 5/27/2024 2:31 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:

Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
can be saved in the migration stream.  This will be needed for CPR.

Signed-off-by: Steve Sistare 


This is really tricky.

  From a first glance, I don't think migrating a VA is valid at all for
migration even if with exec.. and looks insane to me for a cross-process
migration, which seems to be allowed to use as a generic VMSD helper.. as
VA is the address space barrier for different processes and I think it
normally even apply to generic execve(), and we're trying to jailbreak for
some reason..

It definitely won't work for any generic migration as sizeof(void*) can be
different afaict between hosts, e.g. 32bit -> 64bit migrations.

Some description would be really helpful in this commit message,
e.g. explain the users and why.  Do we need to poison that for generic VMSD
use (perhaps with prefixed underscores)?  I think I'll need to read on the
rest to tell..


Short answer: we never dereference the void* in the new process.  And must not.

Longer answer:

During CPR for vfio, each mapped DMA region is re-registered in the new
process using the new VA.  The ioctl to re-register identifies the mapping
by IOVA and length.

The same requirement holds for CPR of iommufd devices.  However, in the
iommufd framework, IOVA does not uniquely identify a dma mapping, and we
need to use the old VA as the unique identifier.  The new process
re-registers each mapping, passing the old VA and new VA to the kernel.
The old VA is never dereferenced in the new process, we just need its value.

I suspected that the void* which must not be dereferenced might make people
uncomfortable.  I have an older version of my code which adds a uint64_t
field to RAMBlock for recording and migrating the old VA.  The saving and
loading code is slightly less elegant, but no big deal.  Would you prefer
that?


I see, thanks for explaining.  Yes that sounds better to me.  Re the
ugliness: is that about a pre_save() plus one extra uint64_t field?  In
that case it looks better comparing to migrating "void*".


Yes.  Will do.


I'm trying to read some context on the vaddr remap thing from you, and I
found this:

https://lore.kernel.org/all/y90bvbnrvraceq%2f...@nvidia.com/

So it will work with iommufd now?  Meanwhile, what's the status for mdev?
Looks like it isn't supported yet for both.


I am currently working on the kernel and qemu code to support iommufd.
It is an RFE on top of this initial cpr-exec work that I will post separately
later.  I do know that it will require the old VA, so I am proposing to
preserve old VA now in the migration stream to avoid adding extra backwards
compatibility code later.

I have prototyped userland code that supports mdev, based on jason's suggestion 
to
fork an extra process to handle mdev translations during the transition from old
to new qemu, but it is a work in progres and adds complexity, and I do not plan 
to
submit that any time soon.  Another RFE.  For now mdev is not supported.

- Steve

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-29 Thread Steven Sistare via


On 5/28/2024 1:44 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:10:03AM -0400, Steven Sistare via wrote:

On 5/27/2024 2:20 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:

Define a type for the 256 byte id string to guarantee the same length is
used and enforced everywhere.

Signed-off-by: Steve Sistare 
---
   include/exec/ramblock.h | 3 ++-
   include/migration/vmstate.h | 2 ++
   migration/savevm.c  | 8 
   migration/vmstate.c | 3 ++-
   4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 0babd10..61deefe 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -23,6 +23,7 @@
   #include "cpu-common.h"
   #include "qemu/rcu.h"
   #include "exec/ramlist.h"
+#include "migration/vmstate.h"
   struct RAMBlock {
   struct rcu_head rcu;
@@ -35,7 +36,7 @@ struct RAMBlock {
   void (*resized)(const char*, uint64_t length, void *host);
   uint32_t flags;
   /* Protected by the BQL.  */
-char idstr[256];
+VMStateId idstr;
   /* RCU-enabled, writes protected by the ramlist lock */
   QLIST_ENTRY(RAMBlock) next;
   QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;


Hmm.. Don't look like a good idea to include a migration header in
ramblock.h?  Is this ramblock change needed for this work?


Well, entities that are migrated include migration headers, and now that
includes RAMBlock.  There is precedent:

0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
5 include/hw/pci/shpc.h  7 #include "migration/vmstate.h"
6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
7 include/migration/cpu.h8 #include "migration/vmstate.h"

Granted, only some of the C files that include ramblock.h need all of vmstate.h.
I could define VMStateId in a smaller file such as migration/misc.h, or a
new file migration/vmstateid.h, and include that in ramblock.h.
Any preference?


One issue here is currently the idstr[] of ramblock is a verbose name of
the ramblock, and logically it doesn't need to have anything to do with
vmstate.

I'll continue to read to see why we need VMStateID here for a ramblock.  So
if it is necessary, maybe the name VMStateID to be used here is misleading?
It'll also be nice to separate the changes, as ramblock.h doesn't belong to
migration subsystem but memory api.


cpr requires migrating vmstate for ramblock.  See the physmem patches for
why. vmstate requires a unique id, and ramblock already defines a unique
id which is used to identify the block and dirty bitmap in the migration
stream.

How about a more general name for the type:

migration/misc.h
typedef char (MigrationId)[256];

exec/ramblock.h
struct RAMBlock {
MigrationId idstr;

migration/savevm.c
typedef struct CompatEntry {
MigrationId idstr;

typedef struct SaveStateEntry {
MigrationId idstr;


- Steve

Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-28 Thread Steven Sistare via


On 5/27/2024 2:16 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:

Provide the VMStateDescription precreate field to mark objects that must
be loaded on the incoming side before devices have been created, because
they provide properties that will be needed at creation time.  They will
be saved to and loaded from their own QEMUFile, via
qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
functions are not yet called in this patch.  Allow them to be called
before or after normal migration is active, when current_migration and
current_incoming are not valid.

Signed-off-by: Steve Sistare 
---
  include/migration/vmstate.h |  6 
  migration/savevm.c  | 69 +
  migration/savevm.h  |  3 ++
  3 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 294d2d8..4691334 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -198,6 +198,12 @@ struct VMStateDescription {
   * a QEMU_VM_SECTION_START section.
   */
  bool early_setup;
+
+/*
+ * Send/receive this object in the precreate migration stream.
+ */
+bool precreate;
+
  int version_id;
  int minimum_version_id;
  MigrationPriority priority;
diff --git a/migration/savevm.c b/migration/savevm.c
index 9789823..a30bcd9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -239,6 +239,7 @@ static SaveState savevm_state = {
  
  #define SAVEVM_FOREACH(se, entry)\

  QTAILQ_FOREACH(se, _state.handlers, entry)\
+if (!se->vmsd || !se->vmsd->precreate)
  
  #define SAVEVM_FOREACH_ALL(se, entry)\

  QTAILQ_FOREACH(se, _state.handlers, entry)
@@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, 
SaveStateEntry *se,
  }
  }
  
+static bool send_section_footer(SaveStateEntry *se)

+{
+return (se->vmsd && se->vmsd->precreate) ||
+   migrate_get_current()->send_section_footer;
+}


Does the precreate vmsd "require" the footer?  Or it should also work?
IMHO it's less optimal to bind features without good reasons.


It is not required.  However, IMO we should not treat send-section-footer as
a fungible feature.  It is strictly an improvement, as was added to catch
misformated sections.  It is only registered as a feature for backwards
compatibility with qemu 2.3 and xen.

For a brand new data stream such as precreate, where we are not constrained
by backwards compatibility, we should unconditionally use the better protocol,
and always send the footer.

- Steve

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-28 Thread Steven Sistare via


On 5/27/2024 1:45 PM, Peter Xu wrote:

On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:

I understand, thanks.  If I can help with any of your todo list,
just ask - steve


Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.


The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough.  It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services.  cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system.  Live migration
also consumes more system and network resources.  cpr-exec seamlessly
preserves client connections by preserving chardevs, and overall provides
a much nicer user experience.

chardev's are preserved by keeping their fd open across the exec, and
remembering the value of the fd in precreate vmstate so that new qemu
can associate the fd with the chardev rather than opening a new one.

The approach of preserving open file descriptors is very general and applicable
to all kinds of devices, regardless of whether they support live migration
in hardware.  Device fd's are preserved using the same mechanism as for
chardevs.

Devices that support live migration in hardware do not like to live migrate
in place to the same node.  It is not what they are designed for, and some
implementations will flat out fail because the source and target interfaces
are the same.

For vhost/tap, sometimes the management layer opens the dev and passes an
fd to qemu, and sometimes qemu opens the dev.  The upcoming vhost/tap support
allows both

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-28 Thread Steven Sistare via


On 5/27/2024 2:20 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote:

Define a type for the 256 byte id string to guarantee the same length is
used and enforced everywhere.

Signed-off-by: Steve Sistare 
---
  include/exec/ramblock.h | 3 ++-
  include/migration/vmstate.h | 2 ++
  migration/savevm.c  | 8 
  migration/vmstate.c | 3 ++-
  4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 0babd10..61deefe 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -23,6 +23,7 @@
  #include "cpu-common.h"
  #include "qemu/rcu.h"
  #include "exec/ramlist.h"
+#include "migration/vmstate.h"
  
  struct RAMBlock {

  struct rcu_head rcu;
@@ -35,7 +36,7 @@ struct RAMBlock {
  void (*resized)(const char*, uint64_t length, void *host);
  uint32_t flags;
  /* Protected by the BQL.  */
-char idstr[256];
+VMStateId idstr;
  /* RCU-enabled, writes protected by the ramlist lock */
  QLIST_ENTRY(RAMBlock) next;
  QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;


Hmm.. Don't look like a good idea to include a migration header in
ramblock.h?  Is this ramblock change needed for this work?


Well, entities that are migrated include migration headers, and now that
includes RAMBlock.  There is precedent:

0 include/exec/ramblock.h   26 #include "migration/vmstate.h"
1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h"
2 include/hw/display/ramfb.  4 #include "migration/vmstate.h"
3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h"
4 include/hw/input/pl050.h  14 #include "migration/vmstate.h"
5 include/hw/pci/shpc.h  7 #include "migration/vmstate.h"
6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h"
7 include/migration/cpu.h8 #include "migration/vmstate.h"

Granted, only some of the C files that include ramblock.h need all of vmstate.h.
I could define VMStateId in a smaller file such as migration/misc.h, or a
new file migration/vmstateid.h, and include that in ramblock.h.
Any preference?

- Steve

Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr

2024-05-28 Thread Steven Sistare via


On 5/27/2024 2:31 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote:

Define VMSTATE_VOID_PTR so the value of a pointer (but not its target)
can be saved in the migration stream.  This will be needed for CPR.

Signed-off-by: Steve Sistare 


This is really tricky.

 From a first glance, I don't think migrating a VA is valid at all for
migration even if with exec.. and looks insane to me for a cross-process
migration, which seems to be allowed to use as a generic VMSD helper.. as
VA is the address space barrier for different processes and I think it
normally even apply to generic execve(), and we're trying to jailbreak for
some reason..

It definitely won't work for any generic migration as sizeof(void*) can be
different afaict between hosts, e.g. 32bit -> 64bit migrations.

Some description would be really helpful in this commit message,
e.g. explain the users and why.  Do we need to poison that for generic VMSD
use (perhaps with prefixed underscores)?  I think I'll need to read on the
rest to tell..


Short answer: we never dereference the void* in the new process.  And must not.

Longer answer:

During CPR for vfio, each mapped DMA region is re-registered in the new
process using the new VA.  The ioctl to re-register identifies the mapping
by IOVA and length.

The same requirement holds for CPR of iommufd devices.  However, in the
iommufd framework, IOVA does not uniquely identify a dma mapping, and we
need to use the old VA as the unique identifier.  The new process
re-registers each mapping, passing the old VA and new VA to the kernel.
The old VA is never dereferenced in the new process, we just need its value.

I suspected that the void* which must not be dereferenced might make people
uncomfortable.  I have an older version of my code which adds a uint64_t
field to RAMBlock for recording and migrating the old VA.  The saving and
loading code is slightly less elegant, but no big deal.  Would you prefer
that?

- Steve

Re: [PATCH V1 23/26] migration: misc cpr-exec blockers

2024-05-27 Thread Steven Sistare via


On 5/24/2024 8:40 AM, Fabiano Rosas wrote:

Steve Sistare  writes:


Add blockers for cpr-exec migration mode for devices and options that do
not support it.

Signed-off-by: Steve Sistare 
---
  accel/xen/xen-all.c|  5 +
  backends/hostmem-epc.c | 12 ++--
  hw/vfio/migration.c|  3 ++-
  replay/replay.c|  6 ++
  4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
index 0bdefce..9a7ed0f 100644
--- a/accel/xen/xen-all.c
+++ b/accel/xen/xen-all.c


This file is missing the migration/blocker.h include.


Good eyes, will fix - steve

Re: [PATCH V1 20/26] migration: cpr-exec mode

2024-05-27 Thread Steven Sistare via


On 5/24/2024 10:58 AM, Fabiano Rosas wrote:

Steve Sistare  writes:


Add the cpr-exec migration mode.  Usage:
   qemu-system-$arch -machine memfd-alloc=on ...
   migrate_set_parameter mode cpr-exec
   migrate_set_parameter cpr-exec-args \
   ... -incoming 
   migrate -d 

The migrate command stops the VM, saves state to the URI,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from the URI.  Guest RAM is preserved in place,
albeit with new virtual addresses.

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary.

Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so the URI must be a type, such as
a file, that reads all data before old QEMU exits.

Memory backend objects must have the share=on attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the '-machine memfd-alloc=on'
option.  This causes implicit ram blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without a memory-backend object.

The implementation saves precreate vmstate at the end of normal
migration in migrate_fd_cleanup, and tells the main loop to call
cpr_exec.  Incoming qemu loads preceate state early, before objects
are created.  The memfds are kept open across exec by clearing the
close-on-exec flag, their values are saved in precreate vmstate,
and they are mmap'd in new qemu.

Note that the memfd-alloc option is not related to memory-backend-memfd.
Later patches add support for memory-backend-memfd, and for additional
devices, including vfio, chardev, and more.

Signed-off-by: Steve Sistare 
---
  include/migration/cpr.h  |  14 +
  include/migration/misc.h |   3 ++
  migration/cpr.c  | 131 +++
  migration/meson.build|   1 +
  migration/migration.c|  21 
  migration/migration.h|   5 +-
  migration/ram.c  |   1 +
  qapi/migration.json  |  30 ++-
  system/physmem.c |   2 +
  system/vl.c  |   4 ++
  10 files changed, 210 insertions(+), 2 deletions(-)
  create mode 100644 include/migration/cpr.h
  create mode 100644 migration/cpr.c
[...]
diff --git a/migration/migration.c b/migration/migration.c
index b5af6b5..0d91531 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -239,6 +239,7 @@ void migration_object_init(void)
  blk_mig_init();
  ram_mig_init();
  dirty_bitmap_mig_init();
+cpr_mig_init();
  }
  
  typedef struct {

@@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s)
  qemu_fclose(tmp);
  }
  
+if (migrate_mode() == MIG_MODE_CPR_EXEC) {

+Error *err = NULL;
+if (migration_precreate_save()) {
+migrate_set_error(s, err);
+error_report_err(err);


There's an error_report_err() call already a few lines down.


+migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
+}
+}


Not a fan of saving state in the middle of the cleanup function. This
adds extra restrictions to migrate_fd_cleanup() which already tends to
be the source of a bunch of bugs.

Can this be done either entirely before or after migrate_fd_cleanup()?
There's only one callsite from which you actually want to do cpr-exec,
migration_iteration_finish(). It's no big deal if we have to play a bit
with the notifier call placement.

static void migration_iteration_finish(MigrationState *s)
{
...
 migration_bh_schedule(migrate_cpr_exec_bh, s);
 migration_bh_schedule(migrate_fd_cleanup_bh, s);
 bql_unlock();
}

IIUC, the BQL ensures the ordering here, but if that doesn't work we
could just call the cpr function right at the end of
migrate_fd_cleanup(). That would already be better than interleaving.

static void migrate_cpr_exec_bh(void *opaque)
{
 MigrationState *s = opaque;
 Error *err = NULL;

 if (migration_has_failed(s) || migrate_mode() != MIG_MODE_CPR_EXEC) {
 return;
 }

 assert(s->state == MIGRATION_STATUS_COMPLETED);

 if (migration_precreate_save()) {
 migrate_set_error(s, err);
 error_report_err(err);
 migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
 migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);

 return;
 }

 qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args);
}


No problem, I can hoist the cpr exec logic out of migrate_fd_cleanup.
I'll call migration_precreate_save prior, and I'll register a notifier
for MIG_EVENT_PRECOPY_DONE that calls qemu_system_exec_request.

BTW the following does not work

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-24 Thread Steven Sistare


On 5/24/2024 9:02 AM, Fabiano Rosas wrote:

Steve Sistare  writes:


This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
stops the VM, writes VM state to the migration URI, and directly exec's a
new version of QEMU on the same host, replacing the original process while
retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
addresses.  The user completes the migration by specifying the -incoming
option, and by issuing the migrate-incoming command if necessary.  This
saves and restores VM state, with minimal guest pause time, so that QEMU may
be updated to a new version in between.

The new interfaces are:
   * cpr-exec (MigMode migration parameter)
   * cpr-exec-args (migration parameter)
   * memfd-alloc=on (command-line option for -machine)
   * only-migratable-modes (command-line argument)

The caller sets the mode parameter before invoking the migrate command.

Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
The first argument should be the path of a new QEMU binary, or a prefix
command that exec's the new QEMU binary, and the arguments should include
the -incoming option.

Memory backend objects must have the share=on attribute, and must be mmap'able
in the new QEMU process.  For example, memory-backend-file is acceptable,
but memory-backend-ram is not.

QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
implicit RAM blocks (those not explicitly described by a memory-backend
object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
and even guest RAM when it is specified without without reference to a
memory-backend object.   The memfds are kept open across exec, their values
are saved in vmstate which is retrieved after exec, and they are re-mmap'd.

The '-only-migratable-modes cpr-exec' option guarantees that the
configuration supports cpr-exec.  QEMU will exit at start time if not.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

   # qemu-kvm -monitor stdio -object
   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
   -m 4G -machine memfd-alloc=on ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming 
file:vm.state
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running

cpr-exec mode preserves attributes of outgoing devices that must be known
before the device is created on the incoming side, such as the memfd descriptor
number, but currently the migration stream is read after all devices are
created.  To solve this problem, I add two VMStateDescription options:
precreate and factory.  precreate objects are saved to their own migration
stream, distinct from the main stream, and are read early by incoming QEMU,
before devices are created.  Factory objects are allocated on demand, without
relying on a pre-registered object's opaque address, which is necessary
because the devices to which the state will apply have not been created yet
and hence have not registered an opaque address to receive the state.

This patch series implements a minimal version of cpr-exec.  Future series
will add support for:
   * vfio
   * chardev's without loss of connectivity
   * vhost
   * fine-grained seccomp controls
   * hostmem-memfd
   * cpr-exec migration test


Steve Sistare (26):
   oslib: qemu_clear_cloexec
   vl: helper to request re-exec
   migration: SAVEVM_FOREACH
   migration: delete unused parameter mis
   migration: precreate vmstate
   migration: precreate vmstate for exec
   migration: VMStateId
   migration: vmstate_info_void_ptr
   migration: vmstate_register_named
   migration: vmstate_unregister_named
   migration: vmstate_register at init time
   migration: vmstate factory object
   physmem: ram_block_create
   physmem: hoist guest_memfd creation
   physmem: hoist host memory allocation
   physmem: set ram block idstr earlier
   machine: memfd-alloc option
   migration: cpr-exec-args parameter
   physmem: preserve ram blocks for cpr
   migration: cpr-exec mode
   migration: migrate_add_blocker_mode
   migration: ram block cpr-exec blockers
   migration: misc cpr-exec blockers
   seccomp: cpr-exec blocker
   migration: fix mismatched GPAs during cpr-exec
   migration: only-migratable-modes

  accel/xen/xen-all.c|   5 +
  backends/hostmem-epc.c |  12 +-
  hmp-commands.hx|   2 +-
  hw/core/machine.c  |  22 +++
  hw/core/qdev.c |   1 +
  hw/intc/apic_common.c  |   2 +-
  hw/vfio/migration.c|   3 +-
  include/exec/cpu-common.h  |   3 +-
  include/exec/memory.h  |  15 ++
  include/exec/ramblock.h|  10 +-

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-21 Thread Steven Sistare

I understand, thanks.  If I can help with any of your todo list,
just ask - steve

On 5/20/2024 10:31 PM, Peter Xu wrote:
Conference back then pto until today, so tomorrow will be my first working day 
after those. Sorry Steve, will try my best to read it before next week. I didn't 
dare to read too much my inbox yet.  A bit scared but need to face it tomorrow.

On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <mailto:faro...@suse.de>> wrote:

    Steven Sistare mailto:steven.sist...@oracle.com>> writes:

 > Hi Peter, Hi Fabiano,
 >    Will you have time to review the migration guts of this series any
time soon?
 > In particular:
 >
 > [PATCH V1 05/26] migration: precreate vmstate
 > [PATCH V1 06/26] migration: precreate vmstate for exec
 > [PATCH V1 12/26] migration: vmstate factory object
 > [PATCH V1 18/26] migration: cpr-exec-args parameter
 > [PATCH V1 20/26] migration: cpr-exec mode
 >

I'll get to them this week. I'm trying to make some progress with my own
code before I forget how to program. I'm also trying to find some time
to implement the device options in the migration tests so we can stop
these virtio-* breakages that have been popping up.

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-20 Thread Steven Sistare


Hi Peter, Hi Fabiano,
  Will you have time to review the migration guts of this series any time soon?
In particular:

[PATCH V1 05/26] migration: precreate vmstate
[PATCH V1 06/26] migration: precreate vmstate for exec
[PATCH V1 12/26] migration: vmstate factory object
[PATCH V1 18/26] migration: cpr-exec-args parameter
[PATCH V1 20/26] migration: cpr-exec mode

- Steve

On 4/29/2024 11:55 AM, Steve Sistare wrote:

This patch series adds the live migration cpr-exec mode.  In this mode, QEMU
stops the VM, writes VM state to the migration URI, and directly exec's a
new version of QEMU on the same host, replacing the original process while
retaining its PID.  Guest RAM is preserved in place, albeit with new virtual
addresses.  The user completes the migration by specifying the -incoming
option, and by issuing the migrate-incoming command if necessary.  This
saves and restores VM state, with minimal guest pause time, so that QEMU may
be updated to a new version in between.

The new interfaces are:
   * cpr-exec (MigMode migration parameter)
   * cpr-exec-args (migration parameter)
   * memfd-alloc=on (command-line option for -machine)
   * only-migratable-modes (command-line argument)

The caller sets the mode parameter before invoking the migrate command.

Arguments for the new QEMU process are taken from the cpr-exec-args parameter.
The first argument should be the path of a new QEMU binary, or a prefix
command that exec's the new QEMU binary, and the arguments should include
the -incoming option.

Memory backend objects must have the share=on attribute, and must be mmap'able
in the new QEMU process.  For example, memory-backend-file is acceptable,
but memory-backend-ram is not.

QEMU must be started with the '-machine memfd-alloc=on' option.  This causes
implicit RAM blocks (those not explicitly described by a memory-backend
object) to be allocated by mmap'ing a memfd.  Examples include VGA, ROM,
and even guest RAM when it is specified without without reference to a
memory-backend object.   The memfds are kept open across exec, their values
are saved in vmstate which is retrieved after exec, and they are re-mmap'd.

The '-only-migratable-modes cpr-exec' option guarantees that the
configuration supports cpr-exec.  QEMU will exit at start time if not.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

   # qemu-kvm -monitor stdio -object
   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
   -m 4G -machine memfd-alloc=on ...

   QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running
   (qemu) migrate_set_parameter mode cpr-exec
   (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming 
file:vm.state
   (qemu) migrate -d file:vm.state
   (qemu) QEMU 9.1.50 monitor - type 'help' for more information
   (qemu) info status
   VM status: running

cpr-exec mode preserves attributes of outgoing devices that must be known
before the device is created on the incoming side, such as the memfd descriptor
number, but currently the migration stream is read after all devices are
created.  To solve this problem, I add two VMStateDescription options:
precreate and factory.  precreate objects are saved to their own migration
stream, distinct from the main stream, and are read early by incoming QEMU,
before devices are created.  Factory objects are allocated on demand, without
relying on a pre-registered object's opaque address, which is necessary
because the devices to which the state will apply have not been created yet
and hence have not registered an opaque address to receive the state.

This patch series implements a minimal version of cpr-exec.  Future series
will add support for:
   * vfio
   * chardev's without loss of connectivity
   * vhost
   * fine-grained seccomp controls
   * hostmem-memfd
   * cpr-exec migration test


Steve Sistare (26):
   oslib: qemu_clear_cloexec
   vl: helper to request re-exec
   migration: SAVEVM_FOREACH
   migration: delete unused parameter mis
   migration: precreate vmstate
   migration: precreate vmstate for exec
   migration: VMStateId
   migration: vmstate_info_void_ptr
   migration: vmstate_register_named
   migration: vmstate_unregister_named
   migration: vmstate_register at init time
   migration: vmstate factory object
   physmem: ram_block_create
   physmem: hoist guest_memfd creation
   physmem: hoist host memory allocation
   physmem: set ram block idstr earlier
   machine: memfd-alloc option
   migration: cpr-exec-args parameter
   physmem: preserve ram blocks for cpr
   migration: cpr-exec mode
   migration: migrate_add_blocker_mode
   migration: ram block cpr-exec blockers
   migration: misc cpr-exec blockers
   seccomp: cpr-exec blocker
   migration: fix mismatched GPAs during cpr-exec
   migration: only-migratable-modes

  accel/xen/xen-all.c|   5 +
  backends/hostmem-epc.c |

Re: CPR/liveupdate: test results using prior bug fix

2024-05-16 Thread Steven Sistare


On 5/16/2024 1:24 PM, Michael Galaxy wrote:

On 5/14/24 08:54, Michael Tokarev wrote:

On 5/14/24 16:39, Michael Galaxy wrote:

Steve,

OK, so it does not look like this bugfix you wrote was included in 8.2.4 
(which was released yesterday). Unfortunately, that means that anyone using 
CPR in that release will still (eventually) encounter the bug like I did.


8.2.4 is basically a "bugfix" release for 8.2.3 which I somewhat
screwed up (in a minor way), plus a few currently (at the time)
queued up changes.   8.2.3 was a big release though.

I would recommend that y'all consider cherry-picking, perhaps, the relevant 
commits for a possible 8.2.5 ?


Please Cc changes which are relevant for -stable to, well,
qemu-sta...@nongnu.org :)

Which changes needs to be picked up?

Steve, can you comment here, please? At a minimum, we have this one: [PULL 
20/25] migration: stop vm for cpr


But that pull came with a handful of other changes that are also not in QEMU v8, 
so I suspect I'm missing some other important changes that might be important 
for a stable release?


- Michael


Hi Michael, I sent the full list of commits to this distribution yesterday, and
I see it in my Sent email folder.  Copying verbatim:


Michael Galaxy, I'm afraid you are out of luck with respect to qemu 8.2.
It has some of the cpr reboot commits, but is missing the following:

87a2848 migration: massage cpr-reboot documentation
cbdafc1 migration: options incompatible with cpr
ce5db1c migration: update cpr-reboot description
9867d4d migration: stop vm for cpr
4af667f migration: notifier error checking
bf78a04 migration: refactor migrate_fd_connect failures
6835f5a migration: per-mode notifiers
5663dd3 migration: MigrationNotifyFunc
c763a23e migration: remove postcopy_after_devices
9d9babf migration: MigrationEvent for notifiers
3e77573 migration: convert to NotifierWithReturn
d91f33c migration: remove error from notifier data
be19d83 notify: pass error to notifier with return
b12635f migration: fix coverity migrate_mode finding
2b58a8b tests/qtest: postcopy migration with suspend
b1fdd21 tests/qtest: precopy migration with suspend
5014478 tests/qtest: option to suspend during migration
f064975 tests/qtest: migration events
49a5020 migration: preserve suspended for bg_migration
58b1057 migration: preserve suspended for snapshot
b4e9ddc migration: preserve suspended runstate
d3c86c99 migration: propagate suspended runstate
9ff5e79 cpus: vm_resume
0f1db06 cpus: check running not RUN_STATE_RUNNING
b9ae473 cpus: stop vm in suspended runstate
f06f316 cpus: vm_was_suspended

All of those landed in qemu 9.0.
---

- Steve

Re: CPR/liveupdate: test results using prior bug fix

2024-05-15 Thread Steven Sistare


On 5/14/2024 11:33 AM, Michael Tokarev wrote:

14.05.2024 16:54, Michael Tokarev пишет:

On 5/14/24 16:39, Michael Galaxy wrote:

Steve,

OK, so it does not look like this bugfix you wrote was included in 8.2.4 
(which was released yesterday). Unfortunately, that means that anyone using 
CPR in that release will still (eventually) encounter the bug like I did.


8.2.4 is basically a "bugfix" release for 8.2.3 which I somewhat
screwed up (in a minor way), plus a few currently (at the time)
queued up changes.   8.2.3 was a big release though.

I would recommend that y'all consider cherry-picking, perhaps, the relevant 
commits for a possible 8.2.5 ?


Please Cc changes which are relevant for -stable to, well,
qemu-sta...@nongnu.org :)

Which changes needs to be picked up?

Please also note this particular change does not apply cleanly to
stable-8.2 branch due to other changes in this area between 8.2
and 9.0, in particular, in postcopy_start().  So if this is to be
picked up for stable-8.2, I need help from someone who better
understands this code to select changes to pick.

Thanks,

/mjt


Michael Galaxy, I'm afraid you are out of luck with respect to qemu 8.2.
It has some of the cpr reboot commits, but is missing the following:

87a2848 migration: massage cpr-reboot documentation
cbdafc1 migration: options incompatible with cpr
ce5db1c migration: update cpr-reboot description
9867d4d migration: stop vm for cpr
4af667f migration: notifier error checking
bf78a04 migration: refactor migrate_fd_connect failures
6835f5a migration: per-mode notifiers
5663dd3 migration: MigrationNotifyFunc
c763a23e migration: remove postcopy_after_devices
9d9babf migration: MigrationEvent for notifiers
3e77573 migration: convert to NotifierWithReturn
d91f33c migration: remove error from notifier data
be19d83 notify: pass error to notifier with return
b12635f migration: fix coverity migrate_mode finding
2b58a8b tests/qtest: postcopy migration with suspend
b1fdd21 tests/qtest: precopy migration with suspend
5014478 tests/qtest: option to suspend during migration
f064975 tests/qtest: migration events
49a5020 migration: preserve suspended for bg_migration
58b1057 migration: preserve suspended for snapshot
b4e9ddc migration: preserve suspended runstate
d3c86c99 migration: propagate suspended runstate
9ff5e79 cpus: vm_resume
0f1db06 cpus: check running not RUN_STATE_RUNNING
b9ae473 cpus: stop vm in suspended runstate
f06f316 cpus: vm_was_suspended

All of those landed in qemu 9.0.

- Steve

Re: CPR/liveupdate: test results using prior bug fix

2024-05-13 Thread Steven Sistare


Hi Michael,
  No surprise here, I did see some of the same failure messages and they
prompted me to submit the fix.  They are all symptoms of "the possibility of
ram and device state being out of sync" as mentioned in the commit.

I am not familiar with the process for maintaining old releases for qemu.
Perhaps someone on this list can comment on 8.2.3.

- Steve

On 5/13/2024 2:22 PM, Michael Galaxy wrote:

Hi Steve,

We found that this specific change in particular ("migration: stop vm for cpr") 
fixes a bug that we've identified in testing back-to-back live updates in a lab 
environment.


More specifically, *without* this change (which is not available in 8.2.2, but 
*is* available in 9.0.0) causes the metadata save file to be corrupted when 
doing live updates one after another. Typically we see a corrupted save file 
somewhere in between 20 and 30 live updates and while doing a git bisect, we 
found that this change makes the problem go away.


Were you aware? Is there any plan in place to cherry pick this for 8.2.3, 
perhaps or a plan to release 8.2.3 at some point?


Here are some examples of how the bug manifests in different locations of the 
QEMU metadata save file:


2024-04-26T13:28:53Z qemu-system-x86_64: Failed to load mtrr_var:base
2024-04-26T13:28:53Z qemu-system-x86_64: Failed to load cpu:env.mtrr_var
2024-04-26T13:28:53Z qemu-system-x86_64: error while loading state for instance 
0x1b of device 'cpu'
2024-04-26T13:28:53Z qemu-system-x86_64: load of migration failed: Input/output 
error

And another:

2024-04-17T16:09:47Z qemu-system-x86_64: check_section_footer: Read section 
footer failed: -5
2024-04-17T16:09:47Z qemu-system-x86_64: load of migration failed: Invalid 
argument

And another:

2024-04-30T21:53:29Z qemu-system-x86_64: Unable to read ID string for section 
163
2024-04-30T21:53:29Z qemu-system-x86_64: load of migration failed: Invalid 
argument

And another:

2024-05-01T16:01:44Z qemu-system-x86_64: Unable to read ID string for section 
164
2024-05-01T16:01:44Z qemu-system-x86_64: load of migration failed: Invalid 
argument
  

As you can see, they occur quite randomly, but generally it takes at least 
20-30+ live updates before the problem occurs.


- Michael

On 2/27/24 23:13, pet...@redhat.com wrote:

From: Steve Sistare

When migration for cpr is initiated, stop the vm and set state
RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
possibility of ram and device state being out of sync, and guarantees
that a guest in the suspended state remains suspended, because qmp_cont
rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.

Signed-off-by: Steve Sistare
Reviewed-by: Peter Xu
Link:https://urldefense.com/v3/__https://lore.kernel.org/r/1708622920-68779-11-git-send-email-steven.sistare@oracle.com__;!!GjvTz_vk!QLsFOCX-x2U9bzAo98SdidKlomHrmf_t0UmQKtgudoIcaDVoAJOPm39ZqaNP_nT5I8QqVfSgwhDZmg$  
Signed-off-by: Peter Xu

---
  include/migration/misc.h |  1 +
  migration/migration.h|  2 --
  migration/migration.c| 51 
  3 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index e4933b815b..5d1aa593ed 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -60,6 +60,7 @@ void migration_object_init(void);
  void migration_shutdown(void);
  bool migration_is_idle(void);
  bool migration_is_active(MigrationState *);
+bool migrate_mode_is_cpr(MigrationState *);
  
  typedef enum MigrationEventType {

  MIG_EVENT_PRECOPY_SETUP,
diff --git a/migration/migration.h b/migration/migration.h
index aef8afbe1f..65c0b61cbd 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -541,6 +541,4 @@ int migration_rp_wait(MigrationState *s);
   */
  void migration_rp_kick(MigrationState *s);
  
-int migration_stop_vm(RunState state);

-
  #endif
diff --git a/migration/migration.c b/migration/migration.c
index 37c836b0b0..90a90947fb 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -167,11 +167,19 @@ static gint page_request_addr_cmp(gconstpointer ap, 
gconstpointer bp)
  return (a > b) - (a < b);
  }
  
-int migration_stop_vm(RunState state)

+static int migration_stop_vm(MigrationState *s, RunState state)
  {
-int ret = vm_stop_force_state(state);
+int ret;
+
+migration_downtime_start(s);
+
+s->vm_old_state = runstate_get();
+global_state_store();
+
+ret = vm_stop_force_state(state);
  
  trace_vmstate_downtime_checkpoint("src-vm-stopped");

+trace_migration_completion_vm_stop(ret);
  
  return ret;

  }
@@ -1602,6 +1610,11 @@ bool migration_is_active(MigrationState *s)
  s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
  }
  
+bool migrate_mode_is_cpr(MigrationState *s)

+{
+return s->parameters.mode == MIG_MODE_CPR_REBOOT;
+}
+
  int migrate_init(MigrationState *s, Error **errp)
  {
  int ret;
@@ -2454,10 +2467,7 @@ static int postcopy_start(MigrationState *ms,

Re: [PATCH V1 26/26] migration: only-migratable-modes

2024-05-13 Thread Steven Sistare


On 5/9/2024 3:14 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Add the only-migratable-modes option as a generalization of only-migratable.
Only devices that support all requested modes are allowed.

Signed-off-by: Steve Sistare 
---
  include/migration/misc.h   |  3 +++
  include/sysemu/sysemu.h|  1 -
  migration/migration-hmp-cmds.c | 26 +-
  migration/migration.c  | 22 +-
  migration/savevm.c |  2 +-
  qemu-options.hx| 16 ++--
  system/globals.c   |  1 -
  system/vl.c| 13 -
  target/s390x/cpu_models.c  |  4 +++-
  9 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 5b963ba..3ad2cd9 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void);
  /* True if background snapshot is active */
  bool migration_in_bg_snapshot(void);
  
+void migration_set_required_mode(MigMode mode);

+bool migration_mode_required(MigMode mode);
+
  /* migration/block-dirty-bitmap.c */
  void dirty_bitmap_mig_init(void);
  
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h

index 5b4397e..0a9c4b4 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -8,7 +8,6 @@
  
  /* vl.c */
  
-extern int only_migratable;

  extern const char *qemu_name;
  extern QemuUUID qemu_uuid;
  extern bool qemu_uuid_set;
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 414c7e8..ca913b7 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -16,6 +16,7 @@
  #include "qemu/osdep.h"
  #include "block/qapi.h"
  #include "migration/snapshot.h"
+#include "migration/misc.h"
  #include "monitor/hmp.h"
  #include "monitor/monitor.h"
  #include "qapi/error.h"
@@ -33,6 +34,28 @@
  #include "options.h"
  #include "migration.h"
  
+static void migration_dump_modes(Monitor *mon)

+{
+int mode, n = 0;
+
+monitor_printf(mon, "only-migratable-modes: ");
+
+for (mode = 0; mode < MIG_MODE__MAX; mode++) {
+if (migration_mode_required(mode)) {
+if (n++) {
+monitor_printf(mon, ",");
+}
+monitor_printf(mon, "%s", MigMode_str(mode));
+}
+}
+
+if (!n) {
+monitor_printf(mon, "none\n");
+} else {
+monitor_printf(mon, "\n");
+}
+}
+
  static void migration_global_dump(Monitor *mon)
  {
  MigrationState *ms = migrate_get_current();
@@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon)
  monitor_printf(mon, "store-global-state: %s\n",
 ms->store_global_state ? "on" : "off");
  monitor_printf(mon, "only-migratable: %s\n",
-   only_migratable ? "on" : "off");
+   migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off");
  monitor_printf(mon, "send-configuration: %s\n",
 ms->send_configuration ? "on" : "off");
  monitor_printf(mon, "send-section-footer: %s\n",
@@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon)
 ms->decompress_error_check ? "on" : "off");
  monitor_printf(mon, "clear-bitmap-shift: %u\n",
 ms->clear_bitmap_shift);
+migration_dump_modes(mon);
  }
  
  void hmp_info_migrate(Monitor *mon, const QDict *qdict)

diff --git a/migration/migration.c b/migration/migration.c
index 4984dee..5535b84 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp)
  return false;
  }
  
-static bool is_only_migratable(Error **reasonp, Error **errp, int modes)

+static int migration_modes_required;
+
+void migration_set_required_mode(MigMode mode)
+{
+migration_modes_required |= BIT(mode);
+}
+
+bool migration_mode_required(MigMode mode)
+{
+return !!(migration_modes_required & BIT(mode));
+}
+
+static bool modes_are_required(Error **reasonp, Error **errp, int modes)
  {
  ERRP_GUARD();
  
-if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) {

+if (migration_modes_required & modes) {
  error_propagate_prepend(errp, *reasonp,
-"disallowing migration blocker "
-"(--only-migratable) for: ");
+"-only-migratable{-modes}  specified, but: ");


extra space before 'specified'


Will fix, thanks.


  *reasonp = NULL;
  return true;
  }
+
  return false;
  }
  
@@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...)

  modes = get_modes(mode, ap);
  va_end(ap);
  
-if (is_only_migratable(reasonp, errp, modes)) {

+if (modes_are_required(reasonp, errp, modes)) {
  return -EACCES;
  } else if (is_busy(reasonp, errp)) {

Re: [PATCH V1 13/26] physmem: ram_block_create

2024-05-13 Thread Steven Sistare


On 5/13/2024 2:37 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Create a common subroutine to allocate a RAMBlock, de-duping the code to
populate its common fields.  Add a trace point for good measure.
No functional change.

Signed-off-by: Steve Sistare 
---
  system/physmem.c| 47 ++-
  system/trace-events |  3 +++
  2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index c3d04ca..6216b14 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -52,6 +52,7 @@
  #include "sysemu/hw_accel.h"
  #include "sysemu/xen-mapcache.h"
  #include "trace/trace-root.h"
+#include "trace.h"
  
  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE

  #include 
@@ -1918,11 +1919,29 @@ out_free:
  }
  }
  
+static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size,

+  ram_addr_t max_size, uint32_t ram_flags)
+{
+RAMBlock *rb = g_malloc0(sizeof(*rb));
+
+rb->used_length = size;
+rb->max_length = max_size;
+rb->fd = -1;
+rb->flags = ram_flags;
+rb->page_size = qemu_real_host_page_size();
+rb->mr = mr;
+rb->guest_memfd = -1;
+trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length,


There's no idstr at this point, is there? I think this needs to be
memory_region_name(mr).


Thanks, will fix. That is a bug in my patch factoring.  I add the call to
qemu_ram_set_idstr in patch "physmem: set ram block idstr earlier".

- Steve


+   rb->max_length, mr->align);
+return rb;
+}
+
  #ifdef CONFIG_POSIX
  RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
   uint32_t ram_flags, int fd, off_t offset,
   Error **errp)
  {
+void *host;
  RAMBlock *new_block;
  Error *local_err = NULL;
  int64_t file_size, file_align;
@@ -1962,19 +1981,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
  return NULL;
  }
  
-new_block = g_malloc0(sizeof(*new_block));

-new_block->mr = mr;
-new_block->used_length = size;
-new_block->max_length = size;
-new_block->flags = ram_flags;
-new_block->guest_memfd = -1;
-new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
- errp);
-if (!new_block->host) {
+new_block = ram_block_create(mr, size, size, ram_flags);
+host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp);
+if (!host) {
  g_free(new_block);
  return NULL;
  }
  
+new_block->host = host;

  ram_block_add(new_block, _err);
  if (local_err) {
  g_free(new_block);
@@ -1982,7 +1996,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
  return NULL;
  }
  return new_block;
-
  }
  
  
@@ -2054,18 +2067,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,

  align = MAX(align, TARGET_PAGE_SIZE);
  size = ROUND_UP(size, align);
  max_size = ROUND_UP(max_size, align);
-
-new_block = g_malloc0(sizeof(*new_block));
-new_block->mr = mr;
-new_block->resized = resized;
-new_block->used_length = size;
-new_block->max_length = max_size;
  assert(max_size >= size);
-new_block->fd = -1;
-new_block->guest_memfd = -1;
-new_block->page_size = qemu_real_host_page_size();
-new_block->host = host;
-new_block->flags = ram_flags;
+new_block = ram_block_create(mr, size, max_size, ram_flags);
+new_block->resized = resized;
+
  ram_block_add(new_block, _err);
  if (local_err) {
  g_free(new_block);
diff --git a/system/trace-events b/system/trace-events
index 69c9044..f0a80ba 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
  dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: 
%" PRIu64 ", throttle adjust time %"PRIi64 " us"
  dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate 
limit %"PRIu64
  dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 
" us"
+
+# physmem.c
+ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t 
max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"

Re: [PATCH V1 24/26] seccomp: cpr-exec blocker

2024-05-13 Thread Steven Sistare


On 5/10/2024 3:54 AM, Daniel P. Berrangé wrote:

On Mon, Apr 29, 2024 at 08:55:33AM -0700, Steve Sistare wrote:

cpr-exec mode needs permission to exec.  Block it if permission is denied.

Signed-off-by: Steve Sistare 
---
  include/sysemu/seccomp.h |  1 +
  system/qemu-seccomp.c| 10 --
  system/vl.c  |  6 ++
  3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h
index fe85989..023c0a1 100644
--- a/include/sysemu/seccomp.h
+++ b/include/sysemu/seccomp.h
@@ -22,5 +22,6 @@
  #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4)
  
  int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp);

+uint32_t qemu_seccomp_get_opts(void);
  
  #endif

diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c
index 5c20ac0..0d2a561 100644
--- a/system/qemu-seccomp.c
+++ b/system/qemu-seccomp.c
@@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error 
**errp)
  return rc < 0 ? -1 : 0;
  }
  
+static uint32_t seccomp_opts;

+
+uint32_t qemu_seccomp_get_opts(void)
+{
+return seccomp_opts;
+}
+
  int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp)
  {
  if (qemu_opt_get_bool(opts, "enable", false)) {
-uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT
-| QEMU_SECCOMP_SET_OBSOLETE;
  const char *value = NULL;
+seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE;
  
  value = qemu_opt_get(opts, "obsolete");

  if (value) {
diff --git a/system/vl.c b/system/vl.c
index 7252100..b76881e 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -76,6 +76,7 @@
  #include "hw/block/block.h"
  #include "hw/i386/x86.h"
  #include "hw/i386/pc.h"
+#include "migration/blocker.h"
  #include "migration/cpr.h"
  #include "migration/misc.h"
  #include "migration/snapshot.h"
@@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void)
  QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
  if (olist) {
  qemu_opts_foreach(olist, parse_sandbox, NULL, _fatal);
+if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) {
+Error *blocker = NULL;
+error_setg(, "-sandbox denies exec for cpr-exec");
+migrate_add_blocker_mode(, MIG_MODE_CPR_EXEC, 
_fatal);
+}
  }
  #endi


There are a whole pile of features that get blocked wehn -sandbox is
used. I'm not convinced we should be adding code to check for specific
blocked features, as such a list will always be incomplete at best, and
incorrectly block things at worst.

I view this primarily as a documentation task for the cpr-exec command.


For cpr and live migration, we do our best to prevent breaking the guest
for cases we know will fail.  Independently, a clear error message here
will reduce error reports for this new cpr feature.

Would it be more palatable if I move this blocker's creation to cpr_mig_init?

- Steve

Re: [PATCH V1 22/26] migration: ram block cpr-exec blockers

2024-05-13 Thread Steven Sistare


On 5/9/2024 2:01 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Unlike cpr-reboot mode, cpr-exec mode cannot save volatile ram blocks in the
migration stream file and recreate them later, because the physical memory for
the blocks is pinned and registered for vfio.  Add an exec-mode blocker for
volatile ram blocks.

Also add a blocker for RAM_GUEST_MEMFD.  Preserving guest_memfd may be
sufficient for cpr-exec, but it has not been tested yet.

- Steve


extra text here


Will fix, thanks - steve


Signed-off-by: Steve Sistare 


Reviewed-by: Fabiano Rosas

Re: [PATCH V1 09/26] migration: vmstate_register_named

2024-05-13 Thread Steven Sistare


On 5/9/2024 10:32 AM, Fabiano Rosas wrote:

Fabiano Rosas  writes:


Steve Sistare  writes:


Define vmstate_register_named which takes the instance name as its first
parameter, instead of generating the name from VMStateIf of the Object.
This will be needed to register objects that are not Objects.  Pass the
new name parameter to vmstate_register_with_alias_id.

Signed-off-by: Steve Sistare 


Reviewed-by: Fabiano Rosas 


Actually, can't we define a wrapper type just for this purpose? For
example, looking at dbus-vmstate.c:


One would need to provide a separate wrapper for each struct to be registered
as vmstate.  This patch set only has RAMBlock, but there are more coming in
my next patch sets.  vmstate_register_named avoids adding such boilerplate,
and makes it easier to add more cpr state in the future.

- Steve


static void dbus_vmstate_class_init(ObjectClass *oc, void *data)
{
...
 VMStateIfClass *vc = VMSTATE_IF_CLASS(oc);

 vc->get_id = dbus_vmstate_get_id;
...
}

static const TypeInfo dbus_vmstate_info = {
 .name = TYPE_DBUS_VMSTATE,
 .parent = TYPE_OBJECT,
 .instance_size = sizeof(DBusVMState),
 .instance_finalize = dbus_vmstate_finalize,
 .class_init = dbus_vmstate_class_init,
 .interfaces = (InterfaceInfo[]) {
 { TYPE_USER_CREATABLE },   // without this one
 { TYPE_VMSTATE_IF },
 { }
 }
};

static void register_types(void)
{
 type_register_static(_vmstate_info);
}
type_init(register_types);

Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-13 Thread Steven Sistare


On 5/7/2024 5:02 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Provide the VMStateDescription precreate field to mark objects that must
be loaded on the incoming side before devices have been created, because
they provide properties that will be needed at creation time.  They will
be saved to and loaded from their own QEMUFile, via


It's not obvious to me what the reason is to have a separate
QEMUFile. Could you expand on this?


The migration stream is read in the calling sequence at B below, but precreate
state is needed at A before chardev and memory backends are created.

main()
  qemu_init()
A:
qemu_create_early_backends()
qemu_create_late_backends()
migration_object_init()
qmp_x_exit_preconfig()
  qmp_migrate_incoming()

  qemu_default_main()
qemu_main_loop()
  fd_accept_incoming_migration()
migration_channel_process_incoming()
  migration_ioc_process_incoming()
migration_incoming_process()
  process_incoming_migration_co()
B:
qemu_loadvm_state()

precreate objects could be emitted first in the existing migration stream and
read at A, but this requires untangling numerous ordering dependencies amongst
migration_object_init, qemu_create_machine, configure_accelerators, monitor
init, and the main loop.

- Steve

Re: [PATCH V1 06/26] migration: precreate vmstate for exec

2024-05-13 Thread Steven Sistare


On 5/6/2024 7:34 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Provide migration_precreate_save for saving precreate vmstate across exec.
Create a memfd, save its value in the environment, and serialize state
to it.  Reverse the process in migration_precreate_load.

Signed-off-by: Steve Sistare 
---
  include/migration/misc.h |   5 ++
  migration/meson.build|   1 +
  migration/precreate.c| 139 +++
  3 files changed, 145 insertions(+)
  create mode 100644 migration/precreate.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c9e200f..cf30351 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void);
  
  void dump_vmstate_json_to_file(FILE *out_fp);
  
+/* migration/precreate.c */

+int migration_precreate_save(Error **errp);
+void migration_precreate_unsave(void);
+int migration_precreate_load(Error **errp);
+
  /* migration/migration.c */
  void migration_object_init(void);
  void migration_shutdown(void);
diff --git a/migration/meson.build b/migration/meson.build
index f76b1ba..50e7cb2 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -26,6 +26,7 @@ system_ss.add(files(
'ram-compress.c',
'options.c',
'postcopy-ram.c',
+  'precreate.c',
'savevm.c',
'socket.c',
'tls.c',
diff --git a/migration/precreate.c b/migration/precreate.c
new file mode 100644
index 000..0bf5e1f
--- /dev/null
+++ b/migration/precreate.c
@@ -0,0 +1,139 @@
+/*
+ * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "migration/misc.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+
+#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE"
+
+static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
+{
+g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+QIOChannel *ioc = QIO_CHANNEL(fioc);
+qio_channel_set_name(ioc, name);
+return qemu_file_new_input(ioc);
+}
+
+static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
+{
+g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+QIOChannel *ioc = QIO_CHANNEL(fioc);
+qio_channel_set_name(ioc, name);
+return qemu_file_new_output(ioc);
+}
+
+static int memfd_create_named(const char *name, Error **errp)
+{
+int mfd;
+char val[16];
+
+mfd = memfd_create(name, 0);
+if (mfd < 0) {
+error_setg_errno(errp, errno, "memfd_create failed");
+return -1;
+}
+
+/* Remember mfd in environment for post-exec load */
+qemu_clear_cloexec(mfd);
+snprintf(val, sizeof(val), "%d", mfd);
+g_setenv(name, val, 1);
+
+return mfd;
+}
+
+static int memfd_find_named(const char *name, int *mfd_p, Error **errp)
+{
+const char *val = g_getenv(name);
+
+if (!val) {
+*mfd_p = -1;
+return 0;   /* No memfd was created, not an error */
+}
+g_unsetenv(name);
+if (qemu_strtoi(val, NULL, 10, mfd_p)) {
+error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val);
+return -1;
+}
+lseek(*mfd_p, 0, SEEK_SET);
+return 0;
+}
+
+static void memfd_delete_named(const char *name)
+{
+int mfd;
+const char *val = g_getenv(name);
+
+if (val) {
+g_unsetenv(name);
+if (!qemu_strtoi(val, NULL, 10, )) {
+close(mfd);
+}
+}
+}
+
+static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp)
+{
+int mfd = memfd_create_named(name, errp);
+
+if (mfd < 0) {
+return NULL;
+}
+
+return qemu_file_new_fd_output(mfd, name);
+}
+
+static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp)
+{
+int ret, mfd;
+
+ret = memfd_find_named(name, , errp);
+if (ret || mfd < 0) {
+return NULL;
+}
+
+return qemu_file_new_fd_input(mfd, name);
+}
+
+int migration_precreate_save(Error **errp)
+{
+QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp);
+
+if (!f) {
+return -1;
+} else if (qemu_savevm_precreate_save(f, errp)) {
+memfd_delete_named(PRECREATE_STATE_NAME);
+return -1;
+} else {
+/* Do not close f, as mfd must remain open. */
+return 0;
+}
+}
+
+void migration_precreate_unsave(void)
+{
+memfd_delete_named(PRECREATE_STATE_NAME);
+}
+
+int migration_precreate_load(Error **errp)
+{
+int ret;
+QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp);


Can we avoid the QEMUFile? I don't see it being exported from this file.


It is not exported, but within this file, it is the basis for all read and
write operations, via the existing functions

Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH

2024-05-13 Thread Steven Sistare


On 5/6/2024 7:17 PM, Fabiano Rosas wrote:

Steve Sistare  writes:


Define an abstraction SAVEVM_FOREACH to loop over all savevm state
handlers, and replace QTAILQ_FOREACH.  Define variants for ALL so
we can loop over all handlers vs a subset of handlers in a subsequent
patch, but at this time there is no distinction between the two.
No functional change.

Signed-off-by: Steve Sistare 
---
  migration/savevm.c | 55 +++---
  1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 4509482..6829ba3 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -237,6 +237,15 @@ static SaveState savevm_state = {
  .global_section_id = 0,
  };
  
+#define SAVEVM_FOREACH(se, entry)\

+QTAILQ_FOREACH(se, _state.handlers, entry)\
+
+#define SAVEVM_FOREACH_ALL(se, entry)\
+QTAILQ_FOREACH(se, _state.handlers, entry)


This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep
coming back to the definition to figure out which FOREACH is the real
deal.


I take your point, but the majority of the loops do not care about precreated
objects, so it seems backwards to make them more verbose with 
SAVEVM_FOREACH_NOT_PRECREATE.  I can go either way, but we need

Peter's opinion also.


+
+#define SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se)   \
+QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se)
+
  static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id);
  
  static bool should_validate_capability(int capability)

@@ -674,7 +683,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
  SaveStateEntry *se;
  uint32_t instance_id = 0;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH_ALL(se, entry) {


In this patch we can't have both instances...


  if (strcmp(idstr, se->idstr) == 0
  && instance_id <= se->instance_id) {
  instance_id = se->instance_id + 1;
@@ -690,7 +699,7 @@ static int calculate_compat_instance_id(const char *idstr)
  SaveStateEntry *se;
  int instance_id = 0;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH(se, entry) {


...otherwise one of the two changes will go undocumented because the
actual reason for it will only be described in the next patch.


Sure, I'll move this to the precreate patch.

- Steve


  if (!se->compat) {
  continue;
  }
@@ -816,7 +825,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, 
void *opaque)
  }
  pstrcat(id, sizeof(id), idstr);
  
-QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) {

+SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
  if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
  savevm_state_handler_remove(se);
  g_free(se->compat);
@@ -939,7 +948,7 @@ void vmstate_unregister(VMStateIf *obj, const 
VMStateDescription *vmsd,
  {
  SaveStateEntry *se, *new_se;
  
-QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) {

+SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) {
  if (se->vmsd == vmsd && se->opaque == opaque) {
  savevm_state_handler_remove(se);
  g_free(se->compat);
@@ -1223,7 +1232,7 @@ bool qemu_savevm_state_blocked(Error **errp)
  {
  SaveStateEntry *se;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH(se, entry) {
  if (se->vmsd && se->vmsd->unmigratable) {
  error_setg(errp, "State blocked by non-migratable device '%s'",
 se->idstr);
@@ -1237,7 +1246,7 @@ void qemu_savevm_non_migratable_list(strList **reasons)
  {
  SaveStateEntry *se;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH(se, entry) {
  if (se->vmsd && se->vmsd->unmigratable) {
  QAPI_LIST_PREPEND(*reasons,
g_strdup_printf("non-migratable device: %s",
@@ -1276,7 +1285,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
  {
  SaveStateEntry *se;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH(se, entry) {
  if (se->vmsd && se->vmsd->dev_unplug_pending &&
  se->vmsd->dev_unplug_pending(se->opaque)) {
  return true;
@@ -1291,7 +1300,7 @@ int qemu_savevm_state_prepare(Error **errp)
  SaveStateEntry *se;
  int ret;
  
-QTAILQ_FOREACH(se, _state.handlers, entry) {

+SAVEVM_FOREACH(se, entry) {
  if (!se->ops || !se->ops->save_prepare) {
  continue;
  }
@@ -1321,7 +1330,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
  json_writer_start_array(ms->vmdesc, "devices");
  
  trace_savevm_state_setup();

-QTAILQ_FOREACH(se, _state.handlers, entry) {
+SAVEVM_FOREACH(se, entry) {
  if (se->vmsd && se->vmsd->early_setup) {

cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec)

2024-05-02 Thread Steven Sistare


On 4/29/2024 11:55 AM, Steve Sistare wrote:

This patch series adds the live migration cpr-exec mode.


Here is the text I plan to add to docs/devel/migration/CPR.rst.  It is
premature for me to submit this as a patch, because it includes all
the functionality I plan to add in this and future series, but it may
help you while reviewing this series.

- Steve



cpr-exec mode
---

In this mode, QEMU stops the VM, writes VM state to the migration
URI, and directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID.  Guest RAM is
preserved in place, albeit with new virtual addresses.  The user
completes the migration by specifying the ``-incoming`` option, and
by issuing the ``migrate-incoming`` command if necessary; see details
below.

This mode supports vfio devices by preserving device descriptors and
hence kernel state across the exec, even for devices that do not
support live migration, and preserves tap and vhost descriptors.

cpr-exec also preserves descriptors for a subset of chardevs,
including socket, file, parallel, pipe, serial, pty, stdio, and null.
chardevs that support cpr-exec have the QEMU_CHAR_FEATURE_CPR set in
the Chardev object.  The client side of a preserved chardev sees no
loss of connectivity during cpr-exec.  More chardevs could be
preserved with additional developement.

All chardevs have a ``reopen-on-cpr`` option which causes the chardev
to be closed and reopened during cpr-exec.  This can be set to allow
cpr-exec when the configuration includes a chardev (such as vc) that
does not have QEMU_CHAR_FEATURE_CPR.

Because the old and new QEMU instances are not active concurrently,
the URI cannot be a type that streams data from one instance to the
other.

Usage
^

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary, and the arguments should include the ''-incoming''
option.

Memory backend objects must have the ``share=on`` attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the ``-machine memfd-alloc=on``
option.  This causes implicit RAM blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without without reference to a
memory-backend object.

Add the ``-only-migratable-modes cpr-exec`` option to guarantee that
the configuration supports cpr-exec.  QEMU will exit at start time
if not.

Outgoing:
  * Set the migration mode parameter to ``cpr-exec``.
  * Set the ``cpr-exec-args`` parameter.
  * Issue the ``migrate`` command.  It is recommended the the URI be
a ``file`` type, but one can use other types such as ``exec``,
provided the command captures all the data from the outgoing side,
and provides all the data to the incoming side.

Incoming:
  * You do not need to explicitly start new QEMU.  It is started as
a side effect of the migrate command above.
  * If the VM was running when the outgoing ``migrate`` command was
issued, then QEMU automatically resumes VM execution.

Example 1: incoming URI
^^^

In these examples, we simply restart the same version of QEMU, but in
a real scenario one would set a new QEMU binary path in cpr-exec-args.

::

  # qemu-kvm -monitor stdio
  -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G

  -machine memfd-alloc=on
  ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming 
file:vm.state

  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

Example 2: incoming defer
^
::

  # qemu-kvm -monitor stdio
  -object 
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G

  -machine memfd-alloc=on
  ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  status: paused (inmigrate)
  (qemu) migrate_incoming file:vm.state
  (qemu) info status
  VM status: running


Caveats
^^^

cpr-exec mode may not be used with postcopy, background-snapshot,
or COLO.

cpr-exec mode requires permission to use the exec system call, which
is denied by certain sandbox options, such as spawn.  Use finer
grained

Re: [PATCH V1 20/26] migration: cpr-exec mode

2024-05-02 Thread Steven Sistare


On 5/2/2024 8:23 AM, Markus Armbruster wrote:

Steve Sistare  writes:


Add the cpr-exec migration mode.  Usage:
   qemu-system-$arch -machine memfd-alloc=on ...
   migrate_set_parameter mode cpr-exec
   migrate_set_parameter cpr-exec-args \
   ... -incoming 
   migrate -d 

The migrate command stops the VM, saves state to the URI,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from the URI.  Guest RAM is preserved in place,
albeit with new virtual addresses.

Arguments for the new QEMU process are taken from the
@cpr-exec-args parameter.  The first argument should be the
path of a new QEMU binary, or a prefix command that exec's the
new QEMU binary.

Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so the URI must be a type, such as
a file, that reads all data before old QEMU exits.

Memory backend objects must have the share=on attribute, and
must be mmap'able in the new QEMU process.  For example,
memory-backend-file is acceptable, but memory-backend-ram is
not.

The VM must be started with the '-machine memfd-alloc=on'
option.  This causes implicit ram blocks (those not explicitly
described by a memory-backend object) to be allocated by
mmap'ing a memfd.  Examples include VGA, ROM, and even guest
RAM when it is specified without a memory-backend object.

The implementation saves precreate vmstate at the end of normal
migration in migrate_fd_cleanup, and tells the main loop to call
cpr_exec.  Incoming qemu loads preceate state early, before objects
are created.  The memfds are kept open across exec by clearing the
close-on-exec flag, their values are saved in precreate vmstate,
and they are mmap'd in new qemu.

Note that the memfd-alloc option is not related to memory-backend-memfd.
Later patches add support for memory-backend-memfd, and for additional
devices, including vfio, chardev, and more.

Signed-off-by: Steve Sistare 


[...]


diff --git a/qapi/migration.json b/qapi/migration.json
index 49710e7..7c5f45f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -665,9 +665,37 @@
  # or COLO.
  #
  # (since 8.2)
+#
+# @cpr-exec: The migrate command stops the VM, saves state to the URI,
+# directly exec's a new version of QEMU on the same host,
+# replacing the original process while retaining its PID, and
+# loads state from the URI.  Guest RAM is preserved in place,
+# albeit with new virtual addresses.


Do you mean the virtual addresses of guest RAM may differ betwen old and
new QEMU process?


The VA at which a guest RAM segment is mapped in the QEMU process
changes.  The end user would not notice or care, so I'll drop that
detail here.


+#
+# Arguments for the new QEMU process are taken from the
+# @cpr-exec-args parameter.  The first argument should be the
+# path of a new QEMU binary, or a prefix command that exec's the
+# new QEMU binary.


What's a "prefix command"?  A wrapper script, perhaps?


A prefix command is any command of the form:
  command1 command1-args command2 command2-args
where command1 performs some set up before exec'ing command2.
However, I will drop the word "prefix", it adds no meaning here.


+#
+# Because old QEMU terminates when new QEMU starts, one cannot
+# stream data between the two, so the URI must be a type, such as
+# a file, that reads all data before old QEMU exits.


What happens when you specify a URI that doesn't?


Old QEMU will quietly block indefinitely writing to the URI.


+#
+# Memory backend objects must have the share=on attribute, and
+# must be mmap'able in the new QEMU process.  For example,
+# memory-backend-file is acceptable, but memory-backend-ram is
+# not.
+#
+# The VM must be started with the '-machine memfd-alloc=on'


What happens when you don't?


If '-only-migratable-modes cpr-exec' is specified, then QEMU will fail
to start, and print a clear error message.

Otherwise, a blocker is registered and any attempt to cpr-exec will fail
with a clear error message.

- Steve


+# option.  This causes implicit ram blocks -- those not explicitly
+# described by a memory-backend object -- to be allocated by
+# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+# RAM when it is specified without a memory-backend object.
+#
+# (since 9.1)
  ##
  { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
  
  ##

  # @ZeroPageDetection:


[...]

Re: [PATCH V1 18/26] migration: cpr-exec-args parameter

2024-05-02 Thread Steven Sistare


On 5/2/2024 8:23 AM, Markus Armbruster wrote:

Steve Sistare  writes:


Create the cpr-exec-args migration parameter, defined as a list of
strings.  It will be used for cpr-exec migration mode in a subsequent
patch.

No functional change, except that cpr-exec-args is shown by the
'info migrate' command.

Signed-off-by: Steve Sistare 


[...]


diff --git a/qapi/migration.json b/qapi/migration.json
index 8c65b90..49710e7 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -914,6 +914,9 @@
  # See description in @ZeroPageDetection.  Default is 'multifd'.
  # (since 9.0)
  #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#See @cpr-exec for details.  (Since 9.1)
+#


You mean migration mode @cpr-exec, don't you?

If yes, dangling reference until PATCH 20 adds it.  Okay, but worth a
mention in the commit message.

Suggest "See MigMode @cpr-exec for details."


Yes to all.  Will update as you suggest.

- Steve


  # Features:
  #
  # @deprecated: Member @block-incremental is deprecated.  Use
@@ -948,7 +951,8 @@
 { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
 'vcpu-dirty-limit',
 'mode',
-   'zero-page-detection'] }
+   'zero-page-detection',
+   'cpr-exec-args'] }
  
  ##

  # @MigrateSetParameters:
@@ -1122,6 +1126,9 @@
  # See description in @ZeroPageDetection.  Default is 'multifd'.
  # (since 9.0)
  #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#See @cpr-exec for details.  (Since 9.1)
+#
  # Features:
  #
  # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1176,7 +1183,8 @@
  'features': [ 'unstable' ] },
  '*vcpu-dirty-limit': 'uint64',
  '*mode': 'MigMode',
-'*zero-page-detection': 'ZeroPageDetection'} }
+'*zero-page-detection': 'ZeroPageDetection',
+'*cpr-exec-args': [ 'str' ]} }
  
  ##

  # @migrate-set-parameters:
@@ -1354,6 +1362,9 @@
  # See description in @ZeroPageDetection.  Default is 'multifd'.
  # (since 9.0)
  #
+# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode.
+#See @cpr-exec for details.  (Since 9.1)
+#
  # Features:
  #
  # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1405,7 +1416,8 @@
  'features': [ 'unstable' ] },
  '*vcpu-dirty-limit': 'uint64',
  '*mode': 'MigMode',
-'*zero-page-detection': 'ZeroPageDetection'} }
+'*zero-page-detection': 'ZeroPageDetection',
+'*cpr-exec-args': [ 'str' ]} }
  
  ##

  # @query-migrate-parameters:

Re: [PATCH V4 10/14] migration: stop vm for cpr

2024-03-13 Thread Steven Sistare


On 2/29/2024 8:28 PM, Peter Xu wrote:

On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote:

On 2/25/2024 9:08 PM, Peter Xu wrote:

On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote:

When migration for cpr is initiated, stop the vm and set state
RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
possibility of ram and device state being out of sync, and guarantees
that a guest in the suspended state remains suspended, because qmp_cont
rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.

Signed-off-by: Steve Sistare 


Reviewed-by: Peter Xu 

cpr-reboot mode keeps changing behavior.

Could we declare it "experimental" until it's solid?  Maybe a patch to
document this?

Normally IMHO we shouldn't merge a feature if it's not complete, however
cpr-reboot is so special that the mode itself is already merged in 8.2
before I started to merge patches, and it keeps changing things.  I don't
know what else we can do here besides declaring it experimental and not
declare it a stable feature.


Hi Peter, the planned/committed functionality for cpr-reboot changed only once, 
in:
 migration: stop vm for cpr

Suspension to support vfio is an enhancement which adds to the basic 
functionality,
it does not change it.  This was planned all along, but submitted as a separate


If VFIO used to migrate without suspension and now it won't, it's a
behavior change?


VFIO could not cpr-reboot migrate without suspension.  The existing vfio 
migration blockers applied to all modes:

  Error: :3a:10.0: VFIO migration is not supported in kernel

Now, with suspension, it will.  An addition, not a change.


series to manage complexity, as I outlined in my qemu community presentation,
which I emailed you at the time.

Other "changes" that arose during review were just clarifications and 
explanations.

So, I don't think cpr-reboot deserves to be condemned to experimental limbo.


IMHO it's not about a feature being condemned, it's about a kindful
heads-up to the users that one needs to take risk on using it until it
becomes stable, it also makes developers easier because of no limitation on
behavior change.  If all the changes are landing, then there's no need for
such a patch.

If so, please propose the planned complete document patch. I had a feeling
that cpr is still not fully understood by even many developers on the list.
It'll be great that such document will contain all the details one needs to
know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"),
when to use it, how to use it, how it differs from "normal" mode
(etc. lifted limitations on some devices that used to block migration?),
what is enforced (vm stop, suspension, etc.) and what is optionally offered
(VFIO, shared-mem, etc.), the expected behaviors, etc.

When send it, please copy relevant people (Alex & Cedric for VFIO, while
Markus could also be a good candidate considering the QMP involvement).


Submitted.

- Steve

Re: [PATCH V2 00/11] privatize migration.h

2024-03-11 Thread Steven Sistare


On 3/11/2024 4:28 PM, Peter Xu wrote:

On Mon, Mar 11, 2024 at 04:24:14PM -0400, Steven Sistare wrote:

On 3/11/2024 3:45 PM, Steven Sistare wrote:

On 3/11/2024 3:30 PM, Peter Xu wrote:

Steve,

On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote:

Changes in V2:
    * rebase to migration-next, add RB


Not apply even to master branch.  Note that there're >=1 PULLs sent and
merged since my last reply..  Perhaps you rebased to the "old" next?


I pulled from branch migration-next in https://gitlab.com/peterx/qemu a
few hours ago, but I must have screwed up somewhere.  I'll figure it out
and post a V4.


My pull was a fiew hours old, but my patches still apply cleanly to the
most recent tip:
   a1bb5dd169f4 ("migration: Fix format in error message")

I can sent that as V3, but ...
Note that you must apply "migration: export fewer options" before
"privatize migration.h".  If that does not help, I will send V3.


Ouch, I forgot that dependency... Sorry.

Yeah it works now.  No need to resend for now.


Great! - steve

Re: [PATCH V2 00/11] privatize migration.h

2024-03-11 Thread Steven Sistare


On 3/11/2024 3:45 PM, Steven Sistare wrote:

On 3/11/2024 3:30 PM, Peter Xu wrote:

Steve,

On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote:

Changes in V2:
   * rebase to migration-next, add RB


Not apply even to master branch.  Note that there're >=1 PULLs sent and
merged since my last reply..  Perhaps you rebased to the "old" next?


I pulled from branch migration-next in https://gitlab.com/peterx/qemu a 
few hours ago, but I must have screwed up somewhere.  I'll figure it out

and post a V4.


My pull was a fiew hours old, but my patches still apply cleanly to the
most recent tip:
  a1bb5dd169f4 ("migration: Fix format in error message")

I can sent that as V3, but ...
Note that you must apply "migration: export fewer options" before
"privatize migration.h".  If that does not help, I will send V3.

- Steve

Re: [PATCH V2 00/11] privatize migration.h

2024-03-11 Thread Steven Sistare


On 3/11/2024 3:30 PM, Peter Xu wrote:

Steve,

On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote:

Changes in V2:
   * rebase to migration-next, add RB


Not apply even to master branch.  Note that there're >=1 PULLs sent and
merged since my last reply..  Perhaps you rebased to the "old" next?


I pulled from branch migration-next in https://gitlab.com/peterx/qemu a 
few hours ago, but I must have screwed up somewhere.  I'll figure it out

and post a V4.

- Steve

Re: [PATCH V2] migration: export fewer options

2024-03-11 Thread Steven Sistare


On 3/1/2024 2:47 AM, Peter Xu wrote:

On Thu, Feb 29, 2024 at 07:03:36AM +0100, Markus Armbruster wrote:

Steven Sistare  writes:


Just a reminder, after our further discussion in the V1 thread,
this patch is still what I propose, no updates needed.

Markus, I think Peter is looking for your blessing on the new
file name: include/migration/client-options.h.


Not my preference, but no objection.


There's yet one alternative, which is to put these exported option
functions into misc.h directly.  After all that's not so much, and misc.h
already hold random stuff from elsewhere.

Steve, would you repost this patch (with/without my above comment taken)
along with your other series with a rebase to migration-next?  It doesn't
apply there.  Re the other series: one nitpick comment on the last patch,
where you may consider splitting the removal of the unused 2 functions into
a standalone patch.  Other than that it looks good to me.

https://gitlab.com/peterx/qemu/-/tree/migration-next


Both are rebased and reposted, thanks - steve

Re: [PATCH V4 10/14] migration: stop vm for cpr

2024-02-29 Thread Steven Sistare

On 2/25/2024 9:08 PM, Peter Xu wrote:
> On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote:
>> When migration for cpr is initiated, stop the vm and set state
>> RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
>> possibility of ram and device state being out of sync, and guarantees
>> that a guest in the suspended state remains suspended, because qmp_cont
>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.
>>
>> Signed-off-by: Steve Sistare 
> 
> Reviewed-by: Peter Xu 
> 
> cpr-reboot mode keeps changing behavior.
> 
> Could we declare it "experimental" until it's solid?  Maybe a patch to
> document this?
> 
> Normally IMHO we shouldn't merge a feature if it's not complete, however
> cpr-reboot is so special that the mode itself is already merged in 8.2
> before I started to merge patches, and it keeps changing things.  I don't
> know what else we can do here besides declaring it experimental and not
> declare it a stable feature.

Hi Peter, the planned/committed functionality for cpr-reboot changed only once, 
in:
migration: stop vm for cpr

Suspension to support vfio is an enhancement which adds to the basic 
functionality,
it does not change it.  This was planned all along, but submitted as a separate 
series to manage complexity, as I outlined in my qemu community presentation,
which I emailed you at the time.

Other "changes" that arose during review were just clarifications and 
explanations.

So, I don't think cpr-reboot deserves to be condemned to experimental limbo.

- Steve

Re: [PATCH V4 14/14] migration: options incompatible with cpr

2024-02-29 Thread Steven Sistare

On 2/29/2024 12:40 AM, Peter Xu wrote:
> On Thu, Feb 29, 2024 at 06:31:26AM +0100, Markus Armbruster wrote:
>> Hmm, perhaps Peter can still squash in the corrections before posting
>> his PR.  Peter?
> 
> The PR was sent yesterday, it's already in PeterM's -staging.  I worry it's
> a bit late to call a stop now.
> 
> https://lore.kernel.org/qemu-devel/20240228051315.400759-23-pet...@redhat.com
> 
> Steve, could you provide a standalone patch for the update?  Then I'll do
> the best accordingly (e.g. if the PR failed to apply I'll squash when
> resend v2; or I'll pick it up for the next).

I sent the patch.  I also re-wrapped all cpr-reboot paragraphs to 70 columns
and addressed Markus' other comments on "migration: update cpr-reboot 
description".

Peter, if you squash it, the last sentence "cpr-reboot may not be used ..." 
squashes into
   migration: options incompatible with cpr
and everything else squashes into
   migration: update cpr-reboot description

- Steve

Re: [PATCH] migration: re-format cpr-reboot documentation

2024-02-29 Thread Steven Sistare

Please ignore this, I will send a V2 that incorporates additional comments from
Markus that I missed in my inbox.

- Steve

On 2/29/2024 9:15 AM, Steve Sistare wrote:
> Re-wrap the cpr-reboot documentation to 70 columns, use '@' for
> cpr-reboot references, and capitalize COLO.
> 
> Suggested-by: Markus Armbruster 
> Signed-off-by: Steve Sistare 
> ---
>  qapi/migration.json | 36 +++-
>  1 file changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index c6bfe2e..b69f62a 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -636,28 +636,30 @@
>  #
>  # @normal: the original form of migration. (since 8.2)
>  #
> -# @cpr-reboot: The migrate command stops the VM and saves state to the URI.
> -# After quitting qemu, the user resumes by running qemu -incoming.
> +# @cpr-reboot: The migrate command stops the VM and saves state to
> +# the URI.  After quitting qemu, the user resumes by running
> +# qemu -incoming.
>  #
> -# This mode allows the user to quit qemu, and restart an updated version
> -# of qemu.  The user may even update and reboot the OS before restarting,
> -# as long as the URI persists across a reboot.
> +# This mode allows the user to quit qemu and restart an updated
> +# version of qemu.  The user may even update and reboot the OS
> +# before restarting, as long as the URI persists across a reboot.
>  #
> -# Unlike normal mode, the use of certain local storage options does not
> -# block the migration, but the user must not modify guest block devices
> -# between the quit and restart.
> +# Unlike normal mode, the use of certain local storage options
> +# does not block the migration, but the user must not modify guest
> +# block devices between the quit and restart.
>  #
> -# This mode supports vfio devices provided the user first puts the guest
> -# in the suspended runstate, such as by issuing guest-suspend-ram to the
> -# qemu guest agent.
> +# This mode supports vfio devices provided the user first puts
> +# the guest in the suspended runstate, such as by issuing
> +# guest-suspend-ram to the qemu guest agent.
>  #
> -# Best performance is achieved when the memory backend is shared and the
> -# @x-ignore-shared migration capability is set, but this is not required.
> -# Further, if the user reboots before restarting such a configuration, 
> the
> -# shared backend must be be non-volatile across reboot, such as by 
> backing
> -# it with a dax device.
> +# Best performance is achieved when the memory backend is shared
> +# and the @x-ignore-shared migration capability is set, but this
> +# is not required.  Further, if the user reboots before restarting
> +# such a configuration, the shared backend must be be non-volatile
> +# across reboot, such as by backing it with a dax device.
>  #
> -# cpr-reboot may not be used with postcopy, colo, or background-snapshot.
> +# @cpr-reboot may not be used with postcopy, background-snapshot,
> +# or COLO.
>  #
>  # (since 8.2)
>  ##

Re: [PATCH V4 11/14] vfio: register container for cpr

2024-02-29 Thread Steven Sistare




On 2/29/2024 3:35 AM, Cédric Le Goater wrote:
> Hello Steve,
> 
> On 2/22/24 18:28, Steve Sistare wrote:
>> Define entry points to perform per-container cpr-specific initialization
>> and teardown.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>   hw/vfio/container.c   | 11 ++-
>>   hw/vfio/cpr.c | 19 +++
>>   hw/vfio/iommufd.c |  6 ++
>>   hw/vfio/meson.build   |  1 +
>>   include/hw/vfio/vfio-common.h |  3 +++
>>   5 files changed, 39 insertions(+), 1 deletion(-)
>>   create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index bd25b9f..096d77e 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -621,10 +621,15 @@ static int vfio_connect_container(VFIOGroup *group, 
>> AddressSpace *as,
>>   goto free_container_exit;
>>   }
>>   +    ret = vfio_cpr_register_container(bcontainer, errp);
>> +    if (ret) {
>> +    goto free_container_exit;
>> +    }
>> +
>>   ret = vfio_ram_block_discard_disable(container, true);
>>   if (ret) {
>>   error_setg_errno(errp, -ret, "Cannot set discarding of RAM 
>> broken");
>> -    goto free_container_exit;
>> +    goto unregister_container_exit;
>>   }
>>     assert(bcontainer->ops->setup);
>> @@ -667,6 +672,9 @@ listener_release_exit:
>>   enable_discards_exit:
>>   vfio_ram_block_discard_disable(container, false);
>>   +unregister_container_exit:
>> +    vfio_cpr_unregister_container(bcontainer);
>> +
>>   free_container_exit:
>>   g_free(container);
>>   @@ -710,6 +718,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>   vfio_container_destroy(bcontainer);
>>     trace_vfio_disconnect_container(container->fd);
>> +    vfio_cpr_unregister_container(bcontainer);
>>   close(container->fd);
>>   g_free(container);
>>   diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 000..3bede54
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,19 @@
>> +/*
>> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "hw/vfio/vfio-common.h"
>> +#include "qapi/error.h"
>> +
>> +int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
>> +{
>> +    return 0;
>> +}
>> +
>> +void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>> +{
>> +}
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 9bfddc1..e1be224 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -411,6 +411,11 @@ found_container:
>>   goto err_listener_register;
>>   }
>>   +    ret = vfio_cpr_register_container(bcontainer, errp);
>> +    if (ret) {
>> +    goto err_listener_register;
>> +    }
>> +
>>   /*
>>    * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>>    * for discarding incompatibility check as well?
>> @@ -461,6 +466,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>   iommufd_cdev_ram_block_discard_disable(false);
>>   }
>>   +    vfio_cpr_unregister_container(bcontainer);
>>   iommufd_cdev_detach_container(vbasedev, container);
>>   iommufd_cdev_container_destroy(container);
>>   vfio_put_address_space(space);
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index bb98493..bba776f 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>     'container-base.c',
>>     'container.c',
>>     'migration.c',
>> +  'cpr.c',
>>   ))
>>   vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
>>   vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 4a6c262..b9da6c0 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -205,6 +205,9 @@ void vfio_detach_device(VFIODevice *vbasedev);
>>   int vfio_kvm_device_add_fd(int fd, Error **errp);
>>   int vfio_kvm_device_del_fd(int fd, Error **errp);
>>   +int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error 
>> **errp);
> 
> Should we return bool since we have an errp ? the returned value
> is not an errno AFAICT.
> 
> Anyhow,
> 
> Reviewed-by: Cédric Le Goater 

Hi Cedric,  vfio_cpr_register_container sometimes returns the value returned 
from 
migrate_add_blocker_modes, which is an errno.

Thanks for reviewing these!

- Steve
 
>> +void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
>> +
>>   extern const MemoryRegionOps vfio_region_ops;
>>   typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>>   typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList;
>

Re: [PATCH V2] migration: export fewer options

2024-02-28 Thread Steven Sistare

Just a reminder, after our further discussion in the V1 thread, 
this patch is still what I propose, no updates needed.

Markus, I think Peter is looking for your blessing on the new
file name: include/migration/client-options.h.

- Steve

On 2/26/2024 5:16 PM, Steve Sistare wrote:
> A small number of migration options are accessed by migration clients,
> but to see them clients must include all of options.h, which is mostly
> for migration core code.  migrate_mode() in particular will be needed by
> multiple clients.
> 
> Refactor the option declarations so clients can see the necessary few via
> misc.h, which already exports a portion of the client API.
> 
> Signed-off-by: Steve Sistare 
> ---
> Changes in V2:
>   * renamed options-pub.h to client-options.h
> ---
> ---
>  hw/vfio/migration.c|  1 -
>  hw/virtio/virtio-balloon.c |  1 -
>  include/migration/client-options.h | 24 
>  include/migration/misc.h   |  1 +
>  migration/options.h|  6 +-
>  system/dirtylimit.c|  1 -
>  6 files changed, 26 insertions(+), 8 deletions(-)
>  create mode 100644 include/migration/client-options.h
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 50140ed..5d4a23c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -18,7 +18,6 @@
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "migration/migration.h"
> -#include "migration/options.h"
>  #include "migration/savevm.h"
>  #include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 89f853f..a59ff17 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -32,7 +32,6 @@
>  #include "qemu/error-report.h"
>  #include "migration/misc.h"
>  #include "migration/migration.h"
> -#include "migration/options.h"
>  
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
> diff --git a/include/migration/client-options.h 
> b/include/migration/client-options.h
> new file mode 100644
> index 000..887fea1
> --- /dev/null
> +++ b/include/migration/client-options.h
> @@ -0,0 +1,24 @@
> +/*
> + * QEMU public migration capabilities
> + *
> + * Copyright (c) 2012-2023 Red Hat Inc
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H
> +#define QEMU_MIGRATION_CLIENT_OPTIONS_H
> +
> +/* capabilities */
> +
> +bool migrate_background_snapshot(void);
> +bool migrate_dirty_limit(void);
> +bool migrate_postcopy_ram(void);
> +bool migrate_switchover_ack(void);
> +
> +/* parameters */
> +
> +MigMode migrate_mode(void);
> +
> +#endif
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 5d1aa59..4c226a4 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -17,6 +17,7 @@
>  #include "qemu/notify.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "qapi/qapi-types-net.h"
> +#include "migration/client-options.h"
>  
>  /* migration/ram.c */
>  
> diff --git a/migration/options.h b/migration/options.h
> index 246c160..964ebdd 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -16,6 +16,7 @@
>  
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
> +#include "migration/client-options.h"
>  
>  /* migration properties */
>  
> @@ -24,12 +25,10 @@ extern Property migration_properties[];
>  /* capabilities */
>  
>  bool migrate_auto_converge(void);
> -bool migrate_background_snapshot(void);
>  bool migrate_block(void);
>  bool migrate_colo(void);
>  bool migrate_compress(void);
>  bool migrate_dirty_bitmaps(void);
> -bool migrate_dirty_limit(void);
>  bool migrate_events(void);
>  bool migrate_ignore_shared(void);
>  bool migrate_late_block_activate(void);
> @@ -37,11 +36,9 @@ bool migrate_multifd(void);
>  bool migrate_pause_before_switchover(void);
>  bool migrate_postcopy_blocktime(void);
>  bool migrate_postcopy_preempt(void);
> -bool migrate_postcopy_ram(void);
>  bool migrate_rdma_pin_all(void);
>  bool migrate_release_ram(void);
>  bool migrate_return_path(void);
> -bool migrate_switchover_ack(void);
>  bool migrate_validate_uuid(void);
>  bool migrate_xbzrle(void);
>  bool migrate_zero_blocks(void);
> @@ -83,7 +80,6 @@ uint8_t migrate_max_cpu_throttle(void);
>  uint64_t migrate_max_bandwidth(void);
>  uint64_t migrate_avail_switchover_bandwidth(void);
>  uint64_t migrate_max_postcopy_bandwidth(void);
> -MigMode migrate_mode(void);
>  int migrate_multifd_channels(void);
>  MultiFDCompression migrate_multifd_compression(void);
>  int migrate_multifd_zlib_level(void);
> diff --git a/system/dirtylimit.c b/system/dirtylimit.c
> index b5607eb..774ff44 100644
> --- a/system/dirtylimit.c
> +++ b/system/dirtylimit.c
> @@ -26,7 +26,6 @@
>  #include "trace.h"
>  #include "migration/misc.h"
>

Re: [PATCH V4 14/14] migration: options incompatible with cpr

2024-02-28 Thread Steven Sistare

On 2/28/2024 11:05 AM, Markus Armbruster wrote:
> Steven Sistare  writes:
> 
>> On 2/28/2024 2:21 AM, Markus Armbruster wrote:
>>> Steve Sistare  writes:
>>>
>>>> Fail the migration request if options are set that are incompatible
>>>> with cpr.
>>>>
>>>> Signed-off-by: Steve Sistare 
> 
> [...]
> 
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 0990297..c6bfe2e 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -657,6 +657,8 @@
>>>>  # shared backend must be be non-volatile across reboot, such as by 
>>>> backing
>>>>  # it with a dax device.
>>>>  #
>>>> +# cpr-reboot may not be used with postcopy, colo, or 
>>>> background-snapshot.
>>>> +#
>>>
>>> @cpr-reboot
>>>
>>> COLO
>>>
>>> Wrap the line:
>>>
>>># @cpr-reboot may not be used with postcopy, COLO, or
>>># background-snapshot.
>>>
>>> This doesn't tell the reader what settings exactly do not work with
>>> @cpr-reboot.
>>>
>>> For instance "background-snapshot" is about enabling migration
>>> capability @background-snapshot.  We could write something like "is
>>> incompatible with enabling migration capability @background-snapshot".
>>>
>>> Same for the other two.  Worthwhile?
>>
>> I initially was more precise exactly as you suggest, but I realized that
>> postcopy encompasses multiple capabilities, and I did not want to enumerate
>> them, nor require new capabilities related to these 3 to be listed here 
>> if/when they are created, so I generalized.  A keyword search in this file 
>> gives unambiguous matches.
>>
>> Peter pushed the patch a few hours before you sent this.
> 
> Okay.
> 
> A followup patch to correct @cpr-reboot, COLO and line wrapping would be
> nice.

Will do - steve

Re: [PATCH V4 14/14] migration: options incompatible with cpr

2024-02-28 Thread Steven Sistare

On 2/28/2024 2:21 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> Fail the migration request if options are set that are incompatible
>> with cpr.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  migration/migration.c | 17 +
>>  qapi/migration.json   |  2 ++
>>  2 files changed, 19 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 90a9094..7652fd4 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1953,6 +1953,23 @@ static bool migrate_prepare(MigrationState *s, bool 
>> blk, bool blk_inc,
>>  return false;
>>  }
>>  
>> +if (migrate_mode_is_cpr(s)) {
>> +const char *conflict = NULL;
>> +
>> +if (migrate_postcopy()) {
>> +conflict = "postcopy";
>> +} else if (migrate_background_snapshot()) {
>> +conflict = "background snapshot";
>> +} else if (migrate_colo()) {
>> +conflict = "COLO";
>> +}
>> +
>> +if (conflict) {
>> +error_setg(errp, "Cannot use %s with CPR", conflict);
>> +return false;
>> +}
>> +}
>> +
>>  if (blk || blk_inc) {
>>  if (migrate_colo()) {
>>  error_setg(errp, "No disk migration is required in COLO mode");
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 0990297..c6bfe2e 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -657,6 +657,8 @@
>>  # shared backend must be be non-volatile across reboot, such as by 
>> backing
>>  # it with a dax device.
>>  #
>> +# cpr-reboot may not be used with postcopy, colo, or 
>> background-snapshot.
>> +#
> 
> @cpr-reboot
> 
> COLO
> 
> Wrap the line:
> 
># @cpr-reboot may not be used with postcopy, COLO, or
># background-snapshot.
> 
> This doesn't tell the reader what settings exactly do not work with
> @cpr-reboot.
> 
> For instance "background-snapshot" is about enabling migration
> capability @background-snapshot.  We could write something like "is
> incompatible with enabling migration capability @background-snapshot".
> 
> Same for the other two.  Worthwhile?

I initially was more precise exactly as you suggest, but I realized that
postcopy encompasses multiple capabilities, and I did not want to enumerate
them, nor require new capabilities related to these 3 to be listed here 
if/when they are created, so I generalized.  A keyword search in this file 
gives unambiguous matches.

Peter pushed the patch a few hours before you sent this.

- Steve

Re: [PATCH V1 00/10] privatize migration.h

2024-02-27 Thread Steven Sistare

Please ignore this solo message, I accidentally sent it alone - steve

On 2/27/2024 12:42 PM, Steve Sistare wrote:
> migration/migration.h is the private interface for code in the migration
> sub-directory, but many other clients include it because they need accessors
> that are not exported by the publc interface in include/migration/misc.h.
> Fix that by refactoring accessors and defining new ones as needed.
> 
> After these fixes, no code outside of migration includes migration.h,
> and no code outside of migration uses MigrationState.
> 
> This series depends on the following:
>   * migration patches in the series "allow cpr-reboot for vfio"
>   * singleton patch "migration: export fewer options"
> 
> Steve Sistare (10):
>   migration: remove migration.h references
>   migration: export migration_is_setup_or_active
>   migration: export migration_is_active
>   migration: export migration_is_running
>   migration: export vcpu_dirty_limit_period
>   migration: migration_thread_is_self
>   migration: migration_is_device
>   migration: migration_file_set_error
>   migration: privatize colo interfaces
>   migration: purge MigrationState from public interface
> 
>  hw/vfio/common.c   | 17 +++---
>  hw/vfio/container.c|  1 -
>  hw/vfio/migration.c| 11 ++-
>  hw/virtio/vhost-user.c |  1 -
>  hw/virtio/virtio-balloon.c |  1 -
>  include/migration/client-options.h |  1 +
>  include/migration/misc.h   | 17 +-
>  migration/colo.c   | 17 ++
>  migration/migration.c  | 67 
> --
>  migration/migration.h  |  7 ++--
>  migration/options.c| 11 +--
>  migration/ram.c|  5 ++-
>  migration/savevm.c |  2 +-
>  net/colo-compare.c |  3 +-
>  net/vhost-vdpa.c   |  3 +-
>  stubs/colo.c   |  1 -
>  system/dirtylimit.c| 12 +++
>  system/qdev-monitor.c  |  1 -
>  target/loongarch/kvm/kvm.c |  1 -
>  target/riscv/kvm/kvm-cpu.c |  4 +--
>  tests/unit/test-vmstate.c  |  1 -
>  21 files changed, 96 insertions(+), 88 deletions(-)
>

Re: [PATCH v7 0/3] string list functions

2024-02-27 Thread Steven Sistare

All the changes look good - steve

On 2/27/2024 10:33 AM, Markus Armbruster wrote:
> This is Steve's work (v1 to v5), tweaked by Philippe (v6), and now by
> me.
> 
> v7:
> * Old PATCH 1 dropped
>   The proposed str_split() is a composition of g_strsplit() and a
>   conversion from char ** to strList *.  I'm willing to accept a
>   conversion function str_list_from_strv() next to
>   strv_from_str_list().  I doubt having a function for the composition
>   is worth the trouble.
> * Old PATCH 4 dropped
>   The tests mostly test g_strsplit().  I'm willing to accept focused
>   tests for strv_from_str_list() and str_list_from_strv().
> * PATCH 1-3: Commit messages tweaked
> * PATCH 2: strv_from_strList() renamed strv_from_str_list(), and moved
>   to qapi/qapi-type-helpers.c.  Function comment tweaked.  Local
>   variable @argv renamed to @strv.
> * PATCH 3: Adjust for the rename.
> 
> Steve Sistare (3):
>   qapi: New QAPI_LIST_LENGTH()
>   qapi: New strv_from_str_list()
>   migration: simplify exec migration functions
> 
>  include/qapi/type-helpers.h |  8 ++
>  include/qapi/util.h | 13 +
>  migration/exec.c| 57 ++---
>  qapi/qapi-type-helpers.c| 14 +
>  4 files changed, 43 insertions(+), 49 deletions(-)
>

Re: [PATCH v6 0/5] string list functions

2024-02-27 Thread Steven Sistare

On 2/27/2024 10:28 AM, Markus Armbruster wrote:
> Philippe Mathieu-Daudé  writes:
> 
>> Hi Markus,
>>
>> Here are the patches I queued until you told me you'd
>> object to the CamelCase filename strList.[ch].
>>
>> Steve, please take over ;)
> 
> I'm going to post the part of the series I'm ready to queue as v7, with
> minor modifications:
> 
> * Rename strv_from_strList() to strv_from_str_list(), and put it into
>   qapi/qapi-type-helpers.c.
> 
> * Tweak its function comment.
> 
> * Rename its local variable @argv to @strv.
> 
> * Cosmetic commit message tweaks.
> 
> We can then talk about the remainder.

Thanks Markus, that saves some time and email.

- Steve

Re: [PATCH V1] migration: export fewer options

2024-02-27 Thread Steven Sistare

On 2/27/2024 3:08 AM, Peter Xu wrote:
> On Mon, Feb 26, 2024 at 09:41:15AM -0500, Steven Sistare wrote:
>> On 2/26/2024 2:40 AM, Markus Armbruster wrote:
>>> Steve Sistare  writes:
>>>
>>>> A small number of migration options are accessed by migration clients,
>>>> but to see them clients must include all of options.h, which is mostly
>>>> for migration core code.  migrate_mode() in particular will be needed by
>>>> multiple clients.
>>>>
>>>> Refactor the option declarations so clients can see the necessary few via
>>>> misc.h, which already exports a portion of the client API.
>>>>
>>>> Signed-off-by: Steve Sistare 
>>>> ---
>>>> I suggest that eventually we should define a single file migration/client.h
>>>> which exports everything needed by the simpler clients: blockers, 
>>>> notifiers,
>>>> options, cpr, and state accessors.
>>>> ---
>>>> ---
>>>>  hw/vfio/migration.c |  1 -
>>>>  hw/virtio/virtio-balloon.c  |  1 -
>>>>  include/migration/misc.h|  1 +
>>>>  include/migration/options-pub.h | 24 
>>>>  migration/options.h |  6 +-
>>>
>>> Unusual naming.  We have zero headers named -pub.h or -public.h, and
>>> dozens named like -int.h or -internal.h.  Please stick to the existing
>>> convention.
>>
>> In the spirit of minimizing changes, I went that route to avoid renaming the 
>> existing migration/options.h and its references:
>>
>> 0 migration/block-dirty-bit 82 #include "options.h"
>> 1 migration/block.c 32 #include "options.h"
>> 2 migration/colo.c  37 #include "options.h"
>> 3 migration/migration-hmp-c 35 #include "options.h"
>> 4 migration/migration.c 68 #include "options.h"
>> 5 migration/multifd-zlib.c  21 #include "options.h"
>> 6 migration/multifd-zstd.c  21 #include "options.h"
>> 7 migration/multifd.c   29 #include "options.h"
>> 8 migration/options.c   30 #include "options.h"
>> 9 migration/postcopy-ram.c  40 #include "options.h"
>> a migration/qemu-file.c 33 #include "options.h"
>> b migration/ram-compress.c  37 #include "options.h"
>> c migration/ram.c   63 #include "options.h"
>> d migration/rdma.c  40 #include "options.h"
>> e migration/savevm.c71 #include "options.h"
>> f migration/socket.c30 #include "options.h"
>> g migration/tls.c   25 #include "options.h"
>>
>> But I take your point.
>>
>> Peter, which do you prefer?
> 
> From statistics, "-internal.h" wins "-int.h":
> 
> $ git grep "\-internal.h" | wc -l
> 135
> $ git grep "\-int.h" | wc -l
> 3

Yes, I did the same search to choose the name for option A below.
But in option B, I keep the existing name for the private file,
and choose a new name for the public file.  There is no suffix
in use in qemu to denote a public file; we just use names indicating
the functionality, so I chose client-options.h.

Let's use client-options.h and move on.  I am preparing another cleanup
series that I think you will like.

- Steve

>>   A. rename: migration/options.h -> migration/options-internal.h 
>>  rename: include/migration/options-pub.h -> include/migration/options.h
>>
>>   B. rename: include/migration/options.h -> 
>> include/migration/client-options.h
>>
>> I prefer B. If you prefer B, but want a different file name, please choose 
>> the
>> final name.
> 
> Personally I don't have a strong opinion on the name.  I'll see whether
> Markus has any comment.
> 
> [and of course, I removed this patch from -staging queue to keep the
>  discussion going..]
>

Re: [PATCH V4 00/14] allow cpr-reboot for vfio

2024-02-26 Thread Steven Sistare

On 2/26/2024 3:21 PM, Steven Sistare wrote:
> On 2/26/2024 4:01 AM, Peter Xu wrote:
>> On Mon, Feb 26, 2024 at 09:49:46AM +0100, Cédric Le Goater wrote:
>>> Go ahead. It will help me for the changes I am doing on error reporting
>>> for VFIO migration. I will rebase on top.
>>
>> Thanks for confirming.  I queued the migration patches then, but leave the
>> two vfio one for further discussion.
> 
> Very good, thanks.  I am always happy to move the ball a few yards closer to
> the goal line :)

Peter, beware that patch 3 needs an edit before being queued.
This hunk snuck in and should be deleted:

[PATCH V4 03/14] migration: convert to NotifierWithReturn
diff --git a/roms/seabios-hppa b/roms/seabios-hppa
index 03774ed..e4eac85 16
--- a/roms/seabios-hppa
+++ b/roms/seabios-hppa
@@ -1 +1 @@
-Subproject commit 03774edaad3bfae090ac96ca5450353c641637d1
+Subproject commit e4eac85880e8677f96d8b9e94de9f2eec9c0751f


- Steve

Re: [PATCH V4 00/14] allow cpr-reboot for vfio

2024-02-26 Thread Steven Sistare

On 2/26/2024 4:01 AM, Peter Xu wrote:
> On Mon, Feb 26, 2024 at 09:49:46AM +0100, Cédric Le Goater wrote:
>> Go ahead. It will help me for the changes I am doing on error reporting
>> for VFIO migration. I will rebase on top.
> 
> Thanks for confirming.  I queued the migration patches then, but leave the
> two vfio one for further discussion.

Very good, thanks.  I am always happy to move the ball a few yards closer to
the goal line :)

- Steve

Re: [PATCH V1] migration: export fewer options

2024-02-26 Thread Steven Sistare

On 2/25/2024 9:40 PM, Peter Xu wrote:
> On Fri, Feb 23, 2024 at 09:13:24AM -0800, Steve Sistare wrote:
>> A small number of migration options are accessed by migration clients,
>> but to see them clients must include all of options.h, which is mostly
>> for migration core code.  migrate_mode() in particular will be needed by
>> multiple clients.
>>
>> Refactor the option declarations so clients can see the necessary few via
>> misc.h, which already exports a portion of the client API.
>>
>> Signed-off-by: Steve Sistare 
> 
> Sounds reasonable, queued, thanks.
> 
>> ---
>> I suggest that eventually we should define a single file migration/client.h
>> which exports everything needed by the simpler clients: blockers, notifiers,
>> options, cpr, and state accessors.
> 
> What's the difference v.s. current migration/misc.h?

This file would be sufficient for most clients:

diff --git a/include/migration/client.h b/include/migration/client.h
new file mode 100644
index 000..a55e504
--- /dev/null
+++ b/include/migration/client.h
@@ -0,0 +1,6 @@
+#ifndef MIGRATION_CLIENT_H
+#define MIGRATION_CLIENT_H
+#include "migration/misc.h"
+#include "migration/blocker.h"
+#include "migration/client-options.h"
+#endif

Or, we could rename misc.h -> client.h and include blocker.h and 
client-options.h in it.

I just like the idea that most clients could include a single, obviously named 
file to
use the most-common exported API.  "misc.h" is not obvious, and not complete.

- Steve

Re: [PATCH v6 0/5] string list functions

2024-02-26 Thread Steven Sistare

Thanks for trying a V6 Philippe.  I'll take it from here.  It helps to know
that someone besides me thinks these functions are worth having.

- Steve

On 2/26/2024 9:11 AM, Philippe Mathieu-Daudé wrote:
> Hi Markus,
> 
> Here are the patches I queued until you told me you'd
> object to the CamelCase filename strList.[ch].
> 
> Steve, please take over ;)
> 
> Since v5:
> - Cover files in MAINTAINERS
> - Complete @docstring mentioning g_auto.
> 
> v5: 
> https://lore.kernel.org/qemu-devel/1708638470-114846-3-git-send-email-steven.sist...@oracle.com/
> 
> Steve Sistare (5):
>   util: str_split
>   qapi: QAPI_LIST_LENGTH
>   util: strv_from_strList
>   util: strList unit tests
>   migration: simplify exec migration functions
> 
>  MAINTAINERS   |  2 +
>  include/monitor/hmp.h |  1 -
>  include/qapi/util.h   | 13 +++
>  include/qemu/strList.h| 33 
>  migration/exec.c  | 57 
>  monitor/hmp-cmds.c| 19 --
>  net/net-hmp-cmds.c|  3 +-
>  stats/stats-hmp-cmds.c|  3 +-
>  tests/unit/test-strList.c | 80 +++
>  util/strList.c| 38 +++
>  tests/unit/meson.build|  1 +
>  util/meson.build  |  1 +
>  12 files changed, 180 insertions(+), 71 deletions(-)
>  create mode 100644 include/qemu/strList.h
>  create mode 100644 tests/unit/test-strList.c
>  create mode 100644 util/strList.c
>

Re: [PATCH V1] migration: export fewer options

2024-02-26 Thread Steven Sistare

On 2/26/2024 2:40 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> A small number of migration options are accessed by migration clients,
>> but to see them clients must include all of options.h, which is mostly
>> for migration core code.  migrate_mode() in particular will be needed by
>> multiple clients.
>>
>> Refactor the option declarations so clients can see the necessary few via
>> misc.h, which already exports a portion of the client API.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>> I suggest that eventually we should define a single file migration/client.h
>> which exports everything needed by the simpler clients: blockers, notifiers,
>> options, cpr, and state accessors.
>> ---
>> ---
>>  hw/vfio/migration.c |  1 -
>>  hw/virtio/virtio-balloon.c  |  1 -
>>  include/migration/misc.h|  1 +
>>  include/migration/options-pub.h | 24 
>>  migration/options.h |  6 +-
> 
> Unusual naming.  We have zero headers named -pub.h or -public.h, and
> dozens named like -int.h or -internal.h.  Please stick to the existing
> convention.

In the spirit of minimizing changes, I went that route to avoid renaming the 
existing migration/options.h and its references:

0 migration/block-dirty-bit 82 #include "options.h"
1 migration/block.c 32 #include "options.h"
2 migration/colo.c  37 #include "options.h"
3 migration/migration-hmp-c 35 #include "options.h"
4 migration/migration.c 68 #include "options.h"
5 migration/multifd-zlib.c  21 #include "options.h"
6 migration/multifd-zstd.c  21 #include "options.h"
7 migration/multifd.c   29 #include "options.h"
8 migration/options.c   30 #include "options.h"
9 migration/postcopy-ram.c  40 #include "options.h"
a migration/qemu-file.c 33 #include "options.h"
b migration/ram-compress.c  37 #include "options.h"
c migration/ram.c   63 #include "options.h"
d migration/rdma.c  40 #include "options.h"
e migration/savevm.c71 #include "options.h"
f migration/socket.c30 #include "options.h"
g migration/tls.c   25 #include "options.h"

But I take your point.

Peter, which do you prefer?

  A. rename: migration/options.h -> migration/options-internal.h 
 rename: include/migration/options-pub.h -> include/migration/options.h

  B. rename: include/migration/options.h -> include/migration/client-options.h

I prefer B. If you prefer B, but want a different file name, please choose the
final name.

- Steve

Re: [PATCH V5 1/5] util: str_split

2024-02-23 Thread Steven Sistare

On 2/23/2024 12:41 PM, Philippe Mathieu-Daudé wrote:
> On 23/2/24 15:01, Steven Sistare wrote:
>> On 2/23/2024 1:01 AM, Philippe Mathieu-Daudé wrote:
>>> On 22/2/24 22:47, Steve Sistare wrote:
>>>> Generalize hmp_split_at_comma() to take any delimiter string, rename
>>>> as str_split(), and move it to util/strList.c.
>>>>
>>>> No functional change.
>>>>
>>>> Signed-off-by: Steve Sistare 
>>>> ---
>>>>    include/monitor/hmp.h  |  1 -
>>>>    include/qemu/strList.h | 24 
>>>>    monitor/hmp-cmds.c | 19 ---
>>>>    net/net-hmp-cmds.c |  3 ++-
>>>>    stats/stats-hmp-cmds.c |  3 ++-
>>>>    util/meson.build   |  1 +
>>>>    util/strList.c | 24 
>>>>    7 files changed, 53 insertions(+), 22 deletions(-)
>>>>    create mode 100644 include/qemu/strList.h
>>>>    create mode 100644 util/strList.c
>>>
>>>
>>>> +#include "qapi/qapi-builtin-types.h"
>>>> +
>>>> +/*
>>>> + * Split @str into a strList using the delimiter string @delim.
>>>> + * The delimiter is not included in the result.
>>>> + * Return NULL if @str is NULL or an empty string.
>>>> + * A leading, trailing, or consecutive delimiter produces an
>>>> + * empty string at that position in the output.
>>>> + * All strings are g_strdup'd, and the result can be freed
>>>> + * using qapi_free_strList.
>>>
>>> Note "qapi/qapi-builtin-types.h" defines:
>>>
>>>    G_DEFINE_AUTOPTR_CLEANUP_FUNC(strList, qapi_free_strList)
>>>
>>> Maybe mention we can also use:
>>>
>>>    g_autoptr(strList)
>>
>> Thanks Philippe.  If we get to V6 for this series, I will mention this,
>> and also mention g_autoptr(GStrv) in the header comment for 
>> strv_from_strlist.
> 
> If there is no need for v6, do you mind sharing here what would be
> the resulting comment? Maybe Markus can update it directly...
> (assuming he takes your series).

Sure - steve


diff --git a/include/qemu/strList.h b/include/qemu/strList.h
index c1eb1dd..b13bd53 100644
--- a/include/qemu/strList.h
+++ b/include/qemu/strList.h
@@ -17,13 +17,16 @@
  * A leading, trailing, or consecutive delimiter produces an
  * empty string at that position in the output.
  * All strings are g_strdup'd, and the result can be freed
- * using qapi_free_strList.
+ * using qapi_free_strList, or by declaring a local variable
+ * with g_autoptr(strList).
  */
 strList *str_split(const char *str, const char *delim);

 /*
  * Produce and return a NULL-terminated array of strings from @list.
- * The result is g_malloc'd and all strings are g_strdup'd.
+ * The result is g_malloc'd and all strings are g_strdup'd.  The result
+ * can be freed using g_strfreev, or by declaring a local variable with
+ * g_auto(GStrv).
  */
 char **strv_from_strList(const strList *list);

Re: [PATCH V5 1/5] util: str_split

2024-02-23 Thread Steven Sistare

On 2/23/2024 1:01 AM, Philippe Mathieu-Daudé wrote:
> On 22/2/24 22:47, Steve Sistare wrote:
>> Generalize hmp_split_at_comma() to take any delimiter string, rename
>> as str_split(), and move it to util/strList.c.
>>
>> No functional change.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>   include/monitor/hmp.h  |  1 -
>>   include/qemu/strList.h | 24 
>>   monitor/hmp-cmds.c | 19 ---
>>   net/net-hmp-cmds.c |  3 ++-
>>   stats/stats-hmp-cmds.c |  3 ++-
>>   util/meson.build   |  1 +
>>   util/strList.c | 24 
>>   7 files changed, 53 insertions(+), 22 deletions(-)
>>   create mode 100644 include/qemu/strList.h
>>   create mode 100644 util/strList.c
> 
> 
>> +#include "qapi/qapi-builtin-types.h"
>> +
>> +/*
>> + * Split @str into a strList using the delimiter string @delim.
>> + * The delimiter is not included in the result.
>> + * Return NULL if @str is NULL or an empty string.
>> + * A leading, trailing, or consecutive delimiter produces an
>> + * empty string at that position in the output.
>> + * All strings are g_strdup'd, and the result can be freed
>> + * using qapi_free_strList.
> 
> Note "qapi/qapi-builtin-types.h" defines:
> 
>   G_DEFINE_AUTOPTR_CLEANUP_FUNC(strList, qapi_free_strList)
> 
> Maybe mention we can also use:
> 
>   g_autoptr(strList)

Thanks Philippe.  If we get to V6 for this series, I will mention this,
and also mention g_autoptr(GStrv) in the header comment for strv_from_strlist.

- Steve

>> + */
>> +strList *str_split(const char *str, const char *delim);
>> +
>> +#endif
>

Re: [PATCH V4 1/5] util: strList_from_string

2024-02-22 Thread Steven Sistare

On 2/21/2024 12:01 PM, Steven Sistare wrote:
> On 2/21/2024 8:29 AM, Markus Armbruster wrote:
>> I apologize for the lateness of my review.
> 
> No problem.  Thanks for the review.
> 
>> Steve Sistare  writes:
>>
>>> Generalize hmp_split_at_comma() to take any delimiter string, rename
>>> as strList_from_string(), and move it to util/strList.c.
>>>
>>> No functional change.
>>>
>>> Signed-off-by: Steve Sistare 
>>
>> I can't see an actual use of generalized delimiters outside tests in
>> this series.  Do you have uses?
> 
> In this series, it is called from hmp_announce_self and stats_filter; 
> those were formerly calls to hmp_split_at_comma.
> 
> In my live update cpr-exec series, there is an additional call site, with a
> space delimiter instead of comma.  Live update V9 is posted but is old and 
> obsolete.  
> I will post V10 soon, but I am hoping you can pull this series first, so I 
> can 
> whittle down my pending patches and omit these from V10.
> 
>>> ---
>>>  include/monitor/hmp.h  |  1 -
>>>  include/qemu/strList.h | 24 
>>>  monitor/hmp-cmds.c | 19 ---
>>>  net/net-hmp-cmds.c |  3 ++-
>>>  stats/stats-hmp-cmds.c |  3 ++-
>>>  util/meson.build   |  1 +
>>>  util/strList.c | 24 
>>>  7 files changed, 53 insertions(+), 22 deletions(-)
>>>  create mode 100644 include/qemu/strList.h
>>>  create mode 100644 util/strList.c
>>>
>>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>>> index 13f9a2d..2df661e 100644
>>> --- a/include/monitor/hmp.h
>>> +++ b/include/monitor/hmp.h
>>> @@ -19,7 +19,6 @@
>>>  
>>>  bool hmp_handle_error(Monitor *mon, Error *err);
>>>  void hmp_help_cmd(Monitor *mon, const char *name);
>>> -strList *hmp_split_at_comma(const char *str);
>>>  
>>>  void hmp_info_name(Monitor *mon, const QDict *qdict);
>>>  void hmp_info_version(Monitor *mon, const QDict *qdict);
>>> diff --git a/include/qemu/strList.h b/include/qemu/strList.h
>>> new file mode 100644
>>> index 000..010237f
>>> --- /dev/null
>>> +++ b/include/qemu/strList.h
>>> @@ -0,0 +1,24 @@
>>> +/*
>>> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates.
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or 
>>> later.
>>> + * See the COPYING file in the top-level directory.
>>> + */
>>> +
>>> +#ifndef QEMU_STR_LIST_H
>>> +#define QEMU_STR_LIST_H
>>> +
>>> +#include "qapi/qapi-builtin-types.h"
>>> +
>>> +/*
>>> + * Break @in into a strList using the delimiter string @delim.
>>> + * The delimiter is not included in the result.
>>> + * Return NULL if @in is NULL or an empty string.
>>> + * A leading, trailing, or consecutive delimiter produces an
>>> + * empty string at that position in the output.
>>> + * All strings are g_strdup'd, and the result can be freed
>>> + * using qapi_free_strList.
>>> + */
>>> +strList *strList_from_string(const char *in, const char *delim);
>>
>> The function name no longer tells us explicitly what the function does:
>> splitting the string.
> 
> The first sentence does not say it?
>   "Break @in into a strList using the delimiter string @delim"
> 
> Would you prefer this?
>   "Split string @in into a strList using the delimiter string @delim"

Sorry, I read your comment too quickly.  You want a different function name.
I propose:
  strList *str_split(const char *str, const char *delim);

- Steve

Re: [PATCH V4 00/14] allow cpr-reboot for vfio

2024-02-22 Thread Steven Sistare

Peter (and David if interested): these patches still need RB:
  migration: notifier error checking
  migration: stop vm for cpr
  migration: update cpr-reboot description
  migration: options incompatible with cpr

Alex, these patches still need RB:
  vfio: register container for cpr
  vfio: allow cpr-reboot migration if suspended

Thanks!

- Steve

On 2/22/2024 12:28 PM, Steve Sistare wrote:
> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.  The
> user is responsible for suspending the guest before initiating cpr, such as
> by issuing guest-suspend-ram to the qemu guest agent.
> 
> Most of the patches in this series enhance migration notifiers so they can
> return an error status and message.  The last few patches register a notifier
> for vfio that returns an error if the guest is not suspended.
> 
> Changes in V3:
>   * update to tip, add RB's
>   * replace MigrationStatus with new enum MigrationEventType
>   * simplify migrate_fd_connect error recovery
>   * support vfio iommufd containers
>   * add patches:
>   migration: stop vm for cpr
>   migration: update cpr-reboot description
> 
> Changes in V4:
>   * rebase to tip, add RB's
>   * add patch to prevent options such as precopy from being used with cpr.
>   migration: options incompatible with cpr
>   * refactor subroutines in "stop vm for cpr"
>   * simplify "notifier error checking" patch by restricting that only
> notifiers for MIG_EVENT_PRECOPY_SETUP may return an error.
>   * doc that a fail event may be sent without a prior setup event
> 
> Steve Sistare (14):
>   notify: pass error to notifier with return
>   migration: remove error from notifier data
>   migration: convert to NotifierWithReturn
>   migration: MigrationEvent for notifiers
>   migration: remove postcopy_after_devices
>   migration: MigrationNotifyFunc
>   migration: per-mode notifiers
>   migration: refactor migrate_fd_connect failures
>   migration: notifier error checking
>   migration: stop vm for cpr
>   vfio: register container for cpr
>   vfio: allow cpr-reboot migration if suspended
>   migration: update cpr-reboot description
>   migration: options incompatible with cpr
> 
>  hw/net/virtio-net.c   |  13 +--
>  hw/vfio/common.c  |   2 +-
>  hw/vfio/container.c   |  11 ++-
>  hw/vfio/cpr.c |  39 +
>  hw/vfio/iommufd.c |   6 ++
>  hw/vfio/meson.build   |   1 +
>  hw/vfio/migration.c   |  15 ++--
>  hw/vfio/trace-events  |   2 +-
>  hw/virtio/vhost-user.c|  10 +--
>  hw/virtio/virtio-balloon.c|   3 +-
>  include/hw/vfio/vfio-common.h |   5 +-
>  include/hw/vfio/vfio-container-base.h |   1 +
>  include/hw/virtio/virtio-net.h|   2 +-
>  include/migration/misc.h  |  47 +--
>  include/qemu/notify.h |   8 +-
>  migration/migration.c | 148 
> +++---
>  migration/migration.h |   4 -
>  migration/postcopy-ram.c  |   3 +-
>  migration/postcopy-ram.h  |   1 -
>  migration/ram.c   |   3 +-
>  net/vhost-vdpa.c  |  14 ++--
>  qapi/migration.json   |  37 ++---
>  roms/seabios-hppa |   2 +-
>  ui/spice-core.c   |  17 ++--
>  util/notify.c |   5 +-
>  25 files changed, 275 insertions(+), 124 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>

Re: [PATCH V3 10/13] migration: stop vm for cpr

2024-02-22 Thread Steven Sistare

On 2/22/2024 4:30 AM, Peter Xu wrote:
> On Thu, Feb 22, 2024 at 05:12:53PM +0800, Peter Xu wrote:
>> On Wed, Feb 21, 2024 at 04:20:07PM -0500, Steven Sistare wrote:
>>> On 2/20/2024 2:33 AM, Peter Xu wrote:
>>>> On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote:
>>>>> When migration for cpr is initiated, stop the vm and set state
>>>>> RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
>>>>> possibility of ram and device state being out of sync, and guarantees
>>>>> that a guest in the suspended state remains suspended, because qmp_cont
>>>>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.
>>>>>
>>>>> Signed-off-by: Steve Sistare 
>>>>> ---
>>>>>  include/migration/misc.h |  1 +
>>>>>  migration/migration.c| 32 +---
>>>>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>>>> index 6dc234b..54c99a3 100644
>>>>> --- a/include/migration/misc.h
>>>>> +++ b/include/migration/misc.h
>>>>> @@ -60,6 +60,7 @@ void migration_object_init(void);
>>>>>  void migration_shutdown(void);
>>>>>  bool migration_is_idle(void);
>>>>>  bool migration_is_active(MigrationState *);
>>>>> +bool migrate_mode_is_cpr(MigrationState *);
>>>>>  
>>>>>  typedef enum MigrationEventType {
>>>>>  MIG_EVENT_PRECOPY_SETUP,
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index d1fce9e..fc5c587 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s)
>>>>>  s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>>>>>  }
>>>>>  
>>>>> +bool migrate_mode_is_cpr(MigrationState *s)
>>>>> +{
>>>>> +return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>>>>> +}
>>>>> +
>>>>>  int migrate_init(MigrationState *s, Error **errp)
>>>>>  {
>>>>>  int ret;
>>>>> @@ -2651,13 +2656,14 @@ static int 
>>>>> migration_completion_precopy(MigrationState *s,
>>>>>  bql_lock();
>>>>>  migration_downtime_start(s);
>>>>>  
>>>>> -s->vm_old_state = runstate_get();
>>>>> -global_state_store();
>>>>> -
>>>>> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>>>>> -trace_migration_completion_vm_stop(ret);
>>>>> -if (ret < 0) {
>>>>> -goto out_unlock;
>>>>> +if (!migrate_mode_is_cpr(s)) {
>>>>> +s->vm_old_state = runstate_get();
>>>>> +global_state_store();
>>>>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>>>>> +trace_migration_completion_vm_stop(ret);
>>>>> +if (ret < 0) {
>>>>> +goto out_unlock;
>>>>> +}
>>>>>  }
>>>>>  
>>>>>  ret = migration_maybe_pause(s, current_active_state,
>>>>> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>>>>> *error_in)
>>>>>  Error *local_err = NULL;
>>>>>  uint64_t rate_limit;
>>>>>  bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>>>>> +int ret;
>>>>>  
>>>>>  /*
>>>>>   * If there's a previous error, free it and prepare for another one.
>>>>> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error 
>>>>> *error_in)
>>>>>  goto fail;
>>>>>  }
>>>>>  
>>>>> +if (migrate_mode_is_cpr(s)) {
>>>>> +s->vm_old_state = runstate_get();
>>>>> +global_state_store();
>>>>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>>>>> +trace_migration_completion_vm_stop(ret);
>>>>> +if (ret < 0) {
>>>>> +error_setg(_err, "migration_stop_vm failed, error %d", 
>>>>> -ret);
>>>>> +

Re: [PATCH V3 10/13] migration: stop vm for cpr

2024-02-22 Thread Steven Sistare

On 2/22/2024 4:03 AM, Peter Xu wrote:
> On Wed, Feb 21, 2024 at 04:23:07PM -0500, Steven Sistare wrote:
>>> How about postcopy?  I know it's nonsense to enable postcopy for cpr.. but
>>> iiuc we don't yet forbid an user doing so.  Maybe we should?
>>
>> How about this?
>>
>> ---
>> @@ -3600,6 +3600,11 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  return;
>>  }
>>
>> +if (migrate_mode_is_cpr(s) && migrate_postcopy()) {
>> +error_setg(_err, "cannot mix postcopy and cpr");
>> +goto fail;
>> +}
>> +
>>  if (resume) {
>>  /* This is a resumed migration */
>>  rate_limit = migrate_max_postcopy_bandwidth();
>> 
> 
> migrate_fd_connect() will be a bit late, the error won't be able to be
> attached in the "migrate" request.  Perhaps, migrate_prepare()?

Thank you, that is better - steve

Re: [PATCH V3 10/13] migration: stop vm for cpr

2024-02-21 Thread Steven Sistare

On 2/20/2024 2:33 AM, Peter Xu wrote:
> On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote:
>> When migration for cpr is initiated, stop the vm and set state
>> RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
>> possibility of ram and device state being out of sync, and guarantees
>> that a guest in the suspended state remains suspended, because qmp_cont
>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/migration/misc.h |  1 +
>>  migration/migration.c| 32 +---
>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 6dc234b..54c99a3 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -60,6 +60,7 @@ void migration_object_init(void);
>>  void migration_shutdown(void);
>>  bool migration_is_idle(void);
>>  bool migration_is_active(MigrationState *);
>> +bool migrate_mode_is_cpr(MigrationState *);
>>  
>>  typedef enum MigrationEventType {
>>  MIG_EVENT_PRECOPY_SETUP,
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d1fce9e..fc5c587 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s)
>>  s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>>  }
>>  
>> +bool migrate_mode_is_cpr(MigrationState *s)
>> +{
>> +return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>> +}
>> +
>>  int migrate_init(MigrationState *s, Error **errp)
>>  {
>>  int ret;
>> @@ -2651,13 +2656,14 @@ static int 
>> migration_completion_precopy(MigrationState *s,
>>  bql_lock();
>>  migration_downtime_start(s);
>>  
>> -s->vm_old_state = runstate_get();
>> -global_state_store();
>> -
>> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> -trace_migration_completion_vm_stop(ret);
>> -if (ret < 0) {
>> -goto out_unlock;
>> +if (!migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +goto out_unlock;
>> +}
>>  }
>>  
>>  ret = migration_maybe_pause(s, current_active_state,
>> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  Error *local_err = NULL;
>>  uint64_t rate_limit;
>>  bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>> +int ret;
>>  
>>  /*
>>   * If there's a previous error, free it and prepare for another one.
>> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  goto fail;
>>  }
>>  
>> +if (migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +error_setg(_err, "migration_stop_vm failed, error %d", 
>> -ret);
>> +goto fail;
>> +}
>> +}
> 
> Could we have a helper function for the shared codes?
> 
> How about postcopy?  I know it's nonsense to enable postcopy for cpr.. but
> iiuc we don't yet forbid an user doing so.  Maybe we should?

How about this?

---
@@ -3600,6 +3600,11 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 return;
 }

+if (migrate_mode_is_cpr(s) && migrate_postcopy()) {
+error_setg(_err, "cannot mix postcopy and cpr");
+goto fail;
+}
+
 if (resume) {
 /* This is a resumed migration */
 rate_limit = migrate_max_postcopy_bandwidth();


- Steve

Re: [PATCH V3 10/13] migration: stop vm for cpr

2024-02-21 Thread Steven Sistare

On 2/20/2024 2:33 AM, Peter Xu wrote:
> On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote:
>> When migration for cpr is initiated, stop the vm and set state
>> RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
>> possibility of ram and device state being out of sync, and guarantees
>> that a guest in the suspended state remains suspended, because qmp_cont
>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/migration/misc.h |  1 +
>>  migration/migration.c| 32 +---
>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 6dc234b..54c99a3 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -60,6 +60,7 @@ void migration_object_init(void);
>>  void migration_shutdown(void);
>>  bool migration_is_idle(void);
>>  bool migration_is_active(MigrationState *);
>> +bool migrate_mode_is_cpr(MigrationState *);
>>  
>>  typedef enum MigrationEventType {
>>  MIG_EVENT_PRECOPY_SETUP,
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d1fce9e..fc5c587 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s)
>>  s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>>  }
>>  
>> +bool migrate_mode_is_cpr(MigrationState *s)
>> +{
>> +return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>> +}
>> +
>>  int migrate_init(MigrationState *s, Error **errp)
>>  {
>>  int ret;
>> @@ -2651,13 +2656,14 @@ static int 
>> migration_completion_precopy(MigrationState *s,
>>  bql_lock();
>>  migration_downtime_start(s);
>>  
>> -s->vm_old_state = runstate_get();
>> -global_state_store();
>> -
>> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> -trace_migration_completion_vm_stop(ret);
>> -if (ret < 0) {
>> -goto out_unlock;
>> +if (!migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +goto out_unlock;
>> +}
>>  }
>>  
>>  ret = migration_maybe_pause(s, current_active_state,
>> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  Error *local_err = NULL;
>>  uint64_t rate_limit;
>>  bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>> +int ret;
>>  
>>  /*
>>   * If there's a previous error, free it and prepare for another one.
>> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  goto fail;
>>  }
>>  
>> +if (migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +error_setg(_err, "migration_stop_vm failed, error %d", 
>> -ret);
>> +goto fail;
>> +}
>> +}
> 
> Could we have a helper function for the shared codes?

I propose to add code to migration_stop_vm to make it the helper.  Some call 
sites emit
more traces (via migration_stop_vm) as a result of my refactoring, and postcopy 
start sets 
vm_old_state, which is not used thereafter in that path.  Those changes seem 
harmless to me.
Tell me what you think:

---
diff --git a/migration/migration.c b/migration/migration.c
index fc5c587..30d2b08 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -107,12 +107,6 @@ static int migration_maybe_pause(MigrationState *s,
 static void migrate_fd_cancel(MigrationState *s);
 static bool close_return_path_on_source(MigrationState *s);

-static void migration_downtime_start(MigrationState *s)
-{
-trace_vmstate_downtime_checkpoint("src-downtime-start");
-s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-}
-
 static void migration_downtime_end(MigrationState *s)
 {
 int64_t now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
@@ -161,11 +155,20 @@ static gint page_request_addr_cmp(gconstpointer ap, 
gconstpo
 return (a > b) - (a < b);
 }

-int migration_stop_vm(RunState state)
+static int migration_stop_vm(MigrationState *s, RunState state)
 {
-int ret = vm_stop_force_state(state);
+int ret;
+
+trace_vmstate_downtime_checkpoint("src-downtime-start");
+s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+s->vm_old_state = runstate_get();
+global_state_store();
+
+ret = vm_stop_force_state(state);

 trace_vmstate_downtime_checkpoint("src-vm-stopped");
+trace_migration_completion_vm_stop(ret);

 return ret;
 }
@@ -2454,10 +2457,7 @@ static int

Re: [PATCH V3 12/13] vfio: allow cpr-reboot migration if suspended

2024-02-21 Thread Steven Sistare

Hi Alex, any comments or RB on this or patch 11?  The last few changes I am 
making for Peter will not change this patch.

- Steve

On 2/8/2024 1:54 PM, Steve Sistare wrote:
> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.  The
> user is responsible for suspending the guest before initiating cpr, such as
> by issuing guest-suspend-ram to the qemu guest agent.
> 
> Relax the vfio blocker so it does not apply to cpr, and add a notifier that
> verifies the guest is suspended.
> 
> Signed-off-by: Steve Sistare 
> ---
>  hw/vfio/common.c  |  2 +-
>  hw/vfio/cpr.c | 20 
>  hw/vfio/migration.c   |  2 +-
>  include/hw/vfio/vfio-container-base.h |  1 +
>  4 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 059bfdc..ff88c3f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice 
> *vbasedev, Error **errp)
>  error_setg(_devices_migration_blocker,
> "Multiple VFIO devices migration is supported only if all of "
> "them support P2P migration");
> -ret = migrate_add_blocker(_devices_migration_blocker, errp);
> +ret = migrate_add_blocker_normal(_devices_migration_blocker, 
> errp);
>  
>  return ret;
>  }
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 3bede54..392c2dd 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -7,13 +7,33 @@
>  
>  #include "qemu/osdep.h"
>  #include "hw/vfio/vfio-common.h"
> +#include "migration/misc.h"
>  #include "qapi/error.h"
> +#include "sysemu/runstate.h"
> +
> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
> +MigrationEvent *e, Error **errp)
> +{
> +if (e->type == MIG_EVENT_PRECOPY_SETUP &&
> +!runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
> +
> +error_setg(errp,
> +"VFIO device only supports cpr-reboot for runstate suspended");
> +
> +return -1;
> +}
> +return 0;
> +}
>  
>  int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
>  {
> +migration_add_notifier_mode(>cpr_reboot_notifier,
> +vfio_cpr_reboot_notifier,
> +MIG_MODE_CPR_REBOOT);
>  return 0;
>  }
>  
>  void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>  {
> +migration_remove_notifier(>cpr_reboot_notifier);
>  }
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 50140ed..2050ac8 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -889,7 +889,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, 
> Error *err, Error **errp)
>  vbasedev->migration_blocker = error_copy(err);
>  error_free(err);
>  
> -return migrate_add_blocker(>migration_blocker, errp);
> +return migrate_add_blocker_normal(>migration_blocker, errp);
>  }
>  
>  /* -- */
> diff --git a/include/hw/vfio/vfio-container-base.h 
> b/include/hw/vfio/vfio-container-base.h
> index b2813b0..3582d5f 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -49,6 +49,7 @@ typedef struct VFIOContainerBase {
>  QLIST_ENTRY(VFIOContainerBase) next;
>  QLIST_HEAD(, VFIODevice) device_list;
>  GList *iova_ranges;
> +NotifierWithReturn cpr_reboot_notifier;
>  } VFIOContainerBase;
>  
>  typedef struct VFIOGuestIOMMU {

Re: [PATCH V4 2/5] qapi: QAPI_LIST_LENGTH

2024-02-21 Thread Steven Sistare

On 2/21/2024 8:29 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> Signed-off-by: Steve Sistare 
>> Reviewed-by: Marc-André Lureau 
>> ---
>>  include/qapi/util.h | 13 +
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/qapi/util.h b/include/qapi/util.h
>> index 81a2b13..e1b8b1d 100644
>> --- a/include/qapi/util.h
>> +++ b/include/qapi/util.h
>> @@ -56,4 +56,17 @@ int parse_qapi_name(const char *name, bool complete);
>>  (tail) = &(*(tail))->next; \
>>  } while (0)
>>  
>> +/*
>> + * For any GenericList @list, return its length.
>> + */
>> +#define QAPI_LIST_LENGTH(list) \
>> +({ \
>> +int len = 0; \
> 
> size_t

ok.

>> +typeof(list) elem; \
> 
> Name this @tail, please.

ok.

>> +for (elem = list; elem != NULL; elem = elem->next) { \
>> +len++; \
>> +} \
>> +len; \
>> +})
>> +
>>  #endif
> 
> This is a macro instead of a function so users don't have to cast their
> FooList * to GenericList *.
> 
> The only user outside tests is strv_from_strList().  I'd be tempted to
> open-code it there and call it a day.  Or do you have more users in
> mind?

That's the only use.  If I make it private, I would still define it as
a static subroutine in util/strList.c, because it simplifies strv_from_strList.
IMO providing a public list length function or macro is pretty basic, but
I am not wedded to it.  Your call.

- Steve

> If we keep the macro, please align the backslashes like this:
> 
>#define QAPI_LIST_LENGTH(list)  \
>({  \
>int len = 0;\
>typeof(list) elem;  \
>for (elem = list; elem != NULL; elem = elem->next) {\
>len++;  \
>}   \
>len;\
>})
>

Re: [PATCH V4 3/5] util: strv_from_strList

2024-02-21 Thread Steven Sistare

On 2/21/2024 8:14 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> Signed-off-by: Steve Sistare 
>> Reviewed-by: Marc-André Lureau 
>> ---
>>  include/qemu/strList.h |  6 ++
>>  util/strList.c | 14 ++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/include/qemu/strList.h b/include/qemu/strList.h
>> index 010237f..4b86aa6 100644
>> --- a/include/qemu/strList.h
>> +++ b/include/qemu/strList.h
>> @@ -21,4 +21,10 @@
>>   */
>>  strList *strList_from_string(const char *in, const char *delim);
>>  
>> +/*
>> + * Produce and return a NULL-terminated array of strings from @args.
>> + * The result is g_malloc'd and all strings are g_strdup'd.
>> + */
>> +GStrv strv_from_strList(const strList *args);
>> +
>>  #endif
>> diff --git a/util/strList.c b/util/strList.c
>> index 7991de3..bad4187 100644
>> --- a/util/strList.c
>> +++ b/util/strList.c
>> @@ -22,3 +22,17 @@ strList *strList_from_string(const char *str, const char 
>> *delim)
>>  
>>  return res;
>>  }
>> +
>> +GStrv strv_from_strList(const strList *args)
> 
> Suggest to name the argument @list.
> 
>> +{
>> +const strList *arg;
> 
> Suggest to name this @tail.

ok.

>> +int i = 0;
>> +GStrv argv = g_new(char *, QAPI_LIST_LENGTH(args) + 1);
>> +
>> +for (arg = args; arg != NULL; arg = arg->next) {
>> +argv[i++] = g_strdup(arg->value);
>> +}
>> +argv[i] = NULL;
>> +
>> +return argv;
>> +}
> 
> Can we use char ** instread of GStrv?  I'd find that clearer.  For what
> it's worth, GLib documentation of functions like g_strsplit() doesn't
> use the GStrv typedef, either.

ok.

- Steve

Re: [PATCH V4 4/5] util: strList unit tests

2024-02-21 Thread Steven Sistare

On 2/21/2024 8:19 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> Signed-off-by: Steve Sistare 
>> Reviewed-by: Marc-André Lureau 
>> ---
>>  tests/unit/meson.build|  1 +
>>  tests/unit/test-strList.c | 80 
>> +++
>>  2 files changed, 81 insertions(+)
>>  create mode 100644 tests/unit/test-strList.c
>>
>> diff --git a/tests/unit/meson.build b/tests/unit/meson.build
>> index 69f9c05..113d12e 100644
>> --- a/tests/unit/meson.build
>> +++ b/tests/unit/meson.build
>> @@ -35,6 +35,7 @@ tests = {
>>'test-rcu-simpleq': [],
>>'test-rcu-tailq': [],
>>'test-rcu-slist': [],
>> +  'test-strList': [],
>>'test-qdist': [],
>>'test-qht': [],
>>'test-qtree': [],
>> diff --git a/tests/unit/test-strList.c b/tests/unit/test-strList.c
>> new file mode 100644
>> index 000..49a1cfd
>> --- /dev/null
>> +++ b/tests/unit/test-strList.c
>> @@ -0,0 +1,80 @@
>> +/*
>> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/strList.h"
>> +
>> +static strList *make_list(int length)
>> +{
>> +strList *head = 0, *list, **prev = 
>> +
>> +while (length--) {
>> +list = *prev = g_new0(strList, 1);
>> +list->value = g_strdup("aaa");
>> +prev = >next;
>> +}
>> +return head;
>> +}
>> +
>> +static void test_length(void)
>> +{
>> +strList *list;
>> +int i;
>> +
>> +for (i = 0; i < 5; i++) {
>> +list = make_list(i);
>> +g_assert_cmpint(i, ==, QAPI_LIST_LENGTH(list));
>> +qapi_free_strList(list);
>> +}
>> +}
>> +
>> +struct {
>> +const char *string;
>> +const char *delim;
>> +const char *args[5];
>> +} list_data[] = {
>> +{ 0, ",", { 0 } },
>> +{ "", ",", { 0 } },
>> +{ "a", ",", { "a", 0 } },
>> +{ "a,b", ",", { "a", "b", 0 } },
>> +{ "a,b,c", ",", { "a", "b", "c", 0 } },
>> +{ "first last", " ", { "first", "last", 0 } },
>> +{ "a:", ":", { "a", "", 0 } },
>> +{ "a::b", ":", { "a", "", "b", 0 } },
>> +{ ":", ":", { "", "", 0 } },
>> +{ ":a", ":", { "", "a", 0 } },
>> +{ "::a", ":", { "", "", "a", 0 } },
>> +};
> 
> NULL instead of 0, please.

ok.

>> +
>> +static void test_strv(void)
>> +{
>> +int i, j;
>> +const char **expect;
>> +strList *list;
>> +GStrv args;
> 
> I'd prefer char **argv;

ok.

>> +
>> +for (i = 0; i < ARRAY_SIZE(list_data); i++) {
>> +expect = list_data[i].args;
>> +list = strList_from_string(list_data[i].string, list_data[i].delim);
>> +args = strv_from_strList(list);
>> +qapi_free_strList(list);
>> +for (j = 0; expect[j] && args[j]; j++) {
> 
> Loop stops when either array element is null.  That's wrong, we need to
> exhaust both arrays: || instead of &&.

|| is not safe.  After one array is exhausted, [j] will be out of bounds for
the other.  The g_assert_null calls guarantee the arrays are the same length
and all elements have been compared.

- Steve

>> +g_assert_cmpstr(expect[j], ==, args[j]);
>> +}
>> +g_assert_null(expect[j]);
>> +g_assert_null(args[j]);
>> +g_strfreev(args);
>> +}
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +g_test_init(, , NULL);
>> +g_test_add_func("/test-string/length", test_length);
>> +g_test_add_func("/test-string/strv", test_strv);
>> +return g_test_run();
>> +}
>

Re: [PATCH V4 1/5] util: strList_from_string

2024-02-21 Thread Steven Sistare

On 2/21/2024 8:29 AM, Markus Armbruster wrote:
> I apologize for the lateness of my review.

No problem.  Thanks for the review.

> Steve Sistare  writes:
> 
>> Generalize hmp_split_at_comma() to take any delimiter string, rename
>> as strList_from_string(), and move it to util/strList.c.
>>
>> No functional change.
>>
>> Signed-off-by: Steve Sistare 
> 
> I can't see an actual use of generalized delimiters outside tests in
> this series.  Do you have uses?

In this series, it is called from hmp_announce_self and stats_filter; 
those were formerly calls to hmp_split_at_comma.

In my live update cpr-exec series, there is an additional call site, with a
space delimiter instead of comma.  Live update V9 is posted but is old and 
obsolete.  
I will post V10 soon, but I am hoping you can pull this series first, so I can 
whittle down my pending patches and omit these from V10.

>> ---
>>  include/monitor/hmp.h  |  1 -
>>  include/qemu/strList.h | 24 
>>  monitor/hmp-cmds.c | 19 ---
>>  net/net-hmp-cmds.c |  3 ++-
>>  stats/stats-hmp-cmds.c |  3 ++-
>>  util/meson.build   |  1 +
>>  util/strList.c | 24 
>>  7 files changed, 53 insertions(+), 22 deletions(-)
>>  create mode 100644 include/qemu/strList.h
>>  create mode 100644 util/strList.c
>>
>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>> index 13f9a2d..2df661e 100644
>> --- a/include/monitor/hmp.h
>> +++ b/include/monitor/hmp.h
>> @@ -19,7 +19,6 @@
>>  
>>  bool hmp_handle_error(Monitor *mon, Error *err);
>>  void hmp_help_cmd(Monitor *mon, const char *name);
>> -strList *hmp_split_at_comma(const char *str);
>>  
>>  void hmp_info_name(Monitor *mon, const QDict *qdict);
>>  void hmp_info_version(Monitor *mon, const QDict *qdict);
>> diff --git a/include/qemu/strList.h b/include/qemu/strList.h
>> new file mode 100644
>> index 000..010237f
>> --- /dev/null
>> +++ b/include/qemu/strList.h
>> @@ -0,0 +1,24 @@
>> +/*
>> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef QEMU_STR_LIST_H
>> +#define QEMU_STR_LIST_H
>> +
>> +#include "qapi/qapi-builtin-types.h"
>> +
>> +/*
>> + * Break @in into a strList using the delimiter string @delim.
>> + * The delimiter is not included in the result.
>> + * Return NULL if @in is NULL or an empty string.
>> + * A leading, trailing, or consecutive delimiter produces an
>> + * empty string at that position in the output.
>> + * All strings are g_strdup'd, and the result can be freed
>> + * using qapi_free_strList.
>> + */
>> +strList *strList_from_string(const char *in, const char *delim);
> 
> The function name no longer tells us explicitly what the function does:
> splitting the string.

The first sentence does not say it?
  "Break @in into a strList using the delimiter string @delim"

Would you prefer this?
  "Split string @in into a strList using the delimiter string @delim"

- Steve

>> +#endif
>> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
>> index 871898a..66b68a0 100644
>> --- a/monitor/hmp-cmds.c
>> +++ b/monitor/hmp-cmds.c
>> @@ -38,25 +38,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
>>  return false;
>>  }
>>  
>> -/*
>> - * Split @str at comma.
>> - * A null @str defaults to "".
>> - */
>> -strList *hmp_split_at_comma(const char *str)
>> -{
>> -char **split = g_strsplit(str ?: "", ",", -1);
>> -strList *res = NULL;
>> -strList **tail = 
>> -int i;
>> -
>> -for (i = 0; split[i]; i++) {
>> -QAPI_LIST_APPEND(tail, split[i]);
>> -}
>> -
>> -g_free(split);
>> -return res;
>> -}
>> -
>>  void hmp_info_name(Monitor *mon, const QDict *qdict)
>>  {
>>  NameInfo *info;
>> diff --git a/net/net-hmp-cmds.c b/net/net-hmp-cmds.c
>> index 41d326b..e893801 100644
>> --- a/net/net-hmp-cmds.c
>> +++ b/net/net-hmp-cmds.c
>> @@ -26,6 +26,7 @@
>>  #include "qemu/config-file.h"
>>  #include "qemu/help_option.h"
>>  #include "qemu/option.h"
>> +#include "qemu/strList.h"
>>  
>>  void hmp_info_network(Monitor *mon, const QDict *qdict)
>>  {
>> @@ -72,7 +73,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>>  migrate_announce_params());
>>  
>>  qapi_free_strList(params->interfaces);
>> -params->interfaces = hmp_split_at_comma(interfaces_str);
>> +params->interfaces = strList_from_string(interfaces_str, ",");
>>  params->has_interfaces = params->interfaces != NULL;
>>  params->id = g_strdup(id);
>>  qmp_announce_self(params, NULL);
>> diff --git a/stats/stats-hmp-cmds.c b/stats/stats-hmp-cmds.c
>> index 1f91bf8..428c0e6 100644
>> --- a/stats/stats-hmp-cmds.c
>> +++ b/stats/stats-hmp-cmds.c
>> @@ -10,6 +10,7 @@
>>  #include "monitor/hmp.h"
>>  #include "monitor/monitor.h"
>>  #include "qemu/cutils.h"
>> +#include

Re: [PATCH V4 5/5] migration: simplify exec migration functions

2024-02-21 Thread Steven Sistare

On 2/21/2024 10:54 AM, Fabiano Rosas wrote:
> Fabiano Rosas  writes:
> 
>> Steve Sistare  writes:
>>
>>> Simplify the exec migration code by using list utility functions.
>>>
>>> As a side effect, this also fixes a minor memory leak.  On function return,
>>> "g_auto(GStrv) argv" frees argv and each element, which is wrong, because
>>> the function does not own the individual elements.  To compensate, the code
>>> uses g_steal_pointer which NULLs argv and prevents the destructor from
>>> running, but argv is leaked.
>>>
>>> Fixes: cbab4face57b ("migration: convert exec backend ...")
>>> Signed-off-by: Steve Sistare 
>>
>> Reviewed-by: Fabiano Rosas 
> 
> You'll have to reintroduce the qemu/cutils.h include:
> 
> ../migration/exec.c: In function 'exec_get_cmd_path':
> ../migration/exec.c:37:5: error: implicit declaration of function 'pstrcat'; 
> did you mean 'strcat'? [-Werror=implicit-function-declaration]
>37 | pstrcat(detected_path, MAX_PATH, "\\cmd.exe");
>   | ^~~
>   | strcat
> ../migration/exec.c:37:5: error: nested extern declaration of 'pstrcat' 
> [-Werror=nested-externs]

Thanks, I will rebase to the tip and verify all is well before I post V5.

- Steve

Re: [PATCH V3 00/13] allow cpr-reboot for vfio

2024-02-20 Thread Steven Sistare

On 2/20/2024 2:49 AM, Peter Xu wrote:
> On Thu, Feb 08, 2024 at 10:53:53AM -0800, Steve Sistare wrote:
>> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.  The
>> user is responsible for suspending the guest before initiating cpr, such as
>> by issuing guest-suspend-ram to the qemu guest agent.
>>
>> Most of the patches in this series enhance migration notifiers so they can
>> return an error status and message.  The last few patches register a notifier
>> for vfio that returns an error if the guest is not suspended.
>>
>> Changes in V3:
>>   * update to tip, add RB's
>>   * replace MigrationStatus with new enum MigrationEventType
>>   * simplify migrate_fd_connect error recovery
>>   * support vfio iommufd containers
>>   * add patches:
>>   migration: stop vm for cpr
>>   migration: update cpr-reboot description
> 
> This doesn't apply to master anymore, please rebase when repost, thanks.

Will do.  Before I do, any comments on "migration: update cpr-reboot 
description"?
After we converge on that short description, I will submit a longer treatment in
docs/devel/migration, which I see you have recently populated.

- Steve

Re: [PATCH V3 10/13] migration: stop vm for cpr

2024-02-20 Thread Steven Sistare

On 2/20/2024 2:33 AM, Peter Xu wrote:
> On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote:
>> When migration for cpr is initiated, stop the vm and set state
>> RUN_STATE_FINISH_MIGRATE before ram is saved.  This eliminates the
>> possibility of ram and device state being out of sync, and guarantees
>> that a guest in the suspended state remains suspended, because qmp_cont
>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/migration/misc.h |  1 +
>>  migration/migration.c| 32 +---
>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 6dc234b..54c99a3 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -60,6 +60,7 @@ void migration_object_init(void);
>>  void migration_shutdown(void);
>>  bool migration_is_idle(void);
>>  bool migration_is_active(MigrationState *);
>> +bool migrate_mode_is_cpr(MigrationState *);
>>  
>>  typedef enum MigrationEventType {
>>  MIG_EVENT_PRECOPY_SETUP,
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d1fce9e..fc5c587 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s)
>>  s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>>  }
>>  
>> +bool migrate_mode_is_cpr(MigrationState *s)
>> +{
>> +return s->parameters.mode == MIG_MODE_CPR_REBOOT;
>> +}
>> +
>>  int migrate_init(MigrationState *s, Error **errp)
>>  {
>>  int ret;
>> @@ -2651,13 +2656,14 @@ static int 
>> migration_completion_precopy(MigrationState *s,
>>  bql_lock();
>>  migration_downtime_start(s);
>>  
>> -s->vm_old_state = runstate_get();
>> -global_state_store();
>> -
>> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> -trace_migration_completion_vm_stop(ret);
>> -if (ret < 0) {
>> -goto out_unlock;
>> +if (!migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +goto out_unlock;
>> +}
>>  }
>>  
>>  ret = migration_maybe_pause(s, current_active_state,
>> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  Error *local_err = NULL;
>>  uint64_t rate_limit;
>>  bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>> +int ret;
>>  
>>  /*
>>   * If there's a previous error, free it and prepare for another one.
>> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  goto fail;
>>  }
>>  
>> +if (migrate_mode_is_cpr(s)) {
>> +s->vm_old_state = runstate_get();
>> +global_state_store();
>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE);
>> +trace_migration_completion_vm_stop(ret);
>> +if (ret < 0) {
>> +error_setg(_err, "migration_stop_vm failed, error %d", 
>> -ret);
>> +goto fail;
>> +}
>> +}
> 
> Could we have a helper function for the shared codes?

Will do.

> How about postcopy?  I know it's nonsense to enable postcopy for cpr.. but
> iiuc we don't yet forbid an user doing so.  Maybe we should?

I will assert that mode != cpr in the postcopy path instead, to prevent any 
nonsense.

- Steve

>> +
>>  if (migrate_background_snapshot()) {
>>  qemu_thread_create(>thread, "bg_snapshot",
>>  bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>> -- 
>> 1.8.3.1
>>
>

Re: [PATCH V3 09/13] migration: notifier error checking

2024-02-20 Thread Steven Sistare

On 2/20/2024 2:12 AM, Peter Xu wrote:
> On Thu, Feb 08, 2024 at 10:54:02AM -0800, Steve Sistare wrote:
>> Check the status returned by migration notifiers and report errors.
>> If notifiers fail, call the notifiers again so they can clean up.
>> None of the notifiers return an error status at this time.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/migration/misc.h |  3 ++-
>>  migration/migration.c| 40 +---
>>  2 files changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 0ea1902..6dc234b 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -82,7 +82,8 @@ void migration_add_notifier(NotifierWithReturn *notify,
>>  void migration_add_notifier_mode(NotifierWithReturn *notify,
>>   MigrationNotifyFunc func, MigMode mode);
>>  void migration_remove_notifier(NotifierWithReturn *notify);
>> -void migration_call_notifiers(MigrationState *s, MigrationEventType type);
>> +int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>> + Error **errp);
>>  bool migration_in_setup(MigrationState *);
>>  bool migration_has_finished(MigrationState *);
>>  bool migration_has_failed(MigrationState *);
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 01d8867..d1fce9e 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1318,6 +1318,8 @@ void migrate_set_state(int *state, int old_state, int 
>> new_state)
>>  
>>  static void migrate_fd_cleanup(MigrationState *s)
>>  {
>> +Error *local_err = NULL;
>> +
>>  g_free(s->hostname);
>>  s->hostname = NULL;
>>  json_writer_free(s->vmdesc);
>> @@ -1362,13 +1364,23 @@ static void migrate_fd_cleanup(MigrationState *s)
>>MIGRATION_STATUS_CANCELLED);
>>  }
>>  
>> +if (!migration_has_failed(s) &&
>> +migration_call_notifiers(s, MIG_EVENT_PRECOPY_DONE, _err)) {
>> +
>> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
>> +migrate_set_error(s, local_err);
>> +error_free(local_err);
>> +}
>> +
>>  if (s->error) {
>>  /* It is used on info migrate.  We can't free it */
>>  error_report_err(error_copy(s->error));
>>  }
>> -migration_call_notifiers(s, s->state == MIGRATION_STATUS_COMPLETED ?
>> - MIG_EVENT_PRECOPY_DONE :
>> - MIG_EVENT_PRECOPY_FAILED);
>> +
>> +if (migration_has_failed(s)) {
>> +migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>> +}
> 
> AFAIU, the whole point of such split is, allowing DONE notifies to fail too
> and then if that happens we can invoke FAIL notifiers again.

Correct.

> 
> Perhaps we can avoid that complexity, but rather document only SETUP
> notifiers can fail?
> 
> The problem is that failing a notifier at this stage (if migration already
> finished) can already be too late; dest QEMU can already have started
> running, so no way to roll back.  We can document that, check and assert
> for !SETUP cases to make sure error is never hit?

Makes sense. I will modify the patch as you suggest.

- Steve

>> +
>>  block_cleanup_parameters();
>>  yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>  }
>> @@ -1481,13 +1493,15 @@ void migration_remove_notifier(NotifierWithReturn 
>> *notify)
>>  }
>>  }
>>  
>> -void migration_call_notifiers(MigrationState *s, MigrationEventType type)
>> +int migration_call_notifiers(MigrationState *s, MigrationEventType type,
>> + Error **errp)
>>  {
>>  MigMode mode = s->parameters.mode;
>>  MigrationEvent e;
>>  
>>  e.type = type;
>> -notifier_with_return_list_notify(_state_notifiers[mode], , 
>> 0);
>> +return 
>> notifier_with_return_list_notify(_state_notifiers[mode],
>> +, errp);
>>  }
>>  
>>  bool migration_in_setup(MigrationState *s)
>> @@ -2535,7 +2549,9 @@ static int postcopy_start(MigrationState *ms, Error 
>> **errp)
>>   * at the transition to postcopy and after the device state; in 
>> particular
>>   * spice needs to trigger a transition now
>>   */
>> -migration_call_notifiers(ms, MIG_EVENT_PRECOPY_DONE);
>> +if (migration_call_notifiers(ms, MIG_EVENT_PRECOPY_DONE, errp)) {
>> +goto fail;
>> +}
>>  
>>  migration_downtime_end(ms);
>>  
>> @@ -2555,11 +2571,10 @@ static int postcopy_start(MigrationState *ms, Error 
>> **errp)
>>  
>>  ret = qemu_file_get_error(ms->to_dst_file);
>>  if (ret) {
>> -error_setg(errp, "postcopy_start: Migration stream errored");
>> -migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> -  MIGRATION_STATUS_FAILED);
>> +error_setg_errno(errp, -ret, "postcopy_start: Migration stream 
>> error");
>> +

Re: [PATCH V3 09/13] migration: notifier error checking

2024-02-12 Thread Steven Sistare

On 2/12/2024 4:24 AM, David Hildenbrand wrote:
> On 08.02.24 19:54, Steve Sistare wrote:
>> Check the status returned by migration notifiers and report errors.
>> If notifiers fail, call the notifiers again so they can clean up.
> 
> IIUC, if any of the notifiers will actually start to fail, say, during 
> MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all 
> notifiers.
> 
> That will include notifiers that have never seen a MIG_EVENT_PRECOPY_SETUP 
> call.

Correct.

> Is that what we expect notifiers to be able to deal with? Can we document 
> that?

The notifiers have always needed to handle failure without knowing the previous 
migration
states, and are robust about unwinding their own internal state.  I will 
document it.

Thanks for all the RBs.

- Steve

Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended

2024-01-17 Thread Steven Sistare

On 1/17/2024 2:12 AM, Peter Xu wrote:
> On Tue, Jan 16, 2024 at 03:37:33PM -0500, Steven Sistare wrote:
>> On 1/15/2024 2:33 AM, Peter Xu wrote:
>>> On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote:
>>>> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
>>>> guest drivers' suspend methods flush outstanding requests and re-initialize
>>>> the devices, and thus there is no device state to save and restore.  The
>>>> user is responsible for suspending the guest before initiating cpr, such as
>>>> by issuing guest-suspend-ram to the qemu guest agent.
>>>>
>>>> Relax the vfio blocker so it does not apply to cpr, and add a notifier that
>>>> verifies the guest is suspended.  Skip dirty page tracking, which is N/A 
>>>> for
>>>> cpr, to avoid ioctl errors.
>>>>
>>>> Signed-off-by: Steve Sistare 
>>>> ---
>>>>  hw/vfio/common.c  |  2 +-
>>>>  hw/vfio/cpr.c | 20 
>>>>  hw/vfio/migration.c   |  2 +-
>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>  migration/ram.c   |  9 +
>>>>  5 files changed, 28 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 0b3352f..09af934 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice 
>>>> *vbasedev, Error **errp)
>>>>  error_setg(_devices_migration_blocker,
>>>> "Multiple VFIO devices migration is supported only if all 
>>>> of "
>>>> "them support P2P migration");
>>>> -ret = migrate_add_blocker(_devices_migration_blocker, errp);
>>>> +ret = migrate_add_blocker_normal(_devices_migration_blocker, 
>>>> errp);
>>>>  
>>>>  return ret;
>>>>  }
>>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>>> index bbd1c7a..9f4b1fe 100644
>>>> --- a/hw/vfio/cpr.c
>>>> +++ b/hw/vfio/cpr.c
>>>> @@ -7,13 +7,33 @@
>>>>  
>>>>  #include "qemu/osdep.h"
>>>>  #include "hw/vfio/vfio-common.h"
>>>> +#include "migration/misc.h"
>>>>  #include "qapi/error.h"
>>>> +#include "sysemu/runstate.h"
>>>> +
>>>> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>>> +MigrationEvent *e, Error **errp)
>>>> +{
>>>> +if (e->state == MIGRATION_STATUS_SETUP &&
>>>> +!runstate_check(RUN_STATE_SUSPENDED)) {
>>>> +
>>>> +error_setg(errp,
>>>> +"VFIO device only supports cpr-reboot for runstate 
>>>> suspended");
>>>> +
>>>> +return -1;
>>>> +}
>>>
>>> What happens if the guest is suspended during SETUP, but then quickly waked
>>> up before CPR migration completes?
>>
>> That is a management layer bug -- we told them the VM must be suspended.
>> However, I will reject a wakeup request if migration is active and mode is 
>> cpr.
> 
> Yes it needs to be enforced if it is required by cpr-reboot.  QEMU
> hopefully should never rely on mgmt app behavior.
> 
>>
>>>> +return 0;
>>>> +}
>>>>  
>>>>  int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>>>>  {
>>>> +migration_add_notifier_mode(>cpr_reboot_notifier,
>>>> +vfio_cpr_reboot_notifier,
>>>> +MIG_MODE_CPR_REBOOT);
>>>>  return 0;
>>>>  }
>>>>  
>>>>  void vfio_cpr_unregister_container(VFIOContainer *container)
>>>>  {
>>>> +migration_remove_notifier(>cpr_reboot_notifier);
>>>>  }
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 534fddf..488905d 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, 
>>>> Error *err, Error **errp)
>>>>  vbasedev->migration_blocker = error_copy(err);
>>>>  error

Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended

2024-01-16 Thread Steven Sistare




On 1/16/2024 3:37 PM, Steven Sistare wrote:
> On 1/15/2024 2:33 AM, Peter Xu wrote:
>> On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote:
>>> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
>>> guest drivers' suspend methods flush outstanding requests and re-initialize
>>> the devices, and thus there is no device state to save and restore.  The
>>> user is responsible for suspending the guest before initiating cpr, such as
>>> by issuing guest-suspend-ram to the qemu guest agent.
>>>
>>> Relax the vfio blocker so it does not apply to cpr, and add a notifier that
>>> verifies the guest is suspended.  Skip dirty page tracking, which is N/A for
>>> cpr, to avoid ioctl errors.
>>>
>>> Signed-off-by: Steve Sistare 
>>> ---
>>>  hw/vfio/common.c  |  2 +-
>>>  hw/vfio/cpr.c | 20 
>>>  hw/vfio/migration.c   |  2 +-
>>>  include/hw/vfio/vfio-common.h |  1 +
>>>  migration/ram.c   |  9 +
>>>  5 files changed, 28 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 0b3352f..09af934 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice 
>>> *vbasedev, Error **errp)
>>>  error_setg(_devices_migration_blocker,
>>> "Multiple VFIO devices migration is supported only if all 
>>> of "
>>> "them support P2P migration");
>>> -ret = migrate_add_blocker(_devices_migration_blocker, errp);
>>> +ret = migrate_add_blocker_normal(_devices_migration_blocker, 
>>> errp);
>>>  
>>>  return ret;
>>>  }
>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>> index bbd1c7a..9f4b1fe 100644
>>> --- a/hw/vfio/cpr.c
>>> +++ b/hw/vfio/cpr.c
>>> @@ -7,13 +7,33 @@
>>>  
>>>  #include "qemu/osdep.h"
>>>  #include "hw/vfio/vfio-common.h"
>>> +#include "migration/misc.h"
>>>  #include "qapi/error.h"
>>> +#include "sysemu/runstate.h"
>>> +
>>> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>> +MigrationEvent *e, Error **errp)
>>> +{
>>> +if (e->state == MIGRATION_STATUS_SETUP &&
>>> +!runstate_check(RUN_STATE_SUSPENDED)) {
>>> +
>>> +error_setg(errp,
>>> +"VFIO device only supports cpr-reboot for runstate suspended");
>>> +
>>> +return -1;
>>> +}
>>
>> What happens if the guest is suspended during SETUP, but then quickly waked
>> up before CPR migration completes?
> 
> That is a management layer bug -- we told them the VM must be suspended.
> However, I will reject a wakeup request if migration is active and mode is 
> cpr.
> 
>>> +return 0;
>>> +}
>>>  
>>>  int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>>>  {
>>> +migration_add_notifier_mode(>cpr_reboot_notifier,
>>> +vfio_cpr_reboot_notifier,
>>> +MIG_MODE_CPR_REBOOT);
>>>  return 0;
>>>  }
>>>  
>>>  void vfio_cpr_unregister_container(VFIOContainer *container)
>>>  {
>>> +migration_remove_notifier(>cpr_reboot_notifier);
>>>  }
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 534fddf..488905d 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, 
>>> Error *err, Error **errp)
>>>  vbasedev->migration_blocker = error_copy(err);
>>>  error_free(err);
>>>  
>>> -return migrate_add_blocker(>migration_blocker, errp);
>>> +return migrate_add_blocker_normal(>migration_blocker, errp);
>>>  }
>>>  
>>>  /* -- 
>>> */
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index 1add5b7..7a46e24 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -78,6 +78,7 @@ struct VFIOGroup

Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended

2024-01-16 Thread Steven Sistare

On 1/15/2024 2:33 AM, Peter Xu wrote:
> On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote:
>> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.  The
>> user is responsible for suspending the guest before initiating cpr, such as
>> by issuing guest-suspend-ram to the qemu guest agent.
>>
>> Relax the vfio blocker so it does not apply to cpr, and add a notifier that
>> verifies the guest is suspended.  Skip dirty page tracking, which is N/A for
>> cpr, to avoid ioctl errors.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  hw/vfio/common.c  |  2 +-
>>  hw/vfio/cpr.c | 20 
>>  hw/vfio/migration.c   |  2 +-
>>  include/hw/vfio/vfio-common.h |  1 +
>>  migration/ram.c   |  9 +
>>  5 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 0b3352f..09af934 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice 
>> *vbasedev, Error **errp)
>>  error_setg(_devices_migration_blocker,
>> "Multiple VFIO devices migration is supported only if all of 
>> "
>> "them support P2P migration");
>> -ret = migrate_add_blocker(_devices_migration_blocker, errp);
>> +ret = migrate_add_blocker_normal(_devices_migration_blocker, 
>> errp);
>>  
>>  return ret;
>>  }
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index bbd1c7a..9f4b1fe 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -7,13 +7,33 @@
>>  
>>  #include "qemu/osdep.h"
>>  #include "hw/vfio/vfio-common.h"
>> +#include "migration/misc.h"
>>  #include "qapi/error.h"
>> +#include "sysemu/runstate.h"
>> +
>> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>> +MigrationEvent *e, Error **errp)
>> +{
>> +if (e->state == MIGRATION_STATUS_SETUP &&
>> +!runstate_check(RUN_STATE_SUSPENDED)) {
>> +
>> +error_setg(errp,
>> +"VFIO device only supports cpr-reboot for runstate suspended");
>> +
>> +return -1;
>> +}
> 
> What happens if the guest is suspended during SETUP, but then quickly waked
> up before CPR migration completes?

That is a management layer bug -- we told them the VM must be suspended.
However, I will reject a wakeup request if migration is active and mode is cpr.

>> +return 0;
>> +}
>>  
>>  int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>>  {
>> +migration_add_notifier_mode(>cpr_reboot_notifier,
>> +vfio_cpr_reboot_notifier,
>> +MIG_MODE_CPR_REBOOT);
>>  return 0;
>>  }
>>  
>>  void vfio_cpr_unregister_container(VFIOContainer *container)
>>  {
>> +migration_remove_notifier(>cpr_reboot_notifier);
>>  }
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 534fddf..488905d 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, 
>> Error *err, Error **errp)
>>  vbasedev->migration_blocker = error_copy(err);
>>  error_free(err);
>>  
>> -return migrate_add_blocker(>migration_blocker, errp);
>> +return migrate_add_blocker_normal(>migration_blocker, errp);
>>  }
>>  
>>  /* -- */
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 1add5b7..7a46e24 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -78,6 +78,7 @@ struct VFIOGroup;
>>  typedef struct VFIOContainer {
>>  VFIOContainerBase bcontainer;
>>  int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> +NotifierWithReturn cpr_reboot_notifier;
>>  unsigned iommu_type;
>>  QLIST_HEAD(, VFIOGroup) group_list;
>>  } VFIOContainer;
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 1923366..44ad324 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -2392,8 +2392,8 @@ static void ram_save_cleanup(void *opaque)
>>  RAMState **rsp = opaque;
>>  RAMBlock *block;
>>  
>> -/* We don't use dirty log with background snapshots */
>> -if (!migrate_background_snapshot()) {
>> +/* We don't use dirty log with background snapshots or cpr */
>> +if (!migrate_background_snapshot() && migrate_mode() == 
>> MIG_MODE_NORMAL) {
> 
> Same question here, on what happens if the user resumes the VM before
> migration completes?  IIUC shared-ram is not required, then it means if
> that happens the cpr migration image can contain corrupted data, and that
> may be a problem.
> 
> Background snapshot is special in that it relies on totally different
> tracking facilities

Re: [PATCH V2 04/11] migration: remove migration_in_postcopy parameter

2024-01-16 Thread Steven Sistare

On 1/15/2024 1:48 AM, Peter Xu wrote:
> On Fri, Jan 12, 2024 at 07:05:03AM -0800, Steve Sistare wrote:
>>  bool migration_in_incoming_postcopy(void)
>> diff --git a/ui/spice-core.c b/ui/spice-core.c
>> index b3cd229..e43a93f 100644
>> --- a/ui/spice-core.c
>> +++ b/ui/spice-core.c
>> @@ -580,7 +580,7 @@ static int migration_state_notifier(NotifierWithReturn 
>> *notifier,
>>  if (migration_in_setup(s)) {
>>  spice_server_migrate_start(spice_server);
>>  } else if (migration_has_finished(s) ||
>> -   migration_in_postcopy_after_devices(s)) {
>> +   migration_in_postcopy_after_devices()) {
> 
> This can be a reply also to your other email: my previous suggestion of
> using PRECOPY_DONE should apply here, where we can convert this chunk into:
> 
>   } else if (event == MIG_EVENT_PRECOPY_DONE) {...}
> 
> Because PRECOPY_DONE should also cover the notification from
> postcopy_start(), then we can drop migration_in_postcopy_after_devices()
> completely, I think.

Yes, that works.  I will define and use MIG_EVENT_PRECOPY_DONE and friends in 
"MigrationEvent for notifiers".

- Steve

Re: [PATCH V2 00/11] allow cpr-reboot for vfio

2024-01-16 Thread Steven Sistare

On 1/12/2024 4:38 PM, Alex Williamson wrote:
> On Fri, 12 Jan 2024 07:04:59 -0800
> Steve Sistare  wrote:
> 
>> Allow cpr-reboot for vfio if the guest is in the suspended runstate.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.  The
>> user is responsible for suspending the guest before initiating cpr, such as
>> by issuing guest-suspend-ram to the qemu guest agent.
>>
>> Most of the patches in this series enhance migration notifiers so they can
>> return an error status and message.  The last two patches register a notifier
>> for vfio that returns an error if the guest is not suspended.
> 
> Hi Steve,
> 
> This appears to only support legacy container mode and not cdev/iommufd
> mode.  Shouldn't this support both?  Thanks,

My bad, thanks! I'll move cpr_reboot_notifier to the base container, and also 
call
cpr register/unregister from iommufd_cdev_attach/iommufd_cdev_detach.

- Steve

>> Steve Sistare (11):
>>   notify: pass error to notifier with return
>>   migration: remove error from notifier data
>>   migration: convert to NotifierWithReturn
>>   migration: remove migration_in_postcopy parameter
>>   migration: MigrationEvent for notifiers
>>   migration: MigrationNotifyFunc
>>   migration: per-mode notifiers
>>   migration: refactor migrate_fd_connect failures
>>   migration: notifier error checking
>>   vfio: register container for cpr
>>   vfio: allow cpr-reboot migration if suspended
>>
>>  hw/net/virtio-net.c|  14 ++---
>>  hw/vfio/common.c   |   2 +-
>>  hw/vfio/container.c|  11 +++-
>>  hw/vfio/cpr.c  |  39 ++
>>  hw/vfio/meson.build|   1 +
>>  hw/vfio/migration.c|  13 +++--
>>  hw/virtio/vhost-user.c |  10 ++--
>>  hw/virtio/virtio-balloon.c |   3 +-
>>  include/hw/vfio/vfio-common.h  |   6 ++-
>>  include/hw/virtio/virtio-net.h |   2 +-
>>  include/migration/misc.h   |  21 +---
>>  include/qemu/notify.h  |   7 ++-
>>  migration/migration.c  | 117 
>> +++--
>>  migration/postcopy-ram.c   |   3 +-
>>  migration/postcopy-ram.h   |   1 -
>>  migration/ram.c|  12 ++---
>>  net/vhost-vdpa.c   |  15 +++---
>>  ui/spice-core.c|  19 +++
>>  util/notify.c  |   5 +-
>>  19 files changed, 206 insertions(+), 95 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>

Re: [PATCH V2 03/11] migration: convert to NotifierWithReturn

2024-01-16 Thread Steven Sistare

On 1/15/2024 1:44 AM, Peter Xu wrote:
> On Fri, Jan 12, 2024 at 07:05:02AM -0800, Steve Sistare wrote:
>> Change all migration notifiers to type NotifierWithReturn, so notifiers
>> can return an error status in a future patch.  For now, pass NULL for the
>> notifier error parameter, and do not check the return value.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  hw/net/virtio-net.c|  4 +++-
>>  hw/vfio/migration.c|  4 +++-
>>  include/hw/vfio/vfio-common.h  |  2 +-
>>  include/hw/virtio/virtio-net.h |  2 +-
>>  include/migration/misc.h   |  6 +++---
>>  migration/migration.c  | 16 
>>  net/vhost-vdpa.c   |  6 --
>>  ui/spice-core.c|  8 +---
>>  8 files changed, 28 insertions(+), 20 deletions(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 7a2846f..9570353 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -3529,11 +3529,13 @@ static void 
>> virtio_net_handle_migration_primary(VirtIONet *n, MigrationState *s)
>>  }
>>  }
>>  
>> -static void virtio_net_migration_state_notifier(Notifier *notifier, void 
>> *data)
>> +static int virtio_net_migration_state_notifier(NotifierWithReturn *notifier,
>> +   void *data, Error **errp)
>>  {
>>  MigrationState *s = data;
>>  VirtIONet *n = container_of(notifier, VirtIONet, migration_state);
>>  virtio_net_handle_migration_primary(n, s);
>> +return 0;
>>  }
> 
> I should have mentioned this earlier.. we have a trend recently to modify
> retval to boolean when Error** existed, e.g.:
> 
> https://lore.kernel.org/all/20231017202633.296756-5-pet...@redhat.com/
> 
> Let's start using boolean too here?  Previous patches may need touch-ups
> too for this.
> 
> Other than that it all looks good here.  Thanks,

Boolean makes sense when there is only one way to recover from failure.
However, when the notifier may fail in different ways, and recovery differs
for each, then the function should return an int errno.  NotifierWithReturn
could have future uses that need multiple return values and multiple recovery 
paths above the notifier_with_return_list_notify level, so IMO the function 
should continue to return int for generality.

- Steve

Re: [PATCH V2 08/11] migration: refactor migrate_fd_connect failures

2024-01-16 Thread Steven Sistare

On 1/15/2024 2:37 AM, Peter Xu wrote:
> On Fri, Jan 12, 2024 at 07:05:07AM -0800, Steve Sistare wrote:
>> Move common code for the error path in migrate_fd_connect to a shared
>> fail label.  No functional change.
>>
>> Signed-off-by: Steve Sistare 
> 
> Reviewed-by: Peter Xu 
> 
> One nitpick below:
> 
>> ---
>>  migration/migration.c | 22 +++---
>>  1 file changed, 11 insertions(+), 11 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index e9914aa..c828ba7 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3549,6 +3549,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  Error *local_err = NULL;
>>  uint64_t rate_limit;
>>  bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
>> +int expect_state = s->state;
> 
> IMHO we can drop this.
> 
>>  
>>  /*
>>   * If there's a previous error, free it and prepare for another one.
>> @@ -3603,11 +3604,7 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  if (migrate_postcopy_ram() || migrate_return_path()) {
>>  if (open_return_path_on_source(s)) {
>>  error_setg(_err, "Unable to open return-path for 
>> postcopy");
>> -migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
>> -migrate_set_error(s, local_err);
>> -error_report_err(local_err);
>> -migrate_fd_cleanup(s);
>> -return;
>> +goto fail;
>>  }
>>  }
>>  
>> @@ -3629,12 +3626,8 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  }
>>  
>>  if (multifd_save_setup(_err) != 0) {
>> -migrate_set_error(s, local_err);
>> -error_report_err(local_err);
>> -migrate_set_state(>state, MIGRATION_STATUS_SETUP,
>> -  MIGRATION_STATUS_FAILED);
>> -migrate_fd_cleanup(s);
>> -return;
>> +expect_state = MIGRATION_STATUS_SETUP;
> 
> Then drop this.
> 
>> +goto fail;
>>  }
>>  
>>  if (migrate_background_snapshot()) {
>> @@ -3645,6 +3638,13 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  migration_thread, s, QEMU_THREAD_JOINABLE);
>>  }
>>  s->migration_thread_running = true;
>> +return;
>> +
>> +fail:
>> +migrate_set_error(s, local_err);
>> +migrate_set_state(>state, expect_state, MIGRATION_STATUS_FAILED);
> 
> Then constantly use s->state.

Yes, if you are OK with it, I also prefer to simplify here.

- Steve

>> +error_report_err(local_err);
>> +migrate_fd_cleanup(s);
>>  }
>>  
>>  static void migration_class_init(ObjectClass *klass, void *data)
>> -- 
>> 1.8.3.1
>>
>

Re: [PATCH V2 05/11] migration: MigrationEvent for notifiers

2024-01-12 Thread Steven Sistare

On 1/12/2024 10:05 AM, Steve Sistare wrote:
> Passing MigrationState to notifiers is unsound because they could access
> unstable migration state internals or even modify the state.  Instead, pass
> the minimal info needed in a new MigrationEvent struct, which could be
> extended in the future if needed.
> 
> Suggested-by: Peter Xu 
> Signed-off-by: Steve Sistare 
> ---
>  [...]
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index dcc98bb..0b4ce0f 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -60,6 +60,11 @@ void migration_object_init(void);
>  void migration_shutdown(void);
>  bool migration_is_idle(void);
>  bool migration_is_active(MigrationState *);
> +
> +typedef struct MigrationEvent {
> +MigrationStatus state;
> +} MigrationEvent;

Hi Peter, I chose to pass MigrationStatus rather than define a new enum and map
MigrationStatus to it.  IMO a new enum adds little value, yet it is more code 
and
another layer of abstraction for coders to grok.  

- Steve

Re: [PATCH V1 2/3] migration: notifier error reporting

2024-01-11 Thread Steven Sistare

On 1/10/2024 9:16 PM, Peter Xu wrote:
> On Wed, Jan 10, 2024 at 01:08:41PM -0500, Steven Sistare wrote:
>> On 1/10/2024 2:18 AM, Peter Xu wrote:
>>> On Wed, Dec 13, 2023 at 10:11:32AM -0800, Steve Sistare wrote:
>>>> After calling notifiers, check if an error has been reported via
>>>> migrate_set_error, and halt the migration.
>>>>
>>>> None of the notifiers call migrate_set_error at this time, so no
>>>> functional change.
>>>>
>>>> Signed-off-by: Steve Sistare 
>>>> ---
>>>>  include/migration/misc.h |  2 +-
>>>>  migration/migration.c| 26 ++
>>>>  2 files changed, 23 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>>>> index 901d117..231d7e4 100644
>>>> --- a/include/migration/misc.h
>>>> +++ b/include/migration/misc.h
>>>> @@ -65,7 +65,7 @@ MigMode migrate_mode_of(MigrationState *);
>>>>  void migration_add_notifier(Notifier *notify,
>>>>  void (*func)(Notifier *notifier, void *data));
>>>>  void migration_remove_notifier(Notifier *notify);
>>>> -void migration_call_notifiers(MigrationState *s);
>>>> +int migration_call_notifiers(MigrationState *s);
>>>>  bool migration_in_setup(MigrationState *);
>>>>  bool migration_has_finished(MigrationState *);
>>>>  bool migration_has_failed(MigrationState *);
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index d5bfe70..29a9a92 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -1280,6 +1280,8 @@ void migrate_set_state(int *state, int old_state, 
>>>> int new_state)
>>>>  
>>>>  static void migrate_fd_cleanup(MigrationState *s)
>>>>  {
>>>> +bool already_failed;
>>>> +
>>>>  qemu_bh_delete(s->cleanup_bh);
>>>>  s->cleanup_bh = NULL;
>>>>  
>>>> @@ -1327,11 +1329,20 @@ static void migrate_fd_cleanup(MigrationState *s)
>>>>MIGRATION_STATUS_CANCELLED);
>>>>  }
>>>>  
>>>> +already_failed = migration_has_failed(s);
>>>> +if (migration_call_notifiers(s)) {
>>>> +if (!already_failed) {
>>>> +migrate_set_state(>state, s->state, 
>>>> MIGRATION_STATUS_FAILED);
>>>> +/* Notify again to recover from this late failure. */
>>>> +migration_call_notifiers(s);
>>>> +}
>>>> +}
>>>> +
>>>>  if (s->error) {
>>>>  /* It is used on info migrate.  We can't free it */
>>>>  error_report_err(error_copy(s->error));
>>>>  }
>>>> -migration_call_notifiers(s);
>>>> +
>>>>  block_cleanup_parameters();
>>>>  yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>>>  }
>>>> @@ -1450,9 +1461,10 @@ void migration_remove_notifier(Notifier *notify)
>>>>  }
>>>>  }
>>>>  
>>>> -void migration_call_notifiers(MigrationState *s)
>>>> +int migration_call_notifiers(MigrationState *s)
>>>>  {
>>>>  notifier_list_notify(_state_notifiers, s);
>>>> +return (s->error != NULL);
>>>
>>> Exporting more migration_*() functions is pretty ugly to me..
>>
>> I assume you mean migrate_set_error(), which is currently only called from
>> migration/*.c code.
>>
>> Instead, we could define a new function migrate_set_notifier_error(), defined
>> in the new file migration/notifier.h, so we clearly limit the migration 
>> functions which can be called from notifiers.  (Its implementation just calls
>> migrate_set_error)
> 
> Fundementally this allows another .c to change one more field of
> MigrationState (which is ->error) and I still want to avoid it.
> 
> I just replied in the other thread, but now with all these in mind I think
> I still prefer not passing in MigrationState* at all.  It's already kind of
> abused due to migrate_get_current(), and IMHO it's healthier to limit its
> usage to minimum to cover the core of migration states for migration/ use
> only.
> 
> Shrinking or even stop exporting migrate_get_current() is another more
> challenging task, but now what we can do is stop enlarging the direct use
> of Migrat

Re: [PATCH V1 2/3] migration: notifier error reporting

2024-01-10 Thread Steven Sistare

On 1/10/2024 2:18 AM, Peter Xu wrote:
> On Wed, Dec 13, 2023 at 10:11:32AM -0800, Steve Sistare wrote:
>> After calling notifiers, check if an error has been reported via
>> migrate_set_error, and halt the migration.
>>
>> None of the notifiers call migrate_set_error at this time, so no
>> functional change.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/migration/misc.h |  2 +-
>>  migration/migration.c| 26 ++
>>  2 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/migration/misc.h b/include/migration/misc.h
>> index 901d117..231d7e4 100644
>> --- a/include/migration/misc.h
>> +++ b/include/migration/misc.h
>> @@ -65,7 +65,7 @@ MigMode migrate_mode_of(MigrationState *);
>>  void migration_add_notifier(Notifier *notify,
>>  void (*func)(Notifier *notifier, void *data));
>>  void migration_remove_notifier(Notifier *notify);
>> -void migration_call_notifiers(MigrationState *s);
>> +int migration_call_notifiers(MigrationState *s);
>>  bool migration_in_setup(MigrationState *);
>>  bool migration_has_finished(MigrationState *);
>>  bool migration_has_failed(MigrationState *);
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d5bfe70..29a9a92 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1280,6 +1280,8 @@ void migrate_set_state(int *state, int old_state, int 
>> new_state)
>>  
>>  static void migrate_fd_cleanup(MigrationState *s)
>>  {
>> +bool already_failed;
>> +
>>  qemu_bh_delete(s->cleanup_bh);
>>  s->cleanup_bh = NULL;
>>  
>> @@ -1327,11 +1329,20 @@ static void migrate_fd_cleanup(MigrationState *s)
>>MIGRATION_STATUS_CANCELLED);
>>  }
>>  
>> +already_failed = migration_has_failed(s);
>> +if (migration_call_notifiers(s)) {
>> +if (!already_failed) {
>> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
>> +/* Notify again to recover from this late failure. */
>> +migration_call_notifiers(s);
>> +}
>> +}
>> +
>>  if (s->error) {
>>  /* It is used on info migrate.  We can't free it */
>>  error_report_err(error_copy(s->error));
>>  }
>> -migration_call_notifiers(s);
>> +
>>  block_cleanup_parameters();
>>  yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>>  }
>> @@ -1450,9 +1461,10 @@ void migration_remove_notifier(Notifier *notify)
>>  }
>>  }
>>  
>> -void migration_call_notifiers(MigrationState *s)
>> +int migration_call_notifiers(MigrationState *s)
>>  {
>>  notifier_list_notify(_state_notifiers, s);
>> +return (s->error != NULL);
> 
> Exporting more migration_*() functions is pretty ugly to me..

I assume you mean migrate_set_error(), which is currently only called from
migration/*.c code.

Instead, we could define a new function migrate_set_notifier_error(), defined
in the new file migration/notifier.h, so we clearly limit the migration 
functions which can be called from notifiers.  (Its implementation just calls
migrate_set_error)

> Would it be better to pass in "Error** errp" into each notifiers?  That may
> need an open coded notifier_list_notify(), breaking the loop if "*errp".
> 
> And the notifier API currently only support one arg..  maybe we should
> implement the notifiers ourselves, ideally passing in "(int state, Error
> **errp)" instead of "(MigrationState *s)".
> 
> Ideally with that MigrationState* shouldn't be visible outside migration/.

I will regret saying this because of the amount of (mechanical) code change 
involved,
but the cleanest solution is:

* Pass errp to: 
  notifier_with_return_list_notify(NotifierWithReturnList *list, void *data, 
Error *errp)
* Pass errp to the NotifierWithReturn notifier:
  int (*notify)(NotifierWithReturn *notifier, void *data, Error **errp);
* Delete the errp member from struct PostcopyNotifyData and pass errp to the 
notifier function
  Ditto for PrecopyNotifyData.
* Convert all migration notifiers to NotifierWithReturn

- Steve

>>  }
>>  
>>  bool migration_in_setup(MigrationState *s)
>> @@ -2520,7 +2532,9 @@ static int postcopy_start(MigrationState *ms, Error 
>> **errp)
>>   * spice needs to trigger a transition now
>>   */
>>  ms->postcopy_after_devices = true;
>> -migration_call_notifiers(ms);
>> +if (migration_call_notifiers(ms)) {
>> +goto fail;
>> +}
>>  
>>  migration_downtime_end(ms);
>>  
>> @@ -3589,7 +3603,11 @@ void migrate_fd_connect(MigrationState *s, Error 
>> *error_in)
>>  rate_limit = migrate_max_bandwidth();
>>  
>>  /* Notify before starting migration thread */
>> -migration_call_notifiers(s);
>> +if (migration_call_notifiers(s)) {
>> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
>> +migrate_fd_cleanup(s);
>> +return;
>> +}
>>  }
>>  
>>  migration_rate_set(rate_limit);
>> -- 
>>

Re: [PATCH V1 1/3] migration: check mode in notifiers

2024-01-10 Thread Steven Sistare

On 1/10/2024 2:09 AM, Peter Xu wrote:
> On Wed, Dec 13, 2023 at 10:11:31AM -0800, Steve Sistare wrote:
>> The existing notifiers should only apply to normal mode.
>>
>> No functional change.
> 
> Instead of adding such check in every notifier, why not make CPR a separate
> list of notifiers?  Just like the blocker lists.

Sure.   I proposed minimal changes in this current series, but extending the 
api to take migration mode would be nicer.

> Aside of this patch, I just started to look at this "notifier" code, I
> really don't think we should pass in MigrationState* into the notifiers.
> IIUC we only need the "state" as an enum.  Then with two separate
> registers, the device code knows the migration mode.
> 
> What do you think?

If we pass state, the notifier must either compare to enum values such as
MIGRATION_STATUS_COMPLETED instead of calling migration_has_finished(s), or
we must define new accessors such as migration_state_is_finished(state).

IMO passing MigrationState is the best approach.
MigrationState is an incomplete type in most notifiers, and the client can
pass it to a limited set of accessors to get more information -- exactly what 
we want to hide migration internals.  However, we could further limit the
allowed accessors, eg move these to a new file "include/migration/notifier.h".

#include "qemu/notify.h"
void migration_add_notifier(Notifier *notify,
void (*func)(Notifier *notifier, void *data));
void migration_remove_notifier(Notifier *notify);
bool migration_is_active(MigrationState *);
bool migration_in_setup(MigrationState *);
bool migration_has_finished(MigrationState *);
bool migration_has_failed(MigrationState *);
---

- Steve

Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate

2024-01-03 Thread Steven Sistare

On 12/23/2023 12:41 AM, Markus Armbruster wrote:
> Steven Sistare  writes:
> 
>> On 12/22/2023 7:20 AM, Markus Armbruster wrote:
>>> Steve Sistare  writes:
>>>
>>>> Currently, a vm in the suspended state is not completely stopped.  The 
>>>> VCPUs
>>>> have been paused, but the cpu clock still runs, and runstate notifiers for
>>>> the transition to stopped have not been called.  This causes problems for
>>>> live migration.  Stale cpu timers_state is saved to the migration stream,
>>>> causing time errors in the guest when it wakes from suspend, and state that
>>>> would have been modified by runstate notifiers is wrong.
>>>>
>>>> Modify vm_stop to completely stop the vm if the current state is suspended,
>>>> transition to RUN_STATE_PAUSED, and remember that the machine was 
>>>> suspended.
>>>> Modify vm_start to restore the suspended state.
>>>
>>> Can you explain this to me in terms of the @current_run_state state
>>> machine?  Like
>>>
>>> Before the patch, trigger X in state Y goes to state Z.
>>> Afterwards, it goes to ...
>>
>> Old behavior:
>>   RUN_STATE_SUSPENDED --> stop --> RUN_STATE_SUSPENDED
>>
>> New behavior:
>>   RUN_STATE_SUSPENDED --> stop --> RUN_STATE_PAUSED
>>   RUN_STATE_PAUSED--> cont --> RUN_STATE_SUSPENDED
> 
> This clarifies things quite a bit for me.  Maybe work it into the commit
> message?

Will do.

>>>> This affects all callers of vm_stop and vm_start, notably, the qapi stop 
>>>> and
>>>> cont commands.  For example:
>>>>
>>>> (qemu) info status
>>>> VM status: paused (suspended)
>>>>
>>>> (qemu) stop
>>>> (qemu) info status
>>>> VM status: paused
>>>>
>>>> (qemu) cont
>>>> (qemu) info status
>>>> VM status: paused (suspended)
>>>>
>>>> (qemu) system_wakeup
>>>> (qemu) info status
>>>> VM status: running
>>>>
>>>> Suggested-by: Peter Xu 
>>>> Signed-off-by: Steve Sistare 
>>>> ---
>>>>  include/sysemu/runstate.h |  5 +
>>>>  qapi/misc.json| 10 --
>>>>  system/cpus.c | 19 ++-
>>>>  system/runstate.c |  3 +++
>>>>  4 files changed, 30 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>>>> index f6a337b..1d6828f 100644
>>>> --- a/include/sysemu/runstate.h
>>>> +++ b/include/sysemu/runstate.h
>>>> @@ -40,6 +40,11 @@ static inline bool 
>>>> shutdown_caused_by_guest(ShutdownCause cause)
>>>>  return cause >= SHUTDOWN_CAUSE_GUEST_SHUTDOWN;
>>>>  }
>>>>  
>>>> +static inline bool runstate_is_started(RunState state)
>>>> +{
>>>> +return state == RUN_STATE_RUNNING || state == RUN_STATE_SUSPENDED;
>>>> +}
>>>> +
>>>>  void vm_start(void);
>>>>  
>>>>  /**
>>>> diff --git a/qapi/misc.json b/qapi/misc.json
>>>> index cda2eff..efb8d44 100644
>>>> --- a/qapi/misc.json
>>>> +++ b/qapi/misc.json
>>>> @@ -134,7 +134,7 @@
>>>>  ##
>>>>  # @stop:
>>>>  #
>>>> -# Stop all guest VCPU execution.
>>>> +# Stop all guest VCPU and VM execution.
>>>
>>> Doesn't "stop all VM execution" imply "stop all guest vCPU execution"?
>>
>> Agreed, so we simply have:
>>
>> # @stop:
>> # Stop guest VM execution.
>>
>> # @cont:
>> # Resume guest VM execution.
> 
> Yes, please.

Will do.

>>>>  #
>>>>  # Since: 0.14
>>>>  #
>>>> @@ -143,6 +143,9 @@
>>># Notes: This function will succeed even if the guest is already in
>>># the stopped state.  In "inmigrate" state, it will ensure that
>>>>  # the guest remains paused once migration finishes, as if the -S
>>>>  # option was passed on the command line.
>>>>  #
>>>> +# In the "suspended" state, it will completely stop the VM and
>>>> +# cause a transition to the "paused" state. (Since 9.0)
>>>> +#
>>>
>>> What user-observable (with que

Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate

2024-01-03 Thread Steven Sistare

On 1/3/2024 8:09 AM, Peter Xu wrote:
> Steven,
> 
> The discussions seem to still apply to the latest.  Do you plan to post a
> new version, with everything covered?

Yes, today, thanks - steve

Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate

2023-12-22 Thread Steven Sistare

On 12/22/2023 7:20 AM, Markus Armbruster wrote:
> Steve Sistare  writes:
> 
>> Currently, a vm in the suspended state is not completely stopped.  The VCPUs
>> have been paused, but the cpu clock still runs, and runstate notifiers for
>> the transition to stopped have not been called.  This causes problems for
>> live migration.  Stale cpu timers_state is saved to the migration stream,
>> causing time errors in the guest when it wakes from suspend, and state that
>> would have been modified by runstate notifiers is wrong.
>>
>> Modify vm_stop to completely stop the vm if the current state is suspended,
>> transition to RUN_STATE_PAUSED, and remember that the machine was suspended.
>> Modify vm_start to restore the suspended state.
> 
> Can you explain this to me in terms of the @current_run_state state
> machine?  Like
> 
> Before the patch, trigger X in state Y goes to state Z.
> Afterwards, it goes to ...

Old behavior:
  RUN_STATE_SUSPENDED --> stop --> RUN_STATE_SUSPENDED

New behavior:
  RUN_STATE_SUSPENDED --> stop --> RUN_STATE_PAUSED
  RUN_STATE_PAUSED--> cont --> RUN_STATE_SUSPENDED

>> This affects all callers of vm_stop and vm_start, notably, the qapi stop and
>> cont commands.  For example:
>>
>> (qemu) info status
>> VM status: paused (suspended)
>>
>> (qemu) stop
>> (qemu) info status
>> VM status: paused
>>
>> (qemu) cont
>> (qemu) info status
>> VM status: paused (suspended)
>>
>> (qemu) system_wakeup
>> (qemu) info status
>> VM status: running
>>
>> Suggested-by: Peter Xu 
>> Signed-off-by: Steve Sistare 
>> ---
>>  include/sysemu/runstate.h |  5 +
>>  qapi/misc.json| 10 --
>>  system/cpus.c | 19 ++-
>>  system/runstate.c |  3 +++
>>  4 files changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>> index f6a337b..1d6828f 100644
>> --- a/include/sysemu/runstate.h
>> +++ b/include/sysemu/runstate.h
>> @@ -40,6 +40,11 @@ static inline bool shutdown_caused_by_guest(ShutdownCause 
>> cause)
>>  return cause >= SHUTDOWN_CAUSE_GUEST_SHUTDOWN;
>>  }
>>  
>> +static inline bool runstate_is_started(RunState state)
>> +{
>> +return state == RUN_STATE_RUNNING || state == RUN_STATE_SUSPENDED;
>> +}
>> +
>>  void vm_start(void);
>>  
>>  /**
>> diff --git a/qapi/misc.json b/qapi/misc.json
>> index cda2eff..efb8d44 100644
>> --- a/qapi/misc.json
>> +++ b/qapi/misc.json
>> @@ -134,7 +134,7 @@
>>  ##
>>  # @stop:
>>  #
>> -# Stop all guest VCPU execution.
>> +# Stop all guest VCPU and VM execution.
> 
> Doesn't "stop all VM execution" imply "stop all guest vCPU execution"?

Agreed, so we simply have:

# @stop:
# Stop guest VM execution.

# @cont:
# Resume guest VM execution.

>>  #
>>  # Since: 0.14
>>  #
>> @@ -143,6 +143,9 @@
># Notes: This function will succeed even if the guest is already in
># the stopped state.  In "inmigrate" state, it will ensure that
>>  # the guest remains paused once migration finishes, as if the -S
>>  # option was passed on the command line.
>>  #
>> +# In the "suspended" state, it will completely stop the VM and
>> +# cause a transition to the "paused" state. (Since 9.0)
>> +#
> 
> What user-observable (with query-status) state transitions are possible
> here?

{"status": "suspended", "singlestep": false, "running": false}
--> stop -->
{"status": "paused", "singlestep": false, "running": false}

>>  # Example:
>>  #
>>  # -> { "execute": "stop" }
>> @@ -153,7 +156,7 @@
>>  ##
>>  # @cont:
>>  #
>> -# Resume guest VCPU execution.
>> +# Resume guest VCPU and VM execution.
>>  #
>>  # Since: 0.14
>>  #
>> @@ -165,6 +168,9 @@
># Returns: If successful, nothing
>#
># Notes: This command will succeed if the guest is currently running.
># It will also succeed if the guest is in the "inmigrate" state;
># in this case, the effect of the command is to make sure the
>>  # guest starts once migration finishes, removing the effect of the
>>  # -S command line option if it was passed.
>>  #
>> +# If the VM was previously suspended, and not been reset or woken,
>> +# this command will transition back to the "suspended" state. (Since 
>> 9.0)
> 
> Long line.

It fits in 80 columns, but perhaps this looks nicer:

# If the VM was previously suspended, and not been reset or woken,
# this command will transition back to the "suspended" state.
# (Since 9.0)

> What user-observable state transitions are possible here?

{"status": "paused", "singlestep": false, "running": false}
--> cont -->
{"status": "suspended", "singlestep": false, "running": false}

>> +#
>>  # Example:
>>  #
>>  # -> { "execute": "cont" }
> 
> Should we update documentation of query-status, too?

IMO no. The new behavior changes the status/RunState field only, and the
domain of values does not change, only the transitions caused by the commands
described here.

Re: [PATCH V8 00/12] fix migration of suspended runstate

2023-12-20 Thread Steven Sistare

On 12/20/2023 9:52 AM, Anthony PERARD wrote:
> On Mon, Dec 18, 2023 at 01:14:51PM +0800, Peter Xu wrote:
>> On Wed, Dec 13, 2023 at 10:35:33AM -0500, Steven Sistare wrote:
>>> Hi Peter, all have RB's, with all i's dotted and t's crossed - steve
>>
>> Yes this seems to be more migration related so maybe good candidate for a
>> pull from migration submodule.
>>
>> But since this is still solving a generic issue, I'm copying a few more
>> people from get_maintainers.pl that this series touches, just in case
>> they'll have something to say before dev cycle starts.
> 
> I did a quick smoke test of migrating a Xen guest. It works fine for me.

Thanks very much, I appreciate it - steve

1 2 3 4 >

1 - 100 of 380 matches

Mail list logo