Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
On 7/19/2024 12:28 PM, Peter Xu wrote: On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote: For new cpr modes, ramblock_is_ignored will always be true, because the memory is preserved in place rather than copied. However, for an ignored block, parse_ramblock currently requires that the received address of the block must match the address of the statically initialized region on the target. This fails for a PCI rom block, because the memory region address is set when the guest writes to a BAR on the source, which does not occur on the target, causing a "Mismatched GPAs" error during cpr migration. Is this a common fix with/without cpr mode? It does not occur during normal migration. It looks to me mr->addr (for these ROMs) should only be set in PCI config region updates as you mentioned. But then I didn't figure out when they're updated on dest in live migration: the ramblock info was sent at the beginning of migration, so it doesn't even have PCI config space migrated; I thought the real mr->addr should be in there. I also failed to understand yet on why the mr->addr check needs to be done by ignore-shared only. Some explanation would be greatly helpful around this area.. I will continue this thread later and explain more fully. - Steve
Re: [PATCH V2 00/11] Live update: cpr-exec
On 7/18/2024 11:56 AM, Peter Xu wrote: Steve, On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote: What? Thanks for trying out with the cpr-transfer series. I saw that that series missed most of the cc list here, so I'm attaching the link here: https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sist...@oracle.com I think most of my previous questions for exec() solution still are there, I'll try to summarize them all in this reply as much as I can. This patch series adds the live migration cpr-exec mode, which allows the user to update QEMU with minimal guest pause time, by preserving guest RAM in place, albeit with new virtual addresses in new QEMU, and by preserving device file descriptors. The new user-visible interfaces are: * cpr-exec (MigMode migration parameter) * cpr-exec-command (migration parameter) I really, really hope we can avoid this.. It's super cumbersome to pass in a qemu cmdline in a qemu migration parameter.. if we can do that with generic live migration ways, I hope we stick with the clean approach. This is no different than live migration, requiring a management agent to launch target qemu with all the arguments use to start source QEMU. Now that same agent will send the arguments via cpr-exec-command. * anon-alloc (command-line option for -machine) Igor questioned this, and I second his opinion.. We can leave the discussion there for this one. Continued on the other thread. The user sets the mode parameter before invoking the migrate command. In this mode, the user issues the migrate command to old QEMU, which stops the VM and saves state to the migration channels. Old QEMU then exec's new QEMU, replacing the original process while retaining its PID. The user specifies the command to exec new QEMU in the migration parameter cpr-exec-command. The command must pass all old QEMU arguments to new QEMU, plus the -incoming option. Execution resumes in new QEMU. Memory-backend objects must have the share=on attribute, but memory-backend-epc is not supported. The VM must be started with the '-machine anon-alloc=memfd' option, which allows anonymous memory to be transferred in place to the new process. Why? This mode has less impact on the guest than any other method of updating in place. So I wonder whether there's comparison between exec() and transfer mode that you recently proposed. Not yet, but I will measure it. I'm asking because exec() (besides all the rest of things that I dislike on it in this approach..) should be simply slower, logically, due to the serialized operation to (1) tearing down the old mm, (2) reload the new ELF, then (3) runs through the QEMU init process. If with a generic migration solution, the dest QEMU can start running (2+3) concurrently without even need to run (1). In this whole process, I doubt (2) could be relatively fast, (3) I donno, maybe it could be slow but I never measured; Paolo may have good idea as I know he used to work on qboot. We'll see, but in any case these take < 100 msec, which is a wonderfully short pause time unless your customer is doing high speed stock trading. If cpr-transfer is faster still, that's gravy, but cpr-exec is still great. For (1), I also doubt in your test cases it's fast, but it may not always be fast. Consider the guest has a huge TBs of shared mem, even if the memory will be completely shared between src/dst QEMUs, the pgtable won't! It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU pgtable alone can even take time, and that will be accounted in step (1) and further in exec() request. Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is backed by huge pages. In UEK, we make it faster still by unmapping in parallel with multiple threads. I don't have the data handy but can share after running some experiments. Regardless, this time is negligible for small and medium size guests, which form the majority of instances in a cloud. All these fuss will be avoided if you use a generic live migration model like cpr-transfer you proposed. That's also cleaner. The pause time is much lower, because devices need not be torn down and recreated, DMA does not need to be drained and quiesced, and minimal state is copied to new QEMU. Further, there are no constraints on the guest. By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram, and suspending plus resuming vfio devices adds multiple seconds to the guest pause time. Lastly, there is no loss of connectivity to the guest, because chardev descriptors remain open and connected. Again, I raised the question on why this would matter, as after all mgmt app will need to coop with reconnections due to the fact they'll need to support a generic live migration, in which case reconnection is a must. So far it doesn't sound like a performance critical path, for example, to do the mgmt reconnects on the ports. So this might be an
Re: [PATCH V2 01/11] machine: alloc-anon option
On 7/17/2024 3:24 PM, Peter Xu wrote: On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote: On Sun, 30 Jun 2024 12:40:24 -0700 Steve Sistare wrote: Allocate anonymous memory using mmap MAP_ANON or memfd_create depending on the value of the anon-alloc machine property. This affects memory-backend-ram objects, guest RAM created with the global -m option but without an associated memory-backend object and without the -mem-path option nowadays, all machines were converted to use memory backend for VM RAM. so -m option implicitly creates memory-backend object, which will be either MEMORY_BACKEND_FILE if -mem-path present or MEMORY_BACKEND_RAM otherwise. To access the same memory in the old and new QEMU processes, the memory must be mapped shared. Therefore, the implementation always sets RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the user must explicitly specify the share option. In lieu of defining a new so statement at the top that memory-backend-ram is affected is not really valid? RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1 as the condition for calling memfd_create. In general I do dislike adding yet another option that will affect guest RAM allocation (memory-backends should be sufficient). I shared the same concern when reviewing the previous version, and I keep having so. However I do see that you need memfd for device memory (vram, roms, ...). Can we just use memfd/shared unconditionally for those and avoid introducing a new confusing option? ROMs should be fine IIUC, as they shouldn't be large, and they can be migrated normally (because they're not DMA target from VFIO assigned devices). IOW, per my understanding what must be shared via memfd is writable memories that can be DMAed from a VFIO device. I raised such question on whether / why vram can be a DMA target, but I didn't get a response. So I would like to redo this comment: I think we should figure out what is missing when we switch all backends to use -object, rather than adding this flag easily. When added, we should be crystal clear on which RAM region will be applicable by this flag. All RAM regions that are mapped by the guest are registered for vfio DMA by a memory listener and could potentially be DMA'd, either read or written. That is defined by the architecture. We are not allowed to make value judgements and decide to not support the architecture for some segments such as ROM. Alex Williamson, any comment here? - Steve
Re: [PATCH V2 01/11] machine: alloc-anon option
On 7/16/2024 5:19 AM, Igor Mammedov wrote: On Sun, 30 Jun 2024 12:40:24 -0700 Steve Sistare wrote: Allocate anonymous memory using mmap MAP_ANON or memfd_create depending on the value of the anon-alloc machine property. This affects memory-backend-ram objects, guest RAM created with the global -m option but without an associated memory-backend object and without the -mem-path option nowadays, all machines were converted to use memory backend for VM RAM. so -m option implicitly creates memory-backend object, which will be either MEMORY_BACKEND_FILE if -mem-path present or MEMORY_BACKEND_RAM otherwise. Yes. I dropped an an important adjective, "implicit". "guest RAM created with the global -m option but without an explicit associated memory-backend object and without the -mem-path option" To access the same memory in the old and new QEMU processes, the memory must be mapped shared. Therefore, the implementation always sets RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the user must explicitly specify the share option. In lieu of defining a new so statement at the top that memory-backend-ram is affected is not really valid? memory-backend-ram is affected by alloc-anon. But in addition, the user must explicitly add the "share" option. I don't implicitly set share in this case, because I would be overriding the user's specification of the memory object's property, which would be private if omitted. RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1 as the condition for calling memfd_create. In general I do dislike adding yet another option that will affect guest RAM allocation (memory-backends should be sufficient). However I do see that you need memfd for device memory (vram, roms, ...). Can we just use memfd/shared unconditionally for those and avoid introducing a new confusing option? The Linux kernel has different tunables for backing memfd's with huge pages, so we could hurt performance if we unconditionally change to memfd. The user should have a choice for any segment that is large enough for huge pages to improve performance, which potentially is any memory-backend-object. The non memory-backend objects are small, and it would be OK to use memfd unconditionally for them. - Steve
Re: [RFC V1 0/6] Live update: cpr-transfer
On 7/18/2024 11:36 AM, Peter Xu wrote: On Sun, Jun 30, 2024 at 12:44:02PM -0700, Steve Sistare wrote: What? This patch series adds the live migration cpr-transfer mode, which allows the user to transfer a guest to a new QEMU instance on the same host. It is identical to cpr-exec in most respects, except as described below. I definitely prefer this one more than the exec solution, thanks for trying this out. It's a matter of whether we'll need both, my answer would be no.. The new user-visible interfaces are: * cpr-transfer (MigMode migration parameter) * cpr-uri (migration parameter) I wonder whether this parameter can be avoided already, maybe we can let cpr-transfer depend on unix socket in -incoming, then integrate fd sharing in the same channel? You saw the answer in another thread, but I repeat it here for others benefit: "CPR state cannot be sent over the normal migration channel, because devices and backends are created prior to reading the channel, so this mode sends CPR state over a second migration channel that is not visible to the user. New QEMU reads the second channel prior to creating devices or backends." - Steve * cpr-uri (command-line argument) In this mode, the user starts new QEMU on the same host as old QEMU, with the same arguments as old QEMU, plus the -incoming and the -cpr-uri options. The user issues the migrate command to old QEMU, which stops the VM, saves state to the migration channels, and enters the postmigrate state. Execution resumes in new QEMU. This mode requires a second migration channel, specified by the cpr-uri migration property on the outgoing side, and by the cpr-uri QEMU command-line option on the incoming side. The channel must be a type, such as unix socket, that supports SCM_RIGHTS. This series depends on the series "Live update: cpr-exec mode". Why? cpr-transfer offers the same benefits as cpr-exec mode, but with a model for launching new QEMU that may be more natural for some management packages. How? The file descriptors are kept open by sending them to new QEMU via the cpr-uri, which must support SCM_RIGHTS. Example: In this example, we simply restart the same version of QEMU, but in a real scenario one would use a new QEMU binary path in terminal 2. Terminal 1: start old QEMU # qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G -machine anon-alloc=memfd ... Terminal 2: start new QEMU # qemu-kvm ... -incoming unix:vm.sock -cpr-uri unix:cpr.sock Terminal 1: QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-transfer (qemu) migrate_set_parameter cpr-uri unix:cpr.sock (qemu) migrate -d unix:vm.sock (qemu) info status VM status: paused (postmigrate) Terminal 2: QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running Steve Sistare (6): migration: SCM_RIGHTS for QEMUFile migration: VMSTATE_FD migration: cpr-transfer save and load migration: cpr-uri parameter migration: cpr-uri option migration: cpr-transfer mode include/migration/cpr.h| 4 ++ include/migration/vmstate.h| 9 + migration/cpr-transfer.c | 81 + migration/cpr.c| 16 +++- migration/meson.build | 1 + migration/migration-hmp-cmds.c | 10 + migration/migration.c | 37 +++ migration/options.c| 29 +++ migration/options.h| 1 + migration/qemu-file.c | 83 -- migration/qemu-file.h | 2 + migration/ram.c| 1 + migration/trace-events | 2 + migration/vmstate-types.c | 33 + qapi/migration.json| 38 ++- qemu-options.hx| 8 stubs/vmstate.c| 7 system/vl.c| 3 ++ 18 files changed, 359 insertions(+), 6 deletions(-) create mode 100644 migration/cpr-transfer.c -- 1.8.3.1
Re: [PATCH V2 04/11] migration: stop vm earlier for cpr
On 7/17/2024 2:59 PM, Fabiano Rosas wrote: Steve Sistare writes: Stop the vm earlier for cpr, to guarantee consistent device state when CPR state is saved. Signed-off-by: Steve Sistare --- migration/migration.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 0f47765..8a8e927 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels, MigrationState *s = migrate_get_current(); g_autoptr(MigrationChannel) channel = NULL; MigrationAddress *addr = NULL; +bool stopped = false; /* * Having preliminary checks for uri and channel @@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels, } } +if (migrate_mode_is_cpr(s)) { +int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); +if (ret < 0) { +error_setg(_err, "migration_stop_vm failed, error %d", -ret); +goto out; +} +stopped = true; +} + if (cpr_state_save(_err)) { goto out; } @@ -2155,6 +2165,9 @@ out: } migrate_fd_error(s, local_err); error_propagate(errp, local_err); +if (stopped && runstate_is_live(s->vm_old_state)) { +vm_start(); +} What about non-live states? Shouldn't this be: if (stopped) { vm_resume(); } Not quite. vm_old_state may be a stopped state, so we don't want to resume. However, I should probably restore the old stopped state here. I'll try some more error recovery scenarios. - Steve return; } } @@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) Error *local_err = NULL; uint64_t rate_limit; bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP); -int ret; /* * If there's a previous error, free it and prepare for another one. @@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } -if (migrate_mode_is_cpr(s)) { -ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE); -if (ret < 0) { -error_setg(_err, "migration_stop_vm failed, error %d", -ret); -goto fail; -} -} - if (migrate_background_snapshot()) { qemu_thread_create(>thread, "mig/snapshot", bg_migration_thread, s, QEMU_THREAD_JOINABLE);
Re: [PATCH V2 02/11] migration: cpr-state
On 7/19/2024 11:03 AM, Peter Xu wrote: On Sun, Jun 30, 2024 at 12:40:25PM -0700, Steve Sistare wrote: CPR must save state that is needed after QEMU is restarted, when devices are realized. Thus the extra state cannot be saved in the migration stream, as objects must already exist before that stream can be loaded. Instead, define auxilliary state structures and vmstate descriptions, not associated with any registered object, and serialize the aux state to a cpr-specific stream in cpr_state_save. Deserialize in cpr_state_load after QEMU restarts, before devices are realized. Provide accessors for clients to register file descriptors for saving. The mechanism for passing the fd's to the new process will be specific to each migration mode, and added in subsequent patches. Signed-off-by: Steve Sistare --- include/migration/cpr.h | 21 ++ migration/cpr.c | 188 migration/meson.build | 1 + migration/migration.c | 6 ++ migration/trace-events | 5 ++ system/vl.c | 3 + 6 files changed, 224 insertions(+) create mode 100644 include/migration/cpr.h create mode 100644 migration/cpr.c diff --git a/include/migration/cpr.h b/include/migration/cpr.h new file mode 100644 index 000..8e7e705 --- /dev/null +++ b/include/migration/cpr.h @@ -0,0 +1,21 @@ +/* + * Copyright (c) 2021, 2024 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef MIGRATION_CPR_H +#define MIGRATION_CPR_H + +typedef int (*cpr_walk_fd_cb)(int fd); +void cpr_save_fd(const char *name, int id, int fd); +void cpr_delete_fd(const char *name, int id); +int cpr_find_fd(const char *name, int id); +int cpr_walk_fd(cpr_walk_fd_cb cb); +void cpr_resave_fd(const char *name, int id, int fd); + +int cpr_state_save(Error **errp); +int cpr_state_load(Error **errp); + +#endif diff --git a/migration/cpr.c b/migration/cpr.c new file mode 100644 index 000..313e74e --- /dev/null +++ b/migration/cpr.c @@ -0,0 +1,188 @@ +/* + * Copyright (c) 2021-2024 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "migration/cpr.h" +#include "migration/misc.h" +#include "migration/qemu-file.h" +#include "migration/savevm.h" +#include "migration/vmstate.h" +#include "sysemu/runstate.h" +#include "trace.h" + +/*/ +/* cpr state container for all information to be saved. */ + +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList; + +typedef struct CprState { +CprFdList fds; +} CprState; + +static CprState cpr_state; + +// + +typedef struct CprFd { +char *name; +unsigned int namelen; +int id; +int fd; [1] +QLIST_ENTRY(CprFd) next; +} CprFd; + +static const VMStateDescription vmstate_cpr_fd = { +.name = "cpr fd", +.version_id = 1, +.minimum_version_id = 1, +.fields = (VMStateField[]) { +VMSTATE_UINT32(namelen, CprFd), +VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen), +VMSTATE_INT32(id, CprFd), +VMSTATE_INT32(fd, CprFd), +VMSTATE_END_OF_LIST() +} +}; + +void cpr_save_fd(const char *name, int id, int fd) +{ +CprFd *elem = g_new0(CprFd, 1); + +trace_cpr_save_fd(name, id, fd); +elem->name = g_strdup(name); +elem->namelen = strlen(name) + 1; +elem->id = id; +elem->fd = fd; +QLIST_INSERT_HEAD(_state.fds, elem, next); +} + +static CprFd *find_fd(CprFdList *head, const char *name, int id) +{ +CprFd *elem; + +QLIST_FOREACH(elem, head, next) { +if (!strcmp(elem->name, name) && elem->id == id) { +return elem; +} +} +return NULL; +} + +void cpr_delete_fd(const char *name, int id) +{ +CprFd *elem = find_fd(_state.fds, name, id); + +if (elem) { +QLIST_REMOVE(elem, next); +g_free(elem->name); +g_free(elem); +} + +trace_cpr_delete_fd(name, id); +} + +int cpr_find_fd(const char *name, int id) +{ +CprFd *elem = find_fd(_state.fds, name, id); +int fd = elem ? elem->fd : -1; + +trace_cpr_find_fd(name, id, fd); +return fd; +} + +int cpr_walk_fd(cpr_walk_fd_cb cb) +{ +CprFd *elem; + +QLIST_FOREACH(elem, _state.fds, next) { +if (elem->fd >= 0 && cb(elem->fd)) { +return 1; +} +} +return 0; +} + +void cpr_resave_fd(const char *name, int id, int fd) +{ +CprFd *elem = find_fd(_state.fds, name, id); +int old_fd = elem ? elem->fd : -1; + +if (old_fd < 0) { +cpr_save_fd(name, id, fd); I don't think I know well on when old_fd<0 would happen yet, as this series doesn't look like
Re: [PATCH V2 03/11] migration: save cpr mode
On 7/17/2024 2:39 PM, Fabiano Rosas wrote: Steve Sistare writes: Save the mode in CPR state, so the user does not need to explicitly specify it for the target. Modify migrate_mode() so it returns the incoming mode on the target. Signed-off-by: Steve Sistare --- include/migration/cpr.h | 7 +++ migration/cpr.c | 23 ++- migration/migration.c | 1 + migration/options.c | 9 +++-- 4 files changed, 37 insertions(+), 3 deletions(-) diff --git a/include/migration/cpr.h b/include/migration/cpr.h index 8e7e705..42b4019 100644 --- a/include/migration/cpr.h +++ b/include/migration/cpr.h @@ -8,6 +8,13 @@ #ifndef MIGRATION_CPR_H #define MIGRATION_CPR_H +#include "qapi/qapi-types-migration.h" + +#define MIG_MODE_NONE MIG_MODE__MAX What happens when a QEMU that knows about a new mode migrates into a QEMU that doesn't know that mode, i.e. sees it as MIG_MODE__MAX? I'd just use -1. Good idea, thanks - steve + +MigMode cpr_get_incoming_mode(void); +void cpr_set_incoming_mode(MigMode mode); + typedef int (*cpr_walk_fd_cb)(int fd); void cpr_save_fd(const char *name, int id, int fd); void cpr_delete_fd(const char *name, int id); diff --git a/migration/cpr.c b/migration/cpr.c index 313e74e..1c296c6 100644 --- a/migration/cpr.c +++ b/migration/cpr.c @@ -21,10 +21,23 @@ typedef QLIST_HEAD(CprFdList, CprFd) CprFdList; typedef struct CprState { +MigMode mode; CprFdList fds; } CprState; -static CprState cpr_state; +static CprState cpr_state = { +.mode = MIG_MODE_NONE, +}; + +MigMode cpr_get_incoming_mode(void) +{ +return cpr_state.mode; +} + +void cpr_set_incoming_mode(MigMode mode) +{ +cpr_state.mode = mode; +} // @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd) /*/ #define CPR_STATE "CprState" +static int cpr_state_presave(void *opaque) +{ +cpr_state.mode = migrate_mode(); +return 0; +} + static const VMStateDescription vmstate_cpr_state = { .name = CPR_STATE, .version_id = 1, .minimum_version_id = 1, +.pre_save = cpr_state_presave, .fields = (VMStateField[]) { +VMSTATE_UINT32(mode, CprState), VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next), VMSTATE_END_OF_LIST() } diff --git a/migration/migration.c b/migration/migration.c index e394ad7..0f47765 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -411,6 +411,7 @@ void migration_incoming_state_destroy(void) mis->postcopy_qemufile_dst = NULL; } +cpr_set_incoming_mode(MIG_MODE_NONE); yank_unregister_instance(MIGRATION_YANK_INSTANCE); } diff --git a/migration/options.c b/migration/options.c index 645f550..305397a 100644 --- a/migration/options.c +++ b/migration/options.c @@ -22,6 +22,7 @@ #include "qapi/qmp/qnull.h" #include "sysemu/runstate.h" #include "migration/colo.h" +#include "migration/cpr.h" #include "migration/misc.h" #include "migration.h" #include "migration-stats.h" @@ -758,8 +759,12 @@ uint64_t migrate_max_postcopy_bandwidth(void) MigMode migrate_mode(void) { -MigrationState *s = migrate_get_current(); -MigMode mode = s->parameters.mode; +MigMode mode = cpr_get_incoming_mode(); + +if (mode == MIG_MODE_NONE) { +MigrationState *s = migrate_get_current(); +mode = s->parameters.mode; +} assert(mode >= 0 && mode < MIG_MODE__MAX); return mode;
Re: [PATCH V2 01/11] machine: alloc-anon option
On 7/17/2024 3:24 PM, Peter Xu wrote: [...] PS to Steve: and I think I left tons of other comments in previous version outside this patch too, but I don't think they're fully discussed when this series was sent. I can re-read the series again, but I don't think it'll work out if we keep skipping discussions.. Hi Peter, let me address this part first, because I don't want you to think that I ignored your questions and concerns. This V2 series tries to address them. The change log was intended to be my response, rather than responding to each open question individually, but let me try again here with more detail. I apologize if I don't summarize your concerns correctly or completely. issue: discomfort with exec. why is it needed? response: exec is just a transport mechanism to send fd's to new qemu. I refactored to separate core patches from exec-specific patches, submitted cpr-transfer patches to illustrate a non-exec method, and provided reasons why one vs the other would be desirable in the commit messages and cover letter. issue: why do we need to preserve the ramblock fields and make them available prior to object creation? response. we don't need to preserve all of them, and we only need fd prior to object creation, so I deleted the precreate, factory, and named object patches, and added CprState to preserve fd's. used_length arrives in the normal migration stream. max_length is recovered from the mfd using lseek. issue: the series is too large, with too much change. response: in addition to the deletions mentioned above, I simplified the functionality and tossed out style patches and nice-to-haves, so we can focus on core functionality. V2 is much smaller. issue: memfd_create option is oddly expressed and hard to understand. response: I redefined the option, deleted all the stylistic ramblock patches to lay its workings bare, and explicitly documented its affect on all types of memory in the commit messages and qapi documentation. issue: no need to preserve blocks like ROM for DMA (with memfd_create). Blocks that must be preserved should be surfaced to the user as objects. response: I disagree, and will continue that conversation in this email thread. issue: how will vfio be handled? response: I submitted the vfio patches (non-trivial, because first I had to rework them without using precreate vmstate). issue: how will fd's be preserved for chardevs? response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time, in each device's creation function, just like vfio. Those primitives are defined in this V2 series. - Steve
Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
On 7/9/2024 4:58 PM, Steve Sistare wrote: Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu. At vfio creation time, save the value of vfio container, group, and device descriptors in CPR state. In the container pre_save handler, suspend the use of virtual addresses in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped at a different VA after exec. DMA to already-mapped pages continues. Save the msi message area as part of vfio-pci vmstate, and save the interrupt and notifier eventfd's in vmstate. On qemu restart, vfio_realize() finds the saved descriptors, uses the descriptors, and notes that the device is being reused. Device and iommu state is already configured, so operations in vfio_realize that would modify the configuration are skipped for a reused device, including vfio ioctl's and writes to PCI configuration space. Vfio PCI device reset is also suppressed. The result is that vfio_realize constructs qemu data structures that reflect the current state of the device. However, the reconstruction is not complete until migrate_incoming is called. migrate_incoming loads the msi data, the vfio post_load handler finds eventfds in CPR state, rebuilds vector data structures, and attaches the interrupts to the new KVM instance. The container post_load handler then invokes the main vfio listener callback, which walks the flattened ranges of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's. Lastly, migration resumes the VM. This functionality is delivered by 3 patches for clarity. Part 1 handles device file descriptors and DMA. Part 2 adds eventfd and MSI/MSI-X vector support. Part 3 adds INTX support. [...] diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c new file mode 100644 index 000..bc51ebe --- /dev/null +++ b/hw/vfio/cpr-legacy.c [...] + +bool vfio_legacy_cpr_register_container(VFIOContainerBase *bcontainer, +Error **errp) +{ +VFIOContainer *container = VFIO_CONTAINER(bcontainer); + +if (!vfio_can_cpr_exec(container, >cpr_blocker)) { +return migrate_add_blocker_modes(>cpr_blocker, errp, + MIG_MODE_CPR_EXEC, -1); This is a bug. With the change in cpr_register return type to bool, this should be: return migrate_add_blocker_modes(...) == 0; - Steve
Re: [PATCH V1 4/8] vfio-pci: cpr part 1 (fd and dma)
On 7/10/2024 4:03 PM, Alex Williamson wrote: On Tue, 9 Jul 2024 13:58:53 -0700 Steve Sistare wrote: Enable vfio-pci devices to be saved and restored across a cpr-exec of qemu. At vfio creation time, save the value of vfio container, group, and device descriptors in CPR state. In the container pre_save handler, suspend the use of virtual addresses in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped at a different VA after exec. DMA to already-mapped pages continues. Save the msi message area as part of vfio-pci vmstate, and save the interrupt and notifier eventfd's in vmstate. On qemu restart, vfio_realize() finds the saved descriptors, uses the descriptors, and notes that the device is being reused. Device and iommu state is already configured, so operations in vfio_realize that would modify the configuration are skipped for a reused device, including vfio ioctl's and writes to PCI configuration space. Vfio PCI device reset is also suppressed. The result is that vfio_realize constructs qemu data structures that reflect the current state of the device. However, the reconstruction is not complete until migrate_incoming is called. migrate_incoming loads the msi data, the vfio post_load handler finds eventfds in CPR state, rebuilds vector data structures, and attaches the interrupts to the new KVM instance. The container post_load handler then invokes the main vfio listener callback, which walks the flattened ranges of the vfio address space and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's. Lastly, migration resumes the VM. Hi Steve, What's the iommufd plan for cpr? Thanks, I am working on vdpa and iommufd as we speak, with working prototypes for both. I plan to submit the kernel and qemu RFC for vdpa next week, followed by vacation, and iommfd in the weeks after that. - Steve
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 6/3/2024 6:17 AM, Daniel P. Berrangé wrote: On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote: On 5/28/2024 5:12 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote: Allocate anonymous memory using memfd_create if the memfd-alloc machine option is set. Signed-off-by: Steve Sistare --- hw/core/machine.c | 22 ++ include/hw/boards.h | 1 + qemu-options.hx | 6 ++ system/memory.c | 9 ++--- system/physmem.c| 18 +- system/trace-events | 1 + 6 files changed, 53 insertions(+), 4 deletions(-) diff --git a/qemu-options.hx b/qemu-options.hx index cf61f6b..f0dfda5 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ "vmport=on|off|auto controls emulation of vmport (default: auto)\n" "dump-guest-core=on|off include guest memory in a core dump (default=on)\n" "mem-merge=on|off controls memory merge support (default: on)\n" +"memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n" "aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n" "dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n" "suppress-vmdesc=on|off disables self-describing migration (default=off)\n" @@ -79,6 +80,11 @@ SRST supported by the host, de-duplicates identical memory pages among VMs instances (enabled by default). +``memfd-alloc=on|off`` +Enables or disables allocation of anonymous guest RAM using +memfd_create. Any associated memory-backend objects are created with +share=on. The memfd-alloc default is off. + ``aes-key-wrap=on|off`` Enables or disables AES key wrapping support on s390-ccw hosts. This feature controls whether AES wrapping keys will be created diff --git a/system/memory.c b/system/memory.c index 49f1cb2..ca04a0e 100644 --- a/system/memory.c +++ b/system/memory.c @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp) { +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; If there's a machine option to "use memfd for allocations", then it's shared mem... Hmm.. It is a bit confusing to me in quite a few levels: - Why memory allocation method will be defined by a machine property, even if we have memory-backend-* which should cover everything? Some memory regions are implicitly created, and have no explicit representation on the qemu command line. memfd-alloc affects those. More generally, memfd-alloc affects all ramblock allocations that are not explicitly represented by memory-backend object. Thus the simple command line "qemu -m 1G" does not explicitly describe an object, so it goes through the anonymous allocation path, and is affected by memfd-alloc. Internally, create_default_memdev does create a memory-backend object. That is what my doc comment above refers to: Any associated memory-backend objects are created with share=on An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc. The qapi comments in patch "migration: cpr-exec mode" attempt to say all that: +# Memory backend objects must have the share=on attribute, and +# must be mmap'able in the new QEMU process. For example, +# memory-backend-file is acceptable, but memory-backend-ram is +# not. +# +# The VM must be started with the '-machine memfd-alloc=on' +# option. This causes implicit ram blocks -- those not explicitly +# described by a memory-backend object -- to be allocated by +# mmap'ing a memfd. Examples include VGA, ROM, and even guest +# RAM when it is specified without a memory-backend object. - Even if we have such a machine property, why setting "memfd" will always imply shared? why not private? After all it's not called "memfd-shared-alloc", and we can create private mappings using e.g. memory-backend-memfd,share=off. There is no use case for memfd-alloc with share=off, so no point IMO in making the option more verbose. For cpr, the mapping with all its modifications must be visible to new qemu when qemu mmaps it. So IIUC, cpr doesn't care about the use of 'memfd' as the specific impl, it only cares that the memory is share=on. Rather than having a machine type option "memfd-alloc" which is named after a Linux specific impl detail, how about having a machine type option "mem-share=on", which just happens to trig
Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
On 5/30/2024 2:39 PM, Peter Xu wrote: On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote: On 5/29/2024 3:25 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: On 5/28/2024 5:44 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: Preserve fields of RAMBlocks that allocate their host memory during CPR so the RAM allocation can be recovered. This sentence itself did not explain much, IMHO. QEMU can share memory using fd based memory already of all kinds, as long as the memory backend is path-based it can be shared by sharing the same paths to dst. This reads very confusing as a generic concept. I mean, QEMU migration relies on so many things to work right. We mostly asks the users to "use exactly the same cmdline for src/dst QEMU unless you know what you're doing", otherwise many things can break. That should also include ramblock being matched between src/dst due to the same cmdlines provided on both sides. It'll be confusing to mention this when we thought the ramblocks also rely on that fact. So IIUC this sentence should be dropped in the real patch, and I'll try to guess the real reason with below.. The properties of the implicitly created ramblocks must be preserved. The defaults can and do change between qemu releases, even when the command-line parameters do not change for the explicit objects that cause these implicit ramblocks to be created. AFAIU, QEMU relies on ramblocks to be the same before this series. Do you have an example? Would that already cause issue when migrate? Alignment has changed, and used_length vs max_length changed when resizeable ramblocks were introduced. I have dealt with these issues while supporting cpr for our internal use, and the learned lesson is to explicitly communicate the creation-time parameters to new qemu. Why used_length can change? I'm looking at ram_mig_ram_block_resized(): if (!migration_is_idle()) { /* * Precopy code on the source cannot deal with the size of RAM blocks * changing at random points in time - especially after sending the * RAM block sizes in the migration stream, they must no longer change. * Abort and indicate a proper reason. */ error_setg(, "RAM block '%s' resized during precopy.", rb->idstr); migration_cancel(err); error_free(err); } We sent used_length upfront of a migration during SETUP phase. Looks like what you're describing can be something different, though? I was imprecise. used_length did not change; it was introduced as being different than max_length when resizeable ramblocks were introduced. The max_length is not sent. It is an implicit property of the implementation, and can change. It is the size of the memfd mapping, so we need to know it and preserve it. used_length is indeed sent during SETUP. We could also send max_length at that time, and store both in the struct ramblock, and *maybe* that would be safe, but that is more fragile and less future proof than setting both properties to the correct value when the ramblock struct is created. And BTW, the ramblock properties are sent using ad-hoc code in setup. I send them using nice clean vmstate. Regarding to rb->align: isn't that mostly a constant, reflecting the MR's alignment? It's set when ramblock is created IIUC: rb->align = mr->align; When will the alignment change? The alignment specified by the mr to allocate a new block is an implicit property of the implementation, and has changed before, from one qemu release to another. Not often, but it did, and could again in the future. Communicating the alignment from old qemu to new qemu is future proof. These are not an issue for migration because the ramblock is re-created and the data copied into the new memory. Mirror the mr->align field in the RAMBlock to simplify the vmstate. Preserve the old host address, even though it is immediately discarded, as it will be needed in the future for CPR with iommufd. Preserve guest_memfd, even though CPR does not yet support it, to maintain vmstate compatibility when it becomes supported. .. It could be about the vfio vaddr update feature that you mentioned and only for iommufd (as IIUC vfio still relies on iova ranges, then it won't help here)? If so, IMHO we should have this patch (or any variance form) to be there for your upcoming vfio support. Keeping this around like this will make the series harder to review. Or is it needed even before VFIO? This patch is needed independently of vfio or iommufd. guest_memfd is independent of vfio or iommufd. It is a recent addition which I have not tried to support, but I added this placeholder field to it can be supported in the future without adding a new field later and maintaining backwards compatibility. Is guest_memfd the only user so far, then? If so, would it be
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 5/30/2024 2:14 PM, Peter Xu wrote: On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote: On 5/29/2024 3:14 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote: diff --git a/system/memory.c b/system/memory.c index 49f1cb2..ca04a0e 100644 --- a/system/memory.c +++ b/system/memory.c @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp) { +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; If there's a machine option to "use memfd for allocations", then it's shared mem... Hmm.. It is a bit confusing to me in quite a few levels: - Why memory allocation method will be defined by a machine property, even if we have memory-backend-* which should cover everything? Some memory regions are implicitly created, and have no explicit representation on the qemu command line. memfd-alloc affects those. More generally, memfd-alloc affects all ramblock allocations that are not explicitly represented by memory-backend object. Thus the simple command line "qemu -m 1G" does not explicitly describe an object, so it goes through the anonymous allocation path, and is affected by memfd-alloc. Can we simply now allow "qemu -m 1G" to work for cpr-exec? I assume you meant "simply not allow". Yes, I could do that, but I would need to explicitly add code to exclude this case, and add a blocker. Right now it "just works" for all paths that lead to ram_block_alloc_host, without any special logic at the memory-backend level. And, I'm not convinced that simplifies the docs, as now I would need to tell the user that "-m 1G" and similar constructions do not work with cpr. I can try to clarify the doc for -memfd-alloc as currently defined. Why do we need to keep cpr working for existing qemu cmdlines? We'll already need to add more new cmdline options already anyway, right? cpr-reboot wasn't doing it, and that made sense to me, so that new features will require the user to opt-in for it, starting with changing its cmdlines. I agree. We need a new option to opt-in to cpr-friendly memory allocation, and I am proposing -machine memfd-alloc. I am simply saying that I can try to do a better job explaining the functionality in my proposed text for memfd-alloc, instead of changing the functionality to exclude "-m 1G". I believe excluding "-m 1G" is the wrong approach, for the reasons I stated - messier implementation *and* documentation. I am open to different syntax for opting in. AFAIU that's what we do with cpr-reboot: we ask the user to specify the right things to make other thing work. Otherwise it won't. Internally, create_default_memdev does create a memory-backend object. That is what my doc comment above refers to: Any associated memory-backend objects are created with share=on An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc. The qapi comments in patch "migration: cpr-exec mode" attempt to say all that: +# Memory backend objects must have the share=on attribute, and +# must be mmap'able in the new QEMU process. For example, +# memory-backend-file is acceptable, but memory-backend-ram is +# not. +# +# The VM must be started with the '-machine memfd-alloc=on' +# option. This causes implicit ram blocks -- those not explicitly +# described by a memory-backend object -- to be allocated by +# mmap'ing a memfd. Examples include VGA, ROM, and even guest +# RAM when it is specified without a memory-backend object. VGA is IIRC 16MB chunk, ROM is even smaller. If the user specifies -object memory-backend-file,share=on propertly, these should be the only outliers? Are these important enough for the downtime? Can we put them into the migrated image alongside with the rest device states? It's not about downtime. vfio, vdpa, and iommufd pin all guest pages. The pages must remain pinned during CPR to support ongoing DMA activity which could target those pages (which we do not quiesce), and the same physical pages must be used for the ramblocks in the new qemu process. Ah ok, yes DMA can happen on the fly. Guest mem is definitely the major DMA target and that can be covered by -object memory-backend-*,shared=on cmdlines. ROM is definitely not a DMA target. So is VGA ram a target for, perhaps, an assigned vGPU device? Do we have a list of things that will need that? Can we make them work somehow by sharing them like guest mem? The pass-through devices map and pin all memory accessible to the guest. We cannot make exceptions based on our intuition of how the memory will and will not be used. Also, we cannot simply abandon the old pinned ramblocks, owned by an mm_struct that will become a zombie. We would actually
Re: [PATCH V1 00/26] Live update: cpr-exec
On 5/28/2024 12:42 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote: On 5/27/2024 1:45 PM, Peter Xu wrote: On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote: I understand, thanks. If I can help with any of your todo list, just ask - steve Thanks for offering the help, Steve. Started looking at this today, then I found that I miss something high-level. Let me ask here, and let me apologize already for starting to throw multiple questions.. IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in this case not host kernel but QEMU-only, and/or upper. Is there any justification on why the complexity is needed here? It looks to me this one is more involved than cpr-reboot, so I'm thinking how much we can get from the complexity, and whether it's worthwhile. 1000+ LOC is the min support, and if we even expect more to come, that's really important, IMHO. For example, what's the major motivation of this whole work? Is that more on performance, or is it more for supporting the special devices like VFIO which we used to not support, or something else? I can't find them in whatever cover letter I can find, including this one. Firstly, regarding performance, IMHO it'll be always nice to share even some very fundamental downtime measurement comparisons using the new exec mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps have some number on hand when you started working on this feature years ago? Or maybe some old links on the list would help too, as I didn't follow this work since the start. On VFIO, IIUC you started out this project without VFIO migration being there. Now we have VFIO migration so not sure how much it would work for the upgrade use case. Even with current VFIO migration, we may not want to migrate device states for a local upgrade I suppose, as that can be a lot depending on the type of device assigned. However it'll be nice to discuss this too if this is the major purpose of the series. I think one other challenge on QEMU upgrade with VFIO devices is that the dest QEMU won't be able to open the VFIO device when the src QEMU is still using it as the owner. IIUC this is a similar condition where QEMU wants to have proper ownership transfer of a shared block device, and AFAIR right now we resolved that issue using some form of file lock on the image file. In this case it won't easily apply to a VFIO dev fd, but maybe we still have other approaches, not sure whether you investigated any. E.g. could the VFIO handle be passed over using unix scm rights? I think this might remove one dependency of using exec which can cause quite some difference v.s. a generic migration (from which regard, cpr-reboot is still a pretty generic migration). You also mentioned vhost/tap, is that also a major goal of this series in the follow up patchsets? Is this a problem only because this solution will do exec? Can it work if either the exec()ed qemu or dst qemu create the vhost/tap fds when boot? Meanwhile, could you elaborate a bit on the implication on chardevs? From what I read in the doc update it looks like a major part of work in the future, but I don't yet understand the issue.. Is it also relevant to the exec() approach? In all cases, some of such discussion would be really appreciated. And if you used to consider other approaches to solve this problem it'll be great to mention how you chose this way. Considering this work contains too many things, it'll be nice if such discussion can start with the fundamentals, e.g. on why exec() is a must. The main goal of cpr-exec is providing a fast and reliable way to update qemu. cpr-reboot is not fast enough or general enough. It requires the guest to support suspend and resume for all devices, and that takes seconds. If one actually reboots the host, that adds more seconds, depending on system services. cpr-exec takes 0.1 secs, and works every time, unlike like migration which can fail to converge on a busy system. Live migration also consumes more system and network resources. Right, but note that when I was thinking of a comparison between cpr-exec v.s. normal migration, I didn't mean a "normal live migration". I think it's more of the case whether exec() can be avoided. I had a feeling that this exec() will cause a major part of work elsewhere but maybe I am wrong as I didn't see the whole branch. The only parts of this work that are specific to exec are these patches and the qemu_clear_cloexec() calls in cpr.c. vl: helper to request re-exec migration: precreate vmstate for exec The rest would be the same if some other mechanism were used to start new qemu. Additional code would be needed for the new mechanism, such as SCM_RIGHTS sends. AFAIU, "cpr-exec takes 0.1 secs" is a conditional result. I think it at least should be relevant to what devices are attached to the VM, right? E.g., I observed at least
Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
On 5/29/2024 3:25 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: On 5/28/2024 5:44 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: Preserve fields of RAMBlocks that allocate their host memory during CPR so the RAM allocation can be recovered. This sentence itself did not explain much, IMHO. QEMU can share memory using fd based memory already of all kinds, as long as the memory backend is path-based it can be shared by sharing the same paths to dst. This reads very confusing as a generic concept. I mean, QEMU migration relies on so many things to work right. We mostly asks the users to "use exactly the same cmdline for src/dst QEMU unless you know what you're doing", otherwise many things can break. That should also include ramblock being matched between src/dst due to the same cmdlines provided on both sides. It'll be confusing to mention this when we thought the ramblocks also rely on that fact. So IIUC this sentence should be dropped in the real patch, and I'll try to guess the real reason with below.. The properties of the implicitly created ramblocks must be preserved. The defaults can and do change between qemu releases, even when the command-line parameters do not change for the explicit objects that cause these implicit ramblocks to be created. AFAIU, QEMU relies on ramblocks to be the same before this series. Do you have an example? Would that already cause issue when migrate? Alignment has changed, and used_length vs max_length changed when resizeable ramblocks were introduced. I have dealt with these issues while supporting cpr for our internal use, and the learned lesson is to explicitly communicate the creation-time parameters to new qemu. These are not an issue for migration because the ramblock is re-created and the data copied into the new memory. Mirror the mr->align field in the RAMBlock to simplify the vmstate. Preserve the old host address, even though it is immediately discarded, as it will be needed in the future for CPR with iommufd. Preserve guest_memfd, even though CPR does not yet support it, to maintain vmstate compatibility when it becomes supported. .. It could be about the vfio vaddr update feature that you mentioned and only for iommufd (as IIUC vfio still relies on iova ranges, then it won't help here)? If so, IMHO we should have this patch (or any variance form) to be there for your upcoming vfio support. Keeping this around like this will make the series harder to review. Or is it needed even before VFIO? This patch is needed independently of vfio or iommufd. guest_memfd is independent of vfio or iommufd. It is a recent addition which I have not tried to support, but I added this placeholder field to it can be supported in the future without adding a new field later and maintaining backwards compatibility. Is guest_memfd the only user so far, then? If so, would it be possible we split it as a separate effort on top of the base cpr-exec support? I don't understand the question. I am indeed deferring support for guest_memfd to a future time. For now, I am adding a blocker, and reserving a field for it in the preserved ramblock attributes, to avoid adding a subsection later. - Steve
Re: [PATCH V1 07/26] migration: VMStateId
On 5/29/2024 2:53 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote: How about a more general name for the type: migration/misc.h typedef char (MigrationId)[256]; How about qemu/typedefs.h? Not sure whether it's applicable. Markus (in the loop) may have a better idea. Meanwhile, s/MigrationID/IDString/? typedefs.h has a different purpose; giving short names to types defined in internal include files. This id is specific to migration, so I still think its name should reflect migration and it belongs in some include/migration/*.h file. ramblocks and migration are already closely related. There is nothing wrong with including a migration header in ramblock.h so it can use a migration type. We already have: include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h" include/hw/display/ramfb.h:#include "migration/vmstate.h" include/hw/input/pl050.h:#include "migration/vmstate.h" include/hw/pci/shpc.h:#include "migration/vmstate.h" include/hw/virtio/virtio.h:#include "migration/vmstate.h" include/hw/hyperv/vmbus.h:#include "migration/vmstate.h" The 256 byte magic length already appears in too many places, and my code would add more occurrences, so I really think that abstracting this type would be cleaner. - Steve exec/ramblock.h struct RAMBlock { MigrationId idstr; migration/savevm.c typedef struct CompatEntry { MigrationId idstr; typedef struct SaveStateEntry { MigrationId idstr; - Steve
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 5/29/2024 3:14 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote: diff --git a/system/memory.c b/system/memory.c index 49f1cb2..ca04a0e 100644 --- a/system/memory.c +++ b/system/memory.c @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp) { +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; If there's a machine option to "use memfd for allocations", then it's shared mem... Hmm.. It is a bit confusing to me in quite a few levels: - Why memory allocation method will be defined by a machine property, even if we have memory-backend-* which should cover everything? Some memory regions are implicitly created, and have no explicit representation on the qemu command line. memfd-alloc affects those. More generally, memfd-alloc affects all ramblock allocations that are not explicitly represented by memory-backend object. Thus the simple command line "qemu -m 1G" does not explicitly describe an object, so it goes through the anonymous allocation path, and is affected by memfd-alloc. Can we simply now allow "qemu -m 1G" to work for cpr-exec? I assume you meant "simply not allow". Yes, I could do that, but I would need to explicitly add code to exclude this case, and add a blocker. Right now it "just works" for all paths that lead to ram_block_alloc_host, without any special logic at the memory-backend level. And, I'm not convinced that simplifies the docs, as now I would need to tell the user that "-m 1G" and similar constructions do not work with cpr. I can try to clarify the doc for -memfd-alloc as currently defined. AFAIU that's what we do with cpr-reboot: we ask the user to specify the right things to make other thing work. Otherwise it won't. Internally, create_default_memdev does create a memory-backend object. That is what my doc comment above refers to: Any associated memory-backend objects are created with share=on An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc. The qapi comments in patch "migration: cpr-exec mode" attempt to say all that: +# Memory backend objects must have the share=on attribute, and +# must be mmap'able in the new QEMU process. For example, +# memory-backend-file is acceptable, but memory-backend-ram is +# not. +# +# The VM must be started with the '-machine memfd-alloc=on' +# option. This causes implicit ram blocks -- those not explicitly +# described by a memory-backend object -- to be allocated by +# mmap'ing a memfd. Examples include VGA, ROM, and even guest +# RAM when it is specified without a memory-backend object. VGA is IIRC 16MB chunk, ROM is even smaller. If the user specifies -object memory-backend-file,share=on propertly, these should be the only outliers? Are these important enough for the downtime? Can we put them into the migrated image alongside with the rest device states? It's not about downtime. vfio, vdpa, and iommufd pin all guest pages. The pages must remain pinned during CPR to support ongoing DMA activity which could target those pages (which we do not quiesce), and the same physical pages must be used for the ramblocks in the new qemu process. - Even if we have such a machine property, why setting "memfd" will always imply shared? why not private? After all it's not called "memfd-shared-alloc", and we can create private mappings using e.g. memory-backend-memfd,share=off. There is no use case for memfd-alloc with share=off, so no point IMO in making the option more verbose. Unfortunately this fact doesn't make the property easier to understand. :-( > For cpr, the mapping with all its modifications must be visible to new qemu when qemu mmaps it. So this might be the important part - do you mean migrating VGA/ROM/... small ramblocks won't work (besides any performance concerns)? Could you elaborate? Pinning. Cpr-reboot already introduced lots of tricky knobs to QEMU. We may need to restrict that specialty to minimal, making the interfacing as clear as possible, or (at least migration) maintainers will start to be soon scared and running away, if such proposal was not shot down. In short, I hope when we introduce new knobs for cpr, we shouldn't always keep cpr-* modes in mind, but consider whenever the user can use it without cpr-*. I'm not sure whether it'll be always possible, but we should try. I agree in principle. FWIW, I have tried to generalize the functionality needed by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers, precreate vmstate, factory objects; to base it on migration internals with minimal change (vmstate); and to make minimal changes in the migration control paths. - Steve
Re: [PATCH V1 05/26] migration: precreate vmstate
On 5/29/2024 2:39 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote: On 5/27/2024 2:16 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote: Provide the VMStateDescription precreate field to mark objects that must be loaded on the incoming side before devices have been created, because they provide properties that will be needed at creation time. They will be saved to and loaded from their own QEMUFile, via qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these functions are not yet called in this patch. Allow them to be called before or after normal migration is active, when current_migration and current_incoming are not valid. Signed-off-by: Steve Sistare --- include/migration/vmstate.h | 6 migration/savevm.c | 69 + migration/savevm.h | 3 ++ 3 files changed, 73 insertions(+), 5 deletions(-) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 294d2d8..4691334 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -198,6 +198,12 @@ struct VMStateDescription { * a QEMU_VM_SECTION_START section. */ bool early_setup; + +/* + * Send/receive this object in the precreate migration stream. + */ +bool precreate; + int version_id; int minimum_version_id; MigrationPriority priority; diff --git a/migration/savevm.c b/migration/savevm.c index 9789823..a30bcd9 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -239,6 +239,7 @@ static SaveState savevm_state = { #define SAVEVM_FOREACH(se, entry)\ QTAILQ_FOREACH(se, _state.handlers, entry)\ +if (!se->vmsd || !se->vmsd->precreate) #define SAVEVM_FOREACH_ALL(se, entry)\ QTAILQ_FOREACH(se, _state.handlers, entry) @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se, } } +static bool send_section_footer(SaveStateEntry *se) +{ +return (se->vmsd && se->vmsd->precreate) || + migrate_get_current()->send_section_footer; +} Does the precreate vmsd "require" the footer? Or it should also work? IMHO it's less optimal to bind features without good reasons. It is not required. However, IMO we should not treat send-section-footer as a fungible feature. It is strictly an improvement, as was added to catch misformated sections. It is only registered as a feature for backwards compatibility with qemu 2.3 and xen. For a brand new data stream such as precreate, where we are not constrained by backwards compatibility, we should unconditionally use the better protocol, and always send the footer. I see your point, but I still don't think we should mangle these things. It makes future justification harder on whether section footer should be sent. Take example of whatever new feature added for migration like mapped-ram, we might also want to enforce it by adding "return migrate_mapped_ram() || ..." but it means we keep growing this with no benefit. What you worry on "what if this is turned off" isn't a real one: nobody will turn it off! We started to deprecate machines, and I had a feeling it will be enforced at some point by default. That's fine, I'll delete the send_section_footer() function. - Steve
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 5/28/2024 5:12 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:26AM -0700, Steve Sistare wrote: Allocate anonymous memory using memfd_create if the memfd-alloc machine option is set. Signed-off-by: Steve Sistare --- hw/core/machine.c | 22 ++ include/hw/boards.h | 1 + qemu-options.hx | 6 ++ system/memory.c | 9 ++--- system/physmem.c| 18 +- system/trace-events | 1 + 6 files changed, 53 insertions(+), 4 deletions(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index 582c2df..9567b97 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -443,6 +443,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) ms->mem_merge = value; } +static bool machine_get_memfd_alloc(Object *obj, Error **errp) +{ +MachineState *ms = MACHINE(obj); + +return ms->memfd_alloc; +} + +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp) +{ +MachineState *ms = MACHINE(obj); + +ms->memfd_alloc = value; +} + static bool machine_get_usb(Object *obj, Error **errp) { MachineState *ms = MACHINE(obj); @@ -1044,6 +1058,11 @@ static void machine_class_init(ObjectClass *oc, void *data) object_class_property_set_description(oc, "mem-merge", "Enable/disable memory merge support"); +object_class_property_add_bool(oc, "memfd-alloc", +machine_get_memfd_alloc, machine_set_memfd_alloc); +object_class_property_set_description(oc, "memfd-alloc", +"Enable/disable allocating anonymous memory using memfd_create"); + object_class_property_add_bool(oc, "usb", machine_get_usb, machine_set_usb); object_class_property_set_description(oc, "usb", @@ -1387,6 +1406,9 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er if (!object_property_set_int(obj, "size", ms->ram_size, errp)) { goto out; } +if (!object_property_set_bool(obj, "share", ms->memfd_alloc, errp)) { +goto out; +} object_property_add_child(object_get_objects_root(), mc->default_ram_id, obj); /* Ensure backend's memory region name is equal to mc->default_ram_id */ diff --git a/include/hw/boards.h b/include/hw/boards.h index 69c1ba4..96259c3 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -372,6 +372,7 @@ struct MachineState { bool dump_guest_core; bool mem_merge; bool require_guest_memfd; +bool memfd_alloc; bool usb; bool usb_disabled; char *firmware; diff --git a/qemu-options.hx b/qemu-options.hx index cf61f6b..f0dfda5 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -32,6 +32,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ "vmport=on|off|auto controls emulation of vmport (default: auto)\n" "dump-guest-core=on|off include guest memory in a core dump (default=on)\n" "mem-merge=on|off controls memory merge support (default: on)\n" +"memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n" "aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n" "dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n" "suppress-vmdesc=on|off disables self-describing migration (default=off)\n" @@ -79,6 +80,11 @@ SRST supported by the host, de-duplicates identical memory pages among VMs instances (enabled by default). +``memfd-alloc=on|off`` +Enables or disables allocation of anonymous guest RAM using +memfd_create. Any associated memory-backend objects are created with +share=on. The memfd-alloc default is off. + ``aes-key-wrap=on|off`` Enables or disables AES key wrapping support on s390-ccw hosts. This feature controls whether AES wrapping keys will be created diff --git a/system/memory.c b/system/memory.c index 49f1cb2..ca04a0e 100644 --- a/system/memory.c +++ b/system/memory.c @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp) { +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; If there's a machine option to "use memfd for allocations", then it's shared mem... Hmm.. It is a bit confusing to me in quite a few levels: - Why memory allocation method will be defined by a machine property, even if we have memory-backend-* which should cover everything? Some memory regions are implicitly created, and have no explicit representation on the qemu command line. memfd-alloc affects those. More generally, memfd-alloc affects all ramblock allocations that are not explicitly represented by memory-backend object. Thus the simple command
Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
On 5/28/2024 5:44 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: Preserve fields of RAMBlocks that allocate their host memory during CPR so the RAM allocation can be recovered. This sentence itself did not explain much, IMHO. QEMU can share memory using fd based memory already of all kinds, as long as the memory backend is path-based it can be shared by sharing the same paths to dst. This reads very confusing as a generic concept. I mean, QEMU migration relies on so many things to work right. We mostly asks the users to "use exactly the same cmdline for src/dst QEMU unless you know what you're doing", otherwise many things can break. That should also include ramblock being matched between src/dst due to the same cmdlines provided on both sides. It'll be confusing to mention this when we thought the ramblocks also rely on that fact. So IIUC this sentence should be dropped in the real patch, and I'll try to guess the real reason with below.. The properties of the implicitly created ramblocks must be preserved. The defaults can and do change between qemu releases, even when the command-line parameters do not change for the explicit objects that cause these implicit ramblocks to be created. Mirror the mr->align field in the RAMBlock to simplify the vmstate. Preserve the old host address, even though it is immediately discarded, as it will be needed in the future for CPR with iommufd. Preserve guest_memfd, even though CPR does not yet support it, to maintain vmstate compatibility when it becomes supported. .. It could be about the vfio vaddr update feature that you mentioned and only for iommufd (as IIUC vfio still relies on iova ranges, then it won't help here)? If so, IMHO we should have this patch (or any variance form) to be there for your upcoming vfio support. Keeping this around like this will make the series harder to review. Or is it needed even before VFIO? This patch is needed independently of vfio or iommufd. guest_memfd is independent of vfio or iommufd. It is a recent addition which I have not tried to support, but I added this placeholder field to it can be supported in the future without adding a new field later and maintaining backwards compatibility. Another thing to ask: does this idea also need to rely on some future iommufd kernel support? If there's anything that's not merged in current Linux upstream, this series needs to be marked as RFC, so it's not target for merging. This will also be true if this patch is "preparing" for that work. It means if this patch only services iommufd purpose, even if it doesn't require any kernel header to be referenced, we should only merge it together with the full iommufd support comes later (and that'll be after iommufd kernel supports land). It does not rely on future kernel support. - Steve
Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
On 5/28/2024 2:21 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:10:16AM -0400, Steven Sistare wrote: On 5/27/2024 2:31 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote: Define VMSTATE_VOID_PTR so the value of a pointer (but not its target) can be saved in the migration stream. This will be needed for CPR. Signed-off-by: Steve Sistare This is really tricky. From a first glance, I don't think migrating a VA is valid at all for migration even if with exec.. and looks insane to me for a cross-process migration, which seems to be allowed to use as a generic VMSD helper.. as VA is the address space barrier for different processes and I think it normally even apply to generic execve(), and we're trying to jailbreak for some reason.. It definitely won't work for any generic migration as sizeof(void*) can be different afaict between hosts, e.g. 32bit -> 64bit migrations. Some description would be really helpful in this commit message, e.g. explain the users and why. Do we need to poison that for generic VMSD use (perhaps with prefixed underscores)? I think I'll need to read on the rest to tell.. Short answer: we never dereference the void* in the new process. And must not. Longer answer: During CPR for vfio, each mapped DMA region is re-registered in the new process using the new VA. The ioctl to re-register identifies the mapping by IOVA and length. The same requirement holds for CPR of iommufd devices. However, in the iommufd framework, IOVA does not uniquely identify a dma mapping, and we need to use the old VA as the unique identifier. The new process re-registers each mapping, passing the old VA and new VA to the kernel. The old VA is never dereferenced in the new process, we just need its value. I suspected that the void* which must not be dereferenced might make people uncomfortable. I have an older version of my code which adds a uint64_t field to RAMBlock for recording and migrating the old VA. The saving and loading code is slightly less elegant, but no big deal. Would you prefer that? I see, thanks for explaining. Yes that sounds better to me. Re the ugliness: is that about a pre_save() plus one extra uint64_t field? In that case it looks better comparing to migrating "void*". Yes. Will do. I'm trying to read some context on the vaddr remap thing from you, and I found this: https://lore.kernel.org/all/y90bvbnrvraceq%2f...@nvidia.com/ So it will work with iommufd now? Meanwhile, what's the status for mdev? Looks like it isn't supported yet for both. I am currently working on the kernel and qemu code to support iommufd. It is an RFE on top of this initial cpr-exec work that I will post separately later. I do know that it will require the old VA, so I am proposing to preserve old VA now in the migration stream to avoid adding extra backwards compatibility code later. I have prototyped userland code that supports mdev, based on jason's suggestion to fork an extra process to handle mdev translations during the transition from old to new qemu, but it is a work in progres and adds complexity, and I do not plan to submit that any time soon. Another RFE. For now mdev is not supported. - Steve
Re: [PATCH V1 07/26] migration: VMStateId
On 5/28/2024 1:44 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:10:03AM -0400, Steven Sistare via wrote: On 5/27/2024 2:20 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote: Define a type for the 256 byte id string to guarantee the same length is used and enforced everywhere. Signed-off-by: Steve Sistare --- include/exec/ramblock.h | 3 ++- include/migration/vmstate.h | 2 ++ migration/savevm.c | 8 migration/vmstate.c | 3 ++- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h index 0babd10..61deefe 100644 --- a/include/exec/ramblock.h +++ b/include/exec/ramblock.h @@ -23,6 +23,7 @@ #include "cpu-common.h" #include "qemu/rcu.h" #include "exec/ramlist.h" +#include "migration/vmstate.h" struct RAMBlock { struct rcu_head rcu; @@ -35,7 +36,7 @@ struct RAMBlock { void (*resized)(const char*, uint64_t length, void *host); uint32_t flags; /* Protected by the BQL. */ -char idstr[256]; +VMStateId idstr; /* RCU-enabled, writes protected by the ramlist lock */ QLIST_ENTRY(RAMBlock) next; QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers; Hmm.. Don't look like a good idea to include a migration header in ramblock.h? Is this ramblock change needed for this work? Well, entities that are migrated include migration headers, and now that includes RAMBlock. There is precedent: 0 include/exec/ramblock.h 26 #include "migration/vmstate.h" 1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h" 2 include/hw/display/ramfb. 4 #include "migration/vmstate.h" 3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h" 4 include/hw/input/pl050.h 14 #include "migration/vmstate.h" 5 include/hw/pci/shpc.h 7 #include "migration/vmstate.h" 6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h" 7 include/migration/cpu.h8 #include "migration/vmstate.h" Granted, only some of the C files that include ramblock.h need all of vmstate.h. I could define VMStateId in a smaller file such as migration/misc.h, or a new file migration/vmstateid.h, and include that in ramblock.h. Any preference? One issue here is currently the idstr[] of ramblock is a verbose name of the ramblock, and logically it doesn't need to have anything to do with vmstate. I'll continue to read to see why we need VMStateID here for a ramblock. So if it is necessary, maybe the name VMStateID to be used here is misleading? It'll also be nice to separate the changes, as ramblock.h doesn't belong to migration subsystem but memory api. cpr requires migrating vmstate for ramblock. See the physmem patches for why. vmstate requires a unique id, and ramblock already defines a unique id which is used to identify the block and dirty bitmap in the migration stream. How about a more general name for the type: migration/misc.h typedef char (MigrationId)[256]; exec/ramblock.h struct RAMBlock { MigrationId idstr; migration/savevm.c typedef struct CompatEntry { MigrationId idstr; typedef struct SaveStateEntry { MigrationId idstr; - Steve
Re: [PATCH V1 05/26] migration: precreate vmstate
On 5/27/2024 2:16 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote: Provide the VMStateDescription precreate field to mark objects that must be loaded on the incoming side before devices have been created, because they provide properties that will be needed at creation time. They will be saved to and loaded from their own QEMUFile, via qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these functions are not yet called in this patch. Allow them to be called before or after normal migration is active, when current_migration and current_incoming are not valid. Signed-off-by: Steve Sistare --- include/migration/vmstate.h | 6 migration/savevm.c | 69 + migration/savevm.h | 3 ++ 3 files changed, 73 insertions(+), 5 deletions(-) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 294d2d8..4691334 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -198,6 +198,12 @@ struct VMStateDescription { * a QEMU_VM_SECTION_START section. */ bool early_setup; + +/* + * Send/receive this object in the precreate migration stream. + */ +bool precreate; + int version_id; int minimum_version_id; MigrationPriority priority; diff --git a/migration/savevm.c b/migration/savevm.c index 9789823..a30bcd9 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -239,6 +239,7 @@ static SaveState savevm_state = { #define SAVEVM_FOREACH(se, entry)\ QTAILQ_FOREACH(se, _state.handlers, entry)\ +if (!se->vmsd || !se->vmsd->precreate) #define SAVEVM_FOREACH_ALL(se, entry)\ QTAILQ_FOREACH(se, _state.handlers, entry) @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se, } } +static bool send_section_footer(SaveStateEntry *se) +{ +return (se->vmsd && se->vmsd->precreate) || + migrate_get_current()->send_section_footer; +} Does the precreate vmsd "require" the footer? Or it should also work? IMHO it's less optimal to bind features without good reasons. It is not required. However, IMO we should not treat send-section-footer as a fungible feature. It is strictly an improvement, as was added to catch misformated sections. It is only registered as a feature for backwards compatibility with qemu 2.3 and xen. For a brand new data stream such as precreate, where we are not constrained by backwards compatibility, we should unconditionally use the better protocol, and always send the footer. - Steve
Re: [PATCH V1 00/26] Live update: cpr-exec
On 5/27/2024 1:45 PM, Peter Xu wrote: On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote: I understand, thanks. If I can help with any of your todo list, just ask - steve Thanks for offering the help, Steve. Started looking at this today, then I found that I miss something high-level. Let me ask here, and let me apologize already for starting to throw multiple questions.. IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in this case not host kernel but QEMU-only, and/or upper. Is there any justification on why the complexity is needed here? It looks to me this one is more involved than cpr-reboot, so I'm thinking how much we can get from the complexity, and whether it's worthwhile. 1000+ LOC is the min support, and if we even expect more to come, that's really important, IMHO. For example, what's the major motivation of this whole work? Is that more on performance, or is it more for supporting the special devices like VFIO which we used to not support, or something else? I can't find them in whatever cover letter I can find, including this one. Firstly, regarding performance, IMHO it'll be always nice to share even some very fundamental downtime measurement comparisons using the new exec mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps have some number on hand when you started working on this feature years ago? Or maybe some old links on the list would help too, as I didn't follow this work since the start. On VFIO, IIUC you started out this project without VFIO migration being there. Now we have VFIO migration so not sure how much it would work for the upgrade use case. Even with current VFIO migration, we may not want to migrate device states for a local upgrade I suppose, as that can be a lot depending on the type of device assigned. However it'll be nice to discuss this too if this is the major purpose of the series. I think one other challenge on QEMU upgrade with VFIO devices is that the dest QEMU won't be able to open the VFIO device when the src QEMU is still using it as the owner. IIUC this is a similar condition where QEMU wants to have proper ownership transfer of a shared block device, and AFAIR right now we resolved that issue using some form of file lock on the image file. In this case it won't easily apply to a VFIO dev fd, but maybe we still have other approaches, not sure whether you investigated any. E.g. could the VFIO handle be passed over using unix scm rights? I think this might remove one dependency of using exec which can cause quite some difference v.s. a generic migration (from which regard, cpr-reboot is still a pretty generic migration). You also mentioned vhost/tap, is that also a major goal of this series in the follow up patchsets? Is this a problem only because this solution will do exec? Can it work if either the exec()ed qemu or dst qemu create the vhost/tap fds when boot? Meanwhile, could you elaborate a bit on the implication on chardevs? From what I read in the doc update it looks like a major part of work in the future, but I don't yet understand the issue.. Is it also relevant to the exec() approach? In all cases, some of such discussion would be really appreciated. And if you used to consider other approaches to solve this problem it'll be great to mention how you chose this way. Considering this work contains too many things, it'll be nice if such discussion can start with the fundamentals, e.g. on why exec() is a must. The main goal of cpr-exec is providing a fast and reliable way to update qemu. cpr-reboot is not fast enough or general enough. It requires the guest to support suspend and resume for all devices, and that takes seconds. If one actually reboots the host, that adds more seconds, depending on system services. cpr-exec takes 0.1 secs, and works every time, unlike like migration which can fail to converge on a busy system. Live migration also consumes more system and network resources. cpr-exec seamlessly preserves client connections by preserving chardevs, and overall provides a much nicer user experience. chardev's are preserved by keeping their fd open across the exec, and remembering the value of the fd in precreate vmstate so that new qemu can associate the fd with the chardev rather than opening a new one. The approach of preserving open file descriptors is very general and applicable to all kinds of devices, regardless of whether they support live migration in hardware. Device fd's are preserved using the same mechanism as for chardevs. Devices that support live migration in hardware do not like to live migrate in place to the same node. It is not what they are designed for, and some implementations will flat out fail because the source and target interfaces are the same. For vhost/tap, sometimes the management layer opens the dev and passes an fd to qemu, and sometimes qemu opens the dev. The upcoming vhost/tap support allows both
Re: [PATCH V1 07/26] migration: VMStateId
On 5/27/2024 2:20 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:16AM -0700, Steve Sistare wrote: Define a type for the 256 byte id string to guarantee the same length is used and enforced everywhere. Signed-off-by: Steve Sistare --- include/exec/ramblock.h | 3 ++- include/migration/vmstate.h | 2 ++ migration/savevm.c | 8 migration/vmstate.c | 3 ++- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h index 0babd10..61deefe 100644 --- a/include/exec/ramblock.h +++ b/include/exec/ramblock.h @@ -23,6 +23,7 @@ #include "cpu-common.h" #include "qemu/rcu.h" #include "exec/ramlist.h" +#include "migration/vmstate.h" struct RAMBlock { struct rcu_head rcu; @@ -35,7 +36,7 @@ struct RAMBlock { void (*resized)(const char*, uint64_t length, void *host); uint32_t flags; /* Protected by the BQL. */ -char idstr[256]; +VMStateId idstr; /* RCU-enabled, writes protected by the ramlist lock */ QLIST_ENTRY(RAMBlock) next; QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers; Hmm.. Don't look like a good idea to include a migration header in ramblock.h? Is this ramblock change needed for this work? Well, entities that are migrated include migration headers, and now that includes RAMBlock. There is precedent: 0 include/exec/ramblock.h 26 #include "migration/vmstate.h" 1 include/hw/acpi/ich9_tco. 14 #include "migration/vmstate.h" 2 include/hw/display/ramfb. 4 #include "migration/vmstate.h" 3 include/hw/hyperv/vmbus.h 16 #include "migration/vmstate.h" 4 include/hw/input/pl050.h 14 #include "migration/vmstate.h" 5 include/hw/pci/shpc.h 7 #include "migration/vmstate.h" 6 include/hw/virtio/virtio. 20 #include "migration/vmstate.h" 7 include/migration/cpu.h8 #include "migration/vmstate.h" Granted, only some of the C files that include ramblock.h need all of vmstate.h. I could define VMStateId in a smaller file such as migration/misc.h, or a new file migration/vmstateid.h, and include that in ramblock.h. Any preference? - Steve
Re: [PATCH V1 08/26] migration: vmstate_info_void_ptr
On 5/27/2024 2:31 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:17AM -0700, Steve Sistare wrote: Define VMSTATE_VOID_PTR so the value of a pointer (but not its target) can be saved in the migration stream. This will be needed for CPR. Signed-off-by: Steve Sistare This is really tricky. From a first glance, I don't think migrating a VA is valid at all for migration even if with exec.. and looks insane to me for a cross-process migration, which seems to be allowed to use as a generic VMSD helper.. as VA is the address space barrier for different processes and I think it normally even apply to generic execve(), and we're trying to jailbreak for some reason.. It definitely won't work for any generic migration as sizeof(void*) can be different afaict between hosts, e.g. 32bit -> 64bit migrations. Some description would be really helpful in this commit message, e.g. explain the users and why. Do we need to poison that for generic VMSD use (perhaps with prefixed underscores)? I think I'll need to read on the rest to tell.. Short answer: we never dereference the void* in the new process. And must not. Longer answer: During CPR for vfio, each mapped DMA region is re-registered in the new process using the new VA. The ioctl to re-register identifies the mapping by IOVA and length. The same requirement holds for CPR of iommufd devices. However, in the iommufd framework, IOVA does not uniquely identify a dma mapping, and we need to use the old VA as the unique identifier. The new process re-registers each mapping, passing the old VA and new VA to the kernel. The old VA is never dereferenced in the new process, we just need its value. I suspected that the void* which must not be dereferenced might make people uncomfortable. I have an older version of my code which adds a uint64_t field to RAMBlock for recording and migrating the old VA. The saving and loading code is slightly less elegant, but no big deal. Would you prefer that? - Steve
Re: [PATCH V1 23/26] migration: misc cpr-exec blockers
On 5/24/2024 8:40 AM, Fabiano Rosas wrote: Steve Sistare writes: Add blockers for cpr-exec migration mode for devices and options that do not support it. Signed-off-by: Steve Sistare --- accel/xen/xen-all.c| 5 + backends/hostmem-epc.c | 12 ++-- hw/vfio/migration.c| 3 ++- replay/replay.c| 6 ++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c index 0bdefce..9a7ed0f 100644 --- a/accel/xen/xen-all.c +++ b/accel/xen/xen-all.c This file is missing the migration/blocker.h include. Good eyes, will fix - steve
Re: [PATCH V1 20/26] migration: cpr-exec mode
On 5/24/2024 10:58 AM, Fabiano Rosas wrote: Steve Sistare writes: Add the cpr-exec migration mode. Usage: qemu-system-$arch -machine memfd-alloc=on ... migrate_set_parameter mode cpr-exec migrate_set_parameter cpr-exec-args \ ... -incoming migrate -d The migrate command stops the VM, saves state to the URI, directly exec's a new version of QEMU on the same host, replacing the original process while retaining its PID, and loads state from the URI. Guest RAM is preserved in place, albeit with new virtual addresses. Arguments for the new QEMU process are taken from the @cpr-exec-args parameter. The first argument should be the path of a new QEMU binary, or a prefix command that exec's the new QEMU binary. Because old QEMU terminates when new QEMU starts, one cannot stream data between the two, so the URI must be a type, such as a file, that reads all data before old QEMU exits. Memory backend objects must have the share=on attribute, and must be mmap'able in the new QEMU process. For example, memory-backend-file is acceptable, but memory-backend-ram is not. The VM must be started with the '-machine memfd-alloc=on' option. This causes implicit ram blocks (those not explicitly described by a memory-backend object) to be allocated by mmap'ing a memfd. Examples include VGA, ROM, and even guest RAM when it is specified without a memory-backend object. The implementation saves precreate vmstate at the end of normal migration in migrate_fd_cleanup, and tells the main loop to call cpr_exec. Incoming qemu loads preceate state early, before objects are created. The memfds are kept open across exec by clearing the close-on-exec flag, their values are saved in precreate vmstate, and they are mmap'd in new qemu. Note that the memfd-alloc option is not related to memory-backend-memfd. Later patches add support for memory-backend-memfd, and for additional devices, including vfio, chardev, and more. Signed-off-by: Steve Sistare --- include/migration/cpr.h | 14 + include/migration/misc.h | 3 ++ migration/cpr.c | 131 +++ migration/meson.build| 1 + migration/migration.c| 21 migration/migration.h| 5 +- migration/ram.c | 1 + qapi/migration.json | 30 ++- system/physmem.c | 2 + system/vl.c | 4 ++ 10 files changed, 210 insertions(+), 2 deletions(-) create mode 100644 include/migration/cpr.h create mode 100644 migration/cpr.c [...] diff --git a/migration/migration.c b/migration/migration.c index b5af6b5..0d91531 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -239,6 +239,7 @@ void migration_object_init(void) blk_mig_init(); ram_mig_init(); dirty_bitmap_mig_init(); +cpr_mig_init(); } typedef struct { @@ -1395,6 +1396,15 @@ static void migrate_fd_cleanup(MigrationState *s) qemu_fclose(tmp); } +if (migrate_mode() == MIG_MODE_CPR_EXEC) { +Error *err = NULL; +if (migration_precreate_save()) { +migrate_set_error(s, err); +error_report_err(err); There's an error_report_err() call already a few lines down. +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); +} +} Not a fan of saving state in the middle of the cleanup function. This adds extra restrictions to migrate_fd_cleanup() which already tends to be the source of a bunch of bugs. Can this be done either entirely before or after migrate_fd_cleanup()? There's only one callsite from which you actually want to do cpr-exec, migration_iteration_finish(). It's no big deal if we have to play a bit with the notifier call placement. static void migration_iteration_finish(MigrationState *s) { ... migration_bh_schedule(migrate_cpr_exec_bh, s); migration_bh_schedule(migrate_fd_cleanup_bh, s); bql_unlock(); } IIUC, the BQL ensures the ordering here, but if that doesn't work we could just call the cpr function right at the end of migrate_fd_cleanup(). That would already be better than interleaving. static void migrate_cpr_exec_bh(void *opaque) { MigrationState *s = opaque; Error *err = NULL; if (migration_has_failed(s) || migrate_mode() != MIG_MODE_CPR_EXEC) { return; } assert(s->state == MIGRATION_STATUS_COMPLETED); if (migration_precreate_save()) { migrate_set_error(s, err); error_report_err(err); migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL); return; } qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_args); } No problem, I can hoist the cpr exec logic out of migrate_fd_cleanup. I'll call migration_precreate_save prior, and I'll register a notifier for MIG_EVENT_PRECOPY_DONE that calls qemu_system_exec_request. BTW the following does not work
Re: [PATCH V1 00/26] Live update: cpr-exec
On 5/24/2024 9:02 AM, Fabiano Rosas wrote: Steve Sistare writes: This patch series adds the live migration cpr-exec mode. In this mode, QEMU stops the VM, writes VM state to the migration URI, and directly exec's a new version of QEMU on the same host, replacing the original process while retaining its PID. Guest RAM is preserved in place, albeit with new virtual addresses. The user completes the migration by specifying the -incoming option, and by issuing the migrate-incoming command if necessary. This saves and restores VM state, with minimal guest pause time, so that QEMU may be updated to a new version in between. The new interfaces are: * cpr-exec (MigMode migration parameter) * cpr-exec-args (migration parameter) * memfd-alloc=on (command-line option for -machine) * only-migratable-modes (command-line argument) The caller sets the mode parameter before invoking the migrate command. Arguments for the new QEMU process are taken from the cpr-exec-args parameter. The first argument should be the path of a new QEMU binary, or a prefix command that exec's the new QEMU binary, and the arguments should include the -incoming option. Memory backend objects must have the share=on attribute, and must be mmap'able in the new QEMU process. For example, memory-backend-file is acceptable, but memory-backend-ram is not. QEMU must be started with the '-machine memfd-alloc=on' option. This causes implicit RAM blocks (those not explicitly described by a memory-backend object) to be allocated by mmap'ing a memfd. Examples include VGA, ROM, and even guest RAM when it is specified without without reference to a memory-backend object. The memfds are kept open across exec, their values are saved in vmstate which is retrieved after exec, and they are re-mmap'd. The '-only-migratable-modes cpr-exec' option guarantees that the configuration supports cpr-exec. QEMU will exit at start time if not. Example: In this example, we simply restart the same version of QEMU, but in a real scenario one would set a new QEMU binary path in cpr-exec-args. # qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G -machine memfd-alloc=on ... QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-exec (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state (qemu) migrate -d file:vm.state (qemu) QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running cpr-exec mode preserves attributes of outgoing devices that must be known before the device is created on the incoming side, such as the memfd descriptor number, but currently the migration stream is read after all devices are created. To solve this problem, I add two VMStateDescription options: precreate and factory. precreate objects are saved to their own migration stream, distinct from the main stream, and are read early by incoming QEMU, before devices are created. Factory objects are allocated on demand, without relying on a pre-registered object's opaque address, which is necessary because the devices to which the state will apply have not been created yet and hence have not registered an opaque address to receive the state. This patch series implements a minimal version of cpr-exec. Future series will add support for: * vfio * chardev's without loss of connectivity * vhost * fine-grained seccomp controls * hostmem-memfd * cpr-exec migration test Steve Sistare (26): oslib: qemu_clear_cloexec vl: helper to request re-exec migration: SAVEVM_FOREACH migration: delete unused parameter mis migration: precreate vmstate migration: precreate vmstate for exec migration: VMStateId migration: vmstate_info_void_ptr migration: vmstate_register_named migration: vmstate_unregister_named migration: vmstate_register at init time migration: vmstate factory object physmem: ram_block_create physmem: hoist guest_memfd creation physmem: hoist host memory allocation physmem: set ram block idstr earlier machine: memfd-alloc option migration: cpr-exec-args parameter physmem: preserve ram blocks for cpr migration: cpr-exec mode migration: migrate_add_blocker_mode migration: ram block cpr-exec blockers migration: misc cpr-exec blockers seccomp: cpr-exec blocker migration: fix mismatched GPAs during cpr-exec migration: only-migratable-modes accel/xen/xen-all.c| 5 + backends/hostmem-epc.c | 12 +- hmp-commands.hx| 2 +- hw/core/machine.c | 22 +++ hw/core/qdev.c | 1 + hw/intc/apic_common.c | 2 +- hw/vfio/migration.c| 3 +- include/exec/cpu-common.h | 3 +- include/exec/memory.h | 15 ++ include/exec/ramblock.h| 10 +-
Re: [PATCH V1 00/26] Live update: cpr-exec
I understand, thanks. If I can help with any of your todo list, just ask - steve On 5/20/2024 10:31 PM, Peter Xu wrote: Conference back then pto until today, so tomorrow will be my first working day after those. Sorry Steve, will try my best to read it before next week. I didn't dare to read too much my inbox yet. A bit scared but need to face it tomorrow. On Mon, May 20, 2024, 6:28 p.m. Fabiano Rosas <mailto:faro...@suse.de>> wrote: Steven Sistare mailto:steven.sist...@oracle.com>> writes: > Hi Peter, Hi Fabiano, > Will you have time to review the migration guts of this series any time soon? > In particular: > > [PATCH V1 05/26] migration: precreate vmstate > [PATCH V1 06/26] migration: precreate vmstate for exec > [PATCH V1 12/26] migration: vmstate factory object > [PATCH V1 18/26] migration: cpr-exec-args parameter > [PATCH V1 20/26] migration: cpr-exec mode > I'll get to them this week. I'm trying to make some progress with my own code before I forget how to program. I'm also trying to find some time to implement the device options in the migration tests so we can stop these virtio-* breakages that have been popping up.
Re: [PATCH V1 00/26] Live update: cpr-exec
Hi Peter, Hi Fabiano, Will you have time to review the migration guts of this series any time soon? In particular: [PATCH V1 05/26] migration: precreate vmstate [PATCH V1 06/26] migration: precreate vmstate for exec [PATCH V1 12/26] migration: vmstate factory object [PATCH V1 18/26] migration: cpr-exec-args parameter [PATCH V1 20/26] migration: cpr-exec mode - Steve On 4/29/2024 11:55 AM, Steve Sistare wrote: This patch series adds the live migration cpr-exec mode. In this mode, QEMU stops the VM, writes VM state to the migration URI, and directly exec's a new version of QEMU on the same host, replacing the original process while retaining its PID. Guest RAM is preserved in place, albeit with new virtual addresses. The user completes the migration by specifying the -incoming option, and by issuing the migrate-incoming command if necessary. This saves and restores VM state, with minimal guest pause time, so that QEMU may be updated to a new version in between. The new interfaces are: * cpr-exec (MigMode migration parameter) * cpr-exec-args (migration parameter) * memfd-alloc=on (command-line option for -machine) * only-migratable-modes (command-line argument) The caller sets the mode parameter before invoking the migrate command. Arguments for the new QEMU process are taken from the cpr-exec-args parameter. The first argument should be the path of a new QEMU binary, or a prefix command that exec's the new QEMU binary, and the arguments should include the -incoming option. Memory backend objects must have the share=on attribute, and must be mmap'able in the new QEMU process. For example, memory-backend-file is acceptable, but memory-backend-ram is not. QEMU must be started with the '-machine memfd-alloc=on' option. This causes implicit RAM blocks (those not explicitly described by a memory-backend object) to be allocated by mmap'ing a memfd. Examples include VGA, ROM, and even guest RAM when it is specified without without reference to a memory-backend object. The memfds are kept open across exec, their values are saved in vmstate which is retrieved after exec, and they are re-mmap'd. The '-only-migratable-modes cpr-exec' option guarantees that the configuration supports cpr-exec. QEMU will exit at start time if not. Example: In this example, we simply restart the same version of QEMU, but in a real scenario one would set a new QEMU binary path in cpr-exec-args. # qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G -machine memfd-alloc=on ... QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-exec (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state (qemu) migrate -d file:vm.state (qemu) QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running cpr-exec mode preserves attributes of outgoing devices that must be known before the device is created on the incoming side, such as the memfd descriptor number, but currently the migration stream is read after all devices are created. To solve this problem, I add two VMStateDescription options: precreate and factory. precreate objects are saved to their own migration stream, distinct from the main stream, and are read early by incoming QEMU, before devices are created. Factory objects are allocated on demand, without relying on a pre-registered object's opaque address, which is necessary because the devices to which the state will apply have not been created yet and hence have not registered an opaque address to receive the state. This patch series implements a minimal version of cpr-exec. Future series will add support for: * vfio * chardev's without loss of connectivity * vhost * fine-grained seccomp controls * hostmem-memfd * cpr-exec migration test Steve Sistare (26): oslib: qemu_clear_cloexec vl: helper to request re-exec migration: SAVEVM_FOREACH migration: delete unused parameter mis migration: precreate vmstate migration: precreate vmstate for exec migration: VMStateId migration: vmstate_info_void_ptr migration: vmstate_register_named migration: vmstate_unregister_named migration: vmstate_register at init time migration: vmstate factory object physmem: ram_block_create physmem: hoist guest_memfd creation physmem: hoist host memory allocation physmem: set ram block idstr earlier machine: memfd-alloc option migration: cpr-exec-args parameter physmem: preserve ram blocks for cpr migration: cpr-exec mode migration: migrate_add_blocker_mode migration: ram block cpr-exec blockers migration: misc cpr-exec blockers seccomp: cpr-exec blocker migration: fix mismatched GPAs during cpr-exec migration: only-migratable-modes accel/xen/xen-all.c| 5 + backends/hostmem-epc.c |
Re: CPR/liveupdate: test results using prior bug fix
On 5/16/2024 1:24 PM, Michael Galaxy wrote: On 5/14/24 08:54, Michael Tokarev wrote: On 5/14/24 16:39, Michael Galaxy wrote: Steve, OK, so it does not look like this bugfix you wrote was included in 8.2.4 (which was released yesterday). Unfortunately, that means that anyone using CPR in that release will still (eventually) encounter the bug like I did. 8.2.4 is basically a "bugfix" release for 8.2.3 which I somewhat screwed up (in a minor way), plus a few currently (at the time) queued up changes. 8.2.3 was a big release though. I would recommend that y'all consider cherry-picking, perhaps, the relevant commits for a possible 8.2.5 ? Please Cc changes which are relevant for -stable to, well, qemu-sta...@nongnu.org :) Which changes needs to be picked up? Steve, can you comment here, please? At a minimum, we have this one: [PULL 20/25] migration: stop vm for cpr But that pull came with a handful of other changes that are also not in QEMU v8, so I suspect I'm missing some other important changes that might be important for a stable release? - Michael Hi Michael, I sent the full list of commits to this distribution yesterday, and I see it in my Sent email folder. Copying verbatim: Michael Galaxy, I'm afraid you are out of luck with respect to qemu 8.2. It has some of the cpr reboot commits, but is missing the following: 87a2848 migration: massage cpr-reboot documentation cbdafc1 migration: options incompatible with cpr ce5db1c migration: update cpr-reboot description 9867d4d migration: stop vm for cpr 4af667f migration: notifier error checking bf78a04 migration: refactor migrate_fd_connect failures 6835f5a migration: per-mode notifiers 5663dd3 migration: MigrationNotifyFunc c763a23e migration: remove postcopy_after_devices 9d9babf migration: MigrationEvent for notifiers 3e77573 migration: convert to NotifierWithReturn d91f33c migration: remove error from notifier data be19d83 notify: pass error to notifier with return b12635f migration: fix coverity migrate_mode finding 2b58a8b tests/qtest: postcopy migration with suspend b1fdd21 tests/qtest: precopy migration with suspend 5014478 tests/qtest: option to suspend during migration f064975 tests/qtest: migration events 49a5020 migration: preserve suspended for bg_migration 58b1057 migration: preserve suspended for snapshot b4e9ddc migration: preserve suspended runstate d3c86c99 migration: propagate suspended runstate 9ff5e79 cpus: vm_resume 0f1db06 cpus: check running not RUN_STATE_RUNNING b9ae473 cpus: stop vm in suspended runstate f06f316 cpus: vm_was_suspended All of those landed in qemu 9.0. --- - Steve
Re: CPR/liveupdate: test results using prior bug fix
On 5/14/2024 11:33 AM, Michael Tokarev wrote: 14.05.2024 16:54, Michael Tokarev пишет: On 5/14/24 16:39, Michael Galaxy wrote: Steve, OK, so it does not look like this bugfix you wrote was included in 8.2.4 (which was released yesterday). Unfortunately, that means that anyone using CPR in that release will still (eventually) encounter the bug like I did. 8.2.4 is basically a "bugfix" release for 8.2.3 which I somewhat screwed up (in a minor way), plus a few currently (at the time) queued up changes. 8.2.3 was a big release though. I would recommend that y'all consider cherry-picking, perhaps, the relevant commits for a possible 8.2.5 ? Please Cc changes which are relevant for -stable to, well, qemu-sta...@nongnu.org :) Which changes needs to be picked up? Please also note this particular change does not apply cleanly to stable-8.2 branch due to other changes in this area between 8.2 and 9.0, in particular, in postcopy_start(). So if this is to be picked up for stable-8.2, I need help from someone who better understands this code to select changes to pick. Thanks, /mjt Michael Galaxy, I'm afraid you are out of luck with respect to qemu 8.2. It has some of the cpr reboot commits, but is missing the following: 87a2848 migration: massage cpr-reboot documentation cbdafc1 migration: options incompatible with cpr ce5db1c migration: update cpr-reboot description 9867d4d migration: stop vm for cpr 4af667f migration: notifier error checking bf78a04 migration: refactor migrate_fd_connect failures 6835f5a migration: per-mode notifiers 5663dd3 migration: MigrationNotifyFunc c763a23e migration: remove postcopy_after_devices 9d9babf migration: MigrationEvent for notifiers 3e77573 migration: convert to NotifierWithReturn d91f33c migration: remove error from notifier data be19d83 notify: pass error to notifier with return b12635f migration: fix coverity migrate_mode finding 2b58a8b tests/qtest: postcopy migration with suspend b1fdd21 tests/qtest: precopy migration with suspend 5014478 tests/qtest: option to suspend during migration f064975 tests/qtest: migration events 49a5020 migration: preserve suspended for bg_migration 58b1057 migration: preserve suspended for snapshot b4e9ddc migration: preserve suspended runstate d3c86c99 migration: propagate suspended runstate 9ff5e79 cpus: vm_resume 0f1db06 cpus: check running not RUN_STATE_RUNNING b9ae473 cpus: stop vm in suspended runstate f06f316 cpus: vm_was_suspended All of those landed in qemu 9.0. - Steve
Re: CPR/liveupdate: test results using prior bug fix
Hi Michael, No surprise here, I did see some of the same failure messages and they prompted me to submit the fix. They are all symptoms of "the possibility of ram and device state being out of sync" as mentioned in the commit. I am not familiar with the process for maintaining old releases for qemu. Perhaps someone on this list can comment on 8.2.3. - Steve On 5/13/2024 2:22 PM, Michael Galaxy wrote: Hi Steve, We found that this specific change in particular ("migration: stop vm for cpr") fixes a bug that we've identified in testing back-to-back live updates in a lab environment. More specifically, *without* this change (which is not available in 8.2.2, but *is* available in 9.0.0) causes the metadata save file to be corrupted when doing live updates one after another. Typically we see a corrupted save file somewhere in between 20 and 30 live updates and while doing a git bisect, we found that this change makes the problem go away. Were you aware? Is there any plan in place to cherry pick this for 8.2.3, perhaps or a plan to release 8.2.3 at some point? Here are some examples of how the bug manifests in different locations of the QEMU metadata save file: 2024-04-26T13:28:53Z qemu-system-x86_64: Failed to load mtrr_var:base 2024-04-26T13:28:53Z qemu-system-x86_64: Failed to load cpu:env.mtrr_var 2024-04-26T13:28:53Z qemu-system-x86_64: error while loading state for instance 0x1b of device 'cpu' 2024-04-26T13:28:53Z qemu-system-x86_64: load of migration failed: Input/output error And another: 2024-04-17T16:09:47Z qemu-system-x86_64: check_section_footer: Read section footer failed: -5 2024-04-17T16:09:47Z qemu-system-x86_64: load of migration failed: Invalid argument And another: 2024-04-30T21:53:29Z qemu-system-x86_64: Unable to read ID string for section 163 2024-04-30T21:53:29Z qemu-system-x86_64: load of migration failed: Invalid argument And another: 2024-05-01T16:01:44Z qemu-system-x86_64: Unable to read ID string for section 164 2024-05-01T16:01:44Z qemu-system-x86_64: load of migration failed: Invalid argument As you can see, they occur quite randomly, but generally it takes at least 20-30+ live updates before the problem occurs. - Michael On 2/27/24 23:13, pet...@redhat.com wrote: From: Steve Sistare When migration for cpr is initiated, stop the vm and set state RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the possibility of ram and device state being out of sync, and guarantees that a guest in the suspended state remains suspended, because qmp_cont rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu Link:https://urldefense.com/v3/__https://lore.kernel.org/r/1708622920-68779-11-git-send-email-steven.sistare@oracle.com__;!!GjvTz_vk!QLsFOCX-x2U9bzAo98SdidKlomHrmf_t0UmQKtgudoIcaDVoAJOPm39ZqaNP_nT5I8QqVfSgwhDZmg$ Signed-off-by: Peter Xu --- include/migration/misc.h | 1 + migration/migration.h| 2 -- migration/migration.c| 51 3 files changed, 32 insertions(+), 22 deletions(-) diff --git a/include/migration/misc.h b/include/migration/misc.h index e4933b815b..5d1aa593ed 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -60,6 +60,7 @@ void migration_object_init(void); void migration_shutdown(void); bool migration_is_idle(void); bool migration_is_active(MigrationState *); +bool migrate_mode_is_cpr(MigrationState *); typedef enum MigrationEventType { MIG_EVENT_PRECOPY_SETUP, diff --git a/migration/migration.h b/migration/migration.h index aef8afbe1f..65c0b61cbd 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -541,6 +541,4 @@ int migration_rp_wait(MigrationState *s); */ void migration_rp_kick(MigrationState *s); -int migration_stop_vm(RunState state); - #endif diff --git a/migration/migration.c b/migration/migration.c index 37c836b0b0..90a90947fb 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -167,11 +167,19 @@ static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp) return (a > b) - (a < b); } -int migration_stop_vm(RunState state) +static int migration_stop_vm(MigrationState *s, RunState state) { -int ret = vm_stop_force_state(state); +int ret; + +migration_downtime_start(s); + +s->vm_old_state = runstate_get(); +global_state_store(); + +ret = vm_stop_force_state(state); trace_vmstate_downtime_checkpoint("src-vm-stopped"); +trace_migration_completion_vm_stop(ret); return ret; } @@ -1602,6 +1610,11 @@ bool migration_is_active(MigrationState *s) s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); } +bool migrate_mode_is_cpr(MigrationState *s) +{ +return s->parameters.mode == MIG_MODE_CPR_REBOOT; +} + int migrate_init(MigrationState *s, Error **errp) { int ret; @@ -2454,10 +2467,7 @@ static int postcopy_start(MigrationState *ms,
Re: [PATCH V1 26/26] migration: only-migratable-modes
On 5/9/2024 3:14 PM, Fabiano Rosas wrote: Steve Sistare writes: Add the only-migratable-modes option as a generalization of only-migratable. Only devices that support all requested modes are allowed. Signed-off-by: Steve Sistare --- include/migration/misc.h | 3 +++ include/sysemu/sysemu.h| 1 - migration/migration-hmp-cmds.c | 26 +- migration/migration.c | 22 +- migration/savevm.c | 2 +- qemu-options.hx| 16 ++-- system/globals.c | 1 - system/vl.c| 13 - target/s390x/cpu_models.c | 4 +++- 9 files changed, 75 insertions(+), 13 deletions(-) diff --git a/include/migration/misc.h b/include/migration/misc.h index 5b963ba..3ad2cd9 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -119,6 +119,9 @@ bool migration_incoming_postcopy_advised(void); /* True if background snapshot is active */ bool migration_in_bg_snapshot(void); +void migration_set_required_mode(MigMode mode); +bool migration_mode_required(MigMode mode); + /* migration/block-dirty-bitmap.c */ void dirty_bitmap_mig_init(void); diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index 5b4397e..0a9c4b4 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -8,7 +8,6 @@ /* vl.c */ -extern int only_migratable; extern const char *qemu_name; extern QemuUUID qemu_uuid; extern bool qemu_uuid_set; diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index 414c7e8..ca913b7 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -16,6 +16,7 @@ #include "qemu/osdep.h" #include "block/qapi.h" #include "migration/snapshot.h" +#include "migration/misc.h" #include "monitor/hmp.h" #include "monitor/monitor.h" #include "qapi/error.h" @@ -33,6 +34,28 @@ #include "options.h" #include "migration.h" +static void migration_dump_modes(Monitor *mon) +{ +int mode, n = 0; + +monitor_printf(mon, "only-migratable-modes: "); + +for (mode = 0; mode < MIG_MODE__MAX; mode++) { +if (migration_mode_required(mode)) { +if (n++) { +monitor_printf(mon, ","); +} +monitor_printf(mon, "%s", MigMode_str(mode)); +} +} + +if (!n) { +monitor_printf(mon, "none\n"); +} else { +monitor_printf(mon, "\n"); +} +} + static void migration_global_dump(Monitor *mon) { MigrationState *ms = migrate_get_current(); @@ -41,7 +64,7 @@ static void migration_global_dump(Monitor *mon) monitor_printf(mon, "store-global-state: %s\n", ms->store_global_state ? "on" : "off"); monitor_printf(mon, "only-migratable: %s\n", - only_migratable ? "on" : "off"); + migration_mode_required(MIG_MODE_NORMAL) ? "on" : "off"); monitor_printf(mon, "send-configuration: %s\n", ms->send_configuration ? "on" : "off"); monitor_printf(mon, "send-section-footer: %s\n", @@ -50,6 +73,7 @@ static void migration_global_dump(Monitor *mon) ms->decompress_error_check ? "on" : "off"); monitor_printf(mon, "clear-bitmap-shift: %u\n", ms->clear_bitmap_shift); +migration_dump_modes(mon); } void hmp_info_migrate(Monitor *mon, const QDict *qdict) diff --git a/migration/migration.c b/migration/migration.c index 4984dee..5535b84 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1719,17 +1719,29 @@ static bool is_busy(Error **reasonp, Error **errp) return false; } -static bool is_only_migratable(Error **reasonp, Error **errp, int modes) +static int migration_modes_required; + +void migration_set_required_mode(MigMode mode) +{ +migration_modes_required |= BIT(mode); +} + +bool migration_mode_required(MigMode mode) +{ +return !!(migration_modes_required & BIT(mode)); +} + +static bool modes_are_required(Error **reasonp, Error **errp, int modes) { ERRP_GUARD(); -if (only_migratable && (modes & BIT(MIG_MODE_NORMAL))) { +if (migration_modes_required & modes) { error_propagate_prepend(errp, *reasonp, -"disallowing migration blocker " -"(--only-migratable) for: "); +"-only-migratable{-modes} specified, but: "); extra space before 'specified' Will fix, thanks. *reasonp = NULL; return true; } + return false; } @@ -1783,7 +1795,7 @@ int migrate_add_blocker_modes(Error **reasonp, Error **errp, MigMode mode, ...) modes = get_modes(mode, ap); va_end(ap); -if (is_only_migratable(reasonp, errp, modes)) { +if (modes_are_required(reasonp, errp, modes)) { return -EACCES; } else if (is_busy(reasonp, errp)) {
Re: [PATCH V1 13/26] physmem: ram_block_create
On 5/13/2024 2:37 PM, Fabiano Rosas wrote: Steve Sistare writes: Create a common subroutine to allocate a RAMBlock, de-duping the code to populate its common fields. Add a trace point for good measure. No functional change. Signed-off-by: Steve Sistare --- system/physmem.c| 47 ++- system/trace-events | 3 +++ 2 files changed, 29 insertions(+), 21 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index c3d04ca..6216b14 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -52,6 +52,7 @@ #include "sysemu/hw_accel.h" #include "sysemu/xen-mapcache.h" #include "trace/trace-root.h" +#include "trace.h" #ifdef CONFIG_FALLOCATE_PUNCH_HOLE #include @@ -1918,11 +1919,29 @@ out_free: } } +static RAMBlock *ram_block_create(MemoryRegion *mr, ram_addr_t size, + ram_addr_t max_size, uint32_t ram_flags) +{ +RAMBlock *rb = g_malloc0(sizeof(*rb)); + +rb->used_length = size; +rb->max_length = max_size; +rb->fd = -1; +rb->flags = ram_flags; +rb->page_size = qemu_real_host_page_size(); +rb->mr = mr; +rb->guest_memfd = -1; +trace_ram_block_create(rb->idstr, rb->flags, rb->fd, rb->used_length, There's no idstr at this point, is there? I think this needs to be memory_region_name(mr). Thanks, will fix. That is a bug in my patch factoring. I add the call to qemu_ram_set_idstr in patch "physmem: set ram block idstr earlier". - Steve + rb->max_length, mr->align); +return rb; +} + #ifdef CONFIG_POSIX RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr, uint32_t ram_flags, int fd, off_t offset, Error **errp) { +void *host; RAMBlock *new_block; Error *local_err = NULL; int64_t file_size, file_align; @@ -1962,19 +1981,14 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr, return NULL; } -new_block = g_malloc0(sizeof(*new_block)); -new_block->mr = mr; -new_block->used_length = size; -new_block->max_length = size; -new_block->flags = ram_flags; -new_block->guest_memfd = -1; -new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset, - errp); -if (!new_block->host) { +new_block = ram_block_create(mr, size, size, ram_flags); +host = file_ram_alloc(new_block, size, fd, !file_size, offset, errp); +if (!host) { g_free(new_block); return NULL; } +new_block->host = host; ram_block_add(new_block, _err); if (local_err) { g_free(new_block); @@ -1982,7 +1996,6 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr, return NULL; } return new_block; - } @@ -2054,18 +2067,10 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, align = MAX(align, TARGET_PAGE_SIZE); size = ROUND_UP(size, align); max_size = ROUND_UP(max_size, align); - -new_block = g_malloc0(sizeof(*new_block)); -new_block->mr = mr; -new_block->resized = resized; -new_block->used_length = size; -new_block->max_length = max_size; assert(max_size >= size); -new_block->fd = -1; -new_block->guest_memfd = -1; -new_block->page_size = qemu_real_host_page_size(); -new_block->host = host; -new_block->flags = ram_flags; +new_block = ram_block_create(mr, size, max_size, ram_flags); +new_block->resized = resized; + ram_block_add(new_block, _err); if (local_err) { g_free(new_block); diff --git a/system/trace-events b/system/trace-events index 69c9044..f0a80ba 100644 --- a/system/trace-events +++ b/system/trace-events @@ -38,3 +38,6 @@ dirtylimit_state_finalize(void) dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us" dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us" + +# physmem.c +ram_block_create(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length, size_t align) "%s, flags %u, fd %d, len %lu, maxlen %lu, align %lu"
Re: [PATCH V1 24/26] seccomp: cpr-exec blocker
On 5/10/2024 3:54 AM, Daniel P. Berrangé wrote: On Mon, Apr 29, 2024 at 08:55:33AM -0700, Steve Sistare wrote: cpr-exec mode needs permission to exec. Block it if permission is denied. Signed-off-by: Steve Sistare --- include/sysemu/seccomp.h | 1 + system/qemu-seccomp.c| 10 -- system/vl.c | 6 ++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/sysemu/seccomp.h b/include/sysemu/seccomp.h index fe85989..023c0a1 100644 --- a/include/sysemu/seccomp.h +++ b/include/sysemu/seccomp.h @@ -22,5 +22,6 @@ #define QEMU_SECCOMP_SET_RESOURCECTL (1 << 4) int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp); +uint32_t qemu_seccomp_get_opts(void); #endif diff --git a/system/qemu-seccomp.c b/system/qemu-seccomp.c index 5c20ac0..0d2a561 100644 --- a/system/qemu-seccomp.c +++ b/system/qemu-seccomp.c @@ -360,12 +360,18 @@ static int seccomp_start(uint32_t seccomp_opts, Error **errp) return rc < 0 ? -1 : 0; } +static uint32_t seccomp_opts; + +uint32_t qemu_seccomp_get_opts(void) +{ +return seccomp_opts; +} + int parse_sandbox(void *opaque, QemuOpts *opts, Error **errp) { if (qemu_opt_get_bool(opts, "enable", false)) { -uint32_t seccomp_opts = QEMU_SECCOMP_SET_DEFAULT -| QEMU_SECCOMP_SET_OBSOLETE; const char *value = NULL; +seccomp_opts = QEMU_SECCOMP_SET_DEFAULT | QEMU_SECCOMP_SET_OBSOLETE; value = qemu_opt_get(opts, "obsolete"); if (value) { diff --git a/system/vl.c b/system/vl.c index 7252100..b76881e 100644 --- a/system/vl.c +++ b/system/vl.c @@ -76,6 +76,7 @@ #include "hw/block/block.h" #include "hw/i386/x86.h" #include "hw/i386/pc.h" +#include "migration/blocker.h" #include "migration/cpr.h" #include "migration/misc.h" #include "migration/snapshot.h" @@ -2493,6 +2494,11 @@ static void qemu_process_early_options(void) QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL); if (olist) { qemu_opts_foreach(olist, parse_sandbox, NULL, _fatal); +if (qemu_seccomp_get_opts() & QEMU_SECCOMP_SET_SPAWN) { +Error *blocker = NULL; +error_setg(, "-sandbox denies exec for cpr-exec"); +migrate_add_blocker_mode(, MIG_MODE_CPR_EXEC, _fatal); +} } #endi There are a whole pile of features that get blocked wehn -sandbox is used. I'm not convinced we should be adding code to check for specific blocked features, as such a list will always be incomplete at best, and incorrectly block things at worst. I view this primarily as a documentation task for the cpr-exec command. For cpr and live migration, we do our best to prevent breaking the guest for cases we know will fail. Independently, a clear error message here will reduce error reports for this new cpr feature. Would it be more palatable if I move this blocker's creation to cpr_mig_init? - Steve
Re: [PATCH V1 22/26] migration: ram block cpr-exec blockers
On 5/9/2024 2:01 PM, Fabiano Rosas wrote: Steve Sistare writes: Unlike cpr-reboot mode, cpr-exec mode cannot save volatile ram blocks in the migration stream file and recreate them later, because the physical memory for the blocks is pinned and registered for vfio. Add an exec-mode blocker for volatile ram blocks. Also add a blocker for RAM_GUEST_MEMFD. Preserving guest_memfd may be sufficient for cpr-exec, but it has not been tested yet. - Steve extra text here Will fix, thanks - steve Signed-off-by: Steve Sistare Reviewed-by: Fabiano Rosas
Re: [PATCH V1 09/26] migration: vmstate_register_named
On 5/9/2024 10:32 AM, Fabiano Rosas wrote: Fabiano Rosas writes: Steve Sistare writes: Define vmstate_register_named which takes the instance name as its first parameter, instead of generating the name from VMStateIf of the Object. This will be needed to register objects that are not Objects. Pass the new name parameter to vmstate_register_with_alias_id. Signed-off-by: Steve Sistare Reviewed-by: Fabiano Rosas Actually, can't we define a wrapper type just for this purpose? For example, looking at dbus-vmstate.c: One would need to provide a separate wrapper for each struct to be registered as vmstate. This patch set only has RAMBlock, but there are more coming in my next patch sets. vmstate_register_named avoids adding such boilerplate, and makes it easier to add more cpr state in the future. - Steve static void dbus_vmstate_class_init(ObjectClass *oc, void *data) { ... VMStateIfClass *vc = VMSTATE_IF_CLASS(oc); vc->get_id = dbus_vmstate_get_id; ... } static const TypeInfo dbus_vmstate_info = { .name = TYPE_DBUS_VMSTATE, .parent = TYPE_OBJECT, .instance_size = sizeof(DBusVMState), .instance_finalize = dbus_vmstate_finalize, .class_init = dbus_vmstate_class_init, .interfaces = (InterfaceInfo[]) { { TYPE_USER_CREATABLE }, // without this one { TYPE_VMSTATE_IF }, { } } }; static void register_types(void) { type_register_static(_vmstate_info); } type_init(register_types);
Re: [PATCH V1 05/26] migration: precreate vmstate
On 5/7/2024 5:02 PM, Fabiano Rosas wrote: Steve Sistare writes: Provide the VMStateDescription precreate field to mark objects that must be loaded on the incoming side before devices have been created, because they provide properties that will be needed at creation time. They will be saved to and loaded from their own QEMUFile, via It's not obvious to me what the reason is to have a separate QEMUFile. Could you expand on this? The migration stream is read in the calling sequence at B below, but precreate state is needed at A before chardev and memory backends are created. main() qemu_init() A: qemu_create_early_backends() qemu_create_late_backends() migration_object_init() qmp_x_exit_preconfig() qmp_migrate_incoming() qemu_default_main() qemu_main_loop() fd_accept_incoming_migration() migration_channel_process_incoming() migration_ioc_process_incoming() migration_incoming_process() process_incoming_migration_co() B: qemu_loadvm_state() precreate objects could be emitted first in the existing migration stream and read at A, but this requires untangling numerous ordering dependencies amongst migration_object_init, qemu_create_machine, configure_accelerators, monitor init, and the main loop. - Steve
Re: [PATCH V1 06/26] migration: precreate vmstate for exec
On 5/6/2024 7:34 PM, Fabiano Rosas wrote: Steve Sistare writes: Provide migration_precreate_save for saving precreate vmstate across exec. Create a memfd, save its value in the environment, and serialize state to it. Reverse the process in migration_precreate_load. Signed-off-by: Steve Sistare --- include/migration/misc.h | 5 ++ migration/meson.build| 1 + migration/precreate.c| 139 +++ 3 files changed, 145 insertions(+) create mode 100644 migration/precreate.c diff --git a/include/migration/misc.h b/include/migration/misc.h index c9e200f..cf30351 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -56,6 +56,11 @@ AnnounceParameters *migrate_announce_params(void); void dump_vmstate_json_to_file(FILE *out_fp); +/* migration/precreate.c */ +int migration_precreate_save(Error **errp); +void migration_precreate_unsave(void); +int migration_precreate_load(Error **errp); + /* migration/migration.c */ void migration_object_init(void); void migration_shutdown(void); diff --git a/migration/meson.build b/migration/meson.build index f76b1ba..50e7cb2 100644 --- a/migration/meson.build +++ b/migration/meson.build @@ -26,6 +26,7 @@ system_ss.add(files( 'ram-compress.c', 'options.c', 'postcopy-ram.c', + 'precreate.c', 'savevm.c', 'socket.c', 'tls.c', diff --git a/migration/precreate.c b/migration/precreate.c new file mode 100644 index 000..0bf5e1f --- /dev/null +++ b/migration/precreate.c @@ -0,0 +1,139 @@ +/* + * Copyright (c) 2022, 2024 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include "qemu/cutils.h" +#include "qemu/memfd.h" +#include "qapi/error.h" +#include "io/channel-file.h" +#include "migration/misc.h" +#include "migration/qemu-file.h" +#include "migration/savevm.h" + +#define PRECREATE_STATE_NAME "QEMU_PRECREATE_STATE" + +static QEMUFile *qemu_file_new_fd_input(int fd, const char *name) +{ +g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd); +QIOChannel *ioc = QIO_CHANNEL(fioc); +qio_channel_set_name(ioc, name); +return qemu_file_new_input(ioc); +} + +static QEMUFile *qemu_file_new_fd_output(int fd, const char *name) +{ +g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd); +QIOChannel *ioc = QIO_CHANNEL(fioc); +qio_channel_set_name(ioc, name); +return qemu_file_new_output(ioc); +} + +static int memfd_create_named(const char *name, Error **errp) +{ +int mfd; +char val[16]; + +mfd = memfd_create(name, 0); +if (mfd < 0) { +error_setg_errno(errp, errno, "memfd_create failed"); +return -1; +} + +/* Remember mfd in environment for post-exec load */ +qemu_clear_cloexec(mfd); +snprintf(val, sizeof(val), "%d", mfd); +g_setenv(name, val, 1); + +return mfd; +} + +static int memfd_find_named(const char *name, int *mfd_p, Error **errp) +{ +const char *val = g_getenv(name); + +if (!val) { +*mfd_p = -1; +return 0; /* No memfd was created, not an error */ +} +g_unsetenv(name); +if (qemu_strtoi(val, NULL, 10, mfd_p)) { +error_setg(errp, "Bad %s env value %s", PRECREATE_STATE_NAME, val); +return -1; +} +lseek(*mfd_p, 0, SEEK_SET); +return 0; +} + +static void memfd_delete_named(const char *name) +{ +int mfd; +const char *val = g_getenv(name); + +if (val) { +g_unsetenv(name); +if (!qemu_strtoi(val, NULL, 10, )) { +close(mfd); +} +} +} + +static QEMUFile *qemu_file_new_memfd_output(const char *name, Error **errp) +{ +int mfd = memfd_create_named(name, errp); + +if (mfd < 0) { +return NULL; +} + +return qemu_file_new_fd_output(mfd, name); +} + +static QEMUFile *qemu_file_new_memfd_input(const char *name, Error **errp) +{ +int ret, mfd; + +ret = memfd_find_named(name, , errp); +if (ret || mfd < 0) { +return NULL; +} + +return qemu_file_new_fd_input(mfd, name); +} + +int migration_precreate_save(Error **errp) +{ +QEMUFile *f = qemu_file_new_memfd_output(PRECREATE_STATE_NAME, errp); + +if (!f) { +return -1; +} else if (qemu_savevm_precreate_save(f, errp)) { +memfd_delete_named(PRECREATE_STATE_NAME); +return -1; +} else { +/* Do not close f, as mfd must remain open. */ +return 0; +} +} + +void migration_precreate_unsave(void) +{ +memfd_delete_named(PRECREATE_STATE_NAME); +} + +int migration_precreate_load(Error **errp) +{ +int ret; +QEMUFile *f = qemu_file_new_memfd_input(PRECREATE_STATE_NAME, errp); Can we avoid the QEMUFile? I don't see it being exported from this file. It is not exported, but within this file, it is the basis for all read and write operations, via the existing functions
Re: [PATCH V1 03/26] migration: SAVEVM_FOREACH
On 5/6/2024 7:17 PM, Fabiano Rosas wrote: Steve Sistare writes: Define an abstraction SAVEVM_FOREACH to loop over all savevm state handlers, and replace QTAILQ_FOREACH. Define variants for ALL so we can loop over all handlers vs a subset of handlers in a subsequent patch, but at this time there is no distinction between the two. No functional change. Signed-off-by: Steve Sistare --- migration/savevm.c | 55 +++--- 1 file changed, 32 insertions(+), 23 deletions(-) diff --git a/migration/savevm.c b/migration/savevm.c index 4509482..6829ba3 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -237,6 +237,15 @@ static SaveState savevm_state = { .global_section_id = 0, }; +#define SAVEVM_FOREACH(se, entry)\ +QTAILQ_FOREACH(se, _state.handlers, entry)\ + +#define SAVEVM_FOREACH_ALL(se, entry)\ +QTAILQ_FOREACH(se, _state.handlers, entry) This feels worse than SAVEVM_FOREACH_NOT_PRECREATED. We'll have to keep coming back to the definition to figure out which FOREACH is the real deal. I take your point, but the majority of the loops do not care about precreated objects, so it seems backwards to make them more verbose with SAVEVM_FOREACH_NOT_PRECREATE. I can go either way, but we need Peter's opinion also. + +#define SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) \ +QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) + static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id); static bool should_validate_capability(int capability) @@ -674,7 +683,7 @@ static uint32_t calculate_new_instance_id(const char *idstr) SaveStateEntry *se; uint32_t instance_id = 0; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH_ALL(se, entry) { In this patch we can't have both instances... if (strcmp(idstr, se->idstr) == 0 && instance_id <= se->instance_id) { instance_id = se->instance_id + 1; @@ -690,7 +699,7 @@ static int calculate_compat_instance_id(const char *idstr) SaveStateEntry *se; int instance_id = 0; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { ...otherwise one of the two changes will go undocumented because the actual reason for it will only be described in the next patch. Sure, I'll move this to the precreate patch. - Steve if (!se->compat) { continue; } @@ -816,7 +825,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque) } pstrcat(id, sizeof(id), idstr); -QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) { +SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) { if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) { savevm_state_handler_remove(se); g_free(se->compat); @@ -939,7 +948,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd, { SaveStateEntry *se, *new_se; -QTAILQ_FOREACH_SAFE(se, _state.handlers, entry, new_se) { +SAVEVM_FOREACH_SAFE_ALL(se, entry, new_se) { if (se->vmsd == vmsd && se->opaque == opaque) { savevm_state_handler_remove(se); g_free(se->compat); @@ -1223,7 +1232,7 @@ bool qemu_savevm_state_blocked(Error **errp) { SaveStateEntry *se; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { if (se->vmsd && se->vmsd->unmigratable) { error_setg(errp, "State blocked by non-migratable device '%s'", se->idstr); @@ -1237,7 +1246,7 @@ void qemu_savevm_non_migratable_list(strList **reasons) { SaveStateEntry *se; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { if (se->vmsd && se->vmsd->unmigratable) { QAPI_LIST_PREPEND(*reasons, g_strdup_printf("non-migratable device: %s", @@ -1276,7 +1285,7 @@ bool qemu_savevm_state_guest_unplug_pending(void) { SaveStateEntry *se; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { if (se->vmsd && se->vmsd->dev_unplug_pending && se->vmsd->dev_unplug_pending(se->opaque)) { return true; @@ -1291,7 +1300,7 @@ int qemu_savevm_state_prepare(Error **errp) SaveStateEntry *se; int ret; -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { if (!se->ops || !se->ops->save_prepare) { continue; } @@ -1321,7 +1330,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp) json_writer_start_array(ms->vmdesc, "devices"); trace_savevm_state_setup(); -QTAILQ_FOREACH(se, _state.handlers, entry) { +SAVEVM_FOREACH(se, entry) { if (se->vmsd && se->vmsd->early_setup) {
cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec)
On 4/29/2024 11:55 AM, Steve Sistare wrote: This patch series adds the live migration cpr-exec mode. Here is the text I plan to add to docs/devel/migration/CPR.rst. It is premature for me to submit this as a patch, because it includes all the functionality I plan to add in this and future series, but it may help you while reviewing this series. - Steve cpr-exec mode --- In this mode, QEMU stops the VM, writes VM state to the migration URI, and directly exec's a new version of QEMU on the same host, replacing the original process while retaining its PID. Guest RAM is preserved in place, albeit with new virtual addresses. The user completes the migration by specifying the ``-incoming`` option, and by issuing the ``migrate-incoming`` command if necessary; see details below. This mode supports vfio devices by preserving device descriptors and hence kernel state across the exec, even for devices that do not support live migration, and preserves tap and vhost descriptors. cpr-exec also preserves descriptors for a subset of chardevs, including socket, file, parallel, pipe, serial, pty, stdio, and null. chardevs that support cpr-exec have the QEMU_CHAR_FEATURE_CPR set in the Chardev object. The client side of a preserved chardev sees no loss of connectivity during cpr-exec. More chardevs could be preserved with additional developement. All chardevs have a ``reopen-on-cpr`` option which causes the chardev to be closed and reopened during cpr-exec. This can be set to allow cpr-exec when the configuration includes a chardev (such as vc) that does not have QEMU_CHAR_FEATURE_CPR. Because the old and new QEMU instances are not active concurrently, the URI cannot be a type that streams data from one instance to the other. Usage ^ Arguments for the new QEMU process are taken from the @cpr-exec-args parameter. The first argument should be the path of a new QEMU binary, or a prefix command that exec's the new QEMU binary, and the arguments should include the ''-incoming'' option. Memory backend objects must have the ``share=on`` attribute, and must be mmap'able in the new QEMU process. For example, memory-backend-file is acceptable, but memory-backend-ram is not. The VM must be started with the ``-machine memfd-alloc=on`` option. This causes implicit RAM blocks (those not explicitly described by a memory-backend object) to be allocated by mmap'ing a memfd. Examples include VGA, ROM, and even guest RAM when it is specified without without reference to a memory-backend object. Add the ``-only-migratable-modes cpr-exec`` option to guarantee that the configuration supports cpr-exec. QEMU will exit at start time if not. Outgoing: * Set the migration mode parameter to ``cpr-exec``. * Set the ``cpr-exec-args`` parameter. * Issue the ``migrate`` command. It is recommended the the URI be a ``file`` type, but one can use other types such as ``exec``, provided the command captures all the data from the outgoing side, and provides all the data to the incoming side. Incoming: * You do not need to explicitly start new QEMU. It is started as a side effect of the migrate command above. * If the VM was running when the outgoing ``migrate`` command was issued, then QEMU automatically resumes VM execution. Example 1: incoming URI ^^^ In these examples, we simply restart the same version of QEMU, but in a real scenario one would set a new QEMU binary path in cpr-exec-args. :: # qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G -machine memfd-alloc=on ... QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-exec (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming file:vm.state (qemu) migrate -d file:vm.state (qemu) QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running Example 2: incoming defer ^ :: # qemu-kvm -monitor stdio -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -m 4G -machine memfd-alloc=on ... QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) migrate_set_parameter mode cpr-exec (qemu) migrate_set_parameter cpr-exec-args qemu-kvm ... -incoming defer (qemu) migrate -d file:vm.state (qemu) QEMU 9.1.50 monitor - type 'help' for more information (qemu) info status status: paused (inmigrate) (qemu) migrate_incoming file:vm.state (qemu) info status VM status: running Caveats ^^^ cpr-exec mode may not be used with postcopy, background-snapshot, or COLO. cpr-exec mode requires permission to use the exec system call, which is denied by certain sandbox options, such as spawn. Use finer grained
Re: [PATCH V1 20/26] migration: cpr-exec mode
On 5/2/2024 8:23 AM, Markus Armbruster wrote: Steve Sistare writes: Add the cpr-exec migration mode. Usage: qemu-system-$arch -machine memfd-alloc=on ... migrate_set_parameter mode cpr-exec migrate_set_parameter cpr-exec-args \ ... -incoming migrate -d The migrate command stops the VM, saves state to the URI, directly exec's a new version of QEMU on the same host, replacing the original process while retaining its PID, and loads state from the URI. Guest RAM is preserved in place, albeit with new virtual addresses. Arguments for the new QEMU process are taken from the @cpr-exec-args parameter. The first argument should be the path of a new QEMU binary, or a prefix command that exec's the new QEMU binary. Because old QEMU terminates when new QEMU starts, one cannot stream data between the two, so the URI must be a type, such as a file, that reads all data before old QEMU exits. Memory backend objects must have the share=on attribute, and must be mmap'able in the new QEMU process. For example, memory-backend-file is acceptable, but memory-backend-ram is not. The VM must be started with the '-machine memfd-alloc=on' option. This causes implicit ram blocks (those not explicitly described by a memory-backend object) to be allocated by mmap'ing a memfd. Examples include VGA, ROM, and even guest RAM when it is specified without a memory-backend object. The implementation saves precreate vmstate at the end of normal migration in migrate_fd_cleanup, and tells the main loop to call cpr_exec. Incoming qemu loads preceate state early, before objects are created. The memfds are kept open across exec by clearing the close-on-exec flag, their values are saved in precreate vmstate, and they are mmap'd in new qemu. Note that the memfd-alloc option is not related to memory-backend-memfd. Later patches add support for memory-backend-memfd, and for additional devices, including vfio, chardev, and more. Signed-off-by: Steve Sistare [...] diff --git a/qapi/migration.json b/qapi/migration.json index 49710e7..7c5f45f 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -665,9 +665,37 @@ # or COLO. # # (since 8.2) +# +# @cpr-exec: The migrate command stops the VM, saves state to the URI, +# directly exec's a new version of QEMU on the same host, +# replacing the original process while retaining its PID, and +# loads state from the URI. Guest RAM is preserved in place, +# albeit with new virtual addresses. Do you mean the virtual addresses of guest RAM may differ betwen old and new QEMU process? The VA at which a guest RAM segment is mapped in the QEMU process changes. The end user would not notice or care, so I'll drop that detail here. +# +# Arguments for the new QEMU process are taken from the +# @cpr-exec-args parameter. The first argument should be the +# path of a new QEMU binary, or a prefix command that exec's the +# new QEMU binary. What's a "prefix command"? A wrapper script, perhaps? A prefix command is any command of the form: command1 command1-args command2 command2-args where command1 performs some set up before exec'ing command2. However, I will drop the word "prefix", it adds no meaning here. +# +# Because old QEMU terminates when new QEMU starts, one cannot +# stream data between the two, so the URI must be a type, such as +# a file, that reads all data before old QEMU exits. What happens when you specify a URI that doesn't? Old QEMU will quietly block indefinitely writing to the URI. +# +# Memory backend objects must have the share=on attribute, and +# must be mmap'able in the new QEMU process. For example, +# memory-backend-file is acceptable, but memory-backend-ram is +# not. +# +# The VM must be started with the '-machine memfd-alloc=on' What happens when you don't? If '-only-migratable-modes cpr-exec' is specified, then QEMU will fail to start, and print a clear error message. Otherwise, a blocker is registered and any attempt to cpr-exec will fail with a clear error message. - Steve +# option. This causes implicit ram blocks -- those not explicitly +# described by a memory-backend object -- to be allocated by +# mmap'ing a memfd. Examples include VGA, ROM, and even guest +# RAM when it is specified without a memory-backend object. +# +# (since 9.1) ## { 'enum': 'MigMode', - 'data': [ 'normal', 'cpr-reboot' ] } + 'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] } ## # @ZeroPageDetection: [...]
Re: [PATCH V1 18/26] migration: cpr-exec-args parameter
On 5/2/2024 8:23 AM, Markus Armbruster wrote: Steve Sistare writes: Create the cpr-exec-args migration parameter, defined as a list of strings. It will be used for cpr-exec migration mode in a subsequent patch. No functional change, except that cpr-exec-args is shown by the 'info migrate' command. Signed-off-by: Steve Sistare [...] diff --git a/qapi/migration.json b/qapi/migration.json index 8c65b90..49710e7 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -914,6 +914,9 @@ # See description in @ZeroPageDetection. Default is 'multifd'. # (since 9.0) # +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode. +#See @cpr-exec for details. (Since 9.1) +# You mean migration mode @cpr-exec, don't you? If yes, dangling reference until PATCH 20 adds it. Okay, but worth a mention in the commit message. Suggest "See MigMode @cpr-exec for details." Yes to all. Will update as you suggest. - Steve # Features: # # @deprecated: Member @block-incremental is deprecated. Use @@ -948,7 +951,8 @@ { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] }, 'vcpu-dirty-limit', 'mode', - 'zero-page-detection'] } + 'zero-page-detection', + 'cpr-exec-args'] } ## # @MigrateSetParameters: @@ -1122,6 +1126,9 @@ # See description in @ZeroPageDetection. Default is 'multifd'. # (since 9.0) # +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode. +#See @cpr-exec for details. (Since 9.1) +# # Features: # # @deprecated: Member @block-incremental is deprecated. Use @@ -1176,7 +1183,8 @@ 'features': [ 'unstable' ] }, '*vcpu-dirty-limit': 'uint64', '*mode': 'MigMode', -'*zero-page-detection': 'ZeroPageDetection'} } +'*zero-page-detection': 'ZeroPageDetection', +'*cpr-exec-args': [ 'str' ]} } ## # @migrate-set-parameters: @@ -1354,6 +1362,9 @@ # See description in @ZeroPageDetection. Default is 'multifd'. # (since 9.0) # +# @cpr-exec-args: Arguments passed to new QEMU for @cpr-exec mode. +#See @cpr-exec for details. (Since 9.1) +# # Features: # # @deprecated: Member @block-incremental is deprecated. Use @@ -1405,7 +1416,8 @@ 'features': [ 'unstable' ] }, '*vcpu-dirty-limit': 'uint64', '*mode': 'MigMode', -'*zero-page-detection': 'ZeroPageDetection'} } +'*zero-page-detection': 'ZeroPageDetection', +'*cpr-exec-args': [ 'str' ]} } ## # @query-migrate-parameters:
Re: [PATCH V4 10/14] migration: stop vm for cpr
On 2/29/2024 8:28 PM, Peter Xu wrote: On Thu, Feb 29, 2024 at 10:21:14AM -0500, Steven Sistare wrote: On 2/25/2024 9:08 PM, Peter Xu wrote: On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: When migration for cpr is initiated, stop the vm and set state RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the possibility of ram and device state being out of sync, and guarantees that a guest in the suspended state remains suspended, because qmp_cont rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu cpr-reboot mode keeps changing behavior. Could we declare it "experimental" until it's solid? Maybe a patch to document this? Normally IMHO we shouldn't merge a feature if it's not complete, however cpr-reboot is so special that the mode itself is already merged in 8.2 before I started to merge patches, and it keeps changing things. I don't know what else we can do here besides declaring it experimental and not declare it a stable feature. Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: migration: stop vm for cpr Suspension to support vfio is an enhancement which adds to the basic functionality, it does not change it. This was planned all along, but submitted as a separate If VFIO used to migrate without suspension and now it won't, it's a behavior change? VFIO could not cpr-reboot migrate without suspension. The existing vfio migration blockers applied to all modes: Error: :3a:10.0: VFIO migration is not supported in kernel Now, with suspension, it will. An addition, not a change. series to manage complexity, as I outlined in my qemu community presentation, which I emailed you at the time. Other "changes" that arose during review were just clarifications and explanations. So, I don't think cpr-reboot deserves to be condemned to experimental limbo. IMHO it's not about a feature being condemned, it's about a kindful heads-up to the users that one needs to take risk on using it until it becomes stable, it also makes developers easier because of no limitation on behavior change. If all the changes are landing, then there's no need for such a patch. If so, please propose the planned complete document patch. I had a feeling that cpr is still not fully understood by even many developers on the list. It'll be great that such document will contain all the details one needs to know on the feature, etc. meaning of the name cpr-reboot (what is "cpr"), when to use it, how to use it, how it differs from "normal" mode (etc. lifted limitations on some devices that used to block migration?), what is enforced (vm stop, suspension, etc.) and what is optionally offered (VFIO, shared-mem, etc.), the expected behaviors, etc. When send it, please copy relevant people (Alex & Cedric for VFIO, while Markus could also be a good candidate considering the QMP involvement). Submitted. - Steve
Re: [PATCH V2 00/11] privatize migration.h
On 3/11/2024 4:28 PM, Peter Xu wrote: On Mon, Mar 11, 2024 at 04:24:14PM -0400, Steven Sistare wrote: On 3/11/2024 3:45 PM, Steven Sistare wrote: On 3/11/2024 3:30 PM, Peter Xu wrote: Steve, On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote: Changes in V2: * rebase to migration-next, add RB Not apply even to master branch. Note that there're >=1 PULLs sent and merged since my last reply.. Perhaps you rebased to the "old" next? I pulled from branch migration-next in https://gitlab.com/peterx/qemu a few hours ago, but I must have screwed up somewhere. I'll figure it out and post a V4. My pull was a fiew hours old, but my patches still apply cleanly to the most recent tip: a1bb5dd169f4 ("migration: Fix format in error message") I can sent that as V3, but ... Note that you must apply "migration: export fewer options" before "privatize migration.h". If that does not help, I will send V3. Ouch, I forgot that dependency... Sorry. Yeah it works now. No need to resend for now. Great! - steve
Re: [PATCH V2 00/11] privatize migration.h
On 3/11/2024 3:45 PM, Steven Sistare wrote: On 3/11/2024 3:30 PM, Peter Xu wrote: Steve, On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote: Changes in V2: * rebase to migration-next, add RB Not apply even to master branch. Note that there're >=1 PULLs sent and merged since my last reply.. Perhaps you rebased to the "old" next? I pulled from branch migration-next in https://gitlab.com/peterx/qemu a few hours ago, but I must have screwed up somewhere. I'll figure it out and post a V4. My pull was a fiew hours old, but my patches still apply cleanly to the most recent tip: a1bb5dd169f4 ("migration: Fix format in error message") I can sent that as V3, but ... Note that you must apply "migration: export fewer options" before "privatize migration.h". If that does not help, I will send V3. - Steve
Re: [PATCH V2 00/11] privatize migration.h
On 3/11/2024 3:30 PM, Peter Xu wrote: Steve, On Mon, Mar 11, 2024 at 10:48:47AM -0700, Steve Sistare wrote: Changes in V2: * rebase to migration-next, add RB Not apply even to master branch. Note that there're >=1 PULLs sent and merged since my last reply.. Perhaps you rebased to the "old" next? I pulled from branch migration-next in https://gitlab.com/peterx/qemu a few hours ago, but I must have screwed up somewhere. I'll figure it out and post a V4. - Steve
Re: [PATCH V2] migration: export fewer options
On 3/1/2024 2:47 AM, Peter Xu wrote: On Thu, Feb 29, 2024 at 07:03:36AM +0100, Markus Armbruster wrote: Steven Sistare writes: Just a reminder, after our further discussion in the V1 thread, this patch is still what I propose, no updates needed. Markus, I think Peter is looking for your blessing on the new file name: include/migration/client-options.h. Not my preference, but no objection. There's yet one alternative, which is to put these exported option functions into misc.h directly. After all that's not so much, and misc.h already hold random stuff from elsewhere. Steve, would you repost this patch (with/without my above comment taken) along with your other series with a rebase to migration-next? It doesn't apply there. Re the other series: one nitpick comment on the last patch, where you may consider splitting the removal of the unused 2 functions into a standalone patch. Other than that it looks good to me. https://gitlab.com/peterx/qemu/-/tree/migration-next Both are rebased and reposted, thanks - steve
Re: [PATCH V4 10/14] migration: stop vm for cpr
On 2/25/2024 9:08 PM, Peter Xu wrote: > On Thu, Feb 22, 2024 at 09:28:36AM -0800, Steve Sistare wrote: >> When migration for cpr is initiated, stop the vm and set state >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >> possibility of ram and device state being out of sync, and guarantees >> that a guest in the suspended state remains suspended, because qmp_cont >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >> >> Signed-off-by: Steve Sistare > > Reviewed-by: Peter Xu > > cpr-reboot mode keeps changing behavior. > > Could we declare it "experimental" until it's solid? Maybe a patch to > document this? > > Normally IMHO we shouldn't merge a feature if it's not complete, however > cpr-reboot is so special that the mode itself is already merged in 8.2 > before I started to merge patches, and it keeps changing things. I don't > know what else we can do here besides declaring it experimental and not > declare it a stable feature. Hi Peter, the planned/committed functionality for cpr-reboot changed only once, in: migration: stop vm for cpr Suspension to support vfio is an enhancement which adds to the basic functionality, it does not change it. This was planned all along, but submitted as a separate series to manage complexity, as I outlined in my qemu community presentation, which I emailed you at the time. Other "changes" that arose during review were just clarifications and explanations. So, I don't think cpr-reboot deserves to be condemned to experimental limbo. - Steve
Re: [PATCH V4 14/14] migration: options incompatible with cpr
On 2/29/2024 12:40 AM, Peter Xu wrote: > On Thu, Feb 29, 2024 at 06:31:26AM +0100, Markus Armbruster wrote: >> Hmm, perhaps Peter can still squash in the corrections before posting >> his PR. Peter? > > The PR was sent yesterday, it's already in PeterM's -staging. I worry it's > a bit late to call a stop now. > > https://lore.kernel.org/qemu-devel/20240228051315.400759-23-pet...@redhat.com > > Steve, could you provide a standalone patch for the update? Then I'll do > the best accordingly (e.g. if the PR failed to apply I'll squash when > resend v2; or I'll pick it up for the next). I sent the patch. I also re-wrapped all cpr-reboot paragraphs to 70 columns and addressed Markus' other comments on "migration: update cpr-reboot description". Peter, if you squash it, the last sentence "cpr-reboot may not be used ..." squashes into migration: options incompatible with cpr and everything else squashes into migration: update cpr-reboot description - Steve
Re: [PATCH] migration: re-format cpr-reboot documentation
Please ignore this, I will send a V2 that incorporates additional comments from Markus that I missed in my inbox. - Steve On 2/29/2024 9:15 AM, Steve Sistare wrote: > Re-wrap the cpr-reboot documentation to 70 columns, use '@' for > cpr-reboot references, and capitalize COLO. > > Suggested-by: Markus Armbruster > Signed-off-by: Steve Sistare > --- > qapi/migration.json | 36 +++- > 1 file changed, 19 insertions(+), 17 deletions(-) > > diff --git a/qapi/migration.json b/qapi/migration.json > index c6bfe2e..b69f62a 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -636,28 +636,30 @@ > # > # @normal: the original form of migration. (since 8.2) > # > -# @cpr-reboot: The migrate command stops the VM and saves state to the URI. > -# After quitting qemu, the user resumes by running qemu -incoming. > +# @cpr-reboot: The migrate command stops the VM and saves state to > +# the URI. After quitting qemu, the user resumes by running > +# qemu -incoming. > # > -# This mode allows the user to quit qemu, and restart an updated version > -# of qemu. The user may even update and reboot the OS before restarting, > -# as long as the URI persists across a reboot. > +# This mode allows the user to quit qemu and restart an updated > +# version of qemu. The user may even update and reboot the OS > +# before restarting, as long as the URI persists across a reboot. > # > -# Unlike normal mode, the use of certain local storage options does not > -# block the migration, but the user must not modify guest block devices > -# between the quit and restart. > +# Unlike normal mode, the use of certain local storage options > +# does not block the migration, but the user must not modify guest > +# block devices between the quit and restart. > # > -# This mode supports vfio devices provided the user first puts the guest > -# in the suspended runstate, such as by issuing guest-suspend-ram to the > -# qemu guest agent. > +# This mode supports vfio devices provided the user first puts > +# the guest in the suspended runstate, such as by issuing > +# guest-suspend-ram to the qemu guest agent. > # > -# Best performance is achieved when the memory backend is shared and the > -# @x-ignore-shared migration capability is set, but this is not required. > -# Further, if the user reboots before restarting such a configuration, > the > -# shared backend must be be non-volatile across reboot, such as by > backing > -# it with a dax device. > +# Best performance is achieved when the memory backend is shared > +# and the @x-ignore-shared migration capability is set, but this > +# is not required. Further, if the user reboots before restarting > +# such a configuration, the shared backend must be be non-volatile > +# across reboot, such as by backing it with a dax device. > # > -# cpr-reboot may not be used with postcopy, colo, or background-snapshot. > +# @cpr-reboot may not be used with postcopy, background-snapshot, > +# or COLO. > # > # (since 8.2) > ##
Re: [PATCH V4 11/14] vfio: register container for cpr
On 2/29/2024 3:35 AM, Cédric Le Goater wrote: > Hello Steve, > > On 2/22/24 18:28, Steve Sistare wrote: >> Define entry points to perform per-container cpr-specific initialization >> and teardown. >> >> Signed-off-by: Steve Sistare >> --- >> hw/vfio/container.c | 11 ++- >> hw/vfio/cpr.c | 19 +++ >> hw/vfio/iommufd.c | 6 ++ >> hw/vfio/meson.build | 1 + >> include/hw/vfio/vfio-common.h | 3 +++ >> 5 files changed, 39 insertions(+), 1 deletion(-) >> create mode 100644 hw/vfio/cpr.c >> >> diff --git a/hw/vfio/container.c b/hw/vfio/container.c >> index bd25b9f..096d77e 100644 >> --- a/hw/vfio/container.c >> +++ b/hw/vfio/container.c >> @@ -621,10 +621,15 @@ static int vfio_connect_container(VFIOGroup *group, >> AddressSpace *as, >> goto free_container_exit; >> } >> + ret = vfio_cpr_register_container(bcontainer, errp); >> + if (ret) { >> + goto free_container_exit; >> + } >> + >> ret = vfio_ram_block_discard_disable(container, true); >> if (ret) { >> error_setg_errno(errp, -ret, "Cannot set discarding of RAM >> broken"); >> - goto free_container_exit; >> + goto unregister_container_exit; >> } >> assert(bcontainer->ops->setup); >> @@ -667,6 +672,9 @@ listener_release_exit: >> enable_discards_exit: >> vfio_ram_block_discard_disable(container, false); >> +unregister_container_exit: >> + vfio_cpr_unregister_container(bcontainer); >> + >> free_container_exit: >> g_free(container); >> @@ -710,6 +718,7 @@ static void vfio_disconnect_container(VFIOGroup *group) >> vfio_container_destroy(bcontainer); >> trace_vfio_disconnect_container(container->fd); >> + vfio_cpr_unregister_container(bcontainer); >> close(container->fd); >> g_free(container); >> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c >> new file mode 100644 >> index 000..3bede54 >> --- /dev/null >> +++ b/hw/vfio/cpr.c >> @@ -0,0 +1,19 @@ >> +/* >> + * Copyright (c) 2021-2024 Oracle and/or its affiliates. >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or later. >> + * See the COPYING file in the top-level directory. >> + */ >> + >> +#include "qemu/osdep.h" >> +#include "hw/vfio/vfio-common.h" >> +#include "qapi/error.h" >> + >> +int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp) >> +{ >> + return 0; >> +} >> + >> +void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer) >> +{ >> +} >> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c >> index 9bfddc1..e1be224 100644 >> --- a/hw/vfio/iommufd.c >> +++ b/hw/vfio/iommufd.c >> @@ -411,6 +411,11 @@ found_container: >> goto err_listener_register; >> } >> + ret = vfio_cpr_register_container(bcontainer, errp); >> + if (ret) { >> + goto err_listener_register; >> + } >> + >> /* >> * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level >> * for discarding incompatibility check as well? >> @@ -461,6 +466,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev) >> iommufd_cdev_ram_block_discard_disable(false); >> } >> + vfio_cpr_unregister_container(bcontainer); >> iommufd_cdev_detach_container(vbasedev, container); >> iommufd_cdev_container_destroy(container); >> vfio_put_address_space(space); >> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build >> index bb98493..bba776f 100644 >> --- a/hw/vfio/meson.build >> +++ b/hw/vfio/meson.build >> @@ -5,6 +5,7 @@ vfio_ss.add(files( >> 'container-base.c', >> 'container.c', >> 'migration.c', >> + 'cpr.c', >> )) >> vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c')) >> vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files( >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >> index 4a6c262..b9da6c0 100644 >> --- a/include/hw/vfio/vfio-common.h >> +++ b/include/hw/vfio/vfio-common.h >> @@ -205,6 +205,9 @@ void vfio_detach_device(VFIODevice *vbasedev); >> int vfio_kvm_device_add_fd(int fd, Error **errp); >> int vfio_kvm_device_del_fd(int fd, Error **errp); >> +int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error >> **errp); > > Should we return bool since we have an errp ? the returned value > is not an errno AFAICT. > > Anyhow, > > Reviewed-by: Cédric Le Goater Hi Cedric, vfio_cpr_register_container sometimes returns the value returned from migrate_add_blocker_modes, which is an errno. Thanks for reviewing these! - Steve >> +void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer); >> + >> extern const MemoryRegionOps vfio_region_ops; >> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList; >> typedef QLIST_HEAD(VFIODeviceList, VFIODevice) VFIODeviceList; >
Re: [PATCH V2] migration: export fewer options
Just a reminder, after our further discussion in the V1 thread, this patch is still what I propose, no updates needed. Markus, I think Peter is looking for your blessing on the new file name: include/migration/client-options.h. - Steve On 2/26/2024 5:16 PM, Steve Sistare wrote: > A small number of migration options are accessed by migration clients, > but to see them clients must include all of options.h, which is mostly > for migration core code. migrate_mode() in particular will be needed by > multiple clients. > > Refactor the option declarations so clients can see the necessary few via > misc.h, which already exports a portion of the client API. > > Signed-off-by: Steve Sistare > --- > Changes in V2: > * renamed options-pub.h to client-options.h > --- > --- > hw/vfio/migration.c| 1 - > hw/virtio/virtio-balloon.c | 1 - > include/migration/client-options.h | 24 > include/migration/misc.h | 1 + > migration/options.h| 6 +- > system/dirtylimit.c| 1 - > 6 files changed, 26 insertions(+), 8 deletions(-) > create mode 100644 include/migration/client-options.h > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 50140ed..5d4a23c 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -18,7 +18,6 @@ > #include "sysemu/runstate.h" > #include "hw/vfio/vfio-common.h" > #include "migration/migration.h" > -#include "migration/options.h" > #include "migration/savevm.h" > #include "migration/vmstate.h" > #include "migration/qemu-file.h" > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index 89f853f..a59ff17 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -32,7 +32,6 @@ > #include "qemu/error-report.h" > #include "migration/misc.h" > #include "migration/migration.h" > -#include "migration/options.h" > > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > diff --git a/include/migration/client-options.h > b/include/migration/client-options.h > new file mode 100644 > index 000..887fea1 > --- /dev/null > +++ b/include/migration/client-options.h > @@ -0,0 +1,24 @@ > +/* > + * QEMU public migration capabilities > + * > + * Copyright (c) 2012-2023 Red Hat Inc > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + */ > + > +#ifndef QEMU_MIGRATION_CLIENT_OPTIONS_H > +#define QEMU_MIGRATION_CLIENT_OPTIONS_H > + > +/* capabilities */ > + > +bool migrate_background_snapshot(void); > +bool migrate_dirty_limit(void); > +bool migrate_postcopy_ram(void); > +bool migrate_switchover_ack(void); > + > +/* parameters */ > + > +MigMode migrate_mode(void); > + > +#endif > diff --git a/include/migration/misc.h b/include/migration/misc.h > index 5d1aa59..4c226a4 100644 > --- a/include/migration/misc.h > +++ b/include/migration/misc.h > @@ -17,6 +17,7 @@ > #include "qemu/notify.h" > #include "qapi/qapi-types-migration.h" > #include "qapi/qapi-types-net.h" > +#include "migration/client-options.h" > > /* migration/ram.c */ > > diff --git a/migration/options.h b/migration/options.h > index 246c160..964ebdd 100644 > --- a/migration/options.h > +++ b/migration/options.h > @@ -16,6 +16,7 @@ > > #include "hw/qdev-properties.h" > #include "hw/qdev-properties-system.h" > +#include "migration/client-options.h" > > /* migration properties */ > > @@ -24,12 +25,10 @@ extern Property migration_properties[]; > /* capabilities */ > > bool migrate_auto_converge(void); > -bool migrate_background_snapshot(void); > bool migrate_block(void); > bool migrate_colo(void); > bool migrate_compress(void); > bool migrate_dirty_bitmaps(void); > -bool migrate_dirty_limit(void); > bool migrate_events(void); > bool migrate_ignore_shared(void); > bool migrate_late_block_activate(void); > @@ -37,11 +36,9 @@ bool migrate_multifd(void); > bool migrate_pause_before_switchover(void); > bool migrate_postcopy_blocktime(void); > bool migrate_postcopy_preempt(void); > -bool migrate_postcopy_ram(void); > bool migrate_rdma_pin_all(void); > bool migrate_release_ram(void); > bool migrate_return_path(void); > -bool migrate_switchover_ack(void); > bool migrate_validate_uuid(void); > bool migrate_xbzrle(void); > bool migrate_zero_blocks(void); > @@ -83,7 +80,6 @@ uint8_t migrate_max_cpu_throttle(void); > uint64_t migrate_max_bandwidth(void); > uint64_t migrate_avail_switchover_bandwidth(void); > uint64_t migrate_max_postcopy_bandwidth(void); > -MigMode migrate_mode(void); > int migrate_multifd_channels(void); > MultiFDCompression migrate_multifd_compression(void); > int migrate_multifd_zlib_level(void); > diff --git a/system/dirtylimit.c b/system/dirtylimit.c > index b5607eb..774ff44 100644 > --- a/system/dirtylimit.c > +++ b/system/dirtylimit.c > @@ -26,7 +26,6 @@ > #include "trace.h" > #include "migration/misc.h" >
Re: [PATCH V4 14/14] migration: options incompatible with cpr
On 2/28/2024 11:05 AM, Markus Armbruster wrote: > Steven Sistare writes: > >> On 2/28/2024 2:21 AM, Markus Armbruster wrote: >>> Steve Sistare writes: >>> >>>> Fail the migration request if options are set that are incompatible >>>> with cpr. >>>> >>>> Signed-off-by: Steve Sistare > > [...] > >>>> diff --git a/qapi/migration.json b/qapi/migration.json >>>> index 0990297..c6bfe2e 100644 >>>> --- a/qapi/migration.json >>>> +++ b/qapi/migration.json >>>> @@ -657,6 +657,8 @@ >>>> # shared backend must be be non-volatile across reboot, such as by >>>> backing >>>> # it with a dax device. >>>> # >>>> +# cpr-reboot may not be used with postcopy, colo, or >>>> background-snapshot. >>>> +# >>> >>> @cpr-reboot >>> >>> COLO >>> >>> Wrap the line: >>> >>># @cpr-reboot may not be used with postcopy, COLO, or >>># background-snapshot. >>> >>> This doesn't tell the reader what settings exactly do not work with >>> @cpr-reboot. >>> >>> For instance "background-snapshot" is about enabling migration >>> capability @background-snapshot. We could write something like "is >>> incompatible with enabling migration capability @background-snapshot". >>> >>> Same for the other two. Worthwhile? >> >> I initially was more precise exactly as you suggest, but I realized that >> postcopy encompasses multiple capabilities, and I did not want to enumerate >> them, nor require new capabilities related to these 3 to be listed here >> if/when they are created, so I generalized. A keyword search in this file >> gives unambiguous matches. >> >> Peter pushed the patch a few hours before you sent this. > > Okay. > > A followup patch to correct @cpr-reboot, COLO and line wrapping would be > nice. Will do - steve
Re: [PATCH V4 14/14] migration: options incompatible with cpr
On 2/28/2024 2:21 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> Fail the migration request if options are set that are incompatible >> with cpr. >> >> Signed-off-by: Steve Sistare >> --- >> migration/migration.c | 17 + >> qapi/migration.json | 2 ++ >> 2 files changed, 19 insertions(+) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 90a9094..7652fd4 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1953,6 +1953,23 @@ static bool migrate_prepare(MigrationState *s, bool >> blk, bool blk_inc, >> return false; >> } >> >> +if (migrate_mode_is_cpr(s)) { >> +const char *conflict = NULL; >> + >> +if (migrate_postcopy()) { >> +conflict = "postcopy"; >> +} else if (migrate_background_snapshot()) { >> +conflict = "background snapshot"; >> +} else if (migrate_colo()) { >> +conflict = "COLO"; >> +} >> + >> +if (conflict) { >> +error_setg(errp, "Cannot use %s with CPR", conflict); >> +return false; >> +} >> +} >> + >> if (blk || blk_inc) { >> if (migrate_colo()) { >> error_setg(errp, "No disk migration is required in COLO mode"); >> diff --git a/qapi/migration.json b/qapi/migration.json >> index 0990297..c6bfe2e 100644 >> --- a/qapi/migration.json >> +++ b/qapi/migration.json >> @@ -657,6 +657,8 @@ >> # shared backend must be be non-volatile across reboot, such as by >> backing >> # it with a dax device. >> # >> +# cpr-reboot may not be used with postcopy, colo, or >> background-snapshot. >> +# > > @cpr-reboot > > COLO > > Wrap the line: > ># @cpr-reboot may not be used with postcopy, COLO, or ># background-snapshot. > > This doesn't tell the reader what settings exactly do not work with > @cpr-reboot. > > For instance "background-snapshot" is about enabling migration > capability @background-snapshot. We could write something like "is > incompatible with enabling migration capability @background-snapshot". > > Same for the other two. Worthwhile? I initially was more precise exactly as you suggest, but I realized that postcopy encompasses multiple capabilities, and I did not want to enumerate them, nor require new capabilities related to these 3 to be listed here if/when they are created, so I generalized. A keyword search in this file gives unambiguous matches. Peter pushed the patch a few hours before you sent this. - Steve
Re: [PATCH V1 00/10] privatize migration.h
Please ignore this solo message, I accidentally sent it alone - steve On 2/27/2024 12:42 PM, Steve Sistare wrote: > migration/migration.h is the private interface for code in the migration > sub-directory, but many other clients include it because they need accessors > that are not exported by the publc interface in include/migration/misc.h. > Fix that by refactoring accessors and defining new ones as needed. > > After these fixes, no code outside of migration includes migration.h, > and no code outside of migration uses MigrationState. > > This series depends on the following: > * migration patches in the series "allow cpr-reboot for vfio" > * singleton patch "migration: export fewer options" > > Steve Sistare (10): > migration: remove migration.h references > migration: export migration_is_setup_or_active > migration: export migration_is_active > migration: export migration_is_running > migration: export vcpu_dirty_limit_period > migration: migration_thread_is_self > migration: migration_is_device > migration: migration_file_set_error > migration: privatize colo interfaces > migration: purge MigrationState from public interface > > hw/vfio/common.c | 17 +++--- > hw/vfio/container.c| 1 - > hw/vfio/migration.c| 11 ++- > hw/virtio/vhost-user.c | 1 - > hw/virtio/virtio-balloon.c | 1 - > include/migration/client-options.h | 1 + > include/migration/misc.h | 17 +- > migration/colo.c | 17 ++ > migration/migration.c | 67 > -- > migration/migration.h | 7 ++-- > migration/options.c| 11 +-- > migration/ram.c| 5 ++- > migration/savevm.c | 2 +- > net/colo-compare.c | 3 +- > net/vhost-vdpa.c | 3 +- > stubs/colo.c | 1 - > system/dirtylimit.c| 12 +++ > system/qdev-monitor.c | 1 - > target/loongarch/kvm/kvm.c | 1 - > target/riscv/kvm/kvm-cpu.c | 4 +-- > tests/unit/test-vmstate.c | 1 - > 21 files changed, 96 insertions(+), 88 deletions(-) >
Re: [PATCH v7 0/3] string list functions
All the changes look good - steve On 2/27/2024 10:33 AM, Markus Armbruster wrote: > This is Steve's work (v1 to v5), tweaked by Philippe (v6), and now by > me. > > v7: > * Old PATCH 1 dropped > The proposed str_split() is a composition of g_strsplit() and a > conversion from char ** to strList *. I'm willing to accept a > conversion function str_list_from_strv() next to > strv_from_str_list(). I doubt having a function for the composition > is worth the trouble. > * Old PATCH 4 dropped > The tests mostly test g_strsplit(). I'm willing to accept focused > tests for strv_from_str_list() and str_list_from_strv(). > * PATCH 1-3: Commit messages tweaked > * PATCH 2: strv_from_strList() renamed strv_from_str_list(), and moved > to qapi/qapi-type-helpers.c. Function comment tweaked. Local > variable @argv renamed to @strv. > * PATCH 3: Adjust for the rename. > > Steve Sistare (3): > qapi: New QAPI_LIST_LENGTH() > qapi: New strv_from_str_list() > migration: simplify exec migration functions > > include/qapi/type-helpers.h | 8 ++ > include/qapi/util.h | 13 + > migration/exec.c| 57 ++--- > qapi/qapi-type-helpers.c| 14 + > 4 files changed, 43 insertions(+), 49 deletions(-) >
Re: [PATCH v6 0/5] string list functions
On 2/27/2024 10:28 AM, Markus Armbruster wrote: > Philippe Mathieu-Daudé writes: > >> Hi Markus, >> >> Here are the patches I queued until you told me you'd >> object to the CamelCase filename strList.[ch]. >> >> Steve, please take over ;) > > I'm going to post the part of the series I'm ready to queue as v7, with > minor modifications: > > * Rename strv_from_strList() to strv_from_str_list(), and put it into > qapi/qapi-type-helpers.c. > > * Tweak its function comment. > > * Rename its local variable @argv to @strv. > > * Cosmetic commit message tweaks. > > We can then talk about the remainder. Thanks Markus, that saves some time and email. - Steve
Re: [PATCH V1] migration: export fewer options
On 2/27/2024 3:08 AM, Peter Xu wrote: > On Mon, Feb 26, 2024 at 09:41:15AM -0500, Steven Sistare wrote: >> On 2/26/2024 2:40 AM, Markus Armbruster wrote: >>> Steve Sistare writes: >>> >>>> A small number of migration options are accessed by migration clients, >>>> but to see them clients must include all of options.h, which is mostly >>>> for migration core code. migrate_mode() in particular will be needed by >>>> multiple clients. >>>> >>>> Refactor the option declarations so clients can see the necessary few via >>>> misc.h, which already exports a portion of the client API. >>>> >>>> Signed-off-by: Steve Sistare >>>> --- >>>> I suggest that eventually we should define a single file migration/client.h >>>> which exports everything needed by the simpler clients: blockers, >>>> notifiers, >>>> options, cpr, and state accessors. >>>> --- >>>> --- >>>> hw/vfio/migration.c | 1 - >>>> hw/virtio/virtio-balloon.c | 1 - >>>> include/migration/misc.h| 1 + >>>> include/migration/options-pub.h | 24 >>>> migration/options.h | 6 +- >>> >>> Unusual naming. We have zero headers named -pub.h or -public.h, and >>> dozens named like -int.h or -internal.h. Please stick to the existing >>> convention. >> >> In the spirit of minimizing changes, I went that route to avoid renaming the >> existing migration/options.h and its references: >> >> 0 migration/block-dirty-bit 82 #include "options.h" >> 1 migration/block.c 32 #include "options.h" >> 2 migration/colo.c 37 #include "options.h" >> 3 migration/migration-hmp-c 35 #include "options.h" >> 4 migration/migration.c 68 #include "options.h" >> 5 migration/multifd-zlib.c 21 #include "options.h" >> 6 migration/multifd-zstd.c 21 #include "options.h" >> 7 migration/multifd.c 29 #include "options.h" >> 8 migration/options.c 30 #include "options.h" >> 9 migration/postcopy-ram.c 40 #include "options.h" >> a migration/qemu-file.c 33 #include "options.h" >> b migration/ram-compress.c 37 #include "options.h" >> c migration/ram.c 63 #include "options.h" >> d migration/rdma.c 40 #include "options.h" >> e migration/savevm.c71 #include "options.h" >> f migration/socket.c30 #include "options.h" >> g migration/tls.c 25 #include "options.h" >> >> But I take your point. >> >> Peter, which do you prefer? > > From statistics, "-internal.h" wins "-int.h": > > $ git grep "\-internal.h" | wc -l > 135 > $ git grep "\-int.h" | wc -l > 3 Yes, I did the same search to choose the name for option A below. But in option B, I keep the existing name for the private file, and choose a new name for the public file. There is no suffix in use in qemu to denote a public file; we just use names indicating the functionality, so I chose client-options.h. Let's use client-options.h and move on. I am preparing another cleanup series that I think you will like. - Steve >> A. rename: migration/options.h -> migration/options-internal.h >> rename: include/migration/options-pub.h -> include/migration/options.h >> >> B. rename: include/migration/options.h -> >> include/migration/client-options.h >> >> I prefer B. If you prefer B, but want a different file name, please choose >> the >> final name. > > Personally I don't have a strong opinion on the name. I'll see whether > Markus has any comment. > > [and of course, I removed this patch from -staging queue to keep the > discussion going..] >
Re: [PATCH V4 00/14] allow cpr-reboot for vfio
On 2/26/2024 3:21 PM, Steven Sistare wrote: > On 2/26/2024 4:01 AM, Peter Xu wrote: >> On Mon, Feb 26, 2024 at 09:49:46AM +0100, Cédric Le Goater wrote: >>> Go ahead. It will help me for the changes I am doing on error reporting >>> for VFIO migration. I will rebase on top. >> >> Thanks for confirming. I queued the migration patches then, but leave the >> two vfio one for further discussion. > > Very good, thanks. I am always happy to move the ball a few yards closer to > the goal line :) Peter, beware that patch 3 needs an edit before being queued. This hunk snuck in and should be deleted: [PATCH V4 03/14] migration: convert to NotifierWithReturn diff --git a/roms/seabios-hppa b/roms/seabios-hppa index 03774ed..e4eac85 16 --- a/roms/seabios-hppa +++ b/roms/seabios-hppa @@ -1 +1 @@ -Subproject commit 03774edaad3bfae090ac96ca5450353c641637d1 +Subproject commit e4eac85880e8677f96d8b9e94de9f2eec9c0751f - Steve
Re: [PATCH V4 00/14] allow cpr-reboot for vfio
On 2/26/2024 4:01 AM, Peter Xu wrote: > On Mon, Feb 26, 2024 at 09:49:46AM +0100, Cédric Le Goater wrote: >> Go ahead. It will help me for the changes I am doing on error reporting >> for VFIO migration. I will rebase on top. > > Thanks for confirming. I queued the migration patches then, but leave the > two vfio one for further discussion. Very good, thanks. I am always happy to move the ball a few yards closer to the goal line :) - Steve
Re: [PATCH V1] migration: export fewer options
On 2/25/2024 9:40 PM, Peter Xu wrote: > On Fri, Feb 23, 2024 at 09:13:24AM -0800, Steve Sistare wrote: >> A small number of migration options are accessed by migration clients, >> but to see them clients must include all of options.h, which is mostly >> for migration core code. migrate_mode() in particular will be needed by >> multiple clients. >> >> Refactor the option declarations so clients can see the necessary few via >> misc.h, which already exports a portion of the client API. >> >> Signed-off-by: Steve Sistare > > Sounds reasonable, queued, thanks. > >> --- >> I suggest that eventually we should define a single file migration/client.h >> which exports everything needed by the simpler clients: blockers, notifiers, >> options, cpr, and state accessors. > > What's the difference v.s. current migration/misc.h? This file would be sufficient for most clients: diff --git a/include/migration/client.h b/include/migration/client.h new file mode 100644 index 000..a55e504 --- /dev/null +++ b/include/migration/client.h @@ -0,0 +1,6 @@ +#ifndef MIGRATION_CLIENT_H +#define MIGRATION_CLIENT_H +#include "migration/misc.h" +#include "migration/blocker.h" +#include "migration/client-options.h" +#endif Or, we could rename misc.h -> client.h and include blocker.h and client-options.h in it. I just like the idea that most clients could include a single, obviously named file to use the most-common exported API. "misc.h" is not obvious, and not complete. - Steve
Re: [PATCH v6 0/5] string list functions
Thanks for trying a V6 Philippe. I'll take it from here. It helps to know that someone besides me thinks these functions are worth having. - Steve On 2/26/2024 9:11 AM, Philippe Mathieu-Daudé wrote: > Hi Markus, > > Here are the patches I queued until you told me you'd > object to the CamelCase filename strList.[ch]. > > Steve, please take over ;) > > Since v5: > - Cover files in MAINTAINERS > - Complete @docstring mentioning g_auto. > > v5: > https://lore.kernel.org/qemu-devel/1708638470-114846-3-git-send-email-steven.sist...@oracle.com/ > > Steve Sistare (5): > util: str_split > qapi: QAPI_LIST_LENGTH > util: strv_from_strList > util: strList unit tests > migration: simplify exec migration functions > > MAINTAINERS | 2 + > include/monitor/hmp.h | 1 - > include/qapi/util.h | 13 +++ > include/qemu/strList.h| 33 > migration/exec.c | 57 > monitor/hmp-cmds.c| 19 -- > net/net-hmp-cmds.c| 3 +- > stats/stats-hmp-cmds.c| 3 +- > tests/unit/test-strList.c | 80 +++ > util/strList.c| 38 +++ > tests/unit/meson.build| 1 + > util/meson.build | 1 + > 12 files changed, 180 insertions(+), 71 deletions(-) > create mode 100644 include/qemu/strList.h > create mode 100644 tests/unit/test-strList.c > create mode 100644 util/strList.c >
Re: [PATCH V1] migration: export fewer options
On 2/26/2024 2:40 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> A small number of migration options are accessed by migration clients, >> but to see them clients must include all of options.h, which is mostly >> for migration core code. migrate_mode() in particular will be needed by >> multiple clients. >> >> Refactor the option declarations so clients can see the necessary few via >> misc.h, which already exports a portion of the client API. >> >> Signed-off-by: Steve Sistare >> --- >> I suggest that eventually we should define a single file migration/client.h >> which exports everything needed by the simpler clients: blockers, notifiers, >> options, cpr, and state accessors. >> --- >> --- >> hw/vfio/migration.c | 1 - >> hw/virtio/virtio-balloon.c | 1 - >> include/migration/misc.h| 1 + >> include/migration/options-pub.h | 24 >> migration/options.h | 6 +- > > Unusual naming. We have zero headers named -pub.h or -public.h, and > dozens named like -int.h or -internal.h. Please stick to the existing > convention. In the spirit of minimizing changes, I went that route to avoid renaming the existing migration/options.h and its references: 0 migration/block-dirty-bit 82 #include "options.h" 1 migration/block.c 32 #include "options.h" 2 migration/colo.c 37 #include "options.h" 3 migration/migration-hmp-c 35 #include "options.h" 4 migration/migration.c 68 #include "options.h" 5 migration/multifd-zlib.c 21 #include "options.h" 6 migration/multifd-zstd.c 21 #include "options.h" 7 migration/multifd.c 29 #include "options.h" 8 migration/options.c 30 #include "options.h" 9 migration/postcopy-ram.c 40 #include "options.h" a migration/qemu-file.c 33 #include "options.h" b migration/ram-compress.c 37 #include "options.h" c migration/ram.c 63 #include "options.h" d migration/rdma.c 40 #include "options.h" e migration/savevm.c71 #include "options.h" f migration/socket.c30 #include "options.h" g migration/tls.c 25 #include "options.h" But I take your point. Peter, which do you prefer? A. rename: migration/options.h -> migration/options-internal.h rename: include/migration/options-pub.h -> include/migration/options.h B. rename: include/migration/options.h -> include/migration/client-options.h I prefer B. If you prefer B, but want a different file name, please choose the final name. - Steve
Re: [PATCH V5 1/5] util: str_split
On 2/23/2024 12:41 PM, Philippe Mathieu-Daudé wrote: > On 23/2/24 15:01, Steven Sistare wrote: >> On 2/23/2024 1:01 AM, Philippe Mathieu-Daudé wrote: >>> On 22/2/24 22:47, Steve Sistare wrote: >>>> Generalize hmp_split_at_comma() to take any delimiter string, rename >>>> as str_split(), and move it to util/strList.c. >>>> >>>> No functional change. >>>> >>>> Signed-off-by: Steve Sistare >>>> --- >>>> include/monitor/hmp.h | 1 - >>>> include/qemu/strList.h | 24 >>>> monitor/hmp-cmds.c | 19 --- >>>> net/net-hmp-cmds.c | 3 ++- >>>> stats/stats-hmp-cmds.c | 3 ++- >>>> util/meson.build | 1 + >>>> util/strList.c | 24 >>>> 7 files changed, 53 insertions(+), 22 deletions(-) >>>> create mode 100644 include/qemu/strList.h >>>> create mode 100644 util/strList.c >>> >>> >>>> +#include "qapi/qapi-builtin-types.h" >>>> + >>>> +/* >>>> + * Split @str into a strList using the delimiter string @delim. >>>> + * The delimiter is not included in the result. >>>> + * Return NULL if @str is NULL or an empty string. >>>> + * A leading, trailing, or consecutive delimiter produces an >>>> + * empty string at that position in the output. >>>> + * All strings are g_strdup'd, and the result can be freed >>>> + * using qapi_free_strList. >>> >>> Note "qapi/qapi-builtin-types.h" defines: >>> >>> G_DEFINE_AUTOPTR_CLEANUP_FUNC(strList, qapi_free_strList) >>> >>> Maybe mention we can also use: >>> >>> g_autoptr(strList) >> >> Thanks Philippe. If we get to V6 for this series, I will mention this, >> and also mention g_autoptr(GStrv) in the header comment for >> strv_from_strlist. > > If there is no need for v6, do you mind sharing here what would be > the resulting comment? Maybe Markus can update it directly... > (assuming he takes your series). Sure - steve diff --git a/include/qemu/strList.h b/include/qemu/strList.h index c1eb1dd..b13bd53 100644 --- a/include/qemu/strList.h +++ b/include/qemu/strList.h @@ -17,13 +17,16 @@ * A leading, trailing, or consecutive delimiter produces an * empty string at that position in the output. * All strings are g_strdup'd, and the result can be freed - * using qapi_free_strList. + * using qapi_free_strList, or by declaring a local variable + * with g_autoptr(strList). */ strList *str_split(const char *str, const char *delim); /* * Produce and return a NULL-terminated array of strings from @list. - * The result is g_malloc'd and all strings are g_strdup'd. + * The result is g_malloc'd and all strings are g_strdup'd. The result + * can be freed using g_strfreev, or by declaring a local variable with + * g_auto(GStrv). */ char **strv_from_strList(const strList *list);
Re: [PATCH V5 1/5] util: str_split
On 2/23/2024 1:01 AM, Philippe Mathieu-Daudé wrote: > On 22/2/24 22:47, Steve Sistare wrote: >> Generalize hmp_split_at_comma() to take any delimiter string, rename >> as str_split(), and move it to util/strList.c. >> >> No functional change. >> >> Signed-off-by: Steve Sistare >> --- >> include/monitor/hmp.h | 1 - >> include/qemu/strList.h | 24 >> monitor/hmp-cmds.c | 19 --- >> net/net-hmp-cmds.c | 3 ++- >> stats/stats-hmp-cmds.c | 3 ++- >> util/meson.build | 1 + >> util/strList.c | 24 >> 7 files changed, 53 insertions(+), 22 deletions(-) >> create mode 100644 include/qemu/strList.h >> create mode 100644 util/strList.c > > >> +#include "qapi/qapi-builtin-types.h" >> + >> +/* >> + * Split @str into a strList using the delimiter string @delim. >> + * The delimiter is not included in the result. >> + * Return NULL if @str is NULL or an empty string. >> + * A leading, trailing, or consecutive delimiter produces an >> + * empty string at that position in the output. >> + * All strings are g_strdup'd, and the result can be freed >> + * using qapi_free_strList. > > Note "qapi/qapi-builtin-types.h" defines: > > G_DEFINE_AUTOPTR_CLEANUP_FUNC(strList, qapi_free_strList) > > Maybe mention we can also use: > > g_autoptr(strList) Thanks Philippe. If we get to V6 for this series, I will mention this, and also mention g_autoptr(GStrv) in the header comment for strv_from_strlist. - Steve >> + */ >> +strList *str_split(const char *str, const char *delim); >> + >> +#endif >
Re: [PATCH V4 1/5] util: strList_from_string
On 2/21/2024 12:01 PM, Steven Sistare wrote: > On 2/21/2024 8:29 AM, Markus Armbruster wrote: >> I apologize for the lateness of my review. > > No problem. Thanks for the review. > >> Steve Sistare writes: >> >>> Generalize hmp_split_at_comma() to take any delimiter string, rename >>> as strList_from_string(), and move it to util/strList.c. >>> >>> No functional change. >>> >>> Signed-off-by: Steve Sistare >> >> I can't see an actual use of generalized delimiters outside tests in >> this series. Do you have uses? > > In this series, it is called from hmp_announce_self and stats_filter; > those were formerly calls to hmp_split_at_comma. > > In my live update cpr-exec series, there is an additional call site, with a > space delimiter instead of comma. Live update V9 is posted but is old and > obsolete. > I will post V10 soon, but I am hoping you can pull this series first, so I > can > whittle down my pending patches and omit these from V10. > >>> --- >>> include/monitor/hmp.h | 1 - >>> include/qemu/strList.h | 24 >>> monitor/hmp-cmds.c | 19 --- >>> net/net-hmp-cmds.c | 3 ++- >>> stats/stats-hmp-cmds.c | 3 ++- >>> util/meson.build | 1 + >>> util/strList.c | 24 >>> 7 files changed, 53 insertions(+), 22 deletions(-) >>> create mode 100644 include/qemu/strList.h >>> create mode 100644 util/strList.c >>> >>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h >>> index 13f9a2d..2df661e 100644 >>> --- a/include/monitor/hmp.h >>> +++ b/include/monitor/hmp.h >>> @@ -19,7 +19,6 @@ >>> >>> bool hmp_handle_error(Monitor *mon, Error *err); >>> void hmp_help_cmd(Monitor *mon, const char *name); >>> -strList *hmp_split_at_comma(const char *str); >>> >>> void hmp_info_name(Monitor *mon, const QDict *qdict); >>> void hmp_info_version(Monitor *mon, const QDict *qdict); >>> diff --git a/include/qemu/strList.h b/include/qemu/strList.h >>> new file mode 100644 >>> index 000..010237f >>> --- /dev/null >>> +++ b/include/qemu/strList.h >>> @@ -0,0 +1,24 @@ >>> +/* >>> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates. >>> + * >>> + * This work is licensed under the terms of the GNU GPL, version 2 or >>> later. >>> + * See the COPYING file in the top-level directory. >>> + */ >>> + >>> +#ifndef QEMU_STR_LIST_H >>> +#define QEMU_STR_LIST_H >>> + >>> +#include "qapi/qapi-builtin-types.h" >>> + >>> +/* >>> + * Break @in into a strList using the delimiter string @delim. >>> + * The delimiter is not included in the result. >>> + * Return NULL if @in is NULL or an empty string. >>> + * A leading, trailing, or consecutive delimiter produces an >>> + * empty string at that position in the output. >>> + * All strings are g_strdup'd, and the result can be freed >>> + * using qapi_free_strList. >>> + */ >>> +strList *strList_from_string(const char *in, const char *delim); >> >> The function name no longer tells us explicitly what the function does: >> splitting the string. > > The first sentence does not say it? > "Break @in into a strList using the delimiter string @delim" > > Would you prefer this? > "Split string @in into a strList using the delimiter string @delim" Sorry, I read your comment too quickly. You want a different function name. I propose: strList *str_split(const char *str, const char *delim); - Steve
Re: [PATCH V4 00/14] allow cpr-reboot for vfio
Peter (and David if interested): these patches still need RB: migration: notifier error checking migration: stop vm for cpr migration: update cpr-reboot description migration: options incompatible with cpr Alex, these patches still need RB: vfio: register container for cpr vfio: allow cpr-reboot migration if suspended Thanks! - Steve On 2/22/2024 12:28 PM, Steve Sistare wrote: > Allow cpr-reboot for vfio if the guest is in the suspended runstate. The > guest drivers' suspend methods flush outstanding requests and re-initialize > the devices, and thus there is no device state to save and restore. The > user is responsible for suspending the guest before initiating cpr, such as > by issuing guest-suspend-ram to the qemu guest agent. > > Most of the patches in this series enhance migration notifiers so they can > return an error status and message. The last few patches register a notifier > for vfio that returns an error if the guest is not suspended. > > Changes in V3: > * update to tip, add RB's > * replace MigrationStatus with new enum MigrationEventType > * simplify migrate_fd_connect error recovery > * support vfio iommufd containers > * add patches: > migration: stop vm for cpr > migration: update cpr-reboot description > > Changes in V4: > * rebase to tip, add RB's > * add patch to prevent options such as precopy from being used with cpr. > migration: options incompatible with cpr > * refactor subroutines in "stop vm for cpr" > * simplify "notifier error checking" patch by restricting that only > notifiers for MIG_EVENT_PRECOPY_SETUP may return an error. > * doc that a fail event may be sent without a prior setup event > > Steve Sistare (14): > notify: pass error to notifier with return > migration: remove error from notifier data > migration: convert to NotifierWithReturn > migration: MigrationEvent for notifiers > migration: remove postcopy_after_devices > migration: MigrationNotifyFunc > migration: per-mode notifiers > migration: refactor migrate_fd_connect failures > migration: notifier error checking > migration: stop vm for cpr > vfio: register container for cpr > vfio: allow cpr-reboot migration if suspended > migration: update cpr-reboot description > migration: options incompatible with cpr > > hw/net/virtio-net.c | 13 +-- > hw/vfio/common.c | 2 +- > hw/vfio/container.c | 11 ++- > hw/vfio/cpr.c | 39 + > hw/vfio/iommufd.c | 6 ++ > hw/vfio/meson.build | 1 + > hw/vfio/migration.c | 15 ++-- > hw/vfio/trace-events | 2 +- > hw/virtio/vhost-user.c| 10 +-- > hw/virtio/virtio-balloon.c| 3 +- > include/hw/vfio/vfio-common.h | 5 +- > include/hw/vfio/vfio-container-base.h | 1 + > include/hw/virtio/virtio-net.h| 2 +- > include/migration/misc.h | 47 +-- > include/qemu/notify.h | 8 +- > migration/migration.c | 148 > +++--- > migration/migration.h | 4 - > migration/postcopy-ram.c | 3 +- > migration/postcopy-ram.h | 1 - > migration/ram.c | 3 +- > net/vhost-vdpa.c | 14 ++-- > qapi/migration.json | 37 ++--- > roms/seabios-hppa | 2 +- > ui/spice-core.c | 17 ++-- > util/notify.c | 5 +- > 25 files changed, 275 insertions(+), 124 deletions(-) > create mode 100644 hw/vfio/cpr.c >
Re: [PATCH V3 10/13] migration: stop vm for cpr
On 2/22/2024 4:30 AM, Peter Xu wrote: > On Thu, Feb 22, 2024 at 05:12:53PM +0800, Peter Xu wrote: >> On Wed, Feb 21, 2024 at 04:20:07PM -0500, Steven Sistare wrote: >>> On 2/20/2024 2:33 AM, Peter Xu wrote: >>>> On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote: >>>>> When migration for cpr is initiated, stop the vm and set state >>>>> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >>>>> possibility of ram and device state being out of sync, and guarantees >>>>> that a guest in the suspended state remains suspended, because qmp_cont >>>>> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >>>>> >>>>> Signed-off-by: Steve Sistare >>>>> --- >>>>> include/migration/misc.h | 1 + >>>>> migration/migration.c| 32 +--- >>>>> 2 files changed, 26 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/include/migration/misc.h b/include/migration/misc.h >>>>> index 6dc234b..54c99a3 100644 >>>>> --- a/include/migration/misc.h >>>>> +++ b/include/migration/misc.h >>>>> @@ -60,6 +60,7 @@ void migration_object_init(void); >>>>> void migration_shutdown(void); >>>>> bool migration_is_idle(void); >>>>> bool migration_is_active(MigrationState *); >>>>> +bool migrate_mode_is_cpr(MigrationState *); >>>>> >>>>> typedef enum MigrationEventType { >>>>> MIG_EVENT_PRECOPY_SETUP, >>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>> index d1fce9e..fc5c587 100644 >>>>> --- a/migration/migration.c >>>>> +++ b/migration/migration.c >>>>> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s) >>>>> s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); >>>>> } >>>>> >>>>> +bool migrate_mode_is_cpr(MigrationState *s) >>>>> +{ >>>>> +return s->parameters.mode == MIG_MODE_CPR_REBOOT; >>>>> +} >>>>> + >>>>> int migrate_init(MigrationState *s, Error **errp) >>>>> { >>>>> int ret; >>>>> @@ -2651,13 +2656,14 @@ static int >>>>> migration_completion_precopy(MigrationState *s, >>>>> bql_lock(); >>>>> migration_downtime_start(s); >>>>> >>>>> -s->vm_old_state = runstate_get(); >>>>> -global_state_store(); >>>>> - >>>>> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >>>>> -trace_migration_completion_vm_stop(ret); >>>>> -if (ret < 0) { >>>>> -goto out_unlock; >>>>> +if (!migrate_mode_is_cpr(s)) { >>>>> +s->vm_old_state = runstate_get(); >>>>> +global_state_store(); >>>>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >>>>> +trace_migration_completion_vm_stop(ret); >>>>> +if (ret < 0) { >>>>> +goto out_unlock; >>>>> +} >>>>> } >>>>> >>>>> ret = migration_maybe_pause(s, current_active_state, >>>>> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error >>>>> *error_in) >>>>> Error *local_err = NULL; >>>>> uint64_t rate_limit; >>>>> bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; >>>>> +int ret; >>>>> >>>>> /* >>>>> * If there's a previous error, free it and prepare for another one. >>>>> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error >>>>> *error_in) >>>>> goto fail; >>>>> } >>>>> >>>>> +if (migrate_mode_is_cpr(s)) { >>>>> +s->vm_old_state = runstate_get(); >>>>> +global_state_store(); >>>>> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >>>>> +trace_migration_completion_vm_stop(ret); >>>>> +if (ret < 0) { >>>>> +error_setg(_err, "migration_stop_vm failed, error %d", >>>>> -ret); >>>>> +
Re: [PATCH V3 10/13] migration: stop vm for cpr
On 2/22/2024 4:03 AM, Peter Xu wrote: > On Wed, Feb 21, 2024 at 04:23:07PM -0500, Steven Sistare wrote: >>> How about postcopy? I know it's nonsense to enable postcopy for cpr.. but >>> iiuc we don't yet forbid an user doing so. Maybe we should? >> >> How about this? >> >> --- >> @@ -3600,6 +3600,11 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> return; >> } >> >> +if (migrate_mode_is_cpr(s) && migrate_postcopy()) { >> +error_setg(_err, "cannot mix postcopy and cpr"); >> +goto fail; >> +} >> + >> if (resume) { >> /* This is a resumed migration */ >> rate_limit = migrate_max_postcopy_bandwidth(); >> > > migrate_fd_connect() will be a bit late, the error won't be able to be > attached in the "migrate" request. Perhaps, migrate_prepare()? Thank you, that is better - steve
Re: [PATCH V3 10/13] migration: stop vm for cpr
On 2/20/2024 2:33 AM, Peter Xu wrote: > On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote: >> When migration for cpr is initiated, stop the vm and set state >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >> possibility of ram and device state being out of sync, and guarantees >> that a guest in the suspended state remains suspended, because qmp_cont >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >> >> Signed-off-by: Steve Sistare >> --- >> include/migration/misc.h | 1 + >> migration/migration.c| 32 +--- >> 2 files changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/include/migration/misc.h b/include/migration/misc.h >> index 6dc234b..54c99a3 100644 >> --- a/include/migration/misc.h >> +++ b/include/migration/misc.h >> @@ -60,6 +60,7 @@ void migration_object_init(void); >> void migration_shutdown(void); >> bool migration_is_idle(void); >> bool migration_is_active(MigrationState *); >> +bool migrate_mode_is_cpr(MigrationState *); >> >> typedef enum MigrationEventType { >> MIG_EVENT_PRECOPY_SETUP, >> diff --git a/migration/migration.c b/migration/migration.c >> index d1fce9e..fc5c587 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s) >> s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); >> } >> >> +bool migrate_mode_is_cpr(MigrationState *s) >> +{ >> +return s->parameters.mode == MIG_MODE_CPR_REBOOT; >> +} >> + >> int migrate_init(MigrationState *s, Error **errp) >> { >> int ret; >> @@ -2651,13 +2656,14 @@ static int >> migration_completion_precopy(MigrationState *s, >> bql_lock(); >> migration_downtime_start(s); >> >> -s->vm_old_state = runstate_get(); >> -global_state_store(); >> - >> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> -trace_migration_completion_vm_stop(ret); >> -if (ret < 0) { >> -goto out_unlock; >> +if (!migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +goto out_unlock; >> +} >> } >> >> ret = migration_maybe_pause(s, current_active_state, >> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> Error *local_err = NULL; >> uint64_t rate_limit; >> bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; >> +int ret; >> >> /* >> * If there's a previous error, free it and prepare for another one. >> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> goto fail; >> } >> >> +if (migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +error_setg(_err, "migration_stop_vm failed, error %d", >> -ret); >> +goto fail; >> +} >> +} > > Could we have a helper function for the shared codes? > > How about postcopy? I know it's nonsense to enable postcopy for cpr.. but > iiuc we don't yet forbid an user doing so. Maybe we should? How about this? --- @@ -3600,6 +3600,11 @@ void migrate_fd_connect(MigrationState *s, Error *error_in) return; } +if (migrate_mode_is_cpr(s) && migrate_postcopy()) { +error_setg(_err, "cannot mix postcopy and cpr"); +goto fail; +} + if (resume) { /* This is a resumed migration */ rate_limit = migrate_max_postcopy_bandwidth(); - Steve
Re: [PATCH V3 10/13] migration: stop vm for cpr
On 2/20/2024 2:33 AM, Peter Xu wrote: > On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote: >> When migration for cpr is initiated, stop the vm and set state >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >> possibility of ram and device state being out of sync, and guarantees >> that a guest in the suspended state remains suspended, because qmp_cont >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >> >> Signed-off-by: Steve Sistare >> --- >> include/migration/misc.h | 1 + >> migration/migration.c| 32 +--- >> 2 files changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/include/migration/misc.h b/include/migration/misc.h >> index 6dc234b..54c99a3 100644 >> --- a/include/migration/misc.h >> +++ b/include/migration/misc.h >> @@ -60,6 +60,7 @@ void migration_object_init(void); >> void migration_shutdown(void); >> bool migration_is_idle(void); >> bool migration_is_active(MigrationState *); >> +bool migrate_mode_is_cpr(MigrationState *); >> >> typedef enum MigrationEventType { >> MIG_EVENT_PRECOPY_SETUP, >> diff --git a/migration/migration.c b/migration/migration.c >> index d1fce9e..fc5c587 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s) >> s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); >> } >> >> +bool migrate_mode_is_cpr(MigrationState *s) >> +{ >> +return s->parameters.mode == MIG_MODE_CPR_REBOOT; >> +} >> + >> int migrate_init(MigrationState *s, Error **errp) >> { >> int ret; >> @@ -2651,13 +2656,14 @@ static int >> migration_completion_precopy(MigrationState *s, >> bql_lock(); >> migration_downtime_start(s); >> >> -s->vm_old_state = runstate_get(); >> -global_state_store(); >> - >> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> -trace_migration_completion_vm_stop(ret); >> -if (ret < 0) { >> -goto out_unlock; >> +if (!migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +goto out_unlock; >> +} >> } >> >> ret = migration_maybe_pause(s, current_active_state, >> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> Error *local_err = NULL; >> uint64_t rate_limit; >> bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; >> +int ret; >> >> /* >> * If there's a previous error, free it and prepare for another one. >> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> goto fail; >> } >> >> +if (migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +error_setg(_err, "migration_stop_vm failed, error %d", >> -ret); >> +goto fail; >> +} >> +} > > Could we have a helper function for the shared codes? I propose to add code to migration_stop_vm to make it the helper. Some call sites emit more traces (via migration_stop_vm) as a result of my refactoring, and postcopy start sets vm_old_state, which is not used thereafter in that path. Those changes seem harmless to me. Tell me what you think: --- diff --git a/migration/migration.c b/migration/migration.c index fc5c587..30d2b08 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -107,12 +107,6 @@ static int migration_maybe_pause(MigrationState *s, static void migrate_fd_cancel(MigrationState *s); static bool close_return_path_on_source(MigrationState *s); -static void migration_downtime_start(MigrationState *s) -{ -trace_vmstate_downtime_checkpoint("src-downtime-start"); -s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); -} - static void migration_downtime_end(MigrationState *s) { int64_t now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); @@ -161,11 +155,20 @@ static gint page_request_addr_cmp(gconstpointer ap, gconstpo return (a > b) - (a < b); } -int migration_stop_vm(RunState state) +static int migration_stop_vm(MigrationState *s, RunState state) { -int ret = vm_stop_force_state(state); +int ret; + +trace_vmstate_downtime_checkpoint("src-downtime-start"); +s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +s->vm_old_state = runstate_get(); +global_state_store(); + +ret = vm_stop_force_state(state); trace_vmstate_downtime_checkpoint("src-vm-stopped"); +trace_migration_completion_vm_stop(ret); return ret; } @@ -2454,10 +2457,7 @@ static int
Re: [PATCH V3 12/13] vfio: allow cpr-reboot migration if suspended
Hi Alex, any comments or RB on this or patch 11? The last few changes I am making for Peter will not change this patch. - Steve On 2/8/2024 1:54 PM, Steve Sistare wrote: > Allow cpr-reboot for vfio if the guest is in the suspended runstate. The > guest drivers' suspend methods flush outstanding requests and re-initialize > the devices, and thus there is no device state to save and restore. The > user is responsible for suspending the guest before initiating cpr, such as > by issuing guest-suspend-ram to the qemu guest agent. > > Relax the vfio blocker so it does not apply to cpr, and add a notifier that > verifies the guest is suspended. > > Signed-off-by: Steve Sistare > --- > hw/vfio/common.c | 2 +- > hw/vfio/cpr.c | 20 > hw/vfio/migration.c | 2 +- > include/hw/vfio/vfio-container-base.h | 1 + > 4 files changed, 23 insertions(+), 2 deletions(-) > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 059bfdc..ff88c3f 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice > *vbasedev, Error **errp) > error_setg(_devices_migration_blocker, > "Multiple VFIO devices migration is supported only if all of " > "them support P2P migration"); > -ret = migrate_add_blocker(_devices_migration_blocker, errp); > +ret = migrate_add_blocker_normal(_devices_migration_blocker, > errp); > > return ret; > } > diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c > index 3bede54..392c2dd 100644 > --- a/hw/vfio/cpr.c > +++ b/hw/vfio/cpr.c > @@ -7,13 +7,33 @@ > > #include "qemu/osdep.h" > #include "hw/vfio/vfio-common.h" > +#include "migration/misc.h" > #include "qapi/error.h" > +#include "sysemu/runstate.h" > + > +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, > +MigrationEvent *e, Error **errp) > +{ > +if (e->type == MIG_EVENT_PRECOPY_SETUP && > +!runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) { > + > +error_setg(errp, > +"VFIO device only supports cpr-reboot for runstate suspended"); > + > +return -1; > +} > +return 0; > +} > > int vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp) > { > +migration_add_notifier_mode(>cpr_reboot_notifier, > +vfio_cpr_reboot_notifier, > +MIG_MODE_CPR_REBOOT); > return 0; > } > > void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer) > { > +migration_remove_notifier(>cpr_reboot_notifier); > } > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 50140ed..2050ac8 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -889,7 +889,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, > Error *err, Error **errp) > vbasedev->migration_blocker = error_copy(err); > error_free(err); > > -return migrate_add_blocker(>migration_blocker, errp); > +return migrate_add_blocker_normal(>migration_blocker, errp); > } > > /* -- */ > diff --git a/include/hw/vfio/vfio-container-base.h > b/include/hw/vfio/vfio-container-base.h > index b2813b0..3582d5f 100644 > --- a/include/hw/vfio/vfio-container-base.h > +++ b/include/hw/vfio/vfio-container-base.h > @@ -49,6 +49,7 @@ typedef struct VFIOContainerBase { > QLIST_ENTRY(VFIOContainerBase) next; > QLIST_HEAD(, VFIODevice) device_list; > GList *iova_ranges; > +NotifierWithReturn cpr_reboot_notifier; > } VFIOContainerBase; > > typedef struct VFIOGuestIOMMU {
Re: [PATCH V4 2/5] qapi: QAPI_LIST_LENGTH
On 2/21/2024 8:29 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> Signed-off-by: Steve Sistare >> Reviewed-by: Marc-André Lureau >> --- >> include/qapi/util.h | 13 + >> 1 file changed, 13 insertions(+) >> >> diff --git a/include/qapi/util.h b/include/qapi/util.h >> index 81a2b13..e1b8b1d 100644 >> --- a/include/qapi/util.h >> +++ b/include/qapi/util.h >> @@ -56,4 +56,17 @@ int parse_qapi_name(const char *name, bool complete); >> (tail) = &(*(tail))->next; \ >> } while (0) >> >> +/* >> + * For any GenericList @list, return its length. >> + */ >> +#define QAPI_LIST_LENGTH(list) \ >> +({ \ >> +int len = 0; \ > > size_t ok. >> +typeof(list) elem; \ > > Name this @tail, please. ok. >> +for (elem = list; elem != NULL; elem = elem->next) { \ >> +len++; \ >> +} \ >> +len; \ >> +}) >> + >> #endif > > This is a macro instead of a function so users don't have to cast their > FooList * to GenericList *. > > The only user outside tests is strv_from_strList(). I'd be tempted to > open-code it there and call it a day. Or do you have more users in > mind? That's the only use. If I make it private, I would still define it as a static subroutine in util/strList.c, because it simplifies strv_from_strList. IMO providing a public list length function or macro is pretty basic, but I am not wedded to it. Your call. - Steve > If we keep the macro, please align the backslashes like this: > >#define QAPI_LIST_LENGTH(list) \ >({ \ >int len = 0;\ >typeof(list) elem; \ >for (elem = list; elem != NULL; elem = elem->next) {\ >len++; \ >} \ >len;\ >}) >
Re: [PATCH V4 3/5] util: strv_from_strList
On 2/21/2024 8:14 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> Signed-off-by: Steve Sistare >> Reviewed-by: Marc-André Lureau >> --- >> include/qemu/strList.h | 6 ++ >> util/strList.c | 14 ++ >> 2 files changed, 20 insertions(+) >> >> diff --git a/include/qemu/strList.h b/include/qemu/strList.h >> index 010237f..4b86aa6 100644 >> --- a/include/qemu/strList.h >> +++ b/include/qemu/strList.h >> @@ -21,4 +21,10 @@ >> */ >> strList *strList_from_string(const char *in, const char *delim); >> >> +/* >> + * Produce and return a NULL-terminated array of strings from @args. >> + * The result is g_malloc'd and all strings are g_strdup'd. >> + */ >> +GStrv strv_from_strList(const strList *args); >> + >> #endif >> diff --git a/util/strList.c b/util/strList.c >> index 7991de3..bad4187 100644 >> --- a/util/strList.c >> +++ b/util/strList.c >> @@ -22,3 +22,17 @@ strList *strList_from_string(const char *str, const char >> *delim) >> >> return res; >> } >> + >> +GStrv strv_from_strList(const strList *args) > > Suggest to name the argument @list. > >> +{ >> +const strList *arg; > > Suggest to name this @tail. ok. >> +int i = 0; >> +GStrv argv = g_new(char *, QAPI_LIST_LENGTH(args) + 1); >> + >> +for (arg = args; arg != NULL; arg = arg->next) { >> +argv[i++] = g_strdup(arg->value); >> +} >> +argv[i] = NULL; >> + >> +return argv; >> +} > > Can we use char ** instread of GStrv? I'd find that clearer. For what > it's worth, GLib documentation of functions like g_strsplit() doesn't > use the GStrv typedef, either. ok. - Steve
Re: [PATCH V4 4/5] util: strList unit tests
On 2/21/2024 8:19 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> Signed-off-by: Steve Sistare >> Reviewed-by: Marc-André Lureau >> --- >> tests/unit/meson.build| 1 + >> tests/unit/test-strList.c | 80 >> +++ >> 2 files changed, 81 insertions(+) >> create mode 100644 tests/unit/test-strList.c >> >> diff --git a/tests/unit/meson.build b/tests/unit/meson.build >> index 69f9c05..113d12e 100644 >> --- a/tests/unit/meson.build >> +++ b/tests/unit/meson.build >> @@ -35,6 +35,7 @@ tests = { >>'test-rcu-simpleq': [], >>'test-rcu-tailq': [], >>'test-rcu-slist': [], >> + 'test-strList': [], >>'test-qdist': [], >>'test-qht': [], >>'test-qtree': [], >> diff --git a/tests/unit/test-strList.c b/tests/unit/test-strList.c >> new file mode 100644 >> index 000..49a1cfd >> --- /dev/null >> +++ b/tests/unit/test-strList.c >> @@ -0,0 +1,80 @@ >> +/* >> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates. >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or later. >> + * See the COPYING file in the top-level directory. >> + */ >> + >> +#include "qemu/osdep.h" >> +#include "qemu/strList.h" >> + >> +static strList *make_list(int length) >> +{ >> +strList *head = 0, *list, **prev = >> + >> +while (length--) { >> +list = *prev = g_new0(strList, 1); >> +list->value = g_strdup("aaa"); >> +prev = >next; >> +} >> +return head; >> +} >> + >> +static void test_length(void) >> +{ >> +strList *list; >> +int i; >> + >> +for (i = 0; i < 5; i++) { >> +list = make_list(i); >> +g_assert_cmpint(i, ==, QAPI_LIST_LENGTH(list)); >> +qapi_free_strList(list); >> +} >> +} >> + >> +struct { >> +const char *string; >> +const char *delim; >> +const char *args[5]; >> +} list_data[] = { >> +{ 0, ",", { 0 } }, >> +{ "", ",", { 0 } }, >> +{ "a", ",", { "a", 0 } }, >> +{ "a,b", ",", { "a", "b", 0 } }, >> +{ "a,b,c", ",", { "a", "b", "c", 0 } }, >> +{ "first last", " ", { "first", "last", 0 } }, >> +{ "a:", ":", { "a", "", 0 } }, >> +{ "a::b", ":", { "a", "", "b", 0 } }, >> +{ ":", ":", { "", "", 0 } }, >> +{ ":a", ":", { "", "a", 0 } }, >> +{ "::a", ":", { "", "", "a", 0 } }, >> +}; > > NULL instead of 0, please. ok. >> + >> +static void test_strv(void) >> +{ >> +int i, j; >> +const char **expect; >> +strList *list; >> +GStrv args; > > I'd prefer char **argv; ok. >> + >> +for (i = 0; i < ARRAY_SIZE(list_data); i++) { >> +expect = list_data[i].args; >> +list = strList_from_string(list_data[i].string, list_data[i].delim); >> +args = strv_from_strList(list); >> +qapi_free_strList(list); >> +for (j = 0; expect[j] && args[j]; j++) { > > Loop stops when either array element is null. That's wrong, we need to > exhaust both arrays: || instead of &&. || is not safe. After one array is exhausted, [j] will be out of bounds for the other. The g_assert_null calls guarantee the arrays are the same length and all elements have been compared. - Steve >> +g_assert_cmpstr(expect[j], ==, args[j]); >> +} >> +g_assert_null(expect[j]); >> +g_assert_null(args[j]); >> +g_strfreev(args); >> +} >> +} >> + >> +int main(int argc, char **argv) >> +{ >> +g_test_init(, , NULL); >> +g_test_add_func("/test-string/length", test_length); >> +g_test_add_func("/test-string/strv", test_strv); >> +return g_test_run(); >> +} >
Re: [PATCH V4 1/5] util: strList_from_string
On 2/21/2024 8:29 AM, Markus Armbruster wrote: > I apologize for the lateness of my review. No problem. Thanks for the review. > Steve Sistare writes: > >> Generalize hmp_split_at_comma() to take any delimiter string, rename >> as strList_from_string(), and move it to util/strList.c. >> >> No functional change. >> >> Signed-off-by: Steve Sistare > > I can't see an actual use of generalized delimiters outside tests in > this series. Do you have uses? In this series, it is called from hmp_announce_self and stats_filter; those were formerly calls to hmp_split_at_comma. In my live update cpr-exec series, there is an additional call site, with a space delimiter instead of comma. Live update V9 is posted but is old and obsolete. I will post V10 soon, but I am hoping you can pull this series first, so I can whittle down my pending patches and omit these from V10. >> --- >> include/monitor/hmp.h | 1 - >> include/qemu/strList.h | 24 >> monitor/hmp-cmds.c | 19 --- >> net/net-hmp-cmds.c | 3 ++- >> stats/stats-hmp-cmds.c | 3 ++- >> util/meson.build | 1 + >> util/strList.c | 24 >> 7 files changed, 53 insertions(+), 22 deletions(-) >> create mode 100644 include/qemu/strList.h >> create mode 100644 util/strList.c >> >> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h >> index 13f9a2d..2df661e 100644 >> --- a/include/monitor/hmp.h >> +++ b/include/monitor/hmp.h >> @@ -19,7 +19,6 @@ >> >> bool hmp_handle_error(Monitor *mon, Error *err); >> void hmp_help_cmd(Monitor *mon, const char *name); >> -strList *hmp_split_at_comma(const char *str); >> >> void hmp_info_name(Monitor *mon, const QDict *qdict); >> void hmp_info_version(Monitor *mon, const QDict *qdict); >> diff --git a/include/qemu/strList.h b/include/qemu/strList.h >> new file mode 100644 >> index 000..010237f >> --- /dev/null >> +++ b/include/qemu/strList.h >> @@ -0,0 +1,24 @@ >> +/* >> + * Copyright (c) 2022 - 2024 Oracle and/or its affiliates. >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or later. >> + * See the COPYING file in the top-level directory. >> + */ >> + >> +#ifndef QEMU_STR_LIST_H >> +#define QEMU_STR_LIST_H >> + >> +#include "qapi/qapi-builtin-types.h" >> + >> +/* >> + * Break @in into a strList using the delimiter string @delim. >> + * The delimiter is not included in the result. >> + * Return NULL if @in is NULL or an empty string. >> + * A leading, trailing, or consecutive delimiter produces an >> + * empty string at that position in the output. >> + * All strings are g_strdup'd, and the result can be freed >> + * using qapi_free_strList. >> + */ >> +strList *strList_from_string(const char *in, const char *delim); > > The function name no longer tells us explicitly what the function does: > splitting the string. The first sentence does not say it? "Break @in into a strList using the delimiter string @delim" Would you prefer this? "Split string @in into a strList using the delimiter string @delim" - Steve >> +#endif >> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c >> index 871898a..66b68a0 100644 >> --- a/monitor/hmp-cmds.c >> +++ b/monitor/hmp-cmds.c >> @@ -38,25 +38,6 @@ bool hmp_handle_error(Monitor *mon, Error *err) >> return false; >> } >> >> -/* >> - * Split @str at comma. >> - * A null @str defaults to "". >> - */ >> -strList *hmp_split_at_comma(const char *str) >> -{ >> -char **split = g_strsplit(str ?: "", ",", -1); >> -strList *res = NULL; >> -strList **tail = >> -int i; >> - >> -for (i = 0; split[i]; i++) { >> -QAPI_LIST_APPEND(tail, split[i]); >> -} >> - >> -g_free(split); >> -return res; >> -} >> - >> void hmp_info_name(Monitor *mon, const QDict *qdict) >> { >> NameInfo *info; >> diff --git a/net/net-hmp-cmds.c b/net/net-hmp-cmds.c >> index 41d326b..e893801 100644 >> --- a/net/net-hmp-cmds.c >> +++ b/net/net-hmp-cmds.c >> @@ -26,6 +26,7 @@ >> #include "qemu/config-file.h" >> #include "qemu/help_option.h" >> #include "qemu/option.h" >> +#include "qemu/strList.h" >> >> void hmp_info_network(Monitor *mon, const QDict *qdict) >> { >> @@ -72,7 +73,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict) >> migrate_announce_params()); >> >> qapi_free_strList(params->interfaces); >> -params->interfaces = hmp_split_at_comma(interfaces_str); >> +params->interfaces = strList_from_string(interfaces_str, ","); >> params->has_interfaces = params->interfaces != NULL; >> params->id = g_strdup(id); >> qmp_announce_self(params, NULL); >> diff --git a/stats/stats-hmp-cmds.c b/stats/stats-hmp-cmds.c >> index 1f91bf8..428c0e6 100644 >> --- a/stats/stats-hmp-cmds.c >> +++ b/stats/stats-hmp-cmds.c >> @@ -10,6 +10,7 @@ >> #include "monitor/hmp.h" >> #include "monitor/monitor.h" >> #include "qemu/cutils.h" >> +#include
Re: [PATCH V4 5/5] migration: simplify exec migration functions
On 2/21/2024 10:54 AM, Fabiano Rosas wrote: > Fabiano Rosas writes: > >> Steve Sistare writes: >> >>> Simplify the exec migration code by using list utility functions. >>> >>> As a side effect, this also fixes a minor memory leak. On function return, >>> "g_auto(GStrv) argv" frees argv and each element, which is wrong, because >>> the function does not own the individual elements. To compensate, the code >>> uses g_steal_pointer which NULLs argv and prevents the destructor from >>> running, but argv is leaked. >>> >>> Fixes: cbab4face57b ("migration: convert exec backend ...") >>> Signed-off-by: Steve Sistare >> >> Reviewed-by: Fabiano Rosas > > You'll have to reintroduce the qemu/cutils.h include: > > ../migration/exec.c: In function 'exec_get_cmd_path': > ../migration/exec.c:37:5: error: implicit declaration of function 'pstrcat'; > did you mean 'strcat'? [-Werror=implicit-function-declaration] >37 | pstrcat(detected_path, MAX_PATH, "\\cmd.exe"); > | ^~~ > | strcat > ../migration/exec.c:37:5: error: nested extern declaration of 'pstrcat' > [-Werror=nested-externs] Thanks, I will rebase to the tip and verify all is well before I post V5. - Steve
Re: [PATCH V3 00/13] allow cpr-reboot for vfio
On 2/20/2024 2:49 AM, Peter Xu wrote: > On Thu, Feb 08, 2024 at 10:53:53AM -0800, Steve Sistare wrote: >> Allow cpr-reboot for vfio if the guest is in the suspended runstate. The >> guest drivers' suspend methods flush outstanding requests and re-initialize >> the devices, and thus there is no device state to save and restore. The >> user is responsible for suspending the guest before initiating cpr, such as >> by issuing guest-suspend-ram to the qemu guest agent. >> >> Most of the patches in this series enhance migration notifiers so they can >> return an error status and message. The last few patches register a notifier >> for vfio that returns an error if the guest is not suspended. >> >> Changes in V3: >> * update to tip, add RB's >> * replace MigrationStatus with new enum MigrationEventType >> * simplify migrate_fd_connect error recovery >> * support vfio iommufd containers >> * add patches: >> migration: stop vm for cpr >> migration: update cpr-reboot description > > This doesn't apply to master anymore, please rebase when repost, thanks. Will do. Before I do, any comments on "migration: update cpr-reboot description"? After we converge on that short description, I will submit a longer treatment in docs/devel/migration, which I see you have recently populated. - Steve
Re: [PATCH V3 10/13] migration: stop vm for cpr
On 2/20/2024 2:33 AM, Peter Xu wrote: > On Thu, Feb 08, 2024 at 10:54:03AM -0800, Steve Sistare wrote: >> When migration for cpr is initiated, stop the vm and set state >> RUN_STATE_FINISH_MIGRATE before ram is saved. This eliminates the >> possibility of ram and device state being out of sync, and guarantees >> that a guest in the suspended state remains suspended, because qmp_cont >> rejects a cont command in the RUN_STATE_FINISH_MIGRATE state. >> >> Signed-off-by: Steve Sistare >> --- >> include/migration/misc.h | 1 + >> migration/migration.c| 32 +--- >> 2 files changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/include/migration/misc.h b/include/migration/misc.h >> index 6dc234b..54c99a3 100644 >> --- a/include/migration/misc.h >> +++ b/include/migration/misc.h >> @@ -60,6 +60,7 @@ void migration_object_init(void); >> void migration_shutdown(void); >> bool migration_is_idle(void); >> bool migration_is_active(MigrationState *); >> +bool migrate_mode_is_cpr(MigrationState *); >> >> typedef enum MigrationEventType { >> MIG_EVENT_PRECOPY_SETUP, >> diff --git a/migration/migration.c b/migration/migration.c >> index d1fce9e..fc5c587 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1603,6 +1603,11 @@ bool migration_is_active(MigrationState *s) >> s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); >> } >> >> +bool migrate_mode_is_cpr(MigrationState *s) >> +{ >> +return s->parameters.mode == MIG_MODE_CPR_REBOOT; >> +} >> + >> int migrate_init(MigrationState *s, Error **errp) >> { >> int ret; >> @@ -2651,13 +2656,14 @@ static int >> migration_completion_precopy(MigrationState *s, >> bql_lock(); >> migration_downtime_start(s); >> >> -s->vm_old_state = runstate_get(); >> -global_state_store(); >> - >> -ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> -trace_migration_completion_vm_stop(ret); >> -if (ret < 0) { >> -goto out_unlock; >> +if (!migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +goto out_unlock; >> +} >> } >> >> ret = migration_maybe_pause(s, current_active_state, >> @@ -3576,6 +3582,7 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> Error *local_err = NULL; >> uint64_t rate_limit; >> bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; >> +int ret; >> >> /* >> * If there's a previous error, free it and prepare for another one. >> @@ -3651,6 +3658,17 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> goto fail; >> } >> >> +if (migrate_mode_is_cpr(s)) { >> +s->vm_old_state = runstate_get(); >> +global_state_store(); >> +ret = migration_stop_vm(RUN_STATE_FINISH_MIGRATE); >> +trace_migration_completion_vm_stop(ret); >> +if (ret < 0) { >> +error_setg(_err, "migration_stop_vm failed, error %d", >> -ret); >> +goto fail; >> +} >> +} > > Could we have a helper function for the shared codes? Will do. > How about postcopy? I know it's nonsense to enable postcopy for cpr.. but > iiuc we don't yet forbid an user doing so. Maybe we should? I will assert that mode != cpr in the postcopy path instead, to prevent any nonsense. - Steve >> + >> if (migrate_background_snapshot()) { >> qemu_thread_create(>thread, "bg_snapshot", >> bg_migration_thread, s, QEMU_THREAD_JOINABLE); >> -- >> 1.8.3.1 >> >
Re: [PATCH V3 09/13] migration: notifier error checking
On 2/20/2024 2:12 AM, Peter Xu wrote: > On Thu, Feb 08, 2024 at 10:54:02AM -0800, Steve Sistare wrote: >> Check the status returned by migration notifiers and report errors. >> If notifiers fail, call the notifiers again so they can clean up. >> None of the notifiers return an error status at this time. >> >> Signed-off-by: Steve Sistare >> --- >> include/migration/misc.h | 3 ++- >> migration/migration.c| 40 +--- >> 2 files changed, 31 insertions(+), 12 deletions(-) >> >> diff --git a/include/migration/misc.h b/include/migration/misc.h >> index 0ea1902..6dc234b 100644 >> --- a/include/migration/misc.h >> +++ b/include/migration/misc.h >> @@ -82,7 +82,8 @@ void migration_add_notifier(NotifierWithReturn *notify, >> void migration_add_notifier_mode(NotifierWithReturn *notify, >> MigrationNotifyFunc func, MigMode mode); >> void migration_remove_notifier(NotifierWithReturn *notify); >> -void migration_call_notifiers(MigrationState *s, MigrationEventType type); >> +int migration_call_notifiers(MigrationState *s, MigrationEventType type, >> + Error **errp); >> bool migration_in_setup(MigrationState *); >> bool migration_has_finished(MigrationState *); >> bool migration_has_failed(MigrationState *); >> diff --git a/migration/migration.c b/migration/migration.c >> index 01d8867..d1fce9e 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1318,6 +1318,8 @@ void migrate_set_state(int *state, int old_state, int >> new_state) >> >> static void migrate_fd_cleanup(MigrationState *s) >> { >> +Error *local_err = NULL; >> + >> g_free(s->hostname); >> s->hostname = NULL; >> json_writer_free(s->vmdesc); >> @@ -1362,13 +1364,23 @@ static void migrate_fd_cleanup(MigrationState *s) >>MIGRATION_STATUS_CANCELLED); >> } >> >> +if (!migration_has_failed(s) && >> +migration_call_notifiers(s, MIG_EVENT_PRECOPY_DONE, _err)) { >> + >> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); >> +migrate_set_error(s, local_err); >> +error_free(local_err); >> +} >> + >> if (s->error) { >> /* It is used on info migrate. We can't free it */ >> error_report_err(error_copy(s->error)); >> } >> -migration_call_notifiers(s, s->state == MIGRATION_STATUS_COMPLETED ? >> - MIG_EVENT_PRECOPY_DONE : >> - MIG_EVENT_PRECOPY_FAILED); >> + >> +if (migration_has_failed(s)) { >> +migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL); >> +} > > AFAIU, the whole point of such split is, allowing DONE notifies to fail too > and then if that happens we can invoke FAIL notifiers again. Correct. > > Perhaps we can avoid that complexity, but rather document only SETUP > notifiers can fail? > > The problem is that failing a notifier at this stage (if migration already > finished) can already be too late; dest QEMU can already have started > running, so no way to roll back. We can document that, check and assert > for !SETUP cases to make sure error is never hit? Makes sense. I will modify the patch as you suggest. - Steve >> + >> block_cleanup_parameters(); >> yank_unregister_instance(MIGRATION_YANK_INSTANCE); >> } >> @@ -1481,13 +1493,15 @@ void migration_remove_notifier(NotifierWithReturn >> *notify) >> } >> } >> >> -void migration_call_notifiers(MigrationState *s, MigrationEventType type) >> +int migration_call_notifiers(MigrationState *s, MigrationEventType type, >> + Error **errp) >> { >> MigMode mode = s->parameters.mode; >> MigrationEvent e; >> >> e.type = type; >> -notifier_with_return_list_notify(_state_notifiers[mode], , >> 0); >> +return >> notifier_with_return_list_notify(_state_notifiers[mode], >> +, errp); >> } >> >> bool migration_in_setup(MigrationState *s) >> @@ -2535,7 +2549,9 @@ static int postcopy_start(MigrationState *ms, Error >> **errp) >> * at the transition to postcopy and after the device state; in >> particular >> * spice needs to trigger a transition now >> */ >> -migration_call_notifiers(ms, MIG_EVENT_PRECOPY_DONE); >> +if (migration_call_notifiers(ms, MIG_EVENT_PRECOPY_DONE, errp)) { >> +goto fail; >> +} >> >> migration_downtime_end(ms); >> >> @@ -2555,11 +2571,10 @@ static int postcopy_start(MigrationState *ms, Error >> **errp) >> >> ret = qemu_file_get_error(ms->to_dst_file); >> if (ret) { >> -error_setg(errp, "postcopy_start: Migration stream errored"); >> -migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE, >> - MIGRATION_STATUS_FAILED); >> +error_setg_errno(errp, -ret, "postcopy_start: Migration stream >> error"); >> +
Re: [PATCH V3 09/13] migration: notifier error checking
On 2/12/2024 4:24 AM, David Hildenbrand wrote: > On 08.02.24 19:54, Steve Sistare wrote: >> Check the status returned by migration notifiers and report errors. >> If notifiers fail, call the notifiers again so they can clean up. > > IIUC, if any of the notifiers will actually start to fail, say, during > MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all > notifiers. > > That will include notifiers that have never seen a MIG_EVENT_PRECOPY_SETUP > call. Correct. > Is that what we expect notifiers to be able to deal with? Can we document > that? The notifiers have always needed to handle failure without knowing the previous migration states, and are robust about unwinding their own internal state. I will document it. Thanks for all the RBs. - Steve
Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended
On 1/17/2024 2:12 AM, Peter Xu wrote: > On Tue, Jan 16, 2024 at 03:37:33PM -0500, Steven Sistare wrote: >> On 1/15/2024 2:33 AM, Peter Xu wrote: >>> On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote: >>>> Allow cpr-reboot for vfio if the guest is in the suspended runstate. The >>>> guest drivers' suspend methods flush outstanding requests and re-initialize >>>> the devices, and thus there is no device state to save and restore. The >>>> user is responsible for suspending the guest before initiating cpr, such as >>>> by issuing guest-suspend-ram to the qemu guest agent. >>>> >>>> Relax the vfio blocker so it does not apply to cpr, and add a notifier that >>>> verifies the guest is suspended. Skip dirty page tracking, which is N/A >>>> for >>>> cpr, to avoid ioctl errors. >>>> >>>> Signed-off-by: Steve Sistare >>>> --- >>>> hw/vfio/common.c | 2 +- >>>> hw/vfio/cpr.c | 20 >>>> hw/vfio/migration.c | 2 +- >>>> include/hw/vfio/vfio-common.h | 1 + >>>> migration/ram.c | 9 + >>>> 5 files changed, 28 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>> index 0b3352f..09af934 100644 >>>> --- a/hw/vfio/common.c >>>> +++ b/hw/vfio/common.c >>>> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice >>>> *vbasedev, Error **errp) >>>> error_setg(_devices_migration_blocker, >>>> "Multiple VFIO devices migration is supported only if all >>>> of " >>>> "them support P2P migration"); >>>> -ret = migrate_add_blocker(_devices_migration_blocker, errp); >>>> +ret = migrate_add_blocker_normal(_devices_migration_blocker, >>>> errp); >>>> >>>> return ret; >>>> } >>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c >>>> index bbd1c7a..9f4b1fe 100644 >>>> --- a/hw/vfio/cpr.c >>>> +++ b/hw/vfio/cpr.c >>>> @@ -7,13 +7,33 @@ >>>> >>>> #include "qemu/osdep.h" >>>> #include "hw/vfio/vfio-common.h" >>>> +#include "migration/misc.h" >>>> #include "qapi/error.h" >>>> +#include "sysemu/runstate.h" >>>> + >>>> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, >>>> +MigrationEvent *e, Error **errp) >>>> +{ >>>> +if (e->state == MIGRATION_STATUS_SETUP && >>>> +!runstate_check(RUN_STATE_SUSPENDED)) { >>>> + >>>> +error_setg(errp, >>>> +"VFIO device only supports cpr-reboot for runstate >>>> suspended"); >>>> + >>>> +return -1; >>>> +} >>> >>> What happens if the guest is suspended during SETUP, but then quickly waked >>> up before CPR migration completes? >> >> That is a management layer bug -- we told them the VM must be suspended. >> However, I will reject a wakeup request if migration is active and mode is >> cpr. > > Yes it needs to be enforced if it is required by cpr-reboot. QEMU > hopefully should never rely on mgmt app behavior. > >> >>>> +return 0; >>>> +} >>>> >>>> int vfio_cpr_register_container(VFIOContainer *container, Error **errp) >>>> { >>>> +migration_add_notifier_mode(>cpr_reboot_notifier, >>>> +vfio_cpr_reboot_notifier, >>>> +MIG_MODE_CPR_REBOOT); >>>> return 0; >>>> } >>>> >>>> void vfio_cpr_unregister_container(VFIOContainer *container) >>>> { >>>> +migration_remove_notifier(>cpr_reboot_notifier); >>>> } >>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >>>> index 534fddf..488905d 100644 >>>> --- a/hw/vfio/migration.c >>>> +++ b/hw/vfio/migration.c >>>> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, >>>> Error *err, Error **errp) >>>> vbasedev->migration_blocker = error_copy(err); >>>> error
Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended
On 1/16/2024 3:37 PM, Steven Sistare wrote: > On 1/15/2024 2:33 AM, Peter Xu wrote: >> On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote: >>> Allow cpr-reboot for vfio if the guest is in the suspended runstate. The >>> guest drivers' suspend methods flush outstanding requests and re-initialize >>> the devices, and thus there is no device state to save and restore. The >>> user is responsible for suspending the guest before initiating cpr, such as >>> by issuing guest-suspend-ram to the qemu guest agent. >>> >>> Relax the vfio blocker so it does not apply to cpr, and add a notifier that >>> verifies the guest is suspended. Skip dirty page tracking, which is N/A for >>> cpr, to avoid ioctl errors. >>> >>> Signed-off-by: Steve Sistare >>> --- >>> hw/vfio/common.c | 2 +- >>> hw/vfio/cpr.c | 20 >>> hw/vfio/migration.c | 2 +- >>> include/hw/vfio/vfio-common.h | 1 + >>> migration/ram.c | 9 + >>> 5 files changed, 28 insertions(+), 6 deletions(-) >>> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>> index 0b3352f..09af934 100644 >>> --- a/hw/vfio/common.c >>> +++ b/hw/vfio/common.c >>> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice >>> *vbasedev, Error **errp) >>> error_setg(_devices_migration_blocker, >>> "Multiple VFIO devices migration is supported only if all >>> of " >>> "them support P2P migration"); >>> -ret = migrate_add_blocker(_devices_migration_blocker, errp); >>> +ret = migrate_add_blocker_normal(_devices_migration_blocker, >>> errp); >>> >>> return ret; >>> } >>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c >>> index bbd1c7a..9f4b1fe 100644 >>> --- a/hw/vfio/cpr.c >>> +++ b/hw/vfio/cpr.c >>> @@ -7,13 +7,33 @@ >>> >>> #include "qemu/osdep.h" >>> #include "hw/vfio/vfio-common.h" >>> +#include "migration/misc.h" >>> #include "qapi/error.h" >>> +#include "sysemu/runstate.h" >>> + >>> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, >>> +MigrationEvent *e, Error **errp) >>> +{ >>> +if (e->state == MIGRATION_STATUS_SETUP && >>> +!runstate_check(RUN_STATE_SUSPENDED)) { >>> + >>> +error_setg(errp, >>> +"VFIO device only supports cpr-reboot for runstate suspended"); >>> + >>> +return -1; >>> +} >> >> What happens if the guest is suspended during SETUP, but then quickly waked >> up before CPR migration completes? > > That is a management layer bug -- we told them the VM must be suspended. > However, I will reject a wakeup request if migration is active and mode is > cpr. > >>> +return 0; >>> +} >>> >>> int vfio_cpr_register_container(VFIOContainer *container, Error **errp) >>> { >>> +migration_add_notifier_mode(>cpr_reboot_notifier, >>> +vfio_cpr_reboot_notifier, >>> +MIG_MODE_CPR_REBOOT); >>> return 0; >>> } >>> >>> void vfio_cpr_unregister_container(VFIOContainer *container) >>> { >>> +migration_remove_notifier(>cpr_reboot_notifier); >>> } >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >>> index 534fddf..488905d 100644 >>> --- a/hw/vfio/migration.c >>> +++ b/hw/vfio/migration.c >>> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, >>> Error *err, Error **errp) >>> vbasedev->migration_blocker = error_copy(err); >>> error_free(err); >>> >>> -return migrate_add_blocker(>migration_blocker, errp); >>> +return migrate_add_blocker_normal(>migration_blocker, errp); >>> } >>> >>> /* -- >>> */ >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>> index 1add5b7..7a46e24 100644 >>> --- a/include/hw/vfio/vfio-common.h >>> +++ b/include/hw/vfio/vfio-common.h >>> @@ -78,6 +78,7 @@ struct VFIOGroup
Re: [PATCH V2 11/11] vfio: allow cpr-reboot migration if suspended
On 1/15/2024 2:33 AM, Peter Xu wrote: > On Fri, Jan 12, 2024 at 07:05:10AM -0800, Steve Sistare wrote: >> Allow cpr-reboot for vfio if the guest is in the suspended runstate. The >> guest drivers' suspend methods flush outstanding requests and re-initialize >> the devices, and thus there is no device state to save and restore. The >> user is responsible for suspending the guest before initiating cpr, such as >> by issuing guest-suspend-ram to the qemu guest agent. >> >> Relax the vfio blocker so it does not apply to cpr, and add a notifier that >> verifies the guest is suspended. Skip dirty page tracking, which is N/A for >> cpr, to avoid ioctl errors. >> >> Signed-off-by: Steve Sistare >> --- >> hw/vfio/common.c | 2 +- >> hw/vfio/cpr.c | 20 >> hw/vfio/migration.c | 2 +- >> include/hw/vfio/vfio-common.h | 1 + >> migration/ram.c | 9 + >> 5 files changed, 28 insertions(+), 6 deletions(-) >> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >> index 0b3352f..09af934 100644 >> --- a/hw/vfio/common.c >> +++ b/hw/vfio/common.c >> @@ -128,7 +128,7 @@ int vfio_block_multiple_devices_migration(VFIODevice >> *vbasedev, Error **errp) >> error_setg(_devices_migration_blocker, >> "Multiple VFIO devices migration is supported only if all of >> " >> "them support P2P migration"); >> -ret = migrate_add_blocker(_devices_migration_blocker, errp); >> +ret = migrate_add_blocker_normal(_devices_migration_blocker, >> errp); >> >> return ret; >> } >> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c >> index bbd1c7a..9f4b1fe 100644 >> --- a/hw/vfio/cpr.c >> +++ b/hw/vfio/cpr.c >> @@ -7,13 +7,33 @@ >> >> #include "qemu/osdep.h" >> #include "hw/vfio/vfio-common.h" >> +#include "migration/misc.h" >> #include "qapi/error.h" >> +#include "sysemu/runstate.h" >> + >> +static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, >> +MigrationEvent *e, Error **errp) >> +{ >> +if (e->state == MIGRATION_STATUS_SETUP && >> +!runstate_check(RUN_STATE_SUSPENDED)) { >> + >> +error_setg(errp, >> +"VFIO device only supports cpr-reboot for runstate suspended"); >> + >> +return -1; >> +} > > What happens if the guest is suspended during SETUP, but then quickly waked > up before CPR migration completes? That is a management layer bug -- we told them the VM must be suspended. However, I will reject a wakeup request if migration is active and mode is cpr. >> +return 0; >> +} >> >> int vfio_cpr_register_container(VFIOContainer *container, Error **errp) >> { >> +migration_add_notifier_mode(>cpr_reboot_notifier, >> +vfio_cpr_reboot_notifier, >> +MIG_MODE_CPR_REBOOT); >> return 0; >> } >> >> void vfio_cpr_unregister_container(VFIOContainer *container) >> { >> +migration_remove_notifier(>cpr_reboot_notifier); >> } >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >> index 534fddf..488905d 100644 >> --- a/hw/vfio/migration.c >> +++ b/hw/vfio/migration.c >> @@ -895,7 +895,7 @@ static int vfio_block_migration(VFIODevice *vbasedev, >> Error *err, Error **errp) >> vbasedev->migration_blocker = error_copy(err); >> error_free(err); >> >> -return migrate_add_blocker(>migration_blocker, errp); >> +return migrate_add_blocker_normal(>migration_blocker, errp); >> } >> >> /* -- */ >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >> index 1add5b7..7a46e24 100644 >> --- a/include/hw/vfio/vfio-common.h >> +++ b/include/hw/vfio/vfio-common.h >> @@ -78,6 +78,7 @@ struct VFIOGroup; >> typedef struct VFIOContainer { >> VFIOContainerBase bcontainer; >> int fd; /* /dev/vfio/vfio, empowered by the attached groups */ >> +NotifierWithReturn cpr_reboot_notifier; >> unsigned iommu_type; >> QLIST_HEAD(, VFIOGroup) group_list; >> } VFIOContainer; >> diff --git a/migration/ram.c b/migration/ram.c >> index 1923366..44ad324 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -2392,8 +2392,8 @@ static void ram_save_cleanup(void *opaque) >> RAMState **rsp = opaque; >> RAMBlock *block; >> >> -/* We don't use dirty log with background snapshots */ >> -if (!migrate_background_snapshot()) { >> +/* We don't use dirty log with background snapshots or cpr */ >> +if (!migrate_background_snapshot() && migrate_mode() == >> MIG_MODE_NORMAL) { > > Same question here, on what happens if the user resumes the VM before > migration completes? IIUC shared-ram is not required, then it means if > that happens the cpr migration image can contain corrupted data, and that > may be a problem. > > Background snapshot is special in that it relies on totally different > tracking facilities
Re: [PATCH V2 04/11] migration: remove migration_in_postcopy parameter
On 1/15/2024 1:48 AM, Peter Xu wrote: > On Fri, Jan 12, 2024 at 07:05:03AM -0800, Steve Sistare wrote: >> bool migration_in_incoming_postcopy(void) >> diff --git a/ui/spice-core.c b/ui/spice-core.c >> index b3cd229..e43a93f 100644 >> --- a/ui/spice-core.c >> +++ b/ui/spice-core.c >> @@ -580,7 +580,7 @@ static int migration_state_notifier(NotifierWithReturn >> *notifier, >> if (migration_in_setup(s)) { >> spice_server_migrate_start(spice_server); >> } else if (migration_has_finished(s) || >> - migration_in_postcopy_after_devices(s)) { >> + migration_in_postcopy_after_devices()) { > > This can be a reply also to your other email: my previous suggestion of > using PRECOPY_DONE should apply here, where we can convert this chunk into: > > } else if (event == MIG_EVENT_PRECOPY_DONE) {...} > > Because PRECOPY_DONE should also cover the notification from > postcopy_start(), then we can drop migration_in_postcopy_after_devices() > completely, I think. Yes, that works. I will define and use MIG_EVENT_PRECOPY_DONE and friends in "MigrationEvent for notifiers". - Steve
Re: [PATCH V2 00/11] allow cpr-reboot for vfio
On 1/12/2024 4:38 PM, Alex Williamson wrote: > On Fri, 12 Jan 2024 07:04:59 -0800 > Steve Sistare wrote: > >> Allow cpr-reboot for vfio if the guest is in the suspended runstate. The >> guest drivers' suspend methods flush outstanding requests and re-initialize >> the devices, and thus there is no device state to save and restore. The >> user is responsible for suspending the guest before initiating cpr, such as >> by issuing guest-suspend-ram to the qemu guest agent. >> >> Most of the patches in this series enhance migration notifiers so they can >> return an error status and message. The last two patches register a notifier >> for vfio that returns an error if the guest is not suspended. > > Hi Steve, > > This appears to only support legacy container mode and not cdev/iommufd > mode. Shouldn't this support both? Thanks, My bad, thanks! I'll move cpr_reboot_notifier to the base container, and also call cpr register/unregister from iommufd_cdev_attach/iommufd_cdev_detach. - Steve >> Steve Sistare (11): >> notify: pass error to notifier with return >> migration: remove error from notifier data >> migration: convert to NotifierWithReturn >> migration: remove migration_in_postcopy parameter >> migration: MigrationEvent for notifiers >> migration: MigrationNotifyFunc >> migration: per-mode notifiers >> migration: refactor migrate_fd_connect failures >> migration: notifier error checking >> vfio: register container for cpr >> vfio: allow cpr-reboot migration if suspended >> >> hw/net/virtio-net.c| 14 ++--- >> hw/vfio/common.c | 2 +- >> hw/vfio/container.c| 11 +++- >> hw/vfio/cpr.c | 39 ++ >> hw/vfio/meson.build| 1 + >> hw/vfio/migration.c| 13 +++-- >> hw/virtio/vhost-user.c | 10 ++-- >> hw/virtio/virtio-balloon.c | 3 +- >> include/hw/vfio/vfio-common.h | 6 ++- >> include/hw/virtio/virtio-net.h | 2 +- >> include/migration/misc.h | 21 +--- >> include/qemu/notify.h | 7 ++- >> migration/migration.c | 117 >> +++-- >> migration/postcopy-ram.c | 3 +- >> migration/postcopy-ram.h | 1 - >> migration/ram.c| 12 ++--- >> net/vhost-vdpa.c | 15 +++--- >> ui/spice-core.c| 19 +++ >> util/notify.c | 5 +- >> 19 files changed, 206 insertions(+), 95 deletions(-) >> create mode 100644 hw/vfio/cpr.c >> >
Re: [PATCH V2 03/11] migration: convert to NotifierWithReturn
On 1/15/2024 1:44 AM, Peter Xu wrote: > On Fri, Jan 12, 2024 at 07:05:02AM -0800, Steve Sistare wrote: >> Change all migration notifiers to type NotifierWithReturn, so notifiers >> can return an error status in a future patch. For now, pass NULL for the >> notifier error parameter, and do not check the return value. >> >> Signed-off-by: Steve Sistare >> --- >> hw/net/virtio-net.c| 4 +++- >> hw/vfio/migration.c| 4 +++- >> include/hw/vfio/vfio-common.h | 2 +- >> include/hw/virtio/virtio-net.h | 2 +- >> include/migration/misc.h | 6 +++--- >> migration/migration.c | 16 >> net/vhost-vdpa.c | 6 -- >> ui/spice-core.c| 8 +--- >> 8 files changed, 28 insertions(+), 20 deletions(-) >> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >> index 7a2846f..9570353 100644 >> --- a/hw/net/virtio-net.c >> +++ b/hw/net/virtio-net.c >> @@ -3529,11 +3529,13 @@ static void >> virtio_net_handle_migration_primary(VirtIONet *n, MigrationState *s) >> } >> } >> >> -static void virtio_net_migration_state_notifier(Notifier *notifier, void >> *data) >> +static int virtio_net_migration_state_notifier(NotifierWithReturn *notifier, >> + void *data, Error **errp) >> { >> MigrationState *s = data; >> VirtIONet *n = container_of(notifier, VirtIONet, migration_state); >> virtio_net_handle_migration_primary(n, s); >> +return 0; >> } > > I should have mentioned this earlier.. we have a trend recently to modify > retval to boolean when Error** existed, e.g.: > > https://lore.kernel.org/all/20231017202633.296756-5-pet...@redhat.com/ > > Let's start using boolean too here? Previous patches may need touch-ups > too for this. > > Other than that it all looks good here. Thanks, Boolean makes sense when there is only one way to recover from failure. However, when the notifier may fail in different ways, and recovery differs for each, then the function should return an int errno. NotifierWithReturn could have future uses that need multiple return values and multiple recovery paths above the notifier_with_return_list_notify level, so IMO the function should continue to return int for generality. - Steve
Re: [PATCH V2 08/11] migration: refactor migrate_fd_connect failures
On 1/15/2024 2:37 AM, Peter Xu wrote: > On Fri, Jan 12, 2024 at 07:05:07AM -0800, Steve Sistare wrote: >> Move common code for the error path in migrate_fd_connect to a shared >> fail label. No functional change. >> >> Signed-off-by: Steve Sistare > > Reviewed-by: Peter Xu > > One nitpick below: > >> --- >> migration/migration.c | 22 +++--- >> 1 file changed, 11 insertions(+), 11 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index e9914aa..c828ba7 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -3549,6 +3549,7 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> Error *local_err = NULL; >> uint64_t rate_limit; >> bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED; >> +int expect_state = s->state; > > IMHO we can drop this. > >> >> /* >> * If there's a previous error, free it and prepare for another one. >> @@ -3603,11 +3604,7 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> if (migrate_postcopy_ram() || migrate_return_path()) { >> if (open_return_path_on_source(s)) { >> error_setg(_err, "Unable to open return-path for >> postcopy"); >> -migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); >> -migrate_set_error(s, local_err); >> -error_report_err(local_err); >> -migrate_fd_cleanup(s); >> -return; >> +goto fail; >> } >> } >> >> @@ -3629,12 +3626,8 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> } >> >> if (multifd_save_setup(_err) != 0) { >> -migrate_set_error(s, local_err); >> -error_report_err(local_err); >> -migrate_set_state(>state, MIGRATION_STATUS_SETUP, >> - MIGRATION_STATUS_FAILED); >> -migrate_fd_cleanup(s); >> -return; >> +expect_state = MIGRATION_STATUS_SETUP; > > Then drop this. > >> +goto fail; >> } >> >> if (migrate_background_snapshot()) { >> @@ -3645,6 +3638,13 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> migration_thread, s, QEMU_THREAD_JOINABLE); >> } >> s->migration_thread_running = true; >> +return; >> + >> +fail: >> +migrate_set_error(s, local_err); >> +migrate_set_state(>state, expect_state, MIGRATION_STATUS_FAILED); > > Then constantly use s->state. Yes, if you are OK with it, I also prefer to simplify here. - Steve >> +error_report_err(local_err); >> +migrate_fd_cleanup(s); >> } >> >> static void migration_class_init(ObjectClass *klass, void *data) >> -- >> 1.8.3.1 >> >
Re: [PATCH V2 05/11] migration: MigrationEvent for notifiers
On 1/12/2024 10:05 AM, Steve Sistare wrote: > Passing MigrationState to notifiers is unsound because they could access > unstable migration state internals or even modify the state. Instead, pass > the minimal info needed in a new MigrationEvent struct, which could be > extended in the future if needed. > > Suggested-by: Peter Xu > Signed-off-by: Steve Sistare > --- > [...] > diff --git a/include/migration/misc.h b/include/migration/misc.h > index dcc98bb..0b4ce0f 100644 > --- a/include/migration/misc.h > +++ b/include/migration/misc.h > @@ -60,6 +60,11 @@ void migration_object_init(void); > void migration_shutdown(void); > bool migration_is_idle(void); > bool migration_is_active(MigrationState *); > + > +typedef struct MigrationEvent { > +MigrationStatus state; > +} MigrationEvent; Hi Peter, I chose to pass MigrationStatus rather than define a new enum and map MigrationStatus to it. IMO a new enum adds little value, yet it is more code and another layer of abstraction for coders to grok. - Steve
Re: [PATCH V1 2/3] migration: notifier error reporting
On 1/10/2024 9:16 PM, Peter Xu wrote: > On Wed, Jan 10, 2024 at 01:08:41PM -0500, Steven Sistare wrote: >> On 1/10/2024 2:18 AM, Peter Xu wrote: >>> On Wed, Dec 13, 2023 at 10:11:32AM -0800, Steve Sistare wrote: >>>> After calling notifiers, check if an error has been reported via >>>> migrate_set_error, and halt the migration. >>>> >>>> None of the notifiers call migrate_set_error at this time, so no >>>> functional change. >>>> >>>> Signed-off-by: Steve Sistare >>>> --- >>>> include/migration/misc.h | 2 +- >>>> migration/migration.c| 26 ++ >>>> 2 files changed, 23 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/include/migration/misc.h b/include/migration/misc.h >>>> index 901d117..231d7e4 100644 >>>> --- a/include/migration/misc.h >>>> +++ b/include/migration/misc.h >>>> @@ -65,7 +65,7 @@ MigMode migrate_mode_of(MigrationState *); >>>> void migration_add_notifier(Notifier *notify, >>>> void (*func)(Notifier *notifier, void *data)); >>>> void migration_remove_notifier(Notifier *notify); >>>> -void migration_call_notifiers(MigrationState *s); >>>> +int migration_call_notifiers(MigrationState *s); >>>> bool migration_in_setup(MigrationState *); >>>> bool migration_has_finished(MigrationState *); >>>> bool migration_has_failed(MigrationState *); >>>> diff --git a/migration/migration.c b/migration/migration.c >>>> index d5bfe70..29a9a92 100644 >>>> --- a/migration/migration.c >>>> +++ b/migration/migration.c >>>> @@ -1280,6 +1280,8 @@ void migrate_set_state(int *state, int old_state, >>>> int new_state) >>>> >>>> static void migrate_fd_cleanup(MigrationState *s) >>>> { >>>> +bool already_failed; >>>> + >>>> qemu_bh_delete(s->cleanup_bh); >>>> s->cleanup_bh = NULL; >>>> >>>> @@ -1327,11 +1329,20 @@ static void migrate_fd_cleanup(MigrationState *s) >>>>MIGRATION_STATUS_CANCELLED); >>>> } >>>> >>>> +already_failed = migration_has_failed(s); >>>> +if (migration_call_notifiers(s)) { >>>> +if (!already_failed) { >>>> +migrate_set_state(>state, s->state, >>>> MIGRATION_STATUS_FAILED); >>>> +/* Notify again to recover from this late failure. */ >>>> +migration_call_notifiers(s); >>>> +} >>>> +} >>>> + >>>> if (s->error) { >>>> /* It is used on info migrate. We can't free it */ >>>> error_report_err(error_copy(s->error)); >>>> } >>>> -migration_call_notifiers(s); >>>> + >>>> block_cleanup_parameters(); >>>> yank_unregister_instance(MIGRATION_YANK_INSTANCE); >>>> } >>>> @@ -1450,9 +1461,10 @@ void migration_remove_notifier(Notifier *notify) >>>> } >>>> } >>>> >>>> -void migration_call_notifiers(MigrationState *s) >>>> +int migration_call_notifiers(MigrationState *s) >>>> { >>>> notifier_list_notify(_state_notifiers, s); >>>> +return (s->error != NULL); >>> >>> Exporting more migration_*() functions is pretty ugly to me.. >> >> I assume you mean migrate_set_error(), which is currently only called from >> migration/*.c code. >> >> Instead, we could define a new function migrate_set_notifier_error(), defined >> in the new file migration/notifier.h, so we clearly limit the migration >> functions which can be called from notifiers. (Its implementation just calls >> migrate_set_error) > > Fundementally this allows another .c to change one more field of > MigrationState (which is ->error) and I still want to avoid it. > > I just replied in the other thread, but now with all these in mind I think > I still prefer not passing in MigrationState* at all. It's already kind of > abused due to migrate_get_current(), and IMHO it's healthier to limit its > usage to minimum to cover the core of migration states for migration/ use > only. > > Shrinking or even stop exporting migrate_get_current() is another more > challenging task, but now what we can do is stop enlarging the direct use > of Migrat
Re: [PATCH V1 2/3] migration: notifier error reporting
On 1/10/2024 2:18 AM, Peter Xu wrote: > On Wed, Dec 13, 2023 at 10:11:32AM -0800, Steve Sistare wrote: >> After calling notifiers, check if an error has been reported via >> migrate_set_error, and halt the migration. >> >> None of the notifiers call migrate_set_error at this time, so no >> functional change. >> >> Signed-off-by: Steve Sistare >> --- >> include/migration/misc.h | 2 +- >> migration/migration.c| 26 ++ >> 2 files changed, 23 insertions(+), 5 deletions(-) >> >> diff --git a/include/migration/misc.h b/include/migration/misc.h >> index 901d117..231d7e4 100644 >> --- a/include/migration/misc.h >> +++ b/include/migration/misc.h >> @@ -65,7 +65,7 @@ MigMode migrate_mode_of(MigrationState *); >> void migration_add_notifier(Notifier *notify, >> void (*func)(Notifier *notifier, void *data)); >> void migration_remove_notifier(Notifier *notify); >> -void migration_call_notifiers(MigrationState *s); >> +int migration_call_notifiers(MigrationState *s); >> bool migration_in_setup(MigrationState *); >> bool migration_has_finished(MigrationState *); >> bool migration_has_failed(MigrationState *); >> diff --git a/migration/migration.c b/migration/migration.c >> index d5bfe70..29a9a92 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1280,6 +1280,8 @@ void migrate_set_state(int *state, int old_state, int >> new_state) >> >> static void migrate_fd_cleanup(MigrationState *s) >> { >> +bool already_failed; >> + >> qemu_bh_delete(s->cleanup_bh); >> s->cleanup_bh = NULL; >> >> @@ -1327,11 +1329,20 @@ static void migrate_fd_cleanup(MigrationState *s) >>MIGRATION_STATUS_CANCELLED); >> } >> >> +already_failed = migration_has_failed(s); >> +if (migration_call_notifiers(s)) { >> +if (!already_failed) { >> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); >> +/* Notify again to recover from this late failure. */ >> +migration_call_notifiers(s); >> +} >> +} >> + >> if (s->error) { >> /* It is used on info migrate. We can't free it */ >> error_report_err(error_copy(s->error)); >> } >> -migration_call_notifiers(s); >> + >> block_cleanup_parameters(); >> yank_unregister_instance(MIGRATION_YANK_INSTANCE); >> } >> @@ -1450,9 +1461,10 @@ void migration_remove_notifier(Notifier *notify) >> } >> } >> >> -void migration_call_notifiers(MigrationState *s) >> +int migration_call_notifiers(MigrationState *s) >> { >> notifier_list_notify(_state_notifiers, s); >> +return (s->error != NULL); > > Exporting more migration_*() functions is pretty ugly to me.. I assume you mean migrate_set_error(), which is currently only called from migration/*.c code. Instead, we could define a new function migrate_set_notifier_error(), defined in the new file migration/notifier.h, so we clearly limit the migration functions which can be called from notifiers. (Its implementation just calls migrate_set_error) > Would it be better to pass in "Error** errp" into each notifiers? That may > need an open coded notifier_list_notify(), breaking the loop if "*errp". > > And the notifier API currently only support one arg.. maybe we should > implement the notifiers ourselves, ideally passing in "(int state, Error > **errp)" instead of "(MigrationState *s)". > > Ideally with that MigrationState* shouldn't be visible outside migration/. I will regret saying this because of the amount of (mechanical) code change involved, but the cleanest solution is: * Pass errp to: notifier_with_return_list_notify(NotifierWithReturnList *list, void *data, Error *errp) * Pass errp to the NotifierWithReturn notifier: int (*notify)(NotifierWithReturn *notifier, void *data, Error **errp); * Delete the errp member from struct PostcopyNotifyData and pass errp to the notifier function Ditto for PrecopyNotifyData. * Convert all migration notifiers to NotifierWithReturn - Steve >> } >> >> bool migration_in_setup(MigrationState *s) >> @@ -2520,7 +2532,9 @@ static int postcopy_start(MigrationState *ms, Error >> **errp) >> * spice needs to trigger a transition now >> */ >> ms->postcopy_after_devices = true; >> -migration_call_notifiers(ms); >> +if (migration_call_notifiers(ms)) { >> +goto fail; >> +} >> >> migration_downtime_end(ms); >> >> @@ -3589,7 +3603,11 @@ void migrate_fd_connect(MigrationState *s, Error >> *error_in) >> rate_limit = migrate_max_bandwidth(); >> >> /* Notify before starting migration thread */ >> -migration_call_notifiers(s); >> +if (migration_call_notifiers(s)) { >> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED); >> +migrate_fd_cleanup(s); >> +return; >> +} >> } >> >> migration_rate_set(rate_limit); >> -- >>
Re: [PATCH V1 1/3] migration: check mode in notifiers
On 1/10/2024 2:09 AM, Peter Xu wrote: > On Wed, Dec 13, 2023 at 10:11:31AM -0800, Steve Sistare wrote: >> The existing notifiers should only apply to normal mode. >> >> No functional change. > > Instead of adding such check in every notifier, why not make CPR a separate > list of notifiers? Just like the blocker lists. Sure. I proposed minimal changes in this current series, but extending the api to take migration mode would be nicer. > Aside of this patch, I just started to look at this "notifier" code, I > really don't think we should pass in MigrationState* into the notifiers. > IIUC we only need the "state" as an enum. Then with two separate > registers, the device code knows the migration mode. > > What do you think? If we pass state, the notifier must either compare to enum values such as MIGRATION_STATUS_COMPLETED instead of calling migration_has_finished(s), or we must define new accessors such as migration_state_is_finished(state). IMO passing MigrationState is the best approach. MigrationState is an incomplete type in most notifiers, and the client can pass it to a limited set of accessors to get more information -- exactly what we want to hide migration internals. However, we could further limit the allowed accessors, eg move these to a new file "include/migration/notifier.h". #include "qemu/notify.h" void migration_add_notifier(Notifier *notify, void (*func)(Notifier *notifier, void *data)); void migration_remove_notifier(Notifier *notify); bool migration_is_active(MigrationState *); bool migration_in_setup(MigrationState *); bool migration_has_finished(MigrationState *); bool migration_has_failed(MigrationState *); --- - Steve
Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate
On 12/23/2023 12:41 AM, Markus Armbruster wrote: > Steven Sistare writes: > >> On 12/22/2023 7:20 AM, Markus Armbruster wrote: >>> Steve Sistare writes: >>> >>>> Currently, a vm in the suspended state is not completely stopped. The >>>> VCPUs >>>> have been paused, but the cpu clock still runs, and runstate notifiers for >>>> the transition to stopped have not been called. This causes problems for >>>> live migration. Stale cpu timers_state is saved to the migration stream, >>>> causing time errors in the guest when it wakes from suspend, and state that >>>> would have been modified by runstate notifiers is wrong. >>>> >>>> Modify vm_stop to completely stop the vm if the current state is suspended, >>>> transition to RUN_STATE_PAUSED, and remember that the machine was >>>> suspended. >>>> Modify vm_start to restore the suspended state. >>> >>> Can you explain this to me in terms of the @current_run_state state >>> machine? Like >>> >>> Before the patch, trigger X in state Y goes to state Z. >>> Afterwards, it goes to ... >> >> Old behavior: >> RUN_STATE_SUSPENDED --> stop --> RUN_STATE_SUSPENDED >> >> New behavior: >> RUN_STATE_SUSPENDED --> stop --> RUN_STATE_PAUSED >> RUN_STATE_PAUSED--> cont --> RUN_STATE_SUSPENDED > > This clarifies things quite a bit for me. Maybe work it into the commit > message? Will do. >>>> This affects all callers of vm_stop and vm_start, notably, the qapi stop >>>> and >>>> cont commands. For example: >>>> >>>> (qemu) info status >>>> VM status: paused (suspended) >>>> >>>> (qemu) stop >>>> (qemu) info status >>>> VM status: paused >>>> >>>> (qemu) cont >>>> (qemu) info status >>>> VM status: paused (suspended) >>>> >>>> (qemu) system_wakeup >>>> (qemu) info status >>>> VM status: running >>>> >>>> Suggested-by: Peter Xu >>>> Signed-off-by: Steve Sistare >>>> --- >>>> include/sysemu/runstate.h | 5 + >>>> qapi/misc.json| 10 -- >>>> system/cpus.c | 19 ++- >>>> system/runstate.c | 3 +++ >>>> 4 files changed, 30 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h >>>> index f6a337b..1d6828f 100644 >>>> --- a/include/sysemu/runstate.h >>>> +++ b/include/sysemu/runstate.h >>>> @@ -40,6 +40,11 @@ static inline bool >>>> shutdown_caused_by_guest(ShutdownCause cause) >>>> return cause >= SHUTDOWN_CAUSE_GUEST_SHUTDOWN; >>>> } >>>> >>>> +static inline bool runstate_is_started(RunState state) >>>> +{ >>>> +return state == RUN_STATE_RUNNING || state == RUN_STATE_SUSPENDED; >>>> +} >>>> + >>>> void vm_start(void); >>>> >>>> /** >>>> diff --git a/qapi/misc.json b/qapi/misc.json >>>> index cda2eff..efb8d44 100644 >>>> --- a/qapi/misc.json >>>> +++ b/qapi/misc.json >>>> @@ -134,7 +134,7 @@ >>>> ## >>>> # @stop: >>>> # >>>> -# Stop all guest VCPU execution. >>>> +# Stop all guest VCPU and VM execution. >>> >>> Doesn't "stop all VM execution" imply "stop all guest vCPU execution"? >> >> Agreed, so we simply have: >> >> # @stop: >> # Stop guest VM execution. >> >> # @cont: >> # Resume guest VM execution. > > Yes, please. Will do. >>>> # >>>> # Since: 0.14 >>>> # >>>> @@ -143,6 +143,9 @@ >>># Notes: This function will succeed even if the guest is already in >>># the stopped state. In "inmigrate" state, it will ensure that >>>> # the guest remains paused once migration finishes, as if the -S >>>> # option was passed on the command line. >>>> # >>>> +# In the "suspended" state, it will completely stop the VM and >>>> +# cause a transition to the "paused" state. (Since 9.0) >>>> +# >>> >>> What user-observable (with que
Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate
On 1/3/2024 8:09 AM, Peter Xu wrote: > Steven, > > The discussions seem to still apply to the latest. Do you plan to post a > new version, with everything covered? Yes, today, thanks - steve
Re: [PATCH V6 03/14] cpus: stop vm in suspended runstate
On 12/22/2023 7:20 AM, Markus Armbruster wrote: > Steve Sistare writes: > >> Currently, a vm in the suspended state is not completely stopped. The VCPUs >> have been paused, but the cpu clock still runs, and runstate notifiers for >> the transition to stopped have not been called. This causes problems for >> live migration. Stale cpu timers_state is saved to the migration stream, >> causing time errors in the guest when it wakes from suspend, and state that >> would have been modified by runstate notifiers is wrong. >> >> Modify vm_stop to completely stop the vm if the current state is suspended, >> transition to RUN_STATE_PAUSED, and remember that the machine was suspended. >> Modify vm_start to restore the suspended state. > > Can you explain this to me in terms of the @current_run_state state > machine? Like > > Before the patch, trigger X in state Y goes to state Z. > Afterwards, it goes to ... Old behavior: RUN_STATE_SUSPENDED --> stop --> RUN_STATE_SUSPENDED New behavior: RUN_STATE_SUSPENDED --> stop --> RUN_STATE_PAUSED RUN_STATE_PAUSED--> cont --> RUN_STATE_SUSPENDED >> This affects all callers of vm_stop and vm_start, notably, the qapi stop and >> cont commands. For example: >> >> (qemu) info status >> VM status: paused (suspended) >> >> (qemu) stop >> (qemu) info status >> VM status: paused >> >> (qemu) cont >> (qemu) info status >> VM status: paused (suspended) >> >> (qemu) system_wakeup >> (qemu) info status >> VM status: running >> >> Suggested-by: Peter Xu >> Signed-off-by: Steve Sistare >> --- >> include/sysemu/runstate.h | 5 + >> qapi/misc.json| 10 -- >> system/cpus.c | 19 ++- >> system/runstate.c | 3 +++ >> 4 files changed, 30 insertions(+), 7 deletions(-) >> >> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h >> index f6a337b..1d6828f 100644 >> --- a/include/sysemu/runstate.h >> +++ b/include/sysemu/runstate.h >> @@ -40,6 +40,11 @@ static inline bool shutdown_caused_by_guest(ShutdownCause >> cause) >> return cause >= SHUTDOWN_CAUSE_GUEST_SHUTDOWN; >> } >> >> +static inline bool runstate_is_started(RunState state) >> +{ >> +return state == RUN_STATE_RUNNING || state == RUN_STATE_SUSPENDED; >> +} >> + >> void vm_start(void); >> >> /** >> diff --git a/qapi/misc.json b/qapi/misc.json >> index cda2eff..efb8d44 100644 >> --- a/qapi/misc.json >> +++ b/qapi/misc.json >> @@ -134,7 +134,7 @@ >> ## >> # @stop: >> # >> -# Stop all guest VCPU execution. >> +# Stop all guest VCPU and VM execution. > > Doesn't "stop all VM execution" imply "stop all guest vCPU execution"? Agreed, so we simply have: # @stop: # Stop guest VM execution. # @cont: # Resume guest VM execution. >> # >> # Since: 0.14 >> # >> @@ -143,6 +143,9 @@ ># Notes: This function will succeed even if the guest is already in ># the stopped state. In "inmigrate" state, it will ensure that >> # the guest remains paused once migration finishes, as if the -S >> # option was passed on the command line. >> # >> +# In the "suspended" state, it will completely stop the VM and >> +# cause a transition to the "paused" state. (Since 9.0) >> +# > > What user-observable (with query-status) state transitions are possible > here? {"status": "suspended", "singlestep": false, "running": false} --> stop --> {"status": "paused", "singlestep": false, "running": false} >> # Example: >> # >> # -> { "execute": "stop" } >> @@ -153,7 +156,7 @@ >> ## >> # @cont: >> # >> -# Resume guest VCPU execution. >> +# Resume guest VCPU and VM execution. >> # >> # Since: 0.14 >> # >> @@ -165,6 +168,9 @@ ># Returns: If successful, nothing ># ># Notes: This command will succeed if the guest is currently running. ># It will also succeed if the guest is in the "inmigrate" state; ># in this case, the effect of the command is to make sure the >> # guest starts once migration finishes, removing the effect of the >> # -S command line option if it was passed. >> # >> +# If the VM was previously suspended, and not been reset or woken, >> +# this command will transition back to the "suspended" state. (Since >> 9.0) > > Long line. It fits in 80 columns, but perhaps this looks nicer: # If the VM was previously suspended, and not been reset or woken, # this command will transition back to the "suspended" state. # (Since 9.0) > What user-observable state transitions are possible here? {"status": "paused", "singlestep": false, "running": false} --> cont --> {"status": "suspended", "singlestep": false, "running": false} >> +# >> # Example: >> # >> # -> { "execute": "cont" } > > Should we update documentation of query-status, too? IMO no. The new behavior changes the status/RunState field only, and the domain of values does not change, only the transitions caused by the commands described here.
Re: [PATCH V8 00/12] fix migration of suspended runstate
On 12/20/2023 9:52 AM, Anthony PERARD wrote: > On Mon, Dec 18, 2023 at 01:14:51PM +0800, Peter Xu wrote: >> On Wed, Dec 13, 2023 at 10:35:33AM -0500, Steven Sistare wrote: >>> Hi Peter, all have RB's, with all i's dotted and t's crossed - steve >> >> Yes this seems to be more migration related so maybe good candidate for a >> pull from migration submodule. >> >> But since this is still solving a generic issue, I'm copying a few more >> people from get_maintainers.pl that this series touches, just in case >> they'll have something to say before dev cycle starts. > > I did a quick smoke test of migrating a Xen guest. It works fine for me. Thanks very much, I appreciate it - steve