from:"Peter Xu"

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-19 Thread Peter Xu

On Fri, Jul 19, 2024 at 01:54:37PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu  writes:
> >> >> 
> >> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> >> >> v2 is ready, but unfortunately this approach doesn't work. When 
> >> >> >> client A
> >> >> >> takes the payload, it fills it with it's data, which may include
> >> >> >> allocating memory. MultiFDPages_t does that for the offset. This 
> >> >> >> means
> >> >> >> we need a round of free/malloc at every packet sent. For every client
> >> >> >> and every allocation they decide to do.
> >> >> >
> >> >> > Shouldn't be a blocker?  E.g. one option is:
> >> >> >
> >> >> > /* Allocate both the pages + offset[] */
> >> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> >> >> >   sizeof(ram_addr_t) * n, 1);
> >> >> > pages->allocated = n;
> >> >> > pages->offset = [1];
> >> >> >
> >> >> > Or.. we can also make offset[] dynamic size, if that looks less 
> >> >> > tricky:
> >> >> >
> >> >> > typedef struct {
> >> >> > /* number of used pages */
> >> >> > uint32_t num;
> >> >> > /* number of normal pages */
> >> >> > uint32_t normal_num;
> >> >> > /* number of allocated pages */
> >> >> > uint32_t allocated;
> >> >> > RAMBlock *block;
> >> >> > /* offset of each page */
> >> >> > ram_addr_t offset[0];
> >> >> > } MultiFDPages_t;
> >> >> 
> >> >> I think you missed the point. If we hold a pointer inside the payload,
> >> >> we lose the reference when the other client takes the structure and puts
> >> >> its own data there. So we'll need to alloc/free everytime we send a
> >> >> packet.
> >> >
> >> > For option 1: when the buffer switch happens, MultiFDPages_t will switch 
> >> > as
> >> > a whole, including its offset[], because its offset[] always belong to 
> >> > this
> >> > MultiFDPages_t.  So yes, we want to lose that *offset reference together
> >> > with MultiFDPages_t here, so the offset[] always belongs to one single
> >> > MultiFDPages_t object for its lifetime.
> >> 
> >> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
> >> individually:
> >> 
> >> struct MultiFDSendData {
> >> MultiFDPayloadType type;
> >> union {
> >> MultiFDPages_t ram_payload;
> >> } u;
> >> };
> >> 
> >> (and even if it did, then we'd lose the pointer to ram_payload anyway -
> >> or require multiple free/alloc)
> >
> > IMHO it's the same.
> >
> > The core idea is we allocate a buffer to put MultiFDSendData which may
> > contain either Pages_t or DeviceState_t, and the size of the buffer should
> > be MAX(A, B).
> >
> 
> Right, but with your zero-length array proposals we need to have a
> separate allocation for MultiFDPages_t because to expand the array we
> need to include the number of pages.

We need to fetch the max size we need and allocate one object covers all
the sizes we need.  I sincerely don't understand why it's an issue..

> 
> Also, don't think only about MultiFDPages_t. With this approach we
> cannot have pointers to memory allocated by the client at all anywhere
> inside the union. Every pointer needs to have another reference
> somewhere else to ensure we don't leak it. That's an unnecessary
> restriction.

So even if there can be multiple pointers we can definitely play the same
trick that we allocate object A+B+C+D in the same chunk and let A->b points
to B, A->c points to C, and so on.

Before that, my question is do we really need that.

For device states, AFAIU it'll always be an opaque buffer..  VFIO needs
that, vDPA probably the same, and for VMSDs it'll be a temp buffer to put
the VMSD dump.

For multifd, I used offset[0] just to make sure things like "dynamic sized
multifd buffers" will easi

Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr

2024-07-19 Thread Peter Xu

On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
> For new cpr modes, ramblock_is_ignored will always be true, because the
> memory is preserved in place rather than copied.  However, for an ignored
> block, parse_ramblock currently requires that the received address of the
> block must match the address of the statically initialized region on the
> target.  This fails for a PCI rom block, because the memory region address
> is set when the guest writes to a BAR on the source, which does not occur
> on the target, causing a "Mismatched GPAs" error during cpr migration.

Is this a common fix with/without cpr mode?

It looks to me mr->addr (for these ROMs) should only be set in PCI config
region updates as you mentioned.  But then I didn't figure out when they're
updated on dest in live migration: the ramblock info was sent at the
beginning of migration, so it doesn't even have PCI config space migrated;
I thought the real mr->addr should be in there.

I also failed to understand yet on why the mr->addr check needs to be done
by ignore-shared only.  Some explanation would be greatly helpful around
this area..

Thanks,

-- 
Peter Xu

Re: [PATCH V2 02/11] migration: cpr-state

2024-07-19 Thread Peter Xu

gt;  if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
> @@ -2142,6 +2147,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>MIGRATION_STATUS_FAILED);
>  }
>  
> +out:
>  if (local_err) {
>  if (!resume_requested) {
>  yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> diff --git a/migration/trace-events b/migration/trace-events
> index 0b7c332..173f2c0 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -340,6 +340,11 @@ colo_receive_message(const char *msg) "Receive '%s' 
> message"
>  # colo-failover.c
>  colo_failover_set_state(const char *new_state) "new state %s"
>  
> +# cpr.c
> +cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
> +cpr_delete_fd(const char *name, int id) "%s, id %d"
> +cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
> +
>  # block-dirty-bitmap.c
>  send_bitmap_header_enter(void) ""
>  send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, 
> uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" 
> PRIu32 ", data_size: %" PRIu64
> diff --git a/system/vl.c b/system/vl.c
> index 03951be..6521ee3 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -77,6 +77,7 @@
>  #include "hw/block/block.h"
>  #include "hw/i386/x86.h"
>  #include "hw/i386/pc.h"
> +#include "migration/cpr.h"
>  #include "migration/misc.h"
>  #include "migration/snapshot.h"
>  #include "sysemu/tpm.h"
> @@ -3713,6 +3714,8 @@ void qemu_init(int argc, char **argv)
>  
>  qemu_create_machine(machine_opts_dict);
>  
> +cpr_state_load(_fatal);

Might be good to add a rich comment here explaining the decision on why
loading here; I think most of the tricks lie here.  E.g., it needs to be
before XXX and it needs to be after YYY.

Thanks,

> +
>  suspend_mux_open();
>  
>  qemu_disable_default_devices();
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-19 Thread Peter Xu

On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> >> v2 is ready, but unfortunately this approach doesn't work. When client A
> >> >> takes the payload, it fills it with it's data, which may include
> >> >> allocating memory. MultiFDPages_t does that for the offset. This means
> >> >> we need a round of free/malloc at every packet sent. For every client
> >> >> and every allocation they decide to do.
> >> >
> >> > Shouldn't be a blocker?  E.g. one option is:
> >> >
> >> > /* Allocate both the pages + offset[] */
> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> >> >   sizeof(ram_addr_t) * n, 1);
> >> > pages->allocated = n;
> >> > pages->offset = [1];
> >> >
> >> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
> >> >
> >> > typedef struct {
> >> > /* number of used pages */
> >> > uint32_t num;
> >> > /* number of normal pages */
> >> > uint32_t normal_num;
> >> > /* number of allocated pages */
> >> > uint32_t allocated;
> >> > RAMBlock *block;
> >> > /* offset of each page */
> >> > ram_addr_t offset[0];
> >> > } MultiFDPages_t;
> >> 
> >> I think you missed the point. If we hold a pointer inside the payload,
> >> we lose the reference when the other client takes the structure and puts
> >> its own data there. So we'll need to alloc/free everytime we send a
> >> packet.
> >
> > For option 1: when the buffer switch happens, MultiFDPages_t will switch as
> > a whole, including its offset[], because its offset[] always belong to this
> > MultiFDPages_t.  So yes, we want to lose that *offset reference together
> > with MultiFDPages_t here, so the offset[] always belongs to one single
> > MultiFDPages_t object for its lifetime.
> 
> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
> individually:
> 
> struct MultiFDSendData {
> MultiFDPayloadType type;
> union {
> MultiFDPages_t ram_payload;
> } u;
> };
> 
> (and even if it did, then we'd lose the pointer to ram_payload anyway -
> or require multiple free/alloc)

IMHO it's the same.

The core idea is we allocate a buffer to put MultiFDSendData which may
contain either Pages_t or DeviceState_t, and the size of the buffer should
be MAX(A, B).

> 
> >
> > For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
> > but make it part of the struct (MultiFDPages_t.offset[]).  Logically it's
> > the same as option 1 but maybe slight cleaner.  We just need to make it
> > sized 0 so as to be dynamic in size.
> 
> Seems like an undefined behavior magnet. If I sent this as the first
> version, you'd NACK me right away.
> 
> Besides, it's an unnecessary restriction to impose in the client
> code. And like above, we don't allocate the struct directly, it's part
> of MultiFDSendData, that's an advantage of using the union.
> 
> I think we've reached the point where I'd like to hear more concrete
> reasons for not going with the current proposal, except for the
> simplicity argument you already put. I like the union idea, but OTOH we
> already have a working solution right here.

I think the issue with current proposal is each client will need to
allocate (N+1)*buffer, so more user using it the more buffers we'll need (M
users, then M*(N+1)*buffer).  Currently it seems to me we will have 3 users
at least: RAM, VFIO, and some other VMSD devices TBD in mid-long futures;
the latter two will share the same DeviceState_t.  Maybe vDPA as well at
some point?  Then 4.

I'd agree with this approach only if multifd is flexible enough to not even
know what's the buffers, but it's not the case, and we seem only care about
two:

  if (type==RAM)
 ...
  else
 assert(type==DEVICE);
 ...

In this case I think it's easier we have multifd manage all the buffers
(after all, it knows them well...).  Then the consumption is not
M*(N+1)*buffer, but (M+N)*buffer.

Perhaps push your tree somewhere so we can have a quick look?  I'm totally
lost when you said I'll nack it.. so maybe I didn't really get what you
meant.  Codes may clarify that.

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-18 Thread Peter Xu

On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> v2 is ready, but unfortunately this approach doesn't work. When client A
> >> takes the payload, it fills it with it's data, which may include
> >> allocating memory. MultiFDPages_t does that for the offset. This means
> >> we need a round of free/malloc at every packet sent. For every client
> >> and every allocation they decide to do.
> >
> > Shouldn't be a blocker?  E.g. one option is:
> >
> > /* Allocate both the pages + offset[] */
> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> >   sizeof(ram_addr_t) * n, 1);
> > pages->allocated = n;
> > pages->offset = [1];
> >
> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
> >
> > typedef struct {
> > /* number of used pages */
> > uint32_t num;
> > /* number of normal pages */
> > uint32_t normal_num;
> > /* number of allocated pages */
> > uint32_t allocated;
> > RAMBlock *block;
> > /* offset of each page */
> > ram_addr_t offset[0];
> > } MultiFDPages_t;
> 
> I think you missed the point. If we hold a pointer inside the payload,
> we lose the reference when the other client takes the structure and puts
> its own data there. So we'll need to alloc/free everytime we send a
> packet.

For option 1: when the buffer switch happens, MultiFDPages_t will switch as
a whole, including its offset[], because its offset[] always belong to this
MultiFDPages_t.  So yes, we want to lose that *offset reference together
with MultiFDPages_t here, so the offset[] always belongs to one single
MultiFDPages_t object for its lifetime.

For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
but make it part of the struct (MultiFDPages_t.offset[]).  Logically it's
the same as option 1 but maybe slight cleaner.  We just need to make it
sized 0 so as to be dynamic in size.

Hmm.. is it the case?

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-18 Thread Peter Xu

On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> v2 is ready, but unfortunately this approach doesn't work. When client A
> takes the payload, it fills it with it's data, which may include
> allocating memory. MultiFDPages_t does that for the offset. This means
> we need a round of free/malloc at every packet sent. For every client
> and every allocation they decide to do.

Shouldn't be a blocker?  E.g. one option is:

/* Allocate both the pages + offset[] */
MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
  sizeof(ram_addr_t) * n, 1);
pages->allocated = n;
pages->offset = [1];

Or.. we can also make offset[] dynamic size, if that looks less tricky:

typedef struct {
/* number of used pages */
uint32_t num;
/* number of normal pages */
uint32_t normal_num;
/* number of allocated pages */
uint32_t allocated;
RAMBlock *block;
/* offset of each page */
ram_addr_t offset[0];
} MultiFDPages_t;

-- 
Peter Xu

Re: [PATCH V2 01/11] machine: alloc-anon option

2024-07-18 Thread Peter Xu

On Thu, Jul 18, 2024 at 11:43:54AM -0400, Steven Sistare wrote:
> On 7/17/2024 3:24 PM, Peter Xu wrote:
> [...]
> > 
> > PS to Steve: and I think I left tons of other comments in previous version
> > outside this patch too, but I don't think they're fully discussed when this
> > series was sent.  I can re-read the series again, but I don't think it'll
> > work out if we keep skipping discussions..
> 
> Hi Peter, let me address this part first, because I don't want you to think
> that I ignored your questions and concerns.  This V2 series tries to address
> them.  The change log was intended to be my response, rather than responding
> to each open question individually, but let me try again here with more 
> detail.
> I apologize if I don't summarize your concerns correctly or completely.
> 
> issue: discomfort with exec. why is it needed?
> response: exec is just a transport mechanism to send fd's to new qemu.
>   I refactored to separate core patches from exec-specific patches, submitted
>   cpr-transfer patches to illustrate a non-exec method, and provided reasons
>   why one vs the other would be desirable in the commit messages and cover
>   letter.
> 
> issue: why do we need to preserve the ramblock fields and make them available
>   prior to object creation?
> response.  we don't need to preserve all of them, and we only need fd prior
>   to object creation, so I deleted the precreate, factory, and named object
>   patches, and added CprState to preserve fd's. used_length arrives in the
>   normal migration stream.  max_length is recovered from the mfd using lseek.
> 
> issue: the series is too large, with too much change.
> response: in addition to the deletions mentioned above, I simplified the
>   functionality and tossed out style patches and nice-to-haves, so we can
>   focus on core functionality.  V2 is much smaller.
> 
> issue: memfd_create option is oddly expressed and hard to understand.
> response: I redefined the option, deleted all the stylistic ramblock patches
>   to lay its workings bare, and explicitly documented its affect on all types
>   of memory in the commit messages and qapi documentation.
> 
> issue: no need to preserve blocks like ROM for DMA (with memfd_create).
>   Blocks that must be preserved should be surfaced to the user as objects.
> response: I disagree, and will continue that conversation in this email 
> thread.
> 
> issue: how will vfio be handled?
> response: I submitted the vfio patches (non-trivial, because first I had to
>   rework them without using precreate vmstate).
> 
> issue: how will fd's be preserved for chardevs?
> response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
>   in each device's creation function, just like vfio.  Those primitives are
>   defined in this V2 series.

Thanks for the answers.  I think I'll need to read more into the patches in
the next few days; it looks like I'll get more answers from there.

I just sent an email probably when you're drafting this one.. it may has
some questions that may not be covered here.

I think a major issue with exec() is the (1-3) steps that I mentioned there
that needs to run sequentially, and IIUC all these steps can be completely
avoided in cpr-transfer, and it may matter a lot in huge VMs.  But maybe I
missed something.

Please have a look there.

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

2024-07-18 Thread Peter Xu

ned resources.  By contrast, consider a design in
> which a new container is created on the same host as the target of the
> CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.

Note that if we need to share RAM anyway, the resources consumption should
be minimal, as mem should IMHO be the major concern (except CPU, but CPU
isn't a concern in this scenario) in container world and here the shared
guest mem shouldn't be accounted to the dest container.  So IMHO it's about
the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
that should be relatively small.

In that case I don't yet see it a huge improvement, if the dest container
is cheap to initiate.

> Avoiding over commitment requires extra work in the management layer.

So it would be nice to know what needs to be overcommitted here.  I confess
I don't know much on containerized VMs, so maybe the page cache can be a
problem even if shared.  But I hope we can spell that out.  Logically IIUC
memcg shouldn't account those page cache if preallocated, because memcg
accounting should be done at folio allocations, at least, where the page
cache should miss first (so not this case..).

> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 
> How?
> 
> All memory that is mapped by the guest is preserved in place.  Indeed,
> it must be, because it may be the target of DMA requests, which are not
> quiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.
> This is easy for named memory-backend objects, as long as they are mapped
> shared, because they are visible in the file system in both old and new QEMU.
> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
> so the memfd's can be sent to new QEMU.  Pages that were locked in memory
> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
> the device that locked them remains open.
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> For device descriptors, new QEMU reuses the descriptor when creating the
> device, rather than opening it again.  The same holds for chardevs.  For
> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
> is created.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.

Oh, maybe this is the reason that cpr-transfer will need a separate uri..

Thanks,

-- 
Peter Xu

Re: [RFC V1 0/6] Live update: cpr-transfer

2024-07-18 Thread Peter Xu

On Sun, Jun 30, 2024 at 12:44:02PM -0700, Steve Sistare wrote:
> What?
> 
> This patch series adds the live migration cpr-transfer mode, which
> allows the user to transfer a guest to a new QEMU instance on the same
> host.  It is identical to cpr-exec in most respects, except as described
> below. 

I definitely prefer this one more than the exec solution, thanks for trying
this out.  It's a matter of whether we'll need both, my answer would be
no..

> 
> The new user-visible interfaces are:
>   * cpr-transfer (MigMode migration parameter)
>   * cpr-uri (migration parameter)

I wonder whether this parameter can be avoided already, maybe we can let
cpr-transfer depend on unix socket in -incoming, then integrate fd sharing
in the same channel?

>   * cpr-uri (command-line argument)
> 
> In this mode, the user starts new QEMU on the same host as old QEMU, with
> the same arguments as old QEMU, plus the -incoming and the -cpr-uri options.
> The user issues the migrate command to old QEMU, which stops the VM, saves
> state to the migration channels, and enters the postmigrate state.  Execution
> resumes in new QEMU.
> 
> This mode requires a second migration channel, specified by the cpr-uri
> migration property on the outgoing side, and by the cpr-uri QEMU command-line
> option on the incoming side.  The channel must be a type, such as unix socket,
> that supports SCM_RIGHTS.
> 
> This series depends on the series "Live update: cpr-exec mode".
> 
> Why?
> 
> cpr-transfer offers the same benefits as cpr-exec mode, but with a model
> for launching new QEMU that may be more natural for some management packages.
> 
> How?
> 
> The file descriptors are kept open by sending them to new QEMU via the
> cpr-uri, which must support SCM_RIGHTS.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would use a new QEMU binary path in terminal 2.
> 
>   Terminal 1: start old QEMU
>   # qemu-kvm -monitor stdio -object
>   memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
>   -m 4G -machine anon-alloc=memfd ...
> 
>   Terminal 2: start new QEMU
>   # qemu-kvm ... -incoming unix:vm.sock -cpr-uri unix:cpr.sock
> 
>   Terminal 1:
>   QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>   (qemu) migrate_set_parameter mode cpr-transfer
>   (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
>   (qemu) migrate -d unix:vm.sock
>   (qemu) info status
>   VM status: paused (postmigrate)
> 
>   Terminal 2:
>   QEMU 9.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
> 
> Steve Sistare (6):
>   migration: SCM_RIGHTS for QEMUFile
>   migration: VMSTATE_FD
>   migration: cpr-transfer save and load
>   migration: cpr-uri parameter
>   migration: cpr-uri option
>   migration: cpr-transfer mode
> 
>  include/migration/cpr.h|  4 ++
>  include/migration/vmstate.h|  9 +
>  migration/cpr-transfer.c   | 81 +
>  migration/cpr.c| 16 +++-
>  migration/meson.build  |  1 +
>  migration/migration-hmp-cmds.c | 10 +
>  migration/migration.c  | 37 +++
>  migration/options.c| 29 +++
>  migration/options.h|  1 +
>  migration/qemu-file.c  | 83 
> --
>  migration/qemu-file.h  |  2 +
>  migration/ram.c|  1 +
>  migration/trace-events |  2 +
>  migration/vmstate-types.c  | 33 +
>  qapi/migration.json| 38 ++++++-
>  qemu-options.hx|  8 
>  stubs/vmstate.c|  7 
>  system/vl.c|  3 ++
>  18 files changed, 359 insertions(+), 6 deletions(-)
>  create mode 100644 migration/cpr-transfer.c
> 
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-17 Thread Peter Xu

On Wed, Jul 17, 2024 at 11:07:17PM +0200, Maciej S. Szmigiero wrote:
> On 17.07.2024 21:00, Peter Xu wrote:
> > On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > The comment I removed is slightly misleading to me too, because 
> > > > > > > right now
> > > > > > > active_slot contains the data hasn't yet been delivered to 
> > > > > > > multifd, so
> > > > > > > we're "putting it back to free list" not because of it's free, 
> > > > > > > but because
> > > > > > > we know it won't get used until the multifd send thread consumes 
> > > > > > > it
> > > > > > > (because before that the thread will be busy, and we won't use 
> > > > > > > the buffer
> > > > > > > if so in upcoming send()s).
> > > > > > > 
> > > > > > > And then when I'm looking at this again, I think maybe it's a 
> > > > > > > slight
> > > > > > > overkill, and maybe we can still keep the "opaque data" managed 
> > > > > > > by multifd.
> > > > > > > One reason might be that I don't expect the "opaque data" payload 
> > > > > > > keep
> > > > > > > growing at all: it should really be either RAM or device state as 
> > > > > > > I
> > > > > > > commented elsewhere in a relevant thread, after all it's a thread 
> > > > > > > model
> > > > > > > only for migration purpose to move vmstates..
> > > > > > 
> > > > > > Some amount of flexibility needs to be baked in. For instance, what
> > > > > > about the handshake procedure? Don't we want to use multifd threads 
> > > > > > to
> > > > > > put some information on the wire for that as well?
> > > > > 
> > > > > Is this an orthogonal question?
> > > > 
> > > > I don't think so. You say the payload data should be either RAM or
> > > > device state. I'm asking what other types of data do we want the multifd
> > > > channel to transmit and suggesting we need to allow room for the
> > > > addition of that, whatever it is. One thing that comes to mind that is
> > > > neither RAM or device state is some form of handshake or capabilities
> > > > negotiation.
> > > 
> > > The RFC version of my multifd device state transfer patch set introduced
> > > a new migration channel header (by Avihai) for clean and extensible
> > > migration channel handshaking but people didn't like so it was removed in 
> > > v1.
> > 
> > Hmm, I'm not sure this is relevant to the context of discussion here, but I
> > confess I didn't notice the per-channel header thing in the previous RFC
> > series.  Link is here:
> > 
> > https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigi...@oracle.com
> 
> The channel header patches were dropped because Daniel didn't like them:
> https://lore.kernel.org/qemu-devel/zh-kf72fe9ov6...@redhat.com/
> https://lore.kernel.org/qemu-devel/zh_6w8u3h4fmg...@redhat.com/

Ah I missed that too when I quickly went over the old series, sorry.

I think what Dan meant was that we'd better do that with the handshake
work, which should cover more than this.  I've no problem with that.

It's just that sooner or later, we should provide something more solid than
commit 6720c2b327 ("migration: check magic value for deciding the mapping
of channels").

> 
> > Maciej, if you want, you can split that out of the seriess. So far it looks
> > like a good thing with/without how VFIO tackles it.
> 
> Unfortunately, these Avihai's channel header patches obviously impact wire
> protocol and are a bit of intermingled with the rest of the device state
> transfer patch set so it would be good to know upfront whether there is
> some consensus to (re)introduce this new channel header (CCed Daniel, too).

When I mentioned posting it separately, it'll still not be relevant to the
VFIO series. IOW, I think below is definitely not needed (and I think we're
on the same page now to reuse multifd threads as generic channels, so
there's no issue now):

https://lore.kernel.org/qemu-devel/027695db92ace07d2d6ee66da05f8e85959fd46a.1713269378.git.maciej.szmigi...@oracle.com/

So I assume we should leave that for later for whoever refactors the
handshake process.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-07-17 Thread Peter Xu

On Wed, Jul 17, 2024 at 11:07:43PM +0200, Maciej S. Szmigiero wrote:
> > Don't wait on me. I think I can make the changes Peter suggested without
> > affecting too much the interfaces used by this series. If it comes to
> > it, I can rebase this series "under" Maciej's.
> 
> So to be clear, I should base my series on top of your existing RFC patch set
> and then we'll swap these RFC patches for the updated versions, correct?

I'm not sure that's good.. since the VFIO series should depend heavily on
that RFC series IIUC, and if the RFC is prone to major changes, maybe we
should still work that out (or the next rebase can change a lot, void again
most of VFIO tests to carry out)?

-- 
Peter Xu

Re: [PATCH V2 01/11] machine: alloc-anon option

2024-07-17 Thread Peter Xu

On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
> On Sun, 30 Jun 2024 12:40:24 -0700
> Steve Sistare  wrote:
> 
> > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > on the value of the anon-alloc machine property.  This affects
> > memory-backend-ram objects, guest RAM created with the global -m option
> > but without an associated memory-backend object and without the -mem-path
> > option
> nowadays, all machines were converted to use memory backend for VM RAM.
> so -m option implicitly creates memory-backend object,
> which will be either MEMORY_BACKEND_FILE if -mem-path present
> or MEMORY_BACKEND_RAM otherwise.
> 
> 
> > To access the same memory in the old and new QEMU processes, the memory
> > must be mapped shared.  Therefore, the implementation always sets
> 
> > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > user must explicitly specify the share option.  In lieu of defining a new
> so statement at the top that memory-backend-ram is affected is not
> really valid? 
> 
> > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > as the condition for calling memfd_create.
> 
> In general I do dislike adding yet another option that will affect
> guest RAM allocation (memory-backends  should be sufficient).

I shared the same concern when reviewing the previous version, and I keep
having so.

> 
> However I do see that you need memfd for device memory (vram, roms, ...).
> Can we just use memfd/shared unconditionally for those and
> avoid introducing a new confusing option?

ROMs should be fine IIUC, as they shouldn't be large, and they can be
migrated normally (because they're not DMA target from VFIO assigned
devices).  IOW, per my understanding what must be shared via memfd is
writable memories that can be DMAed from a VFIO device.

I raised such question on whether / why vram can be a DMA target, but I
didn't get a response.  So I would like to redo this comment: I think we
should figure out what is missing when we switch all backends to use
-object, rather than adding this flag easily.  When added, we should be
crystal clear on which RAM region will be applicable by this flag.

PS to Steve: and I think I left tons of other comments in previous version
outside this patch too, but I don't think they're fully discussed when this
series was sent.  I can re-read the series again, but I don't think it'll
work out if we keep skipping discussions..

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-17 Thread Peter Xu

On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:
> > > > > The comment I removed is slightly misleading to me too, because right 
> > > > > now
> > > > > active_slot contains the data hasn't yet been delivered to multifd, so
> > > > > we're "putting it back to free list" not because of it's free, but 
> > > > > because
> > > > > we know it won't get used until the multifd send thread consumes it
> > > > > (because before that the thread will be busy, and we won't use the 
> > > > > buffer
> > > > > if so in upcoming send()s).
> > > > > 
> > > > > And then when I'm looking at this again, I think maybe it's a slight
> > > > > overkill, and maybe we can still keep the "opaque data" managed by 
> > > > > multifd.
> > > > > One reason might be that I don't expect the "opaque data" payload keep
> > > > > growing at all: it should really be either RAM or device state as I
> > > > > commented elsewhere in a relevant thread, after all it's a thread 
> > > > > model
> > > > > only for migration purpose to move vmstates..
> > > > 
> > > > Some amount of flexibility needs to be baked in. For instance, what
> > > > about the handshake procedure? Don't we want to use multifd threads to
> > > > put some information on the wire for that as well?
> > > 
> > > Is this an orthogonal question?
> > 
> > I don't think so. You say the payload data should be either RAM or
> > device state. I'm asking what other types of data do we want the multifd
> > channel to transmit and suggesting we need to allow room for the
> > addition of that, whatever it is. One thing that comes to mind that is
> > neither RAM or device state is some form of handshake or capabilities
> > negotiation.
> 
> The RFC version of my multifd device state transfer patch set introduced
> a new migration channel header (by Avihai) for clean and extensible
> migration channel handshaking but people didn't like so it was removed in v1.

Hmm, I'm not sure this is relevant to the context of discussion here, but I
confess I didn't notice the per-channel header thing in the previous RFC
series.  Link is here:

https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigi...@oracle.com

Maciej, if you want, you can split that out of the seriess. So far it looks
like a good thing with/without how VFIO tackles it.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-07-17 Thread Peter Xu

On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote:
> On 27.06.2024 16:56, Peter Xu wrote:
> > On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> > > On 26.06.2024 18:23, Peter Xu wrote:
> > > > On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 26.06.2024 03:51, Peter Xu wrote:
> > > > > > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > > > > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > > > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero 
> > > > > > > > wrote:
> > > > > > > > > Hi Peter,
> > > > > > > > 
> > > > > > > > Hi, Maciej,
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. 
> > > > > > > > > > Szmigiero wrote:
> > > > > > > > > > > From: "Maciej S. Szmigiero" 
> > > > > > > > > > > 
> > > > > > > > > > > This is an updated v1 patch series of the RFC (v0) series 
> > > > > > > > > > > located here:
> > > > > > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/
> > > > > > > > > > 
> > > > > > > > > > OK I took some hours thinking about this today, and here's 
> > > > > > > > > > some high level
> > > > > > > > > > comments for this series.  I'll start with which are more 
> > > > > > > > > > relevant to what
> > > > > > > > > > Fabiano has already suggested in the other thread, then 
> > > > > > > > > > I'll add some more.
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de
> > > > > > > > > 
> > > > > > > > > That's a long list, thanks for these comments.
> > > > > > > > > 
> > > > > > > > > I have responded to them inline below.
> > > > > > > > > (..)
> > > > > > > 
> > > > > > > 2) Submit this operation to the thread pool and wait for it to 
> > > > > > > complete,
> > > > > > 
> > > > > > VFIO doesn't need to have its own code waiting.  If this pool is for
> > > > > > migration purpose in general, qemu migration framework will need to 
> > > > > > wait at
> > > > > > some point for all jobs to finish before moving on.  Perhaps it 
> > > > > > should be
> > > > > > at the end of the non-iterative session.
> > > > > 
> > > > > So essentially, instead of calling save_live_complete_precopy_end 
> > > > > handlers
> > > > > from the migration code you would like to hard-code its current VFIO
> > > > > implementation of calling 
> > > > > vfio_save_complete_precopy_async_thread_thread_terminate().
> > > > > 
> > > > > Only it wouldn't be then called VFIO precopy async thread terminate 
> > > > > but some
> > > > > generic device state async precopy thread terminate function.
> > > > 
> > > > I don't understand what did you mean by "hard code".
> > > 
> > > "Hard code" wasn't maybe the best expression here.
> > > 
> > > I meant the move of the functionality that's provided by
> > > vfio_save_complete_precopy_async_thread_thread_terminate() in this patch 
> > > set
> > > to the common migration code.
> > 
> > I see.  That function only does a thread_join() so far.
> > 
> > So can I understand it as below [1] should work for us, and it'll be clean
> > too (with nothing to hard-code)?
> 
> It will need some signal to the worker threads pool to terminate before
> waiting for them to finish (as the code in [1] just waits).
> 
> In the case of current vfio_save_complete_precopy_async_thread() 
> implementation,
> this signal isn't necessary as this thread simply terminates when it has re

Re: [PATCH 0/2] Postcopy migration and vhost-user errors

2024-07-17 Thread Peter Xu

On Wed, Jul 17, 2024 at 09:40:06AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 17, 2024 at 09:33:01AM -0400, Peter Xu wrote:
> > Hi, Michael,
> > 
> > On Wed, Jul 17, 2024 at 04:55:52AM -0400, Michael S. Tsirkin wrote:
> > > I just want to understand how we managed to have two threads
> > > talking in parallel. BQL is normally enough, which path
> > > manages to invoke vhost-user with BQL not taken?
> > > Just check BQL taken on each vhost user invocation and
> > > you will figure it out.
> > 
> > Prasad mentioned how the race happened in the cover letter:
> > 
> > https://lore.kernel.org/r/20240711131424.181615-1-ppan...@redhat.com
> > 
> >  Thread-1  Thread-2
> > 
> > vhost_dev_startpostcopy_ram_incoming_cleanup
> >  vhost_device_iotlb_misspostcopy_notify
> >   vhost_backend_update_device_iotlb  vhost_user_postcopy_notifier
> >vhost_user_send_device_iotlb_msg   vhost_user_postcopy_end
> > process_message_reply  process_message_reply
> >  vhost_user_readvhost_user_read
> >   vhost_user_read_header vhost_user_read_header
> >"Fail to update device iotlb"  "Failed to receive reply to 
> > postcopy_end"
> > 
> > The normal case should be that thread-2 is postcopy_ram_listen_thread(),
> > and this happens when postcopy migration is close to the end.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> 
> 
> OK, so postcopy_ram_ things run without the BQL?

There are a lot of postcopy_ram_* functions, I didn't check all of them but
I think it's true in this case.  Thanks.

-- 
Peter Xu

Re: [PATCH 0/2] Postcopy migration and vhost-user errors

2024-07-17 Thread Peter Xu

Hi, Michael,

On Wed, Jul 17, 2024 at 04:55:52AM -0400, Michael S. Tsirkin wrote:
> I just want to understand how we managed to have two threads
> talking in parallel. BQL is normally enough, which path
> manages to invoke vhost-user with BQL not taken?
> Just check BQL taken on each vhost user invocation and
> you will figure it out.

Prasad mentioned how the race happened in the cover letter:

https://lore.kernel.org/r/20240711131424.181615-1-ppan...@redhat.com

 Thread-1  Thread-2

vhost_dev_startpostcopy_ram_incoming_cleanup
 vhost_device_iotlb_misspostcopy_notify
  vhost_backend_update_device_iotlb  vhost_user_postcopy_notifier
   vhost_user_send_device_iotlb_msg   vhost_user_postcopy_end
process_message_reply  process_message_reply
 vhost_user_readvhost_user_read
  vhost_user_read_header vhost_user_read_header
   "Fail to update device iotlb"  "Failed to receive reply to 
postcopy_end"

The normal case should be that thread-2 is postcopy_ram_listen_thread(),
and this happens when postcopy migration is close to the end.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/2] Postcopy migration and vhost-user errors

2024-07-16 Thread Peter Xu

On Tue, Jul 16, 2024 at 03:44:54PM +0530, Prasad Pandit wrote:
> Hello Peter,
> 
> On Mon, 15 Jul 2024 at 19:10, Peter Xu  wrote:
> > IMHO it's better we debug and fix all the issues before merging this one,
> > otherwise we may overlook something.
> 
> * Well we don't know where the issue is, not sure where the fix may go
> in, ex. if the issue turns out to be how virsh(1) invokes
> migrate-postcopy, fix may go in virsh(1). Patches in this series
> anyway don't help to fix the migration convergence issue, so they
> could be reviewed independently I guess.

I still think we should find a complete solution before merging anything,
because I'm not 100% confident the issue to be further investigated is
irrelevant to this patch.

No strong opinions, I'll leave that to Michael to decide.

> 
> > You could pass over the patch to whoever going to debug this, so it will be 
> > included in the whole set to be
> > posted when the bug is completely fixed.
> 
> * Yes, this patch series is linked there.
> 
> > The protocol should have no restriction on the thread model of a front-end.
> > It only describes the wire protocol.
> >
> > IIUC the protocol was designed to be serialized by nature (where there's no
> > request ID, so we can't match reply to any of the previous response), then
> > the front-end can manage the threads well to serialize all the requests,
> > like using this rwlock.
> 
> * I see, okay. The simple protocol definition seems to indicate that
> it is meant for one front-end/back-end pair. If we are dividing the
> front-end across multiple threads, maybe we need a document to
> describe those threads and how they work, at least for the QEMU
> (front-end) side. Because the back-end could be a non-QEMU process, we
> can not do much there. (just thinking)

IMHO that's not part of the protocol but impl details, so the current doc
looks all fine to me.

Thanks,

-- 
Peter Xu

Re: [PATCH v6 1/5] docs/migration: add qatzip compression feature

2024-07-16 Thread Peter Xu

On Tue, Jul 16, 2024 at 02:34:07AM +, Liu, Yuan1 wrote:
> > -Original Message-
> > From: Yichen Wang 
> > Sent: Tuesday, July 16, 2024 6:13 AM
> > To: Peter Xu ; Fabiano Rosas ; Paolo
> > Bonzini ; Daniel P. Berrangé ;
> > Eduardo Habkost ; Marc-André Lureau
> > ; Thomas Huth ; Philippe
> > Mathieu-Daudé ; Eric Blake ; Markus
> > Armbruster ; Laurent Vivier ; qemu-
> > de...@nongnu.org
> > Cc: Hao Xiang ; Liu, Yuan1 ;
> > Zou, Nanhai ; Ho-Ren (Jack) Chuang
> > ; Wang, Yichen 
> > Subject: [PATCH v6 1/5] docs/migration: add qatzip compression feature
> > 
> > From: Yuan Liu 
> > 
> > add Intel QATzip compression method introduction
> > 
> > Signed-off-by: Yuan Liu 
> > Reviewed-by: Nanhai Zou 
> > Reviewed-by: Peter Xu 
> > Reviewed-by: Yichen Wang 
> > ---
> >  docs/devel/migration/features.rst   |   1 +
> >  docs/devel/migration/qatzip-compression.rst | 251 
> >  2 files changed, 252 insertions(+)
> >  create mode 100644 docs/devel/migration/qatzip-compression.rst
> > 
> > diff --git a/docs/devel/migration/features.rst
> > b/docs/devel/migration/features.rst
> > index 58f8fd9e16..8f431d52f9 100644
> > --- a/docs/devel/migration/features.rst
> > +++ b/docs/devel/migration/features.rst
> > @@ -14,3 +14,4 @@ Migration has plenty of features to support different
> > use cases.
> > CPR
> > qpl-compression
> > uadk-compression
> > +   qatzip-compression
> > diff --git a/docs/devel/migration/qatzip-compression.rst
> > b/docs/devel/migration/qatzip-compression.rst
> > new file mode 100644
> > index 00..72fa3e2826
> > --- /dev/null
> > +++ b/docs/devel/migration/qatzip-compression.rst
> > @@ -0,0 +1,251 @@
> > +==
> > +QATzip Compression
> > +==
> > +In scenarios with limited network bandwidth, the ``QATzip`` solution can
> > help
> > +users save a lot of host CPU resources by accelerating compression and
> > +decompression through the Intel QuickAssist Technology(``QAT``) hardware.
> 
> Hi Yichen
> 
> Thanks for adding the part of Performance Testing with QATzip, I wonder if we
> can remove Performance Testing with QATzip part and directly add the following
> content. 
> 
> Here, we use a typical example of limited bandwidth to illustrate the 
> advantages
> of QATzip. If the user is interested in qatzip, he still needs to verify the 
> performance
> by himself.
> 
> +The following test was conducted using 8 multifd channels and 10Gbps network
> +bandwidth. The results show that, compared to zstd, ``QATzip`` significantly
> +saves CPU resources on the sender and reduces migration time. Compared to the
> +uncompressed solution, ``QATzip`` greatly improves the dirty page processing
> +capability, indicated by the Pages per Second metric, and also reduces the
> +total migration time.
> +
> +::
> +
> +   VM Configuration: 16 vCPU and 64G memory
> +   VM Workload: all vCPUs are idle and 54G memory is filled with Silesia 
> data.
> +   QAT Devices: 4
> +   |---||-|--|--|--|--|
> +   |8 Channels |Total   |down |throughput|pages per | send | recv |
> +   |   |time(ms)|time(ms) |(mbps)|second| cpu %| cpu% |
> +   |---||-|--|--|--|--|
> +   |qatzip |   16630|   28| 10467|   2940235|   160|   360|
> +   |---||-|--|--|--|--|
> +   |zstd   |   20165|   24|  8579|   2391465|   810|   340|
> +   |---||-|--|--|--|--|
> +   |none   |   46063|   40| 10848|330240|45|85|
> +   |---||-|--|--|--|--|

Yes this looks much simpler and better.  The 10GBps test isn't that useful
at least, especially with nocomp numbers absent.  I didn't say when looking
previously, but it'll be better to clarify the numbers.

Yuan, thanks so much for reviewing all the relevant patches.  It's very
helpful to us.

-- 
Peter Xu

Re: [PATCH 0/2] Postcopy migration and vhost-user errors

2024-07-15 Thread Peter Xu

On Mon, Jul 15, 2024 at 03:44:06PM +0530, Prasad Pandit wrote:
> > I remember after you added the rwlock, there's still a hang issue.
> > Did you investigated that?  Or do you mean this series will fix all the 
> > problems?
> 
> * No, this series does not fix the guest hang issue. Root cause of
> that is still a mystery. If migration is ending abruptly before all of
> the guest state is migrated, the guest hang scenario seems possible.
> Adding vhost-user-rw-lock does not address the issue of end of
> migration.

IMHO it's better we debug and fix all the issues before merging this one,
otherwise we may overlook something.  You could pass over the patch to
whoever going to debug this, so it will be included in the whole set to be
posted when the bug is completely fixed.

> * From the protocol page above, it is not clear if the front-end
> should allow/have multiple threads talking to the same vhost-user
> device.

The protocol should have no restriction on the thread model of a front-end.
It only describes the wire protocol.

IIUC the protocol was designed to be serialized by nature (where there's no
request ID, so we can't match reply to any of the previous response), then
the front-end can manage the threads well to serialize all the requests,
like using this rwlock.

Thanks,

-- 
Peter Xu

Re: [PATCH 1/2] vhost-user: add a write-read lock

2024-07-15 Thread Peter Xu

On Mon, Jul 15, 2024 at 01:44:00PM +0530, Prasad Pandit wrote:
> On Thu, 11 Jul 2024 at 21:12, Peter Xu  wrote:
> > I apologize if I suggested WITH_QEMU_LOCK_GUARD when we talked.. I don't
> > remember which one I suggested, but in this case IIUC it'll be much easier
> > to review if you use the other sister function QEMU_LOCK_GUARD()
> > instead.. That should make the diff much, much less.
> 
> * Yes, QEMU_LOCK_GUARD simplifies the diff, but it may extend the time
> for which lock is held, delaying other threads, is that okay?

I think it shouldn't be a major deal in most cases, if the extended cycles
only cover a bunch of instructions. In special case we can still use
WITH_QEMU_LOCK_GUARD, but I'd start with the simple first and only switch
if necessary.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-12 Thread Peter Xu

On Fri, Jul 12, 2024 at 09:44:02AM -0300, Fabiano Rosas wrote:
> Do you have a reference for that kubevirt issue I could look at? It
> maybe interesting to investigate further. Where's the throttling coming
> from? And doesn't less vcpu time imply less dirtying and therefore
> faster convergence?

Sorry I don't have a link on hand.. sometimes it's not about converge, it's
about impacting the guest workload too much without intention which is not
wanted, especially if on a public cloud.

It's understandable to me since they're under the same cgroup with
throttled cpu resources applie to QEMU+Libvirt processes as a whole,
probably based on N_VCPUS with some tiny extra room for other stuff.

For example, I remember they also hit other threads content with the vcpu
threads like the block layer thread pools.

It's a separate issue here when talking about locked_vm, as kubevirt
probably need to figure out a way to say "these are mgmt threads, and those
are vcpu threads", because mgmt threads can take quite some cpu resources
sometimes and it's not avoidable.  Page pinning will be another story, as
in many cases pinning should not be required, except VFIO, zerocopy and
other special stuff.

-- 
Peter Xu

Re: [PATCH v2 16/19] system/cpus: Add cpu_pause() function

2024-07-12 Thread Peter Xu

On Fri, Jul 12, 2024 at 10:02:43PM +1000, Nicholas Piggin wrote:
> This factors the CPU pause function from pause_all_vcpus() into a
> new cpu_pause() function, similarly to cpu_resume(). cpu_resume()
> is moved to keep it next to cpu_pause().
> 
> Cc: Philippe Mathieu-Daudé 
> Cc: Peter Xu 
> Cc: Richard Henderson 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 06:12:44PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:
> >
> > [...]
> >
> >> We also don't flush the iov at once, so f->buf seems redundant to
> >> me. But of course, if we touch any of that we must ensure we're not
> >> dropping any major optimization.
> >
> > Yes some tests over that would be more persuasive when it comes.
> >
> > Per my limited experience in the past few years: memcpy on chips nowadays
> > is pretty cheap.  You'll see very soon one more example of that when you
> > start to look at the qatzip series: that series decided to do one more
> > memcpy for all guest pages, to make it a larger chunk of buffer instead of
> > submitting the compression tasks in 4k chunks (while I thought 4k wasn't
> > too small itself).
> >
> > That may be more involved so may not be a great example (e.g. the
> > compression algo can be special in this case where it just likes larger
> > buffers), but it's not uncommon that I see people trade things with memcpy,
> > especially small buffers.
> >
> > [...]
> >
> >> Any piece of code that fills an iov with data is prone to be able to
> >> send that data through multifd. From this perspective, multifd is just a
> >> way to give work to an iochannel. We don't *need* to use it, but it
> >> might be simple enough to the point that the benefit of ditching
> >> QEMUFile can be reached without too much rework.
> >> 
> >> Say we provision multifd threads early and leave them waiting for any
> >> part of the migration code to send some data. We could have n-1 threads
> >> idle waiting for the bulk of the data and use a single thread for any
> >> early traffic that does not need to be parallel.
> >> 
> >> I'm not suggesting we do any of this right away or even that this is the
> >> correct way to go, I'm just letting you know some of my ideas and why I
> >> think ram + device state might not be the only data we put through
> >> multifd.
> >
> > We can wait and see whether that can be of any use in the future, even if
> > so, we still have chance to add more types into the union, I think.  But
> > again, I don't expect.
> >
> > My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
> > non-IO, data onto multifd.  Again, I would ask "why not the main channel",
> > otherwise.
> >
> > [...]
> >
> >> Just to be clear, do you want a thread-pool to replace multifd? Or would
> >> that be only used for concurrency on the producer side?
> >
> > Not replace multifd.  It's just that I was imagining multifd threads only
> > manage IO stuff, nothing else.
> >
> > I was indeed thinking whether we can reuse multifd threads, but then I
> > found there's risk mangling these two concepts, as: when we do more than IO
> > in multifd threads (e.g., talking to VFIO kernel fetching data which can
> > block), we have risk of blocking IO even if we can push more so the NICs
> > can be idle again.  There's also the complexity where the job fetches data
> > from VFIO kernel and want to enqueue again, it means an multifd task can
> > enqueue to itself, and circular enqueue can be challenging: imagine 8
> > concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
> > the same time; they hunger themselves to death.  Things like that.  Then I
> > figured the rest jobs are really fn(void*) type of things; they should
> > deserve their own pool of threads.
> >
> > So the VFIO threads (used to be per-device) becomes migration worker
> > threads, we need them for both src/dst: on dst there's still pending work
> > to apply the continuous VFIO data back to the kernel driver, and that can't
> > be done by multifd thread too due to similar same reason.  Then those dest
> > side worker threads can also do load() not only for VFIO but also other
> > device states if we can add more.
> >
> > So to summary, we'll have:
> >
> >   - 1 main thread (send / recv)
> >   - N multifd threads (IOs only)
> >   - M worker threads (jobs only)
> >
> > Of course, postcopy not involved..  How's that sound?
> 
> Looks good. There's a better divide between producer and consumer this
> way. I think it will help when designing new features.
> 
> One observation is that we'll still have two different entities doing IO
> (multifd threads and the migration thread), which I would prefer were
> using a common code at a

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:

[...]

> We also don't flush the iov at once, so f->buf seems redundant to
> me. But of course, if we touch any of that we must ensure we're not
> dropping any major optimization.

Yes some tests over that would be more persuasive when it comes.

Per my limited experience in the past few years: memcpy on chips nowadays
is pretty cheap.  You'll see very soon one more example of that when you
start to look at the qatzip series: that series decided to do one more
memcpy for all guest pages, to make it a larger chunk of buffer instead of
submitting the compression tasks in 4k chunks (while I thought 4k wasn't
too small itself).

That may be more involved so may not be a great example (e.g. the
compression algo can be special in this case where it just likes larger
buffers), but it's not uncommon that I see people trade things with memcpy,
especially small buffers.

[...]

> Any piece of code that fills an iov with data is prone to be able to
> send that data through multifd. From this perspective, multifd is just a
> way to give work to an iochannel. We don't *need* to use it, but it
> might be simple enough to the point that the benefit of ditching
> QEMUFile can be reached without too much rework.
> 
> Say we provision multifd threads early and leave them waiting for any
> part of the migration code to send some data. We could have n-1 threads
> idle waiting for the bulk of the data and use a single thread for any
> early traffic that does not need to be parallel.
> 
> I'm not suggesting we do any of this right away or even that this is the
> correct way to go, I'm just letting you know some of my ideas and why I
> think ram + device state might not be the only data we put through
> multifd.

We can wait and see whether that can be of any use in the future, even if
so, we still have chance to add more types into the union, I think.  But
again, I don't expect.

My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
non-IO, data onto multifd.  Again, I would ask "why not the main channel",
otherwise.

[...]

> Just to be clear, do you want a thread-pool to replace multifd? Or would
> that be only used for concurrency on the producer side?

Not replace multifd.  It's just that I was imagining multifd threads only
manage IO stuff, nothing else.

I was indeed thinking whether we can reuse multifd threads, but then I
found there's risk mangling these two concepts, as: when we do more than IO
in multifd threads (e.g., talking to VFIO kernel fetching data which can
block), we have risk of blocking IO even if we can push more so the NICs
can be idle again.  There's also the complexity where the job fetches data
from VFIO kernel and want to enqueue again, it means an multifd task can
enqueue to itself, and circular enqueue can be challenging: imagine 8
concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
the same time; they hunger themselves to death.  Things like that.  Then I
figured the rest jobs are really fn(void*) type of things; they should
deserve their own pool of threads.

So the VFIO threads (used to be per-device) becomes migration worker
threads, we need them for both src/dst: on dst there's still pending work
to apply the continuous VFIO data back to the kernel driver, and that can't
be done by multifd thread too due to similar same reason.  Then those dest
side worker threads can also do load() not only for VFIO but also other
device states if we can add more.

So to summary, we'll have:

  - 1 main thread (send / recv)
  - N multifd threads (IOs only)
  - M worker threads (jobs only)

Of course, postcopy not involved..  How's that sound?

-- 
Peter Xu

Re: [External] Re: [PATCH v5 0/5] Implement QATzip compression method

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 09:48:02AM -0700, Yichen Wang wrote:
> On Thu, Jul 11, 2024 at 8:45 AM Peter Xu  wrote:
> >
> > On Wed, Jul 10, 2024 at 07:52:24PM -0700, Yichen Wang wrote:
> > > v5:
> > > - Rebase changes on top of 59084feb256c617063e0dbe7e64821ae8852d7cf
> > > - Add documentations about migration with qatzip accerlation
> > > - Remove multifd-qatzip-sw-fallback option
> >
> > I think Yuan provided quite a few meaningful comments, did you address all
> > of them?
> Yes. I do.
> >
> > You didn't reply in the previous version, and you didn't add anything in
> > the changelog.  I suggest you at least do one of them in the future so that
> > reviewers can understand what happen.
> They are all very good comments, and instead of replying I just fix
> them all and include it in my next patch. In my changelog I do include
> all the changes and comments we discussed in v4. Sorry I am new to the
> community, so I will reply "fixed" in the previous email before
> pushing the next version. Thanks a lot, and sorry for that.

That's all fine!  You can definitely mention them too here in the changelog
if you think that's easier.

One last nitpick is in the major patch you duplicated part of the comment
when I was requesting a movement (the part explaining why you used a buffer
rather than submit compression for each page without memcpy), I suggest you
can simply move that whole comment above, rather than copying.

I don't have any further questions on this series.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-11 Thread Peter Xu

endData, where N is still the n_multifd_channels, and M is
> > the number of users, in VFIO's case, VFIO allocates the cached SendData and
> > use that to enqueue, right after enqueue it'll get a free one by switching
> > it with another one in the multifd's array[N].  Same to RAM.  Then there'll
> > be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
> > (multifd owns the N per-thread only).
> >
> 
> At first sight, that seems to work. It's similar to this series, but
> you're moving the free slots back into the channels. Should I keep
> SendData as an actual separate array instead of multiple p->data?

I don't know.. they look similar to me yet so far, as long as multifd is
managing the N buffers, while the clients will manage one for each.  There
should have a helper to allocate/free the generic multifd buffers (SendData
in this case) so everyone should be using that.

> 
> Let me know, I'll write some code and see what it looks like.

I think Maciej is working on this too since your absence, as I saw he
decided to base his work on top of yours and he's preparing the new
version. I hope you two won't conflict or duplicates the work.  Might be
good to ask / wait and see how far Maciej has been going.

-- 
Peter Xu

Re: [PATCH v5 0/5] Implement QATzip compression method

2024-07-11 Thread Peter Xu

On Wed, Jul 10, 2024 at 07:52:24PM -0700, Yichen Wang wrote:
> v5:
> - Rebase changes on top of 59084feb256c617063e0dbe7e64821ae8852d7cf
> - Add documentations about migration with qatzip accerlation
> - Remove multifd-qatzip-sw-fallback option

I think Yuan provided quite a few meaningful comments, did you address all
of them?

You didn't reply in the previous version, and you didn't add anything in
the changelog.  I suggest you at least do one of them in the future so that
reviewers can understand what happen.

Thanks,

-- 
Peter Xu

Re: [PATCH 1/2] vhost-user: add a write-read lock

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 06:44:23PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> QEMU threads use vhost_user_write/read calls to send
> and receive messages from a vhost-user device. When multiple
> threads communicate with the same vhost-user device, they can
> receive each other's messages, resulting in an erroneous state.
> 
>  vhost_user_read_header:
>   700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
>  vhost_device_iotlb_miss:
>   700871,700871: Fail to update device iotlb
>  vhost_user_postcopy_end:
>   700871,700900: Failed to receive reply to postcopy_end
>  vhost_user_read_header:
>   700871,700871: Failed to read msg header. Flags 0x0 instead of 0x5.
> 
> Here fault thread seems to end the postcopy migration while
> another thread is starting the vhost-user device.
> 
> Add a rw lock to hold for one vhost_user_write/read cycle
> and avoid such race conditions.
> 
> Suggested-by: Peter Xu 
> Signed-off-by: Prasad Pandit 
> ---
>  hw/virtio/vhost-user.c | 423 +++--
>  include/hw/virtio/vhost-user.h |   3 +

I apologize if I suggested WITH_QEMU_LOCK_GUARD when we talked.. I don't
remember which one I suggested, but in this case IIUC it'll be much easier
to review if you use the other sister function QEMU_LOCK_GUARD()
instead.. That should make the diff much, much less.

-- 
Peter Xu

Re: [PATCH 0/2] Postcopy migration and vhost-user errors

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 06:44:22PM +0530, Prasad Pandit wrote:
> From: Prasad Pandit 
> 
> Hello,
> 
> * virsh(1) offers multiple options to initiate Postcopy migration:
> 
> 1) virsh migrate --postcopy --postcopy-after-precopy
> 2) virsh migrate --postcopy + virsh migrate-postcopy
> 3) virsh migrate --postcopy --timeout  --timeout-postcopy
> 
> When Postcopy migration is invoked via method (2) or (3) above,
> the guest on the destination host seems to hang or get stuck sometimes.
> 
> * During Postcopy migration, multiple threads are spawned on the destination
> host to start the guest and setup devices. One such thread starts vhost
> device via vhost_dev_start() function and another called fault_thread handles

Hmm, I thought it was one of the vcpu threads that invoked
vhost_dev_start(), rather than any migration thread?

> page faults in user space using kernel's userfaultfd(2) system.
> 
> When fault_thread exits upon completion of Postcopy migration, it sends a
> 'postcopy_end' message to the vhost-user device. But sometimes 'postcopy_end'
> message is sent while vhost device is being setup via vhost_dev_start().
> 
>  Thread-1  Thread-2
> 
> vhost_dev_startpostcopy_ram_incoming_cleanup
>  vhost_device_iotlb_misspostcopy_notify
>   vhost_backend_update_device_iotlb  vhost_user_postcopy_notifier
>vhost_user_send_device_iotlb_msg   vhost_user_postcopy_end
> process_message_reply  process_message_reply
>  vhost_user_readvhost_user_read
>   vhost_user_read_header vhost_user_read_header
>"Fail to update device iotlb"  "Failed to receive reply to 
> postcopy_end"
> 
> This creates confusion when vhost device receives 'postcopy_end' message while
> it is still trying to update IOTLB entries.
> 
> This seems to leave the guest in a stranded/hung state because fault_thread
> has exited saying Postcopy migration has ended, but vhost-device is probably
> still expecting updates. QEMU logs following errors on the destination host
> ===
> ...
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. 
> Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_postcopy_end: 700871,700900: Failed to receive reply to 
> postcopy_end
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. 
> Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. 
> Flags 0x8 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. 
> Flags 0x16 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> qemu-kvm: vhost_user_read_header: 700871,700871: Failed to read msg header. 
> Flags 0x0 instead of 0x5.
> qemu-kvm: vhost_device_iotlb_miss: 700871,700871: Fail to update device iotlb
> ===
> 
> * Couple of patches here help to fix/handle these errors.

I remember after you added the rwlock, there's still a hang issue.

Did you investigated that?  Or do you mean this series will fix all the
problems?

Thanks,

> 
> Thank you.
> ---
> Prasad Pandit (2):
>   vhost-user: add a write-read lock
>   vhost: fail device start if iotlb update fails
> 
>  hw/virtio/vhost-user.c | 423 +++--
>  hw/virtio/vhost.c  |   6 +-
>  include/hw/virtio/vhost-user.h |   3 +
>  3 files changed, 259 insertions(+), 173 deletions(-)
> 
> --
> 2.45.2
> 

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 11:44:08AM -0300, Fabiano Rosas wrote:
> But of course, that means we cannot claim to support all kinds of
> forward migrations anymore. Only those in the 6 year period.

That "6 years" comes from machine type deprecation period, and migration
compatibility is mostly only attached to machine types, and we only ever
allowed migration with the same machine type.

It means, >6 years migration will never work anyway as soon as we start to
deprecate machine types (irrelevant of any reuse of UNUSED), because the >6
years machine types will simply be gone.. See configuration_post_load(),
where it'll throw an error upfront when machine type mismatched.

-- 
Peter Xu

Re: [PATCH v5 5/5] tests/migration: Add integration test for 'qatzip' compression method

2024-07-11 Thread Peter Xu

On Wed, Jul 10, 2024 at 07:52:29PM -0700, Yichen Wang wrote:
> From: Bryan Zhang 
> 
> Adds an integration test for 'qatzip'.
> 
> Signed-off-by: Bryan Zhang 
> Signed-off-by: Hao Xiang 
> Signed-off-by: Yichen Wang 
> Reviewed-by: Fabiano Rosas 
> Signed-off-by: Yichen Wang 
> ---
>  tests/qtest/migration-test.c | 35 +++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 70b606b888..b796dd21cb 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -32,6 +32,10 @@
>  # endif /* CONFIG_TASN1 */
>  #endif /* CONFIG_GNUTLS */
>  
> +#ifdef CONFIG_QATZIP
> +#include 
> +#endif /* CONFIG_QATZIP */
> +
>  /* For dirty ring test; so far only x86_64 is supported */
>  #if defined(__linux__) && defined(HOST_X86_64)
>  #include "linux/kvm.h"
> @@ -2992,6 +2996,22 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState 
> *from,
>  }
>  #endif /* CONFIG_ZSTD */
>  
> +#ifdef CONFIG_QATZIP
> +static void *
> +test_migrate_precopy_tcp_multifd_qatzip_start(QTestState *from,
> +  QTestState *to)
> +{
> +migrate_set_parameter_int(from, "multifd-qatzip-level", 2);
> +migrate_set_parameter_int(to, "multifd-qatzip-level", 2);
> +
> +/* SW fallback is disabled by default, so enable it for testing. */
> +migrate_set_parameter_bool(from, "multifd-qatzip-sw-fallback", true);
> +migrate_set_parameter_bool(to, "multifd-qatzip-sw-fallback", true);

Shouldn't this already crash when without the parameter?

> +
> +return test_migrate_precopy_tcp_multifd_start_common(from, to, "qatzip");
> +}
> +#endif
> +
>  #ifdef CONFIG_QPL
>  static void *
>  test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
> @@ -3089,6 +3109,17 @@ static void test_multifd_tcp_zstd(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_QATZIP
> +static void test_multifd_tcp_qatzip(void)
> +{
> +MigrateCommon args = {
> +.listen_uri = "defer",
> +.start_hook = test_migrate_precopy_tcp_multifd_qatzip_start,
> +};
> +test_precopy_common();
> +}
> +#endif
> +
>  #ifdef CONFIG_QPL
>  static void test_multifd_tcp_qpl(void)
>  {
> @@ -3992,6 +4023,10 @@ int main(int argc, char **argv)
>  migration_test_add("/migration/multifd/tcp/plain/zstd",
> test_multifd_tcp_zstd);
>  #endif
> +#ifdef CONFIG_QATZIP
> +migration_test_add("/migration/multifd/tcp/plain/qatzip",
> +test_multifd_tcp_qatzip);
> +#endif
>  #ifdef CONFIG_QPL
>  migration_test_add("/migration/multifd/tcp/plain/qpl",
> test_multifd_tcp_qpl);
> -- 
> Yichen Wang
> 

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 10:34:12AM -0300, Fabiano Rosas wrote:
> Is there an easy way to look at a field and tell in which machine type's
> timeframe it was introduced?

I am not aware of any.

> If the machine type of that era has been removed, then the field is free
> to go as well. I'd prefer if we had a hard link instead of just counting
> years. Maybe we should to that mapping at the machine deprecation time?
> As in, "look at the unused fields introduced in that timeframe and mark
> them free".

We can do that, but depending on how easy it would be. That can be an
overkill to me if it's non-trivial.  When it becomes complicated, I'd
rather make machine compat property easier to use so we always stick with
that.  Currently it's not as easy to use.

Maybe we shouldn't make it a common rule to let people reuse the UNUSED
fields, even if in this case it's probably fine?

E.g. I don't think it's a huge deal to keep all UNUSED fields forever -
sending 512B zeros for only one specific device isn't an issue even if kept
forever.

If "over 6 years" would be okay and simple enough, then maybe we can stick
with that (and only if people would like to reuse a field and ask; that's
after all not required..).  If more than that I doubt whether we should
spend time working on covering all the fields.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 06:38:26PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> >> It's not about trust, we simply don't support migrations other than
> >> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >> >
> >> > Where does it come from?  I thought we suppport that..
> >> 
> >> I'm taking that from:
> >> 
> >> docs/devel/migration/main.rst:
> >>   "In general QEMU tries to maintain forward migration compatibility
> >>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
> >>   backward compatibility as well."
> >> 
> >> But of course it doesn't say whether that comes with a transitive rule
> >> allowing n->n+2 migrations.
> >
> > I'd say that "i.e." implies n->n+1 is not the only forward migration we
> > would support.
> >
> > I _think_ we should support all forward migration as long as the machine
> > type matches.
> >
> >> 
> >> >
> >> > The same question would be: are we requesting an OpenStack cluster to
> >> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> >> 
> >> Will an OpenStack cluster be using upstream QEMU? If not, then that's a
> >
> > It's an example to show what I meant! :) Nothing else. Definitely not
> > saying that everyone should use an upstream released QEMU (but in reality,
> > it's not a problem, I think, and I do feel like people use them, perhaps
> > more with the stable releases).
> >
> >> question for the distro. In a very practical sense, we're not requesting
> >> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> >> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> >> 9.1 should succeed.
> >
> > No matter what we test in CI, I don't think we should break that for >1
> > versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
> > file a bug by anyone.
> >
> > For example, I randomly fetched a bug report:
> >
> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >
> > QEMU version:6.2 and 7.2.5
> >
> > And I believe that's the common case even for upstream.  If we don't do
> > that right for upstream, it can be impossible tasks for downstream and for
> > all of us to maintain.
> 
> But do we do that right currently? I have no idea. Have we ever done
> it? And we're here discussing a hypothetical 2.7->9.1 ...
> 
> So we cannot reuse the UNUSED field because QEMU from 2016 might send
> their data and QEMU from 2024 would interpret it wrong.
> 
> How do we proceed? Add a subsection. And make the code survive when
> receiving 0.
> 
> @Peter is that it? What about backwards-compat? We'll need a property as
> well it seems.

Compat property is definitely one way to go, but I think it's you who more
or less persuaded me that reusing it seems possible! At least I can't yet
think of anything bad if it's ancient unused buffers.

And that's why I was asking about a sane way to describe the "magic
year".. And I was very serious when I said "6 years" to follow the
deprecation of machine types, because it'll be something we can follow to
say when an unused buffer can be obsolete and it could make some sense: if
we will start to deprecate machine types, then it may not make sense to
keep any UNUSED compatible forever, after all.

And we need that ruler to be as accurate as "always 6 years to follow
machine type deprecation procedure", in case someone else tomorrow asks us
something that was only UNUSED since 2018.  We need a rule of thumb if we
want to reuse it, and if all of you agree we can start with this one,
perhaps with a comment above the field (before we think all through and
make it a rule on deprecating UNUSED)?

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 05:16:36PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> >> >> > Or graphically:
> >> >> > 
> >> >> > 1) client fills the active slot with data. Channels point to nothing
> >> >> >at this point:
> >> >> >   [a]  <-- active slot
> >> >> >   [][][][] <-- free slots, one per-channel
> >> >> > 
> >> >> >   [][][][] <-- channels' p->data pointers
> >> >> > 
> >> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> >> >> >still point to nothing:
> >> >> >   []
> >> >> >   [a][][][]
> >> >> > 
> >> >> >   [][][][]
> >> >> > 
> >> >> > 3) multifd_send() finds an idle channel and updates its pointer:
> >> >> 
> >> >> It seems the action "finds an idle channel" is in step 2 rather than 
> >> >> step 3,
> >> >> which means the free slot is selected based on the id of the channel 
> >> >> found, am I
> >> >> understanding correctly?
> >> >
> >> > I think you're right.
> >> >
> >> > Actually I also feel like the desription here is ambiguous, even though I
> >> > think I get what Fabiano wanted to say.
> >> >
> >> > The free slot should be the first step of step 2+3, here what Fabiano
> >> > really wanted to suggest is we move the free buffer array from multifd
> >> > channels into the callers, then the caller can pass in whatever data to
> >> > send.
> >> >
> >> > So I think maybe it's cleaner to write it as this in code (note: I didn't
> >> > really change the code, just some ordering and comments):
> >> >
> >> > ===8<===
> >> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> >> >   */
> >> >  active_slot = slots->active;
> >> >  slots->active = slots->free[p->id];
> >> > -p->data = active_slot;
> >> > -
> >> > -/*
> >> > - * By the next time we arrive here, the channel will certainly
> >> > - * have consumed the active slot. Put it back on the free list
> >> > - * now.
> >> > - */
> >> >  slots->free[p->id] = active_slot;
> >> >  
> >> > +/* Assign the current active slot to the chosen thread */
> >> > +p->data = active_slot;
> >> > ===8<===
> >> >
> >> > The comment I removed is slightly misleading to me too, because right 
> >> > now 
> >> > active_slot contains the data hasn't yet been delivered to multifd, so
> >> > we're "putting it back to free list" not because of it's free, but 
> >> > because
> >> > we know it won't get used until the multifd send thread consumes it
> >> > (because before that the thread will be busy, and we won't use the buffer
> >> > if so in upcoming send()s).
> >> >
> >> > And then when I'm looking at this again, I think maybe it's a slight
> >> > overkill, and maybe we can still keep the "opaque data" managed by 
> >> > multifd.
> >> > One reason might be that I don't expect the "opaque data" payload keep
> >> > growing at all: it should really be either RAM or device state as I
> >> > commented elsewhere in a relevant thread, after all it's a thread model
> >> > only for migration purpose to move vmstates..
> >> 
> >> Some amount of flexibility needs to be baked in. For instance, what
> >> about the handshake procedure? Don't we want to use multifd threads to
> >> put some information on the wire for that as well?
> >
> > Is this an orthogonal question?
> 
> I don't think so. You say the payload data should be either RAM or
> device state. I'm asking what other types of data do we want the multifd
> channel to transmit and suggesting we need to allow room for the
> addition of that, whatever it is. One thing that comes to mind that is
> neither RAM or device state is some form of handshake or capabilities
> negotiation.

Indeed what I thought was

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> It's not about trust, we simply don't support migrations other than
> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >
> > Where does it come from?  I thought we suppport that..
> 
> I'm taking that from:
> 
> docs/devel/migration/main.rst:
>   "In general QEMU tries to maintain forward migration compatibility
>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
>   backward compatibility as well."
> 
> But of course it doesn't say whether that comes with a transitive rule
> allowing n->n+2 migrations.

I'd say that "i.e." implies n->n+1 is not the only forward migration we
would support.

I _think_ we should support all forward migration as long as the machine
type matches.

> 
> >
> > The same question would be: are we requesting an OpenStack cluster to
> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> 
> Will an OpenStack cluster be using upstream QEMU? If not, then that's a

It's an example to show what I meant! :) Nothing else. Definitely not
saying that everyone should use an upstream released QEMU (but in reality,
it's not a problem, I think, and I do feel like people use them, perhaps
more with the stable releases).

> question for the distro. In a very practical sense, we're not requesting
> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> 9.1 should succeed.

No matter what we test in CI, I don't think we should break that for >1
versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
file a bug by anyone.

For example, I randomly fetched a bug report:

https://gitlab.com/qemu-project/qemu/-/issues/1937

QEMU version:6.2 and 7.2.5

And I believe that's the common case even for upstream.  If we don't do
that right for upstream, it can be impossible tasks for downstream and for
all of us to maintain.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> It's not about trust, we simply don't support migrations other than
> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.

Where does it come from?  I thought we suppport that..

The same question would be: are we requesting an OpenStack cluster to
always upgrade QEMU with +1 versions, otherwise migration will fail?

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> >> > Or graphically:
> >> > 
> >> > 1) client fills the active slot with data. Channels point to nothing
> >> >at this point:
> >> >   [a]  <-- active slot
> >> >   [][][][] <-- free slots, one per-channel
> >> > 
> >> >   [][][][] <-- channels' p->data pointers
> >> > 
> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> >> >still point to nothing:
> >> >   []
> >> >   [a][][][]
> >> > 
> >> >   [][][][]
> >> > 
> >> > 3) multifd_send() finds an idle channel and updates its pointer:
> >> 
> >> It seems the action "finds an idle channel" is in step 2 rather than step 
> >> 3,
> >> which means the free slot is selected based on the id of the channel 
> >> found, am I
> >> understanding correctly?
> >
> > I think you're right.
> >
> > Actually I also feel like the desription here is ambiguous, even though I
> > think I get what Fabiano wanted to say.
> >
> > The free slot should be the first step of step 2+3, here what Fabiano
> > really wanted to suggest is we move the free buffer array from multifd
> > channels into the callers, then the caller can pass in whatever data to
> > send.
> >
> > So I think maybe it's cleaner to write it as this in code (note: I didn't
> > really change the code, just some ordering and comments):
> >
> > ===8<===
> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> >   */
> >  active_slot = slots->active;
> >  slots->active = slots->free[p->id];
> > -p->data = active_slot;
> > -
> > -/*
> > - * By the next time we arrive here, the channel will certainly
> > - * have consumed the active slot. Put it back on the free list
> > - * now.
> > - */
> >  slots->free[p->id] = active_slot;
> >  
> > +/* Assign the current active slot to the chosen thread */
> > +p->data = active_slot;
> > ===8<===
> >
> > The comment I removed is slightly misleading to me too, because right now 
> > active_slot contains the data hasn't yet been delivered to multifd, so
> > we're "putting it back to free list" not because of it's free, but because
> > we know it won't get used until the multifd send thread consumes it
> > (because before that the thread will be busy, and we won't use the buffer
> > if so in upcoming send()s).
> >
> > And then when I'm looking at this again, I think maybe it's a slight
> > overkill, and maybe we can still keep the "opaque data" managed by multifd.
> > One reason might be that I don't expect the "opaque data" payload keep
> > growing at all: it should really be either RAM or device state as I
> > commented elsewhere in a relevant thread, after all it's a thread model
> > only for migration purpose to move vmstates..
> 
> Some amount of flexibility needs to be baked in. For instance, what
> about the handshake procedure? Don't we want to use multifd threads to
> put some information on the wire for that as well?

Is this an orthogonal question?

What I meant above is it looks fine to me to keep "device state" in
multifd.c, as long as it is not only about VFIO.

What you were saying seems to be about how to identify this is a device
state, then I just hope VFIO shares the same flag with any future device
that would also like to send its state via multifd, like:

#define MULTIFD_FLAG_DEVICE_STATE (32 << 1)

Then set it in MultiFDPacket_t.flags.  The dest qemu should route that
packet to the device vmsd / save_entry for parsing.

> 
> > Putting it managed by multifd thread should involve less change than this
> > series, but it could look like this:
> >
> > typedef enum {
> > MULTIFD_PAYLOAD_RAM = 0,
> > MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> > } MultifdPayloadType;
> >
> > typedef enum {
> > MultiFDPages_t ram_payload;
> > MultifdDeviceState_t device_payload;
> > } MultifdPayload;
> >
> > struct MultiFDSendData {
> > MultifdPayloadType type;
> > MultifdPayload data;
> > };
> 
> Is that an union up there? So you want to simply allocate in multifd the

Yes.

> max amount of memory between the two types of payload? But then we'll

Yes.

> need a memset(p

Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 03:39:43PM +, Liu, Yuan1 wrote:
> > I don't think postcopy will trigger timeout failures - postcopy should use
> > constant time to complete a migration, that is guest memsize / bw.
> 
> Yes, the migration total time is predictable, failure due to timeout is 
> incorrect, 
> migration taking a long time may be more accurate.

It shouldn't: postcopy is run always together with precopy, so if you start
postcopy after one round of precopy, the total migration time should
alwways be smaller than if you run the precopy two rounds.

With postcopy after that migration completes, but for precopy two rounds of
migration will follow with a dirty sync which may say "there's unforunately
more dirty pages, let's move on with the 3rd round and more".

> 
> > The challenge is normally on the delay of page requests higher than
> > precopy, but in this case it might not be a big deal. And I wonder if on
> > 100G*2 cards it can also perform pretty well, as the delay might be
> > minimal
> > even if bandwidth is throttled.
> 
> I got your point, I don't have much experience in this area.
> So you mean to reserve a small amount of bandwidth on a NIC for postcopy 
> migration, and compare the migration performance with and without traffic
> on the NIC? Will data plane traffic affect page request delays in postcopy?

I'm not sure what's the "data plane" you're describing here, but logically
VMs should be migrated using mgmt networks, and should be somehow separate
from IOs within the VMs.

I'm not really asking for another test, sorry to cause confusions; it's
only about some pure discussions.  I just feel like postcopy wasn't really
seriously considered even for many valid cases, some of them postcopy can
play pretty well even without any modern hardwares requested.  There's no
need to prove which is better for this series.

Thanks,

-- 
Peter Xu

Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 01:55:23PM +, Liu, Yuan1 wrote:

[...]

> migrate_set_parameter max-bandwidth 1250M
> |---||-|--|--|--|--|
> |8 Channels |Total   |down |throughput|pages per | send | recv |
> |   |time(ms)|time(ms) |(mbps)|second| cpu %| cpu% |
> |---||-|--|--|--|--|
> |qatzip |   16630|   28| 10467|   2940235|   160|   360|
> |---||-|--|--|--|--|
> |zstd   |   20165|   24|  8579|   2391465|   810|   340|
> |---||-|--|--|--|--|
> |none   |   46063|   40| 10848|330240|45|85|
> |---||-|--|--|--|--|
> 
> QATzip's dirty page processing throughput is much higher than that no 
> compression. 
> In this test, the vCPUs are in idle state, so the migration can be successful 
> even 
> without compression.

Thanks!  Maybe good material to be put into the docs/ too, if Yichen's
going to pick up your doc patch when repost.

[...]

> I don’t have much experience with postcopy, here are some of my thoughts
> 1. For write-intensive VMs, this solution can improve the migration success, 
>because in a limited bandwidth network scenario, the dirty page processing
>throughput will be significantly reduced for no compression, the previous
>data includes this(pages_per_second), it means that in the no compression
>precopy, the dirty pages generated by the workload are greater than the
>migration processing, resulting in migration failure.

Yes.

> 
> 2. If the VM is read-intensive or has low vCPU utilization (for example, my 
>current test scenario is that the vCPUs are all idle). I think no 
> compression +
>precopy + postcopy also cannot improve the migration performance, and may 
> also
>cause timeout failure due to long migration time, same with no compression 
> precopy.

I don't think postcopy will trigger timeout failures - postcopy should use
constant time to complete a migration, that is guest memsize / bw.

The challenge is normally on the delay of page requests higher than
precopy, but in this case it might not be a big deal. And I wonder if on
100G*2 cards it can also perform pretty well, as the delay might be minimal
even if bandwidth is throttled.

> 
> 3. In my opinion, the postcopy is a good solution in this scenario(low 
> network bandwidth,
>VM is not critical), because even if compression is turned on, the 
> migration may still 
>fail(page_per_second may still less than the new dirty pages), and it is 
> hard to predict
>whether VM memory is compression-friendly.

Yes.

Thanks,

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 11:08:20AM -0300, Fabiano Rosas wrote:
> >> I think it's ok:
> >> 
> >> {
> >>   "field": "unused",
> >>   "version_id": 1,
> >>   "field_exists": false,
> >>   "size": 512
> >> },
> >> 
> >> vs.
> >> 
> >> {
> >>   "field": "vendor_data",
> >>   "version_id": 0,
> >>   "field_exists": false,
> >>   "num": 512,
> >>   "size": 1
> >> },
> >> 
> >> The unused field was introduced in 2016 so there's no chance of
> >> migrating a QEMU that old to/from 9.1.
> >
> > What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
> > new QEMU would consider it meaningful data?
> 
> It will send zeros, no? The code will have to cope with that. The
> alternative is to put the vendor_data in a subsection and the code will
> also have to cope with the lack of data when the old QEMU doesn't send
> it.

Ah indeed, that "static const uint8_t buf[1024]" is there at least since
2017.  So yes, probably always sending zeros.

Nothing I can think of otherwise indeed, if we want to trust that nothing
will migrate before 2016.  It's just that we may want to know how that
"2016" is justified to be safe if we would like to allow that in the
future.

One thing _could_ be that "rule of thumb" is we plan to obsolete machines
with 6 years, so anything "UNUSED" older than 6 years can be over-written?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-09 Thread Peter Xu

On Tue, Jul 09, 2024 at 05:38:54PM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > "General command" (GEN_CMD, CMD56) is described as:
> >
> >   GEN_CMD is the same as the single block read or write
> >   commands (CMD24 or CMD17). The difference is that [...]
> >   the data block is not a memory payload data but has a
> >   vendor specific format and meaning.
> >
> > Thus this block must not be stored overwriting data block
> > on underlying storage drive. Keep it in a dedicated
> > 'vendor_data[]' array.
> >
> > Signed-off-by: Philippe Mathieu-Daudé 
> > Tested-by: Cédric Le Goater 
> > ---
> > RFC: Is it safe to reuse VMSTATE_UNUSED_V() (which happens
> > to be the same size)?
> 
> Hi, sorry it took some time to get to this, I had just left for vacation
> when you first posted.

And I totally overlooked there's the email.. until you replied.  Welcome
back.

> 
> I think it's ok:
> 
> {
>   "field": "unused",
>   "version_id": 1,
>   "field_exists": false,
>   "size": 512
> },
> 
> vs.
> 
> {
>   "field": "vendor_data",
>   "version_id": 0,
>   "field_exists": false,
>   "num": 512,
>   "size": 1
> },
> 
> The unused field was introduced in 2016 so there's no chance of
> migrating a QEMU that old to/from 9.1.

What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
new QEMU would consider it meaningful data?

-- 
Peter Xu

Re: [PATCH] docs/migration: add qatzip compression feature

2024-07-09 Thread Peter Xu

On Tue, Jul 09, 2024 at 01:53:59AM +0800, Yuan Liu wrote:
> add Intel QATzip compression method introduction
> 
> Signed-off-by: Yuan Liu 
> Reviewed-by: Nanhai Zou 

Reviewed-by: Peter Xu 

Thanks.

-- 
Peter Xu

Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

2024-07-09 Thread Peter Xu

On Tue, Jul 09, 2024 at 08:42:59AM +, Liu, Yuan1 wrote:
> > -Original Message-
> > From: Yichen Wang 
> > Sent: Saturday, July 6, 2024 2:29 AM
> > To: Paolo Bonzini ; Daniel P. Berrangé
> > ; Eduardo Habkost ; Marc-André
> > Lureau ; Thomas Huth ;
> > Philippe Mathieu-Daudé ; Peter Xu ;
> > Fabiano Rosas ; Eric Blake ; Markus
> > Armbruster ; Laurent Vivier ; qemu-
> > de...@nongnu.org
> > Cc: Hao Xiang ; Liu, Yuan1 ;
> > Zou, Nanhai ; Ho-Ren (Jack) Chuang
> > ; Wang, Yichen 
> > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> > 
> > v4:
> > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> > - Move the IOV initialization to qatzip implementation
> > - Only use qatzip to compress normal pages
> > 
> > v3:
> > - Rebase changes on top of master
> > - Merge two patches per Fabiano Rosas's comment
> > - Add versions into comments and documentations
> > 
> > v2:
> > - Rebase changes on top of recent multifd code changes.
> > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > - Remove parameter tuning and use QATzip's defaults for better
> >   performance.
> > - Add parameter to enable QAT software fallback.
> > 
> > v1:
> > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > 
> > * Performance
> > 
> > We present updated performance results. For circumstantial reasons, v1
> > presented performance on a low-bandwidth (1Gbps) network.
> > 
> > Here, we present updated results with a similar setup as before but with
> > two main differences:
> > 
> > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > RAM.
> > 
> > Thus, the test setup is as follows:
> > 
> > We perform multifd live migration over TCP using a VM with 64GB memory.
> > We prepare the machine's memory by powering it on, allocating a large
> > amount of memory (60GB) as a single buffer, and filling the buffer with
> > the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> > realistic memory snapshot, which proved troublesome to acquire.
> > 
> > We analyze CPU usage by averaging the output of 'top' every second
> > during migration. This is admittedly imprecise, but we feel that it
> > accurately portrays the different degrees of CPU usage of varying
> > compression methods.
> > 
> > We present the latency, throughput, and CPU usage results for all of the
> > compression methods, with varying numbers of multifd threads (4, 8, and
> > 16).
> > 
> > [0] The Silesia corpus can be accessed here:
> > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > 
> > ** Results
> > 
> > 4 multifd threads:
> > 
> > |---|---||-|-|
> > |method |time(sec)  |throughput(mbps)|send cpu%|recv cpu%|
> > |---|---||-|-|
> > |qatzip | 23.13 | 8749.94|117.50   |186.49   |
> > |---|---||-|-|
> > |zlib   |254.35 |  771.87|388.20   |144.40   |
> > |---|---||-|-|
> > |zstd   | 54.52 | 3442.59|414.59   |149.77   |
> > |---|---||-|-|
> > |none   | 12.45 |43739.60|159.71   |204.96   |
> > |---|---||-|-|
> > 
> > 8 multifd threads:
> > 
> > |---|---||-|-|
> > |method |time(sec)  |throughput(mbps)|send cpu%|recv cpu%|
> > |---|---||-|-|
> > |qatzip | 16.91 |12306.52|186.37   |391.84   |
> > |---|---||-|-|
> > |zlib   |130.11 | 1508.89|753.86   |289.35   |
> > |---|---||-|-|
> > |zstd   | 27.57 | 6823.23|786.83   |303.80   |
> > |---

Re: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method

2024-07-08 Thread Peter Xu

gt;flags & MULTIFD_FLAG_COMPRESSION_MASK;
> +
> +if (in_size > q->in_len) {
> +error_setg(errp, "multifd %u: received unexpectedly large packet",
> +   p->id);
> +return -1;
> +}
> +
> +if (flags != MULTIFD_FLAG_QATZIP) {
> +error_setg(errp, "multifd %u: flags received %x flags expected %x",
> +   p->id, flags, MULTIFD_FLAG_QATZIP);
> +return -1;
> +}
> +
> +ret = qio_channel_read_all(p->c, (void *)q->in_buf, in_size, errp);
> +if (ret != 0) {
> +return ret;
> +}
> +
> +in_len = in_size;
> +out_len = q->out_len;
> +ret = qzDecompress(>sess, q->in_buf, _len, q->out_buf, _len);
> +if (ret != QZ_OK) {
> +error_setg(errp, "multifd %u: qzDecompress failed", p->id);
> +return -1;
> +}
> +if (out_len != expected_size) {
> +error_setg(errp, "multifd %u: packet size received %u size expected 
> %u",
> +   p->id, out_len, expected_size);
> +return -1;
> +}
> +
> +/* Copy each page to its appropriate location. */
> +for (int i = 0; i < p->normal_num; i++) {
> +memcpy(p->host + p->normal[i],
> +   q->out_buf + p->page_size * i,
> +   p->page_size);
> +}
> +return 0;
> +}
> +
> +static MultiFDMethods multifd_qatzip_ops = {
> +.send_setup = qatzip_send_setup,
> +.send_cleanup = qatzip_send_cleanup,
> +.send_prepare = qatzip_send_prepare,
> +.recv_setup = qatzip_recv_setup,
> +.recv_cleanup = qatzip_recv_cleanup,
> +.recv = qatzip_recv
> +};
> +
> +static void multifd_qatzip_register(void)
> +{
> +multifd_register_ops(MULTIFD_COMPRESSION_QATZIP, _qatzip_ops);
> +}
> +
> +migration_init(multifd_qatzip_register);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 0ecd6f47d7..adceb65050 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -34,14 +34,15 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
>  
> -/* We reserve 4 bits for compression methods */
> -#define MULTIFD_FLAG_COMPRESSION_MASK (0xf << 1)
> +/* We reserve 5 bits for compression methods */
> +#define MULTIFD_FLAG_COMPRESSION_MASK (0x1f << 1)
>  /* we need to be compatible. Before compression value was 0 */
>  #define MULTIFD_FLAG_NOCOMP (0 << 1)
>  #define MULTIFD_FLAG_ZLIB (1 << 1)
>  #define MULTIFD_FLAG_ZSTD (2 << 1)
>  #define MULTIFD_FLAG_QPL (4 << 1)
>  #define MULTIFD_FLAG_UADK (8 << 1)
> +#define MULTIFD_FLAG_QATZIP (16 << 1)

This is not a problem of this series alone, but it's sad to see it keeps
eating the flag bits with no good reasons..

Since it will never happen we enable >1 compressor, we should have made it
a type field not bitmask since the start..  We may even avoid having that
at all in packet flags, as migration so far relies on both sides to setup
the same migration parameters, and that already includes compressor
methods, or everything will go chaos.  This flag should so far do mostly
nothing good but waste some cycles on both sides..

Sign..  I think we can keep this for now, as we already have 5 others
anyway.. it's a pity we didn't notice this earlier, definitely not this
series's fault.

>  
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8c9f2a8aa7..ea62f983b1 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -558,6 +558,8 @@
>  #
>  # @zstd: use zstd compression method.
>  #
> +# @qatzip: use qatzip compression method. (Since 9.1)
> +#
>  # @qpl: use qpl compression method.  Query Processing Library(qpl) is
>  #   based on the deflate compression algorithm and use the Intel
>  #   In-Memory Analytics Accelerator(IAA) accelerated compression
> @@ -570,6 +572,7 @@
>  { 'enum': 'MultiFDCompression',
>'data': [ 'none', 'zlib',
>  { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
> +{ 'name': 'qatzip', 'if': 'CONFIG_QATZIP'},
>  { 'name': 'qpl', 'if': 'CONFIG_QPL' },
>  { 'name': 'uadk', 'if': 'CONFIG_UADK' } ] }
>  
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index 12792948ff..23e46144d7 100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -324,6 +324,10 @@ if gnutls.found()
>endif
>  endif
>  
> +if qatzip.found()
> +  migration_files += [qatzip]
> +endif
> +
>  qtests = {
>'bios-tables-test': [io, 'boot-sector.c', 'acpi-utils.c', 'tpm-emu.c'],
>'cdrom-test': files('boot-sector.c'),
> -- 
> Yichen Wang
> 

-- 
Peter Xu

Re: [PATCH v4 2/4] migration: Add migration parameters for QATzip

2024-07-08 Thread Peter Xu

On Fri, Jul 05, 2024 at 11:28:59AM -0700, Yichen Wang wrote:
> +# @multifd-qatzip-sw-fallback: Enable software fallback if QAT hardware
> +# is unavailable. Defaults to false. Software fallback performance
> +# is very poor compared to regular zlib, so be cautious about
> +# enabling this option. (Since 9.1)

Could we avoid this parameter but always have the fallback?

IMHO anyone who is serious with using a HW-accelerated compression method
during migration should make sure that the HWs are properly setup.

If you think such caucious is required, would warn_report_once() works when
the fallback happens?

Thanks,

-- 
Peter Xu

Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion

2024-07-08 Thread Peter Xu

On Mon, Jul 08, 2024 at 10:06:44AM +0200, Philippe Mathieu-Daudé wrote:
> On 6/7/24 13:59, Akihiko Odaki wrote:
> > On 2024/07/03 2:44, Peter Xu wrote:
> > > On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote:
> > > > A memory region does not use their own reference counters, but instead
> > > > piggybacks on another QOM object, "owner" (unless the owner is not the
> > > > memory region itself). When creating a subregion, a new reference to the
> > > > owner of the container must be created. However, if the subregion is
> > > > owned by the same QOM object, this result in a self-reference, and make
> > > > the owner immortal. Avoid such a self-reference.
> > > > 
> > > > Signed-off-by: Akihiko Odaki 
> > > > ---
> > > >   system/memory.c | 11 +--
> > > >   1 file changed, 9 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/system/memory.c b/system/memory.c
> > > > index 74cd73ebc78b..949f5016a68d 100644
> > > > --- a/system/memory.c
> > > > +++ b/system/memory.c
> > > > @@ -2638,7 +2638,10 @@ static void
> > > > memory_region_update_container_subregions(MemoryRegion
> > > > *subregion)
> > > >   memory_region_transaction_begin();
> > > > -    memory_region_ref(subregion);
> > > > +    if (mr->owner != subregion->owner) {
> > > > +    memory_region_ref(subregion);
> > > > +    }
> > > > +
> > > >   QTAILQ_FOREACH(other, >subregions, subregions_link) {
> > > >   if (subregion->priority >= other->priority) {
> > > >   QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
> > > > @@ -2696,7 +2699,11 @@ void
> > > > memory_region_del_subregion(MemoryRegion *mr,
> > > >   assert(alias->mapped_via_alias >= 0);
> > > >   }
> > > >   QTAILQ_REMOVE(>subregions, subregion, subregions_link);
> > > > -    memory_region_unref(subregion);
> > > > +
> > > > +    if (mr->owner != subregion->owner) {
> > > > +    memory_region_unref(subregion);
> > > > +    }
> > > > +
> > > >   memory_region_update_pending |= mr->enabled && subregion->enabled;
> > > >   memory_region_transaction_commit();
> > > >   }
> > > 
> > > This does look like a real issue.. the patch looks reasonable to me,
> > > but I
> > > wonder whether we should start to add some good comments in code to
> > > reflect
> > > that complexity starting from this one.  The MR refcount isn't easy to
> > > understand to me.
> > > 
> > > It also lets me start to wonder how MR refcount went through until
> > > it looks
> > > like today..  It's definitely not extremely intuitive to use mr->owner as
> > > the object to do refcounting if mr itself does has its own QObject,
> > > meanwhile it has other tricks around.
> > > 
> > > E.g. the first thing I stumbled over when looking was the optimization
> > > where we will avoid refcounting the mr when there's no owner, and IIUC it
> > > was for the case when the "guest memory" (which will never be freed) used
> > > to have no owner so we can speedup DMA if we know it won't go away.
> > > 
> > > https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/
> > > 
> > > commit 612263cf33062f7441a5d0e3b37c65991fdc3210
> > > Author: Paolo Bonzini 
> > > Date:   Wed Dec 9 11:44:25 2015 +0100
> > > 
> > >  memory: avoid unnecessary object_ref/unref
> > >  For the common case of DMA into non-hotplugged RAM, it is
> > > unnecessary
> > >  but expensive to do object_ref/unref.  Add back an owner field to
> > >  MemoryRegion, so that these memory regions can skip the reference
> > >  counting.
> > > 
> > > If so, it looks like it will stop working with memory-backends get
> > > involved?  As I think those MRs will have owner set always, and I wonder
> > > whether memory-backends should be the major way to specify guest
> > > memory now
> > > and in the future.  So I'm not sure how important that optimization is as
> > > of now, and whether we could "simplify" it back to always do the refcount
> > > if the major scenarios will not adopt it.
> > > 
>

Re: [External] [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB

2024-07-08 Thread Peter Xu

On Fri, Jul 05, 2024 at 11:28:25AM -0700, Yichen Wang wrote:
> Just want to add some information here. So in ByteDance, the current
> generation server is quipped with 2*100Gb NIC. We reserve 10Gbps for
> control plane purposes which includes live migration here. So it is not
> about we are using “good network”, it is about not normal to use full
> bandwidth for control plane purposes. Hence we do have a requirements for
> QAT/IAA in these cases.

Yes, this makes sense.

But then you may then also want to figure out the high cpu consumption of
those cards, to not interrupt more important workloads?

I saw there's a new version posted, I didn't see an explanation of the cpu
consumption issue mentioned.  Meanwhile I also see that the docs/ update is
missing.

Would you consider adding both by replying to the new version?

-- 
Peter Xu

Re: [PATCH] MAINTAINERS: Add myself as a VT-d reviewer

2024-07-08 Thread Peter Xu

On Mon, Jul 08, 2024 at 11:24:48AM +0800, Jason Wang wrote:
> On Mon, Jul 8, 2024 at 11:21 AM Yi Liu  wrote:
> >
> > Signed-off-by: Yi Liu 
> > ---
> >  MAINTAINERS | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 6725913c8b..61724b91d8 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3656,6 +3656,7 @@ F: tests/uefi-test-tools/
> >  VT-d Emulation
> >  M: Michael S. Tsirkin 
> >  R: Jason Wang 
> > +R: Yi Liu 
> >  S: Supported
> >  F: hw/i386/intel_iommu.c
> >  F: hw/i386/intel_iommu_internal.h
> > --
> > 2.34.1
> >
> 
> Acked-by: Jason Wang 
> 
> Thanks!

Acked-by: Peter Xu 

Thanks!!

-- 
Peter Xu

Re: [PATCH v1] target/i386: kvm: Block migration when enfore_cpuid is set to false

2024-07-05 Thread Peter Xu

On Fri, Jul 05, 2024 at 10:22:23AM +, Wang, Wei W wrote:
> On Thursday, July 4, 2024 11:59 PM, Peter Xu wrote:
> > On Thu, Jul 04, 2024 at 03:10:27PM +, Wang, Wei W wrote:
> > > > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c index
> > > > > 4c2e6f3a71..7db4fe4ead 100644
> > > > > --- a/target/i386/cpu.c
> > > > > +++ b/target/i386/cpu.c
> > > > > @@ -8258,7 +8258,7 @@ static Property x86_cpu_properties[] = {
> > > > >  DEFINE_PROP_UINT32("hv-version-id-snumber", X86CPU,
> > > > > hyperv_ver_id_sn, 0),
> > > > >
> > > > >  DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
> > > > > -DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> > > > > +DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, true),
> > > >
> > > > I assume in many cases people can still properly migrate when the
> > > > hosts are similar or identical, so maybe we at least want the old
> > > > machine types keep working (by introducing a machine compat property)?
> > >
> > > You meant keeping "enforce_cpuid=false" for old machine types (e.g. before
> > 9.1)?
> > > This will make them non-migratable with this patch, but they were
> > > migratable (by
> > > default) as "migratable" wasn't enforced by "enforce_cpuid". Should we
> > > keep them being migratable by default (e.g. enforce_cpuid=true) as well?
> > 
> > Ah, this is trickier than I thought..
> > 
> > The issue is if we make them silently switch to enforce_cpuid=true on old
> > machines, there's chance they start to fail boot, am I right?
> 
> Right for newly launched guests, regardless of whether they are new or old
> machine types, they will fail to boot when the host cannot afford the features
> for the configured vCPU models. This is expected, and actually part of the
> intentions of this patch.
> 
> When there is a need to boot a guest with reduced features, users need to
> explicitly add "enforce_cpuid=false", which marks the new booted guest as
> non-migratable, or a _better_ way, to identify the unsupported features from 
> the host first, and then get it booted with "-cpu CpuModel,-A,-B", this can 
> make
> it migratable with those known reduced features, and the destination guest is
> required to use the same QEMU commands (as usual) to reduce the same set
> of features as the source and get a enforced check by "enforce_cpuid".
> 
> For live update of QEMU for existing running guests (as you mentioned
> below), the impact is only on the running guests that have had features 
> reduced
> from vCPU models (at the time of their original launch). For this case, the
> recommended way to update them to the new QEMU is also to explicitly identify
> the reduced features and update them with "-cpu CpuModel,-A,-B".
> 
> The rationale behind this is that the features reduced from the guest needs to
> be explicitly determined and controllable. In terms of live migration, the
> destination is ensured to have the same set of reduced features as the source
> side.
> 
> > 
> > if (cpu->enforce_cpuid && x86_cpu_have_filtered_features(cpu)) {
> > error_setg(_err,
> >accel_uses_host_cpuid() ?
> >"Host doesn't support requested features" :
> >"TCG doesn't support requested features");
> > goto out;
> > }
> > 
> > I suppose we still need to keep all the old worlds running all fine without
> > breaking them when people do an QEMU upgrade.  It needs to work both on
> > booting fine, and on allowing to migrate.
> > 
> > So maybe we actually need two things?
> > 
> >   - One patch introduce forbit_migration_if_cpuid_mismatch property, when
> > set, block migration if not enforced, otherwise it should still allow
> > migration even if enforce_cpud=fales.  It should default to on, but off
> > on old machines.
> > 
> >   - One patch change default value of enforce_cpuid to on, but turn it off
> > on old machines.
> > 
> > Does that look right?
> 
> I think this can work. Not sure what you would think about the above 
> explanations.
> If agree, then probably we don’t need to add the extra complexity.
> 
> Also, the above two things seem to impede the upgrade for guests with older 
> machine
> types to incorporate this enforcement. I think the primary goal of l

Re: [PATCH v1] target/i386: kvm: Block migration when enfore_cpuid is set to false

2024-07-04 Thread Peter Xu

On Thu, Jul 04, 2024 at 03:10:27PM +, Wang, Wei W wrote:
> > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c index
> > > 4c2e6f3a71..7db4fe4ead 100644
> > > --- a/target/i386/cpu.c
> > > +++ b/target/i386/cpu.c
> > > @@ -8258,7 +8258,7 @@ static Property x86_cpu_properties[] = {
> > >  DEFINE_PROP_UINT32("hv-version-id-snumber", X86CPU,
> > > hyperv_ver_id_sn, 0),
> > >
> > >  DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
> > > -DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> > > +DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, true),
> > 
> > I assume in many cases people can still properly migrate when the hosts are
> > similar or identical, so maybe we at least want the old machine types keep
> > working (by introducing a machine compat property)?
> 
> You meant keeping "enforce_cpuid=false" for old machine types (e.g. before 
> 9.1)?
> This will make them non-migratable with this patch, but they were migratable 
> (by
> default) as "migratable" wasn't enforced by "enforce_cpuid". Should we keep 
> them
> being migratable by default (e.g. enforce_cpuid=true) as well?

Ah, this is trickier than I thought..

The issue is if we make them silently switch to enforce_cpuid=true on old
machines, there's chance they start to fail boot, am I right?

if (cpu->enforce_cpuid && x86_cpu_have_filtered_features(cpu)) {
error_setg(_err,
   accel_uses_host_cpuid() ?
   "Host doesn't support requested features" :
   "TCG doesn't support requested features");
goto out;
}

I suppose we still need to keep all the old worlds running all fine without
breaking them when people do an QEMU upgrade.  It needs to work both on
booting fine, and on allowing to migrate.

So maybe we actually need two things?

  - One patch introduce forbit_migration_if_cpuid_mismatch property, when
set, block migration if not enforced, otherwise it should still allow
migration even if enforce_cpud=fales.  It should default to on, but off
on old machines.

  - One patch change default value of enforce_cpuid to on, but turn it off
on old machines.

Does that look right?

Thanks,

-- 
Peter Xu

Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB

2024-07-04 Thread Peter Xu

On Thu, Jul 04, 2024 at 03:15:51AM +, Liu, Yuan1 wrote:
> > -Original Message-
> > From: Peter Xu 
> > Sent: Wednesday, July 3, 2024 3:16 AM
> > To: Wang, Yichen 
> > Cc: Paolo Bonzini ; Daniel P. Berrangé
> > ; Eduardo Habkost ; Marc-André
> > Lureau ; Thomas Huth ;
> > Philippe Mathieu-Daudé ; Fabiano Rosas
> > ; Eric Blake ; Markus Armbruster
> > ; Laurent Vivier ; qemu-
> > de...@nongnu.org; Hao Xiang ; Liu, Yuan1
> > ; Zou, Nanhai ; Ho-Ren (Jack)
> > Chuang 
> > Subject: Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB
> > 
> > On Thu, Jun 27, 2024 at 03:34:41PM -0700, Yichen Wang wrote:
> > > v3:
> > > - Rebase changes on top of master
> > > - Merge two patches per Fabiano Rosas's comment
> > > - Add versions into comments and documentations
> > >
> > > v2:
> > > - Rebase changes on top of recent multifd code changes.
> > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > > - Remove parameter tuning and use QATzip's defaults for better
> > >   performance.
> > > - Add parameter to enable QAT software fallback.
> > >
> > > v1:
> > > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > >
> > > * Performance
> > >
> > > We present updated performance results. For circumstantial reasons, v1
> > > presented performance on a low-bandwidth (1Gbps) network.
> > >
> > > Here, we present updated results with a similar setup as before but with
> > > two main differences:
> > >
> > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > > 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> > > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > > RAM.
> > >
> > > Thus, the test setup is as follows:
> > >
> > > We perform multifd live migration over TCP using a VM with 64GB memory.
> > > We prepare the machine's memory by powering it on, allocating a large
> > > amount of memory (60GB) as a single buffer, and filling the buffer with
> > > the repeated contents of the Silesia corpus[0]. This is in lieu of a
> > more
> > > realistic memory snapshot, which proved troublesome to acquire.
> > >
> > > We analyze CPU usage by averaging the output of 'top' every second
> > > during migration. This is admittedly imprecise, but we feel that it
> > > accurately portrays the different degrees of CPU usage of varying
> > > compression methods.
> > >
> > > We present the latency, throughput, and CPU usage results for all of the
> > > compression methods, with varying numbers of multifd threads (4, 8, and
> > > 16).
> > >
> > > [0] The Silesia corpus can be accessed here:
> > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > >
> > > ** Results
> > >
> > > 4 multifd threads:
> > >
> > > |---|---||-|
> > -|
> > > |method |time(sec)  |throughput(mbps)|send cpu%|recv
> > cpu%|
> > > |---|---||-|
> > -|
> > > |qatzip | 23.13 | 8749.94|117.50   |186.49
> > |
> > > |---|---||-|
> > -|
> > > |zlib   |254.35 |  771.87|388.20   |144.40
> > |
> > > |---|---||-|
> > -|
> > > |zstd   | 54.52 | 3442.59|414.59   |149.77
> > |
> > > |---|---||-|
> > -|
> > > |none   | 12.45 |43739.60|159.71   |204.96
> > |
> > > |---|---||-|
> > -|
> > >
> > > 8 multifd threads:
> > >
> > > |---|---||-|
> > -|
> > > |method |time(sec)  |throughput(mbps)|send cpu%|recv
> > cpu%|
> > > |---|---||-|
> > -|
> > > |qatzip | 16.91 |12306.52|186.37   |391.84
> > |
> > > |---|---||-|
> > -|
> > >

Re: [PATCH v1] target/i386: kvm: Block migration when enfore_cpuid is set to false

2024-07-03 Thread Peter Xu

On Wed, Jul 03, 2024 at 10:49:12PM +0800, Wei Wang wrote:
> When enforce_cpuid is set to false, the guest is launched with a filtered
> set of features, meaning that unsupported features by the host are removed
> from the guest's vCPU model. This could cause issues for live migration.
> For example, a guest on the source is running with features A and B. If
> the destination host does not support feature B, the stub guest can still
> be launched on the destination with feature A only if enforce_cpuid=false.
> Live migration can start in this case, though it may fail later when the
> states of feature B are put to the destination side. This failure occurs
> in the late stage (i.e., stop phase) of the migration flow, where the
> source guest has already been paused. Tests show that in such cases the
> source guest does not recover, and the destination is unable to resume to
> run.
> 
> Make "enfore_cpuid=true" a hard requirement for a guest to be migratable,
> and change the default value of "enforce_cpuid" to true, making the guest
> vCPUs migratable by default. If the destination stub guest has inconsistent
> CPUIDs (i.e., destination host cannot support the features defined by the
> guest's vCPU model), it fails to boot (with enfore_cpuid=true by default),
> thereby preventing migration from occuring. If enfore_cpuid=false is
> explicitly added for the guest, the guest is deemed as non-migratable
> (via the migration blocker), so the above issue won't occur as the guest
> won't be migrated.
> 
> Tested-by: Lei Wang 
> Signed-off-by: Wei Wang 

[Copy Jiri and Dan for libvirt-side implications]

> ---
>  target/i386/cpu.c |  2 +-
>  target/i386/kvm/kvm.c | 25 +++--
>  2 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 4c2e6f3a71..7db4fe4ead 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -8258,7 +8258,7 @@ static Property x86_cpu_properties[] = {
>  DEFINE_PROP_UINT32("hv-version-id-snumber", X86CPU, hyperv_ver_id_sn, 0),
>  
>  DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
> -DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> +DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, true),

I assume in many cases people can still properly migrate when the hosts are
similar or identical, so maybe we at least want the old machine types keep
working (by introducing a machine compat property)?

>  DEFINE_PROP_BOOL("x-force-features", X86CPU, force_features, false),
>  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
>  DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 0),
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index dd8b0f3313..aee717c1cf 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1741,7 +1741,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>  return 0;
>  }
>  
> -static Error *invtsc_mig_blocker;
> +static Error *cpu_mig_blocker;
>  
>  #define KVM_MAX_CPUID_ENTRIES  100
>  
> @@ -2012,6 +2012,15 @@ full:
>  abort();
>  }
>  
> +static bool kvm_vcpu_need_block_migration(X86CPU *cpu)
> +{
> +CPUX86State *env = >env;
> +
> +return !cpu->enforce_cpuid ||
> +   (!env->user_tsc_khz && (env->features[FEAT_8000_0007_EDX] &
> +   CPUID_APM_INVTSC));
> +}

Nit: maybe it's nice this returns a "const char*" with detailed reasons to
be put into the error_setg(), so it dumps the same as before for the invtsc
blocker.

Thanks,

> +
>  int kvm_arch_init_vcpu(CPUState *cs)
>  {
>  struct {
> @@ -2248,18 +2257,14 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  has_msr_mcg_ext_ctl = has_msr_feature_control = true;
>  }
>  
> -if (!env->user_tsc_khz) {
> -if ((env->features[FEAT_8000_0007_EDX] & CPUID_APM_INVTSC) &&
> -invtsc_mig_blocker == NULL) {
> -error_setg(_mig_blocker,
> -   "State blocked by non-migratable CPU device"
> -   " (invtsc flag)");
> -r = migrate_add_blocker(_mig_blocker, _err);
> +if (!cpu_mig_blocker &&  kvm_vcpu_need_block_migration(cpu)) {
> +error_setg(_mig_blocker,
> +   "State blocked by non-migratable CPU device");
> +r = migrate_add_blocker(_mig_blocker, _err);
>  if (r < 0) {
>  error_report_err(local_err);
>  return r;
>  }
> -}
>  }
>  
>  if (cpu->vmware_cpuid_freq
> @@ -2312,7 +2317,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  return 0;
>  
>   fail:
> -migrate_del_blocker(_mig_blocker);
> +migrate_del_blocker(_mig_blocker);
>  
>  return r;
>  }
> -- 
> 2.27.0
> 
> 

-- 
Peter Xu

Re: [PATCH v3 0/4] Implement using Intel QAT to offload ZLIB

2024-07-02 Thread Peter Xu

ent.
> - 'qatzip' outperforms other compression workers with 4 and 8 workers,
>   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> ~58% latency reduction over 'zstd' with 4 workers.
> - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
>   showing a ~32% increase in latency. This performance difference
> becomes more noticeable with more workers, as CPU compression is highly
> parallelizable.
> - 'qatzip' compression uses considerably less CPU than other compression
>   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> compression CPU usage compared to 'zstd' and 'zlib'.
> - 'qatzip' decompression CPU usage is less impressive, and is even
>   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.

Thanks for the results update.

It looks like the docs/migration/ file is still missing.  It'll be great to
have it in the next version or separately.

So how it compares with QPL (which got merged already)?  They at least look
like both supported on an Intel platform, so an user whoever wants to
compress the RAM could start to look at both.  I'm utterly confused on why
Intel provides these two similar compressors.  It would be great to have
some answer and perhaps put into the doc.

I am honestly curious too on whether are you planning to use it in
production.  It looks like if the network resources are rich, no-comp is
mostly always better than qatzip, no matter on total migration time or cpu
consumption.  I'm pretty surprised that it'll take that much resources even
if the work should have been offloaded to the QAT chips iiuc.

I think it may not be a problem to merge this series even if it performs
slower at some criterias.. but I think we may still want to know when this
should be used, or the good reason this should be merged (if it's not about
it outperforms others).

Thanks,

> 
> 
> Bryan Zhang (4):
>   meson: Introduce 'qatzip' feature to the build system
>   migration: Add migration parameters for QATzip
>   migration: Introduce 'qatzip' compression method
>   tests/migration: Add integration test for 'qatzip' compression method
> 
>  hw/core/qdev-properties-system.c |   6 +-
>  meson.build  |  10 +
>  meson_options.txt|   2 +
>  migration/meson.build|   1 +
>  migration/migration-hmp-cmds.c   |   8 +
>  migration/multifd-qatzip.c   | 382 +++
>  migration/multifd.h  |   1 +
>  migration/options.c  |  57 +
>  migration/options.h  |   2 +
>  qapi/migration.json  |  38 +++
>  scripts/meson-buildoptions.sh|   6 +
>  tests/qtest/meson.build  |   4 +
>  tests/qtest/migration-test.c |  35 +++
>  13 files changed, 551 insertions(+), 1 deletion(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> -- 
> Yichen Wang
> 

-- 
Peter Xu

Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion

2024-07-02 Thread Peter Xu

On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote:
> A memory region does not use their own reference counters, but instead
> piggybacks on another QOM object, "owner" (unless the owner is not the
> memory region itself). When creating a subregion, a new reference to the
> owner of the container must be created. However, if the subregion is
> owned by the same QOM object, this result in a self-reference, and make
> the owner immortal. Avoid such a self-reference.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  system/memory.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/system/memory.c b/system/memory.c
> index 74cd73ebc78b..949f5016a68d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2638,7 +2638,10 @@ static void 
> memory_region_update_container_subregions(MemoryRegion *subregion)
>  
>  memory_region_transaction_begin();
>  
> -memory_region_ref(subregion);
> +if (mr->owner != subregion->owner) {
> +memory_region_ref(subregion);
> +}
> +
>  QTAILQ_FOREACH(other, >subregions, subregions_link) {
>  if (subregion->priority >= other->priority) {
>  QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
> @@ -2696,7 +2699,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
>  assert(alias->mapped_via_alias >= 0);
>  }
>  QTAILQ_REMOVE(>subregions, subregion, subregions_link);
> -memory_region_unref(subregion);
> +
> +if (mr->owner != subregion->owner) {
> +memory_region_unref(subregion);
> +}
> +
>  memory_region_update_pending |= mr->enabled && subregion->enabled;
>  memory_region_transaction_commit();
>  }

This does look like a real issue.. the patch looks reasonable to me, but I
wonder whether we should start to add some good comments in code to reflect
that complexity starting from this one.  The MR refcount isn't easy to
understand to me.

It also lets me start to wonder how MR refcount went through until it looks
like today..  It's definitely not extremely intuitive to use mr->owner as
the object to do refcounting if mr itself does has its own QObject,
meanwhile it has other tricks around.

E.g. the first thing I stumbled over when looking was the optimization
where we will avoid refcounting the mr when there's no owner, and IIUC it
was for the case when the "guest memory" (which will never be freed) used
to have no owner so we can speedup DMA if we know it won't go away.

https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/

commit 612263cf33062f7441a5d0e3b37c65991fdc3210
Author: Paolo Bonzini 
Date:   Wed Dec 9 11:44:25 2015 +0100

memory: avoid unnecessary object_ref/unref

For the common case of DMA into non-hotplugged RAM, it is unnecessary
but expensive to do object_ref/unref.  Add back an owner field to
MemoryRegion, so that these memory regions can skip the reference
counting.

If so, it looks like it will stop working with memory-backends get
involved?  As I think those MRs will have owner set always, and I wonder
whether memory-backends should be the major way to specify guest memory now
and in the future.  So I'm not sure how important that optimization is as
of now, and whether we could "simplify" it back to always do the refcount
if the major scenarios will not adopt it.

The other issue is we used owner refcount from the start of
memory_region_ref() got introduced, since:

commit 46637be269aaaceb9867ffdf176e906401138fff
Author: Paolo Bonzini 
Date:   Tue May 7 09:06:00 2013 +0200

memory: add ref/unref

And we still have that in our document, even though I don't think it's true
anymore:

 * ...  MemoryRegions actually do not have their
 * own reference count; they piggyback on a QOM object, their "owner".
 * This function adds a reference to the owner.

It looks like what happened is when introduced the change, MR is not a QOM
object yet.  But it later is..

I mentioned all these only because I found that _if_ we can keep mr
refcounting as simple as other objects:

memory_region_ref(mr)
{
object_ref(OBJECT(mr));
}

Then looks like this "recursive refcount" problem can also go away.  I'm
curious whether you or anyone tried to explore that path, or whether above
doesn't make sense at all.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-06-27 Thread Peter Xu

On Thu, Jun 27, 2024 at 10:40:11AM -0400, Peter Xu wrote:
> On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> > > Or graphically:
> > > 
> > > 1) client fills the active slot with data. Channels point to nothing
> > >at this point:
> > >   [a]  <-- active slot
> > >   [][][][] <-- free slots, one per-channel
> > > 
> > >   [][][][] <-- channels' p->data pointers
> > > 
> > > 2) multifd_send() swaps the pointers inside the client slot. Channels
> > >still point to nothing:
> > >   []
> > >   [a][][][]
> > > 
> > >   [][][][]
> > > 
> > > 3) multifd_send() finds an idle channel and updates its pointer:
> > 
> > It seems the action "finds an idle channel" is in step 2 rather than step 3,
> > which means the free slot is selected based on the id of the channel found, 
> > am I
> > understanding correctly?
> 
> I think you're right.
> 
> Actually I also feel like the desription here is ambiguous, even though I
> think I get what Fabiano wanted to say.
> 
> The free slot should be the first step of step 2+3, here what Fabiano
> really wanted to suggest is we move the free buffer array from multifd
> channels into the callers, then the caller can pass in whatever data to
> send.
> 
> So I think maybe it's cleaner to write it as this in code (note: I didn't
> really change the code, just some ordering and comments):
> 
> ===8<===
> @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
>   */
>  active_slot = slots->active;
>  slots->active = slots->free[p->id];
> -p->data = active_slot;
> -
> -/*
> - * By the next time we arrive here, the channel will certainly
> - * have consumed the active slot. Put it back on the free list
> - * now.
> - */
>  slots->free[p->id] = active_slot;
>  
> +/* Assign the current active slot to the chosen thread */
> +p->data = active_slot;
> ===8<===
> 
> The comment I removed is slightly misleading to me too, because right now 
> active_slot contains the data hasn't yet been delivered to multifd, so
> we're "putting it back to free list" not because of it's free, but because
> we know it won't get used until the multifd send thread consumes it
> (because before that the thread will be busy, and we won't use the buffer
> if so in upcoming send()s).
> 
> And then when I'm looking at this again, I think maybe it's a slight
> overkill, and maybe we can still keep the "opaque data" managed by multifd.
> One reason might be that I don't expect the "opaque data" payload keep
> growing at all: it should really be either RAM or device state as I
> commented elsewhere in a relevant thread, after all it's a thread model
> only for migration purpose to move vmstates..
> 
> Putting it managed by multifd thread should involve less change than this
> series, but it could look like this:
> 
> typedef enum {
> MULTIFD_PAYLOAD_RAM = 0,
> MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> } MultifdPayloadType;
> 
> typedef enum {
> MultiFDPages_t ram_payload;
> MultifdDeviceState_t device_payload;
> } MultifdPayload;

PS: please conditionally read "enum" as "union" throughout the previous
email of mine, sorry.

[I'll leave that to readers to decide when should do the replacement..]

> 
> struct MultiFDSendData {
> MultifdPayloadType type;
> MultifdPayload data;
> };
> 
> Then the "enum" makes sure the payload only consumes only the max of both
> types; a side benefit to save some memory.
> 
> I think we need to make sure MultifdDeviceState_t is generic enough so that
> it will work for mostly everything (especially normal VMSDs).  In this case
> the VFIO series should be good as that was currently defined as:
> 
> typedef struct {
> MultiFDPacketHdr_t hdr;
> 
> char idstr[256] QEMU_NONSTRING;
> uint32_t instance_id;
> 
> /* size of the next packet that contains the actual data */
> uint32_t next_packet_size;
> } __attribute__((packed)) MultiFDPacketDeviceState_t;
> 
> IIUC that was what we need exactly with idstr+instance_id, so as to nail
> exactly at where should the "opaque device state" go to, then load it with
> a buffer-based loader when it's ready (starting from VFIO, to get rid of
> qemufile).  For VMSDs in the future if ever possible, that should be a
> modified version of vmstate_load() where it may take buffers not qemufiles.
> 
> To Maciej: please see whether above makes sense to you, and if you also
> agree please consider that with your VFIO work.
> 
> Thanks,
> 
> > 
> > >   []
> > >   [a][][][]
> > > 
> > >   [a][][][]
> > >   ^idle
> 
> -- 
> Peter Xu

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-27 Thread Peter Xu

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> Having RAM sent in parallel with non-iterables would make sense to me,
> but I am not 100% sure this is a safe thing to do - after all, currently
> non-iterables can rely on the whole RAM being already transferred.

And I forgot to comment on this one.. but that's a good point.

I think we need further investigation indeed on this one.  Some devices may
need special dependency like what you said either on memory fully loaded,
or something else like BQL, so at least concurrent load() won't work for
the latter.  What I was hoping is that we can start to collect some
time-consuming objects into async-model if they do not have such
dependencies.  The thing in my mind is still vcpus so far: that's what I
observed a major uncertainty on causing major downtimes as well.  I
remember vcpu only needs a loaded memory until KVM_RUN triggering loading
of CR3 so _maybe_ that'll be fine, but that needs some double checks.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-27 Thread Peter Xu

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:
> On 26.06.2024 18:23, Peter Xu wrote:
> > On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> > > On 26.06.2024 03:51, Peter Xu wrote:
> > > > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > Hi Peter,
> > > > > > 
> > > > > > Hi, Maciej,
> > > > > > 
> > > > > > > 
> > > > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero 
> > > > > > > > wrote:
> > > > > > > > > From: "Maciej S. Szmigiero" 
> > > > > > > > > 
> > > > > > > > > This is an updated v1 patch series of the RFC (v0) series 
> > > > > > > > > located here:
> > > > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/
> > > > > > > > 
> > > > > > > > OK I took some hours thinking about this today, and here's some 
> > > > > > > > high level
> > > > > > > > comments for this series.  I'll start with which are more 
> > > > > > > > relevant to what
> > > > > > > > Fabiano has already suggested in the other thread, then I'll 
> > > > > > > > add some more.
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de
> > > > > > > 
> > > > > > > That's a long list, thanks for these comments.
> > > > > > > 
> > > > > > > I have responded to them inline below.
> > > > > > > 
> (..)
> > > > > > > > 4. Risk of OOM on unlimited VFIO buffering
> > > > > > > > ==
> > > > > > > > 
> > > > > > > > This follows with above bullet, but my pure question to ask 
> > > > > > > > here is how
> > > > > > > > does VFIO guarantees no OOM condition by buffering VFIO state?
> > > > > > > > 
> > > > > > > > I mean, currently your proposal used vfio_load_bufs_thread() as 
> > > > > > > > a separate
> > > > > > > > thread to only load the vfio states until sequential data is 
> > > > > > > > received,
> > > > > > > > however is there an upper limit of how much buffering it could 
> > > > > > > > do?  IOW:
> > > > > > > > 
> > > > > > > > vfio_load_state_buffer():
> > > > > > > > 
> > > > > > > >   if (packet->idx >= migration->load_bufs->len) {
> > > > > > > >   g_array_set_size(migration->load_bufs, packet->idx + 
> > > > > > > > 1);
> > > > > > > >   }
> > > > > > > > 
> > > > > > > >   lb = _array_index(migration->load_bufs, typeof(*lb), 
> > > > > > > > packet->idx);
> > > > > > > >   ...
> > > > > > > >   lb->data = g_memdup2(>data, data_size - 
> > > > > > > > sizeof(*packet));
> > > > > > > >   lb->len = data_size - sizeof(*packet);
> > > > > > > >   lb->is_present = true;
> > > > > > > > 
> > > > > > > > What if garray keeps growing with lb->data allocated, which 
> > > > > > > > triggers the
> > > > > > > > memcg limit of the process (if QEMU is in such process)?  Or 
> > > > > > > > just deplete
> > > > > > > > host memory and causing OOM kill.
> > > > > > > > 
> > > > > > > > I think we may need to find a way to throttle max memory usage 
> > > > > > > > of such
> > > > > > > > buffering.
> > > > > > > > 
> > > > > > >

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-06-27 Thread Peter Xu

On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> > Or graphically:
> > 
> > 1) client fills the active slot with data. Channels point to nothing
> >at this point:
> >   [a]  <-- active slot
> >   [][][][] <-- free slots, one per-channel
> > 
> >   [][][][] <-- channels' p->data pointers
> > 
> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> >still point to nothing:
> >   []
> >   [a][][][]
> > 
> >   [][][][]
> > 
> > 3) multifd_send() finds an idle channel and updates its pointer:
> 
> It seems the action "finds an idle channel" is in step 2 rather than step 3,
> which means the free slot is selected based on the id of the channel found, 
> am I
> understanding correctly?

I think you're right.

Actually I also feel like the desription here is ambiguous, even though I
think I get what Fabiano wanted to say.

The free slot should be the first step of step 2+3, here what Fabiano
really wanted to suggest is we move the free buffer array from multifd
channels into the callers, then the caller can pass in whatever data to
send.

So I think maybe it's cleaner to write it as this in code (note: I didn't
really change the code, just some ordering and comments):

===8<===
@@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
  */
 active_slot = slots->active;
 slots->active = slots->free[p->id];
-p->data = active_slot;
-
-/*
- * By the next time we arrive here, the channel will certainly
- * have consumed the active slot. Put it back on the free list
- * now.
- */
 slots->free[p->id] = active_slot;

+/* Assign the current active slot to the chosen thread */
+p->data = active_slot;
===8<===

The comment I removed is slightly misleading to me too, because right now 
active_slot contains the data hasn't yet been delivered to multifd, so
we're "putting it back to free list" not because of it's free, but because
we know it won't get used until the multifd send thread consumes it
(because before that the thread will be busy, and we won't use the buffer
if so in upcoming send()s).

And then when I'm looking at this again, I think maybe it's a slight
overkill, and maybe we can still keep the "opaque data" managed by multifd.
One reason might be that I don't expect the "opaque data" payload keep
growing at all: it should really be either RAM or device state as I
commented elsewhere in a relevant thread, after all it's a thread model
only for migration purpose to move vmstates..

Putting it managed by multifd thread should involve less change than this
series, but it could look like this:

typedef enum {
MULTIFD_PAYLOAD_RAM = 0,
MULTIFD_PAYLOAD_DEVICE_STATE = 1,
} MultifdPayloadType;

typedef enum {
MultiFDPages_t ram_payload;
MultifdDeviceState_t device_payload;
} MultifdPayload;

struct MultiFDSendData {
MultifdPayloadType type;
MultifdPayload data;
};

Then the "enum" makes sure the payload only consumes only the max of both
types; a side benefit to save some memory.

I think we need to make sure MultifdDeviceState_t is generic enough so that
it will work for mostly everything (especially normal VMSDs).  In this case
the VFIO series should be good as that was currently defined as:

typedef struct {
MultiFDPacketHdr_t hdr;

char idstr[256] QEMU_NONSTRING;
uint32_t instance_id;

/* size of the next packet that contains the actual data */
uint32_t next_packet_size;
} __attribute__((packed)) MultiFDPacketDeviceState_t;

IIUC that was what we need exactly with idstr+instance_id, so as to nail
exactly at where should the "opaque device state" go to, then load it with
a buffer-based loader when it's ready (starting from VFIO, to get rid of
qemufile).  For VMSDs in the future if ever possible, that should be a
modified version of vmstate_load() where it may take buffers not qemufiles.

To Maciej: please see whether above makes sense to you, and if you also
agree please consider that with your VFIO work.

Thanks,

> 
> >   []
> >   [a][][][]
> > 
> >   [a][][][]
> >   ^idle

-- 
Peter Xu

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

2024-06-26 Thread Peter Xu

 the issue on inaccurate bw estimations, and that was why
I introduced switchover-bandwidth parameter, I wished after that the result
can be closer to downtime_limit but we never tried to test again.  I am not
sure either on whether that's the best way to address this.

But let's just ignore the iterable save() huge delays (which can be
explained, and hopefully will still be covered by downtime_limit
calculations when it can try to get closer to right), and we can also see
at least a few things we didn't account:

  - stop vm: 268ms
  - non-iterables: 270ms
  - dest load until complete: 144ms

For the last one, we did see another outlier where it can only be seen from
dest:

Non-iterable device analysis:

  Device LOAD of  kvm-tpr-opt:  0 took 123976 
(us)  <- this one
  Device LOAD of  :00:02.0/pcie-root-port:  0 took   6362 
(us)
  Device LOAD of :00:02.0:00.0/virtio-net:  0 took   4583 
(us)
  Device LOAD of :00:02.0:00.1/virtio-net:  0 took   4440 
(us)
  Device LOAD of :00:01.0/vga:  0 took   3740 
(us)
  Device LOAD of :00:00.0/mch:  0 took   3557 
(us)
  Device LOAD of :00:02.2:00.0/virtio-blk:  0 took   3530 
(us)
  Device LOAD of   :00:02.1:00.0/xhci:  0 took   2712 
(us)
  Device LOAD of  :00:02.1/pcie-root-port:  0 took   2046 
(us)
  Device LOAD of  :00:02.2/pcie-root-port:  0 took   1890 
(us)

So we found either cpu save() taking 100+ms, or kvm-tpr-opt load() taking
100+ms.  None of them sounds normal, and I didn't look into them.

Now with a global ratio perhaps start to reflect "how much ratio of
downtime_limit should we account into data transfer", then we'll also need
to answer how the user should set that ratio value, and maybe there's a
sane way to calculate that by the VM setup?  I'm not sure, but those
questions may need to be answered together in the next post, so that such
parameter can be consumable.

The answer doesn't need to be accurate, but I hope that can be based on
some similar analysis like above (where I didn't do it well; as I don't
think I looked into any of the issues, and maybe they're fix-able).  But
just to show what I meant.  It'll be also great when doing the analysis we
found issues fix-able, then it'll be great we fix the issues intead.
That's the part when I mentioned "I still prefer fixing downtime_limit
itself".

Thanks,

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-26 Thread Peter Xu

On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:
> On 26.06.2024 03:51, Peter Xu wrote:
> > On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> > > On 25.06.2024 19:25, Peter Xu wrote:
> > > > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > Hi Peter,
> > > > 
> > > > Hi, Maciej,
> > > > 
> > > > > 
> > > > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" 
> > > > > > > 
> > > > > > > This is an updated v1 patch series of the RFC (v0) series located 
> > > > > > > here:
> > > > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/
> > > > > > 
> > > > > > OK I took some hours thinking about this today, and here's some 
> > > > > > high level
> > > > > > comments for this series.  I'll start with which are more relevant 
> > > > > > to what
> > > > > > Fabiano has already suggested in the other thread, then I'll add 
> > > > > > some more.
> > > > > > 
> > > > > > https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de
> > > > > 
> > > > > That's a long list, thanks for these comments.
> > > > > 
> > > > > I have responded to them inline below.
> > > > > 
> > > (..)
> > > > > 
> > > > > > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> > > > > > =
> > > > > > 
> > > > > > VFIODeviceStatePacket is the new protocol you introduced into 
> > > > > > multifd
> > > > > > packets, along with the new load_state_buffer() hook for loading 
> > > > > > such
> > > > > > buffers.  My question is whether it's needed at all, or.. whether 
> > > > > > it can be
> > > > > > more generic (and also easier) to just allow taking any device 
> > > > > > state in the
> > > > > > multifd packets, then load it with vmstate load().
> > > > > > 
> > > > > > I mean, the vmstate_load() should really have worked on these 
> > > > > > buffers, if
> > > > > > after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as 
> > > > > > the
> > > > > > first flag (uint64), size as the 2nd, then (2) load that rest 
> > > > > > buffer into
> > > > > > VFIO kernel driver.  That is the same to happen during the blackout 
> > > > > > window.
> > > > > > It's not clear to me why load_state_buffer() is needed.
> > > > > > 
> > > > > > I also see that you're also using exactly the same chunk size for 
> > > > > > such
> > > > > > buffering (VFIOMigration.data_buffer_size).
> > > > > > 
> > > > > > I think you have a "reason": VFIODeviceStatePacket and loading of 
> > > > > > the
> > > > > > buffer data resolved one major issue that wasn't there before but 
> > > > > > start to
> > > > > > have now: multifd allows concurrent arrivals of vfio buffers, even 
> > > > > > if the
> > > > > > buffer *must* be sequentially loaded.
> > > > > > 
> > > > > > That's a major pain for current VFIO kernel ioctl design, IMHO.  I 
> > > > > > think I
> > > > > > used to ask nVidia people on whether the VFIO get_state/set_state 
> > > > > > interface
> > > > > > can allow indexing or tagging of buffers but I never get a real 
> > > > > > response.
> > > > > > IMHO that'll be extremely helpful for migration purpose on 
> > > > > > concurrency if
> > > > > > it can happen, rather than using a serialized buffer.  It means
> > > > > > concurrently save/load one VFIO device could be extremely hard, if 
> > > > > > not
> > > > > > impossible.
> > > > > 
> > > > > I am pretty sure that the current kernel VFIO inte

Re: [PATCH 07/14] migration: Free removed SaveStateEntry

2024-06-26 Thread Peter Xu

On Wed, Jun 26, 2024 at 08:06:30PM +0900, Akihiko Odaki wrote:
> This suppresses LeakSanitizer warnings.
> 
> Signed-off-by: Akihiko Odaki 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-25 Thread Peter Xu

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:
> On 25.06.2024 19:25, Peter Xu wrote:
> > On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> > > Hi Peter,
> > 
> > Hi, Maciej,
> > 
> > > 
> > > On 23.06.2024 22:27, Peter Xu wrote:
> > > > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" 
> > > > > 
> > > > > This is an updated v1 patch series of the RFC (v0) series located 
> > > > > here:
> > > > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/
> > > > 
> > > > OK I took some hours thinking about this today, and here's some high 
> > > > level
> > > > comments for this series.  I'll start with which are more relevant to 
> > > > what
> > > > Fabiano has already suggested in the other thread, then I'll add some 
> > > > more.
> > > > 
> > > > https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de
> > > 
> > > That's a long list, thanks for these comments.
> > > 
> > > I have responded to them inline below.
> > > 
> (..)
> > > 
> > > > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> > > > =
> > > > 
> > > > VFIODeviceStatePacket is the new protocol you introduced into multifd
> > > > packets, along with the new load_state_buffer() hook for loading such
> > > > buffers.  My question is whether it's needed at all, or.. whether it 
> > > > can be
> > > > more generic (and also easier) to just allow taking any device state in 
> > > > the
> > > > multifd packets, then load it with vmstate load().
> > > > 
> > > > I mean, the vmstate_load() should really have worked on these buffers, 
> > > > if
> > > > after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
> > > > first flag (uint64), size as the 2nd, then (2) load that rest buffer 
> > > > into
> > > > VFIO kernel driver.  That is the same to happen during the blackout 
> > > > window.
> > > > It's not clear to me why load_state_buffer() is needed.
> > > > 
> > > > I also see that you're also using exactly the same chunk size for such
> > > > buffering (VFIOMigration.data_buffer_size).
> > > > 
> > > > I think you have a "reason": VFIODeviceStatePacket and loading of the
> > > > buffer data resolved one major issue that wasn't there before but start 
> > > > to
> > > > have now: multifd allows concurrent arrivals of vfio buffers, even if 
> > > > the
> > > > buffer *must* be sequentially loaded.
> > > > 
> > > > That's a major pain for current VFIO kernel ioctl design, IMHO.  I 
> > > > think I
> > > > used to ask nVidia people on whether the VFIO get_state/set_state 
> > > > interface
> > > > can allow indexing or tagging of buffers but I never get a real 
> > > > response.
> > > > IMHO that'll be extremely helpful for migration purpose on concurrency 
> > > > if
> > > > it can happen, rather than using a serialized buffer.  It means
> > > > concurrently save/load one VFIO device could be extremely hard, if not
> > > > impossible.
> > > 
> > > I am pretty sure that the current kernel VFIO interface requires for the
> > > buffers to be loaded in-order - accidentally providing the out of order
> > > definitely breaks the restore procedure.
> > 
> > Ah, I didn't mean that we need to do it with the current API.  I'm talking
> > about whether it's possible to have a v2 that will support those otherwise
> > we'll need to do "workarounds" like what you're doing with "unlimited
> > buffer these on dest, until we receive continuous chunk of data" tricks.
> 
> Better kernel API might be possible in the long term but for now we have
> to live with what we have right now.
> 
> After all, adding true unordered loading - I mean not just moving the
> reassembly process from QEMU to the kernel but making the device itself
> accept buffers out out order - will likely be pretty complex (requiring
> adding such functionality to the device firmware, etc).

I would expect the device will need to be able to provision the device
states

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

2024-06-25 Thread Peter Xu

On Tue, Jun 25, 2024 at 05:31:19PM +0100, Joao Martins wrote:
> The device-state multifd scaling is a take on improving switchover phase,
> and we will keep improving it whenever we find things... but the

That'll be helpful, thanks.  Just a quick note that "reducing downtime" is
a separate issue comparing to "make downtime_limit accurate".

> switchover itself can't be 'precomputed' into a downtime number equation
> ahead of time to encompass all possible latencies/costs. Part of the
> reason that at least we couldn't think of a way besides this proposal
> here, which at the core it's meant to bounds check switchover. Even
> without taking into account VFs/HW[0], it is simply not considered how
> long it might take and giving some sort of downtime buffer coupled with
> enforcement that can be enforced helps not violating migration SLAs.

I agree such enforcement alone can be useful in general to be able to
fallback.  Said that, I think it would definitely be nice to attach more
information on the downtime analysis when reposting this series, if there
is any.

For example, irrelevant of whether QEMU can do proper predictions at all,
there can be data / results to show what is the major parts that are
missing besides the current calculations, aka an expectation on when the
fallback can trigger, and some justification on why they can't be
predicted.

IMHO the enforcement won't make much sense if it keeps triggering, in that
case people will simply not use it as it stops migrations from happening.
Ultimately the work will still be needed to make downtime_limit accurate.
The fallback should only be an last fence to guard the promise which should
be the "corner cases".

-- 
Peter Xu

Re: [PATCH 1/2] migration: Implement dirty ring

2024-06-25 Thread Peter Xu

On Tue, Jun 25, 2024 at 08:10:23PM +0900, Shota Imamura wrote:
> Dear Peter Xu,
> 
> Thank you for your feedback.
> 
> > It looks like this patch will introduce a ring but still it keeps the
> > bitmaps around.
> >
> > Could you elaborate your motivation of this work? It’ll be interesting to
> > know whether you did any kind of measurement around it.
> 
> First of all, I apologize for the lack of explanation.
> To provide more details, the motivation for this work stems from the
> current QEMU implementation, where pages obtained from the KVM ring are set
> into the KVMSlot/RAMList/RAMBlock bitmaps. Consequently, when the migration
> thread sends data, it ends up scanning the bitmap (resulting in O(N) time
> complexity). I aimed to improve this situation.

So is this a research project?  Or do you have explicit goals, e.g. on
reducing migration time, or make migration easier to converge?

These information might be helpful for reviewers to understand the ultimate
goal of this series.

> 
> Here are the steps and considerations in my implementation plan:
> 
> 1. Call MigrationOps::ram_save_target_page inside kvm_dirty_ring_reap_one.
> 
> The approach involves QEMU holding neither bitmaps nor rings and sending
> immediately. However, this would require non-migration threads (like accel
> threads) to send pages, necessitating a synchronization mechanism with the
> migration thread, which I deemed would increase code complexity.
> Additionally, if future non-KVM accels provided their rings, we would have
> to write similar code in different places, increasing future maintenance
> costs. For these reasons, I decided against an implementation where page
> sending occurs within accel/kvm and opted for a separate ring within QEMU.

Yes this won't trivially work cleanly, as ram_save_target_page() requires
migration context.  It may also not be thread-safe too.  E.g. qemufile
isn't thread-safe.

"maybe" it'll be easier to do it the other way round: allowing migration to
fetch entry from a ring by calling a registered ring op.

> 
> 2. Use a ring as an alternative to bitmaps.
> 
> The approach involves implementing a ring within QEMU and inserting pages
> into the ring instead of setting them into bitmaps in functions like
> kvm_dirty_ring_mark_page, cpu_physical_memory_set_dirty_lebitmap, and
> cpu_physical_memory_set_dirty_range. Then, pages are retrieved from the
> ring in ram_find_and_save_block.

(I think this might be what I mentioned above)

> However, this approach necessitates immediate sending of pages when the
> ring is full, which might involve non-migration threads sending pages,
> leading to the same issues as mentioned in step 1. Given the ring has a
> limited capacity, if there are enough dirty pages to fill the ring, the
> cost difference between operating the ring and scanning the entire bitmap
> would be negligible.  Hence, I decided to fall back to bitmaps when the
> ring is full.

Right, that was also what I found - the ring may not work as well when the
guest is very busy.  Here the question is migration is normally more
challenging when that is the case.. and when with a pretty idle guest it's
easier to migrate anyway even with bitmap-only.  I think I noticed that
pretty late.

> 
> 3. Use bitmaps when the ring is full.
> 
> The approach involves setting bitmaps while simultaneously inserting pages
> into the ring in functions like kvm_dirty_ring_mark_page,
> cpu_physical_memory_set_dirty_lebitmap, and
> cpu_physical_memory_set_dirty_range. If the ring is not full, it is used in
> ram_find_and_save_block; otherwise, bitmaps are used. This way, only the
> migration thread sends pages. Additionally, by checking if a page is
> already in the ring (O(1) complexity), redundant entries are avoided.
> However, enqueuing and dequeuing are handled by different threads, which
> could result in a situation where pages exist in the bitmap but not in the
> ring once it is full. Identifying these pages would require locking and
> scanning the entire bitmap.

Yes, is this series using this approach?

I think the ring effect may be minimum if there is the bitmap already,
because as long as the ring can fallback (to avoid hanging a vcpu for a
long time) it means the bitmap can contain useful data and the bitmap scans
will be needed, then it kind of invalidates the ring's benefit, IMHO.

What I was thinking a long time ago was something like this: we only use
ring, as you said in (2) above, so no bitmap.  Then the migration core can
look like this:

  migration_thread():
count=0
while () {
count++
if ((count % ENFORCE_BACKGROUND_COUNT == 0) ||
ring_empty())
send_page(background.pop())
else
send_page(ring.pop())
}

To explain it a bit:

  -

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-25 Thread Peter Xu

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:
> Hi Peter,

Hi, Maciej,

> 
> On 23.06.2024 22:27, Peter Xu wrote:
> > On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" 
> > > 
> > > This is an updated v1 patch series of the RFC (v0) series located here:
> > > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/
> > 
> > OK I took some hours thinking about this today, and here's some high level
> > comments for this series.  I'll start with which are more relevant to what
> > Fabiano has already suggested in the other thread, then I'll add some more.
> > 
> > https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de
> 
> That's a long list, thanks for these comments.
> 
> I have responded to them inline below.
> 
> > 1. Multifd device state support
> > ===
> > 
> > As Fabiano suggested in his RFC post, we may need one more layer of
> > abstraction to represent VFIO's demand on allowing multifd to send
> > arbitrary buffer to the wire.  This can be more than "how to pass the
> > device state buffer to the sender threads".
> > 
> > So far, MultiFDMethods is only about RAM.  If you pull the latest master
> > branch Fabiano just merged yet two more RAM compressors that are extended
> > on top of MultiFDMethods model.  However still they're all about RAM.  I
> > think it's better to keep it this way, so maybe MultiFDMethods should some
> > day be called MultiFDRamMethods.
> > 
> > multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
> > buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
> > needed either for arbitrary buffers.  Most of those are still page-based.
> > 
> > I think it also means we shouldn't call ->send_prepare() when multifd send
> > thread notices that it's going to send a VFIO buffer.  So it should look
> > like this:
> > 
> >int type = multifd_payload_type(p->data);
> >if (type == MULTIFD_PAYLOAD_RAM) {
> >multifd_send_state->ops->send_prepare(p, _err);
> >} else {
> >// VFIO buffers should belong here
> >assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
> >...
> >}
> > 
> > It also means it shouldn't contain code like:
> > 
> > nocomp_send_prepare():
> >  if (p->is_device_state_job) {
> >  return nocomp_send_prepare_device_state(p, errp);
> >  } else {
> >  return nocomp_send_prepare_ram(p, errp);
> >  }
> > 
> > nocomp should only exist in RAM world, not VFIO's.
> > 
> > And it looks like you agree with Fabiano's RFC proposal, please work on top
> > of that to provide that layer.  Please make sure it outputs the minimum in
> > "$ git grep device_state migration/multifd.c" when you work on the new
> > version.  Currently:
> > 
> > $ git grep device_state migration/multifd.c | wc -l
> > 59
> > 
> > The hope is zero, or at least a minimum with good reasons.
> 
> I guess you mean "grep -i" in the above example, since otherwise
> the above command will find only lowercase "device_state".
> 
> On the other hand, your example code above has uppercase
> "DEVICE_STATE", suggesting that it might be okay?

Yes that's definitely ok.  I could be over-cautious when I used the grep
example, but I hope you get my point where we should remove that
device_state pointer, rather than as generic as supporting multifd to take
device state buffers.

Especially, if that idea will also apply on top of non-VFIO devices,
e.g. when we extend that to a normal VMSD buffer to be delivered too using
that same mechanism, then I think it's even okay to have some
"device_state" there.  Personally I would still think those as justified
usages that are generic enough; after all I don't expect multifd to send
other things besides RAM and generic terms of device states.

> 
> Overall, using Fabiano's patch set as a base for mine makes sense to me.
> 
> > 2. Frequent mallocs/frees
> > =
> > 
> > Fabiano's series can also help to address some of these, but it looks like
> > this series used malloc/free more than the opaque data buffer.  This is not
> > required to get things merged, but it'll be nice to avoid those if possible.
> 
> Ack - as long as its not making the code messy/fragile, of course.
> 
> > 3. load_state_buffer() and VFIODeviceStatePacket protocol
> >

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

2024-06-25 Thread Peter Xu

On Tue, Jun 25, 2024 at 12:38:50PM +0100, Joao Martins wrote:
> On 24/06/2024 20:41, Peter Xu wrote:
> > On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote:
> >> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
> >> MigrationIncomingState *mis,
>  >>  if (!check_section_footer(f, se)) {
> >>  return -EINVAL;
> >> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, 
> >> MigrationIncomingState *mis,
> >>  se->instance_id, end_ts - start_ts);
> >>  }
> >>  
> >> +if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
> >> +mis->downtime_start) {
> >> +mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> >> +uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
> >> +if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
> >> +error_report("Shutdown destination migration, migration 
> >> abort_limit (%lu ms)"
> >> +  "was reached.", mis->abort_limit);
> >> +trace_qemu_loadvm_downtime_abort(mis->abort_limit, 
> >> dst_downtime,
> >> + mis->src_downtime);
> >> +return -EINVAL;
> >> +}
> >> +}
> > 
> > So this traps both iteration and non-iteration phase.  What if the downtime
> > was caused by after these, or unluckily at the last non-iterator device
> > state?
> > 
> > After trying to think about it, I figured this is not easy to do right.
> > Also, I start to doubt whether it's definitely a good idea on having this
> > in the first place.
> > 
> > Firstly, I'm wondering how we should treat this new feature
> > v.s. downtime_limit.  It's about how the user should set both.
> > 
> > If this is about "cancel migration if downtime more than xxx",
> > then.. AFAICT that's exactly what downtime_limit was "designed" to be..
> > It's just that downtime_limit says the other way round, as: "don't
> > switchover if the downtime will be more than xxx".
> > 
> > Then, if the user has option to set both these parameters, what would be
> > the right thing to do?  Shouldn't they simply always set both parameters to
> > be the same value already?  But if so, what's the point on having two?
> > 
> > This caused me to think whether the 2nd parameter is needed at all, instead
> > whether we should simply make downtime_limit more accurate, so that it will
> > start to account more things than before.  It won't be accurate, but the
> > same case here: making this new feature "accurate" can also be very hard.
> > 
> 
> The way I think about it is that current downtime-limit captures nicely the 
> data
> part as the only calculations it cares about is how much outstanding data it
> sends to the destination (be it VF device state or RAM). This second parameter
> captures what is *not* related to data, i.e. costs of hypervisor quiescing the
> VM or added latencies in hypervisor *in addition* to sending outstanding data 
> to
> destination.
> 
> If we were to merge this all into a single parameter (aka downtime-limit) we 
> are
> possibility artificially increasing the downtime thanks to relaxing the
> oustanding data part, as opposed to trying to capture the switchover cost 
> (hence
> the name the parameter) that can't be reflected in the former equation.
> 
> So I feel like this needs two parameters whereby this second new parameter
> covers 'switchover cost' (hence the name) with current migration algorithm.

Then the question is how should we suggest the user to specify these two
parameters.

The cover letter used:

  migrate_set_parameter downtime-limit 300
  migrate_set_parameter switchover-limit 10

But it does look like a testing sample only and not valid in real world
setup, as logically the new limit should be larger than the old one,
afaict.  If the new limit is larger, how larger should it be?

So far I can understand how it works intenrally, but even with that I don't
know how I should set this parameter, e.g., when downtime_limit used to be
300ms, I'd say 500ms could be a suitable value, but is it?  In that case,
perhaps the 500ms is the "real" limit that I don't want a VM to be halted
for more than 500ms, but in that case the user should have already setup
downtime_limit to 500ms.

I also don't know how should an user understand all these details and
figure out how to set these.  Note that curr

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

2024-06-24 Thread Peter Xu

+static int loadvm_handle_src_downtime(MigrationIncomingState *mis,
> +  uint16_t len)
> +{
> +uint64_t src_abort_limit = qemu_get_be64(mis->from_src_file);
> +uint64_t src_current_downtime = qemu_get_be64(mis->from_src_file);
> +
> +mis->abort_limit = src_abort_limit;
> +mis->src_downtime = src_current_downtime;
> +mis->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);

I guess this didn't count in the delay of sending this packet.  Since the
whole point of this feature will be "making sure downtime is less than xxx
or cancel migration", I think this might cause the real downtime still more
than expected.  Not sure how big a concern this is.

E.g., have you measured how a packet can be delayed when the socket
pipeline is mostly full, then queue this SRC_DOWNTIME message after all
that?  I think that's very possible the case here at switchover where src
is trying to dump as fast as possible.  I'm not sure whether it's easy to
measure, either..

> +
> +trace_loadvm_handle_src_downtime(src_abort_limit, src_current_downtime);
> +return 0;
> +}
> +
>  /* After postcopy we will be told to throw some pages away since they're
>   * dirty and will have to be demand fetched.  Must happen before CPU is
>   * started.
> @@ -2540,6 +2576,9 @@ static int loadvm_process_command(QEMUFile *f)
>  
>  case MIG_CMD_ENABLE_COLO:
>  return loadvm_process_enable_colo(mis);
> +
> +case MIG_CMD_SEND_SRC_DOWNTIME:
> +return loadvm_handle_src_downtime(mis, len);
>  }
>  
>  return 0;
> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
> MigrationIncomingState *mis,
>  trace_vmstate_downtime_load("non-iterable", se->idstr,
>  se->instance_id, end_ts - start_ts);
>  }
> +if (migrate_switchover_abort() && type == QEMU_VM_SECTION_FULL &&
> +mis->downtime_start) {
> +mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
> +if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
> +error_report("Shutdown destination migration, migration 
> abort_limit"
> + "(%lu ms) was reached.", mis->abort_limit);
> +trace_qemu_loadvm_downtime_abort(mis->abort_limit, dst_downtime,
> + mis->src_downtime);
> +return -EINVAL;
> +}
> +}
>  
>  if (!check_section_footer(f, se)) {
>  return -EINVAL;
> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, 
> MigrationIncomingState *mis,
>  se->instance_id, end_ts - start_ts);
>  }
>  
> +if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
> +mis->downtime_start) {
> +mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
> +if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
> +error_report("Shutdown destination migration, migration 
> abort_limit (%lu ms)"
> +  "was reached.", mis->abort_limit);
> +trace_qemu_loadvm_downtime_abort(mis->abort_limit, dst_downtime,
> + mis->src_downtime);
> +return -EINVAL;
> +}
> +}

So this traps both iteration and non-iteration phase.  What if the downtime
was caused by after these, or unluckily at the last non-iterator device
state?

After trying to think about it, I figured this is not easy to do right.
Also, I start to doubt whether it's definitely a good idea on having this
in the first place.

Firstly, I'm wondering how we should treat this new feature
v.s. downtime_limit.  It's about how the user should set both.

If this is about "cancel migration if downtime more than xxx",
then.. AFAICT that's exactly what downtime_limit was "designed" to be..
It's just that downtime_limit says the other way round, as: "don't
switchover if the downtime will be more than xxx".

Then, if the user has option to set both these parameters, what would be
the right thing to do?  Shouldn't they simply always set both parameters to
be the same value already?  But if so, what's the point on having two?

This caused me to think whether the 2nd parameter is needed at all, instead
whether we should simply make downtime_limit more accurate, so that it will
start to account more things than before.  It won't be accurate, but the
same case here: making this new feature "accurate" can also be very hard.

Thanks,

-- 
Peter Xu

Re: [PATCH RFC 1/2] migration: abort when switchover limit exceeded

2024-06-24 Thread Peter Xu

On Fri, Jun 21, 2024 at 07:32:20AM -0700, Elena Ufimtseva wrote:
> @@ -16,6 +16,7 @@ bool migrate_background_snapshot(void);
>  bool migrate_dirty_limit(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_switchover_ack(void);
> +bool migrate_switchover_abort(void);

This does look like an interesting feature.  The challenge might reside in
the details.  For the interface part, one parameter should be enough?
Since when set to 0 it can already mean disabled.

Thanks,

-- 
Peter Xu

Re: [PATCH 1/2] migration: Implement dirty ring

2024-06-24 Thread Peter Xu

Hi, Shota,

On Thu, Jun 20, 2024 at 06:47:13PM +0900, Shota Imamura wrote:
> This commit implements the dirty ring as an alternative dirty tracking
> method to the dirty bitmap.
> 
> While the dirty ring has already been implemented in accel/kvm using KVM's
> dirty ring, it was designed to set bits in the ramlist and ramblock bitmap.
> This commit introduces a new dirty ring to replace the bitmap, allowing the
> use of the dirty ring even without KVM. When using KVM's dirty ring, this
> implementation maximizes its effectiveness.

It looks like this patch will introduce a ring but still it keeps the
bitmaps around.

Could you elaborate your motivation of this work?  It'll be interesting to
know whether you did any kind of measurement around it.

> 
> To enable the dirty ring, specify the startup option
> "-migration dirty-logging=ring,dirty-ring-size=N". To use the bitmap,
> either specify nothing or "-migration dirty-logging=bitmap". If the dirty
> ring becomes full, it falls back to the bitmap for that round.

I remember adding such option is not suggested.  We may consider using
either QMP to setup a migration parameter, or something else.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-23 Thread Peter Xu

On Sun, Jun 23, 2024 at 10:25:05PM +0200, Maciej S. Szmigiero wrote:
> > I appreciated a lot you worked out VFIO on top of multifd, because IMHO
> > that's really the right direction.  However even with that, I don't think
> > the whole design is yet fully settled, not to mention the details. And that
> > implies it may miss 9.1.
> 
> I appreciate your work and review Peter - it would be nice if we could
> at least make some progress on this subject for 9.1.

Let's try and see, I didn't mean it won't hit 9.1 at all, it just feels
still challenging, but who knows!  I just don't want to give you that
feeling then right before softfreeze I start to say "it's not ready".

I think one reason it'll be more challenge is also since Fabiano will be
missing for weeks, and since you look like to agree on his RFC as a start,
it means it might be good idea we wait for his back and see how that goes
from there even if we can reach some consensus; it'll just be challenging
already.

I also left my (slightly lengthy) comment on the high level design of that
series here (I know you'll see that, but just to keep a record of this
discussion and link them together):

https://lore.kernel.org/r/ZniFH14DT6ycjbrL@x1n

Let's discuss there, let me know if something I just made it wrong, and
also if you're targeting landing part of the series you can try to
prioritize / provision what can already be landed and easier.

I actually have a wild idea where I don't know whether "switchover phase"
hooks are even needed.  I mean, currently save_live_iterate() allows
"returning 1 means all data ready".  Logically in switchover phase the
migration core could simply call save_live_iterate() until it returns 1.
Then switchover hooks do not need to exist at all.  I didn't check the
details, but if that works that'll simplify all register_savevm_live()
users, and also for VFIO it may mean iteration phase is more important too,
as when it's resolved it naturally works with switchover phase.  You can
ignore this idea because it can be too wild and definitely not helpful on
landing things fast, just in case it may bring some thoughts.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-23 Thread Peter Xu

operly invoke vfio_load_state() with the right buffers.  It'll
just be nice if VFIO can keep its "load state" logic at one place.

One benefit of that is with such a more generic framework, QEMU can easily
extend this infra to other device states, so that logically we can consider
sending non-VFIO device states also in the multifd buffers.  However with
your current solution, new structures are needed, new hooks, a lot of new
codes around, however less problems it solved..  That's not optimal.

4. Risk of OOM on unlimited VFIO buffering
==

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

  if (packet->idx >= migration->load_bufs->len) {
  g_array_set_size(migration->load_bufs, packet->idx + 1);
  }

  lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx);
  ...
  lb->data = g_memdup2(>data, data_size - sizeof(*packet));
  lb->len = data_size - sizeof(*packet);
  lb->is_present = true;

What if garray keeps growing with lb->data allocated, which triggers the
memcg limit of the process (if QEMU is in such process)?  Or just deplete
host memory and causing OOM kill.

I think we may need to find a way to throttle max memory usage of such
buffering.

So far this will be more of a problem indeed if this will be done during
VFIO iteration phases, but I still hope a solution can work with both
iteration phase and the switchover phase, even if you only do that in
switchover phase (and I don't know why you don't care about VFIO iteration
phase, if you cared enough on how VFIO works now with migration.. literally
that should help VFIO migrates faster on 25G+ networks, with/without a
shorter blackout window).

5. Worker thread model
==

I'm so far not happy with what this proposal suggests on creating the
threads, also the two new hooks mostly just to create these threads..

I know I suggested that.. but that's comparing to what I read in the even
earlier version, and sorry I wasn't able to suggest something better at
that time because I simply thought less.

As I mentioned in the other reply elsewhere, I think we should firstly have
these threads ready to take data at the start of migration, so that it'll
work when someone wants to add vfio iteration support.  Then the jobs
(mostly what vfio_save_complete_precopy_async_thread() does now) can be
enqueued into the thread pools.

It's better to create the thread pool owned by migration, rather than
threads owned by VFIO, because it also paves way for non-VFIO device state
save()s, as I mentioned also above on the multifd packet header.  Maybe we
can have a flag in the packet header saying "this is device xxx's state,
just load it".

I'd start looking at util/thread-pool.c, removing all the AIO implications
but simply provide a raw thread pool for what thread_pool_submit() is
doing.

I know this is a lot, but I really think this is the right thing.. but we
can discuss, and you can correct me on my mistakes if there is.

If you want I can have a look at this pool model and prepare a patch, so
you can work on other vfio relevant stuff and pick that up, if that helps
you trying to reach the goal of landing this whole stuff in 9.1.

But I hope I explained more or less in this email on why I think this
feature is more involved than it looks like, and not yet mature from
design. And I hope I'm not purly asking for too much: merging this VFIO
series first then refactor on top can mean dropping too much unneeded code
after adding them, not to mention the protocol going to need another break.

It just doesn't sound like the right thing to do.

Thanks,

-- 
Peter Xu

Re: [PATCH v3 02/16] migration: Fix file migration with fdset

2024-06-23 Thread Peter Xu

On Sat, Jun 22, 2024 at 07:21:52AM +0300, Michael Tokarev wrote:
> 17.06.2024 21:57, Fabiano Rosas wrote:
> > When the "file:" migration support was added we missed the special
> > case in the qemu_open_old implementation that allows for a particular
> > file name format to be used to refer to a set of file descriptors that
> > have been previously provided to QEMU via the add-fd QMP command.
> > 
> > When using this fdset feature, we should not truncate the migration
> > file because being given an fd means that the management layer is in
> > control of the file and will likely already have some data written to
> > it. This is further indicated by the presence of the 'offset'
> > argument, which indicates the start of the region where QEMU is
> > allowed to write.
> > 
> > Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> > call, which will take the offset into consideration.
> > 
> > Fixes: 385f510df5 ("migration: file URI offset")
> > Suggested-by: Daniel P. Berrangé 
> > Reviewed-by: Prasad Pandit 
> > Reviewed-by: Peter Xu 
> > Reviewed-by: Daniel P. Berrangé 
> > Signed-off-by: Fabiano Rosas 
> > ---
> >   migration/file.c | 11 +--
> >   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Is it a stable material?

I suppose yes. Thanks.

-- 
Peter Xu

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:56, Peter Xu wrote:
> > On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> > > On 21.06.2024 17:04, Fabiano Rosas wrote:
> > > > "Maciej S. Szmigiero"  writes:
> > > > 
> > > > > Hi Fabiano,
> > > > > 
> > > > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > First of all, apologies for the roughness of the series. I'm off for
> > > > > > the next couple of weeks and wanted to put something together early
> > > > > > for your consideration.
> > > > > > 
> > > > > > This series is a refactoring (based on an earlier, off-list
> > > > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > > > the multifd core. If we're going to add support for more data types 
> > > > > > to
> > > > > > multifd, we first need to clean that up.
> > > > > > 
> > > > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > > > you're having to add a bunch of is_device_state checks to work 
> > > > > > around
> > > > > > the rigidity of the code.
> > > > > > 
> > > > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > > > Juan's ideas) to make multifd the default code path for migration,
> > > > > > which will have to include the vmstate migration and anything else 
> > > > > > we
> > > > > > put on the stream via QEMUFile.
> > > > > > 
> > > > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > > > but I believe in order to support more types of payloads in multifd,
> > > > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > > > think.
> > > > > 
> > > > > Thanks for the patch set, I quickly glanced at these patches and they
> > > > > definitely make sense to me.
> > > > > 
> > > (..)
> > > > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > > > incorporate any of this code if it's useful. Or to ignore it
> > > > > > completely).
> > > > > 
> > > > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > > > feature freeze in about a month, correct?
> > > > > 
> > > > 
> > > > For general code improvements like this I'm not thinking about QEMU
> > > > releases at all. But this series is not super complex, so I could
> > > > imagine we merging it in time for 9.1 if we reach an agreement.
> > > > 
> > > > Are you thinking your series might miss the target? Or have concerns
> > > > over the stability of the refactoring? We can within reason merge code
> > > > based on the current framework and improve things on top, we already did
> > > > something similar when merging zero-page support. I don't have an issue
> > > > with that.
> > > 
> > > The reason that I asked whether you are targeting 9.1 is because my
> > > patch set is definitely targeting that release.
> > > 
> > > At the same time my patch set will need to be rebased/refactored on top
> > > of this patch set if it is supposed to be merged for 9.1 too.
> > > 
> > > If this patch set gets merged quickly that's not really a problem.
> > > 
> > > On the other hand, if another iteration(s) is/are needed AND you are
> > > not available in the coming weeks to work on them then there's a
> > > question whether we will make the required deadline.
> > 
> > I think it's a bit rush to merge the vfio series in this release.  I'm not
> > sure it has enough time to be properly reviewed, reposted, retested, etc.
> > 
> > I've already started looking at it, and so far I think I have doubt not
> > only on agreement with Fabiano on the device_state thing which I prefer to
> > avoid, but also I'm thinking of any possible way to at least make the
> > wo

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:04, Fabiano Rosas wrote:
> > "Maciej S. Szmigiero"  writes:
> > 
> > > Hi Fabiano,
> > > 
> > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > Hi folks,
> > > > 
> > > > First of all, apologies for the roughness of the series. I'm off for
> > > > the next couple of weeks and wanted to put something together early
> > > > for your consideration.
> > > > 
> > > > This series is a refactoring (based on an earlier, off-list
> > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > the multifd core. If we're going to add support for more data types to
> > > > multifd, we first need to clean that up.
> > > > 
> > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > you're having to add a bunch of is_device_state checks to work around
> > > > the rigidity of the code.
> > > > 
> > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > Juan's ideas) to make multifd the default code path for migration,
> > > > which will have to include the vmstate migration and anything else we
> > > > put on the stream via QEMUFile.
> > > > 
> > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > but I believe in order to support more types of payloads in multifd,
> > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > think.
> > > 
> > > Thanks for the patch set, I quickly glanced at these patches and they
> > > definitely make sense to me.
> > > 
> (..)
> > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > incorporate any of this code if it's useful. Or to ignore it
> > > > completely).
> > > 
> > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > feature freeze in about a month, correct?
> > > 
> > 
> > For general code improvements like this I'm not thinking about QEMU
> > releases at all. But this series is not super complex, so I could
> > imagine we merging it in time for 9.1 if we reach an agreement.
> > 
> > Are you thinking your series might miss the target? Or have concerns
> > over the stability of the refactoring? We can within reason merge code
> > based on the current framework and improve things on top, we already did
> > something similar when merging zero-page support. I don't have an issue
> > with that.
> 
> The reason that I asked whether you are targeting 9.1 is because my
> patch set is definitely targeting that release.
> 
> At the same time my patch set will need to be rebased/refactored on top
> of this patch set if it is supposed to be merged for 9.1 too.
> 
> If this patch set gets merged quickly that's not really a problem.
> 
> On the other hand, if another iteration(s) is/are needed AND you are
> not available in the coming weeks to work on them then there's a
> question whether we will make the required deadline.

I think it's a bit rush to merge the vfio series in this release.  I'm not
sure it has enough time to be properly reviewed, reposted, retested, etc.

I've already started looking at it, and so far I think I have doubt not
only on agreement with Fabiano on the device_state thing which I prefer to
avoid, but also I'm thinking of any possible way to at least make the
worker threads generic too: a direct impact could be vDPA in the near
future if anyone cared, while I don't want modules to create threads
randomly during migration.

Meanwhile I'm also thinking whether that "the thread needs to dump all
data, and during iteration we can't do that" is the good reason to not
support that during iterations.

I didn't yet reply because I don't think I think all things through, but
I'll get there.

So I'm not saying that the design is problematic, but IMHO it's just not
mature enough to assume it will land in 9.1, considering it's still a large
one, and the first non-rfc version just posted two days ago.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH 1/7] migration/multifd: Reduce access to p->pages

2024-06-21 Thread Peter Xu

On Thu, Jun 20, 2024 at 06:21:05PM -0300, Fabiano Rosas wrote:
> I'm about to replace the p->pages pointer with an opaque pointer, so
> do a cleanup now to reduce direct accesses to p->page, which makes the
> next diffs cleaner.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH] migration: Remove unused VMSTATE_ARRAY_TEST() macro

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 08:13:35AM +, Zhijian Li (Fujitsu) via wrote:
> 
> 
> On 21/06/2024 15:03, Philippe Mathieu-Daudé wrote:
> > Last use of VMSTATE_ARRAY_TEST() was removed in commit 46baa9007f
> > ("migration/i386: Remove old non-softfloat 64bit FP support"), we
> > can safely get rid of it.
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Reviewed-by: Li Zhijian 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v3 16/16] tests/qtest/migration: Add a test for mapped-ram with passing of fds

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 09:33:48AM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, Jun 17, 2024 at 03:57:31PM -0300, Fabiano Rosas wrote:
> >> Add a multifd test for mapped-ram with passing of fds into QEMU. This
> >> is how libvirt will consume the feature.
> >> 
> >> There are a couple of details to the fdset mechanism:
> >> 
> >> - multifd needs two distinct file descriptors (not duplicated with
> >>   dup()) so it can enable O_DIRECT only on the channels that do
> >>   aligned IO. The dup() system call creates file descriptors that
> >>   share status flags, of which O_DIRECT is one.
> >> 
> >> - the open() access mode flags used for the fds passed into QEMU need
> >>   to match the flags QEMU uses to open the file. Currently O_WRONLY
> >>   for src and O_RDONLY for dst.
> >> 
> >> Note that fdset code goes under _WIN32 because fd passing is not
> >> supported on Windows.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >> - dropped Peter's r-b
> >> 
> >> - stopped removing the fdset at the end of the tests. The migration
> >> code should always cleanup after itself.
> >
> > Ah, that looks also ok.
> 
> I made a mistake here. We still need to require that the management
> layer explicitly removes the fds they added by calling qmp_remove_fd().
> 
> The reason I thought it was ok to not remove the fds was that after your
> suggestion to use monitor_fdset_free_if_empty() in patch 7, I mistakenly
> put monitor_fdset_free() instead. So the qmp_remove_fd() was not finding
> any fdsets and I thought that was QEMU removing everything. Which is
> silly, because the whole purpose of the patch is to not do that.
> 
> I think I'll just fix this in the migration tree. It's just a revert to
> v2 of this patch and the correction to patch 7.

Ah OK.  I thought you were talking about when QEMU exit()s, which should
cleanup everything too.

Personally I guess I kind of ignored that remove-fd api anyway being there
or not, as it seems nobody understand why it needs to exist in the first
place..

-- 
Peter Xu

[PATCH v3 05/11] migration/postcopy: Add postcopy-recover-setup phase

2024-06-19 Thread Peter Xu

This patch adds a migration state on src called "postcopy-recover-setup".
The new state will describe the intermediate step starting from when the
src QEMU received a postcopy recovery request, until the migration channels
are properly established, but before the recovery process take place.

The request came from Libvirt where Libvirt currently rely on the migration
state events to detect migration state changes.  That works for most of the
migration process but except postcopy recovery failures at the beginning.

Currently postcopy recovery only has two major states:

  - postcopy-paused: this is the state that both sides of QEMU will be in
for a long time as long as the migration channel was interrupted.

  - postcopy-recover: this is the state where both sides of QEMU handshake
with each other, preparing for a continuation of postcopy which used to
be interrupted.

The issue here is when the recovery port is invalid, the src QEMU will take
the URI/channels, noticing the ports are not valid, and it'll silently keep
in the postcopy-paused state, with no event sent to Libvirt.  In this case,
the only thing Libvirt can do is to poll the migration status with a proper
interval, however that's less optimal.

Considering that this is the only case where Libvirt won't get a
notification from QEMU on such events, let's add postcopy-recover-setup
state to mimic what we have with the "setup" state of a newly initialized
migration, describing the phase of connection establishment.

With that, postcopy recovery will have two paths to go now, and either path
will guarantee an event generated.  Now the events will look like this
during a recovery process on src QEMU:

  - Initially when the recovery is initiated on src, QEMU will go from
"postcopy-paused" -> "postcopy-recover-setup".  Old QEMUs don't have
this event.

  - Depending on whether the channel re-establishment is succeeded:

- In succeeded case, src QEMU will move from "postcopy-recover-setup"
  to "postcopy-recover".  Old QEMUs also have this event.

- In failure case, src QEMU will move from "postcopy-recover-setup" to
  "postcopy-paused" again.  Old QEMUs don't have this event.

This guarantees that Libvirt will always receive a notification for
recovery process properly.

One thing to mention is, such new status is only needed on src QEMU not
both.  On dest QEMU, the state machine doesn't change.  Hence the events
don't change either.  It's done like so because dest QEMU may not have an
explicit point of setup start.  E.g., it can happen that when dest QEMUs
doesn't use migrate-recover command to use a new URI/channel, but the old
URI/channels can be reused in recovery, in which case the old ports simply
can work again after the network routes are fixed up.

Add a new helper postcopy_is_paused() detecting whether postcopy is still
paused, taking RECOVER_SETUP into account too.  When using it on both
src/dst, a slight change is done altogether to always wait for the
semaphore before checking the status, because for both sides a sem_post()
will be required for a recovery.

Cc: Jiri Denemark 
Cc: Prasad Pandit 
Reviewed-by: Fabiano Rosas 
Buglink: https://issues.redhat.com/browse/RHEL-38485
Signed-off-by: Peter Xu 
---
 qapi/migration.json  |  4 
 migration/postcopy-ram.h |  3 +++
 migration/migration.c| 40 ++--
 migration/postcopy-ram.c |  6 ++
 migration/savevm.c   |  4 ++--
 5 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index de6c8b0444..0f24206bce 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -142,6 +142,9 @@
 #
 # @postcopy-paused: during postcopy but paused.  (since 3.0)
 #
+# @postcopy-recover-setup: setup phase for a postcopy recovery process,
+# preparing for a recovery phase to start.  (since 9.1)
+#
 # @postcopy-recover: trying to recover from a paused postcopy.  (since
 # 3.0)
 #
@@ -166,6 +169,7 @@
 { 'enum': 'MigrationStatus',
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
 'active', 'postcopy-active', 'postcopy-paused',
+'postcopy-recover-setup',
 'postcopy-recover', 'completed', 'failed', 'colo',
 'pre-switchover', 'device', 'wait-unplug' ] }
 ##
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index ecae941211..a6df1b2811 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -13,6 +13,8 @@
 #ifndef QEMU_POSTCOPY_RAM_H
 #define QEMU_POSTCOPY_RAM_H
 
+#include "qapi/qapi-types-migration.h"
+
 /* Return true if the host supports everything we need to do postcopy-ram */
 bool postcopy_ram_supported_by_host(MigrationIncomingState *mis,
 Error **errp);
@@ -193,5 +195,6 @@ enum PostcopyChannels {
 void postcopy_preempt_new_channel(Migrat

[PATCH v3 08/11] tests/migration-tests: Always enable migration events

2024-06-19 Thread Peter Xu

Libvirt should always enable it, so it'll be nice qtest also cover that for
all tests on both sides.  migrate_incoming_qmp() used to enable it only on
dst, now we enable them on both, as we'll start to sanity check events even
on the src QEMU.

We'll need to leave the one in migrate_incoming_qmp(), because
virtio-net-failover test uses that one only, and it relies on the events to
work.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-helpers.c | 1 +
 tests/qtest/migration-test.c| 7 +++
 2 files changed, 8 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 0ac49ceb54..2ca4425d71 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -258,6 +258,7 @@ void migrate_incoming_qmp(QTestState *to, const char *uri, 
const char *fmt, ...)
 g_assert(!qdict_haskey(args, "uri"));
 qdict_put_str(args, "uri", uri);
 
+/* This function relies on the event to work, make sure it's enabled */
 migrate_set_capability(to, "events", true);
 
 rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': %p}",
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 640713bfd5..c015e801ac 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -851,6 +851,13 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 unlink(shmem_path);
 }
 
+/*
+ * Always enable migration events.  Libvirt always uses it, let's try
+ * to mimic as closer as that.
+ */
+migrate_set_capability(*from, "events", true);
+migrate_set_capability(*to, "events", true);
+
 return 0;
 }
 
-- 
2.45.0

[PATCH v3 07/11] tests/migration-tests: Drop most WIN32 ifdefs for postcopy failure tests

2024-06-19 Thread Peter Xu

Most of them are not needed, we can stick with one ifdef inside
postcopy_recover_fail() so as to cover the scm right tricks only.
The tests won't run on windows anyway due to has_uffd always false.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 85a21ff5e9..640713bfd5 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1363,9 +1363,9 @@ static void wait_for_postcopy_status(QTestState *one, 
const char *status)
   "completed", NULL });
 }
 
-#ifndef _WIN32
 static void postcopy_recover_fail(QTestState *from, QTestState *to)
 {
+#ifndef _WIN32
 int ret, pair1[2], pair2[2];
 char c;
 
@@ -1427,8 +1427,8 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
 close(pair1[1]);
 close(pair2[0]);
 close(pair2[1]);
+#endif
 }
-#endif /* _WIN32 */
 
 static void test_postcopy_recovery_common(MigrateCommon *args)
 {
@@ -1468,7 +1468,6 @@ static void test_postcopy_recovery_common(MigrateCommon 
*args)
 wait_for_postcopy_status(to, "postcopy-paused");
 wait_for_postcopy_status(from, "postcopy-paused");
 
-#ifndef _WIN32
 if (args->postcopy_recovery_test_fail) {
 /*
  * Test when a wrong socket specified for recover, and then the
@@ -1477,7 +1476,6 @@ static void test_postcopy_recovery_common(MigrateCommon 
*args)
 postcopy_recover_fail(from, to);
 /* continue with a good recovery */
 }
-#endif /* _WIN32 */
 
 /*
  * Create a new socket to emulate a new channel that is different
@@ -1506,7 +1504,6 @@ static void test_postcopy_recovery(void)
 test_postcopy_recovery_common();
 }
 
-#ifndef _WIN32
 static void test_postcopy_recovery_double_fail(void)
 {
 MigrateCommon args = {
@@ -1515,7 +1512,6 @@ static void test_postcopy_recovery_double_fail(void)
 
 test_postcopy_recovery_common();
 }
-#endif /* _WIN32 */
 
 #ifdef CONFIG_GNUTLS
 static void test_postcopy_recovery_tls_psk(void)
@@ -3693,10 +3689,8 @@ int main(int argc, char **argv)
test_postcopy_preempt);
 migration_test_add("/migration/postcopy/preempt/recovery/plain",
test_postcopy_preempt_recovery);
-#ifndef _WIN32
 migration_test_add("/migration/postcopy/recovery/double-failures",
test_postcopy_recovery_double_fail);
-#endif /* _WIN32 */
 if (is_x86) {
 migration_test_add("/migration/postcopy/suspend",
test_postcopy_suspend);
-- 
2.45.0

[PATCH v3 10/11] tests/migration-tests: Verify postcopy-recover-setup status

2024-06-19 Thread Peter Xu

Making sure the postcopy-recover-setup status is present in the postcopy
failure unit test.  Note that it only applies to src QEMU not dest.

This also introduces the tiny but helpful migration_event_wait() helper.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index c015e801ac..de81e28088 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1412,6 +1412,12 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
 migrate_recover(to, "fd:fd-mig");
 migrate_qmp(from, to, "fd:fd-mig", NULL, "{'resume': true}");
 
+/*
+ * Source QEMU has an extra RECOVER_SETUP phase, dest doesn't have it.
+ * Make sure it appears along the way.
+ */
+migration_event_wait(from, "postcopy-recover-setup");
+
 /*
  * Make sure both QEMU instances will go into RECOVER stage, then test
  * kicking them out using migrate-pause.
-- 
2.45.0

[PATCH v3 04/11] migration: Cleanup incoming migration setup state change

2024-06-19 Thread Peter Xu

Destination QEMU can setup incoming ports for two purposes: either a fresh
new incoming migration, in which QEMU will switch to SETUP for channel
establishment, or a paused postcopy migration, in which QEMU will stay in
POSTCOPY_PAUSED until kicking off the RECOVER phase.

Now the state machine worked on dest node for the latter, only because
migrate_set_state() implicitly will become a noop if the current state
check failed.  It wasn't clear at all.

Clean it up by providing a helper migration_incoming_state_setup() doing
proper checks over current status.  Postcopy-paused will be explicitly
checked now, and then we can bail out for unknown states.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 migration/migration.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 795b30f0d0..41a88fc50a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -618,6 +618,29 @@ bool migrate_uri_parse(const char *uri, MigrationChannel 
**channel,
 return true;
 }
 
+static bool
+migration_incoming_state_setup(MigrationIncomingState *mis, Error **errp)
+{
+MigrationStatus current = mis->state;
+
+if (current == MIGRATION_STATUS_POSTCOPY_PAUSED) {
+/*
+ * Incoming postcopy migration will stay in PAUSED state even if
+ * reconnection happened.
+ */
+return true;
+}
+
+if (current != MIGRATION_STATUS_NONE) {
+error_setg(errp, "Illegal migration incoming state: %s",
+   MigrationStatus_str(current));
+return false;
+}
+
+migrate_set_state(>state, current, MIGRATION_STATUS_SETUP);
+return true;
+}
+
 static void qemu_start_incoming_migration(const char *uri, bool has_channels,
   MigrationChannelList *channels,
   Error **errp)
@@ -656,8 +679,9 @@ static void qemu_start_incoming_migration(const char *uri, 
bool has_channels,
 return;
 }
 
-migrate_set_state(>state, MIGRATION_STATUS_NONE,
-  MIGRATION_STATUS_SETUP);
+if (!migration_incoming_state_setup(mis, errp)) {
+return;
+}
 
 if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
 SocketAddress *saddr = >u.socket;
-- 
2.45.0

[PATCH v3 03/11] migration: Use MigrationStatus instead of int

2024-06-19 Thread Peter Xu

QEMU uses "int" in most cases even if it stores MigrationStatus.  I don't
know why, so let's try to do that right and see what blows up..

Reviewed-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 migration/migration.h |  9 +
 migration/migration.c | 24 +++-
 2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 6af01362d4..38aa1402d5 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -160,7 +160,7 @@ struct MigrationIncomingState {
 /* PostCopyFD's for external userfaultfds & handlers of shared memory */
 GArray   *postcopy_remote_fds;
 
-int state;
+MigrationStatus state;
 
 /*
  * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
@@ -301,7 +301,7 @@ struct MigrationState {
 /* params from 'migrate-set-parameters' */
 MigrationParameters parameters;
 
-int state;
+MigrationStatus state;
 
 /* State related to return path */
 struct {
@@ -459,7 +459,8 @@ struct MigrationState {
 bool rdma_migration;
 };
 
-void migrate_set_state(int *state, int old_state, int new_state);
+void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
+   MigrationStatus new_state);
 
 void migration_fd_process_incoming(QEMUFile *f);
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
@@ -479,7 +480,7 @@ int migrate_init(MigrationState *s, Error **errp);
 bool migration_is_blocked(Error **errp);
 /* True if outgoing migration has entered postcopy phase */
 bool migration_in_postcopy(void);
-bool migration_postcopy_is_alive(int state);
+bool migration_postcopy_is_alive(MigrationStatus state);
 MigrationState *migrate_get_current(void);
 bool migration_has_failed(MigrationState *);
 bool migrate_mode_is_cpr(MigrationState *);
diff --git a/migration/migration.c b/migration/migration.c
index f9b69af62f..795b30f0d0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -413,7 +413,7 @@ void migration_incoming_state_destroy(void)
 yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
-static void migrate_generate_event(int new_state)
+static void migrate_generate_event(MigrationStatus new_state)
 {
 if (migrate_events()) {
 qapi_event_send_migration(new_state);
@@ -1296,8 +1296,6 @@ static void fill_destination_migration_info(MigrationInfo 
*info)
 }
 
 switch (mis->state) {
-case MIGRATION_STATUS_NONE:
-return;
 case MIGRATION_STATUS_SETUP:
 case MIGRATION_STATUS_CANCELLING:
 case MIGRATION_STATUS_CANCELLED:
@@ -1313,6 +1311,8 @@ static void fill_destination_migration_info(MigrationInfo 
*info)
 info->has_status = true;
 fill_destination_postcopy_migration_info(info);
 break;
+default:
+return;
 }
 info->status = mis->state;
 
@@ -1360,7 +1360,8 @@ void qmp_migrate_start_postcopy(Error **errp)
 
 /* shared migration helpers */
 
-void migrate_set_state(int *state, int old_state, int new_state)
+void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
+   MigrationStatus new_state)
 {
 assert(new_state < MIGRATION_STATUS__MAX);
 if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
@@ -1567,7 +1568,7 @@ bool migration_in_postcopy(void)
 }
 }
 
-bool migration_postcopy_is_alive(int state)
+bool migration_postcopy_is_alive(MigrationStatus state)
 {
 switch (state) {
 case MIGRATION_STATUS_POSTCOPY_ACTIVE:
@@ -1612,20 +1613,9 @@ bool migration_is_idle(void)
 case MIGRATION_STATUS_COMPLETED:
 case MIGRATION_STATUS_FAILED:
 return true;
-case MIGRATION_STATUS_SETUP:
-case MIGRATION_STATUS_CANCELLING:
-case MIGRATION_STATUS_ACTIVE:
-case MIGRATION_STATUS_POSTCOPY_ACTIVE:
-case MIGRATION_STATUS_COLO:
-case MIGRATION_STATUS_PRE_SWITCHOVER:
-case MIGRATION_STATUS_DEVICE:
-case MIGRATION_STATUS_WAIT_UNPLUG:
+default:
 return false;
-case MIGRATION_STATUS__MAX:
-g_assert_not_reached();
 }
-
-return false;
 }
 
 bool migration_is_active(void)
-- 
2.45.0

[PATCH v3 06/11] migration/docs: Update postcopy recover session for SETUP phase

2024-06-19 Thread Peter Xu

Firstly, the "Paused" state was added in the wrong place before. The state
machine section was describing PostcopyState, rather than MigrationStatus.
Drop the Paused state descriptions.

Then in the postcopy recover session, add more information on the state
machine for MigrationStatus in the lines.  Add the new RECOVER_SETUP phase.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 docs/devel/migration/postcopy.rst | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/docs/devel/migration/postcopy.rst 
b/docs/devel/migration/postcopy.rst
index 6c51e96d79..a15594e11f 100644
--- a/docs/devel/migration/postcopy.rst
+++ b/docs/devel/migration/postcopy.rst
@@ -99,17 +99,6 @@ ADVISE->DISCARD->LISTEN->RUNNING->END
 (although it can't do the cleanup it would do as it
 finishes a normal migration).
 
- - Paused
-
-Postcopy can run into a paused state (normally on both sides when
-happens), where all threads will be temporarily halted mostly due to
-network errors.  When reaching paused state, migration will make sure
-the qemu binary on both sides maintain the data without corrupting
-the VM.  To continue the migration, the admin needs to fix the
-migration channel using the QMP command 'migrate-recover' on the
-destination node, then resume the migration using QMP command 'migrate'
-again on source node, with resume=true flag set.
-
  - End
 
 The listen thread can now quit, and perform the cleanup of migration
@@ -221,7 +210,8 @@ paused postcopy migration.
 
 The recovery phase normally contains a few steps:
 
-  - When network issue occurs, both QEMU will go into PAUSED state
+  - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
+migration state.
 
   - When the network is recovered (or a new network is provided), the admin
 can setup the new channel for migration using QMP command
@@ -229,9 +219,20 @@ The recovery phase normally contains a few steps:
 
   - On source host, the admin can continue the interrupted postcopy
 migration using QMP command 'migrate' with resume=true flag set.
-
-  - After the connection is re-established, QEMU will continue the postcopy
-migration on both sides.
+Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
+re-establish the channels.
+
+  - When both sides of QEMU successfully reconnects using a new or fixed up
+channel, they will go into **POSTCOPY_RECOVER** state, some handshake
+procedure will be needed to properly synchronize the VM states between
+the two QEMUs to continue the postcopy migration.  For example, there
+can be pages sent right during the window when the network is
+interrupted, then the handshake will guarantee pages lost in-flight
+will be resent again.
+
+  - After a proper handshake synchronization, QEMU will continue the
+postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
+state.  Postcopy migration will continue.
 
 During a paused postcopy migration, the VM can logically still continue
 running, and it will not be impacted from any page access to pages that
-- 
2.45.0

[PATCH v3 01/11] migration/multifd: Avoid the final FLUSH in complete()

2024-06-19 Thread Peter Xu

We always do the flush when finishing one round of scan, and during
complete() phase we should scan one more round making sure no dirty page
existed.  In that case we shouldn't need one explicit FLUSH at the end of
complete(), as when reaching there all pages should have been flushed.

Reviewed-by: Fabiano Rosas 
Tested-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 migration/ram.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06..edec1a2d07 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3300,10 +3300,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 }
 }
 
-if (migrate_multifd() && !migrate_multifd_flush_after_each_section() &&
-!migrate_mapped_ram()) {
-qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-}
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 return qemu_fflush(f);
 }
-- 
2.45.0

[PATCH v3 11/11] tests/migration-tests: Cover postcopy failure on reconnect

2024-06-19 Thread Peter Xu

Make sure there will be an event for postcopy recovery, irrelevant of
whether the reconnect will success, or when the failure happens.

The added new case is to fail early in postcopy recovery, in which case it
didn't even reach RECOVER stage on src (and in real life it'll be the same
to dest, but the test case is just slightly more involved due to the dual
socketpair setup).

To do that, rename the postcopy_recovery_test_fail to reflect either stage
to fail, instead of a boolean.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 95 +---
 1 file changed, 77 insertions(+), 18 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index de81e28088..fe33b86783 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -72,6 +72,17 @@ static QTestMigrationState dst_state;
 #define QEMU_ENV_SRC "QTEST_QEMU_BINARY_SRC"
 #define QEMU_ENV_DST "QTEST_QEMU_BINARY_DST"
 
+typedef enum PostcopyRecoveryFailStage {
+/*
+ * "no failure" must be 0 as it's the default.  OTOH, real failure
+ * cases must be >0 to make sure they trigger by a "if" test.
+ */
+POSTCOPY_FAIL_NONE = 0,
+POSTCOPY_FAIL_CHANNEL_ESTABLISH,
+POSTCOPY_FAIL_RECOVERY,
+POSTCOPY_FAIL_MAX
+} PostcopyRecoveryFailStage;
+
 #if defined(__linux__)
 #include 
 #include 
@@ -692,7 +703,7 @@ typedef struct {
 /* Postcopy specific fields */
 void *postcopy_data;
 bool postcopy_preempt;
-bool postcopy_recovery_test_fail;
+PostcopyRecoveryFailStage postcopy_recovery_fail_stage;
 } MigrateCommon;
 
 static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1370,12 +1381,16 @@ static void wait_for_postcopy_status(QTestState *one, 
const char *status)
   "completed", NULL });
 }
 
-static void postcopy_recover_fail(QTestState *from, QTestState *to)
+static void postcopy_recover_fail(QTestState *from, QTestState *to,
+  PostcopyRecoveryFailStage stage)
 {
 #ifndef _WIN32
+bool fail_early = (stage == POSTCOPY_FAIL_CHANNEL_ESTABLISH);
 int ret, pair1[2], pair2[2];
 char c;
 
+g_assert(stage > POSTCOPY_FAIL_NONE && stage < POSTCOPY_FAIL_MAX);
+
 /* Create two unrelated socketpairs */
 ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
 g_assert_cmpint(ret, ==, 0);
@@ -1409,6 +1424,14 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
 ret = send(pair2[1], , 1, 0);
 g_assert_cmpint(ret, ==, 1);
 
+if (stage == POSTCOPY_FAIL_CHANNEL_ESTABLISH) {
+/*
+ * This will make src QEMU to fail at an early stage when trying to
+ * resume later, where it shouldn't reach RECOVER stage at all.
+ */
+close(pair1[1]);
+}
+
 migrate_recover(to, "fd:fd-mig");
 migrate_qmp(from, to, "fd:fd-mig", NULL, "{'resume': true}");
 
@@ -1418,28 +1441,53 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
  */
 migration_event_wait(from, "postcopy-recover-setup");
 
+if (fail_early) {
+/*
+ * When fails at reconnection, src QEMU will automatically goes
+ * back to PAUSED state.  Making sure there is an event in this
+ * case: Libvirt relies on this to detect early reconnection
+ * errors.
+ */
+migration_event_wait(from, "postcopy-paused");
+} else {
+/*
+ * We want to test "fail later" at RECOVER stage here.  Make sure
+ * both QEMU instances will go into RECOVER stage first, then test
+ * kicking them out using migrate-pause.
+ *
+ * Explicitly check the RECOVER event on src, that's what Libvirt
+ * relies on, rather than polling.
+ */
+migration_event_wait(from, "postcopy-recover");
+wait_for_postcopy_status(from, "postcopy-recover");
+
+/* Need an explicit kick on src QEMU in this case */
+migrate_pause(from);
+}
+
 /*
- * Make sure both QEMU instances will go into RECOVER stage, then test
- * kicking them out using migrate-pause.
+ * For all failure cases, we'll reach such states on both sides now.
+ * Check them.
  */
-wait_for_postcopy_status(from, "postcopy-recover");
+wait_for_postcopy_status(from, "postcopy-paused");
 wait_for_postcopy_status(to, "postcopy-recover");
 
 /*
- * This would be issued by the admin upon noticing the hang, we should
- * make sure we're able to kick this out.
+ * Kick dest QEMU out too. This is normally not needed in reality
+ * because when the channel is shutdown it should also happen on src.
+ * However here we used separate socket pairs so we need t

[PATCH v3 00/11] migration: New postcopy state, and some cleanups

2024-06-19 Thread Peter Xu

Based-on: <20240617185731.9725-1-faro...@suse.de>

v3:
- Added one comment in patch 8 explaining why migrate_incoming_qmp() needs
  to keep enabling "events" capability.
- Split patch 9 into two patches, which makes migration_event_wait() to be
  used also in migrate_incoming_qmp()
- Rename the tests in last patch, and a spell fix
- Rebased to "[PATCH v3 00/16] migration/mapped-ram: Add direct-io support"

v1: https://lore.kernel.org/r/20240612144228.1179240-1-pet...@redhat.com
v2: https://lore.kernel.org/r/20240617181534.1425179-1-pet...@redhat.com

The major goal of this patchset is patch 5, which introduced a new postcopy
state so that we will send an event in postcopy reconnect failures that
Libvirt would prefer to have.  There's more information for that issue in
the commit message alone.

Patch 1-2 are cleanups that are not directly relevant but I found/stored
that could be good to have.  I made it simple by putting them together in
one thread to make patch management easier, but I can send them separately
when necessary.

Patch 3 is also a cleanup, but will be needed for patch 4 as dependency.

Patch 4-5 is the core patches.

Patch 6 updates doc for the new state.

Patch 7-11 adds a new test for the new state.

CI: https://gitlab.com/peterx/qemu/-/pipelines/1339544694

Comments welcomed, thanks.

Peter Xu (11):
  migration/multifd: Avoid the final FLUSH in complete()
  migration: Rename thread debug names
  migration: Use MigrationStatus instead of int
  migration: Cleanup incoming migration setup state change
  migration/postcopy: Add postcopy-recover-setup phase
  migration/docs: Update postcopy recover session for SETUP phase
  tests/migration-tests: Drop most WIN32 ifdefs for postcopy failure
tests
  tests/migration-tests: Always enable migration events
  tests/migration-tests: migration_event_wait()
  tests/migration-tests: Verify postcopy-recover-setup status
  tests/migration-tests: Cover postcopy failure on reconnect

 docs/devel/migration/postcopy.rst |  31 
 qapi/migration.json   |   4 ++
 migration/migration.h |   9 +--
 migration/postcopy-ram.h  |   3 +
 tests/qtest/migration-helpers.h   |   2 +
 migration/colo.c  |   2 +-
 migration/migration.c |  98 +
 migration/multifd.c   |   6 +-
 migration/postcopy-ram.c  |  10 ++-
 migration/ram.c   |   4 --
 migration/savevm.c|   6 +-
 tests/qtest/migration-helpers.c   |  32 ++---
 tests/qtest/migration-test.c  | 116 +++---
 13 files changed, 229 insertions(+), 94 deletions(-)

-- 
2.45.0

[PATCH v3 09/11] tests/migration-tests: migration_event_wait()

2024-06-19 Thread Peter Xu

Introduce a small helper to wait for a migration event, generalized from
the incoming migration path.  Make the helper easier to use by allowing it
to keep waiting until the expected event is received.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-helpers.h |  2 ++
 tests/qtest/migration-helpers.c | 31 ++-
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index 50095fca4a..72dba369fb 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -63,4 +63,6 @@ static inline bool probe_o_direct_support(const char *tmpfs)
 }
 #endif
 void migration_test_add(const char *path, void (*fn)(void));
+void migration_event_wait(QTestState *s, const char *target);
+
 #endif /* MIGRATION_HELPERS_H */
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 2ca4425d71..84f49db85e 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -249,7 +249,7 @@ void migrate_set_capability(QTestState *who, const char 
*capability,
 void migrate_incoming_qmp(QTestState *to, const char *uri, const char *fmt, 
...)
 {
 va_list ap;
-QDict *args, *rsp, *data;
+QDict *args, *rsp;
 
 va_start(ap, fmt);
 args = qdict_from_vjsonf_nofail(fmt, ap);
@@ -272,14 +272,7 @@ void migrate_incoming_qmp(QTestState *to, const char *uri, 
const char *fmt, ...)
 g_assert(qdict_haskey(rsp, "return"));
 qobject_unref(rsp);
 
-rsp = qtest_qmp_eventwait_ref(to, "MIGRATION");
-g_assert(qdict_haskey(rsp, "data"));
-
-data = qdict_get_qdict(rsp, "data");
-g_assert(qdict_haskey(data, "status"));
-g_assert_cmpstr(qdict_get_str(data, "status"), ==, "setup");
-
-qobject_unref(rsp);
+migration_event_wait(to, "setup");
 }
 
 /*
@@ -518,3 +511,23 @@ bool probe_o_direct_support(const char *tmpfs)
 return true;
 }
 #endif
+
+/*
+ * Wait for a "MIGRATION" event.  This is what Libvirt uses to track
+ * migration status changes.
+ */
+void migration_event_wait(QTestState *s, const char *target)
+{
+QDict *response, *data;
+const char *status;
+bool found;
+
+do {
+response = qtest_qmp_eventwait_ref(s, "MIGRATION");
+data = qdict_get_qdict(response, "data");
+g_assert(data);
+status = qdict_get_str(data, "status");
+found = (strcmp(status, target) == 0);
+qobject_unref(response);
+} while (!found);
+}
-- 
2.45.0

[PATCH v3 02/11] migration: Rename thread debug names

2024-06-19 Thread Peter Xu

The postcopy thread names on dest QEMU are slightly confusing, partly I'll
need to blame myself on 36f62f11e4 ("migration: Postcopy preemption
preparation on channel creation").  E.g., "fault-fast" reads like a fast
version of "fault-default", but it's actually the fast version of
"postcopy/listen".

Taking this chance, rename all the migration threads with proper rules.
Considering we only have 15 chars usable, prefix all threads with "mig/",
meanwhile identify src/dst threads properly this time.  So now most thread
names will look like "mig/DIR/xxx", where DIR will be "src"/"dst", except
the bg-snapshot thread which doesn't have a direction.

For multifd threads, making them "mig/{src|dst}/{send|recv}_%d".

We used to have "live_migration" thread for a very long time, now it's
called "mig/src/main".  We may hope to have "mig/dst/main" soon but not
yet.

Reviewed-by: Fabiano Rosas 
Reviewed-by: Zhijian Li (Fujitsu) 
Signed-off-by: Peter Xu 
---
 migration/colo.c | 2 +-
 migration/migration.c| 6 +++---
 migration/multifd.c  | 6 +++---
 migration/postcopy-ram.c | 4 ++--
 migration/savevm.c   | 2 +-
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index f96c2ee069..6449490221 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -935,7 +935,7 @@ void coroutine_fn colo_incoming_co(void)
 assert(bql_locked());
 assert(migration_incoming_colo_enabled());
 
-qemu_thread_create(, "COLO incoming", colo_process_incoming_thread,
+qemu_thread_create(, "mig/dst/colo", colo_process_incoming_thread,
mis, QEMU_THREAD_JOINABLE);
 
 mis->colo_incoming_co = qemu_coroutine_self();
diff --git a/migration/migration.c b/migration/migration.c
index e03c80b3aa..f9b69af62f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2431,7 +2431,7 @@ static int open_return_path_on_source(MigrationState *ms)
 
 trace_open_return_path_on_source();
 
-qemu_thread_create(>rp_state.rp_thread, "return path",
+qemu_thread_create(>rp_state.rp_thread, "mig/src/rp-thr",
source_return_path_thread, ms, QEMU_THREAD_JOINABLE);
 ms->rp_state.rp_thread_created = true;
 
@@ -3770,10 +3770,10 @@ void migrate_fd_connect(MigrationState *s, Error 
*error_in)
 }
 
 if (migrate_background_snapshot()) {
-qemu_thread_create(>thread, "bg_snapshot",
+qemu_thread_create(>thread, "mig/snapshot",
 bg_migration_thread, s, QEMU_THREAD_JOINABLE);
 } else {
-qemu_thread_create(>thread, "live_migration",
+qemu_thread_create(>thread, "mig/src/main",
 migration_thread, s, QEMU_THREAD_JOINABLE);
 }
 s->migration_thread_running = true;
diff --git a/migration/multifd.c b/migration/multifd.c
index d82885fdbb..0b4cbaddfe 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1069,7 +1069,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
*p,
 args->p = p;
 
 p->tls_thread_created = true;
-qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker",
+qemu_thread_create(>tls_thread, "mig/src/tls",
multifd_tls_handshake_thread, args,
QEMU_THREAD_JOINABLE);
 return true;
@@ -1190,7 +1190,7 @@ bool multifd_send_setup(void)
 p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
 p->packet->version = cpu_to_be32(MULTIFD_VERSION);
 }
-p->name = g_strdup_printf("multifdsend_%d", i);
+p->name = g_strdup_printf("mig/src/send_%d", i);
 p->page_size = qemu_target_page_size();
 p->page_count = page_count;
 p->write_flags = 0;
@@ -1604,7 +1604,7 @@ int multifd_recv_setup(Error **errp)
 + sizeof(uint64_t) * page_count;
 p->packet = g_malloc0(p->packet_len);
 }
-p->name = g_strdup_printf("multifdrecv_%d", i);
+p->name = g_strdup_printf("mig/dst/recv_%d", i);
 p->normal = g_new0(ram_addr_t, page_count);
 p->zero = g_new0(ram_addr_t, page_count);
 p->page_count = page_count;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 3419779548..97701e6bb2 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1238,7 +1238,7 @@ int postcopy_ram_incoming_setup(MigrationIncomingState 
*mis)
 return -1;
 }
 
-postcopy_thread_create(mis, >fault_thread, "fault-default",
+postcopy_thread_create(mis, >fault_thread, "mig/dst/fault",
postcopy_ram_fault_thread, QEM

Re: [PATCH v2 08/10] tests/migration-tests: Always enable migration events

2024-06-19 Thread Peter Xu

On Mon, Jun 17, 2024 at 05:23:24PM -0400, Peter Xu wrote:
> On Mon, Jun 17, 2024 at 04:51:32PM -0300, Fabiano Rosas wrote:
> > Peter Xu  writes:
> > 
> > > Libvirt should always enable it, so it'll be nice qtest also cover that 
> > > for
> > > all tests.  Though this patch only enables it, no extra tests are done on
> > > these events yet.
> > >
> > > Signed-off-by: Peter Xu 
> > > ---
> > >  tests/qtest/migration-test.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 13b59d4c10..9ae8892e26 100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -841,6 +841,13 @@ static int test_migrate_start(QTestState **from, 
> > > QTestState **to,
> > >  unlink(shmem_path);
> > >  }
> > >  
> > > +/*
> > > + * Always enable migration events.  Libvirt always uses it, let's try
> > > + * to mimic as closer as that.
> > > + */
> > > +migrate_set_capability(*from, "events", true);
> > > +migrate_set_capability(*to, "events", true);
> > > +
> > 
> > What do we do with the one at migrate_incoming_qmp()?
> 
> Hmm missed that..  I'll drop that one in this same patch and rewrite the
> commit message.  New version attached:
> 
> ===8<===
> From 443fef4188d544362fc026b46784c15b82624642 Mon Sep 17 00:00:00 2001
> From: Peter Xu 
> Date: Mon, 17 Jun 2024 10:49:52 -0400
> Subject: [PATCH] tests/migration-tests: Always enable migration events
> 
> Libvirt should always enable it, so it'll be nice qtest also cover that for
> all tests on both sides.  migrate_incoming_qmp() used to enable it only on
> dst, now we enable them on both, as we'll start to sanity check events even
> on the src QEMU.
> 
> Signed-off-by: Peter Xu 
> ---
>  tests/qtest/migration-helpers.c | 2 --
>  tests/qtest/migration-test.c| 7 +++
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
> index 0ac49ceb54..797b1e8c1c 100644
> --- a/tests/qtest/migration-helpers.c
> +++ b/tests/qtest/migration-helpers.c
> @@ -258,8 +258,6 @@ void migrate_incoming_qmp(QTestState *to, const char 
> *uri, const char *fmt, ...)
>  g_assert(!qdict_haskey(args, "uri"));
>  qdict_put_str(args, "uri", uri);
>  
> -migrate_set_capability(to, "events", true);
> -

Unfortunately this will break virtio-net-failover test... as it uses
migrate_incoming_qmp() without using test_migrate_start().

I'll leave it there for now, perhaps adding a comment.

>  rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': %p}",
>  args);
>  
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 640713bfd5..c015e801ac 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -851,6 +851,13 @@ static int test_migrate_start(QTestState **from, 
> QTestState **to,
>  unlink(shmem_path);
>  }
>  
> +/*
> + * Always enable migration events.  Libvirt always uses it, let's try
> + * to mimic as closer as that.
> + */
> +migrate_set_capability(*from, "events", true);
> +migrate_set_capability(*to, "events", true);
> +
>  return 0;
>  }
>  
> -- 
> 2.45.0
> 
> 
> -- 
> Peter Xu

-- 
Peter Xu

Re: [PATCH] MAINTAINERS: Cover all tests/qtest/migration-* files

2024-06-19 Thread Peter Xu

On Wed, Jun 19, 2024 at 07:54:47AM +0200, Thomas Huth wrote:
> Beside migration-test.c, there is nowadays migration-helpers.[ch],
> too, so update the entry in the migration section to also cover these
> files now.
> While we're at it, exclude these files in the common qtest section,
> since the migration test is well covered by the migration maintainers
> already. Since the test is under very active development, it was causing
> a lot of distraction to the generic qtest maintainers with regards to
> the patches that need to be reviewed by the migration maintainers anyway.
> 
> Signed-off-by: Thomas Huth 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 08/10] tests/migration-tests: Always enable migration events

2024-06-17 Thread Peter Xu

On Mon, Jun 17, 2024 at 04:51:32PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > Libvirt should always enable it, so it'll be nice qtest also cover that for
> > all tests.  Though this patch only enables it, no extra tests are done on
> > these events yet.
> >
> > Signed-off-by: Peter Xu 
> > ---
> >  tests/qtest/migration-test.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > index 13b59d4c10..9ae8892e26 100644
> > --- a/tests/qtest/migration-test.c
> > +++ b/tests/qtest/migration-test.c
> > @@ -841,6 +841,13 @@ static int test_migrate_start(QTestState **from, 
> > QTestState **to,
> >  unlink(shmem_path);
> >  }
> >  
> > +/*
> > + * Always enable migration events.  Libvirt always uses it, let's try
> > + * to mimic as closer as that.
> > + */
> > +migrate_set_capability(*from, "events", true);
> > +migrate_set_capability(*to, "events", true);
> > +
> 
> What do we do with the one at migrate_incoming_qmp()?

Hmm missed that..  I'll drop that one in this same patch and rewrite the
commit message.  New version attached:

===8<===
>From 443fef4188d544362fc026b46784c15b82624642 Mon Sep 17 00:00:00 2001
From: Peter Xu 
Date: Mon, 17 Jun 2024 10:49:52 -0400
Subject: [PATCH] tests/migration-tests: Always enable migration events

Libvirt should always enable it, so it'll be nice qtest also cover that for
all tests on both sides.  migrate_incoming_qmp() used to enable it only on
dst, now we enable them on both, as we'll start to sanity check events even
on the src QEMU.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-helpers.c | 2 --
 tests/qtest/migration-test.c| 7 +++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 0ac49ceb54..797b1e8c1c 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -258,8 +258,6 @@ void migrate_incoming_qmp(QTestState *to, const char *uri, 
const char *fmt, ...)
 g_assert(!qdict_haskey(args, "uri"));
 qdict_put_str(args, "uri", uri);
 
-migrate_set_capability(to, "events", true);
-
 rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': %p}",
 args);
 
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 640713bfd5..c015e801ac 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -851,6 +851,13 @@ static int test_migrate_start(QTestState **from, 
QTestState **to,
 unlink(shmem_path);
 }
 
+/*
+ * Always enable migration events.  Libvirt always uses it, let's try
+ * to mimic as closer as that.
+ */
+migrate_set_capability(*from, "events", true);
+migrate_set_capability(*to, "events", true);
+
 return 0;
 }
 
-- 
2.45.0


-- 
Peter Xu

Re: [PATCH v2 00/10] migration: New postcopy state, and some cleanups

2024-06-17 Thread Peter Xu

Hello,

On Mon, Jun 17, 2024 at 02:15:24PM -0400, Peter Xu wrote:
> v2:
> - Collect tags
> - Patch 3
>   - cover all states in migration_postcopy_is_alive()
> - Patch 4 (old)
>   - English changes [Fabiano]
>   - Split the migration_incoming_state_setup() cleanup into a new patch
> [Fabiano]
>   - Drop RECOVER_SETUP in fill_destination_migration_info() [Fabiano]
>   - Keep using explicit state check in migrate_fd_connect() for resume
> [Fabiano]
> - New patches
>   - New doc update: "migration/docs: Update postcopy recover session for
> SETUP phase"
>   - New test case: last four patches

I just found that this won't apply on top of latest master, and also has a
trivial conflict against the direct-io stuffs.  Fabiano, I'll wait for a
few days on comments, and resend v3 on top of your direct-io stuff.

Meanwhile I also plan to squash below fixup to the last test patch, just to
fix up a spelling error I just found, and also renamed the test cases (as
the new test is actually also a "double failure" test, just at different
phase).  Comments welcomed for that fixup even before a repost.

===8<===
>From 5b8fbc3a9d9e87ebfef1a3e5592fd196eecd5923 Mon Sep 17 00:00:00 2001
From: Peter Xu 
Date: Mon, 17 Jun 2024 14:40:15 -0400
Subject: [PATCH] fixup! tests/migration-tests: Cover postcopy failure on
 reconnect

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index a4fed4cc6b..fe33b86783 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1474,7 +1474,7 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to,
 
 /*
  * Kick dest QEMU out too. This is normally not needed in reality
- * because when the channel is shutdown it should also happens on src.
+ * because when the channel is shutdown it should also happen on src.
  * However here we used separate socket pairs so we need to do that
  * explicitly.
  */
@@ -1565,7 +1565,7 @@ static void test_postcopy_recovery(void)
 test_postcopy_recovery_common();
 }
 
-static void test_postcopy_recovery_double_fail(void)
+static void test_postcopy_recovery_fail_handshake(void)
 {
 MigrateCommon args = {
 .postcopy_recovery_fail_stage = POSTCOPY_FAIL_RECOVERY,
@@ -1574,7 +1574,7 @@ static void test_postcopy_recovery_double_fail(void)
 test_postcopy_recovery_common();
 }
 
-static void test_postcopy_recovery_channel_reconnect(void)
+static void test_postcopy_recovery_fail_reconnect(void)
 {
 MigrateCommon args = {
 .postcopy_recovery_fail_stage = POSTCOPY_FAIL_CHANNEL_ESTABLISH,
@@ -3759,10 +3759,10 @@ int main(int argc, char **argv)
test_postcopy_preempt);
 migration_test_add("/migration/postcopy/preempt/recovery/plain",
test_postcopy_preempt_recovery);
-migration_test_add("/migration/postcopy/recovery/double-failures",
-   test_postcopy_recovery_double_fail);
-migration_test_add("/migration/postcopy/recovery/channel-reconnect",
-   test_postcopy_recovery_channel_reconnect);
+
migration_test_add("/migration/postcopy/recovery/double-failures/handshake",
+   test_postcopy_recovery_fail_handshake);
+
migration_test_add("/migration/postcopy/recovery/double-failures/reconnect",
+   test_postcopy_recovery_fail_reconnect);
 if (is_x86) {
 migration_test_add("/migration/postcopy/suspend",
test_postcopy_suspend);
-- 
2.45.0


-- 
Peter Xu

Re: [PATCH v3 16/16] tests/qtest/migration: Add a test for mapped-ram with passing of fds

2024-06-17 Thread Peter Xu

On Mon, Jun 17, 2024 at 03:57:31PM -0300, Fabiano Rosas wrote:
> Add a multifd test for mapped-ram with passing of fds into QEMU. This
> is how libvirt will consume the feature.
> 
> There are a couple of details to the fdset mechanism:
> 
> - multifd needs two distinct file descriptors (not duplicated with
>   dup()) so it can enable O_DIRECT only on the channels that do
>   aligned IO. The dup() system call creates file descriptors that
>   share status flags, of which O_DIRECT is one.
> 
> - the open() access mode flags used for the fds passed into QEMU need
>   to match the flags QEMU uses to open the file. Currently O_WRONLY
>   for src and O_RDONLY for dst.
> 
> Note that fdset code goes under _WIN32 because fd passing is not
> supported on Windows.
> 
> Signed-off-by: Fabiano Rosas 
> ---
> - dropped Peter's r-b
> 
> - stopped removing the fdset at the end of the tests. The migration
> code should always cleanup after itself.

Ah, that looks also ok.

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v3 12/16] migration/multifd: Add direct-io support

2024-06-17 Thread Peter Xu

On Mon, Jun 17, 2024 at 03:57:27PM -0300, Fabiano Rosas wrote:
> When multifd is used along with mapped-ram, we can take benefit of a
> filesystem that supports the O_DIRECT flag and perform direct I/O in
> the multifd threads. This brings a significant performance improvement
> because direct-io writes bypass the page cache which would otherwise
> be thrashed by the multifd data which is unlikely to be needed again
> in a short period of time.
> 
> To be able to use a multifd channel opened with O_DIRECT, we must
> ensure that a certain aligment is used. Filesystems usually require a
> block-size alignment for direct I/O. The way to achieve this is by
> enabling the mapped-ram feature, which already aligns its I/O properly
> (see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c).
> 
> By setting O_DIRECT on the multifd channels, all writes to the same
> file descriptor need to be aligned as well, even the ones that come
> from outside multifd, such as the QEMUFile I/O from the main migration
> code. This makes it impossible to use the same file descriptor for the
> QEMUFile and for the multifd channels. The various flags and metadata
> written by the main migration code will always be unaligned by virtue
> of their small size. To workaround this issue, we'll require a second
> file descriptor to be used exclusively for direct I/O.
> 
> The second file descriptor can be obtained by QEMU by re-opening the
> migration file (already possible), or by being provided by the user or
> management application (support to be added in future patches).
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v3 01/16] migration: Drop reference to QIOChannel if file seeking fails

2024-06-17 Thread Peter Xu

On Mon, Jun 17, 2024 at 03:57:16PM -0300, Fabiano Rosas wrote:
> We forgot to drop the reference to the QIOChannel in the error path of
> the offset adjustment. Do it now.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

[PATCH v2 04/10] migration: Cleanup incoming migration setup state change

2024-06-17 Thread Peter Xu

Destination QEMU can setup incoming ports for two purposes: either a fresh
new incoming migration, in which QEMU will switch to SETUP for channel
establishment, or a paused postcopy migration, in which QEMU will stay in
POSTCOPY_PAUSED until kicking off the RECOVER phase.

Now the state machine worked on dest node for the latter, only because
migrate_set_state() implicitly will become a noop if the current state
check failed.  It wasn't clear at all.

Clean it up by providing a helper migration_incoming_state_setup() doing
proper checks over current status.  Postcopy-paused will be explicitly
checked now, and then we can bail out for unknown states.

Signed-off-by: Peter Xu 
---
 migration/migration.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 75c9d80e8e..59442181a1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -595,6 +595,29 @@ bool migrate_uri_parse(const char *uri, MigrationChannel 
**channel,
 return true;
 }
 
+static bool
+migration_incoming_state_setup(MigrationIncomingState *mis, Error **errp)
+{
+MigrationStatus current = mis->state;
+
+if (current == MIGRATION_STATUS_POSTCOPY_PAUSED) {
+/*
+ * Incoming postcopy migration will stay in PAUSED state even if
+ * reconnection happened.
+ */
+return true;
+}
+
+if (current != MIGRATION_STATUS_NONE) {
+error_setg(errp, "Illegal migration incoming state: %s",
+   MigrationStatus_str(current));
+return false;
+}
+
+migrate_set_state(>state, current, MIGRATION_STATUS_SETUP);
+return true;
+}
+
 static void qemu_start_incoming_migration(const char *uri, bool has_channels,
   MigrationChannelList *channels,
   Error **errp)
@@ -633,8 +656,9 @@ static void qemu_start_incoming_migration(const char *uri, 
bool has_channels,
 return;
 }
 
-migrate_set_state(>state, MIGRATION_STATUS_NONE,
-  MIGRATION_STATUS_SETUP);
+if (!migration_incoming_state_setup(mis, errp)) {
+return;
+}
 
 if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
 SocketAddress *saddr = >u.socket;
-- 
2.45.0

[PATCH v2 07/10] tests/migration-tests: Drop most WIN32 ifdefs for postcopy failure tests

2024-06-17 Thread Peter Xu

Most of them are not needed, we can stick with one ifdef inside
postcopy_recover_fail() so as to cover the scm right tricks only.
The tests won't run on windows anyway due to has_uffd always false.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b7e3406471..13b59d4c10 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1353,9 +1353,9 @@ static void wait_for_postcopy_status(QTestState *one, 
const char *status)
   "completed", NULL });
 }
 
-#ifndef _WIN32
 static void postcopy_recover_fail(QTestState *from, QTestState *to)
 {
+#ifndef _WIN32
 int ret, pair1[2], pair2[2];
 char c;
 
@@ -1417,8 +1417,8 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
 close(pair1[1]);
 close(pair2[0]);
 close(pair2[1]);
+#endif
 }
-#endif /* _WIN32 */
 
 static void test_postcopy_recovery_common(MigrateCommon *args)
 {
@@ -1458,7 +1458,6 @@ static void test_postcopy_recovery_common(MigrateCommon 
*args)
 wait_for_postcopy_status(to, "postcopy-paused");
 wait_for_postcopy_status(from, "postcopy-paused");
 
-#ifndef _WIN32
 if (args->postcopy_recovery_test_fail) {
 /*
  * Test when a wrong socket specified for recover, and then the
@@ -1467,7 +1466,6 @@ static void test_postcopy_recovery_common(MigrateCommon 
*args)
 postcopy_recover_fail(from, to);
 /* continue with a good recovery */
 }
-#endif /* _WIN32 */
 
 /*
  * Create a new socket to emulate a new channel that is different
@@ -1496,7 +1494,6 @@ static void test_postcopy_recovery(void)
 test_postcopy_recovery_common();
 }
 
-#ifndef _WIN32
 static void test_postcopy_recovery_double_fail(void)
 {
 MigrateCommon args = {
@@ -1505,7 +1502,6 @@ static void test_postcopy_recovery_double_fail(void)
 
 test_postcopy_recovery_common();
 }
-#endif /* _WIN32 */
 
 #ifdef CONFIG_GNUTLS
 static void test_postcopy_recovery_tls_psk(void)
@@ -3486,10 +3482,8 @@ int main(int argc, char **argv)
test_postcopy_preempt);
 migration_test_add("/migration/postcopy/preempt/recovery/plain",
test_postcopy_preempt_recovery);
-#ifndef _WIN32
 migration_test_add("/migration/postcopy/recovery/double-failures",
test_postcopy_recovery_double_fail);
-#endif /* _WIN32 */
 if (is_x86) {
 migration_test_add("/migration/postcopy/suspend",
test_postcopy_suspend);
-- 
2.45.0

[PATCH v2 10/10] tests/migration-tests: Cover postcopy failure on reconnect

2024-06-17 Thread Peter Xu

Make sure there will be an event for postcopy recovery, irrelevant of
whether the reconnect will success, or when the failure happens.

The added new case is to fail early in postcopy recovery, in which case it
didn't even reach RECOVER stage on src (and in real life it'll be the same
to dest, but the test case is just slightly more involved due to the dual
socketpair setup).

To do that, rename the postcopy_recovery_test_fail to reflect either stage
to fail, instead of a boolean.

Signed-off-by: Peter Xu 
---
 tests/qtest/migration-test.c | 89 ++--
 1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index a16b1a4824..3e237a1499 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -70,6 +70,17 @@ static QTestMigrationState dst_state;
 #define QEMU_ENV_SRC "QTEST_QEMU_BINARY_SRC"
 #define QEMU_ENV_DST "QTEST_QEMU_BINARY_DST"
 
+typedef enum PostcopyRecoveryFailStage {
+/*
+ * "no failure" must be 0 as it's the default.  OTOH, real failure
+ * cases must be >0 to make sure they trigger by a "if" test.
+ */
+POSTCOPY_FAIL_NONE = 0,
+POSTCOPY_FAIL_CHANNEL_ESTABLISH,
+POSTCOPY_FAIL_RECOVERY,
+POSTCOPY_FAIL_MAX
+} PostcopyRecoveryFailStage;
+
 #if defined(__linux__)
 #include 
 #include 
@@ -680,7 +691,7 @@ typedef struct {
 /* Postcopy specific fields */
 void *postcopy_data;
 bool postcopy_preempt;
-bool postcopy_recovery_test_fail;
+PostcopyRecoveryFailStage postcopy_recovery_fail_stage;
 } MigrateCommon;
 
 static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1360,12 +1371,16 @@ static void wait_for_postcopy_status(QTestState *one, 
const char *status)
   "completed", NULL });
 }
 
-static void postcopy_recover_fail(QTestState *from, QTestState *to)
+static void postcopy_recover_fail(QTestState *from, QTestState *to,
+  PostcopyRecoveryFailStage stage)
 {
 #ifndef _WIN32
+bool fail_early = (stage == POSTCOPY_FAIL_CHANNEL_ESTABLISH);
 int ret, pair1[2], pair2[2];
 char c;
 
+g_assert(stage > POSTCOPY_FAIL_NONE && stage < POSTCOPY_FAIL_MAX);
+
 /* Create two unrelated socketpairs */
 ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
 g_assert_cmpint(ret, ==, 0);
@@ -1399,6 +1414,14 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
 ret = send(pair2[1], , 1, 0);
 g_assert_cmpint(ret, ==, 1);
 
+if (stage == POSTCOPY_FAIL_CHANNEL_ESTABLISH) {
+/*
+ * This will make src QEMU to fail at an early stage when trying to
+ * resume later, where it shouldn't reach RECOVER stage at all.
+ */
+close(pair1[1]);
+}
+
 migrate_recover(to, "fd:fd-mig");
 migrate_qmp(from, to, "fd:fd-mig", NULL, "{'resume': true}");
 
@@ -1408,28 +1431,53 @@ static void postcopy_recover_fail(QTestState *from, 
QTestState *to)
  */
 migration_event_wait(from, "postcopy-recover-setup");
 
+if (fail_early) {
+/*
+ * When fails at reconnection, src QEMU will automatically goes
+ * back to PAUSED state.  Making sure there is an event in this
+ * case: Libvirt relies on this to detect early reconnection
+ * errors.
+ */
+migration_event_wait(from, "postcopy-paused");
+} else {
+/*
+ * We want to test "fail later" at RECOVER stage here.  Make sure
+ * both QEMU instances will go into RECOVER stage first, then test
+ * kicking them out using migrate-pause.
+ *
+ * Explicitly check the RECOVER event on src, that's what Libvirt
+ * relies on, rather than polling.
+ */
+migration_event_wait(from, "postcopy-recover");
+wait_for_postcopy_status(from, "postcopy-recover");
+
+/* Need an explicit kick on src QEMU in this case */
+migrate_pause(from);
+}
+
 /*
- * Make sure both QEMU instances will go into RECOVER stage, then test
- * kicking them out using migrate-pause.
+ * For all failure cases, we'll reach such states on both sides now.
+ * Check them.
  */
-wait_for_postcopy_status(from, "postcopy-recover");
+wait_for_postcopy_status(from, "postcopy-paused");
 wait_for_postcopy_status(to, "postcopy-recover");
 
 /*
- * This would be issued by the admin upon noticing the hang, we should
- * make sure we're able to kick this out.
+ * Kick dest QEMU out too. This is normally not needed in reality
+ * because when the channel is shutdown it should also happens on src.
+ * However here we used separate socket pairs so we need to do that
+ * explicitly.

[PATCH v2 05/10] migration/postcopy: Add postcopy-recover-setup phase

2024-06-17 Thread Peter Xu

This patch adds a migration state on src called "postcopy-recover-setup".
The new state will describe the intermediate step starting from when the
src QEMU received a postcopy recovery request, until the migration channels
are properly established, but before the recovery process take place.

The request came from Libvirt where Libvirt currently rely on the migration
state events to detect migration state changes.  That works for most of the
migration process but except postcopy recovery failures at the beginning.

Currently postcopy recovery only has two major states:

  - postcopy-paused: this is the state that both sides of QEMU will be in
for a long time as long as the migration channel was interrupted.

  - postcopy-recover: this is the state where both sides of QEMU handshake
with each other, preparing for a continuation of postcopy which used to
be interrupted.

The issue here is when the recovery port is invalid, the src QEMU will take
the URI/channels, noticing the ports are not valid, and it'll silently keep
in the postcopy-paused state, with no event sent to Libvirt.  In this case,
the only thing Libvirt can do is to poll the migration status with a proper
interval, however that's less optimal.

Considering that this is the only case where Libvirt won't get a
notification from QEMU on such events, let's add postcopy-recover-setup
state to mimic what we have with the "setup" state of a newly initialized
migration, describing the phase of connection establishment.

With that, postcopy recovery will have two paths to go now, and either path
will guarantee an event generated.  Now the events will look like this
during a recovery process on src QEMU:

  - Initially when the recovery is initiated on src, QEMU will go from
"postcopy-paused" -> "postcopy-recover-setup".  Old QEMUs don't have
this event.

  - Depending on whether the channel re-establishment is succeeded:

- In succeeded case, src QEMU will move from "postcopy-recover-setup"
  to "postcopy-recover".  Old QEMUs also have this event.

- In failure case, src QEMU will move from "postcopy-recover-setup" to
  "postcopy-paused" again.  Old QEMUs don't have this event.

This guarantees that Libvirt will always receive a notification for
recovery process properly.

One thing to mention is, such new status is only needed on src QEMU not
both.  On dest QEMU, the state machine doesn't change.  Hence the events
don't change either.  It's done like so because dest QEMU may not have an
explicit point of setup start.  E.g., it can happen that when dest QEMUs
doesn't use migrate-recover command to use a new URI/channel, but the old
URI/channels can be reused in recovery, in which case the old ports simply
can work again after the network routes are fixed up.

Add a new helper postcopy_is_paused() detecting whether postcopy is still
paused, taking RECOVER_SETUP into account too.  When using it on both
src/dst, a slight change is done altogether to always wait for the
semaphore before checking the status, because for both sides a sem_post()
will be required for a recovery.

Cc: Jiri Denemark 
Cc: Fabiano Rosas 
Cc: Prasad Pandit 
Buglink: https://issues.redhat.com/browse/RHEL-38485
Signed-off-by: Peter Xu 
---
 qapi/migration.json  |  4 
 migration/postcopy-ram.h |  3 +++
 migration/migration.c| 40 ++--
 migration/postcopy-ram.c |  6 ++
 migration/savevm.c   |  4 ++--
 5 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index a351fd3714..565c40b637 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -142,6 +142,9 @@
 #
 # @postcopy-paused: during postcopy but paused.  (since 3.0)
 #
+# @postcopy-recover-setup: setup phase for a postcopy recovery process,
+# preparing for a recovery phase to start.  (since 9.1)
+#
 # @postcopy-recover: trying to recover from a paused postcopy.  (since
 # 3.0)
 #
@@ -166,6 +169,7 @@
 { 'enum': 'MigrationStatus',
   'data': [ 'none', 'setup', 'cancelling', 'cancelled',
 'active', 'postcopy-active', 'postcopy-paused',
+'postcopy-recover-setup',
 'postcopy-recover', 'completed', 'failed', 'colo',
 'pre-switchover', 'device', 'wait-unplug' ] }
 ##
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index ecae941211..a6df1b2811 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -13,6 +13,8 @@
 #ifndef QEMU_POSTCOPY_RAM_H
 #define QEMU_POSTCOPY_RAM_H
 
+#include "qapi/qapi-types-migration.h"
+
 /* Return true if the host supports everything we need to do postcopy-ram */
 bool postcopy_ram_supported_by_host(MigrationIncomingState *mis,
 Error **errp);
@@ -193,5 +195,6 @@ enum PostcopyChannels {
 void postcopy_preempt_new_channel(Migrat

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8117 matches

Mail list logo