from:"Maciej S. Szmigiero"

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-07-17 Thread Maciej S. Szmigiero


On 17.07.2024 22:19, Fabiano Rosas wrote:

Peter Xu  writes:


On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote:

On 27.06.2024 16:56, Peter Xu wrote:

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:

On 26.06.2024 18:23, Peter Xu wrote:

On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:

On 26.06.2024 03:51, Peter Xu wrote:

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:

On 25.06.2024 19:25, Peter Xu wrote:

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:

Hi Peter,


Hi, Maciej,



On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.
(..)


2) Submit this operation to the thread pool and wait for it to complete,


VFIO doesn't need to have its own code waiting.  If this pool is for
migration purpose in general, qemu migration framework will need to wait at
some point for all jobs to finish before moving on.  Perhaps it should be
at the end of the non-iterative session.


So essentially, instead of calling save_live_complete_precopy_end handlers
from the migration code you would like to hard-code its current VFIO
implementation of calling 
vfio_save_complete_precopy_async_thread_thread_terminate().

Only it wouldn't be then called VFIO precopy async thread terminate but some
generic device state async precopy thread terminate function.


I don't understand what did you mean by "hard code".


"Hard code" wasn't maybe the best expression here.

I meant the move of the functionality that's provided by
vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
to the common migration code.


I see.  That function only does a thread_join() so far.

So can I understand it as below [1] should work for us, and it'll be clean
too (with nothing to hard-code)?


It will need some signal to the worker threads pool to terminate before
waiting for them to finish (as the code in [1] just waits).

In the case of current vfio_save_complete_precopy_async_thread() implementation,
this signal isn't necessary as this thread simply terminates when it has read
all the date it needs from the device.

In a worker threads pool case there will be some threads waiting for
jobs to be queued to them and so they will need to be somehow signaled
to exit.


Right.  We may need something like multifd_send_should_exit() +
MultiFDSendParams.sem.  It'll be nicer if we can generalize that part so
multifd threads can also rebase to that thread model, but maybe I'm asking
too much.




The time to join() the worker threads can be even later, until
migrate_fd_cleanup() on sender side.  You may have a better idea on when
would be the best place to do it when start working on it.




What I was saying is if we target the worker thread pool to be used for
"concurrently dump vmstates", then it'll make sense to make sure all the
jobs there were flushed after qemu dumps all non-iterables (because this
should be the last step of the switchover).

I expect it looks like this:

 while (pool->active_threads) {
 qemu_sem_wait(>job_done);
 }


[1]


(..)

I think that with this thread pool introduction we'll unfortunately almost 
certainly
need to target this patch set at 9.2, since these overall changes (and Fabiano
patches too) will need good testing, might uncover some performance regressions
(for example related to the number of buffers limit or Fabiano multifd changes),
bring some review comments from other people, etc.

In addition to that, we are in the middle of holiday season and a lot of people
aren't available - like Fabiano said he will be available only in a few weeks.


Right, that's unfortunate.  Let's see, but still I really hope we can also
get some feedback from Fabiano before it lands, even with that we have
chance for 9.1 but it's just challenging, it's the same condition I
mentioned since the 1st email.  And before Fabiano's back (he's the active
maintainer for this release), I'm personally happy if you can propose
something that can land earlier in this release partly.  E.g., if you want
we can at least upstream Fabiano's idea first, or some more on top.

For that, also feel to have a look at my comment today:

https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n

Feel free to comment there too.  There's a tiny uncertainty there so far on
specify

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-17 Thread Maciej S. Szmigiero

On 17.07.2024 21:00, Peter Xu wrote:

On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:

The comment I removed is slightly misleading to me too, because right now
active_slot contains the data hasn't yet been delivered to multifd, so
we're "putting it back to free list" not because of it's free, but because
we know it won't get used until the multifd send thread consumes it
(because before that the thread will be busy, and we won't use the buffer
if so in upcoming send()s).

And then when I'm looking at this again, I think maybe it's a slight
overkill, and maybe we can still keep the "opaque data" managed by multifd.
One reason might be that I don't expect the "opaque data" payload keep
growing at all: it should really be either RAM or device state as I
commented elsewhere in a relevant thread, after all it's a thread model
only for migration purpose to move vmstates..

Some amount of flexibility needs to be baked in. For instance, what
about the handshake procedure? Don't we want to use multifd threads to
put some information on the wire for that as well?

Is this an orthogonal question?

I don't think so. You say the payload data should be either RAM or
device state. I'm asking what other types of data do we want the multifd
channel to transmit and suggesting we need to allow room for the
addition of that, whatever it is. One thing that comes to mind that is
neither RAM or device state is some form of handshake or capabilities
negotiation.

The RFC version of my multifd device state transfer patch set introduced
a new migration channel header (by Avihai) for clean and extensible
migration channel handshaking but people didn't like so it was removed in v1.

Hmm, I'm not sure this is relevant to the context of discussion here, but I
confess I didn't notice the per-channel header thing in the previous RFC
series. Link is here:

https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigi...@oracle.com

The channel header patches were dropped because Daniel didn't like them:
https://lore.kernel.org/qemu-devel/zh-kf72fe9ov6...@redhat.com/
https://lore.kernel.org/qemu-devel/zh_6w8u3h4fmg...@redhat.com/

Maciej, if you want, you can split that out of the seriess. So far it looks
like a good thing with/without how VFIO tackles it.

Unfortunately, these Avihai's channel header patches obviously impact wire
protocol and are a bit of intermingled with the rest of the device state
transfer patch set so it would be good to know upfront whether there is
some consensus to (re)introduce this new channel header (CCed Daniel, too).

Thanks,

Thanks,
Maciej

Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters

2024-07-16 Thread Maciej S. Szmigiero


On 10.07.2024 22:16, Fabiano Rosas wrote:

Peter Xu  writes:


On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:

Peter Xu  writes:


On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:

Or graphically:

1) client fills the active slot with data. Channels point to nothing
at this point:
   [a]  <-- active slot
   [][][][] <-- free slots, one per-channel

   [][][][] <-- channels' p->data pointers

2) multifd_send() swaps the pointers inside the client slot. Channels
still point to nothing:
   []
   [a][][][]

   [][][][]

3) multifd_send() finds an idle channel and updates its pointer:


It seems the action "finds an idle channel" is in step 2 rather than step 3,
which means the free slot is selected based on the id of the channel found, am I
understanding correctly?


I think you're right.

Actually I also feel like the desription here is ambiguous, even though I
think I get what Fabiano wanted to say.

The free slot should be the first step of step 2+3, here what Fabiano
really wanted to suggest is we move the free buffer array from multifd
channels into the callers, then the caller can pass in whatever data to
send.

So I think maybe it's cleaner to write it as this in code (note: I didn't
really change the code, just some ordering and comments):

===8<===
@@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
   */
  active_slot = slots->active;
  slots->active = slots->free[p->id];
-p->data = active_slot;
-
-/*
- * By the next time we arrive here, the channel will certainly
- * have consumed the active slot. Put it back on the free list
- * now.
- */
  slots->free[p->id] = active_slot;
  
+/* Assign the current active slot to the chosen thread */

+p->data = active_slot;
===8<===

The comment I removed is slightly misleading to me too, because right now
active_slot contains the data hasn't yet been delivered to multifd, so
we're "putting it back to free list" not because of it's free, but because
we know it won't get used until the multifd send thread consumes it
(because before that the thread will be busy, and we won't use the buffer
if so in upcoming send()s).

And then when I'm looking at this again, I think maybe it's a slight
overkill, and maybe we can still keep the "opaque data" managed by multifd.
One reason might be that I don't expect the "opaque data" payload keep
growing at all: it should really be either RAM or device state as I
commented elsewhere in a relevant thread, after all it's a thread model
only for migration purpose to move vmstates..


Some amount of flexibility needs to be baked in. For instance, what
about the handshake procedure? Don't we want to use multifd threads to
put some information on the wire for that as well?


Is this an orthogonal question?


I don't think so. You say the payload data should be either RAM or
device state. I'm asking what other types of data do we want the multifd
channel to transmit and suggesting we need to allow room for the
addition of that, whatever it is. One thing that comes to mind that is
neither RAM or device state is some form of handshake or capabilities
negotiation.


The RFC version of my multifd device state transfer patch set introduced
a new migration channel header (by Avihai) for clean and extensible
migration channel handshaking but people didn't like so it was removed in v1.

Thanks,
Maciej

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-07-16 Thread Maciej S. Szmigiero


On 27.06.2024 16:56, Peter Xu wrote:

On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote:

On 26.06.2024 18:23, Peter Xu wrote:

On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:

On 26.06.2024 03:51, Peter Xu wrote:

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:

On 25.06.2024 19:25, Peter Xu wrote:

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:

Hi Peter,


Hi, Maciej,



On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.
(..)


2) Submit this operation to the thread pool and wait for it to complete,


VFIO doesn't need to have its own code waiting.  If this pool is for
migration purpose in general, qemu migration framework will need to wait at
some point for all jobs to finish before moving on.  Perhaps it should be
at the end of the non-iterative session.


So essentially, instead of calling save_live_complete_precopy_end handlers
from the migration code you would like to hard-code its current VFIO
implementation of calling 
vfio_save_complete_precopy_async_thread_thread_terminate().

Only it wouldn't be then called VFIO precopy async thread terminate but some
generic device state async precopy thread terminate function.


I don't understand what did you mean by "hard code".


"Hard code" wasn't maybe the best expression here.

I meant the move of the functionality that's provided by
vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set
to the common migration code.


I see.  That function only does a thread_join() so far.

So can I understand it as below [1] should work for us, and it'll be clean
too (with nothing to hard-code)?


It will need some signal to the worker threads pool to terminate before
waiting for them to finish (as the code in [1] just waits).

In the case of current vfio_save_complete_precopy_async_thread() implementation,
this signal isn't necessary as this thread simply terminates when it has read
all the date it needs from the device.

In a worker threads pool case there will be some threads waiting for
jobs to be queued to them and so they will need to be somehow signaled
to exit.


The time to join() the worker threads can be even later, until
migrate_fd_cleanup() on sender side.  You may have a better idea on when
would be the best place to do it when start working on it.




What I was saying is if we target the worker thread pool to be used for
"concurrently dump vmstates", then it'll make sense to make sure all the
jobs there were flushed after qemu dumps all non-iterables (because this
should be the last step of the switchover).

I expect it looks like this:

while (pool->active_threads) {
qemu_sem_wait(>job_done);
}


[1]


(..)

I think that with this thread pool introduction we'll unfortunately almost 
certainly
need to target this patch set at 9.2, since these overall changes (and Fabiano
patches too) will need good testing, might uncover some performance regressions
(for example related to the number of buffers limit or Fabiano multifd changes),
bring some review comments from other people, etc.

In addition to that, we are in the middle of holiday season and a lot of people
aren't available - like Fabiano said he will be available only in a few weeks.


Right, that's unfortunate.  Let's see, but still I really hope we can also
get some feedback from Fabiano before it lands, even with that we have
chance for 9.1 but it's just challenging, it's the same condition I
mentioned since the 1st email.  And before Fabiano's back (he's the active
maintainer for this release), I'm personally happy if you can propose
something that can land earlier in this release partly.  E.g., if you want
we can at least upstream Fabiano's idea first, or some more on top.

For that, also feel to have a look at my comment today:

https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n

Feel free to comment there too.  There's a tiny uncertainty there so far on
specifying "max size for a device state" if do what I suggested, as multifd
setup will need to allocate an enum buffer suitable for both ram + device.
But I think that's not an issue and you'll tackle that properly when
working on it.  It's more about whether you agree on what I said as a
general concept.



Since it seems that the discussion on Fabiano's

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-27 Thread Maciej S. Szmigiero


On 26.06.2024 18:23, Peter Xu wrote:

On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote:

On 26.06.2024 03:51, Peter Xu wrote:

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:

On 25.06.2024 19:25, Peter Xu wrote:

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:

Hi Peter,


Hi, Maciej,



On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.


(..)

4. Risk of OOM on unlimited VFIO buffering
==

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

  if (packet->idx >= migration->load_bufs->len) {
  g_array_set_size(migration->load_bufs, packet->idx + 1);
  }

  lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx);
  ...
  lb->data = g_memdup2(>data, data_size - sizeof(*packet));
  lb->len = data_size - sizeof(*packet);
  lb->is_present = true;

What if garray keeps growing with lb->data allocated, which triggers the
memcg limit of the process (if QEMU is in such process)?  Or just deplete
host memory and causing OOM kill.

I think we may need to find a way to throttle max memory usage of such
buffering.

So far this will be more of a problem indeed if this will be done during
VFIO iteration phases, but I still hope a solution can work with both
iteration phase and the switchover phase, even if you only do that in
switchover phase


Unfortunately, this issue will be hard to fix since the source can
legitimately send the very first buffer (chunk) of data as the last one
(at the very end of the transmission).

In this case, the target will need to buffer nearly the whole data.

We can't stop the receive on any channel, either, since the next missing
buffer can arrive at that channel.

However, I don't think purposely DoSing the target QEMU is a realistic
security concern in the typical live migration scenario.

I mean the source can easily force the target QEMU to exit just by
feeding it wrong migration data.

In case someone really wants to protect against the impact of
theoretically unbounded QEMU memory allocations during live migration
on the rest of the system they can put the target QEMU process
(temporally) into a memory-limited cgroup.


Note that I'm not worrying about DoS of a malicious src QEMU, and I'm
exactly talking about the generic case where QEMU (either src or dest, in
that case normally both) is put into the memcg and if QEMU uses too much
memory it'll literally get killed even if no DoS issue at all.

In short, we hopefully will have a design that will always work with QEMU
running in a container, without 0.5% chance dest qemu being killed, if you
see what I meant.

The upper bound of VFIO buffering will be needed so the admin can add that
on top of the memcg limit and as long as QEMU keeps its words it'll always
work without sudden death.

I think I have some idea about resolving this problem.  That idea can
further complicate the protocol a little bit.  But before that let's see
whether we can reach an initial consensus on this matter first, on whether
this is a sane request.  In short, we'll need to start to have a
configurable size to say how much VFIO can buffer, maybe per-device, or
globally.  Then based on that we need to have some logic guarantee that
over-mem won't happen, also without heavily affecting concurrency (e.g.,
single thread is definitely safe and without caching, but it can be
slower).


Here, I think I can add a per-device limit parameter on the number of
buffers received out-of-order or waiting to be loaded into the device -
with a reasonable default.


Yes that should work.

I don't even expect people would change that, but this might be the
information people will need to know before putting it into a container if
it's larger than how qemu dynamically consumes memories here and there.
I'd expect it is still small enough so nobody will notice it (maybe a few
tens of MBs? but just wildly guessing, where tens of MBs could fall

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-26 Thread Maciej S. Szmigiero


On 26.06.2024 03:51, Peter Xu wrote:

On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote:

On 25.06.2024 19:25, Peter Xu wrote:

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:

Hi Peter,


Hi, Maciej,



On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.


(..)



3. load_state_buffer() and VFIODeviceStatePacket protocol
=

VFIODeviceStatePacket is the new protocol you introduced into multifd
packets, along with the new load_state_buffer() hook for loading such
buffers.  My question is whether it's needed at all, or.. whether it can be
more generic (and also easier) to just allow taking any device state in the
multifd packets, then load it with vmstate load().

I mean, the vmstate_load() should really have worked on these buffers, if
after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
first flag (uint64), size as the 2nd, then (2) load that rest buffer into
VFIO kernel driver.  That is the same to happen during the blackout window.
It's not clear to me why load_state_buffer() is needed.

I also see that you're also using exactly the same chunk size for such
buffering (VFIOMigration.data_buffer_size).

I think you have a "reason": VFIODeviceStatePacket and loading of the
buffer data resolved one major issue that wasn't there before but start to
have now: multifd allows concurrent arrivals of vfio buffers, even if the
buffer *must* be sequentially loaded.

That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
used to ask nVidia people on whether the VFIO get_state/set_state interface
can allow indexing or tagging of buffers but I never get a real response.
IMHO that'll be extremely helpful for migration purpose on concurrency if
it can happen, rather than using a serialized buffer.  It means
concurrently save/load one VFIO device could be extremely hard, if not
impossible.


I am pretty sure that the current kernel VFIO interface requires for the
buffers to be loaded in-order - accidentally providing the out of order
definitely breaks the restore procedure.


Ah, I didn't mean that we need to do it with the current API.  I'm talking
about whether it's possible to have a v2 that will support those otherwise
we'll need to do "workarounds" like what you're doing with "unlimited
buffer these on dest, until we receive continuous chunk of data" tricks.


Better kernel API might be possible in the long term but for now we have
to live with what we have right now.

After all, adding true unordered loading - I mean not just moving the
reassembly process from QEMU to the kernel but making the device itself
accept buffers out out order - will likely be pretty complex (requiring
adding such functionality to the device firmware, etc).


I would expect the device will need to be able to provision the device
states so it became smaller objects rather than one binary object, then
either tag-able or address-able on those objects.




And even with that trick, it'll still need to be serialized on the read()
syscall so it won't scale either if the state is huge.  For that issue
there's no workaround we can do from userspace.


The read() calls for multiple VFIO devices can be issued in parallel,
and in fact they are in my patch set.


I was talking about concurrency for one device.


AFAIK with the current hardware the read speed is limited by the device
itself, so adding additional reading threads wouldn't help.

Once someone has the hardware which is limited by single reading thread
that person can add the necessary kernel API (including unordered
loading) and then extend QEMU with such support.



(..)

4. Risk of OOM on unlimited VFIO buffering
==

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

 if (packet->idx >= migration->load_bufs->len) {
 g_array_set_size(migration->load_bufs, packet->idx + 1);
 }

 lb = _

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-25 Thread Maciej S. Szmigiero


On 25.06.2024 19:25, Peter Xu wrote:

On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote:

Hi Peter,


Hi, Maciej,



On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.


(..)



3. load_state_buffer() and VFIODeviceStatePacket protocol
=

VFIODeviceStatePacket is the new protocol you introduced into multifd
packets, along with the new load_state_buffer() hook for loading such
buffers.  My question is whether it's needed at all, or.. whether it can be
more generic (and also easier) to just allow taking any device state in the
multifd packets, then load it with vmstate load().

I mean, the vmstate_load() should really have worked on these buffers, if
after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
first flag (uint64), size as the 2nd, then (2) load that rest buffer into
VFIO kernel driver.  That is the same to happen during the blackout window.
It's not clear to me why load_state_buffer() is needed.

I also see that you're also using exactly the same chunk size for such
buffering (VFIOMigration.data_buffer_size).

I think you have a "reason": VFIODeviceStatePacket and loading of the
buffer data resolved one major issue that wasn't there before but start to
have now: multifd allows concurrent arrivals of vfio buffers, even if the
buffer *must* be sequentially loaded.

That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
used to ask nVidia people on whether the VFIO get_state/set_state interface
can allow indexing or tagging of buffers but I never get a real response.
IMHO that'll be extremely helpful for migration purpose on concurrency if
it can happen, rather than using a serialized buffer.  It means
concurrently save/load one VFIO device could be extremely hard, if not
impossible.


I am pretty sure that the current kernel VFIO interface requires for the
buffers to be loaded in-order - accidentally providing the out of order
definitely breaks the restore procedure.


Ah, I didn't mean that we need to do it with the current API.  I'm talking
about whether it's possible to have a v2 that will support those otherwise
we'll need to do "workarounds" like what you're doing with "unlimited
buffer these on dest, until we receive continuous chunk of data" tricks.


Better kernel API might be possible in the long term but for now we have
to live with what we have right now.

After all, adding true unordered loading - I mean not just moving the
reassembly process from QEMU to the kernel but making the device itself
accept buffers out out order - will likely be pretty complex (requiring
adding such functionality to the device firmware, etc).


And even with that trick, it'll still need to be serialized on the read()
syscall so it won't scale either if the state is huge.  For that issue
there's no workaround we can do from userspace.


The read() calls for multiple VFIO devices can be issued in parallel,
and in fact they are in my patch set.

(..)

4. Risk of OOM on unlimited VFIO buffering
==

This follows with above bullet, but my pure question to ask here is how
does VFIO guarantees no OOM condition by buffering VFIO state?

I mean, currently your proposal used vfio_load_bufs_thread() as a separate
thread to only load the vfio states until sequential data is received,
however is there an upper limit of how much buffering it could do?  IOW:

vfio_load_state_buffer():

if (packet->idx >= migration->load_bufs->len) {
g_array_set_size(migration->load_bufs, packet->idx + 1);
}

lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx);
...
lb->data = g_memdup2(>data, data_size - sizeof(*packet));
lb->len = data_size - sizeof(*packet);
lb->is_present = true;

What if garray keeps growing with lb->data allocated, which triggers the
memcg limit of the process (if QEMU is in such process)?  Or just deplete
host memory and causing OOM kill.

I think we may need to find a way to throttle max memory usage of such
buffering.

So far this will be more of a problem indeed if this will be done during
VFIO iteration phases, but I still hope a solution can work with both
iteration phase and the switchover phase, even if you only do

Re: [PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-24 Thread Maciej S. Szmigiero


Hi Peter,

On 23.06.2024 22:27, Peter Xu wrote:

On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


OK I took some hours thinking about this today, and here's some high level
comments for this series.  I'll start with which are more relevant to what
Fabiano has already suggested in the other thread, then I'll add some more.

https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de


That's a long list, thanks for these comments.

I have responded to them inline below.


1. Multifd device state support
===

As Fabiano suggested in his RFC post, we may need one more layer of
abstraction to represent VFIO's demand on allowing multifd to send
arbitrary buffer to the wire.  This can be more than "how to pass the
device state buffer to the sender threads".

So far, MultiFDMethods is only about RAM.  If you pull the latest master
branch Fabiano just merged yet two more RAM compressors that are extended
on top of MultiFDMethods model.  However still they're all about RAM.  I
think it's better to keep it this way, so maybe MultiFDMethods should some
day be called MultiFDRamMethods.

multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc
buffers like what VFIO is using. multifd_send_zero_page_detect() may not be
needed either for arbitrary buffers.  Most of those are still page-based.

I think it also means we shouldn't call ->send_prepare() when multifd send
thread notices that it's going to send a VFIO buffer.  So it should look
like this:

   int type = multifd_payload_type(p->data);
   if (type == MULTIFD_PAYLOAD_RAM) {
   multifd_send_state->ops->send_prepare(p, _err);
   } else {
   // VFIO buffers should belong here
   assert(type == MULTIFD_PAYLOAD_DEVICE_STATE);
   ...
   }

It also means it shouldn't contain code like:

nocomp_send_prepare():
 if (p->is_device_state_job) {
 return nocomp_send_prepare_device_state(p, errp);
 } else {
 return nocomp_send_prepare_ram(p, errp);
 }

nocomp should only exist in RAM world, not VFIO's.

And it looks like you agree with Fabiano's RFC proposal, please work on top
of that to provide that layer.  Please make sure it outputs the minimum in
"$ git grep device_state migration/multifd.c" when you work on the new
version.  Currently:

$ git grep device_state migration/multifd.c | wc -l
59

The hope is zero, or at least a minimum with good reasons.


I guess you mean "grep -i" in the above example, since otherwise
the above command will find only lowercase "device_state".

On the other hand, your example code above has uppercase
"DEVICE_STATE", suggesting that it might be okay?

Overall, using Fabiano's patch set as a base for mine makes sense to me.


2. Frequent mallocs/frees
=

Fabiano's series can also help to address some of these, but it looks like
this series used malloc/free more than the opaque data buffer.  This is not
required to get things merged, but it'll be nice to avoid those if possible.


Ack - as long as its not making the code messy/fragile, of course.


3. load_state_buffer() and VFIODeviceStatePacket protocol
=

VFIODeviceStatePacket is the new protocol you introduced into multifd
packets, along with the new load_state_buffer() hook for loading such
buffers.  My question is whether it's needed at all, or.. whether it can be
more generic (and also easier) to just allow taking any device state in the
multifd packets, then load it with vmstate load().

I mean, the vmstate_load() should really have worked on these buffers, if
after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the
first flag (uint64), size as the 2nd, then (2) load that rest buffer into
VFIO kernel driver.  That is the same to happen during the blackout window.
It's not clear to me why load_state_buffer() is needed.

I also see that you're also using exactly the same chunk size for such
buffering (VFIOMigration.data_buffer_size).

I think you have a "reason": VFIODeviceStatePacket and loading of the
buffer data resolved one major issue that wasn't there before but start to
have now: multifd allows concurrent arrivals of vfio buffers, even if the
buffer *must* be sequentially loaded.

That's a major pain for current VFIO kernel ioctl design, IMHO.  I think I
used to ask nVidia people on whether the VFIO get_state/set_state interface
can allow indexing or tagging of buffers but I never get a real response.
IMHO that'll be extremely helpful for migration purpose on concurrency if
it can happen, rather than using a serialized buffer.  It means
concurrently save/load one VFIO device could be extremely hard,

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-23 Thread Maciej S. Szmigiero


On 21.06.2024 22:54, Peter Xu wrote:

On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:

On 21.06.2024 17:56, Peter Xu wrote:

On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:

On 21.06.2024 17:04, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


Hi Fabiano,

On 20.06.2024 23:21, Fabiano Rosas wrote:

Hi folks,

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.


Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.


(..)

(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).


I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?



For general code improvements like this I'm not thinking about QEMU
releases at all. But this series is not super complex, so I could
imagine we merging it in time for 9.1 if we reach an agreement.

Are you thinking your series might miss the target? Or have concerns
over the stability of the refactoring? We can within reason merge code
based on the current framework and improve things on top, we already did
something similar when merging zero-page support. I don't have an issue
with that.


The reason that I asked whether you are targeting 9.1 is because my
patch set is definitely targeting that release.

At the same time my patch set will need to be rebased/refactored on top
of this patch set if it is supposed to be merged for 9.1 too.

If this patch set gets merged quickly that's not really a problem.

On the other hand, if another iteration(s) is/are needed AND you are
not available in the coming weeks to work on them then there's a
question whether we will make the required deadline.


I think it's a bit rush to merge the vfio series in this release.  I'm not
sure it has enough time to be properly reviewed, reposted, retested, etc.

I've already started looking at it, and so far I think I have doubt not
only on agreement with Fabiano on the device_state thing which I prefer to
avoid, but also I'm thinking of any possible way to at least make the
worker threads generic too: a direct impact could be vDPA in the near
future if anyone cared, while I don't want modules to create threads
randomly during migration.

Meanwhile I'm also thinking whether that "the thread needs to dump all
data, and during iteration we can't do that" is the good reason to not
support that during iterations.

I didn't yet reply because I don't think I think all things through, but
I'll get there.

So I'm not saying that the design is problematic, but IMHO it's just not
mature enough to assume it will land in 9.1, considering it's still a large
one, and the first non-rfc version just posted two days ago.



The RFC version was posted more than 2 months ago.

It has received some review comments from multiple people,
all of which were addressed in this patch set version.


I thought it was mostly me who reviewed it, am I right?  Or do you have
other thread that has such discussion happening, and the design review has
properly done and reached an agreement?


Daniel P. Berrangé also submitted a few comments: [1], [2], [3], [4], [5].
In fact, it is him who first suggested not having a new channel header
wire format or dedicated device state channels.

In addition to that, Avihai was also following our discussions: [6] and
he also looked privately at an early (but functioning) draft of these
patches well before the RFC was even publicly posted.


IMHO that is also not how RFC works.

It doesn't work like "if RFC didn't got NACKed, a maintainer should merge
v1 when someone posts it".  Instead RFC should only mean these at least to
me: "(1) please review this from high level, things can drastically change;
(2) please don't merge this, because it is not f

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-21 Thread Maciej S. Szmigiero


On 21.06.2024 17:56, Peter Xu wrote:

On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:

On 21.06.2024 17:04, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


Hi Fabiano,

On 20.06.2024 23:21, Fabiano Rosas wrote:

Hi folks,

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.


Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.


(..)

(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).


I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?



For general code improvements like this I'm not thinking about QEMU
releases at all. But this series is not super complex, so I could
imagine we merging it in time for 9.1 if we reach an agreement.

Are you thinking your series might miss the target? Or have concerns
over the stability of the refactoring? We can within reason merge code
based on the current framework and improve things on top, we already did
something similar when merging zero-page support. I don't have an issue
with that.


The reason that I asked whether you are targeting 9.1 is because my
patch set is definitely targeting that release.

At the same time my patch set will need to be rebased/refactored on top
of this patch set if it is supposed to be merged for 9.1 too.

If this patch set gets merged quickly that's not really a problem.

On the other hand, if another iteration(s) is/are needed AND you are
not available in the coming weeks to work on them then there's a
question whether we will make the required deadline.


I think it's a bit rush to merge the vfio series in this release.  I'm not
sure it has enough time to be properly reviewed, reposted, retested, etc.

I've already started looking at it, and so far I think I have doubt not
only on agreement with Fabiano on the device_state thing which I prefer to
avoid, but also I'm thinking of any possible way to at least make the
worker threads generic too: a direct impact could be vDPA in the near
future if anyone cared, while I don't want modules to create threads
randomly during migration.

Meanwhile I'm also thinking whether that "the thread needs to dump all
data, and during iteration we can't do that" is the good reason to not
support that during iterations.

I didn't yet reply because I don't think I think all things through, but
I'll get there.

So I'm not saying that the design is problematic, but IMHO it's just not
mature enough to assume it will land in 9.1, considering it's still a large
one, and the first non-rfc version just posted two days ago.



The RFC version was posted more than 2 months ago.

It has received some review comments from multiple people,
all of which were addressed in this patch set version.

I have not received any further comments during these 2 months, so I thought
the overall design is considered okay - if anything, there might be minor
code comments/issues but these can easily be improved/fixed in the 5 weeks
remaining to the soft code freeze for 9.1.


If anything, I think that the VM live phase (non-downtime) transfers
functionality should be deferred until 9.2 because:
* It wasn't a part of the RFC so even if implemented today would get much
less testing overall,

* It's orthogonal to the switchover time device state transfer functionality
introduced by this patch set and could be added on top of that without
changing the wire protocol for switchover time device state transfers,

* It doesn't impact the switchover downtime so in this case 9.1 would
already contain all what's necessary to improve it.


Thanks,
Maciej

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-21 Thread Maciej S. Szmigiero


On 21.06.2024 17:04, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


Hi Fabiano,

On 20.06.2024 23:21, Fabiano Rosas wrote:

Hi folks,

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.


Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.


(..)

(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).


I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?



For general code improvements like this I'm not thinking about QEMU
releases at all. But this series is not super complex, so I could
imagine we merging it in time for 9.1 if we reach an agreement.

Are you thinking your series might miss the target? Or have concerns
over the stability of the refactoring? We can within reason merge code
based on the current framework and improve things on top, we already did
something similar when merging zero-page support. I don't have an issue
with that.


The reason that I asked whether you are targeting 9.1 is because my
patch set is definitely targeting that release.

At the same time my patch set will need to be rebased/refactored on top
of this patch set if it is supposed to be merged for 9.1 too.

If this patch set gets merged quickly that's not really a problem.

On the other hand, if another iteration(s) is/are needed AND you are
not available in the coming weeks to work on them then there's a
question whether we will make the required deadline.

Thanks,
Maciej

Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots

2024-06-21 Thread Maciej S. Szmigiero


Hi Fabiano,

On 20.06.2024 23:21, Fabiano Rosas wrote:

Hi folks,

First of all, apologies for the roughness of the series. I'm off for
the next couple of weeks and wanted to put something together early
for your consideration.

This series is a refactoring (based on an earlier, off-list
attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
the multifd core. If we're going to add support for more data types to
multifd, we first need to clean that up.

This time around this work was prompted by Maciej's series[1]. I see
you're having to add a bunch of is_device_state checks to work around
the rigidity of the code.

Aside from the VFIO work, there is also the intent (coming back from
Juan's ideas) to make multifd the default code path for migration,
which will have to include the vmstate migration and anything else we
put on the stream via QEMUFile.

I have long since been bothered by having 'pages' sprinkled all over
the code, so I might be coming at this with a bit of a narrow focus,
but I believe in order to support more types of payloads in multifd,
we need to first allow the scheduling at multifd_send_pages() to be
independent of MultiFDPages_t. So here it is. Let me know what you
think.


Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.

I guess its latest version could be found in the repo at [2] since
that's where the CI run mentioned below took it from?


(as I said, I'll be off for a couple of weeks, so feel free to
incorporate any of this code if it's useful. Or to ignore it
completely).


I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?


CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028

0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigi...@oracle.com


[2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple

Thanks,
Maciej

[PATCH v1 00/13] Multifd  device state transfer support with VFIO consumer

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This is an updated v1 patch series of the RFC (v0) series located here:
https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/


Changes from the RFC (v0):
* Extend the existing multifd packet format instead of introducing a new
migration channel header.

* As the replacement of switching the migration channel header on or off
introduce "x-migration-multifd-transfer" VFIO device property instead that
allows configuring at runtime whether to send the particular device state
via multifd channels when live migrating that device.

This property defaults to "false" for bit stream compatibility with older
QEMU versions.

* Remove the support for having dedicated device state transfer multifd
channels since the same downtime performance can be attained by simply
reducing the total number of multifd channels in a shared channel
configuration to the number of channels available for RAM transfer in
the dedicated device state channels configuration.

For example, the best downtime from the dedicated device state config
on my setup (achieved in configuration of 10 total multifd channels /
4 dedicated device state channels) can also be achieved in the
shared RAM/device state channel configuration by reducing the total
multifd channel count to 6.

It looks like not having too many RAM transfer multifd channels is
key to having a good downtime since the results are reproducibly
worse with 15 shared channels total, while they are as good as with
6 shared channels if there are 15 total channels but 4 of them are
dedicated to transferring device state (leaving 11 for RAM transfer).

* Make the next multifd channel selection more fair when converting
multifd_send_pages::next_channel to atomic.

* Convert the code to use QEMU thread sync primitives (QemuMutex with
QemuLockable, QemuCond) instead of their Glib equivalents (GMutex,
GMutexLocker and GCond).

* Rename complete_precopy_async{,_wait} to complete_precopy_{begin,_end} as
suggested.

* Rebase onto the last week's QEMU git master and retest.


When working on the updated patch set version I also investigated the
possibility of refactoring VM live phase (non-downtime) transfers to
happen via multifd channels.

However, the VM live phase transfer works differently: it happens
opportunistically until the remaining data drops below the switchover
threshold, rather that transferring always the whole device state data
until their exhaustion.

For this reason, there would need to be some way in the migration
framework to update the remaining data estimate from per-device
saving/transfer queuing thread and then stop these threads when the
decision has been reached in the migration core to stop the VM and
switch over. Such functionality would need to be introduced first.

There would also need to be some fairness guarantees so every device
gets similar access to multifd channels - otherwise there could be a
situation that the remaining data never drops below switchover
threshold because some devices are starved with respect to access to
the multifd transfer channels - as in the VM live phase additional
device data is constantly being generated.

Moreover, there's nothing stopping a QEMU device driver from requiring
different handling (loading, etc.) of VM live phase data from the
post-switchover data.
For cases like this some kind of a new device VM live phase incoming
data load handler would need to be introduced too.

For the above reasons, the VM live phase multifd transfer functionality
isn't a simple extension of the functionality introduced by this patch
set.


For convenience, this patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio


Maciej S. Szmigiero (13):
  vfio/migration: Add save_{iterate,complete_precopy}_started trace
events
  migration/ram: Add load start trace event
  migration/multifd: Zero p->flags before starting filling a packet
  migration: Add save_live_complete_precopy_{begin,end} handlers
  migration: Add qemu_loadvm_load_state_buffer() and its handler
  migration: Add load_finish handler and associated functions
  migration/multifd: Device state transfer support - receive side
  migration/multifd: Convert multifd_send_pages::next_channel to atomic
  migration/multifd: Device state transfer support - send side
  migration/multifd: Add migration_has_device_state_support()
  vfio/migration: Multifd device state transfer support - receive side
  vfio/migration: Add x-migration-multifd-transfer VFIO property
  vfio/migration: Multifd device state transfer support - send side

 hw/vfio/migration.c   | 545 +-
 hw/vfio/pci.c |   7 +
 hw/vfio/trace-events  |  15 +-
 include/hw/vfio/vfio-common.h |  27 ++
 include/migration/misc.h  |   5 +
 include/migration/register.h  |  70 +
 migration/migration.c |   6 +
 mi

[PATCH v1 02/13] migration/ram: Add load start trace event

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

There's a RAM load complete trace event but there wasn't its start equivalent.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/ram.c| 1 +
 migration/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06ba..87b0cf86db0c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4129,6 +4129,7 @@ static int ram_load_precopy(QEMUFile *f)
   RAM_SAVE_FLAG_ZERO);
 }
 
+trace_ram_load_start();
 while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
 ram_addr_t addr;
 void *host = NULL, *host_bak = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index 0b7c3324fb5e..43dfe4a4bc03 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" 
PRIu64 " milliseconds, %d iterations"
+ram_load_start(void) ""
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" 
PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void 
*addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void 
*addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"

[PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This is necessary for multifd_send_pages() to be able to be called
from multiple threads.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 6e0af84bb9a1..daa34172bf24 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -614,26 +614,38 @@ static bool multifd_send_pages(void)
 return false;
 }
 
-/* We wait here, until at least one channel is ready */
-qemu_sem_wait(_send_state->channels_ready);
-
 /*
  * next_channel can remain from a previous migration that was
  * using more channels, so ensure it doesn't overflow if the
  * limit is lower now.
  */
-next_channel %= migrate_multifd_channels();
-for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
+i = qatomic_load_acquire(_channel);
+if (unlikely(i >= migrate_multifd_channels())) {
+qatomic_cmpxchg(_channel, i, 0);
+}
+
+/* We wait here, until at least one channel is ready */
+qemu_sem_wait(_send_state->channels_ready);
+
+while (true) {
+int i_next;
+
 if (multifd_send_should_exit()) {
 return false;
 }
+
+i = qatomic_load_acquire(_channel);
+i_next = (i + 1) % migrate_multifd_channels();
+if (qatomic_cmpxchg(_channel, i, i_next) != i) {
+continue;
+}
+
 p = _send_state->params[i];
 /*
  * Lockless read to p->pending_job is safe, because only multifd
  * sender thread can clear it.
  */
 if (qatomic_read(>pending_job) == false) {
-next_channel = (i + 1) % migrate_multifd_channels();
 break;
 }
 }

[PATCH v1 06/13] migration: Add load_finish handler and associated functions

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 21 +++
 migration/migration.c|  6 +
 migration/migration.h|  3 +++
 migration/savevm.c   | 52 
 migration/savevm.h   |  4 +++
 5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index ce7641c90cea..7c20a9fb86ff 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -276,6 +276,27 @@ typedef struct SaveVMHandlers {
 int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
  Error **errp);
 
+/**
+ * @load_finish
+ *
+ * Poll whether all asynchronous device state loading had finished.
+ * Not called on the load failure path.
+ *
+ * Called while holding the qemu_loadvm_load_finish_ready_lock.
+ *
+ * If this method signals "not ready" then it might not be called
+ * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+ * while holding qemu_loadvm_load_finish_ready_lock.
+ *
+ * @opaque: data pointer passed to register_savevm_live()
+ * @is_finished: whether the loading had finished (output parameter)
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Returns zero to indicate success and negative for error
+ * It's not an error that the loading still hasn't finished.
+ */
+int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
+
 /**
  * @load_setup
  *
diff --git a/migration/migration.c b/migration/migration.c
index e1b269624c01..ff149e00132f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -236,6 +236,9 @@ void migration_object_init(void)
 
 current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
 
+qemu_mutex_init(_incoming->load_finish_ready_mutex);
+qemu_cond_init(_incoming->load_finish_ready_cond);
+
 migration_object_check(current_migration, _fatal);
 
 ram_mig_init();
@@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void)
 mis->postcopy_qemufile_dst = NULL;
 }
 
+qemu_mutex_destroy(>load_finish_ready_mutex);
+qemu_cond_destroy(>load_finish_ready_cond);
+
 yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 6af01362d424..0f2716ac42c6 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -230,6 +230,9 @@ struct MigrationIncomingState {
 
 /* Do exit on incoming migration failure */
 bool exit_on_error;
+
+QemuCond load_finish_ready_cond;
+QemuMutex load_finish_ready_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 2e538cb02936..46cfb73eae79 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3020,6 +3020,37 @@ int qemu_loadvm_state(QEMUFile *f)
 return ret;
 }
 
+qemu_loadvm_load_finish_ready_lock();
+while (!ret) { /* Don't call load_finish() handlers on the load failure 
path */
+bool all_ready = true;
+SaveStateEntry *se = NULL;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+bool this_ready;
+
+if (!se->ops || !se->ops->load_finish) {
+continue;
+}
+
+ret = se->ops->load_finish(se->opaque, _ready, _err);
+if (ret) {
+error_report_err(local_err);
+
+qemu_loadvm_load_finish_ready_unlock();
+return -EINVAL;
+} else if (!this_ready) {
+all_ready = false;
+}
+}
+
+if (all_ready) {
+break;
+}
+
+qemu_cond_wait(>load_finish_ready_cond, 
>load_finish_ready_mutex);
+}
+qemu_loadvm_load_finish_ready_unlock();
+
 if (ret == 0) {
 ret = qemu_file_get_error(f);
 }
@@ -3124,6 +3155,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, 
uint32_t instance_id,
 return 0;
 }
 
+void qemu_loadvm_load_finish_ready_lock(void)
+{
+MigrationIncomingState *mis = migration_incoming_get_current();
+
+qemu_mutex_lock(>load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_unlock(void)
+{
+MigrationIncomingState *mis = migration_incoming_get_current();
+
+qemu_mutex_unlock(>load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_broadcast(void)

[PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 15 +++
 migration/savevm.c   | 25 +
 migration/savevm.h   |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f7b3df71..ce7641c90cea 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -261,6 +261,21 @@ typedef struct SaveVMHandlers {
  */
 int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+/**
+ * @load_state_buffer
+ *
+ * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+ *
+ * @opaque: data pointer passed to register_savevm_live()
+ * @data: the data buffer to load
+ * @data_size: the data length in buffer
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
+ Error **errp);
+
 /**
  * @load_setup
  *
diff --git a/migration/savevm.c b/migration/savevm.c
index 56fb1c4c2563..2e538cb02936 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3099,6 +3099,31 @@ int qemu_loadvm_approve_switchover(void)
 return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+  char *buf, size_t len, Error **errp)
+{
+SaveStateEntry *se;
+
+se = find_se(idstr, instance_id);
+if (!se) {
+error_setg(errp, "Unknown idstr %s or instance id %u for load state 
buffer",
+   idstr, instance_id);
+return -1;
+}
+
+if (!se->ops || !se->ops->load_state_buffer) {
+error_setg(errp, "idstr %s / instance %u has no load state buffer 
operation",
+   idstr, instance_id);
+return -1;
+}
+
+if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
+return -1;
+}
+
+return 0;
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 9ec96a995c93..d388f1bfca98 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
 bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+  char *buf, size_t len, Error **errp);
+
 #endif

[PATCH v1 07/13] migration/multifd: Device state transfer support - receive side

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Add a basic support for receiving device state via multifd channels -
channels that are shared with RAM transfers.

To differentiate between a device state and a RAM packet the packet
header is read first.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is then read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 123 +---
 migration/multifd.h |  31 ++-
 2 files changed, 134 insertions(+), 20 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c8a5b363f7d4..6e0af84bb9a1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -21,6 +21,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 uint32_t zero_num = pages->num - pages->normal_num;
 int i;
 
-packet->flags = cpu_to_be32(p->flags);
+packet->hdr.flags = cpu_to_be32(p->flags);
 packet->pages_alloc = cpu_to_be32(p->pages->allocated);
 packet->normal_pages = cpu_to_be32(pages->normal_num);
 packet->zero_pages = cpu_to_be32(zero_num);
@@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, 
MultiFDPacketHdr_t *hdr,
+ Error **errp)
 {
-MultiFDPacket_t *packet = p->packet;
-int i;
-
-packet->magic = be32_to_cpu(packet->magic);
-if (packet->magic != MULTIFD_MAGIC) {
+hdr->magic = be32_to_cpu(hdr->magic);
+if (hdr->magic != MULTIFD_MAGIC) {
 error_setg(errp, "multifd: received packet "
"magic %x and expected magic %x",
-   packet->magic, MULTIFD_MAGIC);
+   hdr->magic, MULTIFD_MAGIC);
 return -1;
 }
 
-packet->version = be32_to_cpu(packet->version);
-if (packet->version != MULTIFD_VERSION) {
+hdr->version = be32_to_cpu(hdr->version);
+if (hdr->version != MULTIFD_VERSION) {
 error_setg(errp, "multifd: received packet "
"version %u and expected version %u",
-   packet->version, MULTIFD_VERSION);
+   hdr->version, MULTIFD_VERSION);
 return -1;
 }
 
-p->flags = be32_to_cpu(packet->flags);
+p->flags = be32_to_cpu(hdr->flags);
+
+return 0;
+}
+
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error 
**errp)
+{
+MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+packet->instance_id = be32_to_cpu(packet->instance_id);
+p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
+{
+MultiFDPacket_t *packet = p->packet;
+int i;
 
 packet->pages_alloc = be32_to_cpu(packet->pages_alloc);
 /*
@@ -485,7 +502,6 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, 
Error **errp)
 
 p->next_packet_size = be32_to_cpu(packet->next_packet_size);
 p->packet_num = be64_to_cpu(packet->packet_num);
-p->packets_recved++;
 p->total_normal_pages += p->normal_num;
 p->total_zero_pages += p->zero_num;
 
@@ -533,6 +549,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
 return 0;
 }
 
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+p->packets_recved++;
+
+if (p->flags & MULTIFD_FLAG_DEVICE_STATE) {
+return multifd_recv_unfill_packet_device_state(p, errp);
+} else {
+return multifd_recv_unfill_packet_ram(p, errp);
+}
+
+g_assert_not_reached();
+}
+
 static bool multifd_send_should_exit(void)
 {
 return qatomic_read(_send_state->exiting);
@@ -1177,8 +1206,8 @@ bool multifd_send_setup(void)
 p->packet_len = sizeof(MultiFDPacket_t)
   + sizeof(uint64_t) * page_count;
 p->packet = g_malloc0(p->packet_len);
-p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC);
+

[PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This way there aren't stale flags there.

p->flags can't contain SYNC to be sent at the next RAM packet since syncs
are now handled separately in multifd_send_thread.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index f317bff07746..c8a5b363f7d4 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque)
 if (qatomic_load_acquire(>pending_job)) {
 MultiFDPages_t *pages = p->pages;
 
+p->flags = 0;
 p->iovs_num = 0;
 assert(pages->num);
 
@@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque)
 }
 /* p->next_packet_size will always be zero for a SYNC packet */
 stat64_add(_stats.multifd_bytes, p->packet_len);
-p->flags = 0;
 }
 
 qatomic_set(>pending_sync, false);

[PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This way both the start and end points of migrating a particular VFIO
device are known.

Add also a vfio_save_iterate_empty_hit trace event so it is known when
there's no more data to send for that device.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 13 +
 hw/vfio/trace-events  |  3 +++
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 34d4be2ce1b1..93f767e3c2dd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error 
**errp)
 return -ENOMEM;
 }
 
+migration->save_iterate_run = false;
+migration->save_iterate_empty_hit = false;
+
 if (vfio_precopy_supported(vbasedev)) {
 switch (migration->device_state) {
 case VFIO_DEVICE_STATE_RUNNING:
@@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 VFIOMigration *migration = vbasedev->migration;
 ssize_t data_size;
 
+if (!migration->save_iterate_run) {
+trace_vfio_save_iterate_started(vbasedev->name);
+migration->save_iterate_run = true;
+}
+
 data_size = vfio_save_block(f, migration);
 if (data_size < 0) {
 return data_size;
+} else if (data_size == 0 && !migration->save_iterate_empty_hit) {
+trace_vfio_save_iterate_empty_hit(vbasedev->name);
+migration->save_iterate_empty_hit = true;
 }
 
 vfio_update_estimated_pending_data(migration, data_size);
@@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 int ret;
 Error *local_err = NULL;
 
+trace_vfio_save_complete_precopy_started(vbasedev->name);
+
 /* We reach here with device state STOP or STOP_COPY only */
 ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
VFIO_DEVICE_STATE_STOP, _err);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 64161bf6f44c..814000796687 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -158,8 +158,11 @@ vfio_migration_state_notifier(const char *name, int state) 
" (%s) state %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
+vfio_save_complete_precopy_started(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t 
precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 
0x%"PRIx64
+vfio_save_iterate_started(const char *name) " (%s)"
+vfio_save_iterate_empty_hit(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data 
buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t 
postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) 
precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" 
precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t 
postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t 
precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy 
size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 4cb1ab8645dc..510818f4dae3 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,9 @@ typedef struct VFIOMigration {
 uint64_t precopy_init_size;
 uint64_t precopy_dirty_size;
 bool initial_data_sent;
+
+bool save_iterate_run;
+bool save_iterate_empty_hit;
 } VFIOMigration;
 
 struct VFIOGroup;

[PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Implement the multifd device state transfer via additional per-device
thread spawned from save_live_complete_precopy_begin handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
x-migration-multifd-transfer device property value.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 207 ++
 hw/vfio/trace-events  |   3 +
 include/hw/vfio/vfio-common.h |   9 ++
 3 files changed, 219 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 719e36800ab5..28a835f8a945 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -643,6 +643,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, 
Error **errp)
 uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
 int ret;
 
+/* Make a copy of this setting at the start in case it is changed 
mid-migration */
+migration->multifd_transfer = vbasedev->migration_multifd_transfer;
+
+if (migration->multifd_transfer && !migration_has_device_state_support()) {
+error_setg(errp,
+   "%s: Multifd device transfer requested but unsupported in 
the current config",
+   vbasedev->name);
+return -EINVAL;
+}
+
 qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
 vfio_query_stop_copy_size(vbasedev, _copy_size);
@@ -692,6 +702,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error 
**errp)
 return ret;
 }
 
+static void 
vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev);
+
 static void vfio_save_cleanup(void *opaque)
 {
 VFIODevice *vbasedev = opaque;
@@ -699,6 +711,8 @@ static void vfio_save_cleanup(void *opaque)
 Error *local_err = NULL;
 int ret;
 
+vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
 /*
  * Changing device state from STOP_COPY to STOP can take time. Do it here,
  * after migration has completed, so it won't increase downtime.
@@ -712,6 +726,7 @@ static void vfio_save_cleanup(void *opaque)
 }
 }
 
+g_clear_pointer(>idstr, g_free);
 g_free(migration->data_buffer);
 migration->data_buffer = NULL;
 migration->precopy_init_size = 0;
@@ -823,10 +838,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
 VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
 ssize_t data_size;
 int ret;
 Error *local_err = NULL;
 
+if (migration->multifd_transfer) {
+/* Emit dummy NOP data */
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+return 0;
+}
+
 trace_vfio_save_complete_precopy_started(vbasedev->name);
 
 /* We reach here with device state STOP or STOP_COPY only */
@@ -852,12 +874,188 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
+static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice 
*vbasedev, uint32_t idx)
+{
+VFIOMigration *migration = vbasedev->migration;
+g_autoptr(QIOChannelBuffer) bioc = NULL;
+QEMUFile *f = NULL;
+int ret;
+g_autofree VFIODeviceStatePacket *packet = NULL;
+size_t packet_len;
+
+bioc = qio_channel_buffer_new(0);
+qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+ret = vfio_save_device_config_state(f, vbasedev, NULL);
+if (ret) {
+return ret;
+}
+
+ret = qemu_fflush(f);
+if (ret) {
+goto ret_close_file;
+}
+
+packet_len = sizeof(*packet) + bioc->usage;
+packet = g_malloc0(packet_len);
+packet->idx = idx;
+packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+memcpy(>data, bioc->data, bioc->usage);
+
+ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+ (char *)packet, packet_len);
+
+bytes_transferred += packet_len;
+
+ret_close_file:
+g_clear_pointer(, qemu_fclose);
+return ret;
+}
+
+static void *vfio_save_complete_precopy_async_thread(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int *ret = >save_complete_precopy_thread_ret;
+g_autofree VFIODeviceStatePacket *packet = NULL;
+uint32_t idx;
+
+/* We reach here with device state STOP or STOP_COPY only */
+*ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+VFIO_DEVICE_STATE_STOP, NULL);
+if (*ret) {
+return NULL;
+}
+
+packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+for (idx = 0; ; idx++) {
+ssize_t data_size;
+size_t packet_size;
+
+data_size = rea

[PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This property allows configuring at runtime whether to send the
particular device state via multifd channels when live migrating that
device.

It is ignored on the receive side and defaults to "false" for bit stream
compatibility with older QEMU versions.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/pci.c | 7 +++
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 74a79bdf61f9..e2ac1db96002 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3346,6 +3346,8 @@ static void vfio_instance_init(Object *obj)
 pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
 }
 
+static PropertyInfo qdev_prop_bool_mutable;
+
 static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
 DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token),
@@ -3367,6 +3369,8 @@ static Property vfio_pci_dev_properties[] = {
 VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
 DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice,
 vbasedev.enable_migration, ON_OFF_AUTO_AUTO),
+DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice,
+vbasedev.migration_multifd_transfer, qdev_prop_bool_mutable, 
bool),
 DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice,
  vbasedev.migration_events, false),
 DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
@@ -3464,6 +3468,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+qdev_prop_bool_mutable = qdev_prop_bool;
+qdev_prop_bool_mutable.realized_set_allowed = true;
+
 type_register_static(_pci_dev_info);
 type_register_static(_pci_nohotplug_dev_info);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index aa8476a859a6..bc85891d8fff 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -132,6 +132,7 @@ typedef struct VFIODevice {
 bool no_mmap;
 bool ram_block_discard_allowed;
 OnOffAuto enable_migration;
+bool migration_multifd_transfer;
 bool migration_events;
 VFIODeviceOps *ops;
 unsigned int num_irqs;

[PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating
its position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
state.

Since it's important to finish loading device state transferred via
the main migration channel (via save_live_iterate handler) before
starting loading the data asynchronously transferred via multifd
a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
mark the end of the main migration channel data.

The device state loading process waits until that flag is seen before
commencing loading of the multifd-transferred device state.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 325 +-
 hw/vfio/trace-events  |   9 +-
 include/hw/vfio/vfio-common.h |  14 ++
 3 files changed, 344 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 93f767e3c2dd..719e36800ab5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -47,6 +48,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xef15ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE(0xef16ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where 
typically
@@ -55,6 +57,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+uint32_t version;
+uint32_t idx;
+uint32_t flags;
+uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -254,6 +265,176 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice 
*vbasedev,
 return ret;
 }
 
+typedef struct LoadedBuffer {
+bool is_present;
+char *data;
+size_t len;
+} LoadedBuffer;
+
+static void loaded_buffer_clear(gpointer data)
+{
+LoadedBuffer *lb = data;
+
+if (!lb->is_present) {
+return;
+}
+
+g_clear_pointer(>data, g_free);
+lb->is_present = false;
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+  Error **errp)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+QEMU_LOCK_GUARD(>load_bufs_mutex);
+LoadedBuffer *lb;
+
+if (data_size < sizeof(*packet)) {
+error_setg(errp, "packet too short at %zu (min is %zu)",
+   data_size, sizeof(*packet));
+return -1;
+}
+
+if (packet->version != 0) {
+error_setg(errp, "packet has unknown version %" PRIu32,
+   packet->version);
+return -1;
+}
+
+if (packet->idx == UINT32_MAX) {
+error_setg(errp, "packet has too high idx %" PRIu32,
+   packet->idx);
+return -1;
+}
+
+trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+/* config state packet should be the last one in the stream */
+if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+migration->load_buf_idx_last = packet->idx;
+}
+
+assert(migration->load_bufs);
+if (packet->idx >= migration->load_bufs->len) {
+g_array_set_size(migration->load_bufs, packet->idx + 1);
+}
+
+lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx);
+if (lb->is_present) {
+error_setg(errp, "state buffer %" PRIu32 " already filled", 
packet->idx);
+return -1;
+}
+
+assert(packet->idx >= migration->load_buf_idx);
+
+lb->data = g_memdup2(>data, data_size - sizeof(*packet));
+lb->len = data_size - sizeof(*packet);
+lb->is_present = true;
+
+qemu_cond_broadcast(>load_bufs_buffer_ready_cond);
+
+return 0;
+}
+
+static void *vfio_load_bufs_thread(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+Error **errp = >load_bufs_thread_errp;
+g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock(
+QEMU_MAKE_LOCKABLE(>load_bufs_mutex));
+LoadedBuffer *lb;
+
+while (!migration->load_bufs_device_ready &&
+

[PATCH v1 09/13] migration/multifd: Device state transfer support - send side

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/misc.h |   4 +
 migration/multifd-zlib.c |   2 +-
 migration/multifd-zstd.c |   2 +-
 migration/multifd.c  | 181 +--
 migration/multifd.h  |  26 --
 5 files changed, 182 insertions(+), 33 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613bac..abf6f33eeae8 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/multifd.c */
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+   char *data, size_t len);
+
 #endif
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 737a9645d2fe..424547aa5be0 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error 
**errp)
 
 out:
 p->flags |= MULTIFD_FLAG_ZLIB;
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 256858df0a0a..89ef21898485 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error 
**errp)
 
 out:
 p->flags |= MULTIFD_FLAG_ZSTD;
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index daa34172bf24..6a7e5d659925 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -19,6 +20,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -49,9 +51,12 @@ typedef struct {
 } __attribute__((packed)) MultiFDInit_t;
 
 struct {
+QemuMutex queue_job_mutex;
+
 MultiFDSendParams *params;
-/* array of pages to sent */
+/* array of pages or device state to be sent */
 MultiFDPages_t *pages;
+MultiFDDeviceState_t *device_state;
 /*
  * Global number of generated multifd packets.
  *
@@ -168,7 +173,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 }
 
 /**
- * nocomp_send_prepare: prepare date to be able to send
+ * nocomp_send_prepare_ram: prepare RAM data for sending
  *
  * For no compression we just have to calculate the size of the
  * packet.
@@ -178,7 +183,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
-static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp)
 {
 bool use_zero_copy_send = migrate_zero_copy_send();
 int ret;
@@ -197,13 +202,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, 
Error **errp)
  * Only !zerocopy needs the header in IOV; zerocopy will
  * send it separately.
  */
-multifd_send_prepare_header(p);
+multifd_send_prepare_header_ram(p);
 }
 
 multifd_send_prepare_iovs(p);
 p->flags |= MULTIFD_FLAG_NOCOMP;
 
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 
 if (use_zero_copy_send) {
 /* Send header first, without zerocopy */
@@ -217,6 +222,56 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error 
**errp)
 return 0;
 }
 
+static void multifd_send_fill_packet_device_state(MultiFDSendParams *p)
+{
+MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+packet->hdr.flags = cpu_to_be32(p->flags);
+strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr));
+packet->instance_id = cpu_to_be32(p->device_state->instance_id);
+packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+/**
+ * nocomp_send_prepare_device_state: prepare device state data for sending
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int nocomp_send_prepare_device_state(MultiFDSendParams *p,
+Error **errp)
+{
+multifd_send_prepare_header_device_state(p);
+
+assert(!(p->flags & MULTIFD_FLAG_SYNC));
+
+p->next_packet_size = p->dev

[PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

These SaveVMHandlers allow device to provide its own asynchronous
transmission of the remaining data at the end of a precopy phase.

In this use case the save_live_complete_precopy_begin handler is
supposed to start such transmission (for example, by launching
appropriate threads) while the save_live_complete_precopy_end
handler is supposed to wait until such transfer has finished (for
example, until the sending threads have exited).

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 34 ++
 migration/savevm.c   | 35 +++
 2 files changed, 69 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index f60e797894e5..f7b3df71 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -103,6 +103,40 @@ typedef struct SaveVMHandlers {
  */
 int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+/**
+ * @save_live_complete_precopy_begin
+ *
+ * Called at the end of a precopy phase, before all 
@save_live_complete_precopy
+ * handlers. The handler might, for example, arrange for device-specific
+ * asynchronous transmission of the remaining data. When postcopy is 
enabled,
+ * devices that support postcopy will skip this step.
+ *
+ * @f: QEMUFile where the handler can synchronously send data before 
returning
+ * @idstr: this device section idstr
+ * @instance_id: this device section instance_id
+ * @opaque: data pointer passed to register_savevm_live()
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*save_live_complete_precopy_begin)(QEMUFile *f,
+char *idstr, uint32_t instance_id,
+void *opaque);
+/**
+ * @save_live_complete_precopy_end
+ *
+ * Called at the end of a precopy phase, after all 
@save_live_complete_precopy
+ * handlers. The handler might, for example, wait for the asynchronous
+ * transmission started by the @save_live_complete_precopy_begin handler
+ * to complete. When postcopy is enabled, devices that support postcopy 
will
+ * skip this step.
+ *
+ * @f: QEMUFile where the handler can synchronously send data before 
returning
+ * @opaque: data pointer passed to register_savevm_live()
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque);
+
 /* This runs both outside and inside the BQL.  */
 
 /**
diff --git a/migration/savevm.c b/migration/savevm.c
index c621f2359ba3..56fb1c4c2563 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1494,6 +1494,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile 
*f, bool in_postcopy)
 SaveStateEntry *se;
 int ret;
 
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+ se->ops->has_postcopy(se->opaque)) ||
+!se->ops->save_live_complete_precopy_begin) {
+continue;
+}
+
+save_section_header(f, se, QEMU_VM_SECTION_END);
+
+ret = se->ops->save_live_complete_precopy_begin(f,
+se->idstr, 
se->instance_id,
+se->opaque);
+
+save_section_footer(f, se);
+
+if (ret < 0) {
+qemu_file_set_error(f, ret);
+return -1;
+}
+}
+
 QTAILQ_FOREACH(se, _state.handlers, entry) {
 if (!se->ops ||
 (in_postcopy && se->ops->has_postcopy &&
@@ -1525,6 +1546,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile 
*f, bool in_postcopy)
 end_ts_each - start_ts_each);
 }
 
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+ se->ops->has_postcopy(se->opaque)) ||
+!se->ops->save_live_complete_precopy_end) {
+continue;
+}
+
+ret = se->ops->save_live_complete_precopy_end(f, se->opaque);
+if (ret < 0) {
+qemu_file_set_error(f, ret);
+return -1;
+}
+}
+
 trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
 return 0;

[PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support()

2024-06-18 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Since device state transfer via multifd channels requires multifd
channels with packets and is currently not compatible with multifd
compression add an appropriate query function so device can learn
whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/misc.h | 1 +
 migration/multifd.c  | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index abf6f33eeae8..4f3de2f23819 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -112,6 +112,7 @@ bool migration_in_bg_snapshot(void);
 void dirty_bitmap_mig_init(void);
 
 /* migration/multifd.c */
+bool migration_has_device_state_support(void);
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
char *data, size_t len);
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 6a7e5d659925..e5f7021465ec 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -804,6 +804,12 @@ retry:
 return true;
 }
 
+bool migration_has_device_state_support(void)
+{
+return migrate_multifd() && !migrate_mapped_ram() &&
+migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}
+
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
char *data, size_t len)
 {

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-05-06 Thread Maciej S. Szmigiero


On 29.04.2024 17:09, Peter Xu wrote:

On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:

On 24.04.2024 00:35, Peter Xu wrote:

On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:

On 24.04.2024 00:20, Peter Xu wrote:

On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:

On 19.04.2024 17:31, Peter Xu wrote:

On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.


Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

  - complete_precopy_async
  - complete_precopy
  - complete_precopy_async_wait

But perhaps better:

  - complete_precopy_begin
  - complete_precopy
  - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?


To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.


We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.


The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.


So during iteration phase VFIO cannot yet leverage the multifd channels
when with this series, am I right?


That's right.


Is it possible to extend that use case too?


I guess so, but since this phase (iteration while the VM is still
running) doesn't impact downtime it is much less critical.


But it affects the bandwidth, e.g. even with multifd enabled, the device
iteration data will still bottleneck at ~15Gbps on a common system setup
the best case, even if the hosts are 100Gbps direct connected.  Would that
be a concern in the future too, or it's known problem and it won't be fixed
anyway?


I think any improvements to the migration performance are good, even if
they don't impact downtime.

It's just that this patch set focuses on the downtime phase as the more
critical thing.

After this gets improved there's no reason why not to look at improving
performance of the VM live phase too if it brings sensible improvements.


I remember Avihai used to have plan to look into similar issues, I hope
this is exactly what he is looking for.  Otherwise changing migration
protocol from time to time is cumbersome; we always need to provide a flag
to make sure old systems migrates in the old ways, new systems run the new
ways, and for such a relatively major change I'd want to double check on
how far away we can support offload VFIO iterations data to multifd.


The device state transfer is indicated by a new flag in the multifd
header (MULTIFD_FLAG_DEVICE_STATE).

If we are to use multifd channels for VM live phase transfers these
could simply re-use the same flag type.


Right, and that's also my major purpose of such request to consider both
issues.

If supporting iterators can be easy on top of this, I am thinking whether
we should do this in one s

Re: [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side

2024-05-06 Thread Maciej S. Szmigiero


On 29.04.2024 22:04, Peter Xu wrote:

On Tue, Apr 16, 2024 at 04:43:02PM +0200, Maciej S. Szmigiero wrote:

+bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
+{
+g_autoptr(GMutexLocker) locker = NULL;
+
+/*
+ * Device state submissions for shared channels can come
+ * from multiple threads and conflict with page submissions
+ * with respect to multifd_send_state access.
+ */
+if (!multifd_send_state->device_state_dedicated_channels) {
+locker = g_mutex_locker_new(_send_state->queue_job_mutex);


Haven't read the rest, but suggest to stick with QemuMutex for the whole
patchset, as that's what we use in the rest migration code, along with
QEMU_LOCK_GUARD().



Ack, from a quick scan of QEMU thread sync primitives it seems that
QemuMutex with QemuLockable and QemuCond should fulfill the
requirements to replace GMutex, GMutexLocker and GCond.

Thanks,
Maciej

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-26 Thread Maciej S. Szmigiero


On 24.04.2024 00:27, Peter Xu wrote:

On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:

We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.

It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.

It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.

14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.

For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.


Yeah I understand such option would only provide more options.

However as long as such option got introduced, user will start to do their
own "optimizations" on how to provision the multifd channels, and IMHO
it'll be great if we as developer can be crystal clear on why it needs to
be introduced in the first place, rather than making all channels open to
all purposes.

So I don't think I'm strongly against such parameter, but I want to double
check we really understand what's behind this to justify such parameter.
Meanwhile I'd be always be pretty caucious on introducing any migration
parameters, due to the compatibility nightmares.  The less parameter the
better..


Ack, I am also curious why dedicated device state multifd channels bring
such downtime improvement.





I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.


Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

- complete_precopy_async
- complete_precopy
- complete_precopy_async_wait

But perhaps better:

- complete_precopy_begin
- complete_precopy
- complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.


Ack, I will rename these hooks to begin/end.


Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.


AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.

In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.


Do you think it'll be worthwhile give it a shot, even if we can't decide
yet on the order of end()s to be called?


Upon a closer inspection it looks like that there are, in fact, *two*
RAM syncs done during the downtime - besides the one at the end of
ram_save_complete() there's another on in find_dirty_block(). This function
is called earlier from ram_save_complete() -> ram_find_and_save_block().

Unfortunately, skipping that intermediate sync in find_dirty_block() and
moving the one from the end of ram_save_complete() to another SaveVMHandler
that's called only after VFIO device state transfer doesn't actually
improve downtime (at least not on its own).


It'll be great if we could look into these issues instead of workarounds,
and figure out what affected the performance behind, and also whether that
can be fixed without such parameter.


I've been looking at this and added some measurements around device state
queuing for submission in multifd_queue_device_state().

To my surprise, the mixed RAM / device state config of 15/0 has *much*
lower total queuing time of around 100 msec compared to the dedicated
device state channels 15/4 config with total queuing time of around
300 msec.

Despite that, the 15/4 config still has significantly lower overall
downtime.

This means that any reason for the downtime difference is probably on
the receive / load side of the migration rather than on the save /
send side.

I guess the reason for the lower device state queuing time in the 15/0
case is that this data could be sent via any of the 15 multifd channels
rather than just the 4 dedicated ones in the 15/4 case.

Nevertheless, I will continue to look at this problem to a

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-26 Thread Maciej S. Szmigiero


On 24.04.2024 00:35, Peter Xu wrote:

On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:

On 24.04.2024 00:20, Peter Xu wrote:

On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:

On 19.04.2024 17:31, Peter Xu wrote:

On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.


Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

 - complete_precopy_async
 - complete_precopy
 - complete_precopy_async_wait

But perhaps better:

 - complete_precopy_begin
 - complete_precopy
 - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?


To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.


We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.


The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.


So during iteration phase VFIO cannot yet leverage the multifd channels
when with this series, am I right?


That's right.


Is it possible to extend that use case too?


I guess so, but since this phase (iteration while the VM is still
running) doesn't impact downtime it is much less critical.


But it affects the bandwidth, e.g. even with multifd enabled, the device
iteration data will still bottleneck at ~15Gbps on a common system setup
the best case, even if the hosts are 100Gbps direct connected.  Would that
be a concern in the future too, or it's known problem and it won't be fixed
anyway?


I think any improvements to the migration performance are good, even if
they don't impact downtime.

It's just that this patch set focuses on the downtime phase as the more
critical thing.

After this gets improved there's no reason why not to look at improving
performance of the VM live phase too if it brings sensible improvements.


I remember Avihai used to have plan to look into similar issues, I hope
this is exactly what he is looking for.  Otherwise changing migration
protocol from time to time is cumbersome; we always need to provide a flag
to make sure old systems migrates in the old ways, new systems run the new
ways, and for such a relatively major change I'd want to double check on
how far away we can support offload VFIO iterations data to multifd.


The device state transfer is indicated by a new flag in the multifd
header (MULTIFD_FLAG_DEVICE_STATE).

If we are to use multifd channels for VM live phase transfers these
could simply re-use the same flag type.


Thanks,



Thanks,
Maciej

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-23 Thread Maciej S. Szmigiero


On 24.04.2024 00:20, Peter Xu wrote:

On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:

On 19.04.2024 17:31, Peter Xu wrote:

On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.


Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

- complete_precopy_async
- complete_precopy
- complete_precopy_async_wait

But perhaps better:

- complete_precopy_begin
- complete_precopy
- complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?


To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.


We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.


The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.


So during iteration phase VFIO cannot yet leverage the multifd channels
when with this series, am I right?


That's right.


Is it possible to extend that use case too?


I guess so, but since this phase (iteration while the VM is still
running) doesn't impact downtime it is much less critical.
 

Thanks,



Thanks,
Maciej

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-23 Thread Maciej S. Szmigiero


On 19.04.2024 17:31, Peter Xu wrote:

On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.


Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

   - complete_precopy_async
   - complete_precopy
   - complete_precopy_async_wait

But perhaps better:

   - complete_precopy_begin
   - complete_precopy
   - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?


To dump during RAM iteration, the VFIO device will need to have
dirty tracking and iterate on its state, because the guest CPUs
will still be running potentially changing VFIO state. That seems
impractical in the general case.


We already do such interations in vfio_save_iterate()?

My understanding is the recent VFIO work is based on the fact that the VFIO
device can track device state changes more or less (besides being able to
save/load full states).  E.g. I still remember in our QE tests some old
devices report much more dirty pages than expected during the iterations
when we were looking into such issue that a huge amount of dirty pages
reported.  But newer models seem to have fixed that and report much less.

That issue was about GPU not NICs, though, and IIUC a major portion of such
tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
maybe they work differently.


The device which this series was developed against (Mellanox ConnectX-7)
is already transferring its live state before the VM gets stopped (via
save_live_iterate SaveVMHandler).

It's just that in addition to the live state it has more than 400 MiB
of state that cannot be transferred while the VM is still running.
And that fact hurts a lot with respect to the migration downtime.

AFAIK it's a very similar story for (some) GPUs.


Thanks,



Thanks,
Maciej

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-23 Thread Maciej S. Szmigiero


On 18.04.2024 22:02, Peter Xu wrote:

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:

On 18.04.2024 12:39, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:

On 17.04.2024 18:35, Daniel P. Berrangé wrote:

On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:

On 17.04.2024 10:36, Daniel P. Berrangé wrote:

On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

(..)

That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.


The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.


Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

 main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

 main: | Init || VM state |
 multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

 main: | Init || VM state |
 multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd4: | VFIO VM 
state |
 multifd5: | VFIO VM 
state |


4. parallel RAM and VFIO VM state, then remaining VM state

 main: | Init || VM state |
 multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
 multifd4:| VFIO VM state |
 multifd5:| VFIO VM state |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?


You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.



However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.


Ahh yes, I forgot about the residual dirty RAM, that makes sense as
an explanation. Allow me to work through the scenarios though, as I
still think my suggestion to not have separate dedicate channels is
better


Lets say hypothetically we have an existing deployment today that
uses 6 multifd channels for RAM. ie:
  main: | Init || VM state |
  multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM it

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-18 Thread Maciej S. Szmigiero


On 18.04.2024 12:39, Daniel P. Berrangé wrote:

On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:

On 17.04.2024 18:35, Daniel P. Berrangé wrote:

On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:

On 17.04.2024 10:36, Daniel P. Berrangé wrote:

On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

(..)

That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.


The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.


Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

main: | Init || VM state |
multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

main: | Init || VM state |
multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd4: | VFIO VM 
state |
multifd5: | VFIO VM 
state |


4. parallel RAM and VFIO VM state, then remaining VM state

main: | Init || VM state |
multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N |
multifd4:| VFIO VM state |
multifd5:| VFIO VM state |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?


You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.



However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.


Ahh yes, I forgot about the residual dirty RAM, that makes sense as
an explanation. Allow me to work through the scenarios though, as I
still think my suggestion to not have separate dedicate channels is
better


Lets say hypothetically we have an existing deployment today that
uses 6 multifd channels for RAM. ie:
  
 main: | Init || VM state |

 multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
 multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual 
RAM |
 multifd3:| R

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-18 Thread Maciej S. Szmigiero


On 17.04.2024 18:35, Daniel P. Berrangé wrote:

On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:

On 17.04.2024 10:36, Daniel P. Berrangé wrote:

On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.


I'm not convinced there's any need to introduce the new "channel header"
protocol messages. The multifd channels already have an initialization
message that is extensible to allow extra semantics to be indicated.
So if we want some of the multifd channels to be reserved for device
state, we could indicate that via some data in the MultiFDInit_t
message struct.


The reason for introducing x-channel-header was to avoid having to deduce
the channel type by peeking in the channel's content - where any channel
that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
multifd one.

But if this isn't desired then, as you say, the multifd channel type can
be indicated by using some unused field of the MultiFDInit_t message.

Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.


I don't like the heuristics we currently have, and would to have
a better solution. What makes me cautious is that this proposal
is a protocol change, but only addressing one very narrow problem
with the migration protocol.

I'd like migration to see a more explicit bi-directional protocol
negotiation message set, where both QEMU can auto-negotiate amongst
themselves many of the features that currently require tedious
manual configuration by mgmt apps via migrate parameters/capabilities.
That would address the problem you describe here, and so much more.


Isn't the capability negotiation handled automatically by libvirt
today?
I guess you'd prefer for QEMU to internally handle it instead?


If we add this channel header feature now, it creates yet another
thing to keep around for back compatibility. So if this is not
strictly required, in order to solve the VFIO VMstate problem, I'd
prefer to just solve the VMstate stuff on its own.


Okay, got it.


That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.


The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.


Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for

Re: [PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-17 Thread Maciej S. Szmigiero


On 17.04.2024 10:36, Daniel P. Berrangé wrote:

On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.


I'm not convinced there's any need to introduce the new "channel header"
protocol messages. The multifd channels already have an initialization
message that is extensible to allow extra semantics to be indicated.
So if we want some of the multifd channels to be reserved for device
state, we could indicate that via some data in the MultiFDInit_t
message struct.


The reason for introducing x-channel-header was to avoid having to deduce
the channel type by peeking in the channel's content - where any channel
that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
multifd one.

But if this isn't desired then, as you say, the multifd channel type can
be indicated by using some unused field of the MultiFDInit_t message.

Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.


That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.


The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.


Again the 'MultiFDPacket_t' struct has
both 'flags' and unused fields, so it is extensible to indicate
that is it being used for new types of data.


Yeah, that's what MULTIFD_FLAG_DEVICE_STATE in packet header already
does in this patch set - it indicates that the packet contains device
state, not RAM data.
 

With regards,
Daniel


Best regards,
Maciej

[PATCH RFC 18/26] migration: Add load_finish handler and associated functions

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 21 +++
 migration/migration.c|  6 +
 migration/migration.h|  3 +++
 migration/savevm.c   | 52 
 migration/savevm.h   |  4 +++
 5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 7d29b7e0b559..f15881fc87cd 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -272,6 +272,27 @@ typedef struct SaveVMHandlers {
 int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
  Error **errp);
 
+/**
+ * @load_finish
+ *
+ * Poll whether all asynchronous device state loading had finished.
+ * Not called on the load failure path.
+ *
+ * Called while holding the qemu_loadvm_load_finish_ready_lock.
+ *
+ * If this method signals "not ready" then it might not be called
+ * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+ * while holding qemu_loadvm_load_finish_ready_lock.
+ *
+ * @opaque: data pointer passed to register_savevm_live()
+ * @is_finished: whether the loading had finished (output parameter)
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Returns zero to indicate success and negative for error
+ * It's not an error that the loading still hasn't finished.
+ */
+int (*load_finish)(void *opaque, bool *is_finished, Error **errp);
+
 /**
  * @load_setup
  *
diff --git a/migration/migration.c b/migration/migration.c
index 8fe8be71a0e3..e4f82695a338 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -234,6 +234,9 @@ void migration_object_init(void)
 qemu_cond_init(_incoming->page_request_cond);
 current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
+g_mutex_init(_incoming->load_finish_ready_mutex);
+g_cond_init(_incoming->load_finish_ready_cond);
+
 migration_object_check(current_migration, _fatal);
 
 blk_mig_init();
@@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void)
 mis->postcopy_qemufile_dst = NULL;
 }
 
+g_mutex_clear(>load_finish_ready_mutex);
+g_cond_clear(>load_finish_ready_cond);
+
 yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index a6114405917f..92014ef4cfcc 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -227,6 +227,9 @@ struct MigrationIncomingState {
  * is needed as this field is updated serially.
  */
 unsigned int switchover_ack_pending_num;
+
+GCond load_finish_ready_cond;
+GMutex load_finish_ready_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 2e4d63faca06..30521ad3f340 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2994,6 +2994,37 @@ int qemu_loadvm_state(QEMUFile *f)
 return ret;
 }
 
+qemu_loadvm_load_finish_ready_lock();
+while (!ret) { /* Don't call load_finish() handlers on the load failure 
path */
+bool all_ready = true;
+SaveStateEntry *se = NULL;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+bool this_ready;
+
+if (!se->ops || !se->ops->load_finish) {
+continue;
+}
+
+ret = se->ops->load_finish(se->opaque, _ready, _err);
+if (ret) {
+error_report_err(local_err);
+
+qemu_loadvm_load_finish_ready_unlock();
+return -EINVAL;
+} else if (!this_ready) {
+all_ready = false;
+}
+}
+
+if (all_ready) {
+break;
+}
+
+g_cond_wait(>load_finish_ready_cond, 
>load_finish_ready_mutex);
+}
+qemu_loadvm_load_finish_ready_unlock();
+
 if (ret == 0) {
 ret = qemu_file_get_error(f);
 }
@@ -3098,6 +3129,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, 
uint32_t instance_id,
 return 0;
 }
 
+void qemu_loadvm_load_finish_ready_lock(void)
+{
+MigrationIncomingState *mis = migration_incoming_get_current();
+
+g_mutex_lock(>load_finish_ready_mutex);
+}
+
+void qemu_loadvm_load_finish_ready_unlock(void)
+{
+MigrationIncomingState *mis = migration_incoming_get_current();
+
+g_mutex_unlock(>load_finish_ready_mutex);
+}
+
+void qemu_

[PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This will allow passing additional parameters there in the future.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/postcopy-ram.c | 68 +++-
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e314e1023dc1..94fe872d8251 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1617,14 +1617,62 @@ void 
postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
 trace_postcopy_preempt_new_channel();
 }
 
+typedef struct {
+unsigned int ref;
+MigrationState *s;
+} PostcopyPChannelConnectData;
+
+static PostcopyPChannelConnectData 
*pcopy_preempt_connect_data_new(MigrationState *s)
+{
+PostcopyPChannelConnectData *data;
+
+data = g_malloc0(sizeof(*data));
+data->ref = 1;
+data->s = s;
+
+return data;
+}
+
+static void pcopy_preempt_connect_data_free(PostcopyPChannelConnectData *data)
+{
+g_free(data);
+}
+
+static PostcopyPChannelConnectData *
+pcopy_preempt_connect_data_ref(PostcopyPChannelConnectData *data)
+{
+unsigned int ref_old;
+
+ref_old = qatomic_fetch_inc(>ref);
+assert(ref_old < UINT_MAX);
+
+return data;
+}
+
+static void pcopy_preempt_connect_data_unref(gpointer opaque)
+{
+PostcopyPChannelConnectData *data = opaque;
+unsigned int ref_old;
+
+ref_old = qatomic_fetch_dec(>ref);
+assert(ref_old > 0);
+if (ref_old == 1) {
+pcopy_preempt_connect_data_free(data);
+}
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(PostcopyPChannelConnectData, 
pcopy_preempt_connect_data_unref)
+
 /*
  * Setup the postcopy preempt channel with the IOC.  If ERROR is specified,
  * setup the error instead.  This helper will free the ERROR if specified.
  */
 static void
-postcopy_preempt_send_channel_done(MigrationState *s,
+postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data,
QIOChannel *ioc, Error *local_err)
 {
+MigrationState *s = data->s;
+
 if (local_err) {
 migrate_set_error(s, local_err);
 error_free(local_err);
@@ -1645,18 +1693,19 @@ static void
 postcopy_preempt_tls_handshake(QIOTask *task, gpointer opaque)
 {
 g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task));
-MigrationState *s = opaque;
+PostcopyPChannelConnectData *data = opaque;
 Error *local_err = NULL;
 
 qio_task_propagate_error(task, _err);
-postcopy_preempt_send_channel_done(s, ioc, local_err);
+postcopy_preempt_send_channel_done(data, ioc, local_err);
 }
 
 static void
 postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque)
 {
 g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task));
-MigrationState *s = opaque;
+PostcopyPChannelConnectData *data = opaque;
+MigrationState *s = data->s;
 QIOChannelTLS *tioc;
 Error *local_err = NULL;
 
@@ -1672,14 +1721,15 @@ postcopy_preempt_send_channel_new(QIOTask *task, 
gpointer opaque)
 trace_postcopy_preempt_tls_handshake();
 qio_channel_set_name(QIO_CHANNEL(tioc), "migration-tls-preempt");
 qio_channel_tls_handshake(tioc, postcopy_preempt_tls_handshake,
-  s, NULL, NULL);
+  pcopy_preempt_connect_data_ref(data),
+  pcopy_preempt_connect_data_unref, NULL);
 /* Setup the channel until TLS handshake finished */
 return;
 }
 
 out:
 /* This handles both good and error cases */
-postcopy_preempt_send_channel_done(s, ioc, local_err);
+postcopy_preempt_send_channel_done(data, ioc, local_err);
 }
 
 /*
@@ -1714,8 +1764,12 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 
 void postcopy_preempt_setup(MigrationState *s)
 {
+PostcopyPChannelConnectData *data;
+
+data = pcopy_preempt_connect_data_new(s);
 /* Kick an async task to connect */
-socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL);
+socket_send_channel_create(postcopy_preempt_send_channel_new,
+   data, pcopy_preempt_connect_data_unref);
 }
 
 static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)

[PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support()

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Since device state transfer via multifd channels requires multifd
channels with migration channel header and is currently not compatible
with multifd compression add an appropriate query function so device
can learn whether it can actually make use of it.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/misc.h | 1 +
 migration/multifd.c  | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 25968e31247b..4da4f7f85f18 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -118,6 +118,7 @@ bool migration_in_bg_snapshot(void);
 void dirty_bitmap_mig_init(void);
 
 /* migration/multifd.c */
+bool migration_has_device_state_support(void);
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
char *data, size_t len);
 
diff --git a/migration/multifd.c b/migration/multifd.c
index d8ce01539a05..d24217e705a0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -819,6 +819,12 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
 return multifd_queue_page_locked(block, offset);
 }
 
+bool migration_has_device_state_support(void)
+{
+return migrate_multifd() && migrate_channel_header() &&
+migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE;
+}
+
 int multifd_queue_device_state(char *idstr, uint32_t instance_id,
char *data, size_t len)
 {

[PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This way there aren't stale flags there.

p->flags can't contain SYNC to be sent at the next RAM packet since syncs
are now handled separately in multifd_send_thread.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c2575e3d6dbf..7118c69a4d49 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque)
 if (qatomic_load_acquire(>pending_job)) {
 MultiFDPages_t *pages = p->pages;
 
+p->flags = 0;
 p->iovs_num = 0;
 assert(pages->num);
 
@@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque)
 }
 /* p->next_packet_size will always be zero for a SYNC packet */
 stat64_add(_stats.multifd_bytes, p->packet_len);
-p->flags = 0;
 }
 
 qatomic_set(>pending_sync, false);

[PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Implement the multifd device state transfer via additional per-device
thread spawned from save_live_complete_precopy_async handler.

Switch between doing the data transfer in the new handler and doing it
in the old save_state handler depending on the
migration_has_device_state_support() return value.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 195 ++
 hw/vfio/trace-events  |   3 +
 include/hw/vfio/vfio-common.h |   8 ++
 3 files changed, 206 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3af62dea6899..6177431a0cd3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -608,11 +608,15 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 return qemu_file_get_error(f);
 }
 
+static void 
vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev);
+
 static void vfio_save_cleanup(void *opaque)
 {
 VFIODevice *vbasedev = opaque;
 VFIOMigration *migration = vbasedev->migration;
 
+vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev);
+
 /*
  * Changing device state from STOP_COPY to STOP can take time. Do it here,
  * after migration has completed, so it won't increase downtime.
@@ -621,6 +625,7 @@ static void vfio_save_cleanup(void *opaque)
 vfio_migration_set_state_or_reset(vbasedev, VFIO_DEVICE_STATE_STOP);
 }
 
+g_clear_pointer(>idstr, g_free);
 g_free(migration->data_buffer);
 migration->data_buffer = NULL;
 migration->precopy_init_size = 0;
@@ -735,6 +740,12 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 ssize_t data_size;
 int ret;
 
+if (migration_has_device_state_support()) {
+/* Emit dummy NOP data */
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+return 0;
+}
+
 trace_vfio_save_complete_precopy_started(vbasedev->name);
 
 /* We reach here with device state STOP or STOP_COPY only */
@@ -762,11 +773,186 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 return ret;
 }
 
+static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice 
*vbasedev, uint32_t idx)
+{
+VFIOMigration *migration = vbasedev->migration;
+g_autoptr(QIOChannelBuffer) bioc = NULL;
+QEMUFile *f = NULL;
+int ret;
+g_autofree VFIODeviceStatePacket *packet = NULL;
+size_t packet_len;
+
+bioc = qio_channel_buffer_new(0);
+qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save");
+
+f = qemu_file_new_output(QIO_CHANNEL(bioc));
+
+ret = vfio_save_device_config_state(f, vbasedev);
+if (ret) {
+return ret;
+}
+
+ret = qemu_fflush(f);
+if (ret) {
+goto ret_close_file;
+}
+
+packet_len = sizeof(*packet) + bioc->usage;
+packet = g_malloc0(packet_len);
+packet->idx = idx;
+packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE;
+memcpy(>data, bioc->data, bioc->usage);
+
+ret = multifd_queue_device_state(migration->idstr, migration->instance_id,
+ (char *)packet, packet_len);
+
+bytes_transferred += packet_len;
+
+ret_close_file:
+g_clear_pointer(, qemu_fclose);
+return ret;
+}
+
+static void *vfio_save_complete_precopy_async_thread(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+int *ret = >save_complete_precopy_thread_ret;
+g_autofree VFIODeviceStatePacket *packet = NULL;
+uint32_t idx;
+
+/* We reach here with device state STOP or STOP_COPY only */
+*ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+VFIO_DEVICE_STATE_STOP);
+if (*ret) {
+return NULL;
+}
+
+packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size);
+
+for (idx = 0; ; idx++) {
+ssize_t data_size;
+size_t packet_size;
+
+data_size = read(migration->data_fd, >data,
+ migration->data_buffer_size);
+if (data_size < 0) {
+if (errno != ENOMSG) {
+*ret = -errno;
+return NULL;
+}
+
+/*
+ * Pre-copy emptied all the device state for now. For more 
information,
+ * please refer to the Linux kernel VFIO uAPI.
+ */
+data_size = 0;
+}
+
+if (data_size == 0)
+break;
+
+packet->idx = idx;
+packet_size = sizeof(*packet) + data_size;
+
+*ret = multifd_queue_device_state(migration->idstr, 
migration->instance_id,
+  (char *)packet, packet_size);
+if (*ret) {
+return NULL;
+}
+
+bytes_transferred += packet_size;
+}
+
+*ret = vfio_save_c

[PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This parameter allows specifying how many multifd channels are dedicated
to sending device state in parallel.

It is ignored on the receive side.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/migration-hmp-cmds.c |  7 +
 migration/options.c| 51 ++
 migration/options.h|  1 +
 qapi/migration.json| 16 ++-
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7e96ae6ffdae..37d71422fdc3 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -341,6 +341,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %u\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_CHANNELS),
 params->multifd_channels);
+monitor_printf(mon, "%s: %u\n",
+
MigrationParameter_str(MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE),
+params->x_multifd_channels_device_state);
 monitor_printf(mon, "%s: %s\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION),
 MultiFDCompression_str(params->multifd_compression));
@@ -626,6 +629,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->has_multifd_channels = true;
 visit_type_uint8(v, param, >multifd_channels, );
 break;
+case MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE:
+p->has_x_multifd_channels_device_state = true;
+visit_type_uint8(v, param, >x_multifd_channels_device_state, );
+break;
 case MIGRATION_PARAMETER_MULTIFD_COMPRESSION:
 p->has_multifd_compression = true;
 visit_type_MultiFDCompression(v, param, >multifd_compression,
diff --git a/migration/options.c b/migration/options.c
index 949d8a6c0b62..a7f09570b04e 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -59,6 +59,7 @@
 /* The delay time (in ms) between two COLO checkpoints */
 #define DEFAULT_MIGRATE_X_CHECKPOINT_DELAY (200 * 100)
 #define DEFAULT_MIGRATE_MULTIFD_CHANNELS 2
+#define DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE 0
 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
 /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
@@ -138,6 +139,9 @@ Property migration_properties[] = {
 DEFINE_PROP_UINT8("multifd-channels", MigrationState,
   parameters.multifd_channels,
   DEFAULT_MIGRATE_MULTIFD_CHANNELS),
+DEFINE_PROP_UINT8("x-multifd-channels-device-state", MigrationState,
+  parameters.x_multifd_channels_device_state,
+  DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE),
 DEFINE_PROP_MULTIFD_COMPRESSION("multifd-compression", MigrationState,
   parameters.multifd_compression,
   DEFAULT_MIGRATE_MULTIFD_COMPRESSION),
@@ -885,6 +889,13 @@ int migrate_multifd_channels(void)
 return s->parameters.multifd_channels;
 }
 
+int migrate_multifd_channels_device_state(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->parameters.x_multifd_channels_device_state;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
 MigrationState *s = migrate_get_current();
@@ -1032,6 +1043,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->block_incremental = s->parameters.block_incremental;
 params->has_multifd_channels = true;
 params->multifd_channels = s->parameters.multifd_channels;
+params->has_x_multifd_channels_device_state = true;
+params->x_multifd_channels_device_state = 
s->parameters.x_multifd_channels_device_state;
 params->has_multifd_compression = true;
 params->multifd_compression = s->parameters.multifd_compression;
 params->has_multifd_zlib_level = true;
@@ -1091,6 +1104,7 @@ void migrate_params_init(MigrationParameters *params)
 params->has_x_checkpoint_delay = true;
 params->has_block_incremental = true;
 params->has_multifd_channels = true;
+params->has_x_multifd_channels_device_state = true;
 params->has_multifd_compression = true;
 params->has_multifd_zlib_level = true;
 params->has_multifd_zstd_level = true;
@@ -1198,6 +1212,37 @@ bool migrate_params_check(MigrationParameters *params, 
Error **errp)
 return false;
 }
 
+if (params->has_multifd_channels &&
+params->has_x_multifd_channels_device_state &&
+params->x_multifd_channels_device_state > 0 &&
+!migrate_channel_header()) {
+error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+   "x_multi

[PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This is necessary for multifd_send_pages() to be able to be called
from multiple threads.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index a26418d87485..878ff7d9f9f0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -622,8 +622,8 @@ static bool multifd_send_pages(void)
  * using more channels, so ensure it doesn't overflow if the
  * limit is lower now.
  */
-next_channel %= migrate_multifd_channels();
-for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
+i = qatomic_load_acquire(_channel) % migrate_multifd_channels();
+for (;; i = (i + 1) % migrate_multifd_channels()) {
 if (multifd_send_should_exit()) {
 return false;
 }
@@ -633,7 +633,8 @@ static bool multifd_send_pages(void)
  * sender thread can clear it.
  */
 if (qatomic_read(>pending_job) == false) {
-next_channel = (i + 1) % migrate_multifd_channels();
+qatomic_store_release(_channel,
+  (i + 1) % migrate_multifd_channels());
 break;
 }
 }

[PATCH RFC 23/26] migration/multifd: Device state transfer support - send side

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

A new function multifd_queue_device_state() is provided for device to queue
its state for transmission via a multifd channel.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/misc.h |   4 +
 migration/multifd-zlib.c |   2 +-
 migration/multifd-zstd.c |   2 +-
 migration/multifd.c  | 244 ++-
 migration/multifd.h  |  30 +++--
 5 files changed, 244 insertions(+), 38 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index c9e200f4eb8f..25968e31247b 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -117,4 +117,8 @@ bool migration_in_bg_snapshot(void);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/multifd.c */
+int multifd_queue_device_state(char *idstr, uint32_t instance_id,
+   char *data, size_t len);
+
 #endif
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 99821cd4d5ef..e20c1de6033d 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error 
**errp)
 
 out:
 p->flags |= MULTIFD_FLAG_ZLIB;
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 02112255adcc..37cebd006921 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error 
**errp)
 
 out:
 p->flags |= MULTIFD_FLAG_ZSTD;
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 878ff7d9f9f0..d8ce01539a05 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/iov.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -20,6 +21,7 @@
 #include "qapi/error.h"
 #include "channel.h"
 #include "file.h"
+#include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
 #include "savevm.h"
@@ -50,9 +52,17 @@ typedef struct {
 } __attribute__((packed)) MultiFDInit_t;
 
 struct {
+/*
+ * Are there some device state dedicated channels (true) or
+ * should device state be sent via any available channel (false)?
+ */
+bool device_state_dedicated_channels;
+GMutex queue_job_mutex;
+
 MultiFDSendParams *params;
-/* array of pages to sent */
+/* array of pages or device state to be sent */
 MultiFDPages_t *pages;
+MultiFDDeviceState_t *device_state;
 /*
  * Global number of generated multifd packets.
  *
@@ -169,7 +179,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
 }
 
 /**
- * nocomp_send_prepare: prepare date to be able to send
+ * nocomp_send_prepare_ram: prepare RAM data for sending
  *
  * For no compression we just have to calculate the size of the
  * packet.
@@ -179,7 +189,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p)
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
-static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
+static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp)
 {
 bool use_zero_copy_send = migrate_zero_copy_send();
 int ret;
@@ -198,13 +208,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, 
Error **errp)
  * Only !zerocopy needs the header in IOV; zerocopy will
  * send it separately.
  */
-multifd_send_prepare_header(p);
+multifd_send_prepare_header_ram(p);
 }
 
 multifd_send_prepare_iovs(p);
 p->flags |= MULTIFD_FLAG_NOCOMP;
 
-multifd_send_fill_packet(p);
+multifd_send_fill_packet_ram(p);
 
 if (use_zero_copy_send) {
 /* Send header first, without zerocopy */
@@ -218,6 +228,59 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error 
**errp)
 return 0;
 }
 
+static void multifd_send_fill_packet_device_state(MultiFDSendParams *p)
+{
+MultiFDPacketDeviceState_t *packet = p->packet_device_state;
+
+packet->hdr.flags = cpu_to_be32(p->flags);
+strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr));
+packet->instance_id = cpu_to_be32(p->device_state->instance_id);
+packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+}
+
+/**
+ * nocomp_send_prepare_device_state: prepare device state data for sending
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int nocomp_send_prepare_device_state(MultiFD

[PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Signed-off-by: Avihai Horon 
[MSS: Rewrite using MFDSendChannelConnectData/PostcopyPChannelConnectData]
Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c  | 14 --
 migration/postcopy-ram.c | 14 --
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 58a18bb1e4a8..8eecda68ac0f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -18,6 +18,7 @@
 #include "exec/ramblock.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "channel.h"
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -1014,15 +1015,20 @@ struct MFDSendChannelConnectData {
 unsigned int ref;
 MultiFDSendParams *p;
 QIOChannelTLS *tioc;
+MigChannelHeader header;
 };
 
-static MFDSendChannelConnectData 
*mfd_send_channel_connect_data_new(MultiFDSendParams *p)
+static MFDSendChannelConnectData 
*mfd_send_channel_connect_data_new(MultiFDSendParams *p,
+
MigChannelHeader *header)
 {
 MFDSendChannelConnectData *data;
 
 data = g_malloc0(sizeof(*data));
 data->ref = 1;
 data->p = p;
+if (header) {
+memcpy(>header, header, sizeof(*header));
+}
 
 return data;
 }
@@ -1110,6 +1116,10 @@ bool multifd_channel_connect(MFDSendChannelConnectData 
*data, QIOChannel *ioc,
 {
 MultiFDSendParams *p = data->p;
 
+if (migration_channel_header_send(ioc, >header, errp)) {
+return false;
+}
+
 qio_channel_set_delay(ioc, false);
 
 migration_ioc_register_yank(ioc);
@@ -1182,7 +1192,7 @@ static bool 
multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
 g_autoptr(MFDSendChannelConnectData) data = NULL;
 
-data = mfd_send_channel_connect_data_new(p);
+data = mfd_send_channel_connect_data_new(p, NULL);
 
 if (!multifd_use_packets()) {
 return file_send_channel_create(data, errp);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 94fe872d8251..53c90344acce 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -19,6 +19,7 @@
 #include "qemu/osdep.h"
 #include "qemu/madvise.h"
 #include "exec/target_page.h"
+#include "channel.h"
 #include "migration.h"
 #include "qemu-file.h"
 #include "savevm.h"
@@ -1620,15 +1621,20 @@ void 
postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file)
 typedef struct {
 unsigned int ref;
 MigrationState *s;
+MigChannelHeader header;
 } PostcopyPChannelConnectData;
 
-static PostcopyPChannelConnectData 
*pcopy_preempt_connect_data_new(MigrationState *s)
+static PostcopyPChannelConnectData 
*pcopy_preempt_connect_data_new(MigrationState *s,
+   
MigChannelHeader *header)
 {
 PostcopyPChannelConnectData *data;
 
 data = g_malloc0(sizeof(*data));
 data->ref = 1;
 data->s = s;
+if (header) {
+memcpy(>header, header, sizeof(*header));
+}
 
 return data;
 }
@@ -1673,6 +1679,10 @@ 
postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data,
 {
 MigrationState *s = data->s;
 
+if (!local_err) {
+migration_channel_header_send(ioc, >header, _err);
+}
+
 if (local_err) {
 migrate_set_error(s, local_err);
 error_free(local_err);
@@ -1766,7 +1776,7 @@ void postcopy_preempt_setup(MigrationState *s)
 {
 PostcopyPChannelConnectData *data;
 
-data = pcopy_preempt_connect_data_new(s);
+data = pcopy_preempt_connect_data_new(s, NULL);
 /* Kick an async task to connect */
 socket_send_channel_create(postcopy_preempt_send_channel_new,
data, pcopy_preempt_connect_data_unref);

[PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Signed-off-by: Maciej S. Szmigiero 
---
 migration/channel.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/channel.h b/migration/channel.h
index 4232ee649939..b985c952550d 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -33,6 +33,7 @@ typedef enum {
 MIG_CHANNEL_TYPE_MAIN,
 MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
 MIG_CHANNEL_TYPE_MULTIFD,
+MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {

[PATCH RFC 02/26] migration: Add migration channel header send/receive

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Add functions to send and receive migration channel header.

Signed-off-by: Avihai Horon 
[MSS: Mark MigChannelHeader as packed, remove device id from it]
Signed-off-by: Maciej S. Szmigiero 
---
 migration/channel.c| 59 ++
 migration/channel.h| 14 ++
 migration/trace-events |  2 ++
 3 files changed, 75 insertions(+)

diff --git a/migration/channel.c b/migration/channel.c
index f9de064f3b13..a72e85f5791c 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -21,6 +21,7 @@
 #include "io/channel-socket.h"
 #include "qemu/yank.h"
 #include "yank_functions.h"
+#include "options.h"
 
 /**
  * @migration_channel_process_incoming - Create new incoming migration channel
@@ -93,6 +94,64 @@ void migration_channel_connect(MigrationState *s,
 error_free(error);
 }
 
+int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header,
+  Error **errp)
+{
+uint64_t header_size;
+int ret;
+
+ret = qio_channel_read_all_eof(ioc, (char *)_size,
+   sizeof(header_size), errp);
+if (ret == 0 || ret == -1) {
+return -1;
+}
+
+header_size = be64_to_cpu(header_size);
+if (header_size > sizeof(*header)) {
+error_setg(errp,
+   "Received header of size %lu bytes which is greater than "
+   "max header size of %lu bytes",
+   header_size, sizeof(*header));
+return -EINVAL;
+}
+
+ret = qio_channel_read_all_eof(ioc, (char *)header, header_size, errp);
+if (ret == 0 || ret == -1) {
+return -1;
+}
+
+header->channel_type = be32_to_cpu(header->channel_type);
+
+trace_migration_channel_header_recv(header->channel_type,
+header_size);
+
+return 0;
+}
+
+int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header,
+  Error **errp)
+{
+uint64_t header_size = sizeof(*header);
+int ret;
+
+if (!migrate_channel_header()) {
+return 0;
+}
+
+trace_migration_channel_header_send(header->channel_type,
+header_size);
+
+header_size = cpu_to_be64(header_size);
+ret = qio_channel_write_all(ioc, (char *)_size, sizeof(header_size),
+errp);
+if (ret) {
+return ret;
+}
+
+header->channel_type = cpu_to_be32(header->channel_type);
+
+return qio_channel_write_all(ioc, (char *)header, sizeof(*header), errp);
+}
 
 /**
  * @migration_channel_read_peek - Peek at migration channel, without
diff --git a/migration/channel.h b/migration/channel.h
index 5bdb8208a744..95d281828aaa 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -29,4 +29,18 @@ int migration_channel_read_peek(QIOChannel *ioc,
 const char *buf,
 const size_t buflen,
 Error **errp);
+typedef enum {
+MIG_CHANNEL_TYPE_MAIN,
+} MigChannelTypes;
+
+typedef struct QEMU_PACKED {
+uint32_t channel_type;
+} MigChannelHeader;
+
+int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header,
+  Error **errp);
+
+int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header,
+  Error **errp);
+
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index f0e1cb80c75b..e48607d5a6a2 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,8 @@ migration_transferred_bytes(uint64_t qemu_file, uint64_t 
multifd, uint64_t rdma)
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p 
ioctype=%s"
 migration_set_outgoing_channel(void *ioc, const char *ioctype, const char 
*hostname, void *err)  "ioc=%p ioctype=%s hostname=%s err=%p"
+migration_channel_header_send(uint32_t channel_type, uint64_t header_size) 
"Migration channel header send: channel_type=%u, header_size=%lu"
+migration_channel_header_recv(uint32_t channel_type, uint64_t header_size) 
"Migration channel header recv: channel_type=%u, header_size=%lu"
 
 # global_state.c
 migrate_state_too_big(void) ""

[PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Add a basic support for receiving device state via multifd channels -
both dedicated ones or shared with RAM transfer.

To differentiate between a device state and a RAM packet the packet
header is read first.

Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the
packet header either device state  (MultiFDPacketDeviceState_t) or RAM
data (existing MultiFDPacket_t) is then read.

The received device state data is provided to
qemu_loadvm_load_state_buffer() function for processing in the
device's load_state_buffer handler.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/migration.c |   7 +-
 migration/multifd.c   | 146 --
 migration/multifd.h   |  34 +-
 3 files changed, 163 insertions(+), 24 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e4f82695a338..ea2c8a043a77 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -987,7 +987,7 @@ static void 
migration_ioc_process_incoming_no_header(QIOChannel *ioc,
 /* Multiple connections */
 assert(migration_needs_multiple_sockets());
 if (migrate_multifd()) {
-multifd_recv_new_channel(ioc, _err);
+multifd_recv_new_channel(ioc, false, _err);
 } else {
 assert(migrate_postcopy_preempt());
 f = qemu_file_new_input(ioc);
@@ -1031,6 +1031,7 @@ void migration_ioc_process_incoming(QIOChannel *ioc, 
Error **errp)
 postcopy_preempt_new_channel(migration_incoming_get_current(), f);
 break;
 case MIG_CHANNEL_TYPE_MULTIFD:
+case MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE:
 {
 Error *local_err = NULL;
 
@@ -1039,7 +1040,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, 
Error **errp)
 return;
 }
 
-multifd_recv_new_channel(ioc, _err);
+multifd_recv_new_channel(ioc,
+ header.channel_type == 
MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE,
+ _err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
diff --git a/migration/multifd.c b/migration/multifd.c
index 7118c69a4d49..a26418d87485 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -22,6 +22,7 @@
 #include "file.h"
 #include "migration.h"
 #include "migration-stats.h"
+#include "savevm.h"
 #include "socket.h"
 #include "tls.h"
 #include "qemu-file.h"
@@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
 uint32_t zero_num = pages->num - pages->normal_num;
 int i;
 
-packet->flags = cpu_to_be32(p->flags);
+packet->hdr.flags = cpu_to_be32(p->flags);
 packet->pages_alloc = cpu_to_be32(p->pages->allocated);
 packet->normal_pages = cpu_to_be32(pages->normal_num);
 packet->zero_pages = cpu_to_be32(zero_num);
@@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
p->flags, p->next_packet_size);
 }
 
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, 
MultiFDPacketHdr_t *hdr,
+ Error **errp)
 {
-MultiFDPacket_t *packet = p->packet;
-int i;
-
-packet->magic = be32_to_cpu(packet->magic);
-if (packet->magic != MULTIFD_MAGIC) {
+hdr->magic = be32_to_cpu(hdr->magic);
+if (hdr->magic != MULTIFD_MAGIC) {
 error_setg(errp, "multifd: received packet "
"magic %x and expected magic %x",
-   packet->magic, MULTIFD_MAGIC);
+   hdr->magic, MULTIFD_MAGIC);
 return -1;
 }
 
-packet->version = be32_to_cpu(packet->version);
-if (packet->version != MULTIFD_VERSION) {
+hdr->version = be32_to_cpu(hdr->version);
+if (hdr->version != MULTIFD_VERSION) {
 error_setg(errp, "multifd: received packet "
"version %u and expected version %u",
-   packet->version, MULTIFD_VERSION);
+   hdr->version, MULTIFD_VERSION);
 return -1;
 }
 
-p->flags = be32_to_cpu(packet->flags);
+p->flags = be32_to_cpu(hdr->flags);
+
+return 0;
+}
+
+static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error 
**errp)
+{
+MultiFDPacketDeviceState_t *packet = p->packet_dev_state;
+
+packet->instance_id = be32_to_cpu(packet->instance_id);
+p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+
+return 0;
+}
+
+static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp)
+{
+MultiFDPacket_t *packet = p->packet;
+int i;
 
 packet->pages_alloc = be32_to_cpu(packet-

[PATCH RFC 14/26] migration/ram: Add load start trace event

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

There's a RAM load complete trace event but there wasn't its start equivalent.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/ram.c| 1 +
 migration/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 8deb84984f4a..cebb06480d6f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4223,6 +4223,7 @@ static int ram_load_precopy(QEMUFile *f)
   RAM_SAVE_FLAG_ZERO);
 }
 
+trace_ram_load_start();
 while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) {
 ram_addr_t addr;
 void *host = NULL, *host_bak = NULL;
diff --git a/migration/trace-events b/migration/trace-events
index e48607d5a6a2..396c0233cb8c 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) ""
 save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" 
PRIu64 " milliseconds, %d iterations"
+ram_load_start(void) ""
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" 
PRIu64
 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void 
*addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void 
*addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"

[PATCH RFC 00/26] Multifd  device state transfer support with VFIO consumer

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.


These patches introduce a few new SaveVMHandlers:
* "save_live_complete_precopy_async{,wait}" handlers that allow device to
  provide its own asynchronous transmission of the remaining data at the
  end of a precopy phase.

  The "save_live_complete_precopy_async" handler is supposed to start such
  transmission (for example, by launching appropriate threads) while the
  "save_live_complete_precopy_async_wait" handler is supposed to wait until
  such transfer has finished (for example, until the sending threads
  have exited).

* "load_state_buffer" and its caller qemu_loadvm_load_state_buffer() that
  allow providing device state buffer to explicitly specified device via
  its idstr and instance id.

* "load_finish" the allows migration code to poll whether a device-specific
  asynchronous device state loading operation had finished before proceeding
  further in the migration process (with associated condition variable for
  notification to avoid unnecessary polling).


A VFIO device migration consumer for the new multifd channels device state
migration framework was implemented with a reassembly process for the multifd
received data since device state packets sent via different multifd channels
can arrive out-of-order.

The VFIO device data loading process happens in a separate thread in order
to avoid blocking a multifd receive thread during this fairly long process.


Test setup:
Source machine: 2x Xeon Gold 5218 / 192 GiB RAM
Mellanox ConnectX-7 with 100GbE link
6.9.0-rc1+ kernel
Target machine: 2x Xeon Platinum 8260 / 376 GiB RAM
Mellanox ConnectX-7 with 100GbE link
6.6.0+ kernel
VM: CPU 12cores x 2threads / 15 GiB RAM / 4x Mellanox ConnectX-7 VF


Migration config: 15 multifd channels total
  new way had 4 channels dedicated to device state transfer
  x-return-path=true, x-switchover-ack=true

Downtime with ~400 MiB VFIO total device state size:
   TLS off TLS on
migration.x-channel-header=false (old way)~2100 ms   ~2300 ms
migration.x-channel-header=true (new way) ~1100 ms   ~1200 ms
IMPROVEMENT   ~50%   ~50%


This patch set is also available as a git tree:
https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio


Avihai Horon (7):
  migration: Add x-channel-header pseudo-capability
  migration: Add migration channel header send/receive
  migration: Add send/receive header for main channel
  migration: Allow passing migration header in migration channel
creation
  migration: Add send/receive header for postcopy preempt channel
  migration: Add send/receive header for multifd channel
  migration: Enable x-channel-header pseudo-capability

Maciej S. Szmigiero (19):
  multifd: change multifd_new_send_channel_create() param type
  migration: Add a DestroyNotify parameter to
socket_send_channel_create()
  multifd: pass MFDSendChannelConnectData when connecting sending socket
  migration/postcopy: pass PostcopyPChannelConnectData when connecting
sending preempt socket
  migration/options: Mapped-r

[PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Now that migration channel header has been implemented, enable it.

Signed-off-by: Avihai Horon 
Signed-off-by: Maciej S. Szmigiero 
---
 migration/options.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/migration/options.c b/migration/options.c
index abb5b485badd..949d8a6c0b62 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -386,7 +386,6 @@ bool migrate_channel_header(void)
 {
 MigrationState *s = migrate_get_current();
 
-return false;
 return s->channel_header;
 }

[PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Add send and receive migration channel header for postcopy preempt
channel.

Signed-off-by: Avihai Horon 
[MSS: Adapt to rewritten migration header passing commit]
Signed-off-by: Maciej S. Szmigiero 
---
 migration/channel.h  | 1 +
 migration/migration.c| 5 +
 migration/postcopy-ram.c | 5 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/migration/channel.h b/migration/channel.h
index 95d281828aaa..c59ccedc7b6b 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -31,6 +31,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
 Error **errp);
 typedef enum {
 MIG_CHANNEL_TYPE_MAIN,
+MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {
diff --git a/migration/migration.c b/migration/migration.c
index 0eb5b4f4f5a1..ac9ecf1f4f22 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1019,6 +1019,11 @@ void migration_ioc_process_incoming(QIOChannel *ioc, 
Error **errp)
 migration_incoming_setup(f);
 default_channel = true;
 break;
+case MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT:
+assert(migrate_postcopy_preempt());
+f = qemu_file_new_input(ioc);
+postcopy_preempt_new_channel(migration_incoming_get_current(), f);
+break;
 default:
 error_setg(errp, "Received unknown migration channel type %u",
header.channel_type);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 53c90344acce..c7e9f7345970 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1775,8 +1775,11 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 void postcopy_preempt_setup(MigrationState *s)
 {
 PostcopyPChannelConnectData *data;
+MigChannelHeader header = {};
 
-data = pcopy_preempt_connect_data_new(s, NULL);
+header.channel_type = MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT;
+
+data = pcopy_preempt_connect_data_new(s, );
 /* Kick an async task to connect */
 socket_send_channel_create(postcopy_preempt_send_channel_new,
data, pcopy_preempt_connect_data_unref);

[PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

These SaveVMHandlers allow device to provide its own asynchronous
transmission of the remaining data at the end of a precopy phase.

The save_live_complete_precopy_async handler is supposed to start such
transmission (for example, by launching appropriate threads) while the
save_live_complete_precopy_async_wait handler is supposed to wait until
such transfer has finished (for example, until the sending threads
have exited).

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 31 +++
 migration/savevm.c   | 35 +++
 2 files changed, 66 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index d7b70a8be68c..9d36e35bd612 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -102,6 +102,37 @@ typedef struct SaveVMHandlers {
  */
 int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
 
+/**
+ * @save_live_complete_precopy_async
+ *
+ * Arranges for handler-specific asynchronous transmission of the
+ * remaining data at the end of a precopy phase. When postcopy is
+ * enabled, devices that support postcopy will skip this step.
+ *
+ * @f: QEMUFile where the handler can synchronously send data before 
returning
+ * @idstr: this device section idstr
+ * @instance_id: this device section instance_id
+ * @opaque: data pointer passed to register_savevm_live()
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*save_live_complete_precopy_async)(QEMUFile *f,
+char *idstr, uint32_t instance_id,
+void *opaque);
+/**
+ * @save_live_complete_precopy_async_wait
+ *
+ * Waits for the asynchronous transmission started by the of the
+ * @save_live_complete_precopy_async handler to complete.
+ * When postcopy is enabled, devices that support postcopy will skip this 
step.
+ *
+ * @f: QEMUFile where the handler can synchronously send data before 
returning
+ * @opaque: data pointer passed to register_savevm_live()
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*save_live_complete_precopy_async_wait)(QEMUFile *f, void *opaque);
+
 /* This runs both outside and inside the BQL.  */
 
 /**
diff --git a/migration/savevm.c b/migration/savevm.c
index 388d7af7cdd8..fa35504678bf 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1497,6 +1497,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile 
*f, bool in_postcopy)
 SaveStateEntry *se;
 int ret;
 
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+ se->ops->has_postcopy(se->opaque)) ||
+!se->ops->save_live_complete_precopy_async) {
+continue;
+}
+
+save_section_header(f, se, QEMU_VM_SECTION_END);
+
+ret = se->ops->save_live_complete_precopy_async(f,
+se->idstr, 
se->instance_id,
+se->opaque);
+
+save_section_footer(f, se);
+
+if (ret < 0) {
+qemu_file_set_error(f, ret);
+return -1;
+}
+}
+
 QTAILQ_FOREACH(se, _state.handlers, entry) {
 if (!se->ops ||
 (in_postcopy && se->ops->has_postcopy &&
@@ -1528,6 +1549,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile 
*f, bool in_postcopy)
 end_ts_each - start_ts_each);
 }
 
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || (in_postcopy && se->ops->has_postcopy &&
+ se->ops->has_postcopy(se->opaque)) ||
+!se->ops->save_live_complete_precopy_async_wait) {
+continue;
+}
+
+ret = se->ops->save_live_complete_precopy_async_wait(f, se->opaque);
+if (ret < 0) {
+qemu_file_set_error(f, ret);
+return -1;
+}
+}
+
 trace_vmstate_downtime_checkpoint("src-iterable-saved");
 
 return 0;

[PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating
its position in the stream.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config
state.

Since it's important to finish loading device state transferred via
the main migration channel (via save_live_iterate handler) before
starting loading the data asynchronously transferred via multifd
a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to
mark the end of the main migration channel data.

The device state loading process waits until that flag is seen before
commencing loading of the multifd-transferred device state.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 322 +-
 hw/vfio/trace-events  |   9 +-
 include/hw/vfio/vfio-common.h |  14 ++
 3 files changed, 342 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bc3aea77455c..3af62dea6899 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+#include "io/channel-buffer.h"
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/misc.h"
@@ -46,6 +47,7 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
 #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xef15ULL)
+#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE(0xef16ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where 
typically
@@ -54,6 +56,15 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+#define VFIO_DEVICE_STATE_CONFIG_STATE (1)
+
+typedef struct VFIODeviceStatePacket {
+uint32_t version;
+uint32_t idx;
+uint32_t flags;
+uint8_t data[0];
+} QEMU_PACKED VFIODeviceStatePacket;
+
 static int64_t bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -186,6 +197,175 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice 
*vbasedev,
 return ret;
 }
 
+typedef struct LoadedBuffer {
+bool is_present;
+char *data;
+size_t len;
+} LoadedBuffer;
+
+static void loaded_buffer_clear(gpointer data)
+{
+LoadedBuffer *lb = data;
+
+if (!lb->is_present) {
+return;
+}
+
+g_clear_pointer(>data, g_free);
+lb->is_present = false;
+}
+
+static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+  Error **errp)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+g_autoptr(GMutexLocker) locker = 
g_mutex_locker_new(>load_bufs_mutex);
+LoadedBuffer *lb;
+
+if (data_size < sizeof(*packet)) {
+error_setg(errp, "packet too short at %zu (min is %zu)",
+   data_size, sizeof(*packet));
+return -1;
+}
+
+if (packet->version != 0) {
+error_setg(errp, "packet has unknown version %" PRIu32,
+   packet->version);
+return -1;
+}
+
+if (packet->idx == UINT32_MAX) {
+error_setg(errp, "packet has too high idx %" PRIu32,
+   packet->idx);
+return -1;
+}
+
+trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);
+
+/* config state packet should be the last one in the stream */
+if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+migration->load_buf_idx_last = packet->idx;
+}
+
+assert(migration->load_bufs);
+if (packet->idx >= migration->load_bufs->len) {
+g_array_set_size(migration->load_bufs, packet->idx + 1);
+}
+
+lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx);
+if (lb->is_present) {
+error_setg(errp, "state buffer %" PRIu32 " already filled", 
packet->idx);
+return -1;
+}
+
+assert(packet->idx >= migration->load_buf_idx);
+
+lb->data = g_memdup2(>data, data_size - sizeof(*packet));
+lb->len = data_size - sizeof(*packet);
+lb->is_present = true;
+
+g_cond_broadcast(>load_bufs_buffer_ready_cond);
+
+return 0;
+}
+
+static void *vfio_load_bufs_thread(void *opaque)
+{
+VFIODevice *vbasedev = opaque;
+VFIOMigration *migration = vbasedev->migration;
+Error **errp = >load_bufs_thread_errp;
+g_autoptr(GMutexLocker) locker = 
g_mutex_locker_new(>load_bufs_mutex);
+LoadedBuffer *lb;
+
+while (!migration->load_bufs_device_ready &&
+

[PATCH RFC 10/26] migration: Add send/receive header for multifd channel

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Add send and receive migration channel header for multifd channel.

Signed-off-by: Avihai Horon 
[MSS: Adapt to rewritten migration header passing commit]
Signed-off-by: Maciej S. Szmigiero 
---
 migration/channel.h   |  1 +
 migration/migration.c | 16 
 migration/multifd.c   |  4 +++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/migration/channel.h b/migration/channel.h
index c59ccedc7b6b..4232ee649939 100644
--- a/migration/channel.h
+++ b/migration/channel.h
@@ -32,6 +32,7 @@ int migration_channel_read_peek(QIOChannel *ioc,
 typedef enum {
 MIG_CHANNEL_TYPE_MAIN,
 MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT,
+MIG_CHANNEL_TYPE_MULTIFD,
 } MigChannelTypes;
 
 typedef struct QEMU_PACKED {
diff --git a/migration/migration.c b/migration/migration.c
index ac9ecf1f4f22..8fe8be71a0e3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1024,6 +1024,22 @@ void migration_ioc_process_incoming(QIOChannel *ioc, 
Error **errp)
 f = qemu_file_new_input(ioc);
 postcopy_preempt_new_channel(migration_incoming_get_current(), f);
 break;
+case MIG_CHANNEL_TYPE_MULTIFD:
+{
+Error *local_err = NULL;
+
+assert(migrate_multifd());
+if (multifd_recv_setup(errp) != 0) {
+return;
+}
+
+multifd_recv_new_channel(ioc, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+break;
+}
 default:
 error_setg(errp, "Received unknown migration channel type %u",
header.channel_type);
diff --git a/migration/multifd.c b/migration/multifd.c
index 8eecda68ac0f..c2575e3d6dbf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1191,8 +1191,10 @@ out:
 static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
 g_autoptr(MFDSendChannelConnectData) data = NULL;
+MigChannelHeader header = {};
 
-data = mfd_send_channel_connect_data_new(p, NULL);
+header.channel_type = MIG_CHANNEL_TYPE_MULTIFD;
+data = mfd_send_channel_connect_data_new(p, );
 
 if (!multifd_use_packets()) {
 return file_send_channel_create(data, errp);

[PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Mapped-ram is only available for multifd migration without channel
header - add an appropriate check to migration options.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/options.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/migration/options.c b/migration/options.c
index 8fd871cd956d..abb5b485badd 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -1284,6 +1284,13 @@ bool migrate_params_check(MigrationParameters *params, 
Error **errp)
 return false;
 }
 
+if (migrate_mapped_ram() &&
+params->has_multifd_channels && migrate_channel_header()) {
+error_setg(errp,
+   "Mapped-ram only available for multifd migration without 
channel header");
+return false;
+}
+
 if (params->has_x_vcpu_dirty_limit_period &&
 (params->x_vcpu_dirty_limit_period < 1 ||
  params->x_vcpu_dirty_limit_period > 1000)) {

[PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

qemu_loadvm_load_state_buffer() and its load_state_buffer
SaveVMHandler allow providing device state buffer to explicitly
specified device via its idstr and instance id.

Signed-off-by: Maciej S. Szmigiero 
---
 include/migration/register.h | 15 +++
 migration/savevm.c   | 25 +
 migration/savevm.h   |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 9d36e35bd612..7d29b7e0b559 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -257,6 +257,21 @@ typedef struct SaveVMHandlers {
  */
 int (*load_state)(QEMUFile *f, void *opaque, int version_id);
 
+/**
+ * @load_state_buffer
+ *
+ * Load device state buffer provided to qemu_loadvm_load_state_buffer().
+ *
+ * @opaque: data pointer passed to register_savevm_live()
+ * @data: the data buffer to load
+ * @data_size: the data length in buffer
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Returns zero to indicate success and negative for error
+ */
+int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
+ Error **errp);
+
 /**
  * @load_setup
  *
diff --git a/migration/savevm.c b/migration/savevm.c
index fa35504678bf..2e4d63faca06 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3073,6 +3073,31 @@ int qemu_loadvm_approve_switchover(void)
 return migrate_send_rp_switchover_ack(mis);
 }
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+  char *buf, size_t len, Error **errp)
+{
+SaveStateEntry *se;
+
+se = find_se(idstr, instance_id);
+if (!se) {
+error_setg(errp, "Unknown idstr %s or instance id %u for load state 
buffer",
+   idstr, instance_id);
+return -1;
+}
+
+if (!se->ops || !se->ops->load_state_buffer) {
+error_setg(errp, "idstr %s / instance %u has no load state buffer 
operation",
+   idstr, instance_id);
+return -1;
+}
+
+if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) {
+return -1;
+}
+
+return 0;
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/savevm.h b/migration/savevm.h
index 74669733dd63..c879ba8c970e 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
 bool in_postcopy, bool inactivate_disks);
 
+int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id,
+  char *buf, size_t len, Error **errp);
+
 #endif

[PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Add x-channel-header pseudo-capability which indicates that a header
should be sent through migration channels.

The header is the first thing to be sent through a migration channel and
it allows the destination to differentiate between the various channels
(main, multifd and preempt).

This eliminates the need to deduce the channel type by peeking in the
channel's content, which can be done only on a best-effort basis. It
will also allow other devices to create their own channels in the
future.

This patch only adds the pseudo-capability and sets it to false always.
The following patches will add the actual functionality, after which it
will be enabled..

Signed-off-by: Avihai Horon 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/core/machine.c | 1 +
 migration/migration.h | 3 +++
 migration/options.c   | 9 +
 migration/options.h   | 1 +
 4 files changed, 14 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 37ede0e7d4fd..fa28c49f55b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -37,6 +37,7 @@ GlobalProperty hw_compat_8_2[] = {
 { "migration", "zero-page-detection", "legacy"},
 { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" },
 { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" },
+{ "migration", "channel_header", "off" },
 };
 const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2);
 
diff --git a/migration/migration.h b/migration/migration.h
index 8045e39c26fa..a6114405917f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -450,6 +450,9 @@ struct MigrationState {
  */
 uint8_t clear_bitmap_shift;
 
+/* Whether a header is sent in migration channels */
+bool channel_header;
+
 /*
  * This save hostname when out-going migration starts
  */
diff --git a/migration/options.c b/migration/options.c
index bfd7753b69a5..8fd871cd956d 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -100,6 +100,7 @@ Property migration_properties[] = {
   clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT),
 DEFINE_PROP_BOOL("x-preempt-pre-7-2", MigrationState,
  preempt_pre_7_2, false),
+DEFINE_PROP_BOOL("x-channel-header", MigrationState, channel_header, true),
 
 /* Migration parameters */
 DEFINE_PROP_UINT8("x-compress-level", MigrationState,
@@ -381,6 +382,14 @@ bool migrate_zero_copy_send(void)
 
 /* pseudo capabilities */
 
+bool migrate_channel_header(void)
+{
+MigrationState *s = migrate_get_current();
+
+return false;
+return s->channel_header;
+}
+
 bool migrate_multifd_flush_after_each_section(void)
 {
 MigrationState *s = migrate_get_current();
diff --git a/migration/options.h b/migration/options.h
index ab8199e20784..1144d72ec0db 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -52,6 +52,7 @@ bool migrate_zero_copy_send(void);
  * check, but they are not a capability.
  */
 
+bool migrate_channel_header(void);
 bool migrate_multifd_flush_after_each_section(void);
 bool migrate_postcopy(void);
 bool migrate_rdma(void);

[PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This way both the start and end points of migrating a particular VFIO
device are known.

Add also a vfio_save_iterate_empty_hit trace event so it is known when
there's no more data to send for that device.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/vfio/migration.c   | 13 +
 hw/vfio/trace-events  |  3 +++
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 1149c6b3740f..bc3aea77455c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -394,6 +394,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
 return -ENOMEM;
 }
 
+migration->save_iterate_run = false;
+migration->save_iterate_empty_hit = false;
+
 if (vfio_precopy_supported(vbasedev)) {
 int ret;
 
@@ -515,9 +518,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
 VFIOMigration *migration = vbasedev->migration;
 ssize_t data_size;
 
+if (!migration->save_iterate_run) {
+trace_vfio_save_iterate_started(vbasedev->name);
+migration->save_iterate_run = true;
+}
+
 data_size = vfio_save_block(f, migration);
 if (data_size < 0) {
 return data_size;
+} else if (data_size == 0 && !migration->save_iterate_empty_hit) {
+trace_vfio_save_iterate_empty_hit(vbasedev->name);
+migration->save_iterate_empty_hit = true;
 }
 
 vfio_update_estimated_pending_data(migration, data_size);
@@ -542,6 +553,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void 
*opaque)
 ssize_t data_size;
 int ret;
 
+trace_vfio_save_complete_precopy_started(vbasedev->name);
+
 /* We reach here with device state STOP or STOP_COPY only */
 ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
VFIO_DEVICE_STATE_STOP);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index f0474b244bf0..a72697678256 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -157,8 +157,11 @@ vfio_migration_state_notifier(const char *name, int state) 
" (%s) state %d"
 vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
+vfio_save_complete_precopy_started(const char *name) " (%s)"
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t 
precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 
0x%"PRIx64
+vfio_save_iterate_started(const char *name) " (%s)"
+vfio_save_iterate_empty_hit(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data 
buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t 
postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) 
precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" 
precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t 
postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t 
precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy 
size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index b9da6c08ef41..9bb523249e73 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -71,6 +71,9 @@ typedef struct VFIOMigration {
 uint64_t precopy_init_size;
 uint64_t precopy_dirty_size;
 bool initial_data_sent;
+
+bool save_iterate_run;
+bool save_iterate_empty_hit;
 } VFIOMigration;
 
 struct VFIOGroup;

[PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This function is called only with MultiFDSendParams type param so use this
type explicitly instead of using an opaque pointer.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 2802afe79d0d..039c0de40af5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1132,13 +1132,13 @@ out:
 error_free(local_err);
 }
 
-static bool multifd_new_send_channel_create(gpointer opaque, Error **errp)
+static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 {
 if (!multifd_use_packets()) {
-return file_send_channel_create(opaque, errp);
+return file_send_channel_create(p, errp);
 }
 
-socket_send_channel_create(multifd_new_send_channel_async, opaque);
+socket_send_channel_create(multifd_new_send_channel_async, p);
 return true;
 }

[PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This will allow passing additional parameters there in the future.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/file.c|  5 ++-
 migration/multifd.c | 95 ++---
 migration/multifd.h |  4 +-
 3 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index ab18ba505a1d..34dfbc4a5a2d 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -62,7 +62,10 @@ bool file_send_channel_create(gpointer opaque, Error **errp)
 goto out;
 }
 
-multifd_channel_connect(opaque, QIO_CHANNEL(ioc));
+ret = multifd_channel_connect(opaque, QIO_CHANNEL(ioc), errp);
+if (!ret) {
+object_unref(OBJECT(ioc));
+}
 
 out:
 /*
diff --git a/migration/multifd.c b/migration/multifd.c
index 4bc912d7500e..58a18bb1e4a8 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1010,34 +1010,76 @@ out:
 return NULL;
 }
 
-static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
-
-typedef struct {
+struct MFDSendChannelConnectData {
+unsigned int ref;
 MultiFDSendParams *p;
 QIOChannelTLS *tioc;
-} MultiFDTLSThreadArgs;
+};
+
+static MFDSendChannelConnectData 
*mfd_send_channel_connect_data_new(MultiFDSendParams *p)
+{
+MFDSendChannelConnectData *data;
+
+data = g_malloc0(sizeof(*data));
+data->ref = 1;
+data->p = p;
+
+return data;
+}
+
+static void mfd_send_channel_connect_data_free(MFDSendChannelConnectData *data)
+{
+g_free(data);
+}
+
+static MFDSendChannelConnectData *
+mfd_send_channel_connect_data_ref(MFDSendChannelConnectData *data)
+{
+unsigned int ref_old;
+
+ref_old = qatomic_fetch_inc(>ref);
+assert(ref_old < UINT_MAX);
+
+return data;
+}
+
+static void mfd_send_channel_connect_data_unref(gpointer opaque)
+{
+MFDSendChannelConnectData *data = opaque;
+unsigned int ref_old;
+
+ref_old = qatomic_fetch_dec(>ref);
+assert(ref_old > 0);
+if (ref_old == 1) {
+mfd_send_channel_connect_data_free(data);
+}
+}
+
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(MFDSendChannelConnectData, 
mfd_send_channel_connect_data_unref)
+
+static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
 
 static void *multifd_tls_handshake_thread(void *opaque)
 {
-MultiFDTLSThreadArgs *args = opaque;
+g_autoptr(MFDSendChannelConnectData) data = opaque;
+QIOChannelTLS *tioc = data->tioc;
 
-qio_channel_tls_handshake(args->tioc,
+qio_channel_tls_handshake(tioc,
   multifd_new_send_channel_async,
-  args->p,
-  NULL,
+  g_steal_pointer(),
+  mfd_send_channel_connect_data_unref,
   NULL);
-g_free(args);
 
 return NULL;
 }
 
-static bool multifd_tls_channel_connect(MultiFDSendParams *p,
+static bool multifd_tls_channel_connect(MFDSendChannelConnectData *data,
 QIOChannel *ioc,
 Error **errp)
 {
+MultiFDSendParams *p = data->p;
 MigrationState *s = migrate_get_current();
 const char *hostname = s->hostname;
-MultiFDTLSThreadArgs *args;
 QIOChannelTLS *tioc;
 
 tioc = migration_tls_client_create(ioc, hostname, errp);
@@ -1053,19 +1095,21 @@ static bool 
multifd_tls_channel_connect(MultiFDSendParams *p,
 trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
 qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
 
-args = g_new0(MultiFDTLSThreadArgs, 1);
-args->tioc = tioc;
-args->p = p;
+data->tioc = tioc;
 
 p->tls_thread_created = true;
 qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker",
-   multifd_tls_handshake_thread, args,
+   multifd_tls_handshake_thread,
+   mfd_send_channel_connect_data_ref(data),
QEMU_THREAD_JOINABLE);
 return true;
 }
 
-void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc)
+bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc,
+ Error **errp)
 {
+MultiFDSendParams *p = data->p;
+
 qio_channel_set_delay(ioc, false);
 
 migration_ioc_register_yank(ioc);
@@ -1075,6 +1119,8 @@ void multifd_channel_connect(MultiFDSendParams *p, 
QIOChannel *ioc)
 p->thread_created = true;
 qemu_thread_create(>thread, p->name, multifd_send_thread, p,
QEMU_THREAD_JOINABLE);
+
+return true;
 }
 
 /*
@@ -1085,7 +1131,8 @@ void multifd_channel_connect(MultiFDSendParams *p, 
QIOChannel *ioc)
  */
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
-MultiFDSendParams *p = opaque;
+MFDSendChannelConnectDa

[PATCH RFC 03/26] migration: Add send/receive header for main channel

2024-04-16 Thread Maciej S. Szmigiero

From: Avihai Horon 

Add send and receive migration channel header for main channel.

Signed-off-by: Avihai Horon 
[MSS: Rename main channel -> default channel where it matches the current term]
Signed-off-by: Maciej S. Szmigiero 
---
 migration/channel.c   |  9 +
 migration/migration.c | 82 +++
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index a72e85f5791c..0e3f51654752 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -81,6 +81,13 @@ void migration_channel_connect(MigrationState *s,
 return;
 }
 } else {
+/* TODO: Send header after register yank? Make a QEMUFile variant? 
*/
+MigChannelHeader header = {};
+header.channel_type = MIG_CHANNEL_TYPE_MAIN;
+if (migration_channel_header_send(ioc, , )) {
+goto out;
+}
+
 QEMUFile *f = qemu_file_new_output(ioc);
 
 migration_ioc_register_yank(ioc);
@@ -90,6 +97,8 @@ void migration_channel_connect(MigrationState *s,
 qemu_mutex_unlock(>qemu_file_lock);
 }
 }
+
+out:
 migrate_fd_connect(s, error);
 error_free(error);
 }
diff --git a/migration/migration.c b/migration/migration.c
index 86bf76e92585..0eb5b4f4f5a1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -869,12 +869,39 @@ void migration_fd_process_incoming(QEMUFile *f)
 migration_incoming_process();
 }
 
+static bool migration_should_start_incoming_header(bool main_channel)
+{
+MigrationIncomingState *mis = migration_incoming_get_current();
+
+if (!mis->from_src_file) {
+return false;
+}
+
+if (migrate_multifd()) {
+return multifd_recv_all_channels_created();
+}
+
+if (migrate_postcopy_preempt() && migrate_get_current()->preempt_pre_7_2) {
+return mis->postcopy_qemufile_dst != NULL;
+}
+
+if (migrate_postcopy_preempt()) {
+return main_channel;
+}
+
+return true;
+}
+
 /*
  * Returns true when we want to start a new incoming migration process,
  * false otherwise.
  */
 static bool migration_should_start_incoming(bool main_channel)
 {
+if (migrate_channel_header()) {
+return migration_should_start_incoming_header(main_channel);
+}
+
 /* Multifd doesn't start unless all channels are established */
 if (migrate_multifd()) {
 return migration_has_all_channels();
@@ -894,7 +921,22 @@ static bool migration_should_start_incoming(bool 
main_channel)
 return true;
 }
 
-void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
+static void migration_start_incoming(bool main_channel)
+{
+if (!migration_should_start_incoming(main_channel)) {
+return;
+}
+
+/* If it's a recovery, we're done */
+if (postcopy_try_recover()) {
+return;
+}
+
+migration_incoming_process();
+}
+
+static void migration_ioc_process_incoming_no_header(QIOChannel *ioc,
+ Error **errp)
 {
 MigrationIncomingState *mis = migration_incoming_get_current();
 Error *local_err = NULL;
@@ -951,13 +993,39 @@ void migration_ioc_process_incoming(QIOChannel *ioc, 
Error **errp)
 }
 }
 
-if (migration_should_start_incoming(default_channel)) {
-/* If it's a recovery, we're done */
-if (postcopy_try_recover()) {
-return;
-}
-migration_incoming_process();
+migration_start_incoming(default_channel);
+}
+
+void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
+{
+MigChannelHeader header = {};
+bool default_channel = false;
+QEMUFile *f;
+int ret;
+
+if (!migrate_channel_header()) {
+migration_ioc_process_incoming_no_header(ioc, errp);
+return;
+}
+
+ret = migration_channel_header_recv(ioc, , errp);
+if (ret) {
+return;
+}
+
+switch (header.channel_type) {
+case MIG_CHANNEL_TYPE_MAIN:
+f = qemu_file_new_input(ioc);
+migration_incoming_setup(f);
+default_channel = true;
+break;
+default:
+error_setg(errp, "Received unknown migration channel type %u",
+   header.channel_type);
+return;
 }
+
+migration_start_incoming(default_channel);
 }
 
 /**

[PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create()

2024-04-16 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Makes managing the memory easier.

Signed-off-by: Maciej S. Szmigiero 
---
 migration/multifd.c  | 2 +-
 migration/postcopy-ram.c | 2 +-
 migration/socket.c   | 6 --
 migration/socket.h   | 3 ++-
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 039c0de40af5..4bc912d7500e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1138,7 +1138,7 @@ static bool 
multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp)
 return file_send_channel_create(p, errp);
 }
 
-socket_send_channel_create(multifd_new_send_channel_async, p);
+socket_send_channel_create(multifd_new_send_channel_async, p, NULL);
 return true;
 }
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index eccff499cb20..e314e1023dc1 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1715,7 +1715,7 @@ int postcopy_preempt_establish_channel(MigrationState *s)
 void postcopy_preempt_setup(MigrationState *s)
 {
 /* Kick an async task to connect */
-socket_send_channel_create(postcopy_preempt_send_channel_new, s);
+socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL);
 }
 
 static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)
diff --git a/migration/socket.c b/migration/socket.c
index 9ab89b1e089b..6639581cf18d 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -35,11 +35,13 @@ struct SocketOutgoingArgs {
 SocketAddress *saddr;
 } outgoing_args;
 
-void socket_send_channel_create(QIOTaskFunc f, void *data)
+void socket_send_channel_create(QIOTaskFunc f,
+void *data, GDestroyNotify data_destroy)
 {
 QIOChannelSocket *sioc = qio_channel_socket_new();
+
 qio_channel_socket_connect_async(sioc, outgoing_args.saddr,
- f, data, NULL, NULL);
+ f, data, data_destroy, NULL);
 }
 
 QIOChannel *socket_send_channel_create_sync(Error **errp)
diff --git a/migration/socket.h b/migration/socket.h
index 46c233ecd29e..114ab34176aa 100644
--- a/migration/socket.h
+++ b/migration/socket.h
@@ -21,7 +21,8 @@
 #include "io/task.h"
 #include "qemu/sockets.h"
 
-void socket_send_channel_create(QIOTaskFunc f, void *data);
+void socket_send_channel_create(QIOTaskFunc f,
+void *data, GDestroyNotify data_destroy);
 QIOChannel *socket_send_channel_create_sync(Error **errp);
 
 void socket_start_incoming_migration(SocketAddress *saddr, Error **errp);

[PULL 1/3] hv-balloon: avoid alloca() usage

2024-03-08 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

alloca() is frowned upon, replace it with g_malloc0() + g_autofree.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index ade283335a68..35333dab2434 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -366,7 +366,7 @@ static void hv_balloon_unballoon_posting(HvBalloon 
*balloon, StateDesc *stdesc)
 PageRangeTree dtree;
 uint64_t *dctr;
 bool our_range;
-struct dm_unballoon_request *ur;
+g_autofree struct dm_unballoon_request *ur = NULL;
 size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
 PageRange range;
 bool bret;
@@ -388,8 +388,7 @@ static void hv_balloon_unballoon_posting(HvBalloon 
*balloon, StateDesc *stdesc)
 assert(dtree.t);
 assert(dctr);
 
-ur = alloca(ur_size);
-memset(ur, 0, ur_size);
+ur = g_malloc0(ur_size);
 ur->hdr.type = DM_UNBALLOON_REQUEST;
 ur->hdr.size = ur_size;
 ur->hdr.trans_id = balloon->trans_id;
@@ -531,7 +530,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 PageRange *hot_add_range = >hot_add_range;
 uint64_t *current_count = >ha_current_count;
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-struct dm_hot_add *ha;
+g_autofree struct dm_hot_add *ha = NULL;
 size_t ha_size = sizeof(*ha) + sizeof(ha->range);
 union dm_mem_page_range *ha_region;
 uint64_t align, chunk_max_size;
@@ -560,9 +559,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
  */
 *current_count = MIN(hot_add_range->count, chunk_max_size);
 
-ha = alloca(ha_size);
+ha = g_malloc0(ha_size);
 ha_region = &(>range)[1];
-memset(ha, 0, ha_size);
 ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
 ha->hdr.size = ha_size;
 ha->hdr.trans_id = balloon->trans_id;

[PULL 3/3] vmbus: Print a warning when enabled without the recommended set of features

2024-03-08 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Some Windows versions crash at boot or fail to enable the VMBus device if
they don't see the expected set of Hyper-V features (enlightenments).

Since this provides poor user experience let's warn user if the VMBus
device is enabled without the recommended set of Hyper-V features.

The recommended set is the minimum set of Hyper-V features required to make
the VMBus device work properly in Windows Server versions 2016, 2019 and
2022.

Acked-by: Paolo Bonzini 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hyperv.c| 12 
 hw/hyperv/vmbus.c |  6 ++
 include/hw/hyperv/hyperv.h|  4 
 target/i386/kvm/hyperv-stub.c |  4 
 target/i386/kvm/hyperv.c  |  5 +
 target/i386/kvm/hyperv.h  |  2 ++
 target/i386/kvm/kvm.c |  7 +++
 7 files changed, 40 insertions(+)

diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
index 6c4a18dd0e2a..3ea54ba818b2 100644
--- a/hw/hyperv/hyperv.c
+++ b/hw/hyperv/hyperv.c
@@ -951,3 +951,15 @@ uint64_t hyperv_syndbg_query_options(void)
 
 return msg.u.query_options.options;
 }
+
+static bool vmbus_recommended_features_enabled;
+
+bool hyperv_are_vmbus_recommended_features_enabled(void)
+{
+return vmbus_recommended_features_enabled;
+}
+
+void hyperv_set_vmbus_recommended_features_enabled(void)
+{
+vmbus_recommended_features_enabled = true;
+}
diff --git a/hw/hyperv/vmbus.c b/hw/hyperv/vmbus.c
index 380239af2c7b..f33afeeea27d 100644
--- a/hw/hyperv/vmbus.c
+++ b/hw/hyperv/vmbus.c
@@ -2631,6 +2631,12 @@ static void vmbus_bridge_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+if (!hyperv_are_vmbus_recommended_features_enabled()) {
+warn_report("VMBus enabled without the recommended set of Hyper-V 
features: "
+"hv-stimer, hv-vapic and hv-runtime. "
+"Some Windows versions might not boot or enable the VMBus 
device");
+}
+
 bridge->bus = VMBUS(qbus_new(TYPE_VMBUS, dev, "vmbus"));
 }
 
diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
index 015c3524b1c2..d717b4e13d40 100644
--- a/include/hw/hyperv/hyperv.h
+++ b/include/hw/hyperv/hyperv.h
@@ -139,4 +139,8 @@ typedef struct HvSynDbgMsg {
 } HvSynDbgMsg;
 typedef uint16_t (*HvSynDbgHandler)(void *context, HvSynDbgMsg *msg);
 void hyperv_set_syndbg_handler(HvSynDbgHandler handler, void *context);
+
+bool hyperv_are_vmbus_recommended_features_enabled(void);
+void hyperv_set_vmbus_recommended_features_enabled(void);
+
 #endif
diff --git a/target/i386/kvm/hyperv-stub.c b/target/i386/kvm/hyperv-stub.c
index 778ed782e6fc..3263dcf05d31 100644
--- a/target/i386/kvm/hyperv-stub.c
+++ b/target/i386/kvm/hyperv-stub.c
@@ -52,3 +52,7 @@ void hyperv_x86_synic_reset(X86CPU *cpu)
 void hyperv_x86_synic_update(X86CPU *cpu)
 {
 }
+
+void hyperv_x86_set_vmbus_recommended_features_enabled(void)
+{
+}
diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
index 6825c89af374..f2a3fe650a18 100644
--- a/target/i386/kvm/hyperv.c
+++ b/target/i386/kvm/hyperv.c
@@ -149,3 +149,8 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit 
*exit)
 return -1;
 }
 }
+
+void hyperv_x86_set_vmbus_recommended_features_enabled(void)
+{
+hyperv_set_vmbus_recommended_features_enabled();
+}
diff --git a/target/i386/kvm/hyperv.h b/target/i386/kvm/hyperv.h
index 67543296c3a4..e3982c8f4dd1 100644
--- a/target/i386/kvm/hyperv.h
+++ b/target/i386/kvm/hyperv.h
@@ -26,4 +26,6 @@ int hyperv_x86_synic_add(X86CPU *cpu);
 void hyperv_x86_synic_reset(X86CPU *cpu);
 void hyperv_x86_synic_update(X86CPU *cpu);
 
+void hyperv_x86_set_vmbus_recommended_features_enabled(void);
+
 #endif
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 42970ab046fa..e68cbe929302 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1650,6 +1650,13 @@ static int hyperv_init_vcpu(X86CPU *cpu)
 }
 }
 
+/* Skip SynIC and VP_INDEX since they are hard deps already */
+if (hyperv_feat_enabled(cpu, HYPERV_FEAT_STIMER) &&
+hyperv_feat_enabled(cpu, HYPERV_FEAT_VAPIC) &&
+hyperv_feat_enabled(cpu, HYPERV_FEAT_RUNTIME)) {
+hyperv_x86_set_vmbus_recommended_features_enabled();
+}
+
 return 0;
 }

[PULL 0/3] Hyper-V Dynamic Memory and VMBus misc small patches

2024-03-08 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

The following changes since commit 8f6330a807f2642dc2a3cdf33347aa28a4c00a87:

  Merge tag 'pull-maintainer-updates-060324-1' of 
https://gitlab.com/stsquad/qemu into staging (2024-03-06 16:56:20 +)

are available in the Git repository at:

  https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20240308

for you to fetch changes up to 6093637b4d32875f98cd59696ffc5f26884aa0b4:

  vmbus: Print a warning when enabled without the recommended set of features 
(2024-03-08 14:18:56 +0100)


Hyper-V Dynamic Memory and VMBus misc small patches

This pull request contains two small patches to hv-balloon:
the first one replacing alloca() usage with g_malloc0() + g_autofree
and the second one adding additional declaration of a protocol message
struct with an optional field explicitly defined to avoid a Coverity
warning.

Also included is a VMBus patch to print a warning when it is enabled
without the recommended set of Hyper-V features (enlightenments) since
some Windows versions crash at boot in this case.

--------
Maciej S. Szmigiero (3):
  hv-balloon: avoid alloca() usage
  hv-balloon: define dm_hot_add_with_region to avoid Coverity warning
  vmbus: Print a warning when enabled without the recommended set of 
features

 hw/hyperv/hv-balloon.c   | 18 --
 hw/hyperv/hyperv.c   | 12 
 hw/hyperv/vmbus.c|  6 ++
 include/hw/hyperv/dynmem-proto.h |  9 -
 include/hw/hyperv/hyperv.h   |  4 
 target/i386/kvm/hyperv-stub.c|  4 
 target/i386/kvm/hyperv.c |  5 +
 target/i386/kvm/hyperv.h |  2 ++
 target/i386/kvm/kvm.c|  7 +++
 9 files changed, 56 insertions(+), 11 deletions(-)

[PULL 2/3] hv-balloon: define dm_hot_add_with_region to avoid Coverity warning

2024-03-08 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Since the presence of a hot add memory region is optional in hot add
request message it wasn't part of this message declaration
(struct dm_hot_add).

Instead, the code allocated such enlarged message by simply adding the
necessary size for this extra field to the size of basic hot add message
struct.

However, Coverity considers accessing this extra member to be
an out-of-bounds access, even thought the memory is actually there.

Fix this by adding an extended variant of this message that explicitly has
an additional union dm_mem_page_range at its end.

CID: #1523903
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon.c   | 10 +-
 include/hw/hyperv/dynmem-proto.h |  9 -
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 35333dab2434..3a9ef0769103 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -513,8 +513,8 @@ ret_idle:
 static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
 {
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-struct dm_hot_add *ha;
-size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+struct dm_hot_add_with_region *ha;
+size_t ha_size = sizeof(*ha);
 
 assert(balloon->state == S_HOT_ADD_RB_WAIT);
 
@@ -530,8 +530,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 PageRange *hot_add_range = >hot_add_range;
 uint64_t *current_count = >ha_current_count;
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-g_autofree struct dm_hot_add *ha = NULL;
-size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+g_autofree struct dm_hot_add_with_region *ha = NULL;
+size_t ha_size = sizeof(*ha);
 union dm_mem_page_range *ha_region;
 uint64_t align, chunk_max_size;
 ssize_t ret;
@@ -560,7 +560,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 *current_count = MIN(hot_add_range->count, chunk_max_size);
 
 ha = g_malloc0(ha_size);
-ha_region = &(>range)[1];
+ha_region = >region;
 ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
 ha->hdr.size = ha_size;
 ha->hdr.trans_id = balloon->trans_id;
diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
index a657786a94b1..68b8b606f268 100644
--- a/include/hw/hyperv/dynmem-proto.h
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -328,7 +328,8 @@ struct dm_unballoon_response {
 /*
  * Hot add request message. Message sent from the host to the guest.
  *
- * mem_range: Memory range to hot add.
+ * range: Memory range to hot add.
+ * region: Explicit hot add memory region for guest to use. Optional.
  *
  */
 
@@ -337,6 +338,12 @@ struct dm_hot_add {
 union dm_mem_page_range range;
 } QEMU_PACKED;
 
+struct dm_hot_add_with_region {
+struct dm_header hdr;
+union dm_mem_page_range range;
+union dm_mem_page_range region;
+} QEMU_PACKED;
+
 /*
  * Hot add response message.
  * This message is sent by the guest to report the status of a hot add request.

Re: [PATCH] vmbus: Print a warning when enabled without the recommended set of features

2024-02-29 Thread Maciej S. Szmigiero


On 25.01.2024 17:19, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Some Windows versions crash at boot or fail to enable the VMBus device if
they don't see the expected set of Hyper-V features (enlightenments).

Since this provides poor user experience let's warn user if the VMBus
device is enabled without the recommended set of Hyper-V features.

The recommended set is the minimum set of Hyper-V features required to make
the VMBus device work properly in Windows Server versions 2016, 2019 and
2022.

Signed-off-by: Maciej S. Szmigiero 


@Paolo, @Marcelo: can I get some kind of Ack or comments for the KVM part?

Thanks,
Maciej

[PATCH] vmbus: Print a warning when enabled without the recommended set of features

2024-01-25 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Some Windows versions crash at boot or fail to enable the VMBus device if
they don't see the expected set of Hyper-V features (enlightenments).

Since this provides poor user experience let's warn user if the VMBus
device is enabled without the recommended set of Hyper-V features.

The recommended set is the minimum set of Hyper-V features required to make
the VMBus device work properly in Windows Server versions 2016, 2019 and
2022.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hyperv.c| 12 
 hw/hyperv/vmbus.c |  6 ++
 include/hw/hyperv/hyperv.h|  4 
 target/i386/kvm/hyperv-stub.c |  4 
 target/i386/kvm/hyperv.c  |  5 +
 target/i386/kvm/hyperv.h  |  2 ++
 target/i386/kvm/kvm.c |  7 +++
 7 files changed, 40 insertions(+)

diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
index 57b402b95610..2c91de7ff4a8 100644
--- a/hw/hyperv/hyperv.c
+++ b/hw/hyperv/hyperv.c
@@ -947,3 +947,15 @@ uint64_t hyperv_syndbg_query_options(void)
 
 return msg.u.query_options.options;
 }
+
+static bool vmbus_recommended_features_enabled;
+
+bool hyperv_are_vmbus_recommended_features_enabled(void)
+{
+return vmbus_recommended_features_enabled;
+}
+
+void hyperv_set_vmbus_recommended_features_enabled(void)
+{
+vmbus_recommended_features_enabled = true;
+}
diff --git a/hw/hyperv/vmbus.c b/hw/hyperv/vmbus.c
index 380239af2c7b..f33afeeea27d 100644
--- a/hw/hyperv/vmbus.c
+++ b/hw/hyperv/vmbus.c
@@ -2631,6 +2631,12 @@ static void vmbus_bridge_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+if (!hyperv_are_vmbus_recommended_features_enabled()) {
+warn_report("VMBus enabled without the recommended set of Hyper-V 
features: "
+"hv-stimer, hv-vapic and hv-runtime. "
+"Some Windows versions might not boot or enable the VMBus 
device");
+}
+
 bridge->bus = VMBUS(qbus_new(TYPE_VMBUS, dev, "vmbus"));
 }
 
diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
index 015c3524b1c2..d717b4e13d40 100644
--- a/include/hw/hyperv/hyperv.h
+++ b/include/hw/hyperv/hyperv.h
@@ -139,4 +139,8 @@ typedef struct HvSynDbgMsg {
 } HvSynDbgMsg;
 typedef uint16_t (*HvSynDbgHandler)(void *context, HvSynDbgMsg *msg);
 void hyperv_set_syndbg_handler(HvSynDbgHandler handler, void *context);
+
+bool hyperv_are_vmbus_recommended_features_enabled(void);
+void hyperv_set_vmbus_recommended_features_enabled(void);
+
 #endif
diff --git a/target/i386/kvm/hyperv-stub.c b/target/i386/kvm/hyperv-stub.c
index 778ed782e6fc..3263dcf05d31 100644
--- a/target/i386/kvm/hyperv-stub.c
+++ b/target/i386/kvm/hyperv-stub.c
@@ -52,3 +52,7 @@ void hyperv_x86_synic_reset(X86CPU *cpu)
 void hyperv_x86_synic_update(X86CPU *cpu)
 {
 }
+
+void hyperv_x86_set_vmbus_recommended_features_enabled(void)
+{
+}
diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
index 6825c89af374..f2a3fe650a18 100644
--- a/target/i386/kvm/hyperv.c
+++ b/target/i386/kvm/hyperv.c
@@ -149,3 +149,8 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit 
*exit)
 return -1;
 }
 }
+
+void hyperv_x86_set_vmbus_recommended_features_enabled(void)
+{
+hyperv_set_vmbus_recommended_features_enabled();
+}
diff --git a/target/i386/kvm/hyperv.h b/target/i386/kvm/hyperv.h
index 67543296c3a4..e3982c8f4dd1 100644
--- a/target/i386/kvm/hyperv.h
+++ b/target/i386/kvm/hyperv.h
@@ -26,4 +26,6 @@ int hyperv_x86_synic_add(X86CPU *cpu);
 void hyperv_x86_synic_reset(X86CPU *cpu);
 void hyperv_x86_synic_update(X86CPU *cpu);
 
+void hyperv_x86_set_vmbus_recommended_features_enabled(void);
+
 #endif
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index e88e65fe014c..d3d01b3cf82d 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1650,6 +1650,13 @@ static int hyperv_init_vcpu(X86CPU *cpu)
 }
 }
 
+/* Skip SynIC and VP_INDEX since they are hard deps already */
+if (hyperv_feat_enabled(cpu, HYPERV_FEAT_STIMER) &&
+hyperv_feat_enabled(cpu, HYPERV_FEAT_VAPIC) &&
+hyperv_feat_enabled(cpu, HYPERV_FEAT_RUNTIME)) {
+hyperv_x86_set_vmbus_recommended_features_enabled();
+}
+
 return 0;
 }

Re: [PATCH v1 0/2] memory-device: reintroduce memory region size check

2024-01-22 Thread Maciej S. Szmigiero


Hi David,

On 17.01.2024 14:55, David Hildenbrand wrote:

Reintroduce a modified region size check, after we would now allow some
configurations that don't make any sense (e.g., partial hugetlb pages,
1G+1byte DIMMs).

We have to take care of hv-balloon first, which was the case why we
remove that check in the first place.

Cc: "Maciej S. Szmigiero" 
Cc: Mario Casquero 
Cc: Igor Mammedov 
Cc: Xiao Guangrong 

David Hildenbrand (2):
   hv-balloon: use get_min_alignment() to express 32 GiB alignment
   memory-device: reintroduce memory region size check

  hw/hyperv/hv-balloon.c | 37 +
  hw/mem/memory-device.c | 14 ++
  2 files changed, 35 insertions(+), 16 deletions(-)



Looked at the changes, tested hv-balloon with a small memory
backend and it seem to work fine, so for the whole series:

Reviewed-by: Maciej S. Szmigiero 

Thanks,
Maciej

Re: [PATCH 2/5] vmbus: Switch bus reset to 3-phase-reset

2024-01-22 Thread Maciej S. Szmigiero


On 19.01.2024 17:35, Peter Maydell wrote:

Switch vmbus from using BusClass::reset to the Resettable interface.

This has no behavioural change, because the BusClass code to support
subclasses that use the legacy BusClass::reset will call that method
in the hold phase of 3-phase reset.

Signed-off-by: Peter Maydell 
---


Acked-by: Maciej S. Szmigiero 

Thanks,
Maciej

Re: [PATCH trivial 15/21] include/hw/hyperv/dynmem-proto.h: spelling fix: nunber

2023-11-14 Thread Maciej S. Szmigiero


On 14.11.2023 17:58, Michael Tokarev wrote:

Fixes: 4f80cd2f033e "Add Hyper-V Dynamic Memory Protocol definitions"
Cc: Maciej S. Szmigiero 
Signed-off-by: Michael Tokarev 
---


Acked-by: Maciej S. Szmigiero 

Thanks,
Maciej

[PATCH] hv-balloon: define dm_hot_add_with_region to avoid Coverity warning

2023-11-13 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Since the presence of a hot add memory region is optional in hot add
request message it wasn't part of this message declaration
(struct dm_hot_add).

Instead, the code allocated such enlarged message by simply adding the
necessary size for this extra field to the size of basic hot add message
struct.

However, Coverity considers accessing this extra member to be
an out-of-bounds access, even thought the memory is actually there.

Fix this by adding an extended variant of this message that explicitly has
an additional union dm_mem_page_range at its end.

CID: #1523903
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon.c   | 10 +-
 include/hw/hyperv/dynmem-proto.h |  9 -
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index a4b4bde0a1e9..5b8f8aac7216 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -512,8 +512,8 @@ ret_idle:
 static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
 {
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-struct dm_hot_add *ha;
-size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+struct dm_hot_add_with_region *ha;
+size_t ha_size = sizeof(*ha);
 
 assert(balloon->state == S_HOT_ADD_RB_WAIT);
 
@@ -529,8 +529,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 PageRange *hot_add_range = >hot_add_range;
 uint64_t *current_count = >ha_current_count;
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-g_autofree struct dm_hot_add *ha = NULL;
-size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+g_autofree struct dm_hot_add_with_region *ha = NULL;
+size_t ha_size = sizeof(*ha);
 union dm_mem_page_range *ha_region;
 uint64_t align, chunk_max_size;
 ssize_t ret;
@@ -559,7 +559,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 *current_count = MIN(hot_add_range->count, chunk_max_size);
 
 ha = g_malloc0(ha_size);
-ha_region = &(>range)[1];
+ha_region = >region;
 ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
 ha->hdr.size = ha_size;
 ha->hdr.trans_id = balloon->trans_id;
diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
index d0f9090ac489..834edeb59855 100644
--- a/include/hw/hyperv/dynmem-proto.h
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -328,7 +328,8 @@ struct dm_unballoon_response {
 /*
  * Hot add request message. Message sent from the host to the guest.
  *
- * mem_range: Memory range to hot add.
+ * range: Memory range to hot add.
+ * region: Explicit hot add memory region for guest to use. Optional.
  *
  */
 
@@ -337,6 +338,12 @@ struct dm_hot_add {
 union dm_mem_page_range range;
 } QEMU_PACKED;
 
+struct dm_hot_add_with_region {
+struct dm_header hdr;
+union dm_mem_page_range range;
+union dm_mem_page_range region;
+} QEMU_PACKED;
+
 /*
  * Hot add response message.
  * This message is sent by the guest to report the status of a hot add request.

Re: [PATCH] hv-balloon: avoid alloca() usage

2023-11-13 Thread Maciej S. Szmigiero


On 13.11.2023 09:59, David Hildenbrand wrote:

On 09.11.23 17:02, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

alloca() is frowned upon, replace it with g_malloc0() + g_autofree.



Reviewed-by: David Hildenbrand 

If this fixes a coverity issue of #number, we usually indicate that using "CID: 
#number" or Fixes: CID: #number"



Will add "CID: #1523903" to the commit message then.

Thanks,
Maciej

[PATCH] hv-balloon: avoid alloca() usage

2023-11-09 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

alloca() is frowned upon, replace it with g_malloc0() + g_autofree.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 66f297c1d7e3..a4b4bde0a1e9 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -365,7 +365,7 @@ static void hv_balloon_unballoon_posting(HvBalloon 
*balloon, StateDesc *stdesc)
 PageRangeTree dtree;
 uint64_t *dctr;
 bool our_range;
-struct dm_unballoon_request *ur;
+g_autofree struct dm_unballoon_request *ur = NULL;
 size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
 PageRange range;
 bool bret;
@@ -387,8 +387,7 @@ static void hv_balloon_unballoon_posting(HvBalloon 
*balloon, StateDesc *stdesc)
 assert(dtree.t);
 assert(dctr);
 
-ur = alloca(ur_size);
-memset(ur, 0, ur_size);
+ur = g_malloc0(ur_size);
 ur->hdr.type = DM_UNBALLOON_REQUEST;
 ur->hdr.size = ur_size;
 ur->hdr.trans_id = balloon->trans_id;
@@ -530,7 +529,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
 PageRange *hot_add_range = >hot_add_range;
 uint64_t *current_count = >ha_current_count;
 VMBusChannel *chan = hv_balloon_get_channel(balloon);
-struct dm_hot_add *ha;
+g_autofree struct dm_hot_add *ha = NULL;
 size_t ha_size = sizeof(*ha) + sizeof(ha->range);
 union dm_mem_page_range *ha_region;
 uint64_t align, chunk_max_size;
@@ -559,9 +558,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, 
StateDesc *stdesc)
  */
 *current_count = MIN(hot_add_range->count, chunk_max_size);
 
-ha = alloca(ha_size);
+ha = g_malloc0(ha_size);
 ha_region = &(>range)[1];
-memset(ha, 0, ha_size);
 ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
 ha->hdr.size = ha_size;
 ha->hdr.trans_id = balloon->trans_id;

Re: [PULL 06/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support

2023-11-09 Thread Maciej S. Szmigiero


On 9.11.2023 15:51, Peter Maydell wrote:

On Mon, 6 Nov 2023 at 14:23, Maciej S. Szmigiero
 wrote:


From: "Maciej S. Szmigiero" 

One of advantages of using this protocol over ACPI-based PC DIMM hotplug is
that it allows hot-adding memory in much smaller granularity because the
ACPI DIMM slot limit does not apply.

In order to enable this functionality a new memory backend needs to be
created and provided to the driver via the "memdev" parameter.

This can be achieved by, for example, adding
"-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and
then instantiating the driver with "memdev=mem1" parameter.

The device will try to use multiple memslots to cover the memory backend in
order to reduce the size of metadata for the not-yet-hot-added part of the
memory backend.

Co-developed-by: David Hildenbrand 
Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 


Hi; I was looking at this code because Coverity reported an
issue in it. I think that's because Coverity has got confused
about the way you're doing memory allocation here. But
in looking at the code I see that you're using alloca() in
this function.

Please could you rewrite this not to do that -- we don't use
alloca() or variable-length-arrays in QEMU except in a few
cases which we're trying to get rid of, so we'd like not to
add new uses to the code base.



Sure, will do - I didn't know alloca() is frowned upon
(and David probably didn't either).


thanks
-- PMM


Thanks,
Maciej

[PULL 10/10] MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 MAINTAINERS | 8 
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8e8a7d5be5de..d4a480ce5a62 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2656,6 +2656,14 @@ F: hw/usb/canokey.c
 F: hw/usb/canokey.h
 F: docs/system/devices/canokey.rst
 
+Hyper-V Dynamic Memory Protocol
+M: Maciej S. Szmigiero 
+S: Supported
+F: hw/hyperv/hv-balloon*.c
+F: hw/hyperv/hv-balloon*.h
+F: include/hw/hyperv/dynmem-proto.h
+F: include/hw/hyperv/hv-balloon.h
+
 Subsystems
 --
 Overall Audio backends

[PULL 04/10] Add Hyper-V Dynamic Memory Protocol definitions

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This commit adds Hyper-V Dynamic Memory Protocol definitions, taken
from hv_balloon Linux kernel driver, adapted to the QEMU coding style and
definitions.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 include/hw/hyperv/dynmem-proto.h | 423 +++
 1 file changed, 423 insertions(+)
 create mode 100644 include/hw/hyperv/dynmem-proto.h

diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
new file mode 100644
index ..d0f9090ac489
--- /dev/null
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -0,0 +1,423 @@
+#ifndef HW_HYPERV_DYNMEM_PROTO_H
+#define HW_HYPERV_DYNMEM_PROTO_H
+
+/*
+ * Hyper-V Dynamic Memory Protocol definitions
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * Based on drivers/hv/hv_balloon.c from Linux kernel:
+ * Copyright (c) 2012, Microsoft Corporation.
+ *
+ * Author: K. Y. Srinivasan 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*
+ * Protocol versions. The low word is the minor version, the high word the 
major
+ * version.
+ *
+ * History:
+ * Initial version 1.0
+ * Changed to 0.1 on 2009/03/25
+ * Changes to 0.2 on 2009/05/14
+ * Changes to 0.3 on 2009/12/03
+ * Changed to 1.0 on 2011/04/05
+ * Changed to 2.0 on 2019/12/10
+ */
+
+#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | 
(Minor)))
+#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16)
+#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff)
+
+enum {
+DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3),
+DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0),
+DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0),
+
+DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1,
+DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2,
+DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3,
+
+DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10
+};
+
+
+
+/*
+ * Message Types
+ */
+
+enum dm_message_type {
+/*
+ * Version 0.3
+ */
+DM_ERROR = 0,
+DM_VERSION_REQUEST = 1,
+DM_VERSION_RESPONSE = 2,
+DM_CAPABILITIES_REPORT = 3,
+DM_CAPABILITIES_RESPONSE = 4,
+DM_STATUS_REPORT = 5,
+DM_BALLOON_REQUEST = 6,
+DM_BALLOON_RESPONSE = 7,
+DM_UNBALLOON_REQUEST = 8,
+DM_UNBALLOON_RESPONSE = 9,
+DM_MEM_HOT_ADD_REQUEST = 10,
+DM_MEM_HOT_ADD_RESPONSE = 11,
+DM_VERSION_03_MAX = 11,
+/*
+ * Version 1.0.
+ */
+DM_INFO_MESSAGE = 12,
+DM_VERSION_1_MAX = 12,
+
+/*
+ * Version 2.0
+ */
+DM_MEM_HOT_REMOVE_REQUEST = 13,
+DM_MEM_HOT_REMOVE_RESPONSE = 14
+};
+
+
+/*
+ * Structures defining the dynamic memory management
+ * protocol.
+ */
+
+union dm_version {
+struct {
+uint16_t minor_version;
+uint16_t major_version;
+};
+uint32_t version;
+} QEMU_PACKED;
+
+
+union dm_caps {
+struct {
+uint64_t balloon:1;
+uint64_t hot_add:1;
+/*
+ * To support guests that may have alignment
+ * limitations on hot-add, the guest can specify
+ * its alignment requirements; a value of n
+ * represents an alignment of 2^n in mega bytes.
+ */
+uint64_t hot_add_alignment:4;
+uint64_t hot_remove:1;
+uint64_t reservedz:57;
+} cap_bits;
+uint64_t caps;
+} QEMU_PACKED;
+
+union dm_mem_page_range {
+struct  {
+/*
+ * The PFN number of the first page in the range.
+ * 40 bits is the architectural limit of a PFN
+ * number for AMD64.
+ */
+uint64_t start_page:40;
+/*
+ * The number of pages in the range.
+ */
+uint64_t page_cnt:24;
+} finfo;
+uint64_t  page_range;
+} QEMU_PACKED;
+
+
+
+/*
+ * The header for all dynamic memory messages:
+ *
+ * type: Type of the message.
+ * size: Size of the message in bytes; including the header.
+ * trans_id: The guest is responsible for manufacturing this ID.
+ */
+
+struct dm_header {
+uint16_t type;
+uint16_t size;
+uint32_t trans_id;
+} QEMU_PACKED;
+
+/*
+ * A generic message format for dynamic memory.
+ * Specific message formats are defined later in the file.
+ */
+
+struct dm_message {
+struct dm_header hdr;
+uint8_t data[]; /* enclosed message */
+} QEMU_PACKED;
+
+
+/*
+ * Specific message types supporting the dynamic memory protocol.
+ */
+
+/*
+ * Version negotiation message. Sent from the guest to the host.
+ * The guest is free to try different versions until the host
+ * accepts the version.
+ *
+ * dm_version: The protocol version requested.
+ * is_last_attempt: If TRUE, this is the last version guest will request.
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_request {
+struct dm_header hdr;
+union dm_version version;
+

[PULL 05/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This driver is like virtio-balloon on steroids: it allows both changing the
guest memory allocation via ballooning and (in the next patch) inserting
pieces of extra RAM into it on demand from a provided memory backend.

The actual resizing is done via ballooning interface (for example, via
the "balloon" HMP command).
This includes resizing the guest past its boot size - that is, hot-adding
additional memory in granularity limited only by the guest alignment
requirements, as provided by the next patch.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.

After a VM reboot the guest is back to its original (boot) size.

In the future, the guest boot memory size might be changed on reboot
instead, taking into account the effective size that VM had before that
reboot (much like Hyper-V does).

For performance reasons, the guest-released memory is tracked in a few
range trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU and KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just a few bytes of such metadata.

Since the required GTree operations aren't present in every Glib version
a check for them was added to the meson build script, together with new
"--enable-hv-balloon" and "--disable-hv-balloon" configure arguments.
If these GTree operations are missing in the system's Glib version this
driver will be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.

Following commits will add support for their external emission as
"HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2019 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 Kconfig.host   |3 +
 hw/hyperv/Kconfig  |   10 +
 hw/hyperv/hv-balloon-internal.h|   33 +
 hw/hyperv/hv-balloon-page_range_tree.c |  228 +
 hw/hyperv/hv-balloon-page_range_tree.h |  118 +++
 hw/hyperv/hv-balloon.c | 1160 
 hw/hyperv/meson.build  |1 +
 hw/hyperv/trace-events |   13 +
 include/hw/hyperv/hv-balloon.h |   18 +
 meson.build|   28 +-
 meson_options.txt  |2 +
 scripts/meson-buildoptions.sh  |3 +
 12 files changed, 1616 insertions(+), 1 deletion(-)
 create mode 100644 hw/hyperv/hv-balloon-internal.h
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 include/hw/hyperv/hv-balloon.h

diff --git a/Kconfig.host b/Kconfig.host
index d763d892693c..2ee71578f38f 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -46,3 +46,6 @@ config FUZZ
 config VFIO_USER_SE

[PULL 07/10] qapi: Add query-memory-devices support to hv-balloon

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Used by the driver to report its provided memory state information.

Co-developed-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Acked-by: Markus Armbruster 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/core/machine-hmp-cmds.c | 15 +++
 hw/hyperv/hv-balloon.c | 27 +-
 qapi/machine.json  | 39 --
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 9a4b59c6f210..a6ff6a487583 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -253,6 +253,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 MemoryDeviceInfo *value;
 PCDIMMDeviceInfo *di;
 SgxEPCDeviceInfo *se;
+HvBalloonDeviceInfo *hi;
 
 for (info = info_list; info; info = info->next) {
 value = info->value;
@@ -310,6 +311,20 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "  node: %" PRId64 "\n", se->node);
 monitor_printf(mon, "  memdev: %s\n", se->memdev);
 break;
+case MEMORY_DEVICE_INFO_KIND_HV_BALLOON:
+hi = value->u.hv_balloon.data;
+monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
+   MemoryDeviceInfoKind_str(value->type),
+   hi->id ? hi->id : "");
+if (hi->has_memaddr) {
+monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n",
+   hi->memaddr);
+}
+monitor_printf(mon, "  max-size: %" PRIu64 "\n", hi->max_size);
+if (hi->memdev) {
+monitor_printf(mon, "  memdev: %s\n", hi->memdev);
+}
+break;
 default:
 g_assert_not_reached();
 }
diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 5999f1127d87..44a8d15cc841 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1625,6 +1625,31 @@ static MemoryRegion 
*hv_balloon_md_get_memory_region(MemoryDeviceState *md,
 return balloon->mr;
 }
 
+static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md,
+   MemoryDeviceInfo *info)
+{
+HvBalloonDeviceInfo *hi = g_new0(HvBalloonDeviceInfo, 1);
+const HvBalloon *balloon = HV_BALLOON(md);
+DeviceState *dev = DEVICE(md);
+
+if (dev->id) {
+hi->id = g_strdup(dev->id);
+}
+
+if (balloon->hostmem) {
+hi->memdev = object_get_canonical_path(OBJECT(balloon->hostmem));
+hi->memaddr = balloon->addr;
+hi->has_memaddr = true;
+hi->max_size = memory_region_size(balloon->mr);
+/* TODO: expose current provided size or something else? */
+} else {
+hi->max_size = 0;
+}
+
+info->u.hv_balloon.data = hi;
+info->type = MEMORY_DEVICE_INFO_KIND_HV_BALLOON;
+}
+
 static void hv_balloon_decide_memslots(MemoryDeviceState *md,
unsigned int limit)
 {
@@ -1712,5 +1737,5 @@ static void hv_balloon_class_init(ObjectClass *klass, 
void *data)
 mdc->get_memory_region = hv_balloon_md_get_memory_region;
 mdc->decide_memslots = hv_balloon_decide_memslots;
 mdc->get_memslots = hv_balloon_get_memslots;
-/* implement fill_device_info */
+mdc->fill_device_info = hv_balloon_md_fill_device_info;
 }
diff --git a/qapi/machine.json b/qapi/machine.json
index 6c9d2f6dcffe..2985d043c00d 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1289,6 +1289,29 @@
   }
 }
 
+##
+# @HvBalloonDeviceInfo:
+#
+# hv-balloon provided memory state information
+#
+# @id: device's ID
+#
+# @memaddr: physical address in memory, where device is mapped
+#
+# @max-size: the maximum size of memory that the device can provide
+#
+# @memdev: memory backend linked with device
+#
+# Since: 8.2
+##
+{ 'struct': 'HvBalloonDeviceInfo',
+  'data': { '*id': 'str',
+'*memaddr': 'size',
+'max-size': 'size',
+'*memdev': 'str'
+  }
+}
+
 ##
 # @MemoryDeviceInfoKind:
 #
@@ -1300,10 +1323,13 @@
 #
 # @sgx-epc: since 6.2.
 #
+# @hv-balloon: since 8.2.
+#
 # Since: 2.1
 ##
 { 'enum': 'MemoryDeviceInfoKind',
-  'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc' ] }
+  'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc',
+'hv-balloon' ] }
 
 ##
 # @PCDIMMDeviceInfoWrapper:
@@ -1337,6 +1363,14 @@
 { 'struct': 'SgxEPCDeviceInfoWrapper',
   'data': { 'data': 'SgxEPCDeviceInfo' } }
 
+##
+# @HvBalloonDeviceInfoWrapper:
+#
+# Since: 8.2
+##
+{

[PULL 00/10] Hyper-V Dynamic Memory Protocol driver (hv-balloon) pull req fixed

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Hi Stefan,

Fixed the CI pipeline issues with yesterday's pull request, and:
the following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e:

  Merge tag 'pull-target-arm-20231102' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 
10:04:12 +0800)

are available in the Git repository at:

  https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231106

for you to fetch changes up to 00313b517d09c0b141fb32997791f911c28fd3ff:

  MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-06 
14:08:10 +0100)


Hyper-V Dynamic Memory protocol driver.

This driver is like virtio-balloon on steroids for Windows guests:
it allows both changing the guest memory allocation via ballooning and
inserting pieces of extra RAM into it on demand from a provided memory
backend via Windows-native Hyper-V Dynamic Memory protocol.

* Preparatory patches to support empty memory devices and ones with
large alignment requirements.

* Revert of recently added "hw/virtio/virtio-pmem: Replace impossible
check by assertion" commit 5960f254dbb4 since this series makes this
situation possible again.

* Protocol definitions.

* Hyper-V DM protocol driver (hv-balloon) base (ballooning only).

* Hyper-V DM protocol driver (hv-balloon) hot-add support.

* qapi query-memory-devices support for the driver.

* qapi HV_BALLOON_STATUS_REPORT event.

* The relevant PC machine plumbing.

* New MAINTAINERS entry for the above.


David Hildenbrand (2):
  memory-device: Support empty memory devices
  memory-device: Drop size alignment check

Maciej S. Szmigiero (8):
  Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion"
  Add Hyper-V Dynamic Memory Protocol definitions
  Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base
  Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support
  qapi: Add query-memory-devices support to hv-balloon
  qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command
  hw/i386/pc: Support hv-balloon
  MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol

 Kconfig.host  |3 +
 MAINTAINERS   |8 +
 hw/core/machine-hmp-cmds.c|   15 +
 hw/hyperv/Kconfig |   10 +
 hw/hyperv/hv-balloon-internal.h   |   33 +
 hw/hyperv/hv-balloon-our_range_memslots.c |  201 
 hw/hyperv/hv-balloon-our_range_memslots.h |  110 ++
 hw/hyperv/hv-balloon-page_range_tree.c|  228 
 hw/hyperv/hv-balloon-page_range_tree.h|  118 ++
 hw/hyperv/hv-balloon-stub.c   |   19 +
 hw/hyperv/hv-balloon.c| 1769 +
 hw/hyperv/meson.build |1 +
 hw/hyperv/trace-events|   18 +
 hw/i386/Kconfig   |1 +
 hw/i386/pc.c  |   22 +
 hw/mem/memory-device.c|   49 +-
 hw/virtio/virtio-pmem.c   |5 +-
 include/hw/hyperv/dynmem-proto.h  |  423 +++
 include/hw/hyperv/hv-balloon.h|   18 +
 include/hw/mem/memory-device.h|7 +-
 meson.build   |   28 +-
 meson_options.txt |2 +
 monitor/monitor.c |1 +
 qapi/machine.json |  101 +-
 scripts/meson-buildoptions.sh |3 +
 tests/qtest/qmp-cmd-test.c|1 +
 26 files changed, 3180 insertions(+), 14 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon-internal.h
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h
 create mode 100644 hw/hyperv/hv-balloon-stub.c
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 include/hw/hyperv/dynmem-proto.h
 create mode 100644 include/hw/hyperv/hv-balloon.h

[PULL 02/10] Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion"

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This reverts commit 5960f254dbb46f0c7a9f5f44bf4d27c19c34cb97 since the
previous commit made this situation possible again.

Signed-off-by: Maciej S. Szmigiero 
---
 hw/virtio/virtio-pmem.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
index cc24812d2e92..c3512c2dae3f 100644
--- a/hw/virtio/virtio-pmem.c
+++ b/hw/virtio/virtio-pmem.c
@@ -147,7 +147,10 @@ static void virtio_pmem_fill_device_info(const VirtIOPMEM 
*pmem,
 static MemoryRegion *virtio_pmem_get_memory_region(VirtIOPMEM *pmem,
Error **errp)
 {
-assert(pmem->memdev);
+if (!pmem->memdev) {
+error_setg(errp, "'%s' property must be set", VIRTIO_PMEM_MEMDEV_PROP);
+return NULL;
+}
 
 return >memdev->mr;
 }

[PULL 06/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

One of advantages of using this protocol over ACPI-based PC DIMM hotplug is
that it allows hot-adding memory in much smaller granularity because the
ACPI DIMM slot limit does not apply.

In order to enable this functionality a new memory backend needs to be
created and provided to the driver via the "memdev" parameter.

This can be achieved by, for example, adding
"-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and
then instantiating the driver with "memdev=mem1" parameter.

The device will try to use multiple memslots to cover the memory backend in
order to reduce the size of metadata for the not-yet-hot-added part of the
memory backend.

Co-developed-by: David Hildenbrand 
Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon-our_range_memslots.c | 201 
 hw/hyperv/hv-balloon-our_range_memslots.h | 110 +
 hw/hyperv/hv-balloon.c| 566 +-
 hw/hyperv/meson.build |   2 +-
 hw/hyperv/trace-events|   5 +
 5 files changed, 878 insertions(+), 6 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h

diff --git a/hw/hyperv/hv-balloon-our_range_memslots.c 
b/hw/hyperv/hv-balloon-our_range_memslots.c
new file mode 100644
index ..99bae870f371
--- /dev/null
+++ b/hw/hyperv/hv-balloon-our_range_memslots.c
@@ -0,0 +1,201 @@
+/*
+ * QEMU Hyper-V Dynamic Memory Protocol driver
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "hv-balloon-internal.h"
+#include "hv-balloon-our_range_memslots.h"
+#include "trace.h"
+
+/* OurRange */
+static void our_range_init(OurRange *our_range, uint64_t start, uint64_t count)
+{
+assert(count <= UINT64_MAX - start);
+our_range->range.start = start;
+our_range->range.count = count;
+
+hvb_page_range_tree_init(_range->removed_guest);
+hvb_page_range_tree_init(_range->removed_both);
+
+/* mark the whole range as unused but for potential use */
+our_range->added = 0;
+our_range->unusable_tail = 0;
+}
+
+static void our_range_destroy(OurRange *our_range)
+{
+hvb_page_range_tree_destroy(_range->removed_guest);
+hvb_page_range_tree_destroy(_range->removed_both);
+}
+
+void hvb_our_range_clear_removed_trees(OurRange *our_range)
+{
+hvb_page_range_tree_destroy(_range->removed_guest);
+hvb_page_range_tree_destroy(_range->removed_both);
+hvb_page_range_tree_init(_range->removed_guest);
+hvb_page_range_tree_init(_range->removed_both);
+}
+
+void hvb_our_range_mark_added(OurRange *our_range, uint64_t additional_size)
+{
+assert(additional_size <= UINT64_MAX - our_range->added);
+
+our_range->added += additional_size;
+
+assert(our_range->added <= UINT64_MAX - our_range->unusable_tail);
+assert(our_range->added + our_range->unusable_tail <=
+   our_range->range.count);
+}
+
+/* OurRangeMemslots */
+static void our_range_memslots_init_slots(OurRangeMemslots *our_range,
+  MemoryRegion *backing_mr,
+  Object *memslot_owner)
+{
+OurRangeMemslotsSlots *memslots = _range->slots;
+unsigned int idx;
+uint64_t memslot_offset;
+
+assert(memslots->count > 0);
+memslots->slots = g_new0(MemoryRegion, memslots->count);
+
+/* Initialize our memslots, but don't map them yet. */
+assert(memslots->size_each > 0);
+for (idx = 0, memslot_offset = 0; idx < memslots->count;
+ idx++, memslot_offset += memslots->size_each) {
+uint64_t memslot_size;
+g_autofree char *name = NULL;
+
+/* The size of the last memslot might be smaller. */
+if (idx == memslots->count - 1) {
+uint64_t region_size;
+
+assert(our_range->mr);
+region_size = memory_region_size(our_range->mr);
+memslot_size = region_size - memslot_offset;
+} else {
+memslot_size = memslots->size_each;
+}
+
+name = g_strdup_printf("memslot-%u", idx);
+memory_region_init_alias(>slots[idx], memslot_owner, name,
+ backing_mr, memslot_offset, memslot_size);
+/*
+ * We want to be able to atomically and efficiently activate/deactivate
+ * individual memslots without affecting adjacent memslots in memory
+ * notifiers.
+ */
+memory_region_set_unmergeable(>slots[idx], true);
+}
+
+memslots->mapped_count = 0;
+}
+
+O

[PULL 01/10] memory-device: Support empty memory devices

2023-11-06 Thread Maciej S. Szmigiero

From: David Hildenbrand 

Let's support empty memory devices -- memory devices that don't have a
memory device region in the current configuration. hv-balloon with an
optional memdev is the primary use case.

Signed-off-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/mem/memory-device.c | 43 +++---
 include/hw/mem/memory-device.h |  7 +-
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index ae38f48f1676..db702ccad554 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -20,6 +20,22 @@
 #include "exec/address-spaces.h"
 #include "trace.h"
 
+static bool memory_device_is_empty(const MemoryDeviceState *md)
+{
+const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+Error *local_err = NULL;
+MemoryRegion *mr;
+
+/* dropping const here is fine as we don't touch the memory region */
+mr = mdc->get_memory_region((MemoryDeviceState *)md, _err);
+if (local_err) {
+/* Not empty, we'll report errors later when ontaining the MR again. */
+error_free(local_err);
+return false;
+}
+return !mr;
+}
+
 static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
 {
 const MemoryDeviceState *md_a = MEMORY_DEVICE(a);
@@ -249,6 +265,10 @@ static uint64_t memory_device_get_free_addr(MachineState 
*ms,
 uint64_t next_addr;
 Range tmp;
 
+if (memory_device_is_empty(md)) {
+continue;
+}
+
 range_init_nofail(, mdc->get_addr(md),
   memory_device_get_region_size(md, _abort));
 
@@ -292,6 +312,7 @@ MemoryDeviceInfoList *qmp_memory_device_list(void)
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
 
+/* Let's query infotmation even for empty memory devices. */
 mdc->fill_device_info(md, info);
 
 QAPI_LIST_APPEND(tail, info);
@@ -311,7 +332,7 @@ static int memory_device_plugged_size(Object *obj, void 
*opaque)
 const MemoryDeviceState *md = MEMORY_DEVICE(obj);
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
 
-if (dev->realized) {
+if (dev->realized && !memory_device_is_empty(md)) {
 *size += mdc->get_plugged_size(md, _abort);
 }
 }
@@ -337,6 +358,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, 
MachineState *ms,
 uint64_t addr, align = 0;
 MemoryRegion *mr;
 
+/* We support empty memory devices even without device memory. */
+if (memory_device_is_empty(md)) {
+return;
+}
+
 if (!ms->device_memory) {
 error_setg(errp, "the configuration is not prepared for memory devices"
  " (e.g., for memory hotplug), consider specifying the"
@@ -380,10 +406,17 @@ out:
 void memory_device_plug(MemoryDeviceState *md, MachineState *ms)
 {
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
-const unsigned int memslots = memory_device_get_memslots(md);
-const uint64_t addr = mdc->get_addr(md);
+unsigned int memslots;
+uint64_t addr;
 MemoryRegion *mr;
 
+if (memory_device_is_empty(md)) {
+return;
+}
+
+memslots = memory_device_get_memslots(md);
+addr = mdc->get_addr(md);
+
 /*
  * We expect that a previous call to memory_device_pre_plug() succeeded, so
  * it can't fail at this point.
@@ -408,6 +441,10 @@ void memory_device_unplug(MemoryDeviceState *md, 
MachineState *ms)
 const unsigned int memslots = memory_device_get_memslots(md);
 MemoryRegion *mr;
 
+if (memory_device_is_empty(md)) {
+return;
+}
+
 /*
  * We expect that a previous call to memory_device_pre_plug() succeeded, so
  * it can't fail at this point.
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
index 3354d6c1667e..a1d62cc551ab 100644
--- a/include/hw/mem/memory-device.h
+++ b/include/hw/mem/memory-device.h
@@ -38,6 +38,10 @@ typedef struct MemoryDeviceState MemoryDeviceState;
  * address in guest physical memory can either be specified explicitly
  * or get assigned automatically.
  *
+ * Some memory device might not own a memory region in certain device
+ * configurations. Such devices can logically get (un)plugged, however,
+ * empty memory devices are mostly ignored by the memory device code.
+ *
  * Conceptually, memory devices only span one memory region. If multiple
  * successive memory regions are used, a covering memory region has to
  * be provided. Scattered memory regions are not supported for single
@@ -91,7 +95,8 @@ struct MemoryDeviceClass {
 uint64_t (*get_plugged_size)(const MemoryDeviceState *md, Error **errp);
 
 /*
- * Return the memory region of the memory device.
+ * Return the memory region of the

[PULL 08/10] qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Used by the hv-balloon driver for (optional) guest memory status reports.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon-stub.c | 19 
 hw/hyperv/hv-balloon.c  | 30 +-
 hw/hyperv/meson.build   |  2 +-
 monitor/monitor.c   |  1 +
 qapi/machine.json   | 62 +
 tests/qtest/qmp-cmd-test.c  |  1 +
 6 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon-stub.c

diff --git a/hw/hyperv/hv-balloon-stub.c b/hw/hyperv/hv-balloon-stub.c
new file mode 100644
index ..a47412d4a8ad
--- /dev/null
+++ b/hw/hyperv/hv-balloon-stub.c
@@ -0,0 +1,19 @@
+/*
+ * QEMU Hyper-V Dynamic Memory Protocol driver
+ *
+ * Copyright (C) 2023 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-machine.h"
+#include "qapi/qapi-types-machine.h"
+
+HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp)
+{
+error_setg(errp, "hv-balloon device not enabled in this build");
+return NULL;
+}
diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 44a8d15cc841..66f297c1d7e3 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1102,7 +1102,35 @@ static void hv_balloon_handle_status_report(HvBalloon 
*balloon,
 balloon->status_report.available *= HV_BALLOON_PAGE_SIZE;
 balloon->status_report.received = true;
 
-/* report event */
+qapi_event_send_hv_balloon_status_report(balloon->status_report.committed,
+ balloon->status_report.available);
+}
+
+HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp)
+{
+HvBalloon *balloon;
+HvBalloonInfo *info;
+
+balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL));
+if (!balloon) {
+error_setg(errp, "no %s device present", TYPE_HV_BALLOON);
+return NULL;
+}
+
+if (!balloon->status_report.enabled) {
+error_setg(errp, "guest memory status reporting not enabled");
+return NULL;
+}
+
+if (!balloon->status_report.received) {
+error_setg(errp, "no guest memory status report received yet");
+return NULL;
+}
+
+info = g_malloc0(sizeof(*info));
+info->committed = balloon->status_report.committed;
+info->available = balloon->status_report.available;
+return info;
 }
 
 static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build
index 852d4f4a2ee2..d3d2668c71ae 100644
--- a/hw/hyperv/meson.build
+++ b/hw/hyperv/meson.build
@@ -2,4 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: 
files('hyperv.c'))
 specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: 
files('hyperv_testdev.c'))
 specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
 specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c'))
-specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c', 
'hv-balloon-page_range_tree.c', 'hv-balloon-our_range_memslots.c'))
+specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c', 
'hv-balloon-page_range_tree.c', 'hv-balloon-our_range_memslots.c'), if_false: 
files('hv-balloon-stub.c'))
diff --git a/monitor/monitor.c b/monitor/monitor.c
index 941f87815aa4..01ede1babd3d 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -315,6 +315,7 @@ static MonitorQAPIEventConf 
monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
 [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS },
 [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
 [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
+[QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
 };
 
 /*
diff --git a/qapi/machine.json b/qapi/machine.json
index 2985d043c00d..b6d634b30d55 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1137,6 +1137,68 @@
 { 'event': 'BALLOON_CHANGE',
   'data': { 'actual': 'int' } }
 
+##
+# @HvBalloonInfo:
+#
+# hv-balloon guest-provided memory status information.
+#
+# @committed: the amount of memory in use inside the guest plus the
+# amount of the memory unusable inside the guest (ballooned out,
+# offline, etc.)
+#
+# @available: the amount of the memory inside the guest available for
+# new allocations ("free")
+#
+# Since: 8.2
+##
+{ 'struct': 'HvBalloonInfo',
+  'data': { 'committed': 'size', 'available': 'size' } }
+
+##
+# @query-hv-balloon-status-report:
+#
+# Returns the hv-balloon driver data contained in the last received "STATUS"
+# message from the guest.
+#
+# Returns:
+# - @HvBalloonIn

[PULL 09/10] hw/i386/pc: Support hv-balloon

2023-11-06 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Add the necessary plumbing for the hv-balloon driver to the PC machine.

Co-developed-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/i386/Kconfig |  1 +
 hw/i386/pc.c| 22 ++
 2 files changed, 23 insertions(+)

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index 94772c726b24..55850791df41 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -45,6 +45,7 @@ config PC
 select ACPI_VMGENID
 select VIRTIO_PMEM_SUPPORTED
 select VIRTIO_MEM_SUPPORTED
+select HV_BALLOON_SUPPORTED
 
 config PC_PCI
 bool
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6031234a73f1..1aef21aa2c25 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -27,6 +27,7 @@
 #include "hw/i386/pc.h"
 #include "hw/char/serial.h"
 #include "hw/char/parallel.h"
+#include "hw/hyperv/hv-balloon.h"
 #include "hw/i386/fw_cfg.h"
 #include "hw/i386/vmport.h"
 #include "sysemu/cpus.h"
@@ -57,6 +58,7 @@
 #include "hw/i386/kvm/xen_evtchn.h"
 #include "hw/i386/kvm/xen_gnttab.h"
 #include "hw/i386/kvm/xen_xenstore.h"
+#include "hw/mem/memory-device.h"
 #include "e820_memory_layout.h"
 #include "trace.h"
 #include CONFIG_DEVICES
@@ -1422,6 +1424,21 @@ static void pc_memory_unplug(HotplugHandler *hotplug_dev,
 error_propagate(errp, local_err);
 }
 
+static void pc_hv_balloon_pre_plug(HotplugHandler *hotplug_dev,
+   DeviceState *dev, Error **errp)
+{
+/* The vmbus handler has no hotplug handler; we should never end up here. 
*/
+g_assert(!dev->hotplugged);
+memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL,
+   errp);
+}
+
+static void pc_hv_balloon_plug(HotplugHandler *hotplug_dev,
+   DeviceState *dev, Error **errp)
+{
+memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+}
+
 static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
   DeviceState *dev, Error **errp)
 {
@@ -1452,6 +1469,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler 
*hotplug_dev,
 return;
 }
 pcms->iommu = dev;
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) {
+pc_hv_balloon_pre_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -1464,6 +1483,8 @@ static void pc_machine_device_plug_cb(HotplugHandler 
*hotplug_dev,
 x86_cpu_plug(hotplug_dev, dev, errp);
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI)) {
 virtio_md_pci_plug(VIRTIO_MD_PCI(dev), MACHINE(hotplug_dev), errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) {
+pc_hv_balloon_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -1505,6 +1526,7 @@ static HotplugHandler 
*pc_get_hotplug_handler(MachineState *machine,
 object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) ||
+object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON) ||
 object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
 return HOTPLUG_HANDLER(machine);
 }

[PULL 03/10] memory-device: Drop size alignment check

2023-11-06 Thread Maciej S. Szmigiero

From: David Hildenbrand 

There is no strong requirement that the size has to be multiples of the
requested alignment, let's drop it. This is a preparation for hv-baloon.

Signed-off-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/mem/memory-device.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index db702ccad554..e0704b8dc37a 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -236,12 +236,6 @@ static uint64_t memory_device_get_free_addr(MachineState 
*ms,
 return 0;
 }
 
-if (!QEMU_IS_ALIGNED(size, align)) {
-error_setg(errp, "backend memory size must be multiple of 0x%"
-   PRIx64, align);
-return 0;
-}
-
 if (hint) {
 if (range_init(, *hint, size) || !range_contains_range(, )) 
{
 error_setg(errp, "can't add memory device [0x%" PRIx64 ":0x%" 
PRIx64

Re: [PULL 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon)

2023-11-06 Thread Maciej S. Szmigiero


On 6.11.2023 02:33, Stefan Hajnoczi wrote:

On Sun, 5 Nov 2023 at 19:49, Maciej S. Szmigiero
 wrote:


From: "Maciej S. Szmigiero" 

The following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e:

   Merge tag 'pull-target-arm-20231102' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 
10:04:12 +0800)

are available in the Git repository at:

   https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231105

for you to fetch changes up to 2b49ecabc6bf15efa6aa05f20a7c319ff65c4e11:

   MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-03 
20:31:10 +0100)


Hi Maciej,
Please take a look at this CI system build failure:

/usr/bin/ld: libqemuutil.a.p/meson-generated_.._qapi_qapi-commands-machine.c.o:
in function `qmp_marshal_query_hv_balloon_status_report':
/builds/qemu-project/qemu/build/qapi/qapi-commands-machine.c:1000:
undefined reference to `qmp_query_hv_balloon_status_report'

https://gitlab.com/qemu-project/qemu/-/jobs/5463619044

I have dropped this pull request from the staging tree for the time being.

You can run the GitLab CI by pushing to a personal qemu.git fork on
GitLab with "git push -o ci.variable=QEMU_CI=1 ..." and it's often
possible to reproduce the CI jobs locally using the Docker build tests
(see "make docker-help").


Oops, was testing the driver but forgot to also recently test the
configuration when the driver is disabled in QEMU build config.

Will fix this ASAP.


Stefan


Thanks,
Maciej

[PULL 3/9] Add Hyper-V Dynamic Memory Protocol definitions

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This commit adds Hyper-V Dynamic Memory Protocol definitions, taken
from hv_balloon Linux kernel driver, adapted to the QEMU coding style and
definitions.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 include/hw/hyperv/dynmem-proto.h | 423 +++
 1 file changed, 423 insertions(+)
 create mode 100644 include/hw/hyperv/dynmem-proto.h

diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
new file mode 100644
index ..d0f9090ac489
--- /dev/null
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -0,0 +1,423 @@
+#ifndef HW_HYPERV_DYNMEM_PROTO_H
+#define HW_HYPERV_DYNMEM_PROTO_H
+
+/*
+ * Hyper-V Dynamic Memory Protocol definitions
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * Based on drivers/hv/hv_balloon.c from Linux kernel:
+ * Copyright (c) 2012, Microsoft Corporation.
+ *
+ * Author: K. Y. Srinivasan 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*
+ * Protocol versions. The low word is the minor version, the high word the 
major
+ * version.
+ *
+ * History:
+ * Initial version 1.0
+ * Changed to 0.1 on 2009/03/25
+ * Changes to 0.2 on 2009/05/14
+ * Changes to 0.3 on 2009/12/03
+ * Changed to 1.0 on 2011/04/05
+ * Changed to 2.0 on 2019/12/10
+ */
+
+#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | 
(Minor)))
+#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16)
+#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff)
+
+enum {
+DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3),
+DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0),
+DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0),
+
+DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1,
+DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2,
+DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3,
+
+DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10
+};
+
+
+
+/*
+ * Message Types
+ */
+
+enum dm_message_type {
+/*
+ * Version 0.3
+ */
+DM_ERROR = 0,
+DM_VERSION_REQUEST = 1,
+DM_VERSION_RESPONSE = 2,
+DM_CAPABILITIES_REPORT = 3,
+DM_CAPABILITIES_RESPONSE = 4,
+DM_STATUS_REPORT = 5,
+DM_BALLOON_REQUEST = 6,
+DM_BALLOON_RESPONSE = 7,
+DM_UNBALLOON_REQUEST = 8,
+DM_UNBALLOON_RESPONSE = 9,
+DM_MEM_HOT_ADD_REQUEST = 10,
+DM_MEM_HOT_ADD_RESPONSE = 11,
+DM_VERSION_03_MAX = 11,
+/*
+ * Version 1.0.
+ */
+DM_INFO_MESSAGE = 12,
+DM_VERSION_1_MAX = 12,
+
+/*
+ * Version 2.0
+ */
+DM_MEM_HOT_REMOVE_REQUEST = 13,
+DM_MEM_HOT_REMOVE_RESPONSE = 14
+};
+
+
+/*
+ * Structures defining the dynamic memory management
+ * protocol.
+ */
+
+union dm_version {
+struct {
+uint16_t minor_version;
+uint16_t major_version;
+};
+uint32_t version;
+} QEMU_PACKED;
+
+
+union dm_caps {
+struct {
+uint64_t balloon:1;
+uint64_t hot_add:1;
+/*
+ * To support guests that may have alignment
+ * limitations on hot-add, the guest can specify
+ * its alignment requirements; a value of n
+ * represents an alignment of 2^n in mega bytes.
+ */
+uint64_t hot_add_alignment:4;
+uint64_t hot_remove:1;
+uint64_t reservedz:57;
+} cap_bits;
+uint64_t caps;
+} QEMU_PACKED;
+
+union dm_mem_page_range {
+struct  {
+/*
+ * The PFN number of the first page in the range.
+ * 40 bits is the architectural limit of a PFN
+ * number for AMD64.
+ */
+uint64_t start_page:40;
+/*
+ * The number of pages in the range.
+ */
+uint64_t page_cnt:24;
+} finfo;
+uint64_t  page_range;
+} QEMU_PACKED;
+
+
+
+/*
+ * The header for all dynamic memory messages:
+ *
+ * type: Type of the message.
+ * size: Size of the message in bytes; including the header.
+ * trans_id: The guest is responsible for manufacturing this ID.
+ */
+
+struct dm_header {
+uint16_t type;
+uint16_t size;
+uint32_t trans_id;
+} QEMU_PACKED;
+
+/*
+ * A generic message format for dynamic memory.
+ * Specific message formats are defined later in the file.
+ */
+
+struct dm_message {
+struct dm_header hdr;
+uint8_t data[]; /* enclosed message */
+} QEMU_PACKED;
+
+
+/*
+ * Specific message types supporting the dynamic memory protocol.
+ */
+
+/*
+ * Version negotiation message. Sent from the guest to the host.
+ * The guest is free to try different versions until the host
+ * accepts the version.
+ *
+ * dm_version: The protocol version requested.
+ * is_last_attempt: If TRUE, this is the last version guest will request.
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_request {
+struct dm_header hdr;
+union dm_version version;
+

[PULL 8/9] hw/i386/pc: Support hv-balloon

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Add the necessary plumbing for the hv-balloon driver to the PC machine.

Co-developed-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/i386/Kconfig |  1 +
 hw/i386/pc.c| 22 ++
 2 files changed, 23 insertions(+)

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index 94772c726b24..55850791df41 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -45,6 +45,7 @@ config PC
 select ACPI_VMGENID
 select VIRTIO_PMEM_SUPPORTED
 select VIRTIO_MEM_SUPPORTED
+select HV_BALLOON_SUPPORTED
 
 config PC_PCI
 bool
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6031234a73f1..1aef21aa2c25 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -27,6 +27,7 @@
 #include "hw/i386/pc.h"
 #include "hw/char/serial.h"
 #include "hw/char/parallel.h"
+#include "hw/hyperv/hv-balloon.h"
 #include "hw/i386/fw_cfg.h"
 #include "hw/i386/vmport.h"
 #include "sysemu/cpus.h"
@@ -57,6 +58,7 @@
 #include "hw/i386/kvm/xen_evtchn.h"
 #include "hw/i386/kvm/xen_gnttab.h"
 #include "hw/i386/kvm/xen_xenstore.h"
+#include "hw/mem/memory-device.h"
 #include "e820_memory_layout.h"
 #include "trace.h"
 #include CONFIG_DEVICES
@@ -1422,6 +1424,21 @@ static void pc_memory_unplug(HotplugHandler *hotplug_dev,
 error_propagate(errp, local_err);
 }
 
+static void pc_hv_balloon_pre_plug(HotplugHandler *hotplug_dev,
+   DeviceState *dev, Error **errp)
+{
+/* The vmbus handler has no hotplug handler; we should never end up here. 
*/
+g_assert(!dev->hotplugged);
+memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL,
+   errp);
+}
+
+static void pc_hv_balloon_plug(HotplugHandler *hotplug_dev,
+   DeviceState *dev, Error **errp)
+{
+memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+}
+
 static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
   DeviceState *dev, Error **errp)
 {
@@ -1452,6 +1469,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler 
*hotplug_dev,
 return;
 }
 pcms->iommu = dev;
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) {
+pc_hv_balloon_pre_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -1464,6 +1483,8 @@ static void pc_machine_device_plug_cb(HotplugHandler 
*hotplug_dev,
 x86_cpu_plug(hotplug_dev, dev, errp);
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI)) {
 virtio_md_pci_plug(VIRTIO_MD_PCI(dev), MACHINE(hotplug_dev), errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) {
+pc_hv_balloon_plug(hotplug_dev, dev, errp);
 }
 }
 
@@ -1505,6 +1526,7 @@ static HotplugHandler 
*pc_get_hotplug_handler(MachineState *machine,
 object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) ||
+object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON) ||
 object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
 return HOTPLUG_HANDLER(machine);
 }

[PULL 9/9] MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 MAINTAINERS | 8 
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8e8a7d5be5de..d4a480ce5a62 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2656,6 +2656,14 @@ F: hw/usb/canokey.c
 F: hw/usb/canokey.h
 F: docs/system/devices/canokey.rst
 
+Hyper-V Dynamic Memory Protocol
+M: Maciej S. Szmigiero 
+S: Supported
+F: hw/hyperv/hv-balloon*.c
+F: hw/hyperv/hv-balloon*.h
+F: include/hw/hyperv/dynmem-proto.h
+F: include/hw/hyperv/hv-balloon.h
+
 Subsystems
 --
 Overall Audio backends

[PULL 6/9] qapi: Add query-memory-devices support to hv-balloon

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Used by the driver to report its provided memory state information.

Co-developed-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Acked-by: Markus Armbruster 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/core/machine-hmp-cmds.c | 15 +++
 hw/hyperv/hv-balloon.c | 27 +-
 qapi/machine.json  | 39 --
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 9a4b59c6f210..a6ff6a487583 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -253,6 +253,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 MemoryDeviceInfo *value;
 PCDIMMDeviceInfo *di;
 SgxEPCDeviceInfo *se;
+HvBalloonDeviceInfo *hi;
 
 for (info = info_list; info; info = info->next) {
 value = info->value;
@@ -310,6 +311,20 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "  node: %" PRId64 "\n", se->node);
 monitor_printf(mon, "  memdev: %s\n", se->memdev);
 break;
+case MEMORY_DEVICE_INFO_KIND_HV_BALLOON:
+hi = value->u.hv_balloon.data;
+monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
+   MemoryDeviceInfoKind_str(value->type),
+   hi->id ? hi->id : "");
+if (hi->has_memaddr) {
+monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n",
+   hi->memaddr);
+}
+monitor_printf(mon, "  max-size: %" PRIu64 "\n", hi->max_size);
+if (hi->memdev) {
+monitor_printf(mon, "  memdev: %s\n", hi->memdev);
+}
+break;
 default:
 g_assert_not_reached();
 }
diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 4d87f99375b5..c384f23a3b5e 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1622,6 +1622,31 @@ static MemoryRegion 
*hv_balloon_md_get_memory_region(MemoryDeviceState *md,
 return balloon->mr;
 }
 
+static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md,
+   MemoryDeviceInfo *info)
+{
+HvBalloonDeviceInfo *hi = g_new0(HvBalloonDeviceInfo, 1);
+const HvBalloon *balloon = HV_BALLOON(md);
+DeviceState *dev = DEVICE(md);
+
+if (dev->id) {
+hi->id = g_strdup(dev->id);
+}
+
+if (balloon->hostmem) {
+hi->memdev = object_get_canonical_path(OBJECT(balloon->hostmem));
+hi->memaddr = balloon->addr;
+hi->has_memaddr = true;
+hi->max_size = memory_region_size(balloon->mr);
+/* TODO: expose current provided size or something else? */
+} else {
+hi->max_size = 0;
+}
+
+info->u.hv_balloon.data = hi;
+info->type = MEMORY_DEVICE_INFO_KIND_HV_BALLOON;
+}
+
 static void hv_balloon_decide_memslots(MemoryDeviceState *md,
unsigned int limit)
 {
@@ -1709,5 +1734,5 @@ static void hv_balloon_class_init(ObjectClass *klass, 
void *data)
 mdc->get_memory_region = hv_balloon_md_get_memory_region;
 mdc->decide_memslots = hv_balloon_decide_memslots;
 mdc->get_memslots = hv_balloon_get_memslots;
-/* implement fill_device_info */
+mdc->fill_device_info = hv_balloon_md_fill_device_info;
 }
diff --git a/qapi/machine.json b/qapi/machine.json
index 6c9d2f6dcffe..2985d043c00d 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1289,6 +1289,29 @@
   }
 }
 
+##
+# @HvBalloonDeviceInfo:
+#
+# hv-balloon provided memory state information
+#
+# @id: device's ID
+#
+# @memaddr: physical address in memory, where device is mapped
+#
+# @max-size: the maximum size of memory that the device can provide
+#
+# @memdev: memory backend linked with device
+#
+# Since: 8.2
+##
+{ 'struct': 'HvBalloonDeviceInfo',
+  'data': { '*id': 'str',
+'*memaddr': 'size',
+'max-size': 'size',
+'*memdev': 'str'
+  }
+}
+
 ##
 # @MemoryDeviceInfoKind:
 #
@@ -1300,10 +1323,13 @@
 #
 # @sgx-epc: since 6.2.
 #
+# @hv-balloon: since 8.2.
+#
 # Since: 2.1
 ##
 { 'enum': 'MemoryDeviceInfoKind',
-  'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc' ] }
+  'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc',
+'hv-balloon' ] }
 
 ##
 # @PCDIMMDeviceInfoWrapper:
@@ -1337,6 +1363,14 @@
 { 'struct': 'SgxEPCDeviceInfoWrapper',
   'data': { 'data': 'SgxEPCDeviceInfo' } }
 
+##
+# @HvBalloonDeviceInfoWrapper:
+#
+# Since: 8.2
+##
+{

[PULL 7/9] qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

Used by the hv-balloon driver for (optional) guest memory status reports.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon.c | 30 +++-
 monitor/monitor.c  |  1 +
 qapi/machine.json  | 62 ++
 3 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index c384f23a3b5e..2d1464cd7dca 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1099,7 +1099,35 @@ static void hv_balloon_handle_status_report(HvBalloon 
*balloon,
 balloon->status_report.available *= HV_BALLOON_PAGE_SIZE;
 balloon->status_report.received = true;
 
-/* report event */
+qapi_event_send_hv_balloon_status_report(balloon->status_report.committed,
+ balloon->status_report.available);
+}
+
+HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp)
+{
+HvBalloon *balloon;
+HvBalloonInfo *info;
+
+balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL));
+if (!balloon) {
+error_setg(errp, "no %s device present", TYPE_HV_BALLOON);
+return NULL;
+}
+
+if (!balloon->status_report.enabled) {
+error_setg(errp, "guest memory status reporting not enabled");
+return NULL;
+}
+
+if (!balloon->status_report.received) {
+error_setg(errp, "no guest memory status report received yet");
+return NULL;
+}
+
+info = g_malloc0(sizeof(*info));
+info->committed = balloon->status_report.committed;
+info->available = balloon->status_report.available;
+return info;
 }
 
 static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
diff --git a/monitor/monitor.c b/monitor/monitor.c
index 941f87815aa4..01ede1babd3d 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -315,6 +315,7 @@ static MonitorQAPIEventConf 
monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
 [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS },
 [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
 [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
+[QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS },
 };
 
 /*
diff --git a/qapi/machine.json b/qapi/machine.json
index 2985d043c00d..b6d634b30d55 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1137,6 +1137,68 @@
 { 'event': 'BALLOON_CHANGE',
   'data': { 'actual': 'int' } }
 
+##
+# @HvBalloonInfo:
+#
+# hv-balloon guest-provided memory status information.
+#
+# @committed: the amount of memory in use inside the guest plus the
+# amount of the memory unusable inside the guest (ballooned out,
+# offline, etc.)
+#
+# @available: the amount of the memory inside the guest available for
+# new allocations ("free")
+#
+# Since: 8.2
+##
+{ 'struct': 'HvBalloonInfo',
+  'data': { 'committed': 'size', 'available': 'size' } }
+
+##
+# @query-hv-balloon-status-report:
+#
+# Returns the hv-balloon driver data contained in the last received "STATUS"
+# message from the guest.
+#
+# Returns:
+# - @HvBalloonInfo on success
+# - If no hv-balloon device is present, guest memory status reporting
+#   is not enabled or no guest memory status report received yet,
+#   GenericError
+#
+# Since: 8.2
+#
+# Example:
+#
+# -> { "execute": "query-hv-balloon-status-report" }
+# <- { "return": {
+#  "committed": 81664,
+#  "available": 054464
+#   }
+#}
+##
+{ 'command': 'query-hv-balloon-status-report', 'returns': 'HvBalloonInfo' }
+
+##
+# @HV_BALLOON_STATUS_REPORT:
+#
+# Emitted when the hv-balloon driver receives a "STATUS" message from
+# the guest.
+#
+# Note: this event is rate-limited.
+#
+# Since: 8.2
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_STATUS_REPORT",
+#  "data": { "committed": 81664, "available": 054464 },
+#  "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_STATUS_REPORT',
+  'data': 'HvBalloonInfo' }
+
 ##
 # @MemoryInfo:
 #

[PULL 1/9] memory-device: Support empty memory devices

2023-11-05 Thread Maciej S. Szmigiero

From: David Hildenbrand 

Let's support empty memory devices -- memory devices that don't have a
memory device region in the current configuration. hv-balloon with an
optional memdev is the primary use case.

Signed-off-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/mem/memory-device.c | 43 +++---
 include/hw/mem/memory-device.h |  7 +-
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index ae38f48f1676..db702ccad554 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -20,6 +20,22 @@
 #include "exec/address-spaces.h"
 #include "trace.h"
 
+static bool memory_device_is_empty(const MemoryDeviceState *md)
+{
+const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+Error *local_err = NULL;
+MemoryRegion *mr;
+
+/* dropping const here is fine as we don't touch the memory region */
+mr = mdc->get_memory_region((MemoryDeviceState *)md, _err);
+if (local_err) {
+/* Not empty, we'll report errors later when ontaining the MR again. */
+error_free(local_err);
+return false;
+}
+return !mr;
+}
+
 static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
 {
 const MemoryDeviceState *md_a = MEMORY_DEVICE(a);
@@ -249,6 +265,10 @@ static uint64_t memory_device_get_free_addr(MachineState 
*ms,
 uint64_t next_addr;
 Range tmp;
 
+if (memory_device_is_empty(md)) {
+continue;
+}
+
 range_init_nofail(, mdc->get_addr(md),
   memory_device_get_region_size(md, _abort));
 
@@ -292,6 +312,7 @@ MemoryDeviceInfoList *qmp_memory_device_list(void)
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data);
 MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1);
 
+/* Let's query infotmation even for empty memory devices. */
 mdc->fill_device_info(md, info);
 
 QAPI_LIST_APPEND(tail, info);
@@ -311,7 +332,7 @@ static int memory_device_plugged_size(Object *obj, void 
*opaque)
 const MemoryDeviceState *md = MEMORY_DEVICE(obj);
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj);
 
-if (dev->realized) {
+if (dev->realized && !memory_device_is_empty(md)) {
 *size += mdc->get_plugged_size(md, _abort);
 }
 }
@@ -337,6 +358,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, 
MachineState *ms,
 uint64_t addr, align = 0;
 MemoryRegion *mr;
 
+/* We support empty memory devices even without device memory. */
+if (memory_device_is_empty(md)) {
+return;
+}
+
 if (!ms->device_memory) {
 error_setg(errp, "the configuration is not prepared for memory devices"
  " (e.g., for memory hotplug), consider specifying the"
@@ -380,10 +406,17 @@ out:
 void memory_device_plug(MemoryDeviceState *md, MachineState *ms)
 {
 const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
-const unsigned int memslots = memory_device_get_memslots(md);
-const uint64_t addr = mdc->get_addr(md);
+unsigned int memslots;
+uint64_t addr;
 MemoryRegion *mr;
 
+if (memory_device_is_empty(md)) {
+return;
+}
+
+memslots = memory_device_get_memslots(md);
+addr = mdc->get_addr(md);
+
 /*
  * We expect that a previous call to memory_device_pre_plug() succeeded, so
  * it can't fail at this point.
@@ -408,6 +441,10 @@ void memory_device_unplug(MemoryDeviceState *md, 
MachineState *ms)
 const unsigned int memslots = memory_device_get_memslots(md);
 MemoryRegion *mr;
 
+if (memory_device_is_empty(md)) {
+return;
+}
+
 /*
  * We expect that a previous call to memory_device_pre_plug() succeeded, so
  * it can't fail at this point.
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
index 3354d6c1667e..a1d62cc551ab 100644
--- a/include/hw/mem/memory-device.h
+++ b/include/hw/mem/memory-device.h
@@ -38,6 +38,10 @@ typedef struct MemoryDeviceState MemoryDeviceState;
  * address in guest physical memory can either be specified explicitly
  * or get assigned automatically.
  *
+ * Some memory device might not own a memory region in certain device
+ * configurations. Such devices can logically get (un)plugged, however,
+ * empty memory devices are mostly ignored by the memory device code.
+ *
  * Conceptually, memory devices only span one memory region. If multiple
  * successive memory regions are used, a covering memory region has to
  * be provided. Scattered memory regions are not supported for single
@@ -91,7 +95,8 @@ struct MemoryDeviceClass {
 uint64_t (*get_plugged_size)(const MemoryDeviceState *md, Error **errp);
 
 /*
- * Return the memory region of the memory device.
+ * Return the memory region of the

[PULL 2/9] memory-device: Drop size alignment check

2023-11-05 Thread Maciej S. Szmigiero

From: David Hildenbrand 

There is no strong requirement that the size has to be multiples of the
requested alignment, let's drop it. This is a preparation for hv-baloon.

Signed-off-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/mem/memory-device.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index db702ccad554..e0704b8dc37a 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -236,12 +236,6 @@ static uint64_t memory_device_get_free_addr(MachineState 
*ms,
 return 0;
 }
 
-if (!QEMU_IS_ALIGNED(size, align)) {
-error_setg(errp, "backend memory size must be multiple of 0x%"
-   PRIx64, align);
-return 0;
-}
-
 if (hint) {
 if (range_init(, *hint, size) || !range_contains_range(, )) 
{
 error_setg(errp, "can't add memory device [0x%" PRIx64 ":0x%" 
PRIx64

[PULL 5/9] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

One of advantages of using this protocol over ACPI-based PC DIMM hotplug is
that it allows hot-adding memory in much smaller granularity because the
ACPI DIMM slot limit does not apply.

In order to enable this functionality a new memory backend needs to be
created and provided to the driver via the "memdev" parameter.

This can be achieved by, for example, adding
"-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and
then instantiating the driver with "memdev=mem1" parameter.

The device will try to use multiple memslots to cover the memory backend in
order to reduce the size of metadata for the not-yet-hot-added part of the
memory backend.

Co-developed-by: David Hildenbrand 
Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 hw/hyperv/hv-balloon-our_range_memslots.c | 201 
 hw/hyperv/hv-balloon-our_range_memslots.h | 110 +
 hw/hyperv/hv-balloon.c| 566 +-
 hw/hyperv/meson.build |   2 +-
 hw/hyperv/trace-events|   5 +
 5 files changed, 878 insertions(+), 6 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h

diff --git a/hw/hyperv/hv-balloon-our_range_memslots.c 
b/hw/hyperv/hv-balloon-our_range_memslots.c
new file mode 100644
index ..99bae870f371
--- /dev/null
+++ b/hw/hyperv/hv-balloon-our_range_memslots.c
@@ -0,0 +1,201 @@
+/*
+ * QEMU Hyper-V Dynamic Memory Protocol driver
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "hv-balloon-internal.h"
+#include "hv-balloon-our_range_memslots.h"
+#include "trace.h"
+
+/* OurRange */
+static void our_range_init(OurRange *our_range, uint64_t start, uint64_t count)
+{
+assert(count <= UINT64_MAX - start);
+our_range->range.start = start;
+our_range->range.count = count;
+
+hvb_page_range_tree_init(_range->removed_guest);
+hvb_page_range_tree_init(_range->removed_both);
+
+/* mark the whole range as unused but for potential use */
+our_range->added = 0;
+our_range->unusable_tail = 0;
+}
+
+static void our_range_destroy(OurRange *our_range)
+{
+hvb_page_range_tree_destroy(_range->removed_guest);
+hvb_page_range_tree_destroy(_range->removed_both);
+}
+
+void hvb_our_range_clear_removed_trees(OurRange *our_range)
+{
+hvb_page_range_tree_destroy(_range->removed_guest);
+hvb_page_range_tree_destroy(_range->removed_both);
+hvb_page_range_tree_init(_range->removed_guest);
+hvb_page_range_tree_init(_range->removed_both);
+}
+
+void hvb_our_range_mark_added(OurRange *our_range, uint64_t additional_size)
+{
+assert(additional_size <= UINT64_MAX - our_range->added);
+
+our_range->added += additional_size;
+
+assert(our_range->added <= UINT64_MAX - our_range->unusable_tail);
+assert(our_range->added + our_range->unusable_tail <=
+   our_range->range.count);
+}
+
+/* OurRangeMemslots */
+static void our_range_memslots_init_slots(OurRangeMemslots *our_range,
+  MemoryRegion *backing_mr,
+  Object *memslot_owner)
+{
+OurRangeMemslotsSlots *memslots = _range->slots;
+unsigned int idx;
+uint64_t memslot_offset;
+
+assert(memslots->count > 0);
+memslots->slots = g_new0(MemoryRegion, memslots->count);
+
+/* Initialize our memslots, but don't map them yet. */
+assert(memslots->size_each > 0);
+for (idx = 0, memslot_offset = 0; idx < memslots->count;
+ idx++, memslot_offset += memslots->size_each) {
+uint64_t memslot_size;
+g_autofree char *name = NULL;
+
+/* The size of the last memslot might be smaller. */
+if (idx == memslots->count - 1) {
+uint64_t region_size;
+
+assert(our_range->mr);
+region_size = memory_region_size(our_range->mr);
+memslot_size = region_size - memslot_offset;
+} else {
+memslot_size = memslots->size_each;
+}
+
+name = g_strdup_printf("memslot-%u", idx);
+memory_region_init_alias(>slots[idx], memslot_owner, name,
+ backing_mr, memslot_offset, memslot_size);
+/*
+ * We want to be able to atomically and efficiently activate/deactivate
+ * individual memslots without affecting adjacent memslots in memory
+ * notifiers.
+ */
+memory_region_set_unmergeable(>slots[idx], true);
+}
+
+memslots->mapped_count = 0;
+}
+
+O

[PULL 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon)

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

The following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e:

  Merge tag 'pull-target-arm-20231102' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 
10:04:12 +0800)

are available in the Git repository at:

  https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231105

for you to fetch changes up to 2b49ecabc6bf15efa6aa05f20a7c319ff65c4e11:

  MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-03 
20:31:10 +0100)


Hyper-V Dynamic Memory protocol driver.

This driver is like virtio-balloon on steroids for Windows guests:
it allows both changing the guest memory allocation via ballooning and
inserting pieces of extra RAM into it on demand from a provided memory
backend via Windows-native Hyper-V Dynamic Memory protocol.

* Protocol definitions.

* Hyper-V DM protocol driver (hv-balloon) base (ballooning only).

* Hyper-V DM protocol driver (hv-balloon) hot-add support.

* qapi query-memory-devices support for the driver.

* qapi HV_BALLOON_STATUS_REPORT event.

* The relevant PC machine plumbing.

* New MAINTAINERS entry for the above.


David Hildenbrand (2):
  memory-device: Support empty memory devices
  memory-device: Drop size alignment check

Maciej S. Szmigiero (7):
  Add Hyper-V Dynamic Memory Protocol definitions
  Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base
  Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support
  qapi: Add query-memory-devices support to hv-balloon
  qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command
  hw/i386/pc: Support hv-balloon
  MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol

 Kconfig.host  |3 +
 MAINTAINERS   |8 +
 hw/core/machine-hmp-cmds.c|   15 +
 hw/hyperv/Kconfig |   10 +
 hw/hyperv/hv-balloon-internal.h   |   33 +
 hw/hyperv/hv-balloon-our_range_memslots.c |  201 
 hw/hyperv/hv-balloon-our_range_memslots.h |  110 ++
 hw/hyperv/hv-balloon-page_range_tree.c|  228 
 hw/hyperv/hv-balloon-page_range_tree.h|  118 ++
 hw/hyperv/hv-balloon.c| 1766 +
 hw/hyperv/meson.build |1 +
 hw/hyperv/trace-events|   18 +
 hw/i386/Kconfig   |1 +
 hw/i386/pc.c  |   22 +
 hw/mem/memory-device.c|   49 +-
 include/hw/hyperv/dynmem-proto.h  |  423 +++
 include/hw/hyperv/hv-balloon.h|   18 +
 include/hw/mem/memory-device.h|7 +-
 meson.build   |   28 +-
 meson_options.txt |2 +
 monitor/monitor.c |1 +
 qapi/machine.json |  101 +-
 scripts/meson-buildoptions.sh |3 +
 23 files changed, 3153 insertions(+), 13 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon-internal.h
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c
 create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 include/hw/hyperv/dynmem-proto.h
 create mode 100644 include/hw/hyperv/hv-balloon.h

[PULL 4/9] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base

2023-11-05 Thread Maciej S. Szmigiero

From: "Maciej S. Szmigiero" 

This driver is like virtio-balloon on steroids: it allows both changing the
guest memory allocation via ballooning and (in the next patch) inserting
pieces of extra RAM into it on demand from a provided memory backend.

The actual resizing is done via ballooning interface (for example, via
the "balloon" HMP command).
This includes resizing the guest past its boot size - that is, hot-adding
additional memory in granularity limited only by the guest alignment
requirements, as provided by the next patch.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.

After a VM reboot the guest is back to its original (boot) size.

In the future, the guest boot memory size might be changed on reboot
instead, taking into account the effective size that VM had before that
reboot (much like Hyper-V does).

For performance reasons, the guest-released memory is tracked in a few
range trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU and KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just a few bytes of such metadata.

Since the required GTree operations aren't present in every Glib version
a check for them was added to the meson build script, together with new
"--enable-hv-balloon" and "--disable-hv-balloon" configure arguments.
If these GTree operations are missing in the system's Glib version this
driver will be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.

Following commits will add support for their external emission as
"HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2019 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.

Acked-by: David Hildenbrand 
Signed-off-by: Maciej S. Szmigiero 
---
 Kconfig.host   |3 +
 hw/hyperv/Kconfig  |   10 +
 hw/hyperv/hv-balloon-internal.h|   33 +
 hw/hyperv/hv-balloon-page_range_tree.c |  228 +
 hw/hyperv/hv-balloon-page_range_tree.h |  118 +++
 hw/hyperv/hv-balloon.c | 1157 
 hw/hyperv/meson.build  |1 +
 hw/hyperv/trace-events |   13 +
 include/hw/hyperv/hv-balloon.h |   18 +
 meson.build|   28 +-
 meson_options.txt  |2 +
 scripts/meson-buildoptions.sh  |3 +
 12 files changed, 1613 insertions(+), 1 deletion(-)
 create mode 100644 hw/hyperv/hv-balloon-internal.h
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c
 create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 include/hw/hyperv/hv-balloon.h

diff --git a/Kconfig.host b/Kconfig.host
index d763d892693c..2ee71578f38f 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -46,3 +46,6 @@ config FUZZ
 config VFIO_USER_SE

Re: [PATCH v8 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon )

2023-11-02 Thread Maciej S. Szmigiero


On 2.11.2023 14:50, David Hildenbrand wrote:

On 23.10.23 19:24, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

This is a continuation of the v7 of the patch series located here:
https://lore.kernel.org/qemu-devel/cover.1693240836.git.maciej.szmigi...@oracle.com/



I skimmed over most parts and nothing jumped at me memory-device-related; I'm 
hoping I can take another closer look later/tomorrow; it's a lot of code and an 
in-depth review would be great. But I don't know if we'll find someone to 
volunteer? :)


Thanks - even a cursory review is valuable.


Soft-freeze is in 5 days.

You seem to be the only hyperv-related maintainer listed in MAINTAINERS. Do you 
want to merge this or should I route this via mem-next?


I can post a pull request this weekend so it can be pulled in on Monday 
(hopefully).


For the time being

Acked-by: David Hildenbrand 



Thanks,
Maciej

Re: [PATCH] hyperv: add check for NULL for msg

2023-10-26 Thread Maciej S. Szmigiero


On 26.10.2023 11:31, Анастасия Любимова wrote:


28/09/23 19:18, Maciej S. Szmigiero пишет:

On 28.09.2023 15:25, Anastasia Belova wrote:

cpu_physical_memory_map may return NULL in hyperv_hcall_post_message.
Add check for NULL to avoid NULL-dereference.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 76036a5fc7 ("hyperv: process POST_MESSAGE hypercall")
Signed-off-by: Anastasia Belova 


Makes sense to me, thanks.

Did you run your static checker through the remaining QEMU files,
too?

I can see similar cpu_physical_memory_map() usage in, for example:
target/s390x/helper.c, hw/nvram/spapr_nvram.c, hw/hyperv/vmbus.c,
display/ramfb.c...


It seems that configurations for analysis do not contain these files
so the checker hasn't warned us. Additional time is needed to
analyze these pieces of code and form patches if necessary.


No problem, it's not an urgent issue.  

Anastasia Belova


Thanks,
Maciej

1 2 3 >

1 - 100 of 232 matches

Mail list logo