Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
On 17.07.2024 22:19, Fabiano Rosas wrote: Peter Xu writes: On Tue, Jul 16, 2024 at 10:10:12PM +0200, Maciej S. Szmigiero wrote: On 27.06.2024 16:56, Peter Xu wrote: On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote: On 26.06.2024 18:23, Peter Xu wrote: On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote: On 26.06.2024 03:51, Peter Xu wrote: On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote: On 25.06.2024 19:25, Peter Xu wrote: On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote: Hi Peter, Hi, Maciej, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. (..) 2) Submit this operation to the thread pool and wait for it to complete, VFIO doesn't need to have its own code waiting. If this pool is for migration purpose in general, qemu migration framework will need to wait at some point for all jobs to finish before moving on. Perhaps it should be at the end of the non-iterative session. So essentially, instead of calling save_live_complete_precopy_end handlers from the migration code you would like to hard-code its current VFIO implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate(). Only it wouldn't be then called VFIO precopy async thread terminate but some generic device state async precopy thread terminate function. I don't understand what did you mean by "hard code". "Hard code" wasn't maybe the best expression here. I meant the move of the functionality that's provided by vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set to the common migration code. I see. That function only does a thread_join() so far. So can I understand it as below [1] should work for us, and it'll be clean too (with nothing to hard-code)? It will need some signal to the worker threads pool to terminate before waiting for them to finish (as the code in [1] just waits). In the case of current vfio_save_complete_precopy_async_thread() implementation, this signal isn't necessary as this thread simply terminates when it has read all the date it needs from the device. In a worker threads pool case there will be some threads waiting for jobs to be queued to them and so they will need to be somehow signaled to exit. Right. We may need something like multifd_send_should_exit() + MultiFDSendParams.sem. It'll be nicer if we can generalize that part so multifd threads can also rebase to that thread model, but maybe I'm asking too much. The time to join() the worker threads can be even later, until migrate_fd_cleanup() on sender side. You may have a better idea on when would be the best place to do it when start working on it. What I was saying is if we target the worker thread pool to be used for "concurrently dump vmstates", then it'll make sense to make sure all the jobs there were flushed after qemu dumps all non-iterables (because this should be the last step of the switchover). I expect it looks like this: while (pool->active_threads) { qemu_sem_wait(>job_done); } [1] (..) I think that with this thread pool introduction we'll unfortunately almost certainly need to target this patch set at 9.2, since these overall changes (and Fabiano patches too) will need good testing, might uncover some performance regressions (for example related to the number of buffers limit or Fabiano multifd changes), bring some review comments from other people, etc. In addition to that, we are in the middle of holiday season and a lot of people aren't available - like Fabiano said he will be available only in a few weeks. Right, that's unfortunate. Let's see, but still I really hope we can also get some feedback from Fabiano before it lands, even with that we have chance for 9.1 but it's just challenging, it's the same condition I mentioned since the 1st email. And before Fabiano's back (he's the active maintainer for this release), I'm personally happy if you can propose something that can land earlier in this release partly. E.g., if you want we can at least upstream Fabiano's idea first, or some more on top. For that, also feel to have a look at my comment today: https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n Feel free to comment there too. There's a tiny uncertainty there so far on specify
Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
On 17.07.2024 21:00, Peter Xu wrote: On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote: The comment I removed is slightly misleading to me too, because right now active_slot contains the data hasn't yet been delivered to multifd, so we're "putting it back to free list" not because of it's free, but because we know it won't get used until the multifd send thread consumes it (because before that the thread will be busy, and we won't use the buffer if so in upcoming send()s). And then when I'm looking at this again, I think maybe it's a slight overkill, and maybe we can still keep the "opaque data" managed by multifd. One reason might be that I don't expect the "opaque data" payload keep growing at all: it should really be either RAM or device state as I commented elsewhere in a relevant thread, after all it's a thread model only for migration purpose to move vmstates.. Some amount of flexibility needs to be baked in. For instance, what about the handshake procedure? Don't we want to use multifd threads to put some information on the wire for that as well? Is this an orthogonal question? I don't think so. You say the payload data should be either RAM or device state. I'm asking what other types of data do we want the multifd channel to transmit and suggesting we need to allow room for the addition of that, whatever it is. One thing that comes to mind that is neither RAM or device state is some form of handshake or capabilities negotiation. The RFC version of my multifd device state transfer patch set introduced a new migration channel header (by Avihai) for clean and extensible migration channel handshaking but people didn't like so it was removed in v1. Hmm, I'm not sure this is relevant to the context of discussion here, but I confess I didn't notice the per-channel header thing in the previous RFC series. Link is here: https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigi...@oracle.com The channel header patches were dropped because Daniel didn't like them: https://lore.kernel.org/qemu-devel/zh-kf72fe9ov6...@redhat.com/ https://lore.kernel.org/qemu-devel/zh_6w8u3h4fmg...@redhat.com/ Maciej, if you want, you can split that out of the seriess. So far it looks like a good thing with/without how VFIO tackles it. Unfortunately, these Avihai's channel header patches obviously impact wire protocol and are a bit of intermingled with the rest of the device state transfer patch set so it would be good to know upfront whether there is some consensus to (re)introduce this new channel header (CCed Daniel, too). Thanks, Thanks, Maciej
Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
On 10.07.2024 22:16, Fabiano Rosas wrote: Peter Xu writes: On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote: Peter Xu writes: On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote: Or graphically: 1) client fills the active slot with data. Channels point to nothing at this point: [a] <-- active slot [][][][] <-- free slots, one per-channel [][][][] <-- channels' p->data pointers 2) multifd_send() swaps the pointers inside the client slot. Channels still point to nothing: [] [a][][][] [][][][] 3) multifd_send() finds an idle channel and updates its pointer: It seems the action "finds an idle channel" is in step 2 rather than step 3, which means the free slot is selected based on the id of the channel found, am I understanding correctly? I think you're right. Actually I also feel like the desription here is ambiguous, even though I think I get what Fabiano wanted to say. The free slot should be the first step of step 2+3, here what Fabiano really wanted to suggest is we move the free buffer array from multifd channels into the callers, then the caller can pass in whatever data to send. So I think maybe it's cleaner to write it as this in code (note: I didn't really change the code, just some ordering and comments): ===8<=== @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots) */ active_slot = slots->active; slots->active = slots->free[p->id]; -p->data = active_slot; - -/* - * By the next time we arrive here, the channel will certainly - * have consumed the active slot. Put it back on the free list - * now. - */ slots->free[p->id] = active_slot; +/* Assign the current active slot to the chosen thread */ +p->data = active_slot; ===8<=== The comment I removed is slightly misleading to me too, because right now active_slot contains the data hasn't yet been delivered to multifd, so we're "putting it back to free list" not because of it's free, but because we know it won't get used until the multifd send thread consumes it (because before that the thread will be busy, and we won't use the buffer if so in upcoming send()s). And then when I'm looking at this again, I think maybe it's a slight overkill, and maybe we can still keep the "opaque data" managed by multifd. One reason might be that I don't expect the "opaque data" payload keep growing at all: it should really be either RAM or device state as I commented elsewhere in a relevant thread, after all it's a thread model only for migration purpose to move vmstates.. Some amount of flexibility needs to be baked in. For instance, what about the handshake procedure? Don't we want to use multifd threads to put some information on the wire for that as well? Is this an orthogonal question? I don't think so. You say the payload data should be either RAM or device state. I'm asking what other types of data do we want the multifd channel to transmit and suggesting we need to allow room for the addition of that, whatever it is. One thing that comes to mind that is neither RAM or device state is some form of handshake or capabilities negotiation. The RFC version of my multifd device state transfer patch set introduced a new migration channel header (by Avihai) for clean and extensible migration channel handshaking but people didn't like so it was removed in v1. Thanks, Maciej
Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
On 27.06.2024 16:56, Peter Xu wrote: On Thu, Jun 27, 2024 at 11:14:28AM +0200, Maciej S. Szmigiero wrote: On 26.06.2024 18:23, Peter Xu wrote: On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote: On 26.06.2024 03:51, Peter Xu wrote: On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote: On 25.06.2024 19:25, Peter Xu wrote: On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote: Hi Peter, Hi, Maciej, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. (..) 2) Submit this operation to the thread pool and wait for it to complete, VFIO doesn't need to have its own code waiting. If this pool is for migration purpose in general, qemu migration framework will need to wait at some point for all jobs to finish before moving on. Perhaps it should be at the end of the non-iterative session. So essentially, instead of calling save_live_complete_precopy_end handlers from the migration code you would like to hard-code its current VFIO implementation of calling vfio_save_complete_precopy_async_thread_thread_terminate(). Only it wouldn't be then called VFIO precopy async thread terminate but some generic device state async precopy thread terminate function. I don't understand what did you mean by "hard code". "Hard code" wasn't maybe the best expression here. I meant the move of the functionality that's provided by vfio_save_complete_precopy_async_thread_thread_terminate() in this patch set to the common migration code. I see. That function only does a thread_join() so far. So can I understand it as below [1] should work for us, and it'll be clean too (with nothing to hard-code)? It will need some signal to the worker threads pool to terminate before waiting for them to finish (as the code in [1] just waits). In the case of current vfio_save_complete_precopy_async_thread() implementation, this signal isn't necessary as this thread simply terminates when it has read all the date it needs from the device. In a worker threads pool case there will be some threads waiting for jobs to be queued to them and so they will need to be somehow signaled to exit. The time to join() the worker threads can be even later, until migrate_fd_cleanup() on sender side. You may have a better idea on when would be the best place to do it when start working on it. What I was saying is if we target the worker thread pool to be used for "concurrently dump vmstates", then it'll make sense to make sure all the jobs there were flushed after qemu dumps all non-iterables (because this should be the last step of the switchover). I expect it looks like this: while (pool->active_threads) { qemu_sem_wait(>job_done); } [1] (..) I think that with this thread pool introduction we'll unfortunately almost certainly need to target this patch set at 9.2, since these overall changes (and Fabiano patches too) will need good testing, might uncover some performance regressions (for example related to the number of buffers limit or Fabiano multifd changes), bring some review comments from other people, etc. In addition to that, we are in the middle of holiday season and a lot of people aren't available - like Fabiano said he will be available only in a few weeks. Right, that's unfortunate. Let's see, but still I really hope we can also get some feedback from Fabiano before it lands, even with that we have chance for 9.1 but it's just challenging, it's the same condition I mentioned since the 1st email. And before Fabiano's back (he's the active maintainer for this release), I'm personally happy if you can propose something that can land earlier in this release partly. E.g., if you want we can at least upstream Fabiano's idea first, or some more on top. For that, also feel to have a look at my comment today: https://lore.kernel.org/r/Zn15y693g0AkDbYD@x1n Feel free to comment there too. There's a tiny uncertainty there so far on specifying "max size for a device state" if do what I suggested, as multifd setup will need to allocate an enum buffer suitable for both ram + device. But I think that's not an issue and you'll tackle that properly when working on it. It's more about whether you agree on what I said as a general concept. Since it seems that the discussion on Fabiano's
Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
On 26.06.2024 18:23, Peter Xu wrote: On Wed, Jun 26, 2024 at 05:47:34PM +0200, Maciej S. Szmigiero wrote: On 26.06.2024 03:51, Peter Xu wrote: On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote: On 25.06.2024 19:25, Peter Xu wrote: On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote: Hi Peter, Hi, Maciej, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. (..) 4. Risk of OOM on unlimited VFIO buffering == This follows with above bullet, but my pure question to ask here is how does VFIO guarantees no OOM condition by buffering VFIO state? I mean, currently your proposal used vfio_load_bufs_thread() as a separate thread to only load the vfio states until sequential data is received, however is there an upper limit of how much buffering it could do? IOW: vfio_load_state_buffer(): if (packet->idx >= migration->load_bufs->len) { g_array_set_size(migration->load_bufs, packet->idx + 1); } lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx); ... lb->data = g_memdup2(>data, data_size - sizeof(*packet)); lb->len = data_size - sizeof(*packet); lb->is_present = true; What if garray keeps growing with lb->data allocated, which triggers the memcg limit of the process (if QEMU is in such process)? Or just deplete host memory and causing OOM kill. I think we may need to find a way to throttle max memory usage of such buffering. So far this will be more of a problem indeed if this will be done during VFIO iteration phases, but I still hope a solution can work with both iteration phase and the switchover phase, even if you only do that in switchover phase Unfortunately, this issue will be hard to fix since the source can legitimately send the very first buffer (chunk) of data as the last one (at the very end of the transmission). In this case, the target will need to buffer nearly the whole data. We can't stop the receive on any channel, either, since the next missing buffer can arrive at that channel. However, I don't think purposely DoSing the target QEMU is a realistic security concern in the typical live migration scenario. I mean the source can easily force the target QEMU to exit just by feeding it wrong migration data. In case someone really wants to protect against the impact of theoretically unbounded QEMU memory allocations during live migration on the rest of the system they can put the target QEMU process (temporally) into a memory-limited cgroup. Note that I'm not worrying about DoS of a malicious src QEMU, and I'm exactly talking about the generic case where QEMU (either src or dest, in that case normally both) is put into the memcg and if QEMU uses too much memory it'll literally get killed even if no DoS issue at all. In short, we hopefully will have a design that will always work with QEMU running in a container, without 0.5% chance dest qemu being killed, if you see what I meant. The upper bound of VFIO buffering will be needed so the admin can add that on top of the memcg limit and as long as QEMU keeps its words it'll always work without sudden death. I think I have some idea about resolving this problem. That idea can further complicate the protocol a little bit. But before that let's see whether we can reach an initial consensus on this matter first, on whether this is a sane request. In short, we'll need to start to have a configurable size to say how much VFIO can buffer, maybe per-device, or globally. Then based on that we need to have some logic guarantee that over-mem won't happen, also without heavily affecting concurrency (e.g., single thread is definitely safe and without caching, but it can be slower). Here, I think I can add a per-device limit parameter on the number of buffers received out-of-order or waiting to be loaded into the device - with a reasonable default. Yes that should work. I don't even expect people would change that, but this might be the information people will need to know before putting it into a container if it's larger than how qemu dynamically consumes memories here and there. I'd expect it is still small enough so nobody will notice it (maybe a few tens of MBs? but just wildly guessing, where tens of MBs could fall
Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
On 26.06.2024 03:51, Peter Xu wrote: On Wed, Jun 26, 2024 at 12:44:29AM +0200, Maciej S. Szmigiero wrote: On 25.06.2024 19:25, Peter Xu wrote: On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote: Hi Peter, Hi, Maciej, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. (..) 3. load_state_buffer() and VFIODeviceStatePacket protocol = VFIODeviceStatePacket is the new protocol you introduced into multifd packets, along with the new load_state_buffer() hook for loading such buffers. My question is whether it's needed at all, or.. whether it can be more generic (and also easier) to just allow taking any device state in the multifd packets, then load it with vmstate load(). I mean, the vmstate_load() should really have worked on these buffers, if after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the first flag (uint64), size as the 2nd, then (2) load that rest buffer into VFIO kernel driver. That is the same to happen during the blackout window. It's not clear to me why load_state_buffer() is needed. I also see that you're also using exactly the same chunk size for such buffering (VFIOMigration.data_buffer_size). I think you have a "reason": VFIODeviceStatePacket and loading of the buffer data resolved one major issue that wasn't there before but start to have now: multifd allows concurrent arrivals of vfio buffers, even if the buffer *must* be sequentially loaded. That's a major pain for current VFIO kernel ioctl design, IMHO. I think I used to ask nVidia people on whether the VFIO get_state/set_state interface can allow indexing or tagging of buffers but I never get a real response. IMHO that'll be extremely helpful for migration purpose on concurrency if it can happen, rather than using a serialized buffer. It means concurrently save/load one VFIO device could be extremely hard, if not impossible. I am pretty sure that the current kernel VFIO interface requires for the buffers to be loaded in-order - accidentally providing the out of order definitely breaks the restore procedure. Ah, I didn't mean that we need to do it with the current API. I'm talking about whether it's possible to have a v2 that will support those otherwise we'll need to do "workarounds" like what you're doing with "unlimited buffer these on dest, until we receive continuous chunk of data" tricks. Better kernel API might be possible in the long term but for now we have to live with what we have right now. After all, adding true unordered loading - I mean not just moving the reassembly process from QEMU to the kernel but making the device itself accept buffers out out order - will likely be pretty complex (requiring adding such functionality to the device firmware, etc). I would expect the device will need to be able to provision the device states so it became smaller objects rather than one binary object, then either tag-able or address-able on those objects. And even with that trick, it'll still need to be serialized on the read() syscall so it won't scale either if the state is huge. For that issue there's no workaround we can do from userspace. The read() calls for multiple VFIO devices can be issued in parallel, and in fact they are in my patch set. I was talking about concurrency for one device. AFAIK with the current hardware the read speed is limited by the device itself, so adding additional reading threads wouldn't help. Once someone has the hardware which is limited by single reading thread that person can add the necessary kernel API (including unordered loading) and then extend QEMU with such support. (..) 4. Risk of OOM on unlimited VFIO buffering == This follows with above bullet, but my pure question to ask here is how does VFIO guarantees no OOM condition by buffering VFIO state? I mean, currently your proposal used vfio_load_bufs_thread() as a separate thread to only load the vfio states until sequential data is received, however is there an upper limit of how much buffering it could do? IOW: vfio_load_state_buffer(): if (packet->idx >= migration->load_bufs->len) { g_array_set_size(migration->load_bufs, packet->idx + 1); } lb = _
Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
On 25.06.2024 19:25, Peter Xu wrote: On Mon, Jun 24, 2024 at 09:51:18PM +0200, Maciej S. Szmigiero wrote: Hi Peter, Hi, Maciej, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. (..) 3. load_state_buffer() and VFIODeviceStatePacket protocol = VFIODeviceStatePacket is the new protocol you introduced into multifd packets, along with the new load_state_buffer() hook for loading such buffers. My question is whether it's needed at all, or.. whether it can be more generic (and also easier) to just allow taking any device state in the multifd packets, then load it with vmstate load(). I mean, the vmstate_load() should really have worked on these buffers, if after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the first flag (uint64), size as the 2nd, then (2) load that rest buffer into VFIO kernel driver. That is the same to happen during the blackout window. It's not clear to me why load_state_buffer() is needed. I also see that you're also using exactly the same chunk size for such buffering (VFIOMigration.data_buffer_size). I think you have a "reason": VFIODeviceStatePacket and loading of the buffer data resolved one major issue that wasn't there before but start to have now: multifd allows concurrent arrivals of vfio buffers, even if the buffer *must* be sequentially loaded. That's a major pain for current VFIO kernel ioctl design, IMHO. I think I used to ask nVidia people on whether the VFIO get_state/set_state interface can allow indexing or tagging of buffers but I never get a real response. IMHO that'll be extremely helpful for migration purpose on concurrency if it can happen, rather than using a serialized buffer. It means concurrently save/load one VFIO device could be extremely hard, if not impossible. I am pretty sure that the current kernel VFIO interface requires for the buffers to be loaded in-order - accidentally providing the out of order definitely breaks the restore procedure. Ah, I didn't mean that we need to do it with the current API. I'm talking about whether it's possible to have a v2 that will support those otherwise we'll need to do "workarounds" like what you're doing with "unlimited buffer these on dest, until we receive continuous chunk of data" tricks. Better kernel API might be possible in the long term but for now we have to live with what we have right now. After all, adding true unordered loading - I mean not just moving the reassembly process from QEMU to the kernel but making the device itself accept buffers out out order - will likely be pretty complex (requiring adding such functionality to the device firmware, etc). And even with that trick, it'll still need to be serialized on the read() syscall so it won't scale either if the state is huge. For that issue there's no workaround we can do from userspace. The read() calls for multiple VFIO devices can be issued in parallel, and in fact they are in my patch set. (..) 4. Risk of OOM on unlimited VFIO buffering == This follows with above bullet, but my pure question to ask here is how does VFIO guarantees no OOM condition by buffering VFIO state? I mean, currently your proposal used vfio_load_bufs_thread() as a separate thread to only load the vfio states until sequential data is received, however is there an upper limit of how much buffering it could do? IOW: vfio_load_state_buffer(): if (packet->idx >= migration->load_bufs->len) { g_array_set_size(migration->load_bufs, packet->idx + 1); } lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx); ... lb->data = g_memdup2(>data, data_size - sizeof(*packet)); lb->len = data_size - sizeof(*packet); lb->is_present = true; What if garray keeps growing with lb->data allocated, which triggers the memcg limit of the process (if QEMU is in such process)? Or just deplete host memory and causing OOM kill. I think we may need to find a way to throttle max memory usage of such buffering. So far this will be more of a problem indeed if this will be done during VFIO iteration phases, but I still hope a solution can work with both iteration phase and the switchover phase, even if you only do
Re: [PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
Hi Peter, On 23.06.2024 22:27, Peter Xu wrote: On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-faro...@suse.de That's a long list, thanks for these comments. I have responded to them inline below. 1. Multifd device state support === As Fabiano suggested in his RFC post, we may need one more layer of abstraction to represent VFIO's demand on allowing multifd to send arbitrary buffer to the wire. This can be more than "how to pass the device state buffer to the sender threads". So far, MultiFDMethods is only about RAM. If you pull the latest master branch Fabiano just merged yet two more RAM compressors that are extended on top of MultiFDMethods model. However still they're all about RAM. I think it's better to keep it this way, so maybe MultiFDMethods should some day be called MultiFDRamMethods. multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc buffers like what VFIO is using. multifd_send_zero_page_detect() may not be needed either for arbitrary buffers. Most of those are still page-based. I think it also means we shouldn't call ->send_prepare() when multifd send thread notices that it's going to send a VFIO buffer. So it should look like this: int type = multifd_payload_type(p->data); if (type == MULTIFD_PAYLOAD_RAM) { multifd_send_state->ops->send_prepare(p, _err); } else { // VFIO buffers should belong here assert(type == MULTIFD_PAYLOAD_DEVICE_STATE); ... } It also means it shouldn't contain code like: nocomp_send_prepare(): if (p->is_device_state_job) { return nocomp_send_prepare_device_state(p, errp); } else { return nocomp_send_prepare_ram(p, errp); } nocomp should only exist in RAM world, not VFIO's. And it looks like you agree with Fabiano's RFC proposal, please work on top of that to provide that layer. Please make sure it outputs the minimum in "$ git grep device_state migration/multifd.c" when you work on the new version. Currently: $ git grep device_state migration/multifd.c | wc -l 59 The hope is zero, or at least a minimum with good reasons. I guess you mean "grep -i" in the above example, since otherwise the above command will find only lowercase "device_state". On the other hand, your example code above has uppercase "DEVICE_STATE", suggesting that it might be okay? Overall, using Fabiano's patch set as a base for mine makes sense to me. 2. Frequent mallocs/frees = Fabiano's series can also help to address some of these, but it looks like this series used malloc/free more than the opaque data buffer. This is not required to get things merged, but it'll be nice to avoid those if possible. Ack - as long as its not making the code messy/fragile, of course. 3. load_state_buffer() and VFIODeviceStatePacket protocol = VFIODeviceStatePacket is the new protocol you introduced into multifd packets, along with the new load_state_buffer() hook for loading such buffers. My question is whether it's needed at all, or.. whether it can be more generic (and also easier) to just allow taking any device state in the multifd packets, then load it with vmstate load(). I mean, the vmstate_load() should really have worked on these buffers, if after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the first flag (uint64), size as the 2nd, then (2) load that rest buffer into VFIO kernel driver. That is the same to happen during the blackout window. It's not clear to me why load_state_buffer() is needed. I also see that you're also using exactly the same chunk size for such buffering (VFIOMigration.data_buffer_size). I think you have a "reason": VFIODeviceStatePacket and loading of the buffer data resolved one major issue that wasn't there before but start to have now: multifd allows concurrent arrivals of vfio buffers, even if the buffer *must* be sequentially loaded. That's a major pain for current VFIO kernel ioctl design, IMHO. I think I used to ask nVidia people on whether the VFIO get_state/set_state interface can allow indexing or tagging of buffers but I never get a real response. IMHO that'll be extremely helpful for migration purpose on concurrency if it can happen, rather than using a serialized buffer. It means concurrently save/load one VFIO device could be extremely hard,
Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
On 21.06.2024 22:54, Peter Xu wrote: On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote: On 21.06.2024 17:56, Peter Xu wrote: On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote: On 21.06.2024 17:04, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: Hi Fabiano, On 20.06.2024 23:21, Fabiano Rosas wrote: Hi folks, First of all, apologies for the roughness of the series. I'm off for the next couple of weeks and wanted to put something together early for your consideration. This series is a refactoring (based on an earlier, off-list attempt[0]), aimed to remove the usage of the MultiFDPages_t type in the multifd core. If we're going to add support for more data types to multifd, we first need to clean that up. This time around this work was prompted by Maciej's series[1]. I see you're having to add a bunch of is_device_state checks to work around the rigidity of the code. Aside from the VFIO work, there is also the intent (coming back from Juan's ideas) to make multifd the default code path for migration, which will have to include the vmstate migration and anything else we put on the stream via QEMUFile. I have long since been bothered by having 'pages' sprinkled all over the code, so I might be coming at this with a bit of a narrow focus, but I believe in order to support more types of payloads in multifd, we need to first allow the scheduling at multifd_send_pages() to be independent of MultiFDPages_t. So here it is. Let me know what you think. Thanks for the patch set, I quickly glanced at these patches and they definitely make sense to me. (..) (as I said, I'll be off for a couple of weeks, so feel free to incorporate any of this code if it's useful. Or to ignore it completely). I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has feature freeze in about a month, correct? For general code improvements like this I'm not thinking about QEMU releases at all. But this series is not super complex, so I could imagine we merging it in time for 9.1 if we reach an agreement. Are you thinking your series might miss the target? Or have concerns over the stability of the refactoring? We can within reason merge code based on the current framework and improve things on top, we already did something similar when merging zero-page support. I don't have an issue with that. The reason that I asked whether you are targeting 9.1 is because my patch set is definitely targeting that release. At the same time my patch set will need to be rebased/refactored on top of this patch set if it is supposed to be merged for 9.1 too. If this patch set gets merged quickly that's not really a problem. On the other hand, if another iteration(s) is/are needed AND you are not available in the coming weeks to work on them then there's a question whether we will make the required deadline. I think it's a bit rush to merge the vfio series in this release. I'm not sure it has enough time to be properly reviewed, reposted, retested, etc. I've already started looking at it, and so far I think I have doubt not only on agreement with Fabiano on the device_state thing which I prefer to avoid, but also I'm thinking of any possible way to at least make the worker threads generic too: a direct impact could be vDPA in the near future if anyone cared, while I don't want modules to create threads randomly during migration. Meanwhile I'm also thinking whether that "the thread needs to dump all data, and during iteration we can't do that" is the good reason to not support that during iterations. I didn't yet reply because I don't think I think all things through, but I'll get there. So I'm not saying that the design is problematic, but IMHO it's just not mature enough to assume it will land in 9.1, considering it's still a large one, and the first non-rfc version just posted two days ago. The RFC version was posted more than 2 months ago. It has received some review comments from multiple people, all of which were addressed in this patch set version. I thought it was mostly me who reviewed it, am I right? Or do you have other thread that has such discussion happening, and the design review has properly done and reached an agreement? Daniel P. Berrangé also submitted a few comments: [1], [2], [3], [4], [5]. In fact, it is him who first suggested not having a new channel header wire format or dedicated device state channels. In addition to that, Avihai was also following our discussions: [6] and he also looked privately at an early (but functioning) draft of these patches well before the RFC was even publicly posted. IMHO that is also not how RFC works. It doesn't work like "if RFC didn't got NACKed, a maintainer should merge v1 when someone posts it". Instead RFC should only mean these at least to me: "(1) please review this from high level, things can drastically change; (2) please don't merge this, because it is not f
Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
On 21.06.2024 17:56, Peter Xu wrote: On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote: On 21.06.2024 17:04, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: Hi Fabiano, On 20.06.2024 23:21, Fabiano Rosas wrote: Hi folks, First of all, apologies for the roughness of the series. I'm off for the next couple of weeks and wanted to put something together early for your consideration. This series is a refactoring (based on an earlier, off-list attempt[0]), aimed to remove the usage of the MultiFDPages_t type in the multifd core. If we're going to add support for more data types to multifd, we first need to clean that up. This time around this work was prompted by Maciej's series[1]. I see you're having to add a bunch of is_device_state checks to work around the rigidity of the code. Aside from the VFIO work, there is also the intent (coming back from Juan's ideas) to make multifd the default code path for migration, which will have to include the vmstate migration and anything else we put on the stream via QEMUFile. I have long since been bothered by having 'pages' sprinkled all over the code, so I might be coming at this with a bit of a narrow focus, but I believe in order to support more types of payloads in multifd, we need to first allow the scheduling at multifd_send_pages() to be independent of MultiFDPages_t. So here it is. Let me know what you think. Thanks for the patch set, I quickly glanced at these patches and they definitely make sense to me. (..) (as I said, I'll be off for a couple of weeks, so feel free to incorporate any of this code if it's useful. Or to ignore it completely). I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has feature freeze in about a month, correct? For general code improvements like this I'm not thinking about QEMU releases at all. But this series is not super complex, so I could imagine we merging it in time for 9.1 if we reach an agreement. Are you thinking your series might miss the target? Or have concerns over the stability of the refactoring? We can within reason merge code based on the current framework and improve things on top, we already did something similar when merging zero-page support. I don't have an issue with that. The reason that I asked whether you are targeting 9.1 is because my patch set is definitely targeting that release. At the same time my patch set will need to be rebased/refactored on top of this patch set if it is supposed to be merged for 9.1 too. If this patch set gets merged quickly that's not really a problem. On the other hand, if another iteration(s) is/are needed AND you are not available in the coming weeks to work on them then there's a question whether we will make the required deadline. I think it's a bit rush to merge the vfio series in this release. I'm not sure it has enough time to be properly reviewed, reposted, retested, etc. I've already started looking at it, and so far I think I have doubt not only on agreement with Fabiano on the device_state thing which I prefer to avoid, but also I'm thinking of any possible way to at least make the worker threads generic too: a direct impact could be vDPA in the near future if anyone cared, while I don't want modules to create threads randomly during migration. Meanwhile I'm also thinking whether that "the thread needs to dump all data, and during iteration we can't do that" is the good reason to not support that during iterations. I didn't yet reply because I don't think I think all things through, but I'll get there. So I'm not saying that the design is problematic, but IMHO it's just not mature enough to assume it will land in 9.1, considering it's still a large one, and the first non-rfc version just posted two days ago. The RFC version was posted more than 2 months ago. It has received some review comments from multiple people, all of which were addressed in this patch set version. I have not received any further comments during these 2 months, so I thought the overall design is considered okay - if anything, there might be minor code comments/issues but these can easily be improved/fixed in the 5 weeks remaining to the soft code freeze for 9.1. If anything, I think that the VM live phase (non-downtime) transfers functionality should be deferred until 9.2 because: * It wasn't a part of the RFC so even if implemented today would get much less testing overall, * It's orthogonal to the switchover time device state transfer functionality introduced by this patch set and could be added on top of that without changing the wire protocol for switchover time device state transfers, * It doesn't impact the switchover downtime so in this case 9.1 would already contain all what's necessary to improve it. Thanks, Maciej
Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
On 21.06.2024 17:04, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: Hi Fabiano, On 20.06.2024 23:21, Fabiano Rosas wrote: Hi folks, First of all, apologies for the roughness of the series. I'm off for the next couple of weeks and wanted to put something together early for your consideration. This series is a refactoring (based on an earlier, off-list attempt[0]), aimed to remove the usage of the MultiFDPages_t type in the multifd core. If we're going to add support for more data types to multifd, we first need to clean that up. This time around this work was prompted by Maciej's series[1]. I see you're having to add a bunch of is_device_state checks to work around the rigidity of the code. Aside from the VFIO work, there is also the intent (coming back from Juan's ideas) to make multifd the default code path for migration, which will have to include the vmstate migration and anything else we put on the stream via QEMUFile. I have long since been bothered by having 'pages' sprinkled all over the code, so I might be coming at this with a bit of a narrow focus, but I believe in order to support more types of payloads in multifd, we need to first allow the scheduling at multifd_send_pages() to be independent of MultiFDPages_t. So here it is. Let me know what you think. Thanks for the patch set, I quickly glanced at these patches and they definitely make sense to me. (..) (as I said, I'll be off for a couple of weeks, so feel free to incorporate any of this code if it's useful. Or to ignore it completely). I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has feature freeze in about a month, correct? For general code improvements like this I'm not thinking about QEMU releases at all. But this series is not super complex, so I could imagine we merging it in time for 9.1 if we reach an agreement. Are you thinking your series might miss the target? Or have concerns over the stability of the refactoring? We can within reason merge code based on the current framework and improve things on top, we already did something similar when merging zero-page support. I don't have an issue with that. The reason that I asked whether you are targeting 9.1 is because my patch set is definitely targeting that release. At the same time my patch set will need to be rebased/refactored on top of this patch set if it is supposed to be merged for 9.1 too. If this patch set gets merged quickly that's not really a problem. On the other hand, if another iteration(s) is/are needed AND you are not available in the coming weeks to work on them then there's a question whether we will make the required deadline. Thanks, Maciej
Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
Hi Fabiano, On 20.06.2024 23:21, Fabiano Rosas wrote: Hi folks, First of all, apologies for the roughness of the series. I'm off for the next couple of weeks and wanted to put something together early for your consideration. This series is a refactoring (based on an earlier, off-list attempt[0]), aimed to remove the usage of the MultiFDPages_t type in the multifd core. If we're going to add support for more data types to multifd, we first need to clean that up. This time around this work was prompted by Maciej's series[1]. I see you're having to add a bunch of is_device_state checks to work around the rigidity of the code. Aside from the VFIO work, there is also the intent (coming back from Juan's ideas) to make multifd the default code path for migration, which will have to include the vmstate migration and anything else we put on the stream via QEMUFile. I have long since been bothered by having 'pages' sprinkled all over the code, so I might be coming at this with a bit of a narrow focus, but I believe in order to support more types of payloads in multifd, we need to first allow the scheduling at multifd_send_pages() to be independent of MultiFDPages_t. So here it is. Let me know what you think. Thanks for the patch set, I quickly glanced at these patches and they definitely make sense to me. I guess its latest version could be found in the repo at [2] since that's where the CI run mentioned below took it from? (as I said, I'll be off for a couple of weeks, so feel free to incorporate any of this code if it's useful. Or to ignore it completely). I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has feature freeze in about a month, correct? CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/ 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigi...@oracle.com [2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple Thanks, Maciej
[PATCH v1 00/13] Multifd device state transfer support with VFIO consumer
From: "Maciej S. Szmigiero" This is an updated v1 patch series of the RFC (v0) series located here: https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigi...@oracle.com/ Changes from the RFC (v0): * Extend the existing multifd packet format instead of introducing a new migration channel header. * As the replacement of switching the migration channel header on or off introduce "x-migration-multifd-transfer" VFIO device property instead that allows configuring at runtime whether to send the particular device state via multifd channels when live migrating that device. This property defaults to "false" for bit stream compatibility with older QEMU versions. * Remove the support for having dedicated device state transfer multifd channels since the same downtime performance can be attained by simply reducing the total number of multifd channels in a shared channel configuration to the number of channels available for RAM transfer in the dedicated device state channels configuration. For example, the best downtime from the dedicated device state config on my setup (achieved in configuration of 10 total multifd channels / 4 dedicated device state channels) can also be achieved in the shared RAM/device state channel configuration by reducing the total multifd channel count to 6. It looks like not having too many RAM transfer multifd channels is key to having a good downtime since the results are reproducibly worse with 15 shared channels total, while they are as good as with 6 shared channels if there are 15 total channels but 4 of them are dedicated to transferring device state (leaving 11 for RAM transfer). * Make the next multifd channel selection more fair when converting multifd_send_pages::next_channel to atomic. * Convert the code to use QEMU thread sync primitives (QemuMutex with QemuLockable, QemuCond) instead of their Glib equivalents (GMutex, GMutexLocker and GCond). * Rename complete_precopy_async{,_wait} to complete_precopy_{begin,_end} as suggested. * Rebase onto the last week's QEMU git master and retest. When working on the updated patch set version I also investigated the possibility of refactoring VM live phase (non-downtime) transfers to happen via multifd channels. However, the VM live phase transfer works differently: it happens opportunistically until the remaining data drops below the switchover threshold, rather that transferring always the whole device state data until their exhaustion. For this reason, there would need to be some way in the migration framework to update the remaining data estimate from per-device saving/transfer queuing thread and then stop these threads when the decision has been reached in the migration core to stop the VM and switch over. Such functionality would need to be introduced first. There would also need to be some fairness guarantees so every device gets similar access to multifd channels - otherwise there could be a situation that the remaining data never drops below switchover threshold because some devices are starved with respect to access to the multifd transfer channels - as in the VM live phase additional device data is constantly being generated. Moreover, there's nothing stopping a QEMU device driver from requiring different handling (loading, etc.) of VM live phase data from the post-switchover data. For cases like this some kind of a new device VM live phase incoming data load handler would need to be introduced too. For the above reasons, the VM live phase multifd transfer functionality isn't a simple extension of the functionality introduced by this patch set. For convenience, this patch set is also available as a git tree: https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio Maciej S. Szmigiero (13): vfio/migration: Add save_{iterate,complete_precopy}_started trace events migration/ram: Add load start trace event migration/multifd: Zero p->flags before starting filling a packet migration: Add save_live_complete_precopy_{begin,end} handlers migration: Add qemu_loadvm_load_state_buffer() and its handler migration: Add load_finish handler and associated functions migration/multifd: Device state transfer support - receive side migration/multifd: Convert multifd_send_pages::next_channel to atomic migration/multifd: Device state transfer support - send side migration/multifd: Add migration_has_device_state_support() vfio/migration: Multifd device state transfer support - receive side vfio/migration: Add x-migration-multifd-transfer VFIO property vfio/migration: Multifd device state transfer support - send side hw/vfio/migration.c | 545 +- hw/vfio/pci.c | 7 + hw/vfio/trace-events | 15 +- include/hw/vfio/vfio-common.h | 27 ++ include/migration/misc.h | 5 + include/migration/register.h | 70 + migration/migration.c | 6 + mi
[PATCH v1 02/13] migration/ram: Add load start trace event
From: "Maciej S. Szmigiero" There's a RAM load complete trace event but there wasn't its start equivalent. Signed-off-by: Maciej S. Szmigiero --- migration/ram.c| 1 + migration/trace-events | 1 + 2 files changed, 2 insertions(+) diff --git a/migration/ram.c b/migration/ram.c index ceea586b06ba..87b0cf86db0c 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -4129,6 +4129,7 @@ static int ram_load_precopy(QEMUFile *f) RAM_SAVE_FLAG_ZERO); } +trace_ram_load_start(); while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) { ram_addr_t addr; void *host = NULL, *host_bak = NULL; diff --git a/migration/trace-events b/migration/trace-events index 0b7c3324fb5e..43dfe4a4bc03 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) "" save_xbzrle_page_skipping(void) "" save_xbzrle_page_overflow(void) "" ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations" +ram_load_start(void) "" ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu" ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
[PATCH v1 08/13] migration/multifd: Convert multifd_send_pages::next_channel to atomic
From: "Maciej S. Szmigiero" This is necessary for multifd_send_pages() to be able to be called from multiple threads. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 6e0af84bb9a1..daa34172bf24 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -614,26 +614,38 @@ static bool multifd_send_pages(void) return false; } -/* We wait here, until at least one channel is ready */ -qemu_sem_wait(_send_state->channels_ready); - /* * next_channel can remain from a previous migration that was * using more channels, so ensure it doesn't overflow if the * limit is lower now. */ -next_channel %= migrate_multifd_channels(); -for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) { +i = qatomic_load_acquire(_channel); +if (unlikely(i >= migrate_multifd_channels())) { +qatomic_cmpxchg(_channel, i, 0); +} + +/* We wait here, until at least one channel is ready */ +qemu_sem_wait(_send_state->channels_ready); + +while (true) { +int i_next; + if (multifd_send_should_exit()) { return false; } + +i = qatomic_load_acquire(_channel); +i_next = (i + 1) % migrate_multifd_channels(); +if (qatomic_cmpxchg(_channel, i, i_next) != i) { +continue; +} + p = _send_state->params[i]; /* * Lockless read to p->pending_job is safe, because only multifd * sender thread can clear it. */ if (qatomic_read(>pending_job) == false) { -next_channel = (i + 1) % migrate_multifd_channels(); break; } }
[PATCH v1 06/13] migration: Add load_finish handler and associated functions
From: "Maciej S. Szmigiero" load_finish SaveVMHandler allows migration code to poll whether a device-specific asynchronous device state loading operation had finished. In order to avoid calling this handler needlessly the device is supposed to notify the migration code of its possible readiness via a call to qemu_loadvm_load_finish_ready_broadcast() while holding qemu_loadvm_load_finish_ready_lock. Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 21 +++ migration/migration.c| 6 + migration/migration.h| 3 +++ migration/savevm.c | 52 migration/savevm.h | 4 +++ 5 files changed, 86 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index ce7641c90cea..7c20a9fb86ff 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -276,6 +276,27 @@ typedef struct SaveVMHandlers { int (*load_state_buffer)(void *opaque, char *data, size_t data_size, Error **errp); +/** + * @load_finish + * + * Poll whether all asynchronous device state loading had finished. + * Not called on the load failure path. + * + * Called while holding the qemu_loadvm_load_finish_ready_lock. + * + * If this method signals "not ready" then it might not be called + * again until qemu_loadvm_load_finish_ready_broadcast() is invoked + * while holding qemu_loadvm_load_finish_ready_lock. + * + * @opaque: data pointer passed to register_savevm_live() + * @is_finished: whether the loading had finished (output parameter) + * @errp: pointer to Error*, to store an error if it happens. + * + * Returns zero to indicate success and negative for error + * It's not an error that the loading still hasn't finished. + */ +int (*load_finish)(void *opaque, bool *is_finished, Error **errp); + /** * @load_setup * diff --git a/migration/migration.c b/migration/migration.c index e1b269624c01..ff149e00132f 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -236,6 +236,9 @@ void migration_object_init(void) current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR; +qemu_mutex_init(_incoming->load_finish_ready_mutex); +qemu_cond_init(_incoming->load_finish_ready_cond); + migration_object_check(current_migration, _fatal); ram_mig_init(); @@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void) mis->postcopy_qemufile_dst = NULL; } +qemu_mutex_destroy(>load_finish_ready_mutex); +qemu_cond_destroy(>load_finish_ready_cond); + yank_unregister_instance(MIGRATION_YANK_INSTANCE); } diff --git a/migration/migration.h b/migration/migration.h index 6af01362d424..0f2716ac42c6 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -230,6 +230,9 @@ struct MigrationIncomingState { /* Do exit on incoming migration failure */ bool exit_on_error; + +QemuCond load_finish_ready_cond; +QemuMutex load_finish_ready_mutex; }; MigrationIncomingState *migration_incoming_get_current(void); diff --git a/migration/savevm.c b/migration/savevm.c index 2e538cb02936..46cfb73eae79 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -3020,6 +3020,37 @@ int qemu_loadvm_state(QEMUFile *f) return ret; } +qemu_loadvm_load_finish_ready_lock(); +while (!ret) { /* Don't call load_finish() handlers on the load failure path */ +bool all_ready = true; +SaveStateEntry *se = NULL; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +bool this_ready; + +if (!se->ops || !se->ops->load_finish) { +continue; +} + +ret = se->ops->load_finish(se->opaque, _ready, _err); +if (ret) { +error_report_err(local_err); + +qemu_loadvm_load_finish_ready_unlock(); +return -EINVAL; +} else if (!this_ready) { +all_ready = false; +} +} + +if (all_ready) { +break; +} + +qemu_cond_wait(>load_finish_ready_cond, >load_finish_ready_mutex); +} +qemu_loadvm_load_finish_ready_unlock(); + if (ret == 0) { ret = qemu_file_get_error(f); } @@ -3124,6 +3155,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, return 0; } +void qemu_loadvm_load_finish_ready_lock(void) +{ +MigrationIncomingState *mis = migration_incoming_get_current(); + +qemu_mutex_lock(>load_finish_ready_mutex); +} + +void qemu_loadvm_load_finish_ready_unlock(void) +{ +MigrationIncomingState *mis = migration_incoming_get_current(); + +qemu_mutex_unlock(>load_finish_ready_mutex); +} + +void qemu_loadvm_load_finish_ready_broadcast(void)
[PATCH v1 05/13] migration: Add qemu_loadvm_load_state_buffer() and its handler
From: "Maciej S. Szmigiero" qemu_loadvm_load_state_buffer() and its load_state_buffer SaveVMHandler allow providing device state buffer to explicitly specified device via its idstr and instance id. Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 15 +++ migration/savevm.c | 25 + migration/savevm.h | 3 +++ 3 files changed, 43 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index f7b3df71..ce7641c90cea 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -261,6 +261,21 @@ typedef struct SaveVMHandlers { */ int (*load_state)(QEMUFile *f, void *opaque, int version_id); +/** + * @load_state_buffer + * + * Load device state buffer provided to qemu_loadvm_load_state_buffer(). + * + * @opaque: data pointer passed to register_savevm_live() + * @data: the data buffer to load + * @data_size: the data length in buffer + * @errp: pointer to Error*, to store an error if it happens. + * + * Returns zero to indicate success and negative for error + */ +int (*load_state_buffer)(void *opaque, char *data, size_t data_size, + Error **errp); + /** * @load_setup * diff --git a/migration/savevm.c b/migration/savevm.c index 56fb1c4c2563..2e538cb02936 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -3099,6 +3099,31 @@ int qemu_loadvm_approve_switchover(void) return migrate_send_rp_switchover_ack(mis); } +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, + char *buf, size_t len, Error **errp) +{ +SaveStateEntry *se; + +se = find_se(idstr, instance_id); +if (!se) { +error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer", + idstr, instance_id); +return -1; +} + +if (!se->ops || !se->ops->load_state_buffer) { +error_setg(errp, "idstr %s / instance %u has no load state buffer operation", + idstr, instance_id); +return -1; +} + +if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) { +return -1; +} + +return 0; +} + bool save_snapshot(const char *name, bool overwrite, const char *vmstate, bool has_devices, strList *devices, Error **errp) { diff --git a/migration/savevm.h b/migration/savevm.h index 9ec96a995c93..d388f1bfca98 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void); int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, bool in_postcopy, bool inactivate_disks); +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, + char *buf, size_t len, Error **errp); + #endif
[PATCH v1 07/13] migration/multifd: Device state transfer support - receive side
From: "Maciej S. Szmigiero" Add a basic support for receiving device state via multifd channels - channels that are shared with RAM transfers. To differentiate between a device state and a RAM packet the packet header is read first. Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the packet header either device state (MultiFDPacketDeviceState_t) or RAM data (existing MultiFDPacket_t) is then read. The received device state data is provided to qemu_loadvm_load_state_buffer() function for processing in the device's load_state_buffer handler. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 123 +--- migration/multifd.h | 31 ++- 2 files changed, 134 insertions(+), 20 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index c8a5b363f7d4..6e0af84bb9a1 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -21,6 +21,7 @@ #include "file.h" #include "migration.h" #include "migration-stats.h" +#include "savevm.h" #include "socket.h" #include "tls.h" #include "qemu-file.h" @@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p) uint32_t zero_num = pages->num - pages->normal_num; int i; -packet->flags = cpu_to_be32(p->flags); +packet->hdr.flags = cpu_to_be32(p->flags); packet->pages_alloc = cpu_to_be32(p->pages->allocated); packet->normal_pages = cpu_to_be32(pages->normal_num); packet->zero_pages = cpu_to_be32(zero_num); @@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p) p->flags, p->next_packet_size); } -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, MultiFDPacketHdr_t *hdr, + Error **errp) { -MultiFDPacket_t *packet = p->packet; -int i; - -packet->magic = be32_to_cpu(packet->magic); -if (packet->magic != MULTIFD_MAGIC) { +hdr->magic = be32_to_cpu(hdr->magic); +if (hdr->magic != MULTIFD_MAGIC) { error_setg(errp, "multifd: received packet " "magic %x and expected magic %x", - packet->magic, MULTIFD_MAGIC); + hdr->magic, MULTIFD_MAGIC); return -1; } -packet->version = be32_to_cpu(packet->version); -if (packet->version != MULTIFD_VERSION) { +hdr->version = be32_to_cpu(hdr->version); +if (hdr->version != MULTIFD_VERSION) { error_setg(errp, "multifd: received packet " "version %u and expected version %u", - packet->version, MULTIFD_VERSION); + hdr->version, MULTIFD_VERSION); return -1; } -p->flags = be32_to_cpu(packet->flags); +p->flags = be32_to_cpu(hdr->flags); + +return 0; +} + +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error **errp) +{ +MultiFDPacketDeviceState_t *packet = p->packet_dev_state; + +packet->instance_id = be32_to_cpu(packet->instance_id); +p->next_packet_size = be32_to_cpu(packet->next_packet_size); + +return 0; +} + +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp) +{ +MultiFDPacket_t *packet = p->packet; +int i; packet->pages_alloc = be32_to_cpu(packet->pages_alloc); /* @@ -485,7 +502,6 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) p->next_packet_size = be32_to_cpu(packet->next_packet_size); p->packet_num = be64_to_cpu(packet->packet_num); -p->packets_recved++; p->total_normal_pages += p->normal_num; p->total_zero_pages += p->zero_num; @@ -533,6 +549,19 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) return 0; } +static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) +{ +p->packets_recved++; + +if (p->flags & MULTIFD_FLAG_DEVICE_STATE) { +return multifd_recv_unfill_packet_device_state(p, errp); +} else { +return multifd_recv_unfill_packet_ram(p, errp); +} + +g_assert_not_reached(); +} + static bool multifd_send_should_exit(void) { return qatomic_read(_send_state->exiting); @@ -1177,8 +1206,8 @@ bool multifd_send_setup(void) p->packet_len = sizeof(MultiFDPacket_t) + sizeof(uint64_t) * page_count; p->packet = g_malloc0(p->packet_len); -p->packet->magic = cpu_to_be32(MULTIFD_MAGIC); -p->packet->version = cpu_to_be32(MULTIFD_VERSION); +p->packet->hdr.magic = cpu_to_be32(MULTIFD_MAGIC); +
[PATCH v1 03/13] migration/multifd: Zero p->flags before starting filling a packet
From: "Maciej S. Szmigiero" This way there aren't stale flags there. p->flags can't contain SYNC to be sent at the next RAM packet since syncs are now handled separately in multifd_send_thread. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/migration/multifd.c b/migration/multifd.c index f317bff07746..c8a5b363f7d4 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque) if (qatomic_load_acquire(>pending_job)) { MultiFDPages_t *pages = p->pages; +p->flags = 0; p->iovs_num = 0; assert(pages->num); @@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque) } /* p->next_packet_size will always be zero for a SYNC packet */ stat64_add(_stats.multifd_bytes, p->packet_len); -p->flags = 0; } qatomic_set(>pending_sync, false);
[PATCH v1 01/13] vfio/migration: Add save_{iterate, complete_precopy}_started trace events
From: "Maciej S. Szmigiero" This way both the start and end points of migrating a particular VFIO device are known. Add also a vfio_save_iterate_empty_hit trace event so it is known when there's no more data to send for that device. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 13 + hw/vfio/trace-events | 3 +++ include/hw/vfio/vfio-common.h | 3 +++ 3 files changed, 19 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 34d4be2ce1b1..93f767e3c2dd 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -472,6 +472,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp) return -ENOMEM; } +migration->save_iterate_run = false; +migration->save_iterate_empty_hit = false; + if (vfio_precopy_supported(vbasedev)) { switch (migration->device_state) { case VFIO_DEVICE_STATE_RUNNING: @@ -605,9 +608,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque) VFIOMigration *migration = vbasedev->migration; ssize_t data_size; +if (!migration->save_iterate_run) { +trace_vfio_save_iterate_started(vbasedev->name); +migration->save_iterate_run = true; +} + data_size = vfio_save_block(f, migration); if (data_size < 0) { return data_size; +} else if (data_size == 0 && !migration->save_iterate_empty_hit) { +trace_vfio_save_iterate_empty_hit(vbasedev->name); +migration->save_iterate_empty_hit = true; } vfio_update_estimated_pending_data(migration, data_size); @@ -633,6 +644,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) int ret; Error *local_err = NULL; +trace_vfio_save_complete_precopy_started(vbasedev->name); + /* We reach here with device state STOP or STOP_COPY only */ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, VFIO_DEVICE_STATE_STOP, _err); diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 64161bf6f44c..814000796687 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -158,8 +158,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d" vfio_save_block(const char *name, int data_size) " (%s) data_size %d" vfio_save_cleanup(const char *name) " (%s)" vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" +vfio_save_complete_precopy_started(const char *name) " (%s)" vfio_save_device_config_state(const char *name) " (%s)" vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 +vfio_save_iterate_started(const char *name) " (%s)" +vfio_save_iterate_empty_hit(const char *name) " (%s)" vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 4cb1ab8645dc..510818f4dae3 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -71,6 +71,9 @@ typedef struct VFIOMigration { uint64_t precopy_init_size; uint64_t precopy_dirty_size; bool initial_data_sent; + +bool save_iterate_run; +bool save_iterate_empty_hit; } VFIOMigration; struct VFIOGroup;
[PATCH v1 13/13] vfio/migration: Multifd device state transfer support - send side
From: "Maciej S. Szmigiero" Implement the multifd device state transfer via additional per-device thread spawned from save_live_complete_precopy_begin handler. Switch between doing the data transfer in the new handler and doing it in the old save_state handler depending on the x-migration-multifd-transfer device property value. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 207 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 9 ++ 3 files changed, 219 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 719e36800ab5..28a835f8a945 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -643,6 +643,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp) uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE; int ret; +/* Make a copy of this setting at the start in case it is changed mid-migration */ +migration->multifd_transfer = vbasedev->migration_multifd_transfer; + +if (migration->multifd_transfer && !migration_has_device_state_support()) { +error_setg(errp, + "%s: Multifd device transfer requested but unsupported in the current config", + vbasedev->name); +return -EINVAL; +} + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE); vfio_query_stop_copy_size(vbasedev, _copy_size); @@ -692,6 +702,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp) return ret; } +static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev); + static void vfio_save_cleanup(void *opaque) { VFIODevice *vbasedev = opaque; @@ -699,6 +711,8 @@ static void vfio_save_cleanup(void *opaque) Error *local_err = NULL; int ret; +vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev); + /* * Changing device state from STOP_COPY to STOP can take time. Do it here, * after migration has completed, so it won't increase downtime. @@ -712,6 +726,7 @@ static void vfio_save_cleanup(void *opaque) } } +g_clear_pointer(>idstr, g_free); g_free(migration->data_buffer); migration->data_buffer = NULL; migration->precopy_init_size = 0; @@ -823,10 +838,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque) static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; ssize_t data_size; int ret; Error *local_err = NULL; +if (migration->multifd_transfer) { +/* Emit dummy NOP data */ +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); +return 0; +} + trace_vfio_save_complete_precopy_started(vbasedev->name); /* We reach here with device state STOP or STOP_COPY only */ @@ -852,12 +874,188 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev, uint32_t idx) +{ +VFIOMigration *migration = vbasedev->migration; +g_autoptr(QIOChannelBuffer) bioc = NULL; +QEMUFile *f = NULL; +int ret; +g_autofree VFIODeviceStatePacket *packet = NULL; +size_t packet_len; + +bioc = qio_channel_buffer_new(0); +qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save"); + +f = qemu_file_new_output(QIO_CHANNEL(bioc)); + +ret = vfio_save_device_config_state(f, vbasedev, NULL); +if (ret) { +return ret; +} + +ret = qemu_fflush(f); +if (ret) { +goto ret_close_file; +} + +packet_len = sizeof(*packet) + bioc->usage; +packet = g_malloc0(packet_len); +packet->idx = idx; +packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE; +memcpy(>data, bioc->data, bioc->usage); + +ret = multifd_queue_device_state(migration->idstr, migration->instance_id, + (char *)packet, packet_len); + +bytes_transferred += packet_len; + +ret_close_file: +g_clear_pointer(, qemu_fclose); +return ret; +} + +static void *vfio_save_complete_precopy_async_thread(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int *ret = >save_complete_precopy_thread_ret; +g_autofree VFIODeviceStatePacket *packet = NULL; +uint32_t idx; + +/* We reach here with device state STOP or STOP_COPY only */ +*ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, +VFIO_DEVICE_STATE_STOP, NULL); +if (*ret) { +return NULL; +} + +packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size); + +for (idx = 0; ; idx++) { +ssize_t data_size; +size_t packet_size; + +data_size = rea
[PATCH v1 12/13] vfio/migration: Add x-migration-multifd-transfer VFIO property
From: "Maciej S. Szmigiero" This property allows configuring at runtime whether to send the particular device state via multifd channels when live migrating that device. It is ignored on the receive side and defaults to "false" for bit stream compatibility with older QEMU versions. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/pci.c | 7 +++ include/hw/vfio/vfio-common.h | 1 + 2 files changed, 8 insertions(+) diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 74a79bdf61f9..e2ac1db96002 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3346,6 +3346,8 @@ static void vfio_instance_init(Object *obj) pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; } +static PropertyInfo qdev_prop_bool_mutable; + static Property vfio_pci_dev_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), DEFINE_PROP_UUID_NODEFAULT("vf-token", VFIOPCIDevice, vf_token), @@ -3367,6 +3369,8 @@ static Property vfio_pci_dev_properties[] = { VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false), DEFINE_PROP_ON_OFF_AUTO("enable-migration", VFIOPCIDevice, vbasedev.enable_migration, ON_OFF_AUTO_AUTO), +DEFINE_PROP("x-migration-multifd-transfer", VFIOPCIDevice, +vbasedev.migration_multifd_transfer, qdev_prop_bool_mutable, bool), DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice, vbasedev.migration_events, false), DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false), @@ -3464,6 +3468,9 @@ static const TypeInfo vfio_pci_nohotplug_dev_info = { static void register_vfio_pci_dev_type(void) { +qdev_prop_bool_mutable = qdev_prop_bool; +qdev_prop_bool_mutable.realized_set_allowed = true; + type_register_static(_pci_dev_info); type_register_static(_pci_nohotplug_dev_info); } diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index aa8476a859a6..bc85891d8fff 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -132,6 +132,7 @@ typedef struct VFIODevice { bool no_mmap; bool ram_block_discard_allowed; OnOffAuto enable_migration; +bool migration_multifd_transfer; bool migration_events; VFIODeviceOps *ops; unsigned int num_irqs;
[PATCH v1 11/13] vfio/migration: Multifd device state transfer support - receive side
From: "Maciej S. Szmigiero" The multifd received data needs to be reassembled since device state packets sent via different multifd channels can arrive out-of-order. Therefore, each VFIO device state packet carries a header indicating its position in the stream. The last such VFIO device state packet should have VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state. Since it's important to finish loading device state transferred via the main migration channel (via save_live_iterate handler) before starting loading the data asynchronously transferred via multifd a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to mark the end of the main migration channel data. The device state loading process waits until that flag is seen before commencing loading of the multifd-transferred device state. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 325 +- hw/vfio/trace-events | 9 +- include/hw/vfio/vfio-common.h | 14 ++ 3 files changed, 344 insertions(+), 4 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 93f767e3c2dd..719e36800ab5 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -15,6 +15,7 @@ #include #include +#include "io/channel-buffer.h" #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "migration/misc.h" @@ -47,6 +48,7 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xef15ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE(0xef16ULL) /* * This is an arbitrary size based on migration of mlx5 devices, where typically @@ -55,6 +57,15 @@ */ #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) +#define VFIO_DEVICE_STATE_CONFIG_STATE (1) + +typedef struct VFIODeviceStatePacket { +uint32_t version; +uint32_t idx; +uint32_t flags; +uint8_t data[0]; +} QEMU_PACKED VFIODeviceStatePacket; + static int64_t bytes_transferred; static const char *mig_state_to_str(enum vfio_device_mig_state state) @@ -254,6 +265,176 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, return ret; } +typedef struct LoadedBuffer { +bool is_present; +char *data; +size_t len; +} LoadedBuffer; + +static void loaded_buffer_clear(gpointer data) +{ +LoadedBuffer *lb = data; + +if (!lb->is_present) { +return; +} + +g_clear_pointer(>data, g_free); +lb->is_present = false; +} + +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size, + Error **errp) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data; +QEMU_LOCK_GUARD(>load_bufs_mutex); +LoadedBuffer *lb; + +if (data_size < sizeof(*packet)) { +error_setg(errp, "packet too short at %zu (min is %zu)", + data_size, sizeof(*packet)); +return -1; +} + +if (packet->version != 0) { +error_setg(errp, "packet has unknown version %" PRIu32, + packet->version); +return -1; +} + +if (packet->idx == UINT32_MAX) { +error_setg(errp, "packet has too high idx %" PRIu32, + packet->idx); +return -1; +} + +trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx); + +/* config state packet should be the last one in the stream */ +if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) { +migration->load_buf_idx_last = packet->idx; +} + +assert(migration->load_bufs); +if (packet->idx >= migration->load_bufs->len) { +g_array_set_size(migration->load_bufs, packet->idx + 1); +} + +lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx); +if (lb->is_present) { +error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx); +return -1; +} + +assert(packet->idx >= migration->load_buf_idx); + +lb->data = g_memdup2(>data, data_size - sizeof(*packet)); +lb->len = data_size - sizeof(*packet); +lb->is_present = true; + +qemu_cond_broadcast(>load_bufs_buffer_ready_cond); + +return 0; +} + +static void *vfio_load_bufs_thread(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +Error **errp = >load_bufs_thread_errp; +g_autoptr(QemuLockable) locker = qemu_lockable_auto_lock( +QEMU_MAKE_LOCKABLE(>load_bufs_mutex)); +LoadedBuffer *lb; + +while (!migration->load_bufs_device_ready && +
[PATCH v1 09/13] migration/multifd: Device state transfer support - send side
From: "Maciej S. Szmigiero" A new function multifd_queue_device_state() is provided for device to queue its state for transmission via a multifd channel. Signed-off-by: Maciej S. Szmigiero --- include/migration/misc.h | 4 + migration/multifd-zlib.c | 2 +- migration/multifd-zstd.c | 2 +- migration/multifd.c | 181 +-- migration/multifd.h | 26 -- 5 files changed, 182 insertions(+), 33 deletions(-) diff --git a/include/migration/misc.h b/include/migration/misc.h index bfadc5613bac..abf6f33eeae8 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -111,4 +111,8 @@ bool migration_in_bg_snapshot(void); /* migration/block-dirty-bitmap.c */ void dirty_bitmap_mig_init(void); +/* migration/multifd.c */ +int multifd_queue_device_state(char *idstr, uint32_t instance_id, + char *data, size_t len); + #endif diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c index 737a9645d2fe..424547aa5be0 100644 --- a/migration/multifd-zlib.c +++ b/migration/multifd-zlib.c @@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp) out: p->flags |= MULTIFD_FLAG_ZLIB; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); return 0; } diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c index 256858df0a0a..89ef21898485 100644 --- a/migration/multifd-zstd.c +++ b/migration/multifd-zstd.c @@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp) out: p->flags |= MULTIFD_FLAG_ZSTD; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); return 0; } diff --git a/migration/multifd.c b/migration/multifd.c index daa34172bf24..6a7e5d659925 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -12,6 +12,7 @@ #include "qemu/osdep.h" #include "qemu/cutils.h" +#include "qemu/iov.h" #include "qemu/rcu.h" #include "exec/target_page.h" #include "sysemu/sysemu.h" @@ -19,6 +20,7 @@ #include "qemu/error-report.h" #include "qapi/error.h" #include "file.h" +#include "migration/misc.h" #include "migration.h" #include "migration-stats.h" #include "savevm.h" @@ -49,9 +51,12 @@ typedef struct { } __attribute__((packed)) MultiFDInit_t; struct { +QemuMutex queue_job_mutex; + MultiFDSendParams *params; -/* array of pages to sent */ +/* array of pages or device state to be sent */ MultiFDPages_t *pages; +MultiFDDeviceState_t *device_state; /* * Global number of generated multifd packets. * @@ -168,7 +173,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p) } /** - * nocomp_send_prepare: prepare date to be able to send + * nocomp_send_prepare_ram: prepare RAM data for sending * * For no compression we just have to calculate the size of the * packet. @@ -178,7 +183,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p) * @p: Params for the channel that we are using * @errp: pointer to an error */ -static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) +static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp) { bool use_zero_copy_send = migrate_zero_copy_send(); int ret; @@ -197,13 +202,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) * Only !zerocopy needs the header in IOV; zerocopy will * send it separately. */ -multifd_send_prepare_header(p); +multifd_send_prepare_header_ram(p); } multifd_send_prepare_iovs(p); p->flags |= MULTIFD_FLAG_NOCOMP; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); if (use_zero_copy_send) { /* Send header first, without zerocopy */ @@ -217,6 +222,56 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) return 0; } +static void multifd_send_fill_packet_device_state(MultiFDSendParams *p) +{ +MultiFDPacketDeviceState_t *packet = p->packet_device_state; + +packet->hdr.flags = cpu_to_be32(p->flags); +strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr)); +packet->instance_id = cpu_to_be32(p->device_state->instance_id); +packet->next_packet_size = cpu_to_be32(p->next_packet_size); +} + +/** + * nocomp_send_prepare_device_state: prepare device state data for sending + * + * Returns 0 for success or -1 for error + * + * @p: Params for the channel that we are using + * @errp: pointer to an error + */ +static int nocomp_send_prepare_device_state(MultiFDSendParams *p, +Error **errp) +{ +multifd_send_prepare_header_device_state(p); + +assert(!(p->flags & MULTIFD_FLAG_SYNC)); + +p->next_packet_size = p->dev
[PATCH v1 04/13] migration: Add save_live_complete_precopy_{begin, end} handlers
From: "Maciej S. Szmigiero" These SaveVMHandlers allow device to provide its own asynchronous transmission of the remaining data at the end of a precopy phase. In this use case the save_live_complete_precopy_begin handler is supposed to start such transmission (for example, by launching appropriate threads) while the save_live_complete_precopy_end handler is supposed to wait until such transfer has finished (for example, until the sending threads have exited). Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 34 ++ migration/savevm.c | 35 +++ 2 files changed, 69 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index f60e797894e5..f7b3df71 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -103,6 +103,40 @@ typedef struct SaveVMHandlers { */ int (*save_live_complete_precopy)(QEMUFile *f, void *opaque); +/** + * @save_live_complete_precopy_begin + * + * Called at the end of a precopy phase, before all @save_live_complete_precopy + * handlers. The handler might, for example, arrange for device-specific + * asynchronous transmission of the remaining data. When postcopy is enabled, + * devices that support postcopy will skip this step. + * + * @f: QEMUFile where the handler can synchronously send data before returning + * @idstr: this device section idstr + * @instance_id: this device section instance_id + * @opaque: data pointer passed to register_savevm_live() + * + * Returns zero to indicate success and negative for error + */ +int (*save_live_complete_precopy_begin)(QEMUFile *f, +char *idstr, uint32_t instance_id, +void *opaque); +/** + * @save_live_complete_precopy_end + * + * Called at the end of a precopy phase, after all @save_live_complete_precopy + * handlers. The handler might, for example, wait for the asynchronous + * transmission started by the @save_live_complete_precopy_begin handler + * to complete. When postcopy is enabled, devices that support postcopy will + * skip this step. + * + * @f: QEMUFile where the handler can synchronously send data before returning + * @opaque: data pointer passed to register_savevm_live() + * + * Returns zero to indicate success and negative for error + */ +int (*save_live_complete_precopy_end)(QEMUFile *f, void *opaque); + /* This runs both outside and inside the BQL. */ /** diff --git a/migration/savevm.c b/migration/savevm.c index c621f2359ba3..56fb1c4c2563 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1494,6 +1494,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) SaveStateEntry *se; int ret; +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || (in_postcopy && se->ops->has_postcopy && + se->ops->has_postcopy(se->opaque)) || +!se->ops->save_live_complete_precopy_begin) { +continue; +} + +save_section_header(f, se, QEMU_VM_SECTION_END); + +ret = se->ops->save_live_complete_precopy_begin(f, +se->idstr, se->instance_id, +se->opaque); + +save_section_footer(f, se); + +if (ret < 0) { +qemu_file_set_error(f, ret); +return -1; +} +} + QTAILQ_FOREACH(se, _state.handlers, entry) { if (!se->ops || (in_postcopy && se->ops->has_postcopy && @@ -1525,6 +1546,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) end_ts_each - start_ts_each); } +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || (in_postcopy && se->ops->has_postcopy && + se->ops->has_postcopy(se->opaque)) || +!se->ops->save_live_complete_precopy_end) { +continue; +} + +ret = se->ops->save_live_complete_precopy_end(f, se->opaque); +if (ret < 0) { +qemu_file_set_error(f, ret); +return -1; +} +} + trace_vmstate_downtime_checkpoint("src-iterable-saved"); return 0;
[PATCH v1 10/13] migration/multifd: Add migration_has_device_state_support()
From: "Maciej S. Szmigiero" Since device state transfer via multifd channels requires multifd channels with packets and is currently not compatible with multifd compression add an appropriate query function so device can learn whether it can actually make use of it. Signed-off-by: Maciej S. Szmigiero --- include/migration/misc.h | 1 + migration/multifd.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/include/migration/misc.h b/include/migration/misc.h index abf6f33eeae8..4f3de2f23819 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -112,6 +112,7 @@ bool migration_in_bg_snapshot(void); void dirty_bitmap_mig_init(void); /* migration/multifd.c */ +bool migration_has_device_state_support(void); int multifd_queue_device_state(char *idstr, uint32_t instance_id, char *data, size_t len); diff --git a/migration/multifd.c b/migration/multifd.c index 6a7e5d659925..e5f7021465ec 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -804,6 +804,12 @@ retry: return true; } +bool migration_has_device_state_support(void) +{ +return migrate_multifd() && !migrate_mapped_ram() && +migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE; +} + int multifd_queue_device_state(char *idstr, uint32_t instance_id, char *data, size_t len) {
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 29.04.2024 17:09, Peter Xu wrote: On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: On 24.04.2024 00:35, Peter Xu wrote: On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: On 24.04.2024 00:20, Peter Xu wrote: On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: On 19.04.2024 17:31, Peter Xu wrote: On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. I would even ask one step further as what Dan was asking: have you thought about dumping VFIO states via multifd even during iterations? Would that help even more than this series (which IIUC only helps during the blackout phase)? To dump during RAM iteration, the VFIO device will need to have dirty tracking and iterate on its state, because the guest CPUs will still be running potentially changing VFIO state. That seems impractical in the general case. We already do such interations in vfio_save_iterate()? My understanding is the recent VFIO work is based on the fact that the VFIO device can track device state changes more or less (besides being able to save/load full states). E.g. I still remember in our QE tests some old devices report much more dirty pages than expected during the iterations when we were looking into such issue that a huge amount of dirty pages reported. But newer models seem to have fixed that and report much less. That issue was about GPU not NICs, though, and IIUC a major portion of such tracking used to be for GPU vRAMs. So maybe I was mixing up these, and maybe they work differently. The device which this series was developed against (Mellanox ConnectX-7) is already transferring its live state before the VM gets stopped (via save_live_iterate SaveVMHandler). It's just that in addition to the live state it has more than 400 MiB of state that cannot be transferred while the VM is still running. And that fact hurts a lot with respect to the migration downtime. AFAIK it's a very similar story for (some) GPUs. So during iteration phase VFIO cannot yet leverage the multifd channels when with this series, am I right? That's right. Is it possible to extend that use case too? I guess so, but since this phase (iteration while the VM is still running) doesn't impact downtime it is much less critical. But it affects the bandwidth, e.g. even with multifd enabled, the device iteration data will still bottleneck at ~15Gbps on a common system setup the best case, even if the hosts are 100Gbps direct connected. Would that be a concern in the future too, or it's known problem and it won't be fixed anyway? I think any improvements to the migration performance are good, even if they don't impact downtime. It's just that this patch set focuses on the downtime phase as the more critical thing. After this gets improved there's no reason why not to look at improving performance of the VM live phase too if it brings sensible improvements. I remember Avihai used to have plan to look into similar issues, I hope this is exactly what he is looking for. Otherwise changing migration protocol from time to time is cumbersome; we always need to provide a flag to make sure old systems migrates in the old ways, new systems run the new ways, and for such a relatively major change I'd want to double check on how far away we can support offload VFIO iterations data to multifd. The device state transfer is indicated by a new flag in the multifd header (MULTIFD_FLAG_DEVICE_STATE). If we are to use multifd channels for VM live phase transfers these could simply re-use the same flag type. Right, and that's also my major purpose of such request to consider both issues. If supporting iterators can be easy on top of this, I am thinking whether we should do this in one s
Re: [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side
On 29.04.2024 22:04, Peter Xu wrote: On Tue, Apr 16, 2024 at 04:43:02PM +0200, Maciej S. Szmigiero wrote: +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset) +{ +g_autoptr(GMutexLocker) locker = NULL; + +/* + * Device state submissions for shared channels can come + * from multiple threads and conflict with page submissions + * with respect to multifd_send_state access. + */ +if (!multifd_send_state->device_state_dedicated_channels) { +locker = g_mutex_locker_new(_send_state->queue_job_mutex); Haven't read the rest, but suggest to stick with QemuMutex for the whole patchset, as that's what we use in the rest migration code, along with QEMU_LOCK_GUARD(). Ack, from a quick scan of QEMU thread sync primitives it seems that QemuMutex with QemuLockable and QemuCond should fulfill the requirements to replace GMutex, GMutexLocker and GCond. Thanks, Maciej
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 24.04.2024 00:27, Peter Xu wrote: On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote: We don't lose any genericity since by default the transfer is done via mixed RAM / device state multifd channels from a shared pool. It's only when x-multifd-channels-device-state is set to value > 0 then the requested multifd channel counts gets dedicated to device state. It could be seen as a fine-tuning option for cases where tests show that it provides some benefits to the particular workload - just like many other existing migration options are. 14% downtime improvement is too much to waste - I'm not sure that's only due to avoiding RAM syncs, it's possible that there are other subtle performance interactions too. For even more genericity this option could be named like x-multifd-channels-map and contain an array of channel settings like "ram,ram,ram,device-state,device-state". Then a possible future other uses of multifd channels wouldn't even need a new dedicated option. Yeah I understand such option would only provide more options. However as long as such option got introduced, user will start to do their own "optimizations" on how to provision the multifd channels, and IMHO it'll be great if we as developer can be crystal clear on why it needs to be introduced in the first place, rather than making all channels open to all purposes. So I don't think I'm strongly against such parameter, but I want to double check we really understand what's behind this to justify such parameter. Meanwhile I'd be always be pretty caucious on introducing any migration parameters, due to the compatibility nightmares. The less parameter the better.. Ack, I am also curious why dedicated device state multifd channels bring such downtime improvement. I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Ack, I will rename these hooks to begin/end. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. AFAIK there's already just one multifd_send_sync_main() during downtime - the one called from save_live_complete_precopy SaveVMHandler. In order to truly never interfere with device state transfer the sync would need to be ordered after the device state transfer is complete - that is, after VFIO complete_precopy_end (complete_precopy_async_wait) handler returns. Do you think it'll be worthwhile give it a shot, even if we can't decide yet on the order of end()s to be called? Upon a closer inspection it looks like that there are, in fact, *two* RAM syncs done during the downtime - besides the one at the end of ram_save_complete() there's another on in find_dirty_block(). This function is called earlier from ram_save_complete() -> ram_find_and_save_block(). Unfortunately, skipping that intermediate sync in find_dirty_block() and moving the one from the end of ram_save_complete() to another SaveVMHandler that's called only after VFIO device state transfer doesn't actually improve downtime (at least not on its own). It'll be great if we could look into these issues instead of workarounds, and figure out what affected the performance behind, and also whether that can be fixed without such parameter. I've been looking at this and added some measurements around device state queuing for submission in multifd_queue_device_state(). To my surprise, the mixed RAM / device state config of 15/0 has *much* lower total queuing time of around 100 msec compared to the dedicated device state channels 15/4 config with total queuing time of around 300 msec. Despite that, the 15/4 config still has significantly lower overall downtime. This means that any reason for the downtime difference is probably on the receive / load side of the migration rather than on the save / send side. I guess the reason for the lower device state queuing time in the 15/0 case is that this data could be sent via any of the 15 multifd channels rather than just the 4 dedicated ones in the 15/4 case. Nevertheless, I will continue to look at this problem to a
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 24.04.2024 00:35, Peter Xu wrote: On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: On 24.04.2024 00:20, Peter Xu wrote: On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: On 19.04.2024 17:31, Peter Xu wrote: On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. I would even ask one step further as what Dan was asking: have you thought about dumping VFIO states via multifd even during iterations? Would that help even more than this series (which IIUC only helps during the blackout phase)? To dump during RAM iteration, the VFIO device will need to have dirty tracking and iterate on its state, because the guest CPUs will still be running potentially changing VFIO state. That seems impractical in the general case. We already do such interations in vfio_save_iterate()? My understanding is the recent VFIO work is based on the fact that the VFIO device can track device state changes more or less (besides being able to save/load full states). E.g. I still remember in our QE tests some old devices report much more dirty pages than expected during the iterations when we were looking into such issue that a huge amount of dirty pages reported. But newer models seem to have fixed that and report much less. That issue was about GPU not NICs, though, and IIUC a major portion of such tracking used to be for GPU vRAMs. So maybe I was mixing up these, and maybe they work differently. The device which this series was developed against (Mellanox ConnectX-7) is already transferring its live state before the VM gets stopped (via save_live_iterate SaveVMHandler). It's just that in addition to the live state it has more than 400 MiB of state that cannot be transferred while the VM is still running. And that fact hurts a lot with respect to the migration downtime. AFAIK it's a very similar story for (some) GPUs. So during iteration phase VFIO cannot yet leverage the multifd channels when with this series, am I right? That's right. Is it possible to extend that use case too? I guess so, but since this phase (iteration while the VM is still running) doesn't impact downtime it is much less critical. But it affects the bandwidth, e.g. even with multifd enabled, the device iteration data will still bottleneck at ~15Gbps on a common system setup the best case, even if the hosts are 100Gbps direct connected. Would that be a concern in the future too, or it's known problem and it won't be fixed anyway? I think any improvements to the migration performance are good, even if they don't impact downtime. It's just that this patch set focuses on the downtime phase as the more critical thing. After this gets improved there's no reason why not to look at improving performance of the VM live phase too if it brings sensible improvements. I remember Avihai used to have plan to look into similar issues, I hope this is exactly what he is looking for. Otherwise changing migration protocol from time to time is cumbersome; we always need to provide a flag to make sure old systems migrates in the old ways, new systems run the new ways, and for such a relatively major change I'd want to double check on how far away we can support offload VFIO iterations data to multifd. The device state transfer is indicated by a new flag in the multifd header (MULTIFD_FLAG_DEVICE_STATE). If we are to use multifd channels for VM live phase transfers these could simply re-use the same flag type. Thanks, Thanks, Maciej
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 24.04.2024 00:20, Peter Xu wrote: On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: On 19.04.2024 17:31, Peter Xu wrote: On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. I would even ask one step further as what Dan was asking: have you thought about dumping VFIO states via multifd even during iterations? Would that help even more than this series (which IIUC only helps during the blackout phase)? To dump during RAM iteration, the VFIO device will need to have dirty tracking and iterate on its state, because the guest CPUs will still be running potentially changing VFIO state. That seems impractical in the general case. We already do such interations in vfio_save_iterate()? My understanding is the recent VFIO work is based on the fact that the VFIO device can track device state changes more or less (besides being able to save/load full states). E.g. I still remember in our QE tests some old devices report much more dirty pages than expected during the iterations when we were looking into such issue that a huge amount of dirty pages reported. But newer models seem to have fixed that and report much less. That issue was about GPU not NICs, though, and IIUC a major portion of such tracking used to be for GPU vRAMs. So maybe I was mixing up these, and maybe they work differently. The device which this series was developed against (Mellanox ConnectX-7) is already transferring its live state before the VM gets stopped (via save_live_iterate SaveVMHandler). It's just that in addition to the live state it has more than 400 MiB of state that cannot be transferred while the VM is still running. And that fact hurts a lot with respect to the migration downtime. AFAIK it's a very similar story for (some) GPUs. So during iteration phase VFIO cannot yet leverage the multifd channels when with this series, am I right? That's right. Is it possible to extend that use case too? I guess so, but since this phase (iteration while the VM is still running) doesn't impact downtime it is much less critical. Thanks, Thanks, Maciej
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 19.04.2024 17:31, Peter Xu wrote: On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: I think one of the reasons for these results is that mixed (RAM + device state) multifd channels participate in the RAM sync process (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't. Firstly, I'm wondering whether we can have better names for these new hooks. Currently (only comment on the async* stuff): - complete_precopy_async - complete_precopy - complete_precopy_async_wait But perhaps better: - complete_precopy_begin - complete_precopy - complete_precopy_end ? As I don't see why the device must do something with async in such hook. To me it's more like you're splitting one process into multiple, then begin/end sounds more generic. Then, if with that in mind, IIUC we can already split ram_save_complete() into >1 phases too. For example, I would be curious whether the performance will go back to normal if we offloading multifd_send_sync_main() into the complete_precopy_end(), because we really only need one shot of that, and I am quite surprised it already greatly affects VFIO dumping its own things. I would even ask one step further as what Dan was asking: have you thought about dumping VFIO states via multifd even during iterations? Would that help even more than this series (which IIUC only helps during the blackout phase)? To dump during RAM iteration, the VFIO device will need to have dirty tracking and iterate on its state, because the guest CPUs will still be running potentially changing VFIO state. That seems impractical in the general case. We already do such interations in vfio_save_iterate()? My understanding is the recent VFIO work is based on the fact that the VFIO device can track device state changes more or less (besides being able to save/load full states). E.g. I still remember in our QE tests some old devices report much more dirty pages than expected during the iterations when we were looking into such issue that a huge amount of dirty pages reported. But newer models seem to have fixed that and report much less. That issue was about GPU not NICs, though, and IIUC a major portion of such tracking used to be for GPU vRAMs. So maybe I was mixing up these, and maybe they work differently. The device which this series was developed against (Mellanox ConnectX-7) is already transferring its live state before the VM gets stopped (via save_live_iterate SaveVMHandler). It's just that in addition to the live state it has more than 400 MiB of state that cannot be transferred while the VM is still running. And that fact hurts a lot with respect to the migration downtime. AFAIK it's a very similar story for (some) GPUs. Thanks, Thanks, Maciej
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 18.04.2024 22:02, Peter Xu wrote: On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote: On 18.04.2024 12:39, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: On 17.04.2024 18:35, Daniel P. Berrangé wrote: On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: On 17.04.2024 10:36, Daniel P. Berrangé wrote: On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" (..) That said, the idea of reserving channels specifically for VFIO doesn't make a whole lot of sense to me either. Once we've done the RAM transfer, and are in the switchover phase doing device state transfer, all the multifd channels are idle. We should just use all those channels to transfer the device state, in parallel. Reserving channels just guarantees many idle channels during RAM transfer, and further idle channels during vmstate transfer. IMHO it is more flexible to just use all available multifd channel resources all the time. The reason for having dedicated device state channels is that they provide lower downtime in my tests. With either 15 or 11 mixed multifd channels (no dedicated device state channels) I get a downtime of about 1250 msec. Comparing that with 15 total multifd channels / 4 dedicated device state channels that give downtime of about 1100 ms it means that using dedicated channels gets about 14% downtime improvement. Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking place ? Is is transferred concurrently with the RAM ? I had thought this series still has the RAM transfer iterations running first, and then the VFIO VMstate at the end, simply making use of multifd channels for parallelism of the end phase. your reply though makes me question my interpretation though. Let me try to illustrate channel flow in various scenarios, time flowing left to right: 1. serialized RAM, then serialized VM state (ie historical migration) main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | 2. parallel RAM, then serialized VM state (ie today's multifd) main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | 3. parallel RAM, then parallel VM state main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4: | VFIO VM state | multifd5: | VFIO VM state | 4. parallel RAM and VFIO VM state, then remaining VM state main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4:| VFIO VM state | multifd5:| VFIO VM state | I thought this series was implementing approx (3), but are you actually implementing (4), or something else entirely ? You are right that this series operation is approximately implementing the schema described as numer 3 in your diagrams. However, there are some additional details worth mentioning: * There's some but relatively small amount of VFIO data being transferred from the "save_live_iterate" SaveVMHandler while the VM is still running. This is still happening via the main migration channel. Parallelizing this transfer in the future might make sense too, although obviously this doesn't impact the downtime. * After the VM is stopped and downtime starts the main (~ 400 MiB) VFIO device state gets transferred via multifd channels. However, these multifd channels (if they are not dedicated to device state transfer) aren't idle during that time. Rather they seem to be transferring the residual RAM data. That's most likely what causes the additional observed downtime when dedicated device state transfer multifd channels aren't used. Ahh yes, I forgot about the residual dirty RAM, that makes sense as an explanation. Allow me to work through the scenarios though, as I still think my suggestion to not have separate dedicate channels is better Lets say hypothetically we have an existing deployment today that uses 6 multifd channels for RAM. ie: main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM it
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 18.04.2024 12:39, Daniel P. Berrangé wrote: On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: On 17.04.2024 18:35, Daniel P. Berrangé wrote: On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: On 17.04.2024 10:36, Daniel P. Berrangé wrote: On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" (..) That said, the idea of reserving channels specifically for VFIO doesn't make a whole lot of sense to me either. Once we've done the RAM transfer, and are in the switchover phase doing device state transfer, all the multifd channels are idle. We should just use all those channels to transfer the device state, in parallel. Reserving channels just guarantees many idle channels during RAM transfer, and further idle channels during vmstate transfer. IMHO it is more flexible to just use all available multifd channel resources all the time. The reason for having dedicated device state channels is that they provide lower downtime in my tests. With either 15 or 11 mixed multifd channels (no dedicated device state channels) I get a downtime of about 1250 msec. Comparing that with 15 total multifd channels / 4 dedicated device state channels that give downtime of about 1100 ms it means that using dedicated channels gets about 14% downtime improvement. Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking place ? Is is transferred concurrently with the RAM ? I had thought this series still has the RAM transfer iterations running first, and then the VFIO VMstate at the end, simply making use of multifd channels for parallelism of the end phase. your reply though makes me question my interpretation though. Let me try to illustrate channel flow in various scenarios, time flowing left to right: 1. serialized RAM, then serialized VM state (ie historical migration) main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State | 2. parallel RAM, then serialized VM state (ie today's multifd) main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | 3. parallel RAM, then parallel VM state main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4: | VFIO VM state | multifd5: | VFIO VM state | 4. parallel RAM and VFIO VM state, then remaining VM state main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd3:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | multifd4:| VFIO VM state | multifd5:| VFIO VM state | I thought this series was implementing approx (3), but are you actually implementing (4), or something else entirely ? You are right that this series operation is approximately implementing the schema described as numer 3 in your diagrams. However, there are some additional details worth mentioning: * There's some but relatively small amount of VFIO data being transferred from the "save_live_iterate" SaveVMHandler while the VM is still running. This is still happening via the main migration channel. Parallelizing this transfer in the future might make sense too, although obviously this doesn't impact the downtime. * After the VM is stopped and downtime starts the main (~ 400 MiB) VFIO device state gets transferred via multifd channels. However, these multifd channels (if they are not dedicated to device state transfer) aren't idle during that time. Rather they seem to be transferring the residual RAM data. That's most likely what causes the additional observed downtime when dedicated device state transfer multifd channels aren't used. Ahh yes, I forgot about the residual dirty RAM, that makes sense as an explanation. Allow me to work through the scenarios though, as I still think my suggestion to not have separate dedicate channels is better Lets say hypothetically we have an existing deployment today that uses 6 multifd channels for RAM. ie: main: | Init || VM state | multifd1:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2:| RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3:| R
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 17.04.2024 18:35, Daniel P. Berrangé wrote: On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: On 17.04.2024 10:36, Daniel P. Berrangé wrote: On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" VFIO device state transfer is currently done via the main migration channel. This means that transfers from multiple VFIO devices are done sequentially and via just a single common migration channel. Such way of transferring VFIO device state migration data reduces performance and severally impacts the migration downtime (~50%) for VMs that have multiple such devices with large state size - see the test results below. However, we already have a way to transfer migration data using multiple connections - that's what multifd channels are. Unfortunately, multifd channels are currently utilized for RAM transfer only. This patch set adds a new framework allowing their use for device state transfer too. The wire protocol is based on Avihai's x-channel-header patches, which introduce a header for migration channels that allow the migration source to explicitly indicate the migration channel type without having the target deduce the channel type by peeking in the channel's content. The new wire protocol can be switch on and off via migration.x-channel-header option for compatibility with older QEMU versions and testing. Switching the new wire protocol off also disables device state transfer via multifd channels. The device state transfer can happen either via the same multifd channels as RAM data is transferred, mixed with RAM data (when migration.x-multifd-channels-device-state is 0) or exclusively via dedicated device state transfer channels (when migration.x-multifd-channels-device-state > 0). Using dedicated device state transfer multifd channels brings further performance benefits since these channels don't need to participate in the RAM sync process. I'm not convinced there's any need to introduce the new "channel header" protocol messages. The multifd channels already have an initialization message that is extensible to allow extra semantics to be indicated. So if we want some of the multifd channels to be reserved for device state, we could indicate that via some data in the MultiFDInit_t message struct. The reason for introducing x-channel-header was to avoid having to deduce the channel type by peeking in the channel's content - where any channel that does not start with QEMU_VM_FILE_MAGIC is currently treated as a multifd one. But if this isn't desired then, as you say, the multifd channel type can be indicated by using some unused field of the MultiFDInit_t message. Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. I don't like the heuristics we currently have, and would to have a better solution. What makes me cautious is that this proposal is a protocol change, but only addressing one very narrow problem with the migration protocol. I'd like migration to see a more explicit bi-directional protocol negotiation message set, where both QEMU can auto-negotiate amongst themselves many of the features that currently require tedious manual configuration by mgmt apps via migrate parameters/capabilities. That would address the problem you describe here, and so much more. Isn't the capability negotiation handled automatically by libvirt today? I guess you'd prefer for QEMU to internally handle it instead? If we add this channel header feature now, it creates yet another thing to keep around for back compatibility. So if this is not strictly required, in order to solve the VFIO VMstate problem, I'd prefer to just solve the VMstate stuff on its own. Okay, got it. That said, the idea of reserving channels specifically for VFIO doesn't make a whole lot of sense to me either. Once we've done the RAM transfer, and are in the switchover phase doing device state transfer, all the multifd channels are idle. We should just use all those channels to transfer the device state, in parallel. Reserving channels just guarantees many idle channels during RAM transfer, and further idle channels during vmstate transfer. IMHO it is more flexible to just use all available multifd channel resources all the time. The reason for having dedicated device state channels is that they provide lower downtime in my tests. With either 15 or 11 mixed multifd channels (no dedicated device state channels) I get a downtime of about 1250 msec. Comparing that with 15 total multifd channels / 4 dedicated device state channels that give downtime of about 1100 ms it means that using dedicated channels gets about 14% downtime improvement. Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking place ? Is is transferred concurrently with the RAM ? I had thought this series still has the RAM transfer iterations running first, and then the VFIO VMstate at the end, simply making use of multifd channels for
Re: [PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
On 17.04.2024 10:36, Daniel P. Berrangé wrote: On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" VFIO device state transfer is currently done via the main migration channel. This means that transfers from multiple VFIO devices are done sequentially and via just a single common migration channel. Such way of transferring VFIO device state migration data reduces performance and severally impacts the migration downtime (~50%) for VMs that have multiple such devices with large state size - see the test results below. However, we already have a way to transfer migration data using multiple connections - that's what multifd channels are. Unfortunately, multifd channels are currently utilized for RAM transfer only. This patch set adds a new framework allowing their use for device state transfer too. The wire protocol is based on Avihai's x-channel-header patches, which introduce a header for migration channels that allow the migration source to explicitly indicate the migration channel type without having the target deduce the channel type by peeking in the channel's content. The new wire protocol can be switch on and off via migration.x-channel-header option for compatibility with older QEMU versions and testing. Switching the new wire protocol off also disables device state transfer via multifd channels. The device state transfer can happen either via the same multifd channels as RAM data is transferred, mixed with RAM data (when migration.x-multifd-channels-device-state is 0) or exclusively via dedicated device state transfer channels (when migration.x-multifd-channels-device-state > 0). Using dedicated device state transfer multifd channels brings further performance benefits since these channels don't need to participate in the RAM sync process. I'm not convinced there's any need to introduce the new "channel header" protocol messages. The multifd channels already have an initialization message that is extensible to allow extra semantics to be indicated. So if we want some of the multifd channels to be reserved for device state, we could indicate that via some data in the MultiFDInit_t message struct. The reason for introducing x-channel-header was to avoid having to deduce the channel type by peeking in the channel's content - where any channel that does not start with QEMU_VM_FILE_MAGIC is currently treated as a multifd one. But if this isn't desired then, as you say, the multifd channel type can be indicated by using some unused field of the MultiFDInit_t message. Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. That said, the idea of reserving channels specifically for VFIO doesn't make a whole lot of sense to me either. Once we've done the RAM transfer, and are in the switchover phase doing device state transfer, all the multifd channels are idle. We should just use all those channels to transfer the device state, in parallel. Reserving channels just guarantees many idle channels during RAM transfer, and further idle channels during vmstate transfer. IMHO it is more flexible to just use all available multifd channel resources all the time. The reason for having dedicated device state channels is that they provide lower downtime in my tests. With either 15 or 11 mixed multifd channels (no dedicated device state channels) I get a downtime of about 1250 msec. Comparing that with 15 total multifd channels / 4 dedicated device state channels that give downtime of about 1100 ms it means that using dedicated channels gets about 14% downtime improvement. Again the 'MultiFDPacket_t' struct has both 'flags' and unused fields, so it is extensible to indicate that is it being used for new types of data. Yeah, that's what MULTIFD_FLAG_DEVICE_STATE in packet header already does in this patch set - it indicates that the packet contains device state, not RAM data. With regards, Daniel Best regards, Maciej
[PATCH RFC 18/26] migration: Add load_finish handler and associated functions
From: "Maciej S. Szmigiero" load_finish SaveVMHandler allows migration code to poll whether a device-specific asynchronous device state loading operation had finished. In order to avoid calling this handler needlessly the device is supposed to notify the migration code of its possible readiness via a call to qemu_loadvm_load_finish_ready_broadcast() while holding qemu_loadvm_load_finish_ready_lock. Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 21 +++ migration/migration.c| 6 + migration/migration.h| 3 +++ migration/savevm.c | 52 migration/savevm.h | 4 +++ 5 files changed, 86 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index 7d29b7e0b559..f15881fc87cd 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -272,6 +272,27 @@ typedef struct SaveVMHandlers { int (*load_state_buffer)(void *opaque, char *data, size_t data_size, Error **errp); +/** + * @load_finish + * + * Poll whether all asynchronous device state loading had finished. + * Not called on the load failure path. + * + * Called while holding the qemu_loadvm_load_finish_ready_lock. + * + * If this method signals "not ready" then it might not be called + * again until qemu_loadvm_load_finish_ready_broadcast() is invoked + * while holding qemu_loadvm_load_finish_ready_lock. + * + * @opaque: data pointer passed to register_savevm_live() + * @is_finished: whether the loading had finished (output parameter) + * @errp: pointer to Error*, to store an error if it happens. + * + * Returns zero to indicate success and negative for error + * It's not an error that the loading still hasn't finished. + */ +int (*load_finish)(void *opaque, bool *is_finished, Error **errp); + /** * @load_setup * diff --git a/migration/migration.c b/migration/migration.c index 8fe8be71a0e3..e4f82695a338 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -234,6 +234,9 @@ void migration_object_init(void) qemu_cond_init(_incoming->page_request_cond); current_incoming->page_requested = g_tree_new(page_request_addr_cmp); +g_mutex_init(_incoming->load_finish_ready_mutex); +g_cond_init(_incoming->load_finish_ready_cond); + migration_object_check(current_migration, _fatal); blk_mig_init(); @@ -387,6 +390,9 @@ void migration_incoming_state_destroy(void) mis->postcopy_qemufile_dst = NULL; } +g_mutex_clear(>load_finish_ready_mutex); +g_cond_clear(>load_finish_ready_cond); + yank_unregister_instance(MIGRATION_YANK_INSTANCE); } diff --git a/migration/migration.h b/migration/migration.h index a6114405917f..92014ef4cfcc 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -227,6 +227,9 @@ struct MigrationIncomingState { * is needed as this field is updated serially. */ unsigned int switchover_ack_pending_num; + +GCond load_finish_ready_cond; +GMutex load_finish_ready_mutex; }; MigrationIncomingState *migration_incoming_get_current(void); diff --git a/migration/savevm.c b/migration/savevm.c index 2e4d63faca06..30521ad3f340 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2994,6 +2994,37 @@ int qemu_loadvm_state(QEMUFile *f) return ret; } +qemu_loadvm_load_finish_ready_lock(); +while (!ret) { /* Don't call load_finish() handlers on the load failure path */ +bool all_ready = true; +SaveStateEntry *se = NULL; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +bool this_ready; + +if (!se->ops || !se->ops->load_finish) { +continue; +} + +ret = se->ops->load_finish(se->opaque, _ready, _err); +if (ret) { +error_report_err(local_err); + +qemu_loadvm_load_finish_ready_unlock(); +return -EINVAL; +} else if (!this_ready) { +all_ready = false; +} +} + +if (all_ready) { +break; +} + +g_cond_wait(>load_finish_ready_cond, >load_finish_ready_mutex); +} +qemu_loadvm_load_finish_ready_unlock(); + if (ret == 0) { ret = qemu_file_get_error(f); } @@ -3098,6 +3129,27 @@ int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, return 0; } +void qemu_loadvm_load_finish_ready_lock(void) +{ +MigrationIncomingState *mis = migration_incoming_get_current(); + +g_mutex_lock(>load_finish_ready_mutex); +} + +void qemu_loadvm_load_finish_ready_unlock(void) +{ +MigrationIncomingState *mis = migration_incoming_get_current(); + +g_mutex_unlock(>load_finish_ready_mutex); +} + +void qemu_
[PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket
From: "Maciej S. Szmigiero" This will allow passing additional parameters there in the future. Signed-off-by: Maciej S. Szmigiero --- migration/postcopy-ram.c | 68 +++- 1 file changed, 61 insertions(+), 7 deletions(-) diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index e314e1023dc1..94fe872d8251 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -1617,14 +1617,62 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file) trace_postcopy_preempt_new_channel(); } +typedef struct { +unsigned int ref; +MigrationState *s; +} PostcopyPChannelConnectData; + +static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s) +{ +PostcopyPChannelConnectData *data; + +data = g_malloc0(sizeof(*data)); +data->ref = 1; +data->s = s; + +return data; +} + +static void pcopy_preempt_connect_data_free(PostcopyPChannelConnectData *data) +{ +g_free(data); +} + +static PostcopyPChannelConnectData * +pcopy_preempt_connect_data_ref(PostcopyPChannelConnectData *data) +{ +unsigned int ref_old; + +ref_old = qatomic_fetch_inc(>ref); +assert(ref_old < UINT_MAX); + +return data; +} + +static void pcopy_preempt_connect_data_unref(gpointer opaque) +{ +PostcopyPChannelConnectData *data = opaque; +unsigned int ref_old; + +ref_old = qatomic_fetch_dec(>ref); +assert(ref_old > 0); +if (ref_old == 1) { +pcopy_preempt_connect_data_free(data); +} +} + +G_DEFINE_AUTOPTR_CLEANUP_FUNC(PostcopyPChannelConnectData, pcopy_preempt_connect_data_unref) + /* * Setup the postcopy preempt channel with the IOC. If ERROR is specified, * setup the error instead. This helper will free the ERROR if specified. */ static void -postcopy_preempt_send_channel_done(MigrationState *s, +postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data, QIOChannel *ioc, Error *local_err) { +MigrationState *s = data->s; + if (local_err) { migrate_set_error(s, local_err); error_free(local_err); @@ -1645,18 +1693,19 @@ static void postcopy_preempt_tls_handshake(QIOTask *task, gpointer opaque) { g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task)); -MigrationState *s = opaque; +PostcopyPChannelConnectData *data = opaque; Error *local_err = NULL; qio_task_propagate_error(task, _err); -postcopy_preempt_send_channel_done(s, ioc, local_err); +postcopy_preempt_send_channel_done(data, ioc, local_err); } static void postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque) { g_autoptr(QIOChannel) ioc = QIO_CHANNEL(qio_task_get_source(task)); -MigrationState *s = opaque; +PostcopyPChannelConnectData *data = opaque; +MigrationState *s = data->s; QIOChannelTLS *tioc; Error *local_err = NULL; @@ -1672,14 +1721,15 @@ postcopy_preempt_send_channel_new(QIOTask *task, gpointer opaque) trace_postcopy_preempt_tls_handshake(); qio_channel_set_name(QIO_CHANNEL(tioc), "migration-tls-preempt"); qio_channel_tls_handshake(tioc, postcopy_preempt_tls_handshake, - s, NULL, NULL); + pcopy_preempt_connect_data_ref(data), + pcopy_preempt_connect_data_unref, NULL); /* Setup the channel until TLS handshake finished */ return; } out: /* This handles both good and error cases */ -postcopy_preempt_send_channel_done(s, ioc, local_err); +postcopy_preempt_send_channel_done(data, ioc, local_err); } /* @@ -1714,8 +1764,12 @@ int postcopy_preempt_establish_channel(MigrationState *s) void postcopy_preempt_setup(MigrationState *s) { +PostcopyPChannelConnectData *data; + +data = pcopy_preempt_connect_data_new(s); /* Kick an async task to connect */ -socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL); +socket_send_channel_create(postcopy_preempt_send_channel_new, + data, pcopy_preempt_connect_data_unref); } static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis)
[PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support()
From: "Maciej S. Szmigiero" Since device state transfer via multifd channels requires multifd channels with migration channel header and is currently not compatible with multifd compression add an appropriate query function so device can learn whether it can actually make use of it. Signed-off-by: Maciej S. Szmigiero --- include/migration/misc.h | 1 + migration/multifd.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/include/migration/misc.h b/include/migration/misc.h index 25968e31247b..4da4f7f85f18 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -118,6 +118,7 @@ bool migration_in_bg_snapshot(void); void dirty_bitmap_mig_init(void); /* migration/multifd.c */ +bool migration_has_device_state_support(void); int multifd_queue_device_state(char *idstr, uint32_t instance_id, char *data, size_t len); diff --git a/migration/multifd.c b/migration/multifd.c index d8ce01539a05..d24217e705a0 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -819,6 +819,12 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset) return multifd_queue_page_locked(block, offset); } +bool migration_has_device_state_support(void) +{ +return migrate_multifd() && migrate_channel_header() && +migrate_multifd_compression() == MULTIFD_COMPRESSION_NONE; +} + int multifd_queue_device_state(char *idstr, uint32_t instance_id, char *data, size_t len) {
[PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet
From: "Maciej S. Szmigiero" This way there aren't stale flags there. p->flags can't contain SYNC to be sent at the next RAM packet since syncs are now handled separately in multifd_send_thread. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/migration/multifd.c b/migration/multifd.c index c2575e3d6dbf..7118c69a4d49 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -933,6 +933,7 @@ static void *multifd_send_thread(void *opaque) if (qatomic_load_acquire(>pending_job)) { MultiFDPages_t *pages = p->pages; +p->flags = 0; p->iovs_num = 0; assert(pages->num); @@ -986,7 +987,6 @@ static void *multifd_send_thread(void *opaque) } /* p->next_packet_size will always be zero for a SYNC packet */ stat64_add(_stats.multifd_bytes, p->packet_len); -p->flags = 0; } qatomic_set(>pending_sync, false);
[PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side
From: "Maciej S. Szmigiero" Implement the multifd device state transfer via additional per-device thread spawned from save_live_complete_precopy_async handler. Switch between doing the data transfer in the new handler and doing it in the old save_state handler depending on the migration_has_device_state_support() return value. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 195 ++ hw/vfio/trace-events | 3 + include/hw/vfio/vfio-common.h | 8 ++ 3 files changed, 206 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 3af62dea6899..6177431a0cd3 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -608,11 +608,15 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static void vfio_save_complete_precopy_async_thread_thread_terminate(VFIODevice *vbasedev); + static void vfio_save_cleanup(void *opaque) { VFIODevice *vbasedev = opaque; VFIOMigration *migration = vbasedev->migration; +vfio_save_complete_precopy_async_thread_thread_terminate(vbasedev); + /* * Changing device state from STOP_COPY to STOP can take time. Do it here, * after migration has completed, so it won't increase downtime. @@ -621,6 +625,7 @@ static void vfio_save_cleanup(void *opaque) vfio_migration_set_state_or_reset(vbasedev, VFIO_DEVICE_STATE_STOP); } +g_clear_pointer(>idstr, g_free); g_free(migration->data_buffer); migration->data_buffer = NULL; migration->precopy_init_size = 0; @@ -735,6 +740,12 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) ssize_t data_size; int ret; +if (migration_has_device_state_support()) { +/* Emit dummy NOP data */ +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); +return 0; +} + trace_vfio_save_complete_precopy_started(vbasedev->name); /* We reach here with device state STOP or STOP_COPY only */ @@ -762,11 +773,186 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static int vfio_save_complete_precopy_async_thread_config_state(VFIODevice *vbasedev, uint32_t idx) +{ +VFIOMigration *migration = vbasedev->migration; +g_autoptr(QIOChannelBuffer) bioc = NULL; +QEMUFile *f = NULL; +int ret; +g_autofree VFIODeviceStatePacket *packet = NULL; +size_t packet_len; + +bioc = qio_channel_buffer_new(0); +qio_channel_set_name(QIO_CHANNEL(bioc), "vfio-device-config-save"); + +f = qemu_file_new_output(QIO_CHANNEL(bioc)); + +ret = vfio_save_device_config_state(f, vbasedev); +if (ret) { +return ret; +} + +ret = qemu_fflush(f); +if (ret) { +goto ret_close_file; +} + +packet_len = sizeof(*packet) + bioc->usage; +packet = g_malloc0(packet_len); +packet->idx = idx; +packet->flags = VFIO_DEVICE_STATE_CONFIG_STATE; +memcpy(>data, bioc->data, bioc->usage); + +ret = multifd_queue_device_state(migration->idstr, migration->instance_id, + (char *)packet, packet_len); + +bytes_transferred += packet_len; + +ret_close_file: +g_clear_pointer(, qemu_fclose); +return ret; +} + +static void *vfio_save_complete_precopy_async_thread(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +int *ret = >save_complete_precopy_thread_ret; +g_autofree VFIODeviceStatePacket *packet = NULL; +uint32_t idx; + +/* We reach here with device state STOP or STOP_COPY only */ +*ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, +VFIO_DEVICE_STATE_STOP); +if (*ret) { +return NULL; +} + +packet = g_malloc0(sizeof(*packet) + migration->data_buffer_size); + +for (idx = 0; ; idx++) { +ssize_t data_size; +size_t packet_size; + +data_size = read(migration->data_fd, >data, + migration->data_buffer_size); +if (data_size < 0) { +if (errno != ENOMSG) { +*ret = -errno; +return NULL; +} + +/* + * Pre-copy emptied all the device state for now. For more information, + * please refer to the Linux kernel VFIO uAPI. + */ +data_size = 0; +} + +if (data_size == 0) +break; + +packet->idx = idx; +packet_size = sizeof(*packet) + data_size; + +*ret = multifd_queue_device_state(migration->idstr, migration->instance_id, + (char *)packet, packet_size); +if (*ret) { +return NULL; +} + +bytes_transferred += packet_size; +} + +*ret = vfio_save_c
[PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter
From: "Maciej S. Szmigiero" This parameter allows specifying how many multifd channels are dedicated to sending device state in parallel. It is ignored on the receive side. Signed-off-by: Maciej S. Szmigiero --- migration/migration-hmp-cmds.c | 7 + migration/options.c| 51 ++ migration/options.h| 1 + qapi/migration.json| 16 ++- 4 files changed, 74 insertions(+), 1 deletion(-) diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c index 7e96ae6ffdae..37d71422fdc3 100644 --- a/migration/migration-hmp-cmds.c +++ b/migration/migration-hmp-cmds.c @@ -341,6 +341,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict) monitor_printf(mon, "%s: %u\n", MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_CHANNELS), params->multifd_channels); +monitor_printf(mon, "%s: %u\n", + MigrationParameter_str(MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE), +params->x_multifd_channels_device_state); monitor_printf(mon, "%s: %s\n", MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION), MultiFDCompression_str(params->multifd_compression)); @@ -626,6 +629,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict) p->has_multifd_channels = true; visit_type_uint8(v, param, >multifd_channels, ); break; +case MIGRATION_PARAMETER_X_MULTIFD_CHANNELS_DEVICE_STATE: +p->has_x_multifd_channels_device_state = true; +visit_type_uint8(v, param, >x_multifd_channels_device_state, ); +break; case MIGRATION_PARAMETER_MULTIFD_COMPRESSION: p->has_multifd_compression = true; visit_type_MultiFDCompression(v, param, >multifd_compression, diff --git a/migration/options.c b/migration/options.c index 949d8a6c0b62..a7f09570b04e 100644 --- a/migration/options.c +++ b/migration/options.c @@ -59,6 +59,7 @@ /* The delay time (in ms) between two COLO checkpoints */ #define DEFAULT_MIGRATE_X_CHECKPOINT_DELAY (200 * 100) #define DEFAULT_MIGRATE_MULTIFD_CHANNELS 2 +#define DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE 0 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */ #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1 @@ -138,6 +139,9 @@ Property migration_properties[] = { DEFINE_PROP_UINT8("multifd-channels", MigrationState, parameters.multifd_channels, DEFAULT_MIGRATE_MULTIFD_CHANNELS), +DEFINE_PROP_UINT8("x-multifd-channels-device-state", MigrationState, + parameters.x_multifd_channels_device_state, + DEFAULT_MIGRATE_MULTIFD_CHANNELS_DEVICE_STATE), DEFINE_PROP_MULTIFD_COMPRESSION("multifd-compression", MigrationState, parameters.multifd_compression, DEFAULT_MIGRATE_MULTIFD_COMPRESSION), @@ -885,6 +889,13 @@ int migrate_multifd_channels(void) return s->parameters.multifd_channels; } +int migrate_multifd_channels_device_state(void) +{ +MigrationState *s = migrate_get_current(); + +return s->parameters.x_multifd_channels_device_state; +} + MultiFDCompression migrate_multifd_compression(void) { MigrationState *s = migrate_get_current(); @@ -1032,6 +1043,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp) params->block_incremental = s->parameters.block_incremental; params->has_multifd_channels = true; params->multifd_channels = s->parameters.multifd_channels; +params->has_x_multifd_channels_device_state = true; +params->x_multifd_channels_device_state = s->parameters.x_multifd_channels_device_state; params->has_multifd_compression = true; params->multifd_compression = s->parameters.multifd_compression; params->has_multifd_zlib_level = true; @@ -1091,6 +1104,7 @@ void migrate_params_init(MigrationParameters *params) params->has_x_checkpoint_delay = true; params->has_block_incremental = true; params->has_multifd_channels = true; +params->has_x_multifd_channels_device_state = true; params->has_multifd_compression = true; params->has_multifd_zlib_level = true; params->has_multifd_zstd_level = true; @@ -1198,6 +1212,37 @@ bool migrate_params_check(MigrationParameters *params, Error **errp) return false; } +if (params->has_multifd_channels && +params->has_x_multifd_channels_device_state && +params->x_multifd_channels_device_state > 0 && +!migrate_channel_header()) { +error_setg(errp, QERR_INVALID_PARAMETER_VALUE, + "x_multi
[PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic
From: "Maciej S. Szmigiero" This is necessary for multifd_send_pages() to be able to be called from multiple threads. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index a26418d87485..878ff7d9f9f0 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -622,8 +622,8 @@ static bool multifd_send_pages(void) * using more channels, so ensure it doesn't overflow if the * limit is lower now. */ -next_channel %= migrate_multifd_channels(); -for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) { +i = qatomic_load_acquire(_channel) % migrate_multifd_channels(); +for (;; i = (i + 1) % migrate_multifd_channels()) { if (multifd_send_should_exit()) { return false; } @@ -633,7 +633,8 @@ static bool multifd_send_pages(void) * sender thread can clear it. */ if (qatomic_read(>pending_job) == false) { -next_channel = (i + 1) % migrate_multifd_channels(); +qatomic_store_release(_channel, + (i + 1) % migrate_multifd_channels()); break; } }
[PATCH RFC 23/26] migration/multifd: Device state transfer support - send side
From: "Maciej S. Szmigiero" A new function multifd_queue_device_state() is provided for device to queue its state for transmission via a multifd channel. Signed-off-by: Maciej S. Szmigiero --- include/migration/misc.h | 4 + migration/multifd-zlib.c | 2 +- migration/multifd-zstd.c | 2 +- migration/multifd.c | 244 ++- migration/multifd.h | 30 +++-- 5 files changed, 244 insertions(+), 38 deletions(-) diff --git a/include/migration/misc.h b/include/migration/misc.h index c9e200f4eb8f..25968e31247b 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -117,4 +117,8 @@ bool migration_in_bg_snapshot(void); /* migration/block-dirty-bitmap.c */ void dirty_bitmap_mig_init(void); +/* migration/multifd.c */ +int multifd_queue_device_state(char *idstr, uint32_t instance_id, + char *data, size_t len); + #endif diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c index 99821cd4d5ef..e20c1de6033d 100644 --- a/migration/multifd-zlib.c +++ b/migration/multifd-zlib.c @@ -177,7 +177,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp) out: p->flags |= MULTIFD_FLAG_ZLIB; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); return 0; } diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c index 02112255adcc..37cebd006921 100644 --- a/migration/multifd-zstd.c +++ b/migration/multifd-zstd.c @@ -166,7 +166,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp) out: p->flags |= MULTIFD_FLAG_ZSTD; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); return 0; } diff --git a/migration/multifd.c b/migration/multifd.c index 878ff7d9f9f0..d8ce01539a05 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -12,6 +12,7 @@ #include "qemu/osdep.h" #include "qemu/cutils.h" +#include "qemu/iov.h" #include "qemu/rcu.h" #include "exec/target_page.h" #include "sysemu/sysemu.h" @@ -20,6 +21,7 @@ #include "qapi/error.h" #include "channel.h" #include "file.h" +#include "migration/misc.h" #include "migration.h" #include "migration-stats.h" #include "savevm.h" @@ -50,9 +52,17 @@ typedef struct { } __attribute__((packed)) MultiFDInit_t; struct { +/* + * Are there some device state dedicated channels (true) or + * should device state be sent via any available channel (false)? + */ +bool device_state_dedicated_channels; +GMutex queue_job_mutex; + MultiFDSendParams *params; -/* array of pages to sent */ +/* array of pages or device state to be sent */ MultiFDPages_t *pages; +MultiFDDeviceState_t *device_state; /* * Global number of generated multifd packets. * @@ -169,7 +179,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p) } /** - * nocomp_send_prepare: prepare date to be able to send + * nocomp_send_prepare_ram: prepare RAM data for sending * * For no compression we just have to calculate the size of the * packet. @@ -179,7 +189,7 @@ static void multifd_send_prepare_iovs(MultiFDSendParams *p) * @p: Params for the channel that we are using * @errp: pointer to an error */ -static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) +static int nocomp_send_prepare_ram(MultiFDSendParams *p, Error **errp) { bool use_zero_copy_send = migrate_zero_copy_send(); int ret; @@ -198,13 +208,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) * Only !zerocopy needs the header in IOV; zerocopy will * send it separately. */ -multifd_send_prepare_header(p); +multifd_send_prepare_header_ram(p); } multifd_send_prepare_iovs(p); p->flags |= MULTIFD_FLAG_NOCOMP; -multifd_send_fill_packet(p); +multifd_send_fill_packet_ram(p); if (use_zero_copy_send) { /* Send header first, without zerocopy */ @@ -218,6 +228,59 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp) return 0; } +static void multifd_send_fill_packet_device_state(MultiFDSendParams *p) +{ +MultiFDPacketDeviceState_t *packet = p->packet_device_state; + +packet->hdr.flags = cpu_to_be32(p->flags); +strncpy(packet->idstr, p->device_state->idstr, sizeof(packet->idstr)); +packet->instance_id = cpu_to_be32(p->device_state->instance_id); +packet->next_packet_size = cpu_to_be32(p->next_packet_size); +} + +/** + * nocomp_send_prepare_device_state: prepare device state data for sending + * + * Returns 0 for success or -1 for error + * + * @p: Params for the channel that we are using + * @errp: pointer to an error + */ +static int nocomp_send_prepare_device_state(MultiFD
[PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation
From: Avihai Horon Signed-off-by: Avihai Horon [MSS: Rewrite using MFDSendChannelConnectData/PostcopyPChannelConnectData] Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 14 -- migration/postcopy-ram.c | 14 -- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 58a18bb1e4a8..8eecda68ac0f 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -18,6 +18,7 @@ #include "exec/ramblock.h" #include "qemu/error-report.h" #include "qapi/error.h" +#include "channel.h" #include "file.h" #include "migration.h" #include "migration-stats.h" @@ -1014,15 +1015,20 @@ struct MFDSendChannelConnectData { unsigned int ref; MultiFDSendParams *p; QIOChannelTLS *tioc; +MigChannelHeader header; }; -static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p) +static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p, + MigChannelHeader *header) { MFDSendChannelConnectData *data; data = g_malloc0(sizeof(*data)); data->ref = 1; data->p = p; +if (header) { +memcpy(>header, header, sizeof(*header)); +} return data; } @@ -1110,6 +1116,10 @@ bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, { MultiFDSendParams *p = data->p; +if (migration_channel_header_send(ioc, >header, errp)) { +return false; +} + qio_channel_set_delay(ioc, false); migration_ioc_register_yank(ioc); @@ -1182,7 +1192,7 @@ static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp) { g_autoptr(MFDSendChannelConnectData) data = NULL; -data = mfd_send_channel_connect_data_new(p); +data = mfd_send_channel_connect_data_new(p, NULL); if (!multifd_use_packets()) { return file_send_channel_create(data, errp); diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index 94fe872d8251..53c90344acce 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -19,6 +19,7 @@ #include "qemu/osdep.h" #include "qemu/madvise.h" #include "exec/target_page.h" +#include "channel.h" #include "migration.h" #include "qemu-file.h" #include "savevm.h" @@ -1620,15 +1621,20 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file) typedef struct { unsigned int ref; MigrationState *s; +MigChannelHeader header; } PostcopyPChannelConnectData; -static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s) +static PostcopyPChannelConnectData *pcopy_preempt_connect_data_new(MigrationState *s, + MigChannelHeader *header) { PostcopyPChannelConnectData *data; data = g_malloc0(sizeof(*data)); data->ref = 1; data->s = s; +if (header) { +memcpy(>header, header, sizeof(*header)); +} return data; } @@ -1673,6 +1679,10 @@ postcopy_preempt_send_channel_done(PostcopyPChannelConnectData *data, { MigrationState *s = data->s; +if (!local_err) { +migration_channel_header_send(ioc, >header, _err); +} + if (local_err) { migrate_set_error(s, local_err); error_free(local_err); @@ -1766,7 +1776,7 @@ void postcopy_preempt_setup(MigrationState *s) { PostcopyPChannelConnectData *data; -data = pcopy_preempt_connect_data_new(s); +data = pcopy_preempt_connect_data_new(s, NULL); /* Kick an async task to connect */ socket_send_channel_create(postcopy_preempt_send_channel_new, data, pcopy_preempt_connect_data_unref);
[PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type
From: "Maciej S. Szmigiero" Signed-off-by: Maciej S. Szmigiero --- migration/channel.h | 1 + 1 file changed, 1 insertion(+) diff --git a/migration/channel.h b/migration/channel.h index 4232ee649939..b985c952550d 100644 --- a/migration/channel.h +++ b/migration/channel.h @@ -33,6 +33,7 @@ typedef enum { MIG_CHANNEL_TYPE_MAIN, MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT, MIG_CHANNEL_TYPE_MULTIFD, +MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE, } MigChannelTypes; typedef struct QEMU_PACKED {
[PATCH RFC 02/26] migration: Add migration channel header send/receive
From: Avihai Horon Add functions to send and receive migration channel header. Signed-off-by: Avihai Horon [MSS: Mark MigChannelHeader as packed, remove device id from it] Signed-off-by: Maciej S. Szmigiero --- migration/channel.c| 59 ++ migration/channel.h| 14 ++ migration/trace-events | 2 ++ 3 files changed, 75 insertions(+) diff --git a/migration/channel.c b/migration/channel.c index f9de064f3b13..a72e85f5791c 100644 --- a/migration/channel.c +++ b/migration/channel.c @@ -21,6 +21,7 @@ #include "io/channel-socket.h" #include "qemu/yank.h" #include "yank_functions.h" +#include "options.h" /** * @migration_channel_process_incoming - Create new incoming migration channel @@ -93,6 +94,64 @@ void migration_channel_connect(MigrationState *s, error_free(error); } +int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header, + Error **errp) +{ +uint64_t header_size; +int ret; + +ret = qio_channel_read_all_eof(ioc, (char *)_size, + sizeof(header_size), errp); +if (ret == 0 || ret == -1) { +return -1; +} + +header_size = be64_to_cpu(header_size); +if (header_size > sizeof(*header)) { +error_setg(errp, + "Received header of size %lu bytes which is greater than " + "max header size of %lu bytes", + header_size, sizeof(*header)); +return -EINVAL; +} + +ret = qio_channel_read_all_eof(ioc, (char *)header, header_size, errp); +if (ret == 0 || ret == -1) { +return -1; +} + +header->channel_type = be32_to_cpu(header->channel_type); + +trace_migration_channel_header_recv(header->channel_type, +header_size); + +return 0; +} + +int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header, + Error **errp) +{ +uint64_t header_size = sizeof(*header); +int ret; + +if (!migrate_channel_header()) { +return 0; +} + +trace_migration_channel_header_send(header->channel_type, +header_size); + +header_size = cpu_to_be64(header_size); +ret = qio_channel_write_all(ioc, (char *)_size, sizeof(header_size), +errp); +if (ret) { +return ret; +} + +header->channel_type = cpu_to_be32(header->channel_type); + +return qio_channel_write_all(ioc, (char *)header, sizeof(*header), errp); +} /** * @migration_channel_read_peek - Peek at migration channel, without diff --git a/migration/channel.h b/migration/channel.h index 5bdb8208a744..95d281828aaa 100644 --- a/migration/channel.h +++ b/migration/channel.h @@ -29,4 +29,18 @@ int migration_channel_read_peek(QIOChannel *ioc, const char *buf, const size_t buflen, Error **errp); +typedef enum { +MIG_CHANNEL_TYPE_MAIN, +} MigChannelTypes; + +typedef struct QEMU_PACKED { +uint32_t channel_type; +} MigChannelHeader; + +int migration_channel_header_send(QIOChannel *ioc, MigChannelHeader *header, + Error **errp); + +int migration_channel_header_recv(QIOChannel *ioc, MigChannelHeader *header, + Error **errp); + #endif diff --git a/migration/trace-events b/migration/trace-events index f0e1cb80c75b..e48607d5a6a2 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -198,6 +198,8 @@ migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) # channel.c migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s" migration_set_outgoing_channel(void *ioc, const char *ioctype, const char *hostname, void *err) "ioc=%p ioctype=%s hostname=%s err=%p" +migration_channel_header_send(uint32_t channel_type, uint64_t header_size) "Migration channel header send: channel_type=%u, header_size=%lu" +migration_channel_header_recv(uint32_t channel_type, uint64_t header_size) "Migration channel header recv: channel_type=%u, header_size=%lu" # global_state.c migrate_state_too_big(void) ""
[PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side
From: "Maciej S. Szmigiero" Add a basic support for receiving device state via multifd channels - both dedicated ones or shared with RAM transfer. To differentiate between a device state and a RAM packet the packet header is read first. Depending whether MULTIFD_FLAG_DEVICE_STATE flag is present or not in the packet header either device state (MultiFDPacketDeviceState_t) or RAM data (existing MultiFDPacket_t) is then read. The received device state data is provided to qemu_loadvm_load_state_buffer() function for processing in the device's load_state_buffer handler. Signed-off-by: Maciej S. Szmigiero --- migration/migration.c | 7 +- migration/multifd.c | 146 -- migration/multifd.h | 34 +- 3 files changed, 163 insertions(+), 24 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index e4f82695a338..ea2c8a043a77 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -987,7 +987,7 @@ static void migration_ioc_process_incoming_no_header(QIOChannel *ioc, /* Multiple connections */ assert(migration_needs_multiple_sockets()); if (migrate_multifd()) { -multifd_recv_new_channel(ioc, _err); +multifd_recv_new_channel(ioc, false, _err); } else { assert(migrate_postcopy_preempt()); f = qemu_file_new_input(ioc); @@ -1031,6 +1031,7 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) postcopy_preempt_new_channel(migration_incoming_get_current(), f); break; case MIG_CHANNEL_TYPE_MULTIFD: +case MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE: { Error *local_err = NULL; @@ -1039,7 +1040,9 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) return; } -multifd_recv_new_channel(ioc, _err); +multifd_recv_new_channel(ioc, + header.channel_type == MIG_CHANNEL_TYPE_MULTIFD_DEVICE_STATE, + _err); if (local_err) { error_propagate(errp, local_err); return; diff --git a/migration/multifd.c b/migration/multifd.c index 7118c69a4d49..a26418d87485 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -22,6 +22,7 @@ #include "file.h" #include "migration.h" #include "migration-stats.h" +#include "savevm.h" #include "socket.h" #include "tls.h" #include "qemu-file.h" @@ -404,7 +405,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p) uint32_t zero_num = pages->num - pages->normal_num; int i; -packet->flags = cpu_to_be32(p->flags); +packet->hdr.flags = cpu_to_be32(p->flags); packet->pages_alloc = cpu_to_be32(p->pages->allocated); packet->normal_pages = cpu_to_be32(pages->normal_num); packet->zero_pages = cpu_to_be32(zero_num); @@ -432,28 +433,44 @@ void multifd_send_fill_packet(MultiFDSendParams *p) p->flags, p->next_packet_size); } -static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) +static int multifd_recv_unfill_packet_header(MultiFDRecvParams *p, MultiFDPacketHdr_t *hdr, + Error **errp) { -MultiFDPacket_t *packet = p->packet; -int i; - -packet->magic = be32_to_cpu(packet->magic); -if (packet->magic != MULTIFD_MAGIC) { +hdr->magic = be32_to_cpu(hdr->magic); +if (hdr->magic != MULTIFD_MAGIC) { error_setg(errp, "multifd: received packet " "magic %x and expected magic %x", - packet->magic, MULTIFD_MAGIC); + hdr->magic, MULTIFD_MAGIC); return -1; } -packet->version = be32_to_cpu(packet->version); -if (packet->version != MULTIFD_VERSION) { +hdr->version = be32_to_cpu(hdr->version); +if (hdr->version != MULTIFD_VERSION) { error_setg(errp, "multifd: received packet " "version %u and expected version %u", - packet->version, MULTIFD_VERSION); + hdr->version, MULTIFD_VERSION); return -1; } -p->flags = be32_to_cpu(packet->flags); +p->flags = be32_to_cpu(hdr->flags); + +return 0; +} + +static int multifd_recv_unfill_packet_device_state(MultiFDRecvParams *p, Error **errp) +{ +MultiFDPacketDeviceState_t *packet = p->packet_dev_state; + +packet->instance_id = be32_to_cpu(packet->instance_id); +p->next_packet_size = be32_to_cpu(packet->next_packet_size); + +return 0; +} + +static int multifd_recv_unfill_packet_ram(MultiFDRecvParams *p, Error **errp) +{ +MultiFDPacket_t *packet = p->packet; +int i; packet->pages_alloc = be32_to_cpu(packet-
[PATCH RFC 14/26] migration/ram: Add load start trace event
From: "Maciej S. Szmigiero" There's a RAM load complete trace event but there wasn't its start equivalent. Signed-off-by: Maciej S. Szmigiero --- migration/ram.c| 1 + migration/trace-events | 1 + 2 files changed, 2 insertions(+) diff --git a/migration/ram.c b/migration/ram.c index 8deb84984f4a..cebb06480d6f 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -4223,6 +4223,7 @@ static int ram_load_precopy(QEMUFile *f) RAM_SAVE_FLAG_ZERO); } +trace_ram_load_start(); while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) { ram_addr_t addr; void *host = NULL, *host_bak = NULL; diff --git a/migration/trace-events b/migration/trace-events index e48607d5a6a2..396c0233cb8c 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -115,6 +115,7 @@ colo_flush_ram_cache_end(void) "" save_xbzrle_page_skipping(void) "" save_xbzrle_page_overflow(void) "" ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations" +ram_load_start(void) "" ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64 ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu" ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
[PATCH RFC 00/26] Multifd device state transfer support with VFIO consumer
From: "Maciej S. Szmigiero" VFIO device state transfer is currently done via the main migration channel. This means that transfers from multiple VFIO devices are done sequentially and via just a single common migration channel. Such way of transferring VFIO device state migration data reduces performance and severally impacts the migration downtime (~50%) for VMs that have multiple such devices with large state size - see the test results below. However, we already have a way to transfer migration data using multiple connections - that's what multifd channels are. Unfortunately, multifd channels are currently utilized for RAM transfer only. This patch set adds a new framework allowing their use for device state transfer too. The wire protocol is based on Avihai's x-channel-header patches, which introduce a header for migration channels that allow the migration source to explicitly indicate the migration channel type without having the target deduce the channel type by peeking in the channel's content. The new wire protocol can be switch on and off via migration.x-channel-header option for compatibility with older QEMU versions and testing. Switching the new wire protocol off also disables device state transfer via multifd channels. The device state transfer can happen either via the same multifd channels as RAM data is transferred, mixed with RAM data (when migration.x-multifd-channels-device-state is 0) or exclusively via dedicated device state transfer channels (when migration.x-multifd-channels-device-state > 0). Using dedicated device state transfer multifd channels brings further performance benefits since these channels don't need to participate in the RAM sync process. These patches introduce a few new SaveVMHandlers: * "save_live_complete_precopy_async{,wait}" handlers that allow device to provide its own asynchronous transmission of the remaining data at the end of a precopy phase. The "save_live_complete_precopy_async" handler is supposed to start such transmission (for example, by launching appropriate threads) while the "save_live_complete_precopy_async_wait" handler is supposed to wait until such transfer has finished (for example, until the sending threads have exited). * "load_state_buffer" and its caller qemu_loadvm_load_state_buffer() that allow providing device state buffer to explicitly specified device via its idstr and instance id. * "load_finish" the allows migration code to poll whether a device-specific asynchronous device state loading operation had finished before proceeding further in the migration process (with associated condition variable for notification to avoid unnecessary polling). A VFIO device migration consumer for the new multifd channels device state migration framework was implemented with a reassembly process for the multifd received data since device state packets sent via different multifd channels can arrive out-of-order. The VFIO device data loading process happens in a separate thread in order to avoid blocking a multifd receive thread during this fairly long process. Test setup: Source machine: 2x Xeon Gold 5218 / 192 GiB RAM Mellanox ConnectX-7 with 100GbE link 6.9.0-rc1+ kernel Target machine: 2x Xeon Platinum 8260 / 376 GiB RAM Mellanox ConnectX-7 with 100GbE link 6.6.0+ kernel VM: CPU 12cores x 2threads / 15 GiB RAM / 4x Mellanox ConnectX-7 VF Migration config: 15 multifd channels total new way had 4 channels dedicated to device state transfer x-return-path=true, x-switchover-ack=true Downtime with ~400 MiB VFIO total device state size: TLS off TLS on migration.x-channel-header=false (old way)~2100 ms ~2300 ms migration.x-channel-header=true (new way) ~1100 ms ~1200 ms IMPROVEMENT ~50% ~50% This patch set is also available as a git tree: https://github.com/maciejsszmigiero/qemu/tree/multifd-device-state-transfer-vfio Avihai Horon (7): migration: Add x-channel-header pseudo-capability migration: Add migration channel header send/receive migration: Add send/receive header for main channel migration: Allow passing migration header in migration channel creation migration: Add send/receive header for postcopy preempt channel migration: Add send/receive header for multifd channel migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero (19): multifd: change multifd_new_send_channel_create() param type migration: Add a DestroyNotify parameter to socket_send_channel_create() multifd: pass MFDSendChannelConnectData when connecting sending socket migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket migration/options: Mapped-r
[PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability
From: Avihai Horon Now that migration channel header has been implemented, enable it. Signed-off-by: Avihai Horon Signed-off-by: Maciej S. Szmigiero --- migration/options.c | 1 - 1 file changed, 1 deletion(-) diff --git a/migration/options.c b/migration/options.c index abb5b485badd..949d8a6c0b62 100644 --- a/migration/options.c +++ b/migration/options.c @@ -386,7 +386,6 @@ bool migrate_channel_header(void) { MigrationState *s = migrate_get_current(); -return false; return s->channel_header; }
[PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel
From: Avihai Horon Add send and receive migration channel header for postcopy preempt channel. Signed-off-by: Avihai Horon [MSS: Adapt to rewritten migration header passing commit] Signed-off-by: Maciej S. Szmigiero --- migration/channel.h | 1 + migration/migration.c| 5 + migration/postcopy-ram.c | 5 - 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/migration/channel.h b/migration/channel.h index 95d281828aaa..c59ccedc7b6b 100644 --- a/migration/channel.h +++ b/migration/channel.h @@ -31,6 +31,7 @@ int migration_channel_read_peek(QIOChannel *ioc, Error **errp); typedef enum { MIG_CHANNEL_TYPE_MAIN, +MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT, } MigChannelTypes; typedef struct QEMU_PACKED { diff --git a/migration/migration.c b/migration/migration.c index 0eb5b4f4f5a1..ac9ecf1f4f22 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1019,6 +1019,11 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) migration_incoming_setup(f); default_channel = true; break; +case MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT: +assert(migrate_postcopy_preempt()); +f = qemu_file_new_input(ioc); +postcopy_preempt_new_channel(migration_incoming_get_current(), f); +break; default: error_setg(errp, "Received unknown migration channel type %u", header.channel_type); diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index 53c90344acce..c7e9f7345970 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -1775,8 +1775,11 @@ int postcopy_preempt_establish_channel(MigrationState *s) void postcopy_preempt_setup(MigrationState *s) { PostcopyPChannelConnectData *data; +MigChannelHeader header = {}; -data = pcopy_preempt_connect_data_new(s, NULL); +header.channel_type = MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT; + +data = pcopy_preempt_connect_data_new(s, ); /* Kick an async task to connect */ socket_send_channel_create(postcopy_preempt_send_channel_new, data, pcopy_preempt_connect_data_unref);
[PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers
From: "Maciej S. Szmigiero" These SaveVMHandlers allow device to provide its own asynchronous transmission of the remaining data at the end of a precopy phase. The save_live_complete_precopy_async handler is supposed to start such transmission (for example, by launching appropriate threads) while the save_live_complete_precopy_async_wait handler is supposed to wait until such transfer has finished (for example, until the sending threads have exited). Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 31 +++ migration/savevm.c | 35 +++ 2 files changed, 66 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index d7b70a8be68c..9d36e35bd612 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -102,6 +102,37 @@ typedef struct SaveVMHandlers { */ int (*save_live_complete_precopy)(QEMUFile *f, void *opaque); +/** + * @save_live_complete_precopy_async + * + * Arranges for handler-specific asynchronous transmission of the + * remaining data at the end of a precopy phase. When postcopy is + * enabled, devices that support postcopy will skip this step. + * + * @f: QEMUFile where the handler can synchronously send data before returning + * @idstr: this device section idstr + * @instance_id: this device section instance_id + * @opaque: data pointer passed to register_savevm_live() + * + * Returns zero to indicate success and negative for error + */ +int (*save_live_complete_precopy_async)(QEMUFile *f, +char *idstr, uint32_t instance_id, +void *opaque); +/** + * @save_live_complete_precopy_async_wait + * + * Waits for the asynchronous transmission started by the of the + * @save_live_complete_precopy_async handler to complete. + * When postcopy is enabled, devices that support postcopy will skip this step. + * + * @f: QEMUFile where the handler can synchronously send data before returning + * @opaque: data pointer passed to register_savevm_live() + * + * Returns zero to indicate success and negative for error + */ +int (*save_live_complete_precopy_async_wait)(QEMUFile *f, void *opaque); + /* This runs both outside and inside the BQL. */ /** diff --git a/migration/savevm.c b/migration/savevm.c index 388d7af7cdd8..fa35504678bf 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1497,6 +1497,27 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) SaveStateEntry *se; int ret; +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || (in_postcopy && se->ops->has_postcopy && + se->ops->has_postcopy(se->opaque)) || +!se->ops->save_live_complete_precopy_async) { +continue; +} + +save_section_header(f, se, QEMU_VM_SECTION_END); + +ret = se->ops->save_live_complete_precopy_async(f, +se->idstr, se->instance_id, +se->opaque); + +save_section_footer(f, se); + +if (ret < 0) { +qemu_file_set_error(f, ret); +return -1; +} +} + QTAILQ_FOREACH(se, _state.handlers, entry) { if (!se->ops || (in_postcopy && se->ops->has_postcopy && @@ -1528,6 +1549,20 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy) end_ts_each - start_ts_each); } +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || (in_postcopy && se->ops->has_postcopy && + se->ops->has_postcopy(se->opaque)) || +!se->ops->save_live_complete_precopy_async_wait) { +continue; +} + +ret = se->ops->save_live_complete_precopy_async_wait(f, se->opaque); +if (ret < 0) { +qemu_file_set_error(f, ret); +return -1; +} +} + trace_vmstate_downtime_checkpoint("src-iterable-saved"); return 0;
[PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side
From: "Maciej S. Szmigiero" The multifd received data needs to be reassembled since device state packets sent via different multifd channels can arrive out-of-order. Therefore, each VFIO device state packet carries a header indicating its position in the stream. The last such VFIO device state packet should have VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state. Since it's important to finish loading device state transferred via the main migration channel (via save_live_iterate handler) before starting loading the data asynchronously transferred via multifd a new VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE flag is introduced to mark the end of the main migration channel data. The device state loading process waits until that flag is seen before commencing loading of the multifd-transferred device state. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 322 +- hw/vfio/trace-events | 9 +- include/hw/vfio/vfio-common.h | 14 ++ 3 files changed, 342 insertions(+), 3 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index bc3aea77455c..3af62dea6899 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -15,6 +15,7 @@ #include #include +#include "io/channel-buffer.h" #include "sysemu/runstate.h" #include "hw/vfio/vfio-common.h" #include "migration/misc.h" @@ -46,6 +47,7 @@ #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xef13ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL) #define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xef15ULL) +#define VFIO_MIG_FLAG_DEV_DATA_STATE_COMPLETE(0xef16ULL) /* * This is an arbitrary size based on migration of mlx5 devices, where typically @@ -54,6 +56,15 @@ */ #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB) +#define VFIO_DEVICE_STATE_CONFIG_STATE (1) + +typedef struct VFIODeviceStatePacket { +uint32_t version; +uint32_t idx; +uint32_t flags; +uint8_t data[0]; +} QEMU_PACKED VFIODeviceStatePacket; + static int64_t bytes_transferred; static const char *mig_state_to_str(enum vfio_device_mig_state state) @@ -186,6 +197,175 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev, return ret; } +typedef struct LoadedBuffer { +bool is_present; +char *data; +size_t len; +} LoadedBuffer; + +static void loaded_buffer_clear(gpointer data) +{ +LoadedBuffer *lb = data; + +if (!lb->is_present) { +return; +} + +g_clear_pointer(>data, g_free); +lb->is_present = false; +} + +static int vfio_load_state_buffer(void *opaque, char *data, size_t data_size, + Error **errp) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data; +g_autoptr(GMutexLocker) locker = g_mutex_locker_new(>load_bufs_mutex); +LoadedBuffer *lb; + +if (data_size < sizeof(*packet)) { +error_setg(errp, "packet too short at %zu (min is %zu)", + data_size, sizeof(*packet)); +return -1; +} + +if (packet->version != 0) { +error_setg(errp, "packet has unknown version %" PRIu32, + packet->version); +return -1; +} + +if (packet->idx == UINT32_MAX) { +error_setg(errp, "packet has too high idx %" PRIu32, + packet->idx); +return -1; +} + +trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx); + +/* config state packet should be the last one in the stream */ +if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) { +migration->load_buf_idx_last = packet->idx; +} + +assert(migration->load_bufs); +if (packet->idx >= migration->load_bufs->len) { +g_array_set_size(migration->load_bufs, packet->idx + 1); +} + +lb = _array_index(migration->load_bufs, typeof(*lb), packet->idx); +if (lb->is_present) { +error_setg(errp, "state buffer %" PRIu32 " already filled", packet->idx); +return -1; +} + +assert(packet->idx >= migration->load_buf_idx); + +lb->data = g_memdup2(>data, data_size - sizeof(*packet)); +lb->len = data_size - sizeof(*packet); +lb->is_present = true; + +g_cond_broadcast(>load_bufs_buffer_ready_cond); + +return 0; +} + +static void *vfio_load_bufs_thread(void *opaque) +{ +VFIODevice *vbasedev = opaque; +VFIOMigration *migration = vbasedev->migration; +Error **errp = >load_bufs_thread_errp; +g_autoptr(GMutexLocker) locker = g_mutex_locker_new(>load_bufs_mutex); +LoadedBuffer *lb; + +while (!migration->load_bufs_device_ready && +
[PATCH RFC 10/26] migration: Add send/receive header for multifd channel
From: Avihai Horon Add send and receive migration channel header for multifd channel. Signed-off-by: Avihai Horon [MSS: Adapt to rewritten migration header passing commit] Signed-off-by: Maciej S. Szmigiero --- migration/channel.h | 1 + migration/migration.c | 16 migration/multifd.c | 4 +++- 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/migration/channel.h b/migration/channel.h index c59ccedc7b6b..4232ee649939 100644 --- a/migration/channel.h +++ b/migration/channel.h @@ -32,6 +32,7 @@ int migration_channel_read_peek(QIOChannel *ioc, typedef enum { MIG_CHANNEL_TYPE_MAIN, MIG_CHANNEL_TYPE_POSTCOPY_PREEMPT, +MIG_CHANNEL_TYPE_MULTIFD, } MigChannelTypes; typedef struct QEMU_PACKED { diff --git a/migration/migration.c b/migration/migration.c index ac9ecf1f4f22..8fe8be71a0e3 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1024,6 +1024,22 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) f = qemu_file_new_input(ioc); postcopy_preempt_new_channel(migration_incoming_get_current(), f); break; +case MIG_CHANNEL_TYPE_MULTIFD: +{ +Error *local_err = NULL; + +assert(migrate_multifd()); +if (multifd_recv_setup(errp) != 0) { +return; +} + +multifd_recv_new_channel(ioc, _err); +if (local_err) { +error_propagate(errp, local_err); +return; +} +break; +} default: error_setg(errp, "Received unknown migration channel type %u", header.channel_type); diff --git a/migration/multifd.c b/migration/multifd.c index 8eecda68ac0f..c2575e3d6dbf 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -1191,8 +1191,10 @@ out: static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp) { g_autoptr(MFDSendChannelConnectData) data = NULL; +MigChannelHeader header = {}; -data = mfd_send_channel_connect_data_new(p, NULL); +header.channel_type = MIG_CHANNEL_TYPE_MULTIFD; +data = mfd_send_channel_connect_data_new(p, ); if (!multifd_use_packets()) { return file_send_channel_create(data, errp);
[PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible
From: "Maciej S. Szmigiero" Mapped-ram is only available for multifd migration without channel header - add an appropriate check to migration options. Signed-off-by: Maciej S. Szmigiero --- migration/options.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/migration/options.c b/migration/options.c index 8fd871cd956d..abb5b485badd 100644 --- a/migration/options.c +++ b/migration/options.c @@ -1284,6 +1284,13 @@ bool migrate_params_check(MigrationParameters *params, Error **errp) return false; } +if (migrate_mapped_ram() && +params->has_multifd_channels && migrate_channel_header()) { +error_setg(errp, + "Mapped-ram only available for multifd migration without channel header"); +return false; +} + if (params->has_x_vcpu_dirty_limit_period && (params->x_vcpu_dirty_limit_period < 1 || params->x_vcpu_dirty_limit_period > 1000)) {
[PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler
From: "Maciej S. Szmigiero" qemu_loadvm_load_state_buffer() and its load_state_buffer SaveVMHandler allow providing device state buffer to explicitly specified device via its idstr and instance id. Signed-off-by: Maciej S. Szmigiero --- include/migration/register.h | 15 +++ migration/savevm.c | 25 + migration/savevm.h | 3 +++ 3 files changed, 43 insertions(+) diff --git a/include/migration/register.h b/include/migration/register.h index 9d36e35bd612..7d29b7e0b559 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -257,6 +257,21 @@ typedef struct SaveVMHandlers { */ int (*load_state)(QEMUFile *f, void *opaque, int version_id); +/** + * @load_state_buffer + * + * Load device state buffer provided to qemu_loadvm_load_state_buffer(). + * + * @opaque: data pointer passed to register_savevm_live() + * @data: the data buffer to load + * @data_size: the data length in buffer + * @errp: pointer to Error*, to store an error if it happens. + * + * Returns zero to indicate success and negative for error + */ +int (*load_state_buffer)(void *opaque, char *data, size_t data_size, + Error **errp); + /** * @load_setup * diff --git a/migration/savevm.c b/migration/savevm.c index fa35504678bf..2e4d63faca06 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -3073,6 +3073,31 @@ int qemu_loadvm_approve_switchover(void) return migrate_send_rp_switchover_ack(mis); } +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, + char *buf, size_t len, Error **errp) +{ +SaveStateEntry *se; + +se = find_se(idstr, instance_id); +if (!se) { +error_setg(errp, "Unknown idstr %s or instance id %u for load state buffer", + idstr, instance_id); +return -1; +} + +if (!se->ops || !se->ops->load_state_buffer) { +error_setg(errp, "idstr %s / instance %u has no load state buffer operation", + idstr, instance_id); +return -1; +} + +if (se->ops->load_state_buffer(se->opaque, buf, len, errp) != 0) { +return -1; +} + +return 0; +} + bool save_snapshot(const char *name, bool overwrite, const char *vmstate, bool has_devices, strList *devices, Error **errp) { diff --git a/migration/savevm.h b/migration/savevm.h index 74669733dd63..c879ba8c970e 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -70,4 +70,7 @@ int qemu_loadvm_approve_switchover(void); int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, bool in_postcopy, bool inactivate_disks); +int qemu_loadvm_load_state_buffer(const char *idstr, uint32_t instance_id, + char *buf, size_t len, Error **errp); + #endif
[PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability
From: Avihai Horon Add x-channel-header pseudo-capability which indicates that a header should be sent through migration channels. The header is the first thing to be sent through a migration channel and it allows the destination to differentiate between the various channels (main, multifd and preempt). This eliminates the need to deduce the channel type by peeking in the channel's content, which can be done only on a best-effort basis. It will also allow other devices to create their own channels in the future. This patch only adds the pseudo-capability and sets it to false always. The following patches will add the actual functionality, after which it will be enabled.. Signed-off-by: Avihai Horon Signed-off-by: Maciej S. Szmigiero --- hw/core/machine.c | 1 + migration/migration.h | 3 +++ migration/options.c | 9 + migration/options.h | 1 + 4 files changed, 14 insertions(+) diff --git a/hw/core/machine.c b/hw/core/machine.c index 37ede0e7d4fd..fa28c49f55b7 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -37,6 +37,7 @@ GlobalProperty hw_compat_8_2[] = { { "migration", "zero-page-detection", "legacy"}, { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" }, { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" }, +{ "migration", "channel_header", "off" }, }; const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2); diff --git a/migration/migration.h b/migration/migration.h index 8045e39c26fa..a6114405917f 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -450,6 +450,9 @@ struct MigrationState { */ uint8_t clear_bitmap_shift; +/* Whether a header is sent in migration channels */ +bool channel_header; + /* * This save hostname when out-going migration starts */ diff --git a/migration/options.c b/migration/options.c index bfd7753b69a5..8fd871cd956d 100644 --- a/migration/options.c +++ b/migration/options.c @@ -100,6 +100,7 @@ Property migration_properties[] = { clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT), DEFINE_PROP_BOOL("x-preempt-pre-7-2", MigrationState, preempt_pre_7_2, false), +DEFINE_PROP_BOOL("x-channel-header", MigrationState, channel_header, true), /* Migration parameters */ DEFINE_PROP_UINT8("x-compress-level", MigrationState, @@ -381,6 +382,14 @@ bool migrate_zero_copy_send(void) /* pseudo capabilities */ +bool migrate_channel_header(void) +{ +MigrationState *s = migrate_get_current(); + +return false; +return s->channel_header; +} + bool migrate_multifd_flush_after_each_section(void) { MigrationState *s = migrate_get_current(); diff --git a/migration/options.h b/migration/options.h index ab8199e20784..1144d72ec0db 100644 --- a/migration/options.h +++ b/migration/options.h @@ -52,6 +52,7 @@ bool migrate_zero_copy_send(void); * check, but they are not a capability. */ +bool migrate_channel_header(void); bool migrate_multifd_flush_after_each_section(void); bool migrate_postcopy(void); bool migrate_rdma(void);
[PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events
From: "Maciej S. Szmigiero" This way both the start and end points of migrating a particular VFIO device are known. Add also a vfio_save_iterate_empty_hit trace event so it is known when there's no more data to send for that device. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 13 + hw/vfio/trace-events | 3 +++ include/hw/vfio/vfio-common.h | 3 +++ 3 files changed, 19 insertions(+) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 1149c6b3740f..bc3aea77455c 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -394,6 +394,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return -ENOMEM; } +migration->save_iterate_run = false; +migration->save_iterate_empty_hit = false; + if (vfio_precopy_supported(vbasedev)) { int ret; @@ -515,9 +518,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque) VFIOMigration *migration = vbasedev->migration; ssize_t data_size; +if (!migration->save_iterate_run) { +trace_vfio_save_iterate_started(vbasedev->name); +migration->save_iterate_run = true; +} + data_size = vfio_save_block(f, migration); if (data_size < 0) { return data_size; +} else if (data_size == 0 && !migration->save_iterate_empty_hit) { +trace_vfio_save_iterate_empty_hit(vbasedev->name); +migration->save_iterate_empty_hit = true; } vfio_update_estimated_pending_data(migration, data_size); @@ -542,6 +553,8 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) ssize_t data_size; int ret; +trace_vfio_save_complete_precopy_started(vbasedev->name); + /* We reach here with device state STOP or STOP_COPY only */ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, VFIO_DEVICE_STATE_STOP); diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index f0474b244bf0..a72697678256 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -157,8 +157,11 @@ vfio_migration_state_notifier(const char *name, int state) " (%s) state %d" vfio_save_block(const char *name, int data_size) " (%s) data_size %d" vfio_save_cleanup(const char *name) " (%s)" vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" +vfio_save_complete_precopy_started(const char *name) " (%s)" vfio_save_device_config_state(const char *name) " (%s)" vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 +vfio_save_iterate_started(const char *name) " (%s)" +vfio_save_iterate_empty_hit(const char *name) " (%s)" vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index b9da6c08ef41..9bb523249e73 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -71,6 +71,9 @@ typedef struct VFIOMigration { uint64_t precopy_init_size; uint64_t precopy_dirty_size; bool initial_data_sent; + +bool save_iterate_run; +bool save_iterate_empty_hit; } VFIOMigration; struct VFIOGroup;
[PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type
From: "Maciej S. Szmigiero" This function is called only with MultiFDSendParams type param so use this type explicitly instead of using an opaque pointer. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 2802afe79d0d..039c0de40af5 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -1132,13 +1132,13 @@ out: error_free(local_err); } -static bool multifd_new_send_channel_create(gpointer opaque, Error **errp) +static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp) { if (!multifd_use_packets()) { -return file_send_channel_create(opaque, errp); +return file_send_channel_create(p, errp); } -socket_send_channel_create(multifd_new_send_channel_async, opaque); +socket_send_channel_create(multifd_new_send_channel_async, p); return true; }
[PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket
From: "Maciej S. Szmigiero" This will allow passing additional parameters there in the future. Signed-off-by: Maciej S. Szmigiero --- migration/file.c| 5 ++- migration/multifd.c | 95 ++--- migration/multifd.h | 4 +- 3 files changed, 80 insertions(+), 24 deletions(-) diff --git a/migration/file.c b/migration/file.c index ab18ba505a1d..34dfbc4a5a2d 100644 --- a/migration/file.c +++ b/migration/file.c @@ -62,7 +62,10 @@ bool file_send_channel_create(gpointer opaque, Error **errp) goto out; } -multifd_channel_connect(opaque, QIO_CHANNEL(ioc)); +ret = multifd_channel_connect(opaque, QIO_CHANNEL(ioc), errp); +if (!ret) { +object_unref(OBJECT(ioc)); +} out: /* diff --git a/migration/multifd.c b/migration/multifd.c index 4bc912d7500e..58a18bb1e4a8 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -1010,34 +1010,76 @@ out: return NULL; } -static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque); - -typedef struct { +struct MFDSendChannelConnectData { +unsigned int ref; MultiFDSendParams *p; QIOChannelTLS *tioc; -} MultiFDTLSThreadArgs; +}; + +static MFDSendChannelConnectData *mfd_send_channel_connect_data_new(MultiFDSendParams *p) +{ +MFDSendChannelConnectData *data; + +data = g_malloc0(sizeof(*data)); +data->ref = 1; +data->p = p; + +return data; +} + +static void mfd_send_channel_connect_data_free(MFDSendChannelConnectData *data) +{ +g_free(data); +} + +static MFDSendChannelConnectData * +mfd_send_channel_connect_data_ref(MFDSendChannelConnectData *data) +{ +unsigned int ref_old; + +ref_old = qatomic_fetch_inc(>ref); +assert(ref_old < UINT_MAX); + +return data; +} + +static void mfd_send_channel_connect_data_unref(gpointer opaque) +{ +MFDSendChannelConnectData *data = opaque; +unsigned int ref_old; + +ref_old = qatomic_fetch_dec(>ref); +assert(ref_old > 0); +if (ref_old == 1) { +mfd_send_channel_connect_data_free(data); +} +} + +G_DEFINE_AUTOPTR_CLEANUP_FUNC(MFDSendChannelConnectData, mfd_send_channel_connect_data_unref) + +static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque); static void *multifd_tls_handshake_thread(void *opaque) { -MultiFDTLSThreadArgs *args = opaque; +g_autoptr(MFDSendChannelConnectData) data = opaque; +QIOChannelTLS *tioc = data->tioc; -qio_channel_tls_handshake(args->tioc, +qio_channel_tls_handshake(tioc, multifd_new_send_channel_async, - args->p, - NULL, + g_steal_pointer(), + mfd_send_channel_connect_data_unref, NULL); -g_free(args); return NULL; } -static bool multifd_tls_channel_connect(MultiFDSendParams *p, +static bool multifd_tls_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, Error **errp) { +MultiFDSendParams *p = data->p; MigrationState *s = migrate_get_current(); const char *hostname = s->hostname; -MultiFDTLSThreadArgs *args; QIOChannelTLS *tioc; tioc = migration_tls_client_create(ioc, hostname, errp); @@ -1053,19 +1095,21 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p, trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname); qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing"); -args = g_new0(MultiFDTLSThreadArgs, 1); -args->tioc = tioc; -args->p = p; +data->tioc = tioc; p->tls_thread_created = true; qemu_thread_create(>tls_thread, "multifd-tls-handshake-worker", - multifd_tls_handshake_thread, args, + multifd_tls_handshake_thread, + mfd_send_channel_connect_data_ref(data), QEMU_THREAD_JOINABLE); return true; } -void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc) +bool multifd_channel_connect(MFDSendChannelConnectData *data, QIOChannel *ioc, + Error **errp) { +MultiFDSendParams *p = data->p; + qio_channel_set_delay(ioc, false); migration_ioc_register_yank(ioc); @@ -1075,6 +1119,8 @@ void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc) p->thread_created = true; qemu_thread_create(>thread, p->name, multifd_send_thread, p, QEMU_THREAD_JOINABLE); + +return true; } /* @@ -1085,7 +1131,8 @@ void multifd_channel_connect(MultiFDSendParams *p, QIOChannel *ioc) */ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque) { -MultiFDSendParams *p = opaque; +MFDSendChannelConnectDa
[PATCH RFC 03/26] migration: Add send/receive header for main channel
From: Avihai Horon Add send and receive migration channel header for main channel. Signed-off-by: Avihai Horon [MSS: Rename main channel -> default channel where it matches the current term] Signed-off-by: Maciej S. Szmigiero --- migration/channel.c | 9 + migration/migration.c | 82 +++ 2 files changed, 84 insertions(+), 7 deletions(-) diff --git a/migration/channel.c b/migration/channel.c index a72e85f5791c..0e3f51654752 100644 --- a/migration/channel.c +++ b/migration/channel.c @@ -81,6 +81,13 @@ void migration_channel_connect(MigrationState *s, return; } } else { +/* TODO: Send header after register yank? Make a QEMUFile variant? */ +MigChannelHeader header = {}; +header.channel_type = MIG_CHANNEL_TYPE_MAIN; +if (migration_channel_header_send(ioc, , )) { +goto out; +} + QEMUFile *f = qemu_file_new_output(ioc); migration_ioc_register_yank(ioc); @@ -90,6 +97,8 @@ void migration_channel_connect(MigrationState *s, qemu_mutex_unlock(>qemu_file_lock); } } + +out: migrate_fd_connect(s, error); error_free(error); } diff --git a/migration/migration.c b/migration/migration.c index 86bf76e92585..0eb5b4f4f5a1 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -869,12 +869,39 @@ void migration_fd_process_incoming(QEMUFile *f) migration_incoming_process(); } +static bool migration_should_start_incoming_header(bool main_channel) +{ +MigrationIncomingState *mis = migration_incoming_get_current(); + +if (!mis->from_src_file) { +return false; +} + +if (migrate_multifd()) { +return multifd_recv_all_channels_created(); +} + +if (migrate_postcopy_preempt() && migrate_get_current()->preempt_pre_7_2) { +return mis->postcopy_qemufile_dst != NULL; +} + +if (migrate_postcopy_preempt()) { +return main_channel; +} + +return true; +} + /* * Returns true when we want to start a new incoming migration process, * false otherwise. */ static bool migration_should_start_incoming(bool main_channel) { +if (migrate_channel_header()) { +return migration_should_start_incoming_header(main_channel); +} + /* Multifd doesn't start unless all channels are established */ if (migrate_multifd()) { return migration_has_all_channels(); @@ -894,7 +921,22 @@ static bool migration_should_start_incoming(bool main_channel) return true; } -void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) +static void migration_start_incoming(bool main_channel) +{ +if (!migration_should_start_incoming(main_channel)) { +return; +} + +/* If it's a recovery, we're done */ +if (postcopy_try_recover()) { +return; +} + +migration_incoming_process(); +} + +static void migration_ioc_process_incoming_no_header(QIOChannel *ioc, + Error **errp) { MigrationIncomingState *mis = migration_incoming_get_current(); Error *local_err = NULL; @@ -951,13 +993,39 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) } } -if (migration_should_start_incoming(default_channel)) { -/* If it's a recovery, we're done */ -if (postcopy_try_recover()) { -return; -} -migration_incoming_process(); +migration_start_incoming(default_channel); +} + +void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp) +{ +MigChannelHeader header = {}; +bool default_channel = false; +QEMUFile *f; +int ret; + +if (!migrate_channel_header()) { +migration_ioc_process_incoming_no_header(ioc, errp); +return; +} + +ret = migration_channel_header_recv(ioc, , errp); +if (ret) { +return; +} + +switch (header.channel_type) { +case MIG_CHANNEL_TYPE_MAIN: +f = qemu_file_new_input(ioc); +migration_incoming_setup(f); +default_channel = true; +break; +default: +error_setg(errp, "Received unknown migration channel type %u", + header.channel_type); +return; } + +migration_start_incoming(default_channel); } /**
[PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create()
From: "Maciej S. Szmigiero" Makes managing the memory easier. Signed-off-by: Maciej S. Szmigiero --- migration/multifd.c | 2 +- migration/postcopy-ram.c | 2 +- migration/socket.c | 6 -- migration/socket.h | 3 ++- 4 files changed, 8 insertions(+), 5 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 039c0de40af5..4bc912d7500e 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -1138,7 +1138,7 @@ static bool multifd_new_send_channel_create(MultiFDSendParams *p, Error **errp) return file_send_channel_create(p, errp); } -socket_send_channel_create(multifd_new_send_channel_async, p); +socket_send_channel_create(multifd_new_send_channel_async, p, NULL); return true; } diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index eccff499cb20..e314e1023dc1 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -1715,7 +1715,7 @@ int postcopy_preempt_establish_channel(MigrationState *s) void postcopy_preempt_setup(MigrationState *s) { /* Kick an async task to connect */ -socket_send_channel_create(postcopy_preempt_send_channel_new, s); +socket_send_channel_create(postcopy_preempt_send_channel_new, s, NULL); } static void postcopy_pause_ram_fast_load(MigrationIncomingState *mis) diff --git a/migration/socket.c b/migration/socket.c index 9ab89b1e089b..6639581cf18d 100644 --- a/migration/socket.c +++ b/migration/socket.c @@ -35,11 +35,13 @@ struct SocketOutgoingArgs { SocketAddress *saddr; } outgoing_args; -void socket_send_channel_create(QIOTaskFunc f, void *data) +void socket_send_channel_create(QIOTaskFunc f, +void *data, GDestroyNotify data_destroy) { QIOChannelSocket *sioc = qio_channel_socket_new(); + qio_channel_socket_connect_async(sioc, outgoing_args.saddr, - f, data, NULL, NULL); + f, data, data_destroy, NULL); } QIOChannel *socket_send_channel_create_sync(Error **errp) diff --git a/migration/socket.h b/migration/socket.h index 46c233ecd29e..114ab34176aa 100644 --- a/migration/socket.h +++ b/migration/socket.h @@ -21,7 +21,8 @@ #include "io/task.h" #include "qemu/sockets.h" -void socket_send_channel_create(QIOTaskFunc f, void *data); +void socket_send_channel_create(QIOTaskFunc f, +void *data, GDestroyNotify data_destroy); QIOChannel *socket_send_channel_create_sync(Error **errp); void socket_start_incoming_migration(SocketAddress *saddr, Error **errp);
[PULL 1/3] hv-balloon: avoid alloca() usage
From: "Maciej S. Szmigiero" alloca() is frowned upon, replace it with g_malloc0() + g_autofree. Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index ade283335a68..35333dab2434 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -366,7 +366,7 @@ static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc) PageRangeTree dtree; uint64_t *dctr; bool our_range; -struct dm_unballoon_request *ur; +g_autofree struct dm_unballoon_request *ur = NULL; size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]); PageRange range; bool bret; @@ -388,8 +388,7 @@ static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc) assert(dtree.t); assert(dctr); -ur = alloca(ur_size); -memset(ur, 0, ur_size); +ur = g_malloc0(ur_size); ur->hdr.type = DM_UNBALLOON_REQUEST; ur->hdr.size = ur_size; ur->hdr.trans_id = balloon->trans_id; @@ -531,7 +530,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) PageRange *hot_add_range = >hot_add_range; uint64_t *current_count = >ha_current_count; VMBusChannel *chan = hv_balloon_get_channel(balloon); -struct dm_hot_add *ha; +g_autofree struct dm_hot_add *ha = NULL; size_t ha_size = sizeof(*ha) + sizeof(ha->range); union dm_mem_page_range *ha_region; uint64_t align, chunk_max_size; @@ -560,9 +559,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) */ *current_count = MIN(hot_add_range->count, chunk_max_size); -ha = alloca(ha_size); +ha = g_malloc0(ha_size); ha_region = &(>range)[1]; -memset(ha, 0, ha_size); ha->hdr.type = DM_MEM_HOT_ADD_REQUEST; ha->hdr.size = ha_size; ha->hdr.trans_id = balloon->trans_id;
[PULL 3/3] vmbus: Print a warning when enabled without the recommended set of features
From: "Maciej S. Szmigiero" Some Windows versions crash at boot or fail to enable the VMBus device if they don't see the expected set of Hyper-V features (enlightenments). Since this provides poor user experience let's warn user if the VMBus device is enabled without the recommended set of Hyper-V features. The recommended set is the minimum set of Hyper-V features required to make the VMBus device work properly in Windows Server versions 2016, 2019 and 2022. Acked-by: Paolo Bonzini Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hyperv.c| 12 hw/hyperv/vmbus.c | 6 ++ include/hw/hyperv/hyperv.h| 4 target/i386/kvm/hyperv-stub.c | 4 target/i386/kvm/hyperv.c | 5 + target/i386/kvm/hyperv.h | 2 ++ target/i386/kvm/kvm.c | 7 +++ 7 files changed, 40 insertions(+) diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c index 6c4a18dd0e2a..3ea54ba818b2 100644 --- a/hw/hyperv/hyperv.c +++ b/hw/hyperv/hyperv.c @@ -951,3 +951,15 @@ uint64_t hyperv_syndbg_query_options(void) return msg.u.query_options.options; } + +static bool vmbus_recommended_features_enabled; + +bool hyperv_are_vmbus_recommended_features_enabled(void) +{ +return vmbus_recommended_features_enabled; +} + +void hyperv_set_vmbus_recommended_features_enabled(void) +{ +vmbus_recommended_features_enabled = true; +} diff --git a/hw/hyperv/vmbus.c b/hw/hyperv/vmbus.c index 380239af2c7b..f33afeeea27d 100644 --- a/hw/hyperv/vmbus.c +++ b/hw/hyperv/vmbus.c @@ -2631,6 +2631,12 @@ static void vmbus_bridge_realize(DeviceState *dev, Error **errp) return; } +if (!hyperv_are_vmbus_recommended_features_enabled()) { +warn_report("VMBus enabled without the recommended set of Hyper-V features: " +"hv-stimer, hv-vapic and hv-runtime. " +"Some Windows versions might not boot or enable the VMBus device"); +} + bridge->bus = VMBUS(qbus_new(TYPE_VMBUS, dev, "vmbus")); } diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h index 015c3524b1c2..d717b4e13d40 100644 --- a/include/hw/hyperv/hyperv.h +++ b/include/hw/hyperv/hyperv.h @@ -139,4 +139,8 @@ typedef struct HvSynDbgMsg { } HvSynDbgMsg; typedef uint16_t (*HvSynDbgHandler)(void *context, HvSynDbgMsg *msg); void hyperv_set_syndbg_handler(HvSynDbgHandler handler, void *context); + +bool hyperv_are_vmbus_recommended_features_enabled(void); +void hyperv_set_vmbus_recommended_features_enabled(void); + #endif diff --git a/target/i386/kvm/hyperv-stub.c b/target/i386/kvm/hyperv-stub.c index 778ed782e6fc..3263dcf05d31 100644 --- a/target/i386/kvm/hyperv-stub.c +++ b/target/i386/kvm/hyperv-stub.c @@ -52,3 +52,7 @@ void hyperv_x86_synic_reset(X86CPU *cpu) void hyperv_x86_synic_update(X86CPU *cpu) { } + +void hyperv_x86_set_vmbus_recommended_features_enabled(void) +{ +} diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c index 6825c89af374..f2a3fe650a18 100644 --- a/target/i386/kvm/hyperv.c +++ b/target/i386/kvm/hyperv.c @@ -149,3 +149,8 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit) return -1; } } + +void hyperv_x86_set_vmbus_recommended_features_enabled(void) +{ +hyperv_set_vmbus_recommended_features_enabled(); +} diff --git a/target/i386/kvm/hyperv.h b/target/i386/kvm/hyperv.h index 67543296c3a4..e3982c8f4dd1 100644 --- a/target/i386/kvm/hyperv.h +++ b/target/i386/kvm/hyperv.h @@ -26,4 +26,6 @@ int hyperv_x86_synic_add(X86CPU *cpu); void hyperv_x86_synic_reset(X86CPU *cpu); void hyperv_x86_synic_update(X86CPU *cpu); +void hyperv_x86_set_vmbus_recommended_features_enabled(void); + #endif diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 42970ab046fa..e68cbe929302 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1650,6 +1650,13 @@ static int hyperv_init_vcpu(X86CPU *cpu) } } +/* Skip SynIC and VP_INDEX since they are hard deps already */ +if (hyperv_feat_enabled(cpu, HYPERV_FEAT_STIMER) && +hyperv_feat_enabled(cpu, HYPERV_FEAT_VAPIC) && +hyperv_feat_enabled(cpu, HYPERV_FEAT_RUNTIME)) { +hyperv_x86_set_vmbus_recommended_features_enabled(); +} + return 0; }
[PULL 0/3] Hyper-V Dynamic Memory and VMBus misc small patches
From: "Maciej S. Szmigiero" The following changes since commit 8f6330a807f2642dc2a3cdf33347aa28a4c00a87: Merge tag 'pull-maintainer-updates-060324-1' of https://gitlab.com/stsquad/qemu into staging (2024-03-06 16:56:20 +) are available in the Git repository at: https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20240308 for you to fetch changes up to 6093637b4d32875f98cd59696ffc5f26884aa0b4: vmbus: Print a warning when enabled without the recommended set of features (2024-03-08 14:18:56 +0100) Hyper-V Dynamic Memory and VMBus misc small patches This pull request contains two small patches to hv-balloon: the first one replacing alloca() usage with g_malloc0() + g_autofree and the second one adding additional declaration of a protocol message struct with an optional field explicitly defined to avoid a Coverity warning. Also included is a VMBus patch to print a warning when it is enabled without the recommended set of Hyper-V features (enlightenments) since some Windows versions crash at boot in this case. -------- Maciej S. Szmigiero (3): hv-balloon: avoid alloca() usage hv-balloon: define dm_hot_add_with_region to avoid Coverity warning vmbus: Print a warning when enabled without the recommended set of features hw/hyperv/hv-balloon.c | 18 -- hw/hyperv/hyperv.c | 12 hw/hyperv/vmbus.c| 6 ++ include/hw/hyperv/dynmem-proto.h | 9 - include/hw/hyperv/hyperv.h | 4 target/i386/kvm/hyperv-stub.c| 4 target/i386/kvm/hyperv.c | 5 + target/i386/kvm/hyperv.h | 2 ++ target/i386/kvm/kvm.c| 7 +++ 9 files changed, 56 insertions(+), 11 deletions(-)
[PULL 2/3] hv-balloon: define dm_hot_add_with_region to avoid Coverity warning
From: "Maciej S. Szmigiero" Since the presence of a hot add memory region is optional in hot add request message it wasn't part of this message declaration (struct dm_hot_add). Instead, the code allocated such enlarged message by simply adding the necessary size for this extra field to the size of basic hot add message struct. However, Coverity considers accessing this extra member to be an out-of-bounds access, even thought the memory is actually there. Fix this by adding an extended variant of this message that explicitly has an additional union dm_mem_page_range at its end. CID: #1523903 Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon.c | 10 +- include/hw/hyperv/dynmem-proto.h | 9 - 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 35333dab2434..3a9ef0769103 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -513,8 +513,8 @@ ret_idle: static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc) { VMBusChannel *chan = hv_balloon_get_channel(balloon); -struct dm_hot_add *ha; -size_t ha_size = sizeof(*ha) + sizeof(ha->range); +struct dm_hot_add_with_region *ha; +size_t ha_size = sizeof(*ha); assert(balloon->state == S_HOT_ADD_RB_WAIT); @@ -530,8 +530,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) PageRange *hot_add_range = >hot_add_range; uint64_t *current_count = >ha_current_count; VMBusChannel *chan = hv_balloon_get_channel(balloon); -g_autofree struct dm_hot_add *ha = NULL; -size_t ha_size = sizeof(*ha) + sizeof(ha->range); +g_autofree struct dm_hot_add_with_region *ha = NULL; +size_t ha_size = sizeof(*ha); union dm_mem_page_range *ha_region; uint64_t align, chunk_max_size; ssize_t ret; @@ -560,7 +560,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) *current_count = MIN(hot_add_range->count, chunk_max_size); ha = g_malloc0(ha_size); -ha_region = &(>range)[1]; +ha_region = >region; ha->hdr.type = DM_MEM_HOT_ADD_REQUEST; ha->hdr.size = ha_size; ha->hdr.trans_id = balloon->trans_id; diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h index a657786a94b1..68b8b606f268 100644 --- a/include/hw/hyperv/dynmem-proto.h +++ b/include/hw/hyperv/dynmem-proto.h @@ -328,7 +328,8 @@ struct dm_unballoon_response { /* * Hot add request message. Message sent from the host to the guest. * - * mem_range: Memory range to hot add. + * range: Memory range to hot add. + * region: Explicit hot add memory region for guest to use. Optional. * */ @@ -337,6 +338,12 @@ struct dm_hot_add { union dm_mem_page_range range; } QEMU_PACKED; +struct dm_hot_add_with_region { +struct dm_header hdr; +union dm_mem_page_range range; +union dm_mem_page_range region; +} QEMU_PACKED; + /* * Hot add response message. * This message is sent by the guest to report the status of a hot add request.
Re: [PATCH] vmbus: Print a warning when enabled without the recommended set of features
On 25.01.2024 17:19, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Some Windows versions crash at boot or fail to enable the VMBus device if they don't see the expected set of Hyper-V features (enlightenments). Since this provides poor user experience let's warn user if the VMBus device is enabled without the recommended set of Hyper-V features. The recommended set is the minimum set of Hyper-V features required to make the VMBus device work properly in Windows Server versions 2016, 2019 and 2022. Signed-off-by: Maciej S. Szmigiero @Paolo, @Marcelo: can I get some kind of Ack or comments for the KVM part? Thanks, Maciej
[PATCH] vmbus: Print a warning when enabled without the recommended set of features
From: "Maciej S. Szmigiero" Some Windows versions crash at boot or fail to enable the VMBus device if they don't see the expected set of Hyper-V features (enlightenments). Since this provides poor user experience let's warn user if the VMBus device is enabled without the recommended set of Hyper-V features. The recommended set is the minimum set of Hyper-V features required to make the VMBus device work properly in Windows Server versions 2016, 2019 and 2022. Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hyperv.c| 12 hw/hyperv/vmbus.c | 6 ++ include/hw/hyperv/hyperv.h| 4 target/i386/kvm/hyperv-stub.c | 4 target/i386/kvm/hyperv.c | 5 + target/i386/kvm/hyperv.h | 2 ++ target/i386/kvm/kvm.c | 7 +++ 7 files changed, 40 insertions(+) diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c index 57b402b95610..2c91de7ff4a8 100644 --- a/hw/hyperv/hyperv.c +++ b/hw/hyperv/hyperv.c @@ -947,3 +947,15 @@ uint64_t hyperv_syndbg_query_options(void) return msg.u.query_options.options; } + +static bool vmbus_recommended_features_enabled; + +bool hyperv_are_vmbus_recommended_features_enabled(void) +{ +return vmbus_recommended_features_enabled; +} + +void hyperv_set_vmbus_recommended_features_enabled(void) +{ +vmbus_recommended_features_enabled = true; +} diff --git a/hw/hyperv/vmbus.c b/hw/hyperv/vmbus.c index 380239af2c7b..f33afeeea27d 100644 --- a/hw/hyperv/vmbus.c +++ b/hw/hyperv/vmbus.c @@ -2631,6 +2631,12 @@ static void vmbus_bridge_realize(DeviceState *dev, Error **errp) return; } +if (!hyperv_are_vmbus_recommended_features_enabled()) { +warn_report("VMBus enabled without the recommended set of Hyper-V features: " +"hv-stimer, hv-vapic and hv-runtime. " +"Some Windows versions might not boot or enable the VMBus device"); +} + bridge->bus = VMBUS(qbus_new(TYPE_VMBUS, dev, "vmbus")); } diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h index 015c3524b1c2..d717b4e13d40 100644 --- a/include/hw/hyperv/hyperv.h +++ b/include/hw/hyperv/hyperv.h @@ -139,4 +139,8 @@ typedef struct HvSynDbgMsg { } HvSynDbgMsg; typedef uint16_t (*HvSynDbgHandler)(void *context, HvSynDbgMsg *msg); void hyperv_set_syndbg_handler(HvSynDbgHandler handler, void *context); + +bool hyperv_are_vmbus_recommended_features_enabled(void); +void hyperv_set_vmbus_recommended_features_enabled(void); + #endif diff --git a/target/i386/kvm/hyperv-stub.c b/target/i386/kvm/hyperv-stub.c index 778ed782e6fc..3263dcf05d31 100644 --- a/target/i386/kvm/hyperv-stub.c +++ b/target/i386/kvm/hyperv-stub.c @@ -52,3 +52,7 @@ void hyperv_x86_synic_reset(X86CPU *cpu) void hyperv_x86_synic_update(X86CPU *cpu) { } + +void hyperv_x86_set_vmbus_recommended_features_enabled(void) +{ +} diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c index 6825c89af374..f2a3fe650a18 100644 --- a/target/i386/kvm/hyperv.c +++ b/target/i386/kvm/hyperv.c @@ -149,3 +149,8 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit) return -1; } } + +void hyperv_x86_set_vmbus_recommended_features_enabled(void) +{ +hyperv_set_vmbus_recommended_features_enabled(); +} diff --git a/target/i386/kvm/hyperv.h b/target/i386/kvm/hyperv.h index 67543296c3a4..e3982c8f4dd1 100644 --- a/target/i386/kvm/hyperv.h +++ b/target/i386/kvm/hyperv.h @@ -26,4 +26,6 @@ int hyperv_x86_synic_add(X86CPU *cpu); void hyperv_x86_synic_reset(X86CPU *cpu); void hyperv_x86_synic_update(X86CPU *cpu); +void hyperv_x86_set_vmbus_recommended_features_enabled(void); + #endif diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index e88e65fe014c..d3d01b3cf82d 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1650,6 +1650,13 @@ static int hyperv_init_vcpu(X86CPU *cpu) } } +/* Skip SynIC and VP_INDEX since they are hard deps already */ +if (hyperv_feat_enabled(cpu, HYPERV_FEAT_STIMER) && +hyperv_feat_enabled(cpu, HYPERV_FEAT_VAPIC) && +hyperv_feat_enabled(cpu, HYPERV_FEAT_RUNTIME)) { +hyperv_x86_set_vmbus_recommended_features_enabled(); +} + return 0; }
Re: [PATCH v1 0/2] memory-device: reintroduce memory region size check
Hi David, On 17.01.2024 14:55, David Hildenbrand wrote: Reintroduce a modified region size check, after we would now allow some configurations that don't make any sense (e.g., partial hugetlb pages, 1G+1byte DIMMs). We have to take care of hv-balloon first, which was the case why we remove that check in the first place. Cc: "Maciej S. Szmigiero" Cc: Mario Casquero Cc: Igor Mammedov Cc: Xiao Guangrong David Hildenbrand (2): hv-balloon: use get_min_alignment() to express 32 GiB alignment memory-device: reintroduce memory region size check hw/hyperv/hv-balloon.c | 37 + hw/mem/memory-device.c | 14 ++ 2 files changed, 35 insertions(+), 16 deletions(-) Looked at the changes, tested hv-balloon with a small memory backend and it seem to work fine, so for the whole series: Reviewed-by: Maciej S. Szmigiero Thanks, Maciej
Re: [PATCH 2/5] vmbus: Switch bus reset to 3-phase-reset
On 19.01.2024 17:35, Peter Maydell wrote: Switch vmbus from using BusClass::reset to the Resettable interface. This has no behavioural change, because the BusClass code to support subclasses that use the legacy BusClass::reset will call that method in the hold phase of 3-phase reset. Signed-off-by: Peter Maydell --- Acked-by: Maciej S. Szmigiero Thanks, Maciej
Re: [PATCH trivial 15/21] include/hw/hyperv/dynmem-proto.h: spelling fix: nunber
On 14.11.2023 17:58, Michael Tokarev wrote: Fixes: 4f80cd2f033e "Add Hyper-V Dynamic Memory Protocol definitions" Cc: Maciej S. Szmigiero Signed-off-by: Michael Tokarev --- Acked-by: Maciej S. Szmigiero Thanks, Maciej
[PATCH] hv-balloon: define dm_hot_add_with_region to avoid Coverity warning
From: "Maciej S. Szmigiero" Since the presence of a hot add memory region is optional in hot add request message it wasn't part of this message declaration (struct dm_hot_add). Instead, the code allocated such enlarged message by simply adding the necessary size for this extra field to the size of basic hot add message struct. However, Coverity considers accessing this extra member to be an out-of-bounds access, even thought the memory is actually there. Fix this by adding an extended variant of this message that explicitly has an additional union dm_mem_page_range at its end. CID: #1523903 Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon.c | 10 +- include/hw/hyperv/dynmem-proto.h | 9 - 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index a4b4bde0a1e9..5b8f8aac7216 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -512,8 +512,8 @@ ret_idle: static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc) { VMBusChannel *chan = hv_balloon_get_channel(balloon); -struct dm_hot_add *ha; -size_t ha_size = sizeof(*ha) + sizeof(ha->range); +struct dm_hot_add_with_region *ha; +size_t ha_size = sizeof(*ha); assert(balloon->state == S_HOT_ADD_RB_WAIT); @@ -529,8 +529,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) PageRange *hot_add_range = >hot_add_range; uint64_t *current_count = >ha_current_count; VMBusChannel *chan = hv_balloon_get_channel(balloon); -g_autofree struct dm_hot_add *ha = NULL; -size_t ha_size = sizeof(*ha) + sizeof(ha->range); +g_autofree struct dm_hot_add_with_region *ha = NULL; +size_t ha_size = sizeof(*ha); union dm_mem_page_range *ha_region; uint64_t align, chunk_max_size; ssize_t ret; @@ -559,7 +559,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) *current_count = MIN(hot_add_range->count, chunk_max_size); ha = g_malloc0(ha_size); -ha_region = &(>range)[1]; +ha_region = >region; ha->hdr.type = DM_MEM_HOT_ADD_REQUEST; ha->hdr.size = ha_size; ha->hdr.trans_id = balloon->trans_id; diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h index d0f9090ac489..834edeb59855 100644 --- a/include/hw/hyperv/dynmem-proto.h +++ b/include/hw/hyperv/dynmem-proto.h @@ -328,7 +328,8 @@ struct dm_unballoon_response { /* * Hot add request message. Message sent from the host to the guest. * - * mem_range: Memory range to hot add. + * range: Memory range to hot add. + * region: Explicit hot add memory region for guest to use. Optional. * */ @@ -337,6 +338,12 @@ struct dm_hot_add { union dm_mem_page_range range; } QEMU_PACKED; +struct dm_hot_add_with_region { +struct dm_header hdr; +union dm_mem_page_range range; +union dm_mem_page_range region; +} QEMU_PACKED; + /* * Hot add response message. * This message is sent by the guest to report the status of a hot add request.
Re: [PATCH] hv-balloon: avoid alloca() usage
On 13.11.2023 09:59, David Hildenbrand wrote: On 09.11.23 17:02, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" alloca() is frowned upon, replace it with g_malloc0() + g_autofree. Reviewed-by: David Hildenbrand If this fixes a coverity issue of #number, we usually indicate that using "CID: #number" or Fixes: CID: #number" Will add "CID: #1523903" to the commit message then. Thanks, Maciej
[PATCH] hv-balloon: avoid alloca() usage
From: "Maciej S. Szmigiero" alloca() is frowned upon, replace it with g_malloc0() + g_autofree. Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 66f297c1d7e3..a4b4bde0a1e9 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -365,7 +365,7 @@ static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc) PageRangeTree dtree; uint64_t *dctr; bool our_range; -struct dm_unballoon_request *ur; +g_autofree struct dm_unballoon_request *ur = NULL; size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]); PageRange range; bool bret; @@ -387,8 +387,7 @@ static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc) assert(dtree.t); assert(dctr); -ur = alloca(ur_size); -memset(ur, 0, ur_size); +ur = g_malloc0(ur_size); ur->hdr.type = DM_UNBALLOON_REQUEST; ur->hdr.size = ur_size; ur->hdr.trans_id = balloon->trans_id; @@ -530,7 +529,7 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) PageRange *hot_add_range = >hot_add_range; uint64_t *current_count = >ha_current_count; VMBusChannel *chan = hv_balloon_get_channel(balloon); -struct dm_hot_add *ha; +g_autofree struct dm_hot_add *ha = NULL; size_t ha_size = sizeof(*ha) + sizeof(ha->range); union dm_mem_page_range *ha_region; uint64_t align, chunk_max_size; @@ -559,9 +558,8 @@ static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc) */ *current_count = MIN(hot_add_range->count, chunk_max_size); -ha = alloca(ha_size); +ha = g_malloc0(ha_size); ha_region = &(>range)[1]; -memset(ha, 0, ha_size); ha->hdr.type = DM_MEM_HOT_ADD_REQUEST; ha->hdr.size = ha_size; ha->hdr.trans_id = balloon->trans_id;
Re: [PULL 06/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support
On 9.11.2023 15:51, Peter Maydell wrote: On Mon, 6 Nov 2023 at 14:23, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" One of advantages of using this protocol over ACPI-based PC DIMM hotplug is that it allows hot-adding memory in much smaller granularity because the ACPI DIMM slot limit does not apply. In order to enable this functionality a new memory backend needs to be created and provided to the driver via the "memdev" parameter. This can be achieved by, for example, adding "-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and then instantiating the driver with "memdev=mem1" parameter. The device will try to use multiple memslots to cover the memory backend in order to reduce the size of metadata for the not-yet-hot-added part of the memory backend. Co-developed-by: David Hildenbrand Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero Hi; I was looking at this code because Coverity reported an issue in it. I think that's because Coverity has got confused about the way you're doing memory allocation here. But in looking at the code I see that you're using alloca() in this function. Please could you rewrite this not to do that -- we don't use alloca() or variable-length-arrays in QEMU except in a few cases which we're trying to get rid of, so we'd like not to add new uses to the code base. Sure, will do - I didn't know alloca() is frowned upon (and David probably didn't either). thanks -- PMM Thanks, Maciej
[PULL 10/10] MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol
From: "Maciej S. Szmigiero" Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- MAINTAINERS | 8 1 file changed, 8 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8e8a7d5be5de..d4a480ce5a62 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2656,6 +2656,14 @@ F: hw/usb/canokey.c F: hw/usb/canokey.h F: docs/system/devices/canokey.rst +Hyper-V Dynamic Memory Protocol +M: Maciej S. Szmigiero +S: Supported +F: hw/hyperv/hv-balloon*.c +F: hw/hyperv/hv-balloon*.h +F: include/hw/hyperv/dynmem-proto.h +F: include/hw/hyperv/hv-balloon.h + Subsystems -- Overall Audio backends
[PULL 04/10] Add Hyper-V Dynamic Memory Protocol definitions
From: "Maciej S. Szmigiero" This commit adds Hyper-V Dynamic Memory Protocol definitions, taken from hv_balloon Linux kernel driver, adapted to the QEMU coding style and definitions. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- include/hw/hyperv/dynmem-proto.h | 423 +++ 1 file changed, 423 insertions(+) create mode 100644 include/hw/hyperv/dynmem-proto.h diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h new file mode 100644 index ..d0f9090ac489 --- /dev/null +++ b/include/hw/hyperv/dynmem-proto.h @@ -0,0 +1,423 @@ +#ifndef HW_HYPERV_DYNMEM_PROTO_H +#define HW_HYPERV_DYNMEM_PROTO_H + +/* + * Hyper-V Dynamic Memory Protocol definitions + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * Based on drivers/hv/hv_balloon.c from Linux kernel: + * Copyright (c) 2012, Microsoft Corporation. + * + * Author: K. Y. Srinivasan + * + * This work is licensed under the terms of the GNU GPL, version 2. + * See the COPYING file in the top-level directory. + */ + +/* + * Protocol versions. The low word is the minor version, the high word the major + * version. + * + * History: + * Initial version 1.0 + * Changed to 0.1 on 2009/03/25 + * Changes to 0.2 on 2009/05/14 + * Changes to 0.3 on 2009/12/03 + * Changed to 1.0 on 2011/04/05 + * Changed to 2.0 on 2019/12/10 + */ + +#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | (Minor))) +#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16) +#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff) + +enum { +DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3), +DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0), +DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0), + +DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1, +DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2, +DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3, + +DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10 +}; + + + +/* + * Message Types + */ + +enum dm_message_type { +/* + * Version 0.3 + */ +DM_ERROR = 0, +DM_VERSION_REQUEST = 1, +DM_VERSION_RESPONSE = 2, +DM_CAPABILITIES_REPORT = 3, +DM_CAPABILITIES_RESPONSE = 4, +DM_STATUS_REPORT = 5, +DM_BALLOON_REQUEST = 6, +DM_BALLOON_RESPONSE = 7, +DM_UNBALLOON_REQUEST = 8, +DM_UNBALLOON_RESPONSE = 9, +DM_MEM_HOT_ADD_REQUEST = 10, +DM_MEM_HOT_ADD_RESPONSE = 11, +DM_VERSION_03_MAX = 11, +/* + * Version 1.0. + */ +DM_INFO_MESSAGE = 12, +DM_VERSION_1_MAX = 12, + +/* + * Version 2.0 + */ +DM_MEM_HOT_REMOVE_REQUEST = 13, +DM_MEM_HOT_REMOVE_RESPONSE = 14 +}; + + +/* + * Structures defining the dynamic memory management + * protocol. + */ + +union dm_version { +struct { +uint16_t minor_version; +uint16_t major_version; +}; +uint32_t version; +} QEMU_PACKED; + + +union dm_caps { +struct { +uint64_t balloon:1; +uint64_t hot_add:1; +/* + * To support guests that may have alignment + * limitations on hot-add, the guest can specify + * its alignment requirements; a value of n + * represents an alignment of 2^n in mega bytes. + */ +uint64_t hot_add_alignment:4; +uint64_t hot_remove:1; +uint64_t reservedz:57; +} cap_bits; +uint64_t caps; +} QEMU_PACKED; + +union dm_mem_page_range { +struct { +/* + * The PFN number of the first page in the range. + * 40 bits is the architectural limit of a PFN + * number for AMD64. + */ +uint64_t start_page:40; +/* + * The number of pages in the range. + */ +uint64_t page_cnt:24; +} finfo; +uint64_t page_range; +} QEMU_PACKED; + + + +/* + * The header for all dynamic memory messages: + * + * type: Type of the message. + * size: Size of the message in bytes; including the header. + * trans_id: The guest is responsible for manufacturing this ID. + */ + +struct dm_header { +uint16_t type; +uint16_t size; +uint32_t trans_id; +} QEMU_PACKED; + +/* + * A generic message format for dynamic memory. + * Specific message formats are defined later in the file. + */ + +struct dm_message { +struct dm_header hdr; +uint8_t data[]; /* enclosed message */ +} QEMU_PACKED; + + +/* + * Specific message types supporting the dynamic memory protocol. + */ + +/* + * Version negotiation message. Sent from the guest to the host. + * The guest is free to try different versions until the host + * accepts the version. + * + * dm_version: The protocol version requested. + * is_last_attempt: If TRUE, this is the last version guest will request. + * reservedz: Reserved field, set to zero. + */ + +struct dm_version_request { +struct dm_header hdr; +union dm_version version; +
[PULL 05/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base
From: "Maciej S. Szmigiero" This driver is like virtio-balloon on steroids: it allows both changing the guest memory allocation via ballooning and (in the next patch) inserting pieces of extra RAM into it on demand from a provided memory backend. The actual resizing is done via ballooning interface (for example, via the "balloon" HMP command). This includes resizing the guest past its boot size - that is, hot-adding additional memory in granularity limited only by the guest alignment requirements, as provided by the next patch. In contrast with ACPI DIMM hotplug where one can only request to unplug a whole DIMM stick this driver allows removing memory from guest in single page (4k) units via ballooning. After a VM reboot the guest is back to its original (boot) size. In the future, the guest boot memory size might be changed on reboot instead, taking into account the effective size that VM had before that reboot (much like Hyper-V does). For performance reasons, the guest-released memory is tracked in a few range trees, as a series of (start, count) ranges. Each time a new page range is inserted into such tree its neighbors are checked as candidates for possible merging with it. Besides performance reasons, the Dynamic Memory protocol itself uses page ranges as the data structure in its messages, so relevant pages need to be merged into such ranges anyway. One has to be careful when tracking the guest-released pages, since the guest can maliciously report returning pages outside its current address space, which later clash with the address range of newly added memory. Similarly, the guest can report freeing the same page twice. The above design results in much better ballooning performance than when using virtio-balloon with the same guest: 230 GB / minute with this driver versus 70 GB / minute with virtio-balloon. During a ballooning operation most of time is spent waiting for the guest to come up with newly freed page ranges, processing the received ranges on the host side (in QEMU and KVM) is nearly instantaneous. The unballoon operation is also pretty much instantaneous: thanks to the merging of the ballooned out page ranges 200 GB of memory can be returned to the guest in about 1 second. With virtio-balloon this operation takes about 2.5 minutes. These tests were done against a Windows Server 2019 guest running on a Xeon E5-2699, after dirtying the whole memory inside guest before each balloon operation. Using a range tree instead of a bitmap to track the removed memory also means that the solution scales well with the guest size: even a 1 TB range takes just a few bytes of such metadata. Since the required GTree operations aren't present in every Glib version a check for them was added to the meson build script, together with new "--enable-hv-balloon" and "--disable-hv-balloon" configure arguments. If these GTree operations are missing in the system's Glib version this driver will be skipped during QEMU build. An optional "status-report=on" device parameter requests memory status events from the guest (typically sent every second), which allow the host to learn both the guest memory available and the guest memory in use counts. Following commits will add support for their external emission as "HV_BALLOON_STATUS_REPORT" QMP events. The driver is named hv-balloon since the Linux kernel client driver for the Dynamic Memory Protocol is named as such and to follow the naming pattern established by the virtio-balloon driver. The whole protocol runs over Hyper-V VMBus. The driver was tested against Windows Server 2012 R2, Windows Server 2016 and Windows Server 2019 guests and obeys the guest alignment requirements reported to the host via DM_CAPABILITIES_REPORT message. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- Kconfig.host |3 + hw/hyperv/Kconfig | 10 + hw/hyperv/hv-balloon-internal.h| 33 + hw/hyperv/hv-balloon-page_range_tree.c | 228 + hw/hyperv/hv-balloon-page_range_tree.h | 118 +++ hw/hyperv/hv-balloon.c | 1160 hw/hyperv/meson.build |1 + hw/hyperv/trace-events | 13 + include/hw/hyperv/hv-balloon.h | 18 + meson.build| 28 +- meson_options.txt |2 + scripts/meson-buildoptions.sh |3 + 12 files changed, 1616 insertions(+), 1 deletion(-) create mode 100644 hw/hyperv/hv-balloon-internal.h create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h create mode 100644 hw/hyperv/hv-balloon.c create mode 100644 include/hw/hyperv/hv-balloon.h diff --git a/Kconfig.host b/Kconfig.host index d763d892693c..2ee71578f38f 100644 --- a/Kconfig.host +++ b/Kconfig.host @@ -46,3 +46,6 @@ config FUZZ config VFIO_USER_SE
[PULL 07/10] qapi: Add query-memory-devices support to hv-balloon
From: "Maciej S. Szmigiero" Used by the driver to report its provided memory state information. Co-developed-by: David Hildenbrand Reviewed-by: David Hildenbrand Acked-by: Markus Armbruster Signed-off-by: Maciej S. Szmigiero --- hw/core/machine-hmp-cmds.c | 15 +++ hw/hyperv/hv-balloon.c | 27 +- qapi/machine.json | 39 -- 3 files changed, 78 insertions(+), 3 deletions(-) diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c index 9a4b59c6f210..a6ff6a487583 100644 --- a/hw/core/machine-hmp-cmds.c +++ b/hw/core/machine-hmp-cmds.c @@ -253,6 +253,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict) MemoryDeviceInfo *value; PCDIMMDeviceInfo *di; SgxEPCDeviceInfo *se; +HvBalloonDeviceInfo *hi; for (info = info_list; info; info = info->next) { value = info->value; @@ -310,6 +311,20 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict) monitor_printf(mon, " node: %" PRId64 "\n", se->node); monitor_printf(mon, " memdev: %s\n", se->memdev); break; +case MEMORY_DEVICE_INFO_KIND_HV_BALLOON: +hi = value->u.hv_balloon.data; +monitor_printf(mon, "Memory device [%s]: \"%s\"\n", + MemoryDeviceInfoKind_str(value->type), + hi->id ? hi->id : ""); +if (hi->has_memaddr) { +monitor_printf(mon, " memaddr: 0x%" PRIx64 "\n", + hi->memaddr); +} +monitor_printf(mon, " max-size: %" PRIu64 "\n", hi->max_size); +if (hi->memdev) { +monitor_printf(mon, " memdev: %s\n", hi->memdev); +} +break; default: g_assert_not_reached(); } diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 5999f1127d87..44a8d15cc841 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1625,6 +1625,31 @@ static MemoryRegion *hv_balloon_md_get_memory_region(MemoryDeviceState *md, return balloon->mr; } +static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md, + MemoryDeviceInfo *info) +{ +HvBalloonDeviceInfo *hi = g_new0(HvBalloonDeviceInfo, 1); +const HvBalloon *balloon = HV_BALLOON(md); +DeviceState *dev = DEVICE(md); + +if (dev->id) { +hi->id = g_strdup(dev->id); +} + +if (balloon->hostmem) { +hi->memdev = object_get_canonical_path(OBJECT(balloon->hostmem)); +hi->memaddr = balloon->addr; +hi->has_memaddr = true; +hi->max_size = memory_region_size(balloon->mr); +/* TODO: expose current provided size or something else? */ +} else { +hi->max_size = 0; +} + +info->u.hv_balloon.data = hi; +info->type = MEMORY_DEVICE_INFO_KIND_HV_BALLOON; +} + static void hv_balloon_decide_memslots(MemoryDeviceState *md, unsigned int limit) { @@ -1712,5 +1737,5 @@ static void hv_balloon_class_init(ObjectClass *klass, void *data) mdc->get_memory_region = hv_balloon_md_get_memory_region; mdc->decide_memslots = hv_balloon_decide_memslots; mdc->get_memslots = hv_balloon_get_memslots; -/* implement fill_device_info */ +mdc->fill_device_info = hv_balloon_md_fill_device_info; } diff --git a/qapi/machine.json b/qapi/machine.json index 6c9d2f6dcffe..2985d043c00d 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1289,6 +1289,29 @@ } } +## +# @HvBalloonDeviceInfo: +# +# hv-balloon provided memory state information +# +# @id: device's ID +# +# @memaddr: physical address in memory, where device is mapped +# +# @max-size: the maximum size of memory that the device can provide +# +# @memdev: memory backend linked with device +# +# Since: 8.2 +## +{ 'struct': 'HvBalloonDeviceInfo', + 'data': { '*id': 'str', +'*memaddr': 'size', +'max-size': 'size', +'*memdev': 'str' + } +} + ## # @MemoryDeviceInfoKind: # @@ -1300,10 +1323,13 @@ # # @sgx-epc: since 6.2. # +# @hv-balloon: since 8.2. +# # Since: 2.1 ## { 'enum': 'MemoryDeviceInfoKind', - 'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc' ] } + 'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc', +'hv-balloon' ] } ## # @PCDIMMDeviceInfoWrapper: @@ -1337,6 +1363,14 @@ { 'struct': 'SgxEPCDeviceInfoWrapper', 'data': { 'data': 'SgxEPCDeviceInfo' } } +## +# @HvBalloonDeviceInfoWrapper: +# +# Since: 8.2 +## +{
[PULL 00/10] Hyper-V Dynamic Memory Protocol driver (hv-balloon) pull req fixed
From: "Maciej S. Szmigiero" Hi Stefan, Fixed the CI pipeline issues with yesterday's pull request, and: the following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e: Merge tag 'pull-target-arm-20231102' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 10:04:12 +0800) are available in the Git repository at: https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231106 for you to fetch changes up to 00313b517d09c0b141fb32997791f911c28fd3ff: MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-06 14:08:10 +0100) Hyper-V Dynamic Memory protocol driver. This driver is like virtio-balloon on steroids for Windows guests: it allows both changing the guest memory allocation via ballooning and inserting pieces of extra RAM into it on demand from a provided memory backend via Windows-native Hyper-V Dynamic Memory protocol. * Preparatory patches to support empty memory devices and ones with large alignment requirements. * Revert of recently added "hw/virtio/virtio-pmem: Replace impossible check by assertion" commit 5960f254dbb4 since this series makes this situation possible again. * Protocol definitions. * Hyper-V DM protocol driver (hv-balloon) base (ballooning only). * Hyper-V DM protocol driver (hv-balloon) hot-add support. * qapi query-memory-devices support for the driver. * qapi HV_BALLOON_STATUS_REPORT event. * The relevant PC machine plumbing. * New MAINTAINERS entry for the above. David Hildenbrand (2): memory-device: Support empty memory devices memory-device: Drop size alignment check Maciej S. Szmigiero (8): Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion" Add Hyper-V Dynamic Memory Protocol definitions Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support qapi: Add query-memory-devices support to hv-balloon qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command hw/i386/pc: Support hv-balloon MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol Kconfig.host |3 + MAINTAINERS |8 + hw/core/machine-hmp-cmds.c| 15 + hw/hyperv/Kconfig | 10 + hw/hyperv/hv-balloon-internal.h | 33 + hw/hyperv/hv-balloon-our_range_memslots.c | 201 hw/hyperv/hv-balloon-our_range_memslots.h | 110 ++ hw/hyperv/hv-balloon-page_range_tree.c| 228 hw/hyperv/hv-balloon-page_range_tree.h| 118 ++ hw/hyperv/hv-balloon-stub.c | 19 + hw/hyperv/hv-balloon.c| 1769 + hw/hyperv/meson.build |1 + hw/hyperv/trace-events| 18 + hw/i386/Kconfig |1 + hw/i386/pc.c | 22 + hw/mem/memory-device.c| 49 +- hw/virtio/virtio-pmem.c |5 +- include/hw/hyperv/dynmem-proto.h | 423 +++ include/hw/hyperv/hv-balloon.h| 18 + include/hw/mem/memory-device.h|7 +- meson.build | 28 +- meson_options.txt |2 + monitor/monitor.c |1 + qapi/machine.json | 101 +- scripts/meson-buildoptions.sh |3 + tests/qtest/qmp-cmd-test.c|1 + 26 files changed, 3180 insertions(+), 14 deletions(-) create mode 100644 hw/hyperv/hv-balloon-internal.h create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h create mode 100644 hw/hyperv/hv-balloon-stub.c create mode 100644 hw/hyperv/hv-balloon.c create mode 100644 include/hw/hyperv/dynmem-proto.h create mode 100644 include/hw/hyperv/hv-balloon.h
[PULL 02/10] Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion"
From: "Maciej S. Szmigiero" This reverts commit 5960f254dbb46f0c7a9f5f44bf4d27c19c34cb97 since the previous commit made this situation possible again. Signed-off-by: Maciej S. Szmigiero --- hw/virtio/virtio-pmem.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c index cc24812d2e92..c3512c2dae3f 100644 --- a/hw/virtio/virtio-pmem.c +++ b/hw/virtio/virtio-pmem.c @@ -147,7 +147,10 @@ static void virtio_pmem_fill_device_info(const VirtIOPMEM *pmem, static MemoryRegion *virtio_pmem_get_memory_region(VirtIOPMEM *pmem, Error **errp) { -assert(pmem->memdev); +if (!pmem->memdev) { +error_setg(errp, "'%s' property must be set", VIRTIO_PMEM_MEMDEV_PROP); +return NULL; +} return >memdev->mr; }
[PULL 06/10] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support
From: "Maciej S. Szmigiero" One of advantages of using this protocol over ACPI-based PC DIMM hotplug is that it allows hot-adding memory in much smaller granularity because the ACPI DIMM slot limit does not apply. In order to enable this functionality a new memory backend needs to be created and provided to the driver via the "memdev" parameter. This can be achieved by, for example, adding "-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and then instantiating the driver with "memdev=mem1" parameter. The device will try to use multiple memslots to cover the memory backend in order to reduce the size of metadata for the not-yet-hot-added part of the memory backend. Co-developed-by: David Hildenbrand Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon-our_range_memslots.c | 201 hw/hyperv/hv-balloon-our_range_memslots.h | 110 + hw/hyperv/hv-balloon.c| 566 +- hw/hyperv/meson.build | 2 +- hw/hyperv/trace-events| 5 + 5 files changed, 878 insertions(+), 6 deletions(-) create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h diff --git a/hw/hyperv/hv-balloon-our_range_memslots.c b/hw/hyperv/hv-balloon-our_range_memslots.c new file mode 100644 index ..99bae870f371 --- /dev/null +++ b/hw/hyperv/hv-balloon-our_range_memslots.c @@ -0,0 +1,201 @@ +/* + * QEMU Hyper-V Dynamic Memory Protocol driver + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "hv-balloon-internal.h" +#include "hv-balloon-our_range_memslots.h" +#include "trace.h" + +/* OurRange */ +static void our_range_init(OurRange *our_range, uint64_t start, uint64_t count) +{ +assert(count <= UINT64_MAX - start); +our_range->range.start = start; +our_range->range.count = count; + +hvb_page_range_tree_init(_range->removed_guest); +hvb_page_range_tree_init(_range->removed_both); + +/* mark the whole range as unused but for potential use */ +our_range->added = 0; +our_range->unusable_tail = 0; +} + +static void our_range_destroy(OurRange *our_range) +{ +hvb_page_range_tree_destroy(_range->removed_guest); +hvb_page_range_tree_destroy(_range->removed_both); +} + +void hvb_our_range_clear_removed_trees(OurRange *our_range) +{ +hvb_page_range_tree_destroy(_range->removed_guest); +hvb_page_range_tree_destroy(_range->removed_both); +hvb_page_range_tree_init(_range->removed_guest); +hvb_page_range_tree_init(_range->removed_both); +} + +void hvb_our_range_mark_added(OurRange *our_range, uint64_t additional_size) +{ +assert(additional_size <= UINT64_MAX - our_range->added); + +our_range->added += additional_size; + +assert(our_range->added <= UINT64_MAX - our_range->unusable_tail); +assert(our_range->added + our_range->unusable_tail <= + our_range->range.count); +} + +/* OurRangeMemslots */ +static void our_range_memslots_init_slots(OurRangeMemslots *our_range, + MemoryRegion *backing_mr, + Object *memslot_owner) +{ +OurRangeMemslotsSlots *memslots = _range->slots; +unsigned int idx; +uint64_t memslot_offset; + +assert(memslots->count > 0); +memslots->slots = g_new0(MemoryRegion, memslots->count); + +/* Initialize our memslots, but don't map them yet. */ +assert(memslots->size_each > 0); +for (idx = 0, memslot_offset = 0; idx < memslots->count; + idx++, memslot_offset += memslots->size_each) { +uint64_t memslot_size; +g_autofree char *name = NULL; + +/* The size of the last memslot might be smaller. */ +if (idx == memslots->count - 1) { +uint64_t region_size; + +assert(our_range->mr); +region_size = memory_region_size(our_range->mr); +memslot_size = region_size - memslot_offset; +} else { +memslot_size = memslots->size_each; +} + +name = g_strdup_printf("memslot-%u", idx); +memory_region_init_alias(>slots[idx], memslot_owner, name, + backing_mr, memslot_offset, memslot_size); +/* + * We want to be able to atomically and efficiently activate/deactivate + * individual memslots without affecting adjacent memslots in memory + * notifiers. + */ +memory_region_set_unmergeable(>slots[idx], true); +} + +memslots->mapped_count = 0; +} + +O
[PULL 01/10] memory-device: Support empty memory devices
From: David Hildenbrand Let's support empty memory devices -- memory devices that don't have a memory device region in the current configuration. hv-balloon with an optional memdev is the primary use case. Signed-off-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/mem/memory-device.c | 43 +++--- include/hw/mem/memory-device.h | 7 +- 2 files changed, 46 insertions(+), 4 deletions(-) diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index ae38f48f1676..db702ccad554 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -20,6 +20,22 @@ #include "exec/address-spaces.h" #include "trace.h" +static bool memory_device_is_empty(const MemoryDeviceState *md) +{ +const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); +Error *local_err = NULL; +MemoryRegion *mr; + +/* dropping const here is fine as we don't touch the memory region */ +mr = mdc->get_memory_region((MemoryDeviceState *)md, _err); +if (local_err) { +/* Not empty, we'll report errors later when ontaining the MR again. */ +error_free(local_err); +return false; +} +return !mr; +} + static gint memory_device_addr_sort(gconstpointer a, gconstpointer b) { const MemoryDeviceState *md_a = MEMORY_DEVICE(a); @@ -249,6 +265,10 @@ static uint64_t memory_device_get_free_addr(MachineState *ms, uint64_t next_addr; Range tmp; +if (memory_device_is_empty(md)) { +continue; +} + range_init_nofail(, mdc->get_addr(md), memory_device_get_region_size(md, _abort)); @@ -292,6 +312,7 @@ MemoryDeviceInfoList *qmp_memory_device_list(void) const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data); MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1); +/* Let's query infotmation even for empty memory devices. */ mdc->fill_device_info(md, info); QAPI_LIST_APPEND(tail, info); @@ -311,7 +332,7 @@ static int memory_device_plugged_size(Object *obj, void *opaque) const MemoryDeviceState *md = MEMORY_DEVICE(obj); const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj); -if (dev->realized) { +if (dev->realized && !memory_device_is_empty(md)) { *size += mdc->get_plugged_size(md, _abort); } } @@ -337,6 +358,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, uint64_t addr, align = 0; MemoryRegion *mr; +/* We support empty memory devices even without device memory. */ +if (memory_device_is_empty(md)) { +return; +} + if (!ms->device_memory) { error_setg(errp, "the configuration is not prepared for memory devices" " (e.g., for memory hotplug), consider specifying the" @@ -380,10 +406,17 @@ out: void memory_device_plug(MemoryDeviceState *md, MachineState *ms) { const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); -const unsigned int memslots = memory_device_get_memslots(md); -const uint64_t addr = mdc->get_addr(md); +unsigned int memslots; +uint64_t addr; MemoryRegion *mr; +if (memory_device_is_empty(md)) { +return; +} + +memslots = memory_device_get_memslots(md); +addr = mdc->get_addr(md); + /* * We expect that a previous call to memory_device_pre_plug() succeeded, so * it can't fail at this point. @@ -408,6 +441,10 @@ void memory_device_unplug(MemoryDeviceState *md, MachineState *ms) const unsigned int memslots = memory_device_get_memslots(md); MemoryRegion *mr; +if (memory_device_is_empty(md)) { +return; +} + /* * We expect that a previous call to memory_device_pre_plug() succeeded, so * it can't fail at this point. diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h index 3354d6c1667e..a1d62cc551ab 100644 --- a/include/hw/mem/memory-device.h +++ b/include/hw/mem/memory-device.h @@ -38,6 +38,10 @@ typedef struct MemoryDeviceState MemoryDeviceState; * address in guest physical memory can either be specified explicitly * or get assigned automatically. * + * Some memory device might not own a memory region in certain device + * configurations. Such devices can logically get (un)plugged, however, + * empty memory devices are mostly ignored by the memory device code. + * * Conceptually, memory devices only span one memory region. If multiple * successive memory regions are used, a covering memory region has to * be provided. Scattered memory regions are not supported for single @@ -91,7 +95,8 @@ struct MemoryDeviceClass { uint64_t (*get_plugged_size)(const MemoryDeviceState *md, Error **errp); /* - * Return the memory region of the memory device. + * Return the memory region of the
[PULL 08/10] qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command
From: "Maciej S. Szmigiero" Used by the hv-balloon driver for (optional) guest memory status reports. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon-stub.c | 19 hw/hyperv/hv-balloon.c | 30 +- hw/hyperv/meson.build | 2 +- monitor/monitor.c | 1 + qapi/machine.json | 62 + tests/qtest/qmp-cmd-test.c | 1 + 6 files changed, 113 insertions(+), 2 deletions(-) create mode 100644 hw/hyperv/hv-balloon-stub.c diff --git a/hw/hyperv/hv-balloon-stub.c b/hw/hyperv/hv-balloon-stub.c new file mode 100644 index ..a47412d4a8ad --- /dev/null +++ b/hw/hyperv/hv-balloon-stub.c @@ -0,0 +1,19 @@ +/* + * QEMU Hyper-V Dynamic Memory Protocol driver + * + * Copyright (C) 2023 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qapi/qapi-commands-machine.h" +#include "qapi/qapi-types-machine.h" + +HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp) +{ +error_setg(errp, "hv-balloon device not enabled in this build"); +return NULL; +} diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 44a8d15cc841..66f297c1d7e3 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1102,7 +1102,35 @@ static void hv_balloon_handle_status_report(HvBalloon *balloon, balloon->status_report.available *= HV_BALLOON_PAGE_SIZE; balloon->status_report.received = true; -/* report event */ +qapi_event_send_hv_balloon_status_report(balloon->status_report.committed, + balloon->status_report.available); +} + +HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp) +{ +HvBalloon *balloon; +HvBalloonInfo *info; + +balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL)); +if (!balloon) { +error_setg(errp, "no %s device present", TYPE_HV_BALLOON); +return NULL; +} + +if (!balloon->status_report.enabled) { +error_setg(errp, "guest memory status reporting not enabled"); +return NULL; +} + +if (!balloon->status_report.received) { +error_setg(errp, "no guest memory status report received yet"); +return NULL; +} + +info = g_malloc0(sizeof(*info)); +info->committed = balloon->status_report.committed; +info->available = balloon->status_report.available; +return info; } static void hv_balloon_handle_unballoon_response(HvBalloon *balloon, diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build index 852d4f4a2ee2..d3d2668c71ae 100644 --- a/hw/hyperv/meson.build +++ b/hw/hyperv/meson.build @@ -2,4 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c')) specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: files('hyperv_testdev.c')) specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c')) specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c')) -specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c', 'hv-balloon-page_range_tree.c', 'hv-balloon-our_range_memslots.c')) +specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c', 'hv-balloon-page_range_tree.c', 'hv-balloon-our_range_memslots.c'), if_false: files('hv-balloon-stub.c')) diff --git a/monitor/monitor.c b/monitor/monitor.c index 941f87815aa4..01ede1babd3d 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -315,6 +315,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = { [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS }, [QAPI_EVENT_VSERPORT_CHANGE] = { 1000 * SCALE_MS }, [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS }, +[QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS }, }; /* diff --git a/qapi/machine.json b/qapi/machine.json index 2985d043c00d..b6d634b30d55 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1137,6 +1137,68 @@ { 'event': 'BALLOON_CHANGE', 'data': { 'actual': 'int' } } +## +# @HvBalloonInfo: +# +# hv-balloon guest-provided memory status information. +# +# @committed: the amount of memory in use inside the guest plus the +# amount of the memory unusable inside the guest (ballooned out, +# offline, etc.) +# +# @available: the amount of the memory inside the guest available for +# new allocations ("free") +# +# Since: 8.2 +## +{ 'struct': 'HvBalloonInfo', + 'data': { 'committed': 'size', 'available': 'size' } } + +## +# @query-hv-balloon-status-report: +# +# Returns the hv-balloon driver data contained in the last received "STATUS" +# message from the guest. +# +# Returns: +# - @HvBalloonIn
[PULL 09/10] hw/i386/pc: Support hv-balloon
From: "Maciej S. Szmigiero" Add the necessary plumbing for the hv-balloon driver to the PC machine. Co-developed-by: David Hildenbrand Reviewed-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/i386/Kconfig | 1 + hw/i386/pc.c| 22 ++ 2 files changed, 23 insertions(+) diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig index 94772c726b24..55850791df41 100644 --- a/hw/i386/Kconfig +++ b/hw/i386/Kconfig @@ -45,6 +45,7 @@ config PC select ACPI_VMGENID select VIRTIO_PMEM_SUPPORTED select VIRTIO_MEM_SUPPORTED +select HV_BALLOON_SUPPORTED config PC_PCI bool diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 6031234a73f1..1aef21aa2c25 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -27,6 +27,7 @@ #include "hw/i386/pc.h" #include "hw/char/serial.h" #include "hw/char/parallel.h" +#include "hw/hyperv/hv-balloon.h" #include "hw/i386/fw_cfg.h" #include "hw/i386/vmport.h" #include "sysemu/cpus.h" @@ -57,6 +58,7 @@ #include "hw/i386/kvm/xen_evtchn.h" #include "hw/i386/kvm/xen_gnttab.h" #include "hw/i386/kvm/xen_xenstore.h" +#include "hw/mem/memory-device.h" #include "e820_memory_layout.h" #include "trace.h" #include CONFIG_DEVICES @@ -1422,6 +1424,21 @@ static void pc_memory_unplug(HotplugHandler *hotplug_dev, error_propagate(errp, local_err); } +static void pc_hv_balloon_pre_plug(HotplugHandler *hotplug_dev, + DeviceState *dev, Error **errp) +{ +/* The vmbus handler has no hotplug handler; we should never end up here. */ +g_assert(!dev->hotplugged); +memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL, + errp); +} + +static void pc_hv_balloon_plug(HotplugHandler *hotplug_dev, + DeviceState *dev, Error **errp) +{ +memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev)); +} + static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { @@ -1452,6 +1469,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, return; } pcms->iommu = dev; +} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) { +pc_hv_balloon_pre_plug(hotplug_dev, dev, errp); } } @@ -1464,6 +1483,8 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev, x86_cpu_plug(hotplug_dev, dev, errp); } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI)) { virtio_md_pci_plug(VIRTIO_MD_PCI(dev), MACHINE(hotplug_dev), errp); +} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) { +pc_hv_balloon_plug(hotplug_dev, dev, errp); } } @@ -1505,6 +1526,7 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine, object_dynamic_cast(OBJECT(dev), TYPE_CPU) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) || +object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON) || object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) { return HOTPLUG_HANDLER(machine); }
[PULL 03/10] memory-device: Drop size alignment check
From: David Hildenbrand There is no strong requirement that the size has to be multiples of the requested alignment, let's drop it. This is a preparation for hv-baloon. Signed-off-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/mem/memory-device.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index db702ccad554..e0704b8dc37a 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -236,12 +236,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms, return 0; } -if (!QEMU_IS_ALIGNED(size, align)) { -error_setg(errp, "backend memory size must be multiple of 0x%" - PRIx64, align); -return 0; -} - if (hint) { if (range_init(, *hint, size) || !range_contains_range(, )) { error_setg(errp, "can't add memory device [0x%" PRIx64 ":0x%" PRIx64
Re: [PULL 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
On 6.11.2023 02:33, Stefan Hajnoczi wrote: On Sun, 5 Nov 2023 at 19:49, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" The following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e: Merge tag 'pull-target-arm-20231102' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 10:04:12 +0800) are available in the Git repository at: https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231105 for you to fetch changes up to 2b49ecabc6bf15efa6aa05f20a7c319ff65c4e11: MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-03 20:31:10 +0100) Hi Maciej, Please take a look at this CI system build failure: /usr/bin/ld: libqemuutil.a.p/meson-generated_.._qapi_qapi-commands-machine.c.o: in function `qmp_marshal_query_hv_balloon_status_report': /builds/qemu-project/qemu/build/qapi/qapi-commands-machine.c:1000: undefined reference to `qmp_query_hv_balloon_status_report' https://gitlab.com/qemu-project/qemu/-/jobs/5463619044 I have dropped this pull request from the staging tree for the time being. You can run the GitLab CI by pushing to a personal qemu.git fork on GitLab with "git push -o ci.variable=QEMU_CI=1 ..." and it's often possible to reproduce the CI jobs locally using the Docker build tests (see "make docker-help"). Oops, was testing the driver but forgot to also recently test the configuration when the driver is disabled in QEMU build config. Will fix this ASAP. Stefan Thanks, Maciej
[PULL 3/9] Add Hyper-V Dynamic Memory Protocol definitions
From: "Maciej S. Szmigiero" This commit adds Hyper-V Dynamic Memory Protocol definitions, taken from hv_balloon Linux kernel driver, adapted to the QEMU coding style and definitions. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- include/hw/hyperv/dynmem-proto.h | 423 +++ 1 file changed, 423 insertions(+) create mode 100644 include/hw/hyperv/dynmem-proto.h diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h new file mode 100644 index ..d0f9090ac489 --- /dev/null +++ b/include/hw/hyperv/dynmem-proto.h @@ -0,0 +1,423 @@ +#ifndef HW_HYPERV_DYNMEM_PROTO_H +#define HW_HYPERV_DYNMEM_PROTO_H + +/* + * Hyper-V Dynamic Memory Protocol definitions + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * Based on drivers/hv/hv_balloon.c from Linux kernel: + * Copyright (c) 2012, Microsoft Corporation. + * + * Author: K. Y. Srinivasan + * + * This work is licensed under the terms of the GNU GPL, version 2. + * See the COPYING file in the top-level directory. + */ + +/* + * Protocol versions. The low word is the minor version, the high word the major + * version. + * + * History: + * Initial version 1.0 + * Changed to 0.1 on 2009/03/25 + * Changes to 0.2 on 2009/05/14 + * Changes to 0.3 on 2009/12/03 + * Changed to 1.0 on 2011/04/05 + * Changed to 2.0 on 2019/12/10 + */ + +#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | (Minor))) +#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16) +#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff) + +enum { +DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3), +DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0), +DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0), + +DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1, +DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2, +DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3, + +DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10 +}; + + + +/* + * Message Types + */ + +enum dm_message_type { +/* + * Version 0.3 + */ +DM_ERROR = 0, +DM_VERSION_REQUEST = 1, +DM_VERSION_RESPONSE = 2, +DM_CAPABILITIES_REPORT = 3, +DM_CAPABILITIES_RESPONSE = 4, +DM_STATUS_REPORT = 5, +DM_BALLOON_REQUEST = 6, +DM_BALLOON_RESPONSE = 7, +DM_UNBALLOON_REQUEST = 8, +DM_UNBALLOON_RESPONSE = 9, +DM_MEM_HOT_ADD_REQUEST = 10, +DM_MEM_HOT_ADD_RESPONSE = 11, +DM_VERSION_03_MAX = 11, +/* + * Version 1.0. + */ +DM_INFO_MESSAGE = 12, +DM_VERSION_1_MAX = 12, + +/* + * Version 2.0 + */ +DM_MEM_HOT_REMOVE_REQUEST = 13, +DM_MEM_HOT_REMOVE_RESPONSE = 14 +}; + + +/* + * Structures defining the dynamic memory management + * protocol. + */ + +union dm_version { +struct { +uint16_t minor_version; +uint16_t major_version; +}; +uint32_t version; +} QEMU_PACKED; + + +union dm_caps { +struct { +uint64_t balloon:1; +uint64_t hot_add:1; +/* + * To support guests that may have alignment + * limitations on hot-add, the guest can specify + * its alignment requirements; a value of n + * represents an alignment of 2^n in mega bytes. + */ +uint64_t hot_add_alignment:4; +uint64_t hot_remove:1; +uint64_t reservedz:57; +} cap_bits; +uint64_t caps; +} QEMU_PACKED; + +union dm_mem_page_range { +struct { +/* + * The PFN number of the first page in the range. + * 40 bits is the architectural limit of a PFN + * number for AMD64. + */ +uint64_t start_page:40; +/* + * The number of pages in the range. + */ +uint64_t page_cnt:24; +} finfo; +uint64_t page_range; +} QEMU_PACKED; + + + +/* + * The header for all dynamic memory messages: + * + * type: Type of the message. + * size: Size of the message in bytes; including the header. + * trans_id: The guest is responsible for manufacturing this ID. + */ + +struct dm_header { +uint16_t type; +uint16_t size; +uint32_t trans_id; +} QEMU_PACKED; + +/* + * A generic message format for dynamic memory. + * Specific message formats are defined later in the file. + */ + +struct dm_message { +struct dm_header hdr; +uint8_t data[]; /* enclosed message */ +} QEMU_PACKED; + + +/* + * Specific message types supporting the dynamic memory protocol. + */ + +/* + * Version negotiation message. Sent from the guest to the host. + * The guest is free to try different versions until the host + * accepts the version. + * + * dm_version: The protocol version requested. + * is_last_attempt: If TRUE, this is the last version guest will request. + * reservedz: Reserved field, set to zero. + */ + +struct dm_version_request { +struct dm_header hdr; +union dm_version version; +
[PULL 8/9] hw/i386/pc: Support hv-balloon
From: "Maciej S. Szmigiero" Add the necessary plumbing for the hv-balloon driver to the PC machine. Co-developed-by: David Hildenbrand Reviewed-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/i386/Kconfig | 1 + hw/i386/pc.c| 22 ++ 2 files changed, 23 insertions(+) diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig index 94772c726b24..55850791df41 100644 --- a/hw/i386/Kconfig +++ b/hw/i386/Kconfig @@ -45,6 +45,7 @@ config PC select ACPI_VMGENID select VIRTIO_PMEM_SUPPORTED select VIRTIO_MEM_SUPPORTED +select HV_BALLOON_SUPPORTED config PC_PCI bool diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 6031234a73f1..1aef21aa2c25 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -27,6 +27,7 @@ #include "hw/i386/pc.h" #include "hw/char/serial.h" #include "hw/char/parallel.h" +#include "hw/hyperv/hv-balloon.h" #include "hw/i386/fw_cfg.h" #include "hw/i386/vmport.h" #include "sysemu/cpus.h" @@ -57,6 +58,7 @@ #include "hw/i386/kvm/xen_evtchn.h" #include "hw/i386/kvm/xen_gnttab.h" #include "hw/i386/kvm/xen_xenstore.h" +#include "hw/mem/memory-device.h" #include "e820_memory_layout.h" #include "trace.h" #include CONFIG_DEVICES @@ -1422,6 +1424,21 @@ static void pc_memory_unplug(HotplugHandler *hotplug_dev, error_propagate(errp, local_err); } +static void pc_hv_balloon_pre_plug(HotplugHandler *hotplug_dev, + DeviceState *dev, Error **errp) +{ +/* The vmbus handler has no hotplug handler; we should never end up here. */ +g_assert(!dev->hotplugged); +memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL, + errp); +} + +static void pc_hv_balloon_plug(HotplugHandler *hotplug_dev, + DeviceState *dev, Error **errp) +{ +memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev)); +} + static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { @@ -1452,6 +1469,8 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, return; } pcms->iommu = dev; +} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) { +pc_hv_balloon_pre_plug(hotplug_dev, dev, errp); } } @@ -1464,6 +1483,8 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev, x86_cpu_plug(hotplug_dev, dev, errp); } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI)) { virtio_md_pci_plug(VIRTIO_MD_PCI(dev), MACHINE(hotplug_dev), errp); +} else if (object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON)) { +pc_hv_balloon_plug(hotplug_dev, dev, errp); } } @@ -1505,6 +1526,7 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine, object_dynamic_cast(OBJECT(dev), TYPE_CPU) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MD_PCI) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) || +object_dynamic_cast(OBJECT(dev), TYPE_HV_BALLOON) || object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) { return HOTPLUG_HANDLER(machine); }
[PULL 9/9] MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol
From: "Maciej S. Szmigiero" Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- MAINTAINERS | 8 1 file changed, 8 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8e8a7d5be5de..d4a480ce5a62 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2656,6 +2656,14 @@ F: hw/usb/canokey.c F: hw/usb/canokey.h F: docs/system/devices/canokey.rst +Hyper-V Dynamic Memory Protocol +M: Maciej S. Szmigiero +S: Supported +F: hw/hyperv/hv-balloon*.c +F: hw/hyperv/hv-balloon*.h +F: include/hw/hyperv/dynmem-proto.h +F: include/hw/hyperv/hv-balloon.h + Subsystems -- Overall Audio backends
[PULL 6/9] qapi: Add query-memory-devices support to hv-balloon
From: "Maciej S. Szmigiero" Used by the driver to report its provided memory state information. Co-developed-by: David Hildenbrand Reviewed-by: David Hildenbrand Acked-by: Markus Armbruster Signed-off-by: Maciej S. Szmigiero --- hw/core/machine-hmp-cmds.c | 15 +++ hw/hyperv/hv-balloon.c | 27 +- qapi/machine.json | 39 -- 3 files changed, 78 insertions(+), 3 deletions(-) diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c index 9a4b59c6f210..a6ff6a487583 100644 --- a/hw/core/machine-hmp-cmds.c +++ b/hw/core/machine-hmp-cmds.c @@ -253,6 +253,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict) MemoryDeviceInfo *value; PCDIMMDeviceInfo *di; SgxEPCDeviceInfo *se; +HvBalloonDeviceInfo *hi; for (info = info_list; info; info = info->next) { value = info->value; @@ -310,6 +311,20 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict) monitor_printf(mon, " node: %" PRId64 "\n", se->node); monitor_printf(mon, " memdev: %s\n", se->memdev); break; +case MEMORY_DEVICE_INFO_KIND_HV_BALLOON: +hi = value->u.hv_balloon.data; +monitor_printf(mon, "Memory device [%s]: \"%s\"\n", + MemoryDeviceInfoKind_str(value->type), + hi->id ? hi->id : ""); +if (hi->has_memaddr) { +monitor_printf(mon, " memaddr: 0x%" PRIx64 "\n", + hi->memaddr); +} +monitor_printf(mon, " max-size: %" PRIu64 "\n", hi->max_size); +if (hi->memdev) { +monitor_printf(mon, " memdev: %s\n", hi->memdev); +} +break; default: g_assert_not_reached(); } diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 4d87f99375b5..c384f23a3b5e 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1622,6 +1622,31 @@ static MemoryRegion *hv_balloon_md_get_memory_region(MemoryDeviceState *md, return balloon->mr; } +static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md, + MemoryDeviceInfo *info) +{ +HvBalloonDeviceInfo *hi = g_new0(HvBalloonDeviceInfo, 1); +const HvBalloon *balloon = HV_BALLOON(md); +DeviceState *dev = DEVICE(md); + +if (dev->id) { +hi->id = g_strdup(dev->id); +} + +if (balloon->hostmem) { +hi->memdev = object_get_canonical_path(OBJECT(balloon->hostmem)); +hi->memaddr = balloon->addr; +hi->has_memaddr = true; +hi->max_size = memory_region_size(balloon->mr); +/* TODO: expose current provided size or something else? */ +} else { +hi->max_size = 0; +} + +info->u.hv_balloon.data = hi; +info->type = MEMORY_DEVICE_INFO_KIND_HV_BALLOON; +} + static void hv_balloon_decide_memslots(MemoryDeviceState *md, unsigned int limit) { @@ -1709,5 +1734,5 @@ static void hv_balloon_class_init(ObjectClass *klass, void *data) mdc->get_memory_region = hv_balloon_md_get_memory_region; mdc->decide_memslots = hv_balloon_decide_memslots; mdc->get_memslots = hv_balloon_get_memslots; -/* implement fill_device_info */ +mdc->fill_device_info = hv_balloon_md_fill_device_info; } diff --git a/qapi/machine.json b/qapi/machine.json index 6c9d2f6dcffe..2985d043c00d 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1289,6 +1289,29 @@ } } +## +# @HvBalloonDeviceInfo: +# +# hv-balloon provided memory state information +# +# @id: device's ID +# +# @memaddr: physical address in memory, where device is mapped +# +# @max-size: the maximum size of memory that the device can provide +# +# @memdev: memory backend linked with device +# +# Since: 8.2 +## +{ 'struct': 'HvBalloonDeviceInfo', + 'data': { '*id': 'str', +'*memaddr': 'size', +'max-size': 'size', +'*memdev': 'str' + } +} + ## # @MemoryDeviceInfoKind: # @@ -1300,10 +1323,13 @@ # # @sgx-epc: since 6.2. # +# @hv-balloon: since 8.2. +# # Since: 2.1 ## { 'enum': 'MemoryDeviceInfoKind', - 'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc' ] } + 'data': [ 'dimm', 'nvdimm', 'virtio-pmem', 'virtio-mem', 'sgx-epc', +'hv-balloon' ] } ## # @PCDIMMDeviceInfoWrapper: @@ -1337,6 +1363,14 @@ { 'struct': 'SgxEPCDeviceInfoWrapper', 'data': { 'data': 'SgxEPCDeviceInfo' } } +## +# @HvBalloonDeviceInfoWrapper: +# +# Since: 8.2 +## +{
[PULL 7/9] qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command
From: "Maciej S. Szmigiero" Used by the hv-balloon driver for (optional) guest memory status reports. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon.c | 30 +++- monitor/monitor.c | 1 + qapi/machine.json | 62 ++ 3 files changed, 92 insertions(+), 1 deletion(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index c384f23a3b5e..2d1464cd7dca 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1099,7 +1099,35 @@ static void hv_balloon_handle_status_report(HvBalloon *balloon, balloon->status_report.available *= HV_BALLOON_PAGE_SIZE; balloon->status_report.received = true; -/* report event */ +qapi_event_send_hv_balloon_status_report(balloon->status_report.committed, + balloon->status_report.available); +} + +HvBalloonInfo *qmp_query_hv_balloon_status_report(Error **errp) +{ +HvBalloon *balloon; +HvBalloonInfo *info; + +balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL)); +if (!balloon) { +error_setg(errp, "no %s device present", TYPE_HV_BALLOON); +return NULL; +} + +if (!balloon->status_report.enabled) { +error_setg(errp, "guest memory status reporting not enabled"); +return NULL; +} + +if (!balloon->status_report.received) { +error_setg(errp, "no guest memory status report received yet"); +return NULL; +} + +info = g_malloc0(sizeof(*info)); +info->committed = balloon->status_report.committed; +info->available = balloon->status_report.available; +return info; } static void hv_balloon_handle_unballoon_response(HvBalloon *balloon, diff --git a/monitor/monitor.c b/monitor/monitor.c index 941f87815aa4..01ede1babd3d 100644 --- a/monitor/monitor.c +++ b/monitor/monitor.c @@ -315,6 +315,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = { [QAPI_EVENT_QUORUM_FAILURE]= { 1000 * SCALE_MS }, [QAPI_EVENT_VSERPORT_CHANGE] = { 1000 * SCALE_MS }, [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS }, +[QAPI_EVENT_HV_BALLOON_STATUS_REPORT] = { 1000 * SCALE_MS }, }; /* diff --git a/qapi/machine.json b/qapi/machine.json index 2985d043c00d..b6d634b30d55 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1137,6 +1137,68 @@ { 'event': 'BALLOON_CHANGE', 'data': { 'actual': 'int' } } +## +# @HvBalloonInfo: +# +# hv-balloon guest-provided memory status information. +# +# @committed: the amount of memory in use inside the guest plus the +# amount of the memory unusable inside the guest (ballooned out, +# offline, etc.) +# +# @available: the amount of the memory inside the guest available for +# new allocations ("free") +# +# Since: 8.2 +## +{ 'struct': 'HvBalloonInfo', + 'data': { 'committed': 'size', 'available': 'size' } } + +## +# @query-hv-balloon-status-report: +# +# Returns the hv-balloon driver data contained in the last received "STATUS" +# message from the guest. +# +# Returns: +# - @HvBalloonInfo on success +# - If no hv-balloon device is present, guest memory status reporting +# is not enabled or no guest memory status report received yet, +# GenericError +# +# Since: 8.2 +# +# Example: +# +# -> { "execute": "query-hv-balloon-status-report" } +# <- { "return": { +# "committed": 81664, +# "available": 054464 +# } +#} +## +{ 'command': 'query-hv-balloon-status-report', 'returns': 'HvBalloonInfo' } + +## +# @HV_BALLOON_STATUS_REPORT: +# +# Emitted when the hv-balloon driver receives a "STATUS" message from +# the guest. +# +# Note: this event is rate-limited. +# +# Since: 8.2 +# +# Example: +# +# <- { "event": "HV_BALLOON_STATUS_REPORT", +# "data": { "committed": 81664, "available": 054464 }, +# "timestamp": { "seconds": 1600295492, "microseconds": 661044 } } +# +## +{ 'event': 'HV_BALLOON_STATUS_REPORT', + 'data': 'HvBalloonInfo' } + ## # @MemoryInfo: #
[PULL 1/9] memory-device: Support empty memory devices
From: David Hildenbrand Let's support empty memory devices -- memory devices that don't have a memory device region in the current configuration. hv-balloon with an optional memdev is the primary use case. Signed-off-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/mem/memory-device.c | 43 +++--- include/hw/mem/memory-device.h | 7 +- 2 files changed, 46 insertions(+), 4 deletions(-) diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index ae38f48f1676..db702ccad554 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -20,6 +20,22 @@ #include "exec/address-spaces.h" #include "trace.h" +static bool memory_device_is_empty(const MemoryDeviceState *md) +{ +const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); +Error *local_err = NULL; +MemoryRegion *mr; + +/* dropping const here is fine as we don't touch the memory region */ +mr = mdc->get_memory_region((MemoryDeviceState *)md, _err); +if (local_err) { +/* Not empty, we'll report errors later when ontaining the MR again. */ +error_free(local_err); +return false; +} +return !mr; +} + static gint memory_device_addr_sort(gconstpointer a, gconstpointer b) { const MemoryDeviceState *md_a = MEMORY_DEVICE(a); @@ -249,6 +265,10 @@ static uint64_t memory_device_get_free_addr(MachineState *ms, uint64_t next_addr; Range tmp; +if (memory_device_is_empty(md)) { +continue; +} + range_init_nofail(, mdc->get_addr(md), memory_device_get_region_size(md, _abort)); @@ -292,6 +312,7 @@ MemoryDeviceInfoList *qmp_memory_device_list(void) const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(item->data); MemoryDeviceInfo *info = g_new0(MemoryDeviceInfo, 1); +/* Let's query infotmation even for empty memory devices. */ mdc->fill_device_info(md, info); QAPI_LIST_APPEND(tail, info); @@ -311,7 +332,7 @@ static int memory_device_plugged_size(Object *obj, void *opaque) const MemoryDeviceState *md = MEMORY_DEVICE(obj); const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(obj); -if (dev->realized) { +if (dev->realized && !memory_device_is_empty(md)) { *size += mdc->get_plugged_size(md, _abort); } } @@ -337,6 +358,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, uint64_t addr, align = 0; MemoryRegion *mr; +/* We support empty memory devices even without device memory. */ +if (memory_device_is_empty(md)) { +return; +} + if (!ms->device_memory) { error_setg(errp, "the configuration is not prepared for memory devices" " (e.g., for memory hotplug), consider specifying the" @@ -380,10 +406,17 @@ out: void memory_device_plug(MemoryDeviceState *md, MachineState *ms) { const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); -const unsigned int memslots = memory_device_get_memslots(md); -const uint64_t addr = mdc->get_addr(md); +unsigned int memslots; +uint64_t addr; MemoryRegion *mr; +if (memory_device_is_empty(md)) { +return; +} + +memslots = memory_device_get_memslots(md); +addr = mdc->get_addr(md); + /* * We expect that a previous call to memory_device_pre_plug() succeeded, so * it can't fail at this point. @@ -408,6 +441,10 @@ void memory_device_unplug(MemoryDeviceState *md, MachineState *ms) const unsigned int memslots = memory_device_get_memslots(md); MemoryRegion *mr; +if (memory_device_is_empty(md)) { +return; +} + /* * We expect that a previous call to memory_device_pre_plug() succeeded, so * it can't fail at this point. diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h index 3354d6c1667e..a1d62cc551ab 100644 --- a/include/hw/mem/memory-device.h +++ b/include/hw/mem/memory-device.h @@ -38,6 +38,10 @@ typedef struct MemoryDeviceState MemoryDeviceState; * address in guest physical memory can either be specified explicitly * or get assigned automatically. * + * Some memory device might not own a memory region in certain device + * configurations. Such devices can logically get (un)plugged, however, + * empty memory devices are mostly ignored by the memory device code. + * * Conceptually, memory devices only span one memory region. If multiple * successive memory regions are used, a covering memory region has to * be provided. Scattered memory regions are not supported for single @@ -91,7 +95,8 @@ struct MemoryDeviceClass { uint64_t (*get_plugged_size)(const MemoryDeviceState *md, Error **errp); /* - * Return the memory region of the memory device. + * Return the memory region of the
[PULL 2/9] memory-device: Drop size alignment check
From: David Hildenbrand There is no strong requirement that the size has to be multiples of the requested alignment, let's drop it. This is a preparation for hv-baloon. Signed-off-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/mem/memory-device.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index db702ccad554..e0704b8dc37a 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -236,12 +236,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms, return 0; } -if (!QEMU_IS_ALIGNED(size, align)) { -error_setg(errp, "backend memory size must be multiple of 0x%" - PRIx64, align); -return 0; -} - if (hint) { if (range_init(, *hint, size) || !range_contains_range(, )) { error_setg(errp, "can't add memory device [0x%" PRIx64 ":0x%" PRIx64
[PULL 5/9] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support
From: "Maciej S. Szmigiero" One of advantages of using this protocol over ACPI-based PC DIMM hotplug is that it allows hot-adding memory in much smaller granularity because the ACPI DIMM slot limit does not apply. In order to enable this functionality a new memory backend needs to be created and provided to the driver via the "memdev" parameter. This can be achieved by, for example, adding "-object memory-backend-ram,id=mem1,size=32G" to the QEMU command line and then instantiating the driver with "memdev=mem1" parameter. The device will try to use multiple memslots to cover the memory backend in order to reduce the size of metadata for the not-yet-hot-added part of the memory backend. Co-developed-by: David Hildenbrand Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- hw/hyperv/hv-balloon-our_range_memslots.c | 201 hw/hyperv/hv-balloon-our_range_memslots.h | 110 + hw/hyperv/hv-balloon.c| 566 +- hw/hyperv/meson.build | 2 +- hw/hyperv/trace-events| 5 + 5 files changed, 878 insertions(+), 6 deletions(-) create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h diff --git a/hw/hyperv/hv-balloon-our_range_memslots.c b/hw/hyperv/hv-balloon-our_range_memslots.c new file mode 100644 index ..99bae870f371 --- /dev/null +++ b/hw/hyperv/hv-balloon-our_range_memslots.c @@ -0,0 +1,201 @@ +/* + * QEMU Hyper-V Dynamic Memory Protocol driver + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "hv-balloon-internal.h" +#include "hv-balloon-our_range_memslots.h" +#include "trace.h" + +/* OurRange */ +static void our_range_init(OurRange *our_range, uint64_t start, uint64_t count) +{ +assert(count <= UINT64_MAX - start); +our_range->range.start = start; +our_range->range.count = count; + +hvb_page_range_tree_init(_range->removed_guest); +hvb_page_range_tree_init(_range->removed_both); + +/* mark the whole range as unused but for potential use */ +our_range->added = 0; +our_range->unusable_tail = 0; +} + +static void our_range_destroy(OurRange *our_range) +{ +hvb_page_range_tree_destroy(_range->removed_guest); +hvb_page_range_tree_destroy(_range->removed_both); +} + +void hvb_our_range_clear_removed_trees(OurRange *our_range) +{ +hvb_page_range_tree_destroy(_range->removed_guest); +hvb_page_range_tree_destroy(_range->removed_both); +hvb_page_range_tree_init(_range->removed_guest); +hvb_page_range_tree_init(_range->removed_both); +} + +void hvb_our_range_mark_added(OurRange *our_range, uint64_t additional_size) +{ +assert(additional_size <= UINT64_MAX - our_range->added); + +our_range->added += additional_size; + +assert(our_range->added <= UINT64_MAX - our_range->unusable_tail); +assert(our_range->added + our_range->unusable_tail <= + our_range->range.count); +} + +/* OurRangeMemslots */ +static void our_range_memslots_init_slots(OurRangeMemslots *our_range, + MemoryRegion *backing_mr, + Object *memslot_owner) +{ +OurRangeMemslotsSlots *memslots = _range->slots; +unsigned int idx; +uint64_t memslot_offset; + +assert(memslots->count > 0); +memslots->slots = g_new0(MemoryRegion, memslots->count); + +/* Initialize our memslots, but don't map them yet. */ +assert(memslots->size_each > 0); +for (idx = 0, memslot_offset = 0; idx < memslots->count; + idx++, memslot_offset += memslots->size_each) { +uint64_t memslot_size; +g_autofree char *name = NULL; + +/* The size of the last memslot might be smaller. */ +if (idx == memslots->count - 1) { +uint64_t region_size; + +assert(our_range->mr); +region_size = memory_region_size(our_range->mr); +memslot_size = region_size - memslot_offset; +} else { +memslot_size = memslots->size_each; +} + +name = g_strdup_printf("memslot-%u", idx); +memory_region_init_alias(>slots[idx], memslot_owner, name, + backing_mr, memslot_offset, memslot_size); +/* + * We want to be able to atomically and efficiently activate/deactivate + * individual memslots without affecting adjacent memslots in memory + * notifiers. + */ +memory_region_set_unmergeable(>slots[idx], true); +} + +memslots->mapped_count = 0; +} + +O
[PULL 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
From: "Maciej S. Szmigiero" The following changes since commit d762bf97931b58839316b68a570eecc6143c9e3e: Merge tag 'pull-target-arm-20231102' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-11-03 10:04:12 +0800) are available in the Git repository at: https://github.com/maciejsszmigiero/qemu.git tags/pull-hv-balloon-20231105 for you to fetch changes up to 2b49ecabc6bf15efa6aa05f20a7c319ff65c4e11: MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol (2023-11-03 20:31:10 +0100) Hyper-V Dynamic Memory protocol driver. This driver is like virtio-balloon on steroids for Windows guests: it allows both changing the guest memory allocation via ballooning and inserting pieces of extra RAM into it on demand from a provided memory backend via Windows-native Hyper-V Dynamic Memory protocol. * Protocol definitions. * Hyper-V DM protocol driver (hv-balloon) base (ballooning only). * Hyper-V DM protocol driver (hv-balloon) hot-add support. * qapi query-memory-devices support for the driver. * qapi HV_BALLOON_STATUS_REPORT event. * The relevant PC machine plumbing. * New MAINTAINERS entry for the above. David Hildenbrand (2): memory-device: Support empty memory devices memory-device: Drop size alignment check Maciej S. Szmigiero (7): Add Hyper-V Dynamic Memory Protocol definitions Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) hot-add support qapi: Add query-memory-devices support to hv-balloon qapi: Add HV_BALLOON_STATUS_REPORT event and its QMP query command hw/i386/pc: Support hv-balloon MAINTAINERS: Add an entry for Hyper-V Dynamic Memory Protocol Kconfig.host |3 + MAINTAINERS |8 + hw/core/machine-hmp-cmds.c| 15 + hw/hyperv/Kconfig | 10 + hw/hyperv/hv-balloon-internal.h | 33 + hw/hyperv/hv-balloon-our_range_memslots.c | 201 hw/hyperv/hv-balloon-our_range_memslots.h | 110 ++ hw/hyperv/hv-balloon-page_range_tree.c| 228 hw/hyperv/hv-balloon-page_range_tree.h| 118 ++ hw/hyperv/hv-balloon.c| 1766 + hw/hyperv/meson.build |1 + hw/hyperv/trace-events| 18 + hw/i386/Kconfig |1 + hw/i386/pc.c | 22 + hw/mem/memory-device.c| 49 +- include/hw/hyperv/dynmem-proto.h | 423 +++ include/hw/hyperv/hv-balloon.h| 18 + include/hw/mem/memory-device.h|7 +- meson.build | 28 +- meson_options.txt |2 + monitor/monitor.c |1 + qapi/machine.json | 101 +- scripts/meson-buildoptions.sh |3 + 23 files changed, 3153 insertions(+), 13 deletions(-) create mode 100644 hw/hyperv/hv-balloon-internal.h create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.c create mode 100644 hw/hyperv/hv-balloon-our_range_memslots.h create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h create mode 100644 hw/hyperv/hv-balloon.c create mode 100644 include/hw/hyperv/dynmem-proto.h create mode 100644 include/hw/hyperv/hv-balloon.h
[PULL 4/9] Add Hyper-V Dynamic Memory Protocol driver (hv-balloon) base
From: "Maciej S. Szmigiero" This driver is like virtio-balloon on steroids: it allows both changing the guest memory allocation via ballooning and (in the next patch) inserting pieces of extra RAM into it on demand from a provided memory backend. The actual resizing is done via ballooning interface (for example, via the "balloon" HMP command). This includes resizing the guest past its boot size - that is, hot-adding additional memory in granularity limited only by the guest alignment requirements, as provided by the next patch. In contrast with ACPI DIMM hotplug where one can only request to unplug a whole DIMM stick this driver allows removing memory from guest in single page (4k) units via ballooning. After a VM reboot the guest is back to its original (boot) size. In the future, the guest boot memory size might be changed on reboot instead, taking into account the effective size that VM had before that reboot (much like Hyper-V does). For performance reasons, the guest-released memory is tracked in a few range trees, as a series of (start, count) ranges. Each time a new page range is inserted into such tree its neighbors are checked as candidates for possible merging with it. Besides performance reasons, the Dynamic Memory protocol itself uses page ranges as the data structure in its messages, so relevant pages need to be merged into such ranges anyway. One has to be careful when tracking the guest-released pages, since the guest can maliciously report returning pages outside its current address space, which later clash with the address range of newly added memory. Similarly, the guest can report freeing the same page twice. The above design results in much better ballooning performance than when using virtio-balloon with the same guest: 230 GB / minute with this driver versus 70 GB / minute with virtio-balloon. During a ballooning operation most of time is spent waiting for the guest to come up with newly freed page ranges, processing the received ranges on the host side (in QEMU and KVM) is nearly instantaneous. The unballoon operation is also pretty much instantaneous: thanks to the merging of the ballooned out page ranges 200 GB of memory can be returned to the guest in about 1 second. With virtio-balloon this operation takes about 2.5 minutes. These tests were done against a Windows Server 2019 guest running on a Xeon E5-2699, after dirtying the whole memory inside guest before each balloon operation. Using a range tree instead of a bitmap to track the removed memory also means that the solution scales well with the guest size: even a 1 TB range takes just a few bytes of such metadata. Since the required GTree operations aren't present in every Glib version a check for them was added to the meson build script, together with new "--enable-hv-balloon" and "--disable-hv-balloon" configure arguments. If these GTree operations are missing in the system's Glib version this driver will be skipped during QEMU build. An optional "status-report=on" device parameter requests memory status events from the guest (typically sent every second), which allow the host to learn both the guest memory available and the guest memory in use counts. Following commits will add support for their external emission as "HV_BALLOON_STATUS_REPORT" QMP events. The driver is named hv-balloon since the Linux kernel client driver for the Dynamic Memory Protocol is named as such and to follow the naming pattern established by the virtio-balloon driver. The whole protocol runs over Hyper-V VMBus. The driver was tested against Windows Server 2012 R2, Windows Server 2016 and Windows Server 2019 guests and obeys the guest alignment requirements reported to the host via DM_CAPABILITIES_REPORT message. Acked-by: David Hildenbrand Signed-off-by: Maciej S. Szmigiero --- Kconfig.host |3 + hw/hyperv/Kconfig | 10 + hw/hyperv/hv-balloon-internal.h| 33 + hw/hyperv/hv-balloon-page_range_tree.c | 228 + hw/hyperv/hv-balloon-page_range_tree.h | 118 +++ hw/hyperv/hv-balloon.c | 1157 hw/hyperv/meson.build |1 + hw/hyperv/trace-events | 13 + include/hw/hyperv/hv-balloon.h | 18 + meson.build| 28 +- meson_options.txt |2 + scripts/meson-buildoptions.sh |3 + 12 files changed, 1613 insertions(+), 1 deletion(-) create mode 100644 hw/hyperv/hv-balloon-internal.h create mode 100644 hw/hyperv/hv-balloon-page_range_tree.c create mode 100644 hw/hyperv/hv-balloon-page_range_tree.h create mode 100644 hw/hyperv/hv-balloon.c create mode 100644 include/hw/hyperv/hv-balloon.h diff --git a/Kconfig.host b/Kconfig.host index d763d892693c..2ee71578f38f 100644 --- a/Kconfig.host +++ b/Kconfig.host @@ -46,3 +46,6 @@ config FUZZ config VFIO_USER_SE
Re: [PATCH v8 0/9] Hyper-V Dynamic Memory Protocol driver (hv-balloon )
On 2.11.2023 14:50, David Hildenbrand wrote: On 23.10.23 19:24, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" This is a continuation of the v7 of the patch series located here: https://lore.kernel.org/qemu-devel/cover.1693240836.git.maciej.szmigi...@oracle.com/ I skimmed over most parts and nothing jumped at me memory-device-related; I'm hoping I can take another closer look later/tomorrow; it's a lot of code and an in-depth review would be great. But I don't know if we'll find someone to volunteer? :) Thanks - even a cursory review is valuable. Soft-freeze is in 5 days. You seem to be the only hyperv-related maintainer listed in MAINTAINERS. Do you want to merge this or should I route this via mem-next? I can post a pull request this weekend so it can be pulled in on Monday (hopefully). For the time being Acked-by: David Hildenbrand Thanks, Maciej
Re: [PATCH] hyperv: add check for NULL for msg
On 26.10.2023 11:31, Анастасия Любимова wrote: 28/09/23 19:18, Maciej S. Szmigiero пишет: On 28.09.2023 15:25, Anastasia Belova wrote: cpu_physical_memory_map may return NULL in hyperv_hcall_post_message. Add check for NULL to avoid NULL-dereference. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 76036a5fc7 ("hyperv: process POST_MESSAGE hypercall") Signed-off-by: Anastasia Belova Makes sense to me, thanks. Did you run your static checker through the remaining QEMU files, too? I can see similar cpu_physical_memory_map() usage in, for example: target/s390x/helper.c, hw/nvram/spapr_nvram.c, hw/hyperv/vmbus.c, display/ramfb.c... It seems that configurations for analysis do not contain these files so the checker hasn't warned us. Additional time is needed to analyze these pieces of code and form patches if necessary. No problem, it's not an urgent issue. Anastasia Belova Thanks, Maciej