On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote: > On 17.04.2024 18:35, Daniel P. Berrangé wrote: > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote: > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote: > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com> > > > > > > > > > > VFIO device state transfer is currently done via the main migration > > > > > channel. > > > > > This means that transfers from multiple VFIO devices are done > > > > > sequentially > > > > > and via just a single common migration channel. > > > > > > > > > > Such way of transferring VFIO device state migration data reduces > > > > > performance and severally impacts the migration downtime (~50%) for > > > > > VMs > > > > > that have multiple such devices with large state size - see the test > > > > > results below. > > > > > > > > > > However, we already have a way to transfer migration data using > > > > > multiple > > > > > connections - that's what multifd channels are. > > > > > > > > > > Unfortunately, multifd channels are currently utilized for RAM > > > > > transfer > > > > > only. > > > > > This patch set adds a new framework allowing their use for device > > > > > state > > > > > transfer too. > > > > > > > > > > The wire protocol is based on Avihai's x-channel-header patches, which > > > > > introduce a header for migration channels that allow the migration > > > > > source > > > > > to explicitly indicate the migration channel type without having the > > > > > target deduce the channel type by peeking in the channel's content. > > > > > > > > > > The new wire protocol can be switch on and off via > > > > > migration.x-channel-header > > > > > option for compatibility with older QEMU versions and testing. > > > > > Switching the new wire protocol off also disables device state > > > > > transfer via > > > > > multifd channels. > > > > > > > > > > The device state transfer can happen either via the same multifd > > > > > channels > > > > > as RAM data is transferred, mixed with RAM data (when > > > > > migration.x-multifd-channels-device-state is 0) or exclusively via > > > > > dedicated device state transfer channels (when > > > > > migration.x-multifd-channels-device-state > 0). > > > > > > > > > > Using dedicated device state transfer multifd channels brings further > > > > > performance benefits since these channels don't need to participate in > > > > > the RAM sync process. > > > > > > > > I'm not convinced there's any need to introduce the new "channel header" > > > > protocol messages. The multifd channels already have an initialization > > > > message that is extensible to allow extra semantics to be indicated. > > > > So if we want some of the multifd channels to be reserved for device > > > > state, we could indicate that via some data in the MultiFDInit_t > > > > message struct. > > > > > > The reason for introducing x-channel-header was to avoid having to deduce > > > the channel type by peeking in the channel's content - where any channel > > > that does not start with QEMU_VM_FILE_MAGIC is currently treated as a > > > multifd one. > > > > > > But if this isn't desired then, as you say, the multifd channel type can > > > be indicated by using some unused field of the MultiFDInit_t message. > > > > > > Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then. > > > > I don't like the heuristics we currently have, and would to have > > a better solution. What makes me cautious is that this proposal > > is a protocol change, but only addressing one very narrow problem > > with the migration protocol. > > > > I'd like migration to see a more explicit bi-directional protocol > > negotiation message set, where both QEMU can auto-negotiate amongst > > themselves many of the features that currently require tedious > > manual configuration by mgmt apps via migrate parameters/capabilities. > > That would address the problem you describe here, and so much more. > > Isn't the capability negotiation handled automatically by libvirt > today? > I guess you'd prefer for QEMU to internally handle it instead?
Yes, it would be much saner if QEMU handled it automatically as part of its own protocol handshake. This avoids the need to change libvirt to enable new functionality in the migration protocol in many (but not all) cases, and thus speed up development and deployment of new features. Libvirt should really only need to be changed to support runtime performance tunables, rather than migration protocol features. > > > > That said, the idea of reserving channels specifically for VFIO doesn't > > > > make a whole lot of sense to me either. > > > > > > > > Once we've done the RAM transfer, and are in the switchover phase > > > > doing device state transfer, all the multifd channels are idle. > > > > We should just use all those channels to transfer the device state, > > > > in parallel. Reserving channels just guarantees many idle channels > > > > during RAM transfer, and further idle channels during vmstate > > > > transfer. > > > > > > > > IMHO it is more flexible to just use all available multifd channel > > > > resources all the time. > > > > > > The reason for having dedicated device state channels is that they > > > provide lower downtime in my tests. > > > > > > With either 15 or 11 mixed multifd channels (no dedicated device state > > > channels) I get a downtime of about 1250 msec. > > > > > > Comparing that with 15 total multifd channels / 4 dedicated device > > > state channels that give downtime of about 1100 ms it means that using > > > dedicated channels gets about 14% downtime improvement. > > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking > > place ? Is is transferred concurrently with the RAM ? I had thought > > this series still has the RAM transfer iterations running first, > > and then the VFIO VMstate at the end, simply making use of multifd > > channels for parallelism of the end phase. your reply though makes > > me question my interpretation though. > > > > Let me try to illustrate channel flow in various scenarios, time > > flowing left to right: > > > > 1. serialized RAM, then serialized VM state (ie historical migration) > > > > main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State > > | > > > > > > 2. parallel RAM, then serialized VM state (ie today's multifd) > > > > main: | Init | | VM state > > | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > > > > > 3. parallel RAM, then parallel VM state > > > > main: | Init | | VM state > > | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd4: | VFIO VM > > state | > > multifd5: | VFIO VM > > state | > > > > > > 4. parallel RAM and VFIO VM state, then remaining VM state > > > > main: | Init | | VM state > > | > > multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | > > multifd4: | VFIO VM state > > | > > multifd5: | VFIO VM state > > | > > > > > > I thought this series was implementing approx (3), but are you actually > > implementing (4), or something else entirely ? > > You are right that this series operation is approximately implementing > the schema described as numer 3 in your diagrams. > However, there are some additional details worth mentioning: > * There's some but relatively small amount of VFIO data being > transferred from the "save_live_iterate" SaveVMHandler while the VM is > still running. > > This is still happening via the main migration channel. > Parallelizing this transfer in the future might make sense too, > although obviously this doesn't impact the downtime. > > * After the VM is stopped and downtime starts the main (~ 400 MiB) > VFIO device state gets transferred via multifd channels. > > However, these multifd channels (if they are not dedicated to device > state transfer) aren't idle during that time. > Rather they seem to be transferring the residual RAM data. > > That's most likely what causes the additional observed downtime > when dedicated device state transfer multifd channels aren't used. Ahh yes, I forgot about the residual dirty RAM, that makes sense as an explanation. Allow me to work through the scenarios though, as I still think my suggestion to not have separate dedicate channels is better.... Lets say hypothetically we have an existing deployment today that uses 6 multifd channels for RAM. ie: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | That value of 6 was chosen because that corresponds to the amount of network & CPU utilization the admin wants to allow, for this VM to migrate. All 6 channels are fully utilized at all times. If we now want to parallelize VFIO VM state, the peak network and CPU utilization the admin wants to reserve for the VM should not change. Thus the admin will still wants to configure only 6 channels total. With your proposal the admin has to reduce RAM transfer to 4 of the channels, in order to then reserve 2 channels for VFIO VM state, so we get a flow like: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | VFIO VM state | multifd6: | VFIO VM state | This is bad, as it reduces performance of RAM transfer. VFIO VM state transfer is better, but that's not a net win overall. So lets say the admin was happy to increase the number of multifd channels from 6 to 8. This series proposes that they would leave RAM using 6 channels as before, and now reserve the 2 extra ones for VFIO VM state: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM | multifd7: | VFIO VM state | multifd8: | VFIO VM state | RAM would perform as well as it did historically, and VM state would improve due to the 2 parallel channels, and not competing with the residual RAM transfer. This is what your latency comparison numbers show as a benefit for this channel reservation design. I believe this comparison is inappropriate / unfair though, as it is comparing a situation with 6 total channels against a situation with 8 total channels. If the admin was happy to increase the total channels to 8, then they should allow RAM to use all 8 channels, and then VFIO VM state + residual RAM to also use the very same set of 8 channels: main: | Init | | VM state | multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state| This will speed up initial RAM iters still further & the final switch over phase even more. If residual RAM is larger than VFIO VM state, then it will dominate the switchover latency, so having VFIO VM state compete is not a problem. If VFIO VM state is larger than residual RAM, then allowing it acces to all 8 channels instead of only 2 channels will be a clear win. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|