On Wed, Oct 25, 2023 at 03:20:16PM +0100, Daniel P. Berrangé wrote: > On Wed, Oct 25, 2023 at 10:57:12AM -0300, Fabiano Rosas wrote: > > Daniel P. Berrangé <berra...@redhat.com> writes: > > > > > On Mon, Oct 23, 2023 at 05:35:45PM -0300, Fabiano Rosas wrote: > > >> Add a capability that allows the management layer to delegate to QEMU > > >> the decision of whether to pause a VM and perform a non-live > > >> migration. Depending on the type of migration being performed, this > > >> could bring performance benefits. > > > > > > I'm not really see what problem this is solving. > > > > > > > Well, this is the fruit of your discussion with Peter Xu in the previous > > version of the patch. > > > > To recap: he thinks QEMU is doing useless work with file migrations > > because they are always asynchronous. He thinks we should always pause > > before doing fixed-ram migration. You said that libvirt would rather use > > fixed-ram for a more broad set of savevm-style commands, so you'd rather > > not always pause. I'm trying to cater to both of your wishes. This new > > capability is the middle ground I came up with. > > > > So fixed-ram would always pause the VM, because that is the primary > > use-case, but libvirt would be allowed to say: don't pause this time. > > If the VM is going to be powered off immediately after saving > a snapshot then yes, you might as well pause it, but we can't > assume that will be the case. An equally common use case > would be for saving periodic snapshots of a running VM. This > should be transparent such that the VM remains running the > whole time, except a narrow window at completion of RAM/state > saving where we flip the disk snapshots, so they are in sync > with the RAM snapshot.
Libvirt will still use fixed-ram for live snapshot purpose, especially for Windows? Then auto-pause may still be useful to identify that from what Fabiano wants to achieve here (which is in reality, non-live)? IIRC of previous discussion that was the major point that libvirt can still leverage fixed-ram for a live case - since Windows lacks efficient live snapshot (background-snapshot feature). >From that POV it sounds like auto-pause is a good knob for that. > > IOW, save/restore to disk can imply paused, but snapshotting > should not imply paused. So I don't see an unambiguous > rationale that we should diverge when fixed-ram is set and > auto-pause the VM. > > > > Mgmt apps are perfectly capable of pausing the VM before issuing > > > the migrate operation. > > > > > > > Right. But would QEMU be allowed to just assume that if a VM is paused > > at the start of migration it can then go ahead and skip all dirty page > > mechanisms? > > Skipping dirty page tracking would imply that the mgmt app cannot > resume CPUs without either letting the operation complete, or > aborting it. > > That is probably a reasonable assumption, as I can't come up with > a use case for starting out paused and then later resuming, unless > there was a scearnio where you needed to synchronous something > external with the start of migration. Sychronizing storage though > is something that happens at the end of migration instead. > > > Without pausing, we're basically doing *live* migration into a static > > file that will be kept on disk for who knows how long before being > > restored on the other side. We could release the src QEMU resources (a > > bit) earlier if we paused the VM beforehand. > > Can we really release resources early ? If the save operation fails > right at the end, we want to be able to resume execution of CPUs, > which assumes all resources are still available, otherwise we have > a failure scenario where we've not successfully saved to disk and > also don't still have the running QEMU. Indeed we need to consider if the user starts the VM again during the auto-pause enabled migration. A few options, and one of them should allow early free of resources. Assuming auto-pause=on and migration started, then: 1) Allow VM starts later 1.a) Start dirty tracking right at this point Not prefer this. This will make all things transparent but IMHO unnecessary complexity on maintaining dirty tracking status. 1.b) Fail the migration Can be a good option, IMHO, treating auto-pause as a promise from the user that VM won't need to be running anymore. If VM starts, promise break, migration fails. 2) Doesn't allow VM starts later Can also be a good option. In this case VM resources (I think mostly, RAM) can be freed right after migrated. If user request VM start, fail the start instead of migration itself. Migration must succeed or data lost. Thanks, > > > We're basically talking about whether we want the VM to be usable in the > > (hopefully) very short time between issuing the migration command and > > the migration being finished. We might be splitting hairs here, but we > > need some sort of consensus. > > The time may not be very short for large VMs. > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > -- Peter Xu