On 4/4/23 16:53, Peter Xu wrote: > On Tue, Apr 04, 2023 at 10:00:16AM +0200, Claudio Fontana wrote: >> Hi Peter, > > Hi, Claudio, > >> >> On 4/3/23 21:26, Peter Xu wrote: >>> Hi, Claudio, >>> >>> Thanks for the context. >>> >>> On Mon, Apr 03, 2023 at 09:47:26AM +0200, Claudio Fontana wrote: >>>> Hi, not sure if what is asked here is context in terms of the previous >>>> upstream discussions or our specific requirement we are trying to bring >>>> upstream. >>>> >>>> In terms of the specific requirement we are trying to bring upstream, we >>>> need to get libvirt+QEMU VM save and restore functionality to be able to >>>> transfer VM sizes of ~30 GB (4/8 vcpus) in roughly 5 seconds. When an >>>> event trigger happens, the VM needs to be quickly paused and saved to >>>> disk safely, including datasync, and another VM needs to be restored, >>>> also in ~5 secs. For our specific requirement, the VM is never running >>>> when its data (mostly consisting of RAM) is saved. >>>> >>>> I understand that the need to handle also the "live" case comes from >>>> upstream discussions about solving the "general case", where someone >>>> might want to do this for "live" VMs, but if helpful I want to highlight >>>> that it is not part of the specific requirement we are trying to address, >>>> and for this specific case won't also in the future, as the whole point >>>> of the trigger is to replace the running VM with another VM, so it cannot >>>> be kept running. >>> >>> From what I read so far, that scenario suites exactly what live snapshot >>> would do with current QEMU - that at least should involve a snapshot on the >>> disks being used or I can't see how that can be live. So it looks like a >>> separate request. >>> >>>> The reason we are using "migrate" here likely stems from the fact that >>>> existing libvirt code currently uses QMP migrate to implement the save >>>> and restore commands. And in my personal view, I think that reusing the >>>> existing building blocks (migration, multifd) would be preferable, to >>>> avoid having to maintain two separate ways to do the same thing. That >>>> said, it could be done in a different way, if the performance can keep >>>> up. Just thinking of reducing the overall effort and also maintenance >>>> surface. >>> >>> I would vaguely guess the performance can not only keep up but better than >>> what the current solution would provide, due to the possibility of (1) >>> batch handling of continuous guest pages, and (2) completely no dirty >>> tracking overhead. >>> >>> For (2), it's not about wr-protect page faults or vmexits due to PML being >>> full (because vcpus will be stopped anyway..), it's about enabling the >>> dirty tracking (which already contains overhead, especially when huge pages >>> are enabled, to split huge pages in EPT pgtables) and all the bitmap >>> operations QEMU does during live migration even if the VM is not live. >> >> something we could profile for, I do not remember it being really an >> important source of overhead in my previous profile runs, >> but maybe worthwhile redoing the profiling with Fabiano's patchset. > > Yes I don't know the detailed number either, it should depend on the guest > configuration (mem size, mem type, kernel version etc). It could be less a > concern comparing to the time used elsewhere. More on this on below. > >> >>> >>> IMHO reusing multifd may or may not be a good idea here, because it'll of >>> course also complicate multifd code, hence makes multifd harder to >>> maintain, while not in a good way, because as I mentioned I don't think it >>> can use much of what multifd provides. >> >> >> The main advantage we get is the automatic multithreading of the >> qemu_savevm_state_iterate code in my view. >> >> Reimplementing the same thing again has the potential to cause bitrot for >> this use case, and using multiple fds for the transfer is exactly what is >> needed here, >> and in my understanding the same exact reason multifd exists: to take >> advantage of high bandwidth migration channels. >> >> The only adjustment needed to multifd is the ability to work with block >> devices (file fds) as the migration channels instead of just sockets, >> so it seems a very natural extension of multifd to me. > > Yes, since I haven't looked at the multifd patches at all so I don't have > solid clue on how much it'll affect multifd. I'll leave that to Juan. > >> >>> >>> I don't have a strong opinion on the impl (even though I do have a >>> preference..), but I think at least we should still check on two things: >>> >>> - Being crystal clear on the use case above, and double check whether "VM >>> stop" should be the default operation at the start of the new cmd - we >>> shouldn't assume the user will be aware of doing this, neither should >>> we assume the user is aware of the performance implications. >> >> >> Not sure I can identify what you are asking specifically: the use case is to >> stop executing the currently running VM as soon as possible, save it to >> disk, then restore another VM as soon as possible. >> Probably I missed something there. > > Yes, then IMHO as mentioned we should make "vm stop" part of the command > procedure if vm was still running when invoked. Then we can already > optimize dirty logging of above (2) with the current framework. E.g., we > already optimized live snapshot to not enable dirty logging: > > if (!migrate_background_snapshot()) { > memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION); > migration_bitmap_sync_precopy(rs); > } > > Maybe that can also be done for fixed-ram migration, so no matter how much > overhead there will be, that can be avoided.
Understood, agree. Would it make sense to check for something like if (!runstate_is_running()) instead of checking for the specific multifd + fixed-ram feature? I think from a high level perspective, there should not be dirtying if the vcpus are not running right? This could even be a bit more future proof to avoid checking for many features, if they all happen to share the fact that vcpus are not running. > > PS: I think similar optimizations can be done too in ram_save_complete() or > ram_state_pending_exact().. maybe we should move the check into > migration_bitmap_sync_precopy() so it can be skipped as a whole when it can. makes sense, interesting. I wonder if ramblock_is_ignored() could be optimized a bit too, since it seems to consume roughly the same amount of cpu as the dirty bitmap handling, even when "ignore-shared" is not used. this feature was added by: commit fbd162e629aaf8a7e464af44d2f73d06b26428ad Author: Yury Kotov <yury-ko...@yandex-team.ru> Date: Fri Feb 15 20:45:46 2019 +0300 migration: Add an ability to ignore shared RAM blocks If ignore-shared capability is set then skip shared RAMBlocks during the RAM migration. Also, move qemu_ram_foreach_migratable_block (and rename) to the migration code, because it requires access to the migration capabilities. Signed-off-by: Yury Kotov <yury-ko...@yandex-team.ru> Message-Id: <20190215174548.2630-4-yury-ko...@yandex-team.ru> Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com> Probably not that important, just to mention since we were thinking of possible small optimizations. I would like to share the complete previous callgrind data, but cannot find a way to export them in a readable state, could export the graph though as PDF if helpful. Likely we'd need a new round of measurements with perf... Ciao, Claudio > > Thanks, > >> >>> >>> - Making sure the image layout is well defined, so: >>> >>> - It'll be extensible in the future, and, >>> >>> - If someone would like to refactor it to not use the migration thread >>> model anymore, the image format, hopefully, can be easy to keep >>> untouched so it can be compatible with the current approach. >>> >>> Just my two cents. I think Juan should have the best grasp on this. >>> >>> Thanks, >>> >> >> Ciao, >> >> Claudio >> >