Hi Kirti, What's your opinion about this? Thanks.
Keqian On 2021/1/30 14:30, Keqian Zhu wrote: > Hi Kirti, > > On 2021/1/28 5:03, Kirti Wankhede wrote: >> >> >> On 1/11/2021 1:04 PM, Keqian Zhu wrote: >>> For now the switch of vfio dirty page tracking is integrated into >>> the vfio_save_handler, it causes some problems [1]. >>> >> >> Sorry, I missed [1] mail, somehow it didn't landed in my inbox. >> >>> The object of dirty tracking is guest memory, but the object of >>> the vfio_save_handler is device state. This mixed logic produces >>> unnecessary coupling and conflicts: >>> >>> 1. Coupling: Their saving granule is different (perVM vs perDevice). >>> vfio will enable dirty_page_tracking for each devices, actually >>> once is enough. >> >> That's correct, enabling dirty page tracking once is enough. But log_start >> and log_stop gets called on address space update transaction, region_add() >> or region_del(), at this point migration may not be active. We don't want to >> allocate bitmap memory in kernel for lifetime of VM, without knowing >> migration will be happen or not. vfio_iommu_type1 module should allocate >> bitmap memory only while migration is active. >> > Yeah, we can use global start/stop callbacks as suggested by Paolo, which > solves this problem. > >> Paolo's suggestion here to use log_global_start and log_global_stop >> callbacks seems correct here. But at this point vfio device state is not yet >> changed to |_SAVING as you had identified it in [1]. May be we can start >> tracking bitmap in iommu_type1 module while device is not yet _SAVING, but >> getting dirty bitmap while device is yet not in _SAVING|_RUNNING state >> doesn't seem optimal solution. >> >> Pasting here your question from [1] >> >>> Before start dirty tracking, we will check and ensure that the device >>> is at _SAVING state and return error otherwise. But the question is >>> that what is the rationale? Why does the VFIO_IOMMU_DIRTY_PAGES >>> ioctl have something to do with the device state? >> >> Lets walk through the types of devices we are supporting: >> 1. mdev devices without IOMMU backed device >> Vendor driver pins pages as and when required during runtime. We can say >> that vendor driver is smart which identifies the pages to pin. We are good >> here. >> >> 2. mdev device with IOMMU backed device >> This is similar to vfio-pci, direct assigned device, where all pages are >> pinned at VM bootup. Vendor driver is not smart, so bitmap query will report >> all pages dirty always. If --auto-converge is not set, VM stucks infinitely >> in pre-copy phase. This is known to us. >> > little question here ;-) . Why auto-converge (slow down vCPU) helps to ease > the case of full dirty? > >> 3. mdev device with IOMMU backed device with smart vendor driver >> In this case as well all pages are pinned at VM bootup, but vendor >> driver is smart to identify the pages and pin them explicitly. >> Pages can be pinned anytime, i.e. during normal VM runtime or on setting >> _SAVING flag (entering pre-copy phase) or while in iterative pre-copy phase. >> There is no restriction based on these phases for calling vfio_pin_pages(). >> Vendor driver can start pinning pages based on its device state when _SAVING >> flag is set. In that case, if dirty bitmap is queried before that then it >> will report all sysmem as dirty with an unnecessary copy of sysmem. >> As an optimal solution, I think its better to query bitmap only after all >> vfio devices are in pre-copy phase, i.e. after _SAVING flag is set. > OK, I get your idea. But Qemu assumes all pages are dirty initially, this > seems not a problem. > Let's assume we have a device of type 3, and this device starts to pin pages > on setting _SAVING flag. > > Before this patch, the work flow is: > { > ram_save_setup() > memory_global_dirty_log_start(): start dirty tracking excludes vfio part. > migration_bitmap_sync_precopy(): try to sync dirty log from kvm, vhost > etc, including vfio (as all device saving is not satisfied, fail to get log > from vfio). The result is that bitmap of ramblock is all dirty. > > vfio_save_setup() of this device > vfio_migration_set_state(): Add SAVING state to this device, and vfio > starts to log dirty page of this device. > > first round (i.e. bulk stage) of ram saving: only handle dirty log which is > collected from the first call of migration_bitmap_sync_precopy(). > > iterative stage of ram saving: when the remaining dirty log is less than > threshold, call migration_bitmap_sync_precopy() again. At this stage, all > device is saving, so success to get log from vfio. > } > > With this patch, the work flow is: > { > ram_save_setup() > memory_global_dirty_log_start(): start dirty tracking includes vfio part. > migration_bitmap_sync_precopy(): try to sync dirty log from kvm, vhost > etc, including vfio (as all device saving is not checked, success to get full > dirty log from vfio). The result is that bitmap of ramblock is all dirty. > vfio_save_setup() of this device > vfio_migration_set_state(): Add SAVING state to this device, and vfio > starts to log dirty page of this device. > > first round of ram saving: only handles dirty log which is collected from the > first call of migration_bitmap_sync_precopy(). > > iterative stage of ram saving: when the remaining dirty log is less than a > threshold, calls migration_bitmap_sync_precopy() again. At this stage, all > device is saving, so success to get log from vfio. > } > > We can see that there is no unnecessary copy of guest memory with this patch. > >> >>> 2. Conflicts: The ram_save_setup() traverses all memory_listeners >>> to execute their log_start() and log_sync() hooks to get the >>> first round dirty bitmap, which is used by the bulk stage of >>> ram saving. However, it can't get dirty bitmap from vfio, as >>> @savevm_ram_handlers is registered before @vfio_save_handler. >>> >> Right, but it can get dirty bitmap from vfio device in it's iterative >> callback >> ram_save_pending -> >> migration_bitmap_sync_precopy() .. -> >> vfio_listerner_log_sync >> > Yeah, this work flow is OK. But we are not able to handle vfio dirty log at > the first round of ram saving. > > We plan to add a new interface named manual_log_clear in vfio[1]. If this > interface is enabled, kernel vfio > doesn't automatically clear and repopulate dirty bitmap when report dirty log > to Qemu. These actions will > be triggered by Qemu when it invokes manual_log_clear > (memory_region_clear_dirty_bitmap()) before handles > the dirty log. > > Under the following conditions: > (1)Qemu can handles vfio dirty log at the first round of ram saving. > (2)device is of type1. > (3)manual_log_clear is enabled. > (3)device driver unpins some scopes during the first round of ram saving > (before Qemu actually handles these scopes). > > Then these unpinned scope is not reported to Qemu at iterative stage > (manual_log_clear clears them and re-population > does not contains them), which avoid unnecessary copy of guest memory. > > Thanks, > Keqian > > [1]https://lore.kernel.org/kvmarm/20210128151742.18840-1-zhukeqi...@huawei.com/ > > >> Thanks, >> Kirti >> >>> Move the switch of vfio dirty_page_tracking into vfio_memory_listener >>> can solve above problems. Besides, Do not require devices in SAVING >>> state for vfio_sync_dirty_bitmap(). >>> >>> [1] https://www.spinics.net/lists/kvm/msg229967.html >>> >>> Reported-by: Zenghui Yu <yuzeng...@huawei.com> >>> Signed-off-by: Keqian Zhu <zhukeqi...@huawei.com> >>> --- >>> hw/vfio/common.c | 53 +++++++++++++++++++++++++++++++++++++-------- >>> hw/vfio/migration.c | 35 ------------------------------ >>> 2 files changed, 44 insertions(+), 44 deletions(-) >>> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>> index 6ff1daa763..9128cd7ee1 100644 >>> --- a/hw/vfio/common.c >>> +++ b/hw/vfio/common.c >>> @@ -311,7 +311,7 @@ bool vfio_mig_active(void) >>> return true; >>> } >>> -static bool vfio_devices_all_saving(VFIOContainer *container) >>> +static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) >>> { >>> VFIOGroup *group; >>> VFIODevice *vbasedev; >>> @@ -329,13 +329,8 @@ static bool vfio_devices_all_saving(VFIOContainer >>> *container) >>> return false; >>> } >>> - if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { >>> - if ((vbasedev->pre_copy_dirty_page_tracking == >>> ON_OFF_AUTO_OFF) >>> - && (migration->device_state & >>> VFIO_DEVICE_STATE_RUNNING)) { >>> - return false; >>> - } >>> - continue; >>> - } else { >>> + if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) >>> + && (migration->device_state & VFIO_DEVICE_STATE_RUNNING)) { >>> return false; >>> } >>> } >>> @@ -987,6 +982,44 @@ static void vfio_listener_region_del(MemoryListener >>> *listener, >>> } >>> } >>> +static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool >>> start) >>> +{ >>> + int ret; >>> + struct vfio_iommu_type1_dirty_bitmap dirty = { >>> + .argsz = sizeof(dirty), >>> + }; >>> + >>> + if (start) { >>> + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; >>> + } else { >>> + dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; >>> + } >>> + >>> + ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); >>> + if (ret) { >>> + error_report("Failed to set dirty tracking flag 0x%x errno: %d", >>> + dirty.flags, errno); >>> + } >>> +} >>> + >>> +static void vfio_listener_log_start(MemoryListener *listener, >>> + MemoryRegionSection *section, >>> + int old, int new) >>> +{ >>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>> listener); >>> + >>> + vfio_set_dirty_page_tracking(container, true); >>> +} >>> + >>> +static void vfio_listener_log_stop(MemoryListener *listener, >>> + MemoryRegionSection *section, >>> + int old, int new) >>> +{ >>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>> listener); >>> + >>> + vfio_set_dirty_page_tracking(container, false); >>> +} >>> + >>> static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova, >>> uint64_t size, ram_addr_t ram_addr) >>> { >>> @@ -1128,7 +1161,7 @@ static void vfio_listerner_log_sync(MemoryListener >>> *listener, >>> return; >>> } >>> - if (vfio_devices_all_saving(container)) { >>> + if (vfio_devices_all_dirty_tracking(container)) { >>> vfio_sync_dirty_bitmap(container, section); >>> } >>> } >>> @@ -1136,6 +1169,8 @@ static void vfio_listerner_log_sync(MemoryListener >>> *listener, >>> static const MemoryListener vfio_memory_listener = { >>> .region_add = vfio_listener_region_add, >>> .region_del = vfio_listener_region_del, >>> + .log_start = vfio_listener_log_start, >>> + .log_stop = vfio_listener_log_stop, >>> .log_sync = vfio_listerner_log_sync, >>> }; >>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >>> index 00daa50ed8..c0f646823a 100644 >>> --- a/hw/vfio/migration.c >>> +++ b/hw/vfio/migration.c >>> @@ -395,40 +395,10 @@ static int vfio_load_device_config_state(QEMUFile *f, >>> void *opaque) >>> return qemu_file_get_error(f); >>> } >>> -static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) >>> -{ >>> - int ret; >>> - VFIOMigration *migration = vbasedev->migration; >>> - VFIOContainer *container = vbasedev->group->container; >>> - struct vfio_iommu_type1_dirty_bitmap dirty = { >>> - .argsz = sizeof(dirty), >>> - }; >>> - >>> - if (start) { >>> - if (migration->device_state & VFIO_DEVICE_STATE_SAVING) { >>> - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START; >>> - } else { >>> - return -EINVAL; >>> - } >>> - } else { >>> - dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP; >>> - } >>> - >>> - ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty); >>> - if (ret) { >>> - error_report("Failed to set dirty tracking flag 0x%x errno: %d", >>> - dirty.flags, errno); >>> - return -errno; >>> - } >>> - return ret; >>> -} >>> - >>> static void vfio_migration_cleanup(VFIODevice *vbasedev) >>> { >>> VFIOMigration *migration = vbasedev->migration; >>> - vfio_set_dirty_page_tracking(vbasedev, false); >>> - >>> if (migration->region.mmaps) { >>> vfio_region_unmap(&migration->region); >>> } >>> @@ -469,11 +439,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) >>> return ret; >>> } >>> - ret = vfio_set_dirty_page_tracking(vbasedev, true); >>> - if (ret) { >>> - return ret; >>> - } >>> - >>> qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); >>> ret = qemu_file_get_error(f); >>> >> . >>