Document interfaces used for VFIO device migration. Added flow of state changes during live migration with VFIO device.
Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com> --- MAINTAINERS | 1 + docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 120 insertions(+) create mode 100644 docs/devel/vfio-migration.rst diff --git a/MAINTAINERS b/MAINTAINERS index 6a197bd358d6..6f3fcffc6b3d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1728,6 +1728,7 @@ M: Alex Williamson <alex.william...@redhat.com> S: Supported F: hw/vfio/* F: include/hw/vfio/ +F: docs/devel/vfio-migration.rst vfio-ccw M: Cornelia Huck <coh...@redhat.com> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index 000000000000..dab9127825e4 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,119 @@ +===================== +VFIO device Migration +===================== + +VFIO devices use iterative approach for migration because certain VFIO devices +(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy +phase of migration allows for the guest to continue whilst the VFIO device state +is transferred to destination, this helps to reduce the total downtime of the +VM. VFIO devices can choose to skip the pre-copy phase of migration by returning +pending_bytes as zero during pre-copy phase. + +Detailed description of UAPI for VFIO device for migration is in the comment +above ``vfio_device_migration_info`` structure definition in header file +linux-headers/linux/vfio.h. + +VFIO device hooks for iterative approach: +- A ``save_setup`` function that setup migration region, sets _SAVING flag in +VFIO device state and inform VFIO IOMMU module to start dirty page tracking. + +- A ``load_setup`` function that setup migration region on the destination and +sets _RESUMING flag in VFIO device state. + +- A ``save_live_pending`` function that reads pending_bytes from vendor driver +that indicate how much more data the vendor driver yet to save for the VFIO +device. + +- A ``save_live_iterate`` function that reads VFIO device's data from vendor +driver through migration region during iterative phase. + +- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO +device state, saves device config space, if any, and iteratively copies +remaining data for VFIO device till pending_bytes returned by vendor driver +is zero. + +- A ``load_state`` function loads config section and data sections generated by +above save functions. + +- ``cleanup`` functions for both save and load that unmap migration region. + +VM state change handler is registered to change VFIO device state based on VM +state change. + +Similarly, a migration state change notifier is added to get a notification on +migration state change. These states are translated to VFIO device state and +conveyed to vendor driver. + +System memory dirty pages tracking +---------------------------------- + +A ``log_sync`` memory listener callback is added to mark system memory pages +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried +per container. All pages pinned by vendor driver through vfio_pin_pages() +external API have to be marked as dirty during migration. When there are CPU +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned +by vendor driver can also be written by device. There is currently no device +which has hardware support for dirty page tracking. So all pages which are +pinned by vendor driver are considered as dirty. +Dirty pages are tracked when device is in stop-and-copy phase because if pages +are marked dirty during pre-copy phase and content is transfered from source to +destination, there is no way to know newly dirtied pages from the point they +were copied earlier until device stops. To avoid repeated copy of same content, +pinned pages are marked dirty only during stop-and-copy phase. + +System memory dirty pages tracking when vIOMMU is enabled +--------------------------------------------------------- +With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase +of migration. In that case, unmap ioctl returns pages pinned in that range and +QEMU reports corresponding guest physical pages dirty. +During stop-and-copy phase, an IOMMU notifier is used to get a callback for +mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for +those mapped ranges. + +Flow of state changes during Live migration +=========================================== +Below is the flow of state change during live migration where states in brackets +represent VM state, migration state and VFIO device state as: + (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE) + +Live migration save path +------------------------ + QEMU normal running state + (RUNNING, _NONE, _RUNNING) + | + migrate_init spawns migration_thread + Migration thread then calls each device's .save_setup() + (RUNNING, _SETUP, _RUNNING|_SAVING) + | + (RUNNING, _ACTIVE, _RUNNING|_SAVING) + If device is active, get pending_bytes by .save_live_pending() + if total pending_bytes >= threshold_size, call .save_live_iterate() + Data of VFIO device for pre-copy phase is copied + Iterate till total pending bytes converge and are less than threshold + | + On migration completion, vCPUs stops and calls .save_live_complete_precopy + for each active device. VFIO device is then transitioned in _SAVING state + (FINISH_MIGRATE, _DEVICE, _SAVING) + | +For VFIO device, iterate in .save_live_complete_precopy until pending data is 0 + (FINISH_MIGRATE, _DEVICE, _STOPPED) + | + (FINISH_MIGRATE, _COMPLETED, _STOPPED) + Migraton thread schedule cleanup bottom half and exit + +Live migration resume path +-------------------------- + + Incoming migration calls .load_setup for each device + (RESTORE_VM, _ACTIVE, _STOPPED) + | + For each device, .load_state is called for that device section data + (RESTORE_VM, _ACTIVE, _RESUMING) + | + At the end, called .load_cleanup for each device and vCPUs are started | + (RUNNING, _NONE, _RUNNING) + + +Postcopy +======== +Postcopy migration is not supported for VFIO devices. -- 2.7.0