On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote: > On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote: >> Enable vfio-pci devices to be saved and restored across an exec restart >> of qemu. >> >> At vfio creation time, save the value of vfio container, group, and device >> descriptors in cpr state. >> >> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA >> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped >> at a different VA after exec. DMA to already-mapped pages continues. Save >> the msi message area as part of vfio-pci vmstate, save the interrupt and >> notifier eventfd's in cpr state, and clear the close-on-exec flag for the >> vfio descriptors. The flag is not cleared earlier because the descriptors >> should not persist across miscellaneous fork and exec calls that may be >> performed during normal operation. >> >> On qemu restart, vfio_realize() finds the saved descriptors, uses >> the descriptors, and notes that the device is being reused. Device and >> iommu state is already configured, so operations in vfio_realize that >> would modify the configuration are skipped for a reused device, including >> vfio ioctl's and writes to PCI configuration space. The result is that >> vfio_realize constructs qemu data structures that reflect the current >> state of the device. However, the reconstruction is not complete until >> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr >> state. It rebuilds vector data structures and attaches the interrupts to >> the new KVM instance. cpr-load then invokes the main vfio listener callback, >> which walks the flattened ranges of the vfio_address_spaces and calls >> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's. Lastly, it >> starts the VM and suppresses vfio pci device reset. >> >> This functionality is delivered by 3 patches for clarity. Part 1 handles >> device file descriptors and DMA. Part 2 adds eventfd and MSI/MSI-X vector >> support. Part 3 adds INTX support. >> >> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >> --- >> MAINTAINERS | 1 + >> hw/pci/pci.c | 10 ++++ >> hw/vfio/common.c | 115 >> ++++++++++++++++++++++++++++++++++++++---- >> hw/vfio/cpr.c | 94 ++++++++++++++++++++++++++++++++++ >> hw/vfio/meson.build | 1 + >> hw/vfio/pci.c | 77 ++++++++++++++++++++++++++++ >> hw/vfio/trace-events | 1 + >> include/hw/pci/pci.h | 1 + >> include/hw/vfio/vfio-common.h | 8 +++ >> include/migration/cpr.h | 3 ++ >> migration/cpr.c | 10 +++- >> migration/target.c | 14 +++++ >> 12 files changed, 324 insertions(+), 11 deletions(-) >> create mode 100644 hw/vfio/cpr.c >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index cfe7480..feed239 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -2992,6 +2992,7 @@ CPR >> M: Steve Sistare <steven.sist...@oracle.com> >> M: Mark Kanda <mark.ka...@oracle.com> >> S: Maintained >> +F: hw/vfio/cpr.c >> F: include/migration/cpr.h >> F: migration/cpr.c >> F: qapi/cpr.json >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >> index 0fd21e1..e35df4f 100644 >> --- a/hw/pci/pci.c >> +++ b/hw/pci/pci.c >> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev) >> { >> int r; >> >> + /* >> + * A reused vfio-pci device is already configured, so do not reset it >> + * during qemu_system_reset prior to cpr-load, else interrupts may be >> + * lost. By contrast, pure-virtual pci devices may be reset here and >> + * updated with new state in cpr-load with no ill effects. >> + */ >> + if (dev->reused) { >> + return; >> + } >> + >> pci_device_deassert_intx(dev); >> assert(dev->irq_state == 0); >> > > > Hmm that's a weird thing to do. I suspect this works because > "reused" means something like "in the process of being restored"? > Because clearly, we do not want to skip this part e.g. when > guest resets the device.
Exactly. vfio_realize sets the flag if it detects the device is reused during a restart, and vfio_pci_post_load clears the reused flag. > So a better name could be called for, but really I don't > love how vfio gets to poke at internal PCI state. > I'd rather we found a way just not to call this function. > If we can't, maybe an explicit API, and make it > actually say what it's doing? How about: pci_set_restore(PCIDevice *dev) { dev->restore = true; } pci_clr_restore(PCIDevice *dev) { dev->restore = false; } vfio_realize() pci_set_restore(pdev) vfio_pci_post_load() pci_clr_restore(pdev) pci_do_device_reset() if (dev->restore) return; - Steve