On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>>>> of qemu.
>>>>>>
>>>>>> At vfio creation time, save the value of vfio container, group, and 
>>>>>> device
>>>>>> descriptors in cpr state.
>>>>>>
>>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be 
>>>>>> remapped
>>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  
>>>>>> Save
>>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>>>> vfio descriptors.  The flag is not cleared earlier because the 
>>>>>> descriptors
>>>>>> should not persist across miscellaneous fork and exec calls that may be
>>>>>> performed during normal operation.
>>>>>>
>>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>>>> the descriptors, and notes that the device is being reused.  Device and
>>>>>> iommu state is already configured, so operations in vfio_realize that
>>>>>> would modify the configuration are skipped for a reused device, including
>>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>>>> vfio_realize constructs qemu data structures that reflect the current
>>>>>> state of the device.  However, the reconstruction is not complete until
>>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener 
>>>>>> callback,
>>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>>>> starts the VM and suppresses vfio pci device reset.
>>>>>>
>>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X 
>>>>>> vector
>>>>>> support.  Part 3 adds INTX support.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
>>>>>> ---
>>>>>>  MAINTAINERS                   |   1 +
>>>>>>  hw/pci/pci.c                  |  10 ++++
>>>>>>  hw/vfio/common.c              | 115 
>>>>>> ++++++++++++++++++++++++++++++++++++++----
>>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>>>  hw/vfio/meson.build           |   1 +
>>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>>>  hw/vfio/trace-events          |   1 +
>>>>>>  include/hw/pci/pci.h          |   1 +
>>>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>>>  include/migration/cpr.h       |   3 ++
>>>>>>  migration/cpr.c               |  10 +++-
>>>>>>  migration/target.c            |  14 +++++
>>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>>
>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>> index cfe7480..feed239 100644
>>>>>> --- a/MAINTAINERS
>>>>>> +++ b/MAINTAINERS
>>>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>>>  M: Steve Sistare <steven.sist...@oracle.com>
>>>>>>  M: Mark Kanda <mark.ka...@oracle.com>
>>>>>>  S: Maintained
>>>>>> +F: hw/vfio/cpr.c
>>>>>>  F: include/migration/cpr.h
>>>>>>  F: migration/cpr.c
>>>>>>  F: qapi/cpr.json
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index 0fd21e1..e35df4f 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>>>  {
>>>>>>      int r;
>>>>>>  
>>>>>> +    /*
>>>>>> +     * A reused vfio-pci device is already configured, so do not reset 
>>>>>> it
>>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may 
>>>>>> be
>>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here 
>>>>>> and
>>>>>> +     * updated with new state in cpr-load with no ill effects.
>>>>>> +     */
>>>>>> +    if (dev->reused) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>      pci_device_deassert_intx(dev);
>>>>>>      assert(dev->irq_state == 0);
>>>>>>  
>>>>>
>>>>>
>>>>> Hmm that's a weird thing to do. I suspect this works because
>>>>> "reused" means something like "in the process of being restored"?
>>>>> Because clearly, we do not want to skip this part e.g. when
>>>>> guest resets the device.
>>>>
>>>> Exactly.  vfio_realize sets the flag if it detects the device is reused 
>>>> during
>>>> a restart, and vfio_pci_post_load clears the reused flag.
>>>>
>>>>> So a better name could be called for, but really I don't
>>>>> love how vfio gets to poke at internal PCI state.
>>>>> I'd rather we found a way just not to call this function.
>>>>> If we can't, maybe an explicit API, and make it
>>>>> actually say what it's doing?
>>>>
>>>> How about:
>>>>
>>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>>>
>>>> vfio_realize()
>>>>   pci_set_restore(pdev)
>>>>
>>>> vfio_pci_post_load()
>>>>   pci_clr_restore(pdev)
>>>>
>>>> pci_do_device_reset()
>>>>     if (dev->restore)
>>>>         return;
>>>>
>>>> - Steve
>>>
>>>
>>> Not too bad. I'd like a better definition of what dev->restore is
>>> exactly and to add them in comments near where it
>>> is defined and used.
>>
>> Will do.
>>
>>> E.g. does this mean "device is being restored because of qemu restart"?
>>>
>>> Do we need a per device flag for this thing or would a global
>>> "qemu restart in progress" flag be enough?
>>
>> A global flag (or function, which already exists) would suppress reset for 
>> all
>> PCI devices, not just vfio-pci.  I am concerned that for some devices, 
>> vmstate 
>> load may implicitly depend on the device having been reset for correctness, 
>> by 
>> virtue of some fields being initialized in the reset function.
>>
>> - Steve
> 
> So just so I understand, how do these other devices work with restart?
> Do they use the save/loadvm machinery? And the reason vfio doesn't
> is because it generally does not support savevm/loadvm?

They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft 
state,
plus it preserves its device descriptors.  The only bit we are skipping is the 
reset
function for vfio-pci, because the hardware device is actively processing dma 
and 
interrupts, and they would be lost.  Reset is called unconditionally for all 
devices 
during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() 
->
qemu_system_reset().

- Steve

Reply via email to