Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-11 Thread Steven Sistare
On 3/10/2022 5:30 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 14:55:50 -0500
> Steven Sistare  wrote:
> 
>> On 3/10/2022 1:35 PM, Alex Williamson wrote:
>>> On Thu, 10 Mar 2022 10:00:29 -0500
>>> Steven Sistare  wrote:
>>>   
 On 3/7/2022 5:16 PM, Alex Williamson wrote:  
> On Wed, 22 Dec 2021 11:05:24 -0800
> Steve Sistare  wrote:  
> [...]
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 37eca66..cee82cf 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -7,6 +7,7 @@
>>  
>>  #include "qemu/osdep.h"
>>  #include "exec/memory.h"
>> +#include "hw/vfio/vfio-common.h"
>>  #include "io/channel-buffer.h"
>>  #include "io/channel-file.h"
>>  #include "migration.h"
>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>  error_setg(errp, "cpr-exec requires cpr-save with restart 
>> mode");
>>  return;
>>  }
>> -
>> +if (cpr_vfio_save(errp)) {
>> +return;
>> +}
>
> Why is vfio so unique that it needs separate handlers versus other
> devices?  Thanks,

 In earlier patches these functions fiddled with more objects, but at this 
 point
 they are simple enough to convert to pre_save and post_load vmstate 
 handlers for
 the container and group objects.  However, we would still need to call 
 special 
 functons for vfio from qmp_cpr_exec:

   * validate all containers support CPR before we start blasting vaddrs
 However, I could validate all, in every call to pre_save for each 
 container.
 That would be less efficient, but fits the vmstate model.  
>>>
>>> Would it be a better option to mirror the migration blocker support, ie.
>>> any device that doesn't support cpr registers a blocker and generic
>>> code only needs to keep track of whether any blockers are registered.  
>>
>> We cannot specifically use migrate_add_blocker(), because it is checked in
>> the migration specific function migrate_prepare(), in a layer of functions 
>> above the simpler qemu_save_device_state() used in cpr.  But yes, we could
>> do something similar for vfio.  Increment a global counter in vfio_realize
>> if the container does not support cpr, and decrement it when the container is
>> destroyed.  pre_save could just check the counter.
> 
> Right, not suggesting to piggyback on migrate_add_blocker() only to use
> a similar mechanism.  Only drivers that can't support cpr need register
> a blocker but testing for blockers is done generically, not just for
> vfio devices.
> 
   * restore all vaddr's if qemu_save_device_state fails.
 However, I could recover for all containers inside pre_save when one 
 container fails.
 Feels strange touching all objects in a function for one, but there is 
 no real
 downside.  
>>>
>>> I'm not as familiar as I should be with migration callbacks, thanks to
>>> mostly not supporting it for vfio devices, but it seems strange to me
>>> that there's no existing callback or notifier per device to propagate
>>> save failure.  Do we not at least get some sort of resume callback in
>>> that case?  
>>
>> We do not:
>> struct VMStateDescription {
>> int (*pre_load)(void *opaque);
>> int (*post_load)(void *opaque, int version_id);
>> int (*pre_save)(void *opaque);
>> int (*post_save)(void *opaque);
>>
>> The handler returns an error, which stops further saves and is propagated 
>> back
>> to the top level caller qemu_save_device_state().
>>
>> The vast majority of handlers do not have side effects, with no need to 
>> unwind 
>> anything on failure.
>>
>> This raises another point.  If pre_save succeeds for all the containers,
>> but fails for some non-vfio object, then the overall operation is abandoned,
>> but we do not restore the vaddr's.  To plug that hole, we need to call the
>> unwind code from qmp_cpr_save, or implement your alternative below.
> 
> We're trying to reuse migration interfaces, are we also triggering
> migration state change notifiers?  ie.
> MIGRATION_STATUS_{CANCELLING,CANCELLED,FAILED}  

No. That happens in the migration layer which we do not use.

> We already hook vfio
> devices supporting migration into that notifier to tell the driver to
> move the device back to the running state on failure, which seems a bit
> unique to vfio devices.  Containers could maybe register their own
> callbacks.
> 
>>> As an alternative, maybe each container could register a vm change
>>> handler that would trigger reloading vaddrs if we move to a running
>>> state and a flag on the container indicates vaddrs were invalidated?
>>> Thanks,  
>>
>> That works and is modular, but I dislike that it adds checks on the
>> happy path for a case that will rarely happen, and it pushes recovery from
>> failure further away from the original failure, which would make debugging
>> cascading fail

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Alex Williamson
On Thu, 10 Mar 2022 14:55:50 -0500
Steven Sistare  wrote:

> On 3/10/2022 1:35 PM, Alex Williamson wrote:
> > On Thu, 10 Mar 2022 10:00:29 -0500
> > Steven Sistare  wrote:
> >   
> >> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
> >>> On Wed, 22 Dec 2021 11:05:24 -0800
> >>> Steve Sistare  wrote:  
>  @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
>  *container, int group_fd,
>   {
>   int iommu_type, ret;
>   
>  +/*
>  + * If container is reused, just set its type and skip the ioctls, 
>  as the
>  + * container and group are already configured in the kernel.
>  + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>  + * If you ever add new types or spapr cpr support, kind reader, 
>  please
>  + * also implement VFIO_GET_IOMMU.
>  + */
> >>>
> >>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> >>> problem is that vfio_iommu_type1_check_extension() should actually base
> >>> some of the details on the instantiated vfio_iommu, ex.
> >>>
> >>>   switch (arg) {
> >>>   case VFIO_TYPE1_IOMMU:
> >>>   return (iommu && iommu->v2) ? 0 : 1;
> >>>   case VFIO_UNMAP_ALL:
> >>>   case VFIO_UPDATE_VADDR:
> >>>   case VFIO_TYPE1v2_IOMMU:
> >>>   return (iommu && !iommu->v2) ? 0 : 1;
> >>>   case VFIO_TYPE1_NESTING_IOMMU:
> >>>   return (iommu && !iommu->nesting) ? 0 : 1;
> >>>   ...
> >>>
> >>> We can't support v1 if we've already set a v2 container and vice versa.
> >>> There are probably some corner cases and compatibility to puzzle
> >>> through, but I wouldn't think we need a new ioctl to check this.
> >>
> >> That change makes sense, and may be worth while on its own merits, but 
> >> does not
> >> solve the problem, which is that qemu will not be able to infer iommu_type 
> >> in
> >> the future if new types are added.  Given:
> >>   * a new kernel supporting shiny new TYPE1v3
> >>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it 
> >> has no
> >> knowledge of v3
> >>   * live update to qemu which supports v3, which will be listed first in 
> >> vfio_get_iommu_type.
> >>
> >> Then the new qemu has no way to infer iommu_type.  If it has code that 
> >> makes 
> >> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
> >> vfio_container_region_add,
> >> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
> >> correctly.
> >>
> >> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the 
> >> same time our
> >> hypothetical future developer adds TYPE1v3.  The current inability to ask 
> >> the kernel
> >> "what are you" about a container feels like a bug to me.  
> > 
> > Hmm, I don't think the kernel has an innate responsibility to remind
> > the user of a configuration that they've already made.
> 
> No, but it can make userland cleaner.  For example, CRIU checkpoint/restart 
> queries
> the kernel to save process state, and later makes syscalls to restore it.  
> Where the
> kernel does not export sufficient information, CRIU must provide interpose 
> libraries
> so it can remember state internally on its way to the kernel.  And 
> applications must
> link against the interpose libraries.

The counter argument is that it bloats the kernel to add interfaces to
report back things that userspace should already know.  Which has more
exploit vectors, a new kernel ioctl or yet another userspace library?
 
> > But I also
> > don't follow your TYPE1v3 example.  If we added such a type, I imagine
> > the switch would change to:
> > 
> > switch (arg)
> > case VFIO_TYPE1_IOMMU:
> > return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> > case VFIO_UNMAP_ALL:
> > case VFIO_UPDATE_VADDR:
> > return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> > case VFIO_TYPE1v2_IOMMU:
> > return (iommu && !iommu-v2) ? 0 : 1;
> > case VFIO_TYPE1v3_IOMMU:
> > return (iommu && !iommu->v3) ? 0 : 1;
> > ...
> > 
> > How would that not allow exactly the scenario described, ie. new QEMU
> > can see that old QEMU left it a v2 IOMMU.  
> 
> OK, that works as long as the switch returns true for all options before
> VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
> which I missed before.  If we are on the same page now, I will modify my
> comment "please also implement VFIO_GET_IOMMU".

Yes, in the above all extensions are supported before the container
type is set, then once set only the relevant extensions are available.

...
>  diff --git a/migration/cpr.c b/migration/cpr.c
>  index 37eca66..cee82cf 100644
>  --- a/migration/cpr.c
>  +++ b/migration/cpr.c
>  @@ -7,6 +7,7 @@
>   
>   #include "qemu/osdep.h"
>   #include "exec/memory.h"
>  +#include "hw/vfio/vfio-common.h"
>   #include "io/channel-buffer.h"
>   #include "io/channel-file.h"
>   #

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Steven Sistare
On 3/10/2022 1:35 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 10:00:29 -0500
> Steven Sistare  wrote:
> 
>> On 3/7/2022 5:16 PM, Alex Williamson wrote:
>>> On Wed, 22 Dec 2021 11:05:24 -0800
>>> Steve Sistare  wrote:
 @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
 *container, int group_fd,
  {
  int iommu_type, ret;
  
 +/*
 + * If container is reused, just set its type and skip the ioctls, as 
 the
 + * container and group are already configured in the kernel.
 + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
 + * If you ever add new types or spapr cpr support, kind reader, please
 + * also implement VFIO_GET_IOMMU.
 + */  
>>>
>>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
>>> problem is that vfio_iommu_type1_check_extension() should actually base
>>> some of the details on the instantiated vfio_iommu, ex.
>>>
>>> switch (arg) {
>>> case VFIO_TYPE1_IOMMU:
>>> return (iommu && iommu->v2) ? 0 : 1;
>>> case VFIO_UNMAP_ALL:
>>> case VFIO_UPDATE_VADDR:
>>> case VFIO_TYPE1v2_IOMMU:
>>> return (iommu && !iommu->v2) ? 0 : 1;
>>> case VFIO_TYPE1_NESTING_IOMMU:
>>> return (iommu && !iommu->nesting) ? 0 : 1;
>>> ...
>>>
>>> We can't support v1 if we've already set a v2 container and vice versa.
>>> There are probably some corner cases and compatibility to puzzle
>>> through, but I wouldn't think we need a new ioctl to check this.  
>>
>> That change makes sense, and may be worth while on its own merits, but does 
>> not
>> solve the problem, which is that qemu will not be able to infer iommu_type in
>> the future if new types are added.  Given:
>>   * a new kernel supporting shiny new TYPE1v3
>>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it 
>> has no
>> knowledge of v3
>>   * live update to qemu which supports v3, which will be listed first in 
>> vfio_get_iommu_type.
>>
>> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
>> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
>> vfio_container_region_add,
>> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
>> correctly.
>>
>> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the 
>> same time our
>> hypothetical future developer adds TYPE1v3.  The current inability to ask 
>> the kernel
>> "what are you" about a container feels like a bug to me.
> 
> Hmm, I don't think the kernel has an innate responsibility to remind
> the user of a configuration that they've already made.  

No, but it can make userland cleaner.  For example, CRIU checkpoint/restart 
queries
the kernel to save process state, and later makes syscalls to restore it.  
Where the
kernel does not export sufficient information, CRIU must provide interpose 
libraries
so it can remember state internally on its way to the kernel.  And applications 
must
link against the interpose libraries.

> But I also
> don't follow your TYPE1v3 example.  If we added such a type, I imagine
> the switch would change to:
> 
>   switch (arg)
>   case VFIO_TYPE1_IOMMU:
>   return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
>   case VFIO_UNMAP_ALL:
>   case VFIO_UPDATE_VADDR:
>   return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
>   case VFIO_TYPE1v2_IOMMU:
>   return (iommu && !iommu-v2) ? 0 : 1;
>   case VFIO_TYPE1v3_IOMMU:
>   return (iommu && !iommu->v3) ? 0 : 1;
>   ...
> 
> How would that not allow exactly the scenario described, ie. new QEMU
> can see that old QEMU left it a v2 IOMMU.

OK, that works as long as the switch returns true for all options before
VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
which I missed before.  If we are on the same page now, I will modify my
comment "please also implement VFIO_GET_IOMMU".

> ...
 +
 +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
 +{
 +if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
 +!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
 +error_setg(errp, "VFIO container does not support 
 VFIO_UPDATE_VADDR "
 + "or VFIO_UNMAP_ALL");
 +return false;
 +} else {
 +return true;
 +}
 +}  
>>>
>>> We could have minimally used this where we assumed a TYPE1v2 container.  
>>
>> Are you referring to vfio_init_container (discussed above)?
>> Are you suggesting that, if reused is true, we validate those extensions are
>> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?
> 
> Yeah, though maybe it's not sufficiently precise to be worthwhile given
> the current kernel behavior.
> 
 +
 +/*
 + * Verify that all containers support CPR, and unmap all dma vaddr's.

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Alex Williamson
On Thu, 10 Mar 2022 10:00:29 -0500
Steven Sistare  wrote:

> On 3/7/2022 5:16 PM, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 11:05:24 -0800
> > Steve Sistare  wrote:
> >> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
> >> *container, int group_fd,
> >>  {
> >>  int iommu_type, ret;
> >>  
> >> +/*
> >> + * If container is reused, just set its type and skip the ioctls, as 
> >> the
> >> + * container and group are already configured in the kernel.
> >> + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >> + * If you ever add new types or spapr cpr support, kind reader, please
> >> + * also implement VFIO_GET_IOMMU.
> >> + */  
> > 
> > VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> > problem is that vfio_iommu_type1_check_extension() should actually base
> > some of the details on the instantiated vfio_iommu, ex.
> > 
> > switch (arg) {
> > case VFIO_TYPE1_IOMMU:
> > return (iommu && iommu->v2) ? 0 : 1;
> > case VFIO_UNMAP_ALL:
> > case VFIO_UPDATE_VADDR:
> > case VFIO_TYPE1v2_IOMMU:
> > return (iommu && !iommu->v2) ? 0 : 1;
> > case VFIO_TYPE1_NESTING_IOMMU:
> > return (iommu && !iommu->nesting) ? 0 : 1;
> > ...
> > 
> > We can't support v1 if we've already set a v2 container and vice versa.
> > There are probably some corner cases and compatibility to puzzle
> > through, but I wouldn't think we need a new ioctl to check this.  
> 
> That change makes sense, and may be worth while on its own merits, but does 
> not
> solve the problem, which is that qemu will not be able to infer iommu_type in
> the future if new types are added.  Given:
>   * a new kernel supporting shiny new TYPE1v3
>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has 
> no
> knowledge of v3
>   * live update to qemu which supports v3, which will be listed first in 
> vfio_get_iommu_type.
> 
> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
> vfio_container_region_add,
> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
> correctly.
> 
> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same 
> time our
> hypothetical future developer adds TYPE1v3.  The current inability to ask the 
> kernel
> "what are you" about a container feels like a bug to me.

Hmm, I don't think the kernel has an innate responsibility to remind
the user of a configuration that they've already made.  But I also
don't follow your TYPE1v3 example.  If we added such a type, I imagine
the switch would change to:

switch (arg)
case VFIO_TYPE1_IOMMU:
return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
case VFIO_UNMAP_ALL:
case VFIO_UPDATE_VADDR:
return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
case VFIO_TYPE1v2_IOMMU:
return (iommu && !iommu-v2) ? 0 : 1;
case VFIO_TYPE1v3_IOMMU:
return (iommu && !iommu->v3) ? 0 : 1;
...

How would that not allow exactly the scenario described, ie. new QEMU
can see that old QEMU left it a v2 IOMMU.

...
> >> +
> >> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> >> +{
> >> +if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> >> +!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> >> +error_setg(errp, "VFIO container does not support 
> >> VFIO_UPDATE_VADDR "
> >> + "or VFIO_UNMAP_ALL");
> >> +return false;
> >> +} else {
> >> +return true;
> >> +}
> >> +}  
> > 
> > We could have minimally used this where we assumed a TYPE1v2 container.  
> 
> Are you referring to vfio_init_container (discussed above)?
> Are you suggesting that, if reused is true, we validate those extensions are
> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

Yeah, though maybe it's not sufficiently precise to be worthwhile given
the current kernel behavior.

> >> +
> >> +/*
> >> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> >> + */
> >> +int vfio_cpr_save(Error **errp)
> >> +{
> >> +ERRP_GUARD();
> >> +VFIOAddressSpace *space;
> >> +VFIOContainer *container;
> >> +
> >> +QLIST_FOREACH(space, &vfio_address_spaces, list) {
> >> +QLIST_FOREACH(container, &space->containers, next) {
> >> +if (!vfio_is_cpr_capable(container, errp)) {
> >> +return -1;
> >> +}
> >> +if (vfio_dma_unmap_vaddr_all(container, errp)) {
> >> +return -1;
> >> +}
> >> +}
> >> +}  
> > 
> > Seems like we ought to validate all containers support CPR before we
> > start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> > this fails with no attempt to unwind!  Yikes!

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-10 Thread Steven Sistare
On 3/7/2022 5:16 PM, Alex Williamson wrote:
> On Wed, 22 Dec 2021 11:05:24 -0800
> Steve Sistare  wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  MAINTAINERS   |   1 +
>>  hw/pci/pci.c  |  10 
>>  hw/vfio/common.c  | 115 
>> ++
>>  hw/vfio/cpr.c |  94 ++
>>  hw/vfio/meson.build   |   1 +
>>  hw/vfio/pci.c |  77 
>>  hw/vfio/trace-events  |   1 +
>>  include/hw/pci/pci.h  |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h   |   3 ++
>>  migration/cpr.c   |  10 +++-
>>  migration/target.c|  14 +
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare 
>>  M: Mark Kanda 
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>  int r;
>>  
>> +/*
>> + * A reused vfio-pci device is already configured, so do not reset it
>> + * during qemu_system_reset prior to cpr-load, else interrupts may be
>> + * lost.  By contrast, pure-virtual pci devices may be reset here and
>> + * updated with new state in cpr-load with no ill effects.
>> + */
>> +if (dev->reused) {
>> +return;
>> +}
>> +
>>  pci_device_deassert_intx(dev);
>>  assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5b87f95..90f66ad 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>  .size = size,
>>  };
>>  
>> +assert(!container->reused);
>> +
>>  if (iotlb && container->dirty_pages_supported &&
>>  vfio_devices_all_running_and_saving(container)) {
>>  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, 
>> hwaddr iova,
>>  {
>>  struct vfio_iommu_type1_dma_map map = {
>>  .argsz = sizeof(map),
>> -.flags = VFIO_DMA_MAP_FLAG_READ,
>>  .vaddr = (__u64)(uintptr_t)vaddr,
>>  .iova = iova,
>>  .size = size,
>>  };
>>  
>> +/*
>> + * Set the new vaddr for any mappings registered during cpr-load.
>> + * Reused is cleared thereafter.
>> + */
>> +if (conta

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-03-07 Thread Alex Williamson
On Wed, 22 Dec 2021 11:05:24 -0800
Steve Sistare  wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare 
> ---
>  MAINTAINERS   |   1 +
>  hw/pci/pci.c  |  10 
>  hw/vfio/common.c  | 115 
> ++
>  hw/vfio/cpr.c |  94 ++
>  hw/vfio/meson.build   |   1 +
>  hw/vfio/pci.c |  77 
>  hw/vfio/trace-events  |   1 +
>  include/hw/pci/pci.h  |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h   |   3 ++
>  migration/cpr.c   |  10 +++-
>  migration/target.c|  14 +
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare 
>  M: Mark Kanda 
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>  int r;
>  
> +/*
> + * A reused vfio-pci device is already configured, so do not reset it
> + * during qemu_system_reset prior to cpr-load, else interrupts may be
> + * lost.  By contrast, pure-virtual pci devices may be reset here and
> + * updated with new state in cpr-load with no ill effects.
> + */
> +if (dev->reused) {
> +return;
> +}
> +
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  .size = size,
>  };
>  
> +assert(!container->reused);
> +
>  if (iotlb && container->dirty_pages_supported &&
>  vfio_devices_all_running_and_saving(container)) {
>  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, 
> hwaddr iova,
>  {
>  struct vfio_iommu_type1_dma_map map = {
>  .argsz = sizeof(map),
> -.flags = VFIO_DMA_MAP_FLAG_READ,
>  .vaddr = (__u64)(uintptr_t)vaddr,
>  .iova = iova,
>  .size = size,
>  };
>  
> +/*
> + * Set the new vaddr for any mappings registered during cpr-load.
> + * Reused is cleared thereafter.
> + */
> +if (container->reused) {
> +map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +goto fail;
> +   

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-06 Thread Steven Sistare
On 1/6/2022 4:12 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
>> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
 On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
 Enable vfio-pci devices to be saved and restored across an exec restart
 of qemu.

 At vfio creation time, save the value of vfio container, group, and 
 device
 descriptors in cpr state.

 In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
 mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be 
 remapped
 at a different VA after exec.  DMA to already-mapped pages continues.  
 Save
 the msi message area as part of vfio-pci vmstate, save the interrupt 
 and
 notifier eventfd's in cpr state, and clear the close-on-exec flag for 
 the
 vfio descriptors.  The flag is not cleared earlier because the 
 descriptors
 should not persist across miscellaneous fork and exec calls that may be
 performed during normal operation.

 On qemu restart, vfio_realize() finds the saved descriptors, uses
 the descriptors, and notes that the device is being reused.  Device and
 iommu state is already configured, so operations in vfio_realize that
 would modify the configuration are skipped for a reused device, 
 including
 vfio ioctl's and writes to PCI configuration space.  The result is that
 vfio_realize constructs qemu data structures that reflect the current
 state of the device.  However, the reconstruction is not complete until
 cpr-load is called. cpr-load loads the msi data and finds eventfds in 
 cpr
 state.  It rebuilds vector data structures and attaches the interrupts 
 to
 the new KVM instance.  cpr-load then invokes the main vfio listener 
 callback,
 which walks the flattened ranges of the vfio_address_spaces and calls
 VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, 
 it
 starts the VM and suppresses vfio pci device reset.

 This functionality is delivered by 3 patches for clarity.  Part 1 
 handles
 device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X 
 vector
 support.  Part 3 adds INTX support.

 Signed-off-by: Steve Sistare 
 ---
  MAINTAINERS   |   1 +
  hw/pci/pci.c  |  10 
  hw/vfio/common.c  | 115 
 ++
  hw/vfio/cpr.c |  94 ++
  hw/vfio/meson.build   |   1 +
  hw/vfio/pci.c |  77 
  hw/vfio/trace-events  |   1 +
  include/hw/pci/pci.h  |   1 +
  include/hw/vfio/vfio-common.h |   8 +++
  include/migration/cpr.h   |   3 ++
  migration/cpr.c   |  10 +++-
  migration/target.c|  14 +
  12 files changed, 324 insertions(+), 11 deletions(-)
  create mode 100644 hw/vfio/cpr.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index cfe7480..feed239 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -2992,6 +2992,7 @@ CPR
  M: Steve Sistare 
  M: Mark Kanda 
  S: Maintained
 +F: hw/vfio/cpr.c
  F: include/migration/cpr.h
  F: migration/cpr.c
  F: qapi/cpr.json
 diff --git a/hw/pci/pci.c b/hw/pci/pci.c
 index 0fd21e1..e35df4f 100644
 --- a/hw/pci/pci.c
 +++ b/hw/pci/pci.c
 @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
  {
  int r;
  
 +/*
 + * A reused vfio-pci device is already configured, so do not 
 reset it
 + * during qemu_system_reset prior to cpr-load, else interrupts 
 may be
 + * lost.  By contrast, pure-virtual pci devices may be reset here 
 and
 + * updated with new state in cpr-load with no ill effects.
 + */
 +if (dev->reused) {
 +return;
 +}
 +
  pci_device_deassert_intx(dev);
  assert(dev->irq_state == 0);
  
>>>
>>>
>>> Hmm that's a weird thing to do. I suspect this works because
>>> "reused" means something like "in the process of being restored"?
>>> Because

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-06 Thread Michael S. Tsirkin
On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> >> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>  On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >> Enable vfio-pci devices to be saved and restored across an exec restart
> >> of qemu.
> >>
> >> At vfio creation time, save the value of vfio container, group, and 
> >> device
> >> descriptors in cpr state.
> >>
> >> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be 
> >> remapped
> >> at a different VA after exec.  DMA to already-mapped pages continues.  
> >> Save
> >> the msi message area as part of vfio-pci vmstate, save the interrupt 
> >> and
> >> notifier eventfd's in cpr state, and clear the close-on-exec flag for 
> >> the
> >> vfio descriptors.  The flag is not cleared earlier because the 
> >> descriptors
> >> should not persist across miscellaneous fork and exec calls that may be
> >> performed during normal operation.
> >>
> >> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >> the descriptors, and notes that the device is being reused.  Device and
> >> iommu state is already configured, so operations in vfio_realize that
> >> would modify the configuration are skipped for a reused device, 
> >> including
> >> vfio ioctl's and writes to PCI configuration space.  The result is that
> >> vfio_realize constructs qemu data structures that reflect the current
> >> state of the device.  However, the reconstruction is not complete until
> >> cpr-load is called. cpr-load loads the msi data and finds eventfds in 
> >> cpr
> >> state.  It rebuilds vector data structures and attaches the interrupts 
> >> to
> >> the new KVM instance.  cpr-load then invokes the main vfio listener 
> >> callback,
> >> which walks the flattened ranges of the vfio_address_spaces and calls
> >> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, 
> >> it
> >> starts the VM and suppresses vfio pci device reset.
> >>
> >> This functionality is delivered by 3 patches for clarity.  Part 1 
> >> handles
> >> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X 
> >> vector
> >> support.  Part 3 adds INTX support.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>  MAINTAINERS   |   1 +
> >>  hw/pci/pci.c  |  10 
> >>  hw/vfio/common.c  | 115 
> >> ++
> >>  hw/vfio/cpr.c |  94 ++
> >>  hw/vfio/meson.build   |   1 +
> >>  hw/vfio/pci.c |  77 
> >>  hw/vfio/trace-events  |   1 +
> >>  include/hw/pci/pci.h  |   1 +
> >>  include/hw/vfio/vfio-common.h |   8 +++
> >>  include/migration/cpr.h   |   3 ++
> >>  migration/cpr.c   |  10 +++-
> >>  migration/target.c|  14 +
> >>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>  create mode 100644 hw/vfio/cpr.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index cfe7480..feed239 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2992,6 +2992,7 @@ CPR
> >>  M: Steve Sistare 
> >>  M: Mark Kanda 
> >>  S: Maintained
> >> +F: hw/vfio/cpr.c
> >>  F: include/migration/cpr.h
> >>  F: migration/cpr.c
> >>  F: qapi/cpr.json
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 0fd21e1..e35df4f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>  {
> >>  int r;
> >>  
> >> +/*
> >> + * A reused vfio-pci device is already configured, so do not 
> >> reset it
> >> + * during qemu_system_reset prior to cpr-load, else interrupts 
> >> may be
> >> + * lost.  By contrast, pure-virtual pci devices may be reset here 
> >> and
> >> + * updated with new state in cpr-load with no ill effects.
> >> + */
> >> +if (dev->reused) {
> >> +return;
> >> +}
> >> +
> >>  pci_device_deassert_intx(dev);
> >>  assert(dev->irq_state == 0);
> >>  
> >
> >
> > Hmm that's a weird thing to do. I suspect this works because
> > "reused" means something like "in the process of being restored"?
> > Because clearly, we do not want to skip this part e.g. wh

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-05 Thread Steven Sistare
On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
 On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and 
>> device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be 
>> remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  
>> Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the 
>> descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener 
>> callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X 
>> vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  MAINTAINERS   |   1 +
>>  hw/pci/pci.c  |  10 
>>  hw/vfio/common.c  | 115 
>> ++
>>  hw/vfio/cpr.c |  94 ++
>>  hw/vfio/meson.build   |   1 +
>>  hw/vfio/pci.c |  77 
>>  hw/vfio/trace-events  |   1 +
>>  include/hw/pci/pci.h  |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h   |   3 ++
>>  migration/cpr.c   |  10 +++-
>>  migration/target.c|  14 +
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare 
>>  M: Mark Kanda 
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>  int r;
>>  
>> +/*
>> + * A reused vfio-pci device is already configured, so do not reset 
>> it
>> + * during qemu_system_reset prior to cpr-load, else interrupts may 
>> be
>> + * lost.  By contrast, pure-virtual pci devices may be reset here 
>> and
>> + * updated with new state in cpr-load with no ill effects.
>> + */
>> +if (dev->reused) {
>> +return;
>> +}
>> +
>>  pci_device_deassert_intx(dev);
>>  assert(dev->irq_state == 0);
>>  
>
>
> Hmm that's a weird thing to do. I suspect this works because
> "reused" means something like "in the process of being restored"?
> Because clearly, we do not want to skip this part e.g. when
> guest resets the device.

 Exactly.  vfio_realize sets the flag if it detects the device is reused 
 during
 a restart, and vfio_pci_post_load clears the reused flag.

> So a better name could be called for, but really I don't
> love how vfio gets to poke at internal PCI state.
> I'd rather we found 

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-05 Thread Michael S. Tsirkin
On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> >> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>  Enable vfio-pci devices to be saved and restored across an exec restart
>  of qemu.
> 
>  At vfio creation time, save the value of vfio container, group, and 
>  device
>  descriptors in cpr state.
> 
>  In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>  mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be 
>  remapped
>  at a different VA after exec.  DMA to already-mapped pages continues.  
>  Save
>  the msi message area as part of vfio-pci vmstate, save the interrupt and
>  notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>  vfio descriptors.  The flag is not cleared earlier because the 
>  descriptors
>  should not persist across miscellaneous fork and exec calls that may be
>  performed during normal operation.
> 
>  On qemu restart, vfio_realize() finds the saved descriptors, uses
>  the descriptors, and notes that the device is being reused.  Device and
>  iommu state is already configured, so operations in vfio_realize that
>  would modify the configuration are skipped for a reused device, including
>  vfio ioctl's and writes to PCI configuration space.  The result is that
>  vfio_realize constructs qemu data structures that reflect the current
>  state of the device.  However, the reconstruction is not complete until
>  cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>  state.  It rebuilds vector data structures and attaches the interrupts to
>  the new KVM instance.  cpr-load then invokes the main vfio listener 
>  callback,
>  which walks the flattened ranges of the vfio_address_spaces and calls
>  VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>  starts the VM and suppresses vfio pci device reset.
> 
>  This functionality is delivered by 3 patches for clarity.  Part 1 handles
>  device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X 
>  vector
>  support.  Part 3 adds INTX support.
> 
>  Signed-off-by: Steve Sistare 
>  ---
>   MAINTAINERS   |   1 +
>   hw/pci/pci.c  |  10 
>   hw/vfio/common.c  | 115 
>  ++
>   hw/vfio/cpr.c |  94 ++
>   hw/vfio/meson.build   |   1 +
>   hw/vfio/pci.c |  77 
>   hw/vfio/trace-events  |   1 +
>   include/hw/pci/pci.h  |   1 +
>   include/hw/vfio/vfio-common.h |   8 +++
>   include/migration/cpr.h   |   3 ++
>   migration/cpr.c   |  10 +++-
>   migration/target.c|  14 +
>   12 files changed, 324 insertions(+), 11 deletions(-)
>   create mode 100644 hw/vfio/cpr.c
> 
>  diff --git a/MAINTAINERS b/MAINTAINERS
>  index cfe7480..feed239 100644
>  --- a/MAINTAINERS
>  +++ b/MAINTAINERS
>  @@ -2992,6 +2992,7 @@ CPR
>   M: Steve Sistare 
>   M: Mark Kanda 
>   S: Maintained
>  +F: hw/vfio/cpr.c
>   F: include/migration/cpr.h
>   F: migration/cpr.c
>   F: qapi/cpr.json
>  diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>  index 0fd21e1..e35df4f 100644
>  --- a/hw/pci/pci.c
>  +++ b/hw/pci/pci.c
>  @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>   {
>   int r;
>   
>  +/*
>  + * A reused vfio-pci device is already configured, so do not reset 
>  it
>  + * during qemu_system_reset prior to cpr-load, else interrupts may 
>  be
>  + * lost.  By contrast, pure-virtual pci devices may be reset here 
>  and
>  + * updated with new state in cpr-load with no ill effects.
>  + */
>  +if (dev->reused) {
>  +return;
>  +}
>  +
>   pci_device_deassert_intx(dev);
>   assert(dev->irq_state == 0);
>   
> >>>
> >>>
> >>> Hmm that's a weird thing to do. I suspect this works because
> >>> "reused" means something like "in the process of being restored"?
> >>> Because clearly, we do not want to skip this part e.g. when
> >>> guest resets the device.
> >>
> >> Exactly.  vfio_realize sets the flag if it detects the device is reused 
> >> during
> >> a restart, and vfio_pci_post_load clears the reused flag.
> >>
> >>> So a better name could be called for, but really I don't
> >>> love how vfio gets to poke at internal PCI state.
> >>> I'd rather we found a way just not to call this function.
> >>> If we 

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-05 Thread Steven Sistare
On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
 Enable vfio-pci devices to be saved and restored across an exec restart
 of qemu.

 At vfio creation time, save the value of vfio container, group, and device
 descriptors in cpr state.

 In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
 mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
 at a different VA after exec.  DMA to already-mapped pages continues.  Save
 the msi message area as part of vfio-pci vmstate, save the interrupt and
 notifier eventfd's in cpr state, and clear the close-on-exec flag for the
 vfio descriptors.  The flag is not cleared earlier because the descriptors
 should not persist across miscellaneous fork and exec calls that may be
 performed during normal operation.

 On qemu restart, vfio_realize() finds the saved descriptors, uses
 the descriptors, and notes that the device is being reused.  Device and
 iommu state is already configured, so operations in vfio_realize that
 would modify the configuration are skipped for a reused device, including
 vfio ioctl's and writes to PCI configuration space.  The result is that
 vfio_realize constructs qemu data structures that reflect the current
 state of the device.  However, the reconstruction is not complete until
 cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
 state.  It rebuilds vector data structures and attaches the interrupts to
 the new KVM instance.  cpr-load then invokes the main vfio listener 
 callback,
 which walks the flattened ranges of the vfio_address_spaces and calls
 VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
 starts the VM and suppresses vfio pci device reset.

 This functionality is delivered by 3 patches for clarity.  Part 1 handles
 device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
 support.  Part 3 adds INTX support.

 Signed-off-by: Steve Sistare 
 ---
  MAINTAINERS   |   1 +
  hw/pci/pci.c  |  10 
  hw/vfio/common.c  | 115 
 ++
  hw/vfio/cpr.c |  94 ++
  hw/vfio/meson.build   |   1 +
  hw/vfio/pci.c |  77 
  hw/vfio/trace-events  |   1 +
  include/hw/pci/pci.h  |   1 +
  include/hw/vfio/vfio-common.h |   8 +++
  include/migration/cpr.h   |   3 ++
  migration/cpr.c   |  10 +++-
  migration/target.c|  14 +
  12 files changed, 324 insertions(+), 11 deletions(-)
  create mode 100644 hw/vfio/cpr.c

 diff --git a/MAINTAINERS b/MAINTAINERS
 index cfe7480..feed239 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -2992,6 +2992,7 @@ CPR
  M: Steve Sistare 
  M: Mark Kanda 
  S: Maintained
 +F: hw/vfio/cpr.c
  F: include/migration/cpr.h
  F: migration/cpr.c
  F: qapi/cpr.json
 diff --git a/hw/pci/pci.c b/hw/pci/pci.c
 index 0fd21e1..e35df4f 100644
 --- a/hw/pci/pci.c
 +++ b/hw/pci/pci.c
 @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
  {
  int r;
  
 +/*
 + * A reused vfio-pci device is already configured, so do not reset it
 + * during qemu_system_reset prior to cpr-load, else interrupts may be
 + * lost.  By contrast, pure-virtual pci devices may be reset here and
 + * updated with new state in cpr-load with no ill effects.
 + */
 +if (dev->reused) {
 +return;
 +}
 +
  pci_device_deassert_intx(dev);
  assert(dev->irq_state == 0);
  
>>>
>>>
>>> Hmm that's a weird thing to do. I suspect this works because
>>> "reused" means something like "in the process of being restored"?
>>> Because clearly, we do not want to skip this part e.g. when
>>> guest resets the device.
>>
>> Exactly.  vfio_realize sets the flag if it detects the device is reused 
>> during
>> a restart, and vfio_pci_post_load clears the reused flag.
>>
>>> So a better name could be called for, but really I don't
>>> love how vfio gets to poke at internal PCI state.
>>> I'd rather we found a way just not to call this function.
>>> If we can't, maybe an explicit API, and make it
>>> actually say what it's doing?
>>
>> How about:
>>
>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>
>> vfio_realize()
>>   pci_set_restore(pdev)
>>
>> vfio_pci_post_load()
>>   pci_clr_restore(pdev)
>>
>> pci_do_devi

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-05 Thread Michael S. Tsirkin
On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >> Enable vfio-pci devices to be saved and restored across an exec restart
> >> of qemu.
> >>
> >> At vfio creation time, save the value of vfio container, group, and device
> >> descriptors in cpr state.
> >>
> >> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >> should not persist across miscellaneous fork and exec calls that may be
> >> performed during normal operation.
> >>
> >> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >> the descriptors, and notes that the device is being reused.  Device and
> >> iommu state is already configured, so operations in vfio_realize that
> >> would modify the configuration are skipped for a reused device, including
> >> vfio ioctl's and writes to PCI configuration space.  The result is that
> >> vfio_realize constructs qemu data structures that reflect the current
> >> state of the device.  However, the reconstruction is not complete until
> >> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >> state.  It rebuilds vector data structures and attaches the interrupts to
> >> the new KVM instance.  cpr-load then invokes the main vfio listener 
> >> callback,
> >> which walks the flattened ranges of the vfio_address_spaces and calls
> >> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >> starts the VM and suppresses vfio pci device reset.
> >>
> >> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >> support.  Part 3 adds INTX support.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>  MAINTAINERS   |   1 +
> >>  hw/pci/pci.c  |  10 
> >>  hw/vfio/common.c  | 115 
> >> ++
> >>  hw/vfio/cpr.c |  94 ++
> >>  hw/vfio/meson.build   |   1 +
> >>  hw/vfio/pci.c |  77 
> >>  hw/vfio/trace-events  |   1 +
> >>  include/hw/pci/pci.h  |   1 +
> >>  include/hw/vfio/vfio-common.h |   8 +++
> >>  include/migration/cpr.h   |   3 ++
> >>  migration/cpr.c   |  10 +++-
> >>  migration/target.c|  14 +
> >>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>  create mode 100644 hw/vfio/cpr.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index cfe7480..feed239 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2992,6 +2992,7 @@ CPR
> >>  M: Steve Sistare 
> >>  M: Mark Kanda 
> >>  S: Maintained
> >> +F: hw/vfio/cpr.c
> >>  F: include/migration/cpr.h
> >>  F: migration/cpr.c
> >>  F: qapi/cpr.json
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 0fd21e1..e35df4f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>  {
> >>  int r;
> >>  
> >> +/*
> >> + * A reused vfio-pci device is already configured, so do not reset it
> >> + * during qemu_system_reset prior to cpr-load, else interrupts may be
> >> + * lost.  By contrast, pure-virtual pci devices may be reset here and
> >> + * updated with new state in cpr-load with no ill effects.
> >> + */
> >> +if (dev->reused) {
> >> +return;
> >> +}
> >> +
> >>  pci_device_deassert_intx(dev);
> >>  assert(dev->irq_state == 0);
> >>  
> > 
> > 
> > Hmm that's a weird thing to do. I suspect this works because
> > "reused" means something like "in the process of being restored"?
> > Because clearly, we do not want to skip this part e.g. when
> > guest resets the device.
> 
> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> a restart, and vfio_pci_post_load clears the reused flag.
> 
> > So a better name could be called for, but really I don't
> > love how vfio gets to poke at internal PCI state.
> > I'd rather we found a way just not to call this function.
> > If we can't, maybe an explicit API, and make it
> > actually say what it's doing?
> 
> How about:
> 
> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> 
> vfio_realize()
>   pci_set_restore(pdev)
> 
> vfio_pci_post_load()
>   pci_clr_restore(pdev)
> 
> pci_do_device_reset()
> if (dev->restore)
> return;
> 
> - St

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2022-01-05 Thread Steven Sistare
On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  MAINTAINERS   |   1 +
>>  hw/pci/pci.c  |  10 
>>  hw/vfio/common.c  | 115 
>> ++
>>  hw/vfio/cpr.c |  94 ++
>>  hw/vfio/meson.build   |   1 +
>>  hw/vfio/pci.c |  77 
>>  hw/vfio/trace-events  |   1 +
>>  include/hw/pci/pci.h  |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h   |   3 ++
>>  migration/cpr.c   |  10 +++-
>>  migration/target.c|  14 +
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare 
>>  M: Mark Kanda 
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>  int r;
>>  
>> +/*
>> + * A reused vfio-pci device is already configured, so do not reset it
>> + * during qemu_system_reset prior to cpr-load, else interrupts may be
>> + * lost.  By contrast, pure-virtual pci devices may be reset here and
>> + * updated with new state in cpr-load with no ill effects.
>> + */
>> +if (dev->reused) {
>> +return;
>> +}
>> +
>>  pci_device_deassert_intx(dev);
>>  assert(dev->irq_state == 0);
>>  
> 
> 
> Hmm that's a weird thing to do. I suspect this works because
> "reused" means something like "in the process of being restored"?
> Because clearly, we do not want to skip this part e.g. when
> guest resets the device.

Exactly.  vfio_realize sets the flag if it detects the device is reused during
a restart, and vfio_pci_post_load clears the reused flag.

> So a better name could be called for, but really I don't
> love how vfio gets to poke at internal PCI state.
> I'd rather we found a way just not to call this function.
> If we can't, maybe an explicit API, and make it
> actually say what it's doing?

How about:

pci_set_restore(PCIDevice *dev) { dev->restore = true; }
pci_clr_restore(PCIDevice *dev) { dev->restore = false; }

vfio_realize()
  pci_set_restore(pdev)

vfio_pci_post_load()
  pci_clr_restore(pdev)

pci_do_device_reset()
if (dev->restore)
return;

- Steve
 



Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2021-12-22 Thread Michael S. Tsirkin
On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare 
> ---
>  MAINTAINERS   |   1 +
>  hw/pci/pci.c  |  10 
>  hw/vfio/common.c  | 115 
> ++
>  hw/vfio/cpr.c |  94 ++
>  hw/vfio/meson.build   |   1 +
>  hw/vfio/pci.c |  77 
>  hw/vfio/trace-events  |   1 +
>  include/hw/pci/pci.h  |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h   |   3 ++
>  migration/cpr.c   |  10 +++-
>  migration/target.c|  14 +
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare 
>  M: Mark Kanda 
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>  int r;
>  
> +/*
> + * A reused vfio-pci device is already configured, so do not reset it
> + * during qemu_system_reset prior to cpr-load, else interrupts may be
> + * lost.  By contrast, pure-virtual pci devices may be reset here and
> + * updated with new state in cpr-load with no ill effects.
> + */
> +if (dev->reused) {
> +return;
> +}
> +
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  


Hmm that's a weird thing to do. I suspect this works because
"reused" means something like "in the process of being restored"?
Because clearly, we do not want to skip this part e.g. when
guest resets the device.
So a better name could be called for, but really I don't
love how vfio gets to poke at internal PCI state.
I'd rather we found a way just not to call this function.
If we can't, maybe an explicit API, and make it
actually say what it's doing?


> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>  .size = size,
>  };
>  
> +assert(!container->reused);
> +
>  if (iotlb && container->dirty_pages_supported &&
>  vfio_devices_all_running_and_saving(container)) {
>  return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, 
> hwaddr iova,
>  {
>  struct vfio_iommu_type1_dma_map map = {
>  .argsz = sizeof(map),
> - 

[PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

2021-12-22 Thread Steve Sistare
Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in cpr state.

In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
at a different VA after exec.  DMA to already-mapped pages continues.  Save
the msi message area as part of vfio-pci vmstate, save the interrupt and
notifier eventfd's in cpr state, and clear the close-on-exec flag for the
vfio descriptors.  The flag is not cleared earlier because the descriptors
should not persist across miscellaneous fork and exec calls that may be
performed during normal operation.

On qemu restart, vfio_realize() finds the saved descriptors, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
state.  It rebuilds vector data structures and attaches the interrupts to
the new KVM instance.  cpr-load then invokes the main vfio listener callback,
which walks the flattened ranges of the vfio_address_spaces and calls
VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
starts the VM and suppresses vfio pci device reset.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.

Signed-off-by: Steve Sistare 
---
 MAINTAINERS   |   1 +
 hw/pci/pci.c  |  10 
 hw/vfio/common.c  | 115 ++
 hw/vfio/cpr.c |  94 ++
 hw/vfio/meson.build   |   1 +
 hw/vfio/pci.c |  77 
 hw/vfio/trace-events  |   1 +
 include/hw/pci/pci.h  |   1 +
 include/hw/vfio/vfio-common.h |   8 +++
 include/migration/cpr.h   |   3 ++
 migration/cpr.c   |  10 +++-
 migration/target.c|  14 +
 12 files changed, 324 insertions(+), 11 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cfe7480..feed239 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2992,6 +2992,7 @@ CPR
 M: Steve Sistare 
 M: Mark Kanda 
 S: Maintained
+F: hw/vfio/cpr.c
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 0fd21e1..e35df4f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
 int r;
 
+/*
+ * A reused vfio-pci device is already configured, so do not reset it
+ * during qemu_system_reset prior to cpr-load, else interrupts may be
+ * lost.  By contrast, pure-virtual pci devices may be reset here and
+ * updated with new state in cpr-load with no ill effects.
+ */
+if (dev->reused) {
+return;
+}
+
 pci_device_deassert_intx(dev);
 assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5b87f95..90f66ad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
 .size = size,
 };
 
+assert(!container->reused);
+
 if (iotlb && container->dirty_pages_supported &&
 vfio_devices_all_running_and_saving(container)) {
 return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
@@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
 {
 struct vfio_iommu_type1_dma_map map = {
 .argsz = sizeof(map),
-.flags = VFIO_DMA_MAP_FLAG_READ,
 .vaddr = (__u64)(uintptr_t)vaddr,
 .iova = iova,
 .size = size,
 };
 
+/*
+ * Set the new vaddr for any mappings registered during cpr-load.
+ * Reused is cleared thereafter.
+ */
+if (container->reused) {
+map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+goto fail;
+}
+return 0;
+}
+
+map.flags = VFIO_DMA_MAP_FLAG_READ;
 if (!readonly) {
 map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
 }
@@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
 return 0;
 }
 
-error_report("VFIO_MAP_DMA failed: %s", stre