Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC

Steven Sistare Mon, 17 Aug 2020 14:23:42 -0700

On 8/17/2020 4:48 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 14:30:51 -0400
> Steven Sistare <steven.sist...@oracle.com> wrote:
> 
>> On 7/30/2020 11:14 AM, Steve Sistare wrote:
>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>> in the Linux kernel. For the madvise patches, see:
>>>
>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yzn...@oracle.com/
>>>
>>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
>>> ---
>>>  include/qemu/osdep.h | 7 +++++++
>>>  1 file changed, 7 insertions(+)  
>>
>> Hi Alex,
>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire 
>> qemu 
>> live update series, is getting a chilly reception on lkml.  We could instead 
>> create guest memory using memfd_create and preserve the fd across exec.  
>> However, 
>> the subsequent mmap(fd) will return a different VA than was used previously, 
>> which  is a problem for memory that was registered with vfio, as the 
>> original VA 
>> is remembered in the kernel struct vfio_dma and used in various kernel 
>> functions, 
>> such as vfio_iommu_replay.
>>
>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>> new_vaddr.  The implementation finds an exact match for (iova, size) and 
>> replaces 
>> vaddr with new_vaddr.  Flags cannot be changed.
>>
>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>> vfio on any form of shared memory (shm, dax, etc) could also be preserved 
>> across
>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>
>> What do you think
> 
> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> ioctl, so I think we should just use one of the flag bits on the
> existing MAP_DMA ioctl for this variant.


Sounds good.

> Reading through the discussion on the kernel side there seems to be
> some confusion around why vfio needs the vaddr beyond the user call to
> MAP_DMA though.  Originally this was used to test for virtually
> contiguous mappings for merging and splitting purposes.  This is
> defunct in the v2 interface, however the vaddr is now used largely for
> mdev devices.  If an mdev device is not backed by an IOMMU device and
> does not share a container with an IOMMU device, then a user MAP_DMA
> ioctl essentially just registers the translation within the vfio
> container.  The mdev vendor driver can then later either request pages
> to be pinned for device DMA or can perform copy_to/from_user() to
> simulate DMA via the CPU.
> 
> Therefore I don't see that there's a simple re-architecture of the vfio
> IOMMU backend that could drop vaddr use.  

Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
same conclusion.

> I'm a bit concerned this new
> remap proposal also raises the question of how do we prevent userspace
> remapping vaddrs racing with asynchronous kernel use of the previous
> vaddrs.  

Agreed.  After a quick glance at the code, holding iommu->lock during 
remap might be sufficient, but it needs more study.

> Are we expecting guest drivers/agents to quiesce the device,
> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?

No.  We want to support any guest, and the guest is not aware that qemu
live update is occurring.

> The vfio migration interface we've developed does have a mechanism to
> stop a device, would we need to use this here?  If we do have a
> mechanism to quiesce the device, is the only reason we're not unmapping
> everything and remapping it into the new address space the latency in
> performing that operation?  Thanks,

Same answer - we don't require that the guest has vfio migration support.

- Steve

Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC

Reply via email to