On 8/17/2020 4:48 PM, Alex Williamson wrote: > On Mon, 17 Aug 2020 14:30:51 -0400 > Steven Sistare <steven.sist...@oracle.com> wrote: > >> On 7/30/2020 11:14 AM, Steve Sistare wrote: >>> Anonymous memory segments used by the guest are preserved across a re-exec >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option >>> in the Linux kernel. For the madvise patches, see: >>> >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yzn...@oracle.com/ >>> >>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >>> --- >>> include/qemu/osdep.h | 7 +++++++ >>> 1 file changed, 7 insertions(+) >> >> Hi Alex, >> The MADV_DOEXEC functionality, which is a pre-requisite for the entire >> qemu >> live update series, is getting a chilly reception on lkml. We could instead >> create guest memory using memfd_create and preserve the fd across exec. >> However, >> the subsequent mmap(fd) will return a different VA than was used previously, >> which is a problem for memory that was registered with vfio, as the >> original VA >> is remembered in the kernel struct vfio_dma and used in various kernel >> functions, >> such as vfio_iommu_replay. >> >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and >> new_vaddr. The implementation finds an exact match for (iova, size) and >> replaces >> vaddr with new_vaddr. Flags cannot be changed. >> >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC. >> vfio on any form of shared memory (shm, dax, etc) could also be preserved >> across >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA. >> >> What do you think > > Your new REMAP ioctl would have parameters identical to the MAP_DMA > ioctl, so I think we should just use one of the flag bits on the > existing MAP_DMA ioctl for this variant.
Sounds good. > Reading through the discussion on the kernel side there seems to be > some confusion around why vfio needs the vaddr beyond the user call to > MAP_DMA though. Originally this was used to test for virtually > contiguous mappings for merging and splitting purposes. This is > defunct in the v2 interface, however the vaddr is now used largely for > mdev devices. If an mdev device is not backed by an IOMMU device and > does not share a container with an IOMMU device, then a user MAP_DMA > ioctl essentially just registers the translation within the vfio > container. The mdev vendor driver can then later either request pages > to be pinned for device DMA or can perform copy_to/from_user() to > simulate DMA via the CPU. > > Therefore I don't see that there's a simple re-architecture of the vfio > IOMMU backend that could drop vaddr use. Yes. I did not explain on lkml as you do here (thanks), but I reached the same conclusion. > I'm a bit concerned this new > remap proposal also raises the question of how do we prevent userspace > remapping vaddrs racing with asynchronous kernel use of the previous > vaddrs. Agreed. After a quick glance at the code, holding iommu->lock during remap might be sufficient, but it needs more study. > Are we expecting guest drivers/agents to quiesce the device, > or maybe relying on clearing bus-master, for PCI devices, to halt DMA? No. We want to support any guest, and the guest is not aware that qemu live update is occurring. > The vfio migration interface we've developed does have a mechanism to > stop a device, would we need to use this here? If we do have a > mechanism to quiesce the device, is the only reason we're not unmapping > everything and remapping it into the new address space the latency in > performing that operation? Thanks, Same answer - we don't require that the guest has vfio migration support. - Steve