[ adding Michal and lsf-pci ] On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > <haozhong.zh...@intel.com> wrote: >> + vfio maintainer Alex Williamson in case my understanding of vfio is >> incorrect. >> >> On 01/31/18 16:32 -0800, Dan Williams wrote: >>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang >>> <haozhong.zh...@intel.com> wrote: >>> > On 01/31/18 16:08 -0800, Dan Williams wrote: >>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >>> >> <haozhong.zh...@intel.com> wrote: >>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >>> >> >> <haozhong.zh...@intel.com> wrote: >>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX >>> >> >> > (e.g., >>> >> >> > files on ext4/xfs file system mounted with '-o dax'). >>> >> >> >>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >>> >> >> metadata is in sync after a fault. However, that does not make >>> >> >> filesystem-DAX safe for use with QEMU, because we still need to >>> >> >> coordinate DMA with fileystem operations. There is no way to do that >>> >> >> coordination from within a guest. QEMU needs to use device-dax if the >>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >>> >> >> pages to be mapped in EPT entries unless / until we have a solution to >>> >> >> the DMA synchronization problem. Apologies for not noticing this >>> >> >> earlier. >>> >> > >>> >> > QEMU does not truncate or punch holes of the file once it has been >>> >> > mmap()'ed. Does the problem [1] still exist in such case? >>> >> >>> >> Something else on the system might. The only agent that could enforce >>> >> protection is the kernel, and the kernel will likely just disallow >>> >> passing addresses from filesystem-dax vmas through to a guest >>> >> altogether. I think there's even a problem in the non-DAX case unless >>> >> KVM is pinning pages while they are handed out to a guest. The problem >>> >> is that we don't have a page cache page to pin in the DAX case. >>> >> >>> > >>> > Does it mean any user-space code like >>> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem >>> > // make DMA to ptr >>> > is unsafe? >>> >>> Yes, it is currently unsafe because there is no coordination with the >>> filesytem if it decides to make block layout changes. We can fix that >>> in the non-virtualization case by having the filesystem wait for DMA >>> completion callbacks (i.e. what for all pages to be idle), but as far >>> as I can see we can't do the same coordination for DMA initiated by a >>> guest device driver. >>> >> >> I think that fix [1] also works for KVM/QEMU. The guest DMA are >> performed on two types of devices: >> >> 1. For emulated devices, the guest DMA requests are trapped and >> actually performed by QEMU on the host side. The host side fix [1] >> can cover this case. >> >> 2. For passthrough devices, vfio pins all pages, including those >> backed by dax mode files, used by the guest if any device is >> passthroughed to it. If I read the commit message in [2] correctly, >> operations that change the page-to-file offset association of pages >> from dax mode files will be deferred until the reference count of >> the affected pages becomes 1. That is, if any passthrough device >> is used with a VM, the changes of page-to-file offset will not be >> able to happen until the VM is shutdown, so the fix [1] still takes >> effect here. > > This sounds like a longterm mapping under control of vfio and not the > filesystem. See get_user_pages_longterm(), it is a problem if pages > are pinned indefinitely especially DAX. It sounds like vfio is in the > same boat as RDMA and cannot support long lived pins of DAX pages. As > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > fix will be to create a "memory-registration with lease" semantic > available for RDMA so that the kernel can forcibly revoke page pinning > to perform physical layout changes. In the near it seems > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > that filesystem-dax mappings are explicitly disallowed. > >> Another question is how a user-space application (e.g., QEMU) knows >> whether it's safe to mmap a file on the DAX file system? > > I think we fix vaddr_get_pfn() to start failing for DAX mappings > unless/until we can add a "with lease" mechanism. Userspace will know > when it is safe again when vfio stops failing.
Btw, there is an LSF/MM topic proposal on this subject [1]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html