Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Dan Williams Wed, 31 Jan 2018 19:11:28 -0800

[ adding Michal and lsf-pci ]

On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zh...@intel.com> wrote:
>> + vfio maintainer Alex Williamson in case my understanding of vfio is 
>> incorrect.
>>
>> On 01/31/18 16:32 -0800, Dan Williams wrote:
>>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>> <haozhong.zh...@intel.com> wrote:
>>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>> >> <haozhong.zh...@intel.com> wrote:
>>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>> >> >> <haozhong.zh...@intel.com> wrote:
>>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX 
>>> >> >> > (e.g.,
>>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>>> >> >>
>>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>>> >> >> metadata is in sync after a fault. However, that does not make
>>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>>> >> >> the DMA synchronization problem. Apologies for not noticing this
>>> >> >> earlier.
>>> >> >
>>> >> > QEMU does not truncate or punch holes of the file once it has been
>>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>>> >>
>>> >> Something else on the system might. The only agent that could enforce
>>> >> protection is the kernel, and the kernel will likely just disallow
>>> >> passing addresses from filesystem-dax vmas through to a guest
>>> >> altogether. I think there's even a problem in the non-DAX case unless
>>> >> KVM is pinning pages while they are handed out to a guest. The problem
>>> >> is that we don't have a page cache page to pin in the DAX case.
>>> >>
>>> >
>>> > Does it mean any user-space code like
>>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>>> >   // make DMA to ptr
>>> > is unsafe?
>>>
>>> Yes, it is currently unsafe because there is no coordination with the
>>> filesytem if it decides to make block layout changes. We can fix that
>>> in the non-virtualization case by having the filesystem wait for DMA
>>> completion callbacks (i.e. what for all pages to be idle), but as far
>>> as I can see we can't do the same coordination for DMA initiated by a
>>> guest device driver.
>>>
>>
>> I think that fix [1] also works for KVM/QEMU. The guest DMA are
>> performed on two types of devices:
>>
>> 1. For emulated devices, the guest DMA requests are trapped and
>>    actually performed by QEMU on the host side. The host side fix [1]
>>    can cover this case.
>>
>> 2. For passthrough devices, vfio pins all pages, including those
>>    backed by dax mode files, used by the guest if any device is
>>    passthroughed to it. If I read the commit message in [2] correctly,
>>    operations that change the page-to-file offset association of pages
>>    from dax mode files will be deferred until the reference count of
>>    the affected pages becomes 1.  That is, if any passthrough device
>>    is used with a VM, the changes of page-to-file offset will not be
>>    able to happen until the VM is shutdown, so the fix [1] still takes
>>    effect here.
>
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.
>
>> Another question is how a user-space application (e.g., QEMU) knows
>> whether it's safe to mmap a file on the DAX file system?
>
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.


Btw, there is an LSF/MM topic proposal on this subject [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Reply via email to