On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang <haozhong.zh...@intel.com> wrote: > On 01/31/18 16:08 -0800, Dan Williams wrote: >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >> <haozhong.zh...@intel.com> wrote: >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> >> <haozhong.zh...@intel.com> wrote: >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> >> > files on ext4/xfs file system mounted with '-o dax'). >> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> >> metadata is in sync after a fault. However, that does not make >> >> filesystem-DAX safe for use with QEMU, because we still need to >> >> coordinate DMA with fileystem operations. There is no way to do that >> >> coordination from within a guest. QEMU needs to use device-dax if the >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >> >> pages to be mapped in EPT entries unless / until we have a solution to >> >> the DMA synchronization problem. Apologies for not noticing this >> >> earlier. >> > >> > QEMU does not truncate or punch holes of the file once it has been >> > mmap()'ed. Does the problem [1] still exist in such case? >> >> Something else on the system might. The only agent that could enforce >> protection is the kernel, and the kernel will likely just disallow >> passing addresses from filesystem-dax vmas through to a guest >> altogether. I think there's even a problem in the non-DAX case unless >> KVM is pinning pages while they are handed out to a guest. The problem >> is that we don't have a page cache page to pin in the DAX case. >> > > Does it mean any user-space code like > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > // make DMA to ptr > is unsafe?
Yes, it is currently unsafe because there is no coordination with the filesytem if it decides to make block layout changes. We can fix that in the non-virtualization case by having the filesystem wait for DMA completion callbacks (i.e. what for all pages to be idle), but as far as I can see we can't do the same coordination for DMA initiated by a guest device driver.