Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Tue, Jul 25, 2017 at 02:50:37PM +0200, Jan Kara wrote: > On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > > I guess it's up to filesystem if it wants to reuse the same spot to > > > > > write > > > > > data or not. I think your assumptions works for ext4 and xfs. I > > > > > wouldn't > > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > > combined with DAX, but I had a somewhat working version a few month > > > > ago. > > > > > > But in cases like COW when the block mapping changes, the process > > > must run unmap_mapping_range() before installing the new PTE so that all > > > processes mapping this file offset actually refault and see the new > > > mapping. So this would go through pte_none() case. Am I missing something? > > > > Yes, for DAX COW mappings we'd probably need something like this, unlike > > the pagecache COW handling for which only the underlying block change, > > but not the page. > > Right. So again nothing where the WARN_ON should trigger. Yes. I was confused on how COW is handled. Acked-by: Kirill A. Shutemov -- Kirill A. Shutemov
Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > I guess it's up to filesystem if it wants to reuse the same spot to > > > > write > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > combined with DAX, but I had a somewhat working version a few month > > > ago. > > > > But in cases like COW when the block mapping changes, the process > > must run unmap_mapping_range() before installing the new PTE so that all > > processes mapping this file offset actually refault and see the new > > mapping. So this would go through pte_none() case. Am I missing something? > > Yes, for DAX COW mappings we'd probably need something like this, unlike > the pagecache COW handling for which only the underlying block change, > but not the page. Right. So again nothing where the WARN_ON should trigger. That being said I don't care about the WARN_ON too deeply but it can help to catch DAX bugs so if we can keep it I'd prefer to do so... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > be that sure for btrfs or other filesystems with CoW support. > > > > Or XFS with reflinks for that matter. Which currently can't be > > combined with DAX, but I had a somewhat working version a few month > > ago. > > But in cases like COW when the block mapping changes, the process > must run unmap_mapping_range() before installing the new PTE so that all > processes mapping this file offset actually refault and see the new > mapping. So this would go through pte_none() case. Am I missing something? Yes, for DAX COW mappings we'd probably need something like this, unlike the pagecache COW handling for which only the underlying block change, but not the page.
Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > I guess it's up to filesystem if it wants to reuse the same spot to write > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > be that sure for btrfs or other filesystems with CoW support. > > Or XFS with reflinks for that matter. Which currently can't be > combined with DAX, but I had a somewhat working version a few month > ago. But in cases like COW when the block mapping changes, the process must run unmap_mapping_range() before installing the new PTE so that all processes mapping this file offset actually refault and see the new mapping. So this would go through pte_none() case. Am I missing something? Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > I guess it's up to filesystem if it wants to reuse the same spot to write > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > be that sure for btrfs or other filesystems with CoW support. Or XFS with reflinks for that matter. Which currently can't be combined with DAX, but I had a somewhat working version a few month ago.
Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
On Mon, Jul 24, 2017 at 11:06:12AM -0600, Ross Zwisler wrote: > @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, > unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (mkwrite) { > + /* > + * For read faults on private mappings the PFN passed > + * in may not match the PFN we have mapped if the > + * mapped PFN is a writeable COW page. In the mkwrite > + * case we are creating a writable PTE for a shared > + * mapping and we expect the PFNs to match. > + */ Can we? I guess it's up to filesystem if it wants to reuse the same spot to write data or not. I think your assumptions works for ext4 and xfs. I wouldn't be that sure for btrfs or other filesystems with CoW support. -- Kirill A. Shutemov
[PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()
To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path. Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara --- include/linux/mm.h | 2 ++ mm/memory.c| 50 +++--- 2 files changed, 45 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..483e84c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 0e517be..b29dd42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot, bool mkwrite) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (mkwrite) { + /* +* For read faults on private mappings the PFN passed +* in may not match the PFN we have mapped if the +* mapped PFN is a writeable COW page. In the mkwrite +* case we are creating a writable PTE for a shared +* mapping and we expect the PFNs to match. +*/ + if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn))) + goto out_unlock; + entry = *pte; + goto out_mkwrite; + } else + goto out_unlock; + } /* Ok, finally just insert the thing.. */ if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + +out_mkwrite: + if (mkwrite) { + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } + set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1736,14 +1757,15 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)); - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, + false); return ret; } EXPORT_SYMBOL(vm_insert_pfn_prot); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn) +static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn, bool mkwrite) { pgprot_t pgprot = vma->vm_page_prot; @@ -1772,10 +1794,24 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_to_page(pfn_t_to_pfn(pfn)); return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, pgprot); + return insert_pfn(vma, addr, pfn, pgprot,