Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-25 Thread Kirill A. Shutemov
On Tue, Jul 25, 2017 at 02:50:37PM +0200, Jan Kara wrote:
> On Tue 25-07-17 14:15:22, Christoph Hellwig wrote:
> > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > > > I guess it's up to filesystem if it wants to reuse the same spot to 
> > > > > write
> > > > > data or not. I think your assumptions works for ext4 and xfs. I 
> > > > > wouldn't
> > > > > be that sure for btrfs or other filesystems with CoW support.
> > > > 
> > > > Or XFS with reflinks for that matter.  Which currently can't be
> > > > combined with DAX, but I had a somewhat working version a few month
> > > > ago.
> > > 
> > > But in cases like COW when the block mapping changes, the process
> > > must run unmap_mapping_range() before installing the new PTE so that all
> > > processes mapping this file offset actually refault and see the new
> > > mapping. So this would go through pte_none() case. Am I missing something?
> > 
> > Yes, for DAX COW mappings we'd probably need something like this, unlike
> > the pagecache COW handling for which only the underlying block change,
> > but not the page.
> 
> Right. So again nothing where the WARN_ON should trigger.

Yes. I was confused on how COW is handled.

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-25 Thread Jan Kara
On Tue 25-07-17 14:15:22, Christoph Hellwig wrote:
> On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > > I guess it's up to filesystem if it wants to reuse the same spot to 
> > > > write
> > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > > > be that sure for btrfs or other filesystems with CoW support.
> > > 
> > > Or XFS with reflinks for that matter.  Which currently can't be
> > > combined with DAX, but I had a somewhat working version a few month
> > > ago.
> > 
> > But in cases like COW when the block mapping changes, the process
> > must run unmap_mapping_range() before installing the new PTE so that all
> > processes mapping this file offset actually refault and see the new
> > mapping. So this would go through pte_none() case. Am I missing something?
> 
> Yes, for DAX COW mappings we'd probably need something like this, unlike
> the pagecache COW handling for which only the underlying block change,
> but not the page.

Right. So again nothing where the WARN_ON should trigger. That being said I
don't care about the WARN_ON too deeply but it can help to catch DAX bugs
so if we can keep it I'd prefer to do so...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-25 Thread Christoph Hellwig
On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > I guess it's up to filesystem if it wants to reuse the same spot to write
> > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > > be that sure for btrfs or other filesystems with CoW support.
> > 
> > Or XFS with reflinks for that matter.  Which currently can't be
> > combined with DAX, but I had a somewhat working version a few month
> > ago.
> 
> But in cases like COW when the block mapping changes, the process
> must run unmap_mapping_range() before installing the new PTE so that all
> processes mapping this file offset actually refault and see the new
> mapping. So this would go through pte_none() case. Am I missing something?

Yes, for DAX COW mappings we'd probably need something like this, unlike
the pagecache COW handling for which only the underlying block change,
but not the page.


Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-25 Thread Jan Kara
On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > I guess it's up to filesystem if it wants to reuse the same spot to write
> > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > be that sure for btrfs or other filesystems with CoW support.
> 
> Or XFS with reflinks for that matter.  Which currently can't be
> combined with DAX, but I had a somewhat working version a few month
> ago.

But in cases like COW when the block mapping changes, the process
must run unmap_mapping_range() before installing the new PTE so that all
processes mapping this file offset actually refault and see the new
mapping. So this would go through pte_none() case. Am I missing something?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-25 Thread Christoph Hellwig
On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> I guess it's up to filesystem if it wants to reuse the same spot to write
> data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> be that sure for btrfs or other filesystems with CoW support.

Or XFS with reflinks for that matter.  Which currently can't be
combined with DAX, but I had a somewhat working version a few month
ago.


Re: [PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-24 Thread Kirill A. Shutemov
On Mon, Jul 24, 2017 at 11:06:12AM -0600, Ross Zwisler wrote:
> @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (!pte)
>   goto out;
>   retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (mkwrite) {
> + /*
> +  * For read faults on private mappings the PFN passed
> +  * in may not match the PFN we have mapped if the
> +  * mapped PFN is a writeable COW page.  In the mkwrite
> +  * case we are creating a writable PTE for a shared
> +  * mapping and we expect the PFNs to match.
> +  */

Can we?

I guess it's up to filesystem if it wants to reuse the same spot to write
data or not. I think your assumptions works for ext4 and xfs. I wouldn't
be that sure for btrfs or other filesystems with CoW support.

-- 
 Kirill A. Shutemov


[PATCH v5 1/5] mm: add vm_insert_mixed_mkwrite()

2017-07-24 Thread Ross Zwisler
To be able to use the common 4k zero page in DAX we need to have our PTE
fault path look more like our PMD fault path where a PTE entry can be
marked as dirty and writeable as it is first inserted rather than waiting
for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.

Right now we can rely on having a dax_pfn_mkwrite() call because we can
distinguish between these two cases in do_wp_page():

case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage

This distinction is made by via vm_normal_page().  vm_normal_page() returns
false for the common 4k zero page, though, just as it does for DAX ptes.
Instead of special casing the DAX + 4k zero page case we will simplify our
DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
get rid of the dax_pfn_mkwrite() helper.  We will instead use
dax_iomap_fault() to handle write-protection faults.

This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
will do the work that was previously done by wp_page_reuse() as part of the
dax_pfn_mkwrite() call path.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 include/linux/mm.h |  2 ++
 mm/memory.c| 50 +++---
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..483e84c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, 
unsigned long addr,
unsigned long pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
pfn_t pfn);
+int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
+   pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned 
long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be..b29dd42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned 
long addr,
 EXPORT_SYMBOL(vm_insert_page);
 
 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-   pfn_t pfn, pgprot_t prot)
+   pfn_t pfn, pgprot_t prot, bool mkwrite)
 {
struct mm_struct *mm = vma->vm_mm;
int retval;
@@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
if (!pte)
goto out;
retval = -EBUSY;
-   if (!pte_none(*pte))
-   goto out_unlock;
+   if (!pte_none(*pte)) {
+   if (mkwrite) {
+   /*
+* For read faults on private mappings the PFN passed
+* in may not match the PFN we have mapped if the
+* mapped PFN is a writeable COW page.  In the mkwrite
+* case we are creating a writable PTE for a shared
+* mapping and we expect the PFNs to match.
+*/
+   if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn)))
+   goto out_unlock;
+   entry = *pte;
+   goto out_mkwrite;
+   } else
+   goto out_unlock;
+   }
 
/* Ok, finally just insert the thing.. */
if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
else
entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+
+out_mkwrite:
+   if (mkwrite) {
+   entry = pte_mkyoung(entry);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   }
+
set_pte_at(mm, addr, pte, entry);
update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -1736,14 +1757,15 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, 
unsigned long addr,
 
track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
-   ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
+   ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
+   false);
 
return ret;
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-   pfn_t pfn)
+static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+   pfn_t pfn, bool mkwrite)
 {
pgprot_t pgprot = vma->vm_page_prot;
 
@@ -1772,10 +1794,24 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
page = pfn_to_page(pfn_t_to_pfn(pfn));
return insert_page(vma, addr, page, pgprot);
}
-   return insert_pfn(vma, addr, pfn, pgprot);
+   return insert_pfn(vma, addr, pfn, pgprot,