Re: [PATCH v3 04/11] ext2: remove support for DAX PMD faults

2016-09-27 Thread Dave Chinner
On Tue, Sep 27, 2016 at 02:47:55PM -0600, Ross Zwisler wrote:
> DAX PMD support was added via the following commit:
> 
> commit e7b1ea2ad658 ("ext2: huge page fault support")
> 
> I believe this path to be untested as ext2 doesn't reliably provide block
> allocations that are aligned to 2MiB.  In my testing I've been unable to
> get ext2 to actually fault in a PMD.  It always fails with a "pfn
> unaligned" message because the sector returned by ext2_get_block() isn't
> aligned.
> 
> I've tried various settings for the "stride" and "stripe_width" extended
> options to mkfs.ext2, without any luck.
> 
> Since we can't reliably get PMDs, remove support so that we don't have an
> untested code path that we may someday traverse when we happen to get an
> aligned block allocation.  This should also make 4k DAX faults in ext2 a
> bit faster since they will no longer have to call the PMD fault handler
> only to get a response of VM_FAULT_FALLBACK.
> 
> Signed-off-by: Ross Zwisler 

> @@ -154,7 +133,6 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct 
> *vma,
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>   .fault  = ext2_dax_fault,
> - .pmd_fault  = ext2_dax_pmd_fault,
>   .page_mkwrite   = ext2_dax_fault,
>   .pfn_mkwrite= ext2_dax_pfn_mkwrite,
>  };

Would it be better to put a comment mentioning this here? So as the
years go by, this reminds people not to bother trying to implement
it?

/*
 * .pmd_fault is not supported for DAX because allocation in ext2
 * cannot be reliably aligned to huge page sizes and so pmd faults
 * will always fail and fail back to regular faults.
 */

-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 09/11] dax: add struct iomap based DAX PMD support

2016-09-27 Thread Dave Chinner
On Tue, Sep 27, 2016 at 02:48:00PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This patch allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled using the new struct
> iomap based fault handlers.
> 
> There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
> mappings that have an associated block allocation, and 4k DAX empty
> entries.  The empty entries exist to provide locking for the duration of a
> given page fault.
> 
> This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
> entries, PMD DAX entries that have associated block allocations, and 2 MiB
> DAX empty entries.
> 
> Unlike the 4k case where we insert a struct page* into the radix tree for
> 4k zero pages, for HZP we insert a DAX exceptional entry with the new
> RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
> every 2MiB hole mapping, and it doesn't make sense to have that same struct
> page* with multiple entries in multiple trees.  This would cause contention
> on the single page lock for the one Huge Zero Page, and it would break the
> page->index and page->mapping associations that are assumed to be valid in
> many other places in the kernel.
> 
> One difficult use case is when one thread is trying to use 4k entries in
> radix tree for a given offset, and another thread is using 2 MiB entries
> for that same offset.  The current code handles this by making the 2 MiB
> user fall back to 4k entries for most cases.  This was done because it is
> the simplest solution, and because the use of 2MiB pages is already
> opportunistic.
> 
> If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
> we run into the problem of how we lock out 4k page faults for the entire
> 2MiB range while we clean out the radix tree so we can insert the 2MiB
> entry.  We can solve this problem if we need to, but I think that the cases
> where both 2MiB entries and 4K entries are being used for the same range
> will be rare enough and the gain small enough that it probably won't be
> worth the complexity.
> 
> Signed-off-by: Ross Zwisler 

> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
> +static void __dax_pmd_dbg(struct iomap *iomap, unsigned long address,
> + const char *reason, const char *fn)
> +{
> + if (iomap) {
> + char bname[BDEVNAME_SIZE];
> +
> + bdevname(iomap->bdev, bname);
> + pr_debug("%s: %s addr %lx dev %s type %#x blkno %ld "
> + "offset %lld length %lld fallback: %s\n", fn,
> + current->comm, address, bname, iomap->type,
> + iomap->blkno, iomap->offset, iomap->length, reason);
> + } else {
> + pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
> + current->comm, address, reason);
> + }
> +}

Yuck! Tracepoints for debugging information like this, please, not
printk awfulness.

> +
> +#define dax_pmd_dbg(bh, address, reason) \
> + __dax_pmd_dbg(bh, address, reason, __func__)
> +
> +static int iomap_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> + struct vm_fault *vmf, unsigned long address,
> + struct iomap *iomap, loff_t pos, bool write, void **entryp)

Please put a "dax" in the function name. grepping, cscope, etc are
much easier when static function names are namespaced properly.

> +{
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct block_device *bdev = iomap->bdev;
> + struct blk_dax_ctl dax = {
> + .sector = iomap_dax_sector(iomap, pos),
> + .size = PMD_SIZE,
> + };
> + long length = dax_map_atomic(bdev, );
> + void *ret;
> +
> + if (length < 0) {
> + dax_pmd_dbg(iomap, address, "dax-error fallback");
> + return VM_FAULT_FALLBACK;
> + }

Fails to unmap. Please use an goto based error stack. And
tracepoints make this much neater:

trace_dax_pmd_insert_mapping(iomap, address, , length);
if (length < 0)
goto unmap_fallback;
if (length < PMD_SIZE)
goto unmap_fallback;
.

trace_dax_pmd_insert_mapping_done(iomap, address, , length);
return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);

unmap_fallback:
dax_unmap_atomic(bdev, );
fallback:
trace_dax_pmd_insert_fallback(iomap, address, , length);
return VM_FAULT_FALLBACK;
}

i.e. we don't need need all those debug printks to tell us what
failed - the first tracepoint tells use everything about the context
we are about to check, and the last tracepoint 

Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()

2016-09-27 Thread Christoph Hellwig
On Tue, Sep 27, 2016 at 07:17:07PM +0200, Jan Kara wrote:
> OK, the changelog is stale but I actually took care to integrate this with
> your iomap patches and for the new invalidation code in iomap_dax_actor()
> to work we need this additional information...

It's not just the changelogs (which will need updates on more than this
patch), but also the content.  We're not using get_blocks for DAX
anymore, so this patch should not be needed anymore.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 00/11] re-enable DAX PMD support

2016-09-27 Thread Christoph Hellwig
On Tue, Sep 27, 2016 at 02:47:51PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.
> 
> Jan and Christoph, can you please help review these changes?

About to get on a plane, so it might take a bit to do a real review.
In general this looks fine, but I guess the first two ext4 patches
should just go straight to Ted independent of the rest?

Also Jan just posted a giant DAX patchbomb, we'll need to find a way
to integrate all that work, and maybe prioritize things if we want
to get bits into 4.9 still.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 09/20] mm: Factor out functionality to finish page faults

2016-09-27 Thread Jan Kara
Introduce function finish_fault() as a helper function for finishing
page faults. It is rather thin wrapper around alloc_set_pte() but since
we'd want to call this from DAX code or filesystems, it is still useful
to avoid some boilerplate code.

Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 42 +-
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index faa77b15e9a6..919ebdd27f1e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -622,6 +622,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
+int finish_fault(struct vm_fault *vmf);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 17db88a38e8a..f54cfad7fe04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3029,6 +3029,36 @@ int alloc_set_pte(struct vm_fault *vmf, struct 
mem_cgroup *memcg,
return 0;
 }
 
+
+/**
+ * finish_fault - finish page fault once we have prepared the page to fault
+ *
+ * @vmf: structure describing the fault
+ *
+ * This function handles all that is needed to finish a page fault once the
+ * page to fault in is prepared. It handles locking of PTEs, inserts PTE for
+ * given page, adds reverse page mapping, handles memcg charges and LRU
+ * addition. The function returns 0 on success, VM_FAULT_ code in case of
+ * error.
+ *
+ * The function expects the page to be locked.
+ */
+int finish_fault(struct vm_fault *vmf)
+{
+   struct page *page;
+   int ret;
+
+   /* Did we COW the page? */
+   if (vmf->flags & FAULT_FLAG_WRITE && !(vmf->vma->vm_flags & VM_SHARED))
+   page = vmf->cow_page;
+   else
+   page = vmf->page;
+   ret = alloc_set_pte(vmf, vmf->memcg, page);
+   if (vmf->pte)
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return ret;
+}
+
 static unsigned long fault_around_bytes __read_mostly =
rounddown_pow_of_two(65536);
 
@@ -3174,9 +3204,7 @@ static int do_read_fault(struct vm_fault *vmf)
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
-   ret |= alloc_set_pte(vmf, NULL, vmf->page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
put_page(vmf->page);
@@ -3215,9 +3243,7 @@ static int do_cow_fault(struct vm_fault *vmf)
copy_user_highpage(new_page, vmf->page, vmf->address, vma);
__SetPageUptodate(new_page);
 
-   ret |= alloc_set_pte(vmf, memcg, new_page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
if (!(ret & VM_FAULT_DAX_LOCKED)) {
unlock_page(vmf->page);
put_page(vmf->page);
@@ -3258,9 +3284,7 @@ static int do_shared_fault(struct vm_fault *vmf)
}
}
 
-   ret |= alloc_set_pte(vmf, NULL, vmf->page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY))) {
unlock_page(vmf->page);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 11/20] mm: Remove unnecessary vma->vm_ops check

2016-09-27 Thread Jan Kara
We don't check whether vma->vm_ops is NULL in do_shared_fault() so
there's hardly any point in checking it in wp_page_shared() which gets
called only for shared file mappings as well.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index a4522e8999b2..63d9c1a54caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2301,7 +2301,7 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
 
get_page(old_page);
 
-   if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+   if (vma->vm_ops->page_mkwrite) {
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 20/20] dax: Clear dirty entry tags on cache flush

2016-09-27 Thread Jan Kara
Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 64 
 1 file changed, 64 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index a2d3781c9f4e..233f548d298e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -668,6 +669,59 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+   unsigned long address;
+
+   address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+ pgoff_t index, unsigned long pfn)
+{
+   struct vm_area_struct *vma;
+   pte_t *ptep;
+   pte_t pte;
+   spinlock_t *ptl;
+   bool changed;
+
+   i_mmap_lock_read(mapping);
+   vma_interval_tree_foreach(vma, >i_mmap, index, index) {
+   unsigned long address;
+
+   cond_resched();
+
+   if (!(vma->vm_flags & VM_SHARED))
+   continue;
+
+   address = pgoff_address(index, vma);
+   changed = false;
+   if (follow_pte(vma->vm_mm, address, , ))
+   continue;
+   if (pfn != pte_pfn(*ptep))
+   goto unlock;
+   if (!pte_dirty(*ptep) && !pte_write(*ptep))
+   goto unlock;
+
+   flush_cache_page(vma, address, pfn);
+   pte = ptep_clear_flush(vma, address, ptep);
+   pte = pte_wrprotect(pte);
+   pte = pte_mkclean(pte);
+   set_pte_at(vma->vm_mm, address, ptep, pte);
+   changed = true;
+unlock:
+   pte_unmap_unlock(ptep, ptl);
+
+   if (changed)
+   mmu_notifier_invalidate_page(vma->vm_mm, address);
+   }
+   i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -735,7 +789,17 @@ static int dax_writeback_one(struct block_device *bdev,
goto unmap;
}
 
+   dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
wb_cache_pmem(dax.addr, dax.size);
+   /*
+* After we have flushed the cache, we can clear the dirty tag. There
+* cannot be new dirty data in the pfn after the flush has completed as
+* the pfn mappings are writeprotected and fault waits for mapping
+* entry lock.
+*/
+   spin_lock_irq(>tree_lock);
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+   spin_unlock_irq(>tree_lock);
 unmap:
dax_unmap_atomic(bdev, );
put_locked_mapping_entry(mapping, index, entry);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/20 v3] dax: Clear dirty bits after flushing caches

2016-09-27 Thread Jan Kara
Hello,

this is a third revision of my patches to clear dirty bits from radix tree of
DAX inodes when caches for corresponding pfns have been flushed. This patch set
is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really like feedback from mm folks whether
such changes to fault handling code are fine or what they'd do differently.

The patches pass testing with xfstests on ext4 and xfs on my end
- just be aware they are basis for further DAX fixes without which some
stress tests can still trigger failures. I'll be sending these fixes separately
in order to keep the series of reasonable size. For full testing, you
can pull all the patches from

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git dax

but be aware I will likely rebase that branch and do other nasty stuff with
it so don't use it as a basis of your git trees.

Changes since v2:
* rebased on top of 4.8-rc8 - this involved dealing with new fault_env
  structure
* changed calling convention for fault helpers

Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
  against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback


Background information regarding the motivation:

Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
needed.

The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).

The effect of the patch set is easily visible:

Writing 1 GB of data via mmap, then fsync twice.

Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.

As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.

Patches have passed xfstests for both xfs and ext4.

Honza
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 13/20] mm: Pass vm_fault structure into do_page_mkwrite()

2016-09-27 Thread Jan Kara
We will need more information in the ->page_mkwrite() helper for DAX to
be able to fully finish faults there. Pass vm_fault structure to
do_page_mkwrite() and use it there so that information propagates
properly from upper layers.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0643b3b5a12a..7c87edaa7a8f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2034,20 +2034,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct 
*vma)
  *
  * We do this without the lock held, so that it can sleep if it needs to.
  */
-static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
-  unsigned long address)
+static int do_page_mkwrite(struct vm_fault *vmf)
 {
-   struct vm_fault vmf;
int ret;
+   struct page *page = vmf->page;
 
-   vmf.virtual_address = address & PAGE_MASK;
-   vmf.pgoff = page->index;
-   vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-   vmf.gfp_mask = __get_fault_gfp_mask(vma);
-   vmf.page = page;
-   vmf.cow_page = NULL;
+   vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
-   ret = vma->vm_ops->page_mkwrite(vma, );
+   ret = vmf->vma->vm_ops->page_mkwrite(vmf->vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
if (unlikely(!(ret & VM_FAULT_LOCKED))) {
@@ -2323,7 +2317,8 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   tmp = do_page_mkwrite(vma, old_page, vmf->address);
+   vmf->page = old_page;
+   tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp || (tmp &
  (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
put_page(old_page);
@@ -3286,7 +3281,7 @@ static int do_shared_fault(struct vm_fault *vmf)
 */
if (vma->vm_ops->page_mkwrite) {
unlock_page(vmf->page);
-   tmp = do_page_mkwrite(vma, vmf->page, vmf->address);
+   tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
put_page(vmf->page);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 16/20] mm: Provide helper for finishing mkwrite faults

2016-09-27 Thread Jan Kara
Provide a helper function for finishing write faults due to PTE being
read-only. The helper will be used by DAX to avoid the need of
complicating generic MM code with DAX locking specifics.

Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 65 +++---
 2 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1055f2ece80d..e5a014be8932 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
+int finish_mkwrite_fault(struct vm_fault *vmf);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index f49e736d6a36..8c8cb7f2133e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2266,6 +2266,36 @@ oom:
return VM_FAULT_OOM;
 }
 
+/**
+ * finish_mkrite_fault - finish page fault making PTE writeable once the page
+ *  page is prepared
+ *
+ * @vmf: structure describing the fault
+ *
+ * This function handles all that is needed to finish a write page fault due
+ * to PTE being read-only once the mapped page is prepared. It handles locking
+ * of PTE and modifying it. The function returns VM_FAULT_WRITE on success,
+ * 0 when PTE got changed before we acquired PTE lock.
+ *
+ * The function expects the page to be locked or other protection against
+ * concurrent faults / writeback (such as DAX radix tree locks).
+ */
+int finish_mkwrite_fault(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
+  >ptl);
+   /*
+* We might have raced with another page fault while we released the
+* pte_offset_map_lock.
+*/
+   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return 0;
+   }
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
+}
+
 /*
  * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
  * mapping
@@ -2282,16 +2312,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
if (ret & VM_FAULT_ERROR)
return ret;
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
-   /*
-* We might have raced with another page fault while we
-* released the pte_offset_map_lock.
-*/
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   return finish_mkwrite_fault(vmf);
}
wp_page_reuse(vmf);
return VM_FAULT_WRITE;
@@ -2301,7 +2322,6 @@ static int wp_page_shared(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
-   int page_mkwrite = 0;
 
get_page(vmf->page);
 
@@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf)
put_page(vmf->page);
return tmp;
}
-   /*
-* Since we dropped the lock we need to revalidate
-* the PTE as someone else may have changed it.  If
-* they did, we just return, as we can count on the
-* MMU to tell us if they didn't also make it writable.
-*/
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+   tmp = finish_mkwrite_fault(vmf);
+   if (unlikely(!tmp || (tmp &
+ (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
unlock_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
put_page(vmf->page);
-   return 0;
+   return tmp;
}
-   page_mkwrite = 1;
-   }
-
-   wp_page_reuse(vmf);
-   if (!page_mkwrite)
+   } else {
+   wp_page_reuse(vmf);
lock_page(vmf->page);
+   }
fault_dirty_shared_page(vma, vmf->page);
put_page(vmf->page);
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 10/20] mm: Move handling of COW faults into DAX code

2016-09-27 Thread Jan Kara
Move final handling of COW faults from generic code into DAX fault
handler. That way generic code doesn't have to be aware of peculiarities
of DAX locking so remove that knowledge.

Signed-off-by: Jan Kara 
---
 fs/dax.c| 22 --
 include/linux/dax.h |  7 ---
 include/linux/mm.h  |  9 +
 mm/memory.c | 14 --
 4 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0dc251ca77b8..b1c503930d1d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -876,10 +876,15 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
goto unlock_entry;
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
-   return VM_FAULT_LOCKED;
+   if (unlikely(PageHWPoison(entry))) {
+   put_locked_mapping_entry(mapping, vmf->pgoff,
+entry);
+   return VM_FAULT_HWPOISON;
+   }
}
-   vmf->entry = entry;
-   return VM_FAULT_DAX_LOCKED;
+   error = finish_fault(vmf);
+   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+   return error ? error : VM_FAULT_DONE_COW;
}
 
if (!buffer_mapped()) {
@@ -1430,10 +1435,15 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
goto unlock_entry;
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
-   return VM_FAULT_LOCKED;
+   if (unlikely(PageHWPoison(entry))) {
+   put_locked_mapping_entry(mapping, vmf->pgoff,
+entry);
+   return VM_FAULT_HWPOISON;
+   }
}
-   vmf->entry = entry;
-   return VM_FAULT_DAX_LOCKED;
+   error = finish_fault(vmf);
+   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+   return error ? error : VM_FAULT_DONE_COW;
}
 
switch (iomap.type) {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4bc568f..b1a1acd10df2 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -26,7 +26,6 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
-void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
 int __dax_zero_page_range(struct block_device *bdev, sector_t sector,
unsigned int offset, unsigned int length);
 #else
@@ -35,12 +34,6 @@ static inline struct page *read_dax_sector(struct 
block_device *bdev,
 {
return ERR_PTR(-ENXIO);
 }
-/* Shouldn't ever be called when dax is disabled. */
-static inline void dax_unlock_mapping_entry(struct address_space *mapping,
-   pgoff_t index)
-{
-   BUG();
-}
 static inline int __dax_zero_page_range(struct block_device *bdev,
sector_t sector, unsigned int offset, unsigned int length)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 919ebdd27f1e..1055f2ece80d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -310,12 +310,6 @@ struct vm_fault {
 * is set (which is also implied by
 * VM_FAULT_ERROR).
 */
-   void *entry;/* ->fault handler can alternatively
-* return locked DAX entry. In that
-* case handler should return
-* VM_FAULT_DAX_LOCKED and fill in
-* entry here.
-*/
/* These three entries are valid only while holding ptl lock */
pte_t *pte; /* Pointer to pte entry matching
 * the 'address'. NULL if the page
@@ -1118,8 +1112,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_LOCKED0x0200  /* ->fault locked the returned page */
 #define VM_FAULT_RETRY 0x0400  /* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800   /* huge page fault failed, fall back to 
small */
-#define VM_FAULT_DAX_LOCKED 0x1000 /* ->fault has locked DAX entry */
-#define VM_FAULT_DONE_COW   0x2000 /* ->fault has fully handled COW */
+#define VM_FAULT_DONE_COW   0x1000 /* ->fault has fully handled COW */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large 
hwpoison */
 
diff --git a/mm/memory.c b/mm/memory.c

[PATCH 18/20] dax: Make cache flushing protected by entry lock

2016-09-27 Thread Jan Kara
Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 66 +---
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b1c503930d1d..c6cadf8413a3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -672,43 +672,63 @@ static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   int type = RADIX_DAX_TYPE(entry);
-   struct radix_tree_node *node;
struct blk_dax_ctl dax;
-   void **slot;
+   void *entry2, **slot;
int ret = 0;
+   int type;
 
-   spin_lock_irq(>tree_lock);
/*
-* Regular page slots are stabilized by the page lock even
-* without the tree itself locked.  These unlocked entries
-* need verification under the tree lock.
+* A page got tagged dirty in DAX mapping? Something is seriously
+* wrong.
 */
-   if (!__radix_tree_lookup(page_tree, index, , ))
-   goto unlock;
-   if (*slot != entry)
-   goto unlock;
-
-   /* another fsync thread may have already written back this entry */
-   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-   goto unlock;
+   if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+   return -EIO;
 
+   spin_lock_irq(>tree_lock);
+   entry2 = get_unlocked_mapping_entry(mapping, index, );
+   /* Entry got punched out / reallocated? */
+   if (!entry2 || !radix_tree_exceptional_entry(entry2))
+   goto put_unlock;
+   /*
+* Entry got reallocated elsewhere? No need to writeback. We have to
+* compare sectors as we must not bail out due to difference in lockbit
+* or entry type.
+*/
+   if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
+   goto put_unlock;
+   type = RADIX_DAX_TYPE(entry2);
if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
ret = -EIO;
-   goto unlock;
+   goto put_unlock;
}
 
+   /* Another fsync thread may have already written back this entry */
+   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+   goto put_unlock;
+   /* Lock the entry to serialize with page faults */
+   entry = lock_slot(mapping, slot);
+   /*
+* We can clear the tag now but we have to be careful so that concurrent
+* dax_writeback_one() calls for the same index cannot finish before we
+* actually flush the caches. This is achieved as the calls will look
+* at the entry only under tree_lock and once they do that they will
+* see the entry locked and wait for it to unlock.
+*/
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+   spin_unlock_irq(>tree_lock);
+
dax.sector = RADIX_DAX_SECTOR(entry);
dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
-   spin_unlock_irq(>tree_lock);
 
/*
 * We cannot hold tree_lock while calling dax_map_atomic() because it
 * eventually calls cond_resched().
 */
ret = dax_map_atomic(bdev, );
-   if (ret < 0)
+   if (ret < 0) {
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
+   }
 
if (WARN_ON_ONCE(ret < dax.size)) {
ret = -EIO;
@@ -716,15 +736,13 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
wb_cache_pmem(dax.addr, dax.size);
-
-   spin_lock_irq(>tree_lock);
-   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-   spin_unlock_irq(>tree_lock);
- unmap:
+unmap:
dax_unmap_atomic(bdev, );
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
 
- unlock:
+put_unlock:
+   put_unlocked_mapping_entry(mapping, index, entry2);
spin_unlock_irq(>tree_lock);
return ret;
 }
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 12/20] mm: Factor out common parts of write fault handling

2016-09-27 Thread Jan Kara
Currently we duplicate handling of shared write faults in
wp_page_reuse() and do_shared_fault(). Factor them out into a common
function.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 78 +
 1 file changed, 37 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 63d9c1a54caf..0643b3b5a12a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2063,6 +2063,41 @@ static int do_page_mkwrite(struct vm_area_struct *vma, 
struct page *page,
 }
 
 /*
+ * Handle dirtying of a page in shared file mapping on a write fault.
+ *
+ * The function expects the page to be locked and unlocks it.
+ */
+static void fault_dirty_shared_page(struct vm_area_struct *vma,
+   struct page *page)
+{
+   struct address_space *mapping;
+   bool dirtied;
+   bool page_mkwrite = vma->vm_ops->page_mkwrite;
+
+   dirtied = set_page_dirty(page);
+   VM_BUG_ON_PAGE(PageAnon(page), page);
+   /*
+* Take a local copy of the address_space - page.mapping may be zeroed
+* by truncate after unlock_page().   The address_space itself remains
+* pinned by vma->vm_file's reference.  We rely on unlock_page()'s
+* release semantics to prevent the compiler from undoing this copying.
+*/
+   mapping = page_rmapping(page);
+   unlock_page(page);
+
+   if ((dirtied || page_mkwrite) && mapping) {
+   /*
+* Some device drivers do not set page.mapping
+* but still dirty their pages
+*/
+   balance_dirty_pages_ratelimited(mapping);
+   }
+
+   if (!page_mkwrite)
+   file_update_time(vma->vm_file);
+}
+
+/*
  * Handle write page faults for pages that can be reused in the current vma
  *
  * This can happen either due to the mapping being with the VM_SHARED flag,
@@ -2092,28 +2127,11 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
struct page *page,
pte_unmap_unlock(vmf->pte, vmf->ptl);
 
if (dirty_shared) {
-   struct address_space *mapping;
-   int dirtied;
-
if (!page_mkwrite)
lock_page(page);
 
-   dirtied = set_page_dirty(page);
-   VM_BUG_ON_PAGE(PageAnon(page), page);
-   mapping = page->mapping;
-   unlock_page(page);
+   fault_dirty_shared_page(vma, page);
put_page(page);
-
-   if ((dirtied || page_mkwrite) && mapping) {
-   /*
-* Some device drivers do not set page.mapping
-* but still dirty their pages
-*/
-   balance_dirty_pages_ratelimited(mapping);
-   }
-
-   if (!page_mkwrite)
-   file_update_time(vma->vm_file);
}
 
return VM_FAULT_WRITE;
@@ -3256,8 +3274,6 @@ uncharge_out:
 static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct address_space *mapping;
-   int dirtied = 0;
int ret, tmp;
 
ret = __do_fault(vmf);
@@ -3286,27 +3302,7 @@ static int do_shared_fault(struct vm_fault *vmf)
return ret;
}
 
-   if (set_page_dirty(vmf->page))
-   dirtied = 1;
-   /*
-* Take a local copy of the address_space - page.mapping may be zeroed
-* by truncate after unlock_page().   The address_space itself remains
-* pinned by vma->vm_file's reference.  We rely on unlock_page()'s
-* release semantics to prevent the compiler from undoing this copying.
-*/
-   mapping = page_rmapping(vmf->page);
-   unlock_page(vmf->page);
-   if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
-   /*
-* Some device drivers do not set page.mapping but still
-* dirty their pages
-*/
-   balance_dirty_pages_ratelimited(mapping);
-   }
-
-   if (!vma->vm_ops->page_mkwrite)
-   file_update_time(vma->vm_file);
-
+   fault_dirty_shared_page(vma, vmf->page);
return ret;
 }
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 05/20] mm: Trim __do_fault() arguments

2016-09-27 Thread Jan Kara
Use vm_fault structure to pass cow_page, page, and entry in and out of
the function. That reduces number of __do_fault() arguments from 4 to 1.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 53 +++--
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b7f1f535e079..ba7760fb7db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,26 +2844,22 @@ oom:
  * released depending on flags and vma->vm_ops->fault() return value.
  * See filemap_fault() and __lock_page_retry().
  */
-static int __do_fault(struct vm_fault *vmf, struct page *cow_page,
- struct page **page, void **entry)
+static int __do_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
int ret;
 
-   vmf->cow_page = cow_page;
-
ret = vma->vm_ops->fault(vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
-   if (ret & VM_FAULT_DAX_LOCKED) {
-   *entry = vmf->entry;
+   if (ret & VM_FAULT_DAX_LOCKED)
return ret;
-   }
 
if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf->page);
put_page(vmf->page);
+   vmf->page = NULL;
return VM_FAULT_HWPOISON;
}
 
@@ -2872,7 +2868,6 @@ static int __do_fault(struct vm_fault *vmf, struct page 
*cow_page,
else
VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
 
-   *page = vmf->page;
return ret;
 }
 
@@ -3169,7 +3164,6 @@ out:
 static int do_read_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page;
int ret = 0;
 
/*
@@ -3183,24 +3177,23 @@ static int do_read_fault(struct vm_fault *vmf)
return ret;
}
 
-   ret = __do_fault(vmf, NULL, _page, NULL);
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
-   ret |= alloc_set_pte(vmf, NULL, fault_page);
+   ret |= alloc_set_pte(vmf, NULL, vmf->page);
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   unlock_page(fault_page);
+   unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
-   put_page(fault_page);
+   put_page(vmf->page);
return ret;
 }
 
 static int do_cow_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page, *new_page;
-   void *fault_entry;
+   struct page *new_page;
struct mem_cgroup *memcg;
int ret;
 
@@ -3217,20 +3210,21 @@ static int do_cow_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
 
-   ret = __do_fault(vmf, new_page, _page, _entry);
+   vmf->cow_page = new_page;
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
 
if (!(ret & VM_FAULT_DAX_LOCKED))
-   copy_user_highpage(new_page, fault_page, vmf->address, vma);
+   copy_user_highpage(new_page, vmf->page, vmf->address, vma);
__SetPageUptodate(new_page);
 
ret |= alloc_set_pte(vmf, memcg, new_page);
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (!(ret & VM_FAULT_DAX_LOCKED)) {
-   unlock_page(fault_page);
-   put_page(fault_page);
+   unlock_page(vmf->page);
+   put_page(vmf->page);
} else {
dax_unlock_mapping_entry(vma->vm_file->f_mapping, vmf->pgoff);
}
@@ -3246,12 +3240,11 @@ uncharge_out:
 static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page;
struct address_space *mapping;
int dirtied = 0;
int ret, tmp;
 
-   ret = __do_fault(vmf, NULL, _page, NULL);
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -3260,26 +3253,26 @@ static int do_shared_fault(struct vm_fault *vmf)
 * about to become writable
 */
if (vma->vm_ops->page_mkwrite) {
-   unlock_page(fault_page);
-   tmp = do_page_mkwrite(vma, fault_page, vmf->address);
+   unlock_page(vmf->page);
+   tmp = do_page_mkwrite(vma, vmf->page, vmf->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
-   put_page(fault_page);
+   put_page(vmf->page);
return tmp;
}
}
 
-   ret |= 

[PATCH 14/20] mm: Use vmf->page during WP faults

2016-09-27 Thread Jan Kara
So far we set vmf->page during WP faults only when we needed to pass it
to the ->page_mkwrite handler. Set it in all the cases now and use that
instead of passing page pointer explicitely around.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 58 +-
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7c87edaa7a8f..98304eb7bff4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2099,11 +2099,12 @@ static void fault_dirty_shared_page(struct 
vm_area_struct *vma,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf, struct page *page,
+static inline int wp_page_reuse(struct vm_fault *vmf,
int page_mkwrite, int dirty_shared)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
+   struct page *page = vmf->page;
pte_t entry;
/*
 * Clear the pages cpupid information as the existing
@@ -2147,10 +2148,11 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
struct page *page,
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old 
page.
  */
-static int wp_page_copy(struct vm_fault *vmf, struct page *old_page)
+static int wp_page_copy(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
+   struct page *old_page = vmf->page;
struct page *new_page = NULL;
pte_t entry;
int page_copied = 0;
@@ -2302,26 +2304,25 @@ static int wp_pfn_shared(struct vm_fault *vmf)
return 0;
}
}
-   return wp_page_reuse(vmf, NULL, 0, 0);
+   return wp_page_reuse(vmf, 0, 0);
 }
 
-static int wp_page_shared(struct vm_fault *vmf, struct page *old_page)
+static int wp_page_shared(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
int page_mkwrite = 0;
 
-   get_page(old_page);
+   get_page(vmf->page);
 
if (vma->vm_ops->page_mkwrite) {
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   vmf->page = old_page;
tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp || (tmp &
  (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
-   put_page(old_page);
+   put_page(vmf->page);
return tmp;
}
/*
@@ -2333,15 +2334,15 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, >ptl);
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   unlock_page(old_page);
+   unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   put_page(old_page);
+   put_page(vmf->page);
return 0;
}
page_mkwrite = 1;
}
 
-   return wp_page_reuse(vmf, old_page, page_mkwrite, 1);
+   return wp_page_reuse(vmf, page_mkwrite, 1);
 }
 
 /*
@@ -2366,10 +2367,9 @@ static int do_wp_page(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *old_page;
 
-   old_page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
-   if (!old_page) {
+   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
 * VM_PFNMAP VMA.
@@ -2382,30 +2382,30 @@ static int do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return wp_page_copy(vmf, old_page);
+   return wp_page_copy(vmf);
}
 
/*
 * Take out anonymous pages first, anonymous shared vmas are
 * not dirty accountable.
 */
-   if (PageAnon(old_page) && !PageKsm(old_page)) {
+   if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {
int total_mapcount;
-   if (!trylock_page(old_page)) {
-   get_page(old_page);
+   if (!trylock_page(vmf->page)) {
+   get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   lock_page(old_page);
+   lock_page(vmf->page);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, >ptl);

[PATCH 15/20] mm: Move part of wp_page_reuse() into the single call site

2016-09-27 Thread Jan Kara
wp_page_reuse() handles write shared faults which is needed only in
wp_page_shared(). Move the handling only into that location to make
wp_page_reuse() simpler and avoid a strange situation when we sometimes
pass in locked page, sometimes unlocked etc.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 98304eb7bff4..f49e736d6a36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2099,8 +2099,7 @@ static void fault_dirty_shared_page(struct vm_area_struct 
*vma,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf,
-   int page_mkwrite, int dirty_shared)
+static inline void wp_page_reuse(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -2120,16 +2119,6 @@ static inline int wp_page_reuse(struct vm_fault *vmf,
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-
-   if (dirty_shared) {
-   if (!page_mkwrite)
-   lock_page(page);
-
-   fault_dirty_shared_page(vma, page);
-   put_page(page);
-   }
-
-   return VM_FAULT_WRITE;
 }
 
 /*
@@ -2304,7 +2293,8 @@ static int wp_pfn_shared(struct vm_fault *vmf)
return 0;
}
}
-   return wp_page_reuse(vmf, 0, 0);
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
 }
 
 static int wp_page_shared(struct vm_fault *vmf)
@@ -2342,7 +2332,13 @@ static int wp_page_shared(struct vm_fault *vmf)
page_mkwrite = 1;
}
 
-   return wp_page_reuse(vmf, page_mkwrite, 1);
+   wp_page_reuse(vmf);
+   if (!page_mkwrite)
+   lock_page(vmf->page);
+   fault_dirty_shared_page(vma, vmf->page);
+   put_page(vmf->page);
+
+   return VM_FAULT_WRITE;
 }
 
 /*
@@ -2417,7 +2413,8 @@ static int do_wp_page(struct vm_fault *vmf)
page_move_anon_rmap(vmf->page, vma);
}
unlock_page(vmf->page);
-   return wp_page_reuse(vmf, 0, 0);
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/6] mm: Invalidate DAX radix tree entries only if appropriate

2016-09-27 Thread Jan Kara
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Signed-off-by: Jan Kara 
---
 fs/dax.c| 71 +
 include/linux/dax.h |  2 ++
 mm/truncate.c   | 71 -
 3 files changed, 122 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1542653e8aa1..c8a639d2214e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -521,16 +521,38 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, false);
 }
 
+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+ pgoff_t index, bool trunc)
+{
+   int ret = 0;
+   void *entry;
+   struct radix_tree_root *page_tree = >page_tree;
+
+   spin_lock_irq(>tree_lock);
+   entry = get_unlocked_mapping_entry(mapping, index, NULL);
+   if (!entry || !radix_tree_exceptional_entry(entry))
+   goto out;
+   if (!trunc &&
+   (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+   goto out;
+   radix_tree_delete(page_tree, index);
+   mapping->nrexceptional--;
+   ret = 1;
+out:
+   spin_unlock_irq(>tree_lock);
+   if (ret)
+   dax_wake_mapping_entry_waiter(mapping, index, true);
+   return ret;
+}
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
  */
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *entry;
+   int ret = __dax_invalidate_mapping_entry(mapping, index, true);
 
-   spin_lock_irq(>tree_lock);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
/*
 * This gets called from truncate / punch_hole path. As such, the caller
 * must hold locks protecting against concurrent modifications of the
@@ -538,16 +560,45 @@ int dax_delete_mapping_entry(struct address_space 
*mapping, pgoff_t index)
 * caller has seen exceptional entry for this index, we better find it
 * at that index as well...
 */
-   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
-   spin_unlock_irq(>tree_lock);
-   return 0;
-   }
-   radix_tree_delete(>page_tree, index);
+   WARN_ON_ONCE(!ret);
+   return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+   int ret = 0;
+   void *entry, **slot;
+   struct radix_tree_root *page_tree = >page_tree;
+
+   spin_lock_irq(>tree_lock);
+   entry = __radix_tree_lookup(page_tree, index, NULL, );
+   if (!entry || !radix_tree_exceptional_entry(entry) ||
+   slot_locked(mapping, slot))
+   goto out;
+   if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+   radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+   goto out;
+   radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
+   ret = 1;
+out:
spin_unlock_irq(>tree_lock);
-   dax_wake_mapping_entry_waiter(mapping, index, true);
+   if (ret)
+   dax_wake_mapping_entry_waiter(mapping, index, true);
+   return ret;
+}
 
-   return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_mapping_entry2(struct address_space *mapping, pgoff_t index)
+{
+   return __dax_invalidate_mapping_entry(mapping, index, false);
 }
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b1a1acd10df2..d2fd94b057fe 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -21,6 +21,8 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct 

[PATCH 6/6] dax: Avoid page invalidation races and unnecessary radix tree traversals

2016-09-27 Thread Jan Kara
Currently each filesystem (possibly through generic_file_direct_write()
or iomap_dax_rw()) takes care of invalidating page tables and evicting
hole pages from the radix tree when write(2) to the file happens. This
invalidation is only necessary when there is some block allocation
resulting from write(2). Furthermore in current place the invalidation
is racy wrt page fault instantiating a hole page just after we have
invalidated it.

So perform the page invalidation inside dax_do_io() where we can do it
only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 40 +++-
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c8a639d2214e..2f69ca891aab 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -186,6 +186,18 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
 */
WARN_ON_ONCE(rw == WRITE &&
 buffer_unwritten(bh));
+   /*
+* Write can allocate block for an area which
+* has a hole page mapped into page tables. We
+* have to tear down these mappings so that
+* data written by write(2) is visible in mmap.
+*/
+   if (buffer_new(bh) &&
+   inode->i_mapping->nrpages) {
+   invalidate_inode_pages2_range(
+ inode->i_mapping, page,
+ (bh_max - 1) >> PAGE_SHIFT);
+   }
} else {
unsigned done = bh->b_size -
(bh_max - (pos - first));
@@ -1410,6 +1422,17 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
return -EIO;
 
+   /*
+* Write can allocate block for an area which has a hole page mapped
+* into page tables. We have to tear down these mappings so that data
+* written by write(2) is visible in mmap.
+*/
+   if (iomap->flags & IOMAP_F_NEW && inode->i_mapping->nrpages) {
+   invalidate_inode_pages2_range(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (end - 1) >> PAGE_SHIFT);
+   }
+
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
struct blk_dax_ctl dax = { 0 };
@@ -1469,23 +1492,6 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE)
flags |= IOMAP_WRITE;
 
-   /*
-* Yes, even DAX files can have page cache attached to them:  A zeroed
-* page is inserted into the pagecache when we have to serve a write
-* fault on a hole.  It should never be dirtied and can simply be
-* dropped from the pagecache once we get real data for the page.
-*
-* XXX: This is racy against mmap, and there's nothing we can do about
-* it. We'll eventually need to shift this down even further so that
-* we can check if we allocated blocks over a hole first.
-*/
-   if (mapping->nrpages) {
-   ret = invalidate_inode_pages2_range(mapping,
-   pos >> PAGE_SHIFT,
-   (pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
-   WARN_ON_ONCE(ret);
-   }
-
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, iomap_dax_actor);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/6] dax: Do not warn about BH_New buffers

2016-09-27 Thread Jan Kara
Filesystems will return BH_New buffers to dax code to indicate freshly
allocated blocks which will then trigger synchronization of file
mappings in page tables with actual block mappings. So do not warn about
returned BH_New buffers.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 233f548d298e..1542653e8aa1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -158,8 +158,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
.addr = ERR_PTR(-EIO),
};
unsigned blkbits = inode->i_blkbits;
-   sector_t file_blks = (i_size_read(inode) + (1 << blkbits) - 1)
-   >> blkbits;
 
if (rw == READ)
end = min(end, i_size_read(inode));
@@ -186,9 +184,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
 * We allow uninitialized buffers for writes
 * beyond EOF as those cannot race with faults
 */
-   WARN_ON_ONCE(
-   (buffer_new(bh) && block < file_blks) ||
-   (rw == WRITE && buffer_unwritten(bh)));
+   WARN_ON_ONCE(rw == WRITE &&
+buffer_unwritten(bh));
} else {
unsigned done = bh->b_size -
(bh_max - (pos - first));
@@ -985,7 +982,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
}
 
/* Filesystem should not return unwritten buffers to us! */
-   WARN_ON_ONCE(buffer_unwritten() || buffer_new());
+   WARN_ON_ONCE(buffer_unwritten());
error = dax_insert_mapping(mapping, bh.b_bdev, to_sector(, inode),
bh.b_size, , vma, vmf);
  unlock_entry:
@@ -1094,7 +1091,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned 
long address,
if (get_block(inode, block, , 1) != 0)
return VM_FAULT_SIGBUS;
alloc = true;
-   WARN_ON_ONCE(buffer_unwritten() || buffer_new());
+   WARN_ON_ONCE(buffer_unwritten());
}
 
bdev = bh.b_bdev;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/6] dax: Page invalidation fixes

2016-09-27 Thread Jan Kara
Hello,

these patches fix races when invalidating hole pages in DAX mappings. See
changelogs for details. The series is based on my patches to write-protect
DAX PTEs because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid
dirty bits leading to missed cache flushes on fsync(2).

Honza
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 3/6] ext4: Remove clearing of BH_New bit for zeroed blocks

2016-09-27 Thread Jan Kara
So far we did not return BH_New buffers from ext4_dax_get_block()
because that would trigger racy zeroing in DAX code. This zeroing is
gone these days so we can remove the workaround.

Signed-off-by: Jan Kara 
---
 fs/ext4/inode.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 87150122d361..7ccd6fd7819d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3298,11 +3298,6 @@ int ext4_dax_get_block(struct inode *inode, sector_t 
iblock,
if (ret < 0)
return ret;
}
-   /*
-* At least for now we have to clear BH_New so that DAX code
-* doesn't attempt to zero blocks again in a racy way.
-*/
-   clear_buffer_new(bh_result);
return 0;
 }
 #else
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/6] ext2: Return BH_New buffers for zeroed blocks

2016-09-27 Thread Jan Kara
So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Signed-off-by: Jan Kara 
---
 fs/ext2/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f6312c153731..ac8edbd4af74 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
mutex_unlock(>truncate_mutex);
goto cleanup;
}
-   } else {
-   *new = true;
}
+   *new = true;
 
ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
mutex_unlock(>truncate_mutex);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()

2016-09-27 Thread Christoph Hellwig
On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> So far we did not set BH_New for newly allocated blocks for DAX inodes
> in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> DAX code which was racy. Now the zeroing is gone so we can remove this
> workaround and return BH_New for newly allocated blocks. DAX will use this
> information to properly update mappings of the file.

__xfs_get_blocks isn't used by the DAX code any more.
xfs_file_iomap_begin should already be doing the right thing for now.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 03/20] mm: Use pgoff in struct vm_fault instead of passing it separately

2016-09-27 Thread Jan Kara
struct vm_fault has already pgoff entry. Use it instead of passing pgoff
as a separate argument and then assigning it later.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 447a1ef4a9e3..4c2ec9a9d8af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2275,7 +2275,7 @@ static int wp_pfn_shared(struct vm_fault *vmf, pte_t 
orig_pte)
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf2 = {
.page = NULL,
-   .pgoff = linear_page_index(vma, vmf->address),
+   .pgoff = vmf->pgoff,
.virtual_address = vmf->address & PAGE_MASK,
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
@@ -2844,15 +2844,15 @@ oom:
  * released depending on flags and vma->vm_ops->fault() return value.
  * See filemap_fault() and __lock_page_retry().
  */
-static int __do_fault(struct vm_fault *vmf, pgoff_t pgoff,
-   struct page *cow_page, struct page **page, void **entry)
+static int __do_fault(struct vm_fault *vmf, struct page *cow_page,
+ struct page **page, void **entry)
 {
struct vm_area_struct *vma = vmf->vma;
struct vm_fault vmf2;
int ret;
 
vmf2.virtual_address = vmf->address & PAGE_MASK;
-   vmf2.pgoff = pgoff;
+   vmf2.pgoff = vmf->pgoff;
vmf2.flags = vmf->flags;
vmf2.page = NULL;
vmf2.gfp_mask = __get_fault_gfp_mask(vma);
@@ -3111,9 +3111,10 @@ late_initcall(fault_around_debugfs);
  * fault_around_pages() value (and therefore to page order).  This way it's
  * easier to guarantee that we don't cross page table boundaries.
  */
-static int do_fault_around(struct vm_fault *vmf, pgoff_t start_pgoff)
+static int do_fault_around(struct vm_fault *vmf)
 {
unsigned long address = vmf->address, nr_pages, mask;
+   pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off, ret = 0;
 
@@ -3171,7 +3172,7 @@ out:
return ret;
 }
 
-static int do_read_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_read_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *fault_page;
@@ -3183,12 +3184,12 @@ static int do_read_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
 * something).
 */
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-   ret = do_fault_around(vmf, pgoff);
+   ret = do_fault_around(vmf);
if (ret)
return ret;
}
 
-   ret = __do_fault(vmf, pgoff, NULL, _page, NULL);
+   ret = __do_fault(vmf, NULL, _page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -3201,7 +3202,7 @@ static int do_read_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
return ret;
 }
 
-static int do_cow_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_cow_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *fault_page, *new_page;
@@ -3222,7 +3223,7 @@ static int do_cow_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
return VM_FAULT_OOM;
}
 
-   ret = __do_fault(vmf, pgoff, new_page, _page, _entry);
+   ret = __do_fault(vmf, new_page, _page, _entry);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
 
@@ -3237,7 +3238,7 @@ static int do_cow_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
unlock_page(fault_page);
put_page(fault_page);
} else {
-   dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
+   dax_unlock_mapping_entry(vma->vm_file->f_mapping, vmf->pgoff);
}
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
@@ -3248,7 +3249,7 @@ uncharge_out:
return ret;
 }
 
-static int do_shared_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *fault_page;
@@ -3256,7 +3257,7 @@ static int do_shared_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
int dirtied = 0;
int ret, tmp;
 
-   ret = __do_fault(vmf, pgoff, NULL, _page, NULL);
+   ret = __do_fault(vmf, NULL, _page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -3317,16 +3318,15 @@ static int do_shared_fault(struct vm_fault *vmf, 
pgoff_t pgoff)
 static int do_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   pgoff_t pgoff = linear_page_index(vma, vmf->address);
 
/* The VMA was not fully populated on 

[PATCH 19/20] dax: Protect PTE modification on WP fault by radix tree entry lock

2016-09-27 Thread Jan Kara
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
has released corresponding radix tree entry lock. When we want to
writeprotect PTE on cache flush, we need PTE modification to happen
under radix tree entry lock to ensure consisten updates of PTE and radix
tree (standard faults use page lock to ensure this consistency). So move
update of PTE bit into dax_pfn_mkwrite().

Signed-off-by: Jan Kara 
---
 fs/dax.c| 22 --
 mm/memory.c |  2 +-
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c6cadf8413a3..a2d3781c9f4e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1163,17 +1163,27 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 {
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
-   void *entry;
+   void *entry, **slot;
pgoff_t index = vmf->pgoff;
 
spin_lock_irq(>tree_lock);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
-   if (!entry || !radix_tree_exceptional_entry(entry))
-   goto out;
+   entry = get_unlocked_mapping_entry(mapping, index, );
+   if (!entry || !radix_tree_exceptional_entry(entry)) {
+   if (entry)
+   put_unlocked_mapping_entry(mapping, index, entry);
+   spin_unlock_irq(>tree_lock);
+   return VM_FAULT_NOPAGE;
+   }
radix_tree_tag_set(>page_tree, index, PAGECACHE_TAG_DIRTY);
-   put_unlocked_mapping_entry(mapping, index, entry);
-out:
+   entry = lock_slot(mapping, slot);
spin_unlock_irq(>tree_lock);
+   /*
+* If we race with somebody updating the PTE and finish_mkwrite_fault()
+* fails, we don't care. We need to return VM_FAULT_NOPAGE and retry
+* the fault in either case.
+*/
+   finish_mkwrite_fault(vmf);
+   put_locked_mapping_entry(mapping, index, entry);
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/mm/memory.c b/mm/memory.c
index e7a4a30a5e88..5fa3d0c5196e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2310,7 +2310,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
vmf->flags |= FAULT_FLAG_MKWRITE;
ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
-   if (ret & VM_FAULT_ERROR)
+   if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))
return ret;
return finish_mkwrite_fault(vmf);
}
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 07/20] mm: Add orig_pte field into vm_fault

2016-09-27 Thread Jan Kara
Add orig_pte field to vm_fault structure to allow ->page_mkwrite
handlers to fully handle the fault. This also allows us to save some
passing of extra arguments around.

Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  4 +--
 mm/internal.h  |  2 +-
 mm/khugepaged.c|  5 ++--
 mm/memory.c| 76 +++---
 4 files changed, 44 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fc6daf5242c..c908fd7243ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -300,8 +300,8 @@ struct vm_fault {
unsigned long virtual_address;  /* Faulting virtual address masked by
 * PAGE_MASK */
pmd_t *pmd; /* Pointer to pmd entry matching
-* the 'address'
-*/
+* the 'address' */
+   pte_t orig_pte; /* Value of PTE at the time of fault */
 
struct page *cow_page;  /* Handler may choose to COW */
struct page *page;  /* ->fault handlers should return a
diff --git a/mm/internal.h b/mm/internal.h
index cc80060914f6..7c7421da5d63 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -36,7 +36,7 @@
 /* Do not use these with a slab allocator */
 #define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
 
-int do_swap_page(struct vm_fault *vmf, pte_t orig_pte);
+int do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f88b2d3810a7..66bc77f2d1d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -890,11 +890,12 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
vmf.pte = pte_offset_map(pmd, address);
for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
vmf.pte++, vmf.address += PAGE_SIZE) {
-   pteval = *vmf.pte;
+   vmf.orig_pte = *vmf.pte;
+   pteval = vmf.orig_pte;
if (!is_swap_pte(pteval))
continue;
swapped_in++;
-   ret = do_swap_page(, pteval);
+   ret = do_swap_page();
 
/* do_swap_page returns VM_FAULT_RETRY with released mmap_sem */
if (ret & VM_FAULT_RETRY) {
diff --git a/mm/memory.c b/mm/memory.c
index 48de8187d7b2..0c8779c23925 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2070,8 +2070,8 @@ static int do_page_mkwrite(struct vm_area_struct *vma, 
struct page *page,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf, pte_t orig_pte,
-   struct page *page, int page_mkwrite, int dirty_shared)
+static inline int wp_page_reuse(struct vm_fault *vmf, struct page *page,
+   int page_mkwrite, int dirty_shared)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -2084,8 +2084,8 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
pte_t orig_pte,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
 
-   flush_cache_page(vma, vmf->address, pte_pfn(orig_pte));
-   entry = pte_mkyoung(orig_pte);
+   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
+   entry = pte_mkyoung(vmf->orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -2135,8 +2135,7 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
pte_t orig_pte,
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old 
page.
  */
-static int wp_page_copy(struct vm_fault *vmf, pte_t orig_pte,
-   struct page *old_page)
+static int wp_page_copy(struct vm_fault *vmf, struct page *old_page)
 {
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
@@ -2150,7 +2149,7 @@ static int wp_page_copy(struct vm_fault *vmf, pte_t 
orig_pte,
if (unlikely(anon_vma_prepare(vma)))
goto oom;
 
-   if (is_zero_pfn(pte_pfn(orig_pte))) {
+   if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
  vmf->address);
if (!new_page)
@@ -2174,7 +2173,7 @@ static int wp_page_copy(struct vm_fault *vmf, pte_t 
orig_pte,
 * Re-check the pte - we dropped the lock
 */
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
-   if 

[PATCH 04/20] mm: Use passed vm_fault structure in __do_fault()

2016-09-27 Thread Jan Kara
Instead of creating another vm_fault structure, use the one passed to
__do_fault() for passing arguments into fault handler.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c2ec9a9d8af..b7f1f535e079 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,37 +2848,31 @@ static int __do_fault(struct vm_fault *vmf, struct page 
*cow_page,
  struct page **page, void **entry)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct vm_fault vmf2;
int ret;
 
-   vmf2.virtual_address = vmf->address & PAGE_MASK;
-   vmf2.pgoff = vmf->pgoff;
-   vmf2.flags = vmf->flags;
-   vmf2.page = NULL;
-   vmf2.gfp_mask = __get_fault_gfp_mask(vma);
-   vmf2.cow_page = cow_page;
+   vmf->cow_page = cow_page;
 
-   ret = vma->vm_ops->fault(vma, );
+   ret = vma->vm_ops->fault(vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
if (ret & VM_FAULT_DAX_LOCKED) {
-   *entry = vmf2.entry;
+   *entry = vmf->entry;
return ret;
}
 
-   if (unlikely(PageHWPoison(vmf2.page))) {
+   if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)
-   unlock_page(vmf2.page);
-   put_page(vmf2.page);
+   unlock_page(vmf->page);
+   put_page(vmf->page);
return VM_FAULT_HWPOISON;
}
 
if (unlikely(!(ret & VM_FAULT_LOCKED)))
-   lock_page(vmf2.page);
+   lock_page(vmf->page);
else
-   VM_BUG_ON_PAGE(!PageLocked(vmf2.page), vmf2.page);
+   VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
 
-   *page = vmf2.page;
+   *page = vmf->page;
return ret;
 }
 
@@ -3573,8 +3567,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
struct vm_fault vmf = {
.vma = vma,
.address = address,
+   .virtual_address = address & PAGE_MASK,
.flags = flags,
.pgoff = linear_page_index(vma, address),
+   .gfp_mask = __get_fault_gfp_mask(vma),
};
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RETURNED MAIL: SEE TRANSCRIPT FOR DETAILS

2016-09-27 Thread MAILER-DAEMON
Your message was undeliverable due to the following reason:

Your message was not delivered because the destination computer was
not reachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message was not delivered within 3 days:
Mail server 109.11.213.67 is not responding.

The following recipients could not receive this message:


Please reply to postmas...@lists.01.org
if you feel this message to be in error.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 4/6] xfs: Set BH_New for allocated DAX blocks in __xfs_get_blocks()

2016-09-27 Thread Jan Kara
On Tue 27-09-16 10:01:18, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 06:43:33PM +0200, Jan Kara wrote:
> > So far we did not set BH_New for newly allocated blocks for DAX inodes
> > in __xfs_get_blocks() because we wanted to avoid zeroing done in generic
> > DAX code which was racy. Now the zeroing is gone so we can remove this
> > workaround and return BH_New for newly allocated blocks. DAX will use this
> > information to properly update mappings of the file.
> 
> __xfs_get_blocks isn't used by the DAX code any more.
> xfs_file_iomap_begin should already be doing the right thing for now.

OK, the changelog is stale but I actually took care to integrate this with
your iomap patches and for the new invalidation code in iomap_dax_actor()
to work we need this additional information...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 07/11] dax: coordinate locking for offsets in PMD range

2016-09-27 Thread Ross Zwisler
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler 
---
 fs/dax.c| 37 -
 include/linux/dax.h |  2 +-
 mm/filemap.c|  2 +-
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index baef586..406feea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,10 +64,17 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
+static pgoff_t dax_entry_start(pgoff_t index, void *entry)
+{
+   if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+   index &= (PMD_MASK >> PAGE_SHIFT);
+   return index;
+}
+
 static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
- pgoff_t index)
+ pgoff_t entry_start)
 {
-   unsigned long hash = hash_long((unsigned long)mapping ^ index,
+   unsigned long hash = hash_long((unsigned long)mapping ^ entry_start,
   DAX_WAIT_TABLE_BITS);
return wait_table + hash;
 }
@@ -285,7 +292,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
struct address_space *mapping;
-   unsigned long index;
+   pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -301,7 +308,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, 
unsigned int mode,
container_of(wait, struct wait_exceptional_entry_queue, wait);
 
if (key->mapping != ewait->key.mapping ||
-   key->index != ewait->key.index)
+   key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +366,10 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
 {
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
-   wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+   wait_queue_head_t *wq;
 
init_wait();
ewait.wait.func = wake_exceptional_entry_func;
-   ewait.key.mapping = mapping;
-   ewait.key.index = index;
 
for (;;) {
entry = __radix_tree_lookup(>page_tree, index, NULL,
@@ -375,6 +380,11 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
*slotp = slot;
return entry;
}
+
+   wq = dax_entry_waitqueue(mapping,
+   dax_entry_start(index, entry));
+   ewait.key.mapping = mapping;
+   ewait.key.entry_start = dax_entry_start(index, entry);
prepare_to_wait_exclusive(wq, ,
  TASK_UNINTERRUPTIBLE);
spin_unlock_irq(>tree_lock);
@@ -447,10 +457,11 @@ restart:
return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
   pgoff_t index, bool wake_all)
 {
-   wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+   wait_queue_head_t *wq = dax_entry_waitqueue(mapping,
+   dax_entry_start(index, entry));
 
/*
 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -462,7 +473,7 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
struct exceptional_entry_key key;
 
key.mapping = mapping;
-   key.index = index;
+   key.entry_start = dax_entry_start(index, entry);
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
}
 }
@@ -480,7 +491,7 @@ void dax_unlock_mapping_entry(struct address_space 
*mapping, pgoff_t index)
}
unlock_slot(mapping, slot);
spin_unlock_irq(>tree_lock);
-   dax_wake_mapping_entry_waiter(mapping, index, false);
+   dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +516,7 @@ static void put_unlocked_mapping_entry(struct address_space 
*mapping,
return;
 
/* We have to wake up next waiter for the radix tree entry lock */
-   dax_wake_mapping_entry_waiter(mapping, index, false);
+   dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 /*
@@ -532,7 +543,7 @@ int 

[PATCH v3 04/11] ext2: remove support for DAX PMD faults

2016-09-27 Thread Ross Zwisler
DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler 
---
 fs/ext2/file.c | 24 +---
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..d5af6d2 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, unsigned int flags)
-{
-   struct inode *inode = file_inode(vma->vm_file);
-   struct ext2_inode_info *ei = EXT2_I(inode);
-   int ret;
-
-   if (flags & FAULT_FLAG_WRITE) {
-   sb_start_pagefault(inode->i_sb);
-   file_update_time(vma->vm_file);
-   }
-   down_read(>dax_sem);
-
-   ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-   up_read(>dax_sem);
-   if (flags & FAULT_FLAG_WRITE)
-   sb_end_pagefault(inode->i_sb);
-   return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
 {
@@ -154,7 +133,6 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault  = ext2_dax_fault,
-   .pmd_fault  = ext2_dax_pmd_fault,
.page_mkwrite   = ext2_dax_fault,
.pfn_mkwrite= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +144,7 @@ static int ext2_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 
file_accessed(file);
vma->vm_ops = _dax_vm_ops;
-   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+   vma->vm_flags |= VM_MIXEDMAP;
return 0;
 }
 #else
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 03/11] dax: remove buffer_size_valid()

2016-09-27 Thread Ross Zwisler
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-   return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
rc = get_block(inode, block, bh, rw == WRITE);
if (rc)
break;
-   if (!buffer_size_valid(bh))
-   bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned 
long address,
 
bdev = bh.b_bdev;
 
-   /*
-* If the filesystem isn't willing to tell us the length of a hole,
-* just fall back to PTEs.  Calling get_block 512 times in a loop
-* would be silly.
-*/
-   if (!buffer_size_valid() || bh.b_size < PMD_SIZE) {
+   if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(, address, "allocated block too small");
return VM_FAULT_FALLBACK;
}
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 05/11] dax: make 'wait_table' global variable static

2016-09-27 Thread Ross Zwisler
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be 
static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 11/11] dax: remove "depends on BROKEN" from FS_DAX_PMD

2016-09-27 Thread Ross Zwisler
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler 
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
-   depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 01/11] ext4: allow DAX writeback for hole punch

2016-09-27 Thread Ross Zwisler
Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Cc: 
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode 
*inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, 
loff_t length)
 * Write out all dirty pages to avoid race conditions
 * Then release them.
 */
-   if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+   if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
ret = filemap_write_and_wait_range(mapping, offset,
   offset + length - 1);
if (ret)
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3 10/11] xfs: use struct iomap based DAX PMD fault path

2016-09-27 Thread Ross Zwisler
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved iomap_dax_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler 
---
 fs/xfs/xfs_aops.c | 25 +
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa9..39c754f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1170,8 +1170,7 @@ __xfs_get_blocks(
sector_tiblock,
struct buffer_head  *bh_result,
int create,
-   booldirect,
-   booldax_fault)
+   booldirect)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
@@ -1265,12 +1264,8 @@ __xfs_get_blocks(
if (ISUNWRITTEN())
set_buffer_unwritten(bh_result);
/* direct IO needs special help */
-   if (create) {
-   if (dax_fault)
-   ASSERT(!ISUNWRITTEN());
-   else
-   xfs_map_direct(inode, bh_result, , offset);
-   }
+   if (create)
+   xfs_map_direct(inode, bh_result, , offset);
}
 
/*
@@ -1310,7 +1305,7 @@ xfs_get_blocks(
struct buffer_head  *bh_result,
int create)
 {
-   return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+   return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1320,17 +1315,7 @@ xfs_get_blocks_direct(
struct buffer_head  *bh_result,
int create)
 {
-   return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-   struct inode*inode,
-   sector_tiblock,
-   struct buffer_head  *bh_result,
-   int create)
-{
-   return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+   return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3b..6779e9d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -57,9 +57,6 @@ int   xfs_get_blocks(struct inode *inode, sector_t offset,
   struct buffer_head *map_bh, int create);
 intxfs_get_blocks_direct(struct inode *inode, sector_t offset,
  struct buffer_head *map_bh, int create);
-intxfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-struct buffer_head *map_bh, int create);
-
 intxfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private);
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 882f264..e86b2be 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1539,7 +1539,7 @@ xfs_filemap_pmd_fault(
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+   ret = iomap_dax_pmd_fault(vma, addr, pmd, flags, _iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


案例:客户一再地提出不同的条件,怎么处理?

2016-09-27 Thread 案例:客户一再地提出不同的条件,怎么处理?先生
销售精英2天强化训练

【时间地点】 2016年 10月15-16日深圳11月05-06日上海 
 11月19-20日北京11月26-27日深圳12月17-18日上海 


Judge(评价)一个人,一个公司是不是优秀,不要看他是不是Harvard(哈佛大学),是不是Stanford(斯坦福大学).不要judge(评价)里面有多少名牌大学毕业生,而要judge(评价)这帮人干活是不是发疯一样干,看他每天下班是不是笑眯眯回家!
——阿里巴巴公司马云


——课程简介

第一章客户需求分析
思考:
1、面对客户找不到话说,怎么办?二次沟通应该聊些什么?
2、为什么我把所有资料都给客户了,他还说要考虑一下?
3、同一件事,客户不同的人告诉我不一样的要求,听谁的?
4、同一件事,客户同一个人告诉我两次的答案不一样,听哪次的?
5、推荐哪一种产品给客户好?最好的?稍好的?还是够用的?
4、为什么我按客户要求去做,他还是没有选择我们?
5、不同的客户,我应该如何应对?
6、忠诚的客户应该如何培养?

第一节、为什么要对客户需求进行分析?
1、客户初次告诉我们的信息往往是有所保留的;
2、客户想要的产品,不一定就是实际所需要的;
3、客户不缺少产品信息,困惑的是自己如何选择; 
4、客户购买决定是比较出来的,没有比较,产品就没有价值;
5、销售人员第一思想是战争思想,情报最重要;
6、未来的送货员,联络员,报价员将被淘汰;

第二节、如何做好客户需求分析?
一、基本要求:
1.无事不登三宝殿,有目的地做好拜访计划;
2.引导客户,首先要控制谈话的方向、节奏、内容;
3.从讲产品的“卖点”转变到讲客户的“买点”
4.好的,不一定是最适合的,最合适的才是最好的;
5.不要把猜测当成事实,“谈”的是什么?“判”是由谁判?
6.讨论:客户说价格太贵,代表哪15种不同的意思?

二、需求分析要点:
1.了解客户的4种期望目标;
2.了解客户采购的5个适当;
3.判断谁是关键人的8个依据;
4.哪6大类问题不可以问? 要表达别人听得懂的话;
5.提问注意的“3不谈”,“4不讲”;
6.客户需求分析手册制定的6个步骤;
?找对方向,事半功倍,为什么找这个客户?
?时间没对,努力白费,为什么这个时候找?
?找对人,说对话,为什么找这个人? 
?为什么推荐这个产品?给客户需要的,而不是自己想给的; 
?为什么给这样的服务? 客户看重不是产品,而是使用价值;
?为什么报这个价? 在客户的预算与同行之间找到平衡;
7.为什么还这个价?关注竞争对手,调整自己的策略;

第二章  如何正确推荐产品
思考:
1、为什么我满足客户所提出的要求,客户却还需要考虑一下?
2、为什么客户不相信我质量与服务的承诺?
3、面对客户提出高端产品的要求,而我只有低端产品,怎么办?
4、如何推荐产品才能让客户感觉到我们跟别人不一样;

第一节 为什么需要我们正确地推荐产品?
1、客户往往对自己深层次的问题并不清楚;
2、客户的提出的要求可能是模糊或抽象,有的仅仅提出方向,不要局限于客户明显的问题,头痛医头,脚痛医脚;
3、客户往往会以我们竞品给他的条件要求我们;
4、满足客户提出的要求,是引导客户在不同公司之间做比较,而不在我公司做出决策;

第二节 如何帮助客户建立“排他性”的采购标准?
案例:客户关心的是你如何保证你的质量和服务水平
1、打仗就是打后勤,推荐产品中常用的34项内容;
2、产品的功能与客户需要解决的问题要相对应;客户喜欢提供解决方案的人,而不仅提供工具的人;
3、如何给竞争对手业务员设置障碍?

第三节  见什么人,说什么话;
不同情况下如何讲?时间、能力、精力、兴趣、文化水平、不同的职位等;
1. 什么情况下偏重于理性说服,打动别人的脑?
2. 什么情况下偏重于情感说服,打动别人的心?
3. 何种情况下只讲优势不讲劣势?
4. 何种情况下即讲优势又讲劣势?

第三章如何有效处理异议
思考
1、遇到小气、固执、粗鲁、啰嗦、刻薄、吹毛求疵、优柔寡断的客户应对?
2、客户直接挂电话,怎么办?
3、第二次见面,客户对我大发脾气,怎么办?
4、有一个行业,销售人员每天都会遇到大量的拒绝,为什么却没有任何人会沮丧? 
5、客户就没有压力吗?知已知彼,客户采购时会有哪些压力?
6、为什么客户在上班时与下班后会表现不同的性格特征?

第一节:买卖双方的心情分析
1、如果一方比另一方更主动、更积极追求合作,则后者称为潜在客户 
2、卖方知道某价一定不能卖,但买方不知道什么价不能买;
3、当卖方表现自己很想卖,买方会表现自己不想买;
4、买方还的价,并不一定是他认为商品就应该值这个价;
5、付钱之前,买方占优势,之后,卖方占优势;

第二节、理解客户购买时的心态;
1、客户谈判时常用7种试探技巧分析;
2、客户态度非常好,就是不下订单,为什么?
3、为什么有些客户让我们感觉高高在上,花钱是大爷?难道他们素质真的差?
4、客户自身会有哪6个压力?
案例:客户提出合理条件,是否我就应该降价?
案例:如何分清客户异议的真实性?
案例:当谈判出现僵局时怎么办?
案例:为什么我答应客户提出的所有的条件,反而失去了订单?
案例:客户一再地提出不同的条件,怎么处理?
案例:客户要求我降价时,怎么办?请分8个步骤处理

第三节 客户异议处理的5个区分
1、要区分“第一” 还是“唯一”
2、对客户要求的真伪进行鉴别;
3、要区分“情绪”还是“行为”
4、区分“假想”还是“事实”
5、区别问题的轻重,缓急;

第四章  如何建立良好的客情关系?
案例:销售工作需要疯狂、圆滑、奉承、见人说人话,见鬼说鬼话吗?
案例:生意不成仁义在,咱俩交个朋友,这句话应该由谁说?
案例:邀请客户吃饭,你应该怎么说?
案例:当客户表扬了你,你会怎么回答?
案例:我代表公司的形象,是否我应该表现自己很强势?
案例:为了获得客户的信任,我是否应该花重金包装自己?让自己很完美?
案例:是否需要处处表现自己很有礼貌?
案例:如何与企业高层、政府高层打交道?

第一节 做回真实和真诚的自己,表里如一
礼仪的目的是尊重别人,而不是伪装自己,礼仪中常见的错误;
1、演别人,再好的演技也会搞砸,想做别人的时候,你就会离自己很远;
2、不同的人,需求不同,越改越累,越改越气,只会把自己折磨得心浮气躁,不得人心;
3、以朋友的心态与客户交往,过多的商业化语言、行为、过多的礼仪只会让客户感觉到生硬、距离、排斥、公事公办,没有感情;
4、适当的暴露自己的缺点,越完美的人越不可信;
5、守时,守信,守约,及时传递进程与信息,让客户感觉到可控性;
6、销售不是向客户笑,而是要让客户对自己笑;

第二节 感谢伤害我的人,是因为我自己错了;
1、一味顺从、推卸责任、理论交谈、谈论小事、无诚信;
2、当客户说过一段时间、以后、改天、回头、月底时,如何应对?
3、越完美的人越不可信,自我暴露的四个层次;
4、做好防错性的服务,签完合同仅仅是合作的开始;
?指导客户如何使用; 
?跟踪产品使用的情况; 
?为客户在使用过程中提供指导建议; 
?积极解答客户在使用中提出的问题; 


第四章团队配合
思考:
1.团队配合的前提是什么?是否任意两个人在一起都会有团队精神?
2.团队配合中为什么会出现扯皮的现象?
3.为什么公司花那么高成本让大家加深感情,但有些人之间还是有隔阂?
4.业绩好的人影响业绩差的人容易还是业绩差的影响业绩好的容易?
5.统一底薪好?还是差别化底薪好?如何让大家都觉得公平?
6.为什么有能力的不听话,听话的却没能力?
7.为什么有些人总是不按我要求的方法去做?
8.面对业绩总是很差的员工,到底是留还是开?

第一节团队配合的重要性
1.优秀的业务员业绩往往是普通的几十甚至上百倍;
2.提高成交的效率,不要杀敌一千,而自损八百;
3.优秀业务员缺时间,业绩普通的业务员缺能力,扬长避短,人尽其才;
4.把人力资源效益利用最大化;
5.打造完美的团队,让成员的缺点相互抵消;

第二节,如何开展团队配合
第一、能力互补
1.关注员工的能力,不要把未来寄托在员工未知的潜能上;
2.不要把员工塑造成同一类型的人,不把专才当全才用;
3.团队以能为本,销售岗位常见的14项技能;
4.售前、售中、售后人员要求与如何搭配?
5.案例:新员工有激情,但能力不足,老员工有能力,但激情不足,怎么办?

第二、利益关联
1.为什么成员会相互冷漠、互不关心、彼此封锁信息和资源?
2.为什么团队成员把团队的事不当回事?
3.如何才能让团队成员真心的为优秀的成员而高兴?
4.开除业绩差的员工,其他成员缺乏安全感怎么办?
5.如何才能让团队自动自发的努力工作?

第三节、不同客户喜欢不同风格的销售人员
1、 销售人员形象与举止,注意自己的形象;
2、 是否具备相似的背景,门当户对;
3、 是否具备相同的认识,道不同不相为盟;
4、 是否“投其所好”,话不投机半句多;
5、 赞美,喜欢对方,我们同样对喜欢我们的人有好感;
先交流感情,增进互信,欲速则不达;
6、 是否对销售人员熟悉,销售最忌讳交浅言深;
初次见面就企图跟别人成为朋友的行为很幼稚;
初次见面就暗示好处的行为很肤浅;
刚见面就强调价格很便宜的行为很愚蠢;
7、 销售人员是否具备亲和力,别人的脸是自己的一面镜子;
成交并不取决于说理,而是取决于心情
8、 销售人员是否值得信赖。

第六章  新客户开发
案例:为什么客户一开始很有兴趣,但迟迟不下单?
案例:前天明明说不买的客户居然今天却买了,客户的话能相信吗?
案例:客户答应买我司的产品,却突然变卦买别人的了,为什么?
案例:为什么我们会买自己没有兴趣的而且并不需要的产品?
一、客户是根据自己所投入的精力、金钱来确定自己的态度;
二、如何才能引导客户作自我说服?
1.不要轻易给客户下结论,谁会买,谁不会买
2.态度上的变化叫说服,行为上的变化叫接受;
3.我们都喜欢为我们自己的行为找理由,却不善于做我们已找到理由的事;
4.客户是发现了自己的需求,“发现”的依据是自己的行为; 
5.案例:合同签订后,应该问哪4句话,提升客户忠诚度?

第七章 自我激励
1.做销售工作赚钱最快,且最容易得到老板的重视、同事的尊重;
2.不要把第一次见面看成最后一次,工作要积极但不要着急;
3.不是成功太慢,而是放弃太快,钱是给内行的人赚的;
4.不要报着试试看的心态,企图一夜暴富的投机心态让客户反感;
5.不是有希望才坚持,而是坚持了才有希望; 
6.付出才会拥有,而不是拥有才付出;做了才会,而不是会了才做;
7.好工作是做出来的,不是找出来的,不要把自己托付给公司,而要独立成长;
8.尝试不同的工作方法,而不是多年重复使用一种方式,具备试错的精神;
9.工作可以出错,但不可以不做,世界上最危险的莫过于原地不动;
10.不要把未来寄托在自己一无所知的行业上,做好目前的工作;

【培训特点】
1.分组讨论,训练为主,互动式教学;2次现场考试;
2.真实案例分析,大量课后作业题,既有抢答,又有辩论,还有现场演练,热烈的课堂氛围;
3.将销售管理融入培训现场:
   3.1  不仅关注个人学习表现,而且重视团队合作;
   3.2  不仅关注2天以内的学习,而且营造2天以后的培训学习氛围;
   3.3  不仅考核个人得分,而且考核团队得分;不仅考核学员的学习成绩,而且考核学员学习的参与度;


【讲师介绍】 王老师
 销售团队管理咨询师、销售培训讲师;
 曾任可口可乐(中国)公司业务经理;阿里巴巴(中国)网络技术有限公司业务经理;
 清华大学.南京大学EMBA特邀培训讲师;新加坡莱佛士学院特约讲师;
 

[PATCH v3 09/11] dax: add struct iomap based DAX PMD support

2016-09-27 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler 
---
 fs/dax.c| 396 ++--
 include/linux/dax.h |  29 +++-
 mm/filemap.c|   4 +-
 3 files changed, 380 insertions(+), 49 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b5e7b13..13934d7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include 
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-   RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -400,19 +386,52 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with type RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  Also, a 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+   unsigned long new_type)
 {
+   bool pmd_downgrade = false; /* splitting 2MiB entry