Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-11-01 Thread Boaz Harrosh
On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible.  Without arch-specific support, this defaults to just
>>> memcpy().  For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: "H. Peter Anvin" 
>>> Cc: 
>>> Cc: Al Viro 
>>> Cc: Dan Williams 
>>> Signed-off-by: Brian Boylston 
>>> Reviewed-by: Toshi Kani 
>>> Reported-by: Oliver Moreno 
>>> ---
>>>  arch/x86/include/asm/string_32.h |  3 +++
>>>  arch/x86/include/asm/string_64.h |  3 +++
>>>  arch/x86/lib/misc.c  | 12 
>>>  include/linux/string.h   | 15 +++
>>>  4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h 
>>> b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
>>> *from, size_t len)
>>>
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMMOVE
>>>  void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h 
>>> b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
>>> len);
>>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMSET
>>>  void *memset(void *s, int c, size_t n);
>>>  void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include 
>>> +#include 
>>> +
>>>  /*
>>>   * Count the digits of @val including a possible sign.
>>>   *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>> }
>>> return d;
>>>  }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +   __copy_from_user_inatomic_nocache(dest, src, count);
>>> +   return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>>  #ifndef __HAVE_ARCH_MEMCPY
>>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t 
>>> count)
>>> +{
>>> +   return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
>> switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
> 
> I agree that this looks like a nicer path.
> 
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
> acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
> mean exactly "memcpy nocache".
> 

I think this is the way to go. In my opinion there is no reason why not to 
include
pmem.h into lib/iov_iter.c.

And I think memcpy_to_pmem() would always be the fastest arch way to bypass 
cache
so it should be safe to use this for all cases. It is so in the arches that 
support
this now, and I cannot imagine a theoretical arch that would differ. But let the
specific arch people holler if this steps on their tows, later when they care 
about
this at all.

> I had also considered adjusting copy_from_iter_pmem() (also in linux/pmem.h)
> to just use memcpy_to_pmem() directly, but then it can't use the goodness
> that is the iterate_and_advance() macro in iov_iter.c.
> 

No please keep all these local to iov_iter.c. Any future changes should be local
to here and fixed in one place.

> So, I took a shot with a possibly ill-fated memcpy_nocache().  Thoughts on
> either of the above two?  Are these even in line with what you were thinking?
> 

Yes thanks again for pushing this. I think it is important. CC me on patches
and I wi

[PATCH v9 00/16] re-enable DAX PMD support

2016-11-01 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Previously we had talked about this series going through the XFS tree, but
Jan has a patch set that will need to build on this series and it heavily
modifies the MM code.  I think he would prefer that series to go through
Andrew Morton's -MM tree, so it probably makes sense for this series to go
through that same tree.

For reference, here is the series from Jan that I was talking about:
https://marc.info/?l=linux-mm&m=147499252322902&w=2

Andrew, can you please pick this up for the v4.10 merge window?
This series is currently based on v4.9-rc3.  I tried to rebase onto a -mm
branch or tag, but couldn't find one that contained the DAX iomap changes
that were merged as part of the v4.9 merge window.  I'm happy to rebase &
test on a v4.9-rc* based -MM branch or tag whenever they are available.

Changes since v8:
- Rebased onto v4.9-rc3.
- Updated the DAX PMD fault path so that on fallback we always check to see
  if we are dealing with a transparent huge page, and if we are we will
  split it.  This was already happening for one of the fallback cases via a
  patch from Toshi, and Jan hit a deadlock in another fallback case where
  the same splitting was needed.  (Jan & Toshi)

This series has passed all my xfstests testing, including the test that was
hitting the deadlock with v8.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v9

Ross Zwisler (16):
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  dax: make 'wait_table' global variable static
  dax: remove the last BUG_ON() from fs/dax.c
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add dax_iomap_sector() helper function
  dax: dax_iomap_fault() needs to call iomap_end()
  dax: move RADIX_DAX_* defines to dax.h
  dax: move put_(un)locked_mapping_entry() in dax.c
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig  |   1 -
 fs/dax.c| 826 +---
 fs/ext2/file.c  |  35 +--
 fs/ext4/inode.c |   3 +
 fs/xfs/xfs_aops.c   |  26 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  58 +++-
 mm/filemap.c|   5 +-
 9 files changed, 537 insertions(+), 430 deletions(-)

-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 01/16] ext4: tell DAX the size of allocation holes

2016-11-01 Thread Ross Zwisler
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9c06472..3d58b2b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -767,6 +767,9 @@ static int _ext4_get_block(struct inode *inode, sector_t 
iblock,
ext4_update_bh_state(bh, map.m_flags);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
+   } else if (ret == 0) {
+   /* hole case, need to fill in bh->b_size */
+   bh->b_size = inode->i_sb->s_blocksize * map.m_len;
}
return ret;
 }
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 07/16] dax: coordinate locking for offsets in PMD range

2016-11-01 Thread Ross Zwisler
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 65 +
 include/linux/dax.h |  2 +-
 mm/filemap.c|  2 +-
 3 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 835e7f0..7238702 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
- pgoff_t index)
-{
-   unsigned long hash = hash_long((unsigned long)mapping ^ index,
-  DAX_WAIT_TABLE_BITS);
-   return wait_table + hash;
-}
-
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
struct address_space *mapping;
-   unsigned long index;
+   pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+   pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+   unsigned long hash;
+
+   /*
+* If 'entry' is a PMD, align the 'index' that we use for the wait
+* queue to the start of that PMD.  This ensures that all offsets in
+* the range covered by the PMD map to the same bit lock.
+*/
+   if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+   index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+   key->mapping = mapping;
+   key->entry_start = index;
+
+   hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+   return wait_table + hash;
+}
+
 static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
   int sync, void *keyp)
 {
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, 
unsigned int mode,
container_of(wait, struct wait_exceptional_entry_queue, wait);
 
if (key->mapping != ewait->key.mapping ||
-   key->index != ewait->key.index)
+   key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
 {
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
-   wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+   wait_queue_head_t *wq;
 
init_wait(&ewait.wait);
ewait.wait.func = wake_exceptional_entry_func;
-   ewait.key.mapping = mapping;
-   ewait.key.index = index;
 
for (;;) {
entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
*slotp = slot;
return entry;
}
+
+   wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
prepare_to_wait_exclusive(wq, &ewait.wait,
  TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&mapping->tree_lock);
@@ -447,10 +459,20 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
return entry;
 }
 
+/*
+ * We do not necessarily hold the mapping->tree_lock when we call this
+ * function so it is possible that 'entry' is no longer a valid item in the
+ * radix tree.  This is okay, though, because all we really need to do is to
+ * find the correct waitqueue where tasks might be sleeping waiting for that
+ * old 'entry' and wake them.
+ */
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
-  pgoff_t index, bool wake_all)
+   pgoff_t index, void *entry, bool wake_all)
 {
-   wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+   struct exceptional_entry_key key;
+   wait_queue_head_t *wq;
+
+   wq = dax_entry_waitqueue(mapping, index, entry, &key);
 
/*
 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458

[PATCH v9 05/16] dax: remove the last BUG_ON() from fs/dax.c

2016-11-01 Thread Ross Zwisler
Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair.  Just warn once and return an error.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index e52e754..219fa2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1194,7 +1194,8 @@ int dax_zero_page_range(struct inode *inode, loff_t from, 
unsigned length,
/* Block boundary? Nothing to do */
if (!length)
return 0;
-   BUG_ON((offset + length) > PAGE_SIZE);
+   if (WARN_ON_ONCE((offset + length) > PAGE_SIZE))
+   return -EINVAL;
 
memset(&bh, 0, sizeof(bh));
bh.b_bdev = inode->i_sb->s_bdev;
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 06/16] dax: consistent variable naming for DAX entries

2016-11-01 Thread Ross Zwisler
No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 219fa2b..835e7f0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space 
*mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
pgoff_t index, void ***slotp)
 {
-   void *ret, **slot;
+   void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
ewait.key.index = index;
 
for (;;) {
-   ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+   entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
  &slot);
-   if (!ret || !radix_tree_exceptional_entry(ret) ||
+   if (!entry || !radix_tree_exceptional_entry(entry) ||
!slot_locked(mapping, slot)) {
if (slotp)
*slotp = slot;
-   return ret;
+   return entry;
}
prepare_to_wait_exclusive(wq, &ewait.wait,
  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *ret, **slot;
+   void *entry, **slot;
 
 restart:
spin_lock_irq(&mapping->tree_lock);
-   ret = get_unlocked_mapping_entry(mapping, index, &slot);
+   entry = get_unlocked_mapping_entry(mapping, index, &slot);
/* No entry for given index? Make sure radix tree is big enough. */
-   if (!ret) {
+   if (!entry) {
int err;
 
spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
if (err)
return ERR_PTR(err);
-   ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+   entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
   RADIX_DAX_ENTRY_LOCK);
spin_lock_irq(&mapping->tree_lock);
-   err = radix_tree_insert(&mapping->page_tree, index, ret);
+   err = radix_tree_insert(&mapping->page_tree, index, entry);
radix_tree_preload_end();
if (err) {
spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
/* Good, we have inserted empty locked entry into the tree. */
mapping->nrexceptional++;
spin_unlock_irq(&mapping->tree_lock);
-   return ret;
+   return entry;
}
/* Normal page in radix tree? */
-   if (!radix_tree_exceptional_entry(ret)) {
-   struct page *page = ret;
+   if (!radix_tree_exceptional_entry(entry)) {
+   struct page *page = entry;
 
get_page(page);
spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index)
}
return page;
}
-   ret = lock_slot(mapping, slot);
+   entry = lock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
-   return ret;
+   return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-   void *ret, **slot;
+   void *entry, **slot;
 
spin_lock_irq(&mapping->tree_lock);
-   ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-   if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+   entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 !slot_locked(mapping, slot))) {
spin_unlock_irq(&mapping->tree_lock);
return;
-- 
2.7.4


[PATCH v9 02/16] dax: remove buffer_size_valid()

2016-11-01 Thread Ross Zwisler
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 014defd..b09817a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-   return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter 
*iter,
rc = get_block(inode, block, bh, rw == WRITE);
if (rc)
break;
-   if (!buffer_size_valid(bh))
-   bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned 
long address,
 
bdev = bh.b_bdev;
 
-   /*
-* If the filesystem isn't willing to tell us the length of a hole,
-* just fall back to PTEs.  Calling get_block 512 times in a loop
-* would be silly.
-*/
-   if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+   if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "allocated block too small");
return VM_FAULT_FALLBACK;
}
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 03/16] ext2: remove support for DAX PMD faults

2016-11-01 Thread Ross Zwisler
DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/ext2/file.c | 29 ++---
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a0e1478..fb88b51 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, unsigned int flags)
-{
-   struct inode *inode = file_inode(vma->vm_file);
-   struct ext2_inode_info *ei = EXT2_I(inode);
-   int ret;
-
-   if (flags & FAULT_FLAG_WRITE) {
-   sb_start_pagefault(inode->i_sb);
-   file_update_time(vma->vm_file);
-   }
-   down_read(&ei->dax_sem);
-
-   ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-   up_read(&ei->dax_sem);
-   if (flags & FAULT_FLAG_WRITE)
-   sb_end_pagefault(inode->i_sb);
-   return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault  = ext2_dax_fault,
-   .pmd_fault  = ext2_dax_pmd_fault,
+   /*
+* .pmd_fault is not supported for DAX because allocation in ext2
+* cannot be reliably aligned to huge page sizes and so pmd faults
+* will always fail and fail back to regular faults.
+*/
.page_mkwrite   = ext2_dax_fault,
.pfn_mkwrite= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 
file_accessed(file);
vma->vm_ops = &ext2_dax_vm_ops;
-   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+   vma->vm_flags |= VM_MIXEDMAP;
return 0;
 }
 #else
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 04/16] dax: make 'wait_table' global variable static

2016-11-01 Thread Ross Zwisler
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be 
static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index b09817a..e52e754 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 11/16] dax: dax_iomap_fault() needs to call iomap_end()

2016-11-01 Thread Ross Zwisler
Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS.  ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation.  However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc.  So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler 
Suggested-by: Jan Kara 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 37 +
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7737954..6edd89b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1165,6 +1165,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
struct iomap iomap = { 0 };
unsigned flags = 0;
int error, major = 0;
+   int locked_status = 0;
void *entry;
 
/*
@@ -1194,7 +1195,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
goto unlock_entry;
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
error = -EIO;   /* fs corruption? */
-   goto unlock_entry;
+   goto finish_iomap;
}
 
sector = dax_iomap_sector(&iomap, pos);
@@ -1216,13 +1217,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
 
if (error)
-   goto unlock_entry;
+   goto finish_iomap;
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
-   return VM_FAULT_LOCKED;
+   locked_status = VM_FAULT_LOCKED;
+   } else {
+   vmf->entry = entry;
+   locked_status = VM_FAULT_DAX_LOCKED;
}
-   vmf->entry = entry;
-   return VM_FAULT_DAX_LOCKED;
+   goto finish_iomap;
}
 
switch (iomap.type) {
@@ -1237,8 +1240,10 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
-   if (!(vmf->flags & FAULT_FLAG_WRITE))
-   return dax_load_hole(mapping, entry, vmf);
+   if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+   locked_status = dax_load_hole(mapping, entry, vmf);
+   break;
+   }
/*FALLTHRU*/
default:
WARN_ON_ONCE(1);
@@ -1246,14 +1251,30 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
break;
}
 
+ finish_iomap:
+   if (ops->iomap_end) {
+   if (error) {
+   /* keep previous error */
+   ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
+   &iomap);
+   } else {
+   error = ops->iomap_end(inode, pos, PAGE_SIZE,
+   PAGE_SIZE, flags, &iomap);
+   }
+   }
  unlock_entry:
-   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+   if (!locked_status || error)
+   put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error < 0 && error != -EBUSY)
return VM_FAULT_SIGBUS | major;
+   if (locked_status) {
+   WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
+   return locked_status;
+   }
return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 10/16] dax: add dax_iomap_sector() helper function

2016-11-01 Thread Ross Zwisler
To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault().  This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fdbd7a1..7737954 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1030,6 +1030,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, 
get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+   return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
@@ -1055,8 +1060,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
struct blk_dax_ctl dax = { 0 };
ssize_t map_len;
 
-   dax.sector = iomap->blkno +
-   (((pos & PAGE_MASK) - iomap->offset) >> 9);
+   dax.sector = dax_iomap_sector(iomap, pos);
dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
map_len = dax_map_atomic(iomap->bdev, &dax);
if (map_len < 0) {
@@ -1193,7 +1197,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
goto unlock_entry;
}
 
-   sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+   sector = dax_iomap_sector(&iomap, pos);
 
if (vmf->cow_page) {
switch (iomap.type) {
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 13/16] dax: move put_(un)locked_mapping_entry() in dax.c

2016-11-01 Thread Ross Zwisler
No functional change.

The static functions put_locked_mapping_entry() and
put_unlocked_mapping_entry() will soon be used in error cases in
grab_mapping_entry(), so move their definitions above this function.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c | 50 +-
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c45cc4d..0582c7c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -382,6 +382,31 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
}
 }
 
+static void put_locked_mapping_entry(struct address_space *mapping,
+pgoff_t index, void *entry)
+{
+   if (!radix_tree_exceptional_entry(entry)) {
+   unlock_page(entry);
+   put_page(entry);
+   } else {
+   dax_unlock_mapping_entry(mapping, index);
+   }
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * get_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+  pgoff_t index, void *entry)
+{
+   if (!radix_tree_exceptional_entry(entry))
+   return;
+
+   /* We have to wake up next waiter for the radix tree entry lock */
+   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
+}
+
 /*
  * Find radix tree entry at given index. If it points to a page, return with
  * the page locked. If it points to the exceptional entry, return with the
@@ -486,31 +511,6 @@ void dax_unlock_mapping_entry(struct address_space 
*mapping, pgoff_t index)
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
-static void put_locked_mapping_entry(struct address_space *mapping,
-pgoff_t index, void *entry)
-{
-   if (!radix_tree_exceptional_entry(entry)) {
-   unlock_page(entry);
-   put_page(entry);
-   } else {
-   dax_unlock_mapping_entry(mapping, index);
-   }
-}
-
-/*
- * Called when we are done with radix tree entry we looked up via
- * get_unlocked_mapping_entry() and which we didn't lock in the end.
- */
-static void put_unlocked_mapping_entry(struct address_space *mapping,
-  pgoff_t index, void *entry)
-{
-   if (!radix_tree_exceptional_entry(entry))
-   return;
-
-   /* We have to wake up next waiter for the radix tree entry lock */
-   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
-}
-
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 14/16] dax: add struct iomap based DAX PMD support

2016-11-01 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 378 +++-
 include/linux/dax.h |  55 ++--
 mm/filemap.c|   3 +-
 3 files changed, 386 insertions(+), 50 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0582c7c..281e91a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -76,6 +76,26 @@ static void dax_unmap_atomic(struct block_device *bdev,
blk_queue_exit(bdev->bd_queue);
 }
 
+static int dax_is_pmd_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_PMD;
+}
+
+static int dax_is_pte_entry(void *entry)
+{
+   return !((unsigned long)entry & RADIX_DAX_PMD);
+}
+
+static int dax_is_zero_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_HZP;
+}
+
+static int dax_is_empty_entry(void *entry)
+{
+   return (unsigned long)entry & RADIX_DAX_EMPTY;
+}
+
 struct page *read_dax_sector(struct block_device *bdev, sector_t n)
 {
struct page *page = alloc_pages(GFP_KERNEL, 0);
@@ -281,7 +301,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct 
address_space *mapping,
 * queue to the start of that PMD.  This ensures that all offsets in
 * the range covered by the PMD map to the same bit lock.
 */
-   if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+   if (dax_is_pmd_entry(entry))
index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
 
key->mapping = mapping;
@@ -413,36 +433,116 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  A 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_s

[PATCH v9 09/16] dax: correct dax iomap code namespace

2016-11-01 Thread Ross Zwisler
The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler 
Suggested-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 16 
 fs/ext2/file.c  |  6 +++---
 fs/xfs/xfs_file.c   |  8 
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 3d0b103..fdbd7a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1031,7 +1031,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
 {
struct iov_iter *iter = data;
@@ -1088,7 +1088,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:  The control block for this I/O
  * @iter:  The addresses to do I/O from or to
  * @ops:   iomap ops passed from the file system
@@ -1098,7 +1098,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops)
 {
struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1128,7 +1128,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-   iter, iomap_dax_actor);
+   iter, dax_iomap_actor);
if (ret <= 0)
break;
pos += ret;
@@ -1138,10 +1138,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
iocb->ki_pos += done;
return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1150,7 +1150,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
struct iomap_ops *ops)
 {
struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1252,5 +1252,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
return VM_FAULT_SIGBUS | major;
return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index fb88b51..b0f2415 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return 0; /* skip atime */
 
inode_lock_shared(inode);
-   ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+   ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
inode_unlock_shared(inode);
 
file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
if (ret)
goto out_unlock;
 
-   ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+   ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
i_size_write(inode, iocb->ki_pos);
mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
}
down_read(&ei->dax_sem);
 
-   ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+   ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
up_read(&ei->dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6e4f7f9..8ce5d3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -318,7 +318,7 @@ xfs_file_dax_read(
return 0; /* skip atime */
 
xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-   ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);

[PATCH v9 15/16] xfs: use struct iomap based DAX PMD fault path

2016-11-01 Thread Ross Zwisler
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/xfs/xfs_aops.c | 26 +-
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3e57a56..561cf14 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1298,8 +1298,7 @@ __xfs_get_blocks(
sector_tiblock,
struct buffer_head  *bh_result,
int create,
-   booldirect,
-   booldax_fault)
+   booldirect)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
@@ -1420,13 +1419,8 @@ __xfs_get_blocks(
if (ISUNWRITTEN(&imap))
set_buffer_unwritten(bh_result);
/* direct IO needs special help */
-   if (create) {
-   if (dax_fault)
-   ASSERT(!ISUNWRITTEN(&imap));
-   else
-   xfs_map_direct(inode, bh_result, &imap, offset,
-   is_cow);
-   }
+   if (create)
+   xfs_map_direct(inode, bh_result, &imap, offset, is_cow);
}
 
/*
@@ -1466,7 +1460,7 @@ xfs_get_blocks(
struct buffer_head  *bh_result,
int create)
 {
-   return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+   return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1476,17 +1470,7 @@ xfs_get_blocks_direct(
struct buffer_head  *bh_result,
int create)
 {
-   return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-   struct inode*inode,
-   sector_tiblock,
-   struct buffer_head  *bh_result,
-   int create)
-{
-   return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+   return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index b3c6634..34dc00d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -59,9 +59,6 @@ int   xfs_get_blocks(struct inode *inode, sector_t offset,
   struct buffer_head *map_bh, int create);
 intxfs_get_blocks_direct(struct inode *inode, sector_t offset,
  struct buffer_head *map_bh, int create);
-intxfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-struct buffer_head *map_bh, int create);
-
 intxfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private);
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8ce5d3f..d818c16 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1545,7 +1545,7 @@ xfs_filemap_pmd_fault(
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+   ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 16/16] dax: remove "depends on BROKEN" from FS_DAX_PMD

2016-11-01 Thread Ross Zwisler
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 4bd03a2..8e9e5f41 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
-   depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 12/16] dax: move RADIX_DAX_* defines to dax.h

2016-11-01 Thread Ross Zwisler
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c.  When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 14 --
 include/linux/dax.h | 15 ++-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6edd89b..c45cc4d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include 
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-   RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+   RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops);
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 08/16] dax: remove dax_pmd_fault()

2016-11-01 Thread Ross Zwisler
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/dax.c| 213 
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7238702..3d0b103 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -915,219 +915,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR  ((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-   const char *reason, const char *fn)
-{
-   if (bh) {
-   char bname[BDEVNAME_SIZE];
-   bdevname(bh->b_bdev, bname);
-   pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-   "length %zd fallback: %s\n", fn, current->comm,
-   address, bname, bh->b_state, (u64)bh->b_blocknr,
-   bh->b_size, reason);
-   } else {
-   pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-   current->comm, address, reason);
-   }
-}
-
-#define dax_pmd_dbg(bh, address, reason)   __dax_dbg(bh, address, reason, 
"dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-   pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-   struct file *file = vma->vm_file;
-   struct address_space *mapping = file->f_mapping;
-   struct inode *inode = mapping->host;
-   struct buffer_head bh;
-   unsigned blkbits = inode->i_blkbits;
-   unsigned long pmd_addr = address & PMD_MASK;
-   bool write = flags & FAULT_FLAG_WRITE;
-   struct block_device *bdev;
-   pgoff_t size, pgoff;
-   sector_t block;
-   int result = 0;
-   bool alloc = false;
-
-   /* dax pmd mappings require pfn_t_devmap() */
-   if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-   return VM_FAULT_FALLBACK;
-
-   /* Fall back to PTEs if we're going to COW */
-   if (write && !(vma->vm_flags & VM_SHARED)) {
-   split_huge_pmd(vma, pmd, address);
-   dax_pmd_dbg(NULL, address, "cow write");
-   return VM_FAULT_FALLBACK;
-   }
-   /* If the PMD would extend outside the VMA */
-   if (pmd_addr < vma->vm_start) {
-   dax_pmd_dbg(NULL, address, "vma start unaligned");
-   return VM_FAULT_FALLBACK;
-   }
-   if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-   dax_pmd_dbg(NULL, address, "vma end unaligned");
-   return VM_FAULT_FALLBACK;
-   }
-
-   pgoff = linear_page_index(vma, pmd_addr);
-   size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-   if (pgoff >= size)
-   return VM_FAULT_SIGBUS;
-   /* If the PMD would cover blocks out of the file */
-   if ((pgoff | PG_PMD_COLOUR) >= size) {
-   dax_pmd_dbg(NULL, address,
-   "offset + huge page size > file size");
-   return VM_FAULT_FALLBACK;
-   }
-
-   memset(&bh, 0, sizeof(bh));
-   bh.b_bdev = inode->i_sb->s_bdev;
-   block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-   bh.b_size = PMD_SIZE;
-
-   if (get_block(inode, block, &bh, 0) != 0)
-   return VM_FAULT_SIGBUS;
-
-   if (!buffer_mapped(&bh) && write) {
-   if (get_block(inode, block, &bh, 1) != 0)
-   return VM_FAULT_SIGBUS;
-   alloc = true;
-   WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-   }
-
-   bdev = bh.b_bdev;
-
-   if (bh.b_size < PMD_SIZE) {
-   dax_pmd_dbg(&bh, address, "allocated block too small");
-   return VM_FAULT_FALLBACK;
-   }
-
-   /*
-* If we allocated new storage, make sure no process has any
-* zero pages covering this hole
-*/
-   if (alloc) {
-   loff_t lstart = pgoff << PAGE_SHIFT;
-   loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-   truncate_pagecache_range(inode, lstart, lend);
-   }
-
-   if (!write && !bu

3.如何判定政策法律法规的效力等级?

2016-11-01 Thread 3 . 如何判定政策法律法规的效力等级? 先生
新《劳.动.合.同.法》、《社会.保险法》、《工伤保险条例》实.操应对.策略与有效.调岗调薪、裁员解雇及违纪.问题员工.处理技巧——课.程.简介

【时间地点】  
 10月28-29日上海(B单元)11月04-05日北京(B单元)11月11-12日深圳(B单元)   
 11月18-19日广州(A单元)11月25-26日上海(A单元)12月02-03日北京(A单元)   

【参.加.对.象】 董事长、总经理、副总经理、人力资源总监/经理/专员及人事行政管理人员、工会干部、法务人员及相关管理人员、相关律师等。
【授.课.方.式】 讲师讲授 + 视频演绎 + 案例研讨 +角色扮演 + 讲师点评
【学.习.费.用】  
参加A单元:2800元/1人,5000元/2人;参加B单元:2800元/1人,5000元/2人,参加AB单元:5000元/人(含学习费、资料费、午餐、茶点、发票)
【手机】 o755-6662o355   l3544ol33l5颜小姐
【邮箱】 971700...@qq.com   微.信: 971700732

课程背景

  
2008年,国家出台了《劳动合同法》、《劳动合同法实施条例》、《劳动争议调解仲裁法》、《职工带薪年休假条例》、《企业职工带薪年休假实施办法》;2009年,国家出台了《劳动人事争议仲裁办案规则》;2010年,国家出台了《劳动争议司法解释(三)》及修改了《工伤保险条例》;2011年,国家出台了《社会保险法》及《实施<社会保险法>若干规定》;2012年,国家出台了《企业民主管理规定》、《女职工劳动保护特别规定》及修改了《职业病防治法》;2013年,国家出台了《劳动争议司法解释(四)》、《劳务派遣若干规定》;2014年……
  
上述法律法规政策的持续实施,客观上要求企业精打细算,否则无法承受与日俱增的用工成本;客观上要求用人单位做到“精细化”管理,否则难以证明劳动者“不合格、不胜任、严重失职、严重违纪违规”,也难以进行合法有效的“调岗调薪、裁员解雇”。如果用人单位依然实施“传统式、粗放式、随便式”的管理,那么用人单位必将面临巨大的用工风险和赔偿责任,其管理权威也将受到严峻的挑战!
  
为帮助广大企事业单位了解相关政策法律法规,掌握防范用工风险和化解劳动争议的技能技巧,以实现低风险、低成本、高绩效的人力资源管理目标,特邀请我国知名的劳动法与员工关系管理实战专家钟永棣老师主讲此课程。欢迎企事业单位积极组织相关人员参加此培训课程!

课程特色

稀缺性:此课程将劳动法体系和薪酬绩效管理体系紧密相结合,国内极少出现此类课程。
针对性:课程内容精选了过去5年来主讲老师亲自处理过的且在不少用人单位内部也曾发生过的代表性案例,这些案例完全符合中国现阶段的大环境大气候,极具参考性和启发性。
实战性:实战沙盘演练,学员深入思考与充分互动,老师毫不保留倾囊相授;学员把错误留在课堂,把正确的观点、方法、工具、技能带回去。

课程收益

1、全面了解劳动用工过程的法律风险;
2、理解与劳动用工有关的政策法律法规;
3、培养预测、分析劳动用工法律风险的思维;
4、掌握预防和应对风险的实战技能及方法工具……

新《劳动合同法》、《社会保险法》、《工伤保险条例》实操应对策略与有效调岗调薪、裁员解雇及违纪问题员工处理技巧——课程大纲

A单元内容(共2天,15个以上经典案例)
专题一:招聘入职
1.如何预防劳动者的“应聘欺诈”,如何证明劳动者的“欺诈”?
2.招收应届毕业生,应注意哪些细节问题?
3.招用达到法定退休年龄的人员,应注意哪些细节问题?
4.招用待岗、内退、停薪留职的人员,应注意哪些细节问题?
5.入职体检需注意哪些细节问题?
6.入职前后用人单位应告知劳动者哪些情况,如何保留证据?
7.《入职登记表》如何设计,才能起到预防法律风险的作用?
8.劳动者无法提交《离职证明》,该怎么办?
9.企业如何书写《录用通知书》,其法律风险有哪些?

专题二:劳动合同订立
1.用人单位自行拟定的劳动合同文本是否有效,是否需要进行备案?
2.劳动者借故拖延或拒绝签订劳动合同,用人单位如何应对?
3.未签订劳动合同,需支付多长期限的双倍工资?是否受到仲裁时效的限制?
4.劳动合同期满,继续留用劳动者,但未续签合同,是否也需支付双倍工资?
5.什么时候为最佳时间,签署劳动合同、用工协议?
6.法律禁止2次约定试用期,劳动合同期限和试用期限该如何约定?
7.用人单位收购其他组织时,如何与被接收的员工签订、变更劳动合同?
8.应否与属于职业经理人的法人代表签订劳动合同?

专题三:试用期
1.可否先试用后签合同,可否单独签订试用期协议? 
2.员工主动申请延长试用期,该怎样操作,才规避赔偿风险?
3.试用期满后辞退员工,最少赔2个月工资,该如何化解?
4.试用期最后一天辞退员工,赔偿概率为70%,如何化解?
5.试用期满前几天辞退员工,赔偿概率为50%,如何化解?
6.不符合录用条件的范围包括哪些,如何取证证明?
7.《试用期辞退通知书》如何书写,以避免违法解除的赔偿金?
8.出现“经济性裁员”情况,优先裁掉试用期的新员工,合法吗?

专题四:无固定期限劳动合同
1.无固定期限劳动合同到底是不是铁饭碗,会不会增加企业成本?
2.无固定期限劳动合同解除的条件、理由有哪些?
3.用人单位拒绝签订无固定期限劳动合同,有何风险?
4.签订了固定期限劳动合同的员工,期间工作累计满10年,能否要求将固定期限合同变更为无固定期限合同?
5.连续订立二次固定期限劳动合同到期,用人单位能否终止合同;员工提出签订无固定期限合同,用人单位能否拒绝?
6.合同期满劳动者由于医疗期、三期等原因续延劳动合同导致劳动者连续工作满十年,劳动者提出订立无固定期限劳动合同的,用人单位能否拒绝?

专题五:特殊用工协议
1.培训服务期与劳动合同期限有何不同,劳动合同期限与服务期限发生冲突时如何适用?
2.培训服务期未到期,而劳动合同到期,用人单位终止劳动合同的,是否属于提前解除劳动合同,如何规避?
3.劳动者严重过错被解雇,用人单位能否依据服务期约定要求劳动者支付违约金?
4.在什么情况下,可签署竞业限制协议?
5.在什么时候,企业更有主动权签署竞业限制协议?
6.无约定经济补偿的支付,竞业限制是否有效?
7.竞业限制的经济补偿的标准如何界定?
8.要求员工保密,企业需要支付保密工资吗?

专题六:劳动关系解除终止
1.双方协商解除劳动合同并约定支付适当的经济补偿,事后劳动者追讨经济补偿的差额部分,仲裁机构有可能支持劳动者的诉求,企业如何避免案件败诉?
2.能否与“三期妇女、特殊保护期间的员工”协商解除,如何规避风险?
3.员工未提前30日通知企业即自行离职,企业能否扣减其工资?
4.员工提交辞职信后的30天内,企业批准其离职,可能有风险,如何化解?
5.员工提交辞职信后的30天后,企业批准其离职,也可能有风险,如何化解?
6.对于患病员工,能否解除,如何操作才能降低法律风险?
7.实行末位淘汰制,以末位排名为由解雇员工,往往被认定非法解雇,企业该如何做,才避免案件败诉?
8.以“组织架构调整,无合适岗位安排”为由解雇员工,感觉非常符合常理,但往往被认定非法解雇,企业该如何做才避免风险?
9.以“经济性裁员”名义解雇员工,感觉非常符合常理,但往往被认定非法解雇,企业该如何操作?
10.《解除劳动合同通知书》如果表述不当,往往成为劳动者打赢官司的有力证据,企业该如何书写,才避免案件败诉而承担法律责任?
11.解除劳动合同前未通知及征求工会的意见,是否构成非法解除?
12.劳动合同到期后,经常出现该终止的忘记办理终止手续,该续签的忘记办理续签手续,其引发的风险非常大;那么企业该如何规避风险?

专题七:社会保险法
1.用人单位拖欠社保费,有什么法律责任?
2.用人单位不足额缴纳社会保险如何处理?
3.员工不愿意买社保,并与单位签有协议的情况下,该协议是否有效?
4.试用期间,是否必须缴纳社会保险?
5.如果无参保,劳动者因第三方责任产生的医疗费用,能否要求单位报销?
6.企业协助辞职员工骗取失业保险金,有什么法律风险?
7.女职工未婚先孕、未婚生育争议如何处理?
8.怀孕女职工提出长期休假保胎,直至修完产假,该如何协调此问题?

专题八:劳动争议处理
1.用人单位败诉的原因主要有哪些?
2.仲裁或法院在处理案件时,如何适用法律法规?
3.如何判定政策法律法规的效力等级? 
4.公开审理的开庭形式,有何风险,如何避免风险?
5.申请仲裁的时效如何计算;如何理解“劳动争议发生之日”?
6.如何书写答辩书,有哪些注意事项?
7.开庭期间,质证与辩论需要注意哪些关键问题?
8.举证责任如何分配,无法举证的后果有哪些?

B单元内容(共2天,15个以上经典案例)
专题一:绩效管理与岗位调整
1.企业单方调整岗位,员工往往可被迫解除合同并索赔经济补偿,如何规避?
2.调岗时没有书面确认,员工到新岗位工作2个月后能否要求恢复到原岗位?
3.可否对“三期内”女职工进行调岗、调薪?
4.员工认同绩效结果,为什么在“不胜任工作”引发的争议中还是败诉?
5.为什么企业根据绩效结果支付员工绩效奖金,最终被认定非法克扣工资?
6.法律上如何证明劳动者“不能胜任工作”?
7.对绩效考核不合格的员工,如何合法辞退?
8.绩效正态分布往往强制划分5%的员工为不合格者,是否合法?

专题二:劳动报酬、薪酬福利
1.工资总额包括哪些工资明细?
2.新进员工薪资管理问题及处理技巧;
3.调整工作岗位后,可以调整薪资标准吗?
4.如何通过薪酬调整处理员工失职、违纪等问题?
5.值班算不算加班?
6.加班加点工资支付常见误区?
7.用人单位如何设计工资构成以降低加班费成本?
8.未经用人单位安排,劳动者自行加班的,是否需支付加班工资?
9.劳动者主张入职以来的加班费,如何应对?
10.劳动者在工作日\法定节假日加班,能否安排补休而不予支付加班费?
11.病假、年休假、婚假、产假、丧假等的享受条件及工资待遇标准?
12.离职员工往往回头追讨年终奖,有可能得到支持,如何规避该风险?

专题三:违纪违规问题员工处理
1.劳动者往往拒绝签收处分、解雇通知书,如何应对?
2.问题员工往往拒绝提交《检讨书》或否认违纪违规事实,企业该如何收集证据?
3.对于违纪员工,应该在什么时间内处理?
4.怎样理解“严重违反用人单位的规章制度”?
5.如何在《惩罚条例》中描述“一般违纪”、“较重违纪”及“严重违纪”? 
6.怎样理解“严重失职,营私舞弊,给用人单位造成重大损害”?
7.如何界定“重大损害”,“重大损害”是否必须体现为造成直接的经济损失?
8.如何追究“严重失职、严重违纪违规”者的法律责任?
9.能否直接规定“禁止兼职,否则视为严重违纪违规”?
10.直线部门经理擅自口头辞退员工,仲裁机构往往认定企业非法解雇,企业该如何做,才避免案件败诉?
11.劳动者不辞而别、无故旷工,却主张被企业口头解雇,往往得到仲裁机构的支持,企业该如何做,才避免案件败诉?
12.“录音录象”证据,仲裁与法院是否采信;企业内部OA系统上的资料能否作为证据使用;电子邮件、手机短信能否作为证据使用? 

专题四:经济补偿
1.用人单位需向劳动者支付经济补偿的情形有哪些?
2.什么情况下用人单位需支付两倍的经济补偿?
3.劳动者可否同时向用人单位主张经济补偿和赔偿金?
4.经济补偿计算的基数及标准如何确定?
5.经济补偿年限最高不超过十二年的适用范围?
6.如何计算《劳动合同法》生效前后的经济补偿年限?
7.如何理解“六个月以上不满一年的,按一年计算;不满六个月的,向劳动者支付半个月工资的经济补偿”?
8.劳动合同法环境下“50%额外经济补偿金”是否继续适用?

专题五:规章制度、员工手册
1.企业人力资源管理体系中哪些内容跟劳动法有必然联系?
2.人力资源、劳动用工管理制度应该包括哪些必备内容?
3.制定

[PATCH 05/11] ext4: Use iomap for zeroing blocks in DAX mode

2016-11-01 Thread Jan Kara
Use iomap infrastructure for zeroing blocks when in DAX mode.
ext4_iomap_begin() handles read requests just fine and that's all that
is needed for iomap_zero_range().

Signed-off-by: Jan Kara 
---
 fs/ext4/inode.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cc10145ea98b..ac26a390f14c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3844,8 +3844,10 @@ static int ext4_block_zero_page_range(handle_t *handle,
if (length > max || length < 0)
length = max;
 
-   if (IS_DAX(inode))
-   return dax_zero_page_range(inode, from, length, ext4_get_block);
+   if (IS_DAX(inode)) {
+   return iomap_zero_range(inode, from, length, NULL,
+   &ext4_iomap_ops);
+   }
return __ext4_block_zero_page_range(handle, mapping, from, length);
 }
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 03/11] ext4: Let S_DAX set only if DAX is really supported

2016-11-01 Thread Jan Kara
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Signed-off-by: Jan Kara 
---
 fs/ext4/inline.c | 10 ++
 fs/ext4/inode.c  |  9 -
 fs/ext4/super.c  |  5 +
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index f74d5ee2cdec..c29678965c3c 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -299,6 +299,11 @@ static int ext4_create_inline_data(handle_t *handle,
EXT4_I(inode)->i_inline_size = len + EXT4_MIN_INLINE_DATA_SIZE;
ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
+   /*
+* Propagate changes to inode->i_flags as well - e.g. S_DAX may
+* get cleared
+*/
+   ext4_set_inode_flags(inode);
get_bh(is.iloc.bh);
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
 
@@ -442,6 +447,11 @@ static int ext4_destroy_inline_data_nolock(handle_t 
*handle,
}
}
ext4_clear_inode_flag(inode, EXT4_INODE_INLINE_DATA);
+   /*
+* Propagate changes to inode->i_flags as well - e.g. S_DAX may
+* get set.
+*/
+   ext4_set_inode_flags(inode);
 
get_bh(is.iloc.bh);
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9eb1f89201ed..4cbd1b24c237 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4350,7 +4350,9 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
+   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
+   !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
+   !ext4_encrypted_inode(inode))
new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
@@ -5618,6 +5620,11 @@ int ext4_change_inode_journal_flag(struct inode *inode, 
int val)
ext4_clear_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
}
ext4_set_aops(inode);
+   /*
+* Update inode->i_flags after EXT4_INODE_JOURNAL_DATA was updated.
+* E.g. S_DAX may get cleared / set.
+*/
+   ext4_set_inode_flags(inode);
 
jbd2_journal_unlock_updates(journal);
percpu_up_write(&sbi->s_journal_flag_rwsem);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6db81fbcbaa6..c83d6f1cfab8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1125,6 +1125,10 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
ext4_clear_inode_state(inode,
EXT4_STATE_MAY_INLINE_DATA);
+   /*
+* Update inode->i_flags - e.g. S_DAX may get disabled
+*/
+   ext4_set_inode_flags(inode);
}
return res;
}
@@ -1139,6 +1143,7 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
len, 0);
if (!res) {
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
+   /* Update inode->i_flags - e.g. S_DAX may get disabled */
res = ext4_mark_inode_dirty(handle, inode);
if (res)
EXT4_ERROR_INODE(inode, "Failed to mark inode dirty");
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 01/11] ext4: Factor out checks from ext4_file_write_iter()

2016-11-01 Thread Jan Kara
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Signed-off-by: Jan Kara 
---
 fs/ext4/file.c | 97 ++
 1 file changed, 50 insertions(+), 47 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a822d30e73f..a6a7becb9465 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -88,6 +88,51 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter 
*from, loff_t pos)
return 0;
 }
 
+/* Is IO overwriting allocated and initialized blocks? */
+static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
+{
+   struct ext4_map_blocks map;
+   unsigned int blkbits = inode->i_blkbits;
+   int err, blklen;
+
+   if (pos + len > i_size_read(inode))
+   return false;
+
+   map.m_lblk = pos >> blkbits;
+   map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
+   blklen = map.m_len;
+
+   err = ext4_map_blocks(NULL, inode, &map, 0);
+   /*
+* 'err==len' means that all of blocks has been preallocated no matter
+* they are initialized or not.  For excluding unwritten extents, we
+* need to check m_flags.
+*/
+   return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
+}
+
+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct inode *inode = file_inode(iocb->ki_filp);
+   ssize_t ret;
+
+   ret = generic_write_checks(iocb, from);
+   if (ret <= 0)
+   return ret;
+   /*
+* If we have encountered a bitmap-format file, the size limit
+* is smaller than s_maxbytes, which is for extent-mapped files.
+*/
+   if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
+   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+   if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
+   return -EFBIG;
+   iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
+   }
+   return iov_iter_count(from);
+}
+
 static ssize_t
 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
@@ -98,7 +143,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
ssize_t ret;
 
inode_lock(inode);
-   ret = generic_write_checks(iocb, from);
+   ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
 
@@ -114,53 +159,11 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
ext4_unwritten_wait(inode);
}
 
-   /*
-* If we have encountered a bitmap-format file, the size limit
-* is smaller than s_maxbytes, which is for extent-mapped files.
-*/
-   if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
-   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
-   if (iocb->ki_pos >= sbi->s_bitmap_maxbytes) {
-   ret = -EFBIG;
-   goto out;
-   }
-   iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
-   }
-
iocb->private = &overwrite;
-   if (o_direct) {
-   size_t length = iov_iter_count(from);
-   loff_t pos = iocb->ki_pos;
-
-   /* check whether we do a DIO overwrite or not */
-   if (ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   pos + length <= i_size_read(inode)) {
-   struct ext4_map_blocks map;
-   unsigned int blkbits = inode->i_blkbits;
-   int err, len;
-
-   map.m_lblk = pos >> blkbits;
-   map.m_len = EXT4_MAX_BLOCKS(length, pos, blkbits);
-   len = map.m_len;
-
-   err = ext4_map_blocks(NULL, inode, &map, 0);
-   /*
-* 'err==len' means that all of blocks has
-* been preallocated no matter they are
-* initialized or not.  For excluding
-* unwritten extents, we need to check
-* m_flags.  There are two conditions that
-* indicate for initialized extents.  1) If we
-* hit extent cache, EXT4_MAP_MAPPED flag is
-* returned; 2) If we do a real lookup,
-* non-flags are returned.  So we should check
-* these two conditions.
-*/
-   if (err == len && (map.m_flags & EXT4_MAP_MAPPED))
-   overwrite = 1;
-   }
-   }
+   /* Check whether we do a DIO overwrite or not */
+   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
+   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
+   overwri

[PATCH 07/11] ext4: Avoid split extents for DAX writes

2016-11-01 Thread Jan Kara
Currently mapping of blocks for DAX writes happen with
EXT4_GET_BLOCKS_PRE_IO flag set. That has a result that each
ext4_map_blocks() call creates a separate written extent, although it
could be merged to the neighboring extents in the extent tree.  The
reason for using this flag is that in case the extent is unwritten, we
need to convert it to written one and zero it out. However this "convert
mapped range to written" operation is already implemented by
ext4_map_blocks() for the case of data writes into unwritten extent. So
just use flags for that mode of operation, simplify the code, and avoid
unnecessary split extents.

Signed-off-by: Jan Kara 
---
 fs/ext4/inode.c | 17 -
 1 file changed, 17 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d07d003ebce2..635518dde20e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3346,7 +3346,6 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
return PTR_ERR(handle);
 
ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_PRE_IO |
  EXT4_GET_BLOCKS_CREATE_ZERO);
if (ret < 0) {
ext4_journal_stop(handle);
@@ -3355,22 +3354,6 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
goto retry;
return ret;
}
-   /* For DAX writes we need to zero out unwritten extents */
-   if (map.m_flags & EXT4_MAP_UNWRITTEN) {
-   /*
-* We are protected by i_mmap_sem or i_rwsem so we know
-* block cannot go away from under us even though we
-* dropped i_data_sem. Convert extent to written and
-* write zeros there.
-*/
-   ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CONVERT |
- EXT4_GET_BLOCKS_CREATE_ZERO);
-   if (ret < 0) {
-   ext4_journal_stop(handle);
-   return ret;
-   }
-   }
}
 
iomap->flags = 0;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 10/11] ext2: Use iomap_zero_range() for zeroing truncated page in DAX path

2016-11-01 Thread Jan Kara
Currently the last user of ext2_get_blocks() for DAX inodes was
dax_truncate_page(). Convert that to iomap_zero_range() so that all DAX
IO uses the iomap path.

Signed-off-by: Jan Kara 
---
 fs/ext2/inode.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 335cd1e1f902..b723666124a7 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -846,6 +846,9 @@ struct iomap_ops ext2_iomap_ops = {
.iomap_begin= ext2_iomap_begin,
.iomap_end  = ext2_iomap_end,
 };
+#else
+/* Define empty ops for !CONFIG_FS_DAX case to avoid ugly ifdefs */
+struct iomap_ops ext2_iomap_ops;
 #endif /* CONFIG_FS_DAX */
 
 int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
@@ -1289,9 +1292,11 @@ static int ext2_setsize(struct inode *inode, loff_t 
newsize)
 
inode_dio_wait(inode);
 
-   if (IS_DAX(inode))
-   error = dax_truncate_page(inode, newsize, ext2_get_block);
-   else if (test_opt(inode->i_sb, NOBH))
+   if (IS_DAX(inode)) {
+   error = iomap_zero_range(inode, newsize,
+PAGE_ALIGN(newsize) - newsize, NULL,
+&ext2_iomap_ops);
+   } else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
else
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/11] ext4: Convert ext4 DAX IO to iomap framework

2016-11-01 Thread Jan Kara
Hello,

this patch set converts ext4 DAX IO paths to the new iomap framework and
removes the old bh-based DAX functions. As a result ext4 gains PMD page
fault support, also some other minor bugs get fixed. The patch set is based
on Ross' DAX PMD page fault support series [1]. It passes xfstests both in
DAX and non-DAX mode.

The question is how shall we merge this. If Dave is pulling PMD patches through
XFS tree, then these patches could go there as well (chances for conflicts
with other ext4 stuff are relatively low) or Dave could just export a stable
branch with PMD series which Ted would just pull...

Honza

[1] http://www.spinics.net/lists/linux-mm/msg115247.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 09/11] ext4: Rip out DAX handling from direct IO path

2016-11-01 Thread Jan Kara
Reads and writes for DAX inodes should no longer end up in direct IO
code. Rip out the support and add a warning.

Signed-off-by: Jan Kara 
---
 fs/ext4/inode.c | 49 +++--
 1 file changed, 15 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 001fef06ea97..e236b84fc079 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3571,19 +3571,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, 
struct iov_iter *iter)
iocb->private = NULL;
if (overwrite)
get_block_func = ext4_dio_get_block_overwrite;
-   else if (IS_DAX(inode)) {
-   /*
-* We can avoid zeroing for aligned DAX writes beyond EOF. Other
-* writes need zeroing either because they can race with page
-* faults or because they use partial blocks.
-*/
-   if (round_down(offset, 1= inode->i_size &&
-   ext4_aligned_io(inode, offset, count))
-   get_block_func = ext4_dio_get_block;
-   else
-   get_block_func = ext4_dax_get_block;
-   dio_flags = DIO_LOCKING;
-   } else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+   else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
   round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
get_block_func = ext4_dio_get_block;
dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
@@ -3597,14 +3585,9 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, 
struct iov_iter *iter)
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
 #endif
-   if (IS_DAX(inode)) {
-   ret = dax_do_io(iocb, inode, iter, get_block_func,
-   ext4_end_io_dio, dio_flags);
-   } else
-   ret = __blockdev_direct_IO(iocb, inode,
-  inode->i_sb->s_bdev, iter,
-  get_block_func,
-  ext4_end_io_dio, NULL, dio_flags);
+   ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
+  get_block_func, ext4_end_io_dio, NULL,
+  dio_flags);
 
if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN)) {
@@ -3673,6 +3656,7 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, 
struct iov_iter *iter)
 {
struct address_space *mapping = iocb->ki_filp->f_mapping;
struct inode *inode = mapping->host;
+   size_t count = iov_iter_count(iter);
ssize_t ret;
 
/*
@@ -3681,19 +3665,12 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, 
struct iov_iter *iter)
 * we are protected against page writeback as well.
 */
inode_lock_shared(inode);
-   if (IS_DAX(inode)) {
-   ret = dax_do_io(iocb, inode, iter, ext4_dio_get_block, NULL, 0);
-   } else {
-   size_t count = iov_iter_count(iter);
-
-   ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
-  iocb->ki_pos + count);
-   if (ret)
-   goto out_unlock;
-   ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
-  iter, ext4_dio_get_block,
-  NULL, NULL, 0);
-   }
+   ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
+  iocb->ki_pos + count);
+   if (ret)
+   goto out_unlock;
+   ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
+  iter, ext4_dio_get_block, NULL, NULL, 0);
 out_unlock:
inode_unlock_shared(inode);
return ret;
@@ -3722,6 +3699,10 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
if (ext4_has_inline_data(inode))
return 0;
 
+   /* DAX uses iomap path now */
+   if (WARN_ON_ONCE(IS_DAX(inode)))
+   return 0;
+
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
if (iov_iter_rw(iter) == READ)
ret = ext4_direct_IO_read(iocb, iter);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 06/11] ext4: DAX iomap write support

2016-11-01 Thread Jan Kara
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara 
---
 fs/ext4/file.c  | 39 ++--
 fs/ext4/inode.c | 94 +
 2 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 28ebc2418dc2..d7ab0e90d1b8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -172,6 +172,39 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 }
 
 static ssize_t
+ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct inode *inode = file_inode(iocb->ki_filp);
+   ssize_t ret;
+   bool overwrite = false;
+
+   inode_lock(inode);
+   ret = ext4_write_checks(iocb, from);
+   if (ret <= 0)
+   goto out;
+   ret = file_remove_privs(iocb->ki_filp);
+   if (ret)
+   goto out;
+   ret = file_update_time(iocb->ki_filp);
+   if (ret)
+   goto out;
+
+   if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
+   overwrite = true;
+   downgrade_write(&inode->i_rwsem);
+   }
+   ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+out:
+   if (!overwrite)
+   inode_unlock(inode);
+   else
+   inode_unlock_shared(inode);
+   if (ret > 0)
+   ret = generic_write_sync(iocb, ret);
+   return ret;
+}
+
+static ssize_t
 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
struct inode *inode = file_inode(iocb->ki_filp);
@@ -180,6 +213,9 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
int overwrite = 0;
ssize_t ret;
 
+   if (IS_DAX(inode))
+   return ext4_dax_write_iter(iocb, from);
+
inode_lock(inode);
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
@@ -199,8 +235,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = &overwrite;
/* Check whether we do a DIO overwrite or not */
-   if (((o_direct && !unaligned_aio) || IS_DAX(inode)) &&
-   ext4_should_dioread_nolock(inode) &&
+   if ((o_direct && !unaligned_aio) && ext4_should_dioread_nolock(inode) &&
ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
overwrite = 1;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ac26a390f14c..d07d003ebce2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3316,18 +3316,62 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
struct ext4_map_blocks map;
int ret;
 
-   if (flags & IOMAP_WRITE)
-   return -EIO;
-
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
 
map.m_lblk = first_block;
map.m_len = last_block - first_block + 1;
 
-   ret = ext4_map_blocks(NULL, inode, &map, 0);
-   if (ret < 0)
-   return ret;
+   if (!(flags & IOMAP_WRITE)) {
+   ret = ext4_map_blocks(NULL, inode, &map, 0);
+   } else {
+   int dio_credits;
+   handle_t *handle;
+   int retries = 0;
+
+   /* Trim mapping request to maximum we can map at once for DIO */
+   if (map.m_len > DIO_MAX_BLOCKS)
+   map.m_len = DIO_MAX_BLOCKS;
+   dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
+retry:
+   /*
+* Either we allocate blocks and then we don't get unwritten
+* extent so we have reserved enough credits, or the blocks
+* are already allocated and unwritten and in that case
+* extent conversion fits in the credits as well.
+*/
+   handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+   dio_credits);
+   if (IS_ERR(handle))
+   return PTR_ERR(handle);
+
+   ret = ext4_map_blocks(handle, inode, &map,
+ EXT4_GET_BLOCKS_PRE_IO |
+ EXT4_GET_BLOCKS_CREATE_ZERO);
+   if (ret < 0) {
+   ext4_journal_stop(handle);
+   if (ret == -ENOSPC &&
+   ext4_should_retry_alloc(inode->i_sb, &retries))
+   goto retry;
+   return ret;
+   }
+   /* For DAX writes we need to zero out unwritten extents */
+   if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+   /*
+* We are protected by i_mmap_sem or i_rwsem so we know
+* block cannot go away from under us even though we
+* dropped i_data_sem. Convert extent to written and
+* write zero

[PATCH 11/11] dax: Rip out get_block based IO support

2016-11-01 Thread Jan Kara
No one uses functions using the get_block callback anymore. Rip them
out.

Signed-off-by: Jan Kara 
---
 fs/dax.c| 315 
 include/linux/dax.h |  12 --
 2 files changed, 327 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1fdf4c091371..380ca8547e35 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -116,168 +116,6 @@ struct page *read_dax_sector(struct block_device *bdev, 
sector_t n)
return page;
 }
 
-static bool buffer_written(struct buffer_head *bh)
-{
-   return buffer_mapped(bh) && !buffer_unwritten(bh);
-}
-
-static sector_t to_sector(const struct buffer_head *bh,
-   const struct inode *inode)
-{
-   sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
-
-   return sector;
-}
-
-static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
- loff_t start, loff_t end, get_block_t get_block,
- struct buffer_head *bh)
-{
-   loff_t pos = start, max = start, bh_max = start;
-   bool hole = false;
-   struct block_device *bdev = NULL;
-   int rw = iov_iter_rw(iter), rc;
-   long map_len = 0;
-   struct blk_dax_ctl dax = {
-   .addr = ERR_PTR(-EIO),
-   };
-   unsigned blkbits = inode->i_blkbits;
-   sector_t file_blks = (i_size_read(inode) + (1 << blkbits) - 1)
-   >> blkbits;
-
-   if (rw == READ)
-   end = min(end, i_size_read(inode));
-
-   while (pos < end) {
-   size_t len;
-   if (pos == max) {
-   long page = pos >> PAGE_SHIFT;
-   sector_t block = page << (PAGE_SHIFT - blkbits);
-   unsigned first = pos - (block << blkbits);
-   long size;
-
-   if (pos == bh_max) {
-   bh->b_size = PAGE_ALIGN(end - pos);
-   bh->b_state = 0;
-   rc = get_block(inode, block, bh, rw == WRITE);
-   if (rc)
-   break;
-   bh_max = pos - first + bh->b_size;
-   bdev = bh->b_bdev;
-   /*
-* We allow uninitialized buffers for writes
-* beyond EOF as those cannot race with faults
-*/
-   WARN_ON_ONCE(
-   (buffer_new(bh) && block < file_blks) ||
-   (rw == WRITE && buffer_unwritten(bh)));
-   } else {
-   unsigned done = bh->b_size -
-   (bh_max - (pos - first));
-   bh->b_blocknr += done >> blkbits;
-   bh->b_size -= done;
-   }
-
-   hole = rw == READ && !buffer_written(bh);
-   if (hole) {
-   size = bh->b_size - first;
-   } else {
-   dax_unmap_atomic(bdev, &dax);
-   dax.sector = to_sector(bh, inode);
-   dax.size = bh->b_size;
-   map_len = dax_map_atomic(bdev, &dax);
-   if (map_len < 0) {
-   rc = map_len;
-   break;
-   }
-   dax.addr += first;
-   size = map_len - first;
-   }
-   /*
-* pos + size is one past the last offset for IO,
-* so pos + size can overflow loff_t at extreme offsets.
-* Cast to u64 to catch this and get the true minimum.
-*/
-   max = min_t(u64, pos + size, end);
-   }
-
-   if (iov_iter_rw(iter) == WRITE) {
-   len = copy_from_iter_pmem(dax.addr, max - pos, iter);
-   } else if (!hole)
-   len = copy_to_iter((void __force *) dax.addr, max - pos,
-   iter);
-   else
-   len = iov_iter_zero(max - pos, iter);
-
-   if (!len) {
-   rc = -EFAULT;
-   break;
-   }
-
-   pos += len;
-   if (!IS_ERR(dax.addr))
-   dax.addr += len;
-   }
-
-   dax_unmap_atomic(bdev, &dax);
-
-   return (pos == start) ? rc : pos - start;
-}
-
-/**
- * dax_do_io - Perform I/O to a DAX file
- * @iocb: The control block for this I/O
- * @inode: The file which 

[PATCH 08/11] ext4: Convert DAX faults to iomap infrastructure

2016-11-01 Thread Jan Kara
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara 
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/file.c  |  9 +
 fs/ext4/inode.c | 18 ++
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 098b39910001..2714eb6174ab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3272,6 +3272,7 @@ static inline bool ext4_aligned_io(struct inode *inode, 
loff_t off, loff_t len)
 }
 
 extern struct iomap_ops ext4_iomap_ops;
+extern struct iomap_ops ext4_iomap_fault_ops;
 
 #endif /* __KERNEL__ */
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d7ab0e90d1b8..da44e49c8276 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -273,7 +273,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
-   result = dax_fault(vma, vmf, ext4_dax_get_block);
+   result = dax_iomap_fault(vma, vmf, &ext4_iomap_fault_ops);
 
if (write) {
if (!IS_ERR(handle))
@@ -307,9 +307,10 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, 
unsigned long addr,
 
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
-   else
-   result = dax_pmd_fault(vma, addr, pmd, flags,
-ext4_dax_get_block);
+   else {
+   result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
+&ext4_iomap_fault_ops);
+   }
 
if (write) {
if (!IS_ERR(handle))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 635518dde20e..001fef06ea97 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3419,11 +3419,29 @@ static int ext4_iomap_end(struct inode *inode, loff_t 
offset, loff_t length,
return 0;
 }
 
+/*
+ * For faults we don't allocate any blocks outside of isize and we don't want
+ * to change it so we use a dedicated function for it...
+ */
+static int ext4_iomap_fault_end(struct inode *inode, loff_t offset,
+   loff_t length, ssize_t written, unsigned flags,
+   struct iomap *iomap)
+{
+   if (flags & IOMAP_WRITE)
+   ext4_journal_stop(ext4_journal_current_handle());
+   return 0;
+}
+
 struct iomap_ops ext4_iomap_ops = {
.iomap_begin= ext4_iomap_begin,
.iomap_end  = ext4_iomap_end,
 };
 
+struct iomap_ops ext4_iomap_fault_ops = {
+   .iomap_begin= ext4_iomap_begin,
+   .iomap_end  = ext4_iomap_fault_end,
+};
+
 #else
 /* Just define empty function, it will never get called. */
 int ext4_dax_get_block(struct inode *inode, sector_t iblock,
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 04/11] ext4: Convert DAX reads to iomap infrastructure

2016-11-01 Thread Jan Kara
Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Signed-off-by: Jan Kara 
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  | 40 +++-
 fs/ext4/inode.c | 54 ++
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 282a51b07c57..098b39910001 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3271,6 +3271,8 @@ static inline bool ext4_aligned_io(struct inode *inode, 
loff_t off, loff_t len)
return IS_ALIGNED(off, blksize) && IS_ALIGNED(len, blksize);
 }
 
+extern struct iomap_ops ext4_iomap_ops;
+
 #endif /* __KERNEL__ */
 
 #define EFSBADCRC  EBADMSG /* Bad CRC detected */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 55f8b922b76d..28ebc2418dc2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -31,6 +31,44 @@
 #include "xattr.h"
 #include "acl.h"
 
+#ifdef CONFIG_FS_DAX
+static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   ssize_t ret;
+
+   inode_lock_shared(inode);
+   /*
+* Recheck under inode lock - at this point we are sure it cannot
+* change anymore
+*/
+   if (!IS_DAX(inode)) {
+   inode_unlock_shared(inode);
+   /* Fallback to buffered IO in case we cannot support DAX */
+   return generic_file_read_iter(iocb, to);
+   }
+   ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
+   inode_unlock_shared(inode);
+
+   file_accessed(iocb->ki_filp);
+   return ret;
+}
+#endif
+
+static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+
+   if (!iov_iter_count(to))
+   return 0; /* skip atime */
+
+#ifdef CONFIG_FS_DAX
+   if (IS_DAX(inode))
+   return ext4_dax_read_iter(iocb, to);
+#endif
+   return generic_file_read_iter(iocb, to);
+}
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -691,7 +729,7 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int 
whence)
 
 const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
-   .read_iter  = generic_file_read_iter,
+   .read_iter  = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
.unlocked_ioctl = ext4_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4cbd1b24c237..cc10145ea98b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -3305,6 +3306,59 @@ int ext4_dax_get_block(struct inode *inode, sector_t 
iblock,
clear_buffer_new(bh_result);
return 0;
 }
+
+static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+   unsigned flags, struct iomap *iomap)
+{
+   unsigned int blkbits = inode->i_blkbits;
+   unsigned long first_block = offset >> blkbits;
+   unsigned long last_block = (offset + length - 1) >> blkbits;
+   struct ext4_map_blocks map;
+   int ret;
+
+   if (flags & IOMAP_WRITE)
+   return -EIO;
+
+   if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+   return -ERANGE;
+
+   map.m_lblk = first_block;
+   map.m_len = last_block - first_block + 1;
+
+   ret = ext4_map_blocks(NULL, inode, &map, 0);
+   if (ret < 0)
+   return ret;
+
+   iomap->flags = 0;
+   iomap->bdev = inode->i_sb->s_bdev;
+   iomap->offset = first_block << blkbits;
+
+   if (ret == 0) {
+   iomap->type = IOMAP_HOLE;
+   iomap->blkno = IOMAP_NULL_BLOCK;
+   iomap->length = (u64)map.m_len << blkbits;
+   } else {
+   if (map.m_flags & EXT4_MAP_MAPPED) {
+   iomap->type = IOMAP_MAPPED;
+   } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+   iomap->type = IOMAP_UNWRITTEN;
+   } else {
+   WARN_ON_ONCE(1);
+   return -EIO;
+   }
+   iomap->blkno = (sector_t)map.m_pblk << (blkbits - 9);
+   iomap->length = (u64)map.m_len << blkbits;
+   }
+
+   if (map.m_flags & EXT4_MAP_NEW)
+   iomap->flags |= IOMAP_F_NEW;
+   return 0;
+}
+
+struct iomap_ops ext4_iomap_ops = {
+   .iomap_begin= ext4_iomap_begin,
+};
+
 #else
 /* Just define empty function, it will never get called. */
 int ext4_dax_get_block(struct inode *inode, sector_t iblock,
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 02/11] ext4: Allow unaligned unlocked DAX IO

2016-11-01 Thread Jan Kara
Currently we don't allow unaligned writes without inode_lock. This is
because zeroing of partial blocks could cause data corruption for racing
unaligned writes to the same block. However DAX handles zeroing during
block allocation and thus zeroing of partial blocks cannot race. Allow
DAX unaligned IO to run without inode_lock.

Signed-off-by: Jan Kara 
---
 fs/ext4/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index a6a7becb9465..55f8b922b76d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -161,7 +161,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = &overwrite;
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
+   if (((o_direct && !unaligned_aio) || IS_DAX(inode)) &&
+   ext4_should_dioread_nolock(inode) &&
ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
overwrite = 1;
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/11] ext4: Convert ext4 DAX IO to iomap framework

2016-11-01 Thread Dave Chinner
On Tue, Nov 01, 2016 at 10:06:10PM +0100, Jan Kara wrote:
> Hello,
> 
> this patch set converts ext4 DAX IO paths to the new iomap framework and
> removes the old bh-based DAX functions. As a result ext4 gains PMD page
> fault support, also some other minor bugs get fixed. The patch set is based
> on Ross' DAX PMD page fault support series [1]. It passes xfstests both in
> DAX and non-DAX mode.
> 
> The question is how shall we merge this. If Dave is pulling PMD patches 
> through
> XFS tree, then these patches could go there as well (chances for conflicts
> with other ext4 stuff are relatively low) or Dave could just export a stable
> branch with PMD series which Ted would just pull...

I plan to grab Ross's PMD series in the next couple of days and I'll
push it out as a stable topic branch once I've sanity tested it.  I
don't really want to take a big chunk of ext4 stuff through the XFS
tree if it can be avoided

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 05/21] mm: Trim __do_fault() arguments

2016-11-01 Thread Jan Kara
Use vm_fault structure to pass cow_page, page, and entry in and out of
the function. That reduces number of __do_fault() arguments from 4 to 1.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 53 +++--
 1 file changed, 23 insertions(+), 30 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8145dadb2645..f5ef7b8a30c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,26 +2848,22 @@ static int do_anonymous_page(struct vm_fault *vmf)
  * released depending on flags and vma->vm_ops->fault() return value.
  * See filemap_fault() and __lock_page_retry().
  */
-static int __do_fault(struct vm_fault *vmf, struct page *cow_page,
- struct page **page, void **entry)
+static int __do_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
int ret;
 
-   vmf->cow_page = cow_page;
-
ret = vma->vm_ops->fault(vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
-   if (ret & VM_FAULT_DAX_LOCKED) {
-   *entry = vmf->entry;
+   if (ret & VM_FAULT_DAX_LOCKED)
return ret;
-   }
 
if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf->page);
put_page(vmf->page);
+   vmf->page = NULL;
return VM_FAULT_HWPOISON;
}
 
@@ -2876,7 +2872,6 @@ static int __do_fault(struct vm_fault *vmf, struct page 
*cow_page,
else
VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
 
-   *page = vmf->page;
return ret;
 }
 
@@ -3173,7 +3168,6 @@ static int do_fault_around(struct vm_fault *vmf)
 static int do_read_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page;
int ret = 0;
 
/*
@@ -3187,24 +3181,23 @@ static int do_read_fault(struct vm_fault *vmf)
return ret;
}
 
-   ret = __do_fault(vmf, NULL, &fault_page, NULL);
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
-   ret |= alloc_set_pte(vmf, NULL, fault_page);
+   ret |= alloc_set_pte(vmf, NULL, vmf->page);
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   unlock_page(fault_page);
+   unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
-   put_page(fault_page);
+   put_page(vmf->page);
return ret;
 }
 
 static int do_cow_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page, *new_page;
-   void *fault_entry;
+   struct page *new_page;
struct mem_cgroup *memcg;
int ret;
 
@@ -3221,20 +3214,21 @@ static int do_cow_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
 
-   ret = __do_fault(vmf, new_page, &fault_page, &fault_entry);
+   vmf->cow_page = new_page;
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
 
if (!(ret & VM_FAULT_DAX_LOCKED))
-   copy_user_highpage(new_page, fault_page, vmf->address, vma);
+   copy_user_highpage(new_page, vmf->page, vmf->address, vma);
__SetPageUptodate(new_page);
 
ret |= alloc_set_pte(vmf, memcg, new_page);
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (!(ret & VM_FAULT_DAX_LOCKED)) {
-   unlock_page(fault_page);
-   put_page(fault_page);
+   unlock_page(vmf->page);
+   put_page(vmf->page);
} else {
dax_unlock_mapping_entry(vma->vm_file->f_mapping, vmf->pgoff);
}
@@ -3250,12 +3244,11 @@ static int do_cow_fault(struct vm_fault *vmf)
 static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *fault_page;
struct address_space *mapping;
int dirtied = 0;
int ret, tmp;
 
-   ret = __do_fault(vmf, NULL, &fault_page, NULL);
+   ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -3264,26 +3257,26 @@ static int do_shared_fault(struct vm_fault *vmf)
 * about to become writable
 */
if (vma->vm_ops->page_mkwrite) {
-   unlock_page(fault_page);
-   tmp = do_page_mkwrite(vma, fault_page, vmf->address);
+   unlock_page(vmf->page);
+   tmp = do_page_mkwrite(vma, vmf->page, vmf->address);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
-   put_p

[PATCH 01/20] mm: Change type of vmf->virtual_address

2016-11-01 Thread Jan Kara
Every single user of vmf->virtual_address typed that entry to unsigned
long before doing anything with it. So just change the type of that
entry to unsigned long immediately.

Signed-off-by: Jan Kara 
---
 arch/powerpc/platforms/cell/spufs/file.c |  4 ++--
 arch/x86/entry/vdso/vma.c|  4 ++--
 drivers/char/agp/alpha-agp.c |  2 +-
 drivers/char/mspec.c |  2 +-
 drivers/dax/dax.c|  2 +-
 drivers/gpu/drm/armada/armada_gem.c  |  2 +-
 drivers/gpu/drm/drm_vm.c |  9 -
 drivers/gpu/drm/etnaviv/etnaviv_gem.c|  7 +++
 drivers/gpu/drm/exynos/exynos_drm_gem.c  |  5 ++---
 drivers/gpu/drm/gma500/framebuffer.c |  2 +-
 drivers/gpu/drm/gma500/gem.c |  5 ++---
 drivers/gpu/drm/i915/i915_gem.c  |  5 ++---
 drivers/gpu/drm/msm/msm_gem.c|  7 +++
 drivers/gpu/drm/omapdrm/omap_gem.c   | 17 +++--
 drivers/gpu/drm/tegra/gem.c  |  4 ++--
 drivers/gpu/drm/ttm/ttm_bo_vm.c  |  2 +-
 drivers/gpu/drm/udl/udl_gem.c|  5 ++---
 drivers/gpu/drm/vgem/vgem_drv.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c|  5 ++---
 drivers/misc/cxl/context.c   |  2 +-
 drivers/misc/sgi-gru/grumain.c   |  2 +-
 drivers/staging/android/ion/ion.c|  2 +-
 drivers/staging/lustre/lustre/llite/vvp_io.c |  8 +---
 drivers/xen/privcmd.c|  2 +-
 fs/dax.c |  4 ++--
 include/linux/mm.h   |  2 +-
 mm/memory.c  |  7 +++
 27 files changed, 55 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/file.c 
b/arch/powerpc/platforms/cell/spufs/file.c
index 06254467e4dd..f7b33a477b95 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -236,7 +236,7 @@ static int
 spufs_mem_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct spu_context *ctx = vma->vm_file->private_data;
-   unsigned long address = (unsigned long)vmf->virtual_address;
+   unsigned long address = vmf->virtual_address;
unsigned long pfn, offset;
 
offset = vmf->pgoff << PAGE_SHIFT;
@@ -355,7 +355,7 @@ static int spufs_ps_fault(struct vm_area_struct *vma,
down_read(¤t->mm->mmap_sem);
} else {
area = ctx->spu->problem_phys + ps_offs;
-   vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+   vm_insert_pfn(vma, vmf->virtual_address,
(area + offset) >> PAGE_SHIFT);
spu_context_trace(spufs_ps_fault__insert, ctx, ctx->spu);
}
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index f840766659a8..113e0155c6b5 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -157,7 +157,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
return VM_FAULT_SIGBUS;
 
if (sym_offset == image->sym_vvar_page) {
-   ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+   ret = vm_insert_pfn(vma, vmf->virtual_address,
__pa_symbol(&__vvar_page) >> PAGE_SHIFT);
} else if (sym_offset == image->sym_pvclock_page) {
struct pvclock_vsyscall_time_info *pvti =
@@ -165,7 +165,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) {
ret = vm_insert_pfn(
vma,
-   (unsigned long)vmf->virtual_address,
+   vmf->virtual_address,
__pa(pvti) >> PAGE_SHIFT);
}
}
diff --git a/drivers/char/agp/alpha-agp.c b/drivers/char/agp/alpha-agp.c
index 199b8e99f7d7..537b1dc14c9f 100644
--- a/drivers/char/agp/alpha-agp.c
+++ b/drivers/char/agp/alpha-agp.c
@@ -19,7 +19,7 @@ static int alpha_core_agp_vm_fault(struct vm_area_struct *vma,
unsigned long pa;
struct page *page;
 
-   dma_addr = (unsigned long)vmf->virtual_address - vma->vm_start
+   dma_addr = vmf->virtual_address - vma->vm_start
+ agp->aperture.bus_base;
pa = agp->ops->translate(agp, dma_addr);
 
diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index f3f92d5fcda0..36eb17c16951 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -227,7 +227,7 @@ mspec_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf)
 * be because another thread has installed the pte first, so it
 * is no problem.
 */
-   vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, pfn);
+   vm_insert_pfn(vma, vmf->virtu

[PATCH 0/21 v4] dax: Clear dirty bits after flushing caches

2016-11-01 Thread Jan Kara
Hello,

this is the fourth revision of my patches to clear dirty bits from radix tree
of DAX inodes when caches for corresponding pfns have been flushed. This patch
set is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really like feedback from mm folks whether
such changes to fault handling code are fine or what they'd do differently.

The patches are based on 4.9-rc1 + Ross' DAX PMD page fault series [1] + ext4
conversion of DAX IO patch to the iomap infrastructure [2]. For testing,
I've pushed out a tree including all these patches and further DAX fixes
to:

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git dax

The patches pass testing with xfstests on ext4 and xfs on my end. I'd be
grateful for review so that we can push these patches for the next merge
window.

[1] http://www.spinics.net/lists/linux-mm/msg115247.html
[2] Posted an hour ago - look for "ext4: Convert ext4 DAX IO to iomap framework"

Changes since v3:
* rebased on top of 4.9-rc1 + DAX PMD fault series + ext4 iomap conversion
* reordered some of the patches
* killed ->virtual_address field in vm_fault structure as requested by
  Christoph

Changes since v2:
* rebased on top of 4.8-rc8 - this involved dealing with new fault_env
  structure
* changed calling convention for fault helpers

Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
  against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback


Background information regarding the motivation:

Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
needed.

The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).

The effect of the patch set is easily visible:

Writing 1 GB of data via mmap, then fsync twice.

Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.

As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.
Honza
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 02/20] mm: Join struct fault_env and vm_fault

2016-11-01 Thread Jan Kara
Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env. DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env. Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions. So at this point I don't think there's much use in
keeping these two structures separate. Just embed into struct vm_fault
all that is needed to use it for both purposes.

Signed-off-by: Jan Kara 
---
 Documentation/filesystems/Locking |   2 +-
 fs/userfaultfd.c  |  22 +-
 include/linux/huge_mm.h   |  10 +-
 include/linux/mm.h|  28 +-
 include/linux/userfaultfd_k.h |   4 +-
 mm/filemap.c  |  14 +-
 mm/huge_memory.c  | 173 ++--
 mm/internal.h |   2 +-
 mm/khugepaged.c   |  20 +-
 mm/memory.c   | 549 +++---
 mm/nommu.c|   2 +-
 11 files changed, 414 insertions(+), 412 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index d30fb2cb5066..02961390f4ba 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -549,7 +549,7 @@ till "end_pgoff". ->map_pages() is called with page table 
locked and must
 not block.  If it's not possible to reach a page without blocking,
 filesystem should skip it. Filesystem should use do_set_pte() to setup
 page table entry. Pointer to entry associated with the page is passed in
-"pte" field in fault_env structure. Pointers to entries for other offsets
+"pte" field in vm_fault structure. Pointers to entries for other offsets
 should be calculated relative to "pte".
 
->page_mkwrite() is called when a previously read-only pte is
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 85959d8324df..d96e2f30084b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,9 +257,9 @@ out:
  * fatal_signal_pending()s, and the mmap_sem must be released before
  * returning it.
  */
-int handle_userfault(struct fault_env *fe, unsigned long reason)
+int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 {
-   struct mm_struct *mm = fe->vma->vm_mm;
+   struct mm_struct *mm = vmf->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -268,7 +268,7 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
ret = VM_FAULT_SIGBUS;
-   ctx = fe->vma->vm_userfaultfd_ctx.ctx;
+   ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;
 
@@ -301,17 +301,18 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 * without first stopping userland access to the memory. For
 * VM_UFFD_MISSING userfaults this is enough for now.
 */
-   if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
+   if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
 * Validate the invariant that nowait must allow retry
 * to be sure not to return SIGBUS erroneously on
 * nowait invocations.
 */
-   BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
+   BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
 #ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
-  "FAULT_FLAG_ALLOW_RETRY missing %x\n", 
fe->flags);
+  "FAULT_FLAG_ALLOW_RETRY missing %x\n",
+  vmf->flags);
dump_stack();
}
 #endif
@@ -323,7 +324,7 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 * and wait.
 */
ret = VM_FAULT_RETRY;
-   if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
+   if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;
 
/* take the reference before dropping the mmap_sem */
@@ -331,11 +332,11 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
-   uwq.msg = userfault_msg(fe->address, fe->flags, reason);
+   uwq.msg = userfault_msg(vmf->address, vmf->flags, reason);
uwq.ctx = ctx;
 
return_to_userland =
-   (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+   (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
 
spin_lock(&ctx->fault_pending_wqh.lock);
@@ -353,7 +354,8 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
  TASK_KILLABLE);
spin_unlo

[PATCH 06/21] mm: Use passed vm_fault structure for in wp_pfn_shared()

2016-11-01 Thread Jan Kara
Instead of creating another vm_fault structure, use the one passed to
wp_pfn_shared() for passing arguments into pfn_mkwrite handler.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f5ef7b8a30c5..5f6bc9028a88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2277,16 +2277,11 @@ static int wp_pfn_shared(struct vm_fault *vmf, pte_t 
orig_pte)
struct vm_area_struct *vma = vmf->vma;
 
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
-   struct vm_fault vmf2 = {
-   .page = NULL,
-   .pgoff = vmf->pgoff,
-   .address = vmf->address,
-   .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
-   };
int ret;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   ret = vma->vm_ops->pfn_mkwrite(vma, &vmf2);
+   vmf->flags |= FAULT_FLAG_MKWRITE;
+   ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
if (ret & VM_FAULT_ERROR)
return ret;
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 09/21] mm: Factor out functionality to finish page faults

2016-11-01 Thread Jan Kara
Introduce function finish_fault() as a helper function for finishing
page faults. It is rather thin wrapper around alloc_set_pte() but since
we'd want to call this from DAX code or filesystems, it is still useful
to avoid some boilerplate code.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 44 +++-
 2 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 78173d7de007..7ac2bbaab4f4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -620,6 +620,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
+int finish_fault(struct vm_fault *vmf);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index ac901bb02398..d3fc4988f869 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3033,6 +3033,38 @@ int alloc_set_pte(struct vm_fault *vmf, struct 
mem_cgroup *memcg,
return 0;
 }
 
+
+/**
+ * finish_fault - finish page fault once we have prepared the page to fault
+ *
+ * @vmf: structure describing the fault
+ *
+ * This function handles all that is needed to finish a page fault once the
+ * page to fault in is prepared. It handles locking of PTEs, inserts PTE for
+ * given page, adds reverse page mapping, handles memcg charges and LRU
+ * addition. The function returns 0 on success, VM_FAULT_ code in case of
+ * error.
+ *
+ * The function expects the page to be locked and on success it consumes a
+ * reference of a page being mapped (for the PTE which maps it).
+ */
+int finish_fault(struct vm_fault *vmf)
+{
+   struct page *page;
+   int ret;
+
+   /* Did we COW the page? */
+   if ((vmf->flags & FAULT_FLAG_WRITE) &&
+   !(vmf->vma->vm_flags & VM_SHARED))
+   page = vmf->cow_page;
+   else
+   page = vmf->page;
+   ret = alloc_set_pte(vmf, vmf->memcg, page);
+   if (vmf->pte)
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return ret;
+}
+
 static unsigned long fault_around_bytes __read_mostly =
rounddown_pow_of_two(65536);
 
@@ -3178,9 +3210,7 @@ static int do_read_fault(struct vm_fault *vmf)
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
-   ret |= alloc_set_pte(vmf, NULL, vmf->page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
unlock_page(vmf->page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
put_page(vmf->page);
@@ -3219,9 +3249,7 @@ static int do_cow_fault(struct vm_fault *vmf)
copy_user_highpage(new_page, vmf->page, vmf->address, vma);
__SetPageUptodate(new_page);
 
-   ret |= alloc_set_pte(vmf, memcg, new_page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
if (!(ret & VM_FAULT_DAX_LOCKED)) {
unlock_page(vmf->page);
put_page(vmf->page);
@@ -3262,9 +3290,7 @@ static int do_shared_fault(struct vm_fault *vmf)
}
}
 
-   ret |= alloc_set_pte(vmf, NULL, vmf->page);
-   if (vmf->pte)
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   ret |= finish_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY))) {
unlock_page(vmf->page);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 17/20] mm: Export follow_pte()

2016-11-01 Thread Jan Kara
DAX will need to implement its own version of page_check_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.

Signed-off-by: Jan Kara 
---
 include/linux/mm.h | 2 ++
 mm/memory.c| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e5a014be8932..133fabe4bb4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1224,6 +1224,8 @@ int copy_page_range(struct mm_struct *dst, struct 
mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+  spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 8c8cb7f2133e..e7a4a30a5e88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3763,8 +3763,8 @@ out:
return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+  spinlock_t **ptlp)
 {
int res;
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 12/21] mm: Factor out common parts of write fault handling

2016-11-01 Thread Jan Kara
Currently we duplicate handling of shared write faults in
wp_page_reuse() and do_shared_fault(). Factor them out into a common
function.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 78 +
 1 file changed, 37 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 26b2858e6a12..4da66c984c2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2067,6 +2067,41 @@ static int do_page_mkwrite(struct vm_area_struct *vma, 
struct page *page,
 }
 
 /*
+ * Handle dirtying of a page in shared file mapping on a write fault.
+ *
+ * The function expects the page to be locked and unlocks it.
+ */
+static void fault_dirty_shared_page(struct vm_area_struct *vma,
+   struct page *page)
+{
+   struct address_space *mapping;
+   bool dirtied;
+   bool page_mkwrite = vma->vm_ops->page_mkwrite;
+
+   dirtied = set_page_dirty(page);
+   VM_BUG_ON_PAGE(PageAnon(page), page);
+   /*
+* Take a local copy of the address_space - page.mapping may be zeroed
+* by truncate after unlock_page().   The address_space itself remains
+* pinned by vma->vm_file's reference.  We rely on unlock_page()'s
+* release semantics to prevent the compiler from undoing this copying.
+*/
+   mapping = page_rmapping(page);
+   unlock_page(page);
+
+   if ((dirtied || page_mkwrite) && mapping) {
+   /*
+* Some device drivers do not set page.mapping
+* but still dirty their pages
+*/
+   balance_dirty_pages_ratelimited(mapping);
+   }
+
+   if (!page_mkwrite)
+   file_update_time(vma->vm_file);
+}
+
+/*
  * Handle write page faults for pages that can be reused in the current vma
  *
  * This can happen either due to the mapping being with the VM_SHARED flag,
@@ -2096,28 +2131,11 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
struct page *page,
pte_unmap_unlock(vmf->pte, vmf->ptl);
 
if (dirty_shared) {
-   struct address_space *mapping;
-   int dirtied;
-
if (!page_mkwrite)
lock_page(page);
 
-   dirtied = set_page_dirty(page);
-   VM_BUG_ON_PAGE(PageAnon(page), page);
-   mapping = page->mapping;
-   unlock_page(page);
+   fault_dirty_shared_page(vma, page);
put_page(page);
-
-   if ((dirtied || page_mkwrite) && mapping) {
-   /*
-* Some device drivers do not set page.mapping
-* but still dirty their pages
-*/
-   balance_dirty_pages_ratelimited(mapping);
-   }
-
-   if (!page_mkwrite)
-   file_update_time(vma->vm_file);
}
 
return VM_FAULT_WRITE;
@@ -3262,8 +3280,6 @@ static int do_cow_fault(struct vm_fault *vmf)
 static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct address_space *mapping;
-   int dirtied = 0;
int ret, tmp;
 
ret = __do_fault(vmf);
@@ -3292,27 +3308,7 @@ static int do_shared_fault(struct vm_fault *vmf)
return ret;
}
 
-   if (set_page_dirty(vmf->page))
-   dirtied = 1;
-   /*
-* Take a local copy of the address_space - page.mapping may be zeroed
-* by truncate after unlock_page().   The address_space itself remains
-* pinned by vma->vm_file's reference.  We rely on unlock_page()'s
-* release semantics to prevent the compiler from undoing this copying.
-*/
-   mapping = page_rmapping(vmf->page);
-   unlock_page(vmf->page);
-   if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
-   /*
-* Some device drivers do not set page.mapping but still
-* dirty their pages
-*/
-   balance_dirty_pages_ratelimited(mapping);
-   }
-
-   if (!vma->vm_ops->page_mkwrite)
-   file_update_time(vma->vm_file);
-
+   fault_dirty_shared_page(vma, vmf->page);
return ret;
 }
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 04/21] mm: Use passed vm_fault structure in __do_fault()

2016-11-01 Thread Jan Kara
Instead of creating another vm_fault structure, use the one passed to
__do_fault() for passing arguments into fault handler.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3b79eace8d23..8145dadb2645 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2852,37 +2852,31 @@ static int __do_fault(struct vm_fault *vmf, struct page 
*cow_page,
  struct page **page, void **entry)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct vm_fault vmf2;
int ret;
 
-   vmf2.address = vmf->address;
-   vmf2.pgoff = vmf->pgoff;
-   vmf2.flags = vmf->flags;
-   vmf2.page = NULL;
-   vmf2.gfp_mask = __get_fault_gfp_mask(vma);
-   vmf2.cow_page = cow_page;
+   vmf->cow_page = cow_page;
 
-   ret = vma->vm_ops->fault(vma, &vmf2);
+   ret = vma->vm_ops->fault(vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
if (ret & VM_FAULT_DAX_LOCKED) {
-   *entry = vmf2.entry;
+   *entry = vmf->entry;
return ret;
}
 
-   if (unlikely(PageHWPoison(vmf2.page))) {
+   if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)
-   unlock_page(vmf2.page);
-   put_page(vmf2.page);
+   unlock_page(vmf->page);
+   put_page(vmf->page);
return VM_FAULT_HWPOISON;
}
 
if (unlikely(!(ret & VM_FAULT_LOCKED)))
-   lock_page(vmf2.page);
+   lock_page(vmf->page);
else
-   VM_BUG_ON_PAGE(!PageLocked(vmf2.page), vmf2.page);
+   VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
 
-   *page = vmf2.page;
+   *page = vmf->page;
return ret;
 }
 
@@ -3579,6 +3573,7 @@ static int __handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
.address = address,
.flags = flags,
.pgoff = linear_page_index(vma, address),
+   .gfp_mask = __get_fault_gfp_mask(vma),
};
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 16/21] mm: Provide helper for finishing mkwrite faults

2016-11-01 Thread Jan Kara
Provide a helper function for finishing write faults due to PTE being
read-only. The helper will be used by DAX to avoid the need of
complicating generic MM code with DAX locking specifics.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 67 --
 2 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 56fdd79e5d1e..0920adb6ec1b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,6 +615,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
+int finish_mkwrite_fault(struct vm_fault *vmf);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 06aba4203104..1517ff91c743 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2270,6 +2270,38 @@ static int wp_page_copy(struct vm_fault *vmf)
return VM_FAULT_OOM;
 }
 
+/**
+ * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE
+ *   writeable once the page is prepared
+ *
+ * @vmf: structure describing the fault
+ *
+ * This function handles all that is needed to finish a write page fault in a
+ * shared mapping due to PTE being read-only once the mapped page is prepared.
+ * It handles locking of PTE and modifying it. The function returns
+ * VM_FAULT_WRITE on success, 0 when PTE got changed before we acquired PTE
+ * lock.
+ *
+ * The function expects the page to be locked or other protection against
+ * concurrent faults / writeback (such as DAX radix tree locks).
+ */
+int finish_mkwrite_fault(struct vm_fault *vmf)
+{
+   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
+  &vmf->ptl);
+   /*
+* We might have raced with another page fault while we released the
+* pte_offset_map_lock.
+*/
+   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return 0;
+   }
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
+}
+
 /*
  * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
  * mapping
@@ -2286,16 +2318,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
if (ret & VM_FAULT_ERROR)
return ret;
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
-   /*
-* We might have raced with another page fault while we
-* released the pte_offset_map_lock.
-*/
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
-   }
+   return finish_mkwrite_fault(vmf);
}
wp_page_reuse(vmf);
return VM_FAULT_WRITE;
@@ -2305,7 +2328,6 @@ static int wp_page_shared(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
-   int page_mkwrite = 0;
 
get_page(vmf->page);
 
@@ -2319,26 +2341,17 @@ static int wp_page_shared(struct vm_fault *vmf)
put_page(vmf->page);
return tmp;
}
-   /*
-* Since we dropped the lock we need to revalidate
-* the PTE as someone else may have changed it.  If
-* they did, we just return, as we can count on the
-* MMU to tell us if they didn't also make it writable.
-*/
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, &vmf->ptl);
-   if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+   tmp = finish_mkwrite_fault(vmf);
+   if (unlikely(!tmp || (tmp &
+ (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
unlock_page(vmf->page);
-   pte_unmap_unlock(vmf->pte, vmf->ptl);
put_page(vmf->page);
-   return 0;
+   return tmp;
}
-   page_mkwrite = 1;
-   }
-
-   wp_page_reuse(vmf);
-   if (!page_mkwrite)
+   } else {
+   wp_page_reuse(vmf);
lock_page(vmf->page);
+   }
fault_dirty_shared_page(vma, vmf->page);
put_page(vmf->page);
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 11/21] mm: Remove unnecessary vma->vm_ops check

2016-11-01 Thread Jan Kara
We don't check whether vma->vm_ops is NULL in do_shared_fault() so
there's hardly any point in checking it in wp_page_shared() or
wp_pfn_shared() which get called only for shared file mappings as well.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7be96a43d5ac..26b2858e6a12 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2275,7 +2275,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
+   if (vma->vm_ops->pfn_mkwrite) {
int ret;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2305,7 +2305,7 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
 
get_page(old_page);
 
-   if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+   if (vma->vm_ops->page_mkwrite) {
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 08/21] mm: Allow full handling of COW faults in ->fault handlers

2016-11-01 Thread Jan Kara
To allow full handling of COW faults add memcg field to struct vm_fault
and a return value of ->fault() handler meaning that COW fault is fully
handled and memcg charge must not be canceled. This will allow us to
remove knowledge about special DAX locking from the generic fault code.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 include/linux/mm.h | 4 +++-
 mm/memory.c| 8 +---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8e758060851..78173d7de007 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -301,7 +301,8 @@ struct vm_fault {
 * the 'address' */
pte_t orig_pte; /* Value of PTE at the time of fault */
 
-   struct page *cow_page;  /* Handler may choose to COW */
+   struct page *cow_page;  /* Page handler may use for COW fault */
+   struct mem_cgroup *memcg;   /* Cgroup cow_page belongs to */
struct page *page;  /* ->fault handlers should return a
 * page here, unless VM_FAULT_NOPAGE
 * is set (which is also implied by
@@ -1103,6 +1104,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_RETRY 0x0400  /* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800   /* huge page fault failed, fall back to 
small */
 #define VM_FAULT_DAX_LOCKED 0x1000 /* ->fault has locked DAX entry */
+#define VM_FAULT_DONE_COW   0x2000 /* ->fault has fully handled COW */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large 
hwpoison */
 
diff --git a/mm/memory.c b/mm/memory.c
index 25028422a578..ac901bb02398 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,9 +2848,8 @@ static int __do_fault(struct vm_fault *vmf)
int ret;
 
ret = vma->vm_ops->fault(vma, vmf);
-   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
-   return ret;
-   if (ret & VM_FAULT_DAX_LOCKED)
+   if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
+   VM_FAULT_DAX_LOCKED | VM_FAULT_DONE_COW)))
return ret;
 
if (unlikely(PageHWPoison(vmf->page))) {
@@ -3209,9 +3208,12 @@ static int do_cow_fault(struct vm_fault *vmf)
}
 
vmf->cow_page = new_page;
+   vmf->memcg = memcg;
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
+   if (ret & VM_FAULT_DONE_COW)
+   return ret;
 
if (!(ret & VM_FAULT_DAX_LOCKED))
copy_user_highpage(new_page, vmf->page, vmf->address, vma);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 10/21] mm: Move handling of COW faults into DAX code

2016-11-01 Thread Jan Kara
Move final handling of COW faults from generic code into DAX fault
handler. That way generic code doesn't have to be aware of peculiarities
of DAX locking so remove that knowledge and make locking functions
private to fs/dax.c.

Signed-off-by: Jan Kara 
---
 fs/dax.c| 58 +++--
 include/linux/dax.h |  7 ---
 include/linux/mm.h  |  9 +
 mm/memory.c | 14 -
 4 files changed, 35 insertions(+), 53 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 05c669853316..b4ea4bcca9e7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -240,6 +240,23 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
}
 }
 
+static void dax_unlock_mapping_entry(struct address_space *mapping,
+pgoff_t index)
+{
+   void *entry, **slot;
+
+   spin_lock_irq(&mapping->tree_lock);
+   entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
+!slot_locked(mapping, slot))) {
+   spin_unlock_irq(&mapping->tree_lock);
+   return;
+   }
+   unlock_slot(mapping, slot);
+   spin_unlock_irq(&mapping->tree_lock);
+   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
+}
+
 static void put_locked_mapping_entry(struct address_space *mapping,
 pgoff_t index, void *entry)
 {
@@ -434,22 +451,6 @@ void dax_wake_mapping_entry_waiter(struct address_space 
*mapping,
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
-void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-   void *entry, **slot;
-
-   spin_lock_irq(&mapping->tree_lock);
-   entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-   if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
-!slot_locked(mapping, slot))) {
-   spin_unlock_irq(&mapping->tree_lock);
-   return;
-   }
-   unlock_slot(mapping, slot);
-   spin_unlock_irq(&mapping->tree_lock);
-   dax_wake_mapping_entry_waiter(mapping, index, entry, false);
-}
-
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
@@ -953,7 +954,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
struct iomap iomap = { 0 };
unsigned flags = 0;
int error, major = 0;
-   int locked_status = 0;
+   int vmf_ret = 0;
void *entry;
 
/*
@@ -1006,13 +1007,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
if (error)
goto finish_iomap;
-   if (!radix_tree_exceptional_entry(entry)) {
+
+   __SetPageUptodate(vmf->cow_page);
+   if (!radix_tree_exceptional_entry(entry))
vmf->page = entry;
-   locked_status = VM_FAULT_LOCKED;
-   } else {
-   vmf->entry = entry;
-   locked_status = VM_FAULT_DAX_LOCKED;
-   }
+   vmf_ret = finish_fault(vmf);
+   if (!vmf_ret)
+   vmf_ret = VM_FAULT_DONE_COW;
+   vmf->page = NULL;
goto finish_iomap;
}
 
@@ -1029,7 +1031,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
-   locked_status = dax_load_hole(mapping, entry, vmf);
+   vmf_ret = dax_load_hole(mapping, entry, vmf);
break;
}
/*FALLTHRU*/
@@ -1041,7 +1043,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
  finish_iomap:
if (ops->iomap_end) {
-   if (error) {
+   if (error || (vmf_ret & VM_FAULT_ERROR)) {
/* keep previous error */
ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
&iomap);
@@ -1051,7 +1053,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
}
}
  unlock_entry:
-   if (!locked_status || error)
+   if (vmf_ret != VM_FAULT_LOCKED || error)
put_locked_mapping_entry(mapping, vmf->pgoff, entry);
  out:
if (error == -ENOMEM)
@@ -1059,9 +1061,9 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error < 0 && error != -EBUSY)
return VM_FAULT_SIGBUS | major;
-   if (locked_status) {
+   if (vmf_ret) {
WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? *

[PATCH 06/20] mm: Use pass vm_fault structure for in wp_pfn_shared()

2016-11-01 Thread Jan Kara
Instead of creating another vm_fault structure, use the one passed to
wp_pfn_shared() for passing arguments into pfn_mkwrite handler.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ba7760fb7db2..48de8187d7b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2273,16 +2273,11 @@ static int wp_pfn_shared(struct vm_fault *vmf, pte_t 
orig_pte)
struct vm_area_struct *vma = vmf->vma;
 
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
-   struct vm_fault vmf2 = {
-   .page = NULL,
-   .pgoff = vmf->pgoff,
-   .virtual_address = vmf->address & PAGE_MASK,
-   .flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
-   };
int ret;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   ret = vma->vm_ops->pfn_mkwrite(vma, &vmf2);
+   vmf->flags |= FAULT_FLAG_MKWRITE;
+   ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
if (ret & VM_FAULT_ERROR)
return ret;
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 18/20] dax: Make cache flushing protected by entry lock

2016-11-01 Thread Jan Kara
Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 66 +---
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b1c503930d1d..c6cadf8413a3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -672,43 +672,63 @@ static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = &mapping->page_tree;
-   int type = RADIX_DAX_TYPE(entry);
-   struct radix_tree_node *node;
struct blk_dax_ctl dax;
-   void **slot;
+   void *entry2, **slot;
int ret = 0;
+   int type;
 
-   spin_lock_irq(&mapping->tree_lock);
/*
-* Regular page slots are stabilized by the page lock even
-* without the tree itself locked.  These unlocked entries
-* need verification under the tree lock.
+* A page got tagged dirty in DAX mapping? Something is seriously
+* wrong.
 */
-   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
-   goto unlock;
-   if (*slot != entry)
-   goto unlock;
-
-   /* another fsync thread may have already written back this entry */
-   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-   goto unlock;
+   if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+   return -EIO;
 
+   spin_lock_irq(&mapping->tree_lock);
+   entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
+   /* Entry got punched out / reallocated? */
+   if (!entry2 || !radix_tree_exceptional_entry(entry2))
+   goto put_unlock;
+   /*
+* Entry got reallocated elsewhere? No need to writeback. We have to
+* compare sectors as we must not bail out due to difference in lockbit
+* or entry type.
+*/
+   if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
+   goto put_unlock;
+   type = RADIX_DAX_TYPE(entry2);
if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
ret = -EIO;
-   goto unlock;
+   goto put_unlock;
}
 
+   /* Another fsync thread may have already written back this entry */
+   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+   goto put_unlock;
+   /* Lock the entry to serialize with page faults */
+   entry = lock_slot(mapping, slot);
+   /*
+* We can clear the tag now but we have to be careful so that concurrent
+* dax_writeback_one() calls for the same index cannot finish before we
+* actually flush the caches. This is achieved as the calls will look
+* at the entry only under tree_lock and once they do that they will
+* see the entry locked and wait for it to unlock.
+*/
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+   spin_unlock_irq(&mapping->tree_lock);
+
dax.sector = RADIX_DAX_SECTOR(entry);
dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
-   spin_unlock_irq(&mapping->tree_lock);
 
/*
 * We cannot hold tree_lock while calling dax_map_atomic() because it
 * eventually calls cond_resched().
 */
ret = dax_map_atomic(bdev, &dax);
-   if (ret < 0)
+   if (ret < 0) {
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
+   }
 
if (WARN_ON_ONCE(ret < dax.size)) {
ret = -EIO;
@@ -716,15 +736,13 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
wb_cache_pmem(dax.addr, dax.size);
-
-   spin_lock_irq(&mapping->tree_lock);
-   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-   spin_unlock_irq(&mapping->tree_lock);
- unmap:
+unmap:
dax_unmap_atomic(bdev, &dax);
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
 
- unlock:
+put_unlock:
+   put_unlocked_mapping_entry(mapping, index, entry2);
spin_unlock_irq(&mapping->tree_lock);
return ret;
 }
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 18/21] mm: Export follow_pte()

2016-11-01 Thread Jan Kara
DAX will need to implement its own version of page_check_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 include/linux/mm.h | 2 ++
 mm/memory.c| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0920adb6ec1b..b7b54f5b5198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1210,6 +1210,8 @@ int copy_page_range(struct mm_struct *dst, struct 
mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+  spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index b3bd6b6c6472..7660f6169bee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3782,8 +3782,8 @@ static int __follow_pte(struct mm_struct *mm, unsigned 
long address,
return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+  spinlock_t **ptlp)
 {
int res;
 
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 02/21] mm: Use vmf->address instead of of vmf->virtual_address

2016-11-01 Thread Jan Kara
Every single user of vmf->virtual_address typed that entry to unsigned
long before doing anything with it so the type of virtual_address does
not really provide us any additional safety. Just use masked
vmf->address which already has the appropriate type.

Signed-off-by: Jan Kara 
---
 arch/powerpc/platforms/cell/spufs/file.c |  4 ++--
 arch/x86/entry/vdso/vma.c|  4 ++--
 drivers/char/agp/alpha-agp.c |  2 +-
 drivers/char/mspec.c |  2 +-
 drivers/dax/dax.c|  2 +-
 drivers/gpu/drm/armada/armada_gem.c  |  2 +-
 drivers/gpu/drm/drm_vm.c | 11 ++-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c|  7 +++
 drivers/gpu/drm/exynos/exynos_drm_gem.c  |  6 +++---
 drivers/gpu/drm/gma500/framebuffer.c |  2 +-
 drivers/gpu/drm/gma500/gem.c |  5 ++---
 drivers/gpu/drm/i915/i915_gem.c  |  2 +-
 drivers/gpu/drm/msm/msm_gem.c|  7 +++
 drivers/gpu/drm/omapdrm/omap_gem.c   | 20 +---
 drivers/gpu/drm/tegra/gem.c  |  4 ++--
 drivers/gpu/drm/ttm/ttm_bo_vm.c  |  2 +-
 drivers/gpu/drm/udl/udl_gem.c|  5 ++---
 drivers/gpu/drm/vgem/vgem_drv.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c|  5 ++---
 drivers/misc/cxl/context.c   |  2 +-
 drivers/misc/sgi-gru/grumain.c   |  2 +-
 drivers/staging/android/ion/ion.c|  2 +-
 drivers/staging/lustre/lustre/llite/vvp_io.c |  9 ++---
 drivers/xen/privcmd.c|  2 +-
 fs/dax.c |  4 ++--
 include/linux/mm.h   |  2 --
 mm/memory.c  |  7 +++
 27 files changed, 59 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/file.c 
b/arch/powerpc/platforms/cell/spufs/file.c
index 06254467e4dd..e8a31fffcdda 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -236,7 +236,7 @@ static int
 spufs_mem_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct spu_context *ctx = vma->vm_file->private_data;
-   unsigned long address = (unsigned long)vmf->virtual_address;
+   unsigned long address = vmf->address & PAGE_MASK;
unsigned long pfn, offset;
 
offset = vmf->pgoff << PAGE_SHIFT;
@@ -355,7 +355,7 @@ static int spufs_ps_fault(struct vm_area_struct *vma,
down_read(¤t->mm->mmap_sem);
} else {
area = ctx->spu->problem_phys + ps_offs;
-   vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+   vm_insert_pfn(vma, vmf->address & PAGE_MASK,
(area + offset) >> PAGE_SHIFT);
spu_context_trace(spufs_ps_fault__insert, ctx, ctx->spu);
}
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 23c881caabd1..e20a5cb6cd31 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -109,7 +109,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
return VM_FAULT_SIGBUS;
 
if (sym_offset == image->sym_vvar_page) {
-   ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+   ret = vm_insert_pfn(vma, vmf->address & PAGE_MASK,
__pa_symbol(&__vvar_page) >> PAGE_SHIFT);
} else if (sym_offset == image->sym_pvclock_page) {
struct pvclock_vsyscall_time_info *pvti =
@@ -117,7 +117,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) {
ret = vm_insert_pfn(
vma,
-   (unsigned long)vmf->virtual_address,
+   vmf->address & PAGE_MASK,
__pa(pvti) >> PAGE_SHIFT);
}
}
diff --git a/drivers/char/agp/alpha-agp.c b/drivers/char/agp/alpha-agp.c
index 199b8e99f7d7..372d9378d997 100644
--- a/drivers/char/agp/alpha-agp.c
+++ b/drivers/char/agp/alpha-agp.c
@@ -19,7 +19,7 @@ static int alpha_core_agp_vm_fault(struct vm_area_struct *vma,
unsigned long pa;
struct page *page;
 
-   dma_addr = (unsigned long)vmf->virtual_address - vma->vm_start
+   dma_addr = (vmf->address & PAGE_MASK) - vma->vm_start
+ agp->aperture.bus_base;
pa = agp->ops->translate(agp, dma_addr);
 
diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index f3f92d5fcda0..2b7e1bc9ac5c 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -227,7 +227,7 @@ mspec_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf)
 * be because another thread has installed the pte first, so it
 * is no problem.
 */

[PATCH 14/21] mm: Use vmf->page during WP faults

2016-11-01 Thread Jan Kara
So far we set vmf->page during WP faults only when we needed to pass it
to the ->page_mkwrite handler. Set it in all the cases now and use that
instead of passing page pointer explicitly around.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 58 +-
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c89f99c270bc..e278a8a6ccc7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2103,11 +2103,12 @@ static void fault_dirty_shared_page(struct 
vm_area_struct *vma,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf, struct page *page,
+static inline int wp_page_reuse(struct vm_fault *vmf,
int page_mkwrite, int dirty_shared)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
+   struct page *page = vmf->page;
pte_t entry;
/*
 * Clear the pages cpupid information as the existing
@@ -2151,10 +2152,11 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
struct page *page,
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old 
page.
  */
-static int wp_page_copy(struct vm_fault *vmf, struct page *old_page)
+static int wp_page_copy(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
+   struct page *old_page = vmf->page;
struct page *new_page = NULL;
pte_t entry;
int page_copied = 0;
@@ -2306,26 +2308,25 @@ static int wp_pfn_shared(struct vm_fault *vmf)
return 0;
}
}
-   return wp_page_reuse(vmf, NULL, 0, 0);
+   return wp_page_reuse(vmf, 0, 0);
 }
 
-static int wp_page_shared(struct vm_fault *vmf, struct page *old_page)
+static int wp_page_shared(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
int page_mkwrite = 0;
 
-   get_page(old_page);
+   get_page(vmf->page);
 
if (vma->vm_ops->page_mkwrite) {
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   vmf->page = old_page;
tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp || (tmp &
  (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
-   put_page(old_page);
+   put_page(vmf->page);
return tmp;
}
/*
@@ -2337,15 +2338,15 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
-   unlock_page(old_page);
+   unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   put_page(old_page);
+   put_page(vmf->page);
return 0;
}
page_mkwrite = 1;
}
 
-   return wp_page_reuse(vmf, old_page, page_mkwrite, 1);
+   return wp_page_reuse(vmf, page_mkwrite, 1);
 }
 
 /*
@@ -2370,10 +2371,9 @@ static int do_wp_page(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
-   struct page *old_page;
 
-   old_page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
-   if (!old_page) {
+   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
 * VM_PFNMAP VMA.
@@ -2386,30 +2386,30 @@ static int do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return wp_page_copy(vmf, old_page);
+   return wp_page_copy(vmf);
}
 
/*
 * Take out anonymous pages first, anonymous shared vmas are
 * not dirty accountable.
 */
-   if (PageAnon(old_page) && !PageKsm(old_page)) {
+   if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {
int total_mapcount;
-   if (!trylock_page(old_page)) {
-   get_page(old_page);
+   if (!trylock_page(vmf->page)) {
+   get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   lock_page(old_page);
+   lock_page(vmf->page);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
   

[PATCH 03/21] mm: Use pgoff in struct vm_fault instead of passing it separately

2016-11-01 Thread Jan Kara
struct vm_fault has already pgoff entry. Use it instead of passing pgoff
as a separate argument and then assigning it later.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/khugepaged.c |  1 +
 mm/memory.c | 35 ++-
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f88b2d3810a7..d7df06383b10 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -880,6 +880,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.address = address,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
+   .pgoff = linear_page_index(vma, address),
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index c652b65469cd..3b79eace8d23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2279,7 +2279,7 @@ static int wp_pfn_shared(struct vm_fault *vmf, pte_t 
orig_pte)
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
struct vm_fault vmf2 = {
.page = NULL,
-   .pgoff = linear_page_index(vma, vmf->address),
+   .pgoff = vmf->pgoff,
.address = vmf->address,
.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
};
@@ -2848,15 +2848,15 @@ static int do_anonymous_page(struct vm_fault *vmf)
  * released depending on flags and vma->vm_ops->fault() return value.
  * See filemap_fault() and __lock_page_retry().
  */
-static int __do_fault(struct vm_fault *vmf, pgoff_t pgoff,
-   struct page *cow_page, struct page **page, void **entry)
+static int __do_fault(struct vm_fault *vmf, struct page *cow_page,
+ struct page **page, void **entry)
 {
struct vm_area_struct *vma = vmf->vma;
struct vm_fault vmf2;
int ret;
 
vmf2.address = vmf->address;
-   vmf2.pgoff = pgoff;
+   vmf2.pgoff = vmf->pgoff;
vmf2.flags = vmf->flags;
vmf2.page = NULL;
vmf2.gfp_mask = __get_fault_gfp_mask(vma);
@@ -3115,9 +3115,10 @@ late_initcall(fault_around_debugfs);
  * fault_around_pages() value (and therefore to page order).  This way it's
  * easier to guarantee that we don't cross page table boundaries.
  */
-static int do_fault_around(struct vm_fault *vmf, pgoff_t start_pgoff)
+static int do_fault_around(struct vm_fault *vmf)
 {
unsigned long address = vmf->address, nr_pages, mask;
+   pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off, ret = 0;
 
@@ -3175,7 +3176,7 @@ static int do_fault_around(struct vm_fault *vmf, pgoff_t 
start_pgoff)
return ret;
 }
 
-static int do_read_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_read_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *fault_page;
@@ -3187,12 +3188,12 @@ static int do_read_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
 * something).
 */
if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-   ret = do_fault_around(vmf, pgoff);
+   ret = do_fault_around(vmf);
if (ret)
return ret;
}
 
-   ret = __do_fault(vmf, pgoff, NULL, &fault_page, NULL);
+   ret = __do_fault(vmf, NULL, &fault_page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -3205,7 +3206,7 @@ static int do_read_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
return ret;
 }
 
-static int do_cow_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_cow_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *fault_page, *new_page;
@@ -3226,7 +3227,7 @@ static int do_cow_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
return VM_FAULT_OOM;
}
 
-   ret = __do_fault(vmf, pgoff, new_page, &fault_page, &fault_entry);
+   ret = __do_fault(vmf, new_page, &fault_page, &fault_entry);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
 
@@ -3241,7 +3242,7 @@ static int do_cow_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
unlock_page(fault_page);
put_page(fault_page);
} else {
-   dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
+   dax_unlock_mapping_entry(vma->vm_file->f_mapping, vmf->pgoff);
}
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
@@ -3252,7 +3253,7 @@ static int do_cow_fault(struct vm_fault *vmf, pgoff_t 
pgoff)
return ret;
 }
 
-static int do_shared_fault(struct vm_fault *vmf, pgoff_t pgoff)
+static int do_shared_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct pa

[PATCH 01/21] mm: Join struct fault_env and vm_fault

2016-11-01 Thread Jan Kara
Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env. DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env. Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions. So at this point I don't think there's much use in
keeping these two structures separate. Just embed into struct vm_fault
all that is needed to use it for both purposes.

Signed-off-by: Jan Kara 
---
 Documentation/filesystems/Locking |   2 +-
 fs/userfaultfd.c  |  22 +-
 include/linux/huge_mm.h   |  10 +-
 include/linux/mm.h|  28 +-
 include/linux/userfaultfd_k.h |   4 +-
 mm/filemap.c  |  14 +-
 mm/huge_memory.c  | 173 ++--
 mm/internal.h |   2 +-
 mm/khugepaged.c   |  20 +-
 mm/memory.c   | 549 +++---
 mm/nommu.c|   2 +-
 11 files changed, 414 insertions(+), 412 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index 14cdc101d165..ac3d080eabaa 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -557,7 +557,7 @@ till "end_pgoff". ->map_pages() is called with page table 
locked and must
 not block.  If it's not possible to reach a page without blocking,
 filesystem should skip it. Filesystem should use do_set_pte() to setup
 page table entry. Pointer to entry associated with the page is passed in
-"pte" field in fault_env structure. Pointers to entries for other offsets
+"pte" field in vm_fault structure. Pointers to entries for other offsets
 should be calculated relative to "pte".
 
->page_mkwrite() is called when a previously read-only pte is
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 85959d8324df..d96e2f30084b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,9 +257,9 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
  * fatal_signal_pending()s, and the mmap_sem must be released before
  * returning it.
  */
-int handle_userfault(struct fault_env *fe, unsigned long reason)
+int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 {
-   struct mm_struct *mm = fe->vma->vm_mm;
+   struct mm_struct *mm = vmf->vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
@@ -268,7 +268,7 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
ret = VM_FAULT_SIGBUS;
-   ctx = fe->vma->vm_userfaultfd_ctx.ctx;
+   ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;
 
@@ -301,17 +301,18 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 * without first stopping userland access to the memory. For
 * VM_UFFD_MISSING userfaults this is enough for now.
 */
-   if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
+   if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {
/*
 * Validate the invariant that nowait must allow retry
 * to be sure not to return SIGBUS erroneously on
 * nowait invocations.
 */
-   BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
+   BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
 #ifdef CONFIG_DEBUG_VM
if (printk_ratelimit()) {
printk(KERN_WARNING
-  "FAULT_FLAG_ALLOW_RETRY missing %x\n", 
fe->flags);
+  "FAULT_FLAG_ALLOW_RETRY missing %x\n",
+  vmf->flags);
dump_stack();
}
 #endif
@@ -323,7 +324,7 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 * and wait.
 */
ret = VM_FAULT_RETRY;
-   if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
+   if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out;
 
/* take the reference before dropping the mmap_sem */
@@ -331,11 +332,11 @@ int handle_userfault(struct fault_env *fe, unsigned long 
reason)
 
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
-   uwq.msg = userfault_msg(fe->address, fe->flags, reason);
+   uwq.msg = userfault_msg(vmf->address, vmf->flags, reason);
uwq.ctx = ctx;
 
return_to_userland =
-   (fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+   (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
 
spin_lock(&ctx->fault_pending_wqh.lock);
@@ -353,7 +354,8 @@ int handle_userfault(struct fault_env *fe, unsigned long 
r

[PATCH 20/21] dax: Protect PTE modification on WP fault by radix tree entry lock

2016-11-01 Thread Jan Kara
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
has released corresponding radix tree entry lock. When we want to
writeprotect PTE on cache flush, we need PTE modification to happen
under radix tree entry lock to ensure consistent updates of PTE and radix
tree (standard faults use page lock to ensure this consistency). So move
update of PTE bit into dax_pfn_mkwrite().

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 fs/dax.c| 22 --
 mm/memory.c |  2 +-
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 857117b6db6b..63b4cebe3f20 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -784,17 +784,27 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 {
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
-   void *entry;
+   void *entry, **slot;
pgoff_t index = vmf->pgoff;
 
spin_lock_irq(&mapping->tree_lock);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
-   if (!entry || !radix_tree_exceptional_entry(entry))
-   goto out;
+   entry = get_unlocked_mapping_entry(mapping, index, &slot);
+   if (!entry || !radix_tree_exceptional_entry(entry)) {
+   if (entry)
+   put_unlocked_mapping_entry(mapping, index, entry);
+   spin_unlock_irq(&mapping->tree_lock);
+   return VM_FAULT_NOPAGE;
+   }
radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-   put_unlocked_mapping_entry(mapping, index, entry);
-out:
+   entry = lock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
+   /*
+* If we race with somebody updating the PTE and finish_mkwrite_fault()
+* fails, we don't care. We need to return VM_FAULT_NOPAGE and retry
+* the fault in either case.
+*/
+   finish_mkwrite_fault(vmf);
+   put_locked_mapping_entry(mapping, index, entry);
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/mm/memory.c b/mm/memory.c
index 7660f6169bee..2683e18d6d55 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2316,7 +2316,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
vmf->flags |= FAULT_FLAG_MKWRITE;
ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
-   if (ret & VM_FAULT_ERROR)
+   if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))
return ret;
return finish_mkwrite_fault(vmf);
}
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 15/21] mm: Move part of wp_page_reuse() into the single call site

2016-11-01 Thread Jan Kara
wp_page_reuse() handles write shared faults which is needed only in
wp_page_shared(). Move the handling only into that location to make
wp_page_reuse() simpler and avoid a strange situation when we sometimes
pass in locked page, sometimes unlocked etc.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e278a8a6ccc7..06aba4203104 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2103,8 +2103,7 @@ static void fault_dirty_shared_page(struct vm_area_struct 
*vma,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf,
-   int page_mkwrite, int dirty_shared)
+static inline void wp_page_reuse(struct vm_fault *vmf)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -2124,16 +2123,6 @@ static inline int wp_page_reuse(struct vm_fault *vmf,
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-
-   if (dirty_shared) {
-   if (!page_mkwrite)
-   lock_page(page);
-
-   fault_dirty_shared_page(vma, page);
-   put_page(page);
-   }
-
-   return VM_FAULT_WRITE;
 }
 
 /*
@@ -2308,7 +2297,8 @@ static int wp_pfn_shared(struct vm_fault *vmf)
return 0;
}
}
-   return wp_page_reuse(vmf, 0, 0);
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
 }
 
 static int wp_page_shared(struct vm_fault *vmf)
@@ -2346,7 +2336,13 @@ static int wp_page_shared(struct vm_fault *vmf)
page_mkwrite = 1;
}
 
-   return wp_page_reuse(vmf, page_mkwrite, 1);
+   wp_page_reuse(vmf);
+   if (!page_mkwrite)
+   lock_page(vmf->page);
+   fault_dirty_shared_page(vma, vmf->page);
+   put_page(vmf->page);
+
+   return VM_FAULT_WRITE;
 }
 
 /*
@@ -2421,7 +2417,8 @@ static int do_wp_page(struct vm_fault *vmf)
page_move_anon_rmap(vmf->page, vma);
}
unlock_page(vmf->page);
-   return wp_page_reuse(vmf, 0, 0);
+   wp_page_reuse(vmf);
+   return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 13/21] mm: Pass vm_fault structure into do_page_mkwrite()

2016-11-01 Thread Jan Kara
We will need more information in the ->page_mkwrite() helper for DAX to
be able to fully finish faults there. Pass vm_fault structure to
do_page_mkwrite() and use it there so that information propagates
properly from upper layers.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 mm/memory.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4da66c984c2c..c89f99c270bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2038,20 +2038,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct 
*vma)
  *
  * We do this without the lock held, so that it can sleep if it needs to.
  */
-static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
-  unsigned long address)
+static int do_page_mkwrite(struct vm_fault *vmf)
 {
-   struct vm_fault vmf;
int ret;
+   struct page *page = vmf->page;
 
-   vmf.address = address;
-   vmf.pgoff = page->index;
-   vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-   vmf.gfp_mask = __get_fault_gfp_mask(vma);
-   vmf.page = page;
-   vmf.cow_page = NULL;
+   vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
-   ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+   ret = vmf->vma->vm_ops->page_mkwrite(vmf->vma, vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
if (unlikely(!(ret & VM_FAULT_LOCKED))) {
@@ -2327,7 +2321,8 @@ static int wp_page_shared(struct vm_fault *vmf, struct 
page *old_page)
int tmp;
 
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   tmp = do_page_mkwrite(vma, old_page, vmf->address);
+   vmf->page = old_page;
+   tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp || (tmp &
  (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
put_page(old_page);
@@ -3292,7 +3287,7 @@ static int do_shared_fault(struct vm_fault *vmf)
 */
if (vma->vm_ops->page_mkwrite) {
unlock_page(vmf->page);
-   tmp = do_page_mkwrite(vma, vmf->page, vmf->address);
+   tmp = do_page_mkwrite(vmf);
if (unlikely(!tmp ||
(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
put_page(vmf->page);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 21/21] dax: Clear dirty entry tags on cache flush

2016-11-01 Thread Jan Kara
Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 fs/dax.c | 64 
 1 file changed, 64 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 63b4cebe3f20..5651d58de74c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -615,6 +616,59 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+   unsigned long address;
+
+   address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+ pgoff_t index, unsigned long pfn)
+{
+   struct vm_area_struct *vma;
+   pte_t *ptep;
+   pte_t pte;
+   spinlock_t *ptl;
+   bool changed;
+
+   i_mmap_lock_read(mapping);
+   vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+   unsigned long address;
+
+   cond_resched();
+
+   if (!(vma->vm_flags & VM_SHARED))
+   continue;
+
+   address = pgoff_address(index, vma);
+   changed = false;
+   if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+   continue;
+   if (pfn != pte_pfn(*ptep))
+   goto unlock;
+   if (!pte_dirty(*ptep) && !pte_write(*ptep))
+   goto unlock;
+
+   flush_cache_page(vma, address, pfn);
+   pte = ptep_clear_flush(vma, address, ptep);
+   pte = pte_wrprotect(pte);
+   pte = pte_mkclean(pte);
+   set_pte_at(vma->vm_mm, address, ptep, pte);
+   changed = true;
+unlock:
+   pte_unmap_unlock(ptep, ptl);
+
+   if (changed)
+   mmu_notifier_invalidate_page(vma->vm_mm, address);
+   }
+   i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -688,7 +742,17 @@ static int dax_writeback_one(struct block_device *bdev,
goto unmap;
}
 
+   dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
wb_cache_pmem(dax.addr, dax.size);
+   /*
+* After we have flushed the cache, we can clear the dirty tag. There
+* cannot be new dirty data in the pfn after the flush has completed as
+* the pfn mappings are writeprotected and fault waits for mapping
+* entry lock.
+*/
+   spin_lock_irq(&mapping->tree_lock);
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+   spin_unlock_irq(&mapping->tree_lock);
  unmap:
dax_unmap_atomic(bdev, &dax);
put_locked_mapping_entry(mapping, index, entry);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 17/21] mm: Change return values of finish_mkwrite_fault()

2016-11-01 Thread Jan Kara
Currently finish_mkwrite_fault() returns 0 when PTE got changed before
we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying
the PTE. This is somewhat confusing since 0 generally means success, it
is also inconsistent with finish_fault() which returns 0 on success.
Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE
when PTE changed. Practically, there should be no behavioral difference
since we bail out from the fault the same way regardless whether we
return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that
VM_FAULT_WRITE has no effect for shared mappings since the only two
places that check it - KSM and GUP - care about private mappings only.
Generally the meaning of VM_FAULT_WRITE for shared mappings is not well
defined and we should probably clean that up.

Signed-off-by: Jan Kara 
---
 mm/memory.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1517ff91c743..b3bd6b6c6472 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2296,10 +2296,10 @@ int finish_mkwrite_fault(struct vm_fault *vmf)
 */
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
-   return 0;
+   return VM_FAULT_NOPAGE;
}
wp_page_reuse(vmf);
-   return VM_FAULT_WRITE;
+   return 0;
 }
 
 /*
@@ -2342,8 +2342,7 @@ static int wp_page_shared(struct vm_fault *vmf)
return tmp;
}
tmp = finish_mkwrite_fault(vmf);
-   if (unlikely(!tmp || (tmp &
- (VM_FAULT_ERROR | VM_FAULT_NOPAGE {
+   if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
unlock_page(vmf->page);
put_page(vmf->page);
return tmp;
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 20/20] dax: Clear dirty entry tags on cache flush

2016-11-01 Thread Jan Kara
Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara 
---
 fs/dax.c | 64 
 1 file changed, 64 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index a2d3781c9f4e..233f548d298e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -668,6 +669,59 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+   unsigned long address;
+
+   address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+ pgoff_t index, unsigned long pfn)
+{
+   struct vm_area_struct *vma;
+   pte_t *ptep;
+   pte_t pte;
+   spinlock_t *ptl;
+   bool changed;
+
+   i_mmap_lock_read(mapping);
+   vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+   unsigned long address;
+
+   cond_resched();
+
+   if (!(vma->vm_flags & VM_SHARED))
+   continue;
+
+   address = pgoff_address(index, vma);
+   changed = false;
+   if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+   continue;
+   if (pfn != pte_pfn(*ptep))
+   goto unlock;
+   if (!pte_dirty(*ptep) && !pte_write(*ptep))
+   goto unlock;
+
+   flush_cache_page(vma, address, pfn);
+   pte = ptep_clear_flush(vma, address, ptep);
+   pte = pte_wrprotect(pte);
+   pte = pte_mkclean(pte);
+   set_pte_at(vma->vm_mm, address, ptep, pte);
+   changed = true;
+unlock:
+   pte_unmap_unlock(ptep, ptl);
+
+   if (changed)
+   mmu_notifier_invalidate_page(vma->vm_mm, address);
+   }
+   i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -735,7 +789,17 @@ static int dax_writeback_one(struct block_device *bdev,
goto unmap;
}
 
+   dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
wb_cache_pmem(dax.addr, dax.size);
+   /*
+* After we have flushed the cache, we can clear the dirty tag. There
+* cannot be new dirty data in the pfn after the flush has completed as
+* the pfn mappings are writeprotected and fault waits for mapping
+* entry lock.
+*/
+   spin_lock_irq(&mapping->tree_lock);
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+   spin_unlock_irq(&mapping->tree_lock);
 unmap:
dax_unmap_atomic(bdev, &dax);
put_locked_mapping_entry(mapping, index, entry);
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 19/21] dax: Make cache flushing protected by entry lock

2016-11-01 Thread Jan Kara
Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.

Reviewed-by: Ross Zwisler 
Signed-off-by: Jan Kara 
---
 fs/dax.c | 61 +++--
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b4ea4bcca9e7..857117b6db6b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -619,32 +619,50 @@ static int dax_writeback_one(struct block_device *bdev,
struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = &mapping->page_tree;
-   struct radix_tree_node *node;
struct blk_dax_ctl dax;
-   void **slot;
+   void *entry2, **slot;
int ret = 0;
 
-   spin_lock_irq(&mapping->tree_lock);
/*
-* Regular page slots are stabilized by the page lock even
-* without the tree itself locked.  These unlocked entries
-* need verification under the tree lock.
+* A page got tagged dirty in DAX mapping? Something is seriously
+* wrong.
 */
-   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
-   goto unlock;
-   if (*slot != entry)
-   goto unlock;
-
-   /* another fsync thread may have already written back this entry */
-   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-   goto unlock;
+   if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+   return -EIO;
 
+   spin_lock_irq(&mapping->tree_lock);
+   entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
+   /* Entry got punched out / reallocated? */
+   if (!entry2 || !radix_tree_exceptional_entry(entry2))
+   goto put_unlocked;
+   /*
+* Entry got reallocated elsewhere? No need to writeback. We have to
+* compare sectors as we must not bail out due to difference in lockbit
+* or entry type.
+*/
+   if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+   goto put_unlocked;
if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
dax_is_zero_entry(entry))) {
ret = -EIO;
-   goto unlock;
+   goto put_unlocked;
}
 
+   /* Another fsync thread may have already written back this entry */
+   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+   goto put_unlocked;
+   /* Lock the entry to serialize with page faults */
+   entry = lock_slot(mapping, slot);
+   /*
+* We can clear the tag now but we have to be careful so that concurrent
+* dax_writeback_one() calls for the same index cannot finish before we
+* actually flush the caches. This is achieved as the calls will look
+* at the entry only under tree_lock and once they do that they will
+* see the entry locked and wait for it to unlock.
+*/
+   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+   spin_unlock_irq(&mapping->tree_lock);
+
/*
 * Even if dax_writeback_mapping_range() was given a wbc->range_start
 * in the middle of a PMD, the 'index' we are given will be aligned to
@@ -654,15 +672,16 @@ static int dax_writeback_one(struct block_device *bdev,
 */
dax.sector = dax_radix_sector(entry);
dax.size = PAGE_SIZE << dax_radix_order(entry);
-   spin_unlock_irq(&mapping->tree_lock);
 
/*
 * We cannot hold tree_lock while calling dax_map_atomic() because it
 * eventually calls cond_resched().
 */
ret = dax_map_atomic(bdev, &dax);
-   if (ret < 0)
+   if (ret < 0) {
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
+   }
 
if (WARN_ON_ONCE(ret < dax.size)) {
ret = -EIO;
@@ -670,15 +689,13 @@ static int dax_writeback_one(struct block_device *bdev,
}
 
wb_cache_pmem(dax.addr, dax.size);
-
-   spin_lock_irq(&mapping->tree_lock);
-   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-   spin_unlock_irq(&mapping->tree_lock);
  unmap:
dax_unmap_atomic(bdev, &dax);
+   put_locked_mapping_entry(mapping, index, entry);
return ret;
 
- unlock:
+ put_unlocked:
+   put_unlocked_mapping_entry(mapping, index, entry2);
spin_unlock_irq(&mapping->tree_lock);
return ret;
 }
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/lin

[PATCH 19/20] dax: Protect PTE modification on WP fault by radix tree entry lock

2016-11-01 Thread Jan Kara
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
has released corresponding radix tree entry lock. When we want to
writeprotect PTE on cache flush, we need PTE modification to happen
under radix tree entry lock to ensure consisten updates of PTE and radix
tree (standard faults use page lock to ensure this consistency). So move
update of PTE bit into dax_pfn_mkwrite().

Signed-off-by: Jan Kara 
---
 fs/dax.c| 22 --
 mm/memory.c |  2 +-
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c6cadf8413a3..a2d3781c9f4e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1163,17 +1163,27 @@ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 {
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
-   void *entry;
+   void *entry, **slot;
pgoff_t index = vmf->pgoff;
 
spin_lock_irq(&mapping->tree_lock);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
-   if (!entry || !radix_tree_exceptional_entry(entry))
-   goto out;
+   entry = get_unlocked_mapping_entry(mapping, index, &slot);
+   if (!entry || !radix_tree_exceptional_entry(entry)) {
+   if (entry)
+   put_unlocked_mapping_entry(mapping, index, entry);
+   spin_unlock_irq(&mapping->tree_lock);
+   return VM_FAULT_NOPAGE;
+   }
radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
-   put_unlocked_mapping_entry(mapping, index, entry);
-out:
+   entry = lock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
+   /*
+* If we race with somebody updating the PTE and finish_mkwrite_fault()
+* fails, we don't care. We need to return VM_FAULT_NOPAGE and retry
+* the fault in either case.
+*/
+   finish_mkwrite_fault(vmf);
+   put_locked_mapping_entry(mapping, index, entry);
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/mm/memory.c b/mm/memory.c
index e7a4a30a5e88..5fa3d0c5196e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2310,7 +2310,7 @@ static int wp_pfn_shared(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
vmf->flags |= FAULT_FLAG_MKWRITE;
ret = vma->vm_ops->pfn_mkwrite(vma, vmf);
-   if (ret & VM_FAULT_ERROR)
+   if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))
return ret;
return finish_mkwrite_fault(vmf);
}
-- 
2.6.6

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 07/21] mm: Add orig_pte field into vm_fault

2016-11-01 Thread Jan Kara
Add orig_pte field to vm_fault structure to allow ->page_mkwrite
handlers to fully handle the fault. This also allows us to save some
passing of extra arguments around.

Signed-off-by: Jan Kara 
---
 include/linux/mm.h |  4 +--
 mm/internal.h  |  2 +-
 mm/khugepaged.c|  7 ++---
 mm/memory.c| 82 +++---
 4 files changed, 47 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a4ebe3c67c6..f8e758060851 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,8 +298,8 @@ struct vm_fault {
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
pmd_t *pmd; /* Pointer to pmd entry matching
-* the 'address'
-*/
+* the 'address' */
+   pte_t orig_pte; /* Value of PTE at the time of fault */
 
struct page *cow_page;  /* Handler may choose to COW */
struct page *page;  /* ->fault handlers should return a
diff --git a/mm/internal.h b/mm/internal.h
index 093b1eacc91b..44d68895a9b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -36,7 +36,7 @@
 /* Do not use these with a slab allocator */
 #define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
 
-int do_swap_page(struct vm_fault *vmf, pte_t orig_pte);
+int do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7df06383b10..1f20f25fe029 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -873,7 +873,6 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
unsigned long address, pmd_t *pmd,
int referenced)
 {
-   pte_t pteval;
int swapped_in = 0, ret = 0;
struct vm_fault vmf = {
.vma = vma,
@@ -891,11 +890,11 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
vmf.pte = pte_offset_map(pmd, address);
for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
vmf.pte++, vmf.address += PAGE_SIZE) {
-   pteval = *vmf.pte;
-   if (!is_swap_pte(pteval))
+   vmf.orig_pte = *vmf.pte;
+   if (!is_swap_pte(vmf.orig_pte))
continue;
swapped_in++;
-   ret = do_swap_page(&vmf, pteval);
+   ret = do_swap_page(&vmf);
 
/* do_swap_page returns VM_FAULT_RETRY with released mmap_sem */
if (ret & VM_FAULT_RETRY) {
diff --git a/mm/memory.c b/mm/memory.c
index 5f6bc9028a88..25028422a578 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2074,8 +2074,8 @@ static int do_page_mkwrite(struct vm_area_struct *vma, 
struct page *page,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct vm_fault *vmf, pte_t orig_pte,
-   struct page *page, int page_mkwrite, int dirty_shared)
+static inline int wp_page_reuse(struct vm_fault *vmf, struct page *page,
+   int page_mkwrite, int dirty_shared)
__releases(vmf->ptl)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -2088,8 +2088,8 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
pte_t orig_pte,
if (page)
page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
 
-   flush_cache_page(vma, vmf->address, pte_pfn(orig_pte));
-   entry = pte_mkyoung(orig_pte);
+   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
+   entry = pte_mkyoung(vmf->orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -2139,8 +2139,7 @@ static inline int wp_page_reuse(struct vm_fault *vmf, 
pte_t orig_pte,
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old 
page.
  */
-static int wp_page_copy(struct vm_fault *vmf, pte_t orig_pte,
-   struct page *old_page)
+static int wp_page_copy(struct vm_fault *vmf, struct page *old_page)
 {
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
@@ -2154,7 +2153,7 @@ static int wp_page_copy(struct vm_fault *vmf, pte_t 
orig_pte,
if (unlikely(anon_vma_prepare(vma)))
goto oom;
 
-   if (is_zero_pfn(pte_pfn(orig_pte))) {
+   if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movab

Re: [PATCH 0/11] ext4: Convert ext4 DAX IO to iomap framework

2016-11-01 Thread Ross Zwisler
On Wed, Nov 02, 2016 at 09:12:35AM +1100, Dave Chinner wrote:
> On Tue, Nov 01, 2016 at 10:06:10PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > this patch set converts ext4 DAX IO paths to the new iomap framework and
> > removes the old bh-based DAX functions. As a result ext4 gains PMD page
> > fault support, also some other minor bugs get fixed. The patch set is based
> > on Ross' DAX PMD page fault support series [1]. It passes xfstests both in
> > DAX and non-DAX mode.
> > 
> > The question is how shall we merge this. If Dave is pulling PMD patches 
> > through
> > XFS tree, then these patches could go there as well (chances for conflicts
> > with other ext4 stuff are relatively low) or Dave could just export a stable
> > branch with PMD series which Ted would just pull...
> 
> I plan to grab Ross's PMD series in the next couple of days and I'll
> push it out as a stable topic branch once I've sanity tested it.  I
> don't really want to take a big chunk of ext4 stuff through the XFS
> tree if it can be avoided

Yea, we also need to figure out how to get Jan's "dax: Clear dirty bits after
flushing caches" set merged, which is mostly MM stuff and I think will go
through akpm's tree?  That set is also based on my PMD stuff.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/11] ext4: Convert ext4 DAX IO to iomap framework

2016-11-01 Thread Jan Kara
On Tue 01-11-16 16:45:50, Ross Zwisler wrote:
> On Wed, Nov 02, 2016 at 09:12:35AM +1100, Dave Chinner wrote:
> > On Tue, Nov 01, 2016 at 10:06:10PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > this patch set converts ext4 DAX IO paths to the new iomap framework and
> > > removes the old bh-based DAX functions. As a result ext4 gains PMD page
> > > fault support, also some other minor bugs get fixed. The patch set is 
> > > based
> > > on Ross' DAX PMD page fault support series [1]. It passes xfstests both in
> > > DAX and non-DAX mode.
> > > 
> > > The question is how shall we merge this. If Dave is pulling PMD patches 
> > > through
> > > XFS tree, then these patches could go there as well (chances for conflicts
> > > with other ext4 stuff are relatively low) or Dave could just export a 
> > > stable
> > > branch with PMD series which Ted would just pull...
> > 
> > I plan to grab Ross's PMD series in the next couple of days and I'll
> > push it out as a stable topic branch once I've sanity tested it.  I
> > don't really want to take a big chunk of ext4 stuff through the XFS
> > tree if it can be avoided
> 
> Yea, we also need to figure out how to get Jan's "dax: Clear dirty bits after
> flushing caches" set merged, which is mostly MM stuff and I think will go
> through akpm's tree?  That set is also based on my PMD stuff.

Yeah, I've spoken to Andrew and he wants to take the MM changes through his
tree. I'll talk to him how to make this happen given the patches the series
depends on but the series still needs some review so "how to merge" is not
exactly a question of the day...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/21 v4] dax: Clear dirty bits after flushing caches

2016-11-01 Thread Jan Kara
Hi,

forgot to add Kirill to CC since this modifies the fault path he changed
recently. I don't want to resend the whole series just because of this so
at least I'm pinging him like this...

Honza
On Tue 01-11-16 23:36:06, Jan Kara wrote:
> Hello,
> 
> this is the fourth revision of my patches to clear dirty bits from radix tree
> of DAX inodes when caches for corresponding pfns have been flushed. This patch
> set is significantly larger than the previous version because I'm changing how
> ->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
> fault so that we don't have to leak details about DAX locking into the generic
> code. In principle, these patches enable handlers to easily update PTEs and do
> other work necessary to finish the fault without duplicating the functionality
> present in the generic code. I'd be really like feedback from mm folks whether
> such changes to fault handling code are fine or what they'd do differently.
> 
> The patches are based on 4.9-rc1 + Ross' DAX PMD page fault series [1] + ext4
> conversion of DAX IO patch to the iomap infrastructure [2]. For testing,
> I've pushed out a tree including all these patches and further DAX fixes
> to:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git dax
> 
> The patches pass testing with xfstests on ext4 and xfs on my end. I'd be
> grateful for review so that we can push these patches for the next merge
> window.
> 
> [1] http://www.spinics.net/lists/linux-mm/msg115247.html
> [2] Posted an hour ago - look for "ext4: Convert ext4 DAX IO to iomap 
> framework"
> 
> Changes since v3:
> * rebased on top of 4.9-rc1 + DAX PMD fault series + ext4 iomap conversion
> * reordered some of the patches
> * killed ->virtual_address field in vm_fault structure as requested by
>   Christoph
> 
> Changes since v2:
> * rebased on top of 4.8-rc8 - this involved dealing with new fault_env
>   structure
> * changed calling convention for fault helpers
> 
> Changes since v1:
> * make sure all PTE updates happen under radix tree entry lock to protect
>   against races between faults & write-protecting code
> * remove information about DAX locking from mm/memory.c
> * smaller updates based on Ross' feedback
> 
> 
> Background information regarding the motivation:
> 
> Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
> fsync(2) flushes all the dirty pfns again and again. This patches implement
> clearing of the dirty tag in the radix tree so that we issue flush only when
> needed.
> 
> The difficulty with clearing the dirty tag is that we have to protect against
> a concurrent page fault setting the dirty tag and writing new data into the
> page. So we need a lock serializing page fault and clearing of the dirty tag
> and write-protecting PTEs (so that we get another pagefault when pfn is 
> written
> to again and we have to set the dirty tag again).
> 
> The effect of the patch set is easily visible:
> 
> Writing 1 GB of data via mmap, then fsync twice.
> 
> Before this patch set both fsyncs take ~205 ms on my test machine, after the
> patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
> clearing dirty bits etc. is very noticeable), the second fsync takes below
> 1 us.
> 
> As a bonus, these patches make filesystem freezing for DAX filesystems
> reliable because mappings are now properly writeprotected while freezing the
> fs.
>   Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Shipment delivery problem #0000200534

2016-11-01 Thread FedEx International Economy
Dear Customer,

This is to confirm that one or more of your parcels has been shipped.
Shipment Label is attached to this email.

Warm regards,
Vincent Waller,
Operation Manager.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 02/21] mm: Use vmf->address instead of of vmf->virtual_address

2016-11-01 Thread Hillf Danton
On Wednesday, November 02, 2016 6:36 AM Jan Kara wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8e8b76d11bb4..2a4ebe3c67c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -297,8 +297,6 @@ struct vm_fault {
>   gfp_t gfp_mask; /* gfp mask to be used for allocations 
> */
>   pgoff_t pgoff;  /* Logical page offset based on vma */
>   unsigned long address;  /* Faulting virtual address */
> - void __user *virtual_address;   /* Faulting virtual address masked by
> -  * PAGE_MASK */
>   pmd_t *pmd; /* Pointer to pmd entry matching
>* the 'address'
>*/
We have a pmd field currently?

In  [PATCH 01/20] mm: Change type of vmf->virtual_address we see
[1] __user * gone,
[2] no field of address added
and doubt stray merge occurred.

btw, s:01/20:01/21: in subject line?

Hillf

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef815b9cd426..a5636d646022 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -295,7 +295,7 @@ struct vm_fault {
>   unsigned int flags; /* FAULT_FLAG_xxx flags */
>   gfp_t gfp_mask; /* gfp mask to be used for allocations 
> */
>   pgoff_t pgoff;  /* Logical page offset based on vma */
> - void __user *virtual_address;   /* Faulting virtual address */
> + unsigned long virtual_address;  /* Faulting virtual address */
> 
>   struct page *cow_page;  /* Handler may choose to COW */
>   struct page *page;  /* ->fault handlers should return a


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/21 v4] dax: Clear dirty bits after flushing caches

2016-11-01 Thread Ross Zwisler
On Tue, Nov 01, 2016 at 11:36:06PM +0100, Jan Kara wrote:
> Hello,
> 
> this is the fourth revision of my patches to clear dirty bits from radix tree
> of DAX inodes when caches for corresponding pfns have been flushed. This patch
> set is significantly larger than the previous version because I'm changing how
> ->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
> fault so that we don't have to leak details about DAX locking into the generic
> code. In principle, these patches enable handlers to easily update PTEs and do
> other work necessary to finish the fault without duplicating the functionality
> present in the generic code. I'd be really like feedback from mm folks whether
> such changes to fault handling code are fine or what they'd do differently.
> 
> The patches are based on 4.9-rc1 + Ross' DAX PMD page fault series [1] + ext4
> conversion of DAX IO patch to the iomap infrastructure [2]. For testing,
> I've pushed out a tree including all these patches and further DAX fixes
> to:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git dax

In my testing I hit what I believe to be a new lockdep splat.  This was
produced with ext4+dax+generic/246, though I've tried several times to
reproduce it and haven't been able.  This testing was done with your tree plus
one patch to fix the DAX PMD recursive fallback issue that you reported.  This
new patch is folded into v9 of my PMD series that I sent out earlier today.

I've posted the tree I was testing with here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=jan_dax

Here is the lockdep splat, passed through kasan_symbolize:

run fstests generic/246 at 2016-11-01 21:51:34

==
[ INFO: possible circular locking dependency detected ]
4.9.0-rc1-00165-g13826b5 #2 Not tainted
---
t_mmap_writev/13704 is trying to acquire lock:
 ([ 3522.320075] &ei->i_mmap_sem
){.+}[ 3522.320924] , at:
[] ext4_dax_fault+0x36/0xd0 fs/ext4/file.c:267

but task is already holding lock:
 ([ 3522.324135] jbd2_handle
){.+}[ 3522.324875] , at:
[] start_this_handle+0x110/0x440 fs/jbd2/transaction.c:361

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1[ 3522.330384]  (
jbd2_handle[ 3522.330889] ){.+}
:
   [] lock_acquire+0xf2/0x1e0 
kernel/locking/lockdep.c:3746
   [] start_this_handle+0x174/0x440 
fs/jbd2/transaction.c:389
   [] jbd2__journal_start+0xdb/0x290 
fs/jbd2/transaction.c:457
   [] __ext4_journal_start_sb+0x89/0x1d0 
fs/ext4/ext4_jbd2.c:76
   [< inline >] __ext4_journal_start fs/ext4/ext4_jbd2.h:318
   [] ext4_alloc_file_blocks.isra.34+0xef/0x310 
fs/ext4/extents.c:4701
   [< inline >] ext4_zero_range fs/ext4/extents.c:4850
   [] ext4_fallocate+0x974/0xae0 fs/ext4/extents.c:4952
   [] vfs_fallocate+0x15a/0x230 fs/open.c:320
   [< inline >] SYSC_fallocate fs/open.c:343
   [] SyS_fallocate+0x44/0x70 fs/open.c:337
   [] entry_SYSCALL_64_fastpath+0x1f/0xc2 
arch/x86/entry/entry_64.S:209

-> #0[ 3522.342547]  (
&ei->i_mmap_sem[ 3522.343023] ){.+}
:
   [< inline >] check_prev_add kernel/locking/lockdep.c:1829
   [< inline >] check_prevs_add kernel/locking/lockdep.c:1939
   [< inline >] validate_chain kernel/locking/lockdep.c:2266
   [] __lock_acquire+0x127f/0x14d0 
kernel/locking/lockdep.c:3335
   [] lock_acquire+0xf2/0x1e0 
kernel/locking/lockdep.c:3746
   [] down_read+0x3e/0xa0 kernel/locking/rwsem.c:22
   [] ext4_dax_fault+0x36/0xd0 fs/ext4/file.c:267
   [] __do_fault+0x21/0x130 mm/memory.c:2872
   [< inline >] do_read_fault mm/memory.c:3231
   [< inline >] do_fault mm/memory.c:
   [< inline >] handle_pte_fault mm/memory.c:3534
   [< inline >] __handle_mm_fault mm/memory.c:3624
   [] handle_mm_fault+0x114e/0x1550 mm/memory.c:3661
   [] __do_page_fault+0x247/0x4f0 arch/x86/mm/fault.c:1397
   [] trace_do_page_fault+0x5d/0x290 
arch/x86/mm/fault.c:1490
   [] do_async_page_fault+0x1a/0xa0 
arch/x86/kernel/kvm.c:265
   [] async_page_fault+0x28/0x30 
arch/x86/entry/entry_64.S:1015
   [< inline >] arch_copy_from_iter_pmem 
./arch/x86/include/asm/pmem.h:95
   [< inline >] copy_from_iter_pmem ./include/linux/pmem.h:118
   [] dax_iomap_actor+0x147/0x270 fs/dax.c:1027
   [] iomap_apply+0xb3/0x130 fs/iomap.c:78
   [] dax_iomap_rw+0x76/0xa0 fs/dax.c:1067
   [< inline >] ext4_dax_write_iter fs/ext4/file.c:196
   [] ext4_file_write_iter+0x243/0x340 fs/ext4/file.c:217
   [] do_iter_readv_writev+0xb1/0x130 fs/read_write.c:695
   [] do_readv_writev+0x1a4/0x250 fs/read_write.c:872
   [] vfs_writev+0x3f/0x50 fs/read_write.c:911
   [] do_writev+0x64/0x100 fs/read_write.c:944
   [< inline