Re: [PATCH v3 09/11] dax: add struct iomap based DAX PMD support

2016-09-27 Thread Dave Chinner
On Tue, Sep 27, 2016 at 02:48:00PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This patch allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled using the new struct
> iomap based fault handlers.
> 
> There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
> mappings that have an associated block allocation, and 4k DAX empty
> entries.  The empty entries exist to provide locking for the duration of a
> given page fault.
> 
> This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
> entries, PMD DAX entries that have associated block allocations, and 2 MiB
> DAX empty entries.
> 
> Unlike the 4k case where we insert a struct page* into the radix tree for
> 4k zero pages, for HZP we insert a DAX exceptional entry with the new
> RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
> every 2MiB hole mapping, and it doesn't make sense to have that same struct
> page* with multiple entries in multiple trees.  This would cause contention
> on the single page lock for the one Huge Zero Page, and it would break the
> page->index and page->mapping associations that are assumed to be valid in
> many other places in the kernel.
> 
> One difficult use case is when one thread is trying to use 4k entries in
> radix tree for a given offset, and another thread is using 2 MiB entries
> for that same offset.  The current code handles this by making the 2 MiB
> user fall back to 4k entries for most cases.  This was done because it is
> the simplest solution, and because the use of 2MiB pages is already
> opportunistic.
> 
> If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
> we run into the problem of how we lock out 4k page faults for the entire
> 2MiB range while we clean out the radix tree so we can insert the 2MiB
> entry.  We can solve this problem if we need to, but I think that the cases
> where both 2MiB entries and 4K entries are being used for the same range
> will be rare enough and the gain small enough that it probably won't be
> worth the complexity.
> 
> Signed-off-by: Ross Zwisler 

> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
> +static void __dax_pmd_dbg(struct iomap *iomap, unsigned long address,
> + const char *reason, const char *fn)
> +{
> + if (iomap) {
> + char bname[BDEVNAME_SIZE];
> +
> + bdevname(iomap->bdev, bname);
> + pr_debug("%s: %s addr %lx dev %s type %#x blkno %ld "
> + "offset %lld length %lld fallback: %s\n", fn,
> + current->comm, address, bname, iomap->type,
> + iomap->blkno, iomap->offset, iomap->length, reason);
> + } else {
> + pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
> + current->comm, address, reason);
> + }
> +}

Yuck! Tracepoints for debugging information like this, please, not
printk awfulness.

> +
> +#define dax_pmd_dbg(bh, address, reason) \
> + __dax_pmd_dbg(bh, address, reason, __func__)
> +
> +static int iomap_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> + struct vm_fault *vmf, unsigned long address,
> + struct iomap *iomap, loff_t pos, bool write, void **entryp)

Please put a "dax" in the function name. grepping, cscope, etc are
much easier when static function names are namespaced properly.

> +{
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct block_device *bdev = iomap->bdev;
> + struct blk_dax_ctl dax = {
> + .sector = iomap_dax_sector(iomap, pos),
> + .size = PMD_SIZE,
> + };
> + long length = dax_map_atomic(bdev, );
> + void *ret;
> +
> + if (length < 0) {
> + dax_pmd_dbg(iomap, address, "dax-error fallback");
> + return VM_FAULT_FALLBACK;
> + }

Fails to unmap. Please use an goto based error stack. And
tracepoints make this much neater:

trace_dax_pmd_insert_mapping(iomap, address, , length);
if (length < 0)
goto unmap_fallback;
if (length < PMD_SIZE)
goto unmap_fallback;
.

trace_dax_pmd_insert_mapping_done(iomap, address, , length);
return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);

unmap_fallback:
dax_unmap_atomic(bdev, );
fallback:
trace_dax_pmd_insert_fallback(iomap, address, , length);
return VM_FAULT_FALLBACK;
}

i.e. we don't need need all those debug printks to tell us what
failed - the first tracepoint tells use everything about the context
we are about to check, and the last tracepoint 

[PATCH v3 09/11] dax: add struct iomap based DAX PMD support

2016-09-27 Thread Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler 
---
 fs/dax.c| 396 ++--
 include/linux/dax.h |  29 +++-
 mm/filemap.c|   4 +-
 3 files changed, 380 insertions(+), 49 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b5e7b13..13934d7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include 
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-   RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-   RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -400,19 +386,52 @@ static void *get_unlocked_mapping_entry(struct 
address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with type RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  Also, a 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+   unsigned long new_type)
 {
+   bool pmd_downgrade = false; /* splitting 2MiB entry