Re: [ndctl PATCH 4/8] ccan/list: add a list_add_after helper

2017-10-06 Thread Dan Williams
On Thu, Oct 5, 2017 at 6:54 PM, Vishal Verma  wrote:
> In preparation for the error-inject command, add anew helper to
> ccan/list for adding a list element in the middle of a list.
>
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  ccan/list/list.h | 32 
>  1 file changed, 32 insertions(+)

Let's not touch the ccan files since these should always match what is
in upstream ccan. Create a local util/list.h to add new routines. It
would also be nice to rename this to list_splice() to bring us closer
in line with kernel list primitives.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 0/3] acpi nfit: make emulation of translate SPA on nfit_test

2017-10-06 Thread Verma, Vishal L
On Thu, 2017-10-05 at 21:57 -0700, Dan Williams wrote:
> On Fri, Sep 22, 2017 at 12:44 AM, Yasunori Goto  m> wrote:
> > 
> > Hello
> > 
> > I made a v4 patch set to emulate translate SPA by nfit_test.ko
> > module.
> > If possible, please merge it.
> > 
> > 
> > This patch set is to emulate translate SPA feature by nfit_test.ko.
> > 
> > Certainly the nfit acpi driver supports translate SPA interface
> > via ND_CMD_CALL. But nfit_test does not support it yet.
> > 
> > To test translate SPA with nfit_test, this patch set is needed.
> > 
> > ---
> > Change log since v3 [1]:
> >   - Rebase current libnvdimm-for-next branch.
> >   - Fix over 80 character lines and too much tabs which is pointed
> > by Verma-san. (Thanks!)
> >   - Since ndctl became to use the name "nfit_cmd" instead of
> > "passthru",
> > "passthru" is renamed to "nfit_cmd".
> > 
> > Change log since v2 [2]:
> >  - Make private defintion just for nfit_test.
> > (NFIT_CMD_TRANSLATE_SPA, struct nd_cmd_translate_spa, and
> > others)
> >  - Check region by kobj.name instead of using is_nd_pmem().
> > 
> > 
> > Change log since v1 [3]:
> >  - Separate patch set of kenel from ndctl patch set.
> >  - Change interface via ND_CMD_CALL.
> > 
> > [1] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg05863
> > .html
> > [2] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg05582
> > .html
> > [3] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg05287
> > .html
> > 
> 
> Vishal, since you built on top of these can I add your Tested-by or
> Reviewed-by?

Yes, I meant to send a reveiwed-by, but might've forgotten.

Thanks,
-Vishal
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/4] libnvdimm: move poison list functions to a new 'badrange' file

2017-10-06 Thread Dan Williams
On Thu, Oct 5, 2017 at 6:53 PM, Vishal Verma  wrote:
> From: Dave Jiang 
>
> From: Dave Jiang 
>
> nfit_test needs to use the poison list manipulation code as well. Make
> it more generic and in the process rename poison to badrange, and move
> all the related helpers to a new file.
>
> Signed-off-by: Dave Jiang 
> [vishal: add a missed include in bus.c for the new badrange functions]
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit/core.c  |   2 +-
>  drivers/acpi/nfit/mce.c   |   2 +-
>  drivers/nvdimm/Makefile   |   1 +
>  drivers/nvdimm/badrange.c | 294 
> ++
>  drivers/nvdimm/bus.c  |  24 ++--
>  drivers/nvdimm/core.c | 260 +---
>  drivers/nvdimm/nd-core.h  |   3 +-
>  drivers/nvdimm/nd.h   |   6 -
>  include/linux/libnvdimm.h |  21 +++-
>  9 files changed, 331 insertions(+), 282 deletions(-)
>  create mode 100644 drivers/nvdimm/badrange.c
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index a3ecd5e..4b157f8 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2240,7 +2240,7 @@ static int ars_status_process_records(struct 
> acpi_nfit_desc *acpi_desc,
> if (ars_status->out_length
> < 44 + sizeof(struct nd_ars_record) * (i + 1))
> break;
> -   rc = nvdimm_bus_add_poison(nvdimm_bus,
> +   rc = nvdimm_bus_add_badrange(nvdimm_bus,
> ars_status->records[i].err_address,
> ars_status->records[i].length);
> if (rc)
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index feeb95d..b929214 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -67,7 +67,7 @@ static int nfit_handle_mce(struct notifier_block *nb, 
> unsigned long val,
> continue;
>
> /* If this fails due to an -ENOMEM, there is little we can do 
> */
> -   nvdimm_bus_add_poison(acpi_desc->nvdimm_bus,
> +   nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
> ALIGN(mce->addr, L1_CACHE_BYTES),
> L1_CACHE_BYTES);
> nvdimm_region_notify(nfit_spa->nd_region,
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 909554c..ca6d325 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -20,6 +20,7 @@ libnvdimm-y += region_devs.o
>  libnvdimm-y += region.o
>  libnvdimm-y += namespace_devs.o
>  libnvdimm-y += label.o
> +libnvdimm-y += badrange.o
>  libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
>  libnvdimm-$(CONFIG_BTT) += btt_devs.o
>  libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
> diff --git a/drivers/nvdimm/badrange.c b/drivers/nvdimm/badrange.c
> new file mode 100644
> index 000..6ad782f
> --- /dev/null
> +++ b/drivers/nvdimm/badrange.c
> @@ -0,0 +1,294 @@
> +/*
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "nd-core.h"
> +#include "nd.h"
> +
> +void badrange_init(struct badrange *badrange)
> +{
> +   INIT_LIST_HEAD(&badrange->list);
> +   spin_lock_init(&badrange->lock);
> +}
> +EXPORT_SYMBOL_GPL(badrange_init);
> +
> +static void append_badrange_entry(struct badrange *badrange,
> +   struct badrange_entry *be, u64 addr, u64 length)
> +{
> +   lockdep_assert_held(&badrange->lock);
> +   be->start = addr;
> +   be->length = length;
> +   list_add_tail(&be->list, &badrange->list);
> +}

Small nit, can we rename the instance variable from 'be' to 'bre'?
'be' triggers all my 'big endian' neurons to fire. Other than that,
looks good.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] Fix mpage_writepage() for pages with buffers

2017-10-06 Thread Matthew Wilcox

When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
call clean_buffers() after unlocking the page we've written.  Introduce a
new clean_page_buffers() which cleans all buffers associated with a page
and call it from within bdev_write_page().

Reported-by: Toshi Kani 
Reported-by: OGAWA Hirofumi 
Tested-by: Toshi Kani 
Signed-off-by: Matthew Wilcox 
Cc: sta...@vger.kernel.org

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9941dc8342df..3fbe75bdd257 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -716,10 +716,12 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
set_page_writeback(page);
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, true);
-   if (result)
+   if (result) {
end_page_writeback(page);
-   else
+   } else {
+   clean_page_buffers(page);
unlock_page(page);
+   }
blk_queue_exit(bdev->bd_queue);
return result;
 }
diff --git a/fs/mpage.c b/fs/mpage.c
index 2e4c41ccb5c9..d97b003f1607 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -468,6 +468,16 @@ static void clean_buffers(struct page *page, unsigned 
first_unmapped)
try_to_free_buffers(page);
 }
 
+/*
+ * For situations where we want to clean all buffers attached to a page.
+ * We don't need to calculate how many buffers are attached to the page,
+ * we just need to specify a number larger than the maximum number of buffers.
+ */
+void clean_page_buffers(struct page *page)
+{
+   clean_buffers(page, PAGE_SIZE);
+}
+
 static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
  void *data)
 {
@@ -605,10 +615,8 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
if (bio == NULL) {
if (first_unmapped == blocks_per_page) {
if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
-   page, wbc)) {
-   clean_buffers(page, first_unmapped);
+   page, wbc))
goto out;
-   }
}
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
BIO_MAX_PAGES, GFP_NOFS|__GFP_HIGH);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae555eccf..446b24cac67d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -232,6 +232,7 @@ int generic_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
 void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
+void clean_page_buffers(struct page *page);
 int cont_write_begin(struct file *, struct address_space *, loff_t,
unsigned, unsigned, struct page **, void **,
get_block_t *, loff_t *);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate()

2017-10-06 Thread Dan Williams
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 arch/mips/kernel/vdso.c |2 +-
 arch/tile/mm/elf.c  |2 +-
 arch/x86/mm/mpx.c   |3 ++-
 fs/aio.c|2 +-
 include/linux/fs.h  |2 +-
 include/linux/mm.h  |9 +
 ipc/shm.c   |3 ++-
 mm/internal.h   |2 +-
 mm/mmap.c   |   13 +++--
 mm/nommu.c  |5 +++--
 mm/util.c   |7 ---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL, 0);
+  0, NULL, 0, -1);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
addr = mmap_region(NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-  NULL, 0);
+  NULL, 0, -1);
if (addr > (unsigned long) -PAGE_SIZE)
retval = (int) addr;
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
down_write(&mm->mmap_sem);
addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+   NULL, -1);
up_write(&mm->mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 
ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
   PROT_READ | PROT_WRITE,
-  MAP_SHARED, 0, &unused, NULL);
+  MAP_SHARED, 0, &unused, NULL, -1);
up_write(&mm->mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*mmap_validate) (struct file *, struct vm_area_struct *,
-   unsigned long);
+   unsigned long, int);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-   struct list_head *uf, unsigned long map_flags);
+   struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-   struct list_head *uf);
+   struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
unsigned long p

[PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush

2017-10-06 Thread Dan Williams
Changes since v6 [1]:
* Abandon the concept of immutable files and rework the implementation
  to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd
  layouts vs local filesystem changes. This establishes an interface where
  the kernel is always in control of the block-map and is free to
  invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph)

* Introduce a new ->mmap_validate() file operation since we need both
  the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT
  mapping.

* Introduce a ->lease_direct() vm operation to allow the RDMA core to
  safely register memory against DAX and tear down the mapping when the
  lease is broken. This can be reused by any sub-system that follows a
  memory registration semantic.

[1]: https://lkml.org/lkml/2017/8/23/754

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (12):
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags
  fs, mm: pass fd to ->mmap_validate()
  fs: introduce i_mapdcount
  fs: MAP_DIRECT core
  xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  xfs: wire up MAP_DIRECT
  dma-mapping: introduce dma_has_iommu()
  fs, mapdirect: introduce ->lease_direct()
  xfs: wire up ->lease_direct()
  device-dax: wire up ->lease_direct()
  IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
  tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h   |1 
 arch/mips/include/uapi/asm/mman.h|1 
 arch/mips/kernel/vdso.c  |2 
 arch/parisc/include/uapi/asm/mman.h  |1 
 arch/tile/mm/elf.c   |3 
 arch/x86/mm/mpx.c|3 
 arch/xtensa/include/uapi/asm/mman.h  |1 
 drivers/base/dma-mapping.c   |   10 +
 drivers/dax/device.c |4 
 drivers/infiniband/core/umem.c   |   90 ++-
 drivers/iommu/amd_iommu.c|6 
 drivers/iommu/intel-iommu.c  |6 
 fs/Kconfig   |4 
 fs/Makefile  |1 
 fs/aio.c |2 
 fs/mapdirect.c   |  349 ++
 fs/xfs/Kconfig   |4 
 fs/xfs/Makefile  |1 
 fs/xfs/xfs_file.c|  130 ++
 fs/xfs/xfs_iomap.c   |9 +
 fs/xfs/xfs_layout.c  |   42 +++
 fs/xfs/xfs_layout.h  |   13 +
 fs/xfs/xfs_pnfs.c|   30 --
 fs/xfs/xfs_pnfs.h|   10 -
 include/linux/dma-mapping.h  |3 
 include/linux/fs.h   |   33 ++
 include/linux/mapdirect.h|   68 +
 include/linux/mm.h   |   15 +
 include/linux/mman.h |   42 +++
 include/rdma/ib_umem.h   |8 +
 include/uapi/asm-generic/mman-common.h   |1 
 include/uapi/asm-generic/mman.h  |1 
 ipc/shm.c|3 
 mm/internal.h|2 
 mm/mmap.c|   28 ++
 mm/nommu.c   |5 
 mm/util.c|7 -
 tools/include/uapi/asm-generic/mman-common.h |1 
 tools/testing/nvdimm/Kbuild  |   31 ++
 tools/testing/nvdimm/config_check.c  |2 
 tools/testing/nvdimm/test/iomap.c|6 
 41 files changed, 906 insertions(+), 73 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-06 Thread Dan Williams
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);

and "know" that MAP_SYNC actually takes.

And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.

Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);

and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.

Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Andy Lutomirski 
Cc: Andrew Morton 
Suggested-by: Christoph Hellwig 
Suggested-by: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 arch/alpha/include/uapi/asm/mman.h   |1 +
 arch/mips/include/uapi/asm/mman.h|1 +
 arch/mips/kernel/vdso.c  |2 +
 arch/parisc/include/uapi/asm/mman.h  |1 +
 arch/tile/mm/elf.c   |3 +-
 arch/xtensa/include/uapi/asm/mman.h  |1 +
 include/linux/fs.h   |2 +
 include/linux/mm.h   |2 +
 include/linux/mman.h |   39 ++
 include/uapi/asm-generic/mman-common.h   |1 +
 mm/mmap.c|   21 --
 tools/include/uapi/asm-generic/mman-common.h |1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE   0x0f/* Mask for type of mapping (OSF/1 is 
_wrong_) */
 #define MAP_FIXED  0x100   /* Interpret addr exactly */
 #define MAP_ANONYMOUS  0x10/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines 
*/
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE0x002   /* Changes are private */
 #define MAP_TYPE   0x00f   /* Mask for type of mapping */
 #define MAP_FIXED  0x010   /* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME 0x020   /* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL);
+  0, NULL, 0);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE   0x03/* Mask for type of mapping */
 #define MAP_FIXED  0x04

[PATCH v7 04/12] fs: MAP_DIRECT core

2017-10-06 Thread Dan Williams
Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/Makefile   |2 
 fs/mapdirect.c|  232 +
 include/linux/mapdirect.h |   45 +
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index ..9ac7c1d946a2
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+   atomic_t mds_ref;
+   atomic_t mds_vmaref;
+   unsigned long mds_state;
+   struct inode *mds_inode;
+   struct delayed_work mds_work;
+   struct fasync_struct *mds_fa;
+   struct vm_area_struct *mds_vma;
+};
+
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+   if (!atomic_dec_and_test(&mds->mds_ref))
+   return;
+   kfree(mds);
+}
+
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+   struct vm_area_struct *vma = mds->mds_vma;
+   struct file *file = vma->vm_file;
+   struct inode *inode = file_inode(file);
+   void *owner = mds;
+
+   if (!atomic_dec_and_test(&mds->mds_vmaref))
+   return 0;
+
+   /*
+* Flush in-flight+forced lm_break events that may be
+* referencing this dying vma.
+*/
+   mds->mds_vma = NULL;
+   set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+   vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+   flush_delayed_work(&mds->mds_work);
+   iput(inode);
+
+   put_map_direct(mds);
+   return 1;
+}
+EXPORT_SYMBOL_GPL(put_map_direct_vma);
+
+void get_map_direct_vma(struct map_direct_state *mds)
+{
+   atomic_inc(&mds->mds_vmaref);
+}
+EXPORT_SYMBOL_GPL(get_map_direct_vma);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+   struct map_direct_state *mds;
+   struct vm_area_struct *vma;
+   struct inode *inode;
+   void *owner;
+
+   mds = container_of(work, typeof(*mds), mds_work.work);
+
+   clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+   vma = ACCESS_ONCE(mds->mds_vma);
+   inode = mds->mds_inode;
+   if (vma) {
+   unsigned long len = vma->vm_end - vma->vm_start;
+   loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+   unmap_mapping_range(inode->i_mapping, start, len, 1);
+   }
+   owner = mds;
+   vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+   put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+   struct map_direct_state *mds = fl->fl_owner;
+
+   /*
+* Given that we need to take sleeping locks to invalidate the
+* mapping we schedule that work with the original timeout set
+* by the file-locks core. Then we tell the core to hold off on
+* continuing with the lease break until the delayed work
+* completes the invalidation a

[PATCH v7 03/12] fs: introduce i_mapdcount

2017-10-06 Thread Dan Williams
When ->iomap_begin() sees this count being non-zero and determines that
the block map of the file needs to be modified to satisfy the I/O
request it will instead return an error. This is needed for MAP_DIRECT
where, due to locking constraints, we can't rely on xfs_break_layouts()
to protect against allocating write-faults either from the process that
setup the MAP_DIRECT mapping nor other processes that have the file
mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
mix with the XFS_MMAPLOCK in the fault path.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_iomap.c |9 +
 include/linux/fs.h |   31 +++
 2 files changed, 40 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..6816f8ebbdcf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
goto out_unlock;
}
/*
+* If a file has MAP_DIRECT mappings disable block map
+* updates. This should only effect mmap write faults as
+* other paths are protected by an FL_LAYOUT lease.
+*/
+   if (i_mapdcount_read(inode)) {
+   error = -ETXTBSY;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2b9bf3dc4e9..f83871b188ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,6 +642,9 @@ struct inode {
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
+#ifdef CONFIG_FS_DAX
+   atomic_ti_mapdcount;/* count of MAP_DIRECT vmas */
+#endif
 #ifdef CONFIG_IMA
atomic_ti_readcount; /* struct files open RO */
 #endif
@@ -2784,6 +2787,34 @@ static inline void i_readcount_inc(struct inode *inode)
return;
 }
 #endif
+
+#ifdef CONFIG_FS_DAX
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+   BUG_ON(!atomic_read(&inode->i_mapdcount));
+   atomic_dec(&inode->i_mapdcount);
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+   atomic_inc(&inode->i_mapdcount);
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+   return atomic_read(&inode->i_mapdcount);
+}
+#else
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+   return 0;
+}
+#endif
+
 extern int do_pipe_flags(int *, int);
 
 #define __kernel_read_file_id(id) \

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-06 Thread Dan Williams
Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |4 
 fs/xfs/Makefile |1 +
 fs/xfs/xfs_layout.c |   42 ++
 fs/xfs/xfs_layout.h |   13 +
 fs/xfs/xfs_pnfs.c   |   30 --
 fs/xfs/xfs_pnfs.h   |   10 ++
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
  result in warnings.
 
  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+   def_bool y
+   depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o
 xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)   += xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index ..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include 
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky 
locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   int error;
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+   xfs_iunlock(ip, *iolock);
+   error = break_layout(inode, true);
+   *iolock = XFS_IOLOCK_EXCL;
+   xfs_ilock(ip, *iolock);
+   }
+
+   return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index ..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+   return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-   struct inode*inode,
-   uint*iolock)
-{
-   struct xfs_inode*ip = XFS_I(inode);
-   int error;
-
-   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-   xfs_iunlock(ip, *iolock);
-   error = break_layout(inode, true);
-   *iolock = XFS_IOLOCK_EXCL;
-   xfs_ilock(ip, *iolock);
-   }
-
-   return error;
-}
-
-/*
  * Get a unique ID including its location so that the client can identify
  * the exported device.
  */
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..4135b2482697 100644
--- a/fs/xfs/xfs_pnfs

[PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-06 Thread Dan Williams
Add a helper to determine if the dma mappings set up for a given device
are backed by an iommu. In particular, this lets code paths know that a
dma_unmap operation will revoke access to memory if the device can not
otherwise be quiesced. The need for this knowledge is driven by a need
to make RDMA transfers to DAX mappings safe. If the DAX file's block map
changes we need to be to reliably stop accesses to blocks that have been
freed or re-assigned to a new file.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Greg Kroah-Hartman 
Cc: Joerg Roedel 
Cc: David Woodhouse 
Cc: Ashok Raj 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/base/dma-mapping.c  |   10 ++
 drivers/iommu/amd_iommu.c   |6 ++
 drivers/iommu/intel-iommu.c |6 ++
 include/linux/dma-mapping.h |3 +++
 4 files changed, 25 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..e1b5f103d90e 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
of_dma_deconfigure(dev);
acpi_dma_deconfigure(dev);
 }
+
+bool dma_has_iommu(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   if (ops && ops->has_iommu)
+   return ops->has_iommu(dev);
+   return false;
+}
+EXPORT_SYMBOL(dma_has_iommu);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..873f899fcf57 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,11 @@ static struct protection_domain *get_domain(struct 
device *dev)
return domain;
 }
 
+static bool amd_dma_has_iommu(struct device *dev)
+{
+   return !IS_ERR(get_domain(dev));
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
struct iommu_dev_data *dev_data;
@@ -2689,6 +2694,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
.unmap_sg   = unmap_sg,
.dma_supported  = amd_iommu_dma_supported,
.mapping_error  = amd_iommu_mapping_error,
+   .has_iommu  = amd_dma_has_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..243ef42fdad4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,11 @@ static int iommu_no_mapping(struct device *dev)
return 0;
 }
 
+static bool intel_dma_has_iommu(struct device *dev)
+{
+   return !iommu_no_mapping(dev);
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3877,7 @@ const struct dma_map_ops intel_dma_ops = {
.map_page = intel_map_page,
.unmap_page = intel_unmap_page,
.mapping_error = intel_mapping_error,
+   .has_iommu = intel_dma_has_iommu,
 #ifdef CONFIG_X86
.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..659f122c18f5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
   enum dma_data_direction dir);
int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
int (*dma_supported)(struct device *dev, u64 mask);
+   bool (*has_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
 }
 #endif
 
+extern bool dma_has_iommu(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
  size_t size,
  enum dma_data_direction dir,

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 09/12] xfs: wire up ->lease_direct()

2017-10-06 Thread Dan Williams
A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we establish a new lease and then check if
the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |   28 
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e35518600e28..823b65f17429 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
put_map_direct_vma(vma->vm_private_data);
 }
 
+static struct lease_direct *
+xfs_filemap_direct_lease(
+   struct vm_area_struct   *vma,
+   void(*break_fn)(void *),
+   void*owner)
+{
+   struct lease_direct *ld;
+
+   ld = map_direct_lease(vma, break_fn, owner);
+
+   if (IS_ERR(ld))
+   return ld;
+
+   /*
+* We now have an established lease while the base MAP_DIRECT
+* lease was not broken. So, we know that the "lease holder" will
+* receive a SIGIO notification when the lease is broken and
+* take any necessary cleanup actions.
+*/
+   if (!is_map_direct_broken(vma->vm_private_data))
+   return ld;
+
+   map_direct_lease_destroy(ld);
+
+   return ERR_PTR(-ENXIO);
+}
+
 static const struct vm_operations_struct xfs_file_vm_direct_ops = {
.fault  = xfs_filemap_fault,
.huge_fault = xfs_filemap_huge_fault,
@@ -1175,6 +1202,7 @@ static const struct vm_operations_struct 
xfs_file_vm_direct_ops = {
 
.open   = xfs_filemap_direct_open,
.close  = xfs_filemap_direct_close,
+   .lease_direct   = xfs_filemap_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct()

2017-10-06 Thread Dan Williams
Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/mapdirect.c|  117 +
 include/linux/mapdirect.h |   23 +
 include/linux/mm.h|6 ++
 3 files changed, 146 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9ac7c1d946a2..338cbe055fc7 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -32,12 +33,26 @@ struct map_direct_state {
struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+   void *lds_owner;
+   struct file *lds_file;
+   unsigned long lds_state;
+   void (*lds_break_fn)(void *lds_owner);
+   struct work_struct lds_work;
+};
+
 bool is_map_direct_valid(struct map_direct_state *mds)
 {
return test_bit(MAPDIRECT_VALID, &mds->mds_state);
 }
 EXPORT_SYMBOL_GPL(is_map_direct_valid);
 
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_BREAK, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_broken);
+
 static void put_map_direct(struct map_direct_state *mds)
 {
if (!atomic_dec_and_test(&mds->mds_ref))
@@ -162,6 +177,108 @@ static const struct lock_manager_operations 
map_direct_lm_ops = {
.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+   struct lease_direct_state *lds;
+   void *owner;
+
+   lds = container_of(work, typeof(*lds), lds_work);
+   owner = lds;
+   lds->lds_break_fn(lds->lds_owner);
+   vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner);
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+   struct lease_direct_state *lds = fl->fl_owner;
+
+   if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state))
+   schedule_work(&lds->lds_work);
+   return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+   struct list_head *dispose)
+{
+   WARN_ON(!(arg & F_UNLCK));
+   return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+   .lm_break = lease_direct_lm_break,
+   .lm_change = lease_direct_lm_change,
+};
+
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+   void (*lds_break_fn)(void *), void *lds_owner)
+{
+   struct file *file = vma->vm_file;
+   struct lease_direct_state *lds;
+   struct lease_direct *ld;
+   struct file_lock *fl;
+   int rc = -ENOMEM;
+   void *owner;
+
+   ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+   if (!ld)
+   return ERR_PTR(-ENOMEM);
+   INIT_LIST_HEAD(&ld->list);
+   lds = (struct lease_direct_state *)(ld + 1);
+   owner = lds;
+   ld->lds = lds;
+   lds->lds_break_fn = lds_break_fn;
+   lds->lds_owner = lds_owner;
+   INIT_WORK(&lds->lds_work, lease_direct_invalidate);
+   lds->lds_file = get_file(file);
+
+   fl = locks_alloc_lock();
+   if (!fl)
+   goto err_lock_alloc;
+
+   locks_init_lock(fl);
+   fl->fl_lmops = &lease_direct_lm_ops;
+   fl->fl_flags = FL_LAYOUT;
+   fl->fl_type = F_RDLCK;
+   fl->fl_end = OFFSET_MAX;
+   fl->fl_owner = lds;
+   fl->fl_pid = current->tgid;
+   fl->fl_file = file;
+
+   rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+   if (rc)
+   goto err_setlease;
+   if (fl) {
+   WARN_ON(1);
+   owner = lds;
+   vfs_setlease(file, F_UNLCK, NULL, &owner);
+   owner = NULL;
+   rc = -ENXIO;
+   goto err_setlease;
+   }
+
+   return ld;
+err_setlease:
+   locks_free_lock(fl);
+err_lock_alloc:
+   kfree(lds);
+   return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease);
+
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+   struct lease_direct_state *lds = ld->lds;
+   struct file *file = lds->lds_file;
+   void *owner = lds;
+
+   vfs_setlease(file, F_UNLCK, NULL, &owner);
+   flush_work(&lds->lds_work);
+   fput(file);
+   WARN_ON(!list_empty(&ld->list));
+   kfree(ld);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease_destroy);
+
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct 
*vma)
 {
struct map_direct_state *mds = kzalloc(s

[PATCH v7 06/12] xfs: wire up MAP_DIRECT

2017-10-06 Thread Dan Williams
MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

  * The file is not DAX capable
  * The file has reflinked (copy-on-write) blocks
  * The fault would trigger the filesystem to allocate blocks
  * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
 might require block-map updates, or the lease timed out and the
 kernel invalidated the mapping.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |2 -
 fs/xfs/xfs_file.c   |  102 +++
 include/linux/mman.h|3 +
 include/uapi/asm-generic/mman.h |1 
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
def_bool y
-   depends on EXPORTFS_BLOCK_OPS
+   depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..e35518600e28 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static inline bool
+is_xfs_map_direct(
+   struct vm_area_struct *vma)
+{
+   return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1008,6 +1018,26 @@ xfs_file_llseek(
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+static int
+xfs_vma_checks(
+   struct vm_area_struct   *vma,
+   struct inode*inode)
+{
+   if (!is_xfs_map_direct(vma))
+   return 0;
+
+   if (!is_map_direct_valid(vma->vm_private_data))
+   return VM_FAULT_SIGBUS;
+
+   if (xfs_is_reflink_inode(XFS_I(inode)))
+   return VM_FAULT_SIGBUS;
+
+   if (!IS_DAX(inode))
+   return VM_FAULT_SIGBUS;
+
+   return 0;
+}
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
enum page_entry_sizepe_size,
boolwrite_fault)
 {
+   struct vm_area_struct   *vma = vmf->vma;
struct inode*inode = file_inode(vmf->vma->vm_file);
struct xfs_inode*ip = XFS_I(inode);
int ret;
@@ -1032,10 +1063,14 @@ __xfs_filemap_fault(
 
if (write_fault) {
sb_start_pagefault(inode->i_sb);
-   file_update_time(vmf->vma->vm_file);
+   file_update_time(vma->vm_file);
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+   ret = xfs_vma_checks(vma, inode);
+   if (ret)
+   goto out_unlock;
+
if (IS_DAX(inode)) {
ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
} else {
@@ -1044,6 +1079,8 @@ __xfs_filemap_fault(
else
ret = filemap_fault(vmf);
}
+
+out_unlock:
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
   

[PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests

2017-10-06 Thread Dan Williams
Provide a mock dma_has_iommu() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams 
---
 tools/testing/nvdimm/Kbuild |   31 +++
 tools/testing/nvdimm/config_check.c |2 ++
 tools/testing/nvdimm/test/iomap.c   |6 ++
 3 files changed, 39 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..e4ee7f482ac0 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_has_iommu
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c 
b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+   BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c 
b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1c240328ee5b 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -388,4 +388,10 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle 
handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+bool __wrap_dma_has_iommu(struct device *dev)
+{
+   return true;
+}
+EXPORT_SYMBOL(__wrap_dma_has_iommu);
+
 MODULE_LICENSE("GPL v2");

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 10/12] device-dax: wire up ->lease_direct()

2017-10-06 Thread Dan Williams
The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/dax/device.c  |4 
 fs/Kconfig|4 
 fs/Makefile   |3 ++-
 include/linux/mapdirect.h |2 +-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
 #include 
 #include 
 #include 
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
.fault = dev_dax_fault,
.huge_fault = dev_dax_huge_fault,
+   .lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
struct dax_device *dax_dev = dev_dax->dax_dev;
struct inode *inode = dax_inode(dax_dev);
+   const bool wait = true;
 
kill_dax(dax_dev);
+   break_layout(inode, wait);
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..2e3784ae1bc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,10 @@ config FS_DAX_PMD
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+   bool
+   default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)   += mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index dc4d4ba677d0..bafa78a6085f 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,7 +26,7 @@ struct lease_direct {
struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct 
*vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-06 Thread Dan Williams
Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the new dma_has_iommu() helper to determine if ib_dma_unmap_sg() is
sufficient for revoking access. Once we have that assurance and a
->lease_direct() lease we can safely allow RDMA to DAX.

Cc: Sean Hefty 
Cc: Doug Ledford 
Cc: Hal Rosenstock 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 drivers/infiniband/core/umem.c |   90 ++--
 include/rdma/ib_umem.h |8 
 2 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..dc3ae1bee669 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -46,11 +47,12 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
+   struct lease_direct *ld, *_ld;
struct scatterlist *sg;
struct page *page;
int i;
 
-   if (umem->nmap > 0)
+   if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
ib_dma_unmap_sg(dev, umem->sg_head.sgl,
umem->npages,
DMA_BIDIRECTIONAL);
@@ -64,8 +66,22 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
}
 
sg_free_table(&umem->sg_head);
-   return;
 
+   list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+   list_del_init(&ld->list);
+   map_direct_lease_destroy(ld);
+   }
+}
+
+static void ib_umem_lease_break(void *__umem)
+{
+   struct ib_umem *umem = umem;
+   struct ib_device *dev = umem->context->device;
+
+   if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
+   ib_dma_unmap_sg(dev, umem->sg_head.sgl,
+   umem->npages,
+   DMA_BIDIRECTIONAL);
 }
 
 /**
@@ -96,7 +112,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
struct scatterlist *sg, *sg_list_start;
int need_release = 0;
unsigned int gup_flags = FOLL_WRITE;
+   struct vm_area_struct *vma_prev = NULL;
+   struct device *dma_dev;
 
+   dma_dev = context->device->dma_device;
if (dmasync)
dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +139,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->address= addr;
umem->page_shift = PAGE_SHIFT;
umem->pid= get_task_pid(current, PIDTYPE_PID);
+   INIT_LIST_HEAD(&umem->leases);
+   set_bit(IB_UMEM_MAPPED, &umem->state);
/*
 * We ask for writable memory if any of the following
 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->hugetlb   = 1;
 
page_list = (struct page **) __get_free_page(GFP_KERNEL);
-   if (!page_list) {
-   put_pid(umem->pid);
-   kfree(umem);
-   return ERR_PTR(-ENOMEM);
-   }
+   if (!page_list)
+   goto err_pagelist;
 
/*
-* if we can't alloc the vma_list, it's not so bad;
-* just assume the memory is not hugetlb memory
+* If DAX is enabled we need the vma to setup a ->lease_direct()
+* lease to protect against file modifications, otherwise we can
+* tolerate a failure to allocate the vma_list and just assume
+* that all vmas are not hugetlb-vmas.
 */
vma_list = (struct 

Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-06 Thread Dan Williams
On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse  wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-06 Thread Dan Williams
On Fri, Oct 6, 2017 at 3:52 PM, Dan Williams  wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse  wrote:
>> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>>> Add a helper to determine if the dma mappings set up for a given device
>>> are backed by an iommu. In particular, this lets code paths know that a
>>> dma_unmap operation will revoke access to memory if the device can not
>>> otherwise be quiesced. The need for this knowledge is driven by a need
>>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>>> changes we need to be to reliably stop accesses to blocks that have been
>>> freed or re-assigned to a new file.
>>
>> "a dma_unmap operation revoke access to memory"... but it's OK that the
>> next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

[ chatted with Ashok ]

Yes, it seems we need a way to pin that IOVA as in use, but invalidate
it. Then we can wait for the unmap to occur when the memory is
unregistered to avoid this IOVA reuse problem.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-06 Thread Dan Williams
On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse  wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse  wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm