Re: [PATCH v8 06/14] xfs: wire up MAP_DIRECT

2017-10-10 Thread Dan Williams
On Tue, Oct 10, 2017 at 6:09 PM, Dave Chinner  wrote:
> On Tue, Oct 10, 2017 at 07:49:30AM -0700, Dan Williams wrote:
>> @@ -1009,6 +1019,22 @@ xfs_file_llseek(
>>  }
>>
>>  /*
>> + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
>> + * valid. See map_direct_invalidate.
>> + */
>> +static int
>> +xfs_can_fault_direct(
>> + struct vm_area_struct   *vma)
>> +{
>> + if (!xfs_vma_is_direct(vma))
>> + return 0;
>> +
>> + if (!test_map_direct_valid(vma->vm_private_data))
>> + return VM_FAULT_SIGBUS;
>> + return 0;
>> +}
>
> Better, but I'm going to be an annoying pedant here: a "can
> " check should return a boolean true/false.
>
> Also, it's a bit jarring to see that a non-direct VMA that /can't/
> do direct faults returns the same thing as a direct-vma that /can/
> do direct faults, so a couple of extra comments for people who will
> quickly forget how this code works (i.e. me) will be helpful. Say
> something like this:
>
> /*
>  * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
>  * valid. See map_direct_invalidate.
>  */
> static bool
> xfs_vma_has_direct_lease(
> struct vm_area_struct   *vma)
> {
> /* Non MAP_DIRECT vmas do not require layout leases */
> if (!xfs_vma_is_direct(vma))
> return true;
>
> if (!test_map_direct_valid(vma->vm_private_data))
> return false;
>
> /* We have a valid lease */
> return true;
> }
>
> .
> if (!xfs_vma_has_direct_lease(vma)) {
> ret = VM_FAULT_SIGBUS;
> goto out_unlock;
> }
> 


Looks good to me.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-10 Thread Dan Williams
On Tue, Oct 10, 2017 at 5:46 PM, Dave Chinner  wrote:
> On Tue, Oct 10, 2017 at 07:49:17AM -0700, Dan Williams wrote:
>> Move xfs_break_layouts() to its own compilation unit so that it can be
>> used for both pnfs layouts and MAP_DIRECT mappings.
> .
>> diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
>> index b587cb99b2b7..4135b2482697 100644
>> --- a/fs/xfs/xfs_pnfs.h
>> +++ b/fs/xfs/xfs_pnfs.h
>> @@ -1,19 +1,13 @@
>>  #ifndef _XFS_PNFS_H
>>  #define _XFS_PNFS_H 1
>>
>> +#include "xfs_layout.h"
>> +
>
> I missed this the first time through - we try not to put includes
> in header files, and instead make sure each C file has all the
> includes they require. Can you move this to all the C files that
> need layouts and remove the include of the xfs_pnfs.h include from
> them?

Sure, will do.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 06/14] xfs: wire up MAP_DIRECT

2017-10-10 Thread Dave Chinner
On Tue, Oct 10, 2017 at 07:49:30AM -0700, Dan Williams wrote:
> @@ -1009,6 +1019,22 @@ xfs_file_llseek(
>  }
>  
>  /*
> + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
> + * valid. See map_direct_invalidate.
> + */
> +static int
> +xfs_can_fault_direct(
> + struct vm_area_struct   *vma)
> +{
> + if (!xfs_vma_is_direct(vma))
> + return 0;
> +
> + if (!test_map_direct_valid(vma->vm_private_data))
> + return VM_FAULT_SIGBUS;
> + return 0;
> +}

Better, but I'm going to be an annoying pedant here: a "can
" check should return a boolean true/false.

Also, it's a bit jarring to see that a non-direct VMA that /can't/
do direct faults returns the same thing as a direct-vma that /can/
do direct faults, so a couple of extra comments for people who will
quickly forget how this code works (i.e. me) will be helpful. Say
something like this:

/*
 * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
 * valid. See map_direct_invalidate.
 */
static bool
xfs_vma_has_direct_lease(
struct vm_area_struct   *vma)
{
/* Non MAP_DIRECT vmas do not require layout leases */
if (!xfs_vma_is_direct(vma))
return true;

if (!test_map_direct_valid(vma->vm_private_data))
return false;

/* We have a valid lease */
return true;
}

.
if (!xfs_vma_has_direct_lease(vma)) {
ret = VM_FAULT_SIGBUS;
goto out_unlock;
}


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-10 Thread Dave Chinner
On Tue, Oct 10, 2017 at 07:49:17AM -0700, Dan Williams wrote:
> Move xfs_break_layouts() to its own compilation unit so that it can be
> used for both pnfs layouts and MAP_DIRECT mappings.
.
> diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
> index b587cb99b2b7..4135b2482697 100644
> --- a/fs/xfs/xfs_pnfs.h
> +++ b/fs/xfs/xfs_pnfs.h
> @@ -1,19 +1,13 @@
>  #ifndef _XFS_PNFS_H
>  #define _XFS_PNFS_H 1
>  
> +#include "xfs_layout.h"
> +

I missed this the first time through - we try not to put includes
in header files, and instead make sure each C file has all the
includes they require. Can you move this to all the C files that
need layouts and remove the include of the xfs_pnfs.h include from
them?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] Fix mpage_writepage() for pages with buffers

2017-10-10 Thread Linus Torvalds
On Tue, Oct 10, 2017 at 12:44 PM, Andrew Morton
 wrote:
>
> This is all pretty mature code (isn't it?).  Any idea why this bug
> popped up now?

Also, while the patch looks sane, the

clean_buffers(page, PAGE_SIZE);

line really threw me. That's an insane value to pick, it looks like
"bytes in page", but it isn't. It's just a random value that is bigger
than "PAGE_SIZE >> SECTOR_SHIFT".

I'd prefer to see just ~0u if the intention is just "bigger than
anything possible".

Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-10 Thread Dan Williams
On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe
 wrote:
> On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
>> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
>
>> >> Have a look at the patch [1], I don't touch the ODP path.
>> >
>> > But, does ODP work OK already? I'm not clear on that..
>>
>> It had better. If the mapping is invalidated I would hope that
>> generates an io fault that gets handled by the driver to setup the new
>> mapping. I don't see how it can work otherwise.
>
> I would assume so too...
>
>> > This is why ODP should be the focus because this cannot work fully
>> > reliably otherwise..
>>
>> The lease break time is configurable. If that application can't
>> respond to a stop request within a timeout of its own choosing then it
>> should not be using DAX mappings.
>
> Well, no RDMA application can really do this, unless you set the
> timeout to multiple minutes, on par with network timeouts.

The default lease break timeout is 45 seconds on my system, so minutes
does not seem out of the question.

Also keep in mind that what triggers the lease break is another
application trying to write or punch holes in a file that is mapped
for RDMA. So, if the hardware can't handle the iommu mapping getting
invalidated asynchronously and the application can't react in the
lease break timeout period then the administrator should arrange for
the file to not be written or truncated while it is mapped.

It's already the case that get_user_pages() does not lock down file
associations, so if your application is contending with these types of
file changes it likely already has a problem keeping transactions in
sync with the file state even without DAX.

> Again, these details are why I think this kind of DAX and non ODP-MRs
> are probably practically not too useful for a production system. Great
> for test of course, but in that case SIGKILL would be fine too...
>
>> > Well, what about using SIGKILL if the lease-break-time hits? The
>> > kernel will clean up the MRs when the process exits and this will
>> > fence DMA to that memory.
>>
>> Can you point me to where the MR cleanup code fences DMA and quiesces
>> the device?
>
> Yes. The MR's are associated with an fd. When the fd is closed
> ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
> all the objects, including MRs, and deletes them.
>
> The specification for deleting a MR requires a synchronous fence with
> the hardware. After MR deletion the hardware will not DMA to any pages
> described by the old MR, and those pages will be unpinned.
>
>> > But, still, if you really want to be fined graned, then I think
>> > invalidating the impacted MR's is a better solution for RDMA than
>> > trying to do it with the IOMMU...
>>
>> If there's a better routine for handling ib_umem_lease_break() I'd
>> love to use it. Right now I'm reaching for the only tool I know for
>> kernel enforced revocation of DMA access.
>
> Well, you'd have to code something in the MR code to keep track of DAX
> MRs and issue an out of band invalidate to impacted MRs to create the
> fence.
>
> This probably needs some driver work, I'm not sure if all the hardware
> can do out of band invalidate to any MR or not..

Ok.

>
> Generally speaking, in RDMA, when a new feature like this comes along
> we have to push a lot of the work down to the driver authors, and the
> approach has historically been that new features only work on some
> hardware (as much as I dislike this, it is pragmatic)
>
> So, not being able to support DAX on certain RDMA hardware is not
> an unreasonable situation in our space.

That makes sense, but it still seems to me that this proposed solution
allows more than enough ways to avoid that worst case scenario where
hardware reacts badly to iommu invalidation. Drivers that can do
better than iommu invalidation can arrange for a callback to do their
driver-specific action at lease break time. Hardware that can't should
be blacklisted from supporting DAX altogether. In other words this is
a starting point to incrementally enhance or disable specific drivers,
but with the assurance that the kernel can always do the safe thing
when / if the driver is missing a finer grained solution.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] Fix mpage_writepage() for pages with buffers

2017-10-10 Thread Andrew Morton
On Fri, 6 Oct 2017 14:15:41 -0700 Matthew Wilcox  wrote:

> When using FAT on a block device which supports rw_page, we can hit
> BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
> call clean_buffers() after unlocking the page we've written.  Introduce a
> new clean_page_buffers() which cleans all buffers associated with a page
> and call it from within bdev_write_page().

This is all pretty mature code (isn't it?).  Any idea why this bug
popped up now?

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-10 Thread Jason Gunthorpe
On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-10 Thread Jason Gunthorpe
On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 14/14] tools/testing/nvdimm: enable rdma unit tests

2017-10-10 Thread Dan Williams
Provide a mock dma_get_iommu_domain() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams 
---
 tools/testing/nvdimm/Kbuild |   31 +++
 tools/testing/nvdimm/config_check.c |2 ++
 tools/testing/nvdimm/test/iomap.c   |   14 ++
 3 files changed, 47 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..f4a007090950 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_get_iommu_domain
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c 
b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+   BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c 
b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1e439b2b01e7 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -388,4 +389,17 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle 
handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+/*
+ * This assumes that any iommu api routine we would call with this
+ * domain checks for NULL ops and either returns an error or does
+ * nothing.
+ */
+struct iommu_domain *__wrap_dma_get_iommu_domain(struct device *dev)
+{
+   static struct iommu_domain domain;
+
+   return 
+}
+EXPORT_SYMBOL(__wrap_dma_get_iommu_domain);
+
 MODULE_LICENSE("GPL v2");

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 10/14] device-dax: wire up ->lease_direct()

2017-10-10 Thread Dan Williams
The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig   |1 +
 drivers/dax/device.c  |4 
 fs/Kconfig|4 
 fs/Makefile   |3 ++-
 fs/mapdirect.c|3 ++-
 include/linux/mapdirect.h |5 -
 6 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index b79aa8f7a497..be03d4dbe646 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -8,6 +8,7 @@ if DAX
 config DEV_DAX
tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
+   depends on FILE_LOCKING
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
 #include 
 #include 
 #include 
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
.fault = dev_dax_fault,
.huge_fault = dev_dax_huge_fault,
+   .lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
struct dax_device *dax_dev = dev_dax->dax_dev;
struct inode *inode = dax_inode(dax_dev);
+   const bool wait = true;
 
kill_dax(dax_dev);
+   break_layout(inode, wait);
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index a7b31a96a753..3668cfb046d5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -59,6 +59,10 @@ config FS_DAX_PMD
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+   bool
+   default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)   += mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index c6954033fc1a..dd4a16f9ffc6 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -218,7 +218,7 @@ static const struct lock_manager_operations 
lease_direct_lm_ops = {
.lm_change = lease_direct_lm_change,
 };
 
-static struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
void (*lds_break_fn)(void *), void *lds_owner)
 {
struct file *file = vma->vm_file;
@@ -272,6 +272,7 @@ static struct lease_direct *map_direct_lease(struct 
vm_area_struct *vma,
kfree(lds);
return ERR_PTR(rc);
 }
+EXPORT_SYMBOL_GPL(map_direct_lease);
 
 struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
void (*break_fn)(void *), void *owner)
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index e0df6ac5795a..6695fdcf8009 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,13 +26,15 @@ struct lease_direct {
struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct 
*vma);
 bool test_map_direct_valid(struct map_direct_state *mds);
 void generic_map_direct_open(struct vm_area_struct *vma);
 void generic_map_direct_close(struct vm_area_struct *vma);
 struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
void (*ld_break_fn)(void *), void *ld_owner);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+   void (*lds_break_fn)(void *), void *lds_owner);
 void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
@@ -47,6 +49,7 @@ static inline bool test_map_direct_valid(struct 
map_direct_state *mds)
 #define generic_map_direct_open NULL
 

[PATCH v8 11/14] iommu: up-level sg_num_pages() from amd-iommu

2017-10-10 Thread Dan Williams
iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel 
Signed-off-by: Dan Williams 
---
 drivers/iommu/amd_iommu.c |   30 ++
 drivers/iommu/iommu.c |   27 +++
 include/linux/iommu.h |2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t 
dma_addr, size_t size,
__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-   struct scatterlist *sglist,
-   int nelems)
-{
-   unsigned long mask, boundary_size;
-   struct scatterlist *s;
-   int i, npages = 0;
-
-   mask  = dma_get_seg_boundary(dev);
-   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-  1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-   for_each_sg(sglist, s, nelems, i) {
-   int p, n;
-
-   s->dma_address = npages << PAGE_SHIFT;
-   p = npages % boundary_size;
-   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-   if (p + n > boundary_size)
-   npages += boundary_size - p;
-   npages += n;
-   }
-
-   return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist 
*sglist,
dma_dom  = to_dma_ops_domain(domain);
dma_mask = *dev->dma_mask;
 
-   npages = sg_num_pages(dev, sglist, nelems);
+   npages = iommu_sg_num_pages(dev, sglist, nelems);
 
address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
startaddr = sg_dma_address(sglist) & PAGE_MASK;
dma_dom   = to_dma_ops_domain(domain);
-   npages= sg_num_pages(dev, sglist, nelems);
+   npages= iommu_sg_num_pages(dev, sglist, nelems);
 
__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems)
+{
+   unsigned long mask, boundary_size;
+   struct scatterlist *s;
+   int i, npages = 0;
+
+   mask = dma_get_seg_boundary(dev);
+   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+   : 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+   for_each_sg(sglist, s, nelems, i) {
+   int p, n;
+
+   s->dma_address = npages << PAGE_SHIFT;
+   p = npages % boundary_size;
+   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+   if (p + n > boundary_size)
+   npages += boundary_size - p;
+   npages += n;
+   }
+
+   return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long 
iova,
struct scatterlist *sg,unsigned int nents,
int prot);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 07/14] iommu, dma-mapping: introduce dma_get_iommu_domain()

2017-10-10 Thread Dan Williams
Add a dma-mapping api helper to retrieve the generic iommu_domain for a
device.  The motivation for this interface is making RDMA transfers to
DAX mappings safe. If the DAX file's block map changes we need to be to
reliably stop accesses to blocks that have been freed or re-assigned to
a new file. With the iommu_domain and a callback from the DAX filesystem
the kernel can safely revoke access to a DMA device. The process that
performed the RDMA memory registration is also notified of this
revocation event, but the kernel can not otherwise be in the position of
waiting for userspace to quiesce the device.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Greg Kroah-Hartman 
Cc: Joerg Roedel 
Cc: David Woodhouse 
Cc: Ashok Raj 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/base/dma-mapping.c  |   10 ++
 drivers/iommu/amd_iommu.c   |   10 ++
 drivers/iommu/intel-iommu.c |   15 +++
 include/linux/dma-mapping.h |3 +++
 4 files changed, 38 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..fdb9764f95a4 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
of_dma_deconfigure(dev);
acpi_dma_deconfigure(dev);
 }
+
+struct iommu_domain *dma_get_iommu_domain(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   if (ops && ops->get_iommu)
+   return ops->get_iommu(dev);
+   return NULL;
+}
+EXPORT_SYMBOL(dma_get_iommu_domain);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..c8e1a45af182 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct 
device *dev)
return domain;
 }
 
+static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
+{
+   struct protection_domain *domain = get_domain(dev);
+
+   if (IS_ERR(domain))
+   return NULL;
+   return >domain;
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
struct iommu_dev_data *dev_data;
@@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
.unmap_sg   = unmap_sg,
.dma_supported  = amd_iommu_dma_supported,
.mapping_error  = amd_iommu_mapping_error,
+   .get_iommu  = amd_dma_get_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..f3f4939cebad 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
return 0;
 }
 
+static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
+{
+   struct dmar_domain *domain;
+
+   if (iommu_no_mapping(dev))
+   return NULL;
+
+   domain = get_valid_domain_for_dev(dev);
+   if (!domain)
+   return NULL;
+
+   return >domain;
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
.map_page = intel_map_page,
.unmap_page = intel_unmap_page,
.mapping_error = intel_mapping_error,
+   .get_iommu = intel_dma_get_iommu,
 #ifdef CONFIG_X86
.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..aa62df1d0d72 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
   enum dma_data_direction dir);
int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
int (*dma_supported)(struct device *dev, u64 mask);
+   struct iommu_domain *(*get_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
 }
 #endif
 
+extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
  size_t size,
  enum dma_data_direction dir,


[PATCH v8 12/14] iommu/vt-d: use iommu_num_sg_pages

2017-10-10 Thread Dan Williams
Use the common helper for accounting the size of the IOVA range for a
scatterlist so that iommu and dma apis agree on the size of a
scatterlist. This is in support for using iommu_unmap() in advance of
dma_unmap_sg() to invalidate an io-mapping in advance of the IOVA range
being deallocated. MAP_DIRECT needs this functionality for force
revoking RDMA access to a DAX mapping when userspace fails to respond to
within a lease break timeout period.

Cc: Ashok Raj 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Signed-off-by: Dan Williams 
---
 drivers/iommu/intel-iommu.c |   19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f3f4939cebad..94a5fbe62fb8 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3785,14 +3785,9 @@ static void intel_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
   unsigned long attrs)
 {
dma_addr_t startaddr = sg_dma_address(sglist) & PAGE_MASK;
-   unsigned long nrpages = 0;
-   struct scatterlist *sg;
-   int i;
-
-   for_each_sg(sglist, sg, nelems, i) {
-   nrpages += aligned_nrpages(sg_dma_address(sg), sg_dma_len(sg));
-   }
+   unsigned long nrpages;
 
+   nrpages = iommu_sg_num_pages(dev, sglist, nelems);
intel_unmap(dev, startaddr, nrpages << VTD_PAGE_SHIFT);
 }
 
@@ -3813,14 +3808,12 @@ static int intel_nontranslate_map_sg(struct device 
*hddev,
 static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int 
nelems,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
struct dmar_domain *domain;
size_t size = 0;
int prot = 0;
unsigned long iova_pfn;
int ret;
-   struct scatterlist *sg;
-   unsigned long start_vpfn;
+   unsigned long start_vpfn, npages;
struct intel_iommu *iommu;
 
BUG_ON(dir == DMA_NONE);
@@ -3833,11 +3826,9 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
 
iommu = domain_get_iommu(domain);
 
-   for_each_sg(sglist, sg, nelems, i)
-   size += aligned_nrpages(sg->offset, sg->length);
+   npages = iommu_sg_num_pages(dev, sglist, nelems);
 
-   iova_pfn = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size),
-   *dev->dma_mask);
+   iova_pfn = intel_alloc_iova(dev, domain, npages, *dev->dma_mask);
if (!iova_pfn) {
sglist->dma_length = 0;
return 0;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 08/14] fs, mapdirect: introduce ->lease_direct()

2017-10-10 Thread Dan Williams
Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/mapdirect.c|  144 +
 include/linux/mapdirect.h |   14 
 include/linux/mm.h|8 +++
 3 files changed, 166 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9f4dd7395dcd..c6954033fc1a 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -32,12 +33,25 @@ struct map_direct_state {
struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+   void *lds_owner;
+   struct file *lds_file;
+   unsigned long lds_state;
+   void (*lds_break_fn)(void *lds_owner);
+   struct delayed_work lds_work;
+};
+
 bool test_map_direct_valid(struct map_direct_state *mds)
 {
return test_bit(MAPDIRECT_VALID, >mds_state);
 }
 EXPORT_SYMBOL_GPL(test_map_direct_valid);
 
+static bool test_map_direct_broken(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_BREAK, >mds_state);
+}
+
 static void put_map_direct(struct map_direct_state *mds)
 {
if (!atomic_dec_and_test(>mds_ref))
@@ -168,6 +182,136 @@ static const struct lock_manager_operations 
map_direct_lm_ops = {
.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+   struct lease_direct_state *lds;
+   void *owner;
+
+   lds = container_of(work, typeof(*lds), lds_work.work);
+   owner = lds;
+   lds->lds_break_fn(lds->lds_owner);
+   vfs_setlease(lds->lds_file, F_UNLCK, NULL, );
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+   struct lease_direct_state *lds = fl->fl_owner;
+
+   if (!test_and_set_bit(MAPDIRECT_BREAK, >lds_state))
+   schedule_delayed_work(>lds_work, lease_break_time * HZ);
+
+   /* Tell the core lease code to wait for delayed work completion */
+   fl->fl_break_time = 0;
+
+   return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+   struct list_head *dispose)
+{
+   WARN_ON(!(arg & F_UNLCK));
+   return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+   .lm_break = lease_direct_lm_break,
+   .lm_change = lease_direct_lm_change,
+};
+
+static struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+   void (*lds_break_fn)(void *), void *lds_owner)
+{
+   struct file *file = vma->vm_file;
+   struct lease_direct_state *lds;
+   struct lease_direct *ld;
+   struct file_lock *fl;
+   int rc = -ENOMEM;
+   void *owner;
+
+   ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+   if (!ld)
+   return ERR_PTR(-ENOMEM);
+   INIT_LIST_HEAD(>list);
+   lds = (struct lease_direct_state *)(ld + 1);
+   owner = lds;
+   ld->lds = lds;
+   lds->lds_break_fn = lds_break_fn;
+   lds->lds_owner = lds_owner;
+   INIT_DELAYED_WORK(>lds_work, lease_direct_invalidate);
+   lds->lds_file = get_file(file);
+
+   fl = locks_alloc_lock();
+   if (!fl)
+   goto err_lock_alloc;
+
+   locks_init_lock(fl);
+   fl->fl_lmops = _direct_lm_ops;
+   fl->fl_flags = FL_LAYOUT;
+   fl->fl_type = F_RDLCK;
+   fl->fl_end = OFFSET_MAX;
+   fl->fl_owner = lds;
+   fl->fl_pid = current->tgid;
+   fl->fl_file = file;
+
+   rc = vfs_setlease(file, fl->fl_type, , );
+   if (rc)
+   goto err_setlease;
+   if (fl) {
+   WARN_ON(1);
+   owner = lds;
+   vfs_setlease(file, F_UNLCK, NULL, );
+   owner = NULL;
+   rc = -ENXIO;
+   goto err_setlease;
+   }
+
+   return ld;
+err_setlease:
+   locks_free_lock(fl);
+err_lock_alloc:
+   kfree(lds);
+   return ERR_PTR(rc);
+}
+
+struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
+   void (*break_fn)(void *), void *owner)
+{
+   struct lease_direct *ld;
+
+   ld = map_direct_lease(vma, break_fn, owner);
+
+   if (IS_ERR(ld))
+   return ld;
+
+   /*
+* We now 

[PATCH v8 03/14] fs: MAP_DIRECT core

2017-10-10 Thread Dan Williams
Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/Kconfig|1 
 fs/Makefile   |2 
 fs/mapdirect.c|  237 +
 include/linux/mapdirect.h |   40 
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
+   depends on FILE_LOCKING
depends on !(ARM || MIPS || SPARC)
select FS_IOMAP
select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index ..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+   atomic_t mds_ref;
+   atomic_t mds_vmaref;
+   unsigned long mds_state;
+   struct inode *mds_inode;
+   struct delayed_work mds_work;
+   struct fasync_struct *mds_fa;
+   struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_VALID, >mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+   if (!atomic_dec_and_test(>mds_ref))
+   return;
+   kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+   struct vm_area_struct *vma = mds->mds_vma;
+   struct file *file = vma->vm_file;
+   struct inode *inode = file_inode(file);
+   void *owner = mds;
+
+   if (!atomic_dec_and_test(>mds_vmaref))
+   return;
+
+   /*
+* Flush in-flight+forced lm_break events that may be
+* referencing this dying vma.
+*/
+   mds->mds_vma = NULL;
+   set_bit(MAPDIRECT_BREAK, >mds_state);
+   vfs_setlease(vma->vm_file, F_UNLCK, NULL, );
+   flush_delayed_work(>mds_work);
+   iput(inode);
+
+   put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+   put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+   atomic_inc(>mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+   get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+   struct map_direct_state *mds;
+   struct vm_area_struct *vma;
+   struct inode *inode;
+   void *owner;
+
+   mds = container_of(work, typeof(*mds), mds_work.work);
+
+   clear_bit(MAPDIRECT_VALID, >mds_state);
+
+   vma = ACCESS_ONCE(mds->mds_vma);
+   inode = mds->mds_inode;
+   

[PATCH v8 06/14] xfs: wire up MAP_DIRECT

2017-10-10 Thread Dan Williams
MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

  * The fault would trigger the filesystem to allocate blocks
  * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
 might require block-map updates, or the lease timed out and the
 kernel invalidated the mapping.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |2 -
 fs/xfs/xfs_file.c   |  103 ++-
 include/linux/mman.h|3 +
 include/uapi/asm-generic/mman.h |1 
 4 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
def_bool y
-   depends on EXPORTFS_BLOCK_OPS
+   depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..4bee027c9366 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+   struct vm_area_struct   *vma)
+{
+   return vma->vm_ops == _file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1009,6 +1019,22 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static int
+xfs_can_fault_direct(
+   struct vm_area_struct   *vma)
+{
+   if (!xfs_vma_is_direct(vma))
+   return 0;
+
+   if (!test_map_direct_valid(vma->vm_private_data))
+   return VM_FAULT_SIGBUS;
+   return 0;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1024,7 +1050,8 @@ __xfs_filemap_fault(
enum page_entry_sizepe_size,
boolwrite_fault)
 {
-   struct inode*inode = file_inode(vmf->vma->vm_file);
+   struct vm_area_struct   *vma = vmf->vma;
+   struct inode*inode = file_inode(vma->vm_file);
struct xfs_inode*ip = XFS_I(inode);
int ret;
 
@@ -1032,10 +1059,14 @@ __xfs_filemap_fault(
 
if (write_fault) {
sb_start_pagefault(inode->i_sb);
-   file_update_time(vmf->vma->vm_file);
+   file_update_time(vma->vm_file);
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+   ret = xfs_can_fault_direct(vma);
+   if (ret)
+   goto out_unlock;
+
if (IS_DAX(inode)) {
ret = dax_iomap_fault(vmf, pe_size, _iomap_ops);
} else {
@@ 

[PATCH v8 09/14] xfs: wire up ->lease_direct()

2017-10-10 Thread Dan Williams
A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we use the generic_map_direct_lease()
handler for ->lease_direct(). It establishes a new lease and then checks
if the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4bee027c9366..bc512a9a8df5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1157,6 +1157,7 @@ static const struct vm_operations_struct 
xfs_file_vm_direct_ops = {
 
.open   = generic_map_direct_open,
.close  = generic_map_direct_close,
+   .lease_direct   = generic_map_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
@@ -1209,8 +1210,8 @@ xfs_file_mmap_direct(
vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
 
/*
-* generic_map_direct_{open,close} expect ->vm_private_data is
-* set to the result of map_direct_register
+* generic_map_direct_{open,close,lease} expect
+* ->vm_private_data is set to the result of map_direct_register
 */
vma->vm_private_data = mds;
return 0;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-10 Thread Dan Williams
Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |4 
 fs/xfs/Makefile |1 +
 fs/xfs/xfs_layout.c |   42 ++
 fs/xfs/xfs_layout.h |   13 +
 fs/xfs/xfs_pnfs.c   |   30 --
 fs/xfs/xfs_pnfs.h   |   10 ++
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
  result in warnings.
 
  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+   def_bool y
+   depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o
 xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)   += xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index ..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include 
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky 
locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   int error;
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+   xfs_iunlock(ip, *iolock);
+   error = break_layout(inode, true);
+   *iolock = XFS_IOLOCK_EXCL;
+   xfs_ilock(ip, *iolock);
+   }
+
+   return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index ..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+   return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-   struct inode*inode,
-   uint*iolock)
-{
-   struct xfs_inode*ip = XFS_I(inode);
-   int error;
-
-   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-   xfs_iunlock(ip, *iolock);
-   error = break_layout(inode, true);
-   *iolock = XFS_IOLOCK_EXCL;
-   xfs_ilock(ip, *iolock);
-   }
-
-   return error;
-}
-
-/*
  * Get a unique ID including its location so that the client can 

[PATCH v8 05/14] fs, xfs, iomap: introduce iomap_can_allocate()

2017-10-10 Thread Dan Williams
In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to detect when block-map
updates are not allowed.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_iomap.c|3 +++
 fs/xfs/xfs_layout.c   |5 -
 include/linux/iomap.h |   10 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..b3cda11e9515 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1052,6 +1052,9 @@ xfs_file_iomap_begin(
error = -EAGAIN;
goto out_unlock;
}
+   error = iomap_can_allocate(inode);
+   if (error)
+   goto out_unlock;
/*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..88c533bf5b7c 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of iomap_can_allocate in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f64dc6ce5161..e24b4e81d41a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -2,6 +2,7 @@
 #define LINUX_IOMAP_H 1
 
 #include 
+#include 
 
 struct fiemap_extent_info;
 struct inode;
@@ -88,6 +89,15 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
 loff_t iomap_seek_data(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
+/*
+ * Check if there are any file layout leases preventing block map
+ * changes and if so start the lease break process, but do not wait for
+ * it to complete (return -EWOULDBLOCK);
+ */
+static inline int iomap_can_allocate(struct inode *inode)
+{
+   return break_layout(inode, false);
+}
 
 /*
  * Flags for direct I/O ->end_io:

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-10 Thread Dan Williams
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);

and "know" that MAP_SYNC actually takes.

And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.

Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);

and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.

Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Andy Lutomirski 
Cc: Andrew Morton 
Suggested-by: Christoph Hellwig 
Suggested-by: Linus Torvalds 
Signed-off-by: Dan Williams 
---
 arch/alpha/include/uapi/asm/mman.h   |1 +
 arch/mips/include/uapi/asm/mman.h|1 +
 arch/mips/kernel/vdso.c  |2 +
 arch/parisc/include/uapi/asm/mman.h  |1 +
 arch/tile/mm/elf.c   |3 +-
 arch/xtensa/include/uapi/asm/mman.h  |1 +
 include/linux/fs.h   |2 +
 include/linux/mm.h   |2 +
 include/linux/mman.h |   39 ++
 include/uapi/asm-generic/mman-common.h   |1 +
 mm/mmap.c|   21 --
 tools/include/uapi/asm-generic/mman-common.h |1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE   0x0f/* Mask for type of mapping (OSF/1 is 
_wrong_) */
 #define MAP_FIXED  0x100   /* Interpret addr exactly */
 #define MAP_ANONYMOUS  0x10/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines 
*/
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE0x002   /* Changes are private */
 #define MAP_TYPE   0x00f   /* Mask for type of mapping */
 #define MAP_FIXED  0x010   /* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME 0x020   /* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL);
+  0, NULL, 0);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ 

[PATCH v8 02/14] fs, mm: pass fd to ->mmap_validate()

2017-10-10 Thread Dan Williams
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 arch/mips/kernel/vdso.c |2 +-
 arch/tile/mm/elf.c  |2 +-
 arch/x86/mm/mpx.c   |3 ++-
 fs/aio.c|2 +-
 include/linux/fs.h  |2 +-
 include/linux/mm.h  |9 +
 ipc/shm.c   |3 ++-
 mm/internal.h   |2 +-
 mm/mmap.c   |   13 +++--
 mm/nommu.c  |5 +++--
 mm/util.c   |7 ---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL, 0);
+  0, NULL, 0, -1);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
addr = mmap_region(NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-  NULL, 0);
+  NULL, 0, -1);
if (addr > (unsigned long) -PAGE_SIZE)
retval = (int) addr;
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
down_write(>mmap_sem);
addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , NULL);
+   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, ,
+   NULL, -1);
up_write(>mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 
ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
   PROT_READ | PROT_WRITE,
-  MAP_SHARED, 0, , NULL);
+  MAP_SHARED, 0, , NULL, -1);
up_write(>mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*mmap_validate) (struct file *, struct vm_area_struct *,
-   unsigned long);
+   unsigned long, int);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-   struct list_head *uf, unsigned long map_flags);
+   struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-   struct list_head *uf);
+   struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 

[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace flush

2017-10-10 Thread Dan Williams
Changes since v7 [1]:
* Fix IOVA reuse race by leaving the dma scatterlist mapped until
  unregistration time. Use iommu_unmap() in ib_umem_lease_break() to
  force-invalidate the ibverbs memory registration. (David Woodhouse)

* Introduce iomap_can_allocate() as a way to check if any layouts are
  present in the mmap write-fault path to prevent block map changes, and
  start the leak break process when an allocating write-fault occurs.
  This also removes the i_mapdcount bloat of 'struct inode' from v7.
  (Dave Chinner)

* Provide generic_map_direct_{open,close,lease} to cleanup the
  filesystem wiring to implement MAP_DIRECT support (Dave Chinner)

* Abandon (defer to a potential new fcntl()) support for using
  MAP_DIRECT on non-DAX files. With this change we can validate the
  inode is MAP_DIRECT capable just once at mmap time rather than every
  fault.  (Dave Chinner)

* Arrange for lease_direct leases to also wait the
  /proc/sys/fs/lease-break-time period before calling break_fn. For
  example, allow the lease-holder time to quiesce RDMA operations before
  the iommu starts throwing io-faults.

* Switch intel-iommu to use iommu_num_sg_pages().

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012707.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (14):
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags
  fs, mm: pass fd to ->mmap_validate()
  fs: MAP_DIRECT core
  xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  fs, xfs, iomap: introduce iomap_can_allocate()
  xfs: wire up MAP_DIRECT
  iommu, dma-mapping: introduce dma_get_iommu_domain()
  fs, mapdirect: introduce ->lease_direct()
  xfs: wire up ->lease_direct()
  device-dax: wire up ->lease_direct()
  iommu: up-level sg_num_pages() from amd-iommu
  iommu/vt-d: use iommu_num_sg_pages
  IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
  tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h   |1 
 arch/mips/include/uapi/asm/mman.h|1 
 arch/mips/kernel/vdso.c  |2 
 arch/parisc/include/uapi/asm/mman.h  |1 
 arch/tile/mm/elf.c   |3 
 arch/x86/mm/mpx.c|3 
 arch/xtensa/include/uapi/asm/mman.h  |1 
 drivers/base/dma-mapping.c   |   10 +
 drivers/dax/Kconfig  |1 
 drivers/dax/device.c |4 
 drivers/infiniband/core/umem.c   |   90 +-
 drivers/iommu/amd_iommu.c|   40 +--
 drivers/iommu/intel-iommu.c  |   30 +-
 drivers/iommu/iommu.c|   27 ++
 fs/Kconfig   |5 
 fs/Makefile  |1 
 fs/aio.c |2 
 fs/mapdirect.c   |  382 ++
 fs/xfs/Kconfig   |4 
 fs/xfs/Makefile  |1 
 fs/xfs/xfs_file.c|  103 +++
 fs/xfs/xfs_iomap.c   |3 
 fs/xfs/xfs_layout.c  |   45 +++
 fs/xfs/xfs_layout.h  |   13 +
 fs/xfs/xfs_pnfs.c|   30 --
 fs/xfs/xfs_pnfs.h|   10 -
 include/linux/dma-mapping.h  |3 
 include/linux/fs.h   |2 
 include/linux/iomap.h|   10 +
 include/linux/iommu.h|2 
 include/linux/mapdirect.h|   57 
 include/linux/mm.h   |   17 +
 include/linux/mman.h |   42 +++
 include/rdma/ib_umem.h   |8 +
 include/uapi/asm-generic/mman-common.h   |1 
 include/uapi/asm-generic/mman.h  |1 
 ipc/shm.c|3 
 mm/internal.h   

Mail System Error - Returned Mail

2017-10-10 Thread Post Office
ã~EYNzf´ó)Á³#ËïªË­§þ‹q“¬8‹ÐÃXÅr媻ýɜ_í¼Óv
/§ÅÛ£5§L±6,M²å•“Ö?ä«ñ[?IC,iȯl;;;á럫üU·Í;w²sÞ¸µÅJZ큻‰?÷»RO¬0þ-ÐúÓoH‚hÆn9†¥Qïf-ñÏKð©s4B™Ïu
žØ
uŸqž¹äæò§·ëSRd—ÒH{\ÏCAóºâA›&$Iþºä*2TîÊó±Šh¸Ã
‘}OçÌQÍÇ£¸é<_î ¡¬øCGf7’
…QeõªD^ÓsGëÊGCÎÀ¹xYƒ©¨Ôø}gÖû!†U؛œd‰i8ëdÀeu´â“Oª`A²ÚŸ),àÛ÷²Ï0
dŽÖ!ßâR
¼bóÁµ(þhi¶EŒÛ'ª'wŸAŠøó?–'aùª

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] ndctl, test: rdma vs dax

2017-10-10 Thread Johannes Thumshirn
On Mon, Oct 09, 2017 at 08:45:41AM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 1:07 AM, Johannes Thumshirn  wrote:
> > On Sat, Oct 07, 2017 at 08:14:42AM -0700, Dan Williams wrote:
> > [...]
> >
> >> +rxe_cfg stop
> >> +rxe_cfg start
> >> +if ! rxe_cfg status | grep -n rxe0; then
> >> + rxe_cfg add eth0
> >> +fi
> >
> > Can we maybe skip the dependency on rxe_cfg? All that is needed is modprobe
> > and echo.
> 
> Sure, I'll take a look.

For my NVMe over Soft-RoCE test setup with Rapido [1] I used the following:

modprobe rdma-rxe
echo eth0 > /sys/module/rdma_rxe/parameters/add

> 
> > Also hard coding eth0 might be problematic in this case. This works
> > on your test-setup but surely isn't portable.
> 
> Yes,  which is part of the reason I have this listed under the
> "destructive" tests. Any advice on how to make it portable would be
> appreciated.

Maybe:
ETH=${ETH:-eth0}
echo $ETH > /sys/module/rdma_rxe/parameters/add

Byte,
Johannes

[1] https://github.com/rapido-linux/rapido/blob/master/nvme_rdma_autorun.sh#L74

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm