Re: [PATCH v8 06/14] xfs: wire up MAP_DIRECT
On Tue, Oct 10, 2017 at 6:09 PM, Dave Chinnerwrote: > On Tue, Oct 10, 2017 at 07:49:30AM -0700, Dan Williams wrote: >> @@ -1009,6 +1019,22 @@ xfs_file_llseek( >> } >> >> /* >> + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is >> + * valid. See map_direct_invalidate. >> + */ >> +static int >> +xfs_can_fault_direct( >> + struct vm_area_struct *vma) >> +{ >> + if (!xfs_vma_is_direct(vma)) >> + return 0; >> + >> + if (!test_map_direct_valid(vma->vm_private_data)) >> + return VM_FAULT_SIGBUS; >> + return 0; >> +} > > Better, but I'm going to be an annoying pedant here: a "can > " check should return a boolean true/false. > > Also, it's a bit jarring to see that a non-direct VMA that /can't/ > do direct faults returns the same thing as a direct-vma that /can/ > do direct faults, so a couple of extra comments for people who will > quickly forget how this code works (i.e. me) will be helpful. Say > something like this: > > /* > * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is > * valid. See map_direct_invalidate. > */ > static bool > xfs_vma_has_direct_lease( > struct vm_area_struct *vma) > { > /* Non MAP_DIRECT vmas do not require layout leases */ > if (!xfs_vma_is_direct(vma)) > return true; > > if (!test_map_direct_valid(vma->vm_private_data)) > return false; > > /* We have a valid lease */ > return true; > } > > . > if (!xfs_vma_has_direct_lease(vma)) { > ret = VM_FAULT_SIGBUS; > goto out_unlock; > } > Looks good to me. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
On Tue, Oct 10, 2017 at 5:46 PM, Dave Chinnerwrote: > On Tue, Oct 10, 2017 at 07:49:17AM -0700, Dan Williams wrote: >> Move xfs_break_layouts() to its own compilation unit so that it can be >> used for both pnfs layouts and MAP_DIRECT mappings. > . >> diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h >> index b587cb99b2b7..4135b2482697 100644 >> --- a/fs/xfs/xfs_pnfs.h >> +++ b/fs/xfs/xfs_pnfs.h >> @@ -1,19 +1,13 @@ >> #ifndef _XFS_PNFS_H >> #define _XFS_PNFS_H 1 >> >> +#include "xfs_layout.h" >> + > > I missed this the first time through - we try not to put includes > in header files, and instead make sure each C file has all the > includes they require. Can you move this to all the C files that > need layouts and remove the include of the xfs_pnfs.h include from > them? Sure, will do. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 06/14] xfs: wire up MAP_DIRECT
On Tue, Oct 10, 2017 at 07:49:30AM -0700, Dan Williams wrote: > @@ -1009,6 +1019,22 @@ xfs_file_llseek( > } > > /* > + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is > + * valid. See map_direct_invalidate. > + */ > +static int > +xfs_can_fault_direct( > + struct vm_area_struct *vma) > +{ > + if (!xfs_vma_is_direct(vma)) > + return 0; > + > + if (!test_map_direct_valid(vma->vm_private_data)) > + return VM_FAULT_SIGBUS; > + return 0; > +} Better, but I'm going to be an annoying pedant here: a "can " check should return a boolean true/false. Also, it's a bit jarring to see that a non-direct VMA that /can't/ do direct faults returns the same thing as a direct-vma that /can/ do direct faults, so a couple of extra comments for people who will quickly forget how this code works (i.e. me) will be helpful. Say something like this: /* * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is * valid. See map_direct_invalidate. */ static bool xfs_vma_has_direct_lease( struct vm_area_struct *vma) { /* Non MAP_DIRECT vmas do not require layout leases */ if (!xfs_vma_is_direct(vma)) return true; if (!test_map_direct_valid(vma->vm_private_data)) return false; /* We have a valid lease */ return true; } . if (!xfs_vma_has_direct_lease(vma)) { ret = VM_FAULT_SIGBUS; goto out_unlock; } Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
On Tue, Oct 10, 2017 at 07:49:17AM -0700, Dan Williams wrote: > Move xfs_break_layouts() to its own compilation unit so that it can be > used for both pnfs layouts and MAP_DIRECT mappings. . > diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h > index b587cb99b2b7..4135b2482697 100644 > --- a/fs/xfs/xfs_pnfs.h > +++ b/fs/xfs/xfs_pnfs.h > @@ -1,19 +1,13 @@ > #ifndef _XFS_PNFS_H > #define _XFS_PNFS_H 1 > > +#include "xfs_layout.h" > + I missed this the first time through - we try not to put includes in header files, and instead make sure each C file has all the includes they require. Can you move this to all the C files that need layouts and remove the include of the xfs_pnfs.h include from them? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] Fix mpage_writepage() for pages with buffers
On Tue, Oct 10, 2017 at 12:44 PM, Andrew Mortonwrote: > > This is all pretty mature code (isn't it?). Any idea why this bug > popped up now? Also, while the patch looks sane, the clean_buffers(page, PAGE_SIZE); line really threw me. That's an insane value to pick, it looks like "bytes in page", but it isn't. It's just a random value that is bigger than "PAGE_SIZE >> SECTOR_SHIFT". I'd prefer to see just ~0u if the intention is just "bigger than anything possible". Linus ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpewrote: > On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote: >> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe > >> >> Have a look at the patch [1], I don't touch the ODP path. >> > >> > But, does ODP work OK already? I'm not clear on that.. >> >> It had better. If the mapping is invalidated I would hope that >> generates an io fault that gets handled by the driver to setup the new >> mapping. I don't see how it can work otherwise. > > I would assume so too... > >> > This is why ODP should be the focus because this cannot work fully >> > reliably otherwise.. >> >> The lease break time is configurable. If that application can't >> respond to a stop request within a timeout of its own choosing then it >> should not be using DAX mappings. > > Well, no RDMA application can really do this, unless you set the > timeout to multiple minutes, on par with network timeouts. The default lease break timeout is 45 seconds on my system, so minutes does not seem out of the question. Also keep in mind that what triggers the lease break is another application trying to write or punch holes in a file that is mapped for RDMA. So, if the hardware can't handle the iommu mapping getting invalidated asynchronously and the application can't react in the lease break timeout period then the administrator should arrange for the file to not be written or truncated while it is mapped. It's already the case that get_user_pages() does not lock down file associations, so if your application is contending with these types of file changes it likely already has a problem keeping transactions in sync with the file state even without DAX. > Again, these details are why I think this kind of DAX and non ODP-MRs > are probably practically not too useful for a production system. Great > for test of course, but in that case SIGKILL would be fine too... > >> > Well, what about using SIGKILL if the lease-break-time hits? The >> > kernel will clean up the MRs when the process exits and this will >> > fence DMA to that memory. >> >> Can you point me to where the MR cleanup code fences DMA and quiesces >> the device? > > Yes. The MR's are associated with an fd. When the fd is closed > ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through > all the objects, including MRs, and deletes them. > > The specification for deleting a MR requires a synchronous fence with > the hardware. After MR deletion the hardware will not DMA to any pages > described by the old MR, and those pages will be unpinned. > >> > But, still, if you really want to be fined graned, then I think >> > invalidating the impacted MR's is a better solution for RDMA than >> > trying to do it with the IOMMU... >> >> If there's a better routine for handling ib_umem_lease_break() I'd >> love to use it. Right now I'm reaching for the only tool I know for >> kernel enforced revocation of DMA access. > > Well, you'd have to code something in the MR code to keep track of DAX > MRs and issue an out of band invalidate to impacted MRs to create the > fence. > > This probably needs some driver work, I'm not sure if all the hardware > can do out of band invalidate to any MR or not.. Ok. > > Generally speaking, in RDMA, when a new feature like this comes along > we have to push a lot of the work down to the driver authors, and the > approach has historically been that new features only work on some > hardware (as much as I dislike this, it is pragmatic) > > So, not being able to support DAX on certain RDMA hardware is not > an unreasonable situation in our space. That makes sense, but it still seems to me that this proposed solution allows more than enough ways to avoid that worst case scenario where hardware reacts badly to iommu invalidation. Drivers that can do better than iommu invalidation can arrange for a callback to do their driver-specific action at lease break time. Hardware that can't should be blacklisted from supporting DAX altogether. In other words this is a starting point to incrementally enhance or disable specific drivers, but with the assurance that the kernel can always do the safe thing when / if the driver is missing a finer grained solution. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] Fix mpage_writepage() for pages with buffers
On Fri, 6 Oct 2017 14:15:41 -0700 Matthew Wilcoxwrote: > When using FAT on a block device which supports rw_page, we can hit > BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we > call clean_buffers() after unlocking the page we've written. Introduce a > new clean_page_buffers() which cleans all buffers associated with a page > and call it from within bdev_write_page(). This is all pretty mature code (isn't it?). Any idea why this bug popped up now? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote: > On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe > >> Have a look at the patch [1], I don't touch the ODP path. > > > > But, does ODP work OK already? I'm not clear on that.. > > It had better. If the mapping is invalidated I would hope that > generates an io fault that gets handled by the driver to setup the new > mapping. I don't see how it can work otherwise. I would assume so too... > > This is why ODP should be the focus because this cannot work fully > > reliably otherwise.. > > The lease break time is configurable. If that application can't > respond to a stop request within a timeout of its own choosing then it > should not be using DAX mappings. Well, no RDMA application can really do this, unless you set the timeout to multiple minutes, on par with network timeouts. Again, these details are why I think this kind of DAX and non ODP-MRs are probably practically not too useful for a production system. Great for test of course, but in that case SIGKILL would be fine too... > > Well, what about using SIGKILL if the lease-break-time hits? The > > kernel will clean up the MRs when the process exits and this will > > fence DMA to that memory. > > Can you point me to where the MR cleanup code fences DMA and quiesces > the device? Yes. The MR's are associated with an fd. When the fd is closed ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through all the objects, including MRs, and deletes them. The specification for deleting a MR requires a synchronous fence with the hardware. After MR deletion the hardware will not DMA to any pages described by the old MR, and those pages will be unpinned. > > But, still, if you really want to be fined graned, then I think > > invalidating the impacted MR's is a better solution for RDMA than > > trying to do it with the IOMMU... > > If there's a better routine for handling ib_umem_lease_break() I'd > love to use it. Right now I'm reaching for the only tool I know for > kernel enforced revocation of DMA access. Well, you'd have to code something in the MR code to keep track of DAX MRs and issue an out of band invalidate to impacted MRs to create the fence. This probably needs some driver work, I'm not sure if all the hardware can do out of band invalidate to any MR or not.. Generally speaking, in RDMA, when a new feature like this comes along we have to push a lot of the work down to the driver authors, and the approach has historically been that new features only work on some hardware (as much as I dislike this, it is pragmatic) So, not being able to support DAX on certain RDMA hardware is not an unreasonable situation in our space. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote: > > I don't think this has ever come up in the context of an all-device MR > > invalidate requirement. Drivers already have code to invalidate > > specifc MRs, but to find all MRs that touch certain pages and then > > invalidate them would be new code. > > > > We also have ODP aware drivers that can retarget a MR to new > > physical pages. If the block map changes DAX should synchronously > > retarget the ODP MR, not halt DMA. > > Have a look at the patch [1], I don't touch the ODP path. But, does ODP work OK already? I'm not clear on that.. > > Most likely ODP & DAX would need to be used together to get robust > > user applications, as having the user QP's go to an error state at > > random times (due to DMA failures) during operation is never going to > > be acceptable... > > It's not random. The process that set up the mapping and registered > the memory gets SIGIO when someone else tries to modify the file map. > That process then gets /proc/sys/fs/lease-break-time seconds to fix > the problem before the kernel force revokes the DMA access. Well, the process can't fix the problem in bounded time, so it is random if it will fail or not. MR life time is under the control of the remote side, and time to complete the network exchanges required to release the MRs is hard to bound. So even if I implement SIGIO properly my app will still likely have random QP failures under various cases and work loads. :( This is why ODP should be the focus because this cannot work fully reliably otherwise.. > > Perhaps you might want to initially only support ODP MR mappings with > > DAX and then the DMA fencing issue goes away? > > I'd rather try to fix the non-ODP DAX case instead of just turning it off. Well, what about using SIGKILL if the lease-break-time hits? The kernel will clean up the MRs when the process exits and this will fence DMA to that memory. But, still, if you really want to be fined graned, then I think invalidating the impacted MR's is a better solution for RDMA than trying to do it with the IOMMU... Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 14/14] tools/testing/nvdimm: enable rdma unit tests
Provide a mock dma_get_iommu_domain() for the ibverbs core. Enable ib_umem_get() to satisfy its DAX safety checks for a controlled test. Signed-off-by: Dan Williams--- tools/testing/nvdimm/Kbuild | 31 +++ tools/testing/nvdimm/config_check.c |2 ++ tools/testing/nvdimm/test/iomap.c | 14 ++ 3 files changed, 47 insertions(+) diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild index d870520da68b..f4a007090950 100644 --- a/tools/testing/nvdimm/Kbuild +++ b/tools/testing/nvdimm/Kbuild @@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource ldflags-y += --wrap=remove_resource ldflags-y += --wrap=acpi_evaluate_object ldflags-y += --wrap=acpi_evaluate_dsm +ldflags-y += --wrap=dma_get_iommu_domain DRIVERS := ../../../drivers NVDIMM_SRC := $(DRIVERS)/nvdimm ACPI_SRC := $(DRIVERS)/acpi/nfit DAX_SRC := $(DRIVERS)/dax +IBCORE := $(DRIVERS)/infiniband/core ccflags-y := -I$(src)/$(NVDIMM_SRC)/ obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o @@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o endif obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +obj-$(CONFIG_INFINIBAND) += ib_core.o nfit-y := $(ACPI_SRC)/core.o nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o @@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o libnvdimm-y += config_check.o +ib_core-y := $(IBCORE)/packer.o +ib_core-y += $(IBCORE)/ud_header.o +ib_core-y += $(IBCORE)/verbs.o +ib_core-y += $(IBCORE)/cq.o +ib_core-y += $(IBCORE)/rw.o +ib_core-y += $(IBCORE)/sysfs.o +ib_core-y += $(IBCORE)/device.o +ib_core-y += $(IBCORE)/fmr_pool.o +ib_core-y += $(IBCORE)/cache.o +ib_core-y += $(IBCORE)/netlink.o +ib_core-y += $(IBCORE)/roce_gid_mgmt.o +ib_core-y += $(IBCORE)/mr_pool.o +ib_core-y += $(IBCORE)/addr.o +ib_core-y += $(IBCORE)/sa_query.o +ib_core-y += $(IBCORE)/multicast.o +ib_core-y += $(IBCORE)/mad.o +ib_core-y += $(IBCORE)/smi.o +ib_core-y += $(IBCORE)/agent.o +ib_core-y += $(IBCORE)/mad_rmpp.o +ib_core-y += $(IBCORE)/security.o +ib_core-y += $(IBCORE)/nldev.o + +ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o +ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o +ib_core-y += config_check.o + obj-m += test/ diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c index 7dc5a0af9b54..33e7c805bfd6 100644 --- a/tools/testing/nvdimm/config_check.c +++ b/tools/testing/nvdimm/config_check.c @@ -14,4 +14,6 @@ void check(void) BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT)); BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX)); BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM)); + BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM)); + BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND)); } diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index e1f75a1914a1..1e439b2b01e7 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -388,4 +389,17 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g } EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm); +/* + * This assumes that any iommu api routine we would call with this + * domain checks for NULL ops and either returns an error or does + * nothing. + */ +struct iommu_domain *__wrap_dma_get_iommu_domain(struct device *dev) +{ + static struct iommu_domain domain; + + return +} +EXPORT_SYMBOL(__wrap_dma_get_iommu_domain); + MODULE_LICENSE("GPL v2"); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 10/14] device-dax: wire up ->lease_direct()
The only event that will break a lease_direct lease in the device-dax case is the device shutdown path where the physical pages might get assigned to another device. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/dax/Kconfig |1 + drivers/dax/device.c |4 fs/Kconfig|4 fs/Makefile |3 ++- fs/mapdirect.c|3 ++- include/linux/mapdirect.h |5 - 6 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index b79aa8f7a497..be03d4dbe646 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -8,6 +8,7 @@ if DAX config DEV_DAX tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE + depends on FILE_LOCKING help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/device.c b/drivers/dax/device.c index e9f3b3e4bbf4..fa75004185c4 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -10,6 +10,7 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. */ +#include #include #include #include @@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf) static const struct vm_operations_struct dax_vm_ops = { .fault = dev_dax_fault, .huge_fault = dev_dax_huge_fault, + .lease_direct = map_direct_lease, }; static int dax_mmap(struct file *filp, struct vm_area_struct *vma) @@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax) { struct dax_device *dax_dev = dev_dax->dax_dev; struct inode *inode = dax_inode(dax_dev); + const bool wait = true; kill_dax(dax_dev); + break_layout(inode, wait); unmap_mapping_range(inode->i_mapping, 0, 0, 1); } diff --git a/fs/Kconfig b/fs/Kconfig index a7b31a96a753..3668cfb046d5 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -59,6 +59,10 @@ config FS_DAX_PMD depends on ZONE_DEVICE depends on TRANSPARENT_HUGEPAGE +config DAX_MAP_DIRECT + bool + default FS_DAX || DEV_DAX + endif # BLOCK # Posix ACL utility routines diff --git a/fs/Makefile b/fs/Makefile index c0e791d235d8..21b8fb104656 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_DAX) += dax.o mapdirect.o +obj-$(CONFIG_FS_DAX) += dax.o +obj-$(CONFIG_DAX_MAP_DIRECT) += mapdirect.o obj-$(CONFIG_FS_ENCRYPTION)+= crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o diff --git a/fs/mapdirect.c b/fs/mapdirect.c index c6954033fc1a..dd4a16f9ffc6 100644 --- a/fs/mapdirect.c +++ b/fs/mapdirect.c @@ -218,7 +218,7 @@ static const struct lock_manager_operations lease_direct_lm_ops = { .lm_change = lease_direct_lm_change, }; -static struct lease_direct *map_direct_lease(struct vm_area_struct *vma, +struct lease_direct *map_direct_lease(struct vm_area_struct *vma, void (*lds_break_fn)(void *), void *lds_owner) { struct file *file = vma->vm_file; @@ -272,6 +272,7 @@ static struct lease_direct *map_direct_lease(struct vm_area_struct *vma, kfree(lds); return ERR_PTR(rc); } +EXPORT_SYMBOL_GPL(map_direct_lease); struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma, void (*break_fn)(void *), void *owner) diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h index e0df6ac5795a..6695fdcf8009 100644 --- a/include/linux/mapdirect.h +++ b/include/linux/mapdirect.h @@ -26,13 +26,15 @@ struct lease_direct { struct lease_direct_state *lds; }; -#if IS_ENABLED(CONFIG_FS_DAX) +#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT) struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma); bool test_map_direct_valid(struct map_direct_state *mds); void generic_map_direct_open(struct vm_area_struct *vma); void generic_map_direct_close(struct vm_area_struct *vma); struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma, void (*ld_break_fn)(void *), void *ld_owner); +struct lease_direct *map_direct_lease(struct vm_area_struct *vma, + void (*lds_break_fn)(void *), void *lds_owner); void map_direct_lease_destroy(struct lease_direct *ld); #else static inline struct map_direct_state *map_direct_register(int fd, @@ -47,6 +49,7 @@ static inline bool test_map_direct_valid(struct map_direct_state *mds) #define generic_map_direct_open NULL
[PATCH v8 11/14] iommu: up-level sg_num_pages() from amd-iommu
iommu_sg_num_pages() is a helper that walks a scattlerlist and counts pages taking segment boundaries and iommu_num_pages() into account. Up-level it for determining the IOVA range that dma_map_ops established at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in advance of freeing IOVA range. Cc: Joerg RoedelSigned-off-by: Dan Williams --- drivers/iommu/amd_iommu.c | 30 ++ drivers/iommu/iommu.c | 27 +++ include/linux/iommu.h |2 ++ 3 files changed, 31 insertions(+), 28 deletions(-) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index c8e1a45af182..4795b0823469 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size, __unmap_single(dma_dom, dma_addr, size, dir); } -static int sg_num_pages(struct device *dev, - struct scatterlist *sglist, - int nelems) -{ - unsigned long mask, boundary_size; - struct scatterlist *s; - int i, npages = 0; - - mask = dma_get_seg_boundary(dev); - boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT : - 1UL << (BITS_PER_LONG - PAGE_SHIFT); - - for_each_sg(sglist, s, nelems, i) { - int p, n; - - s->dma_address = npages << PAGE_SHIFT; - p = npages % boundary_size; - n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE); - if (p + n > boundary_size) - npages += boundary_size - p; - npages += n; - } - - return npages; -} - /* * The exported map_sg function for dma_ops (handles scatter-gather * lists). @@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist, dma_dom = to_dma_ops_domain(domain); dma_mask = *dev->dma_mask; - npages = sg_num_pages(dev, sglist, nelems); + npages = iommu_sg_num_pages(dev, sglist, nelems); address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask); if (address == AMD_IOMMU_MAPPING_ERROR) @@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist, startaddr = sg_dma_address(sglist) & PAGE_MASK; dma_dom = to_dma_ops_domain(domain); - npages= sg_num_pages(dev, sglist, nelems); + npages= iommu_sg_num_pages(dev, sglist, nelems); __unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir); } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3de5c0bcb5cc..cfe6eeea3578 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -33,6 +33,7 @@ #include #include #include +#include static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida); @@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_unmap_fast); +int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist, + int nelems) +{ + unsigned long mask, boundary_size; + struct scatterlist *s; + int i, npages = 0; + + mask = dma_get_seg_boundary(dev); + boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT + : 1UL << (BITS_PER_LONG - PAGE_SHIFT); + + for_each_sg(sglist, s, nelems, i) { + int p, n; + + s->dma_address = npages << PAGE_SHIFT; + p = npages % boundary_size; + n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE); + if (p + n > boundary_size) + npages += boundary_size - p; + npages += n; + } + + return npages; +} +EXPORT_SYMBOL_GPL(iommu_sg_num_pages); + size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg, unsigned int nents, int prot) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index a7f2ac689d29..5b2d20e1475a 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size); extern size_t iommu_unmap_fast(struct iommu_domain *domain, unsigned long iova, size_t size); +extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist, + int nelems); extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova, struct scatterlist *sg,unsigned int nents, int prot); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 07/14] iommu, dma-mapping: introduce dma_get_iommu_domain()
Add a dma-mapping api helper to retrieve the generic iommu_domain for a device. The motivation for this interface is making RDMA transfers to DAX mappings safe. If the DAX file's block map changes we need to be to reliably stop accesses to blocks that have been freed or re-assigned to a new file. With the iommu_domain and a callback from the DAX filesystem the kernel can safely revoke access to a DMA device. The process that performed the RDMA memory registration is also notified of this revocation event, but the kernel can not otherwise be in the position of waiting for userspace to quiesce the device. Since PMEM+DAX is currently only enabled for x86, we only update the x86 iommu drivers. Cc: Marek SzyprowskiCc: Robin Murphy Cc: Greg Kroah-Hartman Cc: Joerg Roedel Cc: David Woodhouse Cc: Ashok Raj Cc: Jan Kara Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Signed-off-by: Dan Williams --- drivers/base/dma-mapping.c | 10 ++ drivers/iommu/amd_iommu.c | 10 ++ drivers/iommu/intel-iommu.c | 15 +++ include/linux/dma-mapping.h |3 +++ 4 files changed, 38 insertions(+) diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c index e584eddef0a7..fdb9764f95a4 100644 --- a/drivers/base/dma-mapping.c +++ b/drivers/base/dma-mapping.c @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev) of_dma_deconfigure(dev); acpi_dma_deconfigure(dev); } + +struct iommu_domain *dma_get_iommu_domain(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (ops && ops->get_iommu) + return ops->get_iommu(dev); + return NULL; +} +EXPORT_SYMBOL(dma_get_iommu_domain); diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 51f8215877f5..c8e1a45af182 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev) return domain; } +static struct iommu_domain *amd_dma_get_iommu(struct device *dev) +{ + struct protection_domain *domain = get_domain(dev); + + if (IS_ERR(domain)) + return NULL; + return >domain; +} + static void update_device_table(struct protection_domain *domain) { struct iommu_dev_data *dev_data; @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = { .unmap_sg = unmap_sg, .dma_supported = amd_iommu_dma_supported, .mapping_error = amd_iommu_mapping_error, + .get_iommu = amd_dma_get_iommu, }; static int init_reserved_iova_ranges(void) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 6784a05dd6b2..f3f4939cebad 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev) return 0; } +static struct iommu_domain *intel_dma_get_iommu(struct device *dev) +{ + struct dmar_domain *domain; + + if (iommu_no_mapping(dev)) + return NULL; + + domain = get_valid_domain_for_dev(dev); + if (!domain) + return NULL; + + return >domain; +} + static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, size_t size, int dir, u64 dma_mask) { @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = { .map_page = intel_map_page, .unmap_page = intel_unmap_page, .mapping_error = intel_mapping_error, + .get_iommu = intel_dma_get_iommu, #ifdef CONFIG_X86 .dma_supported = x86_dma_supported, #endif diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 29ce9815da87..aa62df1d0d72 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -128,6 +128,7 @@ struct dma_map_ops { enum dma_data_direction dir); int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); int (*dma_supported)(struct device *dev, u64 mask); + struct iommu_domain *(*get_iommu)(struct device *dev); #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK u64 (*get_required_mask)(struct device *dev); #endif @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev) } #endif +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev); + static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir,
[PATCH v8 12/14] iommu/vt-d: use iommu_num_sg_pages
Use the common helper for accounting the size of the IOVA range for a scatterlist so that iommu and dma apis agree on the size of a scatterlist. This is in support for using iommu_unmap() in advance of dma_unmap_sg() to invalidate an io-mapping in advance of the IOVA range being deallocated. MAP_DIRECT needs this functionality for force revoking RDMA access to a DAX mapping when userspace fails to respond to within a lease break timeout period. Cc: Ashok RajCc: David Woodhouse Cc: Joerg Roedel Signed-off-by: Dan Williams --- drivers/iommu/intel-iommu.c | 19 +-- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index f3f4939cebad..94a5fbe62fb8 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3785,14 +3785,9 @@ static void intel_unmap_sg(struct device *dev, struct scatterlist *sglist, unsigned long attrs) { dma_addr_t startaddr = sg_dma_address(sglist) & PAGE_MASK; - unsigned long nrpages = 0; - struct scatterlist *sg; - int i; - - for_each_sg(sglist, sg, nelems, i) { - nrpages += aligned_nrpages(sg_dma_address(sg), sg_dma_len(sg)); - } + unsigned long nrpages; + nrpages = iommu_sg_num_pages(dev, sglist, nelems); intel_unmap(dev, startaddr, nrpages << VTD_PAGE_SHIFT); } @@ -3813,14 +3808,12 @@ static int intel_nontranslate_map_sg(struct device *hddev, static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nelems, enum dma_data_direction dir, unsigned long attrs) { - int i; struct dmar_domain *domain; size_t size = 0; int prot = 0; unsigned long iova_pfn; int ret; - struct scatterlist *sg; - unsigned long start_vpfn; + unsigned long start_vpfn, npages; struct intel_iommu *iommu; BUG_ON(dir == DMA_NONE); @@ -3833,11 +3826,9 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele iommu = domain_get_iommu(domain); - for_each_sg(sglist, sg, nelems, i) - size += aligned_nrpages(sg->offset, sg->length); + npages = iommu_sg_num_pages(dev, sglist, nelems); - iova_pfn = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size), - *dev->dma_mask); + iova_pfn = intel_alloc_iova(dev, domain, npages, *dev->dma_mask); if (!iova_pfn) { sglist->dma_length = 0; return 0; ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 08/14] fs, mapdirect: introduce ->lease_direct()
Provide a vma operation that registers a lease that is broken by break_layout(). This is motivated by a need to stop in-progress RDMA when the block-map of a DAX-file changes. I.e. since DAX gives direct-access to filesystem blocks we can not allow those blocks to move or change state while they are under active RDMA. So, if the filesystem determines it needs to move blocks it can revoke device access before proceeding. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/mapdirect.c| 144 + include/linux/mapdirect.h | 14 include/linux/mm.h|8 +++ 3 files changed, 166 insertions(+) diff --git a/fs/mapdirect.c b/fs/mapdirect.c index 9f4dd7395dcd..c6954033fc1a 100644 --- a/fs/mapdirect.c +++ b/fs/mapdirect.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -32,12 +33,25 @@ struct map_direct_state { struct vm_area_struct *mds_vma; }; +struct lease_direct_state { + void *lds_owner; + struct file *lds_file; + unsigned long lds_state; + void (*lds_break_fn)(void *lds_owner); + struct delayed_work lds_work; +}; + bool test_map_direct_valid(struct map_direct_state *mds) { return test_bit(MAPDIRECT_VALID, >mds_state); } EXPORT_SYMBOL_GPL(test_map_direct_valid); +static bool test_map_direct_broken(struct map_direct_state *mds) +{ + return test_bit(MAPDIRECT_BREAK, >mds_state); +} + static void put_map_direct(struct map_direct_state *mds) { if (!atomic_dec_and_test(>mds_ref)) @@ -168,6 +182,136 @@ static const struct lock_manager_operations map_direct_lm_ops = { .lm_setup = map_direct_lm_setup, }; +static void lease_direct_invalidate(struct work_struct *work) +{ + struct lease_direct_state *lds; + void *owner; + + lds = container_of(work, typeof(*lds), lds_work.work); + owner = lds; + lds->lds_break_fn(lds->lds_owner); + vfs_setlease(lds->lds_file, F_UNLCK, NULL, ); +} + +static bool lease_direct_lm_break(struct file_lock *fl) +{ + struct lease_direct_state *lds = fl->fl_owner; + + if (!test_and_set_bit(MAPDIRECT_BREAK, >lds_state)) + schedule_delayed_work(>lds_work, lease_break_time * HZ); + + /* Tell the core lease code to wait for delayed work completion */ + fl->fl_break_time = 0; + + return false; +} + +static int lease_direct_lm_change(struct file_lock *fl, int arg, + struct list_head *dispose) +{ + WARN_ON(!(arg & F_UNLCK)); + return lease_modify(fl, arg, dispose); +} + +static const struct lock_manager_operations lease_direct_lm_ops = { + .lm_break = lease_direct_lm_break, + .lm_change = lease_direct_lm_change, +}; + +static struct lease_direct *map_direct_lease(struct vm_area_struct *vma, + void (*lds_break_fn)(void *), void *lds_owner) +{ + struct file *file = vma->vm_file; + struct lease_direct_state *lds; + struct lease_direct *ld; + struct file_lock *fl; + int rc = -ENOMEM; + void *owner; + + ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL); + if (!ld) + return ERR_PTR(-ENOMEM); + INIT_LIST_HEAD(>list); + lds = (struct lease_direct_state *)(ld + 1); + owner = lds; + ld->lds = lds; + lds->lds_break_fn = lds_break_fn; + lds->lds_owner = lds_owner; + INIT_DELAYED_WORK(>lds_work, lease_direct_invalidate); + lds->lds_file = get_file(file); + + fl = locks_alloc_lock(); + if (!fl) + goto err_lock_alloc; + + locks_init_lock(fl); + fl->fl_lmops = _direct_lm_ops; + fl->fl_flags = FL_LAYOUT; + fl->fl_type = F_RDLCK; + fl->fl_end = OFFSET_MAX; + fl->fl_owner = lds; + fl->fl_pid = current->tgid; + fl->fl_file = file; + + rc = vfs_setlease(file, fl->fl_type, , ); + if (rc) + goto err_setlease; + if (fl) { + WARN_ON(1); + owner = lds; + vfs_setlease(file, F_UNLCK, NULL, ); + owner = NULL; + rc = -ENXIO; + goto err_setlease; + } + + return ld; +err_setlease: + locks_free_lock(fl); +err_lock_alloc: + kfree(lds); + return ERR_PTR(rc); +} + +struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma, + void (*break_fn)(void *), void *owner) +{ + struct lease_direct *ld; + + ld = map_direct_lease(vma, break_fn, owner); + + if (IS_ERR(ld)) + return ld; + + /* +* We now
[PATCH v8 03/14] fs: MAP_DIRECT core
Introduce a set of helper apis for filesystems to establish FL_LAYOUT leases to protect against writes and block map updates while a MAP_DIRECT mapping is established. While the lease protects against the syscall write path and fallocate it does not protect against allocating write-faults, so this relies on i_mapdcount to disable block map updates from write faults. Like the pnfs case MAP_DIRECT does its own timeout of the lease since we need to have a process context for running map_direct_invalidate(). Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/Kconfig|1 fs/Makefile |2 fs/mapdirect.c| 237 + include/linux/mapdirect.h | 40 4 files changed, 279 insertions(+), 1 deletion(-) create mode 100644 fs/mapdirect.c create mode 100644 include/linux/mapdirect.h diff --git a/fs/Kconfig b/fs/Kconfig index 7aee6d699fd6..a7b31a96a753 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig" config FS_DAX bool "Direct Access (DAX) support" depends on MMU + depends on FILE_LOCKING depends on !(ARM || MIPS || SPARC) select FS_IOMAP select DAX diff --git a/fs/Makefile b/fs/Makefile index 7bbaca9c67b1..c0e791d235d8 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_DAX) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o mapdirect.o obj-$(CONFIG_FS_ENCRYPTION)+= crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o diff --git a/fs/mapdirect.c b/fs/mapdirect.c new file mode 100644 index ..9f4dd7395dcd --- /dev/null +++ b/fs/mapdirect.c @@ -0,0 +1,237 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#define MAPDIRECT_BREAK 0 +#define MAPDIRECT_VALID 1 + +struct map_direct_state { + atomic_t mds_ref; + atomic_t mds_vmaref; + unsigned long mds_state; + struct inode *mds_inode; + struct delayed_work mds_work; + struct fasync_struct *mds_fa; + struct vm_area_struct *mds_vma; +}; + +bool test_map_direct_valid(struct map_direct_state *mds) +{ + return test_bit(MAPDIRECT_VALID, >mds_state); +} +EXPORT_SYMBOL_GPL(test_map_direct_valid); + +static void put_map_direct(struct map_direct_state *mds) +{ + if (!atomic_dec_and_test(>mds_ref)) + return; + kfree(mds); +} + +static void put_map_direct_vma(struct map_direct_state *mds) +{ + struct vm_area_struct *vma = mds->mds_vma; + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + void *owner = mds; + + if (!atomic_dec_and_test(>mds_vmaref)) + return; + + /* +* Flush in-flight+forced lm_break events that may be +* referencing this dying vma. +*/ + mds->mds_vma = NULL; + set_bit(MAPDIRECT_BREAK, >mds_state); + vfs_setlease(vma->vm_file, F_UNLCK, NULL, ); + flush_delayed_work(>mds_work); + iput(inode); + + put_map_direct(mds); +} + +void generic_map_direct_close(struct vm_area_struct *vma) +{ + put_map_direct_vma(vma->vm_private_data); +} +EXPORT_SYMBOL_GPL(generic_map_direct_close); + +static void get_map_direct_vma(struct map_direct_state *mds) +{ + atomic_inc(>mds_vmaref); +} + +void generic_map_direct_open(struct vm_area_struct *vma) +{ + get_map_direct_vma(vma->vm_private_data); +} +EXPORT_SYMBOL_GPL(generic_map_direct_open); + +static void map_direct_invalidate(struct work_struct *work) +{ + struct map_direct_state *mds; + struct vm_area_struct *vma; + struct inode *inode; + void *owner; + + mds = container_of(work, typeof(*mds), mds_work.work); + + clear_bit(MAPDIRECT_VALID, >mds_state); + + vma = ACCESS_ONCE(mds->mds_vma); + inode = mds->mds_inode; +
[PATCH v8 06/14] xfs: wire up MAP_DIRECT
MAP_DIRECT is an mmap(2) flag with the following semantics: MAP_DIRECT When specified with MAP_SHARED_VALIDATE, sets up a file lease with the same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease is broken when a "lease breaker" attempts to write(2), change the block map (fallocate), or change the size of the file. Otherwise the mechanism of a lease break is identical to the typical lease break case where the lease needs to be removed (munmap) within the number of seconds specified by /proc/sys/fs/lease-break-time. If the lease holder fails to remove the lease in time the kernel will invalidate the mapping and force all future accesses to the mapping to trigger SIGBUS. In addition to lease break timeouts causing faults in the mapping to result in SIGBUS, other states of the file will trigger SIGBUS at fault time: * The fault would trigger the filesystem to allocate blocks * The fault would trigger the filesystem to perform extent conversion In other words, MAP_DIRECT expects and enforces a fully allocated file where faults can be satisfied without modifying block map metadata. An unprivileged process may establish a MAP_DIRECT mapping on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may establish a MAP_DIRECT mapping on arbitrary files ERRORS EACCES Beyond the typical mmap(2) conditions that trigger EACCES MAP_DIRECT also requires the permission to set a file lease. EOPNOTSUPP The filesystem explicitly does not support the flag EPERM The file does not permit MAP_DIRECT mappings. Potential reasons are that DAX access is not available or the file has reflink extents. SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that might require block-map updates, or the lease timed out and the kernel invalidated the mapping. Cc: Jan KaraCc: Arnd Bergmann Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: Alexander Viro Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/xfs/Kconfig |2 - fs/xfs/xfs_file.c | 103 ++- include/linux/mman.h|3 + include/uapi/asm-generic/mman.h |1 4 files changed, 106 insertions(+), 3 deletions(-) diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index f62fc6629abb..f8765653a438 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL config XFS_LAYOUT def_bool y - depends on EXPORTFS_BLOCK_OPS + depends on EXPORTFS_BLOCK_OPS || FS_DAX diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ebdd0bd2b261..4bee027c9366 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -40,12 +40,22 @@ #include "xfs_iomap.h" #include "xfs_reflink.h" +#include #include #include #include +#include #include static const struct vm_operations_struct xfs_file_vm_ops; +static const struct vm_operations_struct xfs_file_vm_direct_ops; + +static bool +xfs_vma_is_direct( + struct vm_area_struct *vma) +{ + return vma->vm_ops == _file_vm_direct_ops; +} /* * Clear the specified ranges to zero through either the pagecache or DAX. @@ -1009,6 +1019,22 @@ xfs_file_llseek( } /* + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is + * valid. See map_direct_invalidate. + */ +static int +xfs_can_fault_direct( + struct vm_area_struct *vma) +{ + if (!xfs_vma_is_direct(vma)) + return 0; + + if (!test_map_direct_valid(vma->vm_private_data)) + return VM_FAULT_SIGBUS; + return 0; +} + +/* * Locking for serialisation of IO during page faults. This results in a lock * ordering of: * @@ -1024,7 +1050,8 @@ __xfs_filemap_fault( enum page_entry_sizepe_size, boolwrite_fault) { - struct inode*inode = file_inode(vmf->vma->vm_file); + struct vm_area_struct *vma = vmf->vma; + struct inode*inode = file_inode(vma->vm_file); struct xfs_inode*ip = XFS_I(inode); int ret; @@ -1032,10 +1059,14 @@ __xfs_filemap_fault( if (write_fault) { sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); + file_update_time(vma->vm_file); } xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + ret = xfs_can_fault_direct(vma); + if (ret) + goto out_unlock; + if (IS_DAX(inode)) { ret = dax_iomap_fault(vmf, pe_size, _iomap_ops); } else { @@
[PATCH v8 09/14] xfs: wire up ->lease_direct()
A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT mapping established. For xfs we use the generic_map_direct_lease() handler for ->lease_direct(). It establishes a new lease and then checks if the MAP_DIRECT mapping has been broken. We want to be sure that the process will receive notification that the MAP_DIRECT mapping is being torn down so it knows why other code paths are throwing failures. For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the MAP_DIRECT mapping is invalid or in the process of being invalidated. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/xfs/xfs_file.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 4bee027c9366..bc512a9a8df5 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1157,6 +1157,7 @@ static const struct vm_operations_struct xfs_file_vm_direct_ops = { .open = generic_map_direct_open, .close = generic_map_direct_close, + .lease_direct = generic_map_direct_lease, }; static const struct vm_operations_struct xfs_file_vm_ops = { @@ -1209,8 +1210,8 @@ xfs_file_mmap_direct( vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; /* -* generic_map_direct_{open,close} expect ->vm_private_data is -* set to the result of map_direct_register +* generic_map_direct_{open,close,lease} expect +* ->vm_private_data is set to the result of map_direct_register */ vma->vm_private_data = mds; return 0; ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
Move xfs_break_layouts() to its own compilation unit so that it can be used for both pnfs layouts and MAP_DIRECT mappings. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/xfs/Kconfig |4 fs/xfs/Makefile |1 + fs/xfs/xfs_layout.c | 42 ++ fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c | 30 -- fs/xfs/xfs_pnfs.h | 10 ++ 6 files changed, 62 insertions(+), 38 deletions(-) create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 1b98cfa342ab..f62fc6629abb 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL result in warnings. This behavior can be modified at runtime via sysfs. + +config XFS_LAYOUT + def_bool y + depends on EXPORTFS_BLOCK_OPS diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index a6e955bfead8..d44135107490 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o +xfs-$(CONFIG_XFS_LAYOUT) += xfs_layout.o diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c new file mode 100644 index ..71d95e1a910a --- /dev/null +++ b/fs/xfs/xfs_layout.c @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2014 Christoph Hellwig. + */ +#include "xfs.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_sb.h" +#include "xfs_mount.h" +#include "xfs_inode.h" + +#include + +/* + * Ensure that we do not have any outstanding pNFS layouts that can be used by + * clients to directly read from or write to this inode. This must be called + * before every operation that can remove blocks from the extent map. + * Additionally we call it during the write operation, where aren't concerned + * about exposing unallocated blocks but just want to provide basic + * synchronization between a local writer and pNFS clients. mmap writes would + * also benefit from this sort of synchronization, but due to the tricky locking + * rules in the page fault path we don't bother. + */ +int +xfs_break_layouts( + struct inode*inode, + uint*iolock) +{ + struct xfs_inode*ip = XFS_I(inode); + int error; + + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); + + while ((error = break_layout(inode, false) == -EWOULDBLOCK)) { + xfs_iunlock(ip, *iolock); + error = break_layout(inode, true); + *iolock = XFS_IOLOCK_EXCL; + xfs_ilock(ip, *iolock); + } + + return error; +} diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h new file mode 100644 index ..f848ee78cc93 --- /dev/null +++ b/fs/xfs/xfs_layout.h @@ -0,0 +1,13 @@ +#ifndef _XFS_LAYOUT_H +#define _XFS_LAYOUT_H 1 + +#ifdef CONFIG_XFS_LAYOUT +int xfs_break_layouts(struct inode *inode, uint *iolock); +#else +static inline int +xfs_break_layouts(struct inode *inode, uint *iolock) +{ + return 0; +} +#endif /* CONFIG_XFS_LAYOUT */ +#endif /* _XFS_LAYOUT_H */ diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c index 2f2dc3c09ad0..8ec72220e73b 100644 --- a/fs/xfs/xfs_pnfs.c +++ b/fs/xfs/xfs_pnfs.c @@ -20,36 +20,6 @@ #include "xfs_pnfs.h" /* - * Ensure that we do not have any outstanding pNFS layouts that can be used by - * clients to directly read from or write to this inode. This must be called - * before every operation that can remove blocks from the extent map. - * Additionally we call it during the write operation, where aren't concerned - * about exposing unallocated blocks but just want to provide basic - * synchronization between a local writer and pNFS clients. mmap writes would - * also benefit from this sort of synchronization, but due to the tricky locking - * rules in the page fault path we don't bother. - */ -int -xfs_break_layouts( - struct inode*inode, - uint*iolock) -{ - struct xfs_inode*ip = XFS_I(inode); - int error; - - ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); - - while ((error = break_layout(inode, false) == -EWOULDBLOCK)) { - xfs_iunlock(ip, *iolock); - error = break_layout(inode, true); - *iolock = XFS_IOLOCK_EXCL; - xfs_ilock(ip, *iolock); - } - - return error; -} - -/* * Get a unique ID including its location so that the client can
[PATCH v8 05/14] fs, xfs, iomap: introduce iomap_can_allocate()
In preparation for using FL_LAYOUT leases to allow coordination between the kernel and processes doing userspace flushes / RDMA with DAX mappings, add this helper that can be used to detect when block-map updates are not allowed. This is targeted to be used in an ->iomap_begin() implementation where we may have various filesystem locks held and can not synchronously wait for any FL_LAYOUT leases to be released. In particular an iomap mmap fault handler running under mmap_sem can not unlock that semaphore and wait for these leases to be unlocked. Instead, this signals the lease holder(s) that a break is requested and immediately returns with an error. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: "Darrick J. Wong" Cc: Ross Zwisler Suggested-by: Dave Chinner Signed-off-by: Dan Williams --- fs/xfs/xfs_iomap.c|3 +++ fs/xfs/xfs_layout.c |5 - include/linux/iomap.h | 10 ++ 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index a1909bc064e9..b3cda11e9515 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1052,6 +1052,9 @@ xfs_file_iomap_begin( error = -EAGAIN; goto out_unlock; } + error = iomap_can_allocate(inode); + if (error) + goto out_unlock; /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c index 71d95e1a910a..88c533bf5b7c 100644 --- a/fs/xfs/xfs_layout.c +++ b/fs/xfs/xfs_layout.c @@ -19,7 +19,10 @@ * about exposing unallocated blocks but just want to provide basic * synchronization between a local writer and pNFS clients. mmap writes would * also benefit from this sort of synchronization, but due to the tricky locking - * rules in the page fault path we don't bother. + * rules in the page fault path all we can do is start the lease break + * timeout. See usage of iomap_can_allocate in xfs_file_iomap_begin to + * prevent write-faults from allocating blocks or performing extent + * conversion. */ int xfs_break_layouts( diff --git a/include/linux/iomap.h b/include/linux/iomap.h index f64dc6ce5161..e24b4e81d41a 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -2,6 +2,7 @@ #define LINUX_IOMAP_H 1 #include +#include struct fiemap_extent_info; struct inode; @@ -88,6 +89,15 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops); loff_t iomap_seek_data(struct inode *inode, loff_t offset, const struct iomap_ops *ops); +/* + * Check if there are any file layout leases preventing block map + * changes and if so start the lease break process, but do not wait for + * it to complete (return -EWOULDBLOCK); + */ +static inline int iomap_can_allocate(struct inode *inode) +{ + return break_layout(inode, false); +} /* * Flags for direct I/O ->end_io: ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a mechanism to define new behavior that is known to fail on older kernels without the support. Define a new MAP_SHARED_VALIDATE flag pattern that is guaranteed to fail on all legacy mmap implementations. It is worth noting that the original proposal was for a standalone MAP_VALIDATE flag. However, when that could not be supported by all archs Linus observed: I see why you *think* you want a bitmap. You think you want a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC etc, so that people can do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_SYNC, fd, 0); and "know" that MAP_SYNC actually takes. And I'm saying that whole wish is bogus. You're fundamentally depending on special semantics, just make it explicit. It's already not portable, so don't try to make it so. Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value of 0x3, and make people do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0); and then the kernel side is easier too (none of that random garbage playing games with looking at the "MAP_VALIDATE bit", but just another case statement in that map type thing. Boom. Done. Similar to ->fallocate() we also want the ability to validate the support for new flags on a per ->mmap() 'struct file_operations' instance basis. Towards that end arrange for flags to be generically validated against a mmap_supported_mask exported by 'struct file_operations'. By default all existing flags are implicitly supported, but new flags require MAP_SHARED_VALIDATE and per-instance-opt-in. Cc: Jan KaraCc: Arnd Bergmann Cc: Andy Lutomirski Cc: Andrew Morton Suggested-by: Christoph Hellwig Suggested-by: Linus Torvalds Signed-off-by: Dan Williams --- arch/alpha/include/uapi/asm/mman.h |1 + arch/mips/include/uapi/asm/mman.h|1 + arch/mips/kernel/vdso.c |2 + arch/parisc/include/uapi/asm/mman.h |1 + arch/tile/mm/elf.c |3 +- arch/xtensa/include/uapi/asm/mman.h |1 + include/linux/fs.h |2 + include/linux/mm.h |2 + include/linux/mman.h | 39 ++ include/uapi/asm-generic/mman-common.h |1 + mm/mmap.c| 21 -- tools/include/uapi/asm-generic/mman-common.h |1 + 12 files changed, 69 insertions(+), 6 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 3b26cc62dadb..92823f24890b 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -14,6 +14,7 @@ #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is _wrong_) */ #define MAP_FIXED 0x100 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x10/* don't use a file */ +#define MAP_SHARED_VALIDATE 0x3/* share + validate extension flags */ /* not used by linux, but here to make sure we don't clash with OSF/1 defines */ #define _MAP_HASSEMAPHORE 0x0200 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index da3216007fe0..c77689076577 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -30,6 +30,7 @@ #define MAP_PRIVATE0x002 /* Changes are private */ #define MAP_TYPE 0x00f /* Mask for type of mapping */ #define MAP_FIXED 0x010 /* Interpret addr exactly */ +#define MAP_SHARED_VALIDATE 0x3/* share + validate extension flags */ /* not used by linux, but here to make sure we don't clash with ABI defines */ #define MAP_RENAME 0x020 /* Assign page to file */ diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 019035d7225c..cf10654477a9 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL); + 0, NULL, 0); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 775b5d5e41a1..36b688d52de3 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++
[PATCH v8 02/14] fs, mm: pass fd to ->mmap_validate()
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent block map changes while the file is mapped. It requires the fd to setup an fasync_struct for signalling lease break events to the lease holder. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Andrew Morton Signed-off-by: Dan Williams --- arch/mips/kernel/vdso.c |2 +- arch/tile/mm/elf.c |2 +- arch/x86/mm/mpx.c |3 ++- fs/aio.c|2 +- include/linux/fs.h |2 +- include/linux/mm.h |9 + ipc/shm.c |3 ++- mm/internal.h |2 +- mm/mmap.c | 13 +++-- mm/nommu.c |5 +++-- mm/util.c |7 --- 11 files changed, 28 insertions(+), 22 deletions(-) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index cf10654477a9..ab26c7ac0316 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL, 0); + 0, NULL, 0, -1); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 5ffcbe76aef9..61a9588e141a 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, addr = mmap_region(NULL, addr, INTRPT_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, - NULL, 0); + NULL, 0, -1); if (addr > (unsigned long) -PAGE_SIZE) retval = (int) addr; } diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 9ceaa955d2ba..a8baa94a496b 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len) down_write(>mmap_sem); addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , NULL); + MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , + NULL, -1); up_write(>mmap_sem); if (populate) mm_populate(addr, populate); diff --git a/fs/aio.c b/fs/aio.c index 5a2487217072..d10ca6db2ee6 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, - MAP_SHARED, 0, , NULL); + MAP_SHARED, 0, , NULL, -1); up_write(>mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 51538958f7f5..c2b9bf3dc4e9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1702,7 +1702,7 @@ struct file_operations { long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*mmap_validate) (struct file *, struct vm_area_struct *, - unsigned long); + unsigned long, int); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 5c4c98e4adc9..0afa19feb755 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf, unsigned long map_flags); + struct list_head *uf, unsigned long map_flags, int fd); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf); + struct list_head *uf, int fd); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); @@ -2145,9 +2145,10 @@ static inline unsigned long
[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace flush
Changes since v7 [1]: * Fix IOVA reuse race by leaving the dma scatterlist mapped until unregistration time. Use iommu_unmap() in ib_umem_lease_break() to force-invalidate the ibverbs memory registration. (David Woodhouse) * Introduce iomap_can_allocate() as a way to check if any layouts are present in the mmap write-fault path to prevent block map changes, and start the leak break process when an allocating write-fault occurs. This also removes the i_mapdcount bloat of 'struct inode' from v7. (Dave Chinner) * Provide generic_map_direct_{open,close,lease} to cleanup the filesystem wiring to implement MAP_DIRECT support (Dave Chinner) * Abandon (defer to a potential new fcntl()) support for using MAP_DIRECT on non-DAX files. With this change we can validate the inode is MAP_DIRECT capable just once at mmap time rather than every fault. (Dave Chinner) * Arrange for lease_direct leases to also wait the /proc/sys/fs/lease-break-time period before calling break_fn. For example, allow the lease-holder time to quiesce RDMA operations before the iommu starts throwing io-faults. * Switch intel-iommu to use iommu_num_sg_pages(). [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012707.html --- MAP_DIRECT is a mechanism that allows an application to establish a mapping where the kernel will not change the block-map, or otherwise dirty the block-map metadata of a file without notification. It supports a "flush from userspace" model where persistent memory applications can bypass the overhead of ongoing coordination of writes with the filesystem, and it provides safety to RDMA operations involving DAX mappings. The kernel always has the ability to revoke access and convert the file back to normal operation after performing a "lease break". Similar to fcntl leases, there is no way for userspace to to cancel the lease break process once it has started, it can only delay it via the /proc/sys/fs/lease-break-time setting. MAP_DIRECT enables XFS to supplant the device-dax interface for mmap-write access to persistent memory with no ongoing coordination with the filesystem via fsync/msync syscalls. --- Dan Williams (14): mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags fs, mm: pass fd to ->mmap_validate() fs: MAP_DIRECT core xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT fs, xfs, iomap: introduce iomap_can_allocate() xfs: wire up MAP_DIRECT iommu, dma-mapping: introduce dma_get_iommu_domain() fs, mapdirect: introduce ->lease_direct() xfs: wire up ->lease_direct() device-dax: wire up ->lease_direct() iommu: up-level sg_num_pages() from amd-iommu iommu/vt-d: use iommu_num_sg_pages IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings tools/testing/nvdimm: enable rdma unit tests arch/alpha/include/uapi/asm/mman.h |1 arch/mips/include/uapi/asm/mman.h|1 arch/mips/kernel/vdso.c |2 arch/parisc/include/uapi/asm/mman.h |1 arch/tile/mm/elf.c |3 arch/x86/mm/mpx.c|3 arch/xtensa/include/uapi/asm/mman.h |1 drivers/base/dma-mapping.c | 10 + drivers/dax/Kconfig |1 drivers/dax/device.c |4 drivers/infiniband/core/umem.c | 90 +- drivers/iommu/amd_iommu.c| 40 +-- drivers/iommu/intel-iommu.c | 30 +- drivers/iommu/iommu.c| 27 ++ fs/Kconfig |5 fs/Makefile |1 fs/aio.c |2 fs/mapdirect.c | 382 ++ fs/xfs/Kconfig |4 fs/xfs/Makefile |1 fs/xfs/xfs_file.c| 103 +++ fs/xfs/xfs_iomap.c |3 fs/xfs/xfs_layout.c | 45 +++ fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c| 30 -- fs/xfs/xfs_pnfs.h| 10 - include/linux/dma-mapping.h |3 include/linux/fs.h |2 include/linux/iomap.h| 10 + include/linux/iommu.h|2 include/linux/mapdirect.h| 57 include/linux/mm.h | 17 + include/linux/mman.h | 42 +++ include/rdma/ib_umem.h |8 + include/uapi/asm-generic/mman-common.h |1 include/uapi/asm-generic/mman.h |1 ipc/shm.c|3 mm/internal.h
Mail System Error - Returned Mail
ã~EYNzf´ó)Á³#ËïªË§þq¬8ÐÃXÅr媻ýÉ_í¼Óv /§ÅÛ£5§L±6,M²åÖ?ä«ñ[?IC,iȯl;;;áë«üU·Í;w²sÞ¸µÅJZí»?÷»RO¬0þ-ÐúÓoHhÆn9¥Qïf-ñÏKð©s4BÏu Ø uq¹äæò§·ëSRdÒH{\ÏCAóºâA&$Iþºä*2TîÊó±h¸Ã }OçÌQÍÇ£¸é<_î ¡¬øCGf7 QeõªD^ÓsGëÊGCÎÀ¹xY©¨Ôø}gÖû!UØdi8ëdÀeu´âOª`A²Ú),àÛ÷²Ï0 dÖ!ßâR ¼bóõ(þhi¶EÛ'ª'wAøó?'aùª ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH] ndctl, test: rdma vs dax
On Mon, Oct 09, 2017 at 08:45:41AM -0700, Dan Williams wrote: > On Mon, Oct 9, 2017 at 1:07 AM, Johannes Thumshirnwrote: > > On Sat, Oct 07, 2017 at 08:14:42AM -0700, Dan Williams wrote: > > [...] > > > >> +rxe_cfg stop > >> +rxe_cfg start > >> +if ! rxe_cfg status | grep -n rxe0; then > >> + rxe_cfg add eth0 > >> +fi > > > > Can we maybe skip the dependency on rxe_cfg? All that is needed is modprobe > > and echo. > > Sure, I'll take a look. For my NVMe over Soft-RoCE test setup with Rapido [1] I used the following: modprobe rdma-rxe echo eth0 > /sys/module/rdma_rxe/parameters/add > > > Also hard coding eth0 might be problematic in this case. This works > > on your test-setup but surely isn't portable. > > Yes, which is part of the reason I have this listed under the > "destructive" tests. Any advice on how to make it portable would be > appreciated. Maybe: ETH=${ETH:-eth0} echo $ETH > /sys/module/rdma_rxe/parameters/add Byte, Johannes [1] https://github.com/rapido-linux/rapido/blob/master/nvme_rdma_autorun.sh#L74 -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm