Delivery reports about your e-mail
The original message was received at Thu, 12 Oct 2017 12:25:23 +0800 from lists.01.org [187.53.147.89] - The following addresses had permanent fatal errors -- Transcript of session follows - while talking to lists.01.org.: >>> MAIL From:"Bounced mail" <<< 501 "Bounced mail" ... Refused ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
On Wed, Oct 11, 2017 at 7:17 PM, Dan Williamswrote: > On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams > wrote: >> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro wrote: >>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote: The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent block map changes while the file is mapped. It requires the fd to setup an fasync_struct for signalling lease break events to the lease holder. >>> >>> *UGH* >>> >>> That looks like one hell of a bad API. You are not even guaranteed that >>> descriptor will remain be still open by the time you pass it down to your >>> helper, nevermind the moment when event actually happens... >> >> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern? > > Ugh, so I think the difference with F_SETLEASE is that the lease ends > when the fd is closed. In the mmap case the lease follows the lifetime > of the vma. I'll rethink this interface... I'm not seeing a lot of good options outside of documenting that if you close the fd that is registered with MAP_DIRECT you may still get SIGIO notifications with si_fd set to the stale fd. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
On Wed, Oct 11, 2017 at 6:28 PM, Dan Williamswrote: > On Wed, Oct 11, 2017 at 6:21 PM, Al Viro wrote: >> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote: >>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent >>> block map changes while the file is mapped. It requires the fd to setup >>> an fasync_struct for signalling lease break events to the lease holder. >> >> *UGH* >> >> That looks like one hell of a bad API. You are not even guaranteed that >> descriptor will remain be still open by the time you pass it down to your >> helper, nevermind the moment when event actually happens... > > What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern? Ugh, so I think the difference with F_SETLEASE is that the lease ends when the fd is closed. In the mmap case the lease follows the lifetime of the vma. I'll rethink this interface... ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
On Wed, Oct 11, 2017 at 6:21 PM, Al Virowrote: > On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote: >> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent >> block map changes while the file is mapped. It requires the fd to setup >> an fasync_struct for signalling lease break events to the lease holder. > > *UGH* > > That looks like one hell of a bad API. You are not even guaranteed that > descriptor will remain be still open by the time you pass it down to your > helper, nevermind the moment when event actually happens... What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote: > The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent > block map changes while the file is mapped. It requires the fd to setup > an fasync_struct for signalling lease break events to the lease holder. *UGH* That looks like one hell of a bad API. You are not even guaranteed that descriptor will remain be still open by the time you pass it down to your helper, nevermind the moment when event actually happens... ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()
In preparation for using FL_LAYOUT leases to allow coordination between the kernel and processes doing userspace flushes / RDMA with DAX mappings, add this helper that can be used to start the lease break process in contexts where we can not sleep waiting for the lease break timeout. This is targeted to be used in an ->iomap_begin() implementation where we may have various filesystem locks held and can not synchronously wait for any FL_LAYOUT leases to be released. In particular an iomap mmap fault handler running under mmap_sem can not unlock that semaphore and wait for these leases to be unlocked. Instead, this signals the lease holder(s) that a break is requested and immediately returns with an error. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Al Viro Cc: "Darrick J. Wong" Cc: Ross Zwisler Suggested-by: Dave Chinner Signed-off-by: Dan Williams --- fs/xfs/xfs_iomap.c |3 +++ fs/xfs/xfs_layout.c |5 - include/linux/fs.h |9 + 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index f179bdf1644d..840e4080afb5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1055,6 +1055,9 @@ xfs_file_iomap_begin( error = -EAGAIN; goto out_unlock; } + error = break_layout_nowait(inode); + if (error) + goto out_unlock; /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c index 71d95e1a910a..7a633b6e9397 100644 --- a/fs/xfs/xfs_layout.c +++ b/fs/xfs/xfs_layout.c @@ -19,7 +19,10 @@ * about exposing unallocated blocks but just want to provide basic * synchronization between a local writer and pNFS clients. mmap writes would * also benefit from this sort of synchronization, but due to the tricky locking - * rules in the page fault path we don't bother. + * rules in the page fault path all we can do is start the lease break + * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to + * prevent write-faults from allocating blocks or performing extent + * conversion. */ int xfs_break_layouts( diff --git a/include/linux/fs.h b/include/linux/fs.h index 17e0e899e184..2b030a2fccc7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool wait) #endif /* CONFIG_FILE_LOCKING */ +/* + * For use in paths where we can not wait for the layout to be recalled, + * for example when we are holding mmap_sem. + */ +static inline int break_layout_nowait(struct inode *inode) +{ + return break_layout(inode, false); +} + /* fs/open.c */ struct audit_names; struct filename { ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v9 6/6] xfs: wire up MAP_DIRECT
MAP_DIRECT is an mmap(2) flag with the following semantics: MAP_DIRECT When specified with MAP_SHARED_VALIDATE, sets up a file lease with the same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease is broken when a "lease breaker" attempts to write(2), change the block map (fallocate), or change the size of the file. Otherwise the mechanism of a lease break is identical to the typical lease break case where the lease needs to be removed (munmap) within the number of seconds specified by /proc/sys/fs/lease-break-time. If the lease holder fails to remove the lease in time the kernel will invalidate the mapping and force all future accesses to the mapping to trigger SIGBUS. In addition to lease break timeouts causing faults in the mapping to result in SIGBUS, other states of the file will trigger SIGBUS at fault time: * The fault would trigger the filesystem to allocate blocks * The fault would trigger the filesystem to perform extent conversion In other words, MAP_DIRECT expects and enforces a fully allocated file where faults can be satisfied without modifying block map metadata. An unprivileged process may establish a MAP_DIRECT mapping on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may establish a MAP_DIRECT mapping on arbitrary files ERRORS EACCES Beyond the typical mmap(2) conditions that trigger EACCES MAP_DIRECT also requires the permission to set a file lease. EOPNOTSUPP The filesystem explicitly does not support the flag EPERM The file does not permit MAP_DIRECT mappings. Potential reasons are that DAX access is not available or the file has reflink extents. SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that might require block-map updates, or the lease timed out and the kernel invalidated the mapping. Cc: Jan KaraCc: Arnd Bergmann Cc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: Alexander Viro Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/xfs/Kconfig |2 - fs/xfs/xfs_file.c | 107 ++- include/linux/mman.h|3 + include/uapi/asm-generic/mman.h |1 4 files changed, 110 insertions(+), 3 deletions(-) diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index f62fc6629abb..f8765653a438 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL config XFS_LAYOUT def_bool y - depends on EXPORTFS_BLOCK_OPS + depends on EXPORTFS_BLOCK_OPS || FS_DAX diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 3cc7292b2e9f..71dbe0307746 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -41,12 +41,22 @@ #include "xfs_reflink.h" #include "xfs_layout.h" +#include #include #include #include +#include #include static const struct vm_operations_struct xfs_file_vm_ops; +static const struct vm_operations_struct xfs_file_vm_direct_ops; + +static bool +xfs_vma_is_direct( + struct vm_area_struct *vma) +{ + return vma->vm_ops == _file_vm_direct_ops; +} /* * Clear the specified ranges to zero through either the pagecache or DAX. @@ -1013,6 +1023,25 @@ xfs_file_llseek( } /* + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is + * valid. See map_direct_invalidate. + */ +static bool +xfs_vma_has_direct_lease( + struct vm_area_struct *vma) +{ + /* Non MAP_DIRECT vmas do not require layout leases */ + if (!xfs_vma_is_direct(vma)) + return true; + + if (!test_map_direct_valid(vma->vm_private_data)) + return false; + + /* We have a valid lease */ + return true; +} + +/* * Locking for serialisation of IO during page faults. This results in a lock * ordering of: * @@ -1028,7 +1057,8 @@ __xfs_filemap_fault( enum page_entry_sizepe_size, boolwrite_fault) { - struct inode*inode = file_inode(vmf->vma->vm_file); + struct vm_area_struct *vma = vmf->vma; + struct inode*inode = file_inode(vma->vm_file); struct xfs_inode*ip = XFS_I(inode); int ret; @@ -1036,10 +1066,15 @@ __xfs_filemap_fault( if (write_fault) { sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); + file_update_time(vma->vm_file); } xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + if (!xfs_vma_has_direct_lease(vma)) { + ret = VM_FAULT_SIGBUS; +
[PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent block map changes while the file is mapped. It requires the fd to setup an fasync_struct for signalling lease break events to the lease holder. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Andrew Morton Signed-off-by: Dan Williams --- arch/mips/kernel/vdso.c |2 +- arch/tile/mm/elf.c |2 +- arch/x86/mm/mpx.c |3 ++- fs/aio.c|2 +- include/linux/fs.h |2 +- include/linux/mm.h |9 + ipc/shm.c |3 ++- mm/internal.h |2 +- mm/mmap.c | 13 +++-- mm/nommu.c |5 +++-- mm/util.c |7 --- 11 files changed, 28 insertions(+), 22 deletions(-) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index cf10654477a9..ab26c7ac0316 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL, 0); + 0, NULL, 0, -1); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 5ffcbe76aef9..61a9588e141a 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, addr = mmap_region(NULL, addr, INTRPT_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, - NULL, 0); + NULL, 0, -1); if (addr > (unsigned long) -PAGE_SIZE) retval = (int) addr; } diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 9ceaa955d2ba..a8baa94a496b 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len) down_write(>mmap_sem); addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , NULL); + MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , + NULL, -1); up_write(>mmap_sem); if (populate) mm_populate(addr, populate); diff --git a/fs/aio.c b/fs/aio.c index 5a2487217072..d10ca6db2ee6 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, - MAP_SHARED, 0, , NULL); + MAP_SHARED, 0, , NULL, -1); up_write(>mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 5aee97d64cae..17e0e899e184 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1702,7 +1702,7 @@ struct file_operations { long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*mmap_validate) (struct file *, struct vm_area_struct *, - unsigned long); + unsigned long, int); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 38f6ed954dde..ec45087348c9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf, unsigned long map_flags); + struct list_head *uf, unsigned long map_flags, int fd); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf); + struct list_head *uf, int fd); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); @@ -2145,9 +2145,10 @@ static inline unsigned long
[PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a mechanism to define new behavior that is known to fail on older kernels without the support. Define a new MAP_SHARED_VALIDATE flag pattern that is guaranteed to fail on all legacy mmap implementations. It is worth noting that the original proposal was for a standalone MAP_VALIDATE flag. However, when that could not be supported by all archs Linus observed: I see why you *think* you want a bitmap. You think you want a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC etc, so that people can do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_SYNC, fd, 0); and "know" that MAP_SYNC actually takes. And I'm saying that whole wish is bogus. You're fundamentally depending on special semantics, just make it explicit. It's already not portable, so don't try to make it so. Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value of 0x3, and make people do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0); and then the kernel side is easier too (none of that random garbage playing games with looking at the "MAP_VALIDATE bit", but just another case statement in that map type thing. Boom. Done. Similar to ->fallocate() we also want the ability to validate the support for new flags on a per ->mmap() 'struct file_operations' instance basis. Towards that end arrange for flags to be generically validated against a mmap_supported_mask exported by 'struct file_operations'. By default all existing flags are implicitly supported, but new flags require MAP_SHARED_VALIDATE and per-instance-opt-in. Cc: Arnd BergmannCc: Andy Lutomirski Cc: Andrew Morton Suggested-by: Christoph Hellwig Suggested-by: Linus Torvalds Reviewed-by: Jan Kara Signed-off-by: Dan Williams --- arch/alpha/include/uapi/asm/mman.h |1 + arch/mips/include/uapi/asm/mman.h|1 + arch/mips/kernel/vdso.c |2 + arch/parisc/include/uapi/asm/mman.h |1 + arch/tile/mm/elf.c |3 +- arch/xtensa/include/uapi/asm/mman.h |1 + include/linux/fs.h |2 + include/linux/mm.h |2 + include/linux/mman.h | 39 ++ include/uapi/asm-generic/mman-common.h |1 + mm/mmap.c| 21 -- tools/include/uapi/asm-generic/mman-common.h |1 + 12 files changed, 69 insertions(+), 6 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 3b26cc62dadb..f85f18ffbf8c 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -11,6 +11,7 @@ #define MAP_SHARED 0x01/* Share changes */ #define MAP_PRIVATE0x02/* Changes are private */ +#define MAP_SHARED_VALIDATE 0x3/* share + validate extension flags */ #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is _wrong_) */ #define MAP_FIXED 0x100 /* Interpret addr exactly */ #define MAP_ANONYMOUS 0x10/* don't use a file */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index da3216007fe0..054314bb062a 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -28,6 +28,7 @@ */ #define MAP_SHARED 0x001 /* Share changes */ #define MAP_PRIVATE0x002 /* Changes are private */ +#define MAP_SHARED_VALIDATE 0x3/* share + validate extension flags */ #define MAP_TYPE 0x00f /* Mask for type of mapping */ #define MAP_FIXED 0x010 /* Interpret addr exactly */ diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 019035d7225c..cf10654477a9 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - 0, NULL); + 0, NULL, 0); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 775b5d5e41a1..a66fdb9c4b6d 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -11,6 +11,7 @@ #define
[PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
Move xfs_break_layouts() to its own compilation unit so that it can be used for both pnfs layouts and MAP_DIRECT mappings. Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Signed-off-by: Dan Williams --- fs/xfs/Kconfig |4 fs/xfs/Makefile |1 + fs/xfs/xfs_file.c |1 + fs/xfs/xfs_ioctl.c |1 + fs/xfs/xfs_iops.c |1 + fs/xfs/xfs_layout.c | 42 ++ fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c | 31 +-- fs/xfs/xfs_pnfs.h |8 9 files changed, 64 insertions(+), 38 deletions(-) create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 1b98cfa342ab..f62fc6629abb 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL result in warnings. This behavior can be modified at runtime via sysfs. + +config XFS_LAYOUT + def_bool y + depends on EXPORTFS_BLOCK_OPS diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index a6e955bfead8..d44135107490 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o +xfs-$(CONFIG_XFS_LAYOUT) += xfs_layout.o diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 309e26c9dddb..3cc7292b2e9f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -39,6 +39,7 @@ #include "xfs_pnfs.h" #include "xfs_iomap.h" #include "xfs_reflink.h" +#include "xfs_layout.h" #include #include diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index aa75389be8cf..8bfd6db4f06d 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -44,6 +44,7 @@ #include "xfs_btree.h" #include #include "xfs_fsmap.h" +#include "xfs_layout.h" #include #include diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 17081c77ef86..4bc2e5ef1a3a 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -39,6 +39,7 @@ #include "xfs_trans_space.h" #include "xfs_pnfs.h" #include "xfs_iomap.h" +#include "xfs_layout.h" #include #include diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c new file mode 100644 index ..71d95e1a910a --- /dev/null +++ b/fs/xfs/xfs_layout.c @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2014 Christoph Hellwig. + */ +#include "xfs.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_sb.h" +#include "xfs_mount.h" +#include "xfs_inode.h" + +#include + +/* + * Ensure that we do not have any outstanding pNFS layouts that can be used by + * clients to directly read from or write to this inode. This must be called + * before every operation that can remove blocks from the extent map. + * Additionally we call it during the write operation, where aren't concerned + * about exposing unallocated blocks but just want to provide basic + * synchronization between a local writer and pNFS clients. mmap writes would + * also benefit from this sort of synchronization, but due to the tricky locking + * rules in the page fault path we don't bother. + */ +int +xfs_break_layouts( + struct inode*inode, + uint*iolock) +{ + struct xfs_inode*ip = XFS_I(inode); + int error; + + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)); + + while ((error = break_layout(inode, false) == -EWOULDBLOCK)) { + xfs_iunlock(ip, *iolock); + error = break_layout(inode, true); + *iolock = XFS_IOLOCK_EXCL; + xfs_ilock(ip, *iolock); + } + + return error; +} diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h new file mode 100644 index ..f848ee78cc93 --- /dev/null +++ b/fs/xfs/xfs_layout.h @@ -0,0 +1,13 @@ +#ifndef _XFS_LAYOUT_H +#define _XFS_LAYOUT_H 1 + +#ifdef CONFIG_XFS_LAYOUT +int xfs_break_layouts(struct inode *inode, uint *iolock); +#else +static inline int +xfs_break_layouts(struct inode *inode, uint *iolock) +{ + return 0; +} +#endif /* CONFIG_XFS_LAYOUT */ +#endif /* _XFS_LAYOUT_H */ diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c index 4246876df7b7..ee9de16d7672 100644 --- a/fs/xfs/xfs_pnfs.c +++ b/fs/xfs/xfs_pnfs.c @@ -18,36 +18,7 @@ #include "xfs_shared.h" #include "xfs_bit.h" #include "xfs_pnfs.h" - -/* - * Ensure that we do not have any outstanding pNFS layouts that can be used by - * clients to directly read from or write to this inode. This must be called - * before every operation that can remove blocks from the extent map. - * Additionally we call it
[PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
Changes since v8 [1]: * Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch headers (Jan) * Include xfs_layout.h directly in all the files that call xfs_break_layouts() (Dave) * Clarify / add more comments to the MAP_DIRECT checks at fault time (Dave) * Rename iomap_can_allocate() to break_layouts_nowait() to make it plain the reason we are bailing out of iomap_begin. * Defer the lease_direct mechanism and RDMA core changes to a later patch series. * EXT4 support is in the works and will be rebased on Jan's MAP_SYNC patches. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html --- MAP_DIRECT is a mechanism that allows an application to establish a mapping where the kernel will not change the block-map, or otherwise dirty the block-map metadata of a file without notification. It supports a "flush from userspace" model where persistent memory applications can bypass the overhead of ongoing coordination of writes with the filesystem, and it provides safety to RDMA operations involving DAX mappings. The kernel always has the ability to revoke access and convert the file back to normal operation after performing a "lease break". Similar to fcntl leases, there is no way for userspace to to cancel the lease break process once it has started, it can only delay it via the /proc/sys/fs/lease-break-time setting. MAP_DIRECT enables XFS to supplant the device-dax interface for mmap-write access to persistent memory with no ongoing coordination with the filesystem via fsync/msync syscalls. The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some scenarios where you would choose one over the other: * 3rd party DMA / RDMA to DAX with hardware that does not support on-demand paging (shared virtual memory) => MAP_DIRECT * Support for reflinked inodes, fallocate-punch-hole, truncate, or any other operation that mutates the block map of an actively mapped file => MAP_SYNC * Userpsace flush => MAP_SYNC or MAP_DIRECT * Assurances that the file's block map metadata is stable, i.e. minimize worst case fault latency by locking out updates => MAP_DIRECT --- Dan Williams (6): mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags fs, mm: pass fd to ->mmap_validate() fs: MAP_DIRECT core xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT fs, xfs, iomap: introduce break_layout_nowait() xfs: wire up MAP_DIRECT arch/alpha/include/uapi/asm/mman.h |1 arch/mips/include/uapi/asm/mman.h|1 arch/mips/kernel/vdso.c |2 arch/parisc/include/uapi/asm/mman.h |1 arch/tile/mm/elf.c |3 arch/x86/mm/mpx.c|3 arch/xtensa/include/uapi/asm/mman.h |1 fs/Kconfig |1 fs/Makefile |2 fs/aio.c |2 fs/mapdirect.c | 237 ++ fs/xfs/Kconfig |4 fs/xfs/Makefile |1 fs/xfs/xfs_file.c| 108 fs/xfs/xfs_ioctl.c |1 fs/xfs/xfs_iomap.c |3 fs/xfs/xfs_iops.c|1 fs/xfs/xfs_layout.c | 45 + fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c| 31 --- fs/xfs/xfs_pnfs.h|8 - include/linux/fs.h | 11 + include/linux/mapdirect.h| 40 include/linux/mm.h |9 + include/linux/mman.h | 42 + include/uapi/asm-generic/mman-common.h |1 include/uapi/asm-generic/mman.h |1 ipc/shm.c|3 mm/internal.h|2 mm/mmap.c| 28 ++- mm/nommu.c |5 - mm/util.c|7 - tools/include/uapi/asm-generic/mman-common.h |1 33 files changed, 557 insertions(+), 62 deletions(-) create mode 100644 fs/mapdirect.c create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h create mode 100644 include/linux/mapdirect.h ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v9 3/6] fs: MAP_DIRECT core
Introduce a set of helper apis for filesystems to establish FL_LAYOUT leases to protect against writes and block map updates while a MAP_DIRECT mapping is established. While the lease protects against the syscall write path and fallocate it does not protect against allocating write-faults, so this relies on i_mapdcount to disable block map updates from write faults. Like the pnfs case MAP_DIRECT does its own timeout of the lease since we need to have a process context for running map_direct_invalidate(). Cc: Jan KaraCc: Jeff Moyer Cc: Christoph Hellwig Cc: Dave Chinner Cc: "Darrick J. Wong" Cc: Ross Zwisler Cc: Jeff Layton Cc: "J. Bruce Fields" Signed-off-by: Dan Williams --- fs/Kconfig|1 fs/Makefile |2 fs/mapdirect.c| 237 + include/linux/mapdirect.h | 40 4 files changed, 279 insertions(+), 1 deletion(-) create mode 100644 fs/mapdirect.c create mode 100644 include/linux/mapdirect.h diff --git a/fs/Kconfig b/fs/Kconfig index 7aee6d699fd6..a7b31a96a753 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig" config FS_DAX bool "Direct Access (DAX) support" depends on MMU + depends on FILE_LOCKING depends on !(ARM || MIPS || SPARC) select FS_IOMAP select DAX diff --git a/fs/Makefile b/fs/Makefile index 7bbaca9c67b1..c0e791d235d8 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_DAX) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o mapdirect.o obj-$(CONFIG_FS_ENCRYPTION)+= crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o diff --git a/fs/mapdirect.c b/fs/mapdirect.c new file mode 100644 index ..9f4dd7395dcd --- /dev/null +++ b/fs/mapdirect.c @@ -0,0 +1,237 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#define MAPDIRECT_BREAK 0 +#define MAPDIRECT_VALID 1 + +struct map_direct_state { + atomic_t mds_ref; + atomic_t mds_vmaref; + unsigned long mds_state; + struct inode *mds_inode; + struct delayed_work mds_work; + struct fasync_struct *mds_fa; + struct vm_area_struct *mds_vma; +}; + +bool test_map_direct_valid(struct map_direct_state *mds) +{ + return test_bit(MAPDIRECT_VALID, >mds_state); +} +EXPORT_SYMBOL_GPL(test_map_direct_valid); + +static void put_map_direct(struct map_direct_state *mds) +{ + if (!atomic_dec_and_test(>mds_ref)) + return; + kfree(mds); +} + +static void put_map_direct_vma(struct map_direct_state *mds) +{ + struct vm_area_struct *vma = mds->mds_vma; + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + void *owner = mds; + + if (!atomic_dec_and_test(>mds_vmaref)) + return; + + /* +* Flush in-flight+forced lm_break events that may be +* referencing this dying vma. +*/ + mds->mds_vma = NULL; + set_bit(MAPDIRECT_BREAK, >mds_state); + vfs_setlease(vma->vm_file, F_UNLCK, NULL, ); + flush_delayed_work(>mds_work); + iput(inode); + + put_map_direct(mds); +} + +void generic_map_direct_close(struct vm_area_struct *vma) +{ + put_map_direct_vma(vma->vm_private_data); +} +EXPORT_SYMBOL_GPL(generic_map_direct_close); + +static void get_map_direct_vma(struct map_direct_state *mds) +{ + atomic_inc(>mds_vmaref); +} + +void generic_map_direct_open(struct vm_area_struct *vma) +{ + get_map_direct_vma(vma->vm_private_data); +} +EXPORT_SYMBOL_GPL(generic_map_direct_open); + +static void map_direct_invalidate(struct work_struct *work) +{ + struct map_direct_state *mds; + struct vm_area_struct *vma; + struct inode *inode; + void *owner; + + mds = container_of(work, typeof(*mds), mds_work.work); + + clear_bit(MAPDIRECT_VALID, >mds_state); + + vma = ACCESS_ONCE(mds->mds_vma); + inode = mds->mds_inode; +
Re: ffsb job does not exit on xfs 4.14-rc1+
On Wed, Oct 11, 2017 at 09:54:15PM +0800, Xiong Zhou wrote: > On Mon, Sep 25, 2017 at 10:49:03AM +0200, Carlos Maiolino wrote: > > On Mon, Sep 25, 2017 at 01:40:06AM +, Xiong Zhou wrote: > > > Hi, > > > > > > ffsb test won't exit like this on Linus tree 4.14-rc1+. > > > Latest commit cd4175b11685 > > > > Can you provide more information? Do you have any kernel log from this > > issue? > > dmesg, Oopses, traces, etc. > > Storage configuration might also be required here. > > Turns out this only repreduces on nvdimm devices, xfs without dax > mount option. More logs are attached. It's a hang, so what's the output of sysrq-w once it's hung? > > have you also tried to reproduce it with another filesystem? If so, is the > > same > > problem reproducible with another filesystem or only with XFS? > > Only xfs. Test on ext4 ends shortly. > > > > > P.S. please avoid sending it to all lists (mainly LKML). > > Why? I thought LKML was better archived. The lists are all archived, but that's irrelevant. The list you should report problems to is based on the scope of the problem, not whether the lists are archived or not. Because the scope of the problem at this point is XFS, it's inappropriate to report it to lists that are for general kernel or VFS issues. There's enough noise on those lists without everyone bombarding them with subsystem specific issues - that's why we have subsystem specific lists in the first place. If it turns out to be a problem in some other subsystem, we'll add cc's to other subsystem or general lists as appropriate. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC] KVM "fake DAX" device flushing
On Wed, Oct 11, 2017 at 11:51 AM, Pankaj Guptawrote: > We are sharing the prototype version of 'fake DAX' flushing > interface for the initial feedback. This is still work in progress > and not yet ready for merging. > > Protoype right now just implements basic functionality without advanced > features with two major parts: > > - Qemu virtio-pmem device > It exposes a persistent memory range to KVM guest which at host side is file > backed memory and works as persistent memory device. In addition to this it > provides a virtio flushing interface for KVM guest to do a Qemu side sync > for > guest DAX persistent memory range. > > - Guest virtio-pmem driver > Reads persistent memory range from paravirt device and reserves system > memory map. > It also allocates a block device corresponding to the pmem range which is > accessed > by DAX capable file systems. (file system support is still pending). > > We shared the project idea for 'fake DAX' flushing interface here [1]. > Based on suggestions here [2], we implemented guest 'virtio-pmem' > driver and Qemu paravirt device. > > [1] https://www.spinics.net/lists/kvm/msg149761.html > [2] https://www.spinics.net/lists/kvm/msg153095.html > > Work yet to be done: > > - Separate out the common code used by ACPI pmem interface and > reuse it. > > - In pmem device memmap allocation and working. There is some parallel work > going on upstream related to 'memory_hotplug restructuring' [3] and also > hitting > a memory section alignment issue [4]. > > [3] https://lwn.net/Articles/712099/ > [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html > > - Provide DAX capable file-system(ext4 & XFS) support. > - Qemu device flush functionality. > - Qemu live migration work when host page cache is used. > - Multiple virtio-pmem disks support. > > Prototype implementation for feedback: > > Kernel: > https://github.com/pagupta/linux/commit/d15cf90074eae91aeed7a228da3faf319566dd40 Please send this as a patch so it can be reviewed over email. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC] KVM "fake DAX" device flushing
We are sharing the prototype version of 'fake DAX' flushing interface for the initial feedback. This is still work in progress and not yet ready for merging. Protoype right now just implements basic functionality without advanced features with two major parts: - Qemu virtio-pmem device It exposes a persistent memory range to KVM guest which at host side is file backed memory and works as persistent memory device. In addition to this it provides a virtio flushing interface for KVM guest to do a Qemu side sync for guest DAX persistent memory range. - Guest virtio-pmem driver Reads persistent memory range from paravirt device and reserves system memory map. It also allocates a block device corresponding to the pmem range which is accessed by DAX capable file systems. (file system support is still pending). We shared the project idea for 'fake DAX' flushing interface here [1]. Based on suggestions here [2], we implemented guest 'virtio-pmem' driver and Qemu paravirt device. [1] https://www.spinics.net/lists/kvm/msg149761.html [2] https://www.spinics.net/lists/kvm/msg153095.html Work yet to be done: - Separate out the common code used by ACPI pmem interface and reuse it. - In pmem device memmap allocation and working. There is some parallel work going on upstream related to 'memory_hotplug restructuring' [3] and also hitting a memory section alignment issue [4]. [3] https://lwn.net/Articles/712099/ [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html - Provide DAX capable file-system(ext4 & XFS) support. - Qemu device flush functionality. - Qemu live migration work when host page cache is used. - Multiple virtio-pmem disks support. Prototype implementation for feedback: Kernel: https://github.com/pagupta/linux/commit/d15cf90074eae91aeed7a228da3faf319566dd40 Qemu : https://github.com/pagupta/qemu/commit/9c428db1e1076970e097e2b0ef8afe52509af823 Please provide feedback. Also, I would be attending KVM Forum in Prague from (25-27 Oct). If you are attending KVM forum/Linux conference, I would love to have a discussion on ideas and future work. Thank you, Pankaj Gupta ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] libnvdimm: add smart payload fields added in DSM 1.6
On Wed, Oct 11, 2017 at 10:57 AM, Dave Jiangwrote: > NVDIMM DSM interface v1.6 added additional smart health fields. Updating the > smart payload data structure accordingly. I'll also add a note when I merge this that the only reason we are maintaining this structure in the kernel is in case we want to translate 3rd party SMART payload formats into the ND_IOCTL_SMART format. Outside of that we could just delete this since ndctl is already doing that translation. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] libnvdimm: add smart payload fields added in DSM 1.6
NVDIMM DSM interface v1.6 added additional smart health fields. Updating the smart payload data structure accordingly. Signed-off-by: Dave Jiang--- include/uapi/linux/ndctl.h |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h index 3f03567..5ca8628 100644 --- a/include/uapi/linux/ndctl.h +++ b/include/uapi/linux/ndctl.h @@ -25,6 +25,7 @@ struct nd_cmd_smart { #define ND_SMART_USED_VALID(1 << 2) #define ND_SMART_TEMP_VALID(1 << 3) #define ND_SMART_CTEMP_VALID (1 << 4) +#define ND_SMART_SHUTDOWN_COUNT_VALID (1 << 5) #define ND_SMART_ALARM_VALID (1 << 9) #define ND_SMART_SHUTDOWN_VALID(1 << 10) #define ND_SMART_VENDOR_VALID (1 << 11) @@ -44,7 +45,10 @@ struct nd_smart_payload { __u8 alarm_flags; __u16 temperature; __u16 ctrl_temperature; - __u8 reserved1[15]; + __u32 shutdown_count; + __u8 ait_status; + __u16 pmic_temperature; + __u8 reserved1[8]; __u8 shutdown_state; __u32 vendor_size; __u8 vendor_data[92]; ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: nfit test deadlock
On Wed, Oct 11, 2017 at 9:24 AM, Ross Zwislerwrote: > Hey Dan, > > I was getting the ndctl unit tests working again in my setup today, and on the > first run of ndctl's "make check" hit a deadlock. This seems to be very easy > to reproduce, all you have to do is specify a number of jobs to make that is > larger than 1 (which I was accidentally doing via an alias), > i.e. "make -j32 check" > > This seems to reproduce 100% of the time. > > I'll append the ouptut of "echo w > /proc/sysrq-trigger" to the end of this > mail. > > I was using v4.13 and ndctl 58.2. I'll take a look. Probably just need more synchronization around the nfit_test setup/teardown path, but my recommendation for now is don't try to run the unit tests in parallel. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
nfit test deadlock
Hey Dan, I was getting the ndctl unit tests working again in my setup today, and on the first run of ndctl's "make check" hit a deadlock. This seems to be very easy to reproduce, all you have to do is specify a number of jobs to make that is larger than 1 (which I was accidentally doing via an alias), i.e. "make -j32 check" This seems to reproduce 100% of the time. I'll append the ouptut of "echo w > /proc/sysrq-trigger" to the end of this mail. I was using v4.13 and ndctl 58.2. - Ross --- [ 132.668043] sysrq: SysRq : Show Blocked State [ 132.668968] taskPC stack pid father [ 132.670774] lt-libndctl D0 5991 5983 0x0004 [ 132.672102] Call Trace: [ 132.672744] __schedule+0x411/0xb10 [ 132.673266] ? trace_hardirqs_on+0xd/0x10 [ 132.674058] schedule+0x40/0x90 [ 132.674545] __kernfs_remove+0x1f9/0x310 [ 132.675298] ? remove_wait_queue+0x70/0x70 [ 132.676046] kernfs_remove_by_name_ns+0x45/0x90 [ 132.676848] remove_files.isra.1+0x35/0x70 [ 132.677451] sysfs_remove_group+0x44/0x90 [ 132.678259] sysfs_remove_groups+0x2e/0x50 [ 132.679047] device_remove_attrs+0x4d/0x80 [ 132.679438] device_del+0x1ec/0x330 [ 132.679888] device_unregister+0x1a/0x60 [ 132.680266] nvdimm_bus_unregister+0x17/0x20 [libnvdimm] [ 132.680876] acpi_nfit_unregister+0x15/0x20 [nfit] [ 132.681329] devm_action_release+0xf/0x20 [ 132.681835] release_nodes+0x16d/0x2b0 [ 132.682196] devres_release_all+0x3c/0x50 [ 132.682573] device_release_driver_internal+0x175/0x220 [ 132.683231] device_release_driver+0x12/0x20 [ 132.683715] bus_remove_device+0x100/0x180 [ 132.684102] device_del+0x1f4/0x330 [ 132.684428] platform_device_del+0x28/0x90 [ 132.684967] platform_device_unregister+0x12/0x30 [ 132.685412] nfit_test_exit+0x17/0x92f [nfit_test] [ 132.685980] SyS_delete_module+0x1d8/0x230 [ 132.686369] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 132.686915] RIP: 0033:0x7f841012b317 [ 132.687255] RSP: 002b:7fffe5ce0898 EFLAGS: 0206 ORIG_RAX: 00b0 [ 132.688070] RAX: ffda RBX: 7f84103e4500 RCX: 7f841012b317 [ 132.688850] RDX: 7f84103e5730 RSI: 0800 RDI: 0258ac98 [ 132.689501] RBP: 7fffe5ce05b0 R08: 7f8410e19c80 R09: 0017 [ 132.690257] R10: 006d R11: 0206 R12: 0038 [ 132.690988] R13: 0001 R14: R15: fbad2887 [ 132.691735] lt-dsm-fail D0 5995 5986 0x0004 [ 132.692246] Call Trace: [ 132.692481] __schedule+0x411/0xb10 [ 132.692972] schedule+0x40/0x90 [ 132.693288] schedule_preempt_disabled+0x18/0x30 [ 132.694083] __mutex_lock+0x487/0xa20 [ 132.694720] ? acpi_nfit_flush_probe+0x3a/0x150 [nfit] [ 132.695452] mutex_lock_nested+0x1b/0x20 [ 132.696245] ? mutex_lock_nested+0x1b/0x20 [ 132.696947] acpi_nfit_flush_probe+0x3a/0x150 [nfit] [ 132.697750] ? kernfs_seq_start+0x2f/0x90 [ 132.698302] ? __mutex_lock+0x228/0xa20 [ 132.699077] ? lock_acquire+0xea/0x1f0 [ 132.699698] ? kernfs_seq_start+0x37/0x90 [ 132.700083] wait_probe_show+0x25/0x60 [libnvdimm] [ 132.700529] dev_attr_show+0x20/0x50 [ 132.701022] ? sysfs_file_ops+0x46/0x60 [ 132.701392] sysfs_kf_seq_show+0xb2/0x110 [ 132.701910] kernfs_seq_show+0x27/0x30 [ 132.702271] seq_read+0x103/0x3d0 [ 132.702709] kernfs_fop_read+0x11e/0x190 [ 132.703082] __vfs_read+0x37/0x160 [ 132.703399] ? security_file_permission+0x9e/0xc0 [ 132.704000] vfs_read+0xab/0x150 [ 132.704312] SyS_read+0x58/0xc0 [ 132.704737] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 132.705295] RIP: 0033:0x7fc0be0d4a80 [ 132.705964] RSP: 002b:7fff3b5cfd08 EFLAGS: 0246 ORIG_RAX: [ 132.707094] RAX: ffda RBX: 0004 RCX: 7fc0be0d4a80 [ 132.708154] RDX: 0400 RSI: 7fff3b5cfd80 RDI: 0004 [ 132.709206] RBP: 7fff3b5d02a0 R08: 01a3ec00 R09: 0035 [ 132.709968] R10: 0073 R11: 0246 R12: 00401620 [ 132.710707] R13: 7fff3b5d0cd0 R14: R15: [ 132.711369] lt-parent-uuid D0 5998 5989 0x0004 [ 132.711984] Call Trace: [ 132.712229] __schedule+0x411/0xb10 [ 132.712565] schedule+0x40/0x90 [ 132.713004] schedule_preempt_disabled+0x18/0x30 [ 132.713443] __mutex_lock+0x487/0xa20 [ 132.713891] ? acpi_nfit_flush_probe+0x3a/0x150 [nfit] [ 132.714378] mutex_lock_nested+0x1b/0x20 [ 132.714853] ? mutex_lock_nested+0x1b/0x20 [ 132.715239] acpi_nfit_flush_probe+0x3a/0x150 [nfit] [ 132.715818] ? kernfs_seq_start+0x2f/0x90 [ 132.716205] ? __mutex_lock+0x228/0xa20 [ 132.716674] ? lock_acquire+0xea/0x1f0 [ 132.717035] ? kernfs_seq_start+0x37/0x90 [ 132.717412] wait_probe_show+0x25/0x60 [libnvdimm] [ 132.718006] dev_attr_show+0x20/0x50 [ 132.718344] ? sysfs_file_ops+0x46/0x60 [ 132.718818] sysfs_kf_seq_show+0xb2/0x110 [ 132.719204]
Re: [PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
On Wed, Oct 11, 2017 at 4:54 AM, Joerg Roedelwrote: > On Tue, Oct 10, 2017 at 07:50:12AM -0700, Dan Williams wrote: >> +static void ib_umem_lease_break(void *__umem) >> +{ >> + struct ib_umem *umem = umem; >> + struct ib_device *idev = umem->context->device; >> + struct device *dev = idev->dma_device; >> + struct scatterlist *sgl = umem->sg_head.sgl; >> + >> + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK, >> + iommu_sg_num_pages(dev, sgl, umem->npages)); >> +} > > This looks like an invitation to break your code by random iommu-driver > changes. There is no guarantee that an iommu-backed dma-api > implemenation will map exactly iommu_sg_num_pages() pages for a given > sg-list. In other words, you are mixing the use of the IOMMU-API and the > DMA-API in an incompatible way that only works because you know the > internals of the iommu-drivers. > > I've seen in another patch that your changes strictly require an IOMMU, > so you what you should do instead is to switch from the DMA-API to the > IOMMU-API and do the address-space management yourself. > Ok, I'll switch over completely to the iommu api for this. It will also address Robin's concern. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] Fix mpage_writepage() for pages with buffers
On Tue, Oct 10, 2017 at 01:31:44PM -0700, Linus Torvalds wrote: > On Tue, Oct 10, 2017 at 12:44 PM, Andrew Morton >wrote: > > > > This is all pretty mature code (isn't it?). Any idea why this bug > > popped up now? I have no idea why it's suddenly popped up. It looks like it should be a bohrbug, but it's actually a heisenbug, and I don't understand that either. > Also, while the patch looks sane, the > > clean_buffers(page, PAGE_SIZE); > > line really threw me. That's an insane value to pick, it looks like > "bytes in page", but it isn't. It's just a random value that is bigger > than "PAGE_SIZE >> SECTOR_SHIFT". > > I'd prefer to see just ~0u if the intention is just "bigger than > anything possible". Actually, I did choose it to be "number of bytes in the page", based on the reasoning that I didn't want to calculate what the actual block size was, and the block size surely couldn't be any smaller than one byte. I forgot about the SECTOR_SIZE limit on filesystem block size, so your spelling of "big enough" does look better. Now that I think about it some more, I suppose we might end up with a situation where we're eventually passing a hugepage to this routine, and futureproofing it with ~0U probably makes more sense. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
On Wed, Oct 11, 2017 at 12:43 AM, Jan Karawrote: > On Tue 10-10-17 07:49:01, Dan Williams wrote: >> The mmap(2) syscall suffers from the ABI anti-pattern of not validating >> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a >> mechanism to define new behavior that is known to fail on older kernels >> without the support. Define a new MAP_SHARED_VALIDATE flag pattern that >> is guaranteed to fail on all legacy mmap implementations. >> >> It is worth noting that the original proposal was for a standalone >> MAP_VALIDATE flag. However, when that could not be supported by all >> archs Linus observed: >> >> I see why you *think* you want a bitmap. You think you want >> a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC >> etc, so that people can do >> >> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED >> | MAP_SYNC, fd, 0); >> >> and "know" that MAP_SYNC actually takes. >> >> And I'm saying that whole wish is bogus. You're fundamentally >> depending on special semantics, just make it explicit. It's already >> not portable, so don't try to make it so. >> >> Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value >> of 0x3, and make people do >> >> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE >> | MAP_SYNC, fd, 0); >> >> and then the kernel side is easier too (none of that random garbage >> playing games with looking at the "MAP_VALIDATE bit", but just another >> case statement in that map type thing. >> >> Boom. Done. >> >> Similar to ->fallocate() we also want the ability to validate the >> support for new flags on a per ->mmap() 'struct file_operations' >> instance basis. Towards that end arrange for flags to be generically >> validated against a mmap_supported_mask exported by 'struct >> file_operations'. By default all existing flags are implicitly >> supported, but new flags require MAP_SHARED_VALIDATE and >> per-instance-opt-in. >> >> Cc: Jan Kara >> Cc: Arnd Bergmann >> Cc: Andy Lutomirski >> Cc: Andrew Morton >> Suggested-by: Christoph Hellwig >> Suggested-by: Linus Torvalds >> Signed-off-by: Dan Williams >> --- >> arch/alpha/include/uapi/asm/mman.h |1 + >> arch/mips/include/uapi/asm/mman.h|1 + >> arch/mips/kernel/vdso.c |2 + >> arch/parisc/include/uapi/asm/mman.h |1 + >> arch/tile/mm/elf.c |3 +- >> arch/xtensa/include/uapi/asm/mman.h |1 + >> include/linux/fs.h |2 + >> include/linux/mm.h |2 + >> include/linux/mman.h | 39 >> ++ >> include/uapi/asm-generic/mman-common.h |1 + >> mm/mmap.c| 21 -- >> tools/include/uapi/asm-generic/mman-common.h |1 + >> 12 files changed, 69 insertions(+), 6 deletions(-) >> >> diff --git a/arch/alpha/include/uapi/asm/mman.h >> b/arch/alpha/include/uapi/asm/mman.h >> index 3b26cc62dadb..92823f24890b 100644 >> --- a/arch/alpha/include/uapi/asm/mman.h >> +++ b/arch/alpha/include/uapi/asm/mman.h >> @@ -14,6 +14,7 @@ >> #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is >> _wrong_) */ >> #define MAP_FIXED0x100 /* Interpret addr exactly */ >> #define MAP_ANONYMOUS0x10/* don't use a file */ >> +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension >> flags */ > > Just a nit but I'd put definition of MAP_SHARED_VALIDATE close to the > definition of MAP_SHARED and MAP_PRIVATE where it logically belongs (for > all archs). Will do. > >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index f8c10d336e42..5c4c98e4adc9 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, >> unsigned long, unsigned lo >> >> extern unsigned long mmap_region(struct file *file, unsigned long addr, >> unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, >> - struct list_head *uf); >> + struct list_head *uf, unsigned long map_flags); >> extern unsigned long do_mmap(struct file *file, unsigned long addr, >> unsigned long len, unsigned long prot, unsigned long flags, >> vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, > > I have to say I'm not very keen on passing down both vm_flags and map_flags > - vm_flags are almost a subset of map_flags but not quite and the ambiguity > which needs to be used for a particular check seems to open a space for > errors. Granted you currently only care about MAP_DIRECT in ->mmap_validate > and just pass map_flags
Re: ffsb job does not exit on xfs 4.14-rc1+
On Wed, Oct 11, 2017 at 6:54 AM, Xiong Zhouwrote: > On Mon, Sep 25, 2017 at 10:49:03AM +0200, Carlos Maiolino wrote: >> On Mon, Sep 25, 2017 at 01:40:06AM +, Xiong Zhou wrote: >> > Hi, >> > >> > ffsb test won't exit like this on Linus tree 4.14-rc1+. >> > Latest commit cd4175b11685 >> >> Can you provide more information? Do you have any kernel log from this issue? >> dmesg, Oopses, traces, etc. >> Storage configuration might also be required here. > > Turns out this only repreduces on nvdimm devices, xfs without dax > mount option. More logs are attached. > >> >> have you also tried to reproduce it with another filesystem? If so, is the >> same >> problem reproducible with another filesystem or only with XFS? > > Only xfs. Test on ext4 ends shortly. > >> >> P.S. please avoid sending it to all lists (mainly LKML). > > Why? I thought LKML was better archived. > >> Unless it's a more generic kernel problem, keep it in fsdevel and/or the >> respective filesystem list if it's related to a single filesystem only. > > 4.14-rc1+ won't survive this script, while 4.13 can. Can you try a git bisect? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
On Tue, Oct 10, 2017 at 07:50:12AM -0700, Dan Williams wrote: > +static void ib_umem_lease_break(void *__umem) > +{ > + struct ib_umem *umem = umem; > + struct ib_device *idev = umem->context->device; > + struct device *dev = idev->dma_device; > + struct scatterlist *sgl = umem->sg_head.sgl; > + > + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK, > + iommu_sg_num_pages(dev, sgl, umem->npages)); > +} This looks like an invitation to break your code by random iommu-driver changes. There is no guarantee that an iommu-backed dma-api implemenation will map exactly iommu_sg_num_pages() pages for a given sg-list. In other words, you are mixing the use of the IOMMU-API and the DMA-API in an incompatible way that only works because you know the internals of the iommu-drivers. I've seen in another patch that your changes strictly require an IOMMU, so you what you should do instead is to switch from the DMA-API to the IOMMU-API and do the address-space management yourself. Regards, Joerg ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
转发:几何尺寸与公差
Ö£ºlinux-nvdimm@lists.01.org Ïêϸ ¿Î³Ìʱ¼ä ¼°±¨ÃûÐÅÏ¢ Çë²éÔĸ½¼þ ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
On Tue 10-10-17 07:49:01, Dan Williams wrote: > The mmap(2) syscall suffers from the ABI anti-pattern of not validating > unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a > mechanism to define new behavior that is known to fail on older kernels > without the support. Define a new MAP_SHARED_VALIDATE flag pattern that > is guaranteed to fail on all legacy mmap implementations. > > It is worth noting that the original proposal was for a standalone > MAP_VALIDATE flag. However, when that could not be supported by all > archs Linus observed: > > I see why you *think* you want a bitmap. You think you want > a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC > etc, so that people can do > > ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED > | MAP_SYNC, fd, 0); > > and "know" that MAP_SYNC actually takes. > > And I'm saying that whole wish is bogus. You're fundamentally > depending on special semantics, just make it explicit. It's already > not portable, so don't try to make it so. > > Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value > of 0x3, and make people do > > ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE > | MAP_SYNC, fd, 0); > > and then the kernel side is easier too (none of that random garbage > playing games with looking at the "MAP_VALIDATE bit", but just another > case statement in that map type thing. > > Boom. Done. > > Similar to ->fallocate() we also want the ability to validate the > support for new flags on a per ->mmap() 'struct file_operations' > instance basis. Towards that end arrange for flags to be generically > validated against a mmap_supported_mask exported by 'struct > file_operations'. By default all existing flags are implicitly > supported, but new flags require MAP_SHARED_VALIDATE and > per-instance-opt-in. > > Cc: Jan Kara> Cc: Arnd Bergmann > Cc: Andy Lutomirski > Cc: Andrew Morton > Suggested-by: Christoph Hellwig > Suggested-by: Linus Torvalds > Signed-off-by: Dan Williams > --- > arch/alpha/include/uapi/asm/mman.h |1 + > arch/mips/include/uapi/asm/mman.h|1 + > arch/mips/kernel/vdso.c |2 + > arch/parisc/include/uapi/asm/mman.h |1 + > arch/tile/mm/elf.c |3 +- > arch/xtensa/include/uapi/asm/mman.h |1 + > include/linux/fs.h |2 + > include/linux/mm.h |2 + > include/linux/mman.h | 39 > ++ > include/uapi/asm-generic/mman-common.h |1 + > mm/mmap.c| 21 -- > tools/include/uapi/asm-generic/mman-common.h |1 + > 12 files changed, 69 insertions(+), 6 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/mman.h > b/arch/alpha/include/uapi/asm/mman.h > index 3b26cc62dadb..92823f24890b 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -14,6 +14,7 @@ > #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is > _wrong_) */ > #define MAP_FIXED0x100 /* Interpret addr exactly */ > #define MAP_ANONYMOUS0x10/* don't use a file */ > +#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension > flags */ Just a nit but I'd put definition of MAP_SHARED_VALIDATE close to the definition of MAP_SHARED and MAP_PRIVATE where it logically belongs (for all archs). > diff --git a/include/linux/mm.h b/include/linux/mm.h > index f8c10d336e42..5c4c98e4adc9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, > unsigned long, unsigned lo > > extern unsigned long mmap_region(struct file *file, unsigned long addr, > unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, > - struct list_head *uf); > + struct list_head *uf, unsigned long map_flags); > extern unsigned long do_mmap(struct file *file, unsigned long addr, > unsigned long len, unsigned long prot, unsigned long flags, > vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, I have to say I'm not very keen on passing down both vm_flags and map_flags - vm_flags are almost a subset of map_flags but not quite and the ambiguity which needs to be used for a particular check seems to open a space for errors. Granted you currently only care about MAP_DIRECT in ->mmap_validate and just pass map_flags through mmap_region() so there's no space for confusion but future checks could do something different. But OTOH I don't see a cleaner way of avoiding the need to allocate