date:20210318

Re: [ndctl PATCH v3 1/4] libndctl: test enablement for non-nfit devices

2021-03-18 Thread Santosh Sivaraj

"Verma, Vishal L"  writes:

Hi Vishal,

> On Thu, 2021-03-11 at 13:16 +0530, Santosh Sivaraj wrote:
>> Unify adding dimms for papr and nfit families, this will help in adding
>
> Minor nit, but it seems like the subject line and the first sentence in
> the body should be swapped. The one-line description of what's happening
> in this patch is "Unify adding dimms for papr and nfit families", and
> the body can go into more detail about why - i.e. in preparation for
> enabling tests on non-nfit devices.
>

Agree.

>> all attributes needed for the unit tests too. We don't fail adding a dimm
>> if some of the dimm attributes are missing, so this will work fine on PAPR
>> platforms where most dimm attributes are provided.
>
> Does this mean 'most - but not all'? Might be a good clarification to
> make here.

Yes, that's right. I will fix that up in the next version.

Thanks,
Santosh
>
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>  ndctl/lib/libndctl.c | 103 ---
>>  1 file changed, 38 insertions(+), 65 deletions(-)
>> 
>> v3:
>> * Drop patch which skips SMART tests, smart test enablement will be posted
>>   soon.
>> 
>> v2:
>> * Patch 2: Fix a bug, I skip erroring out if PAPR family, but condition had
>>   INTEL family instead. That change was there to test the same code on x86, 
>> but
>>   accidently committed. Now have a environment variable to force test PAPR
>>   family on x86.
>> 
>> * Patch 4: Remove stray code, artifact of refactoring in patch 1.
>> 
>> diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
>> index 36fb6fe..26b9317 100644
>> --- a/ndctl/lib/libndctl.c
>> +++ b/ndctl/lib/libndctl.c
>> @@ -1646,41 +1646,9 @@ static int ndctl_bind(struct ndctl_ctx *ctx, struct 
>> kmod_module *module,
>>  static int ndctl_unbind(struct ndctl_ctx *ctx, const char *devpath);
>>  static struct kmod_module *to_module(struct ndctl_ctx *ctx, const char 
>> *alias);
>>  
>> -static int add_papr_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
>> -{
>> -int rc = -ENODEV;
>> -char buf[SYSFS_ATTR_SIZE];
>> -struct ndctl_ctx *ctx = dimm->bus->ctx;
>> -char *path = calloc(1, strlen(dimm_base) + 100);
>> -const char * const devname = ndctl_dimm_get_devname(dimm);
>> -
>> -dbg(ctx, "%s: Probing of_pmem dimm at %s\n", devname, dimm_base);
>> -
>> -if (!path)
>> -return -ENOMEM;
>> -
>> -/* construct path to the papr compatible dimm flags file */
>> -sprintf(path, "%s/papr/flags", dimm_base);
>> -
>> -if (ndctl_bus_is_papr_scm(dimm->bus) &&
>> -sysfs_read_attr(ctx, path, buf) == 0) {
>> -
>> -dbg(ctx, "%s: Adding papr-scm dimm flags:\"%s\"\n", devname, 
>> buf);
>> -dimm->cmd_family = NVDIMM_FAMILY_PAPR;
>> -
>> -/* Parse dimm flags */
>> -parse_papr_flags(dimm, buf);
>> -
>> -/* Allocate monitor mode fd */
>> -dimm->health_eventfd = open(path, O_RDONLY|O_CLOEXEC);
>> -rc = 0;
>> -}
>> -
>> -free(path);
>> -return rc;
>> -}
>> -
>> -static int add_nfit_dimm(struct ndctl_dimm *dimm, const char *dimm_base)
>> +static int populate_dimm_attributes(struct ndctl_dimm *dimm,
>> +const char *dimm_base,
>> +const char *bus_prefix)
>>  {
>>  int i, rc = -1;
>>  char buf[SYSFS_ATTR_SIZE];
>> @@ -1694,7 +1662,7 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, 
>> const char *dimm_base)
>>   * 'unique_id' may not be available on older kernels, so don't
>>   * fail if the read fails.
>>   */
>> -sprintf(path, "%s/nfit/id", dimm_base);
>> +sprintf(path, "%s/%s/id", dimm_base, bus_prefix);
>>  if (sysfs_read_attr(ctx, path, buf) == 0) {
>>  unsigned int b[9];
>> 
>> @@ -1709,68 +1677,74 @@ static int add_nfit_dimm(struct ndctl_dimm *dimm, 
>> const char *dimm_base)
>>  }
>>  }
>> 
>> -sprintf(path, "%s/nfit/handle", dimm_base);
>> +sprintf(path, "%s/%s/handle", dimm_base, bus_prefix);
>>  if (sysfs_read_attr(ctx, path, buf) < 0)
>>  goto err_read;
>>  dimm->handle = strtoul(buf, NULL, 0);
>>  
>> -sprintf(path, "%s/nfit/phys_id", dimm_base);
>> +sprintf(path, "%s/%s/phys_id", dimm_base, bus_prefix);
>>  if (sysfs_read_attr(ctx, path, buf) < 0)
>>  goto err_read;
>>  dimm->phys_id = strtoul(buf, NULL, 0);
>>  
>> -sprintf(path, "%s/nfit/serial", dimm_base);
>> +sprintf(path, "%s/%s/serial", dimm_base, bus_prefix);
>>  if (sysfs_read_attr(ctx, path, buf) == 0)
>>  dimm->serial = strtoul(buf, NULL, 0);
>>  
>> -sprintf(path, "%s/nfit/vendor", dimm_base);
>> +sprintf(path, "%s/%s/vendor", dimm_base, bus_prefix);
>>  if (sysfs_read_attr(ctx, path, buf) == 0)
>>  dimm->vendor_id = strtoul(buf, NULL, 0);
>>  
>> -sprintf(path, "%s/nfit/device", dimm_base);
>> +sprintf(path, "%s/%s/device", dimm_base, bus_pre

Re: [ndctl PATCH v3 2/4] test: Don't skip tests if nfit modules are missing

2021-03-18 Thread Santosh Sivaraj

"Verma, Vishal L"  writes:

> On Thu, 2021-03-11 at 13:16 +0530, Santosh Sivaraj wrote:
>> For NFIT to be available ACPI is a must, so don't fail when nfit modules
>> are missing on a platform that doesn't support ACPI.
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>  test.h|  2 +-
>>  test/ack-shutdown-count-set.c |  2 +-
>>  test/blk_namespaces.c |  2 +-
>>  test/core.c   | 23 +--
>>  test/dpa-alloc.c  |  2 +-
>>  test/dsm-fail.c   |  2 +-
>>  test/libndctl.c   |  2 +-
>>  test/multi-pmem.c |  2 +-
>>  test/parent-uuid.c|  2 +-
>>  test/pmem_namespaces.c|  2 +-
>>  10 files changed, 30 insertions(+), 11 deletions(-)
>> 
>> diff --git a/test.h b/test.h
>> index cba8d41..7de13fe 100644
>> --- a/test.h
>> +++ b/test.h
>> @@ -20,7 +20,7 @@ void builtin_xaction_namespace_reset(void);
>>  
>> 
>>  struct kmod_ctx;
>>  struct kmod_module;
>> -int nfit_test_init(struct kmod_ctx **ctx, struct kmod_module **mod,
>> +int ndctl_test_init(struct kmod_ctx **ctx, struct kmod_module **mod,
>>  struct ndctl_ctx *nd_ctx, int log_level,
>>  struct ndctl_test *test);
>>  
>> 
>> diff --git a/test/ack-shutdown-count-set.c b/test/ack-shutdown-count-set.c
>> index fb1d82b..c561ff3 100644
>> --- a/test/ack-shutdown-count-set.c
>> +++ b/test/ack-shutdown-count-set.c
>> @@ -99,7 +99,7 @@ static int test_ack_shutdown_count_set(int loglevel, 
>> struct ndctl_test *test,
>>  int result = EXIT_FAILURE, err;
>>  
>> 
>>  ndctl_set_log_priority(ctx, loglevel);
>> -err = nfit_test_init(&kmod_ctx, &mod, NULL, loglevel, test);
>> +err = ndctl_test_init(&kmod_ctx, &mod, NULL, loglevel, test);
>>  if (err < 0) {
>>  result = 77;
>>  ndctl_test_skip(test);
>> diff --git a/test/blk_namespaces.c b/test/blk_namespaces.c
>> index d7f00cb..f076e85 100644
>> --- a/test/blk_namespaces.c
>> +++ b/test/blk_namespaces.c
>> @@ -228,7 +228,7 @@ int test_blk_namespaces(int log_level, struct ndctl_test 
>> *test,
>>  
>> 
>>  if (!bus) {
>>  fprintf(stderr, "ACPI.NFIT unavailable falling back to 
>> nfit_test\n");
>> -rc = nfit_test_init(&kmod_ctx, &mod, NULL, log_level, test);
>> +rc = ndctl_test_init(&kmod_ctx, &mod, NULL, log_level, test);
>>  ndctl_invalidate(ctx);
>>  bus = ndctl_bus_get_by_provider(ctx, "nfit_test.0");
>>  if (rc < 0 || !bus) {
>> diff --git a/test/core.c b/test/core.c
>> index cc7d8d9..903034a 100644
>> --- a/test/core.c
>> +++ b/test/core.c
>> @@ -11,6 +11,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>> 
>>  #define KVER_STRLEN 20
>> @@ -106,11 +107,11 @@ int ndctl_test_get_skipped(struct ndctl_test *test)
>>  return test->skip;
>>  }
>>  
>> 
>> -int nfit_test_init(struct kmod_ctx **ctx, struct kmod_module **mod,
>> +int ndctl_test_init(struct kmod_ctx **ctx, struct kmod_module **mod,
>>  struct ndctl_ctx *nd_ctx, int log_level,
>>  struct ndctl_test *test)
>>  {
>> -int rc;
>> +int rc, family = NVDIMM_FAMILY_INTEL;
>>  unsigned int i;
>>  const char *name;
>>  struct ndctl_bus *bus;
>> @@ -127,6 +128,19 @@ int nfit_test_init(struct kmod_ctx **ctx, struct 
>> kmod_module **mod,
>>  "nd_e820",
>>  "nd_pmem",
>>  };
>> +char *test_env;
>> +
>> +/* Do we want to force test PAPR? */
>> +test_env = getenv("NDCTL_TEST_FAMILY");
>> +if (test_env && strcmp(test_env, "PAPR") == 0)
>> +family = NVDIMM_FAMILY_PAPR;
>> +
>> +/* ACPI is a must for nfit, so if ACPI is not available let's default to
>> + * PAPR */
>
> fix multi line comment to the right formatting:
> /*
>  * line 1, etc
>  */
>

Will fix that.

>> +if (access("/sys/bus/acpi", F_OK) == -1) {
>> +if (errno == ENOENT)
>> +family = NVDIMM_FAMILY_PAPR;
>> +}
>
> Instead of a blind default, can we perform a similar check for presence of
> PAPR too?
>

Yes, I wanted to do that, but there is no reliable way of check that; there is
no ofnode before module load, and there won't be any PAPR specific DT entries if
the platform is not Power.

I also test the 'ndtest' module on x86 with NDCTL_TEST_FAMILY environment
variable. I can let the default be nfit_test (NVDIMM_FAMILY_INTEL) and only load
PAPR module when the environment variable is set. Thoughts?

Thanks,
Santosh

>>  
>> 
>>  log_init(&log_ctx, "test/init", "NDCTL_TEST");
>>  log_ctx.log_priority = log_level;
>> @@ -185,6 +199,11 @@ retry:
>>  
>> 
>>  path = kmod_module_get_path(*mod);
>>  if (!path) {
>> +if (family == NVDIMM_FAMILY_PAPR &&
>> +(strcmp(name, "nfit") == 0 ||
>> + strcmp(name, "nd_e820") == 0))
>> +continue;
>> +
>>

RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-18 Thread ruansy.f...@fujitsu.com




> -Original Message-
> From: ruansy.f...@fujitsu.com 
> Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > >
> > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > the filesystem can be notified from there.
> > > > >
> > > > > We want memory_failure() supports reflinked files.  In this
> > > > > case, we are not able to track multiple files from a page(this
> > > > > broken
> > > > > page) because
> > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > page->introduce this
> > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > ->->corrupted_range()
> > > > > upper level to upper level, and finally find out files who are
> > > > > using(mmapping) this page.
> > > > >
> > > >
> > > > I know the motivation, but this implementation seems backwards.
> > > > It's already the case that memory_failure() looks up the
> > > > address_space associated with a mapping. From there I would expect
> > > > a new 'struct address_space_operations' op to let the fs handle
> > > > the case when there are multiple address_spaces associated with a given
> file.
> > > >
> > >
> > > Let me think about it.  In this way, we
> > > 1. associate file mapping with dax page in dax page fault;
> >
> > I think this needs to be a new type of association that proxies the
> > representation of the reflink across all involved address_spaces.
> >
> > > 2. iterate files reflinked to notify `kill processes signal` by the
> > >   new address_space_operation;
> > > 3. re-associate to another reflinked file mapping when unmmaping
> > > (rmap qeury in filesystem to get the another file).
> >
> > Perhaps the proxy object is reference counted per-ref-link. It seems
> > error prone to keep changing the association of the pfn while the reflink is
> in-tact.
> Hi, Dan
> 
> I think my early rfc patchset was implemented in this way:
>  - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping,
> offset) when causing dax page fault.
>  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> that we can iterate reflinked file mappings in memory_failure() easily.
> In my understanding, the dax-rmap tree is the proxy object you mentioned.  If
> so, I have to say, this method was rejected. Because this will cause huge
> overhead in some case that every dax page have one dax-rmap tree.
> 

Hi, Dan

How do you think about this?  I am still confused.  Could you give me some 
advice?


--
Thanks,
Ruan Shiyang.

> 
> --
> Thanks,
> Ruan Shiyang.
> >
> > > It did not handle those dax pages are not in use, because their
> > > ->mapping are not associated to any file.  I didn't think it through
> > > until reading your conversation.  Here is my understanding: this
> > > case should be handled by badblock mechanism in pmem driver.  This
> > > badblock mechanism will call
> > > ->corrupted_range() to tell filesystem to repaire the data if possible.
> >
> > There are 2 types of notifications. There are badblocks discovered by
> > the driver (see notify_pmem()) and there are memory_failures()
> > signalled by the CPU machine-check handler, or the platform BIOS. In
> > the case of badblocks that needs to be information considered by the
> > fs block allocator to avoid / try-to-repair badblocks on allocate, and
> > to allow listing damaged files that need repair. The memory_failure()
> > notification needs immediate handling to tear down mappings to that
> > pfn and signal processes that have consumed it with
> > SIGBUS-action-required. Processes that have the poison mapped, but have not
> consumed it receive SIGBUS-action-optional.
> >
> > > So, we split it into two parts.  And dax device and block device
> > > won't be
> > mixed
> > > up again.   Is my understanding right?
> >
> > Right, it's only the filesystem that knows that the block_device and
> > the dax_device alias data at the same logical offset. The requirements
> > for sector error handling and page error handling are separate like
> > block_device_operations and dax_operations.
> >
> > > But the solution above is to solve the hwpoison on one or couple
> > > pages, which happens rarely(I think).  Do the 'pmem remove'
> > > operation
> > cause hwpoison too?
> > > Call memory_failure() so many times?  I havn't understood this yet.
> >
> > I'm working on a patch here to call memory_failure() on a wide range
> > for the surprise remove of a dax_device while a filesystem might be
> > mounted. It won't be efficient, but there is no other way to notify
> > the kernel that it needs to immediately stop referencing a page.
> ___
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an
> email to linux-nvdimm-le...@lists.01.org
> 

___

[PATCH v3 10/10] fs/xfs: Add dedupe support for fsdax

2021-03-18 Thread Shiyang Ruan

Add xfs_break_two_dax_layouts() to break layout for tow dax files.  Then
call compare range function only when files are both DAX or not.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_file.c| 20 
 fs/xfs/xfs_inode.c   |  8 +++-
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  5 +++--
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1987d15eab61..82467d08e3ce 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -784,6 +784,26 @@ xfs_break_dax_layouts(
0, 0, xfs_wait_dax_page(inode));
 }
 
+int
+xfs_break_two_dax_layouts(
+   struct inode*src,
+   struct inode*dest)
+{
+   int error;
+   boolretry = false;
+
+retry:
+   error = xfs_break_dax_layouts(src, &retry);
+   if (error || retry)
+   goto retry;
+
+   error = xfs_break_dax_layouts(dest, &retry);
+   if (error || retry)
+   goto retry;
+
+   return error;
+}
+
 int
 xfs_break_layouts(
struct inode*inode,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b7352bc4c815..c11b11e59a83 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3651,8 +3651,10 @@ xfs_ilock2_io_mmap(
struct xfs_inode*ip2)
 {
int ret;
+   struct inode*inode1 = VFS_I(ip1);
+   struct inode*inode2 = VFS_I(ip2);
 
-   ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+   ret = xfs_iolock_two_inodes_and_break_layout(inode1, inode2);
if (ret)
return ret;
if (ip1 == ip2)
@@ -3660,6 +3662,10 @@ xfs_ilock2_io_mmap(
else
xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
ip2, XFS_MMAPLOCK_EXCL);
+
+   if (IS_DAX(inode1) && IS_DAX(inode2))
+   ret = xfs_break_two_dax_layouts(inode1, inode2);
+
return 0;
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eca333f5f715..9ed7a2895602 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -431,6 +431,7 @@ enum xfs_prealloc_flags {
 
 intxfs_update_prealloc_flags(struct xfs_inode *ip,
  enum xfs_prealloc_flags flags);
+intxfs_break_two_dax_layouts(struct inode *inode1, struct inode *inode2);
 intxfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 73c556c2ff76..e62b00c2a0c8 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_sb.h"
 #include "xfs_ag_resv.h"
+#include 
 
 /*
  * Copy on Write of Shared Blocks
@@ -1303,8 +1304,8 @@ xfs_reflink_remap_prep(
if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
goto out_unlock;
 
-   /* Don't share DAX file data for now. */
-   if (IS_DAX(inode_in) || IS_DAX(inode_out))
+   /* Don't share DAX file data with non-DAX file. */
+   if (IS_DAX(inode_in) != IS_DAX(inode_out))
goto out_unlock;
 
if (IS_DAX(inode_in))
-- 
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 09/10] fs/xfs: Handle CoW for fsdax write() path

2021-03-18 Thread Shiyang Ruan

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
CoW, new allocated extents needs to be remapped to the file.  So, add an
iomap_end for dax write ops to do the remapping work.

Signed-off-by: Shiyang Ruan 
---
 fs/xfs/xfs_bmap_util.c |  3 +--
 fs/xfs/xfs_file.c  |  9 +++
 fs/xfs/xfs_iomap.c | 58 +-
 fs/xfs/xfs_iomap.h |  4 +++
 fs/xfs/xfs_iops.c  |  7 +++--
 fs/xfs/xfs_reflink.c   |  3 +--
 6 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7371a7f7c652..809013de9915 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -976,8 +976,7 @@ xfs_free_file_space(
return 0;
if (offset + len > XFS_ISIZE(ip))
len = XFS_ISIZE(ip) - offset;
-   error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-   &xfs_buffered_write_iomap_ops);
+   error = xfs_iomap_zero_range(VFS_I(ip), offset, len, NULL);
if (error)
return error;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..1987d15eab61 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -623,11 +623,8 @@ xfs_file_dax_write(
count = iov_iter_count(from);
 
trace_xfs_file_dax_write(ip, count, pos);
-   ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
-   if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
-   i_size_write(inode, iocb->ki_pos);
-   error = xfs_setfilesize(ip, pos, ret);
-   }
+   ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
+
 out:
xfs_iunlock(ip, iolock);
if (error)
@@ -1250,7 +1247,7 @@ __xfs_filemap_fault(
 
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
(write_fault && !vmf->cow_page) ?
-&xfs_direct_write_iomap_ops :
+&xfs_dax_write_iomap_ops :
 &xfs_read_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..0afae5dbbce1 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -771,7 +771,8 @@ xfs_direct_write_iomap_begin(
 
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-   &lockmode, flags & IOMAP_DIRECT);
+   &lockmode,
+   flags & IOMAP_DIRECT || IS_DAX(inode));
if (error)
goto out_unlock;
if (shared)
@@ -850,6 +851,38 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
.iomap_begin= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+   struct inode*inode,
+   loff_t  pos,
+   loff_t  length,
+   ssize_t written,
+   unsigned intflags,
+   struct iomap*iomap)
+{
+   int error = 0;
+   xfs_inode_t *ip = XFS_I(inode);
+   boolcow = xfs_is_cow_inode(ip);
+
+   if (pos + written > i_size_read(inode)) {
+   i_size_write(inode, pos + written);
+   error = xfs_setfilesize(ip, pos, written);
+   if (error && cow) {
+   xfs_reflink_cancel_cow_range(ip, pos, written, true);
+   return error;
+   }
+   }
+   if (cow)
+   error = xfs_reflink_end_cow(ip, pos, written);
+
+   return error;
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+   .iomap_begin= xfs_direct_write_iomap_begin,
+   .iomap_end  = xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
struct inode*inode,
@@ -1308,3 +1341,26 @@ xfs_xattr_iomap_begin(
 const struct iomap_ops xfs_xattr_iomap_ops = {
.iomap_begin= xfs_xattr_iomap_begin,
 };
+
+int
+xfs_iomap_zero_range(
+   struct inode*inode,
+   loff_t  offset,
+   loff_t  len,
+   bool*did_zero)
+{
+   return iomap_zero_range(inode, offset, len, did_zero,
+   IS_DAX(inode) ? &xfs_dax_write_iomap_ops :
+   &xfs_buffered_write_iomap_ops);
+}
+
+int
+xfs_iomap_truncate_page(
+   struct inode*inode,
+   loff_t  pos,
+   bool*did_zero)
+{
+   return iomap_truncate_page(inode, pos, did_zero,
+   IS_DAX(inode) ? &xfs_dax_write_iomap_ops :
+   &xfs_buffered_write_iomap_ops);

[PATCH v3 08/10] fsdax: Dedup file range to use a compare function

2021-03-18 Thread Shiyang Ruan

With dax we cannot deal with readpage() etc. So, we create a dax
comparison funciton which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 56 
 fs/remap_range.c | 45 ---
 fs/xfs/xfs_reflink.c |  9 +--
 include/linux/dax.h  |  4 
 include/linux/fs.h   | 15 
 5 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 348297b38f76..76f81f1d76ec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1833,3 +1833,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
+   struct inode *ino2, loff_t pos2, loff_t len, void *data,
+   struct iomap *smap, struct iomap *dmap)
+{
+   void *saddr, *daddr;
+   bool *same = data;
+   int ret;
+
+   if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+   *same = true;
+   return len;
+   }
+
+   if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+   *same = false;
+   return 0;
+   }
+
+   ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+ &saddr, NULL);
+   if (ret < 0)
+   return -EIO;
+
+   ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+ &daddr, NULL);
+   if (ret < 0)
+   return -EIO;
+
+   *same = !memcmp(saddr, daddr, len);
+   return len;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+   struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
+   const struct iomap_ops *ops)
+{
+   int id, ret = 0;
+
+   id = dax_read_lock();
+   while (len) {
+   ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
+  is_same, dax_range_compare_actor);
+   if (ret < 0 || !*is_same)
+   goto out;
+
+   len -= ret;
+   srcoff += ret;
+   destoff += ret;
+   }
+   ret = 0;
+out:
+   dax_read_unlock(id);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 77dba3a49e65..9079390edaf3 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, struct 
page *page2)
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
-struct inode *dest, loff_t destoff,
-loff_t len, bool *is_same)
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same)
 {
loff_t src_poff;
loff_t dest_poff;
@@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, 
loff_t srcoff,
 out_error:
return error;
 }
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
*src, loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+static int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   loff_t *len, unsigned int remap_flags,
+   const struct iomap_ops *ops)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (remap_flags & REMAP_FILE_DEDUP) {
boolis_same = false;
 
-   ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-   inode_out, pos_out, *len, &is_same);
+   if (!IS_DAX(inode_in) && !IS_DAX(inode_ou

[PATCH v3 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-03-18 Thread Shiyang Ruan

Some operations, such as comparing a range of data in two files under
fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
we introduce iomap_apply2() to accept arguments from two files and
iomap_actor2_t for actions on two files.

Signed-off-by: Shiyang Ruan 
---
 fs/iomap/apply.c  | 56 +++
 include/linux/iomap.h |  7 +-
 2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 26ab6563181f..fbc38ce3d5b6 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -97,3 +97,59 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 
return written ? written : ret;
 }
+
+loff_t
+iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t pos2,
+   loff_t length, unsigned int flags, const struct iomap_ops *ops,
+   void *data, iomap_actor2_t actor)
+{
+   struct iomap smap = { .type = IOMAP_HOLE };
+   struct iomap dmap = { .type = IOMAP_HOLE };
+   loff_t written = 0, ret, ret2 = 0;
+   loff_t len1 = length, len2, min_len;
+
+   ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
+   if (ret)
+   goto out_src;
+   if (WARN_ON(smap.offset > pos1)) {
+   written = -EIO;
+   goto out_src;
+   }
+   if (WARN_ON(smap.length == 0)) {
+   written = -EIO;
+   goto out_src;
+   }
+   len2 = min_t(loff_t, len1, smap.length);
+
+   ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
+   if (ret)
+   goto out_dest;
+   if (WARN_ON(dmap.offset > pos2)) {
+   written = -EIO;
+   goto out_dest;
+   }
+   if (WARN_ON(dmap.length == 0)) {
+   written = -EIO;
+   goto out_dest;
+   }
+   min_len = min_t(loff_t, len2, dmap.length);
+
+   written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
+
+out_dest:
+   if (ops->iomap_end)
+   ret2 = ops->iomap_end(ino2, pos2, len2,
+ written > 0 ? written : 0, flags, &dmap);
+out_src:
+   if (ops->iomap_end)
+   ret = ops->iomap_end(ino1, pos1, len1,
+written > 0 ? written : 0, flags, &smap);
+
+   if (ret)
+   return written ? written : ret;
+
+   if (ret2)
+   return written ? written : ret2;
+
+   return written;
+}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..913f98897a77 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -148,10 +148,15 @@ struct iomap_ops {
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
void *data, struct iomap *iomap, struct iomap *srcmap);
-
+typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
+   struct inode *ino2, loff_t pos2, loff_t len, void *data,
+   struct iomap *smap, struct iomap *dmap);
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
iomap_actor_t actor);
+loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
+   loff_t pos2, loff_t length, unsigned int flags,
+   const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops);
-- 
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 05/10] fsdax: Replace mmap entry in case of CoW

2021-03-18 Thread Shiyang Ruan

We replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues 
Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 37 ++---
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 181aad97136a..cfe513eb111e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, 
struct dax_device *dax_d
return 0;
 }
 
+#define DAX_IF_DIRTY   (1 << 0)
+#define DAX_IF_COW (1 << 1)
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -729,16 +732,19 @@ static int copy_cow_page_dax(struct block_device *bdev, 
struct dax_device *dax_d
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-   struct address_space *mapping, struct vm_fault *vmf,
-   void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+   void *entry, pfn_t pfn, unsigned long flags,
+   unsigned int insert_flags)
 {
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
void *new_entry = dax_make_entry(pfn, flags);
+   bool dirty = insert_flags & DAX_IF_DIRTY;
+   bool cow = insert_flags & DAX_IF_COW;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-   if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+   if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -750,7 +756,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 
xas_reset(xas);
xas_lock_irq(xas);
-   if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+   if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;
 
dax_disassociate_entry(entry, mapping, false);
@@ -774,6 +780,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+   if (cow)
+   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
 }
@@ -1098,8 +1107,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
vm_fault_t ret;
 
-   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-   DAX_ZERO_PAGE, false);
+   *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);
 
ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1126,8 +1134,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, 
struct vm_fault *vmf,
goto fallback;
 
pfn = page_to_pfn_t(zero_page);
-   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-   DAX_PMD | DAX_ZERO_PAGE, false);
+   *entry = dax_insert_entry(xas, vmf, *entry, pfn,
+ DAX_PMD | DAX_ZERO_PAGE, 0);
 
if (arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
@@ -1431,6 +1439,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
pfn_t *pfnp,
loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+   unsigned int insert_flags = 0;
int err = 0;
pfn_t pfn;
void *kaddr;
@@ -1453,8 +1462,14 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
pfn_t *pfnp,
if (err)
return dax_fault_return(err);
 
-   entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
-write && !sync);
+   if (write) {
+   if (!sync)
+   insert_flags |= DAX_IF_DIRTY;
+   if (iomap->flags & IOMAP_F_SHARED)
+   insert_flags |= DAX_IF_COW;
+   }
+
+   entry = dax_insert_entry(xas, vmf, entry, pfn, 0, insert_flags);
 
if (write && srcmap->addr != iomap->addr) {
err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr, false);
-- 
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-03-18 Thread Shiyang Ruan

Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
data in not aligned area will be not correct.  So, add the srcmap to
dax_iomap_zero() and replace memset() as dax_copy_edge().

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c   | 9 +++--
 fs/iomap/buffered-io.c | 2 +-
 include/linux/dax.h| 3 ++-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cfe513eb111e..348297b38f76 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1174,7 +1174,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, 
struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+   struct iomap *srcmap)
 {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
pgoff_t pgoff;
@@ -1204,7 +1205,11 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap 
*iomap)
}
 
if (!page_aligned) {
-   memset(kaddr + offset, 0, size);
+   if (iomap->addr != srcmap->addr)
+   dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
+  kaddr, true);
+   else
+   memset(kaddr + offset, 0, size);
dax_flush(iomap->dax_dev, kaddr + offset, size);
}
dax_read_unlock(id);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16a1e82e3aeb..d754b1f1a05d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, 
loff_t pos,
s64 bytes;
 
if (IS_DAX(inode))
-   bytes = dax_iomap_zero(pos, length, iomap);
+   bytes = dax_iomap_zero(pos, length, iomap, srcmap);
else
bytes = iomap_zero(inode, pos, length, iomap, srcmap);
if (bytes < 0)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..3275e01ed33d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
  pgoff_t index);
-s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
+   struct iomap *srcmap);
 static inline bool dax_mapping(struct address_space *mapping)
 {
return mapping->host && IS_DAX(mapping->host);
-- 
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 04/10] fsdax: Introduce dax_iomap_cow_copy()

2021-03-18 Thread Shiyang Ruan

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

The destance extent which iomap indicated is new allocated extent.
So, it is needed to copy the data from srcmap to new allocated extent.
In theory, it is better to copy the head and tail ranges which is
outside of the non-aligned area instead of copying the whole aligned
range. But in dax page fault, it will always be an aligned range.  So,
we have to copy the whole range in this case.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 71 
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a70e6aa285bb..181aad97136a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,6 +1037,51 @@ static int dax_iomap_direct_access(struct iomap *iomap, 
loff_t pos, size_t size,
return rc;
 }
 
+/*
+ * Copy the head and tail part of the pages not included in the write but
+ * required for CoW, because pos/pos+length are not page aligned.  But in dax
+ * page fault case, the range is page aligned, we need to copy the whole range
+ * of data.  Use copy_edge to distinguish these cases.
+ */
+static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
+   struct iomap *srcmap, void *daddr, bool copy_edge)
+{
+   loff_t head_off = pos & (align_size - 1);
+   size_t size = ALIGN(head_off + length, align_size);
+   loff_t end = pos + length;
+   loff_t pg_end = round_up(end, align_size);
+   void *saddr = 0;
+   int ret = 0;
+
+   ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+   if (ret)
+   return ret;
+
+   if (!copy_edge)
+   return copy_mc_to_kernel(daddr, saddr, length);
+
+   /* Copy the head part of the range.  Note: we pass offset as length. */
+   if (head_off) {
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr, saddr, head_off);
+   else
+   memset(daddr, 0, head_off);
+   }
+   /* Copy the tail part of the range */
+   if (end < pg_end) {
+   loff_t tail_off = head_off + length;
+   loff_t tail_len = pg_end - end;
+
+   if (saddr)
+   ret = copy_mc_to_kernel(daddr + tail_off,
+   saddr + tail_off, tail_len);
+   else
+   memset(daddr + tail_off, 0, tail_len);
+   }
+
+   return ret;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1166,11 +1211,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
struct dax_device *dax_dev = iomap->dax_dev;
struct iov_iter *iter = data;
loff_t end = pos + length, done = 0;
+   bool write = iov_iter_rw(iter) == WRITE;
ssize_t ret = 0;
size_t xfer;
int id;
 
-   if (iov_iter_rw(iter) == READ) {
+   if (!write) {
end = min(end, i_size_read(inode));
if (pos >= end)
return 0;
@@ -1179,7 +1225,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
return iov_iter_zero(min(length, end - pos), iter);
}
 
-   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+   if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+   !(iomap->flags & IOMAP_F_SHARED)))
return -EIO;
 
/*
@@ -1218,6 +1265,13 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
break;
}
 
+   if (write && srcmap->addr != iomap->addr) {
+   ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+kaddr, true);
+   if (ret)
+   break;
+   }
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
@@ -1229,7 +1283,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 * validated via access_ok() in either vfs_read() or
 * vfs_write(), depending on which operation we are doing.
 */
-   if (iov_iter_rw(iter) == WRITE)
+   if (write)
xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
map_len, iter);
else
@@ -1379,6 +1433,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
pfn_t *pfnp,
bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
int err = 0;
pfn_t pfn;
+   void *kaddr;
 
/* if we are reading UNWRITTEN and HO

[PATCH v3 03/10] fsdax: Output address in dax_iomap_pfn() and rename it

2021-03-18 Thread Shiyang Ruan

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 33ddad0f3091..a70e6aa285bb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -997,8 +997,8 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t 
size,
+   void **kaddr, pfn_t *pfnp)
 {
const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
@@ -1010,11 +1010,13 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t 
pos, size_t size,
return rc;
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-  NULL, pfnp);
+  kaddr, pfnp);
if (length < 0) {
rc = length;
goto out;
}
+   if (!pfnp)
+   goto out_check_addr;
rc = -EINVAL;
if (PFN_PHYS(length) < size)
goto out;
@@ -1024,6 +1026,12 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t 
pos, size_t size,
if (length > 1 && !pfn_t_devmap(*pfnp))
goto out;
rc = 0;
+
+out_check_addr:
+   if (!kaddr)
+   goto out;
+   if (!*kaddr)
+   rc = -EFAULT;
 out:
dax_read_unlock(id);
return rc;
@@ -1386,7 +1394,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
pfn_t *pfnp,
return VM_FAULT_SIGBUS;
}
 
-   err = dax_iomap_pfn(iomap, pos, size, &pfn);
+   err = dax_iomap_direct_access(iomap, pos, size, NULL, &pfn);
if (err)
return dax_fault_return(err);
 
-- 
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()

2021-03-18 Thread Shiyang Ruan

The core logic in the two dax page fault functions is similar. So, move
the logic into a common helper function. Also, to facilitate the
addition of new features, such as CoW, switch-case is no longer used to
handle different iomap types.

Signed-off-by: Shiyang Ruan 
---
 fs/dax.c | 291 +++
 1 file changed, 145 insertions(+), 146 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7031e4302b13..33ddad0f3091 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1053,6 +1053,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
 }
 
+#ifdef CONFIG_FS_DAX_PMD
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   unsigned long pmd_addr = vmf->address & PMD_MASK;
+   struct vm_area_struct *vma = vmf->vma;
+   struct inode *inode = mapping->host;
+   pgtable_t pgtable = NULL;
+   struct page *zero_page;
+   spinlock_t *ptl;
+   pmd_t pmd_entry;
+   pfn_t pfn;
+
+   zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
+
+   if (unlikely(!zero_page))
+   goto fallback;
+
+   pfn = page_to_pfn_t(zero_page);
+   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
+   DAX_PMD | DAX_ZERO_PAGE, false);
+
+   if (arch_needs_pgtable_deposit()) {
+   pgtable = pte_alloc_one(vma->vm_mm);
+   if (!pgtable)
+   return VM_FAULT_OOM;
+   }
+
+   ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
+   if (!pmd_none(*(vmf->pmd))) {
+   spin_unlock(ptl);
+   goto fallback;
+   }
+
+   if (pgtable) {
+   pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+   mm_inc_nr_ptes(vma->vm_mm);
+   }
+   pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
+   pmd_entry = pmd_mkhuge(pmd_entry);
+   set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
+   spin_unlock(ptl);
+   trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry);
+   return VM_FAULT_NOPAGE;
+
+fallback:
+   if (pgtable)
+   pte_free(vma->vm_mm, pgtable);
+   trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
+   return VM_FAULT_FALLBACK;
+}
+#else
+static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+   struct iomap *iomap, void **entry)
+{
+   return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_FS_DAX_PMD */
+
 s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
@@ -1289,6 +1349,61 @@ static int dax_fault_cow_page(struct vm_fault *vmf, 
struct iomap *iomap,
return 0;
 }
 
+/**
+ * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
+ * @vmf:   vm fault instance
+ * @pfnp:  pfn to be returned
+ * @xas:   the dax mapping tree of a file
+ * @entry: an unlocked dax entry to be inserted
+ * @pmd:   distinguish whether it is a pmd fault
+ * @flags: iomap flags
+ * @iomap: from iomap_begin()
+ * @srcmap:from iomap_begin(), not equal to iomap if it is a CoW
+ */
+static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
+   struct xa_state *xas, void *entry, bool pmd, unsigned int flags,
+   struct iomap *iomap, struct iomap *srcmap)
+{
+   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+   size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
+   loff_t pos = (loff_t)xas->xa_offset << PAGE_SHIFT;
+   bool write = vmf->flags & FAULT_FLAG_WRITE;
+   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
+   int err = 0;
+   pfn_t pfn;
+
+   /* if we are reading UNWRITTEN and HOLE, return a hole. */
+   if (!write &&
+   (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
+   if (!pmd)
+   return dax_load_hole(xas, mapping, &entry, vmf);
+   else
+   return dax_pmd_load_hole(xas, vmf, iomap, &entry);
+   }
+
+   if (iomap->type != IOMAP_MAPPED) {
+   WARN_ON_ONCE(1);
+   return VM_FAULT_SIGBUS;
+   }
+
+   err = dax_iomap_pfn(iomap, pos, size, &pfn);
+   if (err)
+   return dax_fault_return(err);
+
+   entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
+write && !sync);
+
+   if (sync)
+   return dax_fault_synchronous_pfnp(pfnp, pfn);
+
+   if (pmd)
+   return vmf_insert_pfn_pmd(vmf, pfn, write);
+   if (write)
+   return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+   else
+   return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+}
+
 static vm_fault_t dax_iomap_pte_fault(struct vm

[PATCH v3 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-18 Thread Shiyang Ruan

From: Shiyang Ruan 

This patchset is attempt to add CoW support for fsdax, and take XFS,
which has both reflink and fsdax feature, as an example.

Changes from V2:
 - Fix the mistake in iomap_apply2() and dax_dedupe_file_range_compare()
 - Add CoW judgement in dax_iomap_zero()
 - Fix other code style problems and mistakes

Changes from V1:
 - Factor some helper functions to simplify dax fault code
 - Introduce iomap_apply2() for dax_dedupe_file_range_compare()
 - Fix mistakes and other problems
 - Rebased on v5.11

One of the key mechanism need to be implemented in fsdax is CoW.  Copy
the data from srcmap before we actually write data to the destance
iomap.  And we just copy range in which data won't be changed.

Another mechanism is range comparison.  In page cache case, readpage()
is used to load data on disk to page cache in order to be able to
compare data.  In fsdax case, readpage() does not work.  So, we need
another compare data with direct access support.

With the two mechanism implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.


Some of the patches are picked up from Goldwyn's patchset.  I made some
changes to adapt to this patchset.

(Rebased on v5.11)
==

Shiyang Ruan (10):
  fsdax: Factor helpers to simplify dax fault code
  fsdax: Factor helper: dax_fault_actor()
  fsdax: Output address in dax_iomap_pfn() and rename it
  fsdax: Introduce dax_iomap_cow_copy()
  fsdax: Replace mmap entry in case of CoW
  fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
  iomap: Introduce iomap_apply2() for operations on two files
  fsdax: Dedup file range to use a compare function
  fs/xfs: Handle CoW for fsdax write() path
  fs/xfs: Add dedupe support for fsdax

 fs/dax.c   | 596 ++---
 fs/iomap/apply.c   |  56 
 fs/iomap/buffered-io.c |   2 +-
 fs/remap_range.c   |  45 +++-
 fs/xfs/xfs_bmap_util.c |   3 +-
 fs/xfs/xfs_file.c  |  29 +-
 fs/xfs/xfs_inode.c |   8 +-
 fs/xfs/xfs_inode.h |   1 +
 fs/xfs/xfs_iomap.c |  58 +++-
 fs/xfs/xfs_iomap.h |   4 +
 fs/xfs/xfs_iops.c  |   7 +-
 fs/xfs/xfs_reflink.c   |  17 +-
 include/linux/dax.h|   7 +-
 include/linux/fs.h |  15 +-
 include/linux/iomap.h  |   7 +-
 15 files changed, 602 insertions(+), 253 deletions(-)

--
2.30.1


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v3 01/10] fsdax: Factor helpers to simplify dax fault code

2021-03-18 Thread Shiyang Ruan

The dax page fault code is too long and a bit difficult to read. And it
is hard to understand when we trying to add new features. Some of the
PTE/PMD codes have similar logic. So, factor them as helper functions to
simplify the code.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
---
 fs/dax.c | 152 ++-
 1 file changed, 84 insertions(+), 68 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..7031e4302b13 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1243,6 +1243,52 @@ static bool dax_fault_is_synchronous(unsigned long flags,
&& (iomap->flags & IOMAP_F_DIRTY);
 }
 
+/*
+ * If we are doing synchronous page fault and inode needs fsync, we can insert
+ * PTE/PMD into page tables only after that happens. Skip insertion for now and
+ * return the pfn so that caller can insert it after fsync is done.
+ */
+static vm_fault_t dax_fault_synchronous_pfnp(pfn_t *pfnp, pfn_t pfn)
+{
+   if (WARN_ON_ONCE(!pfnp))
+   return VM_FAULT_SIGBUS;
+
+   *pfnp = pfn;
+   return VM_FAULT_NEEDDSYNC;
+}
+
+static int dax_fault_cow_page(struct vm_fault *vmf, struct iomap *iomap,
+   loff_t pos, vm_fault_t *ret)
+{
+   int error = 0;
+   unsigned long vaddr = vmf->address;
+   sector_t sector = dax_iomap_sector(iomap, pos);
+
+   switch (iomap->type) {
+   case IOMAP_HOLE:
+   case IOMAP_UNWRITTEN:
+   clear_user_highpage(vmf->cow_page, vaddr);
+   break;
+   case IOMAP_MAPPED:
+   error = copy_cow_page_dax(iomap->bdev, iomap->dax_dev,
+   sector, vmf->cow_page, vaddr);
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   error = -EIO;
+   break;
+   }
+
+   if (error)
+   return error;
+
+   __SetPageUptodate(vmf->cow_page);
+   *ret = finish_fault(vmf);
+   if (!*ret)
+   *ret = VM_FAULT_DONE_COW;
+   return 0;
+}
+
 static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
   int *iomap_errp, const struct iomap_ops *ops)
 {
@@ -1311,30 +1357,9 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
}
 
if (vmf->cow_page) {
-   sector_t sector = dax_iomap_sector(&iomap, pos);
-
-   switch (iomap.type) {
-   case IOMAP_HOLE:
-   case IOMAP_UNWRITTEN:
-   clear_user_highpage(vmf->cow_page, vaddr);
-   break;
-   case IOMAP_MAPPED:
-   error = copy_cow_page_dax(iomap.bdev, iomap.dax_dev,
- sector, vmf->cow_page, vaddr);
-   break;
-   default:
-   WARN_ON_ONCE(1);
-   error = -EIO;
-   break;
-   }
-
+   error = dax_fault_cow_page(vmf, &iomap, pos, &ret);
if (error)
-   goto error_finish_iomap;
-
-   __SetPageUptodate(vmf->cow_page);
-   ret = finish_fault(vmf);
-   if (!ret)
-   ret = VM_FAULT_DONE_COW;
+   ret = dax_fault_return(error);
goto finish_iomap;
}
 
@@ -1354,19 +1379,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
*vmf, pfn_t *pfnp,
entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
 0, write && !sync);
 
-   /*
-* If we are doing synchronous page fault and inode needs fsync,
-* we can insert PTE into page tables only after that happens.
-* Skip insertion for now and return the pfn so that caller can
-* insert it after fsync is done.
-*/
if (sync) {
-   if (WARN_ON_ONCE(!pfnp)) {
-   error = -EIO;
-   goto error_finish_iomap;
-   }
-   *pfnp = pfn;
-   ret = VM_FAULT_NEEDDSYNC | major;
+   ret = dax_fault_synchronous_pfnp(pfnp, pfn);
goto finish_iomap;
}
trace_dax_insert_mapping(inode, vmf, entry);
@@ -1465,13 +1479,45 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
*xas, struct vm_fault *vmf,
return VM_FAULT_FALLBACK;
 }
 
+static bool dax_fault_check_fallback(struct vm_fault *vmf, struct xa_state 
*xas,
+   pgoff_t max_pgoff)
+{
+   unsigned long pmd_addr = vmf->address & PMD_MASK;
+   bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+   /*
+* Make sure that the faulting address's PMD offset (color) matches
+* the PMD offset from the start of the file.  This

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams

On Wed, Mar 17, 2021 at 9:45 PM Darrick J. Wong  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare. Remove should always be
> > coordinated by the administrator with the filesystem. If someone force
> > removes a device from underneath a mounted filesystem the driver assumes
> > they have a good reason, or otherwise get to keep the pieces. Since
> > ->remove() callbacks can not fail the only option is to trigger the mass
> > memory_failure().
> >
> > The mechanism to determine whether memory_failure() triggers at
> > pmem->remove() time is whether the associated dax_device has an elevated
> > reference at @pgmap ->kill() time.
> >
> > With this in place the get_user_pages_fast() path can drop its
> > half-measure synchronization with an @pgmap reference.
> >
> > Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
> > Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
> > Reported-by: Jason Gunthorpe 
> > Cc: Dave Chinner 
> > Cc: Christoph Hellwig 
> > Cc: Shiyang Ruan 
> > Cc: Vishal Verma 
> > Cc: Dave Jiang 
> > Cc: Ira Weiny 
> > Cc: Matthew Wilcox 
> > Cc: Jan Kara 
> > Cc: Andrew Morton 
> > Cc: Naoya Horiguchi 
> > Cc: "Darrick J. Wong" 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/dax/super.c  |   15 +++
> >  drivers/nvdimm/pmem.c|   10 +-
> >  drivers/nvdimm/pmem.h|1 +
> >  include/linux/dax.h  |5 +
> >  include/linux/memremap.h |5 +
> >  include/linux/mm.h   |3 +++
> >  mm/memory-failure.c  |   11 +--
> >  mm/memremap.c|   11 +++
> >  8 files changed, 58 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
> >  }
> >  EXPORT_SYMBOL_GPL(put_dax);
> >
> > +bool dax_is_idle(struct dax_device *dax_dev)
> > +{
> > + struct inode *inode;
> > +
> > + if (!dax_dev)
> > + return true;
> > +
> > + WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags),
> > +   "dax idle check on live device.\n");
> > +
> > + inode = &dax_dev->inode;
> > + return atomic_read(&inode->i_count) < 2;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_is_idle);
> > +
> >  /**
> >   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> >   * @host: alternate name for the device registered by a dax driver
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index b8a85bfb2e95..e8822c9262ee 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap 
> > *pgmap)
> >  {
> >   struct request_queue *q =
> >   container_of(pgmap->ref, struct request_queue, 
> > q_usage_counter);
> > + struct pmem_device *pmem = q->queuedata;
> >
> >   blk_freeze_queue_start(q);
> > + kill_dax(pmem->dax_dev);
> > + if (!dax_is_idle(pmem->dax_dev)) {
> > + dev_warn(pmem->dev,
> > +  "DAX active at remove, trigger mass memory 
> > failure\n");
> > + dev_pagemap_failure(pgmap);
> > + }
> >  }
> >
> >  static void pmem_release_di

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams

On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare.
>
> This is a highly suboptimal design. Filesystems only need a single
> callout to trigger a shutdown that unmaps every active mapping in
> the filesystem - we do not need a page-by-page error notification
> which results in 250 million hwposion callouts per TB of pmem to do
> this.
>
> Indeed, the moment we get the first hwpoison from this patch, we'll
> map it to the primary XFS superblock and we'd almost certainly
> consider losing the storage behind that block to be a shut down
> trigger. During the shutdown, the filesystem should unmap all the
> active mappings (we already need to add this to shutdown on DAX
> regardless of this device remove issue) and so we really don't need
> a page-by-page notification of badness.

XFS doesn't, but what about device-mapper and other agents? Even if
the driver had a callback up the stack memory_failure() still needs to
be able to trigger failures down the stack for CPU consumed poison.

>
> AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> iteration to hwposion every page. It's going to take a few seconds
> for the filesystem shutdown to run a device wide invalidation.
>
> SO, yeah, I think this should simply be a single ranged call to the
> filesystem like:
>
> ->memory_failure(dev, 0, -1ULL)
>
> to tell the filesystem that the entire backing device has gone away,
> and leave the filesystem to handle failure entirely at the
> filesystem level.

So I went with memory_failure() after our discussion of all the other
agents in the system that might care about these pfns going offline
and relying on memory_failure() to route down to each of those. I.e.
the "reuse the drop_pagecache_sb() model" idea was indeed
insufficient. Now I'm trying to reconcile the fact that platform
poison handling will hit memory_failure() first and may not
immediately reach the driver, if ever (see the perennially awkward
firmware-first-mode error handling: ghes_handle_memory_failure()) . So
even if the ->memory_failure(dev...) up call exists there is no
guarantee it can get called for all poison before the memory_failure()
down call happens. Which means regardless of whether
->memory_failure(dev...) exists memory_failure() needs to be able to
do the right thing.

Combine that with the fact that new buses like CXL might be configured
in "poison on decode error" mode which means that a memory_failure()
storm can happen regardless of whether the driver initiates it
programatically.

How about a mechanism to optionally let a filesystem take over memory
failure handling for a range of pfns that the memory_failure() can
consult to fail ranges at a time rather than one by one? So a new
'struct dax_operations' op (void) (*memory_failure_register(struct
dax_device *, void *data). Where any agent that claims a dax_dev can
register to take over memory_failure() handling for any event that
happens in that range. This would be routed through device-mapper like
any other 'struct dax_operations' op. I think that meets your
requirement to get notifications of all the events you want to handle,
but still allows memory_failure() to be the last resort for everything
that has not opted into this error handling.
__

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-18 Thread Dan Williams

On Thu, Mar 18, 2021 at 3:02 AM Joao Martins  wrote:
>
> On 3/18/21 4:08 AM, Dan Williams wrote:
> > Now that device-dax and filesystem-dax are guaranteed to unmap all user
> > mappings of devmap / DAX pages before tearing down the 'struct page'
> > array, get_user_pages_fast() can rely on its traditional synchronization
> > method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> > device shutdown. Specifically the unmap guarantee ensures that gup-fast
> > either succeeds in taking a page reference (lock-less), or it detects a
> > need to fall back to the slow path where the device presence can be
> > revalidated with locks held.
>
> [...]
>
> > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> > addr, unsigned long end,
> >  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >
> >  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> > defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +
> >  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >unsigned long end, unsigned int flags,
> >struct page **pages, int *nr)
> >  {
> >   int nr_start = *nr;
> > - struct dev_pagemap *pgmap = NULL;
> >
> >   do {
> > - struct page *page = pfn_to_page(pfn);
> > + struct page *page;
> > +
> > + /*
> > +  * Typically pfn_to_page() on a devmap pfn is not safe
> > +  * without holding a live reference on the hosting
> > +  * pgmap. In the gup-fast path it is safe because any
> > +  * races will be resolved by either gup-fast taking a
> > +  * reference or the shutdown path unmapping the pte to
> > +  * trigger gup-fast to fall back to the slow path.
> > +  */
> > + page = pfn_to_page(pfn);
> >
> > - pgmap = get_dev_pagemap(pfn, pgmap);
> > - if (unlikely(!pgmap)) {
> > - undo_dev_pagemap(nr, nr_start, flags, pages);
> > - return 0;
> > - }
> >   SetPageReferenced(page);
> >   pages[*nr] = page;
> >   if (unlikely(!try_grab_page(page, flags))) {
>
> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
> try_grab_page() for checking pgmap type to see if we are in a device-dax
> longterm pin?
>

Yes. I still need to answer the question of whether mapping
invalidation triggers longterm pin holders to relinquish their hold,
but that's a problem regardless of whether gup-fast is supported or
not.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-18 Thread Joao Martins

On 3/18/21 4:08 AM, Dan Williams wrote:
> Now that device-dax and filesystem-dax are guaranteed to unmap all user
> mappings of devmap / DAX pages before tearing down the 'struct page'
> array, get_user_pages_fast() can rely on its traditional synchronization
> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> device shutdown. Specifically the unmap guarantee ensures that gup-fast
> either succeeds in taking a page reference (lock-less), or it detects a
> need to fall back to the slow path where the device presence can be
> revalidated with locks held.

[...]

> @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> addr, unsigned long end,
>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>unsigned long end, unsigned int flags,
>struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
> - struct dev_pagemap *pgmap = NULL;
>  
>   do {
> - struct page *page = pfn_to_page(pfn);
> + struct page *page;
> +
> + /*
> +  * Typically pfn_to_page() on a devmap pfn is not safe
> +  * without holding a live reference on the hosting
> +  * pgmap. In the gup-fast path it is safe because any
> +  * races will be resolved by either gup-fast taking a
> +  * reference or the shutdown path unmapping the pte to
> +  * trigger gup-fast to fall back to the slow path.
> +  */
> + page = pfn_to_page(pfn);
>  
> - pgmap = get_dev_pagemap(pfn, pgmap);
> - if (unlikely(!pgmap)) {
> - undo_dev_pagemap(nr, nr_start, flags, pages);
> - return 0;
> - }
>   SetPageReferenced(page);
>   pages[*nr] = page;
>   if (unlikely(!try_grab_page(page, flags))) {

So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
try_grab_page() for checking pgmap type to see if we are in a device-dax
longterm pin?

Joao

[0] 
https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10f...@oracle.com/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH v3 1/4] libndctl: test enablement for non-nfit devices

Re: [ndctl PATCH v3 2/4] test: Don't skip tests if nfit modules are missing

RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

[PATCH v3 10/10] fs/xfs: Add dedupe support for fsdax

[PATCH v3 09/10] fs/xfs: Handle CoW for fsdax write() path

[PATCH v3 08/10] fsdax: Dedup file range to use a compare function

[PATCH v3 07/10] iomap: Introduce iomap_apply2() for operations on two files

[PATCH v3 05/10] fsdax: Replace mmap entry in case of CoW

[PATCH v3 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

[PATCH v3 04/10] fsdax: Introduce dax_iomap_cow_copy()

[PATCH v3 03/10] fsdax: Output address in dax_iomap_pfn() and rename it

[PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()

[PATCH v3 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

[PATCH v3 01/10] fsdax: Factor helpers to simplify dax fault code

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

18 matches

Site Navigation

Mail list logo

Footer information