Re: [PATCH v2] device-dax: use fallback nid when numa node is invalid

2021-09-14 Thread Dan Williams
On Mon, Sep 13, 2021 at 7:06 PM Justin He  wrote:
>
> Hi Dan,
>
> > -Original Message-
> > From: Dan Williams 
> > Sent: Friday, September 10, 2021 11:42 PM
> > To: Justin He 
> > Cc: Vishal Verma ; Dave Jiang
> > ; David Hildenbrand ; Linux NVDIMM
> > ; Linux Kernel Mailing List  > ker...@vger.kernel.org>
> > Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> > invalid
> >
> > On Fri, Sep 10, 2021 at 5:46 AM Jia He  wrote:
> > >
> > > Previously, numa_off was set unconditionally in dummy_numa_init()
> > > even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
> > > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > > arm64 with fake numa case.
> > >
> > > Without this patch, pmem can't be probed as RAM devices on arm64 if
> > > SRAT table isn't present:
> > >   $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> > -a 64K
> > >   kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> > invalid node: -1
> > >   kmem: probe of dax0.0 failed with error -22
> > >
> > > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> > >
> > > Suggested-by: David Hildenbrand 
> > > Signed-off-by: Jia He 
> > > ---
> > > v2: - rebase it based on David's "memory group" patch.
> > > - drop the changes in dev_dax_kmem_remove() since nid had been
> > >   removed in remove_memory().
> > >  drivers/dax/kmem.c | 31 +--
> > >  1 file changed, 17 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > index a37622060fff..e4836eb7539e 100644
> > > --- a/drivers/dax/kmem.c
> > > +++ b/drivers/dax/kmem.c
> > > @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > unsigned long total_len = 0;
> > > struct dax_kmem_data *data;
> > > int i, rc, mapped = 0;
> > > -   int numa_node;
> > > -
> > > -   /*
> > > -* Ensure good NUMA information for the persistent memory.
> > > -* Without this check, there is a risk that slow memory
> > > -* could be mixed in a node with faster memory, causing
> > > -* unavoidable performance issues.
> > > -*/
> > > -   numa_node = dev_dax->target_node;
> > > -   if (numa_node < 0) {
> > > -   dev_warn(dev, "rejecting DAX region with invalid
> > node: %d\n",
> > > -   numa_node);
> > > -   return -EINVAL;
> > > -   }
> > > +   int numa_node = dev_dax->target_node;
> > >
> > > for (i = 0; i < dev_dax->nr_range; i++) {
> > > struct range range;
> > > @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > i, range.start, range.end);
> > > continue;
> > > }
> > > +
> > > +   /*
> > > +* Ensure good NUMA information for the persistent
> > memory.
> > > +* Without this check, there is a risk but not fatal
> > that slow
> > > +* memory could be mixed in a node with faster memory,
> > causing
> > > +* unavoidable performance issues. Warn this and use
> > fallback
> > > +* node id.
> > > +*/
> > > +   if (numa_node < 0) {
> > > +   int new_node =
> > memory_add_physaddr_to_nid(range.start);
> > > +
> > > +   dev_info(dev, "changing nid from %d to %d for
> > DAX region [%#llx-%#llx]\n",
> > > +numa_node, new_node, range.start,
> > range.end);
> > > +   numa_node = new_node;
> > > +   }
> > > +
> > > total_len += range_len();
> >
> > This fallback change belongs where the parent region for the namespace
> > adopts its target_node, because it's not clear
> > memory_add_physaddr_to_nid() is the right fallback in all situations.
> > Here is where this setting is happening currently:
> >
> > drivers/acpi/nfit/core.c:3004:  ndr_desc->target_node =
> > pxm_to_node(spa->proximity_domain);
> On my local arm64 guest('virt' machine type), the target_node is
> set to -1 at this line.
> That is:
> The condition "spa->flags & ACPI_NFIT_PROXIMITY_VALID" is hit.
>
> > drivers/acpi/nfit/core.c:3007:  ndr_desc->target_node =
> > NUMA_NO_NODE;
> > drivers/nvdimm/e820.c:29:   ndr_desc.target_node = nid;
> > drivers/nvdimm/of_pmem.c:58:ndr_desc.target_node =
> > ndr_desc.numa_node;
> > drivers/nvdimm/region_devs.c:1127:  nd_region->target_node =
> > ndr_desc->target_node;
>
>
> Sorry,Dan. I thought I missed your previous mail:
>
> =
> Looks like it is the NFIT driver, thanks.
>
> If you're getting NUMA_NO_NODE in dax_kmem from the NFIT driver in
> means your ACPI NFIT 

Re: [PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-14 Thread Dan Williams
On Tue, Sep 14, 2021 at 4:32 PM Jane Chu  wrote:
>
> If pwrite(2) encounters poison in a pmem range, it fails with EIO.
> This is unecessary if hardware is capable of clearing the poison.
>
> Though not all dax backend hardware has the capability of clearing
> poison on the fly, but dax backed by Intel DCPMEM has such capability,
> and it's desirable to, first, speed up repairing by means of it;
> second, maintain backend continuity instead of fragmenting it in
> search for clean blocks.
>
> Jane Chu (3):
>   dax: introduce dax_operation dax_clear_poison

The problem with new dax operations is that they need to be plumbed
not only through fsdax and pmem, but also through device-mapper.

In this case I think we're already covered by dax_zero_page_range().
That will ultimately trigger pmem_clear_poison() and it is routed
through device-mapper properly.

Can you clarify why the existing dax_zero_page_range() is not sufficient?

>   dax: introduce dax_clear_poison to dax pwrite operation
>   libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation
>
>  drivers/dax/super.c   | 13 +
>  drivers/nvdimm/pmem.c | 17 +
>  fs/dax.c  |  9 +
>  include/linux/dax.h   |  6 ++
>  4 files changed, 45 insertions(+)
>
> --
> 2.18.4
>



Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-14 Thread Dan Williams
On Tue, Sep 14, 2021 at 9:08 PM Dan Williams  wrote:
>
> On Thu, Sep 9, 2021 at 12:56 AM kajoljain  wrote:
> >
> >
> >
> > On 9/8/21 3:29 AM, Dan Williams wrote:
> > > Hi Kajol,
> > >
> > > Apologies for the delay in responding to this series, some comments below:
> >
> > Hi Dan,
> > No issues, thanks for reviewing the patches.
> >
> > >
> > > On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
> > >>
> > >> A structure is added, called nvdimm_pmu, for performance
> > >> stats reporting support of nvdimm devices. It can be used to add
> > >> nvdimm pmu data such as supported events and pmu event functions
> > >> like event_init/add/read/del with cpu hotplug support.
> > >>
> > >> Acked-by: Peter Zijlstra (Intel) 
> > >> Reviewed-by: Madhavan Srinivasan 
> > >> Tested-by: Nageswara R Sastry 
> > >> Signed-off-by: Kajol Jain 
> > >> ---
> > >>  include/linux/nd.h | 43 +++
> > >>  1 file changed, 43 insertions(+)
> > >>
> > >> diff --git a/include/linux/nd.h b/include/linux/nd.h
> > >> index ee9ad76afbba..712499cf7335 100644
> > >> --- a/include/linux/nd.h
> > >> +++ b/include/linux/nd.h
> > >> @@ -8,6 +8,8 @@
> > >>  #include 
> > >>  #include 
> > >>  #include 
> > >> +#include 
> > >> +#include 
> > >>
> > >>  enum nvdimm_event {
> > >> NVDIMM_REVALIDATE_POISON,
> > >> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> > >> NVDIMM_CCLASS_UNKNOWN,
> > >>  };
> > >>
> > >> +/* Event attribute array index */
> > >> +#define NVDIMM_PMU_FORMAT_ATTR 0
> > >> +#define NVDIMM_PMU_EVENT_ATTR  1
> > >> +#define NVDIMM_PMU_CPUMASK_ATTR2
> > >> +#define NVDIMM_PMU_NULL_ATTR   3
> > >> +
> > >> +/**
> > >> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> > >> + *
> > >> + * @name: name of the nvdimm pmu device.
> > >> + * @pmu: pmu data structure for nvdimm performance stats.
> > >> + * @dev: nvdimm device pointer.
> > >> + * @functions(event_init/add/del/read): platform specific pmu functions.
> > >
> > > This is not valid kernel-doc:
> > >
> > > include/linux/nd.h:67: warning: Function parameter or member
> > > 'event_init' not described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'add' not
> > > described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'del' not
> > > described in 'nvdimm_pmu'
> > > include/linux/nd.h:67: warning: Function parameter or member 'read'
> > > not described in 'nvdimm_pmu'
> > >
> > > ...but I think rather than fixing those up 'struct nvdimm_pmu' should be 
> > > pruned.
> > >
> > > It's not clear to me that it is worth the effort to describe these
> > > details to the nvdimm core which is just going to turn around and call
> > > the pmu core. I'd just as soon have the driver call the pmu core
> > > directly, optionally passing in attributes and callbacks that come
> > > from the nvdimm core and/or the nvdimm provider.
> >
> > The intend for adding these callbacks(event_init/add/del/read) is to give
> > flexibility to the nvdimm core to add some common checks/routines if 
> > required
> > in the future. Those checks can be common for all architecture with still 
> > having the
> > ability to call arch/platform specific driver code to use its own routines.
> >
> > But as you said, currently we don't have any common checks and it directly
> > calling platform specific code, so we can get rid of it.
> > Should we remove this part for now?
>
> Yes, lets go direct to the perf api for now and await the need for a
> common core wrapper to present itself.
>
> >
> >
> > >
> > > Otherwise it's also not clear which of these structure members are
> > > used at runtime vs purely used as temporary storage to pass parameters
> > > to the pmu core.
> > >
> > >> + * @attr_groups: data structure for events, formats and cpumask
> > >> + * @cpu: designated cpu for counter access.
> > >> + * @node: node for cpu hotplug notifier link.
> > >> + * @cpuhp_state: state for cpu hotplug notification.
> > >> + * @arch_cpumask: cpumask to get designated cpu for counter access.
> > >> + */
> > >> +struct nvdimm_pmu {
> > >> +   const char *name;
> > >> +   struct pmu pmu;
> > >> +   struct device *dev;
> > >> +   int (*event_init)(struct perf_event *event);
> > >> +   int  (*add)(struct perf_event *event, int flags);
> > >> +   void (*del)(struct perf_event *event, int flags);
> > >> +   void (*read)(struct perf_event *event);
> > >> +   /*
> > >> +* Attribute groups for the nvdimm pmu. Index 0 used for
> > >> +* format attribute, index 1 used for event attribute,
> > >> +* index 2 used for cpusmask attribute and index 3 kept as NULL.
> > >> +*/
> > >> +   const struct attribute_group *attr_groups[4];
> > >
> > > Following from above, I'd rather this was organized as static
> > > attributes with an is_visible() helper for the groups for any dynamic
> > > aspects. That 

Re: [RESEND PATCH v4 1/4] drivers/nvdimm: Add nvdimm pmu structure

2021-09-14 Thread Dan Williams
On Thu, Sep 9, 2021 at 12:56 AM kajoljain  wrote:
>
>
>
> On 9/8/21 3:29 AM, Dan Williams wrote:
> > Hi Kajol,
> >
> > Apologies for the delay in responding to this series, some comments below:
>
> Hi Dan,
> No issues, thanks for reviewing the patches.
>
> >
> > On Thu, Sep 2, 2021 at 10:10 PM Kajol Jain  wrote:
> >>
> >> A structure is added, called nvdimm_pmu, for performance
> >> stats reporting support of nvdimm devices. It can be used to add
> >> nvdimm pmu data such as supported events and pmu event functions
> >> like event_init/add/read/del with cpu hotplug support.
> >>
> >> Acked-by: Peter Zijlstra (Intel) 
> >> Reviewed-by: Madhavan Srinivasan 
> >> Tested-by: Nageswara R Sastry 
> >> Signed-off-by: Kajol Jain 
> >> ---
> >>  include/linux/nd.h | 43 +++
> >>  1 file changed, 43 insertions(+)
> >>
> >> diff --git a/include/linux/nd.h b/include/linux/nd.h
> >> index ee9ad76afbba..712499cf7335 100644
> >> --- a/include/linux/nd.h
> >> +++ b/include/linux/nd.h
> >> @@ -8,6 +8,8 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >> +#include 
> >>
> >>  enum nvdimm_event {
> >> NVDIMM_REVALIDATE_POISON,
> >> @@ -23,6 +25,47 @@ enum nvdimm_claim_class {
> >> NVDIMM_CCLASS_UNKNOWN,
> >>  };
> >>
> >> +/* Event attribute array index */
> >> +#define NVDIMM_PMU_FORMAT_ATTR 0
> >> +#define NVDIMM_PMU_EVENT_ATTR  1
> >> +#define NVDIMM_PMU_CPUMASK_ATTR2
> >> +#define NVDIMM_PMU_NULL_ATTR   3
> >> +
> >> +/**
> >> + * struct nvdimm_pmu - data structure for nvdimm perf driver
> >> + *
> >> + * @name: name of the nvdimm pmu device.
> >> + * @pmu: pmu data structure for nvdimm performance stats.
> >> + * @dev: nvdimm device pointer.
> >> + * @functions(event_init/add/del/read): platform specific pmu functions.
> >
> > This is not valid kernel-doc:
> >
> > include/linux/nd.h:67: warning: Function parameter or member
> > 'event_init' not described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'add' not
> > described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'del' not
> > described in 'nvdimm_pmu'
> > include/linux/nd.h:67: warning: Function parameter or member 'read'
> > not described in 'nvdimm_pmu'
> >
> > ...but I think rather than fixing those up 'struct nvdimm_pmu' should be 
> > pruned.
> >
> > It's not clear to me that it is worth the effort to describe these
> > details to the nvdimm core which is just going to turn around and call
> > the pmu core. I'd just as soon have the driver call the pmu core
> > directly, optionally passing in attributes and callbacks that come
> > from the nvdimm core and/or the nvdimm provider.
>
> The intend for adding these callbacks(event_init/add/del/read) is to give
> flexibility to the nvdimm core to add some common checks/routines if required
> in the future. Those checks can be common for all architecture with still 
> having the
> ability to call arch/platform specific driver code to use its own routines.
>
> But as you said, currently we don't have any common checks and it directly
> calling platform specific code, so we can get rid of it.
> Should we remove this part for now?

Yes, lets go direct to the perf api for now and await the need for a
common core wrapper to present itself.

>
>
> >
> > Otherwise it's also not clear which of these structure members are
> > used at runtime vs purely used as temporary storage to pass parameters
> > to the pmu core.
> >
> >> + * @attr_groups: data structure for events, formats and cpumask
> >> + * @cpu: designated cpu for counter access.
> >> + * @node: node for cpu hotplug notifier link.
> >> + * @cpuhp_state: state for cpu hotplug notification.
> >> + * @arch_cpumask: cpumask to get designated cpu for counter access.
> >> + */
> >> +struct nvdimm_pmu {
> >> +   const char *name;
> >> +   struct pmu pmu;
> >> +   struct device *dev;
> >> +   int (*event_init)(struct perf_event *event);
> >> +   int  (*add)(struct perf_event *event, int flags);
> >> +   void (*del)(struct perf_event *event, int flags);
> >> +   void (*read)(struct perf_event *event);
> >> +   /*
> >> +* Attribute groups for the nvdimm pmu. Index 0 used for
> >> +* format attribute, index 1 used for event attribute,
> >> +* index 2 used for cpusmask attribute and index 3 kept as NULL.
> >> +*/
> >> +   const struct attribute_group *attr_groups[4];
> >
> > Following from above, I'd rather this was organized as static
> > attributes with an is_visible() helper for the groups for any dynamic
> > aspects. That mirrors the behavior of nvdimm_create() and allows for
> > device drivers to compose the attribute groups from a core set and /
> > or a provider specific set.
>
> Since we don't have any common events right now, Can I use papr
> attributes directly or should we create dummy events for common thing and
> 

[PATCH 3/3] libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

2021-09-14 Thread Jane Chu
Provide pmem_dax_clear_poison() to struct dax_operations.clear_poison.

Signed-off-by: Jane Chu 
---
 drivers/nvdimm/pmem.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 1e0615b8565e..307a53aa3432 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,6 +294,22 @@ static int pmem_dax_zero_page_range(struct dax_device 
*dax_dev, pgoff_t pgoff,
   PAGE_SIZE));
 }
 
+static int pmem_dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   unsigned int len = PFN_PHYS(nr_pages);
+   sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT;
+   struct pmem_device *pmem = dax_get_private(dax_dev);
+   phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
+   blk_status_t ret;
+
+   if (!is_bad_pmem(>bb, sector, len))
+   return 0;
+
+   ret = pmem_clear_poison(pmem, pmem_off, len);
+   return (ret == BLK_STS_OK) ? 0 : -EIO;
+}
+
 static long pmem_dax_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
 {
@@ -326,6 +342,7 @@ static const struct dax_operations pmem_dax_ops = {
.copy_from_iter = pmem_copy_from_iter,
.copy_to_iter = pmem_copy_to_iter,
.zero_page_range = pmem_dax_zero_page_range,
+   .clear_poison = pmem_dax_clear_poison,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
-- 
2.18.4




[PATCH 0/3] dax: clear poison on the fly along pwrite

2021-09-14 Thread Jane Chu
If pwrite(2) encounters poison in a pmem range, it fails with EIO.
This is unecessary if hardware is capable of clearing the poison.

Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Jane Chu (3):
  dax: introduce dax_operation dax_clear_poison
  dax: introduce dax_clear_poison to dax pwrite operation
  libnvdimm/pmem: Provide pmem_dax_clear_poison for dax operation

 drivers/dax/super.c   | 13 +
 drivers/nvdimm/pmem.c | 17 +
 fs/dax.c  |  9 +
 include/linux/dax.h   |  6 ++
 4 files changed, 45 insertions(+)

-- 
2.18.4




[PATCH 2/3] dax: introduce dax_clear_poison to dax pwrite operation

2021-09-14 Thread Jane Chu
When pwrite(2) encounters poison in a dax range, it fails with EIO.
But if the backend hardware of the dax device is capable of clearing
poison, try that and resume the write.

Signed-off-by: Jane Chu 
---
 fs/dax.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (ret)
break;
 
+   /*
+* If WRITE operation encounters media error in a page aligned
+* range, try to clear the error, then resume, for just once.
+*/
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
, NULL);
+   if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+   if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 
0)
+   map_len = dax_direct_access(dax_dev, pgoff,
+   PHYS_PFN(size), , NULL);
+   }
if (map_len < 0) {
ret = map_len;
break;
-- 
2.18.4




[PATCH 1/3] dax: introduce dax_operation dax_clear_poison

2021-09-14 Thread Jane Chu
Though not all dax backend hardware has the capability of clearing
poison on the fly, but dax backed by Intel DCPMEM has such capability,
and it's desirable to, first, speed up repairing by means of it;
second, maintain backend continuity instead of fragmenting it in
search for clean blocks.

Signed-off-by: Jane Chu 
---
 drivers/dax/super.c | 13 +
 include/linux/dax.h |  6 ++
 2 files changed, 19 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 44736cbd446e..935d496fa7db 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -373,6 +373,19 @@ int dax_zero_page_range(struct dax_device *dax_dev, 
pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   if (!dax_alive(dax_dev))
+   return -ENXIO;
+
+   if (!dax_dev->ops->clear_poison)
+   return -EOPNOTSUPP;
+
+   return dax_dev->ops->clear_poison(dax_dev, pgoff, nr_pages);
+}
+EXPORT_SYMBOL_GPL(dax_clear_poison);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..c54c1087ece1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,6 +36,11 @@ struct dax_operations {
struct iov_iter *);
/* zero_page_range: required operation. Zero page range   */
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+   /*
+* clear_poison: clear media error in the given page aligned range via
+* vendor appropriate method. Optional operation.
+*/
+   int (*clear_poison)(struct dax_device *, pgoff_t, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -226,6 +231,7 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t 
pgoff, void *addr,
size_t bytes, struct iov_iter *i);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages);
+int dax_clear_poison(struct dax_device *dax_dev, pgoff_t pgoff, size_t 
nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.18.4




[PATCH 2/3] dax: introduce dax clear poison to page aligned dax pwrite operation

2021-09-14 Thread Jane Chu
Currenty, when pwrite(2) s issued to a dax range that contains poison,
the pwrite(2) fails with EIO. Well, if the hardware backend of the
dax device is capable of clearing poison, try that and resume the write.

Signed-off-by: Jane Chu 
---
 fs/dax.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 99b4e78d888f..592a156abbf2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1156,8 +1156,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (ret)
break;
 
+   /*
+* If WRITE operation encounters media error in a page aligned
+* range, try to clear the error, then resume, for just once.
+*/
map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
, NULL);
+   if ((map_len == -EIO) && (iov_iter_rw(iter) == WRITE)) {
+   if (dax_clear_poison(dax_dev, pgoff, PHYS_PFN(size)) == 
0)
+   map_len = dax_direct_access(dax_dev, pgoff,
+   PHYS_PFN(size), , NULL);
+   }
if (map_len < 0) {
ret = map_len;
break;
-- 
2.18.4