Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-02 Thread Darrick J. Wong
On Wed, Aug 03, 2022 at 02:43:20AM +, ruansy.f...@fujitsu.com wrote:
> 
> 在 2022/7/19 6:56, Dan Williams 写道:
> > Darrick J. Wong wrote:
> >> On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:
> >>> ruansy.f...@fujitsu.com wrote:
>  This patch is inspired by Dan's "mm, dax, pmem: Introduce
>  dev_pagemap_failure()"[1].  With the help of dax_holder and
>  ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>  (or mapped device) on it to unmap all files in use and notify processes
>  who are using those files.
> 
>  Call trace:
>  trigger unbind
>    -> unbind_store()
> -> ... (skip)
>  -> devres_release_all()   # was pmem driver ->remove() in v1
>   -> kill_dax()
>    -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
>  MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> 
>  Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>  event.  So do not shutdown filesystem directly if something not
>  supported, or if failure range includes metadata area.  Make sure all
>  files and processes are handled correctly.
> 
>  ==
>  Changes since v5:
> 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
> 2. hold s_umount before sync_filesystem()
> 3. move sync_filesystem() after SB_BORN check
> 4. Rebased on next-20220714
> 
>  Changes since v4:
> 1. sync_filesystem() at the beginning when MF_MEM_REMOVE
> 2. Rebased on next-20220706
> 
>  [1]: 
>  https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
>  Signed-off-by: Shiyang Ruan 
>  ---
>    drivers/dax/super.c |  3 ++-
>    fs/xfs/xfs_notify_failure.c | 15 +++
>    include/linux/mm.h  |  1 +
>    3 files changed, 18 insertions(+), 1 deletion(-)
> 
>  diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>  index 9b5e2a5eb0ae..cf9a64563fbe 100644
>  --- a/drivers/dax/super.c
>  +++ b/drivers/dax/super.c
>  @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>    
>   if (dax_dev->holder_data != NULL)
>  -dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>  +dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>  +MF_MEM_PRE_REMOVE);
>    
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
>  diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>  index 69d9c83ea4b2..6da6747435eb 100644
>  --- a/fs/xfs/xfs_notify_failure.c
>  +++ b/fs/xfs/xfs_notify_failure.c
>  @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
>    
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | 
>  XFS_RMAP_BMBT_BLOCK))) {
>  +/* Do not shutdown so early when device is to be 
>  removed */
>  +if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>  +return 0;
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
>  @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
>  +int error;
>    
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for 
>  notify_failure()!");
>   return -EIO;
>   }
>    
>  +if (mf_flags & MF_MEM_PRE_REMOVE) {
>  +xfs_info(mp, "device is about to be removed!");
>  +down_write(>m_super->s_umount);
>  +error = sync_filesystem(mp->m_super);
>  +up_write(>m_super->s_umount);
> >>>
> >>> Are all mappings invalidated after this point?
> >>
> >> No; all this step does is pushes dirty filesystem [meta]data to pmem
> >> before we lose DAXDEV_ALIVE...
> >>
> >>> The goal of the removal notification is to invalidate all DAX mappings
> >>> that are no pointing to pfns that do not exist anymore, so just syncing
> >>> does not seem like enough, and the shutdown is skipped above. What am I
> >>> missing?
> >>
> >> ...however, the shutdown above only applies to filesystem metadata.  In
> >> effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which
> >> enables the mf_dax_kill_procs calls to proceed against mapped file data.
> >> I have a nagging suspicion that in non-PREREMOVE mode, we can end up
> >> shutting down the filesytem on an xattr block and the 'return
> >> -EFSCORRUPTED' actually prevents us 

Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-08-02 Thread ruansy.f...@fujitsu.com

在 2022/7/19 6:56, Dan Williams 写道:
> Darrick J. Wong wrote:
>> On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote:
>>> ruansy.f...@fujitsu.com wrote:
 This patch is inspired by Dan's "mm, dax, pmem: Introduce
 dev_pagemap_failure()"[1].  With the help of dax_holder and
 ->notify_failure() mechanism, the pmem driver is able to ask filesystem
 (or mapped device) on it to unmap all files in use and notify processes
 who are using those files.

 Call trace:
 trigger unbind
   -> unbind_store()
-> ... (skip)
 -> devres_release_all()   # was pmem driver ->remove() in v1
  -> kill_dax()
   -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()

 Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
 event.  So do not shutdown filesystem directly if something not
 supported, or if failure range includes metadata area.  Make sure all
 files and processes are handled correctly.

 ==
 Changes since v5:
1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
2. hold s_umount before sync_filesystem()
3. move sync_filesystem() after SB_BORN check
4. Rebased on next-20220714

 Changes since v4:
1. sync_filesystem() at the beginning when MF_MEM_REMOVE
2. Rebased on next-20220706

 [1]: 
 https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

 Signed-off-by: Shiyang Ruan 
 ---
   drivers/dax/super.c |  3 ++-
   fs/xfs/xfs_notify_failure.c | 15 +++
   include/linux/mm.h  |  1 +
   3 files changed, 18 insertions(+), 1 deletion(-)

 diff --git a/drivers/dax/super.c b/drivers/dax/super.c
 index 9b5e2a5eb0ae..cf9a64563fbe 100644
 --- a/drivers/dax/super.c
 +++ b/drivers/dax/super.c
 @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
   
if (dax_dev->holder_data != NULL)
 -  dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
 +  dax_holder_notify_failure(dax_dev, 0, U64_MAX,
 +  MF_MEM_PRE_REMOVE);
   
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
 diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
 index 69d9c83ea4b2..6da6747435eb 100644
 --- a/fs/xfs/xfs_notify_failure.c
 +++ b/fs/xfs/xfs_notify_failure.c
 @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
   
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
 +  /* Do not shutdown so early when device is to be removed */
 +  if (notify->mf_flags & MF_MEM_PRE_REMOVE)
 +  return 0;
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
}
 @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
 +  int error;
   
if (!(mp->m_sb.sb_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
   
 +  if (mf_flags & MF_MEM_PRE_REMOVE) {
 +  xfs_info(mp, "device is about to be removed!");
 +  down_write(>m_super->s_umount);
 +  error = sync_filesystem(mp->m_super);
 +  up_write(>m_super->s_umount);
>>>
>>> Are all mappings invalidated after this point?
>>
>> No; all this step does is pushes dirty filesystem [meta]data to pmem
>> before we lose DAXDEV_ALIVE...
>>
>>> The goal of the removal notification is to invalidate all DAX mappings
>>> that are no pointing to pfns that do not exist anymore, so just syncing
>>> does not seem like enough, and the shutdown is skipped above. What am I
>>> missing?
>>
>> ...however, the shutdown above only applies to filesystem metadata.  In
>> effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which
>> enables the mf_dax_kill_procs calls to proceed against mapped file data.
>> I have a nagging suspicion that in non-PREREMOVE mode, we can end up
>> shutting down the filesytem on an xattr block and the 'return
>> -EFSCORRUPTED' actually prevents us from reaching all the remaining file
>> data mappings.
>>
>> IOWs, I think that clause above really ought to have returned zero so
>> that we keep the filesystem up while we're tearing down mappings, and
>> only call xfs_force_shutdown() after we've had a chance to let
>> xfs_dax_notify_ddev_failure() tear down all the mappings.
>>
>> I missed that subtlety in the initial ~30 rounds of review, but I figure
>> at this point let's just land it in 5.20 and clean up that quirk for
>> 

[PATCH v7] x86/mce: retrieve poison range from hardware

2022-08-02 Thread Jane Chu
With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

Reviewed-by: Dan Williams 
Reviewed-by: Ingo Molnar 
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..8ed341714686 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,26 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
 
+   /*
+* Even if the ->validation_bits are set for address mask,
+* to be extra safe, check and reject an error radius '0',
+* and fall back to the default page size.
+*/
+   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
+   lsb = find_first_bit((void *)_err->physical_addr_mask, 
PAGE_SHIFT);
+   else
+   lsb = PAGE_SHIFT;
+
mce_setup();
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4




Re: [PATCH v6] x86/mce: retrieve poison range from hardware

2022-08-02 Thread Jane Chu
On 8/2/2022 3:59 AM, Ingo Molnar wrote:
> 
> * Jane Chu  wrote:
> 
>> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
>> poison granularity") that changed nfit_handle_mce() callback to report
>> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
>> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
>> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
>> because 0x1000 bytes is 8 512-byte.
>>
>> Dan Williams noticed that apei_mce_report_mem_error() hardcode
>> the LSB field to PAGE_SHIFT instead of consulting the input
>> struct cper_sec_mem_err record.  So change to rely on hardware whenever
>> support is available.
>>
>> Link: 
>> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
>>
>> Reviewed-by: Dan Williams 
>> Signed-off-by: Jane Chu 
>> ---
>>   arch/x86/kernel/cpu/mce/apei.c | 14 +-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
>> index 717192915f28..e2439c7872ba 100644
>> --- a/arch/x86/kernel/cpu/mce/apei.c
>> +++ b/arch/x86/kernel/cpu/mce/apei.c
>> @@ -29,15 +29,27 @@
>>   void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
>> *mem_err)
>>   {
>>  struct mce m;
>> +int lsb;
>>   
>>  if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>  return;
>>   
>> +/*
>> + * Even if the ->validation_bits are set for address mask,
>> + * to be extra safe, check and reject an error radius '0',
>> + * and fallback to the default page size.
>> + */
> 
> speling nit:
> 
>s/fallback/fall back
> 
>> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>> +lsb = find_first_bit((const unsigned long *)
>> +_err->physical_addr_mask, PAGE_SHIFT);
> 
> I think we can write this in a shorter form and in a single line:
> 
>   lsb = find_first_bit((void *)_err->physical_addr_mask, 
> PAGE_SHIFT);
> 
> (Ignore checkpatch if it wants to break the line.)
> 
> Untested.
> 
> Assuming my suggestion is correct and with those addressed:
> 
>Reviewed-by: Ingo Molnar 

Thanks!  Just tested the change, v7 coming next.

-jane

> 
> Thanks,
> 
>   Ingo



[PATCH] nvdimm/namespace: Fix comment typo

2022-08-02 Thread Jason Wang
The double `existing' is duplicated in the comment, remove one.

Signed-off-by: Jason Wang 
---
 drivers/nvdimm/namespace_devs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index dfade66bab73..c60ec0b373c5 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -385,7 +385,7 @@ static resource_size_t init_dpa_allocation(struct 
nd_label_id *label_id,
  *
  * BLK-space is valid as long as it does not precede a PMEM
  * allocation in a given region. PMEM-space must be contiguous
- * and adjacent to an existing existing allocation (if one
+ * and adjacent to an existing allocation (if one
  * exists).  If reserving PMEM any space is valid.
  */
 static void space_valid(struct nd_region *nd_region, struct nvdimm_drvdata 
*ndd,
-- 
2.35.1




Re: [PATCH v6] x86/mce: retrieve poison range from hardware

2022-08-02 Thread Ingo Molnar


* Jane Chu  wrote:

> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> poison granularity") that changed nfit_handle_mce() callback to report
> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> because 0x1000 bytes is 8 512-byte.
> 
> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> the LSB field to PAGE_SHIFT instead of consulting the input
> struct cper_sec_mem_err record.  So change to rely on hardware whenever
> support is available.
> 
> Link: 
> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com
> 
> Reviewed-by: Dan Williams 
> Signed-off-by: Jane Chu 
> ---
>  arch/x86/kernel/cpu/mce/apei.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index 717192915f28..e2439c7872ba 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -29,15 +29,27 @@
>  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
> *mem_err)
>  {
>   struct mce m;
> + int lsb;
>  
>   if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>   return;
>  
> + /*
> +  * Even if the ->validation_bits are set for address mask,
> +  * to be extra safe, check and reject an error radius '0',
> +  * and fallback to the default page size.
> +  */

speling nit:

  s/fallback/fall back

> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> + lsb = find_first_bit((const unsigned long *)
> + _err->physical_addr_mask, PAGE_SHIFT);

I think we can write this in a shorter form and in a single line:

lsb = find_first_bit((void *)_err->physical_addr_mask, 
PAGE_SHIFT);

(Ignore checkpatch if it wants to break the line.)

Untested.

Assuming my suggestion is correct and with those addressed:

  Reviewed-by: Ingo Molnar 

Thanks,

Ingo