Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
On Wed, Aug 03, 2022 at 02:43:20AM +, ruansy.f...@fujitsu.com wrote: > > 在 2022/7/19 6:56, Dan Williams 写道: > > Darrick J. Wong wrote: > >> On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote: > >>> ruansy.f...@fujitsu.com wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. > > Call trace: > trigger unbind > -> unbind_store() > -> ... (skip) > -> devres_release_all() # was pmem driver ->remove() in v1 > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, > MF_MEM_PRE_REMOVE) > -> xfs_dax_notify_failure() > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. So do not shutdown filesystem directly if something not > supported, or if failure range includes metadata area. Make sure all > files and processes are handled correctly. > > == > Changes since v5: > 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE > 2. hold s_umount before sync_filesystem() > 3. move sync_filesystem() after SB_BORN check > 4. Rebased on next-20220714 > > Changes since v4: > 1. sync_filesystem() at the beginning when MF_MEM_REMOVE > 2. Rebased on next-20220706 > > [1]: > https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/ > > Signed-off-by: Shiyang Ruan > --- > drivers/dax/super.c | 3 ++- > fs/xfs/xfs_notify_failure.c | 15 +++ > include/linux/mm.h | 1 + > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 9b5e2a5eb0ae..cf9a64563fbe 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > return; > > if (dax_dev->holder_data != NULL) > -dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > +dax_holder_notify_failure(dax_dev, 0, U64_MAX, > +MF_MEM_PRE_REMOVE); > > clear_bit(DAXDEV_ALIVE, _dev->flags); > synchronize_srcu(_srcu); > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > index 69d9c83ea4b2..6da6747435eb 100644 > --- a/fs/xfs/xfs_notify_failure.c > +++ b/fs/xfs/xfs_notify_failure.c > @@ -76,6 +76,9 @@ xfs_dax_failure_fn( > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | > XFS_RMAP_BMBT_BLOCK))) { > +/* Do not shutdown so early when device is to be > removed */ > +if (notify->mf_flags & MF_MEM_PRE_REMOVE) > +return 0; > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); > return -EFSCORRUPTED; > } > @@ -174,12 +177,22 @@ xfs_dax_notify_failure( > struct xfs_mount*mp = dax_holder(dax_dev); > u64 ddev_start; > u64 ddev_end; > +int error; > > if (!(mp->m_sb.sb_flags & SB_BORN)) { > xfs_warn(mp, "filesystem is not ready for > notify_failure()!"); > return -EIO; > } > > +if (mf_flags & MF_MEM_PRE_REMOVE) { > +xfs_info(mp, "device is about to be removed!"); > +down_write(>m_super->s_umount); > +error = sync_filesystem(mp->m_super); > +up_write(>m_super->s_umount); > >>> > >>> Are all mappings invalidated after this point? > >> > >> No; all this step does is pushes dirty filesystem [meta]data to pmem > >> before we lose DAXDEV_ALIVE... > >> > >>> The goal of the removal notification is to invalidate all DAX mappings > >>> that are no pointing to pfns that do not exist anymore, so just syncing > >>> does not seem like enough, and the shutdown is skipped above. What am I > >>> missing? > >> > >> ...however, the shutdown above only applies to filesystem metadata. In > >> effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which > >> enables the mf_dax_kill_procs calls to proceed against mapped file data. > >> I have a nagging suspicion that in non-PREREMOVE mode, we can end up > >> shutting down the filesytem on an xattr block and the 'return > >> -EFSCORRUPTED' actually prevents us
Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
在 2022/7/19 6:56, Dan Williams 写道: > Darrick J. Wong wrote: >> On Thu, Jul 14, 2022 at 11:21:44AM -0700, Dan Williams wrote: >>> ruansy.f...@fujitsu.com wrote: This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem (or mapped device) on it to unmap all files in use and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() # was pmem driver ->remove() in v1 -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. So do not shutdown filesystem directly if something not supported, or if failure range includes metadata area. Make sure all files and processes are handled correctly. == Changes since v5: 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE 2. hold s_umount before sync_filesystem() 3. move sync_filesystem() after SB_BORN check 4. Rebased on next-20220714 Changes since v4: 1. sync_filesystem() at the beginning when MF_MEM_REMOVE 2. Rebased on next-20220706 [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/ Signed-off-by: Shiyang Ruan --- drivers/dax/super.c | 3 ++- fs/xfs/xfs_notify_failure.c | 15 +++ include/linux/mm.h | 1 + 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..cf9a64563fbe 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, _dev->flags); synchronize_srcu(_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 69d9c83ea4b2..6da6747435eb 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -76,6 +76,9 @@ xfs_dax_failure_fn( if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* Do not shutdown so early when device is to be removed */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; } @@ -174,12 +177,22 @@ xfs_dax_notify_failure( struct xfs_mount*mp = dax_holder(dax_dev); u64 ddev_start; u64 ddev_end; + int error; if (!(mp->m_sb.sb_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); return -EIO; } + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "device is about to be removed!"); + down_write(>m_super->s_umount); + error = sync_filesystem(mp->m_super); + up_write(>m_super->s_umount); >>> >>> Are all mappings invalidated after this point? >> >> No; all this step does is pushes dirty filesystem [meta]data to pmem >> before we lose DAXDEV_ALIVE... >> >>> The goal of the removal notification is to invalidate all DAX mappings >>> that are no pointing to pfns that do not exist anymore, so just syncing >>> does not seem like enough, and the shutdown is skipped above. What am I >>> missing? >> >> ...however, the shutdown above only applies to filesystem metadata. In >> effect, we avoid the fs shutdown in MF_MEM_PRE_REMOVE mode, which >> enables the mf_dax_kill_procs calls to proceed against mapped file data. >> I have a nagging suspicion that in non-PREREMOVE mode, we can end up >> shutting down the filesytem on an xattr block and the 'return >> -EFSCORRUPTED' actually prevents us from reaching all the remaining file >> data mappings. >> >> IOWs, I think that clause above really ought to have returned zero so >> that we keep the filesystem up while we're tearing down mappings, and >> only call xfs_force_shutdown() after we've had a chance to let >> xfs_dax_notify_ddev_failure() tear down all the mappings. >> >> I missed that subtlety in the initial ~30 rounds of review, but I figure >> at this point let's just land it in 5.20 and clean up that quirk for >>
[PATCH v7] x86/mce: retrieve poison range from hardware
With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity") that changed nfit_handle_mce() callback to report badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting 2 back-to-back poisons and the driver ends up logging 8 badblocks, because 0x1000 bytes is 8 512-byte. Dan Williams noticed that apei_mce_report_mem_error() hardcode the LSB field to PAGE_SHIFT instead of consulting the input struct cper_sec_mem_err record. So change to rely on hardware whenever support is available. Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com Reviewed-by: Dan Williams Reviewed-by: Ingo Molnar Signed-off-by: Jane Chu --- arch/x86/kernel/cpu/mce/apei.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c index 717192915f28..8ed341714686 100644 --- a/arch/x86/kernel/cpu/mce/apei.c +++ b/arch/x86/kernel/cpu/mce/apei.c @@ -29,15 +29,26 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err) { struct mce m; + int lsb; if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) return; + /* +* Even if the ->validation_bits are set for address mask, +* to be extra safe, check and reject an error radius '0', +* and fall back to the default page size. +*/ + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) + lsb = find_first_bit((void *)_err->physical_addr_mask, PAGE_SHIFT); + else + lsb = PAGE_SHIFT; + mce_setup(); m.bank = -1; /* Fake a memory read error with unknown channel */ m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f; - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; + m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb; if (severity >= GHES_SEV_RECOVERABLE) m.status |= MCI_STATUS_UC; -- 2.18.4
Re: [PATCH v6] x86/mce: retrieve poison range from hardware
On 8/2/2022 3:59 AM, Ingo Molnar wrote: > > * Jane Chu wrote: > >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine >> poison granularity") that changed nfit_handle_mce() callback to report >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, >> because 0x1000 bytes is 8 512-byte. >> >> Dan Williams noticed that apei_mce_report_mem_error() hardcode >> the LSB field to PAGE_SHIFT instead of consulting the input >> struct cper_sec_mem_err record. So change to rely on hardware whenever >> support is available. >> >> Link: >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com >> >> Reviewed-by: Dan Williams >> Signed-off-by: Jane Chu >> --- >> arch/x86/kernel/cpu/mce/apei.c | 14 +- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c >> index 717192915f28..e2439c7872ba 100644 >> --- a/arch/x86/kernel/cpu/mce/apei.c >> +++ b/arch/x86/kernel/cpu/mce/apei.c >> @@ -29,15 +29,27 @@ >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err >> *mem_err) >> { >> struct mce m; >> +int lsb; >> >> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >> return; >> >> +/* >> + * Even if the ->validation_bits are set for address mask, >> + * to be extra safe, check and reject an error radius '0', >> + * and fallback to the default page size. >> + */ > > speling nit: > >s/fallback/fall back > >> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) >> +lsb = find_first_bit((const unsigned long *) >> +_err->physical_addr_mask, PAGE_SHIFT); > > I think we can write this in a shorter form and in a single line: > > lsb = find_first_bit((void *)_err->physical_addr_mask, > PAGE_SHIFT); > > (Ignore checkpatch if it wants to break the line.) > > Untested. > > Assuming my suggestion is correct and with those addressed: > >Reviewed-by: Ingo Molnar Thanks! Just tested the change, v7 coming next. -jane > > Thanks, > > Ingo
[PATCH] nvdimm/namespace: Fix comment typo
The double `existing' is duplicated in the comment, remove one. Signed-off-by: Jason Wang --- drivers/nvdimm/namespace_devs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c index dfade66bab73..c60ec0b373c5 100644 --- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -385,7 +385,7 @@ static resource_size_t init_dpa_allocation(struct nd_label_id *label_id, * * BLK-space is valid as long as it does not precede a PMEM * allocation in a given region. PMEM-space must be contiguous - * and adjacent to an existing existing allocation (if one + * and adjacent to an existing allocation (if one * exists). If reserving PMEM any space is valid. */ static void space_valid(struct nd_region *nd_region, struct nvdimm_drvdata *ndd, -- 2.35.1
Re: [PATCH v6] x86/mce: retrieve poison range from hardware
* Jane Chu wrote: > With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > poison granularity") that changed nfit_handle_mce() callback to report > badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > 2 back-to-back poisons and the driver ends up logging 8 badblocks, > because 0x1000 bytes is 8 512-byte. > > Dan Williams noticed that apei_mce_report_mem_error() hardcode > the LSB field to PAGE_SHIFT instead of consulting the input > struct cper_sec_mem_err record. So change to rely on hardware whenever > support is available. > > Link: > https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com > > Reviewed-by: Dan Williams > Signed-off-by: Jane Chu > --- > arch/x86/kernel/cpu/mce/apei.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c > index 717192915f28..e2439c7872ba 100644 > --- a/arch/x86/kernel/cpu/mce/apei.c > +++ b/arch/x86/kernel/cpu/mce/apei.c > @@ -29,15 +29,27 @@ > void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err > *mem_err) > { > struct mce m; > + int lsb; > > if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > return; > > + /* > + * Even if the ->validation_bits are set for address mask, > + * to be extra safe, check and reject an error radius '0', > + * and fallback to the default page size. > + */ speling nit: s/fallback/fall back > + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > + lsb = find_first_bit((const unsigned long *) > + _err->physical_addr_mask, PAGE_SHIFT); I think we can write this in a shorter form and in a single line: lsb = find_first_bit((void *)_err->physical_addr_mask, PAGE_SHIFT); (Ignore checkpatch if it wants to break the line.) Untested. Assuming my suggestion is correct and with those addressed: Reviewed-by: Ingo Molnar Thanks, Ingo