Re: [PATCH v7] x86/mce: retrieve poison range from hardware
Jane Chu wrote: > On 8/3/2022 1:53 AM, Ingo Molnar wrote: > > > > * Jane Chu wrote: > > > >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > > > > s/Commit/commit > > Maintainers, > Would you prefer a v8, or take care the comment upon accepting the patch? > > > > >> poison granularity") that changed nfit_handle_mce() callback to report > >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been > >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting > >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, > >> because 0x1000 bytes is 8 512-byte. > >> > >> Dan Williams noticed that apei_mce_report_mem_error() hardcode > >> the LSB field to PAGE_SHIFT instead of consulting the input > >> struct cper_sec_mem_err record. So change to rely on hardware whenever > >> support is available. > >> > >> Link: > >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com > >> > >> Reviewed-by: Dan Williams > >> Reviewed-by: Ingo Molnar > >> Signed-off-by: Jane Chu > >> --- > >> arch/x86/kernel/cpu/mce/apei.c | 13 - > >> 1 file changed, 12 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/x86/kernel/cpu/mce/apei.c > >> b/arch/x86/kernel/cpu/mce/apei.c > >> index 717192915f28..8ed341714686 100644 > >> --- a/arch/x86/kernel/cpu/mce/apei.c > >> +++ b/arch/x86/kernel/cpu/mce/apei.c > >> @@ -29,15 +29,26 @@ > >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err > >> *mem_err) > >> { > >>struct mce m; > >> + int lsb; > >> > >>if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) > >>return; > >> > >> + /* > >> + * Even if the ->validation_bits are set for address mask, > >> + * to be extra safe, check and reject an error radius '0', > >> + * and fall back to the default page size. > >> + */ > >> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) > >> + lsb = find_first_bit((void *)&mem_err->physical_addr_mask, > >> PAGE_SHIFT); > >> + else > >> + lsb = PAGE_SHIFT; > >> + > >>mce_setup(&m); > >>m.bank = -1; > >>/* Fake a memory read error with unknown channel */ > >>m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | > >> MCI_STATUS_MISCV | 0x9f; > >> - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; > >> + m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb; > > > > LGTM. > > > > I suppose this wants to go upstream via the tree the bug came from (NVDIMM > > tree? ACPI tree?), or should we pick it up into the x86 tree? > > No idea. Maintainers? There's no real NVDIMM dependency here, just a general cleanup of how APEI error granularities are managed. So I think it is appropriate for this to go through the x86 tree via the typical path for mce related topics.
Re: [PATCH v7] x86/mce: retrieve poison range from hardware
On 8/3/2022 1:53 AM, Ingo Molnar wrote: > > * Jane Chu wrote: > >> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine > > s/Commit/commit Maintainers, Would you prefer a v8, or take care the comment upon accepting the patch? > >> poison granularity") that changed nfit_handle_mce() callback to report >> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been >> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting >> 2 back-to-back poisons and the driver ends up logging 8 badblocks, >> because 0x1000 bytes is 8 512-byte. >> >> Dan Williams noticed that apei_mce_report_mem_error() hardcode >> the LSB field to PAGE_SHIFT instead of consulting the input >> struct cper_sec_mem_err record. So change to rely on hardware whenever >> support is available. >> >> Link: >> https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com >> >> Reviewed-by: Dan Williams >> Reviewed-by: Ingo Molnar >> Signed-off-by: Jane Chu >> --- >> arch/x86/kernel/cpu/mce/apei.c | 13 - >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c >> index 717192915f28..8ed341714686 100644 >> --- a/arch/x86/kernel/cpu/mce/apei.c >> +++ b/arch/x86/kernel/cpu/mce/apei.c >> @@ -29,15 +29,26 @@ >> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err >> *mem_err) >> { >> struct mce m; >> +int lsb; >> >> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >> return; >> >> +/* >> + * Even if the ->validation_bits are set for address mask, >> + * to be extra safe, check and reject an error radius '0', >> + * and fall back to the default page size. >> + */ >> +if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK) >> +lsb = find_first_bit((void *)&mem_err->physical_addr_mask, >> PAGE_SHIFT); >> +else >> +lsb = PAGE_SHIFT; >> + >> mce_setup(&m); >> m.bank = -1; >> /* Fake a memory read error with unknown channel */ >> m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | >> MCI_STATUS_MISCV | 0x9f; >> -m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT; >> +m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb; > > LGTM. > > I suppose this wants to go upstream via the tree the bug came from (NVDIMM > tree? ACPI tree?), or should we pick it up into the x86 tree? No idea. Maintainers? thanks! -jane > > Thanks, > > Ingo
Re: [PATCH v5 24/32] tools/testing/nvdimm: Convert to printbuf
On 8/8/22 14:30, Dan Williams wrote: Matthew Wilcox (Oracle) wrote: From: Kent Overstreet This converts from seq_buf to printbuf. Here we're using printbuf with an external buffer, meaning it's a direct conversion. Signed-off-by: Kent Overstreet Cc: Dan Williams Cc: Dave Hansen Cc: nvd...@lists.linux.dev My Acked-by still applies: https://lore.kernel.org/all/62b61165348f4_a7a2f294d0@dwillia2-xfh.notmuch/ ...and Shivaprasad's Tested-by should still apply: https://lore.kernel.org/all/b299ebe2-88e5-c2bd-bad0-bef62d4ac...@linux.ibm.com/ Whoops - got them now, thanks!
RE: [PATCH v5 24/32] tools/testing/nvdimm: Convert to printbuf
Matthew Wilcox (Oracle) wrote: > From: Kent Overstreet > > This converts from seq_buf to printbuf. Here we're using printbuf with > an external buffer, meaning it's a direct conversion. > > Signed-off-by: Kent Overstreet > Cc: Dan Williams > Cc: Dave Hansen > Cc: nvd...@lists.linux.dev My Acked-by still applies: https://lore.kernel.org/all/62b61165348f4_a7a2f294d0@dwillia2-xfh.notmuch/ ...and Shivaprasad's Tested-by should still apply: https://lore.kernel.org/all/b299ebe2-88e5-c2bd-bad0-bef62d4ac...@linux.ibm.com/