RE: [PATCH v2] x86/mce: retrieve poison range from hardware whenever supported

2022-07-15 Thread Dan Williams
Jane Chu wrote:
> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
> poison granularity") that changed nfit_handle_mce() callback to report
> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> because 0x1000 bytes is 8 512-byte.
> 
> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> the LSB field to PAGE_SHIFT instead of consulting the input
> struct cper_sec_mem_err record.  So change to rely on hardware whenever
> support is available.
> 
> v1: https://lkml.org/lkml/2022/7/15/1040

This should be "Link:" and it should reference lore.kernel.org by
msg-id. Lore is maintained by LF infrastructure and the message-id can
be used to search lore.kernel.org mirrors even if the LF infra is down.

Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8...@oracle.com

> Signed-off-by: Jane Chu 
> ---
>  arch/x86/kernel/cpu/mce/apei.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index 717192915f28..a4d5893632e0 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -29,6 +29,7 @@
>  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err 
> *mem_err)
>  {
>   struct mce m;
> + int lsb;
>  
>   if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>   return;
> @@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct 
> cper_sec_mem_err *mem_err)
>   m.bank = -1;
>   /* Fake a memory read error with unknown channel */
>   m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
> MCI_STATUS_MISCV | 0x9f;
> - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
> + lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1;

Just use the kernel's __ffs64() which does not require the substraction
fixup, and you can assume that physical_address_mask is non-zero just
like trace_extlog_mem_event() does.

> + if (lsb <= 0)
> + lsb = PAGE_SHIFT;

This fixup can then be removed as well.

After the above comments are addressed you can add:

Reviewed-by: Dan Williams 



[PATCH v2] x86/mce: retrieve poison range from hardware whenever supported

2022-07-15 Thread Jane Chu
With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
poison granularity") that changed nfit_handle_mce() callback to report
badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
2 back-to-back poisons and the driver ends up logging 8 badblocks,
because 0x1000 bytes is 8 512-byte.

Dan Williams noticed that apei_mce_report_mem_error() hardcode
the LSB field to PAGE_SHIFT instead of consulting the input
struct cper_sec_mem_err record.  So change to rely on hardware whenever
support is available.

v1: https://lkml.org/lkml/2022/7/15/1040
Signed-off-by: Jane Chu 
---
 arch/x86/kernel/cpu/mce/apei.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 717192915f28..a4d5893632e0 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,6 +29,7 @@
 void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 {
struct mce m;
+   int lsb;
 
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
@@ -37,7 +38,10 @@ void apei_mce_report_mem_error(int severity, struct 
cper_sec_mem_err *mem_err)
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | 
MCI_STATUS_MISCV | 0x9f;
-   m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+   lsb = __builtin_ffs(mem_err->physical_addr_mask) - 1;
+   if (lsb <= 0)
+   lsb = PAGE_SHIFT;
+   m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
 
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
-- 
2.18.4




Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu
On 7/15/2022 12:17 PM, Dan Williams wrote:
> [ add Tony ]
> 
> Jane Chu wrote:
>> On 7/14/2022 6:19 PM, Dan Williams wrote:
>>> Jane Chu wrote:
 I meant to say there would be 8 calls to the nfit_handle_mce() callback,
 one call for each poison with accurate address.

 Also, short ARS would find 2 poisons.

 I attached the console output, my annotation is prefixed with "<==".
>>>
>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600  
>>> <== 2nd poison @ 0x600
>>> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 
>>> 0xff00
>>>
>>> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
>>> is seeing a 256-byte address mask?
>>
>> Good question!  One would think both GHES reporting and
>> nfit_handle_mce() are consuming the same mce record...
>> Who might know?
> 
> Did some grepping...
> 
> Have a look at: apei_mce_report_mem_error()
> 
> "The call is coming from inside the house!"
> 
> Luckily we do not need to contact a BIOS engineer to get this fixed.

Great, thank you!
Just put together a quick fix for review after I tested it.

> 
>>> Sigh, is this "firmware-first" causing the kernel to get bad information
>>> via the native mechanisms >
>>> I would expect that if this test was truly worried about minimizing BIOS
>>> latency it would disable firmware-first error reporting. I wonder if
>>> that fixes the observed problem?
>>
>> Could you elaborate on firmware-first error please?  What are the
>> possible consequences disabling it? and how to disable it?
> 
> With my Linux kernel developer hat on, firmware-first error handling is
> really only useful for supporting legacy operating systems that do not
> have native machine check handling, or for platforms that have bugs that
> would otherwise cause OS native error handling to fail. Otherwise, for
> modern Linux, firmware-first error handling is pure overhead and a
> source of bugs.
> 
> In this case the bug is in the Linux code that translates the ACPI event
> back into an MCE record.

Thanks!

-jane



Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Dan Williams
[ add Tony ]

Jane Chu wrote:
> On 7/14/2022 6:19 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> >> one call for each poison with accurate address.
> >>
> >> Also, short ARS would find 2 poisons.
> >>
> >> I attached the console output, my annotation is prefixed with "<==".
> > 
> > [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600  
> > <== 2nd poison @ 0x600
> > [29078.642200] {4}[Hardware Error]:   physical_address_mask: 
> > 0xff00
> > 
> > Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> > is seeing a 256-byte address mask?
> 
> Good question!  One would think both GHES reporting and 
> nfit_handle_mce() are consuming the same mce record...
> Who might know?

Did some grepping...

Have a look at: apei_mce_report_mem_error()

"The call is coming from inside the house!"

Luckily we do not need to contact a BIOS engineer to get this fixed.

> > Sigh, is this "firmware-first" causing the kernel to get bad information
> > via the native mechanisms >
> > I would expect that if this test was truly worried about minimizing BIOS
> > latency it would disable firmware-first error reporting. I wonder if
> > that fixes the observed problem?
> 
> Could you elaborate on firmware-first error please?  What are the 
> possible consequences disabling it? and how to disable it?

With my Linux kernel developer hat on, firmware-first error handling is
really only useful for supporting legacy operating systems that do not
have native machine check handling, or for platforms that have bugs that
would otherwise cause OS native error handling to fail. Otherwise, for
modern Linux, firmware-first error handling is pure overhead and a
source of bugs.

In this case the bug is in the Linux code that translates the ACPI event
back into an MCE record.



Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu
On 7/14/2022 5:58 PM, Dan Williams wrote:
[..]
>>>
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.

 You mean for nfit_handle_mce() callback to issue a short ARS per each
 poison report over a 4K range
>>>
>>> Over a L1_CACHE_BYTES range...
>>>
[..]
>>>
>>> For the badrange tracking, no. So this would just be a check to say
>>> "Yes, CPU I see you think the whole 4K is gone, but lets double check
>>> with more precise information for what gets placed in the badrange
>>> tracking".
>>
>> Okay, process-wise, this is what I am seeing -
>>
>> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
>> 64bytes)
> 
> Why would the short-ARS be performed over a 64-byte span when the MCE
> reported a 4K aligned event?

Cuz you said so, see above. :)  Yes, 4K range as reported by the MCE 
makes sense.

> 
>> - and short ARS returns to say that's actually (addr, 256bytes),
>> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
>> anyway.
> 
> Right, I am reacting to the fact that the patch is picking 512 as an
> arbtitrary blast radius. It's ok to expand the blast radius from
> hardware when, for example, recording a 64-byte MCE in badrange which
> only understands 512 byte records, but it's not ok to take a 4K MCE and
> trim it to 512 bytes without asking hardware for a more precise report.

Agreed.

> 
> Recall that the NFIT driver supports platforms that may not offer ARS.
> In that case the 4K MCE from the CPU is all that the driver gets and
> there is no data source for a more precise answer.
> 
> So the ask is to avoid trimming the blast radius of MCE reports unless
> and until a short-ARS says otherwise.
> 

What happens to short ARS on a platform that doesn't support ARS?
-EOPNOTSUPPORTED ?

>> The precise badrange from short ARS is lost in the process, given the
>> time spent visiting the BIOS, what's the gain?
> 
> Generic support for not under-recording poison on platforms that do not
> support ARS.
> 
>> Could we defer the precise badrange until there is consumer of the
>> information?
> 
> Ideally the consumer is immediate and this precise information can make
> it to the filesystem which might be able to make a better decision about
> what data got clobbered.
> 
> See dax_notify_failure() infrastructure currently in linux-next that can
> convey poison events to filesystems. That might be a path to start
> tracking and reporting precise failure information to address the
> constraints of the badrange implementation.

Yes, I'm aware of dax_notify_failure(), but would appreciate if you 
don't mind to elaborate on how the code path could be leveraged for 
precise badrange implementation.
My understanding is that dax_notify_failure() is in the path of 
synchronous fault accompanied by SIGBUS with BUS_MCEERR_AR.
But badrange could be recorded without poison being consumed, even 
without DAX filesystem in the picture.

thanks,
-jane



Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-15 Thread Jane Chu
On 7/14/2022 6:19 PM, Dan Williams wrote:
> Jane Chu wrote:
>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>> one call for each poison with accurate address.
>>
>> Also, short ARS would find 2 poisons.
>>
>> I attached the console output, my annotation is prefixed with "<==".
> 
> [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
> <== 2nd poison @ 0x600
> [29078.642200] {4}[Hardware Error]:   physical_address_mask: 
> 0xff00
> 
> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> is seeing a 256-byte address mask?

Good question!  One would think both GHES reporting and 
nfit_handle_mce() are consuming the same mce record...
Who might know?

> 
> Sigh, is this "firmware-first" causing the kernel to get bad information
> via the native mechanisms >
> I would expect that if this test was truly worried about minimizing BIOS
> latency it would disable firmware-first error reporting. I wonder if
> that fixes the observed problem?

Could you elaborate on firmware-first error please?  What are the 
possible consequences disabling it? and how to disable it?

thanks!
-jane