Re: [PATCH 06/20] x86/mce/amd: Use helper for GPU UMC bank type checks

2023-11-27 Thread Yazen Ghannam
On 11/27/2023 6:46 AM, Borislav Petkov wrote: On Sat, Nov 18, 2023 at 01:32:34PM -0600, Yazen Ghannam wrote: +/* GPU UMCs have MCATYPE=0x1.*/ +bool smca_gpu_umc_bank_type(u64 ipid) +{ + if (!smca_umc_bank_type(ipid)) + return false; + + return FIELD_GET

Re: [PATCH 05/20] x86/mce/amd: Use helper for UMC bank type check

2023-11-27 Thread Yazen Ghannam
On 11/27/2023 6:43 AM, Borislav Petkov wrote: On Sat, Nov 18, 2023 at 01:32:33PM -0600, Yazen Ghannam wrote: @@ -714,14 +721,10 @@ static bool legacy_mce_is_memory_error(struct mce *m) */ static bool smca_mce_is_memory_error(struct mce *m) { - enum smca_bank_types bank_type

Re: [PATCH 03/20] x86/mce: Use mce_setup() helpers for apei_smca_report_x86_error()

2023-11-27 Thread Yazen Ghannam
On 11/22/2023 1:28 PM, Borislav Petkov wrote: On Sat, Nov 18, 2023 at 01:32:31PM -0600, Yazen Ghannam wrote: Current AMD systems may report MCA errors using the ACPI Boot Error Record Table (BERT). The BERT entries for MCA errors will be an x86 Common Platform Error Record (CPER) with an MSR

Re: [PATCH 02/20] x86/mce: Define mce_setup() helpers for global and per-CPU fields

2023-11-27 Thread Yazen Ghannam
On 11/22/2023 1:24 PM, Borislav Petkov wrote: On Sat, Nov 18, 2023 at 01:32:30PM -0600, Yazen Ghannam wrote: +void mce_setup_global(struct mce *m) We usually call those things "common": mce_setup_common(). +{ + memset(m, 0, sizeof(struct mce)); + + m->cpuid

[PATCH 19/20] x86/mce/apei: Handle variable register array size

2023-11-18 Thread Yazen Ghannam
ned-off-by: Avadhut Naik Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/apei.c | 73 +++--- 1 file changed, 59 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c index 4820f8677460..d01c9b272e2f 100644 ---

[PATCH 17/20] x86/mce: Add wrapper for struct mce to export vendor specific info

2023-11-18 Thread Yazen Ghannam
Some Checkpatch checks have been ignored to maintain coding style. [Yazen: Add last commit message paragraph. Rebase on other MCA updates.] Suggested-by: Borislav Petkov (AMD) Signed-off-by: Avadhut Naik Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h | 6 +- arch/x86/kerne

[PATCH 11/20] x86/mce/amd: Simplify DFR handler setup

2023-11-18 Thread Yazen Ghannam
value from the hardware. Simplify the enable flow by using the hardware-provided value. Any conflicts will be caught by setup_APIC_eilvt(). Conflicts on production systems can be handled as quirks, if needed. Also, rename the function using a "verb-first" style. Signed-off-by: Yaz

[PATCH 20/20] EDAC/mce_amd: Add support for FRU Text in MCA

2023-11-18 Thread Yazen Ghannam
o maintain coding style. [Yazen: Add Avadhut as co-developer for wrapper changes. ] Co-developed-by: Avadhut Naik Signed-off-by: Avadhut Naik Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h | 2 ++ arch/x86/kernel/cpu/mce/apei.c | 2 ++ arch/x86/kernel/cpu/mce/core.c | 3 ++

[PATCH 16/20] x86/mce/amd: Support SMCA Corrected Error Interrupt

2023-11-18 Thread Yazen Ghannam
to set up the MCA Thresholding interrupt handler. If successful, set the feature enable bit in the MCA_CONFIG register to indicate to the Platform that the OS is ready for the interrupt. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 21 +++-- 1 file c

[PATCH 18/20] x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1, 2} registers

2023-11-18 Thread Yazen Ghannam
zen's Co-developed-by tag and moved SoB tag.] [Yazen: Change %Lx to %llx in TP_printk().] Signed-off-by: Avadhut Naik Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h | 12 arch/x86/kernel/cpu/mce/core.c | 26 ++ drivers/edac/mce_amd.c

[PATCH 14/20] x86/mce/amd: Unify AMD DFR handler with MCA Polling

2023-11-18 Thread Yazen Ghannam
the common MCA code. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 122 + arch/x86/kernel/cpu/mce/core.c | 16 - 2 files changed, 48 insertions(+), 90 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.

[PATCH 10/20] x86/mce/amd: Prep DFR handler before enabling banks

2023-11-18 Thread Yazen Ghannam
configuration. This ensures the kernel is ready to receive the interrupts before the hardware is configured to send them. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel

[PATCH 04/20] x86/mce/amd, EDAC/mce_amd: Move long names to decoder module

2023-11-18 Thread Yazen Ghannam
The "long names" for SMCA banks are only used by the MCE decoder module. Move them out of the arch code and into the decoder module. Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h| 1 - arch/x86/kernel/cpu/mce/amd.c | 74 ++- dr

[PATCH 12/20] x86/mce/amd: Clean up enable_deferred_error_interrupt()

2023-11-18 Thread Yazen Ghannam
read it once during init and pass to functions that need it. Start with the Deferred error interrupt case. The MCA Thresholding interrupt case will be handled during refactoring. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 46 +++ 1 file

[PATCH 09/20] x86/mce/amd: Clean up SMCA configuration

2023-11-18 Thread Yazen Ghannam
the MCA_CONFIG MSR, so include updated register field definitions using bitops. Leave old code until init path is cleaned up. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 84 --- 1 file changed, 49 insertions(+), 35 deletions(-) diff --git a

[PATCH 15/20] x86/mce: Skip AMD threshold init if no threshold banks found

2023-11-18 Thread Yazen Ghannam
esholding data structures are not initialized. Check "bank_map" to determine if the thresholding structures should be allocated and initialized. Also, remove "mce_flags.amd_threshold" which is redundant when checking "bank_map". Signed-off-by: Yazen Ghannam --- arch/x

[PATCH 13/20] x86/mce: Unify AMD THR handler with MCA Polling

2023-11-18 Thread Yazen Ghannam
reset" flow. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 57 ++ arch/x86/kernel/cpu/mce/core.c | 8 + arch/x86/kernel/cpu/mce/internal.h | 2 ++ 3 files changed, 37 insertions(+), 30 deletions(-) diff --git a/arch/x86/kernel/cpu/m

[PATCH 08/20] x86/mce/amd: Look up bank type by IPID

2023-11-18 Thread Yazen Ghannam
code until init path is cleaned up. Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h| 2 +- arch/x86/kernel/cpu/mce/amd.c | 88 --- drivers/edac/mce_amd.c| 2 +- 3 files changed, 84 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/a

[PATCH 06/20] x86/mce/amd: Use helper for GPU UMC bank type checks

2023-11-18 Thread Yazen Ghannam
removed. Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h | 4 +++- arch/x86/kernel/cpu/mce/amd.c | 12 +++- drivers/edac/amd64_edac.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 - 4 files changed, 19 insertions(+), 8

[PATCH 05/20] x86/mce/amd: Use helper for UMC bank type check

2023-11-18 Thread Yazen Ghannam
ror is for memory, and where the exact bank type is not needed. Use bitops and rename old mask until removed. Signed-off-by: Yazen Ghannam --- arch/x86/include/asm/mce.h| 3 ++- arch/x86/kernel/cpu/mce/amd.c | 15 +-- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git

[PATCH 07/20] x86/mce/amd: Use fixed bank number for quirks

2023-11-18 Thread Yazen Ghannam
odels. Related to this quirk is code to disable MCA Thresholding for the IF bank. Check the bank number for the quirks instead of determining the bank type. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/amd.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arc

[PATCH 03/20] x86/mce: Use mce_setup() helpers for apei_smca_report_x86_error()

2023-11-18 Thread Yazen Ghannam
CPU number. If no possible CPU was found, then return early. Gather the global MCA information first, save the found CPU number, then gather the per-CPU information. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/apei.c | 18 -- arch/x86/kernel/cpu/mce/internal.h

[PATCH 02/20] x86/mce: Define mce_setup() helpers for global and per-CPU fields

2023-11-18 Thread Yazen Ghannam
t with any per-CPU decoding or handling. Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/core.c | 31 +-- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 1642018dd6c9..7e

[PATCH 01/20] x86/mce/inject: Clear test status value

2023-11-18 Thread Yazen Ghannam
6/mce: Check whether writes to MCA_STATUS are getting ignored") Signed-off-by: Yazen Ghannam --- arch/x86/kernel/cpu/mce/inject.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c index 4d8d4bcf915d..72f0695c3dc1 100644 ---

[PATCH 00/20] MCA Updates

2023-11-18 Thread Yazen Ghannam
ik (2): x86/mce: Add wrapper for struct mce to export vendor specific info x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers Yazen Ghannam (18): x86/mce/inject: Clear test status value x86/mce: Define mce_setup() helpers for global and per-CPU fields x86/mce: Use mce_

Re: [PATCH] drm/amdgpu: Handle potential NULL pointer dereference

2022-08-26 Thread Yazen Ghannam
On Thu, Aug 25, 2022 at 08:54:46AM -0400, Russell, Kent wrote: > [AMD Official Use Only - General] > > It does indeed short-circuit on || (If the left side of an || statement is > not 0, it doesn't evaluate the right side and returns true). So we can ignore > this patch, since checking for each

Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-27 Thread Yazen Ghannam
On Sat, Sep 25, 2021 at 01:20:57PM +0200, Borislav Petkov wrote: > On Fri, Sep 24, 2021 at 07:46:10PM +0000, Yazen Ghannam wrote: > > I agree with you in general. But this device isn't really a GPU. And > > users of this device seem to want to count *every* error, at least f

Re: [PATCHv4 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-24 Thread Yazen Ghannam
> > Signed-off-by: Mukul Joshi > --- This patch looks good to me overall. Reviewed-by: Yazen Ghannam Thanks, Yazen

Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-24 Thread Yazen Ghannam
On Thu, Sep 23, 2021 at 08:14:15PM +0200, Borislav Petkov wrote: > On Thu, Sep 23, 2021 at 05:23:21PM +0000, Yazen Ghannam wrote: > > Shouldn't the error still be reported to EDAC for decoding and counting? I > > think users want this. > > You know what happens with u

Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-23 Thread Yazen Ghannam
On Thu, Sep 23, 2021 at 11:30:55AM -0400, Joshi, Mukul wrote: ... > > > + return NOTIFY_DONE; > > > + > > > + /* > > > + * If it is correctable error, return. > > > + */ > > > + if (mce_is_correctable(m)) > > > + return NOTIFY_OK; > > > > Shouldn't this be "NOTIFY_DONE" if "don't

Re: [PATCHv3 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

2021-09-23 Thread Yazen Ghannam
On Wed, Sep 22, 2021 at 03:36:20PM -0400, Mukul Joshi wrote: > On Aldebaran, GPU driver will handle bad page retirement > even though UMC is host managed. As a result, register a > bad page retirement handler on the mce notifier chain to > retire bad pages on Aldebaran. > I think this should stat

Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-06-03 Thread Yazen Ghannam
On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote: ... > > Is that the same deferred interrupt which calls > > amd_deferred_error_interrupt() ? > > Sorry picking this up after sometime. I thought I had replied to this email. > Yes it is the same deferred interrupt which calls > amd_def