[PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-19 Thread Yazen Ghannam
From: Yazen Ghannam 

Our current AMD severity logic can possibly give MCE_AR_SEVERITY for
uncorrectable errors in kernel context. The current #MC handler only calls
memory_failure() on errors in user context, but older versions will call
memory_failure() unconditionally. In older versions, the system can get
stuck in a loop as memory_failure() will try to handle the bad kernel
memory and find it busy.

Return MCE_PANIC_SEVERITY for all UC errors IN_KERNEL context. Newer
kernel versions have IN_KERNEL_RECOV context for recoverable kernel errors.
All others kernel uncorrectable errors can be considered unrecoverable.

Fixes: bf80bbd7dcf5 (x86/mce: Add an AMD severities-grading function)

Signed-off-by: Yazen Ghannam 
Cc:  # v4.9..
---
 arch/x86/kernel/cpu/mcheck/mce-severity.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c 
b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 2773c8547f69..f5518706baa6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -245,6 +245,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, 
char **msg, bool is_exc
 
if (m->status & MCI_STATUS_UC) {
 
+   if (ctx == IN_KERNEL)
+   return MCE_PANIC_SEVERITY;
+
/*
 * On older systems where overflow_recov flag is not present, we
 * should simply panic if an error overflow occurs. If
@@ -255,10 +258,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, 
char **msg, bool is_exc
if (mce_flags.smca)
return mce_severity_amd_smca(m, ctx);
 
-   /* software can try to contain */
-   if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == 
IN_KERNEL))
-   return MCE_PANIC_SEVERITY;
-
/* kill current process */
return MCE_AR_SEVERITY;
} else {
-- 
2.7.4



[PATCH] x86/MCE/AMD: Always give panic severity for UC errors in kernel context

2017-11-21 Thread Yazen Ghannam
From: Yazen Ghannam 

[Upstream commit d65dfc81bb3894fdb68cbc74bbf5fb48d2354071]

The AMD severity grading function was introduced in kernel 4.1. The
current logic can possibly give MCE_AR_SEVERITY for uncorrectable
errors in kernel context. The system may then get stuck in a loop as
memory_failure() will try to handle the bad kernel memory and find it
busy.

Return MCE_PANIC_SEVERITY for all UC errors IN_KERNEL context on AMD
systems.

After:

  b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT 
exception table entries")

was accepted in v4.6, this issue was masked because of the tail-end attempt
at kernel mode recovery in the #MC handler.

However, uncorrectable errors IN_KERNEL context should always be considered
unrecoverable and cause a panic.

Signed-off-by: Yazen Ghannam 
Cc:  # 4.1.x, 4.4.x
Fixes: bf80bbd7dcf5 (x86/mce: Add an AMD severities-grading function)
---
Same as
  d65dfc81bb38 ("x86/MCE/AMD: Always give panic severity for UC errors in 
kernel context")
but fixed up to apply to v4.1 and v4.4 stable branches.

 arch/x86/kernel/cpu/mcheck/mce-severity.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c 
b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c222071..a9287f0f06f2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -200,6 +200,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, 
char **msg, bool is_exc
 
if (m->status & MCI_STATUS_UC) {
 
+   if (ctx == IN_KERNEL)
+   return MCE_PANIC_SEVERITY;
+
/*
 * On older systems where overflow_recov flag is not present, we
 * should simply panic if an error overflow occurs. If
@@ -207,10 +210,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, 
char **msg, bool is_exc
 * to at least kill process to prolong system operation.
 */
if (mce_flags.overflow_recov) {
-   /* software can try to contain */
-   if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == 
IN_KERNEL))
-   return MCE_PANIC_SEVERITY;
-
/* kill current process */
return MCE_AR_SEVERITY;
} else {
-- 
2.14.1



Re: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-26 Thread Borislav Petkov
(drop CC:stable from CC list)

Do not add CC:stable when sending the patch with git send-email.

On Tue, Sep 19, 2017 at 09:07:11AM -0500, Yazen Ghannam wrote:
> From: Yazen Ghannam 
> 
> Our current AMD severity logic can possibly give MCE_AR_SEVERITY for
> uncorrectable errors in kernel context. The current #MC handler only calls
> memory_failure() on errors in user context, but older versions will call
> memory_failure() unconditionally.

Err, I don't understand this aspect: what does "older versions" mean
here exactly?

There are no older versions of the #MC handler - there's only one
version - the version against which this patch gets applied.

Hmmm?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


RE: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-26 Thread Ghannam, Yazen
> -Original Message-
> From: Borislav Petkov [mailto:b...@alien8.de]
> Sent: Tuesday, September 26, 2017 8:01 AM
> To: Ghannam, Yazen 
> Cc: linux-e...@vger.kernel.org; Tony Luck ;
> x...@kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC
> errors in kernel context
> 
> (drop CC:stable from CC list)
> 
> Do not add CC:stable when sending the patch with git send-email.
> 

How should I CC:stable? 

> On Tue, Sep 19, 2017 at 09:07:11AM -0500, Yazen Ghannam wrote:
> > From: Yazen Ghannam 
> >
> > Our current AMD severity logic can possibly give MCE_AR_SEVERITY for
> > uncorrectable errors in kernel context. The current #MC handler only
> > calls
> > memory_failure() on errors in user context, but older versions will
> > call
> > memory_failure() unconditionally.
> 
> Err, I don't understand this aspect: what does "older versions" mean here
> exactly?
> 
> There are no older versions of the #MC handler - there's only one version -
> the version against which this patch gets applied.
> 
> Hmmm?
> 

There are the stable branches on kernel.org and some distro kernels based on
older kernel versions.

The AMD severity grading function was introduced in v4.1 and has this issue.
However, the following commit was included in v4.6 and masks the issue.

b2f9d678e28c x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception 
table entries

This patch will apply to v4.9 and later. Another version will be needed to apply
to the v4.1 and v4.4. stable branches.

Thanks,
Yazen


Re: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-26 Thread Borislav Petkov
On Tue, Sep 26, 2017 at 03:21:22PM +, Ghannam, Yazen wrote:
> How should I CC:stable?

Documentation/process/stable-kernel-rules.rst

> There are the stable branches on kernel.org and some distro kernels based on
> older kernel versions.
> 
> The AMD severity grading function was introduced in v4.1 and has this issue.
> However, the following commit was included in v4.6 and masks the issue.
> 
> b2f9d678e28c x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT 
> exception table entries
> 
> This patch will apply to v4.9 and later. Another version will be needed to 
> apply
> to the v4.1 and v4.4. stable branches.

Then write that in the commit message. But *also* add the main reason
why you're doing this - to explicitly state that IN_KERNEL context is
panicked on on AMD. Because if it weren't for it, old kernels should
simply backport b2f9d678e28c and be done with it.

And I still don't understand the IN_KERNEL_RECOV thing you mention in
the commit message. That's Intel-only, what does it have to do with AMD?

Btw, while at it, fix that signature

static int mce_severity_amd_smca(struct mce *m, int err_ctx)

to

static int mce_severity_amd_smca(struct mce *m, enum context err_ctx)

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


RE: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-27 Thread Ghannam, Yazen
> -Original Message-
> From: Borislav Petkov [mailto:b...@alien8.de]
> Sent: Tuesday, September 26, 2017 6:21 PM
> To: Ghannam, Yazen 
...
> > There are the stable branches on kernel.org and some distro kernels
> > based on older kernel versions.
> >
> > The AMD severity grading function was introduced in v4.1 and has this
> issue.
> > However, the following commit was included in v4.6 and masks the issue.
> >
> > b2f9d678e28c x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT
> > exception table entries
> >
> > This patch will apply to v4.9 and later. Another version will be
> > needed to apply to the v4.1 and v4.4. stable branches.
> 
> Then write that in the commit message. But *also* add the main reason why
> you're doing this - to explicitly state that IN_KERNEL context is panicked on
> on AMD. Because if it weren't for it, old kernels should simply backport
> b2f9d678e28c and be done with it.
> 

Okay , will do.

> And I still don't understand the IN_KERNEL_RECOV thing you mention in the
> commit message. That's Intel-only, what does it have to do with AMD?
> 

Generally, we can use the IN_KERNEL_RECOV context to show that the error
is recoverable versus IN_KERNEL which we can consider unrecoverable.

Specifically, the Intel SER and AMD SUCCOR features represent the same
thing (MCA Recovery). I'll send another patch for enabling recovery on
AMD SUCCOR systems. I want to keep this patch as just a bug fix. 

> Btw, while at it, fix that signature
> 
> static int mce_severity_amd_smca(struct mce *m, int err_ctx)
> 
> to
> 
> static int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
> 

Sure, I'll do this in another patch. I want to keep this as a bug fix to apply 
to
the stable branches.

Thanks,
Yazen


Re: [PATCH] x86/MCE/AMD: Always give PANIC severity for UC errors in kernel context

2017-09-27 Thread Borislav Petkov
On Wed, Sep 27, 2017 at 03:17:51PM +, Ghannam, Yazen wrote:
> Generally, we can use the IN_KERNEL_RECOV context to show that the error
> is recoverable versus IN_KERNEL which we can consider unrecoverable.
> 
> Specifically, the Intel SER and AMD SUCCOR features represent the same
> thing (MCA Recovery). I'll send another patch for enabling recovery on
> AMD SUCCOR systems. I want to keep this patch as just a bug fix.

Ok, but then do not mention IN_KERNEL_RECOV here as it only confuses: is
it a recoverable error or is it not? /me scratches head...

> Sure, I'll do this in another patch. I want to keep this as a bug fix
> to apply to the stable branches.

Sure.

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.