Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-13 Thread Christophe LEROY




Le 13/10/2018 à 10:48, Nicholas Piggin a écrit :

On Sat, 13 Oct 2018 08:29:48 +
Christophe Leroy  wrote:


On 10/11/2018 02:31 PM, Christophe LEROY wrote:



Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:
  

On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:

Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kernel/traps.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c
b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs
*regs)
  void machine_check_exception(struct pt_regs *regs)
  {
-    enum ctx_state prev_state = exception_enter();
  int recover = 0;
+    bool nested = in_nmi();
+    if (!nested)
+    nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

    if ((user_mode(regs)))
    _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
    else
    die("Machine check", regs, SIGBUS);


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

  if (in_interrupt()) {
   if (!IS_MCHECK_EXC(regs) || (irq_count() -
(NMI_OFFSET + HARDIRQ_OFFSET)))
   panic("Fatal exception in interrupt");
  }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to
test anything before next week. However it looks more or less like a
hack, doesn't it ?


I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

   if (in_interrupt()) {
    if (!in_nmi() || in_nmi_from_interrupt())
    panic("Fatal exception in interrupt");
   }
  


What about the following ?


Hmm, in some ways maybe it's nicer. One complication is I would like the
same thing to be available for platform specific machine check
handlers, so then you need to pass is_in_interrupt to them. Which you
can do without any problem... But is it cleaner than the above?


For me it looks cleaner than twiddle the preempt_count depending on
whether we were or not already in nmi() .

Let's draft something and see what it looks like.


Ok, finaly I went to your solution, see below, as it avoids having to
modify all subarch and platform specific machine check handlers.

Unfortunately it doesn't solves the issue, it only delays it:

oops_end() calls do_exit(), which has the following test:

if (unlikely(in_interrupt()))
panic("Aiee, killing interrupt handler!");


So at the time being I still have no idea how to fix that, have you ?


Huh, I'm not sure. x86's MCE handling looks like it does this:

 /*
  * We might have interrupted pretty much anything.  In
  * fact, if we're a machine check, we can even interrupt
  * NMI processing.  We don't want in_nmi() to return true,
  * but we need to notify RCU.
  */
 rcu_nmi_enter();

But I don't see why they don't want the full NMI treatment there. I
thought the whole point was to do everything so you would get e.g.,
the NMI-safe printk and so on.

The reason the in_interrupt checks work below is because the synchronous
trap handlers e.g., for BUG do not enter interrupt context so the
question is about they context they interrupted. Maybe the right way to
go is nmi_exit just before deciding to oops.


Yes I arrived at the same conclusion. I tested it just now and it works 
for me. 

Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-13 Thread Nicholas Piggin
On Sat, 13 Oct 2018 08:29:48 +
Christophe Leroy  wrote:

> On 10/11/2018 02:31 PM, Christophe LEROY wrote:
> > 
> > 
> > Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :  
> >> On Tue, 9 Oct 2018 09:36:18 +
> >> Christophe Leroy  wrote:
> >>  
> >>> On 10/09/2018 05:30 AM, Nicholas Piggin wrote:  
>  On Tue, 9 Oct 2018 06:46:30 +0200
>  Christophe LEROY  wrote:  
> > Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :  
> >> On Mon, 8 Oct 2018 17:39:11 +0200
> >> Christophe LEROY  wrote:  
> >>> Hi Nick,
> >>>
> >>> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :  
>  Use nmi_enter similarly to system reset interrupts. This uses NMI
>  printk NMI buffers and turns off various debugging facilities that
>  helps avoid tripping on ourselves or other CPUs.
> 
>  Signed-off-by: Nicholas Piggin 
>  ---
>   arch/powerpc/kernel/traps.c | 9 ++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
>  diff --git a/arch/powerpc/kernel/traps.c 
>  b/arch/powerpc/kernel/traps.c
>  index 2849c4f50324..6d31f9d7c333 100644
>  --- a/arch/powerpc/kernel/traps.c
>  +++ b/arch/powerpc/kernel/traps.c
>  @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs 
>  *regs)
>   void machine_check_exception(struct pt_regs *regs)
>   {
>  -    enum ctx_state prev_state = exception_enter();
>   int recover = 0;
>  +    bool nested = in_nmi();
>  +    if (!nested)
>  +    nmi_enter();  
> >>>
> >>> This alters preempt_count, then when die() is called
> >>> in_interrupt() returns true allthough the trap didn't happen in
> >>> interrupt, so oops_end() panics for "fatal exception in interrupt"
> >>> instead of gently sending SIGBUS the faulting app.  
> >>
> >> Thanks for tracking that down.  
> >>> Any idea on how to fix this ?  
> >>
> >> I would say we have to deliver the sigbus by hand.
> >>
> >>    if ((user_mode(regs)))
> >>    _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> >>    else
> >>    die("Machine check", regs, SIGBUS);  
> >
> > And what about all the other things done by 'die()' ?
> >
> > And what if it is a kernel thread ?
> >
> > In one of my boards, I have a kernel thread regularly checking the HW,
> > and if it gets a machine check I expect it to gently stop and the die
> > notification to be delivered to all registered notifiers.
> >
> > Until before this patch, it was working well.  
> 
>  I guess the alternative is we could check regs->trap for machine
>  check in the die test. Complication is having to account for MCE
>  in an interrupt handler.
> 
>   if (in_interrupt()) {
>    if (!IS_MCHECK_EXC(regs) || (irq_count() - 
>  (NMI_OFFSET + HARDIRQ_OFFSET)))
>    panic("Fatal exception in interrupt");
>   }
> 
>  Something like that might work for you? We needs a ppc64 macro for the
>  MCE, and can probably add something like in_nmi_from_interrupt() for
>  the second part of the test.  
> >>>
> >>> Don't know, I'm away from home on business trip so I won't be able to
> >>> test anything before next week. However it looks more or less like a
> >>> hack, doesn't it ?  
> >>
> >> I thought it seemed okay (with the right functions added). Actually it
> >> could be a bit nicer to do this, then it works generally :
> >>
> >>   if (in_interrupt()) {
> >>    if (!in_nmi() || in_nmi_from_interrupt())
> >>    panic("Fatal exception in interrupt");
> >>   }
> >>  
> >>>
> >>> What about the following ?  
> >>
> >> Hmm, in some ways maybe it's nicer. One complication is I would like the
> >> same thing to be available for platform specific machine check
> >> handlers, so then you need to pass is_in_interrupt to them. Which you
> >> can do without any problem... But is it cleaner than the above?  
> > 
> > For me it looks cleaner than twiddle the preempt_count depending on 
> > whether we were or not already in nmi() .
> > 
> > Let's draft something and see what it looks like.  
> 
> Ok, finaly I went to your solution, see below, as it avoids having to 
> modify all subarch and platform specific machine check handlers.
> 
> Unfortunately it doesn't solves the issue, it only delays it:
> 
> oops_end() calls do_exit(), which has the following test:
> 
>   if (unlikely(in_interrupt()))
>   panic("Aiee, killing interrupt handler!");
> 
> 
> So at the time being I still have no idea how to fix that, have you ?

Huh, I'm not sure. x86's MCE handling looks like it does this:

/*
 * We might have interrupted pretty much anything.  In
 

Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-13 Thread Christophe Leroy




On 10/11/2018 02:31 PM, Christophe LEROY wrote:



Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:


On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:

Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c 
b/arch/powerpc/kernel/traps.c

index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs 
*regs)

 void machine_check_exception(struct pt_regs *regs)
 {
-    enum ctx_state prev_state = exception_enter();
 int recover = 0;
+    bool nested = in_nmi();
+    if (!nested)
+    nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

   if ((user_mode(regs)))
   _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
   else
   die("Machine check", regs, SIGBUS);


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

 if (in_interrupt()) {
  if (!IS_MCHECK_EXC(regs) || (irq_count() - 
(NMI_OFFSET + HARDIRQ_OFFSET)))

  panic("Fatal exception in interrupt");
 }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to
test anything before next week. However it looks more or less like a
hack, doesn't it ?


I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

  if (in_interrupt()) {
   if (!in_nmi() || in_nmi_from_interrupt())
   panic("Fatal exception in interrupt");
  }



What about the following ?


Hmm, in some ways maybe it's nicer. One complication is I would like the
same thing to be available for platform specific machine check
handlers, so then you need to pass is_in_interrupt to them. Which you
can do without any problem... But is it cleaner than the above?


For me it looks cleaner than twiddle the preempt_count depending on 
whether we were or not already in nmi() .


Let's draft something and see what it looks like.


Ok, finaly I went to your solution, see below, as it avoids having to 
modify all subarch and platform specific machine check handlers.


Unfortunately it doesn't solves the issue, it only delays it:

oops_end() calls do_exit(), which has the following test:

if (unlikely(in_interrupt()))
panic("Aiee, killing interrupt handler!");


So at the time being I still have no idea how to fix that, have you ?

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index fd58749b4d6b..3569e826f0c2 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -132,6 +132,21 @@ static void pmac_backlight_unblank(void)
 static inline void pmac_backlight_unblank(void) { }
 #endif

+static bool from_interrupt(void)
+{
+   if (!in_nmi())
+   return in_interrupt();
+   /*
+* if we are in NMI, we need to determine if we were already in
+* interrupt before entering NMI. To do that, we recalculate irq_count()
+* from before the call to nmi_enter().
+* If we were already in NMI and reentered in a new one, we have
+* increased the preempt count by HARDIRQ_OFFSET, so the calculated
+* value will be not null
+*/
+   return irq_count() - NMI_OFFSET - HARDIRQ_OFFSET;
+}
+
 /*
  * If oops/die is expected to crash the machine, return true here.
  *
@@ -147,8 +162,7 @@ bool die_will_crash(void)
return true;
if 

Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-11 Thread Christophe LEROY




Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:


On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:
   

Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:
  

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
 
 void machine_check_exception(struct pt_regs *regs)

 {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.
  

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

   if ((user_mode(regs)))
   _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
   else
   die("Machine check", regs, SIGBUS);
  


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

 if (in_interrupt()) {
  if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
HARDIRQ_OFFSET)))
  panic("Fatal exception in interrupt");
 }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to
test anything before next week. However it looks more or less like a
hack, doesn't it ?


I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

  if (in_interrupt()) {
   if (!in_nmi() || in_nmi_from_interrupt())
   panic("Fatal exception in interrupt");
  }



What about the following ?


Hmm, in some ways maybe it's nicer. One complication is I would like the
same thing to be available for platform specific machine check
handlers, so then you need to pass is_in_interrupt to them. Which you
can do without any problem... But is it cleaner than the above?


For me it looks cleaner than twiddle the preempt_count depending on 
whether we were or not already in nmi() .


Let's draft something and see what it looks like.




I guess one advantage of yours is that a BUG somewhere in the NMI path
will panic the system. Or is that a disadvantage?


Why would it panic the system more than now ? And is it an issue at all 
? Doesn't BUG() panic in any case ?


Christophe


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-11 Thread Christophe LEROY




Le 09/10/2018 à 14:14, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 14:01:37 +0200
Christophe LEROY  wrote:


Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:
   

On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:
  

Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:
 

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kernel/traps.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
  
  void machine_check_exception(struct pt_regs *regs)

  {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.
 

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

if ((user_mode(regs)))
_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
else
die("Machine check", regs, SIGBUS);
 


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

  if (in_interrupt()) {
   if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
HARDIRQ_OFFSET)))
   panic("Fatal exception in interrupt");
  }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to
test anything before next week. However it looks more or less like a
hack, doesn't it ?


I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

   if (in_interrupt()) {
if (!in_nmi() || in_nmi_from_interrupt())
panic("Fatal exception in interrupt");
   }



Yes looks nice, but:
1/ what is in_nmi_from_interrupt() ? Is it (in_nmi() && (in_irq() ||
in_softirq()) ?


   return (irq_count() - (NMI_OFFSET + HARDIRQ_OFFSET))) != 0;

(basically just in_interrupt() with the nmi_enter undone)


2/ what about in_nmi_from_nmi(), how do we detect that ?


Oh good point, I'm not sure. I guess we could irq_enter() in the
nested case, I think that would make in_nmi_from_interrupt()
return true.


Yes we could, but I find it ugly.

Don't you think it looks less strange to just check in_interrupt() 
before calling nmi_enter()  ?


Christophe


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-09 Thread Nicholas Piggin
On Tue, 9 Oct 2018 14:01:37 +0200
Christophe LEROY  wrote:

> Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :
> > On Tue, 9 Oct 2018 09:36:18 +
> > Christophe Leroy  wrote:
> >   
> >> On 10/09/2018 05:30 AM, Nicholas Piggin wrote:  
> >>> On Tue, 9 Oct 2018 06:46:30 +0200
> >>> Christophe LEROY  wrote:
> >>>  
>  Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :  
> > On Mon, 8 Oct 2018 17:39:11 +0200
> > Christophe LEROY  wrote:
> > 
> >> Hi Nick,
> >>
> >> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :  
> >>> Use nmi_enter similarly to system reset interrupts. This uses NMI
> >>> printk NMI buffers and turns off various debugging facilities that
> >>> helps avoid tripping on ourselves or other CPUs.
> >>>
> >>> Signed-off-by: Nicholas Piggin 
> >>> ---
> >>>  arch/powerpc/kernel/traps.c | 9 ++---
> >>>  1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> >>> index 2849c4f50324..6d31f9d7c333 100644
> >>> --- a/arch/powerpc/kernel/traps.c
> >>> +++ b/arch/powerpc/kernel/traps.c
> >>> @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
> >>>  
> >>>  void machine_check_exception(struct pt_regs *regs)
> >>>  {
> >>> - enum ctx_state prev_state = exception_enter();
> >>>   int recover = 0;
> >>> + bool nested = in_nmi();
> >>> + if (!nested)
> >>> + nmi_enter();  
> >>
> >> This alters preempt_count, then when die() is called
> >> in_interrupt() returns true allthough the trap didn't happen in
> >> interrupt, so oops_end() panics for "fatal exception in interrupt"
> >> instead of gently sending SIGBUS the faulting app.  
> >
> > Thanks for tracking that down.
> > 
> >> Any idea on how to fix this ?  
> >
> > I would say we have to deliver the sigbus by hand.
> >
> >if ((user_mode(regs)))
> >_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> >else
> >die("Machine check", regs, SIGBUS);
> > 
> 
>  And what about all the other things done by 'die()' ?
> 
>  And what if it is a kernel thread ?
> 
>  In one of my boards, I have a kernel thread regularly checking the HW,
>  and if it gets a machine check I expect it to gently stop and the die
>  notification to be delivered to all registered notifiers.
> 
>  Until before this patch, it was working well.  
> >>>
> >>> I guess the alternative is we could check regs->trap for machine
> >>> check in the die test. Complication is having to account for MCE
> >>> in an interrupt handler.
> >>>
> >>>  if (in_interrupt()) {
> >>>   if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET 
> >>> + HARDIRQ_OFFSET)))
> >>>   panic("Fatal exception in interrupt");
> >>>  }
> >>>
> >>> Something like that might work for you? We needs a ppc64 macro for the
> >>> MCE, and can probably add something like in_nmi_from_interrupt() for
> >>> the second part of the test.  
> >>
> >> Don't know, I'm away from home on business trip so I won't be able to
> >> test anything before next week. However it looks more or less like a
> >> hack, doesn't it ?  
> > 
> > I thought it seemed okay (with the right functions added). Actually it
> > could be a bit nicer to do this, then it works generally :
> > 
> >   if (in_interrupt()) {
> >if (!in_nmi() || in_nmi_from_interrupt())
> >panic("Fatal exception in interrupt");
> >   }  
> 
> 
> Yes looks nice, but:
> 1/ what is in_nmi_from_interrupt() ? Is it (in_nmi() && (in_irq() || 
> in_softirq()) ?

  return (irq_count() - (NMI_OFFSET + HARDIRQ_OFFSET))) != 0;

(basically just in_interrupt() with the nmi_enter undone)

> 2/ what about in_nmi_from_nmi(), how do we detect that ?

Oh good point, I'm not sure. I guess we could irq_enter() in the
nested case, I think that would make in_nmi_from_interrupt()
return true.

Thanks,
Nick


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-09 Thread Christophe LEROY




Le 09/10/2018 à 13:16, Nicholas Piggin a écrit :

On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:


On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:
   

Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:
  

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
 
 void machine_check_exception(struct pt_regs *regs)

 {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.
  

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

   if ((user_mode(regs)))
   _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
   else
   die("Machine check", regs, SIGBUS);
  


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

 if (in_interrupt()) {
  if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
HARDIRQ_OFFSET)))
  panic("Fatal exception in interrupt");
 }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to
test anything before next week. However it looks more or less like a
hack, doesn't it ?


I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

  if (in_interrupt()) {
   if (!in_nmi() || in_nmi_from_interrupt())
   panic("Fatal exception in interrupt");
  }



Yes looks nice, but:
1/ what is in_nmi_from_interrupt() ? Is it (in_nmi() && (in_irq() || 
in_softirq()) ?

2/ what about in_nmi_from_nmi(), how do we detect that ?

Christophe





What about the following ?


Hmm, in some ways maybe it's nicer. One complication is I would like the
same thing to be available for platform specific machine check
handlers, so then you need to pass is_in_interrupt to them. Which you
can do without any problem... But is it cleaner than the above?

I guess one advantage of yours is that a BUG somewhere in the NMI path
will panic the system. Or is that a disadvantage?

Thanks,
Nick




diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index fd58749b4d6b..1f09033a5103 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -208,7 +208,7 @@ static unsigned long oops_begin(struct pt_regs *regs)
   NOKPROBE_SYMBOL(oops_begin);

   static void oops_end(unsigned long flags, struct pt_regs *regs,
-  int signr)
+int signr, bool is_in_interrupt)
   {
bust_spinlocks(0);
add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
@@ -247,7 +247,7 @@ static void oops_end(unsigned long flags, struct
pt_regs *regs,
mdelay(MSEC_PER_SEC);
}

-   if (in_interrupt())
+   if (is_in_interrupt)
panic("Fatal exception in interrupt");
if (panic_on_oops)
panic("Fatal exception");
@@ -288,7 +288,7 @@ static int __die(const char *str, struct pt_regs
*regs, long err)
   }
   NOKPROBE_SYMBOL(__die);

-void die(const char *str, struct pt_regs *regs, long err)
+static void nmi_die(const char *str, struct pt_regs *regs, long err,
bool is_in_interrupt)
   {
unsigned long flags;

@@ -303,7 +303,13 @@ void die(const char *str, struct pt_regs *regs,
long err)
flags = oops_begin(regs);
if (__die(str, regs, err))
err = 

Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-09 Thread Nicholas Piggin
On Tue, 9 Oct 2018 09:36:18 +
Christophe Leroy  wrote:

> On 10/09/2018 05:30 AM, Nicholas Piggin wrote:
> > On Tue, 9 Oct 2018 06:46:30 +0200
> > Christophe LEROY  wrote:
> >   
> >> Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :  
> >>> On Mon, 8 Oct 2018 17:39:11 +0200
> >>> Christophe LEROY  wrote:
> >>>  
>  Hi Nick,
> 
>  Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :  
> > Use nmi_enter similarly to system reset interrupts. This uses NMI
> > printk NMI buffers and turns off various debugging facilities that
> > helps avoid tripping on ourselves or other CPUs.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> > arch/powerpc/kernel/traps.c | 9 ++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> > index 2849c4f50324..6d31f9d7c333 100644
> > --- a/arch/powerpc/kernel/traps.c
> > +++ b/arch/powerpc/kernel/traps.c
> > @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
> > 
> > void machine_check_exception(struct pt_regs *regs)
> > {
> > -   enum ctx_state prev_state = exception_enter();
> > int recover = 0;
> > +   bool nested = in_nmi();
> > +   if (!nested)
> > +   nmi_enter();  
> 
>  This alters preempt_count, then when die() is called
>  in_interrupt() returns true allthough the trap didn't happen in
>  interrupt, so oops_end() panics for "fatal exception in interrupt"
>  instead of gently sending SIGBUS the faulting app.  
> >>>
> >>> Thanks for tracking that down.
> >>>  
>  Any idea on how to fix this ?  
> >>>
> >>> I would say we have to deliver the sigbus by hand.
> >>>
> >>>   if ((user_mode(regs)))
> >>>   _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> >>>   else
> >>>   die("Machine check", regs, SIGBUS);
> >>>  
> >>
> >> And what about all the other things done by 'die()' ?
> >>
> >> And what if it is a kernel thread ?
> >>
> >> In one of my boards, I have a kernel thread regularly checking the HW,
> >> and if it gets a machine check I expect it to gently stop and the die
> >> notification to be delivered to all registered notifiers.
> >>
> >> Until before this patch, it was working well.  
> > 
> > I guess the alternative is we could check regs->trap for machine
> > check in the die test. Complication is having to account for MCE
> > in an interrupt handler.
> > 
> > if (in_interrupt()) {
> >  if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
> > HARDIRQ_OFFSET)))
> >  panic("Fatal exception in interrupt");
> > }
> > 
> > Something like that might work for you? We needs a ppc64 macro for the
> > MCE, and can probably add something like in_nmi_from_interrupt() for
> > the second part of the test.  
> 
> Don't know, I'm away from home on business trip so I won't be able to 
> test anything before next week. However it looks more or less like a 
> hack, doesn't it ?

I thought it seemed okay (with the right functions added). Actually it
could be a bit nicer to do this, then it works generally :

 if (in_interrupt()) {
  if (!in_nmi() || in_nmi_from_interrupt())
  panic("Fatal exception in interrupt");
 }

> 
> What about the following ?

Hmm, in some ways maybe it's nicer. One complication is I would like the
same thing to be available for platform specific machine check
handlers, so then you need to pass is_in_interrupt to them. Which you
can do without any problem... But is it cleaner than the above?

I guess one advantage of yours is that a BUG somewhere in the NMI path
will panic the system. Or is that a disadvantage?

Thanks,
Nick


> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index fd58749b4d6b..1f09033a5103 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -208,7 +208,7 @@ static unsigned long oops_begin(struct pt_regs *regs)
>   NOKPROBE_SYMBOL(oops_begin);
> 
>   static void oops_end(unsigned long flags, struct pt_regs *regs,
> -int signr)
> +  int signr, bool is_in_interrupt)
>   {
>   bust_spinlocks(0);
>   add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
> @@ -247,7 +247,7 @@ static void oops_end(unsigned long flags, struct 
> pt_regs *regs,
>   mdelay(MSEC_PER_SEC);
>   }
> 
> - if (in_interrupt())
> + if (is_in_interrupt)
>   panic("Fatal exception in interrupt");
>   if (panic_on_oops)
>   panic("Fatal exception");
> @@ -288,7 +288,7 @@ static int __die(const char *str, struct pt_regs 
> *regs, long err)
>   }
>   NOKPROBE_SYMBOL(__die);
> 
> -void die(const char *str, struct pt_regs *regs, long err)
> +static void nmi_die(const char *str, struct pt_regs *regs, long err, 

Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-09 Thread Christophe Leroy




On 10/09/2018 05:30 AM, Nicholas Piggin wrote:

On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:


Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:
   

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
arch/powerpc/kernel/traps.c | 9 ++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)

void machine_check_exception(struct pt_regs *regs)

{
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.
   

Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

  if ((user_mode(regs)))
  _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
  else
  die("Machine check", regs, SIGBUS);
   


And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW,
and if it gets a machine check I expect it to gently stop and the die
notification to be delivered to all registered notifiers.

Until before this patch, it was working well.


I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

if (in_interrupt()) {
 if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
HARDIRQ_OFFSET)))
 panic("Fatal exception in interrupt");
}

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.


Don't know, I'm away from home on business trip so I won't be able to 
test anything before next week. However it looks more or less like a 
hack, doesn't it ?


What about the following ?

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index fd58749b4d6b..1f09033a5103 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -208,7 +208,7 @@ static unsigned long oops_begin(struct pt_regs *regs)
 NOKPROBE_SYMBOL(oops_begin);

 static void oops_end(unsigned long flags, struct pt_regs *regs,
-  int signr)
+int signr, bool is_in_interrupt)
 {
bust_spinlocks(0);
add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
@@ -247,7 +247,7 @@ static void oops_end(unsigned long flags, struct 
pt_regs *regs,

mdelay(MSEC_PER_SEC);
}

-   if (in_interrupt())
+   if (is_in_interrupt)
panic("Fatal exception in interrupt");
if (panic_on_oops)
panic("Fatal exception");
@@ -288,7 +288,7 @@ static int __die(const char *str, struct pt_regs 
*regs, long err)

 }
 NOKPROBE_SYMBOL(__die);

-void die(const char *str, struct pt_regs *regs, long err)
+static void nmi_die(const char *str, struct pt_regs *regs, long err, 
bool is_in_interrupt)

 {
unsigned long flags;

@@ -303,7 +303,13 @@ void die(const char *str, struct pt_regs *regs, 
long err)

flags = oops_begin(regs);
if (__die(str, regs, err))
err = 0;
-   oops_end(flags, regs, err);
+   oops_end(flags, regs, err, is_in_interrupt);
+}
+NOKPROBE_SYMBOL(nmi_die);
+
+void die(const char *str, struct pt_regs *regs, long err)
+{
+   nmi_die(str, regs, err, in_interrupt());
 }
 NOKPROBE_SYMBOL(die);

@@ -737,6 +743,7 @@ int machine_check_generic(struct pt_regs *regs)
 void machine_check_exception(struct pt_regs *regs)
 {
int recover = 0;
+   bool is_in_interrupt = in_interrupt();
bool nested = in_nmi();
if (!nested)
nmi_enter();
@@ -765,7 +772,7 @@ void machine_check_exception(struct pt_regs *regs)
if (check_io_access(regs))
goto bail;

-   die("Machine check", regs, SIGBUS);
+   nmi_die("Machine check", regs, SIGBUS, is_in_interrupt);

/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))


Thanks
Christophe


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-08 Thread Nicholas Piggin
On Tue, 9 Oct 2018 06:46:30 +0200
Christophe LEROY  wrote:

> Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :
> > On Mon, 8 Oct 2018 17:39:11 +0200
> > Christophe LEROY  wrote:
> >   
> >> Hi Nick,
> >>
> >> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :  
> >>> Use nmi_enter similarly to system reset interrupts. This uses NMI
> >>> printk NMI buffers and turns off various debugging facilities that
> >>> helps avoid tripping on ourselves or other CPUs.
> >>>
> >>> Signed-off-by: Nicholas Piggin 
> >>> ---
> >>>arch/powerpc/kernel/traps.c | 9 ++---
> >>>1 file changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> >>> index 2849c4f50324..6d31f9d7c333 100644
> >>> --- a/arch/powerpc/kernel/traps.c
> >>> +++ b/arch/powerpc/kernel/traps.c
> >>> @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
> >>>
> >>>void machine_check_exception(struct pt_regs *regs)
> >>>{
> >>> - enum ctx_state prev_state = exception_enter();
> >>>   int recover = 0;
> >>> + bool nested = in_nmi();
> >>> + if (!nested)
> >>> + nmi_enter();  
> >>
> >> This alters preempt_count, then when die() is called
> >> in_interrupt() returns true allthough the trap didn't happen in
> >> interrupt, so oops_end() panics for "fatal exception in interrupt"
> >> instead of gently sending SIGBUS the faulting app.  
> > 
> > Thanks for tracking that down.
> >   
> >> Any idea on how to fix this ?  
> > 
> > I would say we have to deliver the sigbus by hand.
> > 
> >  if ((user_mode(regs)))
> >  _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
> >  else
> >  die("Machine check", regs, SIGBUS);
> >   
> 
> And what about all the other things done by 'die()' ?
> 
> And what if it is a kernel thread ?
> 
> In one of my boards, I have a kernel thread regularly checking the HW, 
> and if it gets a machine check I expect it to gently stop and the die 
> notification to be delivered to all registered notifiers.
> 
> Until before this patch, it was working well.

I guess the alternative is we could check regs->trap for machine
check in the die test. Complication is having to account for MCE
in an interrupt handler.

   if (in_interrupt()) {
if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET + 
HARDIRQ_OFFSET)))
panic("Fatal exception in interrupt");
   }

Something like that might work for you? We needs a ppc64 macro for the
MCE, and can probably add something like in_nmi_from_interrupt() for
the second part of the test.

Thanks,
Nick


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-08 Thread Christophe LEROY




Le 09/10/2018 à 06:32, Nicholas Piggin a écrit :

On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:


Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
   arch/powerpc/kernel/traps.c | 9 ++---
   1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
   
   void machine_check_exception(struct pt_regs *regs)

   {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in
interrupt, so oops_end() panics for "fatal exception in interrupt"
instead of gently sending SIGBUS the faulting app.


Thanks for tracking that down.


Any idea on how to fix this ?


I would say we have to deliver the sigbus by hand.

 if ((user_mode(regs)))
 _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
 else
 die("Machine check", regs, SIGBUS);



And what about all the other things done by 'die()' ?

And what if it is a kernel thread ?

In one of my boards, I have a kernel thread regularly checking the HW, 
and if it gets a machine check I expect it to gently stop and the die 
notification to be delivered to all registered notifiers.


Until before this patch, it was working well.

Christophe


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-08 Thread Nicholas Piggin
On Mon, 8 Oct 2018 17:39:11 +0200
Christophe LEROY  wrote:

> Hi Nick,
> 
> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :
> > Use nmi_enter similarly to system reset interrupts. This uses NMI
> > printk NMI buffers and turns off various debugging facilities that
> > helps avoid tripping on ourselves or other CPUs.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >   arch/powerpc/kernel/traps.c | 9 ++---
> >   1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> > index 2849c4f50324..6d31f9d7c333 100644
> > --- a/arch/powerpc/kernel/traps.c
> > +++ b/arch/powerpc/kernel/traps.c
> > @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
> >   
> >   void machine_check_exception(struct pt_regs *regs)
> >   {
> > -   enum ctx_state prev_state = exception_enter();
> > int recover = 0;
> > +   bool nested = in_nmi();
> > +   if (!nested)
> > +   nmi_enter();  
> 
> This alters preempt_count, then when die() is called
> in_interrupt() returns true allthough the trap didn't happen in 
> interrupt, so oops_end() panics for "fatal exception in interrupt" 
> instead of gently sending SIGBUS the faulting app.

Thanks for tracking that down.

> Any idea on how to fix this ?

I would say we have to deliver the sigbus by hand.

if ((user_mode(regs)))
_exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip);
else
die("Machine check", regs, SIGBUS);


Re: [PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2018-10-08 Thread Christophe LEROY

Hi Nick,

Le 19/07/2017 à 08:59, Nicholas Piggin a écrit :

Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/kernel/traps.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
  
  void machine_check_exception(struct pt_regs *regs)

  {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();


This alters preempt_count, then when die() is called
in_interrupt() returns true allthough the trap didn't happen in 
interrupt, so oops_end() panics for "fatal exception in interrupt" 
instead of gently sending SIGBUS the faulting app.


Any idea on how to fix this ?

Christophe

  
  	__this_cpu_inc(irq_stat.mce_exceptions);
  
@@ -820,10 +822,11 @@ void machine_check_exception(struct pt_regs *regs)
  
  	/* Must die if the interrupt is not recoverable */

if (!(regs->msr & MSR_RI))
-   panic("Unrecoverable Machine check");
+   nmi_panic(regs, "Unrecoverable Machine check");
  
  bail:

-   exception_exit(prev_state);
+   if (!nested)
+   nmi_exit();
  }
  
  void SMIException(struct pt_regs *regs)




[PATCH v2 3/3] powerpc: machine check interrupt is a non-maskable interrupt

2017-07-19 Thread Nicholas Piggin
Use nmi_enter similarly to system reset interrupts. This uses NMI
printk NMI buffers and turns off various debugging facilities that
helps avoid tripping on ourselves or other CPUs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 2849c4f50324..6d31f9d7c333 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs)
 
 void machine_check_exception(struct pt_regs *regs)
 {
-   enum ctx_state prev_state = exception_enter();
int recover = 0;
+   bool nested = in_nmi();
+   if (!nested)
+   nmi_enter();
 
__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -820,10 +822,11 @@ void machine_check_exception(struct pt_regs *regs)
 
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
-   panic("Unrecoverable Machine check");
+   nmi_panic(regs, "Unrecoverable Machine check");
 
 bail:
-   exception_exit(prev_state);
+   if (!nested)
+   nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
-- 
2.11.0