Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-04-09 Thread Mahesh J Salgaonkar
On 2024-03-08 19:08:50 Fri, Michael Ellerman wrote:
> Aneesh Kumar K V  writes:
> > On 3/7/24 5:13 PM, Michael Ellerman wrote:
> >> Mahesh Salgaonkar  writes:
> >>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
> >>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
> >>> interrupt handler) if percpu allocation comes from vmalloc area.
> >>>
> >>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
> >>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue 
> >>> when
> >>> percpu allocation is from the embedded first chunk. However with
> >>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where 
> >>> percpu
> >>> allocation can come from the vmalloc area.
> >>>
> >>> With kernel command line "percpu_alloc=page" we can force percpu 
> >>> allocation
> >>> to come from vmalloc area and can see kernel crash in machine_check_early:
> >>>
> >>> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> >>> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> >>> [1.215719] --- interrupt: 200
> >>> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> >>> [1.215722] [c00fffd731b0] [] 0x0
> >>> [1.215724] [c00fffd73210] [c0008364] 
> >>> machine_check_early_common+0x134/0x1f8
> >>>
> >>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
> >>> first chunk is not embedded.
> >> 
> >> My system (powernv) doesn't even boot with percpu_alloc=page.
> >
> >
> > Can you share the crash details?
> 
> Yes but it's not pretty :)
> 
>   [1.725257][  T714] systemd-journald[714]: Collecting audit messages is 
> disabled.
>   [1.729401][T1] systemd[1]: Finished 
> systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
>   [^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39msystemd-tmpfiles-…reate Static 
> Device Nodes in /dev.
>   [1.773902][   C22] Disabling lock debugging due to kernel taint
>   [1.773905][   C23] Oops: Machine check, sig: 7 [#1]
>   [1.773911][   C23] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA 
> PowerNV
>   [1.773916][   C23] Modules linked in:
>   [1.773920][   C23] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G   M   
> 6.8.0-rc7-02500-g23515c370cbb #1
>   [1.773924][   C23] Hardware name: 8335-GTH POWER9 0x4e1202 
> opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
>   [1.773926][   C23] NIP:   LR:  CTR: 
> 
>   [1.773929][   C23] REGS: c00fffa6ef50 TRAP:    Tainted: G   M   
>  (6.8.0-rc7-02500-g23515c370cbb)
>   [1.773932][   C23] MSR:   <>  CR:   XER: 
> 
>   [1.773937][   C23] CFAR:  IRQMASK: 3 
>   [1.773937][   C23] GPR00:  c00fffa6efe0 
> c00fffa6efb0  
>   [1.773937][   C23] GPR04: c003d8c0 c1f5f000 
>  0103 
>   [1.773937][   C23] GPR08: 0003 653a0d962a590300 
>   
>   [1.773937][   C23] GPR12: c00fffa6f280  
> c00084a4  
>   [1.773937][   C23] GPR16: 53474552  
> c003d8c0 c00fffa6f280 
>   [1.773937][   C23] GPR20: c1f5f000 c00fffa6f340 
> c00fffa6f2e8  
>   [1.773937][   C23] GPR24: 0007fecf c65bbb80 
> 00550102 c2172b20 
>   [1.773937][   C23] GPR28:  53474552 
>  c00c6d80 
>   [1.773982][   C23] NIP [] 0x0
>   [1.773988][   C23] LR [] 0x0
>   [1.773990][   C23] Call Trace:
>   [1.773991][   C23] [c00fffa6efe0] [c1f5f000] 
> .TOC.+0x0/0xa1000 (unreliable)
>   [1.773999][   C23] Code:      
>         
>    
>   [1.774021][   C23] ---[ end trace  ]---
> 
> Something has gone badly wrong.
> 
> That was a test kernel with some other commits, but nothing that should
> cause that. Removing percpu_alloc=page fix it.

So, when I try this without my patch "Avoid nmi_enter/nmi_exit in real
mode interrupt", I see this getting recreated. However, I was not able
to recrate this even once with my changes. Are you able to see this
crash with my patch ?

Thanks,
-Mahesh.



Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-08 Thread Michael Ellerman
Aneesh Kumar K V  writes:
> On 3/7/24 5:13 PM, Michael Ellerman wrote:
>> Mahesh Salgaonkar  writes:
>>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
>>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
>>> interrupt handler) if percpu allocation comes from vmalloc area.
>>>
>>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
>>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
>>> percpu allocation is from the embedded first chunk. However with
>>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
>>> allocation can come from the vmalloc area.
>>>
>>> With kernel command line "percpu_alloc=page" we can force percpu allocation
>>> to come from vmalloc area and can see kernel crash in machine_check_early:
>>>
>>> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
>>> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
>>> [1.215719] --- interrupt: 200
>>> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
>>> [1.215722] [c00fffd731b0] [] 0x0
>>> [1.215724] [c00fffd73210] [c0008364] 
>>> machine_check_early_common+0x134/0x1f8
>>>
>>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
>>> first chunk is not embedded.
>> 
>> My system (powernv) doesn't even boot with percpu_alloc=page.
>
>
> Can you share the crash details?

Yes but it's not pretty :)

  [1.725257][  T714] systemd-journald[714]: Collecting audit messages is 
disabled.
  [1.729401][T1] systemd[1]: Finished 
systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
  [^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39msystemd-tmpfiles-…reate Static 
Device Nodes in /dev.
  [1.773902][   C22] Disabling lock debugging due to kernel taint
  [1.773905][   C23] Oops: Machine check, sig: 7 [#1]
  [1.773911][   C23] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA 
PowerNV
  [1.773916][   C23] Modules linked in:
  [1.773920][   C23] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G   M 
  6.8.0-rc7-02500-g23515c370cbb #1
  [1.773924][   C23] Hardware name: 8335-GTH POWER9 0x4e1202 
opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
  [1.773926][   C23] NIP:   LR:  CTR: 

  [1.773929][   C23] REGS: c00fffa6ef50 TRAP:    Tainted: G   M 
   (6.8.0-rc7-02500-g23515c370cbb)
  [1.773932][   C23] MSR:   <>  CR:   XER: 
  [1.773937][   C23] CFAR:  IRQMASK: 3 
  [1.773937][   C23] GPR00:  c00fffa6efe0 
c00fffa6efb0  
  [1.773937][   C23] GPR04: c003d8c0 c1f5f000 
 0103 
  [1.773937][   C23] GPR08: 0003 653a0d962a590300 
  
  [1.773937][   C23] GPR12: c00fffa6f280  
c00084a4  
  [1.773937][   C23] GPR16: 53474552  
c003d8c0 c00fffa6f280 
  [1.773937][   C23] GPR20: c1f5f000 c00fffa6f340 
c00fffa6f2e8  
  [1.773937][   C23] GPR24: 0007fecf c65bbb80 
00550102 c2172b20 
  [1.773937][   C23] GPR28:  53474552 
 c00c6d80 
  [1.773982][   C23] NIP [] 0x0
  [1.773988][   C23] LR [] 0x0
  [1.773990][   C23] Call Trace:
  [1.773991][   C23] [c00fffa6efe0] [c1f5f000] 
.TOC.+0x0/0xa1000 (unreliable)
  [1.773999][   C23] Code:      
        
   
  [1.774021][   C23] ---[ end trace  ]---

Something has gone badly wrong.

That was a test kernel with some other commits, but nothing that should
cause that. Removing percpu_alloc=page fix it.

It's based on fddff98e83b4b4d54470902ea0d520c4d423ca3b.

>> AFAIK the only reason we added support for it was to handle 4K kernels
>> with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
>> mapping percpu first chunk allocator").
>> 
>> So I wonder if we should change the Kconfig to only offer it as an
>> option in that case, and change the logic in setup_per_cpu_areas() to
>> only use it as a last resort.
>> 
>> I guess we probably still need this commit though, even if just for 4K HPT.
>> 
>>
> We have also observed some error when we have large gap between the start 
> memory of
> NUMA nodes. That made the percpu offset really large causing boot failures 
> even on 64K.

Yeah, I have vague memories of that :)

cheers


Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-07 Thread Mahesh J Salgaonkar
On 2024-03-07 22:43:07 Thu, Michael Ellerman wrote:
> > diff --git a/arch/powerpc/include/asm/interrupt.h 
> > b/arch/powerpc/include/asm/interrupt.h
> > index a4196ab1d0167..0b96464ff0339 100644
> > --- a/arch/powerpc/include/asm/interrupt.h
> > +++ b/arch/powerpc/include/asm/interrupt.h
> > @@ -336,6 +336,14 @@ static inline void interrupt_nmi_enter_prepare(struct 
> > pt_regs *regs, struct inte
> > if (IS_ENABLED(CONFIG_KASAN))
> > return;
> >  
> > +   /*
> > +* Likewise, do not use it in real mode if percpu first chunk is not
> > +* embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
> > +* are chances where percpu allocation can come from vmalloc area.
> > +*/
> > +   if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> > !is_embed_first_chunk)
> 
> I think this would be clearer if it was inverted, eg:
> 
> if (percpu_first_chunk_is_paged)
>return;

Agree.

> 
> That way you shouldn't need to check CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK 
> here.
> Instead it can be part of the ifdef in the header.
> 
> > @@ -351,6 +359,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
> > pt_regs *regs, struct inter
> > // no nmi_exit for a pseries hash guest taking a real mode 
> > exception
> > } else if (IS_ENABLED(CONFIG_KASAN)) {
> > // no nmi_exit for KASAN in real mode
> > +   } else if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> > !is_embed_first_chunk) {
> > +   // no nmi_exit if percpu first chunk is not embedded
> > } else {
> > nmi_exit();
> > }
> > diff --git a/arch/powerpc/include/asm/percpu.h 
> > b/arch/powerpc/include/asm/percpu.h
> > index 8e5b7d0b851c6..e24063eb0b33b 100644
> > --- a/arch/powerpc/include/asm/percpu.h
> > +++ b/arch/powerpc/include/asm/percpu.h
> > @@ -15,6 +15,16 @@
> >  #endif /* CONFIG_SMP */
> >  #endif /* __powerpc64__ */
> >  
> > +#ifdef CONFIG_PPC64
> > +#include 
> > +DECLARE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
> > +
> > +#define is_embed_first_chunk   \
> > +   (static_key_enabled(&__percpu_embed_first_chunk.key))
> > +#else
> > +#define is_embed_first_chunk   true
> > +#endif /* CONFIG_PPC64 */
> > +
> 
> Something like:
> 
> #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
> #include 
> DECLARE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
> 
> #define percpu_first_chunk_is_paged   \
>   (static_key_enabled(&__percpu_first_chunk_is_paged.key))
> #else
> #define percpu_first_chunk_is_paged   false
> #endif /* CONFIG_PPC64 */

Sure, will fix it.

> 
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index 2f19d5e944852..e04f0ff69d4b6 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -834,6 +834,7 @@ static __init int pcpu_cpu_to_node(int cpu)
> >  
> >  unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
> >  EXPORT_SYMBOL(__per_cpu_offset);
> > +DEFINE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
> >  
> >  void __init setup_per_cpu_areas(void)
> >  {
> > @@ -869,6 +870,8 @@ void __init setup_per_cpu_areas(void)
> > pr_warn("PERCPU: %s allocator failed (%d), "
> > "falling back to page size\n",
> > pcpu_fc_names[pcpu_chosen_fc], rc);
> > +   else
> > +   static_key_enable(&__percpu_embed_first_chunk.key);
> > }
> >  
> > if (rc < 0)
>  
> Finally, the current patch breaks the microwatt build:
> 
>   $ make microwatt_defconfig ; make -s -j (nproc)
>   make[1]: Entering directory '/home/michael/linux/.build'
> GEN Makefile
>   #
>   # configuration written to .config
>   #
>   make[1]: Leaving directory '/home/michael/linux/.build'
>   ld: arch/powerpc/kernel/traps.o:(.toc+0x0): undefined reference to 
> `__percpu_embed_first_chunk'
>   ld: arch/powerpc/kernel/mce.o:(.toc+0x0): undefined reference to 
> `__percpu_embed_first_chunk'
>   make[3]: *** [../scripts/Makefile.vmlinux:37: vmlinux] Error 1
> 
> I guess because it has CONFIG_JUMP_LABEL=n?

Even with CONFIG_JUMP_LABEL=n it should still work. Let me take look and
fix this for microwatt build.

Thanks for your review.
-Mahesh.

> 
> cheers

-- 
Mahesh J Salgaonkar


Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-07 Thread Aneesh Kumar K V
On 3/7/24 5:13 PM, Michael Ellerman wrote:
> Hi Mahesh,
> 
> Mahesh Salgaonkar  writes:
>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
>> interrupt handler) if percpu allocation comes from vmalloc area.
>>
>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
>> percpu allocation is from the embedded first chunk. However with
>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
>> allocation can come from the vmalloc area.
>>
>> With kernel command line "percpu_alloc=page" we can force percpu allocation
>> to come from vmalloc area and can see kernel crash in machine_check_early:
>>
>> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
>> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
>> [1.215719] --- interrupt: 200
>> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
>> [1.215722] [c00fffd731b0] [] 0x0
>> [1.215724] [c00fffd73210] [c0008364] 
>> machine_check_early_common+0x134/0x1f8
>>
>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
>> first chunk is not embedded.
> 
> My system (powernv) doesn't even boot with percpu_alloc=page.
> 


Can you share the crash details? 


> AFAIK the only reason we added support for it was to handle 4K kernels
> with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
> mapping percpu first chunk allocator").
> 
> So I wonder if we should change the Kconfig to only offer it as an
> option in that case, and change the logic in setup_per_cpu_areas() to
> only use it as a last resort.
> 
> I guess we probably still need this commit though, even if just for 4K HPT.
> 
>
We have also observed some error when we have large gap between the start 
memory of
NUMA nodes. That made the percpu offset really large causing boot failures even 
on 64K.

-aneesh


Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-07 Thread Michael Ellerman
Hi Mahesh,

Mahesh Salgaonkar  writes:
> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
> interrupt handler) if percpu allocation comes from vmalloc area.
>
> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
> percpu allocation is from the embedded first chunk. However with
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
> allocation can come from the vmalloc area.
>
> With kernel command line "percpu_alloc=page" we can force percpu allocation
> to come from vmalloc area and can see kernel crash in machine_check_early:
>
> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> [1.215719] --- interrupt: 200
> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> [1.215722] [c00fffd731b0] [] 0x0
> [1.215724] [c00fffd73210] [c0008364] 
> machine_check_early_common+0x134/0x1f8
>
> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
> first chunk is not embedded.

My system (powernv) doesn't even boot with percpu_alloc=page.

AFAIK the only reason we added support for it was to handle 4K kernels
with HPT. See commit eb553f16973a ("powerpc/64/mm: implement page
mapping percpu first chunk allocator").

So I wonder if we should change the Kconfig to only offer it as an
option in that case, and change the logic in setup_per_cpu_areas() to
only use it as a last resort.

I guess we probably still need this commit though, even if just for 4K HPT.


> Signed-off-by: Mahesh Salgaonkar 
> ---
> Changes in v4:
> - Fix coding style issues.
>
> Changes in v3:
> - Address comments from Christophe Leroy to avoid using #ifdefs in the
>   code
> - v2 at 
> https://lore.kernel.org/linuxppc-dev/20240205053647.1763446-1-mah...@linux.ibm.com/
>
> Changes in v2:
> - Rebase to upstream master
> - Use jump_labels, if CONFIG_JUMP_LABEL is enabled, to avoid redoing the
>   embed first chunk test at each interrupt entry.
> - v1 is at 
> https://lore.kernel.org/linuxppc-dev/164578465828.74956.6065296024817333750.stgit@jupiter/
> ---
>  arch/powerpc/include/asm/interrupt.h | 10 ++
>  arch/powerpc/include/asm/percpu.h| 10 ++
>  arch/powerpc/kernel/setup_64.c   |  3 +++
>  3 files changed, 23 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index a4196ab1d0167..0b96464ff0339 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -336,6 +336,14 @@ static inline void interrupt_nmi_enter_prepare(struct 
> pt_regs *regs, struct inte
>   if (IS_ENABLED(CONFIG_KASAN))
>   return;
>  
> + /*
> +  * Likewise, do not use it in real mode if percpu first chunk is not
> +  * embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
> +  * are chances where percpu allocation can come from vmalloc area.
> +  */
> + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> !is_embed_first_chunk)

I think this would be clearer if it was inverted, eg:

if (percpu_first_chunk_is_paged)
   return;

That way you shouldn't need to check CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK here.
Instead it can be part of the ifdef in the header.

> @@ -351,6 +359,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
> pt_regs *regs, struct inter
>   // no nmi_exit for a pseries hash guest taking a real mode 
> exception
>   } else if (IS_ENABLED(CONFIG_KASAN)) {
>   // no nmi_exit for KASAN in real mode
> + } else if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> !is_embed_first_chunk) {
> + // no nmi_exit if percpu first chunk is not embedded
>   } else {
>   nmi_exit();
>   }
> diff --git a/arch/powerpc/include/asm/percpu.h 
> b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c6..e24063eb0b33b 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -15,6 +15,16 @@
>  #endif /* CONFIG_SMP */
>  #endif /* __powerpc64__ */
>  
> +#ifdef CONFIG_PPC64
> +#include 
> +DECLARE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
> +
> +#define is_embed_first_chunk \
> + (static_key_enabled(&__percpu_embed_first_chunk.key))
> +#else
> +#define is_embed_first_chunk true
> +#endif /* CONFIG_PPC64 */
> +

Something like:

#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
#include 
DECLARE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);

#define percpu_first_chunk_is_paged \
(static_key_enabled(&__percpu_first_chunk_is_paged.key))
#else
#define percpu_first_chunk_is_paged false
#endif /* CONFIG_PPC64 */

> diff --git a/arch/powerpc/kernel/setup_64.c 

Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-03-06 Thread Shirisha ganta
On Wed, 2024-02-14 at 15:21 +0530, Mahesh Salgaonkar wrote:
> nmi_enter()/nmi_exit() touches per cpu variables which can lead to
> kernel
> crash when invoked during real mode interrupt handling (e.g. early
> HMI/MCE
> interrupt handler) if percpu allocation comes from vmalloc area.
> 
> Early HMI/MCE handlers are called through
> DEFINE_INTERRUPT_HANDLER_NMI()
> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any
> issue when
> percpu allocation is from the embedded first chunk. However with
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where
> percpu
> allocation can come from the vmalloc area.
> 
> With kernel command line "percpu_alloc=page" we can force percpu
> allocation
> to come from vmalloc area and can see kernel crash in
> machine_check_early:
> 
> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> [1.215719] --- interrupt: 200
> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> [1.215722] [c00fffd731b0] [] 0x0
> [1.215724] [c00fffd73210] [c0008364]
> machine_check_early_common+0x134/0x1f8
> 
> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if
> percpu
> first chunk is not embedded.
> 
> Signed-off-by: Mahesh Salgaonkar 

Thanks for the Patch.
I have tested the patch and the fix works fine.
selftests/powerpc/mce/inject-ra-err testcase is working as
expected after enabling percpu_alloc=page with the patch applied.

Output with Patch:
# ./inject-ra-err 
test: inject-ra-err
tags: git_version:unknown
success: inject-ra-err
#

Tested-by: Shirisha Ganta 


> ---
> Changes in v4:
> - Fix coding style issues.
> 
> Changes in v3:
> - Address comments from Christophe Leroy to avoid using #ifdefs in
> the
>   code
> - v2 at 
> https://lore.kernel.org/linuxppc-dev/20240205053647.1763446-1-mah...@linux.ibm.com/
> 
> Changes in v2:
> - Rebase to upstream master
> - Use jump_labels, if CONFIG_JUMP_LABEL is enabled, to avoid redoing
> the
>   embed first chunk test at each interrupt entry.
> - v1 is at 
> https://lore.kernel.org/linuxppc-dev/164578465828.74956.6065296024817333750.stgit@jupiter/
> ---
>  arch/powerpc/include/asm/interrupt.h | 10 ++
>  arch/powerpc/include/asm/percpu.h| 10 ++
>  arch/powerpc/kernel/setup_64.c   |  3 +++
>  3 files changed, 23 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h
> b/arch/powerpc/include/asm/interrupt.h
> index a4196ab1d0167..0b96464ff0339 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -336,6 +336,14 @@ static inline void
> interrupt_nmi_enter_prepare(struct pt_regs *regs, struct inte
>   if (IS_ENABLED(CONFIG_KASAN))
>   return;
>  
> + /*
> +  * Likewise, do not use it in real mode if percpu first chunk
> is not
> +  * embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled
> there
> +  * are chances where percpu allocation can come from vmalloc
> area.
> +  */
> + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
> !is_embed_first_chunk)
> + return;
> +
>   /* Otherwise, it should be safe to call it */
>   nmi_enter();
>  }
> @@ -351,6 +359,8 @@ static inline void
> interrupt_nmi_exit_prepare(struct pt_regs *regs, struct inter
>   // no nmi_exit for a pseries hash guest taking a real
> mode exception
>   } else if (IS_ENABLED(CONFIG_KASAN)) {
>   // no nmi_exit for KASAN in real mode
> + } else if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
> !is_embed_first_chunk) {
> + // no nmi_exit if percpu first chunk is not embedded
>   } else {
>   nmi_exit();
>   }
> diff --git a/arch/powerpc/include/asm/percpu.h
> b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c6..e24063eb0b33b 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -15,6 +15,16 @@
>  #endif /* CONFIG_SMP */
>  #endif /* __powerpc64__ */
>  
> +#ifdef CONFIG_PPC64
> +#include 
> +DECLARE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
> +
> +#define is_embed_first_chunk \
> + (static_key_enabled(&__percpu_embed_first_chunk.key))
> +#else
> +#define is_embed_first_chunk true
> +#endif /* CONFIG_PPC64 */
> +
>  #include 
>  
>  #include 
> diff --git a/arch/powerpc/kernel/setup_64.c
> b/arch/powerpc/kernel/setup_64.c
> index 2f19d5e944852..e04f0ff69d4b6 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -834,6 +834,7 @@ static __init int pcpu_cpu_to_node(int cpu)
>  
>  unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
>  EXPORT_SYMBOL(__per_cpu_offset);
> +DEFINE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
>  
>  void __init setup_per_cpu_areas(void)
>  {
> @@ -869,6 +870,8 @@ void __init setup_per_cpu_areas(void)
>   pr_warn("PERCPU: %s allocator 

Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-02-14 Thread Christophe Leroy


Le 14/02/2024 à 10:51, Mahesh Salgaonkar a écrit :
> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
> interrupt handler) if percpu allocation comes from vmalloc area.
> 
> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
> percpu allocation is from the embedded first chunk. However with
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
> allocation can come from the vmalloc area.
> 
> With kernel command line "percpu_alloc=page" we can force percpu allocation
> to come from vmalloc area and can see kernel crash in machine_check_early:
> 
> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> [1.215719] --- interrupt: 200
> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> [1.215722] [c00fffd731b0] [] 0x0
> [1.215724] [c00fffd73210] [c0008364] 
> machine_check_early_common+0x134/0x1f8
> 
> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
> first chunk is not embedded.
> 
> Signed-off-by: Mahesh Salgaonkar 

Reviewed-by: Christophe Leroy 

> ---
> Changes in v4:
> - Fix coding style issues.
> 
> Changes in v3:
> - Address comments from Christophe Leroy to avoid using #ifdefs in the
>code
> - v2 at 
> https://lore.kernel.org/linuxppc-dev/20240205053647.1763446-1-mah...@linux.ibm.com/
> 
> Changes in v2:
> - Rebase to upstream master
> - Use jump_labels, if CONFIG_JUMP_LABEL is enabled, to avoid redoing the
>embed first chunk test at each interrupt entry.
> - v1 is at 
> https://lore.kernel.org/linuxppc-dev/164578465828.74956.6065296024817333750.stgit@jupiter/
> ---
>   arch/powerpc/include/asm/interrupt.h | 10 ++
>   arch/powerpc/include/asm/percpu.h| 10 ++
>   arch/powerpc/kernel/setup_64.c   |  3 +++
>   3 files changed, 23 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index a4196ab1d0167..0b96464ff0339 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -336,6 +336,14 @@ static inline void interrupt_nmi_enter_prepare(struct 
> pt_regs *regs, struct inte
>   if (IS_ENABLED(CONFIG_KASAN))
>   return;
>   
> + /*
> +  * Likewise, do not use it in real mode if percpu first chunk is not
> +  * embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
> +  * are chances where percpu allocation can come from vmalloc area.
> +  */
> + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> !is_embed_first_chunk)
> + return;
> +
>   /* Otherwise, it should be safe to call it */
>   nmi_enter();
>   }
> @@ -351,6 +359,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
> pt_regs *regs, struct inter
>   // no nmi_exit for a pseries hash guest taking a real mode 
> exception
>   } else if (IS_ENABLED(CONFIG_KASAN)) {
>   // no nmi_exit for KASAN in real mode
> + } else if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
> !is_embed_first_chunk) {
> + // no nmi_exit if percpu first chunk is not embedded
>   } else {
>   nmi_exit();
>   }
> diff --git a/arch/powerpc/include/asm/percpu.h 
> b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c6..e24063eb0b33b 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -15,6 +15,16 @@
>   #endif /* CONFIG_SMP */
>   #endif /* __powerpc64__ */
>   
> +#ifdef CONFIG_PPC64
> +#include 
> +DECLARE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
> +
> +#define is_embed_first_chunk \
> + (static_key_enabled(&__percpu_embed_first_chunk.key))
> +#else
> +#define is_embed_first_chunk true
> +#endif /* CONFIG_PPC64 */
> +
>   #include 
>   
>   #include 
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 2f19d5e944852..e04f0ff69d4b6 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -834,6 +834,7 @@ static __init int pcpu_cpu_to_node(int cpu)
>   
>   unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
>   EXPORT_SYMBOL(__per_cpu_offset);
> +DEFINE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
>   
>   void __init setup_per_cpu_areas(void)
>   {
> @@ -869,6 +870,8 @@ void __init setup_per_cpu_areas(void)
>   pr_warn("PERCPU: %s allocator failed (%d), "
>   "falling back to page size\n",
>   pcpu_fc_names[pcpu_chosen_fc], rc);
> + else
> + static_key_enable(&__percpu_embed_first_chunk.key);
>   }
>   
>   if (rc < 0)


[PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-02-14 Thread Mahesh Salgaonkar
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
interrupt handler) if percpu allocation comes from vmalloc area.

Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
percpu allocation is from the embedded first chunk. However with
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
allocation can come from the vmalloc area.

With kernel command line "percpu_alloc=page" we can force percpu allocation
to come from vmalloc area and can see kernel crash in machine_check_early:

[1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
[1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
[1.215719] --- interrupt: 200
[1.215720] [c00fffd73180] [] 0x0 (unreliable)
[1.215722] [c00fffd731b0] [] 0x0
[1.215724] [c00fffd73210] [c0008364] 
machine_check_early_common+0x134/0x1f8

Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
first chunk is not embedded.

Signed-off-by: Mahesh Salgaonkar 
---
Changes in v4:
- Fix coding style issues.

Changes in v3:
- Address comments from Christophe Leroy to avoid using #ifdefs in the
  code
- v2 at 
https://lore.kernel.org/linuxppc-dev/20240205053647.1763446-1-mah...@linux.ibm.com/

Changes in v2:
- Rebase to upstream master
- Use jump_labels, if CONFIG_JUMP_LABEL is enabled, to avoid redoing the
  embed first chunk test at each interrupt entry.
- v1 is at 
https://lore.kernel.org/linuxppc-dev/164578465828.74956.6065296024817333750.stgit@jupiter/
---
 arch/powerpc/include/asm/interrupt.h | 10 ++
 arch/powerpc/include/asm/percpu.h| 10 ++
 arch/powerpc/kernel/setup_64.c   |  3 +++
 3 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index a4196ab1d0167..0b96464ff0339 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -336,6 +336,14 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
if (IS_ENABLED(CONFIG_KASAN))
return;
 
+   /*
+* Likewise, do not use it in real mode if percpu first chunk is not
+* embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
+* are chances where percpu allocation can come from vmalloc area.
+*/
+   if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
!is_embed_first_chunk)
+   return;
+
/* Otherwise, it should be safe to call it */
nmi_enter();
 }
@@ -351,6 +359,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
// no nmi_exit for a pseries hash guest taking a real mode 
exception
} else if (IS_ENABLED(CONFIG_KASAN)) {
// no nmi_exit for KASAN in real mode
+   } else if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && 
!is_embed_first_chunk) {
+   // no nmi_exit if percpu first chunk is not embedded
} else {
nmi_exit();
}
diff --git a/arch/powerpc/include/asm/percpu.h 
b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c6..e24063eb0b33b 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -15,6 +15,16 @@
 #endif /* CONFIG_SMP */
 #endif /* __powerpc64__ */
 
+#ifdef CONFIG_PPC64
+#include 
+DECLARE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
+
+#define is_embed_first_chunk   \
+   (static_key_enabled(&__percpu_embed_first_chunk.key))
+#else
+#define is_embed_first_chunk   true
+#endif /* CONFIG_PPC64 */
+
 #include 
 
 #include 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 2f19d5e944852..e04f0ff69d4b6 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -834,6 +834,7 @@ static __init int pcpu_cpu_to_node(int cpu)
 
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
+DEFINE_STATIC_KEY_FALSE(__percpu_embed_first_chunk);
 
 void __init setup_per_cpu_areas(void)
 {
@@ -869,6 +870,8 @@ void __init setup_per_cpu_areas(void)
pr_warn("PERCPU: %s allocator failed (%d), "
"falling back to page size\n",
pcpu_fc_names[pcpu_chosen_fc], rc);
+   else
+   static_key_enable(&__percpu_embed_first_chunk.key);
}
 
if (rc < 0)
-- 
2.43.0