Re: [PATCH 2/3] powerpc: fix build errors

2022-02-25 Thread Nicholas Piggin
Excerpts from Segher Boessenkool's message of February 26, 2022 8:28 am:
> On Fri, Feb 25, 2022 at 10:23:07AM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of February 25, 2022 3:29 am:
>> > On Thu, Feb 24, 2022 at 09:13:25PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Arnd Bergmann's message of February 24, 2022 8:20 pm:
>> >> > Again, there should be a minimum number of those .machine directives
>> >> > in inline asm as well, which tends to work out fine as long as the
>> >> > entire kernel is built with the correct -march= option for the minimum
>> >> > supported CPU, and stays away from inline asm that requires a higher
>> >> > CPU level.
>> >> 
>> >> There's really no advantage to them, and they're ugly and annoying
>> >> and if we applied the concept consistently for all asm they would grow 
>> >> to a very large number.
>> > 
>> > The advantage is that you get machine code that *works*.  There are
>> > quite a few mnemonics that translate to different instructions with
>> > different machine options!  We like to get the intended instructions
>> > instead of something that depends on what assembler options the user
>> > has passed behind our backs.
>> > 
>> >> The idea they'll give you good static checking just doesn't really
>> >> pan out.
>> > 
>> > That never was a goal of this at all.
>> > 
>> > -many was very problematical for GCC itself.  We no longer use it.
>> 
>> You have the wrong context. We're not talking about -many vs .machine
>> here.
> 
> Okay, so you have no idea what you are talking about?  Wow.

Wrong context. It's not about -many. We're past that everyone agrees 
it's wrong.

> The reason GCC uses .machine *itself* is because assembler -mmachine
> options *cannot work*, for many reasons.  We hit problems often enough
> that years ago we started moving away from it already.  The biggest
> problems are that on one hand there are mnemonics that encode to
> different instructions depending on target arch or cpu selected (like
> mftb, lxvx, wait, etc.), and on the other hand GCC needs to switch that
> target halfway through compilation (attribute((target(....
> 
> Often these problems were hidden most of the time by us passing -many.
> But not all of the time, and over time, problems became more frequent
> and nasty.
> 
> Passing assembler -m options is nasty when you have to mix it with
> .machine statements (and we need the latter no matter what), and it

No it's not nasty, read the gas manual. -m specifies the machine and
so does .machine. It's simple.

> becomes completely unpredictable if the user passes other -m options
> manually.
> Inline assembler is inserted textually in the generated assembler code.
> This is a big part of the strength of inline assembler.  It does mean
> that if you need a different target selected for your assembler code
> then you need to arrange for that in your assembler code.
> 
> So yes, this very much is about -many, other -m options, and .machine .
> I discourage the kernel (as well as any other project) from using -m
> options, especially -many, but that is your own choice of course.  I
> get sick and tired from you calling a deliberate design decision we
> arrived at after years of work and weighing alternatives a "bug" though.

Alan posted a good summary here

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485#c10

Thanks,
Nick


Re: [PATCH 2/3] powerpc: fix build errors

2022-02-25 Thread Segher Boessenkool
On Fri, Feb 25, 2022 at 10:32:02AM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am:
> > On Thu, Feb 24, 2022 at 03:05:28PM +1000, Nicholas Piggin wrote:
> >> + * gcc 10 started to emit a .machine directive at the beginning of 
> >> generated
> >> + * .s files, which overrides assembler -Wa,-m options passed down.
> >> + * Unclear if this behaviour will be reverted.
> > 
> > It will not be reverted.  If you need a certain .machine for some asm
> > code, you should write just that!
> 
> It should be reverted because it breaks old binutils which did not have
> the workaround patch for this broken gcc behaviour. And it is just
> unnecessary because -m option can already be used to do the same thing.
> 
> Not that I expect gcc to revert it.

Nothing will happen if you do not file a bug report.  And do read the
bug reporting instructions first please.

> >> +#ifdef CONFIG_CC_IS_GCC
> >> +#if (GCC_VERSION >= 10)
> >> +#if (CONFIG_AS_VERSION == 23800)
> >> +asm(".machine any");
> >> +#endif
> >> +#endif
> >> +#endif
> >> +#endif /* __ASSEMBLY__ */
> > 
> > Abusing toplevel asm like this is broken and you *will* end up with
> > unhappiness all around.
> 
> It actually unbreaks things and reduces my unhappiness.

It is broken.  You will need -fno-toplevel-reorder, and you really do
not want that, if you *can* use it in the kernel even.

> It's only done 
> for broken compiler versions and only where as does not have the 
> workaround for the breakage.

What compiler versions?  Please file a PR.


Segher


Re: [PATCH 2/3] powerpc: fix build errors

2022-02-25 Thread Segher Boessenkool
On Fri, Feb 25, 2022 at 10:23:07AM +1000, Nicholas Piggin wrote:
> Excerpts from Segher Boessenkool's message of February 25, 2022 3:29 am:
> > On Thu, Feb 24, 2022 at 09:13:25PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Arnd Bergmann's message of February 24, 2022 8:20 pm:
> >> > Again, there should be a minimum number of those .machine directives
> >> > in inline asm as well, which tends to work out fine as long as the
> >> > entire kernel is built with the correct -march= option for the minimum
> >> > supported CPU, and stays away from inline asm that requires a higher
> >> > CPU level.
> >> 
> >> There's really no advantage to them, and they're ugly and annoying
> >> and if we applied the concept consistently for all asm they would grow 
> >> to a very large number.
> > 
> > The advantage is that you get machine code that *works*.  There are
> > quite a few mnemonics that translate to different instructions with
> > different machine options!  We like to get the intended instructions
> > instead of something that depends on what assembler options the user
> > has passed behind our backs.
> > 
> >> The idea they'll give you good static checking just doesn't really
> >> pan out.
> > 
> > That never was a goal of this at all.
> > 
> > -many was very problematical for GCC itself.  We no longer use it.
> 
> You have the wrong context. We're not talking about -many vs .machine
> here.

Okay, so you have no idea what you are talking about?  Wow.

The reason GCC uses .machine *itself* is because assembler -mmachine
options *cannot work*, for many reasons.  We hit problems often enough
that years ago we started moving away from it already.  The biggest
problems are that on one hand there are mnemonics that encode to
different instructions depending on target arch or cpu selected (like
mftb, lxvx, wait, etc.), and on the other hand GCC needs to switch that
target halfway through compilation (attribute((target(....

Often these problems were hidden most of the time by us passing -many.
But not all of the time, and over time, problems became more frequent
and nasty.

Passing assembler -m options is nasty when you have to mix it with
.machine statements (and we need the latter no matter what), and it
becomes completely unpredictable if the user passes other -m options
manually.

Inline assembler is inserted textually in the generated assembler code.
This is a big part of the strength of inline assembler.  It does mean
that if you need a different target selected for your assembler code
then you need to arrange for that in your assembler code.

So yes, this very much is about -many, other -m options, and .machine .
I discourage the kernel (as well as any other project) from using -m
options, especially -many, but that is your own choice of course.  I
get sick and tired from you calling a deliberate design decision we
arrived at after years of work and weighing alternatives a "bug" though.


Segher


[PATCH v1] powerpc/interrupt: Remove struct interrupt_state

2022-02-25 Thread Christophe Leroy
Since commit ceff77efa4f8 ("powerpc/64e/interrupt: Use new interrupt
context tracking scheme") struct interrupt_state has been empty and
unused.

Remove it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/interrupt.h | 32 +++-
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index fc28f46d2f9d..984ffceb8f91 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -123,9 +123,6 @@ static inline void nap_adjust_return(struct pt_regs *regs)
 #endif
 }
 
-struct interrupt_state {
-};
-
 static inline void booke_restore_dbcr0(void)
 {
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
@@ -138,7 +135,7 @@ static inline void booke_restore_dbcr0(void)
 #endif
 }
 
-static inline void interrupt_enter_prepare(struct pt_regs *regs, struct 
interrupt_state *state)
+static inline void interrupt_enter_prepare(struct pt_regs *regs)
 {
 #ifdef CONFIG_PPC32
if (!arch_irq_disabled_regs(regs))
@@ -228,17 +225,17 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
  * However interrupt_nmi_exit_prepare does return directly to regs, because
  * NMIs do not do "exit work" or replay soft-masked interrupts.
  */
-static inline void interrupt_exit_prepare(struct pt_regs *regs, struct 
interrupt_state *state)
+static inline void interrupt_exit_prepare(struct pt_regs *regs)
 {
 }
 
-static inline void interrupt_async_enter_prepare(struct pt_regs *regs, struct 
interrupt_state *state)
+static inline void interrupt_async_enter_prepare(struct pt_regs *regs)
 {
 #ifdef CONFIG_PPC64
/* Ensure interrupt_enter_prepare does not enable MSR[EE] */
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 #endif
-   interrupt_enter_prepare(regs, state);
+   interrupt_enter_prepare(regs);
 #ifdef CONFIG_PPC_BOOK3S_64
/*
 * RI=1 is set by interrupt_enter_prepare, so this thread flags access
@@ -251,7 +248,7 @@ static inline void interrupt_async_enter_prepare(struct 
pt_regs *regs, struct in
irq_enter();
 }
 
-static inline void interrupt_async_exit_prepare(struct pt_regs *regs, struct 
interrupt_state *state)
+static inline void interrupt_async_exit_prepare(struct pt_regs *regs)
 {
/*
 * Adjust at exit so the main handler sees the true NIA. This must
@@ -262,7 +259,7 @@ static inline void interrupt_async_exit_prepare(struct 
pt_regs *regs, struct int
nap_adjust_return(regs);
 
irq_exit();
-   interrupt_exit_prepare(regs, state);
+   interrupt_exit_prepare(regs);
 }
 
 struct interrupt_nmi_state {
@@ -447,13 +444,11 @@ static __always_inline void ##func(struct pt_regs 
*regs); \
\
 interrupt_handler void func(struct pt_regs *regs)  \
 {  \
-   struct interrupt_state state;   \
-   \
-   interrupt_enter_prepare(regs, );  \
+   interrupt_enter_prepare(regs);  \
\
##func (regs);  \
\
-   interrupt_exit_prepare(regs, );   \
+   interrupt_exit_prepare(regs);   \
 }  \
 NOKPROBE_SYMBOL(func); \
\
@@ -482,14 +477,13 @@ static __always_inline long ##func(struct pt_regs 
*regs); \
\
 interrupt_handler long func(struct pt_regs *regs)  \
 {  \
-   struct interrupt_state state;   \
long ret;   \
\
-   interrupt_enter_prepare(regs, );  \
+   interrupt_enter_prepare(regs);  \
\
ret = ##func (regs);\
\
-   interrupt_exit_prepare(regs, );   \
+   interrupt_exit_prepare(regs);   \
\
 

Re: [PATCH 00/16] Remove usage of the deprecated "pci-dma-compat.h" API

2022-02-25 Thread Christoph Hellwig
I've applied patches 1,5,6,8 and 9 to the dma-mapping tree.


Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation

2022-02-25 Thread Guo Ren
On Fri, Feb 25, 2022 at 11:50 PM Arnd Bergmann  wrote:
>
> On Fri, Feb 25, 2022 at 4:42 PM Guo Ren  wrote:
> >
> > Hi Arnd & Palmer,
> >
> > Here is the new modified compat_vdso/Makefile, please have a look,
> > first. Then I would update it to v7:
> > ===
> > # SPDX-License-Identifier: GPL-2.0-only
> > #
> > # Makefile for compat_vdso
> > #
> >
> > # Symbols present in the compat_vdso
> > compat_vdso-syms  = rt_sigreturn
> > compat_vdso-syms += getcpu
> > compat_vdso-syms += flush_icache
> >
> > ifdef CROSS_COMPILE_COMPAT
> > COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc
> > COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld
> > else
> > COMPAT_CC := $(CC)
> > COMPAT_LD := $(LD)
> > endif
> >
> > COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32
> > COMPAT_LD_FLAGS := -melf32lriscv
>
> Have you come across a case in which a separate cross toolchain
> is required? If not, I would leave this out and just set the flags for the
> normal toolchain.
Okay

>
> I also think it would be a nicer split to build the two vdso variants
> as vdso64/vdso32 rather than vdso/compat_vdso. That way,
> the build procedure can be kept as close as possible to the
> native 32-bit build.
Yes, current native 32-bit vdso & 64-bit vdso use the same
vdso/Makfile. So, I think it could be another patch for this cleanup.

>
> Arnd



-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/


Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation

2022-02-25 Thread Arnd Bergmann
On Fri, Feb 25, 2022 at 4:42 PM Guo Ren  wrote:
>
> Hi Arnd & Palmer,
>
> Here is the new modified compat_vdso/Makefile, please have a look,
> first. Then I would update it to v7:
> ===
> # SPDX-License-Identifier: GPL-2.0-only
> #
> # Makefile for compat_vdso
> #
>
> # Symbols present in the compat_vdso
> compat_vdso-syms  = rt_sigreturn
> compat_vdso-syms += getcpu
> compat_vdso-syms += flush_icache
>
> ifdef CROSS_COMPILE_COMPAT
> COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc
> COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld
> else
> COMPAT_CC := $(CC)
> COMPAT_LD := $(LD)
> endif
>
> COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32
> COMPAT_LD_FLAGS := -melf32lriscv

Have you come across a case in which a separate cross toolchain
is required? If not, I would leave this out and just set the flags for the
normal toolchain.

I also think it would be a nicer split to build the two vdso variants
as vdso64/vdso32 rather than vdso/compat_vdso. That way,
the build procedure can be kept as close as possible to the
native 32-bit build.

Arnd


Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation

2022-02-25 Thread Guo Ren
Hi Arnd & Palmer,

Here is the new modified compat_vdso/Makefile, please have a look,
first. Then I would update it to v7:
===
# SPDX-License-Identifier: GPL-2.0-only
#
# Makefile for compat_vdso
#

# Symbols present in the compat_vdso
compat_vdso-syms  = rt_sigreturn
compat_vdso-syms += getcpu
compat_vdso-syms += flush_icache

ifdef CROSS_COMPILE_COMPAT
COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc
COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld
else
COMPAT_CC := $(CC)
COMPAT_LD := $(LD)
endif

COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32
COMPAT_LD_FLAGS := -melf32lriscv

# Files to link into the compat_vdso
obj-compat_vdso = $(patsubst %, %.o, $(compat_vdso-syms)) note.o

# Build rules
targets := $(obj-compat_vdso) compat_vdso.so compat_vdso.so.dbg compat_vdso.lds
obj-compat_vdso := $(addprefix $(obj)/, $(obj-compat_vdso))

obj-y += compat_vdso.o
CPPFLAGS_compat_vdso.lds += -P -C -U$(ARCH)

# Disable profiling and instrumentation for VDSO code
GCOV_PROFILE := n
KCOV_INSTRUMENT := n
KASAN_SANITIZE := n
UBSAN_SANITIZE := n

# Force dependency
$(obj)/compat_vdso.o: $(obj)/compat_vdso.so

# link rule for the .so file, .lds has to be first
$(obj)/compat_vdso.so.dbg: $(obj)/compat_vdso.lds $(obj-compat_vdso) FORCE
$(call if_changed,compat_vdsold)
LDFLAGS_compat_vdso.so.dbg = -shared -S -soname=linux-compat_vdso.so.1 \
--build-id=sha1 --hash-style=both --eh-frame-hdr

$(obj-compat_vdso): %.o: %.S FORCE
$(call if_changed_dep,compat_vdsoas)

# strip rule for the .so file
$(obj)/%.so: OBJCOPYFLAGS := -S
$(obj)/%.so: $(obj)/%.so.dbg FORCE
$(call if_changed,objcopy)

# Generate VDSO offsets using helper script
gen-compat_vdsosym := $(srctree)/$(src)/gen_compat_vdso_offsets.sh
quiet_cmd_compat_vdsosym = VDSOSYM $@
cmd_compat_vdsosym = $(NM) $< | $(gen-compat_vdsosym) |
LC_ALL=C sort > $@

include/generated/compat_vdso-offsets.h: $(obj)/compat_vdso.so.dbg FORCE
$(call if_changed,compat_vdsosym)

# actual build commands
# The DSO images are built using a special linker script
# Make sure only to export the intended __compat_vdso_xxx symbol offsets.
quiet_cmd_compat_vdsold = VDSOLD  $@
  cmd_compat_vdsold = $(COMPAT_LD) $(ld_flags) $(COMPAT_LD_FLAGS)
-T $(filter-out FORCE,$^) -o $@.tmp && \
   $(OBJCOPY) $(patsubst %, -G __compat_vdso_%,
$(compat_vdso-syms)) $@.tmp $@ && \
   rm $@.tmp

# actual build commands
quiet_cmd_compat_vdsoas = VDSOAS $@
  cmd_compat_vdsoas = $(COMPAT_CC) $(a_flags) $(COMPAT_CC_FLAGS) -c -o $@ $<

# install commands for the unstripped file
quiet_cmd_compat_vdso_install = INSTALL $@
  cmd_compat_vdso_install = cp $(obj)/$@.dbg $(MODLIB)/compat_vdso/$@

compat_vdso.so: $(obj)/compat_vdso.so.dbg
@mkdir -p $(MODLIB)/compat_vdso
$(call cmd,compat_vdso_install)

compat_vdso_install: compat_vdso.so
===

Here is the make V=1 output:

make -f /home/guoren/source/kernel/riscv-linux/scripts/Makefile.build
obj=arch/riscv/kernel/vdso include/generated/vdso-offsets.h
make -f /home/guoren/source/kernel/riscv-linux/scripts/Makefile.build
obj=arch/riscv/kernel/compat_vdso
include/generated/compat_vdso-offsets.h
  riscv64-unknown-linux-gnu-gcc -E
-Wp,-MMD,arch/riscv/kernel/compat_vdso/.compat_vdso.lds.d  -nostdinc
-I/home/guoren/source/kernel/riscv-linux/arch/riscv/include
-I./arch/riscv/include/generated
-I/home/guoren/source/kernel/riscv-linux/include -I./include
-I/home/guoren/source/kernel/riscv-linux/arch/riscv/include/uapi
-I./arch/riscv/include/generated/uapi
-I/home/guoren/source/kernel/riscv-linux/include/uapi -I./includ
e/generated/uapi -include
/home/guoren/source/kernel/riscv-linux/include/linux/compiler-version.h
-include /home/guoren/source/kernel/riscv-linux/include/linux/kconfig.h
-D__KERNEL__ -fmacro-prefix-map=/home/guoren/source/kernel/riscv-linux/=
   -P -C -Uriscv -I
/home/guoren/source/kernel/riscv-linux/arch/riscv/kernel/compat_vdso
-I ./arch/riscv/kernel/compat_vdso -P -Uriscv -D__ASSEMBLY__
-DLINKER_SCRIPT -o arch/riscv/ke
rnel/compat_vdso/compat_vdso.lds
/home/guoren/source/kernel/riscv-linux/arch/riscv/kernel/compat_vdso/compat_vdso.lds.S
  riscv64-unknown-linux-gnu-gcc
-Wp,-MMD,arch/riscv/kernel/compat_vdso/.rt_sigreturn.o.d  -nostdinc
-I/home/guoren/source/kernel/riscv-linux/arch/riscv/include
-I./arch/riscv/include/generated
-I/home/guoren/source/kernel/riscv-linux/include -I./include
-I/home/guoren/source/kernel/riscv-linux/arch/riscv/include/uapi
-I./arch/riscv/include/generated/uapi
-I/home/guoren/source/kernel/riscv-linux/include/uapi -I./include/ge
nerated/uapi -include
/home/guoren/source/kernel/riscv-linux/include/linux/compiler-version.h
-include /home/guoren/source/kernel/riscv-linux/include/linux/kconfig.h
-D__KERNEL__ -fmacro-prefix-map=/home/guoren/source/kernel/riscv-linux/=
-D__ASSEMBLY__ -fno-PIE -mabi=lp64 -march=rv64imafdc -I

Re: cleanup swiotlb initialization

2022-02-25 Thread Boris Ostrovsky



On 2/25/22 3:47 AM, Christoph Hellwig wrote:

On Thu, Feb 24, 2022 at 12:07:26PM -0500, Boris Ostrovsky wrote:

Just to be clear: this crashes only as dom0. Boots fine as baremetal.

Ah.  I can gues what this might be.  On Xen the hypervisor controls the
IOMMU and we should never end up initializing it in Linux, right?


Right, we shouldn't be in that code path.

Can you try the patch below on top of the series?



Yes, this makes dom0 boot fine.


(It also addresses something I wanted to mention while looking at the patches, 
which was to remove Xen initialization code from pci_swiotlb_detect_4gb() since 
it had noting to do with what the routine's name implies.)


-boris



[PATCH v7 4/4] docs: ABI: sysfs-bus-nvdimm: Document sysfs event format entries for nvdimm pmu

2022-02-25 Thread Kajol Jain
Details are added for the event, cpumask and format attributes
in the ABI documentation.

Acked-by: Peter Zijlstra (Intel) 
Tested-by: Nageswara R Sastry 
Signed-off-by: Kajol Jain 
---
Changelog:
v6 -> v7
- Add Acked-by and Tested-by tag from Peter Zijlstra
  and Nageswara R Sastry.

 Documentation/ABI/testing/sysfs-bus-nvdimm | 35 ++
 1 file changed, 35 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-nvdimm 
b/Documentation/ABI/testing/sysfs-bus-nvdimm
index bff84a16812a..1c1f5acbf53d 100644
--- a/Documentation/ABI/testing/sysfs-bus-nvdimm
+++ b/Documentation/ABI/testing/sysfs-bus-nvdimm
@@ -6,3 +6,38 @@ Description:
 
 The libnvdimm sub-system implements a common sysfs interface for
 platform nvdimm resources. See Documentation/driver-api/nvdimm/.
+
+What:   /sys/bus/event_source/devices/nmemX/format
+Date:   February 2022
+KernelVersion:  5.18
+Contact:Kajol Jain 
+Description:   (RO) Attribute group to describe the magic bits
+   that go into perf_event_attr.config for a particular pmu.
+   (See ABI/testing/sysfs-bus-event_source-devices-format).
+
+   Each attribute under this group defines a bit range of the
+   perf_event_attr.config. Supported attribute is listed
+   below::
+ event  = "config:0-4"  - event ID
+
+   For example::
+   ctl_res_cnt = "event=0x1"
+
+What:   /sys/bus/event_source/devices/nmemX/events
+Date:   February 2022
+KernelVersion:  5.18
+Contact:Kajol Jain 
+Description:   (RO) Attribute group to describe performance monitoring events
+for the nvdimm memory device. Each attribute in this group
+describes a single performance monitoring event supported by
+this nvdimm pmu.  The name of the file is the name of the 
event.
+(See ABI/testing/sysfs-bus-event_source-devices-events). A
+listing of the events supported by a given nvdimm provider type
+can be found in Documentation/driver-api/nvdimm/$provider.
+
+What:  /sys/bus/event_source/devices/nmemX/cpumask
+Date:  February 2022
+KernelVersion:  5.18
+Contact:Kajol Jain 
+Description:   (RO) This sysfs file exposes the cpumask which is designated to
+   to retrieve nvdimm pmu event counter data.
-- 
2.31.1



[PATCH v7 3/4] powerpc/papr_scm: Add perf interface support

2022-02-25 Thread Kajol Jain
Performance monitoring support for papr-scm nvdimm devices
via perf interface is added which includes addition of pmu
functions like add/del/read/event_init for nvdimm_pmu struture.

A new parameter 'priv' in added to the pdev_archdata structure to save
nvdimm_pmu device pointer, to handle the unregistering of pmu device.

papr_scm_pmu_register function populates the nvdimm_pmu structure
with name, capabilities, cpumask along with event handling
functions. Finally the populated nvdimm_pmu structure is passed to
register the pmu device. Event handling functions internally uses
hcall to get events and counter data.

Result in power9 machine with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

  nmem0/cache_rh_cnt/[Kernel PMU event]
  nmem0/cache_wh_cnt/[Kernel PMU event]
  nmem0/cri_res_util/[Kernel PMU event]
  nmem0/ctl_res_cnt/ [Kernel PMU event]
  nmem0/ctl_res_tm/  [Kernel PMU event]
  nmem0/fast_w_cnt/  [Kernel PMU event]
  nmem0/host_l_cnt/  [Kernel PMU event]
  nmem0/host_l_dur/  [Kernel PMU event]
  nmem0/host_s_cnt/  [Kernel PMU event]
  nmem0/host_s_dur/  [Kernel PMU event]
  nmem0/med_r_cnt/   [Kernel PMU event]
  nmem0/med_r_dur/   [Kernel PMU event]
  nmem0/med_w_cnt/   [Kernel PMU event]
  nmem0/med_w_dur/   [Kernel PMU event]
  nmem0/mem_life/[Kernel PMU event]
  nmem0/poweron_secs/[Kernel PMU event]
  ...
  nmem1/mem_life/[Kernel PMU event]
  nmem1/poweron_secs/[Kernel PMU event]


Acked-by: Peter Zijlstra (Intel) 
Tested-by: Nageswara R Sastry 
Signed-off-by: Kajol Jain 
[Add numa_map_to_online_node function call to get online node id]
Reported-by: Nageswara R Sastry 
---
Changelog:
Resend v6 -> v7
- Add function call to numa_map_to_online_node function inorder to
  get online numa node. As the node id returned by function dev_to_node
  can be offline in some scenarios and create issue in hotplug code.

- Add Acked-by, Tested-by and Reported-by tags from Peter Zijlstra
  and Nageswara R Sastry.

 arch/powerpc/include/asm/device.h |   5 +
 arch/powerpc/platforms/pseries/papr_scm.c | 225 ++
 2 files changed, 230 insertions(+)

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 219559d65864..47ed639f3b8f 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -48,6 +48,11 @@ struct dev_archdata {
 
 struct pdev_archdata {
u64 dma_mask;
+   /*
+* Pointer to nvdimm_pmu structure, to handle the unregistering
+* of pmu device
+*/
+   void *priv;
 };
 
 #endif /* _ASM_POWERPC_DEVICE_H */
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f48e87ac89c9..4dd513d7c029 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -68,6 +69,8 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu)
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -120,6 +123,9 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+/* array to have event_code and stat_id mappings */
+   char **nvdimm_events_map;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,
@@ -340,6 +346,218 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
return 0;
 }
 
+static int papr_scm_pmu_get_value(struct perf_event *event, struct device 
*dev, u64 *count)
+{
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+   struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data;
+   int rc, size;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   if (!p || !p->nvdimm_events_map)
+   return -EINVAL;
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id,
+  p->nvdimm_events_map[event->attr.config],
+

[PATCH v7 0/4] Add perf interface to expose nvdimm

2022-02-25 Thread Kajol Jain
Patchset adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as cpumask, nvdimm device
pointer and pmu event functions like event_init/add/read/del.
User could use the standard perf tool to access perf events
exposed via pmu.

Interface also defines supported event list, config fields for the
event attributes and their corresponding bit values which are exported
via sysfs. Patch 3 exposes IBM pseries platform nmem* device
performance stats using this interface.

Result from power9 pseries lpar with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

  nmem0/cache_rh_cnt/[Kernel PMU event]
  nmem0/cache_wh_cnt/[Kernel PMU event]
  nmem0/cri_res_util/[Kernel PMU event]
  nmem0/ctl_res_cnt/ [Kernel PMU event]
  nmem0/ctl_res_tm/  [Kernel PMU event]
  nmem0/fast_w_cnt/  [Kernel PMU event]
  nmem0/host_l_cnt/  [Kernel PMU event]
  nmem0/host_l_dur/  [Kernel PMU event]
  nmem0/host_s_cnt/  [Kernel PMU event]
  nmem0/host_s_dur/  [Kernel PMU event]
  nmem0/med_r_cnt/   [Kernel PMU event]
  nmem0/med_r_dur/   [Kernel PMU event]
  nmem0/med_w_cnt/   [Kernel PMU event]
  nmem0/med_w_dur/   [Kernel PMU event]
  nmem0/mem_life/[Kernel PMU event]
  nmem0/poweron_secs/[Kernel PMU event]
  ...
  nmem1/mem_life/[Kernel PMU event]
  nmem1/poweron_secs/[Kernel PMU event]

Patch1:
Introduces the nvdimm_pmu structure
Patch2:
Adds common interface to add arch/platform specific data
includes nvdimm device pointer, pmu data along with
pmu event functions. It also defines supported event list
and adds attribute groups for format, events and cpumask.
It also adds code for cpu hotplug support.
Patch3:
Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
nmem* pmu. It fills in the nvdimm_pmu structure with pmu name,
capabilities, cpumask and event functions and then registers
the pmu by adding callbacks to register_nvdimm_pmu.
Patch4:
Sysfs documentation patch

Changelog
---
v6 -> v7
- Add function call to numa_map_to_online_node function inorder to
  get online numa node. As the node id returned by function dev_to_node
  can be offline in some scenarios and can create issue in hotplug code
  as reported by Nageswara R Sastry.

- Add function declaration of perf_pmu_register, perf_pmu_unregister
  and  perf_pmu_migrate_context functions in nd.h file to resolve
  the implicit-function-declaration warning as reported by kernel test
  robot.
  Link: https://lore.kernel.org/all/202202241242.zqzgkguy-...@intel.com/

- Add Tested-by, Acked-by and Reported-by tags from Peter Zijlstra
  and Nageswara R Sastry.

- Link to the patchset v6: https://lkml.org/lkml/2022/2/17/857

Resend v5 -> v6
- No logic change, just a rebase to latest upstream and
  tested the patchset.

- Link to the patchset Resend v5: https://lkml.org/lkml/2021/11/15/3979

v5 -> Resend v5
- Resend the patchset

- Link to the patchset v5: https://lkml.org/lkml/2021/9/28/643

v4 -> v5:
- Remove multiple variables defined in nvdimm_pmu structure include
  name and pmu functions(event_int/add/del/read) as they are just
  used to copy them again in pmu variable. Now we are directly doing
  this step in arch specific code as suggested by Dan Williams.

- Remove attribute group field from nvdimm pmu structure and
  defined these attribute groups in common interface which
  includes format, event list along with cpumask as suggested by
  Dan Williams.
  Since we added static defination for attrbute groups needed in
  common interface, removes corresponding code from papr.

- Add nvdimm pmu event list with event codes in the common interface.

- Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored
  to handle review comments from Dan.

- Make nvdimm_pmu_free_hotplug_memory function static as reported
  by kernel test robot, also add corresponding Reported-by tag.

- Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45

v3 -> v4
- Rebase code on top of current papr_scm code without any logical
  changes.

- Added Acked-by tag from Peter Zijlstra and Reviewed by tag
  from Madhavan Srinivasan.

- Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605

v2 -> v3
- Added Tested-by tag.

- Fix nvdimm 

[PATCH v7 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2022-02-25 Thread Kajol Jain
A common interface is added to get performance stats reporting
support for nvdimm devices. Added interface defines supported
event list, config fields for the event attributes and their
corresponding bit values which are exported via sysfs.

Interface also added support for pmu register/unregister functions,
cpu hotplug feature along with macros for handling events addition
via sysfs. It adds attribute groups for format, cpumask and events
to the pmu structure.

User could use the standard perf tool to access perf events exposed
via nvdimm pmu.

Acked-by: Peter Zijlstra (Intel) 
Tested-by: Nageswara R Sastry 
Signed-off-by: Kajol Jain 
[Declare pmu functions in nd.h file to resolve implicit-function-declaration
warning and make hotplug function static as reported by kernel test robot]
Link: https://lore.kernel.org/all/202202241242.zqzgkguy-...@intel.com/
Reported-by: kernel test robot 
---
Changelog:
v6 -> v7
- Add function declaration of perf_pmu_register, perf_pmu_unregister
  and  perf_pmu_migrate_context functions in nd.h file to resolve
  the implicit-function-declaration warning as reported by kernel test robot.

- Add Acked-by and Tested-by tag from Peter Zijlstra and Nageswara R
  Sastry.

 drivers/nvdimm/Makefile  |   1 +
 drivers/nvdimm/nd_perf.c | 328 +++
 include/linux/nd.h   |  24 +++
 3 files changed, 353 insertions(+)
 create mode 100644 drivers/nvdimm/nd_perf.c

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 29203f3d3069..25dba6095612 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -18,6 +18,7 @@ nd_e820-y := e820.o
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
+libnvdimm-y += nd_perf.o
 libnvdimm-y += dimm.o
 libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
new file mode 100644
index ..314415894acf
--- /dev/null
+++ b/drivers/nvdimm/nd_perf.c
@@ -0,0 +1,328 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * nd_perf.c: NVDIMM Device Performance Monitoring Unit support
+ *
+ * Perf interface to expose nvdimm performance stats.
+ *
+ * Copyright (C) 2021 IBM Corporation
+ */
+
+#define pr_fmt(fmt) "nvdimm_pmu: " fmt
+
+#include 
+
+#define EVENT(_name, _code) enum{_name = _code}
+
+/*
+ * NVDIMM Events codes.
+ */
+
+/* Controller Reset Count */
+EVENT(CTL_RES_CNT, 0x1);
+/* Controller Reset Elapsed Time */
+EVENT(CTL_RES_TM,  0x2);
+/* Power-on Seconds */
+EVENT(POWERON_SECS,0x3);
+/* Life Remaining */
+EVENT(MEM_LIFE,0x4);
+/* Critical Resource Utilization */
+EVENT(CRI_RES_UTIL,0x5);
+/* Host Load Count */
+EVENT(HOST_L_CNT,  0x6);
+/* Host Store Count */
+EVENT(HOST_S_CNT,  0x7);
+/* Host Store Duration */
+EVENT(HOST_S_DUR,  0x8);
+/* Host Load Duration */
+EVENT(HOST_L_DUR,  0x9);
+/* Media Read Count */
+EVENT(MED_R_CNT,   0xa);
+/* Media Write Count */
+EVENT(MED_W_CNT,   0xb);
+/* Media Read Duration */
+EVENT(MED_R_DUR,   0xc);
+/* Media Write Duration */
+EVENT(MED_W_DUR,   0xd);
+/* Cache Read Hit Count */
+EVENT(CACHE_RH_CNT,0xe);
+/* Cache Write Hit Count */
+EVENT(CACHE_WH_CNT,0xf);
+/* Fast Write Count */
+EVENT(FAST_W_CNT,  0x10);
+
+NVDIMM_EVENT_ATTR(ctl_res_cnt, CTL_RES_CNT);
+NVDIMM_EVENT_ATTR(ctl_res_tm,  CTL_RES_TM);
+NVDIMM_EVENT_ATTR(poweron_secs,POWERON_SECS);
+NVDIMM_EVENT_ATTR(mem_life,MEM_LIFE);
+NVDIMM_EVENT_ATTR(cri_res_util,CRI_RES_UTIL);
+NVDIMM_EVENT_ATTR(host_l_cnt,  HOST_L_CNT);
+NVDIMM_EVENT_ATTR(host_s_cnt,  HOST_S_CNT);
+NVDIMM_EVENT_ATTR(host_s_dur,  HOST_S_DUR);
+NVDIMM_EVENT_ATTR(host_l_dur,  HOST_L_DUR);
+NVDIMM_EVENT_ATTR(med_r_cnt,   MED_R_CNT);
+NVDIMM_EVENT_ATTR(med_w_cnt,   MED_W_CNT);
+NVDIMM_EVENT_ATTR(med_r_dur,   MED_R_DUR);
+NVDIMM_EVENT_ATTR(med_w_dur,   MED_W_DUR);
+NVDIMM_EVENT_ATTR(cache_rh_cnt,CACHE_RH_CNT);
+NVDIMM_EVENT_ATTR(cache_wh_cnt,CACHE_WH_CNT);
+NVDIMM_EVENT_ATTR(fast_w_cnt,  FAST_W_CNT);
+
+static struct attribute *nvdimm_events_attr[] = {
+   NVDIMM_EVENT_PTR(CTL_RES_CNT),
+   NVDIMM_EVENT_PTR(CTL_RES_TM),
+   NVDIMM_EVENT_PTR(POWERON_SECS),
+   NVDIMM_EVENT_PTR(MEM_LIFE),
+   NVDIMM_EVENT_PTR(CRI_RES_UTIL),
+   NVDIMM_EVENT_PTR(HOST_L_CNT),
+   NVDIMM_EVENT_PTR(HOST_S_CNT),
+   NVDIMM_EVENT_PTR(HOST_S_DUR),
+   NVDIMM_EVENT_PTR(HOST_L_DUR),
+   NVDIMM_EVENT_PTR(MED_R_CNT),
+   NVDIMM_EVENT_PTR(MED_W_CNT),
+   NVDIMM_EVENT_PTR(MED_R_DUR),
+   NVDIMM_EVENT_PTR(MED_W_DUR),
+   NVDIMM_EVENT_PTR(CACHE_RH_CNT),
+   NVDIMM_EVENT_PTR(CACHE_WH_CNT),
+   NVDIMM_EVENT_PTR(FAST_W_CNT),
+   NULL
+};
+
+static struct 

[PATCH v7 1/4] drivers/nvdimm: Add nvdimm pmu structure

2022-02-25 Thread Kajol Jain
A structure is added called nvdimm_pmu, for performance
stats reporting support of nvdimm devices. It can be used to add
device pmu data such as pmu data structure for performance
stats, nvdimm device pointer along with cpumask attributes.

Acked-by: Peter Zijlstra (Intel) 
Tested-by: Nageswara R Sastry 
Signed-off-by: Kajol Jain 
---
Changelog:
v6 -> v7
- Add Acked-by and Tested-by tag from Peter Zijlstra
  and Nageswara R Sastry.

 include/linux/nd.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/include/linux/nd.h b/include/linux/nd.h
index 8a8c63edb1b2..ad186e828263 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum nvdimm_event {
NVDIMM_REVALIDATE_POISON,
@@ -23,6 +24,25 @@ enum nvdimm_claim_class {
NVDIMM_CCLASS_UNKNOWN,
 };
 
+/**
+ * struct nvdimm_pmu - data structure for nvdimm perf driver
+ * @pmu: pmu data structure for nvdimm performance stats.
+ * @dev: nvdimm device pointer.
+ * @cpu: designated cpu for counter access.
+ * @node: node for cpu hotplug notifier link.
+ * @cpuhp_state: state for cpu hotplug notification.
+ * @arch_cpumask: cpumask to get designated cpu for counter access.
+ */
+struct nvdimm_pmu {
+   struct pmu pmu;
+   struct device *dev;
+   int cpu;
+   struct hlist_node node;
+   enum cpuhp_state cpuhp_state;
+   /* cpumask provided by arch/platform specific code */
+   struct cpumask arch_cpumask;
+};
+
 struct nd_device_driver {
struct device_driver drv;
unsigned long type;
-- 
2.31.1



Re: [PATCH] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2022-02-25 Thread Christophe Leroy


Le 25/02/2022 à 11:24, Mahesh Salgaonkar a écrit :
> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
> interrupt handler) if percpu allocation comes from vmalloc area.
> 
> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
> percpu allocation is from the embedded first chunk. However with
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
> allocation can come from the vmalloc area.
> 
> With kernel command line "percpu_alloc=page" we can force percpu allocation
> to come from vmalloc area and can see kernel crash in machine_check_early:
> 
> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> [1.215719] --- interrupt: 200
> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> [1.215722] [c00fffd731b0] [] 0x0
> [1.215724] [c00fffd73210] [c0008364] 
> machine_check_early_common+0x134/0x1f8
> 
> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
> first chunk is not embedded.
> 
> Signed-off-by: Mahesh Salgaonkar 
> ---
>   arch/powerpc/include/asm/interrupt.h |   15 +++
>   arch/powerpc/include/asm/percpu.h|2 ++
>   arch/powerpc/kernel/setup_64.c   |3 +++
>   3 files changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index fc28f46d2f9dc..8c5b3a7ca2ab6 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -327,6 +327,16 @@ static inline void interrupt_nmi_enter_prepare(struct 
> pt_regs *regs, struct inte
>   }
>   #endif
>   
> + /*
> +  * Do not use nmi_enter() in real mode if percpu first chunk is
> +  * not embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled
> +  * there are chances where percpu allocation can come from vmalloc
> +  * area.
> +  */
> + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
> + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk)
> + return;
> +

It would probably be better to refactor with the following test and 
avoid calling mfmsr() twice.

>   /*
>* Do not use nmi_enter() for pseries hash guest taking a real-mode
>* NMI because not everything it touches is within the RMA limit.
> @@ -339,6 +349,10 @@ static inline void interrupt_nmi_enter_prepare(struct 
> pt_regs *regs, struct inte
>   
>   static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct 
> interrupt_nmi_state *state)
>   {
> + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
> + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk)
> + goto skip_nmi_exit;
> +
>   if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
>   !firmware_has_feature(FW_FEATURE_LPAR) ||
>   radix_enabled() || (mfmsr() & MSR_DR))

Same here, if you can avoid calling mfmsr() two times that would be better.


> @@ -349,6 +363,7 @@ static inline void interrupt_nmi_exit_prepare(struct 
> pt_regs *regs, struct inter
>* new work to do (must use irq_work for that).
>*/
>   
> +skip_nmi_exit:
>   #ifdef CONFIG_PPC64
>   #ifdef CONFIG_PPC_BOOK3S
>   if (arch_irq_disabled_regs(regs)) {
> diff --git a/arch/powerpc/include/asm/percpu.h 
> b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c6..45c08f5b0b4e9 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -12,6 +12,8 @@
>   
>   #define __my_cpu_offset local_paca->data_offset
>   
> +extern bool __percpu_embed_first_chunk;
> +
>   #endif /* CONFIG_SMP */
>   #endif /* __powerpc64__ */
>   
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index be8577ac93971..39dbf2fb23d61 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -786,6 +786,7 @@ static __init int pcpu_cpu_to_node(int cpu)
>   
>   unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
>   EXPORT_SYMBOL(__per_cpu_offset);
> +bool __percpu_embed_first_chunk;
>   
>   void __init setup_per_cpu_areas(void)
>   {
> @@ -821,6 +822,8 @@ void __init setup_per_cpu_areas(void)
>   pr_warn("PERCPU: %s allocator failed (%d), "
>   "falling back to page size\n",
>   pcpu_fc_names[pcpu_chosen_fc], rc);
> + else
> + __percpu_embed_first_chunk = true;

If I understand correctly that happens once at system startup ?

Then could you use jump_labels in order to avoid redoing the test at 
each interrupt entry ?


>   }
>   
>   if (rc < 0)
> 
> 

Re: [PATCH v6 0/4] Add perf interface to expose nvdimm

2022-02-25 Thread kajoljain



On 2/25/22 16:41, Nageswara Sastry wrote:
> 
> 
> On 25/02/22 12:08 pm, kajoljain wrote:
>>
>>
>> On 2/25/22 11:25, Nageswara Sastry wrote:
>>>
>>>
>>> On 17/02/22 10:03 pm, Kajol Jain wrote:
 Patchset adds performance stats reporting support for nvdimm.
 Added interface includes support for pmu register/unregister
 functions. A structure is added called nvdimm_pmu to be used for
 adding arch/platform specific data such as cpumask, nvdimm device
 pointer and pmu event functions like event_init/add/read/del.
 User could use the standard perf tool to access perf events
 exposed via pmu.

 Interface also defines supported event list, config fields for the
 event attributes and their corresponding bit values which are exported
 via sysfs. Patch 3 exposes IBM pseries platform nmem* device
 performance stats using this interface.

 Result from power9 pseries lpar with 2 nvdimm device:

 Ex: List all event by perf list

 command:# perf list nmem

     nmem0/cache_rh_cnt/    [Kernel PMU
 event]
     nmem0/cache_wh_cnt/    [Kernel PMU
 event]
     nmem0/cri_res_util/    [Kernel PMU
 event]
     nmem0/ctl_res_cnt/ [Kernel PMU
 event]
     nmem0/ctl_res_tm/  [Kernel PMU
 event]
     nmem0/fast_w_cnt/  [Kernel PMU
 event]
     nmem0/host_l_cnt/  [Kernel PMU
 event]
     nmem0/host_l_dur/  [Kernel PMU
 event]
     nmem0/host_s_cnt/  [Kernel PMU
 event]
     nmem0/host_s_dur/  [Kernel PMU
 event]
     nmem0/med_r_cnt/   [Kernel PMU
 event]
     nmem0/med_r_dur/   [Kernel PMU
 event]
     nmem0/med_w_cnt/   [Kernel PMU
 event]
     nmem0/med_w_dur/   [Kernel PMU
 event]
     nmem0/mem_life/    [Kernel PMU
 event]
     nmem0/poweron_secs/    [Kernel PMU
 event]
     ...
     nmem1/mem_life/    [Kernel PMU
 event]
     nmem1/poweron_secs/    [Kernel PMU
 event]

 Patch1:
   Introduces the nvdimm_pmu structure
 Patch2:
   Adds common interface to add arch/platform specific data
   includes nvdimm device pointer, pmu data along with
   pmu event functions. It also defines supported event list
   and adds attribute groups for format, events and cpumask.
   It also adds code for cpu hotplug support.
 Patch3:
   Add code in arch/powerpc/platform/pseries/papr_scm.c to
 expose
   nmem* pmu. It fills in the nvdimm_pmu structure with pmu
 name,
   capabilities, cpumask and event functions and then registers
   the pmu by adding callbacks to register_nvdimm_pmu.
 Patch4:
   Sysfs documentation patch

 Changelog
>>>
>>> Tested these patches with the automated tests at
>>> avocado-misc-tests/perf/perf_nmem.py
>>> URL:
>>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py
>>>
>>>
>>>
>>> 1. On the system where target id and online id were different then not
>>> seeing value in 'cpumask' and those tests failed.
>>>
>>> Example:
>>> Log from dmesg
>>> ...
>>> papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region registered
>>> with target node 1 and online node 0
>>> ...
>>
>> Hi Nageswara Sastry,
>>     Thanks for testing the patch set. Yes you right, incase target
>> node id and online node id is different, it can happen when target
>> node is not online and hence can cause this issue, thanks for pointing
>> it.
>>
>> Function dev_to_node will return node id for a given nvdimm device which
>> can be offline in some scenarios. We should use numa node id return by
>> numa_map_to_online_node function in that scenario. This function incase
>> given node is offline, it will lookup for next closest online node and
>> return that nodeid.
>>
>> Can you try with below change and see, if you are still getting this
>> issue. Please let me know.
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index bdf2620db461..4dd513d7c029 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct
>> papr_scm_priv *p)
>>  PERF_PMU_CAP_NO_EXCLUDE;
>>
>>  /*updating 

Re: [PATCH v6 0/4] Add perf interface to expose nvdimm

2022-02-25 Thread Nageswara Sastry




On 25/02/22 12:08 pm, kajoljain wrote:



On 2/25/22 11:25, Nageswara Sastry wrote:



On 17/02/22 10:03 pm, Kajol Jain wrote:

Patchset adds performance stats reporting support for nvdimm.
Added interface includes support for pmu register/unregister
functions. A structure is added called nvdimm_pmu to be used for
adding arch/platform specific data such as cpumask, nvdimm device
pointer and pmu event functions like event_init/add/read/del.
User could use the standard perf tool to access perf events
exposed via pmu.

Interface also defines supported event list, config fields for the
event attributes and their corresponding bit values which are exported
via sysfs. Patch 3 exposes IBM pseries platform nmem* device
performance stats using this interface.

Result from power9 pseries lpar with 2 nvdimm device:

Ex: List all event by perf list

command:# perf list nmem

    nmem0/cache_rh_cnt/    [Kernel PMU event]
    nmem0/cache_wh_cnt/    [Kernel PMU event]
    nmem0/cri_res_util/    [Kernel PMU event]
    nmem0/ctl_res_cnt/ [Kernel PMU event]
    nmem0/ctl_res_tm/  [Kernel PMU event]
    nmem0/fast_w_cnt/  [Kernel PMU event]
    nmem0/host_l_cnt/  [Kernel PMU event]
    nmem0/host_l_dur/  [Kernel PMU event]
    nmem0/host_s_cnt/  [Kernel PMU event]
    nmem0/host_s_dur/  [Kernel PMU event]
    nmem0/med_r_cnt/   [Kernel PMU event]
    nmem0/med_r_dur/   [Kernel PMU event]
    nmem0/med_w_cnt/   [Kernel PMU event]
    nmem0/med_w_dur/   [Kernel PMU event]
    nmem0/mem_life/    [Kernel PMU event]
    nmem0/poweron_secs/    [Kernel PMU event]
    ...
    nmem1/mem_life/    [Kernel PMU event]
    nmem1/poweron_secs/    [Kernel PMU event]

Patch1:
  Introduces the nvdimm_pmu structure
Patch2:
  Adds common interface to add arch/platform specific data
  includes nvdimm device pointer, pmu data along with
  pmu event functions. It also defines supported event list
  and adds attribute groups for format, events and cpumask.
  It also adds code for cpu hotplug support.
Patch3:
  Add code in arch/powerpc/platform/pseries/papr_scm.c to expose
  nmem* pmu. It fills in the nvdimm_pmu structure with pmu name,
  capabilities, cpumask and event functions and then registers
  the pmu by adding callbacks to register_nvdimm_pmu.
Patch4:
  Sysfs documentation patch

Changelog


Tested these patches with the automated tests at
avocado-misc-tests/perf/perf_nmem.py
URL:
https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py


1. On the system where target id and online id were different then not
seeing value in 'cpumask' and those tests failed.

Example:
Log from dmesg
...
papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region registered
with target node 1 and online node 0
...


Hi Nageswara Sastry,
Thanks for testing the patch set. Yes you right, incase target
node id and online node id is different, it can happen when target
node is not online and hence can cause this issue, thanks for pointing
it.

Function dev_to_node will return node id for a given nvdimm device which
can be offline in some scenarios. We should use numa node id return by
numa_map_to_online_node function in that scenario. This function incase
given node is offline, it will lookup for next closest online node and
return that nodeid.

Can you try with below change and see, if you are still getting this
issue. Please let me know.

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
b/arch/powerpc/platforms/pseries/papr_scm.c
index bdf2620db461..4dd513d7c029 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct
papr_scm_priv *p)
 PERF_PMU_CAP_NO_EXCLUDE;

 /*updating the cpumask variable */
-   nodeid = dev_to_node(>pdev->dev);
+   nodeid = numa_map_to_online_node(dev_to_node(>pdev->dev));
 nd_pmu->arch_cpumask = *cpumask_of_node(nodeid);

Thanks,
Kajol Jain



With the above patch all the tests are passing on the system where 
target id and online id were different. Here is the the result:


(1/9) perf_nmem.py:perfNMEM.test_pmu_register_dmesg: PASS (3.47 s)
(2/9) perf_nmem.py:perfNMEM.test_sysfs: PASS (1.15 s)
(3/9) perf_nmem.py:perfNMEM.test_pmu_count: PASS (1.08 s)
(4/9) perf_nmem.py:perfNMEM.test_all_events: 

Re: [PATCH 2/3] powerpc: fix build errors

2022-02-25 Thread Nicholas Piggin
Excerpts from Arnd Bergmann's message of February 25, 2022 6:33 pm:
> On Fri, Feb 25, 2022 at 1:32 AM Nicholas Piggin  wrote:
>> Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am:
>> >> +#ifdef CONFIG_CC_IS_GCC
>> >> +#if (GCC_VERSION >= 10)
>> >> +#if (CONFIG_AS_VERSION == 23800)
>> >> +asm(".machine any");
>> >> +#endif
>> >> +#endif
>> >> +#endif
>> >> +#endif /* __ASSEMBLY__ */
>> >
>> > Abusing toplevel asm like this is broken and you *will* end up with
>> > unhappiness all around.
>>
>> It actually unbreaks things and reduces my unhappiness. It's only done
>> for broken compiler versions and only where as does not have the
>> workaround for the breakage.
> 
> It doesn't work with clang, which always passes explicit .machine
> statements around each inline asm, and it's also fundamentally
> incompatible with LTO builds. Generally speaking, you can't expect
> a top-level asm statement to have any effect inside of another
> function.

You have misunderstood my patch. It is not supposed to "work" with
clang and it explicitly is complied out of clang. It's not intended
to have any implementation independent meaning. It's working around
a very specific issue with specific versions of gcc, and that's what
it does.

It's also not intended to be the final solution, it's a workaround
hack. We will move away from -many of course. I will post it as a
series since which hopefully will make it less confusing to people.

Thanks,
Nick


[PATCH] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2022-02-25 Thread Mahesh Salgaonkar
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
interrupt handler) if percpu allocation comes from vmalloc area.

Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
percpu allocation is from the embedded first chunk. However with
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
allocation can come from the vmalloc area.

With kernel command line "percpu_alloc=page" we can force percpu allocation
to come from vmalloc area and can see kernel crash in machine_check_early:

[1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
[1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
[1.215719] --- interrupt: 200
[1.215720] [c00fffd73180] [] 0x0 (unreliable)
[1.215722] [c00fffd731b0] [] 0x0
[1.215724] [c00fffd73210] [c0008364] 
machine_check_early_common+0x134/0x1f8

Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
first chunk is not embedded.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/interrupt.h |   15 +++
 arch/powerpc/include/asm/percpu.h|2 ++
 arch/powerpc/kernel/setup_64.c   |3 +++
 3 files changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index fc28f46d2f9dc..8c5b3a7ca2ab6 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -327,6 +327,16 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
}
 #endif
 
+   /*
+* Do not use nmi_enter() in real mode if percpu first chunk is
+* not embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled
+* there are chances where percpu allocation can come from vmalloc
+* area.
+*/
+   if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
+   !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk)
+   return;
+
/*
 * Do not use nmi_enter() for pseries hash guest taking a real-mode
 * NMI because not everything it touches is within the RMA limit.
@@ -339,6 +349,10 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
 
 static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct 
interrupt_nmi_state *state)
 {
+   if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) &&
+   !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk)
+   goto skip_nmi_exit;
+
if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
!firmware_has_feature(FW_FEATURE_LPAR) ||
radix_enabled() || (mfmsr() & MSR_DR))
@@ -349,6 +363,7 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
 * new work to do (must use irq_work for that).
 */
 
+skip_nmi_exit:
 #ifdef CONFIG_PPC64
 #ifdef CONFIG_PPC_BOOK3S
if (arch_irq_disabled_regs(regs)) {
diff --git a/arch/powerpc/include/asm/percpu.h 
b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c6..45c08f5b0b4e9 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -12,6 +12,8 @@
 
 #define __my_cpu_offset local_paca->data_offset
 
+extern bool __percpu_embed_first_chunk;
+
 #endif /* CONFIG_SMP */
 #endif /* __powerpc64__ */
 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index be8577ac93971..39dbf2fb23d61 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -786,6 +786,7 @@ static __init int pcpu_cpu_to_node(int cpu)
 
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
+bool __percpu_embed_first_chunk;
 
 void __init setup_per_cpu_areas(void)
 {
@@ -821,6 +822,8 @@ void __init setup_per_cpu_areas(void)
pr_warn("PERCPU: %s allocator failed (%d), "
"falling back to page size\n",
pcpu_fc_names[pcpu_chosen_fc], rc);
+   else
+   __percpu_embed_first_chunk = true;
}
 
if (rc < 0)




Re: cleanup swiotlb initialization

2022-02-25 Thread Christoph Hellwig
On Thu, Feb 24, 2022 at 12:07:26PM -0500, Boris Ostrovsky wrote:
>>> Just to be clear: this crashes only as dom0. Boots fine as baremetal.
>> Ah.  I can gues what this might be.  On Xen the hypervisor controls the
>> IOMMU and we should never end up initializing it in Linux, right?
>
>
> Right, we shouldn't be in that code path.

Can you try the patch below on top of the series?


diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index b849f11a756d0..f88c9a4a5ed12 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -47,22 +47,6 @@ static unsigned int x86_swiotlb_flags;
  */
 static void __init pci_swiotlb_detect_4gb(void)
 {
-#ifdef CONFIG_SWIOTLB_XEN
-   if (xen_pv_domain()) {
-   if (xen_initial_domain())
-   x86_swiotlb_enable = true;
-
-   if (x86_swiotlb_enable) {
-   dma_ops = _swiotlb_dma_ops;
-#ifdef CONFIG_PCI
-   /* Make sure ACS will be enabled */
-   pci_request_acs();
-#endif
-   }
-   return;
-   }
-#endif /* CONFIG_SWIOTLB_XEN */
-
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
x86_swiotlb_enable = true;
@@ -82,16 +66,36 @@ static inline void __init pci_swiotlb_detect_4gb(void)
 }
 #endif /* CONFIG_SWIOTLB */
 
+#ifdef CONFIG_SWIOTLB_XEN
+static void __init xen_swiotlb_init(void)
+{
+   if (!xen_initial_domain() && !x86_swiotlb_enable)
+   return;
+   x86_swiotlb_enable = true;
+   dma_ops = _swiotlb_dma_ops;
+#ifdef CONFIG_PCI
+   /* Make sure ACS will be enabled */
+   pci_request_acs();
+#endif
+   swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup);
+}
+#else
+static inline void __init xen_swiotlb_init(void)
+{
+}
+#endif /* CONFIG_SWIOTLB_XEN */
+
 void __init pci_iommu_alloc(void)
 {
+   if (xen_pv_domain()) {
+   xen_swiotlb_init();
+   return;
+   }
pci_swiotlb_detect_4gb();
gart_iommu_hole_init();
amd_iommu_detect();
detect_intel_iommu();
-#ifdef CONFIG_SWIOTLB
-   swiotlb_init_remap(x86_swiotlb_enable, x86_swiotlb_flags,
-  xen_pv_domain() ? xen_swiotlb_fixup : NULL);
-#endif
+   swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
 }
 
 /*


Re: [PATCH v6 0/4] Add perf interface to expose nvdimm

2022-02-25 Thread kajoljain



On 2/25/22 13:17, Aneesh Kumar K V wrote:
> On Fri, 2022-02-25 at 12:08 +0530, kajoljain wrote:
>>
>>
>> On 2/25/22 11:25, Nageswara Sastry wrote:
>>>
>>>
>>> On 17/02/22 10:03 pm, Kajol Jain wrote:

> 

 Changelog
>>>
>>> Tested these patches with the automated tests at
>>> avocado-misc-tests/perf/perf_nmem.py
>>> URL:
>>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py
>>>
>>>
>>> 1. On the system where target id and online id were different then
>>> not
>>> seeing value in 'cpumask' and those tests failed.
>>>
>>> Example:
>>> Log from dmesg
>>> ...
>>> papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region
>>> registered
>>> with target node 1 and online node 0
>>> ...
>>
>> Hi Nageswara Sastry,
>>    Thanks for testing the patch set. Yes you right, incase target
>> node id and online node id is different, it can happen when target
>> node is not online and hence can cause this issue, thanks for
>> pointing
>> it.
>>
>> Function dev_to_node will return node id for a given nvdimm device
>> which
>> can be offline in some scenarios. We should use numa node id return
>> by
>> numa_map_to_online_node function in that scenario. This function
>> incase
>> given node is offline, it will lookup for next closest online node
>> and
>> return that nodeid.
>>
>> Can you try with below change and see, if you are still getting this
>> issue. Please let me know.
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index bdf2620db461..4dd513d7c029 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct
>> papr_scm_priv *p)
>>     PERF_PMU_CAP_NO_EXCLUDE;
>>
>>     /*updating the cpumask variable */
>> -   nodeid = dev_to_node(>pdev->dev);
>> +   nodeid = numa_map_to_online_node(dev_to_node(>pdev->dev));
>>     nd_pmu->arch_cpumask = *cpumask_of_node(nodeid);
>>
>>>
> 
> Can you use p->region->numa_node? 

Hi Aneesh,
  Thanks for reviewing the changes. Actually we can't use numa_node
field of region structure directly inside papr_scm.c as we will get
dereferencing issue

Result of build with this change:

arch/powerpc/platforms/pseries/papr_scm.c: In function
'papr_scm_pmu_register':
arch/powerpc/platforms/pseries/papr_scm.c:539:21: error: dereferencing
pointer to incomplete type 'struct nd_region'
  nodeid = >region->numa_node;
 ^~
make[3]: *** [scripts/Makefile.build:288:
arch/powerpc/platforms/pseries/papr_scm.o] Error 1
make[2]: *** [scripts/Makefile.build:550:
arch/powerpc/platforms/pseries] Erro

This is because, this structure is defined inside drivers/nvdimm/nd.h
code which is not included in this file.

So, thats why I am using
`numa_map_to_online_node(dev_to_node(>pdev->dev));` directly.

Let me know if you are ok with initial change.

Thanks,
Kajol Jain

> 
> -aneesh
> 
> 


Re: [PATCH 2/3] powerpc: fix build errors

2022-02-25 Thread Arnd Bergmann
On Fri, Feb 25, 2022 at 1:32 AM Nicholas Piggin  wrote:
> Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am:
> >> +#ifdef CONFIG_CC_IS_GCC
> >> +#if (GCC_VERSION >= 10)
> >> +#if (CONFIG_AS_VERSION == 23800)
> >> +asm(".machine any");
> >> +#endif
> >> +#endif
> >> +#endif
> >> +#endif /* __ASSEMBLY__ */
> >
> > Abusing toplevel asm like this is broken and you *will* end up with
> > unhappiness all around.
>
> It actually unbreaks things and reduces my unhappiness. It's only done
> for broken compiler versions and only where as does not have the
> workaround for the breakage.

It doesn't work with clang, which always passes explicit .machine
statements around each inline asm, and it's also fundamentally
incompatible with LTO builds. Generally speaking, you can't expect
a top-level asm statement to have any effect inside of another
function.

Arnd