Re: [PATCH 2/3] powerpc: fix build errors
Excerpts from Segher Boessenkool's message of February 26, 2022 8:28 am: > On Fri, Feb 25, 2022 at 10:23:07AM +1000, Nicholas Piggin wrote: >> Excerpts from Segher Boessenkool's message of February 25, 2022 3:29 am: >> > On Thu, Feb 24, 2022 at 09:13:25PM +1000, Nicholas Piggin wrote: >> >> Excerpts from Arnd Bergmann's message of February 24, 2022 8:20 pm: >> >> > Again, there should be a minimum number of those .machine directives >> >> > in inline asm as well, which tends to work out fine as long as the >> >> > entire kernel is built with the correct -march= option for the minimum >> >> > supported CPU, and stays away from inline asm that requires a higher >> >> > CPU level. >> >> >> >> There's really no advantage to them, and they're ugly and annoying >> >> and if we applied the concept consistently for all asm they would grow >> >> to a very large number. >> > >> > The advantage is that you get machine code that *works*. There are >> > quite a few mnemonics that translate to different instructions with >> > different machine options! We like to get the intended instructions >> > instead of something that depends on what assembler options the user >> > has passed behind our backs. >> > >> >> The idea they'll give you good static checking just doesn't really >> >> pan out. >> > >> > That never was a goal of this at all. >> > >> > -many was very problematical for GCC itself. We no longer use it. >> >> You have the wrong context. We're not talking about -many vs .machine >> here. > > Okay, so you have no idea what you are talking about? Wow. Wrong context. It's not about -many. We're past that everyone agrees it's wrong. > The reason GCC uses .machine *itself* is because assembler -mmachine > options *cannot work*, for many reasons. We hit problems often enough > that years ago we started moving away from it already. The biggest > problems are that on one hand there are mnemonics that encode to > different instructions depending on target arch or cpu selected (like > mftb, lxvx, wait, etc.), and on the other hand GCC needs to switch that > target halfway through compilation (attribute((target(.... > > Often these problems were hidden most of the time by us passing -many. > But not all of the time, and over time, problems became more frequent > and nasty. > > Passing assembler -m options is nasty when you have to mix it with > .machine statements (and we need the latter no matter what), and it No it's not nasty, read the gas manual. -m specifies the machine and so does .machine. It's simple. > becomes completely unpredictable if the user passes other -m options > manually. > Inline assembler is inserted textually in the generated assembler code. > This is a big part of the strength of inline assembler. It does mean > that if you need a different target selected for your assembler code > then you need to arrange for that in your assembler code. > > So yes, this very much is about -many, other -m options, and .machine . > I discourage the kernel (as well as any other project) from using -m > options, especially -many, but that is your own choice of course. I > get sick and tired from you calling a deliberate design decision we > arrived at after years of work and weighing alternatives a "bug" though. Alan posted a good summary here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485#c10 Thanks, Nick
Re: [PATCH 2/3] powerpc: fix build errors
On Fri, Feb 25, 2022 at 10:32:02AM +1000, Nicholas Piggin wrote: > Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am: > > On Thu, Feb 24, 2022 at 03:05:28PM +1000, Nicholas Piggin wrote: > >> + * gcc 10 started to emit a .machine directive at the beginning of > >> generated > >> + * .s files, which overrides assembler -Wa,-m options passed down. > >> + * Unclear if this behaviour will be reverted. > > > > It will not be reverted. If you need a certain .machine for some asm > > code, you should write just that! > > It should be reverted because it breaks old binutils which did not have > the workaround patch for this broken gcc behaviour. And it is just > unnecessary because -m option can already be used to do the same thing. > > Not that I expect gcc to revert it. Nothing will happen if you do not file a bug report. And do read the bug reporting instructions first please. > >> +#ifdef CONFIG_CC_IS_GCC > >> +#if (GCC_VERSION >= 10) > >> +#if (CONFIG_AS_VERSION == 23800) > >> +asm(".machine any"); > >> +#endif > >> +#endif > >> +#endif > >> +#endif /* __ASSEMBLY__ */ > > > > Abusing toplevel asm like this is broken and you *will* end up with > > unhappiness all around. > > It actually unbreaks things and reduces my unhappiness. It is broken. You will need -fno-toplevel-reorder, and you really do not want that, if you *can* use it in the kernel even. > It's only done > for broken compiler versions and only where as does not have the > workaround for the breakage. What compiler versions? Please file a PR. Segher
Re: [PATCH 2/3] powerpc: fix build errors
On Fri, Feb 25, 2022 at 10:23:07AM +1000, Nicholas Piggin wrote: > Excerpts from Segher Boessenkool's message of February 25, 2022 3:29 am: > > On Thu, Feb 24, 2022 at 09:13:25PM +1000, Nicholas Piggin wrote: > >> Excerpts from Arnd Bergmann's message of February 24, 2022 8:20 pm: > >> > Again, there should be a minimum number of those .machine directives > >> > in inline asm as well, which tends to work out fine as long as the > >> > entire kernel is built with the correct -march= option for the minimum > >> > supported CPU, and stays away from inline asm that requires a higher > >> > CPU level. > >> > >> There's really no advantage to them, and they're ugly and annoying > >> and if we applied the concept consistently for all asm they would grow > >> to a very large number. > > > > The advantage is that you get machine code that *works*. There are > > quite a few mnemonics that translate to different instructions with > > different machine options! We like to get the intended instructions > > instead of something that depends on what assembler options the user > > has passed behind our backs. > > > >> The idea they'll give you good static checking just doesn't really > >> pan out. > > > > That never was a goal of this at all. > > > > -many was very problematical for GCC itself. We no longer use it. > > You have the wrong context. We're not talking about -many vs .machine > here. Okay, so you have no idea what you are talking about? Wow. The reason GCC uses .machine *itself* is because assembler -mmachine options *cannot work*, for many reasons. We hit problems often enough that years ago we started moving away from it already. The biggest problems are that on one hand there are mnemonics that encode to different instructions depending on target arch or cpu selected (like mftb, lxvx, wait, etc.), and on the other hand GCC needs to switch that target halfway through compilation (attribute((target(.... Often these problems were hidden most of the time by us passing -many. But not all of the time, and over time, problems became more frequent and nasty. Passing assembler -m options is nasty when you have to mix it with .machine statements (and we need the latter no matter what), and it becomes completely unpredictable if the user passes other -m options manually. Inline assembler is inserted textually in the generated assembler code. This is a big part of the strength of inline assembler. It does mean that if you need a different target selected for your assembler code then you need to arrange for that in your assembler code. So yes, this very much is about -many, other -m options, and .machine . I discourage the kernel (as well as any other project) from using -m options, especially -many, but that is your own choice of course. I get sick and tired from you calling a deliberate design decision we arrived at after years of work and weighing alternatives a "bug" though. Segher
[PATCH v1] powerpc/interrupt: Remove struct interrupt_state
Since commit ceff77efa4f8 ("powerpc/64e/interrupt: Use new interrupt context tracking scheme") struct interrupt_state has been empty and unused. Remove it. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/interrupt.h | 32 +++- 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h index fc28f46d2f9d..984ffceb8f91 100644 --- a/arch/powerpc/include/asm/interrupt.h +++ b/arch/powerpc/include/asm/interrupt.h @@ -123,9 +123,6 @@ static inline void nap_adjust_return(struct pt_regs *regs) #endif } -struct interrupt_state { -}; - static inline void booke_restore_dbcr0(void) { #ifdef CONFIG_PPC_ADV_DEBUG_REGS @@ -138,7 +135,7 @@ static inline void booke_restore_dbcr0(void) #endif } -static inline void interrupt_enter_prepare(struct pt_regs *regs, struct interrupt_state *state) +static inline void interrupt_enter_prepare(struct pt_regs *regs) { #ifdef CONFIG_PPC32 if (!arch_irq_disabled_regs(regs)) @@ -228,17 +225,17 @@ static inline void interrupt_enter_prepare(struct pt_regs *regs, struct interrup * However interrupt_nmi_exit_prepare does return directly to regs, because * NMIs do not do "exit work" or replay soft-masked interrupts. */ -static inline void interrupt_exit_prepare(struct pt_regs *regs, struct interrupt_state *state) +static inline void interrupt_exit_prepare(struct pt_regs *regs) { } -static inline void interrupt_async_enter_prepare(struct pt_regs *regs, struct interrupt_state *state) +static inline void interrupt_async_enter_prepare(struct pt_regs *regs) { #ifdef CONFIG_PPC64 /* Ensure interrupt_enter_prepare does not enable MSR[EE] */ local_paca->irq_happened |= PACA_IRQ_HARD_DIS; #endif - interrupt_enter_prepare(regs, state); + interrupt_enter_prepare(regs); #ifdef CONFIG_PPC_BOOK3S_64 /* * RI=1 is set by interrupt_enter_prepare, so this thread flags access @@ -251,7 +248,7 @@ static inline void interrupt_async_enter_prepare(struct pt_regs *regs, struct in irq_enter(); } -static inline void interrupt_async_exit_prepare(struct pt_regs *regs, struct interrupt_state *state) +static inline void interrupt_async_exit_prepare(struct pt_regs *regs) { /* * Adjust at exit so the main handler sees the true NIA. This must @@ -262,7 +259,7 @@ static inline void interrupt_async_exit_prepare(struct pt_regs *regs, struct int nap_adjust_return(regs); irq_exit(); - interrupt_exit_prepare(regs, state); + interrupt_exit_prepare(regs); } struct interrupt_nmi_state { @@ -447,13 +444,11 @@ static __always_inline void ##func(struct pt_regs *regs); \ \ interrupt_handler void func(struct pt_regs *regs) \ { \ - struct interrupt_state state; \ - \ - interrupt_enter_prepare(regs, ); \ + interrupt_enter_prepare(regs); \ \ ##func (regs); \ \ - interrupt_exit_prepare(regs, ); \ + interrupt_exit_prepare(regs); \ } \ NOKPROBE_SYMBOL(func); \ \ @@ -482,14 +477,13 @@ static __always_inline long ##func(struct pt_regs *regs); \ \ interrupt_handler long func(struct pt_regs *regs) \ { \ - struct interrupt_state state; \ long ret; \ \ - interrupt_enter_prepare(regs, ); \ + interrupt_enter_prepare(regs); \ \ ret = ##func (regs);\ \ - interrupt_exit_prepare(regs, ); \ + interrupt_exit_prepare(regs); \ \
Re: [PATCH 00/16] Remove usage of the deprecated "pci-dma-compat.h" API
I've applied patches 1,5,6,8 and 9 to the dma-mapping tree.
Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation
On Fri, Feb 25, 2022 at 11:50 PM Arnd Bergmann wrote: > > On Fri, Feb 25, 2022 at 4:42 PM Guo Ren wrote: > > > > Hi Arnd & Palmer, > > > > Here is the new modified compat_vdso/Makefile, please have a look, > > first. Then I would update it to v7: > > === > > # SPDX-License-Identifier: GPL-2.0-only > > # > > # Makefile for compat_vdso > > # > > > > # Symbols present in the compat_vdso > > compat_vdso-syms = rt_sigreturn > > compat_vdso-syms += getcpu > > compat_vdso-syms += flush_icache > > > > ifdef CROSS_COMPILE_COMPAT > > COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc > > COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld > > else > > COMPAT_CC := $(CC) > > COMPAT_LD := $(LD) > > endif > > > > COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32 > > COMPAT_LD_FLAGS := -melf32lriscv > > Have you come across a case in which a separate cross toolchain > is required? If not, I would leave this out and just set the flags for the > normal toolchain. Okay > > I also think it would be a nicer split to build the two vdso variants > as vdso64/vdso32 rather than vdso/compat_vdso. That way, > the build procedure can be kept as close as possible to the > native 32-bit build. Yes, current native 32-bit vdso & 64-bit vdso use the same vdso/Makfile. So, I think it could be another patch for this cleanup. > > Arnd -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/
Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation
On Fri, Feb 25, 2022 at 4:42 PM Guo Ren wrote: > > Hi Arnd & Palmer, > > Here is the new modified compat_vdso/Makefile, please have a look, > first. Then I would update it to v7: > === > # SPDX-License-Identifier: GPL-2.0-only > # > # Makefile for compat_vdso > # > > # Symbols present in the compat_vdso > compat_vdso-syms = rt_sigreturn > compat_vdso-syms += getcpu > compat_vdso-syms += flush_icache > > ifdef CROSS_COMPILE_COMPAT > COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc > COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld > else > COMPAT_CC := $(CC) > COMPAT_LD := $(LD) > endif > > COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32 > COMPAT_LD_FLAGS := -melf32lriscv Have you come across a case in which a separate cross toolchain is required? If not, I would leave this out and just set the flags for the normal toolchain. I also think it would be a nicer split to build the two vdso variants as vdso64/vdso32 rather than vdso/compat_vdso. That way, the build procedure can be kept as close as possible to the native 32-bit build. Arnd
Re: [PATCH V6 16/20] riscv: compat: vdso: Add rv32 VDSO base code implementation
Hi Arnd & Palmer, Here is the new modified compat_vdso/Makefile, please have a look, first. Then I would update it to v7: === # SPDX-License-Identifier: GPL-2.0-only # # Makefile for compat_vdso # # Symbols present in the compat_vdso compat_vdso-syms = rt_sigreturn compat_vdso-syms += getcpu compat_vdso-syms += flush_icache ifdef CROSS_COMPILE_COMPAT COMPAT_CC := $(CROSS_COMPILE_COMPAT)gcc COMPAT_LD := $(CROSS_COMPILE_COMPAT)ld else COMPAT_CC := $(CC) COMPAT_LD := $(LD) endif COMPAT_CC_FLAGS := -march=rv32g -mabi=ilp32 COMPAT_LD_FLAGS := -melf32lriscv # Files to link into the compat_vdso obj-compat_vdso = $(patsubst %, %.o, $(compat_vdso-syms)) note.o # Build rules targets := $(obj-compat_vdso) compat_vdso.so compat_vdso.so.dbg compat_vdso.lds obj-compat_vdso := $(addprefix $(obj)/, $(obj-compat_vdso)) obj-y += compat_vdso.o CPPFLAGS_compat_vdso.lds += -P -C -U$(ARCH) # Disable profiling and instrumentation for VDSO code GCOV_PROFILE := n KCOV_INSTRUMENT := n KASAN_SANITIZE := n UBSAN_SANITIZE := n # Force dependency $(obj)/compat_vdso.o: $(obj)/compat_vdso.so # link rule for the .so file, .lds has to be first $(obj)/compat_vdso.so.dbg: $(obj)/compat_vdso.lds $(obj-compat_vdso) FORCE $(call if_changed,compat_vdsold) LDFLAGS_compat_vdso.so.dbg = -shared -S -soname=linux-compat_vdso.so.1 \ --build-id=sha1 --hash-style=both --eh-frame-hdr $(obj-compat_vdso): %.o: %.S FORCE $(call if_changed_dep,compat_vdsoas) # strip rule for the .so file $(obj)/%.so: OBJCOPYFLAGS := -S $(obj)/%.so: $(obj)/%.so.dbg FORCE $(call if_changed,objcopy) # Generate VDSO offsets using helper script gen-compat_vdsosym := $(srctree)/$(src)/gen_compat_vdso_offsets.sh quiet_cmd_compat_vdsosym = VDSOSYM $@ cmd_compat_vdsosym = $(NM) $< | $(gen-compat_vdsosym) | LC_ALL=C sort > $@ include/generated/compat_vdso-offsets.h: $(obj)/compat_vdso.so.dbg FORCE $(call if_changed,compat_vdsosym) # actual build commands # The DSO images are built using a special linker script # Make sure only to export the intended __compat_vdso_xxx symbol offsets. quiet_cmd_compat_vdsold = VDSOLD $@ cmd_compat_vdsold = $(COMPAT_LD) $(ld_flags) $(COMPAT_LD_FLAGS) -T $(filter-out FORCE,$^) -o $@.tmp && \ $(OBJCOPY) $(patsubst %, -G __compat_vdso_%, $(compat_vdso-syms)) $@.tmp $@ && \ rm $@.tmp # actual build commands quiet_cmd_compat_vdsoas = VDSOAS $@ cmd_compat_vdsoas = $(COMPAT_CC) $(a_flags) $(COMPAT_CC_FLAGS) -c -o $@ $< # install commands for the unstripped file quiet_cmd_compat_vdso_install = INSTALL $@ cmd_compat_vdso_install = cp $(obj)/$@.dbg $(MODLIB)/compat_vdso/$@ compat_vdso.so: $(obj)/compat_vdso.so.dbg @mkdir -p $(MODLIB)/compat_vdso $(call cmd,compat_vdso_install) compat_vdso_install: compat_vdso.so === Here is the make V=1 output: make -f /home/guoren/source/kernel/riscv-linux/scripts/Makefile.build obj=arch/riscv/kernel/vdso include/generated/vdso-offsets.h make -f /home/guoren/source/kernel/riscv-linux/scripts/Makefile.build obj=arch/riscv/kernel/compat_vdso include/generated/compat_vdso-offsets.h riscv64-unknown-linux-gnu-gcc -E -Wp,-MMD,arch/riscv/kernel/compat_vdso/.compat_vdso.lds.d -nostdinc -I/home/guoren/source/kernel/riscv-linux/arch/riscv/include -I./arch/riscv/include/generated -I/home/guoren/source/kernel/riscv-linux/include -I./include -I/home/guoren/source/kernel/riscv-linux/arch/riscv/include/uapi -I./arch/riscv/include/generated/uapi -I/home/guoren/source/kernel/riscv-linux/include/uapi -I./includ e/generated/uapi -include /home/guoren/source/kernel/riscv-linux/include/linux/compiler-version.h -include /home/guoren/source/kernel/riscv-linux/include/linux/kconfig.h -D__KERNEL__ -fmacro-prefix-map=/home/guoren/source/kernel/riscv-linux/= -P -C -Uriscv -I /home/guoren/source/kernel/riscv-linux/arch/riscv/kernel/compat_vdso -I ./arch/riscv/kernel/compat_vdso -P -Uriscv -D__ASSEMBLY__ -DLINKER_SCRIPT -o arch/riscv/ke rnel/compat_vdso/compat_vdso.lds /home/guoren/source/kernel/riscv-linux/arch/riscv/kernel/compat_vdso/compat_vdso.lds.S riscv64-unknown-linux-gnu-gcc -Wp,-MMD,arch/riscv/kernel/compat_vdso/.rt_sigreturn.o.d -nostdinc -I/home/guoren/source/kernel/riscv-linux/arch/riscv/include -I./arch/riscv/include/generated -I/home/guoren/source/kernel/riscv-linux/include -I./include -I/home/guoren/source/kernel/riscv-linux/arch/riscv/include/uapi -I./arch/riscv/include/generated/uapi -I/home/guoren/source/kernel/riscv-linux/include/uapi -I./include/ge nerated/uapi -include /home/guoren/source/kernel/riscv-linux/include/linux/compiler-version.h -include /home/guoren/source/kernel/riscv-linux/include/linux/kconfig.h -D__KERNEL__ -fmacro-prefix-map=/home/guoren/source/kernel/riscv-linux/= -D__ASSEMBLY__ -fno-PIE -mabi=lp64 -march=rv64imafdc -I
Re: cleanup swiotlb initialization
On 2/25/22 3:47 AM, Christoph Hellwig wrote: On Thu, Feb 24, 2022 at 12:07:26PM -0500, Boris Ostrovsky wrote: Just to be clear: this crashes only as dom0. Boots fine as baremetal. Ah. I can gues what this might be. On Xen the hypervisor controls the IOMMU and we should never end up initializing it in Linux, right? Right, we shouldn't be in that code path. Can you try the patch below on top of the series? Yes, this makes dom0 boot fine. (It also addresses something I wanted to mention while looking at the patches, which was to remove Xen initialization code from pci_swiotlb_detect_4gb() since it had noting to do with what the routine's name implies.) -boris
[PATCH v7 4/4] docs: ABI: sysfs-bus-nvdimm: Document sysfs event format entries for nvdimm pmu
Details are added for the event, cpumask and format attributes in the ABI documentation. Acked-by: Peter Zijlstra (Intel) Tested-by: Nageswara R Sastry Signed-off-by: Kajol Jain --- Changelog: v6 -> v7 - Add Acked-by and Tested-by tag from Peter Zijlstra and Nageswara R Sastry. Documentation/ABI/testing/sysfs-bus-nvdimm | 35 ++ 1 file changed, 35 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-nvdimm b/Documentation/ABI/testing/sysfs-bus-nvdimm index bff84a16812a..1c1f5acbf53d 100644 --- a/Documentation/ABI/testing/sysfs-bus-nvdimm +++ b/Documentation/ABI/testing/sysfs-bus-nvdimm @@ -6,3 +6,38 @@ Description: The libnvdimm sub-system implements a common sysfs interface for platform nvdimm resources. See Documentation/driver-api/nvdimm/. + +What: /sys/bus/event_source/devices/nmemX/format +Date: February 2022 +KernelVersion: 5.18 +Contact:Kajol Jain +Description: (RO) Attribute group to describe the magic bits + that go into perf_event_attr.config for a particular pmu. + (See ABI/testing/sysfs-bus-event_source-devices-format). + + Each attribute under this group defines a bit range of the + perf_event_attr.config. Supported attribute is listed + below:: + event = "config:0-4" - event ID + + For example:: + ctl_res_cnt = "event=0x1" + +What: /sys/bus/event_source/devices/nmemX/events +Date: February 2022 +KernelVersion: 5.18 +Contact:Kajol Jain +Description: (RO) Attribute group to describe performance monitoring events +for the nvdimm memory device. Each attribute in this group +describes a single performance monitoring event supported by +this nvdimm pmu. The name of the file is the name of the event. +(See ABI/testing/sysfs-bus-event_source-devices-events). A +listing of the events supported by a given nvdimm provider type +can be found in Documentation/driver-api/nvdimm/$provider. + +What: /sys/bus/event_source/devices/nmemX/cpumask +Date: February 2022 +KernelVersion: 5.18 +Contact:Kajol Jain +Description: (RO) This sysfs file exposes the cpumask which is designated to + to retrieve nvdimm pmu event counter data. -- 2.31.1
[PATCH v7 3/4] powerpc/papr_scm: Add perf interface support
Performance monitoring support for papr-scm nvdimm devices via perf interface is added which includes addition of pmu functions like add/del/read/event_init for nvdimm_pmu struture. A new parameter 'priv' in added to the pdev_archdata structure to save nvdimm_pmu device pointer, to handle the unregistering of pmu device. papr_scm_pmu_register function populates the nvdimm_pmu structure with name, capabilities, cpumask along with event handling functions. Finally the populated nvdimm_pmu structure is passed to register the pmu device. Event handling functions internally uses hcall to get events and counter data. Result in power9 machine with 2 nvdimm device: Ex: List all event by perf list command:# perf list nmem nmem0/cache_rh_cnt/[Kernel PMU event] nmem0/cache_wh_cnt/[Kernel PMU event] nmem0/cri_res_util/[Kernel PMU event] nmem0/ctl_res_cnt/ [Kernel PMU event] nmem0/ctl_res_tm/ [Kernel PMU event] nmem0/fast_w_cnt/ [Kernel PMU event] nmem0/host_l_cnt/ [Kernel PMU event] nmem0/host_l_dur/ [Kernel PMU event] nmem0/host_s_cnt/ [Kernel PMU event] nmem0/host_s_dur/ [Kernel PMU event] nmem0/med_r_cnt/ [Kernel PMU event] nmem0/med_r_dur/ [Kernel PMU event] nmem0/med_w_cnt/ [Kernel PMU event] nmem0/med_w_dur/ [Kernel PMU event] nmem0/mem_life/[Kernel PMU event] nmem0/poweron_secs/[Kernel PMU event] ... nmem1/mem_life/[Kernel PMU event] nmem1/poweron_secs/[Kernel PMU event] Acked-by: Peter Zijlstra (Intel) Tested-by: Nageswara R Sastry Signed-off-by: Kajol Jain [Add numa_map_to_online_node function call to get online node id] Reported-by: Nageswara R Sastry --- Changelog: Resend v6 -> v7 - Add function call to numa_map_to_online_node function inorder to get online numa node. As the node id returned by function dev_to_node can be offline in some scenarios and create issue in hotplug code. - Add Acked-by, Tested-by and Reported-by tags from Peter Zijlstra and Nageswara R Sastry. arch/powerpc/include/asm/device.h | 5 + arch/powerpc/platforms/pseries/papr_scm.c | 225 ++ 2 files changed, 230 insertions(+) diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h index 219559d65864..47ed639f3b8f 100644 --- a/arch/powerpc/include/asm/device.h +++ b/arch/powerpc/include/asm/device.h @@ -48,6 +48,11 @@ struct dev_archdata { struct pdev_archdata { u64 dma_mask; + /* +* Pointer to nvdimm_pmu structure, to handle the unregistering +* of pmu device +*/ + void *priv; }; #endif /* _ASM_POWERPC_DEVICE_H */ diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index f48e87ac89c9..4dd513d7c029 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -19,6 +19,7 @@ #include #include #include +#include #define BIND_ANY_ADDR (~0ul) @@ -68,6 +69,8 @@ #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS) #define PAPR_SCM_PERF_STATS_VERSION 0x1 +#define to_nvdimm_pmu(_pmu)container_of(_pmu, struct nvdimm_pmu, pmu) + /* Struct holding a single performance metric */ struct papr_scm_perf_stat { u8 stat_id[8]; @@ -120,6 +123,9 @@ struct papr_scm_priv { /* length of the stat buffer as expected by phyp */ size_t stat_buffer_len; + +/* array to have event_code and stat_id mappings */ + char **nvdimm_events_map; }; static int papr_scm_pmem_flush(struct nd_region *nd_region, @@ -340,6 +346,218 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p, return 0; } +static int papr_scm_pmu_get_value(struct perf_event *event, struct device *dev, u64 *count) +{ + struct papr_scm_perf_stat *stat; + struct papr_scm_perf_stats *stats; + struct papr_scm_priv *p = (struct papr_scm_priv *)dev->driver_data; + int rc, size; + + /* Allocate request buffer enough to hold single performance stat */ + size = sizeof(struct papr_scm_perf_stats) + + sizeof(struct papr_scm_perf_stat); + + if (!p || !p->nvdimm_events_map) + return -EINVAL; + + stats = kzalloc(size, GFP_KERNEL); + if (!stats) + return -ENOMEM; + + stat = >scm_statistic[0]; + memcpy(>stat_id, + p->nvdimm_events_map[event->attr.config], +
[PATCH v7 0/4] Add perf interface to expose nvdimm
Patchset adds performance stats reporting support for nvdimm. Added interface includes support for pmu register/unregister functions. A structure is added called nvdimm_pmu to be used for adding arch/platform specific data such as cpumask, nvdimm device pointer and pmu event functions like event_init/add/read/del. User could use the standard perf tool to access perf events exposed via pmu. Interface also defines supported event list, config fields for the event attributes and their corresponding bit values which are exported via sysfs. Patch 3 exposes IBM pseries platform nmem* device performance stats using this interface. Result from power9 pseries lpar with 2 nvdimm device: Ex: List all event by perf list command:# perf list nmem nmem0/cache_rh_cnt/[Kernel PMU event] nmem0/cache_wh_cnt/[Kernel PMU event] nmem0/cri_res_util/[Kernel PMU event] nmem0/ctl_res_cnt/ [Kernel PMU event] nmem0/ctl_res_tm/ [Kernel PMU event] nmem0/fast_w_cnt/ [Kernel PMU event] nmem0/host_l_cnt/ [Kernel PMU event] nmem0/host_l_dur/ [Kernel PMU event] nmem0/host_s_cnt/ [Kernel PMU event] nmem0/host_s_dur/ [Kernel PMU event] nmem0/med_r_cnt/ [Kernel PMU event] nmem0/med_r_dur/ [Kernel PMU event] nmem0/med_w_cnt/ [Kernel PMU event] nmem0/med_w_dur/ [Kernel PMU event] nmem0/mem_life/[Kernel PMU event] nmem0/poweron_secs/[Kernel PMU event] ... nmem1/mem_life/[Kernel PMU event] nmem1/poweron_secs/[Kernel PMU event] Patch1: Introduces the nvdimm_pmu structure Patch2: Adds common interface to add arch/platform specific data includes nvdimm device pointer, pmu data along with pmu event functions. It also defines supported event list and adds attribute groups for format, events and cpumask. It also adds code for cpu hotplug support. Patch3: Add code in arch/powerpc/platform/pseries/papr_scm.c to expose nmem* pmu. It fills in the nvdimm_pmu structure with pmu name, capabilities, cpumask and event functions and then registers the pmu by adding callbacks to register_nvdimm_pmu. Patch4: Sysfs documentation patch Changelog --- v6 -> v7 - Add function call to numa_map_to_online_node function inorder to get online numa node. As the node id returned by function dev_to_node can be offline in some scenarios and can create issue in hotplug code as reported by Nageswara R Sastry. - Add function declaration of perf_pmu_register, perf_pmu_unregister and perf_pmu_migrate_context functions in nd.h file to resolve the implicit-function-declaration warning as reported by kernel test robot. Link: https://lore.kernel.org/all/202202241242.zqzgkguy-...@intel.com/ - Add Tested-by, Acked-by and Reported-by tags from Peter Zijlstra and Nageswara R Sastry. - Link to the patchset v6: https://lkml.org/lkml/2022/2/17/857 Resend v5 -> v6 - No logic change, just a rebase to latest upstream and tested the patchset. - Link to the patchset Resend v5: https://lkml.org/lkml/2021/11/15/3979 v5 -> Resend v5 - Resend the patchset - Link to the patchset v5: https://lkml.org/lkml/2021/9/28/643 v4 -> v5: - Remove multiple variables defined in nvdimm_pmu structure include name and pmu functions(event_int/add/del/read) as they are just used to copy them again in pmu variable. Now we are directly doing this step in arch specific code as suggested by Dan Williams. - Remove attribute group field from nvdimm pmu structure and defined these attribute groups in common interface which includes format, event list along with cpumask as suggested by Dan Williams. Since we added static defination for attrbute groups needed in common interface, removes corresponding code from papr. - Add nvdimm pmu event list with event codes in the common interface. - Remove Acked-by/Reviewed-by/Tested-by tags as code is refactored to handle review comments from Dan. - Make nvdimm_pmu_free_hotplug_memory function static as reported by kernel test robot, also add corresponding Reported-by tag. - Link to the patchset v4: https://lkml.org/lkml/2021/9/3/45 v3 -> v4 - Rebase code on top of current papr_scm code without any logical changes. - Added Acked-by tag from Peter Zijlstra and Reviewed by tag from Madhavan Srinivasan. - Link to the patchset v3: https://lkml.org/lkml/2021/6/17/605 v2 -> v3 - Added Tested-by tag. - Fix nvdimm
[PATCH v7 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats
A common interface is added to get performance stats reporting support for nvdimm devices. Added interface defines supported event list, config fields for the event attributes and their corresponding bit values which are exported via sysfs. Interface also added support for pmu register/unregister functions, cpu hotplug feature along with macros for handling events addition via sysfs. It adds attribute groups for format, cpumask and events to the pmu structure. User could use the standard perf tool to access perf events exposed via nvdimm pmu. Acked-by: Peter Zijlstra (Intel) Tested-by: Nageswara R Sastry Signed-off-by: Kajol Jain [Declare pmu functions in nd.h file to resolve implicit-function-declaration warning and make hotplug function static as reported by kernel test robot] Link: https://lore.kernel.org/all/202202241242.zqzgkguy-...@intel.com/ Reported-by: kernel test robot --- Changelog: v6 -> v7 - Add function declaration of perf_pmu_register, perf_pmu_unregister and perf_pmu_migrate_context functions in nd.h file to resolve the implicit-function-declaration warning as reported by kernel test robot. - Add Acked-by and Tested-by tag from Peter Zijlstra and Nageswara R Sastry. drivers/nvdimm/Makefile | 1 + drivers/nvdimm/nd_perf.c | 328 +++ include/linux/nd.h | 24 +++ 3 files changed, 353 insertions(+) create mode 100644 drivers/nvdimm/nd_perf.c diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 29203f3d3069..25dba6095612 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -18,6 +18,7 @@ nd_e820-y := e820.o libnvdimm-y := core.o libnvdimm-y += bus.o libnvdimm-y += dimm_devs.o +libnvdimm-y += nd_perf.o libnvdimm-y += dimm.o libnvdimm-y += region_devs.o libnvdimm-y += region.o diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c new file mode 100644 index ..314415894acf --- /dev/null +++ b/drivers/nvdimm/nd_perf.c @@ -0,0 +1,328 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * nd_perf.c: NVDIMM Device Performance Monitoring Unit support + * + * Perf interface to expose nvdimm performance stats. + * + * Copyright (C) 2021 IBM Corporation + */ + +#define pr_fmt(fmt) "nvdimm_pmu: " fmt + +#include + +#define EVENT(_name, _code) enum{_name = _code} + +/* + * NVDIMM Events codes. + */ + +/* Controller Reset Count */ +EVENT(CTL_RES_CNT, 0x1); +/* Controller Reset Elapsed Time */ +EVENT(CTL_RES_TM, 0x2); +/* Power-on Seconds */ +EVENT(POWERON_SECS,0x3); +/* Life Remaining */ +EVENT(MEM_LIFE,0x4); +/* Critical Resource Utilization */ +EVENT(CRI_RES_UTIL,0x5); +/* Host Load Count */ +EVENT(HOST_L_CNT, 0x6); +/* Host Store Count */ +EVENT(HOST_S_CNT, 0x7); +/* Host Store Duration */ +EVENT(HOST_S_DUR, 0x8); +/* Host Load Duration */ +EVENT(HOST_L_DUR, 0x9); +/* Media Read Count */ +EVENT(MED_R_CNT, 0xa); +/* Media Write Count */ +EVENT(MED_W_CNT, 0xb); +/* Media Read Duration */ +EVENT(MED_R_DUR, 0xc); +/* Media Write Duration */ +EVENT(MED_W_DUR, 0xd); +/* Cache Read Hit Count */ +EVENT(CACHE_RH_CNT,0xe); +/* Cache Write Hit Count */ +EVENT(CACHE_WH_CNT,0xf); +/* Fast Write Count */ +EVENT(FAST_W_CNT, 0x10); + +NVDIMM_EVENT_ATTR(ctl_res_cnt, CTL_RES_CNT); +NVDIMM_EVENT_ATTR(ctl_res_tm, CTL_RES_TM); +NVDIMM_EVENT_ATTR(poweron_secs,POWERON_SECS); +NVDIMM_EVENT_ATTR(mem_life,MEM_LIFE); +NVDIMM_EVENT_ATTR(cri_res_util,CRI_RES_UTIL); +NVDIMM_EVENT_ATTR(host_l_cnt, HOST_L_CNT); +NVDIMM_EVENT_ATTR(host_s_cnt, HOST_S_CNT); +NVDIMM_EVENT_ATTR(host_s_dur, HOST_S_DUR); +NVDIMM_EVENT_ATTR(host_l_dur, HOST_L_DUR); +NVDIMM_EVENT_ATTR(med_r_cnt, MED_R_CNT); +NVDIMM_EVENT_ATTR(med_w_cnt, MED_W_CNT); +NVDIMM_EVENT_ATTR(med_r_dur, MED_R_DUR); +NVDIMM_EVENT_ATTR(med_w_dur, MED_W_DUR); +NVDIMM_EVENT_ATTR(cache_rh_cnt,CACHE_RH_CNT); +NVDIMM_EVENT_ATTR(cache_wh_cnt,CACHE_WH_CNT); +NVDIMM_EVENT_ATTR(fast_w_cnt, FAST_W_CNT); + +static struct attribute *nvdimm_events_attr[] = { + NVDIMM_EVENT_PTR(CTL_RES_CNT), + NVDIMM_EVENT_PTR(CTL_RES_TM), + NVDIMM_EVENT_PTR(POWERON_SECS), + NVDIMM_EVENT_PTR(MEM_LIFE), + NVDIMM_EVENT_PTR(CRI_RES_UTIL), + NVDIMM_EVENT_PTR(HOST_L_CNT), + NVDIMM_EVENT_PTR(HOST_S_CNT), + NVDIMM_EVENT_PTR(HOST_S_DUR), + NVDIMM_EVENT_PTR(HOST_L_DUR), + NVDIMM_EVENT_PTR(MED_R_CNT), + NVDIMM_EVENT_PTR(MED_W_CNT), + NVDIMM_EVENT_PTR(MED_R_DUR), + NVDIMM_EVENT_PTR(MED_W_DUR), + NVDIMM_EVENT_PTR(CACHE_RH_CNT), + NVDIMM_EVENT_PTR(CACHE_WH_CNT), + NVDIMM_EVENT_PTR(FAST_W_CNT), + NULL +}; + +static struct
[PATCH v7 1/4] drivers/nvdimm: Add nvdimm pmu structure
A structure is added called nvdimm_pmu, for performance stats reporting support of nvdimm devices. It can be used to add device pmu data such as pmu data structure for performance stats, nvdimm device pointer along with cpumask attributes. Acked-by: Peter Zijlstra (Intel) Tested-by: Nageswara R Sastry Signed-off-by: Kajol Jain --- Changelog: v6 -> v7 - Add Acked-by and Tested-by tag from Peter Zijlstra and Nageswara R Sastry. include/linux/nd.h | 20 1 file changed, 20 insertions(+) diff --git a/include/linux/nd.h b/include/linux/nd.h index 8a8c63edb1b2..ad186e828263 100644 --- a/include/linux/nd.h +++ b/include/linux/nd.h @@ -8,6 +8,7 @@ #include #include #include +#include enum nvdimm_event { NVDIMM_REVALIDATE_POISON, @@ -23,6 +24,25 @@ enum nvdimm_claim_class { NVDIMM_CCLASS_UNKNOWN, }; +/** + * struct nvdimm_pmu - data structure for nvdimm perf driver + * @pmu: pmu data structure for nvdimm performance stats. + * @dev: nvdimm device pointer. + * @cpu: designated cpu for counter access. + * @node: node for cpu hotplug notifier link. + * @cpuhp_state: state for cpu hotplug notification. + * @arch_cpumask: cpumask to get designated cpu for counter access. + */ +struct nvdimm_pmu { + struct pmu pmu; + struct device *dev; + int cpu; + struct hlist_node node; + enum cpuhp_state cpuhp_state; + /* cpumask provided by arch/platform specific code */ + struct cpumask arch_cpumask; +}; + struct nd_device_driver { struct device_driver drv; unsigned long type; -- 2.31.1
Re: [PATCH] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.
Le 25/02/2022 à 11:24, Mahesh Salgaonkar a écrit : > nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel > crash when invoked during real mode interrupt handling (e.g. early HMI/MCE > interrupt handler) if percpu allocation comes from vmalloc area. > > Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI() > wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when > percpu allocation is from the embedded first chunk. However with > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu > allocation can come from the vmalloc area. > > With kernel command line "percpu_alloc=page" we can force percpu allocation > to come from vmalloc area and can see kernel crash in machine_check_early: > > [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110 > [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0 > [1.215719] --- interrupt: 200 > [1.215720] [c00fffd73180] [] 0x0 (unreliable) > [1.215722] [c00fffd731b0] [] 0x0 > [1.215724] [c00fffd73210] [c0008364] > machine_check_early_common+0x134/0x1f8 > > Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu > first chunk is not embedded. > > Signed-off-by: Mahesh Salgaonkar > --- > arch/powerpc/include/asm/interrupt.h | 15 +++ > arch/powerpc/include/asm/percpu.h|2 ++ > arch/powerpc/kernel/setup_64.c |3 +++ > 3 files changed, 20 insertions(+) > > diff --git a/arch/powerpc/include/asm/interrupt.h > b/arch/powerpc/include/asm/interrupt.h > index fc28f46d2f9dc..8c5b3a7ca2ab6 100644 > --- a/arch/powerpc/include/asm/interrupt.h > +++ b/arch/powerpc/include/asm/interrupt.h > @@ -327,6 +327,16 @@ static inline void interrupt_nmi_enter_prepare(struct > pt_regs *regs, struct inte > } > #endif > > + /* > + * Do not use nmi_enter() in real mode if percpu first chunk is > + * not embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled > + * there are chances where percpu allocation can come from vmalloc > + * area. > + */ > + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && > + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk) > + return; > + It would probably be better to refactor with the following test and avoid calling mfmsr() twice. > /* >* Do not use nmi_enter() for pseries hash guest taking a real-mode >* NMI because not everything it touches is within the RMA limit. > @@ -339,6 +349,10 @@ static inline void interrupt_nmi_enter_prepare(struct > pt_regs *regs, struct inte > > static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct > interrupt_nmi_state *state) > { > + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && > + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk) > + goto skip_nmi_exit; > + > if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) || > !firmware_has_feature(FW_FEATURE_LPAR) || > radix_enabled() || (mfmsr() & MSR_DR)) Same here, if you can avoid calling mfmsr() two times that would be better. > @@ -349,6 +363,7 @@ static inline void interrupt_nmi_exit_prepare(struct > pt_regs *regs, struct inter >* new work to do (must use irq_work for that). >*/ > > +skip_nmi_exit: > #ifdef CONFIG_PPC64 > #ifdef CONFIG_PPC_BOOK3S > if (arch_irq_disabled_regs(regs)) { > diff --git a/arch/powerpc/include/asm/percpu.h > b/arch/powerpc/include/asm/percpu.h > index 8e5b7d0b851c6..45c08f5b0b4e9 100644 > --- a/arch/powerpc/include/asm/percpu.h > +++ b/arch/powerpc/include/asm/percpu.h > @@ -12,6 +12,8 @@ > > #define __my_cpu_offset local_paca->data_offset > > +extern bool __percpu_embed_first_chunk; > + > #endif /* CONFIG_SMP */ > #endif /* __powerpc64__ */ > > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c > index be8577ac93971..39dbf2fb23d61 100644 > --- a/arch/powerpc/kernel/setup_64.c > +++ b/arch/powerpc/kernel/setup_64.c > @@ -786,6 +786,7 @@ static __init int pcpu_cpu_to_node(int cpu) > > unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; > EXPORT_SYMBOL(__per_cpu_offset); > +bool __percpu_embed_first_chunk; > > void __init setup_per_cpu_areas(void) > { > @@ -821,6 +822,8 @@ void __init setup_per_cpu_areas(void) > pr_warn("PERCPU: %s allocator failed (%d), " > "falling back to page size\n", > pcpu_fc_names[pcpu_chosen_fc], rc); > + else > + __percpu_embed_first_chunk = true; If I understand correctly that happens once at system startup ? Then could you use jump_labels in order to avoid redoing the test at each interrupt entry ? > } > > if (rc < 0) > >
Re: [PATCH v6 0/4] Add perf interface to expose nvdimm
On 2/25/22 16:41, Nageswara Sastry wrote: > > > On 25/02/22 12:08 pm, kajoljain wrote: >> >> >> On 2/25/22 11:25, Nageswara Sastry wrote: >>> >>> >>> On 17/02/22 10:03 pm, Kajol Jain wrote: Patchset adds performance stats reporting support for nvdimm. Added interface includes support for pmu register/unregister functions. A structure is added called nvdimm_pmu to be used for adding arch/platform specific data such as cpumask, nvdimm device pointer and pmu event functions like event_init/add/read/del. User could use the standard perf tool to access perf events exposed via pmu. Interface also defines supported event list, config fields for the event attributes and their corresponding bit values which are exported via sysfs. Patch 3 exposes IBM pseries platform nmem* device performance stats using this interface. Result from power9 pseries lpar with 2 nvdimm device: Ex: List all event by perf list command:# perf list nmem nmem0/cache_rh_cnt/ [Kernel PMU event] nmem0/cache_wh_cnt/ [Kernel PMU event] nmem0/cri_res_util/ [Kernel PMU event] nmem0/ctl_res_cnt/ [Kernel PMU event] nmem0/ctl_res_tm/ [Kernel PMU event] nmem0/fast_w_cnt/ [Kernel PMU event] nmem0/host_l_cnt/ [Kernel PMU event] nmem0/host_l_dur/ [Kernel PMU event] nmem0/host_s_cnt/ [Kernel PMU event] nmem0/host_s_dur/ [Kernel PMU event] nmem0/med_r_cnt/ [Kernel PMU event] nmem0/med_r_dur/ [Kernel PMU event] nmem0/med_w_cnt/ [Kernel PMU event] nmem0/med_w_dur/ [Kernel PMU event] nmem0/mem_life/ [Kernel PMU event] nmem0/poweron_secs/ [Kernel PMU event] ... nmem1/mem_life/ [Kernel PMU event] nmem1/poweron_secs/ [Kernel PMU event] Patch1: Introduces the nvdimm_pmu structure Patch2: Adds common interface to add arch/platform specific data includes nvdimm device pointer, pmu data along with pmu event functions. It also defines supported event list and adds attribute groups for format, events and cpumask. It also adds code for cpu hotplug support. Patch3: Add code in arch/powerpc/platform/pseries/papr_scm.c to expose nmem* pmu. It fills in the nvdimm_pmu structure with pmu name, capabilities, cpumask and event functions and then registers the pmu by adding callbacks to register_nvdimm_pmu. Patch4: Sysfs documentation patch Changelog >>> >>> Tested these patches with the automated tests at >>> avocado-misc-tests/perf/perf_nmem.py >>> URL: >>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py >>> >>> >>> >>> 1. On the system where target id and online id were different then not >>> seeing value in 'cpumask' and those tests failed. >>> >>> Example: >>> Log from dmesg >>> ... >>> papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region registered >>> with target node 1 and online node 0 >>> ... >> >> Hi Nageswara Sastry, >> Thanks for testing the patch set. Yes you right, incase target >> node id and online node id is different, it can happen when target >> node is not online and hence can cause this issue, thanks for pointing >> it. >> >> Function dev_to_node will return node id for a given nvdimm device which >> can be offline in some scenarios. We should use numa node id return by >> numa_map_to_online_node function in that scenario. This function incase >> given node is offline, it will lookup for next closest online node and >> return that nodeid. >> >> Can you try with below change and see, if you are still getting this >> issue. Please let me know. >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index bdf2620db461..4dd513d7c029 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct >> papr_scm_priv *p) >> PERF_PMU_CAP_NO_EXCLUDE; >> >> /*updating
Re: [PATCH v6 0/4] Add perf interface to expose nvdimm
On 25/02/22 12:08 pm, kajoljain wrote: On 2/25/22 11:25, Nageswara Sastry wrote: On 17/02/22 10:03 pm, Kajol Jain wrote: Patchset adds performance stats reporting support for nvdimm. Added interface includes support for pmu register/unregister functions. A structure is added called nvdimm_pmu to be used for adding arch/platform specific data such as cpumask, nvdimm device pointer and pmu event functions like event_init/add/read/del. User could use the standard perf tool to access perf events exposed via pmu. Interface also defines supported event list, config fields for the event attributes and their corresponding bit values which are exported via sysfs. Patch 3 exposes IBM pseries platform nmem* device performance stats using this interface. Result from power9 pseries lpar with 2 nvdimm device: Ex: List all event by perf list command:# perf list nmem nmem0/cache_rh_cnt/ [Kernel PMU event] nmem0/cache_wh_cnt/ [Kernel PMU event] nmem0/cri_res_util/ [Kernel PMU event] nmem0/ctl_res_cnt/ [Kernel PMU event] nmem0/ctl_res_tm/ [Kernel PMU event] nmem0/fast_w_cnt/ [Kernel PMU event] nmem0/host_l_cnt/ [Kernel PMU event] nmem0/host_l_dur/ [Kernel PMU event] nmem0/host_s_cnt/ [Kernel PMU event] nmem0/host_s_dur/ [Kernel PMU event] nmem0/med_r_cnt/ [Kernel PMU event] nmem0/med_r_dur/ [Kernel PMU event] nmem0/med_w_cnt/ [Kernel PMU event] nmem0/med_w_dur/ [Kernel PMU event] nmem0/mem_life/ [Kernel PMU event] nmem0/poweron_secs/ [Kernel PMU event] ... nmem1/mem_life/ [Kernel PMU event] nmem1/poweron_secs/ [Kernel PMU event] Patch1: Introduces the nvdimm_pmu structure Patch2: Adds common interface to add arch/platform specific data includes nvdimm device pointer, pmu data along with pmu event functions. It also defines supported event list and adds attribute groups for format, events and cpumask. It also adds code for cpu hotplug support. Patch3: Add code in arch/powerpc/platform/pseries/papr_scm.c to expose nmem* pmu. It fills in the nvdimm_pmu structure with pmu name, capabilities, cpumask and event functions and then registers the pmu by adding callbacks to register_nvdimm_pmu. Patch4: Sysfs documentation patch Changelog Tested these patches with the automated tests at avocado-misc-tests/perf/perf_nmem.py URL: https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py 1. On the system where target id and online id were different then not seeing value in 'cpumask' and those tests failed. Example: Log from dmesg ... papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region registered with target node 1 and online node 0 ... Hi Nageswara Sastry, Thanks for testing the patch set. Yes you right, incase target node id and online node id is different, it can happen when target node is not online and hence can cause this issue, thanks for pointing it. Function dev_to_node will return node id for a given nvdimm device which can be offline in some scenarios. We should use numa node id return by numa_map_to_online_node function in that scenario. This function incase given node is offline, it will lookup for next closest online node and return that nodeid. Can you try with below change and see, if you are still getting this issue. Please let me know. diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index bdf2620db461..4dd513d7c029 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct papr_scm_priv *p) PERF_PMU_CAP_NO_EXCLUDE; /*updating the cpumask variable */ - nodeid = dev_to_node(>pdev->dev); + nodeid = numa_map_to_online_node(dev_to_node(>pdev->dev)); nd_pmu->arch_cpumask = *cpumask_of_node(nodeid); Thanks, Kajol Jain With the above patch all the tests are passing on the system where target id and online id were different. Here is the the result: (1/9) perf_nmem.py:perfNMEM.test_pmu_register_dmesg: PASS (3.47 s) (2/9) perf_nmem.py:perfNMEM.test_sysfs: PASS (1.15 s) (3/9) perf_nmem.py:perfNMEM.test_pmu_count: PASS (1.08 s) (4/9) perf_nmem.py:perfNMEM.test_all_events:
Re: [PATCH 2/3] powerpc: fix build errors
Excerpts from Arnd Bergmann's message of February 25, 2022 6:33 pm: > On Fri, Feb 25, 2022 at 1:32 AM Nicholas Piggin wrote: >> Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am: >> >> +#ifdef CONFIG_CC_IS_GCC >> >> +#if (GCC_VERSION >= 10) >> >> +#if (CONFIG_AS_VERSION == 23800) >> >> +asm(".machine any"); >> >> +#endif >> >> +#endif >> >> +#endif >> >> +#endif /* __ASSEMBLY__ */ >> > >> > Abusing toplevel asm like this is broken and you *will* end up with >> > unhappiness all around. >> >> It actually unbreaks things and reduces my unhappiness. It's only done >> for broken compiler versions and only where as does not have the >> workaround for the breakage. > > It doesn't work with clang, which always passes explicit .machine > statements around each inline asm, and it's also fundamentally > incompatible with LTO builds. Generally speaking, you can't expect > a top-level asm statement to have any effect inside of another > function. You have misunderstood my patch. It is not supposed to "work" with clang and it explicitly is complied out of clang. It's not intended to have any implementation independent meaning. It's working around a very specific issue with specific versions of gcc, and that's what it does. It's also not intended to be the final solution, it's a workaround hack. We will move away from -many of course. I will post it as a series since which hopefully will make it less confusing to people. Thanks, Nick
[PATCH] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel crash when invoked during real mode interrupt handling (e.g. early HMI/MCE interrupt handler) if percpu allocation comes from vmalloc area. Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI() wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when percpu allocation is from the embedded first chunk. However with CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu allocation can come from the vmalloc area. With kernel command line "percpu_alloc=page" we can force percpu allocation to come from vmalloc area and can see kernel crash in machine_check_early: [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110 [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0 [1.215719] --- interrupt: 200 [1.215720] [c00fffd73180] [] 0x0 (unreliable) [1.215722] [c00fffd731b0] [] 0x0 [1.215724] [c00fffd73210] [c0008364] machine_check_early_common+0x134/0x1f8 Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu first chunk is not embedded. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/interrupt.h | 15 +++ arch/powerpc/include/asm/percpu.h|2 ++ arch/powerpc/kernel/setup_64.c |3 +++ 3 files changed, 20 insertions(+) diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h index fc28f46d2f9dc..8c5b3a7ca2ab6 100644 --- a/arch/powerpc/include/asm/interrupt.h +++ b/arch/powerpc/include/asm/interrupt.h @@ -327,6 +327,16 @@ static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct inte } #endif + /* +* Do not use nmi_enter() in real mode if percpu first chunk is +* not embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled +* there are chances where percpu allocation can come from vmalloc +* area. +*/ + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk) + return; + /* * Do not use nmi_enter() for pseries hash guest taking a real-mode * NMI because not everything it touches is within the RMA limit. @@ -339,6 +349,10 @@ static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct inte static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct interrupt_nmi_state *state) { + if (IS_ENABLED(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && + !(mfmsr() & MSR_DR) && !__percpu_embed_first_chunk) + goto skip_nmi_exit; + if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) || !firmware_has_feature(FW_FEATURE_LPAR) || radix_enabled() || (mfmsr() & MSR_DR)) @@ -349,6 +363,7 @@ static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct inter * new work to do (must use irq_work for that). */ +skip_nmi_exit: #ifdef CONFIG_PPC64 #ifdef CONFIG_PPC_BOOK3S if (arch_irq_disabled_regs(regs)) { diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h index 8e5b7d0b851c6..45c08f5b0b4e9 100644 --- a/arch/powerpc/include/asm/percpu.h +++ b/arch/powerpc/include/asm/percpu.h @@ -12,6 +12,8 @@ #define __my_cpu_offset local_paca->data_offset +extern bool __percpu_embed_first_chunk; + #endif /* CONFIG_SMP */ #endif /* __powerpc64__ */ diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index be8577ac93971..39dbf2fb23d61 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -786,6 +786,7 @@ static __init int pcpu_cpu_to_node(int cpu) unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; EXPORT_SYMBOL(__per_cpu_offset); +bool __percpu_embed_first_chunk; void __init setup_per_cpu_areas(void) { @@ -821,6 +822,8 @@ void __init setup_per_cpu_areas(void) pr_warn("PERCPU: %s allocator failed (%d), " "falling back to page size\n", pcpu_fc_names[pcpu_chosen_fc], rc); + else + __percpu_embed_first_chunk = true; } if (rc < 0)
Re: cleanup swiotlb initialization
On Thu, Feb 24, 2022 at 12:07:26PM -0500, Boris Ostrovsky wrote: >>> Just to be clear: this crashes only as dom0. Boots fine as baremetal. >> Ah. I can gues what this might be. On Xen the hypervisor controls the >> IOMMU and we should never end up initializing it in Linux, right? > > > Right, we shouldn't be in that code path. Can you try the patch below on top of the series? diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index b849f11a756d0..f88c9a4a5ed12 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -47,22 +47,6 @@ static unsigned int x86_swiotlb_flags; */ static void __init pci_swiotlb_detect_4gb(void) { -#ifdef CONFIG_SWIOTLB_XEN - if (xen_pv_domain()) { - if (xen_initial_domain()) - x86_swiotlb_enable = true; - - if (x86_swiotlb_enable) { - dma_ops = _swiotlb_dma_ops; -#ifdef CONFIG_PCI - /* Make sure ACS will be enabled */ - pci_request_acs(); -#endif - } - return; - } -#endif /* CONFIG_SWIOTLB_XEN */ - /* don't initialize swiotlb if iommu=off (no_iommu=1) */ if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN) x86_swiotlb_enable = true; @@ -82,16 +66,36 @@ static inline void __init pci_swiotlb_detect_4gb(void) } #endif /* CONFIG_SWIOTLB */ +#ifdef CONFIG_SWIOTLB_XEN +static void __init xen_swiotlb_init(void) +{ + if (!xen_initial_domain() && !x86_swiotlb_enable) + return; + x86_swiotlb_enable = true; + dma_ops = _swiotlb_dma_ops; +#ifdef CONFIG_PCI + /* Make sure ACS will be enabled */ + pci_request_acs(); +#endif + swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup); +} +#else +static inline void __init xen_swiotlb_init(void) +{ +} +#endif /* CONFIG_SWIOTLB_XEN */ + void __init pci_iommu_alloc(void) { + if (xen_pv_domain()) { + xen_swiotlb_init(); + return; + } pci_swiotlb_detect_4gb(); gart_iommu_hole_init(); amd_iommu_detect(); detect_intel_iommu(); -#ifdef CONFIG_SWIOTLB - swiotlb_init_remap(x86_swiotlb_enable, x86_swiotlb_flags, - xen_pv_domain() ? xen_swiotlb_fixup : NULL); -#endif + swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags); } /*
Re: [PATCH v6 0/4] Add perf interface to expose nvdimm
On 2/25/22 13:17, Aneesh Kumar K V wrote: > On Fri, 2022-02-25 at 12:08 +0530, kajoljain wrote: >> >> >> On 2/25/22 11:25, Nageswara Sastry wrote: >>> >>> >>> On 17/02/22 10:03 pm, Kajol Jain wrote: > Changelog >>> >>> Tested these patches with the automated tests at >>> avocado-misc-tests/perf/perf_nmem.py >>> URL: >>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/perf/perf_nmem.py >>> >>> >>> 1. On the system where target id and online id were different then >>> not >>> seeing value in 'cpumask' and those tests failed. >>> >>> Example: >>> Log from dmesg >>> ... >>> papr_scm ibm,persistent-memory:ibm,pmemory@4413: Region >>> registered >>> with target node 1 and online node 0 >>> ... >> >> Hi Nageswara Sastry, >> Thanks for testing the patch set. Yes you right, incase target >> node id and online node id is different, it can happen when target >> node is not online and hence can cause this issue, thanks for >> pointing >> it. >> >> Function dev_to_node will return node id for a given nvdimm device >> which >> can be offline in some scenarios. We should use numa node id return >> by >> numa_map_to_online_node function in that scenario. This function >> incase >> given node is offline, it will lookup for next closest online node >> and >> return that nodeid. >> >> Can you try with below change and see, if you are still getting this >> issue. Please let me know. >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c >> b/arch/powerpc/platforms/pseries/papr_scm.c >> index bdf2620db461..4dd513d7c029 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -536,7 +536,7 @@ static void papr_scm_pmu_register(struct >> papr_scm_priv *p) >> PERF_PMU_CAP_NO_EXCLUDE; >> >> /*updating the cpumask variable */ >> - nodeid = dev_to_node(>pdev->dev); >> + nodeid = numa_map_to_online_node(dev_to_node(>pdev->dev)); >> nd_pmu->arch_cpumask = *cpumask_of_node(nodeid); >> >>> > > Can you use p->region->numa_node? Hi Aneesh, Thanks for reviewing the changes. Actually we can't use numa_node field of region structure directly inside papr_scm.c as we will get dereferencing issue Result of build with this change: arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_pmu_register': arch/powerpc/platforms/pseries/papr_scm.c:539:21: error: dereferencing pointer to incomplete type 'struct nd_region' nodeid = >region->numa_node; ^~ make[3]: *** [scripts/Makefile.build:288: arch/powerpc/platforms/pseries/papr_scm.o] Error 1 make[2]: *** [scripts/Makefile.build:550: arch/powerpc/platforms/pseries] Erro This is because, this structure is defined inside drivers/nvdimm/nd.h code which is not included in this file. So, thats why I am using `numa_map_to_online_node(dev_to_node(>pdev->dev));` directly. Let me know if you are ok with initial change. Thanks, Kajol Jain > > -aneesh > >
Re: [PATCH 2/3] powerpc: fix build errors
On Fri, Feb 25, 2022 at 1:32 AM Nicholas Piggin wrote: > Excerpts from Segher Boessenkool's message of February 25, 2022 3:12 am: > >> +#ifdef CONFIG_CC_IS_GCC > >> +#if (GCC_VERSION >= 10) > >> +#if (CONFIG_AS_VERSION == 23800) > >> +asm(".machine any"); > >> +#endif > >> +#endif > >> +#endif > >> +#endif /* __ASSEMBLY__ */ > > > > Abusing toplevel asm like this is broken and you *will* end up with > > unhappiness all around. > > It actually unbreaks things and reduces my unhappiness. It's only done > for broken compiler versions and only where as does not have the > workaround for the breakage. It doesn't work with clang, which always passes explicit .machine statements around each inline asm, and it's also fundamentally incompatible with LTO builds. Generally speaking, you can't expect a top-level asm statement to have any effect inside of another function. Arnd