Re: [PATCH RT 0/1] Linux v4.19.306-rt132-rc1
On 2024-02-02 18:04:56 [+0100], Daniel Wagner wrote: > Dear RT Folks, Hi, > This is the RT stable review cycle of patch 4.19.306-rt132-rc1. … > Please scream at me if I messed something up. Please test the patches > too. Good. > Enjoy! > Daniel Sebastian
Re: [PATCH RFC 0/4] Introduce uts_release
On 02/02/2024 15:01, Masahiro Yamada wrote: -- 2.35.3 As you see, several drivers store UTS_RELEASE in their driver data, and even print it in debug print. I do not see why it is useful. I would tend to agree, and mentioned that earlier. As you discussed in 3/4, if UTS_RELEASE is unneeded, it is better to get rid of it. Jakub replied about this. If such version information is useful for drivers, the intention is whether the version of the module, or the version of vmlinux. That is a question. They differ when CONFIG_MODVERSION. I think often this information in UTS_RELEASE is shared as informative only, so the user can conveniently know the specific kernel git version. When module developers intend to printk the git version from which the module was compiled from, presumably they want to use UTS_RELEASE, which was expanded at the compile time of the module. If you replace it with uts_release, it is the git version of vmlinux. Of course, the replacement is safe for always-builtin code. Lastly, we can avoid using UTS_RELEASE without relying on your patch. For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1 replaced UTS_RELEASE with init_uts_ns.name.release So, is your uts_release a shorthand of init_uts_ns.name.release? Yes - well that both are strings containing UTS_RELEASE. Using a struct sub-member is bit ungainly, but I suppose that we should not be making life easy for people using this. However we already have init_utsname in: static inline struct new_utsname *init_utsname(void) { return &init_uts_ns.name; } So could use init_utsname()->release, which is a bit nicer. I think what you can contribute are: - Explore the UTS_RELEASE users, and check if you can get rid of it. Unfortunately I expect resistance for this. I also expect places like FW loader it is necessary. And when this is used in sysfs, people will say that it is part of the ABI now. How about I send the patch to update to use init_uts_ns and mention also that it would be better to not use at all, if possible? I can cc you. - Where UTS_RELEASE is useful, consider if it is possible to replace it with init_uts_ns.name.release ok, but, as above, could use init_utsname()->release also Thanks, John
Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On 2/2/24 20:53, Mathieu Poirier wrote: > On Thu, Feb 01, 2024 at 07:33:35PM +0100, Arnaud POULIQUEN wrote: >> >> >> On 2/1/24 17:02, Mathieu Poirier wrote: >>> On Thu, Feb 01, 2024 at 04:06:37PM +0100, Arnaud POULIQUEN wrote: hello Mathieu, On 1/31/24 19:52, Mathieu Poirier wrote: > On Tue, Jan 30, 2024 at 10:13:48AM +0100, Arnaud POULIQUEN wrote: >> >> >> On 1/26/24 18:11, Mathieu Poirier wrote: >>> On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote: The new TEE remoteproc device is used to manage remote firmware in a secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is introduced to delegate the loading of the firmware to the trusted execution context. In such cases, the firmware should be signed and adhere to the image format defined by the TEE. Signed-off-by: Arnaud Pouliquen --- V1 to V2 update: - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by the kernel test robot: WARNING: unmet direct dependencies detected for TEE_REMOTEPROC Depends on [n]: REMOTEPROC [=y] && OPTEE [=n] Selected by [y]: - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && REMOTEPROC [=y] - Fix initialized trproc variable in stm32_rproc_probe --- drivers/remoteproc/stm32_rproc.c | 149 +-- 1 file changed, 144 insertions(+), 5 deletions(-) diff --git a/drivers/remoteproc/stm32_rproc.c b/drivers/remoteproc/stm32_rproc.c index fcc0001e2657..cf6a21bac945 100644 --- a/drivers/remoteproc/stm32_rproc.c +++ b/drivers/remoteproc/stm32_rproc.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include "remoteproc_internal.h" @@ -49,6 +50,9 @@ #define M4_STATE_STANDBY 4 #define M4_STATE_CRASH5 +/* Remote processor unique identifier aligned with the Trusted Execution Environment definitions */ +#define STM32_MP1_M4_PROC_ID0 + struct stm32_syscon { struct regmap *map; u32 reg; @@ -90,6 +94,8 @@ struct stm32_rproc { struct stm32_mbox mb[MBOX_NB_MBX]; struct workqueue_struct *workqueue; bool hold_boot_smc; + bool fw_loaded; + struct tee_rproc *trproc; void __iomem *rsc_va; }; @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc) return err; } +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc, + const struct firmware *fw) +{ + struct stm32_rproc *ddata = rproc->priv; + unsigned int ret = 0; + + if (rproc->state == RPROC_DETACHED) + return 0; + + ret = tee_rproc_load_fw(ddata->trproc, fw); + if (!ret) + ddata->fw_loaded = true; + + return ret; +} + +static int stm32_rproc_tee_elf_load(struct rproc *rproc, + const struct firmware *fw) +{ + struct stm32_rproc *ddata = rproc->priv; + unsigned int ret; + + /* + * This function can be called by remote proc for recovery + * without the sanity check. In this case we need to load the firmware + * else nothing done here as the firmware has been preloaded for the + * sanity check to be able to parse it for the resource table. + */ >>> >>> This comment is very confusing - please consider refactoring. >>> + if (ddata->fw_loaded) + return 0; + >>> >>> I'm not sure about keeping a flag to indicate the status of the loaded >>> firmware. >>> It is not done for the non-secure method, I don't see why it would be >>> needed for >>> the secure one. >>> >> >> The difference is on the sanity check. >> - in rproc_elf_sanity_check we parse the elf file to verify that it is >> valid. >> - in stm32_rproc_tee_elf_sanity_check we have to do the same, that means >> to >> authenticate it. the authentication is done during the load. >> >> So this flag is used to avoid to reload it twice time. >> refactoring the comment should help to understand this flag >> >> >> An alternative would be to bypass the sanity check. Bu
Re: [PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions
On Sat, Feb 03, 2024 at 07:33:51PM -0500, Steven Rostedt wrote: > On Mon, 29 Jan 2024 14:27:58 + > Vincent Donnefort wrote: > > > --- /dev/null > > +++ b/include/uapi/linux/trace_mmap.h > > @@ -0,0 +1,43 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _TRACE_MMAP_H_ > > +#define _TRACE_MMAP_H_ > > + > > +#include > > + > > +/** > > + * struct trace_buffer_meta - Ring-buffer Meta-page description > > + * @meta_page_size:Size of this meta-page. > > + * @meta_struct_len: Size of this structure. > > + * @subbuf_size: Size of each subbuf, including the header. > > That is "the header"? I mean the header of the "bpage". But that is confusing. I'll either remove that mention or just say it is aligned with the order of a system page size. > > > + * @nr_subbufs:Number of subbfs in the ring-buffer. > > + * @reader.lost_events:Number of events lost at the time of the reader > > swap. > > + * @reader.id: subbuf ID of the current reader. From 0 to > > @nr_subbufs - 1 > > + * @reader.read: Number of bytes read on the reader subbuf. > > Note, flags needs a comment. > > > + * @entries: Number of entries in the ring-buffer. > > + * @overrun: Number of entries lost in the ring-buffer. > > + * @read: Number of entries that have been read. > > So does the two Reserved words, otherwise I'm going to start getting > error reports about sparse warnings that check kerneldoc comments > against their content. Every field needs to be represented in the > kerneldoc comment. Ack, wasn't sure it was needed. > > -- Steve > > > > + */ > > +struct trace_buffer_meta { > > + __u32 meta_page_size; > > + __u32 meta_struct_len; > > + > > + __u32 subbuf_size; > > + __u32 nr_subbufs; > > + > > + struct { > > + __u64 lost_events; > > + __u32 id; > > + __u32 read; > > + } reader; > > + > > + __u64 flags; > > + > > + __u64 entries; > > + __u64 overrun; > > + __u64 read; > > + > > + __u64 Reserved1; > > + __u64 Reserved2; > > +}; > > + > > +#endif /* _TRACE_MMAP_H_ */
[PATCH 0/3] Add PPG support for PMI632 LPG dtsi
Hook up the PBS & SDAM to the PMI632 LPG so that we can use the hw_pattern for the LEDs. Signed-off-by: Luca Weiss --- Luca Weiss (3): dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types arm64: dts: qcom: pmi632: Add PBS client and use in LPG node arm64: defconfig: Enable QCOM PBS Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 arch/arm64/boot/dts/qcom/pmi632.dtsi | 9 + arch/arm64/configs/defconfig | 1 + 3 files changed, 14 insertions(+) --- base-commit: 1f790ac9c84028d89ef4dbb28ecc5771fc352e25 change-id: 20240117-pmi632-ppg-f1efb4318722 Best regards, -- Luca Weiss
[PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types
Add the PBS (Programmable Boot Sequencer) to the list of devices. Signed-off-by: Luca Weiss --- Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml b/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml index 8103fb61a16c..b7f01cbb8fff 100644 --- a/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml +++ b/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml @@ -160,6 +160,10 @@ patternProperties: type: object $ref: /schemas/nvmem/qcom,spmi-sdam.yaml# + "^pbs@[0-9a-f]+$": +type: object +$ref: /schemas/soc/qcom/qcom,pbs.yaml# + "phy@[0-9a-f]+$": type: object $ref: /schemas/phy/qcom,snps-eusb2-repeater.yaml# -- 2.43.0
[PATCH 2/3] arm64: dts: qcom: pmi632: Add PBS client and use in LPG node
With SDAM + PBS the LPG driver can configure the LED pattern in hardware. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/pmi632.dtsi | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/pmi632.dtsi b/arch/arm64/boot/dts/qcom/pmi632.dtsi index 4eb79e0ce40a..d2bb49a619d7 100644 --- a/arch/arm64/boot/dts/qcom/pmi632.dtsi +++ b/arch/arm64/boot/dts/qcom/pmi632.dtsi @@ -127,6 +127,11 @@ pmi632_adc_tm: adc-tm@3500 { status = "disabled"; }; + pmi632_pbs_client3: pbs@7400 { + compatible = "qcom,pmi632-pbs", "qcom,pbs"; + reg = <0x7400>; + }; + pmi632_sdam_7: nvram@b600 { compatible = "qcom,spmi-sdam"; reg = <0xb600>; @@ -155,6 +160,10 @@ pmic@3 { pmi632_lpg: pwm { compatible = "qcom,pmi632-lpg"; + nvmem = <&pmi632_sdam_7>; + nvmem-names = "lpg_chan_sdam"; + qcom,pbs = <&pmi632_pbs_client3>; + #address-cells = <1>; #size-cells = <0>; #pwm-cells = <2>; -- 2.43.0
[PATCH 3/3] arm64: defconfig: Enable QCOM PBS
Enable the PBS driver used on e.g. PMI632. Signed-off-by: Luca Weiss --- arch/arm64/configs/defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index cfa3e00def09..e92a5fd9f660 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -1375,6 +1375,7 @@ CONFIG_QCOM_STATS=m CONFIG_QCOM_WCNSS_CTRL=m CONFIG_QCOM_APR=m CONFIG_QCOM_ICC_BWMON=m +CONFIG_QCOM_PBS=m CONFIG_ARCH_R8A77995=y CONFIG_ARCH_R8A77990=y CONFIG_ARCH_R8A77951=y -- 2.43.0
Question about the ipi_raise filter usage and output
Hi guys, With the ipi_raise event enabled and filtered with: echo 'reason == "Function call interrupts"' > filter, then the 'cat trace' output below messages: ... insmod-3355[010] 1.. 24479.230381: ipi_raise: target_mask=,0bff (Function call interrupts) ... The above output is triggered by my kernel module where it will smp cross call a remote function from cpu#10 to cpu#11, for the 'target_mask' value, what does the ',0bff' mean? ~~ Another question is for the filter, I'd like to catch the IPI only happening on cpu#11 *AND* a remote function call, so how to write the 'target_cpus' in the filter expression? I try to write below: echo 'target_cpus == 11 && reason == "Function call interrupts"' > events/ipi/ipi_raise/filter But the 'cat trace' doesn't show anything about cpu#11 IPI info, although both the /proc/interrupts and the smp_processor_id() in the remote function shows there's IPI sent to the cpu#11. Any suggestions? Thank you!
Re: [PATCH v13 3/6] tracing: Add snapshot refcount
On Tue, 30 Jan 2024 10:32:45 + Vincent Donnefort wrote: > > All errors (new ones prefixed by >>): > > > >kernel/trace/trace.c: In function 'tracing_set_tracer': > >kernel/trace/trace.c:6644:17: error: implicit declaration of function > > 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? > > [-Werror=implicit-function-declaration] > > 6644 | tracing_disarm_snapshot_locked(tr); > > | ^~ > > | tracing_disarm_snapshot > > >> kernel/trace/trace.c:6648:23: error: implicit declaration of function > > >> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? > > >> [-Werror=implicit-function-declaration] > > 6648 | ret = tracing_arm_snapshot_locked(tr); > > | ^~~ > > | tracing_arm_snapshot > >cc1: some warnings being treated as errors > > Right, two tracers (hwlat and osnoise) select _only_ MAX_TRACE and not > TRACER_SNAPSHOT. > > However, AFAICT, they will not call any of the swapping functions (they don't > set use_max_tr). So I suppose arm/disarm can be ommited in that case. Yeah, if you can test with the various configs enabled and disabled to make sure that it still builds properly, then that should be good. I should make sure that my own ktest config that I use to run tests checks these variations too. -- Steve
Re: Question about the ipi_raise filter usage and output
On Mon, Feb 05, 2024 at 05:57:29PM +0800, richard clark wrote: > Hi guys, > > With the ipi_raise event enabled and filtered with: > echo 'reason == "Function call interrupts"' > filter, then the 'cat > trace' output below messages: > ... > insmod-3355[010] 1.. 24479.230381: ipi_raise: > target_mask=,0bff (Function call interrupts) > ... > The above output is triggered by my kernel module where it will smp > cross call a remote function from cpu#10 to cpu#11, for the > 'target_mask' value, what does the ',0bff' mean? That's a cpumask bitmap: 0xbff is 0b1011__, which is: ,- CPU 10 | 1011__ | '__' | | | `- CPUs 9 to 0 | `- CPU 11 Note that bitmap has CPUs 0-9 and CPU 11 set, but CPU 10 is not set. I suspect your kernel module has generated the bitmap incorrectly; it looks like you have a mask for CPUs 0-11 minus a mask for CPU 10? For CPUs 10 and 11, that should be 0xc00 / 0b1100__. > ~~ > > Another question is for the filter, I'd like to catch the IPI only > happening on cpu#11 *AND* a remote function call, so how to write the > 'target_cpus' in the filter expression? > > I try to write below: > echo 'target_cpus == 11 && reason == "Function call interrupts"' > > events/ipi/ipi_raise/filter The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the cpumask contains other CPUs, the filter will skip the call. I believe you can use '&' to check whether a cpumask contains a CPU, e.g. 'target_cpus & 11' Thanks, Mark.
Re: Question about the ipi_raise filter usage and output
On Mon, 5 Feb 2024 17:57:29 +0800 richard clark wrote: > Hi guys, > > With the ipi_raise event enabled and filtered with: > echo 'reason == "Function call interrupts"' > filter, then the 'cat > trace' output below messages: > ... > insmod-3355[010] 1.. 24479.230381: ipi_raise: > target_mask=,0bff (Function call interrupts) > ... > The above output is triggered by my kernel module where it will smp > cross call a remote function from cpu#10 to cpu#11, for the > 'target_mask' value, what does the ',0bff' mean? > ~~ It's the CPU mask. bff is bits 1011 or CPUs = 0-9,11. > > Another question is for the filter, I'd like to catch the IPI only > happening on cpu#11 *AND* a remote function call, so how to write the > 'target_cpus' in the filter expression? > > I try to write below: > echo 'target_cpus == 11 && reason == "Function call interrupts"' > > events/ipi/ipi_raise/filter You mean when it is sent only to CPU 11? Not when the event is happening on CPU 11. Like the above example, the event was triggered on CPU 10, but the mask was for all the other CPUs. If you are looking for just CPU 11, you can do: echo 'target_cpus == 0x800 && reason == "Function call interrupts"' > > But the 'cat trace' doesn't show anything about cpu#11 IPI info, > although both the /proc/interrupts and the smp_processor_id() in the > remote function shows there's IPI sent to the cpu#11. > -- Steve
Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints
On 7 Jan 2024, at 19:22, Paul Heidekrüger wrote: > On 12.12.2023 10:32, Marco Elver wrote: >> On Tue, 12 Dec 2023 at 10:19, Paul Heidekrüger >> wrote: >>> >>> On 12.12.2023 00:37, Andrey Konovalov wrote: On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger wrote: > > Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same > error > for me. > > So > > CONFIG_KUNIT=y > CONFIG_KUNIT_ALL_TESTS=n > CONFIG_FTRACE=y > CONFIG_KASAN=y > CONFIG_KASAN_GENERIC=y > CONFIG_KASAN_KUNIT_TEST=y > > produces > > ➜ ./tools/testing/kunit/kunit.py run > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64 > Configuring KUnit Kernel ... > Regenerating .config ... > Populating config with: > $ make ARCH=arm64 O=.kunit olddefconfig CC=clang > ERROR:root:Not all Kconfig options selected in kunitconfig were > in the generated .config. > This is probably due to unsatisfied dependencies. > Missing: CONFIG_KASAN_KUNIT_TEST=y > > By that error message, CONFIG_FTRACE appears to be present in the > generated > config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably, > CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied dependency, > which > must be CONFIG_TRACEPOINTS, unless I'm missing something ... > > If I just generate an arm64 defconfig and select CONFIG_FTRACE=y, > CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is > kunit.py-related > then? > > Andrey, you said that the tests have been working for you; are you > running them > with kunit.py? No, I just run the kernel built with a config file that I put together based on defconfig. >>> >>> Ah. I believe I've figured it out. >>> >>> When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works. >> >> CONFIG_FTRACE should be enough - maybe also check x86 vs. arm64 to debug >> more. > > See below. > >>> CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER >>> selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects >>> CONFIG_TRACING, and >>> CONFIG_TRACING selects CONFIG_TRACEPOINTS. >>> >>> CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it >>> directly selects CONFIG_TRACEPOINTS. >>> >>> CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64. >> >> When you build manually with just CONFIG_FTRACE, is CONFIG_TRACEPOINTS >> enabled? > > When I add CONFIG_FTRACE and enter-key my way through the FTRACE prompts - I > believe because CONFIG_FTRACE is a menuconfig? - at the beginning of a build, > CONFIG_TRACEPOINTS does get set on arm64, yes. > > On X86, the defconfig already includes CONIFG_TRACEPOINTS. > > I also had a closer look at how kunit.py builds its configs. > I believe it does something along the following lines: > > cp .kunit/.config > make ARCH=arm64 O=.kunit olddefconfig > > On arm64, that isn't enough to set CONFIG_TRACEPOINTS; same behaviour when run > outside of kunit.py. > > For CONFIG_TRACEPOINTS, `make ARCH=arm64 menuconfig` shows: > > Symbol: TRACEPOINTS [=n] > Type : bool > Defined at init/Kconfig:1920 > Selected by [n]: > - TRACING [=n] > - BLK_DEV_IO_TRACE [=n] && FTRACE [=y] && SYSFS [=y] && BLOCK [=y] > > So, CONFIG_TRACING or CONFIG_BLK_DEV_IO_TRACE are the two options that prevent > CONFIG_TRACEPOINTS from being set on arm64. > > For CONFIG_TRACING we have: > > Symbol: TRACING [=n] > Type : bool > Defined at kernel/trace/Kconfig:157 > Selects: RING_BUFFER [=n] && STACKTRACE [=y] && TRACEPOINTS [=n] && > NOP_TRACER [=n] && BINARY_PRINTF [=n] && EVENT_TRACING [=n] && TRACE_CLOCK > [=y] && TASKS_RCU [=n] > Selected by [n]: > - DRM_I915_TRACE_GEM [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && EXPERT > [=n] && DRM_I915_DEBUG_GEM [=n] > - DRM_I915_TRACE_GTT [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && EXPERT > [=n] && DRM_I915_DEBUG_GEM [=n] > - PREEMPTIRQ_TRACEPOINTS [=n] && (TRACE_PREEMPT_TOGGLE [=n] || > TRACE_IRQFLAGS [=n]) > - GENERIC_TRACER [=n] > - ENABLE_DEFAULT_TRACERS [=n] && FTRACE [=y] && !GENERIC_TRACER [=n] > - FPROBE_EVENTS [=n] && FTRACE [=y] && FPROBE [=n] && > HAVE_REGS_AND_STACK_ACCESS_API [=y] > - KPROBE_EVENTS [=n] && FTRACE [=y] && KPROBES [=n] && > HAVE_REGS_AND_STACK_ACCESS_API [=y] > - UPROBE_EVENTS [=n] && FTRACE [=y] && ARCH_SUPPORTS_UPROBES [=y] && > MMU [=y] && PERF_EVENTS [=n] > - SYNTH_EVENTS [=n] && FTRACE [=y] > - USER_EVENTS [=n] && FTRACE [=y] > - HIST_TRIGGERS [=n] && FTRACE [=y] && ARCH_HAVE_NMI_SAFE_CMPXCHG [=y] > >>> I believe the reason my .kunitconfig as well as the existing >>> mm/kfence/.kunitconfig work on X86 is because CONFIG_TRACEPOINTS=y is >>> present in >>> an
Re: [PATCH v3 04/47] filelock: add some new helper functions
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h > index 085ff6ba0653..a814664b1053 100644 > --- a/include/linux/filelock.h > +++ b/include/linux/filelock.h > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, unsigned > int, > int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > int fcntl_getlease(struct file *filp); > > +static inline bool lock_is_unlock(struct file_lock *fl) > +{ > + return fl->fl_type == F_UNLCK; > +} > + > +static inline bool lock_is_read(struct file_lock *fl) > +{ > + return fl->fl_type == F_RDLCK; > +} > + > +static inline bool lock_is_write(struct file_lock *fl) > +{ > + return fl->fl_type == F_WRLCK; > +} > + > +static inline void locks_wake_up(struct file_lock *fl) > +{ > + wake_up(&fl->fl_wait); > +} > + > +/* for walking lists of file_locks linked by fl_list */ > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, > fl_list) > + This causes a build warning for fs/ceph/ and fs/afs when !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this patch. The diff looks a bit wonky but essentially I've moved lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING: diff --git a/include/linux/filelock.h b/include/linux/filelock.h index a814664b1053..62be9c6b1e59 100644 --- a/include/linux/filelock.h +++ b/include/linux/filelock.h @@ -133,20 +133,6 @@ struct file_lock_context { struct list_headflc_lease; }; -#ifdef CONFIG_FILE_LOCKING -int fcntl_getlk(struct file *, unsigned int, struct flock *); -int fcntl_setlk(unsigned int, struct file *, unsigned int, - struct flock *); - -#if BITS_PER_LONG == 32 -int fcntl_getlk64(struct file *, unsigned int, struct flock64 *); -int fcntl_setlk64(unsigned int, struct file *, unsigned int, - struct flock64 *); -#endif - -int fcntl_setlease(unsigned int fd, struct file *filp, int arg); -int fcntl_getlease(struct file *filp); - static inline bool lock_is_unlock(struct file_lock *fl) { return fl->fl_type == F_UNLCK; @@ -170,6 +156,20 @@ static inline void locks_wake_up(struct file_lock *fl) /* for walking lists of file_locks linked by fl_list */ #define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, fl_list) +#ifdef CONFIG_FILE_LOCKING +int fcntl_getlk(struct file *, unsigned int, struct flock *); +int fcntl_setlk(unsigned int, struct file *, unsigned int, + struct flock *); + +#if BITS_PER_LONG == 32 +int fcntl_getlk64(struct file *, unsigned int, struct flock64 *); +int fcntl_setlk64(unsigned int, struct file *, unsigned int, + struct flock64 *); +#endif + +int fcntl_setlease(unsigned int fd, struct file *filp, int arg); +int fcntl_getlease(struct file *filp); + /* fs/locks.c */ void locks_free_lock_context(struct inode *inode); void locks_free_lock(struct file_lock *fl);
Re: [PATCH v3 04/47] filelock: add some new helper functions
On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote: > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h > > index 085ff6ba0653..a814664b1053 100644 > > --- a/include/linux/filelock.h > > +++ b/include/linux/filelock.h > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, > > unsigned int, > > int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > > int fcntl_getlease(struct file *filp); > > > > > > > > > > > > > > > > > > +static inline bool lock_is_unlock(struct file_lock *fl) > > +{ > > + return fl->fl_type == F_UNLCK; > > +} > > + > > +static inline bool lock_is_read(struct file_lock *fl) > > +{ > > + return fl->fl_type == F_RDLCK; > > +} > > + > > +static inline bool lock_is_write(struct file_lock *fl) > > +{ > > + return fl->fl_type == F_WRLCK; > > +} > > + > > +static inline void locks_wake_up(struct file_lock *fl) > > +{ > > + wake_up(&fl->fl_wait); > > +} > > + > > +/* for walking lists of file_locks linked by fl_list */ > > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, > > fl_list) > > + > > This causes a build warning for fs/ceph/ and fs/afs when > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this > patch. The diff looks a bit wonky but essentially I've moved > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING: > I sent a patch for this problem yesterday. Did you not get it? > diff --git a/include/linux/filelock.h b/include/linux/filelock.h > index a814664b1053..62be9c6b1e59 100644 > --- a/include/linux/filelock.h > +++ b/include/linux/filelock.h > @@ -133,20 +133,6 @@ struct file_lock_context { > struct list_headflc_lease; > }; > > -#ifdef CONFIG_FILE_LOCKING > -int fcntl_getlk(struct file *, unsigned int, struct flock *); > -int fcntl_setlk(unsigned int, struct file *, unsigned int, > - struct flock *); > - > -#if BITS_PER_LONG == 32 > -int fcntl_getlk64(struct file *, unsigned int, struct flock64 *); > -int fcntl_setlk64(unsigned int, struct file *, unsigned int, > - struct flock64 *); > -#endif > - > -int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > -int fcntl_getlease(struct file *filp); > - > static inline bool lock_is_unlock(struct file_lock *fl) > { > return fl->fl_type == F_UNLCK; > @@ -170,6 +156,20 @@ static inline void locks_wake_up(struct file_lock *fl) > /* for walking lists of file_locks linked by fl_list */ > #define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, > fl_list) > > +#ifdef CONFIG_FILE_LOCKING > +int fcntl_getlk(struct file *, unsigned int, struct flock *); > +int fcntl_setlk(unsigned int, struct file *, unsigned int, > + struct flock *); > + > +#if BITS_PER_LONG == 32 > +int fcntl_getlk64(struct file *, unsigned int, struct flock64 *); > +int fcntl_setlk64(unsigned int, struct file *, unsigned int, > + struct flock64 *); > +#endif > + > +int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > +int fcntl_getlease(struct file *filp); > + > /* fs/locks.c */ > void locks_free_lock_context(struct inode *inode); > void locks_free_lock(struct file_lock *fl); > -- Jeff Layton
Re: [PATCH v3 04/47] filelock: add some new helper functions
On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote: > On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote: > > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h > > > index 085ff6ba0653..a814664b1053 100644 > > > --- a/include/linux/filelock.h > > > +++ b/include/linux/filelock.h > > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, > > > unsigned int, > > > int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > > > int fcntl_getlease(struct file *filp); > > > > > > > > > > > > > > > > > > > > > > > > > > > +static inline bool lock_is_unlock(struct file_lock *fl) > > > +{ > > > + return fl->fl_type == F_UNLCK; > > > +} > > > + > > > +static inline bool lock_is_read(struct file_lock *fl) > > > +{ > > > + return fl->fl_type == F_RDLCK; > > > +} > > > + > > > +static inline bool lock_is_write(struct file_lock *fl) > > > +{ > > > + return fl->fl_type == F_WRLCK; > > > +} > > > + > > > +static inline void locks_wake_up(struct file_lock *fl) > > > +{ > > > + wake_up(&fl->fl_wait); > > > +} > > > + > > > +/* for walking lists of file_locks linked by fl_list */ > > > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, > > > fl_list) > > > + > > > > This causes a build warning for fs/ceph/ and fs/afs when > > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this > > patch. The diff looks a bit wonky but essentially I've moved > > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and > > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING: > > > > I sent a patch for this problem yesterday. Did you not get it? Whoops, probably missed it on the trip back from fosdem. I'll double check now.
Re: [PATCH v3 04/47] filelock: add some new helper functions
On Mon, 2024-02-05 at 12:57 +0100, Christian Brauner wrote: > On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote: > > On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote: > > > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h > > > > index 085ff6ba0653..a814664b1053 100644 > > > > --- a/include/linux/filelock.h > > > > +++ b/include/linux/filelock.h > > > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, > > > > unsigned int, > > > > int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > > > > int fcntl_getlease(struct file *filp); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +static inline bool lock_is_unlock(struct file_lock *fl) > > > > +{ > > > > + return fl->fl_type == F_UNLCK; > > > > +} > > > > + > > > > +static inline bool lock_is_read(struct file_lock *fl) > > > > +{ > > > > + return fl->fl_type == F_RDLCK; > > > > +} > > > > + > > > > +static inline bool lock_is_write(struct file_lock *fl) > > > > +{ > > > > + return fl->fl_type == F_WRLCK; > > > > +} > > > > + > > > > +static inline void locks_wake_up(struct file_lock *fl) > > > > +{ > > > > + wake_up(&fl->fl_wait); > > > > +} > > > > + > > > > +/* for walking lists of file_locks linked by fl_list */ > > > > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, > > > > fl_list) > > > > + > > > > > > This causes a build warning for fs/ceph/ and fs/afs when > > > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this > > > patch. The diff looks a bit wonky but essentially I've moved > > > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and > > > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING: > > > > > > > I sent a patch for this problem yesterday. Did you not get it? > > Whoops, probably missed it on the trip back from fosdem. > I'll double check now. No worries. If you choose to go with your version, you can add: Reviewed-by: Jeff Layton
Re: [PATCH v3 04/47] filelock: add some new helper functions
On Mon, Feb 05, 2024 at 07:06:00AM -0500, Jeff Layton wrote: > On Mon, 2024-02-05 at 12:57 +0100, Christian Brauner wrote: > > On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote: > > > On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote: > > > > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h > > > > > index 085ff6ba0653..a814664b1053 100644 > > > > > --- a/include/linux/filelock.h > > > > > +++ b/include/linux/filelock.h > > > > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, > > > > > unsigned int, > > > > > int fcntl_setlease(unsigned int fd, struct file *filp, int arg); > > > > > int fcntl_getlease(struct file *filp); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +static inline bool lock_is_unlock(struct file_lock *fl) > > > > > +{ > > > > > + return fl->fl_type == F_UNLCK; > > > > > +} > > > > > + > > > > > +static inline bool lock_is_read(struct file_lock *fl) > > > > > +{ > > > > > + return fl->fl_type == F_RDLCK; > > > > > +} > > > > > + > > > > > +static inline bool lock_is_write(struct file_lock *fl) > > > > > +{ > > > > > + return fl->fl_type == F_WRLCK; > > > > > +} > > > > > + > > > > > +static inline void locks_wake_up(struct file_lock *fl) > > > > > +{ > > > > > + wake_up(&fl->fl_wait); > > > > > +} > > > > > + > > > > > +/* for walking lists of file_locks linked by fl_list */ > > > > > +#define for_each_file_lock(_fl, _head) > > > > > list_for_each_entry(_fl, _head, fl_list) > > > > > + > > > > > > > > This causes a build warning for fs/ceph/ and fs/afs when > > > > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this > > > > patch. The diff looks a bit wonky but essentially I've moved > > > > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and > > > > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING: > > > > > > > > > > I sent a patch for this problem yesterday. Did you not get it? > > > > Whoops, probably missed it on the trip back from fosdem. > > I'll double check now. > > No worries. If you choose to go with your version, you can add: No, I took yours. :)
[PATCH net-next v5 2/5] page_frag: unify gfp bits for order 3 page allocation
Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin Reviewed-by: Alexander Duyck CC: Alexander Duyck --- drivers/vhost/net.c | 2 +- mm/page_alloc.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f2ed7167c848..e574e21cc0ca 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, /* Avoid direct reclaim but allow kswapd to wake */ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NORETRY | __GFP_NOMEMALLOC, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c0f7e67c4250..636145c29f70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, gfp_t gfp = gfp_mask; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY | - __GFP_NOMEMALLOC; + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER); nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE; -- 2.33.0
[PATCH net-next v5 4/5] vhost/net: remove vhost_net_page_frag_refill()
The page frag in vhost_net_page_frag_refill() uses the 'struct page_frag' from skb_page_frag_refill(), but it's implementation is similar to page_frag_alloc_align() now. This patch removes vhost_net_page_frag_refill() by using 'struct page_frag_cache' instead of 'struct page_frag', and allocating frag using page_frag_alloc_align(). The added benefit is that not only unifying the page frag implementation a little, but also having about 0.5% performance boost testing by using the vhost_net_test introduced in the last patch. Signed-off-by: Yunsheng Lin Acked-by: Jason Wang --- drivers/vhost/net.c | 91 ++--- 1 file changed, 27 insertions(+), 64 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index e574e21cc0ca..4b2fcb228a0a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -141,10 +141,8 @@ struct vhost_net { unsigned tx_zcopy_err; /* Flush in progress. Protected by tx vq lock. */ bool tx_flush; - /* Private page frag */ - struct page_frag page_frag; - /* Refcount bias of page frag */ - int refcnt_bias; + /* Private page frag cache */ + struct page_frag_cache pf_cache; }; static unsigned vhost_net_zcopy_mask __read_mostly; @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len) !vhost_vq_avail_empty(vq->dev, vq); } -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, - struct page_frag *pfrag, gfp_t gfp) -{ - if (pfrag->page) { - if (pfrag->offset + sz <= pfrag->size) - return true; - __page_frag_cache_drain(pfrag->page, net->refcnt_bias); - } - - pfrag->offset = 0; - net->refcnt_bias = 0; - if (SKB_FRAG_PAGE_ORDER) { - /* Avoid direct reclaim but allow kswapd to wake */ - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | - __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY | __GFP_NOMEMALLOC, - SKB_FRAG_PAGE_ORDER); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; - goto done; - } - } - pfrag->page = alloc_page(gfp); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE; - goto done; - } - return false; - -done: - net->refcnt_bias = USHRT_MAX; - page_ref_add(pfrag->page, USHRT_MAX - 1); - return true; -} - #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD) static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev); struct socket *sock = vhost_vq_get_backend(vq); - struct page_frag *alloc_frag = &net->page_frag; struct virtio_net_hdr *gso; struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp]; struct tun_xdp_hdr *hdr; @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, int sock_hlen = nvq->sock_hlen; void *buf; int copied; + int ret; if (unlikely(len < nvq->sock_hlen)) return -EFAULT; @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, return -ENOSPC; buflen += SKB_DATA_ALIGN(len + pad); - alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES); - if (unlikely(!vhost_net_page_frag_refill(net, buflen, -alloc_frag, GFP_KERNEL))) + buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL, + SMP_CACHE_BYTES); + if (unlikely(!buf)) return -ENOMEM; - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; - copied = copy_page_from_iter(alloc_frag->page, -alloc_frag->offset + -offsetof(struct tun_xdp_hdr, gso), -sock_hlen, from); - if (copied != sock_hlen) - return -EFAULT; + copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso), + sock_hlen, from); + if (copied != sock_hlen) { + ret = -EFAULT; + goto err; + } hdr = buf; gso = &hdr->gso; @@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, vhost16_to_cpu(vq, gso->csum_start) + vhost16_to_cpu(vq, gso->csum_offset) + 2); -
[PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test
introduce vhost_net_test for both vhost_net tx and rx basing on virtio_test to test vhost_net changing in the kernel. Steps for vhost_net tx testing: 1. Prepare a out buf. 2. Kick the vhost_net to do tx processing. 3. Do the receiving in the tun side. 4. verify the data received by tun is correct. Steps for vhost_net rx testing: 1. Prepare a in buf. 2. Do the sending in the tun side. 3. Kick the vhost_net to do rx processing. 4. verify the data received by vhost_net is correct. Signed-off-by: Yunsheng Lin --- tools/virtio/.gitignore| 1 + tools/virtio/Makefile | 8 +- tools/virtio/linux/virtio_config.h | 4 + tools/virtio/vhost_net_test.c | 536 + 4 files changed, 546 insertions(+), 3 deletions(-) create mode 100644 tools/virtio/vhost_net_test.c diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore index 9934d48d9a55..7e47b281c442 100644 --- a/tools/virtio/.gitignore +++ b/tools/virtio/.gitignore @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only *.d virtio_test +vhost_net_test vringh_test virtio-trace/trace-agent diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile index d128925980e0..e25e99c1c3b7 100644 --- a/tools/virtio/Makefile +++ b/tools/virtio/Makefile @@ -1,8 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 all: test mod -test: virtio_test vringh_test +test: virtio_test vringh_test vhost_net_test virtio_test: virtio_ring.o virtio_test.o vringh_test: vringh_test.o vringh.o virtio_ring.o +vhost_net_test: virtio_ring.o vhost_net_test.o try-run = $(shell set -e; \ if ($(1)) >/dev/null 2>&1; \ @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean .PHONY: all test mod clean vhost oot oot-clean oot-build clean: - ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \ - vhost_test/Module.symvers vhost_test/modules.order *.d + ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \ + vhost_test/.*.cmd vhost_test/Module.symvers \ + vhost_test/modules.order *.d -include *.d diff --git a/tools/virtio/linux/virtio_config.h b/tools/virtio/linux/virtio_config.h index 2a8a70e2a950..42a564f22f2d 100644 --- a/tools/virtio/linux/virtio_config.h +++ b/tools/virtio/linux/virtio_config.h @@ -1,4 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#ifndef LINUX_VIRTIO_CONFIG_H +#define LINUX_VIRTIO_CONFIG_H #include #include #include @@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct virtio_device *vdev, u64 val) { return __cpu_to_virtio64(virtio_is_little_endian(vdev), val); } + +#endif diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c new file mode 100644 index ..6c41204e6707 --- /dev/null +++ b/tools/virtio/vhost_net_test.c @@ -0,0 +1,536 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf) +#define TEST_BUF_LEN 256 +#define TEST_PTYPE ETH_P_LOOPBACK +#define DESC_NUM 256 + +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */ +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end; + +struct vq_info { + int kick; + int call; + int idx; + long started; + long completed; + struct pollfd fds; + void *ring; + /* copy used for control */ + struct vring vring; + struct virtqueue *vq; +}; + +struct vdev_info { + struct virtio_device vdev; + int control; + struct vq_info vqs[2]; + int nvqs; + void *buf; + size_t buf_size; + char *test_buf; + char *res_buf; + struct vhost_memory *mem; + int sock; + int ifindex; + unsigned char mac[ETHER_ADDR_LEN]; +}; + +static int tun_alloc(struct vdev_info *dev, char *tun_name) +{ + struct ifreq ifr; + int len = HDR_LEN; + int fd, e; + + fd = open("/dev/net/tun", O_RDWR); + if (fd < 0) { + perror("Cannot open /dev/net/tun"); + return fd; + } + + memset(&ifr, 0, sizeof(ifr)); + + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; + strncpy(ifr.ifr_name, tun_name, IFNAMSIZ); + + e = ioctl(fd, TUNSETIFF, &ifr); + if (e < 0) { + perror("ioctl[TUNSETIFF]"); + close(fd); + return e; + } + + e = ioctl(fd, TUNSETVNETHDRSZ, &len); + if (e < 0) { + perror("ioctl[TUNSETVNETHDRSZ]"); + close(fd); + return e; + } + + e = ioctl(fd, SIOCGIFHWADDR, &ifr); + if (e < 0) { + perror("ioctl[SIOCGIFHWADDR]"); + close(fd); + return e; +
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
On Mon, 5 Feb 2024 07:53:40 +0100 Sven Schnelle wrote: > tracer_tracing_is_on() checks whether record_disabled is not zero. This > checks both the record_disabled counter and the RB_BUFFER_OFF flag. > Reading the source it looks like this function should only check for > the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). > This fixes spurious fails in the 'test for function traceon/off triggers' > test from the ftrace testsuite when the system is under load. > I've seen these spurious failures too, but haven't looked deeper into it. Thanks, -- Steve
Re: Question about the ipi_raise filter usage and output
On Mon, 5 Feb 2024 10:28:57 + Mark Rutland wrote: > > I try to write below: > > echo 'target_cpus == 11 && reason == "Function call interrupts"' > > > events/ipi/ipi_raise/filter > > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the > cpumask > contains other CPUs, the filter will skip the call. > > I believe you can use '&' to check whether a cpumask contains a CPU, e.g. > > 'target_cpus & 11' 11 == 0xb = b1011 So the above would only be true for CPUs 0,1 and 3 ;-) I think you meant: 'target_cpus & 0x800' I tried "1 << 11' but it appears to not allow shifts. I wonder if we should add that? -- Steve
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Hi Steven, Steven Rostedt writes: > On Mon, 5 Feb 2024 07:53:40 +0100 > Sven Schnelle wrote: > >> tracer_tracing_is_on() checks whether record_disabled is not zero. This >> checks both the record_disabled counter and the RB_BUFFER_OFF flag. >> Reading the source it looks like this function should only check for >> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). >> This fixes spurious fails in the 'test for function traceon/off triggers' >> test from the ftrace testsuite when the system is under load. >> > > I've seen these spurious failures too, but haven't looked deeper into > it. Thanks, Another issue i'm hitting sometimes is this part: csum1=`md5sum trace` sleep $SLEEP_TIME csum2=`md5sum trace` if [ "$csum1" != "$csum2" ]; then fail "Tracing file is still changing" fi This is because the command line was replaced in the saved_cmdlines_buffer, an example diff between both files is: ftracetest-17950 [005] . 344507.002490: sched_process_wait: comm=ftracetest pid=0 prio=120 ftracetest-17950 [005] . 344507.002492: sched_process_wait: comm=ftracetest pid=0 prio=120 - stress-ng-fanot-17820 [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=1054 [ns] + <...>-17820 [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=1054 [ns] ftracetest-17950 [005] d.h.. 344507.009901: sched_stat_runtime: comm=ftracetest pid=17950 runtime=7417915 [ns] stress-ng-fanot-17819 [003] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=9983473 [ns] - stress-ng-fanot-17820 [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=865 [ns] + <...>-17820 [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=865 [ns] stress-ng-fanot-17819 [004] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=8388039 [ns] This can be improved by: echo 32768 > /sys/kernel/tracing/saved_cmdlines_size But this is of course not a fix - should we maybe replace the program name with <...> before comparing, remove the check completely, or do anything else? What do you think? Thanks, Sven
Re: [PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types
On Mon, 05 Feb 2024 10:51:38 +0100, Luca Weiss wrote: > Add the PBS (Programmable Boot Sequencer) to the list of devices. > > Signed-off-by: Luca Weiss > --- > Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 > 1 file changed, 4 insertions(+) > My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check' on your patch (DT_CHECKER_FLAGS is new in v5.13): yamllint warnings/errors: dtschema/dtc warnings/errors: /builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml: Error in referenced schema matching $id: http://devicetree.org/schemas/soc/qcom/qcom,pbs.yaml doc reference errors (make refcheckdocs): See https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240205-pmi632-ppg-v1-1-e236c95a2...@fairphone.com The base for the series is generally the latest rc1. A different dependency should be noted in *this* patch. If you already ran 'make dt_binding_check' and didn't see the above error(s), then make sure 'yamllint' is installed and dt-schema is up to date: pip3 install dtschema --upgrade Please check and re-submit after running the above command yourself. Note that DT_SCHEMA_FILES can be set to your schema file to speed up checking your schema. However, it must be unset to test all examples with your schema.
Re: [PATCH v3 3/3] tracing: convert __trace_seq_init to use WARN_ON_ONCE
(arch/x86/kernel/traps.c:238) [ 1915.516038][ T42] ? exc_invalid_op (arch/x86/kernel/traps.c:259) [ 1915.516298][ T42] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 1915.516586][ T42] ? trace_seq_printf (kernel/trace/trace_seq.c:35) [ 1915.516860][ T42] ? trace_find_cmdline (arch/x86/include/asm/preempt.h:103 kernel/trace/trace.c:2546) [ 1915.517142][ T42] ? trace_find_cmdline (arch/x86/include/asm/preempt.h:103 kernel/trace/trace.c:2546) [ 1915.517420][ T42] ? preempt_count_sub (arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:5900) [ 1915.517694][ T42] trace_print_lat_context (kernel/trace/trace_output.c:516 kernel/trace/trace_output.c:664) [ 1915.517995][ T42] print_trace_fmt (kernel/trace/trace.c:?) [ 1915.518257][ T42] ftrace_dump (kernel/trace/trace.c:10413) [ 1915.518505][ T42] rcu_scale_writer (kernel/rcu/rcuscale.c:534) [ 1915.518775][ T42] ? __cfi_rcu_scale_writer (kernel/rcu/rcuscale.c:453) [ 1915.519071][ T42] kthread (kernel/kthread.c:390) [ 1915.519298][ T42] ? __cfi_kthread (kernel/kthread.c:341) [ 1915.519554][ T42] ret_from_fork (arch/x86/kernel/process.c:153) [ 1915.519802][ T42] ? __cfi_kthread (kernel/kthread.c:341) [ 1915.520057][ T42] ret_from_fork_asm (arch/x86/entry/entry_64.S:250) [ 1915.520326][ T42] [ 1915.520496][ T42] irq event stamp: 8228 [ 1915.520724][ T42] hardirqs last enabled at (8227): _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:?) [ 1915.521290][ T42] hardirqs last disabled at (8228): ftrace_dump (kernel/trace/trace.c:10359) [ 1915.521783][ T42] softirqs last enabled at (0): copy_process (include/linux/rcupdate.h:298 include/linux/rcupdate.h:750 kernel/fork.c:2366) [ 1915.522276][ T42] softirqs last disabled at (0): 0x0 [ 1915.522663][ T42] ---[ end trace ]--- [ 1915.522970][ T42] swapper-1 0..Zff 1910231878us : fib6_table_lookup: table 2166473472 oif 80736 iif -14080 proto 0 81ff::ff60:6d05:81ff::ff00:0/0 -> ::/0 tos 96 scope 109 flags 5 ==> dev gw :: err -1 [ 1915.524039][ T42] swapper-1 0. 1910231880us : Unknown type 1830 [ 1915.524442][ T42] - [ 1915.524733][ T42] rcu-scale: Test complete The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240205/202402052141.769871e2-...@intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
On Mon, 05 Feb 2024 14:16:30 +0100 Sven Schnelle wrote: > > Another issue i'm hitting sometimes is this part: > > csum1=`md5sum trace` > sleep $SLEEP_TIME > csum2=`md5sum trace` > > if [ "$csum1" != "$csum2" ]; then > fail "Tracing file is still changing" > fi > > This is because the command line was replaced in the > saved_cmdlines_buffer, an example diff between both files > is: [..] > > This can be improved by: > > echo 32768 > /sys/kernel/tracing/saved_cmdlines_size > > But this is of course not a fix - should we maybe replace the program > name with <...> before comparing, remove the check completely, or do > anything else? What do you think? Hmm, actually I would say that this exposes a real bug. Not a major one, but one that I find annoying. The saved commandlines should only be updated when a trace event occurs. But really, it should only be updated if one is added to the ring buffer. If the ring buffer isn't being updated, we shouldn't be adding new command lines. There may be a location that has tracing off but still updating the cmdlines which will break the saved cache. -- Steve
Re: Question about the ipi_raise filter usage and output
[adding Valentin] On Mon, Feb 05, 2024 at 08:06:09AM -0500, Steven Rostedt wrote: > On Mon, 5 Feb 2024 10:28:57 + > Mark Rutland wrote: > > > > I try to write below: > > > echo 'target_cpus == 11 && reason == "Function call interrupts"' > > > > events/ipi/ipi_raise/filter > > > > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the > > cpumask > > contains other CPUs, the filter will skip the call. > > > > I believe you can use '&' to check whether a cpumask contains a CPU, e.g. > > > > 'target_cpus & 11' > > 11 == 0xb = b1011 > > So the above would only be true for CPUs 0,1 and 3 ;-) Sorry, I misunderstood the scalar logic and thought that we treated: '$mask $OP $scalar', e.g. 'target_cpus & 11' ... as a special case meaning a cpumask with that scalar bit set, i.e. '$mask $OP CPUS{$scalar}', e.g. 'target_cpus & CPUS{11}' ... but evidently I was wrong. > I think you meant: 'target_cpus & 0x800' > > I tried "1 << 11' but it appears to not allow shifts. I wonder if we should > add that? Hmm... shouldn't we make 'CPUS{11}' work for that? >From a quick test (below), that doesn't seem to work, though I think it probably should? # cat /sys/devices/system/cpu/online 0-3 # echo 1 > /sys/kernel/tracing/events/ipi/ipi_raise/enable # echo 'target_cpus & CPUS{3}' > /sys/kernel/tracing/events/ipi/ipi_raise/filter # grep IPI /proc/interrupts IPI0:54 41 32 42 Rescheduling interrupts IPI1: 1202 1035893909 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 0 0 0 0 IRQ work interrupts # sleep 1 # grep IPI /proc/interrupts IPI0:54 42 32 42 Rescheduling interrupts IPI1: 1209 1037912927 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 0 0 0 0 IRQ work interrupts # cat /sys/devices/system/cpu/online 0-3 # cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # #_-=> irqs-off/BH-disabled # / _=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # / delay # TASK-PID CPU# | TIMESTAMP FUNCTION # | | | | | | # More confusingly, if I use '8', I get events with cpumasks which shouldn't match AFAICT: echo 'target_cpus & 8' > /sys/kernel/tracing/events/ipi/ipi_raise/filter # echo '' > /sys/kernel/tracing/trace # grep IPI /proc/interrupts IPI0:55 46 34 43 Rescheduling interrupts IPI1: 1358 1155994 1021 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 0 0 0 0 IRQ work interrupts # sleep 1 # grep IPI /proc/interrupts IPI0:56 46 34 43 Rescheduling interrupts IPI1: 1366 1158 1005 1038 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 0 0 0 0 IRQ work interrupts # cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 91/91 #P:4 # #_-=> irqs-off/BH-disabled # / _=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # / delay # TASK-PID CPU# | TIMESTAMP FUNCTION # | | | | | | -0 [000] d.h4. 480.720312: ipi_ra
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Hi Steven, Steven Rostedt writes: > On Mon, 05 Feb 2024 14:16:30 +0100 > Sven Schnelle wrote: >> >> Another issue i'm hitting sometimes is this part: >> >> csum1=`md5sum trace` >> sleep $SLEEP_TIME >> csum2=`md5sum trace` >> >> if [ "$csum1" != "$csum2" ]; then >> fail "Tracing file is still changing" >> fi >> >> This is because the command line was replaced in the >> saved_cmdlines_buffer, an example diff between both files >> is: > > [..] > >> >> This can be improved by: >> >> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size >> >> But this is of course not a fix - should we maybe replace the program >> name with <...> before comparing, remove the check completely, or do >> anything else? What do you think? > > Hmm, actually I would say that this exposes a real bug. Not a major > one, but one that I find annoying. The saved commandlines should only > be updated when a trace event occurs. But really, it should only be > updated if one is added to the ring buffer. If the ring buffer isn't > being updated, we shouldn't be adding new command lines. > > There may be a location that has tracing off but still updating the > cmdlines which will break the saved cache. Ok, my understanding is that it will override the entry in the list if another process comes up with the same PID. But i haven't read the code carefully - let me do that now.
Re: [PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types
On Montag, 5. Februar 2024 14:46:45 CET Rob Herring wrote: > On Mon, 05 Feb 2024 10:51:38 +0100, Luca Weiss wrote: > > Add the PBS (Programmable Boot Sequencer) to the list of devices. > > > > Signed-off-by: Luca Weiss > > --- > > > > Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 > > 1 file changed, 4 insertions(+) > > My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check' > on your patch (DT_CHECKER_FLAGS is new in v5.13): > > yamllint warnings/errors: > > dtschema/dtc warnings/errors: > /builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/mfd/ > qcom,spmi-pmic.yaml: Error in referenced schema matching $id: > http://devicetree.org/schemas/soc/qcom/qcom,pbs.yaml These patches have been merged into linux-next recently, so should get into the next release. > > doc reference errors (make refcheckdocs): > > See > https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240205-pmi > 632-ppg-v1-1-e236c95a2...@fairphone.com > > The base for the series is generally the latest rc1. A different dependency > should be noted in *this* patch. > > If you already ran 'make dt_binding_check' and didn't see the above > error(s), then make sure 'yamllint' is installed and dt-schema is up to > date: > > pip3 install dtschema --upgrade > > Please check and re-submit after running the above command yourself. Note > that DT_SCHEMA_FILES can be set to your schema file to speed up checking > your schema. However, it must be unset to test all examples with your > schema.
Re: Question about the ipi_raise filter usage and output
On 05/02/24 14:39, Mark Rutland wrote: > [adding Valentin] > Thanks! > On Mon, Feb 05, 2024 at 08:06:09AM -0500, Steven Rostedt wrote: >> On Mon, 5 Feb 2024 10:28:57 + >> Mark Rutland wrote: >> >> > > I try to write below: >> > > echo 'target_cpus == 11 && reason == "Function call interrupts"' > >> > > events/ipi/ipi_raise/filter >> > >> > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the >> > cpumask >> > contains other CPUs, the filter will skip the call. >> > >> > I believe you can use '&' to check whether a cpumask contains a CPU, e.g. >> > >> >'target_cpus & 11' >> >> 11 == 0xb = b1011 >> >> So the above would only be true for CPUs 0,1 and 3 ;-) > > Sorry, I misunderstood the scalar logic and thought that we treated: > > '$mask $OP $scalar', e.g. 'target_cpus & 11' > > .. as a special case meaning a cpumask with that scalar bit set, i.e. > > '$mask $OP CPUS{$scalar}', e.g. 'target_cpus & CPUS{11}' > > .. but evidently I was wrong. > >> I think you meant: 'target_cpus & 0x800' >> >> I tried "1 << 11' but it appears to not allow shifts. I wonder if we should >> add that? > > Hmm... shouldn't we make 'CPUS{11}' work for that? > It /should/ already be the case, the user input with the curly braces is parsed as a cpulist. So CPUS{11} really does mean CPU11, not a hex value to be interpreted as a cpumask. However... > From a quick test (below), that doesn't seem to work, though I think it > probably should? > Have I completely misunderstood how this is supposed to work, or is that a > bug? > The CPUS{} thingie only works with an event field that is either declared as a cpumask (__cpumask) or a scalar. That's not the case for ipi_raise, the target_cpus event field is saved as a "raw" bitmask. There /should/ have been a warning about the event filter though, but I think it's not happening because I'm allowing more than just FILTER_CPUMASK in parse_pred() to make it work for scalars. I'll go poke around some more. Generally for this sort of IPI investigation I'd recommend using the newer trace_ipi_send_cpu() and trace_ipi_send_cpumask() (for which CPUS{} filtering works). If it's only the function call interrupts you're interesting in, have a look at trace_csd_queue_cpu(). > Mark.
Re: [PATCH RFC v3 22/35] arm64: mte: Enable tag storage if CMA areas have been activated
Hi Evgenii, On Fri, Feb 02, 2024 at 02:30:00PM -0800, Evgenii Stepanov wrote: > On Thu, Jan 25, 2024 at 8:44 AM Alexandru Elisei > wrote: > > > > Before enabling MTE tag storage management, make sure that the CMA areas > > have been successfully activated. If a CMA area fails activation, the pages > > are kept as reserved. Reserved pages are never used by the page allocator. > > > > If this happens, the kernel would have to manage tag storage only for some > > of the memory, but not for all memory, and that would make the code > > unreasonably complicated. > > > > Choose to disable tag storage management altogether if a CMA area fails to > > be activated. > > > > Signed-off-by: Alexandru Elisei > > --- > > > > Changes since v2: > > > > * New patch. > > > > arch/arm64/include/asm/mte_tag_storage.h | 12 ++ > > arch/arm64/kernel/mte_tag_storage.c | 50 > > 2 files changed, 62 insertions(+) > > > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h > > b/arch/arm64/include/asm/mte_tag_storage.h > > index 3c2cd29e053e..7b3f6bff8e6f 100644 > > --- a/arch/arm64/include/asm/mte_tag_storage.h > > +++ b/arch/arm64/include/asm/mte_tag_storage.h > > @@ -6,8 +6,20 @@ > > #define __ASM_MTE_TAG_STORAGE_H > > > > #ifdef CONFIG_ARM64_MTE_TAG_STORAGE > > + > > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); > > + > > +static inline bool tag_storage_enabled(void) > > +{ > > + return static_branch_likely(&tag_storage_enabled_key); > > +} > > + > > void mte_init_tag_storage(void); > > #else > > +static inline bool tag_storage_enabled(void) > > +{ > > + return false; > > +} > > static inline void mte_init_tag_storage(void) > > { > > } > > diff --git a/arch/arm64/kernel/mte_tag_storage.c > > b/arch/arm64/kernel/mte_tag_storage.c > > index 9a1a8a45171e..d58c68b4a849 100644 > > --- a/arch/arm64/kernel/mte_tag_storage.c > > +++ b/arch/arm64/kernel/mte_tag_storage.c > > @@ -19,6 +19,8 @@ > > > > #include > > > > +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key); > > + > > struct tag_region { > > struct range mem_range; /* Memory associated with the tag storage, > > in PFNs. */ > > struct range tag_range; /* Tag storage memory, in PFNs. */ > > @@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void) > > num_tag_regions = 0; > > pr_info("MTE tag storage region management disabled"); > > } > > + > > +static int __init mte_enable_tag_storage(void) > > +{ > > + struct range *tag_range; > > + struct cma *cma; > > + int i, ret; > > + > > + if (num_tag_regions == 0) > > + return 0; > > + > > + for (i = 0; i < num_tag_regions; i++) { > > + tag_range = &tag_regions[i].tag_range; > > + cma = tag_regions[i].cma; > > + /* > > +* CMA will keep the pages as reserved when the region fails > > +* activation. > > +*/ > > + if (PageReserved(pfn_to_page(tag_range->start))) > > + goto out_disabled; > > + } > > + > > + static_branch_enable(&tag_storage_enabled_key); > > + pr_info("MTE tag storage region management enabled"); > > + > > + return 0; > > + > > +out_disabled: > > + for (i = 0; i < num_tag_regions; i++) { > > + tag_range = &tag_regions[i].tag_range; > > + cma = tag_regions[i].cma; > > + > > + if (PageReserved(pfn_to_page(tag_range->start))) > > + continue; > > + > > + /* Try really hard to reserve the tag storage. */ > > + ret = cma_alloc(cma, range_len(tag_range), 8, true); > > + /* > > +* Tag storage is still in use for data, memory and/or tag > > +* corruption will ensue. > > +*/ > > + WARN_ON_ONCE(ret); > > cma_alloc returns (page *), so this condition needs to be inverted, > and the type of `ret` changed. > Not sure how it slipped through, this is a compile error with clang. Checked just now, it's a warning with gcc, I must have missed it. Will fix. Thanks, Alex > > > + } > > + num_tag_regions = 0; > > + pr_info("MTE tag storage region management disabled"); > > + > > + return -EINVAL; > > +} > > +arch_initcall(mte_enable_tag_storage); > > -- > > 2.43.0 > >
[PATCH v14 0/6] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. Vincent Donnefort (6): ring-buffer: Zero ring-buffer sub-buffers ring-buffer: Introducing ring-buffer mapping functions tracing: Add snapshot refcount tracing: Allow user-space mapping of the ring-buffer Documentation: tracing: Add ring-buffer mapping ring-buffer/selftest: Add ring-buffer mapping test Documentation/trace/index.rst | 1 + Documentation/trace/ring-buffer-map.rst | 104 ++ include/linux/ring_buffer.h | 7 + include/uapi/linux/trace_mmap.h | 48 +++ kernel/trace/ring_buffer.c| 344 +- kernel/trace/trace.c | 221 +-- kernel/trace/trace.h | 9 +- kernel/trace/trace_events_trigger.c | 58 ++- tools/testing/selftests/ring-buffer/Makefile | 8 + tools/testing/selftests/ring-buffer/config| 2 + .../testing/selftests/ring-buffer/map_test.c | 273 ++ 11 files changed, 1029 insertions(+), 46 deletions(-) create mode 100644 Documentation/trace/ring-buffer-map.rst create mode 100644 include/uapi/linux/trace_mmap.h create mode 100644 tools/testing/selftests/ring-buffer/Makefile create mode 100644 tools/testing/selftests/ring-buffer/config create mode 100644 tools/testing/selftests/ring-buffer/map_test.c base-commit: e2412e51fdea837b50ce31fea8e5dfc885237f3a -- 2.43.0.594
[PATCH v14 1/6] ring-buffer: Zero ring-buffer sub-buffers
In preparation for the ring-buffer memory mapping where each subbuf will be accessible to user-space, zero all the page allocations. Signed-off-by: Vincent Donnefort Reviewed-by: Masami Hiramatsu (Google) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index fd4bfe3ecf01..ca796675c0a1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(&bpage->list, pages); - page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, + page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), + mflags | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, cpu_buffer->buffer->subbuf_order); + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; bpage->page = page_address(page); @@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) if (bpage->data) goto out; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY, + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.43.0.594.gd9cf4e227d-goog
[PATCH v14 2/6] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..0841ba8bab14 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..182e05a3004a --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs - 1 + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, no defined values. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index ca796675c0a1..dcb4a0bc7ecb 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -338,6 +338,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +485,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to addr */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1548,6 +1555,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(&cpu_buffer->irq_work.waiters); init_waitqueue_head(&cpu_buffer->irq_work.full_waiters); + mutex_init(&cpu_buffer->mapping_lock); bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
[PATCH v14 3/6] tracing: Add snapshot refcount
When a ring-buffer is memory mapped by user-space, no trace or ring-buffer swap is possible. This means the snapshot feature is mutually exclusive with the memory mapping. Having a refcount on snapshot users will help to know if a mapping is possible or not. Instead of relying on the global trace_types_lock, a new spinlock is introduced to serialize accesses to trace_array->snapshot. This intends to allow access to that variable in a context where the mmap lock is already held. Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..4ebf4d0bd14c 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1300,6 +1300,50 @@ static void free_snapshot(struct trace_array *tr) tr->allocated_snapshot = false; } +static int tracing_arm_snapshot_locked(struct trace_array *tr) +{ + int ret; + + lockdep_assert_held(&trace_types_lock); + + spin_lock(&tr->snapshot_trigger_lock); + if (tr->snapshot == UINT_MAX) { + spin_unlock(&tr->snapshot_trigger_lock); + return -EBUSY; + } + + tr->snapshot++; + spin_unlock(&tr->snapshot_trigger_lock); + + ret = tracing_alloc_snapshot_instance(tr); + if (ret) { + spin_lock(&tr->snapshot_trigger_lock); + tr->snapshot--; + spin_unlock(&tr->snapshot_trigger_lock); + } + + return ret; +} + +int tracing_arm_snapshot(struct trace_array *tr) +{ + int ret; + + mutex_lock(&trace_types_lock); + ret = tracing_arm_snapshot_locked(tr); + mutex_unlock(&trace_types_lock); + + return ret; +} + +void tracing_disarm_snapshot(struct trace_array *tr) +{ + spin_lock(&tr->snapshot_trigger_lock); + if (!WARN_ON(!tr->snapshot)) + tr->snapshot--; + spin_unlock(&tr->snapshot_trigger_lock); +} + /** * tracing_alloc_snapshot - allocate snapshot buffer. * @@ -1373,10 +1417,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, mutex_lock(&trace_types_lock); - ret = tracing_alloc_snapshot_instance(tr); - if (ret) - goto fail_unlock; - if (tr->current_trace->use_max_tr) { ret = -EBUSY; goto fail_unlock; @@ -1395,6 +1435,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, goto fail_unlock; } + ret = tracing_arm_snapshot_locked(tr); + if (ret) + goto fail_unlock; + local_irq_disable(); arch_spin_lock(&tr->max_lock); tr->cond_snapshot = cond_snapshot; @@ -1439,6 +1483,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr) arch_spin_unlock(&tr->max_lock); local_irq_enable(); + tracing_disarm_snapshot(tr); + return ret; } EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); @@ -1481,6 +1527,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr) } EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); #define free_snapshot(tr) do { } while (0) +#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; }) #endif /* CONFIG_TRACER_SNAPSHOT */ void tracer_tracing_off(struct trace_array *tr) @@ -6593,11 +6640,12 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) */ synchronize_rcu(); free_snapshot(tr); + tracing_disarm_snapshot(tr); } - if (t->use_max_tr && !tr->allocated_snapshot) { - ret = tracing_alloc_snapshot_instance(tr); - if (ret < 0) + if (t->use_max_tr) { + ret = tracing_arm_snapshot_locked(tr); + if (ret) goto out; } #else @@ -6606,8 +6654,13 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) if (t->init) { ret = tracer_init(t, tr); - if (ret) + if (ret) { +#ifdef CONFIG_TRACER_MAX_TRACE + if (t->use_max_tr) + tracing_disarm_snapshot(tr); +#endif goto out; + } } tr->current_trace = t; @@ -7709,10 +7762,11 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt, if (tr->allocated_snapshot) ret = resize_buffer_duplicate_size(&tr->max_buffer, &tr->array_buffer, iter->cpu_file); - else - ret = tracing_alloc_snapshot_instance(tr); - if (ret < 0) + + ret = tracing_arm_snapshot_locked(tr); + if (ret) break; + /* Now, we're going to swap */ if (iter->cpu_file == RING_BUFFER_ALL_CPUS) { local_irq_disable(); @@ -7722,6 +7776,7 @@ tracing_snapshot_write(struct file *filp, const cha
[PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index 182e05a3004a..7330249257e7 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 4ebf4d0bd14c..36b62cf2fb3f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(&trace_types_lock); spin_lock(&tr->snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(&tr->snapshot_trigger_lock); return -EBUSY; } @@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == &nop_trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = &info->iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(&trace_types_lock); iter->wait_index++; @@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + struct trace_array *tr = iter->tr; + + ring_buffer_unmap(iter->array
[PATCH v14 5/6] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..628254e63830 --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,104 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred as the meta-page. One of the most important field of +the meta-page is the reader. It contains the sub-buffer ID which can be safely +read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot or splice. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.43.0.594.gd9cf4e227d-goog
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On 2024-02-05 11:34, Vincent Donnefort wrote: Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. How are the kernel linear mapping and the userspace mapping made coherent on architectures with virtually aliasing data caches ? Ref. https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Thanks, Mathieu Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index 182e05a3004a..7330249257e7 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER _IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 4ebf4d0bd14c..36b62cf2fb3f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(&trace_types_lock); spin_lock(&tr->snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(&tr->snapshot_trigger_lock); return -EBUSY; } @@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == &nop_trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = &info->iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(&trace_types_lock); iter->wait_index++; @@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return
Re: [PATCH] nvdimm: make nvdimm_bus_type const
On 2/4/24 1:20 PM, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the nvdimm_bus_type variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Dave Jiang > --- > drivers/nvdimm/bus.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c > index ef3d0f83318b..508aed017ddc 100644 > --- a/drivers/nvdimm/bus.c > +++ b/drivers/nvdimm/bus.c > @@ -271,7 +271,7 @@ EXPORT_SYMBOL_GPL(nvdimm_clear_poison); > > static int nvdimm_bus_match(struct device *dev, struct device_driver *drv); > > -static struct bus_type nvdimm_bus_type = { > +static const struct bus_type nvdimm_bus_type = { > .name = "nd", > .uevent = nvdimm_bus_uevent, > .match = nvdimm_bus_match, > > --- > base-commit: a085a5eb6594a3ebe5c275e9c2c2d341f686c23c > change-id: 20240204-bus_cleanup-nvdimm-91771693bd4d > > Best regards,
[PATCH v2 0/5] K3 DSP Remoteproc remove cleanup
Hello all, This series uses various devm_ helpers to simplify the device removal path. Removing an unused var "ret1" got squashed into the wrong patch in the v1 series causing a bisectability error. v2 is based on -next with the first 3 already taken. These are the last 5 patches of the v1 series[0]. Thanks, Andrew [0] https://lore.kernel.org/lkml/20240123184913.725435-4-...@ti.com/T/ Andrew Davis (5): remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper remoteproc: k3-dsp: Use devm_kzalloc() helper remoteproc: k3-dsp: Add devm action to release tsp remoteproc: k3-dsp: Use devm_ioremap_wc() helper remoteproc: k3-dsp: Use devm_rproc_add() helper drivers/remoteproc/ti_k3_dsp_remoteproc.c | 116 ++ 1 file changed, 32 insertions(+), 84 deletions(-) -- 2.39.2
[PATCH v2 1/5] remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper
Use the device lifecycle managed TI-SCI get() function. This helps prevent mistakes like not put()'ing in the wrong order in cleanup functions and forgetting to put() on error paths. Signed-off-by: Andrew Davis --- drivers/remoteproc/ti_k3_dsp_remoteproc.c | 32 +++ 1 file changed, 9 insertions(+), 23 deletions(-) diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c b/drivers/remoteproc/ti_k3_dsp_remoteproc.c index a13552c71f440..64ec5759c4ec1 100644 --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c @@ -708,30 +708,24 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) kproc->dev = dev; kproc->data = data; - kproc->ti_sci = ti_sci_get_by_phandle(np, "ti,sci"); + kproc->ti_sci = devm_ti_sci_get_by_phandle(dev, "ti,sci"); if (IS_ERR(kproc->ti_sci)) return dev_err_probe(dev, PTR_ERR(kproc->ti_sci), "failed to get ti-sci handle\n"); ret = of_property_read_u32(np, "ti,sci-dev-id", &kproc->ti_sci_id); - if (ret) { - dev_err_probe(dev, ret, "missing 'ti,sci-dev-id' property\n"); - goto put_sci; - } + if (ret) + return dev_err_probe(dev, ret, "missing 'ti,sci-dev-id' property\n"); kproc->reset = devm_reset_control_get_exclusive(dev, NULL); - if (IS_ERR(kproc->reset)) { - ret = dev_err_probe(dev, PTR_ERR(kproc->reset), - "failed to get reset\n"); - goto put_sci; - } + if (IS_ERR(kproc->reset)) + return dev_err_probe(dev, PTR_ERR(kproc->reset), +"failed to get reset\n"); kproc->tsp = k3_dsp_rproc_of_get_tsp(dev, kproc->ti_sci); - if (IS_ERR(kproc->tsp)) { - ret = dev_err_probe(dev, PTR_ERR(kproc->tsp), - "failed to construct ti-sci proc control\n"); - goto put_sci; - } + if (IS_ERR(kproc->tsp)) + return dev_err_probe(dev, PTR_ERR(kproc->tsp), +"failed to construct ti-sci proc control\n"); ret = ti_sci_proc_request(kproc->tsp); if (ret < 0) { @@ -805,10 +799,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1)); free_tsp: kfree(kproc->tsp); -put_sci: - ret1 = ti_sci_put_handle(kproc->ti_sci); - if (ret1) - dev_err(dev, "failed to put ti_sci handle (%pe)\n", ERR_PTR(ret1)); return ret; } @@ -836,10 +826,6 @@ static void k3_dsp_rproc_remove(struct platform_device *pdev) kfree(kproc->tsp); - ret = ti_sci_put_handle(kproc->ti_sci); - if (ret) - dev_err(dev, "failed to put ti_sci handle (%pe)\n", ERR_PTR(ret)); - k3_dsp_reserved_mem_exit(kproc); } -- 2.39.2
[PATCH v2 3/5] remoteproc: k3-dsp: Add devm action to release tsp
Use a device lifecycle managed action to release tps ti_sci_proc handle. This helps prevent mistakes like releasing out of order in cleanup functions and forgetting to release on error paths. Signed-off-by: Andrew Davis --- drivers/remoteproc/ti_k3_dsp_remoteproc.c | 27 +++ 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c b/drivers/remoteproc/ti_k3_dsp_remoteproc.c index b9332c66a52ab..800c8c6767086 100644 --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c @@ -646,6 +646,13 @@ static void k3_dsp_reserved_mem_exit(struct k3_dsp_rproc *kproc) iounmap(kproc->rmem[i].cpu_addr); } +static void k3_dsp_release_tsp(void *data) +{ + struct ti_sci_proc *tsp = data; + + ti_sci_proc_release(tsp); +} + static struct ti_sci_proc *k3_dsp_rproc_of_get_tsp(struct device *dev, const struct ti_sci_handle *sci) @@ -682,7 +689,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) const char *fw_name; bool p_state = false; int ret = 0; - int ret1; data = of_device_get_match_data(dev); if (!data) @@ -732,16 +738,17 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) dev_err_probe(dev, ret, "ti_sci_proc_request failed\n"); return ret; } + ret = devm_add_action_or_reset(dev, k3_dsp_release_tsp, kproc->tsp); + if (ret) + return ret; ret = k3_dsp_rproc_of_get_memories(pdev, kproc); if (ret) - goto release_tsp; + return ret; ret = k3_dsp_reserved_mem_init(kproc); - if (ret) { - dev_err_probe(dev, ret, "reserved memory init failed\n"); - goto release_tsp; - } + if (ret) + return dev_err_probe(dev, ret, "reserved memory init failed\n"); ret = kproc->ti_sci->ops.dev_ops.is_on(kproc->ti_sci, kproc->ti_sci_id, NULL, &p_state); @@ -793,10 +800,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) release_mem: k3_dsp_reserved_mem_exit(kproc); -release_tsp: - ret1 = ti_sci_proc_release(kproc->tsp); - if (ret1) - dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1)); return ret; } @@ -818,10 +821,6 @@ static void k3_dsp_rproc_remove(struct platform_device *pdev) rproc_del(kproc->rproc); - ret = ti_sci_proc_release(kproc->tsp); - if (ret) - dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret)); - k3_dsp_reserved_mem_exit(kproc); } -- 2.39.2
[PATCH v2 2/5] remoteproc: k3-dsp: Use devm_kzalloc() helper
Use device lifecycle managed devm_kzalloc() helper function. This helps prevent mistakes like freeing out of order in cleanup functions and forgetting to free on all error paths. Signed-off-by: Andrew Davis --- drivers/remoteproc/ti_k3_dsp_remoteproc.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c b/drivers/remoteproc/ti_k3_dsp_remoteproc.c index 64ec5759c4ec1..b9332c66a52ab 100644 --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c @@ -659,7 +659,7 @@ struct ti_sci_proc *k3_dsp_rproc_of_get_tsp(struct device *dev, if (ret < 0) return ERR_PTR(ret); - tsp = kzalloc(sizeof(*tsp), GFP_KERNEL); + tsp = devm_kzalloc(dev, sizeof(*tsp), GFP_KERNEL); if (!tsp) return ERR_PTR(-ENOMEM); @@ -730,7 +730,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) ret = ti_sci_proc_request(kproc->tsp); if (ret < 0) { dev_err_probe(dev, ret, "ti_sci_proc_request failed\n"); - goto free_tsp; + return ret; } ret = k3_dsp_rproc_of_get_memories(pdev, kproc); @@ -797,8 +797,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) ret1 = ti_sci_proc_release(kproc->tsp); if (ret1) dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1)); -free_tsp: - kfree(kproc->tsp); return ret; } @@ -824,8 +822,6 @@ static void k3_dsp_rproc_remove(struct platform_device *pdev) if (ret) dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret)); - kfree(kproc->tsp); - k3_dsp_reserved_mem_exit(kproc); } -- 2.39.2
[PATCH v2 5/5] remoteproc: k3-dsp: Use devm_rproc_add() helper
Use device lifecycle managed devm_rproc_add() helper function. This helps prevent mistakes like deleting out of order in cleanup functions and forgetting to delete on all error paths. Signed-off-by: Andrew Davis --- drivers/remoteproc/ti_k3_dsp_remoteproc.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c b/drivers/remoteproc/ti_k3_dsp_remoteproc.c index f799f74734b4a..3555b535b1683 100644 --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c @@ -768,7 +768,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) } } - ret = rproc_add(rproc); + ret = devm_rproc_add(dev, rproc); if (ret) return dev_err_probe(dev, ret, "failed to add register device with remoteproc core\n"); @@ -786,14 +786,9 @@ static void k3_dsp_rproc_remove(struct platform_device *pdev) if (rproc->state == RPROC_ATTACHED) { ret = rproc_detach(rproc); - if (ret) { - /* Note this error path leaks resources */ + if (ret) dev_err(dev, "failed to detach proc (%pe)\n", ERR_PTR(ret)); - return; - } } - - rproc_del(kproc->rproc); } static const struct k3_dsp_mem_data c66_mems[] = { -- 2.39.2
[PATCH v2 4/5] remoteproc: k3-dsp: Use devm_ioremap_wc() helper
Use a device lifecycle managed ioremap helper function. This helps prevent mistakes like unmapping out of order in cleanup functions and forgetting to unmap on all error paths. Signed-off-by: Andrew Davis --- drivers/remoteproc/ti_k3_dsp_remoteproc.c | 48 +-- 1 file changed, 10 insertions(+), 38 deletions(-) diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c b/drivers/remoteproc/ti_k3_dsp_remoteproc.c index 800c8c6767086..f799f74734b4a 100644 --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c @@ -598,16 +598,13 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc *kproc) /* use remaining reserved memory regions for static carveouts */ for (i = 0; i < num_rmems; i++) { rmem_np = of_parse_phandle(np, "memory-region", i + 1); - if (!rmem_np) { - ret = -EINVAL; - goto unmap_rmem; - } + if (!rmem_np) + return -EINVAL; rmem = of_reserved_mem_lookup(rmem_np); if (!rmem) { of_node_put(rmem_np); - ret = -EINVAL; - goto unmap_rmem; + return -EINVAL; } of_node_put(rmem_np); @@ -615,12 +612,11 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc *kproc) /* 64-bit address regions currently not supported */ kproc->rmem[i].dev_addr = (u32)rmem->base; kproc->rmem[i].size = rmem->size; - kproc->rmem[i].cpu_addr = ioremap_wc(rmem->base, rmem->size); + kproc->rmem[i].cpu_addr = devm_ioremap_wc(dev, rmem->base, rmem->size); if (!kproc->rmem[i].cpu_addr) { dev_err(dev, "failed to map reserved memory#%d at %pa of size %pa\n", i + 1, &rmem->base, &rmem->size); - ret = -ENOMEM; - goto unmap_rmem; + return -ENOMEM; } dev_dbg(dev, "reserved memory%d: bus addr %pa size 0x%zx va %pK da 0x%x\n", @@ -631,19 +627,6 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc *kproc) kproc->num_rmems = num_rmems; return 0; - -unmap_rmem: - for (i--; i >= 0; i--) - iounmap(kproc->rmem[i].cpu_addr); - return ret; -} - -static void k3_dsp_reserved_mem_exit(struct k3_dsp_rproc *kproc) -{ - int i; - - for (i = 0; i < kproc->num_rmems; i++) - iounmap(kproc->rmem[i].cpu_addr); } static void k3_dsp_release_tsp(void *data) @@ -752,10 +735,8 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) ret = kproc->ti_sci->ops.dev_ops.is_on(kproc->ti_sci, kproc->ti_sci_id, NULL, &p_state); - if (ret) { - dev_err_probe(dev, ret, "failed to get initial state, mode cannot be determined\n"); - goto release_mem; - } + if (ret) + return dev_err_probe(dev, ret, "failed to get initial state, mode cannot be determined\n"); /* configure J721E devices for either remoteproc or IPC-only mode */ if (p_state) { @@ -779,8 +760,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) if (data->uses_lreset) { ret = reset_control_status(kproc->reset); if (ret < 0) { - dev_err_probe(dev, ret, "failed to get reset status\n"); - goto release_mem; + return dev_err_probe(dev, ret, "failed to get reset status\n"); } else if (ret == 0) { dev_warn(dev, "local reset is deasserted for device\n"); k3_dsp_rproc_reset(kproc); @@ -789,18 +769,12 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev) } ret = rproc_add(rproc); - if (ret) { - dev_err_probe(dev, ret, "failed to add register device with remoteproc core\n"); - goto release_mem; - } + if (ret) + return dev_err_probe(dev, ret, "failed to add register device with remoteproc core\n"); platform_set_drvdata(pdev, kproc); return 0; - -release_mem: - k3_dsp_reserved_mem_exit(kproc); - return ret; } static void k3_dsp_rproc_remove(struct platform_device *pdev) @@ -820,8 +794,6 @@ static void k3_dsp_rproc_remove(struct platform_device *pdev) } rproc_del(kproc->rproc); - - k3_dsp_reserved_mem_exit(kproc); } static const struct k3_dsp_mem_data c66_mems[] = { -- 2.39.2
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: > On 2024-02-05 11:34, Vincent Donnefort wrote: > > Currently, user-space extracts data from the ring-buffer via splice, > > which is handy for storage or network sharing. However, due to splice > > limitations, it is imposible to do real-time analysis without a copy. > > > > A solution for that problem is to let the user-space map the ring-buffer > > directly. > > > > The mapping is exposed via the per-CPU file trace_pipe_raw. The first > > element of the mapping is the meta-page. It is followed by each > > subbuffer constituting the ring-buffer, ordered by their unique page ID: > > > >* Meta-page -- include/uapi/linux/trace_mmap.h for a description > >* Subbuf ID 0 > >* Subbuf ID 1 > > ... > > > > It is therefore easy to translate a subbuf ID into an offset in the > > mapping: > > > >reader_id = meta->reader->id; > >reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; > > > > When new data is available, the mapper must call a newly introduced ioctl: > > TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to > > point to the next reader containing unread data. > > > > Mapping will prevent snapshot and buffer size modifications. > > How are the kernel linear mapping and the userspace mapping made coherent > on architectures with virtually aliasing data caches ? > > Ref. > https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Hi Mathieu, Thanks for the pointer. We are in the exact same problem as DAX. We do modify the data through the kernel linear mapping while user-space can read it through its own. I should probably return an error when used with any of the arch ARM || SPARC || MIPS, until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. > > Thanks, > > Mathieu > > > > > Signed-off-by: Vincent Donnefort > > > > diff --git a/include/uapi/linux/trace_mmap.h > > b/include/uapi/linux/trace_mmap.h > > index 182e05a3004a..7330249257e7 100644 > > --- a/include/uapi/linux/trace_mmap.h > > +++ b/include/uapi/linux/trace_mmap.h > > @@ -43,4 +43,6 @@ struct trace_buffer_meta { > > __u64 Reserved2; > > }; > > +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) > > + > > #endif /* _TRACE_MMAP_H_ */ > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > > index 4ebf4d0bd14c..36b62cf2fb3f 100644 > > --- a/kernel/trace/trace.c > > +++ b/kernel/trace/trace.c > > @@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct > > trace_array *tr, > > return; > > } > > + if (tr->mapped) { > > + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); > > + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); > > + return; > > + } > > + > > local_irq_save(flags); > > update_max_tr(tr, current, smp_processor_id(), cond_data); > > local_irq_restore(flags); > > @@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct > > trace_array *tr) > > lockdep_assert_held(&trace_types_lock); > > spin_lock(&tr->snapshot_trigger_lock); > > - if (tr->snapshot == UINT_MAX) { > > + if (tr->snapshot == UINT_MAX || tr->mapped) { > > spin_unlock(&tr->snapshot_trigger_lock); > > return -EBUSY; > > } > > @@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr) > > { > > if (tr->current_trace == &nop_trace) > > return; > > - > > + > > tr->current_trace->enabled--; > > if (tr->current_trace->reset) > > @@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, > > loff_t *ppos, > > return ret; > > } > > -/* An ioctl call with cmd 0 to the ring buffer file will wake up all > > waiters */ > > static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, > > unsigned long arg) > > { > > struct ftrace_buffer_info *info = file->private_data; > > struct trace_iterator *iter = &info->iter; > > + int err; > > + > > + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { > > + if (!(file->f_flags & O_NONBLOCK)) { > > + err = ring_buffer_wait(iter->array_buffer->buffer, > > + iter->cpu_file, > > + iter->tr->buffer_percent); > > + if (err) > > + return err; > > + } > > - if (cmd) > > - return -ENOIOCTLCMD; > > + return ring_buffer_map_get_reader(iter->array_buffer->buffer, > > + iter->cpu_file); > > + } else if (cmd) { > > + return -ENOTTY; > > + } > > + /* > > +* An ioctl call with cmd 0 to the ring buffer file will wake up all > > +* waiters > > +*/ > > mutex_lock(&trace_types_lock); > > iter->wait_index++; > > @@ -8673,6 +8695,97 @@ static long tracing_buffer
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On 2024-02-05 13:34, Vincent Donnefort wrote: On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: [...] How are the kernel linear mapping and the userspace mapping made coherent on architectures with virtually aliasing data caches ? Ref. https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Hi Mathieu, Thanks for the pointer. We are in the exact same problem as DAX. We do modify the data through the kernel linear mapping while user-space can read it through its own. I should probably return an error when used with any of the arch ARM || SPARC || MIPS, until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. You might want to use LTTng's ring buffer approach instead. See https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202 lib_ring_buffer_flush_read_subbuf_dcache() Basically, whenever user-space grabs a sub-buffer for reading (through lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng calls flush_dcache_page() on all pages of this subbuffer (I should really change this for a call to flush_dcache_folio() which would be more efficient). Note that doing this is not very efficient on architectures which have coherent data caches and incoherent dcache vs icache: in that case, we issue the flush_dcache_page uselessly. I plan on using the new cpu_dcache_is_aliasing() check once/if it makes it way upstream to remove those useless flushes on architectures which define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the data cache. The equivalent of LTTng's "get subbuf" operation would be the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] virtio: make virtio_bus const
On Sun, Feb 04, 2024 at 05:52:51PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the virtio_bus variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Greg Kroah-Hartman
Re: [PATCH] vdpa: make vdpa_bus const
On Sun, Feb 04, 2024 at 05:50:45PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the vdpa_bus variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Greg Kroah-Hartman
Re: [PATCH] rpmsg: core: make rpmsg_bus const
On Sun, Feb 04, 2024 at 05:32:05PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the rpmsg_bus variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Greg Kroah-Hartman
Re: [PATCH] nvdimm: make nvdimm_bus_type const
On Sun, Feb 04, 2024 at 05:20:07PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the nvdimm_bus_type variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Greg Kroah-Hartman
Re: [PATCH] device-dax: make dax_bus_type const
On Sun, Feb 04, 2024 at 01:07:11PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the dax_bus_type variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Greg Kroah-Hartman
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On Mon, Feb 05, 2024 at 01:44:47PM -0500, Mathieu Desnoyers wrote: > On 2024-02-05 13:34, Vincent Donnefort wrote: > > On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: > [...] > > > > > > > How are the kernel linear mapping and the userspace mapping made coherent > > > on architectures with virtually aliasing data caches ? > > > > > > Ref. > > > https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t > > > > Hi Mathieu, > > > > Thanks for the pointer. > > > > We are in the exact same problem as DAX. We do modify the data through the > > kernel linear mapping while user-space can read it through its own. I should > > probably return an error when used with any of the arch ARM || SPARC || > > MIPS, > > until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. > > You might want to use LTTng's ring buffer approach instead. See > > https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202 > > lib_ring_buffer_flush_read_subbuf_dcache() Thanks! > > Basically, whenever user-space grabs a sub-buffer for reading (through > lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng > calls flush_dcache_page() on all pages of this subbuffer (I should > really change this for a call to flush_dcache_folio() which would be > more efficient). > > Note that doing this is not very efficient on architectures which have > coherent data caches and incoherent dcache vs icache: in that case, > we issue the flush_dcache_page uselessly. I plan on using the new > cpu_dcache_is_aliasing() check once/if it makes it way upstream to > remove those useless flushes on architectures which define > ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the > data cache. I believe the aim is to use the mapping by default in libtracefs and fallback to splice whenever not available... But for those arch, I guess that might be a mistake. Wonder if then it isn't just better to return ENOTSUPP? > > The equivalent of LTTng's "get subbuf" operation would be > the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU. That is correct! > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
Re: [PATCH v2 0/5] K3 DSP Remoteproc remove cleanup
On Mon, Feb 05, 2024 at 12:27:48PM -0600, Andrew Davis wrote: > Hello all, > > This series uses various devm_ helpers to simplify the device > removal path. > > Removing an unused var "ret1" got squashed into the wrong patch in > the v1 series causing a bisectability error. v2 is based on -next > with the first 3 already taken. These are the last 5 patches of the > v1 series[0]. > > Thanks, > Andrew > > [0] https://lore.kernel.org/lkml/20240123184913.725435-4-...@ti.com/T/ > > Andrew Davis (5): > remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper > remoteproc: k3-dsp: Use devm_kzalloc() helper > remoteproc: k3-dsp: Add devm action to release tsp > remoteproc: k3-dsp: Use devm_ioremap_wc() helper > remoteproc: k3-dsp: Use devm_rproc_add() helper > > drivers/remoteproc/ti_k3_dsp_remoteproc.c | 116 ++ > 1 file changed, 32 insertions(+), 84 deletions(-) > I have applied this set. Thanks, Mathieu > -- > 2.39.2 >
Re: [PATCH] rpmsg: core: make rpmsg_bus const
On Sun, Feb 04, 2024 at 05:32:05PM -0300, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the rpmsg_bus variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere > --- > drivers/rpmsg/rpmsg_core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c > index 8abc7d022ff7..4295c01a2861 100644 > --- a/drivers/rpmsg/rpmsg_core.c > +++ b/drivers/rpmsg/rpmsg_core.c > @@ -605,7 +605,7 @@ static void rpmsg_dev_remove(struct device *dev) > rpmsg_destroy_ept(rpdev->ept); > } > > -static struct bus_type rpmsg_bus = { > +static const struct bus_type rpmsg_bus = { > .name = "rpmsg", > .match = rpmsg_dev_match, > .dev_groups = rpmsg_dev_groups, > Applied. Thanks, Mathieu > --- > base-commit: 80255b24efbe83a6a01600484b6959259a30ded5 > change-id: 20240204-bus_cleanup-rpmsg-1a5f6ab69a24 > > Best regards, > -- > Ricardo B. Marliere >
[PATCH v9 00/15] Add Cgroup support for SGX EPC memory
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or bare-metal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. A user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store, which are normal system memory allocated via shmem and accounted by the memory controller. Similar to per-cgroup reclamation done by the memory controller, the EPC misc controller needs to implement a per-cgroup EPC reclaiming process: when the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out some EPC pages within the same cgroup to make room for new allocations. For that, this implementation tracks reclaimable EPC pages in a separate LRU list in each cgroup, and below are more details and justification of this design. Track EPC pages in per-cgroup LRUs (from Dave) -- tl;dr: A cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. Unreclaimable Enclave Pages --- There are a variety of page types for enclaves, each serving different purposes [5]. Although the SGX architecture supports swapping for all types, some special pages, e.g., Version Array(VA) and Secure Enclave Control Structure (SECS)[5], holds meta data of reclaimed pages and enclaves. That makes reclamation of such pages more intricate to manage. The SGX driver global reclaimer currently does not swap out VA pages. It only swaps the SECS page of an enclave when all other associated pages have been swapped out. The cgroup reclaimer follows the same approach and does not track those in per-cgroup LRUs and considers them as unreclaimable pages. The allocation of these pages is counted towards the usage of a specific cgroup and is subject to the cgroup's set EPC limits. Earlier versions of this series implemented forced enclave-killing to reclaim VA and SECS pages. That was designed to enforce the 'max' limit, particularly in scenarios where a user or administrator reduces this limit post-launch of enclaves. However, subsequent discussions [3, 4] indicated that such preemptive enforcement is not necessary for the misc-controllers. Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed, and the limit is only enforced at the time of new EPC allocation request. When a cgroup hits its limit but nothing left in the LRUs of the subtree, i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC within that cgroup will result in an 'ENOMEM'. Unreclaimable Guest VM EPC Pages The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats those as unreclaimable and returns ENOMEM when its limit is hit and nothing reclaimable left within the cgroup. The virtual EPC driver translates th
[PATCH v9 03/15] cgroup/misc: Add SGX EPC resource type
From: Kristen Carlson Accardi Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type for the misc controller. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V6: - Split the original patch into this and the preceding one (Kai) --- include/linux/misc_cgroup.h | 4 kernel/cgroup/misc.c| 4 2 files changed, 8 insertions(+) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 541a5611c597..2f6cc3a0ad23 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -17,6 +17,10 @@ enum misc_res_type { MISC_CG_RES_SEV, /* AMD SEV-ES ASIDs resource */ MISC_CG_RES_SEV_ES, +#endif +#ifdef CONFIG_CGROUP_SGX_EPC + /* SGX EPC memory resource */ + MISC_CG_RES_SGX_EPC, #endif MISC_CG_RES_TYPES }; diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 1f0d8e05b36c..e51d6a45007f 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = { /* AMD SEV-ES ASIDs resource */ "sev_es", #endif +#ifdef CONFIG_CGROUP_SGX_EPC + /* Intel SGX EPC memory bytes */ + "sgx_epc", +#endif }; /* Root misc cgroup */ -- 2.25.1
[PATCH v9 01/15] cgroup/misc: Add per resource callbacks for CSS events
From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V8: - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko) V7: - Make ops one per resource type and store them in array (Michal) - Rename the ops struct to misc_res_ops, and enforce the constraints of required callback functions (Jarkko) - Moved addition of priv field to patch 4 where it was used first. (Jarkko) V6: - Create ops struct for per resource callbacks (Jarkko) - Drop max_write callback (Dave, Michal) - Style fixes (Kai) --- include/linux/misc_cgroup.h | 11 + kernel/cgroup/misc.c| 84 + 2 files changed, 87 insertions(+), 8 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e799b1f8d05b..0806d4436208 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -27,6 +27,16 @@ struct misc_cg; #include +/** + * struct misc_res_ops: per resource type callback ops. + * @alloc: invoked for resource specific initialization when cgroup is allocated. + * @free: invoked for resource specific cleanup when cgroup is deallocated. + */ +struct misc_res_ops { + int (*alloc)(struct misc_cg *cg); + void (*free)(struct misc_cg *cg); +}; + /** * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. @@ -56,6 +66,7 @@ struct misc_cg { u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 79a3717a5803..14ab13ef3bc7 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -39,6 +39,9 @@ static struct misc_cg root_cg; */ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +/* Resource type specific operations */ +static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; + /** * parent_misc() - Get the parent of the passed misc cgroup. * @cgroup: cgroup whose parent needs to be fetched. @@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) } EXPORT_SYMBOL_GPL(misc_cg_set_capacity); +/** + * misc_cg_set_ops() - set resource specific operations. + * @type: Type of the misc res. + * @ops: Operations for the given type. + * + * Context: Any context. + * Return: + * * %0 - Successfully registered the operations. + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks. + */ +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops) +{ + if (!valid_type(type)) + return -EINVAL; + + if (!ops->alloc) { + pr_err("%s: alloc missing\n", __func__); + return -EINVAL; + } + + if (!ops->free) { + pr_err("%s: free missing\n", __func__); + return -EINVAL; + } + + misc_res_ops[type] = ops; + return 0; +} +EXPORT_SYMBOL_GPL(misc_cg_set_ops); + /** * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup. * @type: Misc res type in misc cg to cancel the charge from. @@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = { {} }; +static inline int _misc_cg_res_alloc(struct misc_cg *cg) +{ + enum misc_res_type i; + int ret; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) { + WRITE_ONCE(cg->res[i].max, MAX_NUM); + atomic64_set(&cg->res[i].usage, 0); + if (misc_res_ops[i]) { + ret = misc_res_ops[i]->alloc(cg); + if (ret) + return ret; + } + } + + return 0; +} + +static inline void _misc_cg_res_free(struct misc_cg *cg) +{ + enum misc_res_type i; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) + if (misc_res_ops[i]) + misc_res_ops[i]->free(cg); +} + /** * misc_cg_alloc() - Allocate misc cgroup. * @parent_css: Parent cgroup. @@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = { static struct cgroup_subsys_state * misc_cg_alloc(struct cgroup_subsys_state *parent_css) { - enum misc_res_type i; - struct m
[PATCH v9 02/15] cgroup/misc: Export APIs for SGX driver
From: Kristen Carlson Accardi The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup reaches its or ancestor's limit. This requires a walk from the current cgroup up to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to enable this walk. The SGX driver may also need start a global level reclamation from the root. Export misc_cg_root() for the SGX driver to access. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V6: - Make commit messages more concise and split the original patch into two(Kai) --- include/linux/misc_cgroup.h | 24 kernel/cgroup/misc.c| 21 - 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 0806d4436208..541a5611c597 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -64,6 +64,7 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; +struct misc_cg *misc_cg_root(void); u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); @@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css) return css ? container_of(css, struct misc_cg, css) : NULL; } +/** + * misc_cg_parent() - Get the parent of the passed misc cgroup. + * @cgroup: cgroup whose parent needs to be fetched. + * + * Context: Any context. + * Return: + * * struct misc_cg* - Parent of the @cgroup. + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + */ +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup) +{ + return cgroup ? css_misc(cgroup->css.parent) : NULL; +} + /* * get_current_misc_cg() - Find and get the misc cgroup of the current task. * @@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg) } #else /* !CONFIG_CGROUP_MISC */ +static inline struct misc_cg *misc_cg_root(void) +{ + return NULL; +} + +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg) +{ + return NULL; +} static inline u64 misc_cg_res_total_usage(enum misc_res_type type) { diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 14ab13ef3bc7..1f0d8e05b36c 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; /** - * parent_misc() - Get the parent of the passed misc cgroup. - * @cgroup: cgroup whose parent needs to be fetched. - * - * Context: Any context. - * Return: - * * struct misc_cg* - Parent of the @cgroup. - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + * misc_cg_root() - Return the root misc cgroup. */ -static struct misc_cg *parent_misc(struct misc_cg *cgroup) +struct misc_cg *misc_cg_root(void) { - return cgroup ? css_misc(cgroup->css.parent) : NULL; + return &root_cg; } +EXPORT_SYMBOL_GPL(misc_cg_root); /** * valid_type() - Check if @type is valid or not. @@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!amount) return 0; - for (i = cg; i; i = parent_misc(i)) { + for (i = cg; i; i = misc_cg_parent(i)) { res = &i->res[type]; new_usage = atomic64_add_return(amount, &res->usage); @@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) return 0; err_charge: - for (j = i; j; j = parent_misc(j)) { + for (j = i; j; j = misc_cg_parent(j)) { atomic64_inc(&j->res[type].events); cgroup_file_notify(&j->events_file); } - for (j = cg; j != i; j = parent_misc(j)) + for (j = cg; j != i; j = misc_cg_parent(j)) misc_cg_cancel_charge(type, j, amount); misc_cg_cancel_charge(type, i, amount); return ret; @@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!(amount && valid_type(type) && cg)) return; - for (i = cg; i; i = parent_misc(i)) + for (i = cg; i; i = misc_cg_parent(i)) misc_cg_cancel_charge(type, i, amount); } EXPORT_SYMBOL_GPL(misc_cg_uncharge); -- 2.25.1
[PATCH v9 06/15] x86/sgx: Abstract tracking reclaimable pages in LRU
From: Kristen Carlson Accardi The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly allocated page into the global LRU list while sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded global LRU references in these functions to make them reusable when pages are tracked in per-cgroup LRUs. Create a helper, sgx_lru_list(), that returns the LRU that tracks a given EPC page. It simply returns the global LRU now, and will later return the LRU of the cgroup within which the EPC page was allocated. Replace the hard coded global LRU with a call to this helper. Next patches will first get the cgroup reclamation flow ready while keeping pages tracked in the global LRU and reclaimed by ksgxd before we make the switch in the end for sgx_lru_list() to return per-cgroup LRU. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 912959c7ecc9..a131aa985c95 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) +{ + return &sgx_global_lru; +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -500,25 +505,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void) } /** - * sgx_mark_page_reclaimable() - Mark a page as reclaimable + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU. * @page: EPC page - * - * Mark a page as reclaimable and add it to the active page list. Pages - * are automatically removed from the active list when freed. */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_lru_list(page); + + spin_lock(&lru->lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(&page->list, &sgx_global_lru.reclaimable); - spin_unlock(&sgx_global_lru.lock); + list_add_tail(&page->list, &lru->reclaimable); + spin_unlock(&lru->lock); } /** - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU * @page: EPC page * - * Clear the reclaimable flag and remove the page from the active page list. + * Clear the reclaimable flag if set and remove the page from its LRU. * * Return: * 0 on success, @@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_lru_list(page); + + spin_lock(&lru->lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(&page->list)) { - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); return -EBUSY; } list_del(&page->list); page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); return 0; } -- 2.25.1
[PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
From: Kristen Carlson Accardi SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments. For example, in a Kubernates environment, a user can request certain EPC quota for a pod but the orchestrator can not enforce the quota to limit runtime EPC usage of the pod without an EPC cgroup controller. Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to limit and track EPC allocations per cgroup. Earlier patches have added the "sgx_epc" resource type in the misc cgroup subsystem. Add basic support in SGX driver as the "sgx_epc" resource provider: - Set "capacity" of EPC by calling misc_cg_set_capacity() - Update EPC usage counter, "current", by calling charge and uncharge APIs for EPC allocation and deallocation, respectively. - Setup sgx_epc resource type specific callbacks, which perform initialization and cleanup during cgroup allocation and deallocation, respectively. With these changes, the misc cgroup controller enables user to set a hard limit for EPC usage in the "misc.max" interface file. It reports current usage in "misc.current", the total EPC memory available in "misc.capacity", and the number of times EPC usage reached the max limit in "misc.events". For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo --- V8: - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko) - Remove extra space, '_INTEL'. (Jarkko) V7: - Use a static for root cgroup (Kai) - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai) - Correct check for charge API return (Kai) - Start initialization in SGX device driver init (Kai) - Remove unneeded BUG_ON (Kai) - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai) V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/Kconfig | 13 + arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++ arch/x86/kernel/cpu/sgx/main.c | 52 ++- arch/x86/kernel/cpu/sgx/sgx.h| 5 ++ include/linux/misc_cgroup.h | 2 + 7 files changed, 218 insertions(+), 2 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5edec175b9bf..10c3d1d099b2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1947,6 +1947,19 @@ config X86_SGX If unsure, say N. +config CGROUP_SGX_EPC + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX" + depends on X86_SGX && CGROUP_MISC + help + Provides control over the EPC footprint of tasks in a cgroup via + the Miscellaneous cgroup controller. + + EPC is a subset of regular memory that is usable only by SGX + enclaves and is very limited in quantity, e.g. less than 1% + of total DRAM. + + Say N if unsure. + config X86_USER_SHADOW_STACK bool "X86 userspace shadow stack" depends on AS_WRUSS diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..12901a488da7 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index ..f4a37ace67d7 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright(c) 2022 Intel Corporation. + +#include +#include +#include "epc_cgroup.h" + +/* The root EPC cgroup */ +static struct sgx_epc_cgroup epc_cg_root; + +/** + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + * @epc_cg:The EPC cgroup to be charged for the page. + * Return: + * * %0 - If successfully charged. + * * -errno - for failures. + */ +int sgx_epc_cgroup_try_charge(struct
[PATCH v9 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup
From: Sean Christopherson Each EPC cgroup will have an LRU structure to track reclaimable EPC pages. When a cgroup usage reaches its limit, the cgroup needs to reclaim pages from its LRU or LRUs of its descendants to make room for any new allocations. To prepare for reclamation per cgroup, expose the top level reclamation function, sgx_reclaim_pages(), in header file for reuse. Add a parameter to the function to pass in an LRU so cgroups can pass in different tracking LRUs later. Add another parameter for passing in the number of pages to scan and make the function return the number of pages reclaimed as a cgroup reclaimer may need to track reclamation progress from its descendants, change number of pages to scan in subsequent calls. Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(), to just call this function with the global LRU passed in. When per-cgroup LRU is added later, the wrapper will perform global reclamation from the root cgroup. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V8: - Use width of 80 characters in text paragraphs. (Jarkko) V7: - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim function". Do not split the top level function (Kai) - Dropped patches 7 and 8 of V6. --- arch/x86/kernel/cpu/sgx/main.c | 53 +++--- arch/x86/kernel/cpu/sgx/sgx.h | 1 + 2 files changed, 37 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index a131aa985c95..4f5824c4751d 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -286,11 +286,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, mutex_unlock(&encl->lock); } -/* - * Take a fixed number of pages from the head of the active page pool and - * reclaim them to the enclave's private shmem files. Skip the pages, which have - * been accessed since the last scan. Move those pages to the tail of active - * page pool so that the pages get scanned in LRU like fashion. +/** + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU + * + * Take a fixed number of pages from the head of a given LRU and reclaim them to + * the enclave's private shmem files. Skip the pages, which have been accessed + * since the last scan. Move those pages to the tail of the list so that the + * pages get scanned in LRU like fashion. * * Batch process a chunk of pages (at the moment 16) in order to degrade amount * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, * + EWB) but not sufficiently. Reclaiming one page at a time would also be * problematic as it would increase the lock contention too much, which would * halt forward progress. + * + * @lru: The LRU from which pages are reclaimed. + * @nr_to_scan: Pointer to the target number of pages to scan, must be less than + * SGX_NR_TO_SCAN. + * Return: Number of pages reclaimed. */ -static void sgx_reclaim_pages(void) +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan) { struct sgx_epc_page *chunk[SGX_NR_TO_SCAN]; struct sgx_backing backing[SGX_NR_TO_SCAN]; @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(&sgx_global_lru.lock); - for (i = 0; i < SGX_NR_TO_SCAN; i++) { - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, - struct sgx_epc_page, list); + spin_lock(&lru->lock); + + for (; *nr_to_scan > 0; --(*nr_to_scan)) { + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list); if (!epc_page) break; @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_global_lru.lock); + + spin_unlock(&lru->lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(&sgx_global_lru.lock); - list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable); - spin_unlock(&sgx_global_lru.lock); + spin_lock(&lru->lock); + list_add_tail(&epc_page->list, &lru->reclaimable); + spin_unlock(&lru->lock); kref_put(&encl_page->encl->refcount, sgx_encl_release); @@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void) sgx_reclaimer_block(epc_page); } +
[PATCH v9 05/15] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
From: Sean Christopherson Introduce a data structure to wrap the existing reclaimable list and its spinlock. Each cgroup later will have one instance of this structure to track EPC pages allocated for processes associated with the same cgroup. Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages from the reclaimable list in this structure when its usage reaches near its limit. Use this structure to encapsulate the LRU list and its lock used by the global reclaimer. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Jarkko Sakkinen --- V6: - removed introduction to unreclaimables in commit message. V4: - Removed unneeded comments for the spinlock and the non-reclaimables. (Kai, Jarkko) - Revised the commit to add introduction comments for unreclaimables and multiple LRU lists.(Kai) - Reordered the patches: delay all changes for unreclaimables to later, and this one becomes the first change in the SGX subsystem. V3: - Removed the helper functions and revised commit messages. --- arch/x86/kernel/cpu/sgx/main.c | 39 +- arch/x86/kernel/cpu/sgx/sgx.h | 15 + 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index c32f18b70c73..912959c7ecc9 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space); /* * These variables are part of the state of the reclaimer, and must be accessed - * with sgx_reclaimer_lock acquired. + * with sgx_global_lru.lock acquired. */ -static LIST_HEAD(sgx_active_page_list); -static DEFINE_SPINLOCK(sgx_reclaimer_lock); +static struct sgx_epc_lru_list sgx_global_lru; static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - if (list_empty(&sgx_active_page_list)) + epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, + struct sgx_epc_page, list); + if (!epc_page) break; - epc_page = list_first_entry(&sgx_active_page_list, - struct sgx_epc_page, list); list_del_init(&epc_page->list); encl_page = epc_page->owner; @@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_reclaimer_lock); + spin_unlock(&sgx_global_lru.lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(&sgx_reclaimer_lock); - list_add_tail(&epc_page->list, &sgx_active_page_list); - spin_unlock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); + list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable); + spin_unlock(&sgx_global_lru.lock); kref_put(&encl_page->encl->refcount, sgx_encl_release); @@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void) static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(&sgx_nr_free_pages) < watermark && - !list_empty(&sgx_active_page_list); + !list_empty(&sgx_global_lru.reclaimable); } /* @@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void) ksgxd_tsk = tsk; + sgx_lru_init(&sgx_global_lru); + return true; } @@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void) */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(&page->list, &sgx_active_page_list); - spin_unlock(&sgx_reclaimer_lock); + list_add_tail(&page->list, &sgx_global_lru.reclaimable); + spin_unlock(&sgx_global_lru.lock); } /** @@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(&page->list)) { - spin_unlock(&sgx_reclaimer_lock); + spin_unlock
[PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
From: Kristen Carlson Accardi Implement the reclamation flow for cgroup, encapsulated in the top-level function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its subtree, and make calls to sgx_reclaim_pages() at each node passing in the LRU of that node. It keeps track of total reclaimed pages, and pages left to attempt. It stops the walk if desired number of pages are attempted. In some contexts, e.g. page fault handling, only asynchronous reclamation is allowed. Create a work-queue, corresponding work item and function definitions to support the asynchronous reclamation. Both synchronous and asynchronous flows invoke the same top level reclaim function, and will be triggered later by sgx_epc_cgroup_try_charge() when usage of the cgroup is at or near its limit. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V9: - Add comments for static variables. (Jarkko) V8: - Remove alignment for substructure variables. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 181 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 3 + 2 files changed, 183 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index f4a37ace67d7..16b6d9f909eb 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -8,9 +8,180 @@ /* The root EPC cgroup */ static struct sgx_epc_cgroup epc_cg_root; +/* + * The work queue that reclaims EPC pages in the background for cgroups. + * + * A cgroup schedules a work item into this queue to reclaim pages within the + * same cgroup when its usage limit is reached and synchronous reclamation is not + * an option, e.g., in a fault handler. + */ +static struct workqueue_struct *sgx_epc_cg_wq; + +static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg) +{ + return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE; +} + +static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg) +{ + return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE; +} + +/* + * Get the lower bound of limits of a cgroup and its ancestors. Used in + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is + * over its limit or its ancestors' hence reclamation is needed. + */ +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg) +{ + struct misc_cg *i = epc_cg->cg; + u64 m = U64_MAX; + + while (i) { + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); + i = misc_cg_parent(i); + } + + return m / PAGE_SIZE; +} + /** - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs + * @root: Root of the tree to check * + * Return: %true if all cgroups under the specified root have empty LRU lists. + * Used to avoid livelocks due to a cgroup having a non-zero charge count but + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or + * because all pages in the cgroup are unreclaimable. + */ +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root) +{ + struct cgroup_subsys_state *css_root; + struct cgroup_subsys_state *pos; + struct sgx_epc_cgroup *epc_cg; + bool ret = true; + + /* +* Caller ensure css_root ref acquired +*/ + css_root = &root->css; + + rcu_read_lock(); + css_for_each_descendant_pre(pos, css_root) { + if (!css_tryget(pos)) + break; + + rcu_read_unlock(); + + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); + + spin_lock(&epc_cg->lru.lock); + ret = list_empty(&epc_cg->lru.reclaimable); + spin_unlock(&epc_cg->lru.lock); + + rcu_read_lock(); + css_put(pos); + if (!ret) + break; + } + + rcu_read_unlock(); + + return ret; +} + +/** + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages + * @root: Root of the tree to start walking from. + * Return: Number of pages reclaimed. + */ +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) +{ + /* +* Attempting to reclaim only a few pages will often fail and is +* inefficient, while reclaiming a huge number of pages can result in +* soft lockups due to holding various locks for an extended duration. +*/ + unsigned int nr_to_scan = SGX_NR_TO_SCAN; + struct cgroup_subsys_state *css_root; + struct cgroup_subsys_state *pos; + struct sgx_epc_cgroup *epc_cg; + unsigned
[PATCH v9 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
Enclave Page Cache(EPC) memory can be swapped out to regular system memory, and the consumed memory should be charged to a proper mem_cgroup. Currently the selection of mem_cgroup to charge is done in sgx_encl_get_mem_cgroup(). But it only considers two contexts in which the swapping can be done: normal tasks and the ksgxd kthread. With the new EPC cgroup implementation, the swapping can also happen in EPC cgroup work-queue threads. In those cases, it improperly selects the root mem_cgroup to charge for the RAM usage. Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and return the mem_cgroup of an mm_struct associated with the enclave. The return is used to charge for EPC backing pages in all kthread cases. Pass a flag into the top level reclamation function, sgx_reclaim_pages(), to explicitly indicate whether it is called from a background kthread. Internally, if the flag is true, switch the active mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(), prior to any backing page allocation, in order to ensure that shmem page allocations are charged to the enclave's cgroup. Removed current_is_ksgxd() as it is no longer needed. Signed-off-by: Haitao Huang Reported-by: Mikko Ylinen --- V9: - Reduce number of if statements. (Tim) V8: - Limit text paragraphs to 80 characters wide. (Jarkko) --- arch/x86/kernel/cpu/sgx/encl.c | 38 +--- arch/x86/kernel/cpu/sgx/encl.h | 3 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 7 ++--- arch/x86/kernel/cpu/sgx/main.c | 27 +--- arch/x86/kernel/cpu/sgx/sgx.h| 3 ++- 5 files changed, 36 insertions(+), 42 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..4e5948362060 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde } /* - * When called from ksgxd, returns the mem_cgroup of a struct mm stored - * in the enclave's mm_list. When not called from ksgxd, just returns - * the mem_cgroup of the current task. + * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list. */ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) { @@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) struct sgx_encl_mm *encl_mm; int idx; - /* -* If called from normal task context, return the mem_cgroup -* of the current task's mm. The remainder of the handling is for -* ksgxd. -*/ - if (!current_is_ksgxd()) - return get_mem_cgroup_from_mm(current->mm); - /* * Search the enclave's mm_list to find an mm associated with * this enclave to charge the allocation to. @@ -1047,27 +1037,33 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * @encl: an enclave pointer * @page_index:enclave page index * @backing: data for accessing backing storage for the page + * @indirect: in ksgxd or EPC cgroup work queue context + * + * Create a backing page for loading data back into an EPC page with ELDU. This + * function takes a reference on a new backing page which must be dropped with a + * corresponding call to sgx_encl_put_backing(). * - * When called from ksgxd, sets the active memcg from one of the - * mms in the enclave's mm_list prior to any backing page allocation, - * in order to ensure that shmem page allocations are charged to the - * enclave. Create a backing page for loading data back into an EPC page with - * ELDU. This function takes a reference on a new backing page which - * must be dropped with a corresponding call to sgx_encl_put_backing(). + * When @indirect is true, sets the active memcg from one of the mms in the + * enclave's mm_list prior to any backing page allocation, in order to ensure + * that shmem page allocations are charged to the enclave. * * Return: * 0 on success, * -errno otherwise. */ int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index, - struct sgx_backing *backing) + struct sgx_backing *backing, bool indirect) { - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl); - struct mem_cgroup *memcg = set_active_memcg(encl_memcg); + struct mem_cgroup *encl_memcg; + struct mem_cgroup *memcg; int ret; - ret = __sgx_encl_get_backing(encl, page_index, backing); + if (!indirect) + return __sgx_encl_get_backing(encl, page_index, backing); + encl_memcg = sgx_encl_get_mem_cgroup(encl); + memcg = set_active_memcg(encl_memcg); + ret = __sgx_encl_get_backing(encl, page_index, backing); set_active_memcg(memcg); mem_cgroup_put(encl_memcg); diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch
[PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()
From: Kristen Carlson Accardi When the EPC usage of a cgroup is near its limit, the cgroup needs to reclaim pages used in the same cgroup to make room for new allocations. This is analogous to the behavior that the global reclaimer is triggered when the global usage is close to total available EPC. Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate whether synchronous reclaim is allowed or not. And trigger the synchronous/asynchronous reclamation flow accordingly. Note at this point, all reclaimable EPC pages are still tracked in the global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation is activated yet. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 -- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++-- arch/x86/kernel/cpu/sgx/main.c | 2 +- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index d399fda2b55e..abf74fdb12b4 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) /** * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page * @epc_cg:The EPC cgroup to be charged for the page. + * @reclaim: Whether or not synchronous reclaim is allowed * Return: * * %0 - If successfully charged. * * -errno - for failures. */ -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE); + for (;;) { + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, + PAGE_SIZE)) + break; + + if (sgx_epc_cgroup_lru_empty(epc_cg->cg)) + return -ENOMEM; + + if (signal_pending(current)) + return -ERESTARTSYS; + + if (!reclaim) { + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work); + return -EBUSY; + } + + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false)) + /* All pages were too young to reclaim, try again a little later */ + schedule(); + } + + return 0; } /** diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index e3c6a08f0ee8..d061cd807b45 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { } -static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) +static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim) { return 0; } @@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) put_misc_cg(epc_cg->cg); } -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg); +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim); void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); void sgx_epc_cgroup_init(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 51904f191b97..2279ae967707 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -588,7 +588,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) int ret; epc_cg = sgx_get_current_epc_cg(); - ret = sgx_epc_cgroup_try_charge(epc_cg); + ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim); if (ret) { sgx_put_epc_cg(epc_cg); return ERR_PTR(ret); -- 2.25.1
[PATCH v9 11/15] x86/sgx: Abstract check for global reclaimable pages
From: Kristen Carlson Accardi To determine if any page available for reclamation at the global level, only checking for emptiness of the global LRU is not adequate when pages are tracked in multiple LRUs, one per cgroup. For this purpose, create a new helper, sgx_can_reclaim(), currently only checks the global LRU, later will check emptiness of LRUs of all cgroups when per-cgroup tracking is turned on. Replace all the checks of the global LRU, list_empty(&sgx_global_lru.reclaimable), with calls to sgx_can_reclaim(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- v7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 2279ae967707..6b0c26cac621 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag return &sgx_global_lru; } +static inline bool sgx_can_reclaim(void) +{ + return !list_empty(&sgx_global_lru.reclaimable); +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(&sgx_nr_free_pages) < watermark && - !list_empty(&sgx_global_lru.reclaimable); + sgx_can_reclaim(); } static void sgx_reclaim_pages_global(bool indirect) @@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) break; } - if (list_empty(&sgx_global_lru.reclaimable)) { + if (!sgx_can_reclaim()) { page = ERR_PTR(-ENOMEM); break; } -- 2.25.1
[PATCH v9 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer
From: Kristen Carlson Accardi When cgroup is enabled, all reclaimable pages will be tracked in cgroup LRUs. The global reclaimer needs to start reclamation from the root cgroup. Expose the top level cgroup reclamation function so the global reclaimer can reuse it. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V8: - Remove unneeded breaks in function declarations. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index abf74fdb12b4..6e31f8727b8a 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root) * @indirect: In ksgxd or EPC cgroup work queue context. * Return: Number of pages reclaimed. */ -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) { /* * Attempting to reclaim only a few pages will often fail and is diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index d061cd807b45..5b3e8e1b8630 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { } static inline void sgx_epc_cgroup_init(void) { } + +static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect) +{ + return 0; +} #else struct sgx_epc_cgroup { struct misc_cg *cg; @@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim); void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect); + void sgx_epc_cgroup_init(void); #endif -- 2.25.1
[PATCH v9 14/15] Docs/x86/sgx: Add description for cgroup support
From: Sean Christopherson Add initial documentation of how to regulate the distribution of SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup controller. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson --- V8: - Limit text width to 80 characters to be consistent. V6: - Remove mentioning of VMM specific behavior on handling SIGBUS - Remove statement of forced reclamation, add statement to specify ENOMEM returned when no reclamation possible. - Added statements on the non-preemptive nature for the max limit - Dropped Reviewed-by tag because of changes V4: - Fix indentation (Randy) - Change misc.events file to be read-only - Fix a typo for 'subsystem' - Add behavior when VMM overcommit EPC with a cgroup (Mikko) --- Documentation/arch/x86/sgx.rst | 83 ++ 1 file changed, 83 insertions(+) diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst index d90796adc2ec..c537e6a9aa65 100644 --- a/Documentation/arch/x86/sgx.rst +++ b/Documentation/arch/x86/sgx.rst @@ -300,3 +300,86 @@ to expected failures and handle them as follows: first call. It indicates a bug in the kernel or the userspace client if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has a return code other than 0. + + +Cgroup Support +== + +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates +distribution of SGX EPC memory, which is a subset of system RAM that is used to +provide SGX-enabled applications with protected memory, and is otherwise +inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be +read/written outside of an SGX enclave. + +Although current systems implement EPC by stealing memory from RAM, for all +intents and purposes the EPC is independent from normal system memory, e.g. must +be reserved at boot from RAM and cannot be converted between EPC and normal +memory while the system is running. The EPC is managed by the SGX subsystem and +is not accounted by the memory controller. Note that this is true only for EPC +memory itself, i.e. normal memory allocations related to SGX and EPC memory, +e.g. the backing memory for evicted EPC pages, are accounted, limited and +protected by the memory controller. + +Much like normal system memory, EPC memory can be overcommitted via virtual +memory techniques and pages can be swapped out of the EPC to their backing store +(normal system memory allocated via shmem). The SGX EPC subsystem is analogous +to the memory subsystem, and it implements limit and protection models for EPC +memory. + +SGX EPC Interface Files +--- + +For a generic description of the Miscellaneous controller interface files, +please see Documentation/admin-guide/cgroup-v2.rst + +All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If +a value which is not PAGE_SIZE aligned is written, the actual value used by the +controller will be rounded down to the closest PAGE_SIZE multiple. + + misc.capacity +A read-only flat-keyed file shown only in the root cgroup. The sgx_epc +resource will show the total amount of EPC memory available on the +platform. + + misc.current +A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc +resource will show the current active EPC memory usage of the cgroup and +its descendants. EPC pages that are swapped out to backing RAM are not +included in the current count. + + misc.max +A read-write single value file which exists on non-root cgroups. The +sgx_epc resource will show the EPC usage hard limit. The default is +"max". + +If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for +page fault handling, will be blocked until EPC can be reclaimed from the +cgroup. If there are no pages left that are reclaimable within the same +group, the kernel returns ENOMEM. + +The EPC pages allocated for a guest VM by the virtual EPC driver are not +reclaimable by the host kernel. In case the guest cgroup's limit is +reached and no reclaimable pages left in the same cgroup, the virtual +EPC driver returns SIGBUS to the user space process to indicate failure +on new EPC allocation requests. + +The misc.max limit is non-preemptive. If a user writes a limit lower +than the current usage to this file, the cgroup will not preemptively +deallocate pages currently in use, and will only start blocking the next +allocation and reclaiming EPC at that time. + + misc.events +A read-only flat-keyed file which exists on non-root cgroups. +A value change in this file generates a file modified event. + + max +The number of times
[PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
The scripts rely on cgroup-tools package from libcgroup [1]. To run selftests for epc cgroup: sudo ./run_epc_cg_selftests.sh To watch misc cgroup 'current' changes during testing, run this in a separate terminal: ./watch_misc_for_tests.sh current With different cgroups, the script starts one or multiple concurrent SGX selftests, each to run one unclobbered_vdso_oversubscribed test. Each of such test tries to load an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) SMALL - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) LARGE - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) LARGER - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all process exit. The script also includes a test with low mem_cg limit and LARGE sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. [1] https://github.com/libcgroup/libcgroup/blob/main/README Signed-off-by: Haitao Huang --- V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++ .../selftests/sgx/watch_misc_for_tests.sh | 13 + 2 files changed, 259 insertions(+) create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh new file mode 100755 index ..e027bf39f005 --- /dev/null +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh @@ -0,0 +1,246 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright(c) 2023 Intel Corporation. + +TEST_ROOT_CG=selftest +cgcreate -g misc:$TEST_ROOT_CG +if [ $? -ne 0 ]; then +echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted." +exit 1 +fi +TEST_CG_SUB1=$TEST_ROOT_CG/test1 +TEST_CG_SUB2=$TEST_ROOT_CG/test2 +# We will only set limit in test1 and run tests in test3 +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3 +TEST_CG_SUB4=$TEST_ROOT_CG/test4 + +cgcreate -g misc:$TEST_CG_SUB1 +cgcreate -g misc:$TEST_CG_SUB2 +cgcreate -g misc:$TEST_CG_SUB3 +cgcreate -g misc:$TEST_CG_SUB4 + +# Default to V2 +CG_MISC_ROOT=/sys/fs/cgroup +CG_MEM_ROOT=/sys/fs/cgroup +CG_V1=0 +if [ ! -d "/sys/fs/cgroup/misc" ]; then +echo "# cgroup V2 is in use." +else +echo "# cgroup V1 is in use." +CG_MISC_ROOT=/sys/fs/cgroup/misc +CG_MEM_ROOT=/sys/fs/cgroup/memory +CG_V1=1 +fi + +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}') +# This is below number of VA pages needed for enclave of capacity size. So +# should fail oversubscribed cases +SMALL=$(( CAPACITY / 512 )) + +# At least load one enclave of capacity size successfully, maybe up to 4. +# But some may fail if we run more than 4 concurrent enclaves of capacity size. +LARGE=$(( SMALL * 4 )) + +# Load lots of enclaves +LARGER=$CAPACITY +echo "# Setting up limits." +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max + +timestamp=$(date +%Y%m%d_%H%M%S) + +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed" + +wait_check_process_status() { +local pid=$1 +local check_for_success=$2 # If 1, check for success; +# If 0, check for failure +wait "$pid" +local status=$? + +if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then +echo "# Process $pid succeeded." +return 0 +elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then +echo "# Process $pid returned failure." +return 0 +fi +return 1 +} + +wait_and_detect_for_any() { +local pids=("$@") +local check_for_success=$1 # If 1, check for success; +# If 0, check for failure +local detected=1 # 0 for success detection + +for pid in "${pids[@]:1}"; do +if wait_check_process_status "$pid" "$check_for_success"; then +detected=0 +# Wait for other processes to exit +fi +done + +return $detected +} + +echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..." +# Always use leaf node of misc c
[PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation
From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_lru_list() returns hard coded reference to the global LRU. Change sgx_lru_list() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore global reclamation is still needed in those cases and it should reclaim from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled, otherwise from the global LRU. Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang --- V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 6b0c26cac621..d4265a390ba9 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru; static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) { +#ifdef CONFIG_CGROUP_SGX_EPC + if (epc_page->epc_cg) + return &epc_page->epc_cg->lru; + + /* This should not happen if kernel is configured correctly */ + WARN_ON_ONCE(1); +#endif return &sgx_global_lru; } static inline bool sgx_can_reclaim(void) { +#ifdef CONFIG_CGROUP_SGX_EPC + return !sgx_epc_cgroup_lru_empty(misc_cg_root()); +#else return !list_empty(&sgx_global_lru.reclaimable); +#endif } static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -410,7 +421,10 @@ static void sgx_reclaim_pages_global(bool indirect) { unsigned int nr_to_scan = SGX_NR_TO_SCAN; - sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect); + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC)) + sgx_epc_cgroup_reclaim_pages(misc_cg_root(), indirect); + else + sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect); } /* -- 2.25.1
Re: [PATCH] device-dax: make dax_bus_type const
On 2/4/24 9:07 AM, Ricardo B. Marliere wrote: > Now that the driver core can properly handle constant struct bus_type, > move the dax_bus_type variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere Reviewed-by: Dave Jiang > --- > drivers/dax/bus.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 1659b787b65f..fe0a415c854d 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -222,7 +222,7 @@ static void dax_bus_remove(struct device *dev) > dax_drv->remove(dev_dax); > } > > -static struct bus_type dax_bus_type = { > +static const struct bus_type dax_bus_type = { > .name = "dax", > .uevent = dax_bus_uevent, > .match = dax_bus_match, > > --- > base-commit: 73bf93edeeea866b0b6efbc8d2595bdaaba7f1a5 > change-id: 20240204-bus_cleanup-dax-52c34f72615f > > Best regards,
Re: [PATCH v7 2/5] remoteproc: k3: Move out data structures common with M4 driver
On 2/2/24 11:55 AM, Hari Nagalla wrote: From: Martyn Welch We will be adding the M4F driver which shares a lot of commonality with the DSP driver. Common data structures are introduced here. Signed-off-by: Martyn Welch Signed-off-by: Hari Nagalla --- Changes since v5: - Created a separate patch for data structures to ease review Changes since v6: - Reworded 'split' to 'move' as the common data structures between DSP and M4 remote rpoc drivers are moved into common driver. Is this a joke? In v6 Krzysztof commented the following: Where is the split? I see only addition here. Where is the usage of this header? This is basically dead code. Don't add dead code, but instead actually move the structures here! Move is cut and paste, not just paste. Instead of changing the patch in any way to address this comment you just replaced the word 'split' to 'move' in the commit subject.. Maybe no one will notice this is all still dead code since you didn't say the word 'split' anymore.. Andrew link to v6: https://lore.kernel.org/all/20230913111644.29889-3-hnaga...@ti.com/ drivers/remoteproc/ti_k3_common.h | 107 ++ 1 file changed, 107 insertions(+) create mode 100644 drivers/remoteproc/ti_k3_common.h diff --git a/drivers/remoteproc/ti_k3_common.h b/drivers/remoteproc/ti_k3_common.h new file mode 100644 index ..f1bab83dd0fc --- /dev/null +++ b/drivers/remoteproc/ti_k3_common.h @@ -0,0 +1,107 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * TI K3 Remote Processor(s) driver common code + * + * Refactored from ti_k3_dsp_remoteproc.c. + * + * ti_k3_dsp_remoteproc.c: + * Copyright (C) 2018-2022 Texas Instruments Incorporated - https://www.ti.com/ + * Suman Anna + */ + +#ifndef REMOTEPROC_TI_K3_COMMON_H +#define REMOTEPROC_TI_K3_COMMON_H + +#define KEYSTONE_RPROC_LOCAL_ADDRESS_MASK (SZ_16M - 1) + +/** + * struct k3_rproc_mem - internal memory structure + * @cpu_addr: MPU virtual address of the memory region + * @bus_addr: Bus address used to access the memory region + * @dev_addr: Device address of the memory region from remote processor view + * @size: Size of the memory region + */ +struct k3_rproc_mem { + void __iomem *cpu_addr; + phys_addr_t bus_addr; + u32 dev_addr; + size_t size; +}; + +/** + * struct k3_rproc_mem_data - memory definitions for a remote processor + * @name: name for this memory entry + * @dev_addr: device address for the memory entry + */ +struct k3_rproc_mem_data { + const char *name; + const u32 dev_addr; +}; + +/** + * struct k3_rproc_dev_data - device data structure for a remote processor + * @mems: pointer to memory definitions for a remote processor + * @num_mems: number of memory regions in @mems + * @boot_align_addr: boot vector address alignment granularity + * @uses_lreset: flag to denote the need for local reset management + */ +struct k3_rproc_dev_data { + const struct k3_rproc_mem_data *mems; + u32 num_mems; + u32 boot_align_addr; + bool uses_lreset; +}; + +/** + * struct k3_rproc - k3 remote processor driver structure + * @dev: cached device pointer + * @rproc: remoteproc device handle + * @mem: internal memory regions data + * @num_mems: number of internal memory regions + * @rmem: reserved memory regions data + * @num_rmems: number of reserved memory regions + * @reset: reset control handle + * @data: pointer to device data + * @tsp: TI-SCI processor control handle + * @ti_sci: TI-SCI handle + * @ti_sci_id: TI-SCI device identifier + * @mbox: mailbox channel handle + * @client: mailbox client to request the mailbox channel + */ +struct k3_rproc { + struct device *dev; + struct rproc *rproc; + struct k3_rproc_mem *mem; + int num_mems; + struct k3_rproc_mem *sram; + int num_sram; + struct k3_rproc_mem *rmem; + int num_rmems; + struct reset_control *reset; + const struct k3_rproc_dev_data *data; + struct ti_sci_proc *tsp; + const struct ti_sci_handle *ti_sci; + u32 ti_sci_id; + struct mbox_chan *mbox; + struct mbox_client client; +}; + +void k3_rproc_kick(struct rproc *rproc, int vqid); +int k3_rproc_reset(struct k3_rproc *kproc); +int k3_rproc_release(struct k3_rproc *kproc); +int k3_rproc_request_mbox(struct rproc *rproc); +int k3_rproc_prepare(struct rproc *rproc); +int k3_rproc_unprepare(struct rproc *rproc); +struct resource_table *k3_get_loaded_rsc_table(struct rproc *rproc, + size_t *rsc_table_sz); +void *k3_rproc_da_to_va(struct rproc *rproc, u64 da, size_t len, + bool *is_iomem); +int k3_rproc_of_get_memories(struct platform_device *pdev, +struct k3_rproc *kproc); +int k3_rproc_of_get_sram_memories(struct platform_device *pdev, +struct k3_rproc *kproc); +int k3_reserved_mem_init(struct k3_rproc *kproc); +void k3_reserved_mem_ex
Re: [PATCH v7 3/5] remoteproc: k3: Move out functions common with M4 driver
On 2/2/24 11:55 AM, Hari Nagalla wrote: From: Martyn Welch In the next commit we will be adding the M4F driver which shares a lot of commonality with the DSP driver. Move this shared functionality out so that it can be used by both drivers. Signed-off-by: Martyn Welch Signed-off-by: Hari Nagalla --- Changes since v2: - New patch (reordered refactored from v2) Changes since v3: - Removed "ipc_only" element from k3_rproc structure - Refactored to bring 3 more common functions Changes since v4: - None Changes since v5: - Rearranged the functions order to match with the functions in ti_k3_dsp_remoteproc.c to ease review. Changes since v6: - Generated patch with -M/-B/-C options You where asked to generate this patch "correctly" with these options, not just use them all and hope for the best.. Now it looks like you re-wrote all of ti_k3_dsp_remoteproc.c when you only factored out a couple functions to a different file. Build up the new ti_k3_common.c one function per patch if it helps. And factor the functions out of ti_k3_r5 also as it seems many of these are common to that driver too. Andrew link to v6: https://lore.kernel.org/all/20230913111644.29889-4-hnaga...@ti.com/ drivers/remoteproc/Makefile |2 +- drivers/remoteproc/ti_k3_common.c | 583 ++ drivers/remoteproc/ti_k3_dsp_remoteproc.c | 1277 ++--- 3 files changed, 952 insertions(+), 910 deletions(-) create mode 100644 drivers/remoteproc/ti_k3_common.c rewrite drivers/remoteproc/ti_k3_dsp_remoteproc.c (67%) diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile index 91314a9b43ce..55c552e27a45 100644 --- a/drivers/remoteproc/Makefile +++ b/drivers/remoteproc/Makefile @@ -36,6 +36,6 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o -obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o +obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o ti_k3_common.o obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o diff --git a/drivers/remoteproc/ti_k3_common.c b/drivers/remoteproc/ti_k3_common.c new file mode 100644 index ..62c7c5dba78a --- /dev/null +++ b/drivers/remoteproc/ti_k3_common.c @@ -0,0 +1,583 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * TI K3 Remote Processor(s) driver common code + * + * Refactored from ti_k3_dsp_remoteproc.c. + * + * ti_k3_dsp_remoteproc.c: + * Copyright (C) 2018-2022 Texas Instruments Incorporated - https://www.ti.com/ + * Suman Anna + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "omap_remoteproc.h" +#include "remoteproc_internal.h" +#include "ti_sci_proc.h" +#include "ti_k3_common.h" + +/** + * k3_rproc_mbox_callback() - inbound mailbox message handler + * @client: mailbox client pointer used for requesting the mailbox channel + * @data: mailbox payload + * + * This handler is invoked by the K3 mailbox driver whenever a mailbox + * message is received. Usually, the mailbox payload simply contains + * the index of the virtqueue that is kicked by the remote processor, + * and we let remoteproc core handle it. + * + * In addition to virtqueue indices, we also have some out-of-band values + * that indicate different events. Those values are deliberately very + * large so they don't coincide with virtqueue indices. + */ +static void k3_rproc_mbox_callback(struct mbox_client *client, void *data) +{ + struct k3_rproc *kproc = container_of(client, struct k3_rproc, + client); + struct device *dev = kproc->rproc->dev.parent; + const char *name = kproc->rproc->name; + u32 msg = omap_mbox_message(data); + + dev_dbg(dev, "mbox msg: 0x%x\n", msg); + + switch (msg) { + case RP_MBOX_CRASH: + /* +* remoteproc detected an exception, but error recovery is not +* supported. So, just log this for now +*/ + dev_err(dev, "K3 rproc %s crashed\n", name); + break; + case RP_MBOX_ECHO_REPLY: + dev_info(dev, "received echo reply from %s\n", name); + break; + default: + /* silently handle all other valid messages */ + if (msg >= RP_MBOX_READY && msg < RP_MBOX_END_MSG) + return; + if (msg > kproc->rproc->max_notifyid) { + dev_dbg(dev, "dropping unknown message 0x%x", msg); + return; + } + /* msg contains the index of the triggered vring */ + if (rproc_vq_interrupt(kproc->rproc, msg) == IRQ_NONE) +
[PATCH v17 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. Since the device memory is cache coherent with the CPU, it can be mmaped into the user VMA with a cacheable mapping and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communiate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.phys/memphys resmem.phys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock aligned. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present in the host ACPI table, the driver registers the vfio-pci-core function pointers. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/ [3] https://lore.kernel.org/all/2
[PATCH v17 1/3] vfio/pci: rename and export do_io_rw()
From: Ankit Agrawal do_io_rw() is used to read/write to the device MMIO. The grace hopper VFIO PCI variant driver require this functionality to read/write to its memory. Rename this as vfio_pci_core functions and export as GPL. Signed-off-by: Ankit Agrawal --- drivers/vfio/pci/vfio_pci_rdwr.c | 16 +--- include/linux/vfio_pci_core.h| 5 - 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 07fea08ea8a2..03b8f7ada1ac 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -96,10 +96,10 @@ VFIO_IOREAD(32) * reads with -1. This is intended for handling MSI-X vector tables and * leftover space for ROM BARs. */ -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, - void __iomem *io, char __user *buf, - loff_t off, size_t count, size_t x_start, - size_t x_end, bool iswrite) +ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, + void __iomem *io, char __user *buf, + loff_t off, size_t count, size_t x_start, + size_t x_end, bool iswrite) { ssize_t done = 0; int ret; @@ -201,6 +201,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, return done; } +EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw); int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar) { @@ -279,8 +280,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf, x_end = vdev->msix_offset + vdev->msix_size; } - done = do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos, - count, x_start, x_end, iswrite); + done = vfio_pci_core_do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos, + count, x_start, x_end, iswrite); if (done >= 0) *ppos += done; @@ -348,7 +349,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf, * probing, so we don't currently worry about access in relation * to the memory enable bit in the command register. */ - done = do_io_rw(vdev, false, iomem, buf, off, count, 0, 0, iswrite); + done = vfio_pci_core_do_io_rw(vdev, false, iomem, buf, off, count, + 0, 0, iswrite); vga_put(vdev->pdev, rsrc); diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 85e84b92751b..cf9480a31f3e 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -130,7 +130,10 @@ void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev); int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar); pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev, pci_channel_state_t state); - +ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, + void __iomem *io, char __user *buf, + loff_t off, size_t count, size_t x_start, + size_t x_end, bool iswrite); #define VFIO_IOWRITE_DECLATION(size) \ int vfio_pci_core_iowrite##size(struct vfio_pci_core_device *vdev, \ bool test_mem, u##size val, void __iomem *io); -- 2.34.1
[PATCH v17 2/3] vfio/pci: rename and export range_intesect_range
From: Ankit Agrawal range_intesect_range determines an overlap between two ranges. If an overlap, the helper function returns the overlapping offset and size. The VFIO PCI variant driver emulates the PCI config space BAR offset registers. These offset may be accessed for read/write with a variety of lengths including sub-word sizes from sub-word offsets. The driver makes use of this helper function to read/write the targeted part of the emulated register. Make this a vfio_pci_core function, rename and export as GPL. Also update references in virtio driver. Signed-off-by: Ankit Agrawal --- drivers/vfio/pci/vfio_pci_config.c | 45 +++ drivers/vfio/pci/virtio/main.c | 72 +++--- include/linux/vfio_pci_core.h | 5 +++ 3 files changed, 76 insertions(+), 46 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 672a1804af6a..4fc3c605af13 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1966,3 +1966,48 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf, return done; } + +/** + * vfio_pci_core_range_intersect_range() - Determine overlap between a buffer + *and register offset ranges. + * @range1_start:start offset of the buffer + * @count1: number of buffer bytes. + * @range2_start:start register offset + * @count2: number of bytes of register + * @start_offset:start offset of overlap start in the buffer + * @intersect_count: number of overlapping bytes + * @register_offset: start offset of overlap start in register + * + * The function determines an overlap between a register and a buffer. + * range1 represents the buffer and range2 represents register. + * + * Returns: true if there is overlap, false if not. + * The overlap start and range is returned through function args. + */ +bool vfio_pci_core_range_intersect_range(loff_t range1_start, size_t count1, +loff_t range2_start, size_t count2, +loff_t *start_offset, +size_t *intersect_count, +size_t *register_offset) +{ + if (range1_start <= range2_start && + range1_start + count1 > range2_start) { + *start_offset = range2_start - range1_start; + *intersect_count = min_t(size_t, count2, +range1_start + count1 - range2_start); + *register_offset = 0; + return true; + } + + if (range1_start > range2_start && + range1_start < range2_start + count2) { + *start_offset = 0; + *intersect_count = min_t(size_t, count1, +range2_start + count2 - range1_start); + *register_offset = range1_start - range2_start; + return true; + } + + return false; +} +EXPORT_SYMBOL_GPL(vfio_pci_core_range_intersect_range); diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c index d5af683837d3..b5d3a8c5bbc9 100644 --- a/drivers/vfio/pci/virtio/main.c +++ b/drivers/vfio/pci/virtio/main.c @@ -132,33 +132,6 @@ virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device *virtvdev, return ret ? ret : count; } -static bool range_intersect_range(loff_t range1_start, size_t count1, - loff_t range2_start, size_t count2, - loff_t *start_offset, - size_t *intersect_count, - size_t *register_offset) -{ - if (range1_start <= range2_start && - range1_start + count1 > range2_start) { - *start_offset = range2_start - range1_start; - *intersect_count = min_t(size_t, count2, -range1_start + count1 - range2_start); - *register_offset = 0; - return true; - } - - if (range1_start > range2_start && - range1_start < range2_start + count2) { - *start_offset = 0; - *intersect_count = min_t(size_t, count1, -range2_start + count2 - range1_start); - *register_offset = range1_start - range2_start; - return true; - } - - return false; -} - static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev, char __user *buf, size_t count, loff_t *ppos) @@ -178,16 +151,18 @@ static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev, if (ret < 0) return ret; - if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16), -
[PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. This device memory aperture can be obtained from host ACPI table using device_property_read_u64(), according to the FW specification. Since the device memory is cache coherent with the CPU, it can be mmap into the user VMA with a cacheable mapping using remap_pfn_range() and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communicate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.phys/memphys resmem.phys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock aligned. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the bare metal properties are not present, the driver registers the vfio-pci-core function pointers. This goes along with a qemu series [6] to provides the necessary implementation of the Grace Hopper Superchip firmware specification so that the guest operating system can see the correct ACPI modeling for the coherent GPU device. Verified with the CUDA workload in the VM. [1] https://www.nvidia.com/en-in/technologies/
Re: [PATCH RFC 0/4] Introduce uts_release
On Mon, Feb 5, 2024 at 5:25 PM John Garry wrote: > > On 02/02/2024 15:01, Masahiro Yamada wrote: > >> -- > >> 2.35.3 > > > > As you see, several drivers store UTS_RELEASE in their driver data, > > and even print it in debug print. > > > > > > I do not see why it is useful. > > I would tend to agree, and mentioned that earlier. > > > As you discussed in 3/4, if UTS_RELEASE is unneeded, > > it is better to get rid of it. > > Jakub replied about this. > > > > > > > If such version information is useful for drivers, the intention is > > whether the version of the module, or the version of vmlinux. > > That is a question. > > They differ when CONFIG_MODVERSION. > > > > I think often this information in UTS_RELEASE is shared as informative > only, so the user can conveniently know the specific kernel git version. > > > > > When module developers intend to printk the git version > > from which the module was compiled from, > > presumably they want to use UTS_RELEASE, which > > was expanded at the compile time of the module. > > > > If you replace it with uts_release, it is the git version > > of vmlinux. > > > > > > Of course, the replacement is safe for always-builtin code. > > > > > > > > Lastly, we can avoid using UTS_RELEASE without relying > > on your patch. > > > > > > > > For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1 > > replaced UTS_RELEASE with init_uts_ns.name.release > > > > > > So, is your uts_release a shorthand of init_uts_ns.name.release? > > Yes - well that both are strings containing UTS_RELEASE. Using a struct > sub-member is bit ungainly, but I suppose that we should not be making > life easy for people using this. > > However we already have init_utsname in: > > static inline struct new_utsname *init_utsname(void) > { > return &init_uts_ns.name; > } > > So could use init_utsname()->release, which is a bit nicer. > > > > > > > > > I think what you can contribute are: > > > > - Explore the UTS_RELEASE users, and check if you can get rid of it. > > Unfortunately I expect resistance for this. I also expect places like FW > loader it is necessary. And when this is used in sysfs, people will say > that it is part of the ABI now. > > How about I send the patch to update to use init_uts_ns and mention also > that it would be better to not use at all, if possible? I can cc you. OK. As I mentioned in the previous reply, the replacement is safe for builtin code. When you touch modular code, please pay a little more care, because UTS_RELEASE and init_utsname()->release may differ when CONFIG_MODVERSIONS=y. > > > > > - Where UTS_RELEASE is useful, consider if it is possible > > to replace it with init_uts_ns.name.release > > ok, but, as above, could use init_utsname()->release also I am fine with it. init_utsname()->release is more commonly used (but less common than UTS_RELEASE) $ git grep 'init_utsname()->release' | wc 28 922065 $ git grep 'init_uts_ns.name.release' | wc 7 34 587 $ git grep 'UTS_RELEASE' | wc 57 3044741 -- Best Regards Masahiro Yamada
Re: Question about the ipi_raise filter usage and output
Hi Steve, On Mon, Feb 5, 2024 at 6:38 PM Steven Rostedt wrote: > > On Mon, 5 Feb 2024 17:57:29 +0800 > richard clark wrote: > > > I try to write below: > > echo 'target_cpus == 11 && reason == "Function call interrupts"' > > > events/ipi/ipi_raise/filter > > You mean when it is sent only to CPU 11? Not when the event is > happening on CPU 11. Like the above example, the event was triggered on > CPU 10, but the mask was for all the other CPUs. > > If you are looking for just CPU 11, you can do: > > echo 'target_cpus == 0x800 && reason == "Function call interrupts"' > Seems both 'target_cpus == 0x800 && reason == "Function call interrupts"' and 'target_cpus & 0x800 && reason == "Function call interrupts"' don't work: # cat events/ipi/ipi_raise/enable 1 # cat events/ipi/ipi_raise/filter target_cpus == 0x800 && reason == "Function call interrupts" The kernel module code snippet: void ipi_func_run_cpu(void *info) { pr_info("remote function runs on cpu[%d].\n", smp_processor_id()); } static int __init ipi_send_init(void) { int target = (smp_processor_id() + 1) % nr_cpu_ids; int ret = smp_call_function_single(target, ipi_func_run_cpu, NULL, true); pr_info("ipi cpu[%d --> %d] ret = %d\n", smp_processor_id(), target, ret); return 0; } ... module_init(ipi_send_init); module_exit(ipi_send_exit); $ sudo taskset -c 10 insmod ipi_send.ko $ dmesg ... [84931.864273] remote function runs on cpu[11]. [84931.864282] ipi cpu[10 --> 11] ret = 0 The 'cat trace' will output the below message with 'reason == "Function call interrupts"' filter: ... sudo-5726[007] dn.h1.. 84302.833545: ipi_raise: target_mask=,0001 (Function call interrupts) sudo-5726[007] dn.h2.. 84302.837544: ipi_raise: target_mask=,0001 (Function call interrupts) insmod-5727[011] dn.h1.. 84302.841545: ipi_raise: target_mask=,0001 (Function call interrupts) insmod-5727[010] 1.. 84302.843966: ipi_raise: target_mask=,0bff (Function call interrupts) insmod-5727[010] 1.. 84302.843975: ipi_raise: target_mask=,0bff (Function call interrupts) insmod-5727[010] 1.. 84302.844184: ipi_raise: target_mask=,0800 (Function call interrupts) ... I find that 'target_cpus == 0xfff && reason == "Function call interrupts"' doesn't have output in the buffer, but 'target_cpus & 0xfff && reason == "Function call interrupts"' does. I also tried to use 'target_cpus & 0xf00 && reason == "Function call interrupts"' in my case, the trace buffer has nothing after the kmod inserted. Any comments? > > > -- Steve
Re: [PATCH] virtio: make virtio_bus const
On Mon, Feb 5, 2024 at 4:52 AM Ricardo B. Marliere wrote: > > Now that the driver core can properly handle constant struct bus_type, > move the virtio_bus variable to be a constant structure as well, > placing it into read-only memory which can not be modified at runtime. > > Cc: Greg Kroah-Hartman > Suggested-by: Greg Kroah-Hartman > Signed-off-by: Ricardo B. Marliere > --- Acked-by: Jason Wang Thanks
Re: [PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test
On Mon, Feb 5, 2024 at 8:46 PM Yunsheng Lin wrote: > > introduce vhost_net_test for both vhost_net tx and rx basing > on virtio_test to test vhost_net changing in the kernel. > > Steps for vhost_net tx testing: > 1. Prepare a out buf. > 2. Kick the vhost_net to do tx processing. > 3. Do the receiving in the tun side. > 4. verify the data received by tun is correct. > > Steps for vhost_net rx testing: > 1. Prepare a in buf. > 2. Do the sending in the tun side. > 3. Kick the vhost_net to do rx processing. > 4. verify the data received by vhost_net is correct. > > Signed-off-by: Yunsheng Lin > --- > tools/virtio/.gitignore| 1 + > tools/virtio/Makefile | 8 +- > tools/virtio/linux/virtio_config.h | 4 + > tools/virtio/vhost_net_test.c | 536 + > 4 files changed, 546 insertions(+), 3 deletions(-) > create mode 100644 tools/virtio/vhost_net_test.c > > diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore > index 9934d48d9a55..7e47b281c442 100644 > --- a/tools/virtio/.gitignore > +++ b/tools/virtio/.gitignore > @@ -1,5 +1,6 @@ > # SPDX-License-Identifier: GPL-2.0-only > *.d > virtio_test > +vhost_net_test > vringh_test > virtio-trace/trace-agent > diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile > index d128925980e0..e25e99c1c3b7 100644 > --- a/tools/virtio/Makefile > +++ b/tools/virtio/Makefile > @@ -1,8 +1,9 @@ > # SPDX-License-Identifier: GPL-2.0 > all: test mod > -test: virtio_test vringh_test > +test: virtio_test vringh_test vhost_net_test > virtio_test: virtio_ring.o virtio_test.o > vringh_test: vringh_test.o vringh.o virtio_ring.o > +vhost_net_test: virtio_ring.o vhost_net_test.o > > try-run = $(shell set -e; \ > if ($(1)) >/dev/null 2>&1; \ > @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean > > .PHONY: all test mod clean vhost oot oot-clean oot-build > clean: > - ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \ > - vhost_test/Module.symvers vhost_test/modules.order *.d > + ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \ > + vhost_test/.*.cmd vhost_test/Module.symvers \ > + vhost_test/modules.order *.d > -include *.d > diff --git a/tools/virtio/linux/virtio_config.h > b/tools/virtio/linux/virtio_config.h > index 2a8a70e2a950..42a564f22f2d 100644 > --- a/tools/virtio/linux/virtio_config.h > +++ b/tools/virtio/linux/virtio_config.h > @@ -1,4 +1,6 @@ > /* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef LINUX_VIRTIO_CONFIG_H > +#define LINUX_VIRTIO_CONFIG_H > #include > #include > #include > @@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct > virtio_device *vdev, u64 val) > { > return __cpu_to_virtio64(virtio_is_little_endian(vdev), val); > } > + > +#endif > diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c > new file mode 100644 > index ..6c41204e6707 > --- /dev/null > +++ b/tools/virtio/vhost_net_test.c > @@ -0,0 +1,536 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf) > +#define TEST_BUF_LEN 256 > +#define TEST_PTYPE ETH_P_LOOPBACK > +#define DESC_NUM 256 > + > +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */ > +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end; > + > +struct vq_info { > + int kick; > + int call; > + int idx; > + long started; > + long completed; > + struct pollfd fds; > + void *ring; > + /* copy used for control */ > + struct vring vring; > + struct virtqueue *vq; > +}; > + > +struct vdev_info { > + struct virtio_device vdev; > + int control; > + struct vq_info vqs[2]; > + int nvqs; > + void *buf; > + size_t buf_size; > + char *test_buf; > + char *res_buf; > + struct vhost_memory *mem; > + int sock; > + int ifindex; > + unsigned char mac[ETHER_ADDR_LEN]; > +}; > + > +static int tun_alloc(struct vdev_info *dev, char *tun_name) > +{ > + struct ifreq ifr; > + int len = HDR_LEN; > + int fd, e; > + > + fd = open("/dev/net/tun", O_RDWR); > + if (fd < 0) { > + perror("Cannot open /dev/net/tun"); > + return fd; > + } > + > + memset(&ifr, 0, sizeof(ifr)); > + > + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; > + strncpy(ifr.ifr_name, tun_name, IFNAMSIZ); > + > + e = ioctl(fd, TUNSETIFF, &ifr); > + if (e < 0) { > + perror("ioctl[TUNSETIFF]"); > + close(fd); > +
[PATCH v11] bus: mhi: host: Add tracing support
This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Change the implementation of the arrays which has enum to strings mapping to make it consistent in both trace header file and other files. Where ever the trace events are added, debug messages are removed. Signed-off-by: Krishna chaitanya chundru Reviewed-by: Manivannan Sadhasivam Reviewed-by: "Steven Rostedt (Google)" --- Changes in v11: - Rebased with mhi next. - Link to v10: https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com Changes in v10: - Modified command_start and command_end traces to take string as input to mention correct - string as suggested by mani - As sugggested by mani modified the print format from lower format to upper case format. - Link to v9: https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com Changes in v9: - Change the implementations of some array so that the strings to enum mapping - is same in both trace header and other files as suggested by steve. - Link to v8: https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com Changes in v7: - change log format as pointed by mani. - Link to v6: https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com Changes in v6: - use 'rp' directly as suggested by jeffrey. - Link to v5: https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com Changes in v5: - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. - Instead of converting to u64 to print address, use %px to print the address to avoid - warnings in some platforms. - Link to v4: https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com Changes in v4: - Fix compilation issues in previous patch which happended due to rebasing. - In the defconfig FTRACE config is not enabled due to that the compilation issue is not - seen in my workspace. - Link to v3: https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com Changes in v3: - move trace header file from include/trace/events to drivers/bus/mhi/host/ so that - we can include driver header files. - Use macros directly in the trace events as suggested Jeffrey Hugo. - Reorder the structure in the events as suggested by steve to avoid holes in the buffer. - removed the mhi_to_physical function as this can give security issues. - removed macros to define strings as we can get those from driver headers. - Link to v2: https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com Changes in v2: - Passing the raw state into the trace event and using __print_symbolic() as suggested by bjorn. - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. - Fixed the kernel test rebot issues. - Link to v1: https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com --- drivers/bus/mhi/common.h| 38 +++--- drivers/bus/mhi/host/init.c | 64 + drivers/bus/mhi/host/internal.h | 41 ++ drivers/bus/mhi/host/main.c | 19 ++- drivers/bus/mhi/host/pm.c | 7 +- drivers/bus/mhi/host/trace.h| 280 6 files changed, 384 insertions(+), 65 deletions(-) diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h index f794b9c8049e..dda340aaed95 100644 --- a/drivers/bus/mhi/common.h +++ b/drivers/bus/mhi/common.h @@ -297,30 +297,30 @@ struct mhi_ring_element { __le32 dword[2]; }; +#define MHI_STATE_LIST \ + mhi_state(RESET,"RESET")\ + mhi_state(READY,"READY")\ + mhi_state(M0, "M0") \ + mhi_state(M1, "M1") \ + mhi_state(M2, "M2") \ + mhi_state(M3, "M3") \ + mhi_state(M3_FAST, "M3_FAST") \ + mhi_state(BHI, "BHI") \ + mhi_state_end(SYS_ERR, "SYS ERROR") + +#undef mhi_state +#undef mhi_state_end + +#define mhi_state(a, b)case MHI_STATE_##a: return b; +#define mhi_state_end(a, b)case MHI_STATE_##a: return b; + static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { - case MHI_STATE_RESET: - return "RESET"; - case MHI_STATE_READY: - return "READY"; - case MHI_STATE_M0: - return "M0"; - case MHI_STATE_M1: - return "M1"; -
Re: [PATCH v11] bus: mhi: host: Add tracing support
On Tue, Feb 06, 2024 at 10:02:05AM +0530, Krishna chaitanya chundru wrote: > This change adds ftrace support for following functions which > helps in debugging the issues when there is Channel state & MHI > state change and also when we receive data and control events: > 1. mhi_intvec_mhi_states > 2. mhi_process_data_event_ring > 3. mhi_process_ctrl_ev_ring > 4. mhi_gen_tre > 5. mhi_update_channel_state > 6. mhi_tryset_pm_state > 7. mhi_pm_st_worker > > Change the implementation of the arrays which has enum to strings mapping > to make it consistent in both trace header file and other files. > > Where ever the trace events are added, debug messages are removed. > > Signed-off-by: Krishna chaitanya chundru There are a lot of checkpatch errors. Please fix them and resubmit. - Mani > Reviewed-by: Manivannan Sadhasivam > Reviewed-by: "Steven Rostedt (Google)" > --- > Changes in v11: > - Rebased with mhi next. > - Link to v10: > https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com > > Changes in v10: > - Modified command_start and command_end traces to take string as input to > mention correct > - string as suggested by mani > - As sugggested by mani modified the print format from lower format to upper > case format. > - Link to v9: > https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com > > Changes in v9: > - Change the implementations of some array so that the strings to enum mapping > - is same in both trace header and other files as suggested by steve. > - Link to v8: > https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com > > Changes in v8: > - Pass the structure and derefernce the variables in TP_fast_assign as > suggested by steve > - Link to v7: > https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com > > Changes in v7: > - change log format as pointed by mani. > - Link to v6: > https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com > > Changes in v6: > - use 'rp' directly as suggested by jeffrey. > - Link to v5: > https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com > > Changes in v5: > - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. > - Instead of converting to u64 to print address, use %px to print the address > to avoid > - warnings in some platforms. > - Link to v4: > https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com > > Changes in v4: > - Fix compilation issues in previous patch which happended due to rebasing. > - In the defconfig FTRACE config is not enabled due to that the compilation > issue is not > - seen in my workspace. > - Link to v3: > https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com > > Changes in v3: > - move trace header file from include/trace/events to drivers/bus/mhi/host/ > so that > - we can include driver header files. > - Use macros directly in the trace events as suggested Jeffrey Hugo. > - Reorder the structure in the events as suggested by steve to avoid holes in > the buffer. > - removed the mhi_to_physical function as this can give security issues. > - removed macros to define strings as we can get those from driver headers. > - Link to v2: > https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com > > Changes in v2: > - Passing the raw state into the trace event and using __print_symbolic() as > suggested by bjorn. > - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. > - Fixed the kernel test rebot issues. > - Link to v1: > https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com > --- > drivers/bus/mhi/common.h| 38 +++--- > drivers/bus/mhi/host/init.c | 64 + > drivers/bus/mhi/host/internal.h | 41 ++ > drivers/bus/mhi/host/main.c | 19 ++- > drivers/bus/mhi/host/pm.c | 7 +- > drivers/bus/mhi/host/trace.h| 280 > > 6 files changed, 384 insertions(+), 65 deletions(-) > > diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h > index f794b9c8049e..dda340aaed95 100644 > --- a/drivers/bus/mhi/common.h > +++ b/drivers/bus/mhi/common.h > @@ -297,30 +297,30 @@ struct mhi_ring_element { > __le32 dword[2]; > }; > > +#define MHI_STATE_LIST \ > + mhi_state(RESET,"RESET")\ > + mhi_state(READY,"READY")\ > + mhi_state(M0, "M0") \ > + mhi_state(M1, "M1") \ > + mhi_state(M2, "M2") \ > + mhi_state(M3, "M3") \ > + mhi_state(M3_FAST, "M3_FAST") \ > + mhi_state(BHI, "BHI") \ > + mhi_state_end(SYS_ERR, "SYS ERROR") > + > +#undef mhi_state > +#undef mhi_state_end > + > +#define mhi_state(a, b) case MHI_STATE_##a: return b; > +#defin
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Steven Rostedt writes: > On Mon, 05 Feb 2024 14:16:30 +0100 > Sven Schnelle wrote: >> >> Another issue i'm hitting sometimes is this part: >> >> csum1=`md5sum trace` >> sleep $SLEEP_TIME >> csum2=`md5sum trace` >> >> if [ "$csum1" != "$csum2" ]; then >> fail "Tracing file is still changing" >> fi >> >> This is because the command line was replaced in the >> saved_cmdlines_buffer, an example diff between both files >> is: > > [..] > >> >> This can be improved by: >> >> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size >> >> But this is of course not a fix - should we maybe replace the program >> name with <...> before comparing, remove the check completely, or do >> anything else? What do you think? > > Hmm, actually I would say that this exposes a real bug. Not a major > one, but one that I find annoying. The saved commandlines should only > be updated when a trace event occurs. But really, it should only be > updated if one is added to the ring buffer. If the ring buffer isn't > being updated, we shouldn't be adding new command lines. > > There may be a location that has tracing off but still updating the > cmdlines which will break the saved cache. Looking at trace_save_cmdline(): tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000 so this is basically tpid = tsk->pid & 0x7fff; further on: // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 0x7fff) idx = savedcmd->map_pid_to_cmdline[tpid]; if (idx == NO_CMDLINE_MAP) { // This will pick an existing entry if there are // more than cmdline_num entries present idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num; savedcmd->map_pid_to_cmdline[tpid] = idx; savedcmd->cmdline_idx = idx; } So i think the problem that sometimes '<...>' instead of the correct comm is logged is just expected behaviour given the code above.
Re: [PATCH v11] bus: mhi: host: Add tracing support
On 2/6/2024 11:56 AM, Manivannan Sadhasivam wrote: On Tue, Feb 06, 2024 at 10:02:05AM +0530, Krishna chaitanya chundru wrote: This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Change the implementation of the arrays which has enum to strings mapping to make it consistent in both trace header file and other files. Where ever the trace events are added, debug messages are removed. Signed-off-by: Krishna chaitanya chundru There are a lot of checkpatch errors. Please fix them and resubmit. - Mani Hi Mani, The errors which is pointing in the checkpatch are false positive, those errors are being shown from v2 onwards and kernel test boot is also not throwing any errors for it. I checked with internal team they said these errors are false positive. Thanks & Regards, Krishna Chaitanya. Reviewed-by: Manivannan Sadhasivam Reviewed-by: "Steven Rostedt (Google)" --- Changes in v11: - Rebased with mhi next. - Link to v10: https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com Changes in v10: - Modified command_start and command_end traces to take string as input to mention correct - string as suggested by mani - As sugggested by mani modified the print format from lower format to upper case format. - Link to v9: https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com Changes in v9: - Change the implementations of some array so that the strings to enum mapping - is same in both trace header and other files as suggested by steve. - Link to v8: https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com Changes in v7: - change log format as pointed by mani. - Link to v6: https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com Changes in v6: - use 'rp' directly as suggested by jeffrey. - Link to v5: https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com Changes in v5: - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. - Instead of converting to u64 to print address, use %px to print the address to avoid - warnings in some platforms. - Link to v4: https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com Changes in v4: - Fix compilation issues in previous patch which happended due to rebasing. - In the defconfig FTRACE config is not enabled due to that the compilation issue is not - seen in my workspace. - Link to v3: https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com Changes in v3: - move trace header file from include/trace/events to drivers/bus/mhi/host/ so that - we can include driver header files. - Use macros directly in the trace events as suggested Jeffrey Hugo. - Reorder the structure in the events as suggested by steve to avoid holes in the buffer. - removed the mhi_to_physical function as this can give security issues. - removed macros to define strings as we can get those from driver headers. - Link to v2: https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com Changes in v2: - Passing the raw state into the trace event and using __print_symbolic() as suggested by bjorn. - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. - Fixed the kernel test rebot issues. - Link to v1: https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com --- drivers/bus/mhi/common.h| 38 +++--- drivers/bus/mhi/host/init.c | 64 + drivers/bus/mhi/host/internal.h | 41 ++ drivers/bus/mhi/host/main.c | 19 ++- drivers/bus/mhi/host/pm.c | 7 +- drivers/bus/mhi/host/trace.h| 280 6 files changed, 384 insertions(+), 65 deletions(-) diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h index f794b9c8049e..dda340aaed95 100644 --- a/drivers/bus/mhi/common.h +++ b/drivers/bus/mhi/common.h @@ -297,30 +297,30 @@ struct mhi_ring_element { __le32 dword[2]; }; +#define MHI_STATE_LIST\ + mhi_state(RESET,"RESET") \ + mhi_state(READY,"READY") \ + mhi_state(M0, "M0") \ + mhi_state(M1, "M1") \ + mhi_state(M2, "M2") \ + mhi_state(M3, "M3") \ + mhi_state(M3_FAST, "M3_FAST")\ + mhi_state(BHI, "BHI")\ + mhi_state_end(SYS_ERR, "SYS ERROR") + +#undef mhi_state
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
On 2/5/24 07:53, Sven Schnelle wrote: tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Signed-off-by: Sven Schnelle --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..47e221e1e720 100644 Tested-By: Mete Durlu
Re: [PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test
On 2024/2/6 11:08, Jason Wang wrote: ... >> + >> +static void wait_for_interrupt(struct vq_info *vq) >> +{ >> + unsigned long long val; >> + >> + poll(&vq->fds, 1, -1); > > It's not good to wait indefinitely. How about a timeout value of 100ms as below? poll(&vq->fds, 1, 100); > >> + >> + if (vq->fds.revents & POLLIN) >> + read(vq->fds.fd, &val, sizeof(val)); >> +} >> + >> +static void verify_res_buf(char *res_buf) >> +{ >> + int i; >> + >> + for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++) >> + assert(res_buf[i] == (char)i); >> +} >> + >> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq, >> + bool delayed, int bufs) >> +{ >> + long long spurious = 0; >> + struct scatterlist sl; >> + unsigned int len; >> + int r; >> + >> + for (;;) { >> + long started_before = vq->started; >> + long completed_before = vq->completed; >> + >> + virtqueue_disable_cb(vq->vq); >> + do { >> + while (vq->started < bufs && >> + (vq->started - vq->completed) < 1) { >> + sg_init_one(&sl, dev->test_buf, HDR_LEN + >> TEST_BUF_LEN); >> + r = virtqueue_add_outbuf(vq->vq, &sl, 1, >> +dev->test_buf + >> vq->started, >> +GFP_ATOMIC); >> + if (unlikely(r != 0)) >> + break; >> + >> + ++vq->started; > > If we never decrease started/completed shouldn't we use unsigned here? > (as well as completed) > > Otherwise we may get unexpected results for vq->started as well as > vq->completed. We have "vq->started < bufs" checking before the increasing as above, and there is 'assert(nbufs > 0)' when getting optarg in main(), which means we never allow started/completed to be greater than nbufs as my understanding. > >> + >> + if (unlikely(!virtqueue_kick(vq->vq))) { >> + r = -1; >> + break; >> + } >> + } >> + >> + if (vq->started >= bufs) >> + r = -1; > > Which condition do we reach here? It is also a copy & paste of virtio_test.c It means we have finished adding the outbuf in virtqueue, and set 'r' to be '-1' so that we can break the inner while loop if there is no result for virtqueue_get_buf() as my understanding. > >> + >> + /* Flush out completed bufs if any */ >> + while (virtqueue_get_buf(vq->vq, &len)) { >> + int n; >> + >> + n = recvfrom(dev->sock, dev->res_buf, >> TEST_BUF_LEN, 0, NULL, NULL); >> + assert(n == TEST_BUF_LEN); >> + verify_res_buf(dev->res_buf); >> + >> + ++vq->completed; >> + r = 0; >> + } >> + } while (r == 0); >> + >> + if (vq->completed == completed_before && vq->started == >> started_before) >> + ++spurious; >> + >> + assert(vq->completed <= bufs); >> + assert(vq->started <= bufs); >> + if (vq->completed == bufs) >> + break; >> + >> + if (delayed) { >> + if (virtqueue_enable_cb_delayed(vq->vq)) >> + wait_for_interrupt(vq); >> + } else { >> + if (virtqueue_enable_cb(vq->vq)) >> + wait_for_interrupt(vq); >> + } > > This could be simplified with > > if (delayed) > else > > wait_for_interrupt(vq) I am not sure if I understand the above comment. The wait_for_interrupt() is only called conditionally depending on the returning of virtqueue_enable_cb_delayed() and virtqueue_enable_cb(). > >> + } >> + printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n", >> + spurious, vq->started, vq->completed); >> +} >> + ... >> + >> + /* Flush out completed bufs if any */ >> + while (virtqueue_get_buf(vq->vq, &len)) { >> + struct ether_header *eh; >> + >> + eh = (struct ether_header *)(dev->res_buf + >> HDR_LEN); >> + >> + /* tun netdev is up and running, ignore the >> +* non-TEST_PTYPE packet. >> +*/ >> + if (eh->ether_type != htons(TEST_PTYPE)) { >> +