date:20240205

On Sat, Feb 03, 2024 at 07:33:51PM -0500, Steven Rostedt wrote:
> On Mon, 29 Jan 2024 14:27:58 +
> Vincent Donnefort  wrote:
> 
> > --- /dev/null
> > +++ b/include/uapi/linux/trace_mmap.h
> > @@ -0,0 +1,43 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _TRACE_MMAP_H_
> > +#define _TRACE_MMAP_H_
> > +
> > +#include 
> > +
> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each subbuf, including the header.
> 
> That is "the header"?

I mean the header of the "bpage". But that is confusing. I'll either remove that
mention or just say it is aligned with the order of a system page size.

> 
> > + * @nr_subbufs:Number of subbfs in the ring-buffer.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. From 0 to 
> > @nr_subbufs - 1
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> 
> Note, flags needs a comment.
> 
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> 
> So does the two Reserved words, otherwise I'm going to start getting
> error reports about sparse warnings that check kerneldoc comments
> against their content. Every field needs to be represented in the
> kerneldoc comment.

Ack, wasn't sure it was needed.

> 
> -- Steve
> 
> 
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */

[PATCH 0/3] Add PPG support for PMI632 LPG dtsi

Hook up the PBS & SDAM to the PMI632 LPG so that we can use the
hw_pattern for the LEDs.

Signed-off-by: Luca Weiss 
---
Luca Weiss (3):
  dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types
  arm64: dts: qcom: pmi632: Add PBS client and use in LPG node
  arm64: defconfig: Enable QCOM PBS

 Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 
 arch/arm64/boot/dts/qcom/pmi632.dtsi  | 9 +
 arch/arm64/configs/defconfig  | 1 +
 3 files changed, 14 insertions(+)
---
base-commit: 1f790ac9c84028d89ef4dbb28ecc5771fc352e25
change-id: 20240117-pmi632-ppg-f1efb4318722

Best regards,
-- 
Luca Weiss

[PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types

Add the PBS (Programmable Boot Sequencer) to the list of devices.

Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml 
b/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml
index 8103fb61a16c..b7f01cbb8fff 100644
--- a/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml
+++ b/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml
@@ -160,6 +160,10 @@ patternProperties:
 type: object
 $ref: /schemas/nvmem/qcom,spmi-sdam.yaml#
 
+  "^pbs@[0-9a-f]+$":
+type: object
+$ref: /schemas/soc/qcom/qcom,pbs.yaml#
+
   "phy@[0-9a-f]+$":
 type: object
 $ref: /schemas/phy/qcom,snps-eusb2-repeater.yaml#

-- 
2.43.0

[PATCH 2/3] arm64: dts: qcom: pmi632: Add PBS client and use in LPG node

With SDAM + PBS the LPG driver can configure the LED pattern in
hardware.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/pmi632.dtsi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/pmi632.dtsi 
b/arch/arm64/boot/dts/qcom/pmi632.dtsi
index 4eb79e0ce40a..d2bb49a619d7 100644
--- a/arch/arm64/boot/dts/qcom/pmi632.dtsi
+++ b/arch/arm64/boot/dts/qcom/pmi632.dtsi
@@ -127,6 +127,11 @@ pmi632_adc_tm: adc-tm@3500 {
status = "disabled";
};
 
+   pmi632_pbs_client3: pbs@7400 {
+   compatible = "qcom,pmi632-pbs", "qcom,pbs";
+   reg = <0x7400>;
+   };
+
pmi632_sdam_7: nvram@b600 {
compatible = "qcom,spmi-sdam";
reg = <0xb600>;
@@ -155,6 +160,10 @@ pmic@3 {
pmi632_lpg: pwm {
compatible = "qcom,pmi632-lpg";
 
+   nvmem = <&pmi632_sdam_7>;
+   nvmem-names = "lpg_chan_sdam";
+   qcom,pbs = <&pmi632_pbs_client3>;
+
#address-cells = <1>;
#size-cells = <0>;
#pwm-cells = <2>;

-- 
2.43.0

[PATCH 3/3] arm64: defconfig: Enable QCOM PBS

Enable the PBS driver used on e.g. PMI632.

Signed-off-by: Luca Weiss 
---
 arch/arm64/configs/defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index cfa3e00def09..e92a5fd9f660 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -1375,6 +1375,7 @@ CONFIG_QCOM_STATS=m
 CONFIG_QCOM_WCNSS_CTRL=m
 CONFIG_QCOM_APR=m
 CONFIG_QCOM_ICC_BWMON=m
+CONFIG_QCOM_PBS=m
 CONFIG_ARCH_R8A77995=y
 CONFIG_ARCH_R8A77990=y
 CONFIG_ARCH_R8A77951=y

-- 
2.43.0

Question about the ipi_raise filter usage and output

2024-02-05 Thread richard clark

Hi guys,

With the ipi_raise event enabled and filtered with:
echo 'reason == "Function call interrupts"' > filter, then the 'cat
trace' output below messages:
...
insmod-3355[010] 1.. 24479.230381: ipi_raise:
target_mask=,0bff (Function call interrupts)
...
The above output is triggered by my kernel module where it will smp
cross call a remote function from cpu#10 to cpu#11, for the
'target_mask' value, what does the ',0bff' mean?
 ~~

Another question is for the filter, I'd like to catch the IPI only
happening on cpu#11 *AND* a remote function call, so how to write the
'target_cpus' in the filter expression?

I try to write below:
echo 'target_cpus == 11 && reason == "Function call interrupts"' >
events/ipi/ipi_raise/filter

But the 'cat trace' doesn't show anything about cpu#11 IPI info,
although both the /proc/interrupts and the smp_processor_id() in the
remote function shows there's IPI sent to the cpu#11.

Any suggestions?

Thank you!

Re: [PATCH v13 3/6] tracing: Add snapshot refcount

On Tue, 30 Jan 2024 10:32:45 +
Vincent Donnefort  wrote:

> > All errors (new ones prefixed by >>):
> > 
> >kernel/trace/trace.c: In function 'tracing_set_tracer':
> >kernel/trace/trace.c:6644:17: error: implicit declaration of function 
> > 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? 
> > [-Werror=implicit-function-declaration]
> > 6644 | tracing_disarm_snapshot_locked(tr);
> >  | ^~
> >  | tracing_disarm_snapshot  
> > >> kernel/trace/trace.c:6648:23: error: implicit declaration of function 
> > >> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? 
> > >> [-Werror=implicit-function-declaration]  
> > 6648 | ret = tracing_arm_snapshot_locked(tr);
> >  |   ^~~
> >  |   tracing_arm_snapshot
> >cc1: some warnings being treated as errors  
> 
> Right, two tracers (hwlat and osnoise) select _only_ MAX_TRACE and not
> TRACER_SNAPSHOT.
> 
> However, AFAICT, they will not call any of the swapping functions (they don't
> set use_max_tr). So I suppose arm/disarm can be ommited in that case.

Yeah, if you can test with the various configs enabled and disabled to
make sure that it still builds properly, then that should be good.

I should make sure that my own ktest config that I use to run tests
checks these variations too.

-- Steve

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread Mark Rutland

On Mon, Feb 05, 2024 at 05:57:29PM +0800, richard clark wrote:
> Hi guys,
> 
> With the ipi_raise event enabled and filtered with:
> echo 'reason == "Function call interrupts"' > filter, then the 'cat
> trace' output below messages:
> ...
> insmod-3355[010] 1.. 24479.230381: ipi_raise:
> target_mask=,0bff (Function call interrupts)
> ...
> The above output is triggered by my kernel module where it will smp
> cross call a remote function from cpu#10 to cpu#11, for the
> 'target_mask' value, what does the ',0bff' mean?

That's a cpumask bitmap: 0xbff is 0b1011__, which is:

 ,- CPU 10
 |
1011__
| '__'
|  |
|  `- CPUs 9 to 0
|
`- CPU 11

Note that bitmap has CPUs 0-9 and CPU 11 set, but CPU 10 is not set.

I suspect your kernel module has generated the bitmap incorrectly; it looks
like you have a mask for CPUs 0-11 minus a mask for CPU 10?

For CPUs 10 and 11, that should be 0xc00 / 0b1100__.

>  ~~
> 
> Another question is for the filter, I'd like to catch the IPI only
> happening on cpu#11 *AND* a remote function call, so how to write the
> 'target_cpus' in the filter expression?
> 
> I try to write below:
> echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> events/ipi/ipi_raise/filter

The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the cpumask
contains other CPUs, the filter will skip the call.

I believe you can use '&' to check whether a cpumask contains a CPU, e.g.

'target_cpus & 11'

Thanks,
Mark.

Re: Question about the ipi_raise filter usage and output

On Mon, 5 Feb 2024 17:57:29 +0800
richard clark  wrote:

> Hi guys,
> 
> With the ipi_raise event enabled and filtered with:
> echo 'reason == "Function call interrupts"' > filter, then the 'cat
> trace' output below messages:
> ...
> insmod-3355[010] 1.. 24479.230381: ipi_raise:
> target_mask=,0bff (Function call interrupts)
> ...
> The above output is triggered by my kernel module where it will smp
> cross call a remote function from cpu#10 to cpu#11, for the
> 'target_mask' value, what does the ',0bff' mean?
>  ~~

It's the CPU mask. bff is bits 1011 or CPUs = 0-9,11.


> 
> Another question is for the filter, I'd like to catch the IPI only
> happening on cpu#11 *AND* a remote function call, so how to write the
> 'target_cpus' in the filter expression?
> 
> I try to write below:
> echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> events/ipi/ipi_raise/filter

You mean when it is sent only to CPU 11? Not when the event is
happening on CPU 11. Like the above example, the event was triggered on
CPU 10, but the mask was for all the other CPUs.

If you are looking for just CPU 11, you can do:

  echo 'target_cpus == 0x800 && reason == "Function call interrupts"'


> 
> But the 'cat trace' doesn't show anything about cpu#11 IPI info,
> although both the /proc/interrupts and the smp_processor_id() in the
> remote function shows there's IPI sent to the cpu#11.
> 


-- Steve

Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2024-02-05 Thread Paul Heidekrüger




On 7 Jan 2024, at 19:22, Paul Heidekrüger wrote:

> On 12.12.2023 10:32, Marco Elver wrote:
>> On Tue, 12 Dec 2023 at 10:19, Paul Heidekrüger  
>> wrote:
>>>
>>> On 12.12.2023 00:37, Andrey Konovalov wrote:
 On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger
  wrote:
>
> Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same 
> error
> for me.
>
> So
>
> CONFIG_KUNIT=y
> CONFIG_KUNIT_ALL_TESTS=n
> CONFIG_FTRACE=y
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_KUNIT_TEST=y
>
> produces
>
> ➜   ./tools/testing/kunit/kunit.py run 
> --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> Configuring KUnit Kernel ...
> Regenerating .config ...
> Populating config with:
> $ make ARCH=arm64 O=.kunit olddefconfig CC=clang
> ERROR:root:Not all Kconfig options selected in kunitconfig were 
> in the generated .config.
> This is probably due to unsatisfied dependencies.
> Missing: CONFIG_KASAN_KUNIT_TEST=y
>
> By that error message, CONFIG_FTRACE appears to be present in the 
> generated
> config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably,
> CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied dependency, 
> which
> must be CONFIG_TRACEPOINTS, unless I'm missing something ...
>
> If I just generate an arm64 defconfig and select CONFIG_FTRACE=y,
> CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is 
> kunit.py-related
> then?
>
> Andrey, you said that the tests have been working for you; are you 
> running them
> with kunit.py?

 No, I just run the kernel built with a config file that I put together
 based on defconfig.
>>>
>>> Ah. I believe I've figured it out.
>>>
>>> When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works.
>>
>> CONFIG_FTRACE should be enough - maybe also check x86 vs. arm64 to debug 
>> more.
>
> See below.
>
>>> CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER
>>> selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects 
>>> CONFIG_TRACING, and
>>> CONFIG_TRACING selects CONFIG_TRACEPOINTS.
>>>
>>> CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it
>>> directly selects CONFIG_TRACEPOINTS.
>>>
>>> CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64.
>>
>> When you build manually with just CONFIG_FTRACE, is CONFIG_TRACEPOINTS 
>> enabled?
>
> When I add CONFIG_FTRACE and enter-key my way through the FTRACE prompts - I
> believe because CONFIG_FTRACE is a menuconfig? - at the beginning of a build,
> CONFIG_TRACEPOINTS does get set on arm64, yes.
>
> On X86, the defconfig already includes CONIFG_TRACEPOINTS.
>
> I also had a closer look at how kunit.py builds its configs.
> I believe it does something along the following lines:
>
> cp  .kunit/.config
> make ARCH=arm64 O=.kunit olddefconfig
>
> On arm64, that isn't enough to set CONFIG_TRACEPOINTS; same behaviour when run
> outside of kunit.py.
>
> For CONFIG_TRACEPOINTS, `make ARCH=arm64 menuconfig` shows:
>
> Symbol: TRACEPOINTS [=n]
> Type  : bool
> Defined at init/Kconfig:1920
> Selected by [n]:
>   - TRACING [=n]
>   - BLK_DEV_IO_TRACE [=n] && FTRACE [=y] && SYSFS [=y] && BLOCK [=y]
>
> So, CONFIG_TRACING or CONFIG_BLK_DEV_IO_TRACE are the two options that prevent
> CONFIG_TRACEPOINTS from being set on arm64.
>
> For CONFIG_TRACING we have:
>
> Symbol: TRACING [=n]
> Type  : bool
> Defined at kernel/trace/Kconfig:157
> Selects: RING_BUFFER [=n] && STACKTRACE [=y] && TRACEPOINTS [=n] && 
> NOP_TRACER [=n] && BINARY_PRINTF [=n] && EVENT_TRACING [=n] && TRACE_CLOCK 
> [=y] && TASKS_RCU [=n]
> Selected by [n]:
>   - DRM_I915_TRACE_GEM [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && EXPERT 
> [=n] && DRM_I915_DEBUG_GEM [=n]
>   - DRM_I915_TRACE_GTT [=n] && HAS_IOMEM [=y] && DRM_I915 [=n] && EXPERT 
> [=n] && DRM_I915_DEBUG_GEM [=n]
>   - PREEMPTIRQ_TRACEPOINTS [=n] && (TRACE_PREEMPT_TOGGLE [=n] || 
> TRACE_IRQFLAGS [=n])
>   - GENERIC_TRACER [=n]
>   - ENABLE_DEFAULT_TRACERS [=n] && FTRACE [=y] && !GENERIC_TRACER [=n]
>   - FPROBE_EVENTS [=n] && FTRACE [=y] && FPROBE [=n] && 
> HAVE_REGS_AND_STACK_ACCESS_API [=y]
>   - KPROBE_EVENTS [=n] && FTRACE [=y] && KPROBES [=n] && 
> HAVE_REGS_AND_STACK_ACCESS_API [=y]
>   - UPROBE_EVENTS [=n] && FTRACE [=y] && ARCH_SUPPORTS_UPROBES [=y] && 
> MMU [=y] && PERF_EVENTS [=n]
>   - SYNTH_EVENTS [=n] && FTRACE [=y]
>   - USER_EVENTS [=n] && FTRACE [=y]
>   - HIST_TRIGGERS [=n] && FTRACE [=y] && ARCH_HAVE_NMI_SAFE_CMPXCHG [=y]
>
>>> I believe the reason my .kunitconfig as well as the existing
>>> mm/kfence/.kunitconfig work on X86 is because CONFIG_TRACEPOINTS=y is 
>>> present in
>>> an

Re: [PATCH v3 04/47] filelock: add some new helper functions

2024-02-05 Thread Christian Brauner

> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 085ff6ba0653..a814664b1053 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, unsigned 
> int,
>  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
>  int fcntl_getlease(struct file *filp);
>  
> +static inline bool lock_is_unlock(struct file_lock *fl)
> +{
> + return fl->fl_type == F_UNLCK;
> +}
> +
> +static inline bool lock_is_read(struct file_lock *fl)
> +{
> + return fl->fl_type == F_RDLCK;
> +}
> +
> +static inline bool lock_is_write(struct file_lock *fl)
> +{
> + return fl->fl_type == F_WRLCK;
> +}
> +
> +static inline void locks_wake_up(struct file_lock *fl)
> +{
> + wake_up(&fl->fl_wait);
> +}
> +
> +/* for walking lists of file_locks linked by fl_list */
> +#define for_each_file_lock(_fl, _head)   list_for_each_entry(_fl, _head, 
> fl_list)
> +

This causes a build warning for fs/ceph/ and fs/afs when
!CONFIG_FILE_LOCKING. I'm about to fold the following diff into this
patch. The diff looks a bit wonky but essentially I've moved
lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and
for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING:

diff --git a/include/linux/filelock.h b/include/linux/filelock.h
index a814664b1053..62be9c6b1e59 100644
--- a/include/linux/filelock.h
+++ b/include/linux/filelock.h
@@ -133,20 +133,6 @@ struct file_lock_context {
struct list_headflc_lease;
 };

-#ifdef CONFIG_FILE_LOCKING
-int fcntl_getlk(struct file *, unsigned int, struct flock *);
-int fcntl_setlk(unsigned int, struct file *, unsigned int,
-   struct flock *);
-
-#if BITS_PER_LONG == 32
-int fcntl_getlk64(struct file *, unsigned int, struct flock64 *);
-int fcntl_setlk64(unsigned int, struct file *, unsigned int,
-   struct flock64 *);
-#endif
-
-int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
-int fcntl_getlease(struct file *filp);
-
 static inline bool lock_is_unlock(struct file_lock *fl)
 {
return fl->fl_type == F_UNLCK;
@@ -170,6 +156,20 @@ static inline void locks_wake_up(struct file_lock *fl)
 /* for walking lists of file_locks linked by fl_list */
 #define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, fl_list)

+#ifdef CONFIG_FILE_LOCKING
+int fcntl_getlk(struct file *, unsigned int, struct flock *);
+int fcntl_setlk(unsigned int, struct file *, unsigned int,
+   struct flock *);
+
+#if BITS_PER_LONG == 32
+int fcntl_getlk64(struct file *, unsigned int, struct flock64 *);
+int fcntl_setlk64(unsigned int, struct file *, unsigned int,
+   struct flock64 *);
+#endif
+
+int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
+int fcntl_getlease(struct file *filp);
+
 /* fs/locks.c */
 void locks_free_lock_context(struct inode *inode);
 void locks_free_lock(struct file_lock *fl);

Re: [PATCH v3 04/47] filelock: add some new helper functions

2024-02-05 Thread Jeff Layton

On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote:
> > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > index 085ff6ba0653..a814664b1053 100644
> > --- a/include/linux/filelock.h
> > +++ b/include/linux/filelock.h
> > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, 
> > unsigned int,
> >  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> >  int fcntl_getlease(struct file *filp);
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > +static inline bool lock_is_unlock(struct file_lock *fl)
> > +{
> > +   return fl->fl_type == F_UNLCK;
> > +}
> > +
> > +static inline bool lock_is_read(struct file_lock *fl)
> > +{
> > +   return fl->fl_type == F_RDLCK;
> > +}
> > +
> > +static inline bool lock_is_write(struct file_lock *fl)
> > +{
> > +   return fl->fl_type == F_WRLCK;
> > +}
> > +
> > +static inline void locks_wake_up(struct file_lock *fl)
> > +{
> > +   wake_up(&fl->fl_wait);
> > +}
> > +
> > +/* for walking lists of file_locks linked by fl_list */
> > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, 
> > fl_list)
> > +
> 
> This causes a build warning for fs/ceph/ and fs/afs when
> !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this
> patch. The diff looks a bit wonky but essentially I've moved
> lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and
> for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING:
> 

I sent a patch for this problem yesterday. Did you not get it?


> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index a814664b1053..62be9c6b1e59 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -133,20 +133,6 @@ struct file_lock_context {
> struct list_headflc_lease;
>  };
> 
> -#ifdef CONFIG_FILE_LOCKING
> -int fcntl_getlk(struct file *, unsigned int, struct flock *);
> -int fcntl_setlk(unsigned int, struct file *, unsigned int,
> -   struct flock *);
> -
> -#if BITS_PER_LONG == 32
> -int fcntl_getlk64(struct file *, unsigned int, struct flock64 *);
> -int fcntl_setlk64(unsigned int, struct file *, unsigned int,
> -   struct flock64 *);
> -#endif
> -
> -int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> -int fcntl_getlease(struct file *filp);
> -
>  static inline bool lock_is_unlock(struct file_lock *fl)
>  {
> return fl->fl_type == F_UNLCK;
> @@ -170,6 +156,20 @@ static inline void locks_wake_up(struct file_lock *fl)
>  /* for walking lists of file_locks linked by fl_list */
>  #define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, 
> fl_list)
> 
> +#ifdef CONFIG_FILE_LOCKING
> +int fcntl_getlk(struct file *, unsigned int, struct flock *);
> +int fcntl_setlk(unsigned int, struct file *, unsigned int,
> +   struct flock *);
> +
> +#if BITS_PER_LONG == 32
> +int fcntl_getlk64(struct file *, unsigned int, struct flock64 *);
> +int fcntl_setlk64(unsigned int, struct file *, unsigned int,
> +   struct flock64 *);
> +#endif
> +
> +int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> +int fcntl_getlease(struct file *filp);
> +
>  /* fs/locks.c */
>  void locks_free_lock_context(struct inode *inode);
>  void locks_free_lock(struct file_lock *fl);
> 

-- 
Jeff Layton

Re: [PATCH v3 04/47] filelock: add some new helper functions

2024-02-05 Thread Christian Brauner

On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote:
> On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote:
> > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > > index 085ff6ba0653..a814664b1053 100644
> > > --- a/include/linux/filelock.h
> > > +++ b/include/linux/filelock.h
> > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, 
> > > unsigned int,
> > >  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> > >  int fcntl_getlease(struct file *filp);
> > >  
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > +static inline bool lock_is_unlock(struct file_lock *fl)
> > > +{
> > > + return fl->fl_type == F_UNLCK;
> > > +}
> > > +
> > > +static inline bool lock_is_read(struct file_lock *fl)
> > > +{
> > > + return fl->fl_type == F_RDLCK;
> > > +}
> > > +
> > > +static inline bool lock_is_write(struct file_lock *fl)
> > > +{
> > > + return fl->fl_type == F_WRLCK;
> > > +}
> > > +
> > > +static inline void locks_wake_up(struct file_lock *fl)
> > > +{
> > > + wake_up(&fl->fl_wait);
> > > +}
> > > +
> > > +/* for walking lists of file_locks linked by fl_list */
> > > +#define for_each_file_lock(_fl, _head)   list_for_each_entry(_fl, _head, 
> > > fl_list)
> > > +
> > 
> > This causes a build warning for fs/ceph/ and fs/afs when
> > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this
> > patch. The diff looks a bit wonky but essentially I've moved
> > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and
> > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING:
> > 
> 
> I sent a patch for this problem yesterday. Did you not get it?

Whoops, probably missed it on the trip back from fosdem.
I'll double check now.

Re: [PATCH v3 04/47] filelock: add some new helper functions

2024-02-05 Thread Jeff Layton

On Mon, 2024-02-05 at 12:57 +0100, Christian Brauner wrote:
> On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote:
> > On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote:
> > > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > > > index 085ff6ba0653..a814664b1053 100644
> > > > --- a/include/linux/filelock.h
> > > > +++ b/include/linux/filelock.h
> > > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, 
> > > > unsigned int,
> > > >  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> > > >  int fcntl_getlease(struct file *filp);
> > > >  
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > +static inline bool lock_is_unlock(struct file_lock *fl)
> > > > +{
> > > > +   return fl->fl_type == F_UNLCK;
> > > > +}
> > > > +
> > > > +static inline bool lock_is_read(struct file_lock *fl)
> > > > +{
> > > > +   return fl->fl_type == F_RDLCK;
> > > > +}
> > > > +
> > > > +static inline bool lock_is_write(struct file_lock *fl)
> > > > +{
> > > > +   return fl->fl_type == F_WRLCK;
> > > > +}
> > > > +
> > > > +static inline void locks_wake_up(struct file_lock *fl)
> > > > +{
> > > > +   wake_up(&fl->fl_wait);
> > > > +}
> > > > +
> > > > +/* for walking lists of file_locks linked by fl_list */
> > > > +#define for_each_file_lock(_fl, _head) list_for_each_entry(_fl, _head, 
> > > > fl_list)
> > > > +
> > > 
> > > This causes a build warning for fs/ceph/ and fs/afs when
> > > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this
> > > patch. The diff looks a bit wonky but essentially I've moved
> > > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and
> > > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING:
> > > 
> > 
> > I sent a patch for this problem yesterday. Did you not get it?
> 
> Whoops, probably missed it on the trip back from fosdem.
> I'll double check now.

No worries. If you choose to go with your version, you can add:

Reviewed-by: Jeff Layton

Re: [PATCH v3 04/47] filelock: add some new helper functions

2024-02-05 Thread Christian Brauner

On Mon, Feb 05, 2024 at 07:06:00AM -0500, Jeff Layton wrote:
> On Mon, 2024-02-05 at 12:57 +0100, Christian Brauner wrote:
> > On Mon, Feb 05, 2024 at 06:55:44AM -0500, Jeff Layton wrote:
> > > On Mon, 2024-02-05 at 12:36 +0100, Christian Brauner wrote:
> > > > > diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> > > > > index 085ff6ba0653..a814664b1053 100644
> > > > > --- a/include/linux/filelock.h
> > > > > +++ b/include/linux/filelock.h
> > > > > @@ -147,6 +147,29 @@ int fcntl_setlk64(unsigned int, struct file *, 
> > > > > unsigned int,
> > > > >  int fcntl_setlease(unsigned int fd, struct file *filp, int arg);
> > > > >  int fcntl_getlease(struct file *filp);
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > +static inline bool lock_is_unlock(struct file_lock *fl)
> > > > > +{
> > > > > + return fl->fl_type == F_UNLCK;
> > > > > +}
> > > > > +
> > > > > +static inline bool lock_is_read(struct file_lock *fl)
> > > > > +{
> > > > > + return fl->fl_type == F_RDLCK;
> > > > > +}
> > > > > +
> > > > > +static inline bool lock_is_write(struct file_lock *fl)
> > > > > +{
> > > > > + return fl->fl_type == F_WRLCK;
> > > > > +}
> > > > > +
> > > > > +static inline void locks_wake_up(struct file_lock *fl)
> > > > > +{
> > > > > + wake_up(&fl->fl_wait);
> > > > > +}
> > > > > +
> > > > > +/* for walking lists of file_locks linked by fl_list */
> > > > > +#define for_each_file_lock(_fl, _head)   
> > > > > list_for_each_entry(_fl, _head, fl_list)
> > > > > +
> > > > 
> > > > This causes a build warning for fs/ceph/ and fs/afs when
> > > > !CONFIG_FILE_LOCKING. I'm about to fold the following diff into this
> > > > patch. The diff looks a bit wonky but essentially I've moved
> > > > lock_is_unlock(), lock_is_{read,write}(), locks_wake_up() and
> > > > for_each_file_lock() out of the ifdef CONFIG_FILE_LOCKING:
> > > > 
> > > 
> > > I sent a patch for this problem yesterday. Did you not get it?
> > 
> > Whoops, probably missed it on the trip back from fosdem.
> > I'll double check now.
> 
> No worries. If you choose to go with your version, you can add:

No, I took yours. :)

[PATCH net-next v5 2/5] page_frag: unify gfp bits for order 3 page allocation

Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.

Signed-off-by: Yunsheng Lin 
Reviewed-by: Alexander Duyck 
CC: Alexander Duyck 
---
 drivers/vhost/net.c | 2 +-
 mm/page_alloc.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
  __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
  SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0f7e67c4250..636145c29f70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
page_frag_cache *nc,
gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-   gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-   __GFP_NOMEMALLOC;
+   gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
PAGE_FRAG_CACHE_MAX_ORDER);
nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-- 
2.33.0

[PATCH net-next v5 4/5] vhost/net: remove vhost_net_page_frag_refill()

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c | 91 ++---
 1 file changed, 27 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
unsigned tx_zcopy_err;
/* Flush in progress. Protected by tx vq lock. */
bool tx_flush;
-   /* Private page frag */
-   struct page_frag page_frag;
-   /* Refcount bias of page frag */
-   int refcnt_bias;
+   /* Private page frag cache */
+   struct page_frag_cache pf_cache;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, 
size_t total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
-  struct page_frag *pfrag, gfp_t gfp)
-{
-   if (pfrag->page) {
-   if (pfrag->offset + sz <= pfrag->size)
-   return true;
-   __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
-   }
-
-   pfrag->offset = 0;
-   net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY | __GFP_NOMEMALLOC,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
-   }
-   pfrag->page = alloc_page(gfp);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE;
-   goto done;
-   }
-   return false;
-
-done:
-   net->refcnt_bias = USHRT_MAX;
-   page_ref_add(pfrag->page, USHRT_MAX - 1);
-   return true;
-}
-
 #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
 
 static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
struct vhost_net *net = container_of(vq->dev, struct vhost_net,
 dev);
struct socket *sock = vhost_vq_get_backend(vq);
-   struct page_frag *alloc_frag = &net->page_frag;
struct virtio_net_hdr *gso;
struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
int sock_hlen = nvq->sock_hlen;
void *buf;
int copied;
+   int ret;
 
if (unlikely(len < nvq->sock_hlen))
return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
return -ENOSPC;
 
buflen += SKB_DATA_ALIGN(len + pad);
-   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-   if (unlikely(!vhost_net_page_frag_refill(net, buflen,
-alloc_frag, GFP_KERNEL)))
+   buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
+   SMP_CACHE_BYTES);
+   if (unlikely(!buf))
return -ENOMEM;
 
-   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-   copied = copy_page_from_iter(alloc_frag->page,
-alloc_frag->offset +
-offsetof(struct tun_xdp_hdr, gso),
-sock_hlen, from);
-   if (copied != sock_hlen)
-   return -EFAULT;
+   copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+   sock_hlen, from);
+   if (copied != sock_hlen) {
+   ret = -EFAULT;
+   goto err;
+   }
 
hdr = buf;
gso = &hdr->gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
   vhost16_to_cpu(vq, gso->csum_start) +
   vhost16_to_cpu(vq, gso->csum_offset) + 2);
 
-

[PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test

introduce vhost_net_test for both vhost_net tx and rx basing
on virtio_test to test vhost_net changing in the kernel.

Steps for vhost_net tx testing:
1. Prepare a out buf.
2. Kick the vhost_net to do tx processing.
3. Do the receiving in the tun side.
4. verify the data received by tun is correct.

Steps for vhost_net rx testing:
1. Prepare a in buf.
2. Do the sending in the tun side.
3. Kick the vhost_net to do rx processing.
4. verify the data received by vhost_net is correct.

Signed-off-by: Yunsheng Lin 
---
 tools/virtio/.gitignore|   1 +
 tools/virtio/Makefile  |   8 +-
 tools/virtio/linux/virtio_config.h |   4 +
 tools/virtio/vhost_net_test.c  | 536 +
 4 files changed, 546 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
index 9934d48d9a55..7e47b281c442 100644
--- a/tools/virtio/.gitignore
+++ b/tools/virtio/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 *.d
 virtio_test
+vhost_net_test
 vringh_test
 virtio-trace/trace-agent
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
 virtio_test: virtio_ring.o virtio_test.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
 
 try-run = $(shell set -e;  \
if ($(1)) >/dev/null 2>&1;  \
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
 
 .PHONY: all test mod clean vhost oot oot-clean oot-build
 clean:
-   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
-  vhost_test/Module.symvers vhost_test/modules.order *.d
+   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+  vhost_test/.*.cmd vhost_test/Module.symvers \
+  vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/linux/virtio_config.h 
b/tools/virtio/linux/virtio_config.h
index 2a8a70e2a950..42a564f22f2d 100644
--- a/tools/virtio/linux/virtio_config.h
+++ b/tools/virtio/linux/virtio_config.h
@@ -1,4 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_VIRTIO_CONFIG_H
+#define LINUX_VIRTIO_CONFIG_H
 #include 
 #include 
 #include 
@@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct virtio_device 
*vdev, u64 val)
 {
return __cpu_to_virtio64(virtio_is_little_endian(vdev), val);
 }
+
+#endif
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index ..6c41204e6707
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,536 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf)
+#define TEST_BUF_LEN   256
+#define TEST_PTYPE ETH_P_LOOPBACK
+#define DESC_NUM   256
+
+/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+   int kick;
+   int call;
+   int idx;
+   long started;
+   long completed;
+   struct pollfd fds;
+   void *ring;
+   /* copy used for control */
+   struct vring vring;
+   struct virtqueue *vq;
+};
+
+struct vdev_info {
+   struct virtio_device vdev;
+   int control;
+   struct vq_info vqs[2];
+   int nvqs;
+   void *buf;
+   size_t buf_size;
+   char *test_buf;
+   char *res_buf;
+   struct vhost_memory *mem;
+   int sock;
+   int ifindex;
+   unsigned char mac[ETHER_ADDR_LEN];
+};
+
+static int tun_alloc(struct vdev_info *dev, char *tun_name)
+{
+   struct ifreq ifr;
+   int len = HDR_LEN;
+   int fd, e;
+
+   fd = open("/dev/net/tun", O_RDWR);
+   if (fd < 0) {
+   perror("Cannot open /dev/net/tun");
+   return fd;
+   }
+
+   memset(&ifr, 0, sizeof(ifr));
+
+   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+   strncpy(ifr.ifr_name, tun_name, IFNAMSIZ);
+
+   e = ioctl(fd, TUNSETIFF, &ifr);
+   if (e < 0) {
+   perror("ioctl[TUNSETIFF]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, TUNSETVNETHDRSZ, &len);
+   if (e < 0) {
+   perror("ioctl[TUNSETVNETHDRSZ]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, SIOCGIFHWADDR, &ifr);
+   if (e < 0) {
+   perror("ioctl[SIOCGIFHWADDR]");
+   close(fd);
+   return e;
+

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

On Mon,  5 Feb 2024 07:53:40 +0100
Sven Schnelle  wrote:

> tracer_tracing_is_on() checks whether record_disabled is not zero. This
> checks both the record_disabled counter and the RB_BUFFER_OFF flag.
> Reading the source it looks like this function should only check for
> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
> This fixes spurious fails in the 'test for function traceon/off triggers'
> test from the ftrace testsuite when the system is under load.
> 

I've seen these spurious failures too, but haven't looked deeper into
it. Thanks,

-- Steve

Re: Question about the ipi_raise filter usage and output

On Mon, 5 Feb 2024 10:28:57 +
Mark Rutland  wrote:

> > I try to write below:
> > echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> > events/ipi/ipi_raise/filter  
> 
> The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the 
> cpumask
> contains other CPUs, the filter will skip the call.
> 
> I believe you can use '&' to check whether a cpumask contains a CPU, e.g.
> 
>   'target_cpus & 11'

11 == 0xb = b1011

So the above would only be true for CPUs 0,1 and 3 ;-)

I think you meant: 'target_cpus & 0x800' 

I tried "1 << 11' but it appears to not allow shifts. I wonder if we should add 
that?

-- Steve

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> On Mon,  5 Feb 2024 07:53:40 +0100
> Sven Schnelle  wrote:
>
>> tracer_tracing_is_on() checks whether record_disabled is not zero. This
>> checks both the record_disabled counter and the RB_BUFFER_OFF flag.
>> Reading the source it looks like this function should only check for
>> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
>> This fixes spurious fails in the 'test for function traceon/off triggers'
>> test from the ftrace testsuite when the system is under load.
>> 
>
> I've seen these spurious failures too, but haven't looked deeper into
> it. Thanks,

Another issue i'm hitting sometimes is this part:

csum1=`md5sum trace`
sleep $SLEEP_TIME
csum2=`md5sum trace`

if [ "$csum1" != "$csum2" ]; then
fail "Tracing file is still changing"
fi

This is because the command line was replaced in the
saved_cmdlines_buffer, an example diff between both files
is:

   ftracetest-17950   [005] . 344507.002490: sched_process_wait: 
comm=ftracetest pid=0 prio=120
   ftracetest-17950   [005] . 344507.002492: sched_process_wait: 
comm=ftracetest pid=0 prio=120
- stress-ng-fanot-17820   [006] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=1054 [ns]
+   <...>-17820   [006] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=1054 [ns]
   ftracetest-17950   [005] d.h.. 344507.009901: sched_stat_runtime: 
comm=ftracetest pid=17950 runtime=7417915 [ns]
  stress-ng-fanot-17819   [003] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17819 runtime=9983473 [ns]
- stress-ng-fanot-17820   [007] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=865 [ns]
+   <...>-17820   [007] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=865 [ns]
  stress-ng-fanot-17819   [004] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17819 runtime=8388039 [ns]

This can be improved by:

echo 32768 > /sys/kernel/tracing/saved_cmdlines_size

But this is of course not a fix - should we maybe replace the program
name with <...> before comparing, remove the check completely, or do
anything else? What do you think?

Thanks,
Sven

Re: [PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types

2024-02-05 Thread Rob Herring

On Mon, 05 Feb 2024 10:51:38 +0100, Luca Weiss wrote:
> Add the PBS (Programmable Boot Sequencer) to the list of devices.
> 
> Signed-off-by: Luca Weiss 
> ---
>  Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 
>  1 file changed, 4 insertions(+)
> 

My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
on your patch (DT_CHECKER_FLAGS is new in v5.13):

yamllint warnings/errors:

dtschema/dtc warnings/errors:
/builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml:
Error in referenced schema matching $id: 
http://devicetree.org/schemas/soc/qcom/qcom,pbs.yaml

doc reference errors (make refcheckdocs):

See 
https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240205-pmi632-ppg-v1-1-e236c95a2...@fairphone.com

The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.

Re: [PATCH v3 3/3] tracing: convert __trace_seq_init to use WARN_ON_ONCE

2024-02-05 Thread kernel test robot

 (arch/x86/kernel/traps.c:238) 
[ 1915.516038][ T42] ? exc_invalid_op (arch/x86/kernel/traps.c:259) 
[ 1915.516298][ T42] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 1915.516586][ T42] ? trace_seq_printf (kernel/trace/trace_seq.c:35) 
[ 1915.516860][ T42] ? trace_find_cmdline (arch/x86/include/asm/preempt.h:103 
kernel/trace/trace.c:2546) 
[ 1915.517142][ T42] ? trace_find_cmdline (arch/x86/include/asm/preempt.h:103 
kernel/trace/trace.c:2546) 
[ 1915.517420][ T42] ? preempt_count_sub (arch/x86/include/asm/preempt.h:84 
kernel/sched/core.c:5900) 
[ 1915.517694][ T42] trace_print_lat_context (kernel/trace/trace_output.c:516 
kernel/trace/trace_output.c:664) 
[ 1915.517995][ T42] print_trace_fmt (kernel/trace/trace.c:?) 
[ 1915.518257][ T42] ftrace_dump (kernel/trace/trace.c:10413) 
[ 1915.518505][ T42] rcu_scale_writer (kernel/rcu/rcuscale.c:534) 
[ 1915.518775][ T42] ? __cfi_rcu_scale_writer (kernel/rcu/rcuscale.c:453) 
[ 1915.519071][ T42] kthread (kernel/kthread.c:390) 
[ 1915.519298][ T42] ? __cfi_kthread (kernel/kthread.c:341) 
[ 1915.519554][ T42] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 1915.519802][ T42] ? __cfi_kthread (kernel/kthread.c:341) 
[ 1915.520057][ T42] ret_from_fork_asm (arch/x86/entry/entry_64.S:250) 
[ 1915.520326][   T42]  
[ 1915.520496][   T42] irq event stamp: 8228
[ 1915.520724][ T42] hardirqs last enabled at (8227): 
_raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:?) 
[ 1915.521290][ T42] hardirqs last disabled at (8228): ftrace_dump 
(kernel/trace/trace.c:10359) 
[ 1915.521783][ T42] softirqs last enabled at (0): copy_process 
(include/linux/rcupdate.h:298 include/linux/rcupdate.h:750 kernel/fork.c:2366) 
[ 1915.522276][ T42] softirqs last disabled at (0): 0x0 
[ 1915.522663][   T42] ---[ end trace  ]---
[ 1915.522970][   T42]  swapper-1 0..Zff 1910231878us : 
fib6_table_lookup: table 2166473472 oif 80736 iif -14080 proto 0 
81ff::ff60:6d05:81ff::ff00:0/0 -> ::/0 tos 96 scope 109 flags 5 ==> dev 
 gw :: err -1
[ 1915.524039][   T42]  swapper-1 0. 1910231880us : Unknown type 
1830
[ 1915.524442][   T42] -
[ 1915.524733][   T42] rcu-scale: Test complete



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240205/202402052141.769871e2-...@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

On Mon, 05 Feb 2024 14:16:30 +0100
Sven Schnelle  wrote:
> 
> Another issue i'm hitting sometimes is this part:
> 
> csum1=`md5sum trace`
> sleep $SLEEP_TIME
> csum2=`md5sum trace`
> 
> if [ "$csum1" != "$csum2" ]; then
> fail "Tracing file is still changing"
> fi
> 
> This is because the command line was replaced in the
> saved_cmdlines_buffer, an example diff between both files
> is:

[..]

> 
> This can be improved by:
> 
> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
> 
> But this is of course not a fix - should we maybe replace the program
> name with <...> before comparing, remove the check completely, or do
> anything else? What do you think?

Hmm, actually I would say that this exposes a real bug. Not a major
one, but one that I find annoying. The saved commandlines should only
be updated when a trace event occurs. But really, it should only be
updated if one is added to the ring buffer. If the ring buffer isn't
being updated, we shouldn't be adding new command lines.

There may be a location that has tracing off but still updating the
cmdlines which will break the saved cache.

-- Steve

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread Mark Rutland

[adding Valentin]

On Mon, Feb 05, 2024 at 08:06:09AM -0500, Steven Rostedt wrote:
> On Mon, 5 Feb 2024 10:28:57 +
> Mark Rutland  wrote:
> 
> > > I try to write below:
> > > echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> > > events/ipi/ipi_raise/filter  
> > 
> > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the 
> > cpumask
> > contains other CPUs, the filter will skip the call.
> > 
> > I believe you can use '&' to check whether a cpumask contains a CPU, e.g.
> > 
> > 'target_cpus & 11'
> 
> 11 == 0xb = b1011
> 
> So the above would only be true for CPUs 0,1 and 3 ;-)

Sorry, I misunderstood the scalar logic and thought that we treated:

'$mask $OP $scalar', e.g. 'target_cpus & 11'

... as a special case meaning a cpumask with that scalar bit set, i.e.

'$mask $OP CPUS{$scalar}', e.g. 'target_cpus & CPUS{11}'

... but evidently I was wrong.

> I think you meant: 'target_cpus & 0x800' 
> 
> I tried "1 << 11' but it appears to not allow shifts. I wonder if we should 
> add that?

Hmm... shouldn't we make 'CPUS{11}' work for that?

>From a quick test (below), that doesn't seem to work, though I think it
probably should?

  # cat /sys/devices/system/cpu/online 
  0-3
  # echo 1 > /sys/kernel/tracing/events/ipi/ipi_raise/enable 
  # echo 'target_cpus & CPUS{3}' > 
/sys/kernel/tracing/events/ipi/ipi_raise/filter 
  # grep IPI /proc/interrupts 
  IPI0:54 41 32 42   Rescheduling interrupts
  IPI1:  1202   1035893909   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # sleep 1
  # grep IPI /proc/interrupts 
  IPI0:54 42 32 42   Rescheduling interrupts
  IPI1:  1209   1037912927   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # cat /sys/devices/system/cpu/online 
  0-3
  # cat /sys/kernel/tracing/trace
  # tracer: nop
  #
  # entries-in-buffer/entries-written: 0/0   #P:4
  #
  #_-=> irqs-off/BH-disabled
  #   / _=> need-resched
  #  | / _---=> hardirq/softirq
  #  || / _--=> preempt-depth
  #  ||| / _-=> migrate-disable
  #   / delay
  #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
  #  | | |   | | |
  # 

More confusingly, if I use '8', I get events with cpumasks which shouldn't
match AFAICT:

  echo 'target_cpus & 8' > /sys/kernel/tracing/events/ipi/ipi_raise/filter 
  # echo '' > /sys/kernel/tracing/trace
  # grep IPI /proc/interrupts 
  IPI0:55 46 34 43   Rescheduling interrupts
  IPI1:  1358   1155994   1021   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # sleep 1
  # grep IPI /proc/interrupts 
  IPI0:56 46 34 43   Rescheduling interrupts
  IPI1:  1366   1158   1005   1038   Function call 
interrupts
  IPI2: 0  0  0  0   CPU stop interrupts
  IPI3: 0  0  0  0   CPU stop (for crash 
dump) interrupts
  IPI4: 0  0  0  0   Timer broadcast 
interrupts
  IPI5: 0  0  0  0   IRQ work interrupts
  # cat /sys/kernel/tracing/trace
  # tracer: nop
  #
  # entries-in-buffer/entries-written: 91/91   #P:4
  #
  #_-=> irqs-off/BH-disabled
  #   / _=> need-resched
  #  | / _---=> hardirq/softirq
  #  || / _--=> preempt-depth
  #  ||| / _-=> migrate-disable
  #   / delay
  #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
  #  | | |   | | |
-0   [000] d.h4.   480.720312: ipi_ra

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle  wrote:
>> 
>> Another issue i'm hitting sometimes is this part:
>> 
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>> 
>> if [ "$csum1" != "$csum2" ]; then
>> fail "Tracing file is still changing"
>> fi
>> 
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>> 
>> This can be improved by:
>> 
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>> 
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Ok, my understanding is that it will override the entry in the list if
another process comes up with the same PID. But i haven't read the code
carefully - let me do that now.

Re: [PATCH 1/3] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types

On Montag, 5. Februar 2024 14:46:45 CET Rob Herring wrote:
> On Mon, 05 Feb 2024 10:51:38 +0100, Luca Weiss wrote:
> > Add the PBS (Programmable Boot Sequencer) to the list of devices.
> > 
> > Signed-off-by: Luca Weiss 
> > ---
> > 
> >  Documentation/devicetree/bindings/mfd/qcom,spmi-pmic.yaml | 4 
> >  1 file changed, 4 insertions(+)
> 
> My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
> on your patch (DT_CHECKER_FLAGS is new in v5.13):
> 
> yamllint warnings/errors:
> 
> dtschema/dtc warnings/errors:
> /builds/robherring/dt-review-ci/linux/Documentation/devicetree/bindings/mfd/
> qcom,spmi-pmic.yaml: Error in referenced schema matching $id:
> http://devicetree.org/schemas/soc/qcom/qcom,pbs.yaml

These patches have been merged into linux-next recently, so should get into 
the next release.

> 
> doc reference errors (make refcheckdocs):
> 
> See
> https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240205-pmi
> 632-ppg-v1-1-e236c95a2...@fairphone.com
> 
> The base for the series is generally the latest rc1. A different dependency
> should be noted in *this* patch.
> 
> If you already ran 'make dt_binding_check' and didn't see the above
> error(s), then make sure 'yamllint' is installed and dt-schema is up to
> date:
> 
> pip3 install dtschema --upgrade
> 
> Please check and re-submit after running the above command yourself. Note
> that DT_SCHEMA_FILES can be set to your schema file to speed up checking
> your schema. However, it must be unset to test all examples with your
> schema.

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread Valentin Schneider

On 05/02/24 14:39, Mark Rutland wrote:
> [adding Valentin]
>

Thanks!

> On Mon, Feb 05, 2024 at 08:06:09AM -0500, Steven Rostedt wrote:
>> On Mon, 5 Feb 2024 10:28:57 +
>> Mark Rutland  wrote:
>>
>> > > I try to write below:
>> > > echo 'target_cpus == 11 && reason == "Function call interrupts"' >
>> > > events/ipi/ipi_raise/filter
>> >
>> > The '=' checks if the target_cpus bitmap *only* contains CPU 11. If the 
>> > cpumask
>> > contains other CPUs, the filter will skip the call.
>> >
>> > I believe you can use '&' to check whether a cpumask contains a CPU, e.g.
>> >
>> >'target_cpus & 11'
>>
>> 11 == 0xb = b1011
>>
>> So the above would only be true for CPUs 0,1 and 3 ;-)
>
> Sorry, I misunderstood the scalar logic and thought that we treated:
>
>   '$mask $OP $scalar', e.g. 'target_cpus & 11'
>
> .. as a special case meaning a cpumask with that scalar bit set, i.e.
>
>   '$mask $OP CPUS{$scalar}', e.g. 'target_cpus & CPUS{11}'
>
> .. but evidently I was wrong.
>
>> I think you meant: 'target_cpus & 0x800'
>>
>> I tried "1 << 11' but it appears to not allow shifts. I wonder if we should 
>> add that?
>
> Hmm... shouldn't we make 'CPUS{11}' work for that?
>

It /should/ already be the case, the user input with the curly braces is
parsed as a cpulist. So CPUS{11} really does mean CPU11, not a hex value to
be interpreted as a cpumask.

However...

> From a quick test (below), that doesn't seem to work, though I think it
> probably should?
> Have I completely misunderstood how this is supposed to work, or is that a 
> bug?
>

The CPUS{} thingie only works with an event field that is either declared as a
cpumask (__cpumask) or a scalar. That's not the case for ipi_raise, the
target_cpus event field is saved as a "raw" bitmask.

There /should/ have been a warning about the event filter though, but I
think it's not happening because I'm allowing more than just FILTER_CPUMASK
in parse_pred() to make it work for scalars. I'll go poke around some more.

Generally for this sort of IPI investigation I'd recommend using the newer
trace_ipi_send_cpu() and trace_ipi_send_cpumask() (for which CPUS{}
filtering works).
If it's only the function call interrupts you're interesting in, have a
look at trace_csd_queue_cpu().

> Mark.

Re: [PATCH RFC v3 22/35] arm64: mte: Enable tag storage if CMA areas have been activated

2024-02-05 Thread Alexandru Elisei

Hi Evgenii,

On Fri, Feb 02, 2024 at 02:30:00PM -0800, Evgenii Stepanov wrote:
> On Thu, Jan 25, 2024 at 8:44 AM Alexandru Elisei
>  wrote:
> >
> > Before enabling MTE tag storage management, make sure that the CMA areas
> > have been successfully activated. If a CMA area fails activation, the pages
> > are kept as reserved. Reserved pages are never used by the page allocator.
> >
> > If this happens, the kernel would have to manage tag storage only for some
> > of the memory, but not for all memory, and that would make the code
> > unreasonably complicated.
> >
> > Choose to disable tag storage management altogether if a CMA area fails to
> > be activated.
> >
> > Signed-off-by: Alexandru Elisei 
> > ---
> >
> > Changes since v2:
> >
> > * New patch.
> >
> >  arch/arm64/include/asm/mte_tag_storage.h | 12 ++
> >  arch/arm64/kernel/mte_tag_storage.c  | 50 
> >  2 files changed, 62 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> > b/arch/arm64/include/asm/mte_tag_storage.h
> > index 3c2cd29e053e..7b3f6bff8e6f 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -6,8 +6,20 @@
> >  #define __ASM_MTE_TAG_STORAGE_H
> >
> >  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +
> > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +
> > +static inline bool tag_storage_enabled(void)
> > +{
> > +   return static_branch_likely(&tag_storage_enabled_key);
> > +}
> > +
> >  void mte_init_tag_storage(void);
> >  #else
> > +static inline bool tag_storage_enabled(void)
> > +{
> > +   return false;
> > +}
> >  static inline void mte_init_tag_storage(void)
> >  {
> >  }
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> > b/arch/arm64/kernel/mte_tag_storage.c
> > index 9a1a8a45171e..d58c68b4a849 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -19,6 +19,8 @@
> >
> >  #include 
> >
> > +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +
> >  struct tag_region {
> > struct range mem_range; /* Memory associated with the tag storage, 
> > in PFNs. */
> > struct range tag_range; /* Tag storage memory, in PFNs. */
> > @@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void)
> > num_tag_regions = 0;
> > pr_info("MTE tag storage region management disabled");
> >  }
> > +
> > +static int __init mte_enable_tag_storage(void)
> > +{
> > +   struct range *tag_range;
> > +   struct cma *cma;
> > +   int i, ret;
> > +
> > +   if (num_tag_regions == 0)
> > +   return 0;
> > +
> > +   for (i = 0; i < num_tag_regions; i++) {
> > +   tag_range = &tag_regions[i].tag_range;
> > +   cma = tag_regions[i].cma;
> > +   /*
> > +* CMA will keep the pages as reserved when the region fails
> > +* activation.
> > +*/
> > +   if (PageReserved(pfn_to_page(tag_range->start)))
> > +   goto out_disabled;
> > +   }
> > +
> > +   static_branch_enable(&tag_storage_enabled_key);
> > +   pr_info("MTE tag storage region management enabled");
> > +
> > +   return 0;
> > +
> > +out_disabled:
> > +   for (i = 0; i < num_tag_regions; i++) {
> > +   tag_range = &tag_regions[i].tag_range;
> > +   cma = tag_regions[i].cma;
> > +
> > +   if (PageReserved(pfn_to_page(tag_range->start)))
> > +   continue;
> > +
> > +   /* Try really hard to reserve the tag storage. */
> > +   ret = cma_alloc(cma, range_len(tag_range), 8, true);
> > +   /*
> > +* Tag storage is still in use for data, memory and/or tag
> > +* corruption will ensue.
> > +*/
> > +   WARN_ON_ONCE(ret);
> 
> cma_alloc returns (page *), so this condition needs to be inverted,
> and the type of `ret` changed.
> Not sure how it slipped through, this is a compile error with clang.

Checked just now, it's a warning with gcc, I must have missed it. Will fix.

Thanks,
Alex

> 
> > +   }
> > +   num_tag_regions = 0;
> > +   pr_info("MTE tag storage region management disabled");
> > +
> > +   return -EINVAL;
> > +}
> > +arch_initcall(mte_enable_tag_storage);
> > --
> > 2.43.0
> >

[PATCH v14 0/6] Introducing trace buffer mapping by user-space

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (6):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Add snapshot refcount
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 104 ++
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  48 +++
 kernel/trace/ring_buffer.c| 344 +-
 kernel/trace/trace.c  | 221 +--
 kernel/trace/trace.h  |   9 +-
 kernel/trace/trace_events_trigger.c   |  58 ++-
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config|   2 +
 .../testing/selftests/ring-buffer/map_test.c  | 273 ++
 11 files changed, 1029 insertions(+), 46 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 100644 include/uapi/linux/trace_mmap.h
 create mode 100644 tools/testing/selftests/ring-buffer/Makefile
 create mode 100644 tools/testing/selftests/ring-buffer/config
 create mode 100644 tools/testing/selftests/ring-buffer/map_test.c


base-commit: e2412e51fdea837b50ce31fea8e5dfc885237f3a
-- 
2.43.0.594

[PATCH v14 1/6] ring-buffer: Zero ring-buffer sub-buffers

In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 
Reviewed-by: Masami Hiramatsu (Google) 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd4bfe3ecf01..ca796675c0a1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(&bpage->list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.43.0.594.gd9cf4e227d-goog

[PATCH v14 2/6] ring-buffer: Introducing ring-buffer mapping functions

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..182e05a3004a
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs 
- 1
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, no defined values.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ca796675c0a1..dcb4a0bc7ecb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +485,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1548,6 +1555,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&cpu_buffer->irq_work.waiters);
init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
+   mutex_init(&cpu_buffer->mapping_lock);
 
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),

[PATCH v14 3/6] tracing: Add snapshot refcount

When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..4ebf4d0bd14c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1300,6 +1300,50 @@ static void free_snapshot(struct trace_array *tr)
tr->allocated_snapshot = false;
 }
 
+static int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+   int ret;
+
+   lockdep_assert_held(&trace_types_lock);
+
+   spin_lock(&tr->snapshot_trigger_lock);
+   if (tr->snapshot == UINT_MAX) {
+   spin_unlock(&tr->snapshot_trigger_lock);
+   return -EBUSY;
+   }
+
+   tr->snapshot++;
+   spin_unlock(&tr->snapshot_trigger_lock);
+
+   ret = tracing_alloc_snapshot_instance(tr);
+   if (ret) {
+   spin_lock(&tr->snapshot_trigger_lock);
+   tr->snapshot--;
+   spin_unlock(&tr->snapshot_trigger_lock);
+   }
+
+   return ret;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+   int ret;
+
+   mutex_lock(&trace_types_lock);
+   ret = tracing_arm_snapshot_locked(tr);
+   mutex_unlock(&trace_types_lock);
+
+   return ret;
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+   spin_lock(&tr->snapshot_trigger_lock);
+   if (!WARN_ON(!tr->snapshot))
+   tr->snapshot--;
+   spin_unlock(&tr->snapshot_trigger_lock);
+}
+
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1373,10 +1417,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
 
mutex_lock(&trace_types_lock);
 
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret)
-   goto fail_unlock;
-
if (tr->current_trace->use_max_tr) {
ret = -EBUSY;
goto fail_unlock;
@@ -1395,6 +1435,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
goto fail_unlock;
}
 
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
+   goto fail_unlock;
+
local_irq_disable();
arch_spin_lock(&tr->max_lock);
tr->cond_snapshot = cond_snapshot;
@@ -1439,6 +1483,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
arch_spin_unlock(&tr->max_lock);
local_irq_enable();
 
+   tracing_disarm_snapshot(tr);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
@@ -1481,6 +1527,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
 #define free_snapshot(tr)  do { } while (0)
+#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; })
 #endif /* CONFIG_TRACER_SNAPSHOT */
 
 void tracer_tracing_off(struct trace_array *tr)
@@ -6593,11 +6640,12 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 */
synchronize_rcu();
free_snapshot(tr);
+   tracing_disarm_snapshot(tr);
}
 
-   if (t->use_max_tr && !tr->allocated_snapshot) {
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+   if (t->use_max_tr) {
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
goto out;
}
 #else
@@ -6606,8 +6654,13 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 
if (t->init) {
ret = tracer_init(t, tr);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (t->use_max_tr)
+   tracing_disarm_snapshot(tr);
+#endif
goto out;
+   }
}
 
tr->current_trace = t;
@@ -7709,10 +7762,11 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
if (tr->allocated_snapshot)
ret = resize_buffer_duplicate_size(&tr->max_buffer,
&tr->array_buffer, iter->cpu_file);
-   else
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
break;
+
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
local_irq_disable();
@@ -7722,6 +7776,7 @@ tracing_snapshot_write(struct file *filp, const cha

[PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index 182e05a3004a..7330249257e7 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4ebf4d0bd14c..36b62cf2fb3f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(&trace_types_lock);
 
spin_lock(&tr->snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(&tr->snapshot_trigger_lock);
return -EBUSY;
}
@@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == &nop_trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = &info->iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(&trace_types_lock);
 
iter->wait_index++;
@@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   struct trace_array *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array

[PATCH v14 5/6] Documentation: tracing: Add ring-buffer mapping

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..628254e63830
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot or splice.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.43.0.594.gd9cf4e227d-goog

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers


On 2024-02-05 11:34, Vincent Donnefort wrote:

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

   * Meta-page -- include/uapi/linux/trace_mmap.h for a description
   * Subbuf ID 0
   * Subbuf ID 1
  ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

   reader_id = meta->reader->id;
   reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.


How are the kernel linear mapping and the userspace mapping made coherent
on architectures with virtually aliasing data caches ?

Ref. 
https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t

Thanks,

Mathieu



Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index 182e05a3004a..7330249257e7 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
  };
  
+#define TRACE_MMAP_IOCTL_GET_READER		_IO('T', 0x1)

+
  #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4ebf4d0bd14c..36b62cf2fb3f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
  
+	if (tr->mapped) {

+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(&trace_types_lock);
  
  	spin_lock(&tr->snapshot_trigger_lock);

-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(&tr->snapshot_trigger_lock);
return -EBUSY;
}
@@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr)
  {
if (tr->current_trace == &nop_trace)
return;
-   
+
tr->current_trace->enabled--;
  
  	if (tr->current_trace->reset)

@@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
  }
  
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */

  static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
  {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = &info->iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
  
-	if (cmd)

-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
  
+	/*

+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(&trace_types_lock);
  
  	iter->wait_index++;

@@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
  }
  
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)

+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return

Re: [PATCH] nvdimm: make nvdimm_bus_type const

2024-02-05 Thread Dave Jiang




On 2/4/24 1:20 PM, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the nvdimm_bus_type variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Dave Jiang 

> ---
>  drivers/nvdimm/bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index ef3d0f83318b..508aed017ddc 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -271,7 +271,7 @@ EXPORT_SYMBOL_GPL(nvdimm_clear_poison);
>  
>  static int nvdimm_bus_match(struct device *dev, struct device_driver *drv);
>  
> -static struct bus_type nvdimm_bus_type = {
> +static const struct bus_type nvdimm_bus_type = {
>   .name = "nd",
>   .uevent = nvdimm_bus_uevent,
>   .match = nvdimm_bus_match,
> 
> ---
> base-commit: a085a5eb6594a3ebe5c275e9c2c2d341f686c23c
> change-id: 20240204-bus_cleanup-nvdimm-91771693bd4d
> 
> Best regards,

[PATCH v2 0/5] K3 DSP Remoteproc remove cleanup

Hello all,

This series uses various devm_ helpers to simplify the device
removal path.

Removing an unused var "ret1" got squashed into the wrong patch in
the v1 series causing a bisectability error. v2 is based on -next
with the first 3 already taken. These are the last 5 patches of the
v1 series[0].

Thanks,
Andrew

[0] https://lore.kernel.org/lkml/20240123184913.725435-4-...@ti.com/T/

Andrew Davis (5):
  remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper
  remoteproc: k3-dsp: Use devm_kzalloc() helper
  remoteproc: k3-dsp: Add devm action to release tsp
  remoteproc: k3-dsp: Use devm_ioremap_wc() helper
  remoteproc: k3-dsp: Use devm_rproc_add() helper

 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 116 ++
 1 file changed, 32 insertions(+), 84 deletions(-)

-- 
2.39.2

[PATCH v2 1/5] remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper

Use the device lifecycle managed TI-SCI get() function. This helps prevent
mistakes like not put()'ing in the wrong order in cleanup functions and
forgetting to put() on error paths.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 32 +++
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index a13552c71f440..64ec5759c4ec1 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -708,30 +708,24 @@ static int k3_dsp_rproc_probe(struct platform_device 
*pdev)
kproc->dev = dev;
kproc->data = data;
 
-   kproc->ti_sci = ti_sci_get_by_phandle(np, "ti,sci");
+   kproc->ti_sci = devm_ti_sci_get_by_phandle(dev, "ti,sci");
if (IS_ERR(kproc->ti_sci))
return dev_err_probe(dev, PTR_ERR(kproc->ti_sci),
 "failed to get ti-sci handle\n");
 
ret = of_property_read_u32(np, "ti,sci-dev-id", &kproc->ti_sci_id);
-   if (ret) {
-   dev_err_probe(dev, ret, "missing 'ti,sci-dev-id' property\n");
-   goto put_sci;
-   }
+   if (ret)
+   return dev_err_probe(dev, ret, "missing 'ti,sci-dev-id' 
property\n");
 
kproc->reset = devm_reset_control_get_exclusive(dev, NULL);
-   if (IS_ERR(kproc->reset)) {
-   ret = dev_err_probe(dev, PTR_ERR(kproc->reset),
-   "failed to get reset\n");
-   goto put_sci;
-   }
+   if (IS_ERR(kproc->reset))
+   return dev_err_probe(dev, PTR_ERR(kproc->reset),
+"failed to get reset\n");
 
kproc->tsp = k3_dsp_rproc_of_get_tsp(dev, kproc->ti_sci);
-   if (IS_ERR(kproc->tsp)) {
-   ret = dev_err_probe(dev, PTR_ERR(kproc->tsp),
-   "failed to construct ti-sci proc 
control\n");
-   goto put_sci;
-   }
+   if (IS_ERR(kproc->tsp))
+   return dev_err_probe(dev, PTR_ERR(kproc->tsp),
+"failed to construct ti-sci proc 
control\n");
 
ret = ti_sci_proc_request(kproc->tsp);
if (ret < 0) {
@@ -805,10 +799,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1));
 free_tsp:
kfree(kproc->tsp);
-put_sci:
-   ret1 = ti_sci_put_handle(kproc->ti_sci);
-   if (ret1)
-   dev_err(dev, "failed to put ti_sci handle (%pe)\n", 
ERR_PTR(ret1));
return ret;
 }
 
@@ -836,10 +826,6 @@ static void k3_dsp_rproc_remove(struct platform_device 
*pdev)
 
kfree(kproc->tsp);
 
-   ret = ti_sci_put_handle(kproc->ti_sci);
-   if (ret)
-   dev_err(dev, "failed to put ti_sci handle (%pe)\n", 
ERR_PTR(ret));
-
k3_dsp_reserved_mem_exit(kproc);
 }
 
-- 
2.39.2

[PATCH v2 3/5] remoteproc: k3-dsp: Add devm action to release tsp

Use a device lifecycle managed action to release tps ti_sci_proc handle.
This helps prevent mistakes like releasing out of order in cleanup
functions and forgetting to release on error paths.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 27 +++
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index b9332c66a52ab..800c8c6767086 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -646,6 +646,13 @@ static void k3_dsp_reserved_mem_exit(struct k3_dsp_rproc 
*kproc)
iounmap(kproc->rmem[i].cpu_addr);
 }
 
+static void k3_dsp_release_tsp(void *data)
+{
+   struct ti_sci_proc *tsp = data;
+
+   ti_sci_proc_release(tsp);
+}
+
 static
 struct ti_sci_proc *k3_dsp_rproc_of_get_tsp(struct device *dev,
const struct ti_sci_handle *sci)
@@ -682,7 +689,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
const char *fw_name;
bool p_state = false;
int ret = 0;
-   int ret1;
 
data = of_device_get_match_data(dev);
if (!data)
@@ -732,16 +738,17 @@ static int k3_dsp_rproc_probe(struct platform_device 
*pdev)
dev_err_probe(dev, ret, "ti_sci_proc_request failed\n");
return ret;
}
+   ret = devm_add_action_or_reset(dev, k3_dsp_release_tsp, kproc->tsp);
+   if (ret)
+   return ret;
 
ret = k3_dsp_rproc_of_get_memories(pdev, kproc);
if (ret)
-   goto release_tsp;
+   return ret;
 
ret = k3_dsp_reserved_mem_init(kproc);
-   if (ret) {
-   dev_err_probe(dev, ret, "reserved memory init failed\n");
-   goto release_tsp;
-   }
+   if (ret)
+   return dev_err_probe(dev, ret, "reserved memory init failed\n");
 
ret = kproc->ti_sci->ops.dev_ops.is_on(kproc->ti_sci, kproc->ti_sci_id,
   NULL, &p_state);
@@ -793,10 +800,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
 
 release_mem:
k3_dsp_reserved_mem_exit(kproc);
-release_tsp:
-   ret1 = ti_sci_proc_release(kproc->tsp);
-   if (ret1)
-   dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1));
return ret;
 }
 
@@ -818,10 +821,6 @@ static void k3_dsp_rproc_remove(struct platform_device 
*pdev)
 
rproc_del(kproc->rproc);
 
-   ret = ti_sci_proc_release(kproc->tsp);
-   if (ret)
-   dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret));
-
k3_dsp_reserved_mem_exit(kproc);
 }
 
-- 
2.39.2

[PATCH v2 2/5] remoteproc: k3-dsp: Use devm_kzalloc() helper

Use device lifecycle managed devm_kzalloc() helper function. This helps
prevent mistakes like freeing out of order in cleanup functions and
forgetting to free on all error paths.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index 64ec5759c4ec1..b9332c66a52ab 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -659,7 +659,7 @@ struct ti_sci_proc *k3_dsp_rproc_of_get_tsp(struct device 
*dev,
if (ret < 0)
return ERR_PTR(ret);
 
-   tsp = kzalloc(sizeof(*tsp), GFP_KERNEL);
+   tsp = devm_kzalloc(dev, sizeof(*tsp), GFP_KERNEL);
if (!tsp)
return ERR_PTR(-ENOMEM);
 
@@ -730,7 +730,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
ret = ti_sci_proc_request(kproc->tsp);
if (ret < 0) {
dev_err_probe(dev, ret, "ti_sci_proc_request failed\n");
-   goto free_tsp;
+   return ret;
}
 
ret = k3_dsp_rproc_of_get_memories(pdev, kproc);
@@ -797,8 +797,6 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
ret1 = ti_sci_proc_release(kproc->tsp);
if (ret1)
dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret1));
-free_tsp:
-   kfree(kproc->tsp);
return ret;
 }
 
@@ -824,8 +822,6 @@ static void k3_dsp_rproc_remove(struct platform_device 
*pdev)
if (ret)
dev_err(dev, "failed to release proc (%pe)\n", ERR_PTR(ret));
 
-   kfree(kproc->tsp);
-
k3_dsp_reserved_mem_exit(kproc);
 }
 
-- 
2.39.2

[PATCH v2 5/5] remoteproc: k3-dsp: Use devm_rproc_add() helper

Use device lifecycle managed devm_rproc_add() helper function. This helps
prevent mistakes like deleting out of order in cleanup functions and
forgetting to delete on all error paths.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index f799f74734b4a..3555b535b1683 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -768,7 +768,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
}
}
 
-   ret = rproc_add(rproc);
+   ret = devm_rproc_add(dev, rproc);
if (ret)
return dev_err_probe(dev, ret, "failed to add register device 
with remoteproc core\n");
 
@@ -786,14 +786,9 @@ static void k3_dsp_rproc_remove(struct platform_device 
*pdev)
 
if (rproc->state == RPROC_ATTACHED) {
ret = rproc_detach(rproc);
-   if (ret) {
-   /* Note this error path leaks resources */
+   if (ret)
dev_err(dev, "failed to detach proc (%pe)\n", 
ERR_PTR(ret));
-   return;
-   }
}
-
-   rproc_del(kproc->rproc);
 }
 
 static const struct k3_dsp_mem_data c66_mems[] = {
-- 
2.39.2

[PATCH v2 4/5] remoteproc: k3-dsp: Use devm_ioremap_wc() helper

Use a device lifecycle managed ioremap helper function. This helps prevent
mistakes like unmapping out of order in cleanup functions and forgetting
to unmap on all error paths.

Signed-off-by: Andrew Davis 
---
 drivers/remoteproc/ti_k3_dsp_remoteproc.c | 48 +--
 1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
index 800c8c6767086..f799f74734b4a 100644
--- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
@@ -598,16 +598,13 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc 
*kproc)
/* use remaining reserved memory regions for static carveouts */
for (i = 0; i < num_rmems; i++) {
rmem_np = of_parse_phandle(np, "memory-region", i + 1);
-   if (!rmem_np) {
-   ret = -EINVAL;
-   goto unmap_rmem;
-   }
+   if (!rmem_np)
+   return -EINVAL;
 
rmem = of_reserved_mem_lookup(rmem_np);
if (!rmem) {
of_node_put(rmem_np);
-   ret = -EINVAL;
-   goto unmap_rmem;
+   return -EINVAL;
}
of_node_put(rmem_np);
 
@@ -615,12 +612,11 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc 
*kproc)
/* 64-bit address regions currently not supported */
kproc->rmem[i].dev_addr = (u32)rmem->base;
kproc->rmem[i].size = rmem->size;
-   kproc->rmem[i].cpu_addr = ioremap_wc(rmem->base, rmem->size);
+   kproc->rmem[i].cpu_addr = devm_ioremap_wc(dev, rmem->base, 
rmem->size);
if (!kproc->rmem[i].cpu_addr) {
dev_err(dev, "failed to map reserved memory#%d at %pa 
of size %pa\n",
i + 1, &rmem->base, &rmem->size);
-   ret = -ENOMEM;
-   goto unmap_rmem;
+   return -ENOMEM;
}
 
dev_dbg(dev, "reserved memory%d: bus addr %pa size 0x%zx va %pK 
da 0x%x\n",
@@ -631,19 +627,6 @@ static int k3_dsp_reserved_mem_init(struct k3_dsp_rproc 
*kproc)
kproc->num_rmems = num_rmems;
 
return 0;
-
-unmap_rmem:
-   for (i--; i >= 0; i--)
-   iounmap(kproc->rmem[i].cpu_addr);
-   return ret;
-}
-
-static void k3_dsp_reserved_mem_exit(struct k3_dsp_rproc *kproc)
-{
-   int i;
-
-   for (i = 0; i < kproc->num_rmems; i++)
-   iounmap(kproc->rmem[i].cpu_addr);
 }
 
 static void k3_dsp_release_tsp(void *data)
@@ -752,10 +735,8 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
 
ret = kproc->ti_sci->ops.dev_ops.is_on(kproc->ti_sci, kproc->ti_sci_id,
   NULL, &p_state);
-   if (ret) {
-   dev_err_probe(dev, ret, "failed to get initial state, mode 
cannot be determined\n");
-   goto release_mem;
-   }
+   if (ret)
+   return dev_err_probe(dev, ret, "failed to get initial state, 
mode cannot be determined\n");
 
/* configure J721E devices for either remoteproc or IPC-only mode */
if (p_state) {
@@ -779,8 +760,7 @@ static int k3_dsp_rproc_probe(struct platform_device *pdev)
if (data->uses_lreset) {
ret = reset_control_status(kproc->reset);
if (ret < 0) {
-   dev_err_probe(dev, ret, "failed to get reset 
status\n");
-   goto release_mem;
+   return dev_err_probe(dev, ret, "failed to get 
reset status\n");
} else if (ret == 0) {
dev_warn(dev, "local reset is deasserted for 
device\n");
k3_dsp_rproc_reset(kproc);
@@ -789,18 +769,12 @@ static int k3_dsp_rproc_probe(struct platform_device 
*pdev)
}
 
ret = rproc_add(rproc);
-   if (ret) {
-   dev_err_probe(dev, ret, "failed to add register device with 
remoteproc core\n");
-   goto release_mem;
-   }
+   if (ret)
+   return dev_err_probe(dev, ret, "failed to add register device 
with remoteproc core\n");
 
platform_set_drvdata(pdev, kproc);
 
return 0;
-
-release_mem:
-   k3_dsp_reserved_mem_exit(kproc);
-   return ret;
 }
 
 static void k3_dsp_rproc_remove(struct platform_device *pdev)
@@ -820,8 +794,6 @@ static void k3_dsp_rproc_remove(struct platform_device 
*pdev)
}
 
rproc_del(kproc->rproc);
-
-   k3_dsp_reserved_mem_exit(kproc);
 }
 
 static const struct k3_dsp_mem_data c66_mems[] = {
-- 
2.39.2

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:
> On 2024-02-05 11:34, Vincent Donnefort wrote:
> > Currently, user-space extracts data from the ring-buffer via splice,
> > which is handy for storage or network sharing. However, due to splice
> > limitations, it is imposible to do real-time analysis without a copy.
> > 
> > A solution for that problem is to let the user-space map the ring-buffer
> > directly.
> > 
> > The mapping is exposed via the per-CPU file trace_pipe_raw. The first
> > element of the mapping is the meta-page. It is followed by each
> > subbuffer constituting the ring-buffer, ordered by their unique page ID:
> > 
> >* Meta-page -- include/uapi/linux/trace_mmap.h for a description
> >* Subbuf ID 0
> >* Subbuf ID 1
> >   ...
> > 
> > It is therefore easy to translate a subbuf ID into an offset in the
> > mapping:
> > 
> >reader_id = meta->reader->id;
> >reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
> > 
> > When new data is available, the mapper must call a newly introduced ioctl:
> > TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
> > point to the next reader containing unread data.
> > 
> > Mapping will prevent snapshot and buffer size modifications.
> 
> How are the kernel linear mapping and the userspace mapping made coherent
> on architectures with virtually aliasing data caches ?
> 
> Ref. 
> https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t

Hi Mathieu,

Thanks for the pointer.

We are in the exact same problem as DAX. We do modify the data through the
kernel linear mapping while user-space can read it through its own. I should
probably return an error when used with any of the arch ARM || SPARC || MIPS,
until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.

> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Signed-off-by: Vincent Donnefort 
> > 
> > diff --git a/include/uapi/linux/trace_mmap.h 
> > b/include/uapi/linux/trace_mmap.h
> > index 182e05a3004a..7330249257e7 100644
> > --- a/include/uapi/linux/trace_mmap.h
> > +++ b/include/uapi/linux/trace_mmap.h
> > @@ -43,4 +43,6 @@ struct trace_buffer_meta {
> > __u64   Reserved2;
> >   };
> > +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
> > +
> >   #endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 4ebf4d0bd14c..36b62cf2fb3f 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
> > trace_array *tr,
> > return;
> > }
> > +   if (tr->mapped) {
> > +   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
> > +   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
> > +   return;
> > +   }
> > +
> > local_irq_save(flags);
> > update_max_tr(tr, current, smp_processor_id(), cond_data);
> > local_irq_restore(flags);
> > @@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct 
> > trace_array *tr)
> > lockdep_assert_held(&trace_types_lock);
> > spin_lock(&tr->snapshot_trigger_lock);
> > -   if (tr->snapshot == UINT_MAX) {
> > +   if (tr->snapshot == UINT_MAX || tr->mapped) {
> > spin_unlock(&tr->snapshot_trigger_lock);
> > return -EBUSY;
> > }
> > @@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr)
> >   {
> > if (tr->current_trace == &nop_trace)
> > return;
> > -   
> > +
> > tr->current_trace->enabled--;
> > if (tr->current_trace->reset)
> > @@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, 
> > loff_t *ppos,
> > return ret;
> >   }
> > -/* An ioctl call with cmd 0 to the ring buffer file will wake up all 
> > waiters */
> >   static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
> > unsigned long arg)
> >   {
> > struct ftrace_buffer_info *info = file->private_data;
> > struct trace_iterator *iter = &info->iter;
> > +   int err;
> > +
> > +   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
> > +   if (!(file->f_flags & O_NONBLOCK)) {
> > +   err = ring_buffer_wait(iter->array_buffer->buffer,
> > +  iter->cpu_file,
> > +  iter->tr->buffer_percent);
> > +   if (err)
> > +   return err;
> > +   }
> > -   if (cmd)
> > -   return -ENOIOCTLCMD;
> > +   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
> > + iter->cpu_file);
> > +   } else if (cmd) {
> > +   return -ENOTTY;
> > +   }
> > +   /*
> > +* An ioctl call with cmd 0 to the ring buffer file will wake up all
> > +* waiters
> > +*/
> > mutex_lock(&trace_types_lock);
> > iter->wait_index++;
> > @@ -8673,6 +8695,97 @@ static long tracing_buffer

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

On 2024-02-05 13:34, Vincent Donnefort wrote:

On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:

[...]

How are the kernel linear mapping and the userspace mapping made coherent
on architectures with virtually aliasing data caches ?

Ref.
https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t

Hi Mathieu,

Thanks for the pointer.

We are in the exact same problem as DAX. We do modify the data through the
kernel linear mapping while user-space can read it through its own. I should
probably return an error when used with any of the arch ARM || SPARC || MIPS,
until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.

You might want to use LTTng's ring buffer approach instead. See

https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202

lib_ring_buffer_flush_read_subbuf_dcache()

Basically, whenever user-space grabs a sub-buffer for reading (through
lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng
calls flush_dcache_page() on all pages of this subbuffer (I should
really change this for a call to flush_dcache_folio() which would be
more efficient).

Note that doing this is not very efficient on architectures which have
coherent data caches and incoherent dcache vs icache: in that case,
we issue the flush_dcache_page uselessly. I plan on using the new
cpu_dcache_is_aliasing() check once/if it makes it way upstream to
remove those useless flushes on architectures which define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the
data cache.

The equivalent of LTTng's "get subbuf" operation would be
the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] virtio: make virtio_bus const

On Sun, Feb 04, 2024 at 05:52:51PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the virtio_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH] vdpa: make vdpa_bus const

On Sun, Feb 04, 2024 at 05:50:45PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the vdpa_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH] rpmsg: core: make rpmsg_bus const

On Sun, Feb 04, 2024 at 05:32:05PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the rpmsg_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH] nvdimm: make nvdimm_bus_type const

On Sun, Feb 04, 2024 at 05:20:07PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the nvdimm_bus_type variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH] device-dax: make dax_bus_type const

On Sun, Feb 04, 2024 at 01:07:11PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the dax_bus_type variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

On Mon, Feb 05, 2024 at 01:44:47PM -0500, Mathieu Desnoyers wrote:
> On 2024-02-05 13:34, Vincent Donnefort wrote:
> > On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:
> [...]
> 
> > > 
> > > How are the kernel linear mapping and the userspace mapping made coherent
> > > on architectures with virtually aliasing data caches ?
> > > 
> > > Ref. 
> > > https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t
> > 
> > Hi Mathieu,
> > 
> > Thanks for the pointer.
> > 
> > We are in the exact same problem as DAX. We do modify the data through the
> > kernel linear mapping while user-space can read it through its own. I should
> > probably return an error when used with any of the arch ARM || SPARC || 
> > MIPS,
> > until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.
> 
> You might want to use LTTng's ring buffer approach instead. See
> 
> https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202
> 
> lib_ring_buffer_flush_read_subbuf_dcache()

Thanks!

> 
> Basically, whenever user-space grabs a sub-buffer for reading (through
> lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng
> calls flush_dcache_page() on all pages of this subbuffer (I should
> really change this for a call to flush_dcache_folio() which would be
> more efficient).
> 
> Note that doing this is not very efficient on architectures which have
> coherent data caches and incoherent dcache vs icache: in that case,
> we issue the flush_dcache_page uselessly. I plan on using the new
> cpu_dcache_is_aliasing() check once/if it makes it way upstream to
> remove those useless flushes on architectures which define
> ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the
> data cache.

I believe the aim is to use the mapping by default in libtracefs and fallback to
splice whenever not available...  But for those arch, I guess that might be a
mistake. Wonder if then it isn't just better to return ENOTSUPP?

> 
> The equivalent of LTTng's "get subbuf" operation would be
> the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU.

That is correct!

> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

Re: [PATCH v2 0/5] K3 DSP Remoteproc remove cleanup

2024-02-05 Thread Mathieu Poirier

On Mon, Feb 05, 2024 at 12:27:48PM -0600, Andrew Davis wrote:
> Hello all,
> 
> This series uses various devm_ helpers to simplify the device
> removal path.
> 
> Removing an unused var "ret1" got squashed into the wrong patch in
> the v1 series causing a bisectability error. v2 is based on -next
> with the first 3 already taken. These are the last 5 patches of the
> v1 series[0].
> 
> Thanks,
> Andrew
> 
> [0] https://lore.kernel.org/lkml/20240123184913.725435-4-...@ti.com/T/
> 
> Andrew Davis (5):
>   remoteproc: k3-dsp: Use devm_ti_sci_get_by_phandle() helper
>   remoteproc: k3-dsp: Use devm_kzalloc() helper
>   remoteproc: k3-dsp: Add devm action to release tsp
>   remoteproc: k3-dsp: Use devm_ioremap_wc() helper
>   remoteproc: k3-dsp: Use devm_rproc_add() helper
> 
>  drivers/remoteproc/ti_k3_dsp_remoteproc.c | 116 ++
>  1 file changed, 32 insertions(+), 84 deletions(-)
> 

I have applied this set.

Thanks,
Mathieu

> -- 
> 2.39.2
>

Re: [PATCH] rpmsg: core: make rpmsg_bus const

2024-02-05 Thread Mathieu Poirier

On Sun, Feb 04, 2024 at 05:32:05PM -0300, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the rpmsg_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 
> ---
>  drivers/rpmsg/rpmsg_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
> index 8abc7d022ff7..4295c01a2861 100644
> --- a/drivers/rpmsg/rpmsg_core.c
> +++ b/drivers/rpmsg/rpmsg_core.c
> @@ -605,7 +605,7 @@ static void rpmsg_dev_remove(struct device *dev)
>   rpmsg_destroy_ept(rpdev->ept);
>  }
>  
> -static struct bus_type rpmsg_bus = {
> +static const struct bus_type rpmsg_bus = {
>   .name   = "rpmsg",
>   .match  = rpmsg_dev_match,
>   .dev_groups = rpmsg_dev_groups,
>

Applied.

Thanks,
Mathieu

> ---
> base-commit: 80255b24efbe83a6a01600484b6959259a30ded5
> change-id: 20240204-bus_cleanup-rpmsg-1a5f6ab69a24
> 
> Best regards,
> -- 
> Ricardo B. Marliere 
>

[PATCH v9 00/15] Add Cgroup support for SGX EPC memory

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.

For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 

Track EPC pages in per-cgroup LRUs (from Dave)
--

tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---

There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.

Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.

Unreclaimable Guest VM EPC Pages


The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates th

[PATCH v9 03/15] cgroup/misc: Add SGX EPC resource type

From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..2f6cc3a0ad23 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 1f0d8e05b36c..e51d6a45007f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1

[PATCH v9 01/15] cgroup/misc: Add per resource callbacks for CSS events

From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)
V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 84 +
 2 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+   int ret;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+   WRITE_ONCE(cg->res[i].max, MAX_NUM);
+   atomic64_set(&cg->res[i].usage, 0);
+   if (misc_res_ops[i]) {
+   ret = misc_res_ops[i]->alloc(cg);
+   if (ret)
+   return ret;
+   }
+   }
+
+   return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (misc_res_ops[i])
+   misc_res_ops[i]->free(cg);
+}
+
 /**
  * misc_cg_alloc() - Allocate misc cgroup.
  * @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
-   enum misc_res_type i;
-   struct m

[PATCH v9 02/15] cgroup/misc: Export APIs for SGX driver

From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup
reaches its or ancestor's limit. This requires a walk from the current
cgroup up to the root similar to misc_cg_try_charge(). Export
misc_cg_parent() to enable this walk.

The SGX driver may also need start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 14ab13ef3bc7..1f0d8e05b36c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];
 
new_usage = atomic64_add_return(amount, &res->usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1

[PATCH v9 06/15] x86/sgx: Abstract tracking reclaimable pages in LRU

From: Kristen Carlson Accardi 

The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.

Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 912959c7ecc9..a131aa985c95 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
*epc_page)
+{
+   return &sgx_global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(&lru->lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(&page->list, &sgx_global_lru.reclaimable);
-   spin_unlock(&sgx_global_lru.lock);
+   list_add_tail(&page->list, &lru->reclaimable);
+   spin_unlock(&lru->lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(&lru->lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
-   spin_unlock(&sgx_global_lru.lock);
+   spin_unlock(&lru->lock);
return -EBUSY;
}
 
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_global_lru.lock);
+   spin_unlock(&lru->lock);
 
return 0;
 }
-- 
2.25.1

[PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments.  For
example, in a Kubernates environment, a user can request certain EPC
quota for a pod but the orchestrator can not enforce the quota to limit
runtime EPC usage of the pod without an EPC cgroup controller.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
---
V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/Kconfig | 13 +
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++
 arch/x86/kernel/cpu/sgx/main.c   | 52 ++-
 arch/x86/kernel/cpu/sgx/sgx.h|  5 ++
 include/linux/misc_cgroup.h  |  2 +
 7 files changed, 218 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..10c3d1d099b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1947,6 +1947,19 @@ config X86_SGX
 
  If unsure, say N.
 
+config CGROUP_SGX_EPC
+   bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
Intel SGX"
+   depends on X86_SGX && CGROUP_MISC
+   help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
 config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..f4a37ace67d7
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+/* The root EPC cgroup */
+static struct sgx_epc_cgroup epc_cg_root;
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @epc_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_epc_cgroup_try_charge(struct

[PATCH v9 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

From: Sean Christopherson 

Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
from its LRU or LRUs of its descendants to make room for any new
allocations.

To prepare for reclamation per cgroup, expose the top level reclamation
function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
to the function to pass in an LRU so cgroups can pass in different
tracking LRUs later.  Add another parameter for passing in the number of
pages to scan and make the function return the number of pages reclaimed
as a cgroup reclaimer may need to track reclamation progress from its
descendants, change number of pages to scan in subsequent calls.

Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(),
to just call this function with the global LRU passed in. When
per-cgroup LRU is added later, the wrapper will perform global
reclamation from the root cgroup.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V8:
- Use width of 80 characters in text paragraphs. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
---
 arch/x86/kernel/cpu/sgx/main.c | 53 +++---
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a131aa985c95..4f5824c4751d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,11 +286,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
mutex_unlock(&encl->lock);
 }
 
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which 
have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them 
to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
  *
  * Batch process a chunk of pages (at the moment 16) in order to degrade amount
  * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
bit
@@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
  * + EWB) but not sufficiently. Reclaiming one page at a time would also be
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
+ *
+ * @lru:   The LRU from which pages are reclaimed.
+ * @nr_to_scan: Pointer to the target number of pages to scan, must be less 
than
+ * SGX_NR_TO_SCAN.
+ * Return: Number of pages reclaimed.
  */
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int 
*nr_to_scan)
 {
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(&sgx_global_lru.lock);
-   for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   spin_lock(&lru->lock);
+
+   for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+   epc_page = list_first_entry_or_null(&lru->reclaimable, struct 
sgx_epc_page, list);
if (!epc_page)
break;
 
@@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_global_lru.lock);
+
+   spin_unlock(&lru->lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(&sgx_global_lru.lock);
-   list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-   spin_unlock(&sgx_global_lru.lock);
+   spin_lock(&lru->lock);
+   list_add_tail(&epc_page->list, &lru->reclaimable);
+   spin_unlock(&lru->lock);
 
kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_block(epc_page);
}
 
+

[PATCH v9 05/15] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c32f18b70c73..912959c7ecc9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(&sgx_active_page_list))
+   epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(&sgx_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_reclaimer_lock);
+   spin_unlock(&sgx_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(&sgx_reclaimer_lock);
-   list_add_tail(&epc_page->list, &sgx_active_page_list);
-   spin_unlock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
+   list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+   spin_unlock(&sgx_global_lru.lock);
 
kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-  !list_empty(&sgx_active_page_list);
+  !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(&sgx_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(&page->list, &sgx_active_page_list);
-   spin_unlock(&sgx_reclaimer_lock);
+   list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+   spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
-   spin_unlock(&sgx_reclaimer_lock);
+   spin_unlock

[PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

From: Kristen Carlson Accardi 

Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
subtree, and make calls to sgx_reclaim_pages() at each node passing in
the LRU of that node. It keeps track of total reclaimed pages, and pages
left to attempt.  It stops the walk if desired number of pages are
attempted.

In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a work-queue, corresponding work item and
function definitions to support the asynchronous reclamation. Both
synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 181 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   3 +
 2 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index f4a37ace67d7..16b6d9f909eb 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -8,9 +8,180 @@
 /* The root EPC cgroup */
 static struct sgx_epc_cgroup epc_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is 
not
+ * an option, e.g., in a fault handler.
+ */
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup 
*epc_cg)
+{
+   return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+   return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors.  Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is
+ * over its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
*epc_cg)
+{
+   struct misc_cg *i = epc_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 /**
- * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root:  Root of the tree to check
  *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_epc_cgroup *epc_cg;
+   bool ret = true;
+
+   /*
+* Caller ensure css_root ref acquired
+*/
+   css_root = &root->css;
+
+   rcu_read_lock();
+   css_for_each_descendant_pre(pos, css_root) {
+   if (!css_tryget(pos))
+   break;
+
+   rcu_read_unlock();
+
+   epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+   spin_lock(&epc_cg->lru.lock);
+   ret = list_empty(&epc_cg->lru.reclaimable);
+   spin_unlock(&epc_cg->lru.lock);
+
+   rcu_read_lock();
+   css_put(pos);
+   if (!ret)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to 
reclaim pages
+ * @root:  Root of the tree to start walking from.
+ * Return: Number of pages reclaimed.
+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+   /*
+* Attempting to reclaim only a few pages will often fail and is
+* inefficient, while reclaiming a huge number of pages can result in
+* soft lockups due to holding various locks for an extended duration.
+*/
+   unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_epc_cgroup *epc_cg;
+   unsigned

[PATCH v9 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it only considers two contexts in which
the swapping can be done: normal tasks and the ksgxd kthread.
With the new EPC cgroup implementation, the swapping can also happen in
EPC cgroup work-queue threads. In those cases, it improperly selects the
root mem_cgroup to charge for the RAM usage.

Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and
return the mem_cgroup of an mm_struct associated with the enclave. The
return is used to charge for EPC backing pages in all kthread cases.

Pass a flag into the top level reclamation function,
sgx_reclaim_pages(), to explicitly indicate whether it is called from a
background kthread. Internally, if the flag is true, switch the active
mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(), prior to
any backing page allocation, in order to ensure that shmem page
allocations are charged to the enclave's cgroup.

Removed current_is_ksgxd() as it is no longer needed.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
---
V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 38 +---
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  7 ++---
 arch/x86/kernel/cpu/sgx/main.c   | 27 +---
 arch/x86/kernel/cpu/sgx/sgx.h|  3 ++-
 5 files changed, 36 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..4e5948362060 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list.
  */
 static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
 {
@@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
-
/*
 * Search the enclave's mm_list to find an mm associated with
 * this enclave to charge the allocation to.
@@ -1047,27 +1037,33 @@ static struct mem_cgroup 
*sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @indirect:  in ksgxd or EPC cgroup work queue context
+ *
+ * Create a backing page for loading data back into an EPC page with ELDU. This
+ * function takes a reference on a new backing page which must be dropped with 
a
+ * corresponding call to sgx_encl_put_backing().
  *
- * When called from ksgxd, sets the active memcg from one of the
- * mms in the enclave's mm_list prior to any backing page allocation,
- * in order to ensure that shmem page allocations are charged to the
- * enclave.  Create a backing page for loading data back into an EPC page with
- * ELDU.  This function takes a reference on a new backing page which
- * must be dropped with a corresponding call to sgx_encl_put_backing().
+ * When @indirect is true, sets the active memcg from one of the mms in the
+ * enclave's mm_list prior to any backing page allocation, in order to ensure
+ * that shmem page allocations are charged to the enclave.
  *
  * Return:
  *   0 on success,
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, bool indirect)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
-   struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
+   struct mem_cgroup *encl_memcg;
+   struct mem_cgroup *memcg;
int ret;
 
-   ret = __sgx_encl_get_backing(encl, page_index, backing);
+   if (!indirect)
+   return  __sgx_encl_get_backing(encl, page_index, backing);
 
+   encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   memcg = set_active_memcg(encl_memcg);
+   ret = __sgx_encl_get_backing(encl, page_index, backing);
set_active_memcg(memcg);
mem_cgroup_put(encl_memcg);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch

[PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

From: Kristen Carlson Accardi 

When the EPC usage of a cgroup is near its limit, the cgroup needs to
reclaim pages used in the same cgroup to make room for new allocations.
This is analogous to the behavior that the global reclaimer is triggered
when the global usage is close to total available EPC.

Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
whether synchronous reclaim is allowed or not. And trigger the
synchronous/asynchronous reclamation flow accordingly.

Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 --
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
 arch/x86/kernel/cpu/sgx/main.c   |  2 +-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index d399fda2b55e..abf74fdb12b4 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct 
work_struct *work)
 /**
  * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
  * @epc_cg:The EPC cgroup to be charged for the page.
+ * @reclaim:   Whether or not synchronous reclaim is allowed
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
 {
-   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+   for (;;) {
+   if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+   PAGE_SIZE))
+   break;
+
+   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   return -ENOMEM;
+
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+
+   if (!reclaim) {
+   queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
+   return -EBUSY;
+   }
+
+   if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+   /* All pages were too young to reclaim, try again a 
little later */
+   schedule();
+   }
+
+   return 0;
 }
 
 /**
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index e3c6a08f0ee8..d061cd807b45 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup 
*sgx_get_current_epc_cg(void)
 
 static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
 
-static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, 
bool reclaim)
 {
return 0;
 }
@@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup 
*epc_cg)
put_misc_cg(epc_cg->cg);
 }
 
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
 void sgx_epc_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 51904f191b97..2279ae967707 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -588,7 +588,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
reclaim)
int ret;
 
epc_cg = sgx_get_current_epc_cg();
-   ret = sgx_epc_cgroup_try_charge(epc_cg);
+   ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim);
if (ret) {
sgx_put_epc_cg(epc_cg);
return ERR_PTR(ret);
-- 
2.25.1

[PATCH v9 11/15] x86/sgx: Abstract check for global reclaimable pages

From: Kristen Carlson Accardi 

To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(&sgx_global_lru.reclaimable), with calls to
sgx_can_reclaim().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
v7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 2279ae967707..6b0c26cac621 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
sgx_epc_page *epc_pag
return &sgx_global_lru;
 }
 
+static inline bool sgx_can_reclaim(void)
+{
+   return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list 
*lru, unsigned int *nr_to
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-  !list_empty(&sgx_global_lru.reclaimable);
+   sgx_can_reclaim();
 }
 
 static void sgx_reclaim_pages_global(bool indirect)
@@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
reclaim)
break;
}
 
-   if (list_empty(&sgx_global_lru.reclaimable)) {
+   if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
-- 
2.25.1

[PATCH v9 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer

From: Kristen Carlson Accardi 

When cgroup is enabled, all reclaimable pages will be tracked in cgroup
LRUs. The global reclaimer needs to start reclamation from the root
cgroup. Expose the top level cgroup reclamation function so the global
reclaimer can reuse it.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Remove unneeded breaks in function declarations. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index abf74fdb12b4..6e31f8727b8a 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
  * @indirect:   In ksgxd or EPC cgroup work queue context.
  * Return: Number of pages reclaimed.
  */
-static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool 
indirect)
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
 {
/*
 * Attempting to reclaim only a few pages will often fail and is
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index d061cd807b45..5b3e8e1b8630 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct 
sgx_epc_cgroup *epc_cg, bool
 static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
 
 static inline void sgx_epc_cgroup_init(void) { }
+
+static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, 
bool indirect)
+{
+   return 0;
+}
 #else
 struct sgx_epc_cgroup {
struct misc_cg *cg;
@@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup 
*epc_cg)
 int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect);
+
 void sgx_epc_cgroup_init(void);
 
 #endif
-- 
2.25.1

[PATCH v9 14/15] Docs/x86/sgx: Add description for cgroup support

From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root cgroups.
+A value change in this file generates a file modified event.
+
+  max
+The number of times

[PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg.

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang 
---
V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++
 .../selftests/sgx/watch_misc_for_tests.sh |  13 +
 2 files changed, 259 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index ..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+echo "# Please make sure cgroup-tools is installed, and misc cgroup is 
mounted."
+exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."
+CG_MISC_ROOT=/sys/fs/cgroup/misc
+CG_MEM_ROOT=/sys/fs/cgroup/memory
+CG_V1=1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" >  $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+wait_check_process_status() {
+local pid=$1
+local check_for_success=$2  # If 1, check for success;
+# If 0, check for failure
+wait "$pid"
+local status=$?
+
+if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+echo "# Process $pid succeeded."
+return 0
+elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+echo "# Process $pid returned failure."
+return 0
+fi
+return 1
+}
+
+wait_and_detect_for_any() {
+local pids=("$@")
+local check_for_success=$1  # If 1, check for success;
+# If 0, check for failure
+local detected=1 # 0 for success detection
+
+for pid in "${pids[@]:1}"; do
+if wait_check_process_status "$pid" "$check_for_success"; then
+detected=0
+# Wait for other processes to exit
+fi
+done
+
+return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting 
failure..."
+# Always use leaf node of misc c

[PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
 
 static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
*epc_page)
 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+   if (epc_page->epc_cg)
+   return &epc_page->epc_cg->lru;
+
+   /* This should not happen if kernel is configured correctly */
+   WARN_ON_ONCE(1);
+#endif
return &sgx_global_lru;
 }
 
 static inline bool sgx_can_reclaim(void)
 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+   return !sgx_epc_cgroup_lru_empty(misc_cg_root());
+#else
return !list_empty(&sgx_global_lru.reclaimable);
+#endif
 }
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -410,7 +421,10 @@ static void sgx_reclaim_pages_global(bool indirect)
 {
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
 
-   sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
+   if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+   sgx_epc_cgroup_reclaim_pages(misc_cg_root(), indirect);
+   else
+   sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
 }
 
 /*
-- 
2.25.1

Re: [PATCH] device-dax: make dax_bus_type const

2024-02-05 Thread Dave Jiang




On 2/4/24 9:07 AM, Ricardo B. Marliere wrote:
> Now that the driver core can properly handle constant struct bus_type,
> move the dax_bus_type variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
> 
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 

Reviewed-by: Dave Jiang 
> ---
>  drivers/dax/bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1659b787b65f..fe0a415c854d 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -222,7 +222,7 @@ static void dax_bus_remove(struct device *dev)
>   dax_drv->remove(dev_dax);
>  }
>  
> -static struct bus_type dax_bus_type = {
> +static const struct bus_type dax_bus_type = {
>   .name = "dax",
>   .uevent = dax_bus_uevent,
>   .match = dax_bus_match,
> 
> ---
> base-commit: 73bf93edeeea866b0b6efbc8d2595bdaaba7f1a5
> change-id: 20240204-bus_cleanup-dax-52c34f72615f
> 
> Best regards,

Re: [PATCH v7 2/5] remoteproc: k3: Move out data structures common with M4 driver


On 2/2/24 11:55 AM, Hari Nagalla wrote:

From: Martyn Welch 

We will be adding the M4F driver which shares a lot of commonality
with the DSP driver. Common data structures are introduced here.

Signed-off-by: Martyn Welch 
Signed-off-by: Hari Nagalla 
---
Changes since v5:
  - Created a separate patch for data structures to ease review

Changes since v6:
  - Reworded 'split' to 'move' as the common data structures between
DSP and M4 remote rpoc drivers are moved into common driver.



Is this a joke? In v6 Krzysztof commented the following:


Where is the split? I see only addition here.

Where is the usage of this header? This is basically dead code. Don't
add dead code, but instead actually move the structures here! Move is
cut and paste, not just paste.


Instead of changing the patch in any way to address this comment you
just replaced the word 'split' to 'move' in the commit subject.. Maybe
no one will notice this is all still dead code since you didn't say the
word 'split' anymore..

Andrew


link to v6:
https://lore.kernel.org/all/20230913111644.29889-3-hnaga...@ti.com/

  drivers/remoteproc/ti_k3_common.h | 107 ++
  1 file changed, 107 insertions(+)
  create mode 100644 drivers/remoteproc/ti_k3_common.h

diff --git a/drivers/remoteproc/ti_k3_common.h 
b/drivers/remoteproc/ti_k3_common.h
new file mode 100644
index ..f1bab83dd0fc
--- /dev/null
+++ b/drivers/remoteproc/ti_k3_common.h
@@ -0,0 +1,107 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * TI K3 Remote Processor(s) driver common code
+ *
+ * Refactored from ti_k3_dsp_remoteproc.c.
+ *
+ * ti_k3_dsp_remoteproc.c:
+ * Copyright (C) 2018-2022 Texas Instruments Incorporated - https://www.ti.com/
+ * Suman Anna 
+ */
+
+#ifndef REMOTEPROC_TI_K3_COMMON_H
+#define REMOTEPROC_TI_K3_COMMON_H
+
+#define KEYSTONE_RPROC_LOCAL_ADDRESS_MASK  (SZ_16M - 1)
+
+/**
+ * struct k3_rproc_mem - internal memory structure
+ * @cpu_addr: MPU virtual address of the memory region
+ * @bus_addr: Bus address used to access the memory region
+ * @dev_addr: Device address of the memory region from remote processor view
+ * @size: Size of the memory region
+ */
+struct k3_rproc_mem {
+   void __iomem *cpu_addr;
+   phys_addr_t bus_addr;
+   u32 dev_addr;
+   size_t size;
+};
+
+/**
+ * struct k3_rproc_mem_data - memory definitions for a remote processor
+ * @name: name for this memory entry
+ * @dev_addr: device address for the memory entry
+ */
+struct k3_rproc_mem_data {
+   const char *name;
+   const u32 dev_addr;
+};
+
+/**
+ * struct k3_rproc_dev_data - device data structure for a remote processor
+ * @mems: pointer to memory definitions for a remote processor
+ * @num_mems: number of memory regions in @mems
+ * @boot_align_addr: boot vector address alignment granularity
+ * @uses_lreset: flag to denote the need for local reset management
+ */
+struct k3_rproc_dev_data {
+   const struct k3_rproc_mem_data *mems;
+   u32 num_mems;
+   u32 boot_align_addr;
+   bool uses_lreset;
+};
+
+/**
+ * struct k3_rproc - k3 remote processor driver structure
+ * @dev: cached device pointer
+ * @rproc: remoteproc device handle
+ * @mem: internal memory regions data
+ * @num_mems: number of internal memory regions
+ * @rmem: reserved memory regions data
+ * @num_rmems: number of reserved memory regions
+ * @reset: reset control handle
+ * @data: pointer to device data
+ * @tsp: TI-SCI processor control handle
+ * @ti_sci: TI-SCI handle
+ * @ti_sci_id: TI-SCI device identifier
+ * @mbox: mailbox channel handle
+ * @client: mailbox client to request the mailbox channel
+ */
+struct k3_rproc {
+   struct device *dev;
+   struct rproc *rproc;
+   struct k3_rproc_mem *mem;
+   int num_mems;
+   struct k3_rproc_mem *sram;
+   int num_sram;
+   struct k3_rproc_mem *rmem;
+   int num_rmems;
+   struct reset_control *reset;
+   const struct k3_rproc_dev_data *data;
+   struct ti_sci_proc *tsp;
+   const struct ti_sci_handle *ti_sci;
+   u32 ti_sci_id;
+   struct mbox_chan *mbox;
+   struct mbox_client client;
+};
+
+void k3_rproc_kick(struct rproc *rproc, int vqid);
+int k3_rproc_reset(struct k3_rproc *kproc);
+int k3_rproc_release(struct k3_rproc *kproc);
+int k3_rproc_request_mbox(struct rproc *rproc);
+int k3_rproc_prepare(struct rproc *rproc);
+int k3_rproc_unprepare(struct rproc *rproc);
+struct resource_table *k3_get_loaded_rsc_table(struct rproc *rproc,
+  size_t *rsc_table_sz);
+void *k3_rproc_da_to_va(struct rproc *rproc, u64 da, size_t len,
+   bool *is_iomem);
+int k3_rproc_of_get_memories(struct platform_device *pdev,
+struct k3_rproc *kproc);
+int k3_rproc_of_get_sram_memories(struct platform_device *pdev,
+struct k3_rproc *kproc);
+int k3_reserved_mem_init(struct k3_rproc *kproc);
+void k3_reserved_mem_ex

Re: [PATCH v7 3/5] remoteproc: k3: Move out functions common with M4 driver


On 2/2/24 11:55 AM, Hari Nagalla wrote:

From: Martyn Welch 

In the next commit we will be adding the M4F driver which shares a lot of
commonality with the DSP driver. Move this shared functionality out so
that it can be used by both drivers.

Signed-off-by: Martyn Welch 
Signed-off-by: Hari Nagalla 
---
Changes since v2:
  - New patch (reordered refactored from v2)

Changes since v3:
  - Removed "ipc_only" element from k3_rproc structure
  - Refactored to bring 3 more common functions

Changes since v4:
  - None

Changes since v5:
  - Rearranged the functions order to match with the functions in
ti_k3_dsp_remoteproc.c to ease review.

Changes since v6:
  - Generated patch with -M/-B/-C options



You where asked to generate this patch "correctly" with these options,
not just use them all and hope for the best.. Now it looks like you
re-wrote all of ti_k3_dsp_remoteproc.c when you only factored out
a couple functions to a different file.

Build up the new ti_k3_common.c one function per patch if it helps.
And factor the functions out of ti_k3_r5 also as it seems many
of these are common to that driver too.

Andrew


link to v6:
https://lore.kernel.org/all/20230913111644.29889-4-hnaga...@ti.com/

  drivers/remoteproc/Makefile   |2 +-
  drivers/remoteproc/ti_k3_common.c |  583 ++
  drivers/remoteproc/ti_k3_dsp_remoteproc.c | 1277 ++---
  3 files changed, 952 insertions(+), 910 deletions(-)
  create mode 100644 drivers/remoteproc/ti_k3_common.c
  rewrite drivers/remoteproc/ti_k3_dsp_remoteproc.c (67%)

diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
index 91314a9b43ce..55c552e27a45 100644
--- a/drivers/remoteproc/Makefile
+++ b/drivers/remoteproc/Makefile
@@ -36,6 +36,6 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o
  obj-$(CONFIG_ST_REMOTEPROC)   += st_remoteproc.o
  obj-$(CONFIG_ST_SLIM_REMOTEPROC)  += st_slim_rproc.o
  obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o
-obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o
+obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o ti_k3_common.o
  obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o
  obj-$(CONFIG_XLNX_R5_REMOTEPROC)  += xlnx_r5_remoteproc.o
diff --git a/drivers/remoteproc/ti_k3_common.c 
b/drivers/remoteproc/ti_k3_common.c
new file mode 100644
index ..62c7c5dba78a
--- /dev/null
+++ b/drivers/remoteproc/ti_k3_common.c
@@ -0,0 +1,583 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * TI K3 Remote Processor(s) driver common code
+ *
+ * Refactored from ti_k3_dsp_remoteproc.c.
+ *
+ * ti_k3_dsp_remoteproc.c:
+ * Copyright (C) 2018-2022 Texas Instruments Incorporated - https://www.ti.com/
+ * Suman Anna 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "omap_remoteproc.h"
+#include "remoteproc_internal.h"
+#include "ti_sci_proc.h"
+#include "ti_k3_common.h"
+
+/**
+ * k3_rproc_mbox_callback() - inbound mailbox message handler
+ * @client: mailbox client pointer used for requesting the mailbox channel
+ * @data: mailbox payload
+ *
+ * This handler is invoked by the K3 mailbox driver whenever a mailbox
+ * message is received. Usually, the mailbox payload simply contains
+ * the index of the virtqueue that is kicked by the remote processor,
+ * and we let remoteproc core handle it.
+ *
+ * In addition to virtqueue indices, we also have some out-of-band values
+ * that indicate different events. Those values are deliberately very
+ * large so they don't coincide with virtqueue indices.
+ */
+static void k3_rproc_mbox_callback(struct mbox_client *client, void *data)
+{
+   struct k3_rproc *kproc = container_of(client, struct k3_rproc,
+ client);
+   struct device *dev = kproc->rproc->dev.parent;
+   const char *name = kproc->rproc->name;
+   u32 msg = omap_mbox_message(data);
+
+   dev_dbg(dev, "mbox msg: 0x%x\n", msg);
+
+   switch (msg) {
+   case RP_MBOX_CRASH:
+   /*
+* remoteproc detected an exception, but error recovery is not
+* supported. So, just log this for now
+*/
+   dev_err(dev, "K3 rproc %s crashed\n", name);
+   break;
+   case RP_MBOX_ECHO_REPLY:
+   dev_info(dev, "received echo reply from %s\n", name);
+   break;
+   default:
+   /* silently handle all other valid messages */
+   if (msg >= RP_MBOX_READY && msg < RP_MBOX_END_MSG)
+   return;
+   if (msg > kproc->rproc->max_notifyid) {
+   dev_dbg(dev, "dropping unknown message 0x%x", msg);
+   return;
+   }
+   /* msg contains the index of the triggered vring */
+   if (rproc_vq_interrupt(kproc->rproc, msg) == IRQ_NONE)
+

[PATCH v17 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Ankit Agrawal 

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the
chip-to-chip interconnect that is exposed as a contiguous physically
addressable region on the host. Since the device memory is cache
coherent with the CPU, it can be mmaped into the user VMA with a
cacheable mapping and used like a regular RAM. The device memory is
not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a reserved 1G uncached region (termed as
resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
to work around a HW defect. Based on [2], the requisite properties
(uncached, unaligned access) can be achieved through a VM mapping (S1)
of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
a different non-cached property to the reserved 1G region, it needs to
be carved out from the device memory and mapped as a separate region
in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets
the Qemu VMA page properties (pgprot) as NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. This is divergent from the baremetal
approach, where the device memory is exposed as a device memory region.
The decision for a different approach was taken in view of the fact that
it would necessiate additional code in Qemu to discover and insert those
regions in the VM IPA, along with the additional VM ACPI DSDT changes to
communiate the device memory region IPA to the VM workloads. Moreover,
this behavior would have to be added to a variety of emulators (beyond
top of tree Qemu) out there desiring grace hopper support.

Since the device implements 64-bit BAR0, the VFIO PCI variant driver
maps the uncached carved out region to the next available PCI BAR (i.e.
comprising of region 2 and 3). The cached device memory aperture is
assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
device in the VM with the uncached aperture reported as BAR2 region,
the cacheable as BAR4. The variant driver provides emulation for these
fake BARs' PCI config space offset registers.

The hardware ensures that the system does not crash when the memory
is accessed with the memory enable turned off. It synthesis ~0 reads
and dropped writes on such access. So there is no need to support the
disablement/enablement of BAR through PCI_COMMAND config space register.

The memory layout on the host looks like the following:
   devmem (memlength)
|--|
|-cached|--NC--|
|   |
usemem.phys/memphys resmem.phys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
aligned.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [4] and [3], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

If the device memory properties are not present in the host ACPI table,
the driver registers the vfio-pci-core function pointers.

This goes along with a qemu series [6] to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device. Verified with the CUDA workload in the VM.

[1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
[2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/
[3] https://lore.kernel.org/all/2

[PATCH v17 1/3] vfio/pci: rename and export do_io_rw()

From: Ankit Agrawal 

do_io_rw() is used to read/write to the device MMIO. The grace hopper
VFIO PCI variant driver require this functionality to read/write to
its memory.

Rename this as vfio_pci_core functions and export as GPL.

Signed-off-by: Ankit Agrawal 
---
 drivers/vfio/pci/vfio_pci_rdwr.c | 16 +---
 include/linux/vfio_pci_core.h|  5 -
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 07fea08ea8a2..03b8f7ada1ac 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -96,10 +96,10 @@ VFIO_IOREAD(32)
  * reads with -1.  This is intended for handling MSI-X vector tables and
  * leftover space for ROM BARs.
  */
-static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
-   void __iomem *io, char __user *buf,
-   loff_t off, size_t count, size_t x_start,
-   size_t x_end, bool iswrite)
+ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool 
test_mem,
+  void __iomem *io, char __user *buf,
+  loff_t off, size_t count, size_t x_start,
+  size_t x_end, bool iswrite)
 {
ssize_t done = 0;
int ret;
@@ -201,6 +201,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, 
bool test_mem,
 
return done;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw);
 
 int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
 {
@@ -279,8 +280,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, 
char __user *buf,
x_end = vdev->msix_offset + vdev->msix_size;
}
 
-   done = do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos,
-   count, x_start, x_end, iswrite);
+   done = vfio_pci_core_do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, 
buf, pos,
+ count, x_start, x_end, iswrite);
 
if (done >= 0)
*ppos += done;
@@ -348,7 +349,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, 
char __user *buf,
 * probing, so we don't currently worry about access in relation
 * to the memory enable bit in the command register.
 */
-   done = do_io_rw(vdev, false, iomem, buf, off, count, 0, 0, iswrite);
+   done = vfio_pci_core_do_io_rw(vdev, false, iomem, buf, off, count,
+ 0, 0, iswrite);
 
vga_put(vdev->pdev, rsrc);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 85e84b92751b..cf9480a31f3e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -130,7 +130,10 @@ void vfio_pci_core_finish_enable(struct 
vfio_pci_core_device *vdev);
 int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar);
 pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
pci_channel_state_t state);
-
+ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool 
test_mem,
+  void __iomem *io, char __user *buf,
+  loff_t off, size_t count, size_t x_start,
+  size_t x_end, bool iswrite);
 #define VFIO_IOWRITE_DECLATION(size) \
 int vfio_pci_core_iowrite##size(struct vfio_pci_core_device *vdev, \
bool test_mem, u##size val, void __iomem *io);
-- 
2.34.1

[PATCH v17 2/3] vfio/pci: rename and export range_intesect_range

From: Ankit Agrawal 

range_intesect_range determines an overlap between two ranges. If an
overlap, the helper function returns the overlapping offset and size.

The VFIO PCI variant driver emulates the PCI config space BAR offset
registers. These offset may be accessed for read/write with a variety
of lengths including sub-word sizes from sub-word offsets. The driver
makes use of this helper function to read/write the targeted part of
the emulated register.

Make this a vfio_pci_core function, rename and export as GPL. Also
update references in virtio driver.

Signed-off-by: Ankit Agrawal 
---
 drivers/vfio/pci/vfio_pci_config.c | 45 +++
 drivers/vfio/pci/virtio/main.c | 72 +++---
 include/linux/vfio_pci_core.h  |  5 +++
 3 files changed, 76 insertions(+), 46 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 672a1804af6a..4fc3c605af13 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1966,3 +1966,48 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device 
*vdev, char __user *buf,
 
return done;
 }
+
+/**
+ * vfio_pci_core_range_intersect_range() - Determine overlap between a buffer
+ *and register offset ranges.
+ * @range1_start:start offset of the buffer
+ * @count1: number of buffer bytes.
+ * @range2_start:start register offset
+ * @count2: number of bytes of register
+ * @start_offset:start offset of overlap start in the buffer
+ * @intersect_count: number of overlapping bytes
+ * @register_offset: start offset of overlap start in register
+ *
+ * The function determines an overlap between a register and a buffer.
+ * range1 represents the buffer and range2 represents register.
+ *
+ * Returns: true if there is overlap, false if not.
+ * The overlap start and range is returned through function args.
+ */
+bool vfio_pci_core_range_intersect_range(loff_t range1_start, size_t count1,
+loff_t range2_start, size_t count2,
+loff_t *start_offset,
+size_t *intersect_count,
+size_t *register_offset)
+{
+   if (range1_start <= range2_start &&
+   range1_start + count1 > range2_start) {
+   *start_offset = range2_start - range1_start;
+   *intersect_count = min_t(size_t, count2,
+range1_start + count1 - range2_start);
+   *register_offset = 0;
+   return true;
+   }
+
+   if (range1_start > range2_start &&
+   range1_start < range2_start + count2) {
+   *start_offset = 0;
+   *intersect_count = min_t(size_t, count1,
+range2_start + count2 - range1_start);
+   *register_offset = range1_start - range2_start;
+   return true;
+   }
+
+   return false;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_range_intersect_range);
diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
index d5af683837d3..b5d3a8c5bbc9 100644
--- a/drivers/vfio/pci/virtio/main.c
+++ b/drivers/vfio/pci/virtio/main.c
@@ -132,33 +132,6 @@ virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device 
*virtvdev,
return ret ? ret : count;
 }
 
-static bool range_intersect_range(loff_t range1_start, size_t count1,
- loff_t range2_start, size_t count2,
- loff_t *start_offset,
- size_t *intersect_count,
- size_t *register_offset)
-{
-   if (range1_start <= range2_start &&
-   range1_start + count1 > range2_start) {
-   *start_offset = range2_start - range1_start;
-   *intersect_count = min_t(size_t, count2,
-range1_start + count1 - range2_start);
-   *register_offset = 0;
-   return true;
-   }
-
-   if (range1_start > range2_start &&
-   range1_start < range2_start + count2) {
-   *start_offset = 0;
-   *intersect_count = min_t(size_t, count1,
-range2_start + count2 - range1_start);
-   *register_offset = range1_start - range2_start;
-   return true;
-   }
-
-   return false;
-}
-
 static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
char __user *buf, size_t count,
loff_t *ppos)
@@ -178,16 +151,18 @@ static ssize_t virtiovf_pci_read_config(struct 
vfio_device *core_vdev,
if (ret < 0)
return ret;
 
-   if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
-

[PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Ankit Agrawal 

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the chip-to-chip
interconnect that is exposed as a contiguous physically addressable
region on the host. This device memory aperture can be obtained from host
ACPI table using device_property_read_u64(), according to the FW
specification. Since the device memory is cache coherent with the CPU,
it can be mmap into the user VMA with a cacheable mapping using
remap_pfn_range() and used like a regular RAM. The device memory
is not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a reserved 1G uncached region (termed as
resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
to work around a HW defect. Based on [2], the requisite properties
(uncached, unaligned access) can be achieved through a VM mapping (S1)
of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
a different non-cached property to the reserved 1G region, it needs to
be carved out from the device memory and mapped as a separate region
in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
Qemu VMA page properties (pgprot) as NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. This is divergent from the baremetal
approach, where the device memory is exposed as a device memory region.
The decision for a different approach was taken in view of the fact that
it would necessiate additional code in Qemu to discover and insert those
regions in the VM IPA, along with the additional VM ACPI DSDT changes to
communicate the device memory region IPA to the VM workloads. Moreover,
this behavior would have to be added to a variety of emulators (beyond
top of tree Qemu) out there desiring grace hopper support.

Since the device implements 64-bit BAR0, the VFIO PCI variant driver
maps the uncached carved out region to the next available PCI BAR (i.e.
comprising of region 2 and 3). The cached device memory aperture is
assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
device in the VM with the uncached aperture reported as BAR2 region,
the cacheable as BAR4. The variant driver provides emulation for these
fake BARs' PCI config space offset registers.

The hardware ensures that the system does not crash when the memory
is accessed with the memory enable turned off. It synthesis ~0 reads
and dropped writes on such access. So there is no need to support the
disablement/enablement of BAR through PCI_COMMAND config space register.

The memory layout on the host looks like the following:
   devmem (memlength)
|--|
|-cached|--NC--|
|   |
usemem.phys/memphys resmem.phys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
aligned.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [4] and [3], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

If the bare metal properties are not present, the driver registers the
vfio-pci-core function pointers.

This goes along with a qemu series [6] to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device. Verified with the CUDA workload in the VM.

[1] https://www.nvidia.com/en-in/technologies/

Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-05 Thread Masahiro Yamada

On Mon, Feb 5, 2024 at 5:25 PM John Garry  wrote:
>
> On 02/02/2024 15:01, Masahiro Yamada wrote:
> >> --
> >> 2.35.3
> >
> > As you see, several drivers store UTS_RELEASE in their driver data,
> > and even print it in debug print.
> >
> >
> > I do not see why it is useful.
>
> I would tend to agree, and mentioned that earlier.
>
> > As you discussed in 3/4, if UTS_RELEASE is unneeded,
> > it is better to get rid of it.
>
> Jakub replied about this.
>
> >
> >
> > If such version information is useful for drivers, the intention is
> > whether the version of the module, or the version of vmlinux.
> > That is a question.
> > They differ when CONFIG_MODVERSION.
> >
>
> I think often this information in UTS_RELEASE is shared as informative
> only, so the user can conveniently know the specific kernel git version.
>
> >
> > When module developers intend to printk the git version
> > from which the module was compiled from,
> > presumably they want to use UTS_RELEASE, which
> > was expanded at the compile time of the module.
> >
> > If you replace it with uts_release, it is the git version
> > of vmlinux.
> >
> >
> > Of course, the replacement is safe for always-builtin code.
> >
> >
> >
> > Lastly, we can avoid using UTS_RELEASE without relying
> > on your patch.
> >
> >
> >
> > For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1
> > replaced  UTS_RELEASE with init_uts_ns.name.release
> >
> >
> > So, is your uts_release a shorthand of init_uts_ns.name.release?
>
> Yes - well that both are strings containing UTS_RELEASE. Using a struct
> sub-member is bit ungainly, but I suppose that we should not be making
> life easy for people using this.
>
> However we already have init_utsname in:
>
> static inline struct new_utsname *init_utsname(void)
> {
> return &init_uts_ns.name;
> }
>
> So could use init_utsname()->release, which is a bit nicer.
>
> >
> >
> >
> > I think what you can contribute are:
> >
> >   - Explore the UTS_RELEASE users, and check if you can get rid of it.
>
> Unfortunately I expect resistance for this. I also expect places like FW
> loader it is necessary. And when this is used in sysfs, people will say
> that it is part of the ABI now.
>
> How about I send the patch to update to use init_uts_ns and mention also
> that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.







>
> >
> >   - Where UTS_RELEASE is useful, consider if it is possible
> > to replace it with init_uts_ns.name.release
>
> ok, but, as above, could use init_utsname()->release also


I am fine with it.


init_utsname()->release is more commonly used
(but less common than UTS_RELEASE)



$ git grep   'init_utsname()->release' | wc
 28  922065
$ git grep   'init_uts_ns.name.release' | wc
  7  34 587
$ git grep   'UTS_RELEASE' | wc
 57 3044741




-- 
Best Regards
Masahiro Yamada

Re: Question about the ipi_raise filter usage and output

2024-02-05 Thread richard clark

Hi Steve,

On Mon, Feb 5, 2024 at 6:38 PM Steven Rostedt  wrote:
>
> On Mon, 5 Feb 2024 17:57:29 +0800
> richard clark  wrote:
>
> > I try to write below:
> > echo 'target_cpus == 11 && reason == "Function call interrupts"' >
> > events/ipi/ipi_raise/filter
>
> You mean when it is sent only to CPU 11? Not when the event is
> happening on CPU 11. Like the above example, the event was triggered on
> CPU 10, but the mask was for all the other CPUs.
>
> If you are looking for just CPU 11, you can do:
>
>   echo 'target_cpus == 0x800 && reason == "Function call interrupts"'
>

Seems both 'target_cpus == 0x800 && reason == "Function call
interrupts"' and 'target_cpus & 0x800 && reason == "Function call
interrupts"' don't work:

# cat events/ipi/ipi_raise/enable
1
# cat events/ipi/ipi_raise/filter
target_cpus == 0x800 && reason == "Function call interrupts"

The kernel module code snippet:

void ipi_func_run_cpu(void *info)
{
pr_info("remote function runs on cpu[%d].\n", smp_processor_id());
}
static int __init ipi_send_init(void)
{
int target = (smp_processor_id() + 1) % nr_cpu_ids;
int ret = smp_call_function_single(target, ipi_func_run_cpu,
NULL, true);
pr_info("ipi cpu[%d --> %d] ret = %d\n", smp_processor_id(),
target, ret);
return 0;
}
...
module_init(ipi_send_init);
module_exit(ipi_send_exit);

$ sudo taskset -c 10 insmod ipi_send.ko
$ dmesg
...
[84931.864273] remote function runs on cpu[11].
[84931.864282] ipi cpu[10 --> 11] ret = 0

The 'cat trace' will output the below message with 'reason ==
"Function call interrupts"' filter:
...
sudo-5726[007] dn.h1.. 84302.833545: ipi_raise:
target_mask=,0001 (Function call interrupts)
sudo-5726[007] dn.h2.. 84302.837544: ipi_raise:
target_mask=,0001 (Function call interrupts)
  insmod-5727[011] dn.h1.. 84302.841545: ipi_raise:
target_mask=,0001 (Function call interrupts)
  insmod-5727[010] 1.. 84302.843966: ipi_raise:
target_mask=,0bff (Function call interrupts)
  insmod-5727[010] 1.. 84302.843975: ipi_raise:
target_mask=,0bff (Function call interrupts)
  insmod-5727[010] 1.. 84302.844184: ipi_raise:
target_mask=,0800 (Function call interrupts)
...

I find that 'target_cpus == 0xfff && reason == "Function call
interrupts"' doesn't have output in the buffer, but 'target_cpus &
0xfff && reason == "Function call interrupts"' does. I also tried to
use 'target_cpus & 0xf00 && reason == "Function call interrupts"' in
my case, the trace buffer has nothing after the kmod inserted.

Any comments?

>
>
> -- Steve

Re: [PATCH] virtio: make virtio_bus const

2024-02-05 Thread Jason Wang

On Mon, Feb 5, 2024 at 4:52 AM Ricardo B. Marliere  wrote:
>
> Now that the driver core can properly handle constant struct bus_type,
> move the virtio_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
>
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test

2024-02-05 Thread Jason Wang

On Mon, Feb 5, 2024 at 8:46 PM Yunsheng Lin  wrote:
>
> introduce vhost_net_test for both vhost_net tx and rx basing
> on virtio_test to test vhost_net changing in the kernel.
>
> Steps for vhost_net tx testing:
> 1. Prepare a out buf.
> 2. Kick the vhost_net to do tx processing.
> 3. Do the receiving in the tun side.
> 4. verify the data received by tun is correct.
>
> Steps for vhost_net rx testing:
> 1. Prepare a in buf.
> 2. Do the sending in the tun side.
> 3. Kick the vhost_net to do rx processing.
> 4. verify the data received by vhost_net is correct.
>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/.gitignore|   1 +
>  tools/virtio/Makefile  |   8 +-
>  tools/virtio/linux/virtio_config.h |   4 +
>  tools/virtio/vhost_net_test.c  | 536 +
>  4 files changed, 546 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
> index 9934d48d9a55..7e47b281c442 100644
> --- a/tools/virtio/.gitignore
> +++ b/tools/virtio/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  *.d
>  virtio_test
> +vhost_net_test
>  vringh_test
>  virtio-trace/trace-agent
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/linux/virtio_config.h 
> b/tools/virtio/linux/virtio_config.h
> index 2a8a70e2a950..42a564f22f2d 100644
> --- a/tools/virtio/linux/virtio_config.h
> +++ b/tools/virtio/linux/virtio_config.h
> @@ -1,4 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_VIRTIO_CONFIG_H
> +#define LINUX_VIRTIO_CONFIG_H
>  #include 
>  #include 
>  #include 
> @@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct 
> virtio_device *vdev, u64 val)
>  {
> return __cpu_to_virtio64(virtio_is_little_endian(vdev), val);
>  }
> +
> +#endif
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..6c41204e6707
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,536 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf)
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE ETH_P_LOOPBACK
> +#define DESC_NUM   256
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int idx;
> +   long started;
> +   long completed;
> +   struct pollfd fds;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct vq_info vqs[2];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   char *test_buf;
> +   char *res_buf;
> +   struct vhost_memory *mem;
> +   int sock;
> +   int ifindex;
> +   unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev, char *tun_name)
> +{
> +   struct ifreq ifr;
> +   int len = HDR_LEN;
> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(&ifr, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +   strncpy(ifr.ifr_name, tun_name, IFNAMSIZ);
> +
> +   e = ioctl(fd, TUNSETIFF, &ifr);
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +

[PATCH v11] bus: mhi: host: Add tracing support

2024-02-05 Thread Krishna chaitanya chundru

This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Change the implementation of the arrays which has enum to strings mapping
to make it consistent in both trace header file and other files.

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 
Reviewed-by: Manivannan Sadhasivam 
Reviewed-by: "Steven Rostedt (Google)" 
---
Changes in v11:
- Rebased with mhi next.
- Link to v10: 
https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com

Changes in v10:
- Modified command_start and command_end traces to take string as input to 
mention correct
- string as suggested by mani
- As sugggested by mani modified the print format from lower format to upper 
case format.
- Link to v9: 
https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com

Changes in v9:
- Change the implementations of some array so that the strings to enum mapping
- is same in both trace header and other files as suggested by steve.
- Link to v8: 
https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com

Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com

Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com

Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the address 
to avoid
- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com

Changes in v4:
- Fix compilation issues in previous patch which happended due to rebasing.
- In the defconfig FTRACE config is not enabled due to that the compilation 
issue is not
- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com

Changes in v3:
- move trace header file from include/trace/events to drivers/bus/mhi/host/ so 
that
- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid holes in 
the buffer.
- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com

Changes in v2:
- Passing the raw state into the trace event and using  __print_symbolic() as 
suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
---
 drivers/bus/mhi/common.h|  38 +++---
 drivers/bus/mhi/host/init.c |  64 +
 drivers/bus/mhi/host/internal.h |  41 ++
 drivers/bus/mhi/host/main.c |  19 ++-
 drivers/bus/mhi/host/pm.c   |   7 +-
 drivers/bus/mhi/host/trace.h| 280 
 6 files changed, 384 insertions(+), 65 deletions(-)

diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h
index f794b9c8049e..dda340aaed95 100644
--- a/drivers/bus/mhi/common.h
+++ b/drivers/bus/mhi/common.h
@@ -297,30 +297,30 @@ struct mhi_ring_element {
__le32 dword[2];
 };
 
+#define MHI_STATE_LIST \
+   mhi_state(RESET,"RESET")\
+   mhi_state(READY,"READY")\
+   mhi_state(M0,   "M0")   \
+   mhi_state(M1,   "M1")   \
+   mhi_state(M2,   "M2")   \
+   mhi_state(M3,   "M3")   \
+   mhi_state(M3_FAST,  "M3_FAST")  \
+   mhi_state(BHI,  "BHI")  \
+   mhi_state_end(SYS_ERR,  "SYS ERROR")
+
+#undef mhi_state
+#undef mhi_state_end
+
+#define mhi_state(a, b)case MHI_STATE_##a: return b;
+#define mhi_state_end(a, b)case MHI_STATE_##a: return b;
+
 static inline const char *mhi_state_str(enum mhi_state state)
 {
switch (state) {
-   case MHI_STATE_RESET:
-   return "RESET";
-   case MHI_STATE_READY:
-   return "READY";
-   case MHI_STATE_M0:
-   return "M0";
-   case MHI_STATE_M1:
-   return "M1";
-

Re: [PATCH v11] bus: mhi: host: Add tracing support

2024-02-05 Thread Manivannan Sadhasivam

On Tue, Feb 06, 2024 at 10:02:05AM +0530, Krishna chaitanya chundru wrote:
> This change adds ftrace support for following functions which
> helps in debugging the issues when there is Channel state & MHI
> state change and also when we receive data and control events:
> 1. mhi_intvec_mhi_states
> 2. mhi_process_data_event_ring
> 3. mhi_process_ctrl_ev_ring
> 4. mhi_gen_tre
> 5. mhi_update_channel_state
> 6. mhi_tryset_pm_state
> 7. mhi_pm_st_worker
> 
> Change the implementation of the arrays which has enum to strings mapping
> to make it consistent in both trace header file and other files.
> 
> Where ever the trace events are added, debug messages are removed.
> 
> Signed-off-by: Krishna chaitanya chundru 

There are a lot of checkpatch errors. Please fix them and resubmit.

- Mani

> Reviewed-by: Manivannan Sadhasivam 
> Reviewed-by: "Steven Rostedt (Google)" 
> ---
> Changes in v11:
> - Rebased with mhi next.
> - Link to v10: 
> https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com
> 
> Changes in v10:
> - Modified command_start and command_end traces to take string as input to 
> mention correct
> - string as suggested by mani
> - As sugggested by mani modified the print format from lower format to upper 
> case format.
> - Link to v9: 
> https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com
> 
> Changes in v9:
> - Change the implementations of some array so that the strings to enum mapping
> - is same in both trace header and other files as suggested by steve.
> - Link to v8: 
> https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com
> 
> Changes in v8:
> - Pass the structure and derefernce the variables in TP_fast_assign as 
> suggested by steve
> - Link to v7: 
> https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com
> 
> Changes in v7:
> - change log format as pointed by mani.
> - Link to v6: 
> https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com
> 
> Changes in v6:
> - use 'rp' directly as suggested by jeffrey.
> - Link to v5: 
> https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com
> 
> Changes in v5:
> - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
> - Instead of converting to u64 to print address, use %px to print the address 
> to avoid
> - warnings in some platforms.
> - Link to v4: 
> https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com
> 
> Changes in v4:
> - Fix compilation issues in previous patch which happended due to rebasing.
> - In the defconfig FTRACE config is not enabled due to that the compilation 
> issue is not
> - seen in my workspace.
> - Link to v3: 
> https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com
> 
> Changes in v3:
> - move trace header file from include/trace/events to drivers/bus/mhi/host/ 
> so that
> - we can include driver header files.
> - Use macros directly in the trace events as suggested Jeffrey Hugo.
> - Reorder the structure in the events as suggested by steve to avoid holes in 
> the buffer.
> - removed the mhi_to_physical function as this can give security issues.
> - removed macros to define strings as we can get those from driver headers.
> - Link to v2: 
> https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com
> 
> Changes in v2:
> - Passing the raw state into the trace event and using  __print_symbolic() as 
> suggested by bjorn.
> - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
> - Fixed the kernel test rebot issues.
> - Link to v1: 
> https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
> ---
>  drivers/bus/mhi/common.h|  38 +++---
>  drivers/bus/mhi/host/init.c |  64 +
>  drivers/bus/mhi/host/internal.h |  41 ++
>  drivers/bus/mhi/host/main.c |  19 ++-
>  drivers/bus/mhi/host/pm.c   |   7 +-
>  drivers/bus/mhi/host/trace.h| 280 
> 
>  6 files changed, 384 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h
> index f794b9c8049e..dda340aaed95 100644
> --- a/drivers/bus/mhi/common.h
> +++ b/drivers/bus/mhi/common.h
> @@ -297,30 +297,30 @@ struct mhi_ring_element {
>   __le32 dword[2];
>  };
>  
> +#define MHI_STATE_LIST   \
> + mhi_state(RESET,"RESET")\
> + mhi_state(READY,"READY")\
> + mhi_state(M0,   "M0")   \
> + mhi_state(M1,   "M1")   \
> + mhi_state(M2,   "M2")   \
> + mhi_state(M3,   "M3")   \
> + mhi_state(M3_FAST,  "M3_FAST")  \
> + mhi_state(BHI,  "BHI")  \
> + mhi_state_end(SYS_ERR,  "SYS ERROR")
> +
> +#undef mhi_state
> +#undef mhi_state_end
> +
> +#define mhi_state(a, b)  case MHI_STATE_##a: return b;
> +#defin

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Steven Rostedt  writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle  wrote:
>>
>> Another issue i'm hitting sometimes is this part:
>>
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>>
>> if [ "$csum1" != "$csum2" ]; then
>> fail "Tracing file is still changing"
>> fi
>>
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>>
>> This can be improved by:
>>
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>>
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Looking at trace_save_cmdline():

tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000

so this is basically

tpid = tsk->pid & 0x7fff;

further on:

// might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 
0x7fff)
idx = savedcmd->map_pid_to_cmdline[tpid];
if (idx == NO_CMDLINE_MAP) {
// This will pick an existing entry if there are
// more than cmdline_num entries present
idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;  
savedcmd->map_pid_to_cmdline[tpid] = idx;
savedcmd->cmdline_idx = idx;
}

So i think the problem that sometimes '<...>' instead of the correct
comm is logged is just expected behaviour given the code above.

Re: [PATCH v11] bus: mhi: host: Add tracing support

2024-02-05 Thread Krishna Chaitanya Chundru





On 2/6/2024 11:56 AM, Manivannan Sadhasivam wrote:

On Tue, Feb 06, 2024 at 10:02:05AM +0530, Krishna chaitanya chundru wrote:

This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Change the implementation of the arrays which has enum to strings mapping
to make it consistent in both trace header file and other files.

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 


There are a lot of checkpatch errors. Please fix them and resubmit.

- Mani


Hi Mani,

The errors which is pointing in the checkpatch are false positive, those
errors are being shown from v2 onwards and kernel test boot is also not
throwing any errors for it.

I checked with internal team they said these errors are false positive.

Thanks & Regards,
Krishna Chaitanya.


Reviewed-by: Manivannan Sadhasivam 
Reviewed-by: "Steven Rostedt (Google)" 
---
Changes in v11:
- Rebased with mhi next.
- Link to v10: 
https://lore.kernel.org/r/20240131-ftrace_support-v10-1-4349306b8...@quicinc.com

Changes in v10:
- Modified command_start and command_end traces to take string as input to 
mention correct
- string as suggested by mani
- As sugggested by mani modified the print format from lower format to upper 
case format.
- Link to v9: 
https://lore.kernel.org/r/20240105-ftrace_support-v9-1-a2dca64cc...@quicinc.com

Changes in v9:
- Change the implementations of some array so that the strings to enum mapping
- is same in both trace header and other files as suggested by steve.
- Link to v8: 
https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com

Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com

Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com

Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the address 
to avoid
- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com

Changes in v4:
- Fix compilation issues in previous patch which happended due to rebasing.
- In the defconfig FTRACE config is not enabled due to that the compilation 
issue is not
- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com

Changes in v3:
- move trace header file from include/trace/events to drivers/bus/mhi/host/ so 
that
- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid holes in 
the buffer.
- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com

Changes in v2:
- Passing the raw state into the trace event and using  __print_symbolic() as 
suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
---
  drivers/bus/mhi/common.h|  38 +++---
  drivers/bus/mhi/host/init.c |  64 +
  drivers/bus/mhi/host/internal.h |  41 ++
  drivers/bus/mhi/host/main.c |  19 ++-
  drivers/bus/mhi/host/pm.c   |   7 +-
  drivers/bus/mhi/host/trace.h| 280 
  6 files changed, 384 insertions(+), 65 deletions(-)

diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h
index f794b9c8049e..dda340aaed95 100644
--- a/drivers/bus/mhi/common.h
+++ b/drivers/bus/mhi/common.h
@@ -297,30 +297,30 @@ struct mhi_ring_element {
__le32 dword[2];
  };
  
+#define MHI_STATE_LIST\

+   mhi_state(RESET,"RESET")  \
+   mhi_state(READY,"READY")  \
+   mhi_state(M0,   "M0") \
+   mhi_state(M1,   "M1") \
+   mhi_state(M2,   "M2") \
+   mhi_state(M3,   "M3") \
+   mhi_state(M3_FAST,  "M3_FAST")\
+   mhi_state(BHI,  "BHI")\
+   mhi_state_end(SYS_ERR,  "SYS ERROR")
+
+#undef mhi_state

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Mete Durlu


On 2/5/24 07:53, Sven Schnelle wrote:

tracer_tracing_is_on() checks whether record_disabled is not zero. This
checks both the record_disabled counter and the RB_BUFFER_OFF flag.
Reading the source it looks like this function should only check for
the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
This fixes spurious fails in the 'test for function traceon/off triggers'
test from the ftrace testsuite when the system is under load.

Signed-off-by: Sven Schnelle 
---
  kernel/trace/trace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..47e221e1e720 100644


Tested-By: Mete Durlu

Re: [PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test