Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-06-19 Thread Michael S. Tsirkin
On Thu, Jun 06, 2024 at 08:22:13AM +0800, Jason Wang wrote:
> On Fri, May 31, 2024 at 8:18 AM Jason Wang  wrote:
> >
> > On Thu, May 30, 2024 at 9:09 PM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, May 30, 2024 at 06:29:51PM +0800, Jason Wang wrote:
> > > > On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> > > > > > This patch synchronize operstate with admin state per RFC2863.
> > > > > >
> > > > > > This is done by trying to toggle the carrier upon open/close and
> > > > > > synchronize with the config change work. This allows propagate 
> > > > > > status
> > > > > > correctly to stacked devices like:
> > > > > >
> > > > > > ip link add link enp0s3 macvlan0 type macvlan
> > > > > > ip link set link enp0s3 down
> > > > > > ip link show
> > > > > >
> > > > > > Before this patch:
> > > > > >
> > > > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state 
> > > > > > DOWN mode DEFAULT group default qlen 1000
> > > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > > > ..
> > > > > > 5: macvlan0@enp0s3:  mtu 
> > > > > > 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > > > >
> > > > > > After this patch:
> > > > > >
> > > > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state 
> > > > > > DOWN mode DEFAULT group default qlen 1000
> > > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > > > ...
> > > > > > 5: macvlan0@enp0s3:  mtu 
> > > > > > 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default 
> > > > > > qlen 1000
> > > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > > > >
> > > > > > Cc: Venkat Venkatsubra 
> > > > > > Cc: Gia-Khanh Nguyen 
> > > > > > Reviewed-by: Xuan Zhuo 
> > > > > > Acked-by: Michael S. Tsirkin 
> > > > > > Signed-off-by: Jason Wang 
> > > > > > ---
> > > > > > Changes since V1:
> > > > > > - rebase
> > > > > > - add ack/review tags
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > ---
> > > > > >  drivers/net/virtio_net.c | 94 
> > > > > > +++-
> > > > > >  1 file changed, 63 insertions(+), 31 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > index 4a802c0ea2cb..69e4ae353c51 100644
> > > > > > --- a/drivers/net/virtio_net.c
> > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > @@ -433,6 +433,12 @@ struct virtnet_info {
> > > > > >   /* The lock to synchronize the access to refill_enabled */
> > > > > >   spinlock_t refill_lock;
> > > > > >
> > > > > > + /* Is config change enabled? */
> > > > > > + bool config_change_enabled;
> > > > > > +
> > > > > > + /* The lock to synchronize the access to 
> > > > > > config_change_enabled */
> > > > > > + spinlock_t config_change_lock;
> > > > > > +
> > > > > >   /* Work struct for config space updates */
> > > > > >   struct work_struct config_work;
> > > > > >
> > > > >
> > > > >
> > > > > But we already have dev->config_lock and dev->config_enabled.
> > > > >
> > > > > And it actually works better - instead of discarding config
> > > > > change events it defers them until enabled.
> > > > >
> > > >
> > > > Yes but then both virtio-net driver and virtio core can ask to enable
> > > > and disable and then we need some kind of synchronization which is
> > > > non-trivial.
> > >
> > > Well for core it happens on bring up path before driver works
> > > and later on tear down after it is gone.
> > > So I do not think they ever do it at the same time.
> >
> > For example, there could be a suspend/resume when the admin state is down.
> >
> > >
> > >
> > > > And device enabling on the core is different from bringing the device
> > > > up in the networking subsystem. Here we just delay to deal with the
> > > > config change interrupt on ndo_open(). (E.g try to ack announce is
> > > > meaningless when the device is down).
> > > >
> > > > Thanks
> > >
> > > another thing is that it is better not to re-read all config
> > > on link up if there was no config interrupt - less vm exits.
> >
> > Yes, but it should not matter much as it's done in the ndo_open().
> 
> Michael, any more comments on this?
> 
> Please confirm if this patch is ok or not. If you prefer to reuse the
> config_disable() I can change it from a boolean to a counter that
> allows to be nested.
> 
> Thanks

I think doing this in core and re-using config_lock is a cleaner approach for 
sure, yes.

For remove we do not need stacking. But yes we need it for freeze.

I am not sure how will a counter work. Let's see the patch.


It might be easier, besides config_enabled, to add config_disabled,
document that  config_enabled is controlled by core,
config_disabled by driver.
And we'd have 2 APIs virtio_config_disable_core and
virtio_config_disable_driver.


But I leave that decision in your capable 

Re: [PATCH] livepatch: introduce klp_func called interface

2024-06-19 Thread zhang warden



> On Jun 7, 2024, at 17:07, Miroslav Benes  wrote:
> 
> It would be better than this patch but given what was mentioned in the 
> thread I wonder if it is possible to use ftrace even for this. See 
> /sys/kernel/tracing/trace_stat/function*. It already gathers the number of 
> hits.
> 

Hi, Miroslav

Can ftrace able to trace the function which I added into kernel by livepatching?

Regard
Wardenjohn


Re: [PATCH v2 5/7] bpf: do not create bpf_non_sleepable_error_inject list when unnecessary

2024-06-19 Thread Alexei Starovoitov
On Wed, Jun 19, 2024 at 3:49 PM Vlastimil Babka  wrote:
>
> When CONFIG_FUNCTION_ERROR_INJECTION is disabled,
> within_error_injection_list() will return false for any address and the
> result of check_non_sleepable_error_inject() denylist is thus redundant.
> The bpf_non_sleepable_error_inject list thus does not need to be
> constructed at all, so #ifdef it out.
>
> This will allow to inline functions on the list when
> CONFIG_FUNCTION_ERROR_INJECTION is disabled as there will be no BTF_ID()
> reference for them.
>
> Signed-off-by: Vlastimil Babka 
> ---
>  kernel/bpf/verifier.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 77da1f438bec..5cd93de37d68 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -21044,6 +21044,8 @@ static int check_attach_modify_return(unsigned long 
> addr, const char *func_name)
> return -EINVAL;
>  }
>
> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
> +
>  /* list of non-sleepable functions that are otherwise on
>   * ALLOW_ERROR_INJECTION list
>   */
> @@ -21061,6 +21063,19 @@ static int check_non_sleepable_error_inject(u32 
> btf_id)
> return btf_id_set_contains(_non_sleepable_error_inject, btf_id);
>  }
>
> +#else /* CONFIG_FUNCTION_ERROR_INJECTION */
> +
> +/*
> + * Pretend the denylist is empty, within_error_injection_list() will return
> + * false anyway.
> + */
> +static int check_non_sleepable_error_inject(u32 btf_id)
> +{
> +   return 0;
> +}
> +
> +#endif

The comment reads like this is an optimization, but it's a mandatory
ifdef since should_failslab() might not be found by resolve_btfid
during the build.
Please make it clear in the comment.



[PATCH v2 7/7] mm, page_alloc: add static key for should_fail_alloc_page()

2024-06-19 Thread Vlastimil Babka
Similarly to should_failslab(), remove the overhead of calling the
noinline function should_fail_alloc_page() with a static key that guards
the callsite in the page allocator hotpath, and is controlled by the
fault and error injection frameworks and bpf.

Additionally, compile out all relevant code if neither
CONFIG_FAIL_ALLOC_PAGE nor CONFIG_FUNCTION_ERROR_INJECTION is enabled.
When only the latter is not enabled, make should_fail_alloc_page()
static inline instead of noinline.

No measurement was done other than verifying the should_fail_alloc_page
is gone from the perf profile. A measurement with the analogical change
for should_failslab() suggests that for a page allocator intensive
workload there might be noticeable improvement. It also makes
CONFIG_FAIL_ALLOC_PAGE an option suitable not only for debug kernels.

Reviewed-by: Roman Gushchin 
Signed-off-by: Vlastimil Babka 
---
 include/linux/fault-inject.h |  3 ++-
 mm/fail_page_alloc.c |  3 ++-
 mm/internal.h|  2 ++
 mm/page_alloc.c  | 30 +++---
 4 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
index 0d0fa94dc1c8..1a782042ae80 100644
--- a/include/linux/fault-inject.h
+++ b/include/linux/fault-inject.h
@@ -96,8 +96,9 @@ static inline void fault_config_init(struct fault_config 
*config,
 
 struct kmem_cache;
 
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
 bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order);
-
+#endif
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order);
 #else
diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c
index b1b09cce9394..0906b76d78e8 100644
--- a/mm/fail_page_alloc.c
+++ b/mm/fail_page_alloc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
 #include 
+#include "internal.h"
 
 static struct {
struct fault_attr attr;
@@ -9,7 +10,7 @@ static struct {
bool ignore_gfp_reclaim;
u32 min_order;
 } fail_page_alloc = {
-   .attr = FAULT_ATTR_INITIALIZER,
+   .attr = FAULT_ATTR_INITIALIZER_KEY(_fail_alloc_page_active.key),
.ignore_gfp_reclaim = true,
.ignore_gfp_highmem = true,
.min_order = 1,
diff --git a/mm/internal.h b/mm/internal.h
index b2c75b12014e..8539e39b02e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -410,6 +410,8 @@ extern char * const zone_names[MAX_NR_ZONES];
 /* perform sanity checks on struct pages being allocated or freed */
 DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
 
+DECLARE_STATIC_KEY_FALSE(should_fail_alloc_page_active);
+
 extern int min_free_kbytes;
 
 void setup_per_zone_wmarks(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..b6e246acb4aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3008,11 +3008,35 @@ struct page *rmqueue(struct zone *preferred_zone,
return page;
 }
 
-noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAIL_PAGE_ALLOC)
+DEFINE_STATIC_KEY_FALSE(should_fail_alloc_page_active);
+
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
+noinline
+#else
+static inline
+#endif
+bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 {
return __should_fail_alloc_page(gfp_mask, order);
 }
-ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
+ALLOW_ERROR_INJECTION_KEY(should_fail_alloc_page, TRUE, 
_fail_alloc_page_active);
+
+static __always_inline bool
+should_fail_alloc_page_wrapped(gfp_t gfp_mask, unsigned int order)
+{
+   if (static_branch_unlikely(_fail_alloc_page_active))
+   return should_fail_alloc_page(gfp_mask, order);
+
+   return false;
+}
+#else
+static __always_inline bool
+should_fail_alloc_page_wrapped(gfp_t gfp_mask, unsigned int order)
+{
+   return false;
+}
+#endif
 
 static inline long __zone_watermark_unusable_free(struct zone *z,
unsigned int order, unsigned int alloc_flags)
@@ -4430,7 +4454,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
 
might_alloc(gfp_mask);
 
-   if (should_fail_alloc_page(gfp_mask, order))
+   if (should_fail_alloc_page_wrapped(gfp_mask, order))
return false;
 
*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);

-- 
2.45.2




[PATCH v2 6/7] mm, slab: add static key for should_failslab()

2024-06-19 Thread Vlastimil Babka
Since commit 4f6923fbb352 ("mm: make should_failslab always available for
fault injection") should_failslab() is unconditionally a noinline
function. This adds visible overhead to the slab allocation hotpath,
even if the function is empty. With CONFIG_FAILSLAB=y there's additional
overhead, even when the functionality is not activated by a boot
parameter or via debugfs.

The overhead can be eliminated with a static key around the callsite.
Fault injection and error injection frameworks including bpf can now be
told that this function has a static key associated, and are able to
enable and disable it accordingly.

Additionally, compile out all relevant code if neither CONFIG_FAILSLAB
nor CONFIG_FUNCTION_ERROR_INJECTION is enabled. When only the latter is
not enabled, make should_failslab() static inline instead of noinline.

To demonstrate the reduced overhead of calling an empty
should_failslab() function, a kernel build with
CONFIG_FUNCTION_ERROR_INJECTION enabled but CONFIG_FAILSLAB disabled,
and CPU mitigations enabled, was used in a qemu-kvm (virtme-ng) on AMD
Ryzen 7 2700 machine, and execution of a program trying to open() a
non-existent file was measured 3 times:

for (int i = 0; i < 1000; i++) {
open("non_existent", O_RDONLY);
}

After this patch, the measured real time was 4.3% smaller. Using perf
profiling it was verified that should_failslab was gone from the
profile.

With CONFIG_FAILSLAB also enabled, the patched kernel performace was
unaffected, as expected, while unpatched kernel's performance was worse,
resulting in the relative speedup being 10.5%. This means it no longer
needs to be an option suitable only for debug kernel builds.

Acked-by: Alexei Starovoitov 
Reviewed-by: Roman Gushchin 
Signed-off-by: Vlastimil Babka 
---
 include/linux/fault-inject.h |  4 +++-
 mm/failslab.c|  2 +-
 mm/slab.h|  3 +++
 mm/slub.c| 30 +++---
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
index cfe75cc1bac4..0d0fa94dc1c8 100644
--- a/include/linux/fault-inject.h
+++ b/include/linux/fault-inject.h
@@ -107,9 +107,11 @@ static inline bool __should_fail_alloc_page(gfp_t 
gfp_mask, unsigned int order)
 }
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
 int should_failslab(struct kmem_cache *s, gfp_t gfpflags);
+#endif
 #ifdef CONFIG_FAILSLAB
-extern bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags);
+bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags);
 #else
 static inline bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags)
 {
diff --git a/mm/failslab.c b/mm/failslab.c
index ffc420c0e767..878fd08e5dac 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -9,7 +9,7 @@ static struct {
bool ignore_gfp_reclaim;
bool cache_filter;
 } failslab = {
-   .attr = FAULT_ATTR_INITIALIZER,
+   .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key),
.ignore_gfp_reclaim = true,
.cache_filter = false,
 };
diff --git a/mm/slab.h b/mm/slab.h
index 5f8f47c5bee0..792e19cb37b8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Internal slab definitions
@@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 
sizeof(freelist_aba_t)
  */
 #define slab_page(s) folio_page(slab_folio(s), 0)
 
+DECLARE_STATIC_KEY_FALSE(should_failslab_active);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/slub.c b/mm/slub.c
index 0809760cf789..11980aa94631 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3874,13 +3874,37 @@ static __always_inline void 
maybe_wipe_obj_freeptr(struct kmem_cache *s,
0, sizeof(void *));
 }
 
-noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
+#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAILSLAB)
+DEFINE_STATIC_KEY_FALSE(should_failslab_active);
+
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
+noinline
+#else
+static inline
+#endif
+int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
 {
if (__should_failslab(s, gfpflags))
return -ENOMEM;
return 0;
 }
-ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
+ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active);
+
+static __always_inline int should_failslab_wrapped(struct kmem_cache *s,
+  gfp_t gfp)
+{
+   if (static_branch_unlikely(_failslab_active))
+   return should_failslab(s, gfp);
+   else
+   return 0;
+}
+#else
+static __always_inline int should_failslab_wrapped(struct kmem_cache *s,
+  gfp_t gfp)
+{
+   return false;
+}
+#endif
 
 static __fastpath_inline
 struct kmem_cache 

[PATCH v2 4/7] bpf: support error injection static keys for multi_link attached progs

2024-06-19 Thread Vlastimil Babka
Functions marked for error injection can have an associated static key
that guards the callsite(s) to avoid overhead of calling an empty
function when no error injection is in progress.

Outside of the error injection framework itself, bpf programs can be
atteched to kprobes and override results of error-injectable functions.
To make sure these functions are actually called, attaching such bpf
programs should control the static key accordingly.

Therefore, add an array of static keys to struct bpf_kprobe_multi_link
and fill it in addrs_check_error_injection_list() for programs with
kprobe_override enabled, using get_injection_key() instead of
within_error_injection_list(). Introduce bpf_kprobe_ei_keys_control() to
control the static keys and call the control function when doing
multi_link_attach and release.

Signed-off-by: Vlastimil Babka 
---
 kernel/trace/bpf_trace.c | 59 +++-
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 944de1c41209..ef0fadb76bfa 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2613,6 +2613,7 @@ struct bpf_kprobe_multi_link {
struct bpf_link link;
struct fprobe fp;
unsigned long *addrs;
+   struct static_key **ei_keys;
u64 *cookies;
u32 cnt;
u32 mods_cnt;
@@ -2687,11 +2688,30 @@ static void free_user_syms(struct user_syms *us)
kvfree(us->buf);
 }
 
+static void bpf_kprobe_ei_keys_control(struct bpf_kprobe_multi_link *link, 
bool enable)
+{
+   u32 i;
+
+   for (i = 0; i < link->cnt; i++) {
+   if (!link->ei_keys[i])
+   break;
+
+   if (enable)
+   static_key_slow_inc(link->ei_keys[i]);
+   else
+   static_key_slow_dec(link->ei_keys[i]);
+   }
+}
+
 static void bpf_kprobe_multi_link_release(struct bpf_link *link)
 {
struct bpf_kprobe_multi_link *kmulti_link;
 
kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link);
+
+   if (kmulti_link->ei_keys)
+   bpf_kprobe_ei_keys_control(kmulti_link, false);
+
unregister_fprobe(_link->fp);
kprobe_multi_put_modules(kmulti_link->mods, kmulti_link->mods_cnt);
 }
@@ -2703,6 +2723,7 @@ static void bpf_kprobe_multi_link_dealloc(struct bpf_link 
*link)
kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link);
kvfree(kmulti_link->addrs);
kvfree(kmulti_link->cookies);
+   kvfree(kmulti_link->ei_keys);
kfree(kmulti_link->mods);
kfree(kmulti_link);
 }
@@ -2985,13 +3006,19 @@ static int get_modules_for_addrs(struct module ***mods, 
unsigned long *addrs, u3
return arr.mods_cnt;
 }
 
-static int addrs_check_error_injection_list(unsigned long *addrs, u32 cnt)
+static int addrs_check_error_injection_list(unsigned long *addrs, struct 
static_key **ei_keys,
+   u32 cnt)
 {
-   u32 i;
+   struct static_key *ei_key;
+   u32 i, j = 0;
 
for (i = 0; i < cnt; i++) {
-   if (!within_error_injection_list(addrs[i]))
+   ei_key = get_injection_key(addrs[i]);
+   if (IS_ERR(ei_key))
return -EINVAL;
+
+   if (ei_key)
+   ei_keys[j++] = ei_key;
}
return 0;
 }
@@ -3000,6 +3027,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
 {
struct bpf_kprobe_multi_link *link = NULL;
struct bpf_link_primer link_primer;
+   struct static_key **ei_keys = NULL;
void __user *ucookies;
unsigned long *addrs;
u32 flags, cnt, size;
@@ -3075,9 +3103,24 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
goto error;
}
 
-   if (prog->kprobe_override && addrs_check_error_injection_list(addrs, 
cnt)) {
-   err = -EINVAL;
-   goto error;
+   if (prog->kprobe_override) {
+   ei_keys = kvcalloc(cnt, sizeof(*ei_keys), GFP_KERNEL);
+   if (!ei_keys) {
+   err = -ENOMEM;
+   goto error;
+   }
+
+   if (addrs_check_error_injection_list(addrs, ei_keys, cnt)) {
+   err = -EINVAL;
+   goto error;
+   }
+
+   if (ei_keys[0]) {
+   link->ei_keys = ei_keys;
+   } else {
+   kvfree(ei_keys);
+   ei_keys = NULL;
+   }
}
 
link = kzalloc(sizeof(*link), GFP_KERNEL);
@@ -3132,10 +3175,14 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
*attr, struct bpf_prog *pr
return err;
}
 
+   if (link->ei_keys)
+   bpf_kprobe_ei_keys_control(link, true);
+

[PATCH v2 1/7] fault-inject: add support for static keys around fault injection sites

2024-06-19 Thread Vlastimil Babka
Some fault injection sites are placed in hotpaths and incur overhead
even if not enabled, due to one or more function calls leading up to
should_fail_ex() that returns false due to attr->probability == 0.

This overhead can be eliminated if the outermost call into the checks is
guarded with a static key, so add support for that. The framework should
be told that such static key exist for a fault_attr, by initializing
fault_attr->active with the static key address. When it's not NULL,
enable the static key from setup_fault_attr() when the fault probability
is non-zero.

Also wire up writing into debugfs "probability" file to enable or
disable the static key when transitioning between zero and non-zero
probability.

For now, do not add configfs interface support as the immediate plan is
to leverage this for should_failslab() and should_fail_alloc_page()
after other necessary preparatory changes, and not for any of the
configfs based fault injection users.

Signed-off-by: Vlastimil Babka 
---
 include/linux/fault-inject.h |  7 ++-
 lib/fault-inject.c   | 43 ++-
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
index 6d5edef09d45..cfe75cc1bac4 100644
--- a/include/linux/fault-inject.h
+++ b/include/linux/fault-inject.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * For explanation of the elements of this struct, see
@@ -30,13 +31,14 @@ struct fault_attr {
unsigned long count;
struct ratelimit_state ratelimit_state;
struct dentry *dname;
+   struct static_key *active;
 };
 
 enum fault_flags {
FAULT_NOWARN =  1 << 0,
 };
 
-#define FAULT_ATTR_INITIALIZER {   \
+#define FAULT_ATTR_INITIALIZER_KEY(_key) { \
.interval = 1,  \
.times = ATOMIC_INIT(1),\
.require_end = ULONG_MAX,   \
@@ -44,8 +46,11 @@ enum fault_flags {
.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,   \
.verbose = 2,   \
.dname = NULL,  \
+   .active = (_key),   \
}
 
+#define FAULT_ATTR_INITIALIZER FAULT_ATTR_INITIALIZER_KEY(NULL)
+
 #define DECLARE_FAULT_ATTR(name) struct fault_attr name = 
FAULT_ATTR_INITIALIZER
 int setup_fault_attr(struct fault_attr *attr, char *str);
 bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags);
diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index d608f9b48c10..de9552cb22d0 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -35,6 +35,9 @@ int setup_fault_attr(struct fault_attr *attr, char *str)
atomic_set(>times, times);
atomic_set(>space, space);
 
+   if (probability != 0 && attr->active)
+   static_key_slow_inc(attr->active);
+
return 1;
 }
 EXPORT_SYMBOL_GPL(setup_fault_attr);
@@ -166,6 +169,12 @@ EXPORT_SYMBOL_GPL(should_fail);
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
 
+/*
+ * Protect updating probability from debugfs as that may trigger static key
+ * changes when changing between zero and non-zero.
+ */
+static DEFINE_MUTEX(probability_mutex);
+
 static int debugfs_ul_set(void *data, u64 val)
 {
*(unsigned long *)data = val;
@@ -186,6 +195,38 @@ static void debugfs_create_ul(const char *name, umode_t 
mode,
debugfs_create_file(name, mode, parent, value, _ul);
 }
 
+static int debugfs_prob_set(void *data, u64 val)
+{
+   struct fault_attr *attr = data;
+
+   mutex_lock(_mutex);
+
+   if (attr->active) {
+   if (attr->probability != 0 && val == 0) {
+   static_key_slow_dec(attr->active);
+   } else if (attr->probability == 0 && val != 0) {
+   static_key_slow_inc(attr->active);
+   }
+   }
+
+   attr->probability = val;
+
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
+static int debugfs_prob_get(void *data, u64 *val)
+{
+   struct fault_attr *attr = data;
+
+   *val = attr->probability;
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_prob, debugfs_prob_get, debugfs_prob_set, 
"%llu\n");
+
 #ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER
 
 static int debugfs_stacktrace_depth_set(void *data, u64 val)
@@ -218,7 +259,7 @@ struct dentry *fault_create_debugfs_attr(const char *name,
if (IS_ERR(dir))
return dir;
 
-   debugfs_create_ul("probability", mode, dir, >probability);
+   debugfs_create_file("probability", mode, dir, attr, _prob);
debugfs_create_ul("interval", mode, dir, >interval);
debugfs_create_atomic_t("times", mode, dir, >times);

[PATCH v2 2/7] error-injection: support static keys around injectable functions

2024-06-19 Thread Vlastimil Babka
Error injectable functions cannot be inlined and since some are called
from hot paths, this incurs overhead even if no error injection is
enabled for them.

To avoid this overhead when disabled, allow the callsites of error
injectable functions to put the calls behind a static key, which the
framework can control when error injection is enabled or disabled for
the function.

Introduce a new ALLOW_ERROR_INJECTION_KEY() macro that adds a parameter
with the static key's address, and store it in struct
error_injection_entry. This new field has caused a mismatch when
populating the injection list from the _error_injection_whitelist
section using the current STRUCT_ALIGN(), so change the alignment to 8.

During the population, copy the key's address also to struct ei_entry,
and make it possible to retrieve it by get_injection_key().

Finally, make the processing of writes to the debugfs inject file enable
the static key when the function is added to the injection list, and
disable when removed.

Signed-off-by: Vlastimil Babka 
---
 include/asm-generic/error-injection.h | 13 -
 include/asm-generic/vmlinux.lds.h |  2 +-
 include/linux/error-injection.h   | 12 ++--
 kernel/fail_function.c| 10 ++
 lib/error-inject.c| 19 +++
 5 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/error-injection.h 
b/include/asm-generic/error-injection.h
index b05253f68eaa..eed2731f3820 100644
--- a/include/asm-generic/error-injection.h
+++ b/include/asm-generic/error-injection.h
@@ -12,6 +12,7 @@ enum {
 
 struct error_injection_entry {
unsigned long   addr;
+   unsigned long   static_key_addr;
int etype;
 };
 
@@ -25,16 +26,26 @@ struct pt_regs;
  * 'Error Injectable Functions' section.
  */
 #define ALLOW_ERROR_INJECTION(fname, _etype)   \
-static struct error_injection_entry __used \
+static struct error_injection_entry __used __aligned(8)
\
__section("_error_injection_whitelist") \
_eil_addr_##fname = {   \
.addr = (unsigned long)fname,   \
.etype = EI_ETYPE_##_etype, \
}
 
+#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key)  \
+static struct error_injection_entry __used __aligned(8)
\
+   __section("_error_injection_whitelist") \
+   _eil_addr_##fname = {   \
+   .addr = (unsigned long)fname,   \
+   .static_key_addr = (unsigned long)key,  \
+   .etype = EI_ETYPE_##_etype, \
+   }
+
 void override_function_with_return(struct pt_regs *regs);
 #else
 #define ALLOW_ERROR_INJECTION(fname, _etype)
+#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key)
 
 static inline void override_function_with_return(struct pt_regs *regs) { }
 #endif
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 5703526d6ebf..1b15a0af2a00 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -248,7 +248,7 @@
 
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
 #define ERROR_INJECT_WHITELIST()   \
-   STRUCT_ALIGN(); \
+   . = ALIGN(8);   \
BOUNDED_SECTION(_error_injection_whitelist)
 #else
 #define ERROR_INJECT_WHITELIST()
diff --git a/include/linux/error-injection.h b/include/linux/error-injection.h
index 20e738f4eae8..48da027c0302 100644
--- a/include/linux/error-injection.h
+++ b/include/linux/error-injection.h
@@ -6,10 +6,13 @@
 #include 
 #include 
 
+struct static_key;
+
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
 
-extern bool within_error_injection_list(unsigned long addr);
-extern int get_injectable_error_type(unsigned long addr);
+bool within_error_injection_list(unsigned long addr);
+int get_injectable_error_type(unsigned long addr);
+struct static_key *get_injection_key(unsigned long addr);
 
 #else /* !CONFIG_FUNCTION_ERROR_INJECTION */
 
@@ -23,6 +26,11 @@ static inline int get_injectable_error_type(unsigned long 
addr)
return -EOPNOTSUPP;
 }
 
+static inline struct static_key *get_injection_key(unsigned long addr)
+{
+   return NULL;
+}
+
 #endif
 
 #endif /* _LINUX_ERROR_INJECTION_H */
diff --git a/kernel/fail_function.c b/kernel/fail_function.c
index d971a0189319..d39a9606a448 100644
--- a/kernel/fail_function.c
+++ b/kernel/fail_function.c
@@ -27,6 +27,7 @@ struct fei_attr {
struct list_head list;
struct kprobe kp;
unsigned long retval;
+   struct static_key *key;
 };
 static DEFINE_MUTEX(fei_lock);
 static LIST_HEAD(fei_attr_list);

[PATCH v2 0/7] static key support for error injection functions

2024-06-19 Thread Vlastimil Babka
This should now be complete, but perf_events attached bpf programs are
untested (Patch 3).
This is spread accross several subsystems but the easiest way would be
to go through a single tree, such as the mm tree.

As previously mentioned by myself [1] and others [2] the functions
designed for error injection can bring visible overhead in fastpaths
such as slab or page allocation, because even if nothing hooks into them
at a given moment, they are noninline function calls regardless of
CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
always available for fault injection") and af3b854492f3
("mm/page_alloc.c: allow error injection").

Live patching their callsites has been also suggested in both [1] and
[2] threads, and this is an attempt to do that with static keys that
guard the call sites. When disabled, the error injection functions still
exist and are noinline, but are not being called. Any of the existing
mechanisms that can inject errors should make sure to enable the
respective static key. I have added that support to hopefully all of
them that can be used today.

- the legacy fault injection, i.e. CONFIG_FAILSLAB and
  CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
  address of the static key if it exists. The key will be activated if the
  fault injection probability becomes non-zero, and deactivated in the
  opposite transition. This also removes the overhead of the evaluation
  (on top of the noninline function call) when these mechanisms are
  configured in the kernel but unused at the moment.

- the generic error injection using kretprobes with
  override_function_with_return is handled in Patch 2. The
  ALLOW_ERROR_INJECTION() annotation is extended so that static key
  address can be passed, and the framework controls it when error
  injection is enabled or disabled in debugfs for the function.

- bpf programs can override return values of probed functions with
  CONFIG_BPF_KPROBE_OVERRIDE=y and have prog->kprobe_override=1. They
  can be attached to perf_event, which is handled in Patch 3, or via
  multi_link_attach, which is handled in Patch 4. I have tested the
  latter using a modified bcc program from commit 4f6923fbb352
  description, but not Patch 3 using a perf_event - testing is welcome.

- ftrace seems to be using override_function_with_return from
  #define ftrace_override_function_with_return but there appear to be
  no users, which was confirmed by Mark Rutland in the RFC thread.

If anyone was crazy enough to use multiple of mechanisms above
simultaneously, the usage of static_key_slow_inc/dec will do the right
thing and the key will be enabled iff at least one mechanism is active.

Additionally to the static key support, Patch 5 makes it possible to
stop making the fault injection functions noninline with
CONFIG_FUNCTION_ERROR_INJECTION=n by compiling out the BTF_ID()
references for bpf_non_sleepable_error_inject which are unnecessary in
that case.

Patches 6 and 7 implement the static keys for the two mm fault injection
sites in slab and page allocators. I have measured the improvement for
the slab case, as described in Patch 6:

To demonstrate the reduced overhead of calling an empty
should_failslab() function, a kernel build with
CONFIG_FUNCTION_ERROR_INJECTION enabled but CONFIG_FAILSLAB disabled,
and CPU mitigations enabled, was used in a qemu-kvm (virtme-ng) on AMD
Ryzen 7 2700 machine, and execution of a program trying to open() a
non-existent file was measured 3 times:

for (int i = 0; i < 1000; i++) {
open("non_existent", O_RDONLY);
}

After this patch, the measured real time was 4.3% smaller. Using perf
profiling it was verified that should_failslab was gone from the
profile.

With CONFIG_FAILSLAB also enabled, the patched kernel performace was
unaffected, as expected, while unpatched kernel's performance was worse,
resulting in the relative speedup being 10.5%. This means it no longer
needs to be an option suitable only for debug kernel builds.

There might be other such fault injection callsites in hotpaths of other
subsystems but I didn't search for them at this point. With all the
preparations in place, it should be simple to improve them now.

FAQ:

Q: Does this improve only config options nobody uses in production
anyway?

A: No, the error injection hooks are unconditionally noninline functions
even if they are empty. CONFIG_FUNCTION_ERROR_INJECTION=y is probably
rather common, and overrides done via bpf. The goal was to eliminate
this unnecessary overhead. But as a secondary benefit now the legacy
fault injection options can be also enabled in production kernels
without extra overhead.

Q: Should we remove the legacy fault injection framework?

A: Maybe? I didn't want to wait for that to happen, so it's just handled
as well (Patch 1). The generic error injection handling and bpf needed
the most effort anyway.

Q: Should there be a unified 

[PATCH v2 5/7] bpf: do not create bpf_non_sleepable_error_inject list when unnecessary

2024-06-19 Thread Vlastimil Babka
When CONFIG_FUNCTION_ERROR_INJECTION is disabled,
within_error_injection_list() will return false for any address and the
result of check_non_sleepable_error_inject() denylist is thus redundant.
The bpf_non_sleepable_error_inject list thus does not need to be
constructed at all, so #ifdef it out.

This will allow to inline functions on the list when
CONFIG_FUNCTION_ERROR_INJECTION is disabled as there will be no BTF_ID()
reference for them.

Signed-off-by: Vlastimil Babka 
---
 kernel/bpf/verifier.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 77da1f438bec..5cd93de37d68 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -21044,6 +21044,8 @@ static int check_attach_modify_return(unsigned long 
addr, const char *func_name)
return -EINVAL;
 }
 
+#ifdef CONFIG_FUNCTION_ERROR_INJECTION
+
 /* list of non-sleepable functions that are otherwise on
  * ALLOW_ERROR_INJECTION list
  */
@@ -21061,6 +21063,19 @@ static int check_non_sleepable_error_inject(u32 btf_id)
return btf_id_set_contains(_non_sleepable_error_inject, btf_id);
 }
 
+#else /* CONFIG_FUNCTION_ERROR_INJECTION */
+
+/*
+ * Pretend the denylist is empty, within_error_injection_list() will return
+ * false anyway.
+ */
+static int check_non_sleepable_error_inject(u32 btf_id)
+{
+   return 0;
+}
+
+#endif
+
 int bpf_check_attach_target(struct bpf_verifier_log *log,
const struct bpf_prog *prog,
const struct bpf_prog *tgt_prog,

-- 
2.45.2




[PATCH v2 3/7] bpf: support error injection static keys for perf_event attached progs

2024-06-19 Thread Vlastimil Babka
Functions marked for error injection can have an associated static key
that guards the callsite(s) to avoid overhead of calling an empty
function when no error injection is in progress.

Outside of the error injection framework itself, bpf programs can be
atteched to perf events and override results of error-injectable
functions. To make sure these functions are actually called, attaching
such bpf programs should control the static key accordingly.

Therefore, add the static key's address to struct trace_kprobe and fill
it in trace_kprobe_error_injectable(), using get_injection_key() instead
of within_error_injection_list(). Introduce
trace_kprobe_error_injection_control() to control the static key and
call the control function when attaching or detaching programs with
kprobe_override to perf events.

Signed-off-by: Vlastimil Babka 
---
 kernel/trace/bpf_trace.c|  6 ++
 kernel/trace/trace_kprobe.c | 30 --
 kernel/trace/trace_probe.h  |  5 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f5154c051d2c..944de1c41209 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2283,6 +2283,9 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
rcu_assign_pointer(event->tp_event->prog_array, new_array);
bpf_prog_array_free_sleepable(old_array);
 
+   if (prog->kprobe_override)
+   trace_kprobe_error_injection_control(event->tp_event, true);
+
 unlock:
mutex_unlock(_event_mutex);
return ret;
@@ -2299,6 +2302,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event)
if (!event->prog)
goto unlock;
 
+   if (event->prog->kprobe_override)
+   trace_kprobe_error_injection_control(event->tp_event, false);
+
old_array = bpf_event_rcu_dereference(event->tp_event->prog_array);
ret = bpf_prog_array_copy(old_array, event->prog, NULL, 0, _array);
if (ret == -ENOENT)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 16383247bdbf..1c1ee95bd5de 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -61,6 +61,7 @@ struct trace_kprobe {
unsigned long __percpu *nhit;
const char  *symbol;/* symbol name */
struct trace_probe  tp;
+   struct static_key   *ei_key;
 };
 
 static bool is_trace_kprobe(struct dyn_event *ev)
@@ -235,9 +236,34 @@ bool trace_kprobe_on_func_entry(struct trace_event_call 
*call)
 bool trace_kprobe_error_injectable(struct trace_event_call *call)
 {
struct trace_kprobe *tk = trace_kprobe_primary_from_call(call);
+   struct static_key *ei_key;
 
-   return tk ? within_error_injection_list(trace_kprobe_address(tk)) :
-  false;
+   if (!tk)
+   return false;
+
+   ei_key = get_injection_key(trace_kprobe_address(tk));
+   if (IS_ERR(ei_key))
+   return false;
+
+   tk->ei_key = ei_key;
+   return true;
+}
+
+void trace_kprobe_error_injection_control(struct trace_event_call *call,
+ bool enable)
+{
+   struct trace_kprobe *tk = trace_kprobe_primary_from_call(call);
+
+   if (!tk)
+   return;
+
+   if (!tk->ei_key)
+   return;
+
+   if (enable)
+   static_key_slow_inc(tk->ei_key);
+   else
+   static_key_slow_dec(tk->ei_key);
 }
 
 static int register_kprobe_event(struct trace_kprobe *tk);
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 5803e6a41570..d9ddcabb9f97 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -212,6 +212,8 @@ DECLARE_BASIC_PRINT_TYPE_FUNC(symbol);
 #ifdef CONFIG_KPROBE_EVENTS
 bool trace_kprobe_on_func_entry(struct trace_event_call *call);
 bool trace_kprobe_error_injectable(struct trace_event_call *call);
+void trace_kprobe_error_injection_control(struct trace_event_call *call,
+ bool enabled);
 #else
 static inline bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
@@ -222,6 +224,9 @@ static inline bool trace_kprobe_error_injectable(struct 
trace_event_call *call)
 {
return false;
 }
+
+static inline void trace_kprobe_error_injection_control(struct 
trace_event_call *call,
+   bool enabled) { }
 #endif /* CONFIG_KPROBE_EVENTS */
 
 struct probe_arg {

-- 
2.45.2




Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-19 Thread Mike Rapoport
From: Mike Rapoport (IBM) 

On Thu, 13 Jun 2024 11:55:06 -0400, Steven Rostedt wrote:
> Reserve unspecified location of physical memory from kernel command line
> 
> Background:
> 
> In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
> dmesg output and some other information when a crash happens in the field.
> (This is only done when the user selects "Allow Google to collect data for
>  improving the system"). But there are cases when there's a bug that
> requires more data to be retrieved to figure out what is happening. We would
> like to increase the pstore size, either temporarily, or maybe even
> permanently. The pstore on these devices are at a fixed location in RAM (as
> the RAM is not cleared on soft reboots nor crashes). The location is chosen
> by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
> There's a driver that queries for this to initialize the pstore for
> ChromeOS:
> 
> [...]

Applied to for-next branch of memblock.git tree, thanks!

[0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[2/2] pstore/ramoops: Add ramoops.mem_name= command line option
  commit: d9d814eebb1ae9742e7fd7f39730653b16326bd4

tree: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
branch: for-next

--
Sincerely yours,
Mike.




Re: [PATCH v2] KUnit: add missing MODULE_DESCRIPTION() macros for lib/test_*.ko

2024-06-19 Thread Kees Cook
On Wed, Jun 19, 2024 at 01:59:15PM -0700, Jeff Johnson wrote:
> make allmodconfig && make W=1 C=1 reports for lib/test_*.ko:
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hexdump.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_dhry.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_firmware.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sysctl.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hash.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ida.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_list_sort.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_min_heap.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_module.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sort.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_keys.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_key_base.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_memcat_p.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_blackhole_dev.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_meminit.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_free_pages.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kprobes.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ref_tracker.o
> WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bits.o
> 
> Add the missing invocations of the MODULE_DESCRIPTION() macro.
> 
> Signed-off-by: Jeff Johnson 

Thanks for chasing these down!

Reviewed-by: Kees Cook 

-- 
Kees Cook



Re: [PATCH] filemap: add trace events for get_pages, map_pages, and fault

2024-06-19 Thread Matthew Wilcox
On Tue, Jun 18, 2024 at 09:36:56AM +, Takaya Saeki wrote:
> + TP_printk(
> + "dev %d:%d ino %lx ofs=%lu max_ofs=%lu",

It seems weird to have a space between dev and %d, but an equals between
ofs and %lu.  I see there is some precedent for this elsewhere, but
there are other places which use ino=.  I'd rather:

"dev=%d:%d ino=%lx ofs=%lu max_ofs=%lu",

> + TP_printk(
> + "dev %d:%d ino %lx ofs=%lu",

Likewise.

> + MAJOR(__entry->s_dev),
> + MINOR(__entry->s_dev), __entry->i_ino,
> + __entry->index << PAGE_SHIFT

This needs to be cast to an loff_t before shifting.




[PATCH 7/7] ARM: dts: qcom: msm8226: Convert APCS usages to mbox interface

2024-06-19 Thread Luca Weiss
Since we now have the apcs set up as a mailbox provider, let's use the
interface for all drivers where possible.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
index 9deee34fc5ca..5c1122f93054 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
@@ -157,7 +157,7 @@ master-stats {
 
smd-edge {
interrupts = ;
-   qcom,ipc = < 8 0>;
+   mboxes = < 0>;
qcom,smd-edge = <15>;
 
rpm_requests: rpm-requests {
@@ -235,7 +235,7 @@ smp2p-adsp {
interrupt-parent = <>;
interrupts = ;
 
-   qcom,ipc = < 8 10>;
+   mboxes = < 10>;
 
qcom,local-pid = <0>;
qcom,remote-pid = <2>;
@@ -1232,7 +1232,7 @@ adsp: remoteproc@fe20 {
smd-edge {
interrupts = ;
 
-   qcom,ipc = < 8 8>;
+   mboxes = < 8>;
qcom,smd-edge = <1>;
 
label = "lpass";

-- 
2.45.2




[PATCH 5/7] ARM: dts: qcom: msm8226: Add CPU frequency scaling support

2024-06-19 Thread Luca Weiss
Add a node for the a7pll with its frequencies. With this we can use the
apcs-kpss-global driver for the apcs node and use the apcs to scale the
CPU frequency according to the opp-table.

At the same time unfortunately we need to provide the gcc node xo_board
instead of the XO via rpmcc since otherwise we'll have a circular
dependency between apcs, gcc and the rpm.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 103 ++-
 1 file changed, 100 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
index 270973e85625..6e9fbe2e7223 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
@@ -44,6 +44,8 @@ CPU0: cpu@0 {
device_type = "cpu";
reg = <0>;
next-level-cache = <>;
+   clocks = <>;
+   operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
};
@@ -54,6 +56,8 @@ CPU1: cpu@1 {
device_type = "cpu";
reg = <1>;
next-level-cache = <>;
+   clocks = <>;
+   operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
};
@@ -64,6 +68,8 @@ CPU2: cpu@2 {
device_type = "cpu";
reg = <2>;
next-level-cache = <>;
+   clocks = <>;
+   operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
};
@@ -74,6 +80,8 @@ CPU3: cpu@3 {
device_type = "cpu";
reg = <3>;
next-level-cache = <>;
+   clocks = <>;
+   operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
};
@@ -98,6 +106,29 @@ memory@0 {
reg = <0x0 0x0>;
};
 
+   cpu_opp_table: opp-table-cpu {
+   compatible = "operating-points-v2";
+   opp-shared;
+
+   opp-3 {
+   opp-hz = /bits/ 64 <3>;
+   };
+
+   opp-38400 {
+   opp-hz = /bits/ 64 <38400>;
+   };
+
+   opp-6 {
+   opp-hz = /bits/ 64 <6>;
+   };
+
+   opp-78720 {
+   opp-hz = /bits/ 64 <78720>;
+   };
+
+   /* Higher CPU frequencies need speedbin support */
+   };
+
pmu {
compatible = "arm,cortex-a7-pmu";
interrupts = ;
};
 
-   apcs: syscon@f9011000 {
-   compatible = "syscon";
+   apcs: mailbox@f9011000 {
+   compatible = "qcom,msm8226-apcs-kpss-global",
+"qcom,msm8916-apcs-kpss-global", "syscon";
reg = <0xf9011000 0x1000>;
+   #mbox-cells = <1>;
+   clocks = <>, < GPLL0_VOTE>;
+   clock-names = "pll", "aux";
+   #clock-cells = <0>;
+   };
+
+   a7pll: clock@f9016000 {
+   compatible = "qcom,msm8226-a7pll";
+   reg = <0xf9016000 0x40>;
+   #clock-cells = <0>;
+   clocks = <_board>;
+   clock-names = "xo";
+   operating-points-v2 = <_opp_table>;
+
+   a7pll_opp_table: opp-table {
+   compatible = "operating-points-v2";
+
+   opp-76800 {
+   opp-hz = /bits/ 64 <76800>;
+   };
+
+   opp-78720 {
+   opp-hz = /bits/ 64 <78720>;
+   };
+
+   opp-99840 {
+   opp-hz = /bits/ 64 <99840>;
+   };
+
+   opp-109440 {
+   opp-hz = /bits/ 64 <109440>;
+   };
+
+   opp-119040 {
+   opp-hz = /bits/ 64 <119040>;
+   };
+
+   opp-130560 {
+   opp-hz = /bits/ 64 <130560>;
+   };
+
+   

[PATCH 6/7] ARM: dts: qcom: msm8226: Hook up CPU cooling

2024-06-19 Thread Luca Weiss
Add cooling-maps for the CPU thermal zones so the driver can actually do
something when the CPU temperature rises too much.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
index 6e9fbe2e7223..9deee34fc5ca 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 / {
#address-cells = <1>;
@@ -48,6 +49,7 @@ CPU0: cpu@0 {
operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
+   #cooling-cells = <2>;
};
 
CPU1: cpu@1 {
@@ -60,6 +62,7 @@ CPU1: cpu@1 {
operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
+   #cooling-cells = <2>;
};
 
CPU2: cpu@2 {
@@ -72,6 +75,7 @@ CPU2: cpu@2 {
operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
+   #cooling-cells = <2>;
};
 
CPU3: cpu@3 {
@@ -84,6 +88,7 @@ CPU3: cpu@3 {
operating-points-v2 = <_opp_table>;
qcom,acc = <>;
qcom,saw = <>;
+   #cooling-cells = <2>;
};
 
L2: l2-cache {
@@ -1256,6 +1261,16 @@ cpu0-thermal {
 
thermal-sensors = < 5>;
 
+   cooling-maps {
+   map0 {
+   trip = <_alert0>;
+   cooling-device = < 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+   };
+   };
+
trips {
cpu_alert0: trip0 {
temperature = <75000>;
@@ -1277,6 +1292,16 @@ cpu1-thermal {
 
thermal-sensors = < 2>;
 
+   cooling-maps {
+   map0 {
+   trip = <_alert1>;
+   cooling-device = < 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+< 
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+   };
+   };
+
trips {
cpu_alert1: trip0 {
temperature = <75000>;

-- 
2.45.2




[PATCH 0/7] Add CPU frequency scaling support for MSM8226

2024-06-19 Thread Luca Weiss
Apart from a bunch of bindings updates, add support for the a7pll found
on the SoC and wire up everything in the dtsi. And finally switch over
to using apcs via mbox interface to stop using the apcs via syscon.

Only a limited list of CPU frequencies are supported for now, higher
ones require speedbin support which I plan to work on after this lands.

Signed-off-by: Luca Weiss 
---
Luca Weiss (7):
  dt-bindings: mailbox: qcom: add compatible for MSM8226 SoC
  dt-bindings: clock: qcom,a53pll: Allow opp-table subnode
  dt-bindings: clock: qcom,a53pll: Add msm8226-a7pll compatible
  clk: qcom: a53-pll: Add MSM8226 a7pll support
  ARM: dts: qcom: msm8226: Add CPU frequency scaling support
  ARM: dts: qcom: msm8226: Hook up CPU cooling
  ARM: dts: qcom: msm8226: Convert APCS usages to mbox interface

 .../devicetree/bindings/clock/qcom,a53pll.yaml |   4 +
 .../bindings/mailbox/qcom,apcs-kpss-global.yaml|   1 +
 arch/arm/boot/dts/qcom/qcom-msm8226.dtsi   | 134 -
 drivers/clk/qcom/a53-pll.c |   1 +
 4 files changed, 134 insertions(+), 6 deletions(-)
---
base-commit: 0efa3123a1658dbafdace0bfcdcc4f34eebc7f9f
change-id: 20240619-msm8226-cpufreq-788b0bf0256a

Best regards,
-- 
Luca Weiss 




[PATCH 4/7] clk: qcom: a53-pll: Add MSM8226 a7pll support

2024-06-19 Thread Luca Weiss
The MSM8226 has one PLL for its Cortex-A7 cores. The frequencies will be
specified in devicetree.

Signed-off-by: Luca Weiss 
---
 drivers/clk/qcom/a53-pll.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/clk/qcom/a53-pll.c b/drivers/clk/qcom/a53-pll.c
index f9c5e296dba2..f43d455ab4b8 100644
--- a/drivers/clk/qcom/a53-pll.c
+++ b/drivers/clk/qcom/a53-pll.c
@@ -151,6 +151,7 @@ static int qcom_a53pll_probe(struct platform_device *pdev)
 }
 
 static const struct of_device_id qcom_a53pll_match_table[] = {
+   { .compatible = "qcom,msm8226-a7pll" },
{ .compatible = "qcom,msm8916-a53pll" },
{ .compatible = "qcom,msm8939-a53pll" },
{ }

-- 
2.45.2




[PATCH 3/7] dt-bindings: clock: qcom,a53pll: Add msm8226-a7pll compatible

2024-06-19 Thread Luca Weiss
Add the compatible for the A7PLL found in MSM8226 SoCs.

Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/clock/qcom,a53pll.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml 
b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
index 8cd73a623ef5..47ceab641a4c 100644
--- a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
+++ b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
@@ -21,6 +21,7 @@ properties:
   - qcom,ipq6018-a53pll
   - qcom,ipq8074-a53pll
   - qcom,ipq9574-a73pll
+  - qcom,msm8226-a7pll
   - qcom,msm8916-a53pll
   - qcom,msm8939-a53pll
 

-- 
2.45.2




[PATCH 2/7] dt-bindings: clock: qcom,a53pll: Allow opp-table subnode

2024-06-19 Thread Luca Weiss
Allow placing an opp-table as a subnode that can be assigned using
operating-points-v2 to specify the frequency table for the PLL.

Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/clock/qcom,a53pll.yaml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml 
b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
index 5ca927a8b1d5..8cd73a623ef5 100644
--- a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
+++ b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml
@@ -40,6 +40,9 @@ properties:
 
   operating-points-v2: true
 
+  opp-table:
+type: object
+
 required:
   - compatible
   - reg

-- 
2.45.2




[PATCH 1/7] dt-bindings: mailbox: qcom: add compatible for MSM8226 SoC

2024-06-19 Thread Luca Weiss
Add the mailbox compatible for MSM8226 SoC.

Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml 
b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml
index 982c741e6225..dc75ea2383f1 100644
--- a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml
+++ b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml
@@ -25,6 +25,7 @@ properties:
   - const: qcom,ipq6018-apcs-apps-global
   - items:
   - enum:
+  - qcom,msm8226-apcs-kpss-global
   - qcom,qcs404-apcs-apps-global
   - const: qcom,msm8916-apcs-kpss-global
   - const: syscon

-- 
2.45.2




[PATCH v2] KUnit: add missing MODULE_DESCRIPTION() macros for lib/test_*.ko

2024-06-19 Thread Jeff Johnson
make allmodconfig && make W=1 C=1 reports for lib/test_*.ko:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hexdump.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_dhry.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_firmware.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sysctl.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hash.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ida.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_list_sort.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_min_heap.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_module.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sort.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_keys.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_key_base.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_memcat_p.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_blackhole_dev.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_meminit.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_free_pages.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kprobes.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ref_tracker.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bits.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson 
---
Changes in v2:
- Updated the description of lib/test_kprobes.c per Masami
- Removed references to lib/test_user_copy since that is being renamed in
  https://lore.kernel.org/all/20240612195921.2685842-2-k...@kernel.org/
  and the description added in
  https://lore.kernel.org/all/20240619202659.work.532-k...@kernel.org/
- Link to v1: 
https://lore.kernel.org/r/20240601-md-lib-test-v1-1-a728620e3...@quicinc.com
---
 lib/dhry_run.c | 1 +
 lib/test-kstrtox.c | 1 +
 lib/test_bits.c| 1 +
 lib/test_blackhole_dev.c   | 1 +
 lib/test_firmware.c| 1 +
 lib/test_free_pages.c  | 1 +
 lib/test_hash.c| 1 +
 lib/test_hexdump.c | 1 +
 lib/test_ida.c | 1 +
 lib/test_kprobes.c | 3 ++-
 lib/test_list_sort.c   | 1 +
 lib/test_memcat_p.c| 1 +
 lib/test_meminit.c | 1 +
 lib/test_min_heap.c| 1 +
 lib/test_module.c  | 1 +
 lib/test_ref_tracker.c | 3 ++-
 lib/test_sort.c| 1 +
 lib/test_static_key_base.c | 1 +
 lib/test_static_keys.c | 1 +
 lib/test_sysctl.c  | 1 +
 20 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/lib/dhry_run.c b/lib/dhry_run.c
index e6a279dabf84..4a6d05ce4361 100644
--- a/lib/dhry_run.c
+++ b/lib/dhry_run.c
@@ -83,4 +83,5 @@ static int __init dhry_init(void)
 module_init(dhry_init);
 
 MODULE_AUTHOR("Geert Uytterhoeven ");
+MODULE_DESCRIPTION("Dhrystone benchmark test module");
 MODULE_LICENSE("GPL");
diff --git a/lib/test-kstrtox.c b/lib/test-kstrtox.c
index f355f67169b6..ee87fef66cb5 100644
--- a/lib/test-kstrtox.c
+++ b/lib/test-kstrtox.c
@@ -732,4 +732,5 @@ static int __init test_kstrtox_init(void)
return -EINVAL;
 }
 module_init(test_kstrtox_init);
+MODULE_DESCRIPTION("Module test for kstrto*() APIs");
 MODULE_LICENSE("Dual BSD/GPL");
diff --git a/lib/test_bits.c b/lib/test_bits.c
index c9368a2314e7..01313980f175 100644
--- a/lib/test_bits.c
+++ b/lib/test_bits.c
@@ -72,4 +72,5 @@ static struct kunit_suite bits_test_suite = {
 };
 kunit_test_suite(bits_test_suite);
 
+MODULE_DESCRIPTION("Test cases for functions and macros in bits.h");
 MODULE_LICENSE("GPL");
diff --git a/lib/test_blackhole_dev.c b/lib/test_blackhole_dev.c
index f247089d63c0..ec290ac2a0d9 100644
--- a/lib/test_blackhole_dev.c
+++ b/lib/test_blackhole_dev.c
@@ -96,4 +96,5 @@ module_init(test_blackholedev_init);
 module_exit(test_blackholedev_exit);
 
 MODULE_AUTHOR("Mahesh Bandewar ");
+MODULE_DESCRIPTION("module test of the blackhole_dev");
 MODULE_LICENSE("GPL");
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 9cfdcd6d21db..bcb32cbff188 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -1567,4 +1567,5 @@ static void __exit test_firmware_exit(void)
 module_exit(test_firmware_exit);
 
 MODULE_AUTHOR("Kees Cook ");
+MODULE_DESCRIPTION("interface to trigger and test firmware loading");
 MODULE_LICENSE("GPL");
diff --git a/lib/test_free_pages.c b/lib/test_free_pages.c
index 9ebf6f5549f3..48952364c540 100644
--- a/lib/test_free_pages.c
+++ b/lib/test_free_pages.c
@@ -44,4 +44,5 @@ static void m_ex(void)
 module_init(m_in);
 module_exit(m_ex);
 MODULE_AUTHOR("Matthew Wilcox ");
+MODULE_DESCRIPTION("Check that free_pages() doesn't leak memory");
 MODULE_LICENSE("GPL");
diff --git a/lib/test_hash.c b/lib/test_hash.c
index bb25fda34794..a7af39662a0a 100644
--- a/lib/test_hash.c
+++ b/lib/test_hash.c
@@ -235,4 +235,5 @@ static struct kunit_suite hash_test_suite = {
 
 kunit_test_suite(hash_test_suite);
 

Re: [PATCH 0/5] Use mboxes in smsm node for all dtsi where possible

2024-06-19 Thread Konrad Dybcio




On 6/19/24 18:42, Luca Weiss wrote:

With the binding and driver patches queued for 6.11[0][1], it's time to
update the dtsi files to use the new binding.

[0] 
https://lore.kernel.org/linux-arm-msm/171840533352.102487.9576671528001022676.b4...@kernel.org/
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux.git/log/?h=drivers-for-6.11

@Bjorn: I think this makes sense to only apply these patches for 6.12 so
that also in the arm64 tree the driver will exist already, so git bisect
is not impeded by that.

Patches are just compile-tested.

Signed-off-by: Luca Weiss 
---


Reviewed-by: Konrad Dybcio 

Konrad



Re: [RFC bpf-next 01/10] uprobe: Add session callbacks to uprobe_consumer

2024-06-19 Thread Jiri Olsa
On Mon, Jun 17, 2024 at 03:53:50PM -0700, Andrii Nakryiko wrote:
> On Mon, Jun 10, 2024 at 4:06 AM Jiri Olsa  wrote:
> >
> > On Thu, Jun 06, 2024 at 09:52:39AM -0700, Andrii Nakryiko wrote:
> > > On Thu, Jun 6, 2024 at 9:46 AM Jiri Olsa  wrote:
> > > >
> > > > On Wed, Jun 05, 2024 at 10:50:11PM +0200, Jiri Olsa wrote:
> > > > > On Wed, Jun 05, 2024 at 07:56:19PM +0200, Oleg Nesterov wrote:
> > > > > > On 06/05, Andrii Nakryiko wrote:
> > > > > > >
> > > > > > > so any such
> > > > > > > limitations will cause problems, issue reports, investigation, 
> > > > > > > etc.
> > > > > >
> > > > > > Agreed...
> > > > > >
> > > > > > > As one possible solution, what if we do
> > > > > > >
> > > > > > > struct return_instance {
> > > > > > > ...
> > > > > > > u64 session_cookies[];
> > > > > > > };
> > > > > > >
> > > > > > > and allocate sizeof(struct return_instance) + 8 *
> > > > > > >  and then at runtime pass
> > > > > > > _cookies[i] as data pointer to session-aware callbacks?
> > > > > >
> > > > > > I too thought about this, but I guess it is not that simple.
> > > > > >
> > > > > > Just for example. Suppose we have 2 session-consumers C1 and C2.
> > > > > > What if uprobe_unregister(C1) comes before the probed function
> > > > > > returns?
> > > > > >
> > > > > > We need something like map_cookie_to_consumer().
> > > > >
> > > > > I guess we could have hash table in return_instance that gets 
> > > > > 'consumer -> cookie' ?
> > > >
> > > > ok, hash table is probably too big for this.. I guess some solution that
> > > > would iterate consumers and cookies made sure it matches would be fine
> > > >
> > >
> > > Yes, I was hoping to avoid hash tables for this, and in the common
> > > case have no added overhead.
> >
> > hi,
> > here's first stab on that.. the change below:
> >   - extends current handlers with extra argument rather than adding new
> > set of handlers
> >   - store session consumers objects within return_instance object and
> >   - iterate these objects ^^^ in handle_uretprobe_chain
> >
> > I guess it could be still polished, but I wonder if this could
> > be the right direction to do this.. thoughts? ;-)
> 
> Yeah, I think this is the right direction. It's a bit sad that this
> makes getting rid of rw_sem on hot path even harder, but that's a
> separate problem.
> 
> >
> > thanks,
> > jirka
> >
> >
> > ---
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index f46e0ca0169c..4e40e8352eac 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -34,15 +34,19 @@ enum uprobe_filter_ctx {
> >  };
> >
> >  struct uprobe_consumer {
> > -   int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
> > +   int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs,
> > +   unsigned long *data);
> 
> can we use __u64 here? This long vs __u64 might cause problems for BPF
> when the host is 32-bit architecture (BPF is always 64-bit).

ok

> 
> > int (*ret_handler)(struct uprobe_consumer *self,
> > unsigned long func,
> > -   struct pt_regs *regs);
> > +   struct pt_regs *regs,
> > +   unsigned long *data);
> > bool (*filter)(struct uprobe_consumer *self,
> > enum uprobe_filter_ctx ctx,
> > struct mm_struct *mm);
> >
> 
> [...]
> 
> >  static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
> >  {
> > struct uprobe_task *n_utask;
> > @@ -1756,11 +1795,11 @@ static int dup_utask(struct task_struct *t, struct 
> > uprobe_task *o_utask)
> >
> > p = _utask->return_instances;
> > for (o = o_utask->return_instances; o; o = o->next) {
> > -   n = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
> > +   n = alloc_return_instance(o->session_cnt);
> > if (!n)
> > return -ENOMEM;
> >
> > -   *n = *o;
> > +   memcpy(n, o, ri_size(o->session_cnt));
> > get_uprobe(n->uprobe);
> > n->next = NULL;
> >
> > @@ -1853,35 +1892,38 @@ static void cleanup_return_instances(struct 
> > uprobe_task *utask, bool chained,
> > utask->return_instances = ri;
> >  }
> >
> > -static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
> > +static struct return_instance *
> > +prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
> > + struct return_instance *ri, int session_cnt)
> 
> you have struct uprobe, why do you need to pass session_cnt? Also,
> given return_instance is cached, it seems more natural to have
> 
> struct return_instance **ri as in/out parameter, and keep the function
> itself as void

I tried that, but now I think it'd be better if we just let prepare_uretprobe
to allocate (if needed) and free ri in case it fails 

Re: [PATCH] filemap: add trace events for get_pages, map_pages, and fault

2024-06-19 Thread Takaya Saeki
Thanks Steven for trying it out.

> I can see it bringing down the number of pages needed to be saved 
> dramatically.

Yes, I agree.
However, note that wc does not count the size of the page caches
correctly since 'get_map_pages' gives you a range. In your example of
the less command, actually the total sizes of page_cache_size and
get_map_pages look the same.

Instead of less, running a large binary such as Chrome (232MB) gives
us a better example. '# trace-cmd record -e filemap chrome --version'
showed 58% reduction from 42MB to 17MB in my environment.
- Total size of mm_filemap_add_to_page_cache: 42,958,848 bytes
- With mm_filemap_map_pages and mm_filemap_get_pages: 17,993,728 bytes

By the way,  'mm_filemap_map_pages' traces a range requested to
populate, which includes pages that are not in caches yet and thus are
skipped. So, you need to calculate the intersection with
mm_filemap_add_to_page_cache to see page caches that are actually
mapped.
I'm wondering if we should put a trace event per each successful page
mapping event as mm_filemap_add_to_page_cache does, by putting an
event inside the page map loop.



Re: [PATCH vhost 11/23] vdpa/mlx5: Set an initial size on the VQ

2024-06-19 Thread Dragos Tatulea
On Wed, 2024-06-19 at 17:08 +0200, Eugenio Perez Martin wrote:
> On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
> > 
> > The virtqueue size is a pre-requisite for setting up any virtqueue
> > resources. For the upcoming optimization of creating virtqueues at
> > device add, the virtqueue size has to be configured.
> > 
> > Store the default queue size in struct mlx5_vdpa_net to make it easy in
> > the future to pre-configure this default value via vdpa tool.
> > 
> > The queue size check in setup_vq() will always be false. So remove it.
> > 
> > Signed-off-by: Dragos Tatulea 
> > Reviewed-by: Cosmin Ratiu 
> > ---
> >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 ---
> >  drivers/vdpa/mlx5/net/mlx5_vnet.h | 1 +
> >  2 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index 245b5dac98d3..1181e0ac3671 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -58,6 +58,8 @@ MODULE_LICENSE("Dual BSD/GPL");
> >   */
> >  #define MLX5V_DEFAULT_VQ_COUNT 2
> > 
> > +#define MLX5V_DEFAULT_VQ_SIZE 256
> > +
> >  struct mlx5_vdpa_cq_buf {
> > struct mlx5_frag_buf_ctrl fbc;
> > struct mlx5_frag_buf frag_buf;
> > @@ -1445,9 +1447,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, 
> > struct mlx5_vdpa_virtqueue *mvq)
> > u16 idx = mvq->index;
> > int err;
> > 
> > -   if (!mvq->num_ent)
> > -   return 0;
> > -
> > if (mvq->initialized)
> > return 0;
> > 
> > @@ -3523,6 +3522,7 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev)
> > mvq->ndev = ndev;
> > mvq->fwqp.fw = true;
> > mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE;
> > +   mvq->num_ent = ndev->default_queue_size;
> > }
> >  }
> > 
> > @@ -3660,6 +3660,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> > *v_mdev, const char *name,
> > goto err_alloc;
> > }
> > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT;
> > +   ndev->default_queue_size = MLX5V_DEFAULT_VQ_SIZE;
> > 
> > init_mvqs(ndev);
> > allocate_irqs(ndev);
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h 
> > b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> > index 90b556a57971..2ada29767cc5 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> > @@ -58,6 +58,7 @@ struct mlx5_vdpa_net {
> > bool setup;
> > u32 cur_num_vqs;
> > u32 rqt_size;
> > +   u16 default_queue_size;
> 
> It seems to me this is only assigned here and not used in the rest of
> the series, why allocate a member here instead of using macro
> directly?
It is used in init_mvqs(). I wanted to make it easy in case we add the default
queue size to the mvq tool. I'm also ok with switching to constants for now.

Thank you for your reviews!

Thanks,
Dragos
> 
> > bool nb_registered;
> > struct notifier_block nb;
> > struct vdpa_callback config_cb;
> > 
> > --
> > 2.45.1
> > 
> 



Re: [PATCH vhost 08/23] vdpa/mlx5: Clear and reinitialize software VQ data on reset

2024-06-19 Thread Dragos Tatulea
On Wed, 2024-06-19 at 13:28 +0200, Eugenio Perez Martin wrote:
> On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
> > 
> > The hardware VQ configuration is mirrored by data in struct
> > mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset,
> > fully clear the struct and initialize with the appropriate default
> > values.
> > 
> > As clear_vqs_ready() is used only during reset, get rid of it.
> > 
> > Signed-off-by: Dragos Tatulea 
> > Reviewed-by: Cosmin Ratiu 
> 
> Acked-by: Eugenio Pérez 
> 
> > ---
> >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 16 +++-
> >  1 file changed, 3 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index c8b5c87f001d..de013b5a2815 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -2941,18 +2941,6 @@ static void teardown_vq_resources(struct 
> > mlx5_vdpa_net *ndev)
> > ndev->setup = false;
> >  }
> > 
> > -static void clear_vqs_ready(struct mlx5_vdpa_net *ndev)
> > -{
> > -   int i;
> > -
> > -   for (i = 0; i < ndev->mvdev.max_vqs; i++) {
> > -   ndev->vqs[i].ready = false;
> > -   ndev->vqs[i].modified_fields = 0;
> > -   }
> > -
> > -   ndev->mvdev.cvq.ready = false;
> > -}
> > -
> >  static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev)
> >  {
> > struct mlx5_control_vq *cvq = >cvq;
> > @@ -3035,12 +3023,14 @@ static int mlx5_vdpa_compat_reset(struct 
> > vdpa_device *vdev, u32 flags)
> > down_write(>reslock);
> > unregister_link_notifier(ndev);
> > teardown_vq_resources(ndev);
> > -   clear_vqs_ready(ndev);
> > +   init_mvqs(ndev);
> 
> Nitpick / suggestion if you have to send a v2. The init_mvqs function
> name sounds like it can allocate stuff that needs to be released to
> me. But I'm very bad at naming :). Maybe something like
> "mvqs_set_defaults" or similar?
Makes sense. I think I will call it mvqs_reset / reset_mvqs to keep things
consistent.

Thanks,
Dragos
> 
> > +
> > if (flags & VDPA_RESET_F_CLEAN_MAP)
> > mlx5_vdpa_destroy_mr_resources(>mvdev);
> > ndev->mvdev.status = 0;
> > ndev->mvdev.suspended = false;
> > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT;
> > +   ndev->mvdev.cvq.ready = false;
> > ndev->mvdev.cvq.received_desc = 0;
> > ndev->mvdev.cvq.completed_desc = 0;
> > memset(ndev->event_cbs, 0, sizeof(*ndev->event_cbs) * 
> > (mvdev->max_vqs + 1));
> > 
> > --
> > 2.45.1
> > 
> 



[PATCH 1/5] ARM: dts: qcom: msm8974: Use mboxes in smsm node

2024-06-19 Thread Luca Weiss
With the smsm bindings and driver finally supporting mboxes, switch to
that and stop using apcs as syscon.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
index 1bea3cef4ba7..18b6a7c77996 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
@@ -294,9 +294,7 @@ smsm {
#address-cells = <1>;
#size-cells = <0>;
 
-   qcom,ipc-1 = < 8 13>;
-   qcom,ipc-2 = < 8 9>;
-   qcom,ipc-3 = < 8 19>;
+   mboxes = <0>, < 13>, < 9>, < 19>;
 
apps_smsm: apps@0 {
reg = <0>;

-- 
2.45.2




[PATCH 0/5] Use mboxes in smsm node for all dtsi where possible

2024-06-19 Thread Luca Weiss
With the binding and driver patches queued for 6.11[0][1], it's time to
update the dtsi files to use the new binding.

[0] 
https://lore.kernel.org/linux-arm-msm/171840533352.102487.9576671528001022676.b4...@kernel.org/
[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux.git/log/?h=drivers-for-6.11

@Bjorn: I think this makes sense to only apply these patches for 6.12 so
that also in the arm64 tree the driver will exist already, so git bisect
is not impeded by that.

Patches are just compile-tested.

Signed-off-by: Luca Weiss 
---
Luca Weiss (5):
  ARM: dts: qcom: msm8974: Use mboxes in smsm node
  arm64: dts: qcom: msm8916: Use mboxes in smsm node
  arm64: dts: qcom: msm8939: Use mboxes in smsm node
  arm64: dts: qcom: msm8953: Use mboxes in smsm node
  arm64: dts: qcom: msm8976: Use mboxes in smsm node

 arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 4 +---
 arch/arm64/boot/dts/qcom/msm8916.dtsi| 3 +--
 arch/arm64/boot/dts/qcom/msm8939.dtsi| 3 +--
 arch/arm64/boot/dts/qcom/msm8953.dtsi| 3 +--
 arch/arm64/boot/dts/qcom/msm8976.dtsi| 4 +---
 5 files changed, 5 insertions(+), 12 deletions(-)
---
base-commit: 2102cb0d050d34d50b9642a3a50861787527e922
change-id: 20240619-smsm-mbox-dts-3ee9daadf81b

Best regards,
-- 
Luca Weiss 




[PATCH 2/5] arm64: dts: qcom: msm8916: Use mboxes in smsm node

2024-06-19 Thread Luca Weiss
With the smsm bindings and driver finally supporting mboxes, switch to
that and stop using apcs as syscon.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/msm8916.dtsi | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8916.dtsi 
b/arch/arm64/boot/dts/qcom/msm8916.dtsi
index bdedbcdc36d3..7383bcc603ab 100644
--- a/arch/arm64/boot/dts/qcom/msm8916.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8916.dtsi
@@ -410,8 +410,7 @@ smsm {
#address-cells = <1>;
#size-cells = <0>;
 
-   qcom,ipc-1 = < 8 13>;
-   qcom,ipc-3 = < 8 19>;
+   mboxes = <0>, < 13>, <0>, < 19>;
 
apps_smsm: apps@0 {
reg = <0>;

-- 
2.45.2




[PATCH 3/5] arm64: dts: qcom: msm8939: Use mboxes in smsm node

2024-06-19 Thread Luca Weiss
With the smsm bindings and driver finally supporting mboxes, switch to
that and stop using apcs as syscon.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/msm8939.dtsi | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8939.dtsi 
b/arch/arm64/boot/dts/qcom/msm8939.dtsi
index e309ef909ea7..46d9480cd464 100644
--- a/arch/arm64/boot/dts/qcom/msm8939.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8939.dtsi
@@ -443,8 +443,7 @@ smsm {
#address-cells = <1>;
#size-cells = <0>;
 
-   qcom,ipc-1 = <_mbox 8 13>;
-   qcom,ipc-3 = <_mbox 8 19>;
+   mboxes = <0>, <_mbox 13>, <0>, <_mbox 19>;
 
apps_smsm: apps@0 {
reg = <0>;

-- 
2.45.2




[PATCH 4/5] arm64: dts: qcom: msm8953: Use mboxes in smsm node

2024-06-19 Thread Luca Weiss
With the smsm bindings and driver finally supporting mboxes, switch to
that and stop using apcs as syscon.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/msm8953.dtsi | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi 
b/arch/arm64/boot/dts/qcom/msm8953.dtsi
index 1b61a63710a6..a4bfb624fb8a 100644
--- a/arch/arm64/boot/dts/qcom/msm8953.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi
@@ -411,8 +411,7 @@ smsm {
#address-cells = <1>;
#size-cells = <0>;
 
-   qcom,ipc-1 = < 8 13>;
-   qcom,ipc-3 = < 8 19>;
+   mboxes = <0>, < 13>, <0>, < 19>;
 
apps_smsm: apps@0 {
reg = <0>;

-- 
2.45.2




[PATCH 5/5] arm64: dts: qcom: msm8976: Use mboxes in smsm node

2024-06-19 Thread Luca Weiss
With the smsm bindings and driver finally supporting mboxes, switch to
that and stop using apcs as syscon.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/msm8976.dtsi | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
b/arch/arm64/boot/dts/qcom/msm8976.dtsi
index e299d42c5d98..d62dcb76fa48 100644
--- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
@@ -439,9 +439,7 @@ smsm {
#address-cells = <1>;
#size-cells = <0>;
 
-   qcom,ipc-1 = < 8 13>;
-   qcom,ipc-2 = < 8 9>;
-   qcom,ipc-3 = < 8 19>;
+   mboxes = <0>, < 13>, < 9>, < 19>;
 
apps_smsm: apps@0 {
reg = <0>;

-- 
2.45.2




Re: [PATCH vhost 23/23] vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> VQ indices in the range [cur_num_qps, max_vqs) represent queues that
> have not yet been activated. .set_vq_ready should not activate these
> VQs.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 1a5ee0d2b47f..a969a7f105a6 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1575,6 +1575,9 @@ static int resume_vq(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq
> if (!mvq->initialized)
> return 0;
>
> +   if (mvq->index >= ndev->cur_num_vqs)
> +   return 0;
> +
> switch (mvq->fw_state) {
> case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT:
> /* Due to a FW quirk we need to modify the VQ fields first 
> then change state.
>
> --
> 2.45.1
>




Re: [PATCH v7 2/5] remoteproc: Add TEE support

2024-06-19 Thread Arnaud POULIQUEN
Hi,

On 6/19/24 17:41, Mathieu Poirier wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 09:39:01AM +0200, Arnaud Pouliquen wrote:
>> Add a remoteproc TEE (Trusted Execution Environment) driver
>> that will be probed by the TEE bus. If the associated Trusted
>> application is supported on secure part this driver offers a client
>> interface to load a firmware in the secure part.
>> This firmware could be authenticated by the secure trusted application.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> update from V
>> - Fix missing "{" in tee_rproc_find_loaded_rsc_table inline definition.
>>
>> update from V5
>> - make tee_rproc_get_loaded_rsc_table() local and replace this API by
>>   tee_rproc_find_loaded_rsc_table()
>> - map and unmap the resource table in tee_rproc_parse_fw to make a cached 
>> copy
>> - use the new rproc_pa_to_va() API to map the resource table memory declared 
>> in carevout
>> - remove tee_rproc_release_loaded_rsc_table as no more used.
>> ---
>>  drivers/remoteproc/Kconfig  |  10 +
>>  drivers/remoteproc/Makefile |   1 +
>>  drivers/remoteproc/tee_remoteproc.c | 451 
>>  include/linux/remoteproc.h  |   4 +
>>  include/linux/tee_remoteproc.h  | 100 ++
>>  5 files changed, 566 insertions(+)
>>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>>  create mode 100644 include/linux/tee_remoteproc.h
>>
>> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
>> index 48845dc8fa85..6c1c07202276 100644
>> --- a/drivers/remoteproc/Kconfig
>> +++ b/drivers/remoteproc/Kconfig
>> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
>>  
>>It's safe to say N if not interested in using RPU r5f cores.
>>  
>> +
>> +config TEE_REMOTEPROC
>> +tristate "Remoteproc support by a TEE application"
>> +depends on OPTEE
>> +help
>> +  Support a remote processor with a TEE application. The Trusted
>> +  Execution Context is responsible for loading the trusted firmware
>> +  image and managing the remote processor's lifecycle.
>> +  This can be either built-in or a loadable module.
>> +
>>  endif # REMOTEPROC
>>  
>>  endmenu
>> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
>> index 91314a9b43ce..fa8daebce277 100644
>> --- a/drivers/remoteproc/Makefile
>> +++ b/drivers/remoteproc/Makefile
>> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)  += rcar_rproc.o
>>  obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o
>>  obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o
>>  obj-$(CONFIG_STM32_RPROC)   += stm32_rproc.o
>> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o
>>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)  += ti_k3_dsp_remoteproc.o
>>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)   += ti_k3_r5_remoteproc.o
>>  obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o
>> diff --git a/drivers/remoteproc/tee_remoteproc.c 
>> b/drivers/remoteproc/tee_remoteproc.c
>> new file mode 100644
>> index ..9455fd9d0d2d
>> --- /dev/null
>> +++ b/drivers/remoteproc/tee_remoteproc.c
>> @@ -0,0 +1,451 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
>> + * Author: Arnaud Pouliquen 
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "remoteproc_internal.h"
>> +
>> +#define MAX_TEE_PARAM_ARRY_MEMBER   4
>> +
>> +/*
>> + * Authentication of the firmware and load in the remote processor memory
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [in]  params[1].memref:  buffer containing the image of the 
>> buffer
>> + */
>> +#define TA_RPROC_FW_CMD_LOAD_FW 1
>> +
>> +/*
>> + * Start the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_START_FW2
>> +
>> +/*
>> + * Stop the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_STOP_FW 3
>> +
>> +/*
>> + * Return the address of the resource table, or 0 if not found
>> + * No check is done to verify that the address returned is accessible by
>> + * the non secure context. If the resource table is loaded in a protected
>> + * memory the access by the non secure context will lead to a data abort.
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [out]  params[1].value.a:32bit LSB resource table memory address
>> + * [out]  params[1].value.b:32bit MSB resource table memory address
>> + * [out]  params[2].value.a:32bit LSB resource table memory size
>> + * [out]  params[2].value.b:32bit MSB resource table memory size
>> + */
>> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE   4
>> +
>> +/*
>> + * Return the address of the core dump
>> + *
>> + * 

Re: [PATCH vhost 22/23] vdpa/mlx5: Don't reset VQs more than necessary

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> The vdpa device can be reset many times in sequence without any
> significant state changes in between. Previously this was not a problem:
> VQs were torn down only on first reset. But after VQ pre-creation was
> introduced, each reset will delete and re-create the hardware VQs and
> their associated resources.
>
> To solve this problem, avoid resetting hardware VQs if the VQs are still
> in a blank state.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index d80d6b47da61..1a5ee0d2b47f 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -3134,18 +3134,41 @@ static void init_group_to_asid_map(struct 
> mlx5_vdpa_dev *mvdev)
> mvdev->group2asid[i] = 0;
>  }
>
> +static bool needs_vqs_reset(const struct mlx5_vdpa_dev *mvdev)
> +{
> +   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> +   struct mlx5_vdpa_virtqueue *mvq = >vqs[0];
> +
> +   if (mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK)
> +   return true;
> +
> +   if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT)
> +   return true;
> +
> +   return mvq->modified_fields & (
> +   MLX5_VIRTQ_MODIFY_MASK_STATE |
> +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS |
> +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX |
> +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX
> +   );
> +}
> +
>  static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags)
>  {
> struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> +   bool vq_reset;
>
> print_status(mvdev, 0, true);
> mlx5_vdpa_info(mvdev, "performing device reset\n");
>
> down_write(>reslock);
> unregister_link_notifier(ndev);
> -   teardown_vq_resources(ndev);
> -   init_mvqs(ndev);
> +   vq_reset = needs_vqs_reset(mvdev);
> +   if (vq_reset) {
> +   teardown_vq_resources(ndev);
> +   init_mvqs(ndev);
> +   }
>
> if (flags & VDPA_RESET_F_CLEAN_MAP)
> mlx5_vdpa_destroy_mr_resources(>mvdev);
> @@ -3165,7 +3188,8 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device 
> *vdev, u32 flags)
> if (mlx5_vdpa_create_dma_mr(mvdev))
> mlx5_vdpa_warn(mvdev, "create MR failed\n");
> }
> -   setup_vq_resources(ndev, false);
> +   if (vq_reset)
> +   setup_vq_resources(ndev, false);
> up_write(>reslock);
>
> return 0;
>
> --
> 2.45.1
>




Re: [PATCH vhost 21/23] vdpa/mlx5: Re-create HW VQs under certain conditions

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> There are a few conditions under which the hardware VQs need a full
> teardown and setup:
>
> - VQ size changed to something else than default value. Hardware VQ size
>   modification is not supported.
>
> - User turns off certain device features: mergeable buffers, checksum
>   virtio 1.0 compliance. In these cases, the TIR and RQT need to be
>   re-created.
>
> Add a needs_teardown configuration variable and set it when detecting
> the above scenarios. On next DRIVER_OK, the resources will be torn down
> first.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +++
>  drivers/vdpa/mlx5/net/mlx5_vnet.h |  1 +
>  2 files changed, 16 insertions(+)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index b2836fd3d1dd..d80d6b47da61 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2390,6 +2390,7 @@ static void mlx5_vdpa_set_vq_num(struct vdpa_device 
> *vdev, u16 idx, u32 num)
>  }
>
> mvq = >vqs[idx];
> +   ndev->needs_teardown = num != mvq->num_ent;
> mvq->num_ent = num;
>  }
>
> @@ -2800,6 +2801,7 @@ static int mlx5_vdpa_set_driver_features(struct 
> vdpa_device *vdev, u64 features)
> struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> u64 old_features = mvdev->actual_features;
> +   u64 diff_features;
> int err;
>
> print_features(mvdev, features, true);
> @@ -2822,6 +2824,14 @@ static int mlx5_vdpa_set_driver_features(struct 
> vdpa_device *vdev, u64 features)
> }
> }
>
> +   /* When below features diverge from initial device features, VQs need 
> a full teardown. */
> +#define NEEDS_TEARDOWN_MASK (BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) | \
> +BIT_ULL(VIRTIO_NET_F_CSUM) | \
> +BIT_ULL(VIRTIO_F_VERSION_1))
> +
> +   diff_features = mvdev->mlx_features ^ mvdev->actual_features;
> +   ndev->needs_teardown = !!(diff_features & NEEDS_TEARDOWN_MASK);
> +
> update_cvq_info(mvdev);
> return err;
>  }
> @@ -3038,6 +3048,7 @@ static void teardown_vq_resources(struct mlx5_vdpa_net 
> *ndev)
> destroy_rqt(ndev);
> teardown_virtqueues(ndev);
> ndev->setup = false;
> +   ndev->needs_teardown = false;
>  }
>
>  static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev)
> @@ -3078,6 +3089,10 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
> *vdev, u8 status)
> goto err_setup;
> }
> register_link_notifier(ndev);
> +
> +   if (ndev->needs_teardown)
> +   teardown_vq_resources(ndev);
> +
> if (ndev->setup) {
> err = resume_vqs(ndev);
> if (err) {
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> index 2ada29767cc5..da7318f82d2a 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> @@ -56,6 +56,7 @@ struct mlx5_vdpa_net {
> struct dentry *rx_dent;
> struct dentry *rx_table_dent;
> bool setup;
> +   bool needs_teardown;
> u32 cur_num_vqs;
> u32 rqt_size;
> u16 default_queue_size;
>
> --
> 2.45.1
>




Re: [PATCH vhost 20/23] vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> Currently, hardware VQs are created right when the vdpa device gets into
> DRIVER_OK state. That is easier because most of the VQ state is known by
> then.
>
> This patch switches to creating all VQs and their associated resources
> at device creation time. The motivation is to reduce the vdpa device
> live migration downtime by moving the expensive operation of creating
> all the hardware VQs and their associated resources out of downtime on
> the destination VM.
>
> The VQs are now created in a blank state. The VQ configuration will
> happen later, on DRIVER_OK. Then the configuration will be applied when
> the VQs are moved to the Ready state.
>
> When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is
> needed: now that the VQ is already created a resume_vq() will be
> triggered too early when no mr has been configured yet. Skip calling
> resume_vq() in this case, let it be handled during DRIVER_OK.
>
> For virtio-vdpa, the device configuration is done earlier during
> .vdpa_dev_add() by vdpa_register_device(). Avoid calling
> setup_vq_resources() a second time in that case.
>

I guess this happens if virtio_vdpa is already loaded, but I cannot
see how this is different here. Apart from the IOTLB, what else does
it change from the mlx5_vdpa POV?

> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 37 -
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 249b5afbe34a..b2836fd3d1dd 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2444,7 +2444,7 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device 
> *vdev, u16 idx, bool ready
> mvq = >vqs[idx];
> if (!ready) {
> suspend_vq(ndev, mvq);
> -   } else {
> +   } else if (mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) {
> if (resume_vq(ndev, mvq))
> ready = false;
> }
> @@ -3078,10 +3078,18 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
> *vdev, u8 status)
> goto err_setup;
> }
> register_link_notifier(ndev);
> -   err = setup_vq_resources(ndev, true);
> -   if (err) {
> -   mlx5_vdpa_warn(mvdev, "failed to setup 
> driver\n");
> -   goto err_driver;
> +   if (ndev->setup) {
> +   err = resume_vqs(ndev);
> +   if (err) {
> +   mlx5_vdpa_warn(mvdev, "failed to 
> resume VQs\n");
> +   goto err_driver;
> +   }
> +   } else {
> +   err = setup_vq_resources(ndev, true);
> +   if (err) {
> +   mlx5_vdpa_warn(mvdev, "failed to 
> setup driver\n");
> +   goto err_driver;
> +   }
> }
> } else {
> mlx5_vdpa_warn(mvdev, "did not expect DRIVER_OK to be 
> cleared\n");
> @@ -3142,6 +3150,7 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device 
> *vdev, u32 flags)
> if (mlx5_vdpa_create_dma_mr(mvdev))
> mlx5_vdpa_warn(mvdev, "create MR failed\n");
> }
> +   setup_vq_resources(ndev, false);
> up_write(>reslock);
>
> return 0;
> @@ -3836,8 +3845,21 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> goto err_reg;
>
> mgtdev->ndev = ndev;
> +
> +   /* For virtio-vdpa, the device was set up during device register. */
> +   if (ndev->setup)
> +   return 0;
> +
> +   down_write(>reslock);
> +   err = setup_vq_resources(ndev, false);
> +   up_write(>reslock);
> +   if (err)
> +   goto err_setup_vq_res;
> +
> return 0;
>
> +err_setup_vq_res:
> +   _vdpa_unregister_device(>vdev);
>  err_reg:
> destroy_workqueue(mvdev->wq);
>  err_res2:
> @@ -3863,6 +3885,11 @@ static void mlx5_vdpa_dev_del(struct vdpa_mgmt_dev 
> *v_mdev, struct vdpa_device *
>
> unregister_link_notifier(ndev);
> _vdpa_unregister_device(dev);
> +
> +   down_write(>reslock);
> +   teardown_vq_resources(ndev);
> +   up_write(>reslock);
> +
> wq = mvdev->wq;
> mvdev->wq = NULL;
> destroy_workqueue(wq);
>
> --
> 2.45.1
>




Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal

2024-06-19 Thread Chen Yu
On 2024-06-19 at 18:34:34 +0300, Nikolay Borisov wrote:
> 
> 
> On 19.06.24 г. 18:25 ч., Chen Yu wrote:
> > Hi Nikolay,
> > 
> > On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 26.05.24 г. 4:58 ч., Chen Yu wrote:
> > > > The kernel can change spinlock behavior when running as a guest. But
> > > > this guest-friendly behavior causes performance problems on bare metal.
> > > > So there's a 'virt_spin_lock_key' static key to switch between the two
> > > > modes.
> > > > 
> > > > The static key is always enabled by default (run in guest mode) and
> > > > should be disabled for bare metal (and in some guests that want native
> > > > behavior).
> > > > 
> > > > Performance drop is reported when running encode/decode workload and
> > > > BenchSEE cache sub-workload.
> > > > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
> > > > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
> > > > is disabled the virt_spin_lock_key is incorrectly set to true on bare
> > > > metal. The qspinlock degenerates to test-and-set spinlock, which
> > > > decrease the performance on bare metal.
> > > > 
> > > > Fix this by disabling virt_spin_lock_key if it is on bare metal,
> > > > regardless of CONFIG_PARAVIRT_SPINLOCKS.
> > > > 
> > > 
> > > nit:
> > > 
> > > This bug wouldn't have happened if the key was defined FALSE by default 
> > > and
> > > only enabled in the appropriate case. I think it makes more sense to 
> > > invert
> > > the logic and have the key FALSE by default and only enable it iff the
> > > kernel is running under a hypervisor... At worst only the virtualization
> > > case would suffer if the lock is falsely not enabled.
> > 
> > Thank you for your review. I agree, initializing the key to FALSE by 
> > default seems
> > to be more readable. Could this change be the subsequent adjustment based
> > on current fix, which could be more bisectible?
> 
> Why can't this change be squashed in the current proposed patch?
>

Current patch deals with the incorrect condition of CONFIG_PARAVIRT_SPINLOCKS.
The change of the virt_spin_lock_key's default value is supposed to be "no 
functional
change expected", but maybe just in case of some corner cases...
Anyway, I'll put the changes together in one patch and do some tests.
 
> > 
> > 
> > Set the default key to false. If booting in a VM, enable this key. Later 
> > during
> > the VM initialization, if other high-efficient spinlock is preferred, like
> > paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly.
> 
> Yep, or simply during the initialization stage the correct flavor will be
> chosen, no need to do the on-off dance. But that's a topic for a different
> discussion.
>

Yes it is doable. 

thanks,
Chenyu
> > 
> > diff --git a/arch/x86/include/asm/qspinlock.h 
> > b/arch/x86/include/asm/qspinlock.h
> > index cde8357bb226..a7d3ba00e70e 100644
> > --- a/arch/x86/include/asm/qspinlock.h
> > +++ b/arch/x86/include/asm/qspinlock.h
> > @@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu)
> >   #ifdef CONFIG_PARAVIRT
> >   /*
> > - * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack.
> > + * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack.
> >*
> >* Native (and PV wanting native due to vCPU pinning) should disable this 
> > key.
> >* It is done in this backwards fashion to only have a single direction 
> > change,
> >* which removes ordering between native_pv_spin_init() and HV setup.
> >*/
> > -DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
> > +DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
> >   /*
> >* Shortcut for the queued_spin_lock_slowpath() function that allows
> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index c193c9e60a1b..fec381533555 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", 
> > .noinstr.text);
> >   DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text);
> >   #endif
> > -DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> > +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
> >   void __init native_pv_lock_init(void)
> >   {
> > -   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
> > -   static_branch_disable(_spin_lock_key);
> > +   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> > +   static_branch_enable(_spin_lock_key);
> >   }
> >   static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)



Re: [PATCH vhost 19/23] vdpa/mlx5: Use suspend/resume during VQP change

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> Resume a VQ if it is already created when the number of VQ pairs
> increases. This is done in preparation for VQ pre-creation which is
> coming in a later patch. It is necessary because calling setup_vq() on
> an already created VQ will return early and will not enable the queue.
>
> For symmetry, suspend a VQ instead of tearing it down when the number of
> VQ pairs decreases. But only if the resume operation is supported.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 0e1c1b7ff297..249b5afbe34a 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2130,14 +2130,22 @@ static int change_num_qps(struct mlx5_vdpa_dev 
> *mvdev, int newqps)
> if (err)
> return err;
>
> -   for (i = ndev->cur_num_vqs - 1; i >= 2 * newqps; i--)
> -   teardown_vq(ndev, >vqs[i]);
> +   for (i = ndev->cur_num_vqs - 1; i >= 2 * newqps; i--) {
> +   struct mlx5_vdpa_virtqueue *mvq = >vqs[i];
> +
> +   if (is_resumable(ndev))
> +   suspend_vq(ndev, mvq);
> +   else
> +   teardown_vq(ndev, mvq);
> +   }
>
> ndev->cur_num_vqs = 2 * newqps;
> } else {
> ndev->cur_num_vqs = 2 * newqps;
> for (i = cur_qps * 2; i < 2 * newqps; i++) {
> -   err = setup_vq(ndev, >vqs[i], true);
> +   struct mlx5_vdpa_virtqueue *mvq = >vqs[i];
> +
> +   err = mvq->initialized ? resume_vq(ndev, mvq) : 
> setup_vq(ndev, mvq, true);
> if (err)
> goto clean_added;
> }
>
> --
> 2.45.1
>




[PATCH v5 32/37] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs

2024-06-19 Thread Ilya Leoshkevich
This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/traps.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 52578b5cecbd..dde69d2a64f0 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -262,6 +263,11 @@ static void monitor_event_exception(struct pt_regs *regs)
 
 void kernel_stack_overflow(struct pt_regs *regs)
 {
+   /*
+* Normally regs are unpoisoned by the generic entry code, but
+* kernel_stack_overflow() is a rare case that is called bypassing it.
+*/
+   kmsan_unpoison_entry_regs(regs);
bust_spinlocks(1);
printk("Kernel stack overflow.\n");
show_regs(regs);
-- 
2.45.1




[PATCH v5 14/37] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()

2024-06-19 Thread Ilya Leoshkevich
Improve the readability by replacing the custom aligning logic with
ALIGN_DOWN(). Unlike other places where a similar sequence is used,
there is no size parameter that needs to be adjusted, so the standard
macro fits.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 2d57408c78ae..9c58f081d84f 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void 
*address, u64 size,
  */
 void *kmsan_get_metadata(void *address, bool is_origin)
 {
-   u64 addr = (u64)address, pad, off;
+   u64 addr = (u64)address, off;
struct page *page;
void *ret;
 
-   if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) {
-   pad = addr % KMSAN_ORIGIN_SIZE;
-   addr -= pad;
-   }
+   if (is_origin)
+   addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE);
address = (void *)addr;
if (kmsan_internal_is_vmalloc_addr(address) ||
kmsan_internal_is_module_addr(address))
-- 
2.45.1




[PATCH v5 05/37] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces

2024-06-19 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Skip the comparison when this is the case.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 470b0b4afcc4..8a1bbbc723ab 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -20,7 +20,8 @@
 
 static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store)
 {
-   if ((u64)addr < TASK_SIZE)
+   if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) &&
+   (u64)addr < TASK_SIZE)
return true;
if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW))
return true;
-- 
2.45.1




[PATCH v5 22/37] s390: Use a larger stack for KMSAN

2024-06-19 Thread Ilya Leoshkevich
Adjust the stack size for the KMSAN-enabled kernel like it was done
for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double
the stack size"). Both tools have similar requirements.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Makefile  | 2 +-
 arch/s390/include/asm/thread_info.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index f2b21c7a70ef..7fd57398221e 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -36,7 +36,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option
 KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
 
 UTS_MACHINE:= s390x
-STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384)
+STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384))
 CHECKFLAGS += -D__s390__ -D__s390x__
 
 export LD_BFD
diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index a674c7d25da5..d02a709717b8 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -16,7 +16,7 @@
 /*
  * General size of kernel stacks
  */
-#ifdef CONFIG_KASAN
+#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN)
 #define THREAD_SIZE_ORDER 4
 #else
 #define THREAD_SIZE_ORDER 2
-- 
2.45.1




[PATCH v5 34/37] s390/uaccess: Add the missing linux/instrumented.h #include

2024-06-19 Thread Ilya Leoshkevich
uaccess.h uses instrument_get_user() and instrument_put_user(), which
are defined in linux/instrumented.h. Currently we get this header from
somewhere else by accident; prefer to be explicit about it and include
it directly.

Suggested-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/uaccess.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index 70f0edc00c2a..9213be0529ee 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void debug_user_asce(int exit);
 
-- 
2.45.1




[PATCH v5 27/37] s390/diag: Unpoison diag224() output buffer

2024-06-19 Thread Ilya Leoshkevich
Diagnose 224 stores 4k bytes, which currently cannot be deduced from
the inline assembly constraints. This leads to KMSAN false positives.

Fix the constraints by using a 4k-sized struct instead of a raw
pointer. While at it, prettify them too.

Suggested-by: Heiko Carstens 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/diag.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c
index 8dee9aa0ec95..8a7009618ba7 100644
--- a/arch/s390/kernel/diag.c
+++ b/arch/s390/kernel/diag.c
@@ -278,12 +278,14 @@ int diag224(void *ptr)
int rc = -EOPNOTSUPP;
 
diag_stat_inc(DIAG_STAT_X224);
-   asm volatile(
-   "   diag%1,%2,0x224\n"
-   "0: lhi %0,0x0\n"
+   asm volatile("\n"
+   "   diag%[type],%[addr],0x224\n"
+   "0: lhi %[rc],0\n"
"1:\n"
EX_TABLE(0b,1b)
-   : "+d" (rc) :"d" (0), "d" (addr) : "memory");
+   : [rc] "+d" (rc)
+   , "=m" (*(struct { char buf[PAGE_SIZE]; } *)ptr)
+   : [type] "d" (0), [addr] "d" (addr));
return rc;
 }
 EXPORT_SYMBOL(diag224);
-- 
2.45.1




[PATCH v5 19/37] lib/zlib: Unpoison DFLTCC output buffers

2024-06-19 Thread Ilya Leoshkevich
The constraints of the DFLTCC inline assembly are not precise: they
do not communicate the size of the output buffers to the compiler, so
it cannot automatically instrument it.

Add the manual kmsan_unpoison_memory() calls for the output buffers.
The logic is the same as in [1].

[1] 
https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 lib/zlib_dfltcc/dfltcc.h  |  1 +
 lib/zlib_dfltcc/dfltcc_util.h | 28 
 2 files changed, 29 insertions(+)

diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h
index b96232bdd44d..0f2a16d7a48a 100644
--- a/lib/zlib_dfltcc/dfltcc.h
+++ b/lib/zlib_dfltcc/dfltcc.h
@@ -80,6 +80,7 @@ struct dfltcc_param_v0 {
 uint8_t csb[1152];
 };
 
+static_assert(offsetof(struct dfltcc_param_v0, csb) == 384);
 static_assert(sizeof(struct dfltcc_param_v0) == 1536);
 
 #define CVT_CRC32 0
diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h
index 4a46b5009f0d..10509270d822 100644
--- a/lib/zlib_dfltcc/dfltcc_util.h
+++ b/lib/zlib_dfltcc/dfltcc_util.h
@@ -2,6 +2,8 @@
 #ifndef DFLTCC_UTIL_H
 #define DFLTCC_UTIL_H
 
+#include "dfltcc.h"
+#include 
 #include 
 
 /*
@@ -20,6 +22,7 @@ typedef enum {
 #define DFLTCC_CMPR 2
 #define DFLTCC_XPND 4
 #define HBT_CIRCULAR (1 << 7)
+#define DFLTCC_FN_MASK ((1 << 7) - 1)
 #define HB_BITS 15
 #define HB_SIZE (1 << HB_BITS)
 
@@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc(
 )
 {
 Byte *t2 = op1 ? *op1 : NULL;
+unsigned char *orig_t2 = t2;
 size_t t3 = len1 ? *len1 : 0;
 const Byte *t4 = op2 ? *op2 : NULL;
 size_t t5 = len2 ? *len2 : 0;
@@ -59,6 +63,30 @@ static inline dfltcc_cc dfltcc(
  : "cc", "memory");
 t2 = r2; t3 = r3; t4 = r4; t5 = r5;
 
+/*
+ * Unpoison the parameter block and the output buffer.
+ * This is a no-op in non-KMSAN builds.
+ */
+switch (fn & DFLTCC_FN_MASK) {
+case DFLTCC_QAF:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param));
+break;
+case DFLTCC_GDHT:
+kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb));
+break;
+case DFLTCC_CMPR:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(
+orig_t2,
+t2 - orig_t2 +
+(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1));
+break;
+case DFLTCC_XPND:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(orig_t2, t2 - orig_t2);
+break;
+}
+
 if (op1)
 *op1 = t2;
 if (len1)
-- 
2.45.1




[PATCH v5 04/37] kmsan: Increase the maximum store size to 4096

2024-06-19 Thread Ilya Leoshkevich
The inline assembly block in s390's chsc() stores that much.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index cc3907a9c33a..470b0b4afcc4 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t 
size)
 
ua_flags = user_access_save();
/*
-* Most of the accesses are below 32 bytes. The two exceptions so far
-* are clwb() (64 bytes) and FPU state (512 bytes).
-* It's unlikely that the assembly will touch more than 512 bytes.
+* Most of the accesses are below 32 bytes. The exceptions so far are
+* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes).
 */
-   if (size > 512) {
+   if (size > 4096) {
WARN_ONCE(1, "assembly store size too big: %ld\n", size);
size = 8;
}
-- 
2.45.1




[PATCH v5 35/37] s390/unwind: Disable KMSAN checks

2024-06-19 Thread Ilya Leoshkevich
The unwind code can read uninitialized frames. Furthermore, even in
the good case, KMSAN does not emit shadow for backchains. Therefore
disable it for the unwinding functions.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/unwind_bc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
index 0ece156fdd7c..cd44be2b6ce8 100644
--- a/arch/s390/kernel/unwind_bc.c
+++ b/arch/s390/kernel/unwind_bc.c
@@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state 
*state,
   READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE;
 }
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 bool unwind_next_frame(struct unwind_state *state)
 {
struct stack_info *info = >stack_info;
@@ -118,6 +120,8 @@ bool unwind_next_frame(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_next_frame);
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long first_frame)
 {
-- 
2.45.1




[PATCH v5 20/37] kmsan: Accept ranges starting with 0 on s390

2024-06-19 Thread Ilya Leoshkevich
On s390 the virtual address 0 is valid (current CPU's lowcore is mapped
there), therefore KMSAN should not complain about it.

Disable the respective check on s390. There doesn't seem to be a
Kconfig option to describe this situation, so explicitly check for
s390.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 9de76ac7062c..3f8b1bbb9060 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
bool merged = false;
 
KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
-   KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
+   KMSAN_WARN_ON((nstart >= nend) ||
+ /* Virtual address 0 is valid on s390. */
+ (!IS_ENABLED(CONFIG_S390) && !nstart) ||
+ !nend);
nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
nend = ALIGN(nend, PAGE_SIZE);
 
-- 
2.45.1




[PATCH v5 33/37] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-19 Thread Ilya Leoshkevich
put_user() uses inline assembly with precise constraints, so Clang is
in principle capable of instrumenting it automatically. Unfortunately,
one of the constraints contains a dereferenced user pointer, and Clang
does not currently distinguish user and kernel pointers. Therefore
KMSAN attempts to access shadow for user pointers, which is not a right
thing to do.

An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
work, since it's __always_inline. And __always_inline cannot be removed
due to the __put_user_bad() trick.

A different obvious fix of using the "a" instead of the "+Q" constraint
degrades the code quality, which is very important here, since it's a
hot path.

Instead, repurpose the __put_user_asm() macro to define
__put_user_{char,short,int,long}_noinstr() functions and mark them with
__no_sanitize_memory. For the non-KMSAN builds make them
__always_inline in order to keep the generated code quality. Also
define __put_user_{char,short,int,long}() functions, which call the
aforementioned ones and which *are* instrumented, because they call
KMSAN hooks, which may be implemented as macros.

The same applies to get_user() as well.

Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/uaccess.h | 111 +++-
 1 file changed, 79 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index 81ae8a98e7ec..70f0edc00c2a 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -78,13 +78,24 @@ union oac {
 
 int __noreturn __put_user_bad(void);
 
-#define __put_user_asm(to, from, size) \
-({ \
+#ifdef CONFIG_KMSAN
+#define get_put_user_noinstr_attributes \
+   noinline __maybe_unused __no_sanitize_memory
+#else
+#define get_put_user_noinstr_attributes __always_inline
+#endif
+
+#define DEFINE_PUT_USER(type)  \
+static get_put_user_noinstr_attributes int \
+__put_user_##type##_noinstr(unsigned type __user *to,  \
+   unsigned type *from,\
+   unsigned long size) \
+{  \
union oac __oac_spec = {\
.oac1.as = PSW_BITS_AS_SECONDARY,   \
.oac1.a = 1,\
};  \
-   int __rc;   \
+   int rc; \
\
asm volatile(   \
"   lr  0,%[spec]\n"\
@@ -93,12 +104,28 @@ int __noreturn __put_user_bad(void);
"2:\n"  \
EX_TABLE_UA_STORE(0b, 2b, %[rc])\
EX_TABLE_UA_STORE(1b, 2b, %[rc])\
-   : [rc] "=" (__rc), [_to] "+Q" (*(to)) \
+   : [rc] "=" (rc), [_to] "+Q" (*(to))   \
: [_size] "d" (size), [_from] "Q" (*(from)),\
  [spec] "d" (__oac_spec.val)   \
: "cc", "0");   \
-   __rc;   \
-})
+   return rc;  \
+}  \
+   \
+static __always_inline int \
+__put_user_##type(unsigned type __user *to, unsigned type *from,   \
+ unsigned long size)   \
+{  \
+   int rc; \
+   \
+   rc = __put_user_##type##_noinstr(to, from, size);   \
+   instrument_put_user(*from, to, size);   \
+   return rc;  \
+}
+
+DEFINE_PUT_USER(char);
+DEFINE_PUT_USER(short);
+DEFINE_PUT_USER(int);
+DEFINE_PUT_USER(long);
 
 static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned 
long size)
 {
@@ -106,24 +133,24 @@ static __always_inline int __put_user_fn(void *x, void 
__user *ptr, unsigned lon
 
 

[PATCH v5 09/37] kmsan: Expose kmsan_get_metadata()

2024-06-19 Thread Ilya Leoshkevich
Each s390 CPU has lowcore pages associated with it. Each CPU sees its
own lowcore at virtual address 0 through a hardware mechanism called
prefixing. Additionally, all lowcores are mapped to non-0 virtual
addresses stored in the lowcore_ptr[] array.

When lowcore is accessed through virtual address 0, one needs to
resolve metadata for lowcore_ptr[raw_smp_processor_id()].

Expose kmsan_get_metadata() to make it possible to do this from the
arch code.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan.h  | 9 +
 mm/kmsan/instrumentation.c | 1 +
 mm/kmsan/kmsan.h   | 1 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index e0c23a32cdf0..fe6c2212bdb1 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out);
  */
 void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
 
+/**
+ * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins.
+ * @addr:  kernel address.
+ * @is_origin: whether to return origins or shadow.
+ *
+ * Return NULL if metadata cannot be found.
+ */
+void *kmsan_get_metadata(void *addr, bool is_origin);
+
 #else
 
 static inline void kmsan_init_shadow(void)
diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 8a1bbbc723ab..94b49fac9d8b 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -14,6 +14,7 @@
 
 #include "kmsan.h"
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index adf443bcffe8..34b83c301d57 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -66,7 +66,6 @@ struct shadow_origin_ptr {
 
 struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
 bool store);
-void *kmsan_get_metadata(void *addr, bool is_origin);
 void __init kmsan_init_alloc_meta_for_range(void *start, void *end);
 
 enum kmsan_bug_reason {
-- 
2.45.1




[PATCH v5 23/37] s390/boot: Add the KMSAN runtime stub

2024-06-19 Thread Ilya Leoshkevich
It should be possible to have inline functions in the s390 header
files, which call kmsan_unpoison_memory(). The problem is that these
header files might be included by the decompressor, which does not
contain KMSAN runtime, causing linker errors.

Not compiling these calls if __SANITIZE_MEMORY__ is not defined -
either by changing kmsan-checks.h or at the call sites - may cause
unintended side effects, since calling these functions from an
uninstrumented code that is linked into the kernel is valid use case.

One might want to explicitly distinguish between the kernel and the
decompressor. Checking for a decompressor-specific #define is quite
heavy-handed, and will have to be done at all call sites.

A more generic approach is to provide a dummy kmsan_unpoison_memory()
definition. This produces some runtime overhead, but only when building
with CONFIG_KMSAN. The benefit is that it does not disturb the existing
KMSAN build logic and call sites don't need to be changed.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 1 +
 arch/s390/boot/kmsan.c  | 6 ++
 2 files changed, 7 insertions(+)
 create mode 100644 arch/s390/boot/kmsan.c

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index 526ed20b9d31..e7658997452b 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -44,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) 
$(CONFIG_PGSTE)) +=
 obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr.o
 obj-y  += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o
 obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o
+obj-$(CONFIG_KMSAN) += kmsan.o
 obj-all := $(obj-y) piggy.o syms.o
 
 targets:= bzImage section_cmp.boot.data 
section_cmp.boot.preserved.data $(obj-y)
diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c
new file mode 100644
index ..e7b3ac48143e
--- /dev/null
+++ b/arch/s390/boot/kmsan.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+
+void kmsan_unpoison_memory(const void *address, size_t size)
+{
+}
-- 
2.45.1




[PATCH v5 29/37] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN

2024-06-19 Thread Ilya Leoshkevich
Lockdep generates the following false positives with KMSAN on s390x:

[6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[ ...]
[6.577050] Call Trace:
[6.619637]  [<0690d2de>] check_flags+0x1fe/0x210
[6.665411] ([<0690d2da>] check_flags+0x1fa/0x210)
[6.707478]  [<006cec1a>] lock_acquire+0x2ca/0xce0
[6.749959]  [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190
[6.794912]  [<041fc988>] __stack_depot_save+0x218/0x5b0
[6.838420]  [<0197affe>] __msan_poison_alloca+0xfe/0x1a0
[6.882985]  [<07c5827c>] start_kernel+0x70c/0xd50
[6.927454]  [<00100036>] startup_continue+0x36/0x40

Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
interrupts are on, but on the CPU they are still off. KMSAN
instrumentation takes spinlocks, giving lockdep a chance to see and
complain about this discrepancy.

KMSAN instrumentation is inserted in order to poison the __mask
variable. Disable instrumentation in the respective functions. They are
very small and it's easy to see that no important metadata updates are
lost because of this.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/irqflags.h | 17 ++---
 drivers/s390/char/sclp.c |  2 +-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h
index 02427b205c11..bcab456dfb80 100644
--- a/arch/s390/include/asm/irqflags.h
+++ b/arch/s390/include/asm/irqflags.h
@@ -37,12 +37,18 @@ static __always_inline void __arch_local_irq_ssm(unsigned 
long flags)
asm volatile("ssm   %0" : : "Q" (flags) : "memory");
 }
 
-static __always_inline unsigned long arch_local_save_flags(void)
+#ifdef CONFIG_KMSAN
+#define arch_local_irq_attributes noinline notrace __no_sanitize_memory 
__maybe_unused
+#else
+#define arch_local_irq_attributes __always_inline
+#endif
+
+static arch_local_irq_attributes unsigned long arch_local_save_flags(void)
 {
return __arch_local_irq_stnsm(0xff);
 }
 
-static __always_inline unsigned long arch_local_irq_save(void)
+static arch_local_irq_attributes unsigned long arch_local_irq_save(void)
 {
return __arch_local_irq_stnsm(0xfc);
 }
@@ -52,7 +58,12 @@ static __always_inline void arch_local_irq_disable(void)
arch_local_irq_save();
 }
 
-static __always_inline void arch_local_irq_enable(void)
+static arch_local_irq_attributes void arch_local_irq_enable_external(void)
+{
+   __arch_local_irq_stosm(0x01);
+}
+
+static arch_local_irq_attributes void arch_local_irq_enable(void)
 {
__arch_local_irq_stosm(0x03);
 }
diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c
index d53ee34d398f..fb1d9949adca 100644
--- a/drivers/s390/char/sclp.c
+++ b/drivers/s390/char/sclp.c
@@ -736,7 +736,7 @@ sclp_sync_wait(void)
cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK;
cr0_sync.val |= 1UL << (63 - 54);
local_ctl_load(0, _sync);
-   __arch_local_irq_stosm(0x01);
+   arch_local_irq_enable_external();
/* Loop until driver state indicates finished request */
while (sclp_running_state != sclp_running_state_idle) {
/* Check for expired request timer */
-- 
2.45.1




[PATCH v5 16/37] mm: slub: Let KMSAN access metadata

2024-06-19 Thread Ilya Leoshkevich
Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes
KMSAN to complain about touching redzones in kfree().

Fix by extending the existing KASAN-related metadata_access_enable()
and metadata_access_disable() functions to KMSAN.

Acked-by: Vlastimil Babka 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 1134091abac5..b050e528112c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -829,10 +829,12 @@ static int disable_higher_order_debug;
 static inline void metadata_access_enable(void)
 {
kasan_disable_current();
+   kmsan_disable_current();
 }
 
 static inline void metadata_access_disable(void)
 {
+   kmsan_enable_current();
kasan_enable_current();
 }
 
-- 
2.45.1




[PATCH v5 15/37] kmsan: Do not round up pg_data_t size

2024-06-19 Thread Ilya Leoshkevich
x86's alloc_node_data() rounds up node data size to PAGE_SIZE. It's not
explained why it's needed, but it's most likely for performance
reasons, since the padding bytes are not used anywhere. Some other
architectures do it as well, e.g., mips rounds it up to the cache line
size.

kmsan_init_shadow() initializes metadata for each node data and assumes
the x86 rounding, which does not match other architectures. This may
cause the range end to overshoot the end of available memory, in turn
causing virt_to_page_or_null() in kmsan_init_alloc_meta_for_range() to
return NULL, which leads to kernel panic shortly after.

Since the padding bytes are not used, drop the rounding.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 3ac3b8921d36..9de76ac7062c 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -72,7 +72,7 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
  */
 void __init kmsan_init_shadow(void)
 {
-   const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+   const size_t nd_size = sizeof(pg_data_t);
phys_addr_t p_start, p_end;
u64 loop;
int nid;
-- 
2.45.1




[PATCH v5 18/37] mm: kfence: Disable KMSAN when checking the canary

2024-06-19 Thread Ilya Leoshkevich
KMSAN warns about check_canary() accessing the canary.

The reason is that, even though set_canary() is properly instrumented
and sets shadow, slub explicitly poisons the canary's address range
afterwards.

Unpoisoning the canary is not the right thing to do: only
check_canary() is supposed to ever touch it. Instead, disable KMSAN
checks around canary read accesses.

Reviewed-by: Alexander Potapenko 
Tested-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kfence/core.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 964b8482275b..83f8e78827c0 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum 
kfence_object_state nex
WRITE_ONCE(meta->state, next);
 }
 
+#ifdef CONFIG_KMSAN
+#define check_canary_attributes noinline __no_kmsan_checks
+#else
+#define check_canary_attributes inline
+#endif
+
 /* Check canary byte at @addr. */
-static inline bool check_canary_byte(u8 *addr)
+static check_canary_attributes bool check_canary_byte(u8 *addr)
 {
struct kfence_metadata *meta;
unsigned long flags;
@@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata 
*meta)
*((u64 *)addr) = KFENCE_CANARY_PATTERN_U64;
 }
 
-static inline void check_canary(const struct kfence_metadata *meta)
+static check_canary_attributes void
+check_canary(const struct kfence_metadata *meta)
 {
const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
unsigned long addr = pageaddr;
-- 
2.45.1




[PATCH v5 30/37] s390/mm: Define KMSAN metadata for vmalloc and modules

2024-06-19 Thread Ilya Leoshkevich
The pages for the KMSAN metadata associated with most kernel mappings
are taken from memblock by the common code. However, vmalloc and module
metadata needs to be defined by the architectures.

Be a little bit more careful than x86: allocate exactly MODULES_LEN
for the module shadow and origins, and then take 2/3 of vmalloc for
the vmalloc shadow and origins. This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.

Reviewed-by: Alexander Potapenko 
Acked-by: Alexander Gordeev 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/startup.c| 7 +++
 arch/s390/include/asm/pgtable.h | 8 
 2 files changed, 15 insertions(+)

diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
index 48ef5fe5c08a..d6b0d114939a 100644
--- a/arch/s390/boot/startup.c
+++ b/arch/s390/boot/startup.c
@@ -301,11 +301,18 @@ static unsigned long setup_kernel_memory_layout(unsigned 
long kernel_size)
MODULES_END = round_down(kernel_start, _SEGMENT_SIZE);
MODULES_VADDR = MODULES_END - MODULES_LEN;
VMALLOC_END = MODULES_VADDR;
+   if (IS_ENABLED(CONFIG_KMSAN))
+   VMALLOC_END -= MODULES_LEN * 2;
 
/* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
space left */
vsize = (VMALLOC_END - FIXMAP_SIZE) / 2;
vsize = round_down(vsize, _SEGMENT_SIZE);
vmalloc_size = min(vmalloc_size, vsize);
+   if (IS_ENABLED(CONFIG_KMSAN)) {
+   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
+   vmalloc_size = round_down(vmalloc_size / 3, _SEGMENT_SIZE);
+   VMALLOC_END -= vmalloc_size * 2;
+   }
VMALLOC_START = VMALLOC_END - vmalloc_size;
 
__memcpy_real_area = round_down(VMALLOC_START - MEMCPY_REAL_SIZE, 
PAGE_SIZE);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 70b6ee557eb2..2f44c23efec0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -107,6 +107,14 @@ static inline int is_module_addr(void *addr)
return 1;
 }
 
+#ifdef CONFIG_KMSAN
+#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START)
+#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END
+#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + 
KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + 
KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
+#endif
+
 #ifdef CONFIG_RANDOMIZE_BASE
 #define KASLR_LEN  (1UL << 31)
 #else
-- 
2.45.1




[PATCH v5 01/37] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2024-06-19 Thread Ilya Leoshkevich
Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

The issue was not encountered on x86_64 so far only by accident:
assembly-allocated ftrace_regs was overlapping a stale partially
unpoisoned stack frame. Poisoning stack frames before returns [1]
makes the issue appear on x86_64 as well.

[1] 
https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/

Reviewed-by: Alexander Potapenko 
Acked-by: Steven Rostedt (Google) 
Signed-off-by: Ilya Leoshkevich 
---
 kernel/trace/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 65208d3b5ed9..c35ad4362d71 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
parent_ip,
 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
__ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
 }
 #else
-- 
2.45.1




[PATCH v5 28/37] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()

2024-06-19 Thread Ilya Leoshkevich
s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/ftrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index ddf2ee47cb87..0bd6adc40a34 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
regs = ftrace_get_regs(fregs);
p = get_kprobe((kprobe_opcode_t *)ip);
if (!regs || unlikely(!p) || kprobe_disabled(p))
-- 
2.45.1




[PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions

2024-06-19 Thread Ilya Leoshkevich
arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
prefix and calling kmsan_get_metadata() again.

kmsan_virt_addr_valid() delegates to virt_addr_valid().

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/kmsan.h | 59 +++
 1 file changed, 59 insertions(+)
 create mode 100644 arch/s390/include/asm/kmsan.h

diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h
new file mode 100644
index ..eb850c942204
--- /dev/null
+++ b/arch/s390/include/asm/kmsan.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_S390_KMSAN_H
+#define _ASM_S390_KMSAN_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef MODULE
+
+static inline bool is_lowcore_addr(void *addr)
+{
+   return addr >= (void *)_lowcore &&
+  addr < (void *)(_lowcore + 1);
+}
+
+static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
+{
+   if (is_lowcore_addr(addr)) {
+   /*
+* Different lowcores accessed via S390_lowcore are described
+* by the same struct page. Resolve the prefix manually in
+* order to get a distinct struct page.
+*/
+   addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
+   (void *)_lowcore;
+   if (WARN_ON_ONCE(is_lowcore_addr(addr)))
+   return NULL;
+   return kmsan_get_metadata(addr, is_origin);
+   }
+   return NULL;
+}
+
+static inline bool kmsan_virt_addr_valid(void *addr)
+{
+   bool ret;
+
+   /*
+* pfn_valid() relies on RCU, and may call into the scheduler on exiting
+* the critical section. However, this would result in recursion with
+* KMSAN. Therefore, disable preemption here, and re-enable preemption
+* below while suppressing reschedules to avoid recursion.
+*
+* Note, this sacrifices occasionally breaking scheduling guarantees.
+* Although, a kernel compiled with KMSAN has already given up on any
+* performance guarantees due to being heavily instrumented.
+*/
+   preempt_disable();
+   ret = virt_addr_valid(addr);
+   preempt_enable_no_resched();
+
+   return ret;
+}
+
+#endif /* !MODULE */
+
+#endif /* _ASM_S390_KMSAN_H */
-- 
2.45.1




[PATCH v5 37/37] kmsan: Enable on s390

2024-06-19 Thread Ilya Leoshkevich
Now that everything else is in place, enable KMSAN in Kconfig.

Acked-by: Heiko Carstens 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c59d2b54df49..3cba4993d7c7 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -158,6 +158,7 @@ config S390
select HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_VMALLOC
select HAVE_ARCH_KCSAN
+   select HAVE_ARCH_KMSAN
select HAVE_ARCH_KFENCE
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_SECCOMP_FILTER
-- 
2.45.1




[PATCH v5 00/37] kmsan: Enable on s390

2024-06-19 Thread Ilya Leoshkevich
v4: https://lore.kernel.org/lkml/20240613153924.961511-1-...@linux.ibm.com/
v4 -> v5: Fix the __memset() build issue.
  Change the attribute #defines to lowercase in order to match
  the existing code style.
  Fix the kmsan_virt_addr_valid() implementation to avoid
  recursion in debug builds, like it's done on x86_64 - dropped
  R-bs, please take another look.
  Add kmsan_disable_current()/kmsan_enable_current() doc;
  Fix the poisoned memchr_inv() value in a different way;
  Add the missing linux/instrumented.h #include;
  (Alexander P.).
  Patches that need review:
  - [PATCH 12/37] kmsan: Introduce memset_no_sanitize_memory()
  - [PATCH 13/37] kmsan: Support SLAB_POISON
  - [PATCH 17/37] mm: slub: Disable KMSAN when checking the padding 
bytes
  - [PATCH 36/37] s390/kmsan: Implement the architecture-specific 
functions

v3: https://lore.kernel.org/lkml/20231213233605.661251-1-...@linux.ibm.com/
v3 -> v4: Rebase.
  Elaborate why ftrace_ops_list_func() change is needed on
  x64_64 (Steven).
  Add a comment to the DFLTCC patch (Alexander P.).
  Simplify diag224();
  Improve __arch_local_irq_attributes style;
  Use IS_ENABLED(CONFIG_KMSAN) for vmalloc area (Heiko).
  Align vmalloc area on _SEGMENT_SIZE (Alexander G.).

v2: https://lore.kernel.org/lkml/20231121220155.1217090-1-...@linux.ibm.com/
v2 -> v3: Drop kmsan_memmove_metadata() and strlcpy() patches;
  Remove kmsan_get_metadata() stub;
  Move kmsan_enable_current() and kmsan_disable_current() to
  include/linux/kmsan.h, explain why a counter is needed;
  Drop the memset_no_sanitize_memory() patch;
  Use __memset() in the SLAB_POISON patch;
  Add kmsan-checks.h to the DFLTCC patch;
  Add recursion check to the arch_kmsan_get_meta_or_null()
  patch (Alexander P.).

  Fix inline + __no_kmsan_checks issues.
  New patch for s390/irqflags, that resolves a lockdep warning.
  New patch for s390/diag, that resolves a false positive when
  running on an LPAR.
  New patch for STCCTM, same as above.
  New patch for check_bytes_and_report() that resolves a false
  positive that occurs even on Intel.

v1: https://lore.kernel.org/lkml/20231115203401.2495875-1-...@linux.ibm.com/
v1 -> v2: Add comments, sort #includes, introduce
  memset_no_sanitize_memory() and use it to avoid unpoisoning
  of redzones, change vmalloc alignment to _REGION3_SIZE, add
  R-bs (Alexander P.).

  Fix building
  [PATCH 28/33] s390/string: Add KMSAN support
  with FORTIFY_SOURCE.
  Reported-by: kernel test robot 
  Closes: 
https://lore.kernel.org/oe-kbuild-all/202311170550.bsbo44ix-...@intel.com/

Hi,

This series provides the minimal support for Kernel Memory Sanitizer on
s390. Kernel Memory Sanitizer is clang-only instrumentation for finding
accesses to uninitialized memory. The clang support for s390 has already
been merged [1].

With this series, I can successfully boot s390 defconfig and
debug_defconfig with kmsan.panic=1. The tool found one real
s390-specific bug (fixed in master).

Best regards,
Ilya

[1] https://reviews.llvm.org/D148596

Ilya Leoshkevich (37):
  ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
  kmsan: Make the tests compatible with kmsan.panic=1
  kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
  kmsan: Increase the maximum store size to 4096
  kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
  kmsan: Fix kmsan_copy_to_user() on arches with overlapping address
spaces
  kmsan: Remove a useless assignment from
kmsan_vmap_pages_range_noflush()
  kmsan: Remove an x86-specific #include from kmsan.h
  kmsan: Expose kmsan_get_metadata()
  kmsan: Export panic_on_kmsan
  kmsan: Allow disabling KMSAN checks for the current task
  kmsan: Introduce memset_no_sanitize_memory()
  kmsan: Support SLAB_POISON
  kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
  kmsan: Do not round up pg_data_t size
  mm: slub: Let KMSAN access metadata
  mm: slub: Disable KMSAN when checking the padding bytes
  mm: kfence: Disable KMSAN when checking the canary
  lib/zlib: Unpoison DFLTCC output buffers
  kmsan: Accept ranges starting with 0 on s390
  s390/boot: Turn off KMSAN
  s390: Use a larger stack for KMSAN
  s390/boot: Add the KMSAN runtime stub
  s390/checksum: Add a KMSAN check
  s390/cpacf: Unpoison the results of cpacf_trng()
  s390/cpumf: Unpoison STCCTM output buffer
  s390/diag: Unpoison diag224() output buffer
  s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
  s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN
  s390/mm: Define KMSAN metadata for vmalloc and modules
  s390/string: Add KMSAN support
  s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
  

[PATCH v5 17/37] mm: slub: Disable KMSAN when checking the padding bytes

2024-06-19 Thread Ilya Leoshkevich
Even though the KMSAN warnings generated by memchr_inv() are suppressed
by metadata_access_enable(), its return value may still be poisoned.

The reason is that the last iteration of memchr_inv() returns
`*start != value ? start : NULL`, where *start is poisoned. Because of
this, somewhat counterintuitively, the shadow value computed by
visitSelectInst() is equal to `(uintptr_t)start`.

One possibility to fix this, since the intention behind guarding
memchr_inv() behind metadata_access_enable() is to touch poisoned
metadata without triggering KMSAN, is to unpoison its return value.
However, this approach is too fragile. So simply disable the KMSAN
checks in the respective functions.

Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b050e528112c..fcd68fcea4ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1176,9 +1176,16 @@ static void restore_bytes(struct kmem_cache *s, char 
*message, u8 data,
memset(from, data, to - from);
 }
 
-static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
-   u8 *object, char *what,
-   u8 *start, unsigned int value, unsigned int bytes)
+#ifdef CONFIG_KMSAN
+#define pad_check_attributes noinline __no_kmsan_checks
+#else
+#define pad_check_attributes
+#endif
+
+static pad_check_attributes int
+check_bytes_and_report(struct kmem_cache *s, struct slab *slab,
+  u8 *object, char *what,
+  u8 *start, unsigned int value, unsigned int bytes)
 {
u8 *fault;
u8 *end;
@@ -1270,7 +1277,8 @@ static int check_pad_bytes(struct kmem_cache *s, struct 
slab *slab, u8 *p)
 }
 
 /* Check the pad bytes at the end of a slab page */
-static void slab_pad_check(struct kmem_cache *s, struct slab *slab)
+static pad_check_attributes void
+slab_pad_check(struct kmem_cache *s, struct slab *slab)
 {
u8 *start;
u8 *fault;
-- 
2.45.1




[PATCH v5 26/37] s390/cpumf: Unpoison STCCTM output buffer

2024-06-19 Thread Ilya Leoshkevich
stcctm() uses the "Q" constraint for dest, therefore KMSAN does not
understand that it fills multiple doublewords pointed to by dest, not
just one. This results in false positives.

Unpoison the whole dest manually with kmsan_unpoison_memory().

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpu_mf.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h
index a0de5b9b02ea..9e4bbc3e53f8 100644
--- a/arch/s390/include/asm/cpu_mf.h
+++ b/arch/s390/include/asm/cpu_mf.h
@@ -10,6 +10,7 @@
 #define _ASM_S390_CPU_MF_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, 
u64 range, u64 *dest)
: "=d" (cc)
: "Q" (*dest), "d" (range), "i" (set)
: "cc", "memory");
+   /*
+* If cc == 2, less than RANGE counters are stored, but it's not easy
+* to tell how many. Always unpoison the whole range for simplicity.
+*/
+   kmsan_unpoison_memory(dest, range * sizeof(u64));
return cc;
 }
 
-- 
2.45.1




[PATCH v5 31/37] s390/string: Add KMSAN support

2024-06-19 Thread Ilya Leoshkevich
Add KMSAN support for the s390 implementations of the string functions.
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's
important for KMSAN to know that they initialized something.

The way boot code is built with regard to string functions is
problematic, since most files think it's configured with sanitizers,
but boot/string.c doesn't. This creates various problems with the
memset64() definitions, depending on whether the code is built with
sanitizers or fortify. This should probably be streamlined, but in the
meantime resolve the issues by introducing the IN_BOOT_STRING_C macro,
similar to the existing IN_ARCH_STRING_C macro.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/string.c| 16 
 arch/s390/include/asm/string.h | 20 +++-
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c
index faccb33b462c..f6b9b1df48a8 100644
--- a/arch/s390/boot/string.c
+++ b/arch/s390/boot/string.c
@@ -1,11 +1,18 @@
 // SPDX-License-Identifier: GPL-2.0
+#define IN_BOOT_STRING_C 1
 #include 
 #include 
 #include 
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
+#undef CONFIG_KMSAN
 #include "../lib/string.c"
 
+/*
+ * Duplicate some functions from the common lib/string.c
+ * instead of fully including it.
+ */
+
 int strncmp(const char *cs, const char *ct, size_t count)
 {
unsigned char c1, c2;
@@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count)
return 0;
 }
 
+void *memset64(uint64_t *s, uint64_t v, size_t count)
+{
+   uint64_t *xs = s;
+
+   while (count--)
+   *xs++ = v;
+   return s;
+}
+
 char *skip_spaces(const char *str)
 {
while (isspace(*str))
diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h
index 351685de53d2..2ab868cbae6c 100644
--- a/arch/s390/include/asm/string.h
+++ b/arch/s390/include/asm/string.h
@@ -15,15 +15,12 @@
 #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */
-#define __HAVE_ARCH_MEMSET16   /* arch function */
-#define __HAVE_ARCH_MEMSET32   /* arch function */
-#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 void *memcpy(void *dest, const void *src, size_t n);
 void *memset(void *s, int c, size_t n);
 void *memmove(void *dest, const void *src, size_t n);
 
-#ifndef CONFIG_KASAN
+#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
 #define __HAVE_ARCH_MEMCHR /* inline & arch function */
 #define __HAVE_ARCH_MEMCMP /* arch function */
 #define __HAVE_ARCH_MEMSCAN/* inline & arch function */
@@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n);
 #define __HAVE_ARCH_STRNCPY/* arch function */
 #define __HAVE_ARCH_STRNLEN/* inline & arch function */
 #define __HAVE_ARCH_STRSTR /* arch function */
+#define __HAVE_ARCH_MEMSET16   /* arch function */
+#define __HAVE_ARCH_MEMSET32   /* arch function */
+#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 /* Prototypes for non-inlined arch strings functions. */
 int memcmp(const void *s1, const void *s2, size_t n);
@@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n);
 char *strncat(char *dest, const char *src, size_t n);
 char *strncpy(char *dest, const char *src, size_t n);
 char *strstr(const char *s1, const char *s2);
-#endif /* !CONFIG_KASAN */
+#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */
 
 #undef __HAVE_ARCH_STRCHR
 #undef __HAVE_ARCH_STRNCHR
@@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count);
 void *__memset32(uint32_t *s, uint32_t v, size_t count);
 void *__memset64(uint64_t *s, uint64_t v, size_t count);
 
+#ifdef __HAVE_ARCH_MEMSET16
 static inline void *memset16(uint16_t *s, uint16_t v, size_t count)
 {
return __memset16(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET32
 static inline void *memset32(uint32_t *s, uint32_t v, size_t count)
 {
return __memset32(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET64
+#ifdef IN_BOOT_STRING_C
+void *memset64(uint64_t *s, uint64_t v, size_t count);
+#else
 static inline void *memset64(uint64_t *s, uint64_t v, size_t count)
 {
return __memset64(s, v, count * sizeof(v));
 }
+#endif
+#endif
 
 #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || 
defined(__NO_FORTIFY))
 
-- 
2.45.1




[PATCH v5 02/37] kmsan: Make the tests compatible with kmsan.panic=1

2024-06-19 Thread Ilya Leoshkevich
It's useful to have both tests and kmsan.panic=1 during development,
but right now the warnings, that the tests cause, lead to kernel
panics.

Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan_test.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
index 07d3a3a5a9c5..9bfd11674fe3 100644
--- a/mm/kmsan/kmsan_test.c
+++ b/mm/kmsan/kmsan_test.c
@@ -659,9 +659,13 @@ static void test_exit(struct kunit *test)
 {
 }
 
+static int orig_panic_on_kmsan;
+
 static int kmsan_suite_init(struct kunit_suite *suite)
 {
register_trace_console(probe_console, NULL);
+   orig_panic_on_kmsan = panic_on_kmsan;
+   panic_on_kmsan = 0;
return 0;
 }
 
@@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite)
 {
unregister_trace_console(probe_console, NULL);
tracepoint_synchronize_unregister();
+   panic_on_kmsan = orig_panic_on_kmsan;
 }
 
 static struct kunit_suite kmsan_test_suite = {
-- 
2.45.1




[PATCH v5 06/37] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces

2024-06-19 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Assume that we are handling user memory access in
this case.

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 22e8657800ef..b408714f9ba3 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
return;
 
ua_flags = user_access_save();
-   if ((u64)to < TASK_SIZE) {
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) ||
+   (u64)to < TASK_SIZE) {
/* This is a user memory access, check it. */
kmsan_internal_check_memory((void *)from, to_copy - left, to,
REASON_COPY_TO_USER);
-- 
2.45.1




[PATCH v5 12/37] kmsan: Introduce memset_no_sanitize_memory()

2024-06-19 Thread Ilya Leoshkevich
Add a wrapper for memset() that prevents unpoisoning. This is useful
for filling memory allocator redzones.

Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index 23de1b3d6aee..5f50885f2023 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -255,6 +255,14 @@ void kmsan_enable_current(void);
  */
 void kmsan_disable_current(void);
 
+/*
+ * memset_no_sanitize_memory(): memset() without KMSAN instrumentation.
+ */
+static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
+{
+   return __memset(s, c, n);
+}
+
 #else
 
 static inline void kmsan_init_shadow(void)
@@ -362,6 +370,11 @@ static inline void kmsan_disable_current(void)
 {
 }
 
+static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
+{
+   return memset(s, c, n);
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_H */
-- 
2.45.1




[PATCH v5 25/37] s390/cpacf: Unpoison the results of cpacf_trng()

2024-06-19 Thread Ilya Leoshkevich
Prevent KMSAN from complaining about buffers filled by cpacf_trng()
being uninitialized.

Tested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpacf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h
index c786538e397c..dae8843b164f 100644
--- a/arch/s390/include/asm/cpacf.h
+++ b/arch/s390/include/asm/cpacf.h
@@ -12,6 +12,7 @@
 #define _ASM_S390_CPACF_H
 
 #include 
+#include 
 
 /*
  * Instruction opcodes for the CPACF instructions
@@ -542,6 +543,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long 
ucbuf_len,
: [ucbuf] "+" (u.pair), [cbuf] "+" (c.pair)
: [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO)
: "cc", "memory", "0");
+   kmsan_unpoison_memory(ucbuf, ucbuf_len);
+   kmsan_unpoison_memory(cbuf, cbuf_len);
 }
 
 /**
-- 
2.45.1




[PATCH v5 13/37] kmsan: Support SLAB_POISON

2024-06-19 Thread Ilya Leoshkevich
Avoid false KMSAN negatives with SLUB_DEBUG by allowing
kmsan_slab_free() to poison the freed memory, and by preventing
init_object() from unpoisoning new allocations by using __memset().

There are two alternatives to this approach. First, init_object()
can be marked with __no_sanitize_memory. This annotation should be used
with great care, because it drops all instrumentation from the
function, and any shadow writes will be lost. Even though this is not a
concern with the current init_object() implementation, this may change
in the future.

Second, kmsan_poison_memory() calls may be added after memset() calls.
The downside is that init_object() is called from
free_debug_processing(), in which case poisoning will erase the
distinction between simply uninitialized memory and UAF.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c |  2 +-
 mm/slub.c| 15 +++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 267d0afa2e8b..26d86dfdc819 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -74,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
return;
 
/* RCU slabs could be legally used after free within the RCU period */
-   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
+   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
/*
 * If there's a constructor, freed memory must remain in the same state
diff --git a/mm/slub.c b/mm/slub.c
index 1373ac365a46..1134091abac5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1139,7 +1139,13 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
unsigned int poison_size = s->object_size;
 
if (s->flags & SLAB_RED_ZONE) {
-   memset(p - s->red_left_pad, val, s->red_left_pad);
+   /*
+* Here and below, avoid overwriting the KMSAN shadow. Keeping
+* the shadow makes it possible to distinguish uninit-value
+* from use-after-free.
+*/
+   memset_no_sanitize_memory(p - s->red_left_pad, val,
+ s->red_left_pad);
 
if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
/*
@@ -1152,12 +1158,13 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
}
 
if (s->flags & __OBJECT_POISON) {
-   memset(p, POISON_FREE, poison_size - 1);
-   p[poison_size - 1] = POISON_END;
+   memset_no_sanitize_memory(p, POISON_FREE, poison_size - 1);
+   memset_no_sanitize_memory(p + poison_size - 1, POISON_END, 1);
}
 
if (s->flags & SLAB_RED_ZONE)
-   memset(p + poison_size, val, s->inuse - poison_size);
+   memset_no_sanitize_memory(p + poison_size, val,
+ s->inuse - poison_size);
 }
 
 static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
-- 
2.45.1




[PATCH v5 07/37] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()

2024-06-19 Thread Ilya Leoshkevich
The value assigned to prot is immediately overwritten on the next line
with PAGE_KERNEL. The right hand side of the assignment has no
side-effects.

Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Suggested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index b9d05aff313e..2d57408c78ae 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, 
unsigned long end,
s_pages[i] = shadow_page_for(pages[i]);
o_pages[i] = origin_page_for(pages[i]);
}
-   prot = __pgprot(pgprot_val(prot) | _PAGE_NX);
prot = PAGE_KERNEL;
 
origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN);
-- 
2.45.1




[PATCH v5 10/37] kmsan: Export panic_on_kmsan

2024-06-19 Thread Ilya Leoshkevich
When building the kmsan test as a module, modpost fails with the
following error message:

ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined!

Export panic_on_kmsan in order to improve the KMSAN usability for
modules.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/report.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index 02736ec757f2..c79d3b0d2d0d 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock);
 /* Protected by kmsan_report_lock */
 static char report_local_descr[DESCR_SIZE];
 int panic_on_kmsan __read_mostly;
+EXPORT_SYMBOL_GPL(panic_on_kmsan);
 
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
-- 
2.45.1




[PATCH v5 21/37] s390/boot: Turn off KMSAN

2024-06-19 Thread Ilya Leoshkevich
All other sanitizers are disabled for boot as well. While at it, add a
comment explaining why we need this.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index 070c9b2e905f..526ed20b9d31 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -3,11 +3,13 @@
 # Makefile for the linux s390-specific parts of the memory manager.
 #
 
+# Tooling runtimes are unavailable and cannot be linked for early boot code
 KCOV_INSTRUMENT := n
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR)
 KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR)
-- 
2.45.1




[PATCH v5 24/37] s390/checksum: Add a KMSAN check

2024-06-19 Thread Ilya Leoshkevich
Add a KMSAN check to the CKSM inline assembly, similar to how it was
done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm
instruction").

Acked-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/checksum.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h
index b89159591ca0..46f5c9660616 100644
--- a/arch/s390/include/asm/checksum.h
+++ b/arch/s390/include/asm/checksum.h
@@ -13,6 +13,7 @@
 #define _S390_CHECKSUM_H
 
 #include 
+#include 
 #include 
 
 static inline __wsum cksm(const void *buff, int len, __wsum sum)
@@ -23,6 +24,7 @@ static inline __wsum cksm(const void *buff, int len, __wsum 
sum)
};
 
instrument_read(buff, len);
+   kmsan_check_memory(buff, len);
asm volatile("\n"
"0: cksm%[sum],%[rp]\n"
"   jo  0b\n"
-- 
2.45.1




[PATCH v5 11/37] kmsan: Allow disabling KMSAN checks for the current task

2024-06-19 Thread Ilya Leoshkevich
Like for KASAN, it's useful to temporarily disable KMSAN checks around,
e.g., redzone accesses. Introduce kmsan_disable_current() and
kmsan_enable_current(), which are similar to their KASAN counterparts.

Make them reentrant in order to handle memory allocations in interrupt
context. Repurpose the allow_reporting field for this.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 Documentation/dev-tools/kmsan.rst | 11 +--
 include/linux/kmsan.h | 24 
 include/linux/kmsan_types.h   |  2 +-
 mm/kmsan/core.c   |  1 -
 mm/kmsan/hooks.c  | 18 +++---
 mm/kmsan/report.c |  7 ---
 tools/objtool/check.c |  2 ++
 7 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/Documentation/dev-tools/kmsan.rst 
b/Documentation/dev-tools/kmsan.rst
index 323eedad53cd..6a48d96c5c85 100644
--- a/Documentation/dev-tools/kmsan.rst
+++ b/Documentation/dev-tools/kmsan.rst
@@ -110,6 +110,13 @@ in the Makefile. Think of this as applying 
``__no_sanitize_memory`` to every
 function in the file or directory. Most users won't need KMSAN_SANITIZE, unless
 their code gets broken by KMSAN (e.g. runs at early boot time).
 
+KMSAN checks can also be temporarily disabled for the current task using
+``kmsan_disable_current()`` and ``kmsan_enable_current()`` calls. Each
+``kmsan_enable_current()`` call must be preceded by a
+``kmsan_disable_current()`` call; these call pairs may be nested. One needs to
+be careful with these calls, keeping the regions short and preferring other
+ways to disable instrumentation, where possible.
+
 Support
 ===
 
@@ -338,11 +345,11 @@ Per-task KMSAN state
 
 
 Every task_struct has an associated KMSAN task state that holds the KMSAN
-context (see above) and a per-task flag disallowing KMSAN reports::
+context (see above) and a per-task counter disallowing KMSAN reports::
 
   struct kmsan_context {
 ...
-bool allow_reporting;
+unsigned int depth;
 struct kmsan_context_state cstate;
 ...
   }
diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index fe6c2212bdb1..23de1b3d6aee 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -239,6 +239,22 @@ void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
  */
 void *kmsan_get_metadata(void *addr, bool is_origin);
 
+/*
+ * kmsan_enable_current(): Enable KMSAN for the current task.
+ *
+ * Each kmsan_enable_current() current call must be preceded by a
+ * kmsan_disable_current() call. These call pairs may be nested.
+ */
+void kmsan_enable_current(void);
+
+/*
+ * kmsan_disable_current(): Disable KMSAN for the current task.
+ *
+ * Each kmsan_disable_current() current call must be followed by a
+ * kmsan_enable_current() call. These call pairs may be nested.
+ */
+void kmsan_disable_current(void);
+
 #else
 
 static inline void kmsan_init_shadow(void)
@@ -338,6 +354,14 @@ static inline void kmsan_unpoison_entry_regs(const struct 
pt_regs *regs)
 {
 }
 
+static inline void kmsan_enable_current(void)
+{
+}
+
+static inline void kmsan_disable_current(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_H */
diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h
index 929287981afe..dfc59918b3c0 100644
--- a/include/linux/kmsan_types.h
+++ b/include/linux/kmsan_types.h
@@ -31,7 +31,7 @@ struct kmsan_context_state {
 struct kmsan_ctx {
struct kmsan_context_state cstate;
int kmsan_in_runtime;
-   bool allow_reporting;
+   unsigned int depth;
 };
 
 #endif /* _LINUX_KMSAN_TYPES_H */
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index 95f859e38c53..81b0711a 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task)
struct thread_info *info = current_thread_info();
 
__memset(ctx, 0, sizeof(*ctx));
-   ctx->allow_reporting = true;
kmsan_internal_unpoison_memory(info, sizeof(*info), false);
 }
 
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index b408714f9ba3..267d0afa2e8b 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task)
 
 void kmsan_task_exit(struct task_struct *task)
 {
-   struct kmsan_ctx *ctx = >kmsan_ctx;
-
if (!kmsan_enabled || kmsan_in_runtime())
return;
 
-   ctx->allow_reporting = false;
+   kmsan_disable_current();
 }
 
 void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags)
@@ -424,3 +422,17 @@ void kmsan_check_memory(const void *addr, size_t size)
   REASON_ANY);
 }
 EXPORT_SYMBOL(kmsan_check_memory);
+
+void kmsan_enable_current(void)
+{
+   KMSAN_WARN_ON(current->kmsan_ctx.depth == 0);
+   current->kmsan_ctx.depth--;
+}
+EXPORT_SYMBOL(kmsan_enable_current);
+
+void kmsan_disable_current(void)
+{
+   current->kmsan_ctx.depth++;
+   

[PATCH v5 03/37] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

2024-06-19 Thread Ilya Leoshkevich
KMSAN relies on memblock returning all available pages to it
(see kmsan_memblock_free_pages()). It partitions these pages into 3
categories: pages available to the buddy allocator, shadow pages and
origin pages. This partitioning is static.

If new pages appear after kmsan_init_runtime(), it is considered
an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as
incompatible with KMSAN.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..9791fce5d0a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -946,6 +946,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   depends on !KMSAN
select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
-- 
2.45.1




[PATCH v5 08/37] kmsan: Remove an x86-specific #include from kmsan.h

2024-06-19 Thread Ilya Leoshkevich
Replace the x86-specific asm/pgtable_64_types.h #include with the
linux/pgtable.h one, which all architectures have.

While at it, sort the headers alphabetically for the sake of
consistency with other KMSAN code.

Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
Suggested-by: Heiko Carstens 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index a14744205435..adf443bcffe8 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -10,14 +10,14 @@
 #ifndef __MM_KMSAN_KMSAN_H
 #define __MM_KMSAN_KMSAN_H
 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
 
 #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100
 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200
-- 
2.45.1




Re: [PATCH vhost 17/23] vdpa/mlx5: Consolidate all VQ modify to Ready to use resume_vq()

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> There are a few more places modifying the VQ to Ready directly. Let's
> consolidate them into resume_vq().
>
> The redundant warnings for resume_vq() errors can also be dropped.
>
> There is one special case that needs to be handled for virtio-vdpa:
> the initialized flag must be set to true earlier in setup_vq() so that
> resume_vq() doesn't return early.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++
>  1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index e3a82c43b44e..f5d5b25cdb01 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -160,6 +160,7 @@ static void free_fixed_resources(struct mlx5_vdpa_net 
> *ndev);
>  static void init_mvqs(struct mlx5_vdpa_net *ndev);
>  static int setup_vq_resources(struct mlx5_vdpa_net *ndev, bool filled);
>  static void teardown_vq_resources(struct mlx5_vdpa_net *ndev);
> +static int resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
> *mvq);
>
>  static bool mlx5_vdpa_debug;
>
> @@ -1500,16 +1501,14 @@ static int setup_vq(struct mlx5_vdpa_net *ndev,
> if (err)
> goto err_vq;
>
> +   mvq->initialized = true;
> +
> if (mvq->ready) {
> -   err = modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
> -   if (err) {
> -   mlx5_vdpa_warn(>mvdev, "failed to modify to 
> ready vq idx %d(%d)\n",
> -  idx, err);
> +   err = resume_vq(ndev, mvq);
> +   if (err)
> goto err_modify;
> -   }
> }
>
> -   mvq->initialized = true;
> return 0;
>
>  err_modify:
> @@ -2422,7 +2421,6 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device 
> *vdev, u16 idx, bool ready
> struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> struct mlx5_vdpa_virtqueue *mvq;
> -   int err;
>
> if (!mvdev->actual_features)
> return;
> @@ -2439,14 +2437,10 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device 
> *vdev, u16 idx, bool ready
> if (!ready) {
> suspend_vq(ndev, mvq);
> } else {
> -   err = modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
> -   if (err) {
> -   mlx5_vdpa_warn(mvdev, "modify VQ %d to ready failed 
> (%d)\n", idx, err);
> +   if (resume_vq(ndev, mvq))
> ready = false;
> -   }
> }
>
> -
> mvq->ready = ready;
>  }
>
>
> --
> 2.45.1
>




Re: [PATCH vhost 16/23] vdpa/mlx5: Add error code for suspend/resume VQ

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> Instead of blindly calling suspend/resume_vqs(), make then return error
> codes.
>
> To keep compatibility, keep suspending or resuming VQs on error and
> return the last error code. The assumption here is that the error code
> would be the same.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 77 
> +++
>  1 file changed, 54 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index e4d68d2d0bb4..e3a82c43b44e 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1526,71 +1526,102 @@ static int setup_vq(struct mlx5_vdpa_net *ndev,
> return err;
>  }
>
> -static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq)
> +static int suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
> *mvq)
>  {
> struct mlx5_virtq_attr attr;
> +   int err;
>
> if (!mvq->initialized)
> -   return;
> +   return 0;
>
> if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)
> -   return;
> +   return 0;
>
> -   if (modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> -   mlx5_vdpa_warn(>mvdev, "modify to suspend failed\n");
> +   err = modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND);
> +   if (err) {
> +   mlx5_vdpa_warn(>mvdev, "modify to suspend failed, err: 
> %d\n", err);
> +   return err;
> +   }
>
> -   if (query_virtqueue(ndev, mvq, )) {
> -   mlx5_vdpa_warn(>mvdev, "failed to query virtqueue\n");
> -   return;
> +   err = query_virtqueue(ndev, mvq, );
> +   if (err) {
> +   mlx5_vdpa_warn(>mvdev, "failed to query virtqueue, err: 
> %d\n", err);
> +   return err;
> }
> +
> mvq->avail_idx = attr.available_index;
> mvq->used_idx = attr.used_index;
> +
> +   return 0;
>  }
>
> -static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> +static int suspend_vqs(struct mlx5_vdpa_net *ndev)
>  {
> +   int err = 0;
> int i;
>
> -   for (i = 0; i < ndev->cur_num_vqs; i++)
> -   suspend_vq(ndev, >vqs[i]);
> +   for (i = 0; i < ndev->cur_num_vqs; i++) {
> +   int local_err = suspend_vq(ndev, >vqs[i]);
> +
> +   err = local_err ? local_err : err;
> +   }
> +
> +   return err;
>  }
>
> -static void resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
> *mvq)
> +static int resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
> *mvq)
>  {
> +   int err;
> +
> if (!mvq->initialized)
> -   return;
> +   return 0;
>
> switch (mvq->fw_state) {
> case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT:
> /* Due to a FW quirk we need to modify the VQ fields first 
> then change state.
>  * This should be fixed soon. After that, a single command 
> can be used.
>  */
> -   if (modify_virtqueue(ndev, mvq, 0))
> +   err = modify_virtqueue(ndev, mvq, 0);
> +   if (err) {
> mlx5_vdpa_warn(>mvdev,
> -   "modify vq properties failed for vq %u\n", 
> mvq->index);
> +   "modify vq properties failed for vq %u, err: 
> %d\n",
> +   mvq->index, err);
> +   return err;
> +   }
> break;
> case MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND:
> if (!is_resumable(ndev)) {
> mlx5_vdpa_warn(>mvdev, "vq %d is not 
> resumable\n", mvq->index);
> -   return;
> +   return -EINVAL;
> }
> break;
> case MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY:
> -   return;
> +   return 0;
> default:
> mlx5_vdpa_warn(>mvdev, "resume vq %u called from bad 
> state %d\n",
>mvq->index, mvq->fw_state);
> -   return;
> +   return -EINVAL;
> }
>
> -   if (modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY))
> -   mlx5_vdpa_warn(>mvdev, "modify to resume failed for vq 
> %u\n", mvq->index);
> +   err = modify_virtqueue_state(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
> +   if (err)
> +   mlx5_vdpa_warn(>mvdev, "modify to resume failed for vq 
> %u, err: %d\n",
> +  mvq->index, err);
> +
> +   return err;
>  }
>
> -static void resume_vqs(struct mlx5_vdpa_net *ndev)
> +static int resume_vqs(struct mlx5_vdpa_net 

Re: [PATCH v7 2/5] remoteproc: Add TEE support

2024-06-19 Thread Mathieu Poirier
Hi,

On Tue, Jun 11, 2024 at 09:39:01AM +0200, Arnaud Pouliquen wrote:
> Add a remoteproc TEE (Trusted Execution Environment) driver
> that will be probed by the TEE bus. If the associated Trusted
> application is supported on secure part this driver offers a client
> interface to load a firmware in the secure part.
> This firmware could be authenticated by the secure trusted application.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> update from V
> - Fix missing "{" in tee_rproc_find_loaded_rsc_table inline definition.
> 
> update from V5
> - make tee_rproc_get_loaded_rsc_table() local and replace this API by
>   tee_rproc_find_loaded_rsc_table()
> - map and unmap the resource table in tee_rproc_parse_fw to make a cached copy
> - use the new rproc_pa_to_va() API to map the resource table memory declared 
> in carevout
> - remove tee_rproc_release_loaded_rsc_table as no more used.
> ---
>  drivers/remoteproc/Kconfig  |  10 +
>  drivers/remoteproc/Makefile |   1 +
>  drivers/remoteproc/tee_remoteproc.c | 451 
>  include/linux/remoteproc.h  |   4 +
>  include/linux/tee_remoteproc.h  | 100 ++
>  5 files changed, 566 insertions(+)
>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>  create mode 100644 include/linux/tee_remoteproc.h
> 
> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> index 48845dc8fa85..6c1c07202276 100644
> --- a/drivers/remoteproc/Kconfig
> +++ b/drivers/remoteproc/Kconfig
> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
>  
> It's safe to say N if not interested in using RPU r5f cores.
>  
> +
> +config TEE_REMOTEPROC
> + tristate "Remoteproc support by a TEE application"
> + depends on OPTEE
> + help
> +   Support a remote processor with a TEE application. The Trusted
> +   Execution Context is responsible for loading the trusted firmware
> +   image and managing the remote processor's lifecycle.
> +   This can be either built-in or a loadable module.
> +
>  endif # REMOTEPROC
>  
>  endmenu
> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> index 91314a9b43ce..fa8daebce277 100644
> --- a/drivers/remoteproc/Makefile
> +++ b/drivers/remoteproc/Makefile
> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)   += rcar_rproc.o
>  obj-$(CONFIG_ST_REMOTEPROC)  += st_remoteproc.o
>  obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o
>  obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o
> +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o
>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)   += ti_k3_dsp_remoteproc.o
>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o
>  obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o
> diff --git a/drivers/remoteproc/tee_remoteproc.c 
> b/drivers/remoteproc/tee_remoteproc.c
> new file mode 100644
> index ..9455fd9d0d2d
> --- /dev/null
> +++ b/drivers/remoteproc/tee_remoteproc.c
> @@ -0,0 +1,451 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
> + * Author: Arnaud Pouliquen 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "remoteproc_internal.h"
> +
> +#define MAX_TEE_PARAM_ARRY_MEMBER4
> +
> +/*
> + * Authentication of the firmware and load in the remote processor memory
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [in]   params[1].memref:  buffer containing the image of the 
> buffer
> + */
> +#define TA_RPROC_FW_CMD_LOAD_FW  1
> +
> +/*
> + * Start the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_START_FW 2
> +
> +/*
> + * Stop the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_STOP_FW  3
> +
> +/*
> + * Return the address of the resource table, or 0 if not found
> + * No check is done to verify that the address returned is accessible by
> + * the non secure context. If the resource table is loaded in a protected
> + * memory the access by the non secure context will lead to a data abort.
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out]  params[1].value.a: 32bit LSB resource table memory address
> + * [out]  params[1].value.b: 32bit MSB resource table memory address
> + * [out]  params[2].value.a: 32bit LSB resource table memory size
> + * [out]  params[2].value.b: 32bit MSB resource table memory size
> + */
> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE4
> +
> +/*
> + * Return the address of the core dump
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out] params[1].memref:   address of the core dump image if exist,
> + *   else 

Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal

2024-06-19 Thread Nikolay Borisov




On 19.06.24 г. 18:25 ч., Chen Yu wrote:

Hi Nikolay,

On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote:



On 26.05.24 г. 4:58 ч., Chen Yu wrote:

The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Fix this by disabling virt_spin_lock_key if it is on bare metal,
regardless of CONFIG_PARAVIRT_SPINLOCKS.



nit:

This bug wouldn't have happened if the key was defined FALSE by default and
only enabled in the appropriate case. I think it makes more sense to invert
the logic and have the key FALSE by default and only enable it iff the
kernel is running under a hypervisor... At worst only the virtualization
case would suffer if the lock is falsely not enabled.


Thank you for your review. I agree, initializing the key to FALSE by default 
seems
to be more readable. Could this change be the subsequent adjustment based
on current fix, which could be more bisectible?


Why can't this change be squashed in the current proposed patch?




Set the default key to false. If booting in a VM, enable this key. Later during
the VM initialization, if other high-efficient spinlock is preferred, like
paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly.


Yep, or simply during the initialization stage the correct flavor will 
be chosen, no need to do the on-off dance. But that's a topic for a 
different discussion.




diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index cde8357bb226..a7d3ba00e70e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu)
  
  #ifdef CONFIG_PARAVIRT

  /*
- * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack.
+ * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack.
   *
   * Native (and PV wanting native due to vCPU pinning) should disable this key.
   * It is done in this backwards fashion to only have a single direction 
change,
   * which removes ordering between native_pv_spin_init() and HV setup.
   */
-DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
  
  /*

   * Shortcut for the queued_spin_lock_slowpath() function that allows
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c193c9e60a1b..fec381533555 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text);
  DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text);
  #endif
  
-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);

+DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
  
  void __init native_pv_lock_init(void)

  {
-   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
-   static_branch_disable(_spin_lock_key);
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   static_branch_enable(_spin_lock_key);
  }
  
  static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)




Re: [PATCH vhost 13/23] vdpa/mlx5: Set mkey modified flags on all VQs

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> Otherwise, when virtqueues are moved from INIT to READY the latest mkey
> will not be set appropriately.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 0201c6fe61e1..0abe01fd20e9 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2868,7 +2868,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
>
> mlx5_vdpa_update_mr(mvdev, new_mr, asid);
>
> -   for (int i = 0; i < ndev->cur_num_vqs; i++)
> +   for (int i = 0; i < mvdev->max_vqs; i++)
> ndev->vqs[i].modified_fields |= 
> MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY |
> 
> MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
>
>
> --
> 2.45.1
>




Re: [PATCH vhost 12/23] vdpa/mlx5: Start off rqt_size with max VQPs

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> Currently rqt_size is initialized during device flag configuration.
> That's because it is the earliest moment when device knows if MQ
> (multi queue) is on or off.
>
> Shift this configuration earlier to device creation time. This implies
> that non-MQ devices will have a larger RQT size. But the configuration
> will still be correct.
>
> This is done in preparation for the pre-creation of hardware virtqueues
> at device add time. When that change will be added, RQT will be created
> at device creation time so it needs to be initialized to its max size.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 1181e0ac3671..0201c6fe61e1 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2731,10 +2731,6 @@ static int mlx5_vdpa_set_driver_features(struct 
> vdpa_device *vdev, u64 features)
> return err;
>
> ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
> -   if (ndev->mvdev.actual_features & BIT_ULL(VIRTIO_NET_F_MQ))
> -   ndev->rqt_size = mlx5vdpa16_to_cpu(mvdev, 
> ndev->config.max_virtqueue_pairs);
> -   else
> -   ndev->rqt_size = 1;
>
> /* Interested in changes of vq features only. */
> if (get_features(old_features) != 
> get_features(mvdev->actual_features)) {
> @@ -3719,8 +3715,12 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> goto err_alloc;
> }
>
> -   if (device_features & BIT_ULL(VIRTIO_NET_F_MQ))
> +   if (device_features & BIT_ULL(VIRTIO_NET_F_MQ)) {
> config->max_virtqueue_pairs = cpu_to_mlx5vdpa16(mvdev, 
> max_vqs / 2);
> +   ndev->rqt_size = max_vqs / 2;
> +   } else {
> +   ndev->rqt_size = 1;
> +   }
>
> ndev->mvdev.mlx_features = device_features;
> mvdev->vdev.dma_dev = >pdev->dev;
>
> --
> 2.45.1
>




Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal

2024-06-19 Thread Chen Yu
Hi Nikolay,

On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote:
> 
> 
> On 26.05.24 г. 4:58 ч., Chen Yu wrote:
> > The kernel can change spinlock behavior when running as a guest. But
> > this guest-friendly behavior causes performance problems on bare metal.
> > So there's a 'virt_spin_lock_key' static key to switch between the two
> > modes.
> > 
> > The static key is always enabled by default (run in guest mode) and
> > should be disabled for bare metal (and in some guests that want native
> > behavior).
> > 
> > Performance drop is reported when running encode/decode workload and
> > BenchSEE cache sub-workload.
> > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
> > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
> > is disabled the virt_spin_lock_key is incorrectly set to true on bare
> > metal. The qspinlock degenerates to test-and-set spinlock, which
> > decrease the performance on bare metal.
> > 
> > Fix this by disabling virt_spin_lock_key if it is on bare metal,
> > regardless of CONFIG_PARAVIRT_SPINLOCKS.
> > 
> 
> nit:
> 
> This bug wouldn't have happened if the key was defined FALSE by default and
> only enabled in the appropriate case. I think it makes more sense to invert
> the logic and have the key FALSE by default and only enable it iff the
> kernel is running under a hypervisor... At worst only the virtualization
> case would suffer if the lock is falsely not enabled.

Thank you for your review. I agree, initializing the key to FALSE by default 
seems
to be more readable. Could this change be the subsequent adjustment based
on current fix, which could be more bisectible?


Set the default key to false. If booting in a VM, enable this key. Later during
the VM initialization, if other high-efficient spinlock is preferred, like
paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly.

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index cde8357bb226..a7d3ba00e70e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu)
 
 #ifdef CONFIG_PARAVIRT
 /*
- * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack.
+ * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack.
  *
  * Native (and PV wanting native due to vCPU pinning) should disable this key.
  * It is done in this backwards fashion to only have a single direction change,
  * which removes ordering between native_pv_spin_init() and HV setup.
  */
-DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
 
 /*
  * Shortcut for the queued_spin_lock_slowpath() function that allows
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c193c9e60a1b..fec381533555 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text);
 DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text);
 #endif
 
-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
 
 void __init native_pv_lock_init(void)
 {
-   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
-   static_branch_disable(_spin_lock_key);
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   static_branch_enable(_spin_lock_key);
 }
 
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-- 
2.25.1




Re: [PATCH vhost 11/23] vdpa/mlx5: Set an initial size on the VQ

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea  wrote:
>
> The virtqueue size is a pre-requisite for setting up any virtqueue
> resources. For the upcoming optimization of creating virtqueues at
> device add, the virtqueue size has to be configured.
>
> Store the default queue size in struct mlx5_vdpa_net to make it easy in
> the future to pre-configure this default value via vdpa tool.
>
> The queue size check in setup_vq() will always be false. So remove it.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.h | 1 +
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 245b5dac98d3..1181e0ac3671 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -58,6 +58,8 @@ MODULE_LICENSE("Dual BSD/GPL");
>   */
>  #define MLX5V_DEFAULT_VQ_COUNT 2
>
> +#define MLX5V_DEFAULT_VQ_SIZE 256
> +
>  struct mlx5_vdpa_cq_buf {
> struct mlx5_frag_buf_ctrl fbc;
> struct mlx5_frag_buf frag_buf;
> @@ -1445,9 +1447,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq)
> u16 idx = mvq->index;
> int err;
>
> -   if (!mvq->num_ent)
> -   return 0;
> -
> if (mvq->initialized)
> return 0;
>
> @@ -3523,6 +3522,7 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev)
> mvq->ndev = ndev;
> mvq->fwqp.fw = true;
> mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE;
> +   mvq->num_ent = ndev->default_queue_size;
> }
>  }
>
> @@ -3660,6 +3660,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> goto err_alloc;
> }
> ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT;
> +   ndev->default_queue_size = MLX5V_DEFAULT_VQ_SIZE;
>
> init_mvqs(ndev);
> allocate_irqs(ndev);
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> index 90b556a57971..2ada29767cc5 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h
> @@ -58,6 +58,7 @@ struct mlx5_vdpa_net {
> bool setup;
> u32 cur_num_vqs;
> u32 rqt_size;
> +   u16 default_queue_size;

It seems to me this is only assigned here and not used in the rest of
the series, why allocate a member here instead of using macro
directly?

> bool nb_registered;
> struct notifier_block nb;
> struct vdpa_callback config_cb;
>
> --
> 2.45.1
>




Re: [PATCH] trace: riscv: Remove deprecated kprobe on ftrace support

2024-06-19 Thread Google
On Thu, 13 Jun 2024 19:13:47 +0800
Jinjie Ruan  wrote:

> Since commit 7caa9765465f60 ("ftrace: riscv: move from REGS to ARGS"),
> kprobe on ftrace is not supported by riscv, because riscv's support for
> FTRACE_WITH_REGS has been replaced with support for FTRACE_WITH_ARGS, and
> KPROBES_ON_FTRACE will be supplanted by FPROBES. So remove the deprecated
> kprobe on ftrace support, which is misunderstood.
> 
> Signed-off-by: Jinjie Ruan 

No problem, Now we have fprobe instead.
Anyway, this looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> ---
>  arch/riscv/Kconfig|  1 -
>  arch/riscv/kernel/probes/Makefile |  1 -
>  arch/riscv/kernel/probes/ftrace.c | 65 ---
>  3 files changed, 67 deletions(-)
>  delete mode 100644 arch/riscv/kernel/probes/ftrace.c
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 0525ee2d63c7..a1f2d604c459 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -149,7 +149,6 @@ config RISCV
>   select HAVE_KERNEL_UNCOMPRESSED if !XIP_KERNEL && !EFI_ZBOOT
>   select HAVE_KERNEL_ZSTD if !XIP_KERNEL && !EFI_ZBOOT
>   select HAVE_KPROBES if !XIP_KERNEL
> - select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
>   select HAVE_KRETPROBES if !XIP_KERNEL
>   # https://github.com/ClangBuiltLinux/linux/issues/1881
>   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> diff --git a/arch/riscv/kernel/probes/Makefile 
> b/arch/riscv/kernel/probes/Makefile
> index 8265ff497977..d2129f2c61b8 100644
> --- a/arch/riscv/kernel/probes/Makefile
> +++ b/arch/riscv/kernel/probes/Makefile
> @@ -1,7 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_KPROBES)+= kprobes.o decode-insn.o 
> simulate-insn.o
>  obj-$(CONFIG_RETHOOK)+= rethook.o rethook_trampoline.o
> -obj-$(CONFIG_KPROBES_ON_FTRACE)  += ftrace.o
>  obj-$(CONFIG_UPROBES)+= uprobes.o decode-insn.o 
> simulate-insn.o
>  CFLAGS_REMOVE_simulate-insn.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_rethook.o = $(CC_FLAGS_FTRACE)
> diff --git a/arch/riscv/kernel/probes/ftrace.c 
> b/arch/riscv/kernel/probes/ftrace.c
> deleted file mode 100644
> index a69dfa610aa8..
> --- a/arch/riscv/kernel/probes/ftrace.c
> +++ /dev/null
> @@ -1,65 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -
> -#include 
> -
> -/* Ftrace callback handler for kprobes -- called under preepmt disabled */
> -void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
> -struct ftrace_ops *ops, struct ftrace_regs *fregs)
> -{
> - struct kprobe *p;
> - struct pt_regs *regs;
> - struct kprobe_ctlblk *kcb;
> - int bit;
> -
> - if (unlikely(kprobe_ftrace_disabled))
> - return;
> -
> - bit = ftrace_test_recursion_trylock(ip, parent_ip);
> - if (bit < 0)
> - return;
> -
> - p = get_kprobe((kprobe_opcode_t *)ip);
> - if (unlikely(!p) || kprobe_disabled(p))
> - goto out;
> -
> - regs = ftrace_get_regs(fregs);
> - kcb = get_kprobe_ctlblk();
> - if (kprobe_running()) {
> - kprobes_inc_nmissed_count(p);
> - } else {
> - unsigned long orig_ip = instruction_pointer(regs);
> -
> - instruction_pointer_set(regs, ip);
> -
> - __this_cpu_write(current_kprobe, p);
> - kcb->kprobe_status = KPROBE_HIT_ACTIVE;
> - if (!p->pre_handler || !p->pre_handler(p, regs)) {
> - /*
> -  * Emulate singlestep (and also recover regs->pc)
> -  * as if there is a nop
> -  */
> - instruction_pointer_set(regs,
> - (unsigned long)p->addr + MCOUNT_INSN_SIZE);
> - if (unlikely(p->post_handler)) {
> - kcb->kprobe_status = KPROBE_HIT_SSDONE;
> - p->post_handler(p, regs, 0);
> - }
> - instruction_pointer_set(regs, orig_ip);
> - }
> -
> - /*
> -  * If pre_handler returns !0, it changes regs->pc. We have to
> -  * skip emulating post_handler.
> -  */
> - __this_cpu_write(current_kprobe, NULL);
> - }
> -out:
> - ftrace_test_recursion_unlock(bit);
> -}
> -NOKPROBE_SYMBOL(kprobe_ftrace_handler);
> -
> -int arch_prepare_kprobe_ftrace(struct kprobe *p)
> -{
> - p->ainsn.api.insn = NULL;
> - return 0;
> -}
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] fgraph: Use str_plural() in test_graph_storage_single()

2024-06-19 Thread Google
On Tue, 18 Jun 2024 15:20:14 +0800
Jiapeng Chong  wrote:

> Use existing str_plural() function rather than duplicating its
> implementation.
> 
> ./kernel/trace/trace_selftest.c:880:56-60: opportunity for str_plural(size).
> 
> Reported-by: Abaci Robot 
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9349
> Signed-off-by: Jiapeng Chong 

Thanks, this looks good to me.

Acked-by: Masami Hiramatsu (Google) 


> ---
>  kernel/trace/trace_selftest.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> index adf0f436d84b..97f1e4bc47dc 100644
> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -877,7 +877,7 @@ static int __init test_graph_storage_single(struct 
> fgraph_fixture *fixture)
>   int ret;
>  
>   pr_cont("PASSED\n");
> - pr_info("Testing fgraph storage of %d byte%s: ", size, size > 1 ? "s" : 
> "");
> + pr_info("Testing fgraph storage of %d byte%s: ", size, 
> str_plural(size));
>  
>   ret = init_fgraph_fixture(fixture);
>   if (ret && ret != -ENODEV) {
> -- 
> 2.20.1.7.g153144c
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 1/2] vdpa: support set mac address from vdpa tool

2024-06-19 Thread Cindy Lu
On Tue, Jun 18, 2024 at 6:39 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 17, 2024 at 09:44:21AM -0700, Jakub Kicinski wrote:
> > On Mon, 17 Jun 2024 12:20:19 -0400 Michael S. Tsirkin wrote:
> > > > But the virtio spec doesn't allow setting the MAC...
> > > > I'm probably just lost in the conversation but there's hypervisor side
> > > > and there is user/VM side, each of them already has an interface to set
> > > > the MAC. The MAC doesn't matter, but I want to make sure my mental model
> > > > matches reality in case we start duplicating too much..
> > >
> > > An obvious part of provisioning is specifying the config space
> > > of the device.
> >
> > Agreed, that part is obvious.
> > Please go ahead, I don't really care and you clearly don't have time
> > to explain.
>
> Thanks!
> Just in case Cindy who is working on it is also confused,
> here is what I meant:
>
> - an interface to provision a device, including its config
>   space, makes sense to me
> - default mac address is part of config space, and would thus be covered
> - note how this is different from ability to tweak the mac of an existing
>   device
>
Thanks Micheal, I will continue working in this
thanks
cindy
>
> --
> MST
>




Re: [PATCH net-next v5 0/7] net: pass receive socket to drop tracepoint

2024-06-19 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller :

On Mon, 17 Jun 2024 11:09:00 -0700 you wrote:
> We set up our production packet drop monitoring around the kfree_skb
> tracepoint. While this tracepoint is extremely valuable for diagnosing
> critical problems, it also has some limitation with drops on the local
> receive path: this tracepoint can only inspect the dropped skb itself,
> but such skb might not carry enough information to:
> 
> 1. determine in which netns/container this skb gets dropped
> 2. determine by which socket/service this skb oughts to be received
> 
> [...]

Here is the summary with links:
  - [net-next,v5,1/7] net: add rx_sk to trace_kfree_skb
https://git.kernel.org/netdev/net-next/c/c53795d48ee8
  - [net-next,v5,2/7] net: introduce sk_skb_reason_drop function
https://git.kernel.org/netdev/net-next/c/ba8de796baf4
  - [net-next,v5,3/7] ping: use sk_skb_reason_drop to free rx packets
https://git.kernel.org/netdev/net-next/c/7467de17635f
  - [net-next,v5,4/7] net: raw: use sk_skb_reason_drop to free rx packets
https://git.kernel.org/netdev/net-next/c/ce9a2424e9da
  - [net-next,v5,5/7] tcp: use sk_skb_reason_drop to free rx packets
https://git.kernel.org/netdev/net-next/c/46a02aa35752
  - [net-next,v5,6/7] udp: use sk_skb_reason_drop to free rx packets
https://git.kernel.org/netdev/net-next/c/fc0cc9248843
  - [net-next,v5,7/7] af_packet: use sk_skb_reason_drop to free rx packets
https://git.kernel.org/netdev/net-next/c/e2e7d78d9a25

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





Re: [PATCH v4 16/35] mm: slub: Unpoison the memchr_inv() return value

2024-06-19 Thread Ilya Leoshkevich
On Tue, 2024-06-18 at 16:38 +0200, Alexander Potapenko wrote:
> On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich 
> wrote:
> > 
> > Even though the KMSAN warnings generated by memchr_inv() are
> > suppressed
> > by metadata_access_enable(), its return value may still be
> > poisoned.
> > 
> > The reason is that the last iteration of memchr_inv() returns
> > `*start != value ? start : NULL`, where *start is poisoned. Because
> > of
> > this, somewhat counterintuitively, the shadow value computed by
> > visitSelectInst() is equal to `(uintptr_t)start`.
> > 
> > The intention behind guarding memchr_inv() behind
> > metadata_access_enable() is to touch poisoned metadata without
> > triggering KMSAN, so unpoison its return value.
> 
> What do you think about applying __no_kmsan_checks to these functions
> instead?

Ok, will do. The __no_kmsan_checks approach is already taken by
"mm: kfence: Disable KMSAN when checking the canary", so we might as
well be consistent in how we fix these issues.



Re: [PATCH vhost 08/23] vdpa/mlx5: Clear and reinitialize software VQ data on reset

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> The hardware VQ configuration is mirrored by data in struct
> mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset,
> fully clear the struct and initialize with the appropriate default
> values.
>
> As clear_vqs_ready() is used only during reset, get rid of it.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 16 +++-
>  1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index c8b5c87f001d..de013b5a2815 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2941,18 +2941,6 @@ static void teardown_vq_resources(struct mlx5_vdpa_net 
> *ndev)
> ndev->setup = false;
>  }
>
> -static void clear_vqs_ready(struct mlx5_vdpa_net *ndev)
> -{
> -   int i;
> -
> -   for (i = 0; i < ndev->mvdev.max_vqs; i++) {
> -   ndev->vqs[i].ready = false;
> -   ndev->vqs[i].modified_fields = 0;
> -   }
> -
> -   ndev->mvdev.cvq.ready = false;
> -}
> -
>  static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev)
>  {
> struct mlx5_control_vq *cvq = >cvq;
> @@ -3035,12 +3023,14 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device 
> *vdev, u32 flags)
> down_write(>reslock);
> unregister_link_notifier(ndev);
> teardown_vq_resources(ndev);
> -   clear_vqs_ready(ndev);
> +   init_mvqs(ndev);

Nitpick / suggestion if you have to send a v2. The init_mvqs function
name sounds like it can allocate stuff that needs to be released to
me. But I'm very bad at naming :). Maybe something like
"mvqs_set_defaults" or similar?

> +
> if (flags & VDPA_RESET_F_CLEAN_MAP)
> mlx5_vdpa_destroy_mr_resources(>mvdev);
> ndev->mvdev.status = 0;
> ndev->mvdev.suspended = false;
> ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT;
> +   ndev->mvdev.cvq.ready = false;
> ndev->mvdev.cvq.received_desc = 0;
> ndev->mvdev.cvq.completed_desc = 0;
> memset(ndev->event_cbs, 0, sizeof(*ndev->event_cbs) * (mvdev->max_vqs 
> + 1));
>
> --
> 2.45.1
>




Re: [PATCH vhost 05/23] vdpa/mlx5: Iterate over active VQs during suspend/resume

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> No need to iterate over max number of VQs.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 96782b34e2b2..51630b1935f4 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1504,7 +1504,7 @@ static void suspend_vqs(struct mlx5_vdpa_net *ndev)
>  {
> int i;
>
> -   for (i = 0; i < ndev->mvdev.max_vqs; i++)
> +   for (i = 0; i < ndev->cur_num_vqs; i++)
> suspend_vq(ndev, >vqs[i]);
>  }
>
> @@ -1522,7 +1522,7 @@ static void resume_vq(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtqueue *mv
>
>  static void resume_vqs(struct mlx5_vdpa_net *ndev)
>  {
> -   for (int i = 0; i < ndev->mvdev.max_vqs; i++)
> +   for (int i = 0; i < ndev->cur_num_vqs; i++)
> resume_vq(ndev, >vqs[i]);
>  }
>
>
> --
> 2.45.1
>




Re: [PATCH vhost 06/23] vdpa/mlx5: Remove duplicate suspend code

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> Use the dedicated suspend_vqs() function instead.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Reviewed-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 51630b1935f4..eca6f68c2eda 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -3355,17 +3355,12 @@ static int mlx5_vdpa_suspend(struct vdpa_device *vdev)
>  {
> struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> -   struct mlx5_vdpa_virtqueue *mvq;
> -   int i;
>
> mlx5_vdpa_info(mvdev, "suspending device\n");
>
> down_write(>reslock);
> unregister_link_notifier(ndev);
> -   for (i = 0; i < ndev->cur_num_vqs; i++) {
> -   mvq = >vqs[i];
> -   suspend_vq(ndev, mvq);
> -   }
> +   suspend_vqs(ndev);
> mlx5_vdpa_cvq_suspend(mvdev);
> mvdev->suspended = true;
> up_write(>reslock);
>
> --
> 2.45.1
>




Re: [PATCH vhost 04/23] vdpa/mlx5: Drop redundant check in teardown_virtqueues()

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> The check is done inside teardown_vq().
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Reviewed-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index b4d9ef4f66c8..96782b34e2b2 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -2559,16 +2559,10 @@ static int setup_virtqueues(struct mlx5_vdpa_dev 
> *mvdev)
>
>  static void teardown_virtqueues(struct mlx5_vdpa_net *ndev)
>  {
> -   struct mlx5_vdpa_virtqueue *mvq;
> int i;
>
> -   for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
> -   mvq = >vqs[i];
> -   if (!mvq->initialized)
> -   continue;
> -
> -   teardown_vq(ndev, mvq);
> -   }
> +   for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--)
> +   teardown_vq(ndev, >vqs[i]);
>  }
>
>  static void update_cvq_info(struct mlx5_vdpa_dev *mvdev)
>
> --
> 2.45.1
>




Re: [PATCH vhost 03/23] vdpa/mlx5: Drop redundant code

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:

Patch message suggestion:
Originally, the second loop initialized the CVQ. But (acde3929492b
("vdpa/mlx5: Use consistent RQT size") initialized all the queues in
the first loop, so the second iteration in ...

>
> The second iteration in init_mvqs() is never called because the first
> one will iterate up to max_vqs.
>

Either way,

Acked-by: Eugenio Pérez 

> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 1ad281cbc541..b4d9ef4f66c8 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -3519,12 +3519,6 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev)
> mvq->fwqp.fw = true;
> mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE;
> }
> -   for (; i < ndev->mvdev.max_vqs; i++) {
> -   mvq = >vqs[i];
> -   memset(mvq, 0, offsetof(struct mlx5_vdpa_virtqueue, ri));
> -   mvq->index = i;
> -   mvq->ndev = ndev;
> -   }
>  }
>
>  struct mlx5_vdpa_mgmtdev {
>
> --
> 2.45.1
>




Re: Bug in Kernel 6.8.x, 6.9.x Causing Trace/Panic During Shutdown/Reboot

2024-06-19 Thread Ilkka Naulapää
disabled  CONFIG_FORCE_NR_CPUS option for 6.9.5 but the trace + panic
still exists. So that one didn't help. I've also been bisecting the
trace but have not finished it yet as the last half dozen builds
produced non-bootable kernels. Anyway, I will continue it soon(ish)
when I have a bit more free time.

--Ilkka

On Tue, Jun 18, 2024 at 5:52 PM Steven Rostedt  wrote:
>
> On Thu, 13 Jun 2024 10:32:24 +0300
> Ilkka Naulapää  wrote:
>
> > ok, so if you don't have any idea where this bug is after those debug
> > patches, I'll try to find some time to bisect it as a last resort.
> > Stay tuned.
>
> FYI,
>
> I just debugged a strange crash that was caused by my config having
> something leftover from your config. Specifically, that was:
>
> CONFIG_FORCE_NR_CPUS
>
> Do you get any warning about nr cpus not matching at boot up?
>
> Regardless, can you disable that and see if you still get the same
> crash.
>
> Thanks,
>
> -- Steve



Re: [PATCH vhost 02/23] vdpa/mlx5: Make setup/teardown_vq_resources() symmetrical

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> ... by changing the setup_vq_resources() parameter.

s/parameter/parameter type/ ?

Either way,

Acked-by: Eugenio Pérez 

>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 3422da0e344b..1ad281cbc541 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -146,7 +146,7 @@ static bool is_index_valid(struct mlx5_vdpa_dev *mvdev, 
> u16 idx)
>
>  static void free_fixed_resources(struct mlx5_vdpa_net *ndev);
>  static void init_mvqs(struct mlx5_vdpa_net *ndev);
> -static int setup_vq_resources(struct mlx5_vdpa_dev *mvdev);
> +static int setup_vq_resources(struct mlx5_vdpa_net *ndev);
>  static void teardown_vq_resources(struct mlx5_vdpa_net *ndev);
>
>  static bool mlx5_vdpa_debug;
> @@ -2862,7 +2862,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
>
> if (teardown) {
> restore_channels_info(ndev);
> -   err = setup_vq_resources(mvdev);
> +   err = setup_vq_resources(ndev);
> if (err)
> return err;
> }
> @@ -2873,9 +2873,9 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
>  }
>
>  /* reslock must be held for this function */
> -static int setup_vq_resources(struct mlx5_vdpa_dev *mvdev)
> +static int setup_vq_resources(struct mlx5_vdpa_net *ndev)
>  {
> -   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> +   struct mlx5_vdpa_dev *mvdev = >mvdev;
> int err;
>
> WARN_ON(!rwsem_is_locked(>reslock));
> @@ -2997,7 +2997,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
> *vdev, u8 status)
> goto err_setup;
> }
> register_link_notifier(ndev);
> -   err = setup_vq_resources(mvdev);
> +   err = setup_vq_resources(ndev);
> if (err) {
> mlx5_vdpa_warn(mvdev, "failed to setup 
> driver\n");
> goto err_driver;
>
> --
> 2.45.1
>




Re: [PATCH vhost 01/23] vdpa/mlx5: Clarify meaning thorough function rename

2024-06-19 Thread Eugenio Perez Martin
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea  wrote:
>
> setup_driver()/teardown_driver() are a bit vague. These functions are
> used for virtqueue resources.
>
> Same for alloc_resources()/teardown_resources(): they represent fixed
> resources that are meant to exist during the device lifetime.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Cosmin Ratiu 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index ecfc16151d61..3422da0e344b 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -144,10 +144,10 @@ static bool is_index_valid(struct mlx5_vdpa_dev *mvdev, 
> u16 idx)
> return idx <= mvdev->max_idx;
>  }
>
> -static void free_resources(struct mlx5_vdpa_net *ndev);
> +static void free_fixed_resources(struct mlx5_vdpa_net *ndev);
>  static void init_mvqs(struct mlx5_vdpa_net *ndev);
> -static int setup_driver(struct mlx5_vdpa_dev *mvdev);
> -static void teardown_driver(struct mlx5_vdpa_net *ndev);
> +static int setup_vq_resources(struct mlx5_vdpa_dev *mvdev);
> +static void teardown_vq_resources(struct mlx5_vdpa_net *ndev);
>
>  static bool mlx5_vdpa_debug;
>
> @@ -2848,7 +2848,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
> if (err)
> return err;
>
> -   teardown_driver(ndev);
> +   teardown_vq_resources(ndev);
> }
>
> mlx5_vdpa_update_mr(mvdev, new_mr, asid);
> @@ -2862,7 +2862,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
>
> if (teardown) {
> restore_channels_info(ndev);
> -   err = setup_driver(mvdev);
> +   err = setup_vq_resources(mvdev);
> if (err)
> return err;
> }
> @@ -2873,7 +2873,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
>  }
>
>  /* reslock must be held for this function */
> -static int setup_driver(struct mlx5_vdpa_dev *mvdev)
> +static int setup_vq_resources(struct mlx5_vdpa_dev *mvdev)
>  {
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> int err;
> @@ -2931,7 +2931,7 @@ static int setup_driver(struct mlx5_vdpa_dev *mvdev)
>  }
>
>  /* reslock must be held for this function */
> -static void teardown_driver(struct mlx5_vdpa_net *ndev)
> +static void teardown_vq_resources(struct mlx5_vdpa_net *ndev)
>  {
>
> WARN_ON(!rwsem_is_locked(>reslock));
> @@ -2997,7 +2997,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
> *vdev, u8 status)
> goto err_setup;
> }
> register_link_notifier(ndev);
> -   err = setup_driver(mvdev);
> +   err = setup_vq_resources(mvdev);
> if (err) {
> mlx5_vdpa_warn(mvdev, "failed to setup 
> driver\n");
> goto err_driver;
> @@ -3040,7 +3040,7 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device 
> *vdev, u32 flags)
>
> down_write(>reslock);
> unregister_link_notifier(ndev);
> -   teardown_driver(ndev);
> +   teardown_vq_resources(ndev);
> clear_vqs_ready(ndev);
> if (flags & VDPA_RESET_F_CLEAN_MAP)
> mlx5_vdpa_destroy_mr_resources(>mvdev);
> @@ -3197,7 +3197,7 @@ static void mlx5_vdpa_free(struct vdpa_device *vdev)
>
> ndev = to_mlx5_vdpa_ndev(mvdev);
>
> -   free_resources(ndev);
> +   free_fixed_resources(ndev);
> mlx5_vdpa_destroy_mr_resources(mvdev);
> if (!is_zero_ether_addr(ndev->config.mac)) {
> pfmdev = pci_get_drvdata(pci_physfn(mvdev->mdev->pdev));
> @@ -3467,7 +3467,7 @@ static int query_mtu(struct mlx5_core_dev *mdev, u16 
> *mtu)
> return 0;
>  }
>
> -static int alloc_resources(struct mlx5_vdpa_net *ndev)
> +static int alloc_fixed_resources(struct mlx5_vdpa_net *ndev)
>  {
> struct mlx5_vdpa_net_resources *res = >res;
> int err;
> @@ -3494,7 +3494,7 @@ static int alloc_resources(struct mlx5_vdpa_net *ndev)
> return err;
>  }
>
> -static void free_resources(struct mlx5_vdpa_net *ndev)
> +static void free_fixed_resources(struct mlx5_vdpa_net *ndev)
>  {
> struct mlx5_vdpa_net_resources *res = >res;
>
> @@ -3735,7 +3735,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> goto err_res;
> }
>
> -   err = alloc_resources(ndev);
> +   err = alloc_fixed_resources(ndev);
> if (err)
> goto err_mr;
>
> @@ -3758,7 +3758,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
>  err_reg:
> destroy_workqueue(mvdev->wq);
>  err_res2:
> -   

  1   2   >