Re: [PATCH 2/7] dt-bindings: clock: qcom,a53pll: Allow opp-table subnode
On 19/06/2024 23:02, Luca Weiss wrote: > Allow placing an opp-table as a subnode that can be assigned using > operating-points-v2 to specify the frequency table for the PLL. > > Signed-off-by: Luca Weiss > --- > Documentation/devicetree/bindings/clock/qcom,a53pll.yaml | 3 +++ > 1 file changed, 3 insertions(+) Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH 3/7] dt-bindings: clock: qcom,a53pll: Add msm8226-a7pll compatible
On 19/06/2024 23:02, Luca Weiss wrote: > Add the compatible for the A7PLL found in MSM8226 SoCs. > > Signed-off-by: Luca Weiss > --- Acked-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH 1/7] dt-bindings: mailbox: qcom: add compatible for MSM8226 SoC
On 19/06/2024 23:02, Luca Weiss wrote: > Add the mailbox compatible for MSM8226 SoC. > > Signed-off-by: Luca Weiss > --- Acked-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down
On Thu, Jun 06, 2024 at 08:22:13AM +0800, Jason Wang wrote: > On Fri, May 31, 2024 at 8:18 AM Jason Wang wrote: > > > > On Thu, May 30, 2024 at 9:09 PM Michael S. Tsirkin wrote: > > > > > > On Thu, May 30, 2024 at 06:29:51PM +0800, Jason Wang wrote: > > > > On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote: > > > > > > This patch synchronize operstate with admin state per RFC2863. > > > > > > > > > > > > This is done by trying to toggle the carrier upon open/close and > > > > > > synchronize with the config change work. This allows propagate > > > > > > status > > > > > > correctly to stacked devices like: > > > > > > > > > > > > ip link add link enp0s3 macvlan0 type macvlan > > > > > > ip link set link enp0s3 down > > > > > > ip link show > > > > > > > > > > > > Before this patch: > > > > > > > > > > > > 3: enp0s3: mtu 1500 qdisc pfifo_fast state > > > > > > DOWN mode DEFAULT group default qlen 1000 > > > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff > > > > > > .. > > > > > > 5: macvlan0@enp0s3: mtu > > > > > > 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 > > > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff > > > > > > > > > > > > After this patch: > > > > > > > > > > > > 3: enp0s3: mtu 1500 qdisc pfifo_fast state > > > > > > DOWN mode DEFAULT group default qlen 1000 > > > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff > > > > > > ... > > > > > > 5: macvlan0@enp0s3: mtu > > > > > > 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default > > > > > > qlen 1000 > > > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff > > > > > > > > > > > > Cc: Venkat Venkatsubra > > > > > > Cc: Gia-Khanh Nguyen > > > > > > Reviewed-by: Xuan Zhuo > > > > > > Acked-by: Michael S. Tsirkin > > > > > > Signed-off-by: Jason Wang > > > > > > --- > > > > > > Changes since V1: > > > > > > - rebase > > > > > > - add ack/review tags > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > drivers/net/virtio_net.c | 94 > > > > > > +++- > > > > > > 1 file changed, 63 insertions(+), 31 deletions(-) > > > > > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > > > > index 4a802c0ea2cb..69e4ae353c51 100644 > > > > > > --- a/drivers/net/virtio_net.c > > > > > > +++ b/drivers/net/virtio_net.c > > > > > > @@ -433,6 +433,12 @@ struct virtnet_info { > > > > > > /* The lock to synchronize the access to refill_enabled */ > > > > > > spinlock_t refill_lock; > > > > > > > > > > > > + /* Is config change enabled? */ > > > > > > + bool config_change_enabled; > > > > > > + > > > > > > + /* The lock to synchronize the access to > > > > > > config_change_enabled */ > > > > > > + spinlock_t config_change_lock; > > > > > > + > > > > > > /* Work struct for config space updates */ > > > > > > struct work_struct config_work; > > > > > > > > > > > > > > > > > > > > > But we already have dev->config_lock and dev->config_enabled. > > > > > > > > > > And it actually works better - instead of discarding config > > > > > change events it defers them until enabled. > > > > > > > > > > > > > Yes but then both virtio-net driver and virtio core can ask to enable > > > > and disable and then we need some kind of synchronization which is > > > > non-trivial. > > > > > > Well for core it happens on bring up path before driver works > > > and later on tear down after it is gone. > > > So I do not think they ever do it at the same time. > > > > For example, there could be a suspend/resume when the admin state is down. > > > > > > > > > > > > And device enabling on the core is different from bringing the device > > > > up in the networking subsystem. Here we just delay to deal with the > > > > config change interrupt on ndo_open(). (E.g try to ack announce is > > > > meaningless when the device is down). > > > > > > > > Thanks > > > > > > another thing is that it is better not to re-read all config > > > on link up if there was no config interrupt - less vm exits. > > > > Yes, but it should not matter much as it's done in the ndo_open(). > > Michael, any more comments on this? > > Please confirm if this patch is ok or not. If you prefer to reuse the > config_disable() I can change it from a boolean to a counter that > allows to be nested. > > Thanks I think doing this in core and re-using config_lock is a cleaner approach for sure, yes. For remove we do not need stacking. But yes we need it for freeze. I am not sure how will a counter work. Let's see the patch. It might be easier, besides config_enabled, to add config_disabled, document that config_enabled is controlled by core, config_disabled by driver. And we'd have 2 APIs virtio_config_disable_core and virtio_config_disable_driver. But I leave that decision in your capable hand
Re: [PATCH] livepatch: introduce klp_func called interface
> On Jun 7, 2024, at 17:07, Miroslav Benes wrote: > > It would be better than this patch but given what was mentioned in the > thread I wonder if it is possible to use ftrace even for this. See > /sys/kernel/tracing/trace_stat/function*. It already gathers the number of > hits. > Hi, Miroslav Can ftrace able to trace the function which I added into kernel by livepatching? Regard Wardenjohn
Re: [PATCH v2 5/7] bpf: do not create bpf_non_sleepable_error_inject list when unnecessary
On Wed, Jun 19, 2024 at 3:49 PM Vlastimil Babka wrote: > > When CONFIG_FUNCTION_ERROR_INJECTION is disabled, > within_error_injection_list() will return false for any address and the > result of check_non_sleepable_error_inject() denylist is thus redundant. > The bpf_non_sleepable_error_inject list thus does not need to be > constructed at all, so #ifdef it out. > > This will allow to inline functions on the list when > CONFIG_FUNCTION_ERROR_INJECTION is disabled as there will be no BTF_ID() > reference for them. > > Signed-off-by: Vlastimil Babka > --- > kernel/bpf/verifier.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 77da1f438bec..5cd93de37d68 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -21044,6 +21044,8 @@ static int check_attach_modify_return(unsigned long > addr, const char *func_name) > return -EINVAL; > } > > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION > + > /* list of non-sleepable functions that are otherwise on > * ALLOW_ERROR_INJECTION list > */ > @@ -21061,6 +21063,19 @@ static int check_non_sleepable_error_inject(u32 > btf_id) > return btf_id_set_contains(&btf_non_sleepable_error_inject, btf_id); > } > > +#else /* CONFIG_FUNCTION_ERROR_INJECTION */ > + > +/* > + * Pretend the denylist is empty, within_error_injection_list() will return > + * false anyway. > + */ > +static int check_non_sleepable_error_inject(u32 btf_id) > +{ > + return 0; > +} > + > +#endif The comment reads like this is an optimization, but it's a mandatory ifdef since should_failslab() might not be found by resolve_btfid during the build. Please make it clear in the comment.
[PATCH v2 7/7] mm, page_alloc: add static key for should_fail_alloc_page()
Similarly to should_failslab(), remove the overhead of calling the noinline function should_fail_alloc_page() with a static key that guards the callsite in the page allocator hotpath, and is controlled by the fault and error injection frameworks and bpf. Additionally, compile out all relevant code if neither CONFIG_FAIL_ALLOC_PAGE nor CONFIG_FUNCTION_ERROR_INJECTION is enabled. When only the latter is not enabled, make should_fail_alloc_page() static inline instead of noinline. No measurement was done other than verifying the should_fail_alloc_page is gone from the perf profile. A measurement with the analogical change for should_failslab() suggests that for a page allocator intensive workload there might be noticeable improvement. It also makes CONFIG_FAIL_ALLOC_PAGE an option suitable not only for debug kernels. Reviewed-by: Roman Gushchin Signed-off-by: Vlastimil Babka --- include/linux/fault-inject.h | 3 ++- mm/fail_page_alloc.c | 3 ++- mm/internal.h| 2 ++ mm/page_alloc.c | 30 +++--- 4 files changed, 33 insertions(+), 5 deletions(-) diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 0d0fa94dc1c8..1a782042ae80 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -96,8 +96,9 @@ static inline void fault_config_init(struct fault_config *config, struct kmem_cache; +#ifdef CONFIG_FUNCTION_ERROR_INJECTION bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); - +#endif #ifdef CONFIG_FAIL_PAGE_ALLOC bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order); #else diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c index b1b09cce9394..0906b76d78e8 100644 --- a/mm/fail_page_alloc.c +++ b/mm/fail_page_alloc.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include "internal.h" static struct { struct fault_attr attr; @@ -9,7 +10,7 @@ static struct { bool ignore_gfp_reclaim; u32 min_order; } fail_page_alloc = { - .attr = FAULT_ATTR_INITIALIZER, + .attr = FAULT_ATTR_INITIALIZER_KEY(&should_fail_alloc_page_active.key), .ignore_gfp_reclaim = true, .ignore_gfp_highmem = true, .min_order = 1, diff --git a/mm/internal.h b/mm/internal.h index b2c75b12014e..8539e39b02e6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -410,6 +410,8 @@ extern char * const zone_names[MAX_NR_ZONES]; /* perform sanity checks on struct pages being allocated or freed */ DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled); +DECLARE_STATIC_KEY_FALSE(should_fail_alloc_page_active); + extern int min_free_kbytes; void setup_per_zone_wmarks(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e22ce5675ca..b6e246acb4aa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3008,11 +3008,35 @@ struct page *rmqueue(struct zone *preferred_zone, return page; } -noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAIL_PAGE_ALLOC) +DEFINE_STATIC_KEY_FALSE(should_fail_alloc_page_active); + +#ifdef CONFIG_FUNCTION_ERROR_INJECTION +noinline +#else +static inline +#endif +bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return __should_fail_alloc_page(gfp_mask, order); } -ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); +ALLOW_ERROR_INJECTION_KEY(should_fail_alloc_page, TRUE, &should_fail_alloc_page_active); + +static __always_inline bool +should_fail_alloc_page_wrapped(gfp_t gfp_mask, unsigned int order) +{ + if (static_branch_unlikely(&should_fail_alloc_page_active)) + return should_fail_alloc_page(gfp_mask, order); + + return false; +} +#else +static __always_inline bool +should_fail_alloc_page_wrapped(gfp_t gfp_mask, unsigned int order) +{ + return false; +} +#endif static inline long __zone_watermark_unusable_free(struct zone *z, unsigned int order, unsigned int alloc_flags) @@ -4430,7 +4454,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, might_alloc(gfp_mask); - if (should_fail_alloc_page(gfp_mask, order)) + if (should_fail_alloc_page_wrapped(gfp_mask, order)) return false; *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags); -- 2.45.2
[PATCH v2 6/7] mm, slab: add static key for should_failslab()
Since commit 4f6923fbb352 ("mm: make should_failslab always available for fault injection") should_failslab() is unconditionally a noinline function. This adds visible overhead to the slab allocation hotpath, even if the function is empty. With CONFIG_FAILSLAB=y there's additional overhead, even when the functionality is not activated by a boot parameter or via debugfs. The overhead can be eliminated with a static key around the callsite. Fault injection and error injection frameworks including bpf can now be told that this function has a static key associated, and are able to enable and disable it accordingly. Additionally, compile out all relevant code if neither CONFIG_FAILSLAB nor CONFIG_FUNCTION_ERROR_INJECTION is enabled. When only the latter is not enabled, make should_failslab() static inline instead of noinline. To demonstrate the reduced overhead of calling an empty should_failslab() function, a kernel build with CONFIG_FUNCTION_ERROR_INJECTION enabled but CONFIG_FAILSLAB disabled, and CPU mitigations enabled, was used in a qemu-kvm (virtme-ng) on AMD Ryzen 7 2700 machine, and execution of a program trying to open() a non-existent file was measured 3 times: for (int i = 0; i < 1000; i++) { open("non_existent", O_RDONLY); } After this patch, the measured real time was 4.3% smaller. Using perf profiling it was verified that should_failslab was gone from the profile. With CONFIG_FAILSLAB also enabled, the patched kernel performace was unaffected, as expected, while unpatched kernel's performance was worse, resulting in the relative speedup being 10.5%. This means it no longer needs to be an option suitable only for debug kernel builds. Acked-by: Alexei Starovoitov Reviewed-by: Roman Gushchin Signed-off-by: Vlastimil Babka --- include/linux/fault-inject.h | 4 +++- mm/failslab.c| 2 +- mm/slab.h| 3 +++ mm/slub.c| 30 +++--- 4 files changed, 34 insertions(+), 5 deletions(-) diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index cfe75cc1bac4..0d0fa94dc1c8 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -107,9 +107,11 @@ static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) } #endif /* CONFIG_FAIL_PAGE_ALLOC */ +#ifdef CONFIG_FUNCTION_ERROR_INJECTION int should_failslab(struct kmem_cache *s, gfp_t gfpflags); +#endif #ifdef CONFIG_FAILSLAB -extern bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags); +bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags); #else static inline bool __should_failslab(struct kmem_cache *s, gfp_t gfpflags) { diff --git a/mm/failslab.c b/mm/failslab.c index ffc420c0e767..878fd08e5dac 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -9,7 +9,7 @@ static struct { bool ignore_gfp_reclaim; bool cache_filter; } failslab = { - .attr = FAULT_ATTR_INITIALIZER, + .attr = FAULT_ATTR_INITIALIZER_KEY(&should_failslab_active.key), .ignore_gfp_reclaim = true, .cache_filter = false, }; diff --git a/mm/slab.h b/mm/slab.h index 5f8f47c5bee0..792e19cb37b8 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * Internal slab definitions @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t) */ #define slab_page(s) folio_page(slab_folio(s), 0) +DECLARE_STATIC_KEY_FALSE(should_failslab_active); + /* * If network-based swap is enabled, sl*b must keep track of whether pages * were allocated from pfmemalloc reserves. diff --git a/mm/slub.c b/mm/slub.c index 0809760cf789..11980aa94631 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3874,13 +3874,37 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s, 0, sizeof(void *)); } -noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) +#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAILSLAB) +DEFINE_STATIC_KEY_FALSE(should_failslab_active); + +#ifdef CONFIG_FUNCTION_ERROR_INJECTION +noinline +#else +static inline +#endif +int should_failslab(struct kmem_cache *s, gfp_t gfpflags) { if (__should_failslab(s, gfpflags)) return -ENOMEM; return 0; } -ALLOW_ERROR_INJECTION(should_failslab, ERRNO); +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, &should_failslab_active); + +static __always_inline int should_failslab_wrapped(struct kmem_cache *s, + gfp_t gfp) +{ + if (static_branch_unlikely(&should_failslab_active)) + return should_failslab(s, gfp); + else + return 0; +} +#else +static __always_inline int should_failslab_wrapped(struct kmem_cache *s, + gfp_t gfp) +{ + return false; +} +#endif static __fastpath_inline struct kmem_cache *slab
[PATCH v2 4/7] bpf: support error injection static keys for multi_link attached progs
Functions marked for error injection can have an associated static key that guards the callsite(s) to avoid overhead of calling an empty function when no error injection is in progress. Outside of the error injection framework itself, bpf programs can be atteched to kprobes and override results of error-injectable functions. To make sure these functions are actually called, attaching such bpf programs should control the static key accordingly. Therefore, add an array of static keys to struct bpf_kprobe_multi_link and fill it in addrs_check_error_injection_list() for programs with kprobe_override enabled, using get_injection_key() instead of within_error_injection_list(). Introduce bpf_kprobe_ei_keys_control() to control the static keys and call the control function when doing multi_link_attach and release. Signed-off-by: Vlastimil Babka --- kernel/trace/bpf_trace.c | 59 +++- 1 file changed, 53 insertions(+), 6 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 944de1c41209..ef0fadb76bfa 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2613,6 +2613,7 @@ struct bpf_kprobe_multi_link { struct bpf_link link; struct fprobe fp; unsigned long *addrs; + struct static_key **ei_keys; u64 *cookies; u32 cnt; u32 mods_cnt; @@ -2687,11 +2688,30 @@ static void free_user_syms(struct user_syms *us) kvfree(us->buf); } +static void bpf_kprobe_ei_keys_control(struct bpf_kprobe_multi_link *link, bool enable) +{ + u32 i; + + for (i = 0; i < link->cnt; i++) { + if (!link->ei_keys[i]) + break; + + if (enable) + static_key_slow_inc(link->ei_keys[i]); + else + static_key_slow_dec(link->ei_keys[i]); + } +} + static void bpf_kprobe_multi_link_release(struct bpf_link *link) { struct bpf_kprobe_multi_link *kmulti_link; kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link); + + if (kmulti_link->ei_keys) + bpf_kprobe_ei_keys_control(kmulti_link, false); + unregister_fprobe(&kmulti_link->fp); kprobe_multi_put_modules(kmulti_link->mods, kmulti_link->mods_cnt); } @@ -2703,6 +2723,7 @@ static void bpf_kprobe_multi_link_dealloc(struct bpf_link *link) kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link); kvfree(kmulti_link->addrs); kvfree(kmulti_link->cookies); + kvfree(kmulti_link->ei_keys); kfree(kmulti_link->mods); kfree(kmulti_link); } @@ -2985,13 +3006,19 @@ static int get_modules_for_addrs(struct module ***mods, unsigned long *addrs, u3 return arr.mods_cnt; } -static int addrs_check_error_injection_list(unsigned long *addrs, u32 cnt) +static int addrs_check_error_injection_list(unsigned long *addrs, struct static_key **ei_keys, + u32 cnt) { - u32 i; + struct static_key *ei_key; + u32 i, j = 0; for (i = 0; i < cnt; i++) { - if (!within_error_injection_list(addrs[i])) + ei_key = get_injection_key(addrs[i]); + if (IS_ERR(ei_key)) return -EINVAL; + + if (ei_key) + ei_keys[j++] = ei_key; } return 0; } @@ -3000,6 +3027,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr { struct bpf_kprobe_multi_link *link = NULL; struct bpf_link_primer link_primer; + struct static_key **ei_keys = NULL; void __user *ucookies; unsigned long *addrs; u32 flags, cnt, size; @@ -3075,9 +3103,24 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr goto error; } - if (prog->kprobe_override && addrs_check_error_injection_list(addrs, cnt)) { - err = -EINVAL; - goto error; + if (prog->kprobe_override) { + ei_keys = kvcalloc(cnt, sizeof(*ei_keys), GFP_KERNEL); + if (!ei_keys) { + err = -ENOMEM; + goto error; + } + + if (addrs_check_error_injection_list(addrs, ei_keys, cnt)) { + err = -EINVAL; + goto error; + } + + if (ei_keys[0]) { + link->ei_keys = ei_keys; + } else { + kvfree(ei_keys); + ei_keys = NULL; + } } link = kzalloc(sizeof(*link), GFP_KERNEL); @@ -3132,10 +3175,14 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr return err; } + if (link->ei_keys) + bpf_kprobe_ei_keys_control(link, true); +
[PATCH v2 1/7] fault-inject: add support for static keys around fault injection sites
Some fault injection sites are placed in hotpaths and incur overhead even if not enabled, due to one or more function calls leading up to should_fail_ex() that returns false due to attr->probability == 0. This overhead can be eliminated if the outermost call into the checks is guarded with a static key, so add support for that. The framework should be told that such static key exist for a fault_attr, by initializing fault_attr->active with the static key address. When it's not NULL, enable the static key from setup_fault_attr() when the fault probability is non-zero. Also wire up writing into debugfs "probability" file to enable or disable the static key when transitioning between zero and non-zero probability. For now, do not add configfs interface support as the immediate plan is to leverage this for should_failslab() and should_fail_alloc_page() after other necessary preparatory changes, and not for any of the configfs based fault injection users. Signed-off-by: Vlastimil Babka --- include/linux/fault-inject.h | 7 ++- lib/fault-inject.c | 43 ++- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 6d5edef09d45..cfe75cc1bac4 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -9,6 +9,7 @@ #include #include #include +#include /* * For explanation of the elements of this struct, see @@ -30,13 +31,14 @@ struct fault_attr { unsigned long count; struct ratelimit_state ratelimit_state; struct dentry *dname; + struct static_key *active; }; enum fault_flags { FAULT_NOWARN = 1 << 0, }; -#define FAULT_ATTR_INITIALIZER { \ +#define FAULT_ATTR_INITIALIZER_KEY(_key) { \ .interval = 1, \ .times = ATOMIC_INIT(1),\ .require_end = ULONG_MAX, \ @@ -44,8 +46,11 @@ enum fault_flags { .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ .verbose = 2, \ .dname = NULL, \ + .active = (_key), \ } +#define FAULT_ATTR_INITIALIZER FAULT_ATTR_INITIALIZER_KEY(NULL) + #define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER int setup_fault_attr(struct fault_attr *attr, char *str); bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags); diff --git a/lib/fault-inject.c b/lib/fault-inject.c index d608f9b48c10..de9552cb22d0 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -35,6 +35,9 @@ int setup_fault_attr(struct fault_attr *attr, char *str) atomic_set(&attr->times, times); atomic_set(&attr->space, space); + if (probability != 0 && attr->active) + static_key_slow_inc(attr->active); + return 1; } EXPORT_SYMBOL_GPL(setup_fault_attr); @@ -166,6 +169,12 @@ EXPORT_SYMBOL_GPL(should_fail); #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS +/* + * Protect updating probability from debugfs as that may trigger static key + * changes when changing between zero and non-zero. + */ +static DEFINE_MUTEX(probability_mutex); + static int debugfs_ul_set(void *data, u64 val) { *(unsigned long *)data = val; @@ -186,6 +195,38 @@ static void debugfs_create_ul(const char *name, umode_t mode, debugfs_create_file(name, mode, parent, value, &fops_ul); } +static int debugfs_prob_set(void *data, u64 val) +{ + struct fault_attr *attr = data; + + mutex_lock(&probability_mutex); + + if (attr->active) { + if (attr->probability != 0 && val == 0) { + static_key_slow_dec(attr->active); + } else if (attr->probability == 0 && val != 0) { + static_key_slow_inc(attr->active); + } + } + + attr->probability = val; + + mutex_unlock(&probability_mutex); + + return 0; +} + +static int debugfs_prob_get(void *data, u64 *val) +{ + struct fault_attr *attr = data; + + *val = attr->probability; + + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(fops_prob, debugfs_prob_get, debugfs_prob_set, "%llu\n"); + #ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER static int debugfs_stacktrace_depth_set(void *data, u64 val) @@ -218,7 +259,7 @@ struct dentry *fault_create_debugfs_attr(const char *name, if (IS_ERR(dir)) return dir; - debugfs_create_ul("probability", mode, dir, &attr->probability); + debugfs_create_file("probability", mode, dir, attr, &fops_prob); debugfs_create_ul("interval", mode, dir, &attr->interval); debugfs_create_atomic_t("times", m
[PATCH v2 2/7] error-injection: support static keys around injectable functions
Error injectable functions cannot be inlined and since some are called from hot paths, this incurs overhead even if no error injection is enabled for them. To avoid this overhead when disabled, allow the callsites of error injectable functions to put the calls behind a static key, which the framework can control when error injection is enabled or disabled for the function. Introduce a new ALLOW_ERROR_INJECTION_KEY() macro that adds a parameter with the static key's address, and store it in struct error_injection_entry. This new field has caused a mismatch when populating the injection list from the _error_injection_whitelist section using the current STRUCT_ALIGN(), so change the alignment to 8. During the population, copy the key's address also to struct ei_entry, and make it possible to retrieve it by get_injection_key(). Finally, make the processing of writes to the debugfs inject file enable the static key when the function is added to the injection list, and disable when removed. Signed-off-by: Vlastimil Babka --- include/asm-generic/error-injection.h | 13 - include/asm-generic/vmlinux.lds.h | 2 +- include/linux/error-injection.h | 12 ++-- kernel/fail_function.c| 10 ++ lib/error-inject.c| 19 +++ 5 files changed, 52 insertions(+), 4 deletions(-) diff --git a/include/asm-generic/error-injection.h b/include/asm-generic/error-injection.h index b05253f68eaa..eed2731f3820 100644 --- a/include/asm-generic/error-injection.h +++ b/include/asm-generic/error-injection.h @@ -12,6 +12,7 @@ enum { struct error_injection_entry { unsigned long addr; + unsigned long static_key_addr; int etype; }; @@ -25,16 +26,26 @@ struct pt_regs; * 'Error Injectable Functions' section. */ #define ALLOW_ERROR_INJECTION(fname, _etype) \ -static struct error_injection_entry __used \ +static struct error_injection_entry __used __aligned(8) \ __section("_error_injection_whitelist") \ _eil_addr_##fname = { \ .addr = (unsigned long)fname, \ .etype = EI_ETYPE_##_etype, \ } +#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key) \ +static struct error_injection_entry __used __aligned(8) \ + __section("_error_injection_whitelist") \ + _eil_addr_##fname = { \ + .addr = (unsigned long)fname, \ + .static_key_addr = (unsigned long)key, \ + .etype = EI_ETYPE_##_etype, \ + } + void override_function_with_return(struct pt_regs *regs); #else #define ALLOW_ERROR_INJECTION(fname, _etype) +#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key) static inline void override_function_with_return(struct pt_regs *regs) { } #endif diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 5703526d6ebf..1b15a0af2a00 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -248,7 +248,7 @@ #ifdef CONFIG_FUNCTION_ERROR_INJECTION #define ERROR_INJECT_WHITELIST() \ - STRUCT_ALIGN(); \ + . = ALIGN(8); \ BOUNDED_SECTION(_error_injection_whitelist) #else #define ERROR_INJECT_WHITELIST() diff --git a/include/linux/error-injection.h b/include/linux/error-injection.h index 20e738f4eae8..48da027c0302 100644 --- a/include/linux/error-injection.h +++ b/include/linux/error-injection.h @@ -6,10 +6,13 @@ #include #include +struct static_key; + #ifdef CONFIG_FUNCTION_ERROR_INJECTION -extern bool within_error_injection_list(unsigned long addr); -extern int get_injectable_error_type(unsigned long addr); +bool within_error_injection_list(unsigned long addr); +int get_injectable_error_type(unsigned long addr); +struct static_key *get_injection_key(unsigned long addr); #else /* !CONFIG_FUNCTION_ERROR_INJECTION */ @@ -23,6 +26,11 @@ static inline int get_injectable_error_type(unsigned long addr) return -EOPNOTSUPP; } +static inline struct static_key *get_injection_key(unsigned long addr) +{ + return NULL; +} + #endif #endif /* _LINUX_ERROR_INJECTION_H */ diff --git a/kernel/fail_function.c b/kernel/fail_function.c index d971a0189319..d39a9606a448 100644 --- a/kernel/fail_function.c +++ b/kernel/fail_function.c @@ -27,6 +27,7 @@ struct fei_attr { struct list_head list; struct kprobe kp; unsigned long retval; + struct static_key *key; }; static DEFINE_MUTEX(fei_lock); static LIST_HEAD(fei_attr_list); @@
[PATCH v2 0/7] static key support for error injection functions
This should now be complete, but perf_events attached bpf programs are untested (Patch 3). This is spread accross several subsystems but the easiest way would be to go through a single tree, such as the mm tree. As previously mentioned by myself [1] and others [2] the functions designed for error injection can bring visible overhead in fastpaths such as slab or page allocation, because even if nothing hooks into them at a given moment, they are noninline function calls regardless of CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab always available for fault injection") and af3b854492f3 ("mm/page_alloc.c: allow error injection"). Live patching their callsites has been also suggested in both [1] and [2] threads, and this is an attempt to do that with static keys that guard the call sites. When disabled, the error injection functions still exist and are noinline, but are not being called. Any of the existing mechanisms that can inject errors should make sure to enable the respective static key. I have added that support to hopefully all of them that can be used today. - the legacy fault injection, i.e. CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the address of the static key if it exists. The key will be activated if the fault injection probability becomes non-zero, and deactivated in the opposite transition. This also removes the overhead of the evaluation (on top of the noninline function call) when these mechanisms are configured in the kernel but unused at the moment. - the generic error injection using kretprobes with override_function_with_return is handled in Patch 2. The ALLOW_ERROR_INJECTION() annotation is extended so that static key address can be passed, and the framework controls it when error injection is enabled or disabled in debugfs for the function. - bpf programs can override return values of probed functions with CONFIG_BPF_KPROBE_OVERRIDE=y and have prog->kprobe_override=1. They can be attached to perf_event, which is handled in Patch 3, or via multi_link_attach, which is handled in Patch 4. I have tested the latter using a modified bcc program from commit 4f6923fbb352 description, but not Patch 3 using a perf_event - testing is welcome. - ftrace seems to be using override_function_with_return from #define ftrace_override_function_with_return but there appear to be no users, which was confirmed by Mark Rutland in the RFC thread. If anyone was crazy enough to use multiple of mechanisms above simultaneously, the usage of static_key_slow_inc/dec will do the right thing and the key will be enabled iff at least one mechanism is active. Additionally to the static key support, Patch 5 makes it possible to stop making the fault injection functions noninline with CONFIG_FUNCTION_ERROR_INJECTION=n by compiling out the BTF_ID() references for bpf_non_sleepable_error_inject which are unnecessary in that case. Patches 6 and 7 implement the static keys for the two mm fault injection sites in slab and page allocators. I have measured the improvement for the slab case, as described in Patch 6: To demonstrate the reduced overhead of calling an empty should_failslab() function, a kernel build with CONFIG_FUNCTION_ERROR_INJECTION enabled but CONFIG_FAILSLAB disabled, and CPU mitigations enabled, was used in a qemu-kvm (virtme-ng) on AMD Ryzen 7 2700 machine, and execution of a program trying to open() a non-existent file was measured 3 times: for (int i = 0; i < 1000; i++) { open("non_existent", O_RDONLY); } After this patch, the measured real time was 4.3% smaller. Using perf profiling it was verified that should_failslab was gone from the profile. With CONFIG_FAILSLAB also enabled, the patched kernel performace was unaffected, as expected, while unpatched kernel's performance was worse, resulting in the relative speedup being 10.5%. This means it no longer needs to be an option suitable only for debug kernel builds. There might be other such fault injection callsites in hotpaths of other subsystems but I didn't search for them at this point. With all the preparations in place, it should be simple to improve them now. FAQ: Q: Does this improve only config options nobody uses in production anyway? A: No, the error injection hooks are unconditionally noninline functions even if they are empty. CONFIG_FUNCTION_ERROR_INJECTION=y is probably rather common, and overrides done via bpf. The goal was to eliminate this unnecessary overhead. But as a secondary benefit now the legacy fault injection options can be also enabled in production kernels without extra overhead. Q: Should we remove the legacy fault injection framework? A: Maybe? I didn't want to wait for that to happen, so it's just handled as well (Patch 1). The generic error injection handling and bpf needed the most effort anyway. Q: Should there be a unified
[PATCH v2 5/7] bpf: do not create bpf_non_sleepable_error_inject list when unnecessary
When CONFIG_FUNCTION_ERROR_INJECTION is disabled, within_error_injection_list() will return false for any address and the result of check_non_sleepable_error_inject() denylist is thus redundant. The bpf_non_sleepable_error_inject list thus does not need to be constructed at all, so #ifdef it out. This will allow to inline functions on the list when CONFIG_FUNCTION_ERROR_INJECTION is disabled as there will be no BTF_ID() reference for them. Signed-off-by: Vlastimil Babka --- kernel/bpf/verifier.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 77da1f438bec..5cd93de37d68 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -21044,6 +21044,8 @@ static int check_attach_modify_return(unsigned long addr, const char *func_name) return -EINVAL; } +#ifdef CONFIG_FUNCTION_ERROR_INJECTION + /* list of non-sleepable functions that are otherwise on * ALLOW_ERROR_INJECTION list */ @@ -21061,6 +21063,19 @@ static int check_non_sleepable_error_inject(u32 btf_id) return btf_id_set_contains(&btf_non_sleepable_error_inject, btf_id); } +#else /* CONFIG_FUNCTION_ERROR_INJECTION */ + +/* + * Pretend the denylist is empty, within_error_injection_list() will return + * false anyway. + */ +static int check_non_sleepable_error_inject(u32 btf_id) +{ + return 0; +} + +#endif + int bpf_check_attach_target(struct bpf_verifier_log *log, const struct bpf_prog *prog, const struct bpf_prog *tgt_prog, -- 2.45.2
[PATCH v2 3/7] bpf: support error injection static keys for perf_event attached progs
Functions marked for error injection can have an associated static key that guards the callsite(s) to avoid overhead of calling an empty function when no error injection is in progress. Outside of the error injection framework itself, bpf programs can be atteched to perf events and override results of error-injectable functions. To make sure these functions are actually called, attaching such bpf programs should control the static key accordingly. Therefore, add the static key's address to struct trace_kprobe and fill it in trace_kprobe_error_injectable(), using get_injection_key() instead of within_error_injection_list(). Introduce trace_kprobe_error_injection_control() to control the static key and call the control function when attaching or detaching programs with kprobe_override to perf events. Signed-off-by: Vlastimil Babka --- kernel/trace/bpf_trace.c| 6 ++ kernel/trace/trace_kprobe.c | 30 -- kernel/trace/trace_probe.h | 5 + 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f5154c051d2c..944de1c41209 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2283,6 +2283,9 @@ int perf_event_attach_bpf_prog(struct perf_event *event, rcu_assign_pointer(event->tp_event->prog_array, new_array); bpf_prog_array_free_sleepable(old_array); + if (prog->kprobe_override) + trace_kprobe_error_injection_control(event->tp_event, true); + unlock: mutex_unlock(&bpf_event_mutex); return ret; @@ -2299,6 +2302,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event) if (!event->prog) goto unlock; + if (event->prog->kprobe_override) + trace_kprobe_error_injection_control(event->tp_event, false); + old_array = bpf_event_rcu_dereference(event->tp_event->prog_array); ret = bpf_prog_array_copy(old_array, event->prog, NULL, 0, &new_array); if (ret == -ENOENT) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 16383247bdbf..1c1ee95bd5de 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -61,6 +61,7 @@ struct trace_kprobe { unsigned long __percpu *nhit; const char *symbol;/* symbol name */ struct trace_probe tp; + struct static_key *ei_key; }; static bool is_trace_kprobe(struct dyn_event *ev) @@ -235,9 +236,34 @@ bool trace_kprobe_on_func_entry(struct trace_event_call *call) bool trace_kprobe_error_injectable(struct trace_event_call *call) { struct trace_kprobe *tk = trace_kprobe_primary_from_call(call); + struct static_key *ei_key; - return tk ? within_error_injection_list(trace_kprobe_address(tk)) : - false; + if (!tk) + return false; + + ei_key = get_injection_key(trace_kprobe_address(tk)); + if (IS_ERR(ei_key)) + return false; + + tk->ei_key = ei_key; + return true; +} + +void trace_kprobe_error_injection_control(struct trace_event_call *call, + bool enable) +{ + struct trace_kprobe *tk = trace_kprobe_primary_from_call(call); + + if (!tk) + return; + + if (!tk->ei_key) + return; + + if (enable) + static_key_slow_inc(tk->ei_key); + else + static_key_slow_dec(tk->ei_key); } static int register_kprobe_event(struct trace_kprobe *tk); diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 5803e6a41570..d9ddcabb9f97 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -212,6 +212,8 @@ DECLARE_BASIC_PRINT_TYPE_FUNC(symbol); #ifdef CONFIG_KPROBE_EVENTS bool trace_kprobe_on_func_entry(struct trace_event_call *call); bool trace_kprobe_error_injectable(struct trace_event_call *call); +void trace_kprobe_error_injection_control(struct trace_event_call *call, + bool enabled); #else static inline bool trace_kprobe_on_func_entry(struct trace_event_call *call) { @@ -222,6 +224,9 @@ static inline bool trace_kprobe_error_injectable(struct trace_event_call *call) { return false; } + +static inline void trace_kprobe_error_injection_control(struct trace_event_call *call, + bool enabled) { } #endif /* CONFIG_KPROBE_EVENTS */ struct probe_arg { -- 2.45.2
Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
From: Mike Rapoport (IBM) On Thu, 13 Jun 2024 11:55:06 -0400, Steven Rostedt wrote: > Reserve unspecified location of physical memory from kernel command line > > Background: > > In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract > dmesg output and some other information when a crash happens in the field. > (This is only done when the user selects "Allow Google to collect data for > improving the system"). But there are cases when there's a bug that > requires more data to be retrieved to figure out what is happening. We would > like to increase the pstore size, either temporarily, or maybe even > permanently. The pstore on these devices are at a fixed location in RAM (as > the RAM is not cleared on soft reboots nor crashes). The location is chosen > by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. > There's a driver that queries for this to initialize the pstore for > ChromeOS: > > [...] Applied to for-next branch of memblock.git tree, thanks! [0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332 [1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332 [2/2] pstore/ramoops: Add ramoops.mem_name= command line option commit: d9d814eebb1ae9742e7fd7f39730653b16326bd4 tree: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock branch: for-next -- Sincerely yours, Mike.
Re: [PATCH v2] KUnit: add missing MODULE_DESCRIPTION() macros for lib/test_*.ko
On Wed, Jun 19, 2024 at 01:59:15PM -0700, Jeff Johnson wrote: > make allmodconfig && make W=1 C=1 reports for lib/test_*.ko: > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hexdump.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_dhry.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_firmware.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sysctl.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hash.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ida.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_list_sort.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_min_heap.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_module.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sort.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_keys.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_key_base.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_memcat_p.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_blackhole_dev.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_meminit.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_free_pages.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kprobes.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ref_tracker.o > WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bits.o > > Add the missing invocations of the MODULE_DESCRIPTION() macro. > > Signed-off-by: Jeff Johnson Thanks for chasing these down! Reviewed-by: Kees Cook -- Kees Cook
Re: [PATCH] filemap: add trace events for get_pages, map_pages, and fault
On Tue, Jun 18, 2024 at 09:36:56AM +, Takaya Saeki wrote: > + TP_printk( > + "dev %d:%d ino %lx ofs=%lu max_ofs=%lu", It seems weird to have a space between dev and %d, but an equals between ofs and %lu. I see there is some precedent for this elsewhere, but there are other places which use ino=. I'd rather: "dev=%d:%d ino=%lx ofs=%lu max_ofs=%lu", > + TP_printk( > + "dev %d:%d ino %lx ofs=%lu", Likewise. > + MAJOR(__entry->s_dev), > + MINOR(__entry->s_dev), __entry->i_ino, > + __entry->index << PAGE_SHIFT This needs to be cast to an loff_t before shifting.
[PATCH 7/7] ARM: dts: qcom: msm8226: Convert APCS usages to mbox interface
Since we now have the apcs set up as a mailbox provider, let's use the interface for all drivers where possible. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi index 9deee34fc5ca..5c1122f93054 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi +++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi @@ -157,7 +157,7 @@ master-stats { smd-edge { interrupts = ; - qcom,ipc = <&apcs 8 0>; + mboxes = <&apcs 0>; qcom,smd-edge = <15>; rpm_requests: rpm-requests { @@ -235,7 +235,7 @@ smp2p-adsp { interrupt-parent = <&intc>; interrupts = ; - qcom,ipc = <&apcs 8 10>; + mboxes = <&apcs 10>; qcom,local-pid = <0>; qcom,remote-pid = <2>; @@ -1232,7 +1232,7 @@ adsp: remoteproc@fe20 { smd-edge { interrupts = ; - qcom,ipc = <&apcs 8 8>; + mboxes = <&apcs 8>; qcom,smd-edge = <1>; label = "lpass"; -- 2.45.2
[PATCH 5/7] ARM: dts: qcom: msm8226: Add CPU frequency scaling support
Add a node for the a7pll with its frequencies. With this we can use the apcs-kpss-global driver for the apcs node and use the apcs to scale the CPU frequency according to the opp-table. At the same time unfortunately we need to provide the gcc node xo_board instead of the XO via rpmcc since otherwise we'll have a circular dependency between apcs, gcc and the rpm. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 103 ++- 1 file changed, 100 insertions(+), 3 deletions(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi index 270973e85625..6e9fbe2e7223 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi +++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi @@ -44,6 +44,8 @@ CPU0: cpu@0 { device_type = "cpu"; reg = <0>; next-level-cache = <&L2>; + clocks = <&apcs>; + operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc0>; qcom,saw = <&saw0>; }; @@ -54,6 +56,8 @@ CPU1: cpu@1 { device_type = "cpu"; reg = <1>; next-level-cache = <&L2>; + clocks = <&apcs>; + operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc1>; qcom,saw = <&saw1>; }; @@ -64,6 +68,8 @@ CPU2: cpu@2 { device_type = "cpu"; reg = <2>; next-level-cache = <&L2>; + clocks = <&apcs>; + operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc2>; qcom,saw = <&saw2>; }; @@ -74,6 +80,8 @@ CPU3: cpu@3 { device_type = "cpu"; reg = <3>; next-level-cache = <&L2>; + clocks = <&apcs>; + operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc3>; qcom,saw = <&saw3>; }; @@ -98,6 +106,29 @@ memory@0 { reg = <0x0 0x0>; }; + cpu_opp_table: opp-table-cpu { + compatible = "operating-points-v2"; + opp-shared; + + opp-3 { + opp-hz = /bits/ 64 <3>; + }; + + opp-38400 { + opp-hz = /bits/ 64 <38400>; + }; + + opp-6 { + opp-hz = /bits/ 64 <6>; + }; + + opp-78720 { + opp-hz = /bits/ 64 <78720>; + }; + + /* Higher CPU frequencies need speedbin support */ + }; + pmu { compatible = "arm,cortex-a7-pmu"; interrupts = ; }; - apcs: syscon@f9011000 { - compatible = "syscon"; + apcs: mailbox@f9011000 { + compatible = "qcom,msm8226-apcs-kpss-global", +"qcom,msm8916-apcs-kpss-global", "syscon"; reg = <0xf9011000 0x1000>; + #mbox-cells = <1>; + clocks = <&a7pll>, <&gcc GPLL0_VOTE>; + clock-names = "pll", "aux"; + #clock-cells = <0>; + }; + + a7pll: clock@f9016000 { + compatible = "qcom,msm8226-a7pll"; + reg = <0xf9016000 0x40>; + #clock-cells = <0>; + clocks = <&xo_board>; + clock-names = "xo"; + operating-points-v2 = <&a7pll_opp_table>; + + a7pll_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-76800 { + opp-hz = /bits/ 64 <76800>; + }; + + opp-78720 { + opp-hz = /bits/ 64 <78720>; + }; + + opp-99840 { + opp-hz = /bits/ 64 <99840>; + }; + + opp-109440 { + opp-hz = /bits/ 64 <109440>; + }; + + opp-119040 { + opp-hz = /bits/ 64 <119040>; + }; + + opp-130560 { + opp-
[PATCH 6/7] ARM: dts: qcom: msm8226: Hook up CPU cooling
Add cooling-maps for the CPU thermal zones so the driver can actually do something when the CPU temperature rises too much. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi index 6e9fbe2e7223..9deee34fc5ca 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi +++ b/arch/arm/boot/dts/qcom/qcom-msm8226.dtsi @@ -12,6 +12,7 @@ #include #include #include +#include / { #address-cells = <1>; @@ -48,6 +49,7 @@ CPU0: cpu@0 { operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc0>; qcom,saw = <&saw0>; + #cooling-cells = <2>; }; CPU1: cpu@1 { @@ -60,6 +62,7 @@ CPU1: cpu@1 { operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc1>; qcom,saw = <&saw1>; + #cooling-cells = <2>; }; CPU2: cpu@2 { @@ -72,6 +75,7 @@ CPU2: cpu@2 { operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc2>; qcom,saw = <&saw2>; + #cooling-cells = <2>; }; CPU3: cpu@3 { @@ -84,6 +88,7 @@ CPU3: cpu@3 { operating-points-v2 = <&cpu_opp_table>; qcom,acc = <&acc3>; qcom,saw = <&saw3>; + #cooling-cells = <2>; }; L2: l2-cache { @@ -1256,6 +1261,16 @@ cpu0-thermal { thermal-sensors = <&tsens 5>; + cooling-maps { + map0 { + trip = <&cpu_alert0>; + cooling-device = <&CPU0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>; + }; + }; + trips { cpu_alert0: trip0 { temperature = <75000>; @@ -1277,6 +1292,16 @@ cpu1-thermal { thermal-sensors = <&tsens 2>; + cooling-maps { + map0 { + trip = <&cpu_alert1>; + cooling-device = <&CPU0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>, +<&CPU3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>; + }; + }; + trips { cpu_alert1: trip0 { temperature = <75000>; -- 2.45.2
[PATCH 0/7] Add CPU frequency scaling support for MSM8226
Apart from a bunch of bindings updates, add support for the a7pll found on the SoC and wire up everything in the dtsi. And finally switch over to using apcs via mbox interface to stop using the apcs via syscon. Only a limited list of CPU frequencies are supported for now, higher ones require speedbin support which I plan to work on after this lands. Signed-off-by: Luca Weiss --- Luca Weiss (7): dt-bindings: mailbox: qcom: add compatible for MSM8226 SoC dt-bindings: clock: qcom,a53pll: Allow opp-table subnode dt-bindings: clock: qcom,a53pll: Add msm8226-a7pll compatible clk: qcom: a53-pll: Add MSM8226 a7pll support ARM: dts: qcom: msm8226: Add CPU frequency scaling support ARM: dts: qcom: msm8226: Hook up CPU cooling ARM: dts: qcom: msm8226: Convert APCS usages to mbox interface .../devicetree/bindings/clock/qcom,a53pll.yaml | 4 + .../bindings/mailbox/qcom,apcs-kpss-global.yaml| 1 + arch/arm/boot/dts/qcom/qcom-msm8226.dtsi | 134 - drivers/clk/qcom/a53-pll.c | 1 + 4 files changed, 134 insertions(+), 6 deletions(-) --- base-commit: 0efa3123a1658dbafdace0bfcdcc4f34eebc7f9f change-id: 20240619-msm8226-cpufreq-788b0bf0256a Best regards, -- Luca Weiss
[PATCH 4/7] clk: qcom: a53-pll: Add MSM8226 a7pll support
The MSM8226 has one PLL for its Cortex-A7 cores. The frequencies will be specified in devicetree. Signed-off-by: Luca Weiss --- drivers/clk/qcom/a53-pll.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/clk/qcom/a53-pll.c b/drivers/clk/qcom/a53-pll.c index f9c5e296dba2..f43d455ab4b8 100644 --- a/drivers/clk/qcom/a53-pll.c +++ b/drivers/clk/qcom/a53-pll.c @@ -151,6 +151,7 @@ static int qcom_a53pll_probe(struct platform_device *pdev) } static const struct of_device_id qcom_a53pll_match_table[] = { + { .compatible = "qcom,msm8226-a7pll" }, { .compatible = "qcom,msm8916-a53pll" }, { .compatible = "qcom,msm8939-a53pll" }, { } -- 2.45.2
[PATCH 3/7] dt-bindings: clock: qcom,a53pll: Add msm8226-a7pll compatible
Add the compatible for the A7PLL found in MSM8226 SoCs. Signed-off-by: Luca Weiss --- Documentation/devicetree/bindings/clock/qcom,a53pll.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml index 8cd73a623ef5..47ceab641a4c 100644 --- a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml +++ b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml @@ -21,6 +21,7 @@ properties: - qcom,ipq6018-a53pll - qcom,ipq8074-a53pll - qcom,ipq9574-a73pll + - qcom,msm8226-a7pll - qcom,msm8916-a53pll - qcom,msm8939-a53pll -- 2.45.2
[PATCH 2/7] dt-bindings: clock: qcom,a53pll: Allow opp-table subnode
Allow placing an opp-table as a subnode that can be assigned using operating-points-v2 to specify the frequency table for the PLL. Signed-off-by: Luca Weiss --- Documentation/devicetree/bindings/clock/qcom,a53pll.yaml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml index 5ca927a8b1d5..8cd73a623ef5 100644 --- a/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml +++ b/Documentation/devicetree/bindings/clock/qcom,a53pll.yaml @@ -40,6 +40,9 @@ properties: operating-points-v2: true + opp-table: +type: object + required: - compatible - reg -- 2.45.2
[PATCH 1/7] dt-bindings: mailbox: qcom: add compatible for MSM8226 SoC
Add the mailbox compatible for MSM8226 SoC. Signed-off-by: Luca Weiss --- Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml index 982c741e6225..dc75ea2383f1 100644 --- a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml +++ b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.yaml @@ -25,6 +25,7 @@ properties: - const: qcom,ipq6018-apcs-apps-global - items: - enum: + - qcom,msm8226-apcs-kpss-global - qcom,qcs404-apcs-apps-global - const: qcom,msm8916-apcs-kpss-global - const: syscon -- 2.45.2
[PATCH v2] KUnit: add missing MODULE_DESCRIPTION() macros for lib/test_*.ko
make allmodconfig && make W=1 C=1 reports for lib/test_*.ko: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hexdump.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_dhry.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_firmware.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sysctl.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_hash.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ida.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_list_sort.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_min_heap.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_module.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_sort.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_keys.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_static_key_base.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_memcat_p.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_blackhole_dev.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_meminit.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_free_pages.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kprobes.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_ref_tracker.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_bits.o Add the missing invocations of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson --- Changes in v2: - Updated the description of lib/test_kprobes.c per Masami - Removed references to lib/test_user_copy since that is being renamed in https://lore.kernel.org/all/20240612195921.2685842-2-k...@kernel.org/ and the description added in https://lore.kernel.org/all/20240619202659.work.532-k...@kernel.org/ - Link to v1: https://lore.kernel.org/r/20240601-md-lib-test-v1-1-a728620e3...@quicinc.com --- lib/dhry_run.c | 1 + lib/test-kstrtox.c | 1 + lib/test_bits.c| 1 + lib/test_blackhole_dev.c | 1 + lib/test_firmware.c| 1 + lib/test_free_pages.c | 1 + lib/test_hash.c| 1 + lib/test_hexdump.c | 1 + lib/test_ida.c | 1 + lib/test_kprobes.c | 3 ++- lib/test_list_sort.c | 1 + lib/test_memcat_p.c| 1 + lib/test_meminit.c | 1 + lib/test_min_heap.c| 1 + lib/test_module.c | 1 + lib/test_ref_tracker.c | 3 ++- lib/test_sort.c| 1 + lib/test_static_key_base.c | 1 + lib/test_static_keys.c | 1 + lib/test_sysctl.c | 1 + 20 files changed, 22 insertions(+), 2 deletions(-) diff --git a/lib/dhry_run.c b/lib/dhry_run.c index e6a279dabf84..4a6d05ce4361 100644 --- a/lib/dhry_run.c +++ b/lib/dhry_run.c @@ -83,4 +83,5 @@ static int __init dhry_init(void) module_init(dhry_init); MODULE_AUTHOR("Geert Uytterhoeven "); +MODULE_DESCRIPTION("Dhrystone benchmark test module"); MODULE_LICENSE("GPL"); diff --git a/lib/test-kstrtox.c b/lib/test-kstrtox.c index f355f67169b6..ee87fef66cb5 100644 --- a/lib/test-kstrtox.c +++ b/lib/test-kstrtox.c @@ -732,4 +732,5 @@ static int __init test_kstrtox_init(void) return -EINVAL; } module_init(test_kstrtox_init); +MODULE_DESCRIPTION("Module test for kstrto*() APIs"); MODULE_LICENSE("Dual BSD/GPL"); diff --git a/lib/test_bits.c b/lib/test_bits.c index c9368a2314e7..01313980f175 100644 --- a/lib/test_bits.c +++ b/lib/test_bits.c @@ -72,4 +72,5 @@ static struct kunit_suite bits_test_suite = { }; kunit_test_suite(bits_test_suite); +MODULE_DESCRIPTION("Test cases for functions and macros in bits.h"); MODULE_LICENSE("GPL"); diff --git a/lib/test_blackhole_dev.c b/lib/test_blackhole_dev.c index f247089d63c0..ec290ac2a0d9 100644 --- a/lib/test_blackhole_dev.c +++ b/lib/test_blackhole_dev.c @@ -96,4 +96,5 @@ module_init(test_blackholedev_init); module_exit(test_blackholedev_exit); MODULE_AUTHOR("Mahesh Bandewar "); +MODULE_DESCRIPTION("module test of the blackhole_dev"); MODULE_LICENSE("GPL"); diff --git a/lib/test_firmware.c b/lib/test_firmware.c index 9cfdcd6d21db..bcb32cbff188 100644 --- a/lib/test_firmware.c +++ b/lib/test_firmware.c @@ -1567,4 +1567,5 @@ static void __exit test_firmware_exit(void) module_exit(test_firmware_exit); MODULE_AUTHOR("Kees Cook "); +MODULE_DESCRIPTION("interface to trigger and test firmware loading"); MODULE_LICENSE("GPL"); diff --git a/lib/test_free_pages.c b/lib/test_free_pages.c index 9ebf6f5549f3..48952364c540 100644 --- a/lib/test_free_pages.c +++ b/lib/test_free_pages.c @@ -44,4 +44,5 @@ static void m_ex(void) module_init(m_in); module_exit(m_ex); MODULE_AUTHOR("Matthew Wilcox "); +MODULE_DESCRIPTION("Check that free_pages() doesn't leak memory"); MODULE_LICENSE("GPL"); diff --git a/lib/test_hash.c b/lib/test_hash.c index bb25fda34794..a7af39662a0a 100644 --- a/lib/test_hash.c +++ b/lib/test_hash.c @@ -235,4 +235,5 @@ static struct kunit_suite hash_test_suite = { kunit_test_suite(hash_test_suite);
Re: [PATCH 0/5] Use mboxes in smsm node for all dtsi where possible
On 6/19/24 18:42, Luca Weiss wrote: With the binding and driver patches queued for 6.11[0][1], it's time to update the dtsi files to use the new binding. [0] https://lore.kernel.org/linux-arm-msm/171840533352.102487.9576671528001022676.b4...@kernel.org/ [1] https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux.git/log/?h=drivers-for-6.11 @Bjorn: I think this makes sense to only apply these patches for 6.12 so that also in the arm64 tree the driver will exist already, so git bisect is not impeded by that. Patches are just compile-tested. Signed-off-by: Luca Weiss --- Reviewed-by: Konrad Dybcio Konrad
Re: [RFC bpf-next 01/10] uprobe: Add session callbacks to uprobe_consumer
On Mon, Jun 17, 2024 at 03:53:50PM -0700, Andrii Nakryiko wrote: > On Mon, Jun 10, 2024 at 4:06 AM Jiri Olsa wrote: > > > > On Thu, Jun 06, 2024 at 09:52:39AM -0700, Andrii Nakryiko wrote: > > > On Thu, Jun 6, 2024 at 9:46 AM Jiri Olsa wrote: > > > > > > > > On Wed, Jun 05, 2024 at 10:50:11PM +0200, Jiri Olsa wrote: > > > > > On Wed, Jun 05, 2024 at 07:56:19PM +0200, Oleg Nesterov wrote: > > > > > > On 06/05, Andrii Nakryiko wrote: > > > > > > > > > > > > > > so any such > > > > > > > limitations will cause problems, issue reports, investigation, > > > > > > > etc. > > > > > > > > > > > > Agreed... > > > > > > > > > > > > > As one possible solution, what if we do > > > > > > > > > > > > > > struct return_instance { > > > > > > > ... > > > > > > > u64 session_cookies[]; > > > > > > > }; > > > > > > > > > > > > > > and allocate sizeof(struct return_instance) + 8 * > > > > > > > and then at runtime pass > > > > > > > &session_cookies[i] as data pointer to session-aware callbacks? > > > > > > > > > > > > I too thought about this, but I guess it is not that simple. > > > > > > > > > > > > Just for example. Suppose we have 2 session-consumers C1 and C2. > > > > > > What if uprobe_unregister(C1) comes before the probed function > > > > > > returns? > > > > > > > > > > > > We need something like map_cookie_to_consumer(). > > > > > > > > > > I guess we could have hash table in return_instance that gets > > > > > 'consumer -> cookie' ? > > > > > > > > ok, hash table is probably too big for this.. I guess some solution that > > > > would iterate consumers and cookies made sure it matches would be fine > > > > > > > > > > Yes, I was hoping to avoid hash tables for this, and in the common > > > case have no added overhead. > > > > hi, > > here's first stab on that.. the change below: > > - extends current handlers with extra argument rather than adding new > > set of handlers > > - store session consumers objects within return_instance object and > > - iterate these objects ^^^ in handle_uretprobe_chain > > > > I guess it could be still polished, but I wonder if this could > > be the right direction to do this.. thoughts? ;-) > > Yeah, I think this is the right direction. It's a bit sad that this > makes getting rid of rw_sem on hot path even harder, but that's a > separate problem. > > > > > thanks, > > jirka > > > > > > --- > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > index f46e0ca0169c..4e40e8352eac 100644 > > --- a/include/linux/uprobes.h > > +++ b/include/linux/uprobes.h > > @@ -34,15 +34,19 @@ enum uprobe_filter_ctx { > > }; > > > > struct uprobe_consumer { > > - int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); > > + int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs, > > + unsigned long *data); > > can we use __u64 here? This long vs __u64 might cause problems for BPF > when the host is 32-bit architecture (BPF is always 64-bit). ok > > > int (*ret_handler)(struct uprobe_consumer *self, > > unsigned long func, > > - struct pt_regs *regs); > > + struct pt_regs *regs, > > + unsigned long *data); > > bool (*filter)(struct uprobe_consumer *self, > > enum uprobe_filter_ctx ctx, > > struct mm_struct *mm); > > > > [...] > > > static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) > > { > > struct uprobe_task *n_utask; > > @@ -1756,11 +1795,11 @@ static int dup_utask(struct task_struct *t, struct > > uprobe_task *o_utask) > > > > p = &n_utask->return_instances; > > for (o = o_utask->return_instances; o; o = o->next) { > > - n = kmalloc(sizeof(struct return_instance), GFP_KERNEL); > > + n = alloc_return_instance(o->session_cnt); > > if (!n) > > return -ENOMEM; > > > > - *n = *o; > > + memcpy(n, o, ri_size(o->session_cnt)); > > get_uprobe(n->uprobe); > > n->next = NULL; > > > > @@ -1853,35 +1892,38 @@ static void cleanup_return_instances(struct > > uprobe_task *utask, bool chained, > > utask->return_instances = ri; > > } > > > > -static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) > > +static struct return_instance * > > +prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs, > > + struct return_instance *ri, int session_cnt) > > you have struct uprobe, why do you need to pass session_cnt? Also, > given return_instance is cached, it seems more natural to have > > struct return_instance **ri as in/out parameter, and keep the function > itself as void I tried that, but now I think it'd be better if we just let prepare_uretprobe to allocate (if needed) and free ri in case i
Re: [PATCH] filemap: add trace events for get_pages, map_pages, and fault
Thanks Steven for trying it out. > I can see it bringing down the number of pages needed to be saved > dramatically. Yes, I agree. However, note that wc does not count the size of the page caches correctly since 'get_map_pages' gives you a range. In your example of the less command, actually the total sizes of page_cache_size and get_map_pages look the same. Instead of less, running a large binary such as Chrome (232MB) gives us a better example. '# trace-cmd record -e filemap chrome --version' showed 58% reduction from 42MB to 17MB in my environment. - Total size of mm_filemap_add_to_page_cache: 42,958,848 bytes - With mm_filemap_map_pages and mm_filemap_get_pages: 17,993,728 bytes By the way, 'mm_filemap_map_pages' traces a range requested to populate, which includes pages that are not in caches yet and thus are skipped. So, you need to calculate the intersection with mm_filemap_add_to_page_cache to see page caches that are actually mapped. I'm wondering if we should put a trace event per each successful page mapping event as mm_filemap_add_to_page_cache does, by putting an event inside the page map loop.
Re: [PATCH vhost 11/23] vdpa/mlx5: Set an initial size on the VQ
On Wed, 2024-06-19 at 17:08 +0200, Eugenio Perez Martin wrote: > On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > > > The virtqueue size is a pre-requisite for setting up any virtqueue > > resources. For the upcoming optimization of creating virtqueues at > > device add, the virtqueue size has to be configured. > > > > Store the default queue size in struct mlx5_vdpa_net to make it easy in > > the future to pre-configure this default value via vdpa tool. > > > > The queue size check in setup_vq() will always be false. So remove it. > > > > Signed-off-by: Dragos Tatulea > > Reviewed-by: Cosmin Ratiu > > --- > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 --- > > drivers/vdpa/mlx5/net/mlx5_vnet.h | 1 + > > 2 files changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > index 245b5dac98d3..1181e0ac3671 100644 > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > @@ -58,6 +58,8 @@ MODULE_LICENSE("Dual BSD/GPL"); > > */ > > #define MLX5V_DEFAULT_VQ_COUNT 2 > > > > +#define MLX5V_DEFAULT_VQ_SIZE 256 > > + > > struct mlx5_vdpa_cq_buf { > > struct mlx5_frag_buf_ctrl fbc; > > struct mlx5_frag_buf frag_buf; > > @@ -1445,9 +1447,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, > > struct mlx5_vdpa_virtqueue *mvq) > > u16 idx = mvq->index; > > int err; > > > > - if (!mvq->num_ent) > > - return 0; > > - > > if (mvq->initialized) > > return 0; > > > > @@ -3523,6 +3522,7 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev) > > mvq->ndev = ndev; > > mvq->fwqp.fw = true; > > mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE; > > + mvq->num_ent = ndev->default_queue_size; > > } > > } > > > > @@ -3660,6 +3660,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev > > *v_mdev, const char *name, > > goto err_alloc; > > } > > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT; > > + ndev->default_queue_size = MLX5V_DEFAULT_VQ_SIZE; > > > > init_mvqs(ndev); > > allocate_irqs(ndev); > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h > > b/drivers/vdpa/mlx5/net/mlx5_vnet.h > > index 90b556a57971..2ada29767cc5 100644 > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h > > @@ -58,6 +58,7 @@ struct mlx5_vdpa_net { > > bool setup; > > u32 cur_num_vqs; > > u32 rqt_size; > > + u16 default_queue_size; > > It seems to me this is only assigned here and not used in the rest of > the series, why allocate a member here instead of using macro > directly? It is used in init_mvqs(). I wanted to make it easy in case we add the default queue size to the mvq tool. I'm also ok with switching to constants for now. Thank you for your reviews! Thanks, Dragos > > > bool nb_registered; > > struct notifier_block nb; > > struct vdpa_callback config_cb; > > > > -- > > 2.45.1 > > >
Re: [PATCH vhost 08/23] vdpa/mlx5: Clear and reinitialize software VQ data on reset
On Wed, 2024-06-19 at 13:28 +0200, Eugenio Perez Martin wrote: > On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: > > > > The hardware VQ configuration is mirrored by data in struct > > mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset, > > fully clear the struct and initialize with the appropriate default > > values. > > > > As clear_vqs_ready() is used only during reset, get rid of it. > > > > Signed-off-by: Dragos Tatulea > > Reviewed-by: Cosmin Ratiu > > Acked-by: Eugenio Pérez > > > --- > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 16 +++- > > 1 file changed, 3 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > index c8b5c87f001d..de013b5a2815 100644 > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > @@ -2941,18 +2941,6 @@ static void teardown_vq_resources(struct > > mlx5_vdpa_net *ndev) > > ndev->setup = false; > > } > > > > -static void clear_vqs_ready(struct mlx5_vdpa_net *ndev) > > -{ > > - int i; > > - > > - for (i = 0; i < ndev->mvdev.max_vqs; i++) { > > - ndev->vqs[i].ready = false; > > - ndev->vqs[i].modified_fields = 0; > > - } > > - > > - ndev->mvdev.cvq.ready = false; > > -} > > - > > static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev) > > { > > struct mlx5_control_vq *cvq = &mvdev->cvq; > > @@ -3035,12 +3023,14 @@ static int mlx5_vdpa_compat_reset(struct > > vdpa_device *vdev, u32 flags) > > down_write(&ndev->reslock); > > unregister_link_notifier(ndev); > > teardown_vq_resources(ndev); > > - clear_vqs_ready(ndev); > > + init_mvqs(ndev); > > Nitpick / suggestion if you have to send a v2. The init_mvqs function > name sounds like it can allocate stuff that needs to be released to > me. But I'm very bad at naming :). Maybe something like > "mvqs_set_defaults" or similar? Makes sense. I think I will call it mvqs_reset / reset_mvqs to keep things consistent. Thanks, Dragos > > > + > > if (flags & VDPA_RESET_F_CLEAN_MAP) > > mlx5_vdpa_destroy_mr_resources(&ndev->mvdev); > > ndev->mvdev.status = 0; > > ndev->mvdev.suspended = false; > > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT; > > + ndev->mvdev.cvq.ready = false; > > ndev->mvdev.cvq.received_desc = 0; > > ndev->mvdev.cvq.completed_desc = 0; > > memset(ndev->event_cbs, 0, sizeof(*ndev->event_cbs) * > > (mvdev->max_vqs + 1)); > > > > -- > > 2.45.1 > > >
[PATCH 0/5] Use mboxes in smsm node for all dtsi where possible
With the binding and driver patches queued for 6.11[0][1], it's time to update the dtsi files to use the new binding. [0] https://lore.kernel.org/linux-arm-msm/171840533352.102487.9576671528001022676.b4...@kernel.org/ [1] https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux.git/log/?h=drivers-for-6.11 @Bjorn: I think this makes sense to only apply these patches for 6.12 so that also in the arm64 tree the driver will exist already, so git bisect is not impeded by that. Patches are just compile-tested. Signed-off-by: Luca Weiss --- Luca Weiss (5): ARM: dts: qcom: msm8974: Use mboxes in smsm node arm64: dts: qcom: msm8916: Use mboxes in smsm node arm64: dts: qcom: msm8939: Use mboxes in smsm node arm64: dts: qcom: msm8953: Use mboxes in smsm node arm64: dts: qcom: msm8976: Use mboxes in smsm node arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 4 +--- arch/arm64/boot/dts/qcom/msm8916.dtsi| 3 +-- arch/arm64/boot/dts/qcom/msm8939.dtsi| 3 +-- arch/arm64/boot/dts/qcom/msm8953.dtsi| 3 +-- arch/arm64/boot/dts/qcom/msm8976.dtsi| 4 +--- 5 files changed, 5 insertions(+), 12 deletions(-) --- base-commit: 2102cb0d050d34d50b9642a3a50861787527e922 change-id: 20240619-smsm-mbox-dts-3ee9daadf81b Best regards, -- Luca Weiss
[PATCH 1/5] ARM: dts: qcom: msm8974: Use mboxes in smsm node
With the smsm bindings and driver finally supporting mboxes, switch to that and stop using apcs as syscon. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi index 1bea3cef4ba7..18b6a7c77996 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi +++ b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi @@ -294,9 +294,7 @@ smsm { #address-cells = <1>; #size-cells = <0>; - qcom,ipc-1 = <&apcs 8 13>; - qcom,ipc-2 = <&apcs 8 9>; - qcom,ipc-3 = <&apcs 8 19>; + mboxes = <0>, <&apcs 13>, <&apcs 9>, <&apcs 19>; apps_smsm: apps@0 { reg = <0>; -- 2.45.2
[PATCH 2/5] arm64: dts: qcom: msm8916: Use mboxes in smsm node
With the smsm bindings and driver finally supporting mboxes, switch to that and stop using apcs as syscon. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8916.dtsi | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8916.dtsi b/arch/arm64/boot/dts/qcom/msm8916.dtsi index bdedbcdc36d3..7383bcc603ab 100644 --- a/arch/arm64/boot/dts/qcom/msm8916.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8916.dtsi @@ -410,8 +410,7 @@ smsm { #address-cells = <1>; #size-cells = <0>; - qcom,ipc-1 = <&apcs 8 13>; - qcom,ipc-3 = <&apcs 8 19>; + mboxes = <0>, <&apcs 13>, <0>, <&apcs 19>; apps_smsm: apps@0 { reg = <0>; -- 2.45.2
[PATCH 3/5] arm64: dts: qcom: msm8939: Use mboxes in smsm node
With the smsm bindings and driver finally supporting mboxes, switch to that and stop using apcs as syscon. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8939.dtsi | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8939.dtsi b/arch/arm64/boot/dts/qcom/msm8939.dtsi index e309ef909ea7..46d9480cd464 100644 --- a/arch/arm64/boot/dts/qcom/msm8939.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8939.dtsi @@ -443,8 +443,7 @@ smsm { #address-cells = <1>; #size-cells = <0>; - qcom,ipc-1 = <&apcs1_mbox 8 13>; - qcom,ipc-3 = <&apcs1_mbox 8 19>; + mboxes = <0>, <&apcs1_mbox 13>, <0>, <&apcs1_mbox 19>; apps_smsm: apps@0 { reg = <0>; -- 2.45.2
[PATCH 4/5] arm64: dts: qcom: msm8953: Use mboxes in smsm node
With the smsm bindings and driver finally supporting mboxes, switch to that and stop using apcs as syscon. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index 1b61a63710a6..a4bfb624fb8a 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -411,8 +411,7 @@ smsm { #address-cells = <1>; #size-cells = <0>; - qcom,ipc-1 = <&apcs 8 13>; - qcom,ipc-3 = <&apcs 8 19>; + mboxes = <0>, <&apcs 13>, <0>, <&apcs 19>; apps_smsm: apps@0 { reg = <0>; -- 2.45.2
[PATCH 5/5] arm64: dts: qcom: msm8976: Use mboxes in smsm node
With the smsm bindings and driver finally supporting mboxes, switch to that and stop using apcs as syscon. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8976.dtsi | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi b/arch/arm64/boot/dts/qcom/msm8976.dtsi index e299d42c5d98..d62dcb76fa48 100644 --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi @@ -439,9 +439,7 @@ smsm { #address-cells = <1>; #size-cells = <0>; - qcom,ipc-1 = <&apcs 8 13>; - qcom,ipc-2 = <&apcs 8 9>; - qcom,ipc-3 = <&apcs 8 19>; + mboxes = <0>, <&apcs 13>, <&apcs 9>, <&apcs 19>; apps_smsm: apps@0 { reg = <0>; -- 2.45.2
Re: [PATCH vhost 23/23] vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > VQ indices in the range [cur_num_qps, max_vqs) represent queues that > have not yet been activated. .set_vq_ready should not activate these > VQs. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 1a5ee0d2b47f..a969a7f105a6 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -1575,6 +1575,9 @@ static int resume_vq(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq > if (!mvq->initialized) > return 0; > > + if (mvq->index >= ndev->cur_num_vqs) > + return 0; > + > switch (mvq->fw_state) { > case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT: > /* Due to a FW quirk we need to modify the VQ fields first > then change state. > > -- > 2.45.1 >
Re: [PATCH v7 2/5] remoteproc: Add TEE support
Hi, On 6/19/24 17:41, Mathieu Poirier wrote: > Hi, > > On Tue, Jun 11, 2024 at 09:39:01AM +0200, Arnaud Pouliquen wrote: >> Add a remoteproc TEE (Trusted Execution Environment) driver >> that will be probed by the TEE bus. If the associated Trusted >> application is supported on secure part this driver offers a client >> interface to load a firmware in the secure part. >> This firmware could be authenticated by the secure trusted application. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> update from V >> - Fix missing "{" in tee_rproc_find_loaded_rsc_table inline definition. >> >> update from V5 >> - make tee_rproc_get_loaded_rsc_table() local and replace this API by >> tee_rproc_find_loaded_rsc_table() >> - map and unmap the resource table in tee_rproc_parse_fw to make a cached >> copy >> - use the new rproc_pa_to_va() API to map the resource table memory declared >> in carevout >> - remove tee_rproc_release_loaded_rsc_table as no more used. >> --- >> drivers/remoteproc/Kconfig | 10 + >> drivers/remoteproc/Makefile | 1 + >> drivers/remoteproc/tee_remoteproc.c | 451 >> include/linux/remoteproc.h | 4 + >> include/linux/tee_remoteproc.h | 100 ++ >> 5 files changed, 566 insertions(+) >> create mode 100644 drivers/remoteproc/tee_remoteproc.c >> create mode 100644 include/linux/tee_remoteproc.h >> >> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig >> index 48845dc8fa85..6c1c07202276 100644 >> --- a/drivers/remoteproc/Kconfig >> +++ b/drivers/remoteproc/Kconfig >> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC >> >>It's safe to say N if not interested in using RPU r5f cores. >> >> + >> +config TEE_REMOTEPROC >> +tristate "Remoteproc support by a TEE application" >> +depends on OPTEE >> +help >> + Support a remote processor with a TEE application. The Trusted >> + Execution Context is responsible for loading the trusted firmware >> + image and managing the remote processor's lifecycle. >> + This can be either built-in or a loadable module. >> + >> endif # REMOTEPROC >> >> endmenu >> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile >> index 91314a9b43ce..fa8daebce277 100644 >> --- a/drivers/remoteproc/Makefile >> +++ b/drivers/remoteproc/Makefile >> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o >> obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o >> obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o >> obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o >> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o >> obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o >> obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o >> obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o >> diff --git a/drivers/remoteproc/tee_remoteproc.c >> b/drivers/remoteproc/tee_remoteproc.c >> new file mode 100644 >> index ..9455fd9d0d2d >> --- /dev/null >> +++ b/drivers/remoteproc/tee_remoteproc.c >> @@ -0,0 +1,451 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> +/* >> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved >> + * Author: Arnaud Pouliquen >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include "remoteproc_internal.h" >> + >> +#define MAX_TEE_PARAM_ARRY_MEMBER 4 >> + >> +/* >> + * Authentication of the firmware and load in the remote processor memory >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [in] params[1].memref: buffer containing the image of the >> buffer >> + */ >> +#define TA_RPROC_FW_CMD_LOAD_FW 1 >> + >> +/* >> + * Start the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_START_FW2 >> + >> +/* >> + * Stop the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_STOP_FW 3 >> + >> +/* >> + * Return the address of the resource table, or 0 if not found >> + * No check is done to verify that the address returned is accessible by >> + * the non secure context. If the resource table is loaded in a protected >> + * memory the access by the non secure context will lead to a data abort. >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [out] params[1].value.a:32bit LSB resource table memory address >> + * [out] params[1].value.b:32bit MSB resource table memory address >> + * [out] params[2].value.a:32bit LSB resource table memory size >> + * [out] params[2].value.b:32bit MSB resource table memory size >> + */ >> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE 4 >> + >> +/* >> + * Return the address of the core dump >> + * >> + *
Re: [PATCH vhost 22/23] vdpa/mlx5: Don't reset VQs more than necessary
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > The vdpa device can be reset many times in sequence without any > significant state changes in between. Previously this was not a problem: > VQs were torn down only on first reset. But after VQ pre-creation was > introduced, each reset will delete and re-create the hardware VQs and > their associated resources. > > To solve this problem, avoid resetting hardware VQs if the VQs are still > in a blank state. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 30 +++--- > 1 file changed, 27 insertions(+), 3 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index d80d6b47da61..1a5ee0d2b47f 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -3134,18 +3134,41 @@ static void init_group_to_asid_map(struct > mlx5_vdpa_dev *mvdev) > mvdev->group2asid[i] = 0; > } > > +static bool needs_vqs_reset(const struct mlx5_vdpa_dev *mvdev) > +{ > + struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); > + struct mlx5_vdpa_virtqueue *mvq = &ndev->vqs[0]; > + > + if (mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) > + return true; > + > + if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT) > + return true; > + > + return mvq->modified_fields & ( > + MLX5_VIRTQ_MODIFY_MASK_STATE | > + MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS | > + MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX | > + MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX > + ); > +} > + > static int mlx5_vdpa_compat_reset(struct vdpa_device *vdev, u32 flags) > { > struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); > struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); > + bool vq_reset; > > print_status(mvdev, 0, true); > mlx5_vdpa_info(mvdev, "performing device reset\n"); > > down_write(&ndev->reslock); > unregister_link_notifier(ndev); > - teardown_vq_resources(ndev); > - init_mvqs(ndev); > + vq_reset = needs_vqs_reset(mvdev); > + if (vq_reset) { > + teardown_vq_resources(ndev); > + init_mvqs(ndev); > + } > > if (flags & VDPA_RESET_F_CLEAN_MAP) > mlx5_vdpa_destroy_mr_resources(&ndev->mvdev); > @@ -3165,7 +3188,8 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device > *vdev, u32 flags) > if (mlx5_vdpa_create_dma_mr(mvdev)) > mlx5_vdpa_warn(mvdev, "create MR failed\n"); > } > - setup_vq_resources(ndev, false); > + if (vq_reset) > + setup_vq_resources(ndev, false); > up_write(&ndev->reslock); > > return 0; > > -- > 2.45.1 >
Re: [PATCH vhost 21/23] vdpa/mlx5: Re-create HW VQs under certain conditions
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > There are a few conditions under which the hardware VQs need a full > teardown and setup: > > - VQ size changed to something else than default value. Hardware VQ size > modification is not supported. > > - User turns off certain device features: mergeable buffers, checksum > virtio 1.0 compliance. In these cases, the TIR and RQT need to be > re-created. > > Add a needs_teardown configuration variable and set it when detecting > the above scenarios. On next DRIVER_OK, the resources will be torn down > first. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +++ > drivers/vdpa/mlx5/net/mlx5_vnet.h | 1 + > 2 files changed, 16 insertions(+) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index b2836fd3d1dd..d80d6b47da61 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2390,6 +2390,7 @@ static void mlx5_vdpa_set_vq_num(struct vdpa_device > *vdev, u16 idx, u32 num) > } > > mvq = &ndev->vqs[idx]; > + ndev->needs_teardown = num != mvq->num_ent; > mvq->num_ent = num; > } > > @@ -2800,6 +2801,7 @@ static int mlx5_vdpa_set_driver_features(struct > vdpa_device *vdev, u64 features) > struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); > struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); > u64 old_features = mvdev->actual_features; > + u64 diff_features; > int err; > > print_features(mvdev, features, true); > @@ -2822,6 +2824,14 @@ static int mlx5_vdpa_set_driver_features(struct > vdpa_device *vdev, u64 features) > } > } > > + /* When below features diverge from initial device features, VQs need > a full teardown. */ > +#define NEEDS_TEARDOWN_MASK (BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) | \ > +BIT_ULL(VIRTIO_NET_F_CSUM) | \ > +BIT_ULL(VIRTIO_F_VERSION_1)) > + > + diff_features = mvdev->mlx_features ^ mvdev->actual_features; > + ndev->needs_teardown = !!(diff_features & NEEDS_TEARDOWN_MASK); > + > update_cvq_info(mvdev); > return err; > } > @@ -3038,6 +3048,7 @@ static void teardown_vq_resources(struct mlx5_vdpa_net > *ndev) > destroy_rqt(ndev); > teardown_virtqueues(ndev); > ndev->setup = false; > + ndev->needs_teardown = false; > } > > static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev) > @@ -3078,6 +3089,10 @@ static void mlx5_vdpa_set_status(struct vdpa_device > *vdev, u8 status) > goto err_setup; > } > register_link_notifier(ndev); > + > + if (ndev->needs_teardown) > + teardown_vq_resources(ndev); > + > if (ndev->setup) { > err = resume_vqs(ndev); > if (err) { > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h > b/drivers/vdpa/mlx5/net/mlx5_vnet.h > index 2ada29767cc5..da7318f82d2a 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h > @@ -56,6 +56,7 @@ struct mlx5_vdpa_net { > struct dentry *rx_dent; > struct dentry *rx_table_dent; > bool setup; > + bool needs_teardown; > u32 cur_num_vqs; > u32 rqt_size; > u16 default_queue_size; > > -- > 2.45.1 >
Re: [PATCH vhost 20/23] vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > Currently, hardware VQs are created right when the vdpa device gets into > DRIVER_OK state. That is easier because most of the VQ state is known by > then. > > This patch switches to creating all VQs and their associated resources > at device creation time. The motivation is to reduce the vdpa device > live migration downtime by moving the expensive operation of creating > all the hardware VQs and their associated resources out of downtime on > the destination VM. > > The VQs are now created in a blank state. The VQ configuration will > happen later, on DRIVER_OK. Then the configuration will be applied when > the VQs are moved to the Ready state. > > When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is > needed: now that the VQ is already created a resume_vq() will be > triggered too early when no mr has been configured yet. Skip calling > resume_vq() in this case, let it be handled during DRIVER_OK. > > For virtio-vdpa, the device configuration is done earlier during > .vdpa_dev_add() by vdpa_register_device(). Avoid calling > setup_vq_resources() a second time in that case. > I guess this happens if virtio_vdpa is already loaded, but I cannot see how this is different here. Apart from the IOTLB, what else does it change from the mlx5_vdpa POV? > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 37 - > 1 file changed, 32 insertions(+), 5 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 249b5afbe34a..b2836fd3d1dd 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2444,7 +2444,7 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device > *vdev, u16 idx, bool ready > mvq = &ndev->vqs[idx]; > if (!ready) { > suspend_vq(ndev, mvq); > - } else { > + } else if (mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) { > if (resume_vq(ndev, mvq)) > ready = false; > } > @@ -3078,10 +3078,18 @@ static void mlx5_vdpa_set_status(struct vdpa_device > *vdev, u8 status) > goto err_setup; > } > register_link_notifier(ndev); > - err = setup_vq_resources(ndev, true); > - if (err) { > - mlx5_vdpa_warn(mvdev, "failed to setup > driver\n"); > - goto err_driver; > + if (ndev->setup) { > + err = resume_vqs(ndev); > + if (err) { > + mlx5_vdpa_warn(mvdev, "failed to > resume VQs\n"); > + goto err_driver; > + } > + } else { > + err = setup_vq_resources(ndev, true); > + if (err) { > + mlx5_vdpa_warn(mvdev, "failed to > setup driver\n"); > + goto err_driver; > + } > } > } else { > mlx5_vdpa_warn(mvdev, "did not expect DRIVER_OK to be > cleared\n"); > @@ -3142,6 +3150,7 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device > *vdev, u32 flags) > if (mlx5_vdpa_create_dma_mr(mvdev)) > mlx5_vdpa_warn(mvdev, "create MR failed\n"); > } > + setup_vq_resources(ndev, false); > up_write(&ndev->reslock); > > return 0; > @@ -3836,8 +3845,21 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev > *v_mdev, const char *name, > goto err_reg; > > mgtdev->ndev = ndev; > + > + /* For virtio-vdpa, the device was set up during device register. */ > + if (ndev->setup) > + return 0; > + > + down_write(&ndev->reslock); > + err = setup_vq_resources(ndev, false); > + up_write(&ndev->reslock); > + if (err) > + goto err_setup_vq_res; > + > return 0; > > +err_setup_vq_res: > + _vdpa_unregister_device(&mvdev->vdev); > err_reg: > destroy_workqueue(mvdev->wq); > err_res2: > @@ -3863,6 +3885,11 @@ static void mlx5_vdpa_dev_del(struct vdpa_mgmt_dev > *v_mdev, struct vdpa_device * > > unregister_link_notifier(ndev); > _vdpa_unregister_device(dev); > + > + down_write(&ndev->reslock); > + teardown_vq_resources(ndev); > + up_write(&ndev->reslock); > + > wq = mvdev->wq; > mvdev->wq = NULL; > destroy_workqueue(wq); > > -- > 2.45.1 >
Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal
On 2024-06-19 at 18:34:34 +0300, Nikolay Borisov wrote: > > > On 19.06.24 г. 18:25 ч., Chen Yu wrote: > > Hi Nikolay, > > > > On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote: > > > > > > > > > On 26.05.24 г. 4:58 ч., Chen Yu wrote: > > > > The kernel can change spinlock behavior when running as a guest. But > > > > this guest-friendly behavior causes performance problems on bare metal. > > > > So there's a 'virt_spin_lock_key' static key to switch between the two > > > > modes. > > > > > > > > The static key is always enabled by default (run in guest mode) and > > > > should be disabled for bare metal (and in some guests that want native > > > > behavior). > > > > > > > > Performance drop is reported when running encode/decode workload and > > > > BenchSEE cache sub-workload. > > > > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused > > > > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS > > > > is disabled the virt_spin_lock_key is incorrectly set to true on bare > > > > metal. The qspinlock degenerates to test-and-set spinlock, which > > > > decrease the performance on bare metal. > > > > > > > > Fix this by disabling virt_spin_lock_key if it is on bare metal, > > > > regardless of CONFIG_PARAVIRT_SPINLOCKS. > > > > > > > > > > nit: > > > > > > This bug wouldn't have happened if the key was defined FALSE by default > > > and > > > only enabled in the appropriate case. I think it makes more sense to > > > invert > > > the logic and have the key FALSE by default and only enable it iff the > > > kernel is running under a hypervisor... At worst only the virtualization > > > case would suffer if the lock is falsely not enabled. > > > > Thank you for your review. I agree, initializing the key to FALSE by > > default seems > > to be more readable. Could this change be the subsequent adjustment based > > on current fix, which could be more bisectible? > > Why can't this change be squashed in the current proposed patch? > Current patch deals with the incorrect condition of CONFIG_PARAVIRT_SPINLOCKS. The change of the virt_spin_lock_key's default value is supposed to be "no functional change expected", but maybe just in case of some corner cases... Anyway, I'll put the changes together in one patch and do some tests. > > > > > > Set the default key to false. If booting in a VM, enable this key. Later > > during > > the VM initialization, if other high-efficient spinlock is preferred, like > > paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly. > > Yep, or simply during the initialization stage the correct flavor will be > chosen, no need to do the on-off dance. But that's a topic for a different > discussion. > Yes it is doable. thanks, Chenyu > > > > diff --git a/arch/x86/include/asm/qspinlock.h > > b/arch/x86/include/asm/qspinlock.h > > index cde8357bb226..a7d3ba00e70e 100644 > > --- a/arch/x86/include/asm/qspinlock.h > > +++ b/arch/x86/include/asm/qspinlock.h > > @@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu) > > #ifdef CONFIG_PARAVIRT > > /* > > - * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack. > > + * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack. > >* > >* Native (and PV wanting native due to vCPU pinning) should disable this > > key. > >* It is done in this backwards fashion to only have a single direction > > change, > >* which removes ordering between native_pv_spin_init() and HV setup. > >*/ > > -DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key); > > +DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key); > > /* > >* Shortcut for the queued_spin_lock_slowpath() function that allows > > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > > index c193c9e60a1b..fec381533555 100644 > > --- a/arch/x86/kernel/paravirt.c > > +++ b/arch/x86/kernel/paravirt.c > > @@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", > > .noinstr.text); > > DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text); > > #endif > > -DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); > > +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key); > > void __init native_pv_lock_init(void) > > { > > - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) > > - static_branch_disable(&virt_spin_lock_key); > > + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) > > + static_branch_enable(&virt_spin_lock_key); > > } > > static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
Re: [PATCH vhost 19/23] vdpa/mlx5: Use suspend/resume during VQP change
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > Resume a VQ if it is already created when the number of VQ pairs > increases. This is done in preparation for VQ pre-creation which is > coming in a later patch. It is necessary because calling setup_vq() on > an already created VQ will return early and will not enable the queue. > > For symmetry, suspend a VQ instead of tearing it down when the number of > VQ pairs decreases. But only if the resume operation is supported. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 14 +++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 0e1c1b7ff297..249b5afbe34a 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2130,14 +2130,22 @@ static int change_num_qps(struct mlx5_vdpa_dev > *mvdev, int newqps) > if (err) > return err; > > - for (i = ndev->cur_num_vqs - 1; i >= 2 * newqps; i--) > - teardown_vq(ndev, &ndev->vqs[i]); > + for (i = ndev->cur_num_vqs - 1; i >= 2 * newqps; i--) { > + struct mlx5_vdpa_virtqueue *mvq = &ndev->vqs[i]; > + > + if (is_resumable(ndev)) > + suspend_vq(ndev, mvq); > + else > + teardown_vq(ndev, mvq); > + } > > ndev->cur_num_vqs = 2 * newqps; > } else { > ndev->cur_num_vqs = 2 * newqps; > for (i = cur_qps * 2; i < 2 * newqps; i++) { > - err = setup_vq(ndev, &ndev->vqs[i], true); > + struct mlx5_vdpa_virtqueue *mvq = &ndev->vqs[i]; > + > + err = mvq->initialized ? resume_vq(ndev, mvq) : > setup_vq(ndev, mvq, true); > if (err) > goto clean_added; > } > > -- > 2.45.1 >
[PATCH v5 32/37] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
This is normally done by the generic entry code, but the kernel_stack_overflow() flow bypasses it. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/traps.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c index 52578b5cecbd..dde69d2a64f0 100644 --- a/arch/s390/kernel/traps.c +++ b/arch/s390/kernel/traps.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -262,6 +263,11 @@ static void monitor_event_exception(struct pt_regs *regs) void kernel_stack_overflow(struct pt_regs *regs) { + /* +* Normally regs are unpoisoned by the generic entry code, but +* kernel_stack_overflow() is a rare case that is called bypassing it. +*/ + kmsan_unpoison_entry_regs(regs); bust_spinlocks(1); printk("Kernel stack overflow.\n"); show_regs(regs); -- 2.45.1
[PATCH v5 14/37] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
Improve the readability by replacing the custom aligning logic with ALIGN_DOWN(). Unlike other places where a similar sequence is used, there is no size parameter that needs to be adjusted, so the standard macro fits. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/shadow.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index 2d57408c78ae..9c58f081d84f 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *address, u64 size, */ void *kmsan_get_metadata(void *address, bool is_origin) { - u64 addr = (u64)address, pad, off; + u64 addr = (u64)address, off; struct page *page; void *ret; - if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) { - pad = addr % KMSAN_ORIGIN_SIZE; - addr -= pad; - } + if (is_origin) + addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE); address = (void *)addr; if (kmsan_internal_is_vmalloc_addr(address) || kmsan_internal_is_module_addr(address)) -- 2.45.1
[PATCH v5 05/37] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
Comparing pointers with TASK_SIZE does not make sense when kernel and userspace overlap. Skip the comparison when this is the case. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/instrumentation.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index 470b0b4afcc4..8a1bbbc723ab 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -20,7 +20,8 @@ static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store) { - if ((u64)addr < TASK_SIZE) + if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) && + (u64)addr < TASK_SIZE) return true; if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW)) return true; -- 2.45.1
[PATCH v5 22/37] s390: Use a larger stack for KMSAN
Adjust the stack size for the KMSAN-enabled kernel like it was done for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double the stack size"). Both tools have similar requirements. Reviewed-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/Makefile | 2 +- arch/s390/include/asm/thread_info.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/Makefile b/arch/s390/Makefile index f2b21c7a70ef..7fd57398221e 100644 --- a/arch/s390/Makefile +++ b/arch/s390/Makefile @@ -36,7 +36,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds) UTS_MACHINE:= s390x -STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384) +STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384)) CHECKFLAGS += -D__s390__ -D__s390x__ export LD_BFD diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index a674c7d25da5..d02a709717b8 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -16,7 +16,7 @@ /* * General size of kernel stacks */ -#ifdef CONFIG_KASAN +#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN) #define THREAD_SIZE_ORDER 4 #else #define THREAD_SIZE_ORDER 2 -- 2.45.1
[PATCH v5 34/37] s390/uaccess: Add the missing linux/instrumented.h #include
uaccess.h uses instrument_get_user() and instrument_put_user(), which are defined in linux/instrumented.h. Currently we get this header from somewhere else by accident; prefer to be explicit about it and include it directly. Suggested-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/uaccess.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 70f0edc00c2a..9213be0529ee 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -18,6 +18,7 @@ #include #include #include +#include void debug_user_asce(int exit); -- 2.45.1
[PATCH v5 27/37] s390/diag: Unpoison diag224() output buffer
Diagnose 224 stores 4k bytes, which currently cannot be deduced from the inline assembly constraints. This leads to KMSAN false positives. Fix the constraints by using a 4k-sized struct instead of a raw pointer. While at it, prettify them too. Suggested-by: Heiko Carstens Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/diag.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c index 8dee9aa0ec95..8a7009618ba7 100644 --- a/arch/s390/kernel/diag.c +++ b/arch/s390/kernel/diag.c @@ -278,12 +278,14 @@ int diag224(void *ptr) int rc = -EOPNOTSUPP; diag_stat_inc(DIAG_STAT_X224); - asm volatile( - " diag%1,%2,0x224\n" - "0: lhi %0,0x0\n" + asm volatile("\n" + " diag%[type],%[addr],0x224\n" + "0: lhi %[rc],0\n" "1:\n" EX_TABLE(0b,1b) - : "+d" (rc) :"d" (0), "d" (addr) : "memory"); + : [rc] "+d" (rc) + , "=m" (*(struct { char buf[PAGE_SIZE]; } *)ptr) + : [type] "d" (0), [addr] "d" (addr)); return rc; } EXPORT_SYMBOL(diag224); -- 2.45.1
[PATCH v5 19/37] lib/zlib: Unpoison DFLTCC output buffers
The constraints of the DFLTCC inline assembly are not precise: they do not communicate the size of the output buffers to the compiler, so it cannot automatically instrument it. Add the manual kmsan_unpoison_memory() calls for the output buffers. The logic is the same as in [1]. [1] https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5 Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- lib/zlib_dfltcc/dfltcc.h | 1 + lib/zlib_dfltcc/dfltcc_util.h | 28 2 files changed, 29 insertions(+) diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h index b96232bdd44d..0f2a16d7a48a 100644 --- a/lib/zlib_dfltcc/dfltcc.h +++ b/lib/zlib_dfltcc/dfltcc.h @@ -80,6 +80,7 @@ struct dfltcc_param_v0 { uint8_t csb[1152]; }; +static_assert(offsetof(struct dfltcc_param_v0, csb) == 384); static_assert(sizeof(struct dfltcc_param_v0) == 1536); #define CVT_CRC32 0 diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h index 4a46b5009f0d..10509270d822 100644 --- a/lib/zlib_dfltcc/dfltcc_util.h +++ b/lib/zlib_dfltcc/dfltcc_util.h @@ -2,6 +2,8 @@ #ifndef DFLTCC_UTIL_H #define DFLTCC_UTIL_H +#include "dfltcc.h" +#include #include /* @@ -20,6 +22,7 @@ typedef enum { #define DFLTCC_CMPR 2 #define DFLTCC_XPND 4 #define HBT_CIRCULAR (1 << 7) +#define DFLTCC_FN_MASK ((1 << 7) - 1) #define HB_BITS 15 #define HB_SIZE (1 << HB_BITS) @@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc( ) { Byte *t2 = op1 ? *op1 : NULL; +unsigned char *orig_t2 = t2; size_t t3 = len1 ? *len1 : 0; const Byte *t4 = op2 ? *op2 : NULL; size_t t5 = len2 ? *len2 : 0; @@ -59,6 +63,30 @@ static inline dfltcc_cc dfltcc( : "cc", "memory"); t2 = r2; t3 = r3; t4 = r4; t5 = r5; +/* + * Unpoison the parameter block and the output buffer. + * This is a no-op in non-KMSAN builds. + */ +switch (fn & DFLTCC_FN_MASK) { +case DFLTCC_QAF: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param)); +break; +case DFLTCC_GDHT: +kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb)); +break; +case DFLTCC_CMPR: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); +kmsan_unpoison_memory( +orig_t2, +t2 - orig_t2 + +(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1)); +break; +case DFLTCC_XPND: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); +kmsan_unpoison_memory(orig_t2, t2 - orig_t2); +break; +} + if (op1) *op1 = t2; if (len1) -- 2.45.1
[PATCH v5 35/37] s390/unwind: Disable KMSAN checks
The unwind code can read uninitialized frames. Furthermore, even in the good case, KMSAN does not emit shadow for backchains. Therefore disable it for the unwinding functions. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/unwind_bc.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c index 0ece156fdd7c..cd44be2b6ce8 100644 --- a/arch/s390/kernel/unwind_bc.c +++ b/arch/s390/kernel/unwind_bc.c @@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state *state, READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE; } +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks bool unwind_next_frame(struct unwind_state *state) { struct stack_info *info = &state->stack_info; @@ -118,6 +120,8 @@ bool unwind_next_frame(struct unwind_state *state) } EXPORT_SYMBOL_GPL(unwind_next_frame); +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long first_frame) { -- 2.45.1
[PATCH v5 04/37] kmsan: Increase the maximum store size to 4096
The inline assembly block in s390's chsc() stores that much. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/instrumentation.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index cc3907a9c33a..470b0b4afcc4 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t size) ua_flags = user_access_save(); /* -* Most of the accesses are below 32 bytes. The two exceptions so far -* are clwb() (64 bytes) and FPU state (512 bytes). -* It's unlikely that the assembly will touch more than 512 bytes. +* Most of the accesses are below 32 bytes. The exceptions so far are +* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes). */ - if (size > 512) { + if (size > 4096) { WARN_ONCE(1, "assembly store size too big: %ld\n", size); size = 8; } -- 2.45.1
[PATCH v5 20/37] kmsan: Accept ranges starting with 0 on s390
On s390 the virtual address 0 is valid (current CPU's lowcore is mapped there), therefore KMSAN should not complain about it. Disable the respective check on s390. There doesn't seem to be a Kconfig option to describe this situation, so explicitly check for s390. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/init.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 9de76ac7062c..3f8b1bbb9060 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) bool merged = false; KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES); - KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend); + KMSAN_WARN_ON((nstart >= nend) || + /* Virtual address 0 is valid on s390. */ + (!IS_ENABLED(CONFIG_S390) && !nstart) || + !nend); nstart = ALIGN_DOWN(nstart, PAGE_SIZE); nend = ALIGN(nend, PAGE_SIZE); -- 2.45.1
[PATCH v5 33/37] s390/uaccess: Add KMSAN support to put_user() and get_user()
put_user() uses inline assembly with precise constraints, so Clang is in principle capable of instrumenting it automatically. Unfortunately, one of the constraints contains a dereferenced user pointer, and Clang does not currently distinguish user and kernel pointers. Therefore KMSAN attempts to access shadow for user pointers, which is not a right thing to do. An obvious fix to add __no_sanitize_memory to __put_user_fn() does not work, since it's __always_inline. And __always_inline cannot be removed due to the __put_user_bad() trick. A different obvious fix of using the "a" instead of the "+Q" constraint degrades the code quality, which is very important here, since it's a hot path. Instead, repurpose the __put_user_asm() macro to define __put_user_{char,short,int,long}_noinstr() functions and mark them with __no_sanitize_memory. For the non-KMSAN builds make them __always_inline in order to keep the generated code quality. Also define __put_user_{char,short,int,long}() functions, which call the aforementioned ones and which *are* instrumented, because they call KMSAN hooks, which may be implemented as macros. The same applies to get_user() as well. Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/uaccess.h | 111 +++- 1 file changed, 79 insertions(+), 32 deletions(-) diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 81ae8a98e7ec..70f0edc00c2a 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -78,13 +78,24 @@ union oac { int __noreturn __put_user_bad(void); -#define __put_user_asm(to, from, size) \ -({ \ +#ifdef CONFIG_KMSAN +#define get_put_user_noinstr_attributes \ + noinline __maybe_unused __no_sanitize_memory +#else +#define get_put_user_noinstr_attributes __always_inline +#endif + +#define DEFINE_PUT_USER(type) \ +static get_put_user_noinstr_attributes int \ +__put_user_##type##_noinstr(unsigned type __user *to, \ + unsigned type *from,\ + unsigned long size) \ +{ \ union oac __oac_spec = {\ .oac1.as = PSW_BITS_AS_SECONDARY, \ .oac1.a = 1,\ }; \ - int __rc; \ + int rc; \ \ asm volatile( \ " lr 0,%[spec]\n"\ @@ -93,12 +104,28 @@ int __noreturn __put_user_bad(void); "2:\n" \ EX_TABLE_UA_STORE(0b, 2b, %[rc])\ EX_TABLE_UA_STORE(1b, 2b, %[rc])\ - : [rc] "=&d" (__rc), [_to] "+Q" (*(to)) \ + : [rc] "=&d" (rc), [_to] "+Q" (*(to)) \ : [_size] "d" (size), [_from] "Q" (*(from)),\ [spec] "d" (__oac_spec.val) \ : "cc", "0"); \ - __rc; \ -}) + return rc; \ +} \ + \ +static __always_inline int \ +__put_user_##type(unsigned type __user *to, unsigned type *from, \ + unsigned long size) \ +{ \ + int rc; \ + \ + rc = __put_user_##type##_noinstr(to, from, size); \ + instrument_put_user(*from, to, size); \ + return rc; \ +} + +DEFINE_PUT_USER(char); +DEFINE_PUT_USER(short); +DEFINE_PUT_USER(int); +DEFINE_PUT_USER(long); static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned long size) { @@ -106,24 +133,24 @@ static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned lon
[PATCH v5 09/37] kmsan: Expose kmsan_get_metadata()
Each s390 CPU has lowcore pages associated with it. Each CPU sees its own lowcore at virtual address 0 through a hardware mechanism called prefixing. Additionally, all lowcores are mapped to non-0 virtual addresses stored in the lowcore_ptr[] array. When lowcore is accessed through virtual address 0, one needs to resolve metadata for lowcore_ptr[raw_smp_processor_id()]. Expose kmsan_get_metadata() to make it possible to do this from the arch code. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- include/linux/kmsan.h | 9 + mm/kmsan/instrumentation.c | 1 + mm/kmsan/kmsan.h | 1 - 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index e0c23a32cdf0..fe6c2212bdb1 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out); */ void kmsan_unpoison_entry_regs(const struct pt_regs *regs); +/** + * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins. + * @addr: kernel address. + * @is_origin: whether to return origins or shadow. + * + * Return NULL if metadata cannot be found. + */ +void *kmsan_get_metadata(void *addr, bool is_origin); + #else static inline void kmsan_init_shadow(void) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index 8a1bbbc723ab..94b49fac9d8b 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -14,6 +14,7 @@ #include "kmsan.h" #include +#include #include #include #include diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index adf443bcffe8..34b83c301d57 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -66,7 +66,6 @@ struct shadow_origin_ptr { struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size, bool store); -void *kmsan_get_metadata(void *addr, bool is_origin); void __init kmsan_init_alloc_meta_for_range(void *start, void *end); enum kmsan_bug_reason { -- 2.45.1
[PATCH v5 23/37] s390/boot: Add the KMSAN runtime stub
It should be possible to have inline functions in the s390 header files, which call kmsan_unpoison_memory(). The problem is that these header files might be included by the decompressor, which does not contain KMSAN runtime, causing linker errors. Not compiling these calls if __SANITIZE_MEMORY__ is not defined - either by changing kmsan-checks.h or at the call sites - may cause unintended side effects, since calling these functions from an uninstrumented code that is linked into the kernel is valid use case. One might want to explicitly distinguish between the kernel and the decompressor. Checking for a decompressor-specific #define is quite heavy-handed, and will have to be done at all call sites. A more generic approach is to provide a dummy kmsan_unpoison_memory() definition. This produces some runtime overhead, but only when building with CONFIG_KMSAN. The benefit is that it does not disturb the existing KMSAN build logic and call sites don't need to be changed. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/Makefile | 1 + arch/s390/boot/kmsan.c | 6 ++ 2 files changed, 7 insertions(+) create mode 100644 arch/s390/boot/kmsan.c diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile index 526ed20b9d31..e7658997452b 100644 --- a/arch/s390/boot/Makefile +++ b/arch/s390/boot/Makefile @@ -44,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) $(CONFIG_PGSTE)) += obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o obj-y += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o +obj-$(CONFIG_KMSAN) += kmsan.o obj-all := $(obj-y) piggy.o syms.o targets:= bzImage section_cmp.boot.data section_cmp.boot.preserved.data $(obj-y) diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c new file mode 100644 index ..e7b3ac48143e --- /dev/null +++ b/arch/s390/boot/kmsan.c @@ -0,0 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 +#include + +void kmsan_unpoison_memory(const void *address, size_t size) +{ +} -- 2.45.1
[PATCH v5 29/37] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN
Lockdep generates the following false positives with KMSAN on s390x: [6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) [ ...] [6.577050] Call Trace: [6.619637] [<0690d2de>] check_flags+0x1fe/0x210 [6.665411] ([<0690d2da>] check_flags+0x1fa/0x210) [6.707478] [<006cec1a>] lock_acquire+0x2ca/0xce0 [6.749959] [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190 [6.794912] [<041fc988>] __stack_depot_save+0x218/0x5b0 [6.838420] [<0197affe>] __msan_poison_alloca+0xfe/0x1a0 [6.882985] [<07c5827c>] start_kernel+0x70c/0xd50 [6.927454] [<00100036>] startup_continue+0x36/0x40 Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that interrupts are on, but on the CPU they are still off. KMSAN instrumentation takes spinlocks, giving lockdep a chance to see and complain about this discrepancy. KMSAN instrumentation is inserted in order to poison the __mask variable. Disable instrumentation in the respective functions. They are very small and it's easy to see that no important metadata updates are lost because of this. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/irqflags.h | 17 ++--- drivers/s390/char/sclp.c | 2 +- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h index 02427b205c11..bcab456dfb80 100644 --- a/arch/s390/include/asm/irqflags.h +++ b/arch/s390/include/asm/irqflags.h @@ -37,12 +37,18 @@ static __always_inline void __arch_local_irq_ssm(unsigned long flags) asm volatile("ssm %0" : : "Q" (flags) : "memory"); } -static __always_inline unsigned long arch_local_save_flags(void) +#ifdef CONFIG_KMSAN +#define arch_local_irq_attributes noinline notrace __no_sanitize_memory __maybe_unused +#else +#define arch_local_irq_attributes __always_inline +#endif + +static arch_local_irq_attributes unsigned long arch_local_save_flags(void) { return __arch_local_irq_stnsm(0xff); } -static __always_inline unsigned long arch_local_irq_save(void) +static arch_local_irq_attributes unsigned long arch_local_irq_save(void) { return __arch_local_irq_stnsm(0xfc); } @@ -52,7 +58,12 @@ static __always_inline void arch_local_irq_disable(void) arch_local_irq_save(); } -static __always_inline void arch_local_irq_enable(void) +static arch_local_irq_attributes void arch_local_irq_enable_external(void) +{ + __arch_local_irq_stosm(0x01); +} + +static arch_local_irq_attributes void arch_local_irq_enable(void) { __arch_local_irq_stosm(0x03); } diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c index d53ee34d398f..fb1d9949adca 100644 --- a/drivers/s390/char/sclp.c +++ b/drivers/s390/char/sclp.c @@ -736,7 +736,7 @@ sclp_sync_wait(void) cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK; cr0_sync.val |= 1UL << (63 - 54); local_ctl_load(0, &cr0_sync); - __arch_local_irq_stosm(0x01); + arch_local_irq_enable_external(); /* Loop until driver state indicates finished request */ while (sclp_running_state != sclp_running_state_idle) { /* Check for expired request timer */ -- 2.45.1
[PATCH v5 16/37] mm: slub: Let KMSAN access metadata
Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes KMSAN to complain about touching redzones in kfree(). Fix by extending the existing KASAN-related metadata_access_enable() and metadata_access_disable() functions to KMSAN. Acked-by: Vlastimil Babka Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/slub.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index 1134091abac5..b050e528112c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -829,10 +829,12 @@ static int disable_higher_order_debug; static inline void metadata_access_enable(void) { kasan_disable_current(); + kmsan_disable_current(); } static inline void metadata_access_disable(void) { + kmsan_enable_current(); kasan_enable_current(); } -- 2.45.1
[PATCH v5 15/37] kmsan: Do not round up pg_data_t size
x86's alloc_node_data() rounds up node data size to PAGE_SIZE. It's not explained why it's needed, but it's most likely for performance reasons, since the padding bytes are not used anywhere. Some other architectures do it as well, e.g., mips rounds it up to the cache line size. kmsan_init_shadow() initializes metadata for each node data and assumes the x86 rounding, which does not match other architectures. This may cause the range end to overshoot the end of available memory, in turn causing virt_to_page_or_null() in kmsan_init_alloc_meta_for_range() to return NULL, which leads to kernel panic shortly after. Since the padding bytes are not used, drop the rounding. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 3ac3b8921d36..9de76ac7062c 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -72,7 +72,7 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) */ void __init kmsan_init_shadow(void) { - const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE); + const size_t nd_size = sizeof(pg_data_t); phys_addr_t p_start, p_end; u64 loop; int nid; -- 2.45.1
[PATCH v5 18/37] mm: kfence: Disable KMSAN when checking the canary
KMSAN warns about check_canary() accessing the canary. The reason is that, even though set_canary() is properly instrumented and sets shadow, slub explicitly poisons the canary's address range afterwards. Unpoisoning the canary is not the right thing to do: only check_canary() is supposed to ever touch it. Instead, disable KMSAN checks around canary read accesses. Reviewed-by: Alexander Potapenko Tested-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kfence/core.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 964b8482275b..83f8e78827c0 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex WRITE_ONCE(meta->state, next); } +#ifdef CONFIG_KMSAN +#define check_canary_attributes noinline __no_kmsan_checks +#else +#define check_canary_attributes inline +#endif + /* Check canary byte at @addr. */ -static inline bool check_canary_byte(u8 *addr) +static check_canary_attributes bool check_canary_byte(u8 *addr) { struct kfence_metadata *meta; unsigned long flags; @@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata *meta) *((u64 *)addr) = KFENCE_CANARY_PATTERN_U64; } -static inline void check_canary(const struct kfence_metadata *meta) +static check_canary_attributes void +check_canary(const struct kfence_metadata *meta) { const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE); unsigned long addr = pageaddr; -- 2.45.1
[PATCH v5 30/37] s390/mm: Define KMSAN metadata for vmalloc and modules
The pages for the KMSAN metadata associated with most kernel mappings are taken from memblock by the common code. However, vmalloc and module metadata needs to be defined by the architectures. Be a little bit more careful than x86: allocate exactly MODULES_LEN for the module shadow and origins, and then take 2/3 of vmalloc for the vmalloc shadow and origins. This ensures that users passing small vmalloc= values on the command line do not cause module metadata collisions. Reviewed-by: Alexander Potapenko Acked-by: Alexander Gordeev Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/startup.c| 7 +++ arch/s390/include/asm/pgtable.h | 8 2 files changed, 15 insertions(+) diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c index 48ef5fe5c08a..d6b0d114939a 100644 --- a/arch/s390/boot/startup.c +++ b/arch/s390/boot/startup.c @@ -301,11 +301,18 @@ static unsigned long setup_kernel_memory_layout(unsigned long kernel_size) MODULES_END = round_down(kernel_start, _SEGMENT_SIZE); MODULES_VADDR = MODULES_END - MODULES_LEN; VMALLOC_END = MODULES_VADDR; + if (IS_ENABLED(CONFIG_KMSAN)) + VMALLOC_END -= MODULES_LEN * 2; /* allow vmalloc area to occupy up to about 1/2 of the rest virtual space left */ vsize = (VMALLOC_END - FIXMAP_SIZE) / 2; vsize = round_down(vsize, _SEGMENT_SIZE); vmalloc_size = min(vmalloc_size, vsize); + if (IS_ENABLED(CONFIG_KMSAN)) { + /* take 2/3 of vmalloc area for KMSAN shadow and origins */ + vmalloc_size = round_down(vmalloc_size / 3, _SEGMENT_SIZE); + VMALLOC_END -= vmalloc_size * 2; + } VMALLOC_START = VMALLOC_END - vmalloc_size; __memcpy_real_area = round_down(VMALLOC_START - MEMCPY_REAL_SIZE, PAGE_SIZE); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 70b6ee557eb2..2f44c23efec0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -107,6 +107,14 @@ static inline int is_module_addr(void *addr) return 1; } +#ifdef CONFIG_KMSAN +#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START) +#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END +#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN) +#endif + #ifdef CONFIG_RANDOMIZE_BASE #define KASLR_LEN (1UL << 31) #else -- 2.45.1
[PATCH v5 01/37] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
Architectures use assembly code to initialize ftrace_regs and call ftrace_ops_list_func(). Therefore, from the KMSAN's point of view, ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes KMSAN warnings when running the ftrace testsuite. Fix by trusting the architecture-specific assembly code and always unpoisoning ftrace_regs in ftrace_ops_list_func. The issue was not encountered on x86_64 so far only by accident: assembly-allocated ftrace_regs was overlapping a stale partially unpoisoned stack frame. Poisoning stack frames before returns [1] makes the issue appear on x86_64 as well. [1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/ Reviewed-by: Alexander Potapenko Acked-by: Steven Rostedt (Google) Signed-off-by: Ilya Leoshkevich --- kernel/trace/ftrace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 65208d3b5ed9..c35ad4362d71 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { + kmsan_unpoison_memory(fregs, sizeof(*fregs)); __ftrace_ops_list_func(ip, parent_ip, NULL, fregs); } #else -- 2.45.1
[PATCH v5 28/37] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
s390 uses assembly code to initialize ftrace_regs and call kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view, ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes KMSAN warnings when running the ftrace testsuite. Fix by trusting the assembly code and always unpoisoning ftrace_regs in kprobe_ftrace_handler(). Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/ftrace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index ddf2ee47cb87..0bd6adc40a34 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -303,6 +304,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip, if (bit < 0) return; + kmsan_unpoison_memory(fregs, sizeof(*fregs)); regs = ftrace_get_regs(fregs); p = get_kprobe((kprobe_opcode_t *)ip); if (!regs || unlikely(!p) || kprobe_disabled(p)) -- 2.45.1
[PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions
arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the prefix and calling kmsan_get_metadata() again. kmsan_virt_addr_valid() delegates to virt_addr_valid(). Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/kmsan.h | 59 +++ 1 file changed, 59 insertions(+) create mode 100644 arch/s390/include/asm/kmsan.h diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h new file mode 100644 index ..eb850c942204 --- /dev/null +++ b/arch/s390/include/asm/kmsan.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_S390_KMSAN_H +#define _ASM_S390_KMSAN_H + +#include +#include +#include +#include +#include + +#ifndef MODULE + +static inline bool is_lowcore_addr(void *addr) +{ + return addr >= (void *)&S390_lowcore && + addr < (void *)(&S390_lowcore + 1); +} + +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin) +{ + if (is_lowcore_addr(addr)) { + /* +* Different lowcores accessed via S390_lowcore are described +* by the same struct page. Resolve the prefix manually in +* order to get a distinct struct page. +*/ + addr += (void *)lowcore_ptr[raw_smp_processor_id()] - + (void *)&S390_lowcore; + if (WARN_ON_ONCE(is_lowcore_addr(addr))) + return NULL; + return kmsan_get_metadata(addr, is_origin); + } + return NULL; +} + +static inline bool kmsan_virt_addr_valid(void *addr) +{ + bool ret; + + /* +* pfn_valid() relies on RCU, and may call into the scheduler on exiting +* the critical section. However, this would result in recursion with +* KMSAN. Therefore, disable preemption here, and re-enable preemption +* below while suppressing reschedules to avoid recursion. +* +* Note, this sacrifices occasionally breaking scheduling guarantees. +* Although, a kernel compiled with KMSAN has already given up on any +* performance guarantees due to being heavily instrumented. +*/ + preempt_disable(); + ret = virt_addr_valid(addr); + preempt_enable_no_resched(); + + return ret; +} + +#endif /* !MODULE */ + +#endif /* _ASM_S390_KMSAN_H */ -- 2.45.1
[PATCH v5 37/37] kmsan: Enable on s390
Now that everything else is in place, enable KMSAN in Kconfig. Acked-by: Heiko Carstens Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c59d2b54df49..3cba4993d7c7 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -158,6 +158,7 @@ config S390 select HAVE_ARCH_KASAN select HAVE_ARCH_KASAN_VMALLOC select HAVE_ARCH_KCSAN + select HAVE_ARCH_KMSAN select HAVE_ARCH_KFENCE select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER -- 2.45.1
[PATCH v5 00/37] kmsan: Enable on s390
v4: https://lore.kernel.org/lkml/20240613153924.961511-1-...@linux.ibm.com/ v4 -> v5: Fix the __memset() build issue. Change the attribute #defines to lowercase in order to match the existing code style. Fix the kmsan_virt_addr_valid() implementation to avoid recursion in debug builds, like it's done on x86_64 - dropped R-bs, please take another look. Add kmsan_disable_current()/kmsan_enable_current() doc; Fix the poisoned memchr_inv() value in a different way; Add the missing linux/instrumented.h #include; (Alexander P.). Patches that need review: - [PATCH 12/37] kmsan: Introduce memset_no_sanitize_memory() - [PATCH 13/37] kmsan: Support SLAB_POISON - [PATCH 17/37] mm: slub: Disable KMSAN when checking the padding bytes - [PATCH 36/37] s390/kmsan: Implement the architecture-specific functions v3: https://lore.kernel.org/lkml/20231213233605.661251-1-...@linux.ibm.com/ v3 -> v4: Rebase. Elaborate why ftrace_ops_list_func() change is needed on x64_64 (Steven). Add a comment to the DFLTCC patch (Alexander P.). Simplify diag224(); Improve __arch_local_irq_attributes style; Use IS_ENABLED(CONFIG_KMSAN) for vmalloc area (Heiko). Align vmalloc area on _SEGMENT_SIZE (Alexander G.). v2: https://lore.kernel.org/lkml/20231121220155.1217090-1-...@linux.ibm.com/ v2 -> v3: Drop kmsan_memmove_metadata() and strlcpy() patches; Remove kmsan_get_metadata() stub; Move kmsan_enable_current() and kmsan_disable_current() to include/linux/kmsan.h, explain why a counter is needed; Drop the memset_no_sanitize_memory() patch; Use __memset() in the SLAB_POISON patch; Add kmsan-checks.h to the DFLTCC patch; Add recursion check to the arch_kmsan_get_meta_or_null() patch (Alexander P.). Fix inline + __no_kmsan_checks issues. New patch for s390/irqflags, that resolves a lockdep warning. New patch for s390/diag, that resolves a false positive when running on an LPAR. New patch for STCCTM, same as above. New patch for check_bytes_and_report() that resolves a false positive that occurs even on Intel. v1: https://lore.kernel.org/lkml/20231115203401.2495875-1-...@linux.ibm.com/ v1 -> v2: Add comments, sort #includes, introduce memset_no_sanitize_memory() and use it to avoid unpoisoning of redzones, change vmalloc alignment to _REGION3_SIZE, add R-bs (Alexander P.). Fix building [PATCH 28/33] s390/string: Add KMSAN support with FORTIFY_SOURCE. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202311170550.bsbo44ix-...@intel.com/ Hi, This series provides the minimal support for Kernel Memory Sanitizer on s390. Kernel Memory Sanitizer is clang-only instrumentation for finding accesses to uninitialized memory. The clang support for s390 has already been merged [1]. With this series, I can successfully boot s390 defconfig and debug_defconfig with kmsan.panic=1. The tool found one real s390-specific bug (fixed in master). Best regards, Ilya [1] https://reviews.llvm.org/D148596 Ilya Leoshkevich (37): ftrace: Unpoison ftrace_regs in ftrace_ops_list_func() kmsan: Make the tests compatible with kmsan.panic=1 kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled kmsan: Increase the maximum store size to 4096 kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush() kmsan: Remove an x86-specific #include from kmsan.h kmsan: Expose kmsan_get_metadata() kmsan: Export panic_on_kmsan kmsan: Allow disabling KMSAN checks for the current task kmsan: Introduce memset_no_sanitize_memory() kmsan: Support SLAB_POISON kmsan: Use ALIGN_DOWN() in kmsan_get_metadata() kmsan: Do not round up pg_data_t size mm: slub: Let KMSAN access metadata mm: slub: Disable KMSAN when checking the padding bytes mm: kfence: Disable KMSAN when checking the canary lib/zlib: Unpoison DFLTCC output buffers kmsan: Accept ranges starting with 0 on s390 s390/boot: Turn off KMSAN s390: Use a larger stack for KMSAN s390/boot: Add the KMSAN runtime stub s390/checksum: Add a KMSAN check s390/cpacf: Unpoison the results of cpacf_trng() s390/cpumf: Unpoison STCCTM output buffer s390/diag: Unpoison diag224() output buffer s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler() s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN s390/mm: Define KMSAN metadata for vmalloc and modules s390/string: Add KMSAN support s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs s
[PATCH v5 17/37] mm: slub: Disable KMSAN when checking the padding bytes
Even though the KMSAN warnings generated by memchr_inv() are suppressed by metadata_access_enable(), its return value may still be poisoned. The reason is that the last iteration of memchr_inv() returns `*start != value ? start : NULL`, where *start is poisoned. Because of this, somewhat counterintuitively, the shadow value computed by visitSelectInst() is equal to `(uintptr_t)start`. One possibility to fix this, since the intention behind guarding memchr_inv() behind metadata_access_enable() is to touch poisoned metadata without triggering KMSAN, is to unpoison its return value. However, this approach is too fragile. So simply disable the KMSAN checks in the respective functions. Signed-off-by: Ilya Leoshkevich --- mm/slub.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index b050e528112c..fcd68fcea4ab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1176,9 +1176,16 @@ static void restore_bytes(struct kmem_cache *s, char *message, u8 data, memset(from, data, to - from); } -static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab, - u8 *object, char *what, - u8 *start, unsigned int value, unsigned int bytes) +#ifdef CONFIG_KMSAN +#define pad_check_attributes noinline __no_kmsan_checks +#else +#define pad_check_attributes +#endif + +static pad_check_attributes int +check_bytes_and_report(struct kmem_cache *s, struct slab *slab, + u8 *object, char *what, + u8 *start, unsigned int value, unsigned int bytes) { u8 *fault; u8 *end; @@ -1270,7 +1277,8 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p) } /* Check the pad bytes at the end of a slab page */ -static void slab_pad_check(struct kmem_cache *s, struct slab *slab) +static pad_check_attributes void +slab_pad_check(struct kmem_cache *s, struct slab *slab) { u8 *start; u8 *fault; -- 2.45.1
[PATCH v5 26/37] s390/cpumf: Unpoison STCCTM output buffer
stcctm() uses the "Q" constraint for dest, therefore KMSAN does not understand that it fills multiple doublewords pointed to by dest, not just one. This results in false positives. Unpoison the whole dest manually with kmsan_unpoison_memory(). Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/cpu_mf.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h index a0de5b9b02ea..9e4bbc3e53f8 100644 --- a/arch/s390/include/asm/cpu_mf.h +++ b/arch/s390/include/asm/cpu_mf.h @@ -10,6 +10,7 @@ #define _ASM_S390_CPU_MF_H #include +#include #include #include @@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, u64 range, u64 *dest) : "=d" (cc) : "Q" (*dest), "d" (range), "i" (set) : "cc", "memory"); + /* +* If cc == 2, less than RANGE counters are stored, but it's not easy +* to tell how many. Always unpoison the whole range for simplicity. +*/ + kmsan_unpoison_memory(dest, range * sizeof(u64)); return cc; } -- 2.45.1
[PATCH v5 31/37] s390/string: Add KMSAN support
Add KMSAN support for the s390 implementations of the string functions. Do this similar to how it's already done for KASAN, except that the optimized memset{16,32,64}() functions need to be disabled: it's important for KMSAN to know that they initialized something. The way boot code is built with regard to string functions is problematic, since most files think it's configured with sanitizers, but boot/string.c doesn't. This creates various problems with the memset64() definitions, depending on whether the code is built with sanitizers or fortify. This should probably be streamlined, but in the meantime resolve the issues by introducing the IN_BOOT_STRING_C macro, similar to the existing IN_ARCH_STRING_C macro. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/string.c| 16 arch/s390/include/asm/string.h | 20 +++- 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c index faccb33b462c..f6b9b1df48a8 100644 --- a/arch/s390/boot/string.c +++ b/arch/s390/boot/string.c @@ -1,11 +1,18 @@ // SPDX-License-Identifier: GPL-2.0 +#define IN_BOOT_STRING_C 1 #include #include #include #undef CONFIG_KASAN #undef CONFIG_KASAN_GENERIC +#undef CONFIG_KMSAN #include "../lib/string.c" +/* + * Duplicate some functions from the common lib/string.c + * instead of fully including it. + */ + int strncmp(const char *cs, const char *ct, size_t count) { unsigned char c1, c2; @@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count) return 0; } +void *memset64(uint64_t *s, uint64_t v, size_t count) +{ + uint64_t *xs = s; + + while (count--) + *xs++ = v; + return s; +} + char *skip_spaces(const char *str) { while (isspace(*str)) diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h index 351685de53d2..2ab868cbae6c 100644 --- a/arch/s390/include/asm/string.h +++ b/arch/s390/include/asm/string.h @@ -15,15 +15,12 @@ #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */ #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */ #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */ -#define __HAVE_ARCH_MEMSET16 /* arch function */ -#define __HAVE_ARCH_MEMSET32 /* arch function */ -#define __HAVE_ARCH_MEMSET64 /* arch function */ void *memcpy(void *dest, const void *src, size_t n); void *memset(void *s, int c, size_t n); void *memmove(void *dest, const void *src, size_t n); -#ifndef CONFIG_KASAN +#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) #define __HAVE_ARCH_MEMCHR /* inline & arch function */ #define __HAVE_ARCH_MEMCMP /* arch function */ #define __HAVE_ARCH_MEMSCAN/* inline & arch function */ @@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n); #define __HAVE_ARCH_STRNCPY/* arch function */ #define __HAVE_ARCH_STRNLEN/* inline & arch function */ #define __HAVE_ARCH_STRSTR /* arch function */ +#define __HAVE_ARCH_MEMSET16 /* arch function */ +#define __HAVE_ARCH_MEMSET32 /* arch function */ +#define __HAVE_ARCH_MEMSET64 /* arch function */ /* Prototypes for non-inlined arch strings functions. */ int memcmp(const void *s1, const void *s2, size_t n); @@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n); char *strncat(char *dest, const char *src, size_t n); char *strncpy(char *dest, const char *src, size_t n); char *strstr(const char *s1, const char *s2); -#endif /* !CONFIG_KASAN */ +#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */ #undef __HAVE_ARCH_STRCHR #undef __HAVE_ARCH_STRNCHR @@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count); void *__memset32(uint32_t *s, uint32_t v, size_t count); void *__memset64(uint64_t *s, uint64_t v, size_t count); +#ifdef __HAVE_ARCH_MEMSET16 static inline void *memset16(uint16_t *s, uint16_t v, size_t count) { return __memset16(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET32 static inline void *memset32(uint32_t *s, uint32_t v, size_t count) { return __memset32(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET64 +#ifdef IN_BOOT_STRING_C +void *memset64(uint64_t *s, uint64_t v, size_t count); +#else static inline void *memset64(uint64_t *s, uint64_t v, size_t count) { return __memset64(s, v, count * sizeof(v)); } +#endif +#endif #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || defined(__NO_FORTIFY)) -- 2.45.1
[PATCH v5 02/37] kmsan: Make the tests compatible with kmsan.panic=1
It's useful to have both tests and kmsan.panic=1 during development, but right now the warnings, that the tests cause, lead to kernel panics. Temporarily set kmsan.panic=0 for the duration of the KMSAN testing. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/kmsan_test.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c index 07d3a3a5a9c5..9bfd11674fe3 100644 --- a/mm/kmsan/kmsan_test.c +++ b/mm/kmsan/kmsan_test.c @@ -659,9 +659,13 @@ static void test_exit(struct kunit *test) { } +static int orig_panic_on_kmsan; + static int kmsan_suite_init(struct kunit_suite *suite) { register_trace_console(probe_console, NULL); + orig_panic_on_kmsan = panic_on_kmsan; + panic_on_kmsan = 0; return 0; } @@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite) { unregister_trace_console(probe_console, NULL); tracepoint_synchronize_unregister(); + panic_on_kmsan = orig_panic_on_kmsan; } static struct kunit_suite kmsan_test_suite = { -- 2.45.1
[PATCH v5 06/37] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces
Comparing pointers with TASK_SIZE does not make sense when kernel and userspace overlap. Assume that we are handling user memory access in this case. Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/hooks.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 22e8657800ef..b408714f9ba3 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy, return; ua_flags = user_access_save(); - if ((u64)to < TASK_SIZE) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) || + (u64)to < TASK_SIZE) { /* This is a user memory access, check it. */ kmsan_internal_check_memory((void *)from, to_copy - left, to, REASON_COPY_TO_USER); -- 2.45.1
[PATCH v5 12/37] kmsan: Introduce memset_no_sanitize_memory()
Add a wrapper for memset() that prevents unpoisoning. This is useful for filling memory allocator redzones. Signed-off-by: Ilya Leoshkevich --- include/linux/kmsan.h | 13 + 1 file changed, 13 insertions(+) diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index 23de1b3d6aee..5f50885f2023 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -255,6 +255,14 @@ void kmsan_enable_current(void); */ void kmsan_disable_current(void); +/* + * memset_no_sanitize_memory(): memset() without KMSAN instrumentation. + */ +static inline void *memset_no_sanitize_memory(void *s, int c, size_t n) +{ + return __memset(s, c, n); +} + #else static inline void kmsan_init_shadow(void) @@ -362,6 +370,11 @@ static inline void kmsan_disable_current(void) { } +static inline void *memset_no_sanitize_memory(void *s, int c, size_t n) +{ + return memset(s, c, n); +} + #endif #endif /* _LINUX_KMSAN_H */ -- 2.45.1
[PATCH v5 25/37] s390/cpacf: Unpoison the results of cpacf_trng()
Prevent KMSAN from complaining about buffers filled by cpacf_trng() being uninitialized. Tested-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/cpacf.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h index c786538e397c..dae8843b164f 100644 --- a/arch/s390/include/asm/cpacf.h +++ b/arch/s390/include/asm/cpacf.h @@ -12,6 +12,7 @@ #define _ASM_S390_CPACF_H #include +#include /* * Instruction opcodes for the CPACF instructions @@ -542,6 +543,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long ucbuf_len, : [ucbuf] "+&d" (u.pair), [cbuf] "+&d" (c.pair) : [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO) : "cc", "memory", "0"); + kmsan_unpoison_memory(ucbuf, ucbuf_len); + kmsan_unpoison_memory(cbuf, cbuf_len); } /** -- 2.45.1
[PATCH v5 13/37] kmsan: Support SLAB_POISON
Avoid false KMSAN negatives with SLUB_DEBUG by allowing kmsan_slab_free() to poison the freed memory, and by preventing init_object() from unpoisoning new allocations by using __memset(). There are two alternatives to this approach. First, init_object() can be marked with __no_sanitize_memory. This annotation should be used with great care, because it drops all instrumentation from the function, and any shadow writes will be lost. Even though this is not a concern with the current init_object() implementation, this may change in the future. Second, kmsan_poison_memory() calls may be added after memset() calls. The downside is that init_object() is called from free_debug_processing(), in which case poisoning will erase the distinction between simply uninitialized memory and UAF. Signed-off-by: Ilya Leoshkevich --- mm/kmsan/hooks.c | 2 +- mm/slub.c| 15 +++ 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 267d0afa2e8b..26d86dfdc819 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -74,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object) return; /* RCU slabs could be legally used after free within the RCU period */ - if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON))) + if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) return; /* * If there's a constructor, freed memory must remain in the same state diff --git a/mm/slub.c b/mm/slub.c index 1373ac365a46..1134091abac5 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1139,7 +1139,13 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) unsigned int poison_size = s->object_size; if (s->flags & SLAB_RED_ZONE) { - memset(p - s->red_left_pad, val, s->red_left_pad); + /* +* Here and below, avoid overwriting the KMSAN shadow. Keeping +* the shadow makes it possible to distinguish uninit-value +* from use-after-free. +*/ + memset_no_sanitize_memory(p - s->red_left_pad, val, + s->red_left_pad); if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) { /* @@ -1152,12 +1158,13 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) } if (s->flags & __OBJECT_POISON) { - memset(p, POISON_FREE, poison_size - 1); - p[poison_size - 1] = POISON_END; + memset_no_sanitize_memory(p, POISON_FREE, poison_size - 1); + memset_no_sanitize_memory(p + poison_size - 1, POISON_END, 1); } if (s->flags & SLAB_RED_ZONE) - memset(p + poison_size, val, s->inuse - poison_size); + memset_no_sanitize_memory(p + poison_size, val, + s->inuse - poison_size); } static void restore_bytes(struct kmem_cache *s, char *message, u8 data, -- 2.45.1
[PATCH v5 10/37] kmsan: Export panic_on_kmsan
When building the kmsan test as a module, modpost fails with the following error message: ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined! Export panic_on_kmsan in order to improve the KMSAN usability for modules. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/report.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c index 02736ec757f2..c79d3b0d2d0d 100644 --- a/mm/kmsan/report.c +++ b/mm/kmsan/report.c @@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock); /* Protected by kmsan_report_lock */ static char report_local_descr[DESCR_SIZE]; int panic_on_kmsan __read_mostly; +EXPORT_SYMBOL_GPL(panic_on_kmsan); #ifdef MODULE_PARAM_PREFIX #undef MODULE_PARAM_PREFIX -- 2.45.1
[PATCH v5 07/37] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()
The value assigned to prot is immediately overwritten on the next line with PAGE_KERNEL. The right hand side of the assignment has no side-effects. Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Suggested-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/shadow.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index b9d05aff313e..2d57408c78ae 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, s_pages[i] = shadow_page_for(pages[i]); o_pages[i] = origin_page_for(pages[i]); } - prot = __pgprot(pgprot_val(prot) | _PAGE_NX); prot = PAGE_KERNEL; origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN); -- 2.45.1
[PATCH v5 21/37] s390/boot: Turn off KMSAN
All other sanitizers are disabled for boot as well. While at it, add a comment explaining why we need this. Reviewed-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile index 070c9b2e905f..526ed20b9d31 100644 --- a/arch/s390/boot/Makefile +++ b/arch/s390/boot/Makefile @@ -3,11 +3,13 @@ # Makefile for the linux s390-specific parts of the memory manager. # +# Tooling runtimes are unavailable and cannot be linked for early boot code KCOV_INSTRUMENT := n GCOV_PROFILE := n UBSAN_SANITIZE := n KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR) KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR) -- 2.45.1
[PATCH v5 24/37] s390/checksum: Add a KMSAN check
Add a KMSAN check to the CKSM inline assembly, similar to how it was done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm instruction"). Acked-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/checksum.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h index b89159591ca0..46f5c9660616 100644 --- a/arch/s390/include/asm/checksum.h +++ b/arch/s390/include/asm/checksum.h @@ -13,6 +13,7 @@ #define _S390_CHECKSUM_H #include +#include #include static inline __wsum cksm(const void *buff, int len, __wsum sum) @@ -23,6 +24,7 @@ static inline __wsum cksm(const void *buff, int len, __wsum sum) }; instrument_read(buff, len); + kmsan_check_memory(buff, len); asm volatile("\n" "0: cksm%[sum],%[rp]\n" " jo 0b\n" -- 2.45.1
[PATCH v5 11/37] kmsan: Allow disabling KMSAN checks for the current task
Like for KASAN, it's useful to temporarily disable KMSAN checks around, e.g., redzone accesses. Introduce kmsan_disable_current() and kmsan_enable_current(), which are similar to their KASAN counterparts. Make them reentrant in order to handle memory allocations in interrupt context. Repurpose the allow_reporting field for this. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- Documentation/dev-tools/kmsan.rst | 11 +-- include/linux/kmsan.h | 24 include/linux/kmsan_types.h | 2 +- mm/kmsan/core.c | 1 - mm/kmsan/hooks.c | 18 +++--- mm/kmsan/report.c | 7 --- tools/objtool/check.c | 2 ++ 7 files changed, 55 insertions(+), 10 deletions(-) diff --git a/Documentation/dev-tools/kmsan.rst b/Documentation/dev-tools/kmsan.rst index 323eedad53cd..6a48d96c5c85 100644 --- a/Documentation/dev-tools/kmsan.rst +++ b/Documentation/dev-tools/kmsan.rst @@ -110,6 +110,13 @@ in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every function in the file or directory. Most users won't need KMSAN_SANITIZE, unless their code gets broken by KMSAN (e.g. runs at early boot time). +KMSAN checks can also be temporarily disabled for the current task using +``kmsan_disable_current()`` and ``kmsan_enable_current()`` calls. Each +``kmsan_enable_current()`` call must be preceded by a +``kmsan_disable_current()`` call; these call pairs may be nested. One needs to +be careful with these calls, keeping the regions short and preferring other +ways to disable instrumentation, where possible. + Support === @@ -338,11 +345,11 @@ Per-task KMSAN state Every task_struct has an associated KMSAN task state that holds the KMSAN -context (see above) and a per-task flag disallowing KMSAN reports:: +context (see above) and a per-task counter disallowing KMSAN reports:: struct kmsan_context { ... -bool allow_reporting; +unsigned int depth; struct kmsan_context_state cstate; ... } diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index fe6c2212bdb1..23de1b3d6aee 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -239,6 +239,22 @@ void kmsan_unpoison_entry_regs(const struct pt_regs *regs); */ void *kmsan_get_metadata(void *addr, bool is_origin); +/* + * kmsan_enable_current(): Enable KMSAN for the current task. + * + * Each kmsan_enable_current() current call must be preceded by a + * kmsan_disable_current() call. These call pairs may be nested. + */ +void kmsan_enable_current(void); + +/* + * kmsan_disable_current(): Disable KMSAN for the current task. + * + * Each kmsan_disable_current() current call must be followed by a + * kmsan_enable_current() call. These call pairs may be nested. + */ +void kmsan_disable_current(void); + #else static inline void kmsan_init_shadow(void) @@ -338,6 +354,14 @@ static inline void kmsan_unpoison_entry_regs(const struct pt_regs *regs) { } +static inline void kmsan_enable_current(void) +{ +} + +static inline void kmsan_disable_current(void) +{ +} + #endif #endif /* _LINUX_KMSAN_H */ diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h index 929287981afe..dfc59918b3c0 100644 --- a/include/linux/kmsan_types.h +++ b/include/linux/kmsan_types.h @@ -31,7 +31,7 @@ struct kmsan_context_state { struct kmsan_ctx { struct kmsan_context_state cstate; int kmsan_in_runtime; - bool allow_reporting; + unsigned int depth; }; #endif /* _LINUX_KMSAN_TYPES_H */ diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c index 95f859e38c53..81b0711a 100644 --- a/mm/kmsan/core.c +++ b/mm/kmsan/core.c @@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task) struct thread_info *info = current_thread_info(); __memset(ctx, 0, sizeof(*ctx)); - ctx->allow_reporting = true; kmsan_internal_unpoison_memory(info, sizeof(*info), false); } diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index b408714f9ba3..267d0afa2e8b 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task) void kmsan_task_exit(struct task_struct *task) { - struct kmsan_ctx *ctx = &task->kmsan_ctx; - if (!kmsan_enabled || kmsan_in_runtime()) return; - ctx->allow_reporting = false; + kmsan_disable_current(); } void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags) @@ -424,3 +422,17 @@ void kmsan_check_memory(const void *addr, size_t size) REASON_ANY); } EXPORT_SYMBOL(kmsan_check_memory); + +void kmsan_enable_current(void) +{ + KMSAN_WARN_ON(current->kmsan_ctx.depth == 0); + current->kmsan_ctx.depth--; +} +EXPORT_SYMBOL(kmsan_enable_current); + +void kmsan_disable_current(void) +{ + current->kmsan_ctx.depth++; +
[PATCH v5 03/37] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
KMSAN relies on memblock returning all available pages to it (see kmsan_memblock_free_pages()). It partitions these pages into 3 categories: pages available to the buddy allocator, shadow pages and origin pages. This partitioning is static. If new pages appear after kmsan_init_runtime(), it is considered an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as incompatible with KMSAN. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..9791fce5d0a7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -946,6 +946,7 @@ config DEFERRED_STRUCT_PAGE_INIT depends on SPARSEMEM depends on !NEED_PER_CPU_KM depends on 64BIT + depends on !KMSAN select PADATA help Ordinarily all struct pages are initialised during early boot in a -- 2.45.1
[PATCH v5 08/37] kmsan: Remove an x86-specific #include from kmsan.h
Replace the x86-specific asm/pgtable_64_types.h #include with the linux/pgtable.h one, which all architectures have. While at it, sort the headers alphabetically for the sake of consistency with other KMSAN code. Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Suggested-by: Heiko Carstens Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/kmsan.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index a14744205435..adf443bcffe8 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -10,14 +10,14 @@ #ifndef __MM_KMSAN_KMSAN_H #define __MM_KMSAN_KMSAN_H -#include #include +#include +#include +#include +#include #include #include #include -#include -#include -#include #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200 -- 2.45.1
Re: [PATCH vhost 17/23] vdpa/mlx5: Consolidate all VQ modify to Ready to use resume_vq()
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > There are a few more places modifying the VQ to Ready directly. Let's > consolidate them into resume_vq(). > > The redundant warnings for resume_vq() errors can also be dropped. > > There is one special case that needs to be handled for virtio-vdpa: > the initialized flag must be set to true earlier in setup_vq() so that > resume_vq() doesn't return early. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ > 1 file changed, 6 insertions(+), 12 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index e3a82c43b44e..f5d5b25cdb01 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -160,6 +160,7 @@ static void free_fixed_resources(struct mlx5_vdpa_net > *ndev); > static void init_mvqs(struct mlx5_vdpa_net *ndev); > static int setup_vq_resources(struct mlx5_vdpa_net *ndev, bool filled); > static void teardown_vq_resources(struct mlx5_vdpa_net *ndev); > +static int resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue > *mvq); > > static bool mlx5_vdpa_debug; > > @@ -1500,16 +1501,14 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, > if (err) > goto err_vq; > > + mvq->initialized = true; > + > if (mvq->ready) { > - err = modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY); > - if (err) { > - mlx5_vdpa_warn(&ndev->mvdev, "failed to modify to > ready vq idx %d(%d)\n", > - idx, err); > + err = resume_vq(ndev, mvq); > + if (err) > goto err_modify; > - } > } > > - mvq->initialized = true; > return 0; > > err_modify: > @@ -2422,7 +2421,6 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device > *vdev, u16 idx, bool ready > struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); > struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); > struct mlx5_vdpa_virtqueue *mvq; > - int err; > > if (!mvdev->actual_features) > return; > @@ -2439,14 +2437,10 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device > *vdev, u16 idx, bool ready > if (!ready) { > suspend_vq(ndev, mvq); > } else { > - err = modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY); > - if (err) { > - mlx5_vdpa_warn(mvdev, "modify VQ %d to ready failed > (%d)\n", idx, err); > + if (resume_vq(ndev, mvq)) > ready = false; > - } > } > > - > mvq->ready = ready; > } > > > -- > 2.45.1 >
Re: [PATCH vhost 16/23] vdpa/mlx5: Add error code for suspend/resume VQ
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > Instead of blindly calling suspend/resume_vqs(), make then return error > codes. > > To keep compatibility, keep suspending or resuming VQs on error and > return the last error code. The assumption here is that the error code > would be the same. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 77 > +++ > 1 file changed, 54 insertions(+), 23 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index e4d68d2d0bb4..e3a82c43b44e 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -1526,71 +1526,102 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, > return err; > } > > -static void suspend_vq(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq) > +static int suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue > *mvq) > { > struct mlx5_virtq_attr attr; > + int err; > > if (!mvq->initialized) > - return; > + return 0; > > if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY) > - return; > + return 0; > > - if (modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) > - mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n"); > + err = modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND); > + if (err) { > + mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed, err: > %d\n", err); > + return err; > + } > > - if (query_virtqueue(ndev, mvq, &attr)) { > - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); > - return; > + err = query_virtqueue(ndev, mvq, &attr); > + if (err) { > + mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue, err: > %d\n", err); > + return err; > } > + > mvq->avail_idx = attr.available_index; > mvq->used_idx = attr.used_index; > + > + return 0; > } > > -static void suspend_vqs(struct mlx5_vdpa_net *ndev) > +static int suspend_vqs(struct mlx5_vdpa_net *ndev) > { > + int err = 0; > int i; > > - for (i = 0; i < ndev->cur_num_vqs; i++) > - suspend_vq(ndev, &ndev->vqs[i]); > + for (i = 0; i < ndev->cur_num_vqs; i++) { > + int local_err = suspend_vq(ndev, &ndev->vqs[i]); > + > + err = local_err ? local_err : err; > + } > + > + return err; > } > > -static void resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue > *mvq) > +static int resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue > *mvq) > { > + int err; > + > if (!mvq->initialized) > - return; > + return 0; > > switch (mvq->fw_state) { > case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT: > /* Due to a FW quirk we need to modify the VQ fields first > then change state. > * This should be fixed soon. After that, a single command > can be used. > */ > - if (modify_virtqueue(ndev, mvq, 0)) > + err = modify_virtqueue(ndev, mvq, 0); > + if (err) { > mlx5_vdpa_warn(&ndev->mvdev, > - "modify vq properties failed for vq %u\n", > mvq->index); > + "modify vq properties failed for vq %u, err: > %d\n", > + mvq->index, err); > + return err; > + } > break; > case MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND: > if (!is_resumable(ndev)) { > mlx5_vdpa_warn(&ndev->mvdev, "vq %d is not > resumable\n", mvq->index); > - return; > + return -EINVAL; > } > break; > case MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY: > - return; > + return 0; > default: > mlx5_vdpa_warn(&ndev->mvdev, "resume vq %u called from bad > state %d\n", >mvq->index, mvq->fw_state); > - return; > + return -EINVAL; > } > > - if (modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)) > - mlx5_vdpa_warn(&ndev->mvdev, "modify to resume failed for vq > %u\n", mvq->index); > + err = modify_virtqueue_state(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY); > + if (err) > + mlx5_vdpa_warn(&ndev->mvdev, "modify to resume failed for vq > %u, err: %d\n", > + mvq->index, err); > + > + return err; > } > > -static void resume_vqs(st
Re: [PATCH v7 2/5] remoteproc: Add TEE support
Hi, On Tue, Jun 11, 2024 at 09:39:01AM +0200, Arnaud Pouliquen wrote: > Add a remoteproc TEE (Trusted Execution Environment) driver > that will be probed by the TEE bus. If the associated Trusted > application is supported on secure part this driver offers a client > interface to load a firmware in the secure part. > This firmware could be authenticated by the secure trusted application. > > Signed-off-by: Arnaud Pouliquen > --- > update from V > - Fix missing "{" in tee_rproc_find_loaded_rsc_table inline definition. > > update from V5 > - make tee_rproc_get_loaded_rsc_table() local and replace this API by > tee_rproc_find_loaded_rsc_table() > - map and unmap the resource table in tee_rproc_parse_fw to make a cached copy > - use the new rproc_pa_to_va() API to map the resource table memory declared > in carevout > - remove tee_rproc_release_loaded_rsc_table as no more used. > --- > drivers/remoteproc/Kconfig | 10 + > drivers/remoteproc/Makefile | 1 + > drivers/remoteproc/tee_remoteproc.c | 451 > include/linux/remoteproc.h | 4 + > include/linux/tee_remoteproc.h | 100 ++ > 5 files changed, 566 insertions(+) > create mode 100644 drivers/remoteproc/tee_remoteproc.c > create mode 100644 include/linux/tee_remoteproc.h > > diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig > index 48845dc8fa85..6c1c07202276 100644 > --- a/drivers/remoteproc/Kconfig > +++ b/drivers/remoteproc/Kconfig > @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC > > It's safe to say N if not interested in using RPU r5f cores. > > + > +config TEE_REMOTEPROC > + tristate "Remoteproc support by a TEE application" > + depends on OPTEE > + help > + Support a remote processor with a TEE application. The Trusted > + Execution Context is responsible for loading the trusted firmware > + image and managing the remote processor's lifecycle. > + This can be either built-in or a loadable module. > + > endif # REMOTEPROC > > endmenu > diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile > index 91314a9b43ce..fa8daebce277 100644 > --- a/drivers/remoteproc/Makefile > +++ b/drivers/remoteproc/Makefile > @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o > obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o > obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o > obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o > +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o > obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o > obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o > obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o > diff --git a/drivers/remoteproc/tee_remoteproc.c > b/drivers/remoteproc/tee_remoteproc.c > new file mode 100644 > index ..9455fd9d0d2d > --- /dev/null > +++ b/drivers/remoteproc/tee_remoteproc.c > @@ -0,0 +1,451 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved > + * Author: Arnaud Pouliquen > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "remoteproc_internal.h" > + > +#define MAX_TEE_PARAM_ARRY_MEMBER4 > + > +/* > + * Authentication of the firmware and load in the remote processor memory > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [in] params[1].memref: buffer containing the image of the > buffer > + */ > +#define TA_RPROC_FW_CMD_LOAD_FW 1 > + > +/* > + * Start the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_START_FW 2 > + > +/* > + * Stop the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_STOP_FW 3 > + > +/* > + * Return the address of the resource table, or 0 if not found > + * No check is done to verify that the address returned is accessible by > + * the non secure context. If the resource table is loaded in a protected > + * memory the access by the non secure context will lead to a data abort. > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [out] params[1].value.a: 32bit LSB resource table memory address > + * [out] params[1].value.b: 32bit MSB resource table memory address > + * [out] params[2].value.a: 32bit LSB resource table memory size > + * [out] params[2].value.b: 32bit MSB resource table memory size > + */ > +#define TA_RPROC_FW_CMD_GET_RSC_TABLE4 > + > +/* > + * Return the address of the core dump > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [out] params[1].memref: address of the core dump image if exist, > + * else return
Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal
On 19.06.24 г. 18:25 ч., Chen Yu wrote: Hi Nikolay, On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote: On 26.05.24 г. 4:58 ч., Chen Yu wrote: The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. So there's a 'virt_spin_lock_key' static key to switch between the two modes. The static key is always enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior). Performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is incorrectly set to true on bare metal. The qspinlock degenerates to test-and-set spinlock, which decrease the performance on bare metal. Fix this by disabling virt_spin_lock_key if it is on bare metal, regardless of CONFIG_PARAVIRT_SPINLOCKS. nit: This bug wouldn't have happened if the key was defined FALSE by default and only enabled in the appropriate case. I think it makes more sense to invert the logic and have the key FALSE by default and only enable it iff the kernel is running under a hypervisor... At worst only the virtualization case would suffer if the lock is falsely not enabled. Thank you for your review. I agree, initializing the key to FALSE by default seems to be more readable. Could this change be the subsequent adjustment based on current fix, which could be more bisectible? Why can't this change be squashed in the current proposed patch? Set the default key to false. If booting in a VM, enable this key. Later during the VM initialization, if other high-efficient spinlock is preferred, like paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly. Yep, or simply during the initialization stage the correct flavor will be chosen, no need to do the on-off dance. But that's a topic for a different discussion. diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index cde8357bb226..a7d3ba00e70e 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu) #ifdef CONFIG_PARAVIRT /* - * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack. + * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack. * * Native (and PV wanting native due to vCPU pinning) should disable this key. * It is done in this backwards fashion to only have a single direction change, * which removes ordering between native_pv_spin_init() and HV setup. */ -DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key); +DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key); /* * Shortcut for the queued_spin_lock_slowpath() function that allows diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index c193c9e60a1b..fec381533555 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text); DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text); #endif -DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key); void __init native_pv_lock_init(void) { - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) - static_branch_disable(&virt_spin_lock_key); + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + static_branch_enable(&virt_spin_lock_key); } static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
Re: [PATCH vhost 13/23] vdpa/mlx5: Set mkey modified flags on all VQs
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > Otherwise, when virtqueues are moved from INIT to READY the latest mkey > will not be set appropriately. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 0201c6fe61e1..0abe01fd20e9 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2868,7 +2868,7 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev > *mvdev, > > mlx5_vdpa_update_mr(mvdev, new_mr, asid); > > - for (int i = 0; i < ndev->cur_num_vqs; i++) > + for (int i = 0; i < mvdev->max_vqs; i++) > ndev->vqs[i].modified_fields |= > MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY | > > MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY; > > > -- > 2.45.1 >
Re: [PATCH vhost 12/23] vdpa/mlx5: Start off rqt_size with max VQPs
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > Currently rqt_size is initialized during device flag configuration. > That's because it is the earliest moment when device knows if MQ > (multi queue) is on or off. > > Shift this configuration earlier to device creation time. This implies > that non-MQ devices will have a larger RQT size. But the configuration > will still be correct. > > This is done in preparation for the pre-creation of hardware virtqueues > at device add time. When that change will be added, RQT will be created > at device creation time so it needs to be initialized to its max size. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 1181e0ac3671..0201c6fe61e1 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2731,10 +2731,6 @@ static int mlx5_vdpa_set_driver_features(struct > vdpa_device *vdev, u64 features) > return err; > > ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; > - if (ndev->mvdev.actual_features & BIT_ULL(VIRTIO_NET_F_MQ)) > - ndev->rqt_size = mlx5vdpa16_to_cpu(mvdev, > ndev->config.max_virtqueue_pairs); > - else > - ndev->rqt_size = 1; > > /* Interested in changes of vq features only. */ > if (get_features(old_features) != > get_features(mvdev->actual_features)) { > @@ -3719,8 +3715,12 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev > *v_mdev, const char *name, > goto err_alloc; > } > > - if (device_features & BIT_ULL(VIRTIO_NET_F_MQ)) > + if (device_features & BIT_ULL(VIRTIO_NET_F_MQ)) { > config->max_virtqueue_pairs = cpu_to_mlx5vdpa16(mvdev, > max_vqs / 2); > + ndev->rqt_size = max_vqs / 2; > + } else { > + ndev->rqt_size = 1; > + } > > ndev->mvdev.mlx_features = device_features; > mvdev->vdev.dma_dev = &mdev->pdev->dev; > > -- > 2.45.1 >
Re: [PATCH v2] x86/paravirt: Disable virt spinlock on bare metal
Hi Nikolay, On 2024-06-18 at 11:24:42 +0300, Nikolay Borisov wrote: > > > On 26.05.24 г. 4:58 ч., Chen Yu wrote: > > The kernel can change spinlock behavior when running as a guest. But > > this guest-friendly behavior causes performance problems on bare metal. > > So there's a 'virt_spin_lock_key' static key to switch between the two > > modes. > > > > The static key is always enabled by default (run in guest mode) and > > should be disabled for bare metal (and in some guests that want native > > behavior). > > > > Performance drop is reported when running encode/decode workload and > > BenchSEE cache sub-workload. > > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused > > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS > > is disabled the virt_spin_lock_key is incorrectly set to true on bare > > metal. The qspinlock degenerates to test-and-set spinlock, which > > decrease the performance on bare metal. > > > > Fix this by disabling virt_spin_lock_key if it is on bare metal, > > regardless of CONFIG_PARAVIRT_SPINLOCKS. > > > > nit: > > This bug wouldn't have happened if the key was defined FALSE by default and > only enabled in the appropriate case. I think it makes more sense to invert > the logic and have the key FALSE by default and only enable it iff the > kernel is running under a hypervisor... At worst only the virtualization > case would suffer if the lock is falsely not enabled. Thank you for your review. I agree, initializing the key to FALSE by default seems to be more readable. Could this change be the subsequent adjustment based on current fix, which could be more bisectible? Set the default key to false. If booting in a VM, enable this key. Later during the VM initialization, if other high-efficient spinlock is preferred, like paravirt-spinlock, the virt_spin_lock_key will be disabled accordingly. diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index cde8357bb226..a7d3ba00e70e 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu) #ifdef CONFIG_PARAVIRT /* - * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack. + * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack. * * Native (and PV wanting native due to vCPU pinning) should disable this key. * It is done in this backwards fashion to only have a single direction change, * which removes ordering between native_pv_spin_init() and HV setup. */ -DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key); +DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key); /* * Shortcut for the queued_spin_lock_slowpath() function that allows diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index c193c9e60a1b..fec381533555 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -51,12 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text); DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text); #endif -DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key); void __init native_pv_lock_init(void) { - if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) - static_branch_disable(&virt_spin_lock_key); + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + static_branch_enable(&virt_spin_lock_key); } static void native_tlb_remove_table(struct mmu_gather *tlb, void *table) -- 2.25.1
Re: [PATCH vhost 11/23] vdpa/mlx5: Set an initial size on the VQ
On Mon, Jun 17, 2024 at 5:09 PM Dragos Tatulea wrote: > > The virtqueue size is a pre-requisite for setting up any virtqueue > resources. For the upcoming optimization of creating virtqueues at > device add, the virtqueue size has to be configured. > > Store the default queue size in struct mlx5_vdpa_net to make it easy in > the future to pre-configure this default value via vdpa tool. > > The queue size check in setup_vq() will always be false. So remove it. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 --- > drivers/vdpa/mlx5/net/mlx5_vnet.h | 1 + > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 245b5dac98d3..1181e0ac3671 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -58,6 +58,8 @@ MODULE_LICENSE("Dual BSD/GPL"); > */ > #define MLX5V_DEFAULT_VQ_COUNT 2 > > +#define MLX5V_DEFAULT_VQ_SIZE 256 > + > struct mlx5_vdpa_cq_buf { > struct mlx5_frag_buf_ctrl fbc; > struct mlx5_frag_buf frag_buf; > @@ -1445,9 +1447,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq) > u16 idx = mvq->index; > int err; > > - if (!mvq->num_ent) > - return 0; > - > if (mvq->initialized) > return 0; > > @@ -3523,6 +3522,7 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev) > mvq->ndev = ndev; > mvq->fwqp.fw = true; > mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE; > + mvq->num_ent = ndev->default_queue_size; > } > } > > @@ -3660,6 +3660,7 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev > *v_mdev, const char *name, > goto err_alloc; > } > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT; > + ndev->default_queue_size = MLX5V_DEFAULT_VQ_SIZE; > > init_mvqs(ndev); > allocate_irqs(ndev); > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.h > b/drivers/vdpa/mlx5/net/mlx5_vnet.h > index 90b556a57971..2ada29767cc5 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.h > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.h > @@ -58,6 +58,7 @@ struct mlx5_vdpa_net { > bool setup; > u32 cur_num_vqs; > u32 rqt_size; > + u16 default_queue_size; It seems to me this is only assigned here and not used in the rest of the series, why allocate a member here instead of using macro directly? > bool nb_registered; > struct notifier_block nb; > struct vdpa_callback config_cb; > > -- > 2.45.1 >
Re: [PATCH] trace: riscv: Remove deprecated kprobe on ftrace support
On Thu, 13 Jun 2024 19:13:47 +0800 Jinjie Ruan wrote: > Since commit 7caa9765465f60 ("ftrace: riscv: move from REGS to ARGS"), > kprobe on ftrace is not supported by riscv, because riscv's support for > FTRACE_WITH_REGS has been replaced with support for FTRACE_WITH_ARGS, and > KPROBES_ON_FTRACE will be supplanted by FPROBES. So remove the deprecated > kprobe on ftrace support, which is misunderstood. > > Signed-off-by: Jinjie Ruan No problem, Now we have fprobe instead. Anyway, this looks good to me. Acked-by: Masami Hiramatsu (Google) Thank you, > --- > arch/riscv/Kconfig| 1 - > arch/riscv/kernel/probes/Makefile | 1 - > arch/riscv/kernel/probes/ftrace.c | 65 --- > 3 files changed, 67 deletions(-) > delete mode 100644 arch/riscv/kernel/probes/ftrace.c > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index 0525ee2d63c7..a1f2d604c459 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -149,7 +149,6 @@ config RISCV > select HAVE_KERNEL_UNCOMPRESSED if !XIP_KERNEL && !EFI_ZBOOT > select HAVE_KERNEL_ZSTD if !XIP_KERNEL && !EFI_ZBOOT > select HAVE_KPROBES if !XIP_KERNEL > - select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > select HAVE_KRETPROBES if !XIP_KERNEL > # https://github.com/ClangBuiltLinux/linux/issues/1881 > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > diff --git a/arch/riscv/kernel/probes/Makefile > b/arch/riscv/kernel/probes/Makefile > index 8265ff497977..d2129f2c61b8 100644 > --- a/arch/riscv/kernel/probes/Makefile > +++ b/arch/riscv/kernel/probes/Makefile > @@ -1,7 +1,6 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_KPROBES)+= kprobes.o decode-insn.o > simulate-insn.o > obj-$(CONFIG_RETHOOK)+= rethook.o rethook_trampoline.o > -obj-$(CONFIG_KPROBES_ON_FTRACE) += ftrace.o > obj-$(CONFIG_UPROBES)+= uprobes.o decode-insn.o > simulate-insn.o > CFLAGS_REMOVE_simulate-insn.o = $(CC_FLAGS_FTRACE) > CFLAGS_REMOVE_rethook.o = $(CC_FLAGS_FTRACE) > diff --git a/arch/riscv/kernel/probes/ftrace.c > b/arch/riscv/kernel/probes/ftrace.c > deleted file mode 100644 > index a69dfa610aa8.. > --- a/arch/riscv/kernel/probes/ftrace.c > +++ /dev/null > @@ -1,65 +0,0 @@ > -// SPDX-License-Identifier: GPL-2.0 > - > -#include > - > -/* Ftrace callback handler for kprobes -- called under preepmt disabled */ > -void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip, > -struct ftrace_ops *ops, struct ftrace_regs *fregs) > -{ > - struct kprobe *p; > - struct pt_regs *regs; > - struct kprobe_ctlblk *kcb; > - int bit; > - > - if (unlikely(kprobe_ftrace_disabled)) > - return; > - > - bit = ftrace_test_recursion_trylock(ip, parent_ip); > - if (bit < 0) > - return; > - > - p = get_kprobe((kprobe_opcode_t *)ip); > - if (unlikely(!p) || kprobe_disabled(p)) > - goto out; > - > - regs = ftrace_get_regs(fregs); > - kcb = get_kprobe_ctlblk(); > - if (kprobe_running()) { > - kprobes_inc_nmissed_count(p); > - } else { > - unsigned long orig_ip = instruction_pointer(regs); > - > - instruction_pointer_set(regs, ip); > - > - __this_cpu_write(current_kprobe, p); > - kcb->kprobe_status = KPROBE_HIT_ACTIVE; > - if (!p->pre_handler || !p->pre_handler(p, regs)) { > - /* > - * Emulate singlestep (and also recover regs->pc) > - * as if there is a nop > - */ > - instruction_pointer_set(regs, > - (unsigned long)p->addr + MCOUNT_INSN_SIZE); > - if (unlikely(p->post_handler)) { > - kcb->kprobe_status = KPROBE_HIT_SSDONE; > - p->post_handler(p, regs, 0); > - } > - instruction_pointer_set(regs, orig_ip); > - } > - > - /* > - * If pre_handler returns !0, it changes regs->pc. We have to > - * skip emulating post_handler. > - */ > - __this_cpu_write(current_kprobe, NULL); > - } > -out: > - ftrace_test_recursion_unlock(bit); > -} > -NOKPROBE_SYMBOL(kprobe_ftrace_handler); > - > -int arch_prepare_kprobe_ftrace(struct kprobe *p) > -{ > - p->ainsn.api.insn = NULL; > - return 0; > -} > -- > 2.34.1 > -- Masami Hiramatsu (Google)
Re: [PATCH] fgraph: Use str_plural() in test_graph_storage_single()
On Tue, 18 Jun 2024 15:20:14 +0800 Jiapeng Chong wrote: > Use existing str_plural() function rather than duplicating its > implementation. > > ./kernel/trace/trace_selftest.c:880:56-60: opportunity for str_plural(size). > > Reported-by: Abaci Robot > Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9349 > Signed-off-by: Jiapeng Chong Thanks, this looks good to me. Acked-by: Masami Hiramatsu (Google) > --- > kernel/trace/trace_selftest.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c > index adf0f436d84b..97f1e4bc47dc 100644 > --- a/kernel/trace/trace_selftest.c > +++ b/kernel/trace/trace_selftest.c > @@ -877,7 +877,7 @@ static int __init test_graph_storage_single(struct > fgraph_fixture *fixture) > int ret; > > pr_cont("PASSED\n"); > - pr_info("Testing fgraph storage of %d byte%s: ", size, size > 1 ? "s" : > ""); > + pr_info("Testing fgraph storage of %d byte%s: ", size, > str_plural(size)); > > ret = init_fgraph_fixture(fixture); > if (ret && ret != -ENODEV) { > -- > 2.20.1.7.g153144c > -- Masami Hiramatsu (Google)
Re: [PATCH 1/2] vdpa: support set mac address from vdpa tool
On Tue, Jun 18, 2024 at 6:39 PM Michael S. Tsirkin wrote: > > On Mon, Jun 17, 2024 at 09:44:21AM -0700, Jakub Kicinski wrote: > > On Mon, 17 Jun 2024 12:20:19 -0400 Michael S. Tsirkin wrote: > > > > But the virtio spec doesn't allow setting the MAC... > > > > I'm probably just lost in the conversation but there's hypervisor side > > > > and there is user/VM side, each of them already has an interface to set > > > > the MAC. The MAC doesn't matter, but I want to make sure my mental model > > > > matches reality in case we start duplicating too much.. > > > > > > An obvious part of provisioning is specifying the config space > > > of the device. > > > > Agreed, that part is obvious. > > Please go ahead, I don't really care and you clearly don't have time > > to explain. > > Thanks! > Just in case Cindy who is working on it is also confused, > here is what I meant: > > - an interface to provision a device, including its config > space, makes sense to me > - default mac address is part of config space, and would thus be covered > - note how this is different from ability to tweak the mac of an existing > device > Thanks Micheal, I will continue working in this thanks cindy > > -- > MST >
Re: [PATCH net-next v5 0/7] net: pass receive socket to drop tracepoint
Hello: This series was applied to netdev/net-next.git (main) by David S. Miller : On Mon, 17 Jun 2024 11:09:00 -0700 you wrote: > We set up our production packet drop monitoring around the kfree_skb > tracepoint. While this tracepoint is extremely valuable for diagnosing > critical problems, it also has some limitation with drops on the local > receive path: this tracepoint can only inspect the dropped skb itself, > but such skb might not carry enough information to: > > 1. determine in which netns/container this skb gets dropped > 2. determine by which socket/service this skb oughts to be received > > [...] Here is the summary with links: - [net-next,v5,1/7] net: add rx_sk to trace_kfree_skb https://git.kernel.org/netdev/net-next/c/c53795d48ee8 - [net-next,v5,2/7] net: introduce sk_skb_reason_drop function https://git.kernel.org/netdev/net-next/c/ba8de796baf4 - [net-next,v5,3/7] ping: use sk_skb_reason_drop to free rx packets https://git.kernel.org/netdev/net-next/c/7467de17635f - [net-next,v5,4/7] net: raw: use sk_skb_reason_drop to free rx packets https://git.kernel.org/netdev/net-next/c/ce9a2424e9da - [net-next,v5,5/7] tcp: use sk_skb_reason_drop to free rx packets https://git.kernel.org/netdev/net-next/c/46a02aa35752 - [net-next,v5,6/7] udp: use sk_skb_reason_drop to free rx packets https://git.kernel.org/netdev/net-next/c/fc0cc9248843 - [net-next,v5,7/7] af_packet: use sk_skb_reason_drop to free rx packets https://git.kernel.org/netdev/net-next/c/e2e7d78d9a25 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH v4 16/35] mm: slub: Unpoison the memchr_inv() return value
On Tue, 2024-06-18 at 16:38 +0200, Alexander Potapenko wrote: > On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich > wrote: > > > > Even though the KMSAN warnings generated by memchr_inv() are > > suppressed > > by metadata_access_enable(), its return value may still be > > poisoned. > > > > The reason is that the last iteration of memchr_inv() returns > > `*start != value ? start : NULL`, where *start is poisoned. Because > > of > > this, somewhat counterintuitively, the shadow value computed by > > visitSelectInst() is equal to `(uintptr_t)start`. > > > > The intention behind guarding memchr_inv() behind > > metadata_access_enable() is to touch poisoned metadata without > > triggering KMSAN, so unpoison its return value. > > What do you think about applying __no_kmsan_checks to these functions > instead? Ok, will do. The __no_kmsan_checks approach is already taken by "mm: kfence: Disable KMSAN when checking the canary", so we might as well be consistent in how we fix these issues.
Re: [PATCH vhost 08/23] vdpa/mlx5: Clear and reinitialize software VQ data on reset
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: > > The hardware VQ configuration is mirrored by data in struct > mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset, > fully clear the struct and initialize with the appropriate default > values. > > As clear_vqs_ready() is used only during reset, get rid of it. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 16 +++- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index c8b5c87f001d..de013b5a2815 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2941,18 +2941,6 @@ static void teardown_vq_resources(struct mlx5_vdpa_net > *ndev) > ndev->setup = false; > } > > -static void clear_vqs_ready(struct mlx5_vdpa_net *ndev) > -{ > - int i; > - > - for (i = 0; i < ndev->mvdev.max_vqs; i++) { > - ndev->vqs[i].ready = false; > - ndev->vqs[i].modified_fields = 0; > - } > - > - ndev->mvdev.cvq.ready = false; > -} > - > static int setup_cvq_vring(struct mlx5_vdpa_dev *mvdev) > { > struct mlx5_control_vq *cvq = &mvdev->cvq; > @@ -3035,12 +3023,14 @@ static int mlx5_vdpa_compat_reset(struct vdpa_device > *vdev, u32 flags) > down_write(&ndev->reslock); > unregister_link_notifier(ndev); > teardown_vq_resources(ndev); > - clear_vqs_ready(ndev); > + init_mvqs(ndev); Nitpick / suggestion if you have to send a v2. The init_mvqs function name sounds like it can allocate stuff that needs to be released to me. But I'm very bad at naming :). Maybe something like "mvqs_set_defaults" or similar? > + > if (flags & VDPA_RESET_F_CLEAN_MAP) > mlx5_vdpa_destroy_mr_resources(&ndev->mvdev); > ndev->mvdev.status = 0; > ndev->mvdev.suspended = false; > ndev->cur_num_vqs = MLX5V_DEFAULT_VQ_COUNT; > + ndev->mvdev.cvq.ready = false; > ndev->mvdev.cvq.received_desc = 0; > ndev->mvdev.cvq.completed_desc = 0; > memset(ndev->event_cbs, 0, sizeof(*ndev->event_cbs) * (mvdev->max_vqs > + 1)); > > -- > 2.45.1 >
Re: [PATCH vhost 05/23] vdpa/mlx5: Iterate over active VQs during suspend/resume
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: > > No need to iterate over max number of VQs. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Acked-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 96782b34e2b2..51630b1935f4 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -1504,7 +1504,7 @@ static void suspend_vqs(struct mlx5_vdpa_net *ndev) > { > int i; > > - for (i = 0; i < ndev->mvdev.max_vqs; i++) > + for (i = 0; i < ndev->cur_num_vqs; i++) > suspend_vq(ndev, &ndev->vqs[i]); > } > > @@ -1522,7 +1522,7 @@ static void resume_vq(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtqueue *mv > > static void resume_vqs(struct mlx5_vdpa_net *ndev) > { > - for (int i = 0; i < ndev->mvdev.max_vqs; i++) > + for (int i = 0; i < ndev->cur_num_vqs; i++) > resume_vq(ndev, &ndev->vqs[i]); > } > > > -- > 2.45.1 >
Re: [PATCH vhost 06/23] vdpa/mlx5: Remove duplicate suspend code
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: > > Use the dedicated suspend_vqs() function instead. > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Reviewed-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 51630b1935f4..eca6f68c2eda 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -3355,17 +3355,12 @@ static int mlx5_vdpa_suspend(struct vdpa_device *vdev) > { > struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); > struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); > - struct mlx5_vdpa_virtqueue *mvq; > - int i; > > mlx5_vdpa_info(mvdev, "suspending device\n"); > > down_write(&ndev->reslock); > unregister_link_notifier(ndev); > - for (i = 0; i < ndev->cur_num_vqs; i++) { > - mvq = &ndev->vqs[i]; > - suspend_vq(ndev, mvq); > - } > + suspend_vqs(ndev); > mlx5_vdpa_cvq_suspend(mvdev); > mvdev->suspended = true; > up_write(&ndev->reslock); > > -- > 2.45.1 >
Re: [PATCH vhost 04/23] vdpa/mlx5: Drop redundant check in teardown_virtqueues()
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: > > The check is done inside teardown_vq(). > > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu Reviewed-by: Eugenio Pérez > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 10 ++ > 1 file changed, 2 insertions(+), 8 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index b4d9ef4f66c8..96782b34e2b2 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -2559,16 +2559,10 @@ static int setup_virtqueues(struct mlx5_vdpa_dev > *mvdev) > > static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) > { > - struct mlx5_vdpa_virtqueue *mvq; > int i; > > - for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { > - mvq = &ndev->vqs[i]; > - if (!mvq->initialized) > - continue; > - > - teardown_vq(ndev, mvq); > - } > + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) > + teardown_vq(ndev, &ndev->vqs[i]); > } > > static void update_cvq_info(struct mlx5_vdpa_dev *mvdev) > > -- > 2.45.1 >
Re: [PATCH vhost 03/23] vdpa/mlx5: Drop redundant code
On Mon, Jun 17, 2024 at 5:08 PM Dragos Tatulea wrote: Patch message suggestion: Originally, the second loop initialized the CVQ. But (acde3929492b ("vdpa/mlx5: Use consistent RQT size") initialized all the queues in the first loop, so the second iteration in ... > > The second iteration in init_mvqs() is never called because the first > one will iterate up to max_vqs. > Either way, Acked-by: Eugenio Pérez > Signed-off-by: Dragos Tatulea > Reviewed-by: Cosmin Ratiu > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 6 -- > 1 file changed, 6 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 1ad281cbc541..b4d9ef4f66c8 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -3519,12 +3519,6 @@ static void init_mvqs(struct mlx5_vdpa_net *ndev) > mvq->fwqp.fw = true; > mvq->fw_state = MLX5_VIRTIO_NET_Q_OBJECT_NONE; > } > - for (; i < ndev->mvdev.max_vqs; i++) { > - mvq = &ndev->vqs[i]; > - memset(mvq, 0, offsetof(struct mlx5_vdpa_virtqueue, ri)); > - mvq->index = i; > - mvq->ndev = ndev; > - } > } > > struct mlx5_vdpa_mgmtdev { > > -- > 2.45.1 >