[PATCH v2] lib: objpool: fix head overrun on big.LITTLE system

2023-11-20 Thread wuqiang.matt
objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
following kernel warnings:

WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100

This message is from objpool.c:168:

WARN_ON_ONCE(tail - head > pool->nr_objs);

The overrun test case is to validate the case that pre-allocated objects
are insufficient: 8 objects are pre-allocated for each node and consumer
thread per node tries to grab 16 objects in a row. The testing system is
OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
disabling either all 4 big or 4 little cores, the overrun tests run well,
and once with big and little cores mixed together, the overrun test would
always cause an overrun loop. It's likely the memory timing differences
of big and little cores cause this trouble. Here are the debugging data
of objpool_try_get_slot after try_cmpxchg_release:

objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278

The local copies of 'head' and 'last' were 278 and 276, and reloading of
'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
'slot->head' became 'head + 1', which is correct. But what's wrong here
is the stale value of 'last', and that stale value of 'last' finally led
the overrun of 'head'.

Memory updating of 'last' and 'head' are performed in push() and pop()
independently, which could be the culprit leading this out of order
visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
not enough only checking the condition of 'head != slot', the implicit
condition 'last - head <= nr_objs' must also be explicitly asserted to
guarantee 'last' is always behind 'head' before the object retrieving.

This patch will check and try reloading of 'head' and 'last' to ensure
'last' is behind 'head' at the time of object retrieving. Performance
testings show the average impact is about 0.1% for X86_64 and 1.12% for
ARM64. Here are the results:

OS: Debian 10 X86_64, Linux 6.6rc
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
  1T 2T 4T 8T16T
native: 49543304   99277826  199017659  399070324  795185848
objpool:29909085   59865637  119692073  239750369  478005250
objpool+:   29879313   59230743  119609856  239067773  478509029
 32T48T64T96T   128T
native:   1596927073 2390099988 2929397330 3183875848 3257546602
objpool:   957553042 1435814086 1680872925 2043126796 2165424198
objpool+:  956476281 1434491297 1666055740 2041556569 2157415622

OS: Debian 11 AARCH64, Linux 6.6rc
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
  1T 2T 4T 8T16T
native: 30890508   60399915  123111980  242257008  494002946
objpool:14742531   28883047   57739948  115886644  232455421
objpool+:   14107220   29032998   57286084  113730493  232232850
 24T32T48T64T96T
native:746406039 1000174750 1493236240 1998318364 2942911180
objpool:   349164852  467284332  702296756  934459713 1387898285
objpool+:  348388180  462750976  696606096  927865887 1368402195

Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")

v1 -> v2:
- Title updated since it's a common issue of objpool for big.LITTLE
  systems, verified on Rockchip RK3588 and Amlogic A311D
- Fixes tag added, as reminded by Masami Hiramatsu

Signed-off-by: wuqiang.matt 
Acked-by: Masami Hiramatsu (Google) 
---
 lib/objpool.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/lib/objpool.c b/lib/objpool.c
index 1ab49b897b2e..e8c9a30e485f 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -199,6 +199,23 @@ static inline void *objpool_try_get_slot(struct 
objpool_head *pool, int cpu)
while (head != READ_ONCE(slot->last)) {
void *obj;
 
+   /*
+* data visibility of 'last' and 'head' could be out of
+* order since memory updating of 'last' and 'head' are
+* performed in push() and pop() independently
+*
+* before any retrieving attempts, pop() must guarantee
+* 'last' is behind 'head', that is to say, there must
+* be available objects in slot, which could be ensured
+* by condition 'last != head && last - head <= nr_objs'
+* that is equivalent to 'last - head - 1 < nr_objs' as
+* 'last' and 'head' are both unsigned int32
+*/
+   if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
+   head = READ_ONCE(slot->head);
+   continue;
+   }
+
/* obj must be retrieved before moving forward head */
obj = READ_ONCE(slot->entries[head & slot->mask]);
 
-- 
2.40.1




Re: [PATCH RFC 20/37] mm: compaction: Reserve metadata storage in compaction_alloc()

2023-11-20 Thread Peter Collingbourne
Hi Alexandru,

On Wed, Aug 23, 2023 at 6:16 AM Alexandru Elisei
 wrote:
>
> If the source page being migrated has metadata associated with it, make
> sure to reserve the metadata storage when choosing a suitable destination
> page from the free list.
>
> Signed-off-by: Alexandru Elisei 
> ---
>  mm/compaction.c | 9 +
>  mm/internal.h   | 1 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cc0139fa0cb0..af2ee3085623 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -570,6 +570,7 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
> bool locked = false;
> unsigned long blockpfn = *start_pfn;
> unsigned int order;
> +   int ret;
>
> /* Strict mode is for isolation, speed is secondary */
> if (strict)
> @@ -626,6 +627,11 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>
> /* Found a free page, will break it into order-0 pages */
> order = buddy_order(page);
> +   if (metadata_storage_enabled() && cc->reserve_metadata) {
> +   ret = reserve_metadata_storage(page, order, 
> cc->gfp_mask);

At this point the zone lock is held and preemption is disabled, which
makes it invalid to call reserve_metadata_storage.

Peter

> +   if (ret)
> +   goto isolate_fail;
> +   }
> isolated = __isolate_free_page(page, order);
> if (!isolated)
> break;
> @@ -1757,6 +1763,9 @@ static struct folio *compaction_alloc(struct folio 
> *src, unsigned long data)
> struct compact_control *cc = (struct compact_control *)data;
> struct folio *dst;
>
> +   if (metadata_storage_enabled())
> +   cc->reserve_metadata = folio_has_metadata(src);
> +
> if (list_empty(>freepages)) {
> isolate_freepages(cc);
>
> diff --git a/mm/internal.h b/mm/internal.h
> index d28ac0085f61..046cc264bfbe 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -492,6 +492,7 @@ struct compact_control {
>  */
> bool alloc_contig;  /* alloc_contig_range allocation */
> bool source_has_metadata;   /* source pages have associated 
> metadata */
> +   bool reserve_metadata;
>  };
>
>  /*
> --
> 2.41.0
>



[PATCH 1/2] eventfs: Remove expectation that ei->is_freed means ei->dentry == NULL

2023-11-20 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The logic to free the eventfs_inode (ei) use to set is_freed and clear the
"dentry" field under the eventfs_mutex. But that changed when a race was
found where the ei->dentry needed to be cleared when the last dput() was
called on it. But there was still logic that checked if ei->dentry was not
NULL and is_freed is set, and would warn if it was.

But since that situation was changed and the ei->dentry isn't cleared
until the last dput() is called on it while the ei->is_freed is set, do
not test for that condition anymore, and change the comments to reflect
that.

Fixes: 020010fbfa20 ("eventfs: Delete eventfs_inode when the last dentry is 
freed")
Reported-by: Mark Rutland 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f8a594a50ae6..f239b2b507a4 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -27,16 +27,16 @@
 /*
  * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access
  * to the ei->dentry must be done under this mutex and after checking
- * if ei->is_freed is not set. The ei->dentry is released under the
- * mutex at the same time ei->is_freed is set. If ei->is_freed is set
- * then the ei->dentry is invalid.
+ * if ei->is_freed is not set. When ei->is_freed is set, the dentry
+ * is on its way to being freed after the last dput() is made on it.
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
- * After the SRCU grace period is over, the ei may be freed.
+ * After the SRCU grace period is over and the last dput() is called
+ * the ei is freed.
  */
 DEFINE_STATIC_SRCU(eventfs_srcu);
 
@@ -365,12 +365,14 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 * created the dentry for this e_dentry. In which case
 * use that one.
 *
-* Note, with the mutex held, the e_dentry cannot have content
-* and the ei->is_freed be true at the same time.
+* If ei->is_freed is set, the e_dentry is currently on its
+* way to being freed, don't return it. If e_dentry is NULL
+* it means it was already freed.
 */
-   dentry = *e_dentry;
-   if (WARN_ON_ONCE(dentry && ei->is_freed))
+   if (ei->is_freed)
dentry = NULL;
+   else
+   dentry = *e_dentry;
/* The lookup does not need to up the dentry refcount */
if (dentry && !lookup)
dget(dentry);
@@ -473,8 +475,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
 * created the dentry for this e_dentry. In which case
 * use that one.
 *
-* Note, with the mutex held, the e_dentry cannot have content
-* and the ei->is_freed be true at the same time.
+* If ei->is_freed is set, the e_dentry is currently on its
+* way to being freed.
 */
dentry = ei->dentry;
if (dentry && !lookup)
-- 
2.42.0





[PATCH 2/2] eventfs: Do not invalidate dentry in create_file/dir_dentry()

2023-11-20 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

With the call to simple_recursive_removal() on the entire eventfs sub
system when the directory is removed, it performs the d_invalidate on all
the dentries when it is removed. There's no need to do clean ups when a
dentry is being created while the directory is being deleted.

As dentries are cleaned up by the simpler_recursive_removal(), trying to
do d_invalidate() in these functions will cause the dentry to be
invalidated twice, and crash the kernel.

Link: 
https://lore.kernel.org/all/20231116123016.140576-1-naresh.kamb...@linaro.org/

Fixes: 407c6726ca71 ("eventfs: Use simple_recursive_removal() to clean up 
dentries")
Reported-by: Mark Rutland 
Reported-by: Naresh Kamboju 
Reported-by: Linux Kernel Functional Testing 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 19 ++-
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f239b2b507a4..3eb6c622a74d 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -326,7 +326,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
struct eventfs_attr *attr = NULL;
struct dentry **e_dentry = >d_children[idx];
struct dentry *dentry;
-   bool invalidate = false;
 
mutex_lock(_mutex);
if (ei->is_freed) {
@@ -389,17 +388,14 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 * Otherwise it means two dentries exist with the same name.
 */
WARN_ON_ONCE(!ei->is_freed);
-   invalidate = true;
+   dentry = NULL;
}
mutex_unlock(_mutex);
 
-   if (invalidate)
-   d_invalidate(dentry);
-
-   if (lookup || invalidate)
+   if (lookup)
dput(dentry);
 
-   return invalidate ? NULL : dentry;
+   return dentry;
 }
 
 /**
@@ -439,7 +435,6 @@ static struct dentry *
 create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
  struct dentry *parent, bool lookup)
 {
-   bool invalidate = false;
struct dentry *dentry = NULL;
 
mutex_lock(_mutex);
@@ -495,16 +490,14 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
 * Otherwise it means two dentries exist with the same name.
 */
WARN_ON_ONCE(!ei->is_freed);
-   invalidate = true;
+   dentry = NULL;
}
mutex_unlock(_mutex);
-   if (invalidate)
-   d_invalidate(dentry);
 
-   if (lookup || invalidate)
+   if (lookup)
dput(dentry);
 
-   return invalidate ? NULL : dentry;
+   return dentry;
 }
 
 /**
-- 
2.42.0





[PATCH 0/2] eventfs: Fixes for v6.7-rc2

2023-11-20 Thread Steven Rostedt
[ Resend to trigger my patchwork updates, and Cc linux-trace-kernel ]

A couple of fixes to eventfs:

- With the usage of simple_recursive_remove() recommended by Al Viro,
  the code should not be calling "d_invalidate()" itself. Doing so
  is causing crashes. The code was calling d_invalidate() on the race
  of trying to look up a file while the parent was being deleted.
  This was detected, and the added dentry was having d_invalidate() called
  on it, but the deletion of the directory was also calling d_invalidate()
  on that same dentry.

- A fix to not free the eventfs_inode (ei) until the last dput() was called
  on its ei->dentry made the ei->dentry exist even after it was marked
  for free by setting the ei->is_freed. But code elsewhere still was
  checking if ei->dentry was NULL if ei->is_freed is set and would
  trigger WARN_ON if that was the case. That's no longer true and there
  should not be any warnings when it is true.

Steven Rostedt (Google) (2):
  eventfs: Remove expectation that ei->is_freed means ei->dentry == NULL
  eventfs: Do not invalidate dentry in create_file/dir_dentry()


 fs/tracefs/event_inode.c | 41 ++---
 1 file changed, 18 insertions(+), 23 deletions(-)



Re: [PATCH v4 0/2] Rpmsg support for i.MX DSP with resource table

2023-11-20 Thread Mathieu Poirier
On Fri, Oct 13, 2023 at 06:27:29PM +0300, Iuliana Prodan (OSS) wrote:
> From: Iuliana Prodan 
> 
> These patches are needed in order to support rpmsg on DSP when a
> resource table is available.
> 
> Changes since v3:
>  - add reserve-memory nodes in imx8mp-evk.dts rather than .dtsi (patch 2/2)
> 
> Changes since v2:
>  - add newline between nodes in dtsi (patch 2/2)
> 
> Changes since v1:
>  - add missing bracket in dtsi (patch 2/2)
> 
> Iuliana Prodan (2):
>   remoteproc: imx_dsp_rproc: add mandatory find_loaded_rsc_table op
>   arm64: dts: imx8mp: add reserve-memory nodes for DSP
> 
>  arch/arm64/boot/dts/freescale/imx8mp-evk.dts | 22 
>  drivers/remoteproc/imx_dsp_rproc.c   |  1 +
>  2 files changed, 23 insertions(+)
>

I have applied this set.

Thanks,
Mathieu

> -- 
> 2.17.1
> 


Re: selftests: ftrace: WARNING: __list_del_entry_valid_or_report (lib/list_debug.c:62 (discriminator 1))

2023-11-20 Thread Steven Rostedt
On Thu, 16 Nov 2023 18:00:16 +0530
Naresh Kamboju  wrote:

> Following kernel crash noticed while running selftests: ftrace on arm64 
> Juno-r2
> device running stable-rc linux-6.6.y kernel.
> 
> This kernel crash is hard to reproduce.
> 

Can you test this patch.

Note, there's a similar bug on 6.7-rc1 which I'll fix first. And when
that's accepted, I'll push this one for v6.6. This may be two patches as
one if the d_invalidate() issue, and another is a memory leak fix.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5fcfb634fec2..b60048469df1 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -289,6 +289,8 @@ void eventfs_set_ef_status_free(struct tracefs_inode *ti, 
struct dentry *dentry)
ef = dentry->d_fsdata;
if (ef)
free_ef(ef);
+   else
+   kfree(ei);
return;
}
 
@@ -342,7 +344,6 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
 static struct dentry *
 create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
 {
-   bool invalidate = false;
struct dentry *dentry;
 
mutex_lock(_mutex);
@@ -387,23 +388,24 @@ create_dentry(struct eventfs_file *ef, struct dentry 
*parent, bool lookup)
eventfs_post_create_dir(ef);
dentry->d_fsdata = ef;
} else {
-   /* A race here, should try again (unless freed) */
-   invalidate = true;
-
/*
+* If we are here then the directory is being freed.
+* The simple_recursive_removal() will get rid of the dentry
+* here.
+*/
+   dentry = NULL;
+/*
 * Should never happen unless we get here due to being freed.
 * Otherwise it means two dentries exist with the same name.
 */
WARN_ON_ONCE(!ef->is_freed);
}
mutex_unlock(_mutex);
-   if (invalidate)
-   d_invalidate(dentry);
 
-   if (lookup || invalidate)
+   if (lookup)
dput(dentry);
 
-   return invalidate ? NULL : dentry;
+   return dentry;
 }
 
 static bool match_event_file(struct eventfs_file *ef, const char *name)
-- 
2.42.0




[PATCH v3 2/5] params: Do not go over the limit when getting the string length

2023-11-20 Thread Andy Shevchenko
We can use strnlen() even on early stages and it prevents from
going over the string boundaries in case it's already too long.

Reviewed-by: Luis Chamberlain 
Reviewed-by: Kees Cook 
Signed-off-by: Andy Shevchenko 
---
 kernel/params.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/params.c b/kernel/params.c
index 626fa8265932..f8e3c4139854 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -260,7 +260,10 @@ EXPORT_SYMBOL_GPL(param_set_uint_minmax);
 
 int param_set_charp(const char *val, const struct kernel_param *kp)
 {
-   if (strlen(val) > 1024) {
+   size_t len, maxlen = 1024;
+
+   len = strnlen(val, maxlen + 1);
+   if (len == maxlen + 1) {
pr_err("%s: string parameter too long\n", kp->name);
return -ENOSPC;
}
@@ -270,7 +273,7 @@ int param_set_charp(const char *val, const struct 
kernel_param *kp)
/* This is a hack.  We can't kmalloc in early boot, and we
 * don't need to; this mangled commandline is preserved. */
if (slab_is_available()) {
-   *(char **)kp->arg = kmalloc_parameter(strlen(val)+1);
+   *(char **)kp->arg = kmalloc_parameter(len + 1);
if (!*(char **)kp->arg)
return -ENOMEM;
strcpy(*(char **)kp->arg, val);
@@ -508,7 +511,7 @@ int param_set_copystring(const char *val, const struct 
kernel_param *kp)
 {
const struct kparam_string *kps = kp->str;
 
-   if (strlen(val)+1 > kps->maxlen) {
+   if (strnlen(val, kps->maxlen) == kps->maxlen) {
pr_err("%s: string doesn't fit in %u chars.\n",
   kp->name, kps->maxlen-1);
return -ENOSPC;
-- 
2.43.0.rc1.1.gbec44491f096




[PATCH v3 5/5] params: Fix multi-line comment style

2023-11-20 Thread Andy Shevchenko
The multi-line comment style in the file is rather arbitrary.
Make it follow the standard one.

Reviewed-by: Luis Chamberlain 
Reviewed-by: Kees Cook 
Signed-off-by: Andy Shevchenko 
---
 kernel/params.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/params.c b/kernel/params.c
index eb55b32399b4..2e447f8ae183 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -1,8 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
-/* Helpers for initial module or kernel cmdline parsing
-   Copyright (C) 2001 Rusty Russell.
-
-*/
+/*
+ * Helpers for initial module or kernel cmdline parsing
+ * Copyright (C) 2001 Rusty Russell.
+ */
 #include 
 #include 
 #include 
@@ -271,8 +271,10 @@ int param_set_charp(const char *val, const struct 
kernel_param *kp)
 
maybe_kfree_parameter(*(char **)kp->arg);
 
-   /* This is a hack.  We can't kmalloc in early boot, and we
-* don't need to; this mangled commandline is preserved. */
+   /*
+* This is a hack. We can't kmalloc() in early boot, and we
+* don't need to; this mangled commandline is preserved.
+*/
if (slab_is_available()) {
*(char **)kp->arg = kmalloc_parameter(len + 1);
if (!*(char **)kp->arg)
@@ -743,8 +745,10 @@ void module_param_sysfs_remove(struct module *mod)
 {
if (mod->mkobj.mp) {
sysfs_remove_group(>mkobj.kobj, >mkobj.mp->grp);
-   /* We are positive that no one is using any param
-* attrs at this point.  Deallocate immediately. */
+   /*
+* We are positive that no one is using any param
+* attrs at this point. Deallocate immediately.
+*/
free_module_param_attrs(>mkobj);
}
 }
-- 
2.43.0.rc1.1.gbec44491f096




[PATCH v3 4/5] params: Sort headers

2023-11-20 Thread Andy Shevchenko
Sort the headers in alphabetic order in order to ease
the maintenance for this part.

Reviewed-by: Luis Chamberlain 
Reviewed-by: Kees Cook 
Signed-off-by: Andy Shevchenko 
---
 kernel/params.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/params.c b/kernel/params.c
index c3a029fe183d..eb55b32399b4 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -3,18 +3,18 @@
Copyright (C) 2001 Rusty Russell.
 
 */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
+#include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
-#include 
-#include 
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_SYSFS
 /* Protects all built-in parameters, modules use their own param_lock */
-- 
2.43.0.rc1.1.gbec44491f096




[PATCH v3 1/5] params: Introduce the param_unknown_fn type

2023-11-20 Thread Andy Shevchenko
Introduce a new type for the callback to parse an unknown argument.
This unifies function prototypes which takes that as a parameter.

Reviewed-by: Luis Chamberlain 
Reviewed-by: Kees Cook 
Signed-off-by: Andy Shevchenko 
---
 include/linux/moduleparam.h | 6 +++---
 kernel/params.c | 8 ++--
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
index 4fa9726bc328..bfb85fd13e1f 100644
--- a/include/linux/moduleparam.h
+++ b/include/linux/moduleparam.h
@@ -385,6 +385,8 @@ extern bool parameq(const char *name1, const char *name2);
  */
 extern bool parameqn(const char *name1, const char *name2, size_t n);
 
+typedef int (*parse_unknown_fn)(char *param, char *val, const char *doing, 
void *arg);
+
 /* Called on module insert or kernel boot */
 extern char *parse_args(const char *name,
  char *args,
@@ -392,9 +394,7 @@ extern char *parse_args(const char *name,
  unsigned num,
  s16 level_min,
  s16 level_max,
- void *arg,
- int (*unknown)(char *param, char *val,
-const char *doing, void *arg));
+ void *arg, parse_unknown_fn unknown);
 
 /* Called by module remove. */
 #ifdef CONFIG_SYSFS
diff --git a/kernel/params.c b/kernel/params.c
index 2d4a0564697e..626fa8265932 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -120,9 +120,7 @@ static int parse_one(char *param,
 unsigned num_params,
 s16 min_level,
 s16 max_level,
-void *arg,
-int (*handle_unknown)(char *param, char *val,
-const char *doing, void *arg))
+void *arg, parse_unknown_fn handle_unknown)
 {
unsigned int i;
int err;
@@ -165,9 +163,7 @@ char *parse_args(const char *doing,
 unsigned num,
 s16 min_level,
 s16 max_level,
-void *arg,
-int (*unknown)(char *param, char *val,
-   const char *doing, void *arg))
+void *arg, parse_unknown_fn unknown)
 {
char *param, *val, *err = NULL;
 
-- 
2.43.0.rc1.1.gbec44491f096




[PATCH v3 3/5] params: Use size_add() for kmalloc()

2023-11-20 Thread Andy Shevchenko
Prevent allocations from integer overflow by using size_add().

Reviewed-by: Luis Chamberlain 
Reviewed-by: Kees Cook 
Signed-off-by: Andy Shevchenko 
---
 kernel/params.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/params.c b/kernel/params.c
index f8e3c4139854..c3a029fe183d 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -48,7 +49,7 @@ static void *kmalloc_parameter(unsigned int size)
 {
struct kmalloced_param *p;
 
-   p = kmalloc(sizeof(*p) + size, GFP_KERNEL);
+   p = kmalloc(size_add(sizeof(*p), size), GFP_KERNEL);
if (!p)
return NULL;
 
-- 
2.43.0.rc1.1.gbec44491f096




[PATCH v3 0/5] params: harden string ops and allocatio ops

2023-11-20 Thread Andy Shevchenko
A couple of patches are for get the string ops, used in the module,
slightly harden. On top a few cleanups.

Since the main part is rather hardening, I think the Kees' tree is
the best fit for the series. It also possible to route via Greg's
sysfs (driver core?), but I'm open for another option(s).

Changelog v3:
- added tags (Kees, Luis)

Changelog v2:
- dropped the s*printf() --> sysfs_emit() conversion as it revealed
  an issue, i.e. reuse getters with non-page-aligned pointer, which
  would be addressed separately
- added cover letter and clarified the possible route for the series
  (Luis)

Andy Shevchenko (5):
  params: Introduce the param_unknown_fn type
  params: Do not go over the limit when getting the string length
  params: Use size_add() for kmalloc()
  params: Sort headers
  params: Fix multi-line comment style

 include/linux/moduleparam.h |  6 ++--
 kernel/params.c | 56 -
 2 files changed, 33 insertions(+), 29 deletions(-)

-- 
2.43.0.rc1.1.gbec44491f096




Re: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()

2023-11-20 Thread Rafael J. Wysocki
On Mon, Nov 20, 2023 at 4:03 PM Andy Shevchenko
 wrote:
>
> On Thu, Oct 19, 2023 at 06:03:28PM -0700, Dan Williams wrote:
> > Andy Shevchenko wrote:
> > > The acpi_evaluate_dsm_typed() provides a way to check the type of the
> > > object evaluated by _DSM call. Use it instead of open coded variant.
> >
> > Looks good to me.
> >
> > Reviewed-by: Dan Williams 
>
> Thank you!
>
> Who is taking care of this? Rafael?

I can apply it.



Re: [PATCH v1 1/1] ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()

2023-11-20 Thread Andy Shevchenko
On Thu, Oct 19, 2023 at 06:03:28PM -0700, Dan Williams wrote:
> Andy Shevchenko wrote:
> > The acpi_evaluate_dsm_typed() provides a way to check the type of the
> > object evaluated by _DSM call. Use it instead of open coded variant.
> 
> Looks good to me.
> 
> Reviewed-by: Dan Williams 

Thank you!

Who is taking care of this? Rafael?

-- 
With Best Regards,
Andy Shevchenko





Re: [PATCH v1] lib: objpool: fix head overrun on RK3588 SBC

2023-11-20 Thread wuqiang.matt

On 2023/11/20 13:18, Masami Hiramatsu (Google) wrote:

On Tue, 14 Nov 2023 19:51:48 +0800
"wuqiang.matt"  wrote:


objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
following kernel warnings:

 WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100

This message is from objpool.c:168:

 WARN_ON_ONCE(tail - head > pool->nr_objs);

The overrun test case is to validate the case that pre-allocated objects
are insufficient: 8 objects are pre-allocated for each node and consumer
thread per node tries to grab 16 objects in a row. The testing system is
OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
disabling either all 4 big or 4 little cores, the overrun tests run well,
and once with big and little cores mixed together, the overrun test would
always cause an overrun loop. It's likely the memory timing differences
of big and little cores cause this trouble. Here are the debugging data
of objpool_try_get_slot after try_cmpxchg_release:

 objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278

The local copies of 'head' and 'last' were 278 and 276, and reloading of
'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
'slot->head' became 'head + 1', which is correct. But what's wrong here
is the stale value of 'last', and that stale value of 'last' finally led
the overrun of 'head'.


Ah, good catch! So even if the ring size is enough, the head/tail update
value is not updated locally, it can cause the overrun!


It's really confusing at the first glance of such an issue. I was assuming
the order between 'last' and 'head' should be implicitly maintained, but
after more digging, then found that wasn't true actually, the order should
be explicitly guaranteed by pop().

I also verified with Amlogic A311D which has 6 cores (4x A73 and 4x A53),
and got same results. I think I just need re-discover the differences of
HMP (heterogeneous multiprocessing) for big.LITTLE or P/E cores cpus.



Memory updating of 'last' and 'head' are performed in push() and pop()
independently, which could be the culprit leading this out of order
visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
not enough only checking the condition of 'head != slot', the implicit
condition 'last - head <= nr_objs' must also be explicitly asserted to
guarantee 'last' is always behind 'head' before the object retrieving.


Indeed. Thanks for the investigation!



This patch will check and try reloading of 'head' and 'last' to ensure
'last' is behind 'head' at the time of object retrieving. Performance
testings show the average impact is about 0.1% for X86_64 and 1.12% for
ARM64. Here are the results:

 OS: Debian 10 X86_64, Linux 6.6rc
 HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
   1T 2T 4T 8T16T
 native: 49543304   99277826  199017659  399070324  795185848
 objpool:29909085   59865637  119692073  239750369  478005250
 objpool+:   29879313   59230743  119609856  239067773  478509029
  32T48T64T96T   128T
 native:   1596927073 2390099988 2929397330 3183875848 3257546602
 objpool:   957553042 1435814086 1680872925 2043126796 2165424198
 objpool+:  956476281 1434491297 1666055740 2041556569 2157415622

 OS: Debian 11 AARCH64, Linux 6.6rc
 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
   1T 2T 4T 8T16T
 native: 30890508   60399915  123111980  242257008  494002946
 objpool:14742531   28883047   57739948  115886644  232455421
 objpool+:   14107220   29032998   57286084  113730493  232232850
  24T32T48T64T96T
 native:746406039 1000174750 1493236240 1998318364 2942911180
 objpool:   349164852  467284332  702296756  934459713 1387898285
 objpool+:  348388180  462750976  696606096  927865887 1368402195



OK, looks good to me.

Acked-by: Masami Hiramatsu (Google) 

And let me pick it.


Signed-off-by: wuqiang.matt 


BTW, this is a real bugfix, so it should have Fixes tag :)

Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")



Oh, right! Thanks for your kind reminder. I'll keep that in mind.


Thank you!


Best regards.




---
  lib/objpool.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/lib/objpool.c b/lib/objpool.c
index ce0087f64400..cfdc02420884 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -201,6 +201,23 @@ static inline void *objpool_try_get_slot(struct 
objpool_head *pool, int cpu)
while (head != READ_ONCE(slot->last)) {
void *obj;
  
+		/*

+* data visibility of 'last' and 'head' could be out of
+* order since memory updating of 'last' and 'head' are
+* performed in push() and pop() independently
+ 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Abel Wu

On 11/20/23 6:56 PM, Peter Zijlstra Wrote:

On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:


Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)


(W+w)/W


Ah, right. I misunderstood (again) the comment which says:

vl_i = (W + w_i)*vl'_i / W

So the current implementation is:

v' = V - vl'

and what I was proposing is:

v' = V' - vl

and they are equal in fact.





Does this intended?


The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.


Yes. You did the math in an iterative fashion and mine is facing the
final state:

v' = V' - vlag
V' = (WV + wv') / (W + w)

which gives:

V' = V - w * vlag / W



Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.


And to illustrate my understanding of strategy #1:



@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * vl_i is given by:
 *
 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
-*  = (W*V + w_i*(V - vl_i)) / (W + w_i)
-*  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
-*  = (V*(W + w_i) - w_i*l) / (W + w_i)
-*  = V - w_i*vl_i / (W + w_i)
-*
-* And the actual lag after adding an entity with vl_i is:
-*
-*   vl'_i = V' - v_i
-* = V - w_i*vl_i / (W + w_i) - (V - vl_i)
-* = vl_i - w_i*vl_i / (W + w_i)
-*
-* Which is strictly less than vl_i. So in order to preserve lag
-* we should inflate the lag before placement such that the
-* effective lag after placement comes out right.
-*
-* As such, invert the above relation for vl'_i to get the vl_i
-* we need to use such that the lag after placement is the lag
-* we computed before dequeue.
+*  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+*  = V - w_i*vl_i / W
 *
-*   vl'_i = vl_i - w_i*vl_i / (W + w_i)
-* = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
-*
-*   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
-*   = W*vl_i
-*
-*   vl_i = (W + w_i)*vl'_i / W
 */
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
-   lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
-   lag = div_s64(lag, load);
+
+   vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
load);
}
se->vruntime = vruntime - lag;



So you're proposing we do:

v = V - (lag * w) / (W + w) - lag


What I 'm proposing is:

V' = V - w * vlag / W

so we have:

v' = V' - vlag
   = V - vlag * w/W - vlag
   = V - vlag * (W + w)/W

which is exactly the same as current implementation.



?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement:

l' = V' - v = (V*W + v*w) / (W+w) - v
 = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
= W/(W+w) * (V - V + (W+w)/W * l)
= W/(W+w) * (W+w)/W * l
= l

So by scaling, we've preserved lag across placement.

That make sense?


Yes, I think I won't misunderstand again for the 3rd time :)

Thanks!
Abel



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Peter Zijlstra
On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:

> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
> Given that the weight doesn't change, we have:
> 
>   vl' = vl
> 
> But in fact it is scaled on placement:
> 
>   vl' = vl * W/(W + w)

(W+w)/W

> 
> Does this intended? 

The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.

Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.

> And to illustrate my understanding of strategy #1:

> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *se, int flags)
>* vl_i is given by:
>*
>*   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> -  *  = (W*V + w_i*(V - vl_i)) / (W + w_i)
> -  *  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> -  *  = (V*(W + w_i) - w_i*l) / (W + w_i)
> -  *  = V - w_i*vl_i / (W + w_i)
> -  *
> -  * And the actual lag after adding an entity with vl_i is:
> -  *
> -  *   vl'_i = V' - v_i
> -  * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> -  * = vl_i - w_i*vl_i / (W + w_i)
> -  *
> -  * Which is strictly less than vl_i. So in order to preserve lag
> -  * we should inflate the lag before placement such that the
> -  * effective lag after placement comes out right.
> -  *
> -  * As such, invert the above relation for vl'_i to get the vl_i
> -  * we need to use such that the lag after placement is the lag
> -  * we computed before dequeue.
> +  *  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
> +  *  = V - w_i*vl_i / W
>*
> -  *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> -  * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> -  *
> -  *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> -  *   = W*vl_i
> -  *
> -  *   vl_i = (W + w_i)*vl'_i / W
>*/
>   load = cfs_rq->avg_load;
>   if (curr && curr->on_rq)
>   load += scale_load_down(curr->load.weight);
> -
> - lag *= load + scale_load_down(se->load.weight);
>   if (WARN_ON_ONCE(!load))
>   load = 1;
> - lag = div_s64(lag, load);
> +
> + vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
> load);
>   }
>   se->vruntime = vruntime - lag;


So you're proposing we do:

v = V - (lag * w) / (W + w) - lag

?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement: 

l' = V' - v = (V*W + v*w) / (W+w) - v
= (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
   = W/(W+w) * (V - V + (W+w)/W * l)
   = W/(W+w) * (W+w)/W * l
   = l

So by scaling, we've preserved lag across placement.

That make sense?