date:20240404

Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-04 Thread Paul E. McKenney

On Fri, Apr 05, 2024 at 11:57:45AM +0900, Masami Hiramatsu wrote:
> On Fri, 5 Apr 2024 10:23:24 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > On Thu, 4 Apr 2024 10:43:14 -0700
> > "Paul E. McKenney"  wrote:
> > 
> > > On Thu, Apr 04, 2024 at 08:55:22AM +0900, Masami Hiramatsu wrote:
> > > > On Wed, 3 Apr 2024 12:16:28 -0700
> > > > "Paul E. McKenney"  wrote:
> > > > 
> > > > > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> > > > > /proc/bootconfig") adds bootloader argument comments into 
> > > > > /proc/bootconfig.
> > > > > 
> > > > > /proc/bootconfig shows boot_command_line[] multiple times following
> > > > > every xbc key value pair, that's duplicated and not necessary.
> > > > > Remove redundant ones.
> > > > > 
> > > > > Output before and after the fix is like:
> > > > > key1 = value1
> > > > > *bootloader argument comments*
> > > > > key2 = value2
> > > > > *bootloader argument comments*
> > > > > key3 = value3
> > > > > *bootloader argument comments*
> > > > > ...
> > > > > 
> > > > > key1 = value1
> > > > > key2 = value2
> > > > > key3 = value3
> > > > > *bootloader argument comments*
> > > > > ...
> > > > > 
> > > > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment 
> > > > > to /proc/bootconfig")
> > > > > Signed-off-by: Zhenhua Huang 
> > > > > Signed-off-by: Paul E. McKenney 
> > > > > Cc: Masami Hiramatsu 
> > > > > Cc: 
> > > > > Cc: 
> > > > 
> > > > OOps, good catch! Let me pick it.
> > > > 
> > > > Acked-by: Masami Hiramatsu (Google) 
> > > 
> > > Thank you, and I have applied your ack and pulled this into its own
> > > bootconfig.2024.04.04a.
> > > 
> > > My guess is that you will push this via your own tree, and so I will
> > > drop my copy as soon as yours hits -next.
> > 
> > Thanks! I would like to make PR this soon as bootconfig fixes for v6.9-rc2.
> 
> Hmm I found that this always shows the command line comment in
> /proc/bootconfig even without "bootconfig" option.
> I think that is easier for user-tools but changes the behavior and
> a bit redundant.
> 
> We should skip showing this original argument comment if bootconfig is
> not initialized (no "bootconfig" in cmdline) as it is now.

So something like this folded into that patch?



diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
index e5635a6b127b0..7d2520378f5f2 100644
--- a/fs/proc/bootconfig.c
+++ b/fs/proc/bootconfig.c
@@ -63,7 +63,7 @@ static int __init copy_xbc_key_value_list(char *dst, size_t 
size)
dst += ret;
}
}
-   if (ret >= 0 && boot_command_line[0]) {
+   if (bootconfig_is_present() && ret >= 0 && boot_command_line[0]) {
ret = snprintf(dst, rest(dst, end), "# Parameters from 
bootloader:\n# %s\n",
   boot_command_line);
if (ret > 0)
diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index ca73940e26df8..ef70d1b381421 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 #include 
 #include 
+int bootconfig_is_present(void);
 #else /* !__KERNEL__ */
 /*
  * NOTE: This is only for tools/bootconfig, because tools/bootconfig will
diff --git a/init/main.c b/init/main.c
index 2ca52474d0c30..720a669b1493d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1572,3 +1572,8 @@ static noinline void __init kernel_init_freeable(void)
 
integrity_load_keys();
 }
+
+int bootconfig_is_present(void)
+{
+   return bootconfig_found || IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE);
+}



Give or take placement of the bootconfig_is_present() function's
declaration and definition.

Thanx, Paul

Thanx, Paul

> Thank you,
> 
> 
> > Thank you,
> > 
> > > 
> > >   Thanx, Paul
> > > 
> > > > Thank you!
> > > > 
> > > > > 
> > > > > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> > > > > index 902b326e1e560..e5635a6b127b0 100644
> > > > > --- a/fs/proc/bootconfig.c
> > > > > +++ b/fs/proc/bootconfig.c
> > > > > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char 
> > > > > *dst, size_t size)
> > > > >   break;
> > > > >   dst += ret;
> > > > >   }
> > > > > - if (ret >= 0 && boot_command_line[0]) {
> > > > > - ret = snprintf(dst, rest(dst, end), "# 
> > > > > Parameters from bootloader:\n# %s\n",
> > > > > -boot_command_line);
> > > > > - if (ret > 0)
> > > > > - dst += ret;
> > > > > - }
> > > > > + }
> > > > > + if (ret >= 0 && boot_command_line[0]) {
> > > > > + ret

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,  
> > > enum sgx_reclaim r)
> > 
> > Is the @r here intentional for shorter typing?
> > 
> 
> yes :-)
> Will speel out to make it consistent if that's the concern.

I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case.  You
can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed:

static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,
enum sgx_reclaim reclaim)
{
return 0;
}

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > Please also mention why "leaving asynchronous reclamation to later  
> > patch(es)" is
> > fine.  E.g., it won't break anything I suppose.
> > 
> 
> Right. Pages are still in the global list at the moment and only global  
> reclaiming is active until the "turn on" patch. Separating out is really  
> just for the purpose of review IMHO.

Sounds good to me.  Thanks.

Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-04 Thread Google

On Fri, 5 Apr 2024 10:23:24 +0900
Masami Hiramatsu (Google)  wrote:

> On Thu, 4 Apr 2024 10:43:14 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Thu, Apr 04, 2024 at 08:55:22AM +0900, Masami Hiramatsu wrote:
> > > On Wed, 3 Apr 2024 12:16:28 -0700
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> > > > /proc/bootconfig") adds bootloader argument comments into 
> > > > /proc/bootconfig.
> > > > 
> > > > /proc/bootconfig shows boot_command_line[] multiple times following
> > > > every xbc key value pair, that's duplicated and not necessary.
> > > > Remove redundant ones.
> > > > 
> > > > Output before and after the fix is like:
> > > > key1 = value1
> > > > *bootloader argument comments*
> > > > key2 = value2
> > > > *bootloader argument comments*
> > > > key3 = value3
> > > > *bootloader argument comments*
> > > > ...
> > > > 
> > > > key1 = value1
> > > > key2 = value2
> > > > key3 = value3
> > > > *bootloader argument comments*
> > > > ...
> > > > 
> > > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
> > > > /proc/bootconfig")
> > > > Signed-off-by: Zhenhua Huang 
> > > > Signed-off-by: Paul E. McKenney 
> > > > Cc: Masami Hiramatsu 
> > > > Cc: 
> > > > Cc: 
> > > 
> > > OOps, good catch! Let me pick it.
> > > 
> > > Acked-by: Masami Hiramatsu (Google) 
> > 
> > Thank you, and I have applied your ack and pulled this into its own
> > bootconfig.2024.04.04a.
> > 
> > My guess is that you will push this via your own tree, and so I will
> > drop my copy as soon as yours hits -next.
> 
> Thanks! I would like to make PR this soon as bootconfig fixes for v6.9-rc2.
> 

Hmm I found that this always shows the command line comment in
/proc/bootconfig even without "bootconfig" option.
I think that is easier for user-tools but changes the behavior and
a bit redundant.

We should skip showing this original argument comment if bootconfig is
not initialized (no "bootconfig" in cmdline) as it is now.

Thank you,


> Thank you,
> 
> > 
> > Thanx, Paul
> > 
> > > Thank you!
> > > 
> > > > 
> > > > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> > > > index 902b326e1e560..e5635a6b127b0 100644
> > > > --- a/fs/proc/bootconfig.c
> > > > +++ b/fs/proc/bootconfig.c
> > > > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char 
> > > > *dst, size_t size)
> > > > break;
> > > > dst += ret;
> > > > }
> > > > -   if (ret >= 0 && boot_command_line[0]) {
> > > > -   ret = snprintf(dst, rest(dst, end), "# 
> > > > Parameters from bootloader:\n# %s\n",
> > > > -  boot_command_line);
> > > > -   if (ret > 0)
> > > > -   dst += ret;
> > > > -   }
> > > > +   }
> > > > +   if (ret >= 0 && boot_command_line[0]) {
> > > > +   ret = snprintf(dst, rest(dst, end), "# Parameters from 
> > > > bootloader:\n# %s\n",
> > > > +  boot_command_line);
> > > > +   if (ret > 0)
> > > > +   dst += ret;
> > > > }
> > > >  out:
> > > > kfree(key);
> > > 
> > > 
> > > -- 
> > > Masami Hiramatsu (Google) 
> 
> 
> -- 
> Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 20:24 -0500, Haitao Huang wrote:
> > Again, IMHO having CONFIG_CGROUP_SGX_EPC here is ugly, because it  
> > doesn't even
> > match the try_charge() above, which doesn't have the  
> > CONFIG_CGROUP_SGX_EPC.
> > 
> > If you add a wrapper in "epc_cgroup.h"
> > 
> Agree. but in sgx.h so sgx_epc_page struct is not exposed in epc_cgroup.h.

I am fine with any place that suits.

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-04 Thread Haitao Huang


On Thu, 28 Mar 2024 07:53:45 -0500, Huang, Kai  wrote:




--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.


It's 2024 now.

And looks you need to use C style comment for /* Copyright ... */, after  
looking

at some other C files.


Ok, will update years and use C style.


+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg:The charged sgx cgroup
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+   misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_cgroup_free(struct misc_cg *cg)
+{
+   struct sgx_cgroup *sgx_cg;
+
+   sgx_cg = sgx_cgroup_from_misc_cg(cg);
+   if (!sgx_cg)
+   return;
+
+   kfree(sgx_cg);
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg);


Again, this declaration can be removed if you move the below structure  
...



+
+const struct misc_res_ops sgx_cgroup_ops = {
+   .alloc = sgx_cgroup_alloc,
+   .free = sgx_cgroup_free,
+};
+
+static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup  
*sgx_cg)

+{
+   cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
+   sgx_cg->cg = cg;
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg)
+{
+   struct sgx_cgroup *sgx_cg;
+
+   sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
+   if (!sgx_cg)
+   return -ENOMEM;
+
+   sgx_cgroup_misc_init(cg, sgx_cg);
+
+   return 0;
+}


... here.



yes, thanks


+
+void sgx_cgroup_init(void)
+{
+   misc_cg_set_ops(MISC_CG_RES_SGX_EPC, _cgroup_ops);
+   sgx_cgroup_misc_init(misc_cg_root(), _cg_root);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h  
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h

new file mode 100644
index ..8f794e23fad6
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 
+#include 
+#include 
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC


Nit: add an empty line to make text more breathable.



ok


+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_cgroup;
+
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+   return NULL;
+}
+
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
+
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+   return 0;
+}
+
+static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+
+static inline void sgx_cgroup_init(void) { }
+#else


Nit: I prefer two empty lines before and after the 'else'.



ok


+struct sgx_cgroup {
+   struct misc_cg *cg;
+};
+
+static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct  
misc_cg *cg)

+{
+   return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+   return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


Again, I _think_ you need to check whether get_current_misc_cg() returns  
NULL?


Misc cgroup can be disabled by command line even it is on in the Kconfig.

I am not expert on cgroup, so could you check on this?



Good catch. Will add NULL check in sgx_cgroup_from_misc_cg()


+
+/**
+ * sgx_put_sgx_cg() - Put the EPC cgroup and reduce its ref count.
+ * @sgx_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
+{
+   if (sgx_cg)
+   put_misc_cg(sgx_cg->cg);
+}
+
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_init(void);
+
+#endif
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c  
b/arch/x86/kernel/cpu/sgx/main.c

index d219f14365d4..023af54c1beb 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,6 +18,7 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"

 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int

Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-04 Thread Google

On Thu, 4 Apr 2024 10:43:14 -0700
"Paul E. McKenney"  wrote:

> On Thu, Apr 04, 2024 at 08:55:22AM +0900, Masami Hiramatsu wrote:
> > On Wed, 3 Apr 2024 12:16:28 -0700
> > "Paul E. McKenney"  wrote:
> > 
> > > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> > > /proc/bootconfig") adds bootloader argument comments into 
> > > /proc/bootconfig.
> > > 
> > > /proc/bootconfig shows boot_command_line[] multiple times following
> > > every xbc key value pair, that's duplicated and not necessary.
> > > Remove redundant ones.
> > > 
> > > Output before and after the fix is like:
> > > key1 = value1
> > > *bootloader argument comments*
> > > key2 = value2
> > > *bootloader argument comments*
> > > key3 = value3
> > > *bootloader argument comments*
> > > ...
> > > 
> > > key1 = value1
> > > key2 = value2
> > > key3 = value3
> > > *bootloader argument comments*
> > > ...
> > > 
> > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
> > > /proc/bootconfig")
> > > Signed-off-by: Zhenhua Huang 
> > > Signed-off-by: Paul E. McKenney 
> > > Cc: Masami Hiramatsu 
> > > Cc: 
> > > Cc: 
> > 
> > OOps, good catch! Let me pick it.
> > 
> > Acked-by: Masami Hiramatsu (Google) 
> 
> Thank you, and I have applied your ack and pulled this into its own
> bootconfig.2024.04.04a.
> 
> My guess is that you will push this via your own tree, and so I will
> drop my copy as soon as yours hits -next.

Thanks! I would like to make PR this soon as bootconfig fixes for v6.9-rc2.

Thank you,

> 
>   Thanx, Paul
> 
> > Thank you!
> > 
> > > 
> > > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> > > index 902b326e1e560..e5635a6b127b0 100644
> > > --- a/fs/proc/bootconfig.c
> > > +++ b/fs/proc/bootconfig.c
> > > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, 
> > > size_t size)
> > >   break;
> > >   dst += ret;
> > >   }
> > > - if (ret >= 0 && boot_command_line[0]) {
> > > - ret = snprintf(dst, rest(dst, end), "# Parameters from 
> > > bootloader:\n# %s\n",
> > > -boot_command_line);
> > > - if (ret > 0)
> > > - dst += ret;
> > > - }
> > > + }
> > > + if (ret >= 0 && boot_command_line[0]) {
> > > + ret = snprintf(dst, rest(dst, end), "# Parameters from 
> > > bootloader:\n# %s\n",
> > > +boot_command_line);
> > > + if (ret > 0)
> > > + dst += ret;
> > >   }
> > >  out:
> > >   kfree(key);
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Google

On Thu, 4 Apr 2024 18:11:09 +0200
Oleg Nesterov  wrote:

> On 04/05, Masami Hiramatsu wrote:
> >
> > Can we make this syscall and uprobe behavior clearer? As you said, if
> > the application use sigreturn or longjump, it may skip returns and
> > shadow stack entries are left in the kernel. In such cases, can uretprobe
> > detect it properly, or just crash the process (or process runs wrongly)?
> 
> Please see the comment in handle_trampoline(), it tries to detect this case.
> This patch should not make any difference.

I think you mean this loop will skip and discard the stacked return_instance
to find the valid one.

do {
/*
 * We should throw out the frames invalidated by longjmp().
 * If this chain is valid, then the next one should be alive
 * or NULL; the latter case means that nobody but ri->func
 * could hit this trampoline on return. TODO: sigaltstack().
 */
next = find_next_ret_chain(ri);
valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, 
regs);

instruction_pointer_set(regs, ri->orig_ret_vaddr);
do {
if (valid)
handle_uretprobe_chain(ri, regs);
ri = free_ret_instance(ri);
utask->depth--;
} while (ri != next);
} while (!valid);

I think this expects setjmp/longjmp as below

foo() { <- retprobe1
setjmp()
bar() { <- retprobe2
longjmp()
}
} <- return to trampoline

In this case, we need to skip retprobe2's instance.
My concern is, if we can not find appropriate return instance, what happen?
e.g.

foo() { <-- retprobe1
   bar() { # sp is decremented
   sys_uretprobe() <-- ??
}
}

It seems sys_uretprobe() will handle retprobe1 at that point instead of
SIGILL.

Can we avoid this with below strict check?

if (ri->stack != regs->sp + expected_offset)
goto sigill;

expected_offset should be 16 (push * 3 - ret) on x64 if we ri->stack is the
regs->sp right after call.

Thank you,

-- 
Masami Hiramatsu (Google)

Re: [PATCH net-next v2 5/6] mptcp: support rstreason for passive reset

2024-04-04 Thread Jason Xing

Hello Mat,

On Fri, Apr 5, 2024 at 4:33 AM Mat Martineau  wrote:
>
> On Thu, 4 Apr 2024, Jason Xing wrote:
>
> > From: Jason Xing 
> >
> > It relys on what reset options in MPTCP does as rfc8684 says. Reusing
> > this logic can save us much energy. This patch replaces all the prior
> > NOT_SPECIFIED reasons.
> >
> > Signed-off-by: Jason Xing 
> > ---
> > net/mptcp/subflow.c | 26 --
> > 1 file changed, 20 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
> > index a68d5d0f3e2a..24668d3020aa 100644
> > --- a/net/mptcp/subflow.c
> > +++ b/net/mptcp/subflow.c
> > @@ -304,7 +304,10 @@ static struct dst_entry *subflow_v4_route_req(const 
> > struct sock *sk,
> >
> >   dst_release(dst);
> >   if (!req->syncookie)
> > - tcp_request_sock_ops.send_reset(sk, skb, 
> > SK_RST_REASON_NOT_SPECIFIED);
> > + /* According to RFC 8684, 3.2. Starting a New Subflow,
> > +  * we should use an "MPTCP specific error" reason code.
> > +  */
> > + tcp_request_sock_ops.send_reset(sk, skb, 
> > SK_RST_REASON_MPTCP_RST_EMPTCP);
>
> Hi Jason -
>
> In this case, the MPTCP reset reason is set in subflow_check_req(). Looks
> like it uses EMPTCP but that isn't guaranteed to stay the same. I think it
> would be better to extract the reset reason from the skb extension or the
> subflow context "reset_reason" field.

Good suggestions :)

>
>
> >   return NULL;
> > }
> >
> > @@ -371,7 +374,10 @@ static struct dst_entry *subflow_v6_route_req(const 
> > struct sock *sk,
> >
> >   dst_release(dst);
> >   if (!req->syncookie)
> > - tcp6_request_sock_ops.send_reset(sk, skb, 
> > SK_RST_REASON_NOT_SPECIFIED);
> > + /* According to RFC 8684, 3.2. Starting a New Subflow,
> > +  * we should use an "MPTCP specific error" reason code.
> > +  */
> > + tcp6_request_sock_ops.send_reset(sk, skb, 
> > SK_RST_REASON_MPTCP_RST_EMPTCP);
>
> Same issue here.

Got it.

>
> >   return NULL;
> > }
> > #endif
> > @@ -778,6 +784,7 @@ static struct sock *subflow_syn_recv_sock(const struct 
> > sock *sk,
> >   bool fallback, fallback_is_fatal;
> >   struct mptcp_sock *owner;
> >   struct sock *child;
> > + int reason;
> >
> >   pr_debug("listener=%p, req=%p, conn=%p", listener, req, 
> > listener->conn);
> >
> > @@ -833,7 +840,8 @@ static struct sock *subflow_syn_recv_sock(const struct 
> > sock *sk,
> >*/
> >   if (!ctx || fallback) {
> >   if (fallback_is_fatal) {
> > - subflow_add_reset_reason(skb, 
> > MPTCP_RST_EMPTCP);
> > + reason = MPTCP_RST_EMPTCP;
> > + subflow_add_reset_reason(skb, reason);
> >   goto dispose_child;
> >   }
> >   goto fallback;
> > @@ -861,7 +869,8 @@ static struct sock *subflow_syn_recv_sock(const struct 
> > sock *sk,
> >   } else if (ctx->mp_join) {
> >   owner = subflow_req->msk;
> >   if (!owner) {
> > - subflow_add_reset_reason(skb, 
> > MPTCP_RST_EPROHIBIT);
> > + reason = MPTCP_RST_EPROHIBIT;
> > + subflow_add_reset_reason(skb, reason);
> >   goto dispose_child;
> >   }
> >
> > @@ -875,13 +884,18 @@ static struct sock *subflow_syn_recv_sock(const 
> > struct sock *sk,
> >ntohs(inet_sk((struct sock 
> > *)owner)->inet_sport));
> >   if (!mptcp_pm_sport_in_anno_list(owner, sk)) {
> >   SUBFLOW_REQ_INC_STATS(req, 
> > MPTCP_MIB_MISMATCHPORTACKRX);
> > + reason = MPTCP_RST_EUNSPEC;
>
> I think the MPTCP code here should have been using MPTCP_RST_EPROHIBIT.

I'll update in the V2 of the patch.

Thanks,
Jason

[PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-04 Thread Ho-Ren (Jack) Chuang

The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist calculation
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `default_dram_perf_lock` to
protect `default_dram_perf_*`. This approach not only avoids deadlock
but also prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
Reviewed-by: "Huang, Ying" 
---
 mm/memory-tiers.c | 94 +++
 1 file changed, 70 insertions(+), 24 deletions(-)

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 516b144fd45a..6632102bd5c9 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -36,6 +36,11 @@ struct node_memory_type_map {
 
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * The list is used to store all memory types that are not created
+ * by a device driver.
+ */
+static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
 
@@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
 
 static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
 
+/* The lock is used to protect `default_dram_perf*` info and nid. */
+static DEFINE_MUTEX(default_dram_perf_lock);
 static bool default_dram_perf_error;
 static struct access_coordinate default_dram_perf;
 static int default_dram_perf_ref_nid = NUMA_NO_NODE;
@@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct 
memory_dev_type *mem
 static struct memory_tier *set_node_memory_tier(int node)
 {
struct memory_tier *memtier;
-   struct memory_dev_type *memtype;
+   struct memory_dev_type *memtype = default_dram_type;
+   int adist = MEMTIER_ADISTANCE_DRAM;
pg_data_t *pgdat = NODE_DATA(node);
 
 
@@ -514,7 +522,16 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
 
-   __init_node_memory_type(node, default_dram_type);
+   mt_calc_adistance(node, );
+   if (!node_memory_types[node].memtype) {
+   memtype = mt_find_alloc_memory_type(adist, 
_memory_types);
+   if (IS_ERR(memtype)) {
+   memtype = default_dram_type;
+   pr_info("Failed to allocate a memory type. Fall 
back.\n");
+   }
+   }
+
+   __init_node_memory_type(node, memtype);
 
memtype = node_memory_types[node].memtype;
node_set(node, memtype->nodes);
@@ -652,6 +669,35 @@ void mt_put_memory_types(struct list_head *memory_types)
 }
 EXPORT_SYMBOL_GPL(mt_put_memory_types);
 
+/*
+ * This is invoked via `late_initcall()` to initialize memory tiers for
+ * CPU-less memory nodes after driver initialization, which is
+ * expected to provide `adistance` algorithms.
+ */
+static int __init memory_tier_late_init(void)
+{
+   int nid;
+
+   guard(mutex)(_tier_lock);
+   for_each_node_state(nid, N_MEMORY) {
+   /*
+*

[PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-04 Thread Ho-Ren (Jack) Chuang

Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch,
enhancing the scalability and conciseness of the code.

Signed-off-by: Ho-Ren (Jack) Chuang 
Reviewed-by: "Huang, Ying" 
---
 drivers/dax/kmem.c   | 30 --
 include/linux/memory-tiers.h | 13 +
 mm/memory-tiers.c| 29 +
 3 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..4fe9d040e375 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,36 +55,14 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
-   struct memory_dev_type *mtype;
-
-   mutex_lock(_memory_type_lock);
-   list_for_each_entry(mtype, _memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(>list, _memory_types);
-   }
-   mutex_unlock(_memory_type_lock);
-
-   return mtype;
+   guard(mutex)(_memory_type_lock);
+   return mt_find_alloc_memory_type(adist, _memory_types);
 }
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
-   mutex_lock(_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
-   list_del(>list);
-   put_memory_type(mtype);
-   }
-   mutex_unlock(_memory_type_lock);
+   guard(mutex)(_memory_type_lock);
+   mt_put_memory_types(_memory_types);
 }
 
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..0d70788558f4 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+ struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct 
list_head *memory_types)
+{
+   return NULL;
+}
+
+static inline void mt_put_memory_types(struct list_head *memory_types)
+{
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..516b144fd45a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -623,6 +623,35 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   struct memory_dev_type *mtype;
+
+   list_for_each_entry(mtype, memory_types, list)
+   if (mtype->adistance == adist)
+   return mtype;
+
+   mtype = alloc_memory_type(adist);
+   if (IS_ERR(mtype))
+   return mtype;
+
+   list_add(>list, memory_types);
+
+   return mtype;
+}
+EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(>list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 static void dump_hmem_attrs(struct access_coordinate *coord, const char 
*prefix)
 {
pr_info(
-- 
Ho-Ren (Jack) Chuang

[PATCH v11 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-04-04 Thread Ho-Ren (Jack) Chuang

When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable from
normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes to
the same DRAM tier. This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different
types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention. If no HMAT is specified,
it falls back to using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT. With this change, we enable the
guest kernel to construct the correct memory tiering for the memory nodes.

- v11:
 Thanks to comments from Jonathan,
 * Replace `mutex_lock()` with `guard(mutex)()`
 * Reorder some modifications within the patchset
 * Rewrite the code for improved readability and fixing alignment issues
 * Pass all strict rules in checkpatch.pl
- v10:
 Thanks to Andrew's and SeongJae's comments,
 * Address kunit compilation errors
 * Resolve the bug of not returning the correct error code in
   `mt_perf_to_adistance`
 * 
https://lore.kernel.org/lkml/20240402001739.2521623-1-horenchu...@bytedance.com/T/#u
-v9:
 * Address corner cases in `memory_tier_late_init`. Thank Ying's comments.
 * 
https://lore.kernel.org/lkml/20240329053353.309557-1-horenchu...@bytedance.com/T/#u
-v8:
 * Fix email format
 * 
https://lore.kernel.org/lkml/20240329004815.195476-1-horenchu...@bytedance.com/T/#u
-v7:
 * Add Reviewed-by: "Huang, Ying" 
-v6:
 Thanks to Ying's comments,
 * Move `default_dram_perf_lock` to the function's beginning for clarity
 * Fix double unlocking at v5
 * 
https://lore.kernel.org/lkml/20240327072729.3381685-1-horenchu...@bytedance.com/T/#u
-v5:
 Thanks to Ying's comments,
 * Add comments about what is protected by `default_dram_perf_lock`
 * Fix an uninitialized pointer mtype
 * Slightly shorten the time holding `default_dram_perf_lock`
 * Fix a deadlock bug in `mt_perf_to_adistance`
 * 
https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u
-v4:
 Thanks to Ying's comments,
 * Remove redundant code
 * Reorganize patches accordingly
 * 
https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u
-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
 * 
https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: introduce an abstract layer for finding,
allocating, and putting memory types
  memory tier: create CPUless memory tiers after obtaining HMAT info

 drivers/dax/kmem.c   |  30 ++---
 include/linux/memory-tiers.h |  13 
 mm/memory-tiers.c| 123 ---
 3 files changed, 116 insertions(+), 50 deletions(-)

-- 
Ho-Ren (Jack) Chuang

Re: [PATCH v3 23/25] drivers: media: i2c: imx258: Add support for powerdown gpio

2024-04-04 Thread Luis Garcia

On 4/4/24 08:12, Dave Stevenson wrote:
> Hi Luigi
> 
> On Wed, 3 Apr 2024 at 20:34, Luigi311  wrote:
>>
>> On 4/3/24 10:57, Ondřej Jirman wrote:
>>> Hi Sakari and Luis,
>>>
>>> On Wed, Apr 03, 2024 at 04:25:41PM GMT, Sakari Ailus wrote:
 Hi Luis, Ondrej,

 On Wed, Apr 03, 2024 at 09:03:52AM -0600, g...@luigi311.com wrote:
> From: Luis Garcia 
>
> On some boards powerdown signal needs to be deasserted for this
> sensor to be enabled.
>
> Signed-off-by: Ondrej Jirman 
> Signed-off-by: Luis Garcia 
> ---
>  drivers/media/i2c/imx258.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
> index 30352c33f63c..163f04f6f954 100644
> --- a/drivers/media/i2c/imx258.c
> +++ b/drivers/media/i2c/imx258.c
> @@ -679,6 +679,8 @@ struct imx258 {
> unsigned int lane_mode_idx;
> unsigned int csi2_flags;
>
> +   struct gpio_desc *powerdown_gpio;
> +
> /*
>  * Mutex for serialized access:
>  * Protect sensor module set pad format and start/stop streaming 
> safely.
> @@ -1213,6 +1215,8 @@ static int imx258_power_on(struct device *dev)
> struct imx258 *imx258 = to_imx258(sd);
> int ret;
>
> +   gpiod_set_value_cansleep(imx258->powerdown_gpio, 0);

 What does the spec say? Should this really happen before switching on the
 supplies below?
>>>
>>> There's no powerdown input in the IMX258 manual. The manual only mentions
>>> that XCLR (reset) should be held low during power on.
>>>
>>> https://megous.com/dl/tmp/15b0992a720ab82d.png
>>>
>>> https://megous.com/dl/tmp/f2cc991046d97641.png
>>>
>>>This sensor doesn’t have a built-in “Power ON Reset” function. The XCLR 
>>> pin
>>>is set to “LOW” and the power supplies are brought up. Then the XCLR pin
>>>should be set to “High” after INCK supplied.
>>>
>>> So this input is some feature on camera module itself outside of the
>>> IMX258 chip, which I think is used to gate power to the module. Eg. on 
>>> Pinephone
>>> Pro, there are two modules with shared power rails, so enabling supply to
>>> one module enables it to the other one, too. So this input becomes the only 
>>> way
>>> to really enable/disable power to the chip when both are used at once at 
>>> some
>>> point, because regulator_bulk_enable/disable becomes ineffective at that 
>>> point.
>>>
>>> Luis, maybe you saw some other datasheet that mentions this input? IMO,
>>> it just gates the power rails via some mosfets on the module itself, since
>>> there's not power down input to the chip itself.
>>>
>>> kind regards,
>>>   o.
>>>
>>
>> Ondrej, I did not see anything else in the datasheet since I'm pretty sure
>> I'm looking at the same datasheet as it was supplied to me by Pine64. I'm
>> not sure what datasheet Dave has access to since he got his for a
>> completely different module than what we are testing with though.
> 
> I only have a leaked datasheet (isn't the internet wonderful!)  [1]
> XCLR is documented in that, as Ondrej has said.
> 
> If this powerdown GPIO is meant to be driving XCLR, then it is in the
> wrong order against the supplies.
> 
> This does make me confused over the difference between this powerdown
> GPIO and the reset GPIO that you implement in 24/25.
> 
> Following the PinePhone Pro DT [3] and schematics [4]
> reset-gpios = < RK_PA0 GPIO_ACTIVE_LOW>;
> powerdown-gpios = < RK_PD4 GPIO_ACTIVE_HIGH>;
> 
> Schematic page 11 upper right block
> GPIO1_A0/ISP0_SHUTTER_EN/ISP1_SHUTTER_EN/TCPD_VBUS_SINK_EN_d becomes
> Camera_RST_L. Page 18 feeds that through to the RESET on the camera
> connector.
> Page 11 left middle block GPIO2_D4/SDIO0_BKPWR_d becomes DVP_PDN1_H.
> Page 18 feeds that through to the PWDN on the camera connector.
> 
> Seeing as we apparently have a lens driver kicking around as well,
> potentially one is reset to the VCM, and one to the sensor? DW9714
> does have an XSD shutdown pin.
> Only the module integrator is going to really know the answer,
> although potentially a little poking with gpioset and i2cdetect may
> tell you more.
> 
>   Dave
> 
> [1] https://web.archive.org/web/20201027131326/www.hi.app/IMX258-datasheet.pdf
> [2] 
> https://files.pine64.org/doc/PinePhonePro/PinephonePro-Schematic-V1.0-20211127.pdf
> [3] 
> https://xff.cz/git/linux/tree/arch/arm64/boot/dts/rockchip/rk3399-pinephone-pro.dts?h=orange-pi-5.18#n868
> [4] 
> https://files.pine64.org/doc/PinePhonePro/PinephonePro-Schematic-V1.0-20211127.pdf
> 
> 

Out of curiosity I dropped this and tested it on my PPP and it still loads
up the camera correctly so I am fine with dropping this and patch 22 that
adds in the dt binding

> +
> ret = regulator_bulk_enable(IMX258_NUM_SUPPLIES,
> imx258->supplies);
> if (ret) {
> @@ -1224,6 +1228,7 @@ static int imx258_power_on(struct device *dev)
> ret =

Re: 回复：回复：general protection fault in refill_obj_stock

2024-04-04 Thread Roman Gushchin

On Tue, Apr 02, 2024 at 02:14:58PM +0800, Ubisectech Sirius wrote:
> > On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> >>> On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> >>> Hello.
> >>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> >>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached 
> >>> to the email were a PoC file of the issue.
> >>
> >>> Thank you for the report!
> >>
> >>> I tried to compile and run your test program for about half an hour
> >>> on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> >>> to reproduce the problem.
> >> 
> >>> Can you, please, share a bit more information? How long does it take
> >>> to reproduce? Do you mind sharing your kernel config? Is there anything 
> >>> special
> >>> about your setup? What are exact steps to reproduce the problem?
> >>> Is this problem reproducible on 6.6?
> >> 
> >> Hi. 
> >> The .config of linux kernel 6.7 has send to you as attachment.
> > Thanks!
> > How long it takes to reproduce a problem? Do you just start your reproducer 
> > and wait?
> I just start the reproducer and wait without any other operation. The speed 
> of reproducing this problem is vary fast(Less than 5 seconds). 
> >> And The problem is reproducible on 6.6.
> > Hm, it rules out my recent changes.
> > Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
> > if it's fast to reproduce, it might be the best option.
> I have try the 6.0, 6.3, 6.4, 6.5 kernel. The Linux kernel 6.5 will get same 
> error output. But other version will get different output like below:
> [ 55.306672][ T7950] KASAN: null-ptr-deref in range 
> [0x0018-0x001f]
> [ 55.307259][ T7950] CPU: 1 PID: 7950 Comm: poc Not tainted 6.3.0 #1
> [ 55.307714][ T7950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [ 55.308363][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> [ 55.316475][ T7950] Call Trace:
> [ 55.316713][ T7950] 
> [ 55.317353][ T7950] tomoyo_path_permission (security/tomoyo/file.c:170 
> security/tomoyo/file.c:587 security/tomoyo/file.c:573)
> [ 55.317744][ T7950] tomoyo_check_open_permission (security/tomoyo/file.c:779)
> [ 55.320152][ T7950] tomoyo_file_open (security/tomoyo/tomoyo.c:332 
> security/tomoyo/tomoyo.c:327)
> [ 55.320495][ T7950] security_file_open (security/security.c:1719 
> (discriminator 13))
> [ 55.320850][ T7950] do_dentry_open (fs/open.c:908)
> [ 55.321526][ T7950] path_openat (fs/namei.c:3561 fs/namei.c:3715)
> [ 55.322614][ T7950] do_filp_open (fs/namei.c:3743)
> [ 55.325086][ T7950] do_sys_openat2 (fs/open.c:1349)
> [ 55.326249][ T7950] __x64_sys_openat (fs/open.c:1375)
> [ 55.327428][ T7950] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80)
> [ 55.327756][ T7950] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:120)
> [ 55.328185][ T7950] RIP: 0033:0x7f1c4a484f29
> [ 55.328504][ T7950] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
> [ 55.329864][ T7950] RSP: 002b:7ffd7bfe8398 EFLAGS: 0246 ORIG_RAX: 
> 0101
> [ 55.330464][ T7950] RAX: ffda RBX:  RCX: 
> 7f1c4a484f29
> [ 55.331024][ T7950] RDX: 00141842 RSI: 2380 RDI: 
> ff9c
> [ 55.331585][ T7950] RBP: 7ffd7bfe83a0 R08:  R09: 
> 7ffd7bfe83f0
> [ 55.332148][ T7950] R10:  R11: 0246 R12: 
> 55c5e36482d0
> [ 55.332707][ T7950] R13:  R14:  R15: 
> 
> [ 55.333268][ T7950] 
> [ 55.333488][ T7950] Modules linked in:
> [ 55.340525][ T7950] ---[ end trace  ]---
> [ 55.340936][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> It look like other problem?
> > Also, are you running vanilla kernels or you do have some custom changes on 
> > top?
> I haven't made any custom changes. 
> >Thanks!

Ok, I installed a new toolchain, built a kernel with your config and reproduced 
the (a?) problem.
It definitely smells a generic memory corruption, as I get new stacktraces 
every time I run it.
I got something similar to your tomoyo stacktrace, then I got something about
ima_add_template_entry() and then something else. Never saw your original 
obj_cgroup_get()
stack.

It seems to be connected to your very full kernel config, as I can't reproduce 
anything
with my original more minimal config. It also doesn't seem to be connected to 
the
kernel memory accounting directly.

It would be helpful to understand which kernel config options are required to 
reproduce
the issue as well as what exactly the reproducer does. I'll try to spend some 
cycles
on this, but can't promise much.

Thanks!

Re: [PATCH v3 19/25] media: i2c: imx258: Change register settings for variants of the sensor

2024-04-04 Thread Luigi311

On 4/3/24 10:18, Sakari Ailus wrote:
> Hi Luis, Dave,
> 
> On Wed, Apr 03, 2024 at 09:03:48AM -0600, g...@luigi311.com wrote:
>> From: Dave Stevenson 
>>
>> Sony have advised that there are variants of the IMX258 sensor which
>> require slightly different register configuration to the mainline
>> imx258 driver defaults.
>>
>> There is no available run-time detection for the variant, so add
>> configuration via the DT compatible string.
>>
>> The Vision Components imx258 module supports PDAF, so add the
>> register differences for that variant
>>
>> Signed-off-by: Dave Stevenson 
>> Signed-off-by: Luis Garcia 
>> ---
>>  drivers/media/i2c/imx258.c | 48 ++
>>  1 file changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
>> index 775d957c9b87..fa48da212037 100644
>> --- a/drivers/media/i2c/imx258.c
>> +++ b/drivers/media/i2c/imx258.c
>> @@ -6,6 +6,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -321,8 +322,6 @@ static const struct imx258_reg mipi_642mbps_24mhz_4l[] = 
>> {
>>  
>>  static const struct imx258_reg mode_common_regs[] = {
>>  { 0x3051, 0x00 },
>> -{ 0x3052, 0x00 },
>> -{ 0x4E21, 0x14 },
>>  { 0x6B11, 0xCF },
>>  { 0x7FF0, 0x08 },
>>  { 0x7FF1, 0x0F },
>> @@ -345,7 +344,6 @@ static const struct imx258_reg mode_common_regs[] = {
>>  { 0x7FA8, 0x03 },
>>  { 0x7FA9, 0xFE },
>>  { 0x7B24, 0x81 },
>> -{ 0x7B25, 0x00 },
>>  { 0x6564, 0x07 },
>>  { 0x6B0D, 0x41 },
>>  { 0x653D, 0x04 },
>> @@ -460,6 +458,33 @@ static const struct imx258_reg mode_1048_780_regs[] = {
>>  { 0x034F, 0x0C },
>>  };
>>  
>> +struct imx258_variant_cfg {
>> +const struct imx258_reg *regs;
>> +unsigned int num_regs;
>> +};
>> +
>> +static const struct imx258_reg imx258_cfg_regs[] = {
>> +{ 0x3052, 0x00 },
>> +{ 0x4E21, 0x14 },
>> +{ 0x7B25, 0x00 },
>> +};
>> +
>> +static const struct imx258_variant_cfg imx258_cfg = {
>> +.regs = imx258_cfg_regs,
>> +.num_regs = ARRAY_SIZE(imx258_cfg_regs),
>> +};
>> +
>> +static const struct imx258_reg imx258_pdaf_cfg_regs[] = {
>> +{ 0x3052, 0x01 },
>> +{ 0x4E21, 0x10 },
>> +{ 0x7B25, 0x01 },
>> +};
>> +
>> +static const struct imx258_variant_cfg imx258_pdaf_cfg = {
>> +.regs = imx258_pdaf_cfg_regs,
>> +.num_regs = ARRAY_SIZE(imx258_pdaf_cfg_regs),
>> +};
>> +
>>  static const char * const imx258_test_pattern_menu[] = {
>>  "Disabled",
>>  "Solid Colour",
>> @@ -637,6 +662,8 @@ struct imx258 {
>>  struct v4l2_subdev sd;
>>  struct media_pad pad;
>>  
>> +const struct imx258_variant_cfg *variant_cfg;
>> +
>>  struct v4l2_ctrl_handler ctrl_handler;
>>  /* V4L2 Controls */
>>  struct v4l2_ctrl *link_freq;
>> @@ -1104,6 +1131,14 @@ static int imx258_start_streaming(struct imx258 
>> *imx258)
>>  return ret;
>>  }
>>  
>> +ret = imx258_write_regs(imx258, imx258->variant_cfg->regs,
>> +imx258->variant_cfg->num_regs);
>> +if (ret) {
>> +dev_err(>dev, "%s failed to set variant config\n",
>> +__func__);
>> +return ret;
>> +}
>> +
>>  ret = imx258_write_reg(imx258, IMX258_CLK_BLANK_STOP,
>> IMX258_REG_VALUE_08BIT,
>> imx258->csi2_flags & 
>> V4L2_MBUS_CSI2_NONCONTINUOUS_CLOCK ?
>> @@ -1492,6 +1527,10 @@ static int imx258_probe(struct i2c_client *client)
>>  
>>  imx258->csi2_flags = ep.bus.mipi_csi2.flags;
>>  
>> +imx258->variant_cfg = of_device_get_match_data(>dev);
> 
> You'll also need to keep this working for ACPI based systems. I.e. in
> practice remove "of_" prefix here and add the non-PDAF variant data to the
> relevant ACPI ID list.
> 

Removing of_ is easy enough and looking at all the other commits that make
this change in other drivers I dont see anything else being done besides
adding in the .data section that is down below for both imx258 and pdaf
versions. Is that what you are referencing or is there some other place
to add variant data to ACPI ID list?

>> +if (!imx258->variant_cfg)
>> +imx258->variant_cfg = _cfg;
>> +
>>  /* Initialize subdev */
>>  v4l2_i2c_subdev_init(>sd, client, _subdev_ops);
>>  
>> @@ -1579,7 +1618,8 @@ MODULE_DEVICE_TABLE(acpi, imx258_acpi_ids);
>>  #endif
>>  
>>  static const struct of_device_id imx258_dt_ids[] = {
>> -{ .compatible = "sony,imx258" },
>> +{ .compatible = "sony,imx258", .data = _cfg },
>> +{ .compatible = "sony,imx258-pdaf", .data = _pdaf_cfg },
>>  { /* sentinel */ }
>>  };
>>  MODULE_DEVICE_TABLE(of, imx258_dt_ids);
>

Re: [PATCH v3 12/25] media: i2c: imx258: Allow configuration of clock lane behaviour

2024-04-04 Thread Luigi311

On 4/3/24 12:48, Pavel Machek wrote:
> Hi!
> 
>> The sensor supports the clock lane either remaining in HS mode
>> during frame blanking, or dropping to LP11.
>>
>> Add configuration of the mode via V4L2_MBUS_CSI2_NONCONTINUOUS_CLOCK.
> 
>> +ret = imx258_write_reg(imx258, IMX258_CLK_BLANK_STOP,
>> +   IMX258_REG_VALUE_08BIT,
>> +   imx258->csi2_flags & 
>> V4L2_MBUS_CSI2_NONCONTINUOUS_CLOCK ?
>> +   1 : 0);
> 
> !! can be used to turn value into 1/0. I find it easier to read than ?
> 1 : 0  combination, but possibly that's fine, too.
> 
> Best regards,
>   Pavel
> 

I assume you mean by using 

!!(imx258->csi2_flags & V4L2_MBUS_CSI2_NONCONTINUOUS_CLOCK)

I can go ahead and use that instead

Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-04 Thread Verma, Vishal L

On Thu, 2024-04-04 at 14:23 -0700, Andrew Morton wrote:
> On Tue, 02 Apr 2024 00:24:28 -0600 Vishal Verma  
> wrote:
> 
> > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> > referenced patch should be replaced with lockdep_assert_held(_write)().
> > 
> > Replace those, and additionally, replace a couple of other
> > WARN_ON_ONCE() introduced in the same patch for actual failure
> > cases (i.e. when acquiring a semaphore fails in a remove / unregister
> > path) with dev_WARN_ONCE() as is the precedent here.
> > 
> > Recall that previously, unregistration paths was implicitly protected by
> > overloading the device lock, which the patch in [1] sought to remove.
> > This meant adding a semaphore acquisition in these unregistration paths.
> > Since that can fail, and it doesn't make sense to return errors from
> > these paths, retain the two instances of (now) dev_WARN_ONCE().
> > 
> > ...
> > 
> > @@ -471,6 +471,7 @@ static void __unregister_dev_dax(void *dev)
> >  
> >     dev_dbg(dev, "%s\n", __func__);
> >  
> > +   lockdep_assert_held_write(_region_rwsem);
> >     kill_dev_dax(dev_dax);
> >     device_del(dev);
> >     free_dev_dax_ranges(dev_dax);
> 
> This is new and unchangelogged?
> 
> I'm taking Dan's reply to your patch as Not-A-Nack ;)
> 
True, but with Dan's new feedback, that results in a bit more rework,
this will likely turn into 2-3 patches. Working on it now, will be out
shortly!

Re: (subset) [PATCH v2 0/3] Split sony-castor into shinano-common and add Sony Xperia Z3

2024-04-04 Thread Bjorn Andersson



On Thu, 14 Mar 2024 19:56:21 +0100, Luca Weiss wrote:
> Prepare for adding sony-leo dts by splitting common parts into a
> separate dtsi file.
> 
> Then add the dts for Sony Xperia Z3.
> 
> Depends on:
> https://lore.kernel.org/linux-arm-msm/20240306-castor-changes-v1-0-2286eaf85...@z3ntu.xyz/T/
> 
> [...]

Applied, thanks!

[1/3] ARM: dts: qcom: msm8974-sony-castor: Split into shinano-common
  commit: 53426f53eda5e4a17197a8bc7dd1045601db407e
[3/3] ARM: dts: qcom: Add Sony Xperia Z3 smartphone
  commit: 8d91a5a4a6f5aff714a14ac4a86931aa789655d8

Best regards,
-- 
Bjorn Andersson

Re: [PATCH] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-04 Thread Andrew Morton

On Tue, 02 Apr 2024 00:24:28 -0600 Vishal Verma  
wrote:

> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held(_write)().
> 
> Replace those, and additionally, replace a couple of other
> WARN_ON_ONCE() introduced in the same patch for actual failure
> cases (i.e. when acquiring a semaphore fails in a remove / unregister
> path) with dev_WARN_ONCE() as is the precedent here.
> 
> Recall that previously, unregistration paths was implicitly protected by
> overloading the device lock, which the patch in [1] sought to remove.
> This meant adding a semaphore acquisition in these unregistration paths.
> Since that can fail, and it doesn't make sense to return errors from
> these paths, retain the two instances of (now) dev_WARN_ONCE().
> 
> ...
>
> @@ -471,6 +471,7 @@ static void __unregister_dev_dax(void *dev)
>  
>   dev_dbg(dev, "%s\n", __func__);
>  
> + lockdep_assert_held_write(_region_rwsem);
>   kill_dev_dax(dev_dax);
>   device_del(dev);
>   free_dev_dax_ranges(dev_dax);

This is new and unchangelogged?

I'm taking Dan's reply to your patch as Not-A-Nack ;)

Re: (subset) [PATCH 1/1] clk: qcom: smd-rpm: Restore msm8976 num_clk

2024-04-04 Thread Bjorn Andersson



On Mon, 01 Apr 2024 19:16:39 +0200, Adam Skladowski wrote:
> During rework somehow msm8976 num_clk got removed, restore it.
> 
> 

Applied, thanks!

[1/1] clk: qcom: smd-rpm: Restore msm8976 num_clk
  commit: 0d4ce2458cd7d1d66a5ee2f3c036592fb663d5bc

Best regards,
-- 
Bjorn Andersson

Re: [External] Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-04 Thread Ho-Ren (Jack) Chuang

Hi Jonathan,

Thank you! I will fix them and send a V11 soon.

On Thu, Apr 4, 2024 at 6:37 AM Jonathan Cameron
 wrote:
>
> 
>
> > > > @@ -858,7 +910,8 @@ static int __init memory_tier_init(void)
> > > >* For now we can have 4 faster memory tiers with smaller 
> > > > adistance
> > > >* than default DRAM tier.
> > > >*/
> > > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > > > + default_dram_type = 
> > > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > > > + 
> > > > _memory_types);
> > >
> > > Unusual indenting.  Align with just after (
> > >
> >
> > Aligning with "(" will exceed 100 columns. Would that be acceptable?
> I think we are talking cross purposes.
>
> default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
>   _memory_types);
>
> Is what I was suggesting.
>

Oh, now I see. Thanks!

> >
> > > >   if (IS_ERR(default_dram_type))
> > > >   panic("%s() failed to allocate default DRAM tier\n", 
> > > > __func__);
> > > >
> > > > @@ -868,6 +921,14 @@ static int __init memory_tier_init(void)
> > > >* types assigned.
> > > >*/
> > > >   for_each_node_state(node, N_MEMORY) {
> > > > + if (!node_state(node, N_CPU))
> > > > + /*
> > > > +  * Defer memory tier initialization on CPUless 
> > > > numa nodes.
> > > > +  * These will be initialized after firmware and 
> > > > devices are
> > >
> > > I think this wraps at just over 80 chars.  Seems silly to wrap so tightly 
> > > and not
> > > quite fit under 80. (this is about 83 chars.
> > >
> >
> > I can fix this.
> > I have a question. From my patch, this is <80 chars. However,
> > in an email, this is >80 chars. Does that mean we need to
> > count the number of chars in an email, not in a patch? Or if I
> > missed something? like vim configuration or?
>
> 3 tabs + 1 space + the text from * (58)
> = 24 + 1 + 58 = 83
>
> Advantage of using claws email for kernel stuff is it has a nice per character
> ruler at the top of the window.
>
> I wonder if you have a different tab indent size?  The kernel uses 8
> characters.  It might explain the few other odd indents if perhaps
> you have it at 4 in your editor?
>
> https://www.kernel.org/doc/html/v4.10/process/coding-style.html
>

Got it. I was using tab=4. I will change to 8. Thanks!

> Jonathan
>
> >
> > > > +  * initialized.
> > > > +  */
> > > > + continue;
> > > > +
> > > >   memtier = set_node_memory_tier(node);
> > > >   if (IS_ERR(memtier))
> > > >   /*
> > >
> >
> >
>


-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

Re: [PATCH net-next v2 5/6] mptcp: support rstreason for passive reset

2024-04-04 Thread Mat Martineau


On Thu, 4 Apr 2024, Jason Xing wrote:


From: Jason Xing 

It relys on what reset options in MPTCP does as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces all the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
net/mptcp/subflow.c | 26 --
1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a68d5d0f3e2a..24668d3020aa 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -304,7 +304,10 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,

dst_release(dst);
if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);


Hi Jason -

In this case, the MPTCP reset reason is set in subflow_check_req(). Looks 
like it uses EMPTCP but that isn't guaranteed to stay the same. I think it 
would be better to extract the reset reason from the skb extension or the 
subflow context "reset_reason" field.




return NULL;
}

@@ -371,7 +374,10 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,

dst_release(dst);
if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);


Same issue here.


return NULL;
}
#endif
@@ -778,6 +784,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
bool fallback, fallback_is_fatal;
struct mptcp_sock *owner;
struct sock *child;
+   int reason;

pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);

@@ -833,7 +840,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
 */
if (!ctx || fallback) {
if (fallback_is_fatal) {
-   subflow_add_reset_reason(skb, MPTCP_RST_EMPTCP);
+   reason = MPTCP_RST_EMPTCP;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}
goto fallback;
@@ -861,7 +869,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
} else if (ctx->mp_join) {
owner = subflow_req->msk;
if (!owner) {
-   subflow_add_reset_reason(skb, 
MPTCP_RST_EPROHIBIT);
+   reason = MPTCP_RST_EPROHIBIT;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}

@@ -875,13 +884,18 @@ static struct sock *subflow_syn_recv_sock(const struct 
sock *sk,
 ntohs(inet_sk((struct sock 
*)owner)->inet_sport));
if (!mptcp_pm_sport_in_anno_list(owner, sk)) {
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_MISMATCHPORTACKRX);
+   reason = MPTCP_RST_EUNSPEC;


I think the MPTCP code here should have been using MPTCP_RST_EPROHIBIT.


- Mat


goto dispose_child;
}
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_JOINPORTACKRX);
}

-   if (!mptcp_finish_join(child))
+   if (!mptcp_finish_join(child)) {
+   struct mptcp_subflow_context *subflow = 
mptcp_subflow_ctx(sk);
+
+   reason = subflow->reset_reason;
goto dispose_child;
+   }

SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINACKRX);
tcp_rsk(req)->drop_req = true;
@@ -901,7 +915,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   req->rsk_ops->send_reset(sk, skb, convert_mptcp_reason(reason));

/* The last child reference will be released by the caller */
return child;
--
2.37.3

Copying TLS/user register data per perf-sample?

2024-04-04 Thread Beau Belgrave

Hello,

I'm looking into the possibility of capturing user data that is pointed
to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
perf_events.

I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
I think it could even use roughly the same ABI in the perf ring buffer.
Or it may be possible by some kprobe linked to the perf sample function.

This would allow a profiler to collect TLS (or other values) on x64. In
the Open Telemetry profiling SIG [1], we are trying to find a fast way
to grab a tracing association quickly on a per-thread basis. The team
at Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture
upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS value(s) when a sampling event is generated.

Is this already possible via some other means? It'd be great to be able
to do this directly at the perf_event sample via the ABI or a probe.

Thanks,
-Beau

1. https://opentelemetry.io/blog/2024/profiling/
2. 
https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-04 Thread David Hildenbrand


On 02.04.24 01:29, James Houghton wrote:

The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
  include/linux/mmu_notifier.h | 93 +---
  include/trace/events/kvm.h   | 13 +++--
  mm/mmu_notifier.c| 20 +---
  virt/kvm/kvm_main.c  | 19 ++--
  4 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {
  
  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
  
+#define MMU_NOTIFIER_YOUNG			(1 << 0)

+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE   (1 << 1)


Especially this one really deserves some documentation :)


+#define MMU_NOTIFIER_YOUNG_FAST(1 << 2)


And that one as well.

Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


+
  struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
 * clear_young is a lightweight version of clear_flush_young. Like the
 * latter, it is supposed to test-and-clear the young/accessed bitflag
 * in the secondary pte, but it may omit flushing the secondary tlb.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*clear_young)(struct mmu_notifier *subscription,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end);
+  unsigned long end,
+  unsigned long *bitmap);
  
  	/*

 * test_young is called to check the young/accessed bitflag in
 * the secondary pte. This is used to know if the page is
 * frequently used without actually clearing the flag or tearing
 * down the secondary mapping on the page.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*test_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
- unsigned long address);
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I 
don't find any existing usage of that in this file.


Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

--
Cheers,

David / dhildenb

iMX8MP Cortex-M7 Relation to Audio Power Domain

2024-04-04 Thread João Paulo Silva Gonçalves

Hello all,

I was investigating why the kernel freezes on the iMX8MP when attempting to boot
the Cortex-M7 processor using the Linux remoteproc interface. However, with
v6.5, it started to work, and I was able to pinpoint to commit
b86c3afabb4f ('arm64: dts: imx8mp: Add SAI, SDMA, AudioMIX') [1] through 
bisection.
The patch appeared unrelated to remoteproc, and after some time, I realized 
there
is a connection between the functioning of remoteproc and the audio power 
domain.
For instance, adding the audio power domain to the node in the device
tree (below) made it work. The same behavior occurs in the downstream kernel.
There is a workaround for the problem by setting clkim8mp.mcore_booted=1 in the 
kernel arguments, but this is not seen as a final solution (it seems to 
disable all clock gating).

imx8mp-cm7 {
compatible = "fsl,imx8mp-cm7";
clocks = < IMX8MP_CLK_M7_CORE>;
clock-names = "core", "audio";
mbox-names = "tx", "rx", "rxdb";
mboxes = < 0 1
 1 1
 3 1>;
memory-region = <>, <>, <>, 
<_table>, <_reserved>;
rsc-da = <0x5500>;
syscon = <>;
fsl,startup-delay-ms = <500>;
power-domains = <_audio>;
};


Do any of you know anything about the relationship between the audio domain and
the Cortex-M7 on iMX8MP? The TRM is not very clear on this, and the only thing
I could find is that there are some mailboxes for Cortex-M7/Audio processor
communication managed by the audio power domain.

Thanks for the help!

[1] 
https://github.com/torvalds/linux/commit/b86c3afabb4f4ea146c206508527eb2a15485bcc


Regards,
João Paulo S. Goncalves

Re: [PATCH] livepatch: Add KLP_IDLE state

2024-04-04 Thread Joe Lawrence

On 4/4/24 11:17, Petr Mladek wrote:
> On Tue 2024-04-02 09:52:31, Joe Lawrence wrote:
>> On Tue, Apr 02, 2024 at 11:09:54AM +0800, zhangwar...@gmail.com wrote:
>>> From: Wardenjohn 
>>>
>>> In livepatch, using KLP_UNDEFINED is seems to be confused.
>>> When kernel is ready, livepatch is ready too, which state is
>>> idle but not undefined. What's more, if one livepatch process
>>> is finished, the klp state should be idle rather than undefined.
>>>
>>> Therefore, using KLP_IDLE to replace KLP_UNDEFINED is much better
>>> in reading and understanding.
>>> ---
>>>  include/linux/livepatch.h |  1 +
>>>  kernel/livepatch/patch.c  |  2 +-
>>>  kernel/livepatch/transition.c | 24 
>>>  3 files changed, 14 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
>>> index 9b9b38e89563..c1c53cd5b227 100644
>>> --- a/include/linux/livepatch.h
>>> +++ b/include/linux/livepatch.h
>>> @@ -19,6 +19,7 @@
>>>  
>>>  /* task patch states */
>>>  #define KLP_UNDEFINED  -1
>>> +#define KLP_IDLE   -1
>>
>> Hi Wardenjohn,
>>
>> Quick question, does this patch intend to:
>>
>> - Completely replace KLP_UNDEFINED with KLP_IDLE
>> - Introduce KLP_IDLE as an added, fourth potential state
>> - Introduce KLP_IDLE as synonym of sorts for KLP_UNDEFINED under certain
>>   conditions
>>
>> I ask because this patch leaves KLP_UNDEFINED defined and used in other
>> parts of the tree (ie, init/init_task.c), yet KLP_IDLE is added and
>> continues to use the same -1 enumeration.
> 
> Having two names for the same state adds more harm than good.
> 
> Honestly, neither "task->patch_state == KLP_UNDEFINED" nor "KLP_IDLE"
> make much sense.
> 
> The problem is in the variable name. It is not a state of a patch.
> It is the state of the transition. The right solution would be
> something like:
> 
>   klp_target_state -> klp_transition_target_state
>   task->patch_state -> task->klp_transition_state
>   KLP_UNKNOWN -> KLP_IDLE
> 

Yes, this is exactly how I think of these when reading the code.  The
model starts to make a lot more sense once you look at it thru this lens :)

> But it would also require renaming:
> 
>   /proc//patch_state -> klp_transition_state
> 
> which might break userspace tools => likely not acceptable.
> 
> 
> My opinion:
> 
> It would be nice to clean this up but it does not look worth the
> effort.
> 

Agreed.  Instead of changing code and the sysfs interface, we could
still add comments like:

  /* task patch transition target states */
  #define KLP_UNDEFINED   -1  /* idle, no transition in progress */
  #define KLP_UNPATCHED0  /* transitioning to unpatched state */
  #define KLP_PATCHED  1  /* transitioning to patched state */

  /* klp transition target state */
  static int klp_target_state = KLP_UNDEFINED;

  struct task_struct = {
  .patch_state= KLP_UNDEFINED,   /* klp transition state */

Maybe just one comment is enough?  Alternatively, we could elaborate in
Documentation/livepatch/livepatch.rst if it's really confusing.

Wardenjohn, since you're probably reading this code with fresh(er) eyes,
would any of the above be helpful?

-- 
Joe

Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-04 Thread Paul E. McKenney

On Thu, Apr 04, 2024 at 08:55:22AM +0900, Masami Hiramatsu wrote:
> On Wed, 3 Apr 2024 12:16:28 -0700
> "Paul E. McKenney"  wrote:
> 
> > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> > /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig.
> > 
> > /proc/bootconfig shows boot_command_line[] multiple times following
> > every xbc key value pair, that's duplicated and not necessary.
> > Remove redundant ones.
> > 
> > Output before and after the fix is like:
> > key1 = value1
> > *bootloader argument comments*
> > key2 = value2
> > *bootloader argument comments*
> > key3 = value3
> > *bootloader argument comments*
> > ...
> > 
> > key1 = value1
> > key2 = value2
> > key3 = value3
> > *bootloader argument comments*
> > ...
> > 
> > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
> > /proc/bootconfig")
> > Signed-off-by: Zhenhua Huang 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Masami Hiramatsu 
> > Cc: 
> > Cc: 
> 
> OOps, good catch! Let me pick it.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thank you, and I have applied your ack and pulled this into its own
bootconfig.2024.04.04a.

My guess is that you will push this via your own tree, and so I will
drop my copy as soon as yours hits -next.

Thanx, Paul

> Thank you!
> 
> > 
> > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> > index 902b326e1e560..e5635a6b127b0 100644
> > --- a/fs/proc/bootconfig.c
> > +++ b/fs/proc/bootconfig.c
> > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, 
> > size_t size)
> > break;
> > dst += ret;
> > }
> > -   if (ret >= 0 && boot_command_line[0]) {
> > -   ret = snprintf(dst, rest(dst, end), "# Parameters from 
> > bootloader:\n# %s\n",
> > -  boot_command_line);
> > -   if (ret > 0)
> > -   dst += ret;
> > -   }
> > +   }
> > +   if (ret >= 0 && boot_command_line[0]) {
> > +   ret = snprintf(dst, rest(dst, end), "# Parameters from 
> > bootloader:\n# %s\n",
> > +  boot_command_line);
> > +   if (ret > 0)
> > +   dst += ret;
> > }
> >  out:
> > kfree(key);
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-04 Thread Haitao Huang


Hi Kai,
Thanks for your suggestions. I'll adopt most of it as it.
Minor details below.

On Wed, 03 Apr 2024 08:08:28 -0500, Huang, Kai  wrote:


On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:

From: Kristen Carlson Accardi 

When a cgroup usage reaches its limit, and it is to be charged, i.e.,
sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
reclaim pages from its LRU or LRUs of its descendants to make room for
any new allocations. This patch adds the basic building block for the
per-cgroup reclamation flow and use it for synchronous reclamation in
sgx_cgroup_try_charge().


It's better to firstly mention _why_ we need this first:

Currently in the EPC page allocation, the kernel simply fails the  
allocation
when the current EPC cgroup fails to charge due to its usage reaching  
limit.
This is not ideal.  When that happens, a better way is to reclaim EPC  
page(s)
from the current EPC cgroup (and/or its descendants) to reduce its usage  
so the

new allocation can succeed.

Add the basic building blocks to support the per-cgroup reclamation flow  
...




ok



First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
which pages are reclaimed, so it can be reused by both the global and
cgroup reclaimers. Also return the number of pages attempted, so a
cgroup reclaimer can use it to track reclamation progress from its
descendants.


IMHO you are jumping too fast to the implementation details.  Better to  
have

some more background:

"
Currently the kernel only has one place to reclaim EPC pages: the global  
EPC LRU
list.  To support the "per-cgroup" EPC reclaim, maintain an LRU list for  
each
EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC  
page(s)

from a given EPC cgroup (and its descendants).
"



ok



For the global reclaimer, replace all call sites of sgx_reclaim_pages()
with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
just calls sgx_reclaim_pages() with the global LRU passed in.

For cgroup reclamation, implement a basic reclamation flow, encapsulated
in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
at each node passing in the LRU of that node. It keeps track of total
attempted pages and stops the walk if desired number of pages are
attempted.


Then it's time to jump to implementation details:

"
Currently the kernel does the global EPC reclaim in sgx_reclaim_page().   
It

always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages.
Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from  
the

global LRU, and then tries to reclaim these pages at once for better
performance.

Use similar way to implement the "cgroup" variant EPC reclaim, but keep  
the
implementation simple: 1) change sgx_reclaim_pages() to take an LRU as  
input,
and return the pages that are "scanned" (but not actually reclaimed); 2)  
loop
the given EPC cgroup and its descendants and do the new  
sgx_reclaim_pages()

until SGX_NR_TO_SCAN pages are "scanned".

This implementation always tries to reclaim SGX_NR_TO_SCAN pages from  
the LRU of
the given EPC cgroup, and only moves to its descendants when there's no  
enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for  
most cases.

"



ok


Then I think it's better to explain why "alternatives" are not chosen:

"
Note, this simple implementation doesn't _exactly_ mimic the current  
global EPC
reclaim (which always tries to do the actual reclaim in batch of  
SGX_NR_TO_SCAN
pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the  
actual
reclaim of EPC pages will be split into smaller batches _across_  
multiple LRUs

with each being smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to  
have a
new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_  
the
given EPC cgroup _AND_ its descendants, and then do the actual reclaim  
in one

batch.  But this is unnecessarily complicated at this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to  
return the
actual "reclaimed" pages, but not "scanned" pages.  However this  
solution also

has cons: 
"

:

I recall you mentioned "unable to control latency of each reclaim" etc,  
but IIUC

one could be:

This approach may result in higher chance of "reclaiming EPC pages from
descendants but not the root/given EPC cgorup", e.g., when all EPC pages  
in the
root EPC cgroup are all young while these in its descendants are not.   
This may

not be desired.

Makes sense?



Agree with the flow.
The con is that this function may block too long that is unacceptable for  
some callers like synchronous flow which only needs some minimal (e.g.,  
one page to pass try_charge()) to make forward progress. Convention is to  
call this function loops to ensure caller's condition is met, i.e., the  
way the

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Oleg Nesterov

On 04/05, Masami Hiramatsu wrote:
>
> Can we make this syscall and uprobe behavior clearer? As you said, if
> the application use sigreturn or longjump, it may skip returns and
> shadow stack entries are left in the kernel. In such cases, can uretprobe
> detect it properly, or just crash the process (or process runs wrongly)?

Please see the comment in handle_trampoline(), it tries to detect this case.
This patch should not make any difference.

Oleg.

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Google

On Thu, 4 Apr 2024 13:58:43 +0200
Jiri Olsa  wrote:

> On Wed, Apr 03, 2024 at 07:00:07PM -0700, Andrii Nakryiko wrote:
> 
> SNIP
> 
> > Check rt_sigreturn syscall (manpage at [0], for example).
> > 
> >sigreturn() exists only to allow the implementation of signal
> >handlers.  It should never be called directly.  (Indeed, a simple
> >sigreturn() wrapper in the GNU C library simply returns -1, with
> >errno set to ENOSYS.)  Details of the arguments (if any) passed
> >to sigreturn() vary depending on the architecture.  (On some
> >architectures, such as x86-64, sigreturn() takes no arguments,
> >since all of the information that it requires is available in the
> >stack frame that was previously created by the kernel on the
> >user-space stack.)
> > 
> > This is a very similar use case. Also, check its source code in
> > arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> > on any sign of something not being right. It's exactly the same with
> > sys_uretprobe.
> > 
> >   [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
> > 
> > > And the number of syscalls are limited resource.
> > 
> > We have almost 500 of them, it didn't seems like adding 1-2 for good
> > reasons would be a problem. Can you please point to where the limits
> > on syscalls as a resource are described? I'm curious to learn.
> > 
> > >
> > > I'm actually not sure how much we need to care of it, but adding a new
> > > syscall is worth to be discussed carefully because all of them are
> > > user-space compatibility.
> > 
> > Absolutely, it's a good discussion to have.
> > 
> > >
> > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > > >
> > > > > > right, I'll check on syzkaller
> > > > > >
> > > > > > > set in the user function, what happen if the user function 
> > > > > > > directly
> > > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > > >
> > > > > > the process should receive SIGILL if there's no pending uretprobe 
> > > > > > for
> > > > > > the current task, or it will trigger uretprobe if there's one 
> > > > > > pending
> > > > >
> > > > > No, that is too aggressive and not safe. Since the syscall is exposed 
> > > > > to
> > > > > user program, it should return appropriate error code instead of 
> > > > > SIGILL.
> > > > >
> > > >
> > > > This is the way it is today with uretprobes even through interrupt.
> > >
> > > I doubt that the interrupt (exception) and syscall should be handled
> > > differently. Especially, this exception is injected by uprobes but
> > > syscall will be caused by itself. But syscall can be called from user
> > > program (of couse this works as sys_kill(self, SIGILL)).
> > 
> > Yep, I'd keep the behavior the same between uretprobes implemented
> > through int3 and sys_uretprobe.
> 
> +1 
> 
> > 
> > >
> > > > E.g., it could happen that user process is using fibers and is
> > > > replacing stack pointer without kernel realizing this, which will
> > > > trigger some defensive checks in uretprobe handling code and kernel
> > > > will send SIGILL because it can't support such cases. This is
> > > > happening today already, and it works fine in practice (except for
> > > > applications that manually change stack pointer, too bad, you can't
> > > > trace them with uretprobes, unfortunately).
> > >
> > > OK, we at least need to document it.
> > 
> > +1, yep
> > 
> > >
> > > >
> > > > So I think it's absolutely adequate to have this behavior if the user
> > > > process is *intentionally* abusing this API.
> > >
> > > Of course user expected that it is abusing. So at least we need to
> > > add a document that this syscall number is reserved to uprobes and
> > > user program must not use it.
> > >
> > 
> > Totally agree about documenting this.
> 
> ok there's map page on sigreturn.. do you think we should add man page
> for uretprobe or you can think of some other place to document it?

I think it is better to have a man-page. Anyway, to discuss and explain
this syscall, the man-page is a good format to describe it.

> 
> > 
> > > >
> > > > > >
> > > > > > but we could limit the syscall to be executed just from the 
> > > > > > trampoline,
> > > > > > that should prevent all the user space use cases, I'll do that in 
> > > > > > next
> > > > > > version and add more tests for that
> > > > >
> > > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > > > only from the trampoline, so it is natural to check the caller 
> > > > > address.
> > > > > (and uprobe should know where is the trampoline)
> > > > >
> > > > > Since the syscall is always exposed to the user program, it should
> > > > > - Do nothing and return an error unless it is properly called.
> > > > > - check the prerequisites for operation strictly.
> > > > > I concern that new system calls introduce vulnerabilities.
> > > > >
> > > >
> > > > As Oleg and Jiri mentioned, this

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Google

On Wed, 3 Apr 2024 19:00:07 -0700
Andrii Nakryiko  wrote:

> On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu  wrote:
> >
> > On Wed, 3 Apr 2024 09:58:12 -0700
> > Andrii Nakryiko  wrote:
> >
> > > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > > Jiri Olsa  wrote:
> > > >
> > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > > Hi Jiri,
> > > > > >
> > > > > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > > > > Jiri Olsa  wrote:
> > > > > >
> > > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > > >
> > > > > > This is interesting approach. But I doubt we need to add additional
> > > > > > syscall just for this purpose. Can't we use another syscall or 
> > > > > > ioctl?
> > > > >
> > > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > > for that one as well
> > > > >
> > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > > possible to do that, the trampoline will just have to save one or
> > > > > more additional registers, but adding new syscall seems cleaner to me
> > > >
> > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> > >
> > > I think both ptrace and prctl are for completely different use cases
> > > and it would be an abuse of existing API to reuse them for uretprobe
> > > tracing. Also, keep in mind, that any extra argument that has to be
> > > passed into this syscall means that we need to complicate and slow
> > > generated assembly code that is injected into user process (to
> > > save/restore registers) and also kernel-side (again, to deal with all
> > > the extra registers that would be stored/restored on stack).
> > >
> > > Given syscalls are not some kind of scarce resources, what's the
> > > downside to have a dedicated and simple syscall?
> >
> > Syscalls are explicitly exposed to user space, thus, even if it is used
> > ONLY for a very specific situation, it is an official kernel interface,
> > and need to care about the compatibility. (If it causes SIGILL unless
> > a specific use case, I don't know there is a "compatibility".)
> 
> Check rt_sigreturn syscall (manpage at [0], for example).
> 
>sigreturn() exists only to allow the implementation of signal
>handlers.  It should never be called directly.  (Indeed, a simple
>sigreturn() wrapper in the GNU C library simply returns -1, with
>errno set to ENOSYS.)  Details of the arguments (if any) passed
>to sigreturn() vary depending on the architecture.  (On some
>architectures, such as x86-64, sigreturn() takes no arguments,
>since all of the information that it requires is available in the
>stack frame that was previously created by the kernel on the
>user-space stack.)
> 
> This is a very similar use case. Also, check its source code in
> arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> on any sign of something not being right. It's exactly the same with
> sys_uretprobe.
> 
>   [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html

Thanks for a good example.
Hm, in the case of rt_sigreturn, it has no other way to do it so it
needs to use syscall. OTOH, sys_uretprobe is only for performance
optimization, and the performance may depend on the architecture.

> > And the number of syscalls are limited resource.
> 
> We have almost 500 of them, it didn't seems like adding 1-2 for good
> reasons would be a problem. Can you please point to where the limits
> on syscalls as a resource are described? I'm curious to learn.

Syscall table is compiled as a fixed array, so if we increase
the number, we need more tables. Of course this just increase 1 entry
and at least for x86 we already allocated bigger table, so it is OK.
But I'm just afraid if we can add more syscalls without any clear
rules, we may fill the tables with more specific syscalls.

Ah, we also should follow this document.

https://docs.kernel.org/process/adding-syscalls.html

Let me Cc linux-...@vger.kernel.org.

> >
> > I'm actually not sure how much we need to care of it, but adding a new
> > syscall is worth to be discussed carefully because all of them are
> > user-space compatibility.
> 
> Absolutely, it's a good discussion to have.

Thanks, if this is discussed enough and agreed from other maintainers,
I can safely pick this on my tree.

> 
> >
> > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > >
> > > > > right, I'll check on syzkaller
> > > > >
> > > > > > set in the user function, what happen if the user function directly
> > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > >
> > > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > > the current task, or it will trigger uretprobe if there's one

Re: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-04 Thread Haitao Huang


On Thu, 04 Apr 2024 06:16:54 -0500, Huang, Kai  wrote:


On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:

  void sgx_cgroup_init(void)
 {
+	sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE,  
WQ_UNBOUND_MAX_ACTIVE);

+
+   /* All Cgroups functionalities are disabled. */
+   if (WARN_ON(!sgx_cg_wq))
+   return;
+


I don't think you should WARN(), because it's not a kernel bug or  
similar.  Just

print a message saying EPC cgroup is disabled and move on.

if (!sgx_cg_wq) {
pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
return;
}



Sure
Thanks
Haitao

Re: [PATCH] livepatch: Add KLP_IDLE state

2024-04-04 Thread Petr Mladek

On Tue 2024-04-02 09:52:31, Joe Lawrence wrote:
> On Tue, Apr 02, 2024 at 11:09:54AM +0800, zhangwar...@gmail.com wrote:
> > From: Wardenjohn 
> > 
> > In livepatch, using KLP_UNDEFINED is seems to be confused.
> > When kernel is ready, livepatch is ready too, which state is
> > idle but not undefined. What's more, if one livepatch process
> > is finished, the klp state should be idle rather than undefined.
> > 
> > Therefore, using KLP_IDLE to replace KLP_UNDEFINED is much better
> > in reading and understanding.
> > ---
> >  include/linux/livepatch.h |  1 +
> >  kernel/livepatch/patch.c  |  2 +-
> >  kernel/livepatch/transition.c | 24 
> >  3 files changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> > index 9b9b38e89563..c1c53cd5b227 100644
> > --- a/include/linux/livepatch.h
> > +++ b/include/linux/livepatch.h
> > @@ -19,6 +19,7 @@
> >  
> >  /* task patch states */
> >  #define KLP_UNDEFINED  -1
> > +#define KLP_IDLE   -1
> 
> Hi Wardenjohn,
> 
> Quick question, does this patch intend to:
> 
> - Completely replace KLP_UNDEFINED with KLP_IDLE
> - Introduce KLP_IDLE as an added, fourth potential state
> - Introduce KLP_IDLE as synonym of sorts for KLP_UNDEFINED under certain
>   conditions
> 
> I ask because this patch leaves KLP_UNDEFINED defined and used in other
> parts of the tree (ie, init/init_task.c), yet KLP_IDLE is added and
> continues to use the same -1 enumeration.

Having two names for the same state adds more harm than good.

Honestly, neither "task->patch_state == KLP_UNDEFINED" nor "KLP_IDLE"
make much sense.

The problem is in the variable name. It is not a state of a patch.
It is the state of the transition. The right solution would be
something like:

  klp_target_state -> klp_transition_target_state
  task->patch_state -> task->klp_transition_state
  KLP_UNKNOWN -> KLP_IDLE

But it would also require renaming:

  /proc//patch_state -> klp_transition_state

which might break userspace tools => likely not acceptable.


My opinion:

It would be nice to clean this up but it does not look worth the
effort.

Best Regards,
Petr

[PATCH] [v5] kallsyms: rework symbol lookup return codes

2024-04-04 Thread Arnd Bergmann

From: Arnd Bergmann 

Building with W=1 in some configurations produces a false positive
warning for kallsyms:

kernel/kallsyms.c: In function '__sprint_symbol.isra':
kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as 
destination [-Werror=restrict]
  503 | strcpy(buffer, name);
  | ^~~~

This originally showed up while building with -O3, but later started
happening in other configurations as well, depending on inlining
decisions. The underlying issue is that the local 'name' variable is
always initialized to the be the same as 'buffer' in the called functions
that fill the buffer, which gcc notices while inlining, though it could
see that the address check always skips the copy.

The calling conventions here are rather unusual, as all of the internal
lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
ftrace_func_address_lookup, module_address_lookup and
kallsyms_lookup_buildid) already use the provided buffer and either return
the address of that buffer to indicate success, or NULL for failure,
but the callers are written to also expect an arbitrary other buffer
to be returned.

Rework the calling conventions to return the length of the filled buffer
instead of its address, which is simpler and easier to follow as well
as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
unchanged, since that is called from 16 different functions and
adapting this would be a much bigger change.

Link: https://lore.kernel.org/all/20200107214042.855757-1-a...@arndb.de/
Link: https://lore.kernel.org/lkml/20240326130647.7bfb1...@gandalf.local.home/
Reviewed-by: Luis Chamberlain 
Acked-by: Steven Rostedt (Google) 
Signed-off-by: Arnd Bergmann 
---
v5: fix ftrace_mod_address_lookup return value,
rebased on top of 2e114248e086 ("bpf: Replace deprecated strncpy with 
strscpy")
v4: fix string length
v3: use strscpy() instead of strlcpy()
v2: complete rewrite after the first patch was rejected (in 2020). This
is now one of only two warnings that are in the way of enabling
-Wextra/-Wrestrict by default.
Signed-off-by: Arnd Bergmann 
---
 include/linux/filter.h   | 14 +++---
 include/linux/ftrace.h   |  6 +++---
 include/linux/module.h   | 14 +++---
 kernel/bpf/core.c|  7 +++
 kernel/kallsyms.c| 23 ---
 kernel/module/kallsyms.c | 26 +-
 kernel/trace/ftrace.c| 13 +
 7 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 161d5f7b64ed..e3a8f51fdf84 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1202,18 +1202,18 @@ static inline bool bpf_jit_kallsyms_enabled(void)
return false;
 }
 
-const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
+int __bpf_address_lookup(unsigned long addr, unsigned long *size,
 unsigned long *off, char *sym);
 bool is_bpf_text_address(unsigned long addr);
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
char *sym);
 struct bpf_prog *bpf_prog_ksym_find(unsigned long addr);
 
-static inline const char *
+static inline int
 bpf_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym)
 {
-   const char *ret = __bpf_address_lookup(addr, size, off, sym);
+   int ret = __bpf_address_lookup(addr, size, off, sym);
 
if (ret && modname)
*modname = NULL;
@@ -1257,11 +1257,11 @@ static inline bool bpf_jit_kallsyms_enabled(void)
return false;
 }
 
-static inline const char *
+static inline int
 __bpf_address_lookup(unsigned long addr, unsigned long *size,
 unsigned long *off, char *sym)
 {
-   return NULL;
+   return 0;
 }
 
 static inline bool is_bpf_text_address(unsigned long addr)
@@ -1280,11 +1280,11 @@ static inline struct bpf_prog 
*bpf_prog_ksym_find(unsigned long addr)
return NULL;
 }
 
-static inline const char *
+static inline int
 bpf_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym)
 {
-   return NULL;
+   return 0;
 }
 
 static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp)
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 54d53f345d14..56834a3fa9be 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -87,15 +87,15 @@ struct ftrace_direct_func;
 
 #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_MODULES) && \
defined(CONFIG_DYNAMIC_FTRACE)
-const char *
+int
 ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym);
 #else
-static inline const char *
+static inline int
 ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,

Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-04 Thread Jonathan Cameron



> > > @@ -858,7 +910,8 @@ static int __init memory_tier_init(void)
> > >* For now we can have 4 faster memory tiers with smaller adistance
> > >* than default DRAM tier.
> > >*/
> > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > > + default_dram_type = 
> > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > > + 
> > > _memory_types);  
> >
> > Unusual indenting.  Align with just after (
> >  
> 
> Aligning with "(" will exceed 100 columns. Would that be acceptable?
I think we are talking cross purposes.

default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
  _memory_types);  

Is what I was suggesting.

> 
> > >   if (IS_ERR(default_dram_type))
> > >   panic("%s() failed to allocate default DRAM tier\n", 
> > > __func__);
> > >
> > > @@ -868,6 +921,14 @@ static int __init memory_tier_init(void)
> > >* types assigned.
> > >*/
> > >   for_each_node_state(node, N_MEMORY) {
> > > + if (!node_state(node, N_CPU))
> > > + /*
> > > +  * Defer memory tier initialization on CPUless numa 
> > > nodes.
> > > +  * These will be initialized after firmware and 
> > > devices are  
> >
> > I think this wraps at just over 80 chars.  Seems silly to wrap so tightly 
> > and not
> > quite fit under 80. (this is about 83 chars.
> >  
> 
> I can fix this.
> I have a question. From my patch, this is <80 chars. However,
> in an email, this is >80 chars. Does that mean we need to
> count the number of chars in an email, not in a patch? Or if I
> missed something? like vim configuration or?

3 tabs + 1 space + the text from * (58)
= 24 + 1 + 58 = 83

Advantage of using claws email for kernel stuff is it has a nice per character
ruler at the top of the window.

I wonder if you have a different tab indent size?  The kernel uses 8
characters.  It might explain the few other odd indents if perhaps
you have it at 4 in your editor?

https://www.kernel.org/doc/html/v4.10/process/coding-style.html

Jonathan

> 
> > > +  * initialized.
> > > +  */
> > > + continue;
> > > +
> > >   memtier = set_node_memory_tier(node);
> > >   if (IS_ERR(memtier))
> > >   /*  
> >  
> 
>

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-04 Thread Jiri Olsa

On Wed, Apr 03, 2024 at 07:00:07PM -0700, Andrii Nakryiko wrote:

SNIP

> Check rt_sigreturn syscall (manpage at [0], for example).
> 
>sigreturn() exists only to allow the implementation of signal
>handlers.  It should never be called directly.  (Indeed, a simple
>sigreturn() wrapper in the GNU C library simply returns -1, with
>errno set to ENOSYS.)  Details of the arguments (if any) passed
>to sigreturn() vary depending on the architecture.  (On some
>architectures, such as x86-64, sigreturn() takes no arguments,
>since all of the information that it requires is available in the
>stack frame that was previously created by the kernel on the
>user-space stack.)
> 
> This is a very similar use case. Also, check its source code in
> arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> on any sign of something not being right. It's exactly the same with
> sys_uretprobe.
> 
>   [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
> 
> > And the number of syscalls are limited resource.
> 
> We have almost 500 of them, it didn't seems like adding 1-2 for good
> reasons would be a problem. Can you please point to where the limits
> on syscalls as a resource are described? I'm curious to learn.
> 
> >
> > I'm actually not sure how much we need to care of it, but adding a new
> > syscall is worth to be discussed carefully because all of them are
> > user-space compatibility.
> 
> Absolutely, it's a good discussion to have.
> 
> >
> > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > >
> > > > > right, I'll check on syzkaller
> > > > >
> > > > > > set in the user function, what happen if the user function directly
> > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > >
> > > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > > the current task, or it will trigger uretprobe if there's one pending
> > > >
> > > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > > user program, it should return appropriate error code instead of SIGILL.
> > > >
> > >
> > > This is the way it is today with uretprobes even through interrupt.
> >
> > I doubt that the interrupt (exception) and syscall should be handled
> > differently. Especially, this exception is injected by uprobes but
> > syscall will be caused by itself. But syscall can be called from user
> > program (of couse this works as sys_kill(self, SIGILL)).
> 
> Yep, I'd keep the behavior the same between uretprobes implemented
> through int3 and sys_uretprobe.

+1 

> 
> >
> > > E.g., it could happen that user process is using fibers and is
> > > replacing stack pointer without kernel realizing this, which will
> > > trigger some defensive checks in uretprobe handling code and kernel
> > > will send SIGILL because it can't support such cases. This is
> > > happening today already, and it works fine in practice (except for
> > > applications that manually change stack pointer, too bad, you can't
> > > trace them with uretprobes, unfortunately).
> >
> > OK, we at least need to document it.
> 
> +1, yep
> 
> >
> > >
> > > So I think it's absolutely adequate to have this behavior if the user
> > > process is *intentionally* abusing this API.
> >
> > Of course user expected that it is abusing. So at least we need to
> > add a document that this syscall number is reserved to uprobes and
> > user program must not use it.
> >
> 
> Totally agree about documenting this.

ok there's map page on sigreturn.. do you think we should add man page
for uretprobe or you can think of some other place to document it?

> 
> > >
> > > > >
> > > > > but we could limit the syscall to be executed just from the 
> > > > > trampoline,
> > > > > that should prevent all the user space use cases, I'll do that in next
> > > > > version and add more tests for that
> > > >
> > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > > only from the trampoline, so it is natural to check the caller address.
> > > > (and uprobe should know where is the trampoline)
> > > >
> > > > Since the syscall is always exposed to the user program, it should
> > > > - Do nothing and return an error unless it is properly called.
> > > > - check the prerequisites for operation strictly.
> > > > I concern that new system calls introduce vulnerabilities.
> > > >
> > >
> > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > > processes, only the process that is abusing the API. So any extra
> > > checks that would slow down this approach is an unnecessary overhead
> > > and complication that will never be useful in practice.
> >
> > I think at least it should check the caller address to ensure the
> > address is in the trampoline.
> > But anyway, uprobes itself can break the target process, so no one
> > might care if this system call breaks the process now.
> 
> If we already

Re: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-04 Thread Huang, Kai

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
>  
>  void sgx_cgroup_init(void)
>  {
> + sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, 
> WQ_UNBOUND_MAX_ACTIVE);
> +
> + /* All Cgroups functionalities are disabled. */
> + if (WARN_ON(!sgx_cg_wq))
> + return;
> +

I don't think you should WARN(), because it's not a kernel bug or similar.  Just
print a message saying EPC cgroup is disabled and move on.

if (!sgx_cg_wq) {
pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
return;
}

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-04 Thread Jonthan Haslam

> > Things to note about the results:
> >
> > - The results are slightly variable so don't get too caught up on
> >   individual thread count - it's the trend that is important.
> > - In terms of throughput with this specific benchmark a *very* macro view
> >   is that the RW spinlock provides 40-60% more throughput than the
> >   spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
> >   more throughput then the spinlock.
> > - This doesn't fully reflect the large reduction in latency that we have
> >   seen in application based measurements. However, it does demonstrate
> >   that even the trivial change of going to a RW spinlock provides
> >   significant benefits.
> 
> This is probably because trig-uprobe-nop creates a single uprobe that
> is triggered on many CPUs. While in production we have also *many*
> uprobes running on many CPUs. In this benchmark, besides contention on
> uprobes_treelock, we are also hammering on other per-uprobe locks
> (register_rwsem, also if you don't have [0] patch locally, there will
> be another filter lock taken each time, filter->rwlock). There is also
> atomic refcounting going on, which when you have the same uprobe
> across all CPUs at the same time will cause a bunch of cache line
> bouncing.
> 
> So yes, it's understandable that in practice in production you see an
> even larger effect of optimizing uprobe_treelock than in this
> micro-benchmark.
> 
>   [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

Thanks for the reply and the thoughts on this Andrii. Yes, I do have the
filter->rwlock fix applied but, as you say, there are no doubt other
effects at play here as to be expected in such a synthetic workload. I'm
pleased with the outcomes though as they show a good result even if they
are at the lower end of what I expect.

The results also show that pursuing an RCU solution is definitely worth it
but that write penalty is brutal in the case of a full synchronize_rcu()!
Should be fun.

> > for num_threads in {1..20}
> > do
> > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
> 
> just want to mention -a (affinity) option that you can pass a bench
> tool, it will pin each thread on its own CPU. It generally makes tests
> more uniform, eliminating CPU migrations variability.

Thanks for pointing that flag  out!

Jon.

> 
> > done
> >
> >
> > spinlock
> >
> > Summary: hits1.453 ± 0.005M/s (  1.453M/prod)
> > Summary: hits2.087 ± 0.005M/s (  1.043M/prod)
> > Summary: hits2.701 ± 0.012M/s (  0.900M/prod)
> 
> I also wanted to point out that the first measurement (1.453M/s in
> this row) is total throughput across all threads, while value in
> parenthesis (0.900M/prod) is averaged throughput per each thread. So
> this M/prod value is the most interesting in this benchmark where we
> assess the effect of reducing contention.
> 
> > Summary: hits1.917 ± 0.011M/s (  0.479M/prod)
> > Summary: hits2.105 ± 0.003M/s (  0.421M/prod)
> > Summary: hits1.615 ± 0.006M/s (  0.269M/prod)
> 
> [...]

[PATCH net-next v2 6/6] rstreason: make it work in trace world

2024-04-04 Thread Jason Xing

From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 37 +
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..9bed9e63c9c5 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN1
+#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+#undef FN2
+#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
+DEFINE_RST_REASON(FN1, FN1)
+
+#undef FN1
+#undef FNe1
+#define FN1(reason){ SK_RST_REASON_##reason, #reason },
+#define FNe1(reason)   { SK_RST_REASON_##reason, #reason }
+
+#undef FN2
+#undef FNe2
+#define FN2(reason){ SKB_DROP_REASON_##reason, #reason },
+#define FNe2(reason)   { SKB_DROP_REASON_##reason, #reason }
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const int reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(int, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->reason < RST_REASON_START ?
+   __print_symbolic(__entry->reason, 
DEFINE_DROP_REASON(FN2, FNe2)) :
+   __print_symbolic(__entry->reason, 
DEFINE_RST_REASON(FN1, FNe1)))
 );
 
+#undef FN1
+#undef FNe1
+
+#undef FN2
+#undef FNe2
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1ae2716f0c34..9c52a4a74842 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 18fbbad2028a..d5a7ecfcc1b3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3608,7 +3608,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority, int reason)
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6889ea70c760..3c995eff6e52 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,7 +1131,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3

[PATCH net-next v2 5/6] mptcp: support rstreason for passive reset

2024-04-04 Thread Jason Xing

From: Jason Xing 

It relys on what reset options in MPTCP does as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces all the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/subflow.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a68d5d0f3e2a..24668d3020aa 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -304,7 +304,10 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
 
dst_release(dst);
if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);
return NULL;
 }
 
@@ -371,7 +374,10 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
 
dst_release(dst);
if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   /* According to RFC 8684, 3.2. Starting a New Subflow,
+* we should use an "MPTCP specific error" reason code.
+*/
+   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_MPTCP_RST_EMPTCP);
return NULL;
 }
 #endif
@@ -778,6 +784,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
bool fallback, fallback_is_fatal;
struct mptcp_sock *owner;
struct sock *child;
+   int reason;
 
pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
 
@@ -833,7 +840,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
 */
if (!ctx || fallback) {
if (fallback_is_fatal) {
-   subflow_add_reset_reason(skb, MPTCP_RST_EMPTCP);
+   reason = MPTCP_RST_EMPTCP;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}
goto fallback;
@@ -861,7 +869,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
} else if (ctx->mp_join) {
owner = subflow_req->msk;
if (!owner) {
-   subflow_add_reset_reason(skb, 
MPTCP_RST_EPROHIBIT);
+   reason = MPTCP_RST_EPROHIBIT;
+   subflow_add_reset_reason(skb, reason);
goto dispose_child;
}
 
@@ -875,13 +884,18 @@ static struct sock *subflow_syn_recv_sock(const struct 
sock *sk,
 ntohs(inet_sk((struct sock 
*)owner)->inet_sport));
if (!mptcp_pm_sport_in_anno_list(owner, sk)) {
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_MISMATCHPORTACKRX);
+   reason = MPTCP_RST_EUNSPEC;
goto dispose_child;
}
SUBFLOW_REQ_INC_STATS(req, 
MPTCP_MIB_JOINPORTACKRX);
}
 
-   if (!mptcp_finish_join(child))
+   if (!mptcp_finish_join(child)) {
+   struct mptcp_subflow_context *subflow = 
mptcp_subflow_ctx(sk);
+
+   reason = subflow->reset_reason;
goto dispose_child;
+   }
 
SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINACKRX);
tcp_rsk(req)->drop_req = true;
@@ -901,7 +915,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   req->rsk_ops->send_reset(sk, skb, convert_mptcp_reason(reason));
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3

[PATCH net-next v2 4/6] tcp: support rstreason for passive reset

2024-04-04 Thread Jason Xing

From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we don't need to implement those duplicated reset reasons.
This patch replaces all the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/ipv4/tcp_ipv4.c | 8 
 net/ipv6/tcp_ipv6.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d8d98db8f58e..1ae2716f0c34 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1935,7 +1935,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2280,7 +2280,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2358,7 +2358,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7e591521b193..6889ea70c760 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1678,7 +1678,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, reason);
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1864,7 +1864,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1940,7 +1940,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -1995,7 +1995,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:
-- 
2.37.3

[PATCH net-next v2 3/6] rstreason: prepare for active reset

2024-04-04 Thread Jason Xing

From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  2 +-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  2 +-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ab5b37e9d53..67ab4dbf7805 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -667,7 +667,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e767721b3a58..eacfe0012977 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2805,7 +2806,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2879,7 +2881,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2897,7 +2900,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3001,7 +3005,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4557,7 +4561,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..18fbbad2028a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3583,7 +3583,7 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 976db57b95d4..83fe7f62f7f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))
do_reset = true;
if (do_reset)
-

[PATCH net-next v2 2/6] rstreason: prepare for passive reset

2024-04-04 Thread Jason Xing

From: Jason Xing 

Adjust the paramenter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  3 ++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..93f9fee7e52f 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -34,7 +34,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ int reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9fc9cea4c251..11b8d14be3e2 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c8ca703dc331..232092dc3887 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27add..251a57cf5822 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -15,6 +15,7 @@
 #include

[PATCH net-next v2 1/6] net: introduce rstreason to detect why the RST is sent

2024-04-04 Thread Jason Xing

From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) reuse MP_TCPRST option for MPTCP
3) our own reasons

I will implement the basic codes of active/passive reset reason in
those three protocols, which is not complete for this moment. But
it provides a new chance to let other people add more reasons into
it:)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 93 +
 1 file changed, 93 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..24d098a78a60
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(NOT_SPECIFIED)   \
+   FNe(MAX)
+
+#define RST_REASON_START (SKB_DROP_REASON_MAX + 1)
+
+/* There are three parts in order:
+ * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP
+ * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use
+ * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason
+ */
+enum sk_rst_reason {
+   /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse
+* of skb drop reason because rst reason relies on what drop reason
+* indicates exactly why it could happen.
+*/
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /* Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START,
+   /* MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /* Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /* Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /* Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /* Unacceptable performance.
+* This code indicates that the performance of this subflow was
+* too low compared to the other subflows of this Multipath TCP
+* connection.
+*/
+   SK_RST_REASON_MPTCP_RST_EBADPERF,
+   /* Middlebox interference.
+* Middlebox interference has been detected over this subflow,
+* making MPTCP signaling invalid. For example, this may be sent
+* if the checksum does not validate.
+*/
+   SK_RST_REASON_MPTCP_RST_EMIDDLEBOX,
+
+   /* For the real standalone socket reset reason, we start from here */
+   SK_RST_REASON_NOT_SPECIFIED,
+
+   /* Maximum of socket reset reasons.
+* It shouldn't be used as a real 'reason'.
+*/
+   SK_RST_REASON_MAX,
+};
+
+static inline int convert_mptcp_reason(int reason)
+{
+   return reason += RST_REASON_START;
+}
+#endif
-- 
2.37.3

[PATCH net-next 0/6] Implement reset reason mechanism to detect

2024-04-04 Thread Jason Xing

From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism and the detailed passive part works as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

v2
Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
1. rebase against the latest net-next tree

Jason Xing (6):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  rstreason: make it work in trace world

 include/net/request_sock.h |  3 +-
 include/net/rstreason.h| 93 ++
 include/net/tcp.h  |  2 +-
 include/trace/events/tcp.h | 37 +--
 net/dccp/ipv4.c| 10 ++--
 net/dccp/ipv6.c| 10 ++--
 net/dccp/minisocks.c   |  3 +-
 net/ipv4/tcp.c | 15 --
 net/ipv4/tcp_ipv4.c| 14 +++---
 net/ipv4/tcp_minisocks.c   |  3 +-
 net/ipv4/tcp_output.c  |  4 +-
 net/ipv4/tcp_timer.c   |  9 ++--
 net/ipv6/tcp_ipv6.c| 17 ---
 net/mptcp/protocol.c   |  4 +-
 net/mptcp/subflow.c| 33 ++
 15 files changed, 209 insertions(+), 48 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3

45 matches

Mail list logo