date:20240403

Re: [PATCH] riscv: selftests: Add signal handling vector tests

2024-04-03 Thread Björn Töpel

Charlie Jenkins  writes:

> Add two tests to check vector save/restore when a signal is received
> during a vector routine. One test ensures that a value is not clobbered
> during signal handling. The other verifies that vector registers
> modified in the signal handler are properly reflected when the signal
> handling is complete.
>
> Signed-off-by: Charlie Jenkins 

Nice!

Reviewed-by: Björn Töpel

Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

2024-04-03 Thread Alexei Starovoitov

On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
 wrote:
>
> On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
>  wrote:
> >
> > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> >  wrote:
> > > > > goto out;
> > > > > }
> > > > > +   spin_lock(>sleepable_lock);
> > > > > drop_prog_refcnt(t);
> > > > > +   spin_unlock(>sleepable_lock);
> > > >
> > > > this also looks odd.
> > >
> > > I basically need to protect "t->prog = NULL;" from happening while
> > > bpf_timer_work_cb is setting up the bpf program to be run.
> >
> > Ok. I think I understand the race you're trying to fix.
> > The bpf_timer_cancel_and_free() is doing
> > cancel_work()
> > and proceeds with
> > kfree_rcu(t, rcu);
> >
> > That's the only race and these extra locks don't help.
> >
> > The t->prog = NULL is nothing to worry about.
> > The bpf_timer_work_cb() might still see callback_fn == NULL
> > "when it's being setup" and it's ok.
> > These locks don't help that.
> >
> > I suggest to drop sleepable_lock everywhere.
> > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > Add rcu_read_lock_trace() before calling bpf prog.
> >
> > The race to fix is above 'cancel_work + kfree_rcu'
> > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > while the work is pending and work_queue internal
> > logic might UAF struct work_struct work.
> > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > The argument 'struct work_struct *work' might already be freed.
> >
> > To fix this problem, how about the following:
> > don't call kfree_rcu and instead queue the work to free it.
> > After cancel_work(>work); the work_struct can be reused.
> > So set it up to call "freeing callback" and do
> > schedule_work(>work);
> >
> > There is a big assumption here that new work won't be
> > executed before cancelled work completes.
> > Need to check with wq experts.
> >
> > Another approach is to do something smart with
> > cancel_work() return code.
> > If it returns true set a flag inside bpf_hrtimer and
> > make bpf_timer_work_cb() free(t) after bpf prog finishes.
>
> Looking through wq code... I think I have to correct myself.
> cancel_work and immediate free is probably fine from wq pov.
> It has this comment:
> worker->current_func(work);
> /*
>  * While we must be careful to not use "work" after this, the trace
>  * point will only record its address.
>  */
> trace_workqueue_execute_end(work, worker->current_func);
>
> the bpf_timer_work_cb() might still be running bpf prog.
> So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> since kfree_rcu(t, rcu); could have freed it by then.
> There is also this code in net/rxrpc/rxperf.c
> cancel_work(>work);
> kfree(call);

Correction to correction.
Above piece in rxrpc is buggy.
The following race is possible:
cpu 0
process_one_work()
set_work_pool_and_clear_pending(work, pool->id, 0);

cpu 1
cancel_work()
kfree_rcu(work)

worker->current_func(work);

Here 'work' is a pointer to freed memory.
Though wq code will not be touching it, callback will UAF.

Also what I proposed earlier as:
INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
won't guarantee the ordering.
Since the callback function is different,
find_worker_executing_work() will consider it a separate work item.

Another option is to to keep bpf_timer_work_cb callback
and add a 'bool free_me;' to struct bpf_hrtimer
and let the callback free it.
But it's also racy.
cancel_work() may return false, though worker->current_func(work)
wasn't called yet.
So we cannot set 'free_me' in bpf_timer_cancel_and_free()
in race free maner.

After brainstorming with Tejun it seems the best is to use
another work_struct to call a different callback and do
cancel_work_sync() there.

So we need something like:

struct bpf_hrtimer {
  union {
struct hrtimer timer;
+   struct work_struct work;
  };
  struct bpf_map *map;
  struct bpf_prog *prog;
  void __rcu *callback_fn;
  void *value;
  union {
struct rcu_head rcu;
+   struct work_struct sync_work;
  };
+ u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
 };

'work' will be used to call bpf_timer_work_cb.
'sync_work' will be used to call cancel_work_sync() + kfree_rcu().

And, of course,
schedule_work(>sync_work); from bpf_timer_cancel_and_free()
instead of kfree_rcu.

Re: [PATCH v3 15/15] powerpc: Add support for suppressing warning backtraces

2024-04-03 Thread Michael Ellerman

Guenter Roeck  writes:
> Add name of functions triggering warning backtraces to the __bug_table
> object section to enable support for suppressing WARNING backtraces.
>
> To limit image size impact, the pointer to the function name is only added
> to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
> CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
> parameter is replaced with a (dummy) NULL parameter to avoid an image size
> increase due to unused __func__ entries (this is necessary because __func__
> is not a define but a virtual variable).
>
> Tested-by: Linux Kernel Functional Testing 
> Acked-by: Dan Carpenter 
> Cc: Michael Ellerman 
> Signed-off-by: Guenter Roeck 
> ---
> v2:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
> - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
> v3:
> - Rebased to v6.9-rc2
>
>  arch/powerpc/include/asm/bug.h | 37 +-
>  1 file changed, 28 insertions(+), 9 deletions(-)

I ran it through some build and boot tests, LGTM.

Acked-by: Michael Ellerman  (powerpc)

cheers

Re: [PATCH v3 06/15] net: kunit: Suppress lock warning noise at end of dev_addr_lists tests

2024-04-03 Thread Jakub Kicinski

On Wed,  3 Apr 2024 06:19:27 -0700 Guenter Roeck wrote:
> dev_addr_lists_test generates lock warning noise at the end of tests
> if lock debugging is enabled. There are two sets of warnings.
> 
> WARNING: CPU: 0 PID: 689 at kernel/locking/mutex.c:923 
> __mutex_unlock_slowpath.constprop.0+0x13c/0x368
> DEBUG_LOCKS_WARN_ON(__owner_task(owner) != __get_current())
> 
> WARNING: kunit_try_catch/1336 still has locks held!
> 
> KUnit test cleanup is not guaranteed to run in the same thread as the test
> itself. For this test, this means that rtnl_lock() and rtnl_unlock() may
> be called from different threads. This triggers the warnings.
> Suppress the warnings because they are irrelevant for the test and just
> confusing and distracting.
> 
> The first warning can be suppressed by using START_SUPPRESSED_WARNING()
> and END_SUPPRESSED_WARNING() around the call to rtnl_unlock(). To suppress
> the second warning, it is necessary to set debug_locks_silent while the
> rtnl lock is held.

Is it okay if I move the locking into the tests, instead?
It's only 4 lines more and no magic required, seems to work fine.

Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

2024-04-03 Thread Alexei Starovoitov

On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
 wrote:
>
> On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
>  wrote:
> > > > goto out;
> > > > }
> > > > +   spin_lock(>sleepable_lock);
> > > > drop_prog_refcnt(t);
> > > > +   spin_unlock(>sleepable_lock);
> > >
> > > this also looks odd.
> >
> > I basically need to protect "t->prog = NULL;" from happening while
> > bpf_timer_work_cb is setting up the bpf program to be run.
>
> Ok. I think I understand the race you're trying to fix.
> The bpf_timer_cancel_and_free() is doing
> cancel_work()
> and proceeds with
> kfree_rcu(t, rcu);
>
> That's the only race and these extra locks don't help.
>
> The t->prog = NULL is nothing to worry about.
> The bpf_timer_work_cb() might still see callback_fn == NULL
> "when it's being setup" and it's ok.
> These locks don't help that.
>
> I suggest to drop sleepable_lock everywhere.
> READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> Add rcu_read_lock_trace() before calling bpf prog.
>
> The race to fix is above 'cancel_work + kfree_rcu'
> since kfree_rcu might free 'struct bpf_hrtimer *t'
> while the work is pending and work_queue internal
> logic might UAF struct work_struct work.
> By the time it may luckily enter bpf_timer_work_cb() it's too late.
> The argument 'struct work_struct *work' might already be freed.
>
> To fix this problem, how about the following:
> don't call kfree_rcu and instead queue the work to free it.
> After cancel_work(>work); the work_struct can be reused.
> So set it up to call "freeing callback" and do
> schedule_work(>work);
>
> There is a big assumption here that new work won't be
> executed before cancelled work completes.
> Need to check with wq experts.
>
> Another approach is to do something smart with
> cancel_work() return code.
> If it returns true set a flag inside bpf_hrtimer and
> make bpf_timer_work_cb() free(t) after bpf prog finishes.

Looking through wq code... I think I have to correct myself.
cancel_work and immediate free is probably fine from wq pov.
It has this comment:
worker->current_func(work);
/*
 * While we must be careful to not use "work" after this, the trace
 * point will only record its address.
 */
trace_workqueue_execute_end(work, worker->current_func);

the bpf_timer_work_cb() might still be running bpf prog.
So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
since kfree_rcu(t, rcu); could have freed it by then.
There is also this code in net/rxrpc/rxperf.c
cancel_work(>work);
kfree(call);

So it looks like it's fine to drop sleepable_lock,
add rcu_read_lock_trace() and things should be ok.

Re: [PATCH] riscv: selftests: Add signal handling vector tests

2024-04-03 Thread Charlie Jenkins

ector_override(int sig_no, siginfo_t *info, void *vcontext)
> > +{
> > +   ucontext_t *context = vcontext;
> > +
> > +   // vector state
> > +   struct __riscv_extra_ext_header *ext;
> > +   struct __riscv_v_ext_state *v_ext_state;
> > +
> > +   /* Find the vector context. */
> > +   ext = (void *)(>uc_mcontext.__fpregs);
> > +   if (ext->hdr.magic != RISCV_V_MAGIC) {
> > +   fprintf(stderr, "bad vector magic: %x\n", ext->hdr.magic);
> > +   abort();
> > +   }
> > +
> > +   v_ext_state = (void *)((char *)(ext) + sizeof(*ext));
> > +
> > +   *(int *)v_ext_state->datap = SIGNAL_HANDLER_OVERRIDE;
> > +
> > +   context->uc_mcontext.__gregs[REG_PC] = 
> > context->uc_mcontext.__gregs[REG_PC] + 4;
> > +}
> > +
> > +static int vector_sigreturn(int data, void (*handler)(int, siginfo_t *, 
> > void *))
> > +{
> > +   int after_sigreturn;
> > +   struct sigaction sig_action = {
> > +   .sa_sigaction = handler,
> > +   .sa_flags = SA_SIGINFO
> > +   };
> > +
> > +   sigaction(SIGSEGV, _action, 0);
> > +
> > +   asm(".option push   \n\
> > +   .option arch, +v\n\
> > +   vsetivlix0, 1, e32, ta, ma  \n\
> > +   vmv.s.x     v0, %1  \n\
> > +   # Generate SIGSEGV  \n\
> > +   lw  a0, 0(x0)   \n\
> > +   vmv.x.s %0, v0  \n\
> > +   .option pop" : "=r" (after_sigreturn) : "r" (data));
> > +
> > +   return after_sigreturn;
> > +}
> > +
> > +TEST(vector_restore)
> > +{
> > +   int result;
> > +
> > +   result = vector_sigreturn(DEFAULT_VALUE, _handle);
> > +
> > +   EXPECT_EQ(DEFAULT_VALUE, result);
> > +}
> > +
> > +TEST(vector_restore_signal_handler_override)
> > +{
> > +   int result;
> > +
> > +   result = vector_sigreturn(DEFAULT_VALUE, _override);
> > +
> > +   EXPECT_EQ(SIGNAL_HANDLER_OVERRIDE, result);
> > +}
> > +
> > +TEST_HARNESS_MAIN
> >
> > ---
> > base-commit: 4cece764965020c22cff7665b18a012006359095
> > change-id: 20240403-vector_sigreturn_tests-8118f0ac54fa
>

Re: [PATCH] riscv: selftests: Add signal handling vector tests

2024-04-03 Thread Vineet Gupta

%1  \n\
> + # Generate SIGSEGV  \n\
> + lw  a0, 0(x0)   \n\
> + vmv.x.s %0, v0  \n\
> + .option pop" : "=r" (after_sigreturn) : "r" (data));
> +
> + return after_sigreturn;
> +}
> +
> +TEST(vector_restore)
> +{
> + int result;
> +
> + result = vector_sigreturn(DEFAULT_VALUE, _handle);
> +
> + EXPECT_EQ(DEFAULT_VALUE, result);
> +}
> +
> +TEST(vector_restore_signal_handler_override)
> +{
> + int result;
> +
> + result = vector_sigreturn(DEFAULT_VALUE, _override);
> +
> + EXPECT_EQ(SIGNAL_HANDLER_OVERRIDE, result);
> +}
> +
> +TEST_HARNESS_MAIN
>
> ---
> base-commit: 4cece764965020c22cff7665b18a012006359095
> change-id: 20240403-vector_sigreturn_tests-8118f0ac54fa

Re: [PATCH bpf-next] selftests/bpf: Add F_SETFL for fcntl

2024-04-03 Thread Geliang Tang

Hi Jakub,
 
On Wed, 2024-04-03 at 15:29 -0700, John Fastabend wrote:
> Jakub Sitnicki wrote:
> > Hi Geliang,
> > 
> > On Wed, Apr 03, 2024 at 04:32 PM +08, Geliang Tang wrote:
> > > From: Geliang Tang 
> > > 
> > > Incorrect arguments are passed to fcntl() in test_sockmap.c when
> > > invoking
> > > it to set file status flags. If O_NONBLOCK is used as 2nd
> > > argument and
> > > passed into fcntl, -EINVAL will be returned (See do_fcntl() in
> > > fs/fcntl.c).
> > > The correct approach is to use F_SETFL as 2nd argument, and
> > > O_NONBLOCK as
> > > 3rd one.
> > > 
> > > Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
> > > Signed-off-by: Geliang Tang 
> > > ---
> > >  tools/testing/selftests/bpf/test_sockmap.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/tools/testing/selftests/bpf/test_sockmap.c
> > > b/tools/testing/selftests/bpf/test_sockmap.c
> > > index 024a0faafb3b..34d6a1e6f664 100644
> > > --- a/tools/testing/selftests/bpf/test_sockmap.c
> > > +++ b/tools/testing/selftests/bpf/test_sockmap.c
> > > @@ -603,7 +603,7 @@ static int msg_loop(int fd, int iov_count,
> > > int iov_length, int cnt,
> > >   struct timeval timeout;
> > >   fd_set w;
> > >  
> > > - fcntl(fd, fd_flags);
> > > + fcntl(fd, F_SETFL, fd_flags);
> > >   /* Account for pop bytes noting each iteration
> > > of apply will
> > >    * call msg_pop_data helper so we need to
> > > account for this
> > >    * by calculating the number of apply
> > > iterations. Note user
> > 
> > Good catch. But we also need to figure out why some tests failing
> > with
> > this patch applied and fix them in one go:
> > 
> > # 6/ 7  sockmap::txmsg test skb:FAIL
> > #21/ 7 sockhash::txmsg test skb:FAIL
> > #36/ 7 sockhash:ktls:txmsg test skb:FAIL
> > Pass: 42 Fail: 3

Sorry, I didn't notice these fails in my testing before, they do exist.
I'll try to fix them and sent a v2 soon.

Thanks,
-Geliang

> > 
> > I'm seeing this error message when running `test_sockmap`:
> > 
> > detected skb data error with skb ingress update @iov[0]:0 "00 00 00
> > 00" != "PASS"
> > data verify msg failed: Unknown error -5
> > rx thread exited with err 1.
> 
> I have a theory this is a real bug in the SK_SKB_STREAM_PARSER which
> has an
> issue with wakup logic. Maybe we wakeup the poll/select logic before
> the
> data is copied and because the recv() is actually nonblocking now we
> get
> the error.
> 
> > 
> > I'd also:
> > - add an error check for fnctl, so we don't regress,
> > - get rid of fd_flags, pass O_NONBLOCK flag directly to fnctl.
> > 
> > Thanks,
> > -jkbs
> 
>

[PATCH] riscv: selftests: Add signal handling vector tests

2024-04-03 Thread Charlie Jenkins

Add two tests to check vector save/restore when a signal is received
during a vector routine. One test ensures that a value is not clobbered
during signal handling. The other verifies that vector registers
modified in the signal handler are properly reflected when the signal
handling is complete.

Signed-off-by: Charlie Jenkins 
---
These tests came about to highlight the bug fixed in
https://lore.kernel.org/lkml/20240403072638.567446-1-bj...@kernel.org/
and will only pass with that fix applied.
---
 tools/testing/selftests/riscv/Makefile |  2 +-
 tools/testing/selftests/riscv/sigreturn/.gitignore |  1 +
 tools/testing/selftests/riscv/sigreturn/Makefile   | 12 
 .../testing/selftests/riscv/sigreturn/sigreturn.c  | 82 ++
 4 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/riscv/Makefile 
b/tools/testing/selftests/riscv/Makefile
index 4a9ff515a3a0..7ce03d832b64 100644
--- a/tools/testing/selftests/riscv/Makefile
+++ b/tools/testing/selftests/riscv/Makefile
@@ -5,7 +5,7 @@
 ARCH ?= $(shell uname -m 2>/dev/null || echo not)
 
 ifneq (,$(filter $(ARCH),riscv))
-RISCV_SUBTARGETS ?= hwprobe vector mm
+RISCV_SUBTARGETS ?= hwprobe vector mm sigreturn
 else
 RISCV_SUBTARGETS :=
 endif
diff --git a/tools/testing/selftests/riscv/sigreturn/.gitignore 
b/tools/testing/selftests/riscv/sigreturn/.gitignore
new file mode 100644
index ..35002b8ae780
--- /dev/null
+++ b/tools/testing/selftests/riscv/sigreturn/.gitignore
@@ -0,0 +1 @@
+sigreturn
diff --git a/tools/testing/selftests/riscv/sigreturn/Makefile 
b/tools/testing/selftests/riscv/sigreturn/Makefile
new file mode 100644
index ..eb8bac9279a8
--- /dev/null
+++ b/tools/testing/selftests/riscv/sigreturn/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2021 ARM Limited
+# Originally tools/testing/arm64/abi/Makefile
+
+CFLAGS += -I$(top_srcdir)/tools/include
+
+TEST_GEN_PROGS := sigreturn
+
+include ../../lib.mk
+
+$(OUTPUT)/sigreturn: sigreturn.c
+   $(CC) -static -o$@ $(CFLAGS) $(LDFLAGS) $^
diff --git a/tools/testing/selftests/riscv/sigreturn/sigreturn.c 
b/tools/testing/selftests/riscv/sigreturn/sigreturn.c
new file mode 100644
index ..62397d5934f1
--- /dev/null
+++ b/tools/testing/selftests/riscv/sigreturn/sigreturn.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../../kselftest_harness.h"
+
+#define RISCV_V_MAGIC  0x53465457
+#define DEFAULT_VALUE  2
+#define SIGNAL_HANDLER_OVERRIDE3
+
+static void simple_handle(int sig_no, siginfo_t *info, void *vcontext)
+{
+   ucontext_t *context = vcontext;
+
+   context->uc_mcontext.__gregs[REG_PC] = 
context->uc_mcontext.__gregs[REG_PC] + 4;
+}
+
+static void vector_override(int sig_no, siginfo_t *info, void *vcontext)
+{
+   ucontext_t *context = vcontext;
+
+   // vector state
+   struct __riscv_extra_ext_header *ext;
+   struct __riscv_v_ext_state *v_ext_state;
+
+   /* Find the vector context. */
+   ext = (void *)(>uc_mcontext.__fpregs);
+   if (ext->hdr.magic != RISCV_V_MAGIC) {
+   fprintf(stderr, "bad vector magic: %x\n", ext->hdr.magic);
+   abort();
+   }
+
+   v_ext_state = (void *)((char *)(ext) + sizeof(*ext));
+
+   *(int *)v_ext_state->datap = SIGNAL_HANDLER_OVERRIDE;
+
+   context->uc_mcontext.__gregs[REG_PC] = 
context->uc_mcontext.__gregs[REG_PC] + 4;
+}
+
+static int vector_sigreturn(int data, void (*handler)(int, siginfo_t *, void 
*))
+{
+   int after_sigreturn;
+   struct sigaction sig_action = {
+   .sa_sigaction = handler,
+   .sa_flags = SA_SIGINFO
+   };
+
+   sigaction(SIGSEGV, _action, 0);
+
+   asm(".option push   \n\
+   .option arch, +v\n\
+   vsetivlix0, 1, e32, ta, ma  \n\
+   vmv.s.x v0, %1  \n\
+   # Generate SIGSEGV  \n\
+   lw  a0, 0(x0)   \n\
+   vmv.x.s %0, v0  \n\
+   .option pop" : "=r" (after_sigreturn) : "r" (data));
+
+   return after_sigreturn;
+}
+
+TEST(vector_restore)
+{
+   int result;
+
+   result = vector_sigreturn(DEFAULT_VALUE, _handle);
+
+   EXPECT_EQ(DEFAULT_VALUE, result);
+}
+
+TEST(vector_restore_signal_handler_override)
+{
+   int result;
+
+   result = vector_sigreturn(DEFAULT_VALUE, _override);
+
+   EXPECT_EQ(SIGNAL_HANDLER_OVERRIDE, result);
+}
+
+TEST_HARNESS_MAIN

---
base-commit: 4cece764965020c22cff7665b18a012006359095
change-id: 20240403-vector_sigreturn_tests-8118f0ac54fa
-- 
- Charlie

[PATCH v3 29/29] kselftest/riscv: kselftest for user mode cfi

2024-04-03 Thread Deepak Gupta

Adds kselftest for RISC-V control flow integrity implementation for user
mode. There is not a lot going on in kernel for enabling landing pad for
user mode. cfi selftest are intended to be compiled with zicfilp and
zicfiss enabled compiler. Thus kselftest simply checks if landing pad and
shadow stack for the binary and process are enabled or not. selftest then
register a signal handler for SIGSEGV. Any control flow violation are
reported as SIGSEGV with si_code = SEGV_CPERR. Test will fail on recieving
any SEGV_CPERR. Shadow stack part has more changes in kernel and thus there
are separate tests for that
- Exercise `map_shadow_stack` syscall
- `fork` test to make sure COW works for shadow stack pages
- gup tests
  As of today kernel uses FOLL_FORCE when access happens to memory via
  /proc//mem. Not breaking that for shadow stack
- signal test. Make sure signal delivery results in token creation on
  shadow stack and consumes (and verifies) token on sigreturn
- shadow stack protection test. attempts to write using regular store
  instruction on shadow stack memory must result in access faults

Signed-off-by: Deepak Gupta 
---
 tools/testing/selftests/riscv/Makefile|   2 +-
 tools/testing/selftests/riscv/cfi/.gitignore  |   3 +
 tools/testing/selftests/riscv/cfi/Makefile|  10 +
 .../testing/selftests/riscv/cfi/cfi_rv_test.h |  83 
 .../selftests/riscv/cfi/riscv_cfi_test.c  |  82 
 .../testing/selftests/riscv/cfi/shadowstack.c | 362 ++
 .../testing/selftests/riscv/cfi/shadowstack.h |  37 ++
 7 files changed, 578 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/riscv/cfi/.gitignore
 create mode 100644 tools/testing/selftests/riscv/cfi/Makefile
 create mode 100644 tools/testing/selftests/riscv/cfi/cfi_rv_test.h
 create mode 100644 tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
 create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.c
 create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.h

diff --git a/tools/testing/selftests/riscv/Makefile 
b/tools/testing/selftests/riscv/Makefile
index 4a9ff515a3a0..867e5875b7ce 100644
--- a/tools/testing/selftests/riscv/Makefile
+++ b/tools/testing/selftests/riscv/Makefile
@@ -5,7 +5,7 @@
 ARCH ?= $(shell uname -m 2>/dev/null || echo not)
 
 ifneq (,$(filter $(ARCH),riscv))
-RISCV_SUBTARGETS ?= hwprobe vector mm
+RISCV_SUBTARGETS ?= hwprobe vector mm cfi
 else
 RISCV_SUBTARGETS :=
 endif
diff --git a/tools/testing/selftests/riscv/cfi/.gitignore 
b/tools/testing/selftests/riscv/cfi/.gitignore
new file mode 100644
index ..ce7623f9da28
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/.gitignore
@@ -0,0 +1,3 @@
+cfitests
+riscv_cfi_test
+shadowstack
\ No newline at end of file
diff --git a/tools/testing/selftests/riscv/cfi/Makefile 
b/tools/testing/selftests/riscv/cfi/Makefile
new file mode 100644
index ..b65f7ff38a32
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/Makefile
@@ -0,0 +1,10 @@
+CFLAGS += -I$(top_srcdir)/tools/include
+
+CFLAGS += -march=rv64gc_zicfilp_zicfiss
+
+TEST_GEN_PROGS := cfitests
+
+include ../../lib.mk
+
+$(OUTPUT)/cfitests: riscv_cfi_test.c shadowstack.c
+   $(CC) -o$@ $(CFLAGS) $(LDFLAGS) $^
diff --git a/tools/testing/selftests/riscv/cfi/cfi_rv_test.h 
b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
new file mode 100644
index ..fa1cf7183672
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTEST_RISCV_CFI_H
+#define SELFTEST_RISCV_CFI_H
+#include 
+#include 
+#include "shadowstack.h"
+
+#define RISCV_CFI_SELFTEST_COUNT RISCV_SHADOW_STACK_TESTS
+
+#define CHILD_EXIT_CODE_SSWRITE10
+#define CHILD_EXIT_CODE_SIG_TEST   11
+
+#define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) \
+({ 
\
+   register long _num  __asm__ ("a7") = (num); 
\
+   register long _arg1 __asm__ ("a0") = (long)(arg1);  \
+   register long _arg2 __asm__ ("a1") = (long)(arg2);  \
+   register long _arg3 __asm__ ("a2") = (long)(arg3);  \
+   register long _arg4 __asm__ ("a3") = (long)(arg4);  \
+   register long _arg5 __asm__ ("a4") = (long)(arg5);  \
+   
\
+   __asm__ volatile (  
\
+   "ecall\n"   
\
+   : "+r"(_arg1)   
\
+   : "r"(_arg2), "r"(_arg3),

[PATCH v3 28/29] riscv: Documentation for shadow stack on riscv

2024-04-03 Thread Deepak Gupta

Adding documentation on shadow stack for user mode on riscv and kernel
interfaces exposed so that user tasks can enable it.

Signed-off-by: Deepak Gupta 
---
 Documentation/arch/riscv/zicfiss.rst | 169 +++
 1 file changed, 169 insertions(+)
 create mode 100644 Documentation/arch/riscv/zicfiss.rst

diff --git a/Documentation/arch/riscv/zicfiss.rst 
b/Documentation/arch/riscv/zicfiss.rst
new file mode 100644
index ..f133b6af9c15
--- /dev/null
+++ b/Documentation/arch/riscv/zicfiss.rst
@@ -0,0 +1,169 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta 
+:Date:   12 January 2024
+
+=
+Shadow stack to protect function returns on RISC-V Linux
+=
+
+This document briefly describes the interface provided to userspace by Linux
+to enable shadow stack for user mode applications on RISV-V
+
+1. Feature Overview
+
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where 
adversary
+can use corrupt return addresses present on stack and chain them together to 
perform
+return oriented programming (ROP) and thus compromising control flow integrity 
(CFI)
+of the program.
+
+Return addresses live on stack and thus in read-write memory and thus are
+susceptible to corruption and allows an adversary to reach any program counter
+(PC) in address space. On RISC-V `zicfiss` extension provides an alternate 
stack
+`shadow stack` on which return addresses can be safely placed in prolog of the
+function and retrieved in epilog. `zicfiss` extension makes following changes
+
+   - PTE encodings for shadow stack virtual memory
+ An earlier reserved encoding in first stage translation i.e.
+ PTE.R=0, PTE.W=1, PTE.X=0  becomes PTE encoding for shadow stack 
pages.
+
+   - `sspush x1/x5` instruction pushes (stores) `x1/x5` to shadow stack.
+
+   - `sspopchk x1/x5` instruction pops (loads) from shadow stack and 
compares
+ with `x1/x5` and if un-equal, CPU raises `software check exception` 
with
+ `*tval = 3`
+
+Compiler toolchain makes sure that function prologs have `sspush x1/x5` to 
save return
+address on shadow stack in addition to regular stack. Similarly function 
epilogs have
+`ld x5, offset(x2)`; `sspopchk x5` to ensure that popped value from regular 
stack
+matches with popped value from shadow stack.
+
+2. Shadow stack protections and linux memory manager
+-
+
+As mentioned earlier, shadow stack get new page table encodings and thus have 
some
+special properties assigned to them and instructions that operate on them as 
below
+
+   - Regular stores to shadow stack memory raises access store faults.
+ This way shadow stack memory is protected from stray inadvertant
+ writes
+
+   - Regular loads to shadow stack memory are allowed.
+ This allows stack trace utilities or backtrace functions to read
+ true callstack (not tampered)
+
+   - Only shadow stack instructions can generate shadow stack load or
+ shadow stack store.
+
+   - Shadow stack load / shadow stack store on read-only memory raises
+ AMO/store page fault. Thus both `sspush x1/x5` and `sspopchk x1/x5`
+ will raise AMO/store page fault. This simplies COW handling in kernel
+ During fork, kernel can convert shadow stack pages into read-only
+ memory (as it does for regular read-write memory) and as soon as
+ subsequent `sspush` or `sspopchk` in userspace is encountered, then
+ kernel can perform COW.
+
+   - Shadow stack load / shadow stack store on read-write, read-write-
+ execute memory raises an access fault. This is a fatal condition
+ because shadow stack should never be operating on read-write, read-
+ write-execute memory.
+
+3. ELF and psABI
+-
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_BCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+4. Linux enabling
+--
+
+User space programs can have multiple shared objects loaded in its address 
space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of shadow stack. Thus it's left to dynamic loader to enable
+shadow stack for the program.
+
+5. prctl() enabling
+
+
+`PR_SET_SHADOW_STACK_STATUS` / `PR_GET_SHADOW_STACK_STATUS` /
+`PR_LOCK_SHADOW_STACK_STATUS` are three prctls added to manage shadow stack
+enabling for tasks. prctls are arch agnostic and returns -EINVAL on other 
arches.
+
+`PR_SET_SHADOW_STACK_STATUS`: If arg1 `PR_SHADOW_STACK_ENABLE` and if CPU 
supports
+`zicfiss` then kernel

[PATCH v3 27/29] riscv: Documentation for landing pad / indirect branch tracking

2024-04-03 Thread Deepak Gupta

Adding documentation on landing pad aka indirect branch tracking on riscv
and kernel interfaces exposed so that user tasks can enable it.

Signed-off-by: Deepak Gupta 
---
 Documentation/arch/riscv/zicfilp.rst | 104 +++
 1 file changed, 104 insertions(+)
 create mode 100644 Documentation/arch/riscv/zicfilp.rst

diff --git a/Documentation/arch/riscv/zicfilp.rst 
b/Documentation/arch/riscv/zicfilp.rst
new file mode 100644
index ..3007c81f0465
--- /dev/null
+++ b/Documentation/arch/riscv/zicfilp.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta 
+:Date:   12 January 2024
+
+
+Tracking indirect control transfers on RISC-V Linux
+
+
+This document briefly describes the interface provided to userspace by Linux
+to enable indirect branch tracking for user mode applications on RISV-V
+
+1. Feature Overview
+
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where 
adversary
+can use corrupt function pointers and chain them together to perform jump 
oriented
+programming (JOP) or call oriented programming (COP) and thus compromising 
control
+flow integrity (CFI) of the program.
+
+Function pointers live in read-write memory and thus are susceptible to 
corruption
+and allows an adversary to reach any program counter (PC) in address space. On
+RISC-V zicfilp extension enforces a restriction on such indirect control 
transfers
+
+   - indirect control transfers must land on a landing pad instruction 
`lpad`.
+ There are two exception to this rule
+   - rs1 = x1 or rs1 = x5, i.e. a return from a function and 
returns are
+ protected using shadow stack (see zicfiss.rst)
+
+   - rs1 = x7. On RISC-V compiler usually does below to reach 
function
+ which is beyond the offset possible J-type instruction.
+
+   "auipc x7, "
+   "jalr (x7)"
+
+ Such form of indirect control transfer are still immutable 
and don't rely
+ on memory and thus rs1=x7 is exempted from tracking and 
considered software
+ guarded jumps.
+
+`lpad` instruction is pseudo of `auipc rd, ` and is a HINT nop. 
`lpad`
+instruction must be aligned on 4 byte boundary and compares 20 bit immediate 
with x7.
+If `imm_20bit` == 0, CPU don't perform any comparision with x7. If `imm_20bit` 
!= 0,
+then `imm_20bit` must match x7 else CPU will raise `software check exception`
+(cause=18)with `*tval = 2`.
+
+Compiler can generate a hash over function signatures and setup them (truncated
+to 20bit) in x7 at callsites and function proglogs can have `lpad` with same
+function hash. This further reduces number of program counters a call site can
+reach.
+
+2. ELF and psABI
+-
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_FCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+3. Linux enabling
+--
+
+User space programs can have multiple shared objects loaded in its address 
space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of indirect branch. Thus it's left to dynamic loader to enable
+indirect branch tracking for the program.
+
+4. prctl() enabling
+
+
+`PR_SET_INDIR_BR_LP_STATUS` / `PR_GET_INDIR_BR_LP_STATUS` /
+`PR_LOCK_INDIR_BR_LP_STATUS` are three prctls added to manage indirect branch
+tracking. prctls are arch agnostic and returns -EINVAL on other arches.
+
+`PR_SET_INDIR_BR_LP_STATUS`: If arg1 `PR_INDIR_BR_LP_ENABLE` and if CPU 
supports
+`zicfilp` then kernel will enabled indirect branch tracking for the task.
+Dynamic loader can issue this `prctl` once it has determined that all the 
objects
+loaded in address space support indirect branch tracking. Additionally if 
there is
+a `dlopen` to an object which wasn't compiled with `zicfilp`, dynamic loader 
can
+issue this prctl with arg1 set to 0 (i.e. `PR_INDIR_BR_LP_ENABLE` being clear)
+
+`PR_GET_INDIR_BR_LP_STATUS`: Returns current status of indirect branch 
tracking.
+If enabled it'll return `PR_INDIR_BR_LP_ENABLE`
+
+`PR_LOCK_INDIR_BR_LP_STATUS`: Locks current status of indirect branch tracking 
on
+the task. User space may want to run with strict security posture and wouldn't 
want
+loading of objects without `zicfilp` support in it and thus would want to 
disallow
+disabling of indirect branch tracking. In that case user space can use this 
prctl
+to lock current settings.
+
+5. violations related to indirect branch tracking
+--
+
+Pertaining to indirect branch tracking, CPU raises software

[PATCH v3 26/29] riscv: create a config for shadow stack and landing pad instr support

2024-04-03 Thread Deepak Gupta

This patch creates a config for shadow stack support and landing pad instr
support. Shadow stack support and landing instr support can be enabled by
selecting `CONFIG_RISCV_USER_CFI`. Selecting `CONFIG_RISCV_USER_CFI` wires
up path to enumerate CPU support and if cpu support exists, kernel will
support cpu assisted user mode cfi.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/Kconfig | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 7e0b2bcc388f..d6f1303ef660 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -203,6 +203,24 @@ config ARCH_HAS_BROKEN_DWARF5
# 
https://github.com/llvm/llvm-project/commit/7ffabb61a5569444b5ac9322e22e5471cc5e4a77
depends on LD_IS_LLD && LLD_VERSION < 18
 
+config RISCV_USER_CFI
+   def_bool y
+   bool "riscv userspace control flow integrity"
+   depends on 64BIT && $(cc-option,-mabi=lp64 -march=rv64ima_zicfiss)
+   depends on RISCV_ALTERNATIVE
+   select ARCH_USES_HIGH_VMA_FLAGS
+   help
+ Provides CPU assisted control flow integrity to userspace tasks.
+ Control flow integrity is provided by implementing shadow stack for
+ backward edge and indirect branch tracking for forward edge in 
program.
+ Shadow stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Indirect branch tracking enforces that all indirect branches must land
+ on a landing pad instruction else CPU will fault. This mitigates 
against
+ JOP / COP attacks. Applications must be enabled to use it, and old 
user-
+ space does not get protection "for free".
+ default y
+
 config ARCH_MMAP_RND_BITS_MIN
default 18 if 64BIT
default 8
-- 
2.43.2

[PATCH v3 25/29] riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe

2024-04-03 Thread Deepak Gupta

Adding enumeration of zicfilp and zicfiss extensions in hwprobe syscall.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/uapi/asm/hwprobe.h | 2 ++
 arch/riscv/kernel/sys_hwprobe.c   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/riscv/include/uapi/asm/hwprobe.h 
b/arch/riscv/include/uapi/asm/hwprobe.h
index 9f2a8e3ff204..4ffc6de1eed7 100644
--- a/arch/riscv/include/uapi/asm/hwprobe.h
+++ b/arch/riscv/include/uapi/asm/hwprobe.h
@@ -59,6 +59,8 @@ struct riscv_hwprobe {
 #defineRISCV_HWPROBE_EXT_ZTSO  (1ULL << 33)
 #defineRISCV_HWPROBE_EXT_ZACAS (1ULL << 34)
 #defineRISCV_HWPROBE_EXT_ZICOND(1ULL << 35)
+#defineRISCV_HWPROBE_EXT_ZICFILP   (1ULL << 36)
+#defineRISCV_HWPROBE_EXT_ZICFISS   (1ULL << 37)
 #define RISCV_HWPROBE_KEY_CPUPERF_05
 #defineRISCV_HWPROBE_MISALIGNED_UNKNOWN(0 << 0)
 #defineRISCV_HWPROBE_MISALIGNED_EMULATED   (1 << 0)
diff --git a/arch/riscv/kernel/sys_hwprobe.c b/arch/riscv/kernel/sys_hwprobe.c
index a7c56b41efd2..ddc7a9612a90 100644
--- a/arch/riscv/kernel/sys_hwprobe.c
+++ b/arch/riscv/kernel/sys_hwprobe.c
@@ -111,6 +111,8 @@ static void hwprobe_isa_ext0(struct riscv_hwprobe *pair,
EXT_KEY(ZTSO);
EXT_KEY(ZACAS);
EXT_KEY(ZICOND);
+   EXT_KEY(ZICFILP);
+   EXT_KEY(ZICFISS);
 
if (has_vector()) {
EXT_KEY(ZVBB);
-- 
2.43.2

[PATCH v3 24/29] riscv/ptrace: riscv cfi status and state via ptrace and in core files

2024-04-03 Thread Deepak Gupta

Expose a new register type NT_RISCV_USER_CFI for risc-v cfi status and
state. Intentionally both landing pad and shadow stack status and state
are rolled into cfi state. Creating two different NT_RISCV_USER_XXX would
not be useful and wastage of a note type. Enabling or disabling of feature
is not allowed via ptrace set interface. However setting `elp` state or
setting shadow stack pointer are allowed via ptrace set interface. It is
expected `gdb` might have use to fixup `elp` state or `shadow stack`
pointer.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/uapi/asm/ptrace.h | 18 ++
 arch/riscv/kernel/ptrace.c   | 83 
 include/uapi/linux/elf.h |  1 +
 3 files changed, 102 insertions(+)

diff --git a/arch/riscv/include/uapi/asm/ptrace.h 
b/arch/riscv/include/uapi/asm/ptrace.h
index a38268b19c3d..512be06a8661 100644
--- a/arch/riscv/include/uapi/asm/ptrace.h
+++ b/arch/riscv/include/uapi/asm/ptrace.h
@@ -127,6 +127,24 @@ struct __riscv_v_regset_state {
  */
 #define RISCV_MAX_VLENB (8192)
 
+struct __cfi_status {
+   /* indirect branch tracking state */
+   __u64 lp_en : 1;
+   __u64 lp_lock : 1;
+   __u64 elp_state : 1;
+
+   /* shadow stack status */
+   __u64 shstk_en : 1;
+   __u64 shstk_lock : 1;
+
+   __u64 rsvd : sizeof(__u64) - 5;
+};
+
+struct user_cfi_state {
+   struct __cfi_status cfi_status;
+   __u64 shstk_ptr;
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _UAPI_ASM_RISCV_PTRACE_H */
diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index e8515aa9d80b..33d4b32cc6a7 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum riscv_regset {
REGSET_X,
@@ -28,6 +29,9 @@ enum riscv_regset {
 #ifdef CONFIG_RISCV_ISA_V
REGSET_V,
 #endif
+#ifdef CONFIG_RISCV_USER_CFI
+   REGSET_CFI,
+#endif
 };
 
 static int riscv_gpr_get(struct task_struct *target,
@@ -152,6 +156,75 @@ static int riscv_vr_set(struct task_struct *target,
 }
 #endif
 
+#ifdef CONFIG_RISCV_USER_CFI
+static int riscv_cfi_get(struct task_struct *target,
+   const struct user_regset *regset,
+   struct membuf to)
+{
+   struct user_cfi_state user_cfi;
+   struct pt_regs *regs;
+
+   regs = task_pt_regs(target);
+
+   user_cfi.cfi_status.lp_en = is_indir_lp_enabled(target);
+   user_cfi.cfi_status.lp_lock = is_indir_lp_locked(target);
+   user_cfi.cfi_status.elp_state = (regs->status & SR_ELP);
+
+   user_cfi.cfi_status.shstk_en = is_shstk_enabled(target);
+   user_cfi.cfi_status.shstk_lock = is_shstk_locked(target);
+   user_cfi.shstk_ptr = get_active_shstk(target);
+
+   return membuf_write(, _cfi, sizeof(user_cfi));
+}
+
+/*
+ * Does it make sense to allowing enable / disable of cfi via ptrace?
+ * Not allowing enable / disable / locking control via ptrace for now.
+ * Setting shadow stack pointer is allowed. GDB might use it to unwind or
+ * some other fixup. Similarly gdb might want to suppress elp and may want
+ * to reset elp state.
+ */
+static int riscv_cfi_set(struct task_struct *target,
+   const struct user_regset *regset,
+   unsigned int pos, unsigned int count,
+   const void *kbuf, const void __user *ubuf)
+{
+   int ret;
+   struct user_cfi_state user_cfi;
+   struct pt_regs *regs;
+
+   regs = task_pt_regs(target);
+
+   ret = user_regset_copyin(, , , , _cfi, 0, -1);
+   if (ret)
+   return ret;
+
+   /*
+* Not allowing enabling or locking shadow stack or landing pad
+* There is no disabling of shadow stack or landing pad via ptrace
+* rsvd field should be set to zero so that if those fields are needed 
in future
+*/
+   if (user_cfi.cfi_status.lp_en || user_cfi.cfi_status.lp_lock ||
+   user_cfi.cfi_status.shstk_en || user_cfi.cfi_status.shstk_lock 
||
+   !user_cfi.cfi_status.rsvd)
+   return -EINVAL;
+
+   /* If lpad is enabled on target and ptrace requests to set / clear elp, 
do that */
+   if (is_indir_lp_enabled(target)) {
+   if (user_cfi.cfi_status.elp_state) /* set elp state */
+   regs->status |= SR_ELP;
+   else
+   regs->status &= ~SR_ELP; /* clear elp state */
+   }
+
+   /* If shadow stack enabled on target, set new shadow stack pointer */
+   if (is_shstk_enabled(target))
+   set_active_shstk(target, user_cfi.shstk_ptr);
+
+   return 0;
+}
+#endif
+
 static const struct user_regset riscv_user_regset[] = {
[REGSET_X] = {
.core_note_type = NT_PRSTATUS,
@@ -182,6 +255,16 @@ static const struct user_regset riscv_user_regset[] = {
.set = riscv_vr_set,
},
 #endif
+#ifdef CONFIG_RISCV_USER_CFI
+

[PATCH v3 23/29] riscv signal: Save and restore of shadow stack for signal

2024-04-03 Thread Deepak Gupta

Save shadow stack pointer in sigcontext structure while delivering signal.
Restore shadow stack pointer from sigcontext on sigreturn.

As part of save operation, kernel uses `ssamoswap` to save snapshot of
current shadow stack on shadow stack itself (can be called as a save
token). During restore on sigreturn, kernel retrieves token from top of
shadow stack and validates it. This allows that user mode can't arbitrary
pivot to any shadow stack address without having a token and thus provide
strong security assurance between signaly delivery and sigreturn window.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/usercfi.h | 19 +++
 arch/riscv/kernel/signal.c   | 45 +
 arch/riscv/kernel/usercfi.c  | 57 
 3 files changed, 121 insertions(+)

diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 8accdc8ec164..507a27d5f53c 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -8,6 +8,7 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
+#include 
 
 struct task_struct;
 struct kernel_clone_args;
@@ -35,6 +36,9 @@ void set_shstk_status(struct task_struct *task, bool enable);
 bool is_indir_lp_enabled(struct task_struct *task);
 bool is_indir_lp_locked(struct task_struct *task);
 void set_indir_lp_status(struct task_struct *task, bool enable);
+unsigned long get_active_shstk(struct task_struct *task);
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr);
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr);
 
 #define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
 
@@ -77,6 +81,16 @@ static inline void set_shstk_status(struct task_struct 
*task, bool enable)
 
 }
 
+static inline int restore_user_shstk(struct task_struct *tsk, unsigned long 
shstk_ptr)
+{
+   return -EINVAL;
+}
+
+static inline int save_user_shstk(struct task_struct *tsk, unsigned long 
*saved_shstk_ptr)
+{
+   return -EINVAL;
+}
+
 static inline bool is_indir_lp_enabled(struct task_struct *task)
 {
return false;
@@ -92,6 +106,11 @@ static inline void set_indir_lp_status(struct task_struct 
*task, bool enable)
 
 }
 
+static inline unsigned long get_active_shstk(struct task_struct *task)
+{
+   return 0;
+}
+
 #endif /* CONFIG_RISCV_USER_CFI */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
index 501e66debf69..428a886ab6ef 100644
--- a/arch/riscv/kernel/signal.c
+++ b/arch/riscv/kernel/signal.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 unsigned long signal_minsigstksz __ro_after_init;
 
@@ -232,6 +233,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct rt_sigframe __user *frame;
struct task_struct *task;
+   unsigned long ss_ptr = 0;
sigset_t set;
size_t frame_size = get_rt_frame_size(false);
 
@@ -254,6 +256,26 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (restore_altstack(>uc.uc_stack))
goto badframe;
 
+   /*
+* Restore shadow stack as a form of token stored on shadow stack 
itself as a safe
+* way to restore.
+* A token on shadow gives following properties
+*  - Safe save and restore for shadow stack switching. Any save of 
shadow stack
+*must have had saved a token on shadow stack. Similarly any 
restore of shadow
+*stack must check the token before restore. Since writing to 
shadow stack with
+*address of shadow stack itself is not easily allowed. A 
restore without a save
+*is quite difficult for an attacker to perform.
+*  - A natural break. A token in shadow stack provides a natural 
break in shadow stack
+*So a single linear range can be bucketed into different 
shadow stack segments.
+*sspopchk will detect the condition and fault to kernel as sw 
check exception.
+*/
+   if (__copy_from_user(_ptr, 
>uc.uc_mcontext.sc_cfi_state.ss_ptr,
+sizeof(unsigned long)))
+   goto badframe;
+
+   if (is_shstk_enabled(current) && restore_user_shstk(current, ss_ptr))
+   goto badframe;
+
regs->cause = -1UL;
 
return regs->a0;
@@ -323,6 +345,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
struct rt_sigframe __user *frame;
long err = 0;
unsigned long __maybe_unused addr;
+   unsigned long ss_ptr = 0;
size_t frame_size = get_rt_frame_size(false);
 
frame = get_sigframe(ksig, regs, frame_size);
@@ -334,6 +357,23 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
/* Create the ucontext. */
err |= __put_user(0, >uc.uc_flags);
err |= __put_user(NULL, >uc.uc_link);
+   /*
+* Save a pointer to shadow stack itself

[PATCH v3 22/29] riscv sigcontext: adding cfi state field in sigcontext

2024-04-03 Thread Deepak Gupta

Shadow stack needs to be saved and restored on signal delivery and signal
return.

sigcontext embedded in ucontext is extendible. Adding cfi state in there
which can be used to save cfi state before signal delivery and restore
cfi state on sigreturn

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/uapi/asm/sigcontext.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/riscv/include/uapi/asm/sigcontext.h 
b/arch/riscv/include/uapi/asm/sigcontext.h
index cd4f175dc837..5ccdd94a0855 100644
--- a/arch/riscv/include/uapi/asm/sigcontext.h
+++ b/arch/riscv/include/uapi/asm/sigcontext.h
@@ -21,6 +21,10 @@ struct __sc_riscv_v_state {
struct __riscv_v_ext_state v_state;
 } __attribute__((aligned(16)));
 
+struct __sc_riscv_cfi_state {
+   unsigned long ss_ptr;   /* shadow stack pointer */
+   unsigned long rsvd; /* keeping another word reserved in 
case we need it */
+};
 /*
  * Signal context structure
  *
@@ -29,6 +33,7 @@ struct __sc_riscv_v_state {
  */
 struct sigcontext {
struct user_regs_struct sc_regs;
+   struct __sc_riscv_cfi_state sc_cfi_state;
union {
union __riscv_fp_state sc_fpregs;
struct __riscv_extra_ext_header sc_extdesc;
-- 
2.43.2

[PATCH v3 21/29] riscv/traps: Introduce software check exception

2024-04-03 Thread Deepak Gupta

zicfiss / zicfilp introduces a new exception to priv isa `software check
exception` with cause code = 18. This patch implements software check
exception.

Additionally it implements a cfi violation handler which checks for code
in xtval. If xtval=2, it means that sw check exception happened because of
an indirect branch not landing on 4 byte aligned PC or not landing on
`lpad` instruction or label value embedded in `lpad` not matching label
value setup in `x7`. If xtval=3, it means that sw check exception happened
because of mismatch between link register (x1 or x5) and top of shadow
stack (on execution of `sspopchk`).

In case of cfi violation, SIGSEGV is raised with code=SEGV_CPERR.
SEGV_CPERR was introduced by x86 shadow stack patches.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/asm-prototypes.h |  1 +
 arch/riscv/kernel/entry.S   |  3 ++
 arch/riscv/kernel/traps.c   | 38 +
 3 files changed, 42 insertions(+)

diff --git a/arch/riscv/include/asm/asm-prototypes.h 
b/arch/riscv/include/asm/asm-prototypes.h
index cd627ec289f1..5a27cefd7805 100644
--- a/arch/riscv/include/asm/asm-prototypes.h
+++ b/arch/riscv/include/asm/asm-prototypes.h
@@ -51,6 +51,7 @@ DECLARE_DO_ERROR_INFO(do_trap_ecall_u);
 DECLARE_DO_ERROR_INFO(do_trap_ecall_s);
 DECLARE_DO_ERROR_INFO(do_trap_ecall_m);
 DECLARE_DO_ERROR_INFO(do_trap_break);
+DECLARE_DO_ERROR_INFO(do_trap_software_check);
 
 asmlinkage void handle_bad_stack(struct pt_regs *regs);
 asmlinkage void do_page_fault(struct pt_regs *regs);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 7245a0ea25c1..f97af4ff5237 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -374,6 +374,9 @@ SYM_DATA_START_LOCAL(excp_vect_table)
RISCV_PTR do_page_fault   /* load page fault */
RISCV_PTR do_trap_unknown
RISCV_PTR do_page_fault   /* store page fault */
+   RISCV_PTR do_trap_unknown /* cause=16 */
+   RISCV_PTR do_trap_unknown /* cause=17 */
+   RISCV_PTR do_trap_software_check /* cause=18 is sw check exception */
 SYM_DATA_END_LABEL(excp_vect_table, SYM_L_LOCAL, excp_vect_table_end)
 
 #ifndef CONFIG_MMU
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index a1b9be3c4332..9fba263428a1 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -339,6 +339,44 @@ asmlinkage __visible __trap_section void 
do_trap_ecall_u(struct pt_regs *regs)
 
 }
 
+#define CFI_TVAL_FCFI_CODE 2
+#define CFI_TVAL_BCFI_CODE 3
+/* handle cfi violations */
+bool handle_user_cfi_violation(struct pt_regs *regs)
+{
+   bool ret = false;
+   unsigned long tval = csr_read(CSR_TVAL);
+
+   if (((tval == CFI_TVAL_FCFI_CODE) && 
cpu_supports_indirect_br_lp_instr()) ||
+   ((tval == CFI_TVAL_BCFI_CODE) && cpu_supports_shadow_stack())) {
+   do_trap_error(regs, SIGSEGV, SEGV_CPERR, regs->epc,
+ "Oops - control flow violation");
+   ret = true;
+   }
+
+   return ret;
+}
+/*
+ * software check exception is defined with risc-v cfi spec. Software check
+ * exception is raised when:-
+ * a) An indirect branch doesn't land on 4 byte aligned PC or `lpad`
+ *instruction or `label` value programmed in `lpad` instr doesn't
+ *match with value setup in `x7`. reported code in `xtval` is 2.
+ * b) `sspopchk` instruction finds a mismatch between top of shadow stack (ssp)
+ *and x1/x5. reported code in `xtval` is 3.
+ */
+asmlinkage __visible __trap_section void do_trap_software_check(struct pt_regs 
*regs)
+{
+   if (user_mode(regs)) {
+   /* not a cfi violation, then merge into flow of unknown trap 
handler */
+   if (!handle_user_cfi_violation(regs))
+   do_trap_unknown(regs);
+   } else {
+   /* sw check exception coming from kernel is a bug in kernel */
+   die(regs, "Kernel BUG");
+   }
+}
+
 #ifdef CONFIG_MMU
 asmlinkage __visible noinstr void do_page_fault(struct pt_regs *regs)
 {
-- 
2.43.2

[PATCH v3 20/29] riscv/kernel: update __show_regs to print shadow stack register

2024-04-03 Thread Deepak Gupta

Updating __show_regs to print captured shadow stack pointer as well.
On tasks where shadow stack is disabled, it'll simply print 0.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/kernel/process.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index ebed7589c51a..079fd6cd6446 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -89,8 +89,8 @@ void __show_regs(struct pt_regs *regs)
regs->s8, regs->s9, regs->s10);
pr_cont(" s11: " REG_FMT " t3 : " REG_FMT " t4 : " REG_FMT "\n",
regs->s11, regs->t3, regs->t4);
-   pr_cont(" t5 : " REG_FMT " t6 : " REG_FMT "\n",
-   regs->t5, regs->t6);
+   pr_cont(" t5 : " REG_FMT " t6 : " REG_FMT " ssp : " REG_FMT "\n",
+   regs->t5, regs->t6, get_active_shstk(current));
 
pr_cont("status: " REG_FMT " badaddr: " REG_FMT " cause: " REG_FMT "\n",
regs->status, regs->badaddr, regs->cause);
-- 
2.43.2

[PATCH v3 19/29] riscv: Implements arch agnostic indirect branch tracking prctls

2024-04-03 Thread Deepak Gupta

prctls implemented are:
PR_SET_INDIR_BR_LP_STATUS, PR_GET_INDIR_BR_LP_STATUS and
PR_LOCK_INDIR_BR_LP_STATUS.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/usercfi.h | 22 -
 arch/riscv/kernel/process.c  |  5 +++
 arch/riscv/kernel/usercfi.c  | 76 
 3 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index a168ae0fa5d8..8accdc8ec164 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -16,7 +16,9 @@ struct kernel_clone_args;
 struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
unsigned long ubcfi_locked : 1;
-   unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
+   unsigned long ufcfi_en : 1; /* Enable for forward cfi. Note that ELP 
goes in sstatus */
+   unsigned long ufcfi_locked : 1;
+   unsigned long rsvd : ((sizeof(unsigned long)*8) - 4);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -30,6 +32,9 @@ void set_active_shstk(struct task_struct *task, unsigned long 
shstk_addr);
 bool is_shstk_enabled(struct task_struct *task);
 bool is_shstk_locked(struct task_struct *task);
 void set_shstk_status(struct task_struct *task, bool enable);
+bool is_indir_lp_enabled(struct task_struct *task);
+bool is_indir_lp_locked(struct task_struct *task);
+void set_indir_lp_status(struct task_struct *task, bool enable);
 
 #define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
 
@@ -72,6 +77,21 @@ static inline void set_shstk_status(struct task_struct 
*task, bool enable)
 
 }
 
+static inline bool is_indir_lp_enabled(struct task_struct *task)
+{
+   return false;
+}
+
+static inline bool is_indir_lp_locked(struct task_struct *task)
+{
+   return false;
+}
+
+static inline void set_indir_lp_status(struct task_struct *task, bool enable)
+{
+
+}
+
 #endif /* CONFIG_RISCV_USER_CFI */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 3fb8b23f629b..ebed7589c51a 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -152,6 +152,11 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
set_shstk_status(current, false);
set_shstk_base(current, 0, 0);
set_active_shstk(current, 0);
+   /*
+* disable indirect branch tracking on exec.
+* libc will enable it later via prctl.
+*/
+   set_indir_lp_status(current, false);
 
 #ifdef CONFIG_64BIT
regs->status &= ~SR_UXL;
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index cdedf1f78b3e..13920b9d86f3 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -69,6 +69,32 @@ void set_shstk_lock(struct task_struct *task)
task->thread_info.user_cfi_state.ubcfi_locked = 1;
 }
 
+bool is_indir_lp_enabled(struct task_struct *task)
+{
+   return task->thread_info.user_cfi_state.ufcfi_en ? true : false;
+}
+
+bool is_indir_lp_locked(struct task_struct *task)
+{
+   return task->thread_info.user_cfi_state.ufcfi_locked ? true : false;
+}
+
+void set_indir_lp_status(struct task_struct *task, bool enable)
+{
+   task->thread_info.user_cfi_state.ufcfi_en = enable ? 1 : 0;
+
+   if (enable)
+   task->thread_info.envcfg |= ENVCFG_LPE;
+   else
+   task->thread_info.envcfg &= ~ENVCFG_LPE;
+
+   csr_write(CSR_ENVCFG, task->thread_info.envcfg);
+}
+
+void set_indir_lp_lock(struct task_struct *task)
+{
+   task->thread_info.user_cfi_state.ufcfi_locked = 1;
+}
 /*
  * If size is 0, then to be compatible with regular stack we want it to be as 
big as
  * regular stack. Else PAGE_ALIGN it and return back
@@ -375,3 +401,53 @@ int arch_lock_shadow_stack_status(struct task_struct *task,
 
return 0;
 }
+
+int arch_get_indir_br_lp_status(struct task_struct *t, unsigned long __user 
*status)
+{
+   unsigned long fcfi_status = 0;
+
+   if (!cpu_supports_indirect_br_lp_instr())
+   return -EINVAL;
+
+   /* indirect branch tracking is enabled on the task or not */
+   fcfi_status |= (is_indir_lp_enabled(t) ? PR_INDIR_BR_LP_ENABLE : 0);
+
+   return copy_to_user(status, _status, sizeof(fcfi_status)) ? 
-EFAULT : 0;
+}
+
+int arch_set_indir_br_lp_status(struct task_struct *t, unsigned long status)
+{
+   bool enable_indir_lp = false;
+
+   if (!cpu_supports_indirect_br_lp_instr())
+   return -EINVAL;
+
+   /* indirect branch tracking is locked and further can't be modified by 
user */
+   if (is_indir_lp_locked(t))
+   return -EINVAL;
+
+   /* Reject unknown flags */
+   if (status & ~PR_INDIR_BR_LP_ENABLE)
+   return -EINVAL;
+
+   enable_indir_lp =

[PATCH v3 18/29] riscv: Implements arch agnostic shadow stack prctls

2024-04-03 Thread Deepak Gupta

Implement architecture agnostic prctls() interface for setting and getting
shadow stack status.

prctls implemented are PR_GET_SHADOW_STACK_STATUS,
PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS.

As part of PR_SET_SHADOW_STACK_STATUS/PR_GET_SHADOW_STACK_STATUS, only
PR_SHADOW_STACK_ENABLE is implemented because RISCV allows each mode to
write to their own shadow stack using `sspush` or `ssamoswap`.

PR_LOCK_SHADOW_STACK_STATUS locks current configuration of shadow stack
enabling.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/usercfi.h |  18 +-
 arch/riscv/kernel/process.c  |   8 +++
 arch/riscv/kernel/usercfi.c  | 107 +++
 3 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index b47574a7a8c9..a168ae0fa5d8 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -7,6 +7,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 
 struct task_struct;
 struct kernel_clone_args;
@@ -14,7 +15,8 @@ struct kernel_clone_args;
 #ifdef CONFIG_RISCV_USER_CFI
 struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
-   unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+   unsigned long ubcfi_locked : 1;
+   unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -26,6 +28,10 @@ void shstk_release(struct task_struct *tsk);
 void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, 
unsigned long size);
 void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
 bool is_shstk_enabled(struct task_struct *task);
+bool is_shstk_locked(struct task_struct *task);
+void set_shstk_status(struct task_struct *task, bool enable);
+
+#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
 
 #else
 
@@ -56,6 +62,16 @@ static inline bool is_shstk_enabled(struct task_struct *task)
return false;
 }
 
+static inline bool is_shstk_locked(struct task_struct *task)
+{
+   return false;
+}
+
+static inline void set_shstk_status(struct task_struct *task, bool enable)
+{
+
+}
+
 #endif /* CONFIG_RISCV_USER_CFI */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index ef48a25b0eff..3fb8b23f629b 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -145,6 +145,14 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
regs->epc = pc;
regs->sp = sp;
 
+   /*
+* clear shadow stack state on exec.
+* libc will set it later via prctl.
+*/
+   set_shstk_status(current, false);
+   set_shstk_base(current, 0, 0);
+   set_active_shstk(current, 0);
+
 #ifdef CONFIG_64BIT
regs->status &= ~SR_UXL;
 
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index 11ef7ab925c9..cdedf1f78b3e 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -24,6 +24,16 @@ bool is_shstk_enabled(struct task_struct *task)
return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
 }
 
+bool is_shstk_allocated(struct task_struct *task)
+{
+   return task->thread_info.user_cfi_state.shdw_stk_base ? true : false;
+}
+
+bool is_shstk_locked(struct task_struct *task)
+{
+   return task->thread_info.user_cfi_state.ubcfi_locked ? true : false;
+}
+
 void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, 
unsigned long size)
 {
task->thread_info.user_cfi_state.shdw_stk_base = shstk_addr;
@@ -42,6 +52,23 @@ void set_active_shstk(struct task_struct *task, unsigned 
long shstk_addr)
task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
 }
 
+void set_shstk_status(struct task_struct *task, bool enable)
+{
+   task->thread_info.user_cfi_state.ubcfi_en = enable ? 1 : 0;
+
+   if (enable)
+   task->thread_info.envcfg |= ENVCFG_SSE;
+   else
+   task->thread_info.envcfg &= ~ENVCFG_SSE;
+
+   csr_write(CSR_ENVCFG, task->thread_info.envcfg);
+}
+
+void set_shstk_lock(struct task_struct *task)
+{
+   task->thread_info.user_cfi_state.ubcfi_locked = 1;
+}
+
 /*
  * If size is 0, then to be compatible with regular stack we want it to be as 
big as
  * regular stack. Else PAGE_ALIGN it and return back
@@ -268,3 +295,83 @@ void shstk_release(struct task_struct *tsk)
vm_munmap(base, size);
set_shstk_base(tsk, 0, 0);
 }
+
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user 
*status)
+{
+   unsigned long bcfi_status = 0;
+
+   if (!cpu_supports_shadow_stack())
+   return -EINVAL;
+
+   /* this means shadow stack is enabled on the task */
+   bcfi_status |=

[PATCH v3 17/29] prctl: arch-agnostic prctl for indirect branch tracking

2024-04-03 Thread Deepak Gupta

Three architectures (x86, aarch64, riscv) have support for indirect branch
tracking feature in a very similar fashion. On a very high level, indirect
branch tracking is a CPU feature where CPU tracks branches which uses
memory operand to perform control transfer in program. As part of this
tracking on indirect branches, CPU goes in a state where it expects a
landing pad instr on target and if not found then CPU raises some fault
(architecture dependent)

x86 landing pad instr - `ENDBRANCH`
aarch64 landing pad instr - `BTI`
riscv landing instr - `lpad`

Given that three major arches have support for indirect branch tracking,
This patch makes `prctl` for indirect branch tracking arch agnostic.

To allow userspace to enable this feature for itself, following prtcls are
defined:
 - PR_GET_INDIR_BR_LP_STATUS: Gets current configured status for indirect
   branch tracking.
 - PR_SET_INDIR_BR_LP_STATUS: Sets a configuration for indirect branch
   tracking.
   Following status options are allowed
   - PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user
 thread.
   - PR_INDIR_BR_LP_DISABLE; Disables indirect branch tracking on user
 thread.
 - PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch
   tracking for user thread.

Signed-off-by: Deepak Gupta 
---
 include/uapi/linux/prctl.h | 27 +++
 kernel/sys.c   | 30 ++
 2 files changed, 57 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 3c66ed8f46d8..b7a8212a068e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,31 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS  73
 
+/*
+ * Get the current indirect branch tracking configuration for the current
+ * thread, this will be the value configured via PR_SET_INDIR_BR_LP_STATUS.
+ */
+#define PR_GET_INDIR_BR_LP_STATUS  74
+
+/*
+ * Set the indirect branch tracking configuration. PR_INDIR_BR_LP_ENABLE will
+ * enable cpu feature for user thread, to track all indirect branches and 
ensure
+ * they land on arch defined landing pad instruction.
+ * x86 - If enabled, an indirect branch must land on `ENDBRANCH` instruction.
+ * arch64 - If enabled, an indirect branch must land on `BTI` instruction.
+ * riscv - If enabled, an indirect branch must land on `lpad` instruction.
+ * PR_INDIR_BR_LP_DISABLE will disable feature for user thread and indirect
+ * branches will no more be tracked by cpu to land on arch defined landing pad
+ * instruction.
+ */
+#define PR_SET_INDIR_BR_LP_STATUS  75
+# define PR_INDIR_BR_LP_ENABLE(1UL << 0)
+
+/*
+ * Prevent further changes to the specified indirect branch tracking
+ * configuration.  All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_INDIR_BR_LP_STATUS  76
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 242e9f147791..c770060c3f06 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2330,6 +2330,21 @@ int __weak arch_lock_shadow_stack_status(struct 
task_struct *t, unsigned long st
return -EINVAL;
 }
 
+int __weak arch_get_indir_br_lp_status(struct task_struct *t, unsigned long 
__user *status)
+{
+   return -EINVAL;
+}
+
+int __weak arch_set_indir_br_lp_status(struct task_struct *t, unsigned long 
__user *status)
+{
+   return -EINVAL;
+}
+
+int __weak arch_lock_indir_br_lp_status(struct task_struct *t, unsigned long 
__user *status)
+{
+   return -EINVAL;
+}
+
 #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
 
 #ifdef CONFIG_ANON_VMA_NAME
@@ -2787,6 +2802,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+   case PR_GET_INDIR_BR_LP_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_get_indir_br_lp_status(me, (unsigned long __user 
*) arg2);
+   break;
+   case PR_SET_INDIR_BR_LP_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_set_indir_br_lp_status(me, (unsigned long __user 
*) arg2);
+   break;
+   case PR_LOCK_INDIR_BR_LP_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_lock_indir_br_lp_status(me, (unsigned long __user 
*) arg2);
+   break;
default:
error = -EINVAL;
break;
-- 
2.43.2

[PATCH v3 16/29] prctl: arch-agnostic prctl for shadow stack

2024-04-03 Thread Deepak Gupta

From: Mark Brown 

Three architectures (x86, aarch64, riscv) have announced support for
shadow stacks with fairly similar functionality.  While x86 is using
arch_prctl() to control the functionality neither arm64 nor riscv uses
that interface so this patch adds arch-agnostic prctl() support to
get and set status of shadow stacks and lock the current configuration to
prevent further changes, with support for turning on and off individual
subfeatures so applications can limit their exposure to features that
they do not need.  The features are:

  - PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks,
including allocation of a shadow stack if one is not already
allocated.
  - PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow
stack.
  - PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.
  - PR_SHADOW_STACK_DISABLE: Allow to disable shadow stack.
Note once locked, disable must fail.

These features are expected to be inherited by new threads and cleared
on exec(), unknown features should be rejected for enable but accepted
for locking (in order to allow for future proofing).

This is based on a patch originally written by Deepak Gupta but later
modified by Mark Brown for arm's GCS patch series.

Signed-off-by: Mark Brown 
Co-developed-by: Deepak Gupta 
---
 include/linux/mm.h |  3 +++
 include/uapi/linux/prctl.h | 22 ++
 kernel/sys.c   | 30 ++
 3 files changed, 55 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9952937be659..1d08e1fd2f6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4201,5 +4201,8 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
 
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user 
*status);
+int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
+int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
 #endif /* _LINUX_MM_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..3c66ed8f46d8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,26 @@ struct prctl_mm_map {
 # define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK  0xc
 # define PR_RISCV_V_VSTATE_CTRL_MASK   0x1f
 
+/*
+ * Get the current shadow stack configuration for the current thread,
+ * this will be the value configured via PR_SET_SHADOW_STACK_STATUS.
+ */
+#define PR_GET_SHADOW_STACK_STATUS  71
+
+/*
+ * Set the current shadow stack configuration.  Enabling the shadow
+ * stack will cause a shadow stack to be allocated for the thread.
+ */
+#define PR_SET_SHADOW_STACK_STATUS  72
+# define PR_SHADOW_STACK_ENABLE (1UL << 0)
+# define PR_SHADOW_STACK_WRITE (1UL << 1)
+# define PR_SHADOW_STACK_PUSH  (1UL << 2)
+
+/*
+ * Prevent further changes to the specified shadow stack
+ * configuration.  All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_SHADOW_STACK_STATUS  73
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index f8e543f1e38a..242e9f147791 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2315,6 +2315,21 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
*t, unsigned long which,
return -EINVAL;
 }
 
+int __weak arch_get_shadow_stack_status(struct task_struct *t, unsigned long 
__user *status)
+{
+   return -EINVAL;
+}
+
+int __weak arch_set_shadow_stack_status(struct task_struct *t, unsigned long 
status)
+{
+   return -EINVAL;
+}
+
+int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long 
status)
+{
+   return -EINVAL;
+}
+
 #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
 
 #ifdef CONFIG_ANON_VMA_NAME
@@ -2757,6 +2772,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
case PR_RISCV_V_GET_CONTROL:
error = RISCV_V_GET_CONTROL();
break;
+   case PR_GET_SHADOW_STACK_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_get_shadow_stack_status(me, (unsigned long __user 
*) arg2);
+   break;
+   case PR_SET_SHADOW_STACK_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_set_shadow_stack_status(me, arg2);
+   break;
+   case PR_LOCK_SHADOW_STACK_STATUS:
+   if (arg3 || arg4 || arg5)
+   return -EINVAL;
+   error = arch_lock_shadow_stack_status(me, arg2);
+   break;
default:
error = -EINVAL;
break;
-- 
2.43.2

[PATCH v3 15/29] riscv/shstk: If needed allocate a new shadow stack on clone

2024-04-03 Thread Deepak Gupta

Userspace specifies VM_CLONE to share address space and spawn new thread.
`clone` allow userspace to specify a new stack for new thread. However
there is no way to specify new shadow stack base address without changing
API. This patch allocates a new shadow stack whenever VM_CLONE is given.

In case of VM_FORK, parent is suspended until child finishes and thus can
child use parent shadow stack. In case of !VM_CLONE, COW kicks in because
entire address space is copied from parent to child.

`clone3` is extensible and can provide mechanisms using which shadow stack
as an input parameter can be provided. This is not settled yet and being
extensively discussed on mailing list. Once that's settled, this commit
will adapt to that.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/usercfi.h |  39 ++
 arch/riscv/kernel/process.c  |  12 ++-
 arch/riscv/kernel/usercfi.c  | 121 +++
 3 files changed, 171 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 4fa201b4fc4e..b47574a7a8c9 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -8,6 +8,9 @@
 #ifndef __ASSEMBLY__
 #include 
 
+struct task_struct;
+struct kernel_clone_args;
+
 #ifdef CONFIG_RISCV_USER_CFI
 struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
@@ -17,6 +20,42 @@ struct cfi_status {
unsigned long shdw_stk_size; /* size of shadow stack */
 };
 
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+   const struct 
kernel_clone_args *args);
+void shstk_release(struct task_struct *tsk);
+void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, 
unsigned long size);
+void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
+bool is_shstk_enabled(struct task_struct *task);
+
+#else
+
+static inline unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+  const struct kernel_clone_args *args)
+{
+   return 0;
+}
+
+static inline void shstk_release(struct task_struct *tsk)
+{
+
+}
+
+static inline void set_shstk_base(struct task_struct *task, unsigned long 
shstk_addr,
+   unsigned long 
size)
+{
+
+}
+
+static inline void set_active_shstk(struct task_struct *task, unsigned long 
shstk_addr)
+{
+
+}
+
+static inline bool is_shstk_enabled(struct task_struct *task)
+{
+   return false;
+}
+
 #endif /* CONFIG_RISCV_USER_CFI */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index ce577cdc2af3..ef48a25b0eff 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 register unsigned long gp_in_global __asm__("gp");
 
@@ -202,7 +203,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src)
 
 void exit_thread(struct task_struct *tsk)
 {
-
+   if (IS_ENABLED(CONFIG_RISCV_USER_CFI))
+   shstk_release(tsk);
 }
 
 int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
@@ -210,6 +212,7 @@ int copy_thread(struct task_struct *p, const struct 
kernel_clone_args *args)
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
unsigned long tls = args->tls;
+   unsigned long ssp = 0;
struct pt_regs *childregs = task_pt_regs(p);
 
memset(>thread.s, 0, sizeof(p->thread.s));
@@ -225,11 +228,18 @@ int copy_thread(struct task_struct *p, const struct 
kernel_clone_args *args)
p->thread.s[0] = (unsigned long)args->fn;
p->thread.s[1] = (unsigned long)args->fn_arg;
} else {
+   /* allocate new shadow stack if needed. In case of CLONE_VM we 
have to */
+   ssp = shstk_alloc_thread_stack(p, args);
+   if (IS_ERR_VALUE(ssp))
+   return PTR_ERR((void *)ssp);
+
*childregs = *(current_pt_regs());
/* Turn off status.VS */
riscv_v_vstate_off(childregs);
if (usp) /* User fork */
childregs->sp = usp;
+   if (ssp) /* if needed, set new ssp */
+   set_active_shstk(p, ssp);
if (clone_flags & CLONE_SETTLS)
childregs->tp = tls;
childregs->a0 = 0; /* Return value of fork() */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index c4ed0d4e33d6..11ef7ab925c9 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -19,6 +19,41 @@
 
 #define SHSTK_ENTRY_SIZE sizeof(void *)
 
+bool is_shstk_enabled(struct task_struct *task)
+{
+   return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
+}
+
+void set_shstk_base(struct

[PATCH v3 14/29] riscv/mm: Implement map_shadow_stack() syscall

2024-04-03 Thread Deepak Gupta

As discussed extensively in the changelog for the addition of this
syscall on x86 ("x86/shstk: Introduce map_shadow_stack syscall") the
existing mmap() and madvise() syscalls do not map entirely well onto the
security requirements for shadow stack memory since they lead to windows
where memory is allocated but not yet protected or stacks which are not
properly and safely initialised. Instead a new syscall map_shadow_stack()
has been defined which allocates and initialises a shadow stack page.

This patch implements this syscall for riscv. riscv doesn't require token
to be setup by kernel because user mode can do that by itself. However to
provide compatibility and portability with other architectues, user mode
can specify token set flag.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/kernel/Makefile  |   2 +
 arch/riscv/kernel/usercfi.c | 149 
 include/uapi/asm-generic/mman.h |   1 +
 3 files changed, 152 insertions(+)
 create mode 100644 arch/riscv/kernel/usercfi.c

diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..3bec82f4e94c 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -107,3 +107,5 @@ obj-$(CONFIG_COMPAT)+= compat_vdso/
 
 obj-$(CONFIG_64BIT)+= pi/
 obj-$(CONFIG_ACPI) += acpi.o
+
+obj-$(CONFIG_RISCV_USER_CFI) += usercfi.o
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
new file mode 100644
index ..c4ed0d4e33d6
--- /dev/null
+++ b/arch/riscv/kernel/usercfi.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024 Rivos, Inc.
+ * Deepak Gupta 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SHSTK_ENTRY_SIZE sizeof(void *)
+
+/*
+ * Writes on shadow stack can either be `sspush` or `ssamoswap`. `sspush` can 
happen
+ * implicitly on current shadow stack pointed to by CSR_SSP. `ssamoswap` takes 
pointer to
+ * shadow stack. To keep it simple, we plan to use `ssamoswap` to perform 
writes on shadow
+ * stack.
+ */
+static noinline unsigned long amo_user_shstk(unsigned long *addr, unsigned 
long val)
+{
+   /*
+* Since shadow stack is supported only in 64bit configuration,
+* ssamoswap.d is used below. CONFIG_RISCV_USER_CFI is dependent
+* on 64BIT and compile of this file is dependent on 
CONFIG_RISCV_USER_CFI
+* In case ssamoswap faults, return -1.
+* Never expect -1 on shadow stack. Expect return addresses and zero
+*/
+   unsigned long swap = -1;
+
+   __enable_user_access();
+   asm goto(
+   ".option push\n"
+   ".option arch, +zicfiss\n"
+   "1: ssamoswap.d %[swap], %[val], %[addr]\n"
+   _ASM_EXTABLE(1b, %l[fault])
+   RISCV_ACQUIRE_BARRIER
+   ".option pop\n"
+   : [swap] "=r" (swap), [addr] "+A" (*addr)
+   : [val] "r" (val)
+   : "memory"
+   : fault
+   );
+   __disable_user_access();
+   return swap;
+fault:
+   __disable_user_access();
+   return -1;
+}
+
+/*
+ * Create a restore token on the shadow stack.  A token is always XLEN wide
+ * and aligned to XLEN.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+   unsigned long addr;
+
+   /* Token must be aligned */
+   if (!IS_ALIGNED(ssp, SHSTK_ENTRY_SIZE))
+   return -EINVAL;
+
+   /* On RISC-V we're constructing token to be function of address itself 
*/
+   addr = ssp - SHSTK_ENTRY_SIZE;
+
+   if (amo_user_shstk((unsigned long __user *)addr, (unsigned long) ssp) 
== -1)
+   return -EFAULT;
+
+   if (token_addr)
+   *token_addr = addr;
+
+   return 0;
+}
+
+static unsigned long allocate_shadow_stack(unsigned long addr, unsigned long 
size,
+   unsigned long token_offset,
+   bool set_tok)
+{
+   int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+   struct mm_struct *mm = current->mm;
+   unsigned long populate, tok_loc = 0;
+
+   if (addr)
+   flags |= MAP_FIXED_NOREPLACE;
+
+   mmap_write_lock(mm);
+   addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+   VM_SHADOW_STACK | VM_WRITE, 0, , NULL);
+   mmap_write_unlock(mm);
+
+   if (!set_tok || IS_ERR_VALUE(addr))
+   goto out;
+
+   if (create_rstor_token(addr + token_offset, _loc)) {
+   vm_munmap(addr, size);
+   return -EINVAL;
+   }
+
+   addr = tok_loc;
+
+out:
+   return addr;
+}
+
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr,

[PATCH v3 13/29] riscv mmu: write protect and shadow stack

2024-04-03 Thread Deepak Gupta

`fork` implements copy on write (COW) by making pages readonly in child
and parent both.

ptep_set_wrprotect and pte_wrprotect clears _PAGE_WRITE in PTE.
Assumption is that page is readable and on fault copy on write happens.

To implement COW on such pages, clearing up W bit makes them XWR = 000.
This will result in wrong PTE setting which says no perms but V=1 and PFN
field pointing to final page. Instead desired behavior is to turn it into
a readable page, take an access (load/store) fault on sspush/sspop
(shadow stack) and then perform COW on such pages. This way regular reads
would still be allowed and not lead to COW maintaining current behavior
of COW on non-shadow stack but writeable memory.

On the other hand it doesn't interfere with existing COW for read-write
memory. Assumption is always that _PAGE_READ must have been set and thus
setting _PAGE_READ is harmless.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/pgtable.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9b837239d3e8..7a1c2a98d272 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -398,7 +398,7 @@ static inline int pte_special(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-   return __pte(pte_val(pte) & ~(_PAGE_WRITE));
+   return __pte((pte_val(pte) & ~(_PAGE_WRITE)) | (_PAGE_READ));
 }
 
 /* static inline pte_t pte_mkread(pte_t pte) */
@@ -581,7 +581,15 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
*mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pte_t *ptep)
 {
-   atomic_long_and(~(unsigned long)_PAGE_WRITE, (atomic_long_t *)ptep);
+   volatile pte_t read_pte = *ptep;
+   /*
+* ptep_set_wrprotect can be called for shadow stack ranges too.
+* shadow stack memory is XWR = 010 and thus clearing _PAGE_WRITE will 
lead to
+* encoding 000b which is wrong encoding with V = 1. This should lead 
to page fault
+* but we dont want this wrong configuration to be set in page tables.
+*/
+   atomic_long_set((atomic_long_t *)ptep,
+   ((pte_val(read_pte) & ~(unsigned long)_PAGE_WRITE) | 
_PAGE_READ));
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.43.2

[PATCH v3 12/29] riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs

2024-04-03 Thread Deepak Gupta

pte_mkwrite creates PTEs with WRITE encodings for underlying arch.
Underlying arch can have two types of writeable mappings. One that can be
written using regular store instructions. Another one that can only be
written using specialized store instructions (like shadow stack stores).
pte_mkwrite can select write PTE encoding based on VMA range (i.e.
VM_SHADOW_STACK)

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/pgtable.h |  7 +++
 arch/riscv/mm/pgtable.c  | 21 +
 2 files changed, 28 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 6362407f1e83..9b837239d3e8 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -403,6 +403,10 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 /* static inline pte_t pte_mkread(pte_t pte) */
 
+struct vm_area_struct;
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
+#define pte_mkwrite pte_mkwrite
+
 static inline pte_t pte_mkwrite_novma(pte_t pte)
 {
return __pte(pte_val(pte) | _PAGE_WRITE);
@@ -694,6 +698,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
 }
 
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#define pmd_mkwrite pmd_mkwrite
+
 static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 {
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index ef887efcb679..c84ae2e0424d 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -142,3 +142,24 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
return pmd;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   if (vma_is_shadow_stack(vma->vm_flags))
+   return pte_mkwrite_shstk(pte);
+
+   pte = pte_mkwrite_novma(pte);
+
+   return pte;
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+   if (vma_is_shadow_stack(vma->vm_flags))
+   return pmd_mkwrite_shstk(pmd);
+
+   pmd = pmd_mkwrite_novma(pmd);
+
+   return pmd;
+}
+
-- 
2.43.2

[PATCH v3 11/29] riscv mm: manufacture shadow stack pte

2024-04-03 Thread Deepak Gupta

This patch implements creating shadow stack pte (on riscv). Creating
shadow stack PTE on riscv means that clearing RWX and then setting W=1.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 4d5983bc6766..6362407f1e83 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline pte_t pte_mkwrite_novma(pte_t pte)
return __pte(pte_val(pte) | _PAGE_WRITE);
 }
 
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+   /* shadow stack on risc-v is XWR = 010. Clear everything and only set 
_PAGE_WRITE */
+   return __pte((pte_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
 /* static inline pte_t pte_mkexec(pte_t pte) */
 
 static inline pte_t pte_mkdirty(pte_t pte)
@@ -693,6 +699,12 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
 }
 
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pte)
+{
+   /* shadow stack on risc-v is XWR = 010. Clear everything and only set 
_PAGE_WRITE */
+   return __pmd((pmd_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
return pte_pmd(pte_wrprotect(pmd_pte(pmd)));
-- 
2.43.2

[PATCH v3 10/29] riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE

2024-04-03 Thread Deepak Gupta

`arch_calc_vm_prot_bits` is implemented on risc-v to return VM_READ |
VM_WRITE if PROT_WRITE is specified. Similarly `riscv_sys_mmap` is
updated to convert all incoming PROT_WRITE to (PROT_WRITE | PROT_READ).
This is to make sure that any existing apps using PROT_WRITE still work.

Earlier `protection_map[VM_WRITE]` used to pick read-write PTE encodings.
Now `protection_map[VM_WRITE]` will always pick PAGE_SHADOWSTACK PTE
encodings for shadow stack. Above changes ensure that existing apps
continue to work because underneath kernel will be picking
`protection_map[VM_WRITE|VM_READ]` PTE encodings.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/mman.h| 24 
 arch/riscv/include/asm/pgtable.h |  1 +
 arch/riscv/kernel/sys_riscv.c| 11 +++
 arch/riscv/mm/init.c |  2 +-
 mm/mmap.c|  1 +
 5 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/include/asm/mman.h

diff --git a/arch/riscv/include/asm/mman.h b/arch/riscv/include/asm/mman.h
new file mode 100644
index ..ef9fedf32546
--- /dev/null
+++ b/arch/riscv/include/asm/mman.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MMAN_H__
+#define __ASM_MMAN_H__
+
+#include 
+#include 
+#include 
+
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+   unsigned long pkey __always_unused)
+{
+   unsigned long ret = 0;
+
+   /*
+* If PROT_WRITE was specified, force it to VM_READ | VM_WRITE.
+* Only VM_WRITE means shadow stack.
+*/
+   if (prot & PROT_WRITE)
+   ret = (VM_READ | VM_WRITE);
+   return ret;
+}
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+
+#endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 6066822e7396..4d5983bc6766 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -184,6 +184,7 @@ extern struct pt_alloc_ops pt_ops __initdata;
 #define PAGE_READ_EXEC __pgprot(_PAGE_BASE | _PAGE_READ | _PAGE_EXEC)
 #define PAGE_WRITE_EXEC__pgprot(_PAGE_BASE | _PAGE_READ |  
\
 _PAGE_EXEC | _PAGE_WRITE)
+#define PAGE_SHADOWSTACK   __pgprot(_PAGE_BASE | _PAGE_WRITE)
 
 #define PAGE_COPY  PAGE_READ
 #define PAGE_COPY_EXEC PAGE_READ_EXEC
diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
index f1c1416a9f1e..846c36b1b3d5 100644
--- a/arch/riscv/kernel/sys_riscv.c
+++ b/arch/riscv/kernel/sys_riscv.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static long riscv_sys_mmap(unsigned long addr, unsigned long len,
   unsigned long prot, unsigned long flags,
@@ -17,6 +19,15 @@ static long riscv_sys_mmap(unsigned long addr, unsigned long 
len,
if (unlikely(offset & (~PAGE_MASK >> page_shift_offset)))
return -EINVAL;
 
+   /*
+* If only PROT_WRITE is specified then extend that to PROT_READ
+* protection_map[VM_WRITE] is now going to select shadow stack 
encodings.
+* So specifying PROT_WRITE actually should select protection_map 
[VM_WRITE | VM_READ]
+* If user wants to create shadow stack then they should use 
`map_shadow_stack` syscall.
+*/
+   if (unlikely((prot & PROT_WRITE) && !(prot & PROT_READ)))
+   prot |= PROT_READ;
+
return ksys_mmap_pgoff(addr, len, prot, flags, fd,
   offset >> (PAGE_SHIFT - page_shift_offset));
 }
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fa34cf55037b..98e5ece4052a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -299,7 +299,7 @@ pgd_t early_pg_dir[PTRS_PER_PGD] __initdata 
__aligned(PAGE_SIZE);
 static const pgprot_t protection_map[16] = {
[VM_NONE]   = PAGE_NONE,
[VM_READ]   = PAGE_READ,
-   [VM_WRITE]  = PAGE_COPY,
+   [VM_WRITE]  = PAGE_SHADOWSTACK,
[VM_WRITE | VM_READ]= PAGE_COPY,
[VM_EXEC]   = PAGE_EXEC,
[VM_EXEC | VM_READ] = PAGE_READ_EXEC,
diff --git a/mm/mmap.c b/mm/mmap.c
index d89770eaab6b..57a974f49b00 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
-- 
2.43.2

[PATCH v3 09/29] mm: abstract shadow stack vma behind `vma_is_shadow_stack`

2024-04-03 Thread Deepak Gupta

VM_SHADOW_STACK (alias to VM_HIGH_ARCH_5) to encode shadow stack VMA.

This patch changes checks of VM_SHADOW_STACK flag in generic code to call
to a function `vma_is_shadow_stack` which will return true if its a
shadow stack vma and default stub (when support doesnt exist) returns false.

Signed-off-by: Deepak Gupta 
Suggested-by: Mike Rapoport 
---
 include/linux/mm.h | 13 -
 mm/gup.c   |  5 +++--
 mm/internal.h  |  2 +-
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64109f6c70f5..9952937be659 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -363,8 +363,19 @@ extern unsigned int kobjsize(const void *objp);
 
 #ifndef VM_SHADOW_STACK
 # define VM_SHADOW_STACK   VM_NONE
+
+static inline bool vma_is_shadow_stack(vm_flags_t vm_flags)
+{
+   return false;
+}
+#else
+static inline bool vma_is_shadow_stack(vm_flags_t vm_flags)
+{
+   return (vm_flags & VM_SHADOW_STACK);
+}
 #endif
 
+
 #if defined(CONFIG_X86)
 # define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
 #elif defined(CONFIG_PPC)
@@ -3473,7 +3484,7 @@ static inline unsigned long stack_guard_start_gap(struct 
vm_area_struct *vma)
return stack_guard_gap;
 
/* See reasoning around the VM_SHADOW_STACK definition */
-   if (vma->vm_flags & VM_SHADOW_STACK)
+   if (vma->vm_flags && vma_is_shadow_stack(vma->vm_flags))
return PAGE_SIZE;
 
return 0;
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..a7a02eb0a6b3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1053,7 +1053,7 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
!writable_file_mapping_allowed(vma, gup_flags))
return -EFAULT;
 
-   if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
+   if (!(vm_flags & VM_WRITE) || vma_is_shadow_stack(vm_flags)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
@@ -1071,7 +1071,8 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if (!is_cow_mapping(vm_flags))
return -EFAULT;
}
-   } else if (!(vm_flags & VM_READ)) {
+   } else if (!(vm_flags & VM_READ) && !vma_is_shadow_stack(vm_flags)) {
+   /* reads allowed if its shadow stack vma */
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/*
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..5035b5a58df0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -572,7 +572,7 @@ static inline bool is_exec_mapping(vm_flags_t flags)
  */
 static inline bool is_stack_mapping(vm_flags_t flags)
 {
-   return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
+   return ((flags & VM_STACK) == VM_STACK) || vma_is_shadow_stack(flags);
 }
 
 /*
-- 
2.43.2

[PATCH v3 08/29] mm: Define VM_SHADOW_STACK for RISC-V

2024-04-03 Thread Deepak Gupta

VM_SHADOW_STACK is defined by x86 as vm flag to mark a shadow stack vma.

x86 uses VM_HIGH_ARCH_5 bit but that limits shadow stack vma to 64bit only.
arm64 follows same path (see links)

To keep things simple, RISC-V follows the same.
This patch adds `ss` for shadow stack in process maps.

Links:
https://lore.kernel.org/lkml/20231009-arm64-gcs-v6-12-78e55deaa...@kernel.org/#r

Signed-off-by: Deepak Gupta 
---
 fs/proc/task_mmu.c |  3 +++
 include/linux/mm.h | 11 ++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3f78ebbb795f..d9d63eb74f0d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -702,6 +702,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 #ifdef CONFIG_X86_USER_SHADOW_STACK
[ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
+#ifdef CONFIG_RISCV_USER_CFI
+   [ilog2(VM_SHADOW_STACK)] = "ss",
 #endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..64109f6c70f5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -352,7 +352,16 @@ extern unsigned int kobjsize(const void *objp);
  * for more details on the guard size.
  */
 # define VM_SHADOW_STACK   VM_HIGH_ARCH_5
-#else
+#endif
+
+#ifdef CONFIG_RISCV_USER_CFI
+/*
+ * RISC-V is going along with using VM_HIGH_ARCH_5 bit position for shadow 
stack
+ */
+#define VM_SHADOW_STACKVM_HIGH_ARCH_5
+#endif
+
+#ifndef VM_SHADOW_STACK
 # define VM_SHADOW_STACK   VM_NONE
 #endif
 
-- 
2.43.2

[PATCH v3 07/29] riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit

2024-04-03 Thread Deepak Gupta

Carves out space in arch specific thread struct for cfi status and shadow
stack in usermode on riscv.

This patch does following
- defines a new structure cfi_status with status bit for cfi feature
- defines shadow stack pointer, base and size in cfi_status structure
- defines offsets to new member fields in thread in asm-offsets.c
- Saves and restore shadow stack pointer on trap entry (U --> S) and exit
  (S --> U)

Shadow stack save/restore is gated on feature availiblity and implemented
using alternative. CSR can be context switched in `switch_to` as well but
soon as kernel shadow stack support gets rolled in, shadow stack pointer
will need to be switched at trap entry/exit point (much like `sp`). It can
be argued that kernel using shadow stack deployment scenario may not be as
prevalant as user mode using this feature. But even if there is some
minimal deployment of kernel shadow stack, that means that it needs to be
supported. And thus save/restore of shadow stack pointer in entry.S instead
of in `switch_to.h`.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/processor.h   |  1 +
 arch/riscv/include/asm/thread_info.h |  3 +++
 arch/riscv/include/asm/usercfi.h | 24 
 arch/riscv/kernel/asm-offsets.c  |  4 
 arch/riscv/kernel/entry.S| 26 ++
 5 files changed, 58 insertions(+)
 create mode 100644 arch/riscv/include/asm/usercfi.h

diff --git a/arch/riscv/include/asm/processor.h 
b/arch/riscv/include/asm/processor.h
index 6c5b3d928b12..f8decf357804 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_64BIT
 #define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/include/asm/thread_info.h 
b/arch/riscv/include/asm/thread_info.h
index a503bdc2f6dd..f1dee307806e 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -57,6 +57,9 @@ struct thread_info {
int cpu;
unsigned long   syscall_work;   /* SYSCALL_WORK_ flags */
unsigned long envcfg;
+#ifdef CONFIG_RISCV_USER_CFI
+   struct cfi_status   user_cfi_state;
+#endif
 #ifdef CONFIG_SHADOW_CALL_STACK
void*scs_base;
void*scs_sp;
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
new file mode 100644
index ..4fa201b4fc4e
--- /dev/null
+++ b/arch/riscv/include/asm/usercfi.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (C) 2024 Rivos, Inc.
+ * Deepak Gupta 
+ */
+#ifndef _ASM_RISCV_USERCFI_H
+#define _ASM_RISCV_USERCFI_H
+
+#ifndef __ASSEMBLY__
+#include 
+
+#ifdef CONFIG_RISCV_USER_CFI
+struct cfi_status {
+   unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
+   unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+   unsigned long user_shdw_stk; /* Current user shadow stack pointer */
+   unsigned long shdw_stk_base; /* Base address of shadow stack */
+   unsigned long shdw_stk_size; /* size of shadow stack */
+};
+
+#endif /* CONFIG_RISCV_USER_CFI */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_RISCV_USERCFI_H */
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..5c5ea015c776 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -44,6 +44,10 @@ void asm_offsets(void)
 #endif
 
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
+#ifdef CONFIG_RISCV_USER_CFI
+   OFFSET(TASK_TI_CFI_STATUS, task_struct, thread_info.user_cfi_state);
+   OFFSET(TASK_TI_USER_SSP, task_struct, 
thread_info.user_cfi_state.user_shdw_stk);
+#endif
OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
OFFSET(TASK_THREAD_F1,  task_struct, thread.fstate.f[1]);
OFFSET(TASK_THREAD_F2,  task_struct, thread.fstate.f[2]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 9d1a305d5508..7245a0ea25c1 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -60,6 +60,20 @@ SYM_CODE_START(handle_exception)
 
REG_L s0, TASK_TI_USER_SP(tp)
csrrc s1, CSR_STATUS, t0
+   /*
+* If previous mode was U, capture shadow stack pointer and save it away
+* Zero CSR_SSP at the same time for sanitization.
+*/
+   ALTERNATIVE("nop; nop; nop; nop",
+   __stringify(\
+   andi s2, s1, SR_SPP;\
+   bnez s2, skip_ssp_save; \
+   csrrw s2, CSR_SSP, x0;  \
+   REG_S s2, TASK_TI_USER_SSP(tp); \
+   skip_ssp_save:),
+   0,
+   RISCV_ISA_EXT_ZICFISS,
+   CONFIG_RISCV_USER_CFI)

[PATCH v3 06/29] riscv: zicfiss / zicfilp extension csr and bit definitions

2024-04-03 Thread Deepak Gupta

zicfiss and zicfilp extension gets enabled via b3 and b2 in *envcfg CSR.
menvcfg controls enabling for S/HS mode. henvcfg control enabling for VS
while senvcfg controls enabling for U/VU mode.

zicfilp extension extends *status CSR to hold `expected landing pad` bit.
A trap or interrupt can occur between an indirect jmp/call and target
instr. `expected landing pad` bit from CPU is recorded into xstatus CSR so
that when supervisor performs xret, `expected landing pad` state of CPU can
be restored.

zicfiss adds one new CSR
- CSR_SSP: CSR_SSP contains current shadow stack pointer.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/csr.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index bbd2207adb39..3bb126d1c5ff 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -18,6 +18,15 @@
 #define SR_MPP _AC(0x1800, UL) /* Previously Machine */
 #define SR_SUM _AC(0x0004, UL) /* Supervisor User Memory Access */
 
+/* zicfilp landing pad status bit */
+#define SR_SPELP   _AC(0x0080, UL)
+#define SR_MPELP   _AC(0x0200, UL)
+#ifdef CONFIG_RISCV_M_MODE
+#define SR_ELP SR_MPELP
+#else
+#define SR_ELP SR_SPELP
+#endif
+
 #define SR_FS  _AC(0x6000, UL) /* Floating-point Status */
 #define SR_FS_OFF  _AC(0x, UL)
 #define SR_FS_INITIAL  _AC(0x2000, UL)
@@ -196,6 +205,8 @@
 #define ENVCFG_PBMTE   (_AC(1, ULL) << 62)
 #define ENVCFG_CBZE(_AC(1, UL) << 7)
 #define ENVCFG_CBCFE   (_AC(1, UL) << 6)
+#define ENVCFG_LPE (_AC(1, UL) << 2)
+#define ENVCFG_SSE (_AC(1, UL) << 3)
 #define ENVCFG_CBIE_SHIFT  4
 #define ENVCFG_CBIE(_AC(0x3, UL) << ENVCFG_CBIE_SHIFT)
 #define ENVCFG_CBIE_ILL_AC(0x0, UL)
@@ -216,6 +227,11 @@
 #define SMSTATEEN0_HSENVCFG(_ULL(1) << SMSTATEEN0_HSENVCFG_SHIFT)
 #define SMSTATEEN0_SSTATEEN0_SHIFT 63
 #define SMSTATEEN0_SSTATEEN0   (_ULL(1) << SMSTATEEN0_SSTATEEN0_SHIFT)
+/*
+ * zicfiss user mode csr
+ * CSR_SSP holds current shadow stack pointer.
+ */
+#define CSR_SSP 0x011
 
 /* symbolic CSR names: */
 #define CSR_CYCLE  0xc00
-- 
2.43.2

[PATCH v3 05/29] riscv: zicfiss / zicfilp enumeration

2024-04-03 Thread Deepak Gupta

This patch adds support for detecting zicfiss and zicfilp. zicfiss and
zicfilp stands for unprivleged integer spec extension for shadow stack
and branch tracking on indirect branches, respectively.

This patch looks for zicfiss and zicfilp in device tree and accordinlgy
lights up bit in cpu feature bitmap. Furthermore this patch adds detection
utility functions to return whether shadow stack or landing pads are
supported by cpu.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/cpufeature.h | 13 +
 arch/riscv/include/asm/hwcap.h  |  2 ++
 arch/riscv/include/asm/processor.h  |  1 +
 arch/riscv/kernel/cpufeature.c  |  2 ++
 4 files changed, 18 insertions(+)

diff --git a/arch/riscv/include/asm/cpufeature.h 
b/arch/riscv/include/asm/cpufeature.h
index 0bd11862b760..f0fb8d8ae273 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -137,4 +138,16 @@ static __always_inline bool 
riscv_cpu_has_extension_unlikely(int cpu, const unsi
 
 DECLARE_STATIC_KEY_FALSE(fast_misaligned_access_speed_key);
 
+static inline bool cpu_supports_shadow_stack(void)
+{
+   return (IS_ENABLED(CONFIG_RISCV_USER_CFI) &&
+   riscv_cpu_has_extension_unlikely(smp_processor_id(), 
RISCV_ISA_EXT_ZICFISS));
+}
+
+static inline bool cpu_supports_indirect_br_lp_instr(void)
+{
+   return (IS_ENABLED(CONFIG_RISCV_USER_CFI) &&
+   riscv_cpu_has_extension_unlikely(smp_processor_id(), 
RISCV_ISA_EXT_ZICFILP));
+}
+
 #endif
diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index 1f2d2599c655..74b6c727f545 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -80,6 +80,8 @@
 #define RISCV_ISA_EXT_ZFA  71
 #define RISCV_ISA_EXT_ZTSO 72
 #define RISCV_ISA_EXT_ZACAS73
+#define RISCV_ISA_EXT_ZICFILP  74
+#define RISCV_ISA_EXT_ZICFISS  75
 
 #define RISCV_ISA_EXT_XLINUXENVCFG 127
 
diff --git a/arch/riscv/include/asm/processor.h 
b/arch/riscv/include/asm/processor.h
index a8509cc31ab2..6c5b3d928b12 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -13,6 +13,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_64BIT
 #define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 79a5a35fab96..d052cad5b82f 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -263,6 +263,8 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
__RISCV_ISA_EXT_DATA(h, RISCV_ISA_EXT_h),
__RISCV_ISA_EXT_SUPERSET(zicbom, RISCV_ISA_EXT_ZICBOM, 
riscv_xlinuxenvcfg_exts),
__RISCV_ISA_EXT_SUPERSET(zicboz, RISCV_ISA_EXT_ZICBOZ, 
riscv_xlinuxenvcfg_exts),
+   __RISCV_ISA_EXT_SUPERSET(zicfilp, RISCV_ISA_EXT_ZICFILP, 
riscv_xlinuxenvcfg_exts),
+   __RISCV_ISA_EXT_SUPERSET(zicfiss, RISCV_ISA_EXT_ZICFISS, 
riscv_xlinuxenvcfg_exts),
__RISCV_ISA_EXT_DATA(zicntr, RISCV_ISA_EXT_ZICNTR),
__RISCV_ISA_EXT_DATA(zicond, RISCV_ISA_EXT_ZICOND),
__RISCV_ISA_EXT_DATA(zicsr, RISCV_ISA_EXT_ZICSR),
-- 
2.43.2

[PATCH v3 04/29] riscv: zicfilp / zicfiss in dt-bindings (extensions.yaml)

2024-04-03 Thread Deepak Gupta

Make an entry for cfi extensions in extensions.yaml.

Signed-off-by: Deepak Gupta 
---
 .../devicetree/bindings/riscv/extensions.yaml  | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml 
b/Documentation/devicetree/bindings/riscv/extensions.yaml
index 63d81dc895e5..45b87ad6cc1c 100644
--- a/Documentation/devicetree/bindings/riscv/extensions.yaml
+++ b/Documentation/devicetree/bindings/riscv/extensions.yaml
@@ -317,6 +317,16 @@ properties:
 The standard Zicboz extension for cache-block zeroing as ratified
 in commit 3dd606f ("Create cmobase-v1.0.pdf") of riscv-CMOs.
 
+- const: zicfilp
+  description:
+The standard Zicfilp extension for enforcing forward edge 
control-flow
+integrity in commit 3a20dc9 of riscv-cfi and is in public review.
+
+- const: zicfiss
+  description:
+The standard Zicfiss extension for enforcing backward edge 
control-flow
+integrity in commit 3a20dc9 of riscv-cfi and is in publc review.
+
 - const: zicntr
   description:
 The standard Zicntr extension for base counters and timers, as
-- 
2.43.2

[PATCH v3 03/29] riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv

2024-04-03 Thread Deepak Gupta

riscv will need an implementation for exit_thread to clean up shadow stack
when thread exits. If current thread had shadow stack enabled, shadow
stack is allocated by default for any new thread.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/Kconfig  | 1 +
 arch/riscv/kernel/process.c | 5 +
 2 files changed, 6 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..7e0b2bcc388f 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -149,6 +149,7 @@ config RISCV
select HAVE_SAMPLE_FTRACE_DIRECT_MULTI
select HAVE_STACKPROTECTOR
select HAVE_SYSCALL_TRACEPOINTS
+   select HAVE_EXIT_THREAD
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index d3109557f951..ce577cdc2af3 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -200,6 +200,11 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src)
return 0;
 }
 
+void exit_thread(struct task_struct *tsk)
+{
+
+}
+
 int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 {
unsigned long clone_flags = args->flags;
-- 
2.43.2

[PATCH v3 02/29] riscv: define default value for envcfg for task

2024-04-03 Thread Deepak Gupta

Defines a base default value for envcfg per task. By default all tasks
should have cache zeroing capability. Any future base capabilities that
apply to all tasks can be turned on same way.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/csr.h | 2 ++
 arch/riscv/kernel/process.c  | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 2468c55933cd..bbd2207adb39 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -202,6 +202,8 @@
 #define ENVCFG_CBIE_FLUSH  _AC(0x1, UL)
 #define ENVCFG_CBIE_INV_AC(0x3, UL)
 #define ENVCFG_FIOM_AC(0x1, UL)
+/* by default all threads should be able to zero cache */
+#define ENVCFG_BASEENVCFG_CBZE
 
 /* Smstateen bits */
 #define SMSTATEEN0_AIA_IMSIC_SHIFT 58
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 92922dbd5b5c..d3109557f951 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -152,6 +152,12 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
else
regs->status |= SR_UXL_64;
 #endif
+   /*
+* read current envcfg settings, AND it with base settings applicable
+* for all the tasks. Base settings should've been set up during CPU
+* bring up.
+*/
+   current->thread_info.envcfg = csr_read(CSR_ENVCFG) & ENVCFG_BASE;
 }
 
 void flush_thread(void)
-- 
2.43.2

[PATCH v3 01/29] riscv: envcfg save and restore on task switching

2024-04-03 Thread Deepak Gupta

envcfg CSR defines enabling bits for cache management instructions and
soon will control enabling for control flow integrity and pointer
masking features.

Control flow integrity enabling for forward cfi and backward cfi are
controlled via envcfg and thus need to be enabled on per thread basis.

This patch creates a place holder for envcfg CSR in `thread_info` and
adds logic to save and restore on task switching.

Signed-off-by: Deepak Gupta 
---
 arch/riscv/include/asm/switch_to.h   | 10 ++
 arch/riscv/include/asm/thread_info.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/arch/riscv/include/asm/switch_to.h 
b/arch/riscv/include/asm/switch_to.h
index 7efdb0584d47..2d9a00a30394 100644
--- a/arch/riscv/include/asm/switch_to.h
+++ b/arch/riscv/include/asm/switch_to.h
@@ -69,6 +69,15 @@ static __always_inline bool has_fpu(void) { return false; }
 #define __switch_to_fpu(__prev, __next) do { } while (0)
 #endif
 
+static inline void __switch_to_envcfg(struct task_struct *next)
+{
+   register unsigned long envcfg = next->thread_info.envcfg;
+
+   asm volatile (ALTERNATIVE("nop", "csrw " __stringify(CSR_ENVCFG) ", 
%0", 0,
+ 
RISCV_ISA_EXT_XLINUXENVCFG, 1)
+ :: "r" (envcfg) : 
"memory");
+}
+
 extern struct task_struct *__switch_to(struct task_struct *,
   struct task_struct *);
 
@@ -80,6 +89,7 @@ do {  \
__switch_to_fpu(__prev, __next);\
if (has_vector())   \
__switch_to_vector(__prev, __next); \
+   __switch_to_envcfg(__next); \
((last) = __switch_to(__prev, __next)); \
 } while (0)
 
diff --git a/arch/riscv/include/asm/thread_info.h 
b/arch/riscv/include/asm/thread_info.h
index 5d473343634b..a503bdc2f6dd 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -56,6 +56,7 @@ struct thread_info {
longuser_sp;/* User stack pointer */
int cpu;
unsigned long   syscall_work;   /* SYSCALL_WORK_ flags */
+   unsigned long envcfg;
 #ifdef CONFIG_SHADOW_CALL_STACK
void*scs_base;
void*scs_sp;
-- 
2.43.2

[PATCH v3 00/29] riscv control-flow integrity for usermode

2024-04-03 Thread Deepak Gupta

Sending out v3 for cpu assisted riscv user mode control flow integrity.

v2 [9] was sent a week ago for this riscv usermode control flow integrity
enabling. RFC patchset was (v1) early this year (January) [7].

changes in v3
--
envcfg:
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.

dt-bindings:
As suggested, split into separate commit. fixed the messaging that spec is
in public review

arch_is_shadow_stack change:
arch_is_shadow_stack changed to vma_is_shadow_stack

hwprobe:
zicfiss / zicfilp if present will get enumerated in hwprobe

selftests:
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.

changes in v2
---
As part of testing effort, compiled a rootfs with shadow stack and landing
pad enabled (libraries and binaries) and booted to shell. As part of long
running tests, I have been able to run some spec 2006 benchmarks [8] (here
link is provided only for list of benchmarks that were tested for long
running tests, excel sheet provided here actually is for some static stats
like code size growth on spec binaries). Thus converting from RFC to
regular patchset.

Securing control-flow integrity for usermode requires following

- Securing forward control flow : All callsites must reach
  reach a target that they actually intend to reach.

- Securing backward control flow : All function returns must
  return to location where they were called from.

This patch series use riscv cpu extension `zicfilp` [2] to secure forward
control flow and `zicfiss` [2] to secure backward control flow. `zicfilp`
enforces that all indirect calls or jmps must land on a landing pad instr
and label embedded in landing pad instr must match a value programmed in
`x7` register (at callsite via compiler). `zicfiss` introduces shadow stack
which can only be writeable via shadow stack instructions (sspush and
ssamoswap) and thus can't be tampered with via inadvertent stores. More
details about extension can be read from [2] and there are details in
documentation as well (in this patch series).

Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.

Enabling of control flow integrity for user programs is left to user runtime
(specifically expected from dynamic loader). There has been a lot of earlier
discussion on the enabling topic around x86 shadow stack enabling [3, 4, 5] and
overall consensus had been to let dynamic loader (or usermode) to decide for
enabling the feature.

This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv. arm64 is expected
to implement shadow stack part of these arch agnostic `prctls` [6]

Changes since last time
***

Spec changes

- Forward cfi spec has become much simpler. `lpad` instruction is pseudo for
  `auipc rd, <20bit_imm>`. `lpad` checks x7 against 20bit embedded in instr.
  Thus label width is 20bit.

- Shadow stack management instructions are reduced to
sspush - to push x1/x5 on shadow stack
sspopchk - pops from shadow stack and comapres with x1/x5.
ssamoswap - atomically swap value on shadow stack.
rdssp - reads current shadow stack pointer

- Shadow stack accesses on readonly memory always raise AMO/store page fault.
  `sspopchk` is load but if underlying page is readonly, it'll raise a store
  page fault. It simplifies hardware and kernel for COW handling for shadow
  stack pages.

- riscv defines a new exception type `software check exception` and control flow
  violations raise software check exception.

- enabling controls for shadow stack and landing are in xenvcfg CSR and controls
  lower privilege mode enabling. As an example senvcfg controls enabling for U 
and
  menvcfg controls enabling for S mode.

core mm shadow stack enabling
-
Shadow stack for x86 usermode are now in mainline and thus this patch
series builds on top of that for arch-agnostic mm related changes. Big
thanks and shout out to Rick Edgecombe for that.

selftests
-
Created some minimal selftests to test the patch series.


[1] - https://lore.kernel.org/lkml/20230213045351.3945824-1-de...@rivosinc.com/
[2] - https://github.com/riscv/riscv-cfi
[3] - 
https://lore.kernel.org/lkml/zwhcbq0bj+15e...@finisterre.sirena.org.uk/T/#mb121cd8b33d564e64234595a0ec52211479cf474
[4] - 
https://lore.kernel.org/all/20220130211838.8382-1-rick.p.edgeco...@intel.com/
[5] - 
https://lore.kernel.org/lkml/CAHk-=wgp5mk3poveejw16asbid0ghdt4okhnwawklbkrhqn...@mail.gmail.com/
[6] -

Re: [PATCH bpf-next] selftests/bpf: Add F_SETFL for fcntl

2024-04-03 Thread John Fastabend

Jakub Sitnicki wrote:
> Hi Geliang,
> 
> On Wed, Apr 03, 2024 at 04:32 PM +08, Geliang Tang wrote:
> > From: Geliang Tang 
> >
> > Incorrect arguments are passed to fcntl() in test_sockmap.c when invoking
> > it to set file status flags. If O_NONBLOCK is used as 2nd argument and
> > passed into fcntl, -EINVAL will be returned (See do_fcntl() in fs/fcntl.c).
> > The correct approach is to use F_SETFL as 2nd argument, and O_NONBLOCK as
> > 3rd one.
> >
> > Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
> > Signed-off-by: Geliang Tang 
> > ---
> >  tools/testing/selftests/bpf/test_sockmap.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
> > b/tools/testing/selftests/bpf/test_sockmap.c
> > index 024a0faafb3b..34d6a1e6f664 100644
> > --- a/tools/testing/selftests/bpf/test_sockmap.c
> > +++ b/tools/testing/selftests/bpf/test_sockmap.c
> > @@ -603,7 +603,7 @@ static int msg_loop(int fd, int iov_count, int 
> > iov_length, int cnt,
> > struct timeval timeout;
> > fd_set w;
> >  
> > -   fcntl(fd, fd_flags);
> > +   fcntl(fd, F_SETFL, fd_flags);
> > /* Account for pop bytes noting each iteration of apply will
> >  * call msg_pop_data helper so we need to account for this
> >  * by calculating the number of apply iterations. Note user
> 
> Good catch. But we also need to figure out why some tests failing with
> this patch applied and fix them in one go:
> 
> # 6/ 7  sockmap::txmsg test skb:FAIL
> #21/ 7 sockhash::txmsg test skb:FAIL
> #36/ 7 sockhash:ktls:txmsg test skb:FAIL
> Pass: 42 Fail: 3
> 
> I'm seeing this error message when running `test_sockmap`:
> 
> detected skb data error with skb ingress update @iov[0]:0 "00 00 00 00" != 
> "PASS"
> data verify msg failed: Unknown error -5
> rx thread exited with err 1.

I have a theory this is a real bug in the SK_SKB_STREAM_PARSER which has an
issue with wakup logic. Maybe we wakeup the poll/select logic before the
data is copied and because the recv() is actually nonblocking now we get
the error.

> 
> I'd also:
> - add an error check for fnctl, so we don't regress,
> - get rid of fd_flags, pass O_NONBLOCK flag directly to fnctl.
> 
> Thanks,
> -jkbs

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Thomas Gleixner

On Wed, Apr 03 2024 at 12:35, John Stultz wrote:
> On Wed, Apr 3, 2024 at 12:10 PM Thomas Gleixner  wrote:
>>
>> On Wed, Apr 03 2024 at 11:16, John Stultz wrote:
>> > On Wed, Apr 3, 2024 at 9:32 AM Thomas Gleixner  wrote:
>> > Thanks for this, Thomas!
>> >
>> > Just FYI: testing with 6.1, the test no longer hangs, but I don't see
>> > the SKIP behavior. It just fails:
>> > not ok 6 check signal distribution
>> > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0
>> >
>> > I've not had time yet to dig into what's going on, but let me know if
>> > you need any further details.
>>
>> That's weird. I ran it on my laptop with 6.1.y ...
>>
>> What kind of machine is that?
>
> I was running it in a VM.
>
> Interestingly with 64cpus it sometimes will do the skip behavior, but
> with 4 cpus it seems to always fail.

Duh, yes. The problem is that any thread might grab the signal as it is
process wide.

What was I thinking? Not much obviously.

The distribution mechanism is only targeting the wakeup at signal
queuing time and therefore avoids the wakeup of idle tasks. But it does
not guarantee that the signal is evenly distributed to the threads on
actual signal delivery.

Even with the change to stop the worker threads when they got a signal
it's not guaranteed that the last worker will actually get one within
the timeout simply because the main thread can win the race to collect
the signal every time. I just managed to make the patched test fail in
one out of 100 runs.

IOW, we cannot test this reliably at all with the current approach.

I'll think about it tomorrow again with brain awake.

Thanks,

tglx

Re: [PATCH v2] Documentation: kunit: Clarify test filter format

2024-04-03 Thread Daniel Latypov

On Tue, Apr 2, 2024 at 5:51 AM Brendan Jackman  wrote:
>
> It seems obvious once you know, but at first I didn't realise that the
> suite name is part of this format. Document it and add some examples.
>
> Signed-off-by: Brendan Jackman 

Reviewed-by: Daniel Latypov 

Thanks!

I agree with your comment on v1, I think the extra verbosity is fine.
It's still easy to read and this should hopefully eliminate the
ambiguity for most readers.

> ---
> v1->v2: Expanded to clarify that suite_glob and test_glob are two separate
> patterns. Also made some other trivial changes to formatting etc.

Re: [PATCH net-next 7/7] testing: net-drv: add a driver test for stats reporting

2024-04-03 Thread Jakub Kicinski

On Wed, 3 Apr 2024 18:52:50 +0200 Petr Machata wrote:
> > Nothing wrong with that. I guess the question in my mind is whether
> > we're aiming for making the tests "pythonic" (in which case "with"
> > definitely wins), or more of a "bash with classes" style trying to
> > avoid any constructs people may have to google. I'm on the fence on
> > that one, as the del example proves my python expertise is not high.
> > OTOH people who prefer bash will continue to write bash tests,
> > so maybe we don't have to worry about non-experts too much. Dunno.  
> 
> What I'm saying is, bash is currently a bit of a mess when it comes to
> cleanups. It's hard to get right, annoying to review, and sometimes
> individual cases add state that they don't unwind in cleanup() but only
> later in the function, so when you C-c half-way through such case, stuff
> stays behind.
> 
> Python has tools to just magic all this away.

Understood, just to be clear what I was saying is that +/- bugs 
in my example it is possible to "attach" the lifetime of things
to a test object or such. Maybe people would be less likely to remember
to do that than use "with"? Dunno. In any case, IIUC we don't have to
decide now, so I went ahead with the v2 last night.

[KTAP V2 PATCH v4] ktap_v2: add test metadata

2024-04-03 Thread Rae Moar

Add specification for test metadata to the KTAP v2 spec.

KTAP v1 only specifies the output format of very basic test information:
test result and test name. Any additional test information either gets
added to general diagnostic data or is not included in the output at all.

The purpose of KTAP metadata is to create a framework to include and
easily identify additional important test information in KTAP.

KTAP metadata could include any test information that is pertinent for
user interaction before or after the running of the test. For example,
the test file path or the test speed.

Since this includes a large variety of information, this specification
will recognize notable types of KTAP metadata to ensure consistent format
across test frameworks. See the full list of types in the specification.

Example of KTAP Metadata:

 KTAP version 2
 #:ktap_test: main
 #:ktap_arch: uml
 1..1
 KTAP version 2
 #:ktap_test: suite_1
 #:ktap_subsystem: example
 #:ktap_test_file: lib/test.c
 1..2
 ok 1 test_1
 #:ktap_test: test_2
 #:ktap_speed: very_slow
 # test_2 has begun
 #:custom_is_flaky: true
 ok 2 test_2
 # suite_1 has passed
 ok 1 suite_1

The changes to the KTAP specification outline the format, location, and
different types of metadata.

Reviewed-by: Kees Cook 
Reviewed-by: David Gow 
Signed-off-by: Rae Moar 
---
Note this version is in reponse to comments made off the list asking for
more explanation on inheritance and edge cases.

Changes since v3:
- Add two metadata ktap_config and ktap_id
- Add section on metadata inheritance
- Add edge case examples

 Documentation/dev-tools/ktap.rst | 248 ++-
 1 file changed, 244 insertions(+), 4 deletions(-)

diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index ff77f4aaa6ef..55bc43cd5aea 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -17,19 +17,22 @@ KTAP test results describe a series of tests (which may be 
nested: i.e., test
 can have subtests), each of which can contain both diagnostic data -- e.g., log
 lines -- and a final result. The test structure and results are
 machine-readable, whereas the diagnostic data is unstructured and is there to
-aid human debugging.
+aid human debugging. Since version 2, tests can also contain metadata which
+consists of important supplemental test information and can be
+machine-readable.
+
+KTAP output is built from five different types of lines:
 
-KTAP output is built from four different types of lines:
 - Version lines
 - Plan lines
 - Test case result lines
 - Diagnostic lines
+- Metadata lines
 
 In general, valid KTAP output should also form valid TAP output, but some
 information, in particular nested test results, may be lost. Also note that
 there is a stagnant draft specification for TAP14, KTAP diverges from this in
-a couple of places (notably the "Subtest" header), which are described where
-relevant later in this document.
+a couple of places, which are described where relevant later in this document.
 
 Version lines
 -
@@ -166,6 +169,237 @@ even if they do not start with a "#": this is to capture 
any other useful
 kernel output which may help debug the test. It is nevertheless recommended
 that tests always prefix any diagnostic output they have with a "#" character.
 
+KTAP metadata lines
+---
+
+KTAP metadata lines are used to include and easily identify important
+supplemental test information in KTAP. These lines may appear similar to
+diagnostic lines. The format of metadata lines is below:
+
+.. code-block:: none
+
+   #:_: 
+
+The  indicates where to find the specification for the type of
+metadata, such as the name of a test framework or "ktap" to indicate this
+specification. The list of currently approved prefixes and where to find the
+documentation of the metadata types is below. Note any metadata type that does
+not use a prefix from the list below must use the prefix "custom".
+
+Current List of Approved Prefixes:
+
+- ``ktap``: See Types of KTAP Metadata below for the list of metadata types.
+
+The format of  and  varies based on the type. See the
+individual specification. For "custom" types the  can be any
+string excluding ":", spaces, or newline characters and the  can be any
+string.
+
+**Location:**
+
+The first KTAP metadata line for a test must be "#:ktap_test: ",
+which acts as a header to associate metadata with the correct test. Metadata
+for the main KTAP level uses the test name "main". A test's metadata ends
+with a "ktap_test" line for a different test.
+
+For test cases, the location of the metadata is between the prior test result
+line and the current test result line. For test suites, the location of the
+metadata is between the suite's version line and test plan line. For the main
+level, the location of the metadata is between the main version line and main
+test plan line. See the example below.
+

Re: [PATCH v3 00/15] Add support for suppressing warning backtraces

2024-04-03 Thread Kees Cook

On Wed, Apr 03, 2024 at 06:19:21AM -0700, Guenter Roeck wrote:
> Some unit tests intentionally trigger warning backtraces by passing bad
> parameters to kernel API functions. Such unit tests typically check the
> return value from such calls, not the existence of the warning backtrace.
> 
> Such intentionally generated warning backtraces are neither desirable
> nor useful for a number of reasons.
> - They can result in overlooked real problems.
> - A warning that suddenly starts to show up in unit tests needs to be
>   investigated and has to be marked to be ignored, for example by
>   adjusting filter scripts. Such filters are ad-hoc because there is
>   no real standard format for warnings. On top of that, such filter
>   scripts would require constant maintenance.
> 
> One option to address problem would be to add messages such as "expected
> warning backtraces start / end here" to the kernel log.  However, that
> would again require filter scripts, it might result in missing real
> problematic warning backtraces triggered while the test is running, and
> the irrelevant backtrace(s) would still clog the kernel log.
> 
> Solve the problem by providing a means to identify and suppress specific
> warning backtraces while executing test code. Support suppressing multiple
> backtraces while at the same time limiting changes to generic code to the
> absolute minimum. Architecture specific changes are kept at minimum by
> retaining function names only if both CONFIG_DEBUG_BUGVERBOSE and
> CONFIG_KUNIT are enabled.
> 
> The first patch of the series introduces the necessary infrastructure.
> The second patch introduces support for counting suppressed backtraces.
> This capability is used in patch three to implement unit tests.
> Patch four documents the new API.
> The next two patches add support for suppressing backtraces in drm_rect
> and dev_addr_lists unit tests. These patches are intended to serve as
> examples for the use of the functionality introduced with this series.
> The remaining patches implement the necessary changes for all
> architectures with GENERIC_BUG support.
> 
> With CONFIG_KUNIT enabled, image size increase with this series applied is
> approximately 1%. The image size increase (and with it the functionality
> introduced by this series) can be avoided by disabling
> CONFIG_KUNIT_SUPPRESS_BACKTRACE.
> 
> This series is based on the RFC patch and subsequent discussion at
> https://patchwork.kernel.org/project/linux-kselftest/patch/02546e59-1afe-4b08-ba81-d94f3b691c9a@moroto.mountain/
> and offers a more comprehensive solution of the problem discussed there.
> 
> Design note:
>   Function pointers are only added to the __bug_table section if both
>   CONFIG_KUNIT_SUPPRESS_BACKTRACE and CONFIG_DEBUG_BUGVERBOSE are enabled
>   to avoid image size increases if CONFIG_KUNIT is disabled. There would be
>   some benefits to adding those pointers all the time (reduced complexity,
>   ability to display function names in BUG/WARNING messages). That change,
>   if desired, can be made later.
> 
> Checkpatch note:
>   Remaining checkpatch errors and warnings were deliberately ignored.
>   Some are triggered by matching coding style or by comments interpreted
>   as code, others by assembler macros which are disliked by checkpatch.
>   Suggestions for improvements are welcome.
> 
> Changes since RFC:
> - Introduced CONFIG_KUNIT_SUPPRESS_BACKTRACE
> - Minor cleanups and bug fixes
> - Added support for all affected architectures
> - Added support for counting suppressed warnings
> - Added unit tests using those counters
> - Added patch to suppress warning backtraces in dev_addr_lists tests
> 
> Changes since v1:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
>   [I retained those tags since there have been no functional changes]
> - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option, enabled by
>   default.
> 
> Changes since v2:
> - Rebased to v6.9-rc2
> - Added comments to drm warning suppression explaining why it is needed.
> - Added patch to move conditional code in arch/sh/include/asm/bug.h
>   to avoid kerneldoc warning
> - Added architecture maintainers to Cc: for architecture specific patches
> - No functional changes
> 
> 
> Guenter Roeck (15):
>   bug/kunit: Core support for suppressing warning backtraces
>   kunit: bug: Count suppressed warning backtraces
>   kunit: Add test cases for backtrace warning suppression
>   kunit: Add documentation for warning backtrace suppression API
>   drm: Suppress intentional warning backtraces in scaling unit tests
>   net: kunit: Suppress lock warning noise at end of dev_addr_lists tests
>   x86: Add support for suppressing warning backtraces
>   arm64: Add support for suppressing warning backtraces
>   loongarch: Add support for suppressing warning backtraces
>   parisc: Add support for suppressing warning

Re: [PATCH] selftests: cgroup: skip test_cgcore_lesser_ns_open when cgroup2 mounted without nsdelegate

2024-04-03 Thread Tejun Heo

On Wed, Mar 27, 2024 at 10:44:37AM +0800, Tianchen Ding wrote:
> The test case test_cgcore_lesser_ns_open only tasks effect when cgroup2
> is mounted with "nsdelegate" mount option. If it misses this option, or
> is remounted without "nsdelegate", the test case will fail. For example,
> running bpf/test_cgroup_storage first, and then run cgroup/test_core will
> fail on test_cgcore_lesser_ns_open. Skip it if "nsdelegate" is not
> detected in cgroup2 mount options.
> 
> Fixes: bf35a7879f1d ("selftests: cgroup: Test open-time cgroup namespace 
> usage for migration checks")
> Signed-off-by: Tianchen Ding 

Applied to cgroup/for-6.10.

Thanks.

-- 
tejun

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread John Stultz

On Wed, Apr 3, 2024 at 12:10 PM Thomas Gleixner  wrote:
>
> On Wed, Apr 03 2024 at 11:16, John Stultz wrote:
> > On Wed, Apr 3, 2024 at 9:32 AM Thomas Gleixner  wrote:
> > Thanks for this, Thomas!
> >
> > Just FYI: testing with 6.1, the test no longer hangs, but I don't see
> > the SKIP behavior. It just fails:
> > not ok 6 check signal distribution
> > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0
> >
> > I've not had time yet to dig into what's going on, but let me know if
> > you need any further details.
>
> That's weird. I ran it on my laptop with 6.1.y ...
>
> What kind of machine is that?

I was running it in a VM.

Interestingly with 64cpus it sometimes will do the skip behavior, but
with 4 cpus it seems to always fail.

thanks
-john

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Thomas Gleixner

On Wed, Apr 03 2024 at 11:16, John Stultz wrote:
> On Wed, Apr 3, 2024 at 9:32 AM Thomas Gleixner  wrote:
> Thanks for this, Thomas!
>
> Just FYI: testing with 6.1, the test no longer hangs, but I don't see
> the SKIP behavior. It just fails:
> not ok 6 check signal distribution
> # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0
>
> I've not had time yet to dig into what's going on, but let me know if
> you need any further details.

That's weird. I ran it on my laptop with 6.1.y ...

What kind of machine is that?

Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

2024-04-03 Thread Alexei Starovoitov

On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
 wrote:
> > > goto out;
> > > }
> > > +   spin_lock(>sleepable_lock);
> > > drop_prog_refcnt(t);
> > > +   spin_unlock(>sleepable_lock);
> >
> > this also looks odd.
>
> I basically need to protect "t->prog = NULL;" from happening while
> bpf_timer_work_cb is setting up the bpf program to be run.

Ok. I think I understand the race you're trying to fix.
The bpf_timer_cancel_and_free() is doing
cancel_work()
and proceeds with
kfree_rcu(t, rcu);

That's the only race and these extra locks don't help.

The t->prog = NULL is nothing to worry about.
The bpf_timer_work_cb() might still see callback_fn == NULL
"when it's being setup" and it's ok.
These locks don't help that.

I suggest to drop sleepable_lock everywhere.
READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
Add rcu_read_lock_trace() before calling bpf prog.

The race to fix is above 'cancel_work + kfree_rcu'
since kfree_rcu might free 'struct bpf_hrtimer *t'
while the work is pending and work_queue internal
logic might UAF struct work_struct work.
By the time it may luckily enter bpf_timer_work_cb() it's too late.
The argument 'struct work_struct *work' might already be freed.

To fix this problem, how about the following:
don't call kfree_rcu and instead queue the work to free it.
After cancel_work(>work); the work_struct can be reused.
So set it up to call "freeing callback" and do
schedule_work(>work);

There is a big assumption here that new work won't be
executed before cancelled work completes.
Need to check with wq experts.

Another approach is to do something smart with
cancel_work() return code.
If it returns true set a flag inside bpf_hrtimer and
make bpf_timer_work_cb() free(t) after bpf prog finishes.

> Also, side note: if anyone feels like it would go faster to fix those
> races by themself instead of teaching me how to properly do it, this
> is definitely fine from me :)

Most of the time goes into analyzing and thinking :)
Whoever codes it doesn't speed things much.
Pls do another respin if you still have cycles to work on it.

Re: [PATCH v1 2/2] RISC-V: drop SOC_VIRT for ARCH_VIRT

2024-04-03 Thread Palmer Dabbelt


On Tue, 05 Mar 2024 10:37:06 PST (-0800), Conor Dooley wrote:

From: Conor Dooley 

The ARCH_ and SOC_ versions of this symbol have persisted for quite a
while now in parallel. Generated .config files from previous LTS kernels
should have both. Finally remove SOC_VIRT and update all config files
using it.

Signed-off-by: Conor Dooley 
---
I had a 1.5 year old ack from Jason that I dropped due to the passage of
time.

CC: Paul Walmsley 
CC: Palmer Dabbelt 
CC: Albert Ou 
CC: Brendan Higgins 
CC: David Gow 
CC: Rae Moar 
CC: "Jason A. Donenfeld" 
CC: Shuah Khan 
CC: linux-ri...@lists.infradead.org
CC: linux-ker...@vger.kernel.org
CC: linux-kselftest@vger.kernel.org
CC: kunit-...@googlegroups.com
CC: wiregu...@lists.zx2c4.com
CC: net...@vger.kernel.org
---
 arch/riscv/Kconfig.socs| 3 ---
 arch/riscv/configs/defconfig   | 2 +-
 arch/riscv/configs/nommu_virt_defconfig| 2 +-
 tools/testing/kunit/qemu_configs/riscv.py  | 2 +-
 tools/testing/selftests/wireguard/qemu/arch/riscv32.config | 2 +-
 tools/testing/selftests/wireguard/qemu/arch/riscv64.config | 2 +-
 6 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/Kconfig.socs b/arch/riscv/Kconfig.socs
index e85ffb63c48d..dcbfb659839c 100644
--- a/arch/riscv/Kconfig.socs
+++ b/arch/riscv/Kconfig.socs
@@ -52,9 +52,6 @@ config ARCH_THEAD
  This enables support for the RISC-V based T-HEAD SoCs.

 config ARCH_VIRT
-   def_bool SOC_VIRT
-
-config SOC_VIRT
bool "QEMU Virt Machine"
select CLINT_TIMER if RISCV_M_MODE
select POWER_RESET
diff --git a/arch/riscv/configs/defconfig b/arch/riscv/configs/defconfig
index ab3bab313d56..8d46a9137b1e 100644
--- a/arch/riscv/configs/defconfig
+++ b/arch/riscv/configs/defconfig
@@ -32,7 +32,7 @@ CONFIG_ARCH_SOPHGO=y
 CONFIG_SOC_STARFIVE=y
 CONFIG_ARCH_SUNXI=y
 CONFIG_ARCH_THEAD=y
-CONFIG_SOC_VIRT=y
+CONFIG_ARCH_VIRT=y
 CONFIG_SMP=y
 CONFIG_HOTPLUG_CPU=y
 CONFIG_PM=y
diff --git a/arch/riscv/configs/nommu_virt_defconfig 
b/arch/riscv/configs/nommu_virt_defconfig
index b794e2f8144e..de8143d1f738 100644
--- a/arch/riscv/configs/nommu_virt_defconfig
+++ b/arch/riscv/configs/nommu_virt_defconfig
@@ -24,7 +24,7 @@ CONFIG_EXPERT=y
 CONFIG_SLUB=y
 CONFIG_SLUB_TINY=y
 # CONFIG_MMU is not set
-CONFIG_SOC_VIRT=y
+CONFIG_ARCH_VIRT=y
 CONFIG_NONPORTABLE=y
 CONFIG_SMP=y
 CONFIG_CMDLINE="root=/dev/vda rw earlycon=uart8250,mmio,0x1000,115200n8 
console=ttyS0"
diff --git a/tools/testing/kunit/qemu_configs/riscv.py 
b/tools/testing/kunit/qemu_configs/riscv.py
index 12a1d525978a..c87758030ff7 100644
--- a/tools/testing/kunit/qemu_configs/riscv.py
+++ b/tools/testing/kunit/qemu_configs/riscv.py
@@ -13,7 +13,7 @@ if not os.path.isfile(OPENSBI_PATH):

 QEMU_ARCH = QemuArchParams(linux_arch='riscv',
   kconfig='''
-CONFIG_SOC_VIRT=y
+CONFIG_ARCH_VIRT=y
 CONFIG_SERIAL_8250=y
 CONFIG_SERIAL_8250_CONSOLE=y
 CONFIG_SERIAL_OF_PLATFORM=y
diff --git a/tools/testing/selftests/wireguard/qemu/arch/riscv32.config 
b/tools/testing/selftests/wireguard/qemu/arch/riscv32.config
index 2fc36efb166d..2500eaa9b469 100644
--- a/tools/testing/selftests/wireguard/qemu/arch/riscv32.config
+++ b/tools/testing/selftests/wireguard/qemu/arch/riscv32.config
@@ -2,7 +2,7 @@ CONFIG_NONPORTABLE=y
 CONFIG_ARCH_RV32I=y
 CONFIG_MMU=y
 CONFIG_FPU=y
-CONFIG_SOC_VIRT=y
+CONFIG_ARCH_VIRT=y
 CONFIG_SERIAL_8250=y
 CONFIG_SERIAL_8250_CONSOLE=y
 CONFIG_SERIAL_OF_PLATFORM=y
diff --git a/tools/testing/selftests/wireguard/qemu/arch/riscv64.config 
b/tools/testing/selftests/wireguard/qemu/arch/riscv64.config
index dc266f3b1915..29a67ac67766 100644
--- a/tools/testing/selftests/wireguard/qemu/arch/riscv64.config
+++ b/tools/testing/selftests/wireguard/qemu/arch/riscv64.config
@@ -1,7 +1,7 @@
 CONFIG_ARCH_RV64I=y
 CONFIG_MMU=y
 CONFIG_FPU=y
-CONFIG_SOC_VIRT=y
+CONFIG_ARCH_VIRT=y
 CONFIG_SERIAL_8250=y
 CONFIG_SERIAL_8250_CONSOLE=y
 CONFIG_SERIAL_OF_PLATFORM=y


Acked-by: Palmer Dabbelt

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread John Stultz

On Wed, Apr 3, 2024 at 9:32 AM Thomas Gleixner  wrote:
> Subject: selftests/timers/posix_timers: Make signal distribution test less 
> fragile
> From: Thomas Gleixner 
>
> The signal distribution test has a tendency to hang for a long time as the
> signal delivery is not really evenly distributed. In fact it might never be
> distributed across all threads ever in the way it is written.
>
> Address this by:
>
>1) Adding a timeout which aborts the test
>
>2) Letting the test threads exit once they got a signal instead of
>   running continuously. That ensures that the other threads will
>   have a chance to expire the timer and get the signal.
>
>3) Adding a detection whether all signals arrvied at the main thread,
>   which allows to run the test on older kernels and emit 'SKIP'.
>
> While at it get rid of the pointless atomic operation on a the thread local
> variable in the signal handler.
>
> Signed-off-by: Thomas Gleixner 

Thanks for this, Thomas!

Just FYI: testing with 6.1, the test no longer hangs, but I don't see
the SKIP behavior. It just fails:
not ok 6 check signal distribution
# Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0

I've not had time yet to dig into what's going on, but let me know if
you need any further details.

thanks
-john

Re: [RFC PATCH net-next v8 06/14] page_pool: convert to use netmem

2024-04-03 Thread Simon Horman

On Tue, Apr 02, 2024 at 05:20:43PM -0700, Mina Almasry wrote:
> Abstrace the memory type from the page_pool so we can later add support
> for new memory types. Convert the page_pool to use the new netmem type
> abstraction, rather than use struct page directly.
> 
> As of this patch the netmem type is a no-op abstraction: it's always a
> struct page underneath. All the page pool internals are converted to
> use struct netmem instead of struct page, and the page pool now exports
> 2 APIs:
> 
> 1. The existing struct page API.
> 2. The new struct netmem API.
> 
> Keeping the existing API is transitional; we do not want to refactor all
> the current drivers using the page pool at once.
> 
> The netmem abstraction is currently a no-op. The page_pool uses
> page_to_netmem() to convert allocated pages to netmem, and uses
> netmem_to_page() to convert the netmem back to pages to pass to mm APIs,
> 
> Follow up patches to this series add non-paged netmem support to the
> page_pool. This change is factored out on its own to limit the code
> churn to this 1 patch, for ease of code review.
> 
> Signed-off-by: Mina Almasry 

...

> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h

...

> @@ -170,9 +172,10 @@ static inline void *page_pool_alloc_va(struct page_pool 
> *pool,
>   struct page *page;
>  
>   /* Mask off __GFP_HIGHMEM to ensure we can use page_address() */
> - page = page_pool_alloc(pool, , size, gfp & ~__GFP_HIGHMEM);
> + page = netmem_to_page(
> + page_pool_alloc(pool, , size, gfp & ~__GFP_HIGHMEM));
>   if (unlikely(!page))
> - return NULL;
> + return 0;

Hi Mina,

This doesn't seem right, as the return type is a pointer rather than an
integer.

Flagged by Sparse.

>  
>   return page_address(page) + offset;
>  }

Re: [PATCH net-next 7/7] testing: net-drv: add a driver test for stats reporting

2024-04-03 Thread Petr Machata



Jakub Kicinski  writes:

> On Wed, 3 Apr 2024 10:58:19 +0200 Petr Machata wrote:
>> Also, it's not clear what "del thing" should do in that context, because
>> if cfg also keeps a reference, __del__ won't get called. There could be
>> a direct method, like thing.exit() or whatever, but then you need
>> bookkeeping so as not to clean up the second time through cfg. It's the
>> less straightforward way of going about it IMHO.
>
> I see, having read up on what del actually does - "del thing" would
> indeed not work here.
>
>> I know that I must sound like a broken record at this point, but look:
>> 
>> with build("ip link set dev %s master %s" % (swp1, h1),
>>"ip link set dev %s nomaster" % swp1) as thing:
>> ... some code which may rise ...
>> ... more code, interface detached, `thing' gone ...
>> 
>> It's just as concise, makes it very clear where the device is part of
>> the bridge and where not anymore, and does away with the intricacies of
>> lifetime management.
>
> My experience [1] is that with "with" we often end up writing tests
> like this:
>
>   def test():
>   with a() as bunch,
>of() as things:
>   ... entire body of the test indented ...
>
> [1] https://github.com/kuba-moo/linux/blob/psp/tools/net/ynl/psp.py

Yeah, that does end up happening. I think there are a couple places
where you could have folded several withs in one, but it is going to be
indented, yeah.

But you end up indenting for try: finally: to make the del work reliably
anyway, so it's kinda lose/lose in that regard.

> Nothing wrong with that. I guess the question in my mind is whether
> we're aiming for making the tests "pythonic" (in which case "with"
> definitely wins), or more of a "bash with classes" style trying to
> avoid any constructs people may have to google. I'm on the fence on
> that one, as the del example proves my python expertise is not high.
> OTOH people who prefer bash will continue to write bash tests,
> so maybe we don't have to worry about non-experts too much. Dunno.

What I'm saying is, bash is currently a bit of a mess when it comes to
cleanups. It's hard to get right, annoying to review, and sometimes
individual cases add state that they don't unwind in cleanup() but only
later in the function, so when you C-c half-way through such case, stuff
stays behind.

Python has tools to just magic all this away.

Re: [PATCH v2] KVM: selftests: Fix build error due to assert in dirty_log_test

2024-04-03 Thread Sean Christopherson

On Wed, Apr 03, 2024, Raghavendra Rao Ananta wrote:
> The commit e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in
> the dirty ring logging test") backported the fix from v6.8 to stable
> v6.1. However, since the patch uses 'TEST_ASSERT_EQ()', which doesn't
> exist on v6.1, the following build error is seen:
> 
> dirty_log_test.c:775:2: error: call to undeclared function
> 'TEST_ASSERT_EQ'; ISO C99 and later do not support implicit function
> declarations [-Wimplicit-function-declaration]
> TEST_ASSERT_EQ(sem_val, 0);
> ^
> 1 error generated.
> 
> Replace the macro with its equivalent, 'ASSERT_EQ()' to fix the issue.
> 
> Fixes: e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in the dirty 
> ring logging test")
> Cc: 
> Signed-off-by: Raghavendra Rao Ananta 

Just to be super explicit, this is specifically for 6.1.y.

Acked-by: Sean Christopherson 

> ---
>  tools/testing/selftests/kvm/dirty_log_test.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
> b/tools/testing/selftests/kvm/dirty_log_test.c
> index ec40a33c29fd..711b9e4d86aa 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -772,9 +772,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>* verification of all iterations.
>*/
>   sem_getvalue(_vcpu_stop, _val);
> - TEST_ASSERT_EQ(sem_val, 0);
> + ASSERT_EQ(sem_val, 0);
>   sem_getvalue(_vcpu_cont, _val);
> - TEST_ASSERT_EQ(sem_val, 0);
> + ASSERT_EQ(sem_val, 0);
>  
>   pthread_create(_thread, NULL, vcpu_worker, vcpu);
>  
> 
> base-commit: e5cd595e23c1a075359a337c0e5c3a4f2dc28dd1
> -- 
> 2.44.0.478.gd926399ef9-goog
>

[PATCH v2] KVM: selftests: Fix build error due to assert in dirty_log_test

2024-04-03 Thread Raghavendra Rao Ananta

The commit e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in
the dirty ring logging test") backported the fix from v6.8 to stable
v6.1. However, since the patch uses 'TEST_ASSERT_EQ()', which doesn't
exist on v6.1, the following build error is seen:

dirty_log_test.c:775:2: error: call to undeclared function
'TEST_ASSERT_EQ'; ISO C99 and later do not support implicit function
declarations [-Wimplicit-function-declaration]
TEST_ASSERT_EQ(sem_val, 0);
^
1 error generated.

Replace the macro with its equivalent, 'ASSERT_EQ()' to fix the issue.

Fixes: e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in the dirty 
ring logging test")
Cc: 
Signed-off-by: Raghavendra Rao Ananta 
---
 tools/testing/selftests/kvm/dirty_log_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ec40a33c29fd..711b9e4d86aa 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -772,9 +772,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 * verification of all iterations.
 */
sem_getvalue(_vcpu_stop, _val);
-   TEST_ASSERT_EQ(sem_val, 0);
+   ASSERT_EQ(sem_val, 0);
sem_getvalue(_vcpu_cont, _val);
-   TEST_ASSERT_EQ(sem_val, 0);
+   ASSERT_EQ(sem_val, 0);
 
pthread_create(_thread, NULL, vcpu_worker, vcpu);
 

base-commit: e5cd595e23c1a075359a337c0e5c3a4f2dc28dd1
-- 
2.44.0.478.gd926399ef9-goog

Re: Subject: [PATCH] Add test for more file systems in landlock - ext4

2024-04-03 Thread Mickaël Salaün

On Tue, Apr 02, 2024 at 01:37:44PM +0530, Saasha Gupta wrote:
> Date: Mon, 2 Apr 2024 19:59:56 +0530
> 
> RE: This patch is now properly preformatted.
> 
> Landlock LSM, a part of the security subsystem, has some tests in place
> for synthetic filesystems such as tmpfs, proc, sysfs, etc. The goal of
> the new issue, and hence this patch is to add tests for non synthetic
> file systems, such as ext4, btrfs, etc

I agree with Julia's review.

> 
> This patch adds tests for the ext4 file system. This includes creation
> of a loop device (test-ext4.img) and formating with mkfs.
> 
> Signed-off-by: Saasha Gupta 
> ---
>  tools/testing/selftests/landlock/fs_test.c | 65 ++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/tools/testing/selftests/landlock/fs_test.c 
> b/tools/testing/selftests/landlock/fs_test.c
> index 9a6036fbf..b2f2cd5a5 100644
> --- a/tools/testing/selftests/landlock/fs_test.c
> +++ b/tools/testing/selftests/landlock/fs_test.c
> @@ -4675,6 +4675,14 @@ FIXTURE_VARIANT_ADD(layout3_fs, hostfs) {
>   .cwd_fs_magic = HOSTFS_SUPER_MAGIC,
>  };
>  
> +/* Add more filesystems */
> +FIXTURE_VARIANT_ADD(layout3_fs, ext4) {
> + .mnt = {
> + .type = "ext4",
> + },
> + .file_path = TMP_DIR "/dir/file",
> +};
> +
>  FIXTURE_SETUP(layout3_fs)
>  {
>   struct stat statbuf;
> @@ -4728,6 +4736,63 @@ FIXTURE_SETUP(layout3_fs)
>   self->has_created_file = true;
>   clear_cap(_metadata, CAP_DAC_OVERRIDE);
>   }
> +
> + /* Create non synthetic file system - ext4 */
> + if (stat(self->dir_path, ) != 0) {

dir_path should already exist with previous code right?

> + pid_t pid = fork();
> +
> + if (pid == -1) {
> + perror("Failed to fork");
> + exit(EXIT_FAILURE);
> + } else if (pid == 0) {
> + static const fallocate_argv[] = { "fallocate", 
> "--length",
> +"4M", "test-ext4.img",
> +NULL };
> + execvp(fallocate_argv[0], fallocate_argv);

Using system() would makes this much simpler (see net_test.c).

> + perror("execvp failed");
> + exit(EXIT_FAILURE);
> + } else {
> + int status;
> +
> + if (waitpid(pid, , 0) == -1) {
> + perror("waitpid failed");
> + exit(EXIT_FAILURE);
> + }
> + if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
> + TH_LOG(stderr,
> + "Failed to create ext4 filesystem 
> image: fallocate failed\n");
> + exit(EXIT_FAILURE);
> + }
> + }
> + }
> +
> + /* Formate and mount non synthetic file system - ext4 */
> + if (stat("mnt", ) != 0) {

"mnt" never exists, so this would always run this code...

> + pid_t pid = fork();
> +
> + if (pid == -1) {
> + perror("Failed to fork");
> + exit(EXIT_FAILURE);
> + } else if (pid == 0) {
> + static const mkfs_argv[] = { "mkfs.ext4", "-q",
> +   "test-ext4.img", "mnt", NULL };
> + execvp(mkfs_argv[0], mkfs_argv);
> + perror("execvp failed");
> + exit(EXIT_FAILURE);
> + } else {
> + int status;
> +
> + if (waitpid(pid, , 0) == -1) {
> + perror("waitpid failed");
> + exit(EXIT_FAILURE);
> + }
> + if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
> + TH_LOG(stderr,
> + "Failed to format ext4 filesystem 
> image: mkfs.ext4 failed\n");
> + exit(EXIT_FAILURE);
> + }
> + }
> + }
>  }
>  
>  FIXTURE_TEARDOWN(layout3_fs)
> -- 
> 2.44.0
> 
> 
>

[PATCH] KVM: selftests: Fix build error due to assert in dirty_log_test

2024-04-03 Thread Raghavendra Rao Ananta

The commit e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in
the dirty ring logging test") backported the fix from v6.8 to stable
v6.1. However, since the patch uses 'TEST_ASSERT_EQ()', which doesn't
exist on v6.1, the following build error is seen:

dirty_log_test.c:775:2: error: call to undeclared function
'TEST_ASSERT_EQ'; ISO C99 and later do not support implicit function
declarations [-Wimplicit-function-declaration]
TEST_ASSERT_EQ(sem_val, 0);
^
1 error generated.

Replace the macro with its equivalent, 'ASSERT_EQ()' to fix the issue.

Fixes: e5ed6c922537 ("KVM: selftests: Fix a semaphore imbalance in the dirty 
ring logging test")
Cc: 
Signed-off-by: Raghavendra Rao Ananta 
Change-Id: I52c2c28d962e482bb4f40f285229a2465ed59d7e
---
 tools/testing/selftests/kvm/dirty_log_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index ec40a33c29fd..711b9e4d86aa 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -772,9 +772,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 * verification of all iterations.
 */
sem_getvalue(_vcpu_stop, _val);
-   TEST_ASSERT_EQ(sem_val, 0);
+   ASSERT_EQ(sem_val, 0);
sem_getvalue(_vcpu_cont, _val);
-   TEST_ASSERT_EQ(sem_val, 0);
+   ASSERT_EQ(sem_val, 0);
 
pthread_create(_thread, NULL, vcpu_worker, vcpu);
 

base-commit: e5cd595e23c1a075359a337c0e5c3a4f2dc28dd1
-- 
2.44.0.478.gd926399ef9-goog

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Thomas Gleixner

On Wed, Apr 03 2024 at 17:43, Thomas Gleixner wrote:
> On Wed, Apr 03 2024 at 17:03, Oleg Nesterov wrote:
>>
>> Why distribution_thread() can't simply exit if got_signal != 0 ?
>>
>> See https://lore.kernel.org/all/20230128195641.ga14...@redhat.com/
>
> Indeed. It's too obvious :)

Revised simpler version below.

Thanks,

tglx
---
Subject: selftests/timers/posix_timers: Make signal distribution test less 
fragile
From: Thomas Gleixner 

The signal distribution test has a tendency to hang for a long time as the
signal delivery is not really evenly distributed. In fact it might never be
distributed across all threads ever in the way it is written.

Address this by:

   1) Adding a timeout which aborts the test

   2) Letting the test threads exit once they got a signal instead of
  running continuously. That ensures that the other threads will
  have a chance to expire the timer and get the signal.

   3) Adding a detection whether all signals arrvied at the main thread,
  which allows to run the test on older kernels and emit 'SKIP'.

While at it get rid of the pointless atomic operation on a the thread local
variable in the signal handler.

Signed-off-by: Thomas Gleixner 
---
 tools/testing/selftests/timers/posix_timers.c |   41 --
 1 file changed, 26 insertions(+), 15 deletions(-)

--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -184,18 +184,19 @@ static int check_timer_create(int which)
return 0;
 }
 
-int remain;
-__thread int got_signal;
+static int remain;
+static __thread int got_signal;
 
 static void *distribution_thread(void *arg)
 {
-   while (__atomic_load_n(, __ATOMIC_RELAXED));
+   while (!done && !got_signal);
+
return NULL;
 }
 
 static void distribution_handler(int nr)
 {
-   if (!__atomic_exchange_n(_signal, 1, __ATOMIC_RELAXED))
+   if (++got_signal == 1)
__atomic_fetch_sub(, 1, __ATOMIC_RELAXED);
 }
 
@@ -205,8 +206,6 @@ static void distribution_handler(int nr)
  */
 static int check_timer_distribution(void)
 {
-   int err, i;
-   timer_t id;
const int nthreads = 10;
pthread_t threads[nthreads];
struct itimerspec val = {
@@ -215,7 +214,11 @@ static int check_timer_distribution(void
.it_interval.tv_sec = 0,
.it_interval.tv_nsec = 1000 * 1000,
};
+   time_t start, now;
+   timer_t id;
+   int err, i;
 
+   done = 0;
remain = nthreads + 1;  /* worker threads + this thread */
signal(SIGALRM, distribution_handler);
err = timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, );
@@ -230,8 +233,7 @@ static int check_timer_distribution(void
}
 
for (i = 0; i < nthreads; i++) {
-   err = pthread_create([i], NULL, distribution_thread,
-NULL);
+   err = pthread_create([i], NULL, distribution_thread, 
NULL);
if (err) {
ksft_print_msg("Can't create thread: %s (%d)\n",
   strerror(errno), errno);
@@ -240,7 +242,18 @@ static int check_timer_distribution(void
}
 
/* Wait for all threads to receive the signal. */
-   while (__atomic_load_n(, __ATOMIC_RELAXED));
+   now = start = time(NULL);
+   while (__atomic_load_n(, __ATOMIC_RELAXED)) {
+   now = time(NULL);
+   if (now - start > 2)
+   break;
+   }
+   done = 1;
+
+   if (timer_delete(id)) {
+   ksft_perror("Can't delete timer\n");
+   return -1;
+   }
 
for (i = 0; i < nthreads; i++) {
err = pthread_join(threads[i], NULL);
@@ -251,12 +264,10 @@ static int check_timer_distribution(void
}
}
 
-   if (timer_delete(id)) {
-   ksft_perror("Can't delete timer");
-   return -1;
-   }
-
-   ksft_test_result_pass("check_timer_distribution\n");
+   if (__atomic_load_n(, __ATOMIC_RELAXED) == nthreads)
+   ksft_test_result_skip("No signal distribution. Assuming old 
kernel\n");
+   else
+   ksft_test_result(now - start <= 2, "check signal 
distribution\n");
return 0;
 }

Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Valentin Schneider

On 03/04/24 10:47, Waiman Long wrote:
> On 4/3/24 10:26, Valentin Schneider wrote:
>> IIUC that was Thomas' suggestion [1], but I can't tell yet how bad it would
>> be to change cgroup_lock() to also do a cpus_read_lock().
>
> Changing the locking order is certainly doable. I have taken a cursory
> look at it and at least the following files need to be changed:
>
>   kernel/bpf/cgroup.c
>   kernel/cgroup/cgroup.c
>   kernel/cgroup/legacy_freezer.c
>   mm/memcontrol.c
>
> That requires a lot more testing to make sure that there won't be a
> regression. Given that the call to cgroup_transfer_tasks() should be
> rare these days as it will only apply in the case of cgroup v1 under
> certain condtion, I am not sure this requirement justifies making such
> extensive changes. So I kind of defer it until we reach a consensus that
> it is the right thing to do.
>

Yeah if we can avoid it initially I'd be up for it.

Just one thing that came to mind - there's no flushing of the
cpuset_migrate_tasks_workfn() work, so the scheduler might move tasks
itself before the cpuset does via:

  balance_push() ->__balance_push_cpu_stop() -> select_fallback_rq()

But, given the current placement of cpuset_wait_for_hotplug(), I believe
that's something we can already have, so we should be good.

> Cheers,
> Longman

Re: Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Valentin Schneider

On 03/04/24 16:54, Michal Koutný wrote:
> On Wed, Apr 03, 2024 at 04:26:38PM +0200, Valentin Schneider 
>  wrote:
>> Also, I gave Michal's patch a try and it looks like it's introducing a
>
> Thank you.
>
>>   cgroup_threadgroup_rwsem -> cpuset_mutex
>> ordering from
>>   cgroup_transfer_tasks_locked()
>>   `\
>> percpu_down_write(_threadgroup_rwsem);
>> cgroup_migrate()
>> `\
>>   cgroup_migrate_execute()
>>   `\
>> ss->can_attach() // cpuset_can_attach()
>> `\
>>   mutex_lock(_mutex);
>>
>> which is invalid, see below.
>
> _This_ should be the right order (cpuset_mutex inside
> cgroup_threadgroup_rwsem), at least in my mental model. Thus I missed
> that cpuset_mutex must have been taken somewhere higher up in the
> hotplug code (CPU 0 in the lockdep dump, I can't easily see where from)
> :-/.
>

If I got this right...

cpuset_hotplug_update_tasks()
`\
  mutex_lock(_mutex);
  hotplug_update_tasks_legacy()
  `\
remove_tasks_in_empty_cpuset()
`\
  cgroup_transfer_tasks_locked()
  `\
percpu_down_write(_threadgroup_rwsem);

But then that is also followed by:

cgroup_migrate()
`\
  cgroup_migrate_execute()
  `\
ss->can_attach() // cpuset_can_attach()
`\
  mutex_lock(_mutex);

which doesn't look good...


Also, I missed that earlier, but this triggers:

  cgroup_transfer_tasks_locked() ~> lockdep_assert_held(_mutex);

[   30.773092] WARNING: CPU: 2 PID: 24 at kernel/cgroup/cgroup-v1.c:112 
cgroup_transfer_tasks_locked+0x39f/0x450
[   30.773807] Modules linked in:
[   30.774063] CPU: 2 PID: 24 Comm: cpuhp/2 Not tainted 
6.9.0-rc1-00042-g844178b616c7-dirty #25
[   30.774672] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   30.775457] RIP: 0010:cgroup_transfer_tasks_locked+0x39f/0x450
[   30.775891] Code: 0f 85 70 ff ff ff 0f 1f 44 00 00 e9 6d ff ff ff be ff ff 
ff ff 48 c7 c7 48 82 d6 82 e8 5a 6a ec 00 85 c0 0f 85 6d fd ff ff 90 <0f> 0b 90 
e9 64 fd ff ff 48 8b bd e8 fe ff ff be 01 00 00 00 e8 78
[   30.777270] RSP: :c90e7c20 EFLAGS: 00010246
[   30.777813] RAX:  RBX: c90e7cb0 RCX: 
[   30.778443] RDX:  RSI: 82d68248 RDI: 888004a9a300
[   30.779142] RBP: c90e7d50 R08: 0001 R09: 
[   30.779889] R10: c90e7d70 R11: 0001 R12: 8880057c6040
[   30.780420] R13: 88800539f800 R14: 0001 R15: 0004
[   30.780951] FS:  () GS:88801f50() 
knlGS:
[   30.781561] CS:  0010 DS:  ES:  CR0: 80050033
[   30.781989] CR2: f7e6fe85 CR3: 064ac000 CR4: 06f0
[   30.782558] Call Trace:
[   30.782783]  
[   30.782982]  ? __warn+0x87/0x180
[   30.783250]  ? cgroup_transfer_tasks_locked+0x39f/0x450
[   30.783644]  ? report_bug+0x164/0x190
[   30.783970]  ? handle_bug+0x3b/0x70
[   30.784288]  ? exc_invalid_op+0x17/0x70
[   30.784641]  ? asm_exc_invalid_op+0x1a/0x20
[   30.784992]  ? cgroup_transfer_tasks_locked+0x39f/0x450
[   30.785375]  ? __lock_acquire+0xe9d/0x16d0
[   30.785707]  ? cpuset_update_active_cpus+0x15a/0xca0
[   30.786074]  ? cpuset_update_active_cpus+0x782/0xca0
[   30.786443]  cpuset_update_active_cpus+0x782/0xca0
[   30.786816]  sched_cpu_deactivate+0x1ad/0x1d0
[   30.787148]  ? __pfx_sched_cpu_deactivate+0x10/0x10
[   30.787509]  cpuhp_invoke_callback+0x16b/0x6b0
[   30.787859]  ? cpuhp_thread_fun+0x56/0x240
[   30.788175]  cpuhp_thread_fun+0x1ba/0x240
[   30.788485]  smpboot_thread_fn+0xd8/0x1d0
[   30.788823]  ? __pfx_smpboot_thread_fn+0x10/0x10
[   30.789229]  kthread+0xce/0x100
[   30.789526]  ? __pfx_kthread+0x10/0x10
[   30.789876]  ret_from_fork+0x2f/0x50
[   30.790200]  ? __pfx_kthread+0x10/0x10
[   30.792341]  ret_from_fork_asm+0x1a/0x30
[   30.792716]  

> Michal

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Thomas Gleixner

On Wed, Apr 03 2024 at 17:03, Oleg Nesterov wrote:
> On 04/03, Thomas Gleixner wrote:
>> The test if fragile as hell as there is absolutely no guarantee that the
>> signal target distribution is as expected. The expectation is based on a
>> statistical assumption which does not really hold.
>
> Agreed. I too never liked this test-case.
>
> I forgot everything about this patch and test-case, I can't really read
> your patch right now (sorry), so I am sure I missed something, but
>
>>  static void *distribution_thread(void *arg)
>>  {
>> -while (__atomic_load_n(, __ATOMIC_RELAXED));
>> -return NULL;
>> +while (__atomic_load_n(, __ATOMIC_RELAXED) && !done) {
>> +if (got_signal)
>> +usleep(10);
>> +}
>> +
>> +return (void *)got_signal;
>>  }
>
> Why distribution_thread() can't simply exit if got_signal != 0 ?
>
> See https://lore.kernel.org/all/20230128195641.ga14...@redhat.com/

Indeed. It's too obvious :)

Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Waiman Long




On 4/3/24 10:56, Michal Koutný wrote:

On Wed, Apr 03, 2024 at 10:47:33AM -0400, Waiman Long  
wrote:

should be rare these days as it will only apply in the case of cgroup
v1 under certain condtion,

Could the migration be simply omitted it those special cases?

(Tasks remain in cpusets with empty cpusets -- that already happens in
with the current patch before workqueue is dispatched.)


The tasks should not be runnable if there is no CPUs left in their v1 
cpuset. Migrating them to a parent with runnable CPUs is the current way 
which I don't want to break. Alternatively, we could force it to fall 
back to cgroup v2 behavior using the CPUs in their parent cpuset.


Cheers,
Longman



Michal

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Oleg Nesterov

On 04/03, Thomas Gleixner wrote:
>
> The test if fragile as hell as there is absolutely no guarantee that the
> signal target distribution is as expected. The expectation is based on a
> statistical assumption which does not really hold.

Agreed. I too never liked this test-case.

I forgot everything about this patch and test-case, I can't really read
your patch right now (sorry), so I am sure I missed something, but

>  static void *distribution_thread(void *arg)
>  {
> - while (__atomic_load_n(, __ATOMIC_RELAXED));
> - return NULL;
> + while (__atomic_load_n(, __ATOMIC_RELAXED) && !done) {
> + if (got_signal)
> + usleep(10);
> + }
> +
> + return (void *)got_signal;
>  }

Why distribution_thread() can't simply exit if got_signal != 0 ?

See https://lore.kernel.org/all/20230128195641.ga14...@redhat.com/

Oleg.

Re: Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Michal Koutný

On Wed, Apr 03, 2024 at 10:47:33AM -0400, Waiman Long  
wrote:
> should be rare these days as it will only apply in the case of cgroup
> v1 under certain condtion,

Could the migration be simply omitted it those special cases?

(Tasks remain in cpusets with empty cpusets -- that already happens in
with the current patch before workqueue is dispatched.)

Michal


signature.asc
Description: PGP signature

Re: Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Michal Koutný

On Wed, Apr 03, 2024 at 04:26:38PM +0200, Valentin Schneider 
 wrote:
> Also, I gave Michal's patch a try and it looks like it's introducing a

Thank you.

>   cgroup_threadgroup_rwsem -> cpuset_mutex
> ordering from
>   cgroup_transfer_tasks_locked()
>   `\
> percpu_down_write(_threadgroup_rwsem);
> cgroup_migrate()
> `\
>   cgroup_migrate_execute()
>   `\
> ss->can_attach() // cpuset_can_attach()
> `\
>   mutex_lock(_mutex);
> 
> which is invalid, see below.

_This_ should be the right order (cpuset_mutex inside
cgroup_threadgroup_rwsem), at least in my mental model. Thus I missed
that cpuset_mutex must have been taken somewhere higher up in the
hotplug code (CPU 0 in the lockdep dump, I can't easily see where from)
:-/.

Michal

signature.asc
Description: PGP signature

Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Waiman Long


On 4/3/24 10:26, Valentin Schneider wrote:

On 03/04/24 09:38, Waiman Long wrote:

On 4/3/24 08:02, Michal Koutný wrote:

On Tue, Apr 02, 2024 at 11:30:11AM -0400, Waiman Long  
wrote:

Yes, there is a potential that a cpus_read_lock() may be called leading to
deadlock. So unless we reverse the current cgroup_mutex --> cpu_hotplug_lock
ordering, it is not safe to call cgroup_transfer_tasks() directly.

I see that cgroup_transfer_tasks() has the only user -- cpuset. What
about bending it for the specific use like:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..64deb7212c5c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -109,7 +109,7 @@ struct cgroup *cgroup_get_from_fd(int fd);
   struct cgroup *cgroup_v1v2_get_from_fd(int fd);

   int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from);

   int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
   int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 520a11cb12f4..f97025858c7a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -91,7 +91,8 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
*
* Return: %0 on success or a negative errno code on failure
*/
-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from)
   {
  DEFINE_CGROUP_MGCTX(mgctx);
  struct cgrp_cset_link *link;
@@ -106,9 +106,11 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
  if (ret)
  return ret;

-   cgroup_lock();
-
-   cgroup_attach_lock(true);
+   /* The locking rules serve specific purpose of v1 cpuset hotplug
+* migration, see hotplug_update_tasks_legacy() and
+* cgroup_attach_lock() */
+   lockdep_assert_held(_mutex);
+   lockdep_assert_cpus_held();
+   percpu_down_write(_threadgroup_rwsem);

  /* all tasks in @from are being moved, all csets are source */
  spin_lock_irq(_set_lock);
@@ -144,8 +146,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
  } while (task && !ret);
   out_err:
  cgroup_migrate_finish();
-   cgroup_attach_unlock(true);
-   cgroup_unlock();
+   percpu_up_write(_threadgroup_rwsem);
  return ret;
   }

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 13d27b17c889..94fb8b26f038 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4331,7 +4331,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
  nodes_empty(parent->mems_allowed))
  parent = parent_cs(parent);

-   if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {
+   if (cgroup_transfer_tasks_locked(parent->css.cgroup, cs->css.cgroup)) {
  pr_err("cpuset: failed to transfer tasks out of empty cpuset ");
  pr_cont_cgroup_name(cs->css.cgroup);
  pr_cont("\n");
@@ -4376,21 +4376,9 @@ hotplug_update_tasks_legacy(struct cpuset *cs,

  /*
   * Move tasks to the nearest ancestor with execution resources,
-* This is full cgroup operation which will also call back into
-* cpuset. Execute it asynchronously using workqueue.
   */
-   if (is_empty && css_tryget_online(>css)) {
-   struct cpuset_remove_tasks_struct *s;
-
-   s = kzalloc(sizeof(*s), GFP_KERNEL);
-   if (WARN_ON_ONCE(!s)) {
-   css_put(>css);
-   return;
-   }
-
-   s->cs = cs;
-   INIT_WORK(>work, cpuset_migrate_tasks_workfn);
-   schedule_work(>work);
+   if (is_empty)
+   remove_tasks_in_empty_cpuset(cs);
  }
   }


It still won't work because of the possibility of mutiple tasks
involving in a circular locking dependency. The hotplug thread always
acquire the cpu_hotplug_lock first before acquiring cpuset_mutex or
cgroup_mtuex in this case (cpu_hotplug_lock --> cgroup_mutex). Other
tasks calling into cgroup code will acquire the pair in the order
cgroup_mutex --> cpu_hotplug_lock. This may lead to a deadlock if these
2 locking sequences happen at the same time. Lockdep will certainly
spill out a splat because of this.
So unless we change all the relevant
cgroup code to the new cpu_hotplug_lock --> cgroup_mutex locking order,
the hotplug code can't call cgroup_transfer_tasks() directly.


IIUC that was Thomas' suggestion [1], but I can't tell yet how bad it would
be to change cgroup_lock() to also do a cpus_read_lock().


Changing the locking order is certainly doable. I have taken a cursory 
look at it and at least the following files need to be changed:

Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Valentin Schneider

On 03/04/24 09:38, Waiman Long wrote:
> On 4/3/24 08:02, Michal Koutný wrote:
>> On Tue, Apr 02, 2024 at 11:30:11AM -0400, Waiman Long  
>> wrote:
>>> Yes, there is a potential that a cpus_read_lock() may be called leading to
>>> deadlock. So unless we reverse the current cgroup_mutex --> cpu_hotplug_lock
>>> ordering, it is not safe to call cgroup_transfer_tasks() directly.
>> I see that cgroup_transfer_tasks() has the only user -- cpuset. What
>> about bending it for the specific use like:
>>
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 34aaf0e87def..64deb7212c5c 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -109,7 +109,7 @@ struct cgroup *cgroup_get_from_fd(int fd);
>>   struct cgroup *cgroup_v1v2_get_from_fd(int fd);
>>
>>   int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
>> -int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
>> +int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from);
>>
>>   int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
>>   int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype 
>> *cfts);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 520a11cb12f4..f97025858c7a 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -91,7 +91,8 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
>>*
>>* Return: %0 on success or a negative errno code on failure
>>*/
>> -int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>> +int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from)
>>   {
>>  DEFINE_CGROUP_MGCTX(mgctx);
>>  struct cgrp_cset_link *link;
>> @@ -106,9 +106,11 @@ int cgroup_transfer_tasks(struct cgroup *to, struct 
>> cgroup *from)
>>  if (ret)
>>  return ret;
>>
>> -cgroup_lock();
>> -
>> -cgroup_attach_lock(true);
>> +/* The locking rules serve specific purpose of v1 cpuset hotplug
>> + * migration, see hotplug_update_tasks_legacy() and
>> + * cgroup_attach_lock() */
>> +lockdep_assert_held(_mutex);
>> +lockdep_assert_cpus_held();
>> +percpu_down_write(_threadgroup_rwsem);
>>
>>  /* all tasks in @from are being moved, all csets are source */
>>  spin_lock_irq(_set_lock);
>> @@ -144,8 +146,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct 
>> cgroup *from)
>>  } while (task && !ret);
>>   out_err:
>>  cgroup_migrate_finish();
>> -cgroup_attach_unlock(true);
>> -cgroup_unlock();
>> +percpu_up_write(_threadgroup_rwsem);
>>  return ret;
>>   }
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 13d27b17c889..94fb8b26f038 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4331,7 +4331,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
>> *cs)
>>  nodes_empty(parent->mems_allowed))
>>  parent = parent_cs(parent);
>>
>> -if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {
>> +if (cgroup_transfer_tasks_locked(parent->css.cgroup, cs->css.cgroup)) {
>>  pr_err("cpuset: failed to transfer tasks out of empty cpuset ");
>>  pr_cont_cgroup_name(cs->css.cgroup);
>>  pr_cont("\n");
>> @@ -4376,21 +4376,9 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
>>
>>  /*
>>   * Move tasks to the nearest ancestor with execution resources,
>> - * This is full cgroup operation which will also call back into
>> - * cpuset. Execute it asynchronously using workqueue.
>>   */
>> -if (is_empty && css_tryget_online(>css)) {
>> -struct cpuset_remove_tasks_struct *s;
>> -
>> -s = kzalloc(sizeof(*s), GFP_KERNEL);
>> -if (WARN_ON_ONCE(!s)) {
>> -css_put(>css);
>> -return;
>> -}
>> -
>> -s->cs = cs;
>> -INIT_WORK(>work, cpuset_migrate_tasks_workfn);
>> -schedule_work(>work);
>> +if (is_empty)
>> +remove_tasks_in_empty_cpuset(cs);
>>  }
>>   }
>>
>
> It still won't work because of the possibility of mutiple tasks
> involving in a circular locking dependency. The hotplug thread always
> acquire the cpu_hotplug_lock first before acquiring cpuset_mutex or
> cgroup_mtuex in this case (cpu_hotplug_lock --> cgroup_mutex). Other
> tasks calling into cgroup code will acquire the pair in the order
> cgroup_mutex --> cpu_hotplug_lock. This may lead to a deadlock if these
> 2 locking sequences happen at the same time. Lockdep will certainly
> spill out a splat because of this.

> So unless we change all the relevant
> cgroup code to the new cpu_hotplug_lock --> cgroup_mutex locking order,
> the hotplug code can't call cgroup_transfer_tasks() directly.
>

IIUC that was Thomas' suggestion [1], but I can't tell yet how bad it would
be to change cgroup_lock() to also do a

Re: [PATCH net-next 7/7] testing: net-drv: add a driver test for stats reporting

2024-04-03 Thread Jakub Kicinski

On Wed, 3 Apr 2024 10:58:19 +0200 Petr Machata wrote:
> Also, it's not clear what "del thing" should do in that context, because
> if cfg also keeps a reference, __del__ won't get called. There could be
> a direct method, like thing.exit() or whatever, but then you need
> bookkeeping so as not to clean up the second time through cfg. It's the
> less straightforward way of going about it IMHO.

I see, having read up on what del actually does - "del thing" would
indeed not work here.

> I know that I must sound like a broken record at this point, but look:
> 
> with build("ip link set dev %s master %s" % (swp1, h1),
>"ip link set dev %s nomaster" % swp1) as thing:
> ... some code which may rise ...
> ... more code, interface detached, `thing' gone ...
> 
> It's just as concise, makes it very clear where the device is part of
> the bridge and where not anymore, and does away with the intricacies of
> lifetime management.

My experience [1] is that with "with" we often end up writing tests
like this:

def test():
with a() as bunch,
 of() as things:
... entire body of the test indented ...

[1] https://github.com/kuba-moo/linux/blob/psp/tools/net/ynl/psp.py

Nothing wrong with that. I guess the question in my mind is whether
we're aiming for making the tests "pythonic" (in which case "with"
definitely wins), or more of a "bash with classes" style trying to
avoid any constructs people may have to google. I'm on the fence on
that one, as the del example proves my python expertise is not high.
OTOH people who prefer bash will continue to write bash tests,
so maybe we don't have to worry about non-experts too much. Dunno.

Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Waiman Long


On 4/3/24 08:02, Michal Koutný wrote:

On Tue, Apr 02, 2024 at 11:30:11AM -0400, Waiman Long  
wrote:

Yes, there is a potential that a cpus_read_lock() may be called leading to
deadlock. So unless we reverse the current cgroup_mutex --> cpu_hotplug_lock
ordering, it is not safe to call cgroup_transfer_tasks() directly.

I see that cgroup_transfer_tasks() has the only user -- cpuset. What
about bending it for the specific use like:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..64deb7212c5c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -109,7 +109,7 @@ struct cgroup *cgroup_get_from_fd(int fd);
  struct cgroup *cgroup_v1v2_get_from_fd(int fd);
  
  int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);

-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from);
  
  int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);

  int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 520a11cb12f4..f97025858c7a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -91,7 +91,8 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
   *
   * Return: %0 on success or a negative errno code on failure
   */
-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from)
  {
DEFINE_CGROUP_MGCTX(mgctx);
struct cgrp_cset_link *link;
@@ -106,9 +106,11 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
if (ret)
return ret;
  
-	cgroup_lock();

-
-   cgroup_attach_lock(true);
+   /* The locking rules serve specific purpose of v1 cpuset hotplug
+* migration, see hotplug_update_tasks_legacy() and
+* cgroup_attach_lock() */
+   lockdep_assert_held(_mutex);
+   lockdep_assert_cpus_held();
+   percpu_down_write(_threadgroup_rwsem);
  
  	/* all tasks in @from are being moved, all csets are source */

spin_lock_irq(_set_lock);
@@ -144,8 +146,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
} while (task && !ret);
  out_err:
cgroup_migrate_finish();
-   cgroup_attach_unlock(true);
-   cgroup_unlock();
+   percpu_up_write(_threadgroup_rwsem);
return ret;
  }
  
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c

index 13d27b17c889..94fb8b26f038 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4331,7 +4331,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
nodes_empty(parent->mems_allowed))
parent = parent_cs(parent);
  
-	if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {

+   if (cgroup_transfer_tasks_locked(parent->css.cgroup, cs->css.cgroup)) {
pr_err("cpuset: failed to transfer tasks out of empty cpuset ");
pr_cont_cgroup_name(cs->css.cgroup);
pr_cont("\n");
@@ -4376,21 +4376,9 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
  
  	/*

 * Move tasks to the nearest ancestor with execution resources,
-* This is full cgroup operation which will also call back into
-* cpuset. Execute it asynchronously using workqueue.
 */
-   if (is_empty && css_tryget_online(>css)) {
-   struct cpuset_remove_tasks_struct *s;
-
-   s = kzalloc(sizeof(*s), GFP_KERNEL);
-   if (WARN_ON_ONCE(!s)) {
-   css_put(>css);
-   return;
-   }
-
-   s->cs = cs;
-   INIT_WORK(>work, cpuset_migrate_tasks_workfn);
-   schedule_work(>work);
+   if (is_empty)
+   remove_tasks_in_empty_cpuset(cs);
}
  }
  


It still won't work because of the possibility of mutiple tasks 
involving in a circular locking dependency. The hotplug thread always 
acquire the cpu_hotplug_lock first before acquiring cpuset_mutex or 
cgroup_mtuex in this case (cpu_hotplug_lock --> cgroup_mutex). Other 
tasks calling into cgroup code will acquire the pair in the order 
cgroup_mutex --> cpu_hotplug_lock. This may lead to a deadlock if these 
2 locking sequences happen at the same time. Lockdep will certainly 
spill out a splat because of this. So unless we change all the relevant 
cgroup code to the new cpu_hotplug_lock --> cgroup_mutex locking order, 
the hotplug code can't call cgroup_transfer_tasks() directly.


Cheers,
Longman

[PATCH v3 15/15] powerpc: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Michael Ellerman 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/powerpc/include/asm/bug.h | 37 +-
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1db485aacbd9..5b06745d20aa 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -14,6 +14,9 @@
 .section __bug_table,"aw"
 5001:   .4byte \addr - .
 .4byte 5002f - .
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+.4byte 0
+#endif
 .short \line, \flags
 .org 5001b+BUG_ENTRY_SIZE
 .previous
@@ -32,30 +35,46 @@
 #endif /* verbose */
 
 #else /* !__ASSEMBLY__ */
-/* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
-   sizeof(struct bug_entry), respectively */
+/* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3,%4 to be FILE, __func__, LINE, 
flags
+   and sizeof(struct bug_entry), respectively */
 #ifdef CONFIG_DEBUG_BUGVERBOSE
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR"   .4byte %1 - .\n"
+#else
+# define __BUG_FUNC_PTR
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define _EMIT_BUG_ENTRY\
".section __bug_table,\"aw\"\n" \
"2: .4byte 1b - .\n"\
"   .4byte %0 - .\n"\
-   "   .short %1, %2\n"\
-   ".org 2b+%3\n"  \
+   __BUG_FUNC_PTR  \
+   "   .short %2, %3\n"\
+   ".org 2b+%4\n"  \
".previous\n"
 #else
 #define _EMIT_BUG_ENTRY\
".section __bug_table,\"aw\"\n" \
"2: .4byte 1b - .\n"\
-   "   .short %2\n"\
-   ".org 2b+%3\n"  \
+   "   .short %3\n"\
+   ".org 2b+%4\n"  \
".previous\n"
 #endif
 
+#ifdef HAVE_BUG_FUNCTION
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNCNULL
+#endif
+
 #define BUG_ENTRY(insn, flags, ...)\
__asm__ __volatile__(   \
"1: " insn "\n" \
_EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
+   : : "i" (__FILE__), "i" (__BUG_FUNC),   \
+ "i" (__LINE__),   \
  "i" (flags),  \
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
@@ -80,7 +99,7 @@
if (x)  \
BUG();  \
} else {\
-   BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x)));  
\
+   BUG_ENTRY(PPC_TLNEI " %5, 0", 0, "r" ((__force long)(x)));  
\
}   \
 } while (0)
 
@@ -90,7 +109,7 @@
if (__ret_warn_on)  \
__WARN();   \
} else {\
-   BUG_ENTRY(PPC_TLNEI " %4, 0",   \
+   BUG_ENTRY(PPC_TLNEI " %5, 0",   \
  BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),  \
  "r" (__ret_warn_on)); \
}   \
-- 
2.39.2

[PATCH v3 14/15] riscv: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

To simplify the implementation, unify the __BUG_ENTRY_ADDR and
__BUG_ENTRY_FILE macros into a single macro named __BUG_REL() which takes
the address, file, or function reference as parameter.

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/riscv/include/asm/bug.h | 38 
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/include/asm/bug.h b/arch/riscv/include/asm/bug.h
index 1aaea81fb141..79f360af4ad8 100644
--- a/arch/riscv/include/asm/bug.h
+++ b/arch/riscv/include/asm/bug.h
@@ -30,26 +30,39 @@
 typedef u32 bug_insn_t;
 
 #ifdef CONFIG_GENERIC_BUG_RELATIVE_POINTERS
-#define __BUG_ENTRY_ADDR   RISCV_INT " 1b - ."
-#define __BUG_ENTRY_FILE   RISCV_INT " %0 - ."
+#define __BUG_REL(val) RISCV_INT " " __stringify(val) " - ."
 #else
-#define __BUG_ENTRY_ADDR   RISCV_PTR " 1b"
-#define __BUG_ENTRY_FILE   RISCV_PTR " %0"
+#define __BUG_REL(val) RISCV_PTR " " __stringify(val)
 #endif
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR__BUG_REL(%1)
+#else
+# define __BUG_FUNC_PTR
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define __BUG_ENTRY\
-   __BUG_ENTRY_ADDR "\n\t" \
-   __BUG_ENTRY_FILE "\n\t" \
-   RISCV_SHORT " %1\n\t"   \
-   RISCV_SHORT " %2"
+   __BUG_REL(1b) "\n\t"\
+   __BUG_REL(%0) "\n\t"\
+   __BUG_FUNC_PTR "\n\t"   \
+   RISCV_SHORT " %2\n\t"   \
+   RISCV_SHORT " %3"
 #else
 #define __BUG_ENTRY\
-   __BUG_ENTRY_ADDR "\n\t" \
-   RISCV_SHORT " %2"
+   __BUG_REL(1b) "\n\t"\
+   RISCV_SHORT " %3"
 #endif
 
 #ifdef CONFIG_GENERIC_BUG
+#ifdef HAVE_BUG_FUNCTION
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNCNULL
+#endif
+
 #define __BUG_FLAGS(flags) \
 do {   \
__asm__ __volatile__ (  \
@@ -58,10 +71,11 @@ do {
\
".pushsection __bug_table,\"aw\"\n\t"   \
"2:\n\t"\
__BUG_ENTRY "\n\t"  \
-   ".org 2b + %3\n\t"  \
+   ".org 2b + %4\n\t"  \
".popsection"   \
:   \
-   : "i" (__FILE__), "i" (__LINE__),   \
+   : "i" (__FILE__), "i" (__BUG_FUNC), \
+ "i" (__LINE__),   \
  "i" (flags),  \
  "i" (sizeof(struct bug_entry)));  \
 } while (0)
-- 
2.39.2

[PATCH v3 13/15] sh: Move defines needed for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Declaring the defines needed for suppressing warning inside
'#ifdef CONFIG_DEBUG_BUGVERBOSE' results in a kerneldoc warning.

.../bug.h:29: warning: expecting prototype for _EMIT_BUG_ENTRY().
Prototype was for HAVE_BUG_FUNCTION() instead

Move the defines above the kerneldoc entry for _EMIT_BUG_ENTRY
to make kerneldoc happy.

Reported-by: Simon Horman 
Cc: Simon Horman 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Signed-off-by: Guenter Roeck 
---
v3: Added patch. Possibly squash into previous patch.

 arch/sh/include/asm/bug.h | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/sh/include/asm/bug.h b/arch/sh/include/asm/bug.h
index 470ce6567d20..bf4947d51d69 100644
--- a/arch/sh/include/asm/bug.h
+++ b/arch/sh/include/asm/bug.h
@@ -11,6 +11,15 @@
 #define HAVE_ARCH_BUG
 #define HAVE_ARCH_WARN_ON
 
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR"\t.long %O2\n"
+#else
+# define __BUG_FUNC_PTR
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+#endif /* CONFIG_DEBUG_BUGVERBOSE */
+
 /**
  * _EMIT_BUG_ENTRY
  * %1 - __FILE__
@@ -25,13 +34,6 @@
  */
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
-#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
-# define HAVE_BUG_FUNCTION
-# define __BUG_FUNC_PTR"\t.long %O2\n"
-#else
-# define __BUG_FUNC_PTR
-#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
-
 #define _EMIT_BUG_ENTRY\
"\t.pushsection __bug_table,\"aw\"\n"   \
"2:\t.long 1b, %O1\n"   \
-- 
2.39.2

[PATCH v3 12/15] sh: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/sh/include/asm/bug.h | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/sh/include/asm/bug.h b/arch/sh/include/asm/bug.h
index 05a485c4fabc..470ce6567d20 100644
--- a/arch/sh/include/asm/bug.h
+++ b/arch/sh/include/asm/bug.h
@@ -24,21 +24,36 @@
  * The offending file and line are encoded in the __bug_table section.
  */
 #ifdef CONFIG_DEBUG_BUGVERBOSE
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR"\t.long %O2\n"
+#else
+# define __BUG_FUNC_PTR
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define _EMIT_BUG_ENTRY\
"\t.pushsection __bug_table,\"aw\"\n"   \
"2:\t.long 1b, %O1\n"   \
-   "\t.short %O2, %O3\n"   \
-   "\t.org 2b+%O4\n"   \
+   __BUG_FUNC_PTR  \
+   "\t.short %O3, %O4\n"   \
+   "\t.org 2b+%O5\n"   \
"\t.popsection\n"
 #else
 #define _EMIT_BUG_ENTRY\
"\t.pushsection __bug_table,\"aw\"\n"   \
"2:\t.long 1b\n"\
-   "\t.short %O3\n"\
-   "\t.org 2b+%O4\n"   \
+   "\t.short %O4\n"\
+   "\t.org 2b+%O5\n"   \
"\t.popsection\n"
 #endif
 
+#ifdef HAVE_BUG_FUNCTION
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNCNULL
+#endif
+
 #define BUG()  \
 do {   \
__asm__ __volatile__ (  \
@@ -47,6 +62,7 @@ do {  \
 :  \
 : "n" (TRAPA_BUG_OPCODE),  \
   "i" (__FILE__),  \
+  "i" (__BUG_FUNC),\
   "i" (__LINE__), "i" (0), \
   "i" (sizeof(struct bug_entry))); \
unreachable();  \
@@ -60,6 +76,7 @@ do {  \
 :  \
 : "n" (TRAPA_BUG_OPCODE),  \
   "i" (__FILE__),  \
+  "i" (__BUG_FUNC),\
   "i" (__LINE__),  \
   "i" (BUGFLAG_WARNING|(flags)),   \
   "i" (sizeof(struct bug_entry))); \
@@ -85,6 +102,7 @@ do { \
 :  \
 : "n" (TRAPA_BUG_OPCODE),  \
   "i" (__FILE__),  \
+  "i" (__BUG_FUNC),\
   "i" (__LINE__),  \
   "i" (BUGFLAG_UNWINDER),  \
   "i" (sizeof(struct bug_entry))); \
-- 
2.39.2

[PATCH v3 11/15] s390: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because
__func__ is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1 (simplified assembler changes after upstream commit
  3938490e78f4 ("s390/bug: remove entry size from __bug_table section")
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/s390/include/asm/bug.h | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/bug.h b/arch/s390/include/asm/bug.h
index c500d45fb465..44d4e9f24ae0 100644
--- a/arch/s390/include/asm/bug.h
+++ b/arch/s390/include/asm/bug.h
@@ -8,6 +8,15 @@
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR"   .long   %0-.\n"
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNC_PTR
+# define __BUG_FUNCNULL
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define __EMIT_BUG(x) do { \
asm_inline volatile(\
"0: mc  0,0\n"  \
@@ -17,10 +26,12 @@
".section __bug_table,\"aw\"\n" \
"2: .long   0b-.\n" \
"   .long   1b-.\n" \
-   "   .short  %0,%1\n"\
-   "   .org2b+%2\n"\
+   __BUG_FUNC_PTR  \
+   "   .short  %1,%2\n"\
+   "   .org2b+%3\n"\
".previous\n"   \
-   : : "i" (__LINE__), \
+   : : "i" (__BUG_FUNC),   \
+   "i" (__LINE__), \
"i" (x),\
"i" (sizeof(struct bug_entry)));\
 } while (0)
-- 
2.39.2

[PATCH v3 10/15] parisc: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

While at it, declare assembler parameters as constants where possible.
Refine .blockz instructions to calculate the necessary padding instead
of using fixed values.

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Acked-by: Helge Deller 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/parisc/include/asm/bug.h | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/parisc/include/asm/bug.h b/arch/parisc/include/asm/bug.h
index 833555f74ffa..b59c3f7380bf 100644
--- a/arch/parisc/include/asm/bug.h
+++ b/arch/parisc/include/asm/bug.h
@@ -23,8 +23,17 @@
 # define __BUG_REL(val) ".word " __stringify(val)
 #endif
 
-
 #ifdef CONFIG_DEBUG_BUGVERBOSE
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR__BUG_REL(%c1)
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNC_PTR
+# define __BUG_FUNCNULL
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define BUG()  \
do {\
asm volatile("\n"   \
@@ -33,10 +42,12 @@
 "\t.align 4\n" \
 "2:\t" __BUG_REL(1b) "\n"  \
 "\t" __BUG_REL(%c0)  "\n"  \
-"\t.short %1, %2\n"\
-"\t.blockz %3-2*4-2*2\n"   \
+"\t" __BUG_FUNC_PTR  "\n"  \
+"\t.short %c2, %c3\n"  \
+"\t.blockz %c4-(.-2b)\n"   \
 "\t.popsection"\
-: : "i" (__FILE__), "i" (__LINE__),\
+: : "i" (__FILE__), "i" (__BUG_FUNC),  \
+"i" (__LINE__),\
 "i" (0), "i" (sizeof(struct bug_entry)) ); \
unreachable();  \
} while(0)
@@ -58,10 +69,12 @@
 "\t.align 4\n" \
 "2:\t" __BUG_REL(1b) "\n"  \
 "\t" __BUG_REL(%c0)  "\n"  \
-"\t.short %1, %2\n"\
-"\t.blockz %3-2*4-2*2\n"   \
+"\t" __BUG_FUNC_PTR  "\n"  \
+"\t.short %c2, %3\n"   \
+"\t.blockz %c4-(.-2b)\n"   \
 "\t.popsection"\
-: : "i" (__FILE__), "i" (__LINE__),\
+: : "i" (__FILE__), "i" (__BUG_FUNC),  \
+"i" (__LINE__),\
 "i" (BUGFLAG_WARNING|(flags)), \
 "i" (sizeof(struct bug_entry)) );  \
} while(0)
@@ -74,7 +87,7 @@
 "\t.align 4\n" \
 "2:\t" __BUG_REL(1b) "\n"  \
 "\t.short %0\n"\
-"\t.blockz %1-4-2\n"   \
+"\t.blockz %c1-(.-2b)\n"   \
 "\t.popsection"\
 : : "i" (BUGFLAG_WARNING|(flags)), \
 "i" (sizeof(struct bug_entry)) );  \
-- 
2.39.2

[PATCH v3 09/15] loongarch: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Huacai Chen 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1; resolved context conflict
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2; resolved context conflict

 arch/loongarch/include/asm/bug.h | 38 +++-
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/loongarch/include/asm/bug.h b/arch/loongarch/include/asm/bug.h
index 08388876ade4..193f396d81a0 100644
--- a/arch/loongarch/include/asm/bug.h
+++ b/arch/loongarch/include/asm/bug.h
@@ -3,47 +3,63 @@
 #define __ASM_BUG_H
 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_DEBUG_BUGVERBOSE
-#define _BUGVERBOSE_LOCATION(file, line)
+#define _BUGVERBOSE_LOCATION(file, func, line)
 #else
-#define __BUGVERBOSE_LOCATION(file, line)  \
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR(func)  .long func - .;
+#else
+# define __BUG_FUNC_PTR(func)
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
+#define __BUGVERBOSE_LOCATION(file, func, line)\
.pushsection .rodata.str, "aMS", @progbits, 1;  \
10002:  .string file;   \
.popsection;\
\
.long 10002b - .;   \
+   __BUG_FUNC_PTR(func)\
.short line;
-#define _BUGVERBOSE_LOCATION(file, line) __BUGVERBOSE_LOCATION(file, line)
+#define _BUGVERBOSE_LOCATION(file, func, line) __BUGVERBOSE_LOCATION(file, 
func, line)
 #endif
 
 #ifndef CONFIG_GENERIC_BUG
-#define __BUG_ENTRY(flags)
+#define __BUG_ENTRY(flags, func)
 #else
-#define __BUG_ENTRY(flags) \
+#define __BUG_ENTRY(flags, func)   \
.pushsection __bug_table, "aw"; \
.align 2;   \
1:  .long 10001f - .;   \
-   _BUGVERBOSE_LOCATION(__FILE__, __LINE__)\
+   _BUGVERBOSE_LOCATION(__FILE__, func, __LINE__)  \
.short flags;   \
.popsection;\
10001:
 #endif
 
-#define ASM_BUG_FLAGS(flags)   \
-   __BUG_ENTRY(flags)  \
+#define ASM_BUG_FLAGS(flags, func) \
+   __BUG_ENTRY(flags, func)\
break   BRK_BUG
 
-#define ASM_BUG()  ASM_BUG_FLAGS(0)
+#define ASM_BUG()  ASM_BUG_FLAGS(0, .)
+
+#ifdef HAVE_BUG_FUNCTION
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNCNULL
+#endif
 
 #define __BUG_FLAGS(flags) \
-   asm_inline volatile (__stringify(ASM_BUG_FLAGS(flags)));
+   asm_inline volatile (__stringify(ASM_BUG_FLAGS(flags, %0)) : : "i" 
(__BUG_FUNC));
 
 #define __WARN_FLAGS(flags)\
 do {   \
instrumentation_begin();\
-   __BUG_FLAGS(BUGFLAG_WARNING|(flags));   \
+   if (!IS_SUPPRESSED_WARNING(__func__))   \
+   __BUG_FLAGS(BUGFLAG_WARNING|(flags));   \
annotate_reachable();   \
instrumentation_end();  \
 } while (0)
-- 
2.39.2

[PATCH v3 08/15] arm64: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/arm64/include/asm/asm-bug.h | 29 +++--
 arch/arm64/include/asm/bug.h |  8 +++-
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/asm-bug.h b/arch/arm64/include/asm/asm-bug.h
index c762038ba400..c6d22e3cd840 100644
--- a/arch/arm64/include/asm/asm-bug.h
+++ b/arch/arm64/include/asm/asm-bug.h
@@ -8,36 +8,45 @@
 #include 
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
-#define _BUGVERBOSE_LOCATION(file, line) __BUGVERBOSE_LOCATION(file, line)
-#define __BUGVERBOSE_LOCATION(file, line)  \
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR(func)  .long func - .;
+#else
+# define __BUG_FUNC_PTR(func)
+#endif
+
+#define _BUGVERBOSE_LOCATION(file, func, line) __BUGVERBOSE_LOCATION(file, 
func, line)
+#define __BUGVERBOSE_LOCATION(file, func, line)\
.pushsection .rodata.str,"aMS",@progbits,1; \
14472:  .string file;   \
.popsection;\
\
.long 14472b - .;   \
+   __BUG_FUNC_PTR(func)\
.short line;
 #else
-#define _BUGVERBOSE_LOCATION(file, line)
+#define _BUGVERBOSE_LOCATION(file, func, line)
 #endif
 
 #ifdef CONFIG_GENERIC_BUG
 
-#define __BUG_ENTRY(flags) \
+#define __BUG_ENTRY(flags, func)   \
.pushsection __bug_table,"aw";  \
.align 2;   \
14470:  .long 14471f - .;   \
-_BUGVERBOSE_LOCATION(__FILE__, __LINE__)   \
-   .short flags;   \
+_BUGVERBOSE_LOCATION(__FILE__, func, __LINE__) \
+   .short flags;   \
.popsection;\
14471:
 #else
-#define __BUG_ENTRY(flags)
+#define __BUG_ENTRY(flags, func)
 #endif
 
-#define ASM_BUG_FLAGS(flags)   \
-   __BUG_ENTRY(flags)  \
+#define ASM_BUG_FLAGS(flags, func) \
+   __BUG_ENTRY(flags, func)\
brk BUG_BRK_IMM
 
-#define ASM_BUG()  ASM_BUG_FLAGS(0)
+#define ASM_BUG()  ASM_BUG_FLAGS(0, .)
 
 #endif /* __ASM_ASM_BUG_H */
diff --git a/arch/arm64/include/asm/bug.h b/arch/arm64/include/asm/bug.h
index 28be048db3f6..044c5e24a17d 100644
--- a/arch/arm64/include/asm/bug.h
+++ b/arch/arm64/include/asm/bug.h
@@ -11,8 +11,14 @@
 
 #include 
 
+#ifdef HAVE_BUG_FUNCTION
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNCNULL
+#endif
+
 #define __BUG_FLAGS(flags) \
-   asm volatile (__stringify(ASM_BUG_FLAGS(flags)));
+   asm volatile (__stringify(ASM_BUG_FLAGS(flags, %c0)) : : "i" 
(__BUG_FUNC));
 
 #define BUG() do { \
__BUG_FLAGS(0); \
-- 
2.39.2

[PATCH v3 07/15] x86: Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Add name of functions triggering warning backtraces to the __bug_table
object section to enable support for suppressing WARNING backtraces.

To limit image size impact, the pointer to the function name is only added
to the __bug_table section if both CONFIG_KUNIT_SUPPRESS_BACKTRACE and
CONFIG_DEBUG_BUGVERBOSE are enabled. Otherwise, the __func__ assembly
parameter is replaced with a (dummy) NULL parameter to avoid an image size
increase due to unused __func__ entries (this is necessary because __func__
is not a define but a virtual variable).

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 arch/x86/include/asm/bug.h | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index a3ec87d198ac..7698dfa74c98 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -23,18 +23,28 @@
 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+# define HAVE_BUG_FUNCTION
+# define __BUG_FUNC_PTR__BUG_REL(%c1)
+# define __BUG_FUNC__func__
+#else
+# define __BUG_FUNC_PTR
+# define __BUG_FUNCNULL
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
 #define _BUG_FLAGS(ins, flags, extra)  \
 do {   \
asm_inline volatile("1:\t" ins "\n" \
 ".pushsection __bug_table,\"aw\"\n"\
 "2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n"   \
 "\t"  __BUG_REL(%c0) "\t# bug_entry::file\n"   \
-"\t.word %c1""\t# bug_entry::line\n"   \
-"\t.word %c2""\t# bug_entry::flags\n"  \
-"\t.org 2b+%c3\n"  \
+"\t"  __BUG_FUNC_PTR "\t# bug_entry::function\n"   \
+"\t.word %c2""\t# bug_entry::line\n"   \
+"\t.word %c3""\t# bug_entry::flags\n"  \
+"\t.org 2b+%c4\n"  \
 ".popsection\n"\
 extra  \
-: : "i" (__FILE__), "i" (__LINE__),\
+: : "i" (__FILE__), "i" (__BUG_FUNC), "i" (__LINE__),\
 "i" (flags),   \
 "i" (sizeof(struct bug_entry)));   \
 } while (0)
@@ -80,7 +90,8 @@ do {  
\
 do {   \
__auto_type __flags = BUGFLAG_WARNING|(flags);  \
instrumentation_begin();\
-   _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE);\
+   if (!IS_SUPPRESSED_WARNING(__func__))   \
+   _BUG_FLAGS(ASM_UD2, __flags, ASM_REACHABLE);\
instrumentation_end();  \
 } while (0)
 
-- 
2.39.2

[PATCH v3 06/15] net: kunit: Suppress lock warning noise at end of dev_addr_lists tests

2024-04-03 Thread Guenter Roeck

dev_addr_lists_test generates lock warning noise at the end of tests
if lock debugging is enabled. There are two sets of warnings.

WARNING: CPU: 0 PID: 689 at kernel/locking/mutex.c:923 
__mutex_unlock_slowpath.constprop.0+0x13c/0x368
DEBUG_LOCKS_WARN_ON(__owner_task(owner) != __get_current())

WARNING: kunit_try_catch/1336 still has locks held!

KUnit test cleanup is not guaranteed to run in the same thread as the test
itself. For this test, this means that rtnl_lock() and rtnl_unlock() may
be called from different threads. This triggers the warnings.
Suppress the warnings because they are irrelevant for the test and just
confusing and distracting.

The first warning can be suppressed by using START_SUPPRESSED_WARNING()
and END_SUPPRESSED_WARNING() around the call to rtnl_unlock(). To suppress
the second warning, it is necessary to set debug_locks_silent while the
rtnl lock is held.

Tested-by: Linux Kernel Functional Testing 
Cc: David Gow 
Cc: Jakub Kicinski 
Cc: Eric Dumazet 
Acked-by: Dan Carpenter 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
v3:
- Rebased to v6.9-rc2

 net/core/dev_addr_lists_test.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c
index 4dbd0dc6aea2..b427dd1a3c93 100644
--- a/net/core/dev_addr_lists_test.c
+++ b/net/core/dev_addr_lists_test.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -49,6 +50,7 @@ static int dev_addr_test_init(struct kunit *test)
KUNIT_FAIL(test, "Can't register netdev %d", err);
}
 
+   debug_locks_silent = 1;
rtnl_lock();
return 0;
 }
@@ -56,8 +58,12 @@ static int dev_addr_test_init(struct kunit *test)
 static void dev_addr_test_exit(struct kunit *test)
 {
struct net_device *netdev = test->priv;
+   DEFINE_SUPPRESSED_WARNING(__mutex_unlock_slowpath);
 
+   START_SUPPRESSED_WARNING(__mutex_unlock_slowpath);
rtnl_unlock();
+   END_SUPPRESSED_WARNING(__mutex_unlock_slowpath);
+   debug_locks_silent = 0;
unregister_netdev(netdev);
free_netdev(netdev);
 }
-- 
2.39.2

[PATCH v3 05/15] drm: Suppress intentional warning backtraces in scaling unit tests

2024-04-03 Thread Guenter Roeck

The drm_test_rect_calc_hscale and drm_test_rect_calc_vscale unit tests
intentionally trigger warning backtraces by providing bad parameters to
the tested functions. What is tested is the return value, not the existence
of a warning backtrace. Suppress the backtraces to avoid clogging the
kernel log and distraction from real problems.

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Acked-by: Maíra Canal 
Cc: Maarten Lankhorst 
Cc: David Airlie 
Cc: Daniel Vetter 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
v3:
- Rebased to v6.9-rc2

 drivers/gpu/drm/tests/drm_rect_test.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/tests/drm_rect_test.c 
b/drivers/gpu/drm/tests/drm_rect_test.c
index 76332cd2ead8..66851769ee32 100644
--- a/drivers/gpu/drm/tests/drm_rect_test.c
+++ b/drivers/gpu/drm/tests/drm_rect_test.c
@@ -406,22 +406,38 @@ KUNIT_ARRAY_PARAM(drm_rect_scale, drm_rect_scale_cases, 
drm_rect_scale_case_desc
 
 static void drm_test_rect_calc_hscale(struct kunit *test)
 {
+   DEFINE_SUPPRESSED_WARNING(drm_calc_scale);
const struct drm_rect_scale_case *params = test->param_value;
int scaling_factor;
 
+   /*
+* drm_rect_calc_hscale() generates a warning backtrace whenever bad
+* parameters are passed to it. This affects all unit tests with an
+* error code in expected_scaling_factor.
+*/
+   START_SUPPRESSED_WARNING(drm_calc_scale);
scaling_factor = drm_rect_calc_hscale(>src, >dst,
  params->min_range, 
params->max_range);
+   END_SUPPRESSED_WARNING(drm_calc_scale);
 
KUNIT_EXPECT_EQ(test, scaling_factor, params->expected_scaling_factor);
 }
 
 static void drm_test_rect_calc_vscale(struct kunit *test)
 {
+   DEFINE_SUPPRESSED_WARNING(drm_calc_scale);
const struct drm_rect_scale_case *params = test->param_value;
int scaling_factor;
 
+   /*
+* drm_rect_calc_vscale() generates a warning backtrace whenever bad
+* parameters are passed to it. This affects all unit tests with an
+* error code in expected_scaling_factor.
+*/
+   START_SUPPRESSED_WARNING(drm_calc_scale);
scaling_factor = drm_rect_calc_vscale(>src, >dst,
  params->min_range, 
params->max_range);
+   END_SUPPRESSED_WARNING(drm_calc_scale);
 
KUNIT_EXPECT_EQ(test, scaling_factor, params->expected_scaling_factor);
 }
-- 
2.39.2

[PATCH v3 04/15] kunit: Add documentation for warning backtrace suppression API

2024-04-03 Thread Guenter Roeck

Document API functions for suppressing warning backtraces.

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Reviewed-by: Kees Cook 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
v3:
- Rebased to v6.9-rc2

 Documentation/dev-tools/kunit/usage.rst | 30 -
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/dev-tools/kunit/usage.rst 
b/Documentation/dev-tools/kunit/usage.rst
index 22955d56b379..8d3d36d4103d 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -157,6 +157,34 @@ Alternatively, one can take full control over the error 
message by using
if (some_setup_function())
KUNIT_FAIL(test, "Failed to setup thing for testing");
 
+Suppressing warning backtraces
+--
+
+Some unit tests trigger warning backtraces either intentionally or as side
+effect. Such backtraces are normally undesirable since they distract from
+the actual test and may result in the impression that there is a problem.
+
+Such backtraces can be suppressed. To suppress a backtrace in some_function(),
+use the following code.
+
+.. code-block:: c
+
+   static void some_test(struct kunit *test)
+   {
+   DEFINE_SUPPRESSED_WARNING(some_function);
+
+   START_SUPPRESSED_WARNING(some_function);
+   trigger_backtrace();
+   END_SUPPRESSED_WARNING(some_function);
+   }
+
+SUPPRESSED_WARNING_COUNT() returns the number of suppressed backtraces. If the
+suppressed backtrace was triggered on purpose, this can be used to check if
+the backtrace was actually triggered.
+
+.. code-block:: c
+
+   KUNIT_EXPECT_EQ(test, SUPPRESSED_WARNING_COUNT(some_function), 1);
 
 Test Suites
 ~~~
@@ -857,4 +885,4 @@ For example:
dev_managed_string = devm_kstrdup(fake_device, "Hello, World!");
 
// Everything is cleaned up automatically when the test ends.
-   }
\ No newline at end of file
+   }
-- 
2.39.2

[PATCH v3 03/15] kunit: Add test cases for backtrace warning suppression

2024-04-03 Thread Guenter Roeck

Add unit tests to verify that warning backtrace suppression works.

If backtrace suppression does _not_ work, the unit tests will likely
trigger unsuppressed backtraces, which should actually help to get
the affected architectures / platforms fixed.

Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Reviewed-by: Kees Cook 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 lib/kunit/Makefile |   7 +-
 lib/kunit/backtrace-suppression-test.c | 104 +
 2 files changed, 109 insertions(+), 2 deletions(-)
 create mode 100644 lib/kunit/backtrace-suppression-test.c

diff --git a/lib/kunit/Makefile b/lib/kunit/Makefile
index 545b57c3be48..3eee1bd0ce5e 100644
--- a/lib/kunit/Makefile
+++ b/lib/kunit/Makefile
@@ -16,10 +16,13 @@ endif
 
 # KUnit 'hooks' and bug handling are built-in even when KUnit is built
 # as a module.
-obj-y +=   hooks.o \
-   bug.o
+obj-y +=   hooks.o
+obj-$(CONFIG_KUNIT_SUPPRESS_BACKTRACE) += bug.o
 
 obj-$(CONFIG_KUNIT_TEST) +=kunit-test.o
+ifeq ($(CCONFIG_KUNIT_SUPPRESS_BACKTRACE),y)
+obj-$(CONFIG_KUNIT_TEST) +=backtrace-suppression-test.o
+endif
 
 # string-stream-test compiles built-in only.
 ifeq ($(CONFIG_KUNIT_TEST),y)
diff --git a/lib/kunit/backtrace-suppression-test.c 
b/lib/kunit/backtrace-suppression-test.c
new file mode 100644
index ..47c619283802
--- /dev/null
+++ b/lib/kunit/backtrace-suppression-test.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit test for suppressing warning tracebacks
+ *
+ * Copyright (C) 2024, Guenter Roeck
+ * Author: Guenter Roeck 
+ */
+
+#include 
+#include 
+
+static void backtrace_suppression_test_warn_direct(struct kunit *test)
+{
+   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
+
+   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
+   WARN(1, "This backtrace should be suppressed");
+   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
+
+   KUNIT_EXPECT_EQ(test, 
SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_direct), 1);
+}
+
+static void trigger_backtrace_warn(void)
+{
+   WARN(1, "This backtrace should be suppressed");
+}
+
+static void backtrace_suppression_test_warn_indirect(struct kunit *test)
+{
+   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn);
+
+   START_SUPPRESSED_WARNING(trigger_backtrace_warn);
+   trigger_backtrace_warn();
+   END_SUPPRESSED_WARNING(trigger_backtrace_warn);
+
+   KUNIT_EXPECT_EQ(test, SUPPRESSED_WARNING_COUNT(trigger_backtrace_warn), 
1);
+}
+
+static void backtrace_suppression_test_warn_multi(struct kunit *test)
+{
+   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn);
+   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
+
+   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
+   START_SUPPRESSED_WARNING(trigger_backtrace_warn);
+   WARN(1, "This backtrace should be suppressed");
+   trigger_backtrace_warn();
+   END_SUPPRESSED_WARNING(trigger_backtrace_warn);
+   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
+
+   KUNIT_EXPECT_EQ(test, 
SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_multi), 1);
+   KUNIT_EXPECT_EQ(test, SUPPRESSED_WARNING_COUNT(trigger_backtrace_warn), 
1);
+}
+
+static void backtrace_suppression_test_warn_on_direct(struct kunit *test)
+{
+   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
+
+   if (!IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE) && 
!IS_ENABLED(CONFIG_KALLSYMS))
+   kunit_skip(test, "requires CONFIG_DEBUG_BUGVERBOSE or 
CONFIG_KALLSYMS");
+
+   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
+   WARN_ON(1);
+   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
+
+   KUNIT_EXPECT_EQ(test,
+   
SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_on_direct), 1);
+}
+
+static void trigger_backtrace_warn_on(void)
+{
+   WARN_ON(1);
+}
+
+static void backtrace_suppression_test_warn_on_indirect(struct kunit *test)
+{
+   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn_on);
+
+   if (!IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE))
+   kunit_skip(test, "requires CONFIG_DEBUG_BUGVERBOSE");
+
+   START_SUPPRESSED_WARNING(trigger_backtrace_warn_on);
+   trigger_backtrace_warn_on();
+   END_SUPPRESSED_WARNING(trigger_backtrace_warn_on);
+
+   KUNIT_EXPECT_EQ(test, 
SUPPRESSED_WARNING_COUNT(trigger_backtrace_warn_on), 1);
+}
+
+static struct kunit_case backtrace_suppression_test_cases[] = {
+   KUNIT_CASE(backtrace_suppression_test_warn_direct),
+

[PATCH v3 02/15] kunit: bug: Count suppressed warning backtraces

2024-04-03 Thread Guenter Roeck

Count suppressed warning backtraces to enable code which suppresses
warning backtraces to check if the expected backtrace(s) have been
observed.

Using atomics for the backtrace count resulted in build errors on some
architectures due to include file recursion, so use a plain integer
for now.

Acked-by: Dan Carpenter 
Reviewed-by: Kees Cook 
Tested-by: Linux Kernel Functional Testing 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
v3:
- Rebased to v6.9-rc2

 include/kunit/bug.h | 7 ++-
 lib/kunit/bug.c | 4 +++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/kunit/bug.h b/include/kunit/bug.h
index bd0fe047572b..72e9fb23bbd5 100644
--- a/include/kunit/bug.h
+++ b/include/kunit/bug.h
@@ -20,6 +20,7 @@
 struct __suppressed_warning {
struct list_head node;
const char *function;
+   int counter;
 };
 
 void __start_suppress_warning(struct __suppressed_warning *warning);
@@ -28,7 +29,7 @@ bool __is_suppressed_warning(const char *function);
 
 #define DEFINE_SUPPRESSED_WARNING(func)\
struct __suppressed_warning __kunit_suppress_##func = \
-   { .function = __stringify(func) }
+   { .function = __stringify(func), .counter = 0 }
 
 #define START_SUPPRESSED_WARNING(func) \
__start_suppress_warning(&__kunit_suppress_##func)
@@ -39,12 +40,16 @@ bool __is_suppressed_warning(const char *function);
 #define IS_SUPPRESSED_WARNING(func) \
__is_suppressed_warning(func)
 
+#define SUPPRESSED_WARNING_COUNT(func) \
+   (__kunit_suppress_##func.counter)
+
 #else /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
 
 #define DEFINE_SUPPRESSED_WARNING(func)
 #define START_SUPPRESSED_WARNING(func)
 #define END_SUPPRESSED_WARNING(func)
 #define IS_SUPPRESSED_WARNING(func) (false)
+#define SUPPRESSED_WARNING_COUNT(func) (0)
 
 #endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
 #endif /* __ASSEMBLY__ */
diff --git a/lib/kunit/bug.c b/lib/kunit/bug.c
index f93544d7a9d1..13b3d896c114 100644
--- a/lib/kunit/bug.c
+++ b/lib/kunit/bug.c
@@ -32,8 +32,10 @@ bool __is_suppressed_warning(const char *function)
return false;
 
list_for_each_entry(warning, _warnings, node) {
-   if (!strcmp(function, warning->function))
+   if (!strcmp(function, warning->function)) {
+   warning->counter++;
return true;
+   }
}
return false;
 }
-- 
2.39.2

[PATCH v3 01/15] bug/kunit: Core support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Some unit tests intentionally trigger warning backtraces by passing
bad parameters to API functions. Such unit tests typically check the
return value from those calls, not the existence of the warning backtrace.

Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons.
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
  investigated and has to be marked to be ignored, for example by
  adjusting filter scripts. Such filters are ad-hoc because there is
  no real standard format for warnings. On top of that, such filter
  scripts would require constant maintenance.

One option to address problem would be to add messages such as "expected
warning backtraces start / end here" to the kernel log.  However, that
would again require filter scripts, it might result in missing real
problematic warning backtraces triggered while the test is running, and
the irrelevant backtrace(s) would still clog the kernel log.

Solve the problem by providing a means to identify and suppress specific
warning backtraces while executing test code. Since the new functionality
results in an image size increase of about 1% if CONFIG_KUNIT is enabled,
provide configuration option KUNIT_SUPPRESS_BACKTRACE to be able to disable
the new functionality. This option is by default enabled since almost all
systems with CONFIG_KUNIT enabled will want to benefit from it.

Cc: Dan Carpenter 
Cc: Daniel Diaz 
Cc: Naresh Kamboju 
Cc: Kees Cook 
Tested-by: Linux Kernel Functional Testing 
Acked-by: Dan Carpenter 
Reviewed-by: Kees Cook 
Signed-off-by: Guenter Roeck 
---
v2:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
- Added CONFIG_KUNIT_SUPPRESS_BACKTRACE configuration option,
  enabled by default
v3:
- Rebased to v6.9-rc2

 include/asm-generic/bug.h | 16 +---
 include/kunit/bug.h   | 51 +++
 include/kunit/test.h  |  1 +
 include/linux/bug.h   | 13 ++
 lib/bug.c | 51 ---
 lib/kunit/Kconfig |  9 +++
 lib/kunit/Makefile|  6 +++--
 lib/kunit/bug.c   | 40 ++
 8 files changed, 178 insertions(+), 9 deletions(-)
 create mode 100644 include/kunit/bug.h
 create mode 100644 lib/kunit/bug.c

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 6e794420bd39..c170b6477689 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -18,6 +18,7 @@
 #endif
 
 #ifndef __ASSEMBLY__
+#include 
 #include 
 #include 
 
@@ -39,8 +40,14 @@ struct bug_entry {
 #ifdef CONFIG_DEBUG_BUGVERBOSE
 #ifndef CONFIG_GENERIC_BUG_RELATIVE_POINTERS
const char  *file;
+#ifdef HAVE_BUG_FUNCTION
+   const char  *function;
+#endif
 #else
signed int  file_disp;
+#ifdef HAVE_BUG_FUNCTION
+   signed int  function_disp;
+#endif
 #endif
unsigned short  line;
 #endif
@@ -96,15 +103,18 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, 
...);
 #define __WARN()   __WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {  \
instrumentation_begin();\
-   warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);  \
+   if (!IS_SUPPRESSED_WARNING(__func__))   \
+   warn_slowpath_fmt(__FILE__, __LINE__, taint, arg);\
instrumentation_end();  \
} while (0)
 #else
 #define __WARN()   __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))
 #define __WARN_printf(taint, arg...) do {  \
instrumentation_begin();\
-   __warn_printk(arg); \
-   __WARN_FLAGS(BUGFLAG_NO_CUT_HERE | BUGFLAG_TAINT(taint));\
+   if (!IS_SUPPRESSED_WARNING(__func__)) { \
+   __warn_printk(arg); \
+   __WARN_FLAGS(BUGFLAG_NO_CUT_HERE | 
BUGFLAG_TAINT(taint));\
+   }   \
instrumentation_end();  \
} while (0)
 #define WARN_ON_ONCE(condition) ({ \
diff --git a/include/kunit/bug.h b/include/kunit/bug.h
new file mode 100644
index ..bd0fe047572b
--- /dev/null
+++ b/include/kunit/bug.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * KUnit helpers for backtrace suppression
+ *
+ * Copyright (c) 2024 Guenter Roeck 
+ */
+
+#ifndef _KUNIT_BUG_H
+#define _KUNIT_BUG_H
+
+#ifndef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+
+#include 
+#include 
+
+struct __suppressed_warning {
+   struct list_head node;
+

[PATCH v3 00/15] Add support for suppressing warning backtraces

2024-04-03 Thread Guenter Roeck

Some unit tests intentionally trigger warning backtraces by passing bad
parameters to kernel API functions. Such unit tests typically check the
return value from such calls, not the existence of the warning backtrace.

Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons.
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
  investigated and has to be marked to be ignored, for example by
  adjusting filter scripts. Such filters are ad-hoc because there is
  no real standard format for warnings. On top of that, such filter
  scripts would require constant maintenance.

One option to address problem would be to add messages such as "expected
warning backtraces start / end here" to the kernel log.  However, that
would again require filter scripts, it might result in missing real
problematic warning backtraces triggered while the test is running, and
the irrelevant backtrace(s) would still clog the kernel log.

Solve the problem by providing a means to identify and suppress specific
warning backtraces while executing test code. Support suppressing multiple
backtraces while at the same time limiting changes to generic code to the
absolute minimum. Architecture specific changes are kept at minimum by
retaining function names only if both CONFIG_DEBUG_BUGVERBOSE and
CONFIG_KUNIT are enabled.

The first patch of the series introduces the necessary infrastructure.
The second patch introduces support for counting suppressed backtraces.
This capability is used in patch three to implement unit tests.
Patch four documents the new API.
The next two patches add support for suppressing backtraces in drm_rect
and dev_addr_lists unit tests. These patches are intended to serve as
examples for the use of the functionality introduced with this series.
The remaining patches implement the necessary changes for all
architectures with GENERIC_BUG support.

With CONFIG_KUNIT enabled, image size increase with this series applied is
approximately 1%. The image size increase (and with it the functionality
introduced by this series) can be avoided by disabling
CONFIG_KUNIT_SUPPRESS_BACKTRACE.

This series is based on the RFC patch and subsequent discussion at
https://patchwork.kernel.org/project/linux-kselftest/patch/02546e59-1afe-4b08-ba81-d94f3b691c9a@moroto.mountain/
and offers a more comprehensive solution of the problem discussed there.

Design note:
  Function pointers are only added to the __bug_table section if both
  CONFIG_KUNIT_SUPPRESS_BACKTRACE and CONFIG_DEBUG_BUGVERBOSE are enabled
  to avoid image size increases if CONFIG_KUNIT is disabled. There would be
  some benefits to adding those pointers all the time (reduced complexity,
  ability to display function names in BUG/WARNING messages). That change,
  if desired, can be made later.

Checkpatch note:
  Remaining checkpatch errors and warnings were deliberately ignored.
  Some are triggered by matching coding style or by comments interpreted
  as code, others by assembler macros which are disliked by checkpatch.
  Suggestions for improvements are welcome.

Changes since RFC:
- Introduced CONFIG_KUNIT_SUPPRESS_BACKTRACE
- Minor cleanups and bug fixes
- Added support for all affected architectures
- Added support for counting suppressed warnings
- Added unit tests using those counters
- Added patch to suppress warning backtraces in dev_addr_lists tests

Changes since v1:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
  [I retained those tags since there have been no functional changes]
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option, enabled by
  default.

Changes since v2:
- Rebased to v6.9-rc2
- Added comments to drm warning suppression explaining why it is needed.
- Added patch to move conditional code in arch/sh/include/asm/bug.h
  to avoid kerneldoc warning
- Added architecture maintainers to Cc: for architecture specific patches
- No functional changes


Guenter Roeck (15):
  bug/kunit: Core support for suppressing warning backtraces
  kunit: bug: Count suppressed warning backtraces
  kunit: Add test cases for backtrace warning suppression
  kunit: Add documentation for warning backtrace suppression API
  drm: Suppress intentional warning backtraces in scaling unit tests
  net: kunit: Suppress lock warning noise at end of dev_addr_lists tests
  x86: Add support for suppressing warning backtraces
  arm64: Add support for suppressing warning backtraces
  loongarch: Add support for suppressing warning backtraces
  parisc: Add support for suppressing warning backtraces
  s390: Add support for suppressing warning backtraces
  sh: Add support for suppressing warning backtraces
  sh: Move defines needed for suppressing warning backtraces
  riscv: Add support for suppressing warning backtraces

Re: [PATCH v6 1/2] posix-timers: Prefer delivery of signals to the current thread

2024-04-03 Thread Thomas Gleixner

On Tue, Apr 02 2024 at 10:23, John Stultz wrote:
> On Tue, Apr 2, 2024 at 7:57 AM Thomas Gleixner  wrote:
>> This test in particular exercises new functionality/behaviour, which
>> really has no business to be backported into stable just to make the
>> relevant test usable on older kernels.
>
> That's fair. I didn't have all the context around what motivated the
> change and the follow-on test, which is why I'm asking here.

It's a performance enhancement to avoid waking up idle threads for
signal delivery instead of just delivering it to the current running
thread which made the CPU timer fire. So it does not qualify for fix.

>> Why would testing with latest tests against an older kernel be valid per
>> se?
>
> So yeah, it definitely can get fuzzy trying to split hairs between
> when a change in behavior is a "new feature" or a "fix".
>
> Greg could probably articulate it better, but my understanding is the
> main point for running newer tests on older kernels is that newer
> tests will have more coverage of what is expected of the kernel. For
> features that older kernels don't support, ideally the tests will
> check for that functionality like userland applications would, and
> skip that portion of the test if it's unsupported. This way, we're
> able to find issues (important enough to warrant tests having been
> created) that have not yet been patched in the -stable trees.
>
> In this case, there is a behavioral change combined with a compliance
> test, which makes it look a bit more like a fix, rather than a feature
> (additionally the lack of a way for userland to probe for this new
> "feature" makes it seem fix-like).  But the intended result of this is
> just spurring this discussion to see if it makes sense to backport or
> not.  Disabling/ignoring the test (maybe after Thomas' fix to avoid it
> from hanging :) is a fine solution too, but not one I'd want folks to
> do until they've synced with maintainers and had full context.

I was staring at this test because it hangs even on upstream on a
regular base, at least in a VM. The timeout change I posted prevents the
hang, but still the posixtimer test will not have 0 fails.

The test if fragile as hell as there is absolutely no guarantee that the
signal target distribution is as expected. The expectation is based on a
statistical assumption which does not really hold.

So I came up with a modified variant of that, which can deduce pretty
reliably that the test runs on an older kernel.

Thanks,

tglx
---
Subject: selftests/timers/posix_timers: Make signal distribution test less 
fragile
From: Thomas Gleixner 
Date: Mon, 15 May 2023 00:40:10 +0200

The signal distribution test has a tendency to hang for a long time as the
signal delivery is not really evenly distributed. In fact it might never be
distributed across all threads ever in the way it is written.

Address this by:

   1) Adding a timeout which aborts the test

   2) Letting the test threads do a usleep() once they got a signal instead
  of running continuously. That ensures that the other threads will expire
  the timer and get the signal

   3) Adding a detection whether all signals arrvied at the main thread,
  which allows to run the test on older kernels.

While at it get rid of the pointless atomic operation on a the thread local
variable in the signal handler.

Signed-off-by: Thomas Gleixner 
---
 tools/testing/selftests/timers/posix_timers.c |   48 +-
 1 file changed, 32 insertions(+), 16 deletions(-)

--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -184,18 +184,22 @@ static int check_timer_create(int which)
return 0;
 }
 
-int remain;
-__thread int got_signal;
+static int remain;
+static __thread int got_signal;
 
 static void *distribution_thread(void *arg)
 {
-   while (__atomic_load_n(, __ATOMIC_RELAXED));
-   return NULL;
+   while (__atomic_load_n(, __ATOMIC_RELAXED) && !done) {
+   if (got_signal)
+   usleep(10);
+   }
+
+   return (void *)got_signal;
 }
 
 static void distribution_handler(int nr)
 {
-   if (!__atomic_exchange_n(_signal, 1, __ATOMIC_RELAXED))
+   if (++got_signal == 1)
__atomic_fetch_sub(, 1, __ATOMIC_RELAXED);
 }
 
@@ -205,8 +209,6 @@ static void distribution_handler(int nr)
  */
 static int check_timer_distribution(void)
 {
-   int err, i;
-   timer_t id;
const int nthreads = 10;
pthread_t threads[nthreads];
struct itimerspec val = {
@@ -215,7 +217,11 @@ static int check_timer_distribution(void
.it_interval.tv_sec = 0,
.it_interval.tv_nsec = 1000 * 1000,
};
+   int err, i, nsigs;
+   time_t start, now;
+   timer_t id;
 
+   done = 0;
remain = nthreads + 1;  /* worker threads + this thread */
signal(SIGALRM, distribution_handler);
err =

Re: [PATCH bpf-next] selftests/bpf: Add F_SETFL for fcntl

2024-04-03 Thread Jakub Sitnicki

Hi Geliang,

On Wed, Apr 03, 2024 at 04:32 PM +08, Geliang Tang wrote:
> From: Geliang Tang 
>
> Incorrect arguments are passed to fcntl() in test_sockmap.c when invoking
> it to set file status flags. If O_NONBLOCK is used as 2nd argument and
> passed into fcntl, -EINVAL will be returned (See do_fcntl() in fs/fcntl.c).
> The correct approach is to use F_SETFL as 2nd argument, and O_NONBLOCK as
> 3rd one.
>
> Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
> Signed-off-by: Geliang Tang 
> ---
>  tools/testing/selftests/bpf/test_sockmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
> b/tools/testing/selftests/bpf/test_sockmap.c
> index 024a0faafb3b..34d6a1e6f664 100644
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -603,7 +603,7 @@ static int msg_loop(int fd, int iov_count, int 
> iov_length, int cnt,
>   struct timeval timeout;
>   fd_set w;
>  
> - fcntl(fd, fd_flags);
> + fcntl(fd, F_SETFL, fd_flags);
>   /* Account for pop bytes noting each iteration of apply will
>* call msg_pop_data helper so we need to account for this
>* by calculating the number of apply iterations. Note user

Good catch. But we also need to figure out why some tests failing with
this patch applied and fix them in one go:

# 6/ 7  sockmap::txmsg test skb:FAIL
#21/ 7 sockhash::txmsg test skb:FAIL
#36/ 7 sockhash:ktls:txmsg test skb:FAIL
Pass: 42 Fail: 3

I'm seeing this error message when running `test_sockmap`:

detected skb data error with skb ingress update @iov[0]:0 "00 00 00 00" != 
"PASS"
data verify msg failed: Unknown error -5
rx thread exited with err 1.

I'd also:
- add an error check for fnctl, so we don't regress,
- get rid of fd_flags, pass O_NONBLOCK flag directly to fnctl.

Thanks,
-jkbs

Re: [PATCH bpf-next v3] selftests/bpf: Move test_dev_cgroup to prog_tests

2024-04-03 Thread Muhammad Usama Anjum

On 4/3/24 7:36 AM, Yonghong Song wrote:
> 
> On 4/2/24 8:16 AM, Muhammad Usama Anjum wrote:
>> Yonghong Song,
>>
>> Thank you so much for replying. I was missing how to run pipeline manually.
>> Thanks a ton.
>>
>> On 4/1/24 11:53 PM, Yonghong Song wrote:
>>> On 4/1/24 5:34 AM, Muhammad Usama Anjum wrote:
 Move test_dev_cgroup.c to prog_tests/dev_cgroup.c to be able to run it
 with test_progs. Replace dev_cgroup.bpf.o with skel header file,
 dev_cgroup.skel.h and load program from it accourdingly.

     ./test_progs -t dev_cgroup
     mknod: /tmp/test_dev_cgroup_null: Operation not permitted
     64+0 records in
     64+0 records out
     32768 bytes (33 kB, 32 KiB) copied, 0.000856684 s, 38.2 MB/s
     dd: failed to open '/dev/full': Operation not permitted
     dd: failed to open '/dev/random': Operation not permitted
     #72 test_dev_cgroup:OK
     Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
 Signed-off-by: Muhammad Usama Anjum 
 ---
 Changes since v2:
 - Replace test_dev_cgroup with serial_test_dev_cgroup as there is
     probability that the test is racing against another cgroup test
 - Minor changes to the commit message above

 I've tested the patch with vmtest.sh on bpf-next/for-next and linux
 next. It is passing on both. Not sure why it was failed on BPFCI.
 Test run with vmtest.h:
 sudo LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh
 ./test_progs -t dev_cgroup
 ./test_progs -t dev_cgroup
 mknod: /tmp/test_dev_cgroup_null: Operation not permitted
 64+0 records in
 64+0 records out
 32768 bytes (33 kB, 32 KiB) copied, 0.000403432 s, 81.2 MB/s
 dd: failed to open '/dev/full': Operation not permitted
 dd: failed to open '/dev/random': Operation not permitted
    #69  dev_cgroup:OK
 Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>> The CI failure:
>>>
>>>
>>> Error: #72 dev_cgroup
>>> serial_test_dev_cgroup:PASS:skel_open_and_load 0 nsec
>>> serial_test_dev_cgroup:PASS:cgroup_setup_and_join 0 nsec
>>> serial_test_dev_cgroup:PASS:bpf_attach 0 nsec
>>> serial_test_dev_cgroup:PASS:bpf_query 0 nsec
>>> serial_test_dev_cgroup:PASS:bpf_query 0 nsec
>>> serial_test_dev_cgroup:PASS:rm 0 nsec
>>> serial_test_dev_cgroup:PASS:mknod 0 nsec
>>> serial_test_dev_cgroup:PASS:rm 0 nsec
>>> serial_test_dev_cgroup:PASS:rm 0 nsec
>>> serial_test_dev_cgroup:FAIL:mknod unexpected mknod: actual 256 !=
>>> expected 0
>>> serial_test_dev_cgroup:PASS:rm 0 nsec
>>> serial_test_dev_cgroup:PASS:dd 0 nsec
>>> serial_test_dev_cgroup:PASS:dd 0 nsec
>>> serial_test_dev_cgroup:PASS:dd 0 nsec
>>>
>>> (cgroup_helpers.c:353: errno: Device or resource busy) umount cgroup2
>>>
>>> The error code 256 means mknod execution has some issues. Maybe you need to
>>> find specific errno to find out what is going on. I think you can do ci
>>> on-demanding test to debug.
>> errno is 2 --> No such file or directory
>>
>> Locally I'm unable to reproduce it until I don't remove
>> rm -f /tmp/test_dev_cgroup_zero such that the /tmp/test_dev_cgroup_zero
>> node is present before test execution. The error code is 256 with errno 2.
>> I'm debugging by placing system("ls /tmp 1>&2"); to find out which files
>> are already present in /tmp. But ls's output doesn't appear on the CI logs.
> 
> errno 2 means ENOENT.
> From mknod man page (https://linux.die.net/man/2/mknod), it means
>   A directory component in/pathname/  does not exist or is a dangling
> symbolic link.
> 
> It means /tmp does not exist or a dangling symbolic link.
> It is indeed very strange. To make the test robust, maybe creating a temp
> directory with mkdtemp and use it as the path? The temp directory
> creation should be done before bpf prog attach.
I've tried following but still no luck:
* /tmp is already present. Then I thought maybe the desired file is already
present. I've verified that there isn't file of same name is present inside
/tmp.
* I thought maybe mknod isn't present in the system. But mknod --help succeeds.
* I switched from /tmp to current directory to create the mknod. But the
result is same error.
* I've tried to use the same kernel config as the BPF CI is using. I'm not
able to reproduce it.

Not sure which edge case or what's going on. The problem is appearing
because of some limitation in the rootfs.

-- 
BR,
Muhammad Usama Anjum

Re: Re: [PATCH 1/2] cgroup/cpuset: Make cpuset hotplug processing synchronous

2024-04-03 Thread Michal Koutný

On Tue, Apr 02, 2024 at 11:30:11AM -0400, Waiman Long  
wrote:
> Yes, there is a potential that a cpus_read_lock() may be called leading to
> deadlock. So unless we reverse the current cgroup_mutex --> cpu_hotplug_lock
> ordering, it is not safe to call cgroup_transfer_tasks() directly.

I see that cgroup_transfer_tasks() has the only user -- cpuset. What
about bending it for the specific use like:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..64deb7212c5c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -109,7 +109,7 @@ struct cgroup *cgroup_get_from_fd(int fd);
 struct cgroup *cgroup_v1v2_get_from_fd(int fd);
 
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 520a11cb12f4..f97025858c7a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -91,7 +91,8 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
  *
  * Return: %0 on success or a negative errno code on failure
  */
-int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
+int cgroup_transfer_tasks_locked(struct cgroup *to, struct cgroup *from)
 {
DEFINE_CGROUP_MGCTX(mgctx);
struct cgrp_cset_link *link;
@@ -106,9 +106,11 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
if (ret)
return ret;
 
-   cgroup_lock();
-
-   cgroup_attach_lock(true);
+   /* The locking rules serve specific purpose of v1 cpuset hotplug
+* migration, see hotplug_update_tasks_legacy() and
+* cgroup_attach_lock() */
+   lockdep_assert_held(_mutex);
+   lockdep_assert_cpus_held();
+   percpu_down_write(_threadgroup_rwsem);
 
/* all tasks in @from are being moved, all csets are source */
spin_lock_irq(_set_lock);
@@ -144,8 +146,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
} while (task && !ret);
 out_err:
cgroup_migrate_finish();
-   cgroup_attach_unlock(true);
-   cgroup_unlock();
+   percpu_up_write(_threadgroup_rwsem);
return ret;
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 13d27b17c889..94fb8b26f038 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4331,7 +4331,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset 
*cs)
nodes_empty(parent->mems_allowed))
parent = parent_cs(parent);
 
-   if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {
+   if (cgroup_transfer_tasks_locked(parent->css.cgroup, cs->css.cgroup)) {
pr_err("cpuset: failed to transfer tasks out of empty cpuset ");
pr_cont_cgroup_name(cs->css.cgroup);
pr_cont("\n");
@@ -4376,21 +4376,9 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
 
/*
 * Move tasks to the nearest ancestor with execution resources,
-* This is full cgroup operation which will also call back into
-* cpuset. Execute it asynchronously using workqueue.
 */
-   if (is_empty && css_tryget_online(>css)) {
-   struct cpuset_remove_tasks_struct *s;
-
-   s = kzalloc(sizeof(*s), GFP_KERNEL);
-   if (WARN_ON_ONCE(!s)) {
-   css_put(>css);
-   return;
-   }
-
-   s->cs = cs;
-   INIT_WORK(>work, cpuset_migrate_tasks_workfn);
-   schedule_work(>work);
+   if (is_empty)
+   remove_tasks_in_empty_cpuset(cs);
}
 }
 


signature.asc
Description: PGP signature

Re: [PATCH net-next 7/7] testing: net-drv: add a driver test for stats reporting

2024-04-03 Thread Petr Machata

Jakub Kicinski  writes:

> On Wed, 3 Apr 2024 00:04:14 +0200 Petr Machata wrote:
>> > Yes, I was wondering about that. It must be doable, IIRC 
>> > the multi-threading API "injects" args from a tuple.
>> > I was thinking something along the lines of:
>> >
>> > with NetDrvEnv(__file__) as cfg:
>> > ksft_run([check_pause, check_fec, pkt_byte_sum],
>> >  args=(cfg, ))
>> >
>> > I got lazy, let me take a closer look. Another benefit
>> > will be that once we pass in "env" / cfg - we can "register" 
>> > objects in there for auto-cleanup (in the future, current
>> > tests don't need cleanup)  
>> 
>> Yeah, though some of those should probably just be their own context
>> managers IMHO, not necessarily hooked to cfg. I'm thinking something
>> fairly general, so that the support boilerplate doesn't end up costing
>> an arm and leg:
>> 
>> with build("ip route add 192.0.2.1/28 nexthop via 192.0.2.17",
>>"ip route del 192.0.2.1/28"),
>>  build("ip link set dev %s master %s" % (swp1, h1),
>>"ip link set dev %s nomaster" % swp1):
>> le_test()
>>
>> Dunno. I guess it makes sense to have some of the common stuff
>> predefined, e.g. "with vrf() as h1". And then the stuff that's typically
>> in lib.sh's setup() and cleanup(), can be losslessly hooked up to cfg.
>
> I was thinking of something along the lines of:
>
> def test_abc(cfg):
> cfg.build("ip route add 192.0.2.1/28 nexthop via 192.0.2.17",
>   "ip route del 192.0.2.1/28")
> cfg.build("ip link set dev %s master %s" % (swp1, h1),
>   "ip link set dev %s nomaster" % swp1)
>
> optionally we could then also:
>
>  thing = cfg.build("ip link set dev %s master %s" % (swp1, h1),
>"ip link set dev %s nomaster" % swp1)
>
>  # ... some code which may raise ...
>
>  # unlink to do something else with the device
>  del thing
>  # ... more code ... 
>
> cfg may not be best here, could be cleaner to create a "test" object,
> always pass it in as the first param, and destroy it after each test.

I assume above you mean that cfg inherits the thing, but cfg lifetime
currently looks like it spreads across several test cases. ksft_run()
would need to know about it and call something to issue the postponed
cleanups between cases.

Also, it's not clear what "del thing" should do in that context, because
if cfg also keeps a reference, __del__ won't get called. There could be
a direct method, like thing.exit() or whatever, but then you need
bookkeeping so as not to clean up the second time through cfg. It's the
less straightforward way of going about it IMHO.

I know that I must sound like a broken record at this point, but look:

with build("ip link set dev %s master %s" % (swp1, h1),
   "ip link set dev %s nomaster" % swp1) as thing:
... some code which may rise ...
... more code, interface detached, `thing' gone ...

It's just as concise, makes it very clear where the device is part of
the bridge and where not anymore, and does away with the intricacies of
lifetime management.

If lifetimes don't nest, I think it's just going to be ugly either way.
But I don't think this comes up often.

I don't really see stuff that you could just throw at cfg to keep track
of, apart from the suite configuration (i.e. topology set up). But I
suppose if it comes up, we can do something like:

thing = cfg.retain(build(..., ...))

Or maybe have a dedicated retainer object, or whatever, it doesn't
necessarily need to be cfg itself.

>> This is what I ended up gravitating towards after writing a handful of
>> LNST tests anyway. The scoping makes it clear where the object exists,
>> lifetime is taken care of, it's all ponies rainbows basically. At least
>> as long as your object lifetimes can be cleanly nested, which admittedly
>> is not always.
>
> Should be fairly easy to support all cases - "with", "recording on
> cfg/test" and del.  Unfortunately in the two tests I came up with

Yup.

> quickly for this series cleanup is only needed for the env itself.
> It's a bit awkward to add the lifetime helpers without any users.

Yeah. I'm basically delving in this now to kinda try and steer future
expectations.

[PATCH bpf-next] selftests/bpf: Add F_SETFL for fcntl

2024-04-03 Thread Geliang Tang

From: Geliang Tang 

Incorrect arguments are passed to fcntl() in test_sockmap.c when invoking
it to set file status flags. If O_NONBLOCK is used as 2nd argument and
passed into fcntl, -EINVAL will be returned (See do_fcntl() in fs/fcntl.c).
The correct approach is to use F_SETFL as 2nd argument, and O_NONBLOCK as
3rd one.

Fixes: 16962b2404ac ("bpf: sockmap, add selftests")
Signed-off-by: Geliang Tang 
---
 tools/testing/selftests/bpf/test_sockmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 024a0faafb3b..34d6a1e6f664 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -603,7 +603,7 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
struct timeval timeout;
fd_set w;
 
-   fcntl(fd, fd_flags);
+   fcntl(fd, F_SETFL, fd_flags);
/* Account for pop bytes noting each iteration of apply will
 * call msg_pop_data helper so we need to account for this
 * by calculating the number of apply iterations. Note user
-- 
2.40.1

[PATCH v5 22/22] KVM: riscv: selftests: Add a test for counter overflow

2024-04-03 Thread Atish Patra

Add a test for verifying overflow interrupt. Currently, it relies on
overflow support on cycle/instret events. This test works for cycle/
instret events which support sampling via hpmcounters on the platform.
There are no ISA extensions to detect if a platform supports that. Thus,
this test will fail on platform with virtualization but doesn't
support overflow on these two events.

Reviewed-by: Anup Patel 
Signed-off-by: Atish Patra 
---
 .../selftests/kvm/riscv/sbi_pmu_test.c| 114 ++
 1 file changed, 114 insertions(+)

diff --git a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c 
b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
index 7d195be5c3d9..451db956b885 100644
--- a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
+++ b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
@@ -14,6 +14,7 @@
 #include "test_util.h"
 #include "processor.h"
 #include "sbi.h"
+#include "arch_timer.h"
 
 /* Maximum counters(firmware + hardware) */
 #define RISCV_MAX_PMU_COUNTERS 64
@@ -24,6 +25,9 @@ union sbi_pmu_ctr_info ctrinfo_arr[RISCV_MAX_PMU_COUNTERS];
 static void *snapshot_gva;
 static vm_paddr_t snapshot_gpa;
 
+static int vcpu_shared_irq_count;
+static int counter_in_use;
+
 /* Cache the available counters in a bitmask */
 static unsigned long counter_mask_available;
 
@@ -117,6 +121,31 @@ static void guest_illegal_exception_handler(struct ex_regs 
*regs)
regs->epc += 4;
 }
 
+static void guest_irq_handler(struct ex_regs *regs)
+{
+   unsigned int irq_num = regs->cause & ~CAUSE_IRQ_FLAG;
+   struct riscv_pmu_snapshot_data *snapshot_data = snapshot_gva;
+   unsigned long overflown_mask;
+   unsigned long counter_val = 0;
+
+   /* Validate that we are in the correct irq handler */
+   GUEST_ASSERT_EQ(irq_num, IRQ_PMU_OVF);
+
+   /* Stop all counters first to avoid further interrupts */
+   stop_counter(counter_in_use, SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT);
+
+   csr_clear(CSR_SIP, BIT(IRQ_PMU_OVF));
+
+   overflown_mask = READ_ONCE(snapshot_data->ctr_overflow_mask);
+   GUEST_ASSERT(overflown_mask & 0x01);
+
+   WRITE_ONCE(vcpu_shared_irq_count, vcpu_shared_irq_count+1);
+
+   counter_val = READ_ONCE(snapshot_data->ctr_values[0]);
+   /* Now start the counter to mimick the real driver behavior */
+   start_counter(counter_in_use, SBI_PMU_START_FLAG_SET_INIT_VALUE, 
counter_val);
+}
+
 static unsigned long get_counter_index(unsigned long cbase, unsigned long 
cmask,
   unsigned long cflags,
   unsigned long event)
@@ -276,6 +305,33 @@ static void test_pmu_event_snapshot(unsigned long event)
stop_reset_counter(counter, 0);
 }
 
+static void test_pmu_event_overflow(unsigned long event)
+{
+   unsigned long counter;
+   unsigned long counter_value_post;
+   unsigned long counter_init_value = ULONG_MAX - 1;
+   struct riscv_pmu_snapshot_data *snapshot_data = snapshot_gva;
+
+   counter = get_counter_index(0, counter_mask_available, 0, event);
+   counter_in_use = counter;
+
+   /* The counter value is updated w.r.t relative index of cbase passed to 
start/stop */
+   WRITE_ONCE(snapshot_data->ctr_values[0], counter_init_value);
+   start_counter(counter, SBI_PMU_START_FLAG_INIT_SNAPSHOT, 0);
+   dummy_func_loop(1);
+   udelay(msecs_to_usecs(2000));
+   /* irq handler should have stopped the counter */
+   stop_counter(counter, SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT);
+
+   counter_value_post = READ_ONCE(snapshot_data->ctr_values[0]);
+   /* The counter value after stopping should be less the init value due 
to overflow */
+   __GUEST_ASSERT(counter_value_post < counter_init_value,
+  "counter_value_post %lx counter_init_value %lx for 
counter\n",
+  counter_value_post, counter_init_value);
+
+   stop_reset_counter(counter, 0);
+}
+
 static void test_invalid_event(void)
 {
struct sbiret ret;
@@ -366,6 +422,34 @@ static void test_pmu_events_snaphost(void)
GUEST_DONE();
 }
 
+static void test_pmu_events_overflow(void)
+{
+   int num_counters = 0;
+
+   /* Verify presence of SBI PMU and minimum requrired SBI version */
+   verify_sbi_requirement_assert();
+
+   snapshot_set_shmem(snapshot_gpa, 0);
+   csr_set(CSR_IE, BIT(IRQ_PMU_OVF));
+   local_irq_enable();
+
+   /* Get the counter details */
+   num_counters = get_num_counters();
+   update_counter_info(num_counters);
+
+   /*
+* Qemu supports overflow for cycle/instruction.
+* This test may fail on any platform that do not support overflow for 
these two events.
+*/
+   test_pmu_event_overflow(SBI_PMU_HW_CPU_CYCLES);
+   GUEST_ASSERT_EQ(vcpu_shared_irq_count, 1);
+
+   test_pmu_event_overflow(SBI_PMU_HW_INSTRUCTIONS);
+   GUEST_ASSERT_EQ(vcpu_shared_irq_count, 2);
+
+   GUEST_DONE();
+}
+
 static

[PATCH v5 21/22] KVM: riscv: selftests: Add a test for PMU snapshot functionality

2024-04-03 Thread Atish Patra

Verify PMU snapshot functionality by setting up the shared memory
correctly and reading the counter values from the shared memory
instead of the CSR.

Reviewed-by: Anup Patel 
Signed-off-by: Atish Patra 
---
 .../testing/selftests/kvm/include/riscv/sbi.h |  25 
 .../selftests/kvm/lib/riscv/processor.c   |  12 ++
 .../selftests/kvm/riscv/sbi_pmu_test.c| 127 ++
 3 files changed, 164 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/riscv/sbi.h 
b/tools/testing/selftests/kvm/include/riscv/sbi.h
index 6675ca673c77..8c98bd99d450 100644
--- a/tools/testing/selftests/kvm/include/riscv/sbi.h
+++ b/tools/testing/selftests/kvm/include/riscv/sbi.h
@@ -8,6 +8,12 @@
 #ifndef SELFTEST_KVM_SBI_H
 #define SELFTEST_KVM_SBI_H
 
+/* SBI spec version fields */
+#define SBI_SPEC_VERSION_DEFAULT   0x1
+#define SBI_SPEC_VERSION_MAJOR_SHIFT   24
+#define SBI_SPEC_VERSION_MAJOR_MASK0x7f
+#define SBI_SPEC_VERSION_MINOR_MASK0xff
+
 /* SBI return error codes */
 #define SBI_SUCCESS 0
 #define SBI_ERR_FAILURE-1
@@ -33,6 +39,9 @@ enum sbi_ext_id {
 };
 
 enum sbi_ext_base_fid {
+   SBI_EXT_BASE_GET_SPEC_VERSION = 0,
+   SBI_EXT_BASE_GET_IMP_ID,
+   SBI_EXT_BASE_GET_IMP_VERSION,
SBI_EXT_BASE_PROBE_EXT = 3,
 };
 enum sbi_ext_pmu_fid {
@@ -60,6 +69,12 @@ union sbi_pmu_ctr_info {
};
 };
 
+struct riscv_pmu_snapshot_data {
+   u64 ctr_overflow_mask;
+   u64 ctr_values[64];
+   u64 reserved[447];
+};
+
 struct sbiret {
long error;
long value;
@@ -113,4 +128,14 @@ struct sbiret sbi_ecall(int ext, int fid, unsigned long 
arg0,
 
 bool guest_sbi_probe_extension(int extid, long *out_val);
 
+/* Make SBI version */
+static inline unsigned long sbi_mk_version(unsigned long major,
+   unsigned long minor)
+{
+   return ((major & SBI_SPEC_VERSION_MAJOR_MASK) <<
+   SBI_SPEC_VERSION_MAJOR_SHIFT) | minor;
+}
+
+unsigned long get_host_sbi_spec_version(void);
+
 #endif /* SELFTEST_KVM_SBI_H */
diff --git a/tools/testing/selftests/kvm/lib/riscv/processor.c 
b/tools/testing/selftests/kvm/lib/riscv/processor.c
index e8211f5d6863..ccb35573749c 100644
--- a/tools/testing/selftests/kvm/lib/riscv/processor.c
+++ b/tools/testing/selftests/kvm/lib/riscv/processor.c
@@ -502,3 +502,15 @@ bool guest_sbi_probe_extension(int extid, long *out_val)
 
return true;
 }
+
+unsigned long get_host_sbi_spec_version(void)
+{
+   struct sbiret ret;
+
+   ret = sbi_ecall(SBI_EXT_BASE, SBI_EXT_BASE_GET_SPEC_VERSION, 0,
+  0, 0, 0, 0, 0);
+
+   GUEST_ASSERT(!ret.error);
+
+   return ret.value;
+}
diff --git a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c 
b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
index 8e7c7a3172d8..7d195be5c3d9 100644
--- a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
+++ b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
@@ -19,6 +19,11 @@
 #define RISCV_MAX_PMU_COUNTERS 64
 union sbi_pmu_ctr_info ctrinfo_arr[RISCV_MAX_PMU_COUNTERS];
 
+/* Snapshot shared memory data */
+#define PMU_SNAPSHOT_GPA_BASE  BIT(30)
+static void *snapshot_gva;
+static vm_paddr_t snapshot_gpa;
+
 /* Cache the available counters in a bitmask */
 static unsigned long counter_mask_available;
 
@@ -178,6 +183,32 @@ static unsigned long read_counter(int idx, union 
sbi_pmu_ctr_info ctrinfo)
return counter_val;
 }
 
+static inline void verify_sbi_requirement_assert(void)
+{
+   long out_val = 0;
+   bool probe;
+
+   probe = guest_sbi_probe_extension(SBI_EXT_PMU, _val);
+   GUEST_ASSERT(probe && out_val == 1);
+
+   if (get_host_sbi_spec_version() < sbi_mk_version(2, 0))
+   __GUEST_ASSERT(0, "SBI implementation version doesn't support 
PMU Snapshot");
+}
+
+static void snapshot_set_shmem(vm_paddr_t gpa, unsigned long flags)
+{
+   unsigned long lo = (unsigned long)gpa;
+#if __riscv_xlen == 32
+   unsigned long hi = (unsigned long)(gpa >> 32);
+#else
+   unsigned long hi = gpa == -1 ? -1 : 0;
+#endif
+   struct sbiret ret = sbi_ecall(SBI_EXT_PMU, 
SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
+ lo, hi, flags, 0, 0, 0);
+
+   GUEST_ASSERT(ret.value == 0 && ret.error == 0);
+}
+
 static void test_pmu_event(unsigned long event)
 {
unsigned long counter;
@@ -210,6 +241,41 @@ static void test_pmu_event(unsigned long event)
stop_reset_counter(counter, 0);
 }
 
+static void test_pmu_event_snapshot(unsigned long event)
+{
+   unsigned long counter;
+   unsigned long counter_value_pre, counter_value_post;
+   unsigned long counter_init_value = 100;
+   struct riscv_pmu_snapshot_data *snapshot_data = snapshot_gva;
+
+   counter = get_counter_index(0, counter_mask_available, 0, event);
+   counter_value_pre = read_counter(counter, ctrinfo_arr[counter]);
+
+   /* Do not set the

[PATCH v5 20/22] KVM: riscv: selftests: Add SBI PMU selftest

2024-04-03 Thread Atish Patra

This test implements basic sanity test and cycle/instret event
counting tests.

Reviewed-by: Anup Patel 
Signed-off-by: Atish Patra 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../selftests/kvm/riscv/sbi_pmu_test.c| 340 ++
 2 files changed, 341 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/riscv/sbi_pmu_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 741c7dc16afc..1cfcd2797ee4 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -189,6 +189,7 @@ TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
 
+TEST_GEN_PROGS_riscv += riscv/sbi_pmu_test
 TEST_GEN_PROGS_riscv += arch_timer
 TEST_GEN_PROGS_riscv += demand_paging_test
 TEST_GEN_PROGS_riscv += dirty_log_test
diff --git a/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c 
b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
new file mode 100644
index ..8e7c7a3172d8
--- /dev/null
+++ b/tools/testing/selftests/kvm/riscv/sbi_pmu_test.c
@@ -0,0 +1,340 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * sbi_pmu_test.c - Tests the riscv64 SBI PMU functionality.
+ *
+ * Copyright (c) 2024, Rivos Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "kvm_util.h"
+#include "test_util.h"
+#include "processor.h"
+#include "sbi.h"
+
+/* Maximum counters(firmware + hardware) */
+#define RISCV_MAX_PMU_COUNTERS 64
+union sbi_pmu_ctr_info ctrinfo_arr[RISCV_MAX_PMU_COUNTERS];
+
+/* Cache the available counters in a bitmask */
+static unsigned long counter_mask_available;
+
+unsigned long pmu_csr_read_num(int csr_num)
+{
+#define switchcase_csr_read(__csr_num, __val)  {\
+   case __csr_num: \
+   __val = csr_read(__csr_num);\
+   break; }
+#define switchcase_csr_read_2(__csr_num, __val){\
+   switchcase_csr_read(__csr_num + 0, __val)\
+   switchcase_csr_read(__csr_num + 1, __val)}
+#define switchcase_csr_read_4(__csr_num, __val){\
+   switchcase_csr_read_2(__csr_num + 0, __val)  \
+   switchcase_csr_read_2(__csr_num + 2, __val)}
+#define switchcase_csr_read_8(__csr_num, __val){\
+   switchcase_csr_read_4(__csr_num + 0, __val)  \
+   switchcase_csr_read_4(__csr_num + 4, __val)}
+#define switchcase_csr_read_16(__csr_num, __val)   {\
+   switchcase_csr_read_8(__csr_num + 0, __val)  \
+   switchcase_csr_read_8(__csr_num + 8, __val)}
+#define switchcase_csr_read_32(__csr_num, __val)   {\
+   switchcase_csr_read_16(__csr_num + 0, __val) \
+   switchcase_csr_read_16(__csr_num + 16, __val)}
+
+   unsigned long ret = 0;
+
+   switch (csr_num) {
+   switchcase_csr_read_32(CSR_CYCLE, ret)
+   switchcase_csr_read_32(CSR_CYCLEH, ret)
+   default :
+   break;
+   }
+
+   return ret;
+#undef switchcase_csr_read_32
+#undef switchcase_csr_read_16
+#undef switchcase_csr_read_8
+#undef switchcase_csr_read_4
+#undef switchcase_csr_read_2
+#undef switchcase_csr_read
+}
+
+static inline void dummy_func_loop(uint64_t iter)
+{
+   int i = 0;
+
+   while (i < iter) {
+   asm volatile("nop");
+   i++;
+   }
+}
+
+static void start_counter(unsigned long counter, unsigned long start_flags,
+ unsigned long ival)
+{
+   struct sbiret ret;
+
+   ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, counter, 1, 
start_flags,
+   ival, 0, 0);
+   __GUEST_ASSERT(ret.error == 0, "Unable to start counter %ld\n", 
counter);
+}
+
+/* This should be invoked only for reset counter use case */
+static void stop_reset_counter(unsigned long counter, unsigned long stop_flags)
+{
+   struct sbiret ret;
+
+   ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, counter, 1,
+   stop_flags | SBI_PMU_STOP_FLAG_RESET, 
0, 0, 0);
+   __GUEST_ASSERT(ret.error == SBI_ERR_ALREADY_STOPPED,
+  "Unable to stop counter %ld\n", counter);
+}
+
+static void stop_counter(unsigned long counter, unsigned long stop_flags)
+{
+   struct sbiret ret;
+
+   ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, counter, 1, 
stop_flags,
+   0, 0, 0);
+   __GUEST_ASSERT(ret.error == 0, "Unable to stop counter %ld error %ld\n",
+  counter, ret.error);
+}
+
+static void guest_illegal_exception_handler(struct ex_regs *regs)
+{
+   __GUEST_ASSERT(regs->cause == EXC_INST_ILLEGAL,
+  "Unexpected exception handler %lx\n", regs->cause);
+
+   /* skip the trapping instruction */
+   regs->epc += 4;
+}
+
+static unsigned long get_counter_index(unsigned long cbase, unsigned long 
cmask,
+

[PATCH v5 19/22] KVM: riscv: selftests: Add SBI PMU extension definitions

2024-04-03 Thread Atish Patra

The SBI PMU extension definition is required for upcoming SBI PMU
selftests.

Reviewed-by: Anup Patel 
Signed-off-by: Atish Patra 
---
 .../testing/selftests/kvm/include/riscv/sbi.h | 66 +++
 1 file changed, 66 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/riscv/sbi.h 
b/tools/testing/selftests/kvm/include/riscv/sbi.h
index ba04f2dec7b5..6675ca673c77 100644
--- a/tools/testing/selftests/kvm/include/riscv/sbi.h
+++ b/tools/testing/selftests/kvm/include/riscv/sbi.h
@@ -29,17 +29,83 @@
 enum sbi_ext_id {
SBI_EXT_BASE = 0x10,
SBI_EXT_STA = 0x535441,
+   SBI_EXT_PMU = 0x504D55,
 };
 
 enum sbi_ext_base_fid {
SBI_EXT_BASE_PROBE_EXT = 3,
 };
+enum sbi_ext_pmu_fid {
+   SBI_EXT_PMU_NUM_COUNTERS = 0,
+   SBI_EXT_PMU_COUNTER_GET_INFO,
+   SBI_EXT_PMU_COUNTER_CFG_MATCH,
+   SBI_EXT_PMU_COUNTER_START,
+   SBI_EXT_PMU_COUNTER_STOP,
+   SBI_EXT_PMU_COUNTER_FW_READ,
+   SBI_EXT_PMU_COUNTER_FW_READ_HI,
+   SBI_EXT_PMU_SNAPSHOT_SET_SHMEM,
+};
+
+union sbi_pmu_ctr_info {
+   unsigned long value;
+   struct {
+   unsigned long csr:12;
+   unsigned long width:6;
+#if __riscv_xlen == 32
+   unsigned long reserved:13;
+#else
+   unsigned long reserved:45;
+#endif
+   unsigned long type:1;
+   };
+};
 
 struct sbiret {
long error;
long value;
 };
 
+/** General pmu event codes specified in SBI PMU extension */
+enum sbi_pmu_hw_generic_events_t {
+   SBI_PMU_HW_NO_EVENT = 0,
+   SBI_PMU_HW_CPU_CYCLES   = 1,
+   SBI_PMU_HW_INSTRUCTIONS = 2,
+   SBI_PMU_HW_CACHE_REFERENCES = 3,
+   SBI_PMU_HW_CACHE_MISSES = 4,
+   SBI_PMU_HW_BRANCH_INSTRUCTIONS  = 5,
+   SBI_PMU_HW_BRANCH_MISSES= 6,
+   SBI_PMU_HW_BUS_CYCLES   = 7,
+   SBI_PMU_HW_STALLED_CYCLES_FRONTEND  = 8,
+   SBI_PMU_HW_STALLED_CYCLES_BACKEND   = 9,
+   SBI_PMU_HW_REF_CPU_CYCLES   = 10,
+
+   SBI_PMU_HW_GENERAL_MAX,
+};
+
+/* SBI PMU counter types */
+enum sbi_pmu_ctr_type {
+   SBI_PMU_CTR_TYPE_HW = 0x0,
+   SBI_PMU_CTR_TYPE_FW,
+};
+
+/* Flags defined for config matching function */
+#define SBI_PMU_CFG_FLAG_SKIP_MATCHBIT(0)
+#define SBI_PMU_CFG_FLAG_CLEAR_VALUE   BIT(1)
+#define SBI_PMU_CFG_FLAG_AUTO_STARTBIT(2)
+#define SBI_PMU_CFG_FLAG_SET_VUINH BIT(3)
+#define SBI_PMU_CFG_FLAG_SET_VSINH BIT(4)
+#define SBI_PMU_CFG_FLAG_SET_UINH  BIT(5)
+#define SBI_PMU_CFG_FLAG_SET_SINH  BIT(6)
+#define SBI_PMU_CFG_FLAG_SET_MINH  BIT(7)
+
+/* Flags defined for counter start function */
+#define SBI_PMU_START_FLAG_SET_INIT_VALUE BIT(0)
+#define SBI_PMU_START_FLAG_INIT_SNAPSHOT BIT(1)
+
+/* Flags defined for counter stop function */
+#define SBI_PMU_STOP_FLAG_RESET BIT(0)
+#define SBI_PMU_STOP_FLAG_TAKE_SNAPSHOT BIT(1)
+
 struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0,
unsigned long arg1, unsigned long arg2,
unsigned long arg3, unsigned long arg4,
-- 
2.34.1

[PATCH v5 18/22] KVM: riscv: selftests: Add Sscofpmf to get-reg-list test

2024-04-03 Thread Atish Patra

The KVM RISC-V allows Sscofpmf extension for Guest/VM so let us
add this extension to get-reg-list test.

Reviewed-by: Anup Patel 
Reviewed-by: Andrew Jones 
Signed-off-by: Atish Patra 
---
 tools/testing/selftests/kvm/riscv/get-reg-list.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/kvm/riscv/get-reg-list.c 
b/tools/testing/selftests/kvm/riscv/get-reg-list.c
index b882b7b9b785..222198dd6d04 100644
--- a/tools/testing/selftests/kvm/riscv/get-reg-list.c
+++ b/tools/testing/selftests/kvm/riscv/get-reg-list.c
@@ -43,6 +43,7 @@ bool filter_reg(__u64 reg)
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_V:
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SMSTATEEN:
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SSAIA:
+   case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SSCOFPMF:
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SSTC:
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SVINVAL:
case KVM_REG_RISCV_ISA_EXT | KVM_REG_RISCV_ISA_SINGLE | 
KVM_RISCV_ISA_EXT_SVNAPOT:
@@ -408,6 +409,7 @@ static const char *isa_ext_single_id_to_str(__u64 reg_off)
KVM_ISA_EXT_ARR(V),
KVM_ISA_EXT_ARR(SMSTATEEN),
KVM_ISA_EXT_ARR(SSAIA),
+   KVM_ISA_EXT_ARR(SSCOFPMF),
KVM_ISA_EXT_ARR(SSTC),
KVM_ISA_EXT_ARR(SVINVAL),
KVM_ISA_EXT_ARR(SVNAPOT),
@@ -931,6 +933,7 @@ KVM_ISA_EXT_SUBLIST_CONFIG(fp_f, FP_F);
 KVM_ISA_EXT_SUBLIST_CONFIG(fp_d, FP_D);
 KVM_ISA_EXT_SIMPLE_CONFIG(h, H);
 KVM_ISA_EXT_SUBLIST_CONFIG(smstateen, SMSTATEEN);
+KVM_ISA_EXT_SIMPLE_CONFIG(sscofpmf, SSCOFPMF);
 KVM_ISA_EXT_SIMPLE_CONFIG(sstc, SSTC);
 KVM_ISA_EXT_SIMPLE_CONFIG(svinval, SVINVAL);
 KVM_ISA_EXT_SIMPLE_CONFIG(svnapot, SVNAPOT);
@@ -986,6 +989,7 @@ struct vcpu_reg_list *vcpu_configs[] = {
_fp_d,
_h,
_smstateen,
+   _sscofpmf,
_sstc,
_svinval,
_svnapot,
-- 
2.34.1

[PATCH v5 17/22] KVM: riscv: selftests: Add helper functions for extension checks

2024-04-03 Thread Atish Patra

__vcpu_has_ext can check both SBI and ISA extensions when the first
argument is properly converted to SBI/ISA extension IDs. Introduce
two helper functions to make life easier for developers so they
don't have to worry about the conversions.

Replace the current usages as well with new helpers.

Signed-off-by: Atish Patra 
---
 tools/testing/selftests/kvm/include/riscv/processor.h | 10 ++
 tools/testing/selftests/kvm/riscv/arch_timer.c|  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/include/riscv/processor.h 
b/tools/testing/selftests/kvm/include/riscv/processor.h
index 3b9cb39327ff..5f389166338c 100644
--- a/tools/testing/selftests/kvm/include/riscv/processor.h
+++ b/tools/testing/selftests/kvm/include/riscv/processor.h
@@ -50,6 +50,16 @@ static inline uint64_t __kvm_reg_id(uint64_t type, uint64_t 
subtype,
 
 bool __vcpu_has_ext(struct kvm_vcpu *vcpu, uint64_t ext);
 
+static inline bool __vcpu_has_isa_ext(struct kvm_vcpu *vcpu, uint64_t isa_ext)
+{
+   return __vcpu_has_ext(vcpu, RISCV_ISA_EXT_REG(isa_ext));
+}
+
+static inline bool __vcpu_has_sbi_ext(struct kvm_vcpu *vcpu, uint64_t sbi_ext)
+{
+   return __vcpu_has_ext(vcpu, RISCV_SBI_EXT_REG(sbi_ext));
+}
+
 struct ex_regs {
unsigned long ra;
unsigned long sp;
diff --git a/tools/testing/selftests/kvm/riscv/arch_timer.c 
b/tools/testing/selftests/kvm/riscv/arch_timer.c
index e22848f747c0..6a3e97ead824 100644
--- a/tools/testing/selftests/kvm/riscv/arch_timer.c
+++ b/tools/testing/selftests/kvm/riscv/arch_timer.c
@@ -85,7 +85,7 @@ struct kvm_vm *test_vm_create(void)
int nr_vcpus = test_args.nr_vcpus;
 
vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus);
-   __TEST_REQUIRE(__vcpu_has_ext(vcpus[0], 
RISCV_ISA_EXT_REG(KVM_RISCV_ISA_EXT_SSTC)),
+   __TEST_REQUIRE(__vcpu_has_isa_ext(vcpus[0], KVM_RISCV_ISA_EXT_SSTC),
   "SSTC not available, skipping test\n");
 
vm_init_vector_tables(vm);
-- 
2.34.1

[PATCH v5 16/22] KVM: riscv: selftests: Move sbi definitions to its own header file

2024-04-03 Thread Atish Patra

The SBI definitions will continue to grow. Move the sbi related
definitions to its own header file from processor.h

Suggested-by: Andrew Jones 
Signed-off-by: Atish Patra 
---
 .../selftests/kvm/include/riscv/processor.h   | 39 ---
 .../testing/selftests/kvm/include/riscv/sbi.h | 50 +++
 .../selftests/kvm/include/riscv/ucall.h   |  1 +
 tools/testing/selftests/kvm/steal_time.c  |  4 +-
 4 files changed, 54 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/riscv/sbi.h

diff --git a/tools/testing/selftests/kvm/include/riscv/processor.h 
b/tools/testing/selftests/kvm/include/riscv/processor.h
index ce473fe251dd..3b9cb39327ff 100644
--- a/tools/testing/selftests/kvm/include/riscv/processor.h
+++ b/tools/testing/selftests/kvm/include/riscv/processor.h
@@ -154,45 +154,6 @@ void vm_install_interrupt_handler(struct kvm_vm *vm, 
exception_handler_fn handle
 #define PGTBL_PAGE_SIZEPGTBL_L0_BLOCK_SIZE
 #define PGTBL_PAGE_SIZE_SHIFT  PGTBL_L0_BLOCK_SHIFT
 
-/* SBI return error codes */
-#define SBI_SUCCESS0
-#define SBI_ERR_FAILURE-1
-#define SBI_ERR_NOT_SUPPORTED  -2
-#define SBI_ERR_INVALID_PARAM  -3
-#define SBI_ERR_DENIED -4
-#define SBI_ERR_INVALID_ADDRESS-5
-#define SBI_ERR_ALREADY_AVAILABLE  -6
-#define SBI_ERR_ALREADY_STARTED-7
-#define SBI_ERR_ALREADY_STOPPED-8
-
-#define SBI_EXT_EXPERIMENTAL_START 0x0800
-#define SBI_EXT_EXPERIMENTAL_END   0x08FF
-
-#define KVM_RISCV_SELFTESTS_SBI_EXTSBI_EXT_EXPERIMENTAL_END
-#define KVM_RISCV_SELFTESTS_SBI_UCALL  0
-#define KVM_RISCV_SELFTESTS_SBI_UNEXP  1
-
-enum sbi_ext_id {
-   SBI_EXT_BASE = 0x10,
-   SBI_EXT_STA = 0x535441,
-};
-
-enum sbi_ext_base_fid {
-   SBI_EXT_BASE_PROBE_EXT = 3,
-};
-
-struct sbiret {
-   long error;
-   long value;
-};
-
-struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0,
-   unsigned long arg1, unsigned long arg2,
-   unsigned long arg3, unsigned long arg4,
-   unsigned long arg5);
-
-bool guest_sbi_probe_extension(int extid, long *out_val);
-
 static inline void local_irq_enable(void)
 {
csr_set(CSR_SSTATUS, SR_SIE);
diff --git a/tools/testing/selftests/kvm/include/riscv/sbi.h 
b/tools/testing/selftests/kvm/include/riscv/sbi.h
new file mode 100644
index ..ba04f2dec7b5
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/riscv/sbi.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RISC-V SBI specific definitions
+ *
+ * Copyright (C) 2024 Rivos Inc.
+ */
+
+#ifndef SELFTEST_KVM_SBI_H
+#define SELFTEST_KVM_SBI_H
+
+/* SBI return error codes */
+#define SBI_SUCCESS 0
+#define SBI_ERR_FAILURE-1
+#define SBI_ERR_NOT_SUPPORTED  -2
+#define SBI_ERR_INVALID_PARAM  -3
+#define SBI_ERR_DENIED -4
+#define SBI_ERR_INVALID_ADDRESS-5
+#define SBI_ERR_ALREADY_AVAILABLE  -6
+#define SBI_ERR_ALREADY_STARTED-7
+#define SBI_ERR_ALREADY_STOPPED-8
+
+#define SBI_EXT_EXPERIMENTAL_START 0x0800
+#define SBI_EXT_EXPERIMENTAL_END   0x08FF
+
+#define KVM_RISCV_SELFTESTS_SBI_EXTSBI_EXT_EXPERIMENTAL_END
+#define KVM_RISCV_SELFTESTS_SBI_UCALL  0
+#define KVM_RISCV_SELFTESTS_SBI_UNEXP  1
+
+enum sbi_ext_id {
+   SBI_EXT_BASE = 0x10,
+   SBI_EXT_STA = 0x535441,
+};
+
+enum sbi_ext_base_fid {
+   SBI_EXT_BASE_PROBE_EXT = 3,
+};
+
+struct sbiret {
+   long error;
+   long value;
+};
+
+struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0,
+   unsigned long arg1, unsigned long arg2,
+   unsigned long arg3, unsigned long arg4,
+   unsigned long arg5);
+
+bool guest_sbi_probe_extension(int extid, long *out_val);
+
+#endif /* SELFTEST_KVM_SBI_H */
diff --git a/tools/testing/selftests/kvm/include/riscv/ucall.h 
b/tools/testing/selftests/kvm/include/riscv/ucall.h
index be46eb32ec27..a695ae36f3e0 100644
--- a/tools/testing/selftests/kvm/include/riscv/ucall.h
+++ b/tools/testing/selftests/kvm/include/riscv/ucall.h
@@ -3,6 +3,7 @@
 #define SELFTEST_KVM_UCALL_H
 
 #include "processor.h"
+#include "sbi.h"
 
 #define UCALL_EXIT_REASON   KVM_EXIT_RISCV_SBI
 
diff --git a/tools/testing/selftests/kvm/steal_time.c 
b/tools/testing/selftests/kvm/steal_time.c
index bae0c5026f82..2ff82c7fd926 100644
--- a/tools/testing/selftests/kvm/steal_time.c
+++ b/tools/testing/selftests/kvm/steal_time.c
@@ -11,7 +11,9 @@
 #include 
 #include 
 #include

[PATCH v5 15/22] RISC-V: KVM: Improve firmware counter read function

2024-04-03 Thread Atish Patra

Rename the function to indicate that it is meant for firmware
counter read. While at it, add a range sanity check for it as
well.

Signed-off-by: Atish Patra 
---
 arch/riscv/include/asm/kvm_vcpu_pmu.h | 2 +-
 arch/riscv/kvm/vcpu_pmu.c | 7 ++-
 arch/riscv/kvm/vcpu_sbi_pmu.c | 2 +-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h 
b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 55861b5d3382..fa0f535bbbf0 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -89,7 +89,7 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, 
unsigned long ctr_ba
 unsigned long ctr_mask, unsigned long 
flags,
 unsigned long eidx, u64 evtdata,
 struct kvm_vcpu_sbi_return *retdata);
-int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
+int kvm_riscv_vcpu_pmu_fw_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
 int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long 
cidx,
  struct kvm_vcpu_sbi_return *retdata);
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index ff326152eeff..94efa88d054d 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -235,6 +235,11 @@ static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned 
long cidx,
u64 enabled, running;
int fevent_code;
 
+   if (cidx >= kvm_pmu_num_counters(kvpmu) || cidx == 1) {
+   pr_warn("Invalid counter id [%ld] during read\n", cidx);
+   return -EINVAL;
+   }
+
pmc = >pmc[cidx];
 
if (pmc->cinfo.type == SBI_PMU_CTR_TYPE_FW) {
@@ -747,7 +752,7 @@ int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu 
*vcpu, unsigned long cidx,
return 0;
 }
 
-int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
+int kvm_riscv_vcpu_pmu_fw_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata)
 {
int ret;
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index cf111de51bdb..e4be34e03e83 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -62,7 +62,7 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, 
struct kvm_run *run,
ret = kvm_riscv_vcpu_pmu_ctr_stop(vcpu, cp->a0, cp->a1, cp->a2, 
retdata);
break;
case SBI_EXT_PMU_COUNTER_FW_READ:
-   ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
+   ret = kvm_riscv_vcpu_pmu_fw_ctr_read(vcpu, cp->a0, retdata);
break;
case SBI_EXT_PMU_COUNTER_FW_READ_HI:
if (IS_ENABLED(CONFIG_32BIT))
-- 
2.34.1

[PATCH v5 14/22] RISC-V: KVM: Support 64 bit firmware counters on RV32

2024-04-03 Thread Atish Patra

The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware
counters for RV32 based systems.

Add infrastructure to support that.

Reviewed-by: Anup Patel 
Signed-off-by: Atish Patra 
---
 arch/riscv/include/asm/kvm_vcpu_pmu.h |  4 ++-
 arch/riscv/kvm/vcpu_pmu.c | 44 ++-
 arch/riscv/kvm/vcpu_sbi_pmu.c |  6 
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_vcpu_pmu.h 
b/arch/riscv/include/asm/kvm_vcpu_pmu.h
index 257f17641e00..55861b5d3382 100644
--- a/arch/riscv/include/asm/kvm_vcpu_pmu.h
+++ b/arch/riscv/include/asm/kvm_vcpu_pmu.h
@@ -20,7 +20,7 @@ static_assert(RISCV_KVM_MAX_COUNTERS <= 64);
 
 struct kvm_fw_event {
/* Current value of the event */
-   unsigned long value;
+   u64 value;
 
/* Event monitoring status */
bool started;
@@ -91,6 +91,8 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu *vcpu, 
unsigned long ctr_ba
 struct kvm_vcpu_sbi_return *retdata);
 int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata);
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long 
cidx,
+ struct kvm_vcpu_sbi_return *retdata);
 void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu);
 int kvm_riscv_vcpu_pmu_snapshot_set_shmem(struct kvm_vcpu *vcpu, unsigned long 
saddr_low,
  unsigned long saddr_high, unsigned long 
flags,
diff --git a/arch/riscv/kvm/vcpu_pmu.c b/arch/riscv/kvm/vcpu_pmu.c
index 9fedf9dc498b..ff326152eeff 100644
--- a/arch/riscv/kvm/vcpu_pmu.c
+++ b/arch/riscv/kvm/vcpu_pmu.c
@@ -197,6 +197,36 @@ static int pmu_get_pmc_index(struct kvm_pmu *pmu, unsigned 
long eidx,
return kvm_pmu_get_programmable_pmc_index(pmu, eidx, cbase, cmask);
 }
 
+static int pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long cidx,
+ unsigned long *out_val)
+{
+   struct kvm_pmu *kvpmu = vcpu_to_pmu(vcpu);
+   struct kvm_pmc *pmc;
+   int fevent_code;
+
+   if (!IS_ENABLED(CONFIG_32BIT)) {
+   pr_warn("%s: should be invoked for only RV32\n", __func__);
+   return -EINVAL;
+   }
+
+   if (cidx >= kvm_pmu_num_counters(kvpmu) || cidx == 1) {
+   pr_warn("Invalid counter id [%ld]during read\n", cidx);
+   return -EINVAL;
+   }
+
+   pmc = >pmc[cidx];
+
+   if (pmc->cinfo.type != SBI_PMU_CTR_TYPE_FW)
+   return -EINVAL;
+
+   fevent_code = get_event_code(pmc->event_idx);
+   pmc->counter_val = kvpmu->fw_event[fevent_code].value;
+
+   *out_val = pmc->counter_val >> 32;
+
+   return 0;
+}
+
 static int pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
unsigned long *out_val)
 {
@@ -705,6 +735,18 @@ int kvm_riscv_vcpu_pmu_ctr_cfg_match(struct kvm_vcpu 
*vcpu, unsigned long ctr_ba
return 0;
 }
 
+int kvm_riscv_vcpu_pmu_fw_ctr_read_hi(struct kvm_vcpu *vcpu, unsigned long 
cidx,
+ struct kvm_vcpu_sbi_return *retdata)
+{
+   int ret;
+
+   ret = pmu_fw_ctr_read_hi(vcpu, cidx, >out_val);
+   if (ret == -EINVAL)
+   retdata->err_val = SBI_ERR_INVALID_PARAM;
+
+   return 0;
+}
+
 int kvm_riscv_vcpu_pmu_ctr_read(struct kvm_vcpu *vcpu, unsigned long cidx,
struct kvm_vcpu_sbi_return *retdata)
 {
@@ -778,7 +820,7 @@ void kvm_riscv_vcpu_pmu_init(struct kvm_vcpu *vcpu)
pmc->cinfo.csr = CSR_CYCLE + i;
} else {
pmc->cinfo.type = SBI_PMU_CTR_TYPE_FW;
-   pmc->cinfo.width = BITS_PER_LONG - 1;
+   pmc->cinfo.width = 63;
}
}
 
diff --git a/arch/riscv/kvm/vcpu_sbi_pmu.c b/arch/riscv/kvm/vcpu_sbi_pmu.c
index d3e7625fb2d2..cf111de51bdb 100644
--- a/arch/riscv/kvm/vcpu_sbi_pmu.c
+++ b/arch/riscv/kvm/vcpu_sbi_pmu.c
@@ -64,6 +64,12 @@ static int kvm_sbi_ext_pmu_handler(struct kvm_vcpu *vcpu, 
struct kvm_run *run,
case SBI_EXT_PMU_COUNTER_FW_READ:
ret = kvm_riscv_vcpu_pmu_ctr_read(vcpu, cp->a0, retdata);
break;
+   case SBI_EXT_PMU_COUNTER_FW_READ_HI:
+   if (IS_ENABLED(CONFIG_32BIT))
+   ret = kvm_riscv_vcpu_pmu_fw_ctr_read_hi(vcpu, cp->a0, 
retdata);
+   else
+   retdata->out_val = 0;
+   break;
case SBI_EXT_PMU_SNAPSHOT_SET_SHMEM:
ret = kvm_riscv_vcpu_pmu_snapshot_set_shmem(vcpu, cp->a0, 
cp->a1, cp->a2, retdata);
break;
-- 
2.34.1

1 2 >

1 - 100 of 115 matches

Mail list logo