date:20250515

On Thu, May 15, 2025 at 12:21:14PM -0500, Andrew Davis wrote:
> On 5/13/25 10:52 AM, Bjorn Andersson wrote:
> > Clearing the table_sz on cleanup seemed reasonable, but further
> > discussions concluded that this merely working around the issue
> > and that the fix is incomplete.
> > 
> > As such, revert commit efdde3d73ab2 ("remoteproc: core: Clear table_sz
> > when rproc_shutdown") to avoid carrying a partial fix.
> > 
> 
> Setting table_sz to 0 still seems like a good idea from a defensive
> programming perspective. Both table_ptr and table_sz should be set
> and cleared together in all spots.
> 
> In addition to this, another fix would be to also update
> both table_ptr and table_sz to 0 when loading firmware without
> a resource table. Both should be done, no need to revert this.
>

As mentioned by Bjorn, this is a partial fix.  I'm all good with setting
table_sz to 0, but as long as the real solution to the problem is part of the
same work.  Once the patch is reverted, which I'm about to apply, work can
continue.

> Andrew
> 
> > Signed-off-by: Bjorn Andersson 
> > ---
> >   drivers/remoteproc/remoteproc_core.c | 1 -
> >   1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/remoteproc/remoteproc_core.c 
> > b/drivers/remoteproc/remoteproc_core.c
> > index 
> > 48d146e1fa560397c11eeb8f824ae0fb844a022b..81b2ccf988e852ac79cee375c7e3f118c2a4b41a
> >  100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2025,7 +2025,6 @@ int rproc_shutdown(struct rproc *rproc)
> > kfree(rproc->cached_table);
> > rproc->cached_table = NULL;
> > rproc->table_ptr = NULL;
> > -   rproc->table_sz = 0;
> >   out:
> > mutex_unlock(&rproc->lock);
> > return ret;
> > 
> > ---
> > base-commit: aa94665adc28f3fdc3de2979ac1e98bae961d6ca
> > change-id: 20250513-revert-rproc-table-sz-53ecf24726ae
> > 
> > Best regards,

Re: [PATCH] selftests: Add functional test for the abort file in fusectl

2025-05-15 Thread Chen Linxuan

On Thu, May 15, 2025 at 3:35 PM Chen Linxuan  wrote:

> This patch add a simple functional test for the "about" file

Sorry for the typo, it should be "abort".

[PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.

When userfaultfd is not compiled into kernel, userfaultfd() returns -1,
causing uffd tests in madv_guard fail. Skip the tests instead.

Signed-off-by: Zi Yan 
---
 tools/testing/selftests/mm/guard-regions.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/guard-regions.c 
b/tools/testing/selftests/mm/guard-regions.c
index 0cd9d236649d..93af3d3760f9 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -1453,8 +1453,21 @@ TEST_F(guard_regions, uffd)
 
/* Set up uffd. */
uffd = userfaultfd(0);
-   if (uffd == -1 && errno == EPERM)
-   ksft_exit_skip("No userfaultfd permissions, try running as 
root.\n");
+   if (uffd == -1) {
+   switch (errno) {
+   case EPERM:
+   SKIP(return, "No userfaultfd permissions, try running 
as root.");
+   break;
+   case ENOSYS:
+   SKIP(return, "userfaultfd is not supported/not 
enabled.");
+   break;
+   default:
+   ksft_exit_fail_msg("userfaultfd failed with %s\n",
+  strerror(errno));
+   break;
+   }
+   }
+
ASSERT_NE(uffd, -1);
 
ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
-- 
2.47.2

[PATCH 2/2] selftests/mm: skip hugevm test if kernel config file is not present.

When running hugevm tests in a machine without kernel config present, e.g.,
a VM running a kernel without CONFIG_IKCONFIG_PROC nor /boot/config-*,
skip hugevm tests, which reads kernel config to get page table level
information.

Signed-off-by: Zi Yan 
---
 .../selftests/mm/va_high_addr_switch.sh   | 26 +++
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh 
b/tools/testing/selftests/mm/va_high_addr_switch.sh
index 1f92e8caceac..325de53966b6 100755
--- a/tools/testing/selftests/mm/va_high_addr_switch.sh
+++ b/tools/testing/selftests/mm/va_high_addr_switch.sh
@@ -7,23 +7,20 @@
 # real test to check that the kernel is configured to support at least 5
 # pagetable levels.
 
-# 1 means the test failed
-exitcode=1
-
 # Kselftest framework requirement - SKIP code is 4.
 ksft_skip=4
 
-fail()
+skip()
 {
echo "$1"
-   exit $exitcode
+   exit $ksft_skip
 }
 
 check_supported_x86_64()
 {
local config="/proc/config.gz"
[[ -f "${config}" ]] || config="/boot/config-$(uname -r)"
-   [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or 
/boot"
+   [[ -f "${config}" ]] || skip "Cannot find kernel config in /proc or 
/boot"
 
# gzip -dcfq automatically handles both compressed and plaintext input.
# See man 1 gzip under '-f'.
@@ -33,11 +30,9 @@ check_supported_x86_64()
else {print 1}; exit}' /proc/cpuinfo 2>/dev/null)
 
if [[ "${pg_table_levels}" -lt 5 ]]; then
-   echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
run this test"
-   exit $ksft_skip
+   skip "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
run this test"
elif [[ "${cpu_supports_pl5}" -ne 0 ]]; then
-   echo "$0: CPU does not have the necessary la57 flag to support 
page table level 5"
-   exit $ksft_skip
+   skip "$0: CPU does not have the necessary la57 flag to support 
page table level 5"
fi
 }
 
@@ -45,24 +40,21 @@ check_supported_ppc64()
 {
local config="/proc/config.gz"
[[ -f "${config}" ]] || config="/boot/config-$(uname -r)"
-   [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or 
/boot"
+   [[ -f "${config}" ]] || skip "Cannot find kernel config in /proc or 
/boot"
 
local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | 
cut -d'=' -f 2)
if [[ "${pg_table_levels}" -lt 5 ]]; then
-   echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
run this test"
-   exit $ksft_skip
+   skip "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
run this test"
fi
 
local mmu_support=$(grep -m1 "mmu" /proc/cpuinfo | awk '{print $3}')
if [[ "$mmu_support" != "radix" ]]; then
-   echo "$0: System does not use Radix MMU, required for 5-level 
paging"
-   exit $ksft_skip
+   skip "$0: System does not use Radix MMU, required for 5-level 
paging"
fi
 
local hugepages_total=$(awk '/HugePages_Total/ {print $2}' 
/proc/meminfo)
if [[ "${hugepages_total}" -eq 0 ]]; then
-   echo "$0: HugePages are not enabled, required for some tests"
-   exit $ksft_skip
+   skip "$0: HugePages are not enabled, required for some tests"
fi
 }
 
-- 
2.47.2

Re: [PATCH net-next v5] selftests/vsock: add initial vmtest.sh for vsock

2025-05-15 Thread Bobby Eshleman

On Thu, May 15, 2025 at 11:02:00AM +0200, Stefano Garzarella wrote:
> On Tue, May 13, 2025 at 11:31:13PM -0700, Bobby Eshleman wrote:
> > This commit introduces a new vmtest.sh runner for vsock.
> > 
> > It uses virtme-ng/qemu to run tests in a VM. The tests validate G2H,
> > H2G, and loopback. The testing tools from tools/testing/vsock/ are
> > reused. Currently, only vsock_test is used.
> > 
> > VMCI and hyperv support is automatically built, though not used.
> > 
> > Only tested on x86.
> > 
> > To run:
> > 
> >  $ tools/testing/selftests/vsock/vmtest.sh
> 
> I just tried to call the script directly, but I have this:
> 
> $ tools/testing/selftests/vsock/vmtest.sh
> skip:
> /home/stefano/repos/linux-vsock/tools/testing/selftests/vsock/vsock_test not 
> found! Please build the kselftest vsock target.
> pkill: pidfile not valid
> Try `pkill --help' for more information.
> 
> Is this expected?
> 
> > 
> > or
> > 
> >  $ make -C tools/testing/selftests TARGETS=vsock run_tests
> > 
> > Example runs:
> > 
> > $ ./tools/testing/selftests/vsock/vmtest.sh
> > TAP version 13
> > 1..3
> > not ok 0 vm_server_host_client # exit=1
> > ok 1 vm_client_host_server
> > ok 2 vm_loopback
> > 1..3
> > 
> > $ ./tools/testing/selftests/vsock/vmtest.sh vm_loopback
> > TAP version 13
> > 1..1
> > ok 0 vm_loopback
> > 1..1
> > 
> > Future work can include vsock_diag_test.
> > 
> > The tap output style copies mm's run_vmtests.sh.
> > 
> > Because vsock requires a VM to test anything other than loopback, this
> > patch adds vmtest.sh as a kselftest itself. This is different than other
> > systems that have a "vmtest.sh", where it is used as a utility script to
> > spin up a VM to run the selftests as a guest (but isn't hooked into
> > kselftest).
> > 
> > Testing in NIPA is still WIP.
> > 
> > Signed-off-by: Bobby Eshleman 
> > ---
> > Changes in v5:
> > - make log file a tmpfile (Paolo)
> > - make sure both default and user defined QEMU gets handled by the 
> > dependency check (Paolo)
> > - increased VM boot up timeout from 1m to 3m for slow hosts (Paolo)
> > - rename vm_setup -> vm_start (Paolo)
> > - derive wait_for_listener from selftests/net/net_helper.sh to removes ss 
> > usage
> > - Remove unused 'unset IFS' line (Paolo)
> > - leave space after variable declarations (Paolo)
> > - make QEMU_PIDFILE a tmp file (Paolo)
> > - make everything readonly that is only read (Paolo)
> > - source ktap_helpers.sh for KSFT_PASS and friends (Paolo)
> > - don't check for timeout util (Paolo)
> > - add missing usage string for -q qemu arg
> > - add tap prefix to SUMMARY line since it isn't part of TAP protocol
> > - exit with the correct status code based on failure/pass counts
> > - Link to v4:
> > https://lore.kernel.org/r/20250507-vsock-vmtest-v4-1-6e2a97262...@gmail.com
> > 
> > Changes in v4:
> > - do not use special tab delimiter for help string parsing (Stefano + Paolo)
> > - fix paths for when installing kselftest and running out-of-tree (Paolo)
> > - change vng to using running kernel instead of compiled kernel (Paolo)
> > - use multi-line string for QEMU_OPTS (Stefano)
> > - change timeout to 300s (Paolo)
> > - skip if tools are not found and use kselftests status codes (Paolo)
> > - remove build from vmtest.sh (Paolo)
> > - change  -> SSH_HOST_PORT (Stefano)
> > - add tap-format output
> > - add vmtest.log to gitignore
> > - check for vsock_test binary and remind user to build it if missing
> > - create a proper build in makefile
> > - style fixes
> > - add ssh, timeout, and pkill to dependency check, just in case
> > - fix numerical comparison in conditionals
> > - check qemu pidfile exists before proceeding (avoid wasting time waiting 
> > for ssh)
> > - fix tracking of pass/fail bug
> > - fix stderr redirection bug
> > - Link to v3: 
> > https://lore.kernel.org/r/20250428-vsock-vmtest-v3-1-181af6163...@gmail.com
> > 
> > Changes in v3:
> > - use common conditional syntax for checking variables
> > - use return value instead of global rc
> > - fix typo TEST_HOST_LISTENER_PORT -> TEST_HOST_PORT_LISTENER
> > - use SIGTERM instead of SIGKILL on cleanup
> > - use peer-cid=1 for loopback
> > - change sleep delay times into globals
> > - fix test_vm_loopback logging
> > - add test selection in arguments
> > - make QEMU an argument
> > - check that vng binary is on path
> > - use QEMU variable
> > - change  to 
> > - fix hardcoded file paths
> > - add comment in commit msg about script that vmtest.sh was based off of
> > - Add tools/testing/selftest/vsock/Makefile for kselftest
> > - Link to v2: 
> > https://lore.kernel.org/r/20250417-vsock-vmtest-v2-1-3901a2733...@gmail.com
> > 
> > Changes in v2:
> > - add kernel oops and warnings checker
> > - change testname variable to use FUNCNAME
> > - fix spacing in test_vm_server_host_client
> > - add -s skip build option to vmtest.sh
> > - add test_vm_loopback
> > - pass port to vm_wait_for_listener
> > - fix indentation in vmtest.sh
> > - add vmci and hyperv to config
> > - changed

Re: [PATCH bpf-next v4 0/3] Allow mmap of /sys/kernel/btf/vmlinux

2025-05-15 Thread Alan Maguire

> I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
> With some upcoming changes the library is sitting at 5MiB for a parse.
> Most of that memory is simply copying the BTF blob into user space.
> By allowing vmlinux BTF to be mmapped read-only into user space I can
> cut memory usage by about 75%.
> 
> Signed-off-by: Lorenz Bauer 

For the series,

Tested-by: Alan Maguire 

Tested with 4k and 64k page size on aarch64; all worked perfectly. Thanks!

Re: [PATCH] vhost/net: remove zerocopy support

2025-05-15 Thread Eugenio Perez Martin

On Mon, May 12, 2025 at 5:21 PM Jon Kohler  wrote:
>
>
>
> > On Apr 30, 2025, at 9:21 PM, Jon Kohler  wrote:
> >
> >
> >
> >> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin  
> >> wrote:
> >>
> >> !---|
> >> CAUTION: External Email
> >>
> >> |---!
> >>
> >> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang  wrote:
> >>>
> >>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler  wrote:
> 
> 
> 
> > On Apr 6, 2025, at 7:14 PM, Jason Wang  wrote:
> >
> > !---|
> > CAUTION: External Email
> >
> > |---!
> >
> > On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler  wrote:
> >>
> >> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
> >> the module parameter for the handle_tx_zerocopy path back in 2019,
> >> nothing that many downstream distributions (e.g., RHEL7 and later) had
> >> already done the same.
> >>
> >> Both upstream and downstream disablement suggest this path is rarely
> >> used.
> >>
> >> Testing the module parameter shows that while the path allows packet
> >> forwarding, the zerocopy functionality itself is broken. On outbound
> >> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
> >> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
> >
> > This is by design to avoid DOS.
> 
>  I understand that, but it makes ZC non-functional in general, as ZC fails
>  and immediately increments the error counters.
> >>>
> >>> The main issue is HOL, but zerocopy may still work in some setups that
> >>> don't need to care about HOL. One example the macvtap passthrough
> >>> mode.
> >>>
> 
> >
> >> or
> >> skb_orphan_frags() elsewhere in the stack,
> >
> > Basically zerocopy is expected to work for guest -> remote case, so
> > could we still hit skb_orphan_frags() in this case?
> 
>  Yes, you’d hit that in tun_net_xmit().
> >>>
> >>> Only for local VM to local VM communication.
> >
> > Sure, but the tricky bit here is that if you have a mix of VM-VM and 
> > VM-external
> > traffic patterns, any time the error path is hit, the zc error counter will 
> > go up.
> >
> > When that happens, ZC will get silently disabled anyhow, so it leads to 
> > sporadic
> > success / non-deterministic performance.
> >
> >>>
>  If you punch a hole in that *and* in the
>  zc error counter (such that failed ZC doesn’t disable ZC in vhost), you 
>  get ZC
>  from vhost; however, the network interrupt handler under net_tx_action 
>  and
>  eventually incurs the memcpy under dev_queue_xmit_nit().
> >>>
> >>> Well, yes, we need a copy if there's a packet socket. But if there's
> >>> no network interface taps, we don't need to do the copy here.
> >>>
> >
> > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] 
> > that prevented
> > this specific case; however, there are still other trip wires spread out 
> > across the
> > stack that would need to be addressed.
> >
> > [1] 
> > https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717
> >
> >>
> >> Hi!
> >>
> >> I need more time diving into the issues. As Jon mentioned, vhost ZC is
> >> so little used I didn't have the chance to experiment with this until
> >> now :). But yes, I expect to be able to overcome these for pasta, by
> >> adapting buffer sizes or modifying code etc.
> >
> > Another tricky bit here is that it has been disabled both upstream and 
> > downstream
> > for so long, the code naturally has a bit of wrench-in-the-engine.
> >
> > RE Buffer sizes: I tried this as well, because I think on sufficiently fast 
> > systems,
> > zero copy gets especially interesting in GSO/TSO cases where you have mega
> > payloads.
> >
> > I tried playing around with the good copy value such that ZC restricted 
> > itself to
> > only lets say 32K payloads and above, and while it *does* work (with enough
> > holes punched in), absolute t-put doesn’t actually go up, its just that CPU 
> > utilization
> > goes down a pinch. Not a bad thing for certain, but still not great.
> >
> > In fact, I found that tput actually went down with this path, even with ZC 
> > occurring
> > successfully, as there was still a mix of ZC and non-ZC because you can only
> > have so many pending at any given time before the copy path kicks in again.
> >
> >
> >>
> 
>  This is no more performant, and in fact is actually worse since the time 
>  spent
>  waiting on that memcpy to resolve is longer.
> 
> >
> >> as vhost_net does not set
> >> SKBFL_DONT_ORPHAN.
> >>>
> >>> Maybe we can try to set this as vhost-net can hornor ulimit now.
> >
> > Yea I tried that, and while it helps kick things

Re: [PATCH 01/12] module: Move modprobe_path and modules_disabled ctl_tables into the module subsys

2025-05-15 Thread Petr Pavlu

On 5/9/25 14:54, Joel Granados wrote:
> Move module sysctl (modprobe_path and modules_disabled) out of sysctl.c
> and into the modules subsystem. Make the modprobe_path variable static
> as it no longer needs to be exported. Remove module.h from the includes
> in sysctl as it no longer uses any module exported variables.
> 
> This is part of a greater effort to move ctl tables into their
> respective subsystems which will reduce the merge conflicts in
> kernel/sysctl.c.
> 
> Signed-off-by: Joel Granados 
> [...]
> --- a/kernel/module/kmod.c
> +++ b/kernel/module/kmod.c
> @@ -60,7 +60,7 @@ static DEFINE_SEMAPHORE(kmod_concurrent_max, 
> MAX_KMOD_CONCURRENT);
>  /*
>   modprobe_path is set via /proc/sys.
>  */
> -char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
> +static char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
>  
>  static void free_modprobe_argv(struct subprocess_info *info)
>  {
> @@ -177,3 +177,33 @@ int __request_module(bool wait, const char *fmt, ...)
>   return ret;
>  }
>  EXPORT_SYMBOL(__request_module);
> +
> +#ifdef CONFIG_MODULES
> +static const struct ctl_table kmod_sysctl_table[] = {
> + {
> + .procname   = "modprobe",
> + .data   = &modprobe_path,
> + .maxlen = KMOD_PATH_LEN,
> + .mode   = 0644,
> + .proc_handler   = proc_dostring,
> + },
> + {
> + .procname   = "modules_disabled",
> + .data   = &modules_disabled,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + /* only handle a transition from default "0" to "1" */
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ONE,
> + .extra2 = SYSCTL_ONE,
> + },

This is minor.. but the file kernel/module/kmod.c contains the logic to
request direct modprobe invocation by the kernel. Registering the
modprobe_path sysctl here is appropriate. However, the modules_disabled
setting affects the entire module loader so I don't think it's best to
register it here.

I suggest keeping a single table for the module sysctl values but moving
it to kernel/module/main.c. This means the variable modprobe_path must
retain external linkage, on the other hand, modules_disabled can be made
static.

-- 
Thanks,
Petr

> +};
> +
> +static int __init init_kmod_sysctl(void)
> +{
> + register_sysctl_init("kernel", kmod_sysctl_table);
> + return 0;
> +}
> +
> +subsys_initcall(init_kmod_sysctl);
> +#endif

Re: [PATCH 5/6] x86/paravirt: Switch MSR access pv_ops functions to instruction interfaces

2025-05-15 Thread Xin Li


On 5/13/2025 3:24 PM, H. Peter Anvin wrote:

On May 12, 2025 11:06:02 PM PDT, "Jürgen Groß"  wrote:

On 13.05.25 07:55, Xin Li wrote:

On 5/12/2025 4:24 AM, Juergen Gross wrote:

Now with the mentioned patch really attached. :-)



Does it allow patching with an instruction more than 6 bytes long?

The immediate form MSR instructions are 9 bytes long.


Yes, shouldn't be a problem.


Juergen


However, it is more than that. The immediate instructions have a different 
interface, and it makes more sense to use the extra bytes to shuffle the bits 
around for the legacy forms:

Write:

 mov %rax,%rdx
 shr $32,%rdx
 wrmsr(ns)

Read:

 rdmsr
 shl $32,%rdx
 or %rdx,%rax

For the write case, this also means that two separate trap points are needed.

As far as Xen (the only user of pv msrs), note that it only paravirtualizes a 
very small number of MSRs, and some of those are fairly performance sensitive, 
so not going through the Xen framework for MSRs known to be either native or 
null on Xen would definitely be a win.




Hi Juergen,

I have some update on this thread while working on it.

If we continue down the path of maintaining pvops MSR APIs as this patch
series does, it seems we’ll need to duplicate the ALTERNATIVE code in
three different places.

1) The MSR access primitives defined in , which is used when
   CONFIG_PARAVIRT=n.

2) The pvops native MSR functions pv_native_{rd,wr}msr{,_safe}() defined
   in arch/x86/kernel/paravirt.c, used when CONFIG_PARAVIRT=y on bare
   metal.

3) The pvops Xen MSR functions paravirt_{read,write}_msr{,_safe}()
   defined in , used when CONFIG_PARAVIRT_XXL=y.

hpa had mentioned to me earlier that this would be a maintenance burden
— something I only truly realized once I got hands-on with it.

Maybe you have something in mind to address it?

Also add PeterZ to the To list because he cares it.

Thanks!
Xin

[PATCH] selftests: Add functional test for the abort file in fusectl

2025-05-15 Thread Chen Linxuan

This patch add a simple functional test for the "about" file
in fusectlfs (/sys/fs/fuse/connections/ID/about).

A simple fuse daemon is added for testing.

Related discussion can be found in the link below.

Link: 
https://lore.kernel.org/all/CAOQ4uxjKFXOKQxPpxtS6G_nR0tpw95w0GiO68UcWg_OBhmSY=q...@mail.gmail.com/
Cc: Amir Goldstein 
Signed-off-by: Chen Linxuan 
---
 MAINTAINERS   |   1 +
 tools/testing/selftests/Makefile  |   1 +
 .../selftests/filesystems/fusectl/.gitignore  |   3 +
 .../selftests/filesystems/fusectl/Makefile|  21 +++
 .../selftests/filesystems/fusectl/fuse_mnt.c  | 146 ++
 .../filesystems/fusectl/fusectl_test.c| 115 ++
 6 files changed, 287 insertions(+)
 create mode 100644 tools/testing/selftests/filesystems/fusectl/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/fusectl/Makefile
 create mode 100644 tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
 create mode 100644 tools/testing/selftests/filesystems/fusectl/fusectl_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f21f1dabb5fe1..efc6c89113b95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9740,6 +9740,7 @@ T:git 
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
 F: Documentation/filesystems/fuse.rst
 F: fs/fuse/
 F: include/uapi/linux/fuse.h
+F: tools/testing/selftests/filesystems/fusectl
 
 FUTEX SUBSYSTEM
 M: Thomas Gleixner 
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 80fb84fa3cfcb..a9bfefa961889 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -36,6 +36,7 @@ TARGETS += filesystems/fat
 TARGETS += filesystems/overlayfs
 TARGETS += filesystems/statmount
 TARGETS += filesystems/mount-notify
+TARGETS += filesystems/fusectl
 TARGETS += firmware
 TARGETS += fpu
 TARGETS += ftrace
diff --git a/tools/testing/selftests/filesystems/fusectl/.gitignore 
b/tools/testing/selftests/filesystems/fusectl/.gitignore
new file mode 100644
index 0..3e72e742d08e8
--- /dev/null
+++ b/tools/testing/selftests/filesystems/fusectl/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+fuse_mnt
+fusectl_test
diff --git a/tools/testing/selftests/filesystems/fusectl/Makefile 
b/tools/testing/selftests/filesystems/fusectl/Makefile
new file mode 100644
index 0..612aad69a93aa
--- /dev/null
+++ b/tools/testing/selftests/filesystems/fusectl/Makefile
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g $(KHDR_INCLUDES)
+
+TEST_GEN_PROGS := fusectl_test
+TEST_GEN_FILES := fuse_mnt
+
+include ../../lib.mk
+
+VAR_CFLAGS := $(shell pkg-config fuse --cflags 2>/dev/null)
+ifeq ($(VAR_CFLAGS),)
+VAR_CFLAGS := -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse
+endif
+
+VAR_LDLIBS := $(shell pkg-config fuse --libs 2>/dev/null)
+ifeq ($(VAR_LDLIBS),)
+VAR_LDLIBS := -lfuse -pthread
+endif
+
+$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS)
+$(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS)
diff --git a/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c 
b/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
new file mode 100644
index 0..d12b17f30fadc
--- /dev/null
+++ b/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fusectl test file-system
+ * Creates a simple FUSE filesystem with a single read-write file (/test)
+ */
+
+#define FUSE_USE_VERSION 26
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+
+static char *content;
+static size_t content_size = 0;
+static const char test_path[] = "/test";
+
+static int test_getattr(const char *path, struct stat *st)
+{
+   memset(st, 0, sizeof(*st));
+
+   if (!strcmp(path, "/")) {
+   st->st_mode = S_IFDIR | 0755;
+   st->st_nlink = 2;
+   return 0;
+   }
+
+   if (!strcmp(path, test_path)) {
+   st->st_mode = S_IFREG | 0664;
+   st->st_nlink = 1;
+   st->st_size = content_size;
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
+static int test_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
+   off_t offset, struct fuse_file_info *fi)
+{
+   if (strcmp(path, "/"))
+   return -ENOENT;
+
+   filler(buf, ".", NULL, 0);
+   filler(buf, "..", NULL, 0);
+   filler(buf, test_path + 1, NULL, 0);
+
+   return 0;
+}
+
+static int test_open(const char *path, struct fuse_file_info *fi)
+{
+   if (strcmp(path, test_path))
+   return -ENOENT;
+
+   return 0;
+}
+
+static int test_read(const char *path, char *buf, size_t size, off_t offset,
+struct fuse_file_info *fi)
+{
+   if (strcmp(path, test_path) != 0)
+   return -ENOENT;
+
+   if (!content || content_size == 0)
+

[PATCH] selftests/memfd: clean Makefile

2025-05-15 Thread Chen Linxuan

When writing a test for fusectl, I referred to this Makefile as a
reference for creating a FUSE daemon in the selftests.
While doing so, I noticed that there is a minor issue in the Makefile.

The fuse_mnt.c file is not actually compiled into fuse_mnt.o,
and the code setting CFLAGS for it never takes effect.
The reason fuse_mnt compiles successfully is because CFLAGS is set
at the very beginning of the file.

Signed-off-by: Chen Linxuan 
---
 tools/testing/selftests/memfd/Makefile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/memfd/Makefile 
b/tools/testing/selftests/memfd/Makefile
index 163b6f68631c4..e9b886c65153d 100644
--- a/tools/testing/selftests/memfd/Makefile
+++ b/tools/testing/selftests/memfd/Makefile
@@ -1,5 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
-CFLAGS += -D_FILE_OFFSET_BITS=64
 CFLAGS += $(KHDR_INCLUDES)
 
 TEST_GEN_PROGS := memfd_test
@@ -16,10 +15,9 @@ ifeq ($(VAR_LDLIBS),)
 VAR_LDLIBS := -lfuse -pthread
 endif
 
-fuse_mnt.o: CFLAGS += $(VAR_CFLAGS)
-
 include ../lib.mk
 
+$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS)
 $(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS)
 
 $(OUTPUT)/memfd_test: memfd_test.c common.c
-- 
2.43.0

Re: [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations

On 5/7/25 11:15, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:23AM +0200, Vlastimil Babka wrote:
>> Add functions for efficient guaranteed allocations e.g. in a critical
>> section that cannot sleep, when the exact number of allocations is not
>> known beforehand, but an upper limit can be calculated.
>> 
>> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
>> number of objects.
>> 
>> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
>> and is guaranteed not to fail until depleted.
>> 
>> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
>> allocator after the critical section. This will also attempt to refill
>> it to cache's sheaf capacity for better efficiency of sheaves handling,
>> but it's not stricly necessary to succeed.
>> 
>> kmem_cache_refill_sheaf() can be used to refill a previously obtained
>> sheaf to requested size. If the current size is sufficient, it does
>> nothing. If the requested size exceeds cache's sheaf_capacity and the
>> sheaf's current capacity, the sheaf will be replaced with a new one,
>> hence the indirect pointer parameter.
>> 
>> kmem_cache_sheaf_size() can be used to query the current size.
>> 
>> The implementation supports requesting sizes that exceed cache's
>> sheaf_capacity, but it is not efficient - such "oversize" sheaves are
>> allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
>> immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
>> might be especially ineffective when replacing a sheaf with a new one of
>> a larger capacity. It is therefore better to size cache's
>> sheaf_capacity accordingly to make oversize sheaves exceptional.
>> 
>> CONFIG_SLUB_STATS counters are added for sheaf prefill and return
>> operations. A prefill or return is considered _fast when it is able to
>> grab or return a percpu spare sheaf (even if the sheaf needs a refill to
>> satisfy the request, as those should amortize over time), and _slow
>> otherwise (when the barn or even sheaf allocation/freeing has to be
>> involved). sheaf_prefill_oversize is provided to determine how many
>> prefills were oversize (counter for oversize returns is not necessary as
>> all oversize refills result in oversize returns).
>> 
>> When slub_debug is enabled for a cache with sheaves, no percpu sheaves
>> exist for it, but the prefill functionality is still provided simply by
>> all prefilled sheaves becoming oversize. If percpu sheaves are not
>> created for a cache due to not passing the sheaf_capacity argument on
>> cache creation, the prefills also work through oversize sheaves, but
>> there's a WARN_ON_ONCE() to indicate the omission.
>> 
>> Signed-off-by: Vlastimil Babka 
>> Reviewed-by: Suren Baghdasaryan 
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo 
> 
> with a nit below.

Thanks, incorporated the suggestion!

Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations

On 5/14/25 16:01, Vlastimil Babka wrote:
> On 5/6/25 23:34, Suren Baghdasaryan wrote:
>> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka  wrote:
>>> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, 
>>> struct slab_sheaf *sheaf)
>>> sheaf->size = 0;
>>>  }
>>>
>>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>>> +struct slab_sheaf *sheaf);
>> 
>> I think you could safely move __rcu_free_sheaf_prepare() here and
>> avoid the above forward declaration.
> 
> Right, done.
> 
>>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>>> return true;
>>>  }
>>>
>>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>>> +struct slab_sheaf *sheaf)
>> 
>> This function seems to be an almost exact copy of free_to_pcs_bulk()
>> from your previous patch. Maybe they can be consolidated?
> 
> True, I've extracted it to __kmem_cache_free_bulk_prepare().

... and that was a mistake as free_to_pcs_bulk() diverges in patch 9/9 in a
way that this makes it too infeasible

Re: [linux-next:master] [rcutorture] c27d0d38f2: WARNING:at_kernel/rcu/rcutorture.c:#rcutorture_one_extend_check[rcutorture]

2025-05-15 Thread Oliver Sang

hi, Paul,

On Wed, May 14, 2025 at 09:18:10PM -0700, Paul E. McKenney wrote:
> On Thu, May 15, 2025 at 10:30:02AM +0800, Oliver Sang wrote:
> > hi, Paul,
> > 
> > On Wed, May 14, 2025 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > > On Wed, May 14, 2025 at 10:47:30AM +0800, kernel test robot wrote:
> > > > 
> > > > hi, Paul,
> > > > 
> > > > for this commit we tested before, now we found it causes issues in 
> > > > linux-next
> > > > master branch.
> > > 
> > > Good catch as always!!!  Yes, using HRTIMER_MODE_HARD means that this
> > > ircutorture_one_extend_check() function must check for hardirq as well
> > > as softirq.  Which, oddly enough, permits simplifying the code, though
> > > a larger patch.
> > > 
> > > I could reproduce this, and the patch at the end of this email fixes it
> > > for me.  Does it work for you as well?
> > 
> > thanks a lot for information!
> > 
> > what's the base of the patch? I tried to apply it upon c27d0d38f2 or lastest
> > linux-next/master
> > bdd609656ff55 (tag: next-20250514, linux-next/master) Add linux-next 
> > specific files for 20250514
> > 
> > both failed. thanks
> 
> Apologies, I just now pushed it out on my -rcu tree:
> 
> db950fccd45a ("rcutorture: Make rcutorture_one_extend_check() account for 
> hardirq")

issue is cleaned on db950fccd45a. thanks!

Tested-by: kernel test robot 

=
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
  
clang-20/i386-randconfig-141-20250508/debian-11.1-i386-20220923.cgz/300s/vm-snb/cpuhotplug/rcutorture/srcud

commit:
  8454f1334b167 ("lib: Make the ratelimit test more reliable")
  db950fccd45a8 ("rcutorture: Make rcutorture_one_extend_check() account for 
hardirq")

8454f1334b167506 db950fccd45a8273d93ca861d84
 ---
   fail:runs  %reproductionfail:runs
   | | |
  1:50  -2%:50dmesg.EIP:__x86_return_thunk
  1:50  -2%:50dmesg.EIP:console_trylock_spinning
 48:50 -96%:50dmesg.EIP:pv_native_safe_halt
 50:50-100%:50
dmesg.EIP:rcutorture_one_extend_check
 50:50-100%:50
dmesg.WARNING:at_kernel/rcu/rcutorture.c:#rcutorture_one_extend_check[rcutorture]

> 
> I will need to rebase this to precede the updated version of the commit
> that you tested, but one step at a time...
> 
>   Thanx, Paul
> 
> > > > =
> > > > tbox_group/testcase/rootfs/kconfig/compiler/runtime/test/torture_type:
> > > >   
> > > > vm-snb/rcutorture/debian-11.1-i386-20220923.cgz/i386-randconfig-141-20250508/clang-20/300s/cpuhotplug/srcud
> > > > 
> > > > c795676b5c0a4ab7 c27d0d38f2cafb70a68ca42c410
> > > >  ---
> > > >fail:runs  %reproductionfail:runs
> > > >| | |
> > > >:50   2%   1:50
> > > > dmesg.EIP:__kernel_text_address
> > > >:50   4%   2:50
> > > > dmesg.EIP:__srcu_check_read_flavor
> > > >:50   2%   1:50
> > > > dmesg.EIP:_raw_spin_unlock_irq
> > > >:50  14%   7:50
> > > > dmesg.EIP:_raw_spin_unlock_irqrestore
> > > >:50  30%  15:50
> > > > dmesg.EIP:console_flush_all
> > > >:50   4%   2:50
> > > > dmesg.EIP:console_trylock_spinning
> > > >:50   2%   1:50dmesg.EIP:delay_tsc
> > > >:50   2%   1:50
> > > > dmesg.EIP:finish_lock_switch
> > > >:50   2%   1:50
> > > > dmesg.EIP:kernel_text_address
> > > >:50   2%   1:50dmesg.EIP:lock_acquire
> > > >:50  36%  18:50
> > > > dmesg.EIP:pv_native_safe_halt
> > > >   1:50  -2%:50
> > > > dmesg.EIP:rcu_torture_fwd_prog_cr
> > > >   7:50 -14%:50
> > > > dmesg.EIP:rcu_torture_writer
> > > >:50 100%  50:50
> > > > dmesg.EIP:rcutorture_one_extend_check
> > > >  48:50   0%  48:50
> > > > dmesg.INFO:task_blocked_for_more_than#seconds
> > > >  48:50   0%  48:50
> > > > dmesg.Kernel_panic-not_syncing:hung_task:blocked_tasks
> > > >  50:50   0%  50:50
> > > > dmesg.UBSAN:negation-overflow_in_lib/sort.c
> > > >   1:50  -2%:50
> > > > dmesg.WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_fwd_prog_cr[rcutorture]
> > > >   7:50 -14%:50
> > > > dmesg.WARNING:at_kernel/rcu/rcutorture.c

[PATCH] selftests/mm: Fix test result reporting in gup_longterm

The kselftest framework uses the string logged when a test result is
reported as the unique identifier for a test, using it to track test
results between runs. The gup_longterm test completely fails to follow
this pattern, it runs a single test function repeatedly with various
parameters but each result report is a string logging an error message
which is fixed between runs.

Since the code already logs each test uniquely before it starts refactor
to also print this to a buffer, then use that name as the test result.
This isn't especially pretty, really this test could use a more
substantial cleanup.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/mm/gup_longterm.c | 163 --
 1 file changed, 107 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index 21595b20bbc3..a849537f9372 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -35,6 +35,8 @@ static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 static int gup_fd;
 
+static char test_name[1024];
+
 static __fsword_t get_fs_type(int fd)
 {
struct statfs fs;
@@ -93,33 +95,48 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
__fsword_t fs_type = get_fs_type(fd);
bool should_work;
char *mem;
+   int result = KSFT_PASS;
int ret;
 
+   if (fd < 0) {
+   result = KSFT_FAIL;
+   goto report;
+   }
+
if (ftruncate(fd, size)) {
if (errno == ENOENT) {
skip_test_dodgy_fs("ftruncate()");
} else {
-   ksft_test_result_fail("ftruncate() failed (%s)\n", 
strerror(errno));
+   ksft_print_msg("ftruncate() failed (%s)\n",
+  strerror(errno));
+   result = KSFT_FAIL;
+   goto report;
}
return;
}
 
if (fallocate(fd, 0, 0, size)) {
-   if (size == pagesize)
-   ksft_test_result_fail("fallocate() failed (%s)\n", 
strerror(errno));
-   else
-   ksft_test_result_skip("need more free huge pages\n");
-   return;
+   if (size == pagesize) {
+   ksft_print_msg("fallocate() failed (%s)\n", 
strerror(errno));
+   result = KSFT_FAIL;
+   } else {
+   ksft_print_msg("need more free huge pages\n");
+   result = KSFT_SKIP;
+   }
+   goto report;
}
 
mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
   shared ? MAP_SHARED : MAP_PRIVATE, fd, 0);
if (mem == MAP_FAILED) {
-   if (size == pagesize || shared)
-   ksft_test_result_fail("mmap() failed (%s)\n", 
strerror(errno));
-   else
-   ksft_test_result_skip("need more free huge pages\n");
-   return;
+   if (size == pagesize || shared) {
+   ksft_print_msg("mmap() failed (%s)\n", strerror(errno));
+   result = KSFT_FAIL;
+   } else {
+   ksft_print_msg("need more free huge pages\n");
+   result = KSFT_SKIP;
+   }
+   goto report;
}
 
/* Fault in the page such that GUP-fast can pin it directly. */
@@ -134,7 +151,8 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
 */
ret = mprotect(mem, size, PROT_READ);
if (ret) {
-   ksft_test_result_fail("mprotect() failed (%s)\n", 
strerror(errno));
+   ksft_print_msg("mprotect() failed (%s)\n", 
strerror(errno));
+   result = KSFT_FAIL;
goto munmap;
}
/* FALLTHROUGH */
@@ -147,12 +165,14 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
type == TEST_TYPE_RW_FAST;
 
if (gup_fd < 0) {
-   ksft_test_result_skip("gup_test not available\n");
+   ksft_print_msg("gup_test not available\n");
+   result = KSFT_SKIP;
break;
}
 
if (rw && shared && fs_is_unknown(fs_type)) {
-   ksft_test_result_skip("Unknown filesystem\n");
+   ksft_print_msg("Unknown filesystem\n");
+   result = KSFT_SKIP;
return;
}
/*
@@ -169,14 +189,19 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
args.flags |= rw ? PIN_LONGTERM_TEST_FLAG_USE_WRITE :

Re: [PATCH v2 4/4] media: i2c: imx214: Remove hard-coded external clock frequency

2025-05-15 Thread Laurent Pinchart

Hi Sakari,

On Tue, May 06, 2025 at 08:24:03AM +, Sakari Ailus wrote:
> On Mon, May 05, 2025 at 11:05:56PM +0200, André Apitzsch via B4 Relay wrote:
> > From: André Apitzsch 
> > 
> > Instead rely on the rate set on the clock (using assigned-clock-rates
> > etc.)
> > 
> > Signed-off-by: André Apitzsch 
> > ---
> >  drivers/media/i2c/imx214.c | 6 --
> >  1 file changed, 6 deletions(-)
> > 
> > diff --git a/drivers/media/i2c/imx214.c b/drivers/media/i2c/imx214.c
> > index 
> > 9e9be47394ec768a5b34d44b06b5bbb0988da5a1..c12996e294dccebb18c608254f1e0d14dc064423
> >  100644
> > --- a/drivers/media/i2c/imx214.c
> > +++ b/drivers/media/i2c/imx214.c
> > @@ -32,7 +32,6 @@
> >  
> >  #define IMX214_REG_FAST_STANDBY_CTRL   CCI_REG8(0x0106)
> >  
> > -#define IMX214_DEFAULT_CLK_FREQ2400
> >  #define IMX214_DEFAULT_LINK_FREQ   6
> >  /* Keep wrong link frequency for backward compatibility */
> >  #define IMX214_DEFAULT_LINK_FREQ_LEGACY48000
> > @@ -1405,11 +1404,6 @@ static int imx214_probe(struct i2c_client *client)
> > return dev_err_probe(dev, PTR_ERR(imx214->xclk),
> >  "failed to get xclk\n");
> >  
> > -   ret = clk_set_rate(imx214->xclk, IMX214_DEFAULT_CLK_FREQ);
> > -   if (ret)
> > -   return dev_err_probe(dev, ret,
> > -"failed to set xclk frequency\n");
> > -
> 
> Oops. I missed this is what the driver was doing already. Indeed, this is
> one of the historic sensor drivers that do set the frequency in DT systems.
> 
> The driver never used the clock-frequency property and instead used a fixed
> frequency. Changing the behaviour now could be problematic.
> 
> There are options here that I think we could do:
> 
> 1) use your v1 patch (4) which uses "clock-frequency" if it exists and
> otherwise uses the default, fixed frequency or
> 
> 2) set the frequency only if the "clock-frequency" property exists. The DT
> currently requires clock-frequency and the YAML conversion was done in 2020
> whereas the driver is from 2018. If we do this, the clock-frequency should
> be deprecated (or even removed from bingings).
> 
> I wonder what others think. Cc'd Laurent in any case.

Maybe I'm missing something, but I don't really see the issue here. The
clock-frequency DT property is currently ignored, and this patch doesn't
change that situation, does it ?

The change of behaviour here is related to the assigned-clock-rates
property. If that property is specified today, it will set the clock
rate, and the driver will override it to 24MHz right after. With this
patch, the clock rate won't be overridden. I think the risk of
regression is very low here, as I don't expect systems to set
assigned-clock-rates in DT to a value different than 24MHz and expect
the driver to override it.

> > ret = imx214_get_regulators(dev, imx214);
> > if (ret < 0)
> > return dev_err_probe(dev, ret, "failed to get regulators\n");

-- 
Regards,

Laurent Pinchart

Re: [PATCH net-next v5] selftests/vsock: add initial vmtest.sh for vsock

2025-05-15 Thread Stefano Garzarella


On Tue, May 13, 2025 at 11:31:13PM -0700, Bobby Eshleman wrote:

This commit introduces a new vmtest.sh runner for vsock.

It uses virtme-ng/qemu to run tests in a VM. The tests validate G2H,
H2G, and loopback. The testing tools from tools/testing/vsock/ are
reused. Currently, only vsock_test is used.

VMCI and hyperv support is automatically built, though not used.

Only tested on x86.

To run:

 $ tools/testing/selftests/vsock/vmtest.sh


I just tried to call the script directly, but I have this:

$ tools/testing/selftests/vsock/vmtest.sh
skip:
/home/stefano/repos/linux-vsock/tools/testing/selftests/vsock/vsock_test not 
found! Please build the kselftest vsock target.
pkill: pidfile not valid
Try `pkill --help' for more information.

Is this expected?



or

 $ make -C tools/testing/selftests TARGETS=vsock run_tests

Example runs:

$ ./tools/testing/selftests/vsock/vmtest.sh
TAP version 13
1..3
not ok 0 vm_server_host_client # exit=1
ok 1 vm_client_host_server
ok 2 vm_loopback
1..3

$ ./tools/testing/selftests/vsock/vmtest.sh vm_loopback
TAP version 13
1..1
ok 0 vm_loopback
1..1

Future work can include vsock_diag_test.

The tap output style copies mm's run_vmtests.sh.

Because vsock requires a VM to test anything other than loopback, this
patch adds vmtest.sh as a kselftest itself. This is different than other
systems that have a "vmtest.sh", where it is used as a utility script to
spin up a VM to run the selftests as a guest (but isn't hooked into
kselftest).

Testing in NIPA is still WIP.

Signed-off-by: Bobby Eshleman 
---
Changes in v5:
- make log file a tmpfile (Paolo)
- make sure both default and user defined QEMU gets handled by the dependency 
check (Paolo)
- increased VM boot up timeout from 1m to 3m for slow hosts (Paolo)
- rename vm_setup -> vm_start (Paolo)
- derive wait_for_listener from selftests/net/net_helper.sh to removes ss usage
- Remove unused 'unset IFS' line (Paolo)
- leave space after variable declarations (Paolo)
- make QEMU_PIDFILE a tmp file (Paolo)
- make everything readonly that is only read (Paolo)
- source ktap_helpers.sh for KSFT_PASS and friends (Paolo)
- don't check for timeout util (Paolo)
- add missing usage string for -q qemu arg
- add tap prefix to SUMMARY line since it isn't part of TAP protocol
- exit with the correct status code based on failure/pass counts
- Link to v4: 
https://lore.kernel.org/r/20250507-vsock-vmtest-v4-1-6e2a97262...@gmail.com


Changes in v4:
- do not use special tab delimiter for help string parsing (Stefano + Paolo)
- fix paths for when installing kselftest and running out-of-tree (Paolo)
- change vng to using running kernel instead of compiled kernel (Paolo)
- use multi-line string for QEMU_OPTS (Stefano)
- change timeout to 300s (Paolo)
- skip if tools are not found and use kselftests status codes (Paolo)
- remove build from vmtest.sh (Paolo)
- change  -> SSH_HOST_PORT (Stefano)
- add tap-format output
- add vmtest.log to gitignore
- check for vsock_test binary and remind user to build it if missing
- create a proper build in makefile
- style fixes
- add ssh, timeout, and pkill to dependency check, just in case
- fix numerical comparison in conditionals
- check qemu pidfile exists before proceeding (avoid wasting time waiting for 
ssh)
- fix tracking of pass/fail bug
- fix stderr redirection bug
- Link to v3: 
https://lore.kernel.org/r/20250428-vsock-vmtest-v3-1-181af6163...@gmail.com

Changes in v3:
- use common conditional syntax for checking variables
- use return value instead of global rc
- fix typo TEST_HOST_LISTENER_PORT -> TEST_HOST_PORT_LISTENER
- use SIGTERM instead of SIGKILL on cleanup
- use peer-cid=1 for loopback
- change sleep delay times into globals
- fix test_vm_loopback logging
- add test selection in arguments
- make QEMU an argument
- check that vng binary is on path
- use QEMU variable
- change  to 
- fix hardcoded file paths
- add comment in commit msg about script that vmtest.sh was based off of
- Add tools/testing/selftest/vsock/Makefile for kselftest
- Link to v2: 
https://lore.kernel.org/r/20250417-vsock-vmtest-v2-1-3901a2733...@gmail.com

Changes in v2:
- add kernel oops and warnings checker
- change testname variable to use FUNCNAME
- fix spacing in test_vm_server_host_client
- add -s skip build option to vmtest.sh
- add test_vm_loopback
- pass port to vm_wait_for_listener
- fix indentation in vmtest.sh
- add vmci and hyperv to config
- changed whitespace from tabs to spaces in help string
- Link to v1: 
https://lore.kernel.org/r/20250410-vsock-vmtest-v1-1-f35a81dab...@gmail.com
---
MAINTAINERS  |   1 +
tools/testing/selftests/vsock/.gitignore |   2 +
tools/testing/selftests/vsock/Makefile   |  16 ++
tools/testing/selftests/vsock/config | 114 
tools/testing/selftests/vsock/settings   |   1 +
tools/testing/selftests/vsock/vmtest.sh  | 441 +++
6 files changed, 575 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 
657a67f9031ef

Re: [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing

On 5/7/25 12:39, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:29AM +0200, Vlastimil Babka wrote:
>> Since we don't control the NUMA locality of objects in percpu sheaves,
>> allocations with node restrictions bypass them. Allocations without
>> restrictions may however still expect to get local objects with high
>> probability, and the introduction of sheaves can decrease it due to
>> freed object from a remote node ending up in percpu sheaves.
>> 
>> The fraction of such remote frees seems low (5% on an 8-node machine)
>> but it can be expected that some cache or workload specific corner cases
>> exist. We can either conclude that this is not a problem due to the low
>> fraction, or we can make remote frees bypass percpu sheaves and go
>> directly to their slabs. This will make the remote frees more expensive,
>> but if if's only a small fraction, most frees will still benefit from
>> the lower overhead of percpu sheaves.
>> 
>> This patch thus makes remote object freeing bypass percpu sheaves,
>> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
>> it's not intended to be 100% guarantee that percpu sheaves will only
>> contain local objects. The refill from slabs does not provide that
>> guarantee in the first place, and there might be cpu migrations
>> happening when we need to unlock the local_lock. Avoiding all that could
>> be possible but complicated so we can leave it for later investigation
>> whether it would be worth it. It can be expected that the more selective
>> freeing will itself prevent accumulation of remote objects in percpu
>> sheaves so any such violations would have only short-term effects.
>> 
>> Another possible optimization to investigate is whether it would be
>> beneficial for node-restricted or strict_numa allocations to attempt to
>> obtain an object from percpu sheaves if the node or mempolicy (i.e.
>> MPOL_LOCAL) happens to want the local node of the allocating cpu. Right
>> now such allocations bypass sheaves, but they could probably look first
>> whether the first available object in percpu sheaves is local, and with
>> high probability succeed - and only bypass the sheaves in cases it's
>> not local.
>> 
>> Signed-off-by: Vlastimil Babka 
>> ---
>>  mm/slab_common.c |  7 +--
>>  mm/slub.c| 43 +--
>>  2 files changed, 42 insertions(+), 8 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 
>> cc273cc45f632e16644355831132cdc391219cec..2bf83e2b85b23f4db2b311edaded4bef6b7d01de
>>  100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab 
>> *slab, void *object,
>>  if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), 
>> false)))
>>  return;
>>  
>> -if (!s->cpu_sheaves || !free_to_pcs(s, object))
>> -do_slab_free(s, slab, object, object, 1, addr);
>> +if (s->cpu_sheaves) {
>> +if (likely(!IS_ENABLED(CONFIG_NUMA) ||
>> +   slab_nid(slab) == numa_node_id())) {
>> +free_to_pcs(s, object);
> 
> Shouldn't it call do_slab_free() when free_to_pcs() failed?

Oops yes, thanks!

> 
>> +return;
>> +}
>> +}
>> +
>> +do_slab_free(s, slab, object, object, 1, addr);
>>  }
>>  
>>  #ifdef CONFIG_MEMCG
>> 
>> -- 
>> 2.49.0
>> 
>> 
>

[PATCH] selftests/mm: Deduplicate test logging in test_mlock_lock()

The mlock2-tests test_mlock_lock() test reports two test results with
an identical string, one reporitng if it successfully locked a block of
memory and another reporting if the lock is still present after doing an
unlock (following a similar pattern to other tests in the same program).
This confuses test automation since the test string is used to deduplicate
tests, change the post unlock test to report "Unlocked" instead like the
other tests to fix this.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/mm/mlock2-tests.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mlock2-tests.c 
b/tools/testing/selftests/mm/mlock2-tests.c
index 7f0d50fa361d..3e90ff37e336 100644
--- a/tools/testing/selftests/mm/mlock2-tests.c
+++ b/tools/testing/selftests/mm/mlock2-tests.c
@@ -196,7 +196,7 @@ static void test_mlock_lock(void)
ksft_exit_fail_msg("munlock(): %s\n", strerror(errno));
}
 
-   ksft_test_result(!unlock_lock_check(map), "%s: Locked\n", __func__);
+   ksft_test_result(!unlock_lock_check(map), "%s: Unlocked\n", __func__);
munmap(map, 2 * page_size);
 }
 

---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selftest-mm-mlock2-dup-277d586bb29d

Best regards,
-- 
Mark Brown

Re: [PATCH] selftests/mm: Deduplicate test logging in test_mlock_lock()





On 15/05/25 2:57 pm, Mark Brown wrote:

The mlock2-tests test_mlock_lock() test reports two test results with
an identical string, one reporitng if it successfully locked a block of
memory and another reporting if the lock is still present after doing an
unlock (following a similar pattern to other tests in the same program).
This confuses test automation since the test string is used to deduplicate
tests, change the post unlock test to report "Unlocked" instead like the
other tests to fix this.

Signed-off-by: Mark Brown 
---
  tools/testing/selftests/mm/mlock2-tests.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mlock2-tests.c 
b/tools/testing/selftests/mm/mlock2-tests.c
index 7f0d50fa361d..3e90ff37e336 100644
--- a/tools/testing/selftests/mm/mlock2-tests.c
+++ b/tools/testing/selftests/mm/mlock2-tests.c
@@ -196,7 +196,7 @@ static void test_mlock_lock(void)
ksft_exit_fail_msg("munlock(): %s\n", strerror(errno));
}
  
-	ksft_test_result(!unlock_lock_check(map), "%s: Locked\n", __func__);

+   ksft_test_result(!unlock_lock_check(map), "%s: Unlocked\n", __func__);
munmap(map, 2 * page_size);
  }
  


Acked-by: Dev Jain 



---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selftest-mm-mlock2-dup-277d586bb29d

Best regards,

Re: [PATCH] selftests/mm: Fix test result reporting in gup_longterm





On 15/05/25 2:27 pm, Mark Brown wrote:

The kselftest framework uses the string logged when a test result is
reported as the unique identifier for a test, using it to track test
results between runs. The gup_longterm test completely fails to follow
this pattern, it runs a single test function repeatedly with various
parameters but each result report is a string logging an error message
which is fixed between runs.

Since the code already logs each test uniquely before it starts refactor
to also print this to a buffer, then use that name as the test result.
This isn't especially pretty, really this test could use a more
substantial cleanup.

Signed-off-by: Mark Brown 
---
  tools/testing/selftests/mm/gup_longterm.c | 163 --
  1 file changed, 107 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index 21595b20bbc3..a849537f9372 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -35,6 +35,8 @@ static int nr_hugetlbsizes;
  static size_t hugetlbsizes[10];
  static int gup_fd;
  
+static char test_name[1024];

+
  static __fsword_t get_fs_type(int fd)
  {
struct statfs fs;
@@ -93,33 +95,48 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
__fsword_t fs_type = get_fs_type(fd);
bool should_work;
char *mem;
+   int result = KSFT_PASS;
int ret;
  
+	if (fd < 0) {

+   result = KSFT_FAIL;
+   goto report;
+   }
+
if (ftruncate(fd, size)) {
if (errno == ENOENT) {
skip_test_dodgy_fs("ftruncate()");
} else {
-   ksft_test_result_fail("ftruncate() failed (%s)\n", 
strerror(errno));
+   ksft_print_msg("ftruncate() failed (%s)\n",
+  strerror(errno));
+   result = KSFT_FAIL;
+   goto report;
}
return;
}
  
  	if (fallocate(fd, 0, 0, size)) {

-   if (size == pagesize)
-   ksft_test_result_fail("fallocate() failed (%s)\n", 
strerror(errno));
-   else
-   ksft_test_result_skip("need more free huge pages\n");
-   return;
+   if (size == pagesize) {
+   ksft_print_msg("fallocate() failed (%s)\n", 
strerror(errno));
+   result = KSFT_FAIL;
+   } else {
+   ksft_print_msg("need more free huge pages\n");
+   result = KSFT_SKIP;
+   }
+   goto report;
}
  
  	mem = mmap(NULL, size, PROT_READ | PROT_WRITE,

   shared ? MAP_SHARED : MAP_PRIVATE, fd, 0);
if (mem == MAP_FAILED) {
-   if (size == pagesize || shared)
-   ksft_test_result_fail("mmap() failed (%s)\n", 
strerror(errno));
-   else
-   ksft_test_result_skip("need more free huge pages\n");
-   return;
+   if (size == pagesize || shared) {
+   ksft_print_msg("mmap() failed (%s)\n", strerror(errno));
+   result = KSFT_FAIL;
+   } else {
+   ksft_print_msg("need more free huge pages\n");
+   result = KSFT_SKIP;
+   }
+   goto report;
}
  
  	/* Fault in the page such that GUP-fast can pin it directly. */

@@ -134,7 +151,8 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
 */
ret = mprotect(mem, size, PROT_READ);
if (ret) {
-   ksft_test_result_fail("mprotect() failed (%s)\n", 
strerror(errno));
+   ksft_print_msg("mprotect() failed (%s)\n", 
strerror(errno));
+   result = KSFT_FAIL;
goto munmap;
}
/* FALLTHROUGH */
@@ -147,12 +165,14 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
type == TEST_TYPE_RW_FAST;
  
  		if (gup_fd < 0) {

-   ksft_test_result_skip("gup_test not available\n");
+   ksft_print_msg("gup_test not available\n");
+   result = KSFT_SKIP;
break;
}
  
  		if (rw && shared && fs_is_unknown(fs_type)) {

-   ksft_test_result_skip("Unknown filesystem\n");
+   ksft_print_msg("Unknown filesystem\n");
+   result = KSFT_SKIP;
return;
}
/*
@@ -169,14 +189,19 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
args.flags |= rw ? PIN_LONGTERM_TES

[PATCH] selftests/mm: Deduplicate default page size test results in thuge-gen

The thuge-gen test program runs mmap() and shmget() tests for both every
available page size and the default page size, resulting in two tests for
the default size. These tests are distinct since the flags in the default
case do not specify an explicit size, add the flags to the test name that
is logged to deduplicate.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/mm/thuge-gen.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/thuge-gen.c 
b/tools/testing/selftests/mm/thuge-gen.c
index cd5174d735be..a41bc1234b37 100644
--- a/tools/testing/selftests/mm/thuge-gen.c
+++ b/tools/testing/selftests/mm/thuge-gen.c
@@ -127,7 +127,7 @@ void test_mmap(unsigned long size, unsigned flags)
 
show(size);
ksft_test_result(size == getpagesize() || (before - after) == NUM_PAGES,
-"%s mmap %lu\n", __func__, size);
+"%s mmap %lu %x\n", __func__, size, flags);
 
if (munmap(map, size * NUM_PAGES))
ksft_exit_fail_msg("%s: unmap %s\n", __func__, strerror(errno));
@@ -165,7 +165,7 @@ void test_shmget(unsigned long size, unsigned flags)
 
show(size);
ksft_test_result(size == getpagesize() || (before - after) == NUM_PAGES,
-"%s: mmap %lu\n", __func__, size);
+"%s: mmap %lu %x\n", __func__, size, flags);
if (shmdt(map))
ksft_exit_fail_msg("%s: shmdt: %s\n", __func__, 
strerror(errno));
 }

---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selfests-mm-thuge-gen-dup-7e1c40716091

Best regards,
-- 
Mark Brown

Re: [PATCH] selftests/mm: Fix test result reporting in gup_longterm

On Thu, May 15, 2025 at 03:05:07PM +0530, Dev Jain wrote:
> On 15/05/25 2:27 pm, Mark Brown wrote:

> > @@ -189,7 +214,10 @@ static void do_test(int fd, size_t size, enum 
> > test_type type, bool shared)
> >  * some previously unsupported filesystems, we might want to
> >  * perform some additional tests for possible data corruptions.
> >  */
> > -   ksft_test_result(should_work, "Should have worked\n");
> > +   if (should_work)
> > +   result = KSFT_PASS;

> Missed printing "Should have worked" here.

I didn't think that output was particularly useful separate to the
overall test result (which is logged on exit from the function), it's
just the test result more than diagnostic information.

Please delete unneeded context from mails when replying.  Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.

signature.asc
Description: PGP signature

Re: [PATCH] selftests/mm: Fix test result reporting in gup_longterm





On 15/05/25 3:11 pm, Mark Brown wrote:

On Thu, May 15, 2025 at 03:05:07PM +0530, Dev Jain wrote:

On 15/05/25 2:27 pm, Mark Brown wrote:



@@ -189,7 +214,10 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
 * some previously unsupported filesystems, we might want to
 * perform some additional tests for possible data corruptions.
 */
-   ksft_test_result(should_work, "Should have worked\n");
+   if (should_work)
+   result = KSFT_PASS;



Missed printing "Should have worked" here.


I didn't think that output was particularly useful separate to the
overall test result (which is logged on exit from the function), it's
just the test result more than diagnostic information.


No hard opinion.



Please delete unneeded context from mails when replying.  Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.


You have mentioned that before, sorry my bad! I also hate it :)

Re: [PATCH] selftests/mm: Deduplicate default page size test results in thuge-gen





On 15/05/25 3:06 pm, Mark Brown wrote:

The thuge-gen test program runs mmap() and shmget() tests for both every
available page size and the default page size, resulting in two tests for
the default size. These tests are distinct since the flags in the default
case do not specify an explicit size, add the flags to the test name that
is logged to deduplicate.

Signed-off-by: Mark Brown 


Acked-by: Dev Jain

Re: [PATCH] selftests: remove duplicated function definition

2025-05-15 Thread Christian Brauner

On Wed, May 14, 2025 at 10:02:02AM +0800, Chen Linxuan wrote:
> I failed to build this test on Ubuntu 24.04. Compiler complains that
> function sys_open_tree has already been defined in
> "../filesystems/overlayfs/wrappers.h".
> 
> Signed-off-by: Chen Linxuan 
> ---

This is already fixed in the next vfs tree. Thanks though!

Re: [PATCH 01/12] module: Move modprobe_path and modules_disabled ctl_tables into the module subsys

2025-05-15 Thread Joel Granados

On Thu, May 15, 2025 at 10:04:53AM +0200, Petr Pavlu wrote:
> On 5/9/25 14:54, Joel Granados wrote:
> > Move module sysctl (modprobe_path and modules_disabled) out of sysctl.c
> > and into the modules subsystem. Make the modprobe_path variable static
> > as it no longer needs to be exported. Remove module.h from the includes
> > in sysctl as it no longer uses any module exported variables.
> > 
> > This is part of a greater effort to move ctl tables into their
> > respective subsystems which will reduce the merge conflicts in
> > kernel/sysctl.c.
> > 
> > Signed-off-by: Joel Granados 
> > [...]
> > --- a/kernel/module/kmod.c
> > +++ b/kernel/module/kmod.c
> > @@ -60,7 +60,7 @@ static DEFINE_SEMAPHORE(kmod_concurrent_max, 
> > MAX_KMOD_CONCURRENT);
> >  /*
> > modprobe_path is set via /proc/sys.
> >  */
> > -char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
> > +static char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
> >  
> >  static void free_modprobe_argv(struct subprocess_info *info)
> >  {
> > @@ -177,3 +177,33 @@ int __request_module(bool wait, const char *fmt, ...)
> > return ret;
> >  }
> >  EXPORT_SYMBOL(__request_module);
> > +
> > +#ifdef CONFIG_MODULES
> > +static const struct ctl_table kmod_sysctl_table[] = {
> > +   {
> > +   .procname   = "modprobe",
> > +   .data   = &modprobe_path,
> > +   .maxlen = KMOD_PATH_LEN,
> > +   .mode   = 0644,
> > +   .proc_handler   = proc_dostring,
> > +   },
> > +   {
> > +   .procname   = "modules_disabled",
> > +   .data   = &modules_disabled,
> > +   .maxlen = sizeof(int),
> > +   .mode   = 0644,
> > +   /* only handle a transition from default "0" to "1" */
> > +   .proc_handler   = proc_dointvec_minmax,
> > +   .extra1 = SYSCTL_ONE,
> > +   .extra2 = SYSCTL_ONE,
> > +   },
> 
> This is minor.. but the file kernel/module/kmod.c contains the logic to
> request direct modprobe invocation by the kernel. Registering the
> modprobe_path sysctl here is appropriate. However, the modules_disabled
> setting affects the entire module loader so I don't think it's best to
> register it here.
> 
> I suggest keeping a single table for the module sysctl values but moving
> it to kernel/module/main.c. This means the variable modprobe_path must
> retain external linkage, on the other hand, modules_disabled can be made
> static.

Like this?:

---
 include/linux/module.h |  1 -
 kernel/module/main.c   | 30 +-
 kernel/sysctl.c| 20 
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index d94b196d5a34..25476168e012 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -302,7 +302,6 @@ struct notifier_block;
 
 #ifdef CONFIG_MODULES
 
-extern int modules_disabled; /* for sysctl */
 /* Get/put a kernel symbol (calls must be symmetric) */
 void *__symbol_get(const char *symbol);
 void *__symbol_get_gpl(const char *symbol);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a2859dc3eea6..13055ef65f15 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -126,9 +126,37 @@ static void mod_update_bounds(struct module *mod)
 }
 
 /* Block module loading/unloading? */
-int modules_disabled;
+static int modules_disabled;
 core_param(nomodule, modules_disabled, bint, 0);
 
+static const struct ctl_table kmod_sysctl_table[] = {
+   {
+   .procname   = "modprobe",
+   .data   = &modprobe_path,
+   .maxlen = KMOD_PATH_LEN,
+   .mode   = 0644,
+   .proc_handler   = proc_dostring,
+   },
+   {
+   .procname   = "modules_disabled",
+   .data   = &modules_disabled,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   /* only handle a transition from default "0" to "1" */
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = SYSCTL_ONE,
+   .extra2 = SYSCTL_ONE,
+   },
+};
+
+static int __init init_kmod_sysctl(void)
+{
+   register_sysctl_init("kernel", kmod_sysctl_table);
+   return 0;
+}
+
+subsys_initcall(init_kmod_sysctl);
+
 /* Waiting for a module to finish initializing? */
 static DECLARE_WAIT_QUEUE_HEAD(module_wq);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9b4f0cff76ea..473133d9651e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -19,7 +19,6 @@
  *  Removed it and replaced it with older style, 03/23/00, Bill Wendling
  */
 
-#include 
 #include 
 #include 
 #include 
@@ -1616,25 +1615,6 @@ static const struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
-#ifdef CONFIG_MODULES
-   {
-   .procname   = "m

Re: [PATCH] selftests: Add functional test for the abort file in fusectl

2025-05-15 Thread Amir Goldstein

On Thu, May 15, 2025 at 9:35 AM Chen Linxuan  wrote:
>
> This patch add a simple functional test for the "about" file
> in fusectlfs (/sys/fs/fuse/connections/ID/about).
>
> A simple fuse daemon is added for testing.
>
> Related discussion can be found in the link below.
>
> Link: 
> https://lore.kernel.org/all/CAOQ4uxjKFXOKQxPpxtS6G_nR0tpw95w0GiO68UcWg_OBhmSY=q...@mail.gmail.com/
> Cc: Amir Goldstein 
> Signed-off-by: Chen Linxuan 
> ---
>  MAINTAINERS   |   1 +
>  tools/testing/selftests/Makefile  |   1 +
>  .../selftests/filesystems/fusectl/.gitignore  |   3 +
>  .../selftests/filesystems/fusectl/Makefile|  21 +++
>  .../selftests/filesystems/fusectl/fuse_mnt.c  | 146 ++
>  .../filesystems/fusectl/fusectl_test.c| 115 ++
>  6 files changed, 287 insertions(+)
>  create mode 100644 tools/testing/selftests/filesystems/fusectl/.gitignore
>  create mode 100644 tools/testing/selftests/filesystems/fusectl/Makefile
>  create mode 100644 tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
>  create mode 100644 tools/testing/selftests/filesystems/fusectl/fusectl_test.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f21f1dabb5fe1..efc6c89113b95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9740,6 +9740,7 @@ T:git 
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
>  F: Documentation/filesystems/fuse.rst
>  F: fs/fuse/
>  F: include/uapi/linux/fuse.h
> +F: tools/testing/selftests/filesystems/fusectl
>
>  FUTEX SUBSYSTEM
>  M: Thomas Gleixner 
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index 80fb84fa3cfcb..a9bfefa961889 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -36,6 +36,7 @@ TARGETS += filesystems/fat
>  TARGETS += filesystems/overlayfs
>  TARGETS += filesystems/statmount
>  TARGETS += filesystems/mount-notify
> +TARGETS += filesystems/fusectl
>  TARGETS += firmware
>  TARGETS += fpu
>  TARGETS += ftrace
> diff --git a/tools/testing/selftests/filesystems/fusectl/.gitignore 
> b/tools/testing/selftests/filesystems/fusectl/.gitignore
> new file mode 100644
> index 0..3e72e742d08e8
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/fusectl/.gitignore
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +fuse_mnt
> +fusectl_test
> diff --git a/tools/testing/selftests/filesystems/fusectl/Makefile 
> b/tools/testing/selftests/filesystems/fusectl/Makefile
> new file mode 100644
> index 0..612aad69a93aa
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/fusectl/Makefile
> @@ -0,0 +1,21 @@
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +
> +CFLAGS += -Wall -O2 -g $(KHDR_INCLUDES)
> +
> +TEST_GEN_PROGS := fusectl_test
> +TEST_GEN_FILES := fuse_mnt
> +
> +include ../../lib.mk
> +
> +VAR_CFLAGS := $(shell pkg-config fuse --cflags 2>/dev/null)
> +ifeq ($(VAR_CFLAGS),)
> +VAR_CFLAGS := -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse
> +endif
> +
> +VAR_LDLIBS := $(shell pkg-config fuse --libs 2>/dev/null)
> +ifeq ($(VAR_LDLIBS),)
> +VAR_LDLIBS := -lfuse -pthread
> +endif
> +
> +$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS)
> +$(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS)
> diff --git a/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c 
> b/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
> new file mode 100644
> index 0..d12b17f30fadc
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/fusectl/fuse_mnt.c
> @@ -0,0 +1,146 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * fusectl test file-system
> + * Creates a simple FUSE filesystem with a single read-write file (/test)
> + */
> +
> +#define FUSE_USE_VERSION 26
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define MAX(a, b) ((a) > (b) ? (a) : (b))
> +
> +static char *content;
> +static size_t content_size = 0;
> +static const char test_path[] = "/test";
> +
> +static int test_getattr(const char *path, struct stat *st)
> +{
> +   memset(st, 0, sizeof(*st));
> +
> +   if (!strcmp(path, "/")) {
> +   st->st_mode = S_IFDIR | 0755;
> +   st->st_nlink = 2;
> +   return 0;
> +   }
> +
> +   if (!strcmp(path, test_path)) {
> +   st->st_mode = S_IFREG | 0664;
> +   st->st_nlink = 1;
> +   st->st_size = content_size;
> +   return 0;
> +   }
> +
> +   return -ENOENT;
> +}
> +
> +static int test_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
> +   off_t offset, struct fuse_file_info *fi)
> +{
> +   if (strcmp(path, "/"))
> +   return -ENOENT;
> +
> +   filler(buf, ".", NULL, 0);
> +   filler(buf, "..", NULL, 0);
> +   filler(buf, test_path + 1, NULL, 0);
> +
> +   return 0;
> +}
> +
> +static int test_open(const char *path, struct fuse_file_info *fi)
>

Re: [PATCH bpf-next v3 4/8] selftests/bpf: Introduce verdict programs for sockmap_redir

2025-05-15 Thread Michal Luczaj

On 5/15/25 06:29, John Fastabend wrote:
> On 2025-05-15 00:15:27, Michal Luczaj wrote:
>> Instead of piggybacking on test_sockmap_listen, introduce
>> test_sockmap_redir especially for sockmap redirection tests.
>>
>> Suggested-by: Jiayuan Chen 
>> Acked-by: John Fastabend 
>> Signed-off-by: Michal Luczaj 
>> ---
> 
> LGTM, this is a net new patch in v3 though correct? In the future
> only carry the Acks into patches that are from the previous version
> and not substantially changed.

Will do, sorry.

Note that I wasn't being sneaky, I asked about it in the cover letter :)

Thanks,
Michal

> All the same thanks for doing this. For this patch.
> 
> Acked-by: John Fastabend

Re: [PATCH v4] vdpa/octeon_ep: Control PCI dev enabling manually

2025-05-15 Thread Michael S. Tsirkin

On Thu, May 15, 2025 at 09:14:22AM +0200, Philipp Stanner wrote:
> On Thu, 2025-05-08 at 10:51 +0200, Philipp Stanner wrote:
> > PCI region request functions such as pci_request_region() currently
> > have
> > the problem of becoming sometimes managed functions, if
> > pcim_enable_device() instead of pci_enable_device() was called. The
> > PCI
> > subsystem wants to remove this deprecated behavior from its
> > interfaces.
> > 
> > octeopn_ep enables its device with pcim_enable_device() (for VF. PF
> > uses
> > manual management), but does so only to get automatic disablement.
> > The
> > driver wants to manage its PCI resources for VF manually, without
> > devres.
> > 
> > The easiest way not to use automatic resource management at all is by
> > also handling device enable- and disablement manually.
> > 
> > Replace pcim_enable_device() with pci_enable_device(). Add the
> > necessary
> > calls to pci_disable_device().
> > 
> > Signed-off-by: Philipp Stanner 
> > Acked-by: Vamsi Attunuru 
> 
> Hi again,
> 
> this is the last of 12 drivers blocking me from removing a few hundred
> lines of broken code from PCI. Would be great if it could be sent to
> Linus next merge window.
> 
> Can someone take this patch in?
> 
> Thx
> P.

I intend to, working on packing things up for -next as we speak.

> > ---
> > Changes in v4:
> >   - s/AF/PF
> >   - Add Vamsi's AB
> > 
> > Changes in v3:
> >   - Only call pci_disable_device() for the PF version. For AF it
> > would
> >     cause a WARN_ON because pcim_enable_device()'s callback will also
> >     try to disable.
> > ---
> >  drivers/vdpa/octeon_ep/octep_vdpa_main.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> > b/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> > index f3d4dda4e04c..9b49efd24391 100644
> > --- a/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> > +++ b/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> > @@ -454,6 +454,9 @@ static void octep_vdpa_remove_pf(struct pci_dev
> > *pdev)
> >     octep_iounmap_region(pdev, octpf->base,
> > OCTEP_HW_MBOX_BAR);
> >  
> >     octep_vdpa_pf_bar_expand(octpf);
> > +
> > +   /* The pf version does not use managed PCI. */
> > +   pci_disable_device(pdev);
> >  }
> >  
> >  static void octep_vdpa_vf_bar_shrink(struct pci_dev *pdev)
> > @@ -825,7 +828,7 @@ static int octep_vdpa_probe_pf(struct pci_dev
> > *pdev)
> >     struct octep_pf *octpf;
> >     int ret;
> >  
> > -   ret = pcim_enable_device(pdev);
> > +   ret = pci_enable_device(pdev);
> >     if (ret) {
> >     dev_err(dev, "Failed to enable device\n");
> >     return ret;
> > @@ -834,15 +837,17 @@ static int octep_vdpa_probe_pf(struct pci_dev
> > *pdev)
> >     ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
> >     if (ret) {
> >     dev_err(dev, "No usable DMA configuration\n");
> > -   return ret;
> > +   goto disable_pci;
> >     }
> >     octpf = devm_kzalloc(dev, sizeof(*octpf), GFP_KERNEL);
> > -   if (!octpf)
> > -   return -ENOMEM;
> > +   if (!octpf) {
> > +   ret = -ENOMEM;
> > +   goto disable_pci;
> > +   }
> >  
> >     ret = octep_iomap_region(pdev, octpf->base,
> > OCTEP_HW_MBOX_BAR);
> >     if (ret)
> > -   return ret;
> > +   goto disable_pci;
> >  
> >     pci_set_master(pdev);
> >     pci_set_drvdata(pdev, octpf);
> > @@ -856,6 +861,8 @@ static int octep_vdpa_probe_pf(struct pci_dev
> > *pdev)
> >  
> >  unmap_region:
> >     octep_iounmap_region(pdev, octpf->base, OCTEP_HW_MBOX_BAR);
> > +disable_pci:
> > +   pci_disable_device(pdev);
> >     return ret;
> >  }
> >

Re: [PATCH v2 4/4] media: i2c: imx214: Remove hard-coded external clock frequency

2025-05-15 Thread Sakari Ailus

Hi Laurent,

On Thu, May 15, 2025 at 10:58:46AM +0200, Laurent Pinchart wrote:
> Hi Sakari,
> 
> On Tue, May 06, 2025 at 08:24:03AM +, Sakari Ailus wrote:
> > On Mon, May 05, 2025 at 11:05:56PM +0200, André Apitzsch via B4 Relay wrote:
> > > From: André Apitzsch 
> > > 
> > > Instead rely on the rate set on the clock (using assigned-clock-rates
> > > etc.)
> > > 
> > > Signed-off-by: André Apitzsch 
> > > ---
> > >  drivers/media/i2c/imx214.c | 6 --
> > >  1 file changed, 6 deletions(-)
> > > 
> > > diff --git a/drivers/media/i2c/imx214.c b/drivers/media/i2c/imx214.c
> > > index 
> > > 9e9be47394ec768a5b34d44b06b5bbb0988da5a1..c12996e294dccebb18c608254f1e0d14dc064423
> > >  100644
> > > --- a/drivers/media/i2c/imx214.c
> > > +++ b/drivers/media/i2c/imx214.c
> > > @@ -32,7 +32,6 @@
> > >  
> > >  #define IMX214_REG_FAST_STANDBY_CTRL CCI_REG8(0x0106)
> > >  
> > > -#define IMX214_DEFAULT_CLK_FREQ  2400
> > >  #define IMX214_DEFAULT_LINK_FREQ 6
> > >  /* Keep wrong link frequency for backward compatibility */
> > >  #define IMX214_DEFAULT_LINK_FREQ_LEGACY  48000
> > > @@ -1405,11 +1404,6 @@ static int imx214_probe(struct i2c_client *client)
> > >   return dev_err_probe(dev, PTR_ERR(imx214->xclk),
> > >"failed to get xclk\n");
> > >  
> > > - ret = clk_set_rate(imx214->xclk, IMX214_DEFAULT_CLK_FREQ);
> > > - if (ret)
> > > - return dev_err_probe(dev, ret,
> > > -  "failed to set xclk frequency\n");
> > > -
> > 
> > Oops. I missed this is what the driver was doing already. Indeed, this is
> > one of the historic sensor drivers that do set the frequency in DT systems.
> > 
> > The driver never used the clock-frequency property and instead used a fixed
> > frequency. Changing the behaviour now could be problematic.
> > 
> > There are options here that I think we could do:
> > 
> > 1) use your v1 patch (4) which uses "clock-frequency" if it exists and
> > otherwise uses the default, fixed frequency or
> > 
> > 2) set the frequency only if the "clock-frequency" property exists. The DT
> > currently requires clock-frequency and the YAML conversion was done in 2020
> > whereas the driver is from 2018. If we do this, the clock-frequency should
> > be deprecated (or even removed from bingings).
> > 
> > I wonder what others think. Cc'd Laurent in any case.
> 
> Maybe I'm missing something, but I don't really see the issue here. The
> clock-frequency DT property is currently ignored, and this patch doesn't
> change that situation, does it ?
> 
> The change of behaviour here is related to the assigned-clock-rates
> property. If that property is specified today, it will set the clock
> rate, and the driver will override it to 24MHz right after. With this
> patch, the clock rate won't be overridden. I think the risk of
> regression is very low here, as I don't expect systems to set
> assigned-clock-rates in DT to a value different than 24MHz and expect
> the driver to override it.

If the DTS had assigned-clock-rates set correctly, then yes. How much can
we trust the older DTS did have that?

-- 
Regards,

Sakari Ailus

Re: [PATCH 01/12] module: Move modprobe_path and modules_disabled ctl_tables into the module subsys

2025-05-15 Thread Petr Pavlu

On 5/15/25 12:04, Joel Granados wrote:
> On Thu, May 15, 2025 at 10:04:53AM +0200, Petr Pavlu wrote:
>> On 5/9/25 14:54, Joel Granados wrote:
>>> Move module sysctl (modprobe_path and modules_disabled) out of sysctl.c
>>> and into the modules subsystem. Make the modprobe_path variable static
>>> as it no longer needs to be exported. Remove module.h from the includes
>>> in sysctl as it no longer uses any module exported variables.
>>>
>>> This is part of a greater effort to move ctl tables into their
>>> respective subsystems which will reduce the merge conflicts in
>>> kernel/sysctl.c.
>>>
>>> Signed-off-by: Joel Granados 
>>> [...]
>>> --- a/kernel/module/kmod.c
>>> +++ b/kernel/module/kmod.c
>>> @@ -60,7 +60,7 @@ static DEFINE_SEMAPHORE(kmod_concurrent_max, 
>>> MAX_KMOD_CONCURRENT);
>>>  /*
>>> modprobe_path is set via /proc/sys.
>>>  */
>>> -char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
>>> +static char modprobe_path[KMOD_PATH_LEN] = CONFIG_MODPROBE_PATH;
>>>  
>>>  static void free_modprobe_argv(struct subprocess_info *info)
>>>  {
>>> @@ -177,3 +177,33 @@ int __request_module(bool wait, const char *fmt, ...)
>>> return ret;
>>>  }
>>>  EXPORT_SYMBOL(__request_module);
>>> +
>>> +#ifdef CONFIG_MODULES
>>> +static const struct ctl_table kmod_sysctl_table[] = {
>>> +   {
>>> +   .procname   = "modprobe",
>>> +   .data   = &modprobe_path,
>>> +   .maxlen = KMOD_PATH_LEN,
>>> +   .mode   = 0644,
>>> +   .proc_handler   = proc_dostring,
>>> +   },
>>> +   {
>>> +   .procname   = "modules_disabled",
>>> +   .data   = &modules_disabled,
>>> +   .maxlen = sizeof(int),
>>> +   .mode   = 0644,
>>> +   /* only handle a transition from default "0" to "1" */
>>> +   .proc_handler   = proc_dointvec_minmax,
>>> +   .extra1 = SYSCTL_ONE,
>>> +   .extra2 = SYSCTL_ONE,
>>> +   },
>>
>> This is minor.. but the file kernel/module/kmod.c contains the logic to
>> request direct modprobe invocation by the kernel. Registering the
>> modprobe_path sysctl here is appropriate. However, the modules_disabled
>> setting affects the entire module loader so I don't think it's best to
>> register it here.
>>
>> I suggest keeping a single table for the module sysctl values but moving
>> it to kernel/module/main.c. This means the variable modprobe_path must
>> retain external linkage, on the other hand, modules_disabled can be made
>> static.
> 
> Like this?:
> [...]

Let's also move the KMOD_PATH_LEN definition and the modprobe_path
declaration from include/linux/kmod.h to kernel/module/internal.h, as
they are now fully internal to the module loader, and use "module"
instead of "kmod" in the sysctl registration to avoid confusion with the
modprobe logic.

The adjusted patch is below.

Reviewed-by: Petr Pavlu 

-- 
Thanks,
Petr


diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 68f69362d427..9a07c3215389 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -14,10 +14,7 @@
 #include 
 #include 
 
-#define KMOD_PATH_LEN 256
-
 #ifdef CONFIG_MODULES
-extern char modprobe_path[]; /* for sysctl */
 /* modprobe exit status on success, -ve on error.  Return value
  * usually useless though. */
 extern __printf(2, 3)
diff --git a/include/linux/module.h b/include/linux/module.h
index 8050f77c3b64..f4ab8d90c475 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -304,7 +304,6 @@ struct notifier_block;
 
 #ifdef CONFIG_MODULES
 
-extern int modules_disabled; /* for sysctl */
 /* Get/put a kernel symbol (calls must be symmetric) */
 void *__symbol_get(const char *symbol);
 void *__symbol_get_gpl(const char *symbol);
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 626cf8668a7e..0954c8de00c2 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -58,6 +58,9 @@ extern const struct kernel_symbol __stop___ksymtab_gpl[];
 extern const u32 __start___kcrctab[];
 extern const u32 __start___kcrctab_gpl[];
 
+#define KMOD_PATH_LEN 256
+extern char modprobe_path[];
+
 struct load_info {
const char *name;
/* pointer to module in temporary copy, freed at end of load_module() */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a2859dc3eea6..a336b7b3fb23 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -126,9 +126,37 @@ static void mod_update_bounds(struct module *mod)
 }
 
 /* Block module loading/unloading? */
-int modules_disabled;
+static int modules_disabled;
 core_param(nomodule, modules_disabled, bint, 0);
 
+static const struct ctl_table module_sysctl_table[] = {
+   {
+   .procname   = "modprobe",
+   .data   = &modprobe_path,
+   .maxlen = KMOD_PATH_LEN,
+   .mode   = 0644,
+   .proc_handler   = proc_dostring,
+   },
+   {
+

Re: [PATCH v4 0/9] SLUB percpu sheaves

On 4/25/25 10:27, Vlastimil Babka wrote:
> Hi,
> 
> This is the v4 and first non-RFC series to add an opt-in percpu
> array-based caching layer to SLUB, following the LSF/MM discussions.
> Since v3 I've also made changes to achieve full compatibility with
> slub_debug, and IRC discussions led to the last patch intended to
> improve NUMA locality (the patch remains separate for evaluation
> purposes).

I've pushed the changes based on the feedback here:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves

You can use that for testing/benchmarking. Thanks!

Re: [PATCH v2 4/4] media: i2c: imx214: Remove hard-coded external clock frequency

2025-05-15 Thread Laurent Pinchart

On Thu, May 15, 2025 at 02:54:54PM +0300, Sakari Ailus wrote:
> On Thu, May 15, 2025 at 10:58:46AM +0200, Laurent Pinchart wrote:
> > On Tue, May 06, 2025 at 08:24:03AM +, Sakari Ailus wrote:
> > > On Mon, May 05, 2025 at 11:05:56PM +0200, André Apitzsch via B4 Relay 
> > > wrote:
> > > > From: André Apitzsch 
> > > > 
> > > > Instead rely on the rate set on the clock (using assigned-clock-rates
> > > > etc.)
> > > > 
> > > > Signed-off-by: André Apitzsch 
> > > > ---
> > > >  drivers/media/i2c/imx214.c | 6 --
> > > >  1 file changed, 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/media/i2c/imx214.c b/drivers/media/i2c/imx214.c
> > > > index 
> > > > 9e9be47394ec768a5b34d44b06b5bbb0988da5a1..c12996e294dccebb18c608254f1e0d14dc064423
> > > >  100644
> > > > --- a/drivers/media/i2c/imx214.c
> > > > +++ b/drivers/media/i2c/imx214.c
> > > > @@ -32,7 +32,6 @@
> > > >  
> > > >  #define IMX214_REG_FAST_STANDBY_CTRL   CCI_REG8(0x0106)
> > > >  
> > > > -#define IMX214_DEFAULT_CLK_FREQ2400
> > > >  #define IMX214_DEFAULT_LINK_FREQ   6
> > > >  /* Keep wrong link frequency for backward compatibility */
> > > >  #define IMX214_DEFAULT_LINK_FREQ_LEGACY48000
> > > > @@ -1405,11 +1404,6 @@ static int imx214_probe(struct i2c_client 
> > > > *client)
> > > > return dev_err_probe(dev, PTR_ERR(imx214->xclk),
> > > >  "failed to get xclk\n");
> > > >  
> > > > -   ret = clk_set_rate(imx214->xclk, IMX214_DEFAULT_CLK_FREQ);
> > > > -   if (ret)
> > > > -   return dev_err_probe(dev, ret,
> > > > -"failed to set xclk frequency\n");
> > > > -
> > > 
> > > Oops. I missed this is what the driver was doing already. Indeed, this is
> > > one of the historic sensor drivers that do set the frequency in DT 
> > > systems.
> > > 
> > > The driver never used the clock-frequency property and instead used a 
> > > fixed
> > > frequency. Changing the behaviour now could be problematic.
> > > 
> > > There are options here that I think we could do:
> > > 
> > > 1) use your v1 patch (4) which uses "clock-frequency" if it exists and
> > > otherwise uses the default, fixed frequency or
> > > 
> > > 2) set the frequency only if the "clock-frequency" property exists. The DT
> > > currently requires clock-frequency and the YAML conversion was done in 
> > > 2020
> > > whereas the driver is from 2018. If we do this, the clock-frequency should
> > > be deprecated (or even removed from bingings).
> > > 
> > > I wonder what others think. Cc'd Laurent in any case.
> > 
> > Maybe I'm missing something, but I don't really see the issue here. The
> > clock-frequency DT property is currently ignored, and this patch doesn't
> > change that situation, does it ?
> > 
> > The change of behaviour here is related to the assigned-clock-rates
> > property. If that property is specified today, it will set the clock
> > rate, and the driver will override it to 24MHz right after. With this
> > patch, the clock rate won't be overridden. I think the risk of
> > regression is very low here, as I don't expect systems to set
> > assigned-clock-rates in DT to a value different than 24MHz and expect
> > the driver to override it.
> 
> If the DTS had assigned-clock-rates set correctly, then yes. How much can
> we trust the older DTS did have that?

I am relatively confident that DT-based systems wouldn't have an
assigned-clock-rates property with a frequency that doesn't match
IMX214_DEFAULT_CLK_FREQ. The real question is whether or not I'm
over-confident :-)

-- 
Regards,

Laurent Pinchart

Re: [linux-next:master] [rcutorture] c27d0d38f2: WARNING:at_kernel/rcu/rcutorture.c:#rcutorture_one_extend_check[rcutorture]

2025-05-15 Thread Paul E. McKenney

On Thu, May 15, 2025 at 04:45:35PM +0800, Oliver Sang wrote:
> hi, Paul,
> 
> On Wed, May 14, 2025 at 09:18:10PM -0700, Paul E. McKenney wrote:
> > On Thu, May 15, 2025 at 10:30:02AM +0800, Oliver Sang wrote:
> > > hi, Paul,
> > > 
> > > On Wed, May 14, 2025 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > > > On Wed, May 14, 2025 at 10:47:30AM +0800, kernel test robot wrote:
> > > > > 
> > > > > hi, Paul,
> > > > > 
> > > > > for this commit we tested before, now we found it causes issues in 
> > > > > linux-next
> > > > > master branch.
> > > > 
> > > > Good catch as always!!!  Yes, using HRTIMER_MODE_HARD means that this
> > > > ircutorture_one_extend_check() function must check for hardirq as well
> > > > as softirq.  Which, oddly enough, permits simplifying the code, though
> > > > a larger patch.
> > > > 
> > > > I could reproduce this, and the patch at the end of this email fixes it
> > > > for me.  Does it work for you as well?
> > > 
> > > thanks a lot for information!
> > > 
> > > what's the base of the patch? I tried to apply it upon c27d0d38f2 or 
> > > lastest
> > > linux-next/master
> > > bdd609656ff55 (tag: next-20250514, linux-next/master) Add linux-next 
> > > specific files for 20250514
> > > 
> > > both failed. thanks
> > 
> > Apologies, I just now pushed it out on my -rcu tree:
> > 
> > db950fccd45a ("rcutorture: Make rcutorture_one_extend_check() account for 
> > hardirq")
> 
> issue is cleaned on db950fccd45a. thanks!
> 
> Tested-by: kernel test robot 

I will apply on my next rebase, and once again thank you for your testing!

Thanx, Paul

> =
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
>   
> clang-20/i386-randconfig-141-20250508/debian-11.1-i386-20220923.cgz/300s/vm-snb/cpuhotplug/rcutorture/srcud
> 
> commit:
>   8454f1334b167 ("lib: Make the ratelimit test more reliable")
>   db950fccd45a8 ("rcutorture: Make rcutorture_one_extend_check() account for 
> hardirq")
> 
> 8454f1334b167506 db950fccd45a8273d93ca861d84
>  ---
>fail:runs  %reproductionfail:runs
>| | |
>   1:50  -2%:50dmesg.EIP:__x86_return_thunk
>   1:50  -2%:50
> dmesg.EIP:console_trylock_spinning
>  48:50 -96%:50dmesg.EIP:pv_native_safe_halt
>  50:50-100%:50
> dmesg.EIP:rcutorture_one_extend_check
>  50:50-100%:50
> dmesg.WARNING:at_kernel/rcu/rcutorture.c:#rcutorture_one_extend_check[rcutorture]
> 
> > 
> > I will need to rebase this to precede the updated version of the commit
> > that you tested, but one step at a time...
> > 
> > Thanx, Paul
> > 
> > > > > =
> > > > > tbox_group/testcase/rootfs/kconfig/compiler/runtime/test/torture_type:
> > > > >   
> > > > > vm-snb/rcutorture/debian-11.1-i386-20220923.cgz/i386-randconfig-141-20250508/clang-20/300s/cpuhotplug/srcud
> > > > > 
> > > > > c795676b5c0a4ab7 c27d0d38f2cafb70a68ca42c410
> > > > >  ---
> > > > >fail:runs  %reproductionfail:runs
> > > > >| | |
> > > > >:50   2%   1:50
> > > > > dmesg.EIP:__kernel_text_address
> > > > >:50   4%   2:50
> > > > > dmesg.EIP:__srcu_check_read_flavor
> > > > >:50   2%   1:50
> > > > > dmesg.EIP:_raw_spin_unlock_irq
> > > > >:50  14%   7:50
> > > > > dmesg.EIP:_raw_spin_unlock_irqrestore
> > > > >:50  30%  15:50
> > > > > dmesg.EIP:console_flush_all
> > > > >:50   4%   2:50
> > > > > dmesg.EIP:console_trylock_spinning
> > > > >:50   2%   1:50dmesg.EIP:delay_tsc
> > > > >:50   2%   1:50
> > > > > dmesg.EIP:finish_lock_switch
> > > > >:50   2%   1:50
> > > > > dmesg.EIP:kernel_text_address
> > > > >:50   2%   1:50dmesg.EIP:lock_acquire
> > > > >:50  36%  18:50
> > > > > dmesg.EIP:pv_native_safe_halt
> > > > >   1:50  -2%:50
> > > > > dmesg.EIP:rcu_torture_fwd_prog_cr
> > > > >   7:50 -14%:50
> > > > > dmesg.EIP:rcu_torture_writer
> > > > >:50 100%  50:50
> > > > > dmesg.EIP:rcutorture_one_extend_check
> > > > >  48:50   0%  48:50
> > > > > dmesg.INFO:task_blocked_for_more_than#seconds
> > > > >  48:50   0%

LFX Mentorship: Kselftest Patch Submission - Message Clarity Improvement in futex_requeue

2025-05-15 Thread Shivam Sharma

Hi maintainers,

As part of the Kselftest task for the LFX Mentorship Program, I have
reviewed the futex selftest and made minor improvements to the message
clarity in `futex_requeue.c`.

Attached is the patch with the changes. I’ve also uploaded it to the
mentorship platform as instructed.

Please let me know if any changes are needed.

Thanks,
Shivam Sharma
10sharmashi...@gmail.com


0001-selftests-futex-improve-message-clarity-in-futex_req.patch
Description: Binary data

Re: [PATCH v4 0/9] SLUB percpu sheaves

2025-05-15 Thread Suren Baghdasaryan

On Thu, May 15, 2025 at 5:46 AM Vlastimil Babka  wrote:
>
> On 4/25/25 10:27, Vlastimil Babka wrote:
> > Hi,
> >
> > This is the v4 and first non-RFC series to add an opt-in percpu
> > array-based caching layer to SLUB, following the LSF/MM discussions.
> > Since v3 I've also made changes to achieve full compatibility with
> > slub_debug, and IRC discussions led to the last patch intended to
> > improve NUMA locality (the patch remains separate for evaluation
> > purposes).
>
> I've pushed the changes based on the feedback here:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves
>
> You can use that for testing/benchmarking. Thanks!

Thanks! I'll give it a spin this weekend.

Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations

2025-05-15 Thread Suren Baghdasaryan

On Thu, May 15, 2025 at 1:45 AM Vlastimil Babka  wrote:
>
> On 5/14/25 16:01, Vlastimil Babka wrote:
> > On 5/6/25 23:34, Suren Baghdasaryan wrote:
> >> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka  wrote:
> >>> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache 
> >>> *s, struct slab_sheaf *sheaf)
> >>> sheaf->size = 0;
> >>>  }
> >>>
> >>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> >>> +struct slab_sheaf *sheaf);
> >>
> >> I think you could safely move __rcu_free_sheaf_prepare() here and
> >> avoid the above forward declaration.
> >
> > Right, done.
> >
> >>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void 
> >>> *object)
> >>> return true;
> >>>  }
> >>>
> >>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> >>> +struct slab_sheaf *sheaf)
> >>
> >> This function seems to be an almost exact copy of free_to_pcs_bulk()
> >> from your previous patch. Maybe they can be consolidated?
> >
> > True, I've extracted it to __kmem_cache_free_bulk_prepare().
>
> ... and that was a mistake as free_to_pcs_bulk() diverges in patch 9/9 in a
> way that this makes it too infeasible

Ah, I see. Makes sense. Sorry for the misleading suggestion.

Re: [PATCH v4 1/1] x86/sgx: Enable automatic SVN updates for SGX enclaves

2025-05-15 Thread Dave Hansen

On 5/14/25 00:32, Reshetova, Elena wrote:
>> This was the recent discussion I am aware we had on this matter:
>> https://lkml.org/lkml/2024/2/5/1595
>> The measurements were done for older platform (skylake), but I am not
>> aware of any architectural changes since that time to improve this.
> And to make it more concrete, I made some simple tests based
> on program for stress testing rdrand/rdseed that was discussed in that
> threat earlier: https://lkml.org/lkml/2024/2/6/746 
> Using this stress testing and enough threads, I can make EUPDATESVN fail
> reliably and quite easily even with 10 time retry loop by kernel. So,
> given this, what should be the correct behaviour be here? WARN_ON
> is not justified imo. But should we return an error to userspace and
> block open()? 

No, we can't block open. Just silently fail the EUPDATESVN for now.

We can have a separate bikeshedding session about how to warn about the
failure (or not).

I can also hear Sean screaming from across the room that silent failure
is going to be really nasty to debug. Once we settle on this set, we can
have another discussion on an explicit update ABI so that folks who
truly control their environment can stop all the guests and explicitly
run EUPDATESVN at a nice convenient time. *THAT* interface can punt
retries out to userspace and warn for sure.

But, one thing at a time, please. This set will cover *normal* users:
those who don't know exactly where each SGX user is and what their
lifetimes are. Oh, and normal users also aren't under RDSEED attacks.

Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers

2025-05-15 Thread Mi, Dapeng



On 5/16/2025 3:22 AM, Sean Christopherson wrote:
> On Thu, May 15, 2025, Dapeng Mi wrote:
>> On 5/15/2025 8:37 AM, Sean Christopherson wrote:
 diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
 index 153972e944eb..eba086ef5eca 100644
 --- a/arch/x86/kvm/svm/pmu.c
 +++ b/arch/x86/kvm/svm/pmu.c
 @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
 kvm_pmu_cap.num_counters_gp);
  
 -  if (pmu->version > 1) {
 -  pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 
 1);
 +  if (kvm_pmu_cap.version > 1) {
> ARGH!  I saw the pmu->version => kvm_pmu_cap.version change when going 
> through
> this patch, but assumed it was simply a refactoring bug and ignored it.
>
> Nope, 100% intentional, as I discovered after spending the better part of an 
> hour
> debugging.  Stuffing pmu->global_ctrl when it doesn't exist is necessary when 
> the
> mediated PMU is enabled, because pmu->global_ctrl will always be loaded in 
> hardware.
> And so loading '0' means the PMU is effectively disabled, because KVM 
> disallows the
> guest from writing the MSR.
>
> _That_ is the type of thing that absolutely needs a comment and a lengthy 
> explanation
> in the changelog.

Yes, this change is intended.  As long as HW supports Global_CTRL MSR, KVM
should write available value into it at vm-entry regardless of if guest PMU
has GLOBAL_CTRL (pmu version < 2), otherwise guest counters won't be really
enabled on HW.

yeah, it's indeed easily confused and should put a comment here. My fault ...


>
> Intel also needs the same treatment for PMU v1.  And since there's no hackery 
> that
> I can see, that suggests PMU v1 guests haven't been tested with the mediated 
> PMU
> on Intel.
>
> I guess congratulations are in order, because this patch just became my new 
> goto
> example of why I'm so strict about on the "one thing per patch" rule.

Embarrassing 


>
>>> It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ 
>>> (the
>>> SDM contradicts itself; KVM's ABI is that they're v2+).
>>>
 +  /*
 +   * At RESET, AMD CPUs set all enable bits for general purpose 
 counters in
 +   * IA32_PERF_GLOBAL_CTRL (so that software that was written for 
 v1 PMUs
 +   * don't unknowingly leave GP counters disabled in the global 
 controls).
 +   * Emulate that behavior when refreshing the PMU so that 
 userspace doesn't
 +   * need to manually set PERF_GLOBAL_CTRL.
 +   */
 +  pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
 +  pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
}

Re: [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> - pmc->emulated_counter++;
> - kvm_pmu_request_counter_reprogram(pmc);
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> +
> + /*
> +  * For perf-based PMUs, accumulate software-emulated events separately
> +  * from pmc->counter, as pmc->counter is offset by the count of the
> +  * associated perf event. Request reprogramming, which will consult
> +  * both emulated and hardware-generated events to detect overflow.
> +  */
> + if (!kvm_mediated_pmu_enabled(vcpu)) {
> + pmc->emulated_counter++;
> + kvm_pmu_request_counter_reprogram(pmc);
> + return;
> + }
> +
> + /*
> +  * For mediated PMUs, pmc->counter is updated when the vCPU's PMU is
> +  * put, and will be loaded into hardware when the PMU is loaded. Simply
> +  * increment the counter and signal overflow if it wraps to zero.
> +  */
> + pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> + if (!pmc->counter) {

Ugh, this is broken for the fastpath.  If kvm_skip_emulated_instruction() is
invoked by handle_fastpath_set_msr_irqoff() or handle_fastpath_hlt(), KVM may
consume stale information (GLOBAL_CTRL, GLOBAL_STATUS and PMCs), and even if KVM
gets lucky and those are all fresh, the PMC and GLOBAL_STATUS changes won't be
propagated back to hardware.

The best idea I have is to track whether or not the guest may be counting 
branches
and/or instruction based on eventsels, and then bail from fastpaths that need to
skip instructions.  That flag would also be useful to further optimize
kvm_pmu_trigger_event().

> + pmc_to_pmu(pmc)->global_status |= BIT_ULL(pmc->idx);
> + if (pmc_pmi_enabled(pmc))
> + kvm_make_request(KVM_REQ_PMI, vcpu);
> + }
>  }
>  
>  static inline bool cpl_is_matched(struct kvm_pmc *pmc)
> -- 
> 2.49.0.395.g12beb8f557-goog
>

Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering

2025-05-15 Thread Mi, Dapeng



On 3/25/2025 1:31 AM, Mingwei Zhang wrote:
> From: Dapeng Mi 
>
> Mediated vPMU needs to intercept EVENTSELx and FIXED_CNTR_CTRL MSRs to
> filter out guest malicious perf events. Either writing these MSRs or
> updating event filters would call reprogram_counter() eventually. Thus
> check if the guest event should be filtered out in reprogram_counter().
> If so, clear corresponding EVENTSELx MSR or FIXED_CNTR_CTRL field to
> ensure the guest event won't be really enabled at vm-entry.
>
> Besides, mediated vPMU intercepts the MSRs of these guest not owned
> counters and it just needs simply to read/write from/to pmc->counter.
>
> Suggested-by: Sean Christopherson 
> Signed-off-by: Dapeng Mi 
> Co-developed-by: Mingwei Zhang 
> Signed-off-by: Mingwei Zhang 
> ---
>  arch/x86/kvm/pmu.c | 27 +++
>  arch/x86/kvm/pmu.h |  3 +++
>  2 files changed, 30 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 63143eeb5c44..e9100dc49fdc 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -305,6 +305,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>  
>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>  {
> + if (kvm_mediated_pmu_enabled(pmc->vcpu)) {
> + pmc->counter = val & pmc_bitmask(pmc);
> + return;
> + }
> +
>   /*
>* Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>* read-modify-write.  Adjust the counter value so that its value is
> @@ -455,6 +460,28 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>   bool emulate_overflow;
>   u8 fixed_ctr_ctrl;
>  
> + if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
> + bool allowed = check_pmu_event_filter(pmc);
> +
> + if (pmc_is_gp(pmc)) {
> + if (allowed)
> + pmc->eventsel_hw |= pmc->eventsel &
> + 
> ARCH_PERFMON_EVENTSEL_ENABLE;
> + else
> + pmc->eventsel_hw &= 
> ~ARCH_PERFMON_EVENTSEL_ENABLE;
> + } else {
> + int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
> +
> + if (allowed)
> + pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;

Sean, just found there is a potential bug here.  The
"pmu->fixed_ctr_ctrl_hw" should not be assigned to "pmu->fixed_ctr_ctrl"
here, otherwise the other filtered fixed counter (not this allowed fixed
counter) could be enabled accidentally.

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index ba9d336f1d1d..f32e5f66f73b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -473,7 +473,8 @@ static int reprogram_counter(struct kvm_pmc *pmc)
    int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;

    if (allowed)
-   pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
+   pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl &
+  
intel_fixed_bits_by_idx(idx, 0xf);
    else
    pmu->fixed_ctr_ctrl_hw &=
    ~intel_fixed_bits_by_idx(idx, 0xf);

> + else
> + pmu->fixed_ctr_ctrl_hw &=
> + ~intel_fixed_bits_by_idx(idx, 0xf);
> + }
> +
> + return 0;
> + }
> +
>   emulate_overflow = pmc_pause_counter(pmc);
>  
>   if (!pmc_event_is_allowed(pmc))
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 509c995b7871..6289f523d893 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -113,6 +113,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  {
>   u64 counter, enabled, running;
>  
> + if (kvm_mediated_pmu_enabled(pmc->vcpu))
> + return pmc->counter & pmc_bitmask(pmc);
> +
>   counter = pmc->counter + pmc->emulated_counter;
>  
>   if (pmc->perf_event && !pmc->is_paused)

Re: [PATCH v9 4/4] vhost: Add a KConfig knob to enable IOCTL VHOST_FORK_FROM_OWNER

2025-05-15 Thread Jason Wang

On Thu, May 15, 2025 at 2:14 PM Michael S. Tsirkin  wrote:
>
> On Wed, May 14, 2025 at 10:52:58AM +0800, Jason Wang wrote:
> > On Tue, May 13, 2025 at 3:09 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, May 13, 2025 at 12:08:51PM +0800, Jason Wang wrote:
> > > > On Wed, Apr 30, 2025 at 5:27 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Apr 30, 2025 at 11:34:49AM +0800, Jason Wang wrote:
> > > > > > On Tue, Apr 29, 2025 at 6:56 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Apr 29, 2025 at 11:39:37AM +0800, Jason Wang wrote:
> > > > > > > > On Mon, Apr 21, 2025 at 11:46 AM Jason Wang 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Apr 21, 2025 at 11:45 AM Jason Wang 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Apr 21, 2025 at 10:45 AM Cindy Lu  
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Introduce a new config knob 
> > > > > > > > > > > `CONFIG_VHOST_ENABLE_FORK_OWNER_IOCTL`,
> > > > > > > > > > > to control the availability of the 
> > > > > > > > > > > `VHOST_FORK_FROM_OWNER` ioctl.
> > > > > > > > > > > When CONFIG_VHOST_ENABLE_FORK_OWNER_IOCTL is set to n, 
> > > > > > > > > > > the ioctl
> > > > > > > > > > > is disabled, and any attempt to use it will result in 
> > > > > > > > > > > failure.
> > > > > > > > > >
> > > > > > > > > > I think we need to describe why the default value was 
> > > > > > > > > > chosen to be false.
> > > > > > > > > >
> > > > > > > > > > What's more, should we document the implications here?
> > > > > > > > > >
> > > > > > > > > > inherit_owner was set to false: this means "legacy" 
> > > > > > > > > > userspace may
> > > > > > > > >
> > > > > > > > > I meant "true" actually.
> > > > > > > >
> > > > > > > > MIchael, I'd expect inherit_owner to be false. Otherwise legacy
> > > > > > > > applications need to be modified in order to get the behaviour
> > > > > > > > recovered which is an impossible taks.
> > > > > > > >
> > > > > > > > Any idea on this?
> > > > > > > >
> > > > > > > > Thanks
> > > > >
> > > > > So, let's say we had a modparam? Enough for this customer?
> > > > > WDYT?
> > > >
> > > > Just to make sure I understand the proposal.
> > > >
> > > > Did you mean a module parameter like "inherit_owner_by_default"? I
> > > > think it would be fine if we make it false by default.
> > > >
> > > > Thanks
> > >
> > > I think we should keep it true by default, changing the default
> > > risks regressing what we already fixes.
> >
> > I think it's not a regression since it comes since the day vhost is
> > introduced. To my understanding the real regression is the user space
> > noticeable behaviour changes introduced by vhost thread.
> >
> > > The specific customer can
> > > flip the modparam and be happy.
> >
> > If you stick to the false as default, I'm fine.
> >
> > Thanks
>
> That would be yet another behaviour change.

Back to the original behaviour.

> I think one was enough, don't you think?

I think such kind of change is unavoidable if we want to fix the
usersapce behaviour change.

Thanks

>
>
> > >
> > > > >
> > > > > --
> > > > > MST
> > > > >
> > >
>

Re: [PATCH 13/19] virtio_ring: introduce virtqueue ops

2025-05-15 Thread Jason Wang

On Wed, May 14, 2025 at 10:24 PM Michael S. Tsirkin  wrote:
>
> On Wed, May 14, 2025 at 10:19:05AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 09, 2025 at 12:06:03PM +0800, Jason Wang wrote:
> > > On Tue, Apr 8, 2025 at 7:37 PM Michael S. Tsirkin  wrote:
> > > >
> > > > On Tue, Apr 08, 2025 at 03:02:35PM +0800, Jason Wang wrote:
> > > > > On Mon, Apr 7, 2025 at 4:20 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Mar 24, 2025 at 02:01:21PM +0800, Jason Wang wrote:
> > > > > > > This patch introduces virtqueue ops which is a set of the 
> > > > > > > callbacks
> > > > > > > that will be called for different queue layout or features. This 
> > > > > > > would
> > > > > > > help to avoid branches for split/packed and will ease the future
> > > > > > > implementation like in order.
> > > > > > >
> > > > > > > Signed-off-by: Jason Wang 
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > ---
> > > > > > >  drivers/virtio/virtio_ring.c | 96 
> > > > > > > +---
> > > > > > >  1 file changed, 67 insertions(+), 29 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index a2884eae14d9..ce1dc90ee89d 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -159,9 +159,30 @@ struct vring_virtqueue_packed {
> > > > > > >   size_t event_size_in_bytes;
> > > > > > >  };
> > > > > > >
> > > > > > > +struct vring_virtqueue;
> > > > > > > +
> > > > > > > +struct virtqueue_ops {
> > > > > > > + int (*add)(struct vring_virtqueue *_vq, struct scatterlist 
> > > > > > > *sgs[],
> > > > > > > +unsigned int total_sg, unsigned int out_sgs,
> > > > > > > +unsigned int in_sgs, void *data,
> > > > > > > +void *ctx, bool premapped, gfp_t gfp);
> > > > > > > + void *(*get)(struct vring_virtqueue *vq, unsigned int *len, 
> > > > > > > void **ctx);
> > > > > > > + bool (*kick_prepare)(struct vring_virtqueue *vq);
> > > > > > > + void (*disable_cb)(struct vring_virtqueue *vq);
> > > > > > > + bool (*enable_cb_delayed)(struct vring_virtqueue *vq);
> > > > > > > + unsigned int (*enable_cb_prepare)(struct vring_virtqueue 
> > > > > > > *vq);
> > > > > > > + bool (*poll)(const struct vring_virtqueue *vq, u16 
> > > > > > > last_used_idx);
> > > > > > > + void *(*detach_unused_buf)(struct vring_virtqueue *vq);
> > > > > > > + bool (*more_used)(const struct vring_virtqueue *vq);
> > > > > > > + int (*resize)(struct vring_virtqueue *vq, u32 num);
> > > > > > > + void (*reset)(struct vring_virtqueue *vq);
> > > > > > > +};
> > > > > >
> > > > > > I like it that it's organized but
> > > > > > I worry about the overhead of indirect calls here.
> > > > >
> > > > > We can switch to use INDIRECT_CALL_X() here
> > > >
> > > > If you think it's cleaner.. but INDIRECT_CALL is all chained
> > >
> > > Yes, and it would be problematic as the number of ops increased.
> > >
> > > > while a switch can do a binary search.
> > > >
> > >
> > > Do you mean a nested switch?
> >
> > Not sure what is nested. gcc does a decent job of optimizing
> > switches. You have 4 types of ops:
> > packed/packed in order/split/split in order
> >
> > So:
> >
> > enum {
> >   VQ_SPLIT,
> >   VQ_SPLIT_IN_ORDER,
> >   VQ_PACKED,
> >   VQ_PACKED_IN_ORDER,
> > }
> >
> >
> > I do not see how it is worse?
> >
> >
>
>
>
> Actually, here is an idea - create an array of ops:
>
>
>
> enum vqtype {
> SPLIT,
> SPLIT_IN_ORDER,
> PACKED,
> PACKED_IN_ORDER,
> MAX
> };
>
>
> struct ops {
> int (*add)(int bar);
> };
>
> extern int packed(int);
> extern int packedinorder(int);
> extern int split(int);
> extern int splitinorder(int);
>
> const struct ops allops[MAX] = { [SPLIT] = {split}, [SPLIT_IN_ORDER] = { 
> splitinorder}, [PACKED] = { packed }, [PACKED_IN_ORDER] = {packedinorder}};
>
> int main(int argc, char **argv)
> {
> switch (argc) {
> case 0:
> return allops[PACKED].foo(argc);
> case 1:
> return allops[SPLIT].foo(argc);
> default:
> return allops[PACKED_IN_ORDER].foo(argc);

This still looks like an indirection call as we don't call the symbol
directly but need to load the function address into a register.

> }
> }
>
>
> I tested this and compiler is able to elide the indirect calls.

I've tried the following:

struct virtqueue_ops split_ops = {
.add = virtqueue_add_split,
.get = virtqueue_get_buf_ctx_split,
.kick_prepare = virtqueue_kick_prepare_split,
.disable_cb = virtqueue_disable_cb_split,
.enable_cb_delayed = virtqueue_enable_cb_delayed_split,
.enable_cb_prepare = virtqueue_enable_cb_prepare_split,
.poll = virtqueue

Fwd: [PATCH] selftests : timers : valid-adjtimex.c : Fixed style checks

2025-05-15 Thread rujra

fixed style checks according to Linux Kernel Coding Style standards.


1 : fixed alignment of parenthesis.
LOG : CHECK: Alignment should match open parenthesis
+   printf("ERROR: out of range value %ld actually set!\n",
+   tx.freq);

2 : fixed alignment of parenthesis.
LOG : CHECK: Alignment should match open parenthesis
+   printf("Error: No failure on invalid
ADJ_FREQUENCY %ld\n",
+   invalid_freq[i]);

3 : fixed line length of 106 to 100 and less.
LOG :  CHECK: line length of 106 exceeds 100 columns
+   printf("Invalid (sec: %ld  usec: %ld) did not fail! ",
tmx.time.tv_sec, tmx.time.tv_usec);

Signed-off-by: Rujra Bhatt 
---
 tools/testing/selftests/timers/valid-adjtimex.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/timers/valid-adjtimex.c
b/tools/testing/selftests/timers/valid-adjtimex.c
index 6b7801055ad1..5110f9ee285c 100644
--- a/tools/testing/selftests/timers/valid-adjtimex.c
+++ b/tools/testing/selftests/timers/valid-adjtimex.c
@@ -157,7 +157,7 @@ int validate_freq(void)
if (tx.freq == outofrange_freq[i]) {
printf("[FAIL]\n");
printf("ERROR: out of range value %ld actually set!\n",
-   tx.freq);
+  tx.freq);
pass = -1;
goto out;
}
@@ -172,7 +172,7 @@ int validate_freq(void)
if (ret >= 0) {
printf("[FAIL]\n");
printf("Error: No failure on invalid
ADJ_FREQUENCY %ld\n",
-   invalid_freq[i]);
+  invalid_freq[i]);
pass = -1;
goto out;
}
@@ -238,7 +238,8 @@ int set_bad_offset(long sec, long usec, int use_nano)
tmx.time.tv_usec = usec;
ret = clock_adjtime(CLOCK_REALTIME, &tmx);
if (ret >= 0) {
-   printf("Invalid (sec: %ld  usec: %ld) did not fail! ",
tmx.time.tv_sec, tmx.time.tv_usec);
+   printf("Invalid (sec: %ld  usec: %ld) did not fail! ",
+  tmx.time.tv_sec, tmx.time.tv_usec);
printf("[FAIL]\n");
return -1;
}
--
2.43.0

Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs

On Thu, May 15, 2025, Dapeng Mi wrote:
> On 5/15/2025 8:41 AM, Sean Christopherson wrote:
> >> +  if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) 
> >> &&
> > Just require the guest to have PERF_GLOBAL_CTRL, I don't see any reason to 
> > support
> > v1 PMUs.  It adds complexity and weirdness, and I can't imagine there's a 
> > use case.

I take that back, there absolutely are use cases, especially for AMD.  Any VM
shape that exists today should be compatible with the mediated PMU.  And I was
dead wrong about adding complexity; KVM already needs to intercept GLOBAL_CTRL
if the guest has fewer PMCs than hardware, so incorporating this check is all of
two lines of code.

Re: [PATCH] Revert "remoteproc: core: Clear table_sz when rproc_shutdown"

On Tue, May 13, 2025 at 04:52:46PM +0100, Bjorn Andersson wrote:
> Clearing the table_sz on cleanup seemed reasonable, but further
> discussions concluded that this merely working around the issue
> and that the fix is incomplete.
> 
> As such, revert commit efdde3d73ab2 ("remoteproc: core: Clear table_sz
> when rproc_shutdown") to avoid carrying a partial fix.
> 
> Signed-off-by: Bjorn Andersson 
> ---
>  drivers/remoteproc/remoteproc_core.c | 1 -
>  1 file changed, 1 deletion(-)
>

I have applied this patch.

Thanks,
Mathieu

> diff --git a/drivers/remoteproc/remoteproc_core.c 
> b/drivers/remoteproc/remoteproc_core.c
> index 
> 48d146e1fa560397c11eeb8f824ae0fb844a022b..81b2ccf988e852ac79cee375c7e3f118c2a4b41a
>  100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2025,7 +2025,6 @@ int rproc_shutdown(struct rproc *rproc)
>   kfree(rproc->cached_table);
>   rproc->cached_table = NULL;
>   rproc->table_ptr = NULL;
> - rproc->table_sz = 0;
>  out:
>   mutex_unlock(&rproc->lock);
>   return ret;
> 
> ---
> base-commit: aa94665adc28f3fdc3de2979ac1e98bae961d6ca
> change-id: 20250513-revert-rproc-table-sz-53ecf24726ae
> 
> Best regards,
> -- 
> Bjorn Andersson 
>

Re: [PATCH 2/2] selftests/mm: skip hugevm test if kernel config file is not present.

2025-05-15 Thread Lorenzo Stoakes

On Thu, May 15, 2025 at 02:23:33PM -0400, Zi Yan wrote:
> When running hugevm tests in a machine without kernel config present, e.g.,
> a VM running a kernel without CONFIG_IKCONFIG_PROC nor /boot/config-*,
> skip hugevm tests, which reads kernel config to get page table level
> information.
>
> Signed-off-by: Zi Yan 

Looks generally reasonable to me, but I'm not so familiar with this so,

Acked-by: Lorenzo Stoakes 

> ---
>  .../selftests/mm/va_high_addr_switch.sh   | 26 +++
>  1 file changed, 9 insertions(+), 17 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh 
> b/tools/testing/selftests/mm/va_high_addr_switch.sh
> index 1f92e8caceac..325de53966b6 100755
> --- a/tools/testing/selftests/mm/va_high_addr_switch.sh
> +++ b/tools/testing/selftests/mm/va_high_addr_switch.sh
> @@ -7,23 +7,20 @@
>  # real test to check that the kernel is configured to support at least 5
>  # pagetable levels.
>
> -# 1 means the test failed
> -exitcode=1
> -
>  # Kselftest framework requirement - SKIP code is 4.
>  ksft_skip=4
>
> -fail()
> +skip()
>  {
>   echo "$1"
> - exit $exitcode
> + exit $ksft_skip
>  }
>
>  check_supported_x86_64()
>  {
>   local config="/proc/config.gz"
>   [[ -f "${config}" ]] || config="/boot/config-$(uname -r)"
> - [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or 
> /boot"
> + [[ -f "${config}" ]] || skip "Cannot find kernel config in /proc or 
> /boot"
>
>   # gzip -dcfq automatically handles both compressed and plaintext input.
>   # See man 1 gzip under '-f'.
> @@ -33,11 +30,9 @@ check_supported_x86_64()
>   else {print 1}; exit}' /proc/cpuinfo 2>/dev/null)
>
>   if [[ "${pg_table_levels}" -lt 5 ]]; then
> - echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
> run this test"
> - exit $ksft_skip
> + skip "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
> run this test"
>   elif [[ "${cpu_supports_pl5}" -ne 0 ]]; then
> - echo "$0: CPU does not have the necessary la57 flag to support 
> page table level 5"
> - exit $ksft_skip
> + skip "$0: CPU does not have the necessary la57 flag to support 
> page table level 5"
>   fi
>  }
>
> @@ -45,24 +40,21 @@ check_supported_ppc64()
>  {
>   local config="/proc/config.gz"
>   [[ -f "${config}" ]] || config="/boot/config-$(uname -r)"
> - [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or 
> /boot"
> + [[ -f "${config}" ]] || skip "Cannot find kernel config in /proc or 
> /boot"
>
>   local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | 
> cut -d'=' -f 2)
>   if [[ "${pg_table_levels}" -lt 5 ]]; then
> - echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
> run this test"
> - exit $ksft_skip
> + skip "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to 
> run this test"
>   fi
>
>   local mmu_support=$(grep -m1 "mmu" /proc/cpuinfo | awk '{print $3}')
>   if [[ "$mmu_support" != "radix" ]]; then
> - echo "$0: System does not use Radix MMU, required for 5-level 
> paging"
> - exit $ksft_skip
> + skip "$0: System does not use Radix MMU, required for 5-level 
> paging"
>   fi
>
>   local hugepages_total=$(awk '/HugePages_Total/ {print $2}' 
> /proc/meminfo)
>   if [[ "${hugepages_total}" -eq 0 ]]; then
> - echo "$0: HugePages are not enabled, required for some tests"
> - exit $ksft_skip
> + skip "$0: HugePages are not enabled, required for some tests"
>   fi
>  }
>
> --
> 2.47.2
>

Re: [PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.

On 15 May 2025, at 14:41, Lorenzo Stoakes wrote:

> Ah you got to this first :) thanks!
>
> Could you do this with a cover letter though? It's really weird to have 2/2
> reply to 1/2, I know sometimes people do that, but it's just odd, and it'd be
> good to have an overview, thanks!
>
> On Thu, May 15, 2025 at 02:23:32PM -0400, Zi Yan wrote:
>> When userfaultfd is not compiled into kernel, userfaultfd() returns -1,
>> causing uffd tests in madv_guard fail. Skip the tests instead.
>
> 'madv_guard'? I'd just say the guard_regions.uffd test to fail.

Sure. Will change it.
>
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  tools/testing/selftests/mm/guard-regions.c | 17 +++--
>>  1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/guard-regions.c 
>> b/tools/testing/selftests/mm/guard-regions.c
>> index 0cd9d236649d..93af3d3760f9 100644
>> --- a/tools/testing/selftests/mm/guard-regions.c
>> +++ b/tools/testing/selftests/mm/guard-regions.c
>> @@ -1453,8 +1453,21 @@ TEST_F(guard_regions, uffd)
>>
>>  /* Set up uffd. */
>>  uffd = userfaultfd(0);
>> -if (uffd == -1 && errno == EPERM)
>> -ksft_exit_skip("No userfaultfd permissions, try running as 
>> root.\n");
>
> Let's just make this all part of the same switch please!

What do you mean? EPERM is handled in the switch-case below.

>
> And while I originally used ksft_exit_skip(), I think we can just use the
> SKIP(return, ...) form here just fine to keep it consistent.

Right. I am using SKIP below, since when I ran it, ksft_exit_skip()
makes the whole test message inconsistent.

>
>> +if (uffd == -1) {
>> +switch (errno) {
>> +case EPERM:
>> +SKIP(return, "No userfaultfd permissions, try running 
>> as root.");
>> +break;
>> +case ENOSYS:
>> +SKIP(return, "userfaultfd is not supported/not 
>> enabled.");
>> +break;
>> +default:
>> +ksft_exit_fail_msg("userfaultfd failed with %s\n",
>> +   strerror(errno));
>> +break;
>> +}
>> +}
>> +
>>  ASSERT_NE(uffd, -1);
>>
>>  ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
>> --
>> 2.47.2
>>
>
> Thanks!


--
Best Regards,
Yan, Zi

Re: [PATCH 2/2] selftests/mm: skip hugevm test if kernel config file is not present.

2025-05-15 Thread Pedro Falcato

On Thu, May 15, 2025 at 07:43:40PM +0100, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 02:23:33PM -0400, Zi Yan wrote:
> > When running hugevm tests in a machine without kernel config present, e.g.,
> > a VM running a kernel without CONFIG_IKCONFIG_PROC nor /boot/config-*,
> > skip hugevm tests, which reads kernel config to get page table level
> > information.
> >
> > Signed-off-by: Zi Yan 
> 
> Looks generally reasonable to me, but I'm not so familiar with this so,
> 
> Acked-by: Lorenzo Stoakes 

Acked-by: Pedro Falcato 

Same here. Although, despite this being worth patching, I do think we should
document somewhere what the expectations are for mm selftests (in terms of
kernel config, environment, libc, possibly even utilities present).

-- 
Pedro

Re: [PATCH v4 05/38] perf: Add generic exclude_guest support

2025-05-15 Thread Liang, Kan




On 2025-05-15 3:25 p.m., Sean Christopherson wrote:
> On Thu, May 15, 2025, Kan Liang wrote:
>> On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
 This naming is confusing on purpose? Pick either guest/host and stick
 with it.
>>>
>>> +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
>>> These flows
>>>
>>> After a bit of hacking, and with a few spoilers, this is what I ended up 
>>> with
>>> (not anywhere near fully tested).  I like following KVM's 
>>> kvm_xxx_{load,put}()
>>> nomenclature to tie everything together, so I went with "guest" instead of 
>>> "host"
>>> even though the majority of work being down is to shedule out/in host 
>>> context.
>>>
>>> /* When loading a guest's mediated PMU, schedule out all exclude_guest 
>>> events. */
>>> void perf_load_guest_context(unsigned long data)
>>> {
>>> struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>
>>> lockdep_assert_irqs_disabled();
>>>
>>> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>
>>> if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
>>> goto unlock;
>>>
>>> perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>>> perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> if (cpuctx->task_ctx) {
>>> perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> }
>>>
>>> arch_perf_load_guest_context(data);
>>>
>>> __this_cpu_write(guest_ctx_loaded, true);
>>>
>>> unlock:
>>> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> }
>>> EXPORT_SYMBOL_GPL(perf_load_guest_context);
>>>
>>> void perf_put_guest_context(void)
>>> {
>>> struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>
>>> lockdep_assert_irqs_disabled();
>>>
>>> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>
>>> if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
>>> goto unlock;
>>>
>>> arch_perf_put_guest_context();
>>
>> It will set the guest_ctx_loaded to false.
>> The update_context_time() invoked in the perf_event_sched_in() will not
>> get a chance to update the guest time.
> 
> The guest_ctx_loaded in arch/x86/events/core.c is a different variable, it 
> just
> happens to have the same name.
>

Ah, I thought it was the same variable. Then there should be no issue
with the guest time.

But the same variable name may bring confusions. Maybe add a x86_pmu/x86
prefix to the variable in x86 code.
> It's completely gross, but exposing guest_ctx_loaded outside of 
> kernel/events/core.c
> didn't seem like a great alternative.  If we wanted to use a single variable,
> then the writes in arch_perf_{load,put}_guest_context() can simply go away.
> 
Either two variables or a single variable is fine with me, as long as
they can be easily distinguished.

But I think we should put arch_perf_{load,put}_guest_context() and
guest_ctx_loaded between the perf_ctx_disable/enable() pair.
Perf should only update the state when PMU is completely disabled.
It matches the way the rest of the perf code does.
It could also avoid some potential issues, e.g., the state is not
updated completely, but the counter has already been fired.

Thanks,
Kan

Re: [PATCH v2 1/3] rpmsg: char: Reuse eptdev logic for anon device

Hi,

On Fri, May 09, 2025 at 11:59:25PM +0800, Dawei Li wrote:
> Current uAPI implementation for rpmsg ctrl & char device manipulation is
> abstracted in procedures below:
> 
> Current uAPI implementation for rpmsg ctrl & char device manipulation is
> abstracted in procedures below:
> - fd = open("/dev/rpmsg_ctrlX")
> - ioctl(fd, RPMSG_CREATE_EPT_IOCTL, &info); /dev/rpmsgY devnode is
>   generated.
> - fd_ep = open("/dev/rpmsgY", O_RDWR)
> - operations on fd_ep(write, read, poll ioctl)
> - ioctl(fd_ep, RPMSG_DESTROY_EPT_IOCTL)
> - close(fd_ep)
> - close(fd)
> 
> This /dev/rpmsgY abstraction is less favorable for:
> - Performance issue: It's time consuming for some operations are
> invovled:
>   - Device node creation.
> Depends on specific config, especially CONFIG_DEVTMPFS, the overall
> overhead is based on coordination between DEVTMPFS and userspace
> tools such as udev and mdev.
> 
>   - Extra kernel-space switch cost.
> 
>   - Other major costs brought by heavy-weight logic like device_add().
> 
> - /dev/rpmsgY node can be opened only once. It doesn't make much sense
> that a dynamically created device node can be opened only once.
> 
> - For some container application such as docker, a client can't access
>   host's dev unless specified explicitly. But in case of /dev/rpmsgY, which
>   is generated dynamically and whose existence is unknown for clients in
>   advance, this uAPI based on device node doesn't fit well.
> 
> An anon inode based approach is introduced to address the issues above.
> Rather than generating device node and opening it, rpmsg code just make
> a anon inode representing eptdev and return the fd to userspace.

Please change occurences of "anon" for "anonyous".  "Anon" was used to avoid
writing "anonymous" all the time in the code, but description should not use the
shorthand.  In the title, this patch and all other patches.

> 
> The legacy abstraction based on struct dev and struct cdev is honored
> for:
> - Avoid legacy uAPI break(RPMSG_CREATE_EPT_IOCTL)
> - Reuse existing logic:
>   -- dev_err() and friends.
>   -- Life cycle management of struct device.
> 
> Signed-off-by: Dawei Li 
> ---
>  drivers/rpmsg/rpmsg_char.c | 80 ++
>  1 file changed, 56 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/rpmsg/rpmsg_char.c b/drivers/rpmsg/rpmsg_char.c
> index eec7642d2686..5b2a883d6236 100644
> --- a/drivers/rpmsg/rpmsg_char.c
> +++ b/drivers/rpmsg/rpmsg_char.c
> @@ -91,7 +91,8 @@ int rpmsg_chrdev_eptdev_destroy(struct device *dev, void 
> *data)
>   /* wake up any blocked readers */
>   wake_up_interruptible(&eptdev->readq);
>  
> - cdev_device_del(&eptdev->cdev, &eptdev->dev);
> + if (eptdev->dev.devt)
> + cdev_device_del(&eptdev->cdev, &eptdev->dev);
>   put_device(&eptdev->dev);
>  
>   return 0;
> @@ -132,21 +133,17 @@ static int rpmsg_ept_flow_cb(struct rpmsg_device 
> *rpdev, void *priv, bool enable
>   return 0;
>  }
>  
> -static int rpmsg_eptdev_open(struct inode *inode, struct file *filp)
> +static int __rpmsg_eptdev_open(struct rpmsg_eptdev *eptdev)
>  {
> - struct rpmsg_eptdev *eptdev = cdev_to_eptdev(inode->i_cdev);
>   struct rpmsg_endpoint *ept;
>   struct rpmsg_device *rpdev = eptdev->rpdev;
>   struct device *dev = &eptdev->dev;
>  
> - mutex_lock(&eptdev->ept_lock);
>   if (eptdev->ept) {
> - mutex_unlock(&eptdev->ept_lock);
>   return -EBUSY;
>   }
>  
>   if (!eptdev->rpdev) {
> - mutex_unlock(&eptdev->ept_lock);
>   return -ENETRESET;
>   }
>  
> @@ -164,21 +161,32 @@ static int rpmsg_eptdev_open(struct inode *inode, 
> struct file *filp)
>   if (!ept) {
>   dev_err(dev, "failed to open %s\n", eptdev->chinfo.name);
>   put_device(dev);
> - mutex_unlock(&eptdev->ept_lock);
>   return -EINVAL;
>   }
>  
>   ept->flow_cb = rpmsg_ept_flow_cb;
>   eptdev->ept = ept;
> - filp->private_data = eptdev;
> - mutex_unlock(&eptdev->ept_lock);
>  
>   return 0;
>  }
>  
> -static int rpmsg_eptdev_release(struct inode *inode, struct file *filp)
> +static int rpmsg_eptdev_open(struct inode *inode, struct file *filp)
>  {
>   struct rpmsg_eptdev *eptdev = cdev_to_eptdev(inode->i_cdev);
> + int ret;
> +
> + mutex_lock(&eptdev->ept_lock);
> + ret = __rpmsg_eptdev_open(eptdev);
> + if (!ret)
> + filp->private_data = eptdev;
> + mutex_unlock(&eptdev->ept_lock);
> +
> + return ret;
> +}
> +
> +static int rpmsg_eptdev_release(struct inode *inode, struct file *filp)
> +{
> + struct rpmsg_eptdev *eptdev = filp->private_data;
>   struct device *dev = &eptdev->dev;
>  
>   /* Close the endpoint, if it's not already destroyed by the parent */
> @@ -400,12 +408,13 @@ static void rpmsg_eptdev_release_device(struct device 
> *dev)
>   struct rpmsg_eptdev *eptdev = dev_to_eptdev(dev);

Re: [PATCH 5/6] x86/paravirt: Switch MSR access pv_ops functions to instruction interfaces

2025-05-15 Thread H. Peter Anvin


On 5/15/25 00:32, Xin Li wrote:


Hi Juergen,

I have some update on this thread while working on it.

If we continue down the path of maintaining pvops MSR APIs as this patch
series does, it seems we’ll need to duplicate the ALTERNATIVE code in
three different places.

1) The MSR access primitives defined in , which is used when
    CONFIG_PARAVIRT=n.

2) The pvops native MSR functions pv_native_{rd,wr}msr{,_safe}() defined
    in arch/x86/kernel/paravirt.c, used when CONFIG_PARAVIRT=y on bare
    metal.

3) The pvops Xen MSR functions paravirt_{read,write}_msr{,_safe}()
    defined in , used when CONFIG_PARAVIRT_XXL=y.

hpa had mentioned to me earlier that this would be a maintenance burden
— something I only truly realized once I got hands-on with it.

Maybe you have something in mind to address it?

Also add PeterZ to the To list because he cares it.



Having the code being duplicated is definitely not a good thing; 
although I'm not one of the x86 maintainers anymore, I would consider it 
a strong reason to NAK such a patchset.


At one point I was considering augmenting the alternatives framework to 
be able to call an ad hoc subroutine to generate the code. It would be 
useful in cases like this, where if PV is enabled it can make a callout 
to the currently-active PV code to query the desired code to be output.


There are 16 unused bits in the alternatives table (not counting the 14 
unused flag bits), which could be used for an enumeration of such 
subroutines, optionally split into 8 bits of function enumeration and 8 
bits of private data. In this case, the "replacement" pointer becomes 
available as a private pointer; possibly to a metadata structure used by 
the subroutine.


This could also be used to significantly enhance the static-immediate 
framework, by being able to have explicit code which handles the 
transformations instead of needing to rely on assembly hacks. That way 
we might even be able to do that kind of transformations for any 
ro_after_init value.


I think the biggest concern is how this would affect objtool, since 
objtool would now not have any kind of direct visibility into the 
possibly generated code. How to best feed the information objtool needs 
to it would be my biggest question (in part because I don't know what 
objtool would actually need.)


-hpa

Re: [PATCH v2 2/3] rpmsg: char: Implement eptdev based on anon inode

On Fri, May 09, 2025 at 11:59:26PM +0800, Dawei Li wrote:
> Introduce new eptdev abstraction based on anon inode. The new API is
> exactly same with legacy one except:
> 
> - It's anonymous and devnode/path free.
> - Its fops->open() is empty.
> 
> Signed-off-by: Dawei Li 
> ---
>  drivers/rpmsg/rpmsg_char.c | 44 ++
>  drivers/rpmsg/rpmsg_char.h | 19 
>  2 files changed, 63 insertions(+)
> 
> diff --git a/drivers/rpmsg/rpmsg_char.c b/drivers/rpmsg/rpmsg_char.c
> index 5b2a883d6236..b0ec05f88013 100644
> --- a/drivers/rpmsg/rpmsg_char.c
> +++ b/drivers/rpmsg/rpmsg_char.c
> @@ -13,6 +13,7 @@
>  
>  #define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -517,6 +518,49 @@ int rpmsg_chrdev_eptdev_create(struct rpmsg_device 
> *rpdev, struct device *parent
>  }
>  EXPORT_SYMBOL(rpmsg_chrdev_eptdev_create);
>  
> +static const struct file_operations rpmsg_eptdev_anon_fops = {
> + .owner = THIS_MODULE,
> + .release = rpmsg_eptdev_release,
> + .read_iter = rpmsg_eptdev_read_iter,
> + .write_iter = rpmsg_eptdev_write_iter,
> + .poll = rpmsg_eptdev_poll,
> + .unlocked_ioctl = rpmsg_eptdev_ioctl,
> + .compat_ioctl = compat_ptr_ioctl,
> +};
> +
> +int rpmsg_eptdev_create(struct rpmsg_device *rpdev, struct device *parent,
> + struct rpmsg_channel_info chinfo, int *pfd)

rpmsg_anonymous_eptdev_create()

> +{
> + struct rpmsg_eptdev *eptdev;
> + int ret, fd;
> +
> + eptdev = __rpmsg_chrdev_eptdev_alloc(rpdev, parent, false);
> + if (IS_ERR(eptdev))
> + return PTR_ERR(eptdev);
> +
> + ret =  __rpmsg_chrdev_eptdev_add(eptdev, chinfo, false);
> + if (ret) {
> + dev_err(&eptdev->dev, "failed to add %s\n", 
> eptdev->chinfo.name);
> + return ret;
> + }
> +
> + fd = anon_inode_getfd("rpmsg-eptdev", &rpmsg_eptdev_anon_fops, eptdev, 
> O_RDWR | O_CLOEXEC);
> + if (fd < 0) {
> + put_device(&eptdev->dev);
> + return fd;
> + }
> +
> + mutex_lock(&eptdev->ept_lock);
> + ret = __rpmsg_eptdev_open(eptdev);
> + mutex_unlock(&eptdev->ept_lock);
> +
> + if (!ret)
> + *pfd = fd;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(rpmsg_eptdev_create);
> +
>  static int rpmsg_chrdev_probe(struct rpmsg_device *rpdev)
>  {
>   struct rpmsg_channel_info chinfo;
> diff --git a/drivers/rpmsg/rpmsg_char.h b/drivers/rpmsg/rpmsg_char.h
> index 117d9cbc52f0..8cc2c14537da 100644
> --- a/drivers/rpmsg/rpmsg_char.h
> +++ b/drivers/rpmsg/rpmsg_char.h
> @@ -19,6 +19,19 @@
>  int rpmsg_chrdev_eptdev_create(struct rpmsg_device *rpdev, struct device 
> *parent,
>  struct rpmsg_channel_info chinfo);
>  
> +/**
> + * rpmsg_eptdev_create() - register ep device and its associated fd based on 
> an endpoint
> + * @rpdev:  prepared rpdev to be used for creating endpoints
> + * @parent: parent device
> + * @chinfo: associated endpoint channel information.
> + * @pfd: fd in represent of endpoint device
> + *
> + * This function create a new rpmsg endpoint device and its associated fd to 
> instantiate a new
> + * endpoint based on chinfo information.
> + */
> +int rpmsg_eptdev_create(struct rpmsg_device *rpdev, struct device *parent,
> + struct rpmsg_channel_info chinfo, int *pfd);
> +
>  /**
>   * rpmsg_chrdev_eptdev_destroy() - destroy created char device endpoint.
>   * @data: private data associated to the endpoint device
> @@ -36,6 +49,12 @@ static inline int rpmsg_chrdev_eptdev_create(struct 
> rpmsg_device *rpdev, struct
>   return -ENXIO;
>  }
>  
> +static inline int rpmsg_eptdev_create(struct rpmsg_device *rpdev, struct 
> device *parent,
> +   struct rpmsg_channel_info chinfo, int 
> *pfd)
> +{
> + return -ENXIO;
> +}
> +
>  static inline int rpmsg_chrdev_eptdev_destroy(struct device *dev, void *data)
>  {
>   return -ENXIO;
> -- 
> 2.25.1
>

[PATCH v3 2/6] dt-bindings: soc: qcom: add qcom,qcs615-imem compatible

Document qcom,qcs615-imem compatible. It has a child node for debugging
purposes.

Acked-by: Krzysztof Kozlowski 
Signed-off-by: Lijuan Gao 
---
 Documentation/devicetree/bindings/sram/qcom,imem.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/sram/qcom,imem.yaml 
b/Documentation/devicetree/bindings/sram/qcom,imem.yaml
index 
2711f90d9664b70fcd1e2f7e2dfd3386ed5c1952..dc3b5a69b9254e7001a329bb2011305152316689
 100644
--- a/Documentation/devicetree/bindings/sram/qcom,imem.yaml
+++ b/Documentation/devicetree/bindings/sram/qcom,imem.yaml
@@ -22,6 +22,7 @@ properties:
   - qcom,msm8974-imem
   - qcom,msm8976-imem
   - qcom,qcs404-imem
+  - qcom,qcs615-imem
   - qcom,qcs8300-imem
   - qcom,qdu1000-imem
   - qcom,sa8775p-imem

-- 
2.34.1

[PATCH v3 3/6] arm64: dts: qcom: qcs615: Add mproc node for SEMP2P

From: Kyle Deng 

The Shared Memory Point to Point (SMP2P) protocol facilitates
communication of a single 32-bit value between two processors.
Add these two nodes for remoteproc enablement on QCS615 SoC.

Signed-off-by: Kyle Deng 
Signed-off-by: Lijuan Gao 
---
 arch/arm64/boot/dts/qcom/qcs615.dtsi | 44 
 1 file changed, 44 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcs615.dtsi 
b/arch/arm64/boot/dts/qcom/qcs615.dtsi
index 
f08ba09772f333adaade2d30897cf50c555a4d12..f922349758d11ec7fda1c43736a4bf290916e67f
 100644
--- a/arch/arm64/boot/dts/qcom/qcs615.dtsi
+++ b/arch/arm64/boot/dts/qcom/qcs615.dtsi
@@ -332,6 +332,50 @@ mc_virt: interconnect-2 {
qcom,bcm-voters = <&apps_bcm_voter>;
};
 
+   smp2p-adsp {
+   compatible = "qcom,smp2p";
+   qcom,smem = <443>, <429>;
+   interrupts = ;
+   /* On this platform, bit 26 (normally SLPI) is repurposed for 
ADSP */
+   mboxes = <&apss_shared 26>;
+
+   qcom,local-pid = <0>;
+   qcom,remote-pid = <2>;
+
+   adsp_smp2p_out: master-kernel {
+   qcom,entry-name = "master-kernel";
+   #qcom,smem-state-cells = <1>;
+   };
+
+   adsp_smp2p_in: slave-kernel {
+   qcom,entry-name = "slave-kernel";
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   };
+   };
+
+   smp2p-cdsp {
+   compatible = "qcom,smp2p";
+   qcom,smem = <94>, <432>;
+   interrupts = ;
+   mboxes = <&apss_shared 6>;
+
+   qcom,local-pid = <0>;
+   qcom,remote-pid = <5>;
+
+   cdsp_smp2p_out: master-kernel {
+   qcom,entry-name = "master-kernel";
+   #qcom,smem-state-cells = <1>;
+   };
+
+   cdsp_smp2p_in: slave-kernel {
+   qcom,entry-name = "slave-kernel";
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   };
+
+   };
+
qup_opp_table: opp-table-qup {
compatible = "operating-points-v2";
opp-shared;

-- 
2.34.1

[PATCH v3 5/6] arm64: dts: qcom: qcs615: add ADSP and CDSP nodes

Add nodes for remoteprocs: ADSP and CDSP for QCS615 SoC to enable proper
remoteproc functionality.

Reviewed-by: Konrad Dybcio 
Signed-off-by: Lijuan Gao 
---
 arch/arm64/boot/dts/qcom/qcs615.dtsi | 86 
 1 file changed, 86 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcs615.dtsi 
b/arch/arm64/boot/dts/qcom/qcs615.dtsi
index 
dd54cfe7b7a6f03c1aa658ce3014d50478df5931..b7d649822a35fd30366ecab2eeb7b2fced1b369d
 100644
--- a/arch/arm64/boot/dts/qcom/qcs615.dtsi
+++ b/arch/arm64/boot/dts/qcom/qcs615.dtsi
@@ -473,6 +473,16 @@ smem_region: smem@8600 {
no-map;
hwlocks = <&tcsr_mutex 3>;
};
+
+   rproc_cdsp_mem: rproc-cdsp@93b0 {
+   reg = <0x0 0x93b0 0x0 0x1e0>;
+   no-map;
+   };
+
+   rproc_adsp_mem: rproc-adsp@9590 {
+   reg = <0x0 0x9590 0x0 0x1e0>;
+   no-map;
+   };
};
 
soc: soc@0 {
@@ -3117,6 +3127,44 @@ cti@790 {
clock-names = "apb_pclk";
};
 
+   remoteproc_cdsp: remoteproc@830 {
+   compatible = "qcom,qcs615-cdsp-pas", 
"qcom,sm8150-cdsp-pas";
+   reg = <0x0 0x0830 0x0 0x4040>;
+
+   interrupts-extended = <&intc GIC_SPI 578 
IRQ_TYPE_EDGE_RISING>,
+ <&cdsp_smp2p_in 0 
IRQ_TYPE_EDGE_RISING>,
+ <&cdsp_smp2p_in 1 
IRQ_TYPE_EDGE_RISING>,
+ <&cdsp_smp2p_in 2 
IRQ_TYPE_EDGE_RISING>,
+ <&cdsp_smp2p_in 3 
IRQ_TYPE_EDGE_RISING>;
+   interrupt-names = "wdog",
+ "fatal",
+ "ready",
+ "handover",
+ "stop-ack";
+
+   clocks = <&rpmhcc RPMH_CXO_CLK>;
+   clock-names = "xo";
+
+   power-domains = <&rpmhpd RPMHPD_CX>;
+   power-domain-names = "cx";
+
+   memory-region = <&rproc_cdsp_mem>;
+
+   qcom,qmp = <&aoss_qmp>;
+
+   qcom,smem-states = <&cdsp_smp2p_out 0>;
+   qcom,smem-state-names = "stop";
+
+   status = "disabled";
+
+   glink-edge {
+   interrupts = ;
+   mboxes = <&apss_shared 4>;
+   label = "cdsp";
+   qcom,remote-pid = <5>;
+   };
+   };
+
pmu@90b6300 {
compatible = "qcom,qcs615-cpu-bwmon", 
"qcom,sdm845-bwmon";
reg = <0x0 0x090b6300 0x0 0x600>;
@@ -3751,6 +3799,44 @@ usb_2_dwc3: usb@a80 {
maximum-speed = "high-speed";
};
};
+
+   remoteproc_adsp: remoteproc@6240 {
+   compatible = "qcom,qcs615-adsp-pas", 
"qcom,sm8150-adsp-pas";
+   reg = <0x0 0x6240 0x0 0x4040>;
+
+   interrupts-extended = <&intc GIC_SPI 162 
IRQ_TYPE_EDGE_RISING>,
+ <&adsp_smp2p_in 0 
IRQ_TYPE_EDGE_RISING>,
+ <&adsp_smp2p_in 1 
IRQ_TYPE_EDGE_RISING>,
+ <&adsp_smp2p_in 2 
IRQ_TYPE_EDGE_RISING>,
+ <&adsp_smp2p_in 3 
IRQ_TYPE_EDGE_RISING>;
+   interrupt-names = "wdog",
+ "fatal",
+ "ready",
+ "handover",
+ "stop-ack";
+
+   clocks = <&rpmhcc RPMH_CXO_CLK>;
+   clock-names = "xo";
+
+   power-domains = <&rpmhpd RPMHPD_CX>;
+   power-domain-names = "cx";
+
+   memory-region = <&rproc_adsp_mem>;
+
+   qcom,qmp = <&aoss_qmp>;
+
+   qcom,smem-states = <&adsp_smp2p_out 0>;
+   qcom,smem-state-names = "stop";
+
+   status = "disabled";
+
+   glink_edge: glink-edge {
+   interrupts = ;
+   mboxes = <&apss_shared 24>;
+   label = "lpass";
+   qcom,remote-pid = <2>;
+   };
+   };
};
 
arch_timer: timer {

-- 
2.34.1

[PATCH v3 4/6] arm64: dts: qcom: qcs615: Add IMEM and PIL info region

Add a simple-mfd representing IMEM on QCS615 and define the PIL
relocation info region as its child. The PIL region in IMEM is used to
communicate load addresses of remoteproc to post mortem debug tools, so
that these tools can collect ramdumps.

Signed-off-by: Lijuan Gao 
---
 arch/arm64/boot/dts/qcom/qcs615.dtsi | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcs615.dtsi 
b/arch/arm64/boot/dts/qcom/qcs615.dtsi
index 
f922349758d11ec7fda1c43736a4bf290916e67f..dd54cfe7b7a6f03c1aa658ce3014d50478df5931
 100644
--- a/arch/arm64/boot/dts/qcom/qcs615.dtsi
+++ b/arch/arm64/boot/dts/qcom/qcs615.dtsi
@@ -3290,6 +3290,20 @@ sram@c3f {
reg = <0x0 0x0c3f 0x0 0x400>;
};
 
+   sram@146aa000 {
+   compatible = "qcom,qcs615-imem", "syscon", "simple-mfd";
+   reg = <0x0 0x1468 0x0 0x2c000>;
+   ranges = <0 0 0x1468 0x2c000>;
+
+   #address-cells = <1>;
+   #size-cells = <1>;
+
+   pil-reloc@2a94c {
+   compatible = "qcom,pil-reloc-info";
+   reg = <0x2a94c 0xc8>;
+   };
+   };
+
apps_smmu: iommu@1500 {
compatible = "qcom,qcs615-smmu-500", "qcom,smmu-500", 
"arm,mmu-500";
reg = <0x0 0x1500 0x0 0x8>;

-- 
2.34.1

[PATCH v3 6/6] arm64: dts: qcom: qcs615-ride: enable remoteprocs

Enable all remoteproc nodes on the qcs615-ride board and point to the
appropriate firmware files to allow proper functioning of the remote
processors.

Reviewed-by: Konrad Dybcio 
Signed-off-by: Lijuan Gao 
---
 arch/arm64/boot/dts/qcom/qcs615-ride.dts | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/qcs615-ride.dts 
b/arch/arm64/boot/dts/qcom/qcs615-ride.dts
index 
2b5aa3c66867676bda59ff82b902b6e4974126f8..a6652e4817d1c218c7981b04daeb035e2852ac1a
 100644
--- a/arch/arm64/boot/dts/qcom/qcs615-ride.dts
+++ b/arch/arm64/boot/dts/qcom/qcs615-ride.dts
@@ -240,6 +240,18 @@ &qupv3_id_0 {
status = "okay";
 };
 
+&remoteproc_adsp {
+   firmware-name = "qcom/qcs615/adsp.mbn";
+
+   status = "okay";
+};
+
+&remoteproc_cdsp {
+   firmware-name = "qcom/qcs615/cdsp.mbn";
+
+   status = "okay";
+};
+
 &rpmhcc {
clocks = <&xo_board_clk>;
 };

-- 
2.34.1

[PATCH v3 1/6] dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc

Document the components used to boot the ADSP and CDSP on the Qualcomm
QCS615 SoC. Use fallback to indicate the compatibility of the remoteproc
on the QCS615 with that on the SM8150.

Reviewed-by: Krzysztof Kozlowski 
Signed-off-by: Lijuan Gao 
---
 .../bindings/remoteproc/qcom,sm8150-pas.yaml   | 65 +-
 1 file changed, 38 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,sm8150-pas.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,sm8150-pas.yaml
index 
5dcc2a32c080049ac6c486614a5bd4d71fd3ed62..a8cddf7e2fe1a84064730d847d2f0601b67572ff
 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,sm8150-pas.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,sm8150-pas.yaml
@@ -15,17 +15,26 @@ description:
 
 properties:
   compatible:
-enum:
-  - qcom,sc8180x-adsp-pas
-  - qcom,sc8180x-cdsp-pas
-  - qcom,sc8180x-slpi-pas
-  - qcom,sm8150-adsp-pas
-  - qcom,sm8150-cdsp-pas
-  - qcom,sm8150-mpss-pas
-  - qcom,sm8150-slpi-pas
-  - qcom,sm8250-adsp-pas
-  - qcom,sm8250-cdsp-pas
-  - qcom,sm8250-slpi-pas
+oneOf:
+  - items:
+  - enum:
+  - qcom,qcs615-adsp-pas
+  - const: qcom,sm8150-adsp-pas
+  - items:
+  - enum:
+  - qcom,qcs615-cdsp-pas
+  - const: qcom,sm8150-cdsp-pas
+  - enum:
+  - qcom,sc8180x-adsp-pas
+  - qcom,sc8180x-cdsp-pas
+  - qcom,sc8180x-slpi-pas
+  - qcom,sm8150-adsp-pas
+  - qcom,sm8150-cdsp-pas
+  - qcom,sm8150-mpss-pas
+  - qcom,sm8150-slpi-pas
+  - qcom,sm8250-adsp-pas
+  - qcom,sm8250-cdsp-pas
+  - qcom,sm8250-slpi-pas
 
   reg:
 maxItems: 1
@@ -62,16 +71,17 @@ allOf:
   - if:
   properties:
 compatible:
-  enum:
-- qcom,sc8180x-adsp-pas
-- qcom,sc8180x-cdsp-pas
-- qcom,sc8180x-slpi-pas
-- qcom,sm8150-adsp-pas
-- qcom,sm8150-cdsp-pas
-- qcom,sm8150-slpi-pas
-- qcom,sm8250-adsp-pas
-- qcom,sm8250-cdsp-pas
-- qcom,sm8250-slpi-pas
+  contains:
+enum:
+  - qcom,sc8180x-adsp-pas
+  - qcom,sc8180x-cdsp-pas
+  - qcom,sc8180x-slpi-pas
+  - qcom,sm8150-adsp-pas
+  - qcom,sm8150-cdsp-pas
+  - qcom,sm8150-slpi-pas
+  - qcom,sm8250-adsp-pas
+  - qcom,sm8250-cdsp-pas
+  - qcom,sm8250-slpi-pas
 then:
   properties:
 interrupts:
@@ -88,12 +98,13 @@ allOf:
   - if:
   properties:
 compatible:
-  enum:
-- qcom,sc8180x-adsp-pas
-- qcom,sc8180x-cdsp-pas
-- qcom,sm8150-adsp-pas
-- qcom,sm8150-cdsp-pas
-- qcom,sm8250-cdsp-pas
+  contains:
+enum:
+  - qcom,sc8180x-adsp-pas
+  - qcom,sc8180x-cdsp-pas
+  - qcom,sm8150-adsp-pas
+  - qcom,sm8150-cdsp-pas
+  - qcom,sm8250-cdsp-pas
 then:
   properties:
 power-domains:

-- 
2.34.1

[PATCH v3 0/6] arm64: dts: qcom: qcs615: enable remoteprocs - ADSP and CDSP

Enable the remote processor PAS loader for QCS615 ADSP and CDSP
processors. This allows different platforms/architectures to control
(power on, load firmware, power off) those remote processors while
abstracting the hardware differences. Additionally, and add a PIL region
in IMEM so that post mortem debug tools can collect ramdumps.

Signed-off-by: Lijuan Gao
---
Changes in v3:
- Update base-commit to tag: next-20250515.
- Collected Reviewed-by: tag.
- Add a comment for SLPI 26 in the smp2p-adsp node.
- Update the IMEM address to the starting address of the IMEM layout,
and also update the offset address of pil-reloc.
- Link to v1:
https://lore.kernel.org/r/20250507-add_qcs615_remoteproc_support-v2-0-52ac6cb43...@quicinc.com

Changes in v2:
- Remove the qcom prefix from smp2p node name.
- Remove the unnecessary property qcom,ipc from smp2p node.
- Remove the unused node apcs: syscon.
- Remove the unused nodes from smp2p node, such as
sleepstate_smp2p_out/in, smp2p_rdbg2_out/in, smp2p_rdbg5_out/in.
- Update the commit message for IMEM PIL info region.
- Update the remoteproc node names.
- Correct the size of register for remoteproc nodes.
- Add empty line before status properties.
- Link to v1:
https://lore.kernel.org/r/20250423-add_qcs615_remoteproc_support-v1-0-a94fe8799...@quicinc.com

---
Kyle Deng (1):
arm64: dts: qcom: qcs615: Add mproc node for SEMP2P

Lijuan Gao (5):
dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc
dt-bindings: soc: qcom: add qcom,qcs615-imem compatible
arm64: dts: qcom: qcs615: Add IMEM and PIL info region
arm64: dts: qcom: qcs615: add ADSP and CDSP nodes
arm64: dts: qcom: qcs615-ride: enable remoteprocs

.../bindings/remoteproc/qcom,sm8150-pas.yaml | 65 ++
.../devicetree/bindings/sram/qcom,imem.yaml| 1 +
arch/arm64/boot/dts/qcom/qcs615-ride.dts | 12 ++
arch/arm64/boot/dts/qcom/qcs615.dtsi | 144 +
4 files changed, 195 insertions(+), 27 deletions(-)
---
base-commit: 484803582c77061b470ac64a634f25f89715be3f
change-id: 20250516-add_qcs615_remoteproc_support-b85ca60e48a8

Best regards,
--
Lijuan Gao

Re: [PATCH 0/9] tools/nolibc: split out more headers

2025-05-15 Thread Willy Tarreau

On Thu, May 15, 2025 at 09:57:46PM +0200, Thomas Weißschuh wrote:
> Split out all headers which are used by nolibc-test.c.
> This makes it easier to port existing applications to nolibc.

Nice work, it's pleasant to no longer see #ifdef NOLIBC in the .c. I'll
eventually try to do the same in my init code once I'm in vacation, it's
likely that it will be sufficient as well now.

For the series: Acked-by: Willy Tarreau 

Thanks!
Willy

[PATCH 1/2] nvdimm/btt: Fix memleaks in discover_arenas()

2025-05-15 Thread Li Zhijian

kmemleak reported a memleak after the ndctl_test
unreferenced object 0x88800e6cf2c0 (size 32):
  comm "modprobe", pid 969, jiffies 4294698691
  hex dump (first 32 bytes):
03 00 00 00 a0 0a 00 00 00 b0 b4 00 00 c9 ff ff  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace (crc 807f3e24):
__kmalloc_cache_noprof+0x331/0x410
nvdimm_namespace_attach_btt+0xa9b/0xcc0 [nd_btt]
platform_probe+0x45/0xa0
...
load_module+0x21f9/0x22f0

faddr2line tells that (based on v6.15-rc4):
$ ./scripts/faddr2line drivers/nvdimm/nd_btt.o nvdimm_namespace_attach_btt+0xa9
nvdimm_namespace_attach_btt+0xa9b/0xcc0:
log_set_indices at linux/drivers/nvdimm/btt.c:719
(inlined by) discover_arenas at linux/drivers/nvdimm/btt.c:888
(inlined by) btt_init at linux/drivers/nvdimm/btt.c:1583
(inlined by) nvdimm_namespace_attach_btt at linux/drivers/nvdimm/btt.c:1680

It's believed that this was a false positive because the leaking size
didn't match any instance in an arena.

However during looking into this issue, it's noticed that it does not
release an initializing arena instance and instances in btt->arena_list
in some error paths.

Signed-off-by: Li Zhijian 
---
 drivers/nvdimm/btt.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 423dcd190906..a11e4e7e9a52 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -801,17 +801,22 @@ static struct arena_info *alloc_arena(struct btt *btt, 
size_t size,
return arena;
 }
 
+static void free_arena(struct arena_info *arena)
+{
+   kfree(arena->rtt);
+   kfree(arena->map_locks);
+   kfree(arena->freelist);
+   debugfs_remove_recursive(arena->debugfs_dir);
+   kfree(arena);
+}
+
 static void free_arenas(struct btt *btt)
 {
struct arena_info *arena, *next;
 
list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
list_del(&arena->list);
-   kfree(arena->rtt);
-   kfree(arena->map_locks);
-   kfree(arena->freelist);
-   debugfs_remove_recursive(arena->debugfs_dir);
-   kfree(arena);
+   free_arena(arena);
}
 }
 
@@ -848,7 +853,7 @@ static void parse_arena_meta(struct arena_info *arena, 
struct btt_sb *super,
 static int discover_arenas(struct btt *btt)
 {
int ret = 0;
-   struct arena_info *arena;
+   struct arena_info *arena = NULL;
size_t remaining = btt->rawsize;
u64 cur_nlba = 0;
size_t cur_off = 0;
@@ -861,8 +866,10 @@ static int discover_arenas(struct btt *btt)
while (remaining) {
/* Alloc memory for arena */
arena = alloc_arena(btt, 0, 0, 0);
-   if (!arena)
-   return -ENOMEM;
+   if (!arena) {
+   ret = -ENOMEM;
+   goto out;
+   }
 
arena->infooff = cur_off;
ret = btt_info_read(arena, super);
@@ -921,7 +928,8 @@ static int discover_arenas(struct btt *btt)
return ret;
 
  out:
-   kfree(arena);
+   if (arena)
+   free_arena(arena);
free_arenas(btt);
return ret;
 }
-- 
2.47.0

[PATCH 2/2] nvdimm/btt: Fix memleaks in btt_init()

2025-05-15 Thread Li Zhijian

Call free_arenas() to release the arena instances in btt->arena_list
in the error paths.

Signed-off-by: Li Zhijian 
---
 drivers/nvdimm/btt.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index a11e4e7e9a52..a85448273a9a 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1597,7 +1597,7 @@ static struct btt *btt_init(struct nd_btt *nd_btt, 
unsigned long long rawsize,
if (btt->init_state != INIT_READY && nd_region->ro) {
dev_warn(dev, "%s is read-only, unable to init btt metadata\n",
dev_name(&nd_region->dev));
-   return NULL;
+   goto out;
} else if (btt->init_state != INIT_READY) {
btt->num_arenas = (rawsize / ARENA_MAX_SIZE) +
((rawsize % ARENA_MAX_SIZE) ? 1 : 0);
@@ -1607,25 +1607,29 @@ static struct btt *btt_init(struct nd_btt *nd_btt, 
unsigned long long rawsize,
ret = create_arenas(btt);
if (ret) {
dev_info(dev, "init: create_arenas: %d\n", ret);
-   return NULL;
+   goto out;
}
 
ret = btt_meta_init(btt);
if (ret) {
dev_err(dev, "init: error in meta_init: %d\n", ret);
-   return NULL;
+   goto out;
}
}
 
ret = btt_blk_init(btt);
if (ret) {
dev_err(dev, "init: error in blk_init: %d\n", ret);
-   return NULL;
+   goto out;
}
 
btt_debugfs_init(btt);
 
return btt;
+
+out:
+   free_arenas(btt);
+   return NULL;
 }
 
 /**
-- 
2.47.0

Re: [PATCH v4] vdpa/octeon_ep: Control PCI dev enabling manually

2025-05-15 Thread Philipp Stanner

On Thu, 2025-05-08 at 10:51 +0200, Philipp Stanner wrote:
> PCI region request functions such as pci_request_region() currently
> have
> the problem of becoming sometimes managed functions, if
> pcim_enable_device() instead of pci_enable_device() was called. The
> PCI
> subsystem wants to remove this deprecated behavior from its
> interfaces.
> 
> octeopn_ep enables its device with pcim_enable_device() (for VF. PF
> uses
> manual management), but does so only to get automatic disablement.
> The
> driver wants to manage its PCI resources for VF manually, without
> devres.
> 
> The easiest way not to use automatic resource management at all is by
> also handling device enable- and disablement manually.
> 
> Replace pcim_enable_device() with pci_enable_device(). Add the
> necessary
> calls to pci_disable_device().
> 
> Signed-off-by: Philipp Stanner 
> Acked-by: Vamsi Attunuru 

Hi again,

this is the last of 12 drivers blocking me from removing a few hundred
lines of broken code from PCI. Would be great if it could be sent to
Linus next merge window.

Can someone take this patch in?

Thx
P.

> ---
> Changes in v4:
>   - s/AF/PF
>   - Add Vamsi's AB
> 
> Changes in v3:
>   - Only call pci_disable_device() for the PF version. For AF it
> would
>     cause a WARN_ON because pcim_enable_device()'s callback will also
>     try to disable.
> ---
>  drivers/vdpa/octeon_ep/octep_vdpa_main.c | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> b/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> index f3d4dda4e04c..9b49efd24391 100644
> --- a/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> +++ b/drivers/vdpa/octeon_ep/octep_vdpa_main.c
> @@ -454,6 +454,9 @@ static void octep_vdpa_remove_pf(struct pci_dev
> *pdev)
>   octep_iounmap_region(pdev, octpf->base,
> OCTEP_HW_MBOX_BAR);
>  
>   octep_vdpa_pf_bar_expand(octpf);
> +
> + /* The pf version does not use managed PCI. */
> + pci_disable_device(pdev);
>  }
>  
>  static void octep_vdpa_vf_bar_shrink(struct pci_dev *pdev)
> @@ -825,7 +828,7 @@ static int octep_vdpa_probe_pf(struct pci_dev
> *pdev)
>   struct octep_pf *octpf;
>   int ret;
>  
> - ret = pcim_enable_device(pdev);
> + ret = pci_enable_device(pdev);
>   if (ret) {
>   dev_err(dev, "Failed to enable device\n");
>   return ret;
> @@ -834,15 +837,17 @@ static int octep_vdpa_probe_pf(struct pci_dev
> *pdev)
>   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
>   if (ret) {
>   dev_err(dev, "No usable DMA configuration\n");
> - return ret;
> + goto disable_pci;
>   }
>   octpf = devm_kzalloc(dev, sizeof(*octpf), GFP_KERNEL);
> - if (!octpf)
> - return -ENOMEM;
> + if (!octpf) {
> + ret = -ENOMEM;
> + goto disable_pci;
> + }
>  
>   ret = octep_iomap_region(pdev, octpf->base,
> OCTEP_HW_MBOX_BAR);
>   if (ret)
> - return ret;
> + goto disable_pci;
>  
>   pci_set_master(pdev);
>   pci_set_drvdata(pdev, octpf);
> @@ -856,6 +861,8 @@ static int octep_vdpa_probe_pf(struct pci_dev
> *pdev)
>  
>  unmap_region:
>   octep_iounmap_region(pdev, octpf->base, OCTEP_HW_MBOX_BAR);
> +disable_pci:
> + pci_disable_device(pdev);
>   return ret;
>  }
>

Re: [PATCH v2] x86/sgx: Prevent attempts to reclaim poisoned pages

2025-05-15 Thread Dave Hansen

Thanks for sending this, Andrew!

I think I'll probably add a slightly shorter summary:

tl;dr: SGX page reclaim touches the page to copy its contents to
secondary storage. SGX instructions do not gracefully handle machine
checks. Despite this, the existing SGX code will try to reclaim pages
that it _knows_ are poisoned. Avoid even trying to reclaim poisoned pages.

But otherwise it looks great:

Acked-by: Dave Hansen

[PATCH] selftests: Improve test output grammar, code style

2025-05-15 Thread Hanne-Lotta Mäenpää

Add small grammar fixes in perf events and Real Time Clock tests'
output messages.

Include braces around a single if statement, when there are multiple
statements in the else branch, to align with the kernel coding style.

Signed-off-by: Hanne-Lotta Mäenpää 
---
 tools/testing/selftests/perf_events/watermark_signal.c |  7 ---
 tools/testing/selftests/rtc/rtctest.c  | 10 +-
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/perf_events/watermark_signal.c 
b/tools/testing/selftests/perf_events/watermark_signal.c
index 49dc1e831174..6176afd4950b 100644
--- a/tools/testing/selftests/perf_events/watermark_signal.c
+++ b/tools/testing/selftests/perf_events/watermark_signal.c
@@ -65,8 +65,9 @@ TEST(watermark_signal)
 
child = fork();
EXPECT_GE(child, 0);
-   if (child == 0)
+   if (child == 0) {
do_child();
+   }
else if (child < 0) {
perror("fork()");
goto cleanup;
@@ -75,7 +76,7 @@ TEST(watermark_signal)
if (waitpid(child, &child_status, WSTOPPED) != child ||
!(WIFSTOPPED(child_status) && WSTOPSIG(child_status) == SIGSTOP)) {
fprintf(stderr,
-   "failed to sycnhronize with child errno=%d status=%x\n",
+   "failed to synchronize with child errno=%d status=%x\n",
errno,
child_status);
goto cleanup;
@@ -84,7 +85,7 @@ TEST(watermark_signal)
fd = syscall(__NR_perf_event_open, &attr, child, -1, -1,
 PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
-   fprintf(stderr, "failed opening event %llx\n", attr.config);
+   fprintf(stderr, "failed to setup performance monitoring 
%llx\n", attr.config);
goto cleanup;
}
 
diff --git a/tools/testing/selftests/rtc/rtctest.c 
b/tools/testing/selftests/rtc/rtctest.c
index be175c0e6ae3..8fd4d5d3b527 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -138,10 +138,10 @@ TEST_F_TIMEOUT(rtc, date_read_loop, 
READ_LOOP_DURATION_SEC + 2) {
rtc_read = rtc_time_to_timestamp(&rtc_tm);
/* Time should not go backwards */
ASSERT_LE(prev_rtc_read, rtc_read);
-   /* Time should not increase more then 1s at a time */
+   /* Time should not increase more than 1s per read */
ASSERT_GE(prev_rtc_read + 1, rtc_read);
 
-   /* Sleep 11ms to avoid killing / overheating the RTC */
+   /* Sleep 11ms to avoid overheating the RTC */
nanosleep_with_retries(READ_LOOP_SLEEP_MS * 100);
 
prev_rtc_read = rtc_read;
@@ -236,7 +236,7 @@ TEST_F(rtc, alarm_alm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
-   SKIP(return, "Skipping test since alarms has only minute 
granularity.");
+   SKIP(return, "Skipping test since alarms have only minute 
granularity.");
 
rc = ioctl(self->fd, RTC_RD_TIME, &tm);
ASSERT_NE(-1, rc);
@@ -306,7 +306,7 @@ TEST_F(rtc, alarm_wkalm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
-   SKIP(return, "Skipping test since alarms has only minute 
granularity.");
+   SKIP(return, "Skipping test since alarms have only minute 
granularity.");
 
rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time);
ASSERT_NE(-1, rc);
@@ -502,7 +502,7 @@ int main(int argc, char **argv)
if (access(rtc_file, R_OK) == 0)
ret = test_harness_run(argc, argv);
else
-   ksft_exit_skip("[SKIP]: Cannot access rtc file %s - Exiting\n",
+   ksft_exit_skip("Cannot access RTC file %s - exiting\n",
rtc_file);
 
return ret;
-- 
2.39.5

Re: [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h 
> b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index 9159bf1a4730..35f27366c277 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -22,6 +22,8 @@ KVM_X86_PMU_OP(init)
>  KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> +KVM_X86_PMU_OP(put_guest_context)
> +KVM_X86_PMU_OP(load_guest_context)

For KVM, the "guest_context" part is largely superfluous, as KVM always operates
on guest state, e.g. kvm_fpu_{load,put}().

I do think we should squeeze in "mediated" somewhere, otherwise the it's hard to
see that these are specific to the mediated PMU.

So probably mediated_{load,put}()?

>  #undef KVM_X86_PMU_OP
>  #undef KVM_X86_PMU_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7ee740aa..4117a382739a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -568,6 +568,10 @@ struct kvm_pmu {
>   u64 raw_event_mask;
>   struct kvm_pmc gp_counters[KVM_MAX_NR_GP_COUNTERS];
>   struct kvm_pmc fixed_counters[KVM_MAX_NR_FIXED_COUNTERS];
> + u32 gp_eventsel_base;
> + u32 gp_counter_base;
> + u32 fixed_base;
> + u32 cntr_shift;

Gah, my bad, "shift" was a terrible suggestion.  It should be "stride".

> @@ -306,6 +313,10 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>  void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>  bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
> +void kvm_pmu_put_guest_pmcs(struct kvm_vcpu *vcpu);
> +void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu);
> +void kvm_pmu_put_guest_context(struct kvm_vcpu *vcpu);
> +void kvm_pmu_load_guest_context(struct kvm_vcpu *vcpu);
>  
>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
>  bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 1a7e3a897fdf..7e0d84d50b74 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -175,6 +175,22 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   return 1;
>  }
>  
> +static inline void amd_update_msr_base(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> + if (kvm_pmu_has_perf_global_ctrl(pmu) ||
> + guest_cpu_cap_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
> + pmu->gp_eventsel_base = MSR_F15H_PERF_CTL0;
> + pmu->gp_counter_base = MSR_F15H_PERF_CTR0;
> + pmu->cntr_shift = 2;
> + } else {
> + pmu->gp_eventsel_base = MSR_K7_EVNTSEL0;
> + pmu->gp_counter_base = MSR_K7_PERFCTR0;
> + pmu->cntr_shift = 1;
> + }
> +}

Moving quoted text around to organize responses...

> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 796b7bc4affe..ed17ab198dfb 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -460,6 +460,17 @@ static void intel_pmu_enable_fixed_counter_bits(struct 
> kvm_pmu *pmu, u64 bits)
>   pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
>  }
>  
> +static inline void intel_update_msr_base(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> + pmu->gp_eventsel_base = MSR_P6_EVNTSEL0;
> + pmu->gp_counter_base = fw_writes_is_enabled(vcpu) ?
> +MSR_IA32_PMC0 : MSR_IA32_PERFCTR0;

This is wrong.  And I unintentionally proved that it's wrong, by goofing when I
fixed up this code and using MSR_IA32_PERFCTR0 instead of MSR_IA32_PMC0.

Whether or not the guest supports full-width writes is irrelevant, because 
support
for FW writes doesn't change the width of the counters.  Just because the 
*guest* 
can't directly write all e.g. 48 bits doesn't mean clobbering bits 47:32 is ok.

Similarly, on the AMD side, using the legacy interface in KVM is unnecessary.
The guest may be limited to those MSRs, but KVM has a hard dependency on PMU v2,
so just unconditionally use MSR_F15H_PERF_CTR0 (and for the record, because I
had to look it up, the newfangled MSRs on AMD are aliased to the legacy MSRs for
0..3).

Very happily, that means the MSRs don't need to be per-PMU, and they don't even
need to be configured at runtime for a given vendor.  Simply require FW writes
on Intel to enable the mediated PMU, and then hardcode the GP base to 
MSR_IA32_PMC0.

> +static void amd_put_guest_context(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> + rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
> + wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
> + rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
> +
> + /* Clear global status bits if non-zero */
> + if (pmu->global_st

[PATCH 1/9] tools/nolibc: move ioctl() to sys/ioctl.h

This is the location regular userspace expects this definition.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile|  1 +
 tools/include/nolibc/nolibc.h|  1 +
 tools/include/nolibc/sys.h   | 12 
 tools/include/nolibc/sys/ioctl.h | 29 +
 4 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
4c4b7b5ff605ac5dde11628331378489d76d4466..7d66847df91a94ca784c04ced278eb6d7099bab4
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -49,6 +49,7 @@ all_files := \
string.h \
sys.h \
sys/auxv.h \
+   sys/ioctl.h \
sys/mman.h \
sys/random.h \
sys/stat.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
51c423a36b5957e2528e94d717d88e2383230cba..d6048d1e9ea5d4b5d504e156aafc5651dcd6b8c1
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -97,6 +97,7 @@
 #include "types.h"
 #include "sys.h"
 #include "sys/auxv.h"
+#include "sys/ioctl.h"
 #include "sys/mman.h"
 #include "sys/random.h"
 #include "sys/stat.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
5733fe54911dca44c7423951ff85fb166d95c06f..313c210173c804f728b0be4ab8e67542bb3a7219
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -517,18 +517,6 @@ uid_t getuid(void)
 }
 
 
-/*
- * int ioctl(int fd, unsigned long cmd, ... arg);
- */
-
-static __attribute__((unused))
-long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
-{
-   return my_syscall3(__NR_ioctl, fd, cmd, arg);
-}
-
-#define ioctl(fd, cmd, arg) __sysret(sys_ioctl(fd, cmd, (unsigned long)(arg)))
-
 /*
  * int kill(pid_t pid, int signal);
  */
diff --git a/tools/include/nolibc/sys/ioctl.h b/tools/include/nolibc/sys/ioctl.h
new file mode 100644
index 
..cb34be92a5f96aa566afa53a0f19932c2d5fe1eb
--- /dev/null
+++ b/tools/include/nolibc/sys/ioctl.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Ioctl definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_IOCTL_H
+#define _NOLIBC_SYS_IOCTL_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int ioctl(int fd, unsigned long cmd, ... arg);
+ */
+
+static __attribute__((unused))
+long sys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
+{
+   return my_syscall3(__NR_ioctl, fd, cmd, arg);
+}
+
+#define ioctl(fd, cmd, arg) __sysret(sys_ioctl(fd, cmd, (unsigned long)(arg)))
+
+#endif /* _NOLIBC_SYS_IOCTL_H */

-- 
2.49.0

[PATCH 4/9] tools/nolibc: move reboot() to sys/reboot.h

This is the location regular userspace expects this definition.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile |  1 +
 tools/include/nolibc/nolibc.h |  1 +
 tools/include/nolibc/sys.h| 18 --
 tools/include/nolibc/sys/reboot.h | 34 ++
 tools/include/nolibc/types.h  |  1 -
 5 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
35cfbc84d499b0a57689bdfed95a45023904d256..4850501b8d53be6b603ecbb04d6f952cd5370cf0
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -54,6 +54,7 @@ all_files := \
sys/mount.h \
sys/prctl.h \
sys/random.h \
+   sys/reboot.h \
sys/stat.h \
sys/syscall.h \
sys/time.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
1c159e32a248d46fa4d36a2c35e92eb9da91e9f6..36ea7a02c7434cc006a7b12d413add04a2f85565
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -102,6 +102,7 @@
 #include "sys/mount.h"
 #include "sys/prctl.h"
 #include "sys/random.h"
+#include "sys/reboot.h"
 #include "sys/stat.h"
 #include "sys/syscall.h"
 #include "sys/time.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
a17fe98968a29081661eaf235111482a543f87ba..6c89dd0316dd0ebf03ebde6fa5c14273df6a0c62
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -730,24 +730,6 @@ ssize_t read(int fd, void *buf, size_t count)
 }
 
 
-/*
- * int reboot(int cmd);
- *  is among LINUX_REBOOT_CMD_*
- */
-
-static __attribute__((unused))
-ssize_t sys_reboot(int magic1, int magic2, int cmd, void *arg)
-{
-   return my_syscall4(__NR_reboot, magic1, magic2, cmd, arg);
-}
-
-static __attribute__((unused))
-int reboot(int cmd)
-{
-   return __sysret(sys_reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, 
cmd, 0));
-}
-
-
 /*
  * int getrlimit(int resource, struct rlimit *rlim);
  * int setrlimit(int resource, const struct rlimit *rlim);
diff --git a/tools/include/nolibc/sys/reboot.h 
b/tools/include/nolibc/sys/reboot.h
new file mode 100644
index 
..727363fbfd8a6f8994a0cf13829b7347ced198bb
--- /dev/null
+++ b/tools/include/nolibc/sys/reboot.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Reboot definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_REBOOT_H
+#define _NOLIBC_SYS_REBOOT_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int reboot(int cmd);
+ *  is among LINUX_REBOOT_CMD_*
+ */
+
+static __attribute__((unused))
+ssize_t sys_reboot(int magic1, int magic2, int cmd, void *arg)
+{
+   return my_syscall4(__NR_reboot, magic1, magic2, cmd, arg);
+}
+
+static __attribute__((unused))
+int reboot(int cmd)
+{
+   return __sysret(sys_reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, 
cmd, 0));
+}
+
+#endif /* _NOLIBC_SYS_REBOOT_H */
diff --git a/tools/include/nolibc/types.h b/tools/include/nolibc/types.h
index 
93da29fe7719c9f196fba38d1f3f31cad0fc02f1..74c7694b2d5e54f7a86854697ac32a2ea2b62e86
 100644
--- a/tools/include/nolibc/types.h
+++ b/tools/include/nolibc/types.h
@@ -12,7 +12,6 @@
 
 #include "std.h"
 #include 
-#include  /* for LINUX_REBOOT_* */
 #include 
 #include 
 #include 

-- 
2.49.0

[PATCH 9/9] selftests/nolibc: drop include guards around standard headers

Nolibc now provides all the headers required by nolibc-test.c.

Signed-off-by: Thomas Weißschuh 
---
 tools/testing/selftests/nolibc/nolibc-test.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/tools/testing/selftests/nolibc/nolibc-test.c 
b/tools/testing/selftests/nolibc/nolibc-test.c
index 
0391c7d01380ea2f20d1d07497ea1964bcb6a9f4..dbe13000fb1ac153e9a89f627492daeb584a05d4
 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -9,12 +9,9 @@
  *   $(CC) -nostdlib -I/path/to/nolibc/sysroot  => _NOLIBC_* guards are present
  *   $(CC) with default libc=> NOLIBC* never defined
  */
-#ifndef NOLIBC
 #include 
 #include 
 #include 
-#ifndef _NOLIBC_STDIO_H
-/* standard libcs need more includes */
 #include 
 #include 
 #include 
@@ -43,8 +40,6 @@
 #include 
 #include 
 #include 
-#endif
-#endif
 
 #pragma GCC diagnostic ignored "-Wmissing-prototypes"
 

-- 
2.49.0

[PATCH 6/9] tools/nolibc: move makedev() and friends to sys/sysmacros.h

This is the location regular userspace expects these definitions.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile|  1 +
 tools/include/nolibc/nolibc.h|  1 +
 tools/include/nolibc/sys/sysmacros.h | 20 
 tools/include/nolibc/types.h |  5 -
 4 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
cc82399e940e53940061b60588566db51f2e191a..ae04de46cdfeb2080d95b60c3d16e1c9f3692c9c
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -58,6 +58,7 @@ all_files := \
sys/resource.h \
sys/stat.h \
sys/syscall.h \
+   sys/sysmacros.h \
sys/time.h \
sys/timerfd.h \
sys/types.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
7d151776e47a30080dc3d961bcdeacd6ab558629..182dcfce126638db1cc9fd3fa79ab99835b42c26
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -106,6 +106,7 @@
 #include "sys/resource.h"
 #include "sys/stat.h"
 #include "sys/syscall.h"
+#include "sys/sysmacros.h"
 #include "sys/time.h"
 #include "sys/timerfd.h"
 #include "sys/wait.h"
diff --git a/tools/include/nolibc/sys/sysmacros.h 
b/tools/include/nolibc/sys/sysmacros.h
new file mode 100644
index 
..e7e24dda1dbb51df92307def00f3670459b2409c
--- /dev/null
+++ b/tools/include/nolibc/sys/sysmacros.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Sysmacro definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_SYSMACROS_H
+#define _NOLIBC_SYS_SYSMACROS_H
+
+#include "../std.h"
+
+/* WARNING, it only deals with the 4096 first majors and 256 first minors */
+#define makedev(major, minor) ((dev_t)major) & 0xfff) << 8) | ((minor) & 
0xff)))
+#define major(dev) ((unsigned int)(((dev) >> 8) & 0xfff))
+#define minor(dev) ((unsigned int)((dev) & 0xff))
+
+#endif /* _NOLIBC_SYS_SYSMACROS_H */
diff --git a/tools/include/nolibc/types.h b/tools/include/nolibc/types.h
index 
2225c9388a46480a18e9ce603b08d846e5276bab..0071bfbc23154cf9b47b0bd747101fea4955018d
 100644
--- a/tools/include/nolibc/types.h
+++ b/tools/include/nolibc/types.h
@@ -188,11 +188,6 @@ struct stat {
 typedef __kernel_clockid_t clockid_t;
 typedef int timer_t;
 
-/* WARNING, it only deals with the 4096 first majors and 256 first minors */
-#define makedev(major, minor) ((dev_t)major) & 0xfff) << 8) | ((minor) & 
0xff)))
-#define major(dev) ((unsigned int)(((dev) >> 8) & 0xfff))
-#define minor(dev) ((unsigned int)((dev) & 0xff))
-
 #ifndef offsetof
 #define offsetof(TYPE, FIELD) ((size_t) &((TYPE *)0)->FIELD)
 #endif

-- 
2.49.0

[PATCH 8/9] tools/nolibc: move NULL and offsetof() to sys/stddef.h

This is the location regular userspace expects these definitions.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile |  1 +
 tools/include/nolibc/std.h|  6 +-
 tools/include/nolibc/stddef.h | 24 
 tools/include/nolibc/types.h  |  4 
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
0ff06374577df457b0611d91dcfe637a05476364..c4615bed731ced75a00867fdeb455d72652a7e42
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -44,6 +44,7 @@ all_files := \
std.h \
stdarg.h \
stdbool.h \
+   stddef.h \
stdint.h \
stdlib.h \
string.h \
diff --git a/tools/include/nolibc/std.h b/tools/include/nolibc/std.h
index 
933bc0be7e1c6be3b7909efc127f1d3b9c611135..adda7333d12e7d2c336938ede1aaf215b4b93165
 100644
--- a/tools/include/nolibc/std.h
+++ b/tools/include/nolibc/std.h
@@ -13,12 +13,8 @@
  * syscall-specific stuff, as this file is expected to be included very early.
  */
 
-/* note: may already be defined */
-#ifndef NULL
-#define NULL ((void *)0)
-#endif
-
 #include "stdint.h"
+#include "stddef.h"
 
 /* those are commonly provided by sys/types.h */
 typedef unsigned int  dev_t;
diff --git a/tools/include/nolibc/stddef.h b/tools/include/nolibc/stddef.h
new file mode 100644
index 
..ecbd13eab1f5190fd0e90a07699c2f06dbde8916
--- /dev/null
+++ b/tools/include/nolibc/stddef.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Stddef definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_STDDEF_H
+#define _NOLIBC_STDDEF_H
+
+#include "stdint.h"
+
+/* note: may already be defined */
+#ifndef NULL
+#define NULL ((void *)0)
+#endif
+
+#ifndef offsetof
+#define offsetof(TYPE, FIELD) ((size_t) &((TYPE *)0)->FIELD)
+#endif
+
+#endif /* _NOLIBC_STDDEF_H */
diff --git a/tools/include/nolibc/types.h b/tools/include/nolibc/types.h
index 
0071bfbc23154cf9b47b0bd747101fea4955018d..30904be544ed01b212042ebc0f4dab610f64b216
 100644
--- a/tools/include/nolibc/types.h
+++ b/tools/include/nolibc/types.h
@@ -188,10 +188,6 @@ struct stat {
 typedef __kernel_clockid_t clockid_t;
 typedef int timer_t;
 
-#ifndef offsetof
-#define offsetof(TYPE, FIELD) ((size_t) &((TYPE *)0)->FIELD)
-#endif
-
 #ifndef container_of
 #define container_of(PTR, TYPE, FIELD) ({  \
__typeof__(((TYPE *)0)->FIELD) *__FIELD_PTR = (PTR);\

-- 
2.49.0

[PATCH 2/9] tools/nolibc: move mount() to sys/mount.h

This is the location regular userspace expects this definition.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile|  1 +
 tools/include/nolibc/nolibc.h|  1 +
 tools/include/nolibc/sys.h   | 20 
 tools/include/nolibc/sys/mount.h | 37 +
 4 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
7d66847df91a94ca784c04ced278eb6d7099bab4..aa14dfddd77deb5442a6be65dee5684b6d218da7
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -51,6 +51,7 @@ all_files := \
sys/auxv.h \
sys/ioctl.h \
sys/mman.h \
+   sys/mount.h \
sys/random.h \
sys/stat.h \
sys/syscall.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
d6048d1e9ea5d4b5d504e156aafc5651dcd6b8c1..690368f8e46c33df37c824429cf89dd0e95bb806
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -99,6 +99,7 @@
 #include "sys/auxv.h"
 #include "sys/ioctl.h"
 #include "sys/mman.h"
+#include "sys/mount.h"
 #include "sys/random.h"
 #include "sys/stat.h"
 #include "sys/syscall.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
313c210173c804f728b0be4ab8e67542bb3a7219..e66dd6e76055753d98da2627b038dffe7f93
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -672,26 +672,6 @@ int mknod(const char *path, mode_t mode, dev_t dev)
return __sysret(sys_mknod(path, mode, dev));
 }
 
-/*
- * int mount(const char *source, const char *target,
- *   const char *fstype, unsigned long flags,
- *   const void *data);
- */
-static __attribute__((unused))
-int sys_mount(const char *src, const char *tgt, const char *fst,
- unsigned long flags, const void *data)
-{
-   return my_syscall5(__NR_mount, src, tgt, fst, flags, data);
-}
-
-static __attribute__((unused))
-int mount(const char *src, const char *tgt,
-  const char *fst, unsigned long flags,
-  const void *data)
-{
-   return __sysret(sys_mount(src, tgt, fst, flags, data));
-}
-
 
 /*
  * int pipe2(int pipefd[2], int flags);
diff --git a/tools/include/nolibc/sys/mount.h b/tools/include/nolibc/sys/mount.h
new file mode 100644
index 
..5129420e716d0a7e75c32a5bbc8f4e9f0e21b03b
--- /dev/null
+++ b/tools/include/nolibc/sys/mount.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Mount definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_MOUNT_H
+#define _NOLIBC_SYS_MOUNT_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int mount(const char *source, const char *target,
+ *   const char *fstype, unsigned long flags,
+ *   const void *data);
+ */
+static __attribute__((unused))
+int sys_mount(const char *src, const char *tgt, const char *fst,
+ unsigned long flags, const void *data)
+{
+   return my_syscall5(__NR_mount, src, tgt, fst, flags, data);
+}
+
+static __attribute__((unused))
+int mount(const char *src, const char *tgt,
+ const char *fst, unsigned long flags,
+ const void *data)
+{
+   return __sysret(sys_mount(src, tgt, fst, flags, data));
+}
+
+#endif /* _NOLIBC_SYS_MOUNT_H */

-- 
2.49.0

[PATCH 3/9] tools/nolibc: move prctl() to sys/prctl.h

This is the location regular userspace expects this definition.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile|  1 +
 tools/include/nolibc/nolibc.h|  1 +
 tools/include/nolibc/sys.h   | 21 -
 tools/include/nolibc/sys/prctl.h | 36 
 4 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
aa14dfddd77deb5442a6be65dee5684b6d218da7..35cfbc84d499b0a57689bdfed95a45023904d256
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -52,6 +52,7 @@ all_files := \
sys/ioctl.h \
sys/mman.h \
sys/mount.h \
+   sys/prctl.h \
sys/random.h \
sys/stat.h \
sys/syscall.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
690368f8e46c33df37c824429cf89dd0e95bb806..1c159e32a248d46fa4d36a2c35e92eb9da91e9f6
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -100,6 +100,7 @@
 #include "sys/ioctl.h"
 #include "sys/mman.h"
 #include "sys/mount.h"
+#include "sys/prctl.h"
 #include "sys/random.h"
 #include "sys/stat.h"
 #include "sys/syscall.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
e66dd6e76055753d98da2627b038dffe7f93..a17fe98968a29081661eaf235111482a543f87ba
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -23,7 +23,6 @@
 #include 
 #include  /* for O_* and AT_* */
 #include   /* for statx() */
-#include 
 #include 
 #include 
 
@@ -697,26 +696,6 @@ int pipe(int pipefd[2])
 }
 
 
-/*
- * int prctl(int option, unsigned long arg2, unsigned long arg3,
- *   unsigned long arg4, unsigned long arg5);
- */
-
-static __attribute__((unused))
-int sys_prctl(int option, unsigned long arg2, unsigned long arg3,
- unsigned long arg4, unsigned long arg5)
-{
-   return my_syscall5(__NR_prctl, option, arg2, arg3, arg4, arg5);
-}
-
-static __attribute__((unused))
-int prctl(int option, unsigned long arg2, unsigned long arg3,
- unsigned long arg4, unsigned long arg5)
-{
-   return __sysret(sys_prctl(option, arg2, arg3, arg4, arg5));
-}
-
-
 /*
  * int pivot_root(const char *new, const char *old);
  */
diff --git a/tools/include/nolibc/sys/prctl.h b/tools/include/nolibc/sys/prctl.h
new file mode 100644
index 
..778a6a868823ce496f0f5482a58a31d69afd6d30
--- /dev/null
+++ b/tools/include/nolibc/sys/prctl.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Prctl definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_PRCTL_H
+#define _NOLIBC_SYS_PRCTL_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int prctl(int option, unsigned long arg2, unsigned long arg3,
+ *   unsigned long arg4, unsigned long arg5);
+ */
+
+static __attribute__((unused))
+int sys_prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+   return my_syscall5(__NR_prctl, option, arg2, arg3, arg4, arg5);
+}
+
+static __attribute__((unused))
+int prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+   return __sysret(sys_prctl(option, arg2, arg3, arg4, arg5));
+}
+
+#endif /* _NOLIBC_SYS_PRCTL_H */

-- 
2.49.0

[PATCH 0/9] tools/nolibc: split out more headers

Split out all headers which are used by nolibc-test.c.
This makes it easier to port existing applications to nolibc.

Signed-off-by: Thomas Weißschuh 
---
Thomas Weißschuh (9):
  tools/nolibc: move ioctl() to sys/ioctl.h
  tools/nolibc: move mount() to sys/mount.h
  tools/nolibc: move prctl() to sys/prctl.h
  tools/nolibc: move reboot() to sys/reboot.h
  tools/nolibc: move getrlimit() and friends to sys/resource.h
  tools/nolibc: move makedev() and friends to sys/sysmacros.h
  tools/nolibc: move uname() and friends to sys/utsname.h
  tools/nolibc: move NULL and offsetof() to sys/stddef.h
  selftests/nolibc: drop include guards around standard headers

 tools/include/nolibc/Makefile|   8 ++
 tools/include/nolibc/nolibc.h|   7 ++
 tools/include/nolibc/std.h   |   6 +-
 tools/include/nolibc/stddef.h|  24 +
 tools/include/nolibc/sys.h   | 136 ---
 tools/include/nolibc/sys/ioctl.h |  29 ++
 tools/include/nolibc/sys/mount.h |  37 
 tools/include/nolibc/sys/prctl.h |  36 +++
 tools/include/nolibc/sys/reboot.h|  34 +++
 tools/include/nolibc/sys/resource.h  |  53 +++
 tools/include/nolibc/sys/sysmacros.h |  20 
 tools/include/nolibc/sys/utsname.h   |  42 +
 tools/include/nolibc/types.h |  11 ---
 tools/testing/selftests/nolibc/nolibc-test.c |   5 -
 14 files changed, 291 insertions(+), 157 deletions(-)
---
base-commit: 6a25f787912a73613f12e7eefbebd72ee3d43f85
change-id: 20250515-nolibc-sys-31a4fd76d897

Best regards,
-- 
Thomas Weißschuh

[PATCH 7/9] tools/nolibc: move uname() and friends to sys/utsname.h

This is the location regular userspace expects these definitions.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile  |  1 +
 tools/include/nolibc/nolibc.h  |  1 +
 tools/include/nolibc/sys.h | 27 
 tools/include/nolibc/sys/utsname.h | 42 ++
 4 files changed, 44 insertions(+), 27 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
ae04de46cdfeb2080d95b60c3d16e1c9f3692c9c..0ff06374577df457b0611d91dcfe637a05476364
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -62,6 +62,7 @@ all_files := \
sys/time.h \
sys/timerfd.h \
sys/types.h \
+   sys/utsname.h \
sys/wait.h \
time.h \
types.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
182dcfce126638db1cc9fd3fa79ab99835b42c26..c199ade200c240ef4081f767a162698256f39677
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -109,6 +109,7 @@
 #include "sys/sysmacros.h"
 #include "sys/time.h"
 #include "sys/timerfd.h"
+#include "sys/utsname.h"
 #include "sys/wait.h"
 #include "ctype.h"
 #include "elf.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
282909b1992d17d5292917fbabb0bd2a2dbfce88..9556c69a6ae1ff0f19d884f5022599fc31e04f1f
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -23,7 +23,6 @@
 #include 
 #include  /* for O_* and AT_* */
 #include   /* for statx() */
-#include 
 
 #include "errno.h"
 #include "stdarg.h"
@@ -894,32 +893,6 @@ int umount2(const char *path, int flags)
 }
 
 
-/*
- * int uname(struct utsname *buf);
- */
-
-struct utsname {
-   char sysname[65];
-   char nodename[65];
-   char release[65];
-   char version[65];
-   char machine[65];
-   char domainname[65];
-};
-
-static __attribute__((unused))
-int sys_uname(struct utsname *buf)
-{
-   return my_syscall1(__NR_uname, buf);
-}
-
-static __attribute__((unused))
-int uname(struct utsname *buf)
-{
-   return __sysret(sys_uname(buf));
-}
-
-
 /*
  * int unlink(const char *path);
  */
diff --git a/tools/include/nolibc/sys/utsname.h 
b/tools/include/nolibc/sys/utsname.h
new file mode 100644
index 
..3adda115085069403253c2a276de1e2ae9bb1ad3
--- /dev/null
+++ b/tools/include/nolibc/sys/utsname.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Utsname definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_UTSNAME_H
+#define _NOLIBC_SYS_UTSNAME_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int uname(struct utsname *buf);
+ */
+
+struct utsname {
+   char sysname[65];
+   char nodename[65];
+   char release[65];
+   char version[65];
+   char machine[65];
+   char domainname[65];
+};
+
+static __attribute__((unused))
+int sys_uname(struct utsname *buf)
+{
+   return my_syscall1(__NR_uname, buf);
+}
+
+static __attribute__((unused))
+int uname(struct utsname *buf)
+{
+   return __sysret(sys_uname(buf));
+}
+
+#endif /* _NOLIBC_SYS_UTSNAME_H */

-- 
2.49.0

[PATCH 5/9] tools/nolibc: move getrlimit() and friends to sys/resource.h

This is the location regular userspace expects these definitions.

Signed-off-by: Thomas Weißschuh 
---
 tools/include/nolibc/Makefile   |  1 +
 tools/include/nolibc/nolibc.h   |  1 +
 tools/include/nolibc/sys.h  | 38 --
 tools/include/nolibc/sys/resource.h | 53 +
 tools/include/nolibc/types.h|  1 -
 5 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/tools/include/nolibc/Makefile b/tools/include/nolibc/Makefile
index 
4850501b8d53be6b603ecbb04d6f952cd5370cf0..cc82399e940e53940061b60588566db51f2e191a
 100644
--- a/tools/include/nolibc/Makefile
+++ b/tools/include/nolibc/Makefile
@@ -55,6 +55,7 @@ all_files := \
sys/prctl.h \
sys/random.h \
sys/reboot.h \
+   sys/resource.h \
sys/stat.h \
sys/syscall.h \
sys/time.h \
diff --git a/tools/include/nolibc/nolibc.h b/tools/include/nolibc/nolibc.h
index 
36ea7a02c7434cc006a7b12d413add04a2f85565..7d151776e47a30080dc3d961bcdeacd6ab558629
 100644
--- a/tools/include/nolibc/nolibc.h
+++ b/tools/include/nolibc/nolibc.h
@@ -103,6 +103,7 @@
 #include "sys/prctl.h"
 #include "sys/random.h"
 #include "sys/reboot.h"
+#include "sys/resource.h"
 #include "sys/stat.h"
 #include "sys/syscall.h"
 #include "sys/time.h"
diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h
index 
6c89dd0316dd0ebf03ebde6fa5c14273df6a0c62..282909b1992d17d5292917fbabb0bd2a2dbfce88
 100644
--- a/tools/include/nolibc/sys.h
+++ b/tools/include/nolibc/sys.h
@@ -23,7 +23,6 @@
 #include 
 #include  /* for O_* and AT_* */
 #include   /* for statx() */
-#include 
 #include 
 
 #include "errno.h"
@@ -730,43 +729,6 @@ ssize_t read(int fd, void *buf, size_t count)
 }
 
 
-/*
- * int getrlimit(int resource, struct rlimit *rlim);
- * int setrlimit(int resource, const struct rlimit *rlim);
- */
-
-static __attribute__((unused))
-int sys_prlimit64(pid_t pid, int resource,
- const struct rlimit64 *new_limit, struct rlimit64 *old_limit)
-{
-   return my_syscall4(__NR_prlimit64, pid, resource, new_limit, old_limit);
-}
-
-static __attribute__((unused))
-int getrlimit(int resource, struct rlimit *rlim)
-{
-   struct rlimit64 rlim64;
-   int ret;
-
-   ret = __sysret(sys_prlimit64(0, resource, NULL, &rlim64));
-   rlim->rlim_cur = rlim64.rlim_cur;
-   rlim->rlim_max = rlim64.rlim_max;
-
-   return ret;
-}
-
-static __attribute__((unused))
-int setrlimit(int resource, const struct rlimit *rlim)
-{
-   struct rlimit64 rlim64 = {
-   .rlim_cur = rlim->rlim_cur,
-   .rlim_max = rlim->rlim_max,
-   };
-
-   return __sysret(sys_prlimit64(0, resource, &rlim64, NULL));
-}
-
-
 /*
  * int sched_yield(void);
  */
diff --git a/tools/include/nolibc/sys/resource.h 
b/tools/include/nolibc/sys/resource.h
new file mode 100644
index 
..ac2705a67e8b3abacffa57f097095b98afa1469c
--- /dev/null
+++ b/tools/include/nolibc/sys/resource.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * Resource definitions for NOLIBC
+ * Copyright (C) 2017-2021 Willy Tarreau 
+ */
+
+/* make sure to include all global symbols */
+#include "nolibc.h"
+
+#ifndef _NOLIBC_SYS_RESOURCE_H
+#define _NOLIBC_SYS_RESOURCE_H
+
+#include "../sys.h"
+
+#include 
+
+/*
+ * int getrlimit(int resource, struct rlimit *rlim);
+ * int setrlimit(int resource, const struct rlimit *rlim);
+ */
+
+static __attribute__((unused))
+int sys_prlimit64(pid_t pid, int resource,
+ const struct rlimit64 *new_limit, struct rlimit64 *old_limit)
+{
+   return my_syscall4(__NR_prlimit64, pid, resource, new_limit, old_limit);
+}
+
+static __attribute__((unused))
+int getrlimit(int resource, struct rlimit *rlim)
+{
+   struct rlimit64 rlim64;
+   int ret;
+
+   ret = __sysret(sys_prlimit64(0, resource, NULL, &rlim64));
+   rlim->rlim_cur = rlim64.rlim_cur;
+   rlim->rlim_max = rlim64.rlim_max;
+
+   return ret;
+}
+
+static __attribute__((unused))
+int setrlimit(int resource, const struct rlimit *rlim)
+{
+   struct rlimit64 rlim64 = {
+   .rlim_cur = rlim->rlim_cur,
+   .rlim_max = rlim->rlim_max,
+   };
+
+   return __sysret(sys_prlimit64(0, resource, &rlim64, NULL));
+}
+
+#endif /* _NOLIBC_SYS_RESOURCE_H */
diff --git a/tools/include/nolibc/types.h b/tools/include/nolibc/types.h
index 
74c7694b2d5e54f7a86854697ac32a2ea2b62e86..2225c9388a46480a18e9ce603b08d846e5276bab
 100644
--- a/tools/include/nolibc/types.h
+++ b/tools/include/nolibc/types.h
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 
 
 /* Only the generic macros and types may be defined here. The arch-specific

-- 
2.49.0

Re: [PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.

2025-05-15 Thread Lorenzo Stoakes

Ah you got to this first :) thanks!

Could you do this with a cover letter though? It's really weird to have 2/2
reply to 1/2, I know sometimes people do that, but it's just odd, and it'd be
good to have an overview, thanks!

On Thu, May 15, 2025 at 02:23:32PM -0400, Zi Yan wrote:
> When userfaultfd is not compiled into kernel, userfaultfd() returns -1,
> causing uffd tests in madv_guard fail. Skip the tests instead.

'madv_guard'? I'd just say the guard_regions.uffd test to fail.

>
> Signed-off-by: Zi Yan 
> ---
>  tools/testing/selftests/mm/guard-regions.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/guard-regions.c 
> b/tools/testing/selftests/mm/guard-regions.c
> index 0cd9d236649d..93af3d3760f9 100644
> --- a/tools/testing/selftests/mm/guard-regions.c
> +++ b/tools/testing/selftests/mm/guard-regions.c
> @@ -1453,8 +1453,21 @@ TEST_F(guard_regions, uffd)
>
>   /* Set up uffd. */
>   uffd = userfaultfd(0);
> - if (uffd == -1 && errno == EPERM)
> - ksft_exit_skip("No userfaultfd permissions, try running as 
> root.\n");

Let's just make this all part of the same switch please!

And while I originally used ksft_exit_skip(), I think we can just use the
SKIP(return, ...) form here just fine to keep it consistent.

> + if (uffd == -1) {
> + switch (errno) {
> + case EPERM:
> + SKIP(return, "No userfaultfd permissions, try running 
> as root.");
> + break;
> + case ENOSYS:
> + SKIP(return, "userfaultfd is not supported/not 
> enabled.");
> + break;
> + default:
> + ksft_exit_fail_msg("userfaultfd failed with %s\n",
> +strerror(errno));
> + break;
> + }
> + }
> +
>   ASSERT_NE(uffd, -1);
>
>   ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> --
> 2.47.2
>

Thanks!

Re: [PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.

2025-05-15 Thread Pedro Falcato

On Thu, May 15, 2025 at 02:23:32PM -0400, Zi Yan wrote:
> When userfaultfd is not compiled into kernel, userfaultfd() returns -1,
> causing uffd tests in madv_guard fail. Skip the tests instead.
> 
> Signed-off-by: Zi Yan 

Reviewed-by: Pedro Falcato 

Thanks!

-- 
Pedro

Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers

On Thu, May 15, 2025, Dapeng Mi wrote:
> On 5/15/2025 8:37 AM, Sean Christopherson wrote:
> >> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> >> index 153972e944eb..eba086ef5eca 100644
> >> --- a/arch/x86/kvm/svm/pmu.c
> >> +++ b/arch/x86/kvm/svm/pmu.c
> >> @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
> >>pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
> >> kvm_pmu_cap.num_counters_gp);
> >>  
> >> -  if (pmu->version > 1) {
> >> -  pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 
> >> 1);
> >> +  if (kvm_pmu_cap.version > 1) {

ARGH!  I saw the pmu->version => kvm_pmu_cap.version change when going 
through
this patch, but assumed it was simply a refactoring bug and ignored it.

Nope, 100% intentional, as I discovered after spending the better part of an 
hour
debugging.  Stuffing pmu->global_ctrl when it doesn't exist is necessary when 
the
mediated PMU is enabled, because pmu->global_ctrl will always be loaded in 
hardware.
And so loading '0' means the PMU is effectively disabled, because KVM disallows 
the
guest from writing the MSR.

_That_ is the type of thing that absolutely needs a comment and a lengthy 
explanation
in the changelog.

Intel also needs the same treatment for PMU v1.  And since there's no hackery 
that
I can see, that suggests PMU v1 guests haven't been tested with the mediated PMU
on Intel.

I guess congratulations are in order, because this patch just became my new goto
example of why I'm so strict about on the "one thing per patch" rule.

> > It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ 
> > (the
> > SDM contradicts itself; KVM's ABI is that they're v2+).
> >
> >> +  /*
> >> +   * At RESET, AMD CPUs set all enable bits for general purpose 
> >> counters in
> >> +   * IA32_PERF_GLOBAL_CTRL (so that software that was written for 
> >> v1 PMUs
> >> +   * don't unknowingly leave GP counters disabled in the global 
> >> controls).
> >> +   * Emulate that behavior when refreshing the PMU so that 
> >> userspace doesn't
> >> +   * need to manually set PERF_GLOBAL_CTRL.
> >> +   */
> >> +  pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
> >> +  pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
> >>pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
> >>}

Re: [PATCH v4 05/38] perf: Add generic exclude_guest support

On Thu, May 15, 2025, Kan Liang wrote:
> On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
> >> This naming is confusing on purpose? Pick either guest/host and stick
> >> with it.
> > 
> > +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
> > These flows
> > 
> > After a bit of hacking, and with a few spoilers, this is what I ended up 
> > with
> > (not anywhere near fully tested).  I like following KVM's 
> > kvm_xxx_{load,put}()
> > nomenclature to tie everything together, so I went with "guest" instead of 
> > "host"
> > even though the majority of work being down is to shedule out/in host 
> > context.
> > 
> > /* When loading a guest's mediated PMU, schedule out all exclude_guest 
> > events. */
> > void perf_load_guest_context(unsigned long data)
> > {
> > struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > 
> > lockdep_assert_irqs_disabled();
> > 
> > perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > 
> > if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
> > goto unlock;
> > 
> > perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> > ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> > perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> > if (cpuctx->task_ctx) {
> > perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> > task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> > }
> > 
> > arch_perf_load_guest_context(data);
> > 
> > __this_cpu_write(guest_ctx_loaded, true);
> > 
> > unlock:
> > perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> > }
> > EXPORT_SYMBOL_GPL(perf_load_guest_context);
> > 
> > void perf_put_guest_context(void)
> > {
> > struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > 
> > lockdep_assert_irqs_disabled();
> > 
> > perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > 
> > if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
> > goto unlock;
> > 
> > arch_perf_put_guest_context();
> 
> It will set the guest_ctx_loaded to false.
> The update_context_time() invoked in the perf_event_sched_in() will not
> get a chance to update the guest time.

The guest_ctx_loaded in arch/x86/events/core.c is a different variable, it just
happens to have the same name.

It's completely gross, but exposing guest_ctx_loaded outside of 
kernel/events/core.c
didn't seem like a great alternative.  If we wanted to use a single variable,
then the writes in arch_perf_{load,put}_guest_context() can simply go away.

Re: [PATCH v2 4/4] media: i2c: imx214: Remove hard-coded external clock frequency

2025-05-15 Thread Sakari Ailus

Hi Laurent,

On Thu, May 15, 2025 at 03:03:40PM +0200, Laurent Pinchart wrote:
> On Thu, May 15, 2025 at 02:54:54PM +0300, Sakari Ailus wrote:
> > On Thu, May 15, 2025 at 10:58:46AM +0200, Laurent Pinchart wrote:
> > > On Tue, May 06, 2025 at 08:24:03AM +, Sakari Ailus wrote:
> > > > On Mon, May 05, 2025 at 11:05:56PM +0200, André Apitzsch via B4 Relay 
> > > > wrote:
> > > > > From: André Apitzsch 
> > > > > 
> > > > > Instead rely on the rate set on the clock (using assigned-clock-rates
> > > > > etc.)
> > > > > 
> > > > > Signed-off-by: André Apitzsch 
> > > > > ---
> > > > >  drivers/media/i2c/imx214.c | 6 --
> > > > >  1 file changed, 6 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/media/i2c/imx214.c b/drivers/media/i2c/imx214.c
> > > > > index 
> > > > > 9e9be47394ec768a5b34d44b06b5bbb0988da5a1..c12996e294dccebb18c608254f1e0d14dc064423
> > > > >  100644
> > > > > --- a/drivers/media/i2c/imx214.c
> > > > > +++ b/drivers/media/i2c/imx214.c
> > > > > @@ -32,7 +32,6 @@
> > > > >  
> > > > >  #define IMX214_REG_FAST_STANDBY_CTRL CCI_REG8(0x0106)
> > > > >  
> > > > > -#define IMX214_DEFAULT_CLK_FREQ  2400
> > > > >  #define IMX214_DEFAULT_LINK_FREQ 6
> > > > >  /* Keep wrong link frequency for backward compatibility */
> > > > >  #define IMX214_DEFAULT_LINK_FREQ_LEGACY  48000
> > > > > @@ -1405,11 +1404,6 @@ static int imx214_probe(struct i2c_client 
> > > > > *client)
> > > > >   return dev_err_probe(dev, PTR_ERR(imx214->xclk),
> > > > >"failed to get xclk\n");
> > > > >  
> > > > > - ret = clk_set_rate(imx214->xclk, IMX214_DEFAULT_CLK_FREQ);
> > > > > - if (ret)
> > > > > - return dev_err_probe(dev, ret,
> > > > > -  "failed to set xclk frequency\n");
> > > > > -
> > > > 
> > > > Oops. I missed this is what the driver was doing already. Indeed, this 
> > > > is
> > > > one of the historic sensor drivers that do set the frequency in DT 
> > > > systems.
> > > > 
> > > > The driver never used the clock-frequency property and instead used a 
> > > > fixed
> > > > frequency. Changing the behaviour now could be problematic.
> > > > 
> > > > There are options here that I think we could do:
> > > > 
> > > > 1) use your v1 patch (4) which uses "clock-frequency" if it exists and
> > > > otherwise uses the default, fixed frequency or
> > > > 
> > > > 2) set the frequency only if the "clock-frequency" property exists. The 
> > > > DT
> > > > currently requires clock-frequency and the YAML conversion was done in 
> > > > 2020
> > > > whereas the driver is from 2018. If we do this, the clock-frequency 
> > > > should
> > > > be deprecated (or even removed from bingings).
> > > > 
> > > > I wonder what others think. Cc'd Laurent in any case.
> > > 
> > > Maybe I'm missing something, but I don't really see the issue here. The
> > > clock-frequency DT property is currently ignored, and this patch doesn't
> > > change that situation, does it ?
> > > 
> > > The change of behaviour here is related to the assigned-clock-rates
> > > property. If that property is specified today, it will set the clock
> > > rate, and the driver will override it to 24MHz right after. With this
> > > patch, the clock rate won't be overridden. I think the risk of
> > > regression is very low here, as I don't expect systems to set
> > > assigned-clock-rates in DT to a value different than 24MHz and expect
> > > the driver to override it.
> > 
> > If the DTS had assigned-clock-rates set correctly, then yes. How much can
> > we trust the older DTS did have that?
> 
> I am relatively confident that DT-based systems wouldn't have an
> assigned-clock-rates property with a frequency that doesn't match
> IMX214_DEFAULT_CLK_FREQ. The real question is whether or not I'm
> over-confident :-)

The assigned-clock stuff wasn't always there. But nowadays I guess a lot
of things in practice depends on it.

So I guess doing this should be fine then.

-- 
Sakari Ailus

Re: [PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.

2025-05-15 Thread Lorenzo Stoakes

On Thu, May 15, 2025 at 02:46:41PM -0400, Zi Yan wrote:
> On 15 May 2025, at 14:41, Lorenzo Stoakes wrote:
>
> > Ah you got to this first :) thanks!
> >
> > Could you do this with a cover letter though? It's really weird to have 2/2
> > reply to 1/2, I know sometimes people do that, but it's just odd, and it'd 
> > be
> > good to have an overview, thanks!
> >
> > On Thu, May 15, 2025 at 02:23:32PM -0400, Zi Yan wrote:
> >> When userfaultfd is not compiled into kernel, userfaultfd() returns -1,
> >> causing uffd tests in madv_guard fail. Skip the tests instead.
> >
> > 'madv_guard'? I'd just say the guard_regions.uffd test to fail.
>
> Sure. Will change it.
> >
> >>
> >> Signed-off-by: Zi Yan 

Given I was being an idiot below, now the patch is fine as-is, just resend
with the nitty commit message change and cover letter as a v2 and we should
be good :)

Reviewed-by: Lorenzo Stoakes 

> >> ---
> >>  tools/testing/selftests/mm/guard-regions.c | 17 +++--
> >>  1 file changed, 15 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/testing/selftests/mm/guard-regions.c 
> >> b/tools/testing/selftests/mm/guard-regions.c
> >> index 0cd9d236649d..93af3d3760f9 100644
> >> --- a/tools/testing/selftests/mm/guard-regions.c
> >> +++ b/tools/testing/selftests/mm/guard-regions.c
> >> @@ -1453,8 +1453,21 @@ TEST_F(guard_regions, uffd)
> >>
> >>/* Set up uffd. */
> >>uffd = userfaultfd(0);
> >> -  if (uffd == -1 && errno == EPERM)
> >> -  ksft_exit_skip("No userfaultfd permissions, try running as 
> >> root.\n");
> >
> > Let's just make this all part of the same switch please!
>
> What do you mean? EPERM is handled in the switch-case below.

Oh man, I'm the biggest idiot on Earth haha!

For some reason I read these '-'s as '+'s :

Yes please ignore the above, I therefore - like your approach - and am good
with it.

>
> >
> > And while I originally used ksft_exit_skip(), I think we can just use the
> > SKIP(return, ...) form here just fine to keep it consistent.
>
> Right. I am using SKIP below, since when I ran it, ksft_exit_skip()
> makes the whole test message inconsistent.

Yes, your confusion is warranted, I apparently can't read... :>) and
indeed, agreed.

Thanks for doing this!

>
> >
> >> +  if (uffd == -1) {
> >> +  switch (errno) {
> >> +  case EPERM:
> >> +  SKIP(return, "No userfaultfd permissions, try running 
> >> as root.");
> >> +  break;
> >> +  case ENOSYS:
> >> +  SKIP(return, "userfaultfd is not supported/not 
> >> enabled.");
> >> +  break;
> >> +  default:
> >> +  ksft_exit_fail_msg("userfaultfd failed with %s\n",
> >> + strerror(errno));
> >> +  break;
> >> +  }
> >> +  }
> >> +
> >>ASSERT_NE(uffd, -1);
> >>
> >>ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> >> --
> >> 2.47.2
> >>
> >
> > Thanks!
>
>
> --
> Best Regards,
> Yan, Zi

[PATCH net] selftests: net: validate team flags propagation

2025-05-15 Thread Stanislav Fomichev

Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
   with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags

Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.

One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.

Signed-off-by: Stanislav Fomichev 
---
 drivers/net/netdevsim/netdev.c|  2 +
 net/core/dev.c| 10 ++-
 .../selftests/drivers/net/team/Makefile   |  2 +-
 .../testing/selftests/drivers/net/team/config |  1 +
 .../selftests/drivers/net/team/propagation.sh | 79 +++
 5 files changed, 92 insertions(+), 2 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/team/propagation.sh

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0e0321a7ddd7..3bd1f8cffee8 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -879,11 +879,13 @@ static void nsim_setup(struct net_device *dev)
 NETIF_F_SG |
 NETIF_F_FRAGLIST |
 NETIF_F_HW_CSUM |
+NETIF_F_LRO |
 NETIF_F_TSO;
dev->hw_features |= NETIF_F_HW_TC |
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+   NETIF_F_LRO |
NETIF_F_TSO;
dev->max_mtu = ETH_MAX_MTU;
dev->xdp_features = NETDEV_XDP_ACT_HW_OFFLOAD;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d891634c692..4debd4b8e0f5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9188,8 +9188,16 @@ static int __dev_set_promiscuity(struct net_device *dev, 
int inc, bool notify)
 
dev_change_rx_flags(dev, IFF_PROMISC);
}
-   if (notify)
+   if (notify) {
+   /* The ops lock is only required to ensure consistent locking
+* for `NETDEV_CHANGE` notifiers. This function is sometimes
+* called without the lock, even for devices that are ops
+* locked, such as in `dev_uc_sync_multiple` when using
+* bonding or teaming.
+*/
+   netdev_ops_assert_locked(dev);
__dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL);
+   }
return 0;
 }
 
diff --git a/tools/testing/selftests/drivers/net/team/Makefile 
b/tools/testing/selftests/drivers/net/team/Makefile
index 2d5a76d99181..eaf6938f100e 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 # Makefile for net selftests
 
-TEST_PROGS := dev_addr_lists.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh
 
 TEST_INCLUDES := \
../bonding/lag_lib.sh \
diff --git a/tools/testing/selftests/drivers/net/team/config 
b/tools/testing/selftests/drivers/net/team/config
index b5e3a3aad4bf..636b3525b679 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -1,5 +1,6 @@
 CONFIG_DUMMY=y
 CONFIG_IPV6=y
 CONFIG_MACVLAN=y
+CONFIG_NETDEVSIM=m
 CONFIG_NET_TEAM=y
 CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/propagation.sh 
b/tools/testing/selftests/drivers/net/team/propagation.sh
new file mode 100755
index ..849a5f2cb3a7
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/propagation.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+NSIM_LRO_ID=$((256 + RANDOM % 256))
+NSIM_LRO_SYS=/sys/bus/netdevsim/devices/netdevsim$NSIM_LRO_ID
+
+NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
+NSIM_DEV_SYS_DEL=/sys/bus/netdevsim/del_device
+
+cleanup()
+{
+   ip link del dummyteam &>/dev/null
+   ip link del team0 &>/dev/null
+   echo $NSIM_LRO_ID > $NSIM_DEV_SYS_DEL
+}
+
+# Trigger LRO propagation to the lower.
+# https://lore.kernel.org/netdev/aBvOpkIoxcr9PfDg@mini-arch/
+team_lro()
+{
+   # using netdevsim because it supports NETIF_F_LRO
+   NSIM_LRO_NAME=$(find $NSIM_LRO_SYS/net -maxdepth 1 -type d ! \
+   -path $NSIM_LRO_SYS/net -exec basename {} \;)
+
+   ip link add name team0 type team
+   ip link set $NSIM_LRO_NAME down
+   ip link set dev $NSIM_LRO_NAME master team0
+   ip link set team0 up
+   ethtool -K team0 large-receive-offload off
+
+   ip link del team0
+}
+
+# Trigger promisc propagation to the lower during IFLA_MASTER.
+# https://lore.kernel.org/netdev/20250506032328.3003050-1-...@fomichev.me/
+team_promisc()
+{
+   ip link add name dummyteam type dummy

Re: [PATCH v4 05/38] perf: Add generic exclude_guest support

2025-05-15 Thread Liang, Kan




On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
> On Fri, Apr 25, 2025, Peter Zijlstra wrote:
>> On Mon, Mar 24, 2025 at 05:30:45PM +, Mingwei Zhang wrote:
>>
>>> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>  
>>> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
>>> +{
>>> +   perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +   ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>>> +   perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +   if (cpuctx->task_ctx) {
>>> +   perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +   task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +   perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +   }
>>> +}
>>> +
>>> +/* When entering a guest, schedule out all exclude_guest events. */
>>> +void perf_guest_enter(void)
>>> +{
>>> +   struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +   lockdep_assert_irqs_disabled();
>>> +
>>> +   perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +   if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
>>> +   goto unlock;
>>> +
>>> +   perf_host_exit(cpuctx);
>>> +
>>> +   __this_cpu_write(perf_in_guest, true);
>>> +
>>> +unlock:
>>> +   perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_enter);
>>> +
>>> +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
>>> +{
>>> +   perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +   if (cpuctx->task_ctx)
>>> +   perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +
>>> +   perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +
>>> +   if (cpuctx->task_ctx)
>>> +   perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +   perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +}
>>> +
>>> +void perf_guest_exit(void)
>>> +{
>>> +   struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +   lockdep_assert_irqs_disabled();
>>> +
>>> +   perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +   if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>> +   goto unlock;
>>> +
>>> +   perf_host_enter(cpuctx);
>>> +
>>> +   __this_cpu_write(perf_in_guest, false);
>>> +unlock:
>>> +   perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_exit);
>>
>> This naming is confusing on purpose? Pick either guest/host and stick
>> with it.
> 
> +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
> These flows
> 
> After a bit of hacking, and with a few spoilers, this is what I ended up with
> (not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
> nomenclature to tie everything together, so I went with "guest" instead of 
> "host"
> even though the majority of work being down is to shedule out/in host context.
> 
> /* When loading a guest's mediated PMU, schedule out all exclude_guest 
> events. */
> void perf_load_guest_context(unsigned long data)
> {
>   struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> 
>   lockdep_assert_irqs_disabled();
> 
>   perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> 
>   if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
>   goto unlock;
> 
>   perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>   ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>   perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>   if (cpuctx->task_ctx) {
>   perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>   task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>   perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>   }
> 
>   arch_perf_load_guest_context(data);
> 
>   __this_cpu_write(guest_ctx_loaded, true);
> 
> unlock:
>   perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> }
> EXPORT_SYMBOL_GPL(perf_load_guest_context);
> 
> void perf_put_guest_context(void)
> {
>   struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> 
>   lockdep_assert_irqs_disabled();
> 
>   perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> 
>   if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
>   goto unlock;
> 
>   arch_perf_put_guest_context();

It will set the guest_ctx_loaded to false.
The update_context_time() invoked in the perf_event_sched_in() will not
get a chance to update the guest time.

I think something as below should work.

- Disable all in the PMU (disable global control)
- schedule in the host counters (but not run yet since the global
control of the PMU is disabled)
- arch_perf_put_guest_context()
- Enable all in the PMU (Enable global control. The host counters now start)

void perf_put_guest_context(void)
{
struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

lockdep_assert_irqs_disabled();

perf_ctx_lock(cpuctx, cpuctx->task_ctx);

if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))

Re: [PATCH 1/2] selftests/mm: skip uffd tests in madv_guard if uffd is not present.