Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool benchmark
Hi all, This is very useful. On Wed, 28 May 2025 at 16:51, Arnaldo Carvalho de Melo wrote: > > On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote: > > Mina Almasry writes: > > > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen > > > wrote: > > >> Back when you posted the first RFC, Jesper and I chatted about ways to > > >> avoid the ugly "load module and read the output from dmesg" interface to > > >> the test. > > > > I agree the existing interface is ugly. > > > >> One idea we came up with was to make the module include only the "inner" > > >> functions for the benchmark, and expose those to BPF as kfuncs. Then the > > >> test runner can be a BPF program that runs the tests, collects the data > > >> and passes it to userspace via maps or a ringbuffer or something. That's > > >> a nicer and more customisable interface than the printk output. And if > > >> they're small enough, maybe we could even include the functions into the > > >> page_pool code itself, instead of in a separate benchmark module? > > > >> WDYT of that idea? :) > > > > ...but this sounds like an enormous amount of effort, for something > > > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that > > > much of an expert that I know how to implement what you're referring > > > to off the top of my head. I normally am open to spending time but > > > this is not that high on my todolist and I have limited bandwidth to > > > resolve this :( > > > > I also feel that this is something that could be improved post merge. > > agreed > > > > I think it's very beneficial to have this merged in some form that can > > > be improved later. Byungchul is making a lot of changes to these mm > > > things and it would be nice to have an easy way to run the benchmark > > > in tree and maybe even get automated results from nipa. If we could > > > agree on mvp that is appropriate to merge without too much scope creep > > > that would be ideal from my side at least. > > > Right, fair. I guess we can merge it as-is, and then investigate whether > > we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :) > > tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use > it from a 'perf bench' suite. > > Yeah, the model would be what I did for uprobes, but even then there is > a selftests based uprobes benchmark ;-) > > The 'perf bench' part, that calls into the skel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c > > The skel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c > > While this one is just to generate BPF load to measure the impact on > uprobes, for your case it would involve using a ring buffer to > communicate from the skel (BPF/kernel side) to the userspace part, > similar to what is done in various other BPF based perf tooling > available in: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel > > Like at this line (BPF skel part): > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253 > > The simplest part is in the canonical, standalone runqslower tool, also > hosted in the kernel sources: > > BPF skel sending stuff to userspace: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99 > > The userspace part that reads it: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90 > > This is a callback that gets called for every event that the BPF skel > produces, called from this loop: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162 > > That handle_event callback was associated via: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153 > > There is a dissection I did about this process a long time ago, but > still relevant, I think: > > http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33 > > The part explaining the interaction userspace/kernel starts here: > > http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40 > > (yeah, its http, but then, its _old_vger ;-) > > Doing it in perf is interesting because it gets widely packaged, so > whatever you add to it gets visibility for people using 'perf bench' and > also gets available in most places, it would add to this collection: > > root@number:~# perf bench > Usage: > perf bench [] [] > > # List of all available benchmark collections: > > sched: Scheduler and IPC benchmarks >syscall: System call benchmarks >mem: Memory access benchmarks > numa: NUMA scheduling and MM benchmarks >
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote: > On 03.06.25 21:22, Lorenzo Stoakes wrote: > > The walk_page_range_novma() function is rather confusing - it supports two > > modes, one used often, the other used only for debugging. > > > > The first mode is the common case of traversal of kernel page tables, which > > is what nearly all callers use this for. > > ... and what people should be using it for 🙂 > > > > > Secondly it provides an unusual debugging interface that allows for the > > traversal of page tables in a userland range of memory even for that memory > > which is not described by a VMA. > > > > This is highly unusual and it is far from certain that such page tables > > should even exist, but perhaps this is precisely why it is useful as a > > debugging mechanism. > > > > As a result, this is utilised by ptdump only. Historically, things were > > reversed - ptdump was the only user, and other parts of the kernel evolved > > to use the kernel page table walking here. > > > > Since we have some complicated and confusing locking rules for the novma > > case, it makes sense to separate the two usages into their own functions. > > > > Doing this also provide self-documentation as to the intent of the caller - > > are they doing something rather unusual or are they simply doing a standard > > kernel page table walk? > > > > We therefore maintain walk_page_range_novma() for this single usage, and > > document the function as such. > > If we have to keep this dangerous interface, it should probably be > > walk_page_range_debug() or walk_page_range_dump() We can also move it from include/linux/pagewalk.h to mm/internal.h > > > > Note that ptdump uses the precise same function for kernel walking as a > > convenience, so we permit this but make it very explicit by having > > walk_page_range_novma() invoke walk_page_range_kernel() in this case. > > > > We introduce walk_page_range_kernel() for the far more common case of > > kernel page table traversal. > > I wonder if we should give it a completely different name scheme to > highlight that this is something completely different. > > walk_kernel_page_table_range() > > etc. > > > -- > Cheers, > > David / dhildenb > -- Sincerely yours, Mike.
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On 04.06.25 10:07, Mike Rapoport wrote: On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote: On 03.06.25 21:22, Lorenzo Stoakes wrote: The walk_page_range_novma() function is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. ... and what people should be using it for 🙂 Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. This is highly unusual and it is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore maintain walk_page_range_novma() for this single usage, and document the function as such. If we have to keep this dangerous interface, it should probably be walk_page_range_debug() or walk_page_range_dump() We can also move it from include/linux/pagewalk.h to mm/internal.h Agreed. -- Cheers, David / dhildenb
Re: [PATCH] selftests/timers: Fix integer overlow errors on 32 bit systems
> So this seems to be undoing commit 80fa614e2fbc ("selftests: timers: > Remove local NSEC_PER_SEC and USEC_PER_SEC defines") Thanks John, I somehow missed 80fa614e2fbc and was wondering how this had got in. > Would it make more sense to fix the NSEC_PER_SEC definition in time64.h to a > LL? I was just thinking the same this morning but wasn't sure if this would cause issues anywhere else. I'll send a patch to change NSEC_PER_SEC in time64.h as its a lot cleaner that way. Thanks Terry
[GIT PULL] SPDX/LICENSES update for 6.16-rc1
The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8: Linux 6.15-rc1 (2025-04-06 13:11:33 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx.git tags/spdx-6.16-rc1 for you to fetch changes up to 59c11a7a9a138a28c7dff81032a7b7b6f1794540: LICENSES: add CC0-1.0 license text (2025-05-21 14:54:17 +0200) LICENSES update for 6.16-rc1 Here is a single patch to the LICENSES/ directory to add the CC0 license that is currently used in the kcpuid x86 tool for one of their files. This fixes the error that spdxcheck.py currently has with the kcpuid file due to a missing LICENSE file for this specific license. This change has been in linux-next for weeks with no reported issues. Signed-off-by: Greg Kroah-Hartman Lukas Bulwahn (1): LICENSES: add CC0-1.0 license text LICENSES/deprecated/CC0-1.0 | 129 1 file changed, 129 insertions(+) create mode 100644 LICENSES/deprecated/CC0-1.0
Re: [PATCH v4 0/7] Add managed SOFT RESERVE resource handling
Smita, Thanks for your awesome work. I just tested the scenarios you listed, and they work as expected. Thanks again. (Minor comments inlined) Tested-by: Li Zhijian To the CXL community, The scenarios mentioned here essentially cover what a correct firmware may provide. However, I would like to discuss one more scenario that I can simulate with a modified QEMU: The E820 exposes a SOFT RESERVED region which is the same as a CFMW, but the HDM decoders are not committed. This means no region will be auto-created during boot. As an example, after boot, the iomem tree is as follows: 105000-304fff : CXL Window 0 105000-304fff : Soft Reserved In this case, the SOFT RESERVED resource is not trimmed, so the end-user cannot create a new region. My question is: Is this scenario a problem? If it is, should we fix it in this patchset or create a new patch? On 04/06/2025 06:19, Smita Koralahalli wrote: > Add the ability to manage SOFT RESERVE iomem resources prior to them being > added to the iomem resource tree. This allows drivers, such as CXL, to > remove any pieces of the SOFT RESERVE resource that intersect with created > CXL regions. > > The current approach of leaving the SOFT RESERVE resources as is can cause > failures during hotplug of devices, such as CXL, because the resource is > not available for reuse after teardown of the device. > > The approach is to add SOFT RESERVE resources to a separate tree during > boot. No special tree at all since V3 > This allows any drivers to update the SOFT RESERVE resources before > they are merged into the iomem resource tree. In addition a notifier chain > is added so that drivers can be notified when these SOFT RESERVE resources > are added to the ioeme resource tree. > > The CXL driver is modified to use a worker thread that waits for the CXL > PCI and CXL mem drivers to be loaded and for their probe routine to > complete. Then the driver walks through any created CXL regions to trim any > intersections with SOFT RESERVE resources in the iomem tree. > > The dax driver uses the new soft reserve notifier chain so it can consume > any remaining SOFT RESERVES once they're added to the iomem tree. > > The following scenarios have been tested: > > Example 1: Exact alignment, soft reserved is a child of the region > > |-- "Soft Reserved" ---| > |-- "Region #" | > > Before: >105000-304fff : CXL Window 0 > 105000-304fff : region0 >105000-304fff : Soft Reserved > 108000-2f : dax0.0 BTW, I'm curious how to set up a dax with an address range different from its corresponding region. >108000-2f : System RAM (kmem) > > After: >105000-304fff : CXL Window 0 > 105000-304fff : region1 >108000-2f : dax0.0 > 108000-2f : System RAM (kmem) > > Example 2: Start and/or end aligned and soft reserved spans multiple > regions Tested > > |--- "Soft Reserved" ---| > | "Region #" ---| > or > |--- "Soft Reserved" ---| > | "Region #" ---| Typo? should be: |--- "Soft Reserved" ---| | "Region #" ---| > > Example 3: No alignment > |-- "Soft Reserved" --| > | "Region #" | Tested. Thanks Zhijian
Re: [PATCH] module: make __mod_device_table__* symbols static
On 6/2/25 12:55 PM, Masahiro Yamada wrote: > The __mod_device_table__* symbols are only parsed by modpost to generate > MODULE_ALIAS() entries from MODULE_DEVICE_TABLE(). > > Therefore, these symbols do not need to be globally visible, or globally > unique. > > If they are in the global scope, we would worry about the symbol > uniqueness, but modpost is fine with parsing multiple symbols with the > same name. > > Signed-off-by: Masahiro Yamada Reviewed-by: Petr Pavlu -- Thanks, Petr
Re: [PATCH bpf-next] selftests/bpf: Fix compile error of bin_attribute::read/write()
On Wed, Jun 04, 2025 at 01:53:22PM +0800, Rong Tao wrote: > From: Rong Tao > > Since commit 97d06802d10a ("sysfs: constify bin_attribute argument of > bin_attribute::read/write()"), make bin_attribute parameter of > bin_attribute::read/write() const. hi, there's already fix for this in bpf/master thanks, jirka > > Signed-off-by: Rong Tao > --- > tools/testing/selftests/bpf/test_kmods/bpf_testmod.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c > b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c > index e6c248e3ae54..e9e918cdf31f 100644 > --- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c > +++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c > @@ -385,7 +385,7 @@ int bpf_testmod_fentry_ok; > > noinline ssize_t > bpf_testmod_test_read(struct file *file, struct kobject *kobj, > - struct bin_attribute *bin_attr, > + const struct bin_attribute *bin_attr, > char *buf, loff_t off, size_t len) > { > struct bpf_testmod_test_read_ctx ctx = { > @@ -465,7 +465,7 @@ ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO); > > noinline ssize_t > bpf_testmod_test_write(struct file *file, struct kobject *kobj, > - struct bin_attribute *bin_attr, > + const struct bin_attribute *bin_attr, > char *buf, loff_t off, size_t len) > { > struct bpf_testmod_test_write_ctx ctx = { > @@ -567,7 +567,7 @@ static void testmod_unregister_uprobe(void) > > static ssize_t > bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj, > - struct bin_attribute *bin_attr, > + const struct bin_attribute *bin_attr, >char *buf, loff_t off, size_t len) > { > unsigned long offset = 0; > -- > 2.49.0 >
[Question] attributes encoding in BTF
Hi all, a simpler version of this series has been merged, and so I am now taking a look at the issue I have put aside in the merged version: dealing with more specific data layout for arguments passed on stack. For example, a function can pass small structs on stack, but this need special care when generating the corresponding bpf trampolines. Those structs have specific alignment specified by the target ABI, but it can also be altered with attributes packing the structure or modifying the alignment. Some platforms already support structs on stack (see tracing_struct_many_args test), but as discussed earlier, those may suffer from the same kind of issue mentioned above. On Wed Apr 23, 2025 at 9:24 PM CEST, Alexis Lothoré wrote: > Hi Andrii, > > On Wed Apr 23, 2025 at 7:15 PM CEST, Andrii Nakryiko wrote: >> On Thu, Apr 17, 2025 at 12:14 AM Alexis Lothoré >> wrote: >>> >>> Hi Andrii, >>> >>> On Wed Apr 16, 2025 at 11:24 PM CEST, Andrii Nakryiko wrote: [...] >> I'd suggest looking at btf__align_of() in libbpf (tools/lib/bpf/btf.c) >> to see how we calculate alignment there. It seems to work decently >> enough. It won't cover any arch-specific extra rules like double >> needing 16-byte alignment (I vaguely remember something like that for >> some architectures, but I might be misremembering), or anything >> similar. It also won't detect (I don't think it's possible without >> DWARF) artificially increased alignment with attribute((aligned(N))). > > Thanks for the pointer, I'll take a look at it. The more we discuss this > series, the less member size sounds relevant for what I'm trying to achieve > here. > > Following Xu's comments, I have been thinking about how I could detect the > custom alignments and packing on structures, and I was wondering if I could > somehow benefit from __attribute__ encoding in BTF info ([1]). But > following your hint, I also see some btf_is_struct_packed() in > tools/lib/bpf/btf_dump.c that could help. I'll dig this further and see if > I can manage to make something work with all of this. Andrii's comment above illustrates well my current issue: when functions pass arguments on stack, we are missing info for some of them to correctly build trampolines, especially for struct, which can have attributes like __attribute__((packed)) or __attribute__((align(x))). [1] seems to be a recent solution implemented for BTF to cover this need. IIUC it encodes any arbitratry attribute affecting a data type or function, so if I have some struct like this one in my kernel or a module: struct foo { short b int a; } __packed; I would expect the corresponding BTF data to have some BTF_KIND_DECL_TAG describing the "packed" attribute for the corresponding structure, but I fail to find any of those when running: $ bpftool btf dump file vmlinux format raw In there I see some DECL_TAG but those are mostly 'bpf_kfunc', I see not arbitrary attribute like 'packed' or 'aligned(x)'. What I really need to do in the end is to be able to parse those alignments attributes info in the kernel at runtime when generating trampolines, but I hoped to be able to see them with bpftool first to validate the concept. I started taking a look further at this and stumbled upon [2] in which Alan gives many more details about the feature, so I did the following checks: - kernel version 6.15.0-rc4 from bpf-next_base: it contains the updated Makefile.btf calling pahole with `--btf_features=attributes` - pahole v1.30 $ pahole --supported_btf_features encode_force,var,float,decl_tag,type_tag,enum64,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build,distilled_base,global_var,attributes This pahole comes from my distro pkg manager, but I have also done the same test with a freshly built pahole, while taking care of pulling the libbpf submodule. - bpftool v7.6.0 bpftool v7.6.0 using libbpf v1.6 features: llvm, skeletons Could I be missing something obvious ? Or did I misunderstand the actual attribute encoding feature ? Thanks, Alexis [1] https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solod...@linux.dev/ [2] https://lore.kernel.org/all/CA+icZUW31vpS=r3zm6g4fmkzuiqovqtd+e-8ihwsk_a-qts...@mail.gmail.com/ -- Alexis Lothoré, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
Re: [PATCH RFC net-next v2 2/3] vsock/test: Introduce get_transports()
On Wed, May 28, 2025 at 10:44:42PM +0200, Michal Luczaj wrote: Return a bitmap of registered vsock transports. As guesstimated by grepping /proc/kallsyms (CONFIG_KALLSYMS=y) for known symbols of type `struct virtio_transport`. Suggested-by: Stefano Garzarella Signed-off-by: Michal Luczaj --- tools/testing/vsock/util.c | 60 ++ tools/testing/vsock/util.h | 12 ++ 2 files changed, 72 insertions(+) diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c index b7b3fb2221c1682ecde58cf12e2f0b0ded1cff39..74fb52f566148b16436251216dd9d9275f0ec95b 100644 --- a/tools/testing/vsock/util.c +++ b/tools/testing/vsock/util.c @@ -7,6 +7,7 @@ * Author: Stefan Hajnoczi */ +#include #include #include #include @@ -17,6 +18,7 @@ #include #include #include +#include #include #include "timeout.h" @@ -854,3 +856,61 @@ void enable_so_linger(int fd, int timeout) exit(EXIT_FAILURE); } } + +static int __get_transports(void) +{ + /* Order must match transports defined in util.h. +* man nm: "d" The symbol is in the initialized data section. +*/ + const char * const syms[] = { + "d loopback_transport", + "d virtio_transport", + "d vhost_transport", + "d vmci_transport", + "d hvs_transport", + }; I would move this array (or a macro that define it), near the transport defined in util.h, so they are near and we can easily update/review changes. BTW what about adding static asserts to check we are aligned? + char buf[KALLSYMS_LINE_LEN]; + int ret = 0; + FILE *f; + + f = fopen(KALLSYMS_PATH, "r"); + if (!f) { + perror("Can't open " KALLSYMS_PATH); + exit(EXIT_FAILURE); + } + + while (fgets(buf, sizeof(buf), f)) { + char *match; + int i; + + assert(buf[strlen(buf) - 1] == '\n'); + + for (i = 0; i < ARRAY_SIZE(syms); ++i) { + match = strstr(buf, syms[i]); + + /* Match should be followed by '\t' or '\n'. +* See kallsyms.c:s_show(). +*/ + if (match && isspace(match[strlen(syms[i])])) { + ret |= _BITUL(i); + break; + } + } + } + + fclose(f); + return ret; +} + +/* Return integer with TRANSPORT_* bit set for every (known) registered vsock + * transport. + */ +int get_transports(void) +{ + static int tr = -1; + + if (tr == -1) + tr = __get_transports(); + + return tr; +} diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h index 0afe7cbae12e5194172c639ccfbeb8b81f7c25ac..63953e32c3e18e1aa5c2addcf6f09f433660fa84 100644 --- a/tools/testing/vsock/util.h +++ b/tools/testing/vsock/util.h @@ -3,8 +3,19 @@ #define UTIL_H #include +#include #include +#define KALLSYMS_PATH "/proc/kallsyms" +#define KALLSYMS_LINE_LEN 512 We don't need to expose them in util.h IMO, we can keep in util.c + +/* All known transports */ +#define TRANSPORT_LOOPBACK _BITUL(0) +#define TRANSPORT_VIRTIO _BITUL(1) +#define TRANSPORT_VHOST_BITUL(2) +#define TRANSPORT_VMCI _BITUL(3) +#define TRANSPORT_HYPERV _BITUL(4) + /* Tests can either run as the client or the server */ enum test_mode { TEST_MODE_UNSET, @@ -82,4 +93,5 @@ void setsockopt_timeval_check(int fd, int level, int optname, struct timeval val, char const *errmsg); void enable_so_zerocopy_check(int fd); void enable_so_linger(int fd, int timeout); +int get_transports(void); #endif /* UTIL_H */ -- 2.49.0
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 11:07:05AM +0300, Mike Rapoport wrote: > On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote: > > On 03.06.25 21:22, Lorenzo Stoakes wrote: > > > The walk_page_range_novma() function is rather confusing - it supports two > > > modes, one used often, the other used only for debugging. > > > > > > The first mode is the common case of traversal of kernel page tables, > > > which > > > is what nearly all callers use this for. > > > > ... and what people should be using it for 🙂 > > > > > > > > Secondly it provides an unusual debugging interface that allows for the > > > traversal of page tables in a userland range of memory even for that > > > memory > > > which is not described by a VMA. > > > > > > This is highly unusual and it is far from certain that such page tables > > > should even exist, but perhaps this is precisely why it is useful as a > > > debugging mechanism. > > > > > > As a result, this is utilised by ptdump only. Historically, things were > > > reversed - ptdump was the only user, and other parts of the kernel evolved > > > to use the kernel page table walking here. > > > > > > Since we have some complicated and confusing locking rules for the novma > > > case, it makes sense to separate the two usages into their own functions. > > > > > > Doing this also provide self-documentation as to the intent of the caller > > > - > > > are they doing something rather unusual or are they simply doing a > > > standard > > > kernel page table walk? > > > > > > We therefore maintain walk_page_range_novma() for this single usage, and > > > document the function as such. > > > > If we have to keep this dangerous interface, it should probably be > > > > walk_page_range_debug() or walk_page_range_dump() > > We can also move it from include/linux/pagewalk.h to mm/internal.h Yeah I was wondering about a rename actually, but then thought 'oh novma kinda fits blah blah' but you're right, this is better. Nice idea to move to mm/internal.h also :) I like this... Will fixup on respin > > > > > > > Note that ptdump uses the precise same function for kernel walking as a > > > convenience, so we permit this but make it very explicit by having > > > walk_page_range_novma() invoke walk_page_range_kernel() in this case. > > > > > > We introduce walk_page_range_kernel() for the far more common case of > > > kernel page table traversal. > > > > I wonder if we should give it a completely different name scheme to > > highlight that this is something completely different. > > > > walk_kernel_page_table_range() > > > > etc. > > > > > > -- > > Cheers, > > > > David / dhildenb > > > > -- > Sincerely yours, > Mike.
Re: [PATCH RFC net-next v2 1/3] vsock/test: Introduce vsock_bind_try() helper
On Wed, May 28, 2025 at 10:44:41PM +0200, Michal Luczaj wrote: Create a socket and bind() it. If binding failed, gracefully return an error code while preserving `errno`. Base vsock_bind() on top of it. Suggested-by: Stefano Garzarella Signed-off-by: Michal Luczaj --- tools/testing/vsock/util.c | 24 +--- tools/testing/vsock/util.h | 1 + 2 files changed, 22 insertions(+), 3 deletions(-) LGTM! Reviewed-by: Stefano Garzarella diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c index 0c7e9cbcbc85cde9c8764fc3bb623cde2f6c77a6..b7b3fb2221c1682ecde58cf12e2f0b0ded1cff39 100644 --- a/tools/testing/vsock/util.c +++ b/tools/testing/vsock/util.c @@ -121,15 +121,17 @@ bool vsock_wait_sent(int fd) return !ret; } -/* Create socket , bind to and return the file descriptor. */ -int vsock_bind(unsigned int cid, unsigned int port, int type) +/* Create socket , bind to . + * Return the file descriptor, or -1 on error. + */ +int vsock_bind_try(unsigned int cid, unsigned int port, int type) { struct sockaddr_vm sa = { .svm_family = AF_VSOCK, .svm_cid = cid, .svm_port = port, }; - int fd; + int fd, saved_errno; fd = socket(AF_VSOCK, type, 0); if (fd < 0) { @@ -138,6 +140,22 @@ int vsock_bind(unsigned int cid, unsigned int port, int type) } if (bind(fd, (struct sockaddr *)&sa, sizeof(sa))) { + saved_errno = errno; + close(fd); + errno = saved_errno; + fd = -1; + } + + return fd; +} + +/* Create socket , bind to and return the file descriptor. */ +int vsock_bind(unsigned int cid, unsigned int port, int type) +{ + int fd; + + fd = vsock_bind_try(cid, port, type); + if (fd < 0) { perror("bind"); exit(EXIT_FAILURE); } diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h index 5e2db67072d5053804a9bb93934b625ea78bcd7a..0afe7cbae12e5194172c639ccfbeb8b81f7c25ac 100644 --- a/tools/testing/vsock/util.h +++ b/tools/testing/vsock/util.h @@ -44,6 +44,7 @@ int vsock_connect(unsigned int cid, unsigned int port, int type); int vsock_accept(unsigned int cid, unsigned int port, struct sockaddr_vm *clientaddrp, int type); int vsock_stream_connect(unsigned int cid, unsigned int port); +int vsock_bind_try(unsigned int cid, unsigned int port, int type); int vsock_bind(unsigned int cid, unsigned int port, int type); int vsock_bind_connect(unsigned int cid, unsigned int port, unsigned int bind_port, int type); -- 2.49.0
Re: [PATCH bpf-next] selftests/bpf: Fix compile error of bin_attribute::read/write()
On 6/4/25 16:53, Jiri Olsa wrote: On Wed, Jun 04, 2025 at 01:53:22PM +0800, Rong Tao wrote: From: Rong Tao Since commit 97d06802d10a ("sysfs: constify bin_attribute argument of bin_attribute::read/write()"), make bin_attribute parameter of bin_attribute::read/write() const. hi, there's already fix for this in bpf/master thanks, jirka I am confused, when should I use bpf/master[2] and when should I use bpf-next/master[1]? thank you :) [1] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git [2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Re: [PATCH v4 3/7] cxl/pci: Add pci_loaded tracking to mark PCI driver readiness
On 04/06/2025 06:19, Smita Koralahalli wrote: > Introduce a pci_loaded flag similar to mem_active, and define > mark_cxl_pci_loaded() to indicate when the PCI driver has initialized. > > This will be used by other CXL components, such as cxl_acpi and the soft > reserved resource handling logic, to coordinate initialization and ensure > that dependent operations proceed only after both cxl_pci and cxl_mem > drivers are ready. > > Co-developed-by: Nathan Fontenot > Signed-off-by: Nathan Fontenot > Co-developed-by: Terry Bowman > Signed-off-by: Terry Bowman > Signed-off-by: Smita Koralahalli > --- > drivers/cxl/core/suspend.c | 8 > drivers/cxl/cxlpci.h | 1 + > drivers/cxl/pci.c | 2 ++ > 3 files changed, 11 insertions(+) > > diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c > index 5ba4b4de0e33..72818a2c8ec8 100644 > --- a/drivers/cxl/core/suspend.c > +++ b/drivers/cxl/core/suspend.c > @@ -3,8 +3,10 @@ > #include > #include > #include "cxlmem.h" > +#include "cxlpci.h" > > static atomic_t mem_active; > +static atomic_t pci_loaded; I find it odd to place these changes in suspend.c. Also, I noticed that in a subsequent patch, DJ has mentioned (and I agree) that this patch is unnecessary. Thanks Zhijian > > bool cxl_mem_active(void) > { > @@ -25,3 +27,9 @@ void cxl_mem_active_dec(void) > atomic_dec(&mem_active); > } > EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL"); > + > +void mark_cxl_pci_loaded(void) > +{ > + atomic_inc(&pci_loaded); > +} > +EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL"); > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h > index 54e219b0049e..5a811ac63fcf 100644 > --- a/drivers/cxl/cxlpci.h > +++ b/drivers/cxl/cxlpci.h > @@ -135,4 +135,5 @@ void read_cdat_data(struct cxl_port *port); > void cxl_cor_error_detected(struct pci_dev *pdev); > pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, > pci_channel_state_t state); > +void mark_cxl_pci_loaded(void); > #endif /* __CXL_PCI_H__ */ > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index 785aa2af5eaa..b019bd324dba 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -1189,6 +1189,8 @@ static int __init cxl_pci_driver_init(void) > if (rc) > pci_unregister_driver(&cxl_pci_driver); > > + mark_cxl_pci_loaded(); > + > return rc; > } >
Re: [QUESTION] problems report: rcu_read_unlock_special() called in irq_exit() causes dead loop
On 6/3/2025 11:37 PM, Qi Xi wrote: > Hi Joel, > > The patch works as expected. Previously, the issue triggered a soft lockup > within ~10 minutes, but after applying the fix, the system ran stably for 30+ > minutes without any issues. Great to hear! Thanks for testing. I/we will roll this into a proper patch and will provide it once have something ready. - Joel > > Thanks, > Qi > > On 2025/6/4 11:25, Xiongfeng Wang wrote: >> On 2025/6/4 9:35, Joel Fernandes wrote: >>> On Tue, Jun 03, 2025 at 03:22:42PM -0400, Joel Fernandes wrote: On 6/3/2025 3:03 PM, Joel Fernandes wrote: > On 6/3/2025 2:59 PM, Joel Fernandes wrote: >> On Fri, May 30, 2025 at 09:55:45AM +0800, Xiongfeng Wang wrote: >>> Hi Joel, >>> >>> On 2025/5/29 0:30, Joel Fernandes wrote: On Wed, May 21, 2025 at 5:43 AM Xiongfeng Wang wrote: > Hi RCU experts, > > When I ran syskaller in Linux 6.6 with CONFIG_PREEMPT_RCU enabled, I > got > the following soft lockup. The Calltrace is too long. I put it in the > end. > The issue can also be reproduced in the latest kernel. > > The issue is as follows. CPU3 is waiting for a spin_lock, which is > got by CPU1. > But CPU1 stuck in the following dead loop. > > irq_exit() > __irq_exit_rcu() > /* in_hardirq() returns false after this */ > preempt_count_sub(HARDIRQ_OFFSET) > tick_irq_exit() > tick_nohz_irq_exit() > tick_nohz_stop_sched_tick() > trace_tick_stop() /* a bpf prog is hooked on this > trace point */ >__bpf_trace_tick_stop() > bpf_trace_run2() > rcu_read_unlock_special() > /* will send a IPI to itself */ > irq_work_queue_on(&rdp->defer_qs_iw, > rdp->cpu); > > /* after interrupt is enabled again, the irq_work is called */ > asm_sysvec_irq_work() > sysvec_irq_work() > irq_exit() /* after handled the irq_work, we again enter into > irq_exit() */ > __irq_exit_rcu() > ...skip... >/* we queue a irq_work again, and enter a dead loop */ >irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); >>> The following is a candidate fix (among other fixes being >>> considered/discussed). The change is to check if context tracking thinks >>> we're in IRQ and if so, avoid the irq_work. IMO, this should be rare enough >>> that it shouldn't be an issue and it is dangerous to self-IPI consistently >>> while we're exiting an IRQ anyway. >>> >>> Thoughts? Xiongfeng, do you want to try it? >> Thanks a lot for the fast response. My colleague is testing the modification. >> She will feedback the result. >> >> Thanks, >> Xiongfeng >> >>> Btw, I could easily reproduce it as a boot hang by doing: >>> >>> --- a/kernel/softirq.c >>> +++ b/kernel/softirq.c >>> @@ -638,6 +638,10 @@ void irq_enter(void) >>> >>> static inline void tick_irq_exit(void) >>> { >>> + rcu_read_lock(); >>> + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); >>> + rcu_read_unlock(); >>> + >>> #ifdef CONFIG_NO_HZ_COMMON >>> int cpu = smp_processor_id(); >>> >>> ---8<--- >>> >>> From: Joel Fernandes >>> Subject: [PATCH] Do not schedule irq_work when IRQ is exiting >>> >>> Signed-off-by: Joel Fernandes >>> --- >>> include/linux/context_tracking_irq.h | 2 ++ >>> kernel/context_tracking.c| 12 >>> kernel/rcu/tree_plugin.h | 3 ++- >>> 3 files changed, 16 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/context_tracking_irq.h >>> b/include/linux/context_tracking_irq.h >>> index 197916ee91a4..35a5ad971514 100644 >>> --- a/include/linux/context_tracking_irq.h >>> +++ b/include/linux/context_tracking_irq.h >>> @@ -9,6 +9,7 @@ void ct_irq_enter_irqson(void); >>> void ct_irq_exit_irqson(void); >>> void ct_nmi_enter(void); >>> void ct_nmi_exit(void); >>> +bool ct_in_irq(void); >>> #else >>> static __always_inline void ct_irq_enter(void) { } >>> static __always_inline void ct_irq_exit(void) { } >>> @@ -16,6 +17,7 @@ static inline void ct_irq_enter_irqson(void) { } >>> static inline void ct_irq_exit_irqson(void) { } >>> static __always_inline void ct_nmi_enter(void) { } >>> static __always_inline void ct_nmi_exit(void) { } >>> +static inline bool ct_in_irq(void) { return false; } >>> #endif >>> >>> #endif >>> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c >>> index fb5be6e9b423..8e8055cf04af 100644 >>> --- a/kernel/context_tracking.c >>> +++ b/kernel/context_tracking.c >>> @@ -392,6 +392,18 @@ noinstr void ct_irq_exit(void) >>> ct_nmi_exit(); >>> } >>> >>> +/** >>> + * ct_in_irq - che
Re: [PATCH RFC net-next v2 3/3] vsock/test: Cover more CIDs in transport_uaf test
On Wed, May 28, 2025 at 10:44:43PM +0200, Michal Luczaj wrote: Increase the coverage of test for UAF due to socket unbinding, and losing transport in general. It's a follow up to commit 301a62dfb0d0 ("vsock/test: Add test for UAF due to socket unbinding") and discussion in [1]. The idea remains the same: take an unconnected stream socket with a transport assigned and then attempt to switch the transport by trying (and failing) to connect to some other CID. Now do this iterating over all the well known CIDs (plus one). Note that having only a virtio transport loaded (without vhost_vsock) is unsupported; test will always pass. Depending on transports available, a variety of splats are possible on unpatched machines. After reverting commit 78dafe1cf3af ("vsock: Orphan socket after transport release") and commit fcdd2242c023 ("vsock: Keep the binding until socket destruction"): BUG: KASAN: slab-use-after-free in __vsock_bind+0x61f/0x720 Read of size 4 at addr 88811ff46b54 by task vsock_test/1475 Call Trace: dump_stack_lvl+0x68/0x90 print_report+0x170/0x53d kasan_report+0xc2/0x180 __vsock_bind+0x61f/0x720 vsock_connect+0x727/0xc40 __sys_connect+0xe8/0x100 __x64_sys_connect+0x6e/0xc0 do_syscall_64+0x92/0x1c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 WARNING: CPU: 0 PID: 1475 at net/vmw_vsock/virtio_transport_common.c:37 virtio_transport_send_pkt_info+0xb2b/0x1160 Call Trace: virtio_transport_connect+0x90/0xb0 vsock_connect+0x782/0xc40 __sys_connect+0xe8/0x100 __x64_sys_connect+0x6e/0xc0 do_syscall_64+0x92/0x1c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 KASAN: null-ptr-deref in range [0x0010-0x0017] RIP: 0010:sock_has_perm+0xa7/0x2a0 Call Trace: selinux_socket_connect_helper.isra.0+0xbc/0x450 selinux_socket_connect+0x3b/0x70 security_socket_connect+0x31/0xd0 __sys_connect_file+0x79/0x1f0 __sys_connect+0xe8/0x100 __x64_sys_connect+0x6e/0xc0 do_syscall_64+0x92/0x1c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 refcount_t: addition on 0; use-after-free. WARNING: CPU: 7 PID: 1518 at lib/refcount.c:25 refcount_warn_saturate+0xdd/0x140 RIP: 0010:refcount_warn_saturate+0xdd/0x140 Call Trace: __vsock_bind+0x65e/0x720 vsock_connect+0x727/0xc40 __sys_connect+0xe8/0x100 __x64_sys_connect+0x6e/0xc0 do_syscall_64+0x92/0x1c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 1475 at lib/refcount.c:28 refcount_warn_saturate+0x12b/0x140 RIP: 0010:refcount_warn_saturate+0x12b/0x140 Call Trace: vsock_remove_bound+0x18f/0x280 __vsock_release+0x371/0x480 vsock_release+0x88/0x120 __sock_release+0xaa/0x260 sock_close+0x14/0x20 __fput+0x35a/0xaa0 task_work_run+0xff/0x1c0 do_exit+0x849/0x24c0 make_task_dead+0xf3/0x110 rewind_stack_and_make_dead+0x16/0x20 [1]: https://lore.kernel.org/netdev/CAGxU2F5zhfWymY8u0hrKksW8PumXAYz-9_qRmW==92oax1b...@mail.gmail.com/ Suggested-by: Stefano Garzarella Signed-off-by: Michal Luczaj --- tools/testing/vsock/vsock_test.c | 83 +++- 1 file changed, 64 insertions(+), 19 deletions(-) diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c index f669baaa0dca3bebc678d00eafa80857d1f0fdd6..b58736023981ef7c4812e069ea577fcf2c0fe9fa 100644 --- a/tools/testing/vsock/vsock_test.c +++ b/tools/testing/vsock/vsock_test.c @@ -1718,16 +1718,27 @@ static void test_stream_msgzcopy_leak_zcskb_server(const struct test_opts *opts) #define MAX_PORT_RETRIES24 /* net/vmw_vsock/af_vsock.c */ -/* Test attempts to trigger a transport release for an unbound socket. This can - * lead to a reference count mishandling. - */ -static void test_stream_transport_uaf_client(const struct test_opts *opts) +static bool test_stream_transport_uaf(int cid) { int sockets[MAX_PORT_RETRIES]; struct sockaddr_vm addr; - int fd, i, alen; + socklen_t alen; + int fd, i, c; + bool ret; + + /* Probe for a transport by attempting a local CID bind. Unavailable +* transport (or more specifically: an unsupported transport/CID +* combination) results in EADDRNOTAVAIL, other errnos are fatal. +*/ + fd = vsock_bind_try(cid, VMADDR_PORT_ANY, SOCK_STREAM); + if (fd < 0) { + if (errno != EADDRNOTAVAIL) { + perror("Unexpected bind() errno"); + exit(EXIT_FAILURE); + } - fd = vsock_bind(VMADDR_CID_ANY, VMADDR_PORT_ANY, SOCK_STREAM); + return false; + } alen = sizeof(addr); if (getsockname(fd, (struct sockaddr *)&addr, &alen)) { @@ -1735,38 +1746,73 @@ static void test_stream_transport_uaf_client(const struct test_opts *opts) exit(EXIT_FAILURE); } + /* Drain the autobind pool; see __vsock_bind_connectible(). */ for (i = 0; i < MAX_PORT_RETRIES; ++i) - sockets[i] = vsock_bind(VMADDR_CID_ANY, ++addr.svm_port, - SOCK_STREAM)
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote: > On 03.06.25 21:22, Lorenzo Stoakes wrote: > > The walk_page_range_novma() function is rather confusing - it supports two > > modes, one used often, the other used only for debugging. > > > > The first mode is the common case of traversal of kernel page tables, which > > is what nearly all callers use this for. > > ... and what people should be using it for 🙂 :) Yeah the whole intent of this patch is to detach the 'crazy debug' bit from the 'used by arches all over the place' stuff. Being super clear as to what you're doing matters. > > > > > Secondly it provides an unusual debugging interface that allows for the > > traversal of page tables in a userland range of memory even for that memory > > which is not described by a VMA. > > > > This is highly unusual and it is far from certain that such page tables > > should even exist, but perhaps this is precisely why it is useful as a > > debugging mechanism. > > > > As a result, this is utilised by ptdump only. Historically, things were > > reversed - ptdump was the only user, and other parts of the kernel evolved > > to use the kernel page table walking here. > > > > Since we have some complicated and confusing locking rules for the novma > > case, it makes sense to separate the two usages into their own functions. > > > > Doing this also provide self-documentation as to the intent of the caller - > > are they doing something rather unusual or are they simply doing a standard > > kernel page table walk? > > > > We therefore maintain walk_page_range_novma() for this single usage, and > > document the function as such. > > If we have to keep this dangerous interface, it should probably be > > walk_page_range_debug() or walk_page_range_dump() Ugh it's too early, I thought Mike suggested this :P but he suggested the mm/internal.h bit. But anyway I agree with both, will fix in v2. > > > > > Note that ptdump uses the precise same function for kernel walking as a > > convenience, so we permit this but make it very explicit by having > > walk_page_range_novma() invoke walk_page_range_kernel() in this case. > > > > We introduce walk_page_range_kernel() for the far more common case of > > kernel page table traversal. > > I wonder if we should give it a completely different name scheme to > highlight that this is something completely different. > > walk_kernel_page_table_range() Yeah, I think this might be a good idea actually. This is doing something 'unusual' unlike all the other walk_kernel_xxx() handlers, so this should highlight it even more clearly. Will fixup in v2. > > etc. > > > -- > Cheers, > > David / dhildenb > > > ___ > linux-riscv mailing list > linux-ri...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe
On 04/06/2025 06:19, Smita Koralahalli wrote: > drivers/cxl/acpi.c | 23 +++ > drivers/cxl/core/suspend.c | 21 + > drivers/cxl/cxl.h | 2 ++ > 3 files changed, 46 insertions(+) > > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c > index cb14829bb9be..978f63b32b41 100644 > --- a/drivers/cxl/acpi.c > +++ b/drivers/cxl/acpi.c > @@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void > *data) > return 0; > } > > +static void cxl_softreserv_mem_work_fn(struct work_struct *work) > +{ > + /* Wait for cxl_pci and cxl_mem drivers to load */ > + cxl_wait_for_pci_mem(); > + > + /* > + * Wait for the driver probe routines to complete after cxl_pci > + * and cxl_mem drivers are loaded. > + */ > + wait_for_device_probe(); > +} > +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn); > + > +static void cxl_softreserv_mem_update(void) > +{ > + schedule_work(&cxl_sr_work); > +} > + > static int cxl_acpi_probe(struct platform_device *pdev) > { > int rc; > @@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev) > > /* In case PCI is scanned before ACPI re-trigger memdev attach */ > cxl_bus_rescan(); > + > + /* Update SOFT RESERVE resources that intersect with CXL regions */ > + cxl_softreserv_mem_update(); > + > return 0; > } > > @@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void) > > static void __exit cxl_acpi_exit(void) > { > + cancel_work_sync(&cxl_sr_work); > platform_driver_unregister(&cxl_acpi_driver); > cxl_bus_drain(); > } > diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c > index 72818a2c8ec8..c0d8f70aed56 100644 > --- a/drivers/cxl/core/suspend.c > +++ b/drivers/cxl/core/suspend.c > @@ -2,12 +2,15 @@ > /* Copyright(c) 2022 Intel Corporation. All rights reserved. */ > #include > #include > +#include > #include "cxlmem.h" > #include "cxlpci.h" > > static atomic_t mem_active; > static atomic_t pci_loaded; > > +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue); Given that this file (suspend.c) focuses on power management functions, it might be more appropriate to move the wait queue declaration and its related changes to acpi.c in where the its caller is. Thanks Zhijian > + > bool cxl_mem_active(void) > { > if (IS_ENABLED(CONFIG_CXL_MEM)) > @@ -19,6 +22,7 @@ bool cxl_mem_active(void) > void cxl_mem_active_inc(void) > { > atomic_inc(&mem_active); > + wake_up(&cxl_wait_queue); > } > EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL"); > > @@ -28,8 +32,25 @@ void cxl_mem_active_dec(void) > } > EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL"); > > +static bool cxl_pci_loaded(void) > +{ > + if (IS_ENABLED(CONFIG_CXL_PCI)) > + return atomic_read(&pci_loaded) != 0; > + > + return false; > +} > + > void mark_cxl_pci_loaded(void) > { > atomic_inc(&pci_loaded); > + wake_up(&cxl_wait_queue); > } > EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL"); > + > +void cxl_wait_for_pci_mem(void) > +{ > + if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() && > + cxl_mem_active(), 30 * HZ)) > + pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n"); > +}
Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool benchmark
On 28/05/2025 21.46, Mina Almasry wrote: On Wed, May 28, 2025 at 2:28 AM Toke Høiland-Jørgensen wrote: Mina Almasry writes: On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen wrote: Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns ptr_ring results: no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns slow path results: no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns ``` Cc: Jesper Dangaard Brouer Cc: Ilias Apalodimas Cc: Jakub Kicinski Cc: Toke Høiland-Jørgensen Signed-off-by: Mina Almasry Back when you posted the first RFC, Jesper and I chatted about ways to avoid the ugly "load module and read the output from dmesg" interface to the test. I agree the existing interface is ugly. One idea we came up with was to make the module include only the "inner" functions for the benchmark, and expose those to BPF as kfuncs. Then the test runner can be a BPF program that runs the tests, collects the data and passes it to userspace via maps or a ringbuffer or something. That's a nicer and more customisable interface than the printk output. And if they're small enough, maybe we could even include the functions into the page_pool code itself, instead of in a separate benchmark module? WDYT of that idea? :) ...but this sounds like an enormous amount of effort, for something that is a bit ugly but isn't THAT bad. Especially for me, I'm not that much of an expert that I know how to implement what you're referring to off the top of my head. I normally am open to spending time but this is not that high on my todolist and I have limited bandwidth to resolve this :( I also feel that this is something that could be improved post merge. I think it's very beneficial to have this merged in some form that can be improved later. Byungchul is making a lot of changes to these mm things and it would be nice to have an easy way to run the benchmark in tree and maybe even get automated results from nipa. If we could agree on mvp that is appropriate to merge without too much scope creep that would be ideal from my side at least. Right, fair. I guess we can merge it as-is, and then investigate whether we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :) Thanks for the pliability. Reviewed-bys and comments welcome. Additionally Signed-off-by from Jesper is needed I think. Since most of this code is his, I retained his authorship. Jesper, whenever this looks good to me, a signed-off-by would be good and I would carry it to future versions. Changing authorship to me is also fine by me but I would think you want to retain the credit. Okay, I think Ilias'es comment[1] and ACK convinced me, let us merge this as-is. We have been asking people to run it over several years before accepting patches. We shouldn't be pointing people to use out-of-tree tests for accepting patches. It is not perfect, but it have served us well for benchmarking in the last approx 10 years (5 years for page_pool test). It is isolated as a selftest under (tools/testing/selftests/net/bench/page_pool/). Realistically we are all too busy inventing a new "perfect" benchmark for page_pool. That said, I do encourage others with free cycles to integrated a better benchmark test into `perf bench`. Then we can just remove this module again. Signed-off-by: Jesper Dangaard Brouer [1] https://lore.kernel.org/all/cac_iwjlmo4xz_+pbacnxpvctmgknbslgyeeks2ptrrepn1u...@mail.gmail.com/ Thanks Mina for pushing this forward, --Jesper
Re: [PATCH v3 3/6] modpost: Make mod_device_table aliases more unique
On Tue, Jun 03, 2025 at 01:18:25AM +0900, Masahiro Yamada wrote: > > > > Before these patches this was not a problem as non-unique characters are > > > > in separate object files when the module is compiled separately. > > > > > > > > But when the modules are compiled into the kernel, there is a symbol > > > > conflict when linking vmlinuz. We have modules that export multiple > > > > device > > > > tables from different object files. > > > > > > This is because the __mod_device_table__* symbols are global, but > > > I suspect they do not need to be. > > > > > > Let's test this > > > https://lore.kernel.org/lkml/20250602105539.392362-1-masahi...@kernel.org/T/#u > > > > I tested this patch with the config: > > > > make allmodconfig > > make mod2yesconfig > > > > and it works. > > Good. > Then, __COUNTER__ is unnecessary. I didn't immediately notice. The patch you suggested works, but these symbols remain in System.map and it seems in vmlinuz. -- Rgrds, legion
Re: [PATCH 3/3] remoteproc: imx_rproc: Add support for i.MX95
On Wed, Jun 4, 2025 at 5:37 AM Peng Fan (OSS) wrote: > > From: Peng Fan > > Add imx_rproc_cfg_imx95_m7 and address(TCM and DDR) mapping > Add i.MX95 of_device_id entry > > Signed-off-by: Peng Fan Reviewed-by: Daniel Baluta
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote: > Nice idea to move to mm/internal.h also :) I like this... > > Will fixup on respin Dumb question but IIUC, walk_page_range_novma() will only be used by ptdump from now on, so why not stick it into mm/ptdump.c, which is where the only user of it lives? -- Oscar Salvador SUSE Labs
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 02:26:57PM +0200, Oscar Salvador wrote: > On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote: > > Nice idea to move to mm/internal.h also :) I like this... > > > > Will fixup on respin > > Dumb question but IIUC, walk_page_range_novma() will only be used by > ptdump from now on, so why not stick it into mm/ptdump.c, which is where > the only user of it lives? There's no such thing as a dumb question :) Even though I have maybe tested that concept in the past personally ;) I think mm/internal.h is fine, would be weird to have the declaration in mm/ptdump.c but implemented elsewhere and obviously we need to keep the page walking stuff together. So I think it is best to keep it in internal.h > > -- > Oscar Salvador > SUSE Labs
Re: [PATCH 2/3] remoteproc: imx_rproc: Add support for System Manager API
Hi Peng, Thanks a lot for the patches. Comments inline: On Wed, Jun 4, 2025 at 5:36 AM Peng Fan (OSS) wrote: > > From: Peng Fan > > i.MX95 features a Cortex-M33 core, six Cortex-A55 cores, and > one Cortex-M7 core. The System Control Management Interface(SCMI) > firmware runs on the M33 core. The i.MX95 SCMI firmware named System > Manager(SM) includes vendor extension protocols, Logical Machine > Management(LMM) protocol and CPU protocol and etc. > > There are three cases for M7: > (1) M7 in a separate Logical Machine(LM) that Linux couldn't control it. couldn't -> can't > (2) M7 in a separate Logical Machine that Linux could control it using > LMM protocol could -> can > (3) M7 runs in same Logical Machine as A55, so Linux could control it > using CPU protocol could -> can > > So extend the driver to using LMM and CPU protocol to manage the M7 core. > - Add IMX_RPROC_SM to indicate the remotecores runs on a SoC that >has System Manager. remotecores -> remote core > - Compare linux LM ID(got using scmi_imx_lmm_info) and M7 LM ID(got >from DTB), if same, use CPU protocol to start/stop. Otherwise, use >LMM protocol to start/stop. >Whether using CPU or LMM protocol to start/stop, the M7 status >detection could use CPU protocol to detect started or not. So >in imx_rproc_detect_mode, use scmi_imx_cpu_started to check the >status of M7. > - For above case 1 and 2, Use SCMI_IMX_LMM_POWER_ON to detect whether >the M7 LM is under control of A55 LM. > > Current setup relies on pre-Linux software(U-Boot) to do > M7 TCM ECC initialization. In future, we could add the support in Linux > to decouple U-Boot and Linux. > > Signed-off-by: Peng Fan > --- > drivers/remoteproc/imx_rproc.c | 139 > - > drivers/remoteproc/imx_rproc.h | 2 + > 2 files changed, 139 insertions(+), 2 deletions(-) > > diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c > index > 74299af1d7f10a0db794de494c52304b2323b89f..0649faa98725db99366946c65edf5b7daff78316 > 100644 > --- a/drivers/remoteproc/imx_rproc.c > +++ b/drivers/remoteproc/imx_rproc.c > @@ -8,6 +8,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -21,6 +22,7 @@ > #include > #include > #include > +#include > #include > > #include "imx_rproc.h" > @@ -91,6 +93,11 @@ struct imx_rproc_mem { > #define ATT_CORE_MASK 0x > #define ATT_CORE(I) BIT((I)) > > +/* Logical Machine Operation */ > +#define IMX_RPROC_FLAGS_SM_LMM_OP BIT(0) > +/* Linux has permission to handle the Logical Machine of remote cores */ > +#define IMX_RPROC_FLAGS_SM_LMM_AVAIL BIT(1) > + > static int imx_rproc_xtr_mbox_init(struct rproc *rproc, bool tx_block); > static void imx_rproc_free_mbox(struct rproc *rproc); > > @@ -115,6 +122,22 @@ struct imx_rproc { > u32 entry; /* cpu start address > */ > u32 core_index; > struct dev_pm_domain_list *pd_list; > + /* For i.MX System Manager based systems */ > + u32 cpuid; > + u32 lmid; > + u32 flags; > +}; > + > +static const struct imx_rproc_att imx_rproc_att_imx95_m7[] = { > + /* dev addr , sys addr , size , flags */ > + /* TCM CODE NON-SECURE */ > + { 0x, 0x203C, 0x0004, ATT_OWN | ATT_IOMEM }, > + > + /* TCM SYS NON-SECURE*/ > + { 0x2000, 0x2040, 0x0004, ATT_OWN | ATT_IOMEM }, > + > + /* DDR */ > + { 0x8000, 0x8000, 0x5000, 0 }, > }; ^ this belongs to patch 3/3 Otherwise, patch looks good to me.
RE: [PATCH 2/3] remoteproc: imx_rproc: Add support for System Manager API
> Subject: Re: [PATCH 2/3] remoteproc: imx_rproc: Add support for > System Manager API > > Hi Peng, > > Thanks a lot for the patches. Comments inline: > > On Wed, Jun 4, 2025 at 5:36 AM Peng Fan (OSS) > wrote: > > > > From: Peng Fan > > > > i.MX95 features a Cortex-M33 core, six Cortex-A55 cores, and one > > Cortex-M7 core. The System Control Management Interface(SCMI) > firmware > > runs on the M33 core. The i.MX95 SCMI firmware named System > > Manager(SM) includes vendor extension protocols, Logical Machine > > Management(LMM) protocol and CPU protocol and etc. > > > > There are three cases for M7: > > (1) M7 in a separate Logical Machine(LM) that Linux couldn't control > it. > couldn't -> can't > > > (2) M7 in a separate Logical Machine that Linux could control it using > > LMM protocol > could -> can > > > (3) M7 runs in same Logical Machine as A55, so Linux could control it > > using CPU protocol > could -> can > > > > > So extend the driver to using LMM and CPU protocol to manage the > M7 core. > > - Add IMX_RPROC_SM to indicate the remotecores runs on a SoC > that > >has System Manager. > > remotecores -> remote core > > > - Compare linux LM ID(got using scmi_imx_lmm_info) and M7 LM > ID(got > >from DTB), if same, use CPU protocol to start/stop. Otherwise, use > >LMM protocol to start/stop. > >Whether using CPU or LMM protocol to start/stop, the M7 status > >detection could use CPU protocol to detect started or not. So > >in imx_rproc_detect_mode, use scmi_imx_cpu_started to check the > >status of M7. > > - For above case 1 and 2, Use SCMI_IMX_LMM_POWER_ON to > detect whether > >the M7 LM is under control of A55 LM. > > > > Current setup relies on pre-Linux software(U-Boot) to do > > M7 TCM ECC initialization. In future, we could add the support in > > Linux to decouple U-Boot and Linux. > > > > Signed-off-by: Peng Fan > > --- > > drivers/remoteproc/imx_rproc.c | 139 > - > > drivers/remoteproc/imx_rproc.h | 2 + > > 2 files changed, 139 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/remoteproc/imx_rproc.c > > b/drivers/remoteproc/imx_rproc.c index > > > 74299af1d7f10a0db794de494c52304b2323b89f..0649faa98725db99 > 366946c65edf > > 5b7daff78316 100644 > > --- a/drivers/remoteproc/imx_rproc.c > > +++ b/drivers/remoteproc/imx_rproc.c > > @@ -8,6 +8,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -21,6 +22,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > #include "imx_rproc.h" > > @@ -91,6 +93,11 @@ struct imx_rproc_mem { > > #define ATT_CORE_MASK 0x > > #define ATT_CORE(I) BIT((I)) > > > > +/* Logical Machine Operation */ > > +#define IMX_RPROC_FLAGS_SM_LMM_OP BIT(0) > > +/* Linux has permission to handle the Logical Machine of remote > cores */ > > +#define IMX_RPROC_FLAGS_SM_LMM_AVAIL BIT(1) > > + > > static int imx_rproc_xtr_mbox_init(struct rproc *rproc, bool > > tx_block); static void imx_rproc_free_mbox(struct rproc *rproc); > > > > @@ -115,6 +122,22 @@ struct imx_rproc { > > u32 entry; /* cpu start > > address */ > > u32 core_index; > > struct dev_pm_domain_list *pd_list; > > + /* For i.MX System Manager based systems */ > > + u32 cpuid; > > + u32 lmid; > > + u32 flags; > > +}; > > + > > +static const struct imx_rproc_att imx_rproc_att_imx95_m7[] = { > > + /* dev addr , sys addr , size , flags */ > > + /* TCM CODE NON-SECURE */ > > + { 0x, 0x203C, 0x0004, ATT_OWN | > ATT_IOMEM }, > > + > > + /* TCM SYS NON-SECURE*/ > > + { 0x2000, 0x2040, 0x0004, ATT_OWN | > ATT_IOMEM }, > > + > > + /* DDR */ > > + { 0x8000, 0x8000, 0x5000, 0 }, > > }; > > ^ this belongs to patch 3/3 > > Otherwise, patch looks good to me. oops. I missed to move this to patch 3 when addressing your internal reviewing comments. Fix in V2 together with the typo that you mention above. Thanks, Peng.
Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy
On 6/4/25 07:37, Jason Wang wrote: On Tue, Jun 3, 2025 at 11:07 PM Bui Quang Minh wrote: In virtio-net, we have not yet supported multi-buffer XDP packet in zerocopy mode when there is a binding XDP program. However, in that case, when receiving multi-buffer XDP packet, we skip the XDP program and return XDP_PASS. As a result, the packet is passed to normal network stack which is an incorrect behavior. This commit instead returns XDP_DROP in that case. Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode") Cc: sta...@vger.kernel.org Signed-off-by: Bui Quang Minh --- drivers/net/virtio_net.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index e53ba600605a..4c35324d6e5b 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1309,9 +1309,14 @@ static struct sk_buff *virtnet_receive_xsk_merge(struct net_device *dev, struct ret = XDP_PASS; It would be simpler to just assign XDP_DROP here? Or if you wish to stick to the way, we can simply remove this assignment. This XDP_PASS is returned for the case when there is no XDP program binding (!prog). rcu_read_lock(); prog = rcu_dereference(rq->xdp_prog); - /* TODO: support multi buffer. */ - if (prog && num_buf == 1) - ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats); + if (prog) { + /* TODO: support multi buffer. */ + if (num_buf == 1) + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, + stats); + else + ret = XDP_DROP; + } rcu_read_unlock(); switch (ret) { -- 2.43.0 Thanks Thanks, Quang Minh.
[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
The walk_page_range_novma() function is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. We additionally make walk_page_range_debug() internal to mm. Note that ptdump uses the precise same function for kernel walking as a convenience, so we permit this but make it very explicit by having walk_page_range_novma() invoke walk_kernel_page_table_range() in this case. Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) --- v2: * Renamed walk_page_range_novma() to walk_page_range_debug() as per David. * Moved walk_page_range_debug() definition to mm/internal.h as per Mike. * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as per David. v1 resend: * Actually cc'd lists... * Fixed mistake in walk_page_range_novma() not handling kernel mappings and update commit message to referene. * Added Mike's off-list Acked-by. * Fixed up comments as per Mike. * Add some historic flavour to the commit message as per Mike. https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/ v1: (accidentally sent off-list due to error in scripting) arch/loongarch/mm/pageattr.c | 2 +- arch/openrisc/kernel/dma.c | 4 +- arch/riscv/mm/pageattr.c | 8 +-- include/linux/pagewalk.h | 7 ++- mm/hugetlb_vmemmap.c | 2 +- mm/internal.h| 4 ++ mm/pagewalk.c| 98 mm/ptdump.c | 3 +- 8 files changed, 82 insertions(+), 46 deletions(-) diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c index 99165903908a..f5e910b68229 100644 --- a/arch/loongarch/mm/pageattr.c +++ b/arch/loongarch/mm/pageattr.c @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp return 0; mmap_write_lock(&init_mm); - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks); + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks); mmap_write_unlock(&init_mm); flush_tlb_kernel_range(start, end); diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c index 3a7b5baaa450..af932a4ad306 100644 --- a/arch/openrisc/kernel/dma.c +++ b/arch/openrisc/kernel/dma.c @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size) * them and setting the cache-inhibit bit. */ mmap_write_lock(&init_mm); - error = walk_page_range_novma(&init_mm, va, va + size, + error = walk_kernel_page_table_range(va, va + size, &set_nocache_walk_ops, NULL, NULL); mmap_write_unlock(&init_mm); @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size) mmap_write_lock(&init_mm); /* walk_page_range shouldn't be able to fail here */ - WARN_ON(walk_page_range_novma(&init_mm, va, va + size, + WARN_ON(walk_kernel_page_table_range(va, va + size, &clear_nocache_walk_ops, NULL, NULL)); mmap_write_unlock(&init_mm); } diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index d815448758a1..3f76db3d2769 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, if (ret) goto unlock; - ret = walk_page_range_novma(&init_mm, lm_start, lm_end, + ret = walk_kernel_page_table_range(lm_start, lm_end, &pageattr_ops, NULL, &masks); if (ret) goto unlock; @@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 4, 2025 at 7:21 AM Lorenzo Stoakes wrote: > > The walk_page_range_novma() function is rather confusing - it supports two > modes, one used often, the other used only for debugging. > > The first mode is the common case of traversal of kernel page tables, which > is what nearly all callers use this for. > > Secondly it provides an unusual debugging interface that allows for the > traversal of page tables in a userland range of memory even for that memory > which is not described by a VMA. > > It is far from certain that such page tables should even exist, but perhaps > this is precisely why it is useful as a debugging mechanism. > > As a result, this is utilised by ptdump only. Historically, things were > reversed - ptdump was the only user, and other parts of the kernel evolved > to use the kernel page table walking here. > > Since we have some complicated and confusing locking rules for the novma > case, it makes sense to separate the two usages into their own functions. > > Doing this also provide self-documentation as to the intent of the caller - > are they doing something rather unusual or are they simply doing a standard > kernel page table walk? > > We therefore establish two separate functions - walk_page_range_debug() for > this single usage, and walk_kernel_page_table_range() for general kernel > page table walking. > > We additionally make walk_page_range_debug() internal to mm. > > Note that ptdump uses the precise same function for kernel walking as a > convenience, so we permit this but make it very explicit by having > walk_page_range_novma() invoke walk_kernel_page_table_range() in this case. > > Signed-off-by: Lorenzo Stoakes > Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Suren Baghdasaryan > --- > v2: > * Renamed walk_page_range_novma() to walk_page_range_debug() as per David. > * Moved walk_page_range_debug() definition to mm/internal.h as per Mike. > * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as > per David. > > v1 resend: > * Actually cc'd lists... > * Fixed mistake in walk_page_range_novma() not handling kernel mappings and > update commit message to referene. > * Added Mike's off-list Acked-by. > * Fixed up comments as per Mike. > * Add some historic flavour to the commit message as per Mike. > https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/ > > v1: > (accidentally sent off-list due to error in scripting) > > arch/loongarch/mm/pageattr.c | 2 +- > arch/openrisc/kernel/dma.c | 4 +- > arch/riscv/mm/pageattr.c | 8 +-- > include/linux/pagewalk.h | 7 ++- > mm/hugetlb_vmemmap.c | 2 +- > mm/internal.h| 4 ++ > mm/pagewalk.c| 98 > mm/ptdump.c | 3 +- > 8 files changed, 82 insertions(+), 46 deletions(-) > > diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c > index 99165903908a..f5e910b68229 100644 > --- a/arch/loongarch/mm/pageattr.c > +++ b/arch/loongarch/mm/pageattr.c > @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, > pgprot_t set_mask, pgp > return 0; > > mmap_write_lock(&init_mm); > - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, > NULL, &masks); > + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, > &masks); > mmap_write_unlock(&init_mm); > > flush_tlb_kernel_range(start, end); > diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c > index 3a7b5baaa450..af932a4ad306 100644 > --- a/arch/openrisc/kernel/dma.c > +++ b/arch/openrisc/kernel/dma.c > @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size) > * them and setting the cache-inhibit bit. > */ > mmap_write_lock(&init_mm); > - error = walk_page_range_novma(&init_mm, va, va + size, > + error = walk_kernel_page_table_range(va, va + size, > &set_nocache_walk_ops, NULL, NULL); > mmap_write_unlock(&init_mm); > > @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size) > > mmap_write_lock(&init_mm); > /* walk_page_range shouldn't be able to fail here */ > - WARN_ON(walk_page_range_novma(&init_mm, va, va + size, > + WARN_ON(walk_kernel_page_table_range(va, va + size, > &clear_nocache_walk_ops, NULL, NULL)); > mmap_write_unlock(&init_mm); > } > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > index d815448758a1..3f76db3d2769 100644 > --- a/arch/riscv/mm/pageattr.c > +++ b/arch/riscv/mm/pageattr.c > @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, > pgprot_t set_mask, > if (ret) > goto unlock; > > - ret = walk_page_range_novma(&init_mm, lm_start, > lm_end, > + ret = walk_kern
Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe
On 6/4/25 2:40 AM, Zhijian Li (Fujitsu) wrote: > > > On 04/06/2025 06:19, Smita Koralahalli wrote: >> drivers/cxl/acpi.c | 23 +++ >> drivers/cxl/core/suspend.c | 21 + >> drivers/cxl/cxl.h | 2 ++ >> 3 files changed, 46 insertions(+) >> >> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c >> index cb14829bb9be..978f63b32b41 100644 >> --- a/drivers/cxl/acpi.c >> +++ b/drivers/cxl/acpi.c >> @@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void >> *data) >> return 0; >> } >> >> +static void cxl_softreserv_mem_work_fn(struct work_struct *work) >> +{ >> +/* Wait for cxl_pci and cxl_mem drivers to load */ >> +cxl_wait_for_pci_mem(); >> + >> +/* >> + * Wait for the driver probe routines to complete after cxl_pci >> + * and cxl_mem drivers are loaded. >> + */ >> +wait_for_device_probe(); >> +} >> +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn); >> + >> +static void cxl_softreserv_mem_update(void) >> +{ >> +schedule_work(&cxl_sr_work); >> +} >> + >> static int cxl_acpi_probe(struct platform_device *pdev) >> { >> int rc; >> @@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev) >> >> /* In case PCI is scanned before ACPI re-trigger memdev attach */ >> cxl_bus_rescan(); >> + >> +/* Update SOFT RESERVE resources that intersect with CXL regions */ >> +cxl_softreserv_mem_update(); >> + >> return 0; >> } >> >> @@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void) >> >> static void __exit cxl_acpi_exit(void) >> { >> +cancel_work_sync(&cxl_sr_work); >> platform_driver_unregister(&cxl_acpi_driver); >> cxl_bus_drain(); >> } >> diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c >> index 72818a2c8ec8..c0d8f70aed56 100644 >> --- a/drivers/cxl/core/suspend.c >> +++ b/drivers/cxl/core/suspend.c >> @@ -2,12 +2,15 @@ >> /* Copyright(c) 2022 Intel Corporation. All rights reserved. */ >> #include >> #include >> +#include >> #include "cxlmem.h" >> #include "cxlpci.h" >> >> static atomic_t mem_active; >> static atomic_t pci_loaded; >> >> +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue); > > Given that this file (suspend.c) focuses on power management functions, > it might be more appropriate to move the wait queue declaration and its > related changes to acpi.c in where the its caller is. You mean drivers/cxl/acpi.c and not drivers/cxl/core/acpi.c right? The core one is my mistake and I'm going to create a patch to remove that. DJ > > > Thanks > Zhijian > >> + >> bool cxl_mem_active(void) >> { >> if (IS_ENABLED(CONFIG_CXL_MEM)) >> @@ -19,6 +22,7 @@ bool cxl_mem_active(void) >> void cxl_mem_active_inc(void) >> { >> atomic_inc(&mem_active); >> +wake_up(&cxl_wait_queue); >> } >> EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL"); >> >> @@ -28,8 +32,25 @@ void cxl_mem_active_dec(void) >> } >> EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL"); >> >> +static bool cxl_pci_loaded(void) >> +{ >> +if (IS_ENABLED(CONFIG_CXL_PCI)) >> +return atomic_read(&pci_loaded) != 0; >> + >> +return false; >> +} >> + >> void mark_cxl_pci_loaded(void) >> { >> atomic_inc(&pci_loaded); >> +wake_up(&cxl_wait_queue); >> } >> EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL"); >> + >> +void cxl_wait_for_pci_mem(void) >> +{ >> +if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() && >> +cxl_mem_active(), 30 * HZ)) >> +pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n"); >> +}
[PATCH] module: Make sure relocations are applied to the per-CPU section
The per-CPU data section is handled differently than the other sections. The memory allocations requires a special __percpu pointer and then the section is copied into the view of each CPU. Therefore the SHF_ALLOC flag is removed to ensure move_module() skips it. Later, relocations are applied and apply_relocations() skips sections without SHF_ALLOC because they have not been copied. This also skips the per-CPU data section. The missing relocations result in a NULL pointer on x86-64 and very small values on x86-32. This results in a crash because it is not skipped like NULL pointer would and it can't be dereferenced. Such an assignment happens during compile time per-CPU lock initialisation with lockdep enabled. Add the SHF_ALLOC flag back for the per-CPU section after move_module(). Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-...@intel.com Fixes: 8d8022e8aba85 ("module: do percpu allocation after uniqueness check. No, really!") Signed-off-by: Sebastian Andrzej Siewior --- kernel/module/main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/module/main.c b/kernel/module/main.c index 5c6ab20240a6d..35abb5f13d7dc 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2816,6 +2816,9 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) if (err) return ERR_PTR(err); + /* Add SHF_ALLOC back so that relocations are applied. */ + info->sechdrs[info->index.pcpu].sh_flags |= SHF_ALLOC; + /* Module has been copied to its final place now: return it. */ mod = (void *)info->sechdrs[info->index.mod].sh_addr; kmemleak_load_module(mod, info); -- 2.49.0
Re: [GIT PULL] SPDX/LICENSES update for 6.16-rc1
The pull request you sent on Wed, 4 Jun 2025 10:36:44 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx.git > tags/spdx-6.16-rc1 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/f5ebe7bb87e04537b37573f0a2516baa90ee23c0 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On 03.06.25 21:22, Lorenzo Stoakes wrote: The walk_page_range_novma() function is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. ... and what people should be using it for 🙂 Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. This is highly unusual and it is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore maintain walk_page_range_novma() for this single usage, and document the function as such. If we have to keep this dangerous interface, it should probably be walk_page_range_debug() or walk_page_range_dump() Note that ptdump uses the precise same function for kernel walking as a convenience, so we permit this but make it very explicit by having walk_page_range_novma() invoke walk_page_range_kernel() in this case. We introduce walk_page_range_kernel() for the far more common case of kernel page table traversal. I wonder if we should give it a completely different name scheme to highlight that this is something completely different. walk_kernel_page_table_range() etc. -- Cheers, David / dhildenb
Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy
On Tue, Jun 3, 2025 at 8:09 AM Bui Quang Minh wrote: > > In virtio-net, we have not yet supported multi-buffer XDP packet in > zerocopy mode when there is a binding XDP program. However, in that > case, when receiving multi-buffer XDP packet, we skip the XDP program > and return XDP_PASS. As a result, the packet is passed to normal network > stack which is an incorrect behavior. This commit instead returns > XDP_DROP in that case. Does it make more sense to return XDP_ABORTED? This seems like an unexpected exception case to me, but I'm not familiar enough with virtio-net's multibuffer support. > > Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode") > Cc: sta...@vger.kernel.org > Signed-off-by: Bui Quang Minh > --- > drivers/net/virtio_net.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index e53ba600605a..4c35324d6e5b 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -1309,9 +1309,14 @@ static struct sk_buff > *virtnet_receive_xsk_merge(struct net_device *dev, struct > ret = XDP_PASS; > rcu_read_lock(); > prog = rcu_dereference(rq->xdp_prog); > - /* TODO: support multi buffer. */ > - if (prog && num_buf == 1) > - ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats); > + if (prog) { > + /* TODO: support multi buffer. */ > + if (num_buf == 1) > + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, > + stats); > + else > + ret = XDP_DROP; > + } > rcu_read_unlock(); > > switch (ret) { > -- > 2.43.0 > >
Re: [Question] attributes encoding in BTF
On 6/4/25 2:02 AM, Alexis Lothoré wrote: Hi all, a simpler version of this series has been merged, and so I am now taking a look at the issue I have put aside in the merged version: dealing with more specific data layout for arguments passed on stack. For example, a function can pass small structs on stack, but this need special care when generating the corresponding bpf trampolines. Those structs have specific alignment specified by the target ABI, but it can also be altered with attributes packing the structure or modifying the alignment. Some platforms already support structs on stack (see tracing_struct_many_args test), but as discussed earlier, those may suffer from the same kind of issue mentioned above. On Wed Apr 23, 2025 at 9:24 PM CEST, Alexis Lothoré wrote: Hi Andrii, On Wed Apr 23, 2025 at 7:15 PM CEST, Andrii Nakryiko wrote: On Thu, Apr 17, 2025 at 12:14 AM Alexis Lothoré wrote: Hi Andrii, On Wed Apr 16, 2025 at 11:24 PM CEST, Andrii Nakryiko wrote: [...] I'd suggest looking at btf__align_of() in libbpf (tools/lib/bpf/btf.c) to see how we calculate alignment there. It seems to work decently enough. It won't cover any arch-specific extra rules like double needing 16-byte alignment (I vaguely remember something like that for some architectures, but I might be misremembering), or anything similar. It also won't detect (I don't think it's possible without DWARF) artificially increased alignment with attribute((aligned(N))). Thanks for the pointer, I'll take a look at it. The more we discuss this series, the less member size sounds relevant for what I'm trying to achieve here. Following Xu's comments, I have been thinking about how I could detect the custom alignments and packing on structures, and I was wondering if I could somehow benefit from __attribute__ encoding in BTF info ([1]). But following your hint, I also see some btf_is_struct_packed() in tools/lib/bpf/btf_dump.c that could help. I'll dig this further and see if I can manage to make something work with all of this. Andrii's comment above illustrates well my current issue: when functions pass arguments on stack, we are missing info for some of them to correctly build trampolines, especially for struct, which can have attributes like __attribute__((packed)) or __attribute__((align(x))). [1] seems to be a recent solution implemented for BTF to cover this need. IIUC it encodes any arbitratry attribute affecting a data type or function, so if I have some struct like this one in my kernel or a module: struct foo { short b int a; } __packed; I would expect the corresponding BTF data to have some BTF_KIND_DECL_TAG describing the "packed" attribute for the corresponding structure, but I fail to find any of those when running: $ bpftool btf dump file vmlinux format raw In there I see some DECL_TAG but those are mostly 'bpf_kfunc', I see not arbitrary attribute like 'packed' or 'aligned(x)'. What I really need to do in the end is to be able to parse those alignments attributes info in the kernel at runtime when generating trampolines, but I hoped to be able to see them with bpftool first to validate the concept. I started taking a look further at this and stumbled upon [2] in which Alan gives many more details about the feature, so I did the following checks: - kernel version 6.15.0-rc4 from bpf-next_base: it contains the updated Makefile.btf calling pahole with `--btf_features=attributes` - pahole v1.30 $ pahole --supported_btf_features encode_force,var,float,decl_tag,type_tag,enum64,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build,distilled_base,global_var,attributes This pahole comes from my distro pkg manager, but I have also done the same test with a freshly built pahole, while taking care of pulling the libbpf submodule. - bpftool v7.6.0 bpftool v7.6.0 using libbpf v1.6 features: llvm, skeletons Could I be missing something obvious ? Or did I misunderstand the actual attribute encoding feature ? Hi Alexis. The changes recently landed in pahole and libbpf re attributes had a very narrow goal: passing through particular attributes for some BPF kfuncs from the kernel source to vmlinux.h BTF now has a way of encoding any attribute (as opposed to only bpf type/decl tags) by setting type/decl tag kind flag [1]. So it is possible to represent attributes like packed and aligned in BTF. However, the BTF tags need to be generated by something, in case of vmlinux by pahole. Pahole generates BTF by parsing DWARF. And, as far as I understand, attributes are not (can not be?) represented in DWARF in a generic way, it really depends on specifics of the attribute. In order to support packed/aligned, pahole needs to know how to figure them out from DWARF input and add the tags to BTF. And this does not happen right now, which is why you don't see anything in bpftool output. [1] https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solod...@linu
Re: [PATCH] selftests: cachestat: add tests for mmap and /proc/cpuinfo
On Sat, May 24, 2025 at 1:36 AM Suresh K C wrote: > > From: Suresh K C > > Add a test case to verify cachestat behavior with memory-mapped files > using mmap(). This ensures that pages accessed via mmap are correctly > accounted for in the page cache. > > Also add a test for /proc/cpuinfo to validate cachestat's handling of > virtual files in pseudo-filesystems. This improves test coverage for Hmm, it's been awhile since I wrote these tests, but isn't there already a test for /proc/* files? > edge cases involving non-regular files. > > Tested on x86_64 with default kernel config. > > Signed-off-by: Suresh K C > --- > .../selftests/cachestat/test_cachestat.c | 69 ++- > 1 file changed, 67 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/cachestat/test_cachestat.c > b/tools/testing/selftests/cachestat/test_cachestat.c > index 632ab44737ec..81e7f6dd2279 100644 > --- a/tools/testing/selftests/cachestat/test_cachestat.c > +++ b/tools/testing/selftests/cachestat/test_cachestat.c > @@ -22,7 +22,7 @@ > > static const char * const dev_files[] = { > "/dev/zero", "/dev/null", "/dev/urandom", > - "/proc/version", "/proc" > + "/proc/version","/proc/cpuinfo","/proc" > }; > > void print_cachestat(struct cachestat *cs) > @@ -202,6 +202,65 @@ static int test_cachestat(const char *filename, bool > write_random, bool create, > return ret; > } > > +bool test_cachestat_mmap(void){ > + > + size_t PS = sysconf(_SC_PAGESIZE); > + size_t filesize = PS * 512 * 2;; > + int syscall_ret; > + size_t compute_len = PS * 512; > + struct cachestat_range cs_range = { PS, compute_len }; > + char *filename = "tmpshmcstat"; > + unsigned long num_pages = compute_len / PS; > + struct cachestat cs; > + bool ret = true; > + int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); > + if (fd < 0) { > + ksft_print_msg("Unable to create mmap file.\n"); > + ret = false; > + goto out; > + } > + if (ftruncate(fd, filesize)) { > + ksft_print_msg("Unable to truncate mmap file.\n"); > + ret = false; > + goto close_fd; > + } > + if (!write_exactly(fd, filesize)) { > + ksft_print_msg("Unable to write to mmap file.\n"); > + ret = false; > + goto close_fd; > + } > + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, > fd, 0); > + if (map == MAP_FAILED) { > + ksft_print_msg("mmap failed.\n"); > + ret = false; > + goto close_fd; > + } > + > + for (int i = 0; i < filesize; i++) { > + map[i] = 'A'; > + } > + map[filesize - 1] = 'X'; > + > + syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); > + > + if (syscall_ret) { > + ksft_print_msg("Cachestat returned non-zero.\n"); > + ret = false; > + } else { > + print_cachestat(&cs); > + if (cs.nr_cache + cs.nr_evicted != num_pages) { > + ksft_print_msg("Total number of cached and evicted > pages is off.\n"); > + ret = false; > + } > + } > + > +close_fd: > + close(fd); > + unlink(filename); > +out: > + return ret; > +} > + This looks 90% the same as another test. Are we literally just adding the mmap step to test_cachestat and call it a new test? Can we at least refactor things?
Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe
On 6/3/2025 4:45 PM, Dave Jiang wrote: On 6/3/25 3:19 PM, Smita Koralahalli wrote: Introduce a waitqueue mechanism to coordinate initialization between the cxl_pci and cxl_mem drivers. Launch a background worker from cxl_acpi_probe() that waits for both drivers to complete initialization before invoking wait_for_device_probe(). Without this, the probe completion wait could begin prematurely, before the drivers are present, leading to missed updates. Co-developed-by: Nathan Fontenot Signed-off-by: Nathan Fontenot Co-developed-by: Terry Bowman Signed-off-by: Terry Bowman Signed-off-by: Smita Koralahalli --- drivers/cxl/acpi.c | 23 +++ drivers/cxl/core/suspend.c | 21 + drivers/cxl/cxl.h | 2 ++ 3 files changed, 46 insertions(+) diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c index cb14829bb9be..978f63b32b41 100644 --- a/drivers/cxl/acpi.c +++ b/drivers/cxl/acpi.c @@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void *data) return 0; } +static void cxl_softreserv_mem_work_fn(struct work_struct *work) +{ + /* Wait for cxl_pci and cxl_mem drivers to load */ + cxl_wait_for_pci_mem(); + + /* +* Wait for the driver probe routines to complete after cxl_pci +* and cxl_mem drivers are loaded. +*/ + wait_for_device_probe(); +} +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn); + +static void cxl_softreserv_mem_update(void) +{ + schedule_work(&cxl_sr_work); +} + static int cxl_acpi_probe(struct platform_device *pdev) { int rc; @@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev) /* In case PCI is scanned before ACPI re-trigger memdev attach */ cxl_bus_rescan(); + + /* Update SOFT RESERVE resources that intersect with CXL regions */ + cxl_softreserv_mem_update(); + return 0; } @@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void) static void __exit cxl_acpi_exit(void) { + cancel_work_sync(&cxl_sr_work); platform_driver_unregister(&cxl_acpi_driver); cxl_bus_drain(); } diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c index 72818a2c8ec8..c0d8f70aed56 100644 --- a/drivers/cxl/core/suspend.c +++ b/drivers/cxl/core/suspend.c @@ -2,12 +2,15 @@ /* Copyright(c) 2022 Intel Corporation. All rights reserved. */ #include #include +#include #include "cxlmem.h" #include "cxlpci.h" static atomic_t mem_active; static atomic_t pci_loaded; +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue); + bool cxl_mem_active(void) { if (IS_ENABLED(CONFIG_CXL_MEM)) @@ -19,6 +22,7 @@ bool cxl_mem_active(void) void cxl_mem_active_inc(void) { atomic_inc(&mem_active); + wake_up(&cxl_wait_queue); } EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL"); @@ -28,8 +32,25 @@ void cxl_mem_active_dec(void) } EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL"); +static bool cxl_pci_loaded(void) +{ + if (IS_ENABLED(CONFIG_CXL_PCI)) + return atomic_read(&pci_loaded) != 0; + + return false; +} + void mark_cxl_pci_loaded(void) { atomic_inc(&pci_loaded); + wake_up(&cxl_wait_queue); } EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL"); + +void cxl_wait_for_pci_mem(void) +{ + if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() && + cxl_mem_active(), 30 * HZ)) I'm trying to understand why cxl_pci_loaded() is needed. cxl_mem_active() goes above 0 when a cxl_mem_probe() instance succeeds. cxl_mem_probe() being triggered implies that an instance of cxl_pci_probe() has been called since cxl_mem_probe() is triggered from devm_cxl_add_memdev() with memdev being added and cxl_mem driver also have been loaded. So does cxl_mem_active() not imply cxl_pci_loaded() and makes it unnecessary? Yeah you are right. I will remove this check. Thanks Smita DJ + pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n"); +} +EXPORT_SYMBOL_NS_GPL(cxl_wait_for_pci_mem, "CXL"); diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index a9ab46eb0610..1ba7d39c2991 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -902,6 +902,8 @@ void cxl_coordinates_combine(struct access_coordinate *out, bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port); +void cxl_wait_for_pci_mem(void); + /* * Unit test builds overrides this to __weak, find the 'strong' version * of these symbols in tools/testing/cxl/.
[PATCH vhost] vdpa/mlx5: Fix needs_teardown flag calculation
needs_teardown is a device flag that indicates when virtual queues need to be recreated. This happens for certain configuration changes: queue size and some specific features. Currently, the needs_teardown state can be incorrectly reset by subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512 and 1 tx VQ with size 256: .set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a non-default size) .set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a default size) This change takes into account the previous value of the needs_teardown flag when re-calculating it during VQ size configuration. Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions") Signed-off-by: Dragos Tatulea Reviewed-by: Shahar Shitrit Reviewed-by: Si-Wei Liu Tested-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 49a08a1a..efb5fa694f1e 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2491,7 +2491,7 @@ static void mlx5_vdpa_set_vq_num(struct vdpa_device *vdev, u16 idx, u32 num) } mvq = &ndev->vqs[idx]; - ndev->needs_teardown = num != mvq->num_ent; + ndev->needs_teardown |= num != mvq->num_ent; mvq->num_ent = num; } -- 2.43.0
Re: [PATCH v4 0/7] Add managed SOFT RESERVE resource handling
Hi Zhijian, Thanks for testing my patches. On 6/4/2025 1:43 AM, Zhijian Li (Fujitsu) wrote: Smita, Thanks for your awesome work. I just tested the scenarios you listed, and they work as expected. Thanks again. (Minor comments inlined) Tested-by: Li Zhijian To the CXL community, The scenarios mentioned here essentially cover what a correct firmware may provide. However, I would like to discuss one more scenario that I can simulate with a modified QEMU: The E820 exposes a SOFT RESERVED region which is the same as a CFMW, but the HDM decoders are not committed. This means no region will be auto-created during boot. As an example, after boot, the iomem tree is as follows: 105000-304fff : CXL Window 0 105000-304fff : Soft Reserved In this case, the SOFT RESERVED resource is not trimmed, so the end-user cannot create a new region. My question is: Is this scenario a problem? If it is, should we fix it in this patchset or create a new patch? I believe firmware should handle this correctly by ensuring that any exposed SOFT RESERVED ranges correspond to committed HDM decoders and result in region creation. That said, I’d be interested in hearing what the rest of the community thinks. On 04/06/2025 06:19, Smita Koralahalli wrote: Add the ability to manage SOFT RESERVE iomem resources prior to them being added to the iomem resource tree. This allows drivers, such as CXL, to remove any pieces of the SOFT RESERVE resource that intersect with created CXL regions. The current approach of leaving the SOFT RESERVE resources as is can cause failures during hotplug of devices, such as CXL, because the resource is not available for reuse after teardown of the device. The approach is to add SOFT RESERVE resources to a separate tree during boot. No special tree at all since V3 Will make changes. I overlooked the cover letter. This allows any drivers to update the SOFT RESERVE resources before they are merged into the iomem resource tree. In addition a notifier chain is added so that drivers can be notified when these SOFT RESERVE resources are added to the ioeme resource tree. The CXL driver is modified to use a worker thread that waits for the CXL PCI and CXL mem drivers to be loaded and for their probe routine to complete. Then the driver walks through any created CXL regions to trim any intersections with SOFT RESERVE resources in the iomem tree. The dax driver uses the new soft reserve notifier chain so it can consume any remaining SOFT RESERVES once they're added to the iomem tree. The following scenarios have been tested: Example 1: Exact alignment, soft reserved is a child of the region |-- "Soft Reserved" ---| |-- "Region #" | Before: 105000-304fff : CXL Window 0 105000-304fff : region0 105000-304fff : Soft Reserved 108000-2f : dax0.0 BTW, I'm curious how to set up a dax with an address range different from its corresponding region. Hmm, this configuration was provided directly by our BIOS. The DAX device was mapped to a subset of the region's address space as part of the platform's firmware setup, so I did not explicitly configure it.. 108000-2f : System RAM (kmem) After: 105000-304fff : CXL Window 0 105000-304fff : region1 108000-2f : dax0.0 108000-2f : System RAM (kmem) Example 2: Start and/or end aligned and soft reserved spans multiple regions Tested |--- "Soft Reserved" ---| | "Region #" ---| or |--- "Soft Reserved" ---| | "Region #" ---| Typo? should be: |--- "Soft Reserved" ---| | "Region #" ---| Yeah, Will fix. Example 3: No alignment |-- "Soft Reserved" --| | "Region #" | Tested. Thanks Zhijian Thanks Smita
Re: [PATCH RFC net-next v2 2/3] vsock/test: Introduce get_transports()
On 6/4/25 11:07, Stefano Garzarella wrote: > On Wed, May 28, 2025 at 10:44:42PM +0200, Michal Luczaj wrote: >> +static int __get_transports(void) >> +{ >> +/* Order must match transports defined in util.h. >> + * man nm: "d" The symbol is in the initialized data section. >> + */ >> +const char * const syms[] = { >> +"d loopback_transport", >> +"d virtio_transport", >> +"d vhost_transport", >> +"d vmci_transport", >> +"d hvs_transport", >> +}; > > I would move this array (or a macro that define it), near the transport > defined in util.h, so they are near and we can easily update/review > changes. > > BTW what about adding static asserts to check we are aligned? Something like #define KNOWN_TRANSPORTS\ _(LOOPBACK, "loopback") \ _(VIRTIO, "virtio") \ _(VHOST, "vhost") \ _(VMCI, "vmci") \ _(HYPERV, "hvs") enum transport { TRANSPORT_COUNTER_BASE = __COUNTER__ + 1, #define _(name, symbol) \ TRANSPORT_##name = _BITUL(__COUNTER__ - TRANSPORT_COUNTER_BASE), KNOWN_TRANSPORTS TRANSPORT_NUM = __COUNTER__ - TRANSPORT_COUNTER_BASE, #undef _ }; static char * const transport_ksyms[] = { #define _(name, symbol) "d " symbol "_transport", KNOWN_TRANSPORTS #undef _ }; static_assert(ARRAY_SIZE(transport_ksyms) == TRANSPORT_NUM); ? Note that I keep pushing for naming HVS a TRANSPORT_HYPERV. Perhaps it's better to stick to TRANSPORT_HVS after all? >> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h >> index >> 0afe7cbae12e5194172c639ccfbeb8b81f7c25ac..63953e32c3e18e1aa5c2addcf6f09f433660fa84 >> 100644 >> --- a/tools/testing/vsock/util.h >> +++ b/tools/testing/vsock/util.h >> @@ -3,8 +3,19 @@ >> #define UTIL_H >> >> #include >> +#include >> #include >> >> +#define KALLSYMS_PATH "/proc/kallsyms" >> +#define KALLSYMS_LINE_LEN 512 > > We don't need to expose them in util.h IMO, we can keep in util.c OK, sure. Thanks, Michal
Re: [PATCH RFC net-next v2 3/3] vsock/test: Cover more CIDs in transport_uaf test
On 6/4/25 11:37, Stefano Garzarella wrote: > On Wed, May 28, 2025 at 10:44:43PM +0200, Michal Luczaj wrote: >> +static bool test_stream_transport_uaf(int cid) >> { >> int sockets[MAX_PORT_RETRIES]; >> struct sockaddr_vm addr; >> -int fd, i, alen; >> +socklen_t alen; >> +int fd, i, c; >> +bool ret; >> + >> +/* Probe for a transport by attempting a local CID bind. Unavailable >> + * transport (or more specifically: an unsupported transport/CID >> + * combination) results in EADDRNOTAVAIL, other errnos are fatal. >> + */ >> +fd = vsock_bind_try(cid, VMADDR_PORT_ANY, SOCK_STREAM); >> +if (fd < 0) { >> +if (errno != EADDRNOTAVAIL) { >> +perror("Unexpected bind() errno"); >> +exit(EXIT_FAILURE); >> +} >> >> -fd = vsock_bind(VMADDR_CID_ANY, VMADDR_PORT_ANY, SOCK_STREAM); >> +return false; >> +} >> >> alen = sizeof(addr); >> if (getsockname(fd, (struct sockaddr *)&addr, &alen)) { >> @@ -1735,38 +1746,73 @@ static void test_stream_transport_uaf_client(const >> struct test_opts *opts) >> exit(EXIT_FAILURE); >> } >> >> +/* Drain the autobind pool; see __vsock_bind_connectible(). */ >> for (i = 0; i < MAX_PORT_RETRIES; ++i) >> -sockets[i] = vsock_bind(VMADDR_CID_ANY, ++addr.svm_port, >> -SOCK_STREAM); >> +sockets[i] = vsock_bind(cid, ++addr.svm_port, SOCK_STREAM); >> >> close(fd); >> -fd = socket(AF_VSOCK, SOCK_STREAM, 0); >> +fd = socket(AF_VSOCK, SOCK_STREAM | SOCK_NONBLOCK, 0); > > Why we need this change? It's for the (void)connect() below (not the connect() expecting early EADDRNOTAVAIL, the second one). We're not connecting to anything anyway, so there's no point entering the main `while (sk->sk_state != TCP_ESTABLISHED` loop in vsock_connect(). >> if (fd < 0) { >> perror("socket"); >> exit(EXIT_FAILURE); >> } >> >> -if (!vsock_connect_fd(fd, addr.svm_cid, addr.svm_port)) { >> -perror("Unexpected connect() #1 success"); >> +/* Assign transport, while failing to autobind. Autobind pool was >> + * drained, so EADDRNOTAVAIL coming from __vsock_bind_connectible() is >> + * expected. >> + */ >> +addr.svm_port = VMADDR_PORT_ANY; (Ugh, this line looks useless...) >> +if (!connect(fd, (struct sockaddr *)&addr, alen)) { >> +fprintf(stderr, "Unexpected connect() success\n"); >> +exit(EXIT_FAILURE); >> +} else if (errno == ENODEV) { >> +/* Handle unhappy vhost_vsock */ > > Why it's unhappy? No peer? It's the case of test_stream_transport_uaf(VMADDR_CID_HOST) when only vhost_vsock transport is loaded. Before we even reach (and fail) vsock_auto_bind(), vsock_assign_transport() fails earlier: `new_transport` gets set to `transport_g2h` (NULL) and then it's `if (!new_transport) return -ENODEV`. So the idea was to swallow this errno and let the caller report that nothing went through. I guess we can narrow this down to `if (errno == ENODEV && cid == VMADDR_CID_HOST)`. >> +ret = false; >> +goto cleanup; >> +} else if (errno != EADDRNOTAVAIL) { >> +perror("Unexpected connect() errno"); >> exit(EXIT_FAILURE); >> } >> >> -/* Vulnerable system may crash now. */ >> -if (!vsock_connect_fd(fd, VMADDR_CID_HOST, VMADDR_PORT_ANY)) { >> -perror("Unexpected connect() #2 success"); >> -exit(EXIT_FAILURE); >> +/* Reassign transport, triggering old transport release and >> + * (potentially) unbinding of an unbound socket. >> + * >> + * Vulnerable system may crash now. >> + */ >> +for (c = VMADDR_CID_HYPERVISOR; c <= VMADDR_CID_HOST + 1; ++c) { >> +if (c != cid) { >> +addr.svm_cid = c; >> +(void)connect(fd, (struct sockaddr *)&addr, alen); >> +} >> } >> >> +ret = true; >> +cleanup: >> close(fd); >> while (i--) >> close(sockets[i]); >> >> -control_writeln("DONE"); >> +return ret; >> } >> >> -static void test_stream_transport_uaf_server(const struct test_opts *opts) >> +/* Test attempts to trigger a transport release for an unbound socket. This >> can >> + * lead to a reference count mishandling. >> + */ >> +static void test_stream_transport_uaf_client(const struct test_opts *opts) >> { >> -control_expectln("DONE"); >> +bool tested = false; >> +int cid, tr; >> + >> +for (cid = VMADDR_CID_HYPERVISOR; cid <= VMADDR_CID_HOST + 1; ++cid) >> +tested |= test_stream_transport_uaf(cid); >> + >> +tr = get_transports(); >> +if (!tr) >> +fprintf(stderr, "No transports detected\n"); >> +else if (tr == TRANSPORT_VIRTIO) >> +fprintf(stderr, "Setup unsupported: sole virtio transport\n"); >> +else if (!tested) >> +
Re: [PATCH v4 1/7] cxl/region: Avoid null pointer dereference in is_cxl_region()
On 6/3/2025 6:49 PM, Dave Jiang wrote: > > > On 6/3/25 3:19 PM, Smita Koralahalli wrote: >> Add a NULL check in is_cxl_region() to prevent potential null pointer >> dereference if a caller passes a NULL device. This change ensures the >> function safely returns false instead of triggering undefined behavior >> when dev is NULL. > > Don't think this change is necessary. The code paths should not be hitting > any NULL region devices unless it's a programming error. I originally added this to the patchset during some initial development to handle possible NULL dev pointers when updating soft reserve resources, see cxl_region_softreserv_update() in patch 5/7. In the current form of the that routine it appears we shouldn't execute the while loop if dev is NULL so this could get from the patch set. -Nathan > >> >> Co-developed-by: Nathan Fontenot >> Signed-off-by: Nathan Fontenot >> Co-developed-by: Terry Bowman >> Signed-off-by: Terry Bowman >> Signed-off-by: Smita Koralahalli >> --- >> drivers/cxl/core/region.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c >> index c3f4dc244df7..109b8a98c4c7 100644 >> --- a/drivers/cxl/core/region.c >> +++ b/drivers/cxl/core/region.c >> @@ -2333,7 +2333,7 @@ const struct device_type cxl_region_type = { >> >> bool is_cxl_region(struct device *dev) >> { >> -return dev->type == &cxl_region_type; >> +return dev && dev->type == &cxl_region_type; >> } >> EXPORT_SYMBOL_NS_GPL(is_cxl_region, "CXL"); >> >
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
Hi Zenghui, On Tue, 3 Jun 2025, Zenghui Yu wrote: On 2025/5/27 22:24, Sebastian Ott wrote: Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one. Changes since v1: modified patch 3 based on suggestions from Marc. I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: Test Assertion Failure arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x00404db3: test_run at arch_timer_edge_cases.c:933 2 0x00401f9f: main at arch_timer_edge_cases.c:1062 3 0xaedd625b: ?? ??:0 4 0xaedd633b: ?? ??:0 5 0x004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms) This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time).. Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer.. Yes, I think so, too. test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT); My first attempt was to also initialize cval here /* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE); [...] } set_tval_irq() { timer_set_ctl(timer, ctl); // There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT(). timer_set_tval(timer, tval_cycles); Yes, I stumbled over this as well. I've always assumed that this order is becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the value returned is UNKNOWN." However re-reading that part today I realized that this only concerns register reads. Maybe somone on cc knows why it's in that order? I'm currently testing this with the above swapped and it's looking good, so far. } I'm not familiar with the test so I'm not 100% sure that this is the root cause. But I hope this helps with your analysis ;-) . It did, thanks! Sebastian
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
On Wed, 4 Jun 2025, Sebastian Ott wrote: On Tue, 3 Jun 2025, Zenghui Yu wrote: On 2025/5/27 22:24, Sebastian Ott wrote: Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one. Changes since v1: modified patch 3 based on suggestions from Marc. I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: Test Assertion Failure arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x00404db3: test_run at arch_timer_edge_cases.c:933 2 0x00401f9f: main at arch_timer_edge_cases.c:1062 3 0xaedd625b: ?? ??:0 4 0xaedd633b: ?? ??:0 5 0x004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms) This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time).. Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer.. Yes, I think so, too. test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT); My first attempt was to also initialize cval here Forgot to mention that I did this because my tests have shown that the interrupt didn't only trigger early (like before the reprogrammed delta) but instantly. This seemed to work but I think the order in set_tval_irq() is the actual issue. /* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE); [...] } set_tval_irq() { timer_set_ctl(timer, ctl); // There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT(). timer_set_tval(timer, tval_cycles); Yes, I stumbled over this as well. I've always assumed that this order is becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the value returned is UNKNOWN." However re-reading that part today I realized that this only concerns register reads. Maybe somone on cc knows why it's in that order? I'm currently testing this with the above swapped and it's looking good, so far. } I'm not familiar with the test so I'm not 100% sure that this is the root cause. But I hope this helps with your analysis ;-) . It did, thanks! Sebastian
Re: [PATCH RESEND] selftests/mm/run_vmtests.sh: skip hugevm tests if write_to_hugetlbfs is missing
On 03/06/2025 5:12 am, Andrew Morton wrote: > On Tue, 3 Jun 2025 02:22:32 +0300 Khaled Elnaggar > wrote: > >> The hugevm tests 'charge_reserved_hugetlb.sh' and >> 'hugetlb_reparenting_test.sh' >> depend on the 'write_to_hugetlbfs' binary to simulate writes to hugetlbfs >> while checking reservations asynchronously in the background. >> >> When this binary is missing (e.g., excluded from the build), these tests hang >> for up to 180 seconds. During this time, run_vmtests.sh is eventually killed >> due to timeout, aborting all subsequent tests. >> >> This patch skips these tests if the binary is not found, preventing delays >> and ensuring that the test suite runs to completion. > > OK, but why is write_to_hugetlbfs missing? If we're in a situation > where we _could_ run it then we _should_ run it! The user wants to > test stuff so we should test as much as we can. > So I'm thinking that it would be preferable to make sure the dang thing > is there? > Totally agree, let me try to explain how I understand the issue. The write_to_hugetlbfs binary is built from selftests/mm/Makefile, lines 136–142. It is only compiled if ARCH matches one of the explicitly listed 64-bit architectures: ``` ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) TEST_GEN_FILES += va_high_addr_switch ifneq ($(ARCH),riscv64) TEST_GEN_FILES += virtual_address_range endif TEST_GEN_FILES += write_to_hugetlbfs endif ``` However, when the MM selftests are run from the kernel’s top-level Makefile, (root directory for example: make defconfig sudo make kselftest TARGETS=mm ARCH is resolved as x86, even on an x86_64 machine (Debian in my case), ARCH=x86 is the reason why the binary gets excluded from the build system. As far as I understand, the top-level Makefile normalizes both i.86 and x86_64 to x86 for ARCH variable: Makfile: lines: 383,403 383:include $(srctree)/scripts/subarch.include 403:ARCH?= $(SUBARCH) scripts/subarch.include: line: 7 7:SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \ This mapping probably makes sense for selecting the correct arch/ directory (since we have arch/x86, not arch/x86_64) for top-level Makefile. But the mm selftests Makefile expects ARCH to differentiate between x86 and x86_64 to decide whether to build certain binaries. As a result, the 64-bit-only binaries like write_to_hugetlbfs are skipped during build, yet still expected at runtime (by run_vmtests.sh). That's why this whole issue of the missing executable does not happen when ran from selftests/mm using something like: sudo make -C tools/testing/selftests/mm Because then selftests/mm/Makefile arch detection rus as intended. You're right — the proper fix is to improve how we propagate architecture information from the top-level Makefile to selftests. But since that's a larger change (and possibly beyond what I can safely attempt at this point), this patch is just a targeted workaround to avoid test stalls when the binary is missing. I hope I haven't gone completely wrong with this analysis, happy to improve or revise it further if needed.
[PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Same is done for PROCMAP_QUERY ioctl which locks only the vma that fell into the requested range instead of the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. This test registered close to 10x improvement in update latencies: Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.455 0.011 0.008 0.472 0.011 0.008 0.535 0.011 0.009 0.545 ... 0.011 0.014 2.875 0.011 0.014 2.913 0.011 0.014 3.007 0.011 0.015 3.018 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.036 0.006 0.005 0.039 0.006 0.005 0.039 0.006 0.005 0.039 ... 0.006 0.006 0.403 0.006 0.006 0.474 0.006 0.006 0.479 0.006 0.006 0.498 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's event more important now to test for unexpected inconsistencies. [1] https://lore.kernel.org/all/20250418174959.1431962-1-sur...@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test Suren Baghdasaryan (7): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging mm/maps: read proc/pid/maps under per-vma lock mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 6 + fs/proc/task_mmu.c | 233 +- tools/testing/selftests/proc/proc-pid-vm.c | 793 ++
[PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test
The /proc/pid/maps file is generated page by page, with the mmap_lock released between pages. This can lead to inconsistent reads if the underlying vmas are concurrently modified. For instance, if a vma split or merge occurs at a page boundary while /proc/pid/maps is being read, the same vma might be seen twice: once before and once after the change. This duplication is considered acceptable for userspace handling. However, observing a "hole" where a vma should be (e.g., due to a vma being replaced and the space temporarily being empty) is unacceptable. Implement a test that: 1. Forks a child process which continuously modifies its address space, specifically targeting a vma at the boundary between two pages. 2. The parent process repeatedly reads the child's /proc/pid/maps. 3. The parent process checks the last vma of the first page and the first vma of the second page for consistency, looking for the effects of vma splits or merges. The test duration is configurable via the -d command-line parameter in seconds to increase the likelihood of catching the race condition. The default test duration is 5 seconds. Example Command: proc-pid-vm -d 10 Signed-off-by: Suren Baghdasaryan --- tools/testing/selftests/proc/proc-pid-vm.c | 430 - 1 file changed, 429 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index d04685771952..6e3f06376a1f 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -27,6 +27,7 @@ #undef NDEBUG #include #include +#include #include #include #include @@ -34,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -70,6 +72,8 @@ static void make_private_tmp(void) } } +static unsigned long test_duration_sec = 5UL; +static int page_size; static pid_t pid = -1; static void ate(void) { @@ -281,11 +285,431 @@ static void vsyscall(void) } } -int main(void) +/* /proc/pid/maps parsing routines */ +struct page_content { + char *data; + ssize_t size; +}; + +#define LINE_MAX_SIZE 256 + +struct line_content { + char text[LINE_MAX_SIZE]; + unsigned long start_addr; + unsigned long end_addr; +}; + +static void read_two_pages(int maps_fd, struct page_content *page1, + struct page_content *page2) +{ + ssize_t bytes_read; + + assert(lseek(maps_fd, 0, SEEK_SET) >= 0); + bytes_read = read(maps_fd, page1->data, page_size); + assert(bytes_read > 0 && bytes_read < page_size); + page1->size = bytes_read; + + bytes_read = read(maps_fd, page2->data, page_size); + assert(bytes_read > 0 && bytes_read < page_size); + page2->size = bytes_read; +} + +static void copy_first_line(struct page_content *page, char *first_line) +{ + char *pos = strchr(page->data, '\n'); + + strncpy(first_line, page->data, pos - page->data); + first_line[pos - page->data] = '\0'; +} + +static void copy_last_line(struct page_content *page, char *last_line) +{ + /* Get the last line in the first page */ + const char *end = page->data + page->size - 1; + /* skip last newline */ + const char *pos = end - 1; + + /* search previous newline */ + while (pos[-1] != '\n') + pos--; + strncpy(last_line, pos, end - pos); + last_line[end - pos] = '\0'; +} + +/* Read the last line of the first page and the first line of the second page */ +static void read_boundary_lines(int maps_fd, struct page_content *page1, + struct page_content *page2, + struct line_content *last_line, + struct line_content *first_line) +{ + read_two_pages(maps_fd, page1, page2); + + copy_last_line(page1, last_line->text); + copy_first_line(page2, first_line->text); + + assert(sscanf(last_line->text, "%lx-%lx", &last_line->start_addr, + &last_line->end_addr) == 2); + assert(sscanf(first_line->text, "%lx-%lx", &first_line->start_addr, + &first_line->end_addr) == 2); +} + +/* Thread synchronization routines */ +enum test_state { + INIT, + CHILD_READY, + PARENT_READY, + SETUP_READY, + SETUP_MODIFY_MAPS, + SETUP_MAPS_MODIFIED, + SETUP_RESTORE_MAPS, + SETUP_MAPS_RESTORED, + TEST_READY, + TEST_DONE, +}; + +struct vma_modifier_info; + +typedef void (*vma_modifier_op)(const struct vma_modifier_info *mod_info); +typedef void (*vma_mod_result_check_op)(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line); + +struct vma_modifier_info { +
[PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing
Test that /proc/pid/maps does not report unexpected holes in the address space when a vma at the edge of the page is being concurrently remapped. This remapping results in the vma shrinking and expanding from under the reader. We should always see either shrunk or expanded (original) version of the vma. Signed-off-by: Suren Baghdasaryan --- tools/testing/selftests/proc/proc-pid-vm.c | 83 ++ 1 file changed, 83 insertions(+) diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index 6e3f06376a1f..39842e4ec45f 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -583,6 +583,86 @@ static void test_maps_tearing_from_split(int maps_fd, signal_state(mod_info, TEST_DONE); } +static inline void shrink_vma(const struct vma_modifier_info *mod_info) +{ + assert(mremap(mod_info->addr, page_size * 3, page_size, 0) != MAP_FAILED); +} + +static inline void expand_vma(const struct vma_modifier_info *mod_info) +{ + assert(mremap(mod_info->addr, page_size, page_size * 3, 0) != MAP_FAILED); +} + +static inline void check_shrink_result(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + /* Make sure only the last vma of the first page is changing */ + assert(strcmp(mod_last_line->text, restored_last_line->text) != 0); + assert(strcmp(mod_first_line->text, restored_first_line->text) == 0); +} + +static void test_maps_tearing_from_resize(int maps_fd, + struct vma_modifier_info *mod_info, + struct page_content *page1, + struct page_content *page2, + struct line_content *last_line, + struct line_content *first_line) +{ + struct line_content shrunk_last_line; + struct line_content shrunk_first_line; + struct line_content restored_last_line; + struct line_content restored_first_line; + + wait_for_state(mod_info, SETUP_READY); + + /* re-read the file to avoid using stale data from previous test */ + read_boundary_lines(maps_fd, page1, page2, last_line, first_line); + + mod_info->vma_modify = shrink_vma; + mod_info->vma_restore = expand_vma; + mod_info->vma_mod_check = check_shrink_result; + + capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line, + &shrunk_last_line, &shrunk_first_line, + &restored_last_line, &restored_first_line); + + /* Now start concurrent modifications for test_duration_sec */ + signal_state(mod_info, TEST_READY); + + struct line_content new_last_line; + struct line_content new_first_line; + struct timespec start_ts, end_ts; + + clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + do { + read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line); + + /* Check if we read vmas after shrinking it */ + if (!strcmp(new_last_line.text, shrunk_last_line.text)) { + /* +* The vmas should be consistent with shrunk results, +* however if the vma was concurrently restored, it +* can be reported twice (first as shrunk one, then +* as restored one) because we found it as the next vma +* again. In that case new first line will be the same +* as the last restored line. +*/ + assert(!strcmp(new_first_line.text, shrunk_first_line.text) || + !strcmp(new_first_line.text, restored_last_line.text)); + } else { + /* The vmas should be consistent with the original/resored state */ + assert(!strcmp(new_last_line.text, restored_last_line.text) && + !strcmp(new_first_line.text, restored_first_line.text)); + } + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); + } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec); + + /* Signal the modifyer thread to stop and wait until it exits */ + signal_state(mod_info, TEST_DONE); +} + static int test_maps_tearing(void) { struct vma_modifier_info *mod_info; @@ -674,6 +754,9 @@ static int test_maps_tearing(void) test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2, &last_line, &first_line); + test_maps_tearing_from_resize(maps_f
[PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
Test that /proc/pid/maps does not report unexpected holes in the address space when we concurrently remap a part of a vma into the middle of another vma. This remapping results in the destination vma being split into three parts and the part in the middle being patched back from, all done concurrently from under the reader. We should always see either original vma or the split one with no holes. Signed-off-by: Suren Baghdasaryan --- tools/testing/selftests/proc/proc-pid-vm.c | 92 ++ 1 file changed, 92 insertions(+) diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index 39842e4ec45f..1aef2db7e893 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd, signal_state(mod_info, TEST_DONE); } +static inline void remap_vma(const struct vma_modifier_info *mod_info) +{ + /* +* Remap the last page of the next vma into the middle of the vma. +* This splits the current vma and the first and middle parts (the +* parts at lower addresses) become the last vma objserved in the +* first page and the first vma observed in the last page. +*/ + assert(mremap(mod_info->next_addr + page_size * 2, page_size, + page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + mod_info->addr + page_size) != MAP_FAILED); +} + +static inline void patch_vma(const struct vma_modifier_info *mod_info) +{ + assert(!mprotect(mod_info->addr + page_size, page_size, +mod_info->prot)); +} + +static inline void check_remap_result(struct line_content *mod_last_line, + struct line_content *mod_first_line, + struct line_content *restored_last_line, + struct line_content *restored_first_line) +{ + /* Make sure vmas at the boundaries are changing */ + assert(strcmp(mod_last_line->text, restored_last_line->text) != 0); + assert(strcmp(mod_first_line->text, restored_first_line->text) != 0); +} + +static void test_maps_tearing_from_remap(int maps_fd, + struct vma_modifier_info *mod_info, + struct page_content *page1, + struct page_content *page2, + struct line_content *last_line, + struct line_content *first_line) +{ + struct line_content remapped_last_line; + struct line_content remapped_first_line; + struct line_content restored_last_line; + struct line_content restored_first_line; + + wait_for_state(mod_info, SETUP_READY); + + /* re-read the file to avoid using stale data from previous test */ + read_boundary_lines(maps_fd, page1, page2, last_line, first_line); + + mod_info->vma_modify = remap_vma; + mod_info->vma_restore = patch_vma; + mod_info->vma_mod_check = check_remap_result; + + capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line, + &remapped_last_line, &remapped_first_line, + &restored_last_line, &restored_first_line); + + /* Now start concurrent modifications for test_duration_sec */ + signal_state(mod_info, TEST_READY); + + struct line_content new_last_line; + struct line_content new_first_line; + struct timespec start_ts, end_ts; + + clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + do { + read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line); + + /* Check if we read vmas after remapping it */ + if (!strcmp(new_last_line.text, remapped_last_line.text)) { + /* +* The vmas should be consistent with remap results, +* however if the vma was concurrently restored, it +* can be reported twice (first as split one, then +* as restored one) because we found it as the next vma +* again. In that case new first line will be the same +* as the last restored line. +*/ + assert(!strcmp(new_first_line.text, remapped_first_line.text) || + !strcmp(new_first_line.text, restored_last_line.text)); + } else { + /* The vmas should be consistent with the original/resored state */ + assert(!strcmp(new_last_line.text, restored_last_line.text) && + !strcmp(new_first_line.text, restored_first_line.text)); + } + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); +
[PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified
Extend /proc/pid/maps tearing test to verify PROCMAP_QUERY ioctl operation correctness while the vma is being concurrently modified. Signed-off-by: Suren Baghdasaryan --- tools/testing/selftests/proc/proc-pid-vm.c | 60 ++ 1 file changed, 60 insertions(+) diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index 1aef2db7e893..b582f40851fb 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -486,6 +486,21 @@ static void capture_mod_pattern(int maps_fd, assert(strcmp(restored_first_line->text, first_line->text) == 0); } +static void query_addr_at(int maps_fd, void *addr, + unsigned long *vma_start, unsigned long *vma_end) +{ + struct procmap_query q; + + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + /* Find the VMA at the split address */ + q.query_addr = (unsigned long long)addr; + q.query_flags = 0; + assert(!ioctl(maps_fd, PROCMAP_QUERY, &q)); + *vma_start = q.vma_start; + *vma_end = q.vma_end; +} + static inline void split_vma(const struct vma_modifier_info *mod_info) { assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC, @@ -546,6 +561,8 @@ static void test_maps_tearing_from_split(int maps_fd, do { bool last_line_changed; bool first_line_changed; + unsigned long vma_start; + unsigned long vma_end; read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line); @@ -576,6 +593,19 @@ static void test_maps_tearing_from_split(int maps_fd, first_line_changed = strcmp(new_first_line.text, first_line->text) != 0; assert(last_line_changed == first_line_changed); + /* Check if PROCMAP_QUERY ioclt() finds the right VMA */ + query_addr_at(maps_fd, mod_info->addr + page_size, + &vma_start, &vma_end); + /* +* The vma at the split address can be either the same as +* original one (if read before the split) or the same as the +* first line in the second page (if read after the split). +*/ + assert((vma_start == last_line->start_addr && + vma_end == last_line->end_addr) || + (vma_start == split_first_line.start_addr && + vma_end == split_first_line.end_addr)); + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec); @@ -637,6 +667,9 @@ static void test_maps_tearing_from_resize(int maps_fd, clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); do { + unsigned long vma_start; + unsigned long vma_end; + read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line); /* Check if we read vmas after shrinking it */ @@ -656,6 +689,17 @@ static void test_maps_tearing_from_resize(int maps_fd, assert(!strcmp(new_last_line.text, restored_last_line.text) && !strcmp(new_first_line.text, restored_first_line.text)); } + + /* Check if PROCMAP_QUERY ioclt() finds the right VMA */ + query_addr_at(maps_fd, mod_info->addr, &vma_start, &vma_end); + /* +* The vma should stay at the same address and have either the +* original size of 3 pages or 1 page if read after shrinking. +*/ + assert(vma_start == last_line->start_addr && + (vma_end - vma_start == page_size * 3 || + vma_end - vma_start == page_size)); + clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts); } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec); @@ -726,6 +770,9 @@ static void test_maps_tearing_from_remap(int maps_fd, clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); do { + unsigned long vma_start; + unsigned long vma_end; + read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line); /* Check if we read vmas after remapping it */ @@ -745,6 +792,19 @@ static void test_maps_tearing_from_remap(int maps_fd, assert(!strcmp(new_last_line.text, restored_last_line.text) && !strcmp(new_first_line.text, restored_first_line.text)); } + + /* Check if PROCMAP_QUERY ioclt() finds the right VMA */ + query_addr_at(maps_fd, mod_info->addr + page_size, &vma_start, &vma_end); + /* +* The vma should either stay at the same address and have the +* origina
[PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging
Add verbose mode to the proc tests to print debugging information. Usage: proc-pid-vm -v Signed-off-by: Suren Baghdasaryan --- tools/testing/selftests/proc/proc-pid-vm.c | 154 +++-- 1 file changed, 141 insertions(+), 13 deletions(-) diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c index b582f40851fb..97017f48cd70 100644 --- a/tools/testing/selftests/proc/proc-pid-vm.c +++ b/tools/testing/selftests/proc/proc-pid-vm.c @@ -73,6 +73,7 @@ static void make_private_tmp(void) } static unsigned long test_duration_sec = 5UL; +static bool verbose; static int page_size; static pid_t pid = -1; static void ate(void) @@ -452,6 +453,99 @@ static void stop_vma_modifier(struct vma_modifier_info *mod_info) signal_state(mod_info, SETUP_MODIFY_MAPS); } +static void print_first_lines(char *text, int nr) +{ + const char *end = text; + + while (nr && (end = strchr(end, '\n')) != NULL) { + nr--; + end++; + } + + if (end) { + int offs = end - text; + + text[offs] = '\0'; + printf(text); + text[offs] = '\n'; + printf("\n"); + } else { + printf(text); + } +} + +static void print_last_lines(char *text, int nr) +{ + const char *start = text + strlen(text); + + nr++; /* to ignore the last newline */ + while (nr) { + while (start > text && *start != '\n') + start--; + nr--; + start--; + } + printf(start); +} + +static void print_boundaries(const char *title, +struct page_content *page1, +struct page_content *page2) +{ + if (!verbose) + return; + + printf("%s", title); + /* Print 3 boundary lines from each page */ + print_last_lines(page1->data, 3); + printf("-page boundary-\n"); + print_first_lines(page2->data, 3); +} + +static bool print_boundaries_on(bool condition, const char *title, + struct page_content *page1, + struct page_content *page2) +{ + if (verbose && condition) + print_boundaries(title, page1, page2); + + return condition; +} + +static void report_test_start(const char *name) +{ + if (verbose) + printf(" %s \n", name); +} + +static struct timespec print_ts; + +static void start_test_loop(struct timespec *ts) +{ + if (verbose) + print_ts.tv_sec = ts->tv_sec; +} + +static void end_test_iteration(struct timespec *ts) +{ + if (!verbose) + return; + + /* Update every second */ + if (print_ts.tv_sec == ts->tv_sec) + return; + + printf("."); + fflush(stdout); + print_ts.tv_sec = ts->tv_sec; +} + +static void end_test_loop(void) +{ + if (verbose) + printf("\n"); +} + static void capture_mod_pattern(int maps_fd, struct vma_modifier_info *mod_info, struct page_content *page1, @@ -463,18 +557,24 @@ static void capture_mod_pattern(int maps_fd, struct line_content *restored_last_line, struct line_content *restored_first_line) { + print_boundaries("Before modification", page1, page2); + signal_state(mod_info, SETUP_MODIFY_MAPS); wait_for_state(mod_info, SETUP_MAPS_MODIFIED); /* Copy last line of the first page and first line of the last page */ read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line); + print_boundaries("After modification", page1, page2); + signal_state(mod_info, SETUP_RESTORE_MAPS); wait_for_state(mod_info, SETUP_MAPS_RESTORED); /* Copy last line of the first page and first line of the last page */ read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line); + print_boundaries("After restore", page1, page2); + mod_info->vma_mod_check(mod_last_line, mod_first_line, restored_last_line, restored_first_line); @@ -546,6 +646,7 @@ static void test_maps_tearing_from_split(int maps_fd, mod_info->vma_restore = merge_vma; mod_info->vma_mod_check = check_split_result; + report_test_start("Tearing from split"); capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line, &split_last_line, &split_first_line, &restored_last_line, &restored_first_line); @@ -558,6 +659,7 @@ static void test_maps_tearing_from_split(int maps_fd, struct timespec start_ts, end_ts; clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts); + start_test_loop(&sta
[PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
With maple_tree supporting vma tree traversal under RCU and per-vma locks, /proc/pid/maps can be read while holding individual vma locks instead of locking the entire address space. Completely lockless approach would be quite complex with the main issue being get_vma_name() using callbacks which might not work correctly with a stable vma copy, requiring original (unstable) vma. When per-vma lock acquisition fails, we take the mmap_lock for reading, lock the vma, release the mmap_lock and continue. This guarantees the reader to make forward progress even during lock contention. This will interfere with the writer but for a very short time while we are acquiring the per-vma lock and only when there was contention on the vma reader is interested in. One case requiring special handling is when vma changes between the time it was found and the time it got locked. A problematic case would be if vma got shrunk so that it's start moved higher in the address space and a new vma was installed at the beginning: reader found: |VMA A| VMA is modified:|-VMA B-|VMA A| reader locks modified VMA A reader reports VMA A: | gap |VMA A| This would result in reporting a gap in the address space that does not exist. To prevent this we retry the lookup after locking the vma, however we do that only when we identify a gap and detect that the address space was changed after we found the vma. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Signed-off-by: Suren Baghdasaryan --- fs/proc/internal.h | 6 ++ fs/proc/task_mmu.c | 177 +++-- 2 files changed, 175 insertions(+), 8 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 96122e91c645..3728c9012687 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -379,6 +379,12 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + loff_t last_pos; +#ifdef CONFIG_PER_VMA_LOCK + bool mmap_locked; + unsigned int mm_wr_seq; + struct vm_area_struct *locked_vma; +#endif #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 27972c0749e7..36d883c4f394 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, - loff_t *ppos) +#ifdef CONFIG_PER_VMA_LOCK + +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv, + struct vm_area_struct *vma, + unsigned long last_pos, + bool mm_unstable) +{ + vma = vma_start_read(priv->mm, vma); + if (IS_ERR_OR_NULL(vma)) + return NULL; + + /* Check if the vma we locked is the right one. */ + if (unlikely(vma->vm_mm != priv->mm)) + goto err; + + /* vma should not be ahead of the last search position. */ + if (unlikely(last_pos >= vma->vm_end)) + goto err; + + /* +* vma ahead of last search position is possible but we need to +* verify that it was not shrunk after we found it, and another +* vma has not been installed ahead of it. Otherwise we might +* observe a gap that should not be there. +*/ + if (mm_unstable && last_pos < vma->vm_start) { + /* Verify only if the address space changed since vma lookup. */ + if ((priv->mm_wr_seq & 1) || + mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) { + vma_iter_init(&priv->iter, priv->mm, last_pos); + if (vma != vma_next(&priv->iter)) + goto err; + } + } + + priv->locked_vma = vma; + + return vma; +err: + vma_end_read(vma); + return NULL; +} + + +static void unlock_vma(struct proc_maps_private *priv) +{ + if (priv->locked_vma) { + vma_end_read(priv->locked_vma); + priv->locked_vma = NULL; + } +} + +static const struct seq_operations proc_pid_maps_op; + +static inline bool lock_content(struct seq_file *m, +
[PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
Utilize per-vma locks to stabilize vma after lookup without taking mmap_lock during PROCMAP_QUERY ioctl execution. While we might take mmap_lock for reading during contention, we do that momentarily only to lock the vma. This change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates. Signed-off-by: Suren Baghdasaryan --- fs/proc/task_mmu.c | 56 -- 1 file changed, 44 insertions(+), 12 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 36d883c4f394..93ba35a84975 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -550,28 +550,60 @@ static int pid_maps_open(struct inode *inode, struct file *file) PROCMAP_QUERY_VMA_FLAGS \ ) -static int query_vma_setup(struct mm_struct *mm) +#ifdef CONFIG_PER_VMA_LOCK + +static int query_vma_setup(struct proc_maps_private *priv) { - return mmap_read_lock_killable(mm); + rcu_read_lock(); + priv->locked_vma = NULL; + priv->mmap_locked = false; + + return 0; } -static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +static void query_vma_teardown(struct proc_maps_private *priv) { - mmap_read_unlock(mm); + unlock_vma(priv); + rcu_read_unlock(); +} + +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv, +unsigned long addr) +{ + vma_iter_init(&priv->iter, priv->mm, addr); + return get_next_vma(priv, addr); +} + +#else /* CONFIG_PER_VMA_LOCK */ + +static int query_vma_setup(struct proc_maps_private *priv) +{ + return mmap_read_lock_killable(priv->mm); +} + +static void query_vma_teardown(struct proc_maps_private *priv) +{ + mmap_read_unlock(priv->mm); } -static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv, +unsigned long addr) { - return find_vma(mm, addr); + return find_vma(priv->mm, addr); } -static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, +#endif /* CONFIG_PER_VMA_LOCK */ + +static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv, unsigned long addr, u32 flags) { struct vm_area_struct *vma; next_vma: - vma = query_vma_find_by_addr(mm, addr); + vma = query_vma_find_by_addr(priv, addr); + if (IS_ERR(vma)) + return vma; + if (!vma) goto no_vma; @@ -647,13 +679,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) if (!mm || !mmget_not_zero(mm)) return -ESRCH; - err = query_vma_setup(mm); + err = query_vma_setup(priv); if (err) { mmput(mm); return err; } - vma = query_matching_vma(mm, karg.query_addr, karg.query_flags); + vma = query_matching_vma(priv, karg.query_addr, karg.query_flags); if (IS_ERR(vma)) { err = PTR_ERR(vma); vma = NULL; @@ -738,7 +770,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) } /* unlock vma or mmap_lock, and put mm_struct before copying data to user */ - query_vma_teardown(mm, vma); + query_vma_teardown(priv); mmput(mm); if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr), @@ -758,7 +790,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) return 0; out: - query_vma_teardown(mm, vma); + query_vma_teardown(priv); mmput(mm); kfree(name_buf); return err; -- 2.49.0.1266.g31b7d2e469-goog
Re: [PATCH v2 35/62] objtool: Refactor add_jump_destinations()
On Fri, May 23, 2025 at 07:46:19AM -0400, Joe Lawrence wrote: > > - } else if (reloc->sym->sec->idx) { > > - dest_sec = reloc->sym->sec; > > - dest_off = reloc->sym->sym.st_value + > > - arch_dest_reloc_offset(reloc_addend(reloc)); > > Way back in ("[PATCH v2 18/62] objtool: Fix x86 addend calculation"), > arch_dest_reloc_offset() was replaced with arch_insn_adjusted_addend(), > so I think that patch missed this callsite and breaks bisectability. Fixed, thanks. -- Josh
Re: [PATCH v2 18/62] objtool: Fix x86 addend calculation
On Mon, May 26, 2025 at 12:23:15PM +0200, Peter Zijlstra wrote: > On Fri, May 09, 2025 at 01:16:42PM -0700, Josh Poimboeuf wrote: > > On x86, arch_dest_reloc_offset() hardcodes the addend adjustment to > > four, but the actual adjustment depends on the relocation type. Fix > > that. > > > +s64 arch_insn_adjusted_addend(struct instruction *insn, struct reloc > > *reloc) > > { > > - return addend + 4; > > + s64 addend = reloc_addend(reloc); > > + > > + switch (reloc_type(reloc)) { > > + case R_X86_64_PC32: > > + case R_X86_64_PLT32: > > + addend += insn->offset + insn->len - reloc_offset(reloc); > > + break; > > + default: > > + break; > > + } > > + > > + return addend; > > } > > Should this not be something like: > > s64 arch_insn_adjusted_addend(struct instruction *insn, struct reloc *reloc) > { > s64 addend = reloc_addend(reloc); > > if (arch_pc_relative_reloc(reloc)) > addend += insn->offset + insn->len - reloc_offset(reloc); > > return addend; > } > > instead? > > AFAIU arch_pc_relative_reloc() is the exact same set of relocations. Yeah that's better, thanks. -- Josh
Re: [PATCH v2 28/62] objtool: Fix weak symbol hole detection for .cold functions
On Mon, May 26, 2025 at 12:38:22PM +0200, Peter Zijlstra wrote: > On Fri, May 09, 2025 at 01:16:52PM -0700, Josh Poimboeuf wrote: > > When ignore_unreachable_insn() looks for weak function holes which jump > > to their .cold functions, it assumes the parent function comes before > > the corresponding .cold function in the symbol table. That's not > > necessarily the case with -ffunction-sections. > > > > Mark all the holes beforehand (including .cold functions) so the > > ordering of the discovery doesn't matter. > > One of the things I have a 'todo' entry on, is rewriting all sections > that reference any one of these instructions. > > That is, things like fentry, alternatives, retpoline, static_call, > jump_label. Everything that can cause runtime code patching. > > Once we are sure none of those sections will contain references to this > dead code, we can go and wipe the actual code. Perhaps fill it with a > UD1 instruction with some identifying immediate. Yeah, that would be nice. -- Josh
Re: [PATCH v2 29/62] objtool: Mark prefix functions
On Mon, May 26, 2025 at 12:43:55PM +0200, Peter Zijlstra wrote: > > +++ b/tools/objtool/elf.c > > @@ -442,6 +442,11 @@ static void elf_add_symbol(struct elf *elf, struct > > symbol *sym) > > elf_hash_add(symbol, &sym->hash, sym->idx); > > elf_hash_add(symbol_name, &sym->name_hash, str_hash(sym->name)); > > > > + if (is_func_sym(sym) && sym->len == 16 && > > Where did that 'sym->len == 16' thing come from? I mean, they are, but > the old code didn't assert that. > > I would rather objtool issue a warn if not 16, but still consider these > as prefix. Ok: diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c index 9a1fc0392b7f..2953aa8d2b81 100644 --- a/tools/objtool/elf.c +++ b/tools/objtool/elf.c @@ -442,10 +442,12 @@ static void elf_add_symbol(struct elf *elf, struct symbol *sym) elf_hash_add(symbol, &sym->hash, sym->idx); elf_hash_add(symbol_name, &sym->name_hash, str_hash(sym->name)); - if (is_func_sym(sym) && sym->len == 16 && - (strstarts(sym->name, "__pfx") || strstarts(sym->name, "__cfi_"))) + if (is_func_sym(sym) && + (strstarts(sym->name, "__pfx_") || strstarts(sym->name, "__cfi_"))) { + if (sym->len != 16) + WARN("%s size %d != 16", sym->name, sym->len); sym->prefix = 1; - + } if (is_func_sym(sym) && strstr(sym->name, ".cold")) sym->cold = 1;
Re: [PATCH v2 32/62] objtool: Suppress section skipping warnings with --dryrun
On Wed, May 28, 2025 at 12:34:53PM +0200, Peter Zijlstra wrote: > On Mon, May 26, 2025 at 12:52:40PM +0200, Peter Zijlstra wrote: > > On Fri, May 09, 2025 at 01:16:56PM -0700, Josh Poimboeuf wrote: > > > It's common to use --dryrun on binaries that have already been > > > processed. Don't print the section skipping warnings in that case. > > > > Ah, I rather like this warning, it gives me an easy check to see if the > > file has already been processed. > > > > I typically do a OBJTOOL_ARGS="--backup" build and run dryrun debug > > sessions against those .orig files. > > Turns out, you already broke this.. :-( > > I'm now having a case where objtool fails on vmlinux.o and make happily > deletes vmlinux.o and I'm left empty handed. > > Let me go resurrect --backup Yeah, as I just mentioned in that other email, --verbose should give you what you need. It also prints the cmdline args, which is nice. But also, feel free to resurrect --backup, or you can yell at me to do it as the backup code changed a bit. -- Josh
Re: [PATCH v2 42/62] kbuild,x86: Fix module permissions for __jump_table and __bug_table
On Mon, May 26, 2025 at 01:06:34PM +0200, Peter Zijlstra wrote: > On Fri, May 09, 2025 at 01:17:06PM -0700, Josh Poimboeuf wrote: > > An upcoming patch will add the SHF_MERGE flag to x86 __jump_table and > > __bug_table so their entry sizes can be defined in inline asm. > > > > However, those sections have SHF_WRITE, which the Clang linker (lld) > > explicitly forbids combining with SHF_MERGE. > > > > Those sections are modified at runtime and must remain writable. While > > SHF_WRITE is ignored by vmlinux, it's still needed for modules. > > > > To work around the linker interference, remove SHF_WRITE during > > compilation and restore it after linking the module. > > *groan* > > This and the following patches marking a whole bunch of sections M, > seems to suggest you're going to rely on sh_entsize actually working. > > There was an ld.lld bug, and IIRC you need to enforce llvm-20 or later > if you want this to be so. Hm, ISTR this working with clang 18, I'll go test that again. -- Josh
Re: [PATCH v2 52/62] objtool/klp: Introduce klp diff subcommand for diffing object files
On Mon, May 26, 2025 at 08:22:59PM +0200, Peter Zijlstra wrote: > On Fri, May 09, 2025 at 01:17:16PM -0700, Josh Poimboeuf wrote: > > > Without '-ffunction-sections -fdata-sections', reliable object diffing > > would be infeasible due to toolchain limitations: > > > > - For intra-file+intra-section references, the compiler might > > occasionally generated hard-coded instruction offsets instead of > > relocations. > > > > - Section-symbol-based references can be ambiguous: > > > > - Overlapping or zero-length symbols create ambiguity as to which > > symbol is being referenced. > > > > - A reference to the end of a symbol (e.g., checking array bounds) > > can be misinterpreted as a reference to the next symbol, or vice > > versa. > > > > A potential future alternative to '-ffunction-sections -fdata-sections' > > would be to introduce a toolchain option that forces symbol-based > > (non-section) relocations. > > Urgh.. So the first issue we can fix with objtool, but the ambiguous > cases are indeed very hard to fix up in post. > > Did you already talk to toolchain people about this? For now, I want to stick with -ffunction-sections -fdata-sections, as that's what kpatch has done for 10+ years and it works well. That's the only option we have for current compilers anyway. The above mentioned possibility of diffing without -ffunction-sections -fdata-sections is theoretical, and needs more exploration. If it indeed works then we can try to get toolchain support for that. -- Josh
Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy
On Wed, Jun 4, 2025 at 10:17 PM Bui Quang Minh wrote: > > On 6/4/25 07:37, Jason Wang wrote: > > On Tue, Jun 3, 2025 at 11:07 PM Bui Quang Minh > > wrote: > >> In virtio-net, we have not yet supported multi-buffer XDP packet in > >> zerocopy mode when there is a binding XDP program. However, in that > >> case, when receiving multi-buffer XDP packet, we skip the XDP program > >> and return XDP_PASS. As a result, the packet is passed to normal network > >> stack which is an incorrect behavior. This commit instead returns > >> XDP_DROP in that case. > >> > >> Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode") > >> Cc: sta...@vger.kernel.org > >> Signed-off-by: Bui Quang Minh > >> --- > >> drivers/net/virtio_net.c | 11 --- > >> 1 file changed, 8 insertions(+), 3 deletions(-) > >> > >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >> index e53ba600605a..4c35324d6e5b 100644 > >> --- a/drivers/net/virtio_net.c > >> +++ b/drivers/net/virtio_net.c > >> @@ -1309,9 +1309,14 @@ static struct sk_buff > >> *virtnet_receive_xsk_merge(struct net_device *dev, struct > >> ret = XDP_PASS; > > It would be simpler to just assign XDP_DROP here? > > > > Or if you wish to stick to the way, we can simply remove this assignment. > > This XDP_PASS is returned for the case when there is no XDP program > binding (!prog). You're right. Acked-by: Jason Wang Thanks > > > > >> rcu_read_lock(); > >> prog = rcu_dereference(rq->xdp_prog); > >> - /* TODO: support multi buffer. */ > >> - if (prog && num_buf == 1) > >> - ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats); > >> + if (prog) { > >> + /* TODO: support multi buffer. */ > >> + if (num_buf == 1) > >> + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, > >> + stats); > >> + else > >> + ret = XDP_DROP; > >> + } > >> rcu_read_unlock(); > >> > >> switch (ret) { > >> -- > >> 2.43.0 > >> > > Thanks > > > > > Thanks, > Quang Minh. > >
Re: [PATCH v2 32/62] objtool: Suppress section skipping warnings with --dryrun
On Mon, May 26, 2025 at 12:52:40PM +0200, Peter Zijlstra wrote: > On Fri, May 09, 2025 at 01:16:56PM -0700, Josh Poimboeuf wrote: > > It's common to use --dryrun on binaries that have already been > > processed. Don't print the section skipping warnings in that case. > > Ah, I rather like this warning, it gives me an easy check to see if the > file has already been processed. > > I typically do a OBJTOOL_ARGS="--backup" build and run dryrun debug > sessions against those .orig files. Ok. Though, note that as of a few months ago, --backup no longer exists. A backup is now automatically created with --verbose. But we can revive it if you want. -- Josh
Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe
On 04/06/2025 22:35, Dave Jiang wrote: >>> >>> +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue); >> Given that this file (suspend.c) focuses on power management functions, >> it might be more appropriate to move the wait queue declaration and its >> related changes to acpi.c in where the its caller is. > You mean drivers/cxl/acpi.c and not drivers/cxl/core/acpi.c right? Yes > The core one is my mistake and I'm going to create a patch to remove that. Completely missed its existence - thank you for catching and cleaning that up! Thanks Zhijian > > DJ
Re: [PATCH vhost] vdpa/mlx5: Fix needs_teardown flag calculation
On Thu, Jun 5, 2025 at 2:48 AM Dragos Tatulea wrote: > > needs_teardown is a device flag that indicates when virtual queues need > to be recreated. This happens for certain configuration changes: queue > size and some specific features. > > Currently, the needs_teardown state can be incorrectly reset by > subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512 > and 1 tx VQ with size 256: > > .set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a >non-default size) > .set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a >default size) > > This change takes into account the previous value of the needs_teardown > flag when re-calculating it during VQ size configuration. > > Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions") > Signed-off-by: Dragos Tatulea > Reviewed-by: Shahar Shitrit > Reviewed-by: Si-Wei Liu > Tested-by: Si-Wei Liu > --- Acked-by: Jason Wang Thanks
Re: [PATCH v2 42/62] kbuild,x86: Fix module permissions for __jump_table and __bug_table
On Wed, Jun 04, 2025 at 05:22:15PM -0700, Josh Poimboeuf wrote: > On Mon, May 26, 2025 at 01:06:34PM +0200, Peter Zijlstra wrote: > > On Fri, May 09, 2025 at 01:17:06PM -0700, Josh Poimboeuf wrote: > > > An upcoming patch will add the SHF_MERGE flag to x86 __jump_table and > > > __bug_table so their entry sizes can be defined in inline asm. > > > > > > However, those sections have SHF_WRITE, which the Clang linker (lld) > > > explicitly forbids combining with SHF_MERGE. > > > > > > Those sections are modified at runtime and must remain writable. While > > > SHF_WRITE is ignored by vmlinux, it's still needed for modules. > > > > > > To work around the linker interference, remove SHF_WRITE during > > > compilation and restore it after linking the module. > > > > *groan* > > > > This and the following patches marking a whole bunch of sections M, > > seems to suggest you're going to rely on sh_entsize actually working. > > > > There was an ld.lld bug, and IIRC you need to enforce llvm-20 or later > > if you want this to be so. > > Hm, ISTR this working with clang 18, I'll go test that again. You're right, looks like sh_entsize is getting cleared by the linker with my Clang 18. I guess I tested with newer Clang. "objtool klp diff" fails with: vmlinux.o: error: objtool: .discard.annotate_insn: unknown entry size So yeah, non-buggy linker is already being enforced, though I should probably make the error more human friendly. -- Josh
Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On Wed, Jun 04, 2025 at 03:19:58PM +0100, Lorenzo Stoakes wrote: > The walk_page_range_novma() function is rather confusing - it supports two > modes, one used often, the other used only for debugging. > > The first mode is the common case of traversal of kernel page tables, which > is what nearly all callers use this for. > > Secondly it provides an unusual debugging interface that allows for the > traversal of page tables in a userland range of memory even for that memory > which is not described by a VMA. > > It is far from certain that such page tables should even exist, but perhaps > this is precisely why it is useful as a debugging mechanism. > > As a result, this is utilised by ptdump only. Historically, things were > reversed - ptdump was the only user, and other parts of the kernel evolved > to use the kernel page table walking here. > > Since we have some complicated and confusing locking rules for the novma > case, it makes sense to separate the two usages into their own functions. > > Doing this also provide self-documentation as to the intent of the caller - > are they doing something rather unusual or are they simply doing a standard > kernel page table walk? > > We therefore establish two separate functions - walk_page_range_debug() for > this single usage, and walk_kernel_page_table_range() for general kernel > page table walking. > > We additionally make walk_page_range_debug() internal to mm. > > Note that ptdump uses the precise same function for kernel walking as a > convenience, so we permit this but make it very explicit by having > walk_page_range_novma() invoke walk_kernel_page_table_range() in this case. > > Signed-off-by: Lorenzo Stoakes > Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
On 6/4/25 10:19 PM, Lorenzo Stoakes wrote: The walk_page_range_novma() function is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. We additionally make walk_page_range_debug() internal to mm. Note that ptdump uses the precise same function for kernel walking as a convenience, so we permit this but make it very explicit by having walk_page_range_novma() invoke walk_kernel_page_table_range() in this case. Signed-off-by: Lorenzo Stoakes Acked-by: Mike Rapoport (Microsoft) --- v2: * Renamed walk_page_range_novma() to walk_page_range_debug() as per David. * Moved walk_page_range_debug() definition to mm/internal.h as per Mike. * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as per David. v1 resend: * Actually cc'd lists... * Fixed mistake in walk_page_range_novma() not handling kernel mappings and update commit message to referene. * Added Mike's off-list Acked-by. * Fixed up comments as per Mike. * Add some historic flavour to the commit message as per Mike. https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/ v1: (accidentally sent off-list due to error in scripting) arch/loongarch/mm/pageattr.c | 2 +- arch/openrisc/kernel/dma.c | 4 +- arch/riscv/mm/pageattr.c | 8 +-- include/linux/pagewalk.h | 7 ++- mm/hugetlb_vmemmap.c | 2 +- mm/internal.h| 4 ++ mm/pagewalk.c| 98 mm/ptdump.c | 3 +- 8 files changed, 82 insertions(+), 46 deletions(-) Acked-by: Qi Zheng Thanks.
[PATCH v2] module: Make sure relocations are applied to the per-CPU section
The per-CPU data section is handled differently than the other sections. The memory allocations requires a special __percpu pointer and then the section is copied into the view of each CPU. Therefore the SHF_ALLOC flag is removed to ensure move_module() skips it. Later, relocations are applied and apply_relocations() skips sections without SHF_ALLOC because they have not been copied. This also skips the per-CPU data section. The missing relocations result in a NULL pointer on x86-64 and very small values on x86-32. This results in a crash because it is not skipped like NULL pointer would and can't be dereferenced. Such an assignment happens during static per-CPU lock initialisation with lockdep enabled. Add the SHF_ALLOC flag back for the per-CPU section (if found) after move_module(). Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-...@intel.com Fixes: 8d8022e8aba85 ("module: do percpu allocation after uniqueness check. No, really!") Signed-off-by: Sebastian Andrzej Siewior --- v1…v2: https://lore.kernel.org/all/20250604152707.cied9...@linutronix.de/ - Add the flag back only on SMP if the per-CPU section was found. kernel/module/main.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/module/main.c b/kernel/module/main.c index 5c6ab20240a6d..4f6554dedf8ea 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2816,6 +2816,10 @@ static struct module *layout_and_allocate(struct load_info *info, int flags) if (err) return ERR_PTR(err); + /* Add SHF_ALLOC back so that relocations are applied. */ + if (IS_ENABLED(CONFIG_SMP) && info->index.pcpu) + info->sechdrs[info->index.pcpu].sh_flags |= SHF_ALLOC; + /* Module has been copied to its final place now: return it. */ mod = (void *)info->sechdrs[info->index.mod].sh_addr; kmemleak_load_module(mod, info); -- 2.49.0
Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes
On Wed, 04 Jun 2025 21:58:48 +0100, Sebastian Ott wrote: > > Hi Zenghui, > > On Tue, 3 Jun 2025, Zenghui Yu wrote: > > On 2025/5/27 22:24, Sebastian Ott wrote: > >> Some small fixes for arch_timer_edge_cases that I stumbled upon > >> while debugging failures for this selftest on ampere-one. > >> > >> Changes since v1: modified patch 3 based on suggestions from Marc. > >> > >> I've done some tests with this on various machines - seems to be all > >> good, however on ampere-one I now hit this in 10% of the runs: > >> Test Assertion Failure > >> arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + > >> (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) > >> pid=166657 tid=166657 errno=4 - Interrupted system call > >> 1 0x00404db3: test_run at arch_timer_edge_cases.c:933 > >> 2 0x00401f9f: main at arch_timer_edge_cases.c:1062 > >> 3 0xaedd625b: ?? ??:0 > >> 4 0xaedd633b: ?? ??:0 > >> 5 0x004020af: _start at ??:? > >> timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms) > >> > >> This is not new, it was just hidden behind the other failure. I'll > >> try to figure out what this is about (seems to be independent of > >> the wait time).. > > > > Not sure if you have figured it out. I can easily reproduce it on my box > > and I *guess* it is that we have some random XVAL values when we enable > > the timer.. > > Yes, I think so, too. > > > test_reprogramming_timer() > > { > > local_irq_disable(); > > reset_timer_state(timer, DEF_CNT); > > My first attempt was to also initialize cval here Note that CVAL and TVAL are two views of the same thing. It should be enough to initialise one of them (though TVAL is stupidly narrow). > > > > > /* Program the timer to DEF_CNT + delta_1_ms. */ > > set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE); > > > > [...] > > } > > > > set_tval_irq() > > { > > timer_set_ctl(timer, ctl); > > > > // There is a window that we enable the timer with *random* XVAL > > // values and we may get the unexpected interrupt.. And it's > > // unlikely that KVM can be aware of TVAL's change (and > > // re-evaluate the interrupt's pending state) before hitting the > > // GUEST_ASSERT(). > > > > timer_set_tval(timer, tval_cycles); > > Yes, I stumbled over this as well. I've always assumed that this order is > becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, > the value returned is UNKNOWN." However re-reading that part today I > realized > that this only concerns register reads. > > Maybe somone on cc knows why it's in that order? I can't think of a valid reason. Enabling the timer without having set the deadline first is just silly. > I'm currently testing this with the above swapped and it's looking good, > so far. The swapped version (set xVAL, then enable the timer) is the only one that makes any sense. Thanks, M. -- Jazz isn't dead. It just smells funny.