date:20250604

Smita,

Thanks for your awesome work. I just tested the scenarios you listed, and they 
work as expected. Thanks again.
(Minor comments inlined)

Tested-by: Li Zhijian 


To the CXL community,

The scenarios mentioned here essentially cover what a correct firmware may 
provide. However,
I would like to discuss one more scenario that I can simulate with a modified 
QEMU:
The E820 exposes a SOFT RESERVED region which is the same as a CFMW, but the 
HDM decoders are not committed. This means no region will be auto-created 
during boot.

As an example, after boot, the iomem tree is as follows:
105000-304fff : CXL Window 0
   105000-304fff : Soft Reserved
 

In this case, the SOFT RESERVED resource is not trimmed, so the end-user cannot 
create a new region.
My question is: Is this scenario a problem? If it is, should we fix it in this 
patchset or create a new patch?




On 04/06/2025 06:19, Smita Koralahalli wrote:
> Add the ability to manage SOFT RESERVE iomem resources prior to them being
> added to the iomem resource tree. This allows drivers, such as CXL, to
> remove any pieces of the SOFT RESERVE resource that intersect with created
> CXL regions.
> 
> The current approach of leaving the SOFT RESERVE resources as is can cause
> failures during hotplug of devices, such as CXL, because the resource is
> not available for reuse after teardown of the device.
> 
> The approach is to add SOFT RESERVE resources to a separate tree during
> boot.

No special tree at all since V3


> This allows any drivers to update the SOFT RESERVE resources before
> they are merged into the iomem resource tree. In addition a notifier chain
> is added so that drivers can be notified when these SOFT RESERVE resources
> are added to the ioeme resource tree.
> 
> The CXL driver is modified to use a worker thread that waits for the CXL
> PCI and CXL mem drivers to be loaded and for their probe routine to
> complete. Then the driver walks through any created CXL regions to trim any
> intersections with SOFT RESERVE resources in the iomem tree.
> 
> The dax driver uses the new soft reserve notifier chain so it can consume
> any remaining SOFT RESERVES once they're added to the iomem tree.
> 
> The following scenarios have been tested:
> 
> Example 1: Exact alignment, soft reserved is a child of the region
> 
> |-- "Soft Reserved" ---|
> |-- "Region #" |
> 
> Before:
>105000-304fff : CXL Window 0
>  105000-304fff : region0
>105000-304fff : Soft Reserved
>  108000-2f : dax0.0

BTW, I'm curious how to set up a dax with an address range different from its 
corresponding region.


>108000-2f : System RAM (kmem)
> 
> After:
>105000-304fff : CXL Window 0
>  105000-304fff : region1
>108000-2f : dax0.0
>  108000-2f : System RAM (kmem)
> 
> Example 2: Start and/or end aligned and soft reserved spans multiple
> regions

Tested

> 
> |--- "Soft Reserved" ---|
> | "Region #" ---|
> or
> |--- "Soft Reserved" ---|
> | "Region #" ---|

Typo? should be:
|--- "Soft Reserved" ---|
 | "Region #" ---|

> 
> Example 3: No alignment
> |-- "Soft Reserved" --|
>   | "Region #" |

Tested.


Thanks
Zhijian

Re: [PATCH] module: make __mod_device_table__* symbols static

2025-06-04 Thread Petr Pavlu

On 6/2/25 12:55 PM, Masahiro Yamada wrote:
> The __mod_device_table__* symbols are only parsed by modpost to generate
> MODULE_ALIAS() entries from MODULE_DEVICE_TABLE().
> 
> Therefore, these symbols do not need to be globally visible, or globally
> unique.
> 
> If they are in the global scope, we would worry about the symbol
> uniqueness, but modpost is fine with parsing multiple symbols with the
> same name.
> 
> Signed-off-by: Masahiro Yamada 

Reviewed-by: Petr Pavlu 

-- 
Thanks,
Petr

Re: [PATCH bpf-next] selftests/bpf: Fix compile error of bin_attribute::read/write()

2025-06-04 Thread Jiri Olsa

On Wed, Jun 04, 2025 at 01:53:22PM +0800, Rong Tao wrote:
> From: Rong Tao 
> 
> Since commit 97d06802d10a ("sysfs: constify bin_attribute argument of
> bin_attribute::read/write()"), make bin_attribute parameter of
> bin_attribute::read/write() const.

hi,
there's already fix for this in bpf/master

thanks,
jirka

> 
> Signed-off-by: Rong Tao 
> ---
>  tools/testing/selftests/bpf/test_kmods/bpf_testmod.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c 
> b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
> index e6c248e3ae54..e9e918cdf31f 100644
> --- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
> @@ -385,7 +385,7 @@ int bpf_testmod_fentry_ok;
>  
>  noinline ssize_t
>  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
> -   struct bin_attribute *bin_attr,
> +   const struct bin_attribute *bin_attr,
> char *buf, loff_t off, size_t len)
>  {
>   struct bpf_testmod_test_read_ctx ctx = {
> @@ -465,7 +465,7 @@ ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO);
>  
>  noinline ssize_t
>  bpf_testmod_test_write(struct file *file, struct kobject *kobj,
> -   struct bin_attribute *bin_attr,
> +   const struct bin_attribute *bin_attr,
> char *buf, loff_t off, size_t len)
>  {
>   struct bpf_testmod_test_write_ctx ctx = {
> @@ -567,7 +567,7 @@ static void testmod_unregister_uprobe(void)
>  
>  static ssize_t
>  bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj,
> -  struct bin_attribute *bin_attr,
> +  const struct bin_attribute *bin_attr,
>char *buf, loff_t off, size_t len)
>  {
>   unsigned long offset = 0;
> -- 
> 2.49.0
>

[Question] attributes encoding in BTF

2025-06-04 Thread Alexis Lothoré

Hi all,
a simpler version of this series has been merged, and so I am now taking a
look at the issue I have put aside in the merged version: dealing with more
specific data layout for arguments passed on stack. For example, a function
can pass small structs on stack, but this need special care when generating
the corresponding bpf trampolines. Those structs have specific alignment
specified by the target ABI, but it can also be altered with attributes
packing the structure or modifying the alignment.

Some platforms already support structs on stack (see
tracing_struct_many_args test), but as discussed earlier, those may suffer
from the same kind of issue mentioned above.

On Wed Apr 23, 2025 at 9:24 PM CEST, Alexis Lothoré wrote:
> Hi Andrii,
>
> On Wed Apr 23, 2025 at 7:15 PM CEST, Andrii Nakryiko wrote:
>> On Thu, Apr 17, 2025 at 12:14 AM Alexis Lothoré
>>  wrote:
>>>
>>> Hi Andrii,
>>>
>>> On Wed Apr 16, 2025 at 11:24 PM CEST, Andrii Nakryiko wrote:

[...]

>> I'd suggest looking at btf__align_of() in libbpf (tools/lib/bpf/btf.c)
>> to see how we calculate alignment there. It seems to work decently
>> enough. It won't cover any arch-specific extra rules like double
>> needing 16-byte alignment (I vaguely remember something like that for
>> some architectures, but I might be misremembering), or anything
>> similar. It also won't detect (I don't think it's possible without
>> DWARF) artificially increased alignment with attribute((aligned(N))).
>
> Thanks for the pointer, I'll take a look at it. The more we discuss this
> series, the less member size sounds relevant for what I'm trying to achieve
> here.
>
> Following Xu's comments, I have been thinking about how I could detect the
> custom alignments and packing on structures, and I was wondering if I could
> somehow benefit from __attribute__ encoding in BTF info ([1]). But
> following your hint, I also see some btf_is_struct_packed() in
> tools/lib/bpf/btf_dump.c that could help. I'll dig this further and see if
> I can manage to make something work with all of this.

Andrii's comment above illustrates well my current issue: when functions
pass arguments on stack, we are missing info for some of them to correctly
build trampolines, especially for struct, which can have attributes like
__attribute__((packed)) or __attribute__((align(x))). [1] seems to be a
recent solution implemented for BTF to cover this need. IIUC it encodes any
arbitratry attribute affecting a data type or function, so if I have some
struct like this one in my kernel or a module:

struct foo {
short b
int a;
} __packed;

I would expect the corresponding BTF data to have some BTF_KIND_DECL_TAG
describing the "packed" attribute for the corresponding structure, but I
fail to find any of those when running:

$ bpftool btf dump file vmlinux format raw

In there I see some DECL_TAG but those are mostly 'bpf_kfunc', I see not
arbitrary attribute like 'packed' or 'aligned(x)'.

What I really need to do in the end is to be able to parse those alignments
attributes info in the kernel at runtime when generating trampolines, but I
hoped to be able to see them with bpftool first to validate the concept.

I started taking a look further at this and stumbled upon [2] in which Alan
gives many more details about the feature, so I did the following checks:
- kernel version 6.15.0-rc4 from bpf-next_base: it contains the updated
  Makefile.btf calling pahole with `--btf_features=attributes`
- pahole v1.30
  $ pahole --supported_btf_features

encode_force,var,float,decl_tag,type_tag,enum64,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build,distilled_base,global_var,attributes
  This pahole comes from my distro pkg manager, but I have also done the
  same test with a freshly built pahole, while taking care of pulling the
  libbpf submodule.
- bpftool v7.6.0
bpftool v7.6.0
using libbpf v1.6
features: llvm, skeletons

Could I be missing something obvious ? Or did I misunderstand the actual
attribute encoding feature ?

Thanks,

Alexis

[1] 
https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solod...@linux.dev/
[2] 
https://lore.kernel.org/all/CA+icZUW31vpS=r3zm6g4fmkzuiqovqtd+e-8ihwsk_a-qts...@mail.gmail.com/

-- 
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [PATCH RFC net-next v2 2/3] vsock/test: Introduce get_transports()

2025-06-04 Thread Stefano Garzarella


On Wed, May 28, 2025 at 10:44:42PM +0200, Michal Luczaj wrote:

Return a bitmap of registered vsock transports. As guesstimated by grepping
/proc/kallsyms (CONFIG_KALLSYMS=y) for known symbols of type `struct
virtio_transport`.

Suggested-by: Stefano Garzarella 
Signed-off-by: Michal Luczaj 
---
tools/testing/vsock/util.c | 60 ++
tools/testing/vsock/util.h | 12 ++
2 files changed, 72 insertions(+)

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 
b7b3fb2221c1682ecde58cf12e2f0b0ded1cff39..74fb52f566148b16436251216dd9d9275f0ec95b
 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -7,6 +7,7 @@
 * Author: Stefan Hajnoczi 
 */

+#include 
#include 
#include 
#include 
@@ -17,6 +18,7 @@
#include 
#include 
#include 
+#include 
#include 

#include "timeout.h"
@@ -854,3 +856,61 @@ void enable_so_linger(int fd, int timeout)
exit(EXIT_FAILURE);
}
}
+
+static int __get_transports(void)
+{
+   /* Order must match transports defined in util.h.
+* man nm: "d" The symbol is in the initialized data section.
+*/
+   const char * const syms[] = {
+   "d loopback_transport",
+   "d virtio_transport",
+   "d vhost_transport",
+   "d vmci_transport",
+   "d hvs_transport",
+   };


I would move this array (or a macro that define it), near the transport 
defined in util.h, so they are near and we can easily update/review 
changes.


BTW what about adding static asserts to check we are aligned?


+   char buf[KALLSYMS_LINE_LEN];
+   int ret = 0;
+   FILE *f;
+
+   f = fopen(KALLSYMS_PATH, "r");
+   if (!f) {
+   perror("Can't open " KALLSYMS_PATH);
+   exit(EXIT_FAILURE);
+   }
+
+   while (fgets(buf, sizeof(buf), f)) {
+   char *match;
+   int i;
+
+   assert(buf[strlen(buf) - 1] == '\n');
+
+   for (i = 0; i < ARRAY_SIZE(syms); ++i) {
+   match = strstr(buf, syms[i]);
+
+   /* Match should be followed by '\t' or '\n'.
+* See kallsyms.c:s_show().
+*/
+   if (match && isspace(match[strlen(syms[i])])) {
+   ret |= _BITUL(i);
+   break;
+   }
+   }
+   }
+
+   fclose(f);
+   return ret;
+}
+
+/* Return integer with TRANSPORT_* bit set for every (known) registered vsock
+ * transport.
+ */
+int get_transports(void)
+{
+   static int tr = -1;
+
+   if (tr == -1)
+   tr = __get_transports();
+
+   return tr;
+}
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index 
0afe7cbae12e5194172c639ccfbeb8b81f7c25ac..63953e32c3e18e1aa5c2addcf6f09f433660fa84
 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -3,8 +3,19 @@
#define UTIL_H

#include 
+#include 
#include 

+#define KALLSYMS_PATH  "/proc/kallsyms"
+#define KALLSYMS_LINE_LEN  512


We don't need to expose them in util.h IMO, we can keep in util.c


+
+/* All known transports */
+#define TRANSPORT_LOOPBACK _BITUL(0)
+#define TRANSPORT_VIRTIO   _BITUL(1)
+#define TRANSPORT_VHOST_BITUL(2)
+#define TRANSPORT_VMCI _BITUL(3)
+#define TRANSPORT_HYPERV   _BITUL(4)
+
/* Tests can either run as the client or the server */
enum test_mode {
TEST_MODE_UNSET,
@@ -82,4 +93,5 @@ void setsockopt_timeval_check(int fd, int level, int optname,
  struct timeval val, char const *errmsg);
void enable_so_zerocopy_check(int fd);
void enable_so_linger(int fd, int timeout);
+int get_transports(void);
#endif /* UTIL_H */

--
2.49.0

Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

On Wed, Jun 04, 2025 at 11:07:05AM +0300, Mike Rapoport wrote:
> On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
> > On 03.06.25 21:22, Lorenzo Stoakes wrote:
> > > The walk_page_range_novma() function is rather confusing - it supports two
> > > modes, one used often, the other used only for debugging.
> > >
> > > The first mode is the common case of traversal of kernel page tables, 
> > > which
> > > is what nearly all callers use this for.
> >
> > ... and what people should be using it for 🙂
> >
> > >
> > > Secondly it provides an unusual debugging interface that allows for the
> > > traversal of page tables in a userland range of memory even for that 
> > > memory
> > > which is not described by a VMA.
> > >
> > > This is highly unusual and it is far from certain that such page tables
> > > should even exist, but perhaps this is precisely why it is useful as a
> > > debugging mechanism.
> > >
> > > As a result, this is utilised by ptdump only. Historically, things were
> > > reversed - ptdump was the only user, and other parts of the kernel evolved
> > > to use the kernel page table walking here.
> > >
> > > Since we have some complicated and confusing locking rules for the novma
> > > case, it makes sense to separate the two usages into their own functions.
> > >
> > > Doing this also provide self-documentation as to the intent of the caller 
> > > -
> > > are they doing something rather unusual or are they simply doing a 
> > > standard
> > > kernel page table walk?
> > >
> > > We therefore maintain walk_page_range_novma() for this single usage, and
> > > document the function as such.
> >
> > If we have to keep this dangerous interface, it should probably be
> >
> > walk_page_range_debug() or walk_page_range_dump()
>
> We can also move it from include/linux/pagewalk.h to mm/internal.h

Yeah I was wondering about a rename actually, but then thought 'oh novma kinda
fits blah blah' but you're right, this is better.

Nice idea to move to mm/internal.h also :) I like this...

Will fixup on respin

>
> > >
> > > Note that ptdump uses the precise same function for kernel walking as a
> > > convenience, so we permit this but make it very explicit by having
> > > walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> > >
> > > We introduce walk_page_range_kernel() for the far more common case of
> > > kernel page table traversal.
> >
> > I wonder if we should give it a completely different name scheme to
> > highlight that this is something completely different.
> >
> > walk_kernel_page_table_range()
> >
> > etc.
> >
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> --
> Sincerely yours,
> Mike.

Re: [PATCH RFC net-next v2 1/3] vsock/test: Introduce vsock_bind_try() helper

2025-06-04 Thread Stefano Garzarella


On Wed, May 28, 2025 at 10:44:41PM +0200, Michal Luczaj wrote:

Create a socket and bind() it. If binding failed, gracefully return an
error code while preserving `errno`.

Base vsock_bind() on top of it.

Suggested-by: Stefano Garzarella 
Signed-off-by: Michal Luczaj 
---
tools/testing/vsock/util.c | 24 +---
tools/testing/vsock/util.h |  1 +
2 files changed, 22 insertions(+), 3 deletions(-)


LGTM!

Reviewed-by: Stefano Garzarella 



diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 
0c7e9cbcbc85cde9c8764fc3bb623cde2f6c77a6..b7b3fb2221c1682ecde58cf12e2f0b0ded1cff39
 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -121,15 +121,17 @@ bool vsock_wait_sent(int fd)
return !ret;
}

-/* Create socket , bind to  and return the file descriptor. */
-int vsock_bind(unsigned int cid, unsigned int port, int type)
+/* Create socket , bind to .
+ * Return the file descriptor, or -1 on error.
+ */
+int vsock_bind_try(unsigned int cid, unsigned int port, int type)
{
struct sockaddr_vm sa = {
.svm_family = AF_VSOCK,
.svm_cid = cid,
.svm_port = port,
};
-   int fd;
+   int fd, saved_errno;

fd = socket(AF_VSOCK, type, 0);
if (fd < 0) {
@@ -138,6 +140,22 @@ int vsock_bind(unsigned int cid, unsigned int port, int 
type)
}

if (bind(fd, (struct sockaddr *)&sa, sizeof(sa))) {
+   saved_errno = errno;
+   close(fd);
+   errno = saved_errno;
+   fd = -1;
+   }
+
+   return fd;
+}
+
+/* Create socket , bind to  and return the file descriptor. */
+int vsock_bind(unsigned int cid, unsigned int port, int type)
+{
+   int fd;
+
+   fd = vsock_bind_try(cid, port, type);
+   if (fd < 0) {
perror("bind");
exit(EXIT_FAILURE);
}
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index 
5e2db67072d5053804a9bb93934b625ea78bcd7a..0afe7cbae12e5194172c639ccfbeb8b81f7c25ac
 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -44,6 +44,7 @@ int vsock_connect(unsigned int cid, unsigned int port, int 
type);
int vsock_accept(unsigned int cid, unsigned int port,
 struct sockaddr_vm *clientaddrp, int type);
int vsock_stream_connect(unsigned int cid, unsigned int port);
+int vsock_bind_try(unsigned int cid, unsigned int port, int type);
int vsock_bind(unsigned int cid, unsigned int port, int type);
int vsock_bind_connect(unsigned int cid, unsigned int port,
   unsigned int bind_port, int type);

--
2.49.0

Re: [PATCH bpf-next] selftests/bpf: Fix compile error of bin_attribute::read/write()

2025-06-04 Thread Rong Tao




On 6/4/25 16:53, Jiri Olsa wrote:

On Wed, Jun 04, 2025 at 01:53:22PM +0800, Rong Tao wrote:

From: Rong Tao 

Since commit 97d06802d10a ("sysfs: constify bin_attribute argument of
bin_attribute::read/write()"), make bin_attribute parameter of
bin_attribute::read/write() const.

hi,
there's already fix for this in bpf/master

thanks,
jirka

I am confused, when should I use bpf/master[2] and when should I use
bpf-next/master[1]? thank you :)

[1] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
[2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Re: [PATCH v4 3/7] cxl/pci: Add pci_loaded tracking to mark PCI driver readiness



On 04/06/2025 06:19, Smita Koralahalli wrote:
> Introduce a pci_loaded flag similar to mem_active, and define
> mark_cxl_pci_loaded() to indicate when the PCI driver has initialized.
> 
> This will be used by other CXL components, such as cxl_acpi and the soft
> reserved resource handling logic, to coordinate initialization and ensure
> that dependent operations proceed only after both cxl_pci and cxl_mem
> drivers are ready.
> 
> Co-developed-by: Nathan Fontenot 
> Signed-off-by: Nathan Fontenot 
> Co-developed-by: Terry Bowman 
> Signed-off-by: Terry Bowman 
> Signed-off-by: Smita Koralahalli 
> ---
>   drivers/cxl/core/suspend.c | 8 
>   drivers/cxl/cxlpci.h   | 1 +
>   drivers/cxl/pci.c  | 2 ++
>   3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c
> index 5ba4b4de0e33..72818a2c8ec8 100644
> --- a/drivers/cxl/core/suspend.c
> +++ b/drivers/cxl/core/suspend.c
> @@ -3,8 +3,10 @@
>   #include 
>   #include 
>   #include "cxlmem.h"
> +#include "cxlpci.h"
>   
>   static atomic_t mem_active;
> +static atomic_t pci_loaded;


I find it odd to place these changes in suspend.c. Also, I noticed that in a
subsequent patch, DJ has mentioned (and I agree) that this patch is unnecessary.


Thanks
Zhijian


>   
>   bool cxl_mem_active(void)
>   {
> @@ -25,3 +27,9 @@ void cxl_mem_active_dec(void)
>   atomic_dec(&mem_active);
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL");
> +
> +void mark_cxl_pci_loaded(void)
> +{
> + atomic_inc(&pci_loaded);
> +}
> +EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL");
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..5a811ac63fcf 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -135,4 +135,5 @@ void read_cdat_data(struct cxl_port *port);
>   void cxl_cor_error_detected(struct pci_dev *pdev);
>   pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>   pci_channel_state_t state);
> +void mark_cxl_pci_loaded(void);
>   #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 785aa2af5eaa..b019bd324dba 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1189,6 +1189,8 @@ static int __init cxl_pci_driver_init(void)
>   if (rc)
>   pci_unregister_driver(&cxl_pci_driver);
>   
> + mark_cxl_pci_loaded();
> +
>   return rc;
>   }
>

Re: [QUESTION] problems report: rcu_read_unlock_special() called in irq_exit() causes dead loop

2025-06-04 Thread Joel Fernandes

On 6/3/2025 11:37 PM, Qi Xi wrote:
> Hi Joel,
> 
> The patch works as expected. Previously, the issue triggered a soft lockup
> within ~10 minutes, but after applying the fix, the system ran stably for 30+
> minutes without any issues.

Great to hear! Thanks for testing. I/we will roll this into a proper patch and
will provide it once have something ready.

 - Joel


> 
> Thanks,
> Qi
> 
> On 2025/6/4 11:25, Xiongfeng Wang wrote:
>> On 2025/6/4 9:35, Joel Fernandes wrote:
>>> On Tue, Jun 03, 2025 at 03:22:42PM -0400, Joel Fernandes wrote:
 On 6/3/2025 3:03 PM, Joel Fernandes wrote:
> On 6/3/2025 2:59 PM, Joel Fernandes wrote:
>> On Fri, May 30, 2025 at 09:55:45AM +0800, Xiongfeng Wang wrote:
>>> Hi Joel,
>>>
>>> On 2025/5/29 0:30, Joel Fernandes wrote:
 On Wed, May 21, 2025 at 5:43 AM Xiongfeng Wang
  wrote:
> Hi RCU experts,
>
> When I ran syskaller in Linux 6.6 with CONFIG_PREEMPT_RCU enabled, I 
> got
> the following soft lockup. The Calltrace is too long. I put it in the 
> end.
> The issue can also be reproduced in the latest kernel.
>
> The issue is as follows. CPU3 is waiting for a spin_lock, which is 
> got by CPU1.
> But CPU1 stuck in the following dead loop.
>
> irq_exit()
>   __irq_exit_rcu()
> /* in_hardirq() returns false after this */
> preempt_count_sub(HARDIRQ_OFFSET)
> tick_irq_exit()
>   tick_nohz_irq_exit()
> tick_nohz_stop_sched_tick()
>   trace_tick_stop()  /* a bpf prog is hooked on this 
> trace point */
>__bpf_trace_tick_stop()
>   bpf_trace_run2()
> rcu_read_unlock_special()
>   /* will send a IPI to itself */
>   irq_work_queue_on(&rdp->defer_qs_iw, 
> rdp->cpu);
>
> /* after interrupt is enabled again, the irq_work is called */
> asm_sysvec_irq_work()
>   sysvec_irq_work()
> irq_exit() /* after handled the irq_work, we again enter into 
> irq_exit() */
>   __irq_exit_rcu()
> ...skip...
>/* we queue a irq_work again, and enter a dead loop */
>irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>>> The following is a candidate fix (among other fixes being
>>> considered/discussed). The change is to check if context tracking thinks
>>> we're in IRQ and if so, avoid the irq_work. IMO, this should be rare enough
>>> that it shouldn't be an issue and it is dangerous to self-IPI consistently
>>> while we're exiting an IRQ anyway.
>>>
>>> Thoughts? Xiongfeng, do you want to try it?
>> Thanks a lot for the fast response. My colleague is testing the modification.
>>  She will feedback the result.
>>
>> Thanks,
>> Xiongfeng
>>
>>> Btw, I could easily reproduce it as a boot hang by doing:
>>>
>>> --- a/kernel/softirq.c
>>> +++ b/kernel/softirq.c
>>> @@ -638,6 +638,10 @@ void irq_enter(void)
>>>  
>>>  static inline void tick_irq_exit(void)
>>>  {
>>> +   rcu_read_lock();
>>> +   WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
>>> +   rcu_read_unlock();
>>> +
>>>  #ifdef CONFIG_NO_HZ_COMMON
>>> int cpu = smp_processor_id();
>>>  
>>> ---8<---
>>>
>>> From: Joel Fernandes 
>>> Subject: [PATCH] Do not schedule irq_work when IRQ is exiting
>>>
>>> Signed-off-by: Joel Fernandes 
>>> ---
>>>  include/linux/context_tracking_irq.h |  2 ++
>>>  kernel/context_tracking.c| 12 
>>>  kernel/rcu/tree_plugin.h |  3 ++-
>>>  3 files changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/context_tracking_irq.h 
>>> b/include/linux/context_tracking_irq.h
>>> index 197916ee91a4..35a5ad971514 100644
>>> --- a/include/linux/context_tracking_irq.h
>>> +++ b/include/linux/context_tracking_irq.h
>>> @@ -9,6 +9,7 @@ void ct_irq_enter_irqson(void);
>>>  void ct_irq_exit_irqson(void);
>>>  void ct_nmi_enter(void);
>>>  void ct_nmi_exit(void);
>>> +bool ct_in_irq(void);
>>>  #else
>>>  static __always_inline void ct_irq_enter(void) { }
>>>  static __always_inline void ct_irq_exit(void) { }
>>> @@ -16,6 +17,7 @@ static inline void ct_irq_enter_irqson(void) { }
>>>  static inline void ct_irq_exit_irqson(void) { }
>>>  static __always_inline void ct_nmi_enter(void) { }
>>>  static __always_inline void ct_nmi_exit(void) { }
>>> +static inline bool ct_in_irq(void) { return false; }
>>>  #endif
>>>  
>>>  #endif
>>> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
>>> index fb5be6e9b423..8e8055cf04af 100644
>>> --- a/kernel/context_tracking.c
>>> +++ b/kernel/context_tracking.c
>>> @@ -392,6 +392,18 @@ noinstr void ct_irq_exit(void)
>>> ct_nmi_exit();
>>>  }
>>>  
>>> +/**
>>> + * ct_in_irq - che

Re: [PATCH RFC net-next v2 3/3] vsock/test: Cover more CIDs in transport_uaf test

2025-06-04 Thread Stefano Garzarella


On Wed, May 28, 2025 at 10:44:43PM +0200, Michal Luczaj wrote:

Increase the coverage of test for UAF due to socket unbinding, and losing
transport in general. It's a follow up to commit 301a62dfb0d0 ("vsock/test:
Add test for UAF due to socket unbinding") and discussion in [1].

The idea remains the same: take an unconnected stream socket with a
transport assigned and then attempt to switch the transport by trying (and
failing) to connect to some other CID. Now do this iterating over all the
well known CIDs (plus one).

Note that having only a virtio transport loaded (without vhost_vsock) is
unsupported; test will always pass. Depending on transports available, a
variety of splats are possible on unpatched machines. After reverting
commit 78dafe1cf3af ("vsock: Orphan socket after transport release") and
commit fcdd2242c023 ("vsock: Keep the binding until socket destruction"):

BUG: KASAN: slab-use-after-free in __vsock_bind+0x61f/0x720
Read of size 4 at addr 88811ff46b54 by task vsock_test/1475
Call Trace:
dump_stack_lvl+0x68/0x90
print_report+0x170/0x53d
kasan_report+0xc2/0x180
__vsock_bind+0x61f/0x720
vsock_connect+0x727/0xc40
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

WARNING: CPU: 0 PID: 1475 at net/vmw_vsock/virtio_transport_common.c:37 
virtio_transport_send_pkt_info+0xb2b/0x1160
Call Trace:
virtio_transport_connect+0x90/0xb0
vsock_connect+0x782/0xc40
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

KASAN: null-ptr-deref in range [0x0010-0x0017]
RIP: 0010:sock_has_perm+0xa7/0x2a0
Call Trace:
selinux_socket_connect_helper.isra.0+0xbc/0x450
selinux_socket_connect+0x3b/0x70
security_socket_connect+0x31/0xd0
__sys_connect_file+0x79/0x1f0
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 1518 at lib/refcount.c:25 refcount_warn_saturate+0xdd/0x140
RIP: 0010:refcount_warn_saturate+0xdd/0x140
Call Trace:
__vsock_bind+0x65e/0x720
vsock_connect+0x727/0xc40
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53

refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 1475 at lib/refcount.c:28 
refcount_warn_saturate+0x12b/0x140
RIP: 0010:refcount_warn_saturate+0x12b/0x140
Call Trace:
vsock_remove_bound+0x18f/0x280
__vsock_release+0x371/0x480
vsock_release+0x88/0x120
__sock_release+0xaa/0x260
sock_close+0x14/0x20
__fput+0x35a/0xaa0
task_work_run+0xff/0x1c0
do_exit+0x849/0x24c0
make_task_dead+0xf3/0x110
rewind_stack_and_make_dead+0x16/0x20

[1]: 
https://lore.kernel.org/netdev/CAGxU2F5zhfWymY8u0hrKksW8PumXAYz-9_qRmW==92oax1b...@mail.gmail.com/

Suggested-by: Stefano Garzarella 
Signed-off-by: Michal Luczaj 
---
tools/testing/vsock/vsock_test.c | 83 +++-
1 file changed, 64 insertions(+), 19 deletions(-)

diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index 
f669baaa0dca3bebc678d00eafa80857d1f0fdd6..b58736023981ef7c4812e069ea577fcf2c0fe9fa
 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -1718,16 +1718,27 @@ static void 
test_stream_msgzcopy_leak_zcskb_server(const struct test_opts *opts)

#define MAX_PORT_RETRIES24  /* net/vmw_vsock/af_vsock.c */

-/* Test attempts to trigger a transport release for an unbound socket. This can
- * lead to a reference count mishandling.
- */
-static void test_stream_transport_uaf_client(const struct test_opts *opts)
+static bool test_stream_transport_uaf(int cid)
{
int sockets[MAX_PORT_RETRIES];
struct sockaddr_vm addr;
-   int fd, i, alen;
+   socklen_t alen;
+   int fd, i, c;
+   bool ret;
+
+   /* Probe for a transport by attempting a local CID bind. Unavailable
+* transport (or more specifically: an unsupported transport/CID
+* combination) results in EADDRNOTAVAIL, other errnos are fatal.
+*/
+   fd = vsock_bind_try(cid, VMADDR_PORT_ANY, SOCK_STREAM);
+   if (fd < 0) {
+   if (errno != EADDRNOTAVAIL) {
+   perror("Unexpected bind() errno");
+   exit(EXIT_FAILURE);
+   }

-   fd = vsock_bind(VMADDR_CID_ANY, VMADDR_PORT_ANY, SOCK_STREAM);
+   return false;
+   }

alen = sizeof(addr);
if (getsockname(fd, (struct sockaddr *)&addr, &alen)) {
@@ -1735,38 +1746,73 @@ static void test_stream_transport_uaf_client(const 
struct test_opts *opts)
exit(EXIT_FAILURE);
}

+   /* Drain the autobind pool; see __vsock_bind_connectible(). */
for (i = 0; i < MAX_PORT_RETRIES; ++i)
-   sockets[i] = vsock_bind(VMADDR_CID_ANY, ++addr.svm_port,
-   SOCK_STREAM)

Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
> On 03.06.25 21:22, Lorenzo Stoakes wrote:
> > The walk_page_range_novma() function is rather confusing - it supports two
> > modes, one used often, the other used only for debugging.
> >
> > The first mode is the common case of traversal of kernel page tables, which
> > is what nearly all callers use this for.
>
> ... and what people should be using it for 🙂

:)

Yeah the whole intent of this patch is to detach the 'crazy debug' bit from
the 'used by arches all over the place' stuff.

Being super clear as to what you're doing matters.

>
> >
> > Secondly it provides an unusual debugging interface that allows for the
> > traversal of page tables in a userland range of memory even for that memory
> > which is not described by a VMA.
> >
> > This is highly unusual and it is far from certain that such page tables
> > should even exist, but perhaps this is precisely why it is useful as a
> > debugging mechanism.
> >
> > As a result, this is utilised by ptdump only. Historically, things were
> > reversed - ptdump was the only user, and other parts of the kernel evolved
> > to use the kernel page table walking here.
> >
> > Since we have some complicated and confusing locking rules for the novma
> > case, it makes sense to separate the two usages into their own functions.
> >
> > Doing this also provide self-documentation as to the intent of the caller -
> > are they doing something rather unusual or are they simply doing a standard
> > kernel page table walk?
> >
> > We therefore maintain walk_page_range_novma() for this single usage, and
> > document the function as such.
>
> If we have to keep this dangerous interface, it should probably be
>
> walk_page_range_debug() or walk_page_range_dump()

Ugh it's too early, I thought Mike suggested this :P but he suggested the
mm/internal.h bit.

But anyway I agree with both, will fix in v2.

>
> >
> > Note that ptdump uses the precise same function for kernel walking as a
> > convenience, so we permit this but make it very explicit by having
> > walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> >
> > We introduce walk_page_range_kernel() for the far more common case of
> > kernel page table traversal.
>
> I wonder if we should give it a completely different name scheme to
> highlight that this is something completely different.
>
> walk_kernel_page_table_range()

Yeah, I think this might be a good idea actually. This is doing something
'unusual' unlike all the other walk_kernel_xxx() handlers, so this should
highlight it even more clearly.

Will fixup in v2.

>
> etc.
>
>
> --
> Cheers,
>
> David / dhildenb
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe



On 04/06/2025 06:19, Smita Koralahalli wrote:
>   drivers/cxl/acpi.c | 23 +++
>   drivers/cxl/core/suspend.c | 21 +
>   drivers/cxl/cxl.h  |  2 ++
>   3 files changed, 46 insertions(+)
> 
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index cb14829bb9be..978f63b32b41 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void 
> *data)
>   return 0;
>   }
>   
> +static void cxl_softreserv_mem_work_fn(struct work_struct *work)
> +{
> + /* Wait for cxl_pci and cxl_mem drivers to load */
> + cxl_wait_for_pci_mem();
> +
> + /*
> +  * Wait for the driver probe routines to complete after cxl_pci
> +  * and cxl_mem drivers are loaded.
> +  */
> + wait_for_device_probe();
> +}
> +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
> +
> +static void cxl_softreserv_mem_update(void)
> +{
> + schedule_work(&cxl_sr_work);
> +}
> +
>   static int cxl_acpi_probe(struct platform_device *pdev)
>   {
>   int rc;
> @@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>   
>   /* In case PCI is scanned before ACPI re-trigger memdev attach */
>   cxl_bus_rescan();
> +
> + /* Update SOFT RESERVE resources that intersect with CXL regions */
> + cxl_softreserv_mem_update();
> +
>   return 0;
>   }
>   
> @@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void)
>   
>   static void __exit cxl_acpi_exit(void)
>   {
> + cancel_work_sync(&cxl_sr_work);
>   platform_driver_unregister(&cxl_acpi_driver);
>   cxl_bus_drain();
>   }
> diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c
> index 72818a2c8ec8..c0d8f70aed56 100644
> --- a/drivers/cxl/core/suspend.c
> +++ b/drivers/cxl/core/suspend.c
> @@ -2,12 +2,15 @@
>   /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>   #include 
>   #include 
> +#include 
>   #include "cxlmem.h"
>   #include "cxlpci.h"
>   
>   static atomic_t mem_active;
>   static atomic_t pci_loaded;
>   
> +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);

Given that this file (suspend.c) focuses on power management functions,
it might be more appropriate to move the wait queue declaration and its
related changes to acpi.c in where the its caller is.


Thanks
Zhijian

> +
>   bool cxl_mem_active(void)
>   {
>   if (IS_ENABLED(CONFIG_CXL_MEM))
> @@ -19,6 +22,7 @@ bool cxl_mem_active(void)
>   void cxl_mem_active_inc(void)
>   {
>   atomic_inc(&mem_active);
> + wake_up(&cxl_wait_queue);
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL");
>   
> @@ -28,8 +32,25 @@ void cxl_mem_active_dec(void)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL");
>   
> +static bool cxl_pci_loaded(void)
> +{
> + if (IS_ENABLED(CONFIG_CXL_PCI))
> + return atomic_read(&pci_loaded) != 0;
> +
> + return false;
> +}
> +
>   void mark_cxl_pci_loaded(void)
>   {
>   atomic_inc(&pci_loaded);
> + wake_up(&cxl_wait_queue);
>   }
>   EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL");
> +
> +void cxl_wait_for_pci_mem(void)
> +{
> + if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() &&
> + cxl_mem_active(), 30 * HZ))
> + pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n");
> +}

Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool benchmark

2025-06-04 Thread Jesper Dangaard Brouer





On 28/05/2025 21.46, Mina Almasry wrote:

On Wed, May 28, 2025 at 2:28 AM Toke Høiland-Jørgensen  wrote:


Mina Almasry  writes:


On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen  wrote:

Fast path results:
no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns

ptr_ring results:
no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns

slow path results:
no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns
```

Cc: Jesper Dangaard Brouer 
Cc: Ilias Apalodimas 
Cc: Jakub Kicinski 
Cc: Toke Høiland-Jørgensen 

Signed-off-by: Mina Almasry 


Back when you posted the first RFC, Jesper and I chatted about ways to
avoid the ugly "load module and read the output from dmesg" interface to
the test.



I agree the existing interface is ugly.


One idea we came up with was to make the module include only the "inner"
functions for the benchmark, and expose those to BPF as kfuncs. Then the
test runner can be a BPF program that runs the tests, collects the data
and passes it to userspace via maps or a ringbuffer or something. That's
a nicer and more customisable interface than the printk output. And if
they're small enough, maybe we could even include the functions into the
page_pool code itself, instead of in a separate benchmark module?

WDYT of that idea? :)


...but this sounds like an enormous amount of effort, for something
that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
much of an expert that I know how to implement what you're referring
to off the top of my head. I normally am open to spending time but
this is not that high on my todolist and I have limited bandwidth to
resolve this :(

I also feel that this is something that could be improved post merge.
I think it's very beneficial to have this merged in some form that can
be improved later. Byungchul is making a lot of changes to these mm
things and it would be nice to have an easy way to run the benchmark
in tree and maybe even get automated results from nipa. If we could
agree on mvp that is appropriate to merge without too much scope creep
that would be ideal from my side at least.


Right, fair. I guess we can merge it as-is, and then investigate whether
we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)


Thanks for the pliability. Reviewed-bys and comments welcome.

Additionally Signed-off-by from Jesper is needed I think. Since most
of this code is his, I retained his authorship. Jesper, whenever this
looks good to me, a signed-off-by would be good and I would carry it
to future versions. Changing authorship to me is also fine by me but I
would think you want to retain the credit.


Okay, I think Ilias'es comment[1] and ACK convinced me, let us merge
this as-is.  We have been asking people to run it over several years
before accepting patches. We shouldn't be pointing people to use
out-of-tree tests for accepting patches.

It is not perfect, but it have served us well for benchmarking in the
last approx 10 years (5 years for page_pool test).  It is isolated as a
selftest under (tools/testing/selftests/net/bench/page_pool/).

Realistically we are all too busy inventing a new "perfect" benchmark
for page_pool. That said, I do encourage others with free cycles to
integrated a better benchmark test into `perf bench`.  Then we can just
remove this module again.

Signed-off-by: Jesper Dangaard Brouer 

 [1] 
https://lore.kernel.org/all/cac_iwjlmo4xz_+pbacnxpvctmgknbslgyeeks2ptrrepn1u...@mail.gmail.com/


Thanks Mina for pushing this forward,
--Jesper

Re: [PATCH v3 3/6] modpost: Make mod_device_table aliases more unique

2025-06-04 Thread Alexey Gladkov

On Tue, Jun 03, 2025 at 01:18:25AM +0900, Masahiro Yamada wrote:
> > > > Before these patches this was not a problem as non-unique characters are
> > > > in separate object files when the module is compiled separately.
> > > >
> > > > But when the modules are compiled into the kernel, there is a symbol
> > > > conflict when linking vmlinuz. We have modules that export multiple 
> > > > device
> > > > tables from different object files.
> > >
> > > This is because the __mod_device_table__* symbols are global, but
> > > I suspect they do not need to be.
> > >
> > > Let's test this
> > > https://lore.kernel.org/lkml/20250602105539.392362-1-masahi...@kernel.org/T/#u
> >
> > I tested this patch with the config:
> >
> > make allmodconfig
> > make mod2yesconfig
> >
> > and it works.
> 
> Good.
> Then, __COUNTER__ is unnecessary.

I didn't immediately notice. The patch you suggested works, but these
symbols remain in System.map and it seems in vmlinuz.

-- 
Rgrds, legion

Re: [PATCH 3/3] remoteproc: imx_rproc: Add support for i.MX95

2025-06-04 Thread Daniel Baluta

On Wed, Jun 4, 2025 at 5:37 AM Peng Fan (OSS)  wrote:
>
> From: Peng Fan 
>
> Add imx_rproc_cfg_imx95_m7 and address(TCM and DDR) mapping
> Add i.MX95 of_device_id entry
>
> Signed-off-by: Peng Fan 

Reviewed-by: Daniel Baluta

Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

2025-06-04 Thread Oscar Salvador

On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote:
> Nice idea to move to mm/internal.h also :) I like this...
> 
> Will fixup on respin

Dumb question but IIUC, walk_page_range_novma() will only be used by
ptdump from now on, so why not stick it into mm/ptdump.c, which is where
the only user of it lives?

-- 
Oscar Salvador
SUSE Labs

Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

On Wed, Jun 04, 2025 at 02:26:57PM +0200, Oscar Salvador wrote:
> On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote:
> > Nice idea to move to mm/internal.h also :) I like this...
> >
> > Will fixup on respin
>
> Dumb question but IIUC, walk_page_range_novma() will only be used by
> ptdump from now on, so why not stick it into mm/ptdump.c, which is where
> the only user of it lives?

There's no such thing as a dumb question :) Even though I have maybe tested
that concept in the past personally ;)

I think mm/internal.h is fine, would be weird to have the declaration in
mm/ptdump.c but implemented elsewhere and obviously we need to keep the
page walking stuff together.

So I think it is best to keep it in internal.h

>
> --
> Oscar Salvador
> SUSE Labs

Re: [PATCH 2/3] remoteproc: imx_rproc: Add support for System Manager API

2025-06-04 Thread Daniel Baluta

Hi Peng,

Thanks a lot for the patches. Comments inline:

On Wed, Jun 4, 2025 at 5:36 AM Peng Fan (OSS)  wrote:
>
> From: Peng Fan 
>
> i.MX95 features a Cortex-M33 core, six Cortex-A55 cores, and
> one Cortex-M7 core. The System Control Management Interface(SCMI)
> firmware runs on the M33 core. The i.MX95 SCMI firmware named System
> Manager(SM) includes vendor extension protocols, Logical Machine
> Management(LMM) protocol and CPU protocol and etc.
>
> There are three cases for M7:
>  (1) M7 in a separate Logical Machine(LM) that Linux couldn't control it.
couldn't -> can't

>  (2) M7 in a separate Logical Machine that Linux could control it using
>  LMM protocol
could -> can

>  (3) M7 runs in same Logical Machine as A55, so Linux could control it
>  using CPU protocol
could -> can

>
> So extend the driver to using LMM and CPU protocol to manage the M7 core.
>  - Add IMX_RPROC_SM to indicate the remotecores runs on a SoC that
>has System Manager.

remotecores -> remote core

>  - Compare linux LM ID(got using scmi_imx_lmm_info) and M7 LM ID(got
>from DTB), if same, use CPU protocol to start/stop. Otherwise, use
>LMM protocol to start/stop.
>Whether using CPU or LMM protocol to start/stop, the M7 status
>detection could use CPU protocol to detect started or not. So
>in imx_rproc_detect_mode, use scmi_imx_cpu_started to check the
>status of M7.
>  - For above case 1 and 2, Use SCMI_IMX_LMM_POWER_ON to detect whether
>the M7 LM is under control of A55 LM.
>
> Current setup relies on pre-Linux software(U-Boot) to do
> M7 TCM ECC initialization. In future, we could add the support in Linux
> to decouple U-Boot and Linux.
>
> Signed-off-by: Peng Fan 
> ---
>  drivers/remoteproc/imx_rproc.c | 139 
> -
>  drivers/remoteproc/imx_rproc.h |   2 +
>  2 files changed, 139 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c
> index 
> 74299af1d7f10a0db794de494c52304b2323b89f..0649faa98725db99366946c65edf5b7daff78316
>  100644
> --- a/drivers/remoteproc/imx_rproc.c
> +++ b/drivers/remoteproc/imx_rproc.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -21,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "imx_rproc.h"
> @@ -91,6 +93,11 @@ struct imx_rproc_mem {
>  #define ATT_CORE_MASK   0x
>  #define ATT_CORE(I) BIT((I))
>
> +/* Logical Machine Operation */
> +#define IMX_RPROC_FLAGS_SM_LMM_OP  BIT(0)
> +/* Linux has permission to handle the Logical Machine of remote cores */
> +#define IMX_RPROC_FLAGS_SM_LMM_AVAIL   BIT(1)
> +
>  static int imx_rproc_xtr_mbox_init(struct rproc *rproc, bool tx_block);
>  static void imx_rproc_free_mbox(struct rproc *rproc);
>
> @@ -115,6 +122,22 @@ struct imx_rproc {
> u32 entry;  /* cpu start address 
> */
> u32 core_index;
> struct dev_pm_domain_list   *pd_list;
> +   /* For i.MX System Manager based systems */
> +   u32 cpuid;
> +   u32 lmid;
> +   u32 flags;
> +};
> +
> +static const struct imx_rproc_att imx_rproc_att_imx95_m7[] = {
> +   /* dev addr , sys addr  , size  , flags */
> +   /* TCM CODE NON-SECURE */
> +   { 0x, 0x203C, 0x0004, ATT_OWN | ATT_IOMEM },
> +
> +   /* TCM SYS NON-SECURE*/
> +   { 0x2000, 0x2040, 0x0004, ATT_OWN | ATT_IOMEM },
> +
> +   /* DDR */
> +   { 0x8000, 0x8000, 0x5000, 0 },
>  };

^ this belongs to patch 3/3

Otherwise, patch looks good to me.

RE: [PATCH 2/3] remoteproc: imx_rproc: Add support for System Manager API

2025-06-04 Thread Peng Fan

> Subject: Re: [PATCH 2/3] remoteproc: imx_rproc: Add support for
> System Manager API
> 
> Hi Peng,
> 
> Thanks a lot for the patches. Comments inline:
> 
> On Wed, Jun 4, 2025 at 5:36 AM Peng Fan (OSS)
>  wrote:
> >
> > From: Peng Fan 
> >
> > i.MX95 features a Cortex-M33 core, six Cortex-A55 cores, and one
> > Cortex-M7 core. The System Control Management Interface(SCMI)
> firmware
> > runs on the M33 core. The i.MX95 SCMI firmware named System
> > Manager(SM) includes vendor extension protocols, Logical Machine
> > Management(LMM) protocol and CPU protocol and etc.
> >
> > There are three cases for M7:
> >  (1) M7 in a separate Logical Machine(LM) that Linux couldn't control
> it.
> couldn't -> can't
> 
> >  (2) M7 in a separate Logical Machine that Linux could control it using
> >  LMM protocol
> could -> can
> 
> >  (3) M7 runs in same Logical Machine as A55, so Linux could control it
> >  using CPU protocol
> could -> can
> 



> >
> > So extend the driver to using LMM and CPU protocol to manage the
> M7 core.
> >  - Add IMX_RPROC_SM to indicate the remotecores runs on a SoC
> that
> >has System Manager.
> 
> remotecores -> remote core
> 
> >  - Compare linux LM ID(got using scmi_imx_lmm_info) and M7 LM
> ID(got
> >from DTB), if same, use CPU protocol to start/stop. Otherwise, use
> >LMM protocol to start/stop.
> >Whether using CPU or LMM protocol to start/stop, the M7 status
> >detection could use CPU protocol to detect started or not. So
> >in imx_rproc_detect_mode, use scmi_imx_cpu_started to check the
> >status of M7.
> >  - For above case 1 and 2, Use SCMI_IMX_LMM_POWER_ON to
> detect whether
> >the M7 LM is under control of A55 LM.
> >
> > Current setup relies on pre-Linux software(U-Boot) to do
> > M7 TCM ECC initialization. In future, we could add the support in
> > Linux to decouple U-Boot and Linux.
> >
> > Signed-off-by: Peng Fan 
> > ---
> >  drivers/remoteproc/imx_rproc.c | 139
> -
> >  drivers/remoteproc/imx_rproc.h |   2 +
> >  2 files changed, 139 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/remoteproc/imx_rproc.c
> > b/drivers/remoteproc/imx_rproc.c index
> >
> 74299af1d7f10a0db794de494c52304b2323b89f..0649faa98725db99
> 366946c65edf
> > 5b7daff78316 100644
> > --- a/drivers/remoteproc/imx_rproc.c
> > +++ b/drivers/remoteproc/imx_rproc.c
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -21,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  #include "imx_rproc.h"
> > @@ -91,6 +93,11 @@ struct imx_rproc_mem {
> >  #define ATT_CORE_MASK   0x
> >  #define ATT_CORE(I) BIT((I))
> >
> > +/* Logical Machine Operation */
> > +#define IMX_RPROC_FLAGS_SM_LMM_OP  BIT(0)
> > +/* Linux has permission to handle the Logical Machine of remote
> cores */
> > +#define IMX_RPROC_FLAGS_SM_LMM_AVAIL   BIT(1)
> > +
> >  static int imx_rproc_xtr_mbox_init(struct rproc *rproc, bool
> > tx_block);  static void imx_rproc_free_mbox(struct rproc *rproc);
> >
> > @@ -115,6 +122,22 @@ struct imx_rproc {
> > u32 entry;  /* cpu start 
> > address */
> > u32 core_index;
> > struct dev_pm_domain_list   *pd_list;
> > +   /* For i.MX System Manager based systems */
> > +   u32 cpuid;
> > +   u32 lmid;
> > +   u32 flags;
> > +};
> > +
> > +static const struct imx_rproc_att imx_rproc_att_imx95_m7[] = {
> > +   /* dev addr , sys addr  , size  , flags */
> > +   /* TCM CODE NON-SECURE */
> > +   { 0x, 0x203C, 0x0004, ATT_OWN |
> ATT_IOMEM },
> > +
> > +   /* TCM SYS NON-SECURE*/
> > +   { 0x2000, 0x2040, 0x0004, ATT_OWN |
> ATT_IOMEM },
> > +
> > +   /* DDR */
> > +   { 0x8000, 0x8000, 0x5000, 0 },
> >  };
> 
> ^ this belongs to patch 3/3
> 
> Otherwise, patch looks good to me.

oops. I missed to move this to patch 3 when addressing
your internal reviewing comments.

Fix in V2 together with the typo that you mention
above.

Thanks,
Peng.

Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy

2025-06-04 Thread Bui Quang Minh


On 6/4/25 07:37, Jason Wang wrote:

On Tue, Jun 3, 2025 at 11:07 PM Bui Quang Minh  wrote:

In virtio-net, we have not yet supported multi-buffer XDP packet in
zerocopy mode when there is a binding XDP program. However, in that
case, when receiving multi-buffer XDP packet, we skip the XDP program
and return XDP_PASS. As a result, the packet is passed to normal network
stack which is an incorrect behavior. This commit instead returns
XDP_DROP in that case.

Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode")
Cc: sta...@vger.kernel.org
Signed-off-by: Bui Quang Minh 
---
  drivers/net/virtio_net.c | 11 ---
  1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e53ba600605a..4c35324d6e5b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1309,9 +1309,14 @@ static struct sk_buff *virtnet_receive_xsk_merge(struct 
net_device *dev, struct
 ret = XDP_PASS;

It would be simpler to just assign XDP_DROP here?

Or if you wish to stick to the way, we can simply remove this assignment.


This XDP_PASS is returned for the case when there is no XDP program 
binding (!prog).





 rcu_read_lock();
 prog = rcu_dereference(rq->xdp_prog);
-   /* TODO: support multi buffer. */
-   if (prog && num_buf == 1)
-   ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats);
+   if (prog) {
+   /* TODO: support multi buffer. */
+   if (num_buf == 1)
+   ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit,
+ stats);
+   else
+   ret = XDP_DROP;
+   }
 rcu_read_unlock();

 switch (ret) {
--
2.43.0


Thanks




Thanks,
Quang Minh.

[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

The walk_page_range_novma() function is rather confusing - it supports two
modes, one used often, the other used only for debugging.

The first mode is the common case of traversal of kernel page tables, which
is what nearly all callers use this for.

Secondly it provides an unusual debugging interface that allows for the
traversal of page tables in a userland range of memory even for that memory
which is not described by a VMA.

It is far from certain that such page tables should even exist, but perhaps
this is precisely why it is useful as a debugging mechanism.

As a result, this is utilised by ptdump only. Historically, things were
reversed - ptdump was the only user, and other parts of the kernel evolved
to use the kernel page table walking here.

Since we have some complicated and confusing locking rules for the novma
case, it makes sense to separate the two usages into their own functions.

Doing this also provide self-documentation as to the intent of the caller -
are they doing something rather unusual or are they simply doing a standard
kernel page table walk?

We therefore establish two separate functions - walk_page_range_debug() for
this single usage, and walk_kernel_page_table_range() for general kernel
page table walking.

We additionally make walk_page_range_debug() internal to mm.

Note that ptdump uses the precise same function for kernel walking as a
convenience, so we permit this but make it very explicit by having
walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.

Signed-off-by: Lorenzo Stoakes 
Acked-by: Mike Rapoport (Microsoft) 
---
v2:
* Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
* Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
* Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
  per David.

v1 resend:
* Actually cc'd lists...
* Fixed mistake in walk_page_range_novma() not handling kernel mappings and
  update commit message to referene.
* Added Mike's off-list Acked-by.
* Fixed up comments as per Mike.
* Add some historic flavour to the commit message as per Mike.
https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/

v1:
(accidentally sent off-list due to error in scripting)

 arch/loongarch/mm/pageattr.c |  2 +-
 arch/openrisc/kernel/dma.c   |  4 +-
 arch/riscv/mm/pageattr.c |  8 +--
 include/linux/pagewalk.h |  7 ++-
 mm/hugetlb_vmemmap.c |  2 +-
 mm/internal.h|  4 ++
 mm/pagewalk.c| 98 
 mm/ptdump.c  |  3 +-
 8 files changed, 82 insertions(+), 46 deletions(-)

diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index 99165903908a..f5e910b68229 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, 
pgprot_t set_mask, pgp
return 0;

mmap_write_lock(&init_mm);
-   ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, 
&masks);
+   ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, 
&masks);
mmap_write_unlock(&init_mm);

flush_tlb_kernel_range(start, end);
diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 3a7b5baaa450..af932a4ad306 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
 * them and setting the cache-inhibit bit.
 */
mmap_write_lock(&init_mm);
-   error = walk_page_range_novma(&init_mm, va, va + size,
+   error = walk_kernel_page_table_range(va, va + size,
&set_nocache_walk_ops, NULL, NULL);
mmap_write_unlock(&init_mm);

@@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)

mmap_write_lock(&init_mm);
/* walk_page_range shouldn't be able to fail here */
-   WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
+   WARN_ON(walk_kernel_page_table_range(va, va + size,
&clear_nocache_walk_ops, NULL, NULL));
mmap_write_unlock(&init_mm);
 }
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index d815448758a1..3f76db3d2769 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, 
pgprot_t set_mask,
if (ret)
goto unlock;

-   ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
+   ret = walk_kernel_page_table_range(lm_start, lm_end,
&pageattr_ops, NULL, 
&masks);
if (ret)
goto unlock;
@@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, 
pgprot_t set_mask,

Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

On Wed, Jun 4, 2025 at 7:21 AM Lorenzo Stoakes
 wrote:
>
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
>
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
>
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
>
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
>
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
>
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
>
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
>
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
>
> We additionally make walk_page_range_debug() internal to mm.
>
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>
> Signed-off-by: Lorenzo Stoakes 
> Acked-by: Mike Rapoport (Microsoft) 

Reviewed-by: Suren Baghdasaryan 

> ---
> v2:
> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>   per David.
>
> v1 resend:
> * Actually cc'd lists...
> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>   update commit message to referene.
> * Added Mike's off-list Acked-by.
> * Fixed up comments as per Mike.
> * Add some historic flavour to the commit message as per Mike.
> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/
>
> v1:
> (accidentally sent off-list due to error in scripting)
>
>  arch/loongarch/mm/pageattr.c |  2 +-
>  arch/openrisc/kernel/dma.c   |  4 +-
>  arch/riscv/mm/pageattr.c |  8 +--
>  include/linux/pagewalk.h |  7 ++-
>  mm/hugetlb_vmemmap.c |  2 +-
>  mm/internal.h|  4 ++
>  mm/pagewalk.c| 98 
>  mm/ptdump.c  |  3 +-
>  8 files changed, 82 insertions(+), 46 deletions(-)
>
> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> index 99165903908a..f5e910b68229 100644
> --- a/arch/loongarch/mm/pageattr.c
> +++ b/arch/loongarch/mm/pageattr.c
> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, 
> pgprot_t set_mask, pgp
> return 0;
>
> mmap_write_lock(&init_mm);
> -   ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, 
> NULL, &masks);
> +   ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, 
> &masks);
> mmap_write_unlock(&init_mm);
>
> flush_tlb_kernel_range(start, end);
> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index 3a7b5baaa450..af932a4ad306 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>  * them and setting the cache-inhibit bit.
>  */
> mmap_write_lock(&init_mm);
> -   error = walk_page_range_novma(&init_mm, va, va + size,
> +   error = walk_kernel_page_table_range(va, va + size,
> &set_nocache_walk_ops, NULL, NULL);
> mmap_write_unlock(&init_mm);
>
> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>
> mmap_write_lock(&init_mm);
> /* walk_page_range shouldn't be able to fail here */
> -   WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> +   WARN_ON(walk_kernel_page_table_range(va, va + size,
> &clear_nocache_walk_ops, NULL, NULL));
> mmap_write_unlock(&init_mm);
>  }
> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> index d815448758a1..3f76db3d2769 100644
> --- a/arch/riscv/mm/pageattr.c
> +++ b/arch/riscv/mm/pageattr.c
> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, 
> pgprot_t set_mask,
> if (ret)
> goto unlock;
>
> -   ret = walk_page_range_novma(&init_mm, lm_start, 
> lm_end,
> +   ret = walk_kern

Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe

2025-06-04 Thread Dave Jiang




On 6/4/25 2:40 AM, Zhijian Li (Fujitsu) wrote:
> 
> 
> On 04/06/2025 06:19, Smita Koralahalli wrote:
>>   drivers/cxl/acpi.c | 23 +++
>>   drivers/cxl/core/suspend.c | 21 +
>>   drivers/cxl/cxl.h  |  2 ++
>>   3 files changed, 46 insertions(+)
>>
>> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
>> index cb14829bb9be..978f63b32b41 100644
>> --- a/drivers/cxl/acpi.c
>> +++ b/drivers/cxl/acpi.c
>> @@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void 
>> *data)
>>  return 0;
>>   }
>>   
>> +static void cxl_softreserv_mem_work_fn(struct work_struct *work)
>> +{
>> +/* Wait for cxl_pci and cxl_mem drivers to load */
>> +cxl_wait_for_pci_mem();
>> +
>> +/*
>> + * Wait for the driver probe routines to complete after cxl_pci
>> + * and cxl_mem drivers are loaded.
>> + */
>> +wait_for_device_probe();
>> +}
>> +static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
>> +
>> +static void cxl_softreserv_mem_update(void)
>> +{
>> +schedule_work(&cxl_sr_work);
>> +}
>> +
>>   static int cxl_acpi_probe(struct platform_device *pdev)
>>   {
>>  int rc;
>> @@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>>   
>>  /* In case PCI is scanned before ACPI re-trigger memdev attach */
>>  cxl_bus_rescan();
>> +
>> +/* Update SOFT RESERVE resources that intersect with CXL regions */
>> +cxl_softreserv_mem_update();
>> +
>>  return 0;
>>   }
>>   
>> @@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void)
>>   
>>   static void __exit cxl_acpi_exit(void)
>>   {
>> +cancel_work_sync(&cxl_sr_work);
>>  platform_driver_unregister(&cxl_acpi_driver);
>>  cxl_bus_drain();
>>   }
>> diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c
>> index 72818a2c8ec8..c0d8f70aed56 100644
>> --- a/drivers/cxl/core/suspend.c
>> +++ b/drivers/cxl/core/suspend.c
>> @@ -2,12 +2,15 @@
>>   /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
>>   #include 
>>   #include 
>> +#include 
>>   #include "cxlmem.h"
>>   #include "cxlpci.h"
>>   
>>   static atomic_t mem_active;
>>   static atomic_t pci_loaded;
>>   
>> +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);
> 
> Given that this file (suspend.c) focuses on power management functions,
> it might be more appropriate to move the wait queue declaration and its
> related changes to acpi.c in where the its caller is.

You mean drivers/cxl/acpi.c and not drivers/cxl/core/acpi.c right? The core one 
is my mistake and I'm going to create a patch to remove that.

DJ

> 
> 
> Thanks
> Zhijian
> 
>> +
>>   bool cxl_mem_active(void)
>>   {
>>  if (IS_ENABLED(CONFIG_CXL_MEM))
>> @@ -19,6 +22,7 @@ bool cxl_mem_active(void)
>>   void cxl_mem_active_inc(void)
>>   {
>>  atomic_inc(&mem_active);
>> +wake_up(&cxl_wait_queue);
>>   }
>>   EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL");
>>   
>> @@ -28,8 +32,25 @@ void cxl_mem_active_dec(void)
>>   }
>>   EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL");
>>   
>> +static bool cxl_pci_loaded(void)
>> +{
>> +if (IS_ENABLED(CONFIG_CXL_PCI))
>> +return atomic_read(&pci_loaded) != 0;
>> +
>> +return false;
>> +}
>> +
>>   void mark_cxl_pci_loaded(void)
>>   {
>>  atomic_inc(&pci_loaded);
>> +wake_up(&cxl_wait_queue);
>>   }
>>   EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL");
>> +
>> +void cxl_wait_for_pci_mem(void)
>> +{
>> +if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() &&
>> +cxl_mem_active(), 30 * HZ))
>> +pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n");
>> +}

[PATCH] module: Make sure relocations are applied to the per-CPU section

2025-06-04 Thread Sebastian Andrzej Siewior

The per-CPU data section is handled differently than the other sections.
The memory allocations requires a special __percpu pointer and then the
section is copied into the view of each CPU. Therefore the SHF_ALLOC
flag is removed to ensure move_module() skips it.

Later, relocations are applied and apply_relocations() skips sections
without SHF_ALLOC because they have not been copied. This also skips the
per-CPU data section.

The missing relocations result in a NULL pointer on x86-64 and very
small values on x86-32. This results in a crash because it is not
skipped like NULL pointer would and it can't be dereferenced.

Such an assignment happens during compile time per-CPU lock
initialisation with lockdep enabled.

Add the SHF_ALLOC flag back for the per-CPU section after move_module().

Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-...@intel.com
Fixes: 8d8022e8aba85 ("module: do percpu allocation after uniqueness check.  
No, really!")
Signed-off-by: Sebastian Andrzej Siewior 
---
 kernel/module/main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 5c6ab20240a6d..35abb5f13d7dc 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2816,6 +2816,9 @@ static struct module *layout_and_allocate(struct 
load_info *info, int flags)
if (err)
return ERR_PTR(err);
 
+   /* Add SHF_ALLOC back so that relocations are applied. */
+   info->sechdrs[info->index.pcpu].sh_flags |= SHF_ALLOC;
+
/* Module has been copied to its final place now: return it. */
mod = (void *)info->sechdrs[info->index.mod].sh_addr;
kmemleak_load_module(mod, info);
-- 
2.49.0

Re: [GIT PULL] SPDX/LICENSES update for 6.16-rc1

2025-06-04 Thread pr-tracker-bot

The pull request you sent on Wed, 4 Jun 2025 10:36:44 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx.git 
> tags/spdx-6.16-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f5ebe7bb87e04537b37573f0a2516baa90ee23c0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

2025-06-04 Thread David Hildenbrand


On 03.06.25 21:22, Lorenzo Stoakes wrote:

The walk_page_range_novma() function is rather confusing - it supports two
modes, one used often, the other used only for debugging.

The first mode is the common case of traversal of kernel page tables, which
is what nearly all callers use this for.


... and what people should be using it for 🙂



Secondly it provides an unusual debugging interface that allows for the
traversal of page tables in a userland range of memory even for that memory
which is not described by a VMA.

This is highly unusual and it is far from certain that such page tables
should even exist, but perhaps this is precisely why it is useful as a
debugging mechanism.

As a result, this is utilised by ptdump only. Historically, things were
reversed - ptdump was the only user, and other parts of the kernel evolved
to use the kernel page table walking here.

Since we have some complicated and confusing locking rules for the novma
case, it makes sense to separate the two usages into their own functions.

Doing this also provide self-documentation as to the intent of the caller -
are they doing something rather unusual or are they simply doing a standard
kernel page table walk?

We therefore maintain walk_page_range_novma() for this single usage, and
document the function as such.


If we have to keep this dangerous interface, it should probably be

walk_page_range_debug() or walk_page_range_dump()



Note that ptdump uses the precise same function for kernel walking as a
convenience, so we permit this but make it very explicit by having
walk_page_range_novma() invoke walk_page_range_kernel() in this case.

We introduce walk_page_range_kernel() for the far more common case of
kernel page table traversal.


I wonder if we should give it a completely different name scheme to
highlight that this is something completely different.

walk_kernel_page_table_range()

etc.


--
Cheers,

David / dhildenb

Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy

2025-06-04 Thread Zvi Effron

On Tue, Jun 3, 2025 at 8:09 AM Bui Quang Minh  wrote:
>
> In virtio-net, we have not yet supported multi-buffer XDP packet in
> zerocopy mode when there is a binding XDP program. However, in that
> case, when receiving multi-buffer XDP packet, we skip the XDP program
> and return XDP_PASS. As a result, the packet is passed to normal network
> stack which is an incorrect behavior. This commit instead returns
> XDP_DROP in that case.

Does it make more sense to return XDP_ABORTED? This seems like an unexpected
exception case to me, but I'm not familiar enough with virtio-net's multibuffer
support.

>
> Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Bui Quang Minh 
> ---
> drivers/net/virtio_net.c | 11 ---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index e53ba600605a..4c35324d6e5b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1309,9 +1309,14 @@ static struct sk_buff 
> *virtnet_receive_xsk_merge(struct net_device *dev, struct
> ret = XDP_PASS;
> rcu_read_lock();
> prog = rcu_dereference(rq->xdp_prog);
> - /* TODO: support multi buffer. */
> - if (prog && num_buf == 1)
> - ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats);
> + if (prog) {
> + /* TODO: support multi buffer. */
> + if (num_buf == 1)
> + ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit,
> + stats);
> + else
> + ret = XDP_DROP;
> + }
> rcu_read_unlock();
>
> switch (ret) {
> --
> 2.43.0
>
>

Re: [Question] attributes encoding in BTF

2025-06-04 Thread Ihor Solodrai

On 6/4/25 2:02 AM, Alexis LothorÃ© wrote:

Hi all,
a simpler version of this series has been merged, and so I am now taking a
look at the issue I have put aside in the merged version: dealing with more
specific data layout for arguments passed on stack. For example, a function
can pass small structs on stack, but this need special care when generating
the corresponding bpf trampolines. Those structs have specific alignment
specified by the target ABI, but it can also be altered with attributes
packing the structure or modifying the alignment.

Some platforms already support structs on stack (see
tracing_struct_many_args test), but as discussed earlier, those may suffer
from the same kind of issue mentioned above.

On Wed Apr 23, 2025 at 9:24 PM CEST, Alexis Lothoré wrote:

Hi Andrii,

On Wed Apr 23, 2025 at 7:15 PM CEST, Andrii Nakryiko wrote:

On Thu, Apr 17, 2025 at 12:14 AM Alexis Lothoré
wrote:

Hi Andrii,

On Wed Apr 16, 2025 at 11:24 PM CEST, Andrii Nakryiko wrote:

[...]

I'd suggest looking at btf__align_of() in libbpf (tools/lib/bpf/btf.c)
to see how we calculate alignment there. It seems to work decently
enough. It won't cover any arch-specific extra rules like double
needing 16-byte alignment (I vaguely remember something like that for
some architectures, but I might be misremembering), or anything
similar. It also won't detect (I don't think it's possible without
DWARF) artificially increased alignment with attribute((aligned(N))).

Thanks for the pointer, I'll take a look at it. The more we discuss this
series, the less member size sounds relevant for what I'm trying to achieve
here.

Following Xu's comments, I have been thinking about how I could detect the
custom alignments and packing on structures, and I was wondering if I could
somehow benefit from __attribute__ encoding in BTF info ([1]). But
following your hint, I also see some btf_is_struct_packed() in
tools/lib/bpf/btf_dump.c that could help. I'll dig this further and see if
I can manage to make something work with all of this.

Andrii's comment above illustrates well my current issue: when functions
pass arguments on stack, we are missing info for some of them to correctly
build trampolines, especially for struct, which can have attributes like
__attribute__((packed)) or __attribute__((align(x))). [1] seems to be a
recent solution implemented for BTF to cover this need. IIUC it encodes any
arbitratry attribute affecting a data type or function, so if I have some
struct like this one in my kernel or a module:

struct foo {
short b
int a;
} __packed;

I would expect the corresponding BTF data to have some BTF_KIND_DECL_TAG
describing the "packed" attribute for the corresponding structure, but I
fail to find any of those when running:

$ bpftool btf dump file vmlinux format raw

In there I see some DECL_TAG but those are mostly 'bpf_kfunc', I see not
arbitrary attribute like 'packed' or 'aligned(x)'.

What I really need to do in the end is to be able to parse those alignments
attributes info in the kernel at runtime when generating trampolines, but I
hoped to be able to see them with bpftool first to validate the concept.

I started taking a look further at this and stumbled upon [2] in which Alan
gives many more details about the feature, so I did the following checks:
- kernel version 6.15.0-rc4 from bpf-next_base: it contains the updated
Makefile.btf calling pahole with `--btf_features=attributes`
- pahole v1.30
$ pahole --supported_btf_features

encode_force,var,float,decl_tag,type_tag,enum64,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build,distilled_base,global_var,attributes
This pahole comes from my distro pkg manager, but I have also done the
same test with a freshly built pahole, while taking care of pulling the
libbpf submodule.
- bpftool v7.6.0
bpftool v7.6.0
using libbpf v1.6
features: llvm, skeletons

Could I be missing something obvious ? Or did I misunderstand the actual
attribute encoding feature ?

Hi Alexis.

The changes recently landed in pahole and libbpf re attributes had a
very narrow goal: passing through particular attributes for some BPF
kfuncs from the kernel source to vmlinux.h

BTF now has a way of encoding any attribute (as opposed to only bpf
type/decl tags) by setting type/decl tag kind flag [1]. So it is
possible to represent attributes like packed and aligned in BTF.

However, the BTF tags need to be generated by something, in case of
vmlinux by pahole. Pahole generates BTF by parsing DWARF. And, as far as
I understand, attributes are not (can not be?) represented in DWARF in a
generic way, it really depends on specifics of the attribute.

In order to support packed/aligned, pahole needs to know how to figure
them out from DWARF input and add the tags to BTF. And this does not
happen right now, which is why you don't see anything in bpftool output.

[1]
https://lore.kernel.org/bpf/20250130201239.1429648-1-ihor.solod...@linu

Re: [PATCH] selftests: cachestat: add tests for mmap and /proc/cpuinfo

2025-06-04 Thread Nhat Pham

On Sat, May 24, 2025 at 1:36 AM Suresh K C
 wrote:
>
> From: Suresh K C 
>
> Add a test case to verify cachestat behavior with memory-mapped files
> using mmap(). This ensures that pages accessed via mmap are correctly
> accounted for in the page cache.
>
> Also add a test for /proc/cpuinfo to validate cachestat's handling of
> virtual files in pseudo-filesystems. This improves test coverage for

Hmm, it's been awhile since I wrote these tests, but isn't there
already a test for /proc/* files?

> edge cases involving non-regular files.
>
> Tested on x86_64 with default kernel config.
>
> Signed-off-by: Suresh K C 
> ---
>  .../selftests/cachestat/test_cachestat.c  | 69 ++-
>  1 file changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/cachestat/test_cachestat.c 
> b/tools/testing/selftests/cachestat/test_cachestat.c
> index 632ab44737ec..81e7f6dd2279 100644
> --- a/tools/testing/selftests/cachestat/test_cachestat.c
> +++ b/tools/testing/selftests/cachestat/test_cachestat.c
> @@ -22,7 +22,7 @@
>
>  static const char * const dev_files[] = {
> "/dev/zero", "/dev/null", "/dev/urandom",
> -   "/proc/version", "/proc"
> +   "/proc/version","/proc/cpuinfo","/proc"
>  };
>
>  void print_cachestat(struct cachestat *cs)
> @@ -202,6 +202,65 @@ static int test_cachestat(const char *filename, bool 
> write_random, bool create,
> return ret;
>  }
>
> +bool test_cachestat_mmap(void){
> +
> +   size_t PS = sysconf(_SC_PAGESIZE);
> +   size_t filesize = PS * 512 * 2;;
> +   int syscall_ret;
> +   size_t compute_len = PS * 512;
> +   struct cachestat_range cs_range = { PS, compute_len };
> +   char *filename = "tmpshmcstat";
> +   unsigned long num_pages = compute_len / PS;
> +   struct cachestat cs;
> +   bool ret = true;
> +   int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666);
> +   if (fd < 0) {
> +   ksft_print_msg("Unable to create mmap file.\n");
> +   ret = false;
> +   goto out;
> +   }
> +   if (ftruncate(fd, filesize)) {
> +   ksft_print_msg("Unable to truncate mmap file.\n");
> +   ret = false;
> +   goto close_fd;
> +   }
> +   if (!write_exactly(fd, filesize)) {
> +   ksft_print_msg("Unable to write to mmap file.\n");
> +   ret = false;
> +   goto close_fd;
> +   }
> +   char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, 
> fd, 0);
> +   if (map == MAP_FAILED) {
> +   ksft_print_msg("mmap failed.\n");
> +   ret = false;
> +   goto close_fd;
> +   }
> +
> +   for (int i = 0; i < filesize; i++) {
> +   map[i] = 'A';
> +   }
> +   map[filesize - 1] = 'X';
> +
> +   syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0);
> +
> +   if (syscall_ret) {
> +   ksft_print_msg("Cachestat returned non-zero.\n");
> +   ret = false;
> +   } else {
> +   print_cachestat(&cs);
> +   if (cs.nr_cache + cs.nr_evicted != num_pages) {
> +   ksft_print_msg("Total number of cached and evicted 
> pages is off.\n");
> +   ret = false;
> +   }
> +   }
> +
> +close_fd:
> +   close(fd);
> +   unlink(filename);
> +out:
> +   return ret;
> +}
> +

This looks 90% the same as another test. Are we literally just adding
the mmap step to test_cachestat and call it a new test?

Can we at least refactor things?

Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe

2025-06-04 Thread Koralahalli Channabasappa, Smita


On 6/3/2025 4:45 PM, Dave Jiang wrote:



On 6/3/25 3:19 PM, Smita Koralahalli wrote:

Introduce a waitqueue mechanism to coordinate initialization between the
cxl_pci and cxl_mem drivers.

Launch a background worker from cxl_acpi_probe() that waits for both
drivers to complete initialization before invoking wait_for_device_probe().
Without this, the probe completion wait could begin prematurely, before
the drivers are present, leading to missed updates.

Co-developed-by: Nathan Fontenot 
Signed-off-by: Nathan Fontenot 
Co-developed-by: Terry Bowman 
Signed-off-by: Terry Bowman 
Signed-off-by: Smita Koralahalli 
---
  drivers/cxl/acpi.c | 23 +++
  drivers/cxl/core/suspend.c | 21 +
  drivers/cxl/cxl.h  |  2 ++
  3 files changed, 46 insertions(+)

diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index cb14829bb9be..978f63b32b41 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -813,6 +813,24 @@ static int pair_cxl_resource(struct device *dev, void 
*data)
return 0;
  }
  
+static void cxl_softreserv_mem_work_fn(struct work_struct *work)

+{
+   /* Wait for cxl_pci and cxl_mem drivers to load */
+   cxl_wait_for_pci_mem();
+
+   /*
+* Wait for the driver probe routines to complete after cxl_pci
+* and cxl_mem drivers are loaded.
+*/
+   wait_for_device_probe();
+}
+static DECLARE_WORK(cxl_sr_work, cxl_softreserv_mem_work_fn);
+
+static void cxl_softreserv_mem_update(void)
+{
+   schedule_work(&cxl_sr_work);
+}
+
  static int cxl_acpi_probe(struct platform_device *pdev)
  {
int rc;
@@ -887,6 +905,10 @@ static int cxl_acpi_probe(struct platform_device *pdev)
  
  	/* In case PCI is scanned before ACPI re-trigger memdev attach */

cxl_bus_rescan();
+
+   /* Update SOFT RESERVE resources that intersect with CXL regions */
+   cxl_softreserv_mem_update();
+
return 0;
  }
  
@@ -918,6 +940,7 @@ static int __init cxl_acpi_init(void)
  
  static void __exit cxl_acpi_exit(void)

  {
+   cancel_work_sync(&cxl_sr_work);
platform_driver_unregister(&cxl_acpi_driver);
cxl_bus_drain();
  }
diff --git a/drivers/cxl/core/suspend.c b/drivers/cxl/core/suspend.c
index 72818a2c8ec8..c0d8f70aed56 100644
--- a/drivers/cxl/core/suspend.c
+++ b/drivers/cxl/core/suspend.c
@@ -2,12 +2,15 @@
  /* Copyright(c) 2022 Intel Corporation. All rights reserved. */
  #include 
  #include 
+#include 
  #include "cxlmem.h"
  #include "cxlpci.h"
  
  static atomic_t mem_active;

  static atomic_t pci_loaded;
  
+static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);

+
  bool cxl_mem_active(void)
  {
if (IS_ENABLED(CONFIG_CXL_MEM))
@@ -19,6 +22,7 @@ bool cxl_mem_active(void)
  void cxl_mem_active_inc(void)
  {
atomic_inc(&mem_active);
+   wake_up(&cxl_wait_queue);
  }
  EXPORT_SYMBOL_NS_GPL(cxl_mem_active_inc, "CXL");
  
@@ -28,8 +32,25 @@ void cxl_mem_active_dec(void)

  }
  EXPORT_SYMBOL_NS_GPL(cxl_mem_active_dec, "CXL");
  
+static bool cxl_pci_loaded(void)

+{
+   if (IS_ENABLED(CONFIG_CXL_PCI))
+   return atomic_read(&pci_loaded) != 0;
+
+   return false;
+}
+
  void mark_cxl_pci_loaded(void)
  {
atomic_inc(&pci_loaded);
+   wake_up(&cxl_wait_queue);
  }
  EXPORT_SYMBOL_NS_GPL(mark_cxl_pci_loaded, "CXL");
+
+void cxl_wait_for_pci_mem(void)
+{
+   if (!wait_event_timeout(cxl_wait_queue, cxl_pci_loaded() &&
+   cxl_mem_active(), 30 * HZ))


I'm trying to understand why cxl_pci_loaded() is needed. cxl_mem_active() goes 
above 0 when a cxl_mem_probe() instance succeeds. cxl_mem_probe() being 
triggered implies that an instance of cxl_pci_probe() has been called since 
cxl_mem_probe() is triggered from devm_cxl_add_memdev() with memdev being added 
and cxl_mem driver also have been loaded. So does cxl_mem_active() not imply 
cxl_pci_loaded() and makes it unnecessary?


Yeah you are right. I will remove this check.

Thanks
Smita


DJ



+   pr_debug("Timeout waiting for cxl_pci or cxl_mem probing\n");
+}
+EXPORT_SYMBOL_NS_GPL(cxl_wait_for_pci_mem, "CXL");
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a9ab46eb0610..1ba7d39c2991 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -902,6 +902,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
  
  bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
  
+void cxl_wait_for_pci_mem(void);

+
  /*
   * Unit test builds overrides this to __weak, find the 'strong' version
   * of these symbols in tools/testing/cxl/.

[PATCH vhost] vdpa/mlx5: Fix needs_teardown flag calculation

2025-06-04 Thread Dragos Tatulea

needs_teardown is a device flag that indicates when virtual queues need
to be recreated. This happens for certain configuration changes: queue
size and some specific features.

Currently, the needs_teardown state can be incorrectly reset by
subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512
and 1 tx VQ with size 256:

.set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a
   non-default size)
.set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a
   default size)

This change takes into account the previous value of the needs_teardown
flag when re-calculating it during VQ size configuration.

Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions")
Signed-off-by: Dragos Tatulea 
Reviewed-by: Shahar Shitrit 
Reviewed-by: Si-Wei Liu 
Tested-by: Si-Wei Liu
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 49a08a1a..efb5fa694f1e 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2491,7 +2491,7 @@ static void mlx5_vdpa_set_vq_num(struct vdpa_device 
*vdev, u16 idx, u32 num)
 }
 
mvq = &ndev->vqs[idx];
-   ndev->needs_teardown = num != mvq->num_ent;
+   ndev->needs_teardown |= num != mvq->num_ent;
mvq->num_ent = num;
 }
 
-- 
2.43.0

Re: [PATCH v4 0/7] Add managed SOFT RESERVE resource handling

2025-06-04 Thread Koralahalli Channabasappa, Smita


Hi Zhijian,

Thanks for testing my patches.

On 6/4/2025 1:43 AM, Zhijian Li (Fujitsu) wrote:

Smita,

Thanks for your awesome work. I just tested the scenarios you listed, and they 
work as expected. Thanks again.
(Minor comments inlined)

Tested-by: Li Zhijian 


To the CXL community,

The scenarios mentioned here essentially cover what a correct firmware may 
provide. However,
I would like to discuss one more scenario that I can simulate with a modified 
QEMU:
The E820 exposes a SOFT RESERVED region which is the same as a CFMW, but the 
HDM decoders are not committed. This means no region will be auto-created 
during boot.

As an example, after boot, the iomem tree is as follows:
105000-304fff : CXL Window 0
105000-304fff : Soft Reserved
  

In this case, the SOFT RESERVED resource is not trimmed, so the end-user cannot 
create a new region.
My question is: Is this scenario a problem? If it is, should we fix it in this 
patchset or create a new patch?



I believe firmware should handle this correctly by ensuring that any 
exposed SOFT RESERVED ranges correspond to committed HDM decoders and 
result in region creation.


That said, I’d be interested in hearing what the rest of the community 
thinks.






On 04/06/2025 06:19, Smita Koralahalli wrote:

Add the ability to manage SOFT RESERVE iomem resources prior to them being
added to the iomem resource tree. This allows drivers, such as CXL, to
remove any pieces of the SOFT RESERVE resource that intersect with created
CXL regions.

The current approach of leaving the SOFT RESERVE resources as is can cause
failures during hotplug of devices, such as CXL, because the resource is
not available for reuse after teardown of the device.

The approach is to add SOFT RESERVE resources to a separate tree during
boot.


No special tree at all since V3


Will make changes. I overlooked the cover letter.





This allows any drivers to update the SOFT RESERVE resources before
they are merged into the iomem resource tree. In addition a notifier chain
is added so that drivers can be notified when these SOFT RESERVE resources
are added to the ioeme resource tree.

The CXL driver is modified to use a worker thread that waits for the CXL
PCI and CXL mem drivers to be loaded and for their probe routine to
complete. Then the driver walks through any created CXL regions to trim any
intersections with SOFT RESERVE resources in the iomem tree.

The dax driver uses the new soft reserve notifier chain so it can consume
any remaining SOFT RESERVES once they're added to the iomem tree.

The following scenarios have been tested:

Example 1: Exact alignment, soft reserved is a child of the region

|-- "Soft Reserved" ---|
|-- "Region #" |

Before:
105000-304fff : CXL Window 0
  105000-304fff : region0
105000-304fff : Soft Reserved
  108000-2f : dax0.0


BTW, I'm curious how to set up a dax with an address range different from its 
corresponding region.


Hmm, this configuration was provided directly by our BIOS. The DAX 
device was mapped to a subset of the region's address space as part of 
the platform's firmware setup, so I did not explicitly configure it..






108000-2f : System RAM (kmem)

After:
105000-304fff : CXL Window 0
  105000-304fff : region1
108000-2f : dax0.0
  108000-2f : System RAM (kmem)

Example 2: Start and/or end aligned and soft reserved spans multiple
regions


Tested



|--- "Soft Reserved" ---|
| "Region #" ---|
or
|--- "Soft Reserved" ---|
| "Region #" ---|


Typo? should be:
|--- "Soft Reserved" ---|
  | "Region #" ---|


Yeah, Will fix.





Example 3: No alignment
|-- "Soft Reserved" --|
| "Region #" |


Tested.


Thanks
Zhijian


Thanks
Smita

Re: [PATCH RFC net-next v2 2/3] vsock/test: Introduce get_transports()

2025-06-04 Thread Michal Luczaj

On 6/4/25 11:07, Stefano Garzarella wrote:
> On Wed, May 28, 2025 at 10:44:42PM +0200, Michal Luczaj wrote:
>> +static int __get_transports(void)
>> +{
>> +/* Order must match transports defined in util.h.
>> + * man nm: "d" The symbol is in the initialized data section.
>> + */
>> +const char * const syms[] = {
>> +"d loopback_transport",
>> +"d virtio_transport",
>> +"d vhost_transport",
>> +"d vmci_transport",
>> +"d hvs_transport",
>> +};
> 
> I would move this array (or a macro that define it), near the transport 
> defined in util.h, so they are near and we can easily update/review 
> changes.
> 
> BTW what about adding static asserts to check we are aligned?

Something like

#define KNOWN_TRANSPORTS\
_(LOOPBACK, "loopback") \
_(VIRTIO, "virtio") \
_(VHOST, "vhost")   \
_(VMCI, "vmci") \
_(HYPERV, "hvs")

enum transport {
TRANSPORT_COUNTER_BASE = __COUNTER__ + 1,
#define _(name, symbol) \
TRANSPORT_##name = _BITUL(__COUNTER__ - TRANSPORT_COUNTER_BASE),
KNOWN_TRANSPORTS
TRANSPORT_NUM = __COUNTER__ - TRANSPORT_COUNTER_BASE,
#undef _
};

static char * const transport_ksyms[] = {
#define _(name, symbol) "d " symbol "_transport",
KNOWN_TRANSPORTS
#undef _
};

static_assert(ARRAY_SIZE(transport_ksyms) == TRANSPORT_NUM);

?

Note that I keep pushing for naming HVS a TRANSPORT_HYPERV. Perhaps it's
better to stick to TRANSPORT_HVS after all?

>> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
>> index 
>> 0afe7cbae12e5194172c639ccfbeb8b81f7c25ac..63953e32c3e18e1aa5c2addcf6f09f433660fa84
>>  100644
>> --- a/tools/testing/vsock/util.h
>> +++ b/tools/testing/vsock/util.h
>> @@ -3,8 +3,19 @@
>> #define UTIL_H
>>
>> #include 
>> +#include 
>> #include 
>>
>> +#define KALLSYMS_PATH   "/proc/kallsyms"
>> +#define KALLSYMS_LINE_LEN   512
> 
> We don't need to expose them in util.h IMO, we can keep in util.c

OK, sure.

Thanks,
Michal

Re: [PATCH RFC net-next v2 3/3] vsock/test: Cover more CIDs in transport_uaf test

2025-06-04 Thread Michal Luczaj

On 6/4/25 11:37, Stefano Garzarella wrote:
> On Wed, May 28, 2025 at 10:44:43PM +0200, Michal Luczaj wrote:
>> +static bool test_stream_transport_uaf(int cid)
>> {
>>  int sockets[MAX_PORT_RETRIES];
>>  struct sockaddr_vm addr;
>> -int fd, i, alen;
>> +socklen_t alen;
>> +int fd, i, c;
>> +bool ret;
>> +
>> +/* Probe for a transport by attempting a local CID bind. Unavailable
>> + * transport (or more specifically: an unsupported transport/CID
>> + * combination) results in EADDRNOTAVAIL, other errnos are fatal.
>> + */
>> +fd = vsock_bind_try(cid, VMADDR_PORT_ANY, SOCK_STREAM);
>> +if (fd < 0) {
>> +if (errno != EADDRNOTAVAIL) {
>> +perror("Unexpected bind() errno");
>> +exit(EXIT_FAILURE);
>> +}
>>
>> -fd = vsock_bind(VMADDR_CID_ANY, VMADDR_PORT_ANY, SOCK_STREAM);
>> +return false;
>> +}
>>
>>  alen = sizeof(addr);
>>  if (getsockname(fd, (struct sockaddr *)&addr, &alen)) {
>> @@ -1735,38 +1746,73 @@ static void test_stream_transport_uaf_client(const 
>> struct test_opts *opts)
>>  exit(EXIT_FAILURE);
>>  }
>>
>> +/* Drain the autobind pool; see __vsock_bind_connectible(). */
>>  for (i = 0; i < MAX_PORT_RETRIES; ++i)
>> -sockets[i] = vsock_bind(VMADDR_CID_ANY, ++addr.svm_port,
>> -SOCK_STREAM);
>> +sockets[i] = vsock_bind(cid, ++addr.svm_port, SOCK_STREAM);
>>
>>  close(fd);
>> -fd = socket(AF_VSOCK, SOCK_STREAM, 0);
>> +fd = socket(AF_VSOCK, SOCK_STREAM | SOCK_NONBLOCK, 0);
> 
> Why we need this change?

It's for the (void)connect() below (not the connect() expecting early
EADDRNOTAVAIL, the second one). We're not connecting to anything anyway, so
there's no point entering the main `while (sk->sk_state != TCP_ESTABLISHED`
loop in vsock_connect().

>>  if (fd < 0) {
>>  perror("socket");
>>  exit(EXIT_FAILURE);
>>  }
>>
>> -if (!vsock_connect_fd(fd, addr.svm_cid, addr.svm_port)) {
>> -perror("Unexpected connect() #1 success");
>> +/* Assign transport, while failing to autobind. Autobind pool was
>> + * drained, so EADDRNOTAVAIL coming from __vsock_bind_connectible() is
>> + * expected.
>> + */
>> +addr.svm_port = VMADDR_PORT_ANY;

(Ugh, this line looks useless...)

>> +if (!connect(fd, (struct sockaddr *)&addr, alen)) {
>> +fprintf(stderr, "Unexpected connect() success\n");
>> +exit(EXIT_FAILURE);
>> +} else if (errno == ENODEV) {
>> +/* Handle unhappy vhost_vsock */
> 
> Why it's unhappy? No peer?

It's the case of test_stream_transport_uaf(VMADDR_CID_HOST) when only
vhost_vsock transport is loaded. Before we even reach (and fail)
vsock_auto_bind(), vsock_assign_transport() fails earlier: `new_transport`
gets set to `transport_g2h` (NULL) and then it's `if (!new_transport)
return -ENODEV`. So the idea was to swallow this errno and let the caller
report that nothing went through.

I guess we can narrow this down to `if (errno == ENODEV && cid ==
VMADDR_CID_HOST)`.

>> +ret = false;
>> +goto cleanup;
>> +} else if (errno != EADDRNOTAVAIL) {
>> +perror("Unexpected connect() errno");
>>  exit(EXIT_FAILURE);
>>  }
>>
>> -/* Vulnerable system may crash now. */
>> -if (!vsock_connect_fd(fd, VMADDR_CID_HOST, VMADDR_PORT_ANY)) {
>> -perror("Unexpected connect() #2 success");
>> -exit(EXIT_FAILURE);
>> +/* Reassign transport, triggering old transport release and
>> + * (potentially) unbinding of an unbound socket.
>> + *
>> + * Vulnerable system may crash now.
>> + */
>> +for (c = VMADDR_CID_HYPERVISOR; c <= VMADDR_CID_HOST + 1; ++c) {
>> +if (c != cid) {
>> +addr.svm_cid = c;
>> +(void)connect(fd, (struct sockaddr *)&addr, alen);
>> +}
>>  }
>>
>> +ret = true;
>> +cleanup:
>>  close(fd);
>>  while (i--)
>>  close(sockets[i]);
>>
>> -control_writeln("DONE");
>> +return ret;
>> }
>>
>> -static void test_stream_transport_uaf_server(const struct test_opts *opts)
>> +/* Test attempts to trigger a transport release for an unbound socket. This 
>> can
>> + * lead to a reference count mishandling.
>> + */
>> +static void test_stream_transport_uaf_client(const struct test_opts *opts)
>> {
>> -control_expectln("DONE");
>> +bool tested = false;
>> +int cid, tr;
>> +
>> +for (cid = VMADDR_CID_HYPERVISOR; cid <= VMADDR_CID_HOST + 1; ++cid)
>> +tested |= test_stream_transport_uaf(cid);
>> +
>> +tr = get_transports();
>> +if (!tr)
>> +fprintf(stderr, "No transports detected\n");
>> +else if (tr == TRANSPORT_VIRTIO)
>> +fprintf(stderr, "Setup unsupported: sole virtio transport\n");
>> +else if (!tested)
>> +

Re: [PATCH v4 1/7] cxl/region: Avoid null pointer dereference in is_cxl_region()

2025-06-04 Thread Nathan Fontenot

On 6/3/2025 6:49 PM, Dave Jiang wrote:
> 
> 
> On 6/3/25 3:19 PM, Smita Koralahalli wrote:
>> Add a NULL check in is_cxl_region() to prevent potential null pointer
>> dereference if a caller passes a NULL device. This change ensures the
>> function safely returns false instead of triggering undefined behavior
>> when dev is NULL.
> 
> Don't think this change is necessary. The code paths should not be hitting 
> any NULL region devices unless it's a programming error.

I originally added this to the patchset during some initial development to 
handle possible
NULL dev pointers when updating soft reserve resources, see 
cxl_region_softreserv_update()
in patch 5/7.

In the current form of the that routine it appears we shouldn't execute the 
while loop
if dev is NULL so this could get from the patch set.

-Nathan 

> 
>>
>> Co-developed-by: Nathan Fontenot 
>> Signed-off-by: Nathan Fontenot 
>> Co-developed-by: Terry Bowman 
>> Signed-off-by: Terry Bowman 
>> Signed-off-by: Smita Koralahalli 
>> ---
>>  drivers/cxl/core/region.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
>> index c3f4dc244df7..109b8a98c4c7 100644
>> --- a/drivers/cxl/core/region.c
>> +++ b/drivers/cxl/core/region.c
>> @@ -2333,7 +2333,7 @@ const struct device_type cxl_region_type = {
>>  
>>  bool is_cxl_region(struct device *dev)
>>  {
>> -return dev->type == &cxl_region_type;
>> +return dev && dev->type == &cxl_region_type;
>>  }
>>  EXPORT_SYMBOL_NS_GPL(is_cxl_region, "CXL");
>>  
>

Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes

2025-06-04 Thread Sebastian Ott


Hi Zenghui,

On Tue, 3 Jun 2025, Zenghui Yu wrote:

On 2025/5/27 22:24, Sebastian Ott wrote:

Some small fixes for arch_timer_edge_cases that I stumbled upon
while debugging failures for this selftest on ampere-one.

Changes since v1: modified patch 3 based on suggestions from Marc.

I've done some tests with this on various machines - seems to be all
good, however on ampere-one I now hit this in 10% of the runs:
 Test Assertion Failure 
  arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + 
(timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
  pid=166657 tid=166657 errno=4 - Interrupted system call
 1  0x00404db3: test_run at arch_timer_edge_cases.c:933
 2  0x00401f9f: main at arch_timer_edge_cases.c:1062
 3  0xaedd625b: ?? ??:0
 4  0xaedd633b: ?? ??:0
 5  0x004020af: _start at ??:?
  timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)

This is not new, it was just hidden behind the other failure. I'll
try to figure out what this is about (seems to be independent of
the wait time)..


Not sure if you have figured it out. I can easily reproduce it on my box
and I *guess* it is that we have some random XVAL values when we enable
the timer..


Yes, I think so, too.


test_reprogramming_timer()
{
local_irq_disable();
reset_timer_state(timer, DEF_CNT);


My first attempt was to also initialize cval here



/* Program the timer to DEF_CNT + delta_1_ms. */
set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);

[...]
}

set_tval_irq()
{
timer_set_ctl(timer, ctl);

// There is a window that we enable the timer with *random* XVAL
// values and we may get the unexpected interrupt.. And it's
// unlikely that KVM can be aware of TVAL's change (and
// re-evaluate the interrupt's pending state) before hitting the
// GUEST_ASSERT().

timer_set_tval(timer, tval_cycles);


Yes, I stumbled over this as well. I've always assumed that this order is
becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the 
value returned is UNKNOWN." However re-reading that part today I realized

that this only concerns register reads.

Maybe somone on cc knows why it's in that order?

I'm currently testing this with the above swapped and it's looking good,
so far.


}

I'm not familiar with the test so I'm not 100% sure that this is the
root cause. But I hope this helps with your analysis ;-) .


It did, thanks!

Sebastian

Re: [PATCH v2 0/3] KVM: arm64: selftests: arch_timer_edge_cases fixes

2025-06-04 Thread Sebastian Ott


On Wed, 4 Jun 2025, Sebastian Ott wrote:

On Tue, 3 Jun 2025, Zenghui Yu wrote:

 On 2025/5/27 22:24, Sebastian Ott wrote:

 Some small fixes for arch_timer_edge_cases that I stumbled upon
 while debugging failures for this selftest on ampere-one.

 Changes since v1: modified patch 3 based on suggestions from Marc.

 I've done some tests with this on various machines - seems to be all
 good, however on ampere-one I now hit this in 10% of the runs:
  Test Assertion Failure 
   arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT +
   (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000)
   pid=166657 tid=166657 errno=4 - Interrupted system call
  1  0x00404db3: test_run at arch_timer_edge_cases.c:933
  2  0x00401f9f: main at arch_timer_edge_cases.c:1062
  3  0xaedd625b: ?? ??:0
  4  0xaedd633b: ?? ??:0
  5  0x004020af: _start at ??:?
   timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)

 This is not new, it was just hidden behind the other failure. I'll
 try to figure out what this is about (seems to be independent of
 the wait time)..


 Not sure if you have figured it out. I can easily reproduce it on my box
 and I *guess* it is that we have some random XVAL values when we enable
 the timer..


Yes, I think so, too.


 test_reprogramming_timer()
 {
  local_irq_disable();
  reset_timer_state(timer, DEF_CNT);


My first attempt was to also initialize cval here


Forgot to mention that I did this because my tests have shown
that the interrupt didn't only trigger early (like before the
reprogrammed delta) but instantly. This seemed to work but I think
the order in set_tval_irq() is the actual issue.






  /* Program the timer to DEF_CNT + delta_1_ms. */
  set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);

[...]
 }

 set_tval_irq()
 {
  timer_set_ctl(timer, ctl);

  // There is a window that we enable the timer with *random* XVAL
  // values and we may get the unexpected interrupt.. And it's
  // unlikely that KVM can be aware of TVAL's change (and
  // re-evaluate the interrupt's pending state) before hitting the
  // GUEST_ASSERT().

  timer_set_tval(timer, tval_cycles);


Yes, I stumbled over this as well. I've always assumed that this order is
becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the 
value returned is UNKNOWN." However re-reading that part today I realized

that this only concerns register reads.

Maybe somone on cc knows why it's in that order?

I'm currently testing this with the above swapped and it's looking good,
so far.


 }

 I'm not familiar with the test so I'm not 100% sure that this is the
 root cause. But I hope this helps with your analysis ;-) .


It did, thanks!

Sebastian

Re: [PATCH RESEND] selftests/mm/run_vmtests.sh: skip hugevm tests if write_to_hugetlbfs is missing

2025-06-04 Thread Khaled Elnaggar

On 03/06/2025 5:12 am, Andrew Morton wrote:
> On Tue,  3 Jun 2025 02:22:32 +0300 Khaled Elnaggar 
>  wrote:
> 
>> The hugevm tests 'charge_reserved_hugetlb.sh' and 
>> 'hugetlb_reparenting_test.sh'
>> depend on the 'write_to_hugetlbfs' binary to simulate writes to hugetlbfs
>> while checking reservations asynchronously in the background.
>>
>> When this binary is missing (e.g., excluded from the build), these tests hang
>> for up to 180 seconds. During this time, run_vmtests.sh is eventually killed
>> due to timeout, aborting all subsequent tests.
>>
>> This patch skips these tests if the binary is not found, preventing delays
>> and ensuring that the test suite runs to completion.
> 
> OK, but why is write_to_hugetlbfs missing?  If we're in a situation
> where we _could_ run it then we _should_ run it!  The user wants to
> test stuff so we should test as much as we can.
> So I'm thinking that it would be preferable to make sure the dang thing
> is there?
> 

Totally agree, let me try to explain how I understand the issue.

The write_to_hugetlbfs binary is built from selftests/mm/Makefile,
lines 136–142. It is only compiled if ARCH matches one of the explicitly
listed 64-bit architectures:

```
   ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 
x86_64 s390))
   TEST_GEN_FILES += va_high_addr_switch
   ifneq ($(ARCH),riscv64)
   TEST_GEN_FILES += virtual_address_range
   endif
   TEST_GEN_FILES += write_to_hugetlbfs
   endif
```

However, when the MM selftests are run from the kernel’s top-level Makefile,
(root directory for example:

  make defconfig
  sudo make kselftest TARGETS=mm

ARCH is resolved as x86, even on an x86_64 machine (Debian in my case),
ARCH=x86 is the reason why the binary gets excluded from the build system.

As far as I understand, the top-level Makefile normalizes both
i.86 and x86_64 to x86 for ARCH variable:

Makfile: lines: 383,403
383:include $(srctree)/scripts/subarch.include
403:ARCH?= $(SUBARCH)

scripts/subarch.include: line: 7
7:SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \

This mapping probably makes sense for selecting the correct arch/ directory
(since we have arch/x86, not arch/x86_64) for top-level Makefile.

But the mm selftests Makefile expects ARCH to differentiate between x86 and 
x86_64
to decide whether to build certain binaries.
As a result, the 64-bit-only binaries like write_to_hugetlbfs are skipped
during build, yet still expected at runtime (by run_vmtests.sh).

That's why this whole issue of the missing executable does not happen when
ran from selftests/mm using something like:

  sudo make -C tools/testing/selftests/mm

Because then selftests/mm/Makefile arch detection rus as intended.

You're right — the proper fix is to improve how we propagate architecture
information from the top-level Makefile to selftests.
But since that's a larger change (and possibly beyond what I can safely
attempt at this point), this patch is just a targeted workaround to
avoid test stalls when the binary is missing.

I hope I haven't gone completely wrong with this analysis, happy to improve
or revise it further if needed.

[PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.

Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.

This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.

Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory.  At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test.  The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:

Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018

After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498

The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's event more important
now to test for unexpected inconsistencies.

[1] https://lore.kernel.org/all/20250418174959.1431962-1-sur...@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test

Suren Baghdasaryan (7):
  selftests/proc: add /proc/pid/maps tearing from vma split test
  selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
  selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
  selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
  selftests/proc: add verbose more for tests to facilitate debugging
  mm/maps: read proc/pid/maps under per-vma lock
  mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks

 fs/proc/internal.h |   6 +
 fs/proc/task_mmu.c | 233 +-
 tools/testing/selftests/proc/proc-pid-vm.c | 793 ++

[PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test

The /proc/pid/maps file is generated page by page, with the mmap_lock
released between pages.  This can lead to inconsistent reads if the
underlying vmas are concurrently modified. For instance, if a vma split
or merge occurs at a page boundary while /proc/pid/maps is being read,
the same vma might be seen twice: once before and once after the change.
This duplication is considered acceptable for userspace handling.
However, observing a "hole" where a vma should be (e.g., due to a vma
being replaced and the space temporarily being empty) is unacceptable.

Implement a test that:
1. Forks a child process which continuously modifies its address space,
specifically targeting a vma at the boundary between two pages.
2. The parent process repeatedly reads the child's /proc/pid/maps.
3. The parent process checks the last vma of the first page and
the first vma of the second page for consistency, looking for the
effects of vma splits or merges.

The test duration is configurable via the -d command-line parameter
in seconds to increase the likelihood of catching the race condition.
The default test duration is 5 seconds.

Example Command: proc-pid-vm -d 10

Signed-off-by: Suren Baghdasaryan 
---
 tools/testing/selftests/proc/proc-pid-vm.c | 430 -
 1 file changed, 429 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c 
b/tools/testing/selftests/proc/proc-pid-vm.c
index d04685771952..6e3f06376a1f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -27,6 +27,7 @@
 #undef NDEBUG
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -70,6 +72,8 @@ static void make_private_tmp(void)
}
 }
 
+static unsigned long test_duration_sec = 5UL;
+static int page_size;
 static pid_t pid = -1;
 static void ate(void)
 {
@@ -281,11 +285,431 @@ static void vsyscall(void)
}
 }
 
-int main(void)
+/* /proc/pid/maps parsing routines */
+struct page_content {
+   char *data;
+   ssize_t size;
+};
+
+#define LINE_MAX_SIZE  256
+
+struct line_content {
+   char text[LINE_MAX_SIZE];
+   unsigned long start_addr;
+   unsigned long end_addr;
+};
+
+static void read_two_pages(int maps_fd, struct page_content *page1,
+  struct page_content *page2)
+{
+   ssize_t  bytes_read;
+
+   assert(lseek(maps_fd, 0, SEEK_SET) >= 0);
+   bytes_read = read(maps_fd, page1->data, page_size);
+   assert(bytes_read > 0 && bytes_read < page_size);
+   page1->size = bytes_read;
+
+   bytes_read = read(maps_fd, page2->data, page_size);
+   assert(bytes_read > 0 && bytes_read < page_size);
+   page2->size = bytes_read;
+}
+
+static void copy_first_line(struct page_content *page, char *first_line)
+{
+   char *pos = strchr(page->data, '\n');
+
+   strncpy(first_line, page->data, pos - page->data);
+   first_line[pos - page->data] = '\0';
+}
+
+static void copy_last_line(struct page_content *page, char *last_line)
+{
+   /* Get the last line in the first page */
+   const char *end = page->data + page->size - 1;
+   /* skip last newline */
+   const char *pos = end - 1;
+
+   /* search previous newline */
+   while (pos[-1] != '\n')
+   pos--;
+   strncpy(last_line, pos, end - pos);
+   last_line[end - pos] = '\0';
+}
+
+/* Read the last line of the first page and the first line of the second page 
*/
+static void read_boundary_lines(int maps_fd, struct page_content *page1,
+   struct page_content *page2,
+   struct line_content *last_line,
+   struct line_content *first_line)
+{
+   read_two_pages(maps_fd, page1, page2);
+
+   copy_last_line(page1, last_line->text);
+   copy_first_line(page2, first_line->text);
+
+   assert(sscanf(last_line->text, "%lx-%lx", &last_line->start_addr,
+ &last_line->end_addr) == 2);
+   assert(sscanf(first_line->text, "%lx-%lx", &first_line->start_addr,
+ &first_line->end_addr) == 2);
+}
+
+/* Thread synchronization routines */
+enum test_state {
+   INIT,
+   CHILD_READY,
+   PARENT_READY,
+   SETUP_READY,
+   SETUP_MODIFY_MAPS,
+   SETUP_MAPS_MODIFIED,
+   SETUP_RESTORE_MAPS,
+   SETUP_MAPS_RESTORED,
+   TEST_READY,
+   TEST_DONE,
+};
+
+struct vma_modifier_info;
+
+typedef void (*vma_modifier_op)(const struct vma_modifier_info *mod_info);
+typedef void (*vma_mod_result_check_op)(struct line_content *mod_last_line,
+   struct line_content *mod_first_line,
+   struct line_content *restored_last_line,
+   struct line_content 
*restored_first_line);
+
+struct vma_modifier_info {
+

[PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing

Test that /proc/pid/maps does not report unexpected holes in the address
space when a vma at the edge of the page is being concurrently remapped.
This remapping results in the vma shrinking and expanding from  under the
reader. We should always see either shrunk or expanded (original) version
of the vma.

Signed-off-by: Suren Baghdasaryan 
---
 tools/testing/selftests/proc/proc-pid-vm.c | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c 
b/tools/testing/selftests/proc/proc-pid-vm.c
index 6e3f06376a1f..39842e4ec45f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -583,6 +583,86 @@ static void test_maps_tearing_from_split(int maps_fd,
signal_state(mod_info, TEST_DONE);
 }
 
+static inline void shrink_vma(const struct vma_modifier_info *mod_info)
+{
+   assert(mremap(mod_info->addr, page_size * 3, page_size, 0) != 
MAP_FAILED);
+}
+
+static inline void expand_vma(const struct vma_modifier_info *mod_info)
+{
+   assert(mremap(mod_info->addr, page_size, page_size * 3, 0) != 
MAP_FAILED);
+}
+
+static inline void check_shrink_result(struct line_content *mod_last_line,
+  struct line_content *mod_first_line,
+  struct line_content *restored_last_line,
+  struct line_content *restored_first_line)
+{
+   /* Make sure only the last vma of the first page is changing */
+   assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+   assert(strcmp(mod_first_line->text, restored_first_line->text) == 0);
+}
+
+static void test_maps_tearing_from_resize(int maps_fd,
+ struct vma_modifier_info *mod_info,
+ struct page_content *page1,
+ struct page_content *page2,
+ struct line_content *last_line,
+ struct line_content *first_line)
+{
+   struct line_content shrunk_last_line;
+   struct line_content shrunk_first_line;
+   struct line_content restored_last_line;
+   struct line_content restored_first_line;
+
+   wait_for_state(mod_info, SETUP_READY);
+
+   /* re-read the file to avoid using stale data from previous test */
+   read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+   mod_info->vma_modify = shrink_vma;
+   mod_info->vma_restore = expand_vma;
+   mod_info->vma_mod_check = check_shrink_result;
+
+   capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, 
first_line,
+   &shrunk_last_line, &shrunk_first_line,
+   &restored_last_line, &restored_first_line);
+
+   /* Now start concurrent modifications for test_duration_sec */
+   signal_state(mod_info, TEST_READY);
+
+   struct line_content new_last_line;
+   struct line_content new_first_line;
+   struct timespec start_ts, end_ts;
+
+   clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+   do {
+   read_boundary_lines(maps_fd, page1, page2, &new_last_line, 
&new_first_line);
+
+   /* Check if we read vmas after shrinking it */
+   if (!strcmp(new_last_line.text, shrunk_last_line.text)) {
+   /*
+* The vmas should be consistent with shrunk results,
+* however if the vma was concurrently restored, it
+* can be reported twice (first as shrunk one, then
+* as restored one) because we found it as the next vma
+* again. In that case new first line will be the same
+* as the last restored line.
+*/
+   assert(!strcmp(new_first_line.text, 
shrunk_first_line.text) ||
+  !strcmp(new_first_line.text, 
restored_last_line.text));
+   } else {
+   /* The vmas should be consistent with the 
original/resored state */
+   assert(!strcmp(new_last_line.text, 
restored_last_line.text) &&
+  !strcmp(new_first_line.text, 
restored_first_line.text));
+   }
+   clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+   } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+   /* Signal the modifyer thread to stop and wait until it exits */
+   signal_state(mod_info, TEST_DONE);
+}
+
 static int test_maps_tearing(void)
 {
struct vma_modifier_info *mod_info;
@@ -674,6 +754,9 @@ static int test_maps_tearing(void)
test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
 &last_line, &first_line);
 
+   test_maps_tearing_from_resize(maps_f

[PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping

Test that /proc/pid/maps does not report unexpected holes in the address
space when we concurrently remap a part of a vma into the middle of
another vma. This remapping results in the destination vma being split
into three parts and the part in the middle being patched back from,
all done concurrently from under the reader. We should always see either
original vma or the split one with no holes.

Signed-off-by: Suren Baghdasaryan 
---
 tools/testing/selftests/proc/proc-pid-vm.c | 92 ++
 1 file changed, 92 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c 
b/tools/testing/selftests/proc/proc-pid-vm.c
index 39842e4ec45f..1aef2db7e893 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
signal_state(mod_info, TEST_DONE);
 }
 
+static inline void remap_vma(const struct vma_modifier_info *mod_info)
+{
+   /*
+* Remap the last page of the next vma into the middle of the vma.
+* This splits the current vma and the first and middle parts (the
+* parts at lower addresses) become the last vma objserved in the
+* first page and the first vma observed in the last page.
+*/
+   assert(mremap(mod_info->next_addr + page_size * 2, page_size,
+ page_size, MREMAP_FIXED | MREMAP_MAYMOVE | 
MREMAP_DONTUNMAP,
+ mod_info->addr + page_size) != MAP_FAILED);
+}
+
+static inline void patch_vma(const struct vma_modifier_info *mod_info)
+{
+   assert(!mprotect(mod_info->addr + page_size, page_size,
+mod_info->prot));
+}
+
+static inline void check_remap_result(struct line_content *mod_last_line,
+ struct line_content *mod_first_line,
+ struct line_content *restored_last_line,
+ struct line_content *restored_first_line)
+{
+   /* Make sure vmas at the boundaries are changing */
+   assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+   assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_remap(int maps_fd,
+   struct vma_modifier_info *mod_info,
+   struct page_content *page1,
+   struct page_content *page2,
+   struct line_content *last_line,
+   struct line_content *first_line)
+{
+   struct line_content remapped_last_line;
+   struct line_content remapped_first_line;
+   struct line_content restored_last_line;
+   struct line_content restored_first_line;
+
+   wait_for_state(mod_info, SETUP_READY);
+
+   /* re-read the file to avoid using stale data from previous test */
+   read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+   mod_info->vma_modify = remap_vma;
+   mod_info->vma_restore = patch_vma;
+   mod_info->vma_mod_check = check_remap_result;
+
+   capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, 
first_line,
+   &remapped_last_line, &remapped_first_line,
+   &restored_last_line, &restored_first_line);
+
+   /* Now start concurrent modifications for test_duration_sec */
+   signal_state(mod_info, TEST_READY);
+
+   struct line_content new_last_line;
+   struct line_content new_first_line;
+   struct timespec start_ts, end_ts;
+
+   clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+   do {
+   read_boundary_lines(maps_fd, page1, page2, &new_last_line, 
&new_first_line);
+
+   /* Check if we read vmas after remapping it */
+   if (!strcmp(new_last_line.text, remapped_last_line.text)) {
+   /*
+* The vmas should be consistent with remap results,
+* however if the vma was concurrently restored, it
+* can be reported twice (first as split one, then
+* as restored one) because we found it as the next vma
+* again. In that case new first line will be the same
+* as the last restored line.
+*/
+   assert(!strcmp(new_first_line.text, 
remapped_first_line.text) ||
+  !strcmp(new_first_line.text, 
restored_last_line.text));
+   } else {
+   /* The vmas should be consistent with the 
original/resored state */
+   assert(!strcmp(new_last_line.text, 
restored_last_line.text) &&
+  !strcmp(new_first_line.text, 
restored_first_line.text));
+   }
+   clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+

[PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified

Extend /proc/pid/maps tearing test to verify PROCMAP_QUERY ioctl operation
correctness while the vma is being concurrently modified.

Signed-off-by: Suren Baghdasaryan 
---
 tools/testing/selftests/proc/proc-pid-vm.c | 60 ++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c 
b/tools/testing/selftests/proc/proc-pid-vm.c
index 1aef2db7e893..b582f40851fb 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -486,6 +486,21 @@ static void capture_mod_pattern(int maps_fd,
assert(strcmp(restored_first_line->text, first_line->text) == 0);
 }
 
+static void query_addr_at(int maps_fd, void *addr,
+ unsigned long *vma_start, unsigned long *vma_end)
+{
+   struct procmap_query q;
+
+   memset(&q, 0, sizeof(q));
+   q.size = sizeof(q);
+   /* Find the VMA at the split address */
+   q.query_addr = (unsigned long long)addr;
+   q.query_flags = 0;
+   assert(!ioctl(maps_fd, PROCMAP_QUERY, &q));
+   *vma_start = q.vma_start;
+   *vma_end = q.vma_end;
+}
+
 static inline void split_vma(const struct vma_modifier_info *mod_info)
 {
assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
@@ -546,6 +561,8 @@ static void test_maps_tearing_from_split(int maps_fd,
do {
bool last_line_changed;
bool first_line_changed;
+   unsigned long vma_start;
+   unsigned long vma_end;
 
read_boundary_lines(maps_fd, page1, page2, &new_last_line, 
&new_first_line);
 
@@ -576,6 +593,19 @@ static void test_maps_tearing_from_split(int maps_fd,
first_line_changed = strcmp(new_first_line.text, 
first_line->text) != 0;
assert(last_line_changed == first_line_changed);
 
+   /* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+   query_addr_at(maps_fd, mod_info->addr + page_size,
+ &vma_start, &vma_end);
+   /*
+* The vma at the split address can be either the same as
+* original one (if read before the split) or the same as the
+* first line in the second page (if read after the split).
+*/
+   assert((vma_start == last_line->start_addr &&
+   vma_end == last_line->end_addr) ||
+  (vma_start == split_first_line.start_addr &&
+   vma_end == split_first_line.end_addr));
+
clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -637,6 +667,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 
clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
do {
+   unsigned long vma_start;
+   unsigned long vma_end;
+
read_boundary_lines(maps_fd, page1, page2, &new_last_line, 
&new_first_line);
 
/* Check if we read vmas after shrinking it */
@@ -656,6 +689,17 @@ static void test_maps_tearing_from_resize(int maps_fd,
assert(!strcmp(new_last_line.text, 
restored_last_line.text) &&
   !strcmp(new_first_line.text, 
restored_first_line.text));
}
+
+   /* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+   query_addr_at(maps_fd, mod_info->addr, &vma_start, &vma_end);
+   /*
+* The vma should stay at the same address and have either the
+* original size of 3 pages or 1 page if read after shrinking.
+*/
+   assert(vma_start == last_line->start_addr &&
+  (vma_end - vma_start == page_size * 3 ||
+   vma_end - vma_start == page_size));
+
clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -726,6 +770,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 
clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
do {
+   unsigned long vma_start;
+   unsigned long vma_end;
+
read_boundary_lines(maps_fd, page1, page2, &new_last_line, 
&new_first_line);
 
/* Check if we read vmas after remapping it */
@@ -745,6 +792,19 @@ static void test_maps_tearing_from_remap(int maps_fd,
assert(!strcmp(new_last_line.text, 
restored_last_line.text) &&
   !strcmp(new_first_line.text, 
restored_first_line.text));
}
+
+   /* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+   query_addr_at(maps_fd, mod_info->addr + page_size, &vma_start, 
&vma_end);
+   /*
+* The vma should either stay at the same address and have the
+* origina

[PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging

Add verbose mode to the proc tests to print debugging information.
Usage: proc-pid-vm -v

Signed-off-by: Suren Baghdasaryan 
---
 tools/testing/selftests/proc/proc-pid-vm.c | 154 +++--
 1 file changed, 141 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c 
b/tools/testing/selftests/proc/proc-pid-vm.c
index b582f40851fb..97017f48cd70 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -73,6 +73,7 @@ static void make_private_tmp(void)
 }
 
 static unsigned long test_duration_sec = 5UL;
+static bool verbose;
 static int page_size;
 static pid_t pid = -1;
 static void ate(void)
@@ -452,6 +453,99 @@ static void stop_vma_modifier(struct vma_modifier_info 
*mod_info)
signal_state(mod_info, SETUP_MODIFY_MAPS);
 }
 
+static void print_first_lines(char *text, int nr)
+{
+   const char *end = text;
+
+   while (nr && (end = strchr(end, '\n')) != NULL) {
+   nr--;
+   end++;
+   }
+
+   if (end) {
+   int offs = end - text;
+
+   text[offs] = '\0';
+   printf(text);
+   text[offs] = '\n';
+   printf("\n");
+   } else {
+   printf(text);
+   }
+}
+
+static void print_last_lines(char *text, int nr)
+{
+   const char *start = text + strlen(text);
+
+   nr++; /* to ignore the last newline */
+   while (nr) {
+   while (start > text && *start != '\n')
+   start--;
+   nr--;
+   start--;
+   }
+   printf(start);
+}
+
+static void print_boundaries(const char *title,
+struct page_content *page1,
+struct page_content *page2)
+{
+   if (!verbose)
+   return;
+
+   printf("%s", title);
+   /* Print 3 boundary lines from each page */
+   print_last_lines(page1->data, 3);
+   printf("-page boundary-\n");
+   print_first_lines(page2->data, 3);
+}
+
+static bool print_boundaries_on(bool condition, const char *title,
+   struct page_content *page1,
+   struct page_content *page2)
+{
+   if (verbose && condition)
+   print_boundaries(title, page1, page2);
+
+   return condition;
+}
+
+static void report_test_start(const char *name)
+{
+   if (verbose)
+   printf(" %s \n", name);
+}
+
+static struct timespec print_ts;
+
+static void start_test_loop(struct timespec *ts)
+{
+   if (verbose)
+   print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_iteration(struct timespec *ts)
+{
+   if (!verbose)
+   return;
+
+   /* Update every second */
+   if (print_ts.tv_sec == ts->tv_sec)
+   return;
+
+   printf(".");
+   fflush(stdout);
+   print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_loop(void)
+{
+   if (verbose)
+   printf("\n");
+}
+
 static void capture_mod_pattern(int maps_fd,
struct vma_modifier_info *mod_info,
struct page_content *page1,
@@ -463,18 +557,24 @@ static void capture_mod_pattern(int maps_fd,
struct line_content *restored_last_line,
struct line_content *restored_first_line)
 {
+   print_boundaries("Before modification", page1, page2);
+
signal_state(mod_info, SETUP_MODIFY_MAPS);
wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
 
/* Copy last line of the first page and first line of the last page */
read_boundary_lines(maps_fd, page1, page2, mod_last_line, 
mod_first_line);
 
+   print_boundaries("After modification", page1, page2);
+
signal_state(mod_info, SETUP_RESTORE_MAPS);
wait_for_state(mod_info, SETUP_MAPS_RESTORED);
 
/* Copy last line of the first page and first line of the last page */
read_boundary_lines(maps_fd, page1, page2, restored_last_line, 
restored_first_line);
 
+   print_boundaries("After restore", page1, page2);
+
mod_info->vma_mod_check(mod_last_line, mod_first_line,
restored_last_line, restored_first_line);
 
@@ -546,6 +646,7 @@ static void test_maps_tearing_from_split(int maps_fd,
mod_info->vma_restore = merge_vma;
mod_info->vma_mod_check = check_split_result;
 
+   report_test_start("Tearing from split");
capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, 
first_line,
&split_last_line, &split_first_line,
&restored_last_line, &restored_first_line);
@@ -558,6 +659,7 @@ static void test_maps_tearing_from_split(int maps_fd,
struct timespec start_ts, end_ts;
 
clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+   start_test_loop(&sta

[PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock

With maple_tree supporting vma tree traversal under RCU and per-vma
locks, /proc/pid/maps can be read while holding individual vma locks
instead of locking the entire address space.
Completely lockless approach would be quite complex with the main issue
being get_vma_name() using callbacks which might not work correctly with
a stable vma copy, requiring original (unstable) vma.
When per-vma lock acquisition fails, we take the mmap_lock for reading,
lock the vma, release the mmap_lock and continue. This guarantees the
reader to make forward progress even during lock contention. This will
interfere with the writer but for a very short time while we are
acquiring the per-vma lock and only when there was contention on the
vma reader is interested in.
One case requiring special handling is when vma changes between the
time it was found and the time it got locked. A problematic case would
be if vma got shrunk so that it's start moved higher in the address
space and a new vma was installed at the beginning:

reader found:   |VMA A|
VMA is modified:|-VMA B-|VMA A|
reader locks modified VMA A
reader reports VMA A:   |  gap  |VMA A|

This would result in reporting a gap in the address space that does not
exist. To prevent this we retry the lookup after locking the vma, however
we do that only when we identify a gap and detect that the address space
was changed after we found the vma.
This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such
as monitoring/data collection services) from blocking address space
updates. Note that this change has a userspace visible disadvantage:
it allows for sub-page data tearing as opposed to the previous mechanism
where data tearing could happen only between pages of generated output
data. Since current userspace considers data tearing between pages to be
acceptable, we assume is will be able to handle sub-page data tearing
as well.

Signed-off-by: Suren Baghdasaryan 
---
 fs/proc/internal.h |   6 ++
 fs/proc/task_mmu.c | 177 +++--
 2 files changed, 175 insertions(+), 8 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 96122e91c645..3728c9012687 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -379,6 +379,12 @@ struct proc_maps_private {
struct task_struct *task;
struct mm_struct *mm;
struct vma_iterator iter;
+   loff_t last_pos;
+#ifdef CONFIG_PER_VMA_LOCK
+   bool mmap_locked;
+   unsigned int mm_wr_seq;
+   struct vm_area_struct *locked_vma;
+#endif
 #ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 27972c0749e7..36d883c4f394 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -127,13 +127,172 @@ static void release_task_mempolicy(struct 
proc_maps_private *priv)
 }
 #endif
 
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
-   loff_t *ppos)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
+ struct vm_area_struct *vma,
+ unsigned long last_pos,
+ bool mm_unstable)
+{
+   vma = vma_start_read(priv->mm, vma);
+   if (IS_ERR_OR_NULL(vma))
+   return NULL;
+
+   /* Check if the vma we locked is the right one. */
+   if (unlikely(vma->vm_mm != priv->mm))
+   goto err;
+
+   /* vma should not be ahead of the last search position. */
+   if (unlikely(last_pos >= vma->vm_end))
+   goto err;
+
+   /*
+* vma ahead of last search position is possible but we need to
+* verify that it was not shrunk after we found it, and another
+* vma has not been installed ahead of it. Otherwise we might
+* observe a gap that should not be there.
+*/
+   if (mm_unstable && last_pos < vma->vm_start) {
+   /* Verify only if the address space changed since vma lookup. */
+   if ((priv->mm_wr_seq & 1) ||
+   mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
+   vma_iter_init(&priv->iter, priv->mm, last_pos);
+   if (vma != vma_next(&priv->iter))
+   goto err;
+   }
+   }
+
+   priv->locked_vma = vma;
+
+   return vma;
+err:
+   vma_end_read(vma);
+   return NULL;
+}
+
+
+static void unlock_vma(struct proc_maps_private *priv)
+{
+   if (priv->locked_vma) {
+   vma_end_read(priv->locked_vma);
+   priv->locked_vma = NULL;
+   }
+}
+
+static const struct seq_operations proc_pid_maps_op;
+
+static inline bool lock_content(struct seq_file *m,
+

[PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks

Utilize per-vma locks to stabilize vma after lookup without taking
mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
mmap_lock for reading during contention, we do that momentarily only
to lock the vma.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls from blocking address space updates.

Signed-off-by: Suren Baghdasaryan 
---
 fs/proc/task_mmu.c | 56 --
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 36d883c4f394..93ba35a84975 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -550,28 +550,60 @@ static int pid_maps_open(struct inode *inode, struct file 
*file)
PROCMAP_QUERY_VMA_FLAGS \
 )
 
-static int query_vma_setup(struct mm_struct *mm)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static int query_vma_setup(struct proc_maps_private *priv)
 {
-   return mmap_read_lock_killable(mm);
+   rcu_read_lock();
+   priv->locked_vma = NULL;
+   priv->mmap_locked = false;
+
+   return 0;
 }
 
-static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct 
*vma)
+static void query_vma_teardown(struct proc_maps_private *priv)
 {
-   mmap_read_unlock(mm);
+   unlock_vma(priv);
+   rcu_read_unlock();
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private 
*priv,
+unsigned long addr)
+{
+   vma_iter_init(&priv->iter, priv->mm, addr);
+   return get_next_vma(priv, addr);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static int query_vma_setup(struct proc_maps_private *priv)
+{
+   return mmap_read_lock_killable(priv->mm);
+}
+
+static void query_vma_teardown(struct proc_maps_private *priv)
+{
+   mmap_read_unlock(priv->mm);
 }
 
-static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, 
unsigned long addr)
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private 
*priv,
+unsigned long addr)
 {
-   return find_vma(mm, addr);
+   return find_vma(priv->mm, addr);
 }
 
-static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+#endif  /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *query_matching_vma(struct proc_maps_private 
*priv,
 unsigned long addr, u32 flags)
 {
struct vm_area_struct *vma;
 
 next_vma:
-   vma = query_vma_find_by_addr(mm, addr);
+   vma = query_vma_find_by_addr(priv, addr);
+   if (IS_ERR(vma))
+   return vma;
+
if (!vma)
goto no_vma;
 
@@ -647,13 +679,13 @@ static int do_procmap_query(struct proc_maps_private 
*priv, void __user *uarg)
if (!mm || !mmget_not_zero(mm))
return -ESRCH;
 
-   err = query_vma_setup(mm);
+   err = query_vma_setup(priv);
if (err) {
mmput(mm);
return err;
}
 
-   vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+   vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
if (IS_ERR(vma)) {
err = PTR_ERR(vma);
vma = NULL;
@@ -738,7 +770,7 @@ static int do_procmap_query(struct proc_maps_private *priv, 
void __user *uarg)
}
 
/* unlock vma or mmap_lock, and put mm_struct before copying data to 
user */
-   query_vma_teardown(mm, vma);
+   query_vma_teardown(priv);
mmput(mm);
 
if (karg.vma_name_size && 
copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
@@ -758,7 +790,7 @@ static int do_procmap_query(struct proc_maps_private *priv, 
void __user *uarg)
return 0;
 
 out:
-   query_vma_teardown(mm, vma);
+   query_vma_teardown(priv);
mmput(mm);
kfree(name_buf);
return err;
-- 
2.49.0.1266.g31b7d2e469-goog

Re: [PATCH v2 35/62] objtool: Refactor add_jump_destinations()

On Fri, May 23, 2025 at 07:46:19AM -0400, Joe Lawrence wrote:
> > -   } else if (reloc->sym->sec->idx) {
> > -   dest_sec = reloc->sym->sec;
> > -   dest_off = reloc->sym->sym.st_value +
> > -  arch_dest_reloc_offset(reloc_addend(reloc));
> 
> Way back in ("[PATCH v2 18/62] objtool: Fix x86 addend calculation"),
> arch_dest_reloc_offset() was replaced with arch_insn_adjusted_addend(),
> so I think that patch missed this callsite and breaks bisectability.

Fixed, thanks.

-- 
Josh

Re: [PATCH v2 18/62] objtool: Fix x86 addend calculation

On Mon, May 26, 2025 at 12:23:15PM +0200, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 01:16:42PM -0700, Josh Poimboeuf wrote:
> > On x86, arch_dest_reloc_offset() hardcodes the addend adjustment to
> > four, but the actual adjustment depends on the relocation type.  Fix
> > that.
> 
> > +s64 arch_insn_adjusted_addend(struct instruction *insn, struct reloc 
> > *reloc)
> >  {
> > -   return addend + 4;
> > +   s64 addend = reloc_addend(reloc);
> > +
> > +   switch (reloc_type(reloc)) {
> > +   case R_X86_64_PC32:
> > +   case R_X86_64_PLT32:
> > +   addend += insn->offset + insn->len - reloc_offset(reloc);
> > +   break;
> > +   default:
> > +   break;
> > +   }
> > +
> > +   return addend;
> >  }
> 
> Should this not be something like:
> 
> s64 arch_insn_adjusted_addend(struct instruction *insn, struct reloc *reloc)
> {
>   s64 addend = reloc_addend(reloc);
> 
>   if (arch_pc_relative_reloc(reloc))
>   addend += insn->offset + insn->len - reloc_offset(reloc);
> 
>   return addend;
> }
> 
> instead?
> 
> AFAIU arch_pc_relative_reloc() is the exact same set of relocations.

Yeah that's better, thanks.

-- 
Josh

Re: [PATCH v2 28/62] objtool: Fix weak symbol hole detection for .cold functions

On Mon, May 26, 2025 at 12:38:22PM +0200, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 01:16:52PM -0700, Josh Poimboeuf wrote:
> > When ignore_unreachable_insn() looks for weak function holes which jump
> > to their .cold functions, it assumes the parent function comes before
> > the corresponding .cold function in the symbol table.  That's not
> > necessarily the case with -ffunction-sections.
> > 
> > Mark all the holes beforehand (including .cold functions) so the
> > ordering of the discovery doesn't matter.
> 
> One of the things I have a 'todo' entry on, is rewriting all sections
> that reference any one of these instructions.
> 
> That is, things like fentry, alternatives, retpoline, static_call,
> jump_label.  Everything that can cause runtime code patching.
> 
> Once we are sure none of those sections will contain references to this
> dead code, we can go and wipe the actual code. Perhaps fill it with a
> UD1 instruction with some identifying immediate.

Yeah, that would be nice.

-- 
Josh

Re: [PATCH v2 29/62] objtool: Mark prefix functions

On Mon, May 26, 2025 at 12:43:55PM +0200, Peter Zijlstra wrote:
> > +++ b/tools/objtool/elf.c
> > @@ -442,6 +442,11 @@ static void elf_add_symbol(struct elf *elf, struct 
> > symbol *sym)
> > elf_hash_add(symbol, &sym->hash, sym->idx);
> > elf_hash_add(symbol_name, &sym->name_hash, str_hash(sym->name));
> >  
> > +   if (is_func_sym(sym) && sym->len == 16 &&
> 
> Where did that 'sym->len == 16' thing come from? I mean, they are, but
> the old code didn't assert that.
> 
> I would rather objtool issue a warn if not 16, but still consider these
> as prefix.

Ok:

diff --git a/tools/objtool/elf.c b/tools/objtool/elf.c
index 9a1fc0392b7f..2953aa8d2b81 100644
--- a/tools/objtool/elf.c
+++ b/tools/objtool/elf.c
@@ -442,10 +442,12 @@ static void elf_add_symbol(struct elf *elf, struct symbol 
*sym)
elf_hash_add(symbol, &sym->hash, sym->idx);
elf_hash_add(symbol_name, &sym->name_hash, str_hash(sym->name));
 
-   if (is_func_sym(sym) && sym->len == 16 &&
-   (strstarts(sym->name, "__pfx") || strstarts(sym->name, "__cfi_")))
+   if (is_func_sym(sym) &&
+   (strstarts(sym->name, "__pfx_") || strstarts(sym->name, "__cfi_"))) 
{
+   if (sym->len != 16)
+   WARN("%s size %d != 16", sym->name, sym->len);
sym->prefix = 1;
-
+   }
 
if (is_func_sym(sym) && strstr(sym->name, ".cold"))
sym->cold = 1;

Re: [PATCH v2 32/62] objtool: Suppress section skipping warnings with --dryrun

On Wed, May 28, 2025 at 12:34:53PM +0200, Peter Zijlstra wrote:
> On Mon, May 26, 2025 at 12:52:40PM +0200, Peter Zijlstra wrote:
> > On Fri, May 09, 2025 at 01:16:56PM -0700, Josh Poimboeuf wrote:
> > > It's common to use --dryrun on binaries that have already been
> > > processed.  Don't print the section skipping warnings in that case.
> > 
> > Ah, I rather like this warning, it gives me an easy check to see if the
> > file has already been processed.
> > 
> > I typically do a OBJTOOL_ARGS="--backup" build and run dryrun debug
> > sessions against those .orig files.
> 
> Turns out, you already broke this.. :-(
> 
> I'm now having a case where objtool fails on vmlinux.o and make happily
> deletes vmlinux.o and I'm left empty handed.
> 
> Let me go resurrect --backup

Yeah, as I just mentioned in that other email, --verbose should give you
what you need.  It also prints the cmdline args, which is nice.

But also, feel free to resurrect --backup, or you can yell at me to do
it as the backup code changed a bit.

-- 
Josh

Re: [PATCH v2 42/62] kbuild,x86: Fix module permissions for __jump_table and __bug_table

On Mon, May 26, 2025 at 01:06:34PM +0200, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 01:17:06PM -0700, Josh Poimboeuf wrote:
> > An upcoming patch will add the SHF_MERGE flag to x86 __jump_table and
> > __bug_table so their entry sizes can be defined in inline asm.
> > 
> > However, those sections have SHF_WRITE, which the Clang linker (lld)
> > explicitly forbids combining with SHF_MERGE.
> > 
> > Those sections are modified at runtime and must remain writable.  While
> > SHF_WRITE is ignored by vmlinux, it's still needed for modules.
> > 
> > To work around the linker interference, remove SHF_WRITE during
> > compilation and restore it after linking the module.
> 
> *groan*
> 
> This and the following patches marking a whole bunch of sections M,
> seems to suggest you're going to rely on sh_entsize actually working.
> 
> There was an ld.lld bug, and IIRC you need to enforce llvm-20 or later
> if you want this to be so.

Hm, ISTR this working with clang 18, I'll go test that again.

-- 
Josh

Re: [PATCH v2 52/62] objtool/klp: Introduce klp diff subcommand for diffing object files

On Mon, May 26, 2025 at 08:22:59PM +0200, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 01:17:16PM -0700, Josh Poimboeuf wrote:
> 
> > Without '-ffunction-sections -fdata-sections', reliable object diffing
> > would be infeasible due to toolchain limitations:
> > 
> >   - For intra-file+intra-section references, the compiler might
> > occasionally generated hard-coded instruction offsets instead of
> > relocations.
> > 
> >   - Section-symbol-based references can be ambiguous:
> > 
> > - Overlapping or zero-length symbols create ambiguity as to which
> >   symbol is being referenced.
> > 
> > - A reference to the end of a symbol (e.g., checking array bounds)
> >   can be misinterpreted as a reference to the next symbol, or vice
> >   versa.
> > 
> > A potential future alternative to '-ffunction-sections -fdata-sections'
> > would be to introduce a toolchain option that forces symbol-based
> > (non-section) relocations.
> 
> Urgh.. So the first issue we can fix with objtool, but the ambiguous
> cases are indeed very hard to fix up in post.
> 
> Did you already talk to toolchain people about this?

For now, I want to stick with -ffunction-sections -fdata-sections, as
that's what kpatch has done for 10+ years and it works well.  That's the
only option we have for current compilers anyway.

The above mentioned possibility of diffing without -ffunction-sections
-fdata-sections is theoretical, and needs more exploration.  If it
indeed works then we can try to get toolchain support for that.

-- 
Josh

Re: [PATCH net] virtio-net: drop the multi-buffer XDP packet in zerocopy

2025-06-04 Thread Jason Wang

On Wed, Jun 4, 2025 at 10:17 PM Bui Quang Minh  wrote:
>
> On 6/4/25 07:37, Jason Wang wrote:
> > On Tue, Jun 3, 2025 at 11:07 PM Bui Quang Minh  
> > wrote:
> >> In virtio-net, we have not yet supported multi-buffer XDP packet in
> >> zerocopy mode when there is a binding XDP program. However, in that
> >> case, when receiving multi-buffer XDP packet, we skip the XDP program
> >> and return XDP_PASS. As a result, the packet is passed to normal network
> >> stack which is an incorrect behavior. This commit instead returns
> >> XDP_DROP in that case.
> >>
> >> Fixes: 99c861b44eb1 ("virtio_net: xsk: rx: support recv merge mode")
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Bui Quang Minh 
> >> ---
> >>   drivers/net/virtio_net.c | 11 ---
> >>   1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index e53ba600605a..4c35324d6e5b 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1309,9 +1309,14 @@ static struct sk_buff 
> >> *virtnet_receive_xsk_merge(struct net_device *dev, struct
> >>  ret = XDP_PASS;
> > It would be simpler to just assign XDP_DROP here?
> >
> > Or if you wish to stick to the way, we can simply remove this assignment.
>
> This XDP_PASS is returned for the case when there is no XDP program
> binding (!prog).

You're right.

Acked-by: Jason Wang 

Thanks

>
> >
> >>  rcu_read_lock();
> >>  prog = rcu_dereference(rq->xdp_prog);
> >> -   /* TODO: support multi buffer. */
> >> -   if (prog && num_buf == 1)
> >> -   ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit, stats);
> >> +   if (prog) {
> >> +   /* TODO: support multi buffer. */
> >> +   if (num_buf == 1)
> >> +   ret = virtnet_xdp_handler(prog, xdp, dev, xdp_xmit,
> >> + stats);
> >> +   else
> >> +   ret = XDP_DROP;
> >> +   }
> >>  rcu_read_unlock();
> >>
> >>  switch (ret) {
> >> --
> >> 2.43.0
> >>
> > Thanks
> >
>
>
> Thanks,
> Quang Minh.
>
>

Re: [PATCH v2 32/62] objtool: Suppress section skipping warnings with --dryrun

On Mon, May 26, 2025 at 12:52:40PM +0200, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 01:16:56PM -0700, Josh Poimboeuf wrote:
> > It's common to use --dryrun on binaries that have already been
> > processed.  Don't print the section skipping warnings in that case.
> 
> Ah, I rather like this warning, it gives me an easy check to see if the
> file has already been processed.
> 
> I typically do a OBJTOOL_ARGS="--backup" build and run dryrun debug
> sessions against those .orig files.

Ok.

Though, note that as of a few months ago, --backup no longer exists.  A
backup is now automatically created with --verbose.  But we can revive
it if you want.

-- 
Josh

Re: [PATCH v4 4/7] cxl/acpi: Add background worker to wait for cxl_pci and cxl_mem probe



On 04/06/2025 22:35, Dave Jiang wrote:
>>>
>>> +static DECLARE_WAIT_QUEUE_HEAD(cxl_wait_queue);
>> Given that this file (suspend.c) focuses on power management functions,
>> it might be more appropriate to move the wait queue declaration and its
>> related changes to acpi.c in where the its caller is.
> You mean drivers/cxl/acpi.c and not drivers/cxl/core/acpi.c right? 

Yes


> The core one is my mistake and I'm going to create a patch to remove that.


Completely missed its existence - thank you for catching and cleaning that up!

Thanks
Zhijian


> 
> DJ

Re: [PATCH vhost] vdpa/mlx5: Fix needs_teardown flag calculation

2025-06-04 Thread Jason Wang

On Thu, Jun 5, 2025 at 2:48 AM Dragos Tatulea  wrote:
>
> needs_teardown is a device flag that indicates when virtual queues need
> to be recreated. This happens for certain configuration changes: queue
> size and some specific features.
>
> Currently, the needs_teardown state can be incorrectly reset by
> subsequent .set_vq_num() calls. For example, for 1 rx VQ with size 512
> and 1 tx VQ with size 256:
>
> .set_vq_num(0, 512) -> sets needs_teardown to true (rx queue has a
>non-default size)
> .set_vq_num(1, 256) -> sets needs_teardown to false (tx queue has a
>default size)
>
> This change takes into account the previous value of the needs_teardown
> flag when re-calculating it during VQ size configuration.
>
> Fixes: 0fe963d6fc16 ("vdpa/mlx5: Re-create HW VQs under certain conditions")
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Shahar Shitrit 
> Reviewed-by: Si-Wei Liu 
> Tested-by: Si-Wei Liu
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH v2 42/62] kbuild,x86: Fix module permissions for __jump_table and __bug_table