Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-18 Thread Google
Hi Steve,

Can you review this series? Especially, [07/36] and [12/36] has been changed
a lot from your original patch.

Thank you,

On Mon, 15 Apr 2024 21:48:59 +0900
"Masami Hiramatsu (Google)"  wrote:

> Hi,
> 
> Here is the 9th version of the series to re-implement the fprobe on
> function-graph tracer. The previous version is;
> 
> https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> 
> This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> and fixed some bugs + performance optimization patch[36/36].
>  - [12/36] Fix to clear fgraph_array entry in registration failure, also
>return -ENOSPC when fgraph_array is full.
>  - [28/36] Add new store_fprobe_entry_data() for fprobe.
>  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
>  - [36/36] Add new flag to skip timestamp recording.
> 
> Overview
> 
> This series does major 2 changes, enable multiple function-graphs on
> the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> fprobe on this function-graph.
> 
> The former changes had been sent from Steven Rostedt 4 years ago (*),
> which allows users to set different setting function-graph tracer (and
> other tracers based on function-graph) in each trace-instances at the
> same time.
> 
> (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> 
> The purpose of latter change are;
> 
>  1) Remove dependency of the rethook from fprobe so that we can reduce
>the return hook code and shadow stack.
> 
>  2) Make 'ftrace_regs' the common trace interface for the function
>boundary.
> 
> 1) Currently we have 2(or 3) different function return hook codes,
>  the function-graph tracer and rethook (and legacy kretprobe).
>  But since this  is redundant and needs double maintenance cost,
>  I would like to unify those. From the user's viewpoint, function-
>  graph tracer is very useful to grasp the execution path. For this
>  purpose, it is hard to use the rethook in the function-graph
>  tracer, but the opposite is possible. (Strictly speaking, kretprobe
>  can not use it because it requires 'pt_regs' for historical reasons.)
> 
> 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
>  wrong for the function entry and exit. Moreover, depending on the
>  architecture, there is no way to accurately reproduce 'pt_regs'
>  outside of interrupt or exception handlers. This means fprobe should
>  not use 'pt_regs' because it does not use such exceptions.
>  (Conversely, kprobe should use 'pt_regs' because it is an abstract
>   interface of the software breakpoint exception.)
> 
> This series changes fprobe to use function-graph tracer for tracing
> function entry and exit, instead of mixture of ftrace and rethook.
> Unlike the rethook which is a per-task list of system-wide allocated
> nodes, the function graph's ret_stack is a per-task shadow stack.
> Thus it does not need to set 'nr_maxactive' (which is the number of
> pre-allocated nodes).
> Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> their register interface, this changes it to convert 'ftrace_regs' to
> 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> so users must access only registers for function parameters or
> return value. 
> 
> Design
> --
> Instead of using ftrace's function entry hook directly, the new fprobe
> is built on top of the function-graph's entry and return callbacks
> with 'ftrace_regs'.
> 
> Since the fprobe requires access to 'ftrace_regs', the architecture
> must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> entry callback with 'ftrace_regs', and also
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> return_to_handler.
> 
> All fprobes share a single function-graph ops (means shares a common
> ftrace filter) similar to the kprobe-on-ftrace. This needs another
> layer to find corresponding fprobe in the common function-graph
> callbacks, but has much better scalability, since the number of
> registered function-graph ops is limited.
> 
> In the entry callback, the fprobe runs its entry_handler and saves the
> address of 'fprobe' on the function-graph's shadow stack as data. The
> return callback decodes the data to get the 'fprobe' address, and runs
> the exit_handler.
> 
> The fprobe introduces two hash-tables, one is for entry callback which
> searches fprobes related to the given function address passed by entry
> callback. The other is for a return callback which checks if the given
> 'fprobe' data structure pointer is still valid. Note that it is
> possible to unregister fprobe before the return callback runs. Thus
> the address validation must be done before using it in the return
> callback.
> 
> This series can be applied against the probes/for-next branch, which
> is 

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-18 Thread Steven Rostedt
On Thu, 18 Apr 2024 09:55:55 +0300
Mike Rapoport  wrote:

Hi Mike,

Thanks for doing this review!

> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each sub-buffer.
> > + * @nr_subbufs:Number of subbfs in the ring-buffer, including 
> > the reader.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. ID range [0 : 
> > @nr_subbufs - 1]
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> > + * @flags: Placeholder for now, 0 until new features are supported.
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> > + * @Reserved1: Reserved for future use.
> > + * @Reserved2: Reserved for future use.
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;  
> 
> Why do you need reserved fields? This structure always resides in the
> beginning of a page and the rest of the page is essentially "reserved".

So this code is also going to be used in arm's pkvm hypervisor code,
where it will be using these fields, but since we are looking at
keeping the same interface between the two, we don't want these used by
this interface.

We probably should add a comment about that.

> 
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index cc9ebe593571..793ecc454039 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c  
> 
> ... 
> 
> > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > +  unsigned long *subbuf_ids)
> > +{
> > +   struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > +   unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > +   struct buffer_page *first_subbuf, *subbuf;
> > +   int id = 0;
> > +
> > +   subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > +   cpu_buffer->reader_page->id = id++;
> > +
> > +   first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > +   do {
> > +   if (WARN_ON(id >= nr_subbufs))
> > +   break;
> > +
> > +   subbuf_ids[id] = (unsigned long)subbuf->page;
> > +   subbuf->id = id;
> > +
> > +   rb_inc_page();
> > +   id++;
> > +   } while (subbuf != first_subbuf);
> > +
> > +   /* install subbuf ID to kern VA translation */
> > +   cpu_buffer->subbuf_ids = subbuf_ids;
> > +
> > +   /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > +   meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> 
> Isn't this a single page?

One thing we are doing is to make sure that the subbuffers are aligned
by their size. If a subbuffer is 3 pages, it should be aligned on 3
page boundaries. This was something that Linus suggested.

> 
> > +   meta->meta_struct_len = sizeof(*meta);
> > +   meta->nr_subbufs = nr_subbufs;
> > +   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > +
> > +   rb_update_meta_page(cpu_buffer);
> > +}  
> 
> ...
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(sub_order, start, page)\  
> 
> Nit: usually iterators in kernel use for_each

Ah, good catch. Yeah, that should be changed. But then ...

> 
> > +   page = subbuf_page(0, (start)); \
> > +   for (int __off = 0; __off < (1 << (sub_order)); \
> > +__off++, page = subbuf_page(__off, (start)))  
> 
> The pages are allocated with alloc_pages_node(.. subbuf_order) are
> physically contiguous and struct pages for them are also contiguous, so
> inside a subbuf_order allocation you can just do page++.
> 

I'm wondering if we should just nuke the macro. It was there because of
the previous implementation did things twice. But now it's just done
once here:

+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page;
+
+   foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], 
page) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+ 

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-18 Thread Google
On Tue, 16 Apr 2024 22:41:01 +
Beau Belgrave  wrote:

> When the ABI was updated to prevent same name w/different args, it
> missed an important corner case when fields don't end with a space.
> Typically, space is used for fields to help separate them, like
> "u8 field1; u8 field2". If no spaces are used, like
> "u8 field1;u8 field2", then the parsing works for the first time.
> However, the match check fails on a subsequent register, leading to
> confusion.
> 
> This is because the match check uses argv_split() and assumes that all
> fields will be split upon the space. When spaces are used, we get back
> { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> This causes a mismatch, and the user program gets back -EADDRINUSE.
> 
> Add a method to detect this case before calling argv_split(). If found
> force a space after the field separator character ';'. This ensures all
> cases work properly for matching.
> 
> With this fix, the following are all treated as matching:
> u8 field1;u8 field2
> u8 field1; u8 field2
> u8 field1;\tu8 field2
> u8 field1;\nu8 field2

Sounds good to me. I just have some nits.

> 
> Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> args event")
> Signed-off-by: Beau Belgrave 
> ---
>  kernel/trace/trace_events_user.c | 88 +++-
>  1 file changed, 87 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index 70d428c394b6..9184d3962b2a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -1989,6 +1989,92 @@ static int user_event_set_tp_name(struct user_event 
> *user)
>   return 0;
>  }
>  
> +/*
> + * Counts how many ';' without a trailing space are in the args.
> + */
> +static int count_semis_no_space(char *args)
> +{
> + int count = 0;
> +
> + while ((args = strchr(args, ';'))) {
> + args++;
> +
> + if (!isspace(*args))
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/*
> + * Copies the arguments while ensuring all ';' have a trailing space.
> + */
> +static char *fix_semis_no_space(char *args, int count)

nit: This name does not represent what it does. 'insert_space_after_semis()'
is more self-described.

> +{
> + char *fixed, *pos;
> + char c, last;
> + int len;
> +
> + len = strlen(args) + count;
> + fixed = kmalloc(len + 1, GFP_KERNEL);
> +
> + if (!fixed)
> + return NULL;
> +
> + pos = fixed;
> + last = '\0';
> +
> + while (len > 0) {
> + c = *args++;
> +
> + if (last == ';' && !isspace(c)) {
> + *pos++ = ' ';
> + len--;
> + }
> +
> + if (len > 0) {
> + *pos++ = c;
> + len--;
> + }
> +
> + last = c;
> + }

nit: This loop can be simpler, because we are sure fixed has enough length;

/* insert a space after ';' if there is no space. */
while(*args) {
*pos = *args++;
if (*pos++ == ';' && !isspace(*args))
*pos++ = ' ';
}

> +
> + /*
> +  * len is the length of the copy excluding the null.
> +  * This ensures we always have room for a null.
> +  */
> + *pos = '\0';
> +
> + return fixed;
> +}
> +
> +static char **user_event_argv_split(char *args, int *argc)
> +{
> + /* Count how many ';' without a trailing space */
> + int count = count_semis_no_space(args);
> +
> + if (count) {

nit: it is better to exit fast, so 

if (!count)
return argv_split(GFP_KERNEL, args, argc);

...

Thank you,

OT: BTW, can this also simplify synthetic events?

> + /* We must fixup 'field;field' to 'field; field' */
> + char *fixed = fix_semis_no_space(args, count);
> + char **split;
> +
> + if (!fixed)
> + return NULL;
> +
> + /* We do a normal split afterwards */
> + split = argv_split(GFP_KERNEL, fixed, argc);
> +
> + /* We can free since argv_split makes a copy */
> + kfree(fixed);
> +
> + return split;
> + }
> +
> + /* No fixup is required */
> + return argv_split(GFP_KERNEL, args, argc);
> +}
> +
>  /*
>   * Parses the event name, arguments and flags then registers if successful.
>   * The name buffer lifetime is owned by this method for success cases only.
> @@ -2012,7 +2098,7 @@ static int user_event_parse(struct user_event_group 
> *group, char *name,
>   return -EPERM;
>  
>   if (args) {
> - argv = argv_split(GFP_KERNEL, args, );
> + argv = user_event_argv_split(args, );
>  
>   if (!argv)
>   return -ENOMEM;
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
On Fri, Apr 19, 2024 at 7:26 AM Jason Xing  wrote:
>
> > When I said "If you feel the need to put them in a special group, this
> > is fine by me.",
> > this was really about partitioning the existing enum into groups, if
> > you prefer having a group of 'RES reasons'
>
> Are you suggesting copying what we need from enum skb_drop_reason{} to
> enum sk_rst_reason{}? Why not reusing them directly. I have no idea
> what the side effect of cast conversion itself is?

Sorry that I'm writing this email. I'm worried my statement is not
that clear, so I write one simple snippet which can help me explain
well :)

Allow me give NO_SOCKET as an example:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..2c9f7364de45 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
int code, __be32 info,
if (!fl4.saddr)
fl4.saddr = htonl(INADDR_DUMMY);

+   trace_icmp_send(skb_in, type, code);
icmp_push_reply(sk, _param, , , );
 ende:
ip_rt_put(rt);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1e650ec71d2f..d5963831280f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2160,6 +2160,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 {
struct net *net = dev_net(skb->dev);
enum skb_drop_reason drop_reason;
+   enum sk_rst_reason rst_reason;
int sdif = inet_sdif(skb);
int dif = inet_iif(skb);
const struct iphdr *iph;
@@ -2355,7 +2356,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb);
+   rst_reason = RST_REASON_NO_SOCKET;
+   tcp_v4_send_reset(NULL, skb, rst_reason);
}

 discard_it:

As you can see, we need to add a new 'rst_reason' variable which
actually is the same as drop reason. They are the same except for the
enum type... Such rst_reasons/drop_reasons are all over the place.

Eric, if you have a strong preference, I can do it as you said.

Well, how about explicitly casting them like this based on the current
series. It looks better and clearer and more helpful to people who is
reading codes to understand:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 461b4d2b7cfe..eb125163d819 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;

 reset:
-   tcp_v4_send_reset(rsk, skb, (u32)reason);
+   tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and

Thanks for your patience again.

Jason



[PATCH 1/1] genirq/cpuhotplug: retry with online CPUs on irq_do_set_affinity failure

2024-04-18 Thread Dongli Zhang
When a CPU is offline, its IRQs may migrate to other CPUs. For managed
IRQs, they are migrated, or shutdown (if all CPUs of the managed IRQ
affinity are offline). For regular IRQs, there will only be a migration.

The migrate_one_irq() first uses pending_mask or affinity_mask of the IRQ.

104 if (irq_fixup_move_pending(desc, true))
105 affinity = irq_desc_get_pending_mask(desc);
106 else
107 affinity = irq_data_get_affinity_mask(d);

The migrate_one_irq() may use all online CPUs, if all CPUs in
pending_mask/affinity_mask are already offline.

113 if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
114 /*
115  * If the interrupt is managed, then shut it down and leave
116  * the affinity untouched.
117  */
118 if (irqd_affinity_is_managed(d)) {
119 irqd_set_managed_shutdown(d);
120 irq_shutdown_and_deactivate(desc);
121 return false;
122 }
123 affinity = cpu_online_mask;
124 brokeaff = true;
125 }

However, there is a corner case. Although some CPUs in
pending_mask/affinity_mask are still online, they are lack of available
vectors. If the kernel continues calling irq_do_set_affinity() with those CPUs,
there will be -ENOSPC error.

This is not reasonable as other online CPUs still have many available vectors.

name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:7
Global available:884
Global reserved:   6
Total allocated: 539
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   147 0 0   55  32-49,51-87
 1   147 0 0   55  32-49,51-87
 2 0 0 0  202  32-49,51-127,129-235
 4   147 0 0   55  32-49,51-87
 5   147 0 0   55  32-49,51-87
 6   148 0 0   54  32-49,51-86
 7   148 0 0   54  32-49,51-86

This issue should not happen for managed IRQs because the vectors are already
reserved before CPU hotplug. For regular IRQs, do a re-try with all online
CPUs if the prior irq_do_set_affinity() is failed with -ENOSPC.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 kernel/irq/cpuhotplug.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..d1666a6b73f4 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -130,6 +130,19 @@ static bool migrate_one_irq(struct irq_desc *desc)
 * CPU.
 */
err = irq_do_set_affinity(d, affinity, false);
+
+   if (err == -ENOSPC &&
+   !irqd_affinity_is_managed(d) &&
+   affinity != cpu_online_mask) {
+   affinity = cpu_online_mask;
+   brokeaff = true;
+
+   pr_debug("IRQ%u: set affinity failed for %*pbl, re-try with all 
online CPUs\n",
+d->irq, cpumask_pr_args(affinity));
+
+   err = irq_do_set_affinity(d, affinity, false);
+   }
+
if (err) {
pr_warn_ratelimited("IRQ%u: set affinity failed(%d).\n",
d->irq, err);
-- 
2.34.1




[PATCH 0/1] genirq/cpuhotplug: fix CPU hotplug set affinity failure issue

2024-04-18 Thread Dongli Zhang
Please refer to the commit message of the patch for details.

The cover letter is to demonstrate how to reproduce the issue on purpose with
QEMU/KVM + virtio-net (that's why virtualizat...@lists.linux.dev is CCed).

Thank you very much!



1. Build the mainline linux kernel.

$ make defconfig
$ scripts/config --file ".config" -e CONFIG_X86_X2APIC \
  -e CONFIG_GENERIC_IRQ_DEBUGFS
$ make olddefconfig
$ make -j24 > /dev/null

Confirm the config is enabled.

$ cat .config | grep CONFIG_GENERIC_IRQ_DEBUGFS
CONFIG_GENERIC_IRQ_DEBUGFS=y


2. Create the VM with the below QEMU command line. The libvirt virbr0 is used
as bridge for virtio-net.

---
$ cat qemu-ifup
#!/bin/sh
# Script to bring a network (tap) device for qemu up.

br="virbr0"
ifconfig $1 up
brctl addif $br "$1"
exit
---

/home/zhang/kvm/qemu-8.2.0/build/qemu-system-x86_64 \
-hda ubuntu2204.qcow2 -m 8G -smp 32 -vnc :5 -enable-kvm -cpu host \
-net nic -net user,hostfwd=tcp::5025-:22 \
-device 
virtio-net-pci,netdev=tapnet01,id=net01,mac=01:54:00:12:34:56,bus=pci.0,addr=0x4,mq=true,vectors=257
 \
-netdev 
tap,id=tapnet01,ifname=tap01,script=qemu-ifup,downscript=no,queues=128,vhost=off
 \
-device 
virtio-net-pci,netdev=tapnet02,id=net02,mac=02:54:00:12:34:56,bus=pci.0,addr=0x5,mq=true,vectors=257
 \
-netdev 
tap,id=tapnet02,ifname=tap02,script=qemu-ifup,downscript=no,queues=128,vhost=off
 \
-kernel /home/zhang/img/debug/mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda3 init=/sbin/init text loglevel=7 console=ttyS0" \
-serial stdio -name debug-threads=on


3. Use procfs to confirm the virtio IRQ numbers.

$ cat /proc/interrupts | grep virtio
 24: ... ... PCI-MSIX-:00:04.0   0-edge  virtio0-config
 25: ... ... PCI-MSIX-:00:04.0   1-edge  virtio0-input.0
... ...
537: ... ... PCI-MSIX-:00:05.0 256-edge  virtio1-output.127

Reset the affinity of IRQs 25-537 to CPUs=2,3.

---
#!/bin/sh

for irq in {25..537}
do
  echo $irq
  echo 2,3 > /proc/irq/$irq/smp_affinity_list
  cat /proc/irq/$irq/smp_affinity_list
  cat /proc/irq/$irq/effective_affinity_list
  echo ""
done
---

Now offline CPU=8-31.

---
#!/bin/sh

for cpu in {8..31}
do
  echo $cpu
  echo 0 > /sys/devices/system/cpu/cpu$cpu/online
done
---


The below is the current VECTOR debugfs.

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:8
Global available:   1090
Global reserved:   6
Total allocated: 536
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   169 0 0   33  32-49,51-65
 1   171 0 0   31  32-49,51-63
 226 0 0  176  32-49,52-127,129-210
 327 0 0  175  32-49,51-127,129-171,173-209
 4   175 0 0   27  32-49,51-59
 5   175 0 0   27  32-49,51-59
 6   172 0 0   30  32-49,51-62
 7   175 0 0   27  32-49,51-59


4. Now offline CPU=3.

# echo 0 > /sys/devices/system/cpu/cpu3/online

There are below from dmesg.

[   96.234045] IRQ151: set affinity failed(-28).
[   96.234064] IRQ156: set affinity failed(-28).
[   96.234078] IRQ158: set affinity failed(-28).
[   96.234091] IRQ159: set affinity failed(-28).
[   96.234105] IRQ161: set affinity failed(-28).
[   96.234118] IRQ162: set affinity failed(-28).
[   96.234132] IRQ163: set affinity failed(-28).
[   96.234145] IRQ164: set affinity failed(-28).
[   96.234159] IRQ165: set affinity failed(-28).
[   96.234172] IRQ166: set affinity failed(-28).
[   96.235013] IRQ fixup: irq 339 move in progress, old vector 48
[   96.237129] smpboot: CPU 3 is now offline


Although other CPUs have many available vectors, only CPU=2 is used.

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:7
Global available:   1022
Global reserved:   6
Total allocated: 533
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   168 0 0   34  32-49,51-53,55-57,59-68
 1   165 0 0   37  32-49,51-57,59-60,64-73
 2 0 0 0  202  32-49,51-127,129-235
 4   173 0 0   29  32-40,42-48,52-63,65
 5   171 0 0   31  32-49,51-54,56,58-62,64-66
 6   172 0 0   30  32-49,51-52,54-57,59-63,65
 7   173 0 0   29  32-49,51-52,54-58,60-62,64


Dongli Zhang





Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-18 Thread Huang, Kai




On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.


It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is 
"exactly what does this mean" isn't quite clear.


Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.


So, the first interpretation is we treat the entire MISC_CG_RES_SGX 
resource type doesn't exist, that is, we just don't show control files 
in the file system, and all EPC pages are tracked in the global list.


But it might be not straightforward to implement in the SGX driver, 
i.e., we might need to do more MISC cgroup core code change to make it 
being able to support disable particular resource at runtime -- I need 
to double check.


So if that is not something worth to do, we will still need to live with 
the fact that, the user is still able to create SGX cgroup in the 
hierarchy and see those control files, and being able to read/write them.


The second interpretation I suppose is, although the SGX cgroup is still 
seen as supported in userspace, in kernel we just treat it doesn't exist.


Specifically, that means: 1) we always return the root SGX cgroup for 
any EPC page when allocating a new one; 2) as a result, we still track 
all EPC pages in a single global list.


But from the code below ...



  static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
  {
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum 
sgx_reclaim reclaim)
  {
int ret;
  
+	/* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */

+   if (!sgx_cg_wq)
+   return 0;
+


..., IIUC you choose a (third) solution that is even one more step back:

It just makes try_charge() always succeed, but EPC pages are still 
managed in the "per-cgroup" list.


But this solution, AFAICT, doesn't work.  The reason is when you fail to 
allocate EPC page you will do the global reclaim, but now the global 
list is empty.


Am I missing anything?

So my thinking is, we have two options:

1) Modify the MISC cgroup core code to allow the kernel to disable one 
particular resource.  It shouldn't be hard, e.g., we can add a 
'disabled' flag to the 'struct misc_res'.


Hmm.. wait, after checking, the MISC cgroup won't show any control files 
if the "capacity" of the resource is 0:


"
 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.
"

So I really suppose we should go with this route, i.e., by just setting 
the EPC capacity to 0?


Note misc_cg_try_charge() will fail if capacity is 0, but we can make it 
return success by explicitly check whether SGX cgroup is disabled by 
using a helper, e.g., sgx_cgroup_disabled().


And you always return the root SGX cgroup in sgx_get_current_cg() when 
sgx_cgroup_disabled() is true.


And in sgx_reclaim_pages_global(), you do something like:

static void sgx_reclaim_pages_global(..)
{
#ifdef CONFIG_CGROUP_MISC
if (sgx_cgroup_disabled())
sgx_reclaim_pages(_root_cg.lru);
else
sgx_cgroup_reclaim_pages(misc_cg_root());
#else
sgx_reclaim_pages(_global_list);
#endif
}

I am perhaps missing some other spots too but you got the idea.

At last, after typing those, I believe we should have a separate patch 
to handle disable SGX cgroup at initialization time.  And you can even 
put this patch _somewhere_ after the patch


"x86/sgx: Implement basic EPC misc cgroup functionality"

and before this patch.

It makes sense to have such patch anyway, because with it we can easily 
to add a kernel command line 'sgx_cgroup=disabled" if the user wants it 
disabled (when someone has such requirement in the future).







Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Google
On Thu, 18 Apr 2024 12:09:09 -0700
Andrii Nakryiko  wrote:

> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> One notable exception when we force rcu_is_watching() check is
> CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> old-style int3-based workflow instead of relying on ftrace, making RCU
> watching check important to validate.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 
> Signed-off-by: Andrii Nakryiko 


Thanks for update! This looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..a974605ad7a5 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-18 Thread Google
On Thu, 18 Apr 2024 12:10:59 +0100
Jonthan Haslam  wrote:

> Hi Masami,
> 
> > > > OK, then I'll push this to for-next at this moment.
> > > > Please share if you have a good idea for the batch interface which can 
> > > > be
> > > > backported. I guess it should involve updating userspace changes too.
> > > 
> > > Did you (or anyone else) need anything more from me on this one so that it
> > > can be pushed? I provided some benchmark numbers but happy to provide
> > > anything else that may be required.
> > 
> > Yeah, if you can update with the result, it looks better to me.
> > Or, can I update the description?
> 
> Just checking if you need me to do anything on this so that it can be
> pushed to for-next? Would be really great if we can get this in. Thanks
> for all your help.

Yes, other uprobe performance improvements[1][2] have the benchmark results,
but this patch doesn't. If you update this with the benchmark results, it is
helpful to me.

[1] https://lore.kernel.org/all/20240318181728.2795838-3-and...@kernel.org/
[2] https://lore.kernel.org/all/20240318181728.2795838-4-and...@kernel.org/

Thank you,

> 
> Jon.
> 
> > 
> > Thank you,
> > 
> > > 
> > > Thanks!
> > > 
> > > Jon.
> > > 
> > > > 
> > > > Thank you!
> > > > 
> > > > > >
> > > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > > while Jonathan will keep working on even better final solution.
> > > > > > Thanks!
> > > > > >
> > > > > > > I look forward to your formalized results :)
> > > > > > >
> > > > > 
> > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > > and USDTs, reporting attach/detach timings:
> > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > > bpf_testmod.ko is already unloaded.
> > > > > Loading bpf_testmod.ko...
> > > > > Successfully loaded bpf_testmod.ko.
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > > test_bench_attach_uprobe: attached in   0.120s
> > > > > test_bench_attach_uprobe: detached in   0.092s
> > > > > #400/5   uprobe_multi_test/bench_uprobe:OK
> > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > > test_bench_attach_usdt: attached in   0.124s
> > > > > test_bench_attach_usdt: detached in   0.064s
> > > > > #400/6   uprobe_multi_test/bench_usdt:OK
> > > > > #400 uprobe_multi_test:OK
> > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > > Successfully unloaded bpf_testmod.ko.
> > > > > 
> > > > > So it should be easy for Jonathan to validate his changes with this.
> > > > > 
> > > > > > > Thank you,
> > > > > > >
> > > > > > > >
> > > > > > > > Jon.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock 
> > > > > > > > > > > overhead
> > > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > > >
> > > > > > > > > > > Thank you,
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > > > > > > 
> > > > > > > > > > > > ---
> > > > > > > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > > RB_ROOT;
> > > > > > > > > > > >   */
> > > > > > > > > > > >  #define no_uprobe_events()   
> > > > > > > > > > > > RB_EMPTY_ROOT(_tree)
> > > > > > > > > > > >
> > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > >
> > > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > >  {
> > > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > > + read_lock(_treelock);
> > > > > > > 

[ANNOUNCE] 5.10.215-rt107

2024-04-18 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.215-rt107 stable release.

This release is simply an update to the new stable 5.10.215 version and no
RT-specific changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: b850d563e91f696a605c28e8d6a968a950cff491

Or to build 5.10.215-rt107 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.215.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.215-rt107.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-18 Thread Huang, Kai




Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t


[...]


Ah I missed that.  No problem to me.





--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+    /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+    return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the 
conclusion. I think this is confirmed also by Michal's description AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."


After looking I believe we can even disable MISC cgroup at runtime for a 
particular cgroup (haven't actually verified on real machine, though):


 # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control

And if you look at the MISC cgroup core code, many functions actually 
handle a NULL css, e.g., misc_cg_try_charge():


int misc_cg_try_charge(enum misc_res_type type,
struct misc_cg *cg, u64 amount)
{
...

if (!(valid_type(type) && cg &&
READ_ONCE(misc_res_capacity[type])))
return -EINVAL;

...
}

That's why I am still a little bit worried about this.  And it's better 
to have cgroup expert(s) to confirm here.


Btw, AMD SEV doesn't need to worry because it doesn't dereference @css 
but just pass it to MISC cgroup core functions like 
misc_cg_try_charge().  But for SGX, we actually dereference it directly.




Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
> When I said "If you feel the need to put them in a special group, this
> is fine by me.",
> this was really about partitioning the existing enum into groups, if
> you prefer having a group of 'RES reasons'

Are you suggesting copying what we need from enum skb_drop_reason{} to
enum sk_rst_reason{}? Why not reusing them directly. I have no idea
what the side effect of cast conversion itself is?

If __not__ doing so (copying reasons one by one), for passive rests,
we can totally rely on the drop reason, which means if we implement
more reasons for skb drop happening in reset cases, we don't need to
handle reset cases over and over again (like adding rst reasons just
after newly added drop reasons if without cast conversions). It's
easier to maintain the reset reason part if we can apply the current
patch series.

Thank you.



Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Paul E. McKenney
On Thu, Apr 18, 2024 at 12:09:09PM -0700, Andrii Nakryiko wrote:
> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> One notable exception when we force rcu_is_watching() check is
> CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> old-style int3-based workflow instead of relying on ftrace, making RCU
> watching check important to validate.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 
> Signed-off-by: Andrii Nakryiko 

Acked-by: Paul E. McKenney 

> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..a974605ad7a5 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 



Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-18 Thread Josh Poimboeuf
On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote:
> On Sat, 13 Apr 2024 12:53:38 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> > 
> > > > Anyway, since we typically run stuff from NMI context, accessing user
> > > > data is 'interesting'. As such I would really like to make this work
> > > > depend on the call-graph rework that pushes all the user access bits
> > > > into return-to-user.  
> > > 
> > > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > > could look at and think about what a rebase looks like? Or do you have
> > > someone in mind I should work with for this?  
> > 
> > I've been offline for a little while and still need to catch up with
> > things myself.
> > 
> > Josh was working on that when I dropped off IIRC, I'm not entirely sure
> > where things are at currently (and there is no way I can ever hope to
> > process the backlog).
> > 
> > Anybody know where we are with that?
> 
> It's still very much on my RADAR, but with layoffs and such, my
> priorities have unfortunately changed. I'm hoping to start helping out
> in the near future though (in a month or two).
> 
> Josh was working on it, but I think he got pulled off onto other
> priorities too :-p

Yeah, this is still a priority for me and I hope to get back to it over
the next few weeks (crosses fingers).

-- 
Josh



Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-18 Thread Haitao Huang

On Tue, 16 Apr 2024 08:22:06 -0500, Huang, Kai  wrote:


On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:

From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a  
hard

limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Tested-by: Jarkko Sakkinen 


I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang 


Thanks


Nitpickings below:

[...]



--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include 
+#include 


It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to  
the

relevant patch(es) that they got used.

However I think it's better to explicitly include  since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use  because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.



I'll check and remove as needed.


+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;


The comment isn't necessary (sorry didn't notice before), because the  
code

is pretty clear saying that IMHO.



Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t


[...]



--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+   /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+   return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the conclusion.  
I think this is confirmed also by Michal's description AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."




--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 


Is this needed?  I believe SGX variants in "epc_cgroup.h" should be  
enough

for sgx/main.c?

[...]



right

[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is  
on

yes

[...]
Thanks
Haitao



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
On Fri, Apr 19, 2024 at 6:29 AM Jason Xing  wrote:
>
> On Fri, Apr 19, 2024 at 2:51 AM Eric Dumazet  wrote:
> >
> > On Thu, Apr 18, 2024 at 6:24 PM Jason Xing  
> > wrote:
> > >
> > > On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
> > > >
> > > > On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > > > > I'm not sure why the patch series has been changed to 'Changes
> > > > > Requested', until now I don't think I need to change something.
> > > > >
> > > > > Should I repost this series (keeping the v6 tag) and then wait for
> > > > > more comments?
> > > >
> > > > If Eric doesn't like it - it's not getting merged.
> > >
> > > I'm not a English native speaker. If I understand correctly, it seems
> > > that Eric doesn't object to the patch series. Here is the quotation
> > > [1]:
> > > "If you feel the need to put them in a special group, this is fine by me."
> > >
> > > This rst reason mechanism can cover all the possible reasons for both
> > > TCP and MPTCP. We don't need to reinvent some definitions of reset
> > > reasons which are totally the same as drop reasons. Also, we don't
> > > need to reinvent something to cover MPTCP. If people are willing to
> > > contribute more rst reasons, they can find a good place.
> > >
> > > Reset is one big complicated 'issue' in production. I spent a lot of
> > > time handling all kinds of reset reasons daily. I'm apparently not the
> > > only one. I just want to make admins' lives easier, including me. This
> > > special/separate reason group is important because we can extend it in
> > > the future, which will not get confused.
> > >
> > > I hope it can have a chance to get merged :) Thank you.
> > >
> > > [1]: 
> > > https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/
> > >
> > > Thanks,
> > > Jason
> >
> > My objection was these casts between enums. Especially if hiding with (u32)
>
> So I should explicitly cast it like this:
> tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
> ?
>
> >
> > I see no reason for adding these casts in TCP stack.
>
> Sorry, I don't know why the casts really make you so annoyed. But I
> still think it's not a bad way to handle all the cases for RST.
>
> Supposing not to add a enum sk_rst_reason{}, passing drop reasons only
> works well in TCP for passive rests. For active reset cases (in the
> tcp_send_active_reset()), it's meaningless/confusing to insist on
> reusing the drop reason because I have to add some reset reasons (that
> are only used in RST cases) in the enum skb_drop_reason{}, which is
> really weird, in my view. The same problem exists in how to handle
> MPTCP. So I prefer putting them in a separate group like now. What do
> you think about it, right now?

The description in the previous email may be too long. The key point
is that we're not supporting only for passive resets, right?

Thanks,
Jason



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
On Fri, Apr 19, 2024 at 2:51 AM Eric Dumazet  wrote:
>
> On Thu, Apr 18, 2024 at 6:24 PM Jason Xing  wrote:
> >
> > On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
> > >
> > > On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > > > I'm not sure why the patch series has been changed to 'Changes
> > > > Requested', until now I don't think I need to change something.
> > > >
> > > > Should I repost this series (keeping the v6 tag) and then wait for
> > > > more comments?
> > >
> > > If Eric doesn't like it - it's not getting merged.
> >
> > I'm not a English native speaker. If I understand correctly, it seems
> > that Eric doesn't object to the patch series. Here is the quotation
> > [1]:
> > "If you feel the need to put them in a special group, this is fine by me."
> >
> > This rst reason mechanism can cover all the possible reasons for both
> > TCP and MPTCP. We don't need to reinvent some definitions of reset
> > reasons which are totally the same as drop reasons. Also, we don't
> > need to reinvent something to cover MPTCP. If people are willing to
> > contribute more rst reasons, they can find a good place.
> >
> > Reset is one big complicated 'issue' in production. I spent a lot of
> > time handling all kinds of reset reasons daily. I'm apparently not the
> > only one. I just want to make admins' lives easier, including me. This
> > special/separate reason group is important because we can extend it in
> > the future, which will not get confused.
> >
> > I hope it can have a chance to get merged :) Thank you.
> >
> > [1]: 
> > https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/
> >
> > Thanks,
> > Jason
>
> My objection was these casts between enums. Especially if hiding with (u32)

So I should explicitly cast it like this:
tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
?

>
> I see no reason for adding these casts in TCP stack.

Sorry, I don't know why the casts really make you so annoyed. But I
still think it's not a bad way to handle all the cases for RST.

Supposing not to add a enum sk_rst_reason{}, passing drop reasons only
works well in TCP for passive rests. For active reset cases (in the
tcp_send_active_reset()), it's meaningless/confusing to insist on
reusing the drop reason because I have to add some reset reasons (that
are only used in RST cases) in the enum skb_drop_reason{}, which is
really weird, in my view. The same problem exists in how to handle
MPTCP. So I prefer putting them in a separate group like now. What do
you think about it, right now?

Thanks,
Jason



[PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support

2024-04-18 Thread Tanmay Shah
AMD-Xilinx Versal platform is successor of ZynqMP platform.
Real-time Processing Unit R5 cluster IP on Versal is same as
of ZynqMP Platform. Power-domains ids for Versal platform is
different than ZynqMP.

AMD-Xilinx Versal-NET platform is successor of Versal platform.
Versal-NET Real-Time Processing Unit has two clusters and each
cluster contains dual core ARM Cortex-R52 processors. Each R52
core is assigned 128KB of TCM memory.

Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 53 -
 1 file changed, 17 insertions(+), 36 deletions(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 7b1c12108bff..a6d8ac7394e7 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, int 
vqid)
dev_warn(dev, "failed to send message\n");
 }
 
-/*
- * zynqmp_r5_set_mode()
- *
- * set RPU cluster and TCM operation mode
- *
- * @r5_core: pointer to zynqmp_r5_core type object
- * @fw_reg_val: value expected by firmware to configure RPU cluster mode
- * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split)
- *
- * Return: 0 for success and < 0 for failure
- */
-static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
- enum rpu_oper_mode fw_reg_val,
- enum rpu_tcm_comb tcm_mode)
-{
-   int ret;
-
-   ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
-   if (ret < 0) {
-   dev_err(r5_core->dev, "failed to set RPU mode\n");
-   return ret;
-   }
-
-   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
-   if (ret < 0)
-   dev_err(r5_core->dev, "failed to configure TCM\n");
-
-   return ret;
-}
-
 /*
  * zynqmp_r5_rproc_start()
  * @rproc: single R5 core's corresponding rproc instance
@@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
*cluster,
/* Maintain backward compatibility for zynqmp by using hardcode TCM 
address. */
if (of_find_property(r5_core->np, "reg", NULL))
ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
-   else
+   else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
ret = zynqmp_r5_get_tcm_node(cluster);
 
if (ret) {
@@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
*cluster,
return ret;
}
 
-   ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode);
-   if (ret) {
-   dev_err(dev, "failed to set r5 cluster mode %d, err 
%d\n",
-   cluster->mode, ret);
+   ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
+   if (ret < 0) {
+   dev_err(r5_core->dev, "failed to set RPU mode\n");
return ret;
}
+
+   if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) ||
+   device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
+   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id,
+  tcm_mode);
+   if (ret < 0) {
+   dev_err(r5_core->dev, "failed to configure 
TCM\n");
+   return ret;
+   }
+   }
}
 
return 0;
@@ -1022,7 +1001,7 @@ static int zynqmp_r5_cluster_init(struct 
zynqmp_r5_cluster *cluster)
ret = of_property_read_u32(dev_node, "xlnx,tcm-mode", (u32 
*)_mode);
if (ret)
return ret;
-   } else {
+   } else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
if (cluster_mode == LOCKSTEP_MODE)
tcm_mode = PM_RPU_TCM_COMB;
else
@@ -1212,6 +1191,8 @@ static int zynqmp_r5_remoteproc_probe(struct 
platform_device *pdev)
 
 /* Match table for OF platform binding */
 static const struct of_device_id zynqmp_r5_remoteproc_match[] = {
+   { .compatible = "xlnx,versal-net-r52fss", },
+   { .compatible = "xlnx,versal-r5fss", },
{ .compatible = "xlnx,zynqmp-r5fss", },
{ /* end of list */ },
 };

base-commit: 912ebe48bec5927e2049e91b0e8a9cc682a709d2
-- 
2.25.1




Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu
On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?

IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.

Thanks,
Song



[ANNOUNCE] 5.10.214-rt106

2024-04-18 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.214-rt106 stable release.

This release is simply an update to the new stable 5.10.214 version and no
RT-specific changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 3d208061796d4addeb543c78e0a4ec769b6ce6b2

Or to build 5.10.214-rt106 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.214.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.214-rt106.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson
On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+int kvm_arch_flush_tlb_if_young(void);
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..5ebef8ef239c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,6 +595,11 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+int __weak kvm_arch_flush_tlb_if_young(void)
+{
+   return true;
+}
+
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define 

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Mike Rapoport
On Thu, Apr 18, 2024 at 10:31:16PM +0300, Nadav Amit wrote:
> 
> > On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> > 
> > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> >> 
> >> 
> >> 
> >> I might be missing something, but it seems a bit racy.
> >> 
> >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> >> point, since you don’t hold the text_mutex, some might do text_poke(),
> >> e.g., by enabling/disabling static-key, and the update would be
> >> overwritten. No?
> > 
> > Right :(
> > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> > area.
> > 
> > So I'm thinking about calling alternatives_smp_module_add() from an
> > additional callback after the execmem_update_copy().
> > 
> > Does it make sense to you?
> 
> Going over the code again - I might have just been wrong: I confused the
> alternatives and the jump-label mechanisms (as they do share a lot of
> code and characteristics).
> 
> The jump-labels are updated when prepare_coming_module() is called, which
> happens after post_relocation() [which means they would be updated using
> text_poke() “inefficiently” but should be safe].
> 
> The “alternatives” appear only to use text_poke() (in contrast for
> text_poke_early()) from very specific few flows, e.g., 
> common_cpu_up() -> alternatives_enable_smp().
> 
> Are those flows pose a problem after boot?

Yes, common_cpu_up is called  on CPU hotplug, so it's possible to have a
race between alternatives_smp_module_add() and common_cpu_up() ->
alternatives_enable_smp().

And in UP case alternatives_smp_module_add() will call
alternatives_smp_unlock() that will patch module text before it is updated.

> Anyhow, sorry for the noise.

On the contrary, I would have missed it.

-- 
Sincerely yours,
Mike.



Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Nadav Amit



> On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> 
> On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
>> 
>> 
>> 
>> I might be missing something, but it seems a bit racy.
>> 
>> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
>> point, since you don’t hold the text_mutex, some might do text_poke(),
>> e.g., by enabling/disabling static-key, and the update would be
>> overwritten. No?
> 
> Right :(
> Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> area.
> 
> So I'm thinking about calling alternatives_smp_module_add() from an
> additional callback after the execmem_update_copy().
> 
> Does it make sense to you?

Going over the code again - I might have just been wrong: I confused the
alternatives and the jump-label mechanisms (as they do share a lot of
code and characteristics).

The jump-labels are updated when prepare_coming_module() is called, which
happens after post_relocation() [which means they would be updated using
text_poke() “inefficiently” but should be safe].

The “alternatives” appear only to use text_poke() (in contrast for
text_poke_early()) from very specific few flows, e.g., 
common_cpu_up() -> alternatives_enable_smp().

Are those flows pose a problem after boot?

Anyhow, sorry for the noise.


[PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

One notable exception when we force rcu_is_watching() check is
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
old-style int3-based workflow instead of relying on ftrace, making RCU
watching check important to validate.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..a974605ad7a5 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0




[PATCH v4 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-18 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Eric Dumazet
On Thu, Apr 18, 2024 at 6:24 PM Jason Xing  wrote:
>
> On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
> >
> > On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > > I'm not sure why the patch series has been changed to 'Changes
> > > Requested', until now I don't think I need to change something.
> > >
> > > Should I repost this series (keeping the v6 tag) and then wait for
> > > more comments?
> >
> > If Eric doesn't like it - it's not getting merged.
>
> I'm not a English native speaker. If I understand correctly, it seems
> that Eric doesn't object to the patch series. Here is the quotation
> [1]:
> "If you feel the need to put them in a special group, this is fine by me."
>
> This rst reason mechanism can cover all the possible reasons for both
> TCP and MPTCP. We don't need to reinvent some definitions of reset
> reasons which are totally the same as drop reasons. Also, we don't
> need to reinvent something to cover MPTCP. If people are willing to
> contribute more rst reasons, they can find a good place.
>
> Reset is one big complicated 'issue' in production. I spent a lot of
> time handling all kinds of reset reasons daily. I'm apparently not the
> only one. I just want to make admins' lives easier, including me. This
> special/separate reason group is important because we can extend it in
> the future, which will not get confused.
>
> I hope it can have a chance to get merged :) Thank you.
>
> [1]: 
> https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/
>
> Thanks,
> Jason

My objection was these casts between enums. Especially if hiding with (u32)

I see no reason for adding these casts in TCP stack.

When I said "If you feel the need to put them in a special group, this
is fine by me.",
this was really about partitioning the existing enum into groups, if
you prefer having a group of 'RES reasons'



Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko
On Tue, Apr 9, 2024 at 3:48 PM Masami Hiramatsu  wrote:
>
> On Wed,  3 Apr 2024 15:03:28 -0700
> Andrii Nakryiko  wrote:
>
> > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > that RCU is watching when trying to setup rethooko on a function entry.
> >
> > This further (in addition to improvements in the previous patch)
> > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > by 2.3%, according to BPF benchmarks ([0]).
> >
> >   [0] 
> > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> >
>
> Hi Andrii,
>
> Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this
> option, kretprobes can be used without ftrace, but with original int3) ?

Sorry for the late response, I was out on vacation. Makes sense about
KPROBE_EVENTS_ON_NOTRACE, I went with this condition:

#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) ||
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)

Will send an updated revision shortly.

> This option should be set N on production system because of safety,
> just for testing raw kretprobes.
>
> Thank you,
>
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/trace/rethook.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > index fa03094e9e69..15b8aa4048d9 100644
> > --- a/kernel/trace/rethook.c
> > +++ b/kernel/trace/rethook.c
> > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >   if (unlikely(!handler))
> >   return NULL;
> >
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >   /*
> >* This expects the caller will set up a rethook on a function entry.
> >* When the function returns, the rethook will eventually be reclaimed
> > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >*/
> >   if (unlikely(!rcu_is_watching()))
> >   return NULL;
> > +#endif
> >
> >   return (struct rethook_node *)objpool_pop(>pool);
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-18 Thread Andrii Nakryiko
On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa  wrote:
>
> On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
>
> SNIP
>
> >  #include 
> >  #include 
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe 
> > *auprobe, struct insn *insn, bool
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".global uretprobe_syscall_entry\n"
> > + "uretprobe_syscall_entry:\n"
> > + "pushq %rax\n"
> > + "pushq %rcx\n"
> > + "pushq %r11\n"
> > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > + "syscall\n"
> > + "popq %r11\n"
> > + "popq %rcx\n"
> > +
> > + /* The uretprobe syscall replaces stored %rax value with final
> > +  * return address, so we don't restore %rax in here and just
> > +  * call ret.
> > +  */
> > + "retq\n"
> > + ".global uretprobe_syscall_end\n"
> > + "uretprobe_syscall_end:\n"
> > + ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > + return uretprobe_syscall_entry;
>
> fyi I realized this screws 32-bit programs, we either need to add
> compat trampoline, or keep the standard breakpoint for them:
>
> +   struct pt_regs *regs = task_pt_regs(current);
> +   static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +   if (user_64bit_mode(regs)) {
> +   *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +   return uretprobe_syscall_entry;
> +   }
> +
> +   *psize = UPROBE_SWBP_INSN_SIZE;
> +   return 
>
>
> not sure it's worth the effort to add the trampoline, I'll check
>

32-bit arch isn't a high-performance target anyways, so I'd probably
not bother and prioritize simplicity and long term maintenance.

>
> jirka



Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Andy Shevchenko
On Thu, Apr 18, 2024 at 8:50 PM Aren  wrote:
> On Thu, Apr 18, 2024 at 06:56:09PM +0300, Andy Shevchenko wrote:
> > On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> > > On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > > > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > > > wrote:

...

> > > > I forgot to check the order of freeing resources, be sure you have no
> > > > devm_*() releases happening before this call.
> > >
> > > If I understand what you're saying, this should be fine. The driver just
> > > uses devm to clean up acquired resources after remove is called. Or am I
> > > missing something and resources could be freed before calling
> > > stk3310_remove?
> >
> > I'm not objecting to that. The point here is that the resources should
> > be freed in the reversed order. devm-allocated resources are deferred
> > to be freed after the explicit driver ->remove() callback. At the end
> > it should not interleave with each other, i.o.w. it should be
> > probe: devm followed by non-devm
> > remove: non-devm only.
>
> I think what you're describing is already the case, with the exception
> of parts of the probe function not changed in this patch mixing
> acquiring resources through devm with configuring the device.

Okay, then we are fine!

> I hope I'm not being dense, thanks for the clarification

-- 
With Best Regards,
Andy Shevchenko



Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport
On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > >
> > > > I'm looking at execmem_types more as definition of the consumers, maybe 
> > > > I
> > > > should have named the enum execmem_consumer at the first place.
> > >
> > > I think looking at execmem_type from consumers' point of view adds
> > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > and bpf (and maybe also module text) all have the same requirements.
> > > Did I miss something?
> >
> > It's enough to have one architecture with different constrains for kprobes
> > and bpf to warrant a type for each.
> 
> AFAICT, some of these constraints can be changed without too much work.

But why?
I honestly don't understand what are you trying to optimize here. A few
lines of initialization in execmem_info?
What is the advantage in forcing architectures to have imposed limits on
kprobes or bpf allocations?

> > Where do you see unnecessary complexity?
> >
> > > IOW, we have
> > >
> > > enum execmem_type {
> > > EXECMEM_DEFAULT,
> > > EXECMEM_TEXT,
> > > EXECMEM_KPROBES = EXECMEM_TEXT,
> > > EXECMEM_FTRACE = EXECMEM_TEXT,
> > > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > > _KPROBE, _FTRACE, _BPF */
> > > EXECMEM_DATA,  /* rw */
> > > EXECMEM_RO_DATA,
> > > EXECMEM_RO_AFTER_INIT,
> > > EXECMEM_TYPE_MAX,
> > > };
> > >
> > > Does this make sense?
> >
> > How do you suggest to deal with e.g. riscv that has separate address spaces
> > for modules, kprobes and bpf?
> 
> IIUC, modules and bpf use the same address space on riscv

Not exactly, bpf is a subset of modules on riscv.

> while kprobes use vmalloc address.

The whole point of using the entire vmalloc for kprobes is to avoid
pollution of limited modules space.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.



Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Aren
On Thu, Apr 18, 2024 at 06:56:09PM +0300, Andy Shevchenko wrote:
> On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> > On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > > wrote:
> 
> ...
> 
> > > > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > > > +   if (data->vdd_reg)
> > > > +   regulator_disable(data->vdd_reg);
> > >
> > > I forgot to check the order of freeing resources, be sure you have no
> > > devm_*() releases happening before this call.
> >
> > If I understand what you're saying, this should be fine. The driver just
> > uses devm to clean up acquired resources after remove is called. Or am I
> > missing something and resources could be freed before calling
> > stk3310_remove?
> 
> I'm not objecting to that. The point here is that the resources should
> be freed in the reversed order. devm-allocated resources are deferred
> to be freed after the explicit driver ->remove() callback. At the end
> it should not interleave with each other, i.o.w. it should be
> probe: devm followed by non-devm
> remove: non-devm only.

I think what you're describing is already the case, with the exception
of parts of the probe function not changed in this patch mixing
acquiring resources through devm with configuring the device.

I hope I'm not being dense, thanks for the clarification
 - Aren



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
>
> On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > I'm not sure why the patch series has been changed to 'Changes
> > Requested', until now I don't think I need to change something.
> >
> > Should I repost this series (keeping the v6 tag) and then wait for
> > more comments?
>
> If Eric doesn't like it - it's not getting merged.

I'm not a English native speaker. If I understand correctly, it seems
that Eric doesn't object to the patch series. Here is the quotation
[1]:
"If you feel the need to put them in a special group, this is fine by me."

This rst reason mechanism can cover all the possible reasons for both
TCP and MPTCP. We don't need to reinvent some definitions of reset
reasons which are totally the same as drop reasons. Also, we don't
need to reinvent something to cover MPTCP. If people are willing to
contribute more rst reasons, they can find a good place.

Reset is one big complicated 'issue' in production. I spent a lot of
time handling all kinds of reset reasons daily. I'm apparently not the
only one. I just want to make admins' lives easier, including me. This
special/separate reason group is important because we can extend it in
the future, which will not get confused.

I hope it can have a chance to get merged :) Thank you.

[1]: 
https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/

Thanks,
Jason



Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu
On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same 
> > > > as
> > > >   modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > > our own
> > > > types, e.g.
> > > >
> > > >   EXECMEM_ANYWHERE
> > > >   EXECMEM_NOPLT
> > > >   EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and 
> > > > friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>

AFAICT, some of these constraints can be changed without too much work.

> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA,  /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?

IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:

 |  -4 GB | 7fff |2 GB | modules, BPF

Did I get this right?

Thanks,
Song



Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Andy Shevchenko
On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > wrote:

...

> > > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > > +   if (data->vdd_reg)
> > > +   regulator_disable(data->vdd_reg);
> >
> > I forgot to check the order of freeing resources, be sure you have no
> > devm_*() releases happening before this call.
>
> If I understand what you're saying, this should be fine. The driver just
> uses devm to clean up acquired resources after remove is called. Or am I
> missing something and resources could be freed before calling
> stk3310_remove?

I'm not objecting to that. The point here is that the resources should
be freed in the reversed order. devm-allocated resources are deferred
to be freed after the explicit driver ->remove() callback. At the end
it should not interleave with each other, i.o.w. it should be
probe: devm followed by non-devm
remove: non-devm only.

-- 
With Best Regards,
Andy Shevchenko



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jakub Kicinski
On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> I'm not sure why the patch series has been changed to 'Changes
> Requested', until now I don't think I need to change something.
> 
> Should I repost this series (keeping the v6 tag) and then wait for
> more comments?

If Eric doesn't like it - it's not getting merged.



Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-18 Thread Mike Rapoport
Hi Masami,

On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
> 
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport  wrote:
> 
> > From: "Mike Rapoport (IBM)" 
> > 
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> > 
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> > 
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
> 
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.

I'll rebase and will try to reduce ifdefery where possible.

> Thank you,
> 
> > 
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >  arch/Kconfig|  2 +-
> >  kernel/kprobes.c| 43 +
> >  kernel/trace/trace_kprobe.c | 11 ++
> >  3 files changed, 37 insertions(+), 19 deletions(-)
> > 
> 
> -- 
> Masami Hiramatsu

-- 
Sincerely yours,
Mike.



Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport
On Wed, Apr 17, 2024 at 04:32:49PM -0700, Song Liu wrote:
> On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > > +/**
> > > > > + * enum execmem_type - types of executable memory ranges
> > > > > + *
> > > > > + * There are several subsystems that allocate executable memory.
> > > > > + * Architectures define different restrictions on placement,
> > > > > + * permissions, alignment and other parameters for memory that can 
> > > > > be used
> > > > > + * by these subsystems.
> > > > > + * Types in this enum identify subsystems that allocate executable 
> > > > > memory
> > > > > + * and let architectures define parameters for ranges suitable for
> > > > > + * allocations by each subsystem.
> > > > > + *
> > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > > that
> > > > > + * are not explcitly defined.
> > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > > + * @EXECMEM_BPF: parameters for BPF
> > > > > + * @EXECMEM_TYPE_MAX:
> > > > > + */
> > > > > +enum execmem_type {
> > > > > + EXECMEM_DEFAULT,
> > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > > + EXECMEM_KPROBES,
> > > > > + EXECMEM_FTRACE,
> > > > > + EXECMEM_BPF,
> > > > > + EXECMEM_TYPE_MAX,
> > > > > +};
> > > >
> > > > Can we please get a break-down of how all these types are actually
> > > > different from one another?
> > > >
> > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > > mind) and has less strict placement constraints for some of them?
> > >
> > > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > > said
> > > several times.
> > >
> > > For arm64 we have two bsaic restrictions:
> > >
> > > 1) Direct branches can go +/-128M
> > >We can expand this range by having direct branches go to PLTs, at a
> > >performance cost.
> > >
> > > 2) PREL32 relocations can go +/-2G
> > >We cannot expand this further.
> > >
> > > * We don't need to allocate memory for ftrace. We do not use trampolines.
> > >
> > > * Kprobes XOL areas don't care about either of those; we don't place any
> > >   PC-relative instructions in those. Maybe we want to in future.
> > >
> > > * Modules care about both; we'd *prefer* to place them within +/-128M of 
> > > all
> > >   other kernel/module code, but if there's no space we can use PLTs and 
> > > expand
> > >   that to +/-2G. Since modules can refreence other modules, that ends up
> > >   actually being halved, and modules have to fit within some 2G window 
> > > that
> > >   also covers the kernel.
> 
> Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> really need
> EXECMEM_ANYWHERE below?
> 
> > >
> > > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > >   modules.
> >
> > BPF are happy with vmalloc().
> >
> > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > our own
> > > types, e.g.
> > >
> > >   EXECMEM_ANYWHERE
> > >   EXECMEM_NOPLT
> > >   EXECMEM_PREL32
> > >
> > > ... and then we use those in arch code to implement module_alloc() and 
> > > friends.
> >
> > I'm looking at execmem_types more as definition of the consumers, maybe I
> > should have named the enum execmem_consumer at the first place.
> 
> I think looking at execmem_type from consumers' point of view adds
> unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> and bpf (and maybe also module text) all have the same requirements.
> Did I miss something?

It's enough to have one architecture with different constrains for kprobes
and bpf to warrant a type for each.

Where do you see unnecessary complexity?
 
> IOW, we have
> 
> enum execmem_type {
> EXECMEM_DEFAULT,
> EXECMEM_TEXT,
> EXECMEM_KPROBES = EXECMEM_TEXT,
> EXECMEM_FTRACE = EXECMEM_TEXT,
> EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> _KPROBE, _FTRACE, _BPF */
> EXECMEM_DATA,  /* rw */
> EXECMEM_RO_DATA,
> EXECMEM_RO_AFTER_INIT,
> EXECMEM_TYPE_MAX,
> };
> 
> Does this make sense?
 
How do you suggest to deal with e.g. riscv that has separate address spaces
for modules, kprobes and bpf?

> Thanks,
> Song

-- 
Sincerely yours,
Mike.



Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Aren
On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  wrote:
> >
> > From: Ondrej Jirman 
> >
> > VDD power input can be used to completely power off the chip during
> > system suspend. Do so if available.
> 
> ...
> 
> >  #include 
> >  #include 
> >  #include 
> 
> > +#include 
> 
> Move it to be ordered and add a blank line to separate iio/*.h group.
> 
> ...
> 
> > +   data->vdd_reg = devm_regulator_get_optional(>dev, "vdd");
> > +   if (IS_ERR(data->vdd_reg)) {
> > +   ret = PTR_ERR(data->vdd_reg);
> > +   if (ret == -ENODEV)
> > +   data->vdd_reg = NULL;
> 
> > +   else
> 
> Redundant 'else' when you follow the pattern "check for error condition 
> first".
> 
> > +   return dev_err_probe(>dev, ret,
> > +"get regulator vdd failed\n");
> > +   }
> 
> ...
> 
> > +   if (data->vdd_reg) {
> > +   ret = regulator_enable(data->vdd_reg);
> > +   if (ret)
> > +   return dev_err_probe(>dev, ret,
> > +"regulator vdd enable 
> > failed\n");
> > +
> > +   usleep_range(1000, 2000);
> 
> fsleep()
> 
> > +   }
> 
> ...
> 
> > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > +   if (data->vdd_reg)
> > +   regulator_disable(data->vdd_reg);
> 
> I forgot to check the order of freeing resources, be sure you have no
> devm_*() releases happening before this call.

If I understand what you're saying, this should be fine. The driver just
uses devm to clean up acquired resources after remove is called. Or am I
missing something and resources could be freed before calling
stk3310_remove?

> ...
> 
> > +   usleep_range(1000, 2000);
> 
> fsleep()

Everything else makes sense, I'll include those in v2 along with a patch
to switch stk3310_init to dev_err_probe.

Thanks for taking the time to review
 - Aren



Re: [PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Aleksandr Mikhalitsyn
On Thu, Apr 18, 2024 at 3:23 PM Julian Anastasov  wrote:
>
>
> Hello,

Dear Julian,

>
> On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:
>
> > Cc: Julian Anastasov 
> > Cc: Simon Horman 
> > Cc: Pablo Neira Ayuso 
> > Cc: Jozsef Kadlecsik 
> > Cc: Florian Westphal 
> > Suggested-by: Julian Anastasov 
> > Signed-off-by: Alexander Mikhalitsyn 
> > ---
> >  net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 143a341bbc0a..daa62b8b2dd1 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
>
> > @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   /* si_swapinfo(); */
> >   /* availmem = availmem - (i.totalswap - i.freeswap); */
> >
> > - nomem = (availmem < ipvs->sysctl_amemthresh);
> > + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> > + nomem = (availmem < amemthresh);
> >
> >   local_bh_disable();
> >
> > @@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   case 1:
> >   if (nomem) {
> >   ipvs->drop_rate = ipvs->drop_counter
> > - = ipvs->sysctl_amemthresh /
> > - (ipvs->sysctl_amemthresh-availmem);
> > + = amemthresh /
> > + (amemthresh-availmem);
>
> Thanks, both patches look ok except that the old styling
> is showing warnings for this patch:
>
> scripts/checkpatch.pl --strict /tmp/file1.patch
>
> It would be great if you silence them somehow in v3...

Yeah, I have fixed this in v3. Also, I had to split multiple
assignments into different
lines because of:
>CHECK: multiple assignments should be avoided

Now everything looks fine.

>
> BTW, est_cpulist is masked with current->cpus_mask of the
> sysctl writer process, if that is of any help. That is why I skipped
> it but lets keep it read-only for now...

That's a good point! Probably I'm too conservative ;-)

>
> >   ipvs->sysctl_drop_packet = 2;
> >   } else {
> >   ipvs->drop_rate = 0;
> > @@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   case 2:
> >   if (nomem) {
> >   ipvs->drop_rate = ipvs->drop_counter
> > - = ipvs->sysctl_amemthresh /
> > - (ipvs->sysctl_amemthresh-availmem);
> > + = amemthresh /
> > + (amemthresh-availmem);
> >   } else {
> >   ipvs->drop_rate = 0;
> >   ipvs->sysctl_drop_packet = 1;
>
> Regards

Kind regards,
Alex

>
> --
> Julian Anastasov 
>



[PATCH net-next v3 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
 [2]
Link: 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
 [3]

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 32be24f0d4e4..c3ba71aa2654 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4270,6 +4270,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != _user_ns;
 
atomic_set(>dropentry, 0);
spin_lock_init(>dropentry_lock);
@@ -4284,12 +4285,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != _user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4315,10 +4310,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = >sysctl_sync_ports;
tbl[idx++].data = >sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = >sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = >sysctl_sync_sock_size;
+
tbl[idx++].data = >sysctl_cache_bypass;
tbl[idx++].data = >sysctl_expire_nodest_conn;
tbl[idx++].data = >sysctl_sloppy_tcp;
@@ -4341,15 +4343,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = >sysctl_conn_reuse_mode;
tbl[idx++].data = >sysctl_schedule_icmp;
tbl[idx++].data = >sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_est_nice;
 
-- 
2.34.1




[PATCH net-next v3 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..32be24f0d4e4 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 1:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 2:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1




[PATCH] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-18 Thread Ismael Luceno
It was observed in the wild that pairs of consecutive packets would leave
the IPVS with the same wrong checksum, and the issue only went away when
disabling GSO.

IPVS needs to avoid computing the SCTP checksum when using GSO.

Co-developed-by: Firo Yang 
Signed-off-by: Ismael Luceno 
Tested-by: Andreas Taschner 
CC: Michal Kubeček 
CC: Simon Horman 
CC: Julian Anastasov 
CC: lvs-de...@vger.kernel.org
CC: netfilter-de...@vger.kernel.org
CC: net...@vger.kernel.org
CC: coret...@netfilter.org
---
 net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
b/net/netfilter/ipvs/ip_vs_proto_sctp.c
index a0921adc31a9..3205b45ce161 100644
--- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
@@ -126,7 +126,8 @@ sctp_snat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
if (sctph->source != cp->vport || payload_csum ||
skb->ip_summed == CHECKSUM_PARTIAL) {
sctph->source = cp->vport;
-   sctp_nat_csum(skb, sctph, sctphoff);
+   if (!skb_is_gso_sctp(skb))
+   sctp_nat_csum(skb, sctph, sctphoff);
} else {
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
@@ -174,7 +175,8 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
(skb->ip_summed == CHECKSUM_PARTIAL &&
 !(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
sctph->dest = cp->dport;
-   sctp_nat_csum(skb, sctph, sctphoff);
+   if (!skb_is_gso_sctp(skb))
+   sctp_nat_csum(skb, sctph, sctphoff);
} else if (skb->ip_summed != CHECKSUM_PARTIAL) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
-- 
2.43.0




Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Will Deacon
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> > wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > simpler, more
> > > straightforward solution would be to let architectures override 
> > > flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a 
> > > range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  
> Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address 
> or
> a guest virtual address, but it's a moot point because KVM doen't have the 
> guest
> virtual address, and if it's a host virtual address, there would need to be 
> valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will



[PATCH net-next v6 7/7] rstreason: make it work in trace world

2024-04-18 Thread Jason Xing
From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
Reviewed-by: Steven Rostedt (Google) 
---
 include/trace/events/tcp.h | 37 +
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..b1455cbc0634 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN1
+#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+#undef FN2
+#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
+DEFINE_RST_REASON(FN1, FN1)
+
+#undef FN1
+#undef FNe1
+#define FN1(reason){ SK_RST_REASON_##reason, #reason },
+#define FNe1(reason)   { SK_RST_REASON_##reason, #reason }
+
+#undef FN2
+#undef FNe2
+#define FN2(reason){ SKB_DROP_REASON_##reason, #reason },
+#define FNe2(reason)   { SKB_DROP_REASON_##reason, #reason }
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const enum sk_rst_reason reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(enum sk_rst_reason, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->reason < RST_REASON_START ?
+   __print_symbolic(__entry->reason, 
DEFINE_DROP_REASON(FN2, FNe2)) :
+   __print_symbolic(__entry->reason, 
DEFINE_RST_REASON(FN1, FNe1)))
 );
 
+#undef FN1
+#undef FNe1
+
+#undef FN2
+#undef FNe2
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d78412cf8566..461b4d2b7cfe 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 276d9d541b01..b08ffb17d5a0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3612,7 +3612,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority,
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c46095fb596c..6a4736ec3df0 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1133,7 +1133,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3




[PATCH net-next v6 6/7] mptcp: introducing a helper into active reset logic

2024-04-18 Thread Jason Xing
From: Jason Xing 

Since we have mapped every mptcp reset reason definition
in enum sk_rst_reason, introducing a new helper can cover
some missing places where we have already set the
subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown.
So we can convert it directly.

Suggested-by: Paolo Abeni 
Signed-off-by: Jason Xing 
---
Link: 
https://lore.kernel.org/all/2d3ea199eef53cf6a0c48e21abdee0eefbdee927.ca...@redhat.com/
---
 net/mptcp/protocol.c |  4 +---
 net/mptcp/protocol.h | 11 +++
 net/mptcp/subflow.c  |  6 ++
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 065967086492..4b13ca362efa 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -21,7 +21,6 @@
 #endif
 #include 
 #include 
-#include 
 #include 
 #include "protocol.h"
 #include "mib.h"
@@ -2570,8 +2569,7 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 
slow = lock_sock_fast(tcp_sk);
if (tcp_sk->sk_state != TCP_CLOSE) {
-   tcp_send_active_reset(tcp_sk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(tcp_sk);
tcp_set_state(tcp_sk, TCP_CLOSE);
}
unlock_sock_fast(tcp_sk, slow);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index fdfa843e2d88..82ef2f42a1bc 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mptcp_pm_gen.h"
 
@@ -581,6 +582,16 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context 
*subflow)
WRITE_ONCE(subflow->local_id, -1);
 }
 
+static inline void
+mptcp_send_active_reset_reason(struct sock *sk)
+{
+   struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(subflow->reset_reason);
+   tcp_send_active_reset(sk, GFP_ATOMIC, reason);
+}
+
 static inline u64
 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow)
 {
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index bde4a7fdee82..4783d558863c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -20,7 +20,6 @@
 #include 
 #endif
 #include 
-#include 
 
 #include "protocol.h"
 #include "mib.h"
@@ -424,7 +423,7 @@ void mptcp_subflow_reset(struct sock *ssk)
/* must hold: tcp_done() could drop last reference on parent */
sock_hold(sk);
 
-   tcp_send_active_reset(ssk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
tcp_done(ssk);
if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, _sk(sk)->flags))
mptcp_schedule_work(sk);
@@ -1362,8 +1361,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
tcp_set_state(ssk, TCP_CLOSE);
while ((skb = skb_peek(>sk_receive_queue)))
sk_eat_skb(ssk, skb);
-   tcp_send_active_reset(ssk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
WRITE_ONCE(subflow->data_avail, false);
return false;
}
-- 
2.37.3




[PATCH net-next v6 5/7] mptcp: support rstreason for passive reset

2024-04-18 Thread Jason Xing
From: Jason Xing 

It relys on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/subflow.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ac867d277860..bde4a7fdee82 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 
@@ -377,8 +382,13 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp6_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 #endif
@@ -783,6 +793,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
struct mptcp_subflow_request_sock *subflow_req;
struct mptcp_options_received mp_opt;
bool fallback, fallback_is_fatal;
+   enum sk_rst_reason reason;
struct mptcp_sock *owner;
struct sock *child;
 
@@ -913,7 +924,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   reason = convert_mptcp_reason(mptcp_get_ext(skb)->reset_reason);
+   req->rsk_ops->send_reset(sk, skb, reason);
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3




[PATCH net-next v6 4/7] tcp: support rstreason for passive reset

2024-04-18 Thread Jason Xing
From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we don't need to implement those duplicated reset reasons.
This patch replaces all the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/ipv4/tcp_ipv4.c | 8 
 net/ipv6/tcp_ipv6.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 418d11902fa7..d78412cf8566 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, (u32)reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2278,7 +2278,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(nsk, skb, (u32)drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2357,7 +2357,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, (u32)drop_reason);
}
 
 discard_it:
@@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, (u32)drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 017f6293b5f4..c46095fb596c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1680,7 +1680,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, (u32)reason);
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1865,7 +1865,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(nsk, skb, (u32)drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1942,7 +1942,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, (u32)drop_reason);
}
 
 discard_it:
@@ -1998,7 +1998,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, (u32)drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:
-- 
2.37.3




[PATCH net-next v6 3/7] rstreason: prepare for active reset

2024-04-18 Thread Jason Xing
From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  3 ++-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  3 ++-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b935e1ae4caf..adeacc9aa28a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -670,7 +670,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f23b9ea5..4ec0f4feee00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2811,7 +2812,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2885,7 +2887,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2903,7 +2906,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3007,7 +3011,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4564,7 +4568,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 61119d42b0fd..276d9d541b01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3586,7 +3586,8 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 976db57b95d4..83fe7f62f7f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))
  

[PATCH net-next v6 2/7] rstreason: prepare for passive reset

2024-04-18 Thread Jason Xing
From: Jason Xing 

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  4 +++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..bdc737832da6 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 struct request_sock;
 struct sk_buff;
@@ -34,7 +35,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ enum sk_rst_reason reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9fc9cea4c251..ff41bd6f99c3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c8ca703dc331..85f4b8fdbe5e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c 

[PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-18 Thread Jason Xing
From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) reuse MP_TCPRST option for MPTCP
3) our own reasons

I will implement the basic codes of active/passive reset reason in
those three protocols, which is not complete for this moment. But
it provides a new chance to let other people add more reasons into
it:)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 93 +
 1 file changed, 93 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..0c3fa55fa62f
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(NOT_SPECIFIED)   \
+   FNe(MAX)
+
+#define RST_REASON_START (SKB_DROP_REASON_MAX + 1)
+
+/* There are three parts in order:
+ * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP
+ * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use
+ * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason
+ */
+enum sk_rst_reason {
+   /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse
+* of skb drop reason because rst reason relies on what drop reason
+* indicates exactly why it could happen.
+*/
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /* Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START,
+   /* MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /* Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /* Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /* Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /* Unacceptable performance.
+* This code indicates that the performance of this subflow was
+* too low compared to the other subflows of this Multipath TCP
+* connection.
+*/
+   SK_RST_REASON_MPTCP_RST_EBADPERF,
+   /* Middlebox interference.
+* Middlebox interference has been detected over this subflow,
+* making MPTCP signaling invalid. For example, this may be sent
+* if the checksum does not validate.
+*/
+   SK_RST_REASON_MPTCP_RST_EMIDDLEBOX,
+
+   /* For the real standalone socket reset reason, we start from here */
+   SK_RST_REASON_NOT_SPECIFIED,
+
+   /* Maximum of socket reset reasons.
+* It shouldn't be used as a real 'reason'.
+*/
+   SK_RST_REASON_MAX,
+};
+
+static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
+{
+   return reason += RST_REASON_START;
+}
+#endif
-- 
2.37.3




[PATCH net-next RESEND v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Jason Xing
From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

v6
1. add back casts, or else they are treated as error.
2. RESEND because the status of patchwork changed.

v5
Link: 
https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/
1. address format issue (like reverse xmas tree) (Eric, Paolo)
2. remove unnecessary casts. (Eric)
3. introduce a helper used in mptcp active reset. See patch 6. (Paolo)

v4
Link: 
https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/
1. passing 'enum sk_rst_reason' for readability when tracing (Antoine)

v3
Link: 
https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/
1. rebase (mptcp part) and address what Mat suggested.

v2
Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
1. rebase against the latest net-next tree



Jason Xing (7):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  mptcp: introducing a helper into active reset logic
  rstreason: make it work in trace world

 include/net/request_sock.h |  4 +-
 include/net/rstreason.h| 93 ++
 include/net/tcp.h  |  3 +-
 include/trace/events/tcp.h | 37 +--
 net/dccp/ipv4.c| 10 ++--
 net/dccp/ipv6.c| 10 ++--
 net/dccp/minisocks.c   |  3 +-
 net/ipv4/tcp.c | 15 --
 net/ipv4/tcp_ipv4.c| 14 +++---
 net/ipv4/tcp_minisocks.c   |  3 +-
 net/ipv4/tcp_output.c  |  5 +-
 net/ipv4/tcp_timer.c   |  9 ++--
 net/ipv6/tcp_ipv6.c| 17 ---
 net/mptcp/protocol.c   |  2 +-
 net/mptcp/protocol.h   | 11 +
 net/mptcp/subflow.c| 27 ---
 16 files changed, 216 insertions(+), 47 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3




[PATCH] Revert "tracing/trigger: Fix to return error if failed to alloc snapshot"

2024-04-18 Thread Siddh Raman Pant
This reverts commit b5085b5ac1d96ea2a8a6240f869655176ce44197.

The change has an incorrect assumption about the return value because
in the current stable trees for versions 5.15 and before, the following
commit responsible for making 0 a success value is not present:
b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers at 
a time")

The return value should be 0 on failure in the current tree, because in
the functions event_trigger_callback() and event_enable_trigger_func(),
we have:

ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
/*
 * The above returns on success the # of functions enabled,
 * but if it didn't find any functions it returns zero.
 * Consider no functions a failure too.
 */
if (!ret) {
ret = -ENOENT;

Cc: sta...@kernel.org # 5.15, 5.10, 5.4, 4.19
Signed-off-by: Siddh Raman Pant 
---
 kernel/trace/trace_events_trigger.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_events_trigger.c 
b/kernel/trace/trace_events_trigger.c
index dfdbcf1da216..106f9813841a 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1161,10 +1161,8 @@ register_snapshot_trigger(char *glob, struct 
event_trigger_ops *ops,
  struct event_trigger_data *data,
  struct trace_event_file *file)
 {
-   int ret = tracing_alloc_snapshot_instance(file->tr);
-
-   if (ret < 0)
-   return ret;
+   if (tracing_alloc_snapshot_instance(file->tr) != 0)
+   return 0;
 
return register_trigger(glob, ops, data, file);
 }
-- 
2.43.0




Re: [PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Julian Anastasov


Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Suggested-by: Julian Anastasov 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..daa62b8b2dd1 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c

> @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   /* si_swapinfo(); */
>   /* availmem = availmem - (i.totalswap - i.freeswap); */
>  
> - nomem = (availmem < ipvs->sysctl_amemthresh);
> + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> + nomem = (availmem < amemthresh);
>  
>   local_bh_disable();
>  
> @@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 1:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> + (amemthresh-availmem);

Thanks, both patches look ok except that the old styling
is showing warnings for this patch:

scripts/checkpatch.pl --strict /tmp/file1.patch

It would be great if you silence them somehow in v3...

BTW, est_cpulist is masked with current->cpus_mask of the
sysctl writer process, if that is of any help. That is why I skipped
it but lets keep it read-only for now...

>   ipvs->sysctl_drop_packet = 2;
>   } else {
>   ipvs->drop_rate = 0;
> @@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 2:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> + (amemthresh-availmem);
>   } else {
>   ipvs->drop_rate = 0;
>   ipvs->sysctl_drop_packet = 1;

Regards

--
Julian Anastasov 




Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-18 Thread Greg KH
On Thu, Apr 18, 2024 at 09:04:53AM +0200, Thorsten Leemhuis wrote:
> On 17.04.24 15:38, Greg KH wrote:
> > On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
> >> On 17.04.24 14:52, Konstantin Ryabitsev wrote:
> >>> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
>  Could you please create the email alias
>  do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
>  just like sta...@kernel.org does?
> 
>  To quote:
> 
> > How about:
> > cc:  # Reason goes here, and 
> > must be present
> >
> > and we can make that address be routed to /dev/null just like
> >  is?
> >>
> >> FWIW, we could go back to what I initially proposed: use the existing
> >> stable tag with a pre-defined comment to mark patches that AUTOSEL et.
> >> al. should not pick up:
> >> https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/
> > 
> > If you can pick a better string, possibly, yes.
> 
> What did you think of Konstantin's
> 
> Cc: stable+noauto...@kernel.org # Reason
> 
> That looked like a good solution -- and I wondered why I did not come up
> with that idea myself. Sure, "autosel" would also imply/mean "the
> scripts/tools that look out for Fixes: tags", but does that matter?

We can live with this, sure.  That way no need to change anything on any
kernel.org backend.

thanks,

greg k-h



Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-18 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni :

On Wed, 17 Apr 2024 15:18:22 +0800 you wrote:
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> 
> [...]

Here is the summary with links:
  - [net-next,v9] virtio_net: Support RX hash XDP hint
https://git.kernel.org/netdev/net-next/c/aa37f8916d20

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread David Hildenbrand

On 18.04.24 08:26, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Not an expert on these counters/events, but LGTM

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread David Hildenbrand

On 18.04.24 08:26, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 20 +++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif


Naive me would think that ALLOCSTALL_DEVICE is always 0. :)

Likely we should just do:

for (zid = 0; zid < MAX_NR_ZONES; zid++)
stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];

(see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on 
the same ordering)


Apart form that, LGTM.

--
Cheers,

David / dhildenb




Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-18 Thread Jonthan Haslam
Hi Masami,

> > > OK, then I'll push this to for-next at this moment.
> > > Please share if you have a good idea for the batch interface which can be
> > > backported. I guess it should involve updating userspace changes too.
> > 
> > Did you (or anyone else) need anything more from me on this one so that it
> > can be pushed? I provided some benchmark numbers but happy to provide
> > anything else that may be required.
> 
> Yeah, if you can update with the result, it looks better to me.
> Or, can I update the description?

Just checking if you need me to do anything on this so that it can be
pushed to for-next? Would be really great if we can get this in. Thanks
for all your help.

Jon.

> 
> Thank you,
> 
> > 
> > Thanks!
> > 
> > Jon.
> > 
> > > 
> > > Thank you!
> > > 
> > > > >
> > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > while Jonathan will keep working on even better final solution.
> > > > > Thanks!
> > > > >
> > > > > > I look forward to your formalized results :)
> > > > > >
> > > > 
> > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > and USDTs, reporting attach/detach timings:
> > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > bpf_testmod.ko is already unloaded.
> > > > Loading bpf_testmod.ko...
> > > > Successfully loaded bpf_testmod.ko.
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > test_bench_attach_uprobe: attached in   0.120s
> > > > test_bench_attach_uprobe: detached in   0.092s
> > > > #400/5   uprobe_multi_test/bench_uprobe:OK
> > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > test_bench_attach_usdt: attached in   0.124s
> > > > test_bench_attach_usdt: detached in   0.064s
> > > > #400/6   uprobe_multi_test/bench_usdt:OK
> > > > #400 uprobe_multi_test:OK
> > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > Successfully unloaded bpf_testmod.ko.
> > > > 
> > > > So it should be easy for Jonathan to validate his changes with this.
> > > > 
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > BTW, how did you measure the overhead? I think spinlock 
> > > > > > > > > > overhead
> > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > > > > > ---
> > > > > > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > RB_ROOT;
> > > > > > > > > > >   */
> > > > > > > > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > > > > > > > > >
> > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > >
> > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > + read_lock(_treelock);
> > > > > > > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > - spin_unlock(_treelock);
> > > > > > > > > > > + read_unlock(_treelock);
> > > > > > > > > > >
> > > > > > > > > > >   return uprobe;
> > > > > > > > > > >  }
> > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe 
> > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *u;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(_treelock);
> > > > > > > > > > > + write_lock(_treelock);
> > > > > > > > > > >   u = __insert_uprobe(uprobe);
> > > > > > > > > > > - spin_unlock(_treelock);

[PATCH net-next v2 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

[1] https://github.com/lxc/lxc/issues/4278
[2] 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
[3] 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index daa62b8b2dd1..f84f091626ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4272,6 +4272,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != _user_ns;
 
atomic_set(>dropentry, 0);
spin_lock_init(>dropentry_lock);
@@ -4286,12 +4287,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != _user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4317,10 +4312,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = >sysctl_sync_ports;
tbl[idx++].data = >sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = >sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = >sysctl_sync_sock_size;
+
tbl[idx++].data = >sysctl_cache_bypass;
tbl[idx++].data = >sysctl_expire_nodest_conn;
tbl[idx++].data = >sysctl_sloppy_tcp;
@@ -4343,15 +4345,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = >sysctl_conn_reuse_mode;
tbl[idx++].data = >sysctl_schedule_icmp;
tbl[idx++].data = >sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = >sysctl_est_nice;
 
-- 
2.34.1




Re: [PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Aleksandr Mikhalitsyn
On Wed, Apr 17, 2024 at 3:02 PM Julian Anastasov  wrote:
>
>
> Hello,

Dear Julian,

>
> On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:
>
> > Let's make all IPVS sysctls visible and RO even when
> > network namespace is owned by non-initial user namespace.
> >
> > Let's make a few sysctls to be writable:
> > - conntrack
> > - conn_reuse_mode
> > - expire_nodest_conn
> > - expire_quiescent_template
> >
> > I'm trying to be conservative with this to prevent
> > introducing any security issues in there. Maybe,
> > we can allow more sysctls to be writable, but let's
> > do this on-demand and when we see real use-case.
> >
> > This list of sysctls was chosen because I can't
> > see any security risks allowing them and also
> > Kubernetes uses [2] these specific sysctls.
> >
> > This patch is motivated by user request in the LXC
> > project [1].
> >
> > [1] https://github.com/lxc/lxc/issues/4278
> > [2] 
> > https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> >
> > Cc: Stéphane Graber 
> > Cc: Christian Brauner 
> > Cc: Julian Anastasov 
> > Cc: Simon Horman 
> > Cc: Pablo Neira Ayuso 
> > Cc: Jozsef Kadlecsik 
> > Cc: Florian Westphal 
> > Signed-off-by: Alexander Mikhalitsyn 
> > ---
> >  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 143a341bbc0a..92a818c2f783 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -4285,10 +4285,22 @@ static int __net_init 
> > ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>
> As the list of privileged vars is short I prefer
> to use a bool and to make only some vars read-only:
>
> bool unpriv = false;
>
> >   if (tbl == NULL)
> >   return -ENOMEM;
> >
> > - /* Don't export sysctls to unprivileged users */
> > + /* Let's show all sysctls in non-init user namespace-owned
> > +  * net namespaces, but make them read-only.
> > +  *
> > +  * Allow only a few specific sysctls to be writable.
> > +  */
> >   if (net->user_ns != _user_ns) {
>
> Here we should just set: unpriv = true;
>
> > - tbl[0].procname = NULL;
> > - ctl_table_size = 0;
> > + for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> > + if (!tbl[idx].procname)
> > + continue;
> > +
> > + if (!((strcmp(tbl[idx].procname, "conntrack") 
> > == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "conn_reuse_mode") == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "expire_nodest_conn") == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "expire_quiescent_template") == 0)))
> > + tbl[idx].mode = 0444;
> > + }
> >   }
> >   } else
> >   tbl = vs_vars;
>
> And below at every place to use:
>
> if (unpriv)
> tbl[idx].mode = 0444;
>
> for the following 4 privileged sysctl vars:
>
> - sync_qlen_max:
> - allocates messages in kernel context
> - this needs better tunning in another patch
>
> - sync_sock_size:
> - allocates messages in kernel context
>
> - run_estimation:
> - for now, better init ns to decide if to use est stats
>
> - est_nice:
> - for now, better init ns to decide the value
>
> - debug_level:
> - already set to 0444
>
> I.e. these vars allocate resources (mem, CPU) without
> proper control, so for now we will just copy them from init ns
> without allowing writing. And they are vars that are not tuned
> often. Also we do not know which netns is supposed to be the
> privileged one, some solutions move all devices out of init_net,
> so we can not decide where to use lower limits.

I agree. I have also decided to forbid "est_cpulist" for unprivileged users.

>
> OTOH, "amemthresh" is not privileged but needs single READ_ONCE
> for sysctl_amemthresh in update_defense_level() due to the possible
> div by zero if we allow writing to anyone, eg.:
>
> int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> ...
> nomem = availmem < amemthresh;
> ... use only amemthresh
>
> All other vars can be writable.

Have fixed this and sent it as a separate patch! ;-)

Thanks a lot for such a quick review!

Kind regards,
Alex

>
> Regards
>
> --
> Julian Anastasov 



[PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..daa62b8b2dd1 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 1:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 2:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1




Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread David Hildenbrand

On 18.04.24 08:26, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-18 Thread Mike Rapoport
On Tue, Apr 16, 2024 at 09:52:34AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote:
> > On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote:
> > > 
> > > > To populate the cache, a writable large page is allocated from vmalloc 
> > > > with
> > > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped 
> > > > as
> > > > ROX.
> > > 
> > > > +static void execmem_invalidate(void *ptr, size_t size, bool writable)
> > > > +{
> > > > +   if (execmem_info->invalidate)
> > > > +   execmem_info->invalidate(ptr, size, writable);
> > > > +   else
> > > > +   memset(ptr, 0, size);
> > > > +}
> > > 
> > > +static void execmem_invalidate(void *ptr, size_t size, bool writeable)
> > > +{
> > > +   /* fill memory with INT3 instructions */
> > > +   if (writeable)
> > > +   memset(ptr, 0xcc, size);
> > > +   else
> > > +   text_poke_set(ptr, 0xcc, size);
> > > +}
> > > 
> > > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction.
> > > It raises #BP not #UD.
> > 
> > Do you mean that _invalidate is a poor name choice or that it's necessary
> > to use an instruction that raises #UD?
> 
> Poor naming, mostly. #BP handler will still scream bloody murder if the
> site is otherwise unclaimed.
> 
> It just isn't an invalid instruction.

Well, execmem_fill_with_insns_screaming_bloody_murder seems too long, how
about execmem_fill_trapping_insns?

-- 
Sincerely yours,
Mike.



Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-18 Thread Mike Rapoport
On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> 
> 
> > On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> > 
> > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> > struct load_info *info)
> > add_kallsyms(mod, info);
> > 
> > /* Arch-specific module finalizing. */
> > -   return module_finalize(info->hdr, info->sechdrs, mod);
> > +   ret = module_finalize(info->hdr, info->sechdrs, mod);
> > +   if (ret)
> > +   return ret;
> > +
> > +   for_each_mod_mem_type(type) {
> > +   struct module_memory *mem = >mem[type];
> > +
> > +   if (mem->is_rox) {
> > +   if (!execmem_update_copy(mem->base, mem->rw_copy,
> > +mem->size))
> > +   return -ENOMEM;
> > +
> > +   vfree(mem->rw_copy);
> > +   mem->rw_copy = NULL;
> > +   }
> > +   }
> > +
> > +   return 0;
> > }
> 
> I might be missing something, but it seems a bit racy.
> 
> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> point, since you don’t hold the text_mutex, some might do text_poke(),
> e.g., by enabling/disabling static-key, and the update would be
> overwritten. No?

Right :(
Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
area.

So I'm thinking about calling alternatives_smp_module_add() from an
additional callback after the execmem_update_copy().

Does it make sense to you?

-- 
Sincerely yours,
Mike.



Re: [PATCH 1/2] arm64: dts: qcom: pmi632: Add vibrator

2024-04-18 Thread Luca Weiss
On Thu Apr 18, 2024 at 12:01 PM CEST, Konrad Dybcio wrote:
> On 18.04.2024 8:36 AM, Luca Weiss wrote:
> > Add a node for the vibrator module found inside the PMI632.
> > 
> > Signed-off-by: Luca Weiss 
> > ---
>
> Reviewed-by: Konrad Dybcio 
>
> On a side note, this is a totally configuration-free peripheral that doesn't 
> do
> anything crazy until manually configured.
>
> In the slow quest to be (hopefully) more sane about the defaults, should we 
> keep
> them enabled by default? Bjorn?

But many (most?) devices don't have a vibration motor connected to
PMI632, some (like devboards) don't have anything, and other phones have
a separate chip that controls the vibration motor.

Enabling this by default would mean all devices with PMI632 would get an
input device for the vibrator that probably doesn't work?

Regards
Luca

>
> Konrad




Re: [PATCH v2 3/3] arm64: dts: msm8996: add fastrpc nodes

2024-04-18 Thread Konrad Dybcio
On 18.04.2024 8:44 AM, Dmitry Baryshkov wrote:
> From: Srinivas Kandagatla 
> 
> The ADSP provides fastrpc/compute capabilities. Enable support for the
> fastrpc on this DSP.
> 
> Signed-off-by: Srinivas Kandagatla 
> Signed-off-by: Dmitry Baryshkov 
> ---

Reviewed-by: Konrad Dybcio 

Konrad



Re: [PATCH 1/2] arm64: dts: qcom: pmi632: Add vibrator

2024-04-18 Thread Konrad Dybcio
On 18.04.2024 8:36 AM, Luca Weiss wrote:
> Add a node for the vibrator module found inside the PMI632.
> 
> Signed-off-by: Luca Weiss 
> ---

Reviewed-by: Konrad Dybcio 

On a side note, this is a totally configuration-free peripheral that doesn't do
anything crazy until manually configured.

In the slow quest to be (hopefully) more sane about the defaults, should we keep
them enabled by default? Bjorn?

Konrad



Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Michael S. Tsirkin
On Thu, Apr 18, 2024 at 03:14:37PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 4/17/2024 4:54 PM, David Stevens wrote:
> > Add support for the VIRTIO_F_SUSPEND feature. When this feature is
> > negotiated, power management can use it to suspend virtio devices
> > instead of resorting to resetting the devices entirely.
> > 
> > Signed-off-by: David Stevens 
> > ---
> >   drivers/virtio/virtio.c| 32 ++
> >   drivers/virtio/virtio_pci_common.c | 29 +++
> >   drivers/virtio/virtio_pci_modern.c | 19 ++
> >   include/linux/virtio.h |  2 ++
> >   include/uapi/linux/virtio_config.h | 10 +-
> >   5 files changed, 74 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > index f4080692b351..cd11495a5098 100644
> > --- a/drivers/virtio/virtio.c
> > +++ b/drivers/virtio/virtio.c
> > @@ -1,5 +1,6 @@
> >   // SPDX-License-Identifier: GPL-2.0-only
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
> > return ret;
> >   }
> >   EXPORT_SYMBOL_GPL(virtio_device_restore);
> > +
> > +static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
> > enabled)
> > +{
> > +   u8 status, target;
> > +
> > +   status = dev->config->get_status(dev);
> > +   if (enabled)
> > +   target = status | VIRTIO_CONFIG_S_SUSPEND;
> > +   else
> > +   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
> > +   dev->config->set_status(dev, target);
> I think it is better to verify whether the device SUSPEND bit is
> already set or clear, we can just return if status == target.
> 
> Thanks
> Zhu Lingshan
> > +
> > +   while ((status = dev->config->get_status(dev)) != target) {
> > +   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
> > +   return -EIO;
> > +   mdelay(10);

Bad device state (set by surprise removal) should also
be handled here I think.


> > +   }
> > +   return 0;
> > +}
> > +
> > +int virtio_device_suspend(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, true);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_suspend);
> > +
> > +int virtio_device_resume(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, false);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_resume);
> >   #endif
> >   static int virtio_init(void)
> > diff --git a/drivers/virtio/virtio_pci_common.c 
> > b/drivers/virtio/virtio_pci_common.c
> > index b655fccaf773..4d542de05970 100644
> > --- a/drivers/virtio/virtio_pci_common.c
> > +++ b/drivers/virtio/virtio_pci_common.c
> > @@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
> > return virtio_device_restore(_dev->vdev);
> >   }
> > -static bool vp_supports_pm_no_reset(struct device *dev)
> > +static int virtio_pci_suspend(struct device *dev)
> >   {
> > struct pci_dev *pci_dev = to_pci_dev(dev);
> > -   u16 pmcsr;
> > -
> > -   if (!pci_dev->pm_cap)
> > -   return false;
> > -
> > -   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > -   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > -   dev_err(dev, "Unable to query pmcsr");
> > -   return false;
> > -   }
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > -   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
> > -}
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_suspend(_dev->vdev);
> > -static int virtio_pci_suspend(struct device *dev)
> > -{
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
> > +   return virtio_pci_freeze(dev);
> >   }
> >   static int virtio_pci_resume(struct device *dev)
> >   {
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
> > +   struct pci_dev *pci_dev = to_pci_dev(dev);
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > +
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_resume(_dev->vdev);
> > +
> > +   return virtio_pci_restore(dev);
> >   }
> >   static const struct dev_pm_ops virtio_pci_pm_ops = {
> > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > b/drivers/virtio/virtio_pci_modern.c
> > index f62b530aa3b5..ac8734526b8d 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct 
> > virtio_device *vdev)
> > __virtqueue_break(admin_vq->info.vq);
> >   }
> > +static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
> > +{
> > +   u16 pmcsr;
> > +
> > +   if (!pci_dev->pm_cap)
> > +   return false;
> > +
> > +   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > +   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > +   dev_err(_dev->dev, "Unable to query pmcsr");
> > +   return false;

Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Zhu, Lingshan




On 4/17/2024 4:54 PM, David Stevens wrote:

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
  drivers/virtio/virtio.c| 32 ++
  drivers/virtio/virtio_pci_common.c | 29 +++
  drivers/virtio/virtio_pci_modern.c | 19 ++
  include/linux/virtio.h |  2 ++
  include/uapi/linux/virtio_config.h | 10 +-
  5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..cd11495a5098 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
  // SPDX-License-Identifier: GPL-2.0-only
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
  }
  EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+   dev->config->set_status(dev, target);

I think it is better to verify whether the device SUSPEND bit is
already set or clear, we can just return if status == target.

Thanks
Zhu Lingshan

+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   mdelay(10);
+   }
+   return 0;
+}
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
  #endif
  
  static int virtio_init(void)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..4d542de05970 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
  }
  
-static bool vp_supports_pm_no_reset(struct device *dev)

+static int virtio_pci_suspend(struct device *dev)
  {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
  
-	return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;

-}
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_suspend(_dev->vdev);
  
-static int virtio_pci_suspend(struct device *dev)

-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
  }
  
  static int virtio_pci_resume(struct device *dev)

  {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
  }
  
  static const struct dev_pm_ops virtio_pci_pm_ops = {

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index f62b530aa3b5..ac8734526b8d 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device 
*vdev)
__virtqueue_break(admin_vq->info.vq);
  }
  
+static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)

+{
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(_dev->dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
  static void vp_transport_features(struct virtio_device *vdev, u64 features)
  {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device 
*vdev, u64 features)
  
  	if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))

__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
+
+   if (features & 

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-18 Thread Thorsten Leemhuis
On 17.04.24 15:38, Greg KH wrote:
> On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
>> On 17.04.24 14:52, Konstantin Ryabitsev wrote:
>>> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
 Could you please create the email alias
 do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
 just like sta...@kernel.org does?

 To quote:

> How about:
>   cc:  # Reason goes here, and must be 
> present
>
> and we can make that address be routed to /dev/null just like
>  is?
>>
>> FWIW, we could go back to what I initially proposed: use the existing
>> stable tag with a pre-defined comment to mark patches that AUTOSEL et.
>> al. should not pick up:
>> https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/
> 
> If you can pick a better string, possibly, yes.

What did you think of Konstantin's

Cc: stable+noauto...@kernel.org # Reason

That looked like a good solution -- and I wondered why I did not come up
with that idea myself. Sure, "autosel" would also imply/mean "the
scripts/tools that look out for Fixes: tags", but does that matter?

> But in the end, your proposal seems to imply:
> 
>   cc: sta...@kernel.org   # Psych!  Just kidding, never backport this!
> 
> but really, that's just mean, and again, this is a VERY rare case you
> are trying to automate here. We have MUCH better and simpler ways for> 
> maintainers to not have their subsystems scanned for stuff like this,
> why are we spending all of our time on this topic?

It started with various minor reasons -- and after some "this would be
nice to have" feedback it felt wrong to give up. It also looked like we
had some agreement already before a new discussion began.

Ciao, Thorsten



Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-18 Thread Mike Rapoport
On Sat, Apr 06, 2024 at 06:36:46PM +0100, Vincent Donnefort wrote:
> In preparation for allowing the user-space to map a ring-buffer, add
> a set of mapping functions:
> 
>   ring_buffer_{map,unmap}()
> 
> And controls on the ring-buffer:
> 
>   ring_buffer_map_get_reader()  /* swap reader and head */
> 
> Mapping the ring-buffer also involves:
> 
>   A unique ID for each subbuf of the ring-buffer, currently they are
>   only identified through their in-kernel VA.
> 
>   A meta-page, where are stored ring-buffer statistics and a
>   description for the current reader
> 
> The linear mapping exposes the meta-page, and each subbuf of the
> ring-buffer, ordered following their unique ID, assigned during the
> first mapping.
> 
> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> size will remain unmodified and the splice enabling functions will in
> reality simply memcpy the data instead of swapping subbufs.
> 
> CC: 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index dc5ae4e96aee..96d2140b471e 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -6,6 +6,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  struct trace_buffer;
>  struct ring_buffer_iter;
>  
> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
> hlist_node *node);
>  #define trace_rb_cpu_prepare NULL
>  #endif
>  
> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> + struct vm_area_struct *vma);
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>  #endif /* _LINUX_RING_BUFFER_H */
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> new file mode 100644
> index ..ffcd8dfcaa4f
> --- /dev/null
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _TRACE_MMAP_H_
> +#define _TRACE_MMAP_H_
> +
> +#include 
> +
> +/**
> + * struct trace_buffer_meta - Ring-buffer Meta-page description
> + * @meta_page_size:  Size of this meta-page.
> + * @meta_struct_len: Size of this structure.
> + * @subbuf_size: Size of each sub-buffer.
> + * @nr_subbufs:  Number of subbfs in the ring-buffer, including 
> the reader.
> + * @reader.lost_events:  Number of events lost at the time of the reader 
> swap.
> + * @reader.id:   subbuf ID of the current reader. ID range [0 : 
> @nr_subbufs - 1]
> + * @reader.read: Number of bytes read on the reader subbuf.
> + * @flags:   Placeholder for now, 0 until new features are supported.
> + * @entries: Number of entries in the ring-buffer.
> + * @overrun: Number of entries lost in the ring-buffer.
> + * @read:Number of entries that have been read.
> + * @Reserved1:   Reserved for future use.
> + * @Reserved2:   Reserved for future use.
> + */
> +struct trace_buffer_meta {
> + __u32   meta_page_size;
> + __u32   meta_struct_len;
> +
> + __u32   subbuf_size;
> + __u32   nr_subbufs;
> +
> + struct {
> + __u64   lost_events;
> + __u32   id;
> + __u32   read;
> + } reader;
> +
> + __u64   flags;
> +
> + __u64   entries;
> + __u64   overrun;
> + __u64   read;
> +
> + __u64   Reserved1;
> + __u64   Reserved2;

Why do you need reserved fields? This structure always resides in the
beginning of a page and the rest of the page is essentially "reserved".

> +};
> +
> +#endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index cc9ebe593571..793ecc454039 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c

... 

> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> +unsigned long *subbuf_ids)
> +{
> + struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> + unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> + struct buffer_page *first_subbuf, *subbuf;
> + int id = 0;
> +
> + subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> + cpu_buffer->reader_page->id = id++;
> +
> + first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> + do {
> + if (WARN_ON(id >= nr_subbufs))
> + break;
> +
> + subbuf_ids[id] = (unsigned long)subbuf->page;
> + subbuf->id = id;
> +
> + rb_inc_page();
> + id++;
> + } while (subbuf != first_subbuf);
> +
> + /* install subbuf ID to kern VA translation */
> + cpu_buffer->subbuf_ids = subbuf_ids;
> +
> + /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> + meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;

Isn't this a 

[PATCH v2 2/3] arm64: dts: qcom: msm8996: add glink-edge nodes

2024-04-18 Thread Dmitry Baryshkov
MSM8996 provides limited glink support, so add corresponding device tree
nodes. For example the following interfaces are provided on db820c:

modem:
208.remoteproc:glink-edge.LOOPBACK_CTL_MPSS.-1.-1
208.remoteproc:glink-edge.glink_ssr.-1.-1
208.remoteproc:glink-edge.rpmsg_chrdev.0.0

adsp:
930.remoteproc:glink-edge.LOOPBACK_CTL_LPASS.-1.-1
930.remoteproc:glink-edge.glink_ssr.-1.-1
930.remoteproc:glink-edge.rpmsg_chrdev.0.0

Reviewed-by: Konrad Dybcio 
Signed-off-by: Dmitry Baryshkov 
---
 arch/arm64/boot/dts/qcom/msm8996.dtsi | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8996.dtsi 
b/arch/arm64/boot/dts/qcom/msm8996.dtsi
index 1601e46549e7..7ae499fa7d91 100644
--- a/arch/arm64/boot/dts/qcom/msm8996.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8996.dtsi
@@ -2453,6 +2453,13 @@ slpi_pil: remoteproc@1c0 {
 
status = "disabled";
 
+   glink-edge {
+   interrupts = ;
+   label = "dsps";
+   qcom,remote-pid = <3>;
+   mboxes = <_glb 27>;
+   };
+
smd-edge {
interrupts = ;
 
@@ -2522,6 +2529,13 @@ metadata {
memory-region = <_mem>;
};
 
+   glink-edge {
+   interrupts = ;
+   label = "modem";
+   qcom,remote-pid = <1>;
+   mboxes = <_glb 15>;
+   };
+
smd-edge {
interrupts = ;
 
@@ -3467,6 +3481,14 @@ adsp_pil: remoteproc@930 {
 
status = "disabled";
 
+   glink-edge {
+   interrupts = ;
+   label = "lpass";
+   qcom,remote-pid = <2>;
+   mboxes = <_glb 9>;
+   };
+
+
smd-edge {
interrupts = ;
 

-- 
2.39.2




[PATCH v2 3/3] arm64: dts: msm8996: add fastrpc nodes

2024-04-18 Thread Dmitry Baryshkov
From: Srinivas Kandagatla 

The ADSP provides fastrpc/compute capabilities. Enable support for the
fastrpc on this DSP.

Signed-off-by: Srinivas Kandagatla 
Signed-off-by: Dmitry Baryshkov 
---
 arch/arm64/boot/dts/qcom/msm8996.dtsi | 57 +++
 1 file changed, 57 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8996.dtsi 
b/arch/arm64/boot/dts/qcom/msm8996.dtsi
index 7ae499fa7d91..f9bbb191661b 100644
--- a/arch/arm64/boot/dts/qcom/msm8996.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8996.dtsi
@@ -3545,6 +3545,63 @@ q6routing: routing {
};
};
};
+
+   fastrpc {
+   compatible = "qcom,fastrpc";
+   qcom,smd-channels = 
"fastrpcsmd-apps-dsp";
+   label = "adsp";
+   qcom,non-secure-domain;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   cb@5 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <5>;
+   iommus = <_q6_smmu 5>;
+   };
+
+   cb@6 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <6>;
+   iommus = <_q6_smmu 6>;
+   };
+
+   cb@7 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <7>;
+   iommus = <_q6_smmu 7>;
+   };
+
+   cb@8 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <8>;
+   iommus = <_q6_smmu 8>;
+   };
+
+   cb@9 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <9>;
+   iommus = <_q6_smmu 9>;
+   };
+
+   cb@10 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <10>;
+   iommus = <_q6_smmu 10>;
+   };
+
+   cb@11 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <11>;
+   iommus = <_q6_smmu 11>;
+   };
+
+   cb@12 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <12>;
+   iommus = <_q6_smmu 12>;
+   };
+   };
};
};
 

-- 
2.39.2




[PATCH v2 0/3] arm64: dts: qcom: msm8996: enable fastrpc and glink-edge

2024-04-18 Thread Dmitry Baryshkov
Enable the FastRPC and glink-edge nodes on MSM8996 platform. Tested on
APQ8096 Dragonboard820c.

Signed-off-by: Dmitry Baryshkov 
---
Changes in v2:
- Fixed order of compute nodes (Konrad)
- Link to v1: 
https://lore.kernel.org/r/20240401-msm8996-remoteproc-v1-0-f02ab47fc...@linaro.org

---
Dmitry Baryshkov (2):
  dt-bindings: remoteproc: qcom,msm8996-mss-pil: allow glink-edge on msm8996
  arm64: dts: qcom: msm8996: add glink-edge nodes

Srinivas Kandagatla (1):
  arm64: dts: msm8996: add fastrpc nodes

 .../bindings/remoteproc/qcom,msm8996-mss-pil.yaml  |  1 -
 arch/arm64/boot/dts/qcom/msm8996.dtsi  | 79 ++
 2 files changed, 79 insertions(+), 1 deletion(-)
---
base-commit: 4eab358930711bbeb85bf5ee267d0d42d3394c2c
change-id: 20240320-msm8996-remoteproc-fccbc2b54ea1

Best regards,
-- 
Dmitry Baryshkov 




[PATCH v2 1/3] dt-bindings: remoteproc: qcom,msm8996-mss-pil: allow glink-edge on msm8996

2024-04-18 Thread Dmitry Baryshkov
MSM8996 has limited glink support, allow glink-edge node on MSM8996
platform.

Acked-by: Rob Herring 
Signed-off-by: Dmitry Baryshkov 
---
 Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml
index 971734085d51..4d2055f283ac 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml
@@ -231,7 +231,6 @@ allOf:
 - const: snoc_axi
 - const: mnoc_axi
 - const: qdss
-glink-edge: false
   required:
 - pll-supply
 - smd-edge

-- 
2.39.2




[PATCH 2/2] arm64: dts: qcom: sdm632-fairphone-fp3: Enable vibrator

2024-04-18 Thread Luca Weiss
Enable the vibrator on the PMI632 which is used on this phone.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts 
b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
index e2708c74e95a..2c1172aa97e4 100644
--- a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
+++ b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
@@ -143,6 +143,10 @@ _vbus {
status = "okay";
 };
 
+_vib {
+   status = "okay";
+};
+
 _1 {
status = "okay";
vmmc-supply = <_l8>;

-- 
2.44.0




[PATCH 1/2] arm64: dts: qcom: pmi632: Add vibrator

2024-04-18 Thread Luca Weiss
Add a node for the vibrator module found inside the PMI632.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/pmi632.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/pmi632.dtsi 
b/arch/arm64/boot/dts/qcom/pmi632.dtsi
index 94d53b1cf6c8..b4313728f3e7 100644
--- a/arch/arm64/boot/dts/qcom/pmi632.dtsi
+++ b/arch/arm64/boot/dts/qcom/pmi632.dtsi
@@ -200,5 +200,11 @@ pmi632_lpg: pwm {
 
status = "disabled";
};
+
+   pmi632_vib: vibrator@5700 {
+   compatible = "qcom,pmi632-vib";
+   reg = <0x5700>;
+   status = "disabled";
+   };
};
 };

-- 
2.44.0




[PATCH 0/2] Enable vibrator on PMI632 + Fairphone 3

2024-04-18 Thread Luca Weiss
With the patches to add vibration support for PMI632 finally applied,
let's enable this for the PMI632 PMIC and Fairphone 3 smartphone.

https://lore.kernel.org/linux-arm-msm/20240416-pm8xxx-vibrator-new-design-v11-0-7b1c951e1...@quicinc.com/

Patches have landed in the input tree:
https://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git/

Signed-off-by: Luca Weiss 
---
Luca Weiss (2):
  arm64: dts: qcom: pmi632: Add vibrator
  arm64: dts: qcom: sdm632-fairphone-fp3: Enable vibrator

 arch/arm64/boot/dts/qcom/pmi632.dtsi  | 6 ++
 arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts | 4 
 2 files changed, 10 insertions(+)
---
base-commit: eecc5d90861b551d3b8fbd0d0e6f25c40496f3c0
change-id: 20240418-fp3-vibra-18c400889853

Best regards,
-- 
Luca Weiss 




[PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread zhenwei pi
Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
  the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
  the memory allocation stalls due to cgroup, not the global memory
  pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
  the memory allocation stalls due to global memory pressure. The
  performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
  quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
  the memory allocation stalls due to global memory pressure.
  the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
  heavily, the serious case leads poor performance and difficult
  trouble shooting. Ex, sshd may block on memory allocation when
  accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
  the low ratio between ARCLM/ASCAN shows that the guest tries to
  reclaim more memory, but it can't. Once more memory is required in
  future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
 drivers/virtio/virtio_balloon.c |  9 +
 include/uapi/linux/virtio_balloon.h | 12 ++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
 
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,
+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
 #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
 #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
 #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
 #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
 
 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
 }
 
 #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
-- 
2.34.1




[PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread zhenwei pi
Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
 drivers/virtio/virtio_balloon.c | 20 +++-
 include/uapi/linux/virtio_balloon.h |  6 --
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
 
all_vm_events(events);
@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif
+   stall += events[ALLOCSTALL_NORMAL];
+   stall += events[ALLOCSTALL_MOVABLE];
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
 #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b17bbe033697..487b893a160e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
 #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
 #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
 #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
 
 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
 }
 
 #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
-- 
2.34.1




[PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread zhenwei pi
When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 
---
 drivers/virtio/virtio_balloon.c | 1 +
 include/uapi/linux/virtio_balloon.h | 6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..fd19934a847f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -337,6 +337,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
 #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..b17bbe033697 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -71,7 +71,8 @@ struct virtio_balloon_config {
 #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
 #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
 #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
-#define VIRTIO_BALLOON_S_NR   10
+#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
+#define VIRTIO_BALLOON_S_NR   11
 
 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
 }
 
 #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
-- 
2.34.1




[PATCH 0/3] Improve memory statistics for virtio balloon

2024-04-18 Thread zhenwei pi
RFC -> v1:
- several text changes: oom-kill -> oom-kills, SCAN_ASYNC -> ASYN_SCAN.
- move vm events codes into '#ifdef CONFIG_VM_EVENT_COUNTERS'

RFC version:
Link: 
https://lore.kernel.org/lkml/20240415084113.1203428-1-pizhen...@bytedance.com/T/#m1898963b3c27a989b1123db475135c3ca687ca84

zhenwei pi (3):
  virtio_balloon: introduce oom-kill invocations
  virtio_balloon: introduce memory allocation stall counter
  virtio_balloon: introduce memory scan/reclaim info

 drivers/virtio/virtio_balloon.c | 30 -
 include/uapi/linux/virtio_balloon.h | 16 +--
 2 files changed, 43 insertions(+), 3 deletions(-)

-- 
2.34.1