date:20240418

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport

On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
> >
> > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > > >
> > > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > > maybe I
> > > > > > should have named the enum execmem_consumer at the first place.
> > > > >
> > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > > kprobe,
> > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > Did I miss something?
> > > >
> > > > It's enough to have one architecture with different constrains for 
> > > > kprobes
> > > > and bpf to warrant a type for each.
> > >
> > > AFAICT, some of these constraints can be changed without too much work.
> >
> > But why?
> > I honestly don't understand what are you trying to optimize here. A few
> > lines of initialization in execmem_info?
> 
> IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> harder for bpf and kprobe to share the same ROX page. In many use cases,
> a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> module text. It is not efficient if we have to allocate separate pages for 
> each
> of these use cases. If this is not a problem, the current approach works.

The caching of large ROX pages does not need to be per type. 

In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
global and to make kprobes and bpf use it it's enough to set a flag in
execmem_info.

[1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-18 Thread Google

Hi Steve,

Can you review this series? Especially, [07/36] and [12/36] has been changed
a lot from your original patch.

Thank you,

On Mon, 15 Apr 2024 21:48:59 +0900
"Masami Hiramatsu (Google)"  wrote:

> Hi,
> 
> Here is the 9th version of the series to re-implement the fprobe on
> function-graph tracer. The previous version is;
> 
> https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> 
> This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> and fixed some bugs + performance optimization patch[36/36].
>  - [12/36] Fix to clear fgraph_array entry in registration failure, also
>return -ENOSPC when fgraph_array is full.
>  - [28/36] Add new store_fprobe_entry_data() for fprobe.
>  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
>  - [36/36] Add new flag to skip timestamp recording.
> 
> Overview
> 
> This series does major 2 changes, enable multiple function-graphs on
> the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> fprobe on this function-graph.
> 
> The former changes had been sent from Steven Rostedt 4 years ago (*),
> which allows users to set different setting function-graph tracer (and
> other tracers based on function-graph) in each trace-instances at the
> same time.
> 
> (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> 
> The purpose of latter change are;
> 
>  1) Remove dependency of the rethook from fprobe so that we can reduce
>the return hook code and shadow stack.
> 
>  2) Make 'ftrace_regs' the common trace interface for the function
>boundary.
> 
> 1) Currently we have 2(or 3) different function return hook codes,
>  the function-graph tracer and rethook (and legacy kretprobe).
>  But since this  is redundant and needs double maintenance cost,
>  I would like to unify those. From the user's viewpoint, function-
>  graph tracer is very useful to grasp the execution path. For this
>  purpose, it is hard to use the rethook in the function-graph
>  tracer, but the opposite is possible. (Strictly speaking, kretprobe
>  can not use it because it requires 'pt_regs' for historical reasons.)
> 
> 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
>  wrong for the function entry and exit. Moreover, depending on the
>  architecture, there is no way to accurately reproduce 'pt_regs'
>  outside of interrupt or exception handlers. This means fprobe should
>  not use 'pt_regs' because it does not use such exceptions.
>  (Conversely, kprobe should use 'pt_regs' because it is an abstract
>   interface of the software breakpoint exception.)
> 
> This series changes fprobe to use function-graph tracer for tracing
> function entry and exit, instead of mixture of ftrace and rethook.
> Unlike the rethook which is a per-task list of system-wide allocated
> nodes, the function graph's ret_stack is a per-task shadow stack.
> Thus it does not need to set 'nr_maxactive' (which is the number of
> pre-allocated nodes).
> Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> their register interface, this changes it to convert 'ftrace_regs' to
> 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> so users must access only registers for function parameters or
> return value. 
> 
> Design
> --
> Instead of using ftrace's function entry hook directly, the new fprobe
> is built on top of the function-graph's entry and return callbacks
> with 'ftrace_regs'.
> 
> Since the fprobe requires access to 'ftrace_regs', the architecture
> must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> entry callback with 'ftrace_regs', and also
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> return_to_handler.
> 
> All fprobes share a single function-graph ops (means shares a common
> ftrace filter) similar to the kprobe-on-ftrace. This needs another
> layer to find corresponding fprobe in the common function-graph
> callbacks, but has much better scalability, since the number of
> registered function-graph ops is limited.
> 
> In the entry callback, the fprobe runs its entry_handler and saves the
> address of 'fprobe' on the function-graph's shadow stack as data. The
> return callback decodes the data to get the 'fprobe' address, and runs
> the exit_handler.
> 
> The fprobe introduces two hash-tables, one is for entry callback which
> searches fprobes related to the given function address passed by entry
> callback. The other is for a return callback which checks if the given
> 'fprobe' data structure pointer is still valid. Note that it is
> possible to unregister fprobe before the return callback runs. Thus
> the address validation must be done before using it in the return
> callback.
> 
> This series can be applied against the probes/for-next branch, which
> is based

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-18 Thread Steven Rostedt

On Thu, 18 Apr 2024 09:55:55 +0300
Mike Rapoport  wrote:

Hi Mike,

Thanks for doing this review!

> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each sub-buffer.
> > + * @nr_subbufs:Number of subbfs in the ring-buffer, including 
> > the reader.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. ID range [0 : 
> > @nr_subbufs - 1]
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> > + * @flags: Placeholder for now, 0 until new features are supported.
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> > + * @Reserved1: Reserved for future use.
> > + * @Reserved2: Reserved for future use.
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;  
> 
> Why do you need reserved fields? This structure always resides in the
> beginning of a page and the rest of the page is essentially "reserved".

So this code is also going to be used in arm's pkvm hypervisor code,
where it will be using these fields, but since we are looking at
keeping the same interface between the two, we don't want these used by
this interface.

We probably should add a comment about that.

> 
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index cc9ebe593571..793ecc454039 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c  
> 
> ... 
> 
> > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > +  unsigned long *subbuf_ids)
> > +{
> > +   struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > +   unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > +   struct buffer_page *first_subbuf, *subbuf;
> > +   int id = 0;
> > +
> > +   subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > +   cpu_buffer->reader_page->id = id++;
> > +
> > +   first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > +   do {
> > +   if (WARN_ON(id >= nr_subbufs))
> > +   break;
> > +
> > +   subbuf_ids[id] = (unsigned long)subbuf->page;
> > +   subbuf->id = id;
> > +
> > +   rb_inc_page(&subbuf);
> > +   id++;
> > +   } while (subbuf != first_subbuf);
> > +
> > +   /* install subbuf ID to kern VA translation */
> > +   cpu_buffer->subbuf_ids = subbuf_ids;
> > +
> > +   /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > +   meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> 
> Isn't this a single page?

One thing we are doing is to make sure that the subbuffers are aligned
by their size. If a subbuffer is 3 pages, it should be aligned on 3
page boundaries. This was something that Linus suggested.

> 
> > +   meta->meta_struct_len = sizeof(*meta);
> > +   meta->nr_subbufs = nr_subbufs;
> > +   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > +
> > +   rb_update_meta_page(cpu_buffer);
> > +}  
> 
> ...
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(sub_order, start, page)\  
> 
> Nit: usually iterators in kernel use for_each

Ah, good catch. Yeah, that should be changed. But then ...

> 
> > +   page = subbuf_page(0, (start)); \
> > +   for (int __off = 0; __off < (1 << (sub_order)); \
> > +__off++, page = subbuf_page(__off, (start)))  
> 
> The pages are allocated with alloc_pages_node(.. subbuf_order) are
> physically contiguous and struct pages for them are also contiguous, so
> inside a subbuf_order allocation you can just do page++.
> 

I'm wondering if we should just nuke the macro. It was there because of
the previous implementation did things twice. But now it's just done
once here:

+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page;
+
+   foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], 
page) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = p

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-18 Thread Google

On Tue, 16 Apr 2024 22:41:01 +
Beau Belgrave  wrote:

> When the ABI was updated to prevent same name w/different args, it
> missed an important corner case when fields don't end with a space.
> Typically, space is used for fields to help separate them, like
> "u8 field1; u8 field2". If no spaces are used, like
> "u8 field1;u8 field2", then the parsing works for the first time.
> However, the match check fails on a subsequent register, leading to
> confusion.
> 
> This is because the match check uses argv_split() and assumes that all
> fields will be split upon the space. When spaces are used, we get back
> { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> This causes a mismatch, and the user program gets back -EADDRINUSE.
> 
> Add a method to detect this case before calling argv_split(). If found
> force a space after the field separator character ';'. This ensures all
> cases work properly for matching.
> 
> With this fix, the following are all treated as matching:
> u8 field1;u8 field2
> u8 field1; u8 field2
> u8 field1;\tu8 field2
> u8 field1;\nu8 field2

Sounds good to me. I just have some nits.

> 
> Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> args event")
> Signed-off-by: Beau Belgrave 
> ---
>  kernel/trace/trace_events_user.c | 88 +++-
>  1 file changed, 87 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index 70d428c394b6..9184d3962b2a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -1989,6 +1989,92 @@ static int user_event_set_tp_name(struct user_event 
> *user)
>   return 0;
>  }
>  
> +/*
> + * Counts how many ';' without a trailing space are in the args.
> + */
> +static int count_semis_no_space(char *args)
> +{
> + int count = 0;
> +
> + while ((args = strchr(args, ';'))) {
> + args++;
> +
> + if (!isspace(*args))
> + count++;
> + }
> +
> + return count;
> +}
> +
> +/*
> + * Copies the arguments while ensuring all ';' have a trailing space.
> + */
> +static char *fix_semis_no_space(char *args, int count)

nit: This name does not represent what it does. 'insert_space_after_semis()'
is more self-described.

> +{
> + char *fixed, *pos;
> + char c, last;
> + int len;
> +
> + len = strlen(args) + count;
> + fixed = kmalloc(len + 1, GFP_KERNEL);
> +
> + if (!fixed)
> + return NULL;
> +
> + pos = fixed;
> + last = '\0';
> +
> + while (len > 0) {
> + c = *args++;
> +
> + if (last == ';' && !isspace(c)) {
> + *pos++ = ' ';
> + len--;
> + }
> +
> + if (len > 0) {
> + *pos++ = c;
> + len--;
> + }
> +
> + last = c;
> + }

nit: This loop can be simpler, because we are sure fixed has enough length;

/* insert a space after ';' if there is no space. */
while(*args) {
*pos = *args++;
if (*pos++ == ';' && !isspace(*args))
*pos++ = ' ';
}

> +
> + /*
> +  * len is the length of the copy excluding the null.
> +  * This ensures we always have room for a null.
> +  */
> + *pos = '\0';
> +
> + return fixed;
> +}
> +
> +static char **user_event_argv_split(char *args, int *argc)
> +{
> + /* Count how many ';' without a trailing space */
> + int count = count_semis_no_space(args);
> +
> + if (count) {

nit: it is better to exit fast, so 

if (!count)
return argv_split(GFP_KERNEL, args, argc);

...

Thank you,

OT: BTW, can this also simplify synthetic events?

> + /* We must fixup 'field;field' to 'field; field' */
> + char *fixed = fix_semis_no_space(args, count);
> + char **split;
> +
> + if (!fixed)
> + return NULL;
> +
> + /* We do a normal split afterwards */
> + split = argv_split(GFP_KERNEL, fixed, argc);
> +
> + /* We can free since argv_split makes a copy */
> + kfree(fixed);
> +
> + return split;
> + }
> +
> + /* No fixup is required */
> + return argv_split(GFP_KERNEL, args, argc);
> +}
> +
>  /*
>   * Parses the event name, arguments and flags then registers if successful.
>   * The name buffer lifetime is owned by this method for success cases only.
> @@ -2012,7 +2098,7 @@ static int user_event_parse(struct user_event_group 
> *group, char *name,
>   return -EPERM;
>  
>   if (args) {
> - argv = argv_split(GFP_KERNEL, args, &argc);
> + argv = user_event_argv_split(args, &argc);
>  
>   if (!argv)
>   return -ENOMEM;
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google)

[PATCH 1/1] genirq/cpuhotplug: retry with online CPUs on irq_do_set_affinity failure

2024-04-18 Thread Dongli Zhang

When a CPU is offline, its IRQs may migrate to other CPUs. For managed
IRQs, they are migrated, or shutdown (if all CPUs of the managed IRQ
affinity are offline). For regular IRQs, there will only be a migration.

The migrate_one_irq() first uses pending_mask or affinity_mask of the IRQ.

104 if (irq_fixup_move_pending(desc, true))
105 affinity = irq_desc_get_pending_mask(desc);
106 else
107 affinity = irq_data_get_affinity_mask(d);

The migrate_one_irq() may use all online CPUs, if all CPUs in
pending_mask/affinity_mask are already offline.

113 if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
114 /*
115  * If the interrupt is managed, then shut it down and leave
116  * the affinity untouched.
117  */
118 if (irqd_affinity_is_managed(d)) {
119 irqd_set_managed_shutdown(d);
120 irq_shutdown_and_deactivate(desc);
121 return false;
122 }
123 affinity = cpu_online_mask;
124 brokeaff = true;
125 }

However, there is a corner case. Although some CPUs in
pending_mask/affinity_mask are still online, they are lack of available
vectors. If the kernel continues calling irq_do_set_affinity() with those CPUs,
there will be -ENOSPC error.

This is not reasonable as other online CPUs still have many available vectors.

name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:7
Global available:884
Global reserved:   6
Total allocated: 539
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   147 0 0   55  32-49,51-87
 1   147 0 0   55  32-49,51-87
 2 0 0 0  202  32-49,51-127,129-235
 4   147 0 0   55  32-49,51-87
 5   147 0 0   55  32-49,51-87
 6   148 0 0   54  32-49,51-86
 7   148 0 0   54  32-49,51-86

This issue should not happen for managed IRQs because the vectors are already
reserved before CPU hotplug. For regular IRQs, do a re-try with all online
CPUs if the prior irq_do_set_affinity() is failed with -ENOSPC.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 kernel/irq/cpuhotplug.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..d1666a6b73f4 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -130,6 +130,19 @@ static bool migrate_one_irq(struct irq_desc *desc)
 * CPU.
 */
err = irq_do_set_affinity(d, affinity, false);
+
+   if (err == -ENOSPC &&
+   !irqd_affinity_is_managed(d) &&
+   affinity != cpu_online_mask) {
+   affinity = cpu_online_mask;
+   brokeaff = true;
+
+   pr_debug("IRQ%u: set affinity failed for %*pbl, re-try with all 
online CPUs\n",
+d->irq, cpumask_pr_args(affinity));
+
+   err = irq_do_set_affinity(d, affinity, false);
+   }
+
if (err) {
pr_warn_ratelimited("IRQ%u: set affinity failed(%d).\n",
d->irq, err);
-- 
2.34.1

[PATCH 0/1] genirq/cpuhotplug: fix CPU hotplug set affinity failure issue

2024-04-18 Thread Dongli Zhang

Please refer to the commit message of the patch for details.

The cover letter is to demonstrate how to reproduce the issue on purpose with
QEMU/KVM + virtio-net (that's why virtualizat...@lists.linux.dev is CCed).

Thank you very much!



1. Build the mainline linux kernel.

$ make defconfig
$ scripts/config --file ".config" -e CONFIG_X86_X2APIC \
  -e CONFIG_GENERIC_IRQ_DEBUGFS
$ make olddefconfig
$ make -j24 > /dev/null

Confirm the config is enabled.

$ cat .config | grep CONFIG_GENERIC_IRQ_DEBUGFS
CONFIG_GENERIC_IRQ_DEBUGFS=y


2. Create the VM with the below QEMU command line. The libvirt virbr0 is used
as bridge for virtio-net.

---
$ cat qemu-ifup
#!/bin/sh
# Script to bring a network (tap) device for qemu up.

br="virbr0"
ifconfig $1 up
brctl addif $br "$1"
exit
---

/home/zhang/kvm/qemu-8.2.0/build/qemu-system-x86_64 \
-hda ubuntu2204.qcow2 -m 8G -smp 32 -vnc :5 -enable-kvm -cpu host \
-net nic -net user,hostfwd=tcp::5025-:22 \
-device 
virtio-net-pci,netdev=tapnet01,id=net01,mac=01:54:00:12:34:56,bus=pci.0,addr=0x4,mq=true,vectors=257
 \
-netdev 
tap,id=tapnet01,ifname=tap01,script=qemu-ifup,downscript=no,queues=128,vhost=off
 \
-device 
virtio-net-pci,netdev=tapnet02,id=net02,mac=02:54:00:12:34:56,bus=pci.0,addr=0x5,mq=true,vectors=257
 \
-netdev 
tap,id=tapnet02,ifname=tap02,script=qemu-ifup,downscript=no,queues=128,vhost=off
 \
-kernel /home/zhang/img/debug/mainline-linux/arch/x86_64/boot/bzImage \
-append "root=/dev/sda3 init=/sbin/init text loglevel=7 console=ttyS0" \
-serial stdio -name debug-threads=on


3. Use procfs to confirm the virtio IRQ numbers.

$ cat /proc/interrupts | grep virtio
 24: ... ... PCI-MSIX-:00:04.0   0-edge  virtio0-config
 25: ... ... PCI-MSIX-:00:04.0   1-edge  virtio0-input.0
... ...
537: ... ... PCI-MSIX-:00:05.0 256-edge  virtio1-output.127

Reset the affinity of IRQs 25-537 to CPUs=2,3.

---
#!/bin/sh

for irq in {25..537}
do
  echo $irq
  echo 2,3 > /proc/irq/$irq/smp_affinity_list
  cat /proc/irq/$irq/smp_affinity_list
  cat /proc/irq/$irq/effective_affinity_list
  echo ""
done
---

Now offline CPU=8-31.

---
#!/bin/sh

for cpu in {8..31}
do
  echo $cpu
  echo 0 > /sys/devices/system/cpu/cpu$cpu/online
done
---


The below is the current VECTOR debugfs.

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:8
Global available:   1090
Global reserved:   6
Total allocated: 536
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   169 0 0   33  32-49,51-65
 1   171 0 0   31  32-49,51-63
 226 0 0  176  32-49,52-127,129-210
 327 0 0  175  32-49,51-127,129-171,173-209
 4   175 0 0   27  32-49,51-59
 5   175 0 0   27  32-49,51-59
 6   172 0 0   30  32-49,51-62
 7   175 0 0   27  32-49,51-59


4. Now offline CPU=3.

# echo 0 > /sys/devices/system/cpu/cpu3/online

There are below from dmesg.

[   96.234045] IRQ151: set affinity failed(-28).
[   96.234064] IRQ156: set affinity failed(-28).
[   96.234078] IRQ158: set affinity failed(-28).
[   96.234091] IRQ159: set affinity failed(-28).
[   96.234105] IRQ161: set affinity failed(-28).
[   96.234118] IRQ162: set affinity failed(-28).
[   96.234132] IRQ163: set affinity failed(-28).
[   96.234145] IRQ164: set affinity failed(-28).
[   96.234159] IRQ165: set affinity failed(-28).
[   96.234172] IRQ166: set affinity failed(-28).
[   96.235013] IRQ fixup: irq 339 move in progress, old vector 48
[   96.237129] smpboot: CPU 3 is now offline


Although other CPUs have many available vectors, only CPU=2 is used.

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 529
 flags:  0x0103
Online bitmaps:7
Global available:   1022
Global reserved:   6
Total allocated: 533
System: 36: 0-19,21,50,128,236,243-244,246-255
 | CPU | avl | man | mac | act | vectors
 0   168 0 0   34  32-49,51-53,55-57,59-68
 1   165 0 0   37  32-49,51-57,59-60,64-73
 2 0 0 0  202  32-49,51-127,129-235
 4   173 0 0   29  32-40,42-48,52-63,65
 5   171 0 0   31  32-49,51-54,56,58-62,64-66
 6   172 0 0   30  32-49,51-52,54-57,59-63,65
 7   173 0 0   29  32-49,51-52,54-58,60-62,64


Dongli Zhang

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-18 Thread Huang, Kai





On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.


It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is 
"exactly what does this mean" isn't quite clear.


Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.


So, the first interpretation is we treat the entire MISC_CG_RES_SGX 
resource type doesn't exist, that is, we just don't show control files 
in the file system, and all EPC pages are tracked in the global list.


But it might be not straightforward to implement in the SGX driver, 
i.e., we might need to do more MISC cgroup core code change to make it 
being able to support disable particular resource at runtime -- I need 
to double check.


So if that is not something worth to do, we will still need to live with 
the fact that, the user is still able to create SGX cgroup in the 
hierarchy and see those control files, and being able to read/write them.


The second interpretation I suppose is, although the SGX cgroup is still 
seen as supported in userspace, in kernel we just treat it doesn't exist.


Specifically, that means: 1) we always return the root SGX cgroup for 
any EPC page when allocating a new one; 2) as a result, we still track 
all EPC pages in a single global list.


But from the code below ...



  static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
  {
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum 
sgx_reclaim reclaim)
  {
int ret;
  
+	/* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */

+   if (!sgx_cg_wq)
+   return 0;
+


..., IIUC you choose a (third) solution that is even one more step back:

It just makes try_charge() always succeed, but EPC pages are still 
managed in the "per-cgroup" list.


But this solution, AFAICT, doesn't work.  The reason is when you fail to 
allocate EPC page you will do the global reclaim, but now the global 
list is empty.


Am I missing anything?

So my thinking is, we have two options:

1) Modify the MISC cgroup core code to allow the kernel to disable one 
particular resource.  It shouldn't be hard, e.g., we can add a 
'disabled' flag to the 'struct misc_res'.


Hmm.. wait, after checking, the MISC cgroup won't show any control files 
if the "capacity" of the resource is 0:


"
 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.
"

So I really suppose we should go with this route, i.e., by just setting 
the EPC capacity to 0?


Note misc_cg_try_charge() will fail if capacity is 0, but we can make it 
return success by explicitly check whether SGX cgroup is disabled by 
using a helper, e.g., sgx_cgroup_disabled().


And you always return the root SGX cgroup in sgx_get_current_cg() when 
sgx_cgroup_disabled() is true.


And in sgx_reclaim_pages_global(), you do something like:

static void sgx_reclaim_pages_global(..)
{
#ifdef CONFIG_CGROUP_MISC
if (sgx_cgroup_disabled())
sgx_reclaim_pages(&sgx_root_cg.lru);
else
sgx_cgroup_reclaim_pages(misc_cg_root());
#else
sgx_reclaim_pages(&sgx_global_list);
#endif
}

I am perhaps missing some other spots too but you got the idea.

At last, after typing those, I believe we should have a separate patch 
to handle disable SGX cgroup at initialization time.  And you can even 
put this patch _somewhere_ after the patch


"x86/sgx: Implement basic EPC misc cgroup functionality"

and before this patch.

It makes sense to have such patch anyway, because with it we can easily 
to add a kernel command line 'sgx_cgroup=disabled" if the user wants it 
disabled (when someone has such requirement in the future).

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-18 Thread Google

On Thu, 18 Apr 2024 12:10:59 +0100
Jonthan Haslam  wrote:

> Hi Masami,
> 
> > > > OK, then I'll push this to for-next at this moment.
> > > > Please share if you have a good idea for the batch interface which can 
> > > > be
> > > > backported. I guess it should involve updating userspace changes too.
> > > 
> > > Did you (or anyone else) need anything more from me on this one so that it
> > > can be pushed? I provided some benchmark numbers but happy to provide
> > > anything else that may be required.
> > 
> > Yeah, if you can update with the result, it looks better to me.
> > Or, can I update the description?
> 
> Just checking if you need me to do anything on this so that it can be
> pushed to for-next? Would be really great if we can get this in. Thanks
> for all your help.

Yes, other uprobe performance improvements[1][2] have the benchmark results,
but this patch doesn't. If you update this with the benchmark results, it is
helpful to me.

[1] https://lore.kernel.org/all/20240318181728.2795838-3-and...@kernel.org/
[2] https://lore.kernel.org/all/20240318181728.2795838-4-and...@kernel.org/

Thank you,

> 
> Jon.
> 
> > 
> > Thank you,
> > 
> > > 
> > > Thanks!
> > > 
> > > Jon.
> > > 
> > > > 
> > > > Thank you!
> > > > 
> > > > > >
> > > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > > while Jonathan will keep working on even better final solution.
> > > > > > Thanks!
> > > > > >
> > > > > > > I look forward to your formalized results :)
> > > > > > >
> > > > > 
> > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > > and USDTs, reporting attach/detach timings:
> > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > > bpf_testmod.ko is already unloaded.
> > > > > Loading bpf_testmod.ko...
> > > > > Successfully loaded bpf_testmod.ko.
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > > test_bench_attach_uprobe: attached in   0.120s
> > > > > test_bench_attach_uprobe: detached in   0.092s
> > > > > #400/5   uprobe_multi_test/bench_uprobe:OK
> > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > > test_bench_attach_usdt: attached in   0.124s
> > > > > test_bench_attach_usdt: detached in   0.064s
> > > > > #400/6   uprobe_multi_test/bench_usdt:OK
> > > > > #400 uprobe_multi_test:OK
> > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > > Successfully unloaded bpf_testmod.ko.
> > > > > 
> > > > > So it should be easy for Jonathan to validate his changes with this.
> > > > > 
> > > > > > > Thank you,
> > > > > > >
> > > > > > > >
> > > > > > > > Jon.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock 
> > > > > > > > > > > overhead
> > > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > > >
> > > > > > > > > > > Thank you,
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > > > > > > 
> > > > > > > > > > > > ---
> > > > > > > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > > RB_ROOT;
> > > > > > > > > > > >   */
> > > > > > > > > > > >  #define no_uprobe_events()   
> > > > > > > > > > > > RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > > > >
> > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > >
> > > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > >  {
> > > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + read_lock(&uprobes_tr

[ANNOUNCE] 5.10.215-rt107

2024-04-18 Thread Luis Claudio R. Goncalves

Hello RT-list!

I'm pleased to announce the 5.10.215-rt107 stable release.

This release is simply an update to the new stable 5.10.215 version and no
RT-specific changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: b850d563e91f696a605c28e8d6a968a950cff491

Or to build 5.10.215-rt107 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.215.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.215-rt107.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-18 Thread Huang, Kai





Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t


[...]


Ah I missed that.  No problem to me.





--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+    /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+    return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the 
conclusion. I think this is confirmed also by Michal's description AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."


After looking I believe we can even disable MISC cgroup at runtime for a 
particular cgroup (haven't actually verified on real machine, though):


 # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control

And if you look at the MISC cgroup core code, many functions actually 
handle a NULL css, e.g., misc_cg_try_charge():


int misc_cg_try_charge(enum misc_res_type type,
struct misc_cg *cg, u64 amount)
{
...

if (!(valid_type(type) && cg &&
READ_ONCE(misc_res_capacity[type])))
return -EINVAL;

...
}

That's why I am still a little bit worried about this.  And it's better 
to have cgroup expert(s) to confirm here.


Btw, AMD SEV doesn't need to worry because it doesn't dereference @css 
but just pass it to MISC cgroup core functions like 
misc_cg_try_charge().  But for SGX, we actually dereference it directly.

Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-18 Thread Josh Poimboeuf

On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote:
> On Sat, 13 Apr 2024 12:53:38 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> > 
> > > > Anyway, since we typically run stuff from NMI context, accessing user
> > > > data is 'interesting'. As such I would really like to make this work
> > > > depend on the call-graph rework that pushes all the user access bits
> > > > into return-to-user.  
> > > 
> > > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > > could look at and think about what a rebase looks like? Or do you have
> > > someone in mind I should work with for this?  
> > 
> > I've been offline for a little while and still need to catch up with
> > things myself.
> > 
> > Josh was working on that when I dropped off IIRC, I'm not entirely sure
> > where things are at currently (and there is no way I can ever hope to
> > process the backlog).
> > 
> > Anybody know where we are with that?
> 
> It's still very much on my RADAR, but with layoffs and such, my
> priorities have unfortunately changed. I'm hoping to start helping out
> in the near future though (in a month or two).
> 
> Josh was working on it, but I think he got pulled off onto other
> priorities too :-p

Yeah, this is still a priority for me and I hope to get back to it over
the next few weeks (crosses fingers).

-- 
Josh

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-18 Thread Haitao Huang


On Tue, 16 Apr 2024 08:22:06 -0500, Huang, Kai  wrote:


On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:

From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a  
hard

limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Tested-by: Jarkko Sakkinen 


I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang 


Thanks


Nitpickings below:

[...]



--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include 
+#include 


It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to  
the

relevant patch(es) that they got used.

However I think it's better to explicitly include  since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use  because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.



I'll check and remove as needed.


+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;


The comment isn't necessary (sorry didn't notice before), because the  
code

is pretty clear saying that IMHO.



Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t


[...]



--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+   /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+   return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the conclusion.  
I think this is confirmed also by Michal's description AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."




--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 


Is this needed?  I believe SGX variants in "epc_cgroup.h" should be  
enough

for sgx/main.c?

[...]



right

[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is  
on

yes

[...]
Thanks
Haitao

[PATCH] drivers: remoteproc: xlnx: Add Versal and Versal-NET support

2024-04-18 Thread Tanmay Shah

AMD-Xilinx Versal platform is successor of ZynqMP platform.
Real-time Processing Unit R5 cluster IP on Versal is same as
of ZynqMP Platform. Power-domains ids for Versal platform is
different than ZynqMP.

AMD-Xilinx Versal-NET platform is successor of Versal platform.
Versal-NET Real-Time Processing Unit has two clusters and each
cluster contains dual core ARM Cortex-R52 processors. Each R52
core is assigned 128KB of TCM memory.

Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 53 -
 1 file changed, 17 insertions(+), 36 deletions(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 7b1c12108bff..a6d8ac7394e7 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -300,36 +300,6 @@ static void zynqmp_r5_rproc_kick(struct rproc *rproc, int 
vqid)
dev_warn(dev, "failed to send message\n");
 }
 
-/*
- * zynqmp_r5_set_mode()
- *
- * set RPU cluster and TCM operation mode
- *
- * @r5_core: pointer to zynqmp_r5_core type object
- * @fw_reg_val: value expected by firmware to configure RPU cluster mode
- * @tcm_mode: value expected by fw to configure TCM mode (lockstep or split)
- *
- * Return: 0 for success and < 0 for failure
- */
-static int zynqmp_r5_set_mode(struct zynqmp_r5_core *r5_core,
- enum rpu_oper_mode fw_reg_val,
- enum rpu_tcm_comb tcm_mode)
-{
-   int ret;
-
-   ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
-   if (ret < 0) {
-   dev_err(r5_core->dev, "failed to set RPU mode\n");
-   return ret;
-   }
-
-   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
-   if (ret < 0)
-   dev_err(r5_core->dev, "failed to configure TCM\n");
-
-   return ret;
-}
-
 /*
  * zynqmp_r5_rproc_start()
  * @rproc: single R5 core's corresponding rproc instance
@@ -941,7 +911,7 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
*cluster,
/* Maintain backward compatibility for zynqmp by using hardcode TCM 
address. */
if (of_find_property(r5_core->np, "reg", NULL))
ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
-   else
+   else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss"))
ret = zynqmp_r5_get_tcm_node(cluster);
 
if (ret) {
@@ -960,12 +930,21 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
*cluster,
return ret;
}
 
-   ret = zynqmp_r5_set_mode(r5_core, fw_reg_val, tcm_mode);
-   if (ret) {
-   dev_err(dev, "failed to set r5 cluster mode %d, err 
%d\n",
-   cluster->mode, ret);
+   ret = zynqmp_pm_set_rpu_mode(r5_core->pm_domain_id, fw_reg_val);
+   if (ret < 0) {
+   dev_err(r5_core->dev, "failed to set RPU mode\n");
return ret;
}
+
+   if (of_find_property(dev_of_node(dev), "xlnx,tcm-mode", NULL) ||
+   device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
+   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id,
+  tcm_mode);
+   if (ret < 0) {
+   dev_err(r5_core->dev, "failed to configure 
TCM\n");
+   return ret;
+   }
+   }
}
 
return 0;
@@ -1022,7 +1001,7 @@ static int zynqmp_r5_cluster_init(struct 
zynqmp_r5_cluster *cluster)
ret = of_property_read_u32(dev_node, "xlnx,tcm-mode", (u32 
*)&tcm_mode);
if (ret)
return ret;
-   } else {
+   } else if (device_is_compatible(dev, "xlnx,zynqmp-r5fss")) {
if (cluster_mode == LOCKSTEP_MODE)
tcm_mode = PM_RPU_TCM_COMB;
else
@@ -1212,6 +1191,8 @@ static int zynqmp_r5_remoteproc_probe(struct 
platform_device *pdev)
 
 /* Match table for OF platform binding */
 static const struct of_device_id zynqmp_r5_remoteproc_match[] = {
+   { .compatible = "xlnx,versal-net-r52fss", },
+   { .compatible = "xlnx,versal-r5fss", },
{ .compatible = "xlnx,zynqmp-r5fss", },
{ /* end of list */ },
 };

base-commit: 912ebe48bec5927e2049e91b0e8a9cc682a709d2
-- 
2.25.1

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?

IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.

Thanks,
Song

[ANNOUNCE] 5.10.214-rt106

2024-04-18 Thread Luis Claudio R. Goncalves

Hello RT-list!

I'm pleased to announce the 5.10.214-rt106 stable release.

This release is simply an update to the new stable 5.10.214 version and no
RT-specific changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 3d208061796d4addeb543c78e0a4ec769b6ce6b2

Or to build 5.10.214-rt106 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.214.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.214-rt106.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson

On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+int kvm_arch_flush_tlb_if_young(void);
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..5ebef8ef239c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,6 +595,11 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+int __weak kvm_arch_flush_tlb_if_young(void)
+{
+   return true;
+}
+
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Mike Rapoport

On Thu, Apr 18, 2024 at 10:31:16PM +0300, Nadav Amit wrote:
> 
> > On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> > 
> > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> >> 
> >> 
> >> 
> >> I might be missing something, but it seems a bit racy.
> >> 
> >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> >> point, since you don’t hold the text_mutex, some might do text_poke(),
> >> e.g., by enabling/disabling static-key, and the update would be
> >> overwritten. No?
> > 
> > Right :(
> > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> > area.
> > 
> > So I'm thinking about calling alternatives_smp_module_add() from an
> > additional callback after the execmem_update_copy().
> > 
> > Does it make sense to you?
> 
> Going over the code again - I might have just been wrong: I confused the
> alternatives and the jump-label mechanisms (as they do share a lot of
> code and characteristics).
> 
> The jump-labels are updated when prepare_coming_module() is called, which
> happens after post_relocation() [which means they would be updated using
> text_poke() “inefficiently” but should be safe].
> 
> The “alternatives” appear only to use text_poke() (in contrast for
> text_poke_early()) from very specific few flows, e.g., 
> common_cpu_up() -> alternatives_enable_smp().
> 
> Are those flows pose a problem after boot?

Yes, common_cpu_up is called  on CPU hotplug, so it's possible to have a
race between alternatives_smp_module_add() and common_cpu_up() ->
alternatives_enable_smp().

And in UP case alternatives_smp_module_add() will call
alternatives_smp_unlock() that will patch module text before it is updated.

> Anyhow, sorry for the noise.

On the contrary, I would have missed it.

-- 
Sincerely yours,
Mike.

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Nadav Amit

> On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> 
> On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
>> 
>> 
>> 
>> I might be missing something, but it seems a bit racy.
>> 
>> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
>> point, since you don’t hold the text_mutex, some might do text_poke(),
>> e.g., by enabling/disabling static-key, and the update would be
>> overwritten. No?
> 
> Right :(
> Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> area.
> 
> So I'm thinking about calling alternatives_smp_module_add() from an
> additional callback after the execmem_update_copy().
> 
> Does it make sense to you?

Going over the code again - I might have just been wrong: I confused the
alternatives and the jump-label mechanisms (as they do share a lot of
code and characteristics).

The jump-labels are updated when prepare_coming_module() is called, which
happens after post_relocation() [which means they would be updated using
text_poke() “inefficiently” but should be safe].

The “alternatives” appear only to use text_poke() (in contrast for
text_poke_early()) from very specific few flows, e.g., 
common_cpu_up() -> alternatives_enable_smp().

Are those flows pose a problem after boot?

Anyhow, sorry for the noise.

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-18 Thread Andrii Nakryiko

On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa  wrote:
>
> On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
>
> SNIP
>
> >  #include 
> >  #include 
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe 
> > *auprobe, struct insn *insn, bool
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".global uretprobe_syscall_entry\n"
> > + "uretprobe_syscall_entry:\n"
> > + "pushq %rax\n"
> > + "pushq %rcx\n"
> > + "pushq %r11\n"
> > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > + "syscall\n"
> > + "popq %r11\n"
> > + "popq %rcx\n"
> > +
> > + /* The uretprobe syscall replaces stored %rax value with final
> > +  * return address, so we don't restore %rax in here and just
> > +  * call ret.
> > +  */
> > + "retq\n"
> > + ".global uretprobe_syscall_end\n"
> > + "uretprobe_syscall_end:\n"
> > + ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > + return uretprobe_syscall_entry;
>
> fyi I realized this screws 32-bit programs, we either need to add
> compat trampoline, or keep the standard breakpoint for them:
>
> +   struct pt_regs *regs = task_pt_regs(current);
> +   static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +   if (user_64bit_mode(regs)) {
> +   *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +   return uretprobe_syscall_entry;
> +   }
> +
> +   *psize = UPROBE_SWBP_INSN_SIZE;
> +   return &insn;
>
>
> not sure it's worth the effort to add the trampoline, I'll check
>

32-bit arch isn't a high-performance target anyways, so I'd probably
not bother and prioritize simplicity and long term maintenance.

>
> jirka

Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Andy Shevchenko

On Thu, Apr 18, 2024 at 8:50 PM Aren  wrote:
> On Thu, Apr 18, 2024 at 06:56:09PM +0300, Andy Shevchenko wrote:
> > On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> > > On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > > > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > > > wrote:

...

> > > > I forgot to check the order of freeing resources, be sure you have no
> > > > devm_*() releases happening before this call.
> > >
> > > If I understand what you're saying, this should be fine. The driver just
> > > uses devm to clean up acquired resources after remove is called. Or am I
> > > missing something and resources could be freed before calling
> > > stk3310_remove?
> >
> > I'm not objecting to that. The point here is that the resources should
> > be freed in the reversed order. devm-allocated resources are deferred
> > to be freed after the explicit driver ->remove() callback. At the end
> > it should not interleave with each other, i.o.w. it should be
> > probe: devm followed by non-devm
> > remove: non-devm only.
>
> I think what you're describing is already the case, with the exception
> of parts of the probe function not changed in this patch mixing
> acquiring resources through devm with configuring the device.

Okay, then we are fine!

> I hope I'm not being dense, thanks for the clarification

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport

On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > >
> > > > I'm looking at execmem_types more as definition of the consumers, maybe 
> > > > I
> > > > should have named the enum execmem_consumer at the first place.
> > >
> > > I think looking at execmem_type from consumers' point of view adds
> > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > and bpf (and maybe also module text) all have the same requirements.
> > > Did I miss something?
> >
> > It's enough to have one architecture with different constrains for kprobes
> > and bpf to warrant a type for each.
> 
> AFAICT, some of these constraints can be changed without too much work.

But why?
I honestly don't understand what are you trying to optimize here. A few
lines of initialization in execmem_info?
What is the advantage in forcing architectures to have imposed limits on
kprobes or bpf allocations?

> > Where do you see unnecessary complexity?
> >
> > > IOW, we have
> > >
> > > enum execmem_type {
> > > EXECMEM_DEFAULT,
> > > EXECMEM_TEXT,
> > > EXECMEM_KPROBES = EXECMEM_TEXT,
> > > EXECMEM_FTRACE = EXECMEM_TEXT,
> > > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > > _KPROBE, _FTRACE, _BPF */
> > > EXECMEM_DATA,  /* rw */
> > > EXECMEM_RO_DATA,
> > > EXECMEM_RO_AFTER_INIT,
> > > EXECMEM_TYPE_MAX,
> > > };
> > >
> > > Does this make sense?
> >
> > How do you suggest to deal with e.g. riscv that has separate address spaces
> > for modules, kprobes and bpf?
> 
> IIUC, modules and bpf use the same address space on riscv

Not exactly, bpf is a subset of modules on riscv.

> while kprobes use vmalloc address.

The whole point of using the entire vmalloc for kprobes is to avoid
pollution of limited modules space.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Aren

On Thu, Apr 18, 2024 at 06:56:09PM +0300, Andy Shevchenko wrote:
> On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> > On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > > wrote:
> 
> ...
> 
> > > > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > > > +   if (data->vdd_reg)
> > > > +   regulator_disable(data->vdd_reg);
> > >
> > > I forgot to check the order of freeing resources, be sure you have no
> > > devm_*() releases happening before this call.
> >
> > If I understand what you're saying, this should be fine. The driver just
> > uses devm to clean up acquired resources after remove is called. Or am I
> > missing something and resources could be freed before calling
> > stk3310_remove?
> 
> I'm not objecting to that. The point here is that the resources should
> be freed in the reversed order. devm-allocated resources are deferred
> to be freed after the explicit driver ->remove() callback. At the end
> it should not interleave with each other, i.o.w. it should be
> probe: devm followed by non-devm
> remove: non-devm only.

I think what you're describing is already the case, with the exception
of parts of the probe function not changed in this patch mixing
acquiring resources through devm with configuring the device.

I hope I'm not being dense, thanks for the clarification
 - Aren

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same 
> > > > as
> > > >   modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > > our own
> > > > types, e.g.
> > > >
> > > >   EXECMEM_ANYWHERE
> > > >   EXECMEM_NOPLT
> > > >   EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and 
> > > > friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>

AFAICT, some of these constraints can be changed without too much work.

> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA,  /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?

IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:

 |  -4 GB | 7fff |2 GB | modules, BPF

Did I get this right?

Thanks,
Song

Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Andy Shevchenko

On Thu, Apr 18, 2024 at 6:06 PM Aren  wrote:
> On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> > On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  
> > wrote:

...

> > > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > > +   if (data->vdd_reg)
> > > +   regulator_disable(data->vdd_reg);
> >
> > I forgot to check the order of freeing resources, be sure you have no
> > devm_*() releases happening before this call.
>
> If I understand what you're saying, this should be fine. The driver just
> uses devm to clean up acquired resources after remove is called. Or am I
> missing something and resources could be freed before calling
> stk3310_remove?

I'm not objecting to that. The point here is that the resources should
be freed in the reversed order. devm-allocated resources are deferred
to be freed after the explicit driver ->remove() callback. At the end
it should not interleave with each other, i.o.w. it should be
probe: devm followed by non-devm
remove: non-devm only.

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-18 Thread Mike Rapoport

Hi Masami,

On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
> 
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport  wrote:
> 
> > From: "Mike Rapoport (IBM)" 
> > 
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> > 
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> > 
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
> 
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.

I'll rebase and will try to reduce ifdefery where possible.

> Thank you,
> 
> > 
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >  arch/Kconfig|  2 +-
> >  kernel/kprobes.c| 43 +
> >  kernel/trace/trace_kprobe.c | 11 ++
> >  3 files changed, 37 insertions(+), 19 deletions(-)
> > 
> 
> -- 
> Masami Hiramatsu

-- 
Sincerely yours,
Mike.

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport

On Wed, Apr 17, 2024 at 04:32:49PM -0700, Song Liu wrote:
> On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > > +/**
> > > > > + * enum execmem_type - types of executable memory ranges
> > > > > + *
> > > > > + * There are several subsystems that allocate executable memory.
> > > > > + * Architectures define different restrictions on placement,
> > > > > + * permissions, alignment and other parameters for memory that can 
> > > > > be used
> > > > > + * by these subsystems.
> > > > > + * Types in this enum identify subsystems that allocate executable 
> > > > > memory
> > > > > + * and let architectures define parameters for ranges suitable for
> > > > > + * allocations by each subsystem.
> > > > > + *
> > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > > that
> > > > > + * are not explcitly defined.
> > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > > + * @EXECMEM_BPF: parameters for BPF
> > > > > + * @EXECMEM_TYPE_MAX:
> > > > > + */
> > > > > +enum execmem_type {
> > > > > + EXECMEM_DEFAULT,
> > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > > + EXECMEM_KPROBES,
> > > > > + EXECMEM_FTRACE,
> > > > > + EXECMEM_BPF,
> > > > > + EXECMEM_TYPE_MAX,
> > > > > +};
> > > >
> > > > Can we please get a break-down of how all these types are actually
> > > > different from one another?
> > > >
> > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > > mind) and has less strict placement constraints for some of them?
> > >
> > > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > > said
> > > several times.
> > >
> > > For arm64 we have two bsaic restrictions:
> > >
> > > 1) Direct branches can go +/-128M
> > >We can expand this range by having direct branches go to PLTs, at a
> > >performance cost.
> > >
> > > 2) PREL32 relocations can go +/-2G
> > >We cannot expand this further.
> > >
> > > * We don't need to allocate memory for ftrace. We do not use trampolines.
> > >
> > > * Kprobes XOL areas don't care about either of those; we don't place any
> > >   PC-relative instructions in those. Maybe we want to in future.
> > >
> > > * Modules care about both; we'd *prefer* to place them within +/-128M of 
> > > all
> > >   other kernel/module code, but if there's no space we can use PLTs and 
> > > expand
> > >   that to +/-2G. Since modules can refreence other modules, that ends up
> > >   actually being halved, and modules have to fit within some 2G window 
> > > that
> > >   also covers the kernel.
> 
> Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> really need
> EXECMEM_ANYWHERE below?
> 
> > >
> > > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > >   modules.
> >
> > BPF are happy with vmalloc().
> >
> > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > our own
> > > types, e.g.
> > >
> > >   EXECMEM_ANYWHERE
> > >   EXECMEM_NOPLT
> > >   EXECMEM_PREL32
> > >
> > > ... and then we use those in arch code to implement module_alloc() and 
> > > friends.
> >
> > I'm looking at execmem_types more as definition of the consumers, maybe I
> > should have named the enum execmem_consumer at the first place.
> 
> I think looking at execmem_type from consumers' point of view adds
> unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> and bpf (and maybe also module text) all have the same requirements.
> Did I miss something?

It's enough to have one architecture with different constrains for kprobes
and bpf to warrant a type for each.

Where do you see unnecessary complexity?
 
> IOW, we have
> 
> enum execmem_type {
> EXECMEM_DEFAULT,
> EXECMEM_TEXT,
> EXECMEM_KPROBES = EXECMEM_TEXT,
> EXECMEM_FTRACE = EXECMEM_TEXT,
> EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> _KPROBE, _FTRACE, _BPF */
> EXECMEM_DATA,  /* rw */
> EXECMEM_RO_DATA,
> EXECMEM_RO_AFTER_INIT,
> EXECMEM_TYPE_MAX,
> };
> 
> Does this make sense?
 
How do you suggest to deal with e.g. riscv that has separate address spaces
for modules, kprobes and bpf?

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH 2/4] iio: light: stk3310: Implement vdd supply and power it off during suspend

2024-04-18 Thread Aren

On Mon, Apr 15, 2024 at 05:04:53PM +0300, Andy Shevchenko wrote:
> On Sun, Apr 14, 2024 at 8:57 PM Aren Moynihan  wrote:
> >
> > From: Ondrej Jirman 
> >
> > VDD power input can be used to completely power off the chip during
> > system suspend. Do so if available.
> 
> ...
> 
> >  #include 
> >  #include 
> >  #include 
> 
> > +#include 
> 
> Move it to be ordered and add a blank line to separate iio/*.h group.
> 
> ...
> 
> > +   data->vdd_reg = devm_regulator_get_optional(&client->dev, "vdd");
> > +   if (IS_ERR(data->vdd_reg)) {
> > +   ret = PTR_ERR(data->vdd_reg);
> > +   if (ret == -ENODEV)
> > +   data->vdd_reg = NULL;
> 
> > +   else
> 
> Redundant 'else' when you follow the pattern "check for error condition 
> first".
> 
> > +   return dev_err_probe(&client->dev, ret,
> > +"get regulator vdd failed\n");
> > +   }
> 
> ...
> 
> > +   if (data->vdd_reg) {
> > +   ret = regulator_enable(data->vdd_reg);
> > +   if (ret)
> > +   return dev_err_probe(&client->dev, ret,
> > +"regulator vdd enable 
> > failed\n");
> > +
> > +   usleep_range(1000, 2000);
> 
> fsleep()
> 
> > +   }
> 
> ...
> 
> > stk3310_set_state(iio_priv(indio_dev), STK3310_STATE_STANDBY);
> > +   if (data->vdd_reg)
> > +   regulator_disable(data->vdd_reg);
> 
> I forgot to check the order of freeing resources, be sure you have no
> devm_*() releases happening before this call.

If I understand what you're saying, this should be fine. The driver just
uses devm to clean up acquired resources after remove is called. Or am I
missing something and resources could be freed before calling
stk3310_remove?

> ...
> 
> > +   usleep_range(1000, 2000);
> 
> fsleep()

Everything else makes sense, I'll include those in v2 along with a patch
to switch stk3310_init to dev_err_probe.

Thanks for taking the time to review
 - Aren

Re: [PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Aleksandr Mikhalitsyn

On Thu, Apr 18, 2024 at 3:23 PM Julian Anastasov  wrote:
>
>
> Hello,

Dear Julian,

>
> On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:
>
> > Cc: Julian Anastasov 
> > Cc: Simon Horman 
> > Cc: Pablo Neira Ayuso 
> > Cc: Jozsef Kadlecsik 
> > Cc: Florian Westphal 
> > Suggested-by: Julian Anastasov 
> > Signed-off-by: Alexander Mikhalitsyn 
> > ---
> >  net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 143a341bbc0a..daa62b8b2dd1 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
>
> > @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   /* si_swapinfo(&i); */
> >   /* availmem = availmem - (i.totalswap - i.freeswap); */
> >
> > - nomem = (availmem < ipvs->sysctl_amemthresh);
> > + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> > + nomem = (availmem < amemthresh);
> >
> >   local_bh_disable();
> >
> > @@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   case 1:
> >   if (nomem) {
> >   ipvs->drop_rate = ipvs->drop_counter
> > - = ipvs->sysctl_amemthresh /
> > - (ipvs->sysctl_amemthresh-availmem);
> > + = amemthresh /
> > + (amemthresh-availmem);
>
> Thanks, both patches look ok except that the old styling
> is showing warnings for this patch:
>
> scripts/checkpatch.pl --strict /tmp/file1.patch
>
> It would be great if you silence them somehow in v3...

Yeah, I have fixed this in v3. Also, I had to split multiple
assignments into different
lines because of:
>CHECK: multiple assignments should be avoided

Now everything looks fine.

>
> BTW, est_cpulist is masked with current->cpus_mask of the
> sysctl writer process, if that is of any help. That is why I skipped
> it but lets keep it read-only for now...

That's a good point! Probably I'm too conservative ;-)

>
> >   ipvs->sysctl_drop_packet = 2;
> >   } else {
> >   ipvs->drop_rate = 0;
> > @@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs 
> > *ipvs)
> >   case 2:
> >   if (nomem) {
> >   ipvs->drop_rate = ipvs->drop_counter
> > - = ipvs->sysctl_amemthresh /
> > - (ipvs->sysctl_amemthresh-availmem);
> > + = amemthresh /
> > + (amemthresh-availmem);
> >   } else {
> >   ipvs->drop_rate = 0;
> >   ipvs->sysctl_drop_packet = 1;
>
> Regards

Kind regards,
Alex

>
> --
> Julian Anastasov 
>

[PATCH net-next v3 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn

Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
 [2]
Link: 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
 [3]

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 32be24f0d4e4..c3ba71aa2654 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4270,6 +4270,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != &init_user_ns;
 
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
@@ -4284,12 +4285,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4315,10 +4310,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ports;
tbl[idx++].data = &ipvs->sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
+
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_sloppy_tcp;
@@ -4341,15 +4343,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_nice;
 
-- 
2.34.1

[PATCH net-next v3 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn

Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..32be24f0d4e4 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(&i); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 1:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 2:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1

[PATCH] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-18 Thread Ismael Luceno

It was observed in the wild that pairs of consecutive packets would leave
the IPVS with the same wrong checksum, and the issue only went away when
disabling GSO.

IPVS needs to avoid computing the SCTP checksum when using GSO.

Co-developed-by: Firo Yang 
Signed-off-by: Ismael Luceno 
Tested-by: Andreas Taschner 
CC: Michal Kubeček 
CC: Simon Horman 
CC: Julian Anastasov 
CC: lvs-de...@vger.kernel.org
CC: netfilter-de...@vger.kernel.org
CC: net...@vger.kernel.org
CC: coret...@netfilter.org
---
 net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
b/net/netfilter/ipvs/ip_vs_proto_sctp.c
index a0921adc31a9..3205b45ce161 100644
--- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
@@ -126,7 +126,8 @@ sctp_snat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
if (sctph->source != cp->vport || payload_csum ||
skb->ip_summed == CHECKSUM_PARTIAL) {
sctph->source = cp->vport;
-   sctp_nat_csum(skb, sctph, sctphoff);
+   if (!skb_is_gso_sctp(skb))
+   sctp_nat_csum(skb, sctph, sctphoff);
} else {
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
@@ -174,7 +175,8 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
(skb->ip_summed == CHECKSUM_PARTIAL &&
 !(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
sctph->dest = cp->dport;
-   sctp_nat_csum(skb, sctph, sctphoff);
+   if (!skb_is_gso_sctp(skb))
+   sctp_nat_csum(skb, sctph, sctphoff);
} else if (skb->ip_summed != CHECKSUM_PARTIAL) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
-- 
2.43.0

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Will Deacon

On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> > wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > simpler, more
> > > straightforward solution would be to let architectures override 
> > > flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a 
> > > range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  
> Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address 
> or
> a guest virtual address, but it's a moot point because KVM doen't have the 
> guest
> virtual address, and if it's a host virtual address, there would need to be 
> valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will

[PATCH] Revert "tracing/trigger: Fix to return error if failed to alloc snapshot"

2024-04-18 Thread Siddh Raman Pant

This reverts commit b5085b5ac1d96ea2a8a6240f869655176ce44197.

The change has an incorrect assumption about the return value because
in the current stable trees for versions 5.15 and before, the following
commit responsible for making 0 a success value is not present:
b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers at 
a time")

The return value should be 0 on failure in the current tree, because in
the functions event_trigger_callback() and event_enable_trigger_func(),
we have:

ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
/*
 * The above returns on success the # of functions enabled,
 * but if it didn't find any functions it returns zero.
 * Consider no functions a failure too.
 */
if (!ret) {
ret = -ENOENT;

Cc: sta...@kernel.org # 5.15, 5.10, 5.4, 4.19
Signed-off-by: Siddh Raman Pant 
---
 kernel/trace/trace_events_trigger.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_events_trigger.c 
b/kernel/trace/trace_events_trigger.c
index dfdbcf1da216..106f9813841a 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1161,10 +1161,8 @@ register_snapshot_trigger(char *glob, struct 
event_trigger_ops *ops,
  struct event_trigger_data *data,
  struct trace_event_file *file)
 {
-   int ret = tracing_alloc_snapshot_instance(file->tr);
-
-   if (ret < 0)
-   return ret;
+   if (tracing_alloc_snapshot_instance(file->tr) != 0)
+   return 0;
 
return register_trigger(glob, ops, data, file);
 }
-- 
2.43.0

Re: [PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Julian Anastasov



Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Suggested-by: Julian Anastasov 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..daa62b8b2dd1 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c

> @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   /* si_swapinfo(&i); */
>   /* availmem = availmem - (i.totalswap - i.freeswap); */
>  
> - nomem = (availmem < ipvs->sysctl_amemthresh);
> + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> + nomem = (availmem < amemthresh);
>  
>   local_bh_disable();
>  
> @@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 1:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> + (amemthresh-availmem);

Thanks, both patches look ok except that the old styling
is showing warnings for this patch:

scripts/checkpatch.pl --strict /tmp/file1.patch

It would be great if you silence them somehow in v3...

BTW, est_cpulist is masked with current->cpus_mask of the
sysctl writer process, if that is of any help. That is why I skipped
it but lets keep it read-only for now...

>   ipvs->sysctl_drop_packet = 2;
>   } else {
>   ipvs->drop_rate = 0;
> @@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 2:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> + (amemthresh-availmem);
>   } else {
>   ipvs->drop_rate = 0;
>   ipvs->sysctl_drop_packet = 1;

Regards

--
Julian Anastasov

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-18 Thread Greg KH

On Thu, Apr 18, 2024 at 09:04:53AM +0200, Thorsten Leemhuis wrote:
> On 17.04.24 15:38, Greg KH wrote:
> > On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
> >> On 17.04.24 14:52, Konstantin Ryabitsev wrote:
> >>> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
>  Could you please create the email alias
>  do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
>  just like sta...@kernel.org does?
> 
>  To quote:
> 
> > How about:
> > cc:  # Reason goes here, and 
> > must be present
> >
> > and we can make that address be routed to /dev/null just like
> >  is?
> >>
> >> FWIW, we could go back to what I initially proposed: use the existing
> >> stable tag with a pre-defined comment to mark patches that AUTOSEL et.
> >> al. should not pick up:
> >> https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/
> > 
> > If you can pick a better string, possibly, yes.
> 
> What did you think of Konstantin's
> 
> Cc: stable+noauto...@kernel.org # Reason
> 
> That looked like a good solution -- and I wondered why I did not come up
> with that idea myself. Sure, "autosel" would also imply/mean "the
> scripts/tools that look out for Fixes: tags", but does that matter?

We can live with this, sure.  That way no need to change anything on any
kernel.org backend.

thanks,

greg k-h

Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-18 Thread patchwork-bot+netdevbpf

Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni :

On Wed, 17 Apr 2024 15:18:22 +0800 you wrote:
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> 
> [...]

Here is the summary with links:
  - [net-next,v9] virtio_net: Support RX hash XDP hint
https://git.kernel.org/netdev/net-next/c/aa37f8916d20

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Not an expert on these counters/events, but LGTM

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 20 +++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif


Naive me would think that ALLOCSTALL_DEVICE is always 0. :)

Likely we should just do:

for (zid = 0; zid < MAX_NR_ZONES; zid++)
stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];

(see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on 
the same ordering)


Apart form that, LGTM.

--
Cheers,

David / dhildenb

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-18 Thread Jonthan Haslam

Hi Masami,

> > > OK, then I'll push this to for-next at this moment.
> > > Please share if you have a good idea for the batch interface which can be
> > > backported. I guess it should involve updating userspace changes too.
> > 
> > Did you (or anyone else) need anything more from me on this one so that it
> > can be pushed? I provided some benchmark numbers but happy to provide
> > anything else that may be required.
> 
> Yeah, if you can update with the result, it looks better to me.
> Or, can I update the description?

Just checking if you need me to do anything on this so that it can be
pushed to for-next? Would be really great if we can get this in. Thanks
for all your help.

Jon.

> 
> Thank you,
> 
> > 
> > Thanks!
> > 
> > Jon.
> > 
> > > 
> > > Thank you!
> > > 
> > > > >
> > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > while Jonathan will keep working on even better final solution.
> > > > > Thanks!
> > > > >
> > > > > > I look forward to your formalized results :)
> > > > > >
> > > > 
> > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > and USDTs, reporting attach/detach timings:
> > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > bpf_testmod.ko is already unloaded.
> > > > Loading bpf_testmod.ko...
> > > > Successfully loaded bpf_testmod.ko.
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > test_bench_attach_uprobe: attached in   0.120s
> > > > test_bench_attach_uprobe: detached in   0.092s
> > > > #400/5   uprobe_multi_test/bench_uprobe:OK
> > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > test_bench_attach_usdt: attached in   0.124s
> > > > test_bench_attach_usdt: detached in   0.064s
> > > > #400/6   uprobe_multi_test/bench_usdt:OK
> > > > #400 uprobe_multi_test:OK
> > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > Successfully unloaded bpf_testmod.ko.
> > > > 
> > > > So it should be easy for Jonathan to validate his changes with this.
> > > > 
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > BTW, how did you measure the overhead? I think spinlock 
> > > > > > > > > > overhead
> > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > > > > > ---
> > > > > > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/kernel/events/uprobes.c 
> > > > > > > > > > > b/kernel/events/uprobes.c
> > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = 
> > > > > > > > > > > RB_ROOT;
> > > > > > > > > > >   */
> > > > > > > > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > > >
> > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* 
> > > > > > > > > > > serialize rbtree access */
> > > > > > > > > > >
> > > > > > > > > > >  #define UPROBES_HASH_SZ  13
> > > > > > > > > > >  /* serialize uprobe->pending_list */
> > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe 
> > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *uprobe;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > >   return uprobe;
> > > > > > > > > > >  }
> > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe 
> > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > >  {
> > > > > > > > > > >   struct uprobe *u;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > >   u = __insert_uprobe(uprobe

[PATCH net-next v2 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn

Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

[1] https://github.com/lxc/lxc/issues/4278
[2] 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
[3] 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index daa62b8b2dd1..f84f091626ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4272,6 +4272,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != &init_user_ns;
 
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
@@ -4286,12 +4287,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4317,10 +4312,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ports;
tbl[idx++].data = &ipvs->sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
+
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_sloppy_tcp;
@@ -4343,15 +4345,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_nice;
 
-- 
2.34.1

Re: [PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Aleksandr Mikhalitsyn

On Wed, Apr 17, 2024 at 3:02 PM Julian Anastasov  wrote:
>
>
> Hello,

Dear Julian,

>
> On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:
>
> > Let's make all IPVS sysctls visible and RO even when
> > network namespace is owned by non-initial user namespace.
> >
> > Let's make a few sysctls to be writable:
> > - conntrack
> > - conn_reuse_mode
> > - expire_nodest_conn
> > - expire_quiescent_template
> >
> > I'm trying to be conservative with this to prevent
> > introducing any security issues in there. Maybe,
> > we can allow more sysctls to be writable, but let's
> > do this on-demand and when we see real use-case.
> >
> > This list of sysctls was chosen because I can't
> > see any security risks allowing them and also
> > Kubernetes uses [2] these specific sysctls.
> >
> > This patch is motivated by user request in the LXC
> > project [1].
> >
> > [1] https://github.com/lxc/lxc/issues/4278
> > [2] 
> > https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> >
> > Cc: Stéphane Graber 
> > Cc: Christian Brauner 
> > Cc: Julian Anastasov 
> > Cc: Simon Horman 
> > Cc: Pablo Neira Ayuso 
> > Cc: Jozsef Kadlecsik 
> > Cc: Florian Westphal 
> > Signed-off-by: Alexander Mikhalitsyn 
> > ---
> >  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 143a341bbc0a..92a818c2f783 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -4285,10 +4285,22 @@ static int __net_init 
> > ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>
> As the list of privileged vars is short I prefer
> to use a bool and to make only some vars read-only:
>
> bool unpriv = false;
>
> >   if (tbl == NULL)
> >   return -ENOMEM;
> >
> > - /* Don't export sysctls to unprivileged users */
> > + /* Let's show all sysctls in non-init user namespace-owned
> > +  * net namespaces, but make them read-only.
> > +  *
> > +  * Allow only a few specific sysctls to be writable.
> > +  */
> >   if (net->user_ns != &init_user_ns) {
>
> Here we should just set: unpriv = true;
>
> > - tbl[0].procname = NULL;
> > - ctl_table_size = 0;
> > + for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> > + if (!tbl[idx].procname)
> > + continue;
> > +
> > + if (!((strcmp(tbl[idx].procname, "conntrack") 
> > == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "conn_reuse_mode") == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "expire_nodest_conn") == 0) ||
> > +   (strcmp(tbl[idx].procname, 
> > "expire_quiescent_template") == 0)))
> > + tbl[idx].mode = 0444;
> > + }
> >   }
> >   } else
> >   tbl = vs_vars;
>
> And below at every place to use:
>
> if (unpriv)
> tbl[idx].mode = 0444;
>
> for the following 4 privileged sysctl vars:
>
> - sync_qlen_max:
> - allocates messages in kernel context
> - this needs better tunning in another patch
>
> - sync_sock_size:
> - allocates messages in kernel context
>
> - run_estimation:
> - for now, better init ns to decide if to use est stats
>
> - est_nice:
> - for now, better init ns to decide the value
>
> - debug_level:
> - already set to 0444
>
> I.e. these vars allocate resources (mem, CPU) without
> proper control, so for now we will just copy them from init ns
> without allowing writing. And they are vars that are not tuned
> often. Also we do not know which netns is supposed to be the
> privileged one, some solutions move all devices out of init_net,
> so we can not decide where to use lower limits.

I agree. I have also decided to forbid "est_cpulist" for unprivileged users.

>
> OTOH, "amemthresh" is not privileged but needs single READ_ONCE
> for sysctl_amemthresh in update_defense_level() due to the possible
> div by zero if we allow writing to anyone, eg.:
>
> int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> ...
> nomem = availmem < amemthresh;
> ... use only amemthresh
>
> All other vars can be writable.

Have fixed this and sent it as a separate patch! ;-)

Thanks a lot for such a quick review!

Kind regards,
Alex

>
> Regards
>
> --
> Julian Anastasov

[PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn

Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..daa62b8b2dd1 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(&i); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 1:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 2:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1

Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-18 Thread Mike Rapoport

On Tue, Apr 16, 2024 at 09:52:34AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote:
> > On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote:
> > > 
> > > > To populate the cache, a writable large page is allocated from vmalloc 
> > > > with
> > > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped 
> > > > as
> > > > ROX.
> > > 
> > > > +static void execmem_invalidate(void *ptr, size_t size, bool writable)
> > > > +{
> > > > +   if (execmem_info->invalidate)
> > > > +   execmem_info->invalidate(ptr, size, writable);
> > > > +   else
> > > > +   memset(ptr, 0, size);
> > > > +}
> > > 
> > > +static void execmem_invalidate(void *ptr, size_t size, bool writeable)
> > > +{
> > > +   /* fill memory with INT3 instructions */
> > > +   if (writeable)
> > > +   memset(ptr, 0xcc, size);
> > > +   else
> > > +   text_poke_set(ptr, 0xcc, size);
> > > +}
> > > 
> > > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction.
> > > It raises #BP not #UD.
> > 
> > Do you mean that _invalidate is a poor name choice or that it's necessary
> > to use an instruction that raises #UD?
> 
> Poor naming, mostly. #BP handler will still scream bloody murder if the
> site is otherwise unclaimed.
> 
> It just isn't an invalid instruction.

Well, execmem_fill_with_insns_screaming_bloody_murder seems too long, how
about execmem_fill_trapping_insns?

-- 
Sincerely yours,
Mike.

Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-18 Thread Mike Rapoport

On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> 
> 
> > On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> > 
> > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> > struct load_info *info)
> > add_kallsyms(mod, info);
> > 
> > /* Arch-specific module finalizing. */
> > -   return module_finalize(info->hdr, info->sechdrs, mod);
> > +   ret = module_finalize(info->hdr, info->sechdrs, mod);
> > +   if (ret)
> > +   return ret;
> > +
> > +   for_each_mod_mem_type(type) {
> > +   struct module_memory *mem = &mod->mem[type];
> > +
> > +   if (mem->is_rox) {
> > +   if (!execmem_update_copy(mem->base, mem->rw_copy,
> > +mem->size))
> > +   return -ENOMEM;
> > +
> > +   vfree(mem->rw_copy);
> > +   mem->rw_copy = NULL;
> > +   }
> > +   }
> > +
> > +   return 0;
> > }
> 
> I might be missing something, but it seems a bit racy.
> 
> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> point, since you don’t hold the text_mutex, some might do text_poke(),
> e.g., by enabling/disabling static-key, and the update would be
> overwritten. No?

Right :(
Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
area.

So I'm thinking about calling alternatives_smp_module_add() from an
additional callback after the execmem_update_copy().

Does it make sense to you?

-- 
Sincerely yours,
Mike.

Re: [PATCH 1/2] arm64: dts: qcom: pmi632: Add vibrator

2024-04-18 Thread Luca Weiss

On Thu Apr 18, 2024 at 12:01 PM CEST, Konrad Dybcio wrote:
> On 18.04.2024 8:36 AM, Luca Weiss wrote:
> > Add a node for the vibrator module found inside the PMI632.
> > 
> > Signed-off-by: Luca Weiss 
> > ---
>
> Reviewed-by: Konrad Dybcio 
>
> On a side note, this is a totally configuration-free peripheral that doesn't 
> do
> anything crazy until manually configured.
>
> In the slow quest to be (hopefully) more sane about the defaults, should we 
> keep
> them enabled by default? Bjorn?

But many (most?) devices don't have a vibration motor connected to
PMI632, some (like devboards) don't have anything, and other phones have
a separate chip that controls the vibration motor.

Enabling this by default would mean all devices with PMI632 would get an
input device for the vibrator that probably doesn't work?

Regards
Luca

>
> Konrad

Re: [PATCH v2 3/3] arm64: dts: msm8996: add fastrpc nodes

2024-04-18 Thread Konrad Dybcio

On 18.04.2024 8:44 AM, Dmitry Baryshkov wrote:
> From: Srinivas Kandagatla 
> 
> The ADSP provides fastrpc/compute capabilities. Enable support for the
> fastrpc on this DSP.
> 
> Signed-off-by: Srinivas Kandagatla 
> Signed-off-by: Dmitry Baryshkov 
> ---

Reviewed-by: Konrad Dybcio 

Konrad

Re: [PATCH 1/2] arm64: dts: qcom: pmi632: Add vibrator

2024-04-18 Thread Konrad Dybcio

On 18.04.2024 8:36 AM, Luca Weiss wrote:
> Add a node for the vibrator module found inside the PMI632.
> 
> Signed-off-by: Luca Weiss 
> ---

Reviewed-by: Konrad Dybcio 

On a side note, this is a totally configuration-free peripheral that doesn't do
anything crazy until manually configured.

In the slow quest to be (hopefully) more sane about the defaults, should we keep
them enabled by default? Bjorn?

Konrad

Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 03:14:37PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 4/17/2024 4:54 PM, David Stevens wrote:
> > Add support for the VIRTIO_F_SUSPEND feature. When this feature is
> > negotiated, power management can use it to suspend virtio devices
> > instead of resorting to resetting the devices entirely.
> > 
> > Signed-off-by: David Stevens 
> > ---
> >   drivers/virtio/virtio.c| 32 ++
> >   drivers/virtio/virtio_pci_common.c | 29 +++
> >   drivers/virtio/virtio_pci_modern.c | 19 ++
> >   include/linux/virtio.h |  2 ++
> >   include/uapi/linux/virtio_config.h | 10 +-
> >   5 files changed, 74 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > index f4080692b351..cd11495a5098 100644
> > --- a/drivers/virtio/virtio.c
> > +++ b/drivers/virtio/virtio.c
> > @@ -1,5 +1,6 @@
> >   // SPDX-License-Identifier: GPL-2.0-only
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
> > return ret;
> >   }
> >   EXPORT_SYMBOL_GPL(virtio_device_restore);
> > +
> > +static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
> > enabled)
> > +{
> > +   u8 status, target;
> > +
> > +   status = dev->config->get_status(dev);
> > +   if (enabled)
> > +   target = status | VIRTIO_CONFIG_S_SUSPEND;
> > +   else
> > +   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
> > +   dev->config->set_status(dev, target);
> I think it is better to verify whether the device SUSPEND bit is
> already set or clear, we can just return if status == target.
> 
> Thanks
> Zhu Lingshan
> > +
> > +   while ((status = dev->config->get_status(dev)) != target) {
> > +   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
> > +   return -EIO;
> > +   mdelay(10);

Bad device state (set by surprise removal) should also
be handled here I think.


> > +   }
> > +   return 0;
> > +}
> > +
> > +int virtio_device_suspend(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, true);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_suspend);
> > +
> > +int virtio_device_resume(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, false);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_resume);
> >   #endif
> >   static int virtio_init(void)
> > diff --git a/drivers/virtio/virtio_pci_common.c 
> > b/drivers/virtio/virtio_pci_common.c
> > index b655fccaf773..4d542de05970 100644
> > --- a/drivers/virtio/virtio_pci_common.c
> > +++ b/drivers/virtio/virtio_pci_common.c
> > @@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
> > return virtio_device_restore(&vp_dev->vdev);
> >   }
> > -static bool vp_supports_pm_no_reset(struct device *dev)
> > +static int virtio_pci_suspend(struct device *dev)
> >   {
> > struct pci_dev *pci_dev = to_pci_dev(dev);
> > -   u16 pmcsr;
> > -
> > -   if (!pci_dev->pm_cap)
> > -   return false;
> > -
> > -   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > -   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > -   dev_err(dev, "Unable to query pmcsr");
> > -   return false;
> > -   }
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > -   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
> > -}
> > +   if (virtio_has_feature(&vp_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_suspend(&vp_dev->vdev);
> > -static int virtio_pci_suspend(struct device *dev)
> > -{
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
> > +   return virtio_pci_freeze(dev);
> >   }
> >   static int virtio_pci_resume(struct device *dev)
> >   {
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
> > +   struct pci_dev *pci_dev = to_pci_dev(dev);
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > +
> > +   if (virtio_has_feature(&vp_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_resume(&vp_dev->vdev);
> > +
> > +   return virtio_pci_restore(dev);
> >   }
> >   static const struct dev_pm_ops virtio_pci_pm_ops = {
> > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > b/drivers/virtio/virtio_pci_modern.c
> > index f62b530aa3b5..ac8734526b8d 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct 
> > virtio_device *vdev)
> > __virtqueue_break(admin_vq->info.vq);
> >   }
> > +static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
> > +{
> > +   u16 pmcsr;
> > +
> > +   if (!pci_dev->pm_cap)
> > +   return false;
> > +
> > +   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > +   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > +   dev_err(&pci_dev->dev, "Unable to query pmcsr");

Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Zhu, Lingshan





On 4/17/2024 4:54 PM, David Stevens wrote:

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
  drivers/virtio/virtio.c| 32 ++
  drivers/virtio/virtio_pci_common.c | 29 +++
  drivers/virtio/virtio_pci_modern.c | 19 ++
  include/linux/virtio.h |  2 ++
  include/uapi/linux/virtio_config.h | 10 +-
  5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..cd11495a5098 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
  // SPDX-License-Identifier: GPL-2.0-only
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
  }
  EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+   dev->config->set_status(dev, target);

I think it is better to verify whether the device SUSPEND bit is
already set or clear, we can just return if status == target.

Thanks
Zhu Lingshan

+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   mdelay(10);
+   }
+   return 0;
+}
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
  #endif
  
  static int virtio_init(void)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..4d542de05970 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(&vp_dev->vdev);
  }
  
-static bool vp_supports_pm_no_reset(struct device *dev)

+static int virtio_pci_suspend(struct device *dev)
  {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, &pmcsr);
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
  
-	return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;

-}
+   if (virtio_has_feature(&vp_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_suspend(&vp_dev->vdev);
  
-static int virtio_pci_suspend(struct device *dev)

-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
  }
  
  static int virtio_pci_resume(struct device *dev)

  {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_has_feature(&vp_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_resume(&vp_dev->vdev);
+
+   return virtio_pci_restore(dev);
  }
  
  static const struct dev_pm_ops virtio_pci_pm_ops = {

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index f62b530aa3b5..ac8734526b8d 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device 
*vdev)
__virtqueue_break(admin_vq->info.vq);
  }
  
+static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)

+{
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, &pmcsr);
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(&pci_dev->dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
  static void vp_transport_features(struct virtio_device *vdev, u64 features)
  {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device 
*vdev, u64 features)
  
  	if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))

__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
+
+

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-18 Thread Thorsten Leemhuis

On 17.04.24 15:38, Greg KH wrote:
> On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
>> On 17.04.24 14:52, Konstantin Ryabitsev wrote:
>>> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
 Could you please create the email alias
 do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
 just like sta...@kernel.org does?

 To quote:

> How about:
>   cc:  # Reason goes here, and must be 
> present
>
> and we can make that address be routed to /dev/null just like
>  is?
>>
>> FWIW, we could go back to what I initially proposed: use the existing
>> stable tag with a pre-defined comment to mark patches that AUTOSEL et.
>> al. should not pick up:
>> https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/
> 
> If you can pick a better string, possibly, yes.

What did you think of Konstantin's

Cc: stable+noauto...@kernel.org # Reason

That looked like a good solution -- and I wondered why I did not come up
with that idea myself. Sure, "autosel" would also imply/mean "the
scripts/tools that look out for Fixes: tags", but does that matter?

> But in the end, your proposal seems to imply:
> 
>   cc: sta...@kernel.org   # Psych!  Just kidding, never backport this!
> 
> but really, that's just mean, and again, this is a VERY rare case you
> are trying to automate here. We have MUCH better and simpler ways for> 
> maintainers to not have their subsystems scanned for stuff like this,
> why are we spending all of our time on this topic?

It started with various minor reasons -- and after some "this would be
nice to have" feedback it felt wrong to give up. It also looked like we
had some agreement already before a new discussion began.

Ciao, Thorsten

51 matches

Mail list logo