date:20240711

Re: [PATCH] test/vsock: add install target

2024-07-11 Thread Stefano Garzarella


CCing Stefan.

On Wed, Jul 10, 2024 at 07:00:59PM GMT, Jakub Kicinski wrote:

On Wed, 10 Jul 2024 13:58:39 +0200 Stefano Garzarella wrote:

There is a comment there:

 # Avoid changing the rest of the logic here and lib.mk.

Added by commit 17eac6c2db8b2cdfe33d40229bdda2acd86b304a.

IIUC they re-used INSTALL_PATH, just to avoid too many changes in that
file and in tools/testing/selftests/lib.mk

So, IMHO we should not care about it and only use VSOCK_INSTALL_PATH if
you don't want to conflict with INSTALL_PATH.


Any reason why vsock isn't part of selftests in the first place?



Usually vsock tests test both the driver (virtio-vsock) in the guest and 
the device in the host kernel (vhost-vsock). So I usually run the tests 
in 2 nested VMs to test the latest changes for both the guest and the 
host.


I don't know enough selftests, but do you think it is possible to 
integrate them?


CCing Stefan who is the original author and may remember more reasons 
about this choice.


Thanks,
Stefano

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-11 Thread Peter Hilber

On 10.07.24 18:01, David Woodhouse wrote:
> On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote:
>> On 08.07.24 11:27, David Woodhouse wrote:
>>> From: David Woodhouse 
>>>
>>> The vmclock "device" provides a shared memory region with precision clock
>>> information. By using shared memory, it is safe across Live Migration.
>>>
>>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
>>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
>>> actually helpful.
>>>
>>> The memory region of the device is also exposed to userspace so it can be
>>> read or memory mapped by application which need reliable notification of
>>> clock disruptions.
>>>
>>> Signed-off-by: David Woodhouse 
>>
>> [...]
>>
>>> +
>>> +struct vmclock_abi {
>>> +   /* CONSTANT FIELDS */
>>> +   uint32_t magic;
>>> +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
>>> +   uint32_t size;  /* Size of region containing this structure 
>>> */
>>> +   uint16_t version;   /* 1 */
>>> +   uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except 
>>> INVALID */
>>> +#define VMCLOCK_COUNTER_ARM_VCNT   0
>>> +#define VMCLOCK_COUNTER_X86_TSC1
>>> +#define VMCLOCK_COUNTER_INVALID0xff
>>> +   uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
>>> +#define VMCLOCK_TIME_UTC   0   /* Since 1970-01-01 
>>> 00:00:00z */
>>> +#define VMCLOCK_TIME_TAI   1   /* Since 1970-01-01 
>>> 00:00:00z */
>>> +#define VMCLOCK_TIME_MONOTONIC 2   /* Since undefined 
>>> epoch */
>>> +#define VMCLOCK_TIME_INVALID_SMEARED   3   /* Not supported */
>>> +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4   /* Not supported */
>>> +
>>> +   /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
>>> +   uint32_t seq_count; /* Low bit means an update is in progress */
>>> +   /*
>>> +    * This field changes to another non-repeating value when the CPU
>>> +    * counter is disrupted, for example on live migration. This lets
>>> +    * the guest know that it should discard any calibration it has
>>> +    * performed of the counter against external sources (NTP/PTP/etc.).
>>> +    */
>>> +   uint64_t disruption_marker;
>>> +   uint64_t flags;
>>> +   /* Indicates that the tai_offset_sec field is valid */
>>> +#define VMCLOCK_FLAG_TAI_OFFSET_VALID  (1 << 0)
>>> +   /*
>>> +    * Optionally used to notify guests of pending maintenance events.
>>> +    * A guest which provides latency-sensitive services may wish to
>>> +    * remove itself from service if an event is coming up. Two flags
>>> +    * indicate the approximate imminence of the event.
>>> +    */
>>> +#define VMCLOCK_FLAG_DISRUPTION_SOON   (1 << 1) /* About a day */
>>> +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT   (1 << 2) /* About an hour */
>>> +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3)
>>> +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4)
>>> +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID   (1 << 5)
>>> +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID   (1 << 6)
>>> +   /*
>>> +    * Even regardless of leap seconds, the time presented through this
>>> +    * mechanism may not be strictly monotonic. If the counter slows 
>>> down
>>> +    * and the host adapts to this discovery, the time calculated from
>>> +    * the value of the counter immediately after an update to this
>>> +    * structure, may appear to be *earlier* than a calculation just
>>> +    * before the update (while the counter was believed to be running
>>> +    * faster than it now is). A guest operating system will typically
>>> +    * *skew* its own system clock back towards the reference clock
>>> +    * exposed here, rather than following this clock directly. If,
>>> +    * however, this structure is being populated from such a system
>>> +    * clock which is already handled in such a fashion and the results
>>> +    * *are* guaranteed to be monotonic, such monotonicity can be
>>> +    * advertised by setting this bit.
>>> +    */
>>
>> I wonder if this might be difficult to define in a standard.
> 
> I'm sure we could do better than my attempt above, but surely it isn't
> *so* hard to define monotonicity?
> 
>> Is there a need to define device and driver behavior in more detail? What
>> would happen if e.g. the device first decides how to update the clock, but
>> is then slow to update the SHM?
> 
> That's OK, isn't it?
> 
> The key in the VMCLOCK_FLAG_TIME_MONOTONIC flag is that by setting it,
> the host guarantees that the time calculated according to this
> structure at any given moment, shall not appear to be later than the
> time calculated via the structure at any *later* moment.

IMHO this phrasing is better, since it directly refers to the state of the
structure.

AFAIU if there w

Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

2024-07-11 Thread David Woodhouse

On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote:
> 
> IMHO this phrasing is better, since it directly refers to the state of the
> structure.

Thanks. I'll update it.

> AFAIU if there would be abnormal delays in store buffers, causing some
> driver to still see the old clock for some time, the monotonicity could be
> violated:
> 
> 1. device writes new, much slower clock to store buffer
> 2. some time passes
> 3. driver reads old, much faster clock
> 4. device writes store buffer to cache
> 5. driver reads new, much slower clock
> 
> But I hope such delays do not occur.

For the case of the hypervisor←→guest interface this should be handled
by the use of memory barriers and the seqcount lock.

The guest driver reads the seqcount, performs a read memory barrier,
then reads the contents of the structure. Then performs *another* read
memory barrier, and checks the seqcount hasn't changed:
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351

The converse happens with write barriers on the hypervisor side:
https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68

Do we need to think harder about the ordering across a real PCI bus? It
isn't entirely unreasonable for this to be implemented in hardware if
we eventually add a counter_id value for a bus-visible counter like the
Intel Always Running Timer (ART). I'm also OK with saying that device
implementations may only provide the shared memory structure if they
can ensure memory ordering.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 0/3] uprobes: future cleanups for review

2024-07-11 Thread Peter Zijlstra

On Wed, Jul 10, 2024 at 06:30:22PM +0200, Oleg Nesterov wrote:
> On 07/10, Oleg Nesterov wrote:
> >
> > Peter, these simple cleanups should not conflict with your changes,
> > but I can resend them later if it causes any inconvenience.
> 
> In fact I would like to push 2 more cleanups before the more significant
> changes, but they certainly conflict with your ongoing work, albeit only
> textually.
> 
> Let me send the patches for review anyway, perhaps you can take at least
> the 1st one.
> 
> 3/3 is only compile tested so far. Andrii, can you take a look?

I was going to post a new version today, but I can wait and rebase on
top of / include these 5 patches (or more, these things tend to grow).

[PATCH v2] bootconfig: Remove duplicate included header file linux/bootconfig.h

2024-07-11 Thread Thorsten Blum

The header file linux/bootconfig.h is included whether __KERNEL__ is
defined or not.

Include it only once before the #ifdef/#else/#endif preprocessor
directives and remove the following make includecheck warning:

  linux/bootconfig.h is included more than once

Move the comment to the top and delete the now empty #else block.

Signed-off-by: Thorsten Blum 
---
Changes in v2:
- Move comment and delete #else as suggested by Masami Hiramatsu
- Link to v1: 
https://lore.kernel.org/linux-kernel/20240711002152.800028-2-thorsten.b...@toblux.com/
---
 lib/bootconfig.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 97f8911ea339..81f29c29f47b 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -4,8 +4,16 @@
  * Masami Hiramatsu 
  */
 
-#ifdef __KERNEL__
+/*
+ * NOTE: This is only for tools/bootconfig, because tools/bootconfig will
+ * run the parser sanity test.
+ * This does NOT mean lib/bootconfig.c is available in the user space.
+ * However, if you change this file, please make sure the tools/bootconfig
+ * has no issue on building and running.
+ */
 #include 
+
+#ifdef __KERNEL__
 #include 
 #include 
 #include 
@@ -24,16 +32,6 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
-
-#else /* !__KERNEL__ */
-/*
- * NOTE: This is only for tools/bootconfig, because tools/bootconfig will
- * run the parser sanity test.
- * This does NOT mean lib/bootconfig.c is available in the user space.
- * However, if you change this file, please make sure the tools/bootconfig
- * has no issue on building and running.
- */
-#include 
 #endif
 
 /*
-- 
2.39.2

Re: [PATCH 0/3] uprobes: future cleanups for review

2024-07-11 Thread Oleg Nesterov

On 07/11, Peter Zijlstra wrote:
>
> On Wed, Jul 10, 2024 at 06:30:22PM +0200, Oleg Nesterov wrote:
> >
> > In fact I would like to push 2 more cleanups before the more significant
> > changes, but they certainly conflict with your ongoing work, albeit only
> > textually.
> >
> > Let me send the patches for review anyway, perhaps you can take at least
> > the 1st one.
> >
> > 3/3 is only compile tested so far. Andrii, can you take a look?
>
> I was going to post a new version today, but I can wait and rebase on

No, I do not want to delay your changes, please send the new version,
I'll rebase these cleanups on top of it.

Oleg.

Re: [PATCH v3 0/2] vdpa: support set mac address from vdpa tool

2024-07-11 Thread Leonardo Milleri

Sorry for the noise, resending the email in text format

Hi All,

My answers inline below

>> Any specific reason to pre-create those large number of vdpa devices of the 
>> pool?
>> I was hoping to create vdpa device with needed attributes, when spawning a 
>> kubevirt instance.
>> K8s DRA infrastructure [1] can be used to create the needed vdpa device. 
>> Have you considered using the DRA of [1]?

The vhost-vdpa devices are created in the host before spawning the
kubevirt VM. This is achieved by using:
- sriov-network-operator: load kernel drivers, create vdpa devices
(with MAC address), etc
- sriov-device-plugin: create pool of resources (vdpa devices in this
case), advertise devices to k8s, allocate devices during pod creation
(by the way, isn't this mechanism very similar to DRA?)

Then we create the kubevirt VM by defining an interface with the
following attributes:
- type:vdpa
- mac
- source: vhost-vdpa path

So the issue is, how to make sure the mac in the VM is the same mac of vdpa?
Two options:
- ensure kubevirt interface mac is equal to vdpa mac: this is not
possible because of the device plugin resource pool. You can have a
few devices in the pool and the device plugin picks one randomly.
- change vdpa mac address at a later stage, to make sure it is aligned
with kubevirt interface mac. I don't know if there is already specific
code in kubevirt to do that or need to be implemented.

Hope this helps to clarify

Thanks
Leonardo


On Wed, Jul 10, 2024 at 10:46 AM Cindy Lu  wrote:
>
> On Wed, 10 Jul 2024 at 14:10, Michael S. Tsirkin  wrote:
> >
> > On Wed, Jul 10, 2024 at 11:05:48AM +0800, Jason Wang wrote:
> > > On Tue, Jul 9, 2024 at 8:42 PM Michael S. Tsirkin  wrote:
> > > >
> > > > On Tue, Jul 09, 2024 at 02:19:19PM +0800, Cindy Lu wrote:
> > > > > On Tue, 9 Jul 2024 at 11:59, Parav Pandit  wrote:
> > > > > >
> > > > > > Hi Cindy,
> > > > > >
> > > > > > > From: Cindy Lu 
> > > > > > > Sent: Monday, July 8, 2024 12:17 PM
> > > > > > >
> > > > > > > Add support for setting the MAC address using the VDPA tool.
> > > > > > > This feature will allow setting the MAC address using the VDPA 
> > > > > > > tool.
> > > > > > > For example, in vdpa_sim_net, the implementation sets the MAC 
> > > > > > > address to
> > > > > > > the config space. However, for other drivers, they can implement 
> > > > > > > their own
> > > > > > > function, not limited to the config space.
> > > > > > >
> > > > > > > Changelog v2
> > > > > > >  - Changed the function name to prevent misunderstanding
> > > > > > >  - Added check for blk device
> > > > > > >  - Addressed the comments
> > > > > > > Changelog v3
> > > > > > >  - Split the function of the net device from 
> > > > > > > vdpa_nl_cmd_dev_attr_set_doit
> > > > > > >  - Add a lock for the network device's dev_set_attr operation
> > > > > > >  - Address the comments
> > > > > > >
> > > > > > > Cindy Lu (2):
> > > > > > >   vdpa: support set mac address from vdpa tool
> > > > > > >   vdpa_sim_net: Add the support of set mac address
> > > > > > >
> > > > > > >  drivers/vdpa/vdpa.c  | 81 
> > > > > > > 
> > > > > > >  drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 19 ++-
> > > > > > >  include/linux/vdpa.h |  9 
> > > > > > >  include/uapi/linux/vdpa.h|  1 +
> > > > > > >  4 files changed, 109 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > --
> > > > > > > 2.45.0
> > > > > >
> > > > > > Mlx5 device already allows setting the mac and mtu during the vdpa 
> > > > > > device creation time.
> > > > > > Once the vdpa device is created, it binds to vdpa bus and other 
> > > > > > driver vhost_vdpa etc bind to it.
> > > > > > So there was no good reason in the past to support explicit config 
> > > > > > after device add complicate the flow for synchronizing this.
> > > > > >
> > > > > > The user who wants a device with new attributes, as well destroy 
> > > > > > and recreate the vdpa device with new desired attributes.
> > > > > >
> > > > > > vdpa_sim_net can also be extended for similar way when adding the 
> > > > > > vdpa device.
> > > > > >
> > > > > > Have you considered using the existing tool and kernel in place 
> > > > > > since 2021?
> > > > > > Such as commit d8ca2fa5be1.
> > > > > >
> > > > > > An example of it is,
> > > > > > $ vdpa dev add name bar mgmtdev vdpasim_net mac 00:11:22:33:44:55 
> > > > > > mtu 9000
> > > > > >
> > > > > Hi Parav
> > > > > Really thanks for your comments. The reason for adding this function
> > > > > is to support Kubevirt.
> > > > > the problem we meet is that kubevirt chooses one random vdpa device
> > > > > from the pool and we don't know which one it going to pick. That means
> > > > > we can't get to know the Mac address before it is created. So we plan
> > > > > to have this function to change the mac address after it is created
> > > > > Thanks
> > > > > cindy
> > > >
> > > > Well you will need to change kubevirt to teach it to set
> > > > mac

Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *

2024-07-11 Thread Oleg Nesterov

On 07/10, Oleg Nesterov wrote:
>
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> uprobe_consumer *uc)
> +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  {
> - struct uprobe *uprobe;
> -
> - uprobe = find_uprobe(inode, offset);
> - if (WARN_ON(!uprobe))
> - return;
> -
>   down_write(&uprobe->register_rwsem);
>   __uprobe_unregister(uprobe, uc);
>   up_write(&uprobe->register_rwsem);
> - put_uprobe(uprobe);

OK, this is obviously wrong, needs get_uprobe/put_uprobe. __uprobe_unregister()
can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe.

I'll send V2 on top of Peter's new version.

Oleg.

[PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic

2024-07-11 Thread Peter Zijlstra

Specifically, get rid of the uprobe->consumers re-load, which isn't
sound under RCU.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2101,6 +2101,7 @@ static void handler_chain(struct uprobe
struct uprobe_consumer *uc;
int remove = UPROBE_HANDLER_REMOVE;
bool need_prep = false; /* prepare return uprobe, when needed */
+   bool had_handler = false;
 
down_read(&uprobe->register_rwsem);
for (uc = uprobe->consumers; uc; uc = uc->next) {
@@ -2115,16 +2116,26 @@ static void handler_chain(struct uprobe
if (uc->ret_handler)
need_prep = true;
 
+   /*
+* A single handler that does not mask out REMOVE, means the
+* probe stays.
+*/
+   had_handler = true;
remove &= rc;
}
 
+   /*
+* If there were no handlers called, nobody asked for it to be removed
+* but also nobody got to mask the value. Fix it up.
+*/
+   if (!had_handler)
+   remove = 0;
+
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-   if (remove && uprobe->consumers) {
-   WARN_ON(!uprobe_is_active(uprobe));
+   if (remove)
unapply_uprobe(uprobe, current->mm);
-   }
up_read(&uprobe->register_rwsem);
 }

[PATCH v2 09/11] srcu: Add __srcu_clone_read_lock()

2024-07-11 Thread Peter Zijlstra

In order to support carrying an srcu_read_lock() section across fork,
where both the parent and child process will do: srcu_read_unlock(),
it is needed to account for the extra decrement with an extra
increment at fork time.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/srcu.h |1 +
 include/linux/srcutiny.h |   10 ++
 kernel/rcu/srcutree.c|5 +
 3 files changed, 16 insertions(+)

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -55,6 +55,7 @@ void call_srcu(struct srcu_struct *ssp,
void (*func)(struct rcu_head *head));
 void cleanup_srcu_struct(struct srcu_struct *ssp);
 int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
 void synchronize_srcu(struct srcu_struct *ssp);
 unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp);
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -71,6 +71,16 @@ static inline int __srcu_read_lock(struc
return idx;
 }
 
+static inline void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx)
+{
+   int newval;
+
+   preempt_disable();  // Needed for PREEMPT_AUTO
+   newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1;
+   WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval);
+   preempt_enable();
+}
+
 static inline void synchronize_srcu_expedited(struct srcu_struct *ssp)
 {
synchronize_srcu(ssp);
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -720,6 +720,11 @@ int __srcu_read_lock(struct srcu_struct
 }
 EXPORT_SYMBOL_GPL(__srcu_read_lock);
 
+void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx)
+{
+   this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter);
+}
+
 /*
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different

[PATCH v2 00/11] perf/uprobe: Optimize uprobes

2024-07-11 Thread Peter Zijlstra

Hi!

These patches implement the (S)RCU based proposal to optimize uprobes.

On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
tight loop:

  perf probe -x ./uprobes test=func
  perf stat -ae probe_uprobe:test  -- sleep 1

  perf probe -x ./uprobes test=func%return
  perf stat -ae probe_uprobe:test__return -- sleep 1

PRE:

  4,038,804  probe_uprobe:test
  2,356,275  probe_uprobe:test__return

POST:

  7,216,579  probe_uprobe:test
  6,744,786  probe_uprobe:test__return

(copy-paste FTW, I didn't do new numbers because the fast paths didn't change --
 and quick test run shows similar numbers)

Patches also available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes


Changes since last time:
 - better split with intermediate inc_not_zero()
 - fix UPROBE_HANDLER_REMOVE
 - restored the lost rcu_assign_pointer()
 - avoid lockdep for uretprobe_srcu
 - add missing put_uprobe() -> srcu_read_unlock() conversion
 - actually initialize return_instance::has_ref
 - a few comments
 - things I don't remember

[PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()

2024-07-11 Thread Peter Zijlstra

With handle_swbp() triggering concurrently on (all) CPUs, tree_lock
becomes a bottleneck. Avoid treelock by doing RCU lookups of the
uprobe.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   49 +++-
 1 file changed, 40 insertions(+), 9 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_
 #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = 
SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 #define UPROBES_HASH_SZ13
 /* serialize uprobe->pending_list */
@@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
 struct uprobe {
struct rb_node  rb_node;/* node in the rb tree */
refcount_t  ref;
+   struct rcu_head rcu;
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
struct list_headpending_list;
@@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob
*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+   if (refcount_inc_not_zero(&uprobe->ref))
+   return uprobe;
+   return NULL;
+}
+
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
refcount_inc(&uprobe->ref);
return uprobe;
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+   struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+   kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
if (refcount_dec_and_test(&uprobe->ref)) {
@@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up
mutex_lock(&delayed_uprobe_lock);
delayed_uprobe_remove(uprobe, NULL);
mutex_unlock(&delayed_uprobe_lock);
-   kfree(uprobe);
+   call_rcu(&uprobe->rcu, uprobe_free_rcu);
}
 }
 
@@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru
.inode = inode,
.offset = offset,
};
-   struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+   struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, 
__uprobe_cmp_key);
 
if (node)
-   return get_uprobe(__node_2_uprobe(node));
+   return try_get_uprobe(__node_2_uprobe(node));
 
return NULL;
 }
@@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru
  */
 static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
-   struct uprobe *uprobe;
+   unsigned int seq;
 
-   read_lock(&uprobes_treelock);
-   uprobe = __find_uprobe(inode, offset);
-   read_unlock(&uprobes_treelock);
+   guard(rcu)();
 
-   return uprobe;
+   do {
+   seq = read_seqcount_begin(&uprobes_seqcount);
+   struct uprobe *uprobe = __find_uprobe(inode, offset);
+   if (uprobe) {
+   /*
+* Lockless RB-tree lookups are prone to 
false-negatives.
+* If they find something, it's good. If they do not 
find,
+* it needs to be validated.
+*/
+   return uprobe;
+   }
+   } while (read_seqcount_retry(&uprobes_seqcount, seq));
+
+   /* Really didn't find anything. */
+   return NULL;
 }
 
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
struct rb_node *node;
 
-   node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+   node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
if (node)
return get_uprobe(__node_2_uprobe(node));
 
@@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru
struct uprobe *u;
 
write_lock(&uprobes_treelock);
+   write_seqcount_begin(&uprobes_seqcount);
u = __insert_uprobe(uprobe);
+   write_seqcount_end(&uprobes_seqcount);
write_unlock(&uprobes_treelock);
 
return u;
@@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe
return;
 
write_lock(&uprobes_treelock);
+   write_seqcount_begin(&uprobes_seqcount);
rb_erase(&uprobe->rb_node, &uprobes_tree);
+   write_seqcount_end(&uprobes_seqcount);
write_unlock(&uprobes_treelock);
RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
put_uprobe(uprobe);

[PATCH v2 01/11] perf/uprobe: Re-indent labels

2024-07-11 Thread Peter Zijlstra

Remove the silly label indenting.

 s/^\ \([[:alnum:]]*\):$/\1:/g

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -205,7 +205,7 @@ static int __replace_page(struct vm_area
folio_put(old_folio);
 
err = 0;
- unlock:
+unlock:
mmu_notifier_invalidate_range_end(&range);
folio_unlock(old_folio);
return err;
@@ -857,7 +857,7 @@ static int prepare_uprobe(struct uprobe
smp_wmb(); /* pairs with the smp_rmb() in handle_swbp() */
set_bit(UPROBE_COPY_INSN, &uprobe->flags);
 
- out:
+out:
up_write(&uprobe->consumer_rwsem);
 
return ret;
@@ -965,7 +965,7 @@ build_map_info(struct address_space *map
struct map_info *info;
int more = 0;
 
- again:
+again:
i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
@@ -1019,7 +1019,7 @@ build_map_info(struct address_space *map
} while (--more);
 
goto again;
- out:
+out:
while (prev)
prev = free_map_info(prev);
return curr;
@@ -1068,13 +1068,13 @@ register_for_each_vma(struct uprobe *upr
err |= remove_breakpoint(uprobe, mm, 
info->vaddr);
}
 
- unlock:
+unlock:
mmap_write_unlock(mm);
- free:
+free:
mmput(mm);
info = free_map_info(info);
}
- out:
+out:
percpu_up_write(&dup_mmap_sem);
return err;
 }
@@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
return -EINVAL;
 
- retry:
+retry:
uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
if (!uprobe)
return -ENOMEM;
@@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
ret = 0;
/* pairs with get_xol_area() */
smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
- fail:
+fail:
mmap_write_unlock(mm);
 
return ret;
@@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
kfree(area->bitmap);
  free_area:
kfree(area);
- out:
+out:
return NULL;
 }
 
@@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr
utask->return_instances = ri;
 
return;
- fail:
+fail:
kfree(ri);
 }
 
@@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str
 
copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
put_page(page);
- out:
+out:
/* This needs to return true for any variant of the trap insn */
return is_trap_insn(&opcode);
 }
@@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_
utask->return_instances = ri;
return;
 
- sigill:
+sigill:
uprobe_warn(current, "handle uretprobe, sending SIGILL.");
force_sig(SIGILL);

[PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU

2024-07-11 Thread Peter Zijlstra

With handle_swbp() hitting concurrently on (all) CPUs, potentially on
the same uprobe, the uprobe->refcount can get *very* hot. Move the
struct uprobe lifetime into uprobes_srcu such that it covers both the
uprobe and the uprobe->consumers list.

With this, handle_swbp() can use a single large SRCU critical section
to avoid taking a refcount on the uprobe for it's duration.

Notably, the single-step and uretprobe paths need a reference that
leaves handle_swbp() and will, for now, still use ->refcount.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   68 
 1 file changed, 41 insertions(+), 27 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -51,7 +51,7 @@ static struct mutex uprobes_mmap_mutex[U
 DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 /*
- * Covers uprobe->consumers lifetime.
+ * Covers uprobe->consumers lifetime as well as struct uprobe.
  */
 DEFINE_STATIC_SRCU(uprobes_srcu);
 
@@ -626,7 +626,7 @@ static void put_uprobe(struct uprobe *up
mutex_lock(&delayed_uprobe_lock);
delayed_uprobe_remove(uprobe, NULL);
mutex_unlock(&delayed_uprobe_lock);
-   call_rcu(&uprobe->rcu, uprobe_free_rcu);
+   call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
}
 }
 
@@ -678,7 +678,7 @@ static struct uprobe *__find_uprobe(stru
struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, 
__uprobe_cmp_key);
 
if (node)
-   return try_get_uprobe(__node_2_uprobe(node));
+   return __node_2_uprobe(node);
 
return NULL;
 }
@@ -691,7 +691,7 @@ static struct uprobe *find_uprobe(struct
 {
unsigned int seq;
 
-   guard(rcu)();
+   lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
 
do {
seq = read_seqcount_begin(&uprobes_seqcount);
@@ -1142,6 +1142,8 @@ void uprobe_unregister_nosync(struct ino
 {
struct uprobe *uprobe;
 
+   guard(srcu)(&uprobes_srcu);
+
uprobe = find_uprobe(inode, offset);
if (WARN_ON(!uprobe))
return;
@@ -1151,7 +1153,6 @@ void uprobe_unregister_nosync(struct ino
__uprobe_unregister(uprobe, uc);
raw_write_seqcount_end(&uprobe->register_seq);
up_write(&uprobe->register_rwsem);
-   put_uprobe(uprobe);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
@@ -1263,6 +1264,8 @@ int uprobe_apply(struct inode *inode, lo
struct uprobe_consumer *con;
int ret = -ENOENT;
 
+   guard(srcu)(&uprobes_srcu);
+
uprobe = find_uprobe(inode, offset);
if (WARN_ON(!uprobe))
return ret;
@@ -1275,7 +1278,6 @@ int uprobe_apply(struct inode *inode, lo
ret = register_for_each_vma(uprobe, add ? uc : NULL);
raw_write_seqcount_end(&uprobe->register_seq);
up_write(&uprobe->register_rwsem);
-   put_uprobe(uprobe);
 
return ret;
 }
@@ -1929,10 +1931,14 @@ static void prepare_uretprobe(struct upr
if (!ri)
return;
 
+   ri->uprobe = try_get_uprobe(uprobe);
+   if (!ri->uprobe)
+   goto err_mem;
+
trampoline_vaddr = get_trampoline_vaddr();
orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
regs);
if (orig_ret_vaddr == -1)
-   goto fail;
+   goto err_uprobe;
 
/* drop the entries invalidated by longjmp() */
chained = (orig_ret_vaddr == trampoline_vaddr);
@@ -1950,12 +1956,11 @@ static void prepare_uretprobe(struct upr
 * attack from user-space.
 */
uprobe_warn(current, "handle tail call");
-   goto fail;
+   goto err_uprobe;
}
orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
}
 
-   ri->uprobe = get_uprobe(uprobe);
ri->func = instruction_pointer(regs);
ri->stack = user_stack_pointer(regs);
ri->orig_ret_vaddr = orig_ret_vaddr;
@@ -1966,7 +1971,10 @@ static void prepare_uretprobe(struct upr
utask->return_instances = ri;
 
return;
-fail:
+
+err_uprobe:
+   uprobe_put(ri->uprobe);
+err_mem:
kfree(ri);
 }
 
@@ -1982,22 +1990,31 @@ pre_ssout(struct uprobe *uprobe, struct
if (!utask)
return -ENOMEM;
 
+   utask->active_uprobe = try_get_uprobe(uprobe);
+   if (!utask->active_uprobe)
+   return -ESRCH;
+
xol_vaddr = xol_get_insn_slot(uprobe);
-   if (!xol_vaddr)
-   return -ENOMEM;
+   if (!xol_vaddr) {
+   err = -ENOMEM;
+   goto err_uprobe;
+   }
 
utask->xol_vaddr = xol_vaddr;
utask->vaddr = bp_vaddr;
 
err = arch_uprobe_pre_xol(&uprobe->arch, regs);
-   if (unlikely(err)) {
-   xol_free_insn_slot(current);
-   return err;
-   }
+

[PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list

2024-07-11 Thread Peter Zijlstra

With handle_swbp() hitting concurrently on (all) CPUs the
uprobe->register_rwsem can get very contended. Add an SRCU instance to
cover the consumer list and consumer lifetime.

Since the consumer are externally embedded structures, unregister will
have to suffer a synchronize_srcu().

A notably complication is the UPROBE_HANDLER_REMOVE logic which can
race against uprobe_register() such that it might want to remove a
freshly installer handler that didn't get called. In order to close
this hole, a seqcount is added. With that, the removal path can tell
if anything changed and bail out of the removal.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   60 
 1 file changed, 50 insertions(+), 10 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -49,6 +50,11 @@ static struct mutex uprobes_mmap_mutex[U
 
 DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
+/*
+ * Covers uprobe->consumers lifetime.
+ */
+DEFINE_STATIC_SRCU(uprobes_srcu);
+
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN   0
 
@@ -57,6 +63,7 @@ struct uprobe {
refcount_t  ref;
struct rcu_head rcu;
struct rw_semaphore register_rwsem;
+   seqcount_t  register_seq;
struct rw_semaphore consumer_rwsem;
struct list_headpending_list;
struct uprobe_consumer  *consumers;
@@ -760,6 +767,7 @@ static struct uprobe *alloc_uprobe(struc
uprobe->offset = offset;
uprobe->ref_ctr_offset = ref_ctr_offset;
init_rwsem(&uprobe->register_rwsem);
+   seqcount_init(&uprobe->register_seq);
init_rwsem(&uprobe->consumer_rwsem);
 
/* add to uprobes_tree, sorted on inode:offset */
@@ -782,8 +790,8 @@ static struct uprobe *alloc_uprobe(struc
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
down_write(&uprobe->consumer_rwsem);
-   uc->next = uprobe->consumers;
-   uprobe->consumers = uc;
+   WRITE_ONCE(uc->next, uprobe->consumers);
+   rcu_assign_pointer(uprobe->consumers, uc);
up_write(&uprobe->consumer_rwsem);
 }
 
@@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe *
down_write(&uprobe->consumer_rwsem);
for (con = &uprobe->consumers; *con; con = &(*con)->next) {
if (*con == uc) {
-   *con = uc->next;
+   WRITE_ONCE(*con, uc->next);
ret = true;
break;
}
@@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino
return;
 
down_write(&uprobe->register_rwsem);
+   raw_write_seqcount_begin(&uprobe->register_seq);
__uprobe_unregister(uprobe, uc);
+   raw_write_seqcount_end(&uprobe->register_seq);
up_write(&uprobe->register_rwsem);
put_uprobe(uprobe);
+
+   synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1204,10 +1216,12 @@ static int __uprobe_register(struct inod
down_write(&uprobe->register_rwsem);
ret = -EAGAIN;
if (likely(uprobe_is_active(uprobe))) {
+   raw_write_seqcount_begin(&uprobe->register_seq);
consumer_add(uprobe, uc);
ret = register_for_each_vma(uprobe, uc);
if (ret)
__uprobe_unregister(uprobe, uc);
+   raw_write_seqcount_end(&uprobe->register_seq);
}
up_write(&uprobe->register_rwsem);
put_uprobe(uprobe);
@@ -1250,10 +1264,12 @@ int uprobe_apply(struct inode *inode, lo
return ret;
 
down_write(&uprobe->register_rwsem);
+   raw_write_seqcount_begin(&uprobe->register_seq);
for (con = uprobe->consumers; con && con != uc ; con = con->next)
;
if (con)
ret = register_for_each_vma(uprobe, add ? uc : NULL);
+   raw_write_seqcount_end(&uprobe->register_seq);
up_write(&uprobe->register_rwsem);
put_uprobe(uprobe);
 
@@ -2096,15 +2112,23 @@ static struct uprobe *find_active_uprobe
return uprobe;
 }
 
+#define for_each_consumer_rcu(pos, head) \
+   for (pos = rcu_dereference_raw(head); pos; \
+pos = rcu_dereference_raw(pos->next))
+
 static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 {
struct uprobe_consumer *uc;
int remove = UPROBE_HANDLER_REMOVE;
bool need_prep = false; /* prepare return uprobe, when needed */
bool had_handler = false;
+   unsigned int seq;
 
-   down_read(&uprobe->register_rwsem);
-   for (uc = uprobe->consumers; uc; uc = uc->next) {
+   guard(srcu)(&uprobes_srcu);
+
+   seq = raw_read_seqcount_begin(&uprobe->register_seq);
+
+   for_each_consumer_rcu(uc, uprobe->consumers) {
int rc =

[PATCH v2 11/11] perf/uprobe: Add uretprobe timer

2024-07-11 Thread Peter Zijlstra

In order to put a bound on the uretprobe_srcu critical section, add a
timer to uprobe_task. Upon every RI added or removed the timer is
pushed forward to now + 1s. If the timer were ever to fire, it would
convert the SRCU 'reference' to a refcount reference if possible.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/uprobes.h |8 +
 kernel/events/uprobes.c |   67 
 2 files changed, 69 insertions(+), 6 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct vm_area_struct;
 struct mm_struct;
@@ -79,6 +80,10 @@ struct uprobe_task {
struct return_instance  *return_instances;
unsigned intdepth;
unsigned intactive_srcu_idx;
+
+   struct timer_list   ri_timer;
+   struct callback_headri_task_work;
+   struct task_struct  *task;
 };
 
 struct return_instance {
@@ -86,7 +91,8 @@ struct return_instance {
unsigned long   func;
unsigned long   stack;  /* stack pointer */
unsigned long   orig_ret_vaddr; /* original return address */
-   boolchained;/* true, if instance is nested 
*/
+   u8  chained;/* true, if instance is nested 
*/
+   u8  has_ref;
int srcu_idx;
 
struct return_instance  *next;  /* keep as stack */
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1761,7 +1761,12 @@ unsigned long uprobe_get_trap_addr(struc
 static struct return_instance *free_ret_instance(struct return_instance *ri)
 {
struct return_instance *next = ri->next;
-   __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+   if (ri->uprobe) {
+   if (ri->has_ref)
+   put_uprobe(ri->uprobe);
+   else
+   __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+   }
kfree(ri);
return next;
 }
@@ -1785,11 +1790,48 @@ void uprobe_free_utask(struct task_struc
while (ri)
ri = free_ret_instance(ri);
 
+   timer_delete_sync(&utask->ri_timer);
+   task_work_cancel(utask->task, &utask->ri_task_work);
xol_free_insn_slot(t);
kfree(utask);
t->utask = NULL;
 }
 
+static void return_instance_task_work(struct callback_head *head)
+{
+   struct uprobe_task *utask = container_of(head, struct uprobe_task, 
ri_task_work);
+   struct return_instance *ri;
+
+   for (ri = utask->return_instances; ri; ri = ri->next) {
+   if (!ri->uprobe)
+   continue;
+   if (ri->has_ref)
+   continue;
+   if (refcount_inc_not_zero(&ri->uprobe->ref))
+   ri->has_ref = true;
+   else
+   ri->uprobe = NULL;
+   __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
+   }
+}
+
+static void return_instance_timer(struct timer_list *timer)
+{
+   struct uprobe_task *utask = container_of(timer, struct uprobe_task, 
ri_timer);
+   task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
+}
+
+static struct uprobe_task *alloc_utask(struct task_struct *task)
+{
+   struct uprobe_task *utask = kzalloc(sizeof(struct uprobe_task), 
GFP_KERNEL);
+   if (!utask)
+   return NULL;
+   timer_setup(&utask->ri_timer, return_instance_timer, 0);
+   init_task_work(&utask->ri_task_work, return_instance_task_work);
+   utask->task = task;
+   return utask;
+}
+
 /*
  * Allocate a uprobe_task object for the task if necessary.
  * Called when the thread hits a breakpoint.
@@ -1801,7 +1843,7 @@ void uprobe_free_utask(struct task_struc
 static struct uprobe_task *get_utask(void)
 {
if (!current->utask)
-   current->utask = kzalloc(sizeof(struct uprobe_task), 
GFP_KERNEL);
+   current->utask = alloc_utask(current);
return current->utask;
 }
 
@@ -1810,7 +1852,7 @@ static int dup_utask(struct task_struct
struct uprobe_task *n_utask;
struct return_instance **p, *o, *n;
 
-   n_utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+   n_utask = alloc_utask(t);
if (!n_utask)
return -ENOMEM;
t->utask = n_utask;
@@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct
return -ENOMEM;
 
*n = *o;
-   __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
+   if (n->uprobe) {
+   if (n->has_ref)
+   get_uprobe(n->uprobe);
+   else
+   __srcu_clone_read_lock(&uretprobes_srcu, 
n->srcu_idx);
+   }
n->next = NULL;

[PATCH v2 02/11] perf/uprobe: Remove spurious whitespace

2024-07-11 Thread Peter Zijlstra



Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/events/uprobes.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -67,7 +67,7 @@ struct uprobe {
 * The generic code assumes that it has two members of unknown type
 * owned by the arch-specific code:
 *
-*  insn -  copy_insn() saves the original instruction here for
+*  insn -  copy_insn() saves the original instruction here for
 *  arch_uprobe_analyze_insn().
 *
 *  ixol -  potentially modified instruction to execute out of
@@ -95,18 +95,18 @@ static LIST_HEAD(delayed_uprobe_list);
  * allocated.
  */
 struct xol_area {
-   wait_queue_head_t   wq; /* if all slots are 
busy */
-   atomic_tslot_count; /* number of in-use 
slots */
-   unsigned long   *bitmap;/* 0 = free slot */
+   wait_queue_head_t   wq; /* if all slots are 
busy */
+   atomic_tslot_count; /* number of in-use 
slots */
+   unsigned long   *bitmap;/* 0 = free slot */
 
struct vm_special_mapping   xol_mapping;
-   struct page *pages[2];
+   struct page *pages[2];
/*
 * We keep the vma's vm_start rather than a pointer to the vma
 * itself.  The probed process or a naughty kernel module could make
 * the vma go away, and we must handle that reasonably gracefully.
 */
-   unsigned long   vaddr;  /* Page(s) of 
instruction slots */
+   unsigned long   vaddr;  /* Page(s) of 
instruction slots */
 };
 
 /*

[PATCH v2 07/11] perf/uprobe: Split uprobe_unregister()

2024-07-11 Thread Peter Zijlstra

With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.

Peel off the sync_srcu() and do it once, after the loop.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Masami Hiramatsu (Google) 
---
 include/linux/uprobes.h |8 ++--
 kernel/events/uprobes.c |8 ++--
 kernel/trace/bpf_trace.c|6 --
 kernel/trace/trace_uprobe.c |6 +-
 4 files changed, 21 insertions(+), 7 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -113,7 +113,8 @@ extern int uprobe_write_opcode(struct ar
 extern int uprobe_register(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
 extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t 
ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc);
+extern void uprobe_unregister_nosync(struct inode *inode, loff_t offset, 
struct uprobe_consumer *uc);
+extern void uprobe_unregister_sync(void);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -163,7 +164,10 @@ uprobe_apply(struct inode *inode, loff_t
return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer 
*uc)
+uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+{
+}
+static inline void uprobes_unregister_sync(void)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1138,7 +1138,7 @@ __uprobe_unregister(struct uprobe *uprob
  * @offset: offset from the start of the file.
  * @uc: identify which probe if multiple probes are colocated.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
+void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct 
uprobe_consumer *uc)
 {
struct uprobe *uprobe;
 
@@ -1152,10 +1152,14 @@ void uprobe_unregister(struct inode *ino
raw_write_seqcount_end(&uprobe->register_seq);
up_write(&uprobe->register_rwsem);
put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
 
+void uprobe_unregister_sync(void)
+{
synchronize_srcu(&uprobes_srcu);
 }
-EXPORT_SYMBOL_GPL(uprobe_unregister);
+EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
 
 /*
  * __uprobe_register - register a probe
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3181,9 +3181,11 @@ static void bpf_uprobe_unregister(struct
u32 i;
 
for (i = 0; i < cnt; i++) {
-   uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
- &uprobes[i].consumer);
+   uprobe_unregister_nosync(d_real_inode(path->dentry), 
uprobes[i].offset,
+&uprobes[i].consumer);
}
+   if (cnt)
+   uprobe_unregister_sync();
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct tr
 static void __probe_event_disable(struct trace_probe *tp)
 {
struct trace_uprobe *tu;
+   bool sync = false;
 
tu = container_of(tp, struct trace_uprobe, tp);
WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
@@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct
if (!tu->inode)
continue;
 
-   uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+   uprobe_unregister_nosync(tu->inode, tu->offset, &tu->consumer);
+   sync = true;
tu->inode = NULL;
}
+   if (sync)
+   uprobe_unregister_sync();
 }
 
 static int probe_event_enable(struct trace_event_call *call,

[PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()

2024-07-11 Thread Peter Zijlstra

Much like latch_tree, add two RCU methods for the regular RB-tree,
which can be used in conjunction with a seqcount to provide lockless
lookups.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Masami Hiramatsu (Google) 
---
 include/linux/rbtree.h |   67 +
 1 file changed, 67 insertions(+)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct
 }
 
 /**
+ * rb_find_add_rcu() - find equivalent @node in @tree, or add @node
+ * @node: node to look-for / insert
+ * @tree: tree to search / modify
+ * @cmp: operator defining the node order
+ *
+ * Adds a Store-Release for link_node.
+ *
+ * Returns the rb_node matching @node, or NULL when no match is found and @node
+ * is inserted.
+ */
+static __always_inline struct rb_node *
+rb_find_add_rcu(struct rb_node *node, struct rb_root *tree,
+   int (*cmp)(struct rb_node *, const struct rb_node *))
+{
+   struct rb_node **link = &tree->rb_node;
+   struct rb_node *parent = NULL;
+   int c;
+
+   while (*link) {
+   parent = *link;
+   c = cmp(node, parent);
+
+   if (c < 0)
+   link = &parent->rb_left;
+   else if (c > 0)
+   link = &parent->rb_right;
+   else
+   return parent;
+   }
+
+   rb_link_node_rcu(node, parent, link);
+   rb_insert_color(node, tree);
+   return NULL;
+}
+
+/**
  * rb_find() - find @key in tree @tree
  * @key: key to match
  * @tree: tree to search
@@ -268,6 +304,37 @@ rb_find(const void *key, const struct rb
else
return node;
}
+
+   return NULL;
+}
+
+/**
+ * rb_find_rcu() - find @key in tree @tree
+ * @key: key to match
+ * @tree: tree to search
+ * @cmp: operator defining the node order
+ *
+ * Notably, tree descent vs concurrent tree rotations is unsound and can result
+ * in false-negatives.
+ *
+ * Returns the rb_node matching @key or NULL.
+ */
+static __always_inline struct rb_node *
+rb_find_rcu(const void *key, const struct rb_root *tree,
+   int (*cmp)(const void *key, const struct rb_node *))
+{
+   struct rb_node *node = tree->rb_node;
+
+   while (node) {
+   int c = cmp(key, node);
+
+   if (c < 0)
+   node = rcu_dereference_raw(node->rb_left);
+   else if (c > 0)
+   node = rcu_dereference_raw(node->rb_right);
+   else
+   return node;
+   }
 
return NULL;
 }

[PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU

2024-07-11 Thread Peter Zijlstra

Both single-step and uretprobes take a refcount on struct uprobe in
handle_swbp() in order to ensure struct uprobe stays extant until a
next trap.

Since uprobe_unregister() only cares about the uprobe_consumer
life-time, and these intra-trap sections can be arbitrarily large,
create a second SRCU domain to cover these.

Notably, a uretprobe with a registered return_instance that never
triggers -- because userspace -- will currently pin the
return_instance and related uprobe until the task dies. With this
convertion to SRCU this behaviour will inhibit freeing of all uprobes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/uprobes.h |2 +
 kernel/events/uprobes.c |   60 +++-
 2 files changed, 31 insertions(+), 31 deletions(-)

--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -78,6 +78,7 @@ struct uprobe_task {
 
struct return_instance  *return_instances;
unsigned intdepth;
+   unsigned intactive_srcu_idx;
 };
 
 struct return_instance {
@@ -86,6 +87,7 @@ struct return_instance {
unsigned long   stack;  /* stack pointer */
unsigned long   orig_ret_vaddr; /* original return address */
boolchained;/* true, if instance is nested 
*/
+   int srcu_idx;
 
struct return_instance  *next;  /* keep as stack */
 };
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -54,6 +54,15 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
  * Covers uprobe->consumers lifetime as well as struct uprobe.
  */
 DEFINE_STATIC_SRCU(uprobes_srcu);
+/*
+ * Covers return_instance->uprobe and utask->active_uprobe. Separate from
+ * uprobe_srcu because uprobe_unregister() doesn't need to wait for this
+ * and these lifetimes can be fairly long.
+ *
+ * Notably, these sections span userspace and as such use
+ * __srcu_read_{,un}lock() to elide lockdep.
+ */
+DEFINE_STATIC_SRCU(uretprobes_srcu);
 
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN   0
@@ -596,25 +605,24 @@ set_orig_insn(struct arch_uprobe *auprob
*(uprobe_opcode_t *)&auprobe->insn);
 }
 
-static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
-{
-   if (refcount_inc_not_zero(&uprobe->ref))
-   return uprobe;
-   return NULL;
-}
-
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
refcount_inc(&uprobe->ref);
return uprobe;
 }
 
-static void uprobe_free_rcu(struct rcu_head *rcu)
+static void uprobe_free_stage2(struct rcu_head *rcu)
 {
struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
kfree(uprobe);
 }
 
+static void uprobe_free_stage1(struct rcu_head *rcu)
+{
+   struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+   call_srcu(&uretprobes_srcu, &uprobe->rcu, uprobe_free_stage2);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
if (refcount_dec_and_test(&uprobe->ref)) {
@@ -626,7 +634,7 @@ static void put_uprobe(struct uprobe *up
mutex_lock(&delayed_uprobe_lock);
delayed_uprobe_remove(uprobe, NULL);
mutex_unlock(&delayed_uprobe_lock);
-   call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
+   call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
}
 }
 
@@ -1753,7 +1761,7 @@ unsigned long uprobe_get_trap_addr(struc
 static struct return_instance *free_ret_instance(struct return_instance *ri)
 {
struct return_instance *next = ri->next;
-   put_uprobe(ri->uprobe);
+   __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx);
kfree(ri);
return next;
 }
@@ -1771,7 +1779,7 @@ void uprobe_free_utask(struct task_struc
return;
 
if (utask->active_uprobe)
-   put_uprobe(utask->active_uprobe);
+   __srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx);
 
ri = utask->return_instances;
while (ri)
@@ -1814,7 +1822,7 @@ static int dup_utask(struct task_struct
return -ENOMEM;
 
*n = *o;
-   get_uprobe(n->uprobe);
+   __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx);
n->next = NULL;
 
*p = n;
@@ -1931,14 +1939,10 @@ static void prepare_uretprobe(struct upr
if (!ri)
return;
 
-   ri->uprobe = try_get_uprobe(uprobe);
-   if (!ri->uprobe)
-   goto err_mem;
-
trampoline_vaddr = get_trampoline_vaddr();
orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
regs);
if (orig_ret_vaddr == -1)
-   goto err_uprobe;
+   goto err_mem;
 
/* drop the entries invalidated by longjmp() */
chained = (orig_ret_vaddr == trampoline_vaddr);
@@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct u

Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels

2024-07-11 Thread Jiri Olsa

On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote:

SNIP

> @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
>   if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>   return -EINVAL;
>  
> - retry:
> +retry:
>   uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>   if (!uprobe)
>   return -ENOMEM;
> @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
>   ret = 0;
>   /* pairs with get_xol_area() */
>   smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
> - fail:
> +fail:
>   mmap_write_unlock(mm);
>  
>   return ret;
> @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
>   kfree(area->bitmap);
>   free_area:

hi,
missed this one and another one few lines before that ;-)

jirka

>   kfree(area);
> - out:
> +out:
>   return NULL;
>  }
>  
> @@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr
>   utask->return_instances = ri;
>  
>   return;
> - fail:
> +fail:
>   kfree(ri);
>  }
>  
> @@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str
>  
>   copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>   put_page(page);
> - out:
> +out:
>   /* This needs to return true for any variant of the trap insn */
>   return is_trap_insn(&opcode);
>  }
> @@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_
>   utask->return_instances = ri;
>   return;
>  
> - sigill:
> +sigill:
>   uprobe_warn(current, "handle uretprobe, sending SIGILL.");
>   force_sig(SIGILL);
>  
> 
>

Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels

2024-07-11 Thread Peter Zijlstra

On Thu, Jul 11, 2024 at 01:58:04PM +0200, Jiri Olsa wrote:
> On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote:
> 
> SNIP
> 
> > @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod
> > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> > return -EINVAL;
> >  
> > - retry:
> > +retry:
> > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> > if (!uprobe)
> > return -ENOMEM;
> > @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct
> > ret = 0;
> > /* pairs with get_xol_area() */
> > smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */
> > - fail:
> > +fail:
> > mmap_write_unlock(mm);
> >  
> > return ret;
> > @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are
> > kfree(area->bitmap);
> >   free_area:
> 
> hi,
> missed this one and another one few lines before that ;-)

Bah, _ isn't in [[:alnum:]]. I'll go fix.

Re: [PATCH v2 58/60] i2c: virtio: reword according to newest specification

2024-07-11 Thread Andi Shyti

Hi Wolfram,

On Sat, Jul 06, 2024 at 01:20:58PM GMT, Wolfram Sang wrote:
> Change the wording of this driver wrt. the newest I2C v7 and SMBus 3.2
> specifications and replace "master/slave" with more appropriate terms.
> 
> Signed-off-by: Wolfram Sang 

Reviewed-by: Andi Shyti 

Thanks,
Andi

Re: [PATCH v2 00/60] i2c: reword first drivers according to newest specification

2024-07-11 Thread Andi Shyti

Hi Wolfram,

pushed in i2c/i2c-host.

Thanks for this big work, at the end it turned out quite nice and
I'm happy of the outcome!

Thanks
Andi

On Sat, Jul 06, 2024 at 01:20:00PM GMT, Wolfram Sang wrote:
> Start changing the wording of the I2C main header wrt. the newest I2C
> v7 and SMBus 3.2 specifications and replace "master/slave" with more
> appropriate terms. This first step renames the members of struct
> i2c_algorithm. Once all in-tree users are converted, the anonymous union
> will go away again. All this work will also pave the way for finally
> seperating the monolithic header into more fine-grained headers like
> "i2c/clients.h" etc. So, this is not a simple renaming-excercise but
> also a chance to update the I2C core to recent Linux standards.
> 
> Changes since v1:
> 
> * changed wording according to the terminology we agreed on and defined
>   upstream. That means consistent use of "controller/target", and no
>   more "host/client". I added "local/remote target" where necessary.
> * added tags which I kept despite some changes in wording. The approach
>   and code changes (if necessary) did not change.
> * rebased to Andi's for-next branch
> * this series only contains patches which convert the drivers fully. If
>   all goes well, no more updates for them are needed. The previous
>   series converted all users of "master_xfer". But to avoid tons of
>   incremental patches to one driver, I will incrementally improve i2c.h
>   and see which drivers can be fully converted step-by-step.
> * do not mention I3C specs in commit messages, not really relevant here
> 
> Please note that I am not super strict with the 80 char limit. And, as
> agreed, I did not convert occasions where old terminology is used in
> register names or bits etc. or in function names outside of the I2C
> realm.
> 
> The outcome is that before this series 115 drivers use old terminology,
> after this only 54. Hooray.
> 
> And a comment to all janitors: Do not convert I2C drivers outside of
> drivers/i2c yet. Let us first gain experience here and present the
> well-tested results of what we figured out to other maintainers then.
> This ensures they have to deal with way less patch revisions.
> 
> Thanks and happy hacking!
> 
> 
> Wolfram Sang (60):
>   i2c: reword i2c_algorithm according to newest specification
>   i2c: ali15x3: reword according to newest specification
>   i2c: altera: reword according to newest specification
>   i2c: au1550: reword according to newest specification
>   i2c: bcm-kona: reword according to newest specification
>   i2c: bcm2835: reword according to newest specification
>   i2c: brcmstb: reword according to newest specification
>   i2c: cht-wc: reword according to newest specification
>   i2c: cp2615: reword according to newest specification
>   i2c: cros-ec-tunnel: reword according to newest specification
>   i2c: davinci: reword according to newest specification
>   i2c: digicolor: reword according to newest specification
>   i2c: diolan-u2c: reword according to newest specification
>   i2c: dln2: reword according to newest specification
>   i2c: fsi: reword according to newest specification
>   i2c: gpio: reword according to newest specification
>   i2c: highlander: reword according to newest specification
>   i2c: hisi: reword according to newest specification
>   i2c: hix5hd2: reword according to newest specification
>   i2c: i801: reword according to newest specification
>   i2c: ibm_iic: reword according to newest specification
>   i2c: iop3xx: reword according to newest specification
>   i2c: isch: reword according to newest specification
>   i2c: jz4780: reword according to newest specification
>   i2c: kempld: reword according to newest specification
>   i2c: ljca: reword according to newest specification
>   i2c: lpc2k: reword according to newest specification
>   i2c: ls2x: reword according to newest specification
>   i2c: mlxcpld: reword according to newest specification
>   i2c: mpc: reword according to newest specification
>   i2c: mt7621: reword according to newest specification
>   i2c: mv64xxx: reword according to newest specification
>   i2c: ocores: reword according to newest specification
>   i2c: octeon: reword according to newest specification
>   i2c: opal: reword according to newest specification
>   i2c: owl: reword according to newest specification
>   i2c: pasemi: reword according to newest specification
>   i2c: piix4: reword according to newest specification
>   i2c: powermac: reword according to newest specification
>   i2c: pxa-pci: reword according to newest specification
>   i2c: riic: reword according to newest specification
>   i2c: rk3x: reword according to newest specification
>   i2c: robotfuzz-osif: reword according to newest specification
>   i2c: rzv2m: reword according to newest specification
>   i2c: sis5595: reword according to newest specification
>   i2c: sprd: reword according to newest specification
>   i2c: stm32f4: reword according to newest speci

Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer

2024-07-11 Thread Oleg Nesterov

Not sure I read this patch correctly, but at first glance it looks
suspicious..

On 07/11, Peter Zijlstra wrote:
>
> +static void return_instance_timer(struct timer_list *timer)
> +{
> + struct uprobe_task *utask = container_of(timer, struct uprobe_task, 
> ri_timer);
> + task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
> +}

What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before
return from the ret-probed function?

In this case it won't react to TWA_SIGNAL until debugger or SIGCONT
wakes it up.

---
And it seems that even task_work_add() itself is not safe...

Suppose we have 2 ret-probed functions

void f2() { ... }
void f1() { ...; f2(); }

A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does

mod_timer(&utask->ri_timer, jiffies + HZ);

Then later it calls f2() and the pending timer expires after it enters the
kernel, but before the next prepare_uretprobe() -> mod_timer().

In this case ri_task_work is already queued and the timer is pending again.

Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in
theory nothing can guarantee that it will call get_signal/task_work_run
in less than 1 second, it can be preempted.

But T can sleep in xol_take_insn_slot() before return from handle_swbp(),
and this is not so theoretical.

Oleg.

Re: [PATCH v2 1/2] virtio_balloon: add work around for out of spec QEMU

2024-07-11 Thread kernel test robot

Hi Michael,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240710]
[cannot apply to uml/next remoteproc/rproc-next s390/features linus/master 
uml/fixes v6.10-rc7 v6.10-rc6 v6.10-rc5 v6.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Michael-S-Tsirkin/virtio_balloon-add-work-around-for-out-of-spec-QEMU/20240711-004346
base:   next-20240710
patch link:
https://lore.kernel.org/r/19d916257b76148f89de7386389eeb7267b1b61c.1720611677.git.mst%40redhat.com
patch subject: [PATCH v2 1/2] virtio_balloon: add work around for out of spec 
QEMU
config: i386-randconfig-005-20240711 
(https://download.01.org/0day-ci/archive/20240711/202407112126.plguwi8i-...@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240711/202407112126.plguwi8i-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202407112126.plguwi8i-...@intel.com/

All errors (new ones prefixed by >>):

>> drivers/virtio/virtio_balloon.c:603:55: error: too few arguments to function 
>> call, expected 5, have 4
 602 | err = virtio_find_vqs(vb->vdev,
 |   ~~~
 603 |   
VIRTIO_BALLOON_VQ_REPORTING, vqs_info, NULL);
 |  
^
   include/linux/virtio_config.h:225:5: note: 'virtio_find_vqs' declared here
 225 | int virtio_find_vqs(struct virtio_device *vdev, unsigned int nvqs,
 | ^   ~~
 226 | struct virtqueue *vqs[],
 | 
 227 | struct virtqueue_info vqs_info[],
 | ~
 228 | struct irq_affinity *desc)
 | ~
   1 error generated.


vim +603 drivers/virtio/virtio_balloon.c

   560  
   561  static int init_vqs(struct virtio_balloon *vb)
   562  {
   563  struct virtqueue_info vqs_info[VIRTIO_BALLOON_VQ_MAX] = {};
   564  struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
   565  int err;
   566  
   567  /*
   568   * Inflateq and deflateq are used unconditionally. The names[]
   569   * will be NULL if the related feature is not enabled, which 
will
   570   * cause no allocation for the corresponding virtqueue in 
find_vqs.
   571   */
   572  vqs_info[VIRTIO_BALLOON_VQ_INFLATE].callback = balloon_ack;
   573  vqs_info[VIRTIO_BALLOON_VQ_INFLATE].name = "inflate";
   574  vqs_info[VIRTIO_BALLOON_VQ_DEFLATE].callback = balloon_ack;
   575  vqs_info[VIRTIO_BALLOON_VQ_DEFLATE].name = "deflate";
   576  
   577  if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
   578  vqs_info[VIRTIO_BALLOON_VQ_STATS].name = "stats";
   579  vqs_info[VIRTIO_BALLOON_VQ_STATS].callback = 
stats_request;
   580  }
   581  
   582  if (virtio_has_feature(vb->vdev, 
VIRTIO_BALLOON_F_FREE_PAGE_HINT))
   583  vqs_info[VIRTIO_BALLOON_VQ_FREE_PAGE].name = 
"free_page_vq";
   584  
   585  if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
   586  vqs_info[VIRTIO_BALLOON_VQ_REPORTING].name = 
"reporting_vq";
   587  vqs_info[VIRTIO_BALLOON_VQ_REPORTING].callback = 
balloon_ack;
   588  }
   589  
   590  err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
   591vqs_info, NULL);
   592  if (err) {
   593  /*
   594   * Try to work around QEMU bug which since 2020 
confused vq numbers
   595   * when VIRTIO_BALLOON_F_REPORTING but not
   596   * VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered.
   597   */
   598  if (virtio_has_feature(vb->vdev, 
VIRTIO_BALLOON_F_REPORTING) &&
   599  !virtio_has_feature(vb->vdev, 
VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
   600  vqs_info[VIRTIO_BALLOON_VQ_FREE_PAGE].name = 
"reporting_vq";

Re: [PATCH v2 2/2] virtio: fix vq # for balloon

2024-07-11 Thread kernel test robot

Hi Michael,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20240710]
[cannot apply to uml/next remoteproc/rproc-next s390/features linus/master 
uml/fixes v6.10-rc7 v6.10-rc6 v6.10-rc5 v6.10-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Michael-S-Tsirkin/virtio_balloon-add-work-around-for-out-of-spec-QEMU/20240711-004346
base:   next-20240710
patch link:
https://lore.kernel.org/r/3d655be73ce220f176b2c163839d83699f8faf43.1720611677.git.mst%40redhat.com
patch subject: [PATCH v2 2/2] virtio: fix vq # for balloon
config: i386-randconfig-014-20240711 
(https://download.01.org/0day-ci/archive/20240711/202407112113.szspddlk-...@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 
617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240711/202407112113.szspddlk-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202407112113.szspddlk-...@intel.com/

All errors (new ones prefixed by >>):

>> drivers/virtio/virtio_pci_common.c:391:1: error: version control conflict 
>> marker in file
 391 | <<<<<<< HEAD
 | ^
>> drivers/virtio/virtio_pci_common.c:392:30: error: use of undeclared 
>> identifier 'queue_idx'
 392 | vqs[i] = vp_setup_vq(vdev, queue_idx++, 
vqi->callback,
 |^
   2 errors generated.


vim +391 drivers/virtio/virtio_pci_common.c

   365  
   366  static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int 
nvqs,
   367  struct virtqueue *vqs[],
   368  struct virtqueue_info vqs_info[])
   369  {
   370  struct virtio_pci_device *vp_dev = to_vp_device(vdev);
   371  int i, err;
   372  
   373  vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL);
   374  if (!vp_dev->vqs)
   375  return -ENOMEM;
   376  
   377  err = request_irq(vp_dev->pci_dev->irq, vp_interrupt, 
IRQF_SHARED,
   378  dev_name(&vdev->dev), vp_dev);
   379  if (err)
   380  goto out_del_vqs;
   381  
   382  vp_dev->intx_enabled = 1;
   383  vp_dev->per_vq_vectors = false;
   384  for (i = 0; i < nvqs; ++i) {
   385  struct virtqueue_info *vqi = &vqs_info[i];
   386  
   387  if (!vqi->name) {
   388  vqs[i] = NULL;
   389  continue;
   390  }
 > 391  <<<<<<< HEAD
 > 392  vqs[i] = vp_setup_vq(vdev, queue_idx++, vqi->callback,
   393   vqi->name, vqi->ctx,
   394  ===
   395  vqs[i] = vp_setup_vq(vdev, i, callbacks[i], names[i],
   396   ctx ? ctx[i] : false,
   397  >>>>>>> f814759f80b7... virtio: fix vq # for balloon
   398   VIRTIO_MSI_NO_VECTOR);
   399  if (IS_ERR(vqs[i])) {
   400  err = PTR_ERR(vqs[i]);
   401  goto out_del_vqs;
   402  }
   403  }
   404  
   405  return 0;
   406  out_del_vqs:
   407  vp_del_vqs(vdev);
   408  return err;
   409  }
   410  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] test/vsock: add install target

2024-07-11 Thread Stefan Hajnoczi

On Thu, Jul 11, 2024 at 09:07:04AM +0200, Stefano Garzarella wrote:
> CCing Stefan.
> 
> On Wed, Jul 10, 2024 at 07:00:59PM GMT, Jakub Kicinski wrote:
> > On Wed, 10 Jul 2024 13:58:39 +0200 Stefano Garzarella wrote:
> > > There is a comment there:
> > > 
> > >  # Avoid changing the rest of the logic here and lib.mk.
> > > 
> > > Added by commit 17eac6c2db8b2cdfe33d40229bdda2acd86b304a.
> > > 
> > > IIUC they re-used INSTALL_PATH, just to avoid too many changes in that
> > > file and in tools/testing/selftests/lib.mk
> > > 
> > > So, IMHO we should not care about it and only use VSOCK_INSTALL_PATH if
> > > you don't want to conflict with INSTALL_PATH.
> > 
> > Any reason why vsock isn't part of selftests in the first place?
> > 
> 
> Usually vsock tests test both the driver (virtio-vsock) in the guest and the
> device in the host kernel (vhost-vsock). So I usually run the tests in 2
> nested VMs to test the latest changes for both the guest and the host.
> 
> I don't know enough selftests, but do you think it is possible to integrate
> them?
> 
> CCing Stefan who is the original author and may remember more reasons about
> this choice.

It's probably because of the manual steps in tools/testing/vsock/README:

  The following prerequisite steps are not automated and must be performed prior
  to running tests:

  1. Build the kernel, make headers_install, and build these tests.
  2. Install the kernel and tests on the host.
  3. Install the kernel and tests inside the guest.
  4. Boot the guest and ensure that the AF_VSOCK transport is enabled.

If you want to automate this for QEMU, VMware, and Hyper-V that would be
great. It relies on having a guest running under these hypervisors and
that's not trivial to automate (plus it involves proprietary software
for VMware and Hyper-V that may not be available without additional
license agreements and/or payment).

Stefan

signature.asc
Description: PGP signature

Re: [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()

2024-07-11 Thread Google

On Thu, 11 Jul 2024 13:02:39 +0200
Peter Zijlstra  wrote:

> With handle_swbp() triggering concurrently on (all) CPUs, tree_lock
> becomes a bottleneck. Avoid treelock by doing RCU lookups of the
> uprobe.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks,

> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/events/uprobes.c |   49 
> +++-
>  1 file changed, 40 insertions(+), 9 deletions(-)
> 
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_
>  #define no_uprobe_events()   RB_EMPTY_ROOT(&uprobes_tree)
>  
>  static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = 
> SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>  
>  #define UPROBES_HASH_SZ  13
>  /* serialize uprobe->pending_list */
> @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem)
>  struct uprobe {
>   struct rb_node  rb_node;/* node in the rb tree */
>   refcount_t  ref;
> + struct rcu_head rcu;
>   struct rw_semaphore register_rwsem;
>   struct rw_semaphore consumer_rwsem;
>   struct list_headpending_list;
> @@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob
>   *(uprobe_opcode_t *)&auprobe->insn);
>  }
>  
> +static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
> +{
> + if (refcount_inc_not_zero(&uprobe->ref))
> + return uprobe;
> + return NULL;
> +}
> +
>  static struct uprobe *get_uprobe(struct uprobe *uprobe)
>  {
>   refcount_inc(&uprobe->ref);
>   return uprobe;
>  }
>  
> +static void uprobe_free_rcu(struct rcu_head *rcu)
> +{
> + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> + kfree(uprobe);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>   if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up
>   mutex_lock(&delayed_uprobe_lock);
>   delayed_uprobe_remove(uprobe, NULL);
>   mutex_unlock(&delayed_uprobe_lock);
> - kfree(uprobe);
> + call_rcu(&uprobe->rcu, uprobe_free_rcu);
>   }
>  }
>  
> @@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru
>   .inode = inode,
>   .offset = offset,
>   };
> - struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> + struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, 
> __uprobe_cmp_key);
>  
>   if (node)
> - return get_uprobe(__node_2_uprobe(node));
> + return try_get_uprobe(__node_2_uprobe(node));
>  
>   return NULL;
>  }
> @@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru
>   */
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
> - struct uprobe *uprobe;
> + unsigned int seq;
>  
> - read_lock(&uprobes_treelock);
> - uprobe = __find_uprobe(inode, offset);
> - read_unlock(&uprobes_treelock);
> + guard(rcu)();
>  
> - return uprobe;
> + do {
> + seq = read_seqcount_begin(&uprobes_seqcount);
> + struct uprobe *uprobe = __find_uprobe(inode, offset);
> + if (uprobe) {
> + /*
> +  * Lockless RB-tree lookups are prone to 
> false-negatives.
> +  * If they find something, it's good. If they do not 
> find,
> +  * it needs to be validated.
> +  */
> + return uprobe;
> + }
> + } while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> + /* Really didn't find anything. */
> + return NULL;
>  }
>  
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
>  {
>   struct rb_node *node;
>  
> - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
>   if (node)
>   return get_uprobe(__node_2_uprobe(node));
>  
> @@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru
>   struct uprobe *u;
>  
>   write_lock(&uprobes_treelock);
> + write_seqcount_begin(&uprobes_seqcount);
>   u = __insert_uprobe(uprobe);
> + write_seqcount_end(&uprobes_seqcount);
>   write_unlock(&uprobes_treelock);
>  
>   return u;
> @@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe
>   return;
>  
>   write_lock(&uprobes_treelock);
> + write_seqcount_begin(&uprobes_seqcount);
>   rb_erase(&uprobe->rb_node, &uprobes_tree);
> + write_seqcount_end(&uprobes_seqcount);
>   write_unlock(&uprobes_treelock);
>   RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
>   put_uprobe(uprobe);
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU

2024-07-11 Thread Jiri Olsa

On Thu, Jul 11, 2024 at 01:02:43PM +0200, Peter Zijlstra wrote:

SNIP

>   /* Tracing handlers use ->utask to communicate with fetch methods */
>   if (!get_utask())
> - goto out;
> + return;
>  
>   if (arch_uprobe_ignore(&uprobe->arch, regs))
> - goto out;
> + return;
>  
>   handler_chain(uprobe, regs);
>  
>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> - goto out;
> -
> - if (!pre_ssout(uprobe, regs, bp_vaddr))
>   return;
>  
> - /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> -out:
> - put_uprobe(uprobe);
> + pre_ssout(uprobe, regs, bp_vaddr);

pre_ssout could now return void

jirka

>  }
>  
>  /*
> 
>

Re: [PATCH v2] ring-buffer: Limit time with disabled interrupts in rb_check_pages()

2024-07-11 Thread Steven Rostedt

On Thu, 4 Jul 2024 13:03:47 +0200
Petr Pavlu  wrote:

> > I'm dumb. What's an "era"?  
> 
> I meant it as a calendar era or epoch. The idea was to hint this is
> a number that identifies some structural state of the pages list. Maybe
> pages_gen ("generation") or another name would be better?

Ah, out of context I thought it was short for something. Perhaps just use
"cnt" with a comment, as that can be generic enough for what it is.

Thanks,

-- Steve

Re: [PATCH] test/vsock: add install target

2024-07-11 Thread Jakub Kicinski

On Thu, 11 Jul 2024 15:38:01 +0200 Stefan Hajnoczi wrote:
> > Usually vsock tests test both the driver (virtio-vsock) in the guest and the
> > device in the host kernel (vhost-vsock). So I usually run the tests in 2
> > nested VMs to test the latest changes for both the guest and the host.
> > 
> > I don't know enough selftests, but do you think it is possible to integrate
> > them?
> > 
> > CCing Stefan who is the original author and may remember more reasons about
> > this choice.  
> 
> It's probably because of the manual steps in tools/testing/vsock/README:
> 
>   The following prerequisite steps are not automated and must be performed 
> prior
>   to running tests:
> 
>   1. Build the kernel, make headers_install, and build these tests.
>   2. Install the kernel and tests on the host.
>   3. Install the kernel and tests inside the guest.
>   4. Boot the guest and ensure that the AF_VSOCK transport is enabled.
> 
> If you want to automate this for QEMU, VMware, and Hyper-V that would be
> great. It relies on having a guest running under these hypervisors and
> that's not trivial to automate (plus it involves proprietary software
> for VMware and Hyper-V that may not be available without additional
> license agreements and/or payment).

Not sure if there's a requirement that full process is automated.
Or at least if there is we are already breaking it in networking
because for some tests we need user to export some env variables
to point the test to the right interfaces and even a remote machine 
to generate traffic. If the env isn't set up tests return 4 (SKIP).
I don't feel strongly that ksft + env approach is better but at
least it gives us easy access to the basic build and packaging
features from ksft. Up to you but thought I'd ask.

Re: [PATCH RFC 1/1] remoteproc: mediatek: Support reserved CMA regions

2024-07-11 Thread Mathieu Poirier

On Wed, 10 Jul 2024 at 02:42, Shun-yi Wang  wrote:
>
> From: "shun-yi.wang" 
>
> In order to reserve specific Contiguous Memory Allocator (CMA) regions
> for hardware use. When the name of the reserved region contains "cma",
> then a corresponding CMA heap is added.
>
> Signed-off-by: shun-yi.wang 
> ---
>  drivers/remoteproc/mtk_scp.c | 38 
>  1 file changed, 30 insertions(+), 8 deletions(-)
>

I'm not sure what to do with this patch...  Is it a superset of your
other patch [1]?  And if so why is it labelled as an RFC?

Please read the documentation on submitting patches [2] and subscribe
to the remoteproc mailing list to give you an idea of the patch
workflow that is expected.

Thanks,
Mathieu

[1]. [PATCH 1/1] remoteproc: mediatek: Support multiple reserved memory regions
[2]. 
https://elixir.bootlin.com/linux/latest/source/Documentation/process/submitting-patches.rst

> diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
> index e744c07507ee..ca0a4a52ece9 100644
> --- a/drivers/remoteproc/mtk_scp.c
> +++ b/drivers/remoteproc/mtk_scp.c
> @@ -4,6 +4,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1006,22 +1007,43 @@ EXPORT_SYMBOL_GPL(scp_mapping_dm_addr);
>
>  static int scp_map_memory_region(struct mtk_scp *scp)
>  {
> -   int ret;
> +   int ret, i, err;
> const struct mtk_scp_sizes_data *scp_sizes;
> +   struct device_node *node = scp->dev->of_node;
> +   struct of_phandle_iterator it;
> +
> +   i = 0;
> +   of_for_each_phandle(&it, err, node, "memory-region", NULL, 0) {
> +   ret = of_reserved_mem_device_init_by_idx(scp->dev, node, i);
> +
> +   if (ret) {
> +   dev_err(scp->dev, "failed to assign memory-region: 
> %s\n",
> +   it.node->name);
> +   of_node_put(it.node);
> +   return -ENOMEM;
> +   }
> +
> +#ifdef CONFIG_DMABUF_HEAPS_CMA
> +   if (strstr(it.node->name, "cma")) {
> +   /* Reserved cma memory region */
> +   ret = dma_heap_add_cma(scp->dev);
> +   if (ret) {
> +   dev_err(scp->dev, "Failed to add reserved 
> cma");
> +   of_node_put(it.node);
> +   return ret;
> +   }
> +   }
> +#endif /* CONFIG_DMABUF_HEAPS_CMA */
>
> -   ret = of_reserved_mem_device_init(scp->dev);
> +   i++;
> +   }
>
> /* reserved memory is optional. */
> -   if (ret == -ENODEV) {
> +   if (!i) {
> dev_info(scp->dev, "skipping reserved memory 
> initialization.");
> return 0;
> }
>
> -   if (ret) {
> -   dev_err(scp->dev, "failed to assign memory-region: %d\n", 
> ret);
> -   return -ENOMEM;
> -   }
> -
> /* Reserved SCP code size */
> scp_sizes = scp->data->scp_sizes;
> scp->cpu_addr = dma_alloc_coherent(scp->dev, scp_sizes->max_dram_size,
> --
> 2.18.0
>

Re: [syzbot] [mm?] possible deadlock in __mmap_lock_do_trace_released

2024-07-11 Thread Steven Rostedt

On Thu, 4 Jul 2024 22:12:45 +0200
Jesper Dangaard Brouer  wrote:

> > 
> > WARNING: possible recursive locking detected
> > 6.10.0-rc2-syzkaller-00797-ga12978712d90 #0 Not tainted
> > 
> > syz-executor646/5097 is trying to acquire lock:
> > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire 
> > include/linux/local_lock_internal.h:29 [inline]
> > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: 
> > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243
> > 
> > but task is already holding lock:
> > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire 
> > include/linux/local_lock_internal.h:29 [inline]
> > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: 
> > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243
> > 
> > other info that might help us debug this:
> >   Possible unsafe locking scenario:
> > 
> > CPU0
> > 
> >lock(lock#9);
> >lock(lock#9);
> > 
> >   *** DEADLOCK ***

Looks like it's trying to take the rwsem mm->mmap_lock recursively. And
rwsems are *not* allowed to be recursively taken, as once there's a writer,
all new acquires of the reader will block. Then you can have:

   CPU0 CPU1
    
  down_read(lockA);
down_write(lockA); // blocks
  down_read(lockA); //blocks

DEADLOCK!


> > 
> >   May be due to missing lock nesting notation
> >   
> 
> To me, this looks like a lockdep false-positive, but I might be wrong.
> 
> Could someone with more LOCKDEP knowledge give their interpretation?
> 
> The commit[1] adds a fairly standard trylock scheme.
> Do I need to lockdep annotate trylock's in some special way?
> 
>   [1] https://git.kernel.org/torvalds/c/21c38a3bd4ee3fb733
> 
> Also notice change uses raw_spin_lock, which might be harder for lockdep?
> So, I also enabled CONFIG_PROVE_RAW_LOCK_NESTING in my testlab to help
> with this, and CONFIG_PROVE_LOCKING.
> (And obviously I also enabled LOCKDEP*)
> 
> --Jesper
> 
> > 5 locks held by syz-executor646/5097:
> >   #0: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: mmap_read_lock 
> > include/linux/mmap_lock.h:144 [inline]
> >   #0: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: 
> > acct_collect+0x1cf/0x830 kernel/acct.c:563
> >   #1: 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire 
> > include/linux/local_lock_internal.h:29 [inline]
> >   #1: 8880b94387e8 (lock#9){+.+.}-{2:2}, at: 
> > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243
> >   #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire 
> > include/linux/rcupdate.h:329 [inline]
> >   #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: rcu_read_lock 
> > include/linux/rcupdate.h:781 [inline]
> >   #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: get_memcg_path_buf 
> > mm/mmap_lock.c:139 [inline]
> >   #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: 
> > get_mm_memcg_path+0xb1/0x600 mm/mmap_lock.c:209
> >   #3: 8e333fa0 (rcu_read_lock){}-{1:2}, at: 
> > trace_call_bpf+0xbc/0x8a0
> >   #4: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: mmap_read_trylock 
> > include/linux/mmap_lock.h:163 [inline]
> >   #4: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: 
> > stack_map_get_build_id_offset+0x237/0x9d0 kernel/bpf/stackmap.c:141
> > 
> > stack backtrace:
> > CPU: 0 PID: 5097 Comm: syz-executor646 Not tainted 
> > 6.10.0-rc2-syzkaller-00797-ga12978712d90 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 06/07/2024
> > Call Trace:
> >   
> >   __dump_stack lib/dump_stack.c:88 [inline]
> >   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
> >   check_deadlock kernel/locking/lockdep.c:3062 [inline]
> >   validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3856
> >   __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
> >   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
> >   local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
> >   __mmap_lock_do_trace_released+0x9c/0x620 mm/mmap_lock.c:243

Here we have:

  static inline void mmap_read_lock(struct mm_struct *mm)
  {
__mmap_lock_trace_start_locking(mm, false);
down_read(&mm->mmap_lock);
__mmap_lock_trace_acquire_returned(mm, false, true);
  }

Which is taking the mm->mmap_lock for read.

> >   __mmap_lock_trace_released include/linux/mmap_lock.h:42 [inline]
> >   mmap_read_unlock include/linux/mmap_lock.h:170 [inline]
> >   bpf_mmap_unlock_mm kernel/bpf/mmap_unlock_work.h:52 [inline]
> >   stack_map_get_build_id_offset+0x9c7/0x9d0 kernel/bpf/stackmap.c:173
> >   __bpf_get_stack+0x4ad/0x5a0 kernel/bpf/stackmap.c:449
> >   bpf_prog_e6cf5f9c69743609+0x42/0x46
> >   bpf_dispatcher_nop_func include/linux/bpf.h:1243 [inline]
> >   __bpf_prog_run include/linux/filter.h:691 [inline]
> >   bpf_prog_run include/linux/filter.h:698 [inline]
> >   bpf

Re: [PATCH v2 00/60] i2c: reword first drivers according to newest specification

2024-07-11 Thread Wolfram Sang


> Thanks for this big work, at the end it turned out quite nice and
> I'm happy of the outcome!

Me too. And thanks for the enormous review work!



signature.asc
Description: PGP signature

[PATCH net-next v3 0/2] vsock: avoid queuing on workqueue if possible

2024-07-11 Thread Luigi Leonardi via B4 Relay

This series introduces an optimization for vsock/virtio to reduce latency
and increase the throughput: When the guest sends a packet to the host, 
and the workqueue is empty, if there is enough space, the packet is put
directly in the virtqueue.

v2->v3
- Performed more experiments using iperf3 using multiple streams
- Handling of reply packets removed from virtio_transport_send_skb,
  as is needed just by the worker. 
- Removed atomic_inc/atomic_sub when queuing directly to the vq.
- Introduced virtio_transport_send_skb_fast_path that handles the
  steps for sending on the vq. 
- Fixed a missing mutex_unlock in error path.
- Changed authorship of the second commit
- Rebased on latest net-next

v1->v2
In this v2 I replaced a mutex_lock with a mutex_trylock because it was
insidea RCU critical section. I also added a check on tx_run, so if the
module is being removed the packet is not queued. I'd like to thank Stefano
for reporting the tx_run issue.

Applied all Stefano's suggestions:
- Minor code style changes
- Minor commit text rewrite
Performed more experiments:
 - Check if all the packets go directly to the vq (Matias' suggestion)
 - Used iperf3 to see if there is any improvement in overall throughput
  from guest to host
 - Pinned the vhost process to a pCPU.
 - Run fio using 512B payload
Rebased on latest net-next

To: Stefan Hajnoczi 
To: Stefano Garzarella 
To: David S. Miller 
To: Eric Dumazet 
To: Jakub Kicinski 
To: Paolo Abeni 
Cc: k...@vger.kernel.org
Cc: virtualizat...@lists.linux.dev
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Luigi Leonardi 
---
Luigi Leonardi (1):
  vsock/virtio: avoid queuing packets when work queue is empty

Marco Pinna (1):
  vsock/virtio: refactor virtio_transport_send_pkt_work

 net/vmw_vsock/virtio_transport.c | 143 +--
 1 file changed, 93 insertions(+), 50 deletions(-)
---
base-commit: 58f9416d413aa2c20b2515233ce450a1607ef843
change-id: 20240711-pinna-49bf0ab09909

Best regards,
-- 
Luigi Leonardi

[PATCH net-next v3 2/2] vsock/virtio: avoid queuing packets when work queue is empty

2024-07-11 Thread Luigi Leonardi via B4 Relay

From: Luigi Leonardi 

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is
put directly in the virtqueue increasing the throughput.

In the following benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before: Linux 6.9.7

Payload 512B:

1st perc.   overall 99th perc.
Before  370 810.15  8656ns
After   374 780.29  8741ns

Payload 4K:

1st perc.   overall 99th perc.
Before  460 1720.23 42752   ns
After   460 1520.84 36096   ns

The performance improvement is related to this optimization,
I used ebpf to check that each packet was sent directly to the
virtqueue.

Throughput: iperf-vsock
The size represents the buffer length (-l) to read/write
P represents the number parallel streams

P=1
4K  64K 128K
Before  6.8729.329.5 Gb/s
After   10.539.439.9 Gb/s

P=2
4K  64K 128K
Before  10.532.833.2 Gb/s
After   17.847.748.5 Gb/s

P=4
4K  64K 128K
Before  12.733.634.2 Gb/s
After   16.948.150.5 Gb/s

Co-developed-by: Marco Pinna 
Signed-off-by: Marco Pinna 
Signed-off-by: Luigi Leonardi 
---
 net/vmw_vsock/virtio_transport.c | 38 ++
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c4205c22f40b..d75727fdc35f 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
 }
 
+/* Caller need to hold RCU for vsock.
+ * Returns 0 if the packet is successfully put on the vq.
+ */
+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, 
struct sk_buff *skb)
+{
+   struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
+   int ret;
+
+   /* Inside RCU, can't sleep! */
+   ret = mutex_trylock(&vsock->tx_lock);
+   if (unlikely(ret == 0))
+   return -EBUSY;
+
+   ret = virtio_transport_send_skb(skb, vq, vsock);
+
+   mutex_unlock(&vsock->tx_lock);
+
+   /* Kick if virtio_transport_send_skb succeeded */
+   if (ret == 0)
+   virtqueue_kick(vq);
+   return ret;
+}
+
 static int
 virtio_transport_send_pkt(struct sk_buff *skb)
 {
@@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb)
goto out_rcu;
}
 
-   if (virtio_vsock_skb_reply(skb))
-   atomic_inc(&vsock->queued_replies);
+   /* If the workqueue (send_pkt_queue) is empty there is no need to 
enqueue the packet.
+* Just put it on the virtqueue using 
virtio_transport_send_skb_fast_path.
+*/
 
-   virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-   queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+   if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
+   virtio_transport_send_skb_fast_path(vsock, skb)) {
+   /* Packet must be queued */
+   if (virtio_vsock_skb_reply(skb))
+   atomic_inc(&vsock->queued_replies);
+   virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+   queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+   }
 
 out_rcu:
rcu_read_unlock();

-- 
2.45.2

[PATCH net-next v3 1/2] vsock/virtio: refactor virtio_transport_send_pkt_work

2024-07-11 Thread Luigi Leonardi via B4 Relay

From: Marco Pinna 

Preliminary patch to introduce an optimization to the
enqueue system.

All the code used to enqueue a packet into the virtqueue
is removed from virtio_transport_send_pkt_work()
and moved to the new virtio_transport_send_skb() function.

Co-developed-by: Luigi Leonardi 
Signed-off-by: Luigi Leonardi 
Signed-off-by: Marco Pinna 
---
 net/vmw_vsock/virtio_transport.c | 105 ++-
 1 file changed, 59 insertions(+), 46 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 43d405298857..c4205c22f40b 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -94,6 +94,63 @@ static u32 virtio_transport_get_local_cid(void)
return ret;
 }
 
+/* Caller need to hold vsock->tx_lock on vq */
+static int virtio_transport_send_skb(struct sk_buff *skb, struct virtqueue *vq,
+struct virtio_vsock *vsock)
+{
+   int ret, in_sg = 0, out_sg = 0;
+   struct scatterlist **sgs;
+
+   sgs = vsock->out_sgs;
+   sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
+   sizeof(*virtio_vsock_hdr(skb)));
+   out_sg++;
+
+   if (!skb_is_nonlinear(skb)) {
+   if (skb->len > 0) {
+   sg_init_one(sgs[out_sg], skb->data, skb->len);
+   out_sg++;
+   }
+   } else {
+   struct skb_shared_info *si;
+   int i;
+
+   /* If skb is nonlinear, then its buffer must contain
+* only header and nothing more. Data is stored in
+* the fragged part.
+*/
+   WARN_ON_ONCE(skb_headroom(skb) != 
sizeof(*virtio_vsock_hdr(skb)));
+
+   si = skb_shinfo(skb);
+
+   for (i = 0; i < si->nr_frags; i++) {
+   skb_frag_t *skb_frag = &si->frags[i];
+   void *va;
+
+   /* We will use 'page_to_virt()' for the userspace page
+* here, because virtio or dma-mapping layers will call
+* 'virt_to_phys()' later to fill the buffer descriptor.
+* We don't touch memory at "virtual" address of this 
page.
+*/
+   va = page_to_virt(skb_frag_page(skb_frag));
+   sg_init_one(sgs[out_sg],
+   va + skb_frag_off(skb_frag),
+   skb_frag_size(skb_frag));
+   out_sg++;
+   }
+   }
+
+   ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
+   /* Usually this means that there is no more space available in
+* the vq
+*/
+   if (ret < 0)
+   return ret;
+
+   virtio_transport_deliver_tap_pkt(skb);
+   return 0;
+}
+
 static void
 virtio_transport_send_pkt_work(struct work_struct *work)
 {
@@ -111,66 +168,22 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
 
for (;;) {
-   int ret, in_sg = 0, out_sg = 0;
-   struct scatterlist **sgs;
struct sk_buff *skb;
bool reply;
+   int ret;
 
skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
if (!skb)
break;
 
reply = virtio_vsock_skb_reply(skb);
-   sgs = vsock->out_sgs;
-   sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb),
-   sizeof(*virtio_vsock_hdr(skb)));
-   out_sg++;
-
-   if (!skb_is_nonlinear(skb)) {
-   if (skb->len > 0) {
-   sg_init_one(sgs[out_sg], skb->data, skb->len);
-   out_sg++;
-   }
-   } else {
-   struct skb_shared_info *si;
-   int i;
-
-   /* If skb is nonlinear, then its buffer must contain
-* only header and nothing more. Data is stored in
-* the fragged part.
-*/
-   WARN_ON_ONCE(skb_headroom(skb) != 
sizeof(*virtio_vsock_hdr(skb)));
-
-   si = skb_shinfo(skb);
 
-   for (i = 0; i < si->nr_frags; i++) {
-   skb_frag_t *skb_frag = &si->frags[i];
-   void *va;
-
-   /* We will use 'page_to_virt()' for the 
userspace page
-* here, because virtio or dma-mapping layers 
will call
-* 'virt_to_phys()' later to fill the buffer 
descriptor.
-* We don't touch memory at "virtual" address 
of this page.
-*/
-

Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer

2024-07-11 Thread Peter Zijlstra

On Thu, Jul 11, 2024 at 03:19:19PM +0200, Oleg Nesterov wrote:
> Not sure I read this patch correctly, but at first glance it looks
> suspicious..
> 
> On 07/11, Peter Zijlstra wrote:
> >
> > +static void return_instance_timer(struct timer_list *timer)
> > +{
> > +   struct uprobe_task *utask = container_of(timer, struct uprobe_task, 
> > ri_timer);
> > +   task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
> > +}
> 
> What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before
> return from the ret-probed function?
> 
> In this case it won't react to TWA_SIGNAL until debugger or SIGCONT
> wakes it up.

Or FROZEN etc.. Yeah.

> ---
> And it seems that even task_work_add() itself is not safe...
> 
> Suppose we have 2 ret-probed functions
> 
>   void f2() { ... }
>   void f1() { ...; f2(); }
> 
> A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does
> 
>   mod_timer(&utask->ri_timer, jiffies + HZ);
> 
> Then later it calls f2() and the pending timer expires after it enters the
> kernel, but before the next prepare_uretprobe() -> mod_timer().
> 
> In this case ri_task_work is already queued and the timer is pending again.

You're saying we can hit a double enqueue, right? Yeah, that's a
problem. But that can be fairly easily rectified.

> Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in
> theory nothing can guarantee that it will call get_signal/task_work_run
> in less than 1 second, it can be preempted.
> 
> But T can sleep in xol_take_insn_slot() before return from handle_swbp(),
> and this is not so theoretical.

So the assumption is that kernel code makes forward progress. If we get
preempted, we'll get scheduled again. If the machine is so overloaded
this takes more than a second, stretching the SRCU period is the least
of your problems.

Same with sleeps, it'll get a wakeup.

The only thing that is out of our control is userspace. And yes, I had
not considered STOPPED/TRACED/FROZEN.

So the reason I did that task_work is because the return_instance list
is strictly current, so a random timer cannot safely poke at it. And
barring those pesky states, it does as desired.

Let me ponder that a little, I *can* make it work, but all 'solutions'
I've come up with so far are really rather vile.

[ANNOUNCE] 5.10.221-rt113

2024-07-11 Thread Luis Claudio R. Goncalves

Hello RT-list!

I'm pleased to announce the 5.10.221-rt113 stable release.

This release is just an update to the new stable 5.10.221
version and no RT specific changes have been made.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: ea80a8cd61dcdda428f61d90b3ace2c4d7682d2c

Or to build 5.10.221-rt113 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.221.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.221-rt113.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis

Re: [PATCH 2/3] uprobes: simplify error handling for alloc_uprobe()

2024-07-11 Thread Google

On Wed, 10 Jul 2024 18:31:11 +0200
Oleg Nesterov  wrote:

> From: Andrii Nakryiko 
> 
> Return -ENOMEM instead of NULL, which makes caller's error handling just
> a touch simpler.
> 
> Signed-off-by: Andrii Nakryiko 
> Signed-off-by: Oleg Nesterov 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

> ---
>  kernel/events/uprobes.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4950decebe5c..e5b7c6239970 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, 
> loff_t offset,
>  
>   uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
>   if (!uprobe)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
>  
>   uprobe->inode = inode;
>   uprobe->offset = offset;
> @@ -1166,8 +1166,6 @@ int uprobe_register(struct inode *inode, loff_t offset, 
> loff_t ref_ctr_offset,
>  
>   retry:
>   uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> - if (!uprobe)
> - return -ENOMEM;
>   if (IS_ERR(uprobe))
>   return PTR_ERR(uprobe);
>  
> -- 
> 2.25.1.362.g51ebf55
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer

2024-07-11 Thread Peter Zijlstra

On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote:

> Let me ponder that a little, I *can* make it work, but all 'solutions'
> I've come up with so far are really rather vile.

This is the least horrible solution I could come up with...

---
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -83,6 +83,7 @@ struct uprobe_task {
 
struct timer_list   ri_timer;
struct callback_headri_task_work;
+   boolri_work_pending;
struct task_struct  *task;
 };
 
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc
t->utask = NULL;
 }
 
-static void return_instance_task_work(struct callback_head *head)
+static void __return_instance_put(struct uprobe_task *utask)
 {
-   struct uprobe_task *utask = container_of(head, struct uprobe_task, 
ri_task_work);
struct return_instance *ri;
 
for (ri = utask->return_instances; ri; ri = ri->next) {
@@ -1815,9 +1814,43 @@ static void return_instance_task_work(st
}
 }
 
+static void return_instance_task_work(struct callback_head *head)
+{
+   struct uprobe_task *utask = container_of(head, struct uprobe_task, 
ri_task_work);
+   utask->ri_work_pending = false;
+   __return_instance_put(utask);
+}
+
+static int return_instance_blocked(struct task_struct *p, void *arg)
+{
+   unsigned int state = READ_ONCE(p->__state);
+
+   if (state == TASK_RUNNING || state == TASK_WAKING)
+   return 0;
+
+   smp_rmb();
+   if (p->on_rq)
+   return 0;
+
+   /*
+* Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold
+* p->pi_lock, and can consiter the task fully blocked.
+*/
+
+   __return_instance_put(p->utask);
+   return 1;
+}
+
 static void return_instance_timer(struct timer_list *timer)
 {
struct uprobe_task *utask = container_of(timer, struct uprobe_task, 
ri_timer);
+   if (utask->ri_work_pending)
+   return;
+
+   if (task_call_func(utask->task, return_instance_blocked, NULL))
+   return;
+
+   utask->ri_work_pending = true;
task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL);
 }

Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer

2024-07-11 Thread Peter Zijlstra

On Thu, Jul 11, 2024 at 05:55:42PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote:
> 
> > Let me ponder that a little, I *can* make it work, but all 'solutions'
> > I've come up with so far are really rather vile.
> 
> This is the least horrible solution I could come up with...
> 
> ---
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -83,6 +83,7 @@ struct uprobe_task {
>  
>   struct timer_list   ri_timer;
>   struct callback_headri_task_work;
> + boolri_work_pending;
>   struct task_struct  *task;
>  };
>  
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc
>   t->utask = NULL;
>  }
>  
> -static void return_instance_task_work(struct callback_head *head)
> +static void __return_instance_put(struct uprobe_task *utask)
>  {
> - struct uprobe_task *utask = container_of(head, struct uprobe_task, 
> ri_task_work);
>   struct return_instance *ri;
>  
>   for (ri = utask->return_instances; ri; ri = ri->next) {
> @@ -1815,9 +1814,43 @@ static void return_instance_task_work(st
>   }
>  }
>  
> +static void return_instance_task_work(struct callback_head *head)
> +{
> + struct uprobe_task *utask = container_of(head, struct uprobe_task, 
> ri_task_work);
> + utask->ri_work_pending = false;
> + __return_instance_put(utask);
> +}
> +
> +static int return_instance_blocked(struct task_struct *p, void *arg)
> +{
> + unsigned int state = READ_ONCE(p->__state);
> +
> + if (state == TASK_RUNNING || state == TASK_WAKING)
> + return 0;
> +
> + smp_rmb();
> + if (p->on_rq)
> + return 0;
> +
> + /*
> +  * Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold
> +  * p->pi_lock, and can consiter the task fully blocked.
> +  */
> +
> + __return_instance_put(p->utask);

PREEMPT_RT might not like this though, doing the full RI iteration under
a raw_spinlock_t...

I just can't think of anything saner just now. Oh well, let me go make
dinner, perhaps something will come to me.

Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU

2024-07-11 Thread Oleg Nesterov

I'll try to actually apply the whole series and read the code tomorrow.
Right now I can't understand this change... Just one question for now.

On 07/11, Peter Zijlstra wrote:
>
> @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
>* attack from user-space.
>*/
>   uprobe_warn(current, "handle tail call");
> - goto err_uprobe;
> + goto err_mem;
>   }
>   orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>   }
>
> + ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> + ri->uprobe = uprobe;

It seems that, if we race with _unregister, this __srcu_read_lock()
can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1)
was already called...

In this case read_lock "has no effect" in that uprobe_free_stage1()
can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx).

Perhaps it is fine, uprobe_free_stage1() does another call_srcu(),
but somehow I got lost.

Could you re-check this logic? Most probably I missed something, but still...

Oleg.

Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *

2024-07-11 Thread Andrii Nakryiko

On Thu, Jul 11, 2024 at 2:28 AM Oleg Nesterov  wrote:
>
> On 07/10, Oleg Nesterov wrote:
> >
> > -void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> > uprobe_consumer *uc)
> > +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >  {
> > - struct uprobe *uprobe;
> > -
> > - uprobe = find_uprobe(inode, offset);
> > - if (WARN_ON(!uprobe))
> > - return;
> > -
> >   down_write(&uprobe->register_rwsem);
> >   __uprobe_unregister(uprobe, uc);
> >   up_write(&uprobe->register_rwsem);
> > - put_uprobe(uprobe);
>
> OK, this is obviously wrong, needs get_uprobe/put_uprobe. 
> __uprobe_unregister()
> can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe.

uprobe_register(), given it returns an uprobe instance to the caller
should keep refcount on it (it belongs to uprobe_consumer). That's
what I did for my patches, are you going to do that as well?

We basically do the same thing, just interfaces look a bit different.


>
> I'll send V2 on top of Peter's new version.
>
> Oleg.
>

Re: [PATCH] lib: test_objpool: add missing MODULE_DESCRIPTION() macro

2024-07-11 Thread Jeff Johnson


On 6/2/24 23:45, Masami Hiramatsu (Google) wrote:

On Mon, 3 Jun 2024 11:25:59 +0800
"wuqiang.matt"  wrote:


On 2024/6/1 08:31, Jeff Johnson wrote:

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_objpool.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson 
---
   lib/test_objpool.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/test_objpool.c b/lib/test_objpool.c
index bfdb81599832..5a3f6961a70f 100644
--- a/lib/test_objpool.c
+++ b/lib/test_objpool.c
@@ -687,4 +687,5 @@ static void __exit ot_mod_exit(void)
   module_init(ot_mod_init);
   module_exit(ot_mod_exit);
   
-MODULE_LICENSE("GPL");

\ No newline at end of file
+MODULE_DESCRIPTION("Test module for lockless object pool");
+MODULE_LICENSE("GPL");

---
base-commit: b050496579632f86ee1ef7e7501906db579f3457
change-id: 20240531-md-lib-test_objpool-338d937f8666



Looks good to me. Thanks for the update.

I added Masami Hiramatsu and linux-trace in the loop.

Reviewed-by: Matt Wu 


Thanks, let me pick this to probes/for-next branch.

Following up since I don't see this in linux-next.
I'm hoping to have these warnings fixed tree-wide in 6.11.

/jeff

Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *

2024-07-11 Thread Oleg Nesterov

On 07/11, Andrii Nakryiko wrote:
>
> On Thu, Jul 11, 2024 at 2:28 AM Oleg Nesterov  wrote:
> >
> > On 07/10, Oleg Nesterov wrote:
> > >
> > > -void uprobe_unregister(struct inode *inode, loff_t offset, struct 
> > > uprobe_consumer *uc)
> > > +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> > >  {
> > > - struct uprobe *uprobe;
> > > -
> > > - uprobe = find_uprobe(inode, offset);
> > > - if (WARN_ON(!uprobe))
> > > - return;
> > > -
> > >   down_write(&uprobe->register_rwsem);
> > >   __uprobe_unregister(uprobe, uc);
> > >   up_write(&uprobe->register_rwsem);
> > > - put_uprobe(uprobe);
> >
> > OK, this is obviously wrong, needs get_uprobe/put_uprobe. 
> > __uprobe_unregister()
> > can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe.
>
> uprobe_register(), given it returns an uprobe instance to the caller
> should keep refcount on it (it belongs to uprobe_consumer).

Of course. And again, this patch doesn't change the curent behaviour.

> That's
> what I did for my patches, are you going to do that as well?
>
> We basically do the same thing, just interfaces look a bit different.

Not sure. Well I do not really know, I didn't read your series to the
end, sorry ;) The same for V1/V2 from Peter so far.

But let me say this just in case... With or without this change,
currently uprobe_consumer doesn't have an "individual" ref to uprobe.
The fact that uprobe->consumers != NULL adds a reference.

Lets not discuss if this is good or bad right now, this cleanup is
only cleanup.

Now, let me add another "just in case" note to explain what I am going
to do in V2.

So. this patch should turn uprobe_unregister() into something like

void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer 
*uc)
{
// Ugly  please kill me!!!
get_uprobe(uprobe);
down_write(&uprobe->register_rwsem);
__uprobe_unregister(uprobe, uc);
up_write(&uprobe->register_rwsem);
put_uprobe(uprobe);
}

to simplify this change. And the next (simple) patch will kill these
get_uprobe + put_uprobe, we just need to shift the (possibly) final
put_uprobe() from delete_uprobe() to unregister().

But of course, I will recheck before I send V2.

Oleg.

Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU

2024-07-11 Thread Peter Zijlstra

On Thu, Jul 11, 2024 at 06:06:53PM +0200, Oleg Nesterov wrote:
> I'll try to actually apply the whole series and read the code tomorrow.
> Right now I can't understand this change... Just one question for now.
> 
> On 07/11, Peter Zijlstra wrote:
> >
> > @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr
> >  * attack from user-space.
> >  */
> > uprobe_warn(current, "handle tail call");
> > -   goto err_uprobe;
> > +   goto err_mem;
> > }
> > orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
> > }
> >
> > +   ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu);
> > +   ri->uprobe = uprobe;
> 
> It seems that, if we race with _unregister, this __srcu_read_lock()
> can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1)
> was already called...
> 
> In this case read_lock "has no effect" in that uprobe_free_stage1()
> can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx).
> 
> Perhaps it is fine, uprobe_free_stage1() does another call_srcu(),
> but somehow I got lost.
> 
> Could you re-check this logic? Most probably I missed something, but still...


  handle_swbp()
guard(srcu)(&uprobes_srcu);
handle_chain();
  prepare_uretprobe()
__srcu_read_lock(&uretprobe_srcu);


vs

  uprobe_free_stage2
kfree(uprobe)

  uprobe_free_stage1
call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2);

  put_uprobe()
if (refcount_dec_and_test)
  call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1);
  

So my thinking was since we take uretprobe_srcu *inside* uprobe_srcu,
this reference must be visible before we execute stage1, and as such
stage2 cannot complete prematurely.

Re: [PATCH] virtio: add missing MODULE_DESCRIPTION() macro

2024-07-11 Thread Jeff Johnson


On 6/23/24 10:36, Jeff Johnson wrote:

On 6/2/2024 1:25 PM, Jeff Johnson wrote:

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/virtio/virtio_dma_buf.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson 
---
  drivers/virtio/virtio_dma_buf.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/virtio/virtio_dma_buf.c b/drivers/virtio/virtio_dma_buf.c
index 2521a75009c3..3034a2f605c8 100644
--- a/drivers/virtio/virtio_dma_buf.c
+++ b/drivers/virtio/virtio_dma_buf.c
@@ -85,5 +85,6 @@ int virtio_dma_buf_get_uuid(struct dma_buf *dma_buf,
  }
  EXPORT_SYMBOL(virtio_dma_buf_get_uuid);
  
+MODULE_DESCRIPTION("dma-bufs for virtio exported objects");

  MODULE_LICENSE("GPL");
  MODULE_IMPORT_NS(DMA_BUF);

---
base-commit: 83814698cf48ce3aadc5d88a3f577f04482ff92a
change-id: 20240602-md-virtio_dma_buf-b3552ca6c5d5



Following up to see if anything else is needed from me.
Hoping to see this in linux-next :)


I still don't see this in linux-next so following up to see if anything 
else is needed to get this merged. Adding Greg KH since he's signed off 
on this file before and he's taken quite a few of my cleanups through 
his trees.


I'm hoping to have all of these warnings fixed tree-wide in 6.11.

/jeff

Re: [PATCH] virtio: add missing MODULE_DESCRIPTION() macro

2024-07-11 Thread Michael S. Tsirkin

On Thu, Jul 11, 2024 at 11:43:18AM -0700, Jeff Johnson wrote:
> On 6/23/24 10:36, Jeff Johnson wrote:
> > On 6/2/2024 1:25 PM, Jeff Johnson wrote:
> > > make allmodconfig && make W=1 C=1 reports:
> > > WARNING: modpost: missing MODULE_DESCRIPTION() in 
> > > drivers/virtio/virtio_dma_buf.o
> > > 
> > > Add the missing invocation of the MODULE_DESCRIPTION() macro.
> > > 
> > > Signed-off-by: Jeff Johnson 
> > > ---
> > >   drivers/virtio/virtio_dma_buf.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/virtio/virtio_dma_buf.c 
> > > b/drivers/virtio/virtio_dma_buf.c
> > > index 2521a75009c3..3034a2f605c8 100644
> > > --- a/drivers/virtio/virtio_dma_buf.c
> > > +++ b/drivers/virtio/virtio_dma_buf.c
> > > @@ -85,5 +85,6 @@ int virtio_dma_buf_get_uuid(struct dma_buf *dma_buf,
> > >   }
> > >   EXPORT_SYMBOL(virtio_dma_buf_get_uuid);
> > > +MODULE_DESCRIPTION("dma-bufs for virtio exported objects");
> > >   MODULE_LICENSE("GPL");
> > >   MODULE_IMPORT_NS(DMA_BUF);
> > > 
> > > ---
> > > base-commit: 83814698cf48ce3aadc5d88a3f577f04482ff92a
> > > change-id: 20240602-md-virtio_dma_buf-b3552ca6c5d5
> > > 
> > 
> > Following up to see if anything else is needed from me.
> > Hoping to see this in linux-next :)
> 
> I still don't see this in linux-next so following up to see if anything else
> is needed to get this merged. Adding Greg KH since he's signed off on this
> file before and he's taken quite a few of my cleanups through his trees.
> 
> I'm hoping to have all of these warnings fixed tree-wide in 6.11.
> 
> /jeff

not sure why I tag it and it gets cleared again.
tagged again hope it holds now.

-- 
MST

[PATCH 1/2] MAINTAINERS: i2c-virtio: Drop Conghui Chen from Maintainers

2024-07-11 Thread Andi Shyti

E-mails to Conghui Chen have bounced back:

  : host mgamail.eglb.intel.com[198.175.65.14] said: 550
  #5.1.0 Address rejected. (in reply to RCPT TO command)

Remove him as maintainer of the i2c Virtio driver in the
MAINTAINERS file.

Signed-off-by: Andi Shyti 
Cc: Viresh Kumar 
Cc: Wolfram Sang 
Cc: linux-...@vger.kernel.org
Cc: virtualizat...@lists.linux.dev
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3e26555e52bfd..96745f7971100 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23859,7 +23859,6 @@ S:  Maintained
 F: drivers/vhost/scsi.c
 
 VIRTIO I2C DRIVER
-M: Conghui Chen 
 M: Viresh Kumar 
 L: linux-...@vger.kernel.org
 L: virtualizat...@lists.linux.dev
-- 
2.45.2

Re: [PATCH v2] bootconfig: Remove duplicate included header file linux/bootconfig.h

2024-07-11 Thread Google

On Thu, 11 Jul 2024 10:43:16 +0200
Thorsten Blum  wrote:

> The header file linux/bootconfig.h is included whether __KERNEL__ is
> defined or not.
> 
> Include it only once before the #ifdef/#else/#endif preprocessor
> directives and remove the following make includecheck warning:
> 
>   linux/bootconfig.h is included more than once
> 
> Move the comment to the top and delete the now empty #else block.
> 
> Signed-off-by: Thorsten Blum 

Thanks, looks good to me. Let me pick it up.

> ---
> Changes in v2:
> - Move comment and delete #else as suggested by Masami Hiramatsu
> - Link to v1: 
> https://lore.kernel.org/linux-kernel/20240711002152.800028-2-thorsten.b...@toblux.com/
> ---
>  lib/bootconfig.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index 97f8911ea339..81f29c29f47b 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -4,8 +4,16 @@
>   * Masami Hiramatsu 
>   */
>  
> -#ifdef __KERNEL__
> +/*
> + * NOTE: This is only for tools/bootconfig, because tools/bootconfig will
> + * run the parser sanity test.
> + * This does NOT mean lib/bootconfig.c is available in the user space.
> + * However, if you change this file, please make sure the tools/bootconfig
> + * has no issue on building and running.
> + */
>  #include 
> +
> +#ifdef __KERNEL__
>  #include 
>  #include 
>  #include 
> @@ -24,16 +32,6 @@ const char * __init xbc_get_embedded_bootconfig(size_t 
> *size)
>   return (*size) ? embedded_bootconfig_data : NULL;
>  }
>  #endif
> -
> -#else /* !__KERNEL__ */
> -/*
> - * NOTE: This is only for tools/bootconfig, because tools/bootconfig will
> - * run the parser sanity test.
> - * This does NOT mean lib/bootconfig.c is available in the user space.
> - * However, if you change this file, please make sure the tools/bootconfig
> - * has no issue on building and running.
> - */
> -#include 
>  #endif
>  
>  /*
> -- 
> 2.39.2
> 


-- 
Masami Hiramatsu (Google)

[PATCH 0/2] Cleanup the MAINTAINER's file

2024-07-11 Thread Andi Shyti

Hi,

while reviewing Wolfram's series, I received some delivery
failure notifications for e-mails that don't exist anymore.

With this series I'm removing:

 - Conghui Chen 
 - Thor Thayer 

unfortunately both from Intel :-(

In the case of Altera's subsystems (except for the i2c), I didn't
really know what to do with them, so that I marked them as
Orphan.

Andi

Cc: Andy Shevchenko 
Cc: Krzysztof Kozlowski 
Cc: Lee Jones 
Cc: Linus Walleij 
Cc: Philipp Zabel 
Cc: Viresh Kumar 
Cc: Wolfram Sang 
Cc: linux-g...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: virtualizat...@lists.linux.dev

Andi Shyti (2):
  MAINTAINERS: i2c-virtio: Drop Conghui Chen from Maintainers
  MAINTAINERS: Drop Thor Thayer from maintainers

 MAINTAINERS | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

-- 
2.45.2

[PATCH] MAINTAINERS: Add uprobes entry

2024-07-11 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Add uprobes entry to MAINTAINERS to clarify the maintainers.

Suggested-by: Peter Zijlstra 
Signed-off-by: Masami Hiramatsu (Google) 
---
 MAINTAINERS |   13 +
 1 file changed, 13 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index da5352dbd4f3..ae731fa2328c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23105,6 +23105,19 @@ F: drivers/mtd/ubi/
 F: include/linux/mtd/ubi.h
 F: include/uapi/mtd/ubi-user.h
 
+UPROBES
+M: Masami Hiramatsu 
+M: Oleg Nesterov 
+M: Peter Zijlstra 
+L: linux-kernel@vger.kernel.org
+L: linux-trace-ker...@vger.kernel.org
+S: Maintained
+F: arch/*/include/asm/uprobes.h
+F: arch/*/kernel/probes/uprobes.c
+F: arch/*/kernel/uprobes.c
+F: include/linux/uprobes.h
+F: kernel/events/uprobes.c
+
 USB "USBNET" DRIVER FRAMEWORK
 M: Oliver Neukum 
 L: net...@vger.kernel.org

Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes

2024-07-11 Thread Andrii Nakryiko

On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra  wrote:
>
> Hi!
>
> These patches implement the (S)RCU based proposal to optimize uprobes.
>
> On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
> tight loop:
>
>   perf probe -x ./uprobes test=func
>   perf stat -ae probe_uprobe:test  -- sleep 1
>
>   perf probe -x ./uprobes test=func%return
>   perf stat -ae probe_uprobe:test__return -- sleep 1
>
> PRE:
>
>   4,038,804  probe_uprobe:test
>   2,356,275  probe_uprobe:test__return
>
> POST:
>
>   7,216,579  probe_uprobe:test
>   6,744,786  probe_uprobe:test__return
>
> (copy-paste FTW, I didn't do new numbers because the fast paths didn't change 
> --
>  and quick test run shows similar numbers)
>
> Patches also available here:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes
>
>
> Changes since last time:
>  - better split with intermediate inc_not_zero()
>  - fix UPROBE_HANDLER_REMOVE
>  - restored the lost rcu_assign_pointer()
>  - avoid lockdep for uretprobe_srcu
>  - add missing put_uprobe() -> srcu_read_unlock() conversion
>  - actually initialize return_instance::has_ref
>  - a few comments
>  - things I don't remember
>
>

Hey Peter!

Thanks for the v2, I plan to look at it more thoroughly tomorrow. But
meanwhile I spent a good chunk of today to write an uprobes
stress-test, so we can validate that we are not regressing anything
(yes, I don't trust lockless code and people in general ;)

Anyways, if you'd like to use it, it's at [0]. All you should need to
build and run it is:

  $ cd examples/c
  $ make -j$(nproc) uprobe-stress
  $ sudo ./uprobe-stress -tN -aM -mP -fR

N, M, P, R are number of threads dedicated to one of four functions of
the stress test: triggering user space functions (N),
attaching/detaching various random subsets of uprobes (M), mmap()ing
parts of executable with uprobes (P), and forking the process and
triggering uprobes for a little bit (R). The idea is to test various
timings and interleavings of uprobe-related logic.

You should only need not-too-old Clang to build everything (Clang 12+
should work, I believe). But do let me know if you run into troubles.

I did run this stress test for a little while on current
bpf-next/master with no issues detected (yay!).

But then I also ran it on Linux built from perf/uprobes branch (these
patches), and after a few seconds I see that there is no more
attachment/detachment happening. Eventually I got splats, which you
can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command
to run it inside my QEMU image.

So there is still something off, hopefully this will help to debug and
hammer out any remaining kinks. Thanks!

  [0] 
https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3
  [1] https://gist.github.com/anakryiko/f761690addf7aa5f08caec95fda9ef1a

Re: [PATCH 0/2] Cleanup the MAINTAINER's file

2024-07-11 Thread Wolfram Sang

On Fri, Jul 12, 2024 at 01:19:24AM +0200, Andi Shyti wrote:
> Hi,
> 
> while reviewing Wolfram's series, I received some delivery
> failure notifications for e-mails that don't exist anymore.
> 
> With this series I'm removing:
> 
>  - Conghui Chen 
>  - Thor Thayer 

Fixes for these two are already in my for-current branch. (And the
patches were on the i2c list, Andi ;))



signature.asc
Description: PGP signature

56 matches

Mail list logo