Re: [PATCH] test/vsock: add install target
CCing Stefan. On Wed, Jul 10, 2024 at 07:00:59PM GMT, Jakub Kicinski wrote: On Wed, 10 Jul 2024 13:58:39 +0200 Stefano Garzarella wrote: There is a comment there: # Avoid changing the rest of the logic here and lib.mk. Added by commit 17eac6c2db8b2cdfe33d40229bdda2acd86b304a. IIUC they re-used INSTALL_PATH, just to avoid too many changes in that file and in tools/testing/selftests/lib.mk So, IMHO we should not care about it and only use VSOCK_INSTALL_PATH if you don't want to conflict with INSTALL_PATH. Any reason why vsock isn't part of selftests in the first place? Usually vsock tests test both the driver (virtio-vsock) in the guest and the device in the host kernel (vhost-vsock). So I usually run the tests in 2 nested VMs to test the latest changes for both the guest and the host. I don't know enough selftests, but do you think it is possible to integrate them? CCing Stefan who is the original author and may remember more reasons about this choice. Thanks, Stefano
Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support
On 10.07.24 18:01, David Woodhouse wrote: > On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote: >> On 08.07.24 11:27, David Woodhouse wrote: >>> From: David Woodhouse >>> >>> The vmclock "device" provides a shared memory region with precision clock >>> information. By using shared memory, it is safe across Live Migration. >>> >>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into >>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is >>> actually helpful. >>> >>> The memory region of the device is also exposed to userspace so it can be >>> read or memory mapped by application which need reliable notification of >>> clock disruptions. >>> >>> Signed-off-by: David Woodhouse >> >> [...] >> >>> + >>> +struct vmclock_abi { >>> + /* CONSTANT FIELDS */ >>> + uint32_t magic; >>> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */ >>> + uint32_t size; /* Size of region containing this structure >>> */ >>> + uint16_t version; /* 1 */ >>> + uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except >>> INVALID */ >>> +#define VMCLOCK_COUNTER_ARM_VCNT 0 >>> +#define VMCLOCK_COUNTER_X86_TSC1 >>> +#define VMCLOCK_COUNTER_INVALID0xff >>> + uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */ >>> +#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 >>> 00:00:00z */ >>> +#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 >>> 00:00:00z */ >>> +#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined >>> epoch */ >>> +#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */ >>> +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */ >>> + >>> + /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */ >>> + uint32_t seq_count; /* Low bit means an update is in progress */ >>> + /* >>> + * This field changes to another non-repeating value when the CPU >>> + * counter is disrupted, for example on live migration. This lets >>> + * the guest know that it should discard any calibration it has >>> + * performed of the counter against external sources (NTP/PTP/etc.). >>> + */ >>> + uint64_t disruption_marker; >>> + uint64_t flags; >>> + /* Indicates that the tai_offset_sec field is valid */ >>> +#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0) >>> + /* >>> + * Optionally used to notify guests of pending maintenance events. >>> + * A guest which provides latency-sensitive services may wish to >>> + * remove itself from service if an event is coming up. Two flags >>> + * indicate the approximate imminence of the event. >>> + */ >>> +#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */ >>> +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */ >>> +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID (1 << 3) >>> +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID (1 << 4) >>> +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID (1 << 5) >>> +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID (1 << 6) >>> + /* >>> + * Even regardless of leap seconds, the time presented through this >>> + * mechanism may not be strictly monotonic. If the counter slows >>> down >>> + * and the host adapts to this discovery, the time calculated from >>> + * the value of the counter immediately after an update to this >>> + * structure, may appear to be *earlier* than a calculation just >>> + * before the update (while the counter was believed to be running >>> + * faster than it now is). A guest operating system will typically >>> + * *skew* its own system clock back towards the reference clock >>> + * exposed here, rather than following this clock directly. If, >>> + * however, this structure is being populated from such a system >>> + * clock which is already handled in such a fashion and the results >>> + * *are* guaranteed to be monotonic, such monotonicity can be >>> + * advertised by setting this bit. >>> + */ >> >> I wonder if this might be difficult to define in a standard. > > I'm sure we could do better than my attempt above, but surely it isn't > *so* hard to define monotonicity? > >> Is there a need to define device and driver behavior in more detail? What >> would happen if e.g. the device first decides how to update the clock, but >> is then slow to update the SHM? > > That's OK, isn't it? > > The key in the VMCLOCK_FLAG_TIME_MONOTONIC flag is that by setting it, > the host guarantees that the time calculated according to this > structure at any given moment, shall not appear to be later than the > time calculated via the structure at any *later* moment. IMHO this phrasing is better, since it directly refers to the state of the structure. AFAIU if there w
Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support
On Thu, 2024-07-11 at 09:25 +0200, Peter Hilber wrote: > > IMHO this phrasing is better, since it directly refers to the state of the > structure. Thanks. I'll update it. > AFAIU if there would be abnormal delays in store buffers, causing some > driver to still see the old clock for some time, the monotonicity could be > violated: > > 1. device writes new, much slower clock to store buffer > 2. some time passes > 3. driver reads old, much faster clock > 4. device writes store buffer to cache > 5. driver reads new, much slower clock > > But I hope such delays do not occur. For the case of the hypervisor←→guest interface this should be handled by the use of memory barriers and the seqcount lock. The guest driver reads the seqcount, performs a read memory barrier, then reads the contents of the structure. Then performs *another* read memory barrier, and checks the seqcount hasn't changed: https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=drivers/ptp/ptp_vmclock.c;hb=vmclock#l351 The converse happens with write barriers on the hypervisor side: https://git.infradead.org/?p=users/dwmw2/qemu.git;a=blob;f=hw/acpi/vmclock.c;hb=vmclock#l68 Do we need to think harder about the ordering across a real PCI bus? It isn't entirely unreasonable for this to be implemented in hardware if we eventually add a counter_id value for a bus-visible counter like the Intel Always Running Timer (ART). I'm also OK with saying that device implementations may only provide the shared memory structure if they can ensure memory ordering. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 0/3] uprobes: future cleanups for review
On Wed, Jul 10, 2024 at 06:30:22PM +0200, Oleg Nesterov wrote: > On 07/10, Oleg Nesterov wrote: > > > > Peter, these simple cleanups should not conflict with your changes, > > but I can resend them later if it causes any inconvenience. > > In fact I would like to push 2 more cleanups before the more significant > changes, but they certainly conflict with your ongoing work, albeit only > textually. > > Let me send the patches for review anyway, perhaps you can take at least > the 1st one. > > 3/3 is only compile tested so far. Andrii, can you take a look? I was going to post a new version today, but I can wait and rebase on top of / include these 5 patches (or more, these things tend to grow).
[PATCH v2] bootconfig: Remove duplicate included header file linux/bootconfig.h
The header file linux/bootconfig.h is included whether __KERNEL__ is defined or not. Include it only once before the #ifdef/#else/#endif preprocessor directives and remove the following make includecheck warning: linux/bootconfig.h is included more than once Move the comment to the top and delete the now empty #else block. Signed-off-by: Thorsten Blum --- Changes in v2: - Move comment and delete #else as suggested by Masami Hiramatsu - Link to v1: https://lore.kernel.org/linux-kernel/20240711002152.800028-2-thorsten.b...@toblux.com/ --- lib/bootconfig.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/lib/bootconfig.c b/lib/bootconfig.c index 97f8911ea339..81f29c29f47b 100644 --- a/lib/bootconfig.c +++ b/lib/bootconfig.c @@ -4,8 +4,16 @@ * Masami Hiramatsu */ -#ifdef __KERNEL__ +/* + * NOTE: This is only for tools/bootconfig, because tools/bootconfig will + * run the parser sanity test. + * This does NOT mean lib/bootconfig.c is available in the user space. + * However, if you change this file, please make sure the tools/bootconfig + * has no issue on building and running. + */ #include + +#ifdef __KERNEL__ #include #include #include @@ -24,16 +32,6 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size) return (*size) ? embedded_bootconfig_data : NULL; } #endif - -#else /* !__KERNEL__ */ -/* - * NOTE: This is only for tools/bootconfig, because tools/bootconfig will - * run the parser sanity test. - * This does NOT mean lib/bootconfig.c is available in the user space. - * However, if you change this file, please make sure the tools/bootconfig - * has no issue on building and running. - */ -#include #endif /* -- 2.39.2
Re: [PATCH 0/3] uprobes: future cleanups for review
On 07/11, Peter Zijlstra wrote: > > On Wed, Jul 10, 2024 at 06:30:22PM +0200, Oleg Nesterov wrote: > > > > In fact I would like to push 2 more cleanups before the more significant > > changes, but they certainly conflict with your ongoing work, albeit only > > textually. > > > > Let me send the patches for review anyway, perhaps you can take at least > > the 1st one. > > > > 3/3 is only compile tested so far. Andrii, can you take a look? > > I was going to post a new version today, but I can wait and rebase on No, I do not want to delay your changes, please send the new version, I'll rebase these cleanups on top of it. Oleg.
Re: [PATCH v3 0/2] vdpa: support set mac address from vdpa tool
Sorry for the noise, resending the email in text format Hi All, My answers inline below >> Any specific reason to pre-create those large number of vdpa devices of the >> pool? >> I was hoping to create vdpa device with needed attributes, when spawning a >> kubevirt instance. >> K8s DRA infrastructure [1] can be used to create the needed vdpa device. >> Have you considered using the DRA of [1]? The vhost-vdpa devices are created in the host before spawning the kubevirt VM. This is achieved by using: - sriov-network-operator: load kernel drivers, create vdpa devices (with MAC address), etc - sriov-device-plugin: create pool of resources (vdpa devices in this case), advertise devices to k8s, allocate devices during pod creation (by the way, isn't this mechanism very similar to DRA?) Then we create the kubevirt VM by defining an interface with the following attributes: - type:vdpa - mac - source: vhost-vdpa path So the issue is, how to make sure the mac in the VM is the same mac of vdpa? Two options: - ensure kubevirt interface mac is equal to vdpa mac: this is not possible because of the device plugin resource pool. You can have a few devices in the pool and the device plugin picks one randomly. - change vdpa mac address at a later stage, to make sure it is aligned with kubevirt interface mac. I don't know if there is already specific code in kubevirt to do that or need to be implemented. Hope this helps to clarify Thanks Leonardo On Wed, Jul 10, 2024 at 10:46 AM Cindy Lu wrote: > > On Wed, 10 Jul 2024 at 14:10, Michael S. Tsirkin wrote: > > > > On Wed, Jul 10, 2024 at 11:05:48AM +0800, Jason Wang wrote: > > > On Tue, Jul 9, 2024 at 8:42 PM Michael S. Tsirkin wrote: > > > > > > > > On Tue, Jul 09, 2024 at 02:19:19PM +0800, Cindy Lu wrote: > > > > > On Tue, 9 Jul 2024 at 11:59, Parav Pandit wrote: > > > > > > > > > > > > Hi Cindy, > > > > > > > > > > > > > From: Cindy Lu > > > > > > > Sent: Monday, July 8, 2024 12:17 PM > > > > > > > > > > > > > > Add support for setting the MAC address using the VDPA tool. > > > > > > > This feature will allow setting the MAC address using the VDPA > > > > > > > tool. > > > > > > > For example, in vdpa_sim_net, the implementation sets the MAC > > > > > > > address to > > > > > > > the config space. However, for other drivers, they can implement > > > > > > > their own > > > > > > > function, not limited to the config space. > > > > > > > > > > > > > > Changelog v2 > > > > > > > - Changed the function name to prevent misunderstanding > > > > > > > - Added check for blk device > > > > > > > - Addressed the comments > > > > > > > Changelog v3 > > > > > > > - Split the function of the net device from > > > > > > > vdpa_nl_cmd_dev_attr_set_doit > > > > > > > - Add a lock for the network device's dev_set_attr operation > > > > > > > - Address the comments > > > > > > > > > > > > > > Cindy Lu (2): > > > > > > > vdpa: support set mac address from vdpa tool > > > > > > > vdpa_sim_net: Add the support of set mac address > > > > > > > > > > > > > > drivers/vdpa/vdpa.c | 81 > > > > > > > > > > > > > > drivers/vdpa/vdpa_sim/vdpa_sim_net.c | 19 ++- > > > > > > > include/linux/vdpa.h | 9 > > > > > > > include/uapi/linux/vdpa.h| 1 + > > > > > > > 4 files changed, 109 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > -- > > > > > > > 2.45.0 > > > > > > > > > > > > Mlx5 device already allows setting the mac and mtu during the vdpa > > > > > > device creation time. > > > > > > Once the vdpa device is created, it binds to vdpa bus and other > > > > > > driver vhost_vdpa etc bind to it. > > > > > > So there was no good reason in the past to support explicit config > > > > > > after device add complicate the flow for synchronizing this. > > > > > > > > > > > > The user who wants a device with new attributes, as well destroy > > > > > > and recreate the vdpa device with new desired attributes. > > > > > > > > > > > > vdpa_sim_net can also be extended for similar way when adding the > > > > > > vdpa device. > > > > > > > > > > > > Have you considered using the existing tool and kernel in place > > > > > > since 2021? > > > > > > Such as commit d8ca2fa5be1. > > > > > > > > > > > > An example of it is, > > > > > > $ vdpa dev add name bar mgmtdev vdpasim_net mac 00:11:22:33:44:55 > > > > > > mtu 9000 > > > > > > > > > > > Hi Parav > > > > > Really thanks for your comments. The reason for adding this function > > > > > is to support Kubevirt. > > > > > the problem we meet is that kubevirt chooses one random vdpa device > > > > > from the pool and we don't know which one it going to pick. That means > > > > > we can't get to know the Mac address before it is created. So we plan > > > > > to have this function to change the mac address after it is created > > > > > Thanks > > > > > cindy > > > > > > > > Well you will need to change kubevirt to teach it to set > > > > mac
Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *
On 07/10, Oleg Nesterov wrote: > > -void uprobe_unregister(struct inode *inode, loff_t offset, struct > uprobe_consumer *uc) > +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > { > - struct uprobe *uprobe; > - > - uprobe = find_uprobe(inode, offset); > - if (WARN_ON(!uprobe)) > - return; > - > down_write(&uprobe->register_rwsem); > __uprobe_unregister(uprobe, uc); > up_write(&uprobe->register_rwsem); > - put_uprobe(uprobe); OK, this is obviously wrong, needs get_uprobe/put_uprobe. __uprobe_unregister() can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe. I'll send V2 on top of Peter's new version. Oleg.
[PATCH v2 05/11] perf/uprobe: Simplify UPROBE_HANDLER_REMOVE logic
Specifically, get rid of the uprobe->consumers re-load, which isn't sound under RCU. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2101,6 +2101,7 @@ static void handler_chain(struct uprobe struct uprobe_consumer *uc; int remove = UPROBE_HANDLER_REMOVE; bool need_prep = false; /* prepare return uprobe, when needed */ + bool had_handler = false; down_read(&uprobe->register_rwsem); for (uc = uprobe->consumers; uc; uc = uc->next) { @@ -2115,16 +2116,26 @@ static void handler_chain(struct uprobe if (uc->ret_handler) need_prep = true; + /* +* A single handler that does not mask out REMOVE, means the +* probe stays. +*/ + had_handler = true; remove &= rc; } + /* +* If there were no handlers called, nobody asked for it to be removed +* but also nobody got to mask the value. Fix it up. +*/ + if (!had_handler) + remove = 0; + if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */ - if (remove && uprobe->consumers) { - WARN_ON(!uprobe_is_active(uprobe)); + if (remove) unapply_uprobe(uprobe, current->mm); - } up_read(&uprobe->register_rwsem); }
[PATCH v2 09/11] srcu: Add __srcu_clone_read_lock()
In order to support carrying an srcu_read_lock() section across fork, where both the parent and child process will do: srcu_read_unlock(), it is needed to account for the extra decrement with an extra increment at fork time. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/srcu.h |1 + include/linux/srcutiny.h | 10 ++ kernel/rcu/srcutree.c|5 + 3 files changed, 16 insertions(+) --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -55,6 +55,7 @@ void call_srcu(struct srcu_struct *ssp, void (*func)(struct rcu_head *head)); void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); +void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -71,6 +71,16 @@ static inline int __srcu_read_lock(struc return idx; } +static inline void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx) +{ + int newval; + + preempt_disable(); // Needed for PREEMPT_AUTO + newval = READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1; + WRITE_ONCE(ssp->srcu_lock_nesting[idx], newval); + preempt_enable(); +} + static inline void synchronize_srcu_expedited(struct srcu_struct *ssp) { synchronize_srcu(ssp); --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -720,6 +720,11 @@ int __srcu_read_lock(struct srcu_struct } EXPORT_SYMBOL_GPL(__srcu_read_lock); +void __srcu_clone_read_lock(struct srcu_struct *ssp, int idx) +{ + this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); +} + /* * Removes the count for the old reader from the appropriate per-CPU * element of the srcu_struct. Note that this may well be a different
[PATCH v2 00/11] perf/uprobe: Optimize uprobes
Hi! These patches implement the (S)RCU based proposal to optimize uprobes. On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a tight loop: perf probe -x ./uprobes test=func perf stat -ae probe_uprobe:test -- sleep 1 perf probe -x ./uprobes test=func%return perf stat -ae probe_uprobe:test__return -- sleep 1 PRE: 4,038,804 probe_uprobe:test 2,356,275 probe_uprobe:test__return POST: 7,216,579 probe_uprobe:test 6,744,786 probe_uprobe:test__return (copy-paste FTW, I didn't do new numbers because the fast paths didn't change -- and quick test run shows similar numbers) Patches also available here: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes Changes since last time: - better split with intermediate inc_not_zero() - fix UPROBE_HANDLER_REMOVE - restored the lost rcu_assign_pointer() - avoid lockdep for uretprobe_srcu - add missing put_uprobe() -> srcu_read_unlock() conversion - actually initialize return_instance::has_ref - a few comments - things I don't remember
[PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()
With handle_swbp() triggering concurrently on (all) CPUs, tree_lock becomes a bottleneck. Avoid treelock by doing RCU lookups of the uprobe. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 49 +++- 1 file changed, 40 insertions(+), 9 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); #define UPROBES_HASH_SZ13 /* serialize uprobe->pending_list */ @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem) struct uprobe { struct rb_node rb_node;/* node in the rb tree */ refcount_t ref; + struct rcu_head rcu; struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; struct list_headpending_list; @@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob *(uprobe_opcode_t *)&auprobe->insn); } +static struct uprobe *try_get_uprobe(struct uprobe *uprobe) +{ + if (refcount_inc_not_zero(&uprobe->ref)) + return uprobe; + return NULL; +} + static struct uprobe *get_uprobe(struct uprobe *uprobe) { refcount_inc(&uprobe->ref); return uprobe; } +static void uprobe_free_rcu(struct rcu_head *rcu) +{ + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); + kfree(uprobe); +} + static void put_uprobe(struct uprobe *uprobe) { if (refcount_dec_and_test(&uprobe->ref)) { @@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up mutex_lock(&delayed_uprobe_lock); delayed_uprobe_remove(uprobe, NULL); mutex_unlock(&delayed_uprobe_lock); - kfree(uprobe); + call_rcu(&uprobe->rcu, uprobe_free_rcu); } } @@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru .inode = inode, .offset = offset, }; - struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); + struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); if (node) - return get_uprobe(__node_2_uprobe(node)); + return try_get_uprobe(__node_2_uprobe(node)); return NULL; } @@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru */ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { - struct uprobe *uprobe; + unsigned int seq; - read_lock(&uprobes_treelock); - uprobe = __find_uprobe(inode, offset); - read_unlock(&uprobes_treelock); + guard(rcu)(); - return uprobe; + do { + seq = read_seqcount_begin(&uprobes_seqcount); + struct uprobe *uprobe = __find_uprobe(inode, offset); + if (uprobe) { + /* +* Lockless RB-tree lookups are prone to false-negatives. +* If they find something, it's good. If they do not find, +* it needs to be validated. +*/ + return uprobe; + } + } while (read_seqcount_retry(&uprobes_seqcount, seq)); + + /* Really didn't find anything. */ + return NULL; } static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); if (node) return get_uprobe(__node_2_uprobe(node)); @@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru struct uprobe *u; write_lock(&uprobes_treelock); + write_seqcount_begin(&uprobes_seqcount); u = __insert_uprobe(uprobe); + write_seqcount_end(&uprobes_seqcount); write_unlock(&uprobes_treelock); return u; @@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe return; write_lock(&uprobes_treelock); + write_seqcount_begin(&uprobes_seqcount); rb_erase(&uprobe->rb_node, &uprobes_tree); + write_seqcount_end(&uprobes_seqcount); write_unlock(&uprobes_treelock); RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ put_uprobe(uprobe);
[PATCH v2 01/11] perf/uprobe: Re-indent labels
Remove the silly label indenting. s/^\ \([[:alnum:]]*\):$/\1:/g Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -205,7 +205,7 @@ static int __replace_page(struct vm_area folio_put(old_folio); err = 0; - unlock: +unlock: mmu_notifier_invalidate_range_end(&range); folio_unlock(old_folio); return err; @@ -857,7 +857,7 @@ static int prepare_uprobe(struct uprobe smp_wmb(); /* pairs with the smp_rmb() in handle_swbp() */ set_bit(UPROBE_COPY_INSN, &uprobe->flags); - out: +out: up_write(&uprobe->consumer_rwsem); return ret; @@ -965,7 +965,7 @@ build_map_info(struct address_space *map struct map_info *info; int more = 0; - again: +again: i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { if (!valid_vma(vma, is_register)) @@ -1019,7 +1019,7 @@ build_map_info(struct address_space *map } while (--more); goto again; - out: +out: while (prev) prev = free_map_info(prev); return curr; @@ -1068,13 +1068,13 @@ register_for_each_vma(struct uprobe *upr err |= remove_breakpoint(uprobe, mm, info->vaddr); } - unlock: +unlock: mmap_write_unlock(mm); - free: +free: mmput(mm); info = free_map_info(info); } - out: +out: percpu_up_write(&dup_mmap_sem); return err; } @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) return -EINVAL; - retry: +retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (!uprobe) return -ENOMEM; @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct ret = 0; /* pairs with get_xol_area() */ smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */ - fail: +fail: mmap_write_unlock(mm); return ret; @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are kfree(area->bitmap); free_area: kfree(area); - out: +out: return NULL; } @@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr utask->return_instances = ri; return; - fail: +fail: kfree(ri); } @@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); put_page(page); - out: +out: /* This needs to return true for any variant of the trap insn */ return is_trap_insn(&opcode); } @@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_ utask->return_instances = ri; return; - sigill: +sigill: uprobe_warn(current, "handle uretprobe, sending SIGILL."); force_sig(SIGILL);
[PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU
With handle_swbp() hitting concurrently on (all) CPUs, potentially on the same uprobe, the uprobe->refcount can get *very* hot. Move the struct uprobe lifetime into uprobes_srcu such that it covers both the uprobe and the uprobe->consumers list. With this, handle_swbp() can use a single large SRCU critical section to avoid taking a refcount on the uprobe for it's duration. Notably, the single-step and uretprobe paths need a reference that leaves handle_swbp() and will, for now, still use ->refcount. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 68 1 file changed, 41 insertions(+), 27 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -51,7 +51,7 @@ static struct mutex uprobes_mmap_mutex[U DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); /* - * Covers uprobe->consumers lifetime. + * Covers uprobe->consumers lifetime as well as struct uprobe. */ DEFINE_STATIC_SRCU(uprobes_srcu); @@ -626,7 +626,7 @@ static void put_uprobe(struct uprobe *up mutex_lock(&delayed_uprobe_lock); delayed_uprobe_remove(uprobe, NULL); mutex_unlock(&delayed_uprobe_lock); - call_rcu(&uprobe->rcu, uprobe_free_rcu); + call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu); } } @@ -678,7 +678,7 @@ static struct uprobe *__find_uprobe(stru struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); if (node) - return try_get_uprobe(__node_2_uprobe(node)); + return __node_2_uprobe(node); return NULL; } @@ -691,7 +691,7 @@ static struct uprobe *find_uprobe(struct { unsigned int seq; - guard(rcu)(); + lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); do { seq = read_seqcount_begin(&uprobes_seqcount); @@ -1142,6 +1142,8 @@ void uprobe_unregister_nosync(struct ino { struct uprobe *uprobe; + guard(srcu)(&uprobes_srcu); + uprobe = find_uprobe(inode, offset); if (WARN_ON(!uprobe)) return; @@ -1151,7 +1153,6 @@ void uprobe_unregister_nosync(struct ino __uprobe_unregister(uprobe, uc); raw_write_seqcount_end(&uprobe->register_seq); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); } EXPORT_SYMBOL_GPL(uprobe_unregister_nosync); @@ -1263,6 +1264,8 @@ int uprobe_apply(struct inode *inode, lo struct uprobe_consumer *con; int ret = -ENOENT; + guard(srcu)(&uprobes_srcu); + uprobe = find_uprobe(inode, offset); if (WARN_ON(!uprobe)) return ret; @@ -1275,7 +1278,6 @@ int uprobe_apply(struct inode *inode, lo ret = register_for_each_vma(uprobe, add ? uc : NULL); raw_write_seqcount_end(&uprobe->register_seq); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); return ret; } @@ -1929,10 +1931,14 @@ static void prepare_uretprobe(struct upr if (!ri) return; + ri->uprobe = try_get_uprobe(uprobe); + if (!ri->uprobe) + goto err_mem; + trampoline_vaddr = get_trampoline_vaddr(); orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs); if (orig_ret_vaddr == -1) - goto fail; + goto err_uprobe; /* drop the entries invalidated by longjmp() */ chained = (orig_ret_vaddr == trampoline_vaddr); @@ -1950,12 +1956,11 @@ static void prepare_uretprobe(struct upr * attack from user-space. */ uprobe_warn(current, "handle tail call"); - goto fail; + goto err_uprobe; } orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; } - ri->uprobe = get_uprobe(uprobe); ri->func = instruction_pointer(regs); ri->stack = user_stack_pointer(regs); ri->orig_ret_vaddr = orig_ret_vaddr; @@ -1966,7 +1971,10 @@ static void prepare_uretprobe(struct upr utask->return_instances = ri; return; -fail: + +err_uprobe: + uprobe_put(ri->uprobe); +err_mem: kfree(ri); } @@ -1982,22 +1990,31 @@ pre_ssout(struct uprobe *uprobe, struct if (!utask) return -ENOMEM; + utask->active_uprobe = try_get_uprobe(uprobe); + if (!utask->active_uprobe) + return -ESRCH; + xol_vaddr = xol_get_insn_slot(uprobe); - if (!xol_vaddr) - return -ENOMEM; + if (!xol_vaddr) { + err = -ENOMEM; + goto err_uprobe; + } utask->xol_vaddr = xol_vaddr; utask->vaddr = bp_vaddr; err = arch_uprobe_pre_xol(&uprobe->arch, regs); - if (unlikely(err)) { - xol_free_insn_slot(current); - return err; - } +
[PATCH v2 06/11] perf/uprobe: SRCU-ify uprobe->consumer list
With handle_swbp() hitting concurrently on (all) CPUs the uprobe->register_rwsem can get very contended. Add an SRCU instance to cover the consumer list and consumer lifetime. Since the consumer are externally embedded structures, unregister will have to suffer a synchronize_srcu(). A notably complication is the UPROBE_HANDLER_REMOVE logic which can race against uprobe_register() such that it might want to remove a freshly installer handler that didn't get called. In order to close this hole, a seqcount is added. With that, the removal path can tell if anything changed and bail out of the removal. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 60 1 file changed, 50 insertions(+), 10 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -26,6 +26,7 @@ #include #include #include +#include #include @@ -49,6 +50,11 @@ static struct mutex uprobes_mmap_mutex[U DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); +/* + * Covers uprobe->consumers lifetime. + */ +DEFINE_STATIC_SRCU(uprobes_srcu); + /* Have a copy of original instruction */ #define UPROBE_COPY_INSN 0 @@ -57,6 +63,7 @@ struct uprobe { refcount_t ref; struct rcu_head rcu; struct rw_semaphore register_rwsem; + seqcount_t register_seq; struct rw_semaphore consumer_rwsem; struct list_headpending_list; struct uprobe_consumer *consumers; @@ -760,6 +767,7 @@ static struct uprobe *alloc_uprobe(struc uprobe->offset = offset; uprobe->ref_ctr_offset = ref_ctr_offset; init_rwsem(&uprobe->register_rwsem); + seqcount_init(&uprobe->register_seq); init_rwsem(&uprobe->consumer_rwsem); /* add to uprobes_tree, sorted on inode:offset */ @@ -782,8 +790,8 @@ static struct uprobe *alloc_uprobe(struc static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(&uprobe->consumer_rwsem); - uc->next = uprobe->consumers; - uprobe->consumers = uc; + WRITE_ONCE(uc->next, uprobe->consumers); + rcu_assign_pointer(uprobe->consumers, uc); up_write(&uprobe->consumer_rwsem); } @@ -800,7 +808,7 @@ static bool consumer_del(struct uprobe * down_write(&uprobe->consumer_rwsem); for (con = &uprobe->consumers; *con; con = &(*con)->next) { if (*con == uc) { - *con = uc->next; + WRITE_ONCE(*con, uc->next); ret = true; break; } @@ -1139,9 +1147,13 @@ void uprobe_unregister(struct inode *ino return; down_write(&uprobe->register_rwsem); + raw_write_seqcount_begin(&uprobe->register_seq); __uprobe_unregister(uprobe, uc); + raw_write_seqcount_end(&uprobe->register_seq); up_write(&uprobe->register_rwsem); put_uprobe(uprobe); + + synchronize_srcu(&uprobes_srcu); } EXPORT_SYMBOL_GPL(uprobe_unregister); @@ -1204,10 +1216,12 @@ static int __uprobe_register(struct inod down_write(&uprobe->register_rwsem); ret = -EAGAIN; if (likely(uprobe_is_active(uprobe))) { + raw_write_seqcount_begin(&uprobe->register_seq); consumer_add(uprobe, uc); ret = register_for_each_vma(uprobe, uc); if (ret) __uprobe_unregister(uprobe, uc); + raw_write_seqcount_end(&uprobe->register_seq); } up_write(&uprobe->register_rwsem); put_uprobe(uprobe); @@ -1250,10 +1264,12 @@ int uprobe_apply(struct inode *inode, lo return ret; down_write(&uprobe->register_rwsem); + raw_write_seqcount_begin(&uprobe->register_seq); for (con = uprobe->consumers; con && con != uc ; con = con->next) ; if (con) ret = register_for_each_vma(uprobe, add ? uc : NULL); + raw_write_seqcount_end(&uprobe->register_seq); up_write(&uprobe->register_rwsem); put_uprobe(uprobe); @@ -2096,15 +2112,23 @@ static struct uprobe *find_active_uprobe return uprobe; } +#define for_each_consumer_rcu(pos, head) \ + for (pos = rcu_dereference_raw(head); pos; \ +pos = rcu_dereference_raw(pos->next)) + static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) { struct uprobe_consumer *uc; int remove = UPROBE_HANDLER_REMOVE; bool need_prep = false; /* prepare return uprobe, when needed */ bool had_handler = false; + unsigned int seq; - down_read(&uprobe->register_rwsem); - for (uc = uprobe->consumers; uc; uc = uc->next) { + guard(srcu)(&uprobes_srcu); + + seq = raw_read_seqcount_begin(&uprobe->register_seq); + + for_each_consumer_rcu(uc, uprobe->consumers) { int rc =
[PATCH v2 11/11] perf/uprobe: Add uretprobe timer
In order to put a bound on the uretprobe_srcu critical section, add a timer to uprobe_task. Upon every RI added or removed the timer is pushed forward to now + 1s. If the timer were ever to fire, it would convert the SRCU 'reference' to a refcount reference if possible. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/uprobes.h |8 + kernel/events/uprobes.c | 67 2 files changed, 69 insertions(+), 6 deletions(-) --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -15,6 +15,7 @@ #include #include #include +#include struct vm_area_struct; struct mm_struct; @@ -79,6 +80,10 @@ struct uprobe_task { struct return_instance *return_instances; unsigned intdepth; unsigned intactive_srcu_idx; + + struct timer_list ri_timer; + struct callback_headri_task_work; + struct task_struct *task; }; struct return_instance { @@ -86,7 +91,8 @@ struct return_instance { unsigned long func; unsigned long stack; /* stack pointer */ unsigned long orig_ret_vaddr; /* original return address */ - boolchained;/* true, if instance is nested */ + u8 chained;/* true, if instance is nested */ + u8 has_ref; int srcu_idx; struct return_instance *next; /* keep as stack */ --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1761,7 +1761,12 @@ unsigned long uprobe_get_trap_addr(struc static struct return_instance *free_ret_instance(struct return_instance *ri) { struct return_instance *next = ri->next; - __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx); + if (ri->uprobe) { + if (ri->has_ref) + put_uprobe(ri->uprobe); + else + __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx); + } kfree(ri); return next; } @@ -1785,11 +1790,48 @@ void uprobe_free_utask(struct task_struc while (ri) ri = free_ret_instance(ri); + timer_delete_sync(&utask->ri_timer); + task_work_cancel(utask->task, &utask->ri_task_work); xol_free_insn_slot(t); kfree(utask); t->utask = NULL; } +static void return_instance_task_work(struct callback_head *head) +{ + struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work); + struct return_instance *ri; + + for (ri = utask->return_instances; ri; ri = ri->next) { + if (!ri->uprobe) + continue; + if (ri->has_ref) + continue; + if (refcount_inc_not_zero(&ri->uprobe->ref)) + ri->has_ref = true; + else + ri->uprobe = NULL; + __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx); + } +} + +static void return_instance_timer(struct timer_list *timer) +{ + struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer); + task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL); +} + +static struct uprobe_task *alloc_utask(struct task_struct *task) +{ + struct uprobe_task *utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL); + if (!utask) + return NULL; + timer_setup(&utask->ri_timer, return_instance_timer, 0); + init_task_work(&utask->ri_task_work, return_instance_task_work); + utask->task = task; + return utask; +} + /* * Allocate a uprobe_task object for the task if necessary. * Called when the thread hits a breakpoint. @@ -1801,7 +1843,7 @@ void uprobe_free_utask(struct task_struc static struct uprobe_task *get_utask(void) { if (!current->utask) - current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL); + current->utask = alloc_utask(current); return current->utask; } @@ -1810,7 +1852,7 @@ static int dup_utask(struct task_struct struct uprobe_task *n_utask; struct return_instance **p, *o, *n; - n_utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL); + n_utask = alloc_utask(t); if (!n_utask) return -ENOMEM; t->utask = n_utask; @@ -1822,13 +1864,20 @@ static int dup_utask(struct task_struct return -ENOMEM; *n = *o; - __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx); + if (n->uprobe) { + if (n->has_ref) + get_uprobe(n->uprobe); + else + __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx); + } n->next = NULL;
[PATCH v2 02/11] perf/uprobe: Remove spurious whitespace
Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/uprobes.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -67,7 +67,7 @@ struct uprobe { * The generic code assumes that it has two members of unknown type * owned by the arch-specific code: * -* insn - copy_insn() saves the original instruction here for +* insn - copy_insn() saves the original instruction here for * arch_uprobe_analyze_insn(). * * ixol - potentially modified instruction to execute out of @@ -95,18 +95,18 @@ static LIST_HEAD(delayed_uprobe_list); * allocated. */ struct xol_area { - wait_queue_head_t wq; /* if all slots are busy */ - atomic_tslot_count; /* number of in-use slots */ - unsigned long *bitmap;/* 0 = free slot */ + wait_queue_head_t wq; /* if all slots are busy */ + atomic_tslot_count; /* number of in-use slots */ + unsigned long *bitmap;/* 0 = free slot */ struct vm_special_mapping xol_mapping; - struct page *pages[2]; + struct page *pages[2]; /* * We keep the vma's vm_start rather than a pointer to the vma * itself. The probed process or a naughty kernel module could make * the vma go away, and we must handle that reasonably gracefully. */ - unsigned long vaddr; /* Page(s) of instruction slots */ + unsigned long vaddr; /* Page(s) of instruction slots */ }; /*
[PATCH v2 07/11] perf/uprobe: Split uprobe_unregister()
With uprobe_unregister() having grown a synchronize_srcu(), it becomes fairly slow to call. Esp. since both users of this API call it in a loop. Peel off the sync_srcu() and do it once, after the loop. Signed-off-by: Peter Zijlstra (Intel) Acked-by: Masami Hiramatsu (Google) --- include/linux/uprobes.h |8 ++-- kernel/events/uprobes.c |8 ++-- kernel/trace/bpf_trace.c|6 -- kernel/trace/trace_uprobe.c |6 +- 4 files changed, 21 insertions(+), 7 deletions(-) --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -113,7 +113,8 @@ extern int uprobe_write_opcode(struct ar extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister_sync(void); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -163,7 +164,10 @@ uprobe_apply(struct inode *inode, loff_t return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +{ +} +static inline void uprobes_unregister_sync(void) { } static inline int uprobe_mmap(struct vm_area_struct *vma) --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1138,7 +1138,7 @@ __uprobe_unregister(struct uprobe *uprob * @offset: offset from the start of the file. * @uc: identify which probe if multiple probes are colocated. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +void uprobe_unregister_nosync(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { struct uprobe *uprobe; @@ -1152,10 +1152,14 @@ void uprobe_unregister(struct inode *ino raw_write_seqcount_end(&uprobe->register_seq); up_write(&uprobe->register_rwsem); put_uprobe(uprobe); +} +EXPORT_SYMBOL_GPL(uprobe_unregister_nosync); +void uprobe_unregister_sync(void) +{ synchronize_srcu(&uprobes_srcu); } -EXPORT_SYMBOL_GPL(uprobe_unregister); +EXPORT_SYMBOL_GPL(uprobe_unregister_sync); /* * __uprobe_register - register a probe --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3181,9 +3181,11 @@ static void bpf_uprobe_unregister(struct u32 i; for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset, - &uprobes[i].consumer); + uprobe_unregister_nosync(d_real_inode(path->dentry), uprobes[i].offset, +&uprobes[i].consumer); } + if (cnt) + uprobe_unregister_sync(); } static void bpf_uprobe_multi_link_release(struct bpf_link *link) --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1104,6 +1104,7 @@ static int trace_uprobe_enable(struct tr static void __probe_event_disable(struct trace_probe *tp) { struct trace_uprobe *tu; + bool sync = false; tu = container_of(tp, struct trace_uprobe, tp); WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter)); @@ -1112,9 +1113,12 @@ static void __probe_event_disable(struct if (!tu->inode) continue; - uprobe_unregister(tu->inode, tu->offset, &tu->consumer); + uprobe_unregister_nosync(tu->inode, tu->offset, &tu->consumer); + sync = true; tu->inode = NULL; } + if (sync) + uprobe_unregister_sync(); } static int probe_event_enable(struct trace_event_call *call,
[PATCH v2 03/11] rbtree: Provide rb_find_rcu() / rb_find_add_rcu()
Much like latch_tree, add two RCU methods for the regular RB-tree, which can be used in conjunction with a seqcount to provide lockless lookups. Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Masami Hiramatsu (Google) --- include/linux/rbtree.h | 67 + 1 file changed, 67 insertions(+) --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -245,6 +245,42 @@ rb_find_add(struct rb_node *node, struct } /** + * rb_find_add_rcu() - find equivalent @node in @tree, or add @node + * @node: node to look-for / insert + * @tree: tree to search / modify + * @cmp: operator defining the node order + * + * Adds a Store-Release for link_node. + * + * Returns the rb_node matching @node, or NULL when no match is found and @node + * is inserted. + */ +static __always_inline struct rb_node * +rb_find_add_rcu(struct rb_node *node, struct rb_root *tree, + int (*cmp)(struct rb_node *, const struct rb_node *)) +{ + struct rb_node **link = &tree->rb_node; + struct rb_node *parent = NULL; + int c; + + while (*link) { + parent = *link; + c = cmp(node, parent); + + if (c < 0) + link = &parent->rb_left; + else if (c > 0) + link = &parent->rb_right; + else + return parent; + } + + rb_link_node_rcu(node, parent, link); + rb_insert_color(node, tree); + return NULL; +} + +/** * rb_find() - find @key in tree @tree * @key: key to match * @tree: tree to search @@ -268,6 +304,37 @@ rb_find(const void *key, const struct rb else return node; } + + return NULL; +} + +/** + * rb_find_rcu() - find @key in tree @tree + * @key: key to match + * @tree: tree to search + * @cmp: operator defining the node order + * + * Notably, tree descent vs concurrent tree rotations is unsound and can result + * in false-negatives. + * + * Returns the rb_node matching @key or NULL. + */ +static __always_inline struct rb_node * +rb_find_rcu(const void *key, const struct rb_root *tree, + int (*cmp)(const void *key, const struct rb_node *)) +{ + struct rb_node *node = tree->rb_node; + + while (node) { + int c = cmp(key, node); + + if (c < 0) + node = rcu_dereference_raw(node->rb_left); + else if (c > 0) + node = rcu_dereference_raw(node->rb_right); + else + return node; + } return NULL; }
[PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
Both single-step and uretprobes take a refcount on struct uprobe in handle_swbp() in order to ensure struct uprobe stays extant until a next trap. Since uprobe_unregister() only cares about the uprobe_consumer life-time, and these intra-trap sections can be arbitrarily large, create a second SRCU domain to cover these. Notably, a uretprobe with a registered return_instance that never triggers -- because userspace -- will currently pin the return_instance and related uprobe until the task dies. With this convertion to SRCU this behaviour will inhibit freeing of all uprobes. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/uprobes.h |2 + kernel/events/uprobes.c | 60 +++- 2 files changed, 31 insertions(+), 31 deletions(-) --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -78,6 +78,7 @@ struct uprobe_task { struct return_instance *return_instances; unsigned intdepth; + unsigned intactive_srcu_idx; }; struct return_instance { @@ -86,6 +87,7 @@ struct return_instance { unsigned long stack; /* stack pointer */ unsigned long orig_ret_vaddr; /* original return address */ boolchained;/* true, if instance is nested */ + int srcu_idx; struct return_instance *next; /* keep as stack */ }; --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -54,6 +54,15 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem) * Covers uprobe->consumers lifetime as well as struct uprobe. */ DEFINE_STATIC_SRCU(uprobes_srcu); +/* + * Covers return_instance->uprobe and utask->active_uprobe. Separate from + * uprobe_srcu because uprobe_unregister() doesn't need to wait for this + * and these lifetimes can be fairly long. + * + * Notably, these sections span userspace and as such use + * __srcu_read_{,un}lock() to elide lockdep. + */ +DEFINE_STATIC_SRCU(uretprobes_srcu); /* Have a copy of original instruction */ #define UPROBE_COPY_INSN 0 @@ -596,25 +605,24 @@ set_orig_insn(struct arch_uprobe *auprob *(uprobe_opcode_t *)&auprobe->insn); } -static struct uprobe *try_get_uprobe(struct uprobe *uprobe) -{ - if (refcount_inc_not_zero(&uprobe->ref)) - return uprobe; - return NULL; -} - static struct uprobe *get_uprobe(struct uprobe *uprobe) { refcount_inc(&uprobe->ref); return uprobe; } -static void uprobe_free_rcu(struct rcu_head *rcu) +static void uprobe_free_stage2(struct rcu_head *rcu) { struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); kfree(uprobe); } +static void uprobe_free_stage1(struct rcu_head *rcu) +{ + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); + call_srcu(&uretprobes_srcu, &uprobe->rcu, uprobe_free_stage2); +} + static void put_uprobe(struct uprobe *uprobe) { if (refcount_dec_and_test(&uprobe->ref)) { @@ -626,7 +634,7 @@ static void put_uprobe(struct uprobe *up mutex_lock(&delayed_uprobe_lock); delayed_uprobe_remove(uprobe, NULL); mutex_unlock(&delayed_uprobe_lock); - call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu); + call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1); } } @@ -1753,7 +1761,7 @@ unsigned long uprobe_get_trap_addr(struc static struct return_instance *free_ret_instance(struct return_instance *ri) { struct return_instance *next = ri->next; - put_uprobe(ri->uprobe); + __srcu_read_unlock(&uretprobes_srcu, ri->srcu_idx); kfree(ri); return next; } @@ -1771,7 +1779,7 @@ void uprobe_free_utask(struct task_struc return; if (utask->active_uprobe) - put_uprobe(utask->active_uprobe); + __srcu_read_unlock(&uretprobes_srcu, utask->active_srcu_idx); ri = utask->return_instances; while (ri) @@ -1814,7 +1822,7 @@ static int dup_utask(struct task_struct return -ENOMEM; *n = *o; - get_uprobe(n->uprobe); + __srcu_clone_read_lock(&uretprobes_srcu, n->srcu_idx); n->next = NULL; *p = n; @@ -1931,14 +1939,10 @@ static void prepare_uretprobe(struct upr if (!ri) return; - ri->uprobe = try_get_uprobe(uprobe); - if (!ri->uprobe) - goto err_mem; - trampoline_vaddr = get_trampoline_vaddr(); orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs); if (orig_ret_vaddr == -1) - goto err_uprobe; + goto err_mem; /* drop the entries invalidated by longjmp() */ chained = (orig_ret_vaddr == trampoline_vaddr); @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct u
Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels
On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote: SNIP > @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > return -EINVAL; > > - retry: > +retry: > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > if (!uprobe) > return -ENOMEM; > @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct > ret = 0; > /* pairs with get_xol_area() */ > smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */ > - fail: > +fail: > mmap_write_unlock(mm); > > return ret; > @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are > kfree(area->bitmap); > free_area: hi, missed this one and another one few lines before that ;-) jirka > kfree(area); > - out: > +out: > return NULL; > } > > @@ -1915,7 +1915,7 @@ static void prepare_uretprobe(struct upr > utask->return_instances = ri; > > return; > - fail: > +fail: > kfree(ri); > } > > @@ -2031,7 +2031,7 @@ static int is_trap_at_addr(struct mm_str > > copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); > put_page(page); > - out: > +out: > /* This needs to return true for any variant of the trap insn */ > return is_trap_insn(&opcode); > } > @@ -2159,7 +2159,7 @@ static void handle_trampoline(struct pt_ > utask->return_instances = ri; > return; > > - sigill: > +sigill: > uprobe_warn(current, "handle uretprobe, sending SIGILL."); > force_sig(SIGILL); > > >
Re: [PATCH v2 01/11] perf/uprobe: Re-indent labels
On Thu, Jul 11, 2024 at 01:58:04PM +0200, Jiri Olsa wrote: > On Thu, Jul 11, 2024 at 01:02:36PM +0200, Peter Zijlstra wrote: > > SNIP > > > @@ -1159,7 +1159,7 @@ static int __uprobe_register(struct inod > > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > > return -EINVAL; > > > > - retry: > > +retry: > > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > > if (!uprobe) > > return -ENOMEM; > > @@ -1468,7 +1468,7 @@ static int xol_add_vma(struct mm_struct > > ret = 0; > > /* pairs with get_xol_area() */ > > smp_store_release(&mm->uprobes_state.xol_area, area); /* ^^^ */ > > - fail: > > +fail: > > mmap_write_unlock(mm); > > > > return ret; > > @@ -1512,7 +1512,7 @@ static struct xol_area *__create_xol_are > > kfree(area->bitmap); > > free_area: > > hi, > missed this one and another one few lines before that ;-) Bah, _ isn't in [[:alnum:]]. I'll go fix.
Re: [PATCH v2 58/60] i2c: virtio: reword according to newest specification
Hi Wolfram, On Sat, Jul 06, 2024 at 01:20:58PM GMT, Wolfram Sang wrote: > Change the wording of this driver wrt. the newest I2C v7 and SMBus 3.2 > specifications and replace "master/slave" with more appropriate terms. > > Signed-off-by: Wolfram Sang Reviewed-by: Andi Shyti Thanks, Andi
Re: [PATCH v2 00/60] i2c: reword first drivers according to newest specification
Hi Wolfram, pushed in i2c/i2c-host. Thanks for this big work, at the end it turned out quite nice and I'm happy of the outcome! Thanks Andi On Sat, Jul 06, 2024 at 01:20:00PM GMT, Wolfram Sang wrote: > Start changing the wording of the I2C main header wrt. the newest I2C > v7 and SMBus 3.2 specifications and replace "master/slave" with more > appropriate terms. This first step renames the members of struct > i2c_algorithm. Once all in-tree users are converted, the anonymous union > will go away again. All this work will also pave the way for finally > seperating the monolithic header into more fine-grained headers like > "i2c/clients.h" etc. So, this is not a simple renaming-excercise but > also a chance to update the I2C core to recent Linux standards. > > Changes since v1: > > * changed wording according to the terminology we agreed on and defined > upstream. That means consistent use of "controller/target", and no > more "host/client". I added "local/remote target" where necessary. > * added tags which I kept despite some changes in wording. The approach > and code changes (if necessary) did not change. > * rebased to Andi's for-next branch > * this series only contains patches which convert the drivers fully. If > all goes well, no more updates for them are needed. The previous > series converted all users of "master_xfer". But to avoid tons of > incremental patches to one driver, I will incrementally improve i2c.h > and see which drivers can be fully converted step-by-step. > * do not mention I3C specs in commit messages, not really relevant here > > Please note that I am not super strict with the 80 char limit. And, as > agreed, I did not convert occasions where old terminology is used in > register names or bits etc. or in function names outside of the I2C > realm. > > The outcome is that before this series 115 drivers use old terminology, > after this only 54. Hooray. > > And a comment to all janitors: Do not convert I2C drivers outside of > drivers/i2c yet. Let us first gain experience here and present the > well-tested results of what we figured out to other maintainers then. > This ensures they have to deal with way less patch revisions. > > Thanks and happy hacking! > > > Wolfram Sang (60): > i2c: reword i2c_algorithm according to newest specification > i2c: ali15x3: reword according to newest specification > i2c: altera: reword according to newest specification > i2c: au1550: reword according to newest specification > i2c: bcm-kona: reword according to newest specification > i2c: bcm2835: reword according to newest specification > i2c: brcmstb: reword according to newest specification > i2c: cht-wc: reword according to newest specification > i2c: cp2615: reword according to newest specification > i2c: cros-ec-tunnel: reword according to newest specification > i2c: davinci: reword according to newest specification > i2c: digicolor: reword according to newest specification > i2c: diolan-u2c: reword according to newest specification > i2c: dln2: reword according to newest specification > i2c: fsi: reword according to newest specification > i2c: gpio: reword according to newest specification > i2c: highlander: reword according to newest specification > i2c: hisi: reword according to newest specification > i2c: hix5hd2: reword according to newest specification > i2c: i801: reword according to newest specification > i2c: ibm_iic: reword according to newest specification > i2c: iop3xx: reword according to newest specification > i2c: isch: reword according to newest specification > i2c: jz4780: reword according to newest specification > i2c: kempld: reword according to newest specification > i2c: ljca: reword according to newest specification > i2c: lpc2k: reword according to newest specification > i2c: ls2x: reword according to newest specification > i2c: mlxcpld: reword according to newest specification > i2c: mpc: reword according to newest specification > i2c: mt7621: reword according to newest specification > i2c: mv64xxx: reword according to newest specification > i2c: ocores: reword according to newest specification > i2c: octeon: reword according to newest specification > i2c: opal: reword according to newest specification > i2c: owl: reword according to newest specification > i2c: pasemi: reword according to newest specification > i2c: piix4: reword according to newest specification > i2c: powermac: reword according to newest specification > i2c: pxa-pci: reword according to newest specification > i2c: riic: reword according to newest specification > i2c: rk3x: reword according to newest specification > i2c: robotfuzz-osif: reword according to newest specification > i2c: rzv2m: reword according to newest specification > i2c: sis5595: reword according to newest specification > i2c: sprd: reword according to newest specification > i2c: stm32f4: reword according to newest speci
Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
Not sure I read this patch correctly, but at first glance it looks suspicious.. On 07/11, Peter Zijlstra wrote: > > +static void return_instance_timer(struct timer_list *timer) > +{ > + struct uprobe_task *utask = container_of(timer, struct uprobe_task, > ri_timer); > + task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL); > +} What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before return from the ret-probed function? In this case it won't react to TWA_SIGNAL until debugger or SIGCONT wakes it up. --- And it seems that even task_work_add() itself is not safe... Suppose we have 2 ret-probed functions void f2() { ... } void f1() { ...; f2(); } A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does mod_timer(&utask->ri_timer, jiffies + HZ); Then later it calls f2() and the pending timer expires after it enters the kernel, but before the next prepare_uretprobe() -> mod_timer(). In this case ri_task_work is already queued and the timer is pending again. Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in theory nothing can guarantee that it will call get_signal/task_work_run in less than 1 second, it can be preempted. But T can sleep in xol_take_insn_slot() before return from handle_swbp(), and this is not so theoretical. Oleg.
Re: [PATCH v2 1/2] virtio_balloon: add work around for out of spec QEMU
Hi Michael, kernel test robot noticed the following build errors: [auto build test ERROR on next-20240710] [cannot apply to uml/next remoteproc/rproc-next s390/features linus/master uml/fixes v6.10-rc7 v6.10-rc6 v6.10-rc5 v6.10-rc7] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Michael-S-Tsirkin/virtio_balloon-add-work-around-for-out-of-spec-QEMU/20240711-004346 base: next-20240710 patch link: https://lore.kernel.org/r/19d916257b76148f89de7386389eeb7267b1b61c.1720611677.git.mst%40redhat.com patch subject: [PATCH v2 1/2] virtio_balloon: add work around for out of spec QEMU config: i386-randconfig-005-20240711 (https://download.01.org/0day-ci/archive/20240711/202407112126.plguwi8i-...@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240711/202407112126.plguwi8i-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202407112126.plguwi8i-...@intel.com/ All errors (new ones prefixed by >>): >> drivers/virtio/virtio_balloon.c:603:55: error: too few arguments to function >> call, expected 5, have 4 602 | err = virtio_find_vqs(vb->vdev, | ~~~ 603 | VIRTIO_BALLOON_VQ_REPORTING, vqs_info, NULL); | ^ include/linux/virtio_config.h:225:5: note: 'virtio_find_vqs' declared here 225 | int virtio_find_vqs(struct virtio_device *vdev, unsigned int nvqs, | ^ ~~ 226 | struct virtqueue *vqs[], | 227 | struct virtqueue_info vqs_info[], | ~ 228 | struct irq_affinity *desc) | ~ 1 error generated. vim +603 drivers/virtio/virtio_balloon.c 560 561 static int init_vqs(struct virtio_balloon *vb) 562 { 563 struct virtqueue_info vqs_info[VIRTIO_BALLOON_VQ_MAX] = {}; 564 struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX]; 565 int err; 566 567 /* 568 * Inflateq and deflateq are used unconditionally. The names[] 569 * will be NULL if the related feature is not enabled, which will 570 * cause no allocation for the corresponding virtqueue in find_vqs. 571 */ 572 vqs_info[VIRTIO_BALLOON_VQ_INFLATE].callback = balloon_ack; 573 vqs_info[VIRTIO_BALLOON_VQ_INFLATE].name = "inflate"; 574 vqs_info[VIRTIO_BALLOON_VQ_DEFLATE].callback = balloon_ack; 575 vqs_info[VIRTIO_BALLOON_VQ_DEFLATE].name = "deflate"; 576 577 if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { 578 vqs_info[VIRTIO_BALLOON_VQ_STATS].name = "stats"; 579 vqs_info[VIRTIO_BALLOON_VQ_STATS].callback = stats_request; 580 } 581 582 if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) 583 vqs_info[VIRTIO_BALLOON_VQ_FREE_PAGE].name = "free_page_vq"; 584 585 if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) { 586 vqs_info[VIRTIO_BALLOON_VQ_REPORTING].name = "reporting_vq"; 587 vqs_info[VIRTIO_BALLOON_VQ_REPORTING].callback = balloon_ack; 588 } 589 590 err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs, 591vqs_info, NULL); 592 if (err) { 593 /* 594 * Try to work around QEMU bug which since 2020 confused vq numbers 595 * when VIRTIO_BALLOON_F_REPORTING but not 596 * VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered. 597 */ 598 if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING) && 599 !virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { 600 vqs_info[VIRTIO_BALLOON_VQ_FREE_PAGE].name = "reporting_vq";
Re: [PATCH v2 2/2] virtio: fix vq # for balloon
Hi Michael, kernel test robot noticed the following build errors: [auto build test ERROR on next-20240710] [cannot apply to uml/next remoteproc/rproc-next s390/features linus/master uml/fixes v6.10-rc7 v6.10-rc6 v6.10-rc5 v6.10-rc7] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Michael-S-Tsirkin/virtio_balloon-add-work-around-for-out-of-spec-QEMU/20240711-004346 base: next-20240710 patch link: https://lore.kernel.org/r/3d655be73ce220f176b2c163839d83699f8faf43.1720611677.git.mst%40redhat.com patch subject: [PATCH v2 2/2] virtio: fix vq # for balloon config: i386-randconfig-014-20240711 (https://download.01.org/0day-ci/archive/20240711/202407112113.szspddlk-...@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240711/202407112113.szspddlk-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202407112113.szspddlk-...@intel.com/ All errors (new ones prefixed by >>): >> drivers/virtio/virtio_pci_common.c:391:1: error: version control conflict >> marker in file 391 | <<<<<<< HEAD | ^ >> drivers/virtio/virtio_pci_common.c:392:30: error: use of undeclared >> identifier 'queue_idx' 392 | vqs[i] = vp_setup_vq(vdev, queue_idx++, vqi->callback, |^ 2 errors generated. vim +391 drivers/virtio/virtio_pci_common.c 365 366 static int vp_find_vqs_intx(struct virtio_device *vdev, unsigned int nvqs, 367 struct virtqueue *vqs[], 368 struct virtqueue_info vqs_info[]) 369 { 370 struct virtio_pci_device *vp_dev = to_vp_device(vdev); 371 int i, err; 372 373 vp_dev->vqs = kcalloc(nvqs, sizeof(*vp_dev->vqs), GFP_KERNEL); 374 if (!vp_dev->vqs) 375 return -ENOMEM; 376 377 err = request_irq(vp_dev->pci_dev->irq, vp_interrupt, IRQF_SHARED, 378 dev_name(&vdev->dev), vp_dev); 379 if (err) 380 goto out_del_vqs; 381 382 vp_dev->intx_enabled = 1; 383 vp_dev->per_vq_vectors = false; 384 for (i = 0; i < nvqs; ++i) { 385 struct virtqueue_info *vqi = &vqs_info[i]; 386 387 if (!vqi->name) { 388 vqs[i] = NULL; 389 continue; 390 } > 391 <<<<<<< HEAD > 392 vqs[i] = vp_setup_vq(vdev, queue_idx++, vqi->callback, 393 vqi->name, vqi->ctx, 394 === 395 vqs[i] = vp_setup_vq(vdev, i, callbacks[i], names[i], 396 ctx ? ctx[i] : false, 397 >>>>>>> f814759f80b7... virtio: fix vq # for balloon 398 VIRTIO_MSI_NO_VECTOR); 399 if (IS_ERR(vqs[i])) { 400 err = PTR_ERR(vqs[i]); 401 goto out_del_vqs; 402 } 403 } 404 405 return 0; 406 out_del_vqs: 407 vp_del_vqs(vdev); 408 return err; 409 } 410 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH] test/vsock: add install target
On Thu, Jul 11, 2024 at 09:07:04AM +0200, Stefano Garzarella wrote: > CCing Stefan. > > On Wed, Jul 10, 2024 at 07:00:59PM GMT, Jakub Kicinski wrote: > > On Wed, 10 Jul 2024 13:58:39 +0200 Stefano Garzarella wrote: > > > There is a comment there: > > > > > > # Avoid changing the rest of the logic here and lib.mk. > > > > > > Added by commit 17eac6c2db8b2cdfe33d40229bdda2acd86b304a. > > > > > > IIUC they re-used INSTALL_PATH, just to avoid too many changes in that > > > file and in tools/testing/selftests/lib.mk > > > > > > So, IMHO we should not care about it and only use VSOCK_INSTALL_PATH if > > > you don't want to conflict with INSTALL_PATH. > > > > Any reason why vsock isn't part of selftests in the first place? > > > > Usually vsock tests test both the driver (virtio-vsock) in the guest and the > device in the host kernel (vhost-vsock). So I usually run the tests in 2 > nested VMs to test the latest changes for both the guest and the host. > > I don't know enough selftests, but do you think it is possible to integrate > them? > > CCing Stefan who is the original author and may remember more reasons about > this choice. It's probably because of the manual steps in tools/testing/vsock/README: The following prerequisite steps are not automated and must be performed prior to running tests: 1. Build the kernel, make headers_install, and build these tests. 2. Install the kernel and tests on the host. 3. Install the kernel and tests inside the guest. 4. Boot the guest and ensure that the AF_VSOCK transport is enabled. If you want to automate this for QEMU, VMware, and Hyper-V that would be great. It relies on having a guest running under these hypervisors and that's not trivial to automate (plus it involves proprietary software for VMware and Hyper-V that may not be available without additional license agreements and/or payment). Stefan signature.asc Description: PGP signature
Re: [PATCH v2 04/11] perf/uprobe: RCU-ify find_uprobe()
On Thu, 11 Jul 2024 13:02:39 +0200 Peter Zijlstra wrote: > With handle_swbp() triggering concurrently on (all) CPUs, tree_lock > becomes a bottleneck. Avoid treelock by doing RCU lookups of the > uprobe. > Looks good to me. Acked-by: Masami Hiramatsu (Google) Thanks, > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/events/uprobes.c | 49 > +++- > 1 file changed, 40 insertions(+), 9 deletions(-) > > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > +static seqcount_rwlock_t uprobes_seqcount = > SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); > > #define UPROBES_HASH_SZ 13 > /* serialize uprobe->pending_list */ > @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem) > struct uprobe { > struct rb_node rb_node;/* node in the rb tree */ > refcount_t ref; > + struct rcu_head rcu; > struct rw_semaphore register_rwsem; > struct rw_semaphore consumer_rwsem; > struct list_headpending_list; > @@ -587,12 +589,25 @@ set_orig_insn(struct arch_uprobe *auprob > *(uprobe_opcode_t *)&auprobe->insn); > } > > +static struct uprobe *try_get_uprobe(struct uprobe *uprobe) > +{ > + if (refcount_inc_not_zero(&uprobe->ref)) > + return uprobe; > + return NULL; > +} > + > static struct uprobe *get_uprobe(struct uprobe *uprobe) > { > refcount_inc(&uprobe->ref); > return uprobe; > } > > +static void uprobe_free_rcu(struct rcu_head *rcu) > +{ > + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); > + kfree(uprobe); > +} > + > static void put_uprobe(struct uprobe *uprobe) > { > if (refcount_dec_and_test(&uprobe->ref)) { > @@ -604,7 +619,7 @@ static void put_uprobe(struct uprobe *up > mutex_lock(&delayed_uprobe_lock); > delayed_uprobe_remove(uprobe, NULL); > mutex_unlock(&delayed_uprobe_lock); > - kfree(uprobe); > + call_rcu(&uprobe->rcu, uprobe_free_rcu); > } > } > > @@ -653,10 +668,10 @@ static struct uprobe *__find_uprobe(stru > .inode = inode, > .offset = offset, > }; > - struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); > + struct rb_node *node = rb_find_rcu(&key, &uprobes_tree, > __uprobe_cmp_key); > > if (node) > - return get_uprobe(__node_2_uprobe(node)); > + return try_get_uprobe(__node_2_uprobe(node)); > > return NULL; > } > @@ -667,20 +682,32 @@ static struct uprobe *__find_uprobe(stru > */ > static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > { > - struct uprobe *uprobe; > + unsigned int seq; > > - read_lock(&uprobes_treelock); > - uprobe = __find_uprobe(inode, offset); > - read_unlock(&uprobes_treelock); > + guard(rcu)(); > > - return uprobe; > + do { > + seq = read_seqcount_begin(&uprobes_seqcount); > + struct uprobe *uprobe = __find_uprobe(inode, offset); > + if (uprobe) { > + /* > + * Lockless RB-tree lookups are prone to > false-negatives. > + * If they find something, it's good. If they do not > find, > + * it needs to be validated. > + */ > + return uprobe; > + } > + } while (read_seqcount_retry(&uprobes_seqcount, seq)); > + > + /* Really didn't find anything. */ > + return NULL; > } > > static struct uprobe *__insert_uprobe(struct uprobe *uprobe) > { > struct rb_node *node; > > - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); > if (node) > return get_uprobe(__node_2_uprobe(node)); > > @@ -702,7 +729,9 @@ static struct uprobe *insert_uprobe(stru > struct uprobe *u; > > write_lock(&uprobes_treelock); > + write_seqcount_begin(&uprobes_seqcount); > u = __insert_uprobe(uprobe); > + write_seqcount_end(&uprobes_seqcount); > write_unlock(&uprobes_treelock); > > return u; > @@ -936,7 +965,9 @@ static void delete_uprobe(struct uprobe > return; > > write_lock(&uprobes_treelock); > + write_seqcount_begin(&uprobes_seqcount); > rb_erase(&uprobe->rb_node, &uprobes_tree); > + write_seqcount_end(&uprobes_seqcount); > write_unlock(&uprobes_treelock); > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > put_uprobe(uprobe); > > -- Masami Hiramatsu (Google)
Re: [PATCH v2 08/11] perf/uprobe: Convert (some) uprobe->refcount to SRCU
On Thu, Jul 11, 2024 at 01:02:43PM +0200, Peter Zijlstra wrote: SNIP > /* Tracing handlers use ->utask to communicate with fetch methods */ > if (!get_utask()) > - goto out; > + return; > > if (arch_uprobe_ignore(&uprobe->arch, regs)) > - goto out; > + return; > > handler_chain(uprobe, regs); > > if (arch_uprobe_skip_sstep(&uprobe->arch, regs)) > - goto out; > - > - if (!pre_ssout(uprobe, regs, bp_vaddr)) > return; > > - /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */ > -out: > - put_uprobe(uprobe); > + pre_ssout(uprobe, regs, bp_vaddr); pre_ssout could now return void jirka > } > > /* > >
Re: [PATCH v2] ring-buffer: Limit time with disabled interrupts in rb_check_pages()
On Thu, 4 Jul 2024 13:03:47 +0200 Petr Pavlu wrote: > > I'm dumb. What's an "era"? > > I meant it as a calendar era or epoch. The idea was to hint this is > a number that identifies some structural state of the pages list. Maybe > pages_gen ("generation") or another name would be better? Ah, out of context I thought it was short for something. Perhaps just use "cnt" with a comment, as that can be generic enough for what it is. Thanks, -- Steve
Re: [PATCH] test/vsock: add install target
On Thu, 11 Jul 2024 15:38:01 +0200 Stefan Hajnoczi wrote: > > Usually vsock tests test both the driver (virtio-vsock) in the guest and the > > device in the host kernel (vhost-vsock). So I usually run the tests in 2 > > nested VMs to test the latest changes for both the guest and the host. > > > > I don't know enough selftests, but do you think it is possible to integrate > > them? > > > > CCing Stefan who is the original author and may remember more reasons about > > this choice. > > It's probably because of the manual steps in tools/testing/vsock/README: > > The following prerequisite steps are not automated and must be performed > prior > to running tests: > > 1. Build the kernel, make headers_install, and build these tests. > 2. Install the kernel and tests on the host. > 3. Install the kernel and tests inside the guest. > 4. Boot the guest and ensure that the AF_VSOCK transport is enabled. > > If you want to automate this for QEMU, VMware, and Hyper-V that would be > great. It relies on having a guest running under these hypervisors and > that's not trivial to automate (plus it involves proprietary software > for VMware and Hyper-V that may not be available without additional > license agreements and/or payment). Not sure if there's a requirement that full process is automated. Or at least if there is we are already breaking it in networking because for some tests we need user to export some env variables to point the test to the right interfaces and even a remote machine to generate traffic. If the env isn't set up tests return 4 (SKIP). I don't feel strongly that ksft + env approach is better but at least it gives us easy access to the basic build and packaging features from ksft. Up to you but thought I'd ask.
Re: [PATCH RFC 1/1] remoteproc: mediatek: Support reserved CMA regions
On Wed, 10 Jul 2024 at 02:42, Shun-yi Wang wrote: > > From: "shun-yi.wang" > > In order to reserve specific Contiguous Memory Allocator (CMA) regions > for hardware use. When the name of the reserved region contains "cma", > then a corresponding CMA heap is added. > > Signed-off-by: shun-yi.wang > --- > drivers/remoteproc/mtk_scp.c | 38 > 1 file changed, 30 insertions(+), 8 deletions(-) > I'm not sure what to do with this patch... Is it a superset of your other patch [1]? And if so why is it labelled as an RFC? Please read the documentation on submitting patches [2] and subscribe to the remoteproc mailing list to give you an idea of the patch workflow that is expected. Thanks, Mathieu [1]. [PATCH 1/1] remoteproc: mediatek: Support multiple reserved memory regions [2]. https://elixir.bootlin.com/linux/latest/source/Documentation/process/submitting-patches.rst > diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c > index e744c07507ee..ca0a4a52ece9 100644 > --- a/drivers/remoteproc/mtk_scp.c > +++ b/drivers/remoteproc/mtk_scp.c > @@ -4,6 +4,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -1006,22 +1007,43 @@ EXPORT_SYMBOL_GPL(scp_mapping_dm_addr); > > static int scp_map_memory_region(struct mtk_scp *scp) > { > - int ret; > + int ret, i, err; > const struct mtk_scp_sizes_data *scp_sizes; > + struct device_node *node = scp->dev->of_node; > + struct of_phandle_iterator it; > + > + i = 0; > + of_for_each_phandle(&it, err, node, "memory-region", NULL, 0) { > + ret = of_reserved_mem_device_init_by_idx(scp->dev, node, i); > + > + if (ret) { > + dev_err(scp->dev, "failed to assign memory-region: > %s\n", > + it.node->name); > + of_node_put(it.node); > + return -ENOMEM; > + } > + > +#ifdef CONFIG_DMABUF_HEAPS_CMA > + if (strstr(it.node->name, "cma")) { > + /* Reserved cma memory region */ > + ret = dma_heap_add_cma(scp->dev); > + if (ret) { > + dev_err(scp->dev, "Failed to add reserved > cma"); > + of_node_put(it.node); > + return ret; > + } > + } > +#endif /* CONFIG_DMABUF_HEAPS_CMA */ > > - ret = of_reserved_mem_device_init(scp->dev); > + i++; > + } > > /* reserved memory is optional. */ > - if (ret == -ENODEV) { > + if (!i) { > dev_info(scp->dev, "skipping reserved memory > initialization."); > return 0; > } > > - if (ret) { > - dev_err(scp->dev, "failed to assign memory-region: %d\n", > ret); > - return -ENOMEM; > - } > - > /* Reserved SCP code size */ > scp_sizes = scp->data->scp_sizes; > scp->cpu_addr = dma_alloc_coherent(scp->dev, scp_sizes->max_dram_size, > -- > 2.18.0 >
Re: [syzbot] [mm?] possible deadlock in __mmap_lock_do_trace_released
On Thu, 4 Jul 2024 22:12:45 +0200 Jesper Dangaard Brouer wrote: > > > > WARNING: possible recursive locking detected > > 6.10.0-rc2-syzkaller-00797-ga12978712d90 #0 Not tainted > > > > syz-executor646/5097 is trying to acquire lock: > > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire > > include/linux/local_lock_internal.h:29 [inline] > > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: > > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243 > > > > but task is already holding lock: > > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire > > include/linux/local_lock_internal.h:29 [inline] > > 8880b94387e8 (lock#9){+.+.}-{2:2}, at: > > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243 > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > > CPU0 > > > >lock(lock#9); > >lock(lock#9); > > > > *** DEADLOCK *** Looks like it's trying to take the rwsem mm->mmap_lock recursively. And rwsems are *not* allowed to be recursively taken, as once there's a writer, all new acquires of the reader will block. Then you can have: CPU0 CPU1 down_read(lockA); down_write(lockA); // blocks down_read(lockA); //blocks DEADLOCK! > > > > May be due to missing lock nesting notation > > > > To me, this looks like a lockdep false-positive, but I might be wrong. > > Could someone with more LOCKDEP knowledge give their interpretation? > > The commit[1] adds a fairly standard trylock scheme. > Do I need to lockdep annotate trylock's in some special way? > > [1] https://git.kernel.org/torvalds/c/21c38a3bd4ee3fb733 > > Also notice change uses raw_spin_lock, which might be harder for lockdep? > So, I also enabled CONFIG_PROVE_RAW_LOCK_NESTING in my testlab to help > with this, and CONFIG_PROVE_LOCKING. > (And obviously I also enabled LOCKDEP*) > > --Jesper > > > 5 locks held by syz-executor646/5097: > > #0: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: mmap_read_lock > > include/linux/mmap_lock.h:144 [inline] > > #0: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: > > acct_collect+0x1cf/0x830 kernel/acct.c:563 > > #1: 8880b94387e8 (lock#9){+.+.}-{2:2}, at: local_lock_acquire > > include/linux/local_lock_internal.h:29 [inline] > > #1: 8880b94387e8 (lock#9){+.+.}-{2:2}, at: > > __mmap_lock_do_trace_released+0x83/0x620 mm/mmap_lock.c:243 > > #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire > > include/linux/rcupdate.h:329 [inline] > > #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: rcu_read_lock > > include/linux/rcupdate.h:781 [inline] > > #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: get_memcg_path_buf > > mm/mmap_lock.c:139 [inline] > > #2: 8e333fa0 (rcu_read_lock){}-{1:2}, at: > > get_mm_memcg_path+0xb1/0x600 mm/mmap_lock.c:209 > > #3: 8e333fa0 (rcu_read_lock){}-{1:2}, at: > > trace_call_bpf+0xbc/0x8a0 > > #4: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: mmap_read_trylock > > include/linux/mmap_lock.h:163 [inline] > > #4: 8880182eb118 (&mm->mmap_lock){}-{3:3}, at: > > stack_map_get_build_id_offset+0x237/0x9d0 kernel/bpf/stackmap.c:141 > > > > stack backtrace: > > CPU: 0 PID: 5097 Comm: syz-executor646 Not tainted > > 6.10.0-rc2-syzkaller-00797-ga12978712d90 #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 06/07/2024 > > Call Trace: > > > > __dump_stack lib/dump_stack.c:88 [inline] > > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 > > check_deadlock kernel/locking/lockdep.c:3062 [inline] > > validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3856 > > __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137 > > lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754 > > local_lock_acquire include/linux/local_lock_internal.h:29 [inline] > > __mmap_lock_do_trace_released+0x9c/0x620 mm/mmap_lock.c:243 Here we have: static inline void mmap_read_lock(struct mm_struct *mm) { __mmap_lock_trace_start_locking(mm, false); down_read(&mm->mmap_lock); __mmap_lock_trace_acquire_returned(mm, false, true); } Which is taking the mm->mmap_lock for read. > > __mmap_lock_trace_released include/linux/mmap_lock.h:42 [inline] > > mmap_read_unlock include/linux/mmap_lock.h:170 [inline] > > bpf_mmap_unlock_mm kernel/bpf/mmap_unlock_work.h:52 [inline] > > stack_map_get_build_id_offset+0x9c7/0x9d0 kernel/bpf/stackmap.c:173 > > __bpf_get_stack+0x4ad/0x5a0 kernel/bpf/stackmap.c:449 > > bpf_prog_e6cf5f9c69743609+0x42/0x46 > > bpf_dispatcher_nop_func include/linux/bpf.h:1243 [inline] > > __bpf_prog_run include/linux/filter.h:691 [inline] > > bpf_prog_run include/linux/filter.h:698 [inline] > > bpf
Re: [PATCH v2 00/60] i2c: reword first drivers according to newest specification
> Thanks for this big work, at the end it turned out quite nice and > I'm happy of the outcome! Me too. And thanks for the enormous review work! signature.asc Description: PGP signature
[PATCH net-next v3 0/2] vsock: avoid queuing on workqueue if possible
This series introduces an optimization for vsock/virtio to reduce latency and increase the throughput: When the guest sends a packet to the host, and the workqueue is empty, if there is enough space, the packet is put directly in the virtqueue. v2->v3 - Performed more experiments using iperf3 using multiple streams - Handling of reply packets removed from virtio_transport_send_skb, as is needed just by the worker. - Removed atomic_inc/atomic_sub when queuing directly to the vq. - Introduced virtio_transport_send_skb_fast_path that handles the steps for sending on the vq. - Fixed a missing mutex_unlock in error path. - Changed authorship of the second commit - Rebased on latest net-next v1->v2 In this v2 I replaced a mutex_lock with a mutex_trylock because it was insidea RCU critical section. I also added a check on tx_run, so if the module is being removed the packet is not queued. I'd like to thank Stefano for reporting the tx_run issue. Applied all Stefano's suggestions: - Minor code style changes - Minor commit text rewrite Performed more experiments: - Check if all the packets go directly to the vq (Matias' suggestion) - Used iperf3 to see if there is any improvement in overall throughput from guest to host - Pinned the vhost process to a pCPU. - Run fio using 512B payload Rebased on latest net-next To: Stefan Hajnoczi To: Stefano Garzarella To: David S. Miller To: Eric Dumazet To: Jakub Kicinski To: Paolo Abeni Cc: k...@vger.kernel.org Cc: virtualizat...@lists.linux.dev Cc: net...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Luigi Leonardi --- Luigi Leonardi (1): vsock/virtio: avoid queuing packets when work queue is empty Marco Pinna (1): vsock/virtio: refactor virtio_transport_send_pkt_work net/vmw_vsock/virtio_transport.c | 143 +-- 1 file changed, 93 insertions(+), 50 deletions(-) --- base-commit: 58f9416d413aa2c20b2515233ce450a1607ef843 change-id: 20240711-pinna-49bf0ab09909 Best regards, -- Luigi Leonardi
[PATCH net-next v3 2/2] vsock/virtio: avoid queuing packets when work queue is empty
From: Luigi Leonardi Introduce an optimization in virtio_transport_send_pkt: when the work queue (send_pkt_queue) is empty the packet is put directly in the virtqueue increasing the throughput. In the following benchmark (pingpong mode) the host sends a payload to the guest and waits for the same payload back. All vCPUs pinned individually to pCPUs. vhost process pinned to a pCPU fio process pinned both inside the host and the guest system. Host CPU: Intel i7-10700KF CPU @ 3.80GHz Tool: Fio version 3.37-56 Env: Phys host + L1 Guest Runtime-per-test: 50s Mode: pingpong (h-g-h) Test runs: 50 Type: SOCK_STREAM Before: Linux 6.9.7 Payload 512B: 1st perc. overall 99th perc. Before 370 810.15 8656ns After 374 780.29 8741ns Payload 4K: 1st perc. overall 99th perc. Before 460 1720.23 42752 ns After 460 1520.84 36096 ns The performance improvement is related to this optimization, I used ebpf to check that each packet was sent directly to the virtqueue. Throughput: iperf-vsock The size represents the buffer length (-l) to read/write P represents the number parallel streams P=1 4K 64K 128K Before 6.8729.329.5 Gb/s After 10.539.439.9 Gb/s P=2 4K 64K 128K Before 10.532.833.2 Gb/s After 17.847.748.5 Gb/s P=4 4K 64K 128K Before 12.733.634.2 Gb/s After 16.948.150.5 Gb/s Co-developed-by: Marco Pinna Signed-off-by: Marco Pinna Signed-off-by: Luigi Leonardi --- net/vmw_vsock/virtio_transport.c | 38 ++ 1 file changed, 34 insertions(+), 4 deletions(-) diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index c4205c22f40b..d75727fdc35f 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work) queue_work(virtio_vsock_workqueue, &vsock->rx_work); } +/* Caller need to hold RCU for vsock. + * Returns 0 if the packet is successfully put on the vq. + */ +static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb) +{ + struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX]; + int ret; + + /* Inside RCU, can't sleep! */ + ret = mutex_trylock(&vsock->tx_lock); + if (unlikely(ret == 0)) + return -EBUSY; + + ret = virtio_transport_send_skb(skb, vq, vsock); + + mutex_unlock(&vsock->tx_lock); + + /* Kick if virtio_transport_send_skb succeeded */ + if (ret == 0) + virtqueue_kick(vq); + return ret; +} + static int virtio_transport_send_pkt(struct sk_buff *skb) { @@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb) goto out_rcu; } - if (virtio_vsock_skb_reply(skb)) - atomic_inc(&vsock->queued_replies); + /* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet. +* Just put it on the virtqueue using virtio_transport_send_skb_fast_path. +*/ - virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); - queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) || + virtio_transport_send_skb_fast_path(vsock, skb)) { + /* Packet must be queued */ + if (virtio_vsock_skb_reply(skb)) + atomic_inc(&vsock->queued_replies); + virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); + queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + } out_rcu: rcu_read_unlock(); -- 2.45.2
[PATCH net-next v3 1/2] vsock/virtio: refactor virtio_transport_send_pkt_work
From: Marco Pinna Preliminary patch to introduce an optimization to the enqueue system. All the code used to enqueue a packet into the virtqueue is removed from virtio_transport_send_pkt_work() and moved to the new virtio_transport_send_skb() function. Co-developed-by: Luigi Leonardi Signed-off-by: Luigi Leonardi Signed-off-by: Marco Pinna --- net/vmw_vsock/virtio_transport.c | 105 ++- 1 file changed, 59 insertions(+), 46 deletions(-) diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index 43d405298857..c4205c22f40b 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -94,6 +94,63 @@ static u32 virtio_transport_get_local_cid(void) return ret; } +/* Caller need to hold vsock->tx_lock on vq */ +static int virtio_transport_send_skb(struct sk_buff *skb, struct virtqueue *vq, +struct virtio_vsock *vsock) +{ + int ret, in_sg = 0, out_sg = 0; + struct scatterlist **sgs; + + sgs = vsock->out_sgs; + sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb), + sizeof(*virtio_vsock_hdr(skb))); + out_sg++; + + if (!skb_is_nonlinear(skb)) { + if (skb->len > 0) { + sg_init_one(sgs[out_sg], skb->data, skb->len); + out_sg++; + } + } else { + struct skb_shared_info *si; + int i; + + /* If skb is nonlinear, then its buffer must contain +* only header and nothing more. Data is stored in +* the fragged part. +*/ + WARN_ON_ONCE(skb_headroom(skb) != sizeof(*virtio_vsock_hdr(skb))); + + si = skb_shinfo(skb); + + for (i = 0; i < si->nr_frags; i++) { + skb_frag_t *skb_frag = &si->frags[i]; + void *va; + + /* We will use 'page_to_virt()' for the userspace page +* here, because virtio or dma-mapping layers will call +* 'virt_to_phys()' later to fill the buffer descriptor. +* We don't touch memory at "virtual" address of this page. +*/ + va = page_to_virt(skb_frag_page(skb_frag)); + sg_init_one(sgs[out_sg], + va + skb_frag_off(skb_frag), + skb_frag_size(skb_frag)); + out_sg++; + } + } + + ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL); + /* Usually this means that there is no more space available in +* the vq +*/ + if (ret < 0) + return ret; + + virtio_transport_deliver_tap_pkt(skb); + return 0; +} + static void virtio_transport_send_pkt_work(struct work_struct *work) { @@ -111,66 +168,22 @@ virtio_transport_send_pkt_work(struct work_struct *work) vq = vsock->vqs[VSOCK_VQ_TX]; for (;;) { - int ret, in_sg = 0, out_sg = 0; - struct scatterlist **sgs; struct sk_buff *skb; bool reply; + int ret; skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue); if (!skb) break; reply = virtio_vsock_skb_reply(skb); - sgs = vsock->out_sgs; - sg_init_one(sgs[out_sg], virtio_vsock_hdr(skb), - sizeof(*virtio_vsock_hdr(skb))); - out_sg++; - - if (!skb_is_nonlinear(skb)) { - if (skb->len > 0) { - sg_init_one(sgs[out_sg], skb->data, skb->len); - out_sg++; - } - } else { - struct skb_shared_info *si; - int i; - - /* If skb is nonlinear, then its buffer must contain -* only header and nothing more. Data is stored in -* the fragged part. -*/ - WARN_ON_ONCE(skb_headroom(skb) != sizeof(*virtio_vsock_hdr(skb))); - - si = skb_shinfo(skb); - for (i = 0; i < si->nr_frags; i++) { - skb_frag_t *skb_frag = &si->frags[i]; - void *va; - - /* We will use 'page_to_virt()' for the userspace page -* here, because virtio or dma-mapping layers will call -* 'virt_to_phys()' later to fill the buffer descriptor. -* We don't touch memory at "virtual" address of this page. -*/ -
Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
On Thu, Jul 11, 2024 at 03:19:19PM +0200, Oleg Nesterov wrote: > Not sure I read this patch correctly, but at first glance it looks > suspicious.. > > On 07/11, Peter Zijlstra wrote: > > > > +static void return_instance_timer(struct timer_list *timer) > > +{ > > + struct uprobe_task *utask = container_of(timer, struct uprobe_task, > > ri_timer); > > + task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL); > > +} > > What if utask->task sleeps in TASK_STOPPED/TASK_TRACED state before > return from the ret-probed function? > > In this case it won't react to TWA_SIGNAL until debugger or SIGCONT > wakes it up. Or FROZEN etc.. Yeah. > --- > And it seems that even task_work_add() itself is not safe... > > Suppose we have 2 ret-probed functions > > void f2() { ... } > void f1() { ...; f2(); } > > A task T calls f1(), hits the bp, and calls prepare_uretprobe() which does > > mod_timer(&utask->ri_timer, jiffies + HZ); > > Then later it calls f2() and the pending timer expires after it enters the > kernel, but before the next prepare_uretprobe() -> mod_timer(). > > In this case ri_task_work is already queued and the timer is pending again. You're saying we can hit a double enqueue, right? Yeah, that's a problem. But that can be fairly easily rectified. > Now. Even if T goes to the exit_to_user_mode_loop() path immediately, in > theory nothing can guarantee that it will call get_signal/task_work_run > in less than 1 second, it can be preempted. > > But T can sleep in xol_take_insn_slot() before return from handle_swbp(), > and this is not so theoretical. So the assumption is that kernel code makes forward progress. If we get preempted, we'll get scheduled again. If the machine is so overloaded this takes more than a second, stretching the SRCU period is the least of your problems. Same with sleeps, it'll get a wakeup. The only thing that is out of our control is userspace. And yes, I had not considered STOPPED/TRACED/FROZEN. So the reason I did that task_work is because the return_instance list is strictly current, so a random timer cannot safely poke at it. And barring those pesky states, it does as desired. Let me ponder that a little, I *can* make it work, but all 'solutions' I've come up with so far are really rather vile.
[ANNOUNCE] 5.10.221-rt113
Hello RT-list! I'm pleased to announce the 5.10.221-rt113 stable release. This release is just an update to the new stable 5.10.221 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v5.10-rt Head SHA1: ea80a8cd61dcdda428f61d90b3ace2c4d7682d2c Or to build 5.10.221-rt113 directly, the following patches should be applied: https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.221.xz https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.221-rt113.patch.xz Signing key fingerprint: 9354 0649 9972 8D31 D464 D140 F394 A423 F8E6 7C26 All keys used for the above files and repositories can be found on the following git repository: git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git Enjoy! Luis
Re: [PATCH 2/3] uprobes: simplify error handling for alloc_uprobe()
On Wed, 10 Jul 2024 18:31:11 +0200 Oleg Nesterov wrote: > From: Andrii Nakryiko > > Return -ENOMEM instead of NULL, which makes caller's error handling just > a touch simpler. > > Signed-off-by: Andrii Nakryiko > Signed-off-by: Oleg Nesterov Looks good to me. Reviewed-by: Masami Hiramatsu (Google) > --- > kernel/events/uprobes.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 4950decebe5c..e5b7c6239970 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, > loff_t offset, > > uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); > if (!uprobe) > - return NULL; > + return ERR_PTR(-ENOMEM); > > uprobe->inode = inode; > uprobe->offset = offset; > @@ -1166,8 +1166,6 @@ int uprobe_register(struct inode *inode, loff_t offset, > loff_t ref_ctr_offset, > > retry: > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > - if (!uprobe) > - return -ENOMEM; > if (IS_ERR(uprobe)) > return PTR_ERR(uprobe); > > -- > 2.25.1.362.g51ebf55 > > -- Masami Hiramatsu (Google)
Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote: > Let me ponder that a little, I *can* make it work, but all 'solutions' > I've come up with so far are really rather vile. This is the least horrible solution I could come up with... --- --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -83,6 +83,7 @@ struct uprobe_task { struct timer_list ri_timer; struct callback_headri_task_work; + boolri_work_pending; struct task_struct *task; }; --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc t->utask = NULL; } -static void return_instance_task_work(struct callback_head *head) +static void __return_instance_put(struct uprobe_task *utask) { - struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work); struct return_instance *ri; for (ri = utask->return_instances; ri; ri = ri->next) { @@ -1815,9 +1814,43 @@ static void return_instance_task_work(st } } +static void return_instance_task_work(struct callback_head *head) +{ + struct uprobe_task *utask = container_of(head, struct uprobe_task, ri_task_work); + utask->ri_work_pending = false; + __return_instance_put(utask); +} + +static int return_instance_blocked(struct task_struct *p, void *arg) +{ + unsigned int state = READ_ONCE(p->__state); + + if (state == TASK_RUNNING || state == TASK_WAKING) + return 0; + + smp_rmb(); + if (p->on_rq) + return 0; + + /* +* Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold +* p->pi_lock, and can consiter the task fully blocked. +*/ + + __return_instance_put(p->utask); + return 1; +} + static void return_instance_timer(struct timer_list *timer) { struct uprobe_task *utask = container_of(timer, struct uprobe_task, ri_timer); + if (utask->ri_work_pending) + return; + + if (task_call_func(utask->task, return_instance_blocked, NULL)) + return; + + utask->ri_work_pending = true; task_work_add(utask->task, &utask->ri_task_work, TWA_SIGNAL); }
Re: [PATCH v2 11/11] perf/uprobe: Add uretprobe timer
On Thu, Jul 11, 2024 at 05:55:42PM +0200, Peter Zijlstra wrote: > On Thu, Jul 11, 2024 at 05:00:54PM +0200, Peter Zijlstra wrote: > > > Let me ponder that a little, I *can* make it work, but all 'solutions' > > I've come up with so far are really rather vile. > > This is the least horrible solution I could come up with... > > --- > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -83,6 +83,7 @@ struct uprobe_task { > > struct timer_list ri_timer; > struct callback_headri_task_work; > + boolri_work_pending; > struct task_struct *task; > }; > > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -1797,9 +1797,8 @@ void uprobe_free_utask(struct task_struc > t->utask = NULL; > } > > -static void return_instance_task_work(struct callback_head *head) > +static void __return_instance_put(struct uprobe_task *utask) > { > - struct uprobe_task *utask = container_of(head, struct uprobe_task, > ri_task_work); > struct return_instance *ri; > > for (ri = utask->return_instances; ri; ri = ri->next) { > @@ -1815,9 +1814,43 @@ static void return_instance_task_work(st > } > } > > +static void return_instance_task_work(struct callback_head *head) > +{ > + struct uprobe_task *utask = container_of(head, struct uprobe_task, > ri_task_work); > + utask->ri_work_pending = false; > + __return_instance_put(utask); > +} > + > +static int return_instance_blocked(struct task_struct *p, void *arg) > +{ > + unsigned int state = READ_ONCE(p->__state); > + > + if (state == TASK_RUNNING || state == TASK_WAKING) > + return 0; > + > + smp_rmb(); > + if (p->on_rq) > + return 0; > + > + /* > + * Per __task_needs_rq_locked() we now have: !p->on_cpu and only hold > + * p->pi_lock, and can consiter the task fully blocked. > + */ > + > + __return_instance_put(p->utask); PREEMPT_RT might not like this though, doing the full RI iteration under a raw_spinlock_t... I just can't think of anything saner just now. Oh well, let me go make dinner, perhaps something will come to me.
Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
I'll try to actually apply the whole series and read the code tomorrow. Right now I can't understand this change... Just one question for now. On 07/11, Peter Zijlstra wrote: > > @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr >* attack from user-space. >*/ > uprobe_warn(current, "handle tail call"); > - goto err_uprobe; > + goto err_mem; > } > orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; > } > > + ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu); > + ri->uprobe = uprobe; It seems that, if we race with _unregister, this __srcu_read_lock() can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1) was already called... In this case read_lock "has no effect" in that uprobe_free_stage1() can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx). Perhaps it is fine, uprobe_free_stage1() does another call_srcu(), but somehow I got lost. Could you re-check this logic? Most probably I missed something, but still... Oleg.
Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *
On Thu, Jul 11, 2024 at 2:28 AM Oleg Nesterov wrote: > > On 07/10, Oleg Nesterov wrote: > > > > -void uprobe_unregister(struct inode *inode, loff_t offset, struct > > uprobe_consumer *uc) > > +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > > { > > - struct uprobe *uprobe; > > - > > - uprobe = find_uprobe(inode, offset); > > - if (WARN_ON(!uprobe)) > > - return; > > - > > down_write(&uprobe->register_rwsem); > > __uprobe_unregister(uprobe, uc); > > up_write(&uprobe->register_rwsem); > > - put_uprobe(uprobe); > > OK, this is obviously wrong, needs get_uprobe/put_uprobe. > __uprobe_unregister() > can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe. uprobe_register(), given it returns an uprobe instance to the caller should keep refcount on it (it belongs to uprobe_consumer). That's what I did for my patches, are you going to do that as well? We basically do the same thing, just interfaces look a bit different. > > I'll send V2 on top of Peter's new version. > > Oleg. >
Re: [PATCH] lib: test_objpool: add missing MODULE_DESCRIPTION() macro
On 6/2/24 23:45, Masami Hiramatsu (Google) wrote: On Mon, 3 Jun 2024 11:25:59 +0800 "wuqiang.matt" wrote: On 2024/6/1 08:31, Jeff Johnson wrote: make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_objpool.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson --- lib/test_objpool.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/test_objpool.c b/lib/test_objpool.c index bfdb81599832..5a3f6961a70f 100644 --- a/lib/test_objpool.c +++ b/lib/test_objpool.c @@ -687,4 +687,5 @@ static void __exit ot_mod_exit(void) module_init(ot_mod_init); module_exit(ot_mod_exit); -MODULE_LICENSE("GPL"); \ No newline at end of file +MODULE_DESCRIPTION("Test module for lockless object pool"); +MODULE_LICENSE("GPL"); --- base-commit: b050496579632f86ee1ef7e7501906db579f3457 change-id: 20240531-md-lib-test_objpool-338d937f8666 Looks good to me. Thanks for the update. I added Masami Hiramatsu and linux-trace in the loop. Reviewed-by: Matt Wu Thanks, let me pick this to probes/for-next branch. Following up since I don't see this in linux-next. I'm hoping to have these warnings fixed tree-wide in 6.11. /jeff
Re: [PATCH 3/3] uprobes: make uprobe_register() return struct uprobe *
On 07/11, Andrii Nakryiko wrote: > > On Thu, Jul 11, 2024 at 2:28 AM Oleg Nesterov wrote: > > > > On 07/10, Oleg Nesterov wrote: > > > > > > -void uprobe_unregister(struct inode *inode, loff_t offset, struct > > > uprobe_consumer *uc) > > > +void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > > > { > > > - struct uprobe *uprobe; > > > - > > > - uprobe = find_uprobe(inode, offset); > > > - if (WARN_ON(!uprobe)) > > > - return; > > > - > > > down_write(&uprobe->register_rwsem); > > > __uprobe_unregister(uprobe, uc); > > > up_write(&uprobe->register_rwsem); > > > - put_uprobe(uprobe); > > > > OK, this is obviously wrong, needs get_uprobe/put_uprobe. > > __uprobe_unregister() > > can free this uprobe, so up_write(&uprobe->register_rwsem) is not safe. > > uprobe_register(), given it returns an uprobe instance to the caller > should keep refcount on it (it belongs to uprobe_consumer). Of course. And again, this patch doesn't change the curent behaviour. > That's > what I did for my patches, are you going to do that as well? > > We basically do the same thing, just interfaces look a bit different. Not sure. Well I do not really know, I didn't read your series to the end, sorry ;) The same for V1/V2 from Peter so far. But let me say this just in case... With or without this change, currently uprobe_consumer doesn't have an "individual" ref to uprobe. The fact that uprobe->consumers != NULL adds a reference. Lets not discuss if this is good or bad right now, this cleanup is only cleanup. Now, let me add another "just in case" note to explain what I am going to do in V2. So. this patch should turn uprobe_unregister() into something like void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { // Ugly please kill me!!! get_uprobe(uprobe); down_write(&uprobe->register_rwsem); __uprobe_unregister(uprobe, uc); up_write(&uprobe->register_rwsem); put_uprobe(uprobe); } to simplify this change. And the next (simple) patch will kill these get_uprobe + put_uprobe, we just need to shift the (possibly) final put_uprobe() from delete_uprobe() to unregister(). But of course, I will recheck before I send V2. Oleg.
Re: [PATCH v2 10/11] perf/uprobe: Convert single-step and uretprobe to SRCU
On Thu, Jul 11, 2024 at 06:06:53PM +0200, Oleg Nesterov wrote: > I'll try to actually apply the whole series and read the code tomorrow. > Right now I can't understand this change... Just one question for now. > > On 07/11, Peter Zijlstra wrote: > > > > @@ -1956,11 +1960,13 @@ static void prepare_uretprobe(struct upr > > * attack from user-space. > > */ > > uprobe_warn(current, "handle tail call"); > > - goto err_uprobe; > > + goto err_mem; > > } > > orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; > > } > > > > + ri->srcu_idx = __srcu_read_lock(&uretprobes_srcu); > > + ri->uprobe = uprobe; > > It seems that, if we race with _unregister, this __srcu_read_lock() > can happen after call_srcu(uprobes_srcu, uprobe, uprobe_free_stage1) > was already called... > > In this case read_lock "has no effect" in that uprobe_free_stage1() > can run before free_ret_instance() does srcu_read_unlock(ri->srcu_idx). > > Perhaps it is fine, uprobe_free_stage1() does another call_srcu(), > but somehow I got lost. > > Could you re-check this logic? Most probably I missed something, but still... handle_swbp() guard(srcu)(&uprobes_srcu); handle_chain(); prepare_uretprobe() __srcu_read_lock(&uretprobe_srcu); vs uprobe_free_stage2 kfree(uprobe) uprobe_free_stage1 call_srcu(&uretprobe_srcu, &uprobe->rcu, uprobe_free_stage2); put_uprobe() if (refcount_dec_and_test) call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_stage1); So my thinking was since we take uretprobe_srcu *inside* uprobe_srcu, this reference must be visible before we execute stage1, and as such stage2 cannot complete prematurely.
Re: [PATCH] virtio: add missing MODULE_DESCRIPTION() macro
On 6/23/24 10:36, Jeff Johnson wrote: On 6/2/2024 1:25 PM, Jeff Johnson wrote: make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/virtio/virtio_dma_buf.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson --- drivers/virtio/virtio_dma_buf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/virtio/virtio_dma_buf.c b/drivers/virtio/virtio_dma_buf.c index 2521a75009c3..3034a2f605c8 100644 --- a/drivers/virtio/virtio_dma_buf.c +++ b/drivers/virtio/virtio_dma_buf.c @@ -85,5 +85,6 @@ int virtio_dma_buf_get_uuid(struct dma_buf *dma_buf, } EXPORT_SYMBOL(virtio_dma_buf_get_uuid); +MODULE_DESCRIPTION("dma-bufs for virtio exported objects"); MODULE_LICENSE("GPL"); MODULE_IMPORT_NS(DMA_BUF); --- base-commit: 83814698cf48ce3aadc5d88a3f577f04482ff92a change-id: 20240602-md-virtio_dma_buf-b3552ca6c5d5 Following up to see if anything else is needed from me. Hoping to see this in linux-next :) I still don't see this in linux-next so following up to see if anything else is needed to get this merged. Adding Greg KH since he's signed off on this file before and he's taken quite a few of my cleanups through his trees. I'm hoping to have all of these warnings fixed tree-wide in 6.11. /jeff
Re: [PATCH] virtio: add missing MODULE_DESCRIPTION() macro
On Thu, Jul 11, 2024 at 11:43:18AM -0700, Jeff Johnson wrote: > On 6/23/24 10:36, Jeff Johnson wrote: > > On 6/2/2024 1:25 PM, Jeff Johnson wrote: > > > make allmodconfig && make W=1 C=1 reports: > > > WARNING: modpost: missing MODULE_DESCRIPTION() in > > > drivers/virtio/virtio_dma_buf.o > > > > > > Add the missing invocation of the MODULE_DESCRIPTION() macro. > > > > > > Signed-off-by: Jeff Johnson > > > --- > > > drivers/virtio/virtio_dma_buf.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/drivers/virtio/virtio_dma_buf.c > > > b/drivers/virtio/virtio_dma_buf.c > > > index 2521a75009c3..3034a2f605c8 100644 > > > --- a/drivers/virtio/virtio_dma_buf.c > > > +++ b/drivers/virtio/virtio_dma_buf.c > > > @@ -85,5 +85,6 @@ int virtio_dma_buf_get_uuid(struct dma_buf *dma_buf, > > > } > > > EXPORT_SYMBOL(virtio_dma_buf_get_uuid); > > > +MODULE_DESCRIPTION("dma-bufs for virtio exported objects"); > > > MODULE_LICENSE("GPL"); > > > MODULE_IMPORT_NS(DMA_BUF); > > > > > > --- > > > base-commit: 83814698cf48ce3aadc5d88a3f577f04482ff92a > > > change-id: 20240602-md-virtio_dma_buf-b3552ca6c5d5 > > > > > > > Following up to see if anything else is needed from me. > > Hoping to see this in linux-next :) > > I still don't see this in linux-next so following up to see if anything else > is needed to get this merged. Adding Greg KH since he's signed off on this > file before and he's taken quite a few of my cleanups through his trees. > > I'm hoping to have all of these warnings fixed tree-wide in 6.11. > > /jeff not sure why I tag it and it gets cleared again. tagged again hope it holds now. -- MST
[PATCH 1/2] MAINTAINERS: i2c-virtio: Drop Conghui Chen from Maintainers
E-mails to Conghui Chen have bounced back: : host mgamail.eglb.intel.com[198.175.65.14] said: 550 #5.1.0 Address rejected. (in reply to RCPT TO command) Remove him as maintainer of the i2c Virtio driver in the MAINTAINERS file. Signed-off-by: Andi Shyti Cc: Viresh Kumar Cc: Wolfram Sang Cc: linux-...@vger.kernel.org Cc: virtualizat...@lists.linux.dev --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 3e26555e52bfd..96745f7971100 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -23859,7 +23859,6 @@ S: Maintained F: drivers/vhost/scsi.c VIRTIO I2C DRIVER -M: Conghui Chen M: Viresh Kumar L: linux-...@vger.kernel.org L: virtualizat...@lists.linux.dev -- 2.45.2
Re: [PATCH v2] bootconfig: Remove duplicate included header file linux/bootconfig.h
On Thu, 11 Jul 2024 10:43:16 +0200 Thorsten Blum wrote: > The header file linux/bootconfig.h is included whether __KERNEL__ is > defined or not. > > Include it only once before the #ifdef/#else/#endif preprocessor > directives and remove the following make includecheck warning: > > linux/bootconfig.h is included more than once > > Move the comment to the top and delete the now empty #else block. > > Signed-off-by: Thorsten Blum Thanks, looks good to me. Let me pick it up. > --- > Changes in v2: > - Move comment and delete #else as suggested by Masami Hiramatsu > - Link to v1: > https://lore.kernel.org/linux-kernel/20240711002152.800028-2-thorsten.b...@toblux.com/ > --- > lib/bootconfig.c | 20 +--- > 1 file changed, 9 insertions(+), 11 deletions(-) > > diff --git a/lib/bootconfig.c b/lib/bootconfig.c > index 97f8911ea339..81f29c29f47b 100644 > --- a/lib/bootconfig.c > +++ b/lib/bootconfig.c > @@ -4,8 +4,16 @@ > * Masami Hiramatsu > */ > > -#ifdef __KERNEL__ > +/* > + * NOTE: This is only for tools/bootconfig, because tools/bootconfig will > + * run the parser sanity test. > + * This does NOT mean lib/bootconfig.c is available in the user space. > + * However, if you change this file, please make sure the tools/bootconfig > + * has no issue on building and running. > + */ > #include > + > +#ifdef __KERNEL__ > #include > #include > #include > @@ -24,16 +32,6 @@ const char * __init xbc_get_embedded_bootconfig(size_t > *size) > return (*size) ? embedded_bootconfig_data : NULL; > } > #endif > - > -#else /* !__KERNEL__ */ > -/* > - * NOTE: This is only for tools/bootconfig, because tools/bootconfig will > - * run the parser sanity test. > - * This does NOT mean lib/bootconfig.c is available in the user space. > - * However, if you change this file, please make sure the tools/bootconfig > - * has no issue on building and running. > - */ > -#include > #endif > > /* > -- > 2.39.2 > -- Masami Hiramatsu (Google)
[PATCH 0/2] Cleanup the MAINTAINER's file
Hi, while reviewing Wolfram's series, I received some delivery failure notifications for e-mails that don't exist anymore. With this series I'm removing: - Conghui Chen - Thor Thayer unfortunately both from Intel :-( In the case of Altera's subsystems (except for the i2c), I didn't really know what to do with them, so that I marked them as Orphan. Andi Cc: Andy Shevchenko Cc: Krzysztof Kozlowski Cc: Lee Jones Cc: Linus Walleij Cc: Philipp Zabel Cc: Viresh Kumar Cc: Wolfram Sang Cc: linux-g...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: virtualizat...@lists.linux.dev Andi Shyti (2): MAINTAINERS: i2c-virtio: Drop Conghui Chen from Maintainers MAINTAINERS: Drop Thor Thayer from maintainers MAINTAINERS | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) -- 2.45.2
[PATCH] MAINTAINERS: Add uprobes entry
From: Masami Hiramatsu (Google) Add uprobes entry to MAINTAINERS to clarify the maintainers. Suggested-by: Peter Zijlstra Signed-off-by: Masami Hiramatsu (Google) --- MAINTAINERS | 13 + 1 file changed, 13 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index da5352dbd4f3..ae731fa2328c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -23105,6 +23105,19 @@ F: drivers/mtd/ubi/ F: include/linux/mtd/ubi.h F: include/uapi/mtd/ubi-user.h +UPROBES +M: Masami Hiramatsu +M: Oleg Nesterov +M: Peter Zijlstra +L: linux-kernel@vger.kernel.org +L: linux-trace-ker...@vger.kernel.org +S: Maintained +F: arch/*/include/asm/uprobes.h +F: arch/*/kernel/probes/uprobes.c +F: arch/*/kernel/uprobes.c +F: include/linux/uprobes.h +F: kernel/events/uprobes.c + USB "USBNET" DRIVER FRAMEWORK M: Oliver Neukum L: net...@vger.kernel.org
Re: [PATCH v2 00/11] perf/uprobe: Optimize uprobes
On Thu, Jul 11, 2024 at 4:07 AM Peter Zijlstra wrote: > > Hi! > > These patches implement the (S)RCU based proposal to optimize uprobes. > > On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a > tight loop: > > perf probe -x ./uprobes test=func > perf stat -ae probe_uprobe:test -- sleep 1 > > perf probe -x ./uprobes test=func%return > perf stat -ae probe_uprobe:test__return -- sleep 1 > > PRE: > > 4,038,804 probe_uprobe:test > 2,356,275 probe_uprobe:test__return > > POST: > > 7,216,579 probe_uprobe:test > 6,744,786 probe_uprobe:test__return > > (copy-paste FTW, I didn't do new numbers because the fast paths didn't change > -- > and quick test run shows similar numbers) > > Patches also available here: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes > > > Changes since last time: > - better split with intermediate inc_not_zero() > - fix UPROBE_HANDLER_REMOVE > - restored the lost rcu_assign_pointer() > - avoid lockdep for uretprobe_srcu > - add missing put_uprobe() -> srcu_read_unlock() conversion > - actually initialize return_instance::has_ref > - a few comments > - things I don't remember > > Hey Peter! Thanks for the v2, I plan to look at it more thoroughly tomorrow. But meanwhile I spent a good chunk of today to write an uprobes stress-test, so we can validate that we are not regressing anything (yes, I don't trust lockless code and people in general ;) Anyways, if you'd like to use it, it's at [0]. All you should need to build and run it is: $ cd examples/c $ make -j$(nproc) uprobe-stress $ sudo ./uprobe-stress -tN -aM -mP -fR N, M, P, R are number of threads dedicated to one of four functions of the stress test: triggering user space functions (N), attaching/detaching various random subsets of uprobes (M), mmap()ing parts of executable with uprobes (P), and forking the process and triggering uprobes for a little bit (R). The idea is to test various timings and interleavings of uprobe-related logic. You should only need not-too-old Clang to build everything (Clang 12+ should work, I believe). But do let me know if you run into troubles. I did run this stress test for a little while on current bpf-next/master with no issues detected (yay!). But then I also ran it on Linux built from perf/uprobes branch (these patches), and after a few seconds I see that there is no more attachment/detachment happening. Eventually I got splats, which you can see in [1]. I used `sudo ./uprobe-stress -a10 -t5 -m5 -f3` command to run it inside my QEMU image. So there is still something off, hopefully this will help to debug and hammer out any remaining kinks. Thanks! [0] https://github.com/libbpf/libbpf-bootstrap/commit/2f88cef90f9728ec8c7bee7bd48fdbcf197806c3 [1] https://gist.github.com/anakryiko/f761690addf7aa5f08caec95fda9ef1a
Re: [PATCH 0/2] Cleanup the MAINTAINER's file
On Fri, Jul 12, 2024 at 01:19:24AM +0200, Andi Shyti wrote: > Hi, > > while reviewing Wolfram's series, I received some delivery > failure notifications for e-mails that don't exist anymore. > > With this series I'm removing: > > - Conghui Chen > - Thor Thayer Fixes for these two are already in my for-current branch. (And the patches were on the i2c list, Andi ;)) signature.asc Description: PGP signature