Re: [PATCH] uprobes: reduce contention on uprobes_tree access
Hi Masami, > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) > > BTW, how did you measure the overhead? I think spinlock overhead > will depend on how much lock contention happens. Absolutely. I have the original production workload to test this with and a derived one that mimics this test case. The production case has ~24 threads running on a 192 core system which access 14 USDTs around 1.5 million times per second in total (across all USDTs). My test case is similar but can drive a higher rate of USDT access across more threads and therefore generate higher contention. All measurements are done using bpftrace scripts around relevant parts of code in uprobes.c and application code. Jon. > > Thank you, > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > Signed-off-by: Jonathan Haslam > > --- > > kernel/events/uprobes.c | 22 +++--- > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 929e98c62965..42bf9b6e8bc0 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > */ > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > +static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ > > > > #define UPROBES_HASH_SZ13 > > /* serialize uprobe->pending_list */ > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, > > loff_t offset) > > { > > struct uprobe *uprobe; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > uprobe = __find_uprobe(inode, offset); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return uprobe; > > } > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > *uprobe) > > { > > struct uprobe *u; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > u = __insert_uprobe(uprobe); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > > > return u; > > } > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > if (WARN_ON(!uprobe_is_active(uprobe))) > > return; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > put_uprobe(uprobe); > > } > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > if (n) { > > for (t = n; t; t = rb_prev(t)) { > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > get_uprobe(u); > > } > > } > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > } > > > > /* @vma contains reference counter, not the probed instruction. */ > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned > > long start, unsigned long e > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return !!n; > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
Hi Ingo, > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Have you considered/measured per-CPU RW semaphores? No I hadn't but thanks hugely for suggesting it! In initial measurements it seems to be between 20-100% faster than the RW spinlocks! Apologies for all the exclamation marks but I'm very excited. I'll do some more testing tomorrow but so far it's looking very good. Thanks again for the input. Jon.
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > > Have you considered/measured per-CPU RW semaphores? > > > > No I hadn't but thanks hugely for suggesting it! In initial measurements > > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > > all the exclamation marks but I'm very excited. I'll do some more testing > > tomorrow but so far it's looking very good. > > > > Documentation ([0]) says that locking for writing calls > synchronize_rcu(), is that right? If that's true, attaching multiple > uprobes (including just attaching a single BPF multi-uprobe) will take > a really long time. We need to confirm we are not significantly > regressing this. And if we do, we need to take measures in the BPF > multi-uprobe attachment code path to make sure that a single > multi-uprobe attachment is still fast. > > If my worries above turn out to be true, it still feels like a first > good step should be landing this patch as is (and get it backported to > older kernels), and then have percpu rw-semaphore as a final (and a > bit more invasive) solution (it's RCU-based, so feels like a good > primitive to settle on), making sure to not regress multi-uprobes > (we'll probably will need some batched API for multiple uprobes). > > Thoughts? Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is what calls into synchronize_rcu(). I haven't done the measurements yet but I would imagine this has to regress probe attachment, at least in the uncontended case. Of course, reads are by far the dominant mode here but we probably shouldn't punish writes excessively. I will do some measurements to quantify the write penalty here. I agree that a batched interface for probe attachment is needed here. The usual mode of operation for us is that we have a number of USDTs (uprobes) in hand and we want to enable and disable them in one shot. Removing the need to do multiple locking operations is definitely an efficiency improvement that needs to be done. Tie that together with per-CPU RW semaphores and this should scale extremely well in both a read and write case. Jon.
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > Masami, > > > > Given the discussion around per-cpu rw semaphore and need for > > (internal) batched attachment API for uprobes, do you think you can > > apply this patch as is for now? We can then gain initial improvements > > in scalability that are also easy to backport, and Jonathan will work > > on a more complete solution based on per-cpu RW semaphore, as > > suggested by Ingo. > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > I would like to wait for the next version. My initial tests show a nice improvement on the over RW spinlocks but significant regression in acquiring a write lock. I've got a few days vacation over Easter but I'll aim to get some more formalised results out to the thread toward the end of next week. Jon. > > Thank you, > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > will depend on how much lock contention happens. > > > > > > Thank you, > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > Signed-off-by: Jonathan Haslam > > > > --- > > > > kernel/events/uprobes.c | 22 +++--- > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > --- a/kernel/events/uprobes.c > > > > +++ b/kernel/events/uprobes.c > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > */ > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > > access */ > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree > > > > access */ > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > /* serialize uprobe->pending_list */ > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > > *inode, loff_t offset) > > > > { > > > > struct uprobe *uprobe; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > uprobe = __find_uprobe(inode, offset); > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > > > > > return uprobe; > > > > } > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > > > *uprobe) > > > > { > > > > struct uprobe *u; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + write_lock(&uprobes_treelock); > > > > u = __insert_uprobe(uprobe); > > > > - spin_unlock(&uprobes_treelock); > > > > + write_unlock(&uprobes_treelock); > > > > > > > > return u; > > > > } > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > return; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + write_lock(&uprobes_treelock); > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > - spin_unlock(&uprobes_treelock); > > > > + write_unlock(&uprobes_treelock); > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > put_uprobe(uprobe); > > > > } > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > if (n) { > > > > for (t = n; t; t = rb_prev(t)) { > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > get_uprobe(u); > > > > } > > > > } > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > } > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, > > > > unsigned long start, unsigned long e > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > > > > > return !!n; > > > > } > > > > -- > > > > 2.43.0 > > > > > > > > > > > > > -- > > > Masami Hiramatsu (Google) > > > -- > Masami Hiramatsu (Google)
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. > I look forward to your formalized results :) Sorry for the delay in getting back to you on this Masami. I have used one of the bpf selftest benchmarks to provide some form of comparison of the 3 different approaches (spinlock, RW spinlock and per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' benchmark which just executes a single uprobe with a minimal bpf program attached. The tests were done on a 32 core qemu/kvm instance. Things to note about the results: - The results are slightly variable so don't get too caught up on individual thread count - it's the trend that is important. - In terms of throughput with this specific benchmark a *very* macro view is that the RW spinlock provides 40-60% more throughput than the spinlock. The per-CPU RW semaphore provides in the order of 50-100% more throughput then the spinlock. - This doesn't fully reflect the large reduction in latency that we have seen in application based measurements. However, it does demonstrate that even the trivial change of going to a RW spinlock provides significant benefits. I haven't included the measurements on per-CPU RW semaphore write performance as they are completely in line with those that Paul McKenney posted on his journal [0]. On a 32 core system I see semaphore writes to take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). Each block of results below show 1 line per execution of the benchmark (the "Summary" line) and each line is a run with one more thread added - a thread is a "producer". The lines are edited to remove extraneous output that adds no value here. The tests were executed with this driver script: for num_threads in {1..20} do sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary done spinlock Summary: hits1.453 ± 0.005M/s ( 1.453M/prod) Summary: hits2.087 ± 0.005M/s ( 1.043M/prod) Summary: hits2.701 ± 0.012M/s ( 0.900M/prod) Summary: hits1.917 ± 0.011M/s ( 0.479M/prod) Summary: hits2.105 ± 0.003M/s ( 0.421M/prod) Summary: hits1.615 ± 0.006M/s ( 0.269M/prod) Summary: hits2.046 ± 0.004M/s ( 0.292M/prod) Summary: hits1.818 ± 0.002M/s ( 0.227M/prod) Summary: hits1.867 ± 0.024M/s ( 0.207M/prod) Summary: hits1.692 ± 0.003M/s ( 0.169M/prod) Summary: hits1.778 ± 0.004M/s ( 0.162M/prod) Summary: hits1.710 ± 0.011M/s ( 0.142M/prod) Summary: hits1.687 ± 0.022M/s ( 0.130M/prod) Summary: hits1.697 ± 0.004M/s ( 0.121M/prod) Summary: hits1.645 ± 0.011M/s ( 0.110M/prod) Summary: hits1.671 ± 0.002M/s ( 0.104M/prod) Summary: hits1.692 ± 0.014M/s ( 0.100M/prod) Summary: hits1.700 ± 0.015M/s ( 0.094M/prod) Summary: hits1.668 ± 0.005M/s ( 0.088M/prod) Summary: hits1.644 ± 0.004M/s ( 0.082M/prod) RW spinlock Summary: hits1.465 ± 0.008M/s ( 1.465M/prod) Summary: hits1.750 ± 0.035M/s ( 0.875M/prod) Summary: hits2.164 ± 0.008M/s ( 0.721M/prod) Summary: hits2.235 ± 0.014M/s ( 0.559M/prod) Summary: hits2.202 ± 0.005M/s ( 0.440M/prod) Summary: hits2.816 ± 0.003M/s ( 0.469M/prod) Summary: hits2.364 ± 0.002M/s ( 0.338M/prod) Summary: hits2.327 ± 0.008M/s ( 0.291M/prod) Summary: hits2.147 ± 0.005M/s ( 0.239M/prod) Summary: hits2.266 ± 0.011M/s ( 0.227M/prod) Summary: hits2.483 ± 0.003M/s ( 0.226M/prod) Summary: hits2.573 ± 0.008M/s ( 0.214M/prod) Summary: hits2.569 ± 0.004M/s ( 0.198M/prod) Summary: hits2.507 ± 0.013M/s ( 0.179M/prod) Summary: hits2.165 ± 0.008M/s ( 0.144M/prod) Summary: hits2.524 ± 0.004M/s ( 0.158M/prod) Summary: hits2.059 ± 0.020M/s ( 0.121M/prod) Summary: hits2.332 ± 0.007M/s ( 0.130M/prod) Summary: hits2.404 ± 0.017M/s ( 0.127M/prod) Summary: hits2.187 ± 0.002M/s ( 0.109M/prod) per-CPU RW semaphore Summary: hits1.341 ± 0.017M/s ( 1.341M/prod) Summary: hits2.397 ± 0.011M/s ( 1.198M/prod) Summary: hits3.547 ± 0.041M/s ( 1.182M/prod) Summary: hits4.108 ±
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > Things to note about the results: > > > > - The results are slightly variable so don't get too caught up on > > individual thread count - it's the trend that is important. > > - In terms of throughput with this specific benchmark a *very* macro view > > is that the RW spinlock provides 40-60% more throughput than the > > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > > more throughput then the spinlock. > > - This doesn't fully reflect the large reduction in latency that we have > > seen in application based measurements. However, it does demonstrate > > that even the trivial change of going to a RW spinlock provides > > significant benefits. > > This is probably because trig-uprobe-nop creates a single uprobe that > is triggered on many CPUs. While in production we have also *many* > uprobes running on many CPUs. In this benchmark, besides contention on > uprobes_treelock, we are also hammering on other per-uprobe locks > (register_rwsem, also if you don't have [0] patch locally, there will > be another filter lock taken each time, filter->rwlock). There is also > atomic refcounting going on, which when you have the same uprobe > across all CPUs at the same time will cause a bunch of cache line > bouncing. > > So yes, it's understandable that in practice in production you see an > even larger effect of optimizing uprobe_treelock than in this > micro-benchmark. > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf Thanks for the reply and the thoughts on this Andrii. Yes, I do have the filter->rwlock fix applied but, as you say, there are no doubt other effects at play here as to be expected in such a synthetic workload. I'm pleased with the outcomes though as they show a good result even if they are at the lower end of what I expect. The results also show that pursuing an RCU solution is definitely worth it but that write penalty is brutal in the case of a full synchronize_rcu()! Should be fun. > > for num_threads in {1..20} > > do > > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary > > just want to mention -a (affinity) option that you can pass a bench > tool, it will pin each thread on its own CPU. It generally makes tests > more uniform, eliminating CPU migrations variability. Thanks for pointing that flag out! Jon. > > > done > > > > > > spinlock > > > > Summary: hits1.453 ± 0.005M/s ( 1.453M/prod) > > Summary: hits2.087 ± 0.005M/s ( 1.043M/prod) > > Summary: hits2.701 ± 0.012M/s ( 0.900M/prod) > > I also wanted to point out that the first measurement (1.453M/s in > this row) is total throughput across all threads, while value in > parenthesis (0.900M/prod) is averaged throughput per each thread. So > this M/prod value is the most interesting in this benchmark where we > assess the effect of reducing contention. > > > Summary: hits1.917 ± 0.011M/s ( 0.479M/prod) > > Summary: hits2.105 ± 0.003M/s ( 0.421M/prod) > > Summary: hits1.615 ± 0.006M/s ( 0.269M/prod) > > [...]
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
Hi Masami, > > > Which is why I was asking to land this patch as is, as it relieves the > > > scalability pains in production and is easy to backport to old > > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > > semaphore. > > OK, then I'll push this to for-next at this moment. > Please share if you have a good idea for the batch interface which can be > backported. I guess it should involve updating userspace changes too. Did you (or anyone else) need anything more from me on this one so that it can be pushed? I provided some benchmark numbers but happy to provide anything else that may be required. Thanks! Jon. > > Thank you! > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > while Jonathan will keep working on even better final solution. > > > Thanks! > > > > > > > I look forward to your formalized results :) > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > and USDTs, reporting attach/detach timings: > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > bpf_testmod.ko is already unloaded. > > Loading bpf_testmod.ko... > > Successfully loaded bpf_testmod.ko. > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > test_bench_attach_uprobe: attached in 0.120s > > test_bench_attach_uprobe: detached in 0.092s > > #400/5 uprobe_multi_test/bench_uprobe:OK > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > test_bench_attach_usdt: attached in 0.124s > > test_bench_attach_usdt: detached in 0.064s > > #400/6 uprobe_multi_test/bench_usdt:OK > > #400 uprobe_multi_test:OK > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > Successfully unloaded bpf_testmod.ko. > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > Thank you, > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > > > > --- > > > > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = > > > > > > > > > RB_ROOT; > > > > > > > > > */ > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize > > > > > > > > > rbtree access */ > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize > > > > > > > > > rbtree access */ > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct > > > > > > > > > inode *inode, loff_t offset) > > > > > > > > > { > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > } > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe > > > > > > > > > *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > { > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > } > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe > > > > > > > > > *uprobe) > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > return; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treeloc
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > > OK, then I'll push this to for-next at this moment. > > > Please share if you have a good idea for the batch interface which can be > > > backported. I guess it should involve updating userspace changes too. > > > > Did you (or anyone else) need anything more from me on this one so that it > > can be pushed? I provided some benchmark numbers but happy to provide > > anything else that may be required. > > Yeah, if you can update with the result, it looks better to me. > Or, can I update the description? Sure, please feel free to update the description yourself. Jon. > > Thank you, > > > > > Thanks! > > > > Jon. > > > > > > > > Thank you! > > > > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > > while Jonathan will keep working on even better final solution. > > > > > Thanks! > > > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > > and USDTs, reporting attach/detach timings: > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > > test_bench_attach_uprobe: attached in 0.120s > > > > test_bench_attach_uprobe: detached in 0.092s > > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > > test_bench_attach_usdt: attached in 0.124s > > > > test_bench_attach_usdt: detached in 0.064s > > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > > #400 uprobe_multi_test:OK > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > > Successfully unloaded bpf_testmod.ko. > > > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock > > > > > > > > > > overhead > > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > > > > > > --- > > > > > > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c > > > > > > > > > > > b/kernel/events/uprobes.c > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = > > > > > > > > > > > RB_ROOT; > > > > > > > > > > > */ > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* > > > > > > > > > > > serialize rbtree access */ > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* > > > > > > > > > > > serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe > > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > > } > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe > > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + write_unlock(&uprobes_t
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
Hi Masami, > > > OK, then I'll push this to for-next at this moment. > > > Please share if you have a good idea for the batch interface which can be > > > backported. I guess it should involve updating userspace changes too. > > > > Did you (or anyone else) need anything more from me on this one so that it > > can be pushed? I provided some benchmark numbers but happy to provide > > anything else that may be required. > > Yeah, if you can update with the result, it looks better to me. > Or, can I update the description? Just checking if you need me to do anything on this so that it can be pushed to for-next? Would be really great if we can get this in. Thanks for all your help. Jon. > > Thank you, > > > > > Thanks! > > > > Jon. > > > > > > > > Thank you! > > > > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > > while Jonathan will keep working on even better final solution. > > > > > Thanks! > > > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > > and USDTs, reporting attach/detach timings: > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > > test_bench_attach_uprobe: attached in 0.120s > > > > test_bench_attach_uprobe: detached in 0.092s > > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > > test_bench_attach_usdt: attached in 0.124s > > > > test_bench_attach_usdt: detached in 0.064s > > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > > #400 uprobe_multi_test:OK > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > > Successfully unloaded bpf_testmod.ko. > > > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock > > > > > > > > > > overhead > > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > > > > > > --- > > > > > > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c > > > > > > > > > > > b/kernel/events/uprobes.c > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = > > > > > > > > > > > RB_ROOT; > > > > > > > > > > > */ > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* > > > > > > > > > > > serialize rbtree access */ > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* > > > > > > > > > > > serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe > > > > > > > > > > > *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > > } > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe > > > > > > > > > > > *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > u = __insert_uprobe(uprobe