[PATCH] uprobes: Remove redundant spinlock in uprobe_deny_signal

2024-07-31 Thread Liao Chang
Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.

Signed-off-by: Liao Chang 
---
 kernel/events/uprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 73cc47708679..76a51a1f51e2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
if (task_sigpending(t)) {
-   spin_lock_irq(&t->sighand->siglock);
clear_tsk_thread_flag(t, TIF_SIGPENDING);
-   spin_unlock_irq(&t->sighand->siglock);
 
if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
-- 
2.34.1




[PATCH] uprobes: Improve scalability by reducing the contention on siglock

2024-08-01 Thread Liao Chang
The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock within the
handle_singlestep() is the scalability bottleneck. The reason is also
very straightforward that all producer threads of benchmark have to
contend the spinlock mentioned to resume the TIF_SIGPENDING bit in the
thread_info that might be removed in uprobe_deny_signal().

This patch introduces UTASK_SSTEP_DENY_SIGNAL to mark TIF_SIGPENDING is
suppress temporarily during the uprobe single-step. Upon uprobe single-step
is handled and UTASK_SSTEP_DENY_SIGNAL is confirmed, it could resume the
TIF_SIGPENDING directly without acquiring the siglock in most case, then
reducing contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running
upstream kernel v6.11-rc1 + my optimization [2] for get_xol_insn_slot().

before-opt
--
uprobe-nop  ( 1 cpus):0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop  ( 2 cpus):1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop  ( 4 cpus):3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop  ( 8 cpus):4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop  (16 cpus):3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop  (32 cpus):2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop  (64 cpus):1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push ( 1 cpus):0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push ( 2 cpus):1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push ( 4 cpus):3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push ( 8 cpus):3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push (16 cpus):2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push (32 cpus):2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push (64 cpus):1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret  ( 1 cpus):1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret  ( 2 cpus):3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret  ( 4 cpus):6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret  ( 8 cpus):6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret  (16 cpus):5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret  (32 cpus):5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret  (64 cpus):4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
-
uprobe-nop  ( 1 cpus):0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop  ( 2 cpus):1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop  ( 4 cpus):3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop  ( 8 cpus):5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop  (16 cpus):6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop  (32 cpus):4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop  (64 cpus):2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push ( 1 cpus):0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push ( 2 cpus):1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push ( 4 cpus):3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push ( 8 cpus):4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push (16 cpus):5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push (32 cpus):5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push (64 cpus):2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret  ( 1 cpus):1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret  ( 2 cpus):3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret  ( 4 cpus):6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret  ( 8 cpus):6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret  (16 cpus):6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret  (32 cpus):5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret  (64 cpus):4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-and...@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaocha...@huawei.com

Signed-off-by: Liao Chang 
---
 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..50acbf96bccd 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -53,6 +53,7 @@ enum uprobe_task_state {
UTASK_SSTEP,
UTASK_SSTEP_ACK,
UTASK_SSTEP_TRAPPED,
+   UTASK_SSTEP_DENY_SIGNAL,
 };
 
 /*
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 76a51a1f51e2..4f9c10b3c7b9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1980,6 +1980,7 @@ bool uprobe_deny_signal(void)
 
if (task_sigpending(t)) {
clear_tsk_thread_flag(t, TIF_SIGPENDING);
+   utask->state = UTASK_SSTEP_DENY_SIGNAL;
 

Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU

2024-08-01 Thread Liao, Chang
ite_lock(mm);
>   vma = find_vma(mm, info->vaddr);
> @@ -1885,9 +1889,13 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
> struct pt_regs *regs)
>   return;
>   }
>  
> + /* we need to bump refcount to store uprobe in utask */
> + if (!try_get_uprobe(uprobe))
> + return;
> +
>   ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
>   if (!ri)
> - return;
> + goto fail;
>  
>   trampoline_vaddr = uprobe_get_trampoline_vaddr();
>   orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, 
> regs);
> @@ -1914,11 +1922,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
> struct pt_regs *regs)
>   }
>   orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>   }
> -  /*
> -   * uprobe's refcnt is positive, held by caller, so it's safe to
> -   * unconditionally bump it one more time here
> -   */
> - ri->uprobe = get_uprobe(uprobe);
> + ri->uprobe = uprobe;
>   ri->func = instruction_pointer(regs);
>   ri->stack = user_stack_pointer(regs);
>   ri->orig_ret_vaddr = orig_ret_vaddr;
> @@ -1929,8 +1933,9 @@ static void prepare_uretprobe(struct uprobe *uprobe, 
> struct pt_regs *regs)
>   utask->return_instances = ri;
>  
>   return;
> - fail:
> +fail:
>   kfree(ri);
> + put_uprobe(uprobe);
>  }
>  
>  /* Prepare to single-step probed instruction out of line. */
> @@ -1945,9 +1950,14 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, 
> unsigned long bp_vaddr)
>   if (!utask)
>   return -ENOMEM;
>  
> + if (!try_get_uprobe(uprobe))
> + return -EINVAL;
> +
>   xol_vaddr = xol_get_insn_slot(uprobe);
> - if (!xol_vaddr)
> - return -ENOMEM;
> + if (!xol_vaddr) {
> + err = -ENOMEM;
> + goto err_out;
> + }
>  
>   utask->xol_vaddr = xol_vaddr;
>   utask->vaddr = bp_vaddr;
> @@ -1955,12 +1965,15 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs 
> *regs, unsigned long bp_vaddr)
>   err = arch_uprobe_pre_xol(&uprobe->arch, regs);
>   if (unlikely(err)) {
>   xol_free_insn_slot(current);
> - return err;
> + goto err_out;
>   }
>  
>   utask->active_uprobe = uprobe;
>   utask->state = UTASK_SSTEP;
>   return 0;
> +err_out:
> + put_uprobe(uprobe);
> + return err;
>  }
>  
>  /*
> @@ -2044,7 +2057,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> unsigned long vaddr)
>   return is_trap_insn(&opcode);
>  }
>  
> -static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int 
> *is_swbp)
> +/* assumes being inside RCU protected region */
> +static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int 
> *is_swbp)
>  {
>   struct mm_struct *mm = current->mm;
>   struct uprobe *uprobe = NULL;
> @@ -2057,7 +2071,7 @@ static struct uprobe *find_active_uprobe(unsigned long 
> bp_vaddr, int *is_swbp)
>   struct inode *inode = file_inode(vma->vm_file);
>   loff_t offset = vaddr_to_offset(vma, bp_vaddr);
>  
> - uprobe = find_uprobe(inode, offset);
> + uprobe = find_uprobe_rcu(inode, offset);
>   }
>  
>   if (!uprobe)
> @@ -2201,13 +2215,15 @@ static void handle_swbp(struct pt_regs *regs)
>  {
>   struct uprobe *uprobe;
>   unsigned long bp_vaddr;
> - int is_swbp;
> + int is_swbp, srcu_idx;
>  
>   bp_vaddr = uprobe_get_swbp_addr(regs);
>   if (bp_vaddr == uprobe_get_trampoline_vaddr())
>   return uprobe_handle_trampoline(regs);
>  
> - uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> + srcu_idx = srcu_read_lock(&uprobes_srcu);
> +
> + uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
>   if (!uprobe) {
>   if (is_swbp > 0) {
>   /* No matching uprobe; signal SIGTRAP. */
> @@ -2223,6 +2239,7 @@ static void handle_swbp(struct pt_regs *regs)
>    */
>   instruction_pointer_set(regs, bp_vaddr);
>   }
> + srcu_read_unlock(&uprobes_srcu, srcu_idx);
>   return;
>   }
>  
> @@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>   goto out;
>  
> - if (!pre_ssout(uprobe, regs, bp_vaddr))
> - return;
> + if (pre_ssout(uprobe, regs, bp_vaddr))
> + goto out;
>  

Regardless what pre_ssout() returns, it always reach the label 'out', so the
if block is unnecessary.


> - /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>  out:
> - put_uprobe(uprobe);
> + /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> + srcu_read_unlock(&uprobes_srcu, srcu_idx);
>  }
>  
>  /*

-- 
BR
Liao, Chang



Re: [PATCH 3/8] uprobes: protected uprobe lifetime with SRCU

2024-08-01 Thread Liao, Chang



在 2024/8/2 0:49, Andrii Nakryiko 写道:
> On Thu, Aug 1, 2024 at 5:23 AM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
>>> To avoid unnecessarily taking a (brief) refcount on uprobe during
>>> breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
>>> not take refcount, but protect the lifetime of a uprobe instance with
>>> RCU. This improves scalability, as refcount gets quite expensive due to
>>> cache line bouncing between multiple CPUs.
>>>
>>> Specifically, we utilize our own uprobe-specific SRCU instance for this
>>> RCU protection. put_uprobe() will delay actual kfree() using call_srcu().
>>>
>>> For now, uretprobe and single-stepping handling will still acquire
>>> refcount as necessary. We'll address these issues in follow up patches
>>> by making them use SRCU with timeout.
>>>
>>> Signed-off-by: Andrii Nakryiko 
>>> ---
>>>  kernel/events/uprobes.c | 93 -
>>>  1 file changed, 55 insertions(+), 38 deletions(-)
>>>
> 
> [...]
> 
>>>
>>> @@ -2258,12 +2275,12 @@ static void handle_swbp(struct pt_regs *regs)
>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>   goto out;
>>>
>>> - if (!pre_ssout(uprobe, regs, bp_vaddr))
>>> - return;
>>> + if (pre_ssout(uprobe, regs, bp_vaddr))
>>> + goto out;
>>>
>>
>> Regardless what pre_ssout() returns, it always reach the label 'out', so the
>> if block is unnecessary.
> 
> yep, I know, but I felt like
> 
> if (something is wrong)
> goto out;
> 
> pattern was important to keep for each possible failing step for consistency.
> 
> so unless this is a big deal, I'd keep it as is, as in the future
> there might be some other steps after pre_ssout() before returning, so
> this is a bit more "composable"
> 
OK, I would say this conditional-check pattern is likely to be optimized away by
modern compiler.

Thanks.

> 
>>
>>
>>> - /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep 
>>> */
>>>  out:
>>> - put_uprobe(uprobe);
>>> + /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep 
>>> */
>>> + srcu_read_unlock(&uprobes_srcu, srcu_idx);
>>>  }
>>>
>>>  /*
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang



Re: [PATCH] uprobes: Improve scalability by reducing the contention on siglock

2024-08-01 Thread Liao, Chang



在 2024/8/1 22:06, Oleg Nesterov 写道:
> On 08/01, Liao Chang wrote:
>>
>> @@ -2276,22 +2277,25 @@ static void handle_singlestep(struct uprobe_task 
>> *utask, struct pt_regs *regs)
>>  int err = 0;
>>  
>>  uprobe = utask->active_uprobe;
>> -if (utask->state == UTASK_SSTEP_ACK)
>> +switch (utask->state) {
>> +case UTASK_SSTEP_ACK:
>>  err = arch_uprobe_post_xol(&uprobe->arch, regs);
>> -else if (utask->state == UTASK_SSTEP_TRAPPED)
>> +break;
>> +case UTASK_SSTEP_TRAPPED:
>>  arch_uprobe_abort_xol(&uprobe->arch, regs);
>> -else
>> +fallthrough;
>> +case UTASK_SSTEP_DENY_SIGNAL:
>> +set_tsk_thread_flag(current, TIF_SIGPENDING);
>> +break;
>> +default:
>>  WARN_ON_ONCE(1);
>> +}
> 
> Liao, at first glance this change looks "obviously wrong" to me.

Oleg. Did i overlook some thing obvious here?

> 
> But let me read this patch more carefully and reply on weekend,
> I am a bit busy right now.

Sure, thanks.

> 
> Thanks,
> 
> Oleg.
> 
> 

-- 
BR
Liao, Chang



Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()

2024-08-01 Thread Liao, Chang
; &uprobes[i].consumer);
> +
> + if (cnt)
> + uprobe_unregister_sync();
>  }
>  
>  static void bpf_uprobe_multi_link_release(struct bpf_link *link)
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 7eb79e0a5352..f7443e996b1b 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1097,6 +1097,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, 
> filter_func_t filter)
>  static void __probe_event_disable(struct trace_probe *tp)
>  {
>   struct trace_uprobe *tu;
> + bool sync = false;
>  
>   tu = container_of(tp, struct trace_uprobe, tp);
>   WARN_ON(!uprobe_filter_is_empty(tu->tp.event->filter));
> @@ -1105,9 +1106,12 @@ static void __probe_event_disable(struct trace_probe 
> *tp)
>   if (!tu->uprobe)
>   continue;
>  
> - uprobe_unregister(tu->uprobe, &tu->consumer);
> + uprobe_unregister_nosync(tu->uprobe, &tu->consumer);
> + sync = true;
>   tu->uprobe = NULL;
>   }
> + if (sync)
> + uprobe_unregister_sync();
>  }
>  
>  static int probe_event_enable(struct trace_event_call *call,
> diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
> b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> index 73a6b041bcce..928c73cde32e 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -478,7 +478,8 @@ static void testmod_unregister_uprobe(void)
>   mutex_lock(&testmod_uprobe_mutex);
>  
>   if (uprobe.uprobe) {
> - uprobe_unregister(uprobe.uprobe, &uprobe.consumer);
> + uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
> + uprobe_unregister_sync();
>   uprobe.offset = 0;
>   uprobe.uprobe = NULL;
>   }

-- 
BR
Liao, Chang



Re: [PATCH 6/8] perf/uprobe: split uprobe_unregister()

2024-08-05 Thread Liao, Chang



在 2024/8/6 4:01, Andrii Nakryiko 写道:
> On Fri, Aug 2, 2024 at 8:05 AM Andrii Nakryiko
>  wrote:
>>
>> On Thu, Aug 1, 2024 at 7:41 PM Liao, Chang  wrote:
>>>
>>>
>>>
>>> 在 2024/8/1 5:42, Andrii Nakryiko 写道:
>>>> From: Peter Zijlstra 
>>>>
>>>> With uprobe_unregister() having grown a synchronize_srcu(), it becomes
>>>> fairly slow to call. Esp. since both users of this API call it in a
>>>> loop.
>>>>
>>>> Peel off the sync_srcu() and do it once, after the loop.
>>>>
>>>> With recent uprobe_register()'s error handling reusing full
>>>> uprobe_unregister() call, we need to be careful about returning to the
>>>> caller before we have a guarantee that partially attached consumer won't
>>>> be called anymore. So add uprobe_unregister_sync() in the error handling
>>>> path. This is an unlikely slow path and this should be totally fine to
>>>> be slow in the case of an failed attach.
>>>>
>>>> Signed-off-by: Peter Zijlstra (Intel) 
>>>> Co-developed-by: Andrii Nakryiko 
>>>> Signed-off-by: Andrii Nakryiko 
>>>> ---
>>>>  include/linux/uprobes.h|  8 ++--
>>>>  kernel/events/uprobes.c| 18 ++
>>>>  kernel/trace/bpf_trace.c   |  5 -
>>>>  kernel/trace/trace_uprobe.c|  6 +-
>>>>  .../selftests/bpf/bpf_testmod/bpf_testmod.c|  3 ++-
>>>>  5 files changed, 31 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
>>>> index a1686c1ebcb6..8f1999eb9d9f 100644
>>>> --- a/include/linux/uprobes.h
>>>> +++ b/include/linux/uprobes.h
>>>> @@ -105,7 +105,8 @@ extern unsigned long uprobe_get_trap_addr(struct 
>>>> pt_regs *regs);
>>>>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct 
>>>> mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
>>>>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, 
>>>> loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>>>>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer 
>>>> *uc, bool);
>>>> -extern void uprobe_unregister(struct uprobe *uprobe, struct 
>>>> uprobe_consumer *uc);
>>>> +extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct 
>>>> uprobe_consumer *uc);
>>>> +extern void uprobe_unregister_sync(void);
>>>
>>> [...]
>>>
>>>>  static inline void
>>>> -uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>>>> +uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer 
>>>> *uc)
>>>> +{
>>>> +}
>>>> +static inline void uprobes_unregister_sync(void)
>>>
>>> *uprobes*_unregister_sync, is it a typo?
>>>
>>
>> I think the idea behind this is that you do a lot of individual uprobe
>> unregistrations with multiple uprobe_unregister() calls, and then
>> follow with a single *uprobes*_unregister_sync(), because in general
>> it is meant to sync multiple uprobes unregistrations.
> 
> Ah, I think you were trying to say that only static inline
> implementation here is called uprobes_unregister_sync, while all the
> other ones are uprobe_unregister_sync(). I fixed it up, kept it as
> singular uprobe_unregister_sync().
> 

Yes, that's exactly what i meant :)

>>
>>>>  {
>>>>  }
>>>>  static inline int uprobe_mmap(struct vm_area_struct *vma)
>>>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>>>> index 3b42fd355256..b0488d356399 100644
>>>> --- a/kernel/events/uprobes.c
>>>> +++ b/kernel/events/uprobes.c
> 
> [...]

-- 
BR
Liao, Chang



Re: [PATCH] uprobes: Improve scalability by reducing the contention on siglock

2024-08-05 Thread Liao, Chang



在 2024/8/2 17:24, Oleg Nesterov 写道:
> On 08/02, Liao, Chang wrote:
>>
>>
>> 在 2024/8/1 22:06, Oleg Nesterov 写道:
>>> On 08/01, Liao Chang wrote:
>>>>
>>>> @@ -2276,22 +2277,25 @@ static void handle_singlestep(struct uprobe_task 
>>>> *utask, struct pt_regs *regs)
>>>>int err = 0;
>>>>
>>>>uprobe = utask->active_uprobe;
>>>> -  if (utask->state == UTASK_SSTEP_ACK)
>>>> +  switch (utask->state) {
>>>> +  case UTASK_SSTEP_ACK:
>>>>err = arch_uprobe_post_xol(&uprobe->arch, regs);
>>>> -  else if (utask->state == UTASK_SSTEP_TRAPPED)
>>>> +  break;
>>>> +  case UTASK_SSTEP_TRAPPED:
>>>>arch_uprobe_abort_xol(&uprobe->arch, regs);
>>>> -  else
>>>> +  fallthrough;
>>>> +  case UTASK_SSTEP_DENY_SIGNAL:
>>>> +  set_tsk_thread_flag(current, TIF_SIGPENDING);
>>>> +  break;
>>>> +  default:
>>>>WARN_ON_ONCE(1);
>>>> +  }
>>>
>>> Liao, at first glance this change looks "obviously wrong" to me.
>>
>> Oleg. Did i overlook some thing obvious here?
> 
> OK, lets suppose uprobe_deny_signal() sets UTASK_SSTEP_DENY_SIGNAL.
> 
> In this case handle_singlestep() will only set TIF_SIGPENDING and
> do nothing else. This is wrong, either _post_xol() or _abort_xol()
> must be called.
> 
> But I think handle_singlestep() will never hit this case. In the
> likely case uprobe_post_sstep_notifier() will replace _DENY_SIGNAL
> with _ACK, and this means that handle_singlestep() won't restore
> TIF_SIGPENDING cleared by uprobe_deny_signal().

You're absolutely right. handle_signlestep() has chance to handle _DENY_SIGANL
unless it followed by setting TIF_UPROBE in uprobe_deny_signal(). This means
_DENY_SIGNAL is likey replaced during next uprobe single-stepping.

I believe introducing _DENY_SIGNAL as the immediate state between UTASK_SSTEP
and UTASK_SSTEP_ACK is still necessary. This allow uprobe_post_sstep_notifier()
to correctly restore TIF_SIGPENDING upon the completion of single-step.

A revised implementation would look like this:

--%<--
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1980,6 +1980,7 @@ bool uprobe_deny_signal(void)

if (task_sigpending(t)) {
clear_tsk_thread_flag(t, TIF_SIGPENDING);
+   utask->state = UTASK_SSTEP_DENY_SIGNAL;

if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
@@ -2276,22 +2277,23 @@ static void handle_singlestep(struct uprobe_task 
*utask, struct pt_regs *regs)
int err = 0;

uprobe = utask->active_uprobe;
-   if (utask->state == UTASK_SSTEP_ACK)
+   switch (utask->state) {
+   case UTASK_SSTEP_ACK:
err = arch_uprobe_post_xol(&uprobe->arch, regs);
-   else if (utask->state == UTASK_SSTEP_TRAPPED)
+   break;
+   case UTASK_SSTEP_TRAPPED:
arch_uprobe_abort_xol(&uprobe->arch, regs);
-   else
+   set_thread_flag(TIF_SIGPENDING);
+   break;
+   default:
WARN_ON_ONCE(1);
+   }

put_uprobe(uprobe);
utask->active_uprobe = NULL;
utask->state = UTASK_RUNNING;
xol_free_insn_slot(current);

-   spin_lock_irq(¤t->sighand->siglock);
-   recalc_sigpending(); /* see uprobe_deny_signal() */
-   spin_unlock_irq(¤t->sighand->siglock);
-
if (unlikely(err)) {
uprobe_warn(current, "execute the probed insn, sending 
SIGILL.");
force_sig(SIGILL);
@@ -2351,6 +2353,8 @@ int uprobe_post_sstep_notifier(struct pt_regs *regs)
/* task is currently not uprobed */
return 0;

+   if (utask->state == UTASK_SSTEP_DENY_SIGNAL)
+   set_thread_flag(TIF_SIGPENDING);
utask->state = UTASK_SSTEP_ACK;
set_thread_flag(TIF_UPROBE);
return 1;

-->%--

> 
> Oleg.
> 
> 

-- 
BR
Liao, Chang



Re: [PATCH] uprobes: Improve scalability by reducing the contention on siglock

2024-08-08 Thread Liao, Chang



在 2024/8/7 18:17, Oleg Nesterov 写道:
> So. Liao, I am sorry, but I dislike your patch/approach in any case.
> 
> UTASK_SSTEP_DENY_SIGNAL complicates the state machine. And I don't like the 
> fact
> that set_thread_flag(TIF_SIGPENDING) is called twice, from handle_singlestep()
> and uprobe_post_sstep_notifier(), this complicates the logic even more.
> 
> We need a flag, not the new state.
> 
> And if I read this patch correctly it is wrong:
> 
>   - uprobe_deny_signal() clears TIF_SIGPENDING and sets 
> UTASK_SSTEP_DENY_SIGNAL
> 
>   - another signal cames after that and sets TIF_SIGPENDING again
> 
>   - in this case the task won't return to user-space and execute the 
> probed
> insn, exit_to_user_mode_loop() will notice another TIF_SIGPENDING and
> call arch_do_signal_or_restart()->get_signal() again.
> 
>   - get_signal() will call uprobe_deny_signal() again hit
> 
>   WARN_ON_ONCE(utask->state != UTASK_SSTEP);
> 
> 
> And no, we shouldn't change this check into UTASK_SSTEP || 
> UTASK_SSTEP_DENY_SIGNAL.
> Again, the fact that uprobe_deny_signal() cleared TIF_SIGPENDING must not be 
> the
> new state.

Oleg, If i understand correctly, current state machine expects the single-step 
handling
should end up with either _ACK or _TRAPPED. Any new state would disrupt this 
logic. If so,
I'm convinced that adding a new state is uncessary. As you mentioned, I propose 
using a
boolean flag in the uprobe_task data to track whether a signal should be 
restored at the
cost of increased size. Here's outline of the changes:

  - pre_ssout() resets the deny signal flag

  - uprobe_deny_signal() sets the deny signal flag when TIF_SIGPENDING is 
cleared.

  - handle_singlestep() check the deny signal flag and restore TIF_SIGPENDING 
if necessary.

Does this approach look correct to you,do do you have any other way to 
implement the "flag"?

Thanks.

> 
> Oleg.
> 
> On 08/06, Oleg Nesterov wrote:
>>
>> On 08/06, Liao, Chang wrote:
>>>
>>> You're absolutely right. handle_signlestep() has chance to handle 
>>> _DENY_SIGANL
>>> unless it followed by setting TIF_UPROBE in uprobe_deny_signal(). This means
>>> _DENY_SIGNAL is likey replaced during next uprobe single-stepping.
>>>
>>> I believe introducing _DENY_SIGNAL as the immediate state between 
>>> UTASK_SSTEP
>>> and UTASK_SSTEP_ACK is still necessary. This allow 
>>> uprobe_post_sstep_notifier()
>>> to correctly restore TIF_SIGPENDING upon the completion of single-step.
>>>
>>> A revised implementation would look like this:
>>
>> Still looks "obviously wrong" to me... even the approach itself.
>>
>> Perhaps I am wrong, yet another day when I can't even read emails on lkml
>> carefully, sorry.
>>
>> But can you please send the patch which I could actually apply? This one
>> looks white-space damaged...
>>
>> I'll try to reply with more details as soon I convince myself I fully
>> understand what does your patch actually do, but most probably not tomorrow.
>>
>> Thanks,
>>
>> Oleg.
>>
>>> --%<--
>>> --- a/kernel/events/uprobes.c
>>> +++ b/kernel/events/uprobes.c
>>> @@ -1980,6 +1980,7 @@ bool uprobe_deny_signal(void)
>>>
>>> if (task_sigpending(t)) {
>>> clear_tsk_thread_flag(t, TIF_SIGPENDING);
>>> +   utask->state = UTASK_SSTEP_DENY_SIGNAL;
>>>
>>> if (__fatal_signal_pending(t) || 
>>> arch_uprobe_xol_was_trapped(t)) {
>>> utask->state = UTASK_SSTEP_TRAPPED;
>>> @@ -2276,22 +2277,23 @@ static void handle_singlestep(struct uprobe_task 
>>> *utask, struct pt_regs *regs)
>>> int err = 0;
>>>
>>> uprobe = utask->active_uprobe;
>>> -   if (utask->state == UTASK_SSTEP_ACK)
>>> +   switch (utask->state) {
>>> +   case UTASK_SSTEP_ACK:
>>> err = arch_uprobe_post_xol(&uprobe->arch, regs);
>>> -   else if (utask->state == UTASK_SSTEP_TRAPPED)
>>> +   break;
>>> +   case UTASK_SSTEP_TRAPPED:
>>> arch_uprobe_abort_xol(&uprobe->arch, regs);
>>> -   else
>>> +   set_thread_flag(TIF_SIGPENDING);
>>> +   break;
>>> +   default:
>>> WARN_ON_ONCE(1);
>>> +   }
>>>
>>> put_uprobe(upr

Re: [PATCH 0/8] uprobes: RCU-protected hot path optimizations

2024-08-08 Thread Liao, Chang



在 2024/8/8 1:31, Andrii Nakryiko 写道:
> On Wed, Aug 7, 2024 at 10:11 AM Oleg Nesterov  wrote:
>>
>> On 08/07, Andrii Nakryiko wrote:
>>>
>>> Yes, I was waiting for more of Peter's comments, but I guess I'll just
>>> send a v2 today.
>>
>> OK,
>>
>>> I'll probably include the SRCU+timeout logic for
>>> return_instances, and maybe lockless VMA parts as well.
>>
>> Well, feel free to do what you think right, but perhaps it would be
>> better to push this series first? at least 1-4.
> 
> Ok, I can send those first 4 patches first and hopefully we can land
> them soon and move to the next part. I just also wrote up details
> about that crash in rb_find_rcu().
> 
>>
>> As for lockless VMA. To me this needs more discussions. I didn't read
> 
> We are still discussing, feel free to join the conversation.
> 
>> your conversation with Peter and Suren carefully, but I too have some
>> concerns. Most probably I am wrong, and until I saw this thread I didn't
>> even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
>> but still.
>>
>>>> As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
>>>
>>> Honestly curious, why the preference?
>>
>> Well, you can safely ignore me, but since you have asked ;)
>>
>> I understand what SRCU does, and years ago I even understood (I hope)
>> the implementation. More or less the same for rcu_tasks. But as for
>> the _trace flavour, I simply fail to understand its semantics.
> 
> Ok, I won't try to repeat Paul's explanations. If you are curious you
> can find them in comments to my previous batch register/unregister API
> patches.
> 
>>
>>> BTW, while you are here :) What can you say about
>>> current->sighand->siglock use in handle_singlestep()?
>>
>> It should die, and this looks simple. I disagree with the patches
>> from Liao, see the
>> https://lore.kernel.org/all/20240801082407.1618451-1-liaocha...@huawei.com/
>> thread, but I agree with the intent.
> 
> I wasn't aware of this patch, thanks for mentioning it. Strange that
> me or at least b...@vger.kernel.org wasn't CC'ed.
> 
> Liao, please cc bpf@ mailing list for future patches like that.

OK, sorry about that.

> 
>>
>> IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make 
>> the
>> necessary changes really simple.
> 

[...]

>>
>> (To clarify. In fact I think that a new TIF_ or even PF_ flag makes more 
>> sense,
>>  afaics it can have more users. But I don't think that uprobes can provide 
>> enough
>>  justification for that right now)

I also face the same choice when Oleg suggested me to add new flag to track the 
denied
flag, due to I haven't encountered scenarios outside of uprobe that would deny 
signal,
so I'm not confident of introduce new TIF_ flag without a fully understanding 
of potential
potential impacts.

>>
>> Oleg.
>>

-- 
BR
Liao, Chang



Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-08 Thread Liao, Chang
Hi Andrii and Oleg.

This patch sent by me two weeks ago also aim to optimize the performance of 
uprobe
on arm64. I notice recent discussions on the performance and scalability of 
uprobes
within the mailing list. Considering this interest, I've added you and other 
relevant
maintainers to the CC list for broader visibility and potential collaboration.

Thanks.

在 2024/7/27 17:44, Liao Chang 写道:
> The profiling result of single-thread model of selftests bench reveals
> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> ARM64. On my local testing machine, 5% of CPU time is consumed by
> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> 
> This patch introduce struct uprobe_breakpoint to track previously
> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> need for redundant insn_slot writes and subsequent expensive cache
> flush, especially on architecture like ARM64. This patch has been tested
> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> bench and Redis GET/SET benchmark result below reveal obivious
> performance gain.
> 
> before-opt
> --
> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> Redis SET (RPS) uprobe: 42728.52
> Redis GET (RPS) uprobe: 43640.18
> Redis SET (RPS) uretprobe: 40624.54
> Redis GET (RPS) uretprobe: 41180.56
> 
> after-opt
> -
> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> Redis SET (RPS) uprobe: 43939.69
> Redis GET (RPS) uprobe: 45200.80
> Redis SET (RPS) uretprobe: 41658.58
> Redis GET (RPS) uretprobe: 42805.80
> 
> While some uprobes might still need to share the same insn_slot, this
> patch compare the instructions in the resued insn_slot with the
> instructions execute out-of-line firstly to decides allocate a new one
> or not.
> 
> Additionally, this patch use a rbtree associated with each thread that
> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> rbtree of uprobe_breakpoints has smaller node, better locality and less
> contention, it result in faster lookup times compared to find_uprobe().
> 
> The other part of this patch are some necessary memory management for
> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> hit uprobe that doesn't already have a corresponding node in rbtree. All
> uprobe_breakpoints will be freed when thread exit.
> 
> Signed-off-by: Liao Chang 
> ---
>  include/linux/uprobes.h |   3 +
>  kernel/events/uprobes.c | 246 +---
>  2 files changed, 211 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..04ee465980af 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -78,6 +78,9 @@ struct uprobe_task {
>  
>   struct return_instance  *return_instances;
>   unsigned intdepth;
> +
> + struct rb_root  breakpoints_tree;
> + rwlock_tbreakpoints_treelock;
>  };
>  
>  struct return_instance {
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..3f1a6dc2a327 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -33,6 +33,7 @@
>  #define MAX_UPROBE_XOL_SLOTS UINSNS_PER_PAGE
>  
>  static struct rb_root uprobes_tree = RB_ROOT;
> +
>  /*
>   * allows us to skip the uprobe_mmap if there are no uprobe events active
>   * at this time.  Probably a fine grained per inode count is better?
> @@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe,
>   return ret;
>  }
>  
> +static struct uprobe_task *get_utask(void);
> +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr);
> +static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int 
> *is_swbp);
> +
> +struct uprobe_breakpoint {
> + struct rb_node  rb_node;
> + refcount_t  ref;
> + unsigned long   slot;
> + unsigned long   vaddr;
> + struct uprobe   *uprobe;
> +};
> +

Re: [PATCH] uprobes: Improve scalability by reducing the contention on siglock

2024-08-08 Thread Liao, Chang



在 2024/8/8 18:28, Oleg Nesterov 写道:
> On 08/08, Liao, Chang wrote:
>>
>>   - pre_ssout() resets the deny signal flag
>>
>>   - uprobe_deny_signal() sets the deny signal flag when TIF_SIGPENDING is 
>> cleared.
>>
>>   - handle_singlestep() check the deny signal flag and restore 
>> TIF_SIGPENDING if necessary.
>>
>> Does this approach look correct to you,do do you have any other way to 
>> implement the "flag"?
> 
> Yes. But I don't think pre_ssout() needs to clear this flag. 
> handle_singlestep() resets/clears
> state, active_uprobe, frees insn slot. So I guess we only need
> 
> 
> --- x/kernel/events/uprobes.c
> +++ x/kernel/events/uprobes.c
> @@ -2308,9 +2308,10 @@ static void handle_singlestep(struct upr
>   utask->state = UTASK_RUNNING;
>   xol_free_insn_slot(current);
>  
> - spin_lock_irq(¤t->sighand->siglock);
> - recalc_sigpending(); /* see uprobe_deny_signal() */
> - spin_unlock_irq(¤t->sighand->siglock);
> + if (utask->xxx) {
> + set_thread_flag(TIF_SIGPENDING);
> + utask->xxx = 0;
> + }

Agree, if no more discussion about this flag, I will just send v2 today.

Thanks.

>  
>   if (unlikely(err)) {
>   uprobe_warn(current, "execute the probed insn, sending 
> SIGILL.");
> 
> and that is all.
> 
> Oleg.
> 
> 

-- 
BR
Liao, Chang



[PATCH v2 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()

2024-08-08 Thread Liao Chang
Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.

Acked-by: Oleg Nesterov 
Signed-off-by: Liao Chang 
---
 kernel/events/uprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 73cc47708679..76a51a1f51e2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
if (task_sigpending(t)) {
-   spin_lock_irq(&t->sighand->siglock);
clear_tsk_thread_flag(t, TIF_SIGPENDING);
-   spin_unlock_irq(&t->sighand->siglock);
 
if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
-- 
2.34.1




[PATCH v2 2/2] uprobes: Remove the spinlock within handle_singlestep()

2024-08-08 Thread Liao Chang
This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].

before-opt
--
uprobe-nop  ( 1 cpus):0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop  ( 2 cpus):1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop  ( 4 cpus):3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop  ( 8 cpus):4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop  (16 cpus):3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop  (32 cpus):2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop  (64 cpus):1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push ( 1 cpus):0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push ( 2 cpus):1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push ( 4 cpus):3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push ( 8 cpus):3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push (16 cpus):2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push (32 cpus):2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push (64 cpus):1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret  ( 1 cpus):1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret  ( 2 cpus):3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret  ( 4 cpus):6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret  ( 8 cpus):6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret  (16 cpus):5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret  (32 cpus):5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret  (64 cpus):4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
-
uprobe-nop  ( 1 cpus):0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop  ( 2 cpus):1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop  ( 4 cpus):3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop  ( 8 cpus):5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop  (16 cpus):6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop  (32 cpus):4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop  (64 cpus):2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push ( 1 cpus):0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push ( 2 cpus):1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push ( 4 cpus):3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push ( 8 cpus):4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push (16 cpus):5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push (32 cpus):5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push (64 cpus):2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret  ( 1 cpus):1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret  ( 2 cpus):3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret  ( 4 cpus):6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret  ( 8 cpus):6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret  (16 cpus):6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret  (32 cpus):5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret  (64 cpus):4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-and...@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaocha...@huawei.com

Signed-off-by: Liao Chang 
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 8 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..49403e68307b 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -75,6 +75,7 @@ struct uprobe_task {
 
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
+   booldeny_signal;
 
struct return_instance  *return_instances;
unsigned intdepth;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 76a51a1f51e2..77934fbd1370 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,6 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
if (task_sigpending(t)) {
+   utask->deny_signal = true;
clear_tsk_thread_flag(t, TIF_SIGPENDING);
 
if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
@@ -2288,9 +2289,10 @@ static void handle_singlestep(struct uprobe_task *utask, 
struct pt_regs *regs)
utask->state = UTASK_RUNNING;
xol_free_insn_slot(current);
 
-   sp

[PATCH v2 0/2] uprobes: Improve scalability by reducing the contention on siglock

2024-08-08 Thread Liao Chang
The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock is the
scalability bottleneck. The reason is also very straightforward that all
producer threads of benchmark have to contend the spinlock mentioned to
resume the TIF_SIGPENDING bit in thread_info that might be removed in
uprobe_deny_signal().

The contention on current->sighand->siglock is unnecessary, this series
remove them thoroughly. I've use the script developed by Andrii in [1]
to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
64 cores@2.4GHz running the kernel on next tree + the optimization in
[2] for get_xol_insn_slot().

before-opt
--
uprobe-nop  ( 1 cpus):0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop  ( 2 cpus):1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop  ( 4 cpus):3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop  ( 8 cpus):4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop  (16 cpus):3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop  (32 cpus):2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop  (64 cpus):1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push ( 1 cpus):0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push ( 2 cpus):1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push ( 4 cpus):3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push ( 8 cpus):3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push (16 cpus):2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push (32 cpus):2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push (64 cpus):1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret  ( 1 cpus):1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret  ( 2 cpus):3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret  ( 4 cpus):6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret  ( 8 cpus):6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret  (16 cpus):5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret  (32 cpus):5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret  (64 cpus):4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
-
uprobe-nop  ( 1 cpus):0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop  ( 2 cpus):1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop  ( 4 cpus):3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop  ( 8 cpus):5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop  (16 cpus):6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop  (32 cpus):4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop  (64 cpus):2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push ( 1 cpus):0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push ( 2 cpus):1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push ( 4 cpus):3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push ( 8 cpus):4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push (16 cpus):5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push (32 cpus):5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push (64 cpus):2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret  ( 1 cpus):1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret  ( 2 cpus):3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret  ( 4 cpus):6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret  ( 8 cpus):6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret  (16 cpus):6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret  (32 cpus):5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret  (64 cpus):4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

v2->v1:
Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
completion of singlestep which leads to handle_singlestep() has no
chance to restore the removed TIF_SIGPENDING [3] and some case in
question. So this revision proposes to use a flag in uprobe_task to
track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-and...@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaocha...@huawei.com
[3] https://lore.kernel.org/all/20240801082407.1618451-1-liaocha...@huawei.com

Liao Chang (2):
  uprobes: Remove redundant spinlock in uprobe_deny_signal()
  uprobes: Remove the spinlock within handle_singlestep()

 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.34.1




Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-08 Thread Liao, Chang



在 2024/8/9 2:49, Oleg Nesterov 写道:
> Hi Liao,
> 
> To be honest I didn't even try to look at your patch, sorry.
> 
> Because I think it would be better to delay it in an case. Until Andrii
> finishes/pushes his optimization changes which (in particular) include
> find_active_uprobe_rcu/etc.
> 
> Then you can rebease and re-benchmark your patch on top of these changes.

Oleg, Thanks for your guidance. I'll keep an eye on the development of Andrii's
changes and re-evaluate my patch. To ensure minimal disruption, I'll hold off
on further pushing the patch until Andrii's changes are settle down.

> 
> Oleg.
> 
> 
> On 08/08, Liao, Chang wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of 
>> uprobe
>> on arm64. I notice recent discussions on the performance and scalability of 
>> uprobes
>> within the mailing list. Considering this interest, I've added you and other 
>> relevant
>> maintainers to the CC list for broader visibility and potential 
>> collaboration.
>>
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> --
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> -
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang 
>>> ---
>>>  include/linux/uprobes.h |   3 +
>>>  kernel/events/uprobes.c | 246 +---
>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
>>> index f46e0ca0169c..04ee465980af 100644
>>> --- a/include/linux/uprobes.h
>>> +++ b/include/linux/uprobes.h
>>> @@ -78,6 +78,9 @@ struct uprobe_task {
>>>
>

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-09 Thread Liao, Chang



在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of 
>> uprobe
>> on arm64. I notice recent discussions on the performance and scalability of 
>> uprobes
>> within the mailing list. Considering this interest, I've added you and other 
>> relevant
>> maintainers to the CC list for broader visibility and potential 
>> collaboration.
>>
> 
> Hi Liao,
> 
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.

Andrii, I'm trying to integrate your lockless changes into the upstream
next-20240806 kernel tree. And I ran into some conflicts. please let me
know which kernel you're currently working on.

Thanks.

> 
> But also see some remarks below.
> 
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> --
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> 
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
> 
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
> 
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?

I will spend the weekend figuring out the questions you raised. Thanks for
pointing them out.

> 
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> -
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
&g

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-12 Thread Liao, Chang



在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of 
>> uprobe
>> on arm64. I notice recent discussions on the performance and scalability of 
>> uprobes
>> within the mailing list. Considering this interest, I've added you and other 
>> relevant
>> maintainers to the CC list for broader visibility and potential 
>> collaboration.
>>
> 
> Hi Liao,
> 
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.
> 
> But also see some remarks below.
> 
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> --
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> 
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
> 
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
> 
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?

Hi Andrii,

As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
compared to X86 behavior. My investigation revealed that the root cause lies in
the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents 
instructions
of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
to handle these instructions out-of-line in userspace upon breakpoint exception
is handled, leading to a significant performance overhead compared to 'ret' 
variant,
which is already emulated.

To address this issue, I've developed a patch supports  the emulation of 'nop' 
and
'push' variants. The benchmark results below indicates the performance gain of
emulation is obivious.

xol (1 cpus)

uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
---
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)

As you can see, the performance gap between nop/push and ret variants has been 
significantly
reduced. Due to the emulation of 'push' instruction need to access userspace 
memory, it spent
more cycles than the other.

> 
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-12 Thread Liao, Chang



在 2024/8/10 2:40, Andrii Nakryiko 写道:
> On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
>  wrote:
>>
>> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang  wrote:
>>>
>>>
>>>
>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>>>>
>>>>> Hi Andrii and Oleg.
>>>>>
>>>>> This patch sent by me two weeks ago also aim to optimize the performance 
>>>>> of uprobe
>>>>> on arm64. I notice recent discussions on the performance and scalability 
>>>>> of uprobes
>>>>> within the mailing list. Considering this interest, I've added you and 
>>>>> other relevant
>>>>> maintainers to the CC list for broader visibility and potential 
>>>>> collaboration.
>>>>>
>>>>
>>>> Hi Liao,
>>>>
>>>> As you can see there is an active work to improve uprobes, that
>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>> hold off a bit with your changes until all that lands. And then
>>>> re-benchmark, as costs might shift.
>>>
>>> Andrii, I'm trying to integrate your lockless changes into the upstream
>>> next-20240806 kernel tree. And I ran into some conflicts. please let me
>>> know which kernel you're currently working on.
>>>
>>
>> My patches are  based on tip/perf/core. But I also just pushed all the
>> changes I have accumulated (including patches I haven't sent for
>> review just yet), plus your patches for sighand lock removed applied
>> on top into [0]. So you can take a look and use that as a base for
>> now. Keep in mind, a bunch of those patches might still change, but
>> this should give you the best currently achievable performance with
>> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
>> (note somewhat linear scalability with number of CPU cores, with
>> per-CPU performance *slowly* declining):
>>
>> uprobe-nop( 1 cpus):3.565 ± 0.004M/s  (  3.565M/s/cpu)
>> uprobe-nop( 2 cpus):6.742 ± 0.009M/s  (  3.371M/s/cpu)
>> uprobe-nop( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
>> uprobe-nop( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
>> uprobe-nop( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
>> uprobe-nop( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
>> uprobe-nop( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
>> uprobe-nop( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
>> uprobe-nop(10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
>> uprobe-nop(12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
>> uprobe-nop(14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
>> uprobe-nop(16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
>> uprobe-nop(24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
>> uprobe-nop(32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
>> uprobe-nop(40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
>> uprobe-nop(48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
>> uprobe-nop(56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
>> uprobe-nop(64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
>> uprobe-nop(72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
>> uprobe-nop(80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
>>
>> uretprobe-nop ( 1 cpus):1.987 ± 0.001M/s  (  1.987M/s/cpu)
>> uretprobe-nop ( 2 cpus):3.766 ± 0.003M/s  (  1.883M/s/cpu)
>> uretprobe-nop ( 3 cpus):5.638 ± 0.002M/s  (  1.879M/s/cpu)
>> uretprobe-nop ( 4 cpus):7.275 ± 0.003M/s  (  1.819M/s/cpu)
>> uretprobe-nop ( 5 cpus):9.124 ± 0.004M/s  (  1.825M/s/cpu)
>> uretprobe-nop ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
>> uretprobe-nop ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
>> uretprobe-nop ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
>> uretprobe-nop (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
>> uretprobe-nop (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
>> uretprobe-nop (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
>> uretprobe-nop (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
>> uretprobe-nop (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/c

Re: [PATCH v2 2/2] uprobes: Remove the spinlock within handle_singlestep()

2024-08-13 Thread Liao, Chang



在 2024/8/12 19:29, Oleg Nesterov 写道:
> On 08/09, Liao Chang wrote:
>>
>> --- a/include/linux/uprobes.h
>> +++ b/include/linux/uprobes.h
>> @@ -75,6 +75,7 @@ struct uprobe_task {
>>  
>>  struct uprobe   *active_uprobe;
>>  unsigned long   xol_vaddr;
>> +booldeny_signal;
> 
> Ack, but... I can't believe I am arguing with the naming ;)
> Can we have a better name for this flag?
> 
>   utask->signal_denied ?
>   utask->restore_sigpending ?

I prefer the name "signal_denied" as it more accurately reflects
what happened.

> 
> or just
> 
>   utask->sigpending ?
> 
> utask->deny_signal looks as if handle_singlestep/whatever should
> "deny" the pending signal cleared by uprobe_deny_signal(), while
> it fact it should restore TIF_SIGPENDING.

Make sense. I will change the name in v3.

> 
> Oleg.
> 
>>  
>>  struct return_instance  *return_instances;
>>  unsigned intdepth;
>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>> index 76a51a1f51e2..77934fbd1370 100644
>> --- a/kernel/events/uprobes.c
>> +++ b/kernel/events/uprobes.c
>> @@ -1979,6 +1979,7 @@ bool uprobe_deny_signal(void)
>>  WARN_ON_ONCE(utask->state != UTASK_SSTEP);
>>  
>>  if (task_sigpending(t)) {
>> +utask->deny_signal = true;
>>  clear_tsk_thread_flag(t, TIF_SIGPENDING);
>>  
>>  if (__fatal_signal_pending(t) || 
>> arch_uprobe_xol_was_trapped(t)) {
>> @@ -2288,9 +2289,10 @@ static void handle_singlestep(struct uprobe_task 
>> *utask, struct pt_regs *regs)
>>  utask->state = UTASK_RUNNING;
>>  xol_free_insn_slot(current);
>>  
>> -spin_lock_irq(¤t->sighand->siglock);
>> -recalc_sigpending(); /* see uprobe_deny_signal() */
>> -spin_unlock_irq(¤t->sighand->siglock);
>> +if (utask->deny_signal) {
>> +set_thread_flag(TIF_SIGPENDING);
>> +utask->deny_signal = false;
>> +}
>>  
>>  if (unlikely(err)) {
>>  uprobe_warn(current, "execute the probed insn, sending 
>> SIGILL.");
>> -- 
>> 2.34.1
>>
> 
> 

-- 
BR
Liao, Chang



Re: [PATCH v2 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()

2024-08-13 Thread Liao, Chang



在 2024/8/12 20:07, Oleg Nesterov 写道:
> On 08/09, Liao Chang wrote:
>>
>> Since clearing a bit in thread_info is an atomic operation, the spinlock
>> is redundant and can be removed, reducing lock contention is good for
>> performance.
> 
> My ack still stays, but let me add some notes...
> 
> sighand->siglock doesn't protect clear_bit() per se. It was used to not
> break the "the state of TIF_SIGPENDING of every thread is stable with
> sighand->siglock held" rule.
> 
> But we already have the lockless users of clear_thread_flag(TIF_SIGPENDING)
> (some if not most of them look buggy), and afaics in this (very special)
> case it should be fine.

Oleg, your explaination is more accurate. So I will reword the commit log and
quote some of your note like this:

  Since we already have the lockless user of clear_thread_flag(TIF_SIGPENDING).
  And for uprobe singlestep case, it doesn't break the rule of "the state of
  TIF_SIGPENDING of every thread is stable with sighand->siglock held". So
  removing sighand->siglock to reduce contention for better performance.

> 
> Oleg.
> 
>> Acked-by: Oleg Nesterov 
>> Signed-off-by: Liao Chang 
>> ---
>>  kernel/events/uprobes.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>> index 73cc47708679..76a51a1f51e2 100644
>> --- a/kernel/events/uprobes.c
>> +++ b/kernel/events/uprobes.c
>> @@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
>>  WARN_ON_ONCE(utask->state != UTASK_SSTEP);
>>  
>>  if (task_sigpending(t)) {
>> -spin_lock_irq(&t->sighand->siglock);
>>  clear_tsk_thread_flag(t, TIF_SIGPENDING);
>> -spin_unlock_irq(&t->sighand->siglock);
>>  
>>  if (__fatal_signal_pending(t) || 
>> arch_uprobe_xol_was_trapped(t)) {
>>  utask->state = UTASK_SSTEP_TRAPPED;
>> -- 
>> 2.34.1
>>
> 
> 

-- 
BR
Liao, Chang



Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-13 Thread Liao, Chang



在 2024/8/13 1:49, Andrii Nakryiko 写道:
> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>>>
>>>> Hi Andrii and Oleg.
>>>>
>>>> This patch sent by me two weeks ago also aim to optimize the performance 
>>>> of uprobe
>>>> on arm64. I notice recent discussions on the performance and scalability 
>>>> of uprobes
>>>> within the mailing list. Considering this interest, I've added you and 
>>>> other relevant
>>>> maintainers to the CC list for broader visibility and potential 
>>>> collaboration.
>>>>
>>>
>>> Hi Liao,
>>>
>>> As you can see there is an active work to improve uprobes, that
>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>> hold off a bit with your changes until all that lands. And then
>>> re-benchmark, as costs might shift.
>>>
>>> But also see some remarks below.
>>>
>>>> Thanks.
>>>>
>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>
>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>> performance gain.
>>>>>
>>>>> before-opt
>>>>> --
>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>
>>> I'm surprised that nop and push variants are much slower than ret
>>> variant. This is exactly opposite on x86-64. Do you have an
>>> explanation why this might be happening? I see you are trying to
>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>> installed on nop (for USDT) and on function entry (which would be push
>>> variant, `push %rbp` instruction).
>>>
>>> ret variant, for x86-64, causes one extra step to go back to user
>>> space to execute original instruction out-of-line, and then trapping
>>> back to kernel for running uprobe. Which is what you normally want to
>>> avoid.
>>>
>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>> emulated implementations for nops/push or whatever equivalents for
>>> ARM64 that is. Please take a look at that and see why those are slow
>>> and whether you can make those into fast uprobe cases?
>>
>> Hi Andrii,
>>
>> As you correctly pointed out, the benchmark result on Arm64 is 
>> counterintuitive
>> compared to X86 behavior. My investigation revealed that the root cause lies 
>> in
>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents 
>> instructions
>> of 'nop' and 'push' from the emulatable instruction list. This forces the 
>> kernel
>> to handle these instructions out-of-line in userspace upon breakpoint 
>> exception
>> is handled, leading to a significant performance overhead compared to 'ret' 
>> variant,
>> which is already emulated.
>>
>> To address this issue, I've developed a patch supports  the emulation of 
>> 'nop' and
>> 'push' variants. The benchmark results below indicates the performance gain 
>> of
>> emulation is obivious.
>>
>> xol (1 cpus)
>> 
>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>> uprobe-push: 0.908 ± 0.001M

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-13 Thread Liao, Chang



在 2024/8/13 1:57, Andrii Nakryiko 写道:
> On Mon, Aug 12, 2024 at 5:05 AM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/10 2:40, Andrii Nakryiko 写道:
>>> On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
>>>  wrote:
>>>>
>>>> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang  wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>>>>>>
>>>>>>> Hi Andrii and Oleg.
>>>>>>>
>>>>>>> This patch sent by me two weeks ago also aim to optimize the 
>>>>>>> performance of uprobe
>>>>>>> on arm64. I notice recent discussions on the performance and 
>>>>>>> scalability of uprobes
>>>>>>> within the mailing list. Considering this interest, I've added you and 
>>>>>>> other relevant
>>>>>>> maintainers to the CC list for broader visibility and potential 
>>>>>>> collaboration.
>>>>>>>
>>>>>>
>>>>>> Hi Liao,
>>>>>>
>>>>>> As you can see there is an active work to improve uprobes, that
>>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>>> hold off a bit with your changes until all that lands. And then
>>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> Andrii, I'm trying to integrate your lockless changes into the upstream
>>>>> next-20240806 kernel tree. And I ran into some conflicts. please let me
>>>>> know which kernel you're currently working on.
>>>>>
>>>>
>>>> My patches are  based on tip/perf/core. But I also just pushed all the
>>>> changes I have accumulated (including patches I haven't sent for
>>>> review just yet), plus your patches for sighand lock removed applied
>>>> on top into [0]. So you can take a look and use that as a base for
>>>> now. Keep in mind, a bunch of those patches might still change, but
>>>> this should give you the best currently achievable performance with
>>>> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
>>>> (note somewhat linear scalability with number of CPU cores, with
>>>> per-CPU performance *slowly* declining):
>>>>
>>>> uprobe-nop( 1 cpus):3.565 ± 0.004M/s  (  3.565M/s/cpu)
>>>> uprobe-nop( 2 cpus):6.742 ± 0.009M/s  (  3.371M/s/cpu)
>>>> uprobe-nop( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
>>>> uprobe-nop( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
>>>> uprobe-nop( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
>>>> uprobe-nop( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
>>>> uprobe-nop( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
>>>> uprobe-nop( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
>>>> uprobe-nop(10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
>>>> uprobe-nop(12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
>>>> uprobe-nop(14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
>>>> uprobe-nop(16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
>>>> uprobe-nop(24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
>>>> uprobe-nop(32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
>>>> uprobe-nop(40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
>>>> uprobe-nop(48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
>>>> uprobe-nop(56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
>>>> uprobe-nop(64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
>>>> uprobe-nop(72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
>>>> uprobe-nop(80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
>>>>
>>>> uretprobe-nop ( 1 cpus):1.987 ± 0.001M/s  (  1.987M/s/cpu)
>>>> uretprobe-nop ( 2 cpus):3.766 ± 0.003M/s  (  1.883M/s/cpu)
>>>> uretprobe-nop ( 3 cpus):5.638 ± 0.002M/s  (  1.879M/s/cpu)
>>>> uretprobe-nop ( 4 cpus):7.275 ± 0.003M/s  (  1.819M/s/cpu)
>>>> uretpro

[PATCH] arm64: insn: Simulate nop and push instruction for better uprobe performance

2024-08-14 Thread Liao Chang
As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
counterintuitive result that nop and push variants are much slower than
ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
which excludes 'nop' and 'stp' from the emulatable instructions list.
This force the kernel returns to userspace and execute them out-of-line,
then trapping back to kernel for running uprobe callback functions. This
leads to a significant performance overhead compared to 'ret' variant,
which is already emulated.

Typicall uprobe is installed on 'nop' for USDT and on function entry
which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
and fp into stack regardless kernel or userspace binary. In order to
improve the performance of handling uprobe for common usecases. This
patch supports the emulation of Arm64 equvialents instructions of 'nop'
and 'push'. The benchmark results below indicates the performance gain
of emulation is obvious.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

xol (1 cpus)

uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
---
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)

As shown above, the performance gap between 'nop/push' and 'ret'
variants has been significantly reduced. Due to the emulation of 'push'
instruction needs to access userspace memory, it spent more cycles than
the other.

[0] 
https://lore.kernel.org/all/caef4bzao4eg6hr2hzxypn+7uer4chs0r99zln02ezz5yruv...@mail.gmail.com/

Signed-off-by: Liao Chang 
---
 arch/arm64/include/asm/insn.h| 21 ++
 arch/arm64/kernel/probes/decode-insn.c   | 18 +--
 arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
 arch/arm64/kernel/probes/simulate-insn.c | 28 
 arch/arm64/kernel/probes/simulate-insn.h |  2 ++
 arch/arm64/kernel/probes/uprobes.c   |  2 +-
 6 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8c0a36f72d6f..a246e6e550ba 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 
insn)
   aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_nop(u32 insn)
+{
+   /* nop */
+   return aarch64_insn_is_hint(insn) &&
+  ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
+}
+
+static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
+{
+   /*
+* The 1st instruction on function entry often follows the
+* patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
+* into stack.
+*/
+   return aarch64_insn_is_stp_pre(insn) &&
+  (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
+  (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
+  (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
+  (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/kernel/probes/decode-insn.c 
b/arch/arm64/kernel/probes/decode-insn.c
index 968d5fffe233..df7ca16fc763 100644
--- a/arch/arm64/kernel/probes/decode-insn.c
+++ b/arch/arm64/kernel/probes/decode-insn.c
@@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
  *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
  */
 enum probe_insn __kprobes
-arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
+arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
+ bool kernel)
 {
+   /*
+* While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
+* execute in the out-of-line slot, simulating them in breakpoint
+* handling offers better performance.
+*/
+   if (aarch64_insn_is_nop(insn)) {
+   api->handler = simulate_nop;
+   return INSN_GOOD_NO_SLOT;
+   } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
+   api->handler = simulate_stp_fp

[PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()

2024-08-14 Thread Liao Chang
Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.

Acked-by: Oleg Nesterov 
Signed-off-by: Liao Chang 
---
 kernel/events/uprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 73cc47708679..76a51a1f51e2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
if (task_sigpending(t)) {
-   spin_lock_irq(&t->sighand->siglock);
clear_tsk_thread_flag(t, TIF_SIGPENDING);
-   spin_unlock_irq(&t->sighand->siglock);
 
if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
-- 
2.34.1




[PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep()

2024-08-14 Thread Liao Chang
This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].

before-opt
--
uprobe-nop  ( 1 cpus):0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop  ( 2 cpus):1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop  ( 4 cpus):3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop  ( 8 cpus):4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop  (16 cpus):3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop  (32 cpus):2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop  (64 cpus):1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push ( 1 cpus):0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push ( 2 cpus):1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push ( 4 cpus):3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push ( 8 cpus):3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push (16 cpus):2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push (32 cpus):2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push (64 cpus):1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret  ( 1 cpus):1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret  ( 2 cpus):3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret  ( 4 cpus):6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret  ( 8 cpus):6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret  (16 cpus):5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret  (32 cpus):5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret  (64 cpus):4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
-
uprobe-nop  ( 1 cpus):0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop  ( 2 cpus):1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop  ( 4 cpus):3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop  ( 8 cpus):5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop  (16 cpus):6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop  (32 cpus):4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop  (64 cpus):2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push ( 1 cpus):0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push ( 2 cpus):1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push ( 4 cpus):3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push ( 8 cpus):4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push (16 cpus):5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push (32 cpus):5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push (64 cpus):2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret  ( 1 cpus):1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret  ( 2 cpus):3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret  ( 4 cpus):6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret  ( 8 cpus):6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret  (16 cpus):6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret  (32 cpus):5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret  (64 cpus):4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-and...@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaocha...@huawei.com

Acked-by: Oleg Nesterov 
Signed-off-by: Liao Chang 
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 8 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..e4f57117d9c3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -75,6 +75,7 @@ struct uprobe_task {
 
struct uprobe   *active_uprobe;
unsigned long   xol_vaddr;
+   boolsignal_denied;
 
struct return_instance  *return_instances;
unsigned intdepth;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 76a51a1f51e2..589aa2af1a99 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,6 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
if (task_sigpending(t)) {
+   utask->signal_denied = true;
clear_tsk_thread_flag(t, TIF_SIGPENDING);
 
if (__fatal_signal_pending(t) || 
arch_uprobe_xol_was_trapped(t)) {
@@ -2288,9 +2289,10 @@ static void handle_singlestep(struct uprobe_task *utask, 
struct pt_regs *regs)
utask->state = UTASK_RUNNING;
xol_free_ins

[PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock

2024-08-14 Thread Liao Chang
The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock is the
scalability bottleneck. The reason is also very straightforward that all
producer threads of benchmark have to contend the spinlock mentioned to
resume the TIF_SIGPENDING bit in thread_info that might be removed in
uprobe_deny_signal().

The contention on current->sighand->siglock is unnecessary, this series
remove them thoroughly. I've use the script developed by Andrii in [1]
to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
64 cores@2.4GHz running the kernel on next tree + the optimization in
[2] for get_xol_insn_slot().

before-opt
--
uprobe-nop  ( 1 cpus):0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop  ( 2 cpus):1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop  ( 4 cpus):3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop  ( 8 cpus):4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop  (16 cpus):3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop  (32 cpus):2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop  (64 cpus):1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push ( 1 cpus):0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push ( 2 cpus):1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push ( 4 cpus):3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push ( 8 cpus):3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push (16 cpus):2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push (32 cpus):2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push (64 cpus):1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret  ( 1 cpus):1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret  ( 2 cpus):3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret  ( 4 cpus):6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret  ( 8 cpus):6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret  (16 cpus):5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret  (32 cpus):5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret  (64 cpus):4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
-
uprobe-nop  ( 1 cpus):0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop  ( 2 cpus):1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop  ( 4 cpus):3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop  ( 8 cpus):5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop  (16 cpus):6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop  (32 cpus):4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop  (64 cpus):2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push ( 1 cpus):0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push ( 2 cpus):1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push ( 4 cpus):3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push ( 8 cpus):4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push (16 cpus):5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push (32 cpus):5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push (64 cpus):2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret  ( 1 cpus):1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret  ( 2 cpus):3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret  ( 4 cpus):6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret  ( 8 cpus):6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret  (16 cpus):6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret  (32 cpus):5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret  (64 cpus):4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

v3->v2:
Renaming the flag in [2/2], s/deny_signal/signal_denied/g.

v2->v1:
Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
completion of singlestep which leads to handle_singlestep() has no
chance to restore the removed TIF_SIGPENDING [3] and some case in
question. So this revision proposes to use a flag in uprobe_task to
track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-and...@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaocha...@huawei.com
[3] https://lore.kernel.org/all/20240801082407.1618451-1-liaocha...@huawei.com

Liao Chang (2):
  uprobes: Remove redundant spinlock in uprobe_deny_signal()
  uprobes: Remove the spinlock within handle_singlestep()

 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.34.1




Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-14 Thread Liao, Chang



在 2024/8/15 2:42, Andrii Nakryiko 写道:
> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang  wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>>>>>
>>>>>> Hi Andrii and Oleg.
>>>>>>
>>>>>> This patch sent by me two weeks ago also aim to optimize the performance 
>>>>>> of uprobe
>>>>>> on arm64. I notice recent discussions on the performance and scalability 
>>>>>> of uprobes
>>>>>> within the mailing list. Considering this interest, I've added you and 
>>>>>> other relevant
>>>>>> maintainers to the CC list for broader visibility and potential 
>>>>>> collaboration.
>>>>>>
>>>>>
>>>>> Hi Liao,
>>>>>
>>>>> As you can see there is an active work to improve uprobes, that
>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>> hold off a bit with your changes until all that lands. And then
>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> But also see some remarks below.
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>
>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>> performance gain.
>>>>>>>
>>>>>>> before-opt
>>>>>>> --
>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>
>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>> explanation why this might be happening? I see you are trying to
>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>> variant, `push %rbp` instruction).
>>>>>
>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>> space to execute original instruction out-of-line, and then trapping
>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>> avoid.
>>>>>
>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>> and whether you can make those into fast uprobe cases?
>>>>
>>>> Hi Andrii,
>>>>
>>>> As you correctly pointed out, the benchmark result on Arm64 is 
>>>> counterintuitive
>>>> compared to X86 behavior. My investigation revealed that the root cause 
>>>> lies in
>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents 
>>>> instructions
>>>> o

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-15 Thread Liao, Chang



在 2024/8/15 0:57, Andrii Nakryiko 写道:
> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang  wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  wrote:
>>>>>>
>>>>>> Hi Andrii and Oleg.
>>>>>>
>>>>>> This patch sent by me two weeks ago also aim to optimize the performance 
>>>>>> of uprobe
>>>>>> on arm64. I notice recent discussions on the performance and scalability 
>>>>>> of uprobes
>>>>>> within the mailing list. Considering this interest, I've added you and 
>>>>>> other relevant
>>>>>> maintainers to the CC list for broader visibility and potential 
>>>>>> collaboration.
>>>>>>
>>>>>
>>>>> Hi Liao,
>>>>>
>>>>> As you can see there is an active work to improve uprobes, that
>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>> hold off a bit with your changes until all that lands. And then
>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> But also see some remarks below.
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>
>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>> performance gain.
>>>>>>>
>>>>>>> before-opt
>>>>>>> --
>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>
>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>> explanation why this might be happening? I see you are trying to
>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>> variant, `push %rbp` instruction).
>>>>>
>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>> space to execute original instruction out-of-line, and then trapping
>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>> avoid.
>>>>>
>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>> and whether you can make those into fast uprobe cases?
>>>>
>>>> Hi Andrii,
>>>>
>>>> As you correctly pointed out, the benchmark result on Arm64 is 
>>>> counterintuitive
>>>> compared to X86 behavior. My investigation revealed that the root cause 
>>>> lies in
>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents 
>>>> instructions
>>>> o

Re: [PATCH] arm64: insn: Simulate nop and push instruction for better uprobe performance

2024-08-21 Thread Liao, Chang
Hi, Mark

My bad for taking so long to rely, I generally agree with your suggestions to
STP emulation.

在 2024/8/15 17:58, Mark Rutland 写道:
> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>> counterintuitive result that nop and push variants are much slower than
>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>> This force the kernel returns to userspace and execute them out-of-line,
>> then trapping back to kernel for running uprobe callback functions. This
>> leads to a significant performance overhead compared to 'ret' variant,
>> which is already emulated.
> 
> I appreciate this might be surprising, but does it actually matter
> outside of a microbenchmark?

I just do a simple comparsion the performance impact of single-stepped and
emulated STP on my local machine. Three user cases were measured: Redis GET and
SET throughput (Request Per Second, RPS), and the time taken to execute a grep
command on the "arch_uprobe_copy_xol" string within the kernel source.

Redis GET (higher is better)

No uprobe: 49149.71 RPS
Single-stepped STP: 46750.82 RPS
Emulated STP: 48981.19 RPS

Redis SET (larger is better)

No uprobe: 49761.14 RPS
Single-stepped STP: 45255.01 RPS
Emulated stp: 48619.21 RPS

Grep (lower is better)
--
No uprobe: 2.165s
Single-stepped STP: 15.314s
Emualted STP: 2.216s

The result reveals single-stepped STP instruction that used to push fp/lr into
stack significantly impacts the Redis and grep performance, leading to a notable
notable decrease RPS and increase time individually. So emulating STP on the
function entry might be a more viable option for uprobe.

> 
>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>> and fp into stack regardless kernel or userspace binary. 
> 
> Function entry doesn't always start with a STP; these days it's often a
> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> the compiler), it could be any arbitrary instruction. This might happen
> to be the common case today, but there are certain;y codebases where it
> is not.

Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
is definitly not STP. But for CPU and kernel lack of these supports, STP as
the entry instruction is still the common case. And I profiled the entry
instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
STP for pushing stack.

> 
> STP (or any instruction that accesses memory) is fairly painful to
> emulate because you need to ensure that the correct atomicity and
> ordering properties are provided (e.g. an aligned STP should be
> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> chance), and that the correct VMSA behaviour is provided (e.g. when
> interacting with MTE, POE, etc, while the uaccess primitives don't try
> to be 100% equivalent to instructions in userspace).
Agreed, but I don't think it has to emulate strictly the single-copy-atomic
feature of STP that is used to push fp/lr into stack. In most cases, only one
CPU will push registers to the same position on stack. And I barely understand
why other CPUs would depends on the ordering of pushing data into stack. So it
means the atomicity and ordering is not so important for this scenario. 
Regarding
MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and 
kernel
with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
STP for pushing stack.

> 
> For those reasons, in general I don't think we should be emulating any
> instruction which accesses memory, and we should not try to emulate the
> STP, but I think it's entirely reasonable to emulate NOP.
> 
>> In order to
>> improve the performance of handling uprobe for common usecases. This
>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>> and 'push'. The benchmark results below indicates the performance gain
>> of emulation is obvious.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> xol (1 cpus)
>> 
>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

2024-08-21 Thread Liao, Chang



在 2024/8/16 0:53, Andrii Nakryiko 写道:
> On Wed, Aug 14, 2024 at 7:58 PM Liao, Chang  wrote:
>>
>>
>>
>> 在 2024/8/15 2:42, Andrii Nakryiko 写道:
>>> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang  wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang  wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Andrii and Oleg.
>>>>>>>>
>>>>>>>> This patch sent by me two weeks ago also aim to optimize the 
>>>>>>>> performance of uprobe
>>>>>>>> on arm64. I notice recent discussions on the performance and 
>>>>>>>> scalability of uprobes
>>>>>>>> within the mailing list. Considering this interest, I've added you and 
>>>>>>>> other relevant
>>>>>>>> maintainers to the CC list for broader visibility and potential 
>>>>>>>> collaboration.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Liao,
>>>>>>>
>>>>>>> As you can see there is an active work to improve uprobes, that
>>>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>>>> hold off a bit with your changes until all that lands. And then
>>>>>>> re-benchmark, as costs might shift.
>>>>>>>
>>>>>>> But also see some remarks below.
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() 
>>>>>>>>> on
>>>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>>>
>>>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce 
>>>>>>>>> the
>>>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>>>> flush, especially on architecture like ARM64. This patch has been 
>>>>>>>>> tested
>>>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>>>> performance gain.
>>>>>>>>>
>>>>>>>>> before-opt
>>>>>>>>> --
>>>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>>>
>>>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>>>> explanation why this might be happening? I see you are trying to
>>>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>>>> variant, `push %rbp` instruction).
>>>>>>>
>>>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>>>> space to execute original instruction out-of-line, and then trapping
>>>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>&

Re: [PATCH] arm64: insn: Simulate nop and push instruction for better uprobe performance

2024-08-27 Thread Liao, Chang
Hi, Mark

Would you like to discuss this patch further, or do you still believe emulating
STP to push FP/LR into the stack in kernel is not a good idea?

Thanks.


在 2024/8/21 15:55, Liao, Chang 写道:
> Hi, Mark
> 
> My bad for taking so long to rely, I generally agree with your suggestions to
> STP emulation.
> 
> 在 2024/8/15 17:58, Mark Rutland 写道:
>> On Wed, Aug 14, 2024 at 08:03:56AM +, Liao Chang wrote:
>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>> counterintuitive result that nop and push variants are much slower than
>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>> This force the kernel returns to userspace and execute them out-of-line,
>>> then trapping back to kernel for running uprobe callback functions. This
>>> leads to a significant performance overhead compared to 'ret' variant,
>>> which is already emulated.
>>
>> I appreciate this might be surprising, but does it actually matter
>> outside of a microbenchmark?
> 
> I just do a simple comparsion the performance impact of single-stepped and
> emulated STP on my local machine. Three user cases were measured: Redis GET 
> and
> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> command on the "arch_uprobe_copy_xol" string within the kernel source.
> 
> Redis GET (higher is better)
> 
> No uprobe: 49149.71 RPS
> Single-stepped STP: 46750.82 RPS
> Emulated STP: 48981.19 RPS
> 
> Redis SET (larger is better)
> 
> No uprobe: 49761.14 RPS
> Single-stepped STP: 45255.01 RPS
> Emulated stp: 48619.21 RPS
> 
> Grep (lower is better)
> --
> No uprobe: 2.165s
> Single-stepped STP: 15.314s
> Emualted STP: 2.216s
> 
> The result reveals single-stepped STP instruction that used to push fp/lr into
> stack significantly impacts the Redis and grep performance, leading to a 
> notable
> notable decrease RPS and increase time individually. So emulating STP on the
> function entry might be a more viable option for uprobe.
> 
>>
>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>> and fp into stack regardless kernel or userspace binary. 
>>
>> Function entry doesn't always start with a STP; these days it's often a
>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>> the compiler), it could be any arbitrary instruction. This might happen
>> to be the common case today, but there are certain;y codebases where it
>> is not.
> 
> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> is definitly not STP. But for CPU and kernel lack of these supports, STP as
> the entry instruction is still the common case. And I profiled the entry
> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> STP for pushing stack.
> 
>>
>> STP (or any instruction that accesses memory) is fairly painful to
>> emulate because you need to ensure that the correct atomicity and
>> ordering properties are provided (e.g. an aligned STP should be
>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>> chance), and that the correct VMSA behaviour is provided (e.g. when
>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>> to be 100% equivalent to instructions in userspace).
> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> feature of STP that is used to push fp/lr into stack. In most cases, only one
> CPU will push registers to the same position on stack. And I barely understand
> why other CPUs would depends on the ordering of pushing data into stack. So it
> means the atomicity and ordering is not so important for this scenario. 
> Regarding
> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and 
> kernel
> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> STP for pushing stack.
> 
>>
>> For those reasons, in general I don't think we should be emulating any
>> instruction which accesses memory, and we should not try to emulate the
>> STP, b

Re: [PATCH] arm64: insn: Simulate nop and push instruction for better uprobe performance

2024-08-30 Thread Liao, Chang



在 2024/8/30 3:26, Andrii Nakryiko 写道:
> On Tue, Aug 27, 2024 at 4:34 AM Liao, Chang  wrote:
>>
>> Hi, Mark
>>
>> Would you like to discuss this patch further, or do you still believe 
>> emulating
>> STP to push FP/LR into the stack in kernel is not a good idea?
>>
> 
> Please send an updated version of your patches taking into account
> various smaller issues Mark pointed out. But please keep STP
> emulation, I think it's very important, even if it's not covering all
> possible uprobe tracing scenarios.

OK, I will send it out over the weekend, including an enhancement of STP
emuation that addresses the MTE and POE issues Mark mentioned. I hope this
will lead to more feedback from him.

> 
>> Thanks.
>>
>>
>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>> Hi, Mark
>>>
>>> My bad for taking so long to rely, I generally agree with your suggestions 
>>> to
>>> STP emulation.
>>>
>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>> On Wed, Aug 14, 2024 at 08:03:56AM +, Liao Chang wrote:
>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>> counterintuitive result that nop and push variants are much slower than
>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>> which is already emulated.
>>>>
>>>> I appreciate this might be surprising, but does it actually matter
>>>> outside of a microbenchmark?
>>>
>>> I just do a simple comparsion the performance impact of single-stepped and
>>> emulated STP on my local machine. Three user cases were measured: Redis GET 
>>> and
>>> SET throughput (Request Per Second, RPS), and the time taken to execute a 
>>> grep
>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>
>>> Redis GET (higher is better)
>>> 
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> 
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> --
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>>
>>> The result reveals single-stepped STP instruction that used to push fp/lr 
>>> into
>>> stack significantly impacts the Redis and grep performance, leading to a 
>>> notable
>>> notable decrease RPS and increase time individually. So emulating STP on the
>>> function entry might be a more viable option for uprobe.
>>>
>>>>
>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>> and fp into stack regardless kernel or userspace binary.
>>>>
>>>> Function entry doesn't always start with a STP; these days it's often a
>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>> to be the common case today, but there are certain;y codebases where it
>>>> is not.
>>>
>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>> the entry instruction is still the common case. And I profiled the entry
>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to 
>>> execute
>>> STP for pushing stack.
>>>
>>>>
>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>> em

[PATCH] riscv/kprobe: fix kernel panic when invoking sys_read traced by kprobe

2021-03-30 Thread Liao Chang
The execution of sys_read end up hitting a BUG_ON() in __find_get_block
after installing kprobe at sys_read, the BUG message like the following:

[   65.708663] [ cut here ]
[   65.709987] kernel BUG at fs/buffer.c:1251!
[   65.711283] Kernel BUG [#1]
[   65.712032] Modules linked in:
[   65.712925] CPU: 0 PID: 51 Comm: sh Not tainted 5.12.0-rc4 #1
[   65.714407] Hardware name: riscv-virtio,qemu (DT)
[   65.715696] epc : __find_get_block+0x218/0x2c8
[   65.716835]  ra : __getblk_gfp+0x1c/0x4a
[   65.717831] epc : ffe00019f11e ra : ffe00019f56a sp : 
ffe002437930
[   65.719553]  gp : ffe000f06030 tp : ffe0015abc00 t0 : 
ffe00191e038
[   65.721290]  t1 : ffe00191e038 t2 : 000a s0 : 
ffe002437960
[   65.723051]  s1 : ffe00160ad00 a0 : ffe00160ad00 a1 : 
012a
[   65.724772]  a2 : 0400 a3 : 0008 a4 : 
0040
[   65.726545]  a5 :  a6 : ffe00191e000 a7 : 

[   65.728308]  s2 : 012a s3 : 0400 s4 : 
0008
[   65.730049]  s5 : 006c s6 : ffe00240f800 s7 : 
ffe000f080a8
[   65.731802]  s8 : 0001 s9 : 012a s10: 
0008
[   65.733516]  s11: 0008 t3 : 03ff t4 : 
000f
[   65.734434]  t5 : 03ff t6 : 0004
[   65.734613] status: 0100 badaddr:  cause: 
0003
[   65.734901] Call Trace:
[   65.735076] [] __find_get_block+0x218/0x2c8
[   65.735417] [] __ext4_get_inode_loc+0xb2/0x2f6
[   65.735618] [] ext4_get_inode_loc+0x3a/0x8a
[   65.735802] [] ext4_reserve_inode_write+0x2e/0x8c
[   65.735999] [] __ext4_mark_inode_dirty+0x4c/0x18e
[   65.736208] [] ext4_dirty_inode+0x46/0x66
[   65.736387] [] __mark_inode_dirty+0x12c/0x3da
[   65.736576] [] touch_atime+0x146/0x150
[   65.736748] [] filemap_read+0x234/0x246
[   65.736920] [] generic_file_read_iter+0xc0/0x114
[   65.737114] [] ext4_file_read_iter+0x42/0xea
[   65.737310] [] new_sync_read+0xe2/0x15a
[   65.737483] [] vfs_read+0xca/0xf2
[   65.737641] [] ksys_read+0x5e/0xc8
[   65.737816] [] sys_read+0xe/0x16
[   65.737973] [] ret_from_syscall+0x0/0x2
[   65.738858] ---[ end trace fe93f985456c935d ]---

A simple reproducer looks like:
echo 'p:myprobe sys_read fd=%a0 buf=%a1 count=%a2' > 
/sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
cat /sys/kernel/debug/tracing/trace

Here's what happens to hit that BUG_ON():

1) After installing kprobe at entry of sys_read, the first instruction
   is replaced by 'ebreak' instruction on riscv64 platform.

2) Once kernel reach the 'ebreak' instruction at the entry of sys_read,
   it trap into the riscv breakpoint handler, where it do something to
   setup for coming single-step of origin instruction, including backup
   the 'sstatus' in pt_regs, followed by disable interrupt during single
   stepping via clear 'SIE' bit of 'sstatus' in pt_regs.

3) Then kernel restore to the instruction slot contains two instructions,
   one is original instruction at entry of sys_read, the other is 'ebreak'.
   Here it trigger a 'Instruction page fault' exception (value at 'scause'
   is '0xc'), if PF is not filled into PageTabe for that slot yet.

4) Again kernel trap into page fault exception handler, where it choose
   different policy according to the state of running kprobe. Because
   afte 2) the state is KPROBE_HIT_SS, so kernel reset the current kprobe
   and 'pc' points back to the probe address.

5) Because 'epc' point back to 'ebreak' instrution at sys_read probe,
   kernel trap into breakpoint handler again, and repeat the operations
   at 2), however 'sstatus' without 'SIE' is keep at 4), it cause the
   real 'sstatus' saved at 2) is overwritten by the one withou 'SIE'.

6) When kernel cross the probe the 'sstatus' CSR restore with value
   without 'SIE', and reach __find_get_block where it requires the
   interrupt must be enabled.

Fix this is very trivial, just restore the value of 'sstatus' in pt_regs
with backup one at 2) when the instruction being single stepped cause a
page fault.

Fixes: c22b0bcb1dd02 ("riscv: Add kprobes supported")
Signed-off-by: Liao Chang 
---
 arch/riscv/kernel/probes/kprobes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 7e2c78e2ca6b..d71f7c49a721 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -260,8 +260,10 @@ int __kprobes kprobe_fault_handler(struct pt_regs *regs, 
unsigned int trapnr)

[PATCH v2] riscv/kprobe: Restore local irqflag if kprobe is cancelled

2021-04-16 Thread Liao Chang
The execution of sys_read end up hitting a BUG_ON() in __find_get_block after
installing probe at sys_read via kprobe, the BUG message like the following:

[   65.708663] [ cut here ]
[   65.709987] kernel BUG at fs/buffer.c:1251!
[   65.711283] Kernel BUG [#1]
[   65.712032] Modules linked in:
[   65.712925] CPU: 0 PID: 51 Comm: sh Not tainted 5.12.0-rc4 #1
[   65.714407] Hardware name: riscv-virtio,qemu (DT)
[   65.715696] epc : __find_get_block+0x218/0x2c8
[   65.716835]  ra : __getblk_gfp+0x1c/0x4a
[   65.717831] epc : ffe00019f11e ra : ffe00019f56a sp : 
ffe002437930
[   65.719553]  gp : ffe000f06030 tp : ffe0015abc00 t0 : 
ffe00191e038
[   65.721290]  t1 : ffe00191e038 t2 : 000a s0 : 
ffe002437960
[   65.723051]  s1 : ffe00160ad00 a0 : ffe00160ad00 a1 : 
012a
[   65.724772]  a2 : 0400 a3 : 0008 a4 : 
0040
[   65.726545]  a5 :  a6 : ffe00191e000 a7 : 

[   65.728308]  s2 : 012a s3 : 0400 s4 : 
0008
[   65.730049]  s5 : 006c s6 : ffe00240f800 s7 : 
ffe000f080a8
[   65.731802]  s8 : 0001 s9 : 012a s10: 
0008
[   65.733516]  s11: 0008 t3 : 03ff t4 : 
000f
[   65.734434]  t5 : 03ff t6 : 0004
[   65.734613] status: 0100 badaddr:  cause: 
0003
[   65.734901] Call Trace:
[   65.735076] [] __find_get_block+0x218/0x2c8
[   65.735417] [] __ext4_get_inode_loc+0xb2/0x2f6
[   65.735618] [] ext4_get_inode_loc+0x3a/0x8a
[   65.735802] [] ext4_reserve_inode_write+0x2e/0x8c
[   65.735999] [] __ext4_mark_inode_dirty+0x4c/0x18e
[   65.736208] [] ext4_dirty_inode+0x46/0x66
[   65.736387] [] __mark_inode_dirty+0x12c/0x3da
[   65.736576] [] touch_atime+0x146/0x150
[   65.736748] [] filemap_read+0x234/0x246
[   65.736920] [] generic_file_read_iter+0xc0/0x114
[   65.737114] [] ext4_file_read_iter+0x42/0xea
[   65.737310] [] new_sync_read+0xe2/0x15a
[   65.737483] [] vfs_read+0xca/0xf2
[   65.737641] [] ksys_read+0x5e/0xc8
[   65.737816] [] sys_read+0xe/0x16
[   65.737973] [] ret_from_syscall+0x0/0x2
[   65.738858] ---[ end trace fe93f985456c935d ]---

A simple reproducer looks like:
echo 'p:myprobe sys_read fd=%a0 buf=%a1 count=%a2' > 
/sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
cat trace

Here's what happens to hit that BUG_ON():

If instruction being single stepped caused page fault, the
kprobe is cancelled to let the page fault handler continues
as normal page fault. But the local irqflags are disabled,
so CPU will restore 'sstatus' with 'SIE' masked. After page
fault is serviced, the kprobe is triggered again, we overwrite
the saved irqflag by calling kprobe_save_local_irqflag(). Note,
'SIE' is masked in this new saved irqflag. After kprobe is
serviced, the CPU 'sstatus' is restored with 'SIE' masked.
This overwritten 'sstatus' cause BUG_ON() in __find_get_block.

This bug is already fixed on arm64 by Jisheng Zhang.

Fixes: c22b0bcb1dd02 ("riscv: Add kprobes supported")
Signed-off-by: Liao Chang 
---

Changes in v2:
- Reorganize commit message.

 arch/riscv/kernel/probes/kprobes.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 7e2c78e2ca6b..d71f7c49a721 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -260,8 +260,10 @@ int __kprobes kprobe_fault_handler(struct pt_regs *regs, 
unsigned int trapnr)
 
if (kcb->kprobe_status == KPROBE_REENTER)
restore_previous_kprobe(kcb);
-   else
+   else {
+   kprobes_restore_local_irqflag(kcb, regs);
reset_current_kprobe();
+   }
 
break;
case KPROBE_HIT_ACTIVE:
-- 
2.17.1