On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <[email protected]> wrote: > > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote: > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the > > probe site with a CALL into a uprobe trampoline. CALL pushes a return > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where > > user code may keep temporary data without adjusting rsp. > > > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it > > also does not provide a return address. Replace the single trampoline > > entry with a page of 16-byte slots. Each optimized probe jumps to its > > assigned slot, the slot moves rsp below the red zone, saves the registers > > clobbered by syscall, and invokes the uprobe syscall: > > > > Probe site: jmp slot_N (5B, replaces nop5) > > > > Slot N: lea -128(%rsp), %rsp (5B) skip red zone > > push %rcx (1B) save (syscall clobbers) > > push %r11 (2B) save (syscall clobbers) > > push %rax (1B) save (syscall uses for nr) > > mov $336, %eax (5B) uprobe syscall number > > syscall (2B) > > > > All slots contain identical code at different offsets, so the trampoline > > page is generated once at boot and mapped read-execute into each process. > > The syscall handler identifies the slot from regs->ip, which points just > > after the syscall instruction, and uses a per-mm slot table to recover the > > original probe address. > > > > The uprobe syscall does not return to the trampoline slot. The handler > > restores the probe-site register state, runs the uprobe consumers, sets > > pt_regs to continue at probe_addr + 5 unless a consumer redirected > > execution, and returns directly through the IRET path. This preserves > > general purpose registers, including rcx and r11, without requiring any > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and > > shadow stack concerns. > > > > Protect the per-mm trampoline list with RCU and free trampoline metadata > > with kfree_rcu(). This lets the syscall path resolve trampoline slots > > without taking mmap_lock. The optimized-instruction detection path also > > walks the trampoline list under an RCU read-side lock. Since that path > > starts from the JMP target, it translates the slot start to the post-syscall > > IP expected by the shared resolver before checking the trampoline mapping. > > > > Each trampoline page provides 256 slots. Slots stay permanently assigned > > to their first probe address and are reused only when the same address is > > probed again. Reassigning detached slots is deliberately avoided because a > > thread can remain in a trampoline for an unbounded time due to ptrace, > > interrupts, or scheduling delays. If a reachable trampoline page runs out > > of slots, probes that cannot allocate a slot fall back to the slower INT3 > > path. > > > > Require the entire trampoline page to be reachable by a rel32 JMP before > > reusing it for a probe. This keeps every slot in the page within the range > > that can be encoded at the probe site. > > > > Change the error code returned when the uprobe syscall is invoked outside > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and > > similar libraries distinguish fixed kernels from kernels with the > > red-zone-clobbering implementation and enable nop5 optimization only on > > fixed kernels. > > > > Performance (usdt single-thread, M/s): > > > > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change > > iret% > > Skylake 3.149 6.422 4.865 -24.3% > > 39.1% > > Milan 2.910 3.443 3.820 +11.0% > > 24.3% > > Sapphire Rapids 1.896 4.023 3.693 -8.2% > > 24.9% > > Bergamo 3.393 3.895 3.849 -1.2% > > 24.5% > > > > The fixed nop5 path remains faster than the non-optimized INT3 path on all > > measured systems. The regression relative to the old CALL-based trampoline > > comes from IRET being more expensive than SYSRET, most noticeably on older > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost, > > and AMD Milan improves because removing mmap_lock from the hot path more > > than offsets the IRET cost. > > > > Multi-threaded throughput scales nearly linearly with the number of CPUs, > > like > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup. > > hi, > thanks a lot for the fix > > FWIW we discussed also an option to have 10-bytes nop and do: > [rsp+0x80, call trampoline] > > we would not need the slots re-use logic, but not sure what other > surprises there are with 10-bytes nop > > I tried that change [1], it seems to work, but it has other > difficulties, like I think the unoptimized path needs to do: > [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop] > instead of patching back the 10-byte nop, because some thread > could be inside the nop area already. >
Yeah, nop10 and this jump-over-nop10 approach is an alternative. I don't have strong feelings apart from the ridiculousness of a 10-byte nop :) did you get a chance to benchmark your nop10 approach, curious how do the number look like > jirka > > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574 > > > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes") > > Signed-off-by: Andrii Nakryiko <[email protected]> > > --- > > arch/x86/include/asm/uprobes.h | 18 ++ > > arch/x86/kernel/uprobes.c | 262 ++++++++++-------- > > tools/lib/bpf/features.c | 8 +- > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 5 +- > > tools/testing/selftests/bpf/prog_tests/usdt.c | 2 +- > > 5 files changed, 181 insertions(+), 114 deletions(-) > > [...]
