Mahe reported missing function from stack trace on top of kprobe
multi program. The missing function is the very first one in the
stacktrace, the one that the bpf program is attached to.
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
('*' is used for kprobe_multi attachment)
The reason is that the previous change (the Fixes commit) fixed
stack unwind for tracepoint, but removed attached function address
from the stack trace on top of kprobe multi programs, which I also
overlooked in the related test (check following patch).
The tracepoint and kprobe_multi have different stack setup, but use
same unwind path. I think it's better to keep the previous change,
which fixed tracepoint unwind and instead change the kprobe multi
unwind as explained below.
The bpf program stack unwind calls perf_callchain_kernel for kernel
portion and it follows two unwind paths based on X86_EFLAGS_FIXED
bit in pt_regs.flags.
When the bit set we unwind from stack represented by pt_regs argument,
otherwise we unwind currently executed stack up to 'first_frame'
boundary.
The 'first_frame' value is taken from regs.rsp value, but ftrace_caller
and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp
to the previous stack frame, so we skip the attached function entry.
If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit,
we can control the start of the unwind and get back the attached
function address. As another benefit we also cut extra unwind cycles
needed to reach the 'first_frame' boundary.
The speedup can be meassured with trigger bench for kprobe_multi
program and stacktrace support.
- without bpf_get_stackid call:
# ./bench -w2 -d5 -a -p1 trig-kprobe-multi
Summary: hits 0.857 ± 0.003M/s ( 0.857M/prod), drops 0.000 ± 0.000M/s,
total operations 0.857 ± 0.003M/s
- with bpf_get_stackid call:
# ./bench -w2 -d5 -a -g -p1 trig-kprobe-multi
Summary: hits 1.302 ± 0.002M/s ( 1.302M/prod), drops 0.000 ± 0.000M/s,
total operations 1.302 ± 0.002M/s
Note the '-g' option for stacktrace added in following change.
To recreate same stack setup for return probe as we have for entry
probe, we set the instruction pointer to the attached function address,
which gets us the same unwind setup and same stack trace.
With the fix, entry probe:
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
__x64_sys_newuname+9
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
return probe:
# bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
__x64_sys_newuname+4
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in
perf_callchain_kernel()"")
Reported-by: Mahe Tardy <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
arch/x86/include/asm/ftrace.h | 2 +-
kernel/trace/fgraph.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index b08c95872eed..c56e1e63b893 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -57,7 +57,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
}
#define arch_ftrace_partial_regs(regs) do { \
- regs->flags &= ~X86_EFLAGS_FIXED; \
+ regs->flags |= X86_EFLAGS_FIXED; \
regs->cs = __KERNEL_CS; \
} while (0)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index cc48d16be43e..6279e0a753cf 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -825,7 +825,7 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs,
unsigned long frame_pointe
}
if (fregs)
- ftrace_regs_set_instruction_pointer(fregs, ret);
+ ftrace_regs_set_instruction_pointer(fregs, trace.func);
bit = ftrace_test_recursion_trylock(trace.func, ret);
/*
--
2.52.0