Re: Getting empty callchain from perf_callchain_kernel()
On Wed, Jun 12, 2019 at 09:50:08AM -0500, Josh Poimboeuf wrote: > > Other than that, the same note as before, the 32bit JIT still seems > > buggered, but I'm not sure you (or anybody else) cares enough about that > > to fix it though. It seems to use ebp as its own frame pointer, which > > completely defeats an unwinder. > > I'm still trying to decide if I care about 32-bit. It does indeed use > ebp everywhere. But I'm not sure if I want to poke the beehive... Also > factoring into the equation is the fact that I'll be on PTO next week > :-) If I have time in the next couple days then I may take a look. 32-bit actually looks much easier to fix than 64-bit was. I haven't tested it yet though, but I'll be gone next week so I'll just drop it here in case anybody wants to try it. diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c index b29e82f190c7..8c1de7786e49 100644 --- a/arch/x86/net/bpf_jit_comp32.c +++ b/arch/x86/net/bpf_jit_comp32.c @@ -169,6 +169,10 @@ static const u8 bpf2ia32[][2] = { #define src_hi src[1] #define STACK_ALIGNMENT8 + +/* Size of callee-saved register space (except EBP) */ +#define CALLEE_SAVE_SIZE 12 + /* * Stack space for BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, * BPF_REG_5, BPF_REG_6, BPF_REG_7, BPF_REG_8, BPF_REG_9, @@ -176,13 +180,14 @@ static const u8 bpf2ia32[][2] = { */ #define SCRATCH_SIZE 96 -/* Total stack size used in JITed code */ +/* Total stack size used in JITed code (except callee-saved) */ #define _STACK_SIZE(stack_depth + SCRATCH_SIZE) #define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT) -/* Get the offset of eBPF REGISTERs stored on scratch space. */ -#define STACK_VAR(off) (off) +/* Offset of eBPF REGISTERs stored in scratch space, relative to EBP */ +//FIXME: rename to EBP_OFFSET +#define STACK_VAR(off) (off - CALLEE_SAVE_SIZE - SCRATCH_SIZE) /* Encode 'dst_reg' register into IA32 opcode 'byte' */ static u8 add_1reg(u8 byte, u32 dst_reg) @@ -1408,7 +1413,7 @@ struct jit_context { #define BPF_MAX_INSN_SIZE 128 #define BPF_INSN_SAFETY64 -#define PROLOGUE_SIZE 35 +#define PROLOGUE_SIZE 32 /* * Emit prologue code for BPF program and check it's size. @@ -1436,8 +1441,6 @@ static void emit_prologue(u8 **pprog, u32 stack_depth) /* sub esp,STACK_SIZE */ EMIT2_off32(0x81, 0xEC, STACK_SIZE); - /* sub ebp,SCRATCH_SIZE+12*/ - EMIT3(0x83, add_1reg(0xE8, IA32_EBP), SCRATCH_SIZE + 12); /* xor ebx,ebx */ EMIT2(0x31, add_2reg(0xC0, IA32_EBX, IA32_EBX)); @@ -1470,18 +1473,21 @@ static void emit_epilogue(u8 **pprog, u32 stack_depth) /* mov edx,dword ptr [ebp+off]*/ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r0[1])); - /* add ebp,SCRATCH_SIZE+12*/ - EMIT3(0x83, add_1reg(0xC0, IA32_EBP), SCRATCH_SIZE + 12); + /* add esp, STACK_SIZE */ + EMIT2_off32(0x81, 0xC4, STACK_SIZE); + + /* pop ebx */ + EMIT1(0x5b); + /* pop esi */ + EMIT1(0x5e); + /* pop edi */ + EMIT1(0x5f); + /* pop ebp */ + EMIT1(0x5d); - /* mov ebx,dword ptr [ebp-12]*/ - EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), -12); - /* mov esi,dword ptr [ebp-8]*/ - EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ESI), -8); - /* mov edi,dword ptr [ebp-4]*/ - EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDI), -4); + /* ret */ + EMIT1(0xC3); - EMIT1(0xC9); /* leave */ - EMIT1(0xC3); /* ret */ *pprog = prog; }
Re: Getting empty callchain from perf_callchain_kernel()
On Wed, Jun 12, 2019 at 10:54:23AM +0200, Peter Zijlstra wrote: > On Tue, Jun 11, 2019 at 10:05:01PM -0500, Josh Poimboeuf wrote: > > On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote: > > > > For ORC, I'm thinking we may be able to just require that all generated > > > > code (BPF and others) always use frame pointers. Then when ORC doesn't > > > > recognize a code address, it could try using the frame pointer as a > > > > fallback. > > > > > > Yes, this seems like a sensible approach. We'd also have to audit the > > > ftrace and kprobe trampolines, IIRC they only do framepointer setup for > > > CONFIG_FRAME_POINTER currently, which should be easy to fix (after the > > > patches I have to fix the FP generation in the first place: > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip > > > > Right now, ftrace has a special hook in the ORC unwinder > > (orc_ftrace_find). It would be great if we could get rid of that in > > favor of the "always use frame pointers" approach. I'll hold off on > > doing the kpatch/kprobe trampoline conversions in my patches since it > > would conflict with yours. > > > > Though, hm, because of pt_regs I guess ORC would need to be able to > > decode an encoded frame pointer? I was hoping we could leave those > > encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever... > > Ah, I see.. could a similar approach work for the kprobe trampolines > perhaps? If you mean requiring that kprobes trampolines always use frame pointers, I think it should work. > > Here are my latest BPF unwinder patches in case anybody wants a sneak > > peek: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix > > On a quick read-through, that looks good to me. A minor nit: > > /* mov dst_reg, %r11 */ > EMIT_mov(dst_reg, AUX_REG); > > The disparity between %r11 and AUX_REG is jarring. I understand the > whole bpf register mapping thing, but it is just weird when reading > this. True, but there are several cases where the r11 is hard-coded in the instruction encoding itself, like: /* mov imm32, %r11 */ EMIT3_off32(0x49, 0xC7, 0xC3, imm32); If the code were more decoupled, like if it had helpers where you could always just pass AUX_REG, and the code never had to know what the value of AUX_REG is, then using "AUX_REG" in the comments would make sense. But since there are inconsistencies, with hard-coded register mapping knowledge in many places, I find it easier to follow what's going on when the specific register name is always shown in the comments. > Other than that, the same note as before, the 32bit JIT still seems > buggered, but I'm not sure you (or anybody else) cares enough about that > to fix it though. It seems to use ebp as its own frame pointer, which > completely defeats an unwinder. I'm still trying to decide if I care about 32-bit. It does indeed use ebp everywhere. But I'm not sure if I want to poke the beehive... Also factoring into the equation is the fact that I'll be on PTO next week :-) If I have time in the next couple days then I may take a look. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Wed, Jun 12, 2019 at 09:10:23AM -0400, Steven Rostedt wrote: > On Tue, 11 Jun 2019 22:05:01 -0500 > Josh Poimboeuf wrote: > > > Right now, ftrace has a special hook in the ORC unwinder > > (orc_ftrace_find). It would be great if we could get rid of that in > > favor of the "always use frame pointers" approach. I'll hold off on > > doing the kpatch/kprobe trampoline conversions in my patches since it > > would conflict with yours. > > Basically, IIUC, what you are saying is that the ftrace trampoline > should always store the %sp in %rb even when CONFIG_FRAME_POINTER is not > enabled? And this can allow you to remove the ftrace specific code from > the orc unwinder? Basically, yes. Though the frame pointer encoding which Peter is adding to the ftrace/kprobes trampolines might complicate things a bit. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Tue, 11 Jun 2019 22:05:01 -0500 Josh Poimboeuf wrote: > Right now, ftrace has a special hook in the ORC unwinder > (orc_ftrace_find). It would be great if we could get rid of that in > favor of the "always use frame pointers" approach. I'll hold off on > doing the kpatch/kprobe trampoline conversions in my patches since it > would conflict with yours. Basically, IIUC, what you are saying is that the ftrace trampoline should always store the %sp in %rb even when CONFIG_FRAME_POINTER is not enabled? And this can allow you to remove the ftrace specific code from the orc unwinder? -- Steve > > Though, hm, because of pt_regs I guess ORC would need to be able to > decode an encoded frame pointer? I was hoping we could leave those > encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever...
Re: Getting empty callchain from perf_callchain_kernel()
On Tue, Jun 11, 2019 at 10:05:01PM -0500, Josh Poimboeuf wrote: > On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote: > > > For ORC, I'm thinking we may be able to just require that all generated > > > code (BPF and others) always use frame pointers. Then when ORC doesn't > > > recognize a code address, it could try using the frame pointer as a > > > fallback. > > > > Yes, this seems like a sensible approach. We'd also have to audit the > > ftrace and kprobe trampolines, IIRC they only do framepointer setup for > > CONFIG_FRAME_POINTER currently, which should be easy to fix (after the > > patches I have to fix the FP generation in the first place: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip > > Right now, ftrace has a special hook in the ORC unwinder > (orc_ftrace_find). It would be great if we could get rid of that in > favor of the "always use frame pointers" approach. I'll hold off on > doing the kpatch/kprobe trampoline conversions in my patches since it > would conflict with yours. > > Though, hm, because of pt_regs I guess ORC would need to be able to > decode an encoded frame pointer? I was hoping we could leave those > encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever... Ah, I see.. could a similar approach work for the kprobe trampolines perhaps? > Here are my latest BPF unwinder patches in case anybody wants a sneak > peek: > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix On a quick read-through, that looks good to me. A minor nit: /* mov dst_reg, %r11 */ EMIT_mov(dst_reg, AUX_REG); The disparity between %r11 and AUX_REG is jarring. I understand the whole bpf register mapping thing, but it is just weird when reading this. Other than that, the same note as before, the 32bit JIT still seems buggered, but I'm not sure you (or anybody else) cares enough about that to fix it though. It seems to use ebp as its own frame pointer, which completely defeats an unwinder.
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote: > > For ORC, I'm thinking we may be able to just require that all generated > > code (BPF and others) always use frame pointers. Then when ORC doesn't > > recognize a code address, it could try using the frame pointer as a > > fallback. > > Yes, this seems like a sensible approach. We'd also have to audit the > ftrace and kprobe trampolines, IIRC they only do framepointer setup for > CONFIG_FRAME_POINTER currently, which should be easy to fix (after the > patches I have to fix the FP generation in the first place: > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip Right now, ftrace has a special hook in the ORC unwinder (orc_ftrace_find). It would be great if we could get rid of that in favor of the "always use frame pointers" approach. I'll hold off on doing the kpatch/kprobe trampoline conversions in my patches since it would conflict with yours. Though, hm, because of pt_regs I guess ORC would need to be able to decode an encoded frame pointer? I was hoping we could leave those encoded frame pointers behind in CONFIG_FRAME_POINTER-land forever... Here are my latest BPF unwinder patches in case anybody wants a sneak peek: https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/log/?h=bpf-orc-fix -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, Jun 06, 2019 at 04:04:48PM +, Song Liu wrote: > Hi Josh, > > Have you got luck fixing the ORC side? Here's the ORC fix. It's needed in addition to the bpf frame pointer fix (the previous patch). I'll clean the patches up and post them soon. diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index 33b66b5c5aec..a2337460b1a3 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -128,6 +128,16 @@ static struct orc_entry null_orc_entry = { .type = ORC_TYPE_CALL }; +/* Fake frame pointer entry -- used as a fallback for generated code */ +static struct orc_entry orc_fp_entry = { + .type = ORC_TYPE_CALL, + .sp_reg = ORC_REG_BP, + .sp_offset = 16, + .bp_reg = ORC_REG_PREV_SP, + .bp_offset = -16, + .end= 0, +}; + static struct orc_entry *orc_find(unsigned long ip) { static struct orc_entry *orc; @@ -392,8 +402,16 @@ bool unwind_next_frame(struct unwind_state *state) * calls and calls to noreturn functions. */ orc = orc_find(state->signal ? state->ip : state->ip - 1); - if (!orc) - goto err; + if (!orc) { + /* +* As a fallback, try to assume rbp is the frame pointer. This +* is useful for generated code like BPF. Since this is a +* guess, the rest of the unwind is no longer considered +* reliable. +*/ + orc = _fp_entry; + state->error = true; + } /* End-of-stack check for kernel threads: */ if (orc->sp_reg == ORC_REG_UNDEFINED) {
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, Jun 06, 2019 at 04:04:48PM +, Song Liu wrote: > >> Here's a tentative BPF fix for the JIT frame pointer issue. It was a > >> bit harder than I expected. Encoding r12 as a base register requires a > >> SIB byte, so I had to add support for encoding that. I also simplified > >> the prologue to resemble a GCC prologue, which decreases the prologue > >> size quite a bit. > >> > >> Next week I can work on the corresponding ORC change. Then I can clean > >> all the patches up and submit them properly. > > Hi Josh, > > Have you got luck fixing the ORC side? Sorry, I've been traveling this week and I haven't had a chance to work on it. It's at the top of my TODO list for next week. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
> On May 27, 2019, at 4:57 AM, Kairui Song wrote: > > On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf wrote: >> >> On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote: >>> On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > wrote: >> >> On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > Hi Josh, this still won't fix the problem. > > Problem is not (or not only) with ___bpf_prog_run, what actually went > wrong is with the JITed bpf code. There seem to be a bunch of issues. My patch at least fixes the failing selftest reported by Alexei for ORC. How can I recreate your issue? >>> >>> Hmm, I used bcc's example to attach bpf to trace point, and with that >>> fix stack trace is still invalid. >>> >>> CMD I used with bcc: >>> python3 ./tools/stackcount.py t:sched:sched_fork >> >> I've had problems in the past getting bcc to build, so I was hoping it >> was reproducible with a standalone selftest. >> >>> And I just had another try applying your patch, self test is also >>> failing. >> >> Is it the same selftest reported by Alexei? >> >> test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err >> -1 errno 2 >> >>> I'm applying on my local master branch, a few days older than >>> upstream, I can update and try again, am I missing anything? >> >> The above patch had some issues, so with some configs you might see an >> objtool warning for ___bpf_prog_run(), in which case the patch doesn't >> fix the test_stacktrace_map selftest. >> >> Here's the latest version which should fix it in all cases (based on >> tip/master): >> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > Hmm, I still get the failure: > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > err -1 errno 2 > > And I didn't see how this will fix the issue. As long as ORC need to > unwind through the JITed code it will fail. And that will happen > before reaching ___bpf_prog_run. Ok, I was able to recreate by doing echo 1 > /proc/sys/net/core/bpf_jit_enable first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. >>> >>> Yes, with JIT off it will be fixed. I can confirm that. >> >> Here's a tentative BPF fix for the JIT frame pointer issue. It was a >> bit harder than I expected. Encoding r12 as a base register requires a >> SIB byte, so I had to add support for encoding that. I also simplified >> the prologue to resemble a GCC prologue, which decreases the prologue >> size quite a bit. >> >> Next week I can work on the corresponding ORC change. Then I can clean >> all the patches up and submit them properly. Hi Josh, Have you got luck fixing the ORC side? Thanks, Song >> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c >> index afabf597c855..c9b4503558c9 100644 >> --- a/arch/x86/net/bpf_jit_comp.c >> +++ b/arch/x86/net/bpf_jit_comp.c >> @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size) >> /* >> * The following table maps BPF registers to x86-64 registers. >> * >> - * x86-64 register R12 is unused, since if used as base address >> - * register in load/store instructions, it always needs an >> - * extra byte of encoding and is callee saved. >> + * RBP isn't used; it needs to be preserved to allow the unwinder to move >> + * through generated code stacks. >> * >> * Also x86-64 register R9 is unused. x86-64 register R10 is >> * used for blinding (if enabled). >> @@ -122,7 +121,7 @@ static const int reg2hex[] = { >>[BPF_REG_7] = 5, /* R13 callee saved */ >>[BPF_REG_8] = 6, /* R14 callee saved */ >>[BPF_REG_9] = 7, /* R15 callee saved */ >> - [BPF_REG_FP] = 5, /* RBP readonly */ >> + [BPF_REG_FP] = 4, /* R12 readonly */ >>[BPF_REG_AX] = 2, /* R10 temp register */ >>[AUX_REG] = 3,/* R11 temp register */ >> }; >> @@ -139,6 +138,7 @@ static bool is_ereg(u32 reg) >> BIT(BPF_REG_7) | >> BIT(BPF_REG_8) | >> BIT(BPF_REG_9) | >> +BIT(BPF_REG_FP) | >> BIT(BPF_REG_AX)); >> } >> >> @@ -147,6 +147,11 @@ static bool is_axreg(u32 reg) >>return reg == BPF_REG_0; >> } >> >> +static bool is_sib_reg(u32 reg) >> +{ >> + return reg == BPF_REG_FP; >> +} >> + >> /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */ >> static u8 add_1mod(u8 byte, u32 reg) >> { >> @@ -190,15 +195,13 @@ struct jit_context { >> #define BPF_MAX_INSN_SIZE 128 >> #define
Re: Getting empty callchain from perf_callchain_kernel()
On Sat, May 25, 2019 at 7:23 AM Josh Poimboeuf wrote: > > On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote: > > On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: > > > > > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > > > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > > > > wrote: > > > > > > > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what > > > > > > > > actually went > > > > > > > > wrong is with the JITed bpf code. > > > > > > > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the > > > > > > > failing > > > > > > > selftest reported by Alexei for ORC. > > > > > > > > > > > > > > How can I recreate your issue? > > > > > > > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with > > > > > > that > > > > > > fix stack trace is still invalid. > > > > > > > > > > > > CMD I used with bcc: > > > > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > > > > > > > I've had problems in the past getting bcc to build, so I was hoping it > > > > > was reproducible with a standalone selftest. > > > > > > > > > > > And I just had another try applying your patch, self test is also > > > > > > failing. > > > > > > > > > > Is it the same selftest reported by Alexei? > > > > > > > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > > > err -1 errno 2 > > > > > > > > > > > I'm applying on my local master branch, a few days older than > > > > > > upstream, I can update and try again, am I missing anything? > > > > > > > > > > The above patch had some issues, so with some configs you might see an > > > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > > > > fix the test_stacktrace_map selftest. > > > > > > > > > > Here's the latest version which should fix it in all cases (based on > > > > > tip/master): > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > > > > > > Hmm, I still get the failure: > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > > err -1 errno 2 > > > > > > > > And I didn't see how this will fix the issue. As long as ORC need to > > > > unwind through the JITed code it will fail. And that will happen > > > > before reaching ___bpf_prog_run. > > > > > > Ok, I was able to recreate by doing > > > > > > echo 1 > /proc/sys/net/core/bpf_jit_enable > > > > > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. > > > > > > > Yes, with JIT off it will be fixed. I can confirm that. > > Here's a tentative BPF fix for the JIT frame pointer issue. It was a > bit harder than I expected. Encoding r12 as a base register requires a > SIB byte, so I had to add support for encoding that. I also simplified > the prologue to resemble a GCC prologue, which decreases the prologue > size quite a bit. > > Next week I can work on the corresponding ORC change. Then I can clean > all the patches up and submit them properly. > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index afabf597c855..c9b4503558c9 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size) > /* > * The following table maps BPF registers to x86-64 registers. > * > - * x86-64 register R12 is unused, since if used as base address > - * register in load/store instructions, it always needs an > - * extra byte of encoding and is callee saved. > + * RBP isn't used; it needs to be preserved to allow the unwinder to move > + * through generated code stacks. > * > * Also x86-64 register R9 is unused. x86-64 register R10 is > * used for blinding (if enabled). > @@ -122,7 +121,7 @@ static const int reg2hex[] = { > [BPF_REG_7] = 5, /* R13 callee saved */ > [BPF_REG_8] = 6, /* R14 callee saved */ > [BPF_REG_9] = 7, /* R15 callee saved */ > - [BPF_REG_FP] = 5, /* RBP readonly */ > + [BPF_REG_FP] = 4, /* R12 readonly */ > [BPF_REG_AX] = 2, /* R10 temp register */ > [AUX_REG] = 3,/* R11 temp register */ > }; > @@ -139,6 +138,7 @@ static bool is_ereg(u32 reg) > BIT(BPF_REG_7) | > BIT(BPF_REG_8) | > BIT(BPF_REG_9) | > +BIT(BPF_REG_FP) | > BIT(BPF_REG_AX)); > } > > @@ -147,6 +147,11 @@ static bool is_axreg(u32 reg) > return reg == BPF_REG_0; > } > > +static bool is_sib_reg(u32 reg) > +{ > + return reg == BPF_REG_FP; > +} > + > /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */ > static u8 add_1mod(u8 byte, u32 reg) > { > @@ -190,15 +195,13 @@ struct jit_context { > #define
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 10:20:52AM +0800, Kairui Song wrote: > On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: > > > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > > > wrote: > > > > > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually > > > > > > > went > > > > > > > wrong is with the JITed bpf code. > > > > > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the > > > > > > failing > > > > > > selftest reported by Alexei for ORC. > > > > > > > > > > > > How can I recreate your issue? > > > > > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > > > > fix stack trace is still invalid. > > > > > > > > > > CMD I used with bcc: > > > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > > > > > I've had problems in the past getting bcc to build, so I was hoping it > > > > was reproducible with a standalone selftest. > > > > > > > > > And I just had another try applying your patch, self test is also > > > > > failing. > > > > > > > > Is it the same selftest reported by Alexei? > > > > > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > > err -1 errno 2 > > > > > > > > > I'm applying on my local master branch, a few days older than > > > > > upstream, I can update and try again, am I missing anything? > > > > > > > > The above patch had some issues, so with some configs you might see an > > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > > > fix the test_stacktrace_map selftest. > > > > > > > > Here's the latest version which should fix it in all cases (based on > > > > tip/master): > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > > > > Hmm, I still get the failure: > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > > err -1 errno 2 > > > > > > And I didn't see how this will fix the issue. As long as ORC need to > > > unwind through the JITed code it will fail. And that will happen > > > before reaching ___bpf_prog_run. > > > > Ok, I was able to recreate by doing > > > > echo 1 > /proc/sys/net/core/bpf_jit_enable > > > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. > > > > Yes, with JIT off it will be fixed. I can confirm that. Here's a tentative BPF fix for the JIT frame pointer issue. It was a bit harder than I expected. Encoding r12 as a base register requires a SIB byte, so I had to add support for encoding that. I also simplified the prologue to resemble a GCC prologue, which decreases the prologue size quite a bit. Next week I can work on the corresponding ORC change. Then I can clean all the patches up and submit them properly. diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index afabf597c855..c9b4503558c9 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -104,9 +104,8 @@ static int bpf_size_to_x86_bytes(int bpf_size) /* * The following table maps BPF registers to x86-64 registers. * - * x86-64 register R12 is unused, since if used as base address - * register in load/store instructions, it always needs an - * extra byte of encoding and is callee saved. + * RBP isn't used; it needs to be preserved to allow the unwinder to move + * through generated code stacks. * * Also x86-64 register R9 is unused. x86-64 register R10 is * used for blinding (if enabled). @@ -122,7 +121,7 @@ static const int reg2hex[] = { [BPF_REG_7] = 5, /* R13 callee saved */ [BPF_REG_8] = 6, /* R14 callee saved */ [BPF_REG_9] = 7, /* R15 callee saved */ - [BPF_REG_FP] = 5, /* RBP readonly */ + [BPF_REG_FP] = 4, /* R12 readonly */ [BPF_REG_AX] = 2, /* R10 temp register */ [AUX_REG] = 3,/* R11 temp register */ }; @@ -139,6 +138,7 @@ static bool is_ereg(u32 reg) BIT(BPF_REG_7) | BIT(BPF_REG_8) | BIT(BPF_REG_9) | +BIT(BPF_REG_FP) | BIT(BPF_REG_AX)); } @@ -147,6 +147,11 @@ static bool is_axreg(u32 reg) return reg == BPF_REG_0; } +static bool is_sib_reg(u32 reg) +{ + return reg == BPF_REG_FP; +} + /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */ static u8 add_1mod(u8 byte, u32 reg) { @@ -190,15 +195,13 @@ struct jit_context { #define BPF_MAX_INSN_SIZE 128 #define BPF_INSN_SAFETY64 -#define AUX_STACK_SPACE40 /* Space for RBX, R13, R14, R15, tailcnt */ - -#define PROLOGUE_SIZE 37 +#define PROLOGUE_SIZE 25 /* * Emit x86-64 prologue code for BPF program and check its size. * bpf_tail_call
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 10:53:19AM +0200, Peter Zijlstra wrote: > On Thu, May 23, 2019 at 10:24:13AM -0500, Josh Poimboeuf wrote: > > > Here's the latest version which should fix it in all cases (based on > > tip/master): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > That patch suffers an inconsitency, the comment states: > > 'if they have "jump_table" in the name' > > while the actual code implements: > > 'if the name starts with "jump_table"' > > Other than that, I suppose that works just fine ;-) The thing is, gcc converts a static local variable named "jump_table" to an ELF symbol with a numbered suffix, something like "jump_table.12345". But yeah I should at least clarify that in the comment. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 10:24:13AM -0500, Josh Poimboeuf wrote: > Here's the latest version which should fix it in all cases (based on > tip/master): > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix That patch suffers an inconsitency, the comment states: 'if they have "jump_table" in the name' while the actual code implements: 'if the name starts with "jump_table"' Other than that, I suppose that works just fine ;-) > There's no need to put special cases in the FP unwinder when we can > instead just fix the frame pointer usage in the JIT code. > > For ORC, I'm thinking we may be able to just require that all generated > code (BPF and others) always use frame pointers. Then when ORC doesn't > recognize a code address, it could try using the frame pointer as a > fallback. Yes, this seems like a sensible approach. We'd also have to audit the ftrace and kprobe trampolines, IIRC they only do framepointer setup for CONFIG_FRAME_POINTER currently, which should be easy to fix (after the patches I have to fix the FP generation in the first place: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/wip )
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 1:27 AM Josh Poimboeuf wrote: > > On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf > > wrote: > > > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually > > > > > > went > > > > > > wrong is with the JITed bpf code. > > > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the > > > > > failing > > > > > selftest reported by Alexei for ORC. > > > > > > > > > > How can I recreate your issue? > > > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > > > fix stack trace is still invalid. > > > > > > > > CMD I used with bcc: > > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > > > I've had problems in the past getting bcc to build, so I was hoping it > > > was reproducible with a standalone selftest. > > > > > > > And I just had another try applying your patch, self test is also > > > > failing. > > > > > > Is it the same selftest reported by Alexei? > > > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err > > > -1 errno 2 > > > > > > > I'm applying on my local master branch, a few days older than > > > > upstream, I can update and try again, am I missing anything? > > > > > > The above patch had some issues, so with some configs you might see an > > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > > fix the test_stacktrace_map selftest. > > > > > > Here's the latest version which should fix it in all cases (based on > > > tip/master): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > > Hmm, I still get the failure: > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > > err -1 errno 2 > > > > And I didn't see how this will fix the issue. As long as ORC need to > > unwind through the JITed code it will fail. And that will happen > > before reaching ___bpf_prog_run. > > Ok, I was able to recreate by doing > > echo 1 > /proc/sys/net/core/bpf_jit_enable > > first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. > Yes, with JIT off it will be fixed. I can confirm that. -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 24, 2019 at 12:41:59AM +0800, Kairui Song wrote: > On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf wrote: > > > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > > Hi Josh, this still won't fix the problem. > > > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > > > > wrong is with the JITed bpf code. > > > > > > > > There seem to be a bunch of issues. My patch at least fixes the failing > > > > selftest reported by Alexei for ORC. > > > > > > > > How can I recreate your issue? > > > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > > fix stack trace is still invalid. > > > > > > CMD I used with bcc: > > > python3 ./tools/stackcount.py t:sched:sched_fork > > > > I've had problems in the past getting bcc to build, so I was hoping it > > was reproducible with a standalone selftest. > > > > > And I just had another try applying your patch, self test is also failing. > > > > Is it the same selftest reported by Alexei? > > > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err > > -1 errno 2 > > > > > I'm applying on my local master branch, a few days older than > > > upstream, I can update and try again, am I missing anything? > > > > The above patch had some issues, so with some configs you might see an > > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > > fix the test_stacktrace_map selftest. > > > > Here's the latest version which should fix it in all cases (based on > > tip/master): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > Hmm, I still get the failure: > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap > err -1 errno 2 > > And I didn't see how this will fix the issue. As long as ORC need to > unwind through the JITed code it will fail. And that will happen > before reaching ___bpf_prog_run. Ok, I was able to recreate by doing echo 1 > /proc/sys/net/core/bpf_jit_enable first. I'm guessing you have CONFIG_BPF_JIT_ALWAYS_ON. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 11:24 PM Josh Poimboeuf wrote: > > On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > > Hi Josh, this still won't fix the problem. > > > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > > > wrong is with the JITed bpf code. > > > > > > There seem to be a bunch of issues. My patch at least fixes the failing > > > selftest reported by Alexei for ORC. > > > > > > How can I recreate your issue? > > > > Hmm, I used bcc's example to attach bpf to trace point, and with that > > fix stack trace is still invalid. > > > > CMD I used with bcc: > > python3 ./tools/stackcount.py t:sched:sched_fork > > I've had problems in the past getting bcc to build, so I was hoping it > was reproducible with a standalone selftest. > > > And I just had another try applying your patch, self test is also failing. > > Is it the same selftest reported by Alexei? > > test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 > errno 2 > > > I'm applying on my local master branch, a few days older than > > upstream, I can update and try again, am I missing anything? > > The above patch had some issues, so with some configs you might see an > objtool warning for ___bpf_prog_run(), in which case the patch doesn't > fix the test_stacktrace_map selftest. > > Here's the latest version which should fix it in all cases (based on > tip/master): > > > https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix Hmm, I still get the failure: test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2 And I didn't see how this will fix the issue. As long as ORC need to unwind through the JITed code it will fail. And that will happen before reaching ___bpf_prog_run. > > > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > > > unshift it properly then it will work. > > > > > > Yeah, that looks like a frame pointer bug in emit_prologue(). > > > > > > > I tried below code, and problem is fixed (only for frame pointer > > > > unwinder though). Need to find a better way to detect and do any > > > > similar trick for bpf part, if this is a feasible way to fix it: > > > > > > > > diff --git a/arch/x86/kernel/unwind_frame.c > > > > b/arch/x86/kernel/unwind_frame.c > > > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > > > --- a/arch/x86/kernel/unwind_frame.c > > > > +++ b/arch/x86/kernel/unwind_frame.c > > > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > > > > } > > > > > > > > /* Move to the next frame if it's safe: */ > > > > - if (!update_stack_state(state, next_bp)) > > > > - goto bad_address; > > > > + if (!update_stack_state(state, next_bp)) { > > > > + // Try again with shifted BP > > > > + state->bp += 5; // see AUX_STACK_SPACE > > > > + next_bp = (unsigned long > > > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > > > + // Clean and refetch stack info, it's marked as error > > > > outed > > > > + state->stack_mask = 0; > > > > + get_stack_info(next_bp, state->task, > > > > >stack_info, >stack_mask); > > > > + if (!update_stack_state(state, next_bp)) { > > > > + goto bad_address; > > > > + } > > > > + } > > > > > > > > return true; > > > > > > Nack. > > > > > > > For ORC unwinder, I think the unwinder can't find any info about the > > > > JITed part. Maybe if can let it just skip the JITed part and go to > > > > kernel context, then should be good enough. > > > > > > If it's starting from a fake pt_regs then that's going to be a > > > challenge. > > > > > > Will the JIT code always have the same stack layout? If so then we > > > could hard code that knowledge in ORC. Or even better, create a generic > > > interface for ORC to query the creator of the generated code about the > > > stack layout. > > > > I think yes. > > > > Not sure why we have the BP shift yet, if the prolog code could be > > tweaked to work with frame pointer unwinder it will be good to have. > > But still not for ORC. > > > > Will it be a good idea to have a region reserved for the JITed code? > > Currently it shares the region with "module mapping space". If let it > > have a separate region, when the unwinder meet any code in that region > > it will know it's JITed code and then can do something special about > > it. > > > > This should make it much easier for both frame pointer and ORC unwinder to > > work. > > There's no need to put special cases in the FP unwinder when we can > instead just fix the frame pointer usage in the JIT code. > > For ORC, I'm thinking we may be able to just require that all generated > code (BPF and others) always use frame pointers. Then when ORC doesn't > recognize a code
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 10:50:24PM +0800, Kairui Song wrote: > > > Hi Josh, this still won't fix the problem. > > > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > > wrong is with the JITed bpf code. > > > > There seem to be a bunch of issues. My patch at least fixes the failing > > selftest reported by Alexei for ORC. > > > > How can I recreate your issue? > > Hmm, I used bcc's example to attach bpf to trace point, and with that > fix stack trace is still invalid. > > CMD I used with bcc: > python3 ./tools/stackcount.py t:sched:sched_fork I've had problems in the past getting bcc to build, so I was hoping it was reproducible with a standalone selftest. > And I just had another try applying your patch, self test is also failing. Is it the same selftest reported by Alexei? test_stacktrace_map:FAIL:compare_map_keys stackid_hmap vs. stackmap err -1 errno 2 > I'm applying on my local master branch, a few days older than > upstream, I can update and try again, am I missing anything? The above patch had some issues, so with some configs you might see an objtool warning for ___bpf_prog_run(), in which case the patch doesn't fix the test_stacktrace_map selftest. Here's the latest version which should fix it in all cases (based on tip/master): https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git/commit/?h=bpf-orc-fix > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > > unshift it properly then it will work. > > > > Yeah, that looks like a frame pointer bug in emit_prologue(). > > > > > I tried below code, and problem is fixed (only for frame pointer > > > unwinder though). Need to find a better way to detect and do any > > > similar trick for bpf part, if this is a feasible way to fix it: > > > > > > diff --git a/arch/x86/kernel/unwind_frame.c > > > b/arch/x86/kernel/unwind_frame.c > > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > > --- a/arch/x86/kernel/unwind_frame.c > > > +++ b/arch/x86/kernel/unwind_frame.c > > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > > > } > > > > > > /* Move to the next frame if it's safe: */ > > > - if (!update_stack_state(state, next_bp)) > > > - goto bad_address; > > > + if (!update_stack_state(state, next_bp)) { > > > + // Try again with shifted BP > > > + state->bp += 5; // see AUX_STACK_SPACE > > > + next_bp = (unsigned long > > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > > + // Clean and refetch stack info, it's marked as error > > > outed > > > + state->stack_mask = 0; > > > + get_stack_info(next_bp, state->task, > > > >stack_info, >stack_mask); > > > + if (!update_stack_state(state, next_bp)) { > > > + goto bad_address; > > > + } > > > + } > > > > > > return true; > > > > Nack. > > > > > For ORC unwinder, I think the unwinder can't find any info about the > > > JITed part. Maybe if can let it just skip the JITed part and go to > > > kernel context, then should be good enough. > > > > If it's starting from a fake pt_regs then that's going to be a > > challenge. > > > > Will the JIT code always have the same stack layout? If so then we > > could hard code that knowledge in ORC. Or even better, create a generic > > interface for ORC to query the creator of the generated code about the > > stack layout. > > I think yes. > > Not sure why we have the BP shift yet, if the prolog code could be > tweaked to work with frame pointer unwinder it will be good to have. > But still not for ORC. > > Will it be a good idea to have a region reserved for the JITed code? > Currently it shares the region with "module mapping space". If let it > have a separate region, when the unwinder meet any code in that region > it will know it's JITed code and then can do something special about > it. > > This should make it much easier for both frame pointer and ORC unwinder to > work. There's no need to put special cases in the FP unwinder when we can instead just fix the frame pointer usage in the JIT code. For ORC, I'm thinking we may be able to just require that all generated code (BPF and others) always use frame pointers. Then when ORC doesn't recognize a code address, it could try using the frame pointer as a fallback. Once I get a reproducer I can do the patches for all that. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 9:32 PM Josh Poimboeuf wrote: > > On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote: > > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > > > > > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > > > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > > > > The one that is broken is prog_tests/stacktrace_map.c > > > > > There we attach bpf to standard tracepoint where > > > > > kernel suppose to collect pt_regs before calling into bpf. > > > > > And that's what bpf_get_stackid_tp() is doing. > > > > > It passes pt_regs (that was collected before any bpf) > > > > > into bpf_get_stackid() which calls get_perf_callchain(). > > > > > Same thing with kprobes, uprobes. > > > > > > > > Is it trying to unwind through ___bpf_prog_run()? > > > > > > > > If so, that would at least explain why ORC isn't working. Objtool > > > > currently ignores that function because it can't follow the jump table. > > > > > > Here's a tentative fix (for ORC, at least). I'll need to make sure this > > > doesn't break anything else. > > > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > > index 242a643af82f..1d9a7cc4b836 100644 > > > --- a/kernel/bpf/core.c > > > +++ b/kernel/bpf/core.c > > > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > > > bpf_insn *insn, u64 *stack) > > > BUG_ON(1); > > > return 0; > > > } > > > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > > > > > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > > > #define DEFINE_BPF_PROG_RUN(stack_size) \ > > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > > > index 172f99195726..2567027fce95 100644 > > > --- a/tools/objtool/check.c > > > +++ b/tools/objtool/check.c > > > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > > > objtool_file *file, > > > if (text_rela->type == R_X86_64_PC32) > > > table_offset += 4; > > > > > > - /* > > > -* Make sure the .rodata address isn't associated with a > > > -* symbol. gcc jump tables are anonymous data. > > > -*/ > > > - if (find_symbol_containing(rodata_sec, table_offset)) > > > - continue; > > > - > > > rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > > > if (rodata_rela) { > > > /* > > > > Hi Josh, this still won't fix the problem. > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > wrong is with the JITed bpf code. > > There seem to be a bunch of issues. My patch at least fixes the failing > selftest reported by Alexei for ORC. > > How can I recreate your issue? Hmm, I used bcc's example to attach bpf to trace point, and with that fix stack trace is still invalid. CMD I used with bcc: python3 ./tools/stackcount.py t:sched:sched_fork And I just had another try applying your patch, self test is also failing. I'm applying on my local master branch, a few days older than upstream, I can update and try again, am I missing anything? > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > unshift it properly then it will work. > > Yeah, that looks like a frame pointer bug in emit_prologue(). > > > I tried below code, and problem is fixed (only for frame pointer > > unwinder though). Need to find a better way to detect and do any > > similar trick for bpf part, if this is a feasible way to fix it: > > > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > --- a/arch/x86/kernel/unwind_frame.c > > +++ b/arch/x86/kernel/unwind_frame.c > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > > } > > > > /* Move to the next frame if it's safe: */ > > - if (!update_stack_state(state, next_bp)) > > - goto bad_address; > > + if (!update_stack_state(state, next_bp)) { > > + // Try again with shifted BP > > + state->bp += 5; // see AUX_STACK_SPACE > > + next_bp = (unsigned long > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > + // Clean and refetch stack info, it's marked as error outed > > + state->stack_mask = 0; > > + get_stack_info(next_bp, state->task, > > >stack_info, >stack_mask); > > + if (!update_stack_state(state, next_bp)) { > > + goto bad_address; > > + } > > + } > > > > return true; > > Nack. > > > For ORC unwinder, I think the unwinder can't find any info about the > > JITed part. Maybe if can let it just skip the JITed part and go to > > kernel context, then should be good enough. > > If it's
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 02:48:11PM +0800, Kairui Song wrote: > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > > > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > > > The one that is broken is prog_tests/stacktrace_map.c > > > > There we attach bpf to standard tracepoint where > > > > kernel suppose to collect pt_regs before calling into bpf. > > > > And that's what bpf_get_stackid_tp() is doing. > > > > It passes pt_regs (that was collected before any bpf) > > > > into bpf_get_stackid() which calls get_perf_callchain(). > > > > Same thing with kprobes, uprobes. > > > > > > Is it trying to unwind through ___bpf_prog_run()? > > > > > > If so, that would at least explain why ORC isn't working. Objtool > > > currently ignores that function because it can't follow the jump table. > > > > Here's a tentative fix (for ORC, at least). I'll need to make sure this > > doesn't break anything else. > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index 242a643af82f..1d9a7cc4b836 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > > bpf_insn *insn, u64 *stack) > > BUG_ON(1); > > return 0; > > } > > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > > > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > > #define DEFINE_BPF_PROG_RUN(stack_size) \ > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > > index 172f99195726..2567027fce95 100644 > > --- a/tools/objtool/check.c > > +++ b/tools/objtool/check.c > > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > > objtool_file *file, > > if (text_rela->type == R_X86_64_PC32) > > table_offset += 4; > > > > - /* > > -* Make sure the .rodata address isn't associated with a > > -* symbol. gcc jump tables are anonymous data. > > -*/ > > - if (find_symbol_containing(rodata_sec, table_offset)) > > - continue; > > - > > rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > > if (rodata_rela) { > > /* > > Hi Josh, this still won't fix the problem. > > Problem is not (or not only) with ___bpf_prog_run, what actually went > wrong is with the JITed bpf code. There seem to be a bunch of issues. My patch at least fixes the failing selftest reported by Alexei for ORC. How can I recreate your issue? > For frame pointer unwinder, it seems the JITed bpf code will have a > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > unshift it properly then it will work. Yeah, that looks like a frame pointer bug in emit_prologue(). > I tried below code, and problem is fixed (only for frame pointer > unwinder though). Need to find a better way to detect and do any > similar trick for bpf part, if this is a feasible way to fix it: > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > --- a/arch/x86/kernel/unwind_frame.c > +++ b/arch/x86/kernel/unwind_frame.c > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > } > > /* Move to the next frame if it's safe: */ > - if (!update_stack_state(state, next_bp)) > - goto bad_address; > + if (!update_stack_state(state, next_bp)) { > + // Try again with shifted BP > + state->bp += 5; // see AUX_STACK_SPACE > + next_bp = (unsigned long > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > + // Clean and refetch stack info, it's marked as error outed > + state->stack_mask = 0; > + get_stack_info(next_bp, state->task, > >stack_info, >stack_mask); > + if (!update_stack_state(state, next_bp)) { > + goto bad_address; > + } > + } > > return true; Nack. > For ORC unwinder, I think the unwinder can't find any info about the > JITed part. Maybe if can let it just skip the JITed part and go to > kernel context, then should be good enough. If it's starting from a fake pt_regs then that's going to be a challenge. Will the JIT code always have the same stack layout? If so then we could hard code that knowledge in ORC. Or even better, create a generic interface for ORC to query the creator of the generated code about the stack layout. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 4:28 PM Song Liu wrote: > > > On May 22, 2019, at 11:48 PM, Kairui Song wrote: > > > > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > >> > >> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > >>> On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > The one that is broken is prog_tests/stacktrace_map.c > There we attach bpf to standard tracepoint where > kernel suppose to collect pt_regs before calling into bpf. > And that's what bpf_get_stackid_tp() is doing. > It passes pt_regs (that was collected before any bpf) > into bpf_get_stackid() which calls get_perf_callchain(). > Same thing with kprobes, uprobes. > >>> > >>> Is it trying to unwind through ___bpf_prog_run()? > >>> > >>> If so, that would at least explain why ORC isn't working. Objtool > >>> currently ignores that function because it can't follow the jump table. > >> > >> Here's a tentative fix (for ORC, at least). I'll need to make sure this > >> doesn't break anything else. > >> > >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > >> index 242a643af82f..1d9a7cc4b836 100644 > >> --- a/kernel/bpf/core.c > >> +++ b/kernel/bpf/core.c > >> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > >> bpf_insn *insn, u64 *stack) > >>BUG_ON(1); > >>return 0; > >> } > >> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > >> > >> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > >> #define DEFINE_BPF_PROG_RUN(stack_size) \ > >> diff --git a/tools/objtool/check.c b/tools/objtool/check.c > >> index 172f99195726..2567027fce95 100644 > >> --- a/tools/objtool/check.c > >> +++ b/tools/objtool/check.c > >> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > >> objtool_file *file, > >>if (text_rela->type == R_X86_64_PC32) > >>table_offset += 4; > >> > >> - /* > >> -* Make sure the .rodata address isn't associated with a > >> -* symbol. gcc jump tables are anonymous data. > >> -*/ > >> - if (find_symbol_containing(rodata_sec, table_offset)) > >> - continue; > >> - > >>rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > >>if (rodata_rela) { > >>/* > > > > Hi Josh, this still won't fix the problem. > > > > Problem is not (or not only) with ___bpf_prog_run, what actually went > > wrong is with the JITed bpf code. > > > > For frame pointer unwinder, it seems the JITed bpf code will have a > > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > > unshift it properly then it will work. > > > > I tried below code, and problem is fixed (only for frame pointer > > unwinder though). Need to find a better way to detect and do any > > similar trick for bpf part, if this is a feasible way to fix it: > > > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > > --- a/arch/x86/kernel/unwind_frame.c > > +++ b/arch/x86/kernel/unwind_frame.c > > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) > >} > > > >/* Move to the next frame if it's safe: */ > > - if (!update_stack_state(state, next_bp)) > > - goto bad_address; > > + if (!update_stack_state(state, next_bp)) { > > + // Try again with shifted BP > > + state->bp += 5; // see AUX_STACK_SPACE > > + next_bp = (unsigned long > > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > > + // Clean and refetch stack info, it's marked as error outed > > + state->stack_mask = 0; > > + get_stack_info(next_bp, state->task, > > >stack_info, >stack_mask); > > + if (!update_stack_state(state, next_bp)) { > > + goto bad_address; > > + } > > + } > > > >return true; > > > > For ORC unwinder, I think the unwinder can't find any info about the > > JITed part. Maybe if can let it just skip the JITed part and go to > > kernel context, then should be good enough. > > In this case (tracepoint), the callchain bpf_get_stackid() fetches is the > callchain at the tracepoint. So we don't need the JITed part. > We don't want the JITed part indeed, but the unwinder is now being called to start directly within the BPF JITed program, so it have to traceback above the JITed part and ignoring the JITed part. > BPF program passes the regs at the tracepoint to perf_callchain_kernel(). > However, perf_callchain_kernel() only uses regs->sp for !perf_hw_regs() > case. This is probably expected, as passing regs in doesn't really help. > ORC is not working with a partial dumped regs, and without frame pointer, can't get a valid BP value in the tracepoint. So it only used
Re: Getting empty callchain from perf_callchain_kernel()
> On May 22, 2019, at 11:48 PM, Kairui Song wrote: > > On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: >> >> On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: >>> On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: The one that is broken is prog_tests/stacktrace_map.c There we attach bpf to standard tracepoint where kernel suppose to collect pt_regs before calling into bpf. And that's what bpf_get_stackid_tp() is doing. It passes pt_regs (that was collected before any bpf) into bpf_get_stackid() which calls get_perf_callchain(). Same thing with kprobes, uprobes. >>> >>> Is it trying to unwind through ___bpf_prog_run()? >>> >>> If so, that would at least explain why ORC isn't working. Objtool >>> currently ignores that function because it can't follow the jump table. >> >> Here's a tentative fix (for ORC, at least). I'll need to make sure this >> doesn't break anything else. >> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 242a643af82f..1d9a7cc4b836 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct >> bpf_insn *insn, u64 *stack) >>BUG_ON(1); >>return 0; >> } >> -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ >> >> #define PROG_NAME(stack_size) __bpf_prog_run##stack_size >> #define DEFINE_BPF_PROG_RUN(stack_size) \ >> diff --git a/tools/objtool/check.c b/tools/objtool/check.c >> index 172f99195726..2567027fce95 100644 >> --- a/tools/objtool/check.c >> +++ b/tools/objtool/check.c >> @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct >> objtool_file *file, >>if (text_rela->type == R_X86_64_PC32) >>table_offset += 4; >> >> - /* >> -* Make sure the .rodata address isn't associated with a >> -* symbol. gcc jump tables are anonymous data. >> -*/ >> - if (find_symbol_containing(rodata_sec, table_offset)) >> - continue; >> - >>rodata_rela = find_rela_by_dest(rodata_sec, table_offset); >>if (rodata_rela) { >>/* > > Hi Josh, this still won't fix the problem. > > Problem is not (or not only) with ___bpf_prog_run, what actually went > wrong is with the JITed bpf code. > > For frame pointer unwinder, it seems the JITed bpf code will have a > shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can > unshift it properly then it will work. > > I tried below code, and problem is fixed (only for frame pointer > unwinder though). Need to find a better way to detect and do any > similar trick for bpf part, if this is a feasible way to fix it: > > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > index 9b9fd4826e7a..2c0fa2aaa7e4 100644 > --- a/arch/x86/kernel/unwind_frame.c > +++ b/arch/x86/kernel/unwind_frame.c > @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) >} > >/* Move to the next frame if it's safe: */ > - if (!update_stack_state(state, next_bp)) > - goto bad_address; > + if (!update_stack_state(state, next_bp)) { > + // Try again with shifted BP > + state->bp += 5; // see AUX_STACK_SPACE > + next_bp = (unsigned long > *)READ_ONCE_TASK_STACK(state->task, *state->bp); > + // Clean and refetch stack info, it's marked as error outed > + state->stack_mask = 0; > + get_stack_info(next_bp, state->task, > >stack_info, >stack_mask); > + if (!update_stack_state(state, next_bp)) { > + goto bad_address; > + } > + } > >return true; > > For ORC unwinder, I think the unwinder can't find any info about the > JITed part. Maybe if can let it just skip the JITed part and go to > kernel context, then should be good enough. In this case (tracepoint), the callchain bpf_get_stackid() fetches is the callchain at the tracepoint. So we don't need the JITed part. BPF program passes the regs at the tracepoint to perf_callchain_kernel(). However, perf_callchain_kernel() only uses regs->sp for !perf_hw_regs() case. This is probably expected, as passing regs in doesn't really help. There are multiple cases in unwind_orc.c:__unwind_start(), which I don't understand very well. Does the above make sense? Did I mis-understand anything? @Alexei, do you remember some rough time/version that ORC unwinder works well for tracepoints? Maybe we can dig into that version to see the difference. Thanks, Song > > > > > > -- > Best Regards, > Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 23, 2019 at 7:46 AM Josh Poimboeuf wrote: > > On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > > The one that is broken is prog_tests/stacktrace_map.c > > > There we attach bpf to standard tracepoint where > > > kernel suppose to collect pt_regs before calling into bpf. > > > And that's what bpf_get_stackid_tp() is doing. > > > It passes pt_regs (that was collected before any bpf) > > > into bpf_get_stackid() which calls get_perf_callchain(). > > > Same thing with kprobes, uprobes. > > > > Is it trying to unwind through ___bpf_prog_run()? > > > > If so, that would at least explain why ORC isn't working. Objtool > > currently ignores that function because it can't follow the jump table. > > Here's a tentative fix (for ORC, at least). I'll need to make sure this > doesn't break anything else. > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 242a643af82f..1d9a7cc4b836 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct > bpf_insn *insn, u64 *stack) > BUG_ON(1); > return 0; > } > -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ > > #define PROG_NAME(stack_size) __bpf_prog_run##stack_size > #define DEFINE_BPF_PROG_RUN(stack_size) \ > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index 172f99195726..2567027fce95 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct > objtool_file *file, > if (text_rela->type == R_X86_64_PC32) > table_offset += 4; > > - /* > -* Make sure the .rodata address isn't associated with a > -* symbol. gcc jump tables are anonymous data. > -*/ > - if (find_symbol_containing(rodata_sec, table_offset)) > - continue; > - > rodata_rela = find_rela_by_dest(rodata_sec, table_offset); > if (rodata_rela) { > /* Hi Josh, this still won't fix the problem. Problem is not (or not only) with ___bpf_prog_run, what actually went wrong is with the JITed bpf code. For frame pointer unwinder, it seems the JITed bpf code will have a shifted "BP" register? (arch/x86/net/bpf_jit_comp.c:217), so if we can unshift it properly then it will work. I tried below code, and problem is fixed (only for frame pointer unwinder though). Need to find a better way to detect and do any similar trick for bpf part, if this is a feasible way to fix it: diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c index 9b9fd4826e7a..2c0fa2aaa7e4 100644 --- a/arch/x86/kernel/unwind_frame.c +++ b/arch/x86/kernel/unwind_frame.c @@ -330,8 +330,17 @@ bool unwind_next_frame(struct unwind_state *state) } /* Move to the next frame if it's safe: */ - if (!update_stack_state(state, next_bp)) - goto bad_address; + if (!update_stack_state(state, next_bp)) { + // Try again with shifted BP + state->bp += 5; // see AUX_STACK_SPACE + next_bp = (unsigned long *)READ_ONCE_TASK_STACK(state->task, *state->bp); + // Clean and refetch stack info, it's marked as error outed + state->stack_mask = 0; + get_stack_info(next_bp, state->task, >stack_info, >stack_mask); + if (!update_stack_state(state, next_bp)) { + goto bad_address; + } + } return true; For ORC unwinder, I think the unwinder can't find any info about the JITed part. Maybe if can let it just skip the JITed part and go to kernel context, then should be good enough. -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Wed, May 22, 2019 at 12:45:17PM -0500, Josh Poimboeuf wrote: > On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > > The one that is broken is prog_tests/stacktrace_map.c > > There we attach bpf to standard tracepoint where > > kernel suppose to collect pt_regs before calling into bpf. > > And that's what bpf_get_stackid_tp() is doing. > > It passes pt_regs (that was collected before any bpf) > > into bpf_get_stackid() which calls get_perf_callchain(). > > Same thing with kprobes, uprobes. > > Is it trying to unwind through ___bpf_prog_run()? > > If so, that would at least explain why ORC isn't working. Objtool > currently ignores that function because it can't follow the jump table. Here's a tentative fix (for ORC, at least). I'll need to make sure this doesn't break anything else. diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 242a643af82f..1d9a7cc4b836 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1562,7 +1562,6 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) BUG_ON(1); return 0; } -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ #define PROG_NAME(stack_size) __bpf_prog_run##stack_size #define DEFINE_BPF_PROG_RUN(stack_size) \ diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 172f99195726..2567027fce95 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1033,13 +1033,6 @@ static struct rela *find_switch_table(struct objtool_file *file, if (text_rela->type == R_X86_64_PC32) table_offset += 4; - /* -* Make sure the .rodata address isn't associated with a -* symbol. gcc jump tables are anonymous data. -*/ - if (find_symbol_containing(rodata_sec, table_offset)) - continue; - rodata_rela = find_rela_by_dest(rodata_sec, table_offset); if (rodata_rela) { /*
Re: Getting empty callchain from perf_callchain_kernel()
On 5/22/19 11:07 AM, Josh Poimboeuf wrote: > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: >>> >>> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > Hi, > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > (on bpf/master > branch). > > After digging into the code, we found that perf_callchain_kernel() is > giving empty > callchain for tracepoint sched/sched_switch. And it seems related to > commit > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > Before this commit, perf_callchain_kernel() returns callchain with > regs->ip. With > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. So while I think the below is indeed right; we should store regs->ip regardless of the unwind path chosen, I still think there's something fishy if this results in just the 1 entry. The sched/sched_switch event really should have a non-trivial stack. Let me see if I can reproduce with just perf. >>> >>> $ perf record -g -e "sched:sched_switch" -- make clean >>> $ perf report -D >>> >>> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: >>> 0x81c29562 period: 1 addr: 0 >>> ... FP chain: nr:10 >>> . 0: ff80 >>> . 1: 81c29562 >>> . 2: 81c29933 >>> . 3: 8111f688 >>> . 4: 81120b9d >>> . 5: 81120ce5 >>> . 6: 8100254a >>> . 7: 81e0007d >>> . 8: fe00 >>> . 9: 7f9b6cd9682a >>> ... thread: sh:7236 >>> .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux >>> >>> >>> IOW, it seems to 'work'. >>> >> >> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >> some other bfp functions) is now broken, or, strating an unwind >> directly inside a bpf program will end up strangely. It have following >> kernel message: >> >> WARNING: kernel stack frame pointer at 70cad07c in >> test_progs:1242 has bad value ffd4497e >> >> And in the debug message: >> >> [ 160.460287] 6e117175: aa23a0e8 >> (get_perf_callchain+0x148/0x280) >> [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) >> [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) >> [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) >> [ 160.460289] 7cbb140a: 0001007f (0x1007f) >> [ 160.460290] 7dc62ac9: ... >> [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) >> [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460292] f9a16017: 007f (0x7f) >> [ 160.460292] a8e62d44: ... >> [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) >> [ 160.460293] 92842522: 007f (0x7f) >> [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) >> [ 160.460296] 0777e4cf: aa225d97 >> (bpf_get_stackid+0x77/0x470) >> [ 160.460296] 9942ea16: ... >> [ 160.460297] a08006b1: 0001 (0x1) >> [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) >> [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) >> [ 160.460300] 3a13d702: ... >> [ 160.460300] e2e2496d: 9ce3 (0x9ce3) >> [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460301] a8cf6d55: ... >> [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) >> [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) >> [ 160.460305] 97498bf2: aa1f4791 >> (bpf_get_stackid_tp+0x11/0x20) >> [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) >> [ 160.460307] dacd0ce5: c0405676 (0xc0405676) >> [ 160.460307] a81f2714: ... >> >> # Note here is the invalid frame pointer >> [ 160.460308] 70cad07c: b89d00a1d000 (0xb89d00a1d000) >> [ 160.460308] f8f194e4: 0001 (0x1) >> [ 160.460309] 2134f42a: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460310] f9345889: 9ce3f1e18eb0 (0x9ce3f1e18eb0) >> [ 160.460310] 8ad22a42: ... >> [ 160.460311] 73808173: b89d00c6bce0 (0xb89d00c6bce0) >> [ 160.460312] c9e4: aa1f48d1 (trace_call_bpf+0x81/0x100) >> [ 160.460313] c5d8ebd1: b89d00c6bcc0 (0xb89d00c6bcc0) >> [ 160.460315] bce0b072:
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: > > > > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: > > > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > > > > Hi, > > > > > > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > > > > (on bpf/master > > > > branch). > > > > > > > > After digging into the code, we found that perf_callchain_kernel() is > > > > giving empty > > > > callchain for tracepoint sched/sched_switch. And it seems related to > > > > commit > > > > > > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > > > > > > > Before this commit, perf_callchain_kernel() returns callchain with > > > > regs->ip. With > > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. > > > > > > So while I think the below is indeed right; we should store regs->ip > > > regardless of the unwind path chosen, I still think there's something > > > fishy if this results in just the 1 entry. > > > > > > The sched/sched_switch event really should have a non-trivial stack. > > > > > > Let me see if I can reproduce with just perf. > > > > $ perf record -g -e "sched:sched_switch" -- make clean > > $ perf report -D > > > > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: > > 0x81c29562 period: 1 addr: 0 > > ... FP chain: nr:10 > > . 0: ff80 > > . 1: 81c29562 > > . 2: 81c29933 > > . 3: 8111f688 > > . 4: 81120b9d > > . 5: 81120ce5 > > . 6: 8100254a > > . 7: 81e0007d > > . 8: fe00 > > . 9: 7f9b6cd9682a > > ... thread: sh:7236 > > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux > > > > > > IOW, it seems to 'work'. > > > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > some other bfp functions) is now broken, or, strating an unwind > directly inside a bpf program will end up strangely. It have following > kernel message: > > WARNING: kernel stack frame pointer at 70cad07c in > test_progs:1242 has bad value ffd4497e > > And in the debug message: > > [ 160.460287] 6e117175: aa23a0e8 > (get_perf_callchain+0x148/0x280) > [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) > [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) > [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) > [ 160.460289] 7cbb140a: 0001007f (0x1007f) > [ 160.460290] 7dc62ac9: ... > [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) > [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460292] f9a16017: 007f (0x7f) > [ 160.460292] a8e62d44: ... > [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) > [ 160.460293] 92842522: 007f (0x7f) > [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) > [ 160.460296] 0777e4cf: aa225d97 (bpf_get_stackid+0x77/0x470) > [ 160.460296] 9942ea16: ... > [ 160.460297] a08006b1: 0001 (0x1) > [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) > [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) > [ 160.460300] 3a13d702: ... > [ 160.460300] e2e2496d: 9ce3 (0x9ce3) > [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460301] a8cf6d55: ... > [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) > [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) > [ 160.460305] 97498bf2: aa1f4791 > (bpf_get_stackid_tp+0x11/0x20) > [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) > [ 160.460307] dacd0ce5: c0405676 (0xc0405676) > [ 160.460307] a81f2714: ... > > # Note here is the invalid frame pointer > [ 160.460308] 70cad07c: b89d00a1d000 (0xb89d00a1d000) > [ 160.460308] f8f194e4: 0001 (0x1) > [ 160.460309] 2134f42a: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460310] f9345889: 9ce3f1e18eb0 (0x9ce3f1e18eb0) > [ 160.460310] 8ad22a42: ... > [ 160.460311] 73808173: b89d00c6bce0 (0xb89d00c6bce0) > [ 160.460312] c9e4: aa1f48d1 (trace_call_bpf+0x81/0x100) > [ 160.460313] c5d8ebd1: b89d00c6bcc0 (0xb89d00c6bcc0) > [ 160.460315] bce0b072: ab651be0 > (event_sched_migrate_task+0xa0/0xa0)
Re: Getting empty callchain from perf_callchain_kernel()
On Wed, May 22, 2019 at 02:49:07PM +, Alexei Starovoitov wrote: > The one that is broken is prog_tests/stacktrace_map.c > There we attach bpf to standard tracepoint where > kernel suppose to collect pt_regs before calling into bpf. > And that's what bpf_get_stackid_tp() is doing. > It passes pt_regs (that was collected before any bpf) > into bpf_get_stackid() which calls get_perf_callchain(). > Same thing with kprobes, uprobes. Is it trying to unwind through ___bpf_prog_run()? If so, that would at least explain why ORC isn't working. Objtool currently ignores that function because it can't follow the jump table. -- Josh
Re: Getting empty callchain from perf_callchain_kernel()
On 5/22/19 7:02 AM, Peter Zijlstra wrote: > >> If the unwinder could trace back through the bpf func call then there >> will be no such problem. > > Why couldn't it trace back through the bpf stuff? And how can we fix > that? Most of the time there is no 'tracing through bpf stuff'. bpf infra is preserving 'pt_regs*' that was collected before any bpf things start executing. The reason is that bpf can be executed via interpreter and directly when JITed. In both cases collected stack traces should be the same or it's confusing the users and they cannot compensate for such difference. The only exception is raw_tracepoint, since it's the most minimalistic way of calling bpf and kernel side doesn't do anything before calling into bpf. Only in such case bpf side has to call perf_fetch_caller_regs(). See bpf_get_stackid_raw_tp(). But this test case is actually working! It's covered by prog_tests/stacktrace_map_raw_tp.c and it passes. The one that is broken is prog_tests/stacktrace_map.c There we attach bpf to standard tracepoint where kernel suppose to collect pt_regs before calling into bpf. And that's what bpf_get_stackid_tp() is doing. It passes pt_regs (that was collected before any bpf) into bpf_get_stackid() which calls get_perf_callchain(). Same thing with kprobes, uprobes.
Re: Getting empty callchain from perf_callchain_kernel()
On Mon, May 20, 2019 at 02:06:54AM +0800, Kairui Song wrote: > On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra wrote: > > > > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > > > some other bfp functions) is now broken, or, strating an unwind > > > directly inside a bpf program will end up strangely. It have following > > > kernel message: > > > > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > > follow. > > bpf_get_stackid_tp will just use the regs passed to it from the trace > point. And then it will eventually call perf_get_callchain to get the > call chain. > With a tracepoint we have the fake regs, so unwinder will start from > where it is called, and use the fake regs as the indicator of the > target frame it want, and keep unwinding until reached the actually > callsite. > > But if the stack trace is started withing a bpf func call then it's broken... I'm confused.. how is this broken? Surely we should eventually find the original stack frame and be good again, right? > If the unwinder could trace back through the bpf func call then there > will be no such problem. Why couldn't it trace back through the bpf stuff? And how can we fix that?
Re: Getting empty callchain from perf_callchain_kernel()
On Mon, May 20, 2019 at 05:22:12PM +, Song Liu wrote: > I think this is still the best fix/workaround here? And only one level > of stack trace should be OK for tracepoint? No. That's still completely broken.
Re: Getting empty callchain from perf_callchain_kernel()
> On May 19, 2019, at 11:07 AM, Kairui Song wrote: > > On Sat, May 18, 2019 at 5:48 AM Song Liu wrote: >> >> >> >>> On May 17, 2019, at 2:06 PM, Alexei Starovoitov wrote: >>> >>> On 5/17/19 11:40 AM, Song Liu wrote: +Alexei, Daniel, and bpf > On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >> some other bfp functions) is now broken, or, strating an unwind >> directly inside a bpf program will end up strangely. It have following >> kernel message: > > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > follow. I guess we need something like the following? (we should be able to optimize the PER_CPU stuff). Thanks, Song diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c index f92d6ad5e080..c525149028a7 100644 --- i/kernel/trace/bpf_trace.c +++ w/kernel/trace/bpf_trace.c @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = { .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, u64, flags) { - struct pt_regs *regs = *(struct pt_regs **)tp_buff; + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); + perf_fetch_caller_regs(regs); >>> >>> No. pt_regs is already passed in. It's the first argument. >>> If we call perf_fetch_caller_regs() again the stack trace will be wrong. >>> bpf prog should not see itself, interpreter or all the frames in between. >> >> Thanks Alexei! I get it now. >> >> In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field >> of tp_buff: >> >>struct pt_regs *regs = *(struct pt_regs **)tp_buff; >> >> tp_buff points to something like >> >>struct sched_switch_args { >>unsigned long long pad; >>char prev_comm[16]; >>int prev_pid; >>int prev_prio; >>long long prev_state; >>char next_comm[16]; >>int next_pid; >>int next_prio; >>}; >> >> where the first field "pad" is a pointer to pt_regs. >> >> @Kairui, I think you confirmed that current code will give empty call trace >> with ORC unwinder? If that's the case, can we add regs->ip back? (as in the >> first email of this thread. >> >> Thanks, >> Song >> > > Hi thanks for the suggestion, yes we can add it should be good an idea > to always have IP when stack trace is not available. > But stack trace is actually still broken, it will always give only one > level of stacktrace (the IP). I think this is still the best fix/workaround here? And only one level of stack trace should be OK for tracepoint? Thanks, Song > > -- > Best Regards, > Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
Sorry for previous empty email.. Clicked send by accident. > On May 19, 2019, at 11:06 AM, Kairui Song wrote: > > On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra wrote: >> >> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >>> some other bfp functions) is now broken, or, strating an unwind >>> directly inside a bpf program will end up strangely. It have following >>> kernel message: >> >> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't >> follow. > > bpf_get_stackid_tp will just use the regs passed to it from the trace > point. And then it will eventually call perf_get_callchain to get the > call chain. > With a tracepoint we have the fake regs, so unwinder will start from > where it is called, and use the fake regs as the indicator of the > target frame it want, and keep unwinding until reached the actually > callsite. > > But if the stack trace is started withing a bpf func call then it's broken... > > If the unwinder could trace back through the bpf func call then there > will be no such problem. > > For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED) > before bpf call, and ensure bp is also dumped could fix it (so it will > start using the regs for bpf calls, like before the commit > d15d356887e7). But for ORC I don't see a clear way to fix the problem. > First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI, > DX, R10, R13, else?) in the trace point handler, then use the flag to > indicate ORC to do one more unwind (because can't get caller's regs, > so get callee's regs instaed) before actually giving output? > > I had a try, for framepointer unwinder, mark the indicator flag before > calling bpf functions, and dump bp as well in the trace point. Then > with frame pointer, it works, test passed: > > diff --git a/arch/x86/include/asm/perf_event.h > b/arch/x86/include/asm/perf_event.h > index 1392d5e6e8d6..6f1192e9776b 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct > pt_regs *regs); > > #include > > +#ifdef CONFIG_FRAME_POINTER > +static inline unsigned long caller_frame_pointer(void) > +{ > + return (unsigned long)__builtin_frame_address(1); > +} > +#else > +static inline unsigned long caller_frame_pointer(void) > +{ > + return 0; > +} > +#endif > + > /* > * We abuse bit 3 from flags to pass exact information, see perf_misc_flags > * and the comment with PERF_EFLAGS_EXACT. > */ > #define perf_arch_fetch_caller_regs(regs, __ip){ \ >(regs)->ip = (__ip);\ > + (regs)->bp = caller_frame_pointer();\ >(regs)->sp = (unsigned long)__builtin_frame_address(0); \ >(regs)->cs = __KERNEL_CS; \ >regs->flags = 0;\ > diff --git a/kernel/events/core.c b/kernel/events/core.c > index abbd4b3b96c2..ca7b95ee74f0 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data, > int size, int rctx, > struct task_struct *task) > { >if (bpf_prog_array_valid(call)) { > + regs->flags |= X86_EFLAGS_FIXED; >*(struct pt_regs **)raw_data = regs; >if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) { >perf_swevent_put_recursion_context(rctx); > @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event > *event, >int ret = 0; > >ctx.regs = perf_arch_bpf_user_pt_regs(regs); > + ctx.regs->flags |= X86_EFLAGS_FIXED; > + >preempt_disable(); >if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) >goto out; > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index f92d6ad5e080..e1fa656677dc 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 > flags, void *meta, u64 meta_size, >}; > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >perf_sample_data_init(sd, 0, 0); >sd->raw = > > @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct > bpf_raw_tracepoint_args *, args, >struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >return bpf_perf_event_output(regs, map, flags, data, size); > } > > @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct > bpf_raw_tracepoint_args *, args, >struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >
Re: Getting empty callchain from perf_callchain_kernel()
> On May 19, 2019, at 11:06 AM, Kairui Song wrote: > > On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra wrote: >> >> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >>> some other bfp functions) is now broken, or, strating an unwind >>> directly inside a bpf program will end up strangely. It have following >>> kernel message: >> >> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't >> follow. > > bpf_get_stackid_tp will just use the regs passed to it from the trace > point. And then it will eventually call perf_get_callchain to get the > call chain. > With a tracepoint we have the fake regs, so unwinder will start from > where it is called, and use the fake regs as the indicator of the > target frame it want, and keep unwinding until reached the actually > callsite. > > But if the stack trace is started withing a bpf func call then it's broken... > > If the unwinder could trace back through the bpf func call then there > will be no such problem. > > For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED) > before bpf call, and ensure bp is also dumped could fix it (so it will > start using the regs for bpf calls, like before the commit > d15d356887e7). But for ORC I don't see a clear way to fix the problem. > First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI, > DX, R10, R13, else?) in the trace point handler, then use the flag to > indicate ORC to do one more unwind (because can't get caller's regs, > so get callee's regs instaed) before actually giving output? > > I had a try, for framepointer unwinder, mark the indicator flag before > calling bpf functions, and dump bp as well in the trace point. Then > with frame pointer, it works, test passed: > > diff --git a/arch/x86/include/asm/perf_event.h > b/arch/x86/include/asm/perf_event.h > index 1392d5e6e8d6..6f1192e9776b 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct > pt_regs *regs); > > #include > > +#ifdef CONFIG_FRAME_POINTER > +static inline unsigned long caller_frame_pointer(void) > +{ > + return (unsigned long)__builtin_frame_address(1); > +} > +#else > +static inline unsigned long caller_frame_pointer(void) > +{ > + return 0; > +} > +#endif > + > /* > * We abuse bit 3 from flags to pass exact information, see perf_misc_flags > * and the comment with PERF_EFLAGS_EXACT. > */ > #define perf_arch_fetch_caller_regs(regs, __ip){ \ >(regs)->ip = (__ip);\ > + (regs)->bp = caller_frame_pointer();\ >(regs)->sp = (unsigned long)__builtin_frame_address(0); \ >(regs)->cs = __KERNEL_CS; \ >regs->flags = 0;\ > diff --git a/kernel/events/core.c b/kernel/events/core.c > index abbd4b3b96c2..ca7b95ee74f0 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data, > int size, int rctx, > struct task_struct *task) > { >if (bpf_prog_array_valid(call)) { > + regs->flags |= X86_EFLAGS_FIXED; >*(struct pt_regs **)raw_data = regs; >if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) { >perf_swevent_put_recursion_context(rctx); > @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event > *event, >int ret = 0; > >ctx.regs = perf_arch_bpf_user_pt_regs(regs); > + ctx.regs->flags |= X86_EFLAGS_FIXED; > + >preempt_disable(); >if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) >goto out; > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index f92d6ad5e080..e1fa656677dc 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 > flags, void *meta, u64 meta_size, >}; > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >perf_sample_data_init(sd, 0, 0); >sd->raw = > > @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct > bpf_raw_tracepoint_args *, args, >struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >return bpf_perf_event_output(regs, map, flags, data, size); > } > > @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct > bpf_raw_tracepoint_args *, args, >struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); > >perf_fetch_caller_regs(regs); > + regs->flags |= X86_EFLAGS_FIXED; > + >/* similar to bpf_perf_event_output_tp, but pt_regs
Re: Getting empty callchain from perf_callchain_kernel()
On Sat, May 18, 2019 at 5:48 AM Song Liu wrote: > > > > > On May 17, 2019, at 2:06 PM, Alexei Starovoitov wrote: > > > > On 5/17/19 11:40 AM, Song Liu wrote: > >> +Alexei, Daniel, and bpf > >> > >>> On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: > >>> > >>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > some other bfp functions) is now broken, or, strating an unwind > directly inside a bpf program will end up strangely. It have following > kernel message: > >>> > >>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > >>> follow. > >> > >> I guess we need something like the following? (we should be able to > >> optimize the PER_CPU stuff). > >> > >> Thanks, > >> Song > >> > >> > >> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c > >> index f92d6ad5e080..c525149028a7 100644 > >> --- i/kernel/trace/bpf_trace.c > >> +++ w/kernel/trace/bpf_trace.c > >> @@ -696,11 +696,13 @@ static const struct bpf_func_proto > >> bpf_perf_event_output_proto_tp = { > >> .arg5_type = ARG_CONST_SIZE_OR_ZERO, > >> }; > >> > >> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); > >> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, > >>u64, flags) > >> { > >> - struct pt_regs *regs = *(struct pt_regs **)tp_buff; > >> + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); > >> > >> + perf_fetch_caller_regs(regs); > > > > No. pt_regs is already passed in. It's the first argument. > > If we call perf_fetch_caller_regs() again the stack trace will be wrong. > > bpf prog should not see itself, interpreter or all the frames in between. > > Thanks Alexei! I get it now. > > In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field > of tp_buff: > > struct pt_regs *regs = *(struct pt_regs **)tp_buff; > > tp_buff points to something like > > struct sched_switch_args { > unsigned long long pad; > char prev_comm[16]; > int prev_pid; > int prev_prio; > long long prev_state; > char next_comm[16]; > int next_pid; > int next_prio; > }; > > where the first field "pad" is a pointer to pt_regs. > > @Kairui, I think you confirmed that current code will give empty call trace > with ORC unwinder? If that's the case, can we add regs->ip back? (as in the > first email of this thread. > > Thanks, > Song > Hi thanks for the suggestion, yes we can add it should be good an idea to always have IP when stack trace is not available. But stack trace is actually still broken, it will always give only one level of stacktrace (the IP). -- Best Regards, Kairui Song
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > > some other bfp functions) is now broken, or, strating an unwind > > directly inside a bpf program will end up strangely. It have following > > kernel message: > > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > follow. bpf_get_stackid_tp will just use the regs passed to it from the trace point. And then it will eventually call perf_get_callchain to get the call chain. With a tracepoint we have the fake regs, so unwinder will start from where it is called, and use the fake regs as the indicator of the target frame it want, and keep unwinding until reached the actually callsite. But if the stack trace is started withing a bpf func call then it's broken... If the unwinder could trace back through the bpf func call then there will be no such problem. For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED) before bpf call, and ensure bp is also dumped could fix it (so it will start using the regs for bpf calls, like before the commit d15d356887e7). But for ORC I don't see a clear way to fix the problem. First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI, DX, R10, R13, else?) in the trace point handler, then use the flag to indicate ORC to do one more unwind (because can't get caller's regs, so get callee's regs instaed) before actually giving output? I had a try, for framepointer unwinder, mark the indicator flag before calling bpf functions, and dump bp as well in the trace point. Then with frame pointer, it works, test passed: diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 1392d5e6e8d6..6f1192e9776b 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs); #include +#ifdef CONFIG_FRAME_POINTER +static inline unsigned long caller_frame_pointer(void) +{ + return (unsigned long)__builtin_frame_address(1); +} +#else +static inline unsigned long caller_frame_pointer(void) +{ + return 0; +} +#endif + /* * We abuse bit 3 from flags to pass exact information, see perf_misc_flags * and the comment with PERF_EFLAGS_EXACT. */ #define perf_arch_fetch_caller_regs(regs, __ip){ \ (regs)->ip = (__ip);\ + (regs)->bp = caller_frame_pointer();\ (regs)->sp = (unsigned long)__builtin_frame_address(0); \ (regs)->cs = __KERNEL_CS; \ regs->flags = 0;\ diff --git a/kernel/events/core.c b/kernel/events/core.c index abbd4b3b96c2..ca7b95ee74f0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx, struct task_struct *task) { if (bpf_prog_array_valid(call)) { + regs->flags |= X86_EFLAGS_FIXED; *(struct pt_regs **)raw_data = regs; if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) { perf_swevent_put_recursion_context(rctx); @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event, int ret = 0; ctx.regs = perf_arch_bpf_user_pt_regs(regs); + ctx.regs->flags |= X86_EFLAGS_FIXED; + preempt_disable(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) goto out; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..e1fa656677dc 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, }; perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + perf_sample_data_init(sd, 0, 0); sd->raw = @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + return bpf_perf_event_output(regs, map, flags, data, size); } @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, struct pt_regs *regs = this_cpu_ptr(_raw_tp_regs); perf_fetch_caller_regs(regs); + regs->flags |= X86_EFLAGS_FIXED; + /* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */ return bpf_get_stackid((unsigned long) regs, (unsigned long) map, flags, 0, 0); @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args, struct pt_regs *regs =
Re: Getting empty callchain from perf_callchain_kernel()
> On May 17, 2019, at 2:06 PM, Alexei Starovoitov wrote: > > On 5/17/19 11:40 AM, Song Liu wrote: >> +Alexei, Daniel, and bpf >> >>> On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: >>> >>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe some other bfp functions) is now broken, or, strating an unwind directly inside a bpf program will end up strangely. It have following kernel message: >>> >>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't >>> follow. >> >> I guess we need something like the following? (we should be able to >> optimize the PER_CPU stuff). >> >> Thanks, >> Song >> >> >> diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c >> index f92d6ad5e080..c525149028a7 100644 >> --- i/kernel/trace/bpf_trace.c >> +++ w/kernel/trace/bpf_trace.c >> @@ -696,11 +696,13 @@ static const struct bpf_func_proto >> bpf_perf_event_output_proto_tp = { >> .arg5_type = ARG_CONST_SIZE_OR_ZERO, >> }; >> >> +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); >> BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, >>u64, flags) >> { >> - struct pt_regs *regs = *(struct pt_regs **)tp_buff; >> + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); >> >> + perf_fetch_caller_regs(regs); > > No. pt_regs is already passed in. It's the first argument. > If we call perf_fetch_caller_regs() again the stack trace will be wrong. > bpf prog should not see itself, interpreter or all the frames in between. Thanks Alexei! I get it now. In bpf_get_stackid_tp(), the pt_regs is get by dereferencing the first field of tp_buff: struct pt_regs *regs = *(struct pt_regs **)tp_buff; tp_buff points to something like struct sched_switch_args { unsigned long long pad; char prev_comm[16]; int prev_pid; int prev_prio; long long prev_state; char next_comm[16]; int next_pid; int next_prio; }; where the first field "pad" is a pointer to pt_regs. @Kairui, I think you confirmed that current code will give empty call trace with ORC unwinder? If that's the case, can we add regs->ip back? (as in the first email of this thread. Thanks, Song
Re: Getting empty callchain from perf_callchain_kernel()
On 5/17/19 11:40 AM, Song Liu wrote: > +Alexei, Daniel, and bpf > >> On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: >> >> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >>> some other bfp functions) is now broken, or, strating an unwind >>> directly inside a bpf program will end up strangely. It have following >>> kernel message: >> >> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't >> follow. > > I guess we need something like the following? (we should be able to > optimize the PER_CPU stuff). > > Thanks, > Song > > > diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c > index f92d6ad5e080..c525149028a7 100644 > --- i/kernel/trace/bpf_trace.c > +++ w/kernel/trace/bpf_trace.c > @@ -696,11 +696,13 @@ static const struct bpf_func_proto > bpf_perf_event_output_proto_tp = { > .arg5_type = ARG_CONST_SIZE_OR_ZERO, > }; > > +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); > BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, > u64, flags) > { > - struct pt_regs *regs = *(struct pt_regs **)tp_buff; > + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); > > + perf_fetch_caller_regs(regs); No. pt_regs is already passed in. It's the first argument. If we call perf_fetch_caller_regs() again the stack trace will be wrong. bpf prog should not see itself, interpreter or all the frames in between.
Re: Getting empty callchain from perf_callchain_kernel()
+Alexei, Daniel, and bpf > On May 17, 2019, at 2:10 AM, Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: >> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >> some other bfp functions) is now broken, or, strating an unwind >> directly inside a bpf program will end up strangely. It have following >> kernel message: > > Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't > follow. I guess we need something like the following? (we should be able to optimize the PER_CPU stuff). Thanks, Song diff --git i/kernel/trace/bpf_trace.c w/kernel/trace/bpf_trace.c index f92d6ad5e080..c525149028a7 100644 --- i/kernel/trace/bpf_trace.c +++ w/kernel/trace/bpf_trace.c @@ -696,11 +696,13 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_tp = { .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; +static DEFINE_PER_CPU(struct pt_regs, bpf_stackid_tp_regs); BPF_CALL_3(bpf_get_stackid_tp, void *, tp_buff, struct bpf_map *, map, u64, flags) { - struct pt_regs *regs = *(struct pt_regs **)tp_buff; + struct pt_regs *regs = this_cpu_ptr(_stackid_tp_regs); + perf_fetch_caller_regs(regs); /* * Same comment as in bpf_perf_event_output_tp(), only that this time * the other helper's function body cannot be inlined due to being @@ -719,10 +721,13 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = { .arg3_type = ARG_ANYTHING, }; +static DEFINE_PER_CPU(struct pt_regs, bpf_stack_tp_regs); BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size, u64, flags) { - struct pt_regs *regs = *(struct pt_regs **)tp_buff; + struct pt_regs *regs = this_cpu_ptr(_stack_tp_regs); + + perf_fetch_caller_regs(regs); return bpf_get_stack((unsigned long) regs, (unsigned long) buf, (unsigned long) size, flags, 0);
Re: Getting empty callchain from perf_callchain_kernel()
> On May 17, 2019, at 1:10 AM, Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: >> On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: >>> Hi, >>> >>> We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on >>> bpf/master >>> branch). >>> >>> After digging into the code, we found that perf_callchain_kernel() is >>> giving empty >>> callchain for tracepoint sched/sched_switch. And it seems related to commit >>> >>> d15d356887e770c5f2dcf963b52c7cb510c9e42d >>> ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") >>> >>> Before this commit, perf_callchain_kernel() returns callchain with >>> regs->ip. With >>> this commit, regs->ip is not sent for !perf_hw_regs(regs) case. >> >> So while I think the below is indeed right; we should store regs->ip >> regardless of the unwind path chosen, I still think there's something >> fishy if this results in just the 1 entry. >> >> The sched/sched_switch event really should have a non-trivial stack. >> >> Let me see if I can reproduce with just perf. > > $ perf record -g -e "sched:sched_switch" -- make clean > $ perf report -D > > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: > 0x81c29562 period: 1 addr: 0 > ... FP chain: nr:10 > . 0: ff80 > . 1: 81c29562 > . 2: 81c29933 > . 3: 8111f688 > . 4: 81120b9d > . 5: 81120ce5 > . 6: 8100254a > . 7: 81e0007d > . 8: fe00 > . 9: 7f9b6cd9682a > ... thread: sh:7236 > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux > Hmm... I also get this FP chain even with CONFIG_UNWINDER_ORC=y. I guess it is actually from ORC? Thanks, Song
Re: Getting empty callchain from perf_callchain_kernel()
> On May 17, 2019, at 1:32 AM, Kairui Song wrote: > > On Fri, May 17, 2019 at 4:15 PM Kairui Song wrote: >> >> On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: >>> >>> On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > Hi, > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > (on bpf/master > branch). > > After digging into the code, we found that perf_callchain_kernel() is > giving empty > callchain for tracepoint sched/sched_switch. And it seems related to > commit > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > Before this commit, perf_callchain_kernel() returns callchain with > regs->ip. With > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. So while I think the below is indeed right; we should store regs->ip regardless of the unwind path chosen, I still think there's something fishy if this results in just the 1 entry. The sched/sched_switch event really should have a non-trivial stack. Let me see if I can reproduce with just perf. >>> >>> $ perf record -g -e "sched:sched_switch" -- make clean >>> $ perf report -D >>> >>> 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: >>> 0x81c29562 period: 1 addr: 0 >>> ... FP chain: nr:10 >>> . 0: ff80 >>> . 1: 81c29562 >>> . 2: 81c29933 >>> . 3: 8111f688 >>> . 4: 81120b9d >>> . 5: 81120ce5 >>> . 6: 8100254a >>> . 7: 81e0007d >>> . 8: fe00 >>> . 9: 7f9b6cd9682a >>> ... thread: sh:7236 >>> .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux >>> >>> >>> IOW, it seems to 'work'. >>> >> >> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe >> some other bfp functions) is now broken, or, strating an unwind >> directly inside a bpf program will end up strangely. It have following >> kernel message: >> >> WARNING: kernel stack frame pointer at 70cad07c in >> test_progs:1242 has bad value ffd4497e >> >> And in the debug message: >> >> [ 160.460287] 6e117175: aa23a0e8 >> (get_perf_callchain+0x148/0x280) >> [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) >> [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) >> [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) >> [ 160.460289] 7cbb140a: 0001007f (0x1007f) >> [ 160.460290] 7dc62ac9: ... >> [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) >> [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460292] f9a16017: 007f (0x7f) >> [ 160.460292] a8e62d44: ... >> [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) >> [ 160.460293] 92842522: 007f (0x7f) >> [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) >> [ 160.460296] 0777e4cf: aa225d97 >> (bpf_get_stackid+0x77/0x470) >> [ 160.460296] 9942ea16: ... >> [ 160.460297] a08006b1: 0001 (0x1) >> [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) >> [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) >> [ 160.460300] 3a13d702: ... >> [ 160.460300] e2e2496d: 9ce3 (0x9ce3) >> [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460301] a8cf6d55: ... >> [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) >> [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) >> [ 160.460305] 97498bf2: aa1f4791 >> (bpf_get_stackid_tp+0x11/0x20) >> [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) >> [ 160.460307] dacd0ce5: c0405676 (0xc0405676) >> [ 160.460307] a81f2714: ... >> >> # Note here is the invalid frame pointer >> [ 160.460308] 70cad07c: b89d00a1d000 (0xb89d00a1d000) >> [ 160.460308] f8f194e4: 0001 (0x1) >> [ 160.460309] 2134f42a: d89cffc3ae80 (0xd89cffc3ae80) >> [ 160.460310] f9345889: 9ce3f1e18eb0 (0x9ce3f1e18eb0) >> [ 160.460310] 8ad22a42: ... >> [ 160.460311] 73808173: b89d00c6bce0 (0xb89d00c6bce0) >> [ 160.460312] c9e4: aa1f48d1 (trace_call_bpf+0x81/0x100) >> [ 160.460313] c5d8ebd1: b89d00c6bcc0 (0xb89d00c6bcc0) >> [
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote: > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > some other bfp functions) is now broken, or, strating an unwind > directly inside a bpf program will end up strangely. It have following > kernel message: Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't follow.
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 4:15 PM Kairui Song wrote: > > On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: > > > > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: > > > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > > > > Hi, > > > > > > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > > > > (on bpf/master > > > > branch). > > > > > > > > After digging into the code, we found that perf_callchain_kernel() is > > > > giving empty > > > > callchain for tracepoint sched/sched_switch. And it seems related to > > > > commit > > > > > > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > > > > > > > Before this commit, perf_callchain_kernel() returns callchain with > > > > regs->ip. With > > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. > > > > > > So while I think the below is indeed right; we should store regs->ip > > > regardless of the unwind path chosen, I still think there's something > > > fishy if this results in just the 1 entry. > > > > > > The sched/sched_switch event really should have a non-trivial stack. > > > > > > Let me see if I can reproduce with just perf. > > > > $ perf record -g -e "sched:sched_switch" -- make clean > > $ perf report -D > > > > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: > > 0x81c29562 period: 1 addr: 0 > > ... FP chain: nr:10 > > . 0: ff80 > > . 1: 81c29562 > > . 2: 81c29933 > > . 3: 8111f688 > > . 4: 81120b9d > > . 5: 81120ce5 > > . 6: 8100254a > > . 7: 81e0007d > > . 8: fe00 > > . 9: 7f9b6cd9682a > > ... thread: sh:7236 > > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux > > > > > > IOW, it seems to 'work'. > > > > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe > some other bfp functions) is now broken, or, strating an unwind > directly inside a bpf program will end up strangely. It have following > kernel message: > > WARNING: kernel stack frame pointer at 70cad07c in > test_progs:1242 has bad value ffd4497e > > And in the debug message: > > [ 160.460287] 6e117175: aa23a0e8 > (get_perf_callchain+0x148/0x280) > [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) > [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) > [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) > [ 160.460289] 7cbb140a: 0001007f (0x1007f) > [ 160.460290] 7dc62ac9: ... > [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) > [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460292] f9a16017: 007f (0x7f) > [ 160.460292] a8e62d44: ... > [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) > [ 160.460293] 92842522: 007f (0x7f) > [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) > [ 160.460296] 0777e4cf: aa225d97 (bpf_get_stackid+0x77/0x470) > [ 160.460296] 9942ea16: ... > [ 160.460297] a08006b1: 0001 (0x1) > [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) > [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) > [ 160.460300] 3a13d702: ... > [ 160.460300] e2e2496d: 9ce3 (0x9ce3) > [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460301] a8cf6d55: ... > [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) > [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) > [ 160.460305] 97498bf2: aa1f4791 > (bpf_get_stackid_tp+0x11/0x20) > [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) > [ 160.460307] dacd0ce5: c0405676 (0xc0405676) > [ 160.460307] a81f2714: ... > > # Note here is the invalid frame pointer > [ 160.460308] 70cad07c: b89d00a1d000 (0xb89d00a1d000) > [ 160.460308] f8f194e4: 0001 (0x1) > [ 160.460309] 2134f42a: d89cffc3ae80 (0xd89cffc3ae80) > [ 160.460310] f9345889: 9ce3f1e18eb0 (0x9ce3f1e18eb0) > [ 160.460310] 8ad22a42: ... > [ 160.460311] 73808173: b89d00c6bce0 (0xb89d00c6bce0) > [ 160.460312] c9e4: aa1f48d1 (trace_call_bpf+0x81/0x100) > [ 160.460313] c5d8ebd1: b89d00c6bcc0 (0xb89d00c6bcc0) > [ 160.460315] bce0b072: ab651be0 > (event_sched_migrate_task+0xa0/0xa0) > [
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 4:11 PM Peter Zijlstra wrote: > > On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: > > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > > > Hi, > > > > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map > > > (on bpf/master > > > branch). > > > > > > After digging into the code, we found that perf_callchain_kernel() is > > > giving empty > > > callchain for tracepoint sched/sched_switch. And it seems related to > > > commit > > > > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > > > > > Before this commit, perf_callchain_kernel() returns callchain with > > > regs->ip. With > > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. > > > > So while I think the below is indeed right; we should store regs->ip > > regardless of the unwind path chosen, I still think there's something > > fishy if this results in just the 1 entry. > > > > The sched/sched_switch event really should have a non-trivial stack. > > > > Let me see if I can reproduce with just perf. > > $ perf record -g -e "sched:sched_switch" -- make clean > $ perf report -D > > 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: > 0x81c29562 period: 1 addr: 0 > ... FP chain: nr:10 > . 0: ff80 > . 1: 81c29562 > . 2: 81c29933 > . 3: 8111f688 > . 4: 81120b9d > . 5: 81120ce5 > . 6: 8100254a > . 7: 81e0007d > . 8: fe00 > . 9: 7f9b6cd9682a > ... thread: sh:7236 > .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux > > > IOW, it seems to 'work'. > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe some other bfp functions) is now broken, or, strating an unwind directly inside a bpf program will end up strangely. It have following kernel message: WARNING: kernel stack frame pointer at 70cad07c in test_progs:1242 has bad value ffd4497e And in the debug message: [ 160.460287] 6e117175: aa23a0e8 (get_perf_callchain+0x148/0x280) [ 160.460287] 02e8715f: 00c6bba0 (0xc6bba0) [ 160.460288] b3d07758: 9ce3f979 (0x9ce3f979) [ 160.460289] 55c97836: 9ce3f979 (0x9ce3f979) [ 160.460289] 7cbb140a: 0001007f (0x1007f) [ 160.460290] 7dc62ac9: ... [ 160.460290] 6b41e346: 1c7da01cd70c4000 (0x1c7da01cd70c4000) [ 160.460291] f23d1826: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460292] f9a16017: 007f (0x7f) [ 160.460292] a8e62d44: ... [ 160.460293] cbc83c97: b89d00d8d000 (0xb89d00d8d000) [ 160.460293] 92842522: 007f (0x7f) [ 160.460294] dafbec89: b89d00c6bc50 (0xb89d00c6bc50) [ 160.460296] 0777e4cf: aa225d97 (bpf_get_stackid+0x77/0x470) [ 160.460296] 9942ea16: ... [ 160.460297] a08006b1: 0001 (0x1) [ 160.460298] 9f03b438: b89d00c6bc30 (0xb89d00c6bc30) [ 160.460299] 6caf8b32: aa074fe8 (__do_page_fault+0x58/0x90) [ 160.460300] 3a13d702: ... [ 160.460300] e2e2496d: 9ce3 (0x9ce3) [ 160.460301] 8ee6b7c2: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460301] a8cf6d55: ... [ 160.460302] 59078076: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460303] c6aac739: 9ce3f1e18eb0 (0x9ce3f1e18eb0) [ 160.460303] a39aff92: b89d00c6bc60 (0xb89d00c6bc60) [ 160.460305] 97498bf2: aa1f4791 (bpf_get_stackid_tp+0x11/0x20) [ 160.460306] 6992de1e: b89d00c6bc78 (0xb89d00c6bc78) [ 160.460307] dacd0ce5: c0405676 (0xc0405676) [ 160.460307] a81f2714: ... # Note here is the invalid frame pointer [ 160.460308] 70cad07c: b89d00a1d000 (0xb89d00a1d000) [ 160.460308] f8f194e4: 0001 (0x1) [ 160.460309] 2134f42a: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460310] f9345889: 9ce3f1e18eb0 (0x9ce3f1e18eb0) [ 160.460310] 8ad22a42: ... [ 160.460311] 73808173: b89d00c6bce0 (0xb89d00c6bce0) [ 160.460312] c9e4: aa1f48d1 (trace_call_bpf+0x81/0x100) [ 160.460313] c5d8ebd1: b89d00c6bcc0 (0xb89d00c6bcc0) [ 160.460315] bce0b072: ab651be0 (event_sched_migrate_task+0xa0/0xa0) [ 160.460316] 355cf319: ... [ 160.460316] 3b67f2ad: d89cffc3ae80 (0xd89cffc3ae80) [ 160.460316] 9a77e20b: 9ce3fba25000 (0x9ce3fba25000) [ 160.460317] 32cf7376: 0001 (0x1) [ 160.460317]
Re: Getting empty callchain from perf_callchain_kernel()
On Fri, May 17, 2019 at 09:46:00AM +0200, Peter Zijlstra wrote: > On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > > Hi, > > > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on > > bpf/master > > branch). > > > > After digging into the code, we found that perf_callchain_kernel() is > > giving empty > > callchain for tracepoint sched/sched_switch. And it seems related to commit > > > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > > > Before this commit, perf_callchain_kernel() returns callchain with > > regs->ip. With > > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. > > So while I think the below is indeed right; we should store regs->ip > regardless of the unwind path chosen, I still think there's something > fishy if this results in just the 1 entry. > > The sched/sched_switch event really should have a non-trivial stack. > > Let me see if I can reproduce with just perf. $ perf record -g -e "sched:sched_switch" -- make clean $ perf report -D 12 904071759467 0x1790 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 7236/7236: 0x81c29562 period: 1 addr: 0 ... FP chain: nr:10 . 0: ff80 . 1: 81c29562 . 2: 81c29933 . 3: 8111f688 . 4: 81120b9d . 5: 81120ce5 . 6: 8100254a . 7: 81e0007d . 8: fe00 . 9: 7f9b6cd9682a ... thread: sh:7236 .. dso: /lib/modules/5.1.0-12177-g41bbb9129767/build/vmlinux IOW, it seems to 'work'.
Re: Getting empty callchain from perf_callchain_kernel()
On Thu, May 16, 2019 at 11:51:55PM +, Song Liu wrote: > Hi, > > We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on > bpf/master > branch). > > After digging into the code, we found that perf_callchain_kernel() is giving > empty > callchain for tracepoint sched/sched_switch. And it seems related to commit > > d15d356887e770c5f2dcf963b52c7cb510c9e42d > ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") > > Before this commit, perf_callchain_kernel() returns callchain with regs->ip. > With > this commit, regs->ip is not sent for !perf_hw_regs(regs) case. So while I think the below is indeed right; we should store regs->ip regardless of the unwind path chosen, I still think there's something fishy if this results in just the 1 entry. The sched/sched_switch event really should have a non-trivial stack. Let me see if I can reproduce with just perf.
Getting empty callchain from perf_callchain_kernel()
Hi, We found a failure with selftests/bpf/tests_prog in test_stacktrace_map (on bpf/master branch). After digging into the code, we found that perf_callchain_kernel() is giving empty callchain for tracepoint sched/sched_switch. And it seems related to commit d15d356887e770c5f2dcf963b52c7cb510c9e42d ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER") Before this commit, perf_callchain_kernel() returns callchain with regs->ip. With this commit, regs->ip is not sent for !perf_hw_regs(regs) case. I found the following change fixes the selftest. But I am not very sure, it is the best solution here. Please share comments and suggestions on this. Thanks in advance! Song diff --git i/arch/x86/events/core.c w/arch/x86/events/core.c index f315425d8468..7b8a9eb4d5fd 100644 --- i/arch/x86/events/core.c +++ w/arch/x86/events/core.c @@ -2402,9 +2402,9 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re return; } + if (perf_callchain_store(entry, regs->ip)) + return; if (perf_hw_regs(regs)) { - if (perf_callchain_store(entry, regs->ip)) - return; unwind_start(, current, regs, NULL); } else { unwind_start(, current, NULL, (void *)regs->sp);