On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote: > On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf <jpoim...@redhat.com> wrote: > > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote: > >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoim...@redhat.com> > >> wrote: > >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote: > >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoim...@redhat.com> > >> >> wrote: > >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote: > >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf > >> >> >> <jpoim...@redhat.com> wrote: > >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote: > >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf > >> >> >> >> <jpoim...@redhat.com> wrote: > >> >> >> >> > Now that we can find pt_regs registers in the middle of the > >> >> >> >> > stack due to > >> >> >> >> > an interrupt or exception, we can print them. Here's what it > >> >> >> >> > looks > >> >> >> >> > like: > >> >> >> >> > > >> >> >> >> > ... > >> >> >> >> > [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0 > >> >> >> >> > [<ffffffff8189f558>] async_page_fault+0x28/0x30 > >> >> >> >> > RIP: 0010:[<ffffffff814529e2>] [<ffffffff814529e2>] > >> >> >> >> > __clear_user+0x42/0x70 > >> >> >> >> > RSP: 0018:ffff88007876fd38 EFLAGS: 00010202 > >> >> >> >> > RAX: 0000000000000000 RBX: 0000000000000138 RCX: > >> >> >> >> > 0000000000000138 > >> >> >> >> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: > >> >> >> >> > 000000000061b640 > >> >> >> >> > RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: > >> >> >> >> > 0000000000000000 > >> >> >> >> > R10: 0000000000000001 R11: 0000000000000000 R12: > >> >> >> >> > 000000000061b640 > >> >> >> >> > R13: 0000000000000000 R14: ffff880078770000 R15: > >> >> >> >> > ffff880079947200 > >> >> >> >> > [<ffffffff814529e2>] ? __clear_user+0x42/0x70 > >> >> >> >> > [<ffffffff814529c3>] ? __clear_user+0x23/0x70 > >> >> >> >> > [<ffffffff81452a7b>] clear_user+0x2b/0x40 > >> >> >> >> > ... > >> >> >> >> > >> >> >> >> This looks wrong. Here are some theories: > >> >> >> >> > >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: > >> >> >> >> .... > >> >> >> >> Then it's found again as an unreliable address as "? > >> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "? > >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the > >> >> >> >> stack. > >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been > >> >> >> >> suppressed because it matched a reliable address? > >> >> >> >> > >> >> >> >> (b) You actually intended for all the addresses to be printed, in > >> >> >> >> which case "? __clear_user+0x42/0x70" should have been > >> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's > >> >> >> >> plausible that your state machine got a bit lost leading to "? > >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact -- > >> >> >> >> it's a real frame and you didn't find it). > >> >> >> >> > >> >> >> >> (c) Something else and I'm confused. > >> >> >> > > >> >> >> > So there's a subtle difference between addresses reported by > >> >> >> > regs->ip > >> >> >> > and normal return addresses. For example: > >> >> >> > > >> >> >> > ... > >> >> >> > [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50 > >> >> >> > [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0 > >> >> >> > RIP: 0010:[<ffffffff8129b350>] [<ffffffff8129b350>] > >> >> >> > path_init+0x0/0x750 > >> >> >> > RSP: 0018:ffff880036a3fd80 EFLAGS: 00000296 > >> >> >> > RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08 > >> >> >> > RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0 > >> >> >> > RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010 > >> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0 > >> >> >> > R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000 > >> >> >> > <EOI> > >> >> >> > [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0 > >> >> >> > [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120 > >> >> >> > [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180 > >> >> >> > ... > >> >> >> > > >> >> >> > In this case the irq hit right after path_lookupat() called into > >> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is > >> >> >> > right. > >> >> >> > > >> >> >> > Note the backtrace reports the same address, but it instead > >> >> >> > describes it > >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's > >> >> >> > because normally, such an address after a call instruction at the > >> >> >> > end of > >> >> >> > a function would indicate a tail call (e.g., to a noreturn > >> >> >> > function). > >> >> >> > If that were the case, printing "path_init+0x0" would be completely > >> >> >> > wrong, because path_init() just happens to be the function located > >> >> >> > immediately after lookup_fast(). > >> >> >> > > >> >> >> > Maybe I could add some special logic to say: "if this return > >> >> >> > address was > >> >> >> > from a call, use printk_stack_address(); else if it was from a > >> >> >> > pt_regs, > >> >> >> > use printk_address()." (The former prints the preceding function, > >> >> >> > the > >> >> >> > latter prints the current function.) Then we could remove the > >> >> >> > question > >> >> >> > mark. > >> >> >> > > >> >> >> > There's also the question of whether or not the address should be > >> >> >> > printed again, after it's already been printed by __show_regs(). I > >> >> >> > don't have a strong opinion either way. > >> >> >> > > >> >> >> > >> >> >> IIRC we don't show the actual faulting function in the call trace, so > >> >> >> we probably shouldn't duplicate the entry after the show_regs. > >> >> > > >> >> > Just to clarify, that's true today for cases where the stack dump > >> >> > starts > >> >> > from a handler which has regs. It starts dumping based on regs->ip > >> >> > and > >> >> > regs->bp, so the regs themselves aren't dumped. > >> >> > > >> >> > But for cases where regs are in the middle of the stack, they aren't > >> >> > detected today, and you'll still see the value of regs->ip dumped > >> >> > with a > >> >> > question mark. > >> >> > > >> >> > That said, with this patch, now that regs in the middle of the stack > >> >> > *are* being printed, I can't think of a good reason to print the > >> >> > return > >> >> > address twice: both in regs and the stack trace. So removing it from > >> >> > the stack trace is fine with me. > >> >> > > >> >> >> That being said, I'm still confused by the question marks. What > >> >> >> exactly is going on? Is the code really doing the right thing wrt > >> >> >> resuming the unwind? Is there a git tree with these patches applied > >> >> >> somewhere so I can look at it easily in context? > >> >> > > >> >> > show_trace_log_lvl() is doing two things in parallel: scanning all > >> >> > kernel text addresses on the stack while simultaneously using the > >> >> > unwinder to walk the frame pointers. Only those scanned addresses > >> >> > which > >> >> > are also found by the unwinder are printed without question marks. > >> >> > > >> >> > The pt_regs aren't in a frame of their own; they're just data inside > >> >> > of > >> >> > a bigger frame. (You may recall that you objected to my proposal to > >> >> > put > >> >> > them in their own frame :-)) So that's why the address stored in > >> >> > regs->ip was printed with a question mark: it's not in the header of a > >> >> > real frame; it's just data. > >> >> > >> >> It wasn't the separate frame part I was objecting to -- it was their > >> >> encoding on the stack. Maybe they should unwind as though they're a > >> >> separate frame. For example, the unwind API could give the frame that > >> >> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame > >> >> could literally list regs->ip as its retaddr (and maybe that frame or > >> >> even the following one should be the one with non-NULL > >> >> unwind_get_entry_regs). > >> > > >> > Having the unwinder treat the pt_regs as a "fake" frame is problematic: > >> > > >> > - As I described above, you can't treat regs->ip as a normal return > >> > value anyway. > >> > > >> > - Also, for exceptions and nested interrupts, the regs are stored on the > >> > interrupting stack. But for non-nested interrupts, they're stored on > >> > the thread stack. So the regs aren't always on the same stack as the > >> > corresponding encoded pt_regs pointer. Another issue is that there's > >> > not always a frame after the regs. For those reasons, creating a > >> > "fake" frame abstraction in the state machine is quite a bit trickier > >> > than just dealing with those details in the only place that cares > >> > about them: show_trace_log_lvl(). > >> > > >> >> In some sense, the regs belong to the frame that got interrupted, not > >> >> the frame that did the interrupting. But maybe that's backwards -- if > >> >> we have DWARF, then the regs correspond to the regs at the time of a > >> >> call, and those regs are reasonably likely to contain the arguments to > >> >> the called function. > >> >> > >> >> But regardless of which way this goes, it seems quite awkward to me > >> >> that regs->ip never shows up as the return addr of any frame as > >> >> exposed by the unwind API. > >> > > >> > Again, regs->ip is special. It's not a call return address and we > >> > shouldn't force it to be. > >> > >> This is only mostly true. If the exception was a trap, then it is > >> (e.g. if a function ends in int3, then regs->ip will be off the end). > >> But that's just me being pedantic. > >> > >> More relevantly, regs->ip is a reliable address indicating a function > >> that will be returned to if we ever return, and both > >> show_trace_log_lvl() and the livepatch stuff should interpret it as > >> such. > > > > Actually livepatch doesn't care; once it sees that there are regs, it > > will bail because the stack is unreliable. > > Would it be better for livepatch not to bail some day?
Not until we have a DWARF unwinder. > >> Whether this means the unwinder should change or > >> show_trace_log_lvl() should change isn't a big deal, but I think one > >> of them should change so we get this right. > > > > I have no problem doing so, but can you clarify what you mean? Earlier > > you said: > > > > "IIRC we don't show the actual faulting function in the call trace, so > > we probably shouldn't duplicate the entry after the show_regs." > > > > Maybe I'm misunderstanding, but that seems to contradict what you're > > saying now. So which is it? Do you want the RIP address printed twice > > (both in the regs printout and in the stack trace)? Or not? > > I don't have a stong preference as to how many times it's printed. > But I think we need to get rid of the question mark. I think that > means there are two options: > > a) Teach show_stack_log_lvl() that regs->ip is a "reliable" entry and > print it again. That will get confused if it's the first instruction > in a function, so maybe it's not so great. I proposed a fix for this above, so that it would print regs->ip one way and a normal return address another way, to avoid the confusion: > >> >> >> > Maybe I could add some special logic to say: "if this return > >> >> >> > address was > >> >> >> > from a call, use printk_stack_address(); else if it was from a > >> >> >> > pt_regs, > >> >> >> > use printk_address()." (The former prints the preceding function, > >> >> >> > the > >> >> >> > latter prints the current function.) Then we could remove the > >> >> >> > question > >> >> >> > mark. > b) Teach show_stack_log_lvl() that regs->ip is a thing that we just > printed (via show_regs) and skip the ? entry. > > Option b probably makes more sense. I think I'm starting to > understand all this, but maybe I'm still missing something. I also think b is a good option. I'll do it for v2 unless others disagree. -- Josh