On Tue, Jul 26, 2016 at 9:26 AM, Josh Poimboeuf <jpoim...@redhat.com> wrote: > On Mon, Jul 25, 2016 at 05:09:44PM -0700, Andy Lutomirski wrote: >> On Sat, Jul 23, 2016 at 7:04 AM, Josh Poimboeuf <jpoim...@redhat.com> wrote: >> >> > Unless I'm missing something, I think it should be fine for nested NMIs, >> >> > since they're all on the same stack. I can try to test it. What in >> >> > particular are you worried about? >> >> > >> >> >> >> The top of the NMI frame contains no less than *three* (SS, SP, FLAGS, >> >> CS, IP) records. Off the top of my head, the record that matters is >> >> the third one, so it should be regs[-15]. If an MCE hits while this >> >> mess is being set up, good luck unwinding *that*. If you really want >> >> to know, take a deep breath, read the long comment in entry_64.S after >> >> .Lnmi_from_kernel, then give up on x86 and start hacking on some other >> >> architecture. >> > >> > I did read that comment. Fortunately there's a big difference between >> > reading and understanding so I can go on being an ignorant x86 hacker! >> > >> > For nested NMIs, it does look like the stack of the exception which >> > interrupted the first NMI would get skipped by the stack dump. (But >> > that's a general problem, not specific to my patch set.) >> >> If we end up with task -> IST -> NMI -> same IST, we're doomed and >> we're going to crash, so it doesn't matter much whether the unwinder >> works. Is that what you mean? > > I read the NMI entry code again, and now I realize my comment was > completely misinformed, so never mind. > > Is "IST -> NMI -> same IST" even possible, since the other IST's are > higher priority than NMI?
Priority only matters for events that happen concurrently. Synchronous things like #DB will always fire if the conditions that trigger them are hit, > >> > Am I correct in understanding that there can only be one level of NMI >> > nesting at any given time? If so, could we make it easier on the >> > unwinder by putting the nested NMI on a separate software stack, so the >> > "next stack" pointers are always in the same place? Or am I just being >> > naive? >> >> I think you're being naive :) >> >> But we don't really need the unwinder to be 100% faithful here. If we have: >> >> task stack >> NMI >> nested NMI >> >> then the nested NMI code won't call into C and thus it should be >> impossible to ever invoke your unwinder on that state. Instead the >> nested NMI code will fiddle with the saved regs and return in such a >> way that the outer NMI will be forced to loop through again. So it >> *should* (assuming I'm remembering all this crap correctly) be >> sufficient to find the "outermost" pt_regs, which is sitting at >> (struct pt_regs *)(end - 12) - 1 or thereabouts and look at it's ->sp >> value. This ought to be the same thing that the frame-based unwinder >> would naturally try to do. But if you make this change, ISTM you >> should make it separately because it does change behavior and Linus is >> understandably leery about that. > > Ok, I think that makes sense to me now. As I understand it, the > "outermost" RIP is the authoritative one, because it was written by the > original NMI. Any nested NMIs will update the original and/or iret > RIPs, which will only ever point to NMI entry code, and so they should > be ignored. > > But I think there's a case where this wouldn't work: > > task stack > NMI > IST > stack dump > > If the IST interrupt hits before the NMI has a chance to update the > outermost regs, the authoritative RIP would be the original one written > by HW, right? This should be impossible unless that last entry is MCE. If we actually fire an event that isn't MCE early in NMI entry, something already went very wrong. For NMI -> MCE -> stack dump, the frame-based unwinder will do better than get_stack_info() unless get_stack_info() learns to use the top-of-stack hardware copy if the actual RSP it finds is too high (above the "outermost" frame). > >> Hmm. I wonder if it would make sense to decode this thing both ways >> and display it. So show_trace_log_lvl() could print something like: >> >> <#DB (0xffffwhatever000-0xffffwhateverfff), next frame is at 0xffffsomething> >> >> and, in the case where the frame unwinder disagrees, it'll at least be >> visible in that 0xffffsomething won't be between 0xffffwhatever000 and >> 0xffffwhateverfff. >> >> Then Linus is happy because the unwinder works just like it did before >> but people like me are also happy because it's clearer what's going >> on. FWIW, I've personally debugged crashes in the NMI code where the >> current unwinder falls apart completely and it's not fun -- having a >> display indicating that the unwinder got confused would be nice. > > Hm, maybe. Another idea would be to have the unwinder print some kind > of warning if it goes off the rails. It should be able to detect that, > because every stack trace should end at a user pt_regs. I like that. Be careful, though: kernel threads might not have a "user" pt_regs in the "user_mode" returns true sense. Checking that it's either user_mode() or at task_pt_regs() might be a good condition to check. --Andy