> On Jan 25, 2021, at 9:39 AM, Steven Rostedt <rost...@goodmis.org> wrote: > > On Mon, 25 Jan 2021 15:45:06 +0800 > Lai Jiangshan <jiangshan...@gmail.com> wrote: > >> From: Lai Jiangshan <la...@linux.alibaba.com> >> >> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified >> the NMI code by changing paravirt code into native code and left a comment >> about "inspecting RIP instead". But until now, "inspecting RIP instead" >> has not been made happened and this patch tries to complete it. >> >> Comments in the code was from Andy Lutomirski. Thanks! >> >> Signed-off-by: Lai Jiangshan <la...@linux.alibaba.com> >> --- >> arch/x86/entry/entry_64.S | 44 ++++++++++----------------------------- >> 1 file changed, 11 insertions(+), 33 deletions(-) >> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S >> index cad08703c4ad..21f67ea62341 100644 >> --- a/arch/x86/entry/entry_64.S >> +++ b/arch/x86/entry/entry_64.S >> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi) >> je nested_nmi >> >> /* >> - * Now test if the previous stack was an NMI stack. This covers >> - * the case where we interrupt an outer NMI after it clears >> - * "NMI executing" but before IRET. We need to be careful, though: >> - * there is one case in which RSP could point to the NMI stack >> - * despite there being no NMI active: naughty userspace controls >> - * RSP at the very beginning of the SYSCALL targets. We can >> - * pull a fast one on naughty userspace, though: we program >> - * SYSCALL to mask DF, so userspace cannot cause DF to be set >> - * if it controls the kernel's RSP. We set DF before we clear >> - * "NMI executing". >> + * Now test if we interrupted an outer NMI that just cleared "NMI >> + * executing" and is about to IRET. This is a single-instruction >> + * window. This check does not handle the case in which we get a >> + * nested interrupt (#MC, #VE, #VC, etc.) after clearing >> + * "NMI executing" but before the outer NMI executes IRET. >> */ >> - lea 6*8(%rsp), %rdx >> - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) >> */ >> - cmpq %rdx, 4*8(%rsp) >> - /* If the stack pointer is above the NMI stack, this is a normal NMI */ >> - ja first_nmi >> - >> - subq $EXCEPTION_STKSZ, %rdx >> - cmpq %rdx, 4*8(%rsp) >> - /* If it is below the NMI stack, it is a normal NMI */ >> - jb first_nmi >> - >> - /* Ah, it is within the NMI stack. */ >> - >> - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp) >> - jz first_nmi /* RSP was user controlled. */ > > So we no longer check to see if the current stack is on the NMI stack. > Makes sense, since this beginning of the NMI code can not be interrupted, > as there's no breakpoints or faults that can occur when that happens. The > $nmi_executing is set in all other locations except for: > > repeat_nmi - end_repeat_nmi > and for the iretq itself (.Lnmi_iret). > > Thus, by just checking the nmi_executing variable on the stack along with > the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely > tell if the NMI is nested or not. > > I was working on rewriting the beginning comments to reflect these updates, > and discovered a possible bug that exists (unrelated to this patch). > >> + cmpq $.Lnmi_iret, 8(%rsp) >> + jne first_nmi >> > > On triggering an NMI from user space, I see the switch to the thread stack > is done, and "exc_nmi" is called. > > The problem I see with this is that exc_nmi is called with the thread > stack, if it were to take an exception, NMIs would be enabled allowing for > a nested NMI to run. From what I can tell, I don't see anything stopping > that NMI from executing over the currently running NMI. That is, this means > that NMI handlers are now re-entrant.
That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.