> On Jan 25, 2021, at 9:39 AM, Steven Rostedt <rost...@goodmis.org> wrote:
> 
> On Mon, 25 Jan 2021 15:45:06 +0800
> Lai Jiangshan <jiangshan...@gmail.com> wrote:
> 
>> From: Lai Jiangshan <la...@linux.alibaba.com>
>> 
>> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
>> the NMI code by changing paravirt code into native code and left a comment
>> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
>> has not been made happened and this patch tries to complete it.
>> 
>> Comments in the code was from Andy Lutomirski.  Thanks!
>> 
>> Signed-off-by: Lai Jiangshan <la...@linux.alibaba.com>
>> ---
>> arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>> 1 file changed, 11 insertions(+), 33 deletions(-)
>> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index cad08703c4ad..21f67ea62341 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>>    je    nested_nmi
>> 
>>    /*
>> -     * Now test if the previous stack was an NMI stack.  This covers
>> -     * the case where we interrupt an outer NMI after it clears
>> -     * "NMI executing" but before IRET.  We need to be careful, though:
>> -     * there is one case in which RSP could point to the NMI stack
>> -     * despite there being no NMI active: naughty userspace controls
>> -     * RSP at the very beginning of the SYSCALL targets.  We can
>> -     * pull a fast one on naughty userspace, though: we program
>> -     * SYSCALL to mask DF, so userspace cannot cause DF to be set
>> -     * if it controls the kernel's RSP.  We set DF before we clear
>> -     * "NMI executing".
>> +     * Now test if we interrupted an outer NMI that just cleared "NMI
>> +     * executing" and is about to IRET.  This is a single-instruction
>> +     * window.  This check does not handle the case in which we get a
>> +     * nested interrupt (#MC, #VE, #VC, etc.) after clearing
>> +     * "NMI executing" but before the outer NMI executes IRET.
>>     */
>> -    lea    6*8(%rsp), %rdx
>> -    /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) 
>> */
>> -    cmpq    %rdx, 4*8(%rsp)
>> -    /* If the stack pointer is above the NMI stack, this is a normal NMI */
>> -    ja    first_nmi
>> -
>> -    subq    $EXCEPTION_STKSZ, %rdx
>> -    cmpq    %rdx, 4*8(%rsp)
>> -    /* If it is below the NMI stack, it is a normal NMI */
>> -    jb    first_nmi
>> -
>> -    /* Ah, it is within the NMI stack. */
>> -
>> -    testb    $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
>> -    jz    first_nmi    /* RSP was user controlled. */
> 
> So we no longer check to see if the current stack is on the NMI stack.
> Makes sense, since this beginning of the NMI code can not be interrupted,
> as there's no breakpoints or faults that can occur when that happens. The
> $nmi_executing is set in all other locations except for:
> 
>  repeat_nmi - end_repeat_nmi
>  and for the iretq itself (.Lnmi_iret).
> 
> Thus, by just checking the nmi_executing variable on the stack along with
> the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
> tell if the NMI is nested or not.
> 
> I was working on rewriting the beginning comments to reflect these updates,
> and discovered a possible bug that exists (unrelated to this patch).
> 
>> +    cmpq    $.Lnmi_iret, 8(%rsp)
>> +    jne    first_nmi
>> 
> 
> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
> 
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.

That was intentional in my part. The C code checks for this condition and 
handles it, just like it does on 32-bit kernels.

Reply via email to