Re: [PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
On Sat, Mar 05, 2016 at 09:52:22PM -0800, Andy Lutomirski wrote: > Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST, > so the exception handler is invoked on the temporary SYSENTER stack. > > Because the SYSENTER stack is very small, we have a fixup to switch > off the stack quickly when this happens. The old fixup had several issues: > > 1. It checked the interrupt frame's CS and EIP. This wasn't >obviously correct on Xen or if vm86 mode was in use [1]. > > 2. In the NMI handler, it did some frightening digging into the >stack frame. I'm not convinced this digging was correct. > > 3. The fixup didn't switch stacks and then switch back. Instead, it >synthesized a brand new stack frame that would redirect the IRET >back to the SYSENTER code. That frame was highly questionable. >For one thing, if NMI nested inside #DB, we would effectively >abort the #DB prologue, which was probably safe but was >frightening. For another, the code used PUSHFL to write the >FLAGS portion of the frame, which was simply bogus -- by the time >PUSHFL was called, at least TF, NT, VM, and all of the arithmetic >flags were clobbered. > > Simplify this considerably. Instead of looking at the saved frame > to see where we came from, check the hardware ESP register against > the SYSENTER stack directly. Malicious user code cannot spoof the > kernel ESP register, and by moving the check after SAVE_ALL, we can > use normal PER_CPU accesses to find all the relevant addresses. > > With this patch applied, the improved syscall_nt_32 test finally > passes on 32-bit kernels. > > [1] It isn't obviously correct, but it is nonetheless safe from vm86 > shenanigans as far as I can tell. A user can't point EIP at > entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32, > like all kernel addresses, is greater than 0x and would thus > violate the CS segment limit. > > Signed-off-by: Andy Lutomirski> --- > arch/x86/entry/entry_32.S| 114 > ++- > arch/x86/kernel/asm-offsets_32.c | 5 ++ > 2 files changed, 56 insertions(+), 63 deletions(-) > > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S > index 7700cf695d8c..ad9a85e62128 100644 > --- a/arch/x86/entry/entry_32.S > +++ b/arch/x86/entry/entry_32.S > @@ -987,51 +987,48 @@ error_code: > jmp ret_from_exception > END(page_fault) > > -/* > - * Debug traps and NMI can happen at the one SYSENTER instruction > - * that sets up the real kernel stack. Check here, since we can't > - * allow the wrong stack to be used. > - * > - * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have > - * already pushed 3 words if it hits on the sysenter instruction: > - * eflags, cs and eip. > - * > - * We just load the right stack, and push the three (known) values > - * by hand onto the new stack - while updating the return eip past > - * the instruction that would have done it for sysenter. > - */ > -.macro FIX_STACK offset ok label > - cmpw$__KERNEL_CS, 4(%esp) > - jne \ok > -\label: > - movlTSS_sysenter_sp0 + \offset(%esp), %esp > - pushfl > - pushl $__KERNEL_CS > - pushl $sysenter_past_esp > -.endm > - > ENTRY(debug) > + /* > + * #DB can happen at the first instruction of > + * entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this > + * happens, then we will be running on a very small stack. We > + * need to detect this condition and switch to the thread > + * stack before calling any C code at all. > + * > + * If you edit this code, keep in mind that NMIs can happen in here. > + */ Btw, I think it is a bit more readable if this comment is over ENTRY(debug) like the one over ENTRY(nmi) below. > ASM_CLAC > - cmpl$entry_SYSENTER_32, (%esp) > - jne debug_stack_correct > - FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn > -debug_stack_correct: > pushl $-1 # mark this as an int > SAVE_ALL > - TRACE_IRQS_OFF > xorl%edx, %edx # error code 0 > movl%esp, %eax # pt_regs pointer > + > + /* Are we currently on the SYSENTER stack? */ > + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx) > + subl%eax, %ecx /* ecx = (end of SYENTER_stack) - esp */ ^^^ Just a typo for the committer of this patch to fix: SYSENTER_stack > + cmpl$SIZEOF_SYSENTER_stack, %ecx > + jb .Ldebug_from_sysenter_stack > + > + TRACE_IRQS_OFF > + calldo_debug > + jmp ret_from_exception > + > +.Ldebug_from_sysenter_stack: > + /* We're on the SYSENTER stack. Switch off. */ > + movl%esp, %ebp > + movlPER_CPU_VAR(cpu_current_top_of_stack), %esp > +
Re: [PATCH v2 09/10] x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
On Sat, Mar 05, 2016 at 09:52:22PM -0800, Andy Lutomirski wrote: > Right after SYSENTER, we can get a #DB or NMI. On x86_32, there's no IST, > so the exception handler is invoked on the temporary SYSENTER stack. > > Because the SYSENTER stack is very small, we have a fixup to switch > off the stack quickly when this happens. The old fixup had several issues: > > 1. It checked the interrupt frame's CS and EIP. This wasn't >obviously correct on Xen or if vm86 mode was in use [1]. > > 2. In the NMI handler, it did some frightening digging into the >stack frame. I'm not convinced this digging was correct. > > 3. The fixup didn't switch stacks and then switch back. Instead, it >synthesized a brand new stack frame that would redirect the IRET >back to the SYSENTER code. That frame was highly questionable. >For one thing, if NMI nested inside #DB, we would effectively >abort the #DB prologue, which was probably safe but was >frightening. For another, the code used PUSHFL to write the >FLAGS portion of the frame, which was simply bogus -- by the time >PUSHFL was called, at least TF, NT, VM, and all of the arithmetic >flags were clobbered. > > Simplify this considerably. Instead of looking at the saved frame > to see where we came from, check the hardware ESP register against > the SYSENTER stack directly. Malicious user code cannot spoof the > kernel ESP register, and by moving the check after SAVE_ALL, we can > use normal PER_CPU accesses to find all the relevant addresses. > > With this patch applied, the improved syscall_nt_32 test finally > passes on 32-bit kernels. > > [1] It isn't obviously correct, but it is nonetheless safe from vm86 > shenanigans as far as I can tell. A user can't point EIP at > entry_SYSENTER_32 while in vm86 mode because entry_SYSENTER_32, > like all kernel addresses, is greater than 0x and would thus > violate the CS segment limit. > > Signed-off-by: Andy Lutomirski > --- > arch/x86/entry/entry_32.S| 114 > ++- > arch/x86/kernel/asm-offsets_32.c | 5 ++ > 2 files changed, 56 insertions(+), 63 deletions(-) > > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S > index 7700cf695d8c..ad9a85e62128 100644 > --- a/arch/x86/entry/entry_32.S > +++ b/arch/x86/entry/entry_32.S > @@ -987,51 +987,48 @@ error_code: > jmp ret_from_exception > END(page_fault) > > -/* > - * Debug traps and NMI can happen at the one SYSENTER instruction > - * that sets up the real kernel stack. Check here, since we can't > - * allow the wrong stack to be used. > - * > - * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have > - * already pushed 3 words if it hits on the sysenter instruction: > - * eflags, cs and eip. > - * > - * We just load the right stack, and push the three (known) values > - * by hand onto the new stack - while updating the return eip past > - * the instruction that would have done it for sysenter. > - */ > -.macro FIX_STACK offset ok label > - cmpw$__KERNEL_CS, 4(%esp) > - jne \ok > -\label: > - movlTSS_sysenter_sp0 + \offset(%esp), %esp > - pushfl > - pushl $__KERNEL_CS > - pushl $sysenter_past_esp > -.endm > - > ENTRY(debug) > + /* > + * #DB can happen at the first instruction of > + * entry_SYSENTER_32 or in Xen's SYSENTER prologue. If this > + * happens, then we will be running on a very small stack. We > + * need to detect this condition and switch to the thread > + * stack before calling any C code at all. > + * > + * If you edit this code, keep in mind that NMIs can happen in here. > + */ Btw, I think it is a bit more readable if this comment is over ENTRY(debug) like the one over ENTRY(nmi) below. > ASM_CLAC > - cmpl$entry_SYSENTER_32, (%esp) > - jne debug_stack_correct > - FIX_STACK 12, debug_stack_correct, debug_esp_fix_insn > -debug_stack_correct: > pushl $-1 # mark this as an int > SAVE_ALL > - TRACE_IRQS_OFF > xorl%edx, %edx # error code 0 > movl%esp, %eax # pt_regs pointer > + > + /* Are we currently on the SYSENTER stack? */ > + PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx) > + subl%eax, %ecx /* ecx = (end of SYENTER_stack) - esp */ ^^^ Just a typo for the committer of this patch to fix: SYSENTER_stack > + cmpl$SIZEOF_SYSENTER_stack, %ecx > + jb .Ldebug_from_sysenter_stack > + > + TRACE_IRQS_OFF > + calldo_debug > + jmp ret_from_exception > + > +.Ldebug_from_sysenter_stack: > + /* We're on the SYSENTER stack. Switch off. */ > + movl%esp, %ebp > + movlPER_CPU_VAR(cpu_current_top_of_stack), %esp > + TRACE_IRQS_OFF >