On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> >  ENTRY(paranoid_exit)
> >     UNWIND_HINT_REGS
> >     DISABLE_INTERRUPTS(CLBR_ANY)
> >     TRACE_IRQS_OFF_DEBUG
> > +   ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase",  "nop",\
> > +           X86_FEATURE_FSGSBASE
> > +   wrgsbase        %rbx
> > +   jmp     .Lparanoid_exit_no_swapgs;
> 
> Again. A few newlines would make it more readable.
> 
> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
> the usage sites there is the following code in the nmi maze:
> 
>       /*
>        * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
>        * as we should not be calling schedule in NMI context.
>        * Even with normal interrupts enabled. An NMI should not be
>        * setting NEED_RESCHED or anything that normal interrupts and
>        * exceptions might do.
>        */
>       call    paranoid_entry
>       UNWIND_HINT_REGS
> 
>       /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
>       movq    %rsp, %rdi
>       movq    $-1, %rsi
>       call    do_nmi
> 
>       /* Always restore stashed CR3 value (see paranoid_entry) */
>       RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> 
>       testl   %ebx, %ebx                      /* swapgs needed? */
>       jnz     nmi_restore
> nmi_swapgs:
>       SWAPGS_UNSAFE_STACK
> nmi_restore:
>       POP_REGS
> 
> I might be missing something, but how is that supposed to work when
> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> there is a big fat comment missing explaining why.

So this _is_ broken.

   On entry:

      rbx = rdgsbase()
      wrgsbase(KERNEL_GS)

   On exit:

      if (ebx == 0)
         swapgs

The resulting matrix:

   |  ENTRY GS  | RBX           | EXIT          | GS on IRET    | RESULT
   |            |               |               |               |
 1 |  KERNEL_GS | KERNEL_GS     | EBX == 0      | USER_GS       | FAIL
   |            |               |               |               |
 2 |  KERNEL_GS | KERNEL_GS     | EBX != 0      | KERNEL_GS     | ok
   |            |               |               |               |
 3 |  USER_GS   | USER_GS       | EBX == 0      | USER_GS       | ok
   |            |               |               |               |
 4 |  USER_GS   | USER_GS       | EBX != 0      | KERNEL_GS     | FAIL


#1 Just works by chance because it's unlikely that the lower 32bits of a
   per CPU kernel GS are all 0.

   But it's just a question of probability that this turns into a
   non-debuggable once per year crash (think KASLR).

#4 This can happen when the NMI hits the kernel in some other entry code
   _BEFORE_ or _AFTER_ swapgs.

   User space using GS addressing with GS[31:0] != 0 will crash and burn.

   IIRC FSGSBASE is about fast user space GS switching with (almost) no
   limits on the value ...

Oh well.

        tglx

Reply via email to