64: Use the per-CPU base as GSBASE at the paranoid_entry

Andy Lutomirski Wed, 01 May 2019 10:41:32 -0700

On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>
>
> > On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
> >
> >
> >
> >> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
> >>
> >>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> >>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> >>>> ENTRY(paranoid_exit)
> >>>>   UNWIND_HINT_REGS
> >>>>   DISABLE_INTERRUPTS(CLBR_ANY)
> >>>>   TRACE_IRQS_OFF_DEBUG
> >>>> +    ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase",    "nop",\
> >>>> +        X86_FEATURE_FSGSBASE
> >>>> +    wrgsbase    %rbx
> >>>> +    jmp    .Lparanoid_exit_no_swapgs;
> >>>
> >>> Again. A few newlines would make it more readable.
> >>>
> >>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking 
> >>> at
> >>> the usage sites there is the following code in the nmi maze:
> >>>
> >>>   /*
> >>>    * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> >>>    * as we should not be calling schedule in NMI context.
> >>>    * Even with normal interrupts enabled. An NMI should not be
> >>>    * setting NEED_RESCHED or anything that normal interrupts and
> >>>    * exceptions might do.
> >>>    */
> >>>   call    paranoid_entry
> >>>   UNWIND_HINT_REGS
> >>>
> >>>   /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> >>>   movq    %rsp, %rdi
> >>>   movq    $-1, %rsi
> >>>   call    do_nmi
> >>>
> >>>   /* Always restore stashed CR3 value (see paranoid_entry) */
> >>>   RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> >>>
> >>>   testl    %ebx, %ebx            /* swapgs needed? */
> >>>   jnz    nmi_restore
> >>> nmi_swapgs:
> >>>   SWAPGS_UNSAFE_STACK
> >>> nmi_restore:
> >>>   POP_REGS
> >>>
> >>> I might be missing something, but how is that supposed to work when
> >>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> >>> there is a big fat comment missing explaining why.
> >>
> >> So this _is_ broken.
> >>
> >>  On entry:
> >>
> >>     rbx = rdgsbase()
> >>     wrgsbase(KERNEL_GS)
> >>
> >>  On exit:
> >>
> >>     if (ebx == 0)
> >>          swapgs
> >>
> >> The resulting matrix:
> >>
> >>  |  ENTRY GS    | RBX        | EXIT        | GS on IRET    | RESULT
> >>  |        |        |        |        |
> >> 1 |  KERNEL_GS    | KERNEL_GS    | EBX == 0    | USER_GS    | FAIL
> >>  |        |        |        |        |
> >> 2 |  KERNEL_GS    | KERNEL_GS    | EBX != 0    | KERNEL_GS    | ok
> >>  |        |        |        |        |
> >> 3 |  USER_GS    | USER_GS    | EBX == 0    | USER_GS    | ok
> >>  |        |        |        |        |
> >> 4 |  USER_GS    | USER_GS    | EBX != 0    | KERNEL_GS    | FAIL
> >>
> >>
> >> #1 Just works by chance because it's unlikely that the lower 32bits of a
> >>  per CPU kernel GS are all 0.
> >>
> >>  But it's just a question of probability that this turns into a
> >>  non-debuggable once per year crash (think KASLR).
> >>
> >> #4 This can happen when the NMI hits the kernel in some other entry code
> >>  _BEFORE_ or _AFTER_ swapgs.
> >>
> >>  User space using GS addressing with GS[31:0] != 0 will crash and burn.
> >>
> >>
> >
> > Hi all-
> >
> > In a previous incarnation of these patches, I complained about the use of 
> > SWAPGS in the paranoid path. Now I’m putting my maintainer foot down.  On a 
> > non-FSGSBASE system, the paranoid path known, definitively, which GS is 
> > where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at 
> > the RIP that you interrupted, you cannot know whether you have user or 
> > kernel GSBASE live, since they can have literally the same value.  One of 
> > the numerous versions of this patch compared the values and just said 
> > “well, it’s harmless to SWAPGS if user code happens to use the same value 
> > as the kernel”.  I complained that it was far too fragile.
> >
> > So I’m putting my foot down. If you all want my ack, you’re going to save 
> > the old GS, load the new one with WRGSBASE, and, on return, you’re going to 
> > restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid 
> > path.
> >
> > Obviously, for the non-paranoid path, it all keeps working exactly like it 
> > does now.
>
> Although I can see some other concerns with this, looks like it is still 
> worth pursuing.
>
> >
> > Furthermore, if you folks even want me to review this series, the ptrace 
> > tests need to be in place.  On inspection of the current code (after the 
> > debacle a few releases back), it appears the SETREGSET’s effect depends on 
> > the current values in the registers — it does not actually seem to reliably 
> > load the whole state. So my confidence will be greatly increased if your 
> > series first adds a test that detects that bug (and fails!), then fixes the 
> > bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
> >
>
> I think I need to understand the issue. Appreciate if you can elaborate a 
> little bit.
>


This patch series gives a particular behavior to PTRACE_SETREGS and
PTRACE_POKEUSER.  There should be a test case that validates that
behavior, including testing the weird cases where gs != 0 and gsbase
contains unusual values.  Some existing tests might be pretty close to
doing what's needed.

Beyond that, the current putreg() code does this:

    case offsetof(struct user_regs_struct,gs_base):
        /*
         * Exactly the same here as the %fs handling above.
         */
        if (value >= TASK_SIZE_MAX)
            return -EIO;
        if (child->thread.gsbase != value)
            return do_arch_prctl_64(child, ARCH_SET_GS, value);
        return 0;

and do_arch_prctl_64(), in turn, does this:

    case ARCH_SET_GS: {
        if (unlikely(arg2 >= TASK_SIZE_MAX))
            return -EPERM;

        preempt_disable();
        /*
         * ARCH_SET_GS has always overwritten the index
         * and the base. Zero is the most sensible value
         * to put in the index, and is the only value that
         * makes any sense if FSGSBASE is unavailable.
         */
        if (task == current) {
         [not used for ptrace]
        } else {
            task->thread.gsindex = 0;
            x86_gsbase_write_task(task, arg2);
        }

        ...

So writing the value that was already there to gsbase via putreg()
does nothing, but writing a *different* value implicitly clears gs,
but writing a different value will clear gs.

This behavior is, AFAICT, complete nonsense.  It happens to work
because usually gdb writes the same value back, and, in any case, gs
comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
But I think that this behavior should be fixed up and probably tested.
Certainly the behavior should *not* be the same on a fsgsbase kernel,
and and the fsgsbase behavior definitely needs a selftest.

Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

Reply via email to