64: Use the per-CPU base as GSBASE at the paranoid_entry

Bae, Chang Seok Wed, 01 May 2019 11:02:28 -0700

> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote:
> 
> On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> 
> wrote:
>> 
>> 
>>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
>>>> 
>>>>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
>>>>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
>>>>>> ENTRY(paranoid_exit)
>>>>>>  UNWIND_HINT_REGS
>>>>>>  DISABLE_INTERRUPTS(CLBR_ANY)
>>>>>>  TRACE_IRQS_OFF_DEBUG
>>>>>> +    ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase",    "nop",\
>>>>>> +        X86_FEATURE_FSGSBASE
>>>>>> +    wrgsbase    %rbx
>>>>>> +    jmp    .Lparanoid_exit_no_swapgs;
>>>>> 
>>>>> Again. A few newlines would make it more readable.
>>>>> 
>>>>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking 
>>>>> at
>>>>> the usage sites there is the following code in the nmi maze:
>>>>> 
>>>>>  /*
>>>>>   * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
>>>>>   * as we should not be calling schedule in NMI context.
>>>>>   * Even with normal interrupts enabled. An NMI should not be
>>>>>   * setting NEED_RESCHED or anything that normal interrupts and
>>>>>   * exceptions might do.
>>>>>   */
>>>>>  call    paranoid_entry
>>>>>  UNWIND_HINT_REGS
>>>>> 
>>>>>  /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
>>>>>  movq    %rsp, %rdi
>>>>>  movq    $-1, %rsi
>>>>>  call    do_nmi
>>>>> 
>>>>>  /* Always restore stashed CR3 value (see paranoid_entry) */
>>>>>  RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>>>>> 
>>>>>  testl    %ebx, %ebx            /* swapgs needed? */
>>>>>  jnz    nmi_restore
>>>>> nmi_swapgs:
>>>>>  SWAPGS_UNSAFE_STACK
>>>>> nmi_restore:
>>>>>  POP_REGS
>>>>> 
>>>>> I might be missing something, but how is that supposed to work when
>>>>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
>>>>> there is a big fat comment missing explaining why.
>>>> 
>>>> So this _is_ broken.
>>>> 
>>>> On entry:
>>>> 
>>>>    rbx = rdgsbase()
>>>>    wrgsbase(KERNEL_GS)
>>>> 
>>>> On exit:
>>>> 
>>>>    if (ebx == 0)
>>>>         swapgs
>>>> 
>>>> The resulting matrix:
>>>> 
>>>> |  ENTRY GS    | RBX        | EXIT        | GS on IRET    | RESULT
>>>> |        |        |        |        |
>>>> 1 |  KERNEL_GS    | KERNEL_GS    | EBX == 0    | USER_GS    | FAIL
>>>> |        |        |        |        |
>>>> 2 |  KERNEL_GS    | KERNEL_GS    | EBX != 0    | KERNEL_GS    | ok
>>>> |        |        |        |        |
>>>> 3 |  USER_GS    | USER_GS    | EBX == 0    | USER_GS    | ok
>>>> |        |        |        |        |
>>>> 4 |  USER_GS    | USER_GS    | EBX != 0    | KERNEL_GS    | FAIL
>>>> 
>>>> 
>>>> #1 Just works by chance because it's unlikely that the lower 32bits of a
>>>> per CPU kernel GS are all 0.
>>>> 
>>>> But it's just a question of probability that this turns into a
>>>> non-debuggable once per year crash (think KASLR).
>>>> 
>>>> #4 This can happen when the NMI hits the kernel in some other entry code
>>>> _BEFORE_ or _AFTER_ swapgs.
>>>> 
>>>> User space using GS addressing with GS[31:0] != 0 will crash and burn.
>>>> 
>>>> 
>>> 
>>> Hi all-
>>> 
>>> In a previous incarnation of these patches, I complained about the use of 
>>> SWAPGS in the paranoid path. Now I’m putting my maintainer foot down.  On a 
>>> non-FSGSBASE system, the paranoid path known, definitively, which GS is 
>>> where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at 
>>> the RIP that you interrupted, you cannot know whether you have user or 
>>> kernel GSBASE live, since they can have literally the same value.  One of 
>>> the numerous versions of this patch compared the values and just said 
>>> “well, it’s harmless to SWAPGS if user code happens to use the same value 
>>> as the kernel”.  I complained that it was far too fragile.
>>> 
>>> So I’m putting my foot down. If you all want my ack, you’re going to save 
>>> the old GS, load the new one with WRGSBASE, and, on return, you’re going to 
>>> restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid 
>>> path.
>>> 
>>> Obviously, for the non-paranoid path, it all keeps working exactly like it 
>>> does now.
>> 
>> Although I can see some other concerns with this, looks like it is still 
>> worth pursuing.
>> 
>>> 
>>> Furthermore, if you folks even want me to review this series, the ptrace 
>>> tests need to be in place.  On inspection of the current code (after the 
>>> debacle a few releases back), it appears the SETREGSET’s effect depends on 
>>> the current values in the registers — it does not actually seem to reliably 
>>> load the whole state. So my confidence will be greatly increased if your 
>>> series first adds a test that detects that bug (and fails!), then fixes the 
>>> bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
>>> 
>> 
>> I think I need to understand the issue. Appreciate if you can elaborate a 
>> little bit.
>> 
> 
> This patch series gives a particular behavior to PTRACE_SETREGS and
> PTRACE_POKEUSER.  There should be a test case that validates that
> behavior, including testing the weird cases where gs != 0 and gsbase
> contains unusual values.  Some existing tests might be pretty close to
> doing what's needed.
> 
> Beyond that, the current putreg() code does this:
> 
>    case offsetof(struct user_regs_struct,gs_base):
>        /*
>         * Exactly the same here as the %fs handling above.
>         */
>        if (value >= TASK_SIZE_MAX)
>            return -EIO;
>        if (child->thread.gsbase != value)
>            return do_arch_prctl_64(child, ARCH_SET_GS, value);
>        return 0;
> 
> and do_arch_prctl_64(), in turn, does this:
> 
>    case ARCH_SET_GS: {
>        if (unlikely(arg2 >= TASK_SIZE_MAX))
>            return -EPERM;
> 
>        preempt_disable();
>        /*
>         * ARCH_SET_GS has always overwritten the index
>         * and the base. Zero is the most sensible value
>         * to put in the index, and is the only value that
>         * makes any sense if FSGSBASE is unavailable.
>         */
>        if (task == current) {
>         [not used for ptrace]
>        } else {
>            task->thread.gsindex = 0;
>            x86_gsbase_write_task(task, arg2);
>        }
> 
>        ...
> 
> So writing the value that was already there to gsbase via putreg()
> does nothing, but writing a *different* value implicitly clears gs,
> but writing a different value will clear gs.
> 
> This behavior is, AFAICT, complete nonsense.  It happens to work
> because usually gdb writes the same value back, and, in any case, gs
> comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
> But I think that this behavior should be fixed up and probably tested.
> Certainly the behavior should *not* be the same on a fsgsbase kernel,
> and and the fsgsbase behavior definitely needs a selftest.


Okay, got the point; now crystal clear.

I have my own test case for that though, need to find a very simple and
acceptable solution.

Thanks,
Chang

Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

Reply via email to