On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> wrote:
>
>
> > On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote:
> >
> >
> >
> >> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote:
> >>
> >>> On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> >>>> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> >>>> ENTRY(paranoid_exit)
> >>>> UNWIND_HINT_REGS
> >>>> DISABLE_INTERRUPTS(CLBR_ANY)
> >>>> TRACE_IRQS_OFF_DEBUG
> >>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
> >>>> + X86_FEATURE_FSGSBASE
> >>>> + wrgsbase %rbx
> >>>> + jmp .Lparanoid_exit_no_swapgs;
> >>>
> >>> Again. A few newlines would make it more readable.
> >>>
> >>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking
> >>> at
> >>> the usage sites there is the following code in the nmi maze:
> >>>
> >>> /*
> >>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> >>> * as we should not be calling schedule in NMI context.
> >>> * Even with normal interrupts enabled. An NMI should not be
> >>> * setting NEED_RESCHED or anything that normal interrupts and
> >>> * exceptions might do.
> >>> */
> >>> call paranoid_entry
> >>> UNWIND_HINT_REGS
> >>>
> >>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> >>> movq %rsp, %rdi
> >>> movq $-1, %rsi
> >>> call do_nmi
> >>>
> >>> /* Always restore stashed CR3 value (see paranoid_entry) */
> >>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> >>>
> >>> testl %ebx, %ebx /* swapgs needed? */
> >>> jnz nmi_restore
> >>> nmi_swapgs:
> >>> SWAPGS_UNSAFE_STACK
> >>> nmi_restore:
> >>> POP_REGS
> >>>
> >>> I might be missing something, but how is that supposed to work when
> >>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> >>> there is a big fat comment missing explaining why.
> >>
> >> So this _is_ broken.
> >>
> >> On entry:
> >>
> >> rbx = rdgsbase()
> >> wrgsbase(KERNEL_GS)
> >>
> >> On exit:
> >>
> >> if (ebx == 0)
> >> swapgs
> >>
> >> The resulting matrix:
> >>
> >> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT
> >> | | | | |
> >> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
> >> | | | | |
> >> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
> >> | | | | |
> >> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
> >> | | | | |
> >> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL
> >>
> >>
> >> #1 Just works by chance because it's unlikely that the lower 32bits of a
> >> per CPU kernel GS are all 0.
> >>
> >> But it's just a question of probability that this turns into a
> >> non-debuggable once per year crash (think KASLR).
> >>
> >> #4 This can happen when the NMI hits the kernel in some other entry code
> >> _BEFORE_ or _AFTER_ swapgs.
> >>
> >> User space using GS addressing with GS[31:0] != 0 will crash and burn.
> >>
> >>
> >
> > Hi all-
> >
> > In a previous incarnation of these patches, I complained about the use of
> > SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a
> > non-FSGSBASE system, the paranoid path known, definitively, which GS is
> > where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at
> > the RIP that you interrupted, you cannot know whether you have user or
> > kernel GSBASE live, since they can have literally the same value. One of
> > the numerous versions of this patch compared the values and just said
> > “well, it’s harmless to SWAPGS if user code happens to use the same value
> > as the kernel”. I complained that it was far too fragile.
> >
> > So I’m putting my foot down. If you all want my ack, you’re going to save
> > the old GS, load the new one with WRGSBASE, and, on return, you’re going to
> > restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid
> > path.
> >
> > Obviously, for the non-paranoid path, it all keeps working exactly like it
> > does now.
>
> Although I can see some other concerns with this, looks like it is still
> worth pursuing.
>
> >
> > Furthermore, if you folks even want me to review this series, the ptrace
> > tests need to be in place. On inspection of the current code (after the
> > debacle a few releases back), it appears the SETREGSET’s effect depends on
> > the current values in the registers — it does not actually seem to reliably
> > load the whole state. So my confidence will be greatly increased if your
> > series first adds a test that detects that bug (and fails!), then fixes the
> > bug in a tiny little patch, then adds FSGSBASE, and keeps the test working.
> >
>
> I think I need to understand the issue. Appreciate if you can elaborate a
> little bit.
>
This patch series gives a particular behavior to PTRACE_SETREGS and
PTRACE_POKEUSER. There should be a test case that validates that
behavior, including testing the weird cases where gs != 0 and gsbase
contains unusual values. Some existing tests might be pretty close to
doing what's needed.
Beyond that, the current putreg() code does this:
case offsetof(struct user_regs_struct,gs_base):
/*
* Exactly the same here as the %fs handling above.
*/
if (value >= TASK_SIZE_MAX)
return -EIO;
if (child->thread.gsbase != value)
return do_arch_prctl_64(child, ARCH_SET_GS, value);
return 0;
and do_arch_prctl_64(), in turn, does this:
case ARCH_SET_GS: {
if (unlikely(arg2 >= TASK_SIZE_MAX))
return -EPERM;
preempt_disable();
/*
* ARCH_SET_GS has always overwritten the index
* and the base. Zero is the most sensible value
* to put in the index, and is the only value that
* makes any sense if FSGSBASE is unavailable.
*/
if (task == current) {
[not used for ptrace]
} else {
task->thread.gsindex = 0;
x86_gsbase_write_task(task, arg2);
}
...
So writing the value that was already there to gsbase via putreg()
does nothing, but writing a *different* value implicitly clears gs,
but writing a different value will clear gs.
This behavior is, AFAICT, complete nonsense. It happens to work
because usually gdb writes the same value back, and, in any case, gs
comes *after* gsbase in user_regs_struct, so gs gets replaced anyway.
But I think that this behavior should be fixed up and probably tested.
Certainly the behavior should *not* be the same on a fsgsbase kernel,
and and the fsgsbase behavior definitely needs a selftest.