> On May 1, 2019, at 10:40, Andy Lutomirski <[email protected]> wrote: > > On Wed, May 1, 2019 at 6:52 AM Bae, Chang Seok <[email protected]> > wrote: >> >> >>> On Apr 5, 2019, at 06:50, Andy Lutomirski <[email protected]> wrote: >>> >>> >>> >>>> On Apr 5, 2019, at 2:35 AM, Thomas Gleixner <[email protected]> wrote: >>>> >>>>> On Mon, 25 Mar 2019, Thomas Gleixner wrote: >>>>>> On Fri, 15 Mar 2019, Chang S. Bae wrote: >>>>>> ENTRY(paranoid_exit) >>>>>> UNWIND_HINT_REGS >>>>>> DISABLE_INTERRUPTS(CLBR_ANY) >>>>>> TRACE_IRQS_OFF_DEBUG >>>>>> + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\ >>>>>> + X86_FEATURE_FSGSBASE >>>>>> + wrgsbase %rbx >>>>>> + jmp .Lparanoid_exit_no_swapgs; >>>>> >>>>> Again. A few newlines would make it more readable. >>>>> >>>>> This modifies the semantics of paranoid_entry and paranoid_exit. Looking >>>>> at >>>>> the usage sites there is the following code in the nmi maze: >>>>> >>>>> /* >>>>> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit >>>>> * as we should not be calling schedule in NMI context. >>>>> * Even with normal interrupts enabled. An NMI should not be >>>>> * setting NEED_RESCHED or anything that normal interrupts and >>>>> * exceptions might do. >>>>> */ >>>>> call paranoid_entry >>>>> UNWIND_HINT_REGS >>>>> >>>>> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ >>>>> movq %rsp, %rdi >>>>> movq $-1, %rsi >>>>> call do_nmi >>>>> >>>>> /* Always restore stashed CR3 value (see paranoid_entry) */ >>>>> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 >>>>> >>>>> testl %ebx, %ebx /* swapgs needed? */ >>>>> jnz nmi_restore >>>>> nmi_swapgs: >>>>> SWAPGS_UNSAFE_STACK >>>>> nmi_restore: >>>>> POP_REGS >>>>> >>>>> I might be missing something, but how is that supposed to work when >>>>> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then >>>>> there is a big fat comment missing explaining why. >>>> >>>> So this _is_ broken. >>>> >>>> On entry: >>>> >>>> rbx = rdgsbase() >>>> wrgsbase(KERNEL_GS) >>>> >>>> On exit: >>>> >>>> if (ebx == 0) >>>> swapgs >>>> >>>> The resulting matrix: >>>> >>>> | ENTRY GS | RBX | EXIT | GS on IRET | RESULT >>>> | | | | | >>>> 1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL >>>> | | | | | >>>> 2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok >>>> | | | | | >>>> 3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok >>>> | | | | | >>>> 4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL >>>> >>>> >>>> #1 Just works by chance because it's unlikely that the lower 32bits of a >>>> per CPU kernel GS are all 0. >>>> >>>> But it's just a question of probability that this turns into a >>>> non-debuggable once per year crash (think KASLR). >>>> >>>> #4 This can happen when the NMI hits the kernel in some other entry code >>>> _BEFORE_ or _AFTER_ swapgs. >>>> >>>> User space using GS addressing with GS[31:0] != 0 will crash and burn. >>>> >>>> >>> >>> Hi all- >>> >>> In a previous incarnation of these patches, I complained about the use of >>> SWAPGS in the paranoid path. Now I’m putting my maintainer foot down. On a >>> non-FSGSBASE system, the paranoid path known, definitively, which GS is >>> where, so SWAPGS is annoying. With FSGSBASE, unless you start looking at >>> the RIP that you interrupted, you cannot know whether you have user or >>> kernel GSBASE live, since they can have literally the same value. One of >>> the numerous versions of this patch compared the values and just said >>> “well, it’s harmless to SWAPGS if user code happens to use the same value >>> as the kernel”. I complained that it was far too fragile. >>> >>> So I’m putting my foot down. If you all want my ack, you’re going to save >>> the old GS, load the new one with WRGSBASE, and, on return, you’re going to >>> restore the old one with WRGSBASE. You will not use SWAPGS in the paranoid >>> path. >>> >>> Obviously, for the non-paranoid path, it all keeps working exactly like it >>> does now. >> >> Although I can see some other concerns with this, looks like it is still >> worth pursuing. >> >>> >>> Furthermore, if you folks even want me to review this series, the ptrace >>> tests need to be in place. On inspection of the current code (after the >>> debacle a few releases back), it appears the SETREGSET’s effect depends on >>> the current values in the registers — it does not actually seem to reliably >>> load the whole state. So my confidence will be greatly increased if your >>> series first adds a test that detects that bug (and fails!), then fixes the >>> bug in a tiny little patch, then adds FSGSBASE, and keeps the test working. >>> >> >> I think I need to understand the issue. Appreciate if you can elaborate a >> little bit. >> > > This patch series gives a particular behavior to PTRACE_SETREGS and > PTRACE_POKEUSER. There should be a test case that validates that > behavior, including testing the weird cases where gs != 0 and gsbase > contains unusual values. Some existing tests might be pretty close to > doing what's needed. > > Beyond that, the current putreg() code does this: > > case offsetof(struct user_regs_struct,gs_base): > /* > * Exactly the same here as the %fs handling above. > */ > if (value >= TASK_SIZE_MAX) > return -EIO; > if (child->thread.gsbase != value) > return do_arch_prctl_64(child, ARCH_SET_GS, value); > return 0; > > and do_arch_prctl_64(), in turn, does this: > > case ARCH_SET_GS: { > if (unlikely(arg2 >= TASK_SIZE_MAX)) > return -EPERM; > > preempt_disable(); > /* > * ARCH_SET_GS has always overwritten the index > * and the base. Zero is the most sensible value > * to put in the index, and is the only value that > * makes any sense if FSGSBASE is unavailable. > */ > if (task == current) { > [not used for ptrace] > } else { > task->thread.gsindex = 0; > x86_gsbase_write_task(task, arg2); > } > > ... > > So writing the value that was already there to gsbase via putreg() > does nothing, but writing a *different* value implicitly clears gs, > but writing a different value will clear gs. > > This behavior is, AFAICT, complete nonsense. It happens to work > because usually gdb writes the same value back, and, in any case, gs > comes *after* gsbase in user_regs_struct, so gs gets replaced anyway. > But I think that this behavior should be fixed up and probably tested. > Certainly the behavior should *not* be the same on a fsgsbase kernel, > and and the fsgsbase behavior definitely needs a selftest.
Okay, got the point; now crystal clear. I have my own test case for that though, need to find a very simple and acceptable solution. Thanks, Chang

