* Andy Lutomirski <l...@kernel.org> wrote:

> The main things that are missing are that I haven't done the 32-bit parts 
> (anyone want to help?) and therefore I haven't deleted the old C code.  I 
> also 
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

        testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, 
SIZEOF_PTREGS)
        jnz     tracesys
entry_SYSCALL_64_fastpath:
#if __SYSCALL_MASK == ~0
        cmpq    $__NR_syscall_max, %rax
#else
        andl    $__SYSCALL_MASK, %eax
        cmpl    $__NR_syscall_max, %eax
#endif
        ja      1f                              /* return -ENOSYS (already in 
pt_regs->ax) */
        movq    %r10, %rcx
        call    *sys_call_table(, %rax, 8)
        movq    %rax, RAX(%rsp)
1:
/*
 * Syscall return path ending with SYSRET (fast path).
 * Has incompletely filled pt_regs.
 */
        LOCKDEP_SYS_EXIT
        /*
         * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
         * it is too small to ever cause noticeable irq latency.
         */
        DISABLE_INTERRUPTS(CLBR_NONE)

        /*
         * We must check ti flags with interrupts (or at least preemption)
         * off because we must *never* return to userspace without
         * processing exit work that is enqueued if we're preempted here.
         * In particular, returning to userspace with any of the one-shot
         * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
         * very bad.
         */
        testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, 
SIZEOF_PTREGS)
        jnz     int_ret_from_sys_call_irqs_off  /* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

GLOBAL(int_ret_from_sys_call)
        SAVE_EXTRA_REGS
        movq    %rsp, %rdi
        call    syscall_return_slowpath /* returns with IRQs disabled */
        RESTORE_EXTRA_REGS

        /*
         * Try to use SYSRET instead of IRET if we're returning to
         * a completely clean 64-bit userspace context.
         */
        movq    RCX(%rsp), %rcx
        movq    RIP(%rsp), %r11
        cmpq    %rcx, %r11                      /* RCX == RIP */
        jne     opportunistic_sysret_failed

        /*
         * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
         * in kernel space.  This essentially lets the user take over
         * the kernel, since userspace controls RSP.
         *
         * If width of "canonical tail" ever becomes variable, this will need
         * to be updated to remain correct on both old and new CPUs.
         */
        .ifne __VIRTUAL_MASK_SHIFT - 47
        .error "virtual address width changed -- SYSRET checks need update"
        .endif

        /* Change top 16 bits to be the sign-extension of 47th bit */
        shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
        sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

        /* If this changed %rcx, it was not canonical */
        cmpq    %rcx, %r11
        jne     opportunistic_sysret_failed

        cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
        jne     opportunistic_sysret_failed

        movq    R11(%rsp), %r11
        cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
        jne     opportunistic_sysret_failed

        /*
         * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
         * restoring TF results in a trap from userspace immediately after
         * SYSRET.  This would cause an infinite loop whenever #DB happens
         * with register state that satisfies the opportunistic SYSRET
         * conditions.  For example, single-stepping this user code:
         *
         *           movq       $stuck_here, %rcx
         *           pushfq
         *           popq %r11
         *   stuck_here:
         *
         * would never get past 'stuck_here'.
         */
        testq   $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
        jnz     opportunistic_sysret_failed

        /* nothing to check for RSP */

        cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
        jne     opportunistic_sysret_failed

        /*
         * We win! This label is here just for ease of understanding
         * perf profiles. Nothing jumps here.
         */
syscall_return_via_sysret:
        /* rcx and r11 are already restored (see code above) */
        RESTORE_C_REGS_EXCEPT_RCX_R11
        movq    RSP(%rsp), %rsp
        USERGS_SYSRET64

opportunistic_sysret_failed:
        SWAPGS
        jmp     restore_c_regs_and_iret
END(entry_SYSCALL_64)


Basically there would be a single C function we'd call, which returns a 
condition 
(or fixes up its return address on the stack directly) to determine between the 
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add 
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to 
the 
domain of C code, we get the best benefits by moving as much of it as possible. 

The only low level bits remaining in assembly will be low level hardware ABI 
details: saving registers and restoring registers to the expected format - no 
'active' code whatsoever.

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to