* Andy Lutomirski <l...@kernel.org> wrote: > The main things that are missing are that I haven't done the 32-bit parts > (anyone want to help?) and therefore I haven't deleted the old C code. I > also > think this may break UML for trivial reasons.
So I'd suggest moving most of the SYSRET fast path to C too. This is how it looks like now after your patches: testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz tracesys entry_SYSCALL_64_fastpath: #if __SYSCALL_MASK == ~0 cmpq $__NR_syscall_max, %rax #else andl $__SYSCALL_MASK, %eax cmpl $__NR_syscall_max, %eax #endif ja 1f /* return -ENOSYS (already in pt_regs->ax) */ movq %r10, %rcx call *sys_call_table(, %rax, 8) movq %rax, RAX(%rsp) 1: /* * Syscall return path ending with SYSRET (fast path). * Has incompletely filled pt_regs. */ LOCKDEP_SYS_EXIT /* * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON, * it is too small to ever cause noticeable irq latency. */ DISABLE_INTERRUPTS(CLBR_NONE) /* * We must check ti flags with interrupts (or at least preemption) * off because we must *never* return to userspace without * processing exit work that is enqueued if we're preempted here. * In particular, returning to userspace with any of the one-shot * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is * very bad. */ testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */ Most of that can be done in C. And I think we could also convert the IRET syscall return slow path to C too: GLOBAL(int_ret_from_sys_call) SAVE_EXTRA_REGS movq %rsp, %rdi call syscall_return_slowpath /* returns with IRQs disabled */ RESTORE_EXTRA_REGS /* * Try to use SYSRET instead of IRET if we're returning to * a completely clean 64-bit userspace context. */ movq RCX(%rsp), %rcx movq RIP(%rsp), %r11 cmpq %rcx, %r11 /* RCX == RIP */ jne opportunistic_sysret_failed /* * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP * in kernel space. This essentially lets the user take over * the kernel, since userspace controls RSP. * * If width of "canonical tail" ever becomes variable, this will need * to be updated to remain correct on both old and new CPUs. */ .ifne __VIRTUAL_MASK_SHIFT - 47 .error "virtual address width changed -- SYSRET checks need update" .endif /* Change top 16 bits to be the sign-extension of 47th bit */ shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx /* If this changed %rcx, it was not canonical */ cmpq %rcx, %r11 jne opportunistic_sysret_failed cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */ jne opportunistic_sysret_failed movq R11(%rsp), %r11 cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */ jne opportunistic_sysret_failed /* * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, * restoring TF results in a trap from userspace immediately after * SYSRET. This would cause an infinite loop whenever #DB happens * with register state that satisfies the opportunistic SYSRET * conditions. For example, single-stepping this user code: * * movq $stuck_here, %rcx * pushfq * popq %r11 * stuck_here: * * would never get past 'stuck_here'. */ testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11 jnz opportunistic_sysret_failed /* nothing to check for RSP */ cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */ jne opportunistic_sysret_failed /* * We win! This label is here just for ease of understanding * perf profiles. Nothing jumps here. */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ RESTORE_C_REGS_EXCEPT_RCX_R11 movq RSP(%rsp), %rsp USERGS_SYSRET64 opportunistic_sysret_failed: SWAPGS jmp restore_c_regs_and_iret END(entry_SYSCALL_64) Basically there would be a single C function we'd call, which returns a condition (or fixes up its return address on the stack directly) to determine between the SYSRET and IRET return paths. Moving this to C too has immediate benefits: that way we could easily add instrumentation to see how efficient these various return methods are, etc. I.e. I don't think there's two ways about this: once the entry code moves to the domain of C code, we get the best benefits by moving as much of it as possible. The only low level bits remaining in assembly will be low level hardware ABI details: saving registers and restoring registers to the expected format - no 'active' code whatsoever. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/