On 06/18/2015 11:31 AM, Ingo Molnar wrote: >> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX, >> then SYSRET can't possibly complete sooner than in 20 cycles. > > Yeah, that's true, but my point is: SYSRET has to do a lot of other things > (permission checks, loading the user mode state - most of which are unrelated > to > R11/RCX), which take dozens of cycles,
SYSRET was designed to avoid doing that. It does not check permissions - it slam-dunks CPL3 and resets CS and SS to preset values. It does not touch stack register or restores any other GP register. Having said that, I'd try to get cold hard facts, i.e. experimentally measure SYSRET latency. > and which are probably overlapped with any > cache misses on arguments such as R11/RCX. > > It's not impossible that reordering helps, for example if SYSRET has some > internal > dependencies that makes it parallelism worse than ideal - but I'd complicate > this > code only if it gives a measurable improvement for cache-cold syscall > performance. I attempted to test it. With the patch which moves RCX and R11 loads all the way down: diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index f2064bd..0ea09a3 100644 --- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -139,9 +139,6 @@ sysexit_from_sys_call: * with 'sysenter' and it uses the SYSENTER calling convention. */ andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) - /* Prepare registers for SYSRET insn */ - movl RIP(%rsp), %ecx /* User %eip */ - movl EFLAGS(%rsp), %r11d /* User eflags * /* Restore registers per SYSEXIT ABI requirements: */ /* arg1 (ebx): preserved by virtue of being a callee-saved register */ /* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore eip) */ @@ -155,6 +152,9 @@ sysexit_from_sys_call: xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d + /* Prepare registers for SYSRET insn */ + movl RIP(%rsp), %ecx /* User %eip */ + movl EFLAGS(%rsp), %r11d /* User eflags * TRACE_IRQS_ON /* @@ -374,9 +374,6 @@ cstar_dispatch: sysretl_from_sys_call: andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) - /* Prepare registers for SYSRET insn */ - movl RIP(%rsp), %ecx /* User %eip */ - movl EFLAGS(%rsp), %r11d /* User eflags */ /* Restore registers per SYSRET ABI requirements: */ /* arg1 (ebx): preserved by virtue of being a callee-saved register */ /* arg2 (ebp): preserved (already restored, see above) */ @@ -388,6 +385,9 @@ sysretl_from_sys_call: xorl %r8d, %r8d xorl %r9d, %r9d xorl %r10d, %r10d + /* Prepare registers for SYSRET insn */ + movl RIP(%rsp), %ecx /* User %eip */ + movl EFLAGS(%rsp), %r11d /* User eflags */ TRACE_IRQS_ON movl RSP(%rsp), %esp /* This does not change instructions sizes and therefore code cacheline alignments over entire bzImage. Testing getpid() in a loop (IOW: cache-hot test) did show that with this patch it is slower, but by statistically insignificant amount: before patch, it's 61.92 ns per syscall. after patch, it's 61.99 ns per syscall. That's less than one cycle, more like 0.15 cycles. However, it is reproducible. I did not figure out how to do a cache-cold test. Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns and its variability of +-10 ns drowns out a possible difference. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/