32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads

Denys Vlasenko Thu, 18 Jun 2015 04:01:26 -0700

On 06/18/2015 11:31 AM, Ingo Molnar wrote:
>> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX,
>> then SYSRET can't possibly complete sooner than in 20 cycles.
>
> Yeah, that's true, but my point is: SYSRET has to do a lot of other things
> (permission checks, loading the user mode state - most of which are unrelated 
> to
> R11/RCX), which take dozens of cycles,


SYSRET was designed to avoid doing that. It does not check permissions
- it slam-dunks CPL3 and resets CS and SS to preset values.
It does not touch stack register or restores any other GP register.

Having said that, I'd try to get cold hard facts, i.e. experimentally
measure SYSRET latency.


> and which are probably overlapped with any
> cache misses on arguments such as R11/RCX.
>
> It's not impossible that reordering helps, for example if SYSRET has some 
> internal 
> dependencies that makes it parallelism worse than ideal - but I'd complicate 
> this 
> code only if it gives a measurable improvement for cache-cold syscall 
> performance.

I attempted to test it. With the patch which moves RCX and R11 loads all the 
way down:

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index f2064bd..0ea09a3 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -139,9 +139,6 @@ sysexit_from_sys_call:
         * with 'sysenter' and it uses the SYSENTER calling convention.
         */
        andl    $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-       /* Prepare registers for SYSRET insn */
-       movl    RIP(%rsp), %ecx         /* User %eip */
-       movl    EFLAGS(%rsp), %r11d     /* User eflags *
        /* Restore registers per SYSEXIT ABI requirements: */
        /* arg1 (ebx): preserved by virtue of being a callee-saved register */
        /* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore 
eip) */
@@ -155,6 +152,9 @@ sysexit_from_sys_call:
        xorl    %r8d, %r8d
        xorl    %r9d, %r9d
        xorl    %r10d, %r10d
+       /* Prepare registers for SYSRET insn */
+       movl    RIP(%rsp), %ecx         /* User %eip */
+       movl    EFLAGS(%rsp), %r11d     /* User eflags *
        TRACE_IRQS_ON

        /*
@@ -374,9 +374,6 @@ cstar_dispatch:

 sysretl_from_sys_call:
        andl    $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-       /* Prepare registers for SYSRET insn */
-       movl    RIP(%rsp), %ecx         /* User %eip */
-       movl    EFLAGS(%rsp), %r11d     /* User eflags */
        /* Restore registers per SYSRET ABI requirements: */
        /* arg1 (ebx): preserved by virtue of being a callee-saved register */
        /* arg2 (ebp): preserved (already restored, see above) */
@@ -388,6 +385,9 @@ sysretl_from_sys_call:
        xorl    %r8d, %r8d
        xorl    %r9d, %r9d
        xorl    %r10d, %r10d
+       /* Prepare registers for SYSRET insn */
+       movl    RIP(%rsp), %ecx         /* User %eip */
+       movl    EFLAGS(%rsp), %r11d     /* User eflags */
        TRACE_IRQS_ON
        movl    RSP(%rsp), %esp
        /*

This does not change instructions sizes and therefore code
cacheline alignments over entire bzImage.


Testing getpid() in a loop (IOW: cache-hot test) did show that with
this patch it is slower, but by statistically insignificant amount:

before patch, it's 61.92 ns per syscall.
after patch, it's  61.99 ns per syscall.

That's less than one cycle, more like 0.15 cycles.
However, it is reproducible.

I did not figure out how to do a cache-cold test.
Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns
and its variability of +-10 ns drowns out a possible difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with open-coded 32-bit reads

Reply via email to