On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote: > AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET > with SS == 0 results in an invalid usermode state in which SS is > apparently equal to __USER_DS but causes #SS if used. > > Work around the issue by replacing NULL SS values with __KERNEL_DS > in __switch_to, thus ensuring that SYSRET never happens with SS set > to NULL. > > This was exposed by a recent vDSO cleanup. > > Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss > Signed-off-by: Andy Lutomirski <l...@kernel.org> > --- > > Tested only on Intel, which isn't very interesting. I'll tidy up > and send a test case, too, once Borislav confirms that it works.
So I did some benchmarking today. Custom kernel build measured with perf stat, 10 builds with --pre doing $ cat pre-build-kernel.sh make -s clean echo 3 > /proc/sys/vm/drop_caches $ cat measure.sh EVENTS="cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations" perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh make -s -j64 I've prepended the perf stat output with markers A:, B: or C: for easier comparing. The markers mean: A: Linus' master from a couple of days ago + tip/master + tip/x86/asm B: With Andy's SYSRET patch ontop C: Without RCX canonicalness check (see patch at the end). Numbers are from an AMD F16h box: A: 2835570.145246 cpu-clock (msec) ( +- 0.02% ) [100.00%] B: 2833364.074970 cpu-clock (msec) ( +- 0.04% ) [100.00%] C: 2834708.335431 cpu-clock (msec) ( +- 0.02% ) [100.00%] This is interesting - The SYSRET SS fix makes it minimally better and the C-patch is a bit worse again. Net win is 861 msec, almost a second, oh well. A: 2835570.099981 task-clock (msec) # 3.996 CPUs utilized ( +- 0.02% ) [100.00%] B: 2833364.073633 task-clock (msec) # 3.996 CPUs utilized ( +- 0.04% ) [100.00%] C: 2834708.350387 task-clock (msec) # 3.996 CPUs utilized ( +- 0.02% ) [100.00%] Similar thing observable here. A: 5,591,213,166,613 cycles # 1.972 GHz ( +- 0.03% ) [75.00%] B: 5,585,023,802,888 cycles # 1.971 GHz ( +- 0.03% ) [75.00%] C: 5,587,983,212,758 cycles # 1.971 GHz ( +- 0.02% ) [75.00%] net win is 3,229,953,855 cycles drop. A: 3,106,707,101,530 instructions # 0.56 insns per cycle ( +- 0.01% ) [75.00%] B: 3,106,632,251,528 instructions # 0.56 insns per cycle ( +- 0.00% ) [75.00%] C: 3,106,265,958,142 instructions # 0.56 insns per cycle ( +- 0.00% ) [75.00%] This looks like it would make sense - instruction count drops from A -> B -> C. A: 683,676,044,429 branches # 241.107 M/sec ( +- 0.01% ) [75.00%] B: 683,670,899,595 branches # 241.293 M/sec ( +- 0.01% ) [75.00%] C: 683,675,772,858 branches # 241.180 M/sec ( +- 0.01% ) [75.00%] Also makes sense - the C patch adds an unconditional JMP over the RCX-canonicalness check. A: 43,829,535,008 branch-misses # 6.41% of all branches ( +- 0.02% ) [75.00%] B: 43,844,118,416 branch-misses # 6.41% of all branches ( +- 0.03% ) [75.00%] C: 43,819,871,086 branch-misses # 6.41% of all branches ( +- 0.02% ) [75.00%] And this is nice, branch misses are the smallest with C, cool. It makes sense again - the C patch adds an unconditional JMP which doesn't miss. A: 2,030,357 context-switches # 0.716 K/sec ( +- 0.06% ) [100.00%] B: 2,029,313 context-switches # 0.716 K/sec ( +- 0.05% ) [100.00%] C: 2,028,566 context-switches # 0.716 K/sec ( +- 0.06% ) [100.00%] Those look good. A: 52,421 migrations # 0.018 K/sec ( +- 1.13% ) B: 52,049 migrations # 0.018 K/sec ( +- 1.02% ) C: 51,365 migrations # 0.018 K/sec ( +- 0.92% ) Same here. A: 709.528485252 seconds time elapsed ( +- 0.02% ) B: 708.976557288 seconds time elapsed ( +- 0.04% ) C: 709.312844791 seconds time elapsed ( +- 0.02% ) Interestingly, the unconditional JMP kinda costs... Btw, I'm not sure if kernel build is the optimal workload for benchmarking here but I don't see why not - it does a lot of syscalls so it should exercise the SYSRET path sufficiently. Anyway, we can do this below. Or not, I'm sitting on the fence about that one. --- From: Borislav Petkov <b...@suse.de> Date: Sat, 25 Apr 2015 19:30:33 +0200 Subject: [PATCH] x86/entry: Avoid canonical RCX check on AMD It is not needed on AMD as RCX canonicalness is not checked during SYSRET there. Signed-off-by: Borislav Petkov <b...@suse.de> --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/kernel/cpu/intel.c | 2 ++ arch/x86/kernel/entry_64.S | 13 +++++++++---- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 7ee9b94d9921..8d555b046fe9 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -265,6 +265,7 @@ #define X86_BUG_11AP X86_BUG(5) /* Bad local APIC aka 11AP */ #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */ #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */ +#define X86_BUG_CANONICAL_RCX X86_BUG(8) /* SYSRET #GPs when %RCX non-canonical */ #if defined(__KERNEL__) && !defined(__ASSEMBLY__) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 50163fa9034f..109a51815e92 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c) pr_info("Disabling PGE capability bit\n"); setup_clear_cpu_cap(X86_FEATURE_PGE); } + + set_cpu_bug(c, X86_BUG_CANONICAL_RCX); } #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index e952f6bf1d6d..d01fb6c1362f 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -415,16 +415,20 @@ syscall_return: jne opportunistic_sysret_failed /* - * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP - * in kernel space. This essentially lets the user take over - * the kernel, since userspace controls RSP. - * * If width of "canonical tail" ever becomes variable, this will need * to be updated to remain correct on both old and new CPUs. */ .ifne __VIRTUAL_MASK_SHIFT - 47 .error "virtual address width changed -- SYSRET checks need update" .endif + + /* + * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP + * in kernel space. This essentially lets the user take over + * the kernel, since userspace controls RSP. + */ + ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX + /* Change top 16 bits to be the sign-extension of 47th bit */ shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx @@ -432,6 +436,7 @@ syscall_return: cmpq %rcx, %r11 jne opportunistic_sysret_failed +1: cmpq $__USER_CS,CS(%rsp) /* CS must match SYSRET */ jne opportunistic_sysret_failed -- 2.3.5 -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/