Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

Borislav Petkov Sat, 25 Apr 2015 14:12:51 -0700

On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
> 
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
> 
> This was exposed by a recent vDSO cleanup.
> 
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> 
> Tested only on Intel, which isn't very interesting.  I'll tidy up
> and send a test case, too, once Borislav confirms that it works.


So I did some benchmarking today. Custom kernel build measured with perf
stat, 10 builds with --pre doing

$ cat pre-build-kernel.sh
make -s clean
echo 3 > /proc/sys/vm/drop_caches

$ cat measure.sh
EVENTS="cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations"
perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh 
make -s -j64

I've prepended the perf stat output with markers A:, B: or C: for easier
comparing. The markers mean:

A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
B: With Andy's SYSRET patch ontop
C: Without RCX canonicalness check (see patch at the end).

Numbers are from an AMD F16h box:

A:    2835570.145246      cpu-clock (msec)                                      
        ( +-  0.02% ) [100.00%]
B:    2833364.074970      cpu-clock (msec)                                      
        ( +-  0.04% ) [100.00%]
C:    2834708.335431      cpu-clock (msec)                                      
        ( +-  0.02% ) [100.00%]

This is interesting - The SYSRET SS fix makes it minimally better and
the C-patch is a bit worse again. Net win is 861 msec, almost a second,
oh well.

A:    2835570.099981      task-clock (msec)         #    3.996 CPUs utilized    
        ( +-  0.02% ) [100.00%]
B:    2833364.073633      task-clock (msec)         #    3.996 CPUs utilized    
        ( +-  0.04% ) [100.00%]
C:    2834708.350387      task-clock (msec)         #    3.996 CPUs utilized    
        ( +-  0.02% ) [100.00%]

Similar thing observable here.

A: 5,591,213,166,613      cycles                    #    1.972 GHz              
        ( +-  0.03% ) [75.00%]
B: 5,585,023,802,888      cycles                    #    1.971 GHz              
        ( +-  0.03% ) [75.00%]
C: 5,587,983,212,758      cycles                    #    1.971 GHz              
        ( +-  0.02% ) [75.00%]

net win is 3,229,953,855 cycles drop.

A: 3,106,707,101,530      instructions              #    0.56  insns per cycle  
        ( +-  0.01% ) [75.00%]
B: 3,106,632,251,528      instructions              #    0.56  insns per cycle  
        ( +-  0.00% ) [75.00%]
C: 3,106,265,958,142      instructions              #    0.56  insns per cycle  
        ( +-  0.00% ) [75.00%]

This looks like it would make sense - instruction count drops from A -> B -> C.

A:   683,676,044,429      branches                  #  241.107 M/sec            
        ( +-  0.01% ) [75.00%]
B:   683,670,899,595      branches                  #  241.293 M/sec            
        ( +-  0.01% ) [75.00%]
C:   683,675,772,858      branches                  #  241.180 M/sec            
        ( +-  0.01% ) [75.00%]

Also makes sense - the C patch adds an unconditional JMP over the
RCX-canonicalness check.

A:    43,829,535,008      branch-misses             #    6.41% of all branches  
        ( +-  0.02% ) [75.00%]
B:    43,844,118,416      branch-misses             #    6.41% of all branches  
        ( +-  0.03% ) [75.00%]
C:    43,819,871,086      branch-misses             #    6.41% of all branches  
        ( +-  0.02% ) [75.00%]

And this is nice, branch misses are the smallest with C, cool. It makes
sense again - the C patch adds an unconditional JMP which doesn't miss.

A:         2,030,357      context-switches          #    0.716 K/sec            
        ( +-  0.06% ) [100.00%]
B:         2,029,313      context-switches          #    0.716 K/sec            
        ( +-  0.05% ) [100.00%]
C:         2,028,566      context-switches          #    0.716 K/sec            
        ( +-  0.06% ) [100.00%]

Those look good.

A:            52,421      migrations                #    0.018 K/sec            
        ( +-  1.13% )
B:            52,049      migrations                #    0.018 K/sec            
        ( +-  1.02% )
C:            51,365      migrations                #    0.018 K/sec            
        ( +-  0.92% )

Same here.

A:     709.528485252 seconds time elapsed                                       
   ( +-  0.02% )
B:     708.976557288 seconds time elapsed                                       
   ( +-  0.04% )
C:     709.312844791 seconds time elapsed                                       
   ( +-  0.02% )

Interestingly, the unconditional JMP kinda costs... Btw, I'm not sure if
kernel build is the optimal workload for benchmarking here but I don't
see why not - it does a lot of syscalls so it should exercise the SYSRET
path sufficiently.

Anyway, we can do this below. Or not, I'm sitting on the fence about
that one.

---
From: Borislav Petkov <[email protected]>
Date: Sat, 25 Apr 2015 19:30:33 +0200
Subject: [PATCH] x86/entry: Avoid canonical RCX check on AMD

It is not needed on AMD as RCX canonicalness is not checked during
SYSRET there.

Signed-off-by: Borislav Petkov <[email protected]>
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/kernel/cpu/intel.c       |  2 ++
 arch/x86/kernel/entry_64.S        | 13 +++++++++----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index 7ee9b94d9921..8d555b046fe9 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -265,6 +265,7 @@
 #define X86_BUG_11AP           X86_BUG(5) /* Bad local APIC aka 11AP */
 #define X86_BUG_FXSAVE_LEAK    X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITOR        X86_BUG(7) /* AAI65, CLFLUSH required 
before MONITOR */
+#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
non-canonical */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 50163fa9034f..109a51815e92 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
                pr_info("Disabling PGE capability bit\n");
                setup_clear_cpu_cap(X86_FEATURE_PGE);
        }
+
+       set_cpu_bug(c, X86_BUG_CANONICAL_RCX);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e952f6bf1d6d..d01fb6c1362f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -415,16 +415,20 @@ syscall_return:
        jne opportunistic_sysret_failed
 
        /*
-        * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
-        * in kernel space.  This essentially lets the user take over
-        * the kernel, since userspace controls RSP.
-        *
         * If width of "canonical tail" ever becomes variable, this will need
         * to be updated to remain correct on both old and new CPUs.
         */
        .ifne __VIRTUAL_MASK_SHIFT - 47
        .error "virtual address width changed -- SYSRET checks need update"
        .endif
+
+       /*
+        * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
+        * in kernel space.  This essentially lets the user take over
+        * the kernel, since userspace controls RSP.
+        */
+       ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
+
        /* Change top 16 bits to be the sign-extension of 47th bit */
        shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
        sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
@@ -432,6 +436,7 @@ syscall_return:
        cmpq    %rcx, %r11
        jne     opportunistic_sysret_failed
 
+1:
        cmpq $__USER_CS,CS(%rsp)        /* CS must match SYSRET */
        jne opportunistic_sysret_failed
 
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

Reply via email to