Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-05-03 Thread Borislav Petkov
On Thu, Apr 30, 2015 at 02:39:07PM -0700, H. Peter Anvin wrote:
> This is the microbenchmark I used.
> 
> For the record, Intel's intention going forward is that 0F 1F will
> always be as fast or faster than any other alternative.

It looks like this is the case on AMD too.

So I took your benchmark and made it to measure all sizes of K8 and P6
NOPs. Also I'm doing 10^6 iterations and taking the minimum. The results
speak for themselves, especially from 5-byte NOPs onwards where we have
to repeat the K8 NOP but still can use a single P6 NOP.

And I'm going to move all relevant AMD hw to use the P6 NOPs for the
alternatives.

Unless I've done something wrong, of course. Please double-check, I'm
attaching the microbenchmark too.

Anyway, here's a patch:

---
From: Borislav Petkov 
Date: Sat, 2 May 2015 23:55:40 +0200
Subject: [PATCH] x86/alternatives: Switch AMD F15h and later to the P6 NOPs

Software optimization guides for both F15h and F16h cite those NOPs as
the optimal ones. A microbenchmark confirms that actually even older
families are better with the single-insn NOPs so switch to them for the
alternatives.

Cycles count below includes the loop overhead of the measurement but
that overhead is the same with all runs.

F10h, revE:
---
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 288.212282 cycles
   66 90 288.220840 cycles
66 66 90 288.219447 cycles
 66 66 66 90 288.223204 cycles
  66 66 90 66 90 571.393424 cycles
   66 66 90 66 66 90 571.374919 cycles
66 66 66 90 66 66 90 572.249281 cycles
 66 66 66 90 66 66 66 90 571.388651 cycles

P6:
  90 288.214193 cycles
   66 90 288.225550 cycles
0f 1f 00 288.224441 cycles
 0f 1f 40 00 288.225030 cycles
  0f 1f 44 00 00 288.233558 cycles
   66 0f 1f 44 00 00 324.792342 cycles
0f 1f 80 00 00 00 00 325.657462 cycles
 0f 1f 84 00 00 00 00 00 430.246643 cycles

F14h:

Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 510.404890 cycles
   66 90 510.432117 cycles
66 66 90 510.561858 cycles
 66 66 66 90 510.541865 cycles
  66 66 90 66 901014.192782 cycles
   66 66 90 66 66 901014.226546 cycles
66 66 66 90 66 66 901014.334299 cycles
 66 66 66 90 66 66 66 901014.381205 cycles

P6:
  90 510.436710 cycles
   66 90 510.448229 cycles
0f 1f 00 510.545100 cycles
 0f 1f 40 00 510.502792 cycles
  0f 1f 44 00 00 510.589517 cycles
   66 0f 1f 44 00 00 510.611462 cycles
0f 1f 80 00 00 00 00 511.166794 cycles
 0f 1f 84 00 00 00 00 00 511.651641 cycles

F15h:
-
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 243.128396 cycles
   66 90 243.129883 cycles
66 66 90 243.131631 cycles
 66 66 66 90 242.499324 cycles
  66 66 90 66 90 481.829083 cycles
   66 66 90 66 66 90 481.884413 cycles
66 66 66 90 66 66 90 481.851446 cycles
 66 66 66 90 66 66 66 90 481.409220 cycles

P6:
  90 243.127026 cycles
   66 90 243.130711 cycles
0f 1f 00 243.122747 cycles
 0f 1f 40 00 242.497617 cycles
  0f 1f 44 00 00 245.354461 cycles
   66 0f 1f 44 00 00 361.930417 cycles
0f 1f 80 00 00 00 00 362.844944 cycles
 0f 1f 84 00 00 00 00 00 480.514948 cycles

F16h:
-
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 507.793298 cycles
   66 90 507.789636 cycles
66 66 90 507.826490 cycles
 66 66 66 90 507.859075 cycles
  66 66 90 66 901008.663129 cycles
   66 66 90 66 66 901008.696259 cycles
66 66 66 90 66 66 901008.692517 cycles
 66 66 66 90 66 66 66 901008.755399 cycles

P6:
  90 507.795232 cycles
   66 90 507.794761 cycles
0f 1f 00 507.834901 cycles
 0f 1f 40 00 507.822629 cycles
  0f 1f 44 00 00 507.838493 cycles
   66 0f 1f 44 00 00 507.908597 cycles
0f 1f 80 00 00 00 00 507.946417 cycles
 0f 1f 84 00 00 00 00 00 507.954960 cycles

Signed-off-by: Borislav Petkov 
Cc: Aravind Gopalakrishnan 
---
 arch/x86/kernel/alternative.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index aef653193160..b0932c4341b3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -227,6 +227,15 @@ void __init arch_init_ideal_nops(void)
 #endif
}
break;
+

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-05-03 Thread Borislav Petkov
On Thu, Apr 30, 2015 at 02:39:07PM -0700, H. Peter Anvin wrote:
 This is the microbenchmark I used.
 
 For the record, Intel's intention going forward is that 0F 1F will
 always be as fast or faster than any other alternative.

It looks like this is the case on AMD too.

So I took your benchmark and made it to measure all sizes of K8 and P6
NOPs. Also I'm doing 10^6 iterations and taking the minimum. The results
speak for themselves, especially from 5-byte NOPs onwards where we have
to repeat the K8 NOP but still can use a single P6 NOP.

And I'm going to move all relevant AMD hw to use the P6 NOPs for the
alternatives.

Unless I've done something wrong, of course. Please double-check, I'm
attaching the microbenchmark too.

Anyway, here's a patch:

---
From: Borislav Petkov b...@suse.de
Date: Sat, 2 May 2015 23:55:40 +0200
Subject: [PATCH] x86/alternatives: Switch AMD F15h and later to the P6 NOPs

Software optimization guides for both F15h and F16h cite those NOPs as
the optimal ones. A microbenchmark confirms that actually even older
families are better with the single-insn NOPs so switch to them for the
alternatives.

Cycles count below includes the loop overhead of the measurement but
that overhead is the same with all runs.

F10h, revE:
---
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 288.212282 cycles
   66 90 288.220840 cycles
66 66 90 288.219447 cycles
 66 66 66 90 288.223204 cycles
  66 66 90 66 90 571.393424 cycles
   66 66 90 66 66 90 571.374919 cycles
66 66 66 90 66 66 90 572.249281 cycles
 66 66 66 90 66 66 66 90 571.388651 cycles

P6:
  90 288.214193 cycles
   66 90 288.225550 cycles
0f 1f 00 288.224441 cycles
 0f 1f 40 00 288.225030 cycles
  0f 1f 44 00 00 288.233558 cycles
   66 0f 1f 44 00 00 324.792342 cycles
0f 1f 80 00 00 00 00 325.657462 cycles
 0f 1f 84 00 00 00 00 00 430.246643 cycles

F14h:

Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 510.404890 cycles
   66 90 510.432117 cycles
66 66 90 510.561858 cycles
 66 66 66 90 510.541865 cycles
  66 66 90 66 901014.192782 cycles
   66 66 90 66 66 901014.226546 cycles
66 66 66 90 66 66 901014.334299 cycles
 66 66 66 90 66 66 66 901014.381205 cycles

P6:
  90 510.436710 cycles
   66 90 510.448229 cycles
0f 1f 00 510.545100 cycles
 0f 1f 40 00 510.502792 cycles
  0f 1f 44 00 00 510.589517 cycles
   66 0f 1f 44 00 00 510.611462 cycles
0f 1f 80 00 00 00 00 511.166794 cycles
 0f 1f 84 00 00 00 00 00 511.651641 cycles

F15h:
-
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 243.128396 cycles
   66 90 243.129883 cycles
66 66 90 243.131631 cycles
 66 66 66 90 242.499324 cycles
  66 66 90 66 90 481.829083 cycles
   66 66 90 66 66 90 481.884413 cycles
66 66 66 90 66 66 90 481.851446 cycles
 66 66 66 90 66 66 66 90 481.409220 cycles

P6:
  90 243.127026 cycles
   66 90 243.130711 cycles
0f 1f 00 243.122747 cycles
 0f 1f 40 00 242.497617 cycles
  0f 1f 44 00 00 245.354461 cycles
   66 0f 1f 44 00 00 361.930417 cycles
0f 1f 80 00 00 00 00 362.844944 cycles
 0f 1f 84 00 00 00 00 00 480.514948 cycles

F16h:
-
Running NOP tests, 1000 NOPs x 100 repetitions

K8:
  90 507.793298 cycles
   66 90 507.789636 cycles
66 66 90 507.826490 cycles
 66 66 66 90 507.859075 cycles
  66 66 90 66 901008.663129 cycles
   66 66 90 66 66 901008.696259 cycles
66 66 66 90 66 66 901008.692517 cycles
 66 66 66 90 66 66 66 901008.755399 cycles

P6:
  90 507.795232 cycles
   66 90 507.794761 cycles
0f 1f 00 507.834901 cycles
 0f 1f 40 00 507.822629 cycles
  0f 1f 44 00 00 507.838493 cycles
   66 0f 1f 44 00 00 507.908597 cycles
0f 1f 80 00 00 00 00 507.946417 cycles
 0f 1f 84 00 00 00 00 00 507.954960 cycles

Signed-off-by: Borislav Petkov b...@suse.de
Cc: Aravind Gopalakrishnan aravind.gopalakrish...@amd.com
---
 arch/x86/kernel/alternative.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index aef653193160..b0932c4341b3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -227,6 +227,15 @@ void __init arch_init_ideal_nops(void)
 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-05-01 Thread Borislav Petkov
On Thu, Apr 30, 2015 at 04:23:26PM -0700, H. Peter Anvin wrote:
> I probably should have added that the microbenchmark specifically tests
> for an atomic 5-byte NOP (as required by tracepoints etc.)  If the
> requirement for 5-byte atomic is dropped there might be faster
> combinations, e.g. 66 66 66 90 might work better on some platforms, or
> using very long 0F 1F sequences.

Right, so I was thinking of extending it for all sizes 1 - 15 and then
run it on boxes. I'll keep you posted.

Thanks for the toy - now I get to have some fun. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-05-01 Thread Borislav Petkov
On Thu, Apr 30, 2015 at 04:23:26PM -0700, H. Peter Anvin wrote:
 I probably should have added that the microbenchmark specifically tests
 for an atomic 5-byte NOP (as required by tracepoints etc.)  If the
 requirement for 5-byte atomic is dropped there might be faster
 combinations, e.g. 66 66 66 90 might work better on some platforms, or
 using very long 0F 1F sequences.

Right, so I was thinking of extending it for all sizes 1 - 15 and then
run it on boxes. I'll keep you posted.

Thanks for the toy - now I get to have some fun. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-30 Thread H. Peter Anvin
On 04/30/2015 02:39 PM, H. Peter Anvin wrote:
> This is the microbenchmark I used.
> 
> For the record, Intel's intention going forward is that 0F 1F will
> always be as fast or faster than any other alternative.
> 

I probably should have added that the microbenchmark specifically tests
for an atomic 5-byte NOP (as required by tracepoints etc.)  If the
requirement for 5-byte atomic is dropped there might be faster
combinations, e.g. 66 66 66 90 might work better on some platforms, or
using very long 0F 1F sequences.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-30 Thread H. Peter Anvin
This is the microbenchmark I used.

For the record, Intel's intention going forward is that 0F 1F will
always be as fast or faster than any other alternative.

-hpa

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

static void nop_p6(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
   ".endr");
}

static void nop_k8(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x66,0x66,0x66,0x66,0x90\n"
   ".endr");
}

static void nop_lea(void)
{
#ifdef __x86_64__
  asm volatile(".rept 1000\n"
   ".byte 0x48,0x8d,0x74,0x26,0x00\n"
   ".endr");
#else
  asm volatile(".rept 1000\n"
   ".byte 0x3e,0x8d,0x74,0x26,0x00\n"
   ".endr");
#endif
}

static void nop_jmp5(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0xe9,0,0,0,0\n"
   ".endr");
}

static void nop_jmp2(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0xeb,3,0x90,0x90,0x90\n"
   ".endr");
}

static void nop_xchg(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x66,0x66,0x66,0x87,0xc0\n"
   ".endr");
}

static void nop_mov(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x66,0x66,0x66,0x89,0xc0\n"
   ".endr");
}

static void nop_fdisi(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x66,0x66,0x66,0xdb,0xe1\n"
   ".endr");
}
  
static void nop_feni(void)
{
  asm volatile(".rept 1000\n"
   ".byte 0x66,0x66,0x66,0xdb,0xe0\n"
   ".endr");
}

struct test_list {
  const char *name;
  void (*func)(void);
};

static const struct test_list tests[] = {
  { "P6 NOPs (NOPL)", nop_p6 },
  { "K8 NOPs (66 90)", nop_k8 },
  { "LEA", nop_lea },
  { "XCHG", nop_xchg },
  { "MOV", nop_mov },
  { "FDISI", nop_fdisi },
  { "FENI", nop_feni },
  { "E9 JMP", nop_jmp5 },
  { "EB JMP", nop_jmp2 },
  { NULL, NULL }
};

static void benchmark(const struct test_list *test, bool warmup)
{
  struct timeval tv0, tv1;
  int i;
  const int reps = 10;
  unsigned long long us;
  
  gettimeofday(, NULL);
  for (i = 0; i < reps; i++)
test->func();
  gettimeofday(, NULL);

  us = (tv1.tv_sec - tv0.tv_sec) * 100ULL +
((int)tv1.tv_usec - (int)tv0.tv_usec);

  if (!warmup)
printf("%s: %d repetitions at %llu us\n", test->name, reps, us);
}

int main(void)
{
  const struct test_list *test;
  
  for (test = tests; test->func; test++) {
benchmark(test, true);
benchmark(test, false);
  }

  return 0;
}


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-30 Thread H. Peter Anvin
This is the microbenchmark I used.

For the record, Intel's intention going forward is that 0F 1F will
always be as fast or faster than any other alternative.

-hpa

#define _GNU_SOURCE
#include stdio.h
#include stdlib.h
#include time.h
#include stdbool.h
#include sys/time.h

static void nop_p6(void)
{
  asm volatile(.rept 1000\n
   .byte 0x0f,0x1f,0x44,0x00,0x00\n
   .endr);
}

static void nop_k8(void)
{
  asm volatile(.rept 1000\n
   .byte 0x66,0x66,0x66,0x66,0x90\n
   .endr);
}

static void nop_lea(void)
{
#ifdef __x86_64__
  asm volatile(.rept 1000\n
   .byte 0x48,0x8d,0x74,0x26,0x00\n
   .endr);
#else
  asm volatile(.rept 1000\n
   .byte 0x3e,0x8d,0x74,0x26,0x00\n
   .endr);
#endif
}

static void nop_jmp5(void)
{
  asm volatile(.rept 1000\n
   .byte 0xe9,0,0,0,0\n
   .endr);
}

static void nop_jmp2(void)
{
  asm volatile(.rept 1000\n
   .byte 0xeb,3,0x90,0x90,0x90\n
   .endr);
}

static void nop_xchg(void)
{
  asm volatile(.rept 1000\n
   .byte 0x66,0x66,0x66,0x87,0xc0\n
   .endr);
}

static void nop_mov(void)
{
  asm volatile(.rept 1000\n
   .byte 0x66,0x66,0x66,0x89,0xc0\n
   .endr);
}

static void nop_fdisi(void)
{
  asm volatile(.rept 1000\n
   .byte 0x66,0x66,0x66,0xdb,0xe1\n
   .endr);
}
  
static void nop_feni(void)
{
  asm volatile(.rept 1000\n
   .byte 0x66,0x66,0x66,0xdb,0xe0\n
   .endr);
}

struct test_list {
  const char *name;
  void (*func)(void);
};

static const struct test_list tests[] = {
  { P6 NOPs (NOPL), nop_p6 },
  { K8 NOPs (66 90), nop_k8 },
  { LEA, nop_lea },
  { XCHG, nop_xchg },
  { MOV, nop_mov },
  { FDISI, nop_fdisi },
  { FENI, nop_feni },
  { E9 JMP, nop_jmp5 },
  { EB JMP, nop_jmp2 },
  { NULL, NULL }
};

static void benchmark(const struct test_list *test, bool warmup)
{
  struct timeval tv0, tv1;
  int i;
  const int reps = 10;
  unsigned long long us;
  
  gettimeofday(tv0, NULL);
  for (i = 0; i  reps; i++)
test-func();
  gettimeofday(tv1, NULL);

  us = (tv1.tv_sec - tv0.tv_sec) * 100ULL +
((int)tv1.tv_usec - (int)tv0.tv_usec);

  if (!warmup)
printf(%s: %d repetitions at %llu us\n, test-name, reps, us);
}

int main(void)
{
  const struct test_list *test;
  
  for (test = tests; test-func; test++) {
benchmark(test, true);
benchmark(test, false);
  }

  return 0;
}


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-30 Thread H. Peter Anvin
On 04/30/2015 02:39 PM, H. Peter Anvin wrote:
 This is the microbenchmark I used.
 
 For the record, Intel's intention going forward is that 0F 1F will
 always be as fast or faster than any other alternative.
 

I probably should have added that the microbenchmark specifically tests
for an atomic 5-byte NOP (as required by tracepoints etc.)  If the
requirement for 5-byte atomic is dropped there might be faster
combinations, e.g. 66 66 66 90 might work better on some platforms, or
using very long 0F 1F sequences.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Tue, Apr 28, 2015 at 10:16:33AM -0700, Linus Torvalds wrote:
> I suspect it might be related to things like getting performance
> counters and instruction debug traps etc right. There are quite
> possibly also simply constraints where the front end has to generate
> *something* just to keep the back end happy.
> 
> The front end can generally not just totally remove things without any
> tracking, since the front end doesn't know if things are speculative
> etc. So you can't do instruction debug traps in the front end afaik.
> Or rather, I'm sure you *could*, but in general I suspect the best way
> to handle nops without making them *too* special is to bunch up
> several to make them look like one big instruction, and then associate
> that bunch with some minimal tracking uop that uses minimal resources
> in the back end without losing sight of the original nop entirely, so
> that you can still do checks at retirement time.

Yeah, I was thinking about a simplified uop for tracking - makes most
sense ...

> So I think the "you can do ~5 nops per cycle" is not unreasonable.
> Even in the uop cache, the nops have to take some space, and have to
> do things like update eip, so I don't think they'll ever be entirely
> free, the best you can do is minimize their impact.

... exactly! So something needs to increment rIP so you either need to
special-handle that and remember by how many bytes to increment and
exactly *when* at retire time or simply use a barebones, simplified uop
which does that for you for free and flows down the pipe. Yeah, that
makes a lot of sense!

> Yeah. That looks somewhat reasonable. I think the 16h architecture
> technically decodes just two instructions per cycle,

Yeah, fetch 32B and look at two 16B for max 2 insns per cycle. I.e.,
two-way.

> but I wouldn't be surprised if there's some simple nop special casing
> going on so that it can decode three nops in one go when things line
> up right.

Right.

> So you might get 0.33 cycles for the best case, but then 0.5 cycles
> when it crosses a 16-byte boundary or something. So you might have
> some pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes
> (3/2/3 instructions), which would come out to 0.38 cycles. Add some
> random overhead for the loop, and I could see the 0.39 cycles.
>
> That was wild handwaving with no data to back it up, but I'm trying
> to explain to myself why you could get some odd number like that. It
> seems _possiible_ at least.

Yep, that makes sense.

Now if only we had some numbers to back this up with... I'll play with
this more.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Linus Torvalds
On Tue, Apr 28, 2015 at 9:58 AM, Borislav Petkov  wrote:
>
> Well, AFAIK, NOPs do require resources for tracking in the machine. I
> was hoping that hw would be smarter and discard at decode time but there
> probably are reasons that it can't be done (...yet).

I suspect it might be related to things like getting performance
counters and instruction debug traps etc right. There are quite
possibly also simply constraints where the front end has to generate
*something* just to keep the back end happy.

The front end can generally not just totally remove things without any
tracking, since the front end doesn't know if things are speculative
etc. So you can't do instruction debug traps in the front end afaik.
Or rather, I'm sure you *could*, but in general I suspect the best way
to handle nops without making them *too* special is to bunch up
several to make them look like one big instruction, and then associate
that bunch with some minimal tracking uop that uses minimal resources
in the back end without losing sight of the original nop entirely, so
that you can still do checks at retirement time.

So I think the "you can do ~5 nops per cycle" is not unreasonable.
Even in the uop cache, the nops have to take some space, and have to
do things like update eip, so I don't think they'll ever be entirely
free, the best you can do is minimize their impact.

> $ taskset -c 3 ./t
> Running 60 times, 100 loops per run.
> nop_0x90 average: 0.390625
> nop_3_byte average: 0.390625
>
> and those exact numbers are actually reproducible pretty reliably.

Yeah. That looks somewhat reasonable. I think the 16h architecture
technically decodes just two instructions per cycle, but I wouldn't be
surprised if there's some simple nop special casing going on so that
it can decode three nops in one go when things line up right. So you
might get 0.33 cycles for the best case, but then 0.5 cycles when it
crosses a 16-byte boundary or something. So you might have some
pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes
(3/2/3 instructions), which would come out to 0.38 cycles. Add some
random overhead for the loop, and I could see the 0.39 cycles.

That was wild handwaving with no data to back it up, but I'm trying to
explain to myself why you could get some odd number like that. It
seems _possiible_ at least.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Tue, Apr 28, 2015 at 09:28:52AM -0700, Linus Torvalds wrote:
> On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov  wrote:
> >
> > Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
> > better than the 0F 1F 00 suggested by the manual (Haha!):
> 
> That's which AMD CPU?

F16h.

> On my intel i7-4770S, they are the same cost (I cut down your loop
> numbers by an order of magnitude each because I couldn't be arsed to
> wait for it, so it might be off by a cycle or two):
> 
> Running 60 times, 100 loops per run.
> nop_0x90 average: 81.065681
> nop_3_byte average: 80.230101
> 
> That said, I think your benchmark tests the speed of "rdtsc" rather
> than the no-ops. Putting the read_tsc inside the inner loop basically
> makes it swamp everything else.

Whoops, now that you mention it... of course, that RDTSC *along* with
the barriers around it is much much more expensive than the NOPs.

> > $ taskset -c 3 ./nops
> > Running 600 times, 1000 loops per run.
> > nop_0x90 average: 439.805220
> > nop_3_byte average: 442.412915
> 
> I think that's in the noise, and could be explained by random
> alignment of the loop too, or even random factors like "the CPU heated
> up, so the later run was slightly slower". The difference between 439
> and 442 doesn't strike me as all that significant.
> 
> It might be better to *not* inline, and instead make a real function
> call to something that has a lot of no-ops (do some preprocessor magic
> to make more no-ops in one go). At least that way the alignment is
> likely the same for the two cases.

malloc a page, populate it with NOPs, slap a RET at the end and jump to
it? Maybe even more than 1 page?

> Or if not that, then I think you're better off with something like
> 
> p1 = read_tsc();
> for (i = 0; i < LOOPS; i++) {
> nop_0x90();
> 
> }
> p2 = read_tsc();
> r = (p2 - p1);
> 
> because while you're now measuring the loop overhead too, that's
> *much* smaller than the rdtsc overhead. So I get something like

Yap, that looks better.

> Running 600 times, 100 loops per run.
> nop_0x90 average: 3.786935
> nop_3_byte average: 3.677228
> 
> and notice the difference between "~80 cycles" and "~3.7 cycles".
> Yeah, that's rdtsc. I bet your 440 is about the same thing too.
> 
> Btw, the whole thing about "averaging cycles" is not the right thing
> to do either. You should probably take the *minimum* cycles count, not
> the average, because anything non-minimal means "some perturbation"
> (ie interrupt etc).

My train of thought was: if you do a *lot* of runs, perturbations would
average out. But ok, noted.

> So I think something like the attached would be better. It gives an
> approximate "cycles per one four-byte nop", and I get
> 
> [torvalds@i7 ~]$ taskset -c 3 ./a.out
> Running 60 times, 100 loops per run.
> nop_0x90 average: 0.200479
> nop_3_byte average: 0.199694
> 
> which sounds suspiciously good to me (5 nops per cycle? uop cache and
> nop compression, I guess).

Well, AFAIK, NOPs do require resources for tracking in the machine. I
was hoping that hw would be smarter and discard at decode time but there
probably are reasons that it can't be done (...yet).

So they most likely get discarted at retire time and I can't imagine how
an otherwise relatively idle core's ROB with gazillion of NOPs would
look like. Those things need hw traces. Maybe in another life. :-)

$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

and those exact numbers are actually reproducible pretty reliably.

$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

Hmm, so what are we saying? Modern CPUs should use one set of NOPs and
that's it...

Maybe we need to do more measurements...

Hmmm.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Linus Torvalds
On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov  wrote:
>
> Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
> better than the 0F 1F 00 suggested by the manual (Haha!):

That's which AMD CPU?

On my intel i7-4770S, they are the same cost (I cut down your loop
numbers by an order of magnitude each because I couldn't be arsed to
wait for it, so it might be off by a cycle or two):

Running 60 times, 100 loops per run.
nop_0x90 average: 81.065681
nop_3_byte average: 80.230101

That said, I think your benchmark tests the speed of "rdtsc" rather
than the no-ops. Putting the read_tsc inside the inner loop basically
makes it swamp everything else.

> $ taskset -c 3 ./nops
> Running 600 times, 1000 loops per run.
> nop_0x90 average: 439.805220
> nop_3_byte average: 442.412915

I think that's in the noise, and could be explained by random
alignment of the loop too, or even random factors like "the CPU heated
up, so the later run was slightly slower". The difference between 439
and 442 doesn't strike me as all that significant.

It might be better to *not* inline, and instead make a real function
call to something that has a lot of no-ops (do some preprocessor magic
to make more no-ops in one go). At least that way the alignment is
likely the same for the two cases.

Or if not that, then I think you're better off with something like

p1 = read_tsc();
for (i = 0; i < LOOPS; i++) {
nop_0x90();

}
p2 = read_tsc();
r = (p2 - p1);

because while you're now measuring the loop overhead too, that's
*much* smaller than the rdtsc overhead. So I get something like

Running 600 times, 100 loops per run.
nop_0x90 average: 3.786935
nop_3_byte average: 3.677228

and notice the difference between "~80 cycles" and "~3.7 cycles".
Yeah, that's rdtsc. I bet your 440 is about the same thing too.

Btw, the whole thing about "averaging cycles" is not the right thing
to do either. You should probably take the *minimum* cycles count, not
the average, because anything non-minimal means "some perturbation"
(ie interrupt etc).

So I think something like the attached would be better. It gives an
approximate "cycles per one four-byte nop", and I get

[torvalds@i7 ~]$ taskset -c 3 ./a.out
Running 60 times, 100 loops per run.
nop_0x90 average: 0.200479
nop_3_byte average: 0.199694

which sounds suspiciously good to me (5 nops per cycle? uop cache and
nop compression, I guess).

Linus
/*
 * $ taskset -c 3 ./nops
 * Running 600 times, 1000 loops per run.
 * nop_0x90 average: 439.805220
 * nop_3_byte average: 442.412915
 *
 * How to run:
 *
 * taskset -c  argv0
 */
#include 
#include 
#include 
#include 

typedef unsigned long long u64;

#define TWO(a) a; a;
#define FOUR(a) TWO(TWO(a))
#define SIXTEEN(a) FOUR(FOUR(a))
#define TWOFIVESIX(a) SIXTEEN(SIXTEEN(a))

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)"a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
	u64 ret;

	asm volatile("mfence");
	ret = rdtsc();
	asm volatile("mfence");

	return ret;
}

static void nop_0x90(void)
{
	TWOFIVESIX(asm volatile(".byte 0x66, 0x66, 0x90"))
}

static void nop_3_byte(void)
{
	TWOFIVESIX(asm volatile(".byte 0x0f, 0x1f, 0x00"))
}

int main()
{
	int i, j;
	u64 p1, p2;
	u64 r, min;

#define TIMES 60
#define LOOPS 100ULL

	printf("Running %d times, %lld loops per run.\n", TIMES, LOOPS);

	min = 1;

	for (r = 0, j = 0; j < TIMES; j++) {
		p1 = read_tsc();
		for (i = 0; i < LOOPS; i++) {
			nop_0x90();

		}
		p2 = read_tsc();
		r = (p2 - p1);

		if (r < min)
			min = r;
	}

	printf("nop_0x90 average: %f\n", min / (double) LOOPS / 256);

	min = 1;

	for (r = 0, j = 0; j < TIMES; j++) {
		p1 = read_tsc();
		for (i = 0; i < LOOPS; i++) {
			nop_3_byte();
		}
		p2 = read_tsc();

		r = (p2 - p1);
		if (r < min)
			min = r;
	}

	printf("nop_3_byte average: %f\n", min / (double) LOOPS / 256);

	return 0;
}


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 01:14:51PM -0700, H. Peter Anvin wrote:
> I did a microbenchmark in user space... let's see if I can find it.

How about the simple one below?

Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
better than the 0F 1F 00 suggested by the manual (Haha!):

$ taskset -c 3 ./nops
Running 600 times, 1000 loops per run.
nop_0x90 average: 439.805220
nop_3_byte average: 442.412915

---
/*
 * How to run:
 *
 * taskset -c  argv0
 */
#include 
#include 
#include 
#include 

typedef unsigned long long u64;

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)"a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
u64 ret;

asm volatile("mfence");
ret = rdtsc();
asm volatile("mfence");

return ret;
}

static inline void nop_0x90(void)
{
asm volatile(
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"

".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"

".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
".byte 0x66, 0x66, 0x90\n\t"
);
}

static inline void nop_3_byte(void)
{
asm volatile(
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"

".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"

".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
".byte 0x0f, 0x1f, 0x00\n\t"
);
}

int main()
{
int i, j;
u64 p1, p2;
u64 r;
double avg, t;

#define TIMES 600
#define LOOPS 1000ULL

printf("Running %d times, %lld loops per run.\n", TIMES, LOOPS);

avg = 0;

for (r = 0, j = 0; j < TIMES; j++) {
for (i = 0; i < LOOPS; i++) {
p1 = read_tsc();
nop_0x90();
p2 = read_tsc();

r += (p2 - p1);
}

t = (double)r / LOOPS;

//  printf("NOP cycles: %lld, cycles/nop_0x90: %f\n", r, t);
avg += t;
r = 0;
}

printf("nop_0x90 average: %f\n", avg/TIMES);

avg = 0;

for (r = 0, j = 0; j < TIMES; j++) {
for (i = 0; i < LOOPS; i++) {
p1 = read_tsc();
nop_3_byte();
p2 = read_tsc();

r += (p2 - p1);
}

t = (double)r / LOOPS;

//  printf("NOP cycles: %lld, cycles/nop_3_byte: %f\n", r, t);
avg += t;
r = 0;
}

printf("nop_3_byte average: %f\n", avg/TIMES);

return 0;
}

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:45:12PM +0200, Borislav Petkov wrote:
> > Maybe you are measuring random noise.
> 
> Yeah. Last exercise tomorrow. Let's see what those numbers would look
> like.

Right, so with Mel's help, I did a simple microbenchmark to measure how
many cycles a syscall (getpid()) needs on 4.1-rc1 and with your patch.

4.1-rc1
---
Running 60 times, 1000 loops per run.
Cycles: 3977233027, cycles/syscall: 397.723303
Cycles: 3964979519, cycles/syscall: 396.497952
Cycles: 3962470261, cycles/syscall: 396.247026
Cycles: 3963524693, cycles/syscall: 396.352469
Cycles: 3962853704, cycles/syscall: 396.285370
Cycles: 3964603727, cycles/syscall: 396.460373
Cycles: 3964758926, cycles/syscall: 396.475893
Cycles: 3965268019, cycles/syscall: 396.526802
Cycles: 3962198683, cycles/syscall: 396.219868
...


4.1-rc1 + 17be0aec74fb036eb4eb32c2268f3420a034762b from tip
---
Running 60 times, 1000 loops per run.
Cycles: 3973575441, cycles/syscall: 397.357544
Cycles: 3963999393, cycles/syscall: 396.399939
Cycles: 3962613575, cycles/syscall: 396.261357
Cycles: 3963440408, cycles/syscall: 396.344041
Cycles: 3963475255, cycles/syscall: 396.347526
Cycles: 3964471785, cycles/syscall: 396.447179
Cycles: 3962890513, cycles/syscall: 396.289051
Cycles: 3964940114, cycles/syscall: 396.494011
Cycles: 3964186426, cycles/syscall: 396.418643
...

So yeah, your patch is fine - provided I've done everything right. Here's
the microbenchmark:

---
/*
 * How to run:
 *
 * taskset -c  ./sys
 */
#include 
#include 
#include 
#include 

typedef unsigned long long u64;

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)"a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static long my_getpid(void)
{
  long ret;
  asm volatile ("syscall" :
"=a" (ret) :
"a" (SYS_getpid) :
"memory", "cc", "rcx", "r11");
  return ret;
}

static inline u64 read_tsc(void)
{
u64 ret;

asm volatile("mfence");
ret = rdtsc();
asm volatile("mfence");

return ret;
}


int main()
{
int i, j;
u64 p1, p2;
u64 count = 0;

#define TIMES 60
#define LOOPS 1000ULL

printf("Running %d times, %lld loops per run.\n", TIMES, LOOPS);

for (j = 0; j < TIMES; j++) {
for (i = 0; i < LOOPS; i++) {
p1 = read_tsc();
my_getpid();
p2 = read_tsc();

count += (p2 - p1);
}

printf("Cycles: %lld, cycles/syscall: %f\n",
count, (double)count / LOOPS);

count = 0;
}

return 0;
}

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:45:12PM +0200, Borislav Petkov wrote:
  Maybe you are measuring random noise.
 
 Yeah. Last exercise tomorrow. Let's see what those numbers would look
 like.

Right, so with Mel's help, I did a simple microbenchmark to measure how
many cycles a syscall (getpid()) needs on 4.1-rc1 and with your patch.

4.1-rc1
---
Running 60 times, 1000 loops per run.
Cycles: 3977233027, cycles/syscall: 397.723303
Cycles: 3964979519, cycles/syscall: 396.497952
Cycles: 3962470261, cycles/syscall: 396.247026
Cycles: 3963524693, cycles/syscall: 396.352469
Cycles: 3962853704, cycles/syscall: 396.285370
Cycles: 3964603727, cycles/syscall: 396.460373
Cycles: 3964758926, cycles/syscall: 396.475893
Cycles: 3965268019, cycles/syscall: 396.526802
Cycles: 3962198683, cycles/syscall: 396.219868
...


4.1-rc1 + 17be0aec74fb036eb4eb32c2268f3420a034762b from tip
---
Running 60 times, 1000 loops per run.
Cycles: 3973575441, cycles/syscall: 397.357544
Cycles: 3963999393, cycles/syscall: 396.399939
Cycles: 3962613575, cycles/syscall: 396.261357
Cycles: 3963440408, cycles/syscall: 396.344041
Cycles: 3963475255, cycles/syscall: 396.347526
Cycles: 3964471785, cycles/syscall: 396.447179
Cycles: 3962890513, cycles/syscall: 396.289051
Cycles: 3964940114, cycles/syscall: 396.494011
Cycles: 3964186426, cycles/syscall: 396.418643
...

So yeah, your patch is fine - provided I've done everything right. Here's
the microbenchmark:

---
/*
 * How to run:
 *
 * taskset -c cpunum ./sys
 */
#include stdio.h
#include sys/syscall.h
#include stdlib.h
#include unistd.h

typedef unsigned long long u64;

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high)  32))
#define EAX_EDX_ARGS(val, low, high)a (low), d (high)
#define EAX_EDX_RET(val, low, high) =a (low), =d (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile(rdtsc : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static long my_getpid(void)
{
  long ret;
  asm volatile (syscall :
=a (ret) :
a (SYS_getpid) :
memory, cc, rcx, r11);
  return ret;
}

static inline u64 read_tsc(void)
{
u64 ret;

asm volatile(mfence);
ret = rdtsc();
asm volatile(mfence);

return ret;
}


int main()
{
int i, j;
u64 p1, p2;
u64 count = 0;

#define TIMES 60
#define LOOPS 1000ULL

printf(Running %d times, %lld loops per run.\n, TIMES, LOOPS);

for (j = 0; j  TIMES; j++) {
for (i = 0; i  LOOPS; i++) {
p1 = read_tsc();
my_getpid();
p2 = read_tsc();

count += (p2 - p1);
}

printf(Cycles: %lld, cycles/syscall: %f\n,
count, (double)count / LOOPS);

count = 0;
}

return 0;
}

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 01:14:51PM -0700, H. Peter Anvin wrote:
 I did a microbenchmark in user space... let's see if I can find it.

How about the simple one below?

Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
better than the 0F 1F 00 suggested by the manual (Haha!):

$ taskset -c 3 ./nops
Running 600 times, 1000 loops per run.
nop_0x90 average: 439.805220
nop_3_byte average: 442.412915

---
/*
 * How to run:
 *
 * taskset -c cpunum argv0
 */
#include stdio.h
#include sys/syscall.h
#include stdlib.h
#include unistd.h

typedef unsigned long long u64;

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high)  32))
#define EAX_EDX_ARGS(val, low, high)a (low), d (high)
#define EAX_EDX_RET(val, low, high) =a (low), =d (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile(rdtsc : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
u64 ret;

asm volatile(mfence);
ret = rdtsc();
asm volatile(mfence);

return ret;
}

static inline void nop_0x90(void)
{
asm volatile(
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t

.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t

.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
.byte 0x66, 0x66, 0x90\n\t
);
}

static inline void nop_3_byte(void)
{
asm volatile(
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t

.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t

.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
.byte 0x0f, 0x1f, 0x00\n\t
);
}

int main()
{
int i, j;
u64 p1, p2;
u64 r;
double avg, t;

#define TIMES 600
#define LOOPS 1000ULL

printf(Running %d times, %lld loops per run.\n, TIMES, LOOPS);

avg = 0;

for (r = 0, j = 0; j  TIMES; j++) {
for (i = 0; i  LOOPS; i++) {
p1 = read_tsc();
nop_0x90();
p2 = read_tsc();

r += (p2 - p1);
}

t = (double)r / LOOPS;

//  printf(NOP cycles: %lld, cycles/nop_0x90: %f\n, r, t);
avg += t;
r = 0;
}

printf(nop_0x90 average: %f\n, avg/TIMES);

avg = 0;

for (r = 0, j = 0; j  TIMES; j++) {
for (i = 0; i  LOOPS; i++) {
p1 = read_tsc();
nop_3_byte();
p2 = read_tsc();

r += (p2 - p1);
}

t = (double)r / LOOPS;

//  printf(NOP cycles: %lld, cycles/nop_3_byte: %f\n, r, t);
avg += t;
r = 0;
}

printf(nop_3_byte average: %f\n, avg/TIMES);

return 0;
}

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Linus Torvalds
On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov b...@alien8.de wrote:

 Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
 better than the 0F 1F 00 suggested by the manual (Haha!):

That's which AMD CPU?

On my intel i7-4770S, they are the same cost (I cut down your loop
numbers by an order of magnitude each because I couldn't be arsed to
wait for it, so it might be off by a cycle or two):

Running 60 times, 100 loops per run.
nop_0x90 average: 81.065681
nop_3_byte average: 80.230101

That said, I think your benchmark tests the speed of rdtsc rather
than the no-ops. Putting the read_tsc inside the inner loop basically
makes it swamp everything else.

 $ taskset -c 3 ./nops
 Running 600 times, 1000 loops per run.
 nop_0x90 average: 439.805220
 nop_3_byte average: 442.412915

I think that's in the noise, and could be explained by random
alignment of the loop too, or even random factors like the CPU heated
up, so the later run was slightly slower. The difference between 439
and 442 doesn't strike me as all that significant.

It might be better to *not* inline, and instead make a real function
call to something that has a lot of no-ops (do some preprocessor magic
to make more no-ops in one go). At least that way the alignment is
likely the same for the two cases.

Or if not that, then I think you're better off with something like

p1 = read_tsc();
for (i = 0; i  LOOPS; i++) {
nop_0x90();

}
p2 = read_tsc();
r = (p2 - p1);

because while you're now measuring the loop overhead too, that's
*much* smaller than the rdtsc overhead. So I get something like

Running 600 times, 100 loops per run.
nop_0x90 average: 3.786935
nop_3_byte average: 3.677228

and notice the difference between ~80 cycles and ~3.7 cycles.
Yeah, that's rdtsc. I bet your 440 is about the same thing too.

Btw, the whole thing about averaging cycles is not the right thing
to do either. You should probably take the *minimum* cycles count, not
the average, because anything non-minimal means some perturbation
(ie interrupt etc).

So I think something like the attached would be better. It gives an
approximate cycles per one four-byte nop, and I get

[torvalds@i7 ~]$ taskset -c 3 ./a.out
Running 60 times, 100 loops per run.
nop_0x90 average: 0.200479
nop_3_byte average: 0.199694

which sounds suspiciously good to me (5 nops per cycle? uop cache and
nop compression, I guess).

Linus
/*
 * $ taskset -c 3 ./nops
 * Running 600 times, 1000 loops per run.
 * nop_0x90 average: 439.805220
 * nop_3_byte average: 442.412915
 *
 * How to run:
 *
 * taskset -c cpunum argv0
 */
#include stdio.h
#include sys/syscall.h
#include stdlib.h
#include unistd.h

typedef unsigned long long u64;

#define TWO(a) a; a;
#define FOUR(a) TWO(TWO(a))
#define SIXTEEN(a) FOUR(FOUR(a))
#define TWOFIVESIX(a) SIXTEEN(SIXTEEN(a))

#define DECLARE_ARGS(val, low, high)unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high)  32))
#define EAX_EDX_ARGS(val, low, high)a (low), d (high)
#define EAX_EDX_RET(val, low, high) =a (low), =d (high)

static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);

asm volatile(rdtsc : EAX_EDX_RET(val, low, high));

return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
	u64 ret;

	asm volatile(mfence);
	ret = rdtsc();
	asm volatile(mfence);

	return ret;
}

static void nop_0x90(void)
{
	TWOFIVESIX(asm volatile(.byte 0x66, 0x66, 0x90))
}

static void nop_3_byte(void)
{
	TWOFIVESIX(asm volatile(.byte 0x0f, 0x1f, 0x00))
}

int main()
{
	int i, j;
	u64 p1, p2;
	u64 r, min;

#define TIMES 60
#define LOOPS 100ULL

	printf(Running %d times, %lld loops per run.\n, TIMES, LOOPS);

	min = 1;

	for (r = 0, j = 0; j  TIMES; j++) {
		p1 = read_tsc();
		for (i = 0; i  LOOPS; i++) {
			nop_0x90();

		}
		p2 = read_tsc();
		r = (p2 - p1);

		if (r  min)
			min = r;
	}

	printf(nop_0x90 average: %f\n, min / (double) LOOPS / 256);

	min = 1;

	for (r = 0, j = 0; j  TIMES; j++) {
		p1 = read_tsc();
		for (i = 0; i  LOOPS; i++) {
			nop_3_byte();
		}
		p2 = read_tsc();

		r = (p2 - p1);
		if (r  min)
			min = r;
	}

	printf(nop_3_byte average: %f\n, min / (double) LOOPS / 256);

	return 0;
}


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Tue, Apr 28, 2015 at 09:28:52AM -0700, Linus Torvalds wrote:
 On Tue, Apr 28, 2015 at 8:55 AM, Borislav Petkov b...@alien8.de wrote:
 
  Provided it is correct, it shows that the 0x66-prefixed 3-byte NOPs are
  better than the 0F 1F 00 suggested by the manual (Haha!):
 
 That's which AMD CPU?

F16h.

 On my intel i7-4770S, they are the same cost (I cut down your loop
 numbers by an order of magnitude each because I couldn't be arsed to
 wait for it, so it might be off by a cycle or two):
 
 Running 60 times, 100 loops per run.
 nop_0x90 average: 81.065681
 nop_3_byte average: 80.230101
 
 That said, I think your benchmark tests the speed of rdtsc rather
 than the no-ops. Putting the read_tsc inside the inner loop basically
 makes it swamp everything else.

Whoops, now that you mention it... of course, that RDTSC *along* with
the barriers around it is much much more expensive than the NOPs.

  $ taskset -c 3 ./nops
  Running 600 times, 1000 loops per run.
  nop_0x90 average: 439.805220
  nop_3_byte average: 442.412915
 
 I think that's in the noise, and could be explained by random
 alignment of the loop too, or even random factors like the CPU heated
 up, so the later run was slightly slower. The difference between 439
 and 442 doesn't strike me as all that significant.
 
 It might be better to *not* inline, and instead make a real function
 call to something that has a lot of no-ops (do some preprocessor magic
 to make more no-ops in one go). At least that way the alignment is
 likely the same for the two cases.

malloc a page, populate it with NOPs, slap a RET at the end and jump to
it? Maybe even more than 1 page?

 Or if not that, then I think you're better off with something like
 
 p1 = read_tsc();
 for (i = 0; i  LOOPS; i++) {
 nop_0x90();
 
 }
 p2 = read_tsc();
 r = (p2 - p1);
 
 because while you're now measuring the loop overhead too, that's
 *much* smaller than the rdtsc overhead. So I get something like

Yap, that looks better.

 Running 600 times, 100 loops per run.
 nop_0x90 average: 3.786935
 nop_3_byte average: 3.677228
 
 and notice the difference between ~80 cycles and ~3.7 cycles.
 Yeah, that's rdtsc. I bet your 440 is about the same thing too.
 
 Btw, the whole thing about averaging cycles is not the right thing
 to do either. You should probably take the *minimum* cycles count, not
 the average, because anything non-minimal means some perturbation
 (ie interrupt etc).

My train of thought was: if you do a *lot* of runs, perturbations would
average out. But ok, noted.

 So I think something like the attached would be better. It gives an
 approximate cycles per one four-byte nop, and I get
 
 [torvalds@i7 ~]$ taskset -c 3 ./a.out
 Running 60 times, 100 loops per run.
 nop_0x90 average: 0.200479
 nop_3_byte average: 0.199694
 
 which sounds suspiciously good to me (5 nops per cycle? uop cache and
 nop compression, I guess).

Well, AFAIK, NOPs do require resources for tracking in the machine. I
was hoping that hw would be smarter and discard at decode time but there
probably are reasons that it can't be done (...yet).

So they most likely get discarted at retire time and I can't imagine how
an otherwise relatively idle core's ROB with gazillion of NOPs would
look like. Those things need hw traces. Maybe in another life. :-)

$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

and those exact numbers are actually reproducible pretty reliably.

$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625
$ taskset -c 3 ./t
Running 60 times, 100 loops per run.
nop_0x90 average: 0.390625
nop_3_byte average: 0.390625

Hmm, so what are we saying? Modern CPUs should use one set of NOPs and
that's it...

Maybe we need to do more measurements...

Hmmm.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Linus Torvalds
On Tue, Apr 28, 2015 at 9:58 AM, Borislav Petkov b...@alien8.de wrote:

 Well, AFAIK, NOPs do require resources for tracking in the machine. I
 was hoping that hw would be smarter and discard at decode time but there
 probably are reasons that it can't be done (...yet).

I suspect it might be related to things like getting performance
counters and instruction debug traps etc right. There are quite
possibly also simply constraints where the front end has to generate
*something* just to keep the back end happy.

The front end can generally not just totally remove things without any
tracking, since the front end doesn't know if things are speculative
etc. So you can't do instruction debug traps in the front end afaik.
Or rather, I'm sure you *could*, but in general I suspect the best way
to handle nops without making them *too* special is to bunch up
several to make them look like one big instruction, and then associate
that bunch with some minimal tracking uop that uses minimal resources
in the back end without losing sight of the original nop entirely, so
that you can still do checks at retirement time.

So I think the you can do ~5 nops per cycle is not unreasonable.
Even in the uop cache, the nops have to take some space, and have to
do things like update eip, so I don't think they'll ever be entirely
free, the best you can do is minimize their impact.

 $ taskset -c 3 ./t
 Running 60 times, 100 loops per run.
 nop_0x90 average: 0.390625
 nop_3_byte average: 0.390625

 and those exact numbers are actually reproducible pretty reliably.

Yeah. That looks somewhat reasonable. I think the 16h architecture
technically decodes just two instructions per cycle, but I wouldn't be
surprised if there's some simple nop special casing going on so that
it can decode three nops in one go when things line up right. So you
might get 0.33 cycles for the best case, but then 0.5 cycles when it
crosses a 16-byte boundary or something. So you might have some
pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes
(3/2/3 instructions), which would come out to 0.38 cycles. Add some
random overhead for the loop, and I could see the 0.39 cycles.

That was wild handwaving with no data to back it up, but I'm trying to
explain to myself why you could get some odd number like that. It
seems _possiible_ at least.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-28 Thread Borislav Petkov
On Tue, Apr 28, 2015 at 10:16:33AM -0700, Linus Torvalds wrote:
 I suspect it might be related to things like getting performance
 counters and instruction debug traps etc right. There are quite
 possibly also simply constraints where the front end has to generate
 *something* just to keep the back end happy.
 
 The front end can generally not just totally remove things without any
 tracking, since the front end doesn't know if things are speculative
 etc. So you can't do instruction debug traps in the front end afaik.
 Or rather, I'm sure you *could*, but in general I suspect the best way
 to handle nops without making them *too* special is to bunch up
 several to make them look like one big instruction, and then associate
 that bunch with some minimal tracking uop that uses minimal resources
 in the back end without losing sight of the original nop entirely, so
 that you can still do checks at retirement time.

Yeah, I was thinking about a simplified uop for tracking - makes most
sense ...

 So I think the you can do ~5 nops per cycle is not unreasonable.
 Even in the uop cache, the nops have to take some space, and have to
 do things like update eip, so I don't think they'll ever be entirely
 free, the best you can do is minimize their impact.

... exactly! So something needs to increment rIP so you either need to
special-handle that and remember by how many bytes to increment and
exactly *when* at retire time or simply use a barebones, simplified uop
which does that for you for free and flows down the pipe. Yeah, that
makes a lot of sense!

 Yeah. That looks somewhat reasonable. I think the 16h architecture
 technically decodes just two instructions per cycle,

Yeah, fetch 32B and look at two 16B for max 2 insns per cycle. I.e.,
two-way.

 but I wouldn't be surprised if there's some simple nop special casing
 going on so that it can decode three nops in one go when things line
 up right.

Right.

 So you might get 0.33 cycles for the best case, but then 0.5 cycles
 when it crosses a 16-byte boundary or something. So you might have
 some pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes
 (3/2/3 instructions), which would come out to 0.38 cycles. Add some
 random overhead for the loop, and I could see the 0.39 cycles.

 That was wild handwaving with no data to back it up, but I'm trying
 to explain to myself why you could get some odd number like that. It
 seems _possiible_ at least.

Yep, that makes sense.

Now if only we had some numbers to back this up with... I'll play with
this more.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread H. Peter Anvin
I did a microbenchmark in user space... let's see if I can find it.

On April 27, 2015 1:03:29 PM PDT, Borislav Petkov  wrote:
>On Mon, Apr 27, 2015 at 12:59:11PM -0700, H. Peter Anvin wrote:
>> It really comes down to this: it seems older cores from both Intel
>> and AMD perform better with 66 66 66 90, whereas the 0F 1F series is
>> better on newer cores.
>>
>> When I measured it, the differences were sometimes dramatic.
>
>How did you measure that? I should probably do the same on some newer
>AMD machines... We're using k8_nops on all AMD, even though the
>optimization manuals cite different NOPs on newer AMD families.
>
>Thanks.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 12:59:11PM -0700, H. Peter Anvin wrote:
> It really comes down to this: it seems older cores from both Intel
> and AMD perform better with 66 66 66 90, whereas the 0F 1F series is
> better on newer cores.
>
> When I measured it, the differences were sometimes dramatic.

How did you measure that? I should probably do the same on some newer
AMD machines... We're using k8_nops on all AMD, even though the
optimization manuals cite different NOPs on newer AMD families.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread H. Peter Anvin
It really comes down to this: it seems older cores from both Intel and AMD 
perform better with 66 66 66 90, whereas the 0F 1F series is better on newer 
cores.

When I measured it, the differences were sometimes dramatic.

On April 27, 2015 11:53:44 AM PDT, Borislav Petkov  wrote:
>On Mon, Apr 27, 2015 at 11:47:30AM -0700, Linus Torvalds wrote:
>> On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov 
>wrote:
>> >
>> > So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes
>so
>> > without more invasive changes, our longest NOPs are 8 byte long and
>then
>> > we have to repeat.
>> 
>> Btw (and I'm too lazy to check) do we take alignment into account?
>> 
>> Because if you have to split, and use multiple nops, it is *probably*
>> a good idea to try to avoid 16-byte boundaries, since that's can be
>> the I$ fetch granularity from L1 (although I guess 32B is getting
>more
>> common).
>
>Yeah, on F16h you have 32B fetch but the paths later in the machine
>gets narrower, so to speak.
>
>> So the exact split might depend on the alignment of the nop
>replacement..
>
>Yeah, no. Our add_nops() is trivial:
>
>/* Use this to add nops to a buffer, then text_poke the whole buffer.
>*/
>static void __init_or_module add_nops(void *insns, unsigned int len)
>{
>while (len > 0) {
>unsigned int noplen = len;
>if (noplen > ASM_NOP_MAX)
>noplen = ASM_NOP_MAX;
>memcpy(insns, ideal_nops[noplen], noplen);
>insns += noplen;
>len -= noplen;
>}
>}
>
>> Can we perhaps get rid of the distinction entirely, and just use one
>> set of 64-bit nops for both Intel/AMD?
>
>I *think* hpa would have an opinion here. I'm judging by looking at
>comments like this one in the code:
>
>/*
> * Due to a decoder implementation quirk, some
> * specific Intel CPUs actually perform better with
> * the "k8_nops" than with the SDM-recommended NOPs.
> */
>
>which is a fun one in itself. :-)

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:21:34PM +0200, Denys Vlasenko wrote:
> On 04/27/2015 09:11 PM, Borislav Petkov wrote:
> > A: 709.528485252 seconds time elapsed   
> >( +-  0.02% )
> > B: 708.976557288 seconds time elapsed   
> >( +-  0.04% )
> > C: 709.312844791 seconds time elapsed   
> >( +-  0.02% )
> > D: 709.400050112 seconds time elapsed   
> >( +-  0.01% )
> > E: 708.914562508 seconds time elapsed   
> >( +-  0.06% )
> > F: 709.602255085 seconds time elapsed   
> >( +-  0.02% )
> 
> That's about 0.2% variance. Very small.

Right, I'm doubtful this is the right workload for this. And actually
if even any workload would show any serious difference. Perhaps it all
doesn't really matter and we shouldn't do anything at all.

> Sounds obvious, but. Did you try running a test several times?

All runs so far are done with perf state ... --repeat 10 so, 10 kernel
builds and results are averaged.

> Maybe you are measuring random noise.

Yeah. Last exercise tomorrow. Let's see what those numbers would look
like.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 09:11 PM, Borislav Petkov wrote:
> A: 709.528485252 seconds time elapsed 
>  ( +-  0.02% )
> B: 708.976557288 seconds time elapsed 
>  ( +-  0.04% )
> C: 709.312844791 seconds time elapsed 
>  ( +-  0.02% )
> D: 709.400050112 seconds time elapsed 
>  ( +-  0.01% )
> E: 708.914562508 seconds time elapsed 
>  ( +-  0.06% )
> F: 709.602255085 seconds time elapsed 
>  ( +-  0.02% )

That's about 0.2% variance. Very small.

Sounds obvious, but. Did you try running a test several times?
Maybe you are measuring random noise.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 08:38:54PM +0200, Borislav Petkov wrote:
> I'm running them now and will report numbers relative to the last run
> once it is done. And those numbers should in practice get even better if
> we revert to the simpler canonical-ness check but let's see...

Results are done. New row is F: which is with the F16h NOPs.

With all things equal and with this change ontop:

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index aef653193160..d713080005ef 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -227,6 +227,14 @@ void __init arch_init_ideal_nops(void)
 #endif
}
break;
+
+   case X86_VENDOR_AMD:
+   if (boot_cpu_data.x86 == 0x16) {
+   ideal_nops = p6_nops;
+   return;
+   }
+
+   /* fall through */
default:
 #ifdef CONFIG_X86_64
ideal_nops = k8_nops;
---

... cycles, instructions, branches, branch-misses, context-switches
drop or remain roughly the same. BUT(!) timings increases.
cpu-clock/task-clock and duration of the workload are all the worst of
all possible cases.

So either those NOPs are not really optimal (i.e., trusting the manuals
and so on :-)) or it is their alignment.

But look at the chapter in the manual - "2.7.2.1 Encoding Padding for
Loop Alignment" - those NOPs are supposed to be used as padding so
they themselves will not be necessarily aligned when you use them to pad
stuff.

Or maybe using the longer NOPs is probably worse than the shorter 4-byte
ones with 3 0x66 prefixes which should "flow" easier through the pipe
due to their smaller length.

Or something completely different...

Oh well, enough measurements for today - will do the rc1 measurement
tomorrow.

Thanks.

---
 Performance counter stats for 'system wide' (10 runs):

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]
E:2833115.118624  cpu-clock (msec)  
( +-  0.06% ) [100.00%]
F:2835863.670798  cpu-clock (msec)  
( +-  0.02% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]
E:2833115.145292  task-clock (msec) #3.996 CPUs utilized
( +-  0.06% ) [100.00%]
F:2835863.719556  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]
E: 5,583,979,727,842  cycles#1.971 GHz  
( +-  0.05% ) [75.00%]
F: 5,581,639,840,197  cycles#1.968 GHz  
( +-  0.03% ) [75.00%]

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
E: 3,106,381,223,355  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
F: 3,105,996,162,436  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:47:30AM -0700, Linus Torvalds wrote:
> On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov  wrote:
> >
> > So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
> > without more invasive changes, our longest NOPs are 8 byte long and then
> > we have to repeat.
> 
> Btw (and I'm too lazy to check) do we take alignment into account?
> 
> Because if you have to split, and use multiple nops, it is *probably*
> a good idea to try to avoid 16-byte boundaries, since that's can be
> the I$ fetch granularity from L1 (although I guess 32B is getting more
> common).

Yeah, on F16h you have 32B fetch but the paths later in the machine
gets narrower, so to speak.

> So the exact split might depend on the alignment of the nop replacement..

Yeah, no. Our add_nops() is trivial:

/* Use this to add nops to a buffer, then text_poke the whole buffer. */
static void __init_or_module add_nops(void *insns, unsigned int len)
{
while (len > 0) {
unsigned int noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
memcpy(insns, ideal_nops[noplen], noplen);
insns += noplen;
len -= noplen;
}
}

> Can we perhaps get rid of the distinction entirely, and just use one
> set of 64-bit nops for both Intel/AMD?

I *think* hpa would have an opinion here. I'm judging by looking at
comments like this one in the code:

/*
 * Due to a decoder implementation quirk, some
 * specific Intel CPUs actually perform better with
 * the "k8_nops" than with the SDM-recommended NOPs.
 */

which is a fun one in itself. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:12:05AM -0700, Linus Torvalds wrote:
> So if one or two cycles in this code doesn't matter, then why are we
> adding alternate instructions just to avoid a few ALU instructions and
> a conditional branch that predicts perfectly? And if it does matter,
> then the 6-byte option looks clearly better..

You know what? I haven't even measured the tree *without* Denys'
stricter RCX canonical-ness patch. All numbers so far are from 4.0+ with
tip/master ontop which has Denys' patch.

And I *should* measure once with plain 4.1-rc1 and then with Denys'
patch ontop to see whether this all jumping-thru-alternatives-hoops is
even worth it.

If nothing else, it was a nice exercise. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov  wrote:
>
> So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
> without more invasive changes, our longest NOPs are 8 byte long and then
> we have to repeat.

Btw (and I'm too lazy to check) do we take alignment into account?

Because if you have to split, and use multiple nops, it is *probably*
a good idea to try to avoid 16-byte boundaries, since that's can be
the I$ fetch granularity from L1 (although I guess 32B is getting more
common).

So the exact split might depend on the alignment of the nop replacement..

> You can recognize the p6_nops being the same as in-the-manual-suggested
> F16h ones.

Can we perhaps get rid of the distinction entirely, and just use one
set of 64-bit nops for both Intel/AMD?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:14:15AM -0700, Linus Torvalds wrote:
> Btw, please don't use the "more than three 66h overrides" version.

Oh yeah, a notorious "frontend choker".

> Sure, that's what the optimization manual suggests if you want
> single-instruction decode for all sizes up to 15 bytes, but I think
> we're better off with the two-nop case for sizes 12-15) (4-byte nop
> followed by 8-11 byte nop).

Yeah, so says the manual. Although I wouldn't trust those manuals
blindly but that's another story.

> Because the "more than three 66b prefixes" really performs abysmally
> on some cores, iirc.

Right.

So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
without more invasive changes, our longest NOPs are 8 byte long and then
we have to repeat. This is consistent with what the code looks like here
after alternatives application:

815b9084 :
...

815b90ac:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
815b90b3:   00 
815b90b4:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
815b90bb:   00 
815b90bc:   90  nop

You can recognize the p6_nops being the same as in-the-manual-suggested
F16h ones.

:-)

I'm running them now and will report numbers relative to the last run
once it is done. And those numbers should in practice get even better if
we revert to the simpler canonical-ness check but let's see...

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 9:40 AM, Borislav Petkov  wrote:
>
> Either way, the NOPs-version is faster and I'm running the test with the
> F16h-specific NOPs to see how they perform.

Btw, please don't use the "more than three 66h overrides" version.
Sure, that's what the optimization manual suggests if you want
single-instruction decode for all sizes up to 15 bytes, but I think
we're better off with the two-nop case for sizes 12-15) (4-byte nop
followed by 8-11 byte nop).

Because the "more than three 66b prefixes" really performs abysmally
on some cores, iirc.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 9:12 AM, Denys Vlasenko  wrote:
>
> It is smaller, but not by much. It is two instructions smaller.

Ehh. That's _half_.

And on a decoding side, it's the difference between 6 bytes that
decode cleanly and can be decoded in parallel with other things
(assuming the 6-byte nop), and 13 bytes that will need at least 2 nops
(unless you want to do lots of prefixes, which is slow on some cores),
_and_ which is likely big enough that you will basically not be
decoding anythign else that cycle.

So on the whole, your "smaller, but not by much" is all relative. It's
a relatively big difference.

So if one or two cycles in this code doesn't matter, then why are we
adding alternate instructions just to avoid a few ALU instructions and
a conditional branch that predicts perfectly? And if it does matter,
then the 6-byte option looks clearly better..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:00:08AM -0700, Linus Torvalds wrote:
> On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov  wrote:
> >
> > Right, what about the false positives:
> 
> Anybody who tries to return to kernel addresses with sysret is
> suspect. It's more likely to be an attack vector than anything else
> (ie somebody who is trying to take advantage of a CPU bug).
> 
> I don't think there are any false positives. The only valid sysret
> targets are in normal user space.
> 
> There's the "vsyscall" area, I guess, but we are actively discouraging

vsyscall=native on old glibc, says Andy.

> people from using it (it's emulated by default) and using iret to
> return from it is fine if somebody ends up using it natively. It was a
> mistake to have fixed addresses with known code in it, so I don't
> think we should care.
> 
> We've had the inexact version for a long time, and the exact canonical
> address check hasn't even hit my tree yet. I wouldn't worry about it.
> 
> And since we haven't even merged the "better check for canonical
> addresses" it cannot even be a regression if we never really use it.

Well, it certainly sounds to me like not really worth the trouble to do
the exact check. So I'm all for dropping it. Unless Andy doesn't come up
with a "but but, there's this use case..."

Either way, the NOPs-version is faster and I'm running the test with the
F16h-specific NOPs to see how they perform.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 04:57 PM, Linus Torvalds wrote:
> On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
>>
>> /*
>>  * Change top 16 bits to be the sign-extension of 47th bit, if this
>>  * changed %rcx, it was not canonical.
>>  */
>> ALTERNATIVE "", \
>> "shl$(64 - (47+1)), %rcx; \
>>  sar$(64 - (47+1)), %rcx; \
>>  cmpq   %rcx, %r11; \
>>  jneopportunistic_sysret_failed", 
>> X86_BUG_SYSRET_CANON_RCX
> 
> Guys, if we're looking at cycles for this, then don't do the "exact
> canonical test". and go back to just doing
> 
> shr $__VIRTUAL_MASK_SHIFT, %rcx
> jnz opportunistic_sysret_failed
> 
> which is much smaller.

It is smaller, but not by much. It is two instructions smaller.
On disassembly level, the changes are:

cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
shr $0x2f,%rcx  -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
mov 0x58(%rsp),%rcx -> (eliminated)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 06:04 PM, Brian Gerst wrote:
> On Mon, Apr 27, 2015 at 11:56 AM, Andy Lutomirski  wrote:
>> On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov  wrote:
>>> On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
>
> /*
>  * Change top 16 bits to be the sign-extension of 47th bit, if 
> this
>  * changed %rcx, it was not canonical.
>  */
> ALTERNATIVE "", \
> "shl$(64 - (47+1)), %rcx; \
>  sar$(64 - (47+1)), %rcx; \
>  cmpq   %rcx, %r11; \
>  jneopportunistic_sysret_failed", 
> X86_BUG_SYSRET_CANON_RCX

 Guys, if we're looking at cycles for this, then don't do the "exact
 canonical test". and go back to just doing

 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed

 which is much smaller.
>>>
>>> Right, what about the false positives:
>>>
>>> 17be0aec74fb ("x86/asm/entry/64: Implement better check for canonical 
>>> addresses")
>>>
>>> ? We don't care?
>>
>> The false positives only matter for very strange workloads, e.g.
>> vsyscall=native with old libc.  If it's a measurable regression, we
>> could revert it.
>>
>> --Andy
> 
> Another alternative is to do the canonical check in the paths that can
> set user RIP with an untrusted value, ie, sigreturn and exec.

It is already done only on that path. Fast path doesn't check
RCX for canonicalness.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Brian Gerst
On Mon, Apr 27, 2015 at 11:56 AM, Andy Lutomirski  wrote:
> On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov  wrote:
>> On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
>>> On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
>>> >
>>> > /*
>>> >  * Change top 16 bits to be the sign-extension of 47th bit, if 
>>> > this
>>> >  * changed %rcx, it was not canonical.
>>> >  */
>>> > ALTERNATIVE "", \
>>> > "shl$(64 - (47+1)), %rcx; \
>>> >  sar$(64 - (47+1)), %rcx; \
>>> >  cmpq   %rcx, %r11; \
>>> >  jneopportunistic_sysret_failed", 
>>> > X86_BUG_SYSRET_CANON_RCX
>>>
>>> Guys, if we're looking at cycles for this, then don't do the "exact
>>> canonical test". and go back to just doing
>>>
>>> shr $__VIRTUAL_MASK_SHIFT, %rcx
>>> jnz opportunistic_sysret_failed
>>>
>>> which is much smaller.
>>
>> Right, what about the false positives:
>>
>> 17be0aec74fb ("x86/asm/entry/64: Implement better check for canonical 
>> addresses")
>>
>> ? We don't care?
>
> The false positives only matter for very strange workloads, e.g.
> vsyscall=native with old libc.  If it's a measurable regression, we
> could revert it.
>
> --Andy

Another alternative is to do the canonical check in the paths that can
set user RIP with an untrusted value, ie, sigreturn and exec.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov  wrote:
>
> Right, what about the false positives:

Anybody who tries to return to kernel addresses with sysret is
suspect. It's more likely to be an attack vector than anything else
(ie somebody who is trying to take advantage of a CPU bug).

I don't think there are any false positives. The only valid sysret
targets are in normal user space.

There's the "vsyscall" area, I guess, but we are actively discouraging
people from using it (it's emulated by default) and using iret to
return from it is fine if somebody ends up using it natively. It was a
mistake to have fixed addresses with known code in it, so I don't
think we should care.

We've had the inexact version for a long time, and the exact canonical
address check hasn't even hit my tree yet. I wouldn't worry about it.

And since we haven't even merged the "better check for canonical
addresses" it cannot even be a regression if we never really use it.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Andy Lutomirski
On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov  wrote:
> On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
>> On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
>> >
>> > /*
>> >  * Change top 16 bits to be the sign-extension of 47th bit, if this
>> >  * changed %rcx, it was not canonical.
>> >  */
>> > ALTERNATIVE "", \
>> > "shl$(64 - (47+1)), %rcx; \
>> >  sar$(64 - (47+1)), %rcx; \
>> >  cmpq   %rcx, %r11; \
>> >  jneopportunistic_sysret_failed", 
>> > X86_BUG_SYSRET_CANON_RCX
>>
>> Guys, if we're looking at cycles for this, then don't do the "exact
>> canonical test". and go back to just doing
>>
>> shr $__VIRTUAL_MASK_SHIFT, %rcx
>> jnz opportunistic_sysret_failed
>>
>> which is much smaller.
>
> Right, what about the false positives:
>
> 17be0aec74fb ("x86/asm/entry/64: Implement better check for canonical 
> addresses")
>
> ? We don't care?

The false positives only matter for very strange workloads, e.g.
vsyscall=native with old libc.  If it's a measurable regression, we
could revert it.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
> On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
> >
> > /*
> >  * Change top 16 bits to be the sign-extension of 47th bit, if this
> >  * changed %rcx, it was not canonical.
> >  */
> > ALTERNATIVE "", \
> > "shl$(64 - (47+1)), %rcx; \
> >  sar$(64 - (47+1)), %rcx; \
> >  cmpq   %rcx, %r11; \
> >  jneopportunistic_sysret_failed", 
> > X86_BUG_SYSRET_CANON_RCX
> 
> Guys, if we're looking at cycles for this, then don't do the "exact
> canonical test". and go back to just doing
> 
> shr $__VIRTUAL_MASK_SHIFT, %rcx
> jnz opportunistic_sysret_failed
> 
> which is much smaller.

Right, what about the false positives:

17be0aec74fb ("x86/asm/entry/64: Implement better check for canonical 
addresses")

? We don't care?

> In fact, aim to make the conditional jump be a
> two-byte one (jump forward to another jump if required - it's a
> slow-path that doesn't matter at *all* for the taken case), and the
> end result is just six bytes. That way you can use alternative to
> replace it with one single noop on AMD.

Well, even with the non-optimal NOPs (we end up with 4 3-byte NOPs and
one single-byte), we're still better than the unconditional JMP I had
there before:

https://lkml.kernel.org/r/20150427143905.gk6...@pd.tnic

(you might want to look at the raw email - marc.info breaks lines)

I'll retest with the F16h NOPs to see whether there's any difference.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 08:06:16AM -0700, Linus Torvalds wrote:
> So maybe our AMD nop tables should be updated?

Ho-humm, we're using k8_nops on all 64-bit AMD. I better do some
opt-guide staring.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 7:57 AM, Linus Torvalds
 wrote:
>
> ..end result is just six bytes. That way you can use alternative to
> replace it with one single noop on AMD.

Actually, it looks like we have no good 6-byte no-ops on AMD. So you'd
get two three-byte ones. Oh well. It's still better than five nops
that can't even be decoded all at once.

That said, our NOP tables look to be old for AMD. Looking at the AMD
optimization guide (family 16h), it says to use

  66 0F 1F 44 00 00

which seems to be the same as Intel (it's "nopw   0x0(%rax,%rax,1)").

So maybe our AMD nop tables should be updated?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov  wrote:
>
> /*
>  * Change top 16 bits to be the sign-extension of 47th bit, if this
>  * changed %rcx, it was not canonical.
>  */
> ALTERNATIVE "", \
> "shl$(64 - (47+1)), %rcx; \
>  sar$(64 - (47+1)), %rcx; \
>  cmpq   %rcx, %r11; \
>  jneopportunistic_sysret_failed", X86_BUG_SYSRET_CANON_RCX

Guys, if we're looking at cycles for this, then don't do the "exact
canonical test". and go back to just doing

shr $__VIRTUAL_MASK_SHIFT, %rcx
jnz opportunistic_sysret_failed

which is much smaller. In fact, aim to make the conditional jump be a
two-byte one (jump forward to another jump if required - it's a
slow-path that doesn't matter at *all* for the taken case), and the
end result is just six bytes. That way you can use alternative to
replace it with one single noop on AMD.

Because dammit, if we're playing these kinds of games, let's do it
*right*. No half measures.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
> I know it would be ugly, but would it be worth saving two bytes by
> using ALTERNATIVE "jmp 1f", "shl ...", ...?

Damn, it is actually visible even that saving the unconditional forward
JMP makes the numbers marginally nicer (E: row). So I guess we'll be
dropping the forward JMP too.

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]
E:2833115.118624  cpu-clock (msec)  
( +-  0.06% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]
E:2833115.145292  task-clock (msec) #3.996 CPUs utilized
( +-  0.06% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]
E: 5,583,979,727,842  cycles#1.971 GHz  
( +-  0.05% ) [75.00%]

cycles is the lowest, nice.

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
E: 3,106,381,223,355  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]

Understandable - we end up executing 5 insns more:

815b90ac:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b0:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b4:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b8:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90bc:   90  nop


A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  241.154 M/sec
( +-  0.00% ) [75.00%]
E:   683,648,518,667  branches  #  241.306 M/sec
( +-  0.01% ) [75.00%]

Lowest.

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
D:43,795,107,998  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
E:43,801,985,070  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

That looks like noise to me - we shouldn't be getting more branch misses
with the E: version.

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
D: 2,028,895  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
E: 2,031,008  context-switches  #0.717 K/sec
( +-  0.09% ) [100.00%]

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 02:08:40PM +0200, Denys Vlasenko wrote:
> > 819ef40c:   48 c1 e1 10 shl$0x10,%rcx
> > 819ef410:   48 c1 f9 10 sar$0x10,%rcx
> > 819ef414:   49 39 cbcmp%rcx,%r11
> > 819ef417:   0f 85 ff 9c bc ff   jne815b911c 
> > 
> 
> This looks strange. opportunistic_sysret_failed label is just a few
> instructions below. Why are you getting "ff 9c bc ff" offset in JNE
> instead of short jump of 0x5f bytes I see without ALTERNATIVE?

Because the replacement instructions are placed far away in the section
.altinstr_replacement and since we have relative JMPs, gas generates
JMP from that section to opportunistic_sysret_failed. That's why it is
negative too.

And by looking at this more, I'm afraid even this current version won't
work because even after I added recompute_jump() recently which is
supposed to fixup the JMPs and even make them smaller, it won't work in
this case because it won't detect the JMP as it is the 4th instruction
and not the first byte. (And even if, it won't detect it still because
we're not looking at conditional JMPs yet, i.e. Jcc).

What we could do is something like this instead:

jne opportunistic_sysret_failed - 1f
1:

so that the offset is correct. Need to experiment with this a bit first
though, for the exact placement of the label but it should show the
idea.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 01:35 PM, Borislav Petkov wrote:
> On Mon, Apr 27, 2015 at 10:53:05AM +0200, Borislav Petkov wrote:
>>  ALTERNATIVE "",
>>  "shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>>   sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>>   cmpq%rcx, %r11 \
>>   jne opportunistic_sysret_failed"
>>   X86_BUG_SYSRET_CANONICAL_RCX
> 
> Right, so I can do this:
> 
> /*
>  * Change top 16 bits to be the sign-extension of 47th bit, if this
>  * changed %rcx, it was not canonical.
>  */
> ALTERNATIVE "", \
> "shl$(64 - (47+1)), %rcx; \
>  sar$(64 - (47+1)), %rcx; \
>  cmpq   %rcx, %r11; \
>  jneopportunistic_sysret_failed", X86_BUG_SYSRET_CANON_RCX
> 
> If I use the __VIRTUAL_MASK_SHIFT macro *in* the ALTERNATIVE macro, I get some
> really cryptic gas error:
> 
> arch/x86/kernel/entry_64.S: Assembler messages:
> arch/x86/kernel/entry_64.S:441: Error: can't resolve `L0' {*ABS* section} - 
> `L0' {*UND* section}
> scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
> failed
> make[1]: *** [arch/x86/kernel/entry_64.o] Error 1
> Makefile:1536: recipe for target 'arch/x86/kernel/entry_64.o' failed
> make: *** [arch/x86/kernel/entry_64.o] Error 2
> 
> but I guess we can simply use the naked "47" because a couple of lines
> above, we already have the sanity-check:
> 
> .ifne __VIRTUAL_MASK_SHIFT - 47
> .error "virtual address width changed -- SYSRET checks need update"
> .endif
> 
> so we should be guarded just fine.
> 
> Anyway, if we do it this way, we get 17 NOPs added at build time which is the
> length of the 4 instructions:
> 
> 819ef40c:   48 c1 e1 10 shl$0x10,%rcx
> 819ef410:   48 c1 f9 10 sar$0x10,%rcx
> 819ef414:   49 39 cbcmp%rcx,%r11
> 819ef417:   0f 85 ff 9c bc ff   jne815b911c 
> 

This looks strange. opportunistic_sysret_failed label is just a few
instructions below. Why are you getting "ff 9c bc ff" offset in JNE
instead of short jump of 0x5f bytes I see without ALTERNATIVE?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 10:53:05AM +0200, Borislav Petkov wrote:
>   ALTERNATIVE "",
>   "shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>cmpq%rcx, %r11 \
>jne opportunistic_sysret_failed"
>X86_BUG_SYSRET_CANONICAL_RCX

Right, so I can do this:

/*
 * Change top 16 bits to be the sign-extension of 47th bit, if this
 * changed %rcx, it was not canonical.
 */
ALTERNATIVE "", \
"shl$(64 - (47+1)), %rcx; \
 sar$(64 - (47+1)), %rcx; \
 cmpq   %rcx, %r11; \
 jneopportunistic_sysret_failed", X86_BUG_SYSRET_CANON_RCX

If I use the __VIRTUAL_MASK_SHIFT macro *in* the ALTERNATIVE macro, I get some
really cryptic gas error:

arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:441: Error: can't resolve `L0' {*ABS* section} - 
`L0' {*UND* section}
scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
failed
make[1]: *** [arch/x86/kernel/entry_64.o] Error 1
Makefile:1536: recipe for target 'arch/x86/kernel/entry_64.o' failed
make: *** [arch/x86/kernel/entry_64.o] Error 2

but I guess we can simply use the naked "47" because a couple of lines
above, we already have the sanity-check:

.ifne __VIRTUAL_MASK_SHIFT - 47
.error "virtual address width changed -- SYSRET checks need update"
.endif

so we should be guarded just fine.

Anyway, if we do it this way, we get 17 NOPs added at build time which is the
length of the 4 instructions:

819ef40c:   48 c1 e1 10 shl$0x10,%rcx
819ef410:   48 c1 f9 10 sar$0x10,%rcx
819ef414:   49 39 cbcmp%rcx,%r11
819ef417:   0f 85 ff 9c bc ff   jne815b911c 


and the 17 NOPs should be optimized at boot time on AMD.

I was initially afraid that the JMP in the 4th line might be wrong but
apparently since we're using a global label, gas/gcc does generate the
offset properly (0x815b911c):

815b911c :
815b911c:   ff 15 fe a4 26 00   callq  *0x26a4fe(%rip)# 
81823620 
815b9122:   e9 01 09 00 00  jmpq   815b9a28 

815b9127:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
815b912e:   00 00

So, on to do some tracing.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 12:07:14PM +0200, Denys Vlasenko wrote:
>   /* Only three 0x66 prefixes for NOP for fast decode on all CPUs */
>   ALTERNATIVE ".byte 0x66,0x66,0x66,0x90 \
>   .byte 0x66,0x66,0x66,0x90 \
>   .byte 0x66,0x66,0x66,0x90",
>   "shl$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>   sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>   cmpq%rcx, %r11 \
>   jne opportunistic_sysret_failed"
>   X86_BUG_SYSRET_CANONICAL_RCX
> 
> would be better than letting ALTERNATIVE to generate 13 one-byte NOPs.

We already do everything automagically, see:
arch/x86/kernel/alternative.c: optimize_nops()

:-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 10:53 AM, Borislav Petkov wrote:
> On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
>>> +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
>>> non-canonical */
>>
>> I think that "sysret" should appear in the name.
> 
> Yeah, I thought about it too, will fix.
> 
>> Oh no!  My laptop is currently bug-free, and you're breaking it! :)
> 
> Muahahahhahaha...
> 
>>> +
>>> +   /*
>>> +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>>> +* in kernel space.  This essentially lets the user take over
>>> +* the kernel, since userspace controls RSP.
>>> +*/
>>> +   ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
>>> +
>>
>> I know it would be ugly, but would it be worth saving two bytes by
>> using ALTERNATIVE "jmp 1f", "shl ...", ...?
>>
>>> /* Change top 16 bits to be the sign-extension of 47th bit */
>>> shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>>> sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>>> @@ -432,6 +436,7 @@ syscall_return:
>>> cmpq%rcx, %r11
>>> jne opportunistic_sysret_failed
> 
> You want to stick all 4 insns in the alternative? Yeah, it should work
> but it might even more unreadable than it is now.
> 
> Btw, we can do this too:
> 
>   ALTERNATIVE "",
>   "shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
>cmpq%rcx, %r11 \
>jne opportunistic_sysret_failed"
>X86_BUG_SYSRET_CANONICAL_RCX
> 
> which will replace the 2-byte JMP with a lot of NOPs on AMD.

The instructions you want to NOP out are translated to these bytes:

 2c2:   48 c1 e1 10 shl$0x10,%rcx
 2c6:   48 c1 f9 10 sar$0x10,%rcx
 2ca:   49 39 cbcmp%rcx,%r11
 2cd:   75 5f   jne32e 

According to http://instlatx64.atw.hu/
CPUs from both AMD and Intel are happy to eat "66,66,66,90" NOPs
with maximum throughput; more than three 66 prefixes slow decode down,
sometimes horrifically (from 3 insns per cycle to one insn per ~10 cycles).

Probably doing something like this

/* Only three 0x66 prefixes for NOP for fast decode on all CPUs */
ALTERNATIVE ".byte 0x66,0x66,0x66,0x90 \
.byte 0x66,0x66,0x66,0x90 \
.byte 0x66,0x66,0x66,0x90",
"shl$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
cmpq%rcx, %r11 \
jne opportunistic_sysret_failed"
X86_BUG_SYSRET_CANONICAL_RCX

would be better than letting ALTERNATIVE to generate 13 one-byte NOPs.

-- 
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
> > +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
> > non-canonical */
> 
> I think that "sysret" should appear in the name.

Yeah, I thought about it too, will fix.

> Oh no!  My laptop is currently bug-free, and you're breaking it! :)

Muahahahhahaha...

> > +
> > +   /*
> > +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> > +* in kernel space.  This essentially lets the user take over
> > +* the kernel, since userspace controls RSP.
> > +*/
> > +   ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
> > +
> 
> I know it would be ugly, but would it be worth saving two bytes by
> using ALTERNATIVE "jmp 1f", "shl ...", ...?
> 
> > /* Change top 16 bits to be the sign-extension of 47th bit */
> > shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > @@ -432,6 +436,7 @@ syscall_return:
> > cmpq%rcx, %r11
> > jne opportunistic_sysret_failed

You want to stick all 4 insns in the alternative? Yeah, it should work
but it might even more unreadable than it is now.

Btw, we can do this too:

ALTERNATIVE "",
"shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
 cmpq%rcx, %r11 \
 jne opportunistic_sysret_failed"
 X86_BUG_SYSRET_CANONICAL_RCX

which will replace the 2-byte JMP with a lot of NOPs on AMD.

I'll trace it again to see which one is worse :-)

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 10:53 AM, Borislav Petkov wrote:
 On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
 +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
 non-canonical */

 I think that sysret should appear in the name.
 
 Yeah, I thought about it too, will fix.
 
 Oh no!  My laptop is currently bug-free, and you're breaking it! :)
 
 Muahahahhahaha...
 
 +
 +   /*
 +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
 +* in kernel space.  This essentially lets the user take over
 +* the kernel, since userspace controls RSP.
 +*/
 +   ALTERNATIVE jmp 1f, , X86_BUG_CANONICAL_RCX
 +

 I know it would be ugly, but would it be worth saving two bytes by
 using ALTERNATIVE jmp 1f, shl ..., ...?

 /* Change top 16 bits to be the sign-extension of 47th bit */
 shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 @@ -432,6 +436,7 @@ syscall_return:
 cmpq%rcx, %r11
 jne opportunistic_sysret_failed
 
 You want to stick all 4 insns in the alternative? Yeah, it should work
 but it might even more unreadable than it is now.
 
 Btw, we can do this too:
 
   ALTERNATIVE ,
   shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
cmpq%rcx, %r11 \
jne opportunistic_sysret_failed
X86_BUG_SYSRET_CANONICAL_RCX
 
 which will replace the 2-byte JMP with a lot of NOPs on AMD.

The instructions you want to NOP out are translated to these bytes:

 2c2:   48 c1 e1 10 shl$0x10,%rcx
 2c6:   48 c1 f9 10 sar$0x10,%rcx
 2ca:   49 39 cbcmp%rcx,%r11
 2cd:   75 5f   jne32e opportunistic_sysret_failed

According to http://instlatx64.atw.hu/
CPUs from both AMD and Intel are happy to eat 66,66,66,90 NOPs
with maximum throughput; more than three 66 prefixes slow decode down,
sometimes horrifically (from 3 insns per cycle to one insn per ~10 cycles).

Probably doing something like this

/* Only three 0x66 prefixes for NOP for fast decode on all CPUs */
ALTERNATIVE .byte 0x66,0x66,0x66,0x90 \
.byte 0x66,0x66,0x66,0x90 \
.byte 0x66,0x66,0x66,0x90,
shl$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
cmpq%rcx, %r11 \
jne opportunistic_sysret_failed
X86_BUG_SYSRET_CANONICAL_RCX

would be better than letting ALTERNATIVE to generate 13 one-byte NOPs.

-- 
vda

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 12:07:14PM +0200, Denys Vlasenko wrote:
   /* Only three 0x66 prefixes for NOP for fast decode on all CPUs */
   ALTERNATIVE .byte 0x66,0x66,0x66,0x90 \
   .byte 0x66,0x66,0x66,0x90 \
   .byte 0x66,0x66,0x66,0x90,
   shl$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
   sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
   cmpq%rcx, %r11 \
   jne opportunistic_sysret_failed
   X86_BUG_SYSRET_CANONICAL_RCX
 
 would be better than letting ALTERNATIVE to generate 13 one-byte NOPs.

We already do everything automagically, see:
arch/x86/kernel/alternative.c: optimize_nops()

:-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 12:59:11PM -0700, H. Peter Anvin wrote:
 It really comes down to this: it seems older cores from both Intel
 and AMD perform better with 66 66 66 90, whereas the 0F 1F series is
 better on newer cores.

 When I measured it, the differences were sometimes dramatic.

How did you measure that? I should probably do the same on some newer
AMD machines... We're using k8_nops on all AMD, even though the
optimization manuals cite different NOPs on newer AMD families.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread H. Peter Anvin
It really comes down to this: it seems older cores from both Intel and AMD 
perform better with 66 66 66 90, whereas the 0F 1F series is better on newer 
cores.

When I measured it, the differences were sometimes dramatic.

On April 27, 2015 11:53:44 AM PDT, Borislav Petkov b...@alien8.de wrote:
On Mon, Apr 27, 2015 at 11:47:30AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov b...@alien8.de
wrote:
 
  So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes
so
  without more invasive changes, our longest NOPs are 8 byte long and
then
  we have to repeat.
 
 Btw (and I'm too lazy to check) do we take alignment into account?
 
 Because if you have to split, and use multiple nops, it is *probably*
 a good idea to try to avoid 16-byte boundaries, since that's can be
 the I$ fetch granularity from L1 (although I guess 32B is getting
more
 common).

Yeah, on F16h you have 32B fetch but the paths later in the machine
gets narrower, so to speak.

 So the exact split might depend on the alignment of the nop
replacement..

Yeah, no. Our add_nops() is trivial:

/* Use this to add nops to a buffer, then text_poke the whole buffer.
*/
static void __init_or_module add_nops(void *insns, unsigned int len)
{
while (len  0) {
unsigned int noplen = len;
if (noplen  ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
memcpy(insns, ideal_nops[noplen], noplen);
insns += noplen;
len -= noplen;
}
}

 Can we perhaps get rid of the distinction entirely, and just use one
 set of 64-bit nops for both Intel/AMD?

I *think* hpa would have an opinion here. I'm judging by looking at
comments like this one in the code:

/*
 * Due to a decoder implementation quirk, some
 * specific Intel CPUs actually perform better with
 * the k8_nops than with the SDM-recommended NOPs.
 */

which is a fun one in itself. :-)

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:21:34PM +0200, Denys Vlasenko wrote:
 On 04/27/2015 09:11 PM, Borislav Petkov wrote:
  A: 709.528485252 seconds time elapsed   
 ( +-  0.02% )
  B: 708.976557288 seconds time elapsed   
 ( +-  0.04% )
  C: 709.312844791 seconds time elapsed   
 ( +-  0.02% )
  D: 709.400050112 seconds time elapsed   
 ( +-  0.01% )
  E: 708.914562508 seconds time elapsed   
 ( +-  0.06% )
  F: 709.602255085 seconds time elapsed   
 ( +-  0.02% )
 
 That's about 0.2% variance. Very small.

Right, I'm doubtful this is the right workload for this. And actually
if even any workload would show any serious difference. Perhaps it all
doesn't really matter and we shouldn't do anything at all.

 Sounds obvious, but. Did you try running a test several times?

All runs so far are done with perf state ... --repeat 10 so, 10 kernel
builds and results are averaged.

 Maybe you are measuring random noise.

Yeah. Last exercise tomorrow. Let's see what those numbers would look
like.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread H. Peter Anvin
I did a microbenchmark in user space... let's see if I can find it.

On April 27, 2015 1:03:29 PM PDT, Borislav Petkov b...@alien8.de wrote:
On Mon, Apr 27, 2015 at 12:59:11PM -0700, H. Peter Anvin wrote:
 It really comes down to this: it seems older cores from both Intel
 and AMD perform better with 66 66 66 90, whereas the 0F 1F series is
 better on newer cores.

 When I measured it, the differences were sometimes dramatic.

How did you measure that? I should probably do the same on some newer
AMD machines... We're using k8_nops on all AMD, even though the
optimization manuals cite different NOPs on newer AMD families.

Thanks.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
  +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
  non-canonical */
 
 I think that sysret should appear in the name.

Yeah, I thought about it too, will fix.

 Oh no!  My laptop is currently bug-free, and you're breaking it! :)

Muahahahhahaha...

  +
  +   /*
  +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
  +* in kernel space.  This essentially lets the user take over
  +* the kernel, since userspace controls RSP.
  +*/
  +   ALTERNATIVE jmp 1f, , X86_BUG_CANONICAL_RCX
  +
 
 I know it would be ugly, but would it be worth saving two bytes by
 using ALTERNATIVE jmp 1f, shl ..., ...?
 
  /* Change top 16 bits to be the sign-extension of 47th bit */
  shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
  sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
  @@ -432,6 +436,7 @@ syscall_return:
  cmpq%rcx, %r11
  jne opportunistic_sysret_failed

You want to stick all 4 insns in the alternative? Yeah, it should work
but it might even more unreadable than it is now.

Btw, we can do this too:

ALTERNATIVE ,
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
 cmpq%rcx, %r11 \
 jne opportunistic_sysret_failed
 X86_BUG_SYSRET_CANONICAL_RCX

which will replace the 2-byte JMP with a lot of NOPs on AMD.

I'll trace it again to see which one is worse :-)

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 02:08:40PM +0200, Denys Vlasenko wrote:
  819ef40c:   48 c1 e1 10 shl$0x10,%rcx
  819ef410:   48 c1 f9 10 sar$0x10,%rcx
  819ef414:   49 39 cbcmp%rcx,%r11
  819ef417:   0f 85 ff 9c bc ff   jne815b911c 
  opportunistic_sysret_failed
 
 This looks strange. opportunistic_sysret_failed label is just a few
 instructions below. Why are you getting ff 9c bc ff offset in JNE
 instead of short jump of 0x5f bytes I see without ALTERNATIVE?

Because the replacement instructions are placed far away in the section
.altinstr_replacement and since we have relative JMPs, gas generates
JMP from that section to opportunistic_sysret_failed. That's why it is
negative too.

And by looking at this more, I'm afraid even this current version won't
work because even after I added recompute_jump() recently which is
supposed to fixup the JMPs and even make them smaller, it won't work in
this case because it won't detect the JMP as it is the 4th instruction
and not the first byte. (And even if, it won't detect it still because
we're not looking at conditional JMPs yet, i.e. Jcc).

What we could do is something like this instead:

jne opportunistic_sysret_failed - 1f
1:

so that the offset is correct. Need to experiment with this a bit first
though, for the exact placement of the label but it should show the
idea.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 01:35 PM, Borislav Petkov wrote:
 On Mon, Apr 27, 2015 at 10:53:05AM +0200, Borislav Petkov wrote:
  ALTERNATIVE ,
  shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
   sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
   cmpq%rcx, %r11 \
   jne opportunistic_sysret_failed
   X86_BUG_SYSRET_CANONICAL_RCX
 
 Right, so I can do this:
 
 /*
  * Change top 16 bits to be the sign-extension of 47th bit, if this
  * changed %rcx, it was not canonical.
  */
 ALTERNATIVE , \
 shl$(64 - (47+1)), %rcx; \
  sar$(64 - (47+1)), %rcx; \
  cmpq   %rcx, %r11; \
  jneopportunistic_sysret_failed, X86_BUG_SYSRET_CANON_RCX
 
 If I use the __VIRTUAL_MASK_SHIFT macro *in* the ALTERNATIVE macro, I get some
 really cryptic gas error:
 
 arch/x86/kernel/entry_64.S: Assembler messages:
 arch/x86/kernel/entry_64.S:441: Error: can't resolve `L0' {*ABS* section} - 
 `L0' {*UND* section}
 scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
 failed
 make[1]: *** [arch/x86/kernel/entry_64.o] Error 1
 Makefile:1536: recipe for target 'arch/x86/kernel/entry_64.o' failed
 make: *** [arch/x86/kernel/entry_64.o] Error 2
 
 but I guess we can simply use the naked 47 because a couple of lines
 above, we already have the sanity-check:
 
 .ifne __VIRTUAL_MASK_SHIFT - 47
 .error virtual address width changed -- SYSRET checks need update
 .endif
 
 so we should be guarded just fine.
 
 Anyway, if we do it this way, we get 17 NOPs added at build time which is the
 length of the 4 instructions:
 
 819ef40c:   48 c1 e1 10 shl$0x10,%rcx
 819ef410:   48 c1 f9 10 sar$0x10,%rcx
 819ef414:   49 39 cbcmp%rcx,%r11
 819ef417:   0f 85 ff 9c bc ff   jne815b911c 
 opportunistic_sysret_failed

This looks strange. opportunistic_sysret_failed label is just a few
instructions below. Why are you getting ff 9c bc ff offset in JNE
instead of short jump of 0x5f bytes I see without ALTERNATIVE?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 10:53:05AM +0200, Borislav Petkov wrote:
   ALTERNATIVE ,
   shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx \
cmpq%rcx, %r11 \
jne opportunistic_sysret_failed
X86_BUG_SYSRET_CANONICAL_RCX

Right, so I can do this:

/*
 * Change top 16 bits to be the sign-extension of 47th bit, if this
 * changed %rcx, it was not canonical.
 */
ALTERNATIVE , \
shl$(64 - (47+1)), %rcx; \
 sar$(64 - (47+1)), %rcx; \
 cmpq   %rcx, %r11; \
 jneopportunistic_sysret_failed, X86_BUG_SYSRET_CANON_RCX

If I use the __VIRTUAL_MASK_SHIFT macro *in* the ALTERNATIVE macro, I get some
really cryptic gas error:

arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:441: Error: can't resolve `L0' {*ABS* section} - 
`L0' {*UND* section}
scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
failed
make[1]: *** [arch/x86/kernel/entry_64.o] Error 1
Makefile:1536: recipe for target 'arch/x86/kernel/entry_64.o' failed
make: *** [arch/x86/kernel/entry_64.o] Error 2

but I guess we can simply use the naked 47 because a couple of lines
above, we already have the sanity-check:

.ifne __VIRTUAL_MASK_SHIFT - 47
.error virtual address width changed -- SYSRET checks need update
.endif

so we should be guarded just fine.

Anyway, if we do it this way, we get 17 NOPs added at build time which is the
length of the 4 instructions:

819ef40c:   48 c1 e1 10 shl$0x10,%rcx
819ef410:   48 c1 f9 10 sar$0x10,%rcx
819ef414:   49 39 cbcmp%rcx,%r11
819ef417:   0f 85 ff 9c bc ff   jne815b911c 
opportunistic_sysret_failed

and the 17 NOPs should be optimized at boot time on AMD.

I was initially afraid that the JMP in the 4th line might be wrong but
apparently since we're using a global label, gas/gcc does generate the
offset properly (0x815b911c):

815b911c opportunistic_sysret_failed:
815b911c:   ff 15 fe a4 26 00   callq  *0x26a4fe(%rip)# 
81823620 pv_cpu_ops+0x120
815b9122:   e9 01 09 00 00  jmpq   815b9a28 
restore_c_regs_and_iret
815b9127:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
815b912e:   00 00

So, on to do some tracing.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:

 /*
  * Change top 16 bits to be the sign-extension of 47th bit, if this
  * changed %rcx, it was not canonical.
  */
 ALTERNATIVE , \
 shl$(64 - (47+1)), %rcx; \
  sar$(64 - (47+1)), %rcx; \
  cmpq   %rcx, %r11; \
  jneopportunistic_sysret_failed, X86_BUG_SYSRET_CANON_RCX

Guys, if we're looking at cycles for this, then don't do the exact
canonical test. and go back to just doing

shr $__VIRTUAL_MASK_SHIFT, %rcx
jnz opportunistic_sysret_failed

which is much smaller. In fact, aim to make the conditional jump be a
two-byte one (jump forward to another jump if required - it's a
slow-path that doesn't matter at *all* for the taken case), and the
end result is just six bytes. That way you can use alternative to
replace it with one single noop on AMD.

Because dammit, if we're playing these kinds of games, let's do it
*right*. No half measures.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 7:57 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 ..end result is just six bytes. That way you can use alternative to
 replace it with one single noop on AMD.

Actually, it looks like we have no good 6-byte no-ops on AMD. So you'd
get two three-byte ones. Oh well. It's still better than five nops
that can't even be decoded all at once.

That said, our NOP tables look to be old for AMD. Looking at the AMD
optimization guide (family 16h), it says to use

  66 0F 1F 44 00 00

which seems to be the same as Intel (it's nopw   0x0(%rax,%rax,1)).

So maybe our AMD nop tables should be updated?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Sun, Apr 26, 2015 at 04:39:38PM -0700, Andy Lutomirski wrote:
 I know it would be ugly, but would it be worth saving two bytes by
 using ALTERNATIVE jmp 1f, shl ..., ...?

Damn, it is actually visible even that saving the unconditional forward
JMP makes the numbers marginally nicer (E: row). So I guess we'll be
dropping the forward JMP too.

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]
E:2833115.118624  cpu-clock (msec)  
( +-  0.06% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]
E:2833115.145292  task-clock (msec) #3.996 CPUs utilized
( +-  0.06% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]
E: 5,583,979,727,842  cycles#1.971 GHz  
( +-  0.05% ) [75.00%]

cycles is the lowest, nice.

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
E: 3,106,381,223,355  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]

Understandable - we end up executing 5 insns more:

815b90ac:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b0:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b4:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90b8:   66 66 66 90 data16 data16 xchg %ax,%ax
815b90bc:   90  nop


A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  241.154 M/sec
( +-  0.00% ) [75.00%]
E:   683,648,518,667  branches  #  241.306 M/sec
( +-  0.01% ) [75.00%]

Lowest.

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
D:43,795,107,998  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
E:43,801,985,070  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

That looks like noise to me - we shouldn't be getting more branch misses
with the E: version.

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
D: 2,028,895  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
E: 2,031,008  context-switches  #0.717 K/sec
( +-  0.09% ) [100.00%]

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#0.018 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Brian Gerst
On Mon, Apr 27, 2015 at 11:56 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov b...@alien8.de wrote:
 On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:
 
  /*
   * Change top 16 bits to be the sign-extension of 47th bit, if 
  this
   * changed %rcx, it was not canonical.
   */
  ALTERNATIVE , \
  shl$(64 - (47+1)), %rcx; \
   sar$(64 - (47+1)), %rcx; \
   cmpq   %rcx, %r11; \
   jneopportunistic_sysret_failed, 
  X86_BUG_SYSRET_CANON_RCX

 Guys, if we're looking at cycles for this, then don't do the exact
 canonical test. and go back to just doing

 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed

 which is much smaller.

 Right, what about the false positives:

 17be0aec74fb (x86/asm/entry/64: Implement better check for canonical 
 addresses)

 ? We don't care?

 The false positives only matter for very strange workloads, e.g.
 vsyscall=native with old libc.  If it's a measurable regression, we
 could revert it.

 --Andy

Another alternative is to do the canonical check in the paths that can
set user RIP with an untrusted value, ie, sigreturn and exec.

--
Brian Gerst
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:
 
  /*
   * Change top 16 bits to be the sign-extension of 47th bit, if this
   * changed %rcx, it was not canonical.
   */
  ALTERNATIVE , \
  shl$(64 - (47+1)), %rcx; \
   sar$(64 - (47+1)), %rcx; \
   cmpq   %rcx, %r11; \
   jneopportunistic_sysret_failed, 
  X86_BUG_SYSRET_CANON_RCX
 
 Guys, if we're looking at cycles for this, then don't do the exact
 canonical test. and go back to just doing
 
 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed
 
 which is much smaller.

Right, what about the false positives:

17be0aec74fb (x86/asm/entry/64: Implement better check for canonical 
addresses)

? We don't care?

 In fact, aim to make the conditional jump be a
 two-byte one (jump forward to another jump if required - it's a
 slow-path that doesn't matter at *all* for the taken case), and the
 end result is just six bytes. That way you can use alternative to
 replace it with one single noop on AMD.

Well, even with the non-optimal NOPs (we end up with 4 3-byte NOPs and
one single-byte), we're still better than the unconditional JMP I had
there before:

https://lkml.kernel.org/r/20150427143905.gk6...@pd.tnic

(you might want to look at the raw email - marc.info breaks lines)

I'll retest with the F16h NOPs to see whether there's any difference.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Andy Lutomirski
On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov b...@alien8.de wrote:
 On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:
 
  /*
   * Change top 16 bits to be the sign-extension of 47th bit, if this
   * changed %rcx, it was not canonical.
   */
  ALTERNATIVE , \
  shl$(64 - (47+1)), %rcx; \
   sar$(64 - (47+1)), %rcx; \
   cmpq   %rcx, %r11; \
   jneopportunistic_sysret_failed, 
  X86_BUG_SYSRET_CANON_RCX

 Guys, if we're looking at cycles for this, then don't do the exact
 canonical test. and go back to just doing

 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed

 which is much smaller.

 Right, what about the false positives:

 17be0aec74fb (x86/asm/entry/64: Implement better check for canonical 
 addresses)

 ? We don't care?

The false positives only matter for very strange workloads, e.g.
vsyscall=native with old libc.  If it's a measurable regression, we
could revert it.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov b...@alien8.de wrote:

 Right, what about the false positives:

Anybody who tries to return to kernel addresses with sysret is
suspect. It's more likely to be an attack vector than anything else
(ie somebody who is trying to take advantage of a CPU bug).

I don't think there are any false positives. The only valid sysret
targets are in normal user space.

There's the vsyscall area, I guess, but we are actively discouraging
people from using it (it's emulated by default) and using iret to
return from it is fine if somebody ends up using it natively. It was a
mistake to have fixed addresses with known code in it, so I don't
think we should care.

We've had the inexact version for a long time, and the exact canonical
address check hasn't even hit my tree yet. I wouldn't worry about it.

And since we haven't even merged the better check for canonical
addresses it cannot even be a regression if we never really use it.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 08:06:16AM -0700, Linus Torvalds wrote:
 So maybe our AMD nop tables should be updated?

Ho-humm, we're using k8_nops on all 64-bit AMD. I better do some
opt-guide staring.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 06:04 PM, Brian Gerst wrote:
 On Mon, Apr 27, 2015 at 11:56 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov b...@alien8.de wrote:
 On Mon, Apr 27, 2015 at 07:57:36AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:

 /*
  * Change top 16 bits to be the sign-extension of 47th bit, if 
 this
  * changed %rcx, it was not canonical.
  */
 ALTERNATIVE , \
 shl$(64 - (47+1)), %rcx; \
  sar$(64 - (47+1)), %rcx; \
  cmpq   %rcx, %r11; \
  jneopportunistic_sysret_failed, 
 X86_BUG_SYSRET_CANON_RCX

 Guys, if we're looking at cycles for this, then don't do the exact
 canonical test. and go back to just doing

 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed

 which is much smaller.

 Right, what about the false positives:

 17be0aec74fb (x86/asm/entry/64: Implement better check for canonical 
 addresses)

 ? We don't care?

 The false positives only matter for very strange workloads, e.g.
 vsyscall=native with old libc.  If it's a measurable regression, we
 could revert it.

 --Andy
 
 Another alternative is to do the canonical check in the paths that can
 set user RIP with an untrusted value, ie, sigreturn and exec.

It is already done only on that path. Fast path doesn't check
RCX for canonicalness.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 04:57 PM, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 4:35 AM, Borislav Petkov b...@alien8.de wrote:

 /*
  * Change top 16 bits to be the sign-extension of 47th bit, if this
  * changed %rcx, it was not canonical.
  */
 ALTERNATIVE , \
 shl$(64 - (47+1)), %rcx; \
  sar$(64 - (47+1)), %rcx; \
  cmpq   %rcx, %r11; \
  jneopportunistic_sysret_failed, 
 X86_BUG_SYSRET_CANON_RCX
 
 Guys, if we're looking at cycles for this, then don't do the exact
 canonical test. and go back to just doing
 
 shr $__VIRTUAL_MASK_SHIFT, %rcx
 jnz opportunistic_sysret_failed
 
 which is much smaller.

It is smaller, but not by much. It is two instructions smaller.
On disassembly level, the changes are:

cmp %rcx,0x80(%rsp) - mov 0x80(%rsp),%r11; cmp %rcx,%r11
shr $0x2f,%rcx  - shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
mov 0x58(%rsp),%rcx - (eliminated)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 09:00:08AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 8:46 AM, Borislav Petkov b...@alien8.de wrote:
 
  Right, what about the false positives:
 
 Anybody who tries to return to kernel addresses with sysret is
 suspect. It's more likely to be an attack vector than anything else
 (ie somebody who is trying to take advantage of a CPU bug).
 
 I don't think there are any false positives. The only valid sysret
 targets are in normal user space.
 
 There's the vsyscall area, I guess, but we are actively discouraging

vsyscall=native on old glibc, says Andy.

 people from using it (it's emulated by default) and using iret to
 return from it is fine if somebody ends up using it natively. It was a
 mistake to have fixed addresses with known code in it, so I don't
 think we should care.
 
 We've had the inexact version for a long time, and the exact canonical
 address check hasn't even hit my tree yet. I wouldn't worry about it.
 
 And since we haven't even merged the better check for canonical
 addresses it cannot even be a regression if we never really use it.

Well, it certainly sounds to me like not really worth the trouble to do
the exact check. So I'm all for dropping it. Unless Andy doesn't come up
with a but but, there's this use case...

Either way, the NOPs-version is faster and I'm running the test with the
F16h-specific NOPs to see how they perform.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:14:15AM -0700, Linus Torvalds wrote:
 Btw, please don't use the more than three 66h overrides version.

Oh yeah, a notorious frontend choker.

 Sure, that's what the optimization manual suggests if you want
 single-instruction decode for all sizes up to 15 bytes, but I think
 we're better off with the two-nop case for sizes 12-15) (4-byte nop
 followed by 8-11 byte nop).

Yeah, so says the manual. Although I wouldn't trust those manuals
blindly but that's another story.

 Because the more than three 66b prefixes really performs abysmally
 on some cores, iirc.

Right.

So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
without more invasive changes, our longest NOPs are 8 byte long and then
we have to repeat. This is consistent with what the code looks like here
after alternatives application:

815b9084 syscall_return:
...

815b90ac:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
815b90b3:   00 
815b90b4:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
815b90bb:   00 
815b90bc:   90  nop

You can recognize the p6_nops being the same as in-the-manual-suggested
F16h ones.

:-)

I'm running them now and will report numbers relative to the last run
once it is done. And those numbers should in practice get even better if
we revert to the simpler canonical-ness check but let's see...

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov b...@alien8.de wrote:

 So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
 without more invasive changes, our longest NOPs are 8 byte long and then
 we have to repeat.

Btw (and I'm too lazy to check) do we take alignment into account?

Because if you have to split, and use multiple nops, it is *probably*
a good idea to try to avoid 16-byte boundaries, since that's can be
the I$ fetch granularity from L1 (although I guess 32B is getting more
common).

So the exact split might depend on the alignment of the nop replacement..

 You can recognize the p6_nops being the same as in-the-manual-suggested
 F16h ones.

Can we perhaps get rid of the distinction entirely, and just use one
set of 64-bit nops for both Intel/AMD?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:12:05AM -0700, Linus Torvalds wrote:
 So if one or two cycles in this code doesn't matter, then why are we
 adding alternate instructions just to avoid a few ALU instructions and
 a conditional branch that predicts perfectly? And if it does matter,
 then the 6-byte option looks clearly better..

You know what? I haven't even measured the tree *without* Denys'
stricter RCX canonical-ness patch. All numbers so far are from 4.0+ with
tip/master ontop which has Denys' patch.

And I *should* measure once with plain 4.1-rc1 and then with Denys'
patch ontop to see whether this all jumping-thru-alternatives-hoops is
even worth it.

If nothing else, it was a nice exercise. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 11:47:30AM -0700, Linus Torvalds wrote:
 On Mon, Apr 27, 2015 at 11:38 AM, Borislav Petkov b...@alien8.de wrote:
 
  So our current NOP-infrastructure does ASM_NOP_MAX NOPs of 8 bytes so
  without more invasive changes, our longest NOPs are 8 byte long and then
  we have to repeat.
 
 Btw (and I'm too lazy to check) do we take alignment into account?
 
 Because if you have to split, and use multiple nops, it is *probably*
 a good idea to try to avoid 16-byte boundaries, since that's can be
 the I$ fetch granularity from L1 (although I guess 32B is getting more
 common).

Yeah, on F16h you have 32B fetch but the paths later in the machine
gets narrower, so to speak.

 So the exact split might depend on the alignment of the nop replacement..

Yeah, no. Our add_nops() is trivial:

/* Use this to add nops to a buffer, then text_poke the whole buffer. */
static void __init_or_module add_nops(void *insns, unsigned int len)
{
while (len  0) {
unsigned int noplen = len;
if (noplen  ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
memcpy(insns, ideal_nops[noplen], noplen);
insns += noplen;
len -= noplen;
}
}

 Can we perhaps get rid of the distinction entirely, and just use one
 set of 64-bit nops for both Intel/AMD?

I *think* hpa would have an opinion here. I'm judging by looking at
comments like this one in the code:

/*
 * Due to a decoder implementation quirk, some
 * specific Intel CPUs actually perform better with
 * the k8_nops than with the SDM-recommended NOPs.
 */

which is a fun one in itself. :-)

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 9:12 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 It is smaller, but not by much. It is two instructions smaller.

Ehh. That's _half_.

And on a decoding side, it's the difference between 6 bytes that
decode cleanly and can be decoded in parallel with other things
(assuming the 6-byte nop), and 13 bytes that will need at least 2 nops
(unless you want to do lots of prefixes, which is slow on some cores),
_and_ which is likely big enough that you will basically not be
decoding anythign else that cycle.

So on the whole, your smaller, but not by much is all relative. It's
a relatively big difference.

So if one or two cycles in this code doesn't matter, then why are we
adding alternate instructions just to avoid a few ALU instructions and
a conditional branch that predicts perfectly? And if it does matter,
then the 6-byte option looks clearly better..

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Linus Torvalds
On Mon, Apr 27, 2015 at 9:40 AM, Borislav Petkov b...@alien8.de wrote:

 Either way, the NOPs-version is faster and I'm running the test with the
 F16h-specific NOPs to see how they perform.

Btw, please don't use the more than three 66h overrides version.
Sure, that's what the optimization manual suggests if you want
single-instruction decode for all sizes up to 15 bytes, but I think
we're better off with the two-nop case for sizes 12-15) (4-byte nop
followed by 8-11 byte nop).

Because the more than three 66b prefixes really performs abysmally
on some cores, iirc.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Borislav Petkov
On Mon, Apr 27, 2015 at 08:38:54PM +0200, Borislav Petkov wrote:
 I'm running them now and will report numbers relative to the last run
 once it is done. And those numbers should in practice get even better if
 we revert to the simpler canonical-ness check but let's see...

Results are done. New row is F: which is with the F16h NOPs.

With all things equal and with this change ontop:

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index aef653193160..d713080005ef 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -227,6 +227,14 @@ void __init arch_init_ideal_nops(void)
 #endif
}
break;
+
+   case X86_VENDOR_AMD:
+   if (boot_cpu_data.x86 == 0x16) {
+   ideal_nops = p6_nops;
+   return;
+   }
+
+   /* fall through */
default:
 #ifdef CONFIG_X86_64
ideal_nops = k8_nops;
---

... cycles, instructions, branches, branch-misses, context-switches
drop or remain roughly the same. BUT(!) timings increases.
cpu-clock/task-clock and duration of the workload are all the worst of
all possible cases.

So either those NOPs are not really optimal (i.e., trusting the manuals
and so on :-)) or it is their alignment.

But look at the chapter in the manual - 2.7.2.1 Encoding Padding for
Loop Alignment - those NOPs are supposed to be used as padding so
they themselves will not be necessarily aligned when you use them to pad
stuff.

Or maybe using the longer NOPs is probably worse than the shorter 4-byte
ones with 3 0x66 prefixes which should flow easier through the pipe
due to their smaller length.

Or something completely different...

Oh well, enough measurements for today - will do the rc1 measurement
tomorrow.

Thanks.

---
 Performance counter stats for 'system wide' (10 runs):

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]
E:2833115.118624  cpu-clock (msec)  
( +-  0.06% ) [100.00%]
F:2835863.670798  cpu-clock (msec)  
( +-  0.02% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]
E:2833115.145292  task-clock (msec) #3.996 CPUs utilized
( +-  0.06% ) [100.00%]
F:2835863.719556  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]
E: 5,583,979,727,842  cycles#1.971 GHz  
( +-  0.05% ) [75.00%]
F: 5,581,639,840,197  cycles#1.968 GHz  
( +-  0.03% ) [75.00%]

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
E: 3,106,381,223,355  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
F: 3,105,996,162,436  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  241.154 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-27 Thread Denys Vlasenko
On 04/27/2015 09:11 PM, Borislav Petkov wrote:
 A: 709.528485252 seconds time elapsed 
  ( +-  0.02% )
 B: 708.976557288 seconds time elapsed 
  ( +-  0.04% )
 C: 709.312844791 seconds time elapsed 
  ( +-  0.02% )
 D: 709.400050112 seconds time elapsed 
  ( +-  0.01% )
 E: 708.914562508 seconds time elapsed 
  ( +-  0.06% )
 F: 709.602255085 seconds time elapsed 
  ( +-  0.02% )

That's about 0.2% variance. Very small.

Sounds obvious, but. Did you try running a test several times?
Maybe you are measuring random noise.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Andy Lutomirski
>
> diff --git a/arch/x86/include/asm/cpufeature.h 
> b/arch/x86/include/asm/cpufeature.h
> index 7ee9b94d9921..8d555b046fe9 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -265,6 +265,7 @@
>  #define X86_BUG_11AP   X86_BUG(5) /* Bad local APIC aka 11AP */
>  #define X86_BUG_FXSAVE_LEAKX86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
>  #define X86_BUG_CLFLUSH_MONITORX86_BUG(7) /* AAI65, CLFLUSH required 
> before MONITOR */
> +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
> non-canonical */

I think that "sysret" should appear in the name.


>
>  #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
>
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 50163fa9034f..109a51815e92 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
> pr_info("Disabling PGE capability bit\n");
> setup_clear_cpu_cap(X86_FEATURE_PGE);
> }
> +
> +   set_cpu_bug(c, X86_BUG_CANONICAL_RCX);

Oh no!  My laptop is currently bug-free, and you're breaking it! :)

>  }
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index e952f6bf1d6d..d01fb6c1362f 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -415,16 +415,20 @@ syscall_return:
> jne opportunistic_sysret_failed
>
> /*
> -* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> -* in kernel space.  This essentially lets the user take over
> -* the kernel, since userspace controls RSP.
> -*
>  * If width of "canonical tail" ever becomes variable, this will need
>  * to be updated to remain correct on both old and new CPUs.
>  */
> .ifne __VIRTUAL_MASK_SHIFT - 47
> .error "virtual address width changed -- SYSRET checks need update"
> .endif
> +
> +   /*
> +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> +* in kernel space.  This essentially lets the user take over
> +* the kernel, since userspace controls RSP.
> +*/
> +   ALTERNATIVE "jmp 1f", "", X86_BUG_CANONICAL_RCX
> +

I know it would be ugly, but would it be worth saving two bytes by
using ALTERNATIVE "jmp 1f", "shl ...", ...?

> /* Change top 16 bits to be the sign-extension of 47th bit */
> shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> @@ -432,6 +436,7 @@ syscall_return:
> cmpq%rcx, %r11
> jne opportunistic_sysret_failed
>
> +1:
> cmpq $__USER_CS,CS(%rsp)/* CS must match SYSRET */
> jne opportunistic_sysret_failed

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Andy Lutomirski
On Fri, Apr 24, 2015 at 7:17 PM, Denys Vlasenko
 wrote:
> On Fri, Apr 24, 2015 at 10:50 PM, Andy Lutomirski  wrote:
>> On Fri, Apr 24, 2015 at 1:46 PM, Denys Vlasenko
 This might be way more trouble than it's worth.
>>>
>>> Exactly my feeling. What are you trying to save? About four CPU
>>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>>> That's not worth bothering about. Your last patch seems to be perfect.
>>
>> We'll have to do the write to ss almost every time an AMD CPU sleeps
>> in a syscall.
>
> Why do you think so?
> Scheduling from a syscall which decided to block won't require
> writing to %ss, since in this case %ss isn't NULL.
>
> Writing to %ss will happen every time we schedule from an interrupt.
> With timer interrupt every 1 ms it means scheduling at most ~1000
> times per second, if _every_ such interrupt causes task switch.
> This is still not often enough to worry.

OK, you've convinced me.  I still think it can happen much more than
~1k times per second due to hrtimers (see my test case) or things like
page faults, but those are all slow paths.

v2 coming.

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Denys Vlasenko
On Fri, Apr 24, 2015 at 4:18 AM, Andy Lutomirski  wrote:
> On Thu, Apr 23, 2015 at 7:15 PM, Andy Lutomirski  wrote:
> Even if the issue affects SYSRETQ, it could be that we don't care.  If
> the extent of the info leak is whether we context switched during a
> 64-bit syscall to a non-syscall context, then this is basically
> uninteresting.  In that case, we could either ignore the 64-bit issue
> entirely or fix it the other way: force SS to NULL on context switch
> (much faster, I presume) and fix it up before SYSRETL as Denys
> suggested.
>
> We clearly don't have a spate of crashes in programs that do SYSCALL
> from a 64-bit CS and then far jump/return to a 32-bit CS without first
> reloading SS, since this bug has been here forever.  I agree that the
> issue is ugly (if it exists in the first place), but maybe we don't
> need to fix it.

If you feel that concerned by the speed impact,
you can also disable this fix for !CONFIG_IA32_EMULATION kernels.
If kernel builder declared they don't want 32-bit userspace,
they won't do any ridiculous "I will temporarily
switch to 32-bit mode in 64-bit tasks" stuff either.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


perf numbers (was: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue)

2015-04-26 Thread Borislav Petkov
On Sat, Apr 25, 2015 at 11:12:06PM +0200, Borislav Petkov wrote:
> I've prepended the perf stat output with markers A:, B: or C: for easier
> comparing. The markers mean:
> 
> A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
> B: With Andy's SYSRET patch ontop
> C: Without RCX canonicalness check (see patch at the end).

What was missing is D = B+C results, here they are:

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  241.154 M/sec
( +-  0.00% ) [75.00%]

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
D:43,795,107,998  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
D: 2,028,895  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#0.018 K/sec
( +-  1.02% )
C:51,365  migrations#0.018 K/sec
( +-  0.92% )
D:51,766  migrations#0.018 K/sec
( +-  1.11% )

A: 709.528485252 seconds time elapsed   
   ( +-  0.02% )
B: 708.976557288 seconds time elapsed   
   ( +-  0.04% )
C: 709.312844791 seconds time elapsed   
   ( +-  0.02% )
D: 709.400050112 seconds time elapsed   
   ( +-  0.01% )

So in all events except "branches" - which is comprehensible, btw -
we have a minimal net win when looking at how the numbers in A have
improved in D *with* *both* patches applied.

And those numbers are pretty nice IMO - even if the net win is
measurement artefact and not really a win, we certainly don't have a net
loss so unless anyone objects, I'm going to apply both patches but wait
for Andy's v2 with better comments and changed ss_sel test as per Denys'
suggestion.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" 

perf numbers (was: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue)

2015-04-26 Thread Borislav Petkov
On Sat, Apr 25, 2015 at 11:12:06PM +0200, Borislav Petkov wrote:
 I've prepended the perf stat output with markers A:, B: or C: for easier
 comparing. The markers mean:
 
 A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
 B: With Andy's SYSRET patch ontop
 C: Without RCX canonicalness check (see patch at the end).

What was missing is D = B+C results, here they are:

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
D:2835055.118431  cpu-clock (msec)  
( +-  0.01% ) [100.00%]

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
D:2835055.094383  task-clock (msec) #3.996 CPUs utilized
( +-  0.01% ) [100.00%]

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]
D: 5,584,838,532,936  cycles#1.970 GHz  
( +-  0.03% ) [75.00%]

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
D: 3,106,294,801,185  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]
D:   683,683,533,664  branches  #  241.154 M/sec
( +-  0.00% ) [75.00%]

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
D:43,795,107,998  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
D: 2,028,895  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#0.018 K/sec
( +-  1.02% )
C:51,365  migrations#0.018 K/sec
( +-  0.92% )
D:51,766  migrations#0.018 K/sec
( +-  1.11% )

A: 709.528485252 seconds time elapsed   
   ( +-  0.02% )
B: 708.976557288 seconds time elapsed   
   ( +-  0.04% )
C: 709.312844791 seconds time elapsed   
   ( +-  0.02% )
D: 709.400050112 seconds time elapsed   
   ( +-  0.01% )

So in all events except branches - which is comprehensible, btw -
we have a minimal net win when looking at how the numbers in A have
improved in D *with* *both* patches applied.

And those numbers are pretty nice IMO - even if the net win is
measurement artefact and not really a win, we certainly don't have a net
loss so unless anyone objects, I'm going to apply both patches but wait
for Andy's v2 with better comments and changed ss_sel test as per Denys'
suggestion.

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Denys Vlasenko
On Fri, Apr 24, 2015 at 4:18 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Thu, Apr 23, 2015 at 7:15 PM, Andy Lutomirski l...@kernel.org wrote:
 Even if the issue affects SYSRETQ, it could be that we don't care.  If
 the extent of the info leak is whether we context switched during a
 64-bit syscall to a non-syscall context, then this is basically
 uninteresting.  In that case, we could either ignore the 64-bit issue
 entirely or fix it the other way: force SS to NULL on context switch
 (much faster, I presume) and fix it up before SYSRETL as Denys
 suggested.

 We clearly don't have a spate of crashes in programs that do SYSCALL
 from a 64-bit CS and then far jump/return to a 32-bit CS without first
 reloading SS, since this bug has been here forever.  I agree that the
 issue is ugly (if it exists in the first place), but maybe we don't
 need to fix it.

If you feel that concerned by the speed impact,
you can also disable this fix for !CONFIG_IA32_EMULATION kernels.
If kernel builder declared they don't want 32-bit userspace,
they won't do any ridiculous I will temporarily
switch to 32-bit mode in 64-bit tasks stuff either.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Andy Lutomirski
On Fri, Apr 24, 2015 at 7:17 PM, Denys Vlasenko
vda.li...@googlemail.com wrote:
 On Fri, Apr 24, 2015 at 10:50 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Apr 24, 2015 at 1:46 PM, Denys Vlasenko
 This might be way more trouble than it's worth.

 Exactly my feeling. What are you trying to save? About four CPU
 cycles of checking %ss != __KERNEL_DS on each switch_to?
 That's not worth bothering about. Your last patch seems to be perfect.

 We'll have to do the write to ss almost every time an AMD CPU sleeps
 in a syscall.

 Why do you think so?
 Scheduling from a syscall which decided to block won't require
 writing to %ss, since in this case %ss isn't NULL.

 Writing to %ss will happen every time we schedule from an interrupt.
 With timer interrupt every 1 ms it means scheduling at most ~1000
 times per second, if _every_ such interrupt causes task switch.
 This is still not often enough to worry.

OK, you've convinced me.  I still think it can happen much more than
~1k times per second due to hrtimers (see my test case) or things like
page faults, but those are all slow paths.

v2 coming.

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-26 Thread Andy Lutomirski

 diff --git a/arch/x86/include/asm/cpufeature.h 
 b/arch/x86/include/asm/cpufeature.h
 index 7ee9b94d9921..8d555b046fe9 100644
 --- a/arch/x86/include/asm/cpufeature.h
 +++ b/arch/x86/include/asm/cpufeature.h
 @@ -265,6 +265,7 @@
  #define X86_BUG_11AP   X86_BUG(5) /* Bad local APIC aka 11AP */
  #define X86_BUG_FXSAVE_LEAKX86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
  #define X86_BUG_CLFLUSH_MONITORX86_BUG(7) /* AAI65, CLFLUSH required 
 before MONITOR */
 +#define X86_BUG_CANONICAL_RCX  X86_BUG(8) /* SYSRET #GPs when %RCX 
 non-canonical */

I think that sysret should appear in the name.



  #if defined(__KERNEL__)  !defined(__ASSEMBLY__)

 diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
 index 50163fa9034f..109a51815e92 100644
 --- a/arch/x86/kernel/cpu/intel.c
 +++ b/arch/x86/kernel/cpu/intel.c
 @@ -159,6 +159,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 pr_info(Disabling PGE capability bit\n);
 setup_clear_cpu_cap(X86_FEATURE_PGE);
 }
 +
 +   set_cpu_bug(c, X86_BUG_CANONICAL_RCX);

Oh no!  My laptop is currently bug-free, and you're breaking it! :)

  }

  #ifdef CONFIG_X86_32
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index e952f6bf1d6d..d01fb6c1362f 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -415,16 +415,20 @@ syscall_return:
 jne opportunistic_sysret_failed

 /*
 -* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
 -* in kernel space.  This essentially lets the user take over
 -* the kernel, since userspace controls RSP.
 -*
  * If width of canonical tail ever becomes variable, this will need
  * to be updated to remain correct on both old and new CPUs.
  */
 .ifne __VIRTUAL_MASK_SHIFT - 47
 .error virtual address width changed -- SYSRET checks need update
 .endif
 +
 +   /*
 +* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
 +* in kernel space.  This essentially lets the user take over
 +* the kernel, since userspace controls RSP.
 +*/
 +   ALTERNATIVE jmp 1f, , X86_BUG_CANONICAL_RCX
 +

I know it would be ugly, but would it be worth saving two bytes by
using ALTERNATIVE jmp 1f, shl ..., ...?

 /* Change top 16 bits to be the sign-extension of 47th bit */
 shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 @@ -432,6 +436,7 @@ syscall_return:
 cmpq%rcx, %r11
 jne opportunistic_sysret_failed

 +1:
 cmpq $__USER_CS,CS(%rsp)/* CS must match SYSRET */
 jne opportunistic_sysret_failed

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-25 Thread Borislav Petkov
On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
> 
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
> 
> This was exposed by a recent vDSO cleanup.
> 
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Tested only on Intel, which isn't very interesting.  I'll tidy up
> and send a test case, too, once Borislav confirms that it works.

So I did some benchmarking today. Custom kernel build measured with perf
stat, 10 builds with --pre doing

$ cat pre-build-kernel.sh
make -s clean
echo 3 > /proc/sys/vm/drop_caches

$ cat measure.sh
EVENTS="cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations"
perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh 
make -s -j64

I've prepended the perf stat output with markers A:, B: or C: for easier
comparing. The markers mean:

A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
B: With Andy's SYSRET patch ontop
C: Without RCX canonicalness check (see patch at the end).

Numbers are from an AMD F16h box:

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]

This is interesting - The SYSRET SS fix makes it minimally better and
the C-patch is a bit worse again. Net win is 861 msec, almost a second,
oh well.

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]

Similar thing observable here.

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]

net win is 3,229,953,855 cycles drop.

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

This looks like it would make sense - instruction count drops from A -> B -> C.

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]

Also makes sense - the C patch adds an unconditional JMP over the
RCX-canonicalness check.

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

And this is nice, branch misses are the smallest with C, cool. It makes
sense again - the C patch adds an unconditional JMP which doesn't miss.

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]

Those look good.

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#0.018 K/sec
( +-  1.02% )
C:51,365  migrations#0.018 K/sec
( +-  0.92% )

Same here.

A: 709.528485252 seconds time elapsed   
   ( +-  0.02% )
B: 708.976557288 seconds time elapsed   
   ( +-  0.04% )
C: 709.312844791 seconds time elapsed   
   ( +-  0.02% )

Interestingly, the unconditional JMP kinda 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-25 Thread Borislav Petkov
On Thu, Apr 23, 2015 at 07:15:01PM -0700, Andy Lutomirski wrote:
 AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
 with SS == 0 results in an invalid usermode state in which SS is
 apparently equal to __USER_DS but causes #SS if used.
 
 Work around the issue by replacing NULL SS values with __KERNEL_DS
 in __switch_to, thus ensuring that SYSRET never happens with SS set
 to NULL.
 
 This was exposed by a recent vDSO cleanup.
 
 Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
 Signed-off-by: Andy Lutomirski l...@kernel.org
 ---
 
 Tested only on Intel, which isn't very interesting.  I'll tidy up
 and send a test case, too, once Borislav confirms that it works.

So I did some benchmarking today. Custom kernel build measured with perf
stat, 10 builds with --pre doing

$ cat pre-build-kernel.sh
make -s clean
echo 3  /proc/sys/vm/drop_caches

$ cat measure.sh
EVENTS=cpu-clock,task-clock,cycles,instructions,branches,branch-misses,context-switches,migrations
perf stat -e $EVENTS --sync -a --repeat 10 --pre ~/kernel/pre-build-kernel.sh 
make -s -j64

I've prepended the perf stat output with markers A:, B: or C: for easier
comparing. The markers mean:

A: Linus' master from a couple of days ago + tip/master + tip/x86/asm
B: With Andy's SYSRET patch ontop
C: Without RCX canonicalness check (see patch at the end).

Numbers are from an AMD F16h box:

A:2835570.145246  cpu-clock (msec)  
( +-  0.02% ) [100.00%]
B:2833364.074970  cpu-clock (msec)  
( +-  0.04% ) [100.00%]
C:2834708.335431  cpu-clock (msec)  
( +-  0.02% ) [100.00%]

This is interesting - The SYSRET SS fix makes it minimally better and
the C-patch is a bit worse again. Net win is 861 msec, almost a second,
oh well.

A:2835570.099981  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]
B:2833364.073633  task-clock (msec) #3.996 CPUs utilized
( +-  0.04% ) [100.00%]
C:2834708.350387  task-clock (msec) #3.996 CPUs utilized
( +-  0.02% ) [100.00%]

Similar thing observable here.

A: 5,591,213,166,613  cycles#1.972 GHz  
( +-  0.03% ) [75.00%]
B: 5,585,023,802,888  cycles#1.971 GHz  
( +-  0.03% ) [75.00%]
C: 5,587,983,212,758  cycles#1.971 GHz  
( +-  0.02% ) [75.00%]

net win is 3,229,953,855 cycles drop.

A: 3,106,707,101,530  instructions  #0.56  insns per cycle  
( +-  0.01% ) [75.00%]
B: 3,106,632,251,528  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]
C: 3,106,265,958,142  instructions  #0.56  insns per cycle  
( +-  0.00% ) [75.00%]

This looks like it would make sense - instruction count drops from A - B - C.

A:   683,676,044,429  branches  #  241.107 M/sec
( +-  0.01% ) [75.00%]
B:   683,670,899,595  branches  #  241.293 M/sec
( +-  0.01% ) [75.00%]
C:   683,675,772,858  branches  #  241.180 M/sec
( +-  0.01% ) [75.00%]

Also makes sense - the C patch adds an unconditional JMP over the
RCX-canonicalness check.

A:43,829,535,008  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]
B:43,844,118,416  branch-misses #6.41% of all branches  
( +-  0.03% ) [75.00%]
C:43,819,871,086  branch-misses #6.41% of all branches  
( +-  0.02% ) [75.00%]

And this is nice, branch misses are the smallest with C, cool. It makes
sense again - the C patch adds an unconditional JMP which doesn't miss.

A: 2,030,357  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]
B: 2,029,313  context-switches  #0.716 K/sec
( +-  0.05% ) [100.00%]
C: 2,028,566  context-switches  #0.716 K/sec
( +-  0.06% ) [100.00%]

Those look good.

A:52,421  migrations#0.018 K/sec
( +-  1.13% )
B:52,049  migrations#0.018 K/sec
( +-  1.02% )
C:51,365  migrations#0.018 K/sec
( +-  0.92% )

Same here.

A: 709.528485252 seconds time elapsed   
   ( +-  0.02% )
B: 708.976557288 seconds time elapsed   
   ( +-  0.04% )
C: 709.312844791 seconds time elapsed   
   ( +-  0.02% )

Interestingly, the unconditional JMP kinda 

Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Denys Vlasenko
On Fri, Apr 24, 2015 at 10:50 PM, Andy Lutomirski  wrote:
> On Fri, Apr 24, 2015 at 1:46 PM, Denys Vlasenko
>>> This might be way more trouble than it's worth.
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
>
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.

Why do you think so?
Scheduling from a syscall which decided to block won't require
writing to %ss, since in this case %ss isn't NULL.

Writing to %ss will happen every time we schedule from an interrupt.
With timer interrupt every 1 ms it means scheduling at most ~1000
times per second, if _every_ such interrupt causes task switch.
This is still not often enough to worry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread H. Peter Anvin
On 04/24/2015 01:50 PM, Andy Lutomirski wrote:
>>
>> Exactly my feeling. What are you trying to save? About four CPU
>> cycles of checking %ss != __KERNEL_DS on each switch_to?
>> That's not worth bothering about. Your last patch seems to be perfect.
> 
> We'll have to do the write to ss almost every time an AMD CPU sleeps
> in a syscall.  Maybe that's still not a big deal.
> 

Once you're sleeping anyway, I don't think so.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Linus Torvalds
On Fri, Apr 24, 2015 at 1:21 PM, Andy Lutomirski  wrote:
>
> 2. SYSRETQ.  The only way that I know of to see the problem is SYSRETQ
> followed by a far jump or return.  This is presumably *extremely*
> rare.
>
> What if we fixed #2 up in do_stack_segment.  We should double-check
> the docs, but I think that this will only ever manifest as #SS(0) with
> regs->ss == __USER_DS and !user_mode_64bit(regs).

Hmm. It smells a bit "too clever" for me, and in particular, I think
you need a good test-case for this. But yeah, I guess that gets things
out of any possibly critical paths.

That said, I wouldn't even be sure about the SS(0). The rules about
when you get SS and when you get GP are odd.

Also, you need to check what happens when somebody does something like

movl $-1,%eax
ss ; movl (%eax),%eax

because I think that gets a #DB(0) too with the situation you expect
to be "unique", because the address wraps.. I dunno.

So care and testing needed. I think the scheduler approach is a *lot*
more obvious, quite frankly.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Andy Lutomirski
On Fri, Apr 24, 2015 at 1:46 PM, Denys Vlasenko
 wrote:
> On Fri, Apr 24, 2015 at 10:21 PM, Andy Lutomirski  wrote:
>> On Thu, Apr 23, 2015 at 7:15 PM, Andy Lutomirski  wrote:
>>> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
>>> with SS == 0 results in an invalid usermode state in which SS is
>>> apparently equal to __USER_DS but causes #SS if used.
>>>
>>> Work around the issue by replacing NULL SS values with __KERNEL_DS
>>> in __switch_to, thus ensuring that SYSRET never happens with SS set
>>> to NULL.
>>>
>>> This was exposed by a recent vDSO cleanup.
>>>
>>> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>> Signed-off-by: Andy Lutomirski 
>>> ---
>>>
>>> Tested only on Intel, which isn't very interesting.  I'll tidy up
>>> and send a test case, too, once Borislav confirms that it works.
>>>
>>> Please don't actually apply this until we're sure we understand the
>>> scope of the issue.  If this doesn't affect SYSRETQ, then we might
>>> to fix it on before SYSRETL to avoid impacting 64-bit processes
>>> at all.
>>>
>>
>> After sleeping on it, I think I want to offer a different, more
>> complicated approach.  AFAIK there are really only two ways that this
>> issue can be visible:
>>
>> 1. SYSRETL.  We can fix that up in the AMD SYSRETL path.  I think
>> there's a decent argument that that path is less performance-critical
>> than context switches.
>>
>> 2. SYSRETQ.  The only way that I know of to see the problem is SYSRETQ
>> followed by a far jump or return.  This is presumably *extremely*
>> rare.
>>
>> What if we fixed #2 up in do_stack_segment.  We should double-check
>> the docs, but I think that this will only ever manifest as #SS(0) with
>> regs->ss == __USER_DS and !user_mode_64bit(regs).  We need to avoid
>> infinite retry looks, but this might be okay.  I think that #SS(0)
>> from userspace under those conditions can *only* happen as a result of
>> this issue.  Even if not, we could come up with a way to only retry
>> once per syscall (e.g. set some ti->status flag in the 64-bit syscall
>> path on AMD and clear it in do_stack_segment).
>>
>> This might be way more trouble than it's worth.
>
> Exactly my feeling. What are you trying to save? About four CPU
> cycles of checking %ss != __KERNEL_DS on each switch_to?
> That's not worth bothering about. Your last patch seems to be perfect.

We'll have to do the write to ss almost every time an AMD CPU sleeps
in a syscall.  Maybe that's still not a big deal.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Denys Vlasenko
On Fri, Apr 24, 2015 at 10:21 PM, Andy Lutomirski  wrote:
> On Thu, Apr 23, 2015 at 7:15 PM, Andy Lutomirski  wrote:
>> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
>> with SS == 0 results in an invalid usermode state in which SS is
>> apparently equal to __USER_DS but causes #SS if used.
>>
>> Work around the issue by replacing NULL SS values with __KERNEL_DS
>> in __switch_to, thus ensuring that SYSRET never happens with SS set
>> to NULL.
>>
>> This was exposed by a recent vDSO cleanup.
>>
>> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>> Signed-off-by: Andy Lutomirski 
>> ---
>>
>> Tested only on Intel, which isn't very interesting.  I'll tidy up
>> and send a test case, too, once Borislav confirms that it works.
>>
>> Please don't actually apply this until we're sure we understand the
>> scope of the issue.  If this doesn't affect SYSRETQ, then we might
>> to fix it on before SYSRETL to avoid impacting 64-bit processes
>> at all.
>>
>
> After sleeping on it, I think I want to offer a different, more
> complicated approach.  AFAIK there are really only two ways that this
> issue can be visible:
>
> 1. SYSRETL.  We can fix that up in the AMD SYSRETL path.  I think
> there's a decent argument that that path is less performance-critical
> than context switches.
>
> 2. SYSRETQ.  The only way that I know of to see the problem is SYSRETQ
> followed by a far jump or return.  This is presumably *extremely*
> rare.
>
> What if we fixed #2 up in do_stack_segment.  We should double-check
> the docs, but I think that this will only ever manifest as #SS(0) with
> regs->ss == __USER_DS and !user_mode_64bit(regs).  We need to avoid
> infinite retry looks, but this might be okay.  I think that #SS(0)
> from userspace under those conditions can *only* happen as a result of
> this issue.  Even if not, we could come up with a way to only retry
> once per syscall (e.g. set some ti->status flag in the 64-bit syscall
> path on AMD and clear it in do_stack_segment).
>
> This might be way more trouble than it's worth.

Exactly my feeling. What are you trying to save? About four CPU
cycles of checking %ss != __KERNEL_DS on each switch_to?
That's not worth bothering about. Your last patch seems to be perfect.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Andy Lutomirski
On Thu, Apr 23, 2015 at 7:15 PM, Andy Lutomirski  wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
>
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
>
> This was exposed by a recent vDSO cleanup.
>
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski 
> ---
>
> Tested only on Intel, which isn't very interesting.  I'll tidy up
> and send a test case, too, once Borislav confirms that it works.
>
> Please don't actually apply this until we're sure we understand the
> scope of the issue.  If this doesn't affect SYSRETQ, then we might
> to fix it on before SYSRETL to avoid impacting 64-bit processes
> at all.
>

After sleeping on it, I think I want to offer a different, more
complicated approach.  AFAIK there are really only two ways that this
issue can be visible:

1. SYSRETL.  We can fix that up in the AMD SYSRETL path.  I think
there's a decent argument that that path is less performance-critical
than context switches.

2. SYSRETQ.  The only way that I know of to see the problem is SYSRETQ
followed by a far jump or return.  This is presumably *extremely*
rare.

What if we fixed #2 up in do_stack_segment.  We should double-check
the docs, but I think that this will only ever manifest as #SS(0) with
regs->ss == __USER_DS and !user_mode_64bit(regs).  We need to avoid
infinite retry looks, but this might be okay.  I think that #SS(0)
from userspace under those conditions can *only* happen as a result of
this issue.  Even if not, we could come up with a way to only retry
once per syscall (e.g. set some ti->status flag in the 64-bit syscall
path on AMD and clear it in do_stack_segment).

This might be way more trouble than it's worth.  For one thing, we
need to be careful with the IRET fixup.  Ick.  So maybe this should be
written off as my useless ramblings.

NB: I suspect that all of this is irrelevant on Xen.  Xen does its own
thing wrt sysret.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

2015-04-24 Thread Borislav Petkov
On Fri, Apr 24, 2015 at 12:59:06PM +0200, Borislav Petkov wrote:
> Yeah, that makes more sense. So I tested Andy's patch but changed it as
> above and I get
> 
> $ taskset -c 0 ./sysret_ss_attrs_32
> [RUN]   Syscalls followed by SS validation
> [OK]We survived

Andy, you wanted the 64-bit version too:

$ taskset -c 0 ./sysret_ss_attrs_64
[RUN]   Syscalls followed by SS validation
[OK]We survived
$ taskset -c 0 ./sysret_ss_attrs_64
[RUN]   Syscalls followed by SS validation
[OK]We survived

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >