from:"Alexandre Chartre"

Re: [PATCH 2/2] arch/x86: arch/sparc: tools/perf: fix typos in comments

2021-04-09 Thread Alexandre Chartre




On 4/8/21 7:28 PM, Thomas Tai wrote:

s/insted/instead/
s/maintaing/maintaining/

Signed-off-by: Thomas Tai 
---
  arch/sparc/vdso/vdso2c.c | 2 +-
  arch/x86/entry/vdso/vdso2c.c | 2 +-
  arch/x86/kernel/cpu/intel.c  | 2 +-
  tools/perf/arch/x86/util/perf_regs.c | 4 ++--
  4 files changed, 5 insertions(+), 5 deletions(-)



Reviewed-by: Alexandre Chartre 

alex.

Re: [PATCH 1/2] x86/traps: call cond_local_irq_disable before returning from exc_general_protection and math_error

2021-04-09 Thread Alexandre Chartre




On 4/8/21 7:28 PM, Thomas Tai wrote:

This fixes commit 334872a09198 ("x86/traps: Attempt to fixup exceptions
in vDSO before signaling") which added return statements without calling
cond_local_irq_disable(). According to commit ca4c6a9858c2
("x86/traps: Make interrupt enable/disable symmetric in C code"),
cond_local_irq_disable() is needed because the ASM return code no
longer disables interrupts. Follow the existing code as an example to
use "goto exit" instead of "return" statement.

Signed-off-by: Thomas Tai 
---
  arch/x86/kernel/traps.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)



Reviewed-by: Alexandre Chartre 

And it is probably worth adding a 'Fixes:' tag:

Fixes: 334872a09198 ("x86/traps: Attempt to fixup exceptions in vDSO before 
signaling")

alex.

Re: [for-stable-4.19 PATCH 1/2] vmlinux.lds.h: Create section for protection against instrumentation

2021-03-19 Thread Alexandre Chartre




On 3/19/21 11:39 AM, Greg Kroah-Hartman wrote:

On Fri, Mar 19, 2021 at 07:54:15AM +0800, Nicolas Boichat wrote:

From: Thomas Gleixner 

commit 655389433e7efec589838b400a2a652b3ffa upstream.

Some code pathes, especially the low level entry code, must be protected
against instrumentation for various reasons:

  - Low level entry code can be a fragile beast, especially on x86.

  - With NO_HZ_FULL RCU state needs to be established before using it.

Having a dedicated section for such code allows to validate with tooling
that no unsafe functions are invoked.

Add the .noinstr.text section and the noinstr attribute to mark
functions. noinstr implies notrace. Kprobes will gain a section check
later.

Provide also a set of markers: instrumentation_begin()/end()

These are used to mark code inside a noinstr function which calls
into regular instrumentable text section as safe.

The instrumentation markers are only active when CONFIG_DEBUG_ENTRY is
enabled as the end marker emits a NOP to prevent the compiler from merging
the annotation points. This means the objtool verification requires a
kernel compiled with this option.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Alexandre Chartre 
Acked-by: Peter Zijlstra 
Link: https://lkml.kernel.org/r/20200505134100.075416...@linutronix.de

[Nicolas: context conflicts in:
arch/powerpc/kernel/vmlinux.lds.S
include/asm-generic/vmlinux.lds.h
include/linux/compiler.h
include/linux/compiler_types.h]
Signed-off-by: Nicolas Boichat 


Did you build this on x86?

I get the following build error:

ld:./arch/x86/kernel/vmlinux.lds:20: syntax error

And that line looks like:

  . = ALIGN(8); *(.text.hot .text.hot.*) *(.text .text.fixup) *(.text.unlikely 
.text.unlikely.*) *(.text.unknown .text.unknown.*) . = ALIGN(8); __noinstr_text_start = 
.; *(.__attribute__((noinline)) __attribute__((no_instrument_function)) 
__attribute((__section__(".noinstr.text"))).text) __noinstr_text_end = .; 
*(.text..refcount) *(.ref.text) *(.meminit.text*) *(.memexit.text*)



In the NOINSTR_TEXT macro, noinstr is expanded with the value of the noinstr
macro from linux/compiler_types.h while it shouldn't.

The problem is possibly that the noinstr macro is defined for assembly. Make
sure that the macro is not defined for assembly e.g.:

#ifndef __ASSEMBLY__

/* Section for code which can't be instrumented at all */
#define noinstr \
noinline notrace __attribute((__section__(".noinstr.text")))

#endif

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-19 Thread Alexandre Chartre




On 11/19/20 8:10 PM, Thomas Gleixner wrote:

On Mon, Nov 16 2020 at 19:10, Alexandre Chartre wrote:

On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:

When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


  But AFAICS the only thing that this enables is sleeping with user pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Coming late, but this does not make any sense to me.

Unless you map most of the kernel into the user page-table sleeping with
the user page-table _cannot_ work. And if you do that you broke KPTI.

You can neither pick arbitrary points in the C code of an exception
handler to switch to the kernel mapping unless you mapped everything
which might be touched before that into user space.

How is that supposed to work?



Sorry I mixed up a few thing; I got confused with my own code which is not a
good sign...

It's not sleeping with the user page-table which, as you mentioned, doesn't
make sense, it's sleeping with the kernel page-table but with the PTI stack.

Basically, it is:
  - entering C code with (user page-table, PTI stack);
  - then it switches to the kernel page-table so we have (kernel page-table, 
PTI stack);
  - and then it switches to the kernel stack so we have (kernel page-table, 
kernel stack).

As this is all C code, some of which is executed with the PTI stack, we need 
the PTI stack
to be per-task so that the stack is preserved, in case that C code does a 
sleep/schedule
(no matter if this happens when using the PTI stack or the kernel stack).

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-19 Thread Alexandre Chartre




On 11/19/20 5:06 PM, Andy Lutomirski wrote:

On Thu, Nov 19, 2020 at 4:06 AM Alexandre Chartre
 wrote:


On 11/19/20 9:05 AM, Alexandre Chartre wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


 But AFAICS the only thing that this enables is sleeping with user 
pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.



I finally remember why I have introduced a per-task PTI trampoline stack right 
now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, 
we need
a per-task stack to enter (and return) from the C handler as the handler can 
potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 
functions
from the assembly entry code before and after calling the C function handler 
(also called
from assembly).


The noinstr part of the C entry code won't sleep.



But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).

Example:

#define DEFINE_IDTENTRY(func)   \
static __always_inline void __##func(struct pt_regs *regs); \
   \
__visible noinstr void func(struct pt_regs *regs)   \
{   \
   irqentry_state_t state; -+  \
|  \
   user_pagetable_escape(regs); | use trampoline stack (1)
   state = irqentry_enter(regs);|  \
   instrumentation_begin();-+  \
   run_idt(__##func, regs);   |===| run __func() on kernel stack 
(this can sleep)
   instrumentation_end();  -+  \
   irqentry_exit(regs, state);  | use trampoline stack (2)
   user_pagetable_return(regs);-+  \
}

Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.



Why?  Right now, we have the percpu entry stack, and we do just fine
if we enter on one percpu stack and exit from a different one.
We would need to call from asm to C on the entry stack, return back to
asm, and then switch stacks.



That's the problem: I didn't want to return back to asm, so that the pagetable
switch can be done anywhere in the C handler.

So yes, returning to asm to switch the stack is the solution if we want to avoid
having per-task trampoline stack. The drawback is that this forces to do the
page-table switch at the beginning and end of the handler; the pagetable switch
cannot be moved deeper down into the C handler.

But that's probably a good first step (effectively just moving CR3 switch to C
without adding per-task trampoline stack). I will update the patches to do that,
and we can defer the per-task trampoline stack to later if there's an effective
need for it.



That might not be a good first step after all... Calling CR3 switch C functions
from assembly introduces extra pt_regs copies between the trampoline stack and 
the
kernel stack.

Currently when entering syscall, we immediately sw

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-19 Thread Alexandre Chartre





On 11/19/20 9:05 AM, Alexandre Chartre wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


    But AFAICS the only thing that this enables is sleeping with user 
pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.



I finally remember why I have introduced a per-task PTI trampoline stack right 
now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, 
we need
a per-task stack to enter (and return) from the C handler as the handler can 
potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 
functions
from the assembly entry code before and after calling the C function handler 
(also called
from assembly).


The noinstr part of the C entry code won't sleep.



But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).

Example:

#define DEFINE_IDTENTRY(func)   \
static __always_inline void __##func(struct pt_regs *regs); \
  \
__visible noinstr void func(struct pt_regs *regs)   \
{   \
  irqentry_state_t state; -+  \
   |  \
  user_pagetable_escape(regs); | use trampoline stack (1)
  state = irqentry_enter(regs);    |  \
  instrumentation_begin();    -+  \
  run_idt(__##func, regs);   |===| run __func() on kernel stack 
(this can sleep)
  instrumentation_end();  -+  \
  irqentry_exit(regs, state);  | use trampoline stack (2)
  user_pagetable_return(regs);    -+  \
}

Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.



Why?  Right now, we have the percpu entry stack, and we do just fine
if we enter on one percpu stack and exit from a different one.
We would need to call from asm to C on the entry stack, return back to
asm, and then switch stacks.



That's the problem: I didn't want to return back to asm, so that the pagetable
switch can be done anywhere in the C handler.

So yes, returning to asm to switch the stack is the solution if we want to avoid
having per-task trampoline stack. The drawback is that this forces to do the
page-table switch at the beginning and end of the handler; the pagetable switch
cannot be moved deeper down into the C handler.

But that's probably a good first step (effectively just moving CR3 switch to C
without adding per-task trampoline stack). I will update the patches to do that,
and we can defer the per-task trampoline stack to later if there's an effective
need for it.



That might not be a good first step after all... Calling CR3 switch C functions
from assembly introduces extra pt_regs copies between the trampoline stack and 
the
kernel stack.

Currently when entering syscall, we immediately switches CR3 and builds pt_regs
directly on the kernel stack. On return, registers are restored from pt_regs 
from
the k

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-19 Thread Alexandre Chartre





On 11/19/20 2:49 AM, Andy Lutomirski wrote:

On Tue, Nov 17, 2020 at 8:59 AM Alexandre Chartre
 wrote:




On 11/17/20 4:52 PM, Andy Lutomirski wrote:

On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
 wrote:




On 11/16/20 7:34 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
 wrote:



On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


But AFAICS the only thing that this enables is sleeping with user 
pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.



I finally remember why I have introduced a per-task PTI trampoline stack right 
now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, 
we need
a per-task stack to enter (and return) from the C handler as the handler can 
potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 
functions
from the assembly entry code before and after calling the C function handler 
(also called
from assembly).


The noinstr part of the C entry code won't sleep.



But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).

Example:

#define DEFINE_IDTENTRY(func)   \
static __always_inline void __##func(struct pt_regs *regs); \
  \
__visible noinstr void func(struct pt_regs *regs)   \
{   \
  irqentry_state_t state; -+  \
   |  \
  user_pagetable_escape(regs); | use trampoline stack (1)
  state = irqentry_enter(regs);|  \
  instrumentation_begin();-+  \
  run_idt(__##func, regs);   |===| run __func() on kernel stack 
(this can sleep)
  instrumentation_end();  -+  \
  irqentry_exit(regs, state);  | use trampoline stack (2)
  user_pagetable_return(regs);-+  \
}

Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.



Why?  Right now, we have the percpu entry stack, and we do just fine
if we enter on one percpu stack and exit from a different one. 


We would need to call from asm to C on the entry stack, return back to
asm, and then switch stacks.



That's the problem: I didn't want to return back to asm, so that the pagetable
switch can be done anywhere in the C handler.

So yes, returning to asm to switch the stack is the solution if we want to avoid
having per-task trampoline stack. The drawback is that this forces to do the
page-table switch at the beginning and end of the handler; the pagetable switch
cannot be moved deeper down into the C handler.

But that's probably a good first step (effectively just moving CR3 switch to C
without adding per-task trampoline stack). I will update the patches to do that,
and we can defer the per-task trampoline stack to later if there's an effective
need for it.

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre




On 11/18/20 12:29 PM, Borislav Petkov wrote:

On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote:

Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.


Here's one recipe, courtesy of Mel:

https://github.com/gormanm/mmtests



Thanks for the detailed information, I have run the test and I see the same 
difference
as with the tools/perf and libMICRO I already sent: there's a 150% difference 
for
getpid() with and without pti.

alex.

-

# ../../compare-kernels.sh --baseline test-nopti --compare test-pti

poundsyscall
   test   test
  noptipti
Min   2 1.99 (   0.00%)5.08 (-155.28%)
Min   4 1.02 (   0.00%)2.60 (-154.90%)
Min   6 0.94 (   0.00%)2.07 (-120.21%)
Min   8 0.81 (   0.00%)1.60 ( -97.53%)
Min   120.85 (   0.00%)1.65 ( -94.12%)
Min   180.82 (   0.00%)1.61 ( -96.34%)
Min   240.81 (   0.00%)1.60 ( -97.53%)
Min   300.81 (   0.00%)1.60 ( -97.53%)
Min   320.81 (   0.00%)1.60 ( -97.53%)
Amean 2 2.02 (   0.00%)5.10 *-151.83%*
Amean 4 1.03 (   0.00%)2.61 *-151.98%*
Amean 6 0.96 (   0.00%)2.07 *-116.74%*
Amean 8 0.82 (   0.00%)1.60 * -96.56%*
Amean 120.87 (   0.00%)1.67 * -91.73%*
Amean 180.82 (   0.00%)1.63 * -97.94%*
Amean 240.81 (   0.00%)1.60 * -97.41%*
Amean 300.82 (   0.00%)1.60 * -96.93%*
Amean 320.82 (   0.00%)1.60 * -96.56%*
Stddev2 0.02 (   0.00%)0.02 (  33.78%)
Stddev4 0.01 (   0.00%)0.01 (   7.18%)
Stddev6 0.01 (   0.00%)0.00 (  68.77%)
Stddev8 0.01 (   0.00%)0.01 (  10.56%)
Stddev120.01 (   0.00%)0.02 ( -12.69%)
Stddev180.01 (   0.00%)0.01 (-107.25%)
Stddev240.00 (   0.00%)0.00 ( -14.56%)
Stddev300.01 (   0.00%)0.01 (   0.00%)
Stddev320.01 (   0.00%)0.00 (  20.00%)
CoeffVar  2 1.17 (   0.00%)0.31 (  73.70%)
CoeffVar  4 0.82 (   0.00%)0.30 (  63.16%)
CoeffVar  6 1.41 (   0.00%)0.20 (  85.59%)
CoeffVar  8 0.87 (   0.00%)0.39 (  54.50%)
CoeffVar  121.66 (   0.00%)0.98 (  41.23%)
CoeffVar  180.85 (   0.00%)0.89 (  -4.71%)
CoeffVar  240.52 (   0.00%)0.30 (  41.97%)
CoeffVar  300.65 (   0.00%)0.33 (  49.22%)
CoeffVar  320.65 (   0.00%)0.26 (  59.30%)
Max   2 2.04 (   0.00%)5.13 (-151.47%)
Max   4 1.04 (   0.00%)2.62 (-151.92%)
Max   6 0.98 (   0.00%)2.08 (-112.24%)
Max   8 0.83 (   0.00%)1.62 ( -95.18%)
Max   120.89 (   0.00%)1.70 ( -91.01%)
Max   180.84 (   0.00%)1.66 ( -97.62%)
Max   240.82 (   0.00%)1.61 ( -96.34%)
Max   300.82 (   0.00%)1.61 ( -96.34%)
Max   320.82 (   0.00%)1.61 ( -96.34%)
BAmean-50 2 2.01 (   0.00%)5.09 (-153.39%)
BAmean-50 4 1.03 (   0.00%)2.60 (-152.62%)
BAmean-50 6 0.95 (   0.00%)2.07 (-118.82%)
BAmean-50 8 0.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 120.86 (   0.00%)1.66 ( -92.79%)
BAmean-50 180.82 (   0.00%)1.62 ( -97.56%)
BAmean-50 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 300.81 (   0.00%)1.60 ( -97.53%)
BAmean-50 320.81 (   0.00%)1.60 ( -97.53%)
BAmean-95 2 2.02 (   0.00%)5.09 (-151.87%)
BAmean-95 4 1.03 (   0.00%)2.61 (-151.99%)
BAmean-95 6 0.95 (   0.00%)2.07 (-117.25%)
BAmean-95 8 0.81 (   0.00%)1.60 ( -96.72%)
BAmean-95 120.87 (   0.00%)1.67 ( -91.82%)
BAmean-95 180.82 (   0.00%)1.63 ( -97.97%)
BAmean-95 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-95 300.81 (   0.00%)1.60 ( -97.00%)
BAmean-95 320.81 (   0.00%)1.60 ( -96.59%)
BAmean-99 2 2.02 (   0.00%)5.09 (-151.87%)
BAmean-99 4 1.03 (   0.00%)2.61 (-151.99%)
BAmean-99 6 0.95 (   0.00%)2.07 (-117.25%)
BAmean-99 8 0.81 (   0.00%)1.60 ( -96.72%)
BAmean-99 120.87 (   0.00%)1.67 ( -91.82%)
BAmean-99 180.82 (   0.00%)1.63 ( -97.97%)
BAmean-99 240.81 (   0.00%)1.60 ( -97.53%)
BAmean-99 300.8

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre




On 11/18/20 2:22 PM, David Laight wrote:

From: Alexandre Chartre

Sent: 18 November 2020 10:30

...

Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.


Remember that without PTI user space can read all kernel memory.
(I'm not 100% sure you can force a cache-line read.)
It isn't even that slow.
(Even I can understand how it works.)

So if you are worried about user space doing that you can't really
run anything on the user page tables.


Yes, without PTI, userspace can read all kernel memory. But to run some
part of the kernel you don't need to have all kernel mappings. Also a lot
of the kernel contain non-sensitive information which can be safely expose
to userspace. So there's probably some room for running carefully selected
syscalls with the user page-table (and hopefully useful ones).
 


System calls like getpid() are irrelevant - they aren't used (much).
Even the time of day ones are implemented in the VDSO without a
context switch.


getpid()/getppid() is interesting because it provides the amount of overhead
PTI is adding. But the impact can be more important if some TLB flushing are
also required (as you mentioned below).



So the overheads come from other system calls that 'do work'
without actually sleeping.
I'm guessing things like read, write, sendmsg, recvmsg.

The only interesting system call I can think of is futex.
As well as all the calls that return immediately because the
mutex has been released while entering the kernel, I suspect
that being pre-empted by a different thread (of the same process)
doesn't actually need CR3 reloading (without PTI).

I also suspect that it isn't just the CR3 reload that costs.
There could (depending on the cpu) be associated TLB and/or cache
invalidations that have a much larger effect on programs with
large working sets than on simple benchmark programs.


Right, although the TLB flush is mitigated with PCID, but this has
more impact if there's no PCID.



Now bits of data that you are 'more worried about' could be kept
in physical memory that isn't normally mapped (or referenced by
a TLB) and only mapped when needed.
But that doesn't help the general case.



Note that having syscall which could be done without switching the
page-table is just one benefit you can get from this RFC. But the main
benefit is for integrating Address Space Isolation (ASI) which will be
much more complex if ASI as to plug in the current assembly CR3 switch.

Thanks,

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-18 Thread Alexandre Chartre




On 11/18/20 10:30 AM, David Laight wrote:

From: Alexandre Chartre

Sent: 18 November 2020 07:42


On 11/17/20 10:26 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Some benchmarks are available, in particular from phoronix:


What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?



Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.


IIRC the penalty comes from the page table switch.
Doing it at a different time is unlikely to make much difference.



Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.



For some workloads the penalty is massive - getting on for 50%.
We are still using old kernels on AWS.



Here are some micro benchmarks of the getppid and getpid syscalls which 
highlight
the PTI overhead. This uses the kernel tools/perf command, and the getpid 
command
from libMICRO (https://github.com/redhat-performance/libMicro):

system running 5.10-rc4 booted with nopti:
--

# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 1000 getppid() calls
 Total time: 0.792 [sec]

   0.079223 usecs/op
   12622549 ops/sec

# getpid -B 10
 prc thr   usecs/call  samples   errors cnt/samp
getpid 1   1  0.08029  1020   10


We can see that getpid and getppid syscall have the same execution
time around 0.08 usecs. These syscalls are very small and just return
a value, so the time is mostly spent entering/exiting the kernel.


same system booted with pti:


# perf bench syscall basic
# Running 'syscall/basic' benchmark:
# Executed 1000 getppid() calls
 Total time: 2.025 [sec]

   0.202527 usecs/op
4937605 ops/sec

# getpid -B 10
 prc thr   usecs/call  samples   errors cnt/samp
getpid 1   1  0.20241  1020   10


With PTI, the execution time jumps to 0.20 usecs (+0.12 usecs = +150%).

That's a very extreme case because these are very small syscalls, and
in that case the overhead to switch page-tables is significant compared
to the execution time of the syscall.

So with an overhead of +0.12 usecs per syscall, the PTI impact is significant
with workload which uses a lot of short syscalls. But if you use longer 
syscalls,
for example with an average execution time of 2.0 usecs per syscall then you
have a lower overhead of 6%.

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/17/20 10:26 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Some benchmarks are available, in particular from phoronix:


What I was expecting was benchmarks *you* have run which show that
perf penalty, not something one can find quickly on the internet and
something one cannot always reproduce her-/himself.

You do know that presenting convincing numbers with a patchset greatly
improves its chances of getting it upstreamed, right?



Well, it looks like I wrongfully assume that KPTI was a well known performance
overhead since it was introduced (because it adds extra page-table switches),
but you are right I should be presenting my own numbers.

Thanks,

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/17/20 10:23 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote:

No. This prevents the guest VM from gathering data from the host
kernel on the same cpu-thread. But there's no mitigation for a guest
VM running on a cpu-thread attacking another cpu-thread (which can be
running another guest VM or the host kernel) from the same cpu-core.
You cannot use flush/clear barriers because the two cpu-threads are
running in parallel.


Now there's your justification for why you're doing this. It took a
while...

The "why" should always be part of the 0th message to provide
reviewers/maintainers with answers to the question, what this pile of
patches is all about. Please always add this rationale to your patchset
in the future.



Sorry about that, I will definitively try to do better next time. :-}

Thanks,

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre





On 11/17/20 7:28 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote:

Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
the moment. In particular, this allows a guest VM to attack another guest VM
or the host kernel running on a sibling cpu-thread. Core Scheduling will
mitigate the guest-to-guest attack but not the guest-to-host attack.


I see in vmx_vcpu_enter_exit():

 /* L1D Flush includes CPU buffer clear to mitigate MDS */
 if (static_branch_unlikely(&vmx_l1d_should_flush))
 vmx_l1d_flush(vcpu);
 else if (static_branch_unlikely(&mds_user_clear))
 mds_clear_cpu_buffers();

Is that not enough?


No. This prevents the guest VM from gathering data from the host kernel on the
same cpu-thread. But there's no mitigation for a guest VM running on a 
cpu-thread
attacking another cpu-thread (which can be running another guest VM or the
host kernel) from the same cpu-core. You cannot use flush/clear barriers because
the two cpu-threads are running in parallel.

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/17/20 6:07 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote:

We are not reversing PTI, we are extending it.


You're reversing it in the sense that you're mapping more kernel memory
into the user page table than what is mapped now.


PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.


I hope that is the case.


Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.


And this is what bothers me - someone else might come after you and say,
but but, I need to map more stuff into the user pgt because I wanna do
X... and so on.


Agree, any addition should be strictly checked. I have been careful to expand
it to the minimum I needed.



The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.


So on a big machine with a million tasks, that's at least a million
pages more which is what, ~4 Gb?

There better be a very good justification for the additional memory
consumption...


Yeah, adding a per-task allocation is my main concern, hence this RFC.


alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/17/20 5:55 PM, Borislav Petkov wrote:

On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote:

The main goal of ASI is to provide KVM address space isolation to
mitigate guest-to-host speculative attacks like L1TF or MDS.


Because the current L1TF and MDS mitigations are lacking or why?



Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at
the moment. In particular, this allows a guest VM to attack another guest VM
or the host kernel running on a sibling cpu-thread. Core Scheduling will
mitigate the guest-to-guest attack but not the guest-to-host attack. Address
Space Isolation provides a mitigation for guest-to-host attack.



Current proposal of ASI is plugged into the CR3 switch assembly macro
which make the code brittle and complex. (see [1])

I am also expected this might help with some other ideas like having
syscall (or interrupt handler) which can run without switching the
page-table.


I still fail to see why we need all that. I read, "this does this and
that" but I don't read "the current problem is this" and "this is our
suggested solution for it".

So what is the issue which needs addressing in the current kernel which
is going to justify adding all that code?


The main issue this is trying to address is that the CR3 switch is currently
done in assembly code from contexts which are very restrictive: the CR3 switch
is often done when only one or two registers are available for use, sometimes
no stack is available. For example, the syscall entry switches CR3 with a single
register available (%sp) and no stack.

Because of this, it is fairly tricky to expand the logic for switching CR3.
This is a problem that we have faced while implementing Address Space Isolation
(ASI) where we need extra logic to drive the page-table switch. We have 
successfully
implement ASI with the current CR3 switching assembly code, but this requires
complex assembly construction. Hence this proposal to defer CR3 switching to C
code so that it can be more easily expandable.

Hopefully this can also contribute to make the assembly entry code less complex,
and be beneficial to other projects.



PTI has a measured overhead of roughly 5% for most workloads, but it can
be much higher in some cases.


"it can be"? Where? Actual use case?


Some benchmarks are available, in particular from phoronix:

https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti
https://www.phoronix.com/scan.php?page=news_item&px=x86-PTI-Initial-Gaming-Tests
https://www.phoronix.com/scan.php?page=article&item=linux-kpti-kvm
https://medium.com/@loganaden/linux-kpti-performance-hit-on-real-workloads-8da185482df3



The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged
directly into the CR3 switch assembly macro. We are working on a new
implementation, based on these changes which avoid having to deal with
assembly code and makes the implementation more robust.


This still doesn't answer my questions. I read a lot of "could be used
for" formulations but I still don't know why we need that. So what is
the problem that the kernel currently has which you're trying to address
with this?



Hopefully this is clearer with the answer I provided above.

Thanks,

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-17 Thread Alexandre Chartre





On 11/17/20 4:52 PM, Andy Lutomirski wrote:

On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre
 wrote:




On 11/16/20 7:34 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
 wrote:



On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


   But AFAICS the only thing that this enables is sleeping with user pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.



I finally remember why I have introduced a per-task PTI trampoline stack right 
now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, 
we need
a per-task stack to enter (and return) from the C handler as the handler can 
potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 
functions
from the assembly entry code before and after calling the C function handler 
(also called
from assembly).


The noinstr part of the C entry code won't sleep.



But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).

Example:

#define DEFINE_IDTENTRY(func)   \
static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs)   \
{   \
irqentry_state_t state; -+  \
 |  \
user_pagetable_escape(regs); | use trampoline stack (1)
state = irqentry_enter(regs);|  \
instrumentation_begin();-+  \
run_idt(__##func, regs);   |===| run __func() on kernel stack (this 
can sleep)
instrumentation_end();  -+  \
irqentry_exit(regs, state);  | use trampoline stack (2)
user_pagetable_return(regs);-+  \
}

Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-17 Thread Alexandre Chartre





On 11/16/20 7:34 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
 wrote:



On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


  But AFAICS the only thing that this enables is sleeping with user pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.



I finally remember why I have introduced a per-task PTI trampoline stack right 
now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, 
we need
a per-task stack to enter (and return) from the C handler as the handler can 
potentially
go to sleep.

Without a per-task trampoline stack, we would be limited to call the switch CR3 
functions
from the assembly entry code before and after calling the C function handler 
(also called
from assembly).

alex.

Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

2020-11-17 Thread Alexandre Chartre




On 11/17/20 12:06 AM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 12:18 PM Alexandre Chartre
 wrote:



On 11/16/20 8:48 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
 wrote:


Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).


Does anything unmap the PTI stack?  Mapping is easy, and unmapping
could be a pretty big mess.



No, there's no unmap. The mapping exists as long as the task page-table
does (i.e. as long as the task mm exits). I assume that the task stack
and mm are freed at the same time but that's not something I have checked.



Nope.  A multi-threaded mm will free task stacks when the task exits,
but the mm may outlive the individual tasks.  Additionally, if you
allocate page tables as part of mapping PTI stacks, you need to make
sure the pagetables are freed.


So I think I just need to unmap the PTI stack from the user page-table
when the task exits. Everything else is handled because the kernel and
PTI stack are allocated in a single chunk (referenced by task->stack).



 Finally, you need to make sure that
the PTI stacks have appropriate guard pages -- just doubling the
allocation is not safe enough.


The PTI stack does have guard pages because it maps only a part of the task
stack into the user page-table, so pages around the PTI stack are not mapped
into the user-pagetable (the page below is the task stack guard, and the page
above is part of the kernel-only stack so it's never mapped into the user
page-table).

+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack


My intuition is that this is going to be far more complexity than is justified.


Sounds like only the PTI stack unmap is missing, which is hopefully not
that bad. I will check that.

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-17 Thread Alexandre Chartre




On 11/16/20 10:24 PM, David Laight wrote:

From: Alexandre Chartre

Sent: 16 November 2020 18:10

On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


Isn't that going to allocate a lot more kernel memory?


That's one of my concern, hence this RFC. The current code is doubling the
task stack (this was an easy solution), so that's +8KB per task. See my
reply to Boris, it has a bit more details.

alex.



ISTR some thoughts about using dynamically allocated kernel
stacks when (at least some) wakeups are done by directly
restarting the system call - so that the sleeping thread
doesn't even need a kernel stack.
(I can't remember if that was linux or one of the BSDs)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-17 Thread Alexandre Chartre




On 11/16/20 9:24 PM, Borislav Petkov wrote:

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

  - map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

  - map additional data used in the entry code (such as stack canary);

  - run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;


So PTI was added exactly to *not* have kernel memory mapped in the user
page table. You're partially reversing that...


We are not reversing PTI, we are extending it.

PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.

Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.



  - have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.


per-task? How much more memory is that per task?



Currently, this is done by doubling the size of the task stack (patch 8),
so that's an extra 8KB. Half of the stack is used as the regular kernel
stack, and the other half used as the PTI stack:

+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack
+ */

The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.

alex.

Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Alexandre Chartre




On 11/16/20 9:17 PM, Borislav Petkov wrote:

On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibility to execute some selected
syscall or interrupt handlers without switching to the kernel page-table


What for? What is this going to be used for in the end?



In addition to simplify the assembly entry code, this will also simplify
the integration of Address Space Isolation (ASI) which will certainly be
the primary beneficiary of this change. The main goal of ASI is to provide
KVM address space isolation to mitigate guest-to-host speculative attacks
like L1TF or MDS. Current proposal of ASI is plugged into the CR3 switch
assembly macro which make the code brittle and complex. (see [1])

I am also expected this might help with some other ideas like having
syscall (or interrupt handler) which can run without switching the
page-table.



(and thus avoid the PTI page-table switch overhead).


Overhead of how much? Why do we care?



PTI has a measured overhead of roughly 5% for most workloads, but it can
be much higher in some cases. The overhead is mostly due to the page-table
switch (even with PCID) so if we can run a syscall or an interrupt handler
without switching the page-table then we can get this kind of performance
back.



What is the big picture justfication for this diffstat


  21 files changed, 874 insertions(+), 314 deletions(-)


and the diffstat for the ASI enablement?



The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into
the CR3 switch assembly macro. We are working on a new implementation, based
on these changes which avoid having to deal with assembly code and makes the
implementation more robust.

alex.

[1] ASI RFCv4 - 
https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.char...@oracle.com/

Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

2020-11-16 Thread Alexandre Chartre




On 11/16/20 8:48 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre
 wrote:


Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).


Does anything unmap the PTI stack?  Mapping is easy, and unmapping
could be a pretty big mess.



No, there's no unmap. The mapping exists as long as the task page-table
does (i.e. as long as the task mm exits). I assume that the task stack
and mm are freed at the same time but that's not something I have checked.

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-16 Thread Alexandre Chartre





On 11/16/20 7:34 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre
 wrote:



On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?


When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


  But AFAICS the only thing that this enables is sleeping with user pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.


Seems reasonable.

Where is the code that allocates and frees these stacks hiding?  I
think I should at least read it.


Stacks are allocated/freed with the task stack, this code is unchanged (see
alloc_thread_stack_node()). The trick is that I have doubled the THREAD_SIZE
(patch 8 "x86/pti: Introduce per-task PTI trampoline stack"). Half the stack
is a used as the kernel stack (mapped only in the kernel page-table), the
other half is used as the PTI stack (mapped in the kernel and user page-table).
The mapping to the user page-table is done in mm_map_task() in fork.c (patch 11
"x86/pti: Extend PTI user mappings").

alex.

Re: [RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table

2020-11-16 Thread Alexandre Chartre




On 11/16/20 5:56 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:48 AM Alexandre Chartre
 wrote:


Using stack protector requires the stack canary to be mapped into
the current page-table. Now that the page-table switch between the
user and kernel page-table is deferred to C code, stack protector can
be used while the user page-table is active and so the stack canary
is mapped into the user page-table.

To prevent leaking the stack canary used with the kernel page-table,
use a different canary with the user and kernel page-table. The stack
canary is changed when switching the page-table.


Unless I've missed something, this doesn't have the security
properties we want.  One CPU can be executing with kernel CR3, and
another CPU can read the stack canary using Meltdown.


I think you are right because we have the mapping to the stack canary in
the user page-table. From userspace, we will only read the user stack canary,
but using Meltdown we can speculatively read the kernel stack canary which
will be stored at the same place.


I think that doing this safely requires mapping a different page with
the stack canary in the two pagetables.


Right.

alex.

Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-16 Thread Alexandre Chartre




On 11/16/20 5:57 PM, Andy Lutomirski wrote:

On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre
 wrote:


When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.


Why?
 
When executing more code in the kernel, we are likely to reach a point

where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.


I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.


The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).


 But AFAICS the only thing that this enables is sleeping with user pagetables.


That's precisely the point, it allows to sleep with the user page-table.


Do we really need to do that?


Actually, probably not with this particular patchset, because I do the 
page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).

So a first step should probably be to not introduce the per-task PTI trampoline 
stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack 
can
be introduced later when the page-table switch is moved deeper in the C handler 
and
we can effectively sleep while using the user page-table.

alex.

[RFC][PATCH v2 20/21] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

2020-11-16 Thread Alexandre Chartre

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 15 ---
 arch/x86/entry/entry_64.S   | 23 +--
 arch/x86/entry/entry_64_compat.S| 22 --
 arch/x86/include/asm/entry-common.h | 13 +
 arch/x86/include/asm/idtentry.h | 25 -
 arch/x86/kernel/cpu/mce/core.c  |  2 ++
 arch/x86/kernel/nmi.c   |  2 ++
 arch/x86/kernel/traps.c |  6 ++
 arch/x86/mm/fault.c |  9 +++--
 9 files changed, 67 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1aba02ecb806..6ef5afc42b82 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
+   user_pagetable_enter();
 }
 
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t 
sysfunc,
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+   user_pagetable_exit();
nr = syscall_enter_from_user_mode(regs, nr);
 
instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
 
instrumentation_end();
syscall_exit_to_user_mode(regs);
+   user_pagetable_enter();
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+   user_pagetable_exit();
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;
 
@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
*regs)
 
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
+   user_pagetable_enter();
 }
 
-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
 {
-   unsigned int nr = syscall_32_enter(regs);
int res;
 
/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
 __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
+   unsigned int nr = syscall_32_enter(regs);
+   bool syscall_done;
+
/*
 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 * convention.  Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs 
*regs)
regs->ip = landing_pad;
 
/* Invoke the syscall. If it failed, keep it simple: use IRET. */
-   if (!__do_fast_syscall_32(regs))
+   syscall_done = __do_fast_syscall_32(regs, nr);
+   user_pagetable_enter();
+   if (!syscall_done)
return 0;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1715bc0cefff..b7d9a019d001 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 is scratch space. */
movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 */
 syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
-   POP_REGS pop_rdi=0 skip_r11rcx=1
+   POP_REGS skip_r11rcx=1
 
/*
-* We are on the trampoline stack.  All regs except RDI are live.
 * We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
 
-   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
-   popq%rdi
movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
swapgs
cld
FENCE_SWAPGS_USER_ENTRY
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq%rsp, %rdx
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -594,19 +588,15 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
 1:
 #endif
-   POP_REGS pop_rdi=0
+   POP_REGS
+   addq

[RFC][PATCH v2 14/21] x86/pti: Execute IDT handlers on the kernel stack

2020-11-16 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

For now, only changes IDT handlers which have no argument other
than the pt_regs registers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 43 +++--
 arch/x86/kernel/cpu/mce/core.c  |  2 +-
 arch/x86/kernel/traps.c |  4 +--
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4b4aca2b1420..3595a31947b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,10 +10,49 @@
 #include 
 
 #include 
+#include 
 
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+/*
+ * The CALL_ON_STACK_* macro call the specified function either directly
+ * if no stack is provided, or on the specified stack.
+ */
+#define CALL_ON_STACK_1(stack, func, arg1) \
+   ((stack) ?  \
+asm_call_on_stack_1(stack, \
+   (void (*)(void))(func), (void *)(arg1)) :   \
+func(arg1))
+
+/*
+ * Functions to return the top of the kernel stack if we are using the
+ * user page-table (and thus not running with the kernel stack). If we
+ * are using the kernel page-table (and so already using the kernel
+ * stack) when it returns NULL.
+ */
+static __always_inline void *pti_kernel_stack(struct pt_regs *regs)
+{
+   unsigned long stack;
+
+   if (pti_enabled() && user_mode(regs)) {
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   return (void *)(stack - 8);
+   } else {
+   return NULL;
+   }
+}
+
+/*
+ * Wrappers to run an IDT handler on the kernel stack if we are not
+ * already using this stack.
+ */
+static __always_inline
+void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
+{
+   CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) 
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs);\
+   run_idt(__##func, regs);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
__irq_enter_raw();  \
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs);\
+   run_idt(__##func, regs);\
__irq_exit_raw();   \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..9407c3cd9355 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
unsigned long dr7;
 
dr7 = local_db_save();
-   exc_machine_check_user(regs);
+   run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
 }
 #else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 09b22a611d99..5161385b3670 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 
state = irqentry_enter(regs);
instrumentation_begin();
-   handle_invalid_op(regs);
+   run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
 }
@@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
-   do_int3_user(regs);
+   run_idt(do_int3_us

[RFC][PATCH v2 19/21] x86/pti: Defer CR3 switch to C code for IST entries

2020-11-16 Thread Alexandre Chartre

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S  | 34 ++
 arch/x86/kernel/cpu/mce/core.c |  3 +++
 arch/x86/kernel/nmi.c  | 18 +++---
 arch/x86/kernel/sev-es.c   | 13 -
 arch/x86/kernel/traps.c| 30 ++
 5 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..1715bc0cefff 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -900,23 +900,6 @@ SYM_CODE_START_LOCAL(paranoid_entry)
PUSH_AND_CLEAR_REGS save_ret=1
ENCODE_FRAME_POINTER 8
 
-   /*
-* Always stash CR3 in %r14.  This value will be restored,
-* verbatim, at exit.  Needed if paranoid_entry interrupted
-* another entry that already switched to the user CR3 value
-* but has not yet returned to userspace.
-*
-* This is also why CS (stashed in the "iret frame" by the
-* hardware at entry) can not be used: this may be a return
-* to kernel code, but with a user CR3 value.
-*
-* Switching CR3 does not depend on kernel GSBASE so it can
-* be done before switching to the kernel GSBASE. This is
-* required for FSGSBASE because the kernel GSBASE has to
-* be retrieved from a kernel internal table.
-*/
-   SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
/*
 * Handling GSBASE depends on the availability of FSGSBASE.
 *
@@ -956,9 +939,7 @@ SYM_CODE_START_LOCAL(paranoid_entry)
SWAPGS
 
/*
-* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
-* unconditional CR3 write, even in the PTI case.  So do an lfence
-* to prevent GS speculation, regardless of whether PTI is enabled.
+* Do an lfence prevent GS speculation.
 */
FENCE_SWAPGS_KERNEL_ENTRY
 
@@ -989,14 +970,10 @@ SYM_CODE_END(paranoid_entry)
 SYM_CODE_START_LOCAL(paranoid_exit)
UNWIND_HINT_REGS
/*
-* The order of operations is important. RESTORE_CR3 requires
-* kernel GSBASE.
-*
 * NB to anyone to try to optimize this code: this code does
 * not execute at all for exceptions from user mode. Those
 * exceptions go through error_exit instead.
 */
-   RESTORE_CR3 scratch_reg=%rax save_reg=%r14
 
/* Handle the three GSBASE cases */
ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
@@ -1119,10 +1096,6 @@ SYM_CODE_END(error_return)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- *   when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 SYM_CODE_START(asm_exc_nmi)
/*
@@ -1173,7 +1146,7 @@ SYM_CODE_START(asm_exc_nmi)
 * We also must not push anything to the stack before switching
 * stacks lest we corrupt the "NMI executing" variable.
 */
-   ist_entry_user exc_nmi
+   ist_entry_user exc_nmi_user
 
/* NMI from kernel */
 
@@ -1385,9 +1358,6 @@ end_repeat_nmi:
movq$-1, %rsi
callexc_nmi
 
-   /* Always restore stashed CR3 value (see paranoid_entry) */
-   RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
/*
 * The above invocation of paranoid_entry stored the GSBASE
 * related information in R/EBX depending on the availability
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 9407c3cd9355..31ac01c1155d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2022,11 +2022,14 @@ static __always_inline void 
exc_machine_check_user(struct pt_regs *regs)
 /* MCE hit kernel mode */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+   unsigned long saved_cr3;
unsigned long dr7;
 
+   saved_cr3 = save_and_switch_to_kernel_cr3();
dr7 = local_db_save();
exc_machine_check_kernel(regs);
local_db_restore(dr7);
+   restore_cr3(saved_cr3);
 }
 
 /* The user mode variant. */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be0f654c3095..523d88c3fea1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
 static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
-DEFINE_IDTENTRY_RAW(exc_nmi)
+static noinstr void handle_nmi(struct pt_regs *regs)
 {
bool irq_state;
 
@@

[RFC][PATCH v2 09/21] x86/pti: Function to clone page-table entries from a specified mm

2020-11-16 Thread Alexandre Chartre

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/pti.h | 10 ++
 arch/x86/mm/pti.c  | 32 
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+   PTI_CLONE_PMD,
+   PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
 extern void pti_init(void);
 extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level);
 #else
 static inline void pti_check_boottime_disable(void) { }
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
 static void __init pti_setup_vsyscall(void) { }
 #endif
 
-enum pti_clone_level {
-   PTI_CLONE_PMD,
-   PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
- enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+  unsigned long end, enum pti_clone_level level)
 {
unsigned long addr;
 
@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
if (addr < start)
break;
 
-   pgd = pgd_offset_k(addr);
+   pgd = pgd_offset(mm, addr);
if (WARN_ON(pgd_none(*pgd)))
return;
p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
}
 }
 
+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+  enum pti_clone_level level)
+{
+   pti_clone_pgtable(&init_mm, start, end, level);
+}
+
 #ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
start = CPU_ENTRY_AREA_BASE;
end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-   pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+   pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
  */
 static void pti_clone_entry_text(void)
 {
-   pti_clone_pgtable((unsigned long) __entry_text_start,
- (unsigned long) __entry_text_end,
- PTI_CLONE_PMD);
+   pti_clone_init_pgtable((unsigned long) __entry_text_start,
+  (unsigned long) __entry_text_end,
+  PTI_CLONE_PMD);
 }
 
 /*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
 * pti_set_kernel_image_nonglobal() did to clear the
 * global bit.
 */
-   pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+   pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
 
/*
-* pti_clone_pgtable() will set the global bit in any PMDs
-* that it clones, but we also need to get any PTEs in
+* pti_clone_init_pgtable() will set the global bit in any
+* PMDs that it clones, but we also need to get any PTEs in
 * the last level for areas that are not huge-page-aligned.
 */
 
-- 
2.18.4

[RFC][PATCH v2 06/21] x86/pti: Provide C variants of PTI switch CR3 macros

2020-11-16 Thread Alexandre Chartre

Page Table Isolation (PTI) use assembly macros to switch the CR3
register between kernel and user page-tables. Add C functions which
implement the same features. For now, these C functions are not
used but they will eventually replace using the assembly macros.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/entry-common.h | 127 
 1 file changed, 127 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h 
b/arch/x86/include/asm/entry-common.h
index 6fe54b2813c1..46682b1433a4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -81,4 +82,130 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+#ifndef MODULE
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT   PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK  (1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT  X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK  \
+   (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+
+static __always_inline void write_kernel_cr3(unsigned long cr3)
+{
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= X86_CR3_PCID_NOFLUSH;
+
+   native_write_cr3(cr3);
+}
+
+static __always_inline void write_user_cr3(unsigned long cr3)
+{
+   unsigned short mask;
+   unsigned long asid;
+
+   if (static_cpu_has(X86_FEATURE_PCID)) {
+   /*
+* Test if the ASID needs a flush.
+*/
+   asid = cr3 & 0x7ff;
+   mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask);
+   if (mask & (1 << asid)) {
+   /* Flush needed, clear the bit */
+   this_cpu_and(cpu_tlbstate.user_pcid_flush_mask,
+~(1 << asid));
+   } else {
+   cr3 |= X86_CR3_PCID_NOFLUSH;
+   }
+   }
+
+   native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_kernel_cr3(unsigned long cr3)
+{
+   /*
+* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+* at kernel pagetables.
+*/
+   write_kernel_cr3(cr3 & ~PTI_USER_PGTABLE_AND_PCID_MASK);
+}
+
+static __always_inline void switch_to_user_cr3(unsigned long cr3)
+{
+   if (static_cpu_has(X86_FEATURE_PCID)) {
+   /* Flip the ASID to the user version */
+   cr3 |= PTI_USER_PCID_MASK;
+   }
+
+   /* Flip the PGD to the user version */
+   write_user_cr3(cr3 | PTI_USER_PGTABLE_MASK);
+}
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+   unsigned long cr3;
+
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return 0;
+
+   cr3 = __native_read_cr3();
+   if (cr3 & PTI_USER_PGTABLE_MASK)
+   switch_to_kernel_cr3(cr3);
+
+   return cr3;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3)
+{
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   if (cr3 & PTI_USER_PGTABLE_MASK) {
+   switch_to_user_cr3(cr3);
+   } else {
+   /*
+* The CR3 write could be avoided when not changing
+* its value, but would require a CR3 read.
+*/
+   write_kernel_cr3(cr3);
+   }
+}
+
+static __always_inline void user_pagetable_enter(void)
+{
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   switch_to_user_cr3(__native_read_cr3());
+}
+
+static __always_inline void user_pagetable_exit(void)
+{
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   switch_to_kernel_cr3(__native_read_cr3());
+}
+
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+   return 0;
+}
+static __always_inline void restore_cr3(unsigned long cr3) {}
+
+static __always_inline void user_pagetable_enter(void) {};
+static __always_inline void user_pagetable_exit(void) {};
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+#endif /* MODULE */
+
 #endif
-- 
2.18.4

[RFC][PATCH v2 17/21] x86/pti: Execute page fault handler on the kernel stack

2020-11-16 Thread Alexandre Chartre

After a page fault from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the page fault handler, switch
to the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 17 +
 arch/x86/mm/fault.c |  2 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0c5d9f027112..a6725afaaec0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
 func(arg1, arg2))
 
+#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \
+   ((stack) ?  \
+asm_call_on_stack_3(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \
+   (void *)(arg3)) :   \
+func(arg1, arg2, arg3))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long,
+   unsigned long),
+  struct pt_regs *regs, unsigned long error_code,
+  unsigned long address)
+{
+   CALL_ON_STACK_3(pti_kernel_stack(regs),
+   func, regs, error_code, address);
+}
+
 static __always_inline
 void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..b9d03603d95d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
state = irqentry_enter(regs);
 
instrumentation_begin();
-   handle_page_fault(regs, error_code, address);
+   run_idt_pagefault(handle_page_fault, regs, error_code, address);
instrumentation_end();
 
irqentry_exit(regs, state);
-- 
2.18.4

[RFC][PATCH v2 13/21] x86/pti: Execute syscall functions on the kernel stack

2020-11-16 Thread Alexandre Chartre

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c  | 11 ++-
 arch/x86/entry/entry_64.S|  1 +
 arch/x86/include/asm/irq_stack.h |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7ee15a12c115..1aba02ecb806 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs 
*regs,
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
+   unsigned long stack;
+
if (!sysfunc)
return;
 
-   regs->ax = sysfunc(regs);
+   if (!pti_enabled()) {
+   regs->ax = sysfunc(regs);
+   return;
+   }
+
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+sysfunc, regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
 SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
/*
 * Save the frame pointer unconditionally. This allows the ORC
 * unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_X86_64
 static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct 
pt_regs *regs),
  struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
   struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+  struct pt_regs *regs);
 
 static __always_inline void __run_on_irqstack(void (*func)(void))
 {
-- 
2.18.4

[RFC][PATCH v2 15/21] x86/pti: Execute IDT handlers with error code on the kernel stack

2020-11-16 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 18 --
 arch/x86/kernel/traps.c |  2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1)) :   \
 func(arg1))
 
+#define CALL_ON_STACK_2(stack, func, arg1, arg2)   \
+   ((stack) ?  \
+asm_call_on_stack_2(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+func(arg1, arg2))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs 
*regs)
CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
 }
 
+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+struct pt_regs *regs, unsigned long error_code)
+{
+   CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs, error_code);\
+   run_idt_errcode(__##func, regs, error_code);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs,   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs, (u8)error_code);\
+   run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+   regs, (u8)error_code);  \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
 /* User entry, runs on regular task stack */
 DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
-   exc_debug_user(regs, debug_read_clear_dr6());
+   run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
 }
 #else
 /* 32 bit does not have separate entry points. */
-- 
2.18.4

[RFC][PATCH v2 16/21] x86/pti: Execute system vector handlers on the kernel stack

2020-11-16 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+   void *stack = pti_kernel_stack(regs);
+
+   if (stack)
+   asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+   else
+   run_sysvec_on_irqstack_cond(func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   run_sysvec_on_irqstack_cond(__##func, regs);\
+   run_sysvec(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
-- 
2.18.4

[RFC][PATCH v2 10/21] x86/pti: Function to map per-cpu page-table entry

2020-11-16 Thread Alexandre Chartre

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/mm/pti.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
 }
 
+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+   phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+   pte_t *target_pte;
+
+   target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+   if (WARN_ON(!target_pte))
+   return;
+
+   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
 /*
  * Clone the CPU_ENTRY_AREA and associated data into the user space visible
  * page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
 * This is done for all possible CPUs during boot to ensure
 * that it's propagated to all mms.
 */
+   pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
-   unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
-   phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
-   pte_t *target_pte;
-
-   target_pte = pti_user_pagetable_walk_pte(va);
-   if (WARN_ON(!target_pte))
-   return;
-
-   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
}
 }
 
-- 
2.18.4

[RFC][PATCH v2 02/21] x86/entry: Update asm_call_on_stack to support more function arguments

2020-11-16 Thread Alexandre Chartre

Update the asm_call_on_stack() function so that it can be invoked
with a function having up to three arguments instead of only one.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S| 15 +++
 arch/x86/include/asm/irq_stack.h |  8 
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..c42948aca0a8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs)
 /*
  * rdi: New stack pointer points to the top word of the stack
  * rsi: Function pointer
- * rdx: Function argument (can be NULL if none)
+ * rdx: Function argument 1 (can be NULL if none)
+ * rcx: Function argument 2 (can be NULL if none)
+ * r8 : Function argument 3 (can be NULL if none)
  */
 SYM_FUNC_START(asm_call_on_stack)
+SYM_FUNC_START(asm_call_on_stack_1)
+SYM_FUNC_START(asm_call_on_stack_2)
+SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
/*
@@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
 */
mov %rsp, (%rdi)
mov %rdi, %rsp
-   /* Move the argument to the right place */
+   mov %rsi, %rax
+   /* Move arguments to the right place */
mov %rdx, %rdi
-
+   mov %rcx, %rsi
+   mov %r8, %rdx
 1:
.pushsection .discard.instr_begin
.long 1b - .
.popsection
 
-   CALL_NOSPEC rsi
+   CALL_NOSPEC rax
 
 2:
.pushsection .discard.instr_end
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 775816965c6a..359427216336 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void)
 }
 
 void asm_call_on_stack(void *sp, void (*func)(void), void *arg);
+
+void asm_call_on_stack_1(void *sp, void (*func)(void),
+void *arg1);
+void asm_call_on_stack_2(void *sp, void (*func)(void),
+void *arg1, void *arg2);
+void asm_call_on_stack_3(void *sp, void (*func)(void),
+void *arg1, void *arg2, void *arg3);
+
 void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
  struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
-- 
2.18.4

[RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table

2020-11-16 Thread Alexandre Chartre

Using stack protector requires the stack canary to be mapped into
the current page-table. Now that the page-table switch between the
user and kernel page-table is deferred to C code, stack protector can
be used while the user page-table is active and so the stack canary
is mapped into the user page-table.

To prevent leaking the stack canary used with the kernel page-table,
use a different canary with the user and kernel page-table. The stack
canary is changed when switching the page-table.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/entry-common.h   | 56 ++-
 arch/x86/include/asm/stackprotector.h | 35 +++--
 arch/x86/kernel/sev-es.c  | 18 +
 include/linux/sched.h |  8 
 kernel/fork.c |  3 ++
 5 files changed, 107 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/entry-common.h 
b/arch/x86/include/asm/entry-common.h
index e01735a181b8..5b4d0e3237a3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -96,6 +96,52 @@ static __always_inline void arch_exit_to_user_mode(void)
 #define PTI_USER_PGTABLE_AND_PCID_MASK  \
(PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
 
+/*
+ * Functions to set the stack canary to the kernel or user value:
+ *
+ * The kernel stack canary should be used when running with the kernel
+ * page-table, and the user stack canary should be used when running
+ * with the user page-table. Also the kernel stack canary should not
+ * leak to the user page-table.
+ *
+ * So the stack canary should be set to the kernel value when entering
+ * the kernel from userspace *after* switching to the kernel page-table.
+ * And the stack canary should be set to the user value when returning
+ * to userspace *before* switching to the user page-table.
+ *
+ * In both cases, there is a window (between the page-table switch and
+ * the stack canary setting) where we will be running with the kernel
+ * page-table and the user stack canary. This window should be as small
+ * as possible and, ideally, it should:
+ * - not call functions which require the stack protector to be used;
+ * - have interrupt disabled to prevent interrupt handlers from being
+ *   processed with the user stack canary (but there is nothing we can
+ *   do for NMIs).
+ */
+static __always_inline void set_stack_canary_kernel(void)
+{
+   this_cpu_write(fixed_percpu_data.stack_canary,
+  current->stack_canary);
+}
+
+static __always_inline void set_stack_canary_user(void)
+{
+   this_cpu_write(fixed_percpu_data.stack_canary,
+  current->stack_canary_user);
+}
+
+static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3)
+{
+   if (cr3 & PTI_USER_PGTABLE_MASK)
+   set_stack_canary_kernel();
+}
+
+static __always_inline void restore_stack_canary(unsigned long cr3)
+{
+   if (cr3 & PTI_USER_PGTABLE_MASK)
+   set_stack_canary_user();
+}
+
 static __always_inline void write_kernel_cr3(unsigned long cr3)
 {
if (static_cpu_has(X86_FEATURE_PCID))
@@ -155,8 +201,10 @@ static __always_inline unsigned long 
save_and_switch_to_kernel_cr3(void)
return 0;
 
cr3 = __native_read_cr3();
-   if (cr3 & PTI_USER_PGTABLE_MASK)
+   if (cr3 & PTI_USER_PGTABLE_MASK) {
switch_to_kernel_cr3(cr3);
+   set_stack_canary_kernel();
+   }
 
return cr3;
 }
@@ -167,6 +215,7 @@ static __always_inline void restore_cr3(unsigned long cr3)
return;
 
if (cr3 & PTI_USER_PGTABLE_MASK) {
+   set_stack_canary_user();
switch_to_user_cr3(cr3);
} else {
/*
@@ -182,6 +231,7 @@ static __always_inline void user_pagetable_enter(void)
if (!static_cpu_has(X86_FEATURE_PTI))
return;
 
+   set_stack_canary_user();
switch_to_user_cr3(__native_read_cr3());
 }
 
@@ -191,6 +241,7 @@ static __always_inline void user_pagetable_exit(void)
return;
 
switch_to_kernel_cr3(__native_read_cr3());
+   set_stack_canary_kernel();
 }
 
 static __always_inline void user_pagetable_return(struct pt_regs *regs)
@@ -218,6 +269,9 @@ static __always_inline void user_pagetable_exit(void) {};
 static __always_inline void user_pagetable_return(struct pt_regs *regs) {};
 static __always_inline void user_pagetable_escape(struct pt_regs *regs) {};
 
+static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3) {}
+static __always_inline void restore_stack_canary(unsigned long cr3) {}
+
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
 #endif /* MODULE */
 
diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index 7fb482f0f25b..be6c051bafe3 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -52,6 +52,25 @@
 #de

[RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings

2020-11-16 Thread Alexandre Chartre

Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S |  2 --
 arch/x86/mm/pti.c | 19 +++
 kernel/fork.c | 22 ++
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e0b5b010e0b..458af12ed9a1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm)
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
  */
-.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
UNWIND_HINT_REGS
movq%rsp, %rdi  /* pt_regs */
@@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork)
callreturn_from_fork/* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(ret_from_fork)
-.popsection
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
 #ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 71ca245d7b38..e4c6cb4a4840 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -449,6 +449,7 @@ static void __init pti_clone_percpu_page(void *addr)
  */
 static void __init pti_clone_user_shared(void)
 {
+   unsigned long start, end;
unsigned int cpu;
 
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
@@ -465,7 +466,16 @@ static void __init pti_clone_user_shared(void)
 */
pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
+   /*
+* Map fixed_percpu_data to get the stack canary.
+*/
+   if (IS_ENABLED(CONFIG_STACKPROTECTOR))
+   pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu));
}
+
+   start = (unsigned long)__per_cpu_offset;
+   end = start + sizeof(__per_cpu_offset);
+   pti_clone_init_pgtable(start, end, PTI_CLONE_PTE);
 }
 
 #else /* CONFIG_X86_64 */
@@ -505,6 +515,15 @@ static void pti_clone_entry_text(void)
pti_clone_init_pgtable((unsigned long) __entry_text_start,
   (unsigned long) __entry_text_end,
   PTI_CLONE_PMD);
+
+   /*
+   * Syscall and interrupt entry code (which is in the noinstr
+   * section) will be entered with the user page-table, so that
+   * code has to be mapped in.
+   */
+   pti_clone_init_pgtable((unsigned long) __noinstr_text_start,
+  (unsigned long) __noinstr_text_end,
+  PTI_CLONE_PMD);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..31cd77dbdba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+   unsigned long addr;
+
+   if (!tsk || !static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   /*
+* Map the task stack after the kernel stack into the user
+* address space, so that this stack can be used when entering
+* syscall or interrupt from user mode.
+*/
+   BUG_ON(!task_stack_page(tsk));
+   addr = (unsigned long)task_top_of_kernel_stack(tsk);
+   pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
 {
@@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
if (init_new_context(p, mm))
goto fail_nocontext;
 
+   mm_map_task(mm, p);
+
mm->user_ns = get_user_ns(user_ns);
return mm;
 
@@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct 
task_struct *tsk)
vmacache_flush(tsk);
 
if (clone_flags & CLONE_VM) {
+   mm_map_task(oldmm, tsk);
mmget(oldmm);
mm = oldmm;
goto good_mm;
-- 
2.18.4

[RFC][PATCH v2 18/21] x86/pti: Execute NMI handler on the kernel stack

2020-11-16 Thread Alexandre Chartre

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/kernel/nmi.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
inc_irq_stat(__nmi_count);
 
-   if (!ignore_nmis)
-   default_do_nmi(regs);
+   if (!ignore_nmis) {
+   if (user_mode(regs)) {
+   /*
+* If we come from userland then we are on the
+* trampoline stack, switch to the kernel stack
+* to execute the NMI handler.
+*/
+   run_idt(default_do_nmi, regs);
+   } else {
+   default_do_nmi(regs);
+   }
+   }
 
idtentry_exit_nmi(regs, irq_state);
 
-- 
2.18.4

[RFC][PATCH v2 05/21] x86/entry: Implement ret_from_fork body with C code

2020-11-16 Thread Alexandre Chartre

ret_from_fork is a mix of assembly code and calls to C functions.
Re-implement ret_from_fork so that it calls a single C function.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c   | 18 ++
 arch/x86/entry/entry_64.S | 28 +---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d12908ad..7ee15a12c115 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,24 @@
 #include 
 #include 
 
+__visible noinstr void return_from_fork(struct pt_regs *regs,
+   struct task_struct *prev,
+   void (*kfunc)(void *), void *kargs)
+{
+   schedule_tail(prev);
+   if (kfunc) {
+   /* kernel thread */
+   kfunc(kargs);
+   /*
+* A kernel thread is allowed to return here after
+* successfully calling kernel_execve(). Exit to
+* userspace to complete the execve() syscall.
+*/
+   regs->ax = 0;
+   }
+   syscall_exit_to_user_mode(regs);
+}
+
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 274384644b5e..73e9cd47dc83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm)
  */
 .pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
-   UNWIND_HINT_EMPTY
-   movq%rax, %rdi
-   callschedule_tail   /* rdi: 'prev' task parameter */
-
-   testq   %rbx, %rbx  /* from kernel_thread? */
-   jnz 1f  /* kernel threads are uncommon 
*/
-
-2:
UNWIND_HINT_REGS
-   movq%rsp, %rdi
-   callsyscall_exit_to_user_mode   /* returns with IRQs disabled */
+   movq%rsp, %rdi  /* pt_regs */
+   movq%rax, %rsi  /* 'prev' task parameter */
+   movq%rbx, %rdx  /* kernel thread func */
+   movq%r12, %rcx  /* kernel thread arg */
+   callreturn_from_fork/* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
-
-1:
-   /* kernel thread */
-   UNWIND_HINT_EMPTY
-   movq%r12, %rdi
-   CALL_NOSPEC rbx
-   /*
-* A kernel thread is allowed to return here after successfully
-* calling kernel_execve().  Exit to userspace to complete the execve()
-* syscall.
-*/
-   movq$0, RAX(%rsp)
-   jmp 2b
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-- 
2.18.4

[RFC][PATCH v2 04/21] x86/sev-es: Define a setup stack function for the VC idtentry

2020-11-16 Thread Alexandre Chartre

The #VC exception assembly entry code uses C code (vc_switch_off_ist)
to get and configure a stack, then return to assembly to switch to
that stack and finally invoked the C function exception handler.

To pave the way for deferring CR3 switch from assembly to C code,
define a setup stack function for the VC idtentry. This function is
used to get and configure the stack before invoking idtentry handler.

For now, the setup stack function is just a wrapper around the
the vc_switch_off_ist() function but it will eventually also
contain the C code to switch CR3. The vc_switch_off_ist() function
is also refactored to just return the stack pointer, and the stack
configuration is done in the setup stack function (so that the
stack can be also be used to propagate CR3 switch information to
the idtentry handler for switching CR3 back).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S   |  8 +++-
 arch/x86/include/asm/idtentry.h | 14 ++
 arch/x86/include/asm/traps.h|  2 +-
 arch/x86/kernel/sev-es.c| 34 +
 arch/x86/kernel/traps.c | 19 +++---
 5 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 51df9f1871c6..274384644b5e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_REGS
 
/*
-* Switch off the IST stack to make it free for nested exceptions. The
-* vc_switch_off_ist() function will switch back to the interrupted
-* stack if it is safe to do so. If not it switches to the VC fall-back
-* stack.
+* Call the setup stack function. It configures and returns
+* the stack we should be using to run the exception handler.
 */
movq%rsp, %rdi  /* pt_regs pointer */
-   callvc_switch_off_ist
+   callsetup_stack_\cfunc
movq%rax, %rsp  /* Switch to new stack */
 
UNWIND_HINT_REGS
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..4b4aca2b1420 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  */
 #define DECLARE_IDTENTRY_VC(vector, func)  \
DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func);   \
+   __visible noinstr unsigned long setup_stack_##func(struct pt_regs 
*regs);   \
__visible noinstr void ist_##func(struct pt_regs *regs, unsigned long 
error_code);  \
__visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned 
long error_code)
 
@@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_VC_IST(func)   \
DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func)
 
+/**
+ * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to
+   run the VMM communication handler
+ * @func:  Function name of the entry point
+ *
+ * The stack setup code is executed before the VMM communication handler.
+ * It configures and returns the stack to switch to before running the
+ * VMM communication handler.
+ */
+#define DEFINE_IDTENTRY_VC_SETUP_STACK(func)   \
+   __visible noinstr   \
+   unsigned long setup_stack_##func(struct pt_regs *regs)
+
 /**
  * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler
  * @func:  Function name of the entry point
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..cfcc9d34d2a0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct 
pt_regs *eregs);
 asmlinkage __visible notrace
 struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs 
*eregs);
+asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs 
*eregs);
 #endif
 
 #ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0bd1a0fc587e..bd977c917cd6 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
instrumentation_end();
 }
 
+struct exc_vc_frame {
+   /* pt_regs should be first */
+   struct pt_regs regs;
+};
+
+DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
+{
+   struct exc_vc_frame *frame;
+   unsigned long sp;
+
+   /*
+* Switch off the IST stack to make it free for nested exceptions.
+* The vc_switch_off_ist() function will switch back to the
+* interrupted stack if

[RFC][PATCH v2 08/21] x86/pti: Introduce per-task PTI trampoline stack

2020-11-16 Thread Alexandre Chartre

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/page_64_types.h | 36 +++-
 arch/x86/include/asm/processor.h |  3 +++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
 #define KASAN_STACK_ORDER 0
 #endif
 
-#define THREAD_SIZE_ORDER  (2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER  \
+   (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 
 #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
+#define task_top_of_kernel_stack(task) \
+   ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
 #define task_pt_regs(task) \
 ({ \
unsigned long __ptr = (unsigned long)task_stack_page(task); \
-- 
2.18.4

[RFC][PATCH v2 01/21] x86/syscall: Add wrapper for invoking syscall function

2020-11-16 Thread Alexandre Chartre

Add a wrapper function for invoking a syscall function.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..d12908ad 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,15 @@
 #include 
 #include 
 
+static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
+   struct pt_regs *regs)
+{
+   if (!sysfunc)
+   return;
+
+   regs->ax = sysfunc(regs);
+}
+
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
instrumentation_begin();
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
-   regs->ax = sys_call_table[nr](regs);
+   run_syscall(sys_call_table[nr], regs);
 #ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
-   regs->ax = x32_sys_call_table[nr](regs);
+   run_syscall(x32_sys_call_table[nr], regs);
 #endif
}
+
instrumentation_end();
syscall_exit_to_user_mode(regs);
 }
@@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct 
pt_regs *regs,
if (likely(nr < IA32_NR_syscalls)) {
instrumentation_begin();
nr = array_index_nospec(nr, IA32_NR_syscalls);
-   regs->ax = ia32_sys_call_table[nr](regs);
+   run_syscall(ia32_sys_call_table[nr], regs);
instrumentation_end();
}
 }
-- 
2.18.4

[RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

2020-11-16 Thread Alexandre Chartre

Version 2 addressing comments from Andy:

- paranoid_entry/exit is back to assembly code. This avoids having
  a C version of SWAPGS and the need to disable stack-protector.
  (remove patches 8, 9, 21 from v1).

- SAVE_AND_SWITCH_TO_KERNEL_CR3 and RESTORE_CR3 are removed from
  paranoid_entry/exit and move to C (patch 19).

- __per_cpu_offset is mapped into the user page-table (patch 11)
  so that paranoid_entry can update GS before CR3 is switched.

- use a different stack canary with the user and kernel page-tables.
  This is a new patch in v2 to not leak the kernel stack canary
  in the user page-table (patch 21).

Patches are now based on v5.10-rc4.



With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

 - map more syscall, interrupt and exception entry code into the user
   page-table (map all noinstr code);

 - map additional data used in the entry code (such as stack canary);

 - run more entry code on the trampoline stack (which is mapped both
   in the kernel and in the user page-table) until we switch to the
   kernel page-table and then switch to the kernel stack;

 - have a per-task trampoline stack instead of a per-cpu trampoline
   stack, so the task can be scheduled out while it hasn't switched
   to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Patches are based on v5.10-rc4.

Thanks,

alex.

-----

Alexandre Chartre (21):
  x86/syscall: Add wrapper for invoking syscall function
  x86/entry: Update asm_call_on_stack to support more function arguments
  x86/entry: Consolidate IST entry from userspace
  x86/sev-es: Define a setup stack function for the VC idtentry
  x86/entry: Implement ret_from_fork body with C code
  x86/pti: Provide C variants of PTI switch CR3 macros
  x86/entry: Fill ESPFIX stack using C code
  x86/pti: Introduce per-task PTI trampoline stack
  x86/pti: Function to clone page-table entries from a specified mm
  x86/pti: Function to map per-cpu page-table entry
  x86/pti: Extend PTI user mappings
  x86/pti: Use PTI stack instead of trampoline stack
  x86/pti: Execute syscall functions on the kernel stack
  x86/pti: Execute IDT handlers on the kernel stack
  x86/pti: Execute IDT handlers with error code on the kernel stack
  x86/pti: Execute system vector handlers on the kernel stack
  x86/pti: Execute page fault handler on the kernel stack
  x86/pti: Execute NMI handler on the kernel stack
  x86/pti: Defer CR3 switch to C code for IST entries
  x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
  x86/pti: Use a different stack canary with the user and kernel
page-table

 arch/x86/entry/common.c   |  58 -
 arch/x86/entry/entry_64.S | 346 +++---
 arch/x86/entry/entry_64_compat.S  |  22 --
 arch/x86/include/asm/entry-common.h   | 194 +++
 arch/x86/include/asm/idtentry.h   | 130 +-
 arch/x86/include/asm/irq_stack.h  |  11 +
 arch/x86/include/asm/page_64_types.h  |  36 ++-
 arch/x86/include/asm/processor.h  |   3 +
 arch/x86/include/asm/pti.h|  18 ++
 arch/x86/include/asm/stackprotector.h |  35 ++-
 arch/x86/include/asm/switch_to.h  |   7 +-
 arch/x86/include/asm/traps.h  |   2 +-
 arch/x86/kernel/cpu/mce/core.c|   7 +-
 arch/x86/kernel/espfix_64.c   |  41 +++
 arch/x86/kernel/nmi.c |  34 ++-
 arch/x86/kernel/sev-es.c  |  63 +
 arch/x86/kernel/traps.c   |  61 +++--
 arch/x86/mm/fault.c   |  11

[RFC][PATCH v2 07/21] x86/entry: Fill ESPFIX stack using C code

2020-11-16 Thread Alexandre Chartre

The ESPFIX stack is filled using assembly code. Move this code to a C
function so that it is easier to read and modify.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S   | 62 ++---
 arch/x86/kernel/espfix_64.c | 41 
 2 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73e9cd47dc83..6e0b5b010e0b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -684,8 +684,10 @@ native_irq_return_ldt:
 * long (see ESPFIX_STACK_SIZE).  espfix_waddr points to the bottom
 * of the ESPFIX stack.
 *
-* We clobber RAX and RDI in this code.  We stash RDI on the
-* normal stack and RAX on the ESPFIX stack.
+* We call into C code to fill the ESPFIX stack. We stash registers
+* that the C function can clobber on the normal stack. The user RAX
+* is stashed first so that it is adjacent to the iret frame which
+* will be copied to the ESPFIX stack.
 *
 * The ESPFIX stack layout we set up looks like this:
 *
@@ -699,39 +701,37 @@ native_irq_return_ldt:
 * --- bottom of ESPFIX stack ---
 */
 
-   pushq   %rdi/* Stash user RDI */
-   SWAPGS  /* to kernel GS */
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi   /* to kernel CR3 */
-
-   movqPER_CPU_VAR(espfix_waddr), %rdi
-   movq%rax, (0*8)(%rdi)   /* user RAX */
-   movq(1*8)(%rsp), %rax   /* user RIP */
-   movq%rax, (1*8)(%rdi)
-   movq(2*8)(%rsp), %rax   /* user CS */
-   movq%rax, (2*8)(%rdi)
-   movq(3*8)(%rsp), %rax   /* user RFLAGS */
-   movq%rax, (3*8)(%rdi)
-   movq(5*8)(%rsp), %rax   /* user SS */
-   movq%rax, (5*8)(%rdi)
-   movq(4*8)(%rsp), %rax   /* user RSP */
-   movq%rax, (4*8)(%rdi)
-   /* Now RAX == RSP. */
-
-   andl$0x, %eax   /* RAX = (RSP & 0x) */
+   /* save registers */
+   pushq   %rax
+   pushq   %rdi
+   pushq   %rsi
+   pushq   %rdx
+   pushq   %rcx
+   pushq   %r8
+   pushq   %r9
+   pushq   %r10
+   pushq   %r11
 
/*
-* espfix_stack[31:16] == 0.  The page tables are set up such that
-* (espfix_stack | (X & 0x)) points to a read-only alias of
-* espfix_waddr for any X.  That is, there are 65536 RO aliases of
-* the same page.  Set up RSP so that RSP[31:16] contains the
-* respective 16 bits of the /userspace/ RSP and RSP nonetheless
-* still points to an RO alias of the ESPFIX stack.
+* fill_espfix_stack will copy the iret+rax frame to the ESPFIX
+* stack and return with RAX containing a pointer to the ESPFIX
+* stack.
 */
-   orq PER_CPU_VAR(espfix_stack), %rax
+   leaq8*8(%rsp), %rdi /* points to the iret+rax frame */
+   callfill_espfix_stack
 
-   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-   SWAPGS  /* to user GS */
-   popq%rdi/* Restore user RDI */
+   /*
+* RAX contains a pointer to the ESPFIX, so restore registers but
+* RAX. RAX will be restored from the ESPFIX stack.
+*/
+   popq%r11
+   popq%r10
+   popq%r9
+   popq%r8
+   popq%rcx
+   popq%rdx
+   popq%rsi
+   popq%rdi
 
movq%rax, %rsp
UNWIND_HINT_IRET_REGS offset=8
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4fe7af58cfe1..ff4b5160b39c 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -205,3 +206,43 @@ void init_espfix_ap(int cpu)
per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
  + (addr & ~PAGE_MASK);
 }
+
+/*
+ * iret frame with an additional user_rax register.
+ */
+struct iret_rax_frame {
+   unsigned long user_rax;
+   unsigned long rip;
+   unsigned long cs;
+   unsigned long rflags;
+   unsigned long rsp;
+   unsigned long ss;
+};
+
+noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame)
+{
+   struct iret_rax_frame *espfix_frame;
+   unsigned long rsp;
+
+   native_swapgs();
+   user_pagetable_exit();
+
+   espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr);
+   *espfix_frame = *frame;
+
+   /*
+* espfix_stack[31:16] == 0.  The page tables are set up such that
+* (espfix_stack | (X & 0x)) points to a re

[RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack

2020-11-16 Thread Alexandre Chartre

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S| 42 +---
 arch/x86/include/asm/pti.h   |  8 ++
 arch/x86/include/asm/switch_to.h |  7 +-
 3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1
 
-   /*
-* Now all regs are restored except RSP and RDI.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   pushq   RSP-RDI(%rdi)   /* RSP */
-   pushq   (%rdi)  /* RDI */
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
+* We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
popq%rdi
-   popq%rsp
+   movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
 
@@ -606,24 +596,6 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #endif
POP_REGS pop_rdi=0
 
-   /*
-* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   /* Copy the IRET frame to the trampoline stack. */
-   pushq   6*8(%rdi)   /* SS */
-   pushq   5*8(%rdi)   /* RSP */
-   pushq   4*8(%rdi)   /* EFLAGS */
-   pushq   3*8(%rdi)   /* CS */
-   pushq   2*8(%rdi)   /* RIP */
-
-   /* Push user RDI on the trampoline stack. */
-   pushq   (%rdi)
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
 * We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, 
SYM_L_GLOBAL)
 
/* Restore RDI. */
popq%rdi
+   addq$8, %rsp/* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN
 
@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
+   /*
+* We are on the trampoline stack. With PTI, the trampoline
+* stack is a per-thread stack so we are all set and we can
+* return.
+*
+* Without PTI, the trampoline stack is a per-cpu stack and
+* we need to switch to the normal thread stack.
+*/
+   ALTERNATIVE "", "ret", X86_FEATURE_PTI
/* Put us onto the real thread stack. */
popq%r12/* save return addr in %12 */
movq%rsp, %rdi  /* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
 extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
  unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+   return static_cpu_has(X86_FEATURE_PTI);
+}
 #else
 static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+   return false;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_SWITCH_TO_H
 
 #include 
+#include 
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
 
@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct 
*task)
 * doesn't wo

[RFC][PATCH v2 03/21] x86/entry: Consolidate IST entry from userspace

2020-11-16 Thread Alexandre Chartre

Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from
userspace the same way: they switch from the IST stack to the kernel
stack, call the handler and then return to userspace. However, NMI,
MCE/DEBUG and VC implement this same behavior using different code paths,
so consolidate this code into a single assembly macro.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S | 137 +-
 1 file changed, 75 insertions(+), 62 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c42948aca0a8..51df9f1871c6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
+/*
+ * Macro to handle an IDT entry defined with the IST mechanism. It should
+ * be invoked at the beginning of the IDT handler with a pointer to the C
+ * function (cfunc_user) to invoke if the IDT was entered from userspace.
+ *
+ * If the IDT was entered from userspace, the macro will switch from the
+ * IST stack to the regular task stack, call the provided function and
+ * return to userland.
+ *
+ * If IDT was entered from the kernel, the macro will just return.
+ */
+.macro ist_entry_user cfunc_user has_error_code=0
+   UNWIND_HINT_IRET_REGS
+   ASM_CLAC
+
+   /* only process entry from userspace */
+   .if \has_error_code == 1
+   testb   $3, CS-ORIG_RAX(%rsp)
+   jz  .List_entry_from_kernel_\@
+   .else
+   testb   $3, CS-RIP(%rsp)
+   jz  .List_entry_from_kernel_\@
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
+   .endif
+
+   /* Use %rdx as a temp variable */
+   pushq   %rdx
+
+   /*
+* Switch from the IST stack to the regular task stack and
+* use the provided entry point.
+*/
+   swapgs
+   cld
+   FENCE_SWAPGS_USER_ENTRY
+   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
+   movq%rsp, %rdx
+   movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
+   UNWIND_HINT_IRET_REGS base=%rdx offset=8
+   pushq   6*8(%rdx)   /* pt_regs->ss */
+   pushq   5*8(%rdx)   /* pt_regs->rsp */
+   pushq   4*8(%rdx)   /* pt_regs->flags */
+   pushq   3*8(%rdx)   /* pt_regs->cs */
+   pushq   2*8(%rdx)   /* pt_regs->rip */
+   UNWIND_HINT_IRET_REGS
+   pushq   1*8(%rdx)   /* pt_regs->orig_ax */
+   PUSH_AND_CLEAR_REGS rdx=(%rdx)
+   ENCODE_FRAME_POINTER
+
+   /*
+* At this point we no longer need to worry about stack damage
+* due to nesting -- we're on the normal thread stack and we're
+* done with the IST stack.
+*/
+
+   mov %rsp, %rdi
+   .if \has_error_code == 1
+   movqORIG_RAX(%rsp), %rsi/* get error code into 2nd 
argument*/
+   movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
+   .endif
+   call\cfunc_user
+   jmp swapgs_restore_regs_and_return_to_usermode
+
+.List_entry_from_kernel_\@:
+.endm
+
 /**
  * idtentry_body - Macro to emit code calling the C function
  * @cfunc: C function to be called
@@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
-   pushq   $-1 /* ORIG_RAX: no syscall to restart */
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user noist_\cfunc
 
+   /* Entry from kernel */
+
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
callparanoid_entry
 
@@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym)
 
jmp paranoid_exit
 
-   /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body noist_\cfunc, has_error_code=0
-
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
@@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_vc vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user safe_stack_\cfunc, has_error_code=1
 
/*
 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
@@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym)
 */
jmp paranoid_exit
 
-   /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body safe_stack_\cfunc, has_e

Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode

2020-11-16 Thread Alexandre Chartre




On 11/10/20 11:42 PM, Joel Fernandes wrote:

On Tue, Nov 10, 2020 at 10:35:17AM +0100, Alexandre Chartre wrote:
[..]

---8<---

  From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" 
Date: Mon, 27 Jul 2020 17:56:14 -0400
Subject: [PATCH] kernel/entry: Add support for core-wide protection of
   kernel-mode

[..]

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d38e904dd603..fe6f225bfbf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
   const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
+#ifdef CONFIG_SCHED_CORE
+void sched_core_unsafe_enter(void);
+void sched_core_unsafe_exit(void);
+bool sched_core_wait_till_safe(unsigned long ti_check);
+bool sched_core_kernel_protected(void);
+#else
+#define sched_core_unsafe_enter(ignore) do { } while (0)
+#define sched_core_unsafe_exit(ignore) do { } while (0)
+#define sched_core_wait_till_safe(ignore) do { } while (0)
+#define sched_core_kernel_protected(ignore) do { } while (0)
+#endif
+
   #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 0a1e20f8d4e8..a18ed60cedea 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_begin();
trace_hardirqs_off_finish();
+   if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the 
flag. */
+   sched_core_unsafe_enter();
instrumentation_end();
   }
@@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void)
   /* Workaround to allow gradual conversion of architecture code */
   void __weak arch_do_signal(struct pt_regs *regs) { }
+unsigned long exit_to_user_get_work(void)


Function should be static.


Fixed.


+{
+   unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+
+   if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
+   || !_TIF_UNSAFE_RET)
+   return ti_work;
+
+#ifdef CONFIG_SCHED_CORE
+   ti_work &= EXIT_TO_USER_MODE_WORK;
+   if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
+   sched_core_unsafe_exit();
+   if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
+   sched_core_unsafe_enter(); /* not exiting to user yet. 
*/
+   }
+   }
+
+   return READ_ONCE(current_thread_info()->flags);
+#endif
+}
+
   static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
   {
@@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs 
*regs,
 * enabled above.
 */
local_irq_disable_exit_to_user();
-   ti_work = READ_ONCE(current_thread_info()->flags);
+   ti_work = exit_to_user_get_work();
}


What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it 
has
_TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force 
siblings
to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is
scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when
the task is scheduled back in?


No, when the task is scheduled out, it will in kernel mode on the task being
scheduled in. That task (being scheduled-in) would have already done a
sched_core_unsafe_enter(). When that task returns to user made, it will do a
sched_core_unsafe_exit(). When all tasks goto sleep, the last task that
enters the idle loop will do a sched_core_unsafe_exit(). Just to note: the
"unsafe kernel context" is per-CPU and not per-task. Does that answer your
question?


Ok, I think I get it: it works because when a task is scheduled out then the
scheduler will schedule in a new tagged task (because we have core scheduling).
So that new task should be accounted for core-wide protection the same way as
the previous one.



+static inline void init_sched_core_irq_work(struct rq *rq)
+{
+   init_irq_work(&rq->core_irq_work, sched_core_irq_work);
+}
+
+/*
+ * sched_core_wait_till_safe - Pause the caller's hyperthread until the core
+ * exits the core-wide unsafe state. Obviously the CPU calling this function
+ * should not be responsible for the core being in the core-wide unsafe state
+ * otherwise it will deadlock.
+ *
+ * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of
+ *the loop if TIF flags are set and notify caller about it.
+ *
+ * IRQs should be disabled.
+ */
+bool sched_core_wait_till_safe(unsigned long ti_check)
+{
+   bool restart = false;
+   struct rq *rq;
+   int cpu;
+
+   /* We clear the thread flag only at the end, so need to check for it. */


Do you mean "

Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings

2020-11-11 Thread Alexandre Chartre





On 11/11/20 12:39 AM, Andy Lutomirski wrote:


On 11/9/20 6:28 PM, Andy Lutomirski wrote:

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
 wrote:


Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code,


Probably fine.


per cpu offsets (__per_cpu_offset, which is used some in
entry code),


This likely already leaks due to vulnerable CPUs leaking address space
layout info.


I forgot to update the comment, I am not mapping __per_cpu_offset anymore.

However, if we do map __per_cpu_offset then we don't need to enforce the
ordering in paranoid_entry to switch CR3 before GS.


I'm okay with mapping __per_cpu_offset.



Good. That way I can move the GS update back to assembly code 
(paranoid_entry/exit
will be mostly reduce to updating GS), and probably I won't need to disable
stack-protector.





the stack canary,


That's going to be a very tough sell.



I can get rid of this, but this will require to disable stack-protector for
any function that we can call while using the user page-table, like already
done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).



You could probably get away with using a different stack protector
canary before and after the CR3 switch as long as you are careful to
have the canary restored when you return from whatever function is
involved.



I was thinking about doing that. I will give it a try.

Thanks,

alex.

Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode

2020-11-10 Thread Alexandre Chartre




On 11/3/20 2:20 AM, Joel Fernandes wrote:

Hi Alexandre,

Sorry for late reply as I was working on the snapshotting patch...

On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:


On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.


Hi Joel,

In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
see such a call. Am I missing something?


Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
updated patch is appended below:
  


See comments below about the updated patch.


---8<---

 From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" 
Date: Mon, 27 Jul 2020 17:56:14 -0400
Subject: [PATCH] kernel/entry: Add support for core-wide protection of
  kernel-mode

Core-scheduling prevents hyperthreads in usermode from attacking each
other, but it does not do anything about one of the hyperthreads
entering the kernel for any reason. This leaves the door open for MDS
and L1TF attacks with concurrent execution sequences between
hyperthreads.

This patch therefore adds support for protecting all syscall and IRQ
kernel mode entries. Care is taken to track the outermost usermode exit
and entry using per-cpu counters. In cases where one of the hyperthreads
enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
when not needed - example: idle and non-cookie HTs do not need to be
forced into kernel mode.

More information about attacks:
For MDS, it is possible for syscalls, IRQ and softirq handlers to leak
data to either host or guest attackers. For L1TF, it is possible to leak
to guest attackers. There is no possible mitigation involving flushing
of buffers to avoid this since the execution of attacker and victims
happen concurrently on 2 or more HTs.

Cc: Julien Desfossez 
Cc: Tim Chen 
Cc: Aaron Lu 
Cc: Aubrey Li 
Cc: Tim Chen 
Cc: Paul E. McKenney 
Co-developed-by: Vineeth Pillai 
Tested-by: Julien Desfossez 
Signed-off-by: Vineeth Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
  .../admin-guide/kernel-parameters.txt |   9 +
  include/linux/entry-common.h  |   6 +-
  include/linux/sched.h |  12 +
  kernel/entry/common.c |  28 ++-
  kernel/sched/core.c   | 230 ++
  kernel/sched/sched.h  |   3 +
  6 files changed, 285 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 3236427e2215..a338d5d64c3d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4678,6 +4678,15 @@
  
  	sbni=		[NET] Granch SBNI12 leased line adapter
  
+	sched_core_protect_kernel=

+   [SCHED_CORE] Pause SMT siblings of a core running in
+   user mode, if at least one of the siblings of the core
+   is running in kernel mode. This is to guarantee that
+   kernel data is not leaked to tasks which are not trusted
+   by the kernel. A value of 0 disables protection, 1
+   enables protection. The default is 1. Note that 
protection
+   depends on the arch defining the _TIF_UNSAFE_RET flag.
+
sched_debug [KNL] Enables verbose scheduler debug messages.
  
  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..62278c5b3b5f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -33,6 +33,10 @@
  # define _TIF_PATCH_PENDING   (0)
  #endif
  
+#ifndef _TIF_UNSAFE_RET

+# define _TIF_UNSAFE_RET   (0)
+#endif
+
  #ifndef _TIF_UPROBE
  # define _TIF_UPROBE  (0)
  #endif
@@ -69,7 +73,7 @@
  
  #define EXIT_TO_USER_MODE_WORK		\

(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |   \
-_TIF_NEED_RESCHED | _TIF_PATCH_PENDING |   \
+_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET | \

Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK

2020-11-09 Thread Alexandre Chartre




[Copying the reply to Andy in the thread with the right email addresses]

On 11/9/20 6:38 PM, Andy Lutomirski wrote:

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
 wrote:


SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
of these macros (swapgs() and swapgs_unsafe_stack()).


This needs a very good justification.  It also needs some kind of
static verification that these helpers are only used by noinstr code,
and they need to be __always_inline.  And I cannot fathom how C code
could possibly use SWAPGS_UNSAFE_STACK in a meaningful way.



You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK
in C code, that doesn't make sense. Looks like only SWAPGS is then needed.

Or maybe we can just use native_swapgs() instead?

I have added a C version of SWAPGS for moving paranoid_entry() to C because,
in this function, we need to switch CR3 before doing the updating GS. But I
really wonder if we need a paravirt swapgs here, and we can probably just use
native_swapgs().

Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table
then we will be able to update GS before switching CR3. That way we can keep the
GS update in assembly code, and just do the CR3 switch in C code. This would 
also
avoid having to disable stack-protector (patch 21).

alex.

Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings

2020-11-09 Thread Alexandre Chartre




[Copying the reply to Andy in the thread with the right email addresses]

On 11/9/20 6:28 PM, Andy Lutomirski wrote:

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
 wrote:


Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code,


Probably fine.


per cpu offsets (__per_cpu_offset, which is used some in
entry code),


This likely already leaks due to vulnerable CPUs leaking address space
layout info.


I forgot to update the comment, I am not mapping __per_cpu_offset anymore.

However, if we do map __per_cpu_offset then we don't need to enforce the
ordering in paranoid_entry to switch CR3 before GS.




the stack canary,


That's going to be a very tough sell.



I can get rid of this, but this will require to disable stack-protector for
any function that we can call while using the user page-table, like already
done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).

alex.

Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code

2020-11-09 Thread Alexandre Chartre




On 11/9/20 8:35 PM, Dave Hansen wrote:

On 11/9/20 6:44 AM, Alexandre Chartre wrote:

  - map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);


This seems like the thing we'd want to tag explicitly rather than make
it implicit with 'noinstr' code.  Worst-case, shouldn't this be:

#define __entry_funcnoinstr

or something?


Yes. I use the easy solution to just use noinstr because noinstr is mostly
use for entry functions. But if we want to use the user page-table beyond
the entry functions then we will definitively need a dedicated tag.


I'd also like to see a lot more discussion about what the rules are for
the C code and the compiler.  We can't, for instance, do a normal
printk() in this entry functions.  Should we stick them in a special
section and have objtool look for suspect patterns or references?

I'm most worried about things like this:

if (something_weird)
pr_warn("this will oops the kernel\n");


That would be similar to noinstr which uses the .noinstr.text section, and if
I remember correctly objtool detects if a noinstr function calls a non-noinst.
Similarly here, an entry function should not call a non-entry function.

alex.

Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK

2020-11-09 Thread Alexandre Chartre




On 11/9/20 6:38 PM, Andy Lutomirski wrote:

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
 wrote:


SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
of these macros (swapgs() and swapgs_unsafe_stack()).


This needs a very good justification.  It also needs some kind of
static verification that these helpers are only used by noinstr code,
and they need to be __always_inline.  And I cannot fathom how C code
could possibly use SWAPGS_UNSAFE_STACK in a meaningful way.



You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK
in C code, that doesn't make sense. Looks like only SWAPGS is then needed.

Or maybe we can just use native_swapgs() instead?

I have added a C version of SWAPGS for moving paranoid_entry() to C because,
in this function, we need to switch CR3 before doing the updating GS. But I
really wonder if we need a paravirt swapgs here, and we can probably just use
native_swapgs().

Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table
then we will be able to update GS before switching CR3. That way we can keep the
GS update in assembly code, and just do the CR3 switch in C code. This would 
also
avoid having to disable stack-protector (patch 21).

alex.

Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings

2020-11-09 Thread Alexandre Chartre




On 11/9/20 6:28 PM, Andy Lutomirski wrote:

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
 wrote:


Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code,


Probably fine.


per cpu offsets (__per_cpu_offset, which is used some in
entry code),


This likely already leaks due to vulnerable CPUs leaking address space
layout info.


I forgot to update the comment, I am not mapping __per_cpu_offset anymore.

However, if we do map __per_cpu_offset then we don't need to enforce the
ordering in paranoid_entry to switch CR3 before GS.




the stack canary,


That's going to be a very tough sell.



I can get rid of this, but this will require to disable stack-protector for
any function that we can call while using the user page-table, like already
done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).

alex.

Re: [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function

2020-11-09 Thread Alexandre Chartre





On 11/9/20 6:25 PM, Andy Lutomirski wrote:

Hi Alexander-

You appear to be infected by corporate malware that has inserted the
string "@aserv0122.oracle.com" to the end of all the email addresses
in your to: list.  "l...@kernel.org"@aserv0122.oracle.com, for
example, is not me.  Can you fix this?



I known, I messed up :-(
I have already resent the entire RFC with correct addresses.
Sorry about that.

alex.



On Mon, Nov 9, 2020 at 3:21 AM Alexandre Chartre
 wrote:


Add a wrapper function for invoking a syscall function.


This needs some explanation of why.

[RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack

2020-11-09 Thread Alexandre Chartre

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c  | 11 ++-
 arch/x86/entry/entry_64.S|  1 +
 arch/x86/include/asm/irq_stack.h |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54d0931801e1..ead6a4c72e6a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs 
*regs,
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
+   unsigned long stack;
+
if (!sysfunc)
return;
 
-   regs->ax = sysfunc(regs);
+   if (!pti_enabled()) {
+   regs->ax = sysfunc(regs);
+   return;
+   }
+
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+sysfunc, regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
 SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
/*
 * Save the frame pointer unconditionally. This allows the ORC
 * unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_X86_64
 static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct 
pt_regs *regs),
  struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
   struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+  struct pt_regs *regs);
 
 static __always_inline void __run_on_irqstack(void (*func)(void))
 {
-- 
2.18.4

[RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+   void *stack = pti_kernel_stack(regs);
+
+   if (stack)
+   asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+   else
+   run_sysvec_on_irqstack_cond(func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   run_sysvec_on_irqstack_cond(__##func, regs);\
+   run_sysvec(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
-- 
2.18.4

[RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry

2020-11-09 Thread Alexandre Chartre

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/mm/pti.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
 }
 
+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+   phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+   pte_t *target_pte;
+
+   target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+   if (WARN_ON(!target_pte))
+   return;
+
+   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
 /*
  * Clone the CPU_ENTRY_AREA and associated data into the user space visible
  * page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
 * This is done for all possible CPUs during boot to ensure
 * that it's propagated to all mms.
 */
+   pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
-   unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
-   phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
-   pte_t *target_pte;
-
-   target_pte = pti_user_pagetable_walk_pte(va);
-   if (WARN_ON(!target_pte))
-   return;
-
-   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
}
 }
 
-- 
2.18.4

[RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm

2020-11-09 Thread Alexandre Chartre

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/pti.h | 10 ++
 arch/x86/mm/pti.c  | 32 
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+   PTI_CLONE_PMD,
+   PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
 extern void pti_init(void);
 extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level);
 #else
 static inline void pti_check_boottime_disable(void) { }
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
 static void __init pti_setup_vsyscall(void) { }
 #endif
 
-enum pti_clone_level {
-   PTI_CLONE_PMD,
-   PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
- enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+  unsigned long end, enum pti_clone_level level)
 {
unsigned long addr;
 
@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
if (addr < start)
break;
 
-   pgd = pgd_offset_k(addr);
+   pgd = pgd_offset(mm, addr);
if (WARN_ON(pgd_none(*pgd)))
return;
p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
}
 }
 
+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+  enum pti_clone_level level)
+{
+   pti_clone_pgtable(&init_mm, start, end, level);
+}
+
 #ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
start = CPU_ENTRY_AREA_BASE;
end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-   pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+   pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
  */
 static void pti_clone_entry_text(void)
 {
-   pti_clone_pgtable((unsigned long) __entry_text_start,
- (unsigned long) __entry_text_end,
- PTI_CLONE_PMD);
+   pti_clone_init_pgtable((unsigned long) __entry_text_start,
+  (unsigned long) __entry_text_end,
+  PTI_CLONE_PMD);
 }
 
 /*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
 * pti_set_kernel_image_nonglobal() did to clear the
 * global bit.
 */
-   pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+   pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
 
/*
-* pti_clone_pgtable() will set the global bit in any PMDs
-* that it clones, but we also need to get any PTEs in
+* pti_clone_init_pgtable() will set the global bit in any
+* PMDs that it clones, but we also need to get any PTEs in
 * the last level for areas that are not huge-page-aligned.
 */
 
-- 
2.18.4

[RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 18 --
 arch/x86/kernel/traps.c |  2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1)) :   \
 func(arg1))
 
+#define CALL_ON_STACK_2(stack, func, arg1, arg2)   \
+   ((stack) ?  \
+asm_call_on_stack_2(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+func(arg1, arg2))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs 
*regs)
CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
 }
 
+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+struct pt_regs *regs, unsigned long error_code)
+{
+   CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs, error_code);\
+   run_idt_errcode(__##func, regs, error_code);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs,   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs, (u8)error_code);\
+   run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+   regs, (u8)error_code);  \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
 /* User entry, runs on regular task stack */
 DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
-   exc_debug_user(regs, debug_read_clear_dr6());
+   run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
 }
 #else
 /* 32 bit does not have separate entry points. */
-- 
2.18.4

[RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

2020-11-09 Thread Alexandre Chartre

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 15 ---
 arch/x86/entry/entry_64.S   | 23 +--
 arch/x86/entry/entry_64_compat.S| 22 --
 arch/x86/include/asm/entry-common.h | 14 ++
 arch/x86/include/asm/idtentry.h | 25 -
 arch/x86/kernel/cpu/mce/core.c  |  2 ++
 arch/x86/kernel/nmi.c   |  2 ++
 arch/x86/kernel/traps.c |  6 ++
 arch/x86/mm/fault.c |  9 +++--
 9 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ead6a4c72e6a..3f4788dbbde7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t 
sysfunc,
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+   switch_to_kernel_cr3();
nr = syscall_enter_from_user_mode(regs, nr);
 
instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
 
instrumentation_end();
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+   switch_to_kernel_cr3();
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;
 
@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
*regs)
 
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 
-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
 {
-   unsigned int nr = syscall_32_enter(regs);
int res;
 
/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
 __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
+   unsigned int nr = syscall_32_enter(regs);
+   bool syscall_done;
+
/*
 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 * convention.  Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs 
*regs)
regs->ip = landing_pad;
 
/* Invoke the syscall. If it failed, keep it simple: use IRET. */
-   if (!__do_fast_syscall_32(regs))
+   syscall_done = __do_fast_syscall_32(regs, nr);
+   switch_to_user_cr3();
+   if (!syscall_done)
return 0;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 797effbe65b6..4be15a5ffe68 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 is scratch space. */
movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 */
 syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
-   POP_REGS pop_rdi=0 skip_r11rcx=1
+   POP_REGS skip_r11rcx=1
 
/*
-* We are on the trampoline stack.  All regs except RDI are live.
 * We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
 
-   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
-   popq%rdi
movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
swapgs
cld
FENCE_SWAPGS_USER_ENTRY
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq%rsp, %rdx
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -592,19 +586,15 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
 1:
 #endif
-   POP_REGS pop_rdi=0
+   POP_REGS
+   addq

[RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit

2020-11-09 Thread Alexandre Chartre

The paranoid_entry and paranoid_exit assembly functions have been
replaced by the kernel_paranoid_entry() and kernel_paranoid_exit()
C functions. Now paranoid_entry/exit are not used anymore and can
be removed.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S | 131 --
 1 file changed, 131 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ea8187d4405..797effbe65b6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback)
 SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
-/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASER/EBX
- * N0 -> SWAPGS on exit
- *  1 -> no SWAPGS on exit
- *
- * YGSBASE value at entry, must be restored in paranoid_exit
- */
-SYM_CODE_START_LOCAL(paranoid_entry)
-   UNWIND_HINT_FUNC
-   cld
-   PUSH_AND_CLEAR_REGS save_ret=1
-   ENCODE_FRAME_POINTER 8
-
-   /*
-* Always stash CR3 in %r14.  This value will be restored,
-* verbatim, at exit.  Needed if paranoid_entry interrupted
-* another entry that already switched to the user CR3 value
-* but has not yet returned to userspace.
-*
-* This is also why CS (stashed in the "iret frame" by the
-* hardware at entry) can not be used: this may be a return
-* to kernel code, but with a user CR3 value.
-*
-* Switching CR3 does not depend on kernel GSBASE so it can
-* be done before switching to the kernel GSBASE. This is
-* required for FSGSBASE because the kernel GSBASE has to
-* be retrieved from a kernel internal table.
-*/
-   SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
-   /*
-* Handling GSBASE depends on the availability of FSGSBASE.
-*
-* Without FSGSBASE the kernel enforces that negative GSBASE
-* values indicate kernel GSBASE. With FSGSBASE no assumptions
-* can be made about the GSBASE value when entering from user
-* space.
-*/
-   ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
-   /*
-* Read the current GSBASE and store it in %rbx unconditionally,
-* retrieve and set the current CPUs kernel GSBASE. The stored value
-* has to be restored in paranoid_exit unconditionally.
-*
-* The unconditional write to GS base below ensures that no subsequent
-* loads based on a mispredicted GS base can happen, therefore no LFENCE
-* is needed here.
-*/
-   SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
-   ret
-
-.Lparanoid_entry_checkgs:
-   /* EBX = 1 -> kernel GSBASE active, no restore required */
-   movl$1, %ebx
-   /*
-* The kernel-enforced convention is a negative GSBASE indicates
-* a kernel value. No SWAPGS needed on entry and exit.
-*/
-   movl$MSR_GS_BASE, %ecx
-   rdmsr
-   testl   %edx, %edx
-   jns .Lparanoid_entry_swapgs
-   ret
-
-.Lparanoid_entry_swapgs:
-   SWAPGS
-
-   /*
-* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
-* unconditional CR3 write, even in the PTI case.  So do an lfence
-* to prevent GS speculation, regardless of whether PTI is enabled.
-*/
-   FENCE_SWAPGS_KERNEL_ENTRY
-
-   /* EBX = 0 -> SWAPGS required on exit */
-   xorl%ebx, %ebx
-   ret
-SYM_CODE_END(paranoid_entry)
-
-/*
- * "Paranoid" exit path from exception stack.  This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated.  Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASER/EBX
- * N0 -> SWAPGS on exit
- *  1 -> no SWAPGS on exit
- *
- * YUser space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
-   UNWIND_HINT_REGS
-   /*
-* The order of operations is important. RESTORE_CR3 requires
-* kernel GSBASE.
-*
-* NB to anyone to try to optimize this code: this code does
-* not execute at all for exceptions from user mode. Those
-* exceptions go through error_exit instead.
-*/
-   RESTORE_CR3 scratch_reg=%rax save_reg=%r14
-
-   /* Handle the three GSBASE cases */
-   ALTERNATIVE "jmp .Lparanoid_exit_checkgs"

[RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code

2020-11-09 Thread Alexandre Chartre

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 and GS registers inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S  | 72 ++
 arch/x86/kernel/cpu/mce/core.c |  3 ++
 arch/x86/kernel/nmi.c  | 18 +++--
 arch/x86/kernel/sev-es.c   | 20 +-
 arch/x86/kernel/traps.c| 30 --
 5 files changed, 83 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..9ea8187d4405 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym)
/* Entry from kernel */
 
pushq   $-1 /* ORIG_RAX: no syscall to restart */
-   /* paranoid_entry returns GS information for paranoid_exit in EBX. */
-   callparanoid_entry
-
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi  /* pt_regs pointer */
 
call\cfunc
 
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym)
 */
ist_entry_user safe_stack_\cfunc, has_error_code=1
 
-   /*
-* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
-* EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
-*/
-   callparanoid_entry
-
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
/*
@@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym)
 * identical to the stack in the IRET frame or the VC fall-back stack,
 * so it is definitly mapped even with PTI enabled.
 */
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS offset=8
ASM_CLAC
 
-   /* paranoid_entry returns GS information for paranoid_exit in EBX. */
-   callparanoid_entry
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi  /* pt_regs pointer into first argument 
*/
@@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym)
movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
call\cfunc
 
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- *   when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 SYM_CODE_START(asm_exc_nmi)
/*
@@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi)
 * We also must not push anything to the stack before switching
 * stacks lest we corrupt the "NMI executing" variable.
 */
-   ist_entry_user exc_nmi
+   ist_entry_user exc_nmi_user
 
/* NMI from kernel */
 
@@ -1346,9 +1340,7 @@ repeat_nmi:
 *
 * RSP is pointing to "outermost RIP".  gsbase is unknown, but, if
 * we're repeating an NMI, gsbase has the same value that it had on
-* the first iteration.  paranoid_entry will load the kernel
-* gsbase if needed before we call exc_nmi().  "NMI executing"
-* is zero.
+* the first iteration.  "NMI executing" is zero.
 */
movq$1, 10*8(%rsp)  /* Set "NMI executing". */
 
@@ -1372,44 +1364,20 @@ end_repeat_nmi:
pushq   $-1 /* ORIG_RAX: no syscall to 
restart */
 
/*
-* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
-* as we should not be calling schedule in NMI context.
-* Even with normal interrupts enabled. An NMI should not be
-* setting NEED_RESCHED or anything that normal interrupts and
+* We should not be calling schedule in NMI context. Even with
+* normal interrupts enabled. An NMI should not be setting
+* NEED_RESCHED or anything that normal interrupts and
 * exceptions might do.
 */
-   callparanoid_entry
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi
movq$-1, %rsi
callexc_nmi
 
-   /* Always restore stashed CR3 value (see paranoid_entry) */
-   RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
-   /*
-* The above invocation of pa

[RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack

2020-11-09 Thread Alexandre Chartre

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S| 42 +---
 arch/x86/include/asm/pti.h   |  8 ++
 arch/x86/include/asm/switch_to.h |  7 +-
 3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1
 
-   /*
-* Now all regs are restored except RSP and RDI.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   pushq   RSP-RDI(%rdi)   /* RSP */
-   pushq   (%rdi)  /* RDI */
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
+* We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
popq%rdi
-   popq%rsp
+   movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
 
@@ -606,24 +596,6 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #endif
POP_REGS pop_rdi=0
 
-   /*
-* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   /* Copy the IRET frame to the trampoline stack. */
-   pushq   6*8(%rdi)   /* SS */
-   pushq   5*8(%rdi)   /* RSP */
-   pushq   4*8(%rdi)   /* EFLAGS */
-   pushq   3*8(%rdi)   /* CS */
-   pushq   2*8(%rdi)   /* RIP */
-
-   /* Push user RDI on the trampoline stack. */
-   pushq   (%rdi)
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
 * We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, 
SYM_L_GLOBAL)
 
/* Restore RDI. */
popq%rdi
+   addq$8, %rsp/* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN
 
@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
+   /*
+* We are on the trampoline stack. With PTI, the trampoline
+* stack is a per-thread stack so we are all set and we can
+* return.
+*
+* Without PTI, the trampoline stack is a per-cpu stack and
+* we need to switch to the normal thread stack.
+*/
+   ALTERNATIVE "", "ret", X86_FEATURE_PTI
/* Put us onto the real thread stack. */
popq%r12/* save return addr in %12 */
movq%rsp, %rdi  /* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
 extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
  unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+   return static_cpu_has(X86_FEATURE_PTI);
+}
 #else
 static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+   return false;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_SWITCH_TO_H
 
 #include 
+#include 
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
 
@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct 
*task)
 * doesn't wo

[RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers

2020-11-09 Thread Alexandre Chartre

The stack-protector option adds a stack canary to functions vulnerable
to stack buffer overflow. The stack canary is defined through the GS
register. Add an attribute to disable the stack-protector option; it
will be used for C functions which can be called while the GS register
might not be properly configured yet.

The GS register is not properly configured for the kernel when we enter
the kernel from userspace. The assembly entry code sets the GS register
for the kernel using the swapgs instruction or the paranoid_entry function,
and so, currently, the GS register is correctly configured when assembly
entry code subsequently transfer control to C code.

Deferring the CR3 register switch from assembly to C code will require to
reimplement paranoid_entry in C and hence also defer the GS register setup
for IST entries to C code. To prepare this change, disable stack-protector
for IST entry C handlers where the GS register setup will eventually
happen.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 25 -
 arch/x86/kernel/nmi.c   |  2 +-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a6725afaaec0..647af7ea3bf1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -94,6 +94,21 @@ void run_sysvec(void (*func)(struct pt_regs *regs), struct 
pt_regs *regs)
run_sysvec_on_irqstack_cond(func, regs);
 }
 
+/*
+ * Attribute to disable the stack-protector option. The option is
+ * disabled using the optimize attribute which clears all optimize
+ * options. So we need to specify the optimize option to disable but
+ * also optimize options we want to preserve.
+ *
+ * The stack-protector option adds a stack canary to functions
+ * vulnerable to stack buffer overflow. The stack canary is defined
+ * through the GS register. So the attribute is used to disable the
+ * stack-protector option for functions which can be called while the
+ * GS register might not be properly configured yet.
+ */
+#define no_stack_protector \
+   __attribute__ 
((optimize("-O2,-fno-stack-protector,-fno-omit-frame-pointer")))
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -410,7 +425,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW
  */
 #define DEFINE_IDTENTRY_IST(func)  \
-   DEFINE_IDTENTRY_RAW(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW(func)
 
 /**
  * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which
@@ -440,7 +455,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_DF(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 /**
  * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler
@@ -472,7 +487,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * VMM communication handler.
  */
 #define DEFINE_IDTENTRY_VC_SETUP_STACK(func)   \
-   __visible noinstr   \
+   no_stack_protector __visible noinstr\
unsigned long setup_stack_##func(struct pt_regs *regs)
 
 /**
@@ -482,7 +497,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_VC(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 #else  /* CONFIG_X86_64 */
 
@@ -517,7 +532,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
 
 /* C-Code mapping */
 #define DECLARE_IDTENTRY_NMI   DECLARE_IDTENTRY_RAW
-#define DEFINE_IDTENTRY_NMIDEFINE_IDTENTRY_RAW
+#define DEFINE_IDTENTRY_NMIno_stack_protector DEFINE_IDTENTRY_RAW
 
 #ifdef CONFIG_X86_64
 #define DECLARE_IDTENTRY_MCE   DECLARE_IDTENTRY_IST
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be0f654c3095..b6291b683be1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
 static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
-DEFINE_IDTENTRY_RAW(exc_nmi)
+DEFINE_IDTENTRY_NMI(exc_nmi)
 {
bool irq_state;
 
-- 
2.18.4

[RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK

2020-11-09 Thread Alexandre Chartre

SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
of these macros (swapgs() and swapgs_unsafe_stack()).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/paravirt.h   | 15 +++
 arch/x86/include/asm/paravirt_types.h | 17 -
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc6830e89..a4898130b36b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -145,6 +145,21 @@ static inline void __write_cr4(unsigned long x)
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
+static inline void swapgs(void)
+{
+   PVOP_VCALL0(cpu.swapgs);
+}
+
+/*
+ * If swapgs is used while the userspace stack is still current,
+ * there's no way to call a pvop.  The PV replacement *must* be
+ * inlined, or the swapgs instruction must be trapped and emulated.
+ */
+static inline void swapgs_unsafe_stack(void)
+{
+   PVOP_VCALL0_ALT(cpu.swapgs, "swapgs");
+}
+
 static inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0fad9f61c76a..eea9acc942a3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -532,12 +532,12 @@ int paravirt_disable_iospace(void);
  pre, post, ##__VA_ARGS__)
 
 
-#define PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...)
\
+#define PVOP_VCALL(op, insn, clbr, call_clbr, extra_clbr, pre, post, ...) \
({  \
PVOP_VCALL_ARGS;\
PVOP_TEST_NULL(op); \
asm volatile(pre\
-paravirt_alt(PARAVIRT_CALL)\
+paravirt_alt(insn) \
 post   \
 : call_clbr, ASM_CALL_CONSTRAINT   \
 : paravirt_type(op),   \
@@ -547,12 +547,17 @@ int paravirt_disable_iospace(void);
})
 
 #define __PVOP_VCALL(op, pre, post, ...)   \
-   PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS,   \
-  VEXTRA_CLOBBERS, \
+   PVOP_VCALL(op, PARAVIRT_CALL, CLBR_ANY, \
+  PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,\
   pre, post, ##__VA_ARGS__)
 
+#define __PVOP_VCALL_ALT(op, insn) \
+   PVOP_VCALL(op, insn, CLBR_ANY,  \
+  PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,\
+  "", "")
+
 #define __PVOP_VCALLEESAVE(op, pre, post, ...) \
-   PVOP_VCALL(op.func, CLBR_RET_REG,   \
+   PVOP_VCALL(op.func, PARAVIRT_CALL, CLBR_RET_REG,\
  PVOP_VCALLEE_CLOBBERS, ,  \
  pre, post, ##__VA_ARGS__)
 
@@ -562,6 +567,8 @@ int paravirt_disable_iospace(void);
__PVOP_CALL(rettype, op, "", "")
 #define PVOP_VCALL0(op)
\
__PVOP_VCALL(op, "", "")
+#define PVOP_VCALL0_ALT(op, insn)  \
+   __PVOP_VCALL_ALT(op, insn)
 
 #define PVOP_CALLEE0(rettype, op)  \
__PVOP_CALLEESAVE(rettype, op, "", "")
-- 
2.18.4

[RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code

2020-11-09 Thread Alexandre Chartre

ret_from_fork is a mix of assembly code and calls to C functions.
Re-implement ret_from_fork so that it calls a single C function.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c   | 18 ++
 arch/x86/entry/entry_64.S | 28 +---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d12908ad..7ee15a12c115 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,24 @@
 #include 
 #include 
 
+__visible noinstr void return_from_fork(struct pt_regs *regs,
+   struct task_struct *prev,
+   void (*kfunc)(void *), void *kargs)
+{
+   schedule_tail(prev);
+   if (kfunc) {
+   /* kernel thread */
+   kfunc(kargs);
+   /*
+* A kernel thread is allowed to return here after
+* successfully calling kernel_execve(). Exit to
+* userspace to complete the execve() syscall.
+*/
+   regs->ax = 0;
+   }
+   syscall_exit_to_user_mode(regs);
+}
+
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 274384644b5e..73e9cd47dc83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm)
  */
 .pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
-   UNWIND_HINT_EMPTY
-   movq%rax, %rdi
-   callschedule_tail   /* rdi: 'prev' task parameter */
-
-   testq   %rbx, %rbx  /* from kernel_thread? */
-   jnz 1f  /* kernel threads are uncommon 
*/
-
-2:
UNWIND_HINT_REGS
-   movq%rsp, %rdi
-   callsyscall_exit_to_user_mode   /* returns with IRQs disabled */
+   movq%rsp, %rdi  /* pt_regs */
+   movq%rax, %rsi  /* 'prev' task parameter */
+   movq%rbx, %rdx  /* kernel thread func */
+   movq%r12, %rcx  /* kernel thread arg */
+   callreturn_from_fork/* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
-
-1:
-   /* kernel thread */
-   UNWIND_HINT_EMPTY
-   movq%r12, %rdi
-   CALL_NOSPEC rbx
-   /*
-* A kernel thread is allowed to return here after successfully
-* calling kernel_execve().  Exit to userspace to complete the execve()
-* syscall.
-*/
-   movq$0, RAX(%rsp)
-   jmp 2b
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-- 
2.18.4

[RFC][PATCH 13/24] x86/pti: Extend PTI user mappings

2020-11-09 Thread Alexandre Chartre

Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S |  2 --
 arch/x86/mm/pti.c | 14 ++
 kernel/fork.c | 22 ++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e0b5b010e0b..458af12ed9a1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm)
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
  */
-.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
UNWIND_HINT_REGS
movq%rsp, %rdi  /* pt_regs */
@@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork)
callreturn_from_fork/* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(ret_from_fork)
-.popsection
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
 #ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 71ca245d7b38..f4f3d9ae4449 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -465,6 +465,11 @@ static void __init pti_clone_user_shared(void)
 */
pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
+   /*
+* Map fixed_percpu_data to get the stack canary.
+*/
+   if (IS_ENABLED(CONFIG_STACKPROTECTOR))
+   pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu));
}
 }
 
@@ -505,6 +510,15 @@ static void pti_clone_entry_text(void)
pti_clone_init_pgtable((unsigned long) __entry_text_start,
   (unsigned long) __entry_text_end,
   PTI_CLONE_PMD);
+
+   /*
+   * Syscall and interrupt entry code (which is in the noinstr
+   * section) will be entered with the user page-table, so that
+   * code has to be mapped in.
+   */
+   pti_clone_init_pgtable((unsigned long) __noinstr_text_start,
+  (unsigned long) __noinstr_text_end,
+  PTI_CLONE_PMD);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..31cd77dbdba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+   unsigned long addr;
+
+   if (!tsk || !static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   /*
+* Map the task stack after the kernel stack into the user
+* address space, so that this stack can be used when entering
+* syscall or interrupt from user mode.
+*/
+   BUG_ON(!task_stack_page(tsk));
+   addr = (unsigned long)task_top_of_kernel_stack(tsk);
+   pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
 {
@@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
if (init_new_context(p, mm))
goto fail_nocontext;
 
+   mm_map_task(mm, p);
+
mm->user_ns = get_user_ns(user_ns);
return mm;
 
@@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct 
task_struct *tsk)
vmacache_flush(tsk);
 
if (clone_flags & CLONE_VM) {
+   mm_map_task(oldmm, tsk);
mmget(oldmm);
mm = oldmm;
goto good_mm;
-- 
2.18.4

[RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack

2020-11-09 Thread Alexandre Chartre

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/page_64_types.h | 36 +++-
 arch/x86/include/asm/processor.h |  3 +++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
 #define KASAN_STACK_ORDER 0
 #endif
 
-#define THREAD_SIZE_ORDER  (2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER  \
+   (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 
 #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
+#define task_top_of_kernel_stack(task) \
+   ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
 #define task_pt_regs(task) \
 ({ \
unsigned long __ptr = (unsigned long)task_stack_page(task); \
-- 
2.18.4

[RFC][PATCH 19/24] x86/pti: Execute page fault handler on the kernel stack

2020-11-09 Thread Alexandre Chartre

After a page fault from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the page fault handler, switch
to the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 17 +
 arch/x86/mm/fault.c |  2 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0c5d9f027112..a6725afaaec0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
 func(arg1, arg2))
 
+#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \
+   ((stack) ?  \
+asm_call_on_stack_3(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \
+   (void *)(arg3)) :   \
+func(arg1, arg2, arg3))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long,
+   unsigned long),
+  struct pt_regs *regs, unsigned long error_code,
+  unsigned long address)
+{
+   CALL_ON_STACK_3(pti_kernel_stack(regs),
+   func, regs, error_code, address);
+}
+
 static __always_inline
 void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..b9d03603d95d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
state = irqentry_enter(regs);
 
instrumentation_begin();
-   handle_page_fault(regs, error_code, address);
+   run_idt_pagefault(handle_page_fault, regs, error_code, address);
instrumentation_end();
 
irqentry_exit(regs, state);
-- 
2.18.4

[RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack

2020-11-09 Thread Alexandre Chartre

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/kernel/nmi.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
inc_irq_stat(__nmi_count);
 
-   if (!ignore_nmis)
-   default_do_nmi(regs);
+   if (!ignore_nmis) {
+   if (user_mode(regs)) {
+   /*
+* If we come from userland then we are on the
+* trampoline stack, switch to the kernel stack
+* to execute the NMI handler.
+*/
+   run_idt(default_do_nmi, regs);
+   } else {
+   default_do_nmi(regs);
+   }
+   }
 
idtentry_exit_nmi(regs, irq_state);
 
-- 
2.18.4

[RFC][PATCH 16/24] x86/pti: Execute IDT handlers on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

For now, only changes IDT handlers which have no argument other
than the pt_regs registers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 43 +++--
 arch/x86/kernel/cpu/mce/core.c  |  2 +-
 arch/x86/kernel/traps.c |  4 +--
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4b4aca2b1420..3595a31947b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,10 +10,49 @@
 #include 
 
 #include 
+#include 
 
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+/*
+ * The CALL_ON_STACK_* macro call the specified function either directly
+ * if no stack is provided, or on the specified stack.
+ */
+#define CALL_ON_STACK_1(stack, func, arg1) \
+   ((stack) ?  \
+asm_call_on_stack_1(stack, \
+   (void (*)(void))(func), (void *)(arg1)) :   \
+func(arg1))
+
+/*
+ * Functions to return the top of the kernel stack if we are using the
+ * user page-table (and thus not running with the kernel stack). If we
+ * are using the kernel page-table (and so already using the kernel
+ * stack) when it returns NULL.
+ */
+static __always_inline void *pti_kernel_stack(struct pt_regs *regs)
+{
+   unsigned long stack;
+
+   if (pti_enabled() && user_mode(regs)) {
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   return (void *)(stack - 8);
+   } else {
+   return NULL;
+   }
+}
+
+/*
+ * Wrappers to run an IDT handler on the kernel stack if we are not
+ * already using this stack.
+ */
+static __always_inline
+void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
+{
+   CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) 
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs);\
+   run_idt(__##func, regs);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
__irq_enter_raw();  \
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs);\
+   run_idt(__##func, regs);\
__irq_exit_raw();   \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..9407c3cd9355 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
unsigned long dr7;
 
dr7 = local_db_save();
-   exc_machine_check_user(regs);
+   run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
 }
 #else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 09b22a611d99..5161385b3670 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 
state = irqentry_enter(regs);
instrumentation_begin();
-   handle_invalid_op(regs);
+   run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
 }
@@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
-   do_int3_user(regs);
+   run_idt(do_int3_us

[RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros

2020-11-09 Thread Alexandre Chartre

Page Table Isolation (PTI) use assembly macros to switch the CR3
register between kernel and user page-tables. Add C functions which
implement the same features. For now, these C functions are not
used but they will eventually replace using the assembly macros.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 44 +++
 arch/x86/include/asm/entry-common.h | 84 +
 2 files changed, 128 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7ee15a12c115..d09b1ded5287 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -343,3 +343,47 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
}
 }
 #endif /* CONFIG_XEN_PV */
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+   unsigned long cr3, saved_cr3;
+
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return 0;
+
+   saved_cr3 = cr3 = __read_cr3();
+   if (cr3 & PTI_USER_PGTABLE_MASK) {
+   adjust_kernel_cr3(&cr3);
+   native_write_cr3(cr3);
+   }
+
+   return saved_cr3;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3)
+{
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   if (static_cpu_has(X86_FEATURE_PCID)) {
+   if (cr3 & PTI_USER_PGTABLE_MASK)
+   adjust_user_cr3(&cr3);
+   else
+   cr3 |= X86_CR3_PCID_NOFLUSH;
+   }
+
+   native_write_cr3(cr3);
+}
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+   return 0;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3) {}
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
diff --git a/arch/x86/include/asm/entry-common.h 
b/arch/x86/include/asm/entry-common.h
index 6fe54b2813c1..b05b212f5ebc 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -81,4 +82,87 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+#ifndef MODULE
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT   PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK  (1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT  X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK  \
+   (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+
+static __always_inline void adjust_kernel_cr3(unsigned long *cr3)
+{
+   if (static_cpu_has(X86_FEATURE_PCID))
+   *cr3 |= X86_CR3_PCID_NOFLUSH;
+
+   /*
+* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+* at kernel pagetables.
+*/
+   *cr3 &= ~PTI_USER_PGTABLE_AND_PCID_MASK;
+}
+
+static __always_inline void adjust_user_cr3(unsigned long *cr3)
+{
+   unsigned short mask;
+   unsigned long asid;
+
+   /*
+* Test if the ASID needs a flush.
+*/
+   asid = *cr3 & 0x7ff;
+   mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask);
+   if (mask & (1 << asid)) {
+   /* Flush needed, clear the bit */
+   this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~(1 << asid));
+   } else {
+   *cr3 |= X86_CR3_PCID_NOFLUSH;
+   }
+}
+
+static __always_inline void switch_to_kernel_cr3(void)
+{
+   unsigned long cr3;
+
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   cr3 = __read_cr3();
+   adjust_kernel_cr3(&cr3);
+   native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_user_cr3(void)
+{
+   unsigned long cr3;
+
+   if (!static_cpu_has(X86_FEATURE_PTI))
+   return;
+
+   cr3 = __read_cr3();
+   if (static_cpu_has(X86_FEATURE_PCID)) {
+   adjust_user_cr3(&cr3);
+   /* Flip the ASID to the user version */
+   cr3 |= PTI_USER_PCID_MASK;
+   }
+
+   /* Flip the PGD to the user version */
+   cr3 |= PTI_USER_PGTABLE_MASK;
+   native_write_cr3(cr3);
+}
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static inline void switch_to_kernel_cr3(void) {}
+static inline void switch_to_user_cr3(void) {}
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+#endif /* MODULE */
+
 #endif
-- 
2.18.4

[RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry

2020-11-09 Thread Alexandre Chartre

The #VC exception assembly entry code uses C code (vc_switch_off_ist)
to get and configure a stack, then return to assembly to switch to
that stack and finally invoked the C function exception handler.

To pave the way for deferring CR3 switch from assembly to C code,
define a setup stack function for the VC idtentry. This function is
used to get and configure the stack before invoking idtentry handler.

For now, the setup stack function is just a wrapper around the
the vc_switch_off_ist() function but it will eventually also
contain the C code to switch CR3. The vc_switch_off_ist() function
is also refactored to just return the stack pointer, and the stack
configuration is done in the setup stack function (so that the
stack can be also be used to propagate CR3 switch information to
the idtentry handler for switching CR3 back).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S   |  8 +++-
 arch/x86/include/asm/idtentry.h | 14 ++
 arch/x86/include/asm/traps.h|  2 +-
 arch/x86/kernel/sev-es.c| 34 +
 arch/x86/kernel/traps.c | 19 +++---
 5 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 51df9f1871c6..274384644b5e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_REGS
 
/*
-* Switch off the IST stack to make it free for nested exceptions. The
-* vc_switch_off_ist() function will switch back to the interrupted
-* stack if it is safe to do so. If not it switches to the VC fall-back
-* stack.
+* Call the setup stack function. It configures and returns
+* the stack we should be using to run the exception handler.
 */
movq%rsp, %rdi  /* pt_regs pointer */
-   callvc_switch_off_ist
+   callsetup_stack_\cfunc
movq%rax, %rsp  /* Switch to new stack */
 
UNWIND_HINT_REGS
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..4b4aca2b1420 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  */
 #define DECLARE_IDTENTRY_VC(vector, func)  \
DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func);   \
+   __visible noinstr unsigned long setup_stack_##func(struct pt_regs 
*regs);   \
__visible noinstr void ist_##func(struct pt_regs *regs, unsigned long 
error_code);  \
__visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned 
long error_code)
 
@@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_VC_IST(func)   \
DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func)
 
+/**
+ * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to
+   run the VMM communication handler
+ * @func:  Function name of the entry point
+ *
+ * The stack setup code is executed before the VMM communication handler.
+ * It configures and returns the stack to switch to before running the
+ * VMM communication handler.
+ */
+#define DEFINE_IDTENTRY_VC_SETUP_STACK(func)   \
+   __visible noinstr   \
+   unsigned long setup_stack_##func(struct pt_regs *regs)
+
 /**
  * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler
  * @func:  Function name of the entry point
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..cfcc9d34d2a0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct 
pt_regs *eregs);
 asmlinkage __visible notrace
 struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs 
*eregs);
+asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs 
*eregs);
 #endif
 
 #ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0bd1a0fc587e..bd977c917cd6 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
instrumentation_end();
 }
 
+struct exc_vc_frame {
+   /* pt_regs should be first */
+   struct pt_regs regs;
+};
+
+DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
+{
+   struct exc_vc_frame *frame;
+   unsigned long sp;
+
+   /*
+* Switch off the IST stack to make it free for nested exceptions.
+* The vc_switch_off_ist() function will switch back to the
+* interrupted stack if

[RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments

2020-11-09 Thread Alexandre Chartre

Update the asm_call_on_stack() function so that it can be invoked
with a function having up to three arguments instead of only one.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S| 15 +++
 arch/x86/include/asm/irq_stack.h |  8 
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..c42948aca0a8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs)
 /*
  * rdi: New stack pointer points to the top word of the stack
  * rsi: Function pointer
- * rdx: Function argument (can be NULL if none)
+ * rdx: Function argument 1 (can be NULL if none)
+ * rcx: Function argument 2 (can be NULL if none)
+ * r8 : Function argument 3 (can be NULL if none)
  */
 SYM_FUNC_START(asm_call_on_stack)
+SYM_FUNC_START(asm_call_on_stack_1)
+SYM_FUNC_START(asm_call_on_stack_2)
+SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
/*
@@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
 */
mov %rsp, (%rdi)
mov %rdi, %rsp
-   /* Move the argument to the right place */
+   mov %rsi, %rax
+   /* Move arguments to the right place */
mov %rdx, %rdi
-
+   mov %rcx, %rsi
+   mov %r8, %rdx
 1:
.pushsection .discard.instr_begin
.long 1b - .
.popsection
 
-   CALL_NOSPEC rsi
+   CALL_NOSPEC rax
 
 2:
.pushsection .discard.instr_end
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 775816965c6a..359427216336 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void)
 }
 
 void asm_call_on_stack(void *sp, void (*func)(void), void *arg);
+
+void asm_call_on_stack_1(void *sp, void (*func)(void),
+void *arg1);
+void asm_call_on_stack_2(void *sp, void (*func)(void),
+void *arg1, void *arg2);
+void asm_call_on_stack_3(void *sp, void (*func)(void),
+void *arg1, void *arg2, void *arg3);
+
 void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
  struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
-- 
2.18.4

[RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code

2020-11-09 Thread Alexandre Chartre

The ESPFIX stack is filled using assembly code. Move this code to a C
function so that it is easier to read and modify.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S   | 62 ++---
 arch/x86/kernel/espfix_64.c | 41 
 2 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73e9cd47dc83..6e0b5b010e0b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -684,8 +684,10 @@ native_irq_return_ldt:
 * long (see ESPFIX_STACK_SIZE).  espfix_waddr points to the bottom
 * of the ESPFIX stack.
 *
-* We clobber RAX and RDI in this code.  We stash RDI on the
-* normal stack and RAX on the ESPFIX stack.
+* We call into C code to fill the ESPFIX stack. We stash registers
+* that the C function can clobber on the normal stack. The user RAX
+* is stashed first so that it is adjacent to the iret frame which
+* will be copied to the ESPFIX stack.
 *
 * The ESPFIX stack layout we set up looks like this:
 *
@@ -699,39 +701,37 @@ native_irq_return_ldt:
 * --- bottom of ESPFIX stack ---
 */
 
-   pushq   %rdi/* Stash user RDI */
-   SWAPGS  /* to kernel GS */
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi   /* to kernel CR3 */
-
-   movqPER_CPU_VAR(espfix_waddr), %rdi
-   movq%rax, (0*8)(%rdi)   /* user RAX */
-   movq(1*8)(%rsp), %rax   /* user RIP */
-   movq%rax, (1*8)(%rdi)
-   movq(2*8)(%rsp), %rax   /* user CS */
-   movq%rax, (2*8)(%rdi)
-   movq(3*8)(%rsp), %rax   /* user RFLAGS */
-   movq%rax, (3*8)(%rdi)
-   movq(5*8)(%rsp), %rax   /* user SS */
-   movq%rax, (5*8)(%rdi)
-   movq(4*8)(%rsp), %rax   /* user RSP */
-   movq%rax, (4*8)(%rdi)
-   /* Now RAX == RSP. */
-
-   andl$0x, %eax   /* RAX = (RSP & 0x) */
+   /* save registers */
+   pushq   %rax
+   pushq   %rdi
+   pushq   %rsi
+   pushq   %rdx
+   pushq   %rcx
+   pushq   %r8
+   pushq   %r9
+   pushq   %r10
+   pushq   %r11
 
/*
-* espfix_stack[31:16] == 0.  The page tables are set up such that
-* (espfix_stack | (X & 0x)) points to a read-only alias of
-* espfix_waddr for any X.  That is, there are 65536 RO aliases of
-* the same page.  Set up RSP so that RSP[31:16] contains the
-* respective 16 bits of the /userspace/ RSP and RSP nonetheless
-* still points to an RO alias of the ESPFIX stack.
+* fill_espfix_stack will copy the iret+rax frame to the ESPFIX
+* stack and return with RAX containing a pointer to the ESPFIX
+* stack.
 */
-   orq PER_CPU_VAR(espfix_stack), %rax
+   leaq8*8(%rsp), %rdi /* points to the iret+rax frame */
+   callfill_espfix_stack
 
-   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-   SWAPGS  /* to user GS */
-   popq%rdi/* Restore user RDI */
+   /*
+* RAX contains a pointer to the ESPFIX, so restore registers but
+* RAX. RAX will be restored from the ESPFIX stack.
+*/
+   popq%r11
+   popq%r10
+   popq%r9
+   popq%r8
+   popq%rcx
+   popq%rdx
+   popq%rsi
+   popq%rdi
 
movq%rax, %rsp
UNWIND_HINT_IRET_REGS offset=8
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4fe7af58cfe1..6a81c4bd1542 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -205,3 +206,43 @@ void init_espfix_ap(int cpu)
per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
  + (addr & ~PAGE_MASK);
 }
+
+/*
+ * iret frame with an additional user_rax register.
+ */
+struct iret_rax_frame {
+   unsigned long user_rax;
+   unsigned long rip;
+   unsigned long cs;
+   unsigned long rflags;
+   unsigned long rsp;
+   unsigned long ss;
+};
+
+noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame)
+{
+   struct iret_rax_frame *espfix_frame;
+   unsigned long rsp;
+
+   native_swapgs();
+   switch_to_kernel_cr3();
+
+   espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr);
+   *espfix_frame = *frame;
+
+   /*
+* espfix_stack[31:16] == 0.  The page tables are set up such that
+* (espfix_stack | (X & 0x)) points to a re

[RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit

2020-11-09 Thread Alexandre Chartre

paranoid_entry/exit are assembly macros. Provide C versions of
these macros (kernel_paranoid_entry() and kernel_paranoid_exit()).
The C functions are functionally equivalent to the assembly macros,
except that kernel_paranoid_entry() doesn't save registers in
pt_regs like paranoid_entry does.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 157 
 arch/x86/include/asm/entry-common.h |  10 ++
 2 files changed, 167 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d09b1ded5287..54d0931801e1 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -387,3 +387,160 @@ static __always_inline unsigned long 
save_and_switch_to_kernel_cr3(void)
 static __always_inline void restore_cr3(unsigned long cr3) {}
 
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+/*
+ * "Paranoid" entry path from exception stack. Ensure that the CR3 and
+ * GS registers are correctly set for the kernel. Return GSBASE related
+ * information in kernel_entry_state depending on the availability of
+ * the FSGSBASE instructions:
+ *
+ * FSGSBASEkernel_entry_state
+ * Nswapgs=true -> SWAPGS on exit
+ *  swapgs=false -> no SWAPGS on exit
+ *
+ * Ygsbase=GSBASE value at entry, must be restored in
+ *  kernel_paranoid_exit()
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables before
+ * kernel_paranoid_entry() has been called.
+ */
+noinstr void kernel_paranoid_entry(struct kernel_entry_state *state)
+{
+   unsigned long gsbase;
+   unsigned int cpu;
+
+   /*
+* Save CR3 in the kernel entry state.  This value will be
+* restored, verbatim, at exit.  Needed if the paranoid entry
+* interrupted another entry that already switched to the user
+* CR3 value but has not yet returned to userspace.
+*
+* This is also why CS (stashed in the "iret frame" by the
+* hardware at entry) can not be used: this may be a return
+* to kernel code, but with a user CR3 value.
+*
+* Switching CR3 does not depend on kernel GSBASE so it can
+* be done before switching to the kernel GSBASE. This is
+* required for FSGSBASE because the kernel GSBASE has to
+* be retrieved from a kernel internal table.
+*/
+   state->cr3 = save_and_switch_to_kernel_cr3();
+
+   /*
+* Handling GSBASE depends on the availability of FSGSBASE.
+*
+* Without FSGSBASE the kernel enforces that negative GSBASE
+* values indicate kernel GSBASE. With FSGSBASE no assumptions
+* can be made about the GSBASE value when entering from user
+* space.
+*/
+   if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+   /*
+* Read the current GSBASE and store it in the kernel
+* entry state unconditionally, retrieve and set the
+* current CPUs kernel GSBASE. The stored value has to
+* be restored at exit unconditionally.
+*
+* The unconditional write to GS base below ensures that
+* no subsequent loads based on a mispredicted GS base
+* can happen, therefore no LFENCE is needed here.
+*/
+   state->gsbase = rdgsbase();
+
+   /*
+* Fetch the per-CPU GSBASE value for this processor. We
+* normally use %gs for accessing per-CPU data, but we
+* are setting up %gs here and obviously can not use %gs
+* itself to access per-CPU data.
+*/
+   if (IS_ENABLED(CONFIG_SMP)) {
+   /*
+* Load CPU from the GDT. Do not use RDPID,
+* because KVM loads guest's TSC_AUX on vm-entry
+* and may not restore the host's value until
+* the CPU returns to userspace. Thus the kernel
+* would consume a guest's TSC_AUX if an NMI
+* arrives while running KVM's run loop.
+*/
+   asm_inline volatile ("lsl %[seg],%[p]"
+: [p] "=r" (cpu)
+: [seg] "r" (__CPUNODE_SEG));
+
+   cpu &= VDSO_CPUNODE_MASK;
+   gsbase = __per_cpu_offset[cpu];
+   } else {
+   gsbase = *pcpu_unit_offsets;
+   }
+
+   wrgsbase(gsbase);
+
+   } else {
+   /*
+* The kernel-enforced convention is a negative GSBASE
+* indicates a kernel value. No SWAPGS needed on entry

[RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace

2020-11-09 Thread Alexandre Chartre

Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from
userspace the same way: they switch from the IST stack to the kernel
stack, call the handler and then return to userspace. However, NMI,
MCE/DEBUG and VC implement this same behavior using different code paths,
so consolidate this code into a single assembly macro.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S | 137 +-
 1 file changed, 75 insertions(+), 62 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c42948aca0a8..51df9f1871c6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
+/*
+ * Macro to handle an IDT entry defined with the IST mechanism. It should
+ * be invoked at the beginning of the IDT handler with a pointer to the C
+ * function (cfunc_user) to invoke if the IDT was entered from userspace.
+ *
+ * If the IDT was entered from userspace, the macro will switch from the
+ * IST stack to the regular task stack, call the provided function and
+ * return to userland.
+ *
+ * If IDT was entered from the kernel, the macro will just return.
+ */
+.macro ist_entry_user cfunc_user has_error_code=0
+   UNWIND_HINT_IRET_REGS
+   ASM_CLAC
+
+   /* only process entry from userspace */
+   .if \has_error_code == 1
+   testb   $3, CS-ORIG_RAX(%rsp)
+   jz  .List_entry_from_kernel_\@
+   .else
+   testb   $3, CS-RIP(%rsp)
+   jz  .List_entry_from_kernel_\@
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
+   .endif
+
+   /* Use %rdx as a temp variable */
+   pushq   %rdx
+
+   /*
+* Switch from the IST stack to the regular task stack and
+* use the provided entry point.
+*/
+   swapgs
+   cld
+   FENCE_SWAPGS_USER_ENTRY
+   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
+   movq%rsp, %rdx
+   movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
+   UNWIND_HINT_IRET_REGS base=%rdx offset=8
+   pushq   6*8(%rdx)   /* pt_regs->ss */
+   pushq   5*8(%rdx)   /* pt_regs->rsp */
+   pushq   4*8(%rdx)   /* pt_regs->flags */
+   pushq   3*8(%rdx)   /* pt_regs->cs */
+   pushq   2*8(%rdx)   /* pt_regs->rip */
+   UNWIND_HINT_IRET_REGS
+   pushq   1*8(%rdx)   /* pt_regs->orig_ax */
+   PUSH_AND_CLEAR_REGS rdx=(%rdx)
+   ENCODE_FRAME_POINTER
+
+   /*
+* At this point we no longer need to worry about stack damage
+* due to nesting -- we're on the normal thread stack and we're
+* done with the IST stack.
+*/
+
+   mov %rsp, %rdi
+   .if \has_error_code == 1
+   movqORIG_RAX(%rsp), %rsi/* get error code into 2nd 
argument*/
+   movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
+   .endif
+   call\cfunc_user
+   jmp swapgs_restore_regs_and_return_to_usermode
+
+.List_entry_from_kernel_\@:
+.endm
+
 /**
  * idtentry_body - Macro to emit code calling the C function
  * @cfunc: C function to be called
@@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
-   pushq   $-1 /* ORIG_RAX: no syscall to restart */
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user noist_\cfunc
 
+   /* Entry from kernel */
+
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
callparanoid_entry
 
@@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym)
 
jmp paranoid_exit
 
-   /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body noist_\cfunc, has_error_code=0
-
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
@@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_vc vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user safe_stack_\cfunc, has_error_code=1
 
/*
 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
@@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym)
 */
jmp paranoid_exit
 
-   /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body safe_stack_\cfunc, has_e

[RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code

2020-11-09 Thread Alexandre Chartre

[Resending without messing up email addresses (hopefully!),
 Please reply using this email thread to have correct emails.
 Sorry for the noise.]

With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

 - map more syscall, interrupt and exception entry code into the user
   page-table (map all noinstr code);

 - map additional data used in the entry code (such as stack canary);

 - run more entry code on the trampoline stack (which is mapped both
   in the kernel and in the user page-table) until we switch to the
   kernel page-table and then switch to the kernel stack;

 - have a per-task trampoline stack instead of a per-cpu trampoline
   stack, so the task can be scheduled out while it hasn't switched
   to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Code is based on v5.10-rc3.

Thanks,

alex.

-----

Alexandre Chartre (24):
  x86/syscall: Add wrapper for invoking syscall function
  x86/entry: Update asm_call_on_stack to support more function arguments
  x86/entry: Consolidate IST entry from userspace
  x86/sev-es: Define a setup stack function for the VC idtentry
  x86/entry: Implement ret_from_fork body with C code
  x86/pti: Provide C variants of PTI switch CR3 macros
  x86/entry: Fill ESPFIX stack using C code
  x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
  x86/entry: Add C version of paranoid_entry/exit
  x86/pti: Introduce per-task PTI trampoline stack
  x86/pti: Function to clone page-table entries from a specified mm
  x86/pti: Function to map per-cpu page-table entry
  x86/pti: Extend PTI user mappings
  x86/pti: Use PTI stack instead of trampoline stack
  x86/pti: Execute syscall functions on the kernel stack
  x86/pti: Execute IDT handlers on the kernel stack
  x86/pti: Execute IDT handlers with error code on the kernel stack
  x86/pti: Execute system vector handlers on the kernel stack
  x86/pti: Execute page fault handler on the kernel stack
  x86/pti: Execute NMI handler on the kernel stack
  x86/entry: Disable stack-protector for IST entry C handlers
  x86/entry: Defer paranoid entry/exit to C code
  x86/entry: Remove paranoid_entry and paranoid_exit
  x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

 arch/x86/entry/common.c   | 259 -
 arch/x86/entry/entry_64.S | 513 --
 arch/x86/entry/entry_64_compat.S  |  22 --
 arch/x86/include/asm/entry-common.h   | 108 ++
 arch/x86/include/asm/idtentry.h   | 153 +++-
 arch/x86/include/asm/irq_stack.h  |  11 +
 arch/x86/include/asm/page_64_types.h  |  36 +-
 arch/x86/include/asm/paravirt.h   |  15 +
 arch/x86/include/asm/paravirt_types.h |  17 +-
 arch/x86/include/asm/processor.h  |   3 +
 arch/x86/include/asm/pti.h|  18 +
 arch/x86/include/asm/switch_to.h  |   7 +-
 arch/x86/include/asm/traps.h  |   2 +-
 arch/x86/kernel/cpu/mce/core.c|   7 +-
 arch/x86/kernel/espfix_64.c   |  41 ++
 arch/x86/kernel/nmi.c |  34 +-
 arch/x86/kernel/sev-es.c  |  52 +++
 arch/x86/kernel/traps.c   |  61 +--
 arch/x86/mm/fault.c   |  11 +-
 arch/x86/mm/pti.c |  71 ++--
 kernel/fork.c |  22 ++
 21 files changed, 1002 insertions(+), 461 deletions(-)

-- 
2.18.4

[RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function

2020-11-09 Thread Alexandre Chartre

Add a wrapper function for invoking a syscall function.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..d12908ad 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,15 @@
 #include 
 #include 
 
+static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
+   struct pt_regs *regs)
+{
+   if (!sysfunc)
+   return;
+
+   regs->ax = sysfunc(regs);
+}
+
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
instrumentation_begin();
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
-   regs->ax = sys_call_table[nr](regs);
+   run_syscall(sys_call_table[nr], regs);
 #ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
-   regs->ax = x32_sys_call_table[nr](regs);
+   run_syscall(x32_sys_call_table[nr], regs);
 #endif
}
+
instrumentation_end();
syscall_exit_to_user_mode(regs);
 }
@@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct 
pt_regs *regs,
if (likely(nr < IA32_NR_syscalls)) {
instrumentation_begin();
nr = array_index_nospec(nr, IA32_NR_syscalls);
-   regs->ax = ia32_sys_call_table[nr](regs);
+   run_syscall(ia32_sys_call_table[nr], regs);
instrumentation_end();
}
 }
-- 
2.18.4

Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code

2020-11-09 Thread Alexandre Chartre




Sorry but it looks like email addresses are messed up in my emails. Our email
server has a new security "feature" which has the good idea to change external
email addresses.

I will resend the patches with the correct addresses once I've found
how to prevent this mess.

alex.

On 11/9/20 12:22 PM, Alexandre Chartre wrote:

With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

  - map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

  - map additional data used in the entry code (such as stack canary);

  - run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;

  - have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Code is based on v5.10-rc3.

Thanks,

alex.

-

Alexandre Chartre (24):
   x86/syscall: Add wrapper for invoking syscall function
   x86/entry: Update asm_call_on_stack to support more function arguments
   x86/entry: Consolidate IST entry from userspace
   x86/sev-es: Define a setup stack function for the VC idtentry
   x86/entry: Implement ret_from_fork body with C code
   x86/pti: Provide C variants of PTI switch CR3 macros
   x86/entry: Fill ESPFIX stack using C code
   x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
   x86/entry: Add C version of paranoid_entry/exit
   x86/pti: Introduce per-task PTI trampoline stack
   x86/pti: Function to clone page-table entries from a specified mm
   x86/pti: Function to map per-cpu page-table entry
   x86/pti: Extend PTI user mappings
   x86/pti: Use PTI stack instead of trampoline stack
   x86/pti: Execute syscall functions on the kernel stack
   x86/pti: Execute IDT handlers on the kernel stack
   x86/pti: Execute IDT handlers with error code on the kernel stack
   x86/pti: Execute system vector handlers on the kernel stack
   x86/pti: Execute page fault handler on the kernel stack
   x86/pti: Execute NMI handler on the kernel stack
   x86/entry: Disable stack-protector for IST entry C handlers
   x86/entry: Defer paranoid entry/exit to C code
   x86/entry: Remove paranoid_entry and paranoid_exit
   x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

  arch/x86/entry/common.c   | 259 -
  arch/x86/entry/entry_64.S | 513 --
  arch/x86/entry/entry_64_compat.S  |  22 --
  arch/x86/include/asm/entry-common.h   | 108 ++
  arch/x86/include/asm/idtentry.h   | 153 +++-
  arch/x86/include/asm/irq_stack.h  |  11 +
  arch/x86/include/asm/page_64_types.h  |  36 +-
  arch/x86/include/asm/paravirt.h   |  15 +
  arch/x86/include/asm/paravirt_types.h |  17 +-
  arch/x86/include/asm/processor.h  |   3 +
  arch/x86/include/asm/pti.h|  18 +
  arch/x86/include/asm/switch_to.h  |   7 +-
  arch/x86/include/asm/traps.h  |   2 +-
  arch/x86/kernel/cpu/mce/core.c|   7 +-
  arch/x86/kernel/espfix_64.c   |  41 ++
  arch/x86/kernel/nmi.c |  34 +-
  arch/x86/kernel/sev-es.c  |  52 +++
  arch/x86/kernel/traps.c   |  61 +--
  arch/x86/mm/fault.c   |  11 +-
  arch/x86/mm/pti.c |  71 ++--
  kernel/fork.c |

[RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack

2020-11-09 Thread Alexandre Chartre

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c  | 11 ++-
 arch/x86/entry/entry_64.S|  1 +
 arch/x86/include/asm/irq_stack.h |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54d0931801e1..ead6a4c72e6a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs 
*regs,
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
+   unsigned long stack;
+
if (!sysfunc)
return;
 
-   regs->ax = sysfunc(regs);
+   if (!pti_enabled()) {
+   regs->ax = sysfunc(regs);
+   return;
+   }
+
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+sysfunc, regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
 SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
/*
 * Save the frame pointer unconditionally. This allows the ORC
 * unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_X86_64
 static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct 
pt_regs *regs),
  struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
   struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+  struct pt_regs *regs);
 
 static __always_inline void __run_on_irqstack(void (*func)(void))
 {
-- 
2.18.4

[RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry

2020-11-09 Thread Alexandre Chartre

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/mm/pti.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
 }
 
+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+   phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+   pte_t *target_pte;
+
+   target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+   if (WARN_ON(!target_pte))
+   return;
+
+   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
 /*
  * Clone the CPU_ENTRY_AREA and associated data into the user space visible
  * page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
 * This is done for all possible CPUs during boot to ensure
 * that it's propagated to all mms.
 */
+   pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
-   unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
-   phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
-   pte_t *target_pte;
-
-   target_pte = pti_user_pagetable_walk_pte(va);
-   if (WARN_ON(!target_pte))
-   return;
-
-   *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
}
 }
 
-- 
2.18.4

[RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack

2020-11-09 Thread Alexandre Chartre

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/page_64_types.h | 36 +++-
 arch/x86/include/asm/processor.h |  3 +++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
 #define KASAN_STACK_ORDER 0
 #endif
 
-#define THREAD_SIZE_ORDER  (2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-+
+ *   | | ^   ^
+ *   | kernel-only | | KERNEL_STACK_SIZE |
+ *   |stack| |   |
+ *   | | V   |
+ *   +-+ <- top of kernel stack  | THREAD_SIZE
+ *   | | ^   |
+ *   | kernel and  | | KERNEL_STACK_SIZE |
+ *   | PTI stack   | |   |
+ *   | | V   v
+ *   +-+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER  \
+   (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 
 #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
+#define task_top_of_kernel_stack(task) \
+   ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
 #define task_pt_regs(task) \
 ({ \
unsigned long __ptr = (unsigned long)task_stack_page(task); \
-- 
2.18.4

[RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm

2020-11-09 Thread Alexandre Chartre

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/pti.h | 10 ++
 arch/x86/mm/pti.c  | 32 
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+   PTI_CLONE_PMD,
+   PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
 extern void pti_init(void);
 extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, enum pti_clone_level level);
 #else
 static inline void pti_check_boottime_disable(void) { }
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
 static void __init pti_setup_vsyscall(void) { }
 #endif
 
-enum pti_clone_level {
-   PTI_CLONE_PMD,
-   PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
- enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+  unsigned long end, enum pti_clone_level level)
 {
unsigned long addr;
 
@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
if (addr < start)
break;
 
-   pgd = pgd_offset_k(addr);
+   pgd = pgd_offset(mm, addr);
if (WARN_ON(pgd_none(*pgd)))
return;
p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
}
 }
 
+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+  enum pti_clone_level level)
+{
+   pti_clone_pgtable(&init_mm, start, end, level);
+}
+
 #ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
start = CPU_ENTRY_AREA_BASE;
end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-   pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+   pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
  */
 static void pti_clone_entry_text(void)
 {
-   pti_clone_pgtable((unsigned long) __entry_text_start,
- (unsigned long) __entry_text_end,
- PTI_CLONE_PMD);
+   pti_clone_init_pgtable((unsigned long) __entry_text_start,
+  (unsigned long) __entry_text_end,
+  PTI_CLONE_PMD);
 }
 
 /*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
 * pti_set_kernel_image_nonglobal() did to clear the
 * global bit.
 */
-   pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+   pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
 
/*
-* pti_clone_pgtable() will set the global bit in any PMDs
-* that it clones, but we also need to get any PTEs in
+* pti_clone_init_pgtable() will set the global bit in any
+* PMDs that it clones, but we also need to get any PTEs in
 * the last level for areas that are not huge-page-aligned.
 */
 
-- 
2.18.4

[RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace

2020-11-09 Thread Alexandre Chartre

Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from
userspace the same way: they switch from the IST stack to the kernel
stack, call the handler and then return to userspace. However, NMI,
MCE/DEBUG and VC implement this same behavior using different code paths,
so consolidate this code into a single assembly macro.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S | 137 +-
 1 file changed, 75 insertions(+), 62 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c42948aca0a8..51df9f1871c6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
+/*
+ * Macro to handle an IDT entry defined with the IST mechanism. It should
+ * be invoked at the beginning of the IDT handler with a pointer to the C
+ * function (cfunc_user) to invoke if the IDT was entered from userspace.
+ *
+ * If the IDT was entered from userspace, the macro will switch from the
+ * IST stack to the regular task stack, call the provided function and
+ * return to userland.
+ *
+ * If IDT was entered from the kernel, the macro will just return.
+ */
+.macro ist_entry_user cfunc_user has_error_code=0
+   UNWIND_HINT_IRET_REGS
+   ASM_CLAC
+
+   /* only process entry from userspace */
+   .if \has_error_code == 1
+   testb   $3, CS-ORIG_RAX(%rsp)
+   jz  .List_entry_from_kernel_\@
+   .else
+   testb   $3, CS-RIP(%rsp)
+   jz  .List_entry_from_kernel_\@
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
+   .endif
+
+   /* Use %rdx as a temp variable */
+   pushq   %rdx
+
+   /*
+* Switch from the IST stack to the regular task stack and
+* use the provided entry point.
+*/
+   swapgs
+   cld
+   FENCE_SWAPGS_USER_ENTRY
+   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
+   movq%rsp, %rdx
+   movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
+   UNWIND_HINT_IRET_REGS base=%rdx offset=8
+   pushq   6*8(%rdx)   /* pt_regs->ss */
+   pushq   5*8(%rdx)   /* pt_regs->rsp */
+   pushq   4*8(%rdx)   /* pt_regs->flags */
+   pushq   3*8(%rdx)   /* pt_regs->cs */
+   pushq   2*8(%rdx)   /* pt_regs->rip */
+   UNWIND_HINT_IRET_REGS
+   pushq   1*8(%rdx)   /* pt_regs->orig_ax */
+   PUSH_AND_CLEAR_REGS rdx=(%rdx)
+   ENCODE_FRAME_POINTER
+
+   /*
+* At this point we no longer need to worry about stack damage
+* due to nesting -- we're on the normal thread stack and we're
+* done with the IST stack.
+*/
+
+   mov %rsp, %rdi
+   .if \has_error_code == 1
+   movqORIG_RAX(%rsp), %rsi/* get error code into 2nd 
argument*/
+   movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
+   .endif
+   call\cfunc_user
+   jmp swapgs_restore_regs_and_return_to_usermode
+
+.List_entry_from_kernel_\@:
+.endm
+
 /**
  * idtentry_body - Macro to emit code calling the C function
  * @cfunc: C function to be called
@@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
-   pushq   $-1 /* ORIG_RAX: no syscall to restart */
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user noist_\cfunc
 
+   /* Entry from kernel */
+
+   pushq   $-1 /* ORIG_RAX: no syscall to restart */
/* paranoid_entry returns GS information for paranoid_exit in EBX. */
callparanoid_entry
 
@@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym)
 
jmp paranoid_exit
 
-   /* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body noist_\cfunc, has_error_code=0
-
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
@@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_vc vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-   UNWIND_HINT_IRET_REGS
-   ASM_CLAC
-
/*
 * If the entry is from userspace, switch stacks and treat it as
 * a normal entry.
 */
-   testb   $3, CS-ORIG_RAX(%rsp)
-   jnz .Lfrom_usermode_switch_stack_\@
+   ist_entry_user safe_stack_\cfunc, has_error_code=1
 
/*
 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
@@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym)
 */
jmp paranoid_exit
 
-   /* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
-   idtentry_body safe_stack_\cfunc, has_e

[RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code

2020-11-09 Thread Alexandre Chartre

ret_from_fork is a mix of assembly code and calls to C functions.
Re-implement ret_from_fork so that it calls a single C function.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c   | 18 ++
 arch/x86/entry/entry_64.S | 28 +---
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d12908ad..7ee15a12c115 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,24 @@
 #include 
 #include 
 
+__visible noinstr void return_from_fork(struct pt_regs *regs,
+   struct task_struct *prev,
+   void (*kfunc)(void *), void *kargs)
+{
+   schedule_tail(prev);
+   if (kfunc) {
+   /* kernel thread */
+   kfunc(kargs);
+   /*
+* A kernel thread is allowed to return here after
+* successfully calling kernel_execve(). Exit to
+* userspace to complete the execve() syscall.
+*/
+   regs->ax = 0;
+   }
+   syscall_exit_to_user_mode(regs);
+}
+
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
struct pt_regs *regs)
 {
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 274384644b5e..73e9cd47dc83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm)
  */
 .pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
-   UNWIND_HINT_EMPTY
-   movq%rax, %rdi
-   callschedule_tail   /* rdi: 'prev' task parameter */
-
-   testq   %rbx, %rbx  /* from kernel_thread? */
-   jnz 1f  /* kernel threads are uncommon 
*/
-
-2:
UNWIND_HINT_REGS
-   movq%rsp, %rdi
-   callsyscall_exit_to_user_mode   /* returns with IRQs disabled */
+   movq%rsp, %rdi  /* pt_regs */
+   movq%rax, %rsi  /* 'prev' task parameter */
+   movq%rbx, %rdx  /* kernel thread func */
+   movq%r12, %rcx  /* kernel thread arg */
+   callreturn_from_fork/* returns with IRQs disabled */
jmp swapgs_restore_regs_and_return_to_usermode
-
-1:
-   /* kernel thread */
-   UNWIND_HINT_EMPTY
-   movq%r12, %rdi
-   CALL_NOSPEC rbx
-   /*
-* A kernel thread is allowed to return here after successfully
-* calling kernel_execve().  Exit to userspace to complete the execve()
-* syscall.
-*/
-   movq$0, RAX(%rsp)
-   jmp 2b
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-- 
2.18.4

[RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit

2020-11-09 Thread Alexandre Chartre

paranoid_entry/exit are assembly macros. Provide C versions of
these macros (kernel_paranoid_entry() and kernel_paranoid_exit()).
The C functions are functionally equivalent to the assembly macros,
except that kernel_paranoid_entry() doesn't save registers in
pt_regs like paranoid_entry does.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 157 
 arch/x86/include/asm/entry-common.h |  10 ++
 2 files changed, 167 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d09b1ded5287..54d0931801e1 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -387,3 +387,160 @@ static __always_inline unsigned long 
save_and_switch_to_kernel_cr3(void)
 static __always_inline void restore_cr3(unsigned long cr3) {}
 
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+/*
+ * "Paranoid" entry path from exception stack. Ensure that the CR3 and
+ * GS registers are correctly set for the kernel. Return GSBASE related
+ * information in kernel_entry_state depending on the availability of
+ * the FSGSBASE instructions:
+ *
+ * FSGSBASEkernel_entry_state
+ * Nswapgs=true -> SWAPGS on exit
+ *  swapgs=false -> no SWAPGS on exit
+ *
+ * Ygsbase=GSBASE value at entry, must be restored in
+ *  kernel_paranoid_exit()
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables before
+ * kernel_paranoid_entry() has been called.
+ */
+noinstr void kernel_paranoid_entry(struct kernel_entry_state *state)
+{
+   unsigned long gsbase;
+   unsigned int cpu;
+
+   /*
+* Save CR3 in the kernel entry state.  This value will be
+* restored, verbatim, at exit.  Needed if the paranoid entry
+* interrupted another entry that already switched to the user
+* CR3 value but has not yet returned to userspace.
+*
+* This is also why CS (stashed in the "iret frame" by the
+* hardware at entry) can not be used: this may be a return
+* to kernel code, but with a user CR3 value.
+*
+* Switching CR3 does not depend on kernel GSBASE so it can
+* be done before switching to the kernel GSBASE. This is
+* required for FSGSBASE because the kernel GSBASE has to
+* be retrieved from a kernel internal table.
+*/
+   state->cr3 = save_and_switch_to_kernel_cr3();
+
+   /*
+* Handling GSBASE depends on the availability of FSGSBASE.
+*
+* Without FSGSBASE the kernel enforces that negative GSBASE
+* values indicate kernel GSBASE. With FSGSBASE no assumptions
+* can be made about the GSBASE value when entering from user
+* space.
+*/
+   if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+   /*
+* Read the current GSBASE and store it in the kernel
+* entry state unconditionally, retrieve and set the
+* current CPUs kernel GSBASE. The stored value has to
+* be restored at exit unconditionally.
+*
+* The unconditional write to GS base below ensures that
+* no subsequent loads based on a mispredicted GS base
+* can happen, therefore no LFENCE is needed here.
+*/
+   state->gsbase = rdgsbase();
+
+   /*
+* Fetch the per-CPU GSBASE value for this processor. We
+* normally use %gs for accessing per-CPU data, but we
+* are setting up %gs here and obviously can not use %gs
+* itself to access per-CPU data.
+*/
+   if (IS_ENABLED(CONFIG_SMP)) {
+   /*
+* Load CPU from the GDT. Do not use RDPID,
+* because KVM loads guest's TSC_AUX on vm-entry
+* and may not restore the host's value until
+* the CPU returns to userspace. Thus the kernel
+* would consume a guest's TSC_AUX if an NMI
+* arrives while running KVM's run loop.
+*/
+   asm_inline volatile ("lsl %[seg],%[p]"
+: [p] "=r" (cpu)
+: [seg] "r" (__CPUNODE_SEG));
+
+   cpu &= VDSO_CPUNODE_MASK;
+   gsbase = __per_cpu_offset[cpu];
+   } else {
+   gsbase = *pcpu_unit_offsets;
+   }
+
+   wrgsbase(gsbase);
+
+   } else {
+   /*
+* The kernel-enforced convention is a negative GSBASE
+* indicates a kernel value. No SWAPGS needed on entry

[RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry

2020-11-09 Thread Alexandre Chartre

The #VC exception assembly entry code uses C code (vc_switch_off_ist)
to get and configure a stack, then return to assembly to switch to
that stack and finally invoked the C function exception handler.

To pave the way for deferring CR3 switch from assembly to C code,
define a setup stack function for the VC idtentry. This function is
used to get and configure the stack before invoking idtentry handler.

For now, the setup stack function is just a wrapper around the
the vc_switch_off_ist() function but it will eventually also
contain the C code to switch CR3. The vc_switch_off_ist() function
is also refactored to just return the stack pointer, and the stack
configuration is done in the setup stack function (so that the
stack can be also be used to propagate CR3 switch information to
the idtentry handler for switching CR3 back).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S   |  8 +++-
 arch/x86/include/asm/idtentry.h | 14 ++
 arch/x86/include/asm/traps.h|  2 +-
 arch/x86/kernel/sev-es.c| 34 +
 arch/x86/kernel/traps.c | 19 +++---
 5 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 51df9f1871c6..274384644b5e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_REGS
 
/*
-* Switch off the IST stack to make it free for nested exceptions. The
-* vc_switch_off_ist() function will switch back to the interrupted
-* stack if it is safe to do so. If not it switches to the VC fall-back
-* stack.
+* Call the setup stack function. It configures and returns
+* the stack we should be using to run the exception handler.
 */
movq%rsp, %rdi  /* pt_regs pointer */
-   callvc_switch_off_ist
+   callsetup_stack_\cfunc
movq%rax, %rsp  /* Switch to new stack */
 
UNWIND_HINT_REGS
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..4b4aca2b1420 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  */
 #define DECLARE_IDTENTRY_VC(vector, func)  \
DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func);   \
+   __visible noinstr unsigned long setup_stack_##func(struct pt_regs 
*regs);   \
__visible noinstr void ist_##func(struct pt_regs *regs, unsigned long 
error_code);  \
__visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned 
long error_code)
 
@@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_VC_IST(func)   \
DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func)
 
+/**
+ * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to
+   run the VMM communication handler
+ * @func:  Function name of the entry point
+ *
+ * The stack setup code is executed before the VMM communication handler.
+ * It configures and returns the stack to switch to before running the
+ * VMM communication handler.
+ */
+#define DEFINE_IDTENTRY_VC_SETUP_STACK(func)   \
+   __visible noinstr   \
+   unsigned long setup_stack_##func(struct pt_regs *regs)
+
 /**
  * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler
  * @func:  Function name of the entry point
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..cfcc9d34d2a0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct 
pt_regs *eregs);
 asmlinkage __visible notrace
 struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs 
*eregs);
+asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs 
*eregs);
 #endif
 
 #ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0bd1a0fc587e..bd977c917cd6 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
instrumentation_end();
 }
 
+struct exc_vc_frame {
+   /* pt_regs should be first */
+   struct pt_regs regs;
+};
+
+DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
+{
+   struct exc_vc_frame *frame;
+   unsigned long sp;
+
+   /*
+* Switch off the IST stack to make it free for nested exceptions.
+* The vc_switch_off_ist() function will switch back to the
+* interrupted stack if

[RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code

2020-11-09 Thread Alexandre Chartre

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 and GS registers inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S  | 72 ++
 arch/x86/kernel/cpu/mce/core.c |  3 ++
 arch/x86/kernel/nmi.c  | 18 +++--
 arch/x86/kernel/sev-es.c   | 20 +-
 arch/x86/kernel/traps.c| 30 --
 5 files changed, 83 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..9ea8187d4405 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym)
/* Entry from kernel */
 
pushq   $-1 /* ORIG_RAX: no syscall to restart */
-   /* paranoid_entry returns GS information for paranoid_exit in EBX. */
-   callparanoid_entry
-
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi  /* pt_regs pointer */
 
call\cfunc
 
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym)
 */
ist_entry_user safe_stack_\cfunc, has_error_code=1
 
-   /*
-* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
-* EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
-*/
-   callparanoid_entry
-
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
/*
@@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym)
 * identical to the stack in the IRET frame or the VC fall-back stack,
 * so it is definitly mapped even with PTI enabled.
 */
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym)
UNWIND_HINT_IRET_REGS offset=8
ASM_CLAC
 
-   /* paranoid_entry returns GS information for paranoid_exit in EBX. */
-   callparanoid_entry
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi  /* pt_regs pointer into first argument 
*/
@@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym)
movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */
call\cfunc
 
-   jmp paranoid_exit
+   jmp restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
- *
- * Registers:
- * %r14: Used to save/restore the CR3 of the interrupted context
- *   when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 SYM_CODE_START(asm_exc_nmi)
/*
@@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi)
 * We also must not push anything to the stack before switching
 * stacks lest we corrupt the "NMI executing" variable.
 */
-   ist_entry_user exc_nmi
+   ist_entry_user exc_nmi_user
 
/* NMI from kernel */
 
@@ -1346,9 +1340,7 @@ repeat_nmi:
 *
 * RSP is pointing to "outermost RIP".  gsbase is unknown, but, if
 * we're repeating an NMI, gsbase has the same value that it had on
-* the first iteration.  paranoid_entry will load the kernel
-* gsbase if needed before we call exc_nmi().  "NMI executing"
-* is zero.
+* the first iteration.  "NMI executing" is zero.
 */
movq$1, 10*8(%rsp)  /* Set "NMI executing". */
 
@@ -1372,44 +1364,20 @@ end_repeat_nmi:
pushq   $-1 /* ORIG_RAX: no syscall to 
restart */
 
/*
-* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
-* as we should not be calling schedule in NMI context.
-* Even with normal interrupts enabled. An NMI should not be
-* setting NEED_RESCHED or anything that normal interrupts and
+* We should not be calling schedule in NMI context. Even with
+* normal interrupts enabled. An NMI should not be setting
+* NEED_RESCHED or anything that normal interrupts and
 * exceptions might do.
 */
-   callparanoid_entry
+   cld
+   PUSH_AND_CLEAR_REGS
+   ENCODE_FRAME_POINTER
UNWIND_HINT_REGS
 
movq%rsp, %rdi
movq$-1, %rsi
callexc_nmi
 
-   /* Always restore stashed CR3 value (see paranoid_entry) */
-   RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
-   /*
-* The above invocation of pa

[RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

2020-11-09 Thread Alexandre Chartre

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/common.c | 15 ---
 arch/x86/entry/entry_64.S   | 23 +--
 arch/x86/entry/entry_64_compat.S| 22 --
 arch/x86/include/asm/entry-common.h | 14 ++
 arch/x86/include/asm/idtentry.h | 25 -
 arch/x86/kernel/cpu/mce/core.c  |  2 ++
 arch/x86/kernel/nmi.c   |  2 ++
 arch/x86/kernel/traps.c |  6 ++
 arch/x86/mm/fault.c |  9 +++--
 9 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ead6a4c72e6a..3f4788dbbde7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
regs->ax = 0;
}
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t 
sysfunc,
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+   switch_to_kernel_cr3();
nr = syscall_enter_from_user_mode(regs, nr);
 
instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, 
struct pt_regs *regs)
 
instrumentation_end();
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+   switch_to_kernel_cr3();
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;
 
@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs 
*regs)
 
do_syscall_32_irqs_on(regs, nr);
syscall_exit_to_user_mode(regs);
+   switch_to_user_cr3();
 }
 
-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
 {
-   unsigned int nr = syscall_32_enter(regs);
int res;
 
/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs 
*regs)
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
 __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
+   unsigned int nr = syscall_32_enter(regs);
+   bool syscall_done;
+
/*
 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 * convention.  Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs 
*regs)
regs->ip = landing_pad;
 
/* Invoke the syscall. If it failed, keep it simple: use IRET. */
-   if (!__do_fast_syscall_32(regs))
+   syscall_done = __do_fast_syscall_32(regs, nr);
+   switch_to_user_cr3();
+   if (!syscall_done)
return 0;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 797effbe65b6..4be15a5ffe68 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
swapgs
/* tss.sp2 is scratch space. */
movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, 
SYM_L_GLOBAL)
 */
 syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
-   POP_REGS pop_rdi=0 skip_r11rcx=1
+   POP_REGS skip_r11rcx=1
 
/*
-* We are on the trampoline stack.  All regs except RDI are live.
 * We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
 
-   SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
-   popq%rdi
movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
swapgs
cld
FENCE_SWAPGS_USER_ENTRY
-   SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq%rsp, %rdx
movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -592,19 +586,15 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
ud2
 1:
 #endif
-   POP_REGS pop_rdi=0
+   POP_REGS
+   addq

[RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+   void *stack = pti_kernel_stack(regs);
+
+   if (stack)
+   asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+   else
+   run_sysvec_on_irqstack_cond(func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   run_sysvec_on_irqstack_cond(__##func, regs);\
+   run_sysvec(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
-- 
2.18.4

[RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit

2020-11-09 Thread Alexandre Chartre

The paranoid_entry and paranoid_exit assembly functions have been
replaced by the kernel_paranoid_entry() and kernel_paranoid_exit()
C functions. Now paranoid_entry/exit are not used anymore and can
be removed.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S | 131 --
 1 file changed, 131 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ea8187d4405..797effbe65b6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback)
 SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
-/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASER/EBX
- * N0 -> SWAPGS on exit
- *  1 -> no SWAPGS on exit
- *
- * YGSBASE value at entry, must be restored in paranoid_exit
- */
-SYM_CODE_START_LOCAL(paranoid_entry)
-   UNWIND_HINT_FUNC
-   cld
-   PUSH_AND_CLEAR_REGS save_ret=1
-   ENCODE_FRAME_POINTER 8
-
-   /*
-* Always stash CR3 in %r14.  This value will be restored,
-* verbatim, at exit.  Needed if paranoid_entry interrupted
-* another entry that already switched to the user CR3 value
-* but has not yet returned to userspace.
-*
-* This is also why CS (stashed in the "iret frame" by the
-* hardware at entry) can not be used: this may be a return
-* to kernel code, but with a user CR3 value.
-*
-* Switching CR3 does not depend on kernel GSBASE so it can
-* be done before switching to the kernel GSBASE. This is
-* required for FSGSBASE because the kernel GSBASE has to
-* be retrieved from a kernel internal table.
-*/
-   SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
-   /*
-* Handling GSBASE depends on the availability of FSGSBASE.
-*
-* Without FSGSBASE the kernel enforces that negative GSBASE
-* values indicate kernel GSBASE. With FSGSBASE no assumptions
-* can be made about the GSBASE value when entering from user
-* space.
-*/
-   ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
-   /*
-* Read the current GSBASE and store it in %rbx unconditionally,
-* retrieve and set the current CPUs kernel GSBASE. The stored value
-* has to be restored in paranoid_exit unconditionally.
-*
-* The unconditional write to GS base below ensures that no subsequent
-* loads based on a mispredicted GS base can happen, therefore no LFENCE
-* is needed here.
-*/
-   SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
-   ret
-
-.Lparanoid_entry_checkgs:
-   /* EBX = 1 -> kernel GSBASE active, no restore required */
-   movl$1, %ebx
-   /*
-* The kernel-enforced convention is a negative GSBASE indicates
-* a kernel value. No SWAPGS needed on entry and exit.
-*/
-   movl$MSR_GS_BASE, %ecx
-   rdmsr
-   testl   %edx, %edx
-   jns .Lparanoid_entry_swapgs
-   ret
-
-.Lparanoid_entry_swapgs:
-   SWAPGS
-
-   /*
-* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
-* unconditional CR3 write, even in the PTI case.  So do an lfence
-* to prevent GS speculation, regardless of whether PTI is enabled.
-*/
-   FENCE_SWAPGS_KERNEL_ENTRY
-
-   /* EBX = 0 -> SWAPGS required on exit */
-   xorl%ebx, %ebx
-   ret
-SYM_CODE_END(paranoid_entry)
-
-/*
- * "Paranoid" exit path from exception stack.  This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated.  Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASER/EBX
- * N0 -> SWAPGS on exit
- *  1 -> no SWAPGS on exit
- *
- * YUser space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
-   UNWIND_HINT_REGS
-   /*
-* The order of operations is important. RESTORE_CR3 requires
-* kernel GSBASE.
-*
-* NB to anyone to try to optimize this code: this code does
-* not execute at all for exceptions from user mode. Those
-* exceptions go through error_exit instead.
-*/
-   RESTORE_CR3 scratch_reg=%rax save_reg=%r14
-
-   /* Handle the three GSBASE cases */
-   ALTERNATIVE "jmp .Lparanoid_exit_checkgs"

[RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack

2020-11-09 Thread Alexandre Chartre

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/kernel/nmi.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
inc_irq_stat(__nmi_count);
 
-   if (!ignore_nmis)
-   default_do_nmi(regs);
+   if (!ignore_nmis) {
+   if (user_mode(regs)) {
+   /*
+* If we come from userland then we are on the
+* trampoline stack, switch to the kernel stack
+* to execute the NMI handler.
+*/
+   run_idt(default_do_nmi, regs);
+   } else {
+   default_do_nmi(regs);
+   }
+   }
 
idtentry_exit_nmi(regs, irq_state);
 
-- 
2.18.4

[RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack

2020-11-09 Thread Alexandre Chartre

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre 
---
 arch/x86/entry/entry_64.S| 42 +---
 arch/x86/include/asm/pti.h   |  8 ++
 arch/x86/include/asm/switch_to.h |  7 +-
 3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1
 
-   /*
-* Now all regs are restored except RSP and RDI.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   pushq   RSP-RDI(%rdi)   /* RSP */
-   pushq   (%rdi)  /* RDI */
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
+* We are on the trampoline stack.  All regs except RSP are live.
 * We can do future final exit work right here.
 */
STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
popq%rdi
-   popq%rsp
+   movqRSP-ORIG_RAX(%rsp), %rsp
USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
 
@@ -606,24 +596,6 @@ 
SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #endif
POP_REGS pop_rdi=0
 
-   /*
-* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
-* Save old stack pointer and switch to trampoline stack.
-*/
-   movq%rsp, %rdi
-   movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-   UNWIND_HINT_EMPTY
-
-   /* Copy the IRET frame to the trampoline stack. */
-   pushq   6*8(%rdi)   /* SS */
-   pushq   5*8(%rdi)   /* RSP */
-   pushq   4*8(%rdi)   /* EFLAGS */
-   pushq   3*8(%rdi)   /* CS */
-   pushq   2*8(%rdi)   /* RIP */
-
-   /* Push user RDI on the trampoline stack. */
-   pushq   (%rdi)
-
/*
 * We are on the trampoline stack.  All regs except RDI are live.
 * We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, 
SYM_L_GLOBAL)
 
/* Restore RDI. */
popq%rdi
+   addq$8, %rsp/* skip regs->orig_ax */
SWAPGS
INTERRUPT_RETURN
 
@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
+   /*
+* We are on the trampoline stack. With PTI, the trampoline
+* stack is a per-thread stack so we are all set and we can
+* return.
+*
+* Without PTI, the trampoline stack is a per-cpu stack and
+* we need to switch to the normal thread stack.
+*/
+   ALTERNATIVE "", "ret", X86_FEATURE_PTI
/* Put us onto the real thread stack. */
popq%r12/* save return addr in %12 */
movq%rsp, %rdi  /* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
 extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
  unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+   return static_cpu_has(X86_FEATURE_PTI);
+}
 #else
 static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+   return false;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_SWITCH_TO_H
 
 #include 
+#include 
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
 
@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct 
*task)
 * doesn't wo

[RFC][PATCH 16/24] x86/pti: Execute IDT handlers on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

For now, only changes IDT handlers which have no argument other
than the pt_regs registers.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 43 +++--
 arch/x86/kernel/cpu/mce/core.c  |  2 +-
 arch/x86/kernel/traps.c |  4 +--
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4b4aca2b1420..3595a31947b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,10 +10,49 @@
 #include 
 
 #include 
+#include 
 
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+/*
+ * The CALL_ON_STACK_* macro call the specified function either directly
+ * if no stack is provided, or on the specified stack.
+ */
+#define CALL_ON_STACK_1(stack, func, arg1) \
+   ((stack) ?  \
+asm_call_on_stack_1(stack, \
+   (void (*)(void))(func), (void *)(arg1)) :   \
+func(arg1))
+
+/*
+ * Functions to return the top of the kernel stack if we are using the
+ * user page-table (and thus not running with the kernel stack). If we
+ * are using the kernel page-table (and so already using the kernel
+ * stack) when it returns NULL.
+ */
+static __always_inline void *pti_kernel_stack(struct pt_regs *regs)
+{
+   unsigned long stack;
+
+   if (pti_enabled() && user_mode(regs)) {
+   stack = (unsigned long)task_top_of_kernel_stack(current);
+   return (void *)(stack - 8);
+   } else {
+   return NULL;
+   }
+}
+
+/*
+ * Wrappers to run an IDT handler on the kernel stack if we are not
+ * already using this stack.
+ */
+static __always_inline
+void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
+{
+   CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) 
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs);\
+   run_idt(__##func, regs);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs)   
\
instrumentation_begin();\
__irq_enter_raw();  \
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs);\
+   run_idt(__##func, regs);\
__irq_exit_raw();   \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..9407c3cd9355 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
unsigned long dr7;
 
dr7 = local_db_save();
-   exc_machine_check_user(regs);
+   run_idt(exc_machine_check_user, regs);
local_db_restore(dr7);
 }
 #else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 09b22a611d99..5161385b3670 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 
state = irqentry_enter(regs);
instrumentation_begin();
-   handle_invalid_op(regs);
+   run_idt(handle_invalid_op, regs);
instrumentation_end();
irqentry_exit(regs, state);
 }
@@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
instrumentation_begin();
-   do_int3_user(regs);
+   run_idt(do_int3_us

[RFC][PATCH 19/24] x86/pti: Execute page fault handler on the kernel stack

2020-11-09 Thread Alexandre Chartre

After a page fault from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the page fault handler, switch
to the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 17 +
 arch/x86/mm/fault.c |  2 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0c5d9f027112..a6725afaaec0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
 func(arg1, arg2))
 
+#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \
+   ((stack) ?  \
+asm_call_on_stack_3(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \
+   (void *)(arg3)) :   \
+func(arg1, arg2, arg3))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned 
long),
CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long,
+   unsigned long),
+  struct pt_regs *regs, unsigned long error_code,
+  unsigned long address)
+{
+   CALL_ON_STACK_3(pti_kernel_stack(regs),
+   func, regs, error_code, address);
+}
+
 static __always_inline
 void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..b9d03603d95d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
state = irqentry_enter(regs);
 
instrumentation_begin();
-   handle_page_fault(regs, error_code, address);
+   run_idt_pagefault(handle_page_fault, regs, error_code, address);
instrumentation_end();
 
irqentry_exit(regs, state);
-- 
2.18.4

[RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers

2020-11-09 Thread Alexandre Chartre

The stack-protector option adds a stack canary to functions vulnerable
to stack buffer overflow. The stack canary is defined through the GS
register. Add an attribute to disable the stack-protector option; it
will be used for C functions which can be called while the GS register
might not be properly configured yet.

The GS register is not properly configured for the kernel when we enter
the kernel from userspace. The assembly entry code sets the GS register
for the kernel using the swapgs instruction or the paranoid_entry function,
and so, currently, the GS register is correctly configured when assembly
entry code subsequently transfer control to C code.

Deferring the CR3 register switch from assembly to C code will require to
reimplement paranoid_entry in C and hence also defer the GS register setup
for IST entries to C code. To prepare this change, disable stack-protector
for IST entry C handlers where the GS register setup will eventually
happen.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 25 -
 arch/x86/kernel/nmi.c   |  2 +-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a6725afaaec0..647af7ea3bf1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -94,6 +94,21 @@ void run_sysvec(void (*func)(struct pt_regs *regs), struct 
pt_regs *regs)
run_sysvec_on_irqstack_cond(func, regs);
 }
 
+/*
+ * Attribute to disable the stack-protector option. The option is
+ * disabled using the optimize attribute which clears all optimize
+ * options. So we need to specify the optimize option to disable but
+ * also optimize options we want to preserve.
+ *
+ * The stack-protector option adds a stack canary to functions
+ * vulnerable to stack buffer overflow. The stack canary is defined
+ * through the GS register. So the attribute is used to disable the
+ * stack-protector option for functions which can be called while the
+ * GS register might not be properly configured yet.
+ */
+#define no_stack_protector \
+   __attribute__ 
((optimize("-O2,-fno-stack-protector,-fno-omit-frame-pointer")))
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -410,7 +425,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW
  */
 #define DEFINE_IDTENTRY_IST(func)  \
-   DEFINE_IDTENTRY_RAW(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW(func)
 
 /**
  * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which
@@ -440,7 +455,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_DF(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 /**
  * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler
@@ -472,7 +487,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * VMM communication handler.
  */
 #define DEFINE_IDTENTRY_VC_SETUP_STACK(func)   \
-   __visible noinstr   \
+   no_stack_protector __visible noinstr\
unsigned long setup_stack_##func(struct pt_regs *regs)
 
 /**
@@ -482,7 +497,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_VC(func)   \
-   DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+   no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 #else  /* CONFIG_X86_64 */
 
@@ -517,7 +532,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
 
 /* C-Code mapping */
 #define DECLARE_IDTENTRY_NMI   DECLARE_IDTENTRY_RAW
-#define DEFINE_IDTENTRY_NMIDEFINE_IDTENTRY_RAW
+#define DEFINE_IDTENTRY_NMIno_stack_protector DEFINE_IDTENTRY_RAW
 
 #ifdef CONFIG_X86_64
 #define DECLARE_IDTENTRY_MCE   DECLARE_IDTENTRY_IST
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be0f654c3095..b6291b683be1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
 static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
-DEFINE_IDTENTRY_RAW(exc_nmi)
+DEFINE_IDTENTRY_NMI(exc_nmi)
 {
bool irq_state;
 
-- 
2.18.4

[RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack

2020-11-09 Thread Alexandre Chartre

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre 
---
 arch/x86/include/asm/idtentry.h | 18 --
 arch/x86/kernel/traps.c |  2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
(void (*)(void))(func), (void *)(arg1)) :   \
 func(arg1))
 
+#define CALL_ON_STACK_2(stack, func, arg1, arg2)   \
+   ((stack) ?  \
+asm_call_on_stack_2(stack, \
+   (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+func(arg1, arg2))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs 
*regs)
CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
 }
 
+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+struct pt_regs *regs, unsigned long error_code)
+{
+   CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs,   
\
irqentry_state_t state = irqentry_enter(regs);  \
\
instrumentation_begin();\
-   __##func (regs, error_code);\
+   run_idt_errcode(__##func, regs, error_code);\
instrumentation_end();  \
irqentry_exit(regs, state); \
 }  \
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs,   
\
instrumentation_begin();\
irq_enter_rcu();\
kvm_set_cpu_l1tf_flush_l1d();   \
-   __##func (regs, (u8)error_code);\
+   run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+   regs, (u8)error_code);  \
irq_exit_rcu(); \
instrumentation_end();  \
irqentry_exit(regs, state); \
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
 /* User entry, runs on regular task stack */
 DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
-   exc_debug_user(regs, debug_read_clear_dr6());
+   run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
 }
 #else
 /* 32 bit does not have separate entry points. */
-- 
2.18.4

1 2 3 >

1 - 100 of 278 matches

Mail list logo