Re: [PATCH 2/2] arch/x86: arch/sparc: tools/perf: fix typos in comments
On 4/8/21 7:28 PM, Thomas Tai wrote: s/insted/instead/ s/maintaing/maintaining/ Signed-off-by: Thomas Tai --- arch/sparc/vdso/vdso2c.c | 2 +- arch/x86/entry/vdso/vdso2c.c | 2 +- arch/x86/kernel/cpu/intel.c | 2 +- tools/perf/arch/x86/util/perf_regs.c | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) Reviewed-by: Alexandre Chartre alex.
Re: [PATCH 1/2] x86/traps: call cond_local_irq_disable before returning from exc_general_protection and math_error
On 4/8/21 7:28 PM, Thomas Tai wrote: This fixes commit 334872a09198 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling") which added return statements without calling cond_local_irq_disable(). According to commit ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code"), cond_local_irq_disable() is needed because the ASM return code no longer disables interrupts. Follow the existing code as an example to use "goto exit" instead of "return" statement. Signed-off-by: Thomas Tai --- arch/x86/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Reviewed-by: Alexandre Chartre And it is probably worth adding a 'Fixes:' tag: Fixes: 334872a09198 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling") alex.
Re: [for-stable-4.19 PATCH 1/2] vmlinux.lds.h: Create section for protection against instrumentation
On 3/19/21 11:39 AM, Greg Kroah-Hartman wrote: On Fri, Mar 19, 2021 at 07:54:15AM +0800, Nicolas Boichat wrote: From: Thomas Gleixner commit 655389433e7efec589838b400a2a652b3ffa upstream. Some code pathes, especially the low level entry code, must be protected against instrumentation for various reasons: - Low level entry code can be a fragile beast, especially on x86. - With NO_HZ_FULL RCU state needs to be established before using it. Having a dedicated section for such code allows to validate with tooling that no unsafe functions are invoked. Add the .noinstr.text section and the noinstr attribute to mark functions. noinstr implies notrace. Kprobes will gain a section check later. Provide also a set of markers: instrumentation_begin()/end() These are used to mark code inside a noinstr function which calls into regular instrumentable text section as safe. The instrumentation markers are only active when CONFIG_DEBUG_ENTRY is enabled as the end marker emits a NOP to prevent the compiler from merging the annotation points. This means the objtool verification requires a kernel compiled with this option. Signed-off-by: Thomas Gleixner Reviewed-by: Alexandre Chartre Acked-by: Peter Zijlstra Link: https://lkml.kernel.org/r/20200505134100.075416...@linutronix.de [Nicolas: context conflicts in: arch/powerpc/kernel/vmlinux.lds.S include/asm-generic/vmlinux.lds.h include/linux/compiler.h include/linux/compiler_types.h] Signed-off-by: Nicolas Boichat Did you build this on x86? I get the following build error: ld:./arch/x86/kernel/vmlinux.lds:20: syntax error And that line looks like: . = ALIGN(8); *(.text.hot .text.hot.*) *(.text .text.fixup) *(.text.unlikely .text.unlikely.*) *(.text.unknown .text.unknown.*) . = ALIGN(8); __noinstr_text_start = .; *(.__attribute__((noinline)) __attribute__((no_instrument_function)) __attribute((__section__(".noinstr.text"))).text) __noinstr_text_end = .; *(.text..refcount) *(.ref.text) *(.meminit.text*) *(.memexit.text*) In the NOINSTR_TEXT macro, noinstr is expanded with the value of the noinstr macro from linux/compiler_types.h while it shouldn't. The problem is possibly that the noinstr macro is defined for assembly. Make sure that the macro is not defined for assembly e.g.: #ifndef __ASSEMBLY__ /* Section for code which can't be instrumented at all */ #define noinstr \ noinline notrace __attribute((__section__(".noinstr.text"))) #endif alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/19/20 8:10 PM, Thomas Gleixner wrote: On Mon, Nov 16 2020 at 19:10, Alexandre Chartre wrote: On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Coming late, but this does not make any sense to me. Unless you map most of the kernel into the user page-table sleeping with the user page-table _cannot_ work. And if you do that you broke KPTI. You can neither pick arbitrary points in the C code of an exception handler to switch to the kernel mapping unless you mapped everything which might be touched before that into user space. How is that supposed to work? Sorry I mixed up a few thing; I got confused with my own code which is not a good sign... It's not sleeping with the user page-table which, as you mentioned, doesn't make sense, it's sleeping with the kernel page-table but with the PTI stack. Basically, it is: - entering C code with (user page-table, PTI stack); - then it switches to the kernel page-table so we have (kernel page-table, PTI stack); - and then it switches to the kernel stack so we have (kernel page-table, kernel stack). As this is all C code, some of which is executed with the PTI stack, we need the PTI stack to be per-task so that the stack is preserved, in case that C code does a sleep/schedule (no matter if this happens when using the PTI stack or the kernel stack). alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/19/20 5:06 PM, Andy Lutomirski wrote: On Thu, Nov 19, 2020 at 4:06 AM Alexandre Chartre wrote: On 11/19/20 9:05 AM, Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. I finally remember why I have introduced a per-task PTI trampoline stack right now: that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need a per-task stack to enter (and return) from the C handler as the handler can potentially go to sleep. Without a per-task trampoline stack, we would be limited to call the switch CR3 functions from the assembly entry code before and after calling the C function handler (also called from assembly). The noinstr part of the C entry code won't sleep. But the noinstr part of the handler can sleep, and if it does we will need to preserve the trampoline stack (even if we switch to the per-task kernel stack to execute the noinstr part). Example: #define DEFINE_IDTENTRY(func) \ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state; -+ \ | \ user_pagetable_escape(regs); | use trampoline stack (1) state = irqentry_enter(regs);| \ instrumentation_begin();-+ \ run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep) instrumentation_end(); -+ \ irqentry_exit(regs, state); | use trampoline stack (2) user_pagetable_return(regs);-+ \ } Between (1) and (2) we need to preserve and use the same trampoline stack in case __func() went sleeping. Why? Right now, we have the percpu entry stack, and we do just fine if we enter on one percpu stack and exit from a different one. We would need to call from asm to C on the entry stack, return back to asm, and then switch stacks. That's the problem: I didn't want to return back to asm, so that the pagetable switch can be done anywhere in the C handler. So yes, returning to asm to switch the stack is the solution if we want to avoid having per-task trampoline stack. The drawback is that this forces to do the page-table switch at the beginning and end of the handler; the pagetable switch cannot be moved deeper down into the C handler. But that's probably a good first step (effectively just moving CR3 switch to C without adding per-task trampoline stack). I will update the patches to do that, and we can defer the per-task trampoline stack to later if there's an effective need for it. That might not be a good first step after all... Calling CR3 switch C functions from assembly introduces extra pt_regs copies between the trampoline stack and the kernel stack. Currently when entering syscall, we immediately sw
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/19/20 9:05 AM, Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. I finally remember why I have introduced a per-task PTI trampoline stack right now: that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need a per-task stack to enter (and return) from the C handler as the handler can potentially go to sleep. Without a per-task trampoline stack, we would be limited to call the switch CR3 functions from the assembly entry code before and after calling the C function handler (also called from assembly). The noinstr part of the C entry code won't sleep. But the noinstr part of the handler can sleep, and if it does we will need to preserve the trampoline stack (even if we switch to the per-task kernel stack to execute the noinstr part). Example: #define DEFINE_IDTENTRY(func) \ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state; -+ \ | \ user_pagetable_escape(regs); | use trampoline stack (1) state = irqentry_enter(regs); | \ instrumentation_begin(); -+ \ run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep) instrumentation_end(); -+ \ irqentry_exit(regs, state); | use trampoline stack (2) user_pagetable_return(regs); -+ \ } Between (1) and (2) we need to preserve and use the same trampoline stack in case __func() went sleeping. Why? Right now, we have the percpu entry stack, and we do just fine if we enter on one percpu stack and exit from a different one. We would need to call from asm to C on the entry stack, return back to asm, and then switch stacks. That's the problem: I didn't want to return back to asm, so that the pagetable switch can be done anywhere in the C handler. So yes, returning to asm to switch the stack is the solution if we want to avoid having per-task trampoline stack. The drawback is that this forces to do the page-table switch at the beginning and end of the handler; the pagetable switch cannot be moved deeper down into the C handler. But that's probably a good first step (effectively just moving CR3 switch to C without adding per-task trampoline stack). I will update the patches to do that, and we can defer the per-task trampoline stack to later if there's an effective need for it. That might not be a good first step after all... Calling CR3 switch C functions from assembly introduces extra pt_regs copies between the trampoline stack and the kernel stack. Currently when entering syscall, we immediately switches CR3 and builds pt_regs directly on the kernel stack. On return, registers are restored from pt_regs from the k
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/19/20 2:49 AM, Andy Lutomirski wrote: On Tue, Nov 17, 2020 at 8:59 AM Alexandre Chartre wrote: On 11/17/20 4:52 PM, Andy Lutomirski wrote: On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre wrote: On 11/16/20 7:34 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre wrote: On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. I finally remember why I have introduced a per-task PTI trampoline stack right now: that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need a per-task stack to enter (and return) from the C handler as the handler can potentially go to sleep. Without a per-task trampoline stack, we would be limited to call the switch CR3 functions from the assembly entry code before and after calling the C function handler (also called from assembly). The noinstr part of the C entry code won't sleep. But the noinstr part of the handler can sleep, and if it does we will need to preserve the trampoline stack (even if we switch to the per-task kernel stack to execute the noinstr part). Example: #define DEFINE_IDTENTRY(func) \ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state; -+ \ | \ user_pagetable_escape(regs); | use trampoline stack (1) state = irqentry_enter(regs);| \ instrumentation_begin();-+ \ run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep) instrumentation_end(); -+ \ irqentry_exit(regs, state); | use trampoline stack (2) user_pagetable_return(regs);-+ \ } Between (1) and (2) we need to preserve and use the same trampoline stack in case __func() went sleeping. Why? Right now, we have the percpu entry stack, and we do just fine if we enter on one percpu stack and exit from a different one. We would need to call from asm to C on the entry stack, return back to asm, and then switch stacks. That's the problem: I didn't want to return back to asm, so that the pagetable switch can be done anywhere in the C handler. So yes, returning to asm to switch the stack is the solution if we want to avoid having per-task trampoline stack. The drawback is that this forces to do the page-table switch at the beginning and end of the handler; the pagetable switch cannot be moved deeper down into the C handler. But that's probably a good first step (effectively just moving CR3 switch to C without adding per-task trampoline stack). I will update the patches to do that, and we can defer the per-task trampoline stack to later if there's an effective need for it. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 12:29 PM, Borislav Petkov wrote: On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote: Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. Here's one recipe, courtesy of Mel: https://github.com/gormanm/mmtests Thanks for the detailed information, I have run the test and I see the same difference as with the tools/perf and libMICRO I already sent: there's a 150% difference for getpid() with and without pti. alex. - # ../../compare-kernels.sh --baseline test-nopti --compare test-pti poundsyscall test test noptipti Min 2 1.99 ( 0.00%)5.08 (-155.28%) Min 4 1.02 ( 0.00%)2.60 (-154.90%) Min 6 0.94 ( 0.00%)2.07 (-120.21%) Min 8 0.81 ( 0.00%)1.60 ( -97.53%) Min 120.85 ( 0.00%)1.65 ( -94.12%) Min 180.82 ( 0.00%)1.61 ( -96.34%) Min 240.81 ( 0.00%)1.60 ( -97.53%) Min 300.81 ( 0.00%)1.60 ( -97.53%) Min 320.81 ( 0.00%)1.60 ( -97.53%) Amean 2 2.02 ( 0.00%)5.10 *-151.83%* Amean 4 1.03 ( 0.00%)2.61 *-151.98%* Amean 6 0.96 ( 0.00%)2.07 *-116.74%* Amean 8 0.82 ( 0.00%)1.60 * -96.56%* Amean 120.87 ( 0.00%)1.67 * -91.73%* Amean 180.82 ( 0.00%)1.63 * -97.94%* Amean 240.81 ( 0.00%)1.60 * -97.41%* Amean 300.82 ( 0.00%)1.60 * -96.93%* Amean 320.82 ( 0.00%)1.60 * -96.56%* Stddev2 0.02 ( 0.00%)0.02 ( 33.78%) Stddev4 0.01 ( 0.00%)0.01 ( 7.18%) Stddev6 0.01 ( 0.00%)0.00 ( 68.77%) Stddev8 0.01 ( 0.00%)0.01 ( 10.56%) Stddev120.01 ( 0.00%)0.02 ( -12.69%) Stddev180.01 ( 0.00%)0.01 (-107.25%) Stddev240.00 ( 0.00%)0.00 ( -14.56%) Stddev300.01 ( 0.00%)0.01 ( 0.00%) Stddev320.01 ( 0.00%)0.00 ( 20.00%) CoeffVar 2 1.17 ( 0.00%)0.31 ( 73.70%) CoeffVar 4 0.82 ( 0.00%)0.30 ( 63.16%) CoeffVar 6 1.41 ( 0.00%)0.20 ( 85.59%) CoeffVar 8 0.87 ( 0.00%)0.39 ( 54.50%) CoeffVar 121.66 ( 0.00%)0.98 ( 41.23%) CoeffVar 180.85 ( 0.00%)0.89 ( -4.71%) CoeffVar 240.52 ( 0.00%)0.30 ( 41.97%) CoeffVar 300.65 ( 0.00%)0.33 ( 49.22%) CoeffVar 320.65 ( 0.00%)0.26 ( 59.30%) Max 2 2.04 ( 0.00%)5.13 (-151.47%) Max 4 1.04 ( 0.00%)2.62 (-151.92%) Max 6 0.98 ( 0.00%)2.08 (-112.24%) Max 8 0.83 ( 0.00%)1.62 ( -95.18%) Max 120.89 ( 0.00%)1.70 ( -91.01%) Max 180.84 ( 0.00%)1.66 ( -97.62%) Max 240.82 ( 0.00%)1.61 ( -96.34%) Max 300.82 ( 0.00%)1.61 ( -96.34%) Max 320.82 ( 0.00%)1.61 ( -96.34%) BAmean-50 2 2.01 ( 0.00%)5.09 (-153.39%) BAmean-50 4 1.03 ( 0.00%)2.60 (-152.62%) BAmean-50 6 0.95 ( 0.00%)2.07 (-118.82%) BAmean-50 8 0.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 120.86 ( 0.00%)1.66 ( -92.79%) BAmean-50 180.82 ( 0.00%)1.62 ( -97.56%) BAmean-50 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 300.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 320.81 ( 0.00%)1.60 ( -97.53%) BAmean-95 2 2.02 ( 0.00%)5.09 (-151.87%) BAmean-95 4 1.03 ( 0.00%)2.61 (-151.99%) BAmean-95 6 0.95 ( 0.00%)2.07 (-117.25%) BAmean-95 8 0.81 ( 0.00%)1.60 ( -96.72%) BAmean-95 120.87 ( 0.00%)1.67 ( -91.82%) BAmean-95 180.82 ( 0.00%)1.63 ( -97.97%) BAmean-95 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-95 300.81 ( 0.00%)1.60 ( -97.00%) BAmean-95 320.81 ( 0.00%)1.60 ( -96.59%) BAmean-99 2 2.02 ( 0.00%)5.09 (-151.87%) BAmean-99 4 1.03 ( 0.00%)2.61 (-151.99%) BAmean-99 6 0.95 ( 0.00%)2.07 (-117.25%) BAmean-99 8 0.81 ( 0.00%)1.60 ( -96.72%) BAmean-99 120.87 ( 0.00%)1.67 ( -91.82%) BAmean-99 180.82 ( 0.00%)1.63 ( -97.97%) BAmean-99 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-99 300.8
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 2:22 PM, David Laight wrote: From: Alexandre Chartre Sent: 18 November 2020 10:30 ... Correct, this RFC is not changing the overhead. However, it is a step forward for being able to execute some selected syscalls or interrupt handlers without switching to the kernel page-table. The next step would be to identify and add the necessary mapping to the user page-table so that specified syscalls can be executed without switching the page-table. Remember that without PTI user space can read all kernel memory. (I'm not 100% sure you can force a cache-line read.) It isn't even that slow. (Even I can understand how it works.) So if you are worried about user space doing that you can't really run anything on the user page tables. Yes, without PTI, userspace can read all kernel memory. But to run some part of the kernel you don't need to have all kernel mappings. Also a lot of the kernel contain non-sensitive information which can be safely expose to userspace. So there's probably some room for running carefully selected syscalls with the user page-table (and hopefully useful ones). System calls like getpid() are irrelevant - they aren't used (much). Even the time of day ones are implemented in the VDSO without a context switch. getpid()/getppid() is interesting because it provides the amount of overhead PTI is adding. But the impact can be more important if some TLB flushing are also required (as you mentioned below). So the overheads come from other system calls that 'do work' without actually sleeping. I'm guessing things like read, write, sendmsg, recvmsg. The only interesting system call I can think of is futex. As well as all the calls that return immediately because the mutex has been released while entering the kernel, I suspect that being pre-empted by a different thread (of the same process) doesn't actually need CR3 reloading (without PTI). I also suspect that it isn't just the CR3 reload that costs. There could (depending on the cpu) be associated TLB and/or cache invalidations that have a much larger effect on programs with large working sets than on simple benchmark programs. Right, although the TLB flush is mitigated with PCID, but this has more impact if there's no PCID. Now bits of data that you are 'more worried about' could be kept in physical memory that isn't normally mapped (or referenced by a TLB) and only mapped when needed. But that doesn't help the general case. Note that having syscall which could be done without switching the page-table is just one benefit you can get from this RFC. But the main benefit is for integrating Address Space Isolation (ASI) which will be much more complex if ASI as to plug in the current assembly CR3 switch. Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 10:30 AM, David Laight wrote: From: Alexandre Chartre Sent: 18 November 2020 07:42 On 11/17/20 10:26 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Some benchmarks are available, in particular from phoronix: What I was expecting was benchmarks *you* have run which show that perf penalty, not something one can find quickly on the internet and something one cannot always reproduce her-/himself. You do know that presenting convincing numbers with a patchset greatly improves its chances of getting it upstreamed, right? Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. IIRC the penalty comes from the page table switch. Doing it at a different time is unlikely to make much difference. Correct, this RFC is not changing the overhead. However, it is a step forward for being able to execute some selected syscalls or interrupt handlers without switching to the kernel page-table. The next step would be to identify and add the necessary mapping to the user page-table so that specified syscalls can be executed without switching the page-table. For some workloads the penalty is massive - getting on for 50%. We are still using old kernels on AWS. Here are some micro benchmarks of the getppid and getpid syscalls which highlight the PTI overhead. This uses the kernel tools/perf command, and the getpid command from libMICRO (https://github.com/redhat-performance/libMicro): system running 5.10-rc4 booted with nopti: -- # perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 1000 getppid() calls Total time: 0.792 [sec] 0.079223 usecs/op 12622549 ops/sec # getpid -B 10 prc thr usecs/call samples errors cnt/samp getpid 1 1 0.08029 1020 10 We can see that getpid and getppid syscall have the same execution time around 0.08 usecs. These syscalls are very small and just return a value, so the time is mostly spent entering/exiting the kernel. same system booted with pti: # perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 1000 getppid() calls Total time: 2.025 [sec] 0.202527 usecs/op 4937605 ops/sec # getpid -B 10 prc thr usecs/call samples errors cnt/samp getpid 1 1 0.20241 1020 10 With PTI, the execution time jumps to 0.20 usecs (+0.12 usecs = +150%). That's a very extreme case because these are very small syscalls, and in that case the overhead to switch page-tables is significant compared to the execution time of the syscall. So with an overhead of +0.12 usecs per syscall, the PTI impact is significant with workload which uses a lot of short syscalls. But if you use longer syscalls, for example with an average execution time of 2.0 usecs per syscall then you have a lower overhead of 6%. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 10:26 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Some benchmarks are available, in particular from phoronix: What I was expecting was benchmarks *you* have run which show that perf penalty, not something one can find quickly on the internet and something one cannot always reproduce her-/himself. You do know that presenting convincing numbers with a patchset greatly improves its chances of getting it upstreamed, right? Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 10:23 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote: No. This prevents the guest VM from gathering data from the host kernel on the same cpu-thread. But there's no mitigation for a guest VM running on a cpu-thread attacking another cpu-thread (which can be running another guest VM or the host kernel) from the same cpu-core. You cannot use flush/clear barriers because the two cpu-threads are running in parallel. Now there's your justification for why you're doing this. It took a while... The "why" should always be part of the 0th message to provide reviewers/maintainers with answers to the question, what this pile of patches is all about. Please always add this rationale to your patchset in the future. Sorry about that, I will definitively try to do better next time. :-} Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 7:28 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at the moment. In particular, this allows a guest VM to attack another guest VM or the host kernel running on a sibling cpu-thread. Core Scheduling will mitigate the guest-to-guest attack but not the guest-to-host attack. I see in vmx_vcpu_enter_exit(): /* L1D Flush includes CPU buffer clear to mitigate MDS */ if (static_branch_unlikely(&vmx_l1d_should_flush)) vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); Is that not enough? No. This prevents the guest VM from gathering data from the host kernel on the same cpu-thread. But there's no mitigation for a guest VM running on a cpu-thread attacking another cpu-thread (which can be running another guest VM or the host kernel) from the same cpu-core. You cannot use flush/clear barriers because the two cpu-threads are running in parallel. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 6:07 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote: We are not reversing PTI, we are extending it. You're reversing it in the sense that you're mapping more kernel memory into the user page table than what is mapped now. PTI removes all kernel mapping from the user page-table. However there's no issue with mapping some kernel data into the user page-table as long as these data have no sensitive information. I hope that is the case. Actually, PTI is already doing that but with a very limited scope. PTI adds into the user page-table some kernel mappings which are needed for userland to enter the kernel (such as the kernel entry text, the ESPFIX, the CPU_ENTRY_AREA_BASE...). So here, we are extending the PTI mapping so that we can execute more kernel code while using the user page-table; it's a kind of PTI on steroids. And this is what bothers me - someone else might come after you and say, but but, I need to map more stuff into the user pgt because I wanna do X... and so on. Agree, any addition should be strictly checked. I have been careful to expand it to the minimum I needed. The minimum size would be 1 page (4KB) as this is the minimum mapping size. It's certainly enough for now as the usage of the PTI stack is limited, but we will need larger stack if we won't to execute more kernel code with the user page-table. So on a big machine with a million tasks, that's at least a million pages more which is what, ~4 Gb? There better be a very good justification for the additional memory consumption... Yeah, adding a per-task allocation is my main concern, hence this RFC. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 5:55 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote: The main goal of ASI is to provide KVM address space isolation to mitigate guest-to-host speculative attacks like L1TF or MDS. Because the current L1TF and MDS mitigations are lacking or why? Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at the moment. In particular, this allows a guest VM to attack another guest VM or the host kernel running on a sibling cpu-thread. Core Scheduling will mitigate the guest-to-guest attack but not the guest-to-host attack. Address Space Isolation provides a mitigation for guest-to-host attack. Current proposal of ASI is plugged into the CR3 switch assembly macro which make the code brittle and complex. (see [1]) I am also expected this might help with some other ideas like having syscall (or interrupt handler) which can run without switching the page-table. I still fail to see why we need all that. I read, "this does this and that" but I don't read "the current problem is this" and "this is our suggested solution for it". So what is the issue which needs addressing in the current kernel which is going to justify adding all that code? The main issue this is trying to address is that the CR3 switch is currently done in assembly code from contexts which are very restrictive: the CR3 switch is often done when only one or two registers are available for use, sometimes no stack is available. For example, the syscall entry switches CR3 with a single register available (%sp) and no stack. Because of this, it is fairly tricky to expand the logic for switching CR3. This is a problem that we have faced while implementing Address Space Isolation (ASI) where we need extra logic to drive the page-table switch. We have successfully implement ASI with the current CR3 switching assembly code, but this requires complex assembly construction. Hence this proposal to defer CR3 switching to C code so that it can be more easily expandable. Hopefully this can also contribute to make the assembly entry code less complex, and be beneficial to other projects. PTI has a measured overhead of roughly 5% for most workloads, but it can be much higher in some cases. "it can be"? Where? Actual use case? Some benchmarks are available, in particular from phoronix: https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti https://www.phoronix.com/scan.php?page=news_item&px=x86-PTI-Initial-Gaming-Tests https://www.phoronix.com/scan.php?page=article&item=linux-kpti-kvm https://medium.com/@loganaden/linux-kpti-performance-hit-on-real-workloads-8da185482df3 The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into the CR3 switch assembly macro. We are working on a new implementation, based on these changes which avoid having to deal with assembly code and makes the implementation more robust. This still doesn't answer my questions. I read a lot of "could be used for" formulations but I still don't know why we need that. So what is the problem that the kernel currently has which you're trying to address with this? Hopefully this is clearer with the answer I provided above. Thanks, alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/17/20 4:52 PM, Andy Lutomirski wrote: On Tue, Nov 17, 2020 at 7:07 AM Alexandre Chartre wrote: On 11/16/20 7:34 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre wrote: On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. I finally remember why I have introduced a per-task PTI trampoline stack right now: that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need a per-task stack to enter (and return) from the C handler as the handler can potentially go to sleep. Without a per-task trampoline stack, we would be limited to call the switch CR3 functions from the assembly entry code before and after calling the C function handler (also called from assembly). The noinstr part of the C entry code won't sleep. But the noinstr part of the handler can sleep, and if it does we will need to preserve the trampoline stack (even if we switch to the per-task kernel stack to execute the noinstr part). Example: #define DEFINE_IDTENTRY(func) \ static __always_inline void __##func(struct pt_regs *regs); \ \ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state; -+ \ | \ user_pagetable_escape(regs); | use trampoline stack (1) state = irqentry_enter(regs);| \ instrumentation_begin();-+ \ run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep) instrumentation_end(); -+ \ irqentry_exit(regs, state); | use trampoline stack (2) user_pagetable_return(regs);-+ \ } Between (1) and (2) we need to preserve and use the same trampoline stack in case __func() went sleeping. alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/16/20 7:34 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre wrote: On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. I finally remember why I have introduced a per-task PTI trampoline stack right now: that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need a per-task stack to enter (and return) from the C handler as the handler can potentially go to sleep. Without a per-task trampoline stack, we would be limited to call the switch CR3 functions from the assembly entry code before and after calling the C function handler (also called from assembly). alex.
Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings
On 11/17/20 12:06 AM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 12:18 PM Alexandre Chartre wrote: On 11/16/20 8:48 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre wrote: Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, per cpu offsets (__per_cpu_offset, which is used some in entry code), the stack canary, and the PTI stack (which is defined per task). Does anything unmap the PTI stack? Mapping is easy, and unmapping could be a pretty big mess. No, there's no unmap. The mapping exists as long as the task page-table does (i.e. as long as the task mm exits). I assume that the task stack and mm are freed at the same time but that's not something I have checked. Nope. A multi-threaded mm will free task stacks when the task exits, but the mm may outlive the individual tasks. Additionally, if you allocate page tables as part of mapping PTI stacks, you need to make sure the pagetables are freed. So I think I just need to unmap the PTI stack from the user page-table when the task exits. Everything else is handled because the kernel and PTI stack are allocated in a single chunk (referenced by task->stack). Finally, you need to make sure that the PTI stacks have appropriate guard pages -- just doubling the allocation is not safe enough. The PTI stack does have guard pages because it maps only a part of the task stack into the user page-table, so pages around the PTI stack are not mapped into the user-pagetable (the page below is the task stack guard, and the page above is part of the kernel-only stack so it's never mapped into the user page-table). + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack My intuition is that this is going to be far more complexity than is justified. Sounds like only the PTI stack unmap is missing, which is hopefully not that bad. I will check that. alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/16/20 10:24 PM, David Laight wrote: From: Alexandre Chartre Sent: 16 November 2020 18:10 On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. Isn't that going to allocate a lot more kernel memory? That's one of my concern, hence this RFC. The current code is doubling the task stack (this was an easy solution), so that's +8KB per task. See my reply to Boris, it has a bit more details. alex. ISTR some thoughts about using dynamically allocated kernel stacks when (at least some) wakeups are done by directly restarting the system call - so that the sleeping thread doesn't even need a kernel stack. (I can't remember if that was linux or one of the BSDs) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/16/20 9:24 PM, Borislav Petkov wrote: On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: Deferring CR3 switch to C code means that we need to run more of the kernel entry code with the user page-table. To do so, we need to: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); - map additional data used in the entry code (such as stack canary); - run more entry code on the trampoline stack (which is mapped both in the kernel and in the user page-table) until we switch to the kernel page-table and then switch to the kernel stack; So PTI was added exactly to *not* have kernel memory mapped in the user page table. You're partially reversing that... We are not reversing PTI, we are extending it. PTI removes all kernel mapping from the user page-table. However there's no issue with mapping some kernel data into the user page-table as long as these data have no sensitive information. Actually, PTI is already doing that but with a very limited scope. PTI adds into the user page-table some kernel mappings which are needed for userland to enter the kernel (such as the kernel entry text, the ESPFIX, the CPU_ENTRY_AREA_BASE...). So here, we are extending the PTI mapping so that we can execute more kernel code while using the user page-table; it's a kind of PTI on steroids. - have a per-task trampoline stack instead of a per-cpu trampoline stack, so the task can be scheduled out while it hasn't switched to the kernel stack. per-task? How much more memory is that per task? Currently, this is done by doubling the size of the task stack (patch 8), so that's an extra 8KB. Half of the stack is used as the regular kernel stack, and the other half used as the PTI stack: +/* + * PTI doubles the size of the stack. The entire stack is mapped into + * the kernel address space. However, only the top half of the stack is + * mapped into the user address space. + * + * On syscall or interrupt, user mode enters the kernel with the user + * page-table, and the stack pointer is switched to the top of the + * stack (which is mapped in the user address space and in the kernel). + * The syscall/interrupt handler will then later decide when to switch + * to the kernel address space, and to switch to the top of the kernel + * stack which is only mapped in the kernel. + * + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack + */ The minimum size would be 1 page (4KB) as this is the minimum mapping size. It's certainly enough for now as the usage of the PTI stack is limited, but we will need larger stack if we won't to execute more kernel code with the user page-table. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/16/20 9:17 PM, Borislav Petkov wrote: On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: This RFC proposes to defer the PTI CR3 switch until we reach C code. The benefit is that this simplifies the assembly entry code, and make the PTI CR3 switch code easier to understand. This also paves the way for further possible projects such an easier integration of Address Space Isolation (ASI), or the possibility to execute some selected syscall or interrupt handlers without switching to the kernel page-table What for? What is this going to be used for in the end? In addition to simplify the assembly entry code, this will also simplify the integration of Address Space Isolation (ASI) which will certainly be the primary beneficiary of this change. The main goal of ASI is to provide KVM address space isolation to mitigate guest-to-host speculative attacks like L1TF or MDS. Current proposal of ASI is plugged into the CR3 switch assembly macro which make the code brittle and complex. (see [1]) I am also expected this might help with some other ideas like having syscall (or interrupt handler) which can run without switching the page-table. (and thus avoid the PTI page-table switch overhead). Overhead of how much? Why do we care? PTI has a measured overhead of roughly 5% for most workloads, but it can be much higher in some cases. The overhead is mostly due to the page-table switch (even with PCID) so if we can run a syscall or an interrupt handler without switching the page-table then we can get this kind of performance back. What is the big picture justfication for this diffstat 21 files changed, 874 insertions(+), 314 deletions(-) and the diffstat for the ASI enablement? The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into the CR3 switch assembly macro. We are working on a new implementation, based on these changes which avoid having to deal with assembly code and makes the implementation more robust. alex. [1] ASI RFCv4 - https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.char...@oracle.com/
Re: [RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings
On 11/16/20 8:48 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:49 AM Alexandre Chartre wrote: Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, per cpu offsets (__per_cpu_offset, which is used some in entry code), the stack canary, and the PTI stack (which is defined per task). Does anything unmap the PTI stack? Mapping is easy, and unmapping could be a pretty big mess. No, there's no unmap. The mapping exists as long as the task page-table does (i.e. as long as the task mm exits). I assume that the task stack and mm are freed at the same time but that's not something I have checked. alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/16/20 7:34 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 10:10 AM Alexandre Chartre wrote: On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. Seems reasonable. Where is the code that allocates and frees these stacks hiding? I think I should at least read it. Stacks are allocated/freed with the task stack, this code is unchanged (see alloc_thread_stack_node()). The trick is that I have doubled the THREAD_SIZE (patch 8 "x86/pti: Introduce per-task PTI trampoline stack"). Half the stack is a used as the kernel stack (mapped only in the kernel page-table), the other half is used as the PTI stack (mapped in the kernel and user page-table). The mapping to the user page-table is done in mm_map_task() in fork.c (patch 11 "x86/pti: Extend PTI user mappings"). alex.
Re: [RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table
On 11/16/20 5:56 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:48 AM Alexandre Chartre wrote: Using stack protector requires the stack canary to be mapped into the current page-table. Now that the page-table switch between the user and kernel page-table is deferred to C code, stack protector can be used while the user page-table is active and so the stack canary is mapped into the user page-table. To prevent leaking the stack canary used with the kernel page-table, use a different canary with the user and kernel page-table. The stack canary is changed when switching the page-table. Unless I've missed something, this doesn't have the security properties we want. One CPU can be executing with kernel CR3, and another CPU can read the stack canary using Meltdown. I think you are right because we have the mapping to the stack canary in the user page-table. From userspace, we will only read the user stack canary, but using Meltdown we can speculatively read the kernel stack canary which will be stored at the same place. I think that doing this safely requires mapping a different page with the stack canary in the two pagetables. Right. alex.
Re: [RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
On 11/16/20 5:57 PM, Andy Lutomirski wrote: On Mon, Nov 16, 2020 at 6:47 AM Alexandre Chartre wrote: When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Why? When executing more code in the kernel, we are likely to reach a point where we need to sleep while we are using the user page-table, so we need to be using a per-thread stack. I can't immediately evaluate how nasty the page table setup is because it's not in this patch. The page-table is the regular page-table as introduced by PTI. It is just augmented with a few additional mapping which are in patch 11 (x86/pti: Extend PTI user mappings). But AFAICS the only thing that this enables is sleeping with user pagetables. That's precisely the point, it allows to sleep with the user page-table. Do we really need to do that? Actually, probably not with this particular patchset, because I do the page-table switch at the very beginning and end of the C handler. I had some code where I moved the page-table switch deeper in the kernel handler where you definitively can sleep (for example, if you switch back to the user page-table before exit_to_user_mode_prepare()). So a first step should probably be to not introduce the per-task PTI trampoline stack, and stick with the existing trampoline stack. The per-task PTI trampoline stack can be introduced later when the page-table switch is moved deeper in the C handler and we can effectively sleep while using the user page-table. alex.
[RFC][PATCH v2 20/21] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
With PTI, syscall/interrupt/exception entries switch the CR3 register to change the page-table in assembly code. Move the CR3 register switch inside the C code of syscall/interrupt/exception entry handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 15 --- arch/x86/entry/entry_64.S | 23 +-- arch/x86/entry/entry_64_compat.S| 22 -- arch/x86/include/asm/entry-common.h | 13 + arch/x86/include/asm/idtentry.h | 25 - arch/x86/kernel/cpu/mce/core.c | 2 ++ arch/x86/kernel/nmi.c | 2 ++ arch/x86/kernel/traps.c | 6 ++ arch/x86/mm/fault.c | 9 +++-- 9 files changed, 67 insertions(+), 50 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 1aba02ecb806..6ef5afc42b82 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, regs->ax = 0; } syscall_exit_to_user_mode(regs); + user_pagetable_enter(); } static __always_inline void run_syscall(sys_call_ptr_t sysfunc, @@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc, #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { + user_pagetable_exit(); nr = syscall_enter_from_user_mode(regs, nr); instrumentation_begin(); @@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) instrumentation_end(); syscall_exit_to_user_mode(regs); + user_pagetable_enter(); } #endif #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs) { + user_pagetable_exit(); if (IS_ENABLED(CONFIG_IA32_EMULATION)) current_thread_info()->status |= TS_COMPAT; @@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs) do_syscall_32_irqs_on(regs, nr); syscall_exit_to_user_mode(regs); + user_pagetable_enter(); } -static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) +static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr) { - unsigned int nr = syscall_32_enter(regs); int res; /* @@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) { + unsigned int nr = syscall_32_enter(regs); + bool syscall_done; + /* * Called using the internal vDSO SYSENTER/SYSCALL32 calling * convention. Adjust regs so it looks like we entered using int80. @@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) regs->ip = landing_pad; /* Invoke the syscall. If it failed, keep it simple: use IRET. */ - if (!__do_fast_syscall_32(regs)) + syscall_done = __do_fast_syscall_32(regs, nr); + user_pagetable_enter(); + if (!syscall_done) return 0; #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1715bc0cefff..b7d9a019d001 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64) swapgs /* tss.sp2 is scratch space. */ movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) - SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL) @@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ - POP_REGS pop_rdi=0 skip_r11rcx=1 + POP_REGS skip_r11rcx=1 /* -* We are on the trampoline stack. All regs except RDI are live. * We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER - SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi - - popq%rdi movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork) swapgs cld FENCE_SWAPGS_USER_ENTRY - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx movq%rsp, %rdx movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp UNWIND_HINT_IRET_REGS base=%rdx offset=8 @@ -594,19 +588,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) ud2 1: #endif - POP_REGS pop_rdi=0 + POP_REGS + addq
[RFC][PATCH v2 14/21] x86/pti: Execute IDT handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. For now, only changes IDT handlers which have no argument other than the pt_regs registers. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 43 +++-- arch/x86/kernel/cpu/mce/core.c | 2 +- arch/x86/kernel/traps.c | 4 +-- 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 4b4aca2b1420..3595a31947b3 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -10,10 +10,49 @@ #include #include +#include bool idtentry_enter_nmi(struct pt_regs *regs); void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); +/* + * The CALL_ON_STACK_* macro call the specified function either directly + * if no stack is provided, or on the specified stack. + */ +#define CALL_ON_STACK_1(stack, func, arg1) \ + ((stack) ? \ +asm_call_on_stack_1(stack, \ + (void (*)(void))(func), (void *)(arg1)) : \ +func(arg1)) + +/* + * Functions to return the top of the kernel stack if we are using the + * user page-table (and thus not running with the kernel stack). If we + * are using the kernel page-table (and so already using the kernel + * stack) when it returns NULL. + */ +static __always_inline void *pti_kernel_stack(struct pt_regs *regs) +{ + unsigned long stack; + + if (pti_enabled() && user_mode(regs)) { + stack = (unsigned long)task_top_of_kernel_stack(current); + return (void *)(stack - 8); + } else { + return NULL; + } +} + +/* + * Wrappers to run an IDT handler on the kernel stack if we are not + * already using this stack. + */ +static __always_inline +void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) +{ + CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs);\ + run_idt(__##func, regs);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ __irq_enter_raw(); \ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs);\ + run_idt(__##func, regs);\ __irq_exit_raw(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4102b866e7c0..9407c3cd9355 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check) unsigned long dr7; dr7 = local_db_save(); - exc_machine_check_user(regs); + run_idt(exc_machine_check_user, regs); local_db_restore(dr7); } #else diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 09b22a611d99..5161385b3670 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) state = irqentry_enter(regs); instrumentation_begin(); - handle_invalid_op(regs); + run_idt(handle_invalid_op, regs); instrumentation_end(); irqentry_exit(regs, state); } @@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3) if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); instrumentation_begin(); - do_int3_user(regs); + run_idt(do_int3_us
[RFC][PATCH v2 19/21] x86/pti: Defer CR3 switch to C code for IST entries
IST entries from the kernel use paranoid entry and exit assembly functions to ensure the CR3 and GS registers are updated with correct values for the kernel. Move the update of the CR3 inside the C code of IST handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 34 ++ arch/x86/kernel/cpu/mce/core.c | 3 +++ arch/x86/kernel/nmi.c | 18 +++--- arch/x86/kernel/sev-es.c | 13 - arch/x86/kernel/traps.c| 30 ++ 5 files changed, 58 insertions(+), 40 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 6b88a0eb8975..1715bc0cefff 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -900,23 +900,6 @@ SYM_CODE_START_LOCAL(paranoid_entry) PUSH_AND_CLEAR_REGS save_ret=1 ENCODE_FRAME_POINTER 8 - /* -* Always stash CR3 in %r14. This value will be restored, -* verbatim, at exit. Needed if paranoid_entry interrupted -* another entry that already switched to the user CR3 value -* but has not yet returned to userspace. -* -* This is also why CS (stashed in the "iret frame" by the -* hardware at entry) can not be used: this may be a return -* to kernel code, but with a user CR3 value. -* -* Switching CR3 does not depend on kernel GSBASE so it can -* be done before switching to the kernel GSBASE. This is -* required for FSGSBASE because the kernel GSBASE has to -* be retrieved from a kernel internal table. -*/ - SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14 - /* * Handling GSBASE depends on the availability of FSGSBASE. * @@ -956,9 +939,7 @@ SYM_CODE_START_LOCAL(paranoid_entry) SWAPGS /* -* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an -* unconditional CR3 write, even in the PTI case. So do an lfence -* to prevent GS speculation, regardless of whether PTI is enabled. +* Do an lfence prevent GS speculation. */ FENCE_SWAPGS_KERNEL_ENTRY @@ -989,14 +970,10 @@ SYM_CODE_END(paranoid_entry) SYM_CODE_START_LOCAL(paranoid_exit) UNWIND_HINT_REGS /* -* The order of operations is important. RESTORE_CR3 requires -* kernel GSBASE. -* * NB to anyone to try to optimize this code: this code does * not execute at all for exceptions from user mode. Those * exceptions go through error_exit instead. */ - RESTORE_CR3 scratch_reg=%rax save_reg=%r14 /* Handle the three GSBASE cases */ ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE @@ -1119,10 +1096,6 @@ SYM_CODE_END(error_return) /* * Runs on exception stack. Xen PV does not go through this path at all, * so we can use real assembly here. - * - * Registers: - * %r14: Used to save/restore the CR3 of the interrupted context - * when PAGE_TABLE_ISOLATION is in use. Do not clobber. */ SYM_CODE_START(asm_exc_nmi) /* @@ -1173,7 +1146,7 @@ SYM_CODE_START(asm_exc_nmi) * We also must not push anything to the stack before switching * stacks lest we corrupt the "NMI executing" variable. */ - ist_entry_user exc_nmi + ist_entry_user exc_nmi_user /* NMI from kernel */ @@ -1385,9 +1358,6 @@ end_repeat_nmi: movq$-1, %rsi callexc_nmi - /* Always restore stashed CR3 value (see paranoid_entry) */ - RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 - /* * The above invocation of paranoid_entry stored the GSBASE * related information in R/EBX depending on the availability diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 9407c3cd9355..31ac01c1155d 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2022,11 +2022,14 @@ static __always_inline void exc_machine_check_user(struct pt_regs *regs) /* MCE hit kernel mode */ DEFINE_IDTENTRY_MCE(exc_machine_check) { + unsigned long saved_cr3; unsigned long dr7; + saved_cr3 = save_and_switch_to_kernel_cr3(); dr7 = local_db_save(); exc_machine_check_kernel(regs); local_db_restore(dr7); + restore_cr3(saved_cr3); } /* The user mode variant. */ diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index be0f654c3095..523d88c3fea1 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state); static DEFINE_PER_CPU(unsigned long, nmi_cr2); static DEFINE_PER_CPU(unsigned long, nmi_dr7); -DEFINE_IDTENTRY_RAW(exc_nmi) +static noinstr void handle_nmi(struct pt_regs *regs) { bool irq_state; @@
[RFC][PATCH v2 09/21] x86/pti: Function to clone page-table entries from a specified mm
PTI has a function to clone page-table entries but only from the init_mm page-table. Provide a new function to clone page-table entries from a specified mm page-table. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/pti.h | 10 ++ arch/x86/mm/pti.c | 32 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 07375b476c4f..5484e69ff8d3 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -4,9 +4,19 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_PAGE_TABLE_ISOLATION + +enum pti_clone_level { + PTI_CLONE_PMD, + PTI_CLONE_PTE, +}; + +struct mm_struct; + extern void pti_init(void); extern void pti_check_boottime_disable(void); extern void pti_finalize(void); +extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level); #else static inline void pti_check_boottime_disable(void) { } #endif diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 1aab92930569..ebc8cd2f1cd8 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void) static void __init pti_setup_vsyscall(void) { } #endif -enum pti_clone_level { - PTI_CLONE_PMD, - PTI_CLONE_PTE, -}; - -static void -pti_clone_pgtable(unsigned long start, unsigned long end, - enum pti_clone_level level) +void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level) { unsigned long addr; @@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, if (addr < start) break; - pgd = pgd_offset_k(addr); + pgd = pgd_offset(mm, addr); if (WARN_ON(pgd_none(*pgd))) return; p4d = p4d_offset(pgd, addr); @@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end, } } +static void pti_clone_init_pgtable(unsigned long start, unsigned long end, + enum pti_clone_level level) +{ + pti_clone_pgtable(&init_mm, start, end, level); +} + #ifdef CONFIG_X86_64 /* * Clone a single p4d (i.e. a top-level entry on 4-level systems and a @@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void) start = CPU_ENTRY_AREA_BASE; end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES); - pti_clone_pgtable(start, end, PTI_CLONE_PMD); + pti_clone_init_pgtable(start, end, PTI_CLONE_PMD); } #endif /* CONFIG_X86_64 */ @@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void) */ static void pti_clone_entry_text(void) { - pti_clone_pgtable((unsigned long) __entry_text_start, - (unsigned long) __entry_text_end, - PTI_CLONE_PMD); + pti_clone_init_pgtable((unsigned long) __entry_text_start, + (unsigned long) __entry_text_end, + PTI_CLONE_PMD); } /* @@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void) * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ - pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); + pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); /* -* pti_clone_pgtable() will set the global bit in any PMDs -* that it clones, but we also need to get any PTEs in +* pti_clone_init_pgtable() will set the global bit in any +* PMDs that it clones, but we also need to get any PTEs in * the last level for areas that are not huge-page-aligned. */ -- 2.18.4
[RFC][PATCH v2 06/21] x86/pti: Provide C variants of PTI switch CR3 macros
Page Table Isolation (PTI) use assembly macros to switch the CR3 register between kernel and user page-tables. Add C functions which implement the same features. For now, these C functions are not used but they will eventually replace using the assembly macros. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/entry-common.h | 127 1 file changed, 127 insertions(+) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 6fe54b2813c1..46682b1433a4 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -7,6 +7,7 @@ #include #include #include +#include /* Check that the stack and regs on entry from user mode are sane. */ static __always_inline void arch_check_user_regs(struct pt_regs *regs) @@ -81,4 +82,130 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +#ifndef MODULE +#ifdef CONFIG_PAGE_TABLE_ISOLATION + +/* + * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two + * halves: + */ +#define PTI_USER_PGTABLE_BIT PAGE_SHIFT +#define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT) +#define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT +#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT) +#define PTI_USER_PGTABLE_AND_PCID_MASK \ + (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK) + +static __always_inline void write_kernel_cr3(unsigned long cr3) +{ + if (static_cpu_has(X86_FEATURE_PCID)) + cr3 |= X86_CR3_PCID_NOFLUSH; + + native_write_cr3(cr3); +} + +static __always_inline void write_user_cr3(unsigned long cr3) +{ + unsigned short mask; + unsigned long asid; + + if (static_cpu_has(X86_FEATURE_PCID)) { + /* +* Test if the ASID needs a flush. +*/ + asid = cr3 & 0x7ff; + mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask); + if (mask & (1 << asid)) { + /* Flush needed, clear the bit */ + this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, +~(1 << asid)); + } else { + cr3 |= X86_CR3_PCID_NOFLUSH; + } + } + + native_write_cr3(cr3); +} + +static __always_inline void switch_to_kernel_cr3(unsigned long cr3) +{ + /* +* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 +* at kernel pagetables. +*/ + write_kernel_cr3(cr3 & ~PTI_USER_PGTABLE_AND_PCID_MASK); +} + +static __always_inline void switch_to_user_cr3(unsigned long cr3) +{ + if (static_cpu_has(X86_FEATURE_PCID)) { + /* Flip the ASID to the user version */ + cr3 |= PTI_USER_PCID_MASK; + } + + /* Flip the PGD to the user version */ + write_user_cr3(cr3 | PTI_USER_PGTABLE_MASK); +} + +static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) +{ + unsigned long cr3; + + if (!static_cpu_has(X86_FEATURE_PTI)) + return 0; + + cr3 = __native_read_cr3(); + if (cr3 & PTI_USER_PGTABLE_MASK) + switch_to_kernel_cr3(cr3); + + return cr3; +} + +static __always_inline void restore_cr3(unsigned long cr3) +{ + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + if (cr3 & PTI_USER_PGTABLE_MASK) { + switch_to_user_cr3(cr3); + } else { + /* +* The CR3 write could be avoided when not changing +* its value, but would require a CR3 read. +*/ + write_kernel_cr3(cr3); + } +} + +static __always_inline void user_pagetable_enter(void) +{ + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + switch_to_user_cr3(__native_read_cr3()); +} + +static __always_inline void user_pagetable_exit(void) +{ + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + switch_to_kernel_cr3(__native_read_cr3()); +} + + +#else /* CONFIG_PAGE_TABLE_ISOLATION */ + +static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) +{ + return 0; +} +static __always_inline void restore_cr3(unsigned long cr3) {} + +static __always_inline void user_pagetable_enter(void) {}; +static __always_inline void user_pagetable_exit(void) {}; + +#endif /* CONFIG_PAGE_TABLE_ISOLATION */ +#endif /* MODULE */ + #endif -- 2.18.4
[RFC][PATCH v2 17/21] x86/pti: Execute page fault handler on the kernel stack
After a page fault from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the page fault handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 17 + arch/x86/mm/fault.c | 2 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 0c5d9f027112..a6725afaaec0 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ func(arg1, arg2)) +#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \ + ((stack) ? \ +asm_call_on_stack_3(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \ + (void *)(arg3)) : \ +func(arg1, arg2, arg3)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long, + unsigned long), + struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + CALL_ON_STACK_3(pti_kernel_stack(regs), + func, regs, error_code, address); +} + static __always_inline void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) { diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 82bf37a5c9ec..b9d03603d95d 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) state = irqentry_enter(regs); instrumentation_begin(); - handle_page_fault(regs, error_code, address); + run_idt_pagefault(handle_page_fault, regs, error_code, address); instrumentation_end(); irqentry_exit(regs, state); -- 2.18.4
[RFC][PATCH v2 13/21] x86/pti: Execute syscall functions on the kernel stack
During a syscall, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the syscall function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 11 ++- arch/x86/entry/entry_64.S| 1 + arch/x86/include/asm/irq_stack.h | 3 +++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 7ee15a12c115..1aba02ecb806 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { + unsigned long stack; + if (!sysfunc) return; - regs->ax = sysfunc(regs); + if (!pti_enabled()) { + regs->ax = sysfunc(regs); + return; + } + + stack = (unsigned long)task_top_of_kernel_stack(current); + regs->ax = asm_call_syscall_on_stack((void *)(stack - 8), +sysfunc, regs); } #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 29beab46bedd..6b88a0eb8975 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2) SYM_FUNC_START(asm_call_on_stack_3) SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL) SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) +SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL) /* * Save the frame pointer unconditionally. This allows the ORC * unwinder to handle the stack switch. diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h index 359427216336..108d9da7c01c 100644 --- a/arch/x86/include/asm/irq_stack.h +++ b/arch/x86/include/asm/irq_stack.h @@ -5,6 +5,7 @@ #include #include +#include #ifdef CONFIG_X86_64 static __always_inline bool irqstack_active(void) @@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs), struct pt_regs *regs); void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc), struct irq_desc *desc); +long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func, + struct pt_regs *regs); static __always_inline void __run_on_irqstack(void (*func)(void)) { -- 2.18.4
[RFC][PATCH v2 15/21] x86/pti: Execute IDT handlers with error code on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes IDT handlers which have an error code. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 18 -- arch/x86/kernel/traps.c | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 3595a31947b3..a82e31b45442 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1)) : \ func(arg1)) +#define CALL_ON_STACK_2(stack, func, arg1, arg2) \ + ((stack) ? \ +asm_call_on_stack_2(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ +func(arg1, arg2)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); } +static __always_inline +void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), +struct pt_regs *regs, unsigned long error_code) +{ + CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs, \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs, error_code);\ + run_idt_errcode(__##func, regs, error_code);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs, \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs, (u8)error_code);\ + run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \ + regs, (u8)error_code); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 5161385b3670..9a51aa016fb3 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug) /* User entry, runs on regular task stack */ DEFINE_IDTENTRY_DEBUG_USER(exc_debug) { - exc_debug_user(regs, debug_read_clear_dr6()); + run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6()); } #else /* 32 bit does not have separate entry points. */ -- 2.18.4
[RFC][PATCH v2 16/21] x86/pti: Execute system vector handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes system vector handlers to execute on the kernel stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a82e31b45442..0c5d9f027112 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) +{ + void *stack = pti_kernel_stack(regs); + + if (stack) + asm_call_on_stack_1(stack, (void (*)(void))func, regs); + else + run_sysvec_on_irqstack_cond(func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - run_sysvec_on_irqstack_cond(__##func, regs);\ + run_sysvec(__##func, regs); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ -- 2.18.4
[RFC][PATCH v2 10/21] x86/pti: Function to map per-cpu page-table entry
Wrap the code used by PTI to map a per-cpu page-table entry into a new function so that this code can be re-used to map other per-cpu entries. Signed-off-by: Alexandre Chartre --- arch/x86/mm/pti.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index ebc8cd2f1cd8..71ca245d7b38 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr) *user_p4d = *kernel_p4d; } +/* + * Clone a single percpu page. + */ +static void __init pti_clone_percpu_page(void *addr) +{ + phys_addr_t pa = per_cpu_ptr_to_phys(addr); + pte_t *target_pte; + + target_pte = pti_user_pagetable_walk_pte((unsigned long)addr); + if (WARN_ON(!target_pte)) + return; + + *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); +} + /* * Clone the CPU_ENTRY_AREA and associated data into the user space visible * page table. @@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void) * This is done for all possible CPUs during boot to ensure * that it's propagated to all mms. */ + pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu)); - unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu); - phys_addr_t pa = per_cpu_ptr_to_phys((void *)va); - pte_t *target_pte; - - target_pte = pti_user_pagetable_walk_pte(va); - if (WARN_ON(!target_pte)) - return; - - *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); } } -- 2.18.4
[RFC][PATCH v2 02/21] x86/entry: Update asm_call_on_stack to support more function arguments
Update the asm_call_on_stack() function so that it can be invoked with a function having up to three arguments instead of only one. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S| 15 +++ arch/x86/include/asm/irq_stack.h | 8 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index cad08703c4ad..c42948aca0a8 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs) /* * rdi: New stack pointer points to the top word of the stack * rsi: Function pointer - * rdx: Function argument (can be NULL if none) + * rdx: Function argument 1 (can be NULL if none) + * rcx: Function argument 2 (can be NULL if none) + * r8 : Function argument 3 (can be NULL if none) */ SYM_FUNC_START(asm_call_on_stack) +SYM_FUNC_START(asm_call_on_stack_1) +SYM_FUNC_START(asm_call_on_stack_2) +SYM_FUNC_START(asm_call_on_stack_3) SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL) SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) /* @@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) */ mov %rsp, (%rdi) mov %rdi, %rsp - /* Move the argument to the right place */ + mov %rsi, %rax + /* Move arguments to the right place */ mov %rdx, %rdi - + mov %rcx, %rsi + mov %r8, %rdx 1: .pushsection .discard.instr_begin .long 1b - . .popsection - CALL_NOSPEC rsi + CALL_NOSPEC rax 2: .pushsection .discard.instr_end diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h index 775816965c6a..359427216336 100644 --- a/arch/x86/include/asm/irq_stack.h +++ b/arch/x86/include/asm/irq_stack.h @@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void) } void asm_call_on_stack(void *sp, void (*func)(void), void *arg); + +void asm_call_on_stack_1(void *sp, void (*func)(void), +void *arg1); +void asm_call_on_stack_2(void *sp, void (*func)(void), +void *arg1, void *arg2); +void asm_call_on_stack_3(void *sp, void (*func)(void), +void *arg1, void *arg2, void *arg3); + void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs), struct pt_regs *regs); void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc), -- 2.18.4
[RFC][PATCH v2 21/21] x86/pti: Use a different stack canary with the user and kernel page-table
Using stack protector requires the stack canary to be mapped into the current page-table. Now that the page-table switch between the user and kernel page-table is deferred to C code, stack protector can be used while the user page-table is active and so the stack canary is mapped into the user page-table. To prevent leaking the stack canary used with the kernel page-table, use a different canary with the user and kernel page-table. The stack canary is changed when switching the page-table. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/entry-common.h | 56 ++- arch/x86/include/asm/stackprotector.h | 35 +++-- arch/x86/kernel/sev-es.c | 18 + include/linux/sched.h | 8 kernel/fork.c | 3 ++ 5 files changed, 107 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index e01735a181b8..5b4d0e3237a3 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -96,6 +96,52 @@ static __always_inline void arch_exit_to_user_mode(void) #define PTI_USER_PGTABLE_AND_PCID_MASK \ (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK) +/* + * Functions to set the stack canary to the kernel or user value: + * + * The kernel stack canary should be used when running with the kernel + * page-table, and the user stack canary should be used when running + * with the user page-table. Also the kernel stack canary should not + * leak to the user page-table. + * + * So the stack canary should be set to the kernel value when entering + * the kernel from userspace *after* switching to the kernel page-table. + * And the stack canary should be set to the user value when returning + * to userspace *before* switching to the user page-table. + * + * In both cases, there is a window (between the page-table switch and + * the stack canary setting) where we will be running with the kernel + * page-table and the user stack canary. This window should be as small + * as possible and, ideally, it should: + * - not call functions which require the stack protector to be used; + * - have interrupt disabled to prevent interrupt handlers from being + * processed with the user stack canary (but there is nothing we can + * do for NMIs). + */ +static __always_inline void set_stack_canary_kernel(void) +{ + this_cpu_write(fixed_percpu_data.stack_canary, + current->stack_canary); +} + +static __always_inline void set_stack_canary_user(void) +{ + this_cpu_write(fixed_percpu_data.stack_canary, + current->stack_canary_user); +} + +static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3) +{ + if (cr3 & PTI_USER_PGTABLE_MASK) + set_stack_canary_kernel(); +} + +static __always_inline void restore_stack_canary(unsigned long cr3) +{ + if (cr3 & PTI_USER_PGTABLE_MASK) + set_stack_canary_user(); +} + static __always_inline void write_kernel_cr3(unsigned long cr3) { if (static_cpu_has(X86_FEATURE_PCID)) @@ -155,8 +201,10 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) return 0; cr3 = __native_read_cr3(); - if (cr3 & PTI_USER_PGTABLE_MASK) + if (cr3 & PTI_USER_PGTABLE_MASK) { switch_to_kernel_cr3(cr3); + set_stack_canary_kernel(); + } return cr3; } @@ -167,6 +215,7 @@ static __always_inline void restore_cr3(unsigned long cr3) return; if (cr3 & PTI_USER_PGTABLE_MASK) { + set_stack_canary_user(); switch_to_user_cr3(cr3); } else { /* @@ -182,6 +231,7 @@ static __always_inline void user_pagetable_enter(void) if (!static_cpu_has(X86_FEATURE_PTI)) return; + set_stack_canary_user(); switch_to_user_cr3(__native_read_cr3()); } @@ -191,6 +241,7 @@ static __always_inline void user_pagetable_exit(void) return; switch_to_kernel_cr3(__native_read_cr3()); + set_stack_canary_kernel(); } static __always_inline void user_pagetable_return(struct pt_regs *regs) @@ -218,6 +269,9 @@ static __always_inline void user_pagetable_exit(void) {}; static __always_inline void user_pagetable_return(struct pt_regs *regs) {}; static __always_inline void user_pagetable_escape(struct pt_regs *regs) {}; +static __always_inline void switch_to_kernel_stack_canary(unsigned long cr3) {} +static __always_inline void restore_stack_canary(unsigned long cr3) {} + #endif /* CONFIG_PAGE_TABLE_ISOLATION */ #endif /* MODULE */ diff --git a/arch/x86/include/asm/stackprotector.h b/arch/x86/include/asm/stackprotector.h index 7fb482f0f25b..be6c051bafe3 100644 --- a/arch/x86/include/asm/stackprotector.h +++ b/arch/x86/include/asm/stackprotector.h @@ -52,6 +52,25 @@ #de
[RFC][PATCH v2 11/21] x86/pti: Extend PTI user mappings
Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, per cpu offsets (__per_cpu_offset, which is used some in entry code), the stack canary, and the PTI stack (which is defined per task). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 2 -- arch/x86/mm/pti.c | 19 +++ kernel/fork.c | 22 ++ 3 files changed, 41 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 6e0b5b010e0b..458af12ed9a1 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm) * rbx: kernel thread func (NULL for user thread) * r12: kernel thread arg */ -.pushsection .text, "ax" SYM_CODE_START(ret_from_fork) UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs */ @@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork) callreturn_from_fork/* returns with IRQs disabled */ jmp swapgs_restore_regs_and_return_to_usermode SYM_CODE_END(ret_from_fork) -.popsection .macro DEBUG_ENTRY_ASSERT_IRQS_OFF #ifdef CONFIG_DEBUG_ENTRY diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 71ca245d7b38..e4c6cb4a4840 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -449,6 +449,7 @@ static void __init pti_clone_percpu_page(void *addr) */ static void __init pti_clone_user_shared(void) { + unsigned long start, end; unsigned int cpu; pti_clone_p4d(CPU_ENTRY_AREA_BASE); @@ -465,7 +466,16 @@ static void __init pti_clone_user_shared(void) */ pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu)); + /* +* Map fixed_percpu_data to get the stack canary. +*/ + if (IS_ENABLED(CONFIG_STACKPROTECTOR)) + pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu)); } + + start = (unsigned long)__per_cpu_offset; + end = start + sizeof(__per_cpu_offset); + pti_clone_init_pgtable(start, end, PTI_CLONE_PTE); } #else /* CONFIG_X86_64 */ @@ -505,6 +515,15 @@ static void pti_clone_entry_text(void) pti_clone_init_pgtable((unsigned long) __entry_text_start, (unsigned long) __entry_text_end, PTI_CLONE_PMD); + + /* + * Syscall and interrupt entry code (which is in the noinstr + * section) will be entered with the user page-table, so that + * code has to be mapped in. + */ + pti_clone_init_pgtable((unsigned long) __noinstr_text_start, + (unsigned long) __noinstr_text_end, + PTI_CLONE_PMD); } /* diff --git a/kernel/fork.c b/kernel/fork.c index 6d266388d380..31cd77dbdba3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm) #endif } +static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk) +{ +#ifdef CONFIG_PAGE_TABLE_ISOLATION + unsigned long addr; + + if (!tsk || !static_cpu_has(X86_FEATURE_PTI)) + return; + + /* +* Map the task stack after the kernel stack into the user +* address space, so that this stack can be used when entering +* syscall or interrupt from user mode. +*/ + BUG_ON(!task_stack_page(tsk)); + addr = (unsigned long)task_top_of_kernel_stack(tsk); + pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE); +#endif +} + static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { @@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (init_new_context(p, mm)) goto fail_nocontext; + mm_map_task(mm, p); + mm->user_ns = get_user_ns(user_ns); return mm; @@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) vmacache_flush(tsk); if (clone_flags & CLONE_VM) { + mm_map_task(oldmm, tsk); mmget(oldmm); mm = oldmm; goto good_mm; -- 2.18.4
[RFC][PATCH v2 18/21] x86/pti: Execute NMI handler on the kernel stack
After a NMI from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the NMI handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/kernel/nmi.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 4bc77aaf1303..be0f654c3095 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi) inc_irq_stat(__nmi_count); - if (!ignore_nmis) - default_do_nmi(regs); + if (!ignore_nmis) { + if (user_mode(regs)) { + /* +* If we come from userland then we are on the +* trampoline stack, switch to the kernel stack +* to execute the NMI handler. +*/ + run_idt(default_do_nmi, regs); + } else { + default_do_nmi(regs); + } + } idtentry_exit_nmi(regs, irq_state); -- 2.18.4
[RFC][PATCH v2 05/21] x86/entry: Implement ret_from_fork body with C code
ret_from_fork is a mix of assembly code and calls to C functions. Re-implement ret_from_fork so that it calls a single C function. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 18 ++ arch/x86/entry/entry_64.S | 28 +--- 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index d12908ad..7ee15a12c115 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -35,6 +35,24 @@ #include #include +__visible noinstr void return_from_fork(struct pt_regs *regs, + struct task_struct *prev, + void (*kfunc)(void *), void *kargs) +{ + schedule_tail(prev); + if (kfunc) { + /* kernel thread */ + kfunc(kargs); + /* +* A kernel thread is allowed to return here after +* successfully calling kernel_execve(). Exit to +* userspace to complete the execve() syscall. +*/ + regs->ax = 0; + } + syscall_exit_to_user_mode(regs); +} + static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 274384644b5e..73e9cd47dc83 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm) */ .pushsection .text, "ax" SYM_CODE_START(ret_from_fork) - UNWIND_HINT_EMPTY - movq%rax, %rdi - callschedule_tail /* rdi: 'prev' task parameter */ - - testq %rbx, %rbx /* from kernel_thread? */ - jnz 1f /* kernel threads are uncommon */ - -2: UNWIND_HINT_REGS - movq%rsp, %rdi - callsyscall_exit_to_user_mode /* returns with IRQs disabled */ + movq%rsp, %rdi /* pt_regs */ + movq%rax, %rsi /* 'prev' task parameter */ + movq%rbx, %rdx /* kernel thread func */ + movq%r12, %rcx /* kernel thread arg */ + callreturn_from_fork/* returns with IRQs disabled */ jmp swapgs_restore_regs_and_return_to_usermode - -1: - /* kernel thread */ - UNWIND_HINT_EMPTY - movq%r12, %rdi - CALL_NOSPEC rbx - /* -* A kernel thread is allowed to return here after successfully -* calling kernel_execve(). Exit to userspace to complete the execve() -* syscall. -*/ - movq$0, RAX(%rsp) - jmp 2b SYM_CODE_END(ret_from_fork) .popsection -- 2.18.4
[RFC][PATCH v2 04/21] x86/sev-es: Define a setup stack function for the VC idtentry
The #VC exception assembly entry code uses C code (vc_switch_off_ist) to get and configure a stack, then return to assembly to switch to that stack and finally invoked the C function exception handler. To pave the way for deferring CR3 switch from assembly to C code, define a setup stack function for the VC idtentry. This function is used to get and configure the stack before invoking idtentry handler. For now, the setup stack function is just a wrapper around the the vc_switch_off_ist() function but it will eventually also contain the C code to switch CR3. The vc_switch_off_ist() function is also refactored to just return the stack pointer, and the stack configuration is done in the setup stack function (so that the stack can be also be used to propagate CR3 switch information to the idtentry handler for switching CR3 back). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 8 +++- arch/x86/include/asm/idtentry.h | 14 ++ arch/x86/include/asm/traps.h| 2 +- arch/x86/kernel/sev-es.c| 34 + arch/x86/kernel/traps.c | 19 +++--- 5 files changed, 55 insertions(+), 22 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 51df9f1871c6..274384644b5e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym) UNWIND_HINT_REGS /* -* Switch off the IST stack to make it free for nested exceptions. The -* vc_switch_off_ist() function will switch back to the interrupted -* stack if it is safe to do so. If not it switches to the VC fall-back -* stack. +* Call the setup stack function. It configures and returns +* the stack we should be using to run the exception handler. */ movq%rsp, %rdi /* pt_regs pointer */ - callvc_switch_off_ist + callsetup_stack_\cfunc movq%rax, %rsp /* Switch to new stack */ UNWIND_HINT_REGS diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index b2442eb0ac2f..4b4aca2b1420 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs) */ #define DECLARE_IDTENTRY_VC(vector, func) \ DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func); \ + __visible noinstr unsigned long setup_stack_##func(struct pt_regs *regs); \ __visible noinstr void ist_##func(struct pt_regs *regs, unsigned long error_code); \ __visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned long error_code) @@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs) #define DEFINE_IDTENTRY_VC_IST(func) \ DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func) +/** + * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to + run the VMM communication handler + * @func: Function name of the entry point + * + * The stack setup code is executed before the VMM communication handler. + * It configures and returns the stack to switch to before running the + * VMM communication handler. + */ +#define DEFINE_IDTENTRY_VC_SETUP_STACK(func) \ + __visible noinstr \ + unsigned long setup_stack_##func(struct pt_regs *regs) + /** * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler * @func: Function name of the entry point diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 7f7200021bd1..cfcc9d34d2a0 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); asmlinkage __visible notrace struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); void __init trap_init(void); -asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs); +asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs *eregs); #endif #ifdef CONFIG_X86_F00F_BUG diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 0bd1a0fc587e..bd977c917cd6 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication) instrumentation_end(); } +struct exc_vc_frame { + /* pt_regs should be first */ + struct pt_regs regs; +}; + +DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication) +{ + struct exc_vc_frame *frame; + unsigned long sp; + + /* +* Switch off the IST stack to make it free for nested exceptions. +* The vc_switch_off_ist() function will switch back to the +* interrupted stack if
[RFC][PATCH v2 08/21] x86/pti: Introduce per-task PTI trampoline stack
Double the size of the kernel stack when using PTI. The entire stack is mapped into the kernel address space, and the top half of the stack (the PTI stack) is also mapped into the user address space. The PTI stack will be used as a per-task trampoline stack instead of the current per-cpu trampoline stack. This will allow running more code on the trampoline stack, in particular code that schedules the task out. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/page_64_types.h | 36 +++- arch/x86/include/asm/processor.h | 3 +++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 3f49dac03617..733accc20fdb 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -12,7 +12,41 @@ #define KASAN_STACK_ORDER 0 #endif -#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) +#ifdef CONFIG_PAGE_TABLE_ISOLATION +/* + * PTI doubles the size of the stack. The entire stack is mapped into + * the kernel address space. However, only the top half of the stack is + * mapped into the user address space. + * + * On syscall or interrupt, user mode enters the kernel with the user + * page-table, and the stack pointer is switched to the top of the + * stack (which is mapped in the user address space and in the kernel). + * The syscall/interrupt handler will then later decide when to switch + * to the kernel address space, and to switch to the top of the kernel + * stack which is only mapped in the kernel. + * + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack + */ +#define PTI_STACK_ORDER 1 +#else +#define PTI_STACK_ORDER 0 +#endif + +#define KERNEL_STACK_ORDER 2 +#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER) + +#define THREAD_SIZE_ORDER \ + (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 82a08b585818..47b1b806535b 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x) #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1)) +#define task_top_of_kernel_stack(task) \ + ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE)) + #define task_pt_regs(task) \ ({ \ unsigned long __ptr = (unsigned long)task_stack_page(task); \ -- 2.18.4
[RFC][PATCH v2 01/21] x86/syscall: Add wrapper for invoking syscall function
Add a wrapper function for invoking a syscall function. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 870efeec8bda..d12908ad 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -35,6 +35,15 @@ #include #include +static __always_inline void run_syscall(sys_call_ptr_t sysfunc, + struct pt_regs *regs) +{ + if (!sysfunc) + return; + + regs->ax = sysfunc(regs); +} + #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { @@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) instrumentation_begin(); if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); - regs->ax = sys_call_table[nr](regs); + run_syscall(sys_call_table[nr], regs); #ifdef CONFIG_X86_X32_ABI } else if (likely((nr & __X32_SYSCALL_BIT) && (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) { nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT, X32_NR_syscalls); - regs->ax = x32_sys_call_table[nr](regs); + run_syscall(x32_sys_call_table[nr], regs); #endif } + instrumentation_end(); syscall_exit_to_user_mode(regs); } @@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, if (likely(nr < IA32_NR_syscalls)) { instrumentation_begin(); nr = array_index_nospec(nr, IA32_NR_syscalls); - regs->ax = ia32_sys_call_table[nr](regs); + run_syscall(ia32_sys_call_table[nr], regs); instrumentation_end(); } } -- 2.18.4
[RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
Version 2 addressing comments from Andy: - paranoid_entry/exit is back to assembly code. This avoids having a C version of SWAPGS and the need to disable stack-protector. (remove patches 8, 9, 21 from v1). - SAVE_AND_SWITCH_TO_KERNEL_CR3 and RESTORE_CR3 are removed from paranoid_entry/exit and move to C (patch 19). - __per_cpu_offset is mapped into the user page-table (patch 11) so that paranoid_entry can update GS before CR3 is switched. - use a different stack canary with the user and kernel page-tables. This is a new patch in v2 to not leak the kernel stack canary in the user page-table (patch 21). Patches are now based on v5.10-rc4. With Page Table Isolation (PTI), syscalls as well as interrupts and exceptions occurring in userspace enter the kernel with a user page-table. The kernel entry code will then switch the page-table from the user page-table to the kernel page-table by updating the CR3 control register. This CR3 switch is currently done early in the kernel entry sequence using assembly code. This RFC proposes to defer the PTI CR3 switch until we reach C code. The benefit is that this simplifies the assembly entry code, and make the PTI CR3 switch code easier to understand. This also paves the way for further possible projects such an easier integration of Address Space Isolation (ASI), or the possibilily to execute some selected syscall or interrupt handlers without switching to the kernel page-table (and thus avoid the PTI page-table switch overhead). Deferring CR3 switch to C code means that we need to run more of the kernel entry code with the user page-table. To do so, we need to: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); - map additional data used in the entry code (such as stack canary); - run more entry code on the trampoline stack (which is mapped both in the kernel and in the user page-table) until we switch to the kernel page-table and then switch to the kernel stack; - have a per-task trampoline stack instead of a per-cpu trampoline stack, so the task can be scheduled out while it hasn't switched to the kernel stack. Note that, for now, the CR3 switch can only be pushed as far as interrupts remain disabled in the entry code. This is because the CR3 switch is done based on the privilege level from the CS register from the interrupt frame. I plan to fix this but that's some extra complication (need to track if the user page-table is used or not). The proposed patchset is in RFC state to get early feedback about this proposal. The code survives running a kernel build and LTP. Note that changes are only for 64-bit at the moment, I haven't looked at 32-bit yet but I will definitively check it. Patches are based on v5.10-rc4. Thanks, alex. ----- Alexandre Chartre (21): x86/syscall: Add wrapper for invoking syscall function x86/entry: Update asm_call_on_stack to support more function arguments x86/entry: Consolidate IST entry from userspace x86/sev-es: Define a setup stack function for the VC idtentry x86/entry: Implement ret_from_fork body with C code x86/pti: Provide C variants of PTI switch CR3 macros x86/entry: Fill ESPFIX stack using C code x86/pti: Introduce per-task PTI trampoline stack x86/pti: Function to clone page-table entries from a specified mm x86/pti: Function to map per-cpu page-table entry x86/pti: Extend PTI user mappings x86/pti: Use PTI stack instead of trampoline stack x86/pti: Execute syscall functions on the kernel stack x86/pti: Execute IDT handlers on the kernel stack x86/pti: Execute IDT handlers with error code on the kernel stack x86/pti: Execute system vector handlers on the kernel stack x86/pti: Execute page fault handler on the kernel stack x86/pti: Execute NMI handler on the kernel stack x86/pti: Defer CR3 switch to C code for IST entries x86/pti: Defer CR3 switch to C code for non-IST and syscall entries x86/pti: Use a different stack canary with the user and kernel page-table arch/x86/entry/common.c | 58 - arch/x86/entry/entry_64.S | 346 +++--- arch/x86/entry/entry_64_compat.S | 22 -- arch/x86/include/asm/entry-common.h | 194 +++ arch/x86/include/asm/idtentry.h | 130 +- arch/x86/include/asm/irq_stack.h | 11 + arch/x86/include/asm/page_64_types.h | 36 ++- arch/x86/include/asm/processor.h | 3 + arch/x86/include/asm/pti.h| 18 ++ arch/x86/include/asm/stackprotector.h | 35 ++- arch/x86/include/asm/switch_to.h | 7 +- arch/x86/include/asm/traps.h | 2 +- arch/x86/kernel/cpu/mce/core.c| 7 +- arch/x86/kernel/espfix_64.c | 41 +++ arch/x86/kernel/nmi.c | 34 ++- arch/x86/kernel/sev-es.c | 63 + arch/x86/kernel/traps.c | 61 +++-- arch/x86/mm/fault.c | 11
[RFC][PATCH v2 07/21] x86/entry: Fill ESPFIX stack using C code
The ESPFIX stack is filled using assembly code. Move this code to a C function so that it is easier to read and modify. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 62 ++--- arch/x86/kernel/espfix_64.c | 41 2 files changed, 72 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 73e9cd47dc83..6e0b5b010e0b 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -684,8 +684,10 @@ native_irq_return_ldt: * long (see ESPFIX_STACK_SIZE). espfix_waddr points to the bottom * of the ESPFIX stack. * -* We clobber RAX and RDI in this code. We stash RDI on the -* normal stack and RAX on the ESPFIX stack. +* We call into C code to fill the ESPFIX stack. We stash registers +* that the C function can clobber on the normal stack. The user RAX +* is stashed first so that it is adjacent to the iret frame which +* will be copied to the ESPFIX stack. * * The ESPFIX stack layout we set up looks like this: * @@ -699,39 +701,37 @@ native_irq_return_ldt: * --- bottom of ESPFIX stack --- */ - pushq %rdi/* Stash user RDI */ - SWAPGS /* to kernel GS */ - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi /* to kernel CR3 */ - - movqPER_CPU_VAR(espfix_waddr), %rdi - movq%rax, (0*8)(%rdi) /* user RAX */ - movq(1*8)(%rsp), %rax /* user RIP */ - movq%rax, (1*8)(%rdi) - movq(2*8)(%rsp), %rax /* user CS */ - movq%rax, (2*8)(%rdi) - movq(3*8)(%rsp), %rax /* user RFLAGS */ - movq%rax, (3*8)(%rdi) - movq(5*8)(%rsp), %rax /* user SS */ - movq%rax, (5*8)(%rdi) - movq(4*8)(%rsp), %rax /* user RSP */ - movq%rax, (4*8)(%rdi) - /* Now RAX == RSP. */ - - andl$0x, %eax /* RAX = (RSP & 0x) */ + /* save registers */ + pushq %rax + pushq %rdi + pushq %rsi + pushq %rdx + pushq %rcx + pushq %r8 + pushq %r9 + pushq %r10 + pushq %r11 /* -* espfix_stack[31:16] == 0. The page tables are set up such that -* (espfix_stack | (X & 0x)) points to a read-only alias of -* espfix_waddr for any X. That is, there are 65536 RO aliases of -* the same page. Set up RSP so that RSP[31:16] contains the -* respective 16 bits of the /userspace/ RSP and RSP nonetheless -* still points to an RO alias of the ESPFIX stack. +* fill_espfix_stack will copy the iret+rax frame to the ESPFIX +* stack and return with RAX containing a pointer to the ESPFIX +* stack. */ - orq PER_CPU_VAR(espfix_stack), %rax + leaq8*8(%rsp), %rdi /* points to the iret+rax frame */ + callfill_espfix_stack - SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi - SWAPGS /* to user GS */ - popq%rdi/* Restore user RDI */ + /* +* RAX contains a pointer to the ESPFIX, so restore registers but +* RAX. RAX will be restored from the ESPFIX stack. +*/ + popq%r11 + popq%r10 + popq%r9 + popq%r8 + popq%rcx + popq%rdx + popq%rsi + popq%rdi movq%rax, %rsp UNWIND_HINT_IRET_REGS offset=8 diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 4fe7af58cfe1..ff4b5160b39c 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -33,6 +33,7 @@ #include #include #include +#include /* * Note: we only need 6*8 = 48 bytes for the espfix stack, but round @@ -205,3 +206,43 @@ void init_espfix_ap(int cpu) per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page + (addr & ~PAGE_MASK); } + +/* + * iret frame with an additional user_rax register. + */ +struct iret_rax_frame { + unsigned long user_rax; + unsigned long rip; + unsigned long cs; + unsigned long rflags; + unsigned long rsp; + unsigned long ss; +}; + +noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame) +{ + struct iret_rax_frame *espfix_frame; + unsigned long rsp; + + native_swapgs(); + user_pagetable_exit(); + + espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr); + *espfix_frame = *frame; + + /* +* espfix_stack[31:16] == 0. The page tables are set up such that +* (espfix_stack | (X & 0x)) points to a re
[RFC][PATCH v2 12/21] x86/pti: Use PTI stack instead of trampoline stack
When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Additional changes will be made to later to switch to the kernel stack (which is only mapped in the kernel page-table). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S| 42 +--- arch/x86/include/asm/pti.h | 8 ++ arch/x86/include/asm/switch_to.h | 7 +- 3 files changed, 26 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 458af12ed9a1..29beab46bedd 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -194,19 +194,9 @@ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ POP_REGS pop_rdi=0 skip_r11rcx=1 - /* -* Now all regs are restored except RSP and RDI. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - pushq RSP-RDI(%rdi) /* RSP */ - pushq (%rdi) /* RDI */ - /* * We are on the trampoline stack. All regs except RDI are live. +* We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER @@ -214,7 +204,7 @@ syscall_return_via_sysret: SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi popq%rdi - popq%rsp + movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) #endif POP_REGS pop_rdi=0 - /* -* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - /* Copy the IRET frame to the trampoline stack. */ - pushq 6*8(%rdi) /* SS */ - pushq 5*8(%rdi) /* RSP */ - pushq 4*8(%rdi) /* EFLAGS */ - pushq 3*8(%rdi) /* CS */ - pushq 2*8(%rdi) /* RIP */ - - /* Push user RDI on the trampoline stack. */ - pushq (%rdi) - /* * We are on the trampoline stack. All regs except RDI are live. * We can do future final exit work right here. @@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) /* Restore RDI. */ popq%rdi + addq$8, %rsp/* skip regs->orig_ax */ SWAPGS INTERRUPT_RETURN @@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry) SWITCH_TO_KERNEL_CR3 scratch_reg=%rax .Lerror_entry_from_usermode_after_swapgs: + /* +* We are on the trampoline stack. With PTI, the trampoline +* stack is a per-thread stack so we are all set and we can +* return. +* +* Without PTI, the trampoline stack is a per-cpu stack and +* we need to switch to the normal thread stack. +*/ + ALTERNATIVE "", "ret", X86_FEATURE_PTI /* Put us onto the real thread stack. */ popq%r12/* save return addr in %12 */ movq%rsp, %rdi /* arg0 = pt_regs pointer */ diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 5484e69ff8d3..ed211fcc3a50 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void); extern void pti_finalize(void); extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, unsigned long end, enum pti_clone_level level); +static inline bool pti_enabled(void) +{ + return static_cpu_has(X86_FEATURE_PTI); +} #else static inline void pti_check_boottime_disable(void) { } +static inline bool pti_enabled(void) +{ + return false; +} #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index 9f69cc497f4b..457458228462 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -3,6 +3,7 @@ #define _ASM_X86_SWITCH_TO_H #include +#include struct task_struct; /* one of the stranger aspects of C forward declarations */ @@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task) * doesn't wo
[RFC][PATCH v2 03/21] x86/entry: Consolidate IST entry from userspace
Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from userspace the same way: they switch from the IST stack to the kernel stack, call the handler and then return to userspace. However, NMI, MCE/DEBUG and VC implement this same behavior using different code paths, so consolidate this code into a single assembly macro. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 137 +- 1 file changed, 75 insertions(+), 62 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index c42948aca0a8..51df9f1871c6 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork) #endif .endm +/* + * Macro to handle an IDT entry defined with the IST mechanism. It should + * be invoked at the beginning of the IDT handler with a pointer to the C + * function (cfunc_user) to invoke if the IDT was entered from userspace. + * + * If the IDT was entered from userspace, the macro will switch from the + * IST stack to the regular task stack, call the provided function and + * return to userland. + * + * If IDT was entered from the kernel, the macro will just return. + */ +.macro ist_entry_user cfunc_user has_error_code=0 + UNWIND_HINT_IRET_REGS + ASM_CLAC + + /* only process entry from userspace */ + .if \has_error_code == 1 + testb $3, CS-ORIG_RAX(%rsp) + jz .List_entry_from_kernel_\@ + .else + testb $3, CS-RIP(%rsp) + jz .List_entry_from_kernel_\@ + pushq $-1 /* ORIG_RAX: no syscall to restart */ + .endif + + /* Use %rdx as a temp variable */ + pushq %rdx + + /* +* Switch from the IST stack to the regular task stack and +* use the provided entry point. +*/ + swapgs + cld + FENCE_SWAPGS_USER_ENTRY + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx + movq%rsp, %rdx + movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp + UNWIND_HINT_IRET_REGS base=%rdx offset=8 + pushq 6*8(%rdx) /* pt_regs->ss */ + pushq 5*8(%rdx) /* pt_regs->rsp */ + pushq 4*8(%rdx) /* pt_regs->flags */ + pushq 3*8(%rdx) /* pt_regs->cs */ + pushq 2*8(%rdx) /* pt_regs->rip */ + UNWIND_HINT_IRET_REGS + pushq 1*8(%rdx) /* pt_regs->orig_ax */ + PUSH_AND_CLEAR_REGS rdx=(%rdx) + ENCODE_FRAME_POINTER + + /* +* At this point we no longer need to worry about stack damage +* due to nesting -- we're on the normal thread stack and we're +* done with the IST stack. +*/ + + mov %rsp, %rdi + .if \has_error_code == 1 + movqORIG_RAX(%rsp), %rsi/* get error code into 2nd argument*/ + movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ + .endif + call\cfunc_user + jmp swapgs_restore_regs_and_return_to_usermode + +.List_entry_from_kernel_\@: +.endm + /** * idtentry_body - Macro to emit code calling the C function * @cfunc: C function to be called @@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_mce_db vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - - pushq $-1 /* ORIG_RAX: no syscall to restart */ - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user noist_\cfunc + /* Entry from kernel */ + + pushq $-1 /* ORIG_RAX: no syscall to restart */ /* paranoid_entry returns GS information for paranoid_exit in EBX. */ callparanoid_entry @@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym) jmp paranoid_exit - /* Switch to the regular task stack and use the noist entry point */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body noist_\cfunc, has_error_code=0 - _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) .endm @@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_vc vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user safe_stack_\cfunc, has_error_code=1 /* * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. @@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym) */ jmp paranoid_exit - /* Switch to the regular task stack */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body safe_stack_\cfunc, has_e
Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
On 11/10/20 11:42 PM, Joel Fernandes wrote: On Tue, Nov 10, 2020 at 10:35:17AM +0100, Alexandre Chartre wrote: [..] ---8<--- From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 27 Jul 2020 17:56:14 -0400 Subject: [PATCH] kernel/entry: Add support for core-wide protection of kernel-mode [..] diff --git a/include/linux/sched.h b/include/linux/sched.h index d38e904dd603..fe6f225bfbf9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq); const struct cpumask *sched_trace_rd_span(struct root_domain *rd); +#ifdef CONFIG_SCHED_CORE +void sched_core_unsafe_enter(void); +void sched_core_unsafe_exit(void); +bool sched_core_wait_till_safe(unsigned long ti_check); +bool sched_core_kernel_protected(void); +#else +#define sched_core_unsafe_enter(ignore) do { } while (0) +#define sched_core_unsafe_exit(ignore) do { } while (0) +#define sched_core_wait_till_safe(ignore) do { } while (0) +#define sched_core_kernel_protected(ignore) do { } while (0) +#endif + #endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 0a1e20f8d4e8..a18ed60cedea 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -28,6 +28,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs) instrumentation_begin(); trace_hardirqs_off_finish(); + if (_TIF_UNSAFE_RET) /* Kernel protection depends on arch defining the flag. */ + sched_core_unsafe_enter(); instrumentation_end(); } @@ -137,6 +139,27 @@ static __always_inline void exit_to_user_mode(void) /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal(struct pt_regs *regs) { } +unsigned long exit_to_user_get_work(void) Function should be static. Fixed. +{ + unsigned long ti_work = READ_ONCE(current_thread_info()->flags); + + if ((IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected()) + || !_TIF_UNSAFE_RET) + return ti_work; + +#ifdef CONFIG_SCHED_CORE + ti_work &= EXIT_TO_USER_MODE_WORK; + if ((ti_work & _TIF_UNSAFE_RET) == ti_work) { + sched_core_unsafe_exit(); + if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) { + sched_core_unsafe_enter(); /* not exiting to user yet. */ + } + } + + return READ_ONCE(current_thread_info()->flags); +#endif +} + static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work) { @@ -175,7 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, * enabled above. */ local_irq_disable_exit_to_user(); - ti_work = READ_ONCE(current_thread_info()->flags); + ti_work = exit_to_user_get_work(); } What happen if the task is scheduled out in exit_to_user_mode_loop? (e.g. if it has _TIF_NEED_RESCHED set). It will have call sched_core_unsafe_enter() and force siblings to wait for it. So shouldn't sched_core_unsafe_exit() be called when the task is scheduled out? (because it won't run anymore) And sched_core_unsafe_enter() when the task is scheduled back in? No, when the task is scheduled out, it will in kernel mode on the task being scheduled in. That task (being scheduled-in) would have already done a sched_core_unsafe_enter(). When that task returns to user made, it will do a sched_core_unsafe_exit(). When all tasks goto sleep, the last task that enters the idle loop will do a sched_core_unsafe_exit(). Just to note: the "unsafe kernel context" is per-CPU and not per-task. Does that answer your question? Ok, I think I get it: it works because when a task is scheduled out then the scheduler will schedule in a new tagged task (because we have core scheduling). So that new task should be accounted for core-wide protection the same way as the previous one. +static inline void init_sched_core_irq_work(struct rq *rq) +{ + init_irq_work(&rq->core_irq_work, sched_core_irq_work); +} + +/* + * sched_core_wait_till_safe - Pause the caller's hyperthread until the core + * exits the core-wide unsafe state. Obviously the CPU calling this function + * should not be responsible for the core being in the core-wide unsafe state + * otherwise it will deadlock. + * + * @ti_check: We spin here with IRQ enabled and preempt disabled. Break out of + *the loop if TIF flags are set and notify caller about it. + * + * IRQs should be disabled. + */ +bool sched_core_wait_till_safe(unsigned long ti_check) +{ + bool restart = false; + struct rq *rq; + int cpu; + + /* We clear the thread flag only at the end, so need to check for it. */ Do you mean "
Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
On 11/11/20 12:39 AM, Andy Lutomirski wrote: On 11/9/20 6:28 PM, Andy Lutomirski wrote: On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre wrote: Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, Probably fine. per cpu offsets (__per_cpu_offset, which is used some in entry code), This likely already leaks due to vulnerable CPUs leaking address space layout info. I forgot to update the comment, I am not mapping __per_cpu_offset anymore. However, if we do map __per_cpu_offset then we don't need to enforce the ordering in paranoid_entry to switch CR3 before GS. I'm okay with mapping __per_cpu_offset. Good. That way I can move the GS update back to assembly code (paranoid_entry/exit will be mostly reduce to updating GS), and probably I won't need to disable stack-protector. the stack canary, That's going to be a very tough sell. I can get rid of this, but this will require to disable stack-protector for any function that we can call while using the user page-table, like already done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers). You could probably get away with using a different stack protector canary before and after the CR3 switch as long as you are careful to have the canary restored when you return from whatever function is involved. I was thinking about doing that. I will give it a try. Thanks, alex.
Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode
On 11/3/20 2:20 AM, Joel Fernandes wrote: Hi Alexandre, Sorry for late reply as I was working on the snapshotting patch... On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote: On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote: Core-scheduling prevents hyperthreads in usermode from attacking each other, but it does not do anything about one of the hyperthreads entering the kernel for any reason. This leaves the door open for MDS and L1TF attacks with concurrent execution sequences between hyperthreads. This patch therefore adds support for protecting all syscall and IRQ kernel mode entries. Care is taken to track the outermost usermode exit and entry using per-cpu counters. In cases where one of the hyperthreads enter the kernel, no additional IPIs are sent. Further, IPIs are avoided when not needed - example: idle and non-cookie HTs do not need to be forced into kernel mode. Hi Joel, In order to protect syscall/IRQ kernel mode entries, shouldn't we have a call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't see such a call. Am I missing something? Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile updated patch is appended below: See comments below about the updated patch. ---8<--- From b2835a587a28405ffdf8fc801e798129a014a8c8 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 27 Jul 2020 17:56:14 -0400 Subject: [PATCH] kernel/entry: Add support for core-wide protection of kernel-mode Core-scheduling prevents hyperthreads in usermode from attacking each other, but it does not do anything about one of the hyperthreads entering the kernel for any reason. This leaves the door open for MDS and L1TF attacks with concurrent execution sequences between hyperthreads. This patch therefore adds support for protecting all syscall and IRQ kernel mode entries. Care is taken to track the outermost usermode exit and entry using per-cpu counters. In cases where one of the hyperthreads enter the kernel, no additional IPIs are sent. Further, IPIs are avoided when not needed - example: idle and non-cookie HTs do not need to be forced into kernel mode. More information about attacks: For MDS, it is possible for syscalls, IRQ and softirq handlers to leak data to either host or guest attackers. For L1TF, it is possible to leak to guest attackers. There is no possible mitigation involving flushing of buffers to avoid this since the execution of attacker and victims happen concurrently on 2 or more HTs. Cc: Julien Desfossez Cc: Tim Chen Cc: Aaron Lu Cc: Aubrey Li Cc: Tim Chen Cc: Paul E. McKenney Co-developed-by: Vineeth Pillai Tested-by: Julien Desfossez Signed-off-by: Vineeth Pillai Signed-off-by: Joel Fernandes (Google) --- .../admin-guide/kernel-parameters.txt | 9 + include/linux/entry-common.h | 6 +- include/linux/sched.h | 12 + kernel/entry/common.c | 28 ++- kernel/sched/core.c | 230 ++ kernel/sched/sched.h | 3 + 6 files changed, 285 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3236427e2215..a338d5d64c3d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4678,6 +4678,15 @@ sbni= [NET] Granch SBNI12 leased line adapter + sched_core_protect_kernel= + [SCHED_CORE] Pause SMT siblings of a core running in + user mode, if at least one of the siblings of the core + is running in kernel mode. This is to guarantee that + kernel data is not leaked to tasks which are not trusted + by the kernel. A value of 0 disables protection, 1 + enables protection. The default is 1. Note that protection + depends on the arch defining the _TIF_UNSAFE_RET flag. + sched_debug [KNL] Enables verbose scheduler debug messages. schedstats= [KNL,X86] Enable or disable scheduled statistics. diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 474f29638d2c..62278c5b3b5f 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -33,6 +33,10 @@ # define _TIF_PATCH_PENDING (0) #endif +#ifndef _TIF_UNSAFE_RET +# define _TIF_UNSAFE_RET (0) +#endif + #ifndef _TIF_UPROBE # define _TIF_UPROBE (0) #endif @@ -69,7 +73,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ -_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | \ +_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET | \
Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
[Copying the reply to Andy in the thread with the right email addresses] On 11/9/20 6:38 PM, Andy Lutomirski wrote: On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre wrote: SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions of these macros (swapgs() and swapgs_unsafe_stack()). This needs a very good justification. It also needs some kind of static verification that these helpers are only used by noinstr code, and they need to be __always_inline. And I cannot fathom how C code could possibly use SWAPGS_UNSAFE_STACK in a meaningful way. You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK in C code, that doesn't make sense. Looks like only SWAPGS is then needed. Or maybe we can just use native_swapgs() instead? I have added a C version of SWAPGS for moving paranoid_entry() to C because, in this function, we need to switch CR3 before doing the updating GS. But I really wonder if we need a paravirt swapgs here, and we can probably just use native_swapgs(). Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table then we will be able to update GS before switching CR3. That way we can keep the GS update in assembly code, and just do the CR3 switch in C code. This would also avoid having to disable stack-protector (patch 21). alex.
Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
[Copying the reply to Andy in the thread with the right email addresses] On 11/9/20 6:28 PM, Andy Lutomirski wrote: On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre wrote: Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, Probably fine. per cpu offsets (__per_cpu_offset, which is used some in entry code), This likely already leaks due to vulnerable CPUs leaking address space layout info. I forgot to update the comment, I am not mapping __per_cpu_offset anymore. However, if we do map __per_cpu_offset then we don't need to enforce the ordering in paranoid_entry to switch CR3 before GS. the stack canary, That's going to be a very tough sell. I can get rid of this, but this will require to disable stack-protector for any function that we can call while using the user page-table, like already done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers). alex.
Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code
On 11/9/20 8:35 PM, Dave Hansen wrote: On 11/9/20 6:44 AM, Alexandre Chartre wrote: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); This seems like the thing we'd want to tag explicitly rather than make it implicit with 'noinstr' code. Worst-case, shouldn't this be: #define __entry_funcnoinstr or something? Yes. I use the easy solution to just use noinstr because noinstr is mostly use for entry functions. But if we want to use the user page-table beyond the entry functions then we will definitively need a dedicated tag. I'd also like to see a lot more discussion about what the rules are for the C code and the compiler. We can't, for instance, do a normal printk() in this entry functions. Should we stick them in a special section and have objtool look for suspect patterns or references? I'm most worried about things like this: if (something_weird) pr_warn("this will oops the kernel\n"); That would be similar to noinstr which uses the .noinstr.text section, and if I remember correctly objtool detects if a noinstr function calls a non-noinst. Similarly here, an entry function should not call a non-entry function. alex.
Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
On 11/9/20 6:38 PM, Andy Lutomirski wrote: On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre wrote: SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions of these macros (swapgs() and swapgs_unsafe_stack()). This needs a very good justification. It also needs some kind of static verification that these helpers are only used by noinstr code, and they need to be __always_inline. And I cannot fathom how C code could possibly use SWAPGS_UNSAFE_STACK in a meaningful way. You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK in C code, that doesn't make sense. Looks like only SWAPGS is then needed. Or maybe we can just use native_swapgs() instead? I have added a C version of SWAPGS for moving paranoid_entry() to C because, in this function, we need to switch CR3 before doing the updating GS. But I really wonder if we need a paravirt swapgs here, and we can probably just use native_swapgs(). Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table then we will be able to update GS before switching CR3. That way we can keep the GS update in assembly code, and just do the CR3 switch in C code. This would also avoid having to disable stack-protector (patch 21). alex.
Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
On 11/9/20 6:28 PM, Andy Lutomirski wrote: On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre wrote: Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, Probably fine. per cpu offsets (__per_cpu_offset, which is used some in entry code), This likely already leaks due to vulnerable CPUs leaking address space layout info. I forgot to update the comment, I am not mapping __per_cpu_offset anymore. However, if we do map __per_cpu_offset then we don't need to enforce the ordering in paranoid_entry to switch CR3 before GS. the stack canary, That's going to be a very tough sell. I can get rid of this, but this will require to disable stack-protector for any function that we can call while using the user page-table, like already done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers). alex.
Re: [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
On 11/9/20 6:25 PM, Andy Lutomirski wrote: Hi Alexander- You appear to be infected by corporate malware that has inserted the string "@aserv0122.oracle.com" to the end of all the email addresses in your to: list. "l...@kernel.org"@aserv0122.oracle.com, for example, is not me. Can you fix this? I known, I messed up :-( I have already resent the entire RFC with correct addresses. Sorry about that. alex. On Mon, Nov 9, 2020 at 3:21 AM Alexandre Chartre wrote: Add a wrapper function for invoking a syscall function. This needs some explanation of why.
[RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack
During a syscall, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the syscall function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 11 ++- arch/x86/entry/entry_64.S| 1 + arch/x86/include/asm/irq_stack.h | 3 +++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 54d0931801e1..ead6a4c72e6a 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { + unsigned long stack; + if (!sysfunc) return; - regs->ax = sysfunc(regs); + if (!pti_enabled()) { + regs->ax = sysfunc(regs); + return; + } + + stack = (unsigned long)task_top_of_kernel_stack(current); + regs->ax = asm_call_syscall_on_stack((void *)(stack - 8), +sysfunc, regs); } #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 29beab46bedd..6b88a0eb8975 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2) SYM_FUNC_START(asm_call_on_stack_3) SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL) SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) +SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL) /* * Save the frame pointer unconditionally. This allows the ORC * unwinder to handle the stack switch. diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h index 359427216336..108d9da7c01c 100644 --- a/arch/x86/include/asm/irq_stack.h +++ b/arch/x86/include/asm/irq_stack.h @@ -5,6 +5,7 @@ #include #include +#include #ifdef CONFIG_X86_64 static __always_inline bool irqstack_active(void) @@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs), struct pt_regs *regs); void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc), struct irq_desc *desc); +long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func, + struct pt_regs *regs); static __always_inline void __run_on_irqstack(void (*func)(void)) { -- 2.18.4
[RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes system vector handlers to execute on the kernel stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a82e31b45442..0c5d9f027112 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) +{ + void *stack = pti_kernel_stack(regs); + + if (stack) + asm_call_on_stack_1(stack, (void (*)(void))func, regs); + else + run_sysvec_on_irqstack_cond(func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - run_sysvec_on_irqstack_cond(__##func, regs);\ + run_sysvec(__##func, regs); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ -- 2.18.4
[RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry
Wrap the code used by PTI to map a per-cpu page-table entry into a new function so that this code can be re-used to map other per-cpu entries. Signed-off-by: Alexandre Chartre --- arch/x86/mm/pti.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index ebc8cd2f1cd8..71ca245d7b38 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr) *user_p4d = *kernel_p4d; } +/* + * Clone a single percpu page. + */ +static void __init pti_clone_percpu_page(void *addr) +{ + phys_addr_t pa = per_cpu_ptr_to_phys(addr); + pte_t *target_pte; + + target_pte = pti_user_pagetable_walk_pte((unsigned long)addr); + if (WARN_ON(!target_pte)) + return; + + *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); +} + /* * Clone the CPU_ENTRY_AREA and associated data into the user space visible * page table. @@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void) * This is done for all possible CPUs during boot to ensure * that it's propagated to all mms. */ + pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu)); - unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu); - phys_addr_t pa = per_cpu_ptr_to_phys((void *)va); - pte_t *target_pte; - - target_pte = pti_user_pagetable_walk_pte(va); - if (WARN_ON(!target_pte)) - return; - - *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); } } -- 2.18.4
[RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm
PTI has a function to clone page-table entries but only from the init_mm page-table. Provide a new function to clone page-table entries from a specified mm page-table. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/pti.h | 10 ++ arch/x86/mm/pti.c | 32 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 07375b476c4f..5484e69ff8d3 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -4,9 +4,19 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_PAGE_TABLE_ISOLATION + +enum pti_clone_level { + PTI_CLONE_PMD, + PTI_CLONE_PTE, +}; + +struct mm_struct; + extern void pti_init(void); extern void pti_check_boottime_disable(void); extern void pti_finalize(void); +extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level); #else static inline void pti_check_boottime_disable(void) { } #endif diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 1aab92930569..ebc8cd2f1cd8 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void) static void __init pti_setup_vsyscall(void) { } #endif -enum pti_clone_level { - PTI_CLONE_PMD, - PTI_CLONE_PTE, -}; - -static void -pti_clone_pgtable(unsigned long start, unsigned long end, - enum pti_clone_level level) +void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level) { unsigned long addr; @@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, if (addr < start) break; - pgd = pgd_offset_k(addr); + pgd = pgd_offset(mm, addr); if (WARN_ON(pgd_none(*pgd))) return; p4d = p4d_offset(pgd, addr); @@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end, } } +static void pti_clone_init_pgtable(unsigned long start, unsigned long end, + enum pti_clone_level level) +{ + pti_clone_pgtable(&init_mm, start, end, level); +} + #ifdef CONFIG_X86_64 /* * Clone a single p4d (i.e. a top-level entry on 4-level systems and a @@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void) start = CPU_ENTRY_AREA_BASE; end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES); - pti_clone_pgtable(start, end, PTI_CLONE_PMD); + pti_clone_init_pgtable(start, end, PTI_CLONE_PMD); } #endif /* CONFIG_X86_64 */ @@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void) */ static void pti_clone_entry_text(void) { - pti_clone_pgtable((unsigned long) __entry_text_start, - (unsigned long) __entry_text_end, - PTI_CLONE_PMD); + pti_clone_init_pgtable((unsigned long) __entry_text_start, + (unsigned long) __entry_text_end, + PTI_CLONE_PMD); } /* @@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void) * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ - pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); + pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); /* -* pti_clone_pgtable() will set the global bit in any PMDs -* that it clones, but we also need to get any PTEs in +* pti_clone_init_pgtable() will set the global bit in any +* PMDs that it clones, but we also need to get any PTEs in * the last level for areas that are not huge-page-aligned. */ -- 2.18.4
[RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes IDT handlers which have an error code. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 18 -- arch/x86/kernel/traps.c | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 3595a31947b3..a82e31b45442 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1)) : \ func(arg1)) +#define CALL_ON_STACK_2(stack, func, arg1, arg2) \ + ((stack) ? \ +asm_call_on_stack_2(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ +func(arg1, arg2)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); } +static __always_inline +void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), +struct pt_regs *regs, unsigned long error_code) +{ + CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs, \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs, error_code);\ + run_idt_errcode(__##func, regs, error_code);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs, \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs, (u8)error_code);\ + run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \ + regs, (u8)error_code); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 5161385b3670..9a51aa016fb3 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug) /* User entry, runs on regular task stack */ DEFINE_IDTENTRY_DEBUG_USER(exc_debug) { - exc_debug_user(regs, debug_read_clear_dr6()); + run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6()); } #else /* 32 bit does not have separate entry points. */ -- 2.18.4
[RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
With PTI, syscall/interrupt/exception entries switch the CR3 register to change the page-table in assembly code. Move the CR3 register switch inside the C code of syscall/interrupt/exception entry handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 15 --- arch/x86/entry/entry_64.S | 23 +-- arch/x86/entry/entry_64_compat.S| 22 -- arch/x86/include/asm/entry-common.h | 14 ++ arch/x86/include/asm/idtentry.h | 25 - arch/x86/kernel/cpu/mce/core.c | 2 ++ arch/x86/kernel/nmi.c | 2 ++ arch/x86/kernel/traps.c | 6 ++ arch/x86/mm/fault.c | 9 +++-- 9 files changed, 68 insertions(+), 50 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index ead6a4c72e6a..3f4788dbbde7 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, regs->ax = 0; } syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } static __always_inline void run_syscall(sys_call_ptr_t sysfunc, @@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc, #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { + switch_to_kernel_cr3(); nr = syscall_enter_from_user_mode(regs, nr); instrumentation_begin(); @@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) instrumentation_end(); syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } #endif #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs) { + switch_to_kernel_cr3(); if (IS_ENABLED(CONFIG_IA32_EMULATION)) current_thread_info()->status |= TS_COMPAT; @@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs) do_syscall_32_irqs_on(regs, nr); syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } -static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) +static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr) { - unsigned int nr = syscall_32_enter(regs); int res; /* @@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) { + unsigned int nr = syscall_32_enter(regs); + bool syscall_done; + /* * Called using the internal vDSO SYSENTER/SYSCALL32 calling * convention. Adjust regs so it looks like we entered using int80. @@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) regs->ip = landing_pad; /* Invoke the syscall. If it failed, keep it simple: use IRET. */ - if (!__do_fast_syscall_32(regs)) + syscall_done = __do_fast_syscall_32(regs, nr); + switch_to_user_cr3(); + if (!syscall_done) return 0; #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 797effbe65b6..4be15a5ffe68 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64) swapgs /* tss.sp2 is scratch space. */ movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) - SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL) @@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ - POP_REGS pop_rdi=0 skip_r11rcx=1 + POP_REGS skip_r11rcx=1 /* -* We are on the trampoline stack. All regs except RDI are live. * We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER - SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi - - popq%rdi movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork) swapgs cld FENCE_SWAPGS_USER_ENTRY - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx movq%rsp, %rdx movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp UNWIND_HINT_IRET_REGS base=%rdx offset=8 @@ -592,19 +586,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) ud2 1: #endif - POP_REGS pop_rdi=0 + POP_REGS + addq
[RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit
The paranoid_entry and paranoid_exit assembly functions have been replaced by the kernel_paranoid_entry() and kernel_paranoid_exit() C functions. Now paranoid_entry/exit are not used anymore and can be removed. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 131 -- 1 file changed, 131 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 9ea8187d4405..797effbe65b6 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback) SYM_CODE_END(xen_failsafe_callback) #endif /* CONFIG_XEN_PV */ -/* - * Save all registers in pt_regs. Return GSBASE related information - * in EBX depending on the availability of the FSGSBASE instructions: - * - * FSGSBASER/EBX - * N0 -> SWAPGS on exit - * 1 -> no SWAPGS on exit - * - * YGSBASE value at entry, must be restored in paranoid_exit - */ -SYM_CODE_START_LOCAL(paranoid_entry) - UNWIND_HINT_FUNC - cld - PUSH_AND_CLEAR_REGS save_ret=1 - ENCODE_FRAME_POINTER 8 - - /* -* Always stash CR3 in %r14. This value will be restored, -* verbatim, at exit. Needed if paranoid_entry interrupted -* another entry that already switched to the user CR3 value -* but has not yet returned to userspace. -* -* This is also why CS (stashed in the "iret frame" by the -* hardware at entry) can not be used: this may be a return -* to kernel code, but with a user CR3 value. -* -* Switching CR3 does not depend on kernel GSBASE so it can -* be done before switching to the kernel GSBASE. This is -* required for FSGSBASE because the kernel GSBASE has to -* be retrieved from a kernel internal table. -*/ - SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14 - - /* -* Handling GSBASE depends on the availability of FSGSBASE. -* -* Without FSGSBASE the kernel enforces that negative GSBASE -* values indicate kernel GSBASE. With FSGSBASE no assumptions -* can be made about the GSBASE value when entering from user -* space. -*/ - ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE - - /* -* Read the current GSBASE and store it in %rbx unconditionally, -* retrieve and set the current CPUs kernel GSBASE. The stored value -* has to be restored in paranoid_exit unconditionally. -* -* The unconditional write to GS base below ensures that no subsequent -* loads based on a mispredicted GS base can happen, therefore no LFENCE -* is needed here. -*/ - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx - ret - -.Lparanoid_entry_checkgs: - /* EBX = 1 -> kernel GSBASE active, no restore required */ - movl$1, %ebx - /* -* The kernel-enforced convention is a negative GSBASE indicates -* a kernel value. No SWAPGS needed on entry and exit. -*/ - movl$MSR_GS_BASE, %ecx - rdmsr - testl %edx, %edx - jns .Lparanoid_entry_swapgs - ret - -.Lparanoid_entry_swapgs: - SWAPGS - - /* -* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an -* unconditional CR3 write, even in the PTI case. So do an lfence -* to prevent GS speculation, regardless of whether PTI is enabled. -*/ - FENCE_SWAPGS_KERNEL_ENTRY - - /* EBX = 0 -> SWAPGS required on exit */ - xorl%ebx, %ebx - ret -SYM_CODE_END(paranoid_entry) - -/* - * "Paranoid" exit path from exception stack. This is invoked - * only on return from non-NMI IST interrupts that came - * from kernel space. - * - * We may be returning to very strange contexts (e.g. very early - * in syscall entry), so checking for preemption here would - * be complicated. Fortunately, there's no good reason to try - * to handle preemption here. - * - * R/EBX contains the GSBASE related information depending on the - * availability of the FSGSBASE instructions: - * - * FSGSBASER/EBX - * N0 -> SWAPGS on exit - * 1 -> no SWAPGS on exit - * - * YUser space GSBASE, must be restored unconditionally - */ -SYM_CODE_START_LOCAL(paranoid_exit) - UNWIND_HINT_REGS - /* -* The order of operations is important. RESTORE_CR3 requires -* kernel GSBASE. -* -* NB to anyone to try to optimize this code: this code does -* not execute at all for exceptions from user mode. Those -* exceptions go through error_exit instead. -*/ - RESTORE_CR3 scratch_reg=%rax save_reg=%r14 - - /* Handle the three GSBASE cases */ - ALTERNATIVE "jmp .Lparanoid_exit_checkgs"
[RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code
IST entries from the kernel use paranoid entry and exit assembly functions to ensure the CR3 and GS registers are updated with correct values for the kernel. Move the update of the CR3 and GS registers inside the C code of IST handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 72 ++ arch/x86/kernel/cpu/mce/core.c | 3 ++ arch/x86/kernel/nmi.c | 18 +++-- arch/x86/kernel/sev-es.c | 20 +- arch/x86/kernel/traps.c| 30 -- 5 files changed, 83 insertions(+), 60 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 6b88a0eb8975..9ea8187d4405 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym) /* Entry from kernel */ pushq $-1 /* ORIG_RAX: no syscall to restart */ - /* paranoid_entry returns GS information for paranoid_exit in EBX. */ - callparanoid_entry - + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs pointer */ call\cfunc - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym) */ ist_entry_user safe_stack_\cfunc, has_error_code=1 - /* -* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. -* EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS -*/ - callparanoid_entry - + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS /* @@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym) * identical to the stack in the IRET frame or the VC fall-back stack, * so it is definitly mapped even with PTI enabled. */ - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym) UNWIND_HINT_IRET_REGS offset=8 ASM_CLAC - /* paranoid_entry returns GS information for paranoid_exit in EBX. */ - callparanoid_entry + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs pointer into first argument */ @@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym) movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ call\cfunc - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return) /* * Runs on exception stack. Xen PV does not go through this path at all, * so we can use real assembly here. - * - * Registers: - * %r14: Used to save/restore the CR3 of the interrupted context - * when PAGE_TABLE_ISOLATION is in use. Do not clobber. */ SYM_CODE_START(asm_exc_nmi) /* @@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi) * We also must not push anything to the stack before switching * stacks lest we corrupt the "NMI executing" variable. */ - ist_entry_user exc_nmi + ist_entry_user exc_nmi_user /* NMI from kernel */ @@ -1346,9 +1340,7 @@ repeat_nmi: * * RSP is pointing to "outermost RIP". gsbase is unknown, but, if * we're repeating an NMI, gsbase has the same value that it had on -* the first iteration. paranoid_entry will load the kernel -* gsbase if needed before we call exc_nmi(). "NMI executing" -* is zero. +* the first iteration. "NMI executing" is zero. */ movq$1, 10*8(%rsp) /* Set "NMI executing". */ @@ -1372,44 +1364,20 @@ end_repeat_nmi: pushq $-1 /* ORIG_RAX: no syscall to restart */ /* -* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit -* as we should not be calling schedule in NMI context. -* Even with normal interrupts enabled. An NMI should not be -* setting NEED_RESCHED or anything that normal interrupts and +* We should not be calling schedule in NMI context. Even with +* normal interrupts enabled. An NMI should not be setting +* NEED_RESCHED or anything that normal interrupts and * exceptions might do. */ - callparanoid_entry + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi movq$-1, %rsi callexc_nmi - /* Always restore stashed CR3 value (see paranoid_entry) */ - RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 - - /* -* The above invocation of pa
[RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack
When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Additional changes will be made to later to switch to the kernel stack (which is only mapped in the kernel page-table). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S| 42 +--- arch/x86/include/asm/pti.h | 8 ++ arch/x86/include/asm/switch_to.h | 7 +- 3 files changed, 26 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 458af12ed9a1..29beab46bedd 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -194,19 +194,9 @@ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ POP_REGS pop_rdi=0 skip_r11rcx=1 - /* -* Now all regs are restored except RSP and RDI. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - pushq RSP-RDI(%rdi) /* RSP */ - pushq (%rdi) /* RDI */ - /* * We are on the trampoline stack. All regs except RDI are live. +* We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER @@ -214,7 +204,7 @@ syscall_return_via_sysret: SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi popq%rdi - popq%rsp + movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) #endif POP_REGS pop_rdi=0 - /* -* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - /* Copy the IRET frame to the trampoline stack. */ - pushq 6*8(%rdi) /* SS */ - pushq 5*8(%rdi) /* RSP */ - pushq 4*8(%rdi) /* EFLAGS */ - pushq 3*8(%rdi) /* CS */ - pushq 2*8(%rdi) /* RIP */ - - /* Push user RDI on the trampoline stack. */ - pushq (%rdi) - /* * We are on the trampoline stack. All regs except RDI are live. * We can do future final exit work right here. @@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) /* Restore RDI. */ popq%rdi + addq$8, %rsp/* skip regs->orig_ax */ SWAPGS INTERRUPT_RETURN @@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry) SWITCH_TO_KERNEL_CR3 scratch_reg=%rax .Lerror_entry_from_usermode_after_swapgs: + /* +* We are on the trampoline stack. With PTI, the trampoline +* stack is a per-thread stack so we are all set and we can +* return. +* +* Without PTI, the trampoline stack is a per-cpu stack and +* we need to switch to the normal thread stack. +*/ + ALTERNATIVE "", "ret", X86_FEATURE_PTI /* Put us onto the real thread stack. */ popq%r12/* save return addr in %12 */ movq%rsp, %rdi /* arg0 = pt_regs pointer */ diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 5484e69ff8d3..ed211fcc3a50 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void); extern void pti_finalize(void); extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, unsigned long end, enum pti_clone_level level); +static inline bool pti_enabled(void) +{ + return static_cpu_has(X86_FEATURE_PTI); +} #else static inline void pti_check_boottime_disable(void) { } +static inline bool pti_enabled(void) +{ + return false; +} #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index 9f69cc497f4b..457458228462 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -3,6 +3,7 @@ #define _ASM_X86_SWITCH_TO_H #include +#include struct task_struct; /* one of the stranger aspects of C forward declarations */ @@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task) * doesn't wo
[RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers
The stack-protector option adds a stack canary to functions vulnerable to stack buffer overflow. The stack canary is defined through the GS register. Add an attribute to disable the stack-protector option; it will be used for C functions which can be called while the GS register might not be properly configured yet. The GS register is not properly configured for the kernel when we enter the kernel from userspace. The assembly entry code sets the GS register for the kernel using the swapgs instruction or the paranoid_entry function, and so, currently, the GS register is correctly configured when assembly entry code subsequently transfer control to C code. Deferring the CR3 register switch from assembly to C code will require to reimplement paranoid_entry in C and hence also defer the GS register setup for IST entries to C code. To prepare this change, disable stack-protector for IST entry C handlers where the GS register setup will eventually happen. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 25 - arch/x86/kernel/nmi.c | 2 +- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a6725afaaec0..647af7ea3bf1 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -94,6 +94,21 @@ void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) run_sysvec_on_irqstack_cond(func, regs); } +/* + * Attribute to disable the stack-protector option. The option is + * disabled using the optimize attribute which clears all optimize + * options. So we need to specify the optimize option to disable but + * also optimize options we want to preserve. + * + * The stack-protector option adds a stack canary to functions + * vulnerable to stack buffer overflow. The stack canary is defined + * through the GS register. So the attribute is used to disable the + * stack-protector option for functions which can be called while the + * GS register might not be properly configured yet. + */ +#define no_stack_protector \ + __attribute__ ((optimize("-O2,-fno-stack-protector,-fno-omit-frame-pointer"))) + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -410,7 +425,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW */ #define DEFINE_IDTENTRY_IST(func) \ - DEFINE_IDTENTRY_RAW(func) + no_stack_protector DEFINE_IDTENTRY_RAW(func) /** * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which @@ -440,7 +455,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE */ #define DEFINE_IDTENTRY_DF(func) \ - DEFINE_IDTENTRY_RAW_ERRORCODE(func) + no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func) /** * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler @@ -472,7 +487,7 @@ static __always_inline void __##func(struct pt_regs *regs) * VMM communication handler. */ #define DEFINE_IDTENTRY_VC_SETUP_STACK(func) \ - __visible noinstr \ + no_stack_protector __visible noinstr\ unsigned long setup_stack_##func(struct pt_regs *regs) /** @@ -482,7 +497,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE */ #define DEFINE_IDTENTRY_VC(func) \ - DEFINE_IDTENTRY_RAW_ERRORCODE(func) + no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func) #else /* CONFIG_X86_64 */ @@ -517,7 +532,7 @@ __visible noinstr void func(struct pt_regs *regs, \ /* C-Code mapping */ #define DECLARE_IDTENTRY_NMI DECLARE_IDTENTRY_RAW -#define DEFINE_IDTENTRY_NMIDEFINE_IDTENTRY_RAW +#define DEFINE_IDTENTRY_NMIno_stack_protector DEFINE_IDTENTRY_RAW #ifdef CONFIG_X86_64 #define DECLARE_IDTENTRY_MCE DECLARE_IDTENTRY_IST diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index be0f654c3095..b6291b683be1 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state); static DEFINE_PER_CPU(unsigned long, nmi_cr2); static DEFINE_PER_CPU(unsigned long, nmi_dr7); -DEFINE_IDTENTRY_RAW(exc_nmi) +DEFINE_IDTENTRY_NMI(exc_nmi) { bool irq_state; -- 2.18.4
[RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions of these macros (swapgs() and swapgs_unsafe_stack()). Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/paravirt.h | 15 +++ arch/x86/include/asm/paravirt_types.h | 17 - 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d25cc6830e89..a4898130b36b 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -145,6 +145,21 @@ static inline void __write_cr4(unsigned long x) PVOP_VCALL1(cpu.write_cr4, x); } +static inline void swapgs(void) +{ + PVOP_VCALL0(cpu.swapgs); +} + +/* + * If swapgs is used while the userspace stack is still current, + * there's no way to call a pvop. The PV replacement *must* be + * inlined, or the swapgs instruction must be trapped and emulated. + */ +static inline void swapgs_unsafe_stack(void) +{ + PVOP_VCALL0_ALT(cpu.swapgs, "swapgs"); +} + static inline void arch_safe_halt(void) { PVOP_VCALL0(irq.safe_halt); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0fad9f61c76a..eea9acc942a3 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -532,12 +532,12 @@ int paravirt_disable_iospace(void); pre, post, ##__VA_ARGS__) -#define PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...) \ +#define PVOP_VCALL(op, insn, clbr, call_clbr, extra_clbr, pre, post, ...) \ ({ \ PVOP_VCALL_ARGS;\ PVOP_TEST_NULL(op); \ asm volatile(pre\ -paravirt_alt(PARAVIRT_CALL)\ +paravirt_alt(insn) \ post \ : call_clbr, ASM_CALL_CONSTRAINT \ : paravirt_type(op), \ @@ -547,12 +547,17 @@ int paravirt_disable_iospace(void); }) #define __PVOP_VCALL(op, pre, post, ...) \ - PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS, \ - VEXTRA_CLOBBERS, \ + PVOP_VCALL(op, PARAVIRT_CALL, CLBR_ANY, \ + PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,\ pre, post, ##__VA_ARGS__) +#define __PVOP_VCALL_ALT(op, insn) \ + PVOP_VCALL(op, insn, CLBR_ANY, \ + PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,\ + "", "") + #define __PVOP_VCALLEESAVE(op, pre, post, ...) \ - PVOP_VCALL(op.func, CLBR_RET_REG, \ + PVOP_VCALL(op.func, PARAVIRT_CALL, CLBR_RET_REG,\ PVOP_VCALLEE_CLOBBERS, , \ pre, post, ##__VA_ARGS__) @@ -562,6 +567,8 @@ int paravirt_disable_iospace(void); __PVOP_CALL(rettype, op, "", "") #define PVOP_VCALL0(op) \ __PVOP_VCALL(op, "", "") +#define PVOP_VCALL0_ALT(op, insn) \ + __PVOP_VCALL_ALT(op, insn) #define PVOP_CALLEE0(rettype, op) \ __PVOP_CALLEESAVE(rettype, op, "", "") -- 2.18.4
[RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code
ret_from_fork is a mix of assembly code and calls to C functions. Re-implement ret_from_fork so that it calls a single C function. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 18 ++ arch/x86/entry/entry_64.S | 28 +--- 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index d12908ad..7ee15a12c115 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -35,6 +35,24 @@ #include #include +__visible noinstr void return_from_fork(struct pt_regs *regs, + struct task_struct *prev, + void (*kfunc)(void *), void *kargs) +{ + schedule_tail(prev); + if (kfunc) { + /* kernel thread */ + kfunc(kargs); + /* +* A kernel thread is allowed to return here after +* successfully calling kernel_execve(). Exit to +* userspace to complete the execve() syscall. +*/ + regs->ax = 0; + } + syscall_exit_to_user_mode(regs); +} + static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 274384644b5e..73e9cd47dc83 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm) */ .pushsection .text, "ax" SYM_CODE_START(ret_from_fork) - UNWIND_HINT_EMPTY - movq%rax, %rdi - callschedule_tail /* rdi: 'prev' task parameter */ - - testq %rbx, %rbx /* from kernel_thread? */ - jnz 1f /* kernel threads are uncommon */ - -2: UNWIND_HINT_REGS - movq%rsp, %rdi - callsyscall_exit_to_user_mode /* returns with IRQs disabled */ + movq%rsp, %rdi /* pt_regs */ + movq%rax, %rsi /* 'prev' task parameter */ + movq%rbx, %rdx /* kernel thread func */ + movq%r12, %rcx /* kernel thread arg */ + callreturn_from_fork/* returns with IRQs disabled */ jmp swapgs_restore_regs_and_return_to_usermode - -1: - /* kernel thread */ - UNWIND_HINT_EMPTY - movq%r12, %rdi - CALL_NOSPEC rbx - /* -* A kernel thread is allowed to return here after successfully -* calling kernel_execve(). Exit to userspace to complete the execve() -* syscall. -*/ - movq$0, RAX(%rsp) - jmp 2b SYM_CODE_END(ret_from_fork) .popsection -- 2.18.4
[RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
Extend PTI user mappings so that more kernel entry code can be executed with the user page-table. To do so, we need to map syscall and interrupt entry code, per cpu offsets (__per_cpu_offset, which is used some in entry code), the stack canary, and the PTI stack (which is defined per task). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 2 -- arch/x86/mm/pti.c | 14 ++ kernel/fork.c | 22 ++ 3 files changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 6e0b5b010e0b..458af12ed9a1 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm) * rbx: kernel thread func (NULL for user thread) * r12: kernel thread arg */ -.pushsection .text, "ax" SYM_CODE_START(ret_from_fork) UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs */ @@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork) callreturn_from_fork/* returns with IRQs disabled */ jmp swapgs_restore_regs_and_return_to_usermode SYM_CODE_END(ret_from_fork) -.popsection .macro DEBUG_ENTRY_ASSERT_IRQS_OFF #ifdef CONFIG_DEBUG_ENTRY diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 71ca245d7b38..f4f3d9ae4449 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -465,6 +465,11 @@ static void __init pti_clone_user_shared(void) */ pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu)); + /* +* Map fixed_percpu_data to get the stack canary. +*/ + if (IS_ENABLED(CONFIG_STACKPROTECTOR)) + pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu)); } } @@ -505,6 +510,15 @@ static void pti_clone_entry_text(void) pti_clone_init_pgtable((unsigned long) __entry_text_start, (unsigned long) __entry_text_end, PTI_CLONE_PMD); + + /* + * Syscall and interrupt entry code (which is in the noinstr + * section) will be entered with the user page-table, so that + * code has to be mapped in. + */ + pti_clone_init_pgtable((unsigned long) __noinstr_text_start, + (unsigned long) __noinstr_text_end, + PTI_CLONE_PMD); } /* diff --git a/kernel/fork.c b/kernel/fork.c index 6d266388d380..31cd77dbdba3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm) #endif } +static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk) +{ +#ifdef CONFIG_PAGE_TABLE_ISOLATION + unsigned long addr; + + if (!tsk || !static_cpu_has(X86_FEATURE_PTI)) + return; + + /* +* Map the task stack after the kernel stack into the user +* address space, so that this stack can be used when entering +* syscall or interrupt from user mode. +*/ + BUG_ON(!task_stack_page(tsk)); + addr = (unsigned long)task_top_of_kernel_stack(tsk); + pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE); +#endif +} + static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { @@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (init_new_context(p, mm)) goto fail_nocontext; + mm_map_task(mm, p); + mm->user_ns = get_user_ns(user_ns); return mm; @@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) vmacache_flush(tsk); if (clone_flags & CLONE_VM) { + mm_map_task(oldmm, tsk); mmget(oldmm); mm = oldmm; goto good_mm; -- 2.18.4
[RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack
Double the size of the kernel stack when using PTI. The entire stack is mapped into the kernel address space, and the top half of the stack (the PTI stack) is also mapped into the user address space. The PTI stack will be used as a per-task trampoline stack instead of the current per-cpu trampoline stack. This will allow running more code on the trampoline stack, in particular code that schedules the task out. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/page_64_types.h | 36 +++- arch/x86/include/asm/processor.h | 3 +++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 3f49dac03617..733accc20fdb 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -12,7 +12,41 @@ #define KASAN_STACK_ORDER 0 #endif -#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) +#ifdef CONFIG_PAGE_TABLE_ISOLATION +/* + * PTI doubles the size of the stack. The entire stack is mapped into + * the kernel address space. However, only the top half of the stack is + * mapped into the user address space. + * + * On syscall or interrupt, user mode enters the kernel with the user + * page-table, and the stack pointer is switched to the top of the + * stack (which is mapped in the user address space and in the kernel). + * The syscall/interrupt handler will then later decide when to switch + * to the kernel address space, and to switch to the top of the kernel + * stack which is only mapped in the kernel. + * + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack + */ +#define PTI_STACK_ORDER 1 +#else +#define PTI_STACK_ORDER 0 +#endif + +#define KERNEL_STACK_ORDER 2 +#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER) + +#define THREAD_SIZE_ORDER \ + (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 82a08b585818..47b1b806535b 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x) #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1)) +#define task_top_of_kernel_stack(task) \ + ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE)) + #define task_pt_regs(task) \ ({ \ unsigned long __ptr = (unsigned long)task_stack_page(task); \ -- 2.18.4
[RFC][PATCH 19/24] x86/pti: Execute page fault handler on the kernel stack
After a page fault from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the page fault handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 17 + arch/x86/mm/fault.c | 2 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 0c5d9f027112..a6725afaaec0 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ func(arg1, arg2)) +#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \ + ((stack) ? \ +asm_call_on_stack_3(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \ + (void *)(arg3)) : \ +func(arg1, arg2, arg3)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long, + unsigned long), + struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + CALL_ON_STACK_3(pti_kernel_stack(regs), + func, regs, error_code, address); +} + static __always_inline void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) { diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 82bf37a5c9ec..b9d03603d95d 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) state = irqentry_enter(regs); instrumentation_begin(); - handle_page_fault(regs, error_code, address); + run_idt_pagefault(handle_page_fault, regs, error_code, address); instrumentation_end(); irqentry_exit(regs, state); -- 2.18.4
[RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack
After a NMI from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the NMI handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/kernel/nmi.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 4bc77aaf1303..be0f654c3095 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi) inc_irq_stat(__nmi_count); - if (!ignore_nmis) - default_do_nmi(regs); + if (!ignore_nmis) { + if (user_mode(regs)) { + /* +* If we come from userland then we are on the +* trampoline stack, switch to the kernel stack +* to execute the NMI handler. +*/ + run_idt(default_do_nmi, regs); + } else { + default_do_nmi(regs); + } + } idtentry_exit_nmi(regs, irq_state); -- 2.18.4
[RFC][PATCH 16/24] x86/pti: Execute IDT handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. For now, only changes IDT handlers which have no argument other than the pt_regs registers. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 43 +++-- arch/x86/kernel/cpu/mce/core.c | 2 +- arch/x86/kernel/traps.c | 4 +-- 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 4b4aca2b1420..3595a31947b3 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -10,10 +10,49 @@ #include #include +#include bool idtentry_enter_nmi(struct pt_regs *regs); void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); +/* + * The CALL_ON_STACK_* macro call the specified function either directly + * if no stack is provided, or on the specified stack. + */ +#define CALL_ON_STACK_1(stack, func, arg1) \ + ((stack) ? \ +asm_call_on_stack_1(stack, \ + (void (*)(void))(func), (void *)(arg1)) : \ +func(arg1)) + +/* + * Functions to return the top of the kernel stack if we are using the + * user page-table (and thus not running with the kernel stack). If we + * are using the kernel page-table (and so already using the kernel + * stack) when it returns NULL. + */ +static __always_inline void *pti_kernel_stack(struct pt_regs *regs) +{ + unsigned long stack; + + if (pti_enabled() && user_mode(regs)) { + stack = (unsigned long)task_top_of_kernel_stack(current); + return (void *)(stack - 8); + } else { + return NULL; + } +} + +/* + * Wrappers to run an IDT handler on the kernel stack if we are not + * already using this stack. + */ +static __always_inline +void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) +{ + CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs);\ + run_idt(__##func, regs);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ __irq_enter_raw(); \ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs);\ + run_idt(__##func, regs);\ __irq_exit_raw(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4102b866e7c0..9407c3cd9355 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check) unsigned long dr7; dr7 = local_db_save(); - exc_machine_check_user(regs); + run_idt(exc_machine_check_user, regs); local_db_restore(dr7); } #else diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 09b22a611d99..5161385b3670 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) state = irqentry_enter(regs); instrumentation_begin(); - handle_invalid_op(regs); + run_idt(handle_invalid_op, regs); instrumentation_end(); irqentry_exit(regs, state); } @@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3) if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); instrumentation_begin(); - do_int3_user(regs); + run_idt(do_int3_us
[RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros
Page Table Isolation (PTI) use assembly macros to switch the CR3 register between kernel and user page-tables. Add C functions which implement the same features. For now, these C functions are not used but they will eventually replace using the assembly macros. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 44 +++ arch/x86/include/asm/entry-common.h | 84 + 2 files changed, 128 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 7ee15a12c115..d09b1ded5287 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -343,3 +343,47 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) } } #endif /* CONFIG_XEN_PV */ + +#ifdef CONFIG_PAGE_TABLE_ISOLATION + +static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) +{ + unsigned long cr3, saved_cr3; + + if (!static_cpu_has(X86_FEATURE_PTI)) + return 0; + + saved_cr3 = cr3 = __read_cr3(); + if (cr3 & PTI_USER_PGTABLE_MASK) { + adjust_kernel_cr3(&cr3); + native_write_cr3(cr3); + } + + return saved_cr3; +} + +static __always_inline void restore_cr3(unsigned long cr3) +{ + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + if (static_cpu_has(X86_FEATURE_PCID)) { + if (cr3 & PTI_USER_PGTABLE_MASK) + adjust_user_cr3(&cr3); + else + cr3 |= X86_CR3_PCID_NOFLUSH; + } + + native_write_cr3(cr3); +} + +#else /* CONFIG_PAGE_TABLE_ISOLATION */ + +static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) +{ + return 0; +} + +static __always_inline void restore_cr3(unsigned long cr3) {} + +#endif /* CONFIG_PAGE_TABLE_ISOLATION */ diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 6fe54b2813c1..b05b212f5ebc 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -7,6 +7,7 @@ #include #include #include +#include /* Check that the stack and regs on entry from user mode are sane. */ static __always_inline void arch_check_user_regs(struct pt_regs *regs) @@ -81,4 +82,87 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +#ifndef MODULE +#ifdef CONFIG_PAGE_TABLE_ISOLATION + +/* + * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two + * halves: + */ +#define PTI_USER_PGTABLE_BIT PAGE_SHIFT +#define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT) +#define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT +#define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT) +#define PTI_USER_PGTABLE_AND_PCID_MASK \ + (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK) + +static __always_inline void adjust_kernel_cr3(unsigned long *cr3) +{ + if (static_cpu_has(X86_FEATURE_PCID)) + *cr3 |= X86_CR3_PCID_NOFLUSH; + + /* +* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 +* at kernel pagetables. +*/ + *cr3 &= ~PTI_USER_PGTABLE_AND_PCID_MASK; +} + +static __always_inline void adjust_user_cr3(unsigned long *cr3) +{ + unsigned short mask; + unsigned long asid; + + /* +* Test if the ASID needs a flush. +*/ + asid = *cr3 & 0x7ff; + mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask); + if (mask & (1 << asid)) { + /* Flush needed, clear the bit */ + this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~(1 << asid)); + } else { + *cr3 |= X86_CR3_PCID_NOFLUSH; + } +} + +static __always_inline void switch_to_kernel_cr3(void) +{ + unsigned long cr3; + + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + cr3 = __read_cr3(); + adjust_kernel_cr3(&cr3); + native_write_cr3(cr3); +} + +static __always_inline void switch_to_user_cr3(void) +{ + unsigned long cr3; + + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + cr3 = __read_cr3(); + if (static_cpu_has(X86_FEATURE_PCID)) { + adjust_user_cr3(&cr3); + /* Flip the ASID to the user version */ + cr3 |= PTI_USER_PCID_MASK; + } + + /* Flip the PGD to the user version */ + cr3 |= PTI_USER_PGTABLE_MASK; + native_write_cr3(cr3); +} + +#else /* CONFIG_PAGE_TABLE_ISOLATION */ + +static inline void switch_to_kernel_cr3(void) {} +static inline void switch_to_user_cr3(void) {} + +#endif /* CONFIG_PAGE_TABLE_ISOLATION */ +#endif /* MODULE */ + #endif -- 2.18.4
[RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry
The #VC exception assembly entry code uses C code (vc_switch_off_ist) to get and configure a stack, then return to assembly to switch to that stack and finally invoked the C function exception handler. To pave the way for deferring CR3 switch from assembly to C code, define a setup stack function for the VC idtentry. This function is used to get and configure the stack before invoking idtentry handler. For now, the setup stack function is just a wrapper around the the vc_switch_off_ist() function but it will eventually also contain the C code to switch CR3. The vc_switch_off_ist() function is also refactored to just return the stack pointer, and the stack configuration is done in the setup stack function (so that the stack can be also be used to propagate CR3 switch information to the idtentry handler for switching CR3 back). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 8 +++- arch/x86/include/asm/idtentry.h | 14 ++ arch/x86/include/asm/traps.h| 2 +- arch/x86/kernel/sev-es.c| 34 + arch/x86/kernel/traps.c | 19 +++--- 5 files changed, 55 insertions(+), 22 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 51df9f1871c6..274384644b5e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym) UNWIND_HINT_REGS /* -* Switch off the IST stack to make it free for nested exceptions. The -* vc_switch_off_ist() function will switch back to the interrupted -* stack if it is safe to do so. If not it switches to the VC fall-back -* stack. +* Call the setup stack function. It configures and returns +* the stack we should be using to run the exception handler. */ movq%rsp, %rdi /* pt_regs pointer */ - callvc_switch_off_ist + callsetup_stack_\cfunc movq%rax, %rsp /* Switch to new stack */ UNWIND_HINT_REGS diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index b2442eb0ac2f..4b4aca2b1420 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs) */ #define DECLARE_IDTENTRY_VC(vector, func) \ DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func); \ + __visible noinstr unsigned long setup_stack_##func(struct pt_regs *regs); \ __visible noinstr void ist_##func(struct pt_regs *regs, unsigned long error_code); \ __visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned long error_code) @@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs) #define DEFINE_IDTENTRY_VC_IST(func) \ DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func) +/** + * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to + run the VMM communication handler + * @func: Function name of the entry point + * + * The stack setup code is executed before the VMM communication handler. + * It configures and returns the stack to switch to before running the + * VMM communication handler. + */ +#define DEFINE_IDTENTRY_VC_SETUP_STACK(func) \ + __visible noinstr \ + unsigned long setup_stack_##func(struct pt_regs *regs) + /** * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler * @func: Function name of the entry point diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 7f7200021bd1..cfcc9d34d2a0 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); asmlinkage __visible notrace struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); void __init trap_init(void); -asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs); +asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs *eregs); #endif #ifdef CONFIG_X86_F00F_BUG diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 0bd1a0fc587e..bd977c917cd6 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication) instrumentation_end(); } +struct exc_vc_frame { + /* pt_regs should be first */ + struct pt_regs regs; +}; + +DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication) +{ + struct exc_vc_frame *frame; + unsigned long sp; + + /* +* Switch off the IST stack to make it free for nested exceptions. +* The vc_switch_off_ist() function will switch back to the +* interrupted stack if
[RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments
Update the asm_call_on_stack() function so that it can be invoked with a function having up to three arguments instead of only one. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S| 15 +++ arch/x86/include/asm/irq_stack.h | 8 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index cad08703c4ad..c42948aca0a8 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs) /* * rdi: New stack pointer points to the top word of the stack * rsi: Function pointer - * rdx: Function argument (can be NULL if none) + * rdx: Function argument 1 (can be NULL if none) + * rcx: Function argument 2 (can be NULL if none) + * r8 : Function argument 3 (can be NULL if none) */ SYM_FUNC_START(asm_call_on_stack) +SYM_FUNC_START(asm_call_on_stack_1) +SYM_FUNC_START(asm_call_on_stack_2) +SYM_FUNC_START(asm_call_on_stack_3) SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL) SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) /* @@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) */ mov %rsp, (%rdi) mov %rdi, %rsp - /* Move the argument to the right place */ + mov %rsi, %rax + /* Move arguments to the right place */ mov %rdx, %rdi - + mov %rcx, %rsi + mov %r8, %rdx 1: .pushsection .discard.instr_begin .long 1b - . .popsection - CALL_NOSPEC rsi + CALL_NOSPEC rax 2: .pushsection .discard.instr_end diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h index 775816965c6a..359427216336 100644 --- a/arch/x86/include/asm/irq_stack.h +++ b/arch/x86/include/asm/irq_stack.h @@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void) } void asm_call_on_stack(void *sp, void (*func)(void), void *arg); + +void asm_call_on_stack_1(void *sp, void (*func)(void), +void *arg1); +void asm_call_on_stack_2(void *sp, void (*func)(void), +void *arg1, void *arg2); +void asm_call_on_stack_3(void *sp, void (*func)(void), +void *arg1, void *arg2, void *arg3); + void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs), struct pt_regs *regs); void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc), -- 2.18.4
[RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code
The ESPFIX stack is filled using assembly code. Move this code to a C function so that it is easier to read and modify. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 62 ++--- arch/x86/kernel/espfix_64.c | 41 2 files changed, 72 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 73e9cd47dc83..6e0b5b010e0b 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -684,8 +684,10 @@ native_irq_return_ldt: * long (see ESPFIX_STACK_SIZE). espfix_waddr points to the bottom * of the ESPFIX stack. * -* We clobber RAX and RDI in this code. We stash RDI on the -* normal stack and RAX on the ESPFIX stack. +* We call into C code to fill the ESPFIX stack. We stash registers +* that the C function can clobber on the normal stack. The user RAX +* is stashed first so that it is adjacent to the iret frame which +* will be copied to the ESPFIX stack. * * The ESPFIX stack layout we set up looks like this: * @@ -699,39 +701,37 @@ native_irq_return_ldt: * --- bottom of ESPFIX stack --- */ - pushq %rdi/* Stash user RDI */ - SWAPGS /* to kernel GS */ - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi /* to kernel CR3 */ - - movqPER_CPU_VAR(espfix_waddr), %rdi - movq%rax, (0*8)(%rdi) /* user RAX */ - movq(1*8)(%rsp), %rax /* user RIP */ - movq%rax, (1*8)(%rdi) - movq(2*8)(%rsp), %rax /* user CS */ - movq%rax, (2*8)(%rdi) - movq(3*8)(%rsp), %rax /* user RFLAGS */ - movq%rax, (3*8)(%rdi) - movq(5*8)(%rsp), %rax /* user SS */ - movq%rax, (5*8)(%rdi) - movq(4*8)(%rsp), %rax /* user RSP */ - movq%rax, (4*8)(%rdi) - /* Now RAX == RSP. */ - - andl$0x, %eax /* RAX = (RSP & 0x) */ + /* save registers */ + pushq %rax + pushq %rdi + pushq %rsi + pushq %rdx + pushq %rcx + pushq %r8 + pushq %r9 + pushq %r10 + pushq %r11 /* -* espfix_stack[31:16] == 0. The page tables are set up such that -* (espfix_stack | (X & 0x)) points to a read-only alias of -* espfix_waddr for any X. That is, there are 65536 RO aliases of -* the same page. Set up RSP so that RSP[31:16] contains the -* respective 16 bits of the /userspace/ RSP and RSP nonetheless -* still points to an RO alias of the ESPFIX stack. +* fill_espfix_stack will copy the iret+rax frame to the ESPFIX +* stack and return with RAX containing a pointer to the ESPFIX +* stack. */ - orq PER_CPU_VAR(espfix_stack), %rax + leaq8*8(%rsp), %rdi /* points to the iret+rax frame */ + callfill_espfix_stack - SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi - SWAPGS /* to user GS */ - popq%rdi/* Restore user RDI */ + /* +* RAX contains a pointer to the ESPFIX, so restore registers but +* RAX. RAX will be restored from the ESPFIX stack. +*/ + popq%r11 + popq%r10 + popq%r9 + popq%r8 + popq%rcx + popq%rdx + popq%rsi + popq%rdi movq%rax, %rsp UNWIND_HINT_IRET_REGS offset=8 diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 4fe7af58cfe1..6a81c4bd1542 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -33,6 +33,7 @@ #include #include #include +#include /* * Note: we only need 6*8 = 48 bytes for the espfix stack, but round @@ -205,3 +206,43 @@ void init_espfix_ap(int cpu) per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page + (addr & ~PAGE_MASK); } + +/* + * iret frame with an additional user_rax register. + */ +struct iret_rax_frame { + unsigned long user_rax; + unsigned long rip; + unsigned long cs; + unsigned long rflags; + unsigned long rsp; + unsigned long ss; +}; + +noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame) +{ + struct iret_rax_frame *espfix_frame; + unsigned long rsp; + + native_swapgs(); + switch_to_kernel_cr3(); + + espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr); + *espfix_frame = *frame; + + /* +* espfix_stack[31:16] == 0. The page tables are set up such that +* (espfix_stack | (X & 0x)) points to a re
[RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit
paranoid_entry/exit are assembly macros. Provide C versions of these macros (kernel_paranoid_entry() and kernel_paranoid_exit()). The C functions are functionally equivalent to the assembly macros, except that kernel_paranoid_entry() doesn't save registers in pt_regs like paranoid_entry does. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 157 arch/x86/include/asm/entry-common.h | 10 ++ 2 files changed, 167 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index d09b1ded5287..54d0931801e1 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -387,3 +387,160 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) static __always_inline void restore_cr3(unsigned long cr3) {} #endif /* CONFIG_PAGE_TABLE_ISOLATION */ + +/* + * "Paranoid" entry path from exception stack. Ensure that the CR3 and + * GS registers are correctly set for the kernel. Return GSBASE related + * information in kernel_entry_state depending on the availability of + * the FSGSBASE instructions: + * + * FSGSBASEkernel_entry_state + * Nswapgs=true -> SWAPGS on exit + * swapgs=false -> no SWAPGS on exit + * + * Ygsbase=GSBASE value at entry, must be restored in + * kernel_paranoid_exit() + * + * Note that per-cpu variables are accessed using the GS register, + * so paranoid entry code cannot access per-cpu variables before + * kernel_paranoid_entry() has been called. + */ +noinstr void kernel_paranoid_entry(struct kernel_entry_state *state) +{ + unsigned long gsbase; + unsigned int cpu; + + /* +* Save CR3 in the kernel entry state. This value will be +* restored, verbatim, at exit. Needed if the paranoid entry +* interrupted another entry that already switched to the user +* CR3 value but has not yet returned to userspace. +* +* This is also why CS (stashed in the "iret frame" by the +* hardware at entry) can not be used: this may be a return +* to kernel code, but with a user CR3 value. +* +* Switching CR3 does not depend on kernel GSBASE so it can +* be done before switching to the kernel GSBASE. This is +* required for FSGSBASE because the kernel GSBASE has to +* be retrieved from a kernel internal table. +*/ + state->cr3 = save_and_switch_to_kernel_cr3(); + + /* +* Handling GSBASE depends on the availability of FSGSBASE. +* +* Without FSGSBASE the kernel enforces that negative GSBASE +* values indicate kernel GSBASE. With FSGSBASE no assumptions +* can be made about the GSBASE value when entering from user +* space. +*/ + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { + /* +* Read the current GSBASE and store it in the kernel +* entry state unconditionally, retrieve and set the +* current CPUs kernel GSBASE. The stored value has to +* be restored at exit unconditionally. +* +* The unconditional write to GS base below ensures that +* no subsequent loads based on a mispredicted GS base +* can happen, therefore no LFENCE is needed here. +*/ + state->gsbase = rdgsbase(); + + /* +* Fetch the per-CPU GSBASE value for this processor. We +* normally use %gs for accessing per-CPU data, but we +* are setting up %gs here and obviously can not use %gs +* itself to access per-CPU data. +*/ + if (IS_ENABLED(CONFIG_SMP)) { + /* +* Load CPU from the GDT. Do not use RDPID, +* because KVM loads guest's TSC_AUX on vm-entry +* and may not restore the host's value until +* the CPU returns to userspace. Thus the kernel +* would consume a guest's TSC_AUX if an NMI +* arrives while running KVM's run loop. +*/ + asm_inline volatile ("lsl %[seg],%[p]" +: [p] "=r" (cpu) +: [seg] "r" (__CPUNODE_SEG)); + + cpu &= VDSO_CPUNODE_MASK; + gsbase = __per_cpu_offset[cpu]; + } else { + gsbase = *pcpu_unit_offsets; + } + + wrgsbase(gsbase); + + } else { + /* +* The kernel-enforced convention is a negative GSBASE +* indicates a kernel value. No SWAPGS needed on entry
[RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace
Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from userspace the same way: they switch from the IST stack to the kernel stack, call the handler and then return to userspace. However, NMI, MCE/DEBUG and VC implement this same behavior using different code paths, so consolidate this code into a single assembly macro. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 137 +- 1 file changed, 75 insertions(+), 62 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index c42948aca0a8..51df9f1871c6 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork) #endif .endm +/* + * Macro to handle an IDT entry defined with the IST mechanism. It should + * be invoked at the beginning of the IDT handler with a pointer to the C + * function (cfunc_user) to invoke if the IDT was entered from userspace. + * + * If the IDT was entered from userspace, the macro will switch from the + * IST stack to the regular task stack, call the provided function and + * return to userland. + * + * If IDT was entered from the kernel, the macro will just return. + */ +.macro ist_entry_user cfunc_user has_error_code=0 + UNWIND_HINT_IRET_REGS + ASM_CLAC + + /* only process entry from userspace */ + .if \has_error_code == 1 + testb $3, CS-ORIG_RAX(%rsp) + jz .List_entry_from_kernel_\@ + .else + testb $3, CS-RIP(%rsp) + jz .List_entry_from_kernel_\@ + pushq $-1 /* ORIG_RAX: no syscall to restart */ + .endif + + /* Use %rdx as a temp variable */ + pushq %rdx + + /* +* Switch from the IST stack to the regular task stack and +* use the provided entry point. +*/ + swapgs + cld + FENCE_SWAPGS_USER_ENTRY + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx + movq%rsp, %rdx + movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp + UNWIND_HINT_IRET_REGS base=%rdx offset=8 + pushq 6*8(%rdx) /* pt_regs->ss */ + pushq 5*8(%rdx) /* pt_regs->rsp */ + pushq 4*8(%rdx) /* pt_regs->flags */ + pushq 3*8(%rdx) /* pt_regs->cs */ + pushq 2*8(%rdx) /* pt_regs->rip */ + UNWIND_HINT_IRET_REGS + pushq 1*8(%rdx) /* pt_regs->orig_ax */ + PUSH_AND_CLEAR_REGS rdx=(%rdx) + ENCODE_FRAME_POINTER + + /* +* At this point we no longer need to worry about stack damage +* due to nesting -- we're on the normal thread stack and we're +* done with the IST stack. +*/ + + mov %rsp, %rdi + .if \has_error_code == 1 + movqORIG_RAX(%rsp), %rsi/* get error code into 2nd argument*/ + movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ + .endif + call\cfunc_user + jmp swapgs_restore_regs_and_return_to_usermode + +.List_entry_from_kernel_\@: +.endm + /** * idtentry_body - Macro to emit code calling the C function * @cfunc: C function to be called @@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_mce_db vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - - pushq $-1 /* ORIG_RAX: no syscall to restart */ - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user noist_\cfunc + /* Entry from kernel */ + + pushq $-1 /* ORIG_RAX: no syscall to restart */ /* paranoid_entry returns GS information for paranoid_exit in EBX. */ callparanoid_entry @@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym) jmp paranoid_exit - /* Switch to the regular task stack and use the noist entry point */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body noist_\cfunc, has_error_code=0 - _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) .endm @@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_vc vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user safe_stack_\cfunc, has_error_code=1 /* * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. @@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym) */ jmp paranoid_exit - /* Switch to the regular task stack */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body safe_stack_\cfunc, has_e
[RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code
[Resending without messing up email addresses (hopefully!), Please reply using this email thread to have correct emails. Sorry for the noise.] With Page Table Isolation (PTI), syscalls as well as interrupts and exceptions occurring in userspace enter the kernel with a user page-table. The kernel entry code will then switch the page-table from the user page-table to the kernel page-table by updating the CR3 control register. This CR3 switch is currently done early in the kernel entry sequence using assembly code. This RFC proposes to defer the PTI CR3 switch until we reach C code. The benefit is that this simplifies the assembly entry code, and make the PTI CR3 switch code easier to understand. This also paves the way for further possible projects such an easier integration of Address Space Isolation (ASI), or the possibilily to execute some selected syscall or interrupt handlers without switching to the kernel page-table (and thus avoid the PTI page-table switch overhead). Deferring CR3 switch to C code means that we need to run more of the kernel entry code with the user page-table. To do so, we need to: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); - map additional data used in the entry code (such as stack canary); - run more entry code on the trampoline stack (which is mapped both in the kernel and in the user page-table) until we switch to the kernel page-table and then switch to the kernel stack; - have a per-task trampoline stack instead of a per-cpu trampoline stack, so the task can be scheduled out while it hasn't switched to the kernel stack. Note that, for now, the CR3 switch can only be pushed as far as interrupts remain disabled in the entry code. This is because the CR3 switch is done based on the privilege level from the CS register from the interrupt frame. I plan to fix this but that's some extra complication (need to track if the user page-table is used or not). The proposed patchset is in RFC state to get early feedback about this proposal. The code survives running a kernel build and LTP. Note that changes are only for 64-bit at the moment, I haven't looked at 32-bit yet but I will definitively check it. Code is based on v5.10-rc3. Thanks, alex. ----- Alexandre Chartre (24): x86/syscall: Add wrapper for invoking syscall function x86/entry: Update asm_call_on_stack to support more function arguments x86/entry: Consolidate IST entry from userspace x86/sev-es: Define a setup stack function for the VC idtentry x86/entry: Implement ret_from_fork body with C code x86/pti: Provide C variants of PTI switch CR3 macros x86/entry: Fill ESPFIX stack using C code x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK x86/entry: Add C version of paranoid_entry/exit x86/pti: Introduce per-task PTI trampoline stack x86/pti: Function to clone page-table entries from a specified mm x86/pti: Function to map per-cpu page-table entry x86/pti: Extend PTI user mappings x86/pti: Use PTI stack instead of trampoline stack x86/pti: Execute syscall functions on the kernel stack x86/pti: Execute IDT handlers on the kernel stack x86/pti: Execute IDT handlers with error code on the kernel stack x86/pti: Execute system vector handlers on the kernel stack x86/pti: Execute page fault handler on the kernel stack x86/pti: Execute NMI handler on the kernel stack x86/entry: Disable stack-protector for IST entry C handlers x86/entry: Defer paranoid entry/exit to C code x86/entry: Remove paranoid_entry and paranoid_exit x86/pti: Defer CR3 switch to C code for non-IST and syscall entries arch/x86/entry/common.c | 259 - arch/x86/entry/entry_64.S | 513 -- arch/x86/entry/entry_64_compat.S | 22 -- arch/x86/include/asm/entry-common.h | 108 ++ arch/x86/include/asm/idtentry.h | 153 +++- arch/x86/include/asm/irq_stack.h | 11 + arch/x86/include/asm/page_64_types.h | 36 +- arch/x86/include/asm/paravirt.h | 15 + arch/x86/include/asm/paravirt_types.h | 17 +- arch/x86/include/asm/processor.h | 3 + arch/x86/include/asm/pti.h| 18 + arch/x86/include/asm/switch_to.h | 7 +- arch/x86/include/asm/traps.h | 2 +- arch/x86/kernel/cpu/mce/core.c| 7 +- arch/x86/kernel/espfix_64.c | 41 ++ arch/x86/kernel/nmi.c | 34 +- arch/x86/kernel/sev-es.c | 52 +++ arch/x86/kernel/traps.c | 61 +-- arch/x86/mm/fault.c | 11 +- arch/x86/mm/pti.c | 71 ++-- kernel/fork.c | 22 ++ 21 files changed, 1002 insertions(+), 461 deletions(-) -- 2.18.4
[RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
Add a wrapper function for invoking a syscall function. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 870efeec8bda..d12908ad 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -35,6 +35,15 @@ #include #include +static __always_inline void run_syscall(sys_call_ptr_t sysfunc, + struct pt_regs *regs) +{ + if (!sysfunc) + return; + + regs->ax = sysfunc(regs); +} + #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { @@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) instrumentation_begin(); if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); - regs->ax = sys_call_table[nr](regs); + run_syscall(sys_call_table[nr], regs); #ifdef CONFIG_X86_X32_ABI } else if (likely((nr & __X32_SYSCALL_BIT) && (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) { nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT, X32_NR_syscalls); - regs->ax = x32_sys_call_table[nr](regs); + run_syscall(x32_sys_call_table[nr], regs); #endif } + instrumentation_end(); syscall_exit_to_user_mode(regs); } @@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, if (likely(nr < IA32_NR_syscalls)) { instrumentation_begin(); nr = array_index_nospec(nr, IA32_NR_syscalls); - regs->ax = ia32_sys_call_table[nr](regs); + run_syscall(ia32_sys_call_table[nr], regs); instrumentation_end(); } } -- 2.18.4
Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code
Sorry but it looks like email addresses are messed up in my emails. Our email server has a new security "feature" which has the good idea to change external email addresses. I will resend the patches with the correct addresses once I've found how to prevent this mess. alex. On 11/9/20 12:22 PM, Alexandre Chartre wrote: With Page Table Isolation (PTI), syscalls as well as interrupts and exceptions occurring in userspace enter the kernel with a user page-table. The kernel entry code will then switch the page-table from the user page-table to the kernel page-table by updating the CR3 control register. This CR3 switch is currently done early in the kernel entry sequence using assembly code. This RFC proposes to defer the PTI CR3 switch until we reach C code. The benefit is that this simplifies the assembly entry code, and make the PTI CR3 switch code easier to understand. This also paves the way for further possible projects such an easier integration of Address Space Isolation (ASI), or the possibilily to execute some selected syscall or interrupt handlers without switching to the kernel page-table (and thus avoid the PTI page-table switch overhead). Deferring CR3 switch to C code means that we need to run more of the kernel entry code with the user page-table. To do so, we need to: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); - map additional data used in the entry code (such as stack canary); - run more entry code on the trampoline stack (which is mapped both in the kernel and in the user page-table) until we switch to the kernel page-table and then switch to the kernel stack; - have a per-task trampoline stack instead of a per-cpu trampoline stack, so the task can be scheduled out while it hasn't switched to the kernel stack. Note that, for now, the CR3 switch can only be pushed as far as interrupts remain disabled in the entry code. This is because the CR3 switch is done based on the privilege level from the CS register from the interrupt frame. I plan to fix this but that's some extra complication (need to track if the user page-table is used or not). The proposed patchset is in RFC state to get early feedback about this proposal. The code survives running a kernel build and LTP. Note that changes are only for 64-bit at the moment, I haven't looked at 32-bit yet but I will definitively check it. Code is based on v5.10-rc3. Thanks, alex. - Alexandre Chartre (24): x86/syscall: Add wrapper for invoking syscall function x86/entry: Update asm_call_on_stack to support more function arguments x86/entry: Consolidate IST entry from userspace x86/sev-es: Define a setup stack function for the VC idtentry x86/entry: Implement ret_from_fork body with C code x86/pti: Provide C variants of PTI switch CR3 macros x86/entry: Fill ESPFIX stack using C code x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK x86/entry: Add C version of paranoid_entry/exit x86/pti: Introduce per-task PTI trampoline stack x86/pti: Function to clone page-table entries from a specified mm x86/pti: Function to map per-cpu page-table entry x86/pti: Extend PTI user mappings x86/pti: Use PTI stack instead of trampoline stack x86/pti: Execute syscall functions on the kernel stack x86/pti: Execute IDT handlers on the kernel stack x86/pti: Execute IDT handlers with error code on the kernel stack x86/pti: Execute system vector handlers on the kernel stack x86/pti: Execute page fault handler on the kernel stack x86/pti: Execute NMI handler on the kernel stack x86/entry: Disable stack-protector for IST entry C handlers x86/entry: Defer paranoid entry/exit to C code x86/entry: Remove paranoid_entry and paranoid_exit x86/pti: Defer CR3 switch to C code for non-IST and syscall entries arch/x86/entry/common.c | 259 - arch/x86/entry/entry_64.S | 513 -- arch/x86/entry/entry_64_compat.S | 22 -- arch/x86/include/asm/entry-common.h | 108 ++ arch/x86/include/asm/idtentry.h | 153 +++- arch/x86/include/asm/irq_stack.h | 11 + arch/x86/include/asm/page_64_types.h | 36 +- arch/x86/include/asm/paravirt.h | 15 + arch/x86/include/asm/paravirt_types.h | 17 +- arch/x86/include/asm/processor.h | 3 + arch/x86/include/asm/pti.h| 18 + arch/x86/include/asm/switch_to.h | 7 +- arch/x86/include/asm/traps.h | 2 +- arch/x86/kernel/cpu/mce/core.c| 7 +- arch/x86/kernel/espfix_64.c | 41 ++ arch/x86/kernel/nmi.c | 34 +- arch/x86/kernel/sev-es.c | 52 +++ arch/x86/kernel/traps.c | 61 +-- arch/x86/mm/fault.c | 11 +- arch/x86/mm/pti.c | 71 ++-- kernel/fork.c |
[RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack
During a syscall, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the syscall function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 11 ++- arch/x86/entry/entry_64.S| 1 + arch/x86/include/asm/irq_stack.h | 3 +++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 54d0931801e1..ead6a4c72e6a 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { + unsigned long stack; + if (!sysfunc) return; - regs->ax = sysfunc(regs); + if (!pti_enabled()) { + regs->ax = sysfunc(regs); + return; + } + + stack = (unsigned long)task_top_of_kernel_stack(current); + regs->ax = asm_call_syscall_on_stack((void *)(stack - 8), +sysfunc, regs); } #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 29beab46bedd..6b88a0eb8975 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2) SYM_FUNC_START(asm_call_on_stack_3) SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL) SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL) +SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL) /* * Save the frame pointer unconditionally. This allows the ORC * unwinder to handle the stack switch. diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h index 359427216336..108d9da7c01c 100644 --- a/arch/x86/include/asm/irq_stack.h +++ b/arch/x86/include/asm/irq_stack.h @@ -5,6 +5,7 @@ #include #include +#include #ifdef CONFIG_X86_64 static __always_inline bool irqstack_active(void) @@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs), struct pt_regs *regs); void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc), struct irq_desc *desc); +long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func, + struct pt_regs *regs); static __always_inline void __run_on_irqstack(void (*func)(void)) { -- 2.18.4
[RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry
Wrap the code used by PTI to map a per-cpu page-table entry into a new function so that this code can be re-used to map other per-cpu entries. Signed-off-by: Alexandre Chartre --- arch/x86/mm/pti.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index ebc8cd2f1cd8..71ca245d7b38 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr) *user_p4d = *kernel_p4d; } +/* + * Clone a single percpu page. + */ +static void __init pti_clone_percpu_page(void *addr) +{ + phys_addr_t pa = per_cpu_ptr_to_phys(addr); + pte_t *target_pte; + + target_pte = pti_user_pagetable_walk_pte((unsigned long)addr); + if (WARN_ON(!target_pte)) + return; + + *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); +} + /* * Clone the CPU_ENTRY_AREA and associated data into the user space visible * page table. @@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void) * This is done for all possible CPUs during boot to ensure * that it's propagated to all mms. */ + pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu)); - unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu); - phys_addr_t pa = per_cpu_ptr_to_phys((void *)va); - pte_t *target_pte; - - target_pte = pti_user_pagetable_walk_pte(va); - if (WARN_ON(!target_pte)) - return; - - *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL); } } -- 2.18.4
[RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack
Double the size of the kernel stack when using PTI. The entire stack is mapped into the kernel address space, and the top half of the stack (the PTI stack) is also mapped into the user address space. The PTI stack will be used as a per-task trampoline stack instead of the current per-cpu trampoline stack. This will allow running more code on the trampoline stack, in particular code that schedules the task out. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/page_64_types.h | 36 +++- arch/x86/include/asm/processor.h | 3 +++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 3f49dac03617..733accc20fdb 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -12,7 +12,41 @@ #define KASAN_STACK_ORDER 0 #endif -#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) +#ifdef CONFIG_PAGE_TABLE_ISOLATION +/* + * PTI doubles the size of the stack. The entire stack is mapped into + * the kernel address space. However, only the top half of the stack is + * mapped into the user address space. + * + * On syscall or interrupt, user mode enters the kernel with the user + * page-table, and the stack pointer is switched to the top of the + * stack (which is mapped in the user address space and in the kernel). + * The syscall/interrupt handler will then later decide when to switch + * to the kernel address space, and to switch to the top of the kernel + * stack which is only mapped in the kernel. + * + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack + */ +#define PTI_STACK_ORDER 1 +#else +#define PTI_STACK_ORDER 0 +#endif + +#define KERNEL_STACK_ORDER 2 +#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER) + +#define THREAD_SIZE_ORDER \ + (KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 82a08b585818..47b1b806535b 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x) #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1)) +#define task_top_of_kernel_stack(task) \ + ((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE)) + #define task_pt_regs(task) \ ({ \ unsigned long __ptr = (unsigned long)task_stack_page(task); \ -- 2.18.4
[RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm
PTI has a function to clone page-table entries but only from the init_mm page-table. Provide a new function to clone page-table entries from a specified mm page-table. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/pti.h | 10 ++ arch/x86/mm/pti.c | 32 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 07375b476c4f..5484e69ff8d3 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -4,9 +4,19 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_PAGE_TABLE_ISOLATION + +enum pti_clone_level { + PTI_CLONE_PMD, + PTI_CLONE_PTE, +}; + +struct mm_struct; + extern void pti_init(void); extern void pti_check_boottime_disable(void); extern void pti_finalize(void); +extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level); #else static inline void pti_check_boottime_disable(void) { } #endif diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 1aab92930569..ebc8cd2f1cd8 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void) static void __init pti_setup_vsyscall(void) { } #endif -enum pti_clone_level { - PTI_CLONE_PMD, - PTI_CLONE_PTE, -}; - -static void -pti_clone_pgtable(unsigned long start, unsigned long end, - enum pti_clone_level level) +void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, + unsigned long end, enum pti_clone_level level) { unsigned long addr; @@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, if (addr < start) break; - pgd = pgd_offset_k(addr); + pgd = pgd_offset(mm, addr); if (WARN_ON(pgd_none(*pgd))) return; p4d = p4d_offset(pgd, addr); @@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end, } } +static void pti_clone_init_pgtable(unsigned long start, unsigned long end, + enum pti_clone_level level) +{ + pti_clone_pgtable(&init_mm, start, end, level); +} + #ifdef CONFIG_X86_64 /* * Clone a single p4d (i.e. a top-level entry on 4-level systems and a @@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void) start = CPU_ENTRY_AREA_BASE; end = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES); - pti_clone_pgtable(start, end, PTI_CLONE_PMD); + pti_clone_init_pgtable(start, end, PTI_CLONE_PMD); } #endif /* CONFIG_X86_64 */ @@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void) */ static void pti_clone_entry_text(void) { - pti_clone_pgtable((unsigned long) __entry_text_start, - (unsigned long) __entry_text_end, - PTI_CLONE_PMD); + pti_clone_init_pgtable((unsigned long) __entry_text_start, + (unsigned long) __entry_text_end, + PTI_CLONE_PMD); } /* @@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void) * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ - pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); + pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE); /* -* pti_clone_pgtable() will set the global bit in any PMDs -* that it clones, but we also need to get any PTEs in +* pti_clone_init_pgtable() will set the global bit in any +* PMDs that it clones, but we also need to get any PTEs in * the last level for areas that are not huge-page-aligned. */ -- 2.18.4
[RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace
Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from userspace the same way: they switch from the IST stack to the kernel stack, call the handler and then return to userspace. However, NMI, MCE/DEBUG and VC implement this same behavior using different code paths, so consolidate this code into a single assembly macro. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 137 +- 1 file changed, 75 insertions(+), 62 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index c42948aca0a8..51df9f1871c6 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork) #endif .endm +/* + * Macro to handle an IDT entry defined with the IST mechanism. It should + * be invoked at the beginning of the IDT handler with a pointer to the C + * function (cfunc_user) to invoke if the IDT was entered from userspace. + * + * If the IDT was entered from userspace, the macro will switch from the + * IST stack to the regular task stack, call the provided function and + * return to userland. + * + * If IDT was entered from the kernel, the macro will just return. + */ +.macro ist_entry_user cfunc_user has_error_code=0 + UNWIND_HINT_IRET_REGS + ASM_CLAC + + /* only process entry from userspace */ + .if \has_error_code == 1 + testb $3, CS-ORIG_RAX(%rsp) + jz .List_entry_from_kernel_\@ + .else + testb $3, CS-RIP(%rsp) + jz .List_entry_from_kernel_\@ + pushq $-1 /* ORIG_RAX: no syscall to restart */ + .endif + + /* Use %rdx as a temp variable */ + pushq %rdx + + /* +* Switch from the IST stack to the regular task stack and +* use the provided entry point. +*/ + swapgs + cld + FENCE_SWAPGS_USER_ENTRY + SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx + movq%rsp, %rdx + movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp + UNWIND_HINT_IRET_REGS base=%rdx offset=8 + pushq 6*8(%rdx) /* pt_regs->ss */ + pushq 5*8(%rdx) /* pt_regs->rsp */ + pushq 4*8(%rdx) /* pt_regs->flags */ + pushq 3*8(%rdx) /* pt_regs->cs */ + pushq 2*8(%rdx) /* pt_regs->rip */ + UNWIND_HINT_IRET_REGS + pushq 1*8(%rdx) /* pt_regs->orig_ax */ + PUSH_AND_CLEAR_REGS rdx=(%rdx) + ENCODE_FRAME_POINTER + + /* +* At this point we no longer need to worry about stack damage +* due to nesting -- we're on the normal thread stack and we're +* done with the IST stack. +*/ + + mov %rsp, %rdi + .if \has_error_code == 1 + movqORIG_RAX(%rsp), %rsi/* get error code into 2nd argument*/ + movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ + .endif + call\cfunc_user + jmp swapgs_restore_regs_and_return_to_usermode + +.List_entry_from_kernel_\@: +.endm + /** * idtentry_body - Macro to emit code calling the C function * @cfunc: C function to be called @@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_mce_db vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - - pushq $-1 /* ORIG_RAX: no syscall to restart */ - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user noist_\cfunc + /* Entry from kernel */ + + pushq $-1 /* ORIG_RAX: no syscall to restart */ /* paranoid_entry returns GS information for paranoid_exit in EBX. */ callparanoid_entry @@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym) jmp paranoid_exit - /* Switch to the regular task stack and use the noist entry point */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body noist_\cfunc, has_error_code=0 - _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) .endm @@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym) */ .macro idtentry_vc vector asmsym cfunc SYM_CODE_START(\asmsym) - UNWIND_HINT_IRET_REGS - ASM_CLAC - /* * If the entry is from userspace, switch stacks and treat it as * a normal entry. */ - testb $3, CS-ORIG_RAX(%rsp) - jnz .Lfrom_usermode_switch_stack_\@ + ist_entry_user safe_stack_\cfunc, has_error_code=1 /* * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. @@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym) */ jmp paranoid_exit - /* Switch to the regular task stack */ -.Lfrom_usermode_switch_stack_\@: - idtentry_body safe_stack_\cfunc, has_e
[RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code
ret_from_fork is a mix of assembly code and calls to C functions. Re-implement ret_from_fork so that it calls a single C function. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 18 ++ arch/x86/entry/entry_64.S | 28 +--- 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index d12908ad..7ee15a12c115 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -35,6 +35,24 @@ #include #include +__visible noinstr void return_from_fork(struct pt_regs *regs, + struct task_struct *prev, + void (*kfunc)(void *), void *kargs) +{ + schedule_tail(prev); + if (kfunc) { + /* kernel thread */ + kfunc(kargs); + /* +* A kernel thread is allowed to return here after +* successfully calling kernel_execve(). Exit to +* userspace to complete the execve() syscall. +*/ + regs->ax = 0; + } + syscall_exit_to_user_mode(regs); +} + static __always_inline void run_syscall(sys_call_ptr_t sysfunc, struct pt_regs *regs) { diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 274384644b5e..73e9cd47dc83 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm) */ .pushsection .text, "ax" SYM_CODE_START(ret_from_fork) - UNWIND_HINT_EMPTY - movq%rax, %rdi - callschedule_tail /* rdi: 'prev' task parameter */ - - testq %rbx, %rbx /* from kernel_thread? */ - jnz 1f /* kernel threads are uncommon */ - -2: UNWIND_HINT_REGS - movq%rsp, %rdi - callsyscall_exit_to_user_mode /* returns with IRQs disabled */ + movq%rsp, %rdi /* pt_regs */ + movq%rax, %rsi /* 'prev' task parameter */ + movq%rbx, %rdx /* kernel thread func */ + movq%r12, %rcx /* kernel thread arg */ + callreturn_from_fork/* returns with IRQs disabled */ jmp swapgs_restore_regs_and_return_to_usermode - -1: - /* kernel thread */ - UNWIND_HINT_EMPTY - movq%r12, %rdi - CALL_NOSPEC rbx - /* -* A kernel thread is allowed to return here after successfully -* calling kernel_execve(). Exit to userspace to complete the execve() -* syscall. -*/ - movq$0, RAX(%rsp) - jmp 2b SYM_CODE_END(ret_from_fork) .popsection -- 2.18.4
[RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit
paranoid_entry/exit are assembly macros. Provide C versions of these macros (kernel_paranoid_entry() and kernel_paranoid_exit()). The C functions are functionally equivalent to the assembly macros, except that kernel_paranoid_entry() doesn't save registers in pt_regs like paranoid_entry does. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 157 arch/x86/include/asm/entry-common.h | 10 ++ 2 files changed, 167 insertions(+) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index d09b1ded5287..54d0931801e1 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -387,3 +387,160 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void) static __always_inline void restore_cr3(unsigned long cr3) {} #endif /* CONFIG_PAGE_TABLE_ISOLATION */ + +/* + * "Paranoid" entry path from exception stack. Ensure that the CR3 and + * GS registers are correctly set for the kernel. Return GSBASE related + * information in kernel_entry_state depending on the availability of + * the FSGSBASE instructions: + * + * FSGSBASEkernel_entry_state + * Nswapgs=true -> SWAPGS on exit + * swapgs=false -> no SWAPGS on exit + * + * Ygsbase=GSBASE value at entry, must be restored in + * kernel_paranoid_exit() + * + * Note that per-cpu variables are accessed using the GS register, + * so paranoid entry code cannot access per-cpu variables before + * kernel_paranoid_entry() has been called. + */ +noinstr void kernel_paranoid_entry(struct kernel_entry_state *state) +{ + unsigned long gsbase; + unsigned int cpu; + + /* +* Save CR3 in the kernel entry state. This value will be +* restored, verbatim, at exit. Needed if the paranoid entry +* interrupted another entry that already switched to the user +* CR3 value but has not yet returned to userspace. +* +* This is also why CS (stashed in the "iret frame" by the +* hardware at entry) can not be used: this may be a return +* to kernel code, but with a user CR3 value. +* +* Switching CR3 does not depend on kernel GSBASE so it can +* be done before switching to the kernel GSBASE. This is +* required for FSGSBASE because the kernel GSBASE has to +* be retrieved from a kernel internal table. +*/ + state->cr3 = save_and_switch_to_kernel_cr3(); + + /* +* Handling GSBASE depends on the availability of FSGSBASE. +* +* Without FSGSBASE the kernel enforces that negative GSBASE +* values indicate kernel GSBASE. With FSGSBASE no assumptions +* can be made about the GSBASE value when entering from user +* space. +*/ + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { + /* +* Read the current GSBASE and store it in the kernel +* entry state unconditionally, retrieve and set the +* current CPUs kernel GSBASE. The stored value has to +* be restored at exit unconditionally. +* +* The unconditional write to GS base below ensures that +* no subsequent loads based on a mispredicted GS base +* can happen, therefore no LFENCE is needed here. +*/ + state->gsbase = rdgsbase(); + + /* +* Fetch the per-CPU GSBASE value for this processor. We +* normally use %gs for accessing per-CPU data, but we +* are setting up %gs here and obviously can not use %gs +* itself to access per-CPU data. +*/ + if (IS_ENABLED(CONFIG_SMP)) { + /* +* Load CPU from the GDT. Do not use RDPID, +* because KVM loads guest's TSC_AUX on vm-entry +* and may not restore the host's value until +* the CPU returns to userspace. Thus the kernel +* would consume a guest's TSC_AUX if an NMI +* arrives while running KVM's run loop. +*/ + asm_inline volatile ("lsl %[seg],%[p]" +: [p] "=r" (cpu) +: [seg] "r" (__CPUNODE_SEG)); + + cpu &= VDSO_CPUNODE_MASK; + gsbase = __per_cpu_offset[cpu]; + } else { + gsbase = *pcpu_unit_offsets; + } + + wrgsbase(gsbase); + + } else { + /* +* The kernel-enforced convention is a negative GSBASE +* indicates a kernel value. No SWAPGS needed on entry
[RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry
The #VC exception assembly entry code uses C code (vc_switch_off_ist) to get and configure a stack, then return to assembly to switch to that stack and finally invoked the C function exception handler. To pave the way for deferring CR3 switch from assembly to C code, define a setup stack function for the VC idtentry. This function is used to get and configure the stack before invoking idtentry handler. For now, the setup stack function is just a wrapper around the the vc_switch_off_ist() function but it will eventually also contain the C code to switch CR3. The vc_switch_off_ist() function is also refactored to just return the stack pointer, and the stack configuration is done in the setup stack function (so that the stack can be also be used to propagate CR3 switch information to the idtentry handler for switching CR3 back). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 8 +++- arch/x86/include/asm/idtentry.h | 14 ++ arch/x86/include/asm/traps.h| 2 +- arch/x86/kernel/sev-es.c| 34 + arch/x86/kernel/traps.c | 19 +++--- 5 files changed, 55 insertions(+), 22 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 51df9f1871c6..274384644b5e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym) UNWIND_HINT_REGS /* -* Switch off the IST stack to make it free for nested exceptions. The -* vc_switch_off_ist() function will switch back to the interrupted -* stack if it is safe to do so. If not it switches to the VC fall-back -* stack. +* Call the setup stack function. It configures and returns +* the stack we should be using to run the exception handler. */ movq%rsp, %rdi /* pt_regs pointer */ - callvc_switch_off_ist + callsetup_stack_\cfunc movq%rax, %rsp /* Switch to new stack */ UNWIND_HINT_REGS diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index b2442eb0ac2f..4b4aca2b1420 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs) */ #define DECLARE_IDTENTRY_VC(vector, func) \ DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func); \ + __visible noinstr unsigned long setup_stack_##func(struct pt_regs *regs); \ __visible noinstr void ist_##func(struct pt_regs *regs, unsigned long error_code); \ __visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned long error_code) @@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs) #define DEFINE_IDTENTRY_VC_IST(func) \ DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func) +/** + * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to + run the VMM communication handler + * @func: Function name of the entry point + * + * The stack setup code is executed before the VMM communication handler. + * It configures and returns the stack to switch to before running the + * VMM communication handler. + */ +#define DEFINE_IDTENTRY_VC_SETUP_STACK(func) \ + __visible noinstr \ + unsigned long setup_stack_##func(struct pt_regs *regs) + /** * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler * @func: Function name of the entry point diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 7f7200021bd1..cfcc9d34d2a0 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs); asmlinkage __visible notrace struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s); void __init trap_init(void); -asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs); +asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs *eregs); #endif #ifdef CONFIG_X86_F00F_BUG diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 0bd1a0fc587e..bd977c917cd6 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication) instrumentation_end(); } +struct exc_vc_frame { + /* pt_regs should be first */ + struct pt_regs regs; +}; + +DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication) +{ + struct exc_vc_frame *frame; + unsigned long sp; + + /* +* Switch off the IST stack to make it free for nested exceptions. +* The vc_switch_off_ist() function will switch back to the +* interrupted stack if
[RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code
IST entries from the kernel use paranoid entry and exit assembly functions to ensure the CR3 and GS registers are updated with correct values for the kernel. Move the update of the CR3 and GS registers inside the C code of IST handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 72 ++ arch/x86/kernel/cpu/mce/core.c | 3 ++ arch/x86/kernel/nmi.c | 18 +++-- arch/x86/kernel/sev-es.c | 20 +- arch/x86/kernel/traps.c| 30 -- 5 files changed, 83 insertions(+), 60 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 6b88a0eb8975..9ea8187d4405 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym) /* Entry from kernel */ pushq $-1 /* ORIG_RAX: no syscall to restart */ - /* paranoid_entry returns GS information for paranoid_exit in EBX. */ - callparanoid_entry - + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs pointer */ call\cfunc - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym) */ ist_entry_user safe_stack_\cfunc, has_error_code=1 - /* -* paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. -* EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS -*/ - callparanoid_entry - + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS /* @@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym) * identical to the stack in the IRET frame or the VC fall-back stack, * so it is definitly mapped even with PTI enabled. */ - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym) UNWIND_HINT_IRET_REGS offset=8 ASM_CLAC - /* paranoid_entry returns GS information for paranoid_exit in EBX. */ - callparanoid_entry + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi /* pt_regs pointer into first argument */ @@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym) movq$-1, ORIG_RAX(%rsp) /* no syscall to restart */ call\cfunc - jmp paranoid_exit + jmp restore_regs_and_return_to_kernel _ASM_NOKPROBE(\asmsym) SYM_CODE_END(\asmsym) @@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return) /* * Runs on exception stack. Xen PV does not go through this path at all, * so we can use real assembly here. - * - * Registers: - * %r14: Used to save/restore the CR3 of the interrupted context - * when PAGE_TABLE_ISOLATION is in use. Do not clobber. */ SYM_CODE_START(asm_exc_nmi) /* @@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi) * We also must not push anything to the stack before switching * stacks lest we corrupt the "NMI executing" variable. */ - ist_entry_user exc_nmi + ist_entry_user exc_nmi_user /* NMI from kernel */ @@ -1346,9 +1340,7 @@ repeat_nmi: * * RSP is pointing to "outermost RIP". gsbase is unknown, but, if * we're repeating an NMI, gsbase has the same value that it had on -* the first iteration. paranoid_entry will load the kernel -* gsbase if needed before we call exc_nmi(). "NMI executing" -* is zero. +* the first iteration. "NMI executing" is zero. */ movq$1, 10*8(%rsp) /* Set "NMI executing". */ @@ -1372,44 +1364,20 @@ end_repeat_nmi: pushq $-1 /* ORIG_RAX: no syscall to restart */ /* -* Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit -* as we should not be calling schedule in NMI context. -* Even with normal interrupts enabled. An NMI should not be -* setting NEED_RESCHED or anything that normal interrupts and +* We should not be calling schedule in NMI context. Even with +* normal interrupts enabled. An NMI should not be setting +* NEED_RESCHED or anything that normal interrupts and * exceptions might do. */ - callparanoid_entry + cld + PUSH_AND_CLEAR_REGS + ENCODE_FRAME_POINTER UNWIND_HINT_REGS movq%rsp, %rdi movq$-1, %rsi callexc_nmi - /* Always restore stashed CR3 value (see paranoid_entry) */ - RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 - - /* -* The above invocation of pa
[RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
With PTI, syscall/interrupt/exception entries switch the CR3 register to change the page-table in assembly code. Move the CR3 register switch inside the C code of syscall/interrupt/exception entry handlers. Signed-off-by: Alexandre Chartre --- arch/x86/entry/common.c | 15 --- arch/x86/entry/entry_64.S | 23 +-- arch/x86/entry/entry_64_compat.S| 22 -- arch/x86/include/asm/entry-common.h | 14 ++ arch/x86/include/asm/idtentry.h | 25 - arch/x86/kernel/cpu/mce/core.c | 2 ++ arch/x86/kernel/nmi.c | 2 ++ arch/x86/kernel/traps.c | 6 ++ arch/x86/mm/fault.c | 9 +++-- 9 files changed, 68 insertions(+), 50 deletions(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index ead6a4c72e6a..3f4788dbbde7 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs, regs->ax = 0; } syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } static __always_inline void run_syscall(sys_call_ptr_t sysfunc, @@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc, #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) { + switch_to_kernel_cr3(); nr = syscall_enter_from_user_mode(regs, nr); instrumentation_begin(); @@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs) instrumentation_end(); syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } #endif #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs) { + switch_to_kernel_cr3(); if (IS_ENABLED(CONFIG_IA32_EMULATION)) current_thread_info()->status |= TS_COMPAT; @@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs) do_syscall_32_irqs_on(regs, nr); syscall_exit_to_user_mode(regs); + switch_to_user_cr3(); } -static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) +static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr) { - unsigned int nr = syscall_32_enter(regs); int res; /* @@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs) /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) { + unsigned int nr = syscall_32_enter(regs); + bool syscall_done; + /* * Called using the internal vDSO SYSENTER/SYSCALL32 calling * convention. Adjust regs so it looks like we entered using int80. @@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs) regs->ip = landing_pad; /* Invoke the syscall. If it failed, keep it simple: use IRET. */ - if (!__do_fast_syscall_32(regs)) + syscall_done = __do_fast_syscall_32(regs, nr); + switch_to_user_cr3(); + if (!syscall_done) return 0; #ifdef CONFIG_X86_64 diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 797effbe65b6..4be15a5ffe68 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64) swapgs /* tss.sp2 is scratch space. */ movq%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) - SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL) @@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ - POP_REGS pop_rdi=0 skip_r11rcx=1 + POP_REGS skip_r11rcx=1 /* -* We are on the trampoline stack. All regs except RDI are live. * We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER - SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi - - popq%rdi movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork) swapgs cld FENCE_SWAPGS_USER_ENTRY - SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx movq%rsp, %rdx movqPER_CPU_VAR(cpu_current_top_of_stack), %rsp UNWIND_HINT_IRET_REGS base=%rdx offset=8 @@ -592,19 +586,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) ud2 1: #endif - POP_REGS pop_rdi=0 + POP_REGS + addq
[RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes system vector handlers to execute on the kernel stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a82e31b45442..0c5d9f027112 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) +{ + void *stack = pti_kernel_stack(regs); + + if (stack) + asm_call_on_stack_1(stack, (void (*)(void))func, regs); + else + run_sysvec_on_irqstack_cond(func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - run_sysvec_on_irqstack_cond(__##func, regs);\ + run_sysvec(__##func, regs); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ -- 2.18.4
[RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit
The paranoid_entry and paranoid_exit assembly functions have been replaced by the kernel_paranoid_entry() and kernel_paranoid_exit() C functions. Now paranoid_entry/exit are not used anymore and can be removed. Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S | 131 -- 1 file changed, 131 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 9ea8187d4405..797effbe65b6 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback) SYM_CODE_END(xen_failsafe_callback) #endif /* CONFIG_XEN_PV */ -/* - * Save all registers in pt_regs. Return GSBASE related information - * in EBX depending on the availability of the FSGSBASE instructions: - * - * FSGSBASER/EBX - * N0 -> SWAPGS on exit - * 1 -> no SWAPGS on exit - * - * YGSBASE value at entry, must be restored in paranoid_exit - */ -SYM_CODE_START_LOCAL(paranoid_entry) - UNWIND_HINT_FUNC - cld - PUSH_AND_CLEAR_REGS save_ret=1 - ENCODE_FRAME_POINTER 8 - - /* -* Always stash CR3 in %r14. This value will be restored, -* verbatim, at exit. Needed if paranoid_entry interrupted -* another entry that already switched to the user CR3 value -* but has not yet returned to userspace. -* -* This is also why CS (stashed in the "iret frame" by the -* hardware at entry) can not be used: this may be a return -* to kernel code, but with a user CR3 value. -* -* Switching CR3 does not depend on kernel GSBASE so it can -* be done before switching to the kernel GSBASE. This is -* required for FSGSBASE because the kernel GSBASE has to -* be retrieved from a kernel internal table. -*/ - SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14 - - /* -* Handling GSBASE depends on the availability of FSGSBASE. -* -* Without FSGSBASE the kernel enforces that negative GSBASE -* values indicate kernel GSBASE. With FSGSBASE no assumptions -* can be made about the GSBASE value when entering from user -* space. -*/ - ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE - - /* -* Read the current GSBASE and store it in %rbx unconditionally, -* retrieve and set the current CPUs kernel GSBASE. The stored value -* has to be restored in paranoid_exit unconditionally. -* -* The unconditional write to GS base below ensures that no subsequent -* loads based on a mispredicted GS base can happen, therefore no LFENCE -* is needed here. -*/ - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx - ret - -.Lparanoid_entry_checkgs: - /* EBX = 1 -> kernel GSBASE active, no restore required */ - movl$1, %ebx - /* -* The kernel-enforced convention is a negative GSBASE indicates -* a kernel value. No SWAPGS needed on entry and exit. -*/ - movl$MSR_GS_BASE, %ecx - rdmsr - testl %edx, %edx - jns .Lparanoid_entry_swapgs - ret - -.Lparanoid_entry_swapgs: - SWAPGS - - /* -* The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an -* unconditional CR3 write, even in the PTI case. So do an lfence -* to prevent GS speculation, regardless of whether PTI is enabled. -*/ - FENCE_SWAPGS_KERNEL_ENTRY - - /* EBX = 0 -> SWAPGS required on exit */ - xorl%ebx, %ebx - ret -SYM_CODE_END(paranoid_entry) - -/* - * "Paranoid" exit path from exception stack. This is invoked - * only on return from non-NMI IST interrupts that came - * from kernel space. - * - * We may be returning to very strange contexts (e.g. very early - * in syscall entry), so checking for preemption here would - * be complicated. Fortunately, there's no good reason to try - * to handle preemption here. - * - * R/EBX contains the GSBASE related information depending on the - * availability of the FSGSBASE instructions: - * - * FSGSBASER/EBX - * N0 -> SWAPGS on exit - * 1 -> no SWAPGS on exit - * - * YUser space GSBASE, must be restored unconditionally - */ -SYM_CODE_START_LOCAL(paranoid_exit) - UNWIND_HINT_REGS - /* -* The order of operations is important. RESTORE_CR3 requires -* kernel GSBASE. -* -* NB to anyone to try to optimize this code: this code does -* not execute at all for exceptions from user mode. Those -* exceptions go through error_exit instead. -*/ - RESTORE_CR3 scratch_reg=%rax save_reg=%r14 - - /* Handle the three GSBASE cases */ - ALTERNATIVE "jmp .Lparanoid_exit_checkgs"
[RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack
After a NMI from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the NMI handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/kernel/nmi.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 4bc77aaf1303..be0f654c3095 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi) inc_irq_stat(__nmi_count); - if (!ignore_nmis) - default_do_nmi(regs); + if (!ignore_nmis) { + if (user_mode(regs)) { + /* +* If we come from userland then we are on the +* trampoline stack, switch to the kernel stack +* to execute the NMI handler. +*/ + run_idt(default_do_nmi, regs); + } else { + default_do_nmi(regs); + } + } idtentry_exit_nmi(regs, irq_state); -- 2.18.4
[RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack
When entering the kernel from userland, use the per-task PTI stack instead of the per-cpu trampoline stack. Like the trampoline stack, the PTI stack is mapped both in the kernel and in the user page-table. Using a per-task stack which is mapped into the kernel and the user page-table instead of a per-cpu stack will allow executing more code before switching to the kernel stack and to the kernel page-table. Additional changes will be made to later to switch to the kernel stack (which is only mapped in the kernel page-table). Signed-off-by: Alexandre Chartre --- arch/x86/entry/entry_64.S| 42 +--- arch/x86/include/asm/pti.h | 8 ++ arch/x86/include/asm/switch_to.h | 7 +- 3 files changed, 26 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 458af12ed9a1..29beab46bedd 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -194,19 +194,9 @@ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ POP_REGS pop_rdi=0 skip_r11rcx=1 - /* -* Now all regs are restored except RSP and RDI. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - pushq RSP-RDI(%rdi) /* RSP */ - pushq (%rdi) /* RDI */ - /* * We are on the trampoline stack. All regs except RDI are live. +* We are on the trampoline stack. All regs except RSP are live. * We can do future final exit work right here. */ STACKLEAK_ERASE_NOCLOBBER @@ -214,7 +204,7 @@ syscall_return_via_sysret: SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi popq%rdi - popq%rsp + movqRSP-ORIG_RAX(%rsp), %rsp USERGS_SYSRET64 SYM_CODE_END(entry_SYSCALL_64) @@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) #endif POP_REGS pop_rdi=0 - /* -* The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS. -* Save old stack pointer and switch to trampoline stack. -*/ - movq%rsp, %rdi - movqPER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp - UNWIND_HINT_EMPTY - - /* Copy the IRET frame to the trampoline stack. */ - pushq 6*8(%rdi) /* SS */ - pushq 5*8(%rdi) /* RSP */ - pushq 4*8(%rdi) /* EFLAGS */ - pushq 3*8(%rdi) /* CS */ - pushq 2*8(%rdi) /* RIP */ - - /* Push user RDI on the trampoline stack. */ - pushq (%rdi) - /* * We are on the trampoline stack. All regs except RDI are live. * We can do future final exit work right here. @@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) /* Restore RDI. */ popq%rdi + addq$8, %rsp/* skip regs->orig_ax */ SWAPGS INTERRUPT_RETURN @@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry) SWITCH_TO_KERNEL_CR3 scratch_reg=%rax .Lerror_entry_from_usermode_after_swapgs: + /* +* We are on the trampoline stack. With PTI, the trampoline +* stack is a per-thread stack so we are all set and we can +* return. +* +* Without PTI, the trampoline stack is a per-cpu stack and +* we need to switch to the normal thread stack. +*/ + ALTERNATIVE "", "ret", X86_FEATURE_PTI /* Put us onto the real thread stack. */ popq%r12/* save return addr in %12 */ movq%rsp, %rdi /* arg0 = pt_regs pointer */ diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h index 5484e69ff8d3..ed211fcc3a50 100644 --- a/arch/x86/include/asm/pti.h +++ b/arch/x86/include/asm/pti.h @@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void); extern void pti_finalize(void); extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start, unsigned long end, enum pti_clone_level level); +static inline bool pti_enabled(void) +{ + return static_cpu_has(X86_FEATURE_PTI); +} #else static inline void pti_check_boottime_disable(void) { } +static inline bool pti_enabled(void) +{ + return false; +} #endif #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index 9f69cc497f4b..457458228462 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -3,6 +3,7 @@ #define _ASM_X86_SWITCH_TO_H #include +#include struct task_struct; /* one of the stranger aspects of C forward declarations */ @@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task) * doesn't wo
[RFC][PATCH 16/24] x86/pti: Execute IDT handlers on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. For now, only changes IDT handlers which have no argument other than the pt_regs registers. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 43 +++-- arch/x86/kernel/cpu/mce/core.c | 2 +- arch/x86/kernel/traps.c | 4 +-- 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 4b4aca2b1420..3595a31947b3 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -10,10 +10,49 @@ #include #include +#include bool idtentry_enter_nmi(struct pt_regs *regs); void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); +/* + * The CALL_ON_STACK_* macro call the specified function either directly + * if no stack is provided, or on the specified stack. + */ +#define CALL_ON_STACK_1(stack, func, arg1) \ + ((stack) ? \ +asm_call_on_stack_1(stack, \ + (void (*)(void))(func), (void *)(arg1)) : \ +func(arg1)) + +/* + * Functions to return the top of the kernel stack if we are using the + * user page-table (and thus not running with the kernel stack). If we + * are using the kernel page-table (and so already using the kernel + * stack) when it returns NULL. + */ +static __always_inline void *pti_kernel_stack(struct pt_regs *regs) +{ + unsigned long stack; + + if (pti_enabled() && user_mode(regs)) { + stack = (unsigned long)task_top_of_kernel_stack(current); + return (void *)(stack - 8); + } else { + return NULL; + } +} + +/* + * Wrappers to run an IDT handler on the kernel stack if we are not + * already using this stack. + */ +static __always_inline +void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) +{ + CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs) \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs);\ + run_idt(__##func, regs);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs) \ instrumentation_begin();\ __irq_enter_raw(); \ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs);\ + run_idt(__##func, regs);\ __irq_exit_raw(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4102b866e7c0..9407c3cd9355 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check) unsigned long dr7; dr7 = local_db_save(); - exc_machine_check_user(regs); + run_idt(exc_machine_check_user, regs); local_db_restore(dr7); } #else diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 09b22a611d99..5161385b3670 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op) state = irqentry_enter(regs); instrumentation_begin(); - handle_invalid_op(regs); + run_idt(handle_invalid_op, regs); instrumentation_end(); irqentry_exit(regs, state); } @@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3) if (user_mode(regs)) { irqentry_enter_from_user_mode(regs); instrumentation_begin(); - do_int3_user(regs); + run_idt(do_int3_us
[RFC][PATCH 19/24] x86/pti: Execute page fault handler on the kernel stack
After a page fault from userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the page fault handler, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 17 + arch/x86/mm/fault.c | 2 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 0c5d9f027112..a6725afaaec0 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ func(arg1, arg2)) +#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3) \ + ((stack) ? \ +asm_call_on_stack_3(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2), \ + (void *)(arg3)) : \ +func(arg1, arg2, arg3)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); } +static __always_inline +void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long, + unsigned long), + struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ + CALL_ON_STACK_3(pti_kernel_stack(regs), + func, regs, error_code, address); +} + static __always_inline void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) { diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 82bf37a5c9ec..b9d03603d95d 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) state = irqentry_enter(regs); instrumentation_begin(); - handle_page_fault(regs, error_code, address); + run_idt_pagefault(handle_page_fault, regs, error_code, address); instrumentation_end(); irqentry_exit(regs, state); -- 2.18.4
[RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers
The stack-protector option adds a stack canary to functions vulnerable to stack buffer overflow. The stack canary is defined through the GS register. Add an attribute to disable the stack-protector option; it will be used for C functions which can be called while the GS register might not be properly configured yet. The GS register is not properly configured for the kernel when we enter the kernel from userspace. The assembly entry code sets the GS register for the kernel using the swapgs instruction or the paranoid_entry function, and so, currently, the GS register is correctly configured when assembly entry code subsequently transfer control to C code. Deferring the CR3 register switch from assembly to C code will require to reimplement paranoid_entry in C and hence also defer the GS register setup for IST entries to C code. To prepare this change, disable stack-protector for IST entry C handlers where the GS register setup will eventually happen. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 25 - arch/x86/kernel/nmi.c | 2 +- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a6725afaaec0..647af7ea3bf1 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -94,6 +94,21 @@ void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs) run_sysvec_on_irqstack_cond(func, regs); } +/* + * Attribute to disable the stack-protector option. The option is + * disabled using the optimize attribute which clears all optimize + * options. So we need to specify the optimize option to disable but + * also optimize options we want to preserve. + * + * The stack-protector option adds a stack canary to functions + * vulnerable to stack buffer overflow. The stack canary is defined + * through the GS register. So the attribute is used to disable the + * stack-protector option for functions which can be called while the + * GS register might not be properly configured yet. + */ +#define no_stack_protector \ + __attribute__ ((optimize("-O2,-fno-stack-protector,-fno-omit-frame-pointer"))) + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -410,7 +425,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW */ #define DEFINE_IDTENTRY_IST(func) \ - DEFINE_IDTENTRY_RAW(func) + no_stack_protector DEFINE_IDTENTRY_RAW(func) /** * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which @@ -440,7 +455,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE */ #define DEFINE_IDTENTRY_DF(func) \ - DEFINE_IDTENTRY_RAW_ERRORCODE(func) + no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func) /** * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler @@ -472,7 +487,7 @@ static __always_inline void __##func(struct pt_regs *regs) * VMM communication handler. */ #define DEFINE_IDTENTRY_VC_SETUP_STACK(func) \ - __visible noinstr \ + no_stack_protector __visible noinstr\ unsigned long setup_stack_##func(struct pt_regs *regs) /** @@ -482,7 +497,7 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE */ #define DEFINE_IDTENTRY_VC(func) \ - DEFINE_IDTENTRY_RAW_ERRORCODE(func) + no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func) #else /* CONFIG_X86_64 */ @@ -517,7 +532,7 @@ __visible noinstr void func(struct pt_regs *regs, \ /* C-Code mapping */ #define DECLARE_IDTENTRY_NMI DECLARE_IDTENTRY_RAW -#define DEFINE_IDTENTRY_NMIDEFINE_IDTENTRY_RAW +#define DEFINE_IDTENTRY_NMIno_stack_protector DEFINE_IDTENTRY_RAW #ifdef CONFIG_X86_64 #define DECLARE_IDTENTRY_MCE DECLARE_IDTENTRY_IST diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index be0f654c3095..b6291b683be1 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state); static DEFINE_PER_CPU(unsigned long, nmi_cr2); static DEFINE_PER_CPU(unsigned long, nmi_dr7); -DEFINE_IDTENTRY_RAW(exc_nmi) +DEFINE_IDTENTRY_NMI(exc_nmi) { bool irq_state; -- 2.18.4
[RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack
After an interrupt/exception in userland, the kernel is entered and it switches the stack to the PTI stack which is mapped both in the kernel and in the user page-table. When executing the interrupt function, switch to the kernel stack (which is mapped only in the kernel page-table) so that no kernel data leak to the userland through the stack. Changes IDT handlers which have an error code. Signed-off-by: Alexandre Chartre --- arch/x86/include/asm/idtentry.h | 18 -- arch/x86/kernel/traps.c | 2 +- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 3595a31947b3..a82e31b45442 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); (void (*)(void))(func), (void *)(arg1)) : \ func(arg1)) +#define CALL_ON_STACK_2(stack, func, arg1, arg2) \ + ((stack) ? \ +asm_call_on_stack_2(stack, \ + (void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \ +func(arg1, arg2)) + /* * Functions to return the top of the kernel stack if we are using the * user page-table (and thus not running with the kernel stack). If we @@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs) CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs); } +static __always_inline +void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long), +struct pt_regs *regs, unsigned long error_code) +{ + CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code); +} + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs, \ irqentry_state_t state = irqentry_enter(regs); \ \ instrumentation_begin();\ - __##func (regs, error_code);\ + run_idt_errcode(__##func, regs, error_code);\ instrumentation_end(); \ irqentry_exit(regs, state); \ } \ @@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs, \ instrumentation_begin();\ irq_enter_rcu();\ kvm_set_cpu_l1tf_flush_l1d(); \ - __##func (regs, (u8)error_code);\ + run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \ + regs, (u8)error_code); \ irq_exit_rcu(); \ instrumentation_end(); \ irqentry_exit(regs, state); \ diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 5161385b3670..9a51aa016fb3 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug) /* User entry, runs on regular task stack */ DEFINE_IDTENTRY_DEBUG_USER(exc_debug) { - exc_debug_user(regs, debug_read_clear_dr6()); + run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6()); } #else /* 32 bit does not have separate entry points. */ -- 2.18.4