Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17 2020 at 09:19, Alexandre Chartre wrote: > On 11/16/20 9:24 PM, Borislav Petkov wrote: >> On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: >> So PTI was added exactly to *not* have kernel memory mapped in the user >> page table. You're partially reversing that... > > We are not reversing PTI, we are extending it. You widen the exposure surface without providing an argument why it is safe. > PTI removes all kernel mapping from the user page-table. However there's > no issue with mapping some kernel data into the user page-table as long as > these data have no sensitive information. Define sensitive information. > Actually, PTI is already doing that but with a very limited scope. PTI adds > into the user page-table some kernel mappings which are needed for userland > to enter the kernel (such as the kernel entry text, the ESPFIX, the > CPU_ENTRY_AREA_BASE...). > > So here, we are extending the PTI mapping so that we can execute more kernel > code while using the user page-table; it's a kind of PTI on steroids. Let's just look at a syscall: noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) { long ret; enter_from_user_mode(regs); lockdep_hardirqs_off(); user_exit_irqoff(); trace_hardirqs_off_finish(); So just looking at the 3 calls above, how are you going to guarantee that everything these callchains touch is mapped into user space? Not to talk about everything which comes after that. Thanks, tglx
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 12:29 PM, Borislav Petkov wrote: On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote: Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. Here's one recipe, courtesy of Mel: https://github.com/gormanm/mmtests Thanks for the detailed information, I have run the test and I see the same difference as with the tools/perf and libMICRO I already sent: there's a 150% difference for getpid() with and without pti. alex. - # ../../compare-kernels.sh --baseline test-nopti --compare test-pti poundsyscall test test noptipti Min 2 1.99 ( 0.00%)5.08 (-155.28%) Min 4 1.02 ( 0.00%)2.60 (-154.90%) Min 6 0.94 ( 0.00%)2.07 (-120.21%) Min 8 0.81 ( 0.00%)1.60 ( -97.53%) Min 120.85 ( 0.00%)1.65 ( -94.12%) Min 180.82 ( 0.00%)1.61 ( -96.34%) Min 240.81 ( 0.00%)1.60 ( -97.53%) Min 300.81 ( 0.00%)1.60 ( -97.53%) Min 320.81 ( 0.00%)1.60 ( -97.53%) Amean 2 2.02 ( 0.00%)5.10 *-151.83%* Amean 4 1.03 ( 0.00%)2.61 *-151.98%* Amean 6 0.96 ( 0.00%)2.07 *-116.74%* Amean 8 0.82 ( 0.00%)1.60 * -96.56%* Amean 120.87 ( 0.00%)1.67 * -91.73%* Amean 180.82 ( 0.00%)1.63 * -97.94%* Amean 240.81 ( 0.00%)1.60 * -97.41%* Amean 300.82 ( 0.00%)1.60 * -96.93%* Amean 320.82 ( 0.00%)1.60 * -96.56%* Stddev2 0.02 ( 0.00%)0.02 ( 33.78%) Stddev4 0.01 ( 0.00%)0.01 ( 7.18%) Stddev6 0.01 ( 0.00%)0.00 ( 68.77%) Stddev8 0.01 ( 0.00%)0.01 ( 10.56%) Stddev120.01 ( 0.00%)0.02 ( -12.69%) Stddev180.01 ( 0.00%)0.01 (-107.25%) Stddev240.00 ( 0.00%)0.00 ( -14.56%) Stddev300.01 ( 0.00%)0.01 ( 0.00%) Stddev320.01 ( 0.00%)0.00 ( 20.00%) CoeffVar 2 1.17 ( 0.00%)0.31 ( 73.70%) CoeffVar 4 0.82 ( 0.00%)0.30 ( 63.16%) CoeffVar 6 1.41 ( 0.00%)0.20 ( 85.59%) CoeffVar 8 0.87 ( 0.00%)0.39 ( 54.50%) CoeffVar 121.66 ( 0.00%)0.98 ( 41.23%) CoeffVar 180.85 ( 0.00%)0.89 ( -4.71%) CoeffVar 240.52 ( 0.00%)0.30 ( 41.97%) CoeffVar 300.65 ( 0.00%)0.33 ( 49.22%) CoeffVar 320.65 ( 0.00%)0.26 ( 59.30%) Max 2 2.04 ( 0.00%)5.13 (-151.47%) Max 4 1.04 ( 0.00%)2.62 (-151.92%) Max 6 0.98 ( 0.00%)2.08 (-112.24%) Max 8 0.83 ( 0.00%)1.62 ( -95.18%) Max 120.89 ( 0.00%)1.70 ( -91.01%) Max 180.84 ( 0.00%)1.66 ( -97.62%) Max 240.82 ( 0.00%)1.61 ( -96.34%) Max 300.82 ( 0.00%)1.61 ( -96.34%) Max 320.82 ( 0.00%)1.61 ( -96.34%) BAmean-50 2 2.01 ( 0.00%)5.09 (-153.39%) BAmean-50 4 1.03 ( 0.00%)2.60 (-152.62%) BAmean-50 6 0.95 ( 0.00%)2.07 (-118.82%) BAmean-50 8 0.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 120.86 ( 0.00%)1.66 ( -92.79%) BAmean-50 180.82 ( 0.00%)1.62 ( -97.56%) BAmean-50 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 300.81 ( 0.00%)1.60 ( -97.53%) BAmean-50 320.81 ( 0.00%)1.60 ( -97.53%) BAmean-95 2 2.02 ( 0.00%)5.09 (-151.87%) BAmean-95 4 1.03 ( 0.00%)2.61 (-151.99%) BAmean-95 6 0.95 ( 0.00%)2.07 (-117.25%) BAmean-95 8 0.81 ( 0.00%)1.60 ( -96.72%) BAmean-95 120.87 ( 0.00%)1.67 ( -91.82%) BAmean-95 180.82 ( 0.00%)1.63 ( -97.97%) BAmean-95 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-95 300.81 ( 0.00%)1.60 ( -97.00%) BAmean-95 320.81 ( 0.00%)1.60 ( -96.59%) BAmean-99 2 2.02 ( 0.00%)5.09 (-151.87%) BAmean-99 4 1.03 ( 0.00%)2.61 (-151.99%) BAmean-99 6 0.95 ( 0.00%)2.07 (-117.25%) BAmean-99 8 0.81 ( 0.00%)1.60 ( -96.72%) BAmean-99 120.87 ( 0.00%)1.67 ( -91.82%) BAmean-99 180.82 ( 0.00%)1.63 ( -97.97%) BAmean-99 240.81 ( 0.00%)1.60 ( -97.53%) BAmean-99 300.81 ( 0.00%)1
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 2:22 PM, David Laight wrote: From: Alexandre Chartre Sent: 18 November 2020 10:30 ... Correct, this RFC is not changing the overhead. However, it is a step forward for being able to execute some selected syscalls or interrupt handlers without switching to the kernel page-table. The next step would be to identify and add the necessary mapping to the user page-table so that specified syscalls can be executed without switching the page-table. Remember that without PTI user space can read all kernel memory. (I'm not 100% sure you can force a cache-line read.) It isn't even that slow. (Even I can understand how it works.) So if you are worried about user space doing that you can't really run anything on the user page tables. Yes, without PTI, userspace can read all kernel memory. But to run some part of the kernel you don't need to have all kernel mappings. Also a lot of the kernel contain non-sensitive information which can be safely expose to userspace. So there's probably some room for running carefully selected syscalls with the user page-table (and hopefully useful ones). System calls like getpid() are irrelevant - they aren't used (much). Even the time of day ones are implemented in the VDSO without a context switch. getpid()/getppid() is interesting because it provides the amount of overhead PTI is adding. But the impact can be more important if some TLB flushing are also required (as you mentioned below). So the overheads come from other system calls that 'do work' without actually sleeping. I'm guessing things like read, write, sendmsg, recvmsg. The only interesting system call I can think of is futex. As well as all the calls that return immediately because the mutex has been released while entering the kernel, I suspect that being pre-empted by a different thread (of the same process) doesn't actually need CR3 reloading (without PTI). I also suspect that it isn't just the CR3 reload that costs. There could (depending on the cpu) be associated TLB and/or cache invalidations that have a much larger effect on programs with large working sets than on simple benchmark programs. Right, although the TLB flush is mitigated with PCID, but this has more impact if there's no PCID. Now bits of data that you are 'more worried about' could be kept in physical memory that isn't normally mapped (or referenced by a TLB) and only mapped when needed. But that doesn't help the general case. Note that having syscall which could be done without switching the page-table is just one benefit you can get from this RFC. But the main benefit is for integrating Address Space Isolation (ASI) which will be much more complex if ASI as to plug in the current assembly CR3 switch. Thanks, alex.
RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
From: Alexandre Chartre > Sent: 18 November 2020 10:30 ... > Correct, this RFC is not changing the overhead. However, it is a step forward > for being able to execute some selected syscalls or interrupt handlers without > switching to the kernel page-table. The next step would be to identify and add > the necessary mapping to the user page-table so that specified syscalls can be > executed without switching the page-table. Remember that without PTI user space can read all kernel memory. (I'm not 100% sure you can force a cache-line read.) It isn't even that slow. (Even I can understand how it works.) So if you are worried about user space doing that you can't really run anything on the user page tables. System calls like getpid() are irrelevant - they aren't used (much). Even the time of day ones are implemented in the VDSO without a context switch. So the overheads come from other system calls that 'do work' without actually sleeping. I'm guessing things like read, write, sendmsg, recvmsg. The only interesting system call I can think of is futex. As well as all the calls that return immediately because the mutex has been released while entering the kernel, I suspect that being pre-empted by a different thread (of the same process) doesn't actually need CR3 reloading (without PTI). I also suspect that it isn't just the CR3 reload that costs. There could (depending on the cpu) be associated TLB and/or cache invalidations that have a much larger effect on programs with large working sets than on simple benchmark programs. Now bits of data that you are 'more worried about' could be kept in physical memory that isn't normally mapped (or referenced by a TLB) and only mapped when needed. But that doesn't help the general case. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Wed, Nov 18, 2020 at 08:41:42AM +0100, Alexandre Chartre wrote: > Well, it looks like I wrongfully assume that KPTI was a well known performance > overhead since it was introduced (because it adds extra page-table switches), > but you are right I should be presenting my own numbers. Here's one recipe, courtesy of Mel: https://github.com/gormanm/mmtests " ./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall test-default # reboot the machine with pti disabled ./run-mmtests.sh --no-monitor --config configs/config-workload-poundsyscall test-nopti poundsyscall just calls getppid() so it's a light-weight syscall and a proxy measure for syscall entry/exit costs. To do the actual compare cd work/log ../../compare-kernels.sh and see what gain there is from disabling pti. If you want to compare the other direction ../../compare-kernels.sh --baseline test-nopti --compare test-default If you get an error about BinarySearch (echo y;echo o conf prerequisites_policy follow;echo o conf commit)|cpan yes | cpan List::BinarySearch Only se the second line if you want to interactively confirm what cpan should download and install." I've CCed him should you have any questions. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/18/20 10:30 AM, David Laight wrote: From: Alexandre Chartre Sent: 18 November 2020 07:42 On 11/17/20 10:26 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Some benchmarks are available, in particular from phoronix: What I was expecting was benchmarks *you* have run which show that perf penalty, not something one can find quickly on the internet and something one cannot always reproduce her-/himself. You do know that presenting convincing numbers with a patchset greatly improves its chances of getting it upstreamed, right? Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. IIRC the penalty comes from the page table switch. Doing it at a different time is unlikely to make much difference. Correct, this RFC is not changing the overhead. However, it is a step forward for being able to execute some selected syscalls or interrupt handlers without switching to the kernel page-table. The next step would be to identify and add the necessary mapping to the user page-table so that specified syscalls can be executed without switching the page-table. For some workloads the penalty is massive - getting on for 50%. We are still using old kernels on AWS. Here are some micro benchmarks of the getppid and getpid syscalls which highlight the PTI overhead. This uses the kernel tools/perf command, and the getpid command from libMICRO (https://github.com/redhat-performance/libMicro): system running 5.10-rc4 booted with nopti: -- # perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 1000 getppid() calls Total time: 0.792 [sec] 0.079223 usecs/op 12622549 ops/sec # getpid -B 10 prc thr usecs/call samples errors cnt/samp getpid 1 1 0.08029 1020 10 We can see that getpid and getppid syscall have the same execution time around 0.08 usecs. These syscalls are very small and just return a value, so the time is mostly spent entering/exiting the kernel. same system booted with pti: # perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 1000 getppid() calls Total time: 2.025 [sec] 0.202527 usecs/op 4937605 ops/sec # getpid -B 10 prc thr usecs/call samples errors cnt/samp getpid 1 1 0.20241 1020 10 With PTI, the execution time jumps to 0.20 usecs (+0.12 usecs = +150%). That's a very extreme case because these are very small syscalls, and in that case the overhead to switch page-tables is significant compared to the execution time of the syscall. So with an overhead of +0.12 usecs per syscall, the PTI impact is significant with workload which uses a lot of short syscalls. But if you use longer syscalls, for example with an average execution time of 2.0 usecs per syscall then you have a lower overhead of 6%. alex.
RE: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
From: Alexandre Chartre > Sent: 18 November 2020 07:42 > > > On 11/17/20 10:26 PM, Borislav Petkov wrote: > > On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: > >> Some benchmarks are available, in particular from phoronix: > > > > What I was expecting was benchmarks *you* have run which show that > > perf penalty, not something one can find quickly on the internet and > > something one cannot always reproduce her-/himself. > > > > You do know that presenting convincing numbers with a patchset greatly > > improves its chances of getting it upstreamed, right? > > > > Well, it looks like I wrongfully assume that KPTI was a well known performance > overhead since it was introduced (because it adds extra page-table switches), > but you are right I should be presenting my own numbers. IIRC the penalty comes from the page table switch. Doing it at a different time is unlikely to make much difference. For some workloads the penalty is massive - getting on for 50%. We are still using old kernels on AWS. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 10:26 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Some benchmarks are available, in particular from phoronix: What I was expecting was benchmarks *you* have run which show that perf penalty, not something one can find quickly on the internet and something one cannot always reproduce her-/himself. You do know that presenting convincing numbers with a patchset greatly improves its chances of getting it upstreamed, right? Well, it looks like I wrongfully assume that KPTI was a well known performance overhead since it was introduced (because it adds extra page-table switches), but you are right I should be presenting my own numbers. Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 10:23 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote: No. This prevents the guest VM from gathering data from the host kernel on the same cpu-thread. But there's no mitigation for a guest VM running on a cpu-thread attacking another cpu-thread (which can be running another guest VM or the host kernel) from the same cpu-core. You cannot use flush/clear barriers because the two cpu-threads are running in parallel. Now there's your justification for why you're doing this. It took a while... The "why" should always be part of the 0th message to provide reviewers/maintainers with answers to the question, what this pile of patches is all about. Please always add this rationale to your patchset in the future. Sorry about that, I will definitively try to do better next time. :-} Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: > Some benchmarks are available, in particular from phoronix: What I was expecting was benchmarks *you* have run which show that perf penalty, not something one can find quickly on the internet and something one cannot always reproduce her-/himself. You do know that presenting convincing numbers with a patchset greatly improves its chances of getting it upstreamed, right? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17, 2020 at 08:02:51PM +0100, Alexandre Chartre wrote: > No. This prevents the guest VM from gathering data from the host > kernel on the same cpu-thread. But there's no mitigation for a guest > VM running on a cpu-thread attacking another cpu-thread (which can be > running another guest VM or the host kernel) from the same cpu-core. > You cannot use flush/clear barriers because the two cpu-threads are > running in parallel. Now there's your justification for why you're doing this. It took a while... The "why" should always be part of the 0th message to provide reviewers/maintainers with answers to the question, what this pile of patches is all about. Please always add this rationale to your patchset in the future. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 7:28 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at the moment. In particular, this allows a guest VM to attack another guest VM or the host kernel running on a sibling cpu-thread. Core Scheduling will mitigate the guest-to-guest attack but not the guest-to-host attack. I see in vmx_vcpu_enter_exit(): /* L1D Flush includes CPU buffer clear to mitigate MDS */ if (static_branch_unlikely(&vmx_l1d_should_flush)) vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); Is that not enough? No. This prevents the guest VM from gathering data from the host kernel on the same cpu-thread. But there's no mitigation for a guest VM running on a cpu-thread attacking another cpu-thread (which can be running another guest VM or the host kernel) from the same cpu-core. You cannot use flush/clear barriers because the two cpu-threads are running in parallel. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17, 2020 at 07:12:07PM +0100, Alexandre Chartre wrote: > Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at > the moment. In particular, this allows a guest VM to attack another guest VM > or the host kernel running on a sibling cpu-thread. Core Scheduling will > mitigate the guest-to-guest attack but not the guest-to-host attack. I see in vmx_vcpu_enter_exit(): /* L1D Flush includes CPU buffer clear to mitigate MDS */ if (static_branch_unlikely(&vmx_l1d_should_flush)) vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); Is that not enough? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 6:07 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote: We are not reversing PTI, we are extending it. You're reversing it in the sense that you're mapping more kernel memory into the user page table than what is mapped now. PTI removes all kernel mapping from the user page-table. However there's no issue with mapping some kernel data into the user page-table as long as these data have no sensitive information. I hope that is the case. Actually, PTI is already doing that but with a very limited scope. PTI adds into the user page-table some kernel mappings which are needed for userland to enter the kernel (such as the kernel entry text, the ESPFIX, the CPU_ENTRY_AREA_BASE...). So here, we are extending the PTI mapping so that we can execute more kernel code while using the user page-table; it's a kind of PTI on steroids. And this is what bothers me - someone else might come after you and say, but but, I need to map more stuff into the user pgt because I wanna do X... and so on. Agree, any addition should be strictly checked. I have been careful to expand it to the minimum I needed. The minimum size would be 1 page (4KB) as this is the minimum mapping size. It's certainly enough for now as the usage of the PTI stack is limited, but we will need larger stack if we won't to execute more kernel code with the user page-table. So on a big machine with a million tasks, that's at least a million pages more which is what, ~4 Gb? There better be a very good justification for the additional memory consumption... Yeah, adding a per-task allocation is my main concern, hence this RFC. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/17/20 5:55 PM, Borislav Petkov wrote: On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote: The main goal of ASI is to provide KVM address space isolation to mitigate guest-to-host speculative attacks like L1TF or MDS. Because the current L1TF and MDS mitigations are lacking or why? Yes. L1TF/MDS allow some inter cpu-thread attacks which are not mitigated at the moment. In particular, this allows a guest VM to attack another guest VM or the host kernel running on a sibling cpu-thread. Core Scheduling will mitigate the guest-to-guest attack but not the guest-to-host attack. Address Space Isolation provides a mitigation for guest-to-host attack. Current proposal of ASI is plugged into the CR3 switch assembly macro which make the code brittle and complex. (see [1]) I am also expected this might help with some other ideas like having syscall (or interrupt handler) which can run without switching the page-table. I still fail to see why we need all that. I read, "this does this and that" but I don't read "the current problem is this" and "this is our suggested solution for it". So what is the issue which needs addressing in the current kernel which is going to justify adding all that code? The main issue this is trying to address is that the CR3 switch is currently done in assembly code from contexts which are very restrictive: the CR3 switch is often done when only one or two registers are available for use, sometimes no stack is available. For example, the syscall entry switches CR3 with a single register available (%sp) and no stack. Because of this, it is fairly tricky to expand the logic for switching CR3. This is a problem that we have faced while implementing Address Space Isolation (ASI) where we need extra logic to drive the page-table switch. We have successfully implement ASI with the current CR3 switching assembly code, but this requires complex assembly construction. Hence this proposal to defer CR3 switching to C code so that it can be more easily expandable. Hopefully this can also contribute to make the assembly entry code less complex, and be beneficial to other projects. PTI has a measured overhead of roughly 5% for most workloads, but it can be much higher in some cases. "it can be"? Where? Actual use case? Some benchmarks are available, in particular from phoronix: https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti https://www.phoronix.com/scan.php?page=news_item&px=x86-PTI-Initial-Gaming-Tests https://www.phoronix.com/scan.php?page=article&item=linux-kpti-kvm https://medium.com/@loganaden/linux-kpti-performance-hit-on-real-workloads-8da185482df3 The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into the CR3 switch assembly macro. We are working on a new implementation, based on these changes which avoid having to deal with assembly code and makes the implementation more robust. This still doesn't answer my questions. I read a lot of "could be used for" formulations but I still don't know why we need that. So what is the problem that the kernel currently has which you're trying to address with this? Hopefully this is clearer with the answer I provided above. Thanks, alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17, 2020 at 09:19:01AM +0100, Alexandre Chartre wrote: > We are not reversing PTI, we are extending it. You're reversing it in the sense that you're mapping more kernel memory into the user page table than what is mapped now. > PTI removes all kernel mapping from the user page-table. However there's > no issue with mapping some kernel data into the user page-table as long as > these data have no sensitive information. I hope that is the case. > Actually, PTI is already doing that but with a very limited scope. PTI adds > into the user page-table some kernel mappings which are needed for userland > to enter the kernel (such as the kernel entry text, the ESPFIX, the > CPU_ENTRY_AREA_BASE...). > > So here, we are extending the PTI mapping so that we can execute more kernel > code while using the user page-table; it's a kind of PTI on steroids. And this is what bothers me - someone else might come after you and say, but but, I need to map more stuff into the user pgt because I wanna do X... and so on. > The minimum size would be 1 page (4KB) as this is the minimum mapping size. > It's certainly enough for now as the usage of the PTI stack is limited, but > we will need larger stack if we won't to execute more kernel code with the > user page-table. So on a big machine with a million tasks, that's at least a million pages more which is what, ~4 Gb? There better be a very good justification for the additional memory consumption... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Tue, Nov 17, 2020 at 08:56:23AM +0100, Alexandre Chartre wrote: > The main goal of ASI is to provide KVM address space isolation to > mitigate guest-to-host speculative attacks like L1TF or MDS. Because the current L1TF and MDS mitigations are lacking or why? > Current proposal of ASI is plugged into the CR3 switch assembly macro > which make the code brittle and complex. (see [1]) > > I am also expected this might help with some other ideas like having > syscall (or interrupt handler) which can run without switching the > page-table. I still fail to see why we need all that. I read, "this does this and that" but I don't read "the current problem is this" and "this is our suggested solution for it". So what is the issue which needs addressing in the current kernel which is going to justify adding all that code? > PTI has a measured overhead of roughly 5% for most workloads, but it can > be much higher in some cases. "it can be"? Where? Actual use case? > The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged > directly into the CR3 switch assembly macro. We are working on a new > implementation, based on these changes which avoid having to deal with > assembly code and makes the implementation more robust. This still doesn't answer my questions. I read a lot of "could be used for" formulations but I still don't know why we need that. So what is the problem that the kernel currently has which you're trying to address with this? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/16/20 9:24 PM, Borislav Petkov wrote: On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: Deferring CR3 switch to C code means that we need to run more of the kernel entry code with the user page-table. To do so, we need to: - map more syscall, interrupt and exception entry code into the user page-table (map all noinstr code); - map additional data used in the entry code (such as stack canary); - run more entry code on the trampoline stack (which is mapped both in the kernel and in the user page-table) until we switch to the kernel page-table and then switch to the kernel stack; So PTI was added exactly to *not* have kernel memory mapped in the user page table. You're partially reversing that... We are not reversing PTI, we are extending it. PTI removes all kernel mapping from the user page-table. However there's no issue with mapping some kernel data into the user page-table as long as these data have no sensitive information. Actually, PTI is already doing that but with a very limited scope. PTI adds into the user page-table some kernel mappings which are needed for userland to enter the kernel (such as the kernel entry text, the ESPFIX, the CPU_ENTRY_AREA_BASE...). So here, we are extending the PTI mapping so that we can execute more kernel code while using the user page-table; it's a kind of PTI on steroids. - have a per-task trampoline stack instead of a per-cpu trampoline stack, so the task can be scheduled out while it hasn't switched to the kernel stack. per-task? How much more memory is that per task? Currently, this is done by doubling the size of the task stack (patch 8), so that's an extra 8KB. Half of the stack is used as the regular kernel stack, and the other half used as the PTI stack: +/* + * PTI doubles the size of the stack. The entire stack is mapped into + * the kernel address space. However, only the top half of the stack is + * mapped into the user address space. + * + * On syscall or interrupt, user mode enters the kernel with the user + * page-table, and the stack pointer is switched to the top of the + * stack (which is mapped in the user address space and in the kernel). + * The syscall/interrupt handler will then later decide when to switch + * to the kernel address space, and to switch to the top of the kernel + * stack which is only mapped in the kernel. + * + * +-+ + * | | ^ ^ + * | kernel-only | | KERNEL_STACK_SIZE | + * |stack| | | + * | | V | + * +-+ <- top of kernel stack | THREAD_SIZE + * | | ^ | + * | kernel and | | KERNEL_STACK_SIZE | + * | PTI stack | | | + * | | V v + * +-+ <- top of stack + */ The minimum size would be 1 page (4KB) as this is the minimum mapping size. It's certainly enough for now as the usage of the PTI stack is limited, but we will need larger stack if we won't to execute more kernel code with the user page-table. alex.
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On 11/16/20 9:17 PM, Borislav Petkov wrote: On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: This RFC proposes to defer the PTI CR3 switch until we reach C code. The benefit is that this simplifies the assembly entry code, and make the PTI CR3 switch code easier to understand. This also paves the way for further possible projects such an easier integration of Address Space Isolation (ASI), or the possibility to execute some selected syscall or interrupt handlers without switching to the kernel page-table What for? What is this going to be used for in the end? In addition to simplify the assembly entry code, this will also simplify the integration of Address Space Isolation (ASI) which will certainly be the primary beneficiary of this change. The main goal of ASI is to provide KVM address space isolation to mitigate guest-to-host speculative attacks like L1TF or MDS. Current proposal of ASI is plugged into the CR3 switch assembly macro which make the code brittle and complex. (see [1]) I am also expected this might help with some other ideas like having syscall (or interrupt handler) which can run without switching the page-table. (and thus avoid the PTI page-table switch overhead). Overhead of how much? Why do we care? PTI has a measured overhead of roughly 5% for most workloads, but it can be much higher in some cases. The overhead is mostly due to the page-table switch (even with PCID) so if we can run a syscall or an interrupt handler without switching the page-table then we can get this kind of performance back. What is the big picture justfication for this diffstat 21 files changed, 874 insertions(+), 314 deletions(-) and the diffstat for the ASI enablement? The latest ASI RFC (RFC v4) is here [1]. This RFC has ASI plugged directly into the CR3 switch assembly macro. We are working on a new implementation, based on these changes which avoid having to deal with assembly code and makes the implementation more robust. alex. [1] ASI RFCv4 - https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.char...@oracle.com/
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: > Deferring CR3 switch to C code means that we need to run more of the > kernel entry code with the user page-table. To do so, we need to: > > - map more syscall, interrupt and exception entry code into the user >page-table (map all noinstr code); > > - map additional data used in the entry code (such as stack canary); > > - run more entry code on the trampoline stack (which is mapped both >in the kernel and in the user page-table) until we switch to the >kernel page-table and then switch to the kernel stack; So PTI was added exactly to *not* have kernel memory mapped in the user page table. You're partially reversing that... > - have a per-task trampoline stack instead of a per-cpu trampoline >stack, so the task can be scheduled out while it hasn't switched >to the kernel stack. per-task? How much more memory is that per task? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code
On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote: > This RFC proposes to defer the PTI CR3 switch until we reach C code. > The benefit is that this simplifies the assembly entry code, and make > the PTI CR3 switch code easier to understand. This also paves the way > for further possible projects such an easier integration of Address > Space Isolation (ASI), or the possibilily to execute some selected > syscall or interrupt handlers without switching to the kernel page-table What for? What is this going to be used for in the end? > (and thus avoid the PTI page-table switch overhead). Overhead of how much? Why do we care? What is the big picture justfication for this diffstat > 21 files changed, 874 insertions(+), 314 deletions(-) and the diffstat for the ASI enablement? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette