Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
On 11/16/2017 11:19 AM, Andrea Arcangeli wrote: > On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote: >> Hugh Dickins also points out that PCIDs really have two distinct >> use-cases in the context of KAISER. The first way they can be used > I don't see why you try to retain such a minor optimization for newer > Intel chips when at the same you prevent KAISER to run with good > performance on older Intel chips like SandyBridge/IvyBridge which > would create a major performance regression for those two. This was more straightforward to do. The other way requires having *TWO* PCID modes. So, we need to disambiguate the two modes in the existing infrastructure in addition to adding KAISER. Had I gone and done that, my fear was that we would be left with no usable PCIDs on *any* hardware. So, this was easier, I went and did it first, and I'd love to see someone add support for PCIDs on those older non-INVPCID systems. "Someone" may even be me, but it'll be in v2. Patches welcome before then. :)
Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
On 11/16/2017 11:19 AM, Andrea Arcangeli wrote: > On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote: >> Hugh Dickins also points out that PCIDs really have two distinct >> use-cases in the context of KAISER. The first way they can be used > I don't see why you try to retain such a minor optimization for newer > Intel chips when at the same you prevent KAISER to run with good > performance on older Intel chips like SandyBridge/IvyBridge which > would create a major performance regression for those two. This was more straightforward to do. The other way requires having *TWO* PCID modes. So, we need to disambiguate the two modes in the existing infrastructure in addition to adding KAISER. Had I gone and done that, my fear was that we would be left with no usable PCIDs on *any* hardware. So, this was easier, I went and did it first, and I'd love to see someone add support for PCIDs on those older non-INVPCID systems. "Someone" may even be me, but it'll be in v2. Patches welcome before then. :)
Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
Hello, On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote: > Hugh Dickins also points out that PCIDs really have two distinct > use-cases in the context of KAISER. The first way they can be used I don't see why you try to retain such a minor optimization for newer Intel chips when at the same you prevent KAISER to run with good performance on older Intel chips like SandyBridge/IvyBridge which would create a major performance regression for those two. I'd prefer if you reverse the PCID feature of v4.14 when KASIER is enabled (at build time would be enough initially), and you use just two asids to only accelerate enter/exit kernel and you flush the whole TLB over mm switch like Hugh suggested. It may not even be worth to flush over cr4, as you've only two asids to deal with anyway.
Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
Hello, On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote: > Hugh Dickins also points out that PCIDs really have two distinct > use-cases in the context of KAISER. The first way they can be used I don't see why you try to retain such a minor optimization for newer Intel chips when at the same you prevent KAISER to run with good performance on older Intel chips like SandyBridge/IvyBridge which would create a major performance regression for those two. I'd prefer if you reverse the PCID feature of v4.14 when KASIER is enabled (at build time would be enough initially), and you use just two asids to only accelerate enter/exit kernel and you flush the whole TLB over mm switch like Hugh suggested. It may not even be worth to flush over cr4, as you've only two asids to deal with anyway.
[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
From: Dave HansenShort summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. Switches between the copies are performed by writing to the CR3 register. But, CR3 was really designed for context switches and writes to it also flush the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. The kernel recently gained support for and Intel CPU feature called Process Context IDentifiers (PCID) thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: PCIDs can be used to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. First, the kernel and userspace must be assigned different ASIDs. On entry from userspace, move over to the kernel page tables *and* ASID. On exit, restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which is already being used to switch between the user and kernel page tables. This gives us convenient, one-stop shopping. The CR3 write which is used to switch between processes provides all the TLB flushing normally required at context switch time. But, with KAISER, that CR3 write only flushes the current (kernel) ASID. An extra TLB flush operation is now required in order to flush the user ASID. This new instruction (INVPCID) is probably ~100 cycles, but this is done with the assumption that the time lost in context switches is more than made up for by lower cost of interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that can be performed later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-switch", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x...@kernel.org --- b/arch/x86/entry/calling.h| 25 +++- b/arch/x86/entry/entry_64.S |1 b/arch/x86/include/asm/cpufeatures.h |1 b/arch/x86/include/asm/pgtable_types.h| 11 ++ b/arch/x86/include/asm/tlbflush.h | 137 +- b/arch/x86/include/uapi/asm/processor-flags.h |3 b/arch/x86/kvm/x86.c |3 b/arch/x86/mm/init.c | 75
[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
From: Dave Hansen Short summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. Switches between the copies are performed by writing to the CR3 register. But, CR3 was really designed for context switches and writes to it also flush the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. The kernel recently gained support for and Intel CPU feature called Process Context IDentifiers (PCID) thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: PCIDs can be used to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. First, the kernel and userspace must be assigned different ASIDs. On entry from userspace, move over to the kernel page tables *and* ASID. On exit, restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which is already being used to switch between the user and kernel page tables. This gives us convenient, one-stop shopping. The CR3 write which is used to switch between processes provides all the TLB flushing normally required at context switch time. But, with KAISER, that CR3 write only flushes the current (kernel) ASID. An extra TLB flush operation is now required in order to flush the user ASID. This new instruction (INVPCID) is probably ~100 cycles, but this is done with the assumption that the time lost in context switches is more than made up for by lower cost of interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that can be performed later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-switch", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x...@kernel.org --- b/arch/x86/entry/calling.h| 25 +++- b/arch/x86/entry/entry_64.S |1 b/arch/x86/include/asm/cpufeatures.h |1 b/arch/x86/include/asm/pgtable_types.h| 11 ++ b/arch/x86/include/asm/tlbflush.h | 137 +- b/arch/x86/include/uapi/asm/processor-flags.h |3 b/arch/x86/kvm/x86.c |3 b/arch/x86/mm/init.c | 75 +- b/arch/x86/mm/tlb.c | 66 9 files changed, 262 insertions(+), 60 deletions(-) diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h --- a/arch/x86/entry/calling.h~kaiser-pcid 2017-11-10
[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
From: Dave HansenShort summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. We switch between them with the the CR3 register. But, CR3 was really designed for context switches and changing it also flushes the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. But, now we have suppport for and Intel CPU feature called Process Context IDentifiers (PCID) in the kernel thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: We can use PCIDs to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. We do this by assigning the kernel and userspace different ASIDs. On entry from userspace, we move over to the kernel page tables *and* ASID. On exit, we restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which we are already using to switch between the page table copies. So, we get one-stop shopping. In current kernels, CR3 is used to switch between processes which also provides all the TLB flushing that we need at a context switch. But, with KAISER, that CR3 move only flushes the current (kernel) ASID. We need an extra TLB flushing operation to flush the user ASID: invpcid. This is probably ~100 cycles, but this is done with the assumption that the time we lose in context switches is more than made up for in interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, we fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that we can do later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-swtich", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x...@kernel.org --- b/arch/x86/entry/calling.h| 25 +++- b/arch/x86/entry/entry_64.S |1 b/arch/x86/include/asm/cpufeatures.h |1 b/arch/x86/include/asm/pgtable_types.h| 11 ++ b/arch/x86/include/asm/tlbflush.h | 141 +- b/arch/x86/include/uapi/asm/processor-flags.h |3 b/arch/x86/kvm/x86.c |3 b/arch/x86/mm/init.c | 75 + b/arch/x86/mm/tlb.c | 66 +++-
[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster
From: Dave Hansen Short summary: Use x86 PCID feature to avoid flushing the TLB at all interrupts and syscalls. Speed them up. Makes context switches and TLB flushing slower. Background: KAISER keeps two copies of the page tables. We switch between them with the the CR3 register. But, CR3 was really designed for context switches and changing it also flushes the entire TLB (modulo global pages). This TLB flush increases the cost of interrupts and context switches. For syscall-heavy microbenchmarks it can cut the rate of syscalls by 2/3. But, now we have suppport for and Intel CPU feature called Process Context IDentifiers (PCID) in the kernel thanks to Andy Lutomirski. This feature is intended to allow you to switch between contexts without flushing the TLB. Implementation: We can use PCIDs to avoid flushing the TLB at kernel entry/exit. This is speeds up both interrupts and syscalls. We do this by assigning the kernel and userspace different ASIDs. On entry from userspace, we move over to the kernel page tables *and* ASID. On exit, we restore the user page tables and ASID. Fortunately, the ASID is programmed via CR3, which we are already using to switch between the page table copies. So, we get one-stop shopping. In current kernels, CR3 is used to switch between processes which also provides all the TLB flushing that we need at a context switch. But, with KAISER, that CR3 move only flushes the current (kernel) ASID. We need an extra TLB flushing operation to flush the user ASID: invpcid. This is probably ~100 cycles, but this is done with the assumption that the time we lose in context switches is more than made up for in interrupts and syscalls. Support: PCIDs are generally available on Sandybridge and newer CPUs. However, the accompanying INVPCID instruction did not become available until Haswell (the ones with "v4", or called fourth-generation Core). This instruction allows non-current-PCID TLB entries to be flushed without switching CR3 and global pages to be flushed without a double MOV-to-CR4. Without INVPCID, PCIDs are much harder to use. TLB invalidation gets much more onerous: 1. Every kernel TLB flush (even for a single page) requires an interrupts-off MOV-to-CR4 which is very expensive. This is because there is no way to flush a kernel address that might be loaded in *EVERY* PCID. Right now, there are "only" ~12 of these per-cpu, but that's too painful to use the MOV-to-CR3 to flush them. That leaves only the MOV-to-CR4. 2. Every userspace flush (even for a single page requires one of the following: a. A pair of flushing (bit 63 clear) CR3 writes: one for the kernel ASID and another for userspace. b. A pair of non-flushing CR3 writes (bit 63 set) with the flush done for each. For instance, what is currently a single instruction without KAISER: invpcid_flush_one(current_pcid, addr); becomes this with KAISER: invpcid_flush_one(current_kern_pcid, addr); invpcid_flush_one(current_user_pcid, addr); and this without INVPCID: __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_user_pcid | NOFLUSH); __native_flush_tlb_single(addr); write_cr3(mm->pgd | current_kern_pcid | NOFLUSH); So, for now, we fully disable PCIDs with KAISER when INVPCID is not available. This is fixable, but it's an optimization that we can do later. Hugh Dickins also points out that PCIDs really have two distinct use-cases in the context of KAISER. The first way they can be used is as "TLB preservation across context-swtich", which is what Andy Lutomirksi's 4.14 PCID code does. They can also be used as a "KAISER syscall/interrupt accelerator". If we just use them to speed up syscall/interrupts (and ignore the context-switch TLB preservation), then the deficiency of not having INVPCID becomes much less onerous. Signed-off-by: Dave Hansen Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Kees Cook Cc: Hugh Dickins Cc: x...@kernel.org --- b/arch/x86/entry/calling.h| 25 +++- b/arch/x86/entry/entry_64.S |1 b/arch/x86/include/asm/cpufeatures.h |1 b/arch/x86/include/asm/pgtable_types.h| 11 ++ b/arch/x86/include/asm/tlbflush.h | 141 +- b/arch/x86/include/uapi/asm/processor-flags.h |3 b/arch/x86/kvm/x86.c |3 b/arch/x86/mm/init.c | 75 + b/arch/x86/mm/tlb.c | 66 +++- 9 files changed, 264 insertions(+), 62 deletions(-) diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h --- a/arch/x86/entry/calling.h~kaiser-pcid 2017-11-08 10:45:38.410681372 -0800 +++ b/arch/x86/entry/calling.h 2017-11-08