Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-16 Thread Dave Hansen
On 11/16/2017 11:19 AM, Andrea Arcangeli wrote:
> On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
>> Hugh Dickins also points out that PCIDs really have two distinct
>> use-cases in the context of KAISER.  The first way they can be used
> I don't see why you try to retain such a minor optimization for newer
> Intel chips when at the same you prevent KAISER to run with good
> performance on older Intel chips like SandyBridge/IvyBridge which
> would create a major performance regression for those two.

This was more straightforward to do.

The other way requires having *TWO* PCID modes.  So, we need to
disambiguate the two modes in the existing infrastructure in addition to
adding KAISER.

Had I gone and done that, my fear was that we would be left with no
usable PCIDs on *any* hardware.  So, this was easier, I went and did it
first, and I'd love to see someone add support for PCIDs on those older
non-INVPCID systems.  "Someone" may even be me, but it'll be in v2.

Patches welcome before then. :)


Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-16 Thread Dave Hansen
On 11/16/2017 11:19 AM, Andrea Arcangeli wrote:
> On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
>> Hugh Dickins also points out that PCIDs really have two distinct
>> use-cases in the context of KAISER.  The first way they can be used
> I don't see why you try to retain such a minor optimization for newer
> Intel chips when at the same you prevent KAISER to run with good
> performance on older Intel chips like SandyBridge/IvyBridge which
> would create a major performance regression for those two.

This was more straightforward to do.

The other way requires having *TWO* PCID modes.  So, we need to
disambiguate the two modes in the existing infrastructure in addition to
adding KAISER.

Had I gone and done that, my fear was that we would be left with no
usable PCIDs on *any* hardware.  So, this was easier, I went and did it
first, and I'd love to see someone add support for PCIDs on those older
non-INVPCID systems.  "Someone" may even be me, but it'll be in v2.

Patches welcome before then. :)


Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-16 Thread Andrea Arcangeli
Hello,

On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
> Hugh Dickins also points out that PCIDs really have two distinct
> use-cases in the context of KAISER.  The first way they can be used

I don't see why you try to retain such a minor optimization for newer
Intel chips when at the same you prevent KAISER to run with good
performance on older Intel chips like SandyBridge/IvyBridge which
would create a major performance regression for those two. I'd prefer
if you reverse the PCID feature of v4.14 when KASIER is enabled (at
build time would be enough initially), and you use just two asids to
only accelerate enter/exit kernel and you flush the whole TLB over mm
switch like Hugh suggested. It may not even be worth to flush over
cr4, as you've only two asids to deal with anyway.


Re: [PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-16 Thread Andrea Arcangeli
Hello,

On Fri, Nov 10, 2017 at 11:31:50AM -0800, Dave Hansen wrote:
> Hugh Dickins also points out that PCIDs really have two distinct
> use-cases in the context of KAISER.  The first way they can be used

I don't see why you try to retain such a minor optimization for newer
Intel chips when at the same you prevent KAISER to run with good
performance on older Intel chips like SandyBridge/IvyBridge which
would create a major performance regression for those two. I'd prefer
if you reverse the PCID feature of v4.14 when KASIER is enabled (at
build time would be enough initially), and you use just two asids to
only accelerate enter/exit kernel and you flush the whole TLB over mm
switch like Hugh suggested. It may not even be worth to flush over
cr4, as you've only two asids to deal with anyway.


[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-10 Thread Dave Hansen

From: Dave Hansen 

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
  the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
  flush done for each.  For instance, what is currently a
  single instruction without KAISER:

invpcid_flush_one(current_pcid, addr);

  becomes this with KAISER:

invpcid_flush_one(current_kern_pcid, addr);
invpcid_flush_one(current_user_pcid, addr);

  and this without INVPCID:

__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/entry/calling.h|   25 +++-
 b/arch/x86/entry/entry_64.S   |1 
 b/arch/x86/include/asm/cpufeatures.h  |1 
 b/arch/x86/include/asm/pgtable_types.h|   11 ++
 b/arch/x86/include/asm/tlbflush.h |  137 +-
 b/arch/x86/include/uapi/asm/processor-flags.h |3 
 b/arch/x86/kvm/x86.c  |3 
 b/arch/x86/mm/init.c  |   75 

[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-10 Thread Dave Hansen

From: Dave Hansen 

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
  the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
  flush done for each.  For instance, what is currently a
  single instruction without KAISER:

invpcid_flush_one(current_pcid, addr);

  becomes this with KAISER:

invpcid_flush_one(current_kern_pcid, addr);
invpcid_flush_one(current_user_pcid, addr);

  and this without INVPCID:

__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/entry/calling.h|   25 +++-
 b/arch/x86/entry/entry_64.S   |1 
 b/arch/x86/include/asm/cpufeatures.h  |1 
 b/arch/x86/include/asm/pgtable_types.h|   11 ++
 b/arch/x86/include/asm/tlbflush.h |  137 +-
 b/arch/x86/include/uapi/asm/processor-flags.h |3 
 b/arch/x86/kvm/x86.c  |3 
 b/arch/x86/mm/init.c  |   75 +-
 b/arch/x86/mm/tlb.c   |   66 
 9 files changed, 262 insertions(+), 60 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-pcid  2017-11-10 

[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-08 Thread Dave Hansen

From: Dave Hansen 

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  We switch between them
with the the CR3 register.  But, CR3 was really designed for context
switches and changing it also flushes the entire TLB (modulo global
pages).  This TLB flush increases the cost of interrupts and context
switches.  For syscall-heavy microbenchmarks it can cut the rate of
syscalls by 2/3.

But, now we have suppport for and Intel CPU feature called Process
Context IDentifiers (PCID) in the kernel thanks to Andy Lutomirski.
This feature is intended to allow you to switch between contexts
without flushing the TLB.

Implementation:

We can use PCIDs to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

We do this by assigning the kernel and userspace different ASIDs.  On
entry from userspace, we move over to the kernel page tables *and*
ASID.  On exit, we restore the user page tables and ASID.  Fortunately,
the ASID is programmed via CR3, which we are already using to switch
between the page table copies.  So, we get one-stop shopping.

In current kernels, CR3 is used to switch between processes which also
provides all the TLB flushing that we need at a context switch.  But,
with KAISER, that CR3 move only flushes the current (kernel) ASID.  We
need an extra TLB flushing operation to flush the user ASID: invpcid.
This is probably ~100 cycles, but this is done with the assumption that
the time we lose in context switches is more than made up for in
interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
  the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
  flush done for each.  For instance, what is currently a
  single instruction without KAISER:

invpcid_flush_one(current_pcid, addr);

  becomes this with KAISER:

invpcid_flush_one(current_kern_pcid, addr);
invpcid_flush_one(current_user_pcid, addr);

  and this without INVPCID:

__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, we fully disable PCIDs with KAISER when INVPCID is
not available.  This is fixable, but it's an optimization that
we can do later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-swtich", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/entry/calling.h|   25 +++-
 b/arch/x86/entry/entry_64.S   |1 
 b/arch/x86/include/asm/cpufeatures.h  |1 
 b/arch/x86/include/asm/pgtable_types.h|   11 ++
 b/arch/x86/include/asm/tlbflush.h |  141 +-
 b/arch/x86/include/uapi/asm/processor-flags.h |3 
 b/arch/x86/kvm/x86.c  |3 
 b/arch/x86/mm/init.c  |   75 +
 b/arch/x86/mm/tlb.c   |   66 +++-

[PATCH 23/30] x86, kaiser: use PCID feature to make user and kernel switches faster

2017-11-08 Thread Dave Hansen

From: Dave Hansen 

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  We switch between them
with the the CR3 register.  But, CR3 was really designed for context
switches and changing it also flushes the entire TLB (modulo global
pages).  This TLB flush increases the cost of interrupts and context
switches.  For syscall-heavy microbenchmarks it can cut the rate of
syscalls by 2/3.

But, now we have suppport for and Intel CPU feature called Process
Context IDentifiers (PCID) in the kernel thanks to Andy Lutomirski.
This feature is intended to allow you to switch between contexts
without flushing the TLB.

Implementation:

We can use PCIDs to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

We do this by assigning the kernel and userspace different ASIDs.  On
entry from userspace, we move over to the kernel page tables *and*
ASID.  On exit, we restore the user page tables and ASID.  Fortunately,
the ASID is programmed via CR3, which we are already using to switch
between the page table copies.  So, we get one-stop shopping.

In current kernels, CR3 is used to switch between processes which also
provides all the TLB flushing that we need at a context switch.  But,
with KAISER, that CR3 move only flushes the current (kernel) ASID.  We
need an extra TLB flushing operation to flush the user ASID: invpcid.
This is probably ~100 cycles, but this is done with the assumption that
the time we lose in context switches is more than made up for in
interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
  the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
  flush done for each.  For instance, what is currently a
  single instruction without KAISER:

invpcid_flush_one(current_pcid, addr);

  becomes this with KAISER:

invpcid_flush_one(current_kern_pcid, addr);
invpcid_flush_one(current_user_pcid, addr);

  and this without INVPCID:

__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
__native_flush_tlb_single(addr);
write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, we fully disable PCIDs with KAISER when INVPCID is
not available.  This is fixable, but it's an optimization that
we can do later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-swtich", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Kees Cook 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/arch/x86/entry/calling.h|   25 +++-
 b/arch/x86/entry/entry_64.S   |1 
 b/arch/x86/include/asm/cpufeatures.h  |1 
 b/arch/x86/include/asm/pgtable_types.h|   11 ++
 b/arch/x86/include/asm/tlbflush.h |  141 +-
 b/arch/x86/include/uapi/asm/processor-flags.h |3 
 b/arch/x86/kvm/x86.c  |3 
 b/arch/x86/mm/init.c  |   75 +
 b/arch/x86/mm/tlb.c   |   66 +++-
 9 files changed, 264 insertions(+), 62 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-pcid  2017-11-08 10:45:38.410681372 
-0800
+++ b/arch/x86/entry/calling.h  2017-11-08