INVPCID is considerably slower than INVLPG of a single PTE, but it is currently used to flush PTEs in the user page-table when PTI is used.
Instead, it is possible to defer TLB flushes until after the user page-tables are loaded. Preventing speculation over the TLB flushes should keep the whole thing safe. In some cases, deferring TLB flushes in such a way can result in more full TLB flushes, but arguably this behavior is oftentimes beneficial. These patches are based and evaluated on top of the concurrent TLB-flushes v4 patch-set. I will provide more results later, but it might be easier to look at the time an isolated TLB flush takes. These numbers are from skylake, showing the number of cycles that running madvise(DONTNEED) which results in local TLB flushes takes: n_pages concurrent +deferred-pti change ------- ---------- ------------- ------ 1 2119 1986 -6.7% 10 6791 5417 -20% Please let me know if I missed something that affects security or performance. [ Yes, I know there is another pending RFC for async TLB flushes, but I think it might be easier to merge this one first ] RFC v1 -> RFC v2: * Wrong patches were sent before Nadav Amit (3): x86/mm/tlb: Change __flush_tlb_one_user interface x86/mm/tlb: Defer PTI flushes x86/mm/tlb: Avoid deferring PTI flushes on shootdown arch/x86/entry/calling.h | 52 +++++++++++- arch/x86/include/asm/paravirt.h | 5 +- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/include/asm/tlbflush.h | 55 +++++++----- arch/x86/kernel/asm-offsets.c | 3 + arch/x86/kernel/paravirt.c | 7 +- arch/x86/mm/tlb.c | 117 ++++++++++++++++++++++++-- arch/x86/xen/mmu_pv.c | 21 +++-- 8 files changed, 218 insertions(+), 45 deletions(-) -- 2.17.1