Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Andy Lutomirski
On Tue, Aug 27, 2019 at 4:55 PM Nadav Amit  wrote:
>
> > On Aug 27, 2019, at 4:13 PM, Andy Lutomirski  wrote:
> >
> > On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit  wrote:
> >> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> >> flush the user page-tables when PTI is enabled therefore introduces
> >> significant overhead.
> >>
> >> Instead, unless page-tables are released, it is possible to defer the
> >> flushing of the user page-tables until the time the code returns to
> >> userspace. These page tables are not in use, so deferring them is not a
> >> security hazard.
> >
> > I agree and, in fact, I argued against ever using INVPCID in the
> > original PTI code.
> >
> > However, I don't see what freeing page tables has to do with this.  If
> > the CPU can actually do speculative page walks based on the contents
> > of non-current-PCID TLB entries, then we have major problems, since we
> > don't actively flush the TLB for non-running mms at all.
>
> That was not my concern.
>
> >
> > I suppose that, if we free a page table, then we can't activate the
> > PCID by writing to CR3 before flushing things.  But we can still defer
> > the flush and just set the flush bit when we write to CR3.
>
> This was my concern. I can change the behavior so the code would flush the
> whole TLB instead. I just tried not to change the existing behavior too
> much.
>

We do this anyway if we don't have INVPCID_SINGLE, so it doesn't seem
so bad to also do it if there's a freed page table.


Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 4:13 PM, Andy Lutomirski  wrote:
> 
> On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit  wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
>> 
>> Instead, unless page-tables are released, it is possible to defer the
>> flushing of the user page-tables until the time the code returns to
>> userspace. These page tables are not in use, so deferring them is not a
>> security hazard.
> 
> I agree and, in fact, I argued against ever using INVPCID in the
> original PTI code.
> 
> However, I don't see what freeing page tables has to do with this.  If
> the CPU can actually do speculative page walks based on the contents
> of non-current-PCID TLB entries, then we have major problems, since we
> don't actively flush the TLB for non-running mms at all.

That was not my concern.

> 
> I suppose that, if we free a page table, then we can't activate the
> PCID by writing to CR3 before flushing things.  But we can still defer
> the flush and just set the flush bit when we write to CR3.

This was my concern. I can change the behavior so the code would flush the
whole TLB instead. I just tried not to change the existing behavior too
much.



Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Andy Lutomirski
On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit  wrote:
>
> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> flush the user page-tables when PTI is enabled therefore introduces
> significant overhead.
>
> Instead, unless page-tables are released, it is possible to defer the
> flushing of the user page-tables until the time the code returns to
> userspace. These page tables are not in use, so deferring them is not a
> security hazard.

I agree and, in fact, I argued against ever using INVPCID in the
original PTI code.

However, I don't see what freeing page tables has to do with this.  If
the CPU can actually do speculative page walks based on the contents
of non-current-PCID TLB entries, then we have major problems, since we
don't actively flush the TLB for non-running mms at all.

I suppose that, if we free a page table, then we can't activate the
PCID by writing to CR3 before flushing things.  But we can still defer
the flush and just set the flush bit when we write to CR3.

--Andy


Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 11:28 AM, Dave Hansen  wrote:
> 
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
> 
> I'm not sure this is worth all the churn, especially in the entry code.
> For large flushes (> tlb_single_page_flush_ceiling), we don't do
> INVPCIDs in the first place.

It is possible to jump from flush_tlb_func() into the trampoline page,
instead of flushing the TLB in the entry code. However, it induces higher
overhead (switching CR3s), so it will only be useful if multiple TLB entries
are flushed at once. It also prevents exploiting opportunities of promoting
individual entry flushes into a full-TLB flush when multiple flushes are
issued or when context switch takes place before returning-to-user-space.

There are cases/workloads that flush multiple (but not too many) TLB entries
on every syscall, for instance issuing msync() or running Apache webserver.
So I am not sure that tlb_single_page_flush_ceiling saves the day. Besides,
you may want to recalibrate (lower) tlb_single_page_flush_ceiling when PTI
is used.

> I'd really want to understand what the heck is going on that makes
> INVPCID so slow, first.

INVPCID-single is slow (even more than 133 cycles slower than INVLPG that
you mentioned; I don’t have the numbers if front of me). I thought that this
is a known fact, although, obviously, it does not make much sense.



Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Dave Hansen
On 8/23/19 3:52 PM, Nadav Amit wrote:
> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
> flush the user page-tables when PTI is enabled therefore introduces
> significant overhead.

I'm not sure this is worth all the churn, especially in the entry code.
 For large flushes (> tlb_single_page_flush_ceiling), we don't do
INVPCIDs in the first place.

I'd really want to understand what the heck is going on that makes
INVPCID so slow, first.