> On May 31, 2019, at 2:14 PM, Andy Lutomirski <l...@kernel.org> wrote: > > On Thu, May 30, 2019 at 11:37 PM Nadav Amit <na...@vmware.com> wrote: >> When we flush userspace mappings, we can defer the TLB flushes, as long >> the following conditions are met: >> >> 1. No tables are freed, since otherwise speculative page walks might >> cause machine-checks. >> >> 2. No one would access userspace before flush takes place. Specifically, >> NMI handlers and kprobes would avoid accessing userspace. > > I think I need to ask the big picture question. When someone calls > flush_tlb_mm_range() (or the other entry points), if no page tables > were freed, they want the guarantee that future accesses (initiated > observably after the flush returns) will not use paging entries that > were replaced by stores ordered before flush_tlb_mm_range(). We also > need the guarantee that any effects from any memory access using the > old paging entries will become globally visible before > flush_tlb_mm_range(). > > I'm wondering if receipt of an IPI is enough to guarantee any of this. > If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI > to CPU 1, at what point is CPU 2 guaranteed to be able to observe the > dirty bit? An interrupt entry today is fully serializing by the time > it finishes, but interrupt entries are epicly slow, and I don't know > if the APIC waits long enough. Heck, what if IRQs are off on the > remote CPU? There are a handful of places where we touch user memory > with IRQs off, and it's (sadly) possible for user code to turn off > IRQs with iopl(). > > I *think* that Intel has stated recently that SMT siblings are > guaranteed to stop speculating when you write to the APIC ICR to poke > them, but SMT is very special. > > My general conclusion is that I think the code needs to document what > is guaranteed and why.
I think I might have managed to confuse you with a bug I made (last minute bug when I was doing some cleanup). This bug does not affect the performance much, but it might led you to think that I use the APIC sending as synchronization. The idea is not for us to rely on write to ICR as something serializing. The flow should be as follows: CPU0 CPU1 flush_tlb_mm_range() __smp_call_function_many() [ prepare call_single_data (csd) ] [ lock csd ] [ send IPI ] (*) [ wait for csd to be unlocked ] [ interrupt ] [ copy csd info to stack ] [ csd unlock ] [ find csd is unlocked ] [ continue (**) ] [ flush TLB ] At (**) the pages might be recycled, written-back to disk, etc. Note that during (*), CPU0 might do some local TLB flushes, making it very likely that CSD will be unlocked by the time it gets there. As you can see, I don’t rely on any special micro-architectural behavior. The synchronization is done purely in software. Does it make more sense now?