On Fri, Sep 8, 2017 at 6:39 PM, Andy Lutomirski <l...@amacapital.net> wrote: > > >> On Sep 8, 2017, at 6:05 PM, Linus Torvalds <torva...@linux-foundation.org> >> wrote: >> >>> On Fri, Sep 8, 2017 at 5:00 PM, Andy Lutomirski <l...@kernel.org> wrote: >>> >>> I'm not convinced. The SDM says (Vol 3, 11.3, under WC): >>> >>> If the WC buffer is partially filled, the writes may be delayed until >>> the next occurrence of a serializing event; such as, an SFENCE or >>> MFENCE instruction, CPUID execution, a read or write to uncached >>> memory, an interrupt occurrence, or a LOCK instruction execution. >>> >>> Thanks, Intel, for definiing "serializing event" differently here than >>> anywhere else in the whole manual. >> >> Yeah, it's really badly defined. Ok, maybe a locked instruction does >> actually wait for it.. It should be invisible to anything, regardless. >> >>> 1. The kernel wants to reclaim a page of normal memory, so it unmaps >>> it and flushes. Another CPU has an entry for that page in its WC >>> buffer. I don't think we care whether the flush causes the WC write >>> to really hit RAM because it's unobservable -- we just need to make >>> sure it is ordered, as seen by software, before the flush operation >>> completes. From the quote above, I think we're okay here. >> >> Agreed. >> >>> 2. The kernel is unmapping some IO memory (e.g. a GPU command buffer). >>> It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs >>> are really done writing to it. Here I'm less convinced. The SDM >>> quote certainly suggests to me that we have a promise that the WC >>> write has *started* before flush_tlb_mm_range returns, but I'm not >>> sure I believe that it's guaranteed to have retired. >> >> If others have writable TLB entries, what keeps them from just >> continuing to write for a long time afterwards? > > Whoever unmaps the resource by kicking out their drm fd? I admit I'm just > trying to think of the worst case. > >> >>> I'd prefer to leave it as is except on the buggy AMD CPUs, though, >>> since the current code is nice and fast. >> >> So is there a patch to detect the 383 erratum and serialize for those? >> I may have missed that part. >> > > The patch is in my head. It's imaginarily attached to this email.
After contemplating the info from Boris and Markus, I think I need to add a #3 to the list of reasons my patch could be problematic: 3. If a CPU frees a page table (or PUD or PMD or whatever), that CPU will flush before the memory goes back to the system. If that flush is deferred on a different CPU that has the pointer to the freed table cached in its TLB, then that CPU can speculatively load complete garbage into its TLB. I don't think this should be observable, but I can easily imagine it triggering errata or weird ill-advised machine checks. Anyway, if I need change the behavior back, I can do it in one of two ways. I can just switch to init_mm instead of going lazy, which is expensive, but not *that* expensive on CPUs with PCID. Or I can do it the way we used to do it and send the flush IPI to lazy CPUs. The latter will only have a performance impact when a flush happens, but the performance hit is much higher when there's a flush. --Andy