On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen <dave.han...@intel.com> wrote: > > On 8/3/20 8:12 AM, Andy Lutomirski wrote: > > I could easily be convinced that the PASID fixup is so trivial and so > > obviously free of misfiring in a way that causes an infinite loop that > > this code is fine. But I think we first need to answer the bigger > > question of why we're doing a lazy fixup in the first place. > > There was an (internal to Intel) implementation of this about a year ago > that used smp_call_function_many() to force the MSR state into all > threads of a process. I took one look at it, decided there was a 0% > chance of it actually functioning and recommended we find another way. > While I'm sure it could be done more efficiently, the implementation I > looked at took ~200 lines of code and comments. It started to look too > much like another instance of mm->cpumask for my comfort.
If I were implementing this, I would try making switch_mm_irqs_off() do, roughly: void load_mm_pasid(...) { if (cpu_feature_enabled(X86_FEATURE_ENQCMD)) tsk->xstate[offset] = READ_ONCE(next->context.pasid); } This costs one cache miss, although the cache line in question is about to be read anyway. It might be faster to, instead, do: void load_mm_pasid(...) { u32 pasid = READ_ONCE(next->context.pasid); if (tsk->xstate[offset] != pasid) tsk->state[offset] = pasid; } so we don't dirty the cache line in the common case. The actual generated code ought to be pretty good -- surely the offset of PASID in XSTATE is an entry in an array somewhere that can be found with a single read, right? The READ_ONCE is because this could race against a write to context.pasid, so this code needs to be at the end of the function where it's protected by mm_cpumask. With all this done, the pasid update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm, true). This looks like maybe 20 lines of code. As an added bonus, it lets us free PASIDs early if we ever decide we want to. May I take this opportunity to ask Intel to please put some real thought into future pieces of CPU state? Here's a summary of some things we have: - Normal extended state (FPU, XMM, etc): genuinely per thread and only ever used explicitly. Actually makes sense with XSAVE(C/S). - PKRU: affects CPL0-originated memory accesses, so it needs to be eagerly loaded in the kernel. Does not make sense with XRSTOR(C/S), but it's in there anyway. - CR3: per-mm state. Makes some sense in CR3, but it's tangled up with CR4 in nasty ways. - LDTR: per-mm on Linux and mostly obsolete everyone. In it's own register, so it's not a big deal. - PASID: per-mm state (surely Intel always intended it to be per-mm, since it's for shared _virtual memory_!). But for some reason it's in an MSR (which is slow), and it's cleverly, but not that cleverly, accessible with XSAVES/XRSTORS. Doesn't actually make sense. Also, PASID is lazy-loadable, but the mechanism for telling the kernel that a lazy load is needed got flubbed. - TILE: genuinely per-thread, but it's expensive so it's lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not fully disambiguated from the other use of #NM. So it sort of works, but it's gross. - "KERNEL_GS_BASE", i.e. the shadow GS base. This is logically per-user-thread state, but it's only accessible in MSRs. For some reason this is *not* in XSAVES/XRSTORS state, nor is there any efficient way to access it at all. - Segment registers: can't be properly saved except by hypervisors, and can almost, but not quite, be properly loaded (assuming the state was sane to begin with) by normal kernels. Just don't try to load 1, 2, or 3 into any of them. Sometimes I think that this is all intended to be as confusing as possible and that it's all a ploy to keep context switches slow and complicated. Maybe Intel doesn't actually want to compete with other architectures that can context switch quickly? It would be really nice if we had a clean way to load per-mm state (see my private emails about this), a clean way to load CPL3 register state, and a clean way to load per-user-thread *kernel* register state (e.g. PKRU and probably PKRS). And there should be an exception that says "user code accessed a lazy-loaded resource that isn't loaded, and this is the resource it tried to access".