On Wed, Sep 11, 2019 at 12:49 AM Thomas Hellström (VMware) <thomas...@shipmail.org> wrote: > > Hi, Andy. > > On 9/11/19 6:18 AM, Andy Lutomirski wrote:
> > > > As a for-real example, take a look at arch/x86/entry/vdso/vma.c. The > > "vvar" VMA contains multiple pages that are backed by different types > > of memory. There's a page of ordinary kernel memory. Historically > > there was a page of genuine MMIO memory, but that's gone now. If the > > system is a SEV guest with pvclock enabled, then there's a page of > > decrypted memory. They all share a VMA, they're instantiated in > > .fault, and there is no hackery involved. Look at vvar_fault(). > > So this is conceptually identical to TTM. The difference is that it uses > vmf_insert_pfn_prot() instead of vmf_insert_mixed() with the vma hack. > Had there been a vmf_insert_mixed_prot(), the hack in TTM wouldn't be > needed. I guess providing a vmf_insert_mixed_prot() is a to-do for me to > pick up. :) > > Having said that, the code you point out is as fragile and suffers from > the same shortcomings as TTM since > a) Running things like split_huge_pmd() that takes the vm_page_prot and > applies it to new PTEs will make things break, (although probably never > applicable in this case). Hmm. There are no vvar huge pages, so this is okay. I wonder how hard it would be to change the huge page splitting code to copy the encryption and cacheability bits from the source entry instead of getting them from vm_page_prot, at least in the cases relevant to VM_MIXEDMAP and VM_PFNMAP. > b) Running mprotect() on that VMA will only work if sme_me_mask is part > of _PAGE_CHG_MASK (which is addressed in a reliable way in my recent > patchset), otherwise, the encryption setting will be overwritten. > Indeed. Thanks for the fix! > > >> We could probably get away with a WRITE_ONCE() update of the > >> vm_page_prot before taking the page table lock since > >> > >> a) We're locking out all other writers. > >> b) We can't race with another fault to the same vma since we hold an > >> address space lock ("buffer object reservation") > >> c) When we need to update there are no valid page table entries in the > >> vma, since it only happens directly after mmap(), or after an > >> unmap_mapping_range() with the same address space lock. When another > >> reader (for example split_huge_pmd()) sees a valid page table entry, it > >> also sees the new page protection and things are fine. > >> > >> But that would really be a special case. To solve this properly we'd > >> probably need an additional lock to protect the vm_flags and > >> vm_page_prot, taken after mmap_sem and i_mmap_lock. > >> > > This is all horrible IMO. > > I'd prefer to call it fragile and potentially troublesome to maintain. Fair enough. > > That distinction is important because if it ever comes to a choice > between adding a new lock to protect vm_page_prot (and consequently slow > down the whole vm system) and using the WRITE_ONCE solution in TTM, we > should know what the implications are. As it turns out previous choices > in this area actually seem to have opted for the lockless WRITE_ONCE / > READ_ONCE / ptl solution. See __split_huge_pmd_locked() and > vma_set_page_prot(). I think it would be even better if the whole thing could work without ever writing to vm_page_prot. This would be a requirement for vvar in the unlikely event that the vvar vma ever supported splittable huge pages. Fortunately, that seems unlikely :)