Andrea Arcangeli wrote:
> I thought some more about this.
>
> BTW, for completeness: normally (with exception of vm_destroy) the
> put_page run by rmap_remove won't be the last one, but still the page
> can go in the freelist a moment after put_page runs (leading to the
> same problem). The VM is prevented to free the page while it's pinned,
> but the VM can do the final free on the page before rmap_remove
> returns. And w/o mmu notifiers there's no serialization that makes the
> core VM stop on the mmu_lock to wait the tlb flush to run, before the
> VM finally executes the last free of the page. mmu notifiers fixes
> this race for regular swapping as the core VM will block on the
> mmu_lock waiting the tlb flush (for this to work the tlb flush must
> always happen inside the mmu_lock unless the order is exactly "spte =
> nonpresent; tlbflush; put_page"). A VM_LOCKED on the vmas backing the
> anonymous memory will fix this for regolar swapping too (I did
> something like this in a patch at the end as a band-aid).
>
> But thinking more the moment we pretend to allow anybody to randomly
> __munmap__ any part of the guest physical address space like for
> ballooning while the guest runs (think unprivileged user owning
> /dev/kvm and running munmap at will), not even VM_LOCKED (ignored by
> munmap) and not even the mmu notifiers, can prevent the page to be
> queued in the kernel freelists immediately after rmap_remove returns,
> this is because rmap_remove may run in a different host-cpu in between
> unmap_vmas and invalidate_range_end.
>
> Running the ioctl before munmap won't help to prevent the race as the
> guest can still re-instantiate the sptes with page faults between the
> ioctl and munmap.
>
> However we've invalidate_range_begin. If we invalidate all sptes in
> invalidate_range_begin and we hold off the page faults in between
> _begin/_end, then we can fix this with the mmu notifiers.
>   

This can be done by taking mmu_lock in _begin and releasing it in _end, 
unless there's a lock dependency issue.

> So I think I can allow munmap safely (to unprivileged user too) by
> using _range_begin somehow. For this to work any relevant tlb flush
> must happen inside the _same_ mmu_lock critical section where
> spte=nonpresent and rmap_remove run too (thanks to the mmu_lock the
> ordering of those operations won't matter anymore, and no queue will
> be needed).
>
>   

I don't understand your conclusion: you prove that mlock() is not good 
enough, then post a patch to do it?

I'll take another shot at fixing rmap_remove(), I don't like to cripple 
swapping for 2.6.25 (though it will only be really dependable in .26).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to