On Fri, Aug 01, 2008 at 02:47:39PM -0500, Anthony Liguori wrote: > Is there a way to detect MMU notifiers from userspace? I don't think it's > currently safe to madvise unconditionally.
There is no way to detect mmu notifiers from userspace (well strictly speaking you could check /proc/kallsyms) but the point is that without mmu notifiers madvise won't be enough (the memory may not be freed without the other ioctl proposed by Marcelo that also flushes sptes). Especially with the plan to pin pages during memslot allocation (next step in the kvm-userland compatibility effort when built against old kernel) madvise will be a noop because all memory in the memslot will remain pinned. The safety issue with madvise was then found to be the same issue for all rmap_remove invocations (not just the ones done by Marcelo's ioctl). Whenever the put_page in rmap_remove happens to be the last free of the page, we have a tlb race where we flush the other vcpus tlbs after the page was already freed by rmap_remove. This is usually only an issue for smp guest on smp host. So in short madvise run on a memslot wasn't "less" safe, but still the ioctl proposed to remove sptes to allow memory to be released wasn't capable of fixing this race for madvise. mmu notifiers on the other hand are allowing madvise to free all memory reliably while fixing this race as well as a whole (not just for madvise but for swapping and all other operations too). The compatibility plan for the host kernels without mmu notifiers is going to entirely prevent swapping, and in turn it will never allow rmap_remove to invoke the last free of the page will also fix the tlb troubles. So if you currently just madvise, right now it won't be less safe than without madvise (it's not safe regardless) but without mmu notifier madvise won't remove sptes so it won't be reliable (this is why once Marcelo proposed the ioctl to call after/before madvise, but that ioctl still had the same issues that every other rmap_remove had without mmu notifier). With the next plan of pinning all pages you can still madvise just fine, and it will be safe too (like every other rmap_remove will be safe) but it will be a guaranteed a noop because of the memslot being entirely page-pinning. So madvise on old kernels is never less safe but if we want to provide reliable ballooning to host kernels without mmu notifiers with the next compatibility plan that Avi suggested, we should "trim" the memslot first (to unpin the pages) instead of calling madvise and then you can munmap the range instead of madvise. Or you can add a new call that only drops sptes in the range and _later_ unpin the pages (that's similar to Marcelo ioctl but the unpin event happening later should fix its previous safety issues). However if the sptes removal is done without touching the memslot that will complicate the memslots semantics quite a bit (because all memory mapped by a memslot won't be guaranteed to be entirely pinned anymore and we'll have to track separately the balloned ranges that aren't pinned). So I guess teaching set_memory_region to trim a memslot (currently it can't) sounds one approach that could be used to allow balloning on kernels without mmu notifiers, if we go with Avi's backwards compatibility suggestion of memslot pinning. Then it's up to you if to munamp or madvise, it'll be the same then. It's a bit complicated if something isn't clear on what the issues are without mmu notifier, don't hesitate to ask more details. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html