Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
On Fri, Nov 21, 2014 at 11:05:45PM +, Peter Maydell wrote: > If it's mapped and readable-but-not-writable then it should still > fault on write accesses, though? These are cases we currently get > SEGV for, anyway. Yes then it'll work just fine. > Ah, I guess we have a terminology difference. I was considering > "page fault" to mean (roughly) "anything that causes the CPU to > take an exception on an attempted load/store" and expected that > userfaultfd would notify userspace of any of those. (Well, not > alignment faults, maybe, but I'm definitely surprised that > access permission issues don't get reported the same way as > page-completely-missing issues. In other words I was expecting > that this was "everything previously reported via SIGSEGV or > SIGBUS now comes via userfaultfd".) Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV currently. Because it's not a not-present fault (the page is present, just not mapped readable) and it's neither a wrprotect fault (it is trapped with the vma vm_flags permission bits instead before the actual page fault handler is invoked). userfaultfd hooks into the common code of the page fault handler. > > Temporarily removing/moving the page with remap_anon_pages shall be > > much better than using PROT_NONE for this (or alternative syscall name > > to differentiate it further from remap_file_pages, or equivalent > > userfaultfd command if we decide to hide the pte/pmd mangling as > > userfaultfd commands instead of adding new standalone syscalls). > > We don't use PROT_NONE for the linux-user situation, we just use > mprotect() to remove the PAGE_WRITE permission so it's still > readable. Like said above it'll work just fine then. > I suspect actually linux-user would be better off implementing > something like "if this is a page which we've mapped read-only > because we translated code out of it, then go ahead and remap > it r/w and throw away the translation and retry the access, > otherwise report SEGV to the guest", because taking SEGVs shouldn't > be a fast path in the guest binary. That would let us work without > architecture-specific junk and without requiring new kernel > features either. So you can ignore this whole tangent thread :-) You might get a significant boost if you use userfaultfd. For postcopy live snapshot and postcopy live migration the main benefit is the removal mprotect as a whole and the performance improvement is a secondary benefit. You can cap the max size of the JIT translated cache (and in turn the maximal number of vmas generated by the mprotects) but we can't cap the address space fragmentation. The faults may invoke way too many mprotect and we may fragment the vma too much to the point we get -ENOMEM. Marking a page wrprotected however is always tricky, no matter if it's fork doing it or KSM or something else. KSM just skips page that could be under gup pins and retries them at the next pass. Fork simply won't work right currently and it needs MADV_DONTFORK to avoid the wrprotection entirely where you may use O_DIRECT mixed with threads and fork. For this new vma-less syscall (or ufd command) the best we could do is to print a warning if any page marked wrprotected could be under GUP pin (the warning could generate false positives as result of speculative cache lookups that run lockless get_page_unless_zero() on any pfn). To avoid races the postcopy live snapshot feature I think it should be enough to wait all in-flight I/O to complete before marking the guest address space readonly (the KVM gup() side can be taken care of by marking the shadow MMU readonly which is a must anyway, the mmu notifier will take care of that part). The postcopy live snapshot will have to copy the page so it's effectively a COW in userland, and in turn it must ensure there's no O_DIRECT in flight still writing to the page (despite we marked it readonly) while the wrprotection syscall runs. For your case probably there's no gup() in the equation unless you use O_DIRECT (I don't think you use shadow-MMU in the kernel in linux-user) so you don't have to worry about those races and it's just simpler. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
On 21 November 2014 20:14, Andrea Arcangeli wrote: > Hi Peter, > > On Wed, Oct 29, 2014 at 05:56:59PM +, Peter Maydell wrote: >> On 29 October 2014 17:46, Andrea Arcangeli wrote: >> > After some chat during the KVMForum I've been already thinking it >> > could be beneficial for some usage to give userland the information >> > about the fault being read or write >> >> ...I wonder if that would let us replace the current nasty >> mess we use in linux-user to detect read vs write faults >> (which uses a bunch of architecture-specific hacks including >> in some cases "look at the insn that triggered this SEGV and >> decode it to see if it was a load or a store"; see the >> various cpu_signal_handler() implementations in user-exec.c). > > There's currently no plan to deliver to userland read access > notifications of a present page, simply because the task of the > userfaultfd is to handle the page fault in userland, but if the page > is mapped and readable it won't fault in the first place :). I just > mean it's not like gdb read watch. If it's mapped and readable-but-not-writable then it should still fault on write accesses, though? These are cases we currently get SEGV for, anyway. > Even if the region would be set to PROT_NONE it would still SEGV > without triggering an userfault (after all pte_present would still > true because the page is still mapped despite not being readable, so > in any case it wouldn't be considered a not-present page fault). Ah, I guess we have a terminology difference. I was considering "page fault" to mean (roughly) "anything that causes the CPU to take an exception on an attempted load/store" and expected that userfaultfd would notify userspace of any of those. (Well, not alignment faults, maybe, but I'm definitely surprised that access permission issues don't get reported the same way as page-completely-missing issues. In other words I was expecting that this was "everything previously reported via SIGSEGV or SIGBUS now comes via userfaultfd".) > Temporarily removing/moving the page with remap_anon_pages shall be > much better than using PROT_NONE for this (or alternative syscall name > to differentiate it further from remap_file_pages, or equivalent > userfaultfd command if we decide to hide the pte/pmd mangling as > userfaultfd commands instead of adding new standalone syscalls). We don't use PROT_NONE for the linux-user situation, we just use mprotect() to remove the PAGE_WRITE permission so it's still readable. I suspect actually linux-user would be better off implementing something like "if this is a page which we've mapped read-only because we translated code out of it, then go ahead and remap it r/w and throw away the translation and retry the access, otherwise report SEGV to the guest", because taking SEGVs shouldn't be a fast path in the guest binary. That would let us work without architecture-specific junk and without requiring new kernel features either. So you can ignore this whole tangent thread :-) thanks -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
Hi Peter, On Wed, Oct 29, 2014 at 05:56:59PM +, Peter Maydell wrote: > On 29 October 2014 17:46, Andrea Arcangeli wrote: > > After some chat during the KVMForum I've been already thinking it > > could be beneficial for some usage to give userland the information > > about the fault being read or write > > ...I wonder if that would let us replace the current nasty > mess we use in linux-user to detect read vs write faults > (which uses a bunch of architecture-specific hacks including > in some cases "look at the insn that triggered this SEGV and > decode it to see if it was a load or a store"; see the > various cpu_signal_handler() implementations in user-exec.c). There's currently no plan to deliver to userland read access notifications of a present page, simply because the task of the userfaultfd is to handle the page fault in userland, but if the page is mapped and readable it won't fault in the first place :). I just mean it's not like gdb read watch. Even if the region would be set to PROT_NONE it would still SEGV without triggering an userfault (after all pte_present would still true because the page is still mapped despite not being readable, so in any case it wouldn't be considered a not-present page fault). If you temporarily remove the page (which requires an unavoidable TLB flush also considering if the page was previously mapped the TLB could still resolve it for reads) it would work then, because the plan is to provide read/write fault information through the userfaultfd. In theory it would be possible to deliver PROT_NONE faults through userfault too but it doesn't make much sense because PROT_NONE still requires a TLB flush, in addition to the vma modifications/splitting/rbtree-rebalance and the mmap_sem for writing as well. Temporarily removing/moving the page with remap_anon_pages shall be much better than using PROT_NONE for this (or alternative syscall name to differentiate it further from remap_file_pages, or equivalent userfaultfd command if we decide to hide the pte/pmd mangling as userfaultfd commands instead of adding new standalone syscalls). It would have the only constraint that you must mark the region MADV_DONTFORK if you intend linux-user to ever fork or it won't work reliably (that constraint is to eliminate the need of additional rmap complexity, precisely so that it doesn't turn into something more intrusive like remap_file_pages). I assume that would be a fine constraint for linux-user. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/11/21 1:38, Andrea Arcangeli wrote: Hi, On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote: Yes, you are right. This is what i really want, bypass all non-present faults and only track strict wrprotect faults. ;) So, do you plan to support that in the userfault API? Yes I think it's good idea to support wrprotect/COW faults too. Great! Then i can expect your patches. ;) I just wanted to understand if there was any other reason why you needed only wrprotect faults, because the non-present faults didn't look like a big performance concern if they triggered in addition to wrprotect faults, but it's certainly ok to optimize them away so it's fully optimal. Er, you have got the answer, no special, it's only for optimality. All it takes to differentiate the behavior should be one more bit during registration so you can select non-present, wrprotect faults or both. postcopy live migration would select only non-present faults, postcopy live snapshot would select only wrprotect faults, anything like distributed shared memory supporting shared readonly access and exclusive write access, would select both flags. It is really flexible in this way. I just sent an (unfortunately) longish but way more detailed email about live snapshotting with userfaultfd but I just wanted to give a shorter answer here too :). Thanks for your explanation, and your patience. It is really useful, now i know more details about why 'fork & dump live snapshot' scenario is not acceptable. Thanks :) Thanks, Andrea . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
Hi, On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote: > Yes, you are right. This is what i really want, bypass all non-present faults > and only track strict wrprotect faults. ;) > > So, do you plan to support that in the userfault API? Yes I think it's good idea to support wrprotect/COW faults too. I just wanted to understand if there was any other reason why you needed only wrprotect faults, because the non-present faults didn't look like a big performance concern if they triggered in addition to wrprotect faults, but it's certainly ok to optimize them away so it's fully optimal. All it takes to differentiate the behavior should be one more bit during registration so you can select non-present, wrprotect faults or both. postcopy live migration would select only non-present faults, postcopy live snapshot would select only wrprotect faults, anything like distributed shared memory supporting shared readonly access and exclusive write access, would select both flags. I just sent an (unfortunately) longish but way more detailed email about live snapshotting with userfaultfd but I just wanted to give a shorter answer here too :). Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
Hi, On Fri, Oct 31, 2014 at 12:39:32PM -0700, Peter Feiner wrote: > On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote: > > Agreed, but for doing live memory snapshot (VM is running when do > > snapsphot), > > we have to do this (block the write action), because we have to save the > > page before it > > is dirtied by writing action. This is the difference, compared to pre-copy > > migration. > > Ah ha, I understand the difference now. I suppose that you have considered > doing a traditional pre-copy migration (that is, passes over memory saving > dirty pages, followed by a pause and a final dump of remaining dirty pages) to > a file. Your approach has the advantage of having the VM pause time bounded by > the time it takes to handle the userfault and do the write, as opposed to > pre-copy migration which has a pause time bounded by the time it takes to do > the final dump of dirty pages, which, in the worst case, is the time it takes > to dump all of the guest memory! It sounds really similar issue as live migration, one can implement a precopy live snapshot, or a precopy+postcopy live snapshot or a pure postcopy live snapshot. The decision on the amount of precopy done before engaging postcopy (zero passes, 1 pass, or more passes) would have similar tradeoffs too, except instead of having to re-transmit the re-dirtied pages over the wire, it would need to overwrite them to disk. The more precopy passes, the longer it takes for the live snapshotting process to finish and the more I/O there will be (for live migration it'd be network bandwidth usage instead of amount of I/O), but the shortest the postcopy runtime will be (and the shorter postcopy runtime is, the fewer userfaults will end up triggering on writes, in turn reducing the slowdown and the artificial fault latency introduced to the guest runtime). But the more precopy passes the more overwriting will happen during the "longer" precopy stage and the more overall load there will be for the host (the otherwise idle part of the host). For the postcopy live snapshot the wrprotect faults are quite equivalent to the not-present faults of postcopy live migration logic. > You could use the old fork & dump trick. Given that the guest's memory is > backed by private VMA (as of a year ago when I last looked, is always the case > for QEMU), you can have the kernel do the write protection for you. > Essentially, you fork Qemu and, in the child process, dump the guest memory > then exit. If the parent (including the guest) writes to guest memory, then it > will fault and the kernel will copy the page. > > The fork & dump approach will give you the best performance w.r.t. guest pause > times (i.e., just pausing for the COW fault handler), but it does have the > distinct disadvantage of potentially using 2x the guest memory (i.e., if the > parent process races ahead and writes to all of the pages before you finish > the > dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child > memory as you copy it. This is a very good point. fork must be evaluated first because it literally already provides you a readonly memory snapshot of the guest memory. The memory cons mentioned above could lead to both -ENOMEM of too many guests runs live snapshots at the same time in the same host, unless overcommit_memory is set to 1 (0 by default). Even then if too many live snapshots are running in parallel you could hit the OOM killer if there are just a bit too many faults at the same time, or you could hit heavy swapping which isn't ideal either. In fact the -ENOMEM avoidance (with qemu failing) is one of the two critical reasons why qemu always set the guest memory as MADV_DONTFORK. But that's not the only reason. To use the fork() trick you'd need to undo the MADV_DONTFORK first but that would open another problem: there's a race condition between fork() O_DIRECT and <4k hardblocksize of virtio-blk. If there's any read() syscall with O_DIRECT with len=512 while fork() is running (think if the aio runs in parallel with the live snapshot thread that forks the child to dump the snapshot) and if the guest writes with the CPU to any 512 fragment of the same page that is the destination buffer of the write(len=512) (on two different 512bytes area of the same guest page) the O_DIRECT write will get lost. So to use fork we'd need to fix this longstanding race (I tried but in the end we declared it an userland issue because it's not exploitable to bypass permissions or corrupt kernel or unrelated memory). Or you'd need to add locking between the dataplane/aio threads and the live snapshot thread to ensure no direct-io I/O is ever in-flight while fork runs. The O_DIRECT however would only help if it's qemu TCG, if it's KVM it's not even enough to stop O_DIRECT reads. KVM would use gup(write=1) from the async-pf all the time... and then the shadow pagetables would go out of sync (it won't destabilize the host of course, but the guest memor
Re: [PATCH 00/17] RFC: userfault v2
On 2014/11/20 2:49, Andrea Arcangeli wrote: Hi Zhang, On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote: On 2014/10/30 20:49, Dr. David Alan Gilbert wrote: * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. Hmm, I can see the same process being useful for the fault-tolerance schemes like COLO, it needs a memory state snapshot. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. What pages would be non-present at this point - just balloon? Er, sorry, it should be 'no-present page faults';) Could you elaborate? The balloon pages or not yet allocated pages in the guest, if they fault too (in addition to the wrprotect faults) it doesn't sound a big deal, as it's not so common (balloon especially shouldn't happen except during balloon deflating during the live snapshotting). We could bypass non-present faults though, and only track strict wrprotect faults. Yes, you are right. This is what i really want, bypass all non-present faults and only track strict wrprotect faults. ;) So, do you plan to support that in the userfault API? Thanks, zhanghailiang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
Hi Zhang, On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote: > On 2014/10/30 20:49, Dr. David Alan Gilbert wrote: > > * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > >> On 2014/10/30 1:46, Andrea Arcangeli wrote: > >>> Hi Zhanghailiang, > >>> > >>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: > Hi Andrea, > > Thanks for your hard work on userfault;) > > This is really a useful API. > > I want to confirm a question: > Can we support distinguishing between writing and reading memory for > userfault? > That is, we can decide whether writing a page, reading a page or both > trigger userfault. > > I think this will help supporting vhost-scsi,ivshmem for migration, > we can trace dirty page in userspace. > > Actually, i'm trying to relize live memory snapshot based on pre-copy > and userfault, > but reading memory from migration thread will also trigger userfault. > It will be easy to implement live memory snapshot, if we support > configuring > userfault for writing memory only. > >>> > >>> Mail is going to be long enough already so I'll just assume tracking > >>> dirty memory in userland (instead of doing it in kernel) is worthy > >>> feature to have here. > >>> > >>> After some chat during the KVMForum I've been already thinking it > >>> could be beneficial for some usage to give userland the information > >>> about the fault being read or write, combined with the ability of > >>> mapping pages wrprotected to mcopy_atomic (that would work without > >>> false positives only with MADV_DONTFORK also set, but it's already set > >>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be > >>> checked also in the wrprotect faults, not just in the not present > >>> faults, but it's not a massive change. Returning the read/write > >>> information is also a not massive change. This will then payoff mostly > >>> if there's also a way to remove the memory atomically (kind of > >>> remap_anon_pages). > >>> > >>> Would that be enough? I mean are you still ok if non present read > >>> fault traps too (you'd be notified it's a read) and you get > >>> notification for both wrprotect and non present faults? > >>> > >> Hi Andrea, > >> > >> Thanks for your reply, and your patience;) > >> > >> Er, maybe i didn't describe clearly. What i really need for live memory > >> snapshot > >> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing > >> write action*. > >> > >> My initial solution scheme for live memory snapshot is: > >> (1) pause VM > >> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) > >> (3) save deivce state to snapshot file > >> (4) resume VM > >> (5) snapshot thread begin to save page of memory to snapshot file > >> (6) VM is going to run, and it is OK for VM or other thread to read ram > >> (no fault trap), > >> but if VM try to write page (dirty the page), there will be > >> a userfault trap notification. > >> (7) a fault-handle-thread reads the page request from userfaultfd, > >> it will copy content of the page to some buffers, and then remove the > >> page's > >> wrprotect limit(still using the userfaultfd to tell kernel). > >> (8) after step (7), VM can continue to write the page which is now can be > >> write. > >> (9) snapshot thread save the page cached in step (7) > >> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. > > > > Hmm, I can see the same process being useful for the fault-tolerance schemes > > like COLO, it needs a memory state snapshot. > > > >> So, what i need for userfault is supporting only wrprotect fault. i don't > >> want to get notification for non present reading faults, it will influence > >> VM's performance and the efficiency of doing snapshot. > > > > What pages would be non-present at this point - just balloon? > > > > Er, sorry, it should be 'no-present page faults';) Could you elaborate? The balloon pages or not yet allocated pages in the guest, if they fault too (in addition to the wrprotect faults) it doesn't sound a big deal, as it's not so common (balloon especially shouldn't happen except during balloon deflating during the live snapshotting). We could bypass non-present faults though, and only track strict wrprotect faults. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
Hi Andrea, Is there any new about this discussion? ;) Will you plan to support 'only wrprotect fault' in the userfault API? Thanks, zhanghailiang On 2014/10/30 19:31, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the "userfault" resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to "remove" the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it doesn't look clean having to open the userfaultfd just to issue an hidden mcopy_atomic. So it becomes a decision if the basic SIGBUS mode for single
Re: [PATCH 00/17] RFC: userfault v2
On 2014/11/1 3:39, Peter Feiner wrote: On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote: Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Ah ha, I understand the difference now. I suppose that you have considered doing a traditional pre-copy migration (that is, passes over memory saving dirty pages, followed by a pause and a final dump of remaining dirty pages) to a file. Your approach has the advantage of having the VM pause time bounded by the time it takes to handle the userfault and do the write, as opposed to pre-copy migration which has a pause time bounded by the time it takes to do the final dump of dirty pages, which, in the worst case, is the time it takes to dump all of the guest memory! Right! Strictly speaking, Migrate VM's state into a file(fd) is not snapshot, Because its time is not decided (depend on the time of finishing mingration). A VM's snasphot should be decided, it should be the time when i fire snapshot command. Snapshot is very like taking a photo, getting a VM's state on the time;) You could use the old fork & dump trick. Given that the guest's memory is backed by private VMA (as of a year ago when I last looked, is always the case for QEMU), you can have the kernel do the write protection for you. Essentially, you fork Qemu and, in the child process, dump the guest memory then exit. If the parent (including the guest) writes to guest memory, then it will fault and the kernel will copy the page. It is difficult to do fork in qemu process, which has multi-threads and holds all kinds of locks. actually, this scheme has been discussed in community long time ago. It is not accepted. The fork & dump approach will give you the best performance w.r.t. guest pause times (i.e., just pausing for the COW fault handler), but it does have the distinct disadvantage of potentially using 2x the guest memory (i.e., if the Agreed! This is the second reason why community does not accept it. parent process races ahead and writes to all of the pages before you finish the dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child memory as you copy it. IMHO,The scheme i mentioned in the previous email, may be the simplest and the most efficient way, if userfault could support only wrprotect fault. We can also do some optimization to reduce influence for VM when do snapshot, such as caching the request pages by using memory buffer, etc. Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. I absolutely plan on releasing these patches :-) CRIU was the first open-source userland I had planned on integrating with. At Google, I'm working with our home-grown Qemu replacement. However, I'd be happy to help with an effort to get softdirty integrated in Qemu in the future. Great;) Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Negative. I'll be sure to CC you when I start sending this stuff upstream. OK, I look forward to it:) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote: > Agreed, but for doing live memory snapshot (VM is running when do snapsphot), > we have to do this (block the write action), because we have to save the page > before it > is dirtied by writing action. This is the difference, compared to pre-copy > migration. Ah ha, I understand the difference now. I suppose that you have considered doing a traditional pre-copy migration (that is, passes over memory saving dirty pages, followed by a pause and a final dump of remaining dirty pages) to a file. Your approach has the advantage of having the VM pause time bounded by the time it takes to handle the userfault and do the write, as opposed to pre-copy migration which has a pause time bounded by the time it takes to do the final dump of dirty pages, which, in the worst case, is the time it takes to dump all of the guest memory! You could use the old fork & dump trick. Given that the guest's memory is backed by private VMA (as of a year ago when I last looked, is always the case for QEMU), you can have the kernel do the write protection for you. Essentially, you fork Qemu and, in the child process, dump the guest memory then exit. If the parent (including the guest) writes to guest memory, then it will fault and the kernel will copy the page. The fork & dump approach will give you the best performance w.r.t. guest pause times (i.e., just pausing for the COW fault handler), but it does have the distinct disadvantage of potentially using 2x the guest memory (i.e., if the parent process races ahead and writes to all of the pages before you finish the dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child memory as you copy it. > Great! Do you plan to issue your patches to community? I mean is your work > based on > qemu? or an independent tool (CRIU migration?) for live-migration? > Maybe i could fix the migration problem for ivshmem in qemu now, > based on softdirty mechanism. I absolutely plan on releasing these patches :-) CRIU was the first open-source userland I had planned on integrating with. At Google, I'm working with our home-grown Qemu replacement. However, I'd be happy to help with an effort to get softdirty integrated in Qemu in the future. > >Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. > >To > > I have read them cursorily, it is useful for pre-copy indeed. But it seems > that > it can not meet my need for snapshot. > >make softdirty usable for live migration, I've added an API to atomically > >test-and-clear the bit and write protect the page. > > How can i find the API? Is it been merged in kernel's master branch already? Negative. I'll be sure to CC you when I start sending this stuff upstream. Peter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/31 13:17, Andres Lagar-Cavilla wrote: On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang wrote: On 2014/10/31 11:29, zhanghailiang wrote: On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Again;) For snapshot, i don't use its dirty tracing ability, i just use it to block write action, and save page, and then i will remove its write protect. You could do a CoW in the kernel, post a notification, keep going, and expose an interface for user-space to mmap the preserved copy. Getting the life-cycle of the preserved page(s) right is tricky, but doable. Anyway, it's easy to hand-wave without knowing your specific requirements. Yes, what i need is very much like user-space COW feature, but i don't want to modify any code of kvm to relize COW, usefault is a more generic way and more grace. Besides, I'm not an expert in kernel:( Opening the discussion a bit, this does look similar to the xen-access interface, in which a xen domain vcpu could be stopped in its tracks Right;) while user-space was notified (and acknowledged) a variety of scenarios: page was written to, page was read from, vcpu is attempting to execute from page, etc. Very applicable to anti-viruses right away, for example you can enforce W^X properties on pages. I don't know that Andrea wants to open the game so broadly for userfault, and the code right now is very specific to triggering on pte_none(), but that's a nice reward down this road. I hope he will consider it. IMHO, it is a good extension for userfault (write fault);) Best Regards, zhanghailiang Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang wrote: > On 2014/10/31 11:29, zhanghailiang wrote: >> >> On 2014/10/31 10:23, Peter Feiner wrote: >>> >>> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: > > On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: >> >> I want to confirm a question: >> Can we support distinguishing between writing and reading memory for >> userfault? >> That is, we can decide whether writing a page, reading a page or both >> trigger userfault. > > Mail is going to be long enough already so I'll just assume tracking > dirty memory in userland (instead of doing it in kernel) is worthy > feature to have here. >>> >>> >>> I'll open that can of worms :-) >>> [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. >>> >>> >>> Given that you do care about performance Zhanghailiang, I don't think >>> that a >>> userfault handler is a good place to track dirty memory. Every dirtying >>> write >>> will block on the userfault handler, which is an expensively slow >>> proposition >>> compared to an in-kernel approach. >>> >> >> Agreed, but for doing live memory snapshot (VM is running when do >> snapsphot), >> we have to do this (block the write action), because we have to save the >> page before it >> is dirtied by writing action. This is the difference, compared to pre-copy >> migration. >> > > Again;) For snapshot, i don't use its dirty tracing ability, i just use it > to block write action, > and save page, and then i will remove its write protect. You could do a CoW in the kernel, post a notification, keep going, and expose an interface for user-space to mmap the preserved copy. Getting the life-cycle of the preserved page(s) right is tricky, but doable. Anyway, it's easy to hand-wave without knowing your specific requirements. Opening the discussion a bit, this does look similar to the xen-access interface, in which a xen domain vcpu could be stopped in its tracks while user-space was notified (and acknowledged) a variety of scenarios: page was written to, page was read from, vcpu is attempting to execute from page, etc. Very applicable to anti-viruses right away, for example you can enforce W^X properties on pages. I don't know that Andrea wants to open the game so broadly for userfault, and the code right now is very specific to triggering on pte_none(), but that's a nice reward down this road. Andres > Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. >>> >>> >>> I do agree wholeheartedly with you here. Manually tracking non-guest >>> writes >>> adds to the complexity of device emulation code. A central fault-driven >>> means >>> for dirty tracking writes from the guest and host would be a welcome >>> simplification to implementing pre-copy migration. Indeed, that's exactly >>> what >>> I'm working on! I'm using the softdirty bit, which was introduced >>> recently for >>> CRIU migration, to replace the use of KVM's dirty logging and manual >>> dirty >>> tracking by the VMM during pre-copy migration. See >> >> >> Great! Do you plan to issue your patches to community? I mean is your work >> based on >> qemu? or an independent tool (CRIU migration?) for live-migration? >> Maybe i could fix the migration problem for ivshmem in qemu now, >> based on softdirty mechanism. >> >>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't >>> familiar. To >> >> >> I have read them cursorily, it is useful for pre-copy indeed. But it seems >> that >> it can not meet my need for snapshot. >> >>> make softdirty usable for live migration, I've added an API to atomically >>> test-and-clear the bit and write protect the page. >> >> >> How can i find the API? Is it been merged in kernel's master branch >> already? >> >> >> Thanks, >> zhanghailiang >> >> -- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> . >> > -- Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/31 11:29, zhanghailiang wrote: On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Again;) For snapshot, i don't use its dirty tracing ability, i just use it to block write action, and save page, and then i will remove its write protect. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: > On 2014/10/30 1:46, Andrea Arcangeli wrote: > >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: > >>I want to confirm a question: > >>Can we support distinguishing between writing and reading memory for > >>userfault? > >>That is, we can decide whether writing a page, reading a page or both > >>trigger userfault. > >Mail is going to be long enough already so I'll just assume tracking > >dirty memory in userland (instead of doing it in kernel) is worthy > >feature to have here. I'll open that can of worms :-) > [...] > Er, maybe i didn't describe clearly. What i really need for live memory > snapshot > is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing > write action*. > > So, what i need for userfault is supporting only wrprotect fault. i don't > want to get notification for non present reading faults, it will influence > VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. > Also, i think this feature will benefit for migration of ivshmem and > vhost-scsi > which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/30 20:49, Dr. David Alan Gilbert wrote: * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. Hmm, I can see the same process being useful for the fault-tolerance schemes like COLO, it needs a memory state snapshot. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. What pages would be non-present at this point - just balloon? Er, sorry, it should be 'no-present page faults';) Dave Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the "userfault" resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to "remove" the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or
Re: [PATCH 00/17] RFC: userfault v2
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: > On 2014/10/30 1:46, Andrea Arcangeli wrote: > >Hi Zhanghailiang, > > > >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: > >>Hi Andrea, > >> > >>Thanks for your hard work on userfault;) > >> > >>This is really a useful API. > >> > >>I want to confirm a question: > >>Can we support distinguishing between writing and reading memory for > >>userfault? > >>That is, we can decide whether writing a page, reading a page or both > >>trigger userfault. > >> > >>I think this will help supporting vhost-scsi,ivshmem for migration, > >>we can trace dirty page in userspace. > >> > >>Actually, i'm trying to relize live memory snapshot based on pre-copy and > >>userfault, > >>but reading memory from migration thread will also trigger userfault. > >>It will be easy to implement live memory snapshot, if we support configuring > >>userfault for writing memory only. > > > >Mail is going to be long enough already so I'll just assume tracking > >dirty memory in userland (instead of doing it in kernel) is worthy > >feature to have here. > > > >After some chat during the KVMForum I've been already thinking it > >could be beneficial for some usage to give userland the information > >about the fault being read or write, combined with the ability of > >mapping pages wrprotected to mcopy_atomic (that would work without > >false positives only with MADV_DONTFORK also set, but it's already set > >in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be > >checked also in the wrprotect faults, not just in the not present > >faults, but it's not a massive change. Returning the read/write > >information is also a not massive change. This will then payoff mostly > >if there's also a way to remove the memory atomically (kind of > >remap_anon_pages). > > > >Would that be enough? I mean are you still ok if non present read > >fault traps too (you'd be notified it's a read) and you get > >notification for both wrprotect and non present faults? > > > Hi Andrea, > > Thanks for your reply, and your patience;) > > Er, maybe i didn't describe clearly. What i really need for live memory > snapshot > is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing > write action*. > > My initial solution scheme for live memory snapshot is: > (1) pause VM > (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) > (3) save deivce state to snapshot file > (4) resume VM > (5) snapshot thread begin to save page of memory to snapshot file > (6) VM is going to run, and it is OK for VM or other thread to read ram (no > fault trap), > but if VM try to write page (dirty the page), there will be > a userfault trap notification. > (7) a fault-handle-thread reads the page request from userfaultfd, > it will copy content of the page to some buffers, and then remove the > page's > wrprotect limit(still using the userfaultfd to tell kernel). > (8) after step (7), VM can continue to write the page which is now can be > write. > (9) snapshot thread save the page cached in step (7) > (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. Hmm, I can see the same process being useful for the fault-tolerance schemes like COLO, it needs a memory state snapshot. > So, what i need for userfault is supporting only wrprotect fault. i don't > want to get notification for non present reading faults, it will influence > VM's performance and the efficiency of doing snapshot. What pages would be non-present at this point - just balloon? Dave > Also, i think this feature will benefit for migration of ivshmem and > vhost-scsi > which have no dirty-page-tracing now. > > >The question then is how you mark the memory readonly to let the > >wrprotect faults trap if the memory already existed and you didn't map > >it yourself in the guest with mcopy_atomic with a readonly flag. > > > >My current plan would be: > > > >- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the > > fast path check in the not-present and wrprotect page fault > > > >- if VM_USERFAULT is set, find if there's a userfaultfd registered > > into that vma too > > > > if yes engage userfaultfd protocol > > > > otherwise raise SIGBUS (single threaded apps should be fine with > > SIGBUS and it'll avoid them to spawn a thread in order to talk the > > userfaultfd protocol) > > > >- if userfaultfd protocol is engaged, return read|write fault + fault > > address to read(ufd) syscalls > > > >- leave the "userfault" resolution mechanism independent of the > > userfaultfd protocol so we keep the two problems separated and we > > don't mix them in the same API which makes it even harder to > > finalize it. > > > > add mcopy_atomic (with a flag to map the page readonly too) > > > > The alternative would be to hide mcopy_atomic (and even > > remap_anon_pages in order to "remove" the memory atomically for > > the externali
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the "userfault" resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to "remove" the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it doesn't look clean having to open the userfaultfd just to issue an hidden mcopy_atomic. So it becomes a decision if the basic SIGBUS mode for single threaded apps should be supported or not. As long as we support SIGBUS too and we don't force to use userfaultfd as the only mechanism to be notified about userfaults, having a separate
Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
On 29 October 2014 17:46, Andrea Arcangeli wrote: > After some chat during the KVMForum I've been already thinking it > could be beneficial for some usage to give userland the information > about the fault being read or write ...I wonder if that would let us replace the current nasty mess we use in linux-user to detect read vs write faults (which uses a bunch of architecture-specific hacks including in some cases "look at the insn that triggered this SEGV and decode it to see if it was a load or a store"; see the various cpu_signal_handler() implementations in user-exec.c). -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: > Hi Andrea, > > Thanks for your hard work on userfault;) > > This is really a useful API. > > I want to confirm a question: > Can we support distinguishing between writing and reading memory for > userfault? > That is, we can decide whether writing a page, reading a page or both trigger > userfault. > > I think this will help supporting vhost-scsi,ivshmem for migration, > we can trace dirty page in userspace. > > Actually, i'm trying to relize live memory snapshot based on pre-copy and > userfault, > but reading memory from migration thread will also trigger userfault. > It will be easy to implement live memory snapshot, if we support configuring > userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the "userfault" resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to "remove" the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it doesn't look clean having to open the userfaultfd just to issue an hidden mcopy_atomic. So it becomes a decision if the basic SIGBUS mode for single threaded apps should be supported or not. As long as we support SIGBUS too and we don't force to use userfaultfd as the only mechanism to be notified about userfaults, having a separate mcopy_atomic syscall sounds cleaner. Perhaps mcopy_atomic could be used in other cases that may arise later that may not be connected with the userfault. Questions to double check the above plan is ok: 1) should I drop the SIGBUS behavior and MADV_USERFAULT? 2) should I hide mcopy_atomic as a write into the userfaultfd? NOTE: even if I hide mcopy_atomic as a userfaultfd command to write into the fd, the buffer pointer passed to write() syscall would still _not_ be pointing to the data like a regular write, but it would be a pointer to a command structure that points to the source and destination data of the "hidden" mcopy_atomic, the only advantage is that perhaps I could wakeup the blocked page faults without requiring an additional syscall. The standalone mcopy_atomic would still require a write into the userfaultfd as it happens now after remap_anon_pages returns, in order to wakeup the stopped page faults. 3) should I add a registration command to trap only write faults? The protocol can always be extended later anyway in a backwards compatible way but it's better if we get it fully featured from the start. For completeness, some answers for other questions I've seen floating around but that weren't posted on the list yet (you can skip reading the below part if not interested): - open("/dev/userfault") instead of sys_userfaultfd(), I don't see the benefit: userfaul
Re: [PATCH 00/17] RFC: userfault v2
Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Thanks, zhanghailiang On 2014/10/4 1:07, Andrea Arcangeli wrote: Hello everyone, There's a large To/Cc list for this RFC because this adds two new syscalls (userfaultfd and remap_anon_pages) and MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome sooner than later. The major change compared to the previous RFC I sent a few months ago is that the userfaultfd protocol now supports dynamic range registration. So you can have an unlimited number of userfaults for each process, so each shared library can use its own userfaultfd on its own memory independently from other shared libraries or the main program. This functionality was suggested from Andy Lutomirski (more details on this are in the commit header of the last patch of this patchset). In addition the mmap_sem complexities has been sorted out. In fact the real userfault patchset starts from patch number 7. Patches 1-6 will be submitted separately for merging and if applied standalone they provide a scalability improvement by reducing the mmap_sem hold times during I/O. I included patch 1-6 here too because they're an hard dependency for the userfault patchset. The userfaultfd syscall depends on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the later retry faults don't matter, it's fine to clear FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current model). The combination of these features are what I would propose to implement postcopy live migration in qemu, and in general demand paging of remote memory, hosted in different cloud nodes. If the access could ever happen in kernel context through syscalls (not not just from userland context), then userfaultfd has to be used on top of MADV_USERFAULT, to make the userfault unnoticeable to the syscall (no error will be returned). This latter feature is more advanced than what volatile ranges alone could do with SIGBUS so far (but it's optional, if the process doesn't register the memory in a userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS will also fire for any blocked userfault that was waiting a userfaultfd_write ack). userfaultfd is also a generic enough feature, that it allows KVM to implement postcopy live migration without having to modify a single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all other GUP features works just fine in combination with userfaults (userfaults trigger async page faults in the guest scheduler so those guest processes that aren't waiting for userfaults can keep running in the guest vcpus). remap_anon_pages is the syscall to use to resolve the userfaults (it's not mandatory, vmsplice will likely still be used in the case of local postcopy live migration just to upgrade the qemu binary, but remap_anon_pages is faster and ideal for transferring memory across the network, it's zerocopy and doesn't touch the vma: it only holds the mmap_sem for reading). The current behavior of remap_anon_pages is very strict to avoid any chance of memory corruption going unnoticed. mremap is not strict like that: if there's a synchronization bug it would drop the destination range silently resulting in subtle memory corruption for example. remap_anon_pages would return -EEXIST in that case. If there are holes in the source range remap_anon_pages will return -ENOENT. If remap_anon_pages is used always with 2M naturally aligned addresses, transparent hugepages will not be splitted. In there could be 4k (or any size) holes in the 2M (or any size) source range, remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to relax some of its strict checks (-ENOENT won't be returned if RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as a noop on any hole in the source range). This flag is generally useful when implementing userfaults with THP granularity, but it shouldn't be set if doing the userfaults with PAGE_SIZE granularity if the developer wants to benefit from the strict -ENOENT behavior. The remap_anon_pages syscall API is not vectored, as I expect it to be used mainly for demand paging (where there can be just one faulting range per userfault) or for large ranges (with the THP model as an alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k granularity before starting the guest in the destination node) where
[PATCH 00/17] RFC: userfault v2
Hello everyone, There's a large To/Cc list for this RFC because this adds two new syscalls (userfaultfd and remap_anon_pages) and MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome sooner than later. The major change compared to the previous RFC I sent a few months ago is that the userfaultfd protocol now supports dynamic range registration. So you can have an unlimited number of userfaults for each process, so each shared library can use its own userfaultfd on its own memory independently from other shared libraries or the main program. This functionality was suggested from Andy Lutomirski (more details on this are in the commit header of the last patch of this patchset). In addition the mmap_sem complexities has been sorted out. In fact the real userfault patchset starts from patch number 7. Patches 1-6 will be submitted separately for merging and if applied standalone they provide a scalability improvement by reducing the mmap_sem hold times during I/O. I included patch 1-6 here too because they're an hard dependency for the userfault patchset. The userfaultfd syscall depends on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the later retry faults don't matter, it's fine to clear FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current model). The combination of these features are what I would propose to implement postcopy live migration in qemu, and in general demand paging of remote memory, hosted in different cloud nodes. If the access could ever happen in kernel context through syscalls (not not just from userland context), then userfaultfd has to be used on top of MADV_USERFAULT, to make the userfault unnoticeable to the syscall (no error will be returned). This latter feature is more advanced than what volatile ranges alone could do with SIGBUS so far (but it's optional, if the process doesn't register the memory in a userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS will also fire for any blocked userfault that was waiting a userfaultfd_write ack). userfaultfd is also a generic enough feature, that it allows KVM to implement postcopy live migration without having to modify a single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all other GUP features works just fine in combination with userfaults (userfaults trigger async page faults in the guest scheduler so those guest processes that aren't waiting for userfaults can keep running in the guest vcpus). remap_anon_pages is the syscall to use to resolve the userfaults (it's not mandatory, vmsplice will likely still be used in the case of local postcopy live migration just to upgrade the qemu binary, but remap_anon_pages is faster and ideal for transferring memory across the network, it's zerocopy and doesn't touch the vma: it only holds the mmap_sem for reading). The current behavior of remap_anon_pages is very strict to avoid any chance of memory corruption going unnoticed. mremap is not strict like that: if there's a synchronization bug it would drop the destination range silently resulting in subtle memory corruption for example. remap_anon_pages would return -EEXIST in that case. If there are holes in the source range remap_anon_pages will return -ENOENT. If remap_anon_pages is used always with 2M naturally aligned addresses, transparent hugepages will not be splitted. In there could be 4k (or any size) holes in the 2M (or any size) source range, remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to relax some of its strict checks (-ENOENT won't be returned if RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as a noop on any hole in the source range). This flag is generally useful when implementing userfaults with THP granularity, but it shouldn't be set if doing the userfaults with PAGE_SIZE granularity if the developer wants to benefit from the strict -ENOENT behavior. The remap_anon_pages syscall API is not vectored, as I expect it to be used mainly for demand paging (where there can be just one faulting range per userfault) or for large ranges (with the THP model as an alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k granularity before starting the guest in the destination node) where vectoring isn't going to provide much performance advantages (thanks to the THP coarser granularity). On the rmap side remap_anon_pages doesn't add much complexity: there's no need of nonlinear anon vmas to support it because I added the constraint that it will fail if the mapcount is more than 1. So in general the source range of remap_anon_pages should be marked MADV_DONTFORK to prevent any risk of failure if the process ever forks (like qemu can in some case). The MADV_USERFAULT feature should be generic enough that it can provide the userfaults to the Android volatile range feature too, on access of reclaimed volatile pages. Or it could be used for other similar things with tmpfs in the future. I've been discussin