Hi Isaku, On Tue, May 07, 2013 at 07:07:40PM +0900, Isaku Yamahata wrote: > On Mon, May 06, 2013 at 09:56:57PM +0200, Andrea Arcangeli wrote: > > Hello everyone, > > > > this is a patchset to implement two new kernel features: > > MADV_USERFAULT and remap_anon_pages. > > > > The combination of the two features are what I would propose to > > implement postcopy live migration, and in general demand paging of > > remote memory, hosted in different cloud nodes with KSM. It might also > > be used without virt to offload parts of memory to different nodes > > using some userland library and a network memory manager. > > Interesting. The API you are proposing handles only user fault. > How do you think about kernel case. I mean that KVM kernel module issues > get_user_pages(). > Exit to qemu with dedicated reason?
Correct. It's possible we want a more meaningful retval from get_user_pages too (right now sigbus would make gup return a too generic -EFAULT) by introducing a FOLL_USERFAULT in gup_flags. So the KVM bits are still missing at this point. Gleb also wants to enable the async page fault in the postcopy stage, so we immediately schedule a different guest process if the current guest process hits an userfault within KVM. So the protocol with the postcopy thread will tell it "fill this pfn async" or "fill it synchronous". And Gleb likes kvm to talk to the postcopy thread (through a pipe?) directly to avoid exiting to userland. But we could also return to userland, if we do, we don't need to teach the kernel about the postcopy thread protocol to require new pages synchronously (after running out of async page faults) or asynchronously (when async page faults are still availbale). Clearly staying in the kernel is more efficient as it avoids an enter/exit cycle and kvm can be restarted immediately after a 9 byte write to the pipe with the postcopy thread. > In case of precopy + postcopy optimization, dirty bitmap is sent after > precopy phase and then clean pages are populated. In this population phase, > vecotored API can be utilized. I'm not sure how much vectored API will > contribute to shorten VM-switch time, though. But the network transfer won't be vectored, would it? If we pay an enter/exit kernel for the network transfer, I assume we'd run a remap_anon_pages after each chunk. Also the postcopy thread won't transfer in the background too much data at once. It needs to react quick to a "urgent" userfault request coming from a vcpu thread. > It would be desirable to avoid complex thing in signal handler. > Like sending page request to remote, receiving pages from remote. > So signal handler would just queue requests to those dedicated threads > and wait and requests would be serialized. Such strictness is not Exactly, that's the idea, a separate thread will do the network transfer and then run remap_anon_pages. And if we immediately use async page faults it won't need to block until we run out of async page faults. > very critical, I guess. But others might find other use case... It's still somewhat useful to be strict in my view, as it will verify that we handle correctly the case of many vcpus userfaulting on the same address at the same time, everyone except the first shouldn't run remap_anon_pages. Thanks! Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/