Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-25 Thread Andrea Arcangeli
On Fri, Nov 21, 2014 at 11:05:45PM +, Peter Maydell wrote:
> If it's mapped and readable-but-not-writable then it should still
> fault on write accesses, though? These are cases we currently get
> SEGV for, anyway.

Yes then it'll work just fine.

> Ah, I guess we have a terminology difference. I was considering
> "page fault" to mean (roughly) "anything that causes the CPU to
> take an exception on an attempted load/store" and expected that
> userfaultfd would notify userspace of any of those. (Well, not
> alignment faults, maybe, but I'm definitely surprised that
> access permission issues don't get reported the same way as
> page-completely-missing issues. In other words I was expecting
> that this was "everything previously reported via SIGSEGV or
> SIGBUS now comes via userfaultfd".)

Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV
currently. Because it's not a not-present fault (the page is present,
just not mapped readable) and it's neither a wrprotect fault (it is
trapped with the vma vm_flags permission bits instead before the
actual page fault handler is invoked). userfaultfd hooks into the
common code of the page fault handler.

> > Temporarily removing/moving the page with remap_anon_pages shall be
> > much better than using PROT_NONE for this (or alternative syscall name
> > to differentiate it further from remap_file_pages, or equivalent
> > userfaultfd command if we decide to hide the pte/pmd mangling as
> > userfaultfd commands instead of adding new standalone syscalls).
> 
> We don't use PROT_NONE for the linux-user situation, we just use
> mprotect() to remove the PAGE_WRITE permission so it's still
> readable.

Like said above it'll work just fine then.

> I suspect actually linux-user would be better off implementing
> something like "if this is a page which we've mapped read-only
> because we translated code out of it, then go ahead and remap
> it r/w and throw away the translation and retry the access,
> otherwise report SEGV to the guest", because taking SEGVs shouldn't
> be a fast path in the guest binary. That would let us work without
> architecture-specific junk and without requiring new kernel
> features either. So you can ignore this whole tangent thread :-)

You might get a significant boost if you use userfaultfd.

For postcopy live snapshot and postcopy live migration the main
benefit is the removal mprotect as a whole and the performance
improvement is a secondary benefit.

You can cap the max size of the JIT translated cache (and in turn the
maximal number of vmas generated by the mprotects) but we can't cap
the address space fragmentation. The faults may invoke way too many
mprotect and we may fragment the vma too much to the point we get
-ENOMEM.

Marking a page wrprotected however is always tricky, no matter if it's
fork doing it or KSM or something else. KSM just skips page that could
be under gup pins and retries them at the next pass. Fork simply won't
work right currently and it needs MADV_DONTFORK to avoid the
wrprotection entirely where you may use O_DIRECT mixed with threads
and fork.

For this new vma-less syscall (or ufd command) the best we could do is
to print a warning if any page marked wrprotected could be under GUP
pin (the warning could generate false positives as result of
speculative cache lookups that run lockless get_page_unless_zero() on
any pfn).

To avoid races the postcopy live snapshot feature I think it should be
enough to wait all in-flight I/O to complete before marking the guest
address space readonly (the KVM gup() side can be taken care of by
marking the shadow MMU readonly which is a must anyway, the mmu
notifier will take care of that part).

The postcopy live snapshot will have to copy the page so it's
effectively a COW in userland, and in turn it must ensure there's no
O_DIRECT in flight still writing to the page (despite we marked it
readonly) while the wrprotection syscall runs.

For your case probably there's no gup() in the equation unless you use
O_DIRECT (I don't think you use shadow-MMU in the kernel in
linux-user) so you don't have to worry about those races and it's just
simpler.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-21 Thread Peter Maydell
On 21 November 2014 20:14, Andrea Arcangeli  wrote:
> Hi Peter,
>
> On Wed, Oct 29, 2014 at 05:56:59PM +, Peter Maydell wrote:
>> On 29 October 2014 17:46, Andrea Arcangeli  wrote:
>> > After some chat during the KVMForum I've been already thinking it
>> > could be beneficial for some usage to give userland the information
>> > about the fault being read or write
>>
>> ...I wonder if that would let us replace the current nasty
>> mess we use in linux-user to detect read vs write faults
>> (which uses a bunch of architecture-specific hacks including
>> in some cases "look at the insn that triggered this SEGV and
>> decode it to see if it was a load or a store"; see the
>> various cpu_signal_handler() implementations in user-exec.c).
>
> There's currently no plan to deliver to userland read access
> notifications of a present page, simply because the task of the
> userfaultfd is to handle the page fault in userland, but if the page
> is mapped and readable it won't fault in the first place :). I just
> mean it's not like gdb read watch.

If it's mapped and readable-but-not-writable then it should still
fault on write accesses, though? These are cases we currently get
SEGV for, anyway.

> Even if the region would be set to PROT_NONE it would still SEGV
> without triggering an userfault (after all pte_present would still
> true because the page is still mapped despite not being readable, so
> in any case it wouldn't be considered a not-present page fault).

Ah, I guess we have a terminology difference. I was considering
"page fault" to mean (roughly) "anything that causes the CPU to
take an exception on an attempted load/store" and expected that
userfaultfd would notify userspace of any of those. (Well, not
alignment faults, maybe, but I'm definitely surprised that
access permission issues don't get reported the same way as
page-completely-missing issues. In other words I was expecting
that this was "everything previously reported via SIGSEGV or
SIGBUS now comes via userfaultfd".)

> Temporarily removing/moving the page with remap_anon_pages shall be
> much better than using PROT_NONE for this (or alternative syscall name
> to differentiate it further from remap_file_pages, or equivalent
> userfaultfd command if we decide to hide the pte/pmd mangling as
> userfaultfd commands instead of adding new standalone syscalls).

We don't use PROT_NONE for the linux-user situation, we just use
mprotect() to remove the PAGE_WRITE permission so it's still
readable.

I suspect actually linux-user would be better off implementing
something like "if this is a page which we've mapped read-only
because we translated code out of it, then go ahead and remap
it r/w and throw away the translation and retry the access,
otherwise report SEGV to the guest", because taking SEGVs shouldn't
be a fast path in the guest binary. That would let us work without
architecture-specific junk and without requiring new kernel
features either. So you can ignore this whole tangent thread :-)

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-21 Thread Andrea Arcangeli
Hi Peter,

On Wed, Oct 29, 2014 at 05:56:59PM +, Peter Maydell wrote:
> On 29 October 2014 17:46, Andrea Arcangeli  wrote:
> > After some chat during the KVMForum I've been already thinking it
> > could be beneficial for some usage to give userland the information
> > about the fault being read or write
> 
> ...I wonder if that would let us replace the current nasty
> mess we use in linux-user to detect read vs write faults
> (which uses a bunch of architecture-specific hacks including
> in some cases "look at the insn that triggered this SEGV and
> decode it to see if it was a load or a store"; see the
> various cpu_signal_handler() implementations in user-exec.c).

There's currently no plan to deliver to userland read access
notifications of a present page, simply because the task of the
userfaultfd is to handle the page fault in userland, but if the page
is mapped and readable it won't fault in the first place :). I just
mean it's not like gdb read watch.

Even if the region would be set to PROT_NONE it would still SEGV
without triggering an userfault (after all pte_present would still
true because the page is still mapped despite not being readable, so
in any case it wouldn't be considered a not-present page fault).

If you temporarily remove the page (which requires an unavoidable TLB
flush also considering if the page was previously mapped the TLB could
still resolve it for reads) it would work then, because the plan is to
provide read/write fault information through the userfaultfd.

In theory it would be possible to deliver PROT_NONE faults through
userfault too but it doesn't make much sense because PROT_NONE still
requires a TLB flush, in addition to the vma
modifications/splitting/rbtree-rebalance and the mmap_sem for writing
as well.

Temporarily removing/moving the page with remap_anon_pages shall be
much better than using PROT_NONE for this (or alternative syscall name
to differentiate it further from remap_file_pages, or equivalent
userfaultfd command if we decide to hide the pte/pmd mangling as
userfaultfd commands instead of adding new standalone syscalls). It
would have the only constraint that you must mark the region
MADV_DONTFORK if you intend linux-user to ever fork or it won't work
reliably (that constraint is to eliminate the need of additional rmap
complexity, precisely so that it doesn't turn into something more
intrusive like remap_file_pages). I assume that would be a fine
constraint for linux-user.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-20 Thread zhanghailiang

On 2014/11/21 1:38, Andrea Arcangeli wrote:

Hi,

On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote:

Yes, you are right. This is what i really want, bypass all non-present faults
and only track strict wrprotect faults. ;)

So, do you plan to support that in the userfault API?


Yes I think it's good idea to support wrprotect/COW faults too.



Great! Then i can expect your patches. ;)


I just wanted to understand if there was any other reason why you
needed only wrprotect faults, because the non-present faults didn't
look like a big performance concern if they triggered in addition to
wrprotect faults, but it's certainly ok to optimize them away so it's
fully optimal.



Er, you have got the answer, no special, it's only for optimality.


All it takes to differentiate the behavior should be one more bit
during registration so you can select non-present, wrprotect faults or
both. postcopy live migration would select only non-present faults,
postcopy live snapshot would select only wrprotect faults, anything
like distributed shared memory supporting shared readonly access and
exclusive write access, would select both flags.



It is really flexible in this way.


I just sent an (unfortunately) longish but way more detailed email
about live snapshotting with userfaultfd but I just wanted to give a
shorter answer here too :).



Thanks for your explanation, and your patience. It is really useful,
now i know more details about why 'fork & dump live snapshot' scenario
is not acceptable. Thanks :)


Thanks,
Andrea

.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-20 Thread Andrea Arcangeli
Hi,

On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote:
> Yes, you are right. This is what i really want, bypass all non-present faults
> and only track strict wrprotect faults. ;)
> 
> So, do you plan to support that in the userfault API?

Yes I think it's good idea to support wrprotect/COW faults too.

I just wanted to understand if there was any other reason why you
needed only wrprotect faults, because the non-present faults didn't
look like a big performance concern if they triggered in addition to
wrprotect faults, but it's certainly ok to optimize them away so it's
fully optimal.

All it takes to differentiate the behavior should be one more bit
during registration so you can select non-present, wrprotect faults or
both. postcopy live migration would select only non-present faults,
postcopy live snapshot would select only wrprotect faults, anything
like distributed shared memory supporting shared readonly access and
exclusive write access, would select both flags.

I just sent an (unfortunately) longish but way more detailed email
about live snapshotting with userfaultfd but I just wanted to give a
shorter answer here too :).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-20 Thread Andrea Arcangeli
Hi,

On Fri, Oct 31, 2014 at 12:39:32PM -0700, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> > Agreed, but for doing live memory snapshot (VM is running when do 
> > snapsphot),
> > we have to do this (block the write action), because we have to save the 
> > page before it
> > is dirtied by writing action. This is the difference, compared to pre-copy 
> > migration.
> 
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!

It sounds really similar issue as live migration, one can implement a
precopy live snapshot, or a precopy+postcopy live snapshot or a pure
postcopy live snapshot.

The decision on the amount of precopy done before engaging postcopy
(zero passes, 1 pass, or more passes) would have similar tradeoffs
too, except instead of having to re-transmit the re-dirtied pages over
the wire, it would need to overwrite them to disk.

The more precopy passes, the longer it takes for the live snapshotting
process to finish and the more I/O there will be (for live migration it'd
be network bandwidth usage instead of amount of I/O), but the shortest
the postcopy runtime will be (and the shorter postcopy runtime is, the
fewer userfaults will end up triggering on writes, in turn reducing
the slowdown and the artificial fault latency introduced to the guest
runtime). But the more precopy passes the more overwriting will happen
during the "longer" precopy stage and the more overall load there will
be for the host (the otherwise idle part of the host).

For the postcopy live snapshot the wrprotect faults are quite
equivalent to the not-present faults of postcopy live migration logic.

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page. 
> 
> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the
> parent process races ahead and writes to all of the pages before you finish 
> the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.

This is a very good point. fork must be evaluated first because it
literally already provides you a readonly memory snapshot of the guest
memory.

The memory cons mentioned above could lead to both -ENOMEM of too many
guests runs live snapshots at the same time in the same host, unless
overcommit_memory is set to 1 (0 by default). Even then if too many
live snapshots are running in parallel you could hit the OOM killer if
there are just a bit too many faults at the same time, or you could
hit heavy swapping which isn't ideal either.

In fact the -ENOMEM avoidance (with qemu failing) is one of the two
critical reasons why qemu always set the guest memory as
MADV_DONTFORK. But that's not the only reason.

To use the fork() trick you'd need to undo the MADV_DONTFORK first but
that would open another problem: there's a race condition between
fork() O_DIRECT and <4k hardblocksize of virtio-blk. If there's any
read() syscall with O_DIRECT with len=512 while fork() is running
(think if the aio runs in parallel with the live snapshot thread that
forks the child to dump the snapshot) and if the guest writes with the
CPU to any 512 fragment of the same page that is the destination
buffer of the write(len=512) (on two different 512bytes area of the
same guest page) the O_DIRECT write will get lost.

So to use fork we'd need to fix this longstanding race (I tried but in
the end we declared it an userland issue because it's not exploitable
to bypass permissions or corrupt kernel or unrelated memory). Or you'd
need to add locking between the dataplane/aio threads and the live
snapshot thread to ensure no direct-io I/O is ever in-flight while
fork runs.

The O_DIRECT however would only help if it's qemu TCG, if it's KVM
it's not even enough to stop O_DIRECT reads. KVM would use
gup(write=1) from the async-pf all the time... and then the shadow
pagetables would go out of sync (it won't destabilize the host of
course, but the guest memor

Re: [PATCH 00/17] RFC: userfault v2

2014-11-19 Thread zhanghailiang

On 2014/11/20 2:49, Andrea Arcangeli wrote:

Hi Zhang,

On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote:

On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:

* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
  but if VM try to write page (dirty the page), there will be
  a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
  it will copy content of the page to some buffers, and then remove the 
page's
  wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.


Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.


So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


What pages would be non-present at this point - just balloon?



Er, sorry, it should be 'no-present page faults';)


Could you elaborate? The balloon pages or not yet allocated pages in
the guest, if they fault too (in addition to the wrprotect faults) it
doesn't sound a big deal, as it's not so common (balloon especially
shouldn't happen except during balloon deflating during the live



snapshotting). We could bypass non-present faults though, and only
track strict wrprotect faults.



Yes, you are right. This is what i really want, bypass all non-present faults
and only track strict wrprotect faults. ;)

So, do you plan to support that in the userfault API?

Thanks,
zhanghailiang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-19 Thread Andrea Arcangeli
Hi Zhang,

On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote:
> On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:
> > * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
> >> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >>> Hi Zhanghailiang,
> >>>
> >>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>  Hi Andrea,
> 
>  Thanks for your hard work on userfault;)
> 
>  This is really a useful API.
> 
>  I want to confirm a question:
>  Can we support distinguishing between writing and reading memory for 
>  userfault?
>  That is, we can decide whether writing a page, reading a page or both 
>  trigger userfault.
> 
>  I think this will help supporting vhost-scsi,ivshmem for migration,
>  we can trace dirty page in userspace.
> 
>  Actually, i'm trying to relize live memory snapshot based on pre-copy 
>  and userfault,
>  but reading memory from migration thread will also trigger userfault.
>  It will be easy to implement live memory snapshot, if we support 
>  configuring
>  userfault for writing memory only.
> >>>
> >>> Mail is going to be long enough already so I'll just assume tracking
> >>> dirty memory in userland (instead of doing it in kernel) is worthy
> >>> feature to have here.
> >>>
> >>> After some chat during the KVMForum I've been already thinking it
> >>> could be beneficial for some usage to give userland the information
> >>> about the fault being read or write, combined with the ability of
> >>> mapping pages wrprotected to mcopy_atomic (that would work without
> >>> false positives only with MADV_DONTFORK also set, but it's already set
> >>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> >>> checked also in the wrprotect faults, not just in the not present
> >>> faults, but it's not a massive change. Returning the read/write
> >>> information is also a not massive change. This will then payoff mostly
> >>> if there's also a way to remove the memory atomically (kind of
> >>> remap_anon_pages).
> >>>
> >>> Would that be enough? I mean are you still ok if non present read
> >>> fault traps too (you'd be notified it's a read) and you get
> >>> notification for both wrprotect and non present faults?
> >>>
> >> Hi Andrea,
> >>
> >> Thanks for your reply, and your patience;)
> >>
> >> Er, maybe i didn't describe clearly. What i really need for live memory 
> >> snapshot
> >> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
> >> write action*.
> >>
> >> My initial solution scheme for live memory snapshot is:
> >> (1) pause VM
> >> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> >> (3) save deivce state to snapshot file
> >> (4) resume VM
> >> (5) snapshot thread begin to save page of memory to snapshot file
> >> (6) VM is going to run, and it is OK for VM or other thread to read ram 
> >> (no fault trap),
> >>  but if VM try to write page (dirty the page), there will be
> >>  a userfault trap notification.
> >> (7) a fault-handle-thread reads the page request from userfaultfd,
> >>  it will copy content of the page to some buffers, and then remove the 
> >> page's
> >>  wrprotect limit(still using the userfaultfd to tell kernel).
> >> (8) after step (7), VM can continue to write the page which is now can be 
> >> write.
> >> (9) snapshot thread save the page cached in step (7)
> >> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
> >
> > Hmm, I can see the same process being useful for the fault-tolerance schemes
> > like COLO, it needs a memory state snapshot.
> >
> >> So, what i need for userfault is supporting only wrprotect fault. i don't
> >> want to get notification for non present reading faults, it will influence
> >> VM's performance and the efficiency of doing snapshot.
> >
> > What pages would be non-present at this point - just balloon?
> >
> 
> Er, sorry, it should be 'no-present page faults';)

Could you elaborate? The balloon pages or not yet allocated pages in
the guest, if they fault too (in addition to the wrprotect faults) it
doesn't sound a big deal, as it's not so common (balloon especially
shouldn't happen except during balloon deflating during the live
snapshotting). We could bypass non-present faults though, and only
track strict wrprotect faults.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-11 Thread zhanghailiang

Hi Andrea,

Is there any new about this discussion? ;)

Will you plan to support 'only wrprotect fault' in the userfault API?

Thanks,
zhanghailiang

On 2014/10/30 19:31, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
 but if VM try to write page (dirty the page), there will be
 a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
 it will copy content of the page to some buffers, and then remove the 
page's
 wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.

Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to "remove" the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or it doesn't look clean having to open the userfaultfd just
 to issue an hidden mcopy_atomic.

 So it becomes a decision if the basic SIGBUS mode for single

Re: [PATCH 00/17] RFC: userfault v2

2014-11-01 Thread zhanghailiang

On 2014/11/1 3:39, Peter Feiner wrote:

On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:

Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page 
before it
is dirtied by writing action. This is the difference, compared to pre-copy 
migration.


Ah ha, I understand the difference now. I suppose that you have considered
doing a traditional pre-copy migration (that is, passes over memory saving
dirty pages, followed by a pause and a final dump of remaining dirty pages) to
a file. Your approach has the advantage of having the VM pause time bounded by
the time it takes to handle the userfault and do the write, as opposed to
pre-copy migration which has a pause time bounded by the time it takes to do
the final dump of dirty pages, which, in the worst case, is the time it takes
to dump all of the guest memory!



Right! Strictly speaking, Migrate VM's state into a file(fd) is not snapshot,
Because its time is not decided (depend on the time of finishing mingration).
A VM's snasphot should be decided, it should be the time when i fire snapshot
command.
Snapshot is very like taking a photo, getting a VM's state on the time;)


You could use the old fork & dump trick. Given that the guest's memory is
backed by private VMA (as of a year ago when I last looked, is always the case
for QEMU), you can have the kernel do the write protection for you.
Essentially, you fork Qemu and, in the child process, dump the guest memory
then exit. If the parent (including the guest) writes to guest memory, then it
will fault and the kernel will copy the page.



It is difficult to do fork in qemu process, which has multi-threads and holds
all kinds of locks. actually, this scheme has been discussed in community long 
time
ago. It is not accepted.


The fork & dump approach will give you the best performance w.r.t. guest pause
times (i.e., just pausing for the COW fault handler), but it does have the
distinct disadvantage of potentially using 2x the guest memory (i.e., if the


Agreed! This is the second reason why community does not accept it.


parent process races ahead and writes to all of the pages before you finish the
dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
memory as you copy it.



IMHO,The scheme i mentioned in the previous email, may be the simplest and the
most efficient way, if userfault could support only wrprotect fault.
We can also do some optimization to reduce influence for VM when do snapshot,
such as caching the request pages by using memory buffer, etc.


Great! Do you plan to issue your patches to community? I mean is your work 
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


I absolutely plan on releasing these patches :-) CRIU was the first open-source
userland I had planned on integrating with. At Google, I'm working with our
home-grown Qemu replacement. However, I'd be happy to help with an effort to
get softdirty integrated in Qemu in the future.



Great;)


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To


I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.



make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.


How can i find the API? Is it been merged in kernel's master branch already?


Negative. I'll be sure to CC you when I start sending this stuff upstream.




OK, I look forward to it:)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-31 Thread Peter Feiner
On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
> we have to do this (block the write action), because we have to save the page 
> before it
> is dirtied by writing action. This is the difference, compared to pre-copy 
> migration.

Ah ha, I understand the difference now. I suppose that you have considered
doing a traditional pre-copy migration (that is, passes over memory saving
dirty pages, followed by a pause and a final dump of remaining dirty pages) to
a file. Your approach has the advantage of having the VM pause time bounded by
the time it takes to handle the userfault and do the write, as opposed to
pre-copy migration which has a pause time bounded by the time it takes to do
the final dump of dirty pages, which, in the worst case, is the time it takes
to dump all of the guest memory!

You could use the old fork & dump trick. Given that the guest's memory is
backed by private VMA (as of a year ago when I last looked, is always the case
for QEMU), you can have the kernel do the write protection for you.
Essentially, you fork Qemu and, in the child process, dump the guest memory
then exit. If the parent (including the guest) writes to guest memory, then it
will fault and the kernel will copy the page. 

The fork & dump approach will give you the best performance w.r.t. guest pause
times (i.e., just pausing for the COW fault handler), but it does have the
distinct disadvantage of potentially using 2x the guest memory (i.e., if the
parent process races ahead and writes to all of the pages before you finish the
dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
memory as you copy it.

> Great! Do you plan to issue your patches to community? I mean is your work 
> based on
> qemu? or an independent tool (CRIU migration?) for live-migration?
> Maybe i could fix the migration problem for ivshmem in qemu now,
> based on softdirty mechanism.

I absolutely plan on releasing these patches :-) CRIU was the first open-source
userland I had planned on integrating with. At Google, I'm working with our
home-grown Qemu replacement. However, I'd be happy to help with an effort to
get softdirty integrated in Qemu in the future.

> >Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. 
> >To
> 
> I have read them cursorily, it is useful for pre-copy indeed. But it seems 
> that
> it can not meet my need for snapshot.

> >make softdirty usable for live migration, I've added an API to atomically
> >test-and-clear the bit and write protect the page.
> 
> How can i find the API? Is it been merged in kernel's master branch already?

Negative. I'll be sure to CC you when I start sending this stuff upstream.

Peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-31 Thread zhanghailiang

On 2014/10/31 13:17, Andres Lagar-Cavilla wrote:

On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang
 wrote:

On 2014/10/31 11:29, zhanghailiang wrote:


On 2014/10/31 10:23, Peter Feiner wrote:


On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:


On 2014/10/30 1:46, Andrea Arcangeli wrote:


On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:


I want to confirm a question:
Can we support distinguishing between writing and reading memory for
userfault?
That is, we can decide whether writing a page, reading a page or both
trigger userfault.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.



I'll open that can of worms :-)


[...]
Er, maybe i didn't describe clearly. What i really need for live memory
snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only
tracing write action*.

So, what i need for userfault is supporting only wrprotect fault. i
don't
want to get notification for non present reading faults, it will
influence
VM's performance and the efficiency of doing snapshot.



Given that you do care about performance Zhanghailiang, I don't think
that a
userfault handler is a good place to track dirty memory. Every dirtying
write
will block on the userfault handler, which is an expensively slow
proposition
compared to an in-kernel approach.



Agreed, but for doing live memory snapshot (VM is running when do
snapsphot),
we have to do this (block the write action), because we have to save the
page before it
is dirtied by writing action. This is the difference, compared to pre-copy
migration.



Again;) For snapshot, i don't use its dirty tracing ability, i just use it
to block write action,
and save page, and then i will remove its write protect.


You could do a CoW in the kernel, post a notification, keep going, and
expose an interface for user-space to mmap the preserved copy. Getting
the life-cycle of the preserved page(s) right is tricky, but doable.
Anyway, it's easy to hand-wave without knowing your specific
requirements.



Yes, what i need is very much like user-space COW feature, but i don't want to 
modify
any code of kvm to relize COW, usefault is a more generic way and more grace.
Besides, I'm not an expert in kernel:(


Opening the discussion a bit, this does look similar to the xen-access
interface, in which a xen domain vcpu could be stopped in its tracks


Right;)


while user-space was notified (and acknowledged) a variety of
scenarios: page was written to, page was read from, vcpu is attempting
to execute from page, etc. Very applicable to anti-viruses right away,
for example you can enforce W^X properties on pages.

I don't know that Andrea wants to open the game so broadly for
userfault, and the code right now is very specific to triggering on
pte_none(), but that's a nice reward down this road.



I hope he will consider it. IMHO, it is a good extension for userfault
(write fault);)

Best Regards,
zhanghailiang




Also, i think this feature will benefit for migration of ivshmem and
vhost-scsi
which have no dirty-page-tracing now.



I do agree wholeheartedly with you here. Manually tracking non-guest
writes
adds to the complexity of device emulation code. A central fault-driven
means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly
what
I'm working on! I'm using the softdirty bit, which was introduced
recently for
CRIU migration, to replace the use of KVM's dirty logging and manual
dirty
tracking by the VMM during pre-copy migration. See



Great! Do you plan to issue your patches to community? I mean is your work
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't
familiar. To



I have read them cursorily, it is useful for pre-copy indeed. But it seems
that
it can not meet my need for snapshot.


make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.



How can i find the API? Is it been merged in kernel's master branch
already?


Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.









--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Andres Lagar-Cavilla
On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang
 wrote:
> On 2014/10/31 11:29, zhanghailiang wrote:
>>
>> On 2014/10/31 10:23, Peter Feiner wrote:
>>>
>>> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

 On 2014/10/30 1:46, Andrea Arcangeli wrote:
>
> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>
>> I want to confirm a question:
>> Can we support distinguishing between writing and reading memory for
>> userfault?
>> That is, we can decide whether writing a page, reading a page or both
>> trigger userfault.
>
> Mail is going to be long enough already so I'll just assume tracking
> dirty memory in userland (instead of doing it in kernel) is worthy
> feature to have here.
>>>
>>>
>>> I'll open that can of worms :-)
>>>
 [...]
 Er, maybe i didn't describe clearly. What i really need for live memory
 snapshot
 is only wrprotect fault, like kvm's dirty tracing mechanism, *only
 tracing write action*.

 So, what i need for userfault is supporting only wrprotect fault. i
 don't
 want to get notification for non present reading faults, it will
 influence
 VM's performance and the efficiency of doing snapshot.
>>>
>>>
>>> Given that you do care about performance Zhanghailiang, I don't think
>>> that a
>>> userfault handler is a good place to track dirty memory. Every dirtying
>>> write
>>> will block on the userfault handler, which is an expensively slow
>>> proposition
>>> compared to an in-kernel approach.
>>>
>>
>> Agreed, but for doing live memory snapshot (VM is running when do
>> snapsphot),
>> we have to do this (block the write action), because we have to save the
>> page before it
>> is dirtied by writing action. This is the difference, compared to pre-copy
>> migration.
>>
>
> Again;) For snapshot, i don't use its dirty tracing ability, i just use it
> to block write action,
> and save page, and then i will remove its write protect.

You could do a CoW in the kernel, post a notification, keep going, and
expose an interface for user-space to mmap the preserved copy. Getting
the life-cycle of the preserved page(s) right is tricky, but doable.
Anyway, it's easy to hand-wave without knowing your specific
requirements.

Opening the discussion a bit, this does look similar to the xen-access
interface, in which a xen domain vcpu could be stopped in its tracks
while user-space was notified (and acknowledged) a variety of
scenarios: page was written to, page was read from, vcpu is attempting
to execute from page, etc. Very applicable to anti-viruses right away,
for example you can enforce W^X properties on pages.

I don't know that Andrea wants to open the game so broadly for
userfault, and the code right now is very specific to triggering on
pte_none(), but that's a nice reward down this road.

Andres

>
 Also, i think this feature will benefit for migration of ivshmem and
 vhost-scsi
 which have no dirty-page-tracing now.
>>>
>>>
>>> I do agree wholeheartedly with you here. Manually tracking non-guest
>>> writes
>>> adds to the complexity of device emulation code. A central fault-driven
>>> means
>>> for dirty tracking writes from the guest and host would be a welcome
>>> simplification to implementing pre-copy migration. Indeed, that's exactly
>>> what
>>> I'm working on! I'm using the softdirty bit, which was introduced
>>> recently for
>>> CRIU migration, to replace the use of KVM's dirty logging and manual
>>> dirty
>>> tracking by the VMM during pre-copy migration. See
>>
>>
>> Great! Do you plan to issue your patches to community? I mean is your work
>> based on
>> qemu? or an independent tool (CRIU migration?) for live-migration?
>> Maybe i could fix the migration problem for ivshmem in qemu now,
>> based on softdirty mechanism.
>>
>>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't
>>> familiar. To
>>
>>
>> I have read them cursorily, it is useful for pre-copy indeed. But it seems
>> that
>> it can not meet my need for snapshot.
>>
>>> make softdirty usable for live migration, I've added an API to atomically
>>> test-and-clear the bit and write protect the page.
>>
>>
>> How can i find the API? Is it been merged in kernel's master branch
>> already?
>>
>>
>> Thanks,
>> zhanghailiang
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> .
>>
>



-- 
Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/31 11:29, zhanghailiang wrote:

On 2014/10/31 10:23, Peter Feiner wrote:

On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.


I'll open that can of worms :-)


[...]
Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.



Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page 
before it
is dirtied by writing action. This is the difference, compared to pre-copy 
migration.



Again;) For snapshot, i don't use its dirty tracing ability, i just use it to 
block write action,
and save page, and then i will remove its write protect.


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See


Great! Do you plan to issue your patches to community? I mean is your work 
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To


I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.


make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.


How can i find the API? Is it been merged in kernel's master branch already?


Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/31 10:23, Peter Feiner wrote:

On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.


I'll open that can of worms :-)


[...]
Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.



Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page 
before it
is dirtied by writing action. This is the difference, compared to pre-copy 
migration.


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See


Great! Do you plan to issue your patches to community? I mean is your work 
based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.


Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To


I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.


make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.


How can i find the API? Is it been merged in kernel's master branch already?


Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Peter Feiner
On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>I want to confirm a question:
> >>Can we support distinguishing between writing and reading memory for 
> >>userfault?
> >>That is, we can decide whether writing a page, reading a page or both 
> >>trigger userfault.
> >Mail is going to be long enough already so I'll just assume tracking
> >dirty memory in userland (instead of doing it in kernel) is worthy
> >feature to have here.

I'll open that can of worms :-)

> [...]
> Er, maybe i didn't describe clearly. What i really need for live memory 
> snapshot
> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
> write action*.
> 
> So, what i need for userfault is supporting only wrprotect fault. i don't
> want to get notification for non present reading faults, it will influence
> VM's performance and the efficiency of doing snapshot.

Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.

> Also, i think this feature will benefit for migration of ivshmem and 
> vhost-scsi
> which have no dirty-page-tracing now.

I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See
Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:

* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
 but if VM try to write page (dirty the page), there will be
 a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
 it will copy content of the page to some buffers, and then remove the 
page's
 wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.


Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.


So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.


What pages would be non-present at this point - just balloon?



Er, sorry, it should be 'no-present page faults';)


Dave


Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to "remove" the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or 

Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread Dr. David Alan Gilbert
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >Hi Zhanghailiang,
> >
> >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>Hi Andrea,
> >>
> >>Thanks for your hard work on userfault;)
> >>
> >>This is really a useful API.
> >>
> >>I want to confirm a question:
> >>Can we support distinguishing between writing and reading memory for 
> >>userfault?
> >>That is, we can decide whether writing a page, reading a page or both 
> >>trigger userfault.
> >>
> >>I think this will help supporting vhost-scsi,ivshmem for migration,
> >>we can trace dirty page in userspace.
> >>
> >>Actually, i'm trying to relize live memory snapshot based on pre-copy and 
> >>userfault,
> >>but reading memory from migration thread will also trigger userfault.
> >>It will be easy to implement live memory snapshot, if we support configuring
> >>userfault for writing memory only.
> >
> >Mail is going to be long enough already so I'll just assume tracking
> >dirty memory in userland (instead of doing it in kernel) is worthy
> >feature to have here.
> >
> >After some chat during the KVMForum I've been already thinking it
> >could be beneficial for some usage to give userland the information
> >about the fault being read or write, combined with the ability of
> >mapping pages wrprotected to mcopy_atomic (that would work without
> >false positives only with MADV_DONTFORK also set, but it's already set
> >in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> >checked also in the wrprotect faults, not just in the not present
> >faults, but it's not a massive change. Returning the read/write
> >information is also a not massive change. This will then payoff mostly
> >if there's also a way to remove the memory atomically (kind of
> >remap_anon_pages).
> >
> >Would that be enough? I mean are you still ok if non present read
> >fault traps too (you'd be notified it's a read) and you get
> >notification for both wrprotect and non present faults?
> >
> Hi Andrea,
> 
> Thanks for your reply, and your patience;)
> 
> Er, maybe i didn't describe clearly. What i really need for live memory 
> snapshot
> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
> write action*.
> 
> My initial solution scheme for live memory snapshot is:
> (1) pause VM
> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> (3) save deivce state to snapshot file
> (4) resume VM
> (5) snapshot thread begin to save page of memory to snapshot file
> (6) VM is going to run, and it is OK for VM or other thread to read ram (no 
> fault trap),
> but if VM try to write page (dirty the page), there will be
> a userfault trap notification.
> (7) a fault-handle-thread reads the page request from userfaultfd,
> it will copy content of the page to some buffers, and then remove the 
> page's
> wrprotect limit(still using the userfaultfd to tell kernel).
> (8) after step (7), VM can continue to write the page which is now can be 
> write.
> (9) snapshot thread save the page cached in step (7)
> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.

> So, what i need for userfault is supporting only wrprotect fault. i don't
> want to get notification for non present reading faults, it will influence
> VM's performance and the efficiency of doing snapshot.

What pages would be non-present at this point - just balloon?

Dave

> Also, i think this feature will benefit for migration of ivshmem and 
> vhost-scsi
> which have no dirty-page-tracing now.
> 
> >The question then is how you mark the memory readonly to let the
> >wrprotect faults trap if the memory already existed and you didn't map
> >it yourself in the guest with mcopy_atomic with a readonly flag.
> >
> >My current plan would be:
> >
> >- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
> >   fast path check in the not-present and wrprotect page fault
> >
> >- if VM_USERFAULT is set, find if there's a userfaultfd registered
> >   into that vma too
> >
> > if yes engage userfaultfd protocol
> >
> > otherwise raise SIGBUS (single threaded apps should be fine with
> > SIGBUS and it'll avoid them to spawn a thread in order to talk the
> > userfaultfd protocol)
> >
> >- if userfaultfd protocol is engaged, return read|write fault + fault
> >   address to read(ufd) syscalls
> >
> >- leave the "userfault" resolution mechanism independent of the
> >   userfaultfd protocol so we keep the two problems separated and we
> >   don't mix them in the same API which makes it even harder to
> >   finalize it.
> >
> > add mcopy_atomic (with a flag to map the page readonly too)
> >
> > The alternative would be to hide mcopy_atomic (and even
> > remap_anon_pages in order to "remove" the memory atomically for
> > the externali

Re: [PATCH 00/17] RFC: userfault v2

2014-10-30 Thread zhanghailiang

On 2014/10/30 1:46, Andrea Arcangeli wrote:

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?


Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no 
fault trap),
but if VM try to write page (dirty the page), there will be
a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
it will copy content of the page to some buffers, and then remove the page's
wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.

Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.


The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
   fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
   into that vma too

 if yes engage userfaultfd protocol

 otherwise raise SIGBUS (single threaded apps should be fine with
 SIGBUS and it'll avoid them to spawn a thread in order to talk the
 userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
   address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
   userfaultfd protocol so we keep the two problems separated and we
   don't mix them in the same API which makes it even harder to
   finalize it.

 add mcopy_atomic (with a flag to map the page readonly too)

 The alternative would be to hide mcopy_atomic (and even
 remap_anon_pages in order to "remove" the memory atomically for
 the externalization into the cloud) as userfaultfd commands to
 write into the fd. But then there would be no much point to keep
 MADV_USERFAULT around if I do so and I could just remove it
 too or it doesn't look clean having to open the userfaultfd just
 to issue an hidden mcopy_atomic.

 So it becomes a decision if the basic SIGBUS mode for single
 threaded apps should be supported or not. As long as we support
 SIGBUS too and we don't force to use userfaultfd as the only
 mechanism to be notified about userfaults, having a separate
 

Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-10-29 Thread Peter Maydell
On 29 October 2014 17:46, Andrea Arcangeli  wrote:
> After some chat during the KVMForum I've been already thinking it
> could be beneficial for some usage to give userland the information
> about the fault being read or write

...I wonder if that would let us replace the current nasty
mess we use in linux-user to detect read vs write faults
(which uses a bunch of architecture-specific hacks including
in some cases "look at the insn that triggered this SEGV and
decode it to see if it was a load or a store"; see the
various cpu_signal_handler() implementations in user-exec.c).

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-29 Thread Andrea Arcangeli
Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> Hi Andrea,
> 
> Thanks for your hard work on userfault;)
> 
> This is really a useful API.
> 
> I want to confirm a question:
> Can we support distinguishing between writing and reading memory for 
> userfault?
> That is, we can decide whether writing a page, reading a page or both trigger 
> userfault.
> 
> I think this will help supporting vhost-scsi,ivshmem for migration,
> we can trace dirty page in userspace.
> 
> Actually, i'm trying to relize live memory snapshot based on pre-copy and 
> userfault,
> but reading memory from migration thread will also trigger userfault.
> It will be easy to implement live memory snapshot, if we support configuring
> userfault for writing memory only.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?

The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
  fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
  into that vma too

if yes engage userfaultfd protocol

otherwise raise SIGBUS (single threaded apps should be fine with
SIGBUS and it'll avoid them to spawn a thread in order to talk the
userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
  address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
  userfaultfd protocol so we keep the two problems separated and we
  don't mix them in the same API which makes it even harder to
  finalize it.

add mcopy_atomic (with a flag to map the page readonly too)

The alternative would be to hide mcopy_atomic (and even
remap_anon_pages in order to "remove" the memory atomically for
the externalization into the cloud) as userfaultfd commands to
write into the fd. But then there would be no much point to keep
MADV_USERFAULT around if I do so and I could just remove it
too or it doesn't look clean having to open the userfaultfd just
to issue an hidden mcopy_atomic.

So it becomes a decision if the basic SIGBUS mode for single
threaded apps should be supported or not. As long as we support
SIGBUS too and we don't force to use userfaultfd as the only
mechanism to be notified about userfaults, having a separate
mcopy_atomic syscall sounds cleaner.
 
Perhaps mcopy_atomic could be used in other cases that may arise
later that may not be connected with the userfault.

Questions to double check the above plan is ok:

1) should I drop the SIGBUS behavior and MADV_USERFAULT?

2) should I hide mcopy_atomic as a write into the userfaultfd?

   NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
   into the fd, the buffer pointer passed to write() syscall would
   still _not_ be pointing to the data like a regular write, but it
   would be a pointer to a command structure that points to the source
   and destination data of the "hidden" mcopy_atomic, the only
   advantage is that perhaps I could wakeup the blocked page faults
   without requiring an additional syscall.

   The standalone mcopy_atomic would still require a write into the
   userfaultfd as it happens now after remap_anon_pages returns, in
   order to wakeup the stopped page faults.

3) should I add a registration command to trap only write faults?

   The protocol can always be extended later anyway in a backwards
   compatible way but it's better if we get it fully featured from the
   start.

For completeness, some answers for other questions I've seen floating
around but that weren't posted on the list yet (you can skip reading
the below part if not interested):

- open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
  benefit: userfaul

Re: [PATCH 00/17] RFC: userfault v2

2014-10-27 Thread zhanghailiang

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger 
userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and 
userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Thanks,
zhanghailiang

On 2014/10/4 1:07, Andrea Arcangeli wrote:

Hello everyone,

There's a large To/Cc list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome
sooner than later.

The major change compared to the previous RFC I sent a few months ago
is that the userfaultfd protocol now supports dynamic range
registration. So you can have an unlimited number of userfaults for
each process, so each shared library can use its own userfaultfd on
its own memory independently from other shared libraries or the main
program. This functionality was suggested from Andy Lutomirski (more
details on this are in the commit header of the last patch of this
patchset).

In addition the mmap_sem complexities has been sorted out. In fact the
real userfault patchset starts from patch number 7. Patches 1-6 will
be submitted separately for merging and if applied standalone they
provide a scalability improvement by reducing the mmap_sem hold times
during I/O. I included patch 1-6 here too because they're an hard
dependency for the userfault patchset. The userfaultfd syscall depends
on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the
later retry faults don't matter, it's fine to clear
FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current
model).

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
on top of MADV_USERFAULT, to make the userfault unnoticeable to the
syscall (no error will be returned). This latter feature is more
advanced than what volatile ranges alone could do with SIGBUS so far
(but it's optional, if the process doesn't register the memory in a
userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS
will also fire for any blocked userfault that was waiting a
userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where

[PATCH 00/17] RFC: userfault v2

2014-10-03 Thread Andrea Arcangeli
Hello everyone,

There's a large To/Cc list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome
sooner than later.

The major change compared to the previous RFC I sent a few months ago
is that the userfaultfd protocol now supports dynamic range
registration. So you can have an unlimited number of userfaults for
each process, so each shared library can use its own userfaultfd on
its own memory independently from other shared libraries or the main
program. This functionality was suggested from Andy Lutomirski (more
details on this are in the commit header of the last patch of this
patchset).

In addition the mmap_sem complexities has been sorted out. In fact the
real userfault patchset starts from patch number 7. Patches 1-6 will
be submitted separately for merging and if applied standalone they
provide a scalability improvement by reducing the mmap_sem hold times
during I/O. I included patch 1-6 here too because they're an hard
dependency for the userfault patchset. The userfaultfd syscall depends
on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the
later retry faults don't matter, it's fine to clear
FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current
model).

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
on top of MADV_USERFAULT, to make the userfault unnoticeable to the
syscall (no error will be returned). This latter feature is more
advanced than what volatile ranges alone could do with SIGBUS so far
(but it's optional, if the process doesn't register the memory in a
userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS
will also fire for any blocked userfault that was waiting a
userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages. Or it could be used for other
similar things with tmpfs in the future. I've been discussin