Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

2015-12-04 Thread Andrea Arcangeli
Hello Michael,

On Fri, Dec 04, 2015 at 04:50:03PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Andrea,
> 
> On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> > On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
> >> Add documentation.
> > 
> > Hi Andrea,
> > 
> > I do not recall... Did you write a man page also for this new system call?
> 
> No response to my last mail, so I'll try again... Did you 
> write any man page for this interface?

I wished I would answer with the manpage itself to give a more
satisfactory answer, but answer is still no at this time. Right now
there's the write protection tracking feature posted to linux-mm and
I'm currently reviewing that. It's worth documenting that part too in
the manpage as it's going to happen sooner than later.

Lack of manpage so far didn't prevent userland to use it (qemu
postcopy is already in upstream qemu and it depends on userfaultfd),
nor review of the code nor other kernel contributors to extend the
syscall API. Other users started testing the syscall too. This is just
to explain why unfortunately the manpage didn't get the top priority
yet, but nevertheless the manpage should happen too and it's
important. Advice on how to proceed is welcome.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/11] KVM: x86: track guest page access

2015-12-01 Thread Andrea Arcangeli
On Tue, Dec 01, 2015 at 11:17:30AM +0100, Paolo Bonzini wrote:
> 
> 
> On 30/11/2015 19:26, Xiao Guangrong wrote:
> > This patchset introduces the feature which allows us to track page
> > access in guest. Currently, only write access tracking is implemented
> > in this version.
> > 
> > Four APIs are introduces:
> > - kvm_page_track_add_page(kvm, gfn, mode), single guest page @gfn is
> >   added into the track pool of the guest instance represented by @kvm,
> >   @mode specifies which kind of access on the @gfn is tracked
> >   
> > - kvm_page_track_remove_page(kvm, gfn, mode), is the opposed operation
> >   of kvm_page_track_add_page() which removes @gfn from the tracking pool.
> >   gfn is no tracked after its last user is gone
> > 
> > - kvm_page_track_register_notifier(kvm, n), register a notifier so that
> >   the event triggered by page tracking will be received, at that time,
> >   the callback of n->track_write() will be called
> > 
> > - kvm_page_track_unregister_notifier(kvm, n), does the opposed operation
> >   of kvm_page_track_register_notifier(), which unlinks the notifier and
> >   stops receiving the tracked event
> > 
> > The first user of page track is non-leaf shadow page tables as they are
> > always write protected. It also gains performance improvement because
> > page track speeds up page fault handler for the tracked pages. The
> > performance result of kernel building is as followings:
> > 
> >before   after
> > real 461.63   real 455.48
> > user 4529.55  user 4557.88
> > sys 1995.39   sys 1922.57
> 
> For KVM-GT, as far as I know Andrea Arcangeli is working on extending
> userfaultfd to tracking write faults only.  Perhaps KVM-GT can do

I was a bit busy lately with the KSMscale design change and to fix a
THP purely theoretical issue, but the userfaultfd write tracking is
already become available here:

http://www.spinics.net/lists/linux-mm/msg97422.html

I'll be merging it soon in my tree after a thoughtful review.

> something similar, where KVM gets the write tracking functionality for
> free through the MMU notifiers.  Any thoughts on this?
> 
> Applying your technique to non-leaf shadow pages actually makes this
> series quite interesting. :)  Shadow paging is still in use for nested
> EPT, so it's always a good idea to speed it up.

I don't have the full picture of how userfaultfd write tracking could
also fit in the leaf/non-leaf shadow pagetable write tracking yet but
it's good to think about it.

In the userfaultfd case the write notification would always arrive
first through the uffd and it would be received by the qemu userfault
thread, it's then the uffd memory protect ioctl invoked by the qemu
userfault thread (to handle the write fault in userland and wake up
the thread stuck in handle_userfault()) that would also flush the
secondary MMU TLB through MMU notifier and get rid of the readonly
spte (or update it to read-write with change_pte in the best case).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390

2015-11-11 Thread Andrea Arcangeli
On Wed, Nov 11, 2015 at 08:47:34PM +0100, Christian Borntraeger wrote:
> Acked-by: Christian Borntraeger 
> Who is going to take this patch? If I should take the patch, I need an
> ACK from the memory mgmt folks.

I would suggest to resend in CC to Andrew to merge in -mm after taking
care of the below, as it's a mm common code part.

> 
> Christian
> 
> 
> >> ---
> >>  mm/huge_memory.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c29ddeb..a8b5347 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >>/*
> >> * Be somewhat over-protective like KSM for now!
> >> */
> >> -  if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> >> +  if (*vm_flags & VM_NO_THP)
> >>return -EINVAL;
> >>*vm_flags &= ~VM_HUGEPAGE;
> >>*vm_flags |= VM_NOHUGEPAGE;

If we make this change the MADV_HUGEPAGE must be taken care of too or
it doesn't make sense to threat them differently.

After taking care of the MADV_HUGEPAGE you can add my reviewed-by when
you resubmit to Andrew.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390

2015-11-11 Thread Andrea Arcangeli
On Wed, Nov 11, 2015 at 09:01:44PM +0100, Christian Borntraeger wrote:
> Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> > Hi Jason,
> > 
> > On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
> >> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> >> hugepage but hugepage_madvise() takes the error path when we ask to turn
> >> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> > 
> > I wonder why KVM disables transparent hugepages on s390. It sounds
> > weird to disable transparent hugepages with KVM. In fact on x86 we
> > call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> > guest physical memory, even if the transparent_hugepage/enabled ==
> > madvise.
> > 
> >> new postcopy migration feature to fail on s390 because its first action is
> >> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> >> code so that the operation succeeds without error now.
> > 
> > The other way is to change qemu to keep track it already called
> > MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> > opinion on this, I think it's ok to return 0 but it's a visible change
> > to userland, I can't imagine it to break anything though. It sounds
> > very unlikely that an app could error out if it notices the kernel
> > doesn't error out on the second call of MADV_NOHUGEPAGE.
> > 
> > Glad to hear KVM postcopy live migration is already running on s390 too.
> 
> Sometimeswe have some issues with userfaultd, which we currently address.
> One place is interesting: the kvm code might have to call fixup_user_fault
> for a guest address (to map the page writable). Right now we do not pass
> FAULT_FLAG_ALLOW_RETRY, which can trigger a warning like
> 
> [  119.414573] FAULT_FLAG_ALLOW_RETRY missing 1
> [  119.414577] CPU: 42 PID: 12853 Comm: qemu-system-s39 Not tainted 4.3.0+ 
> #315
> [  119.414579]00011c4579b8 00011c457a48 0002 
>  
>   00011c457ae8 00011c457a60 00011c457a60 
> 00113e26 
>   02cf 009feef8 00a1e054 
> 000b 
>   00011c457aa8 00011c457a48  
>  
>    00113e26 00011c457a48 
> 00011c457aa8 
> [  119.414590] Call Trace:
> [  119.414596] ([<00113d16>] show_trace+0xf6/0x148)
> [  119.414598]  [<00113dda>] show_stack+0x72/0xf0
> [  119.414600]  [<00551b9e>] dump_stack+0x6e/0x90
> [  119.414605]  [<0032d168>] handle_userfault+0xe0/0x448
> [  119.414609]  [<0029a2d4>] handle_mm_fault+0x16e4/0x1798
> [  119.414611]  [<002930be>] fixup_user_fault+0x86/0x118
> [  119.414614]  [<00126bb8>] gmap_ipte_notify+0xa0/0x170
> [  119.414617]  [<0013ae90>] kvm_arch_vcpu_ioctl_run+0x448/0xc58
> [  119.414619]  [<0012e4dc>] kvm_vcpu_ioctl+0x37c/0x668
> [  119.414622]  [<002eba68>] do_vfs_ioctl+0x3a8/0x508
> [  119.414624]  [<002ebc6c>] SyS_ioctl+0xa4/0xb8
> [  119.414627]  [<00815c56>] system_call+0xd6/0x264
> [  119.414629]  [<03ff9628721a>] 0x3ff9628721a
> 
> I think we can rework this to use something that sets FAULT_FLAG_ALLOW_RETRY,
> but this begs the question if a futex operation on userfault backed memory 
> would also be broken. The futex code also does fixup_user_fault without 
> FAULT_FLAG_ALLOW_RETRY as far as I can tell.

That's a good point, but qemu never does futex on the guest physical
memory so can't be a problem for postcopy at least, it's also not
destabilizing in any way (and the stack dump also happens only if you
have DEBUG_VM selected).

The userfaultfd stress test actually could end up using futex on the
userfault memory, but it never triggered anything, it doesn't get to
that fixup_user_fault at runtime.

Still it should be fixed for s390 and futex.

It's probably better to add a fixup_user_fault_unlocked that will work
like get_user_pages_unlocked. I.e. leaves the details of the mmap_sem
locking internally to the function, and will handle VM_FAULT_RETRY
automatically by re-taking the mmap_sem and repeating the
fixup_user_fault after updating the FAULT_FLAG_ALLOW_RETRY to
FAULT_FLAG_TRIED.

Note that the FAULT_FLAG_TRIED logic will need an overhaul soon, as we
must be able to release the mmap_sem in handle_userfault() even if
FAULT_FLAG_TRIED is passed instead of FAULT_FLAG_ALLOW_RETRY for to
reasons:

1) the theoretical scheduler race,

Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 05:15:09PM +0200, Peter Zijlstra wrote:
> Indefinitely is such a long time, we should try and finish
> computation before the computer dies etc. :-)

Indefinitely as read_seqcount_retry, eventually it makes progress.

Even returning 0 from the page fault can trigger it again
indefinitely, so VM_FAULT_RETRY isn't fundamentally different from
returning 0 and retrying the page fault again later. So it's not clear
why VM_FAULT_RETRY can only try once more.

FAULT_FLAG_TRIED as a message to the VM so it starts to do heavy
locking and block more aggressively is actually useful as such, but it
shouldn't be a replacement of FAULT_FLAG_ALLOW_RETRY. What I meant
with removing FAULT_FLAG_TRIED is really about converting it to an
hint, but not controlling if the page fault can keep retrying
in-kernel.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 03:38:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:
> 
> > If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> > would be a bug in the scheduler in my view. Luckily there doesn't seem
> > to be such a bug, or at least we never experienced it.
> 
> Well, there will be a wakeup, just not the one you were hoping for.
> 
> We have code that does:
> 
>   @cond = true;
>   get_task_struct(p);
>   queue(p)
> 
>   /* random wait somewhere */
>   for (;;) {
>   prepare_to_wait();
>   if (@cond)
> break;
> 
>   ...
> 
>   handle_userfault()
> ...
> schedule();
>   ...
> 
>   dequeue(p)
>   wake_up_process(p) ---> wakeup without userfault wakeup
> 
> 
> These races are (extremely) rare, but they do exist. Therefore one must
> never assume schedule() will not spuriously wake because of these
> things.
> 
> Also, see:
> 
> lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojy2...@mail.gmail.com

With one more spinlock taken in the fast path we could recheck if the
waitqueue is still queued and this is a false positive extremely rare
spurious wakeup, and in such case set the state back to TASK_KILLABLE
and schedule.

However in the long term such a spinlock should be removed because
it's faster to stick with the current lockless list_empty_careful and
not to recheck the auto-remove waitqueue, but then we must be able to
re-enter handle_userfault() even if FAULT_FLAG_TRIED was set
(currently we can't return VM_FAULT_RETRY if FAULT_FLAG_TRIED is set
and that's the problem). This change is planned for a long time as we
need it to arm the vma-less write protection while the app is running,
so I'm not sure if it's worth going for the short term fix if this is
extremely rare.

The risk of memory corruption is still zero no matter what happens
here, in the extremely rare case the app will get a SIGBUS or a
syscall will return -EFAULT. The kernel also cannot crash. So it's not
very severe concern if it happens extremely rarely (we never
reproduced it and stress testing run for months). Of course in the
longer term this would have been fixed regardless as said in previous
email.

I think going for the longer term fix that was already planned, is
better than doing a short term fix and the real question is how I
should proceed to change the arch code and gup to cope with
handle_userfault() being re-entered.

The simplest thing is to drop FAULT_FLAG_TRIED as a whole. Or I could
add a new VM_FAULT_USERFAULT flag specific to handle_userfault that
would be returned even if FAULT_FLAG_TRIED is set, so that only
userfaults will be allowed to be repeated indefinitely (and then
VM_FAULT_USERFAULT shouldn't trigger a transition to FAULT_FLAG_TRIED,
unlike VM_FAULT_RETRY does).

This is all about being allowed to drop the mmap_sem.

If we'd check the waitqueue with the spinlock (to be sure the wakeup
isn't happening from under us while we check if we got an userfault
wakeup or if this is a spurious schedule), we could also limit the
VM_FAULT_RETRY to 2 max events if I add a FAULT_FLAG_TRIED2 and I
still use VM_FAULT_RETRY (instead of VM_FAULT_USERFAULT).

Being able to return VM_FAULT_RETRY indefinitely is only needed if we
don't handle the extremely wakeup race condition in handle_userfault
by taking the spinlock once more time in the fast path (i.e. after the
schedule).

I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I've no real preference on how to tweak the page fault code to be able
to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
change possible, so if you've suggestions now it's good time.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 02:10:56PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> > @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, 
> > unsigned long address,
> >  * through poll/read().
> >  */
> > __add_wait_queue(&ctx->fault_wqh, &uwq.wq);
> > -   for (;;) {
> > -   set_current_state(TASK_KILLABLE);
> > -   if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> > -   fatal_signal_pending(current))
> > -   break;
> > -   spin_unlock(&ctx->fault_wqh.lock);
> > +   set_current_state(TASK_KILLABLE);
> > +   spin_unlock(&ctx->fault_wqh.lock);
> >  
> > +   if (likely(!ACCESS_ONCE(ctx->released) &&
> > +  !fatal_signal_pending(current))) {
> > wake_up_poll(&ctx->fd_wqh, POLLIN);
> > schedule();
> > +   ret |= VM_FAULT_MAJOR;
> > +   }
> 
> So what happens here if schedule() spontaneously wakes for no reason?
> 
> I'm not sure enough of userfaultfd semantics to say if that would be
> bad, but the code looks suspiciously like it relies on schedule() not to
> do that; which is wrong.

That would repeat the fault and trigger the DEBUG_VM printk above,
complaining that FAULT_FLAG_ALLOW_RETRY is not set. It is only a
problem for kernel faults (copy_user/get_user_pages*). Userland won't
error out in such a way because userland would return 0 and not
VM_FAULT_RETRY. So it's only required when the task schedule in
TASK_KILLABLE state.

If schedule spontaneously wakes up a task in TASK_KILLABLE state that
would be a bug in the scheduler in my view. Luckily there doesn't seem
to be such a bug, or at least we never experienced it.

Overall this dependency on the scheduler will be lifted soon, as it
must be lifted in order to track the write protect faults, so longer
term this is not a concern and this is not a design issue, but this
remains an implementation detail that avoided to change the arch code
and gup.

If you send a reproducer to show how the current scheduler can wake up
the task in TASK_KILLABLE despite not receiving a wakeup, that would
help too as we never experienced that.

The reason this dependency will be lifted soon is that the userfaultfd
write protect tracking may be armed at any time while the app is
running so we may already be in the middle of a page fault that
returned VM_FAULT_RETRY by the time we arm the write protect
tracking. So longer term the arch page fault and
__get_user_pages_locked must allow handle_userfault() to return
VM_FAULT_RETRY even if FAULT_FLAG_TRIED is set. Then we don't care if
the task is waken.

Waking a task in TASK_KILLABLE state it will still waste CPU so the
scheduler still shouldn't do that. All load balancing works better if
the task isn't running anyway so I can't imagine a good reason for
wanting to run a task in TASK_KILLABLE state before it gets the
wakeup.

Trying to predict that a wakeup is always happening in less time than
it takes to schedule the task out of the CPU, sounds a very CPU
intensive thing to measure and it's probably better to leave those
heuristics in the caller like by spinning on a lock for a while before
blocking.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] userfault21 update

2015-10-19 Thread Andrea Arcangeli
Hello Patrick,

On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote:
> Hello Andrea,
> 
> On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli  wrote:
> > This is an incremental update to the userfaultfd code in -mm.
> 
> Sorry I'm late to this party. I'm curious how a ptrace monitor might
> use a userfaultfd to handle faults in all of its tracees. Is this
> possible without having each (newly forked) tracee "cooperate" by
> creating a userfaultfd and passing that to the tracer?

To make the non cooperative usage work, userfaulfd also needs more
features to track fork() and mremap() syscalls and such, as the
monitor needs to be aware about modifications to the address space of
each "mm" is managing and of new forked "mm" as well. So fork() won't
need to call userfaultfd once we add those features, but it still
doesn't need to know about the "pid". The uffd_msg already has padding
to add the features you need for that.

Pavel invented and developed those features for the non cooperative
usage to implement postcopy live migration of containers. He posted
some patchset on the lists too, but it probably needs to be rebased on
upstream.

The ptrace monitor thread can also fault into the userfault area if it
wants to (but only if it's not the userfault manager thread as well).
I didn't expect the ptrace monitor to want to be a userfault manager
too though.

On a side note, the signals the ptrace monitor sends to the tracee
(SIGCONT|STOP included) will only be executed by the tracee without
waiting for userfault resolution from the userfault manager, if the
tracees userfault wasn't triggered in kernel context (and in a non
cooperative usage that's not an assumption you can make). If the
tracee hits an userfault while running in kernel context, the
userfault manager must resolve the userfault before any signal (except
SIGKILL of course) can be executed by the tracee. Only SIGKILL is
instantly executed by all tracees no matter if it was an userfault in
kernel or user context. That may be another reason for not wanting the
ptrace monitor and the userfault manager in the same thread (they can
still be running in two different threads of the same external
process).

> Have you considered using one userfaultfd for an entire tree of
> processes (signaled through a flag)? Would not a process id included
> in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient
> to disambiguate faults?

I got a private email asking a corollary question about having the
faulting IP address in the uffd_msg recently, which I answered and I
take opportunity to quote it as well below, as it's somewhat connected
with your "pid" question and this adds more context.

===

At times it's the kernel accessing the page (copy-user get user pages)
like if the buffer is a parameter to the write or read syscalls, just
to make an example.

The IP address triggering the fault isn't necessarily a userland
address. Furthermore not even the pid is known, so you don't know
which process accessed it.

userfaultfd only notifies userland that a certain page is requested
and must be mapped ASAP. You don't know why or who touched it.

===

Now about adding the "pid": the association between "pid" and "mm"
isn't so strict in the kernel. You can tell which "pid" shares the
same "mm" but if you look from userland, you can't always tell which
"mm"(/process) the pid belongs to. At times async io threads or
vhost-net threads can impersonate the "mm" and in effect become part
of the process and you'd get those random "pid" of kernel threads.

It could also be a ptrace that triggers an userfault, with a "pid" that
isn't part of the application and the manager must still work
seamlessly no matter who or which "pid" triggered the userfault.

So overall dealing the "pid"s sounds like not very clean as the same
kernel thread "pid" can impersonate multiple "mm" and you wouldn't get
the information of which "mm" the "address" belongs to.

When userfaultfd() is called, it literally binds to the "mm" the
process is running on and it's pid agnostic. Then when a kernel thread
impersonating the "mm" faults into the "mm" with get_user_pages or
copy_user or when a ptrace faults into the "mm", the userafult manager
won't even see the difference.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Andrea Arcangeli
Hello Bharata,

On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> May be it is a bit late to bring this up, but I needed the following fix
> to userfault21 branch of your git tree to compile on powerpc.

Not late, just in time. I increased the number of syscalls in earlier
versions, it must have gotten lost during a rejecting rebase, sorry.

I applied it to my tree and it can be applied to -mm and linux-next,
thanks!

The syscall for arm32 are also ready and on their way to the arm tree,
the testsuite worked fine there. ppc also should work fine if you
could confirm it'd be interesting, just beware that I got a typo in
the testcase:

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 76071b1..925c3c9 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -70,7 +70,7 @@
 #define __NR_userfaultfd 323
 #elif defined(__i386__)
 #define __NR_userfaultfd 374
-#elif defined(__powewrpc__)
+#elif defined(__powerpc__)
 #define __NR_userfaultfd 364
 #else
 #error "missing __NR_userfaultfd definition"



> 
> 
> powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd
> 
> From: Bharata B Rao 
> 
> With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
> Reflect the same in __NR_syscalls.
> 
> Signed-off-by: Bharata B Rao 
> ---
>  arch/powerpc/include/asm/unistd.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/unistd.h 
> b/arch/powerpc/include/asm/unistd.h
> index f4f8b66..4a055b6 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -12,7 +12,7 @@
>  #include 
>  
>  
> -#define __NR_syscalls364
> +#define __NR_syscalls365
>  
>  #define __NR__exit __NR_exit
>  #define NR_syscalls  __NR_syscalls

Reviewed-by: Andrea Arcangeli 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

2015-06-23 Thread Andrea Arcangeli
Hi Dave,

On Tue, Jun 23, 2015 at 12:00:19PM -0700, Dave Hansen wrote:
> Down in userfaultfd_wake_function(), it looks like you intended for a
> len=0 to mean "wake all".  But the validate_range() that we do from
> userspace has a !len check in it, which keeps us from passing a len=0 in
> from userspace.
> Was that "wake all" for some internal use, or is the check too strict?

It's for internal use or userfaultfd_release that has to wake them all
(after setting ctx->released) if the uffd is closed. It avoids to
enlarge the structure by depending on the invariant that userland
cannot pass len=0.

If we'd accept len=0 from userland as valid, I'd be safer if it does
nothing like in madvise, I doubt we want to expose this non standard
kernel internal behavior to userland.

> I was trying to use the wake ioctl after an madvise() (as opposed to
> filling things in using a userfd copy).

madvise will return 0 if len=0, mremap would return -EINVAL if new_len
is zero, mmap also returns -EINVAL if len is 0, not all MM syscalls
are as permissive as madvise. Can't you pass the same len you pass to
madvise to UFFDIO_WAKE (or just skip the call if the madvise len is
zero)?

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

2015-06-16 Thread Andrea Arcangeli
On Mon, Jun 15, 2015 at 08:41:24PM -1000, Linus Torvalds wrote:
> On Mon, Jun 15, 2015 at 12:19 PM, Andrea Arcangeli  
> wrote:
> >
> > Yes, it would leave the other blocked, how is it different from having
> > just 1 reader and it gets killed?
> 
> Either is completely wrong. But the read() code can at least see that
> "I'm returning early due to a signal, so I'll wake up any other
> waiters".
> 
> Poll simply *cannot* do that. Because by definition poll always
> returns without actually clearing the thing that caused the wakeup.
> 
> So for "poll()", using exclusive waits is wrong very much by
> definition. For read(), you *can* use exclusive waits correctly, it
> just requires you to wake up others if you don't read all the data
> (either due to being killed by a signal, or because the read was
> incomplete).

There's no interface to do wakeone with poll so I haven't thought much
about it frankly but intuitively it didn't look radically different as
long as poll checks every fd revent it gets. If I was to patch it to
introduce wakeone in poll I'd think more about it of course. Perhaps
I've been overoptimistic here.

> What does qemu have to do with anything?
> 
> We don't implement kernel interfaces that are broken, and that can
> leave processes blocked when they shouldn't be blocked. We also don't
> implement kernel interfaces that only work with one program and then
> say "if that program is broken, it's not our problem".

I'm testing with the stresstest application not with qemu, qemu cannot
take advantage of this anyway because it uses a single thread so far
and it uses poll not blocking reads. The stresstest suite listens to
events with one thread per CPU and it interleaves poll usage with
blocking reads at every bounce, and it's working correctly so far.

> > I'm not saying doing wakeone is easy [...]
> 
> Bullshit, Andrea.
> 
> That's *exactly* what you said in the commit message for the broken
> patch that I complained about. And I quote:

Please don't quote me out of context, and quote me in full if you
quote me:

"I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere"

In the above paragraph (that you quoted in a truncated version of it)
I was talking in general, not specific to userfaultfd. I meant in
general wakeone is not easy.

> patch that I complained about. And I quote:
> 
>   "Blocking reads can easily use exclusive wakeups. Poll in theory
> could too but there's no poll_wait_exclusive in common code yet"

With "I'm not saying doing wakeone is easy and it's enough to flip a
switch everywhere to get it everywhere" I intended exactly to clarify
that "Blocking reads can easily use exclusive wakeups" was in the
context of userfaultfd only.

With regard to the patch you still haven't told what exactly what
runtime breakage I shall expect from my simple change. The case you
mentioned about a thread that gets killed is irrelevant because an
userfault would get missed anyway, if a task listening to userfaultfd
get killed after it received any uffd_msg structure. Wakeone or
wakeall won't move the needle for that case. There's no broadcast of
userfaults to all readers, even with wakeall, only the first reader
that wakes up, gets the messages, the others returns to sleep
immediately.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

2015-06-15 Thread Andrea Arcangeli
On Mon, Jun 15, 2015 at 08:19:07AM -1000, Linus Torvalds wrote:
> What if the process doing the polling never doors anything with the end
> result? Maybe it meant to, but it got killed before it could? Are you going
> to leave everybody else blocked, even though there are pending events?

Yes, it would leave the other blocked, how is it different from having
just 1 reader and it gets killed?

If any qemu thread gets killed the thing is going to be noticeable,
there's no fault-tolerance-double-thread for anything. If one wants to
use more threads for fault tolerance of this scenario with
userfaultfd, one just needs to add a feature flag to the
uffdio_api.features to request it and change the behavior to wakeall
but by default if we can do wakeone I think we should.

> The same us try of read() too. What if the reader only reads party of the
> message? The wake didn't wake anybody else, so now people are (again)
> blocked despite there being data.

I totally agree that for a normal read that would be a concern, but
the wakeone only applies to the uffd. I'm not even trying to change
other read methods.

The uffd can't short-read. Lengths not multiple of sizeof(struct
uffd_msg) immediately return -EINVAL. read will return one or more
events, sizeof(struct uffd_msg). Signal interruptions only are
reported if it's about to block and it found nothing.

> So no, exclusive waiting is never "simple". You have to 100% guarantee that
> you will consume all the data that caused the wake event (or perhaps wake
> the next person up if you don't).

I don't see where it goes wrong.

Now if __wake_up_common didn't check the retval of
default_wake_function->try_to_wake_up before decrements and checking
nr_exclusive I would where the problem about the next guy is, but it
does this:

if (curr->func(curr, mode, wake_flags, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;

Every new userfault blocking (and at max 1 event to read is generated
for each new blocking userfaults) wakes one more reader, and each
reader is guaranteed to be blocked only if the pending (pending as not
read yet) waitqueue is truly empty. Where does it misbehave?

Yes each reader is required then to handle whatever userfault event it
got from read (or to pass it to another thread before quitting), but
this is a must anyway. This is because after the userfault is read it
is moved from pending fault queue to normal fault queue, so it won't
ever be read again, if it wasn't the case read would infinite loop and
it couldn't block (the same applies to poll, poll blocks after the
pending event has been read).

The testsuite can reproduce the bug fixed in 4/7 in like 3 seconds,
and it's 100% reproducible. And the window for such a bug is really
small: exactly in between list_del(); list_add the two
waitqueue_active must run in the other CPU. So it's hard to imagine if
this had some major issue, the testsuite wouldn't show it. In fact the
load seems to scale more evenly across all uffd threads too without no
apparent downside.

qemu uses just one reader, and it's even using poll, so this is not
needed for the short term production code, and it's totally fine to
defer this patch.

I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere, and perhaps there's something wrong
still, I just I don't see where the actual bug is and how it should
work better without this patch but it's certainly fine to drop the
patch anyway (at least for now).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls

2015-06-15 Thread Andrea Arcangeli
On Mon, Jun 15, 2015 at 08:11:50AM -1000, Linus Torvalds wrote:
> On Jun 15, 2015 7:22 AM, "Andrea Arcangeli"  wrote:
> >
> > +   if (cmd != UFFDIO_API) {
> > +   if (ctx->state == UFFD_STATE_WAIT_API)
> > +   return -EINVAL;
> > +   BUG_ON(ctx->state != UFFD_STATE_RUNNING);
> > +   }
> 
> NAK.
> 
> Once again: we don't add BUG_ON() as some kind of assert. If your
> non-critical code has s bug in it, you do WARN_ONCE() and you return. You
> don't kill the machine just because of some "this can't happen" situation.
> 
> It turns out "this can't happen" happens way too often, just because code
> changes, or programmers didn't think all the cases through. And killing the
> machine is just NOT ACCEPTABLE.
> 
> People need to stop adding machine-killing checks to code that just doesn't
> merit killing the machine.
> 
> And if you are so damn sure that it really cannot happen ever, then you
> damn well had better remove the test too!
> 
> BUG_ON is not a debugging tool, or a "I think this would be bad" helper.

Several times I got very hardly reproducible bugs noticed purely
because of BUG_ON (not VM_BUG_ON) inserted out of pure paranoia, so I
know as a matter of fact that they're worth the little cost. It's hard
to tell if things didn't get worse, if the workload continued, or even
if I ended up getting a bugreport in the first place with only a
WARN_ON variant, precisely because a WARN_ON isn't necessarily a bug.

Example: when a WARN_ON in the network code showup (and they do once
in a while as there are so many), nobody panics because we assume it
may not actually be a bug so we can cross finger it goes away at the
next git fetch... not even sure if they all get reported in the first
place.

BUG_ONs are terribly annoying when they trigger, and even worse if
they're false positives, but they're worth the pain in my view.

Of course what's unacceptable is that BUG_ON can be triggered at will
by userland, that would be a security issue. Just in case I verified
to run two UFFDIO_API in a row and a UFFDIO_REGISTER without an
UFFDIO_API before it, and no BUG_ON triggers with this code inserted.

Said that it's your choice, so I'm not going to argue further about
this and I'm sure fine with WARN_ONCE too, there were a few more to
convert in the state machine invariant checks. While at it I can also
use VM_WARN_ONCE to cover my performance concern.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads

2015-06-15 Thread Andrea Arcangeli
) #   17.430 CPUs utilized  
  ( +-  0.72% )
   190,388  context-switches  #0.004 M/sec  
  ( +-  0.85% )
26,983  cpu-migrations#0.595 K/sec  
  ( +-  1.02% )
   135,232  page-faults   #0.003 M/sec  
  ( +-  1.46% )
   113,920,663,471  cycles#2.513 GHz
  ( +-  0.66% ) [83.32%]
83,823,483,951  stalled-cycles-frontend   #   73.58% frontend cycles 
idle ( +-  0.65% ) [83.41%]
35,786,661,114  stalled-cycles-backend#   31.41% backend  cycles 
idle ( +-  0.86% ) [67.29%]
59,478,650,192  instructions  #0.52  insns per cycle
  #1.41  stalled cycles per 
insn  ( +-  0.65% ) [84.04%]
11,635,219,658  branches  #  256.660 M/sec  
  ( +-  0.71% ) [83.69%]
59,203,898  branch-misses #0.51% of all branches
  ( +-  2.03% ) [83.54%]

   2.600912438 seconds time elapsed 
 ( +-  0.02% )

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f9e11ec..a66c4be 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -501,6 +501,7 @@ static unsigned int userfaultfd_poll(struct file *file, 
poll_table *wait)
struct userfaultfd_ctx *ctx = file->private_data;
unsigned int ret;
 
+   /* FIXME: poll_wait_exclusive doesn't exist yet in common code */
poll_wait(file, &ctx->fd_wqh, wait);
 
switch (ctx->state) {
@@ -542,7 +543,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
 
/* always take the fd_wqh lock before the fault_pending_wqh lock */
spin_lock(&ctx->fd_wqh.lock);
-   __add_wait_queue(&ctx->fd_wqh, &wait);
+   __add_wait_queue_exclusive(&ctx->fd_wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
spin_lock(&ctx->fault_pending_wqh.lock);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] userfaultfd: selftest

2015-06-15 Thread Andrea Arcangeli
This test allocates two virtual areas and bounces the physical memory
across the two virtual areas using only userfaultfd.

This exposed a race condition in the refile of the userfault in
userfaultfd_read and an alignment issue with the address returned to
userland with THP enabled. It also allowed to test the interruption of
userfaults by signals (like running the testcase under gdb).

As expected no sign of memory corruption has ever materialized no
matter how I changed the stress test while developing it. The two bugs
had no impact on the safety and correctness of the memory being
tracked by userfaultfd. The fix for those two bugs was also
strightforward and required no design change of any sort.

Signed-off-by: Andrea Arcangeli 
---
 tools/testing/selftests/vm/Makefile  |   4 +-
 tools/testing/selftests/vm/userfaultfd.c | 669 +++
 2 files changed, 672 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index a5ce953..6f19ecc 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -1,12 +1,14 @@
 # Makefile for vm selftests
 
 CFLAGS = -Wall
-BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest
+BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest 
userfaultfd
 BINARIES += transhuge-stress
 
 all: $(BINARIES)
 %: %.c
$(CC) $(CFLAGS) -o $@ $^ -lrt
+userfaultfd: userfaultfd.c
+   $(CC) $(CFLAGS) -o $@ $^ -g -lrt -lpthread
 
 TEST_PROGS := run_vmtests
 TEST_FILES := $(BINARIES)
diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
new file mode 100644
index 000..418dc33
--- /dev/null
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -0,0 +1,669 @@
+/*
+ * Stress userfaultfd syscall.
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ * This test allocates two virtual areas and bounces the physical
+ * memory across the two virtual areas (from area_src to area_dst)
+ * using userfaultfd.
+ *
+ * There are three threads running per CPU:
+ *
+ * 1) one per-CPU thread takes a per-page pthread_mutex in a random
+ *page of the area_dst (while the physical page may still be in
+ *area_src), and increments a per-page counter in the same page,
+ *and checks its value against a verification region.
+ *
+ * 2) another per-CPU thread handles the userfaults generated by
+ *thread 1 above. userfaultfd blocking reads or poll() modes are
+ *exercised interleaved.
+ *
+ * 3) one last per-CPU thread transfers the memory in the background
+ *at maximum bandwidth (if not already transferred by thread
+ *2). Each cpu thread takes cares of transferring a portion of the
+ *area.
+ *
+ * When all threads of type 3 completed the transfer, one bounce is
+ * complete. area_src and area_dst are then swapped. All threads are
+ * respawned and so the bounce is immediately restarted in the
+ * opposite direction.
+ *
+ * per-CPU threads 1 by triggering userfaults inside
+ * pthread_mutex_lock will also verify the atomicity of the memory
+ * transfer (UFFDIO_COPY).
+ *
+ * The program takes two parameters: the amounts of physical memory in
+ * megabytes (MiB) of the area and the number of bounces to execute.
+ *
+ * # 100MiB 9 bounces
+ * ./userfaultfd 100 9
+ *
+ * # 1GiB 99 bounces
+ * ./userfaultfd 1000 99
+ *
+ * # 10MiB-~6GiB 999 bounces, continue forever unless an error triggers
+ * while ./userfaultfd $[RANDOM % 6000 + 10] 999; do true; done
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../../../../include/uapi/linux/userfaultfd.h"
+
+#ifdef __x86_64__
+#define __NR_userfaultfd 323
+#elif defined(__i386__)
+#define __NR_userfaultfd 359
+#elif defined(__powewrpc__)
+#define __NR_userfaultfd 364
+#else
+#error "missing __NR_userfaultfd definition"
+#endif
+
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+
+#define BOUNCE_RANDOM  (1<<0)
+#define BOUNCE_RACINGFAULTS(1<<1)
+#define BOUNCE_VERIFY  (1<<2)
+#define BOUNCE_POLL(1<<3)
+static int bounces;
+
+static unsigned long long *count_verify;
+static int uffd, finished, *pipefd;
+static char *area_src, *area_dst;
+static char *zeropage;
+pthread_attr_t attr;
+
+/* pthread_mutex_t starts at page offset 0 */
+#define area_mutex(___area, ___nr) \
+   ((pthread_mutex_t *) ((___area) + (___nr)*page_size))
+/*
+ * count is placed in the page after pthread_mutex_t naturally aligned
+ * to avoid non alignment faults on non-x86 ar

[PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read

2015-06-15 Thread Andrea Arcangeli
During the refile in userfaultfd_read both waitqueues could look empty
to the lockless wake_userfault(). Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8286ec8..f9e11ec 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -45,6 +45,8 @@ struct userfaultfd_ctx {
wait_queue_head_t fault_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
+   /* a refile sequence protected by fault_pending_wqh lock */
+   struct seqcount refile_seq;
/* pseudo fd refcounting */
atomic_t refcount;
/* userfaultfd syscall flags */
@@ -547,6 +549,15 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
uwq = find_userfault(ctx);
if (uwq) {
/*
+* Use a seqcount to repeat the lockless check
+* in wake_userfault() to avoid missing
+* wakeups because during the refile both
+* waitqueue could become empty if this is the
+* only userfault.
+*/
+   write_seqcount_begin(&ctx->refile_seq);
+
+   /*
 * The fault_pending_wqh.lock prevents the uwq
 * to disappear from under us.
 *
@@ -570,6 +581,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx 
*ctx, int no_wait,
list_del(&uwq->wq.task_list);
__add_wait_queue(&ctx->fault_wqh, &uwq->wq);
 
+   write_seqcount_end(&ctx->refile_seq);
+
/* careful to always initialize msg if ret == 0 */
*msg = uwq->msg;
spin_unlock(&ctx->fault_pending_wqh.lock);
@@ -648,6 +661,9 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
   struct userfaultfd_wake_range *range)
 {
+   unsigned seq;
+   bool need_wakeup;
+
/*
 * To be sure waitqueue_active() is not reordered by the CPU
 * before the pagetable update, use an explicit SMP memory
@@ -663,8 +679,13 @@ static __always_inline void wake_userfault(struct 
userfaultfd_ctx *ctx,
 * userfaults yet. So we take the spinlock only when we're
 * sure we've userfaults to wake.
 */
-   if (waitqueue_active(&ctx->fault_pending_wqh) ||
-   waitqueue_active(&ctx->fault_wqh))
+   do {
+   seq = read_seqcount_begin(&ctx->refile_seq);
+   need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
+   waitqueue_active(&ctx->fault_wqh);
+   cond_resched();
+   } while (read_seqcount_retry(&ctx->refile_seq, seq));
+   if (need_wakeup)
__wake_userfault(ctx, range);
 }
 
@@ -1223,6 +1244,7 @@ static void init_once_userfaultfd_ctx(void *mem)
init_waitqueue_head(&ctx->fault_pending_wqh);
init_waitqueue_head(&ctx->fault_wqh);
init_waitqueue_head(&ctx->fd_wqh);
+   seqcount_init(&ctx->refile_seq);
 }
 
 /**
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] userfaultfd: propagate the full address in THP faults

2015-06-15 Thread Andrea Arcangeli
The THP faults were not propagating the original fault address. The latest
version of the API with uffd.arg.pagefault.address is supposed to propagate the
full address through THP faults.

This was not a kernel crashing bug and it wouldn't risk to corrupt
user memory, but it would cause a SIGBUS failure because the wrong page was
being copied.

For various reasons this wasn't easily reproducible in the qemu
workload, but the strestest exposed the problem immediately.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 80d4ae1..73eb404 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -717,13 +717,14 @@ static inline pmd_t mk_huge_pmd(struct page *page, 
pgprot_t prot)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
-   unsigned long haddr, pmd_t *pmd,
+   unsigned long address, pmd_t *pmd,
struct page *page, gfp_t gfp,
unsigned int flags)
 {
struct mem_cgroup *memcg;
pgtable_t pgtable;
spinlock_t *ptl;
+   unsigned long haddr = address & HPAGE_PMD_MASK;
 
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -765,7 +766,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
mem_cgroup_cancel_charge(page, memcg);
put_page(page);
pte_free(mm, pgtable);
-   ret = handle_userfault(vma, haddr, flags,
+   ret = handle_userfault(vma, address, flags,
   VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
return ret;
@@ -841,7 +842,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pmd_none(*pmd)) {
if (userfaultfd_missing(vma)) {
spin_unlock(ptl);
-   ret = handle_userfault(vma, haddr, flags,
+   ret = handle_userfault(vma, address, flags,
   VM_UFFD_MISSING);
VM_BUG_ON(ret & VM_FAULT_FALLBACK);
} else {
@@ -865,7 +866,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
-   return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, 
flags);
+   return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
+   flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls

2015-06-15 Thread Andrea Arcangeli
UFFDIO_API was already forced before read/poll could work. This
makes the code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the
current API already should cover fine even the non cooperative usage,
but this is just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5f11678..b69d236 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1115,6 +1115,12 @@ static long userfaultfd_ioctl(struct file *file, 
unsigned cmd,
int ret = -EINVAL;
struct userfaultfd_ctx *ctx = file->private_data;
 
+   if (cmd != UFFDIO_API) {
+   if (ctx->state == UFFD_STATE_WAIT_API)
+   return -EINVAL;
+   BUG_ON(ctx->state != UFFD_STATE_RUNNING);
+   }
+
switch(cmd) {
case UFFDIO_API:
ret = userfaultfd_api(ctx, arg);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"

2015-06-15 Thread Andrea Arcangeli
This reverts commit 855c5a9026b0fce58c8de5382ef8ce00f74c1331 and
adapts fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what
we insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue
heads so we can as well stick to "nr == 1" of the old code and we can
rely purely on the fact no waitqueue inserted in one of the two
waitqueue heads we must enforce as wakeall, has wait->flags
WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 8 
 include/linux/wait.h | 5 ++---
 kernel/sched/wait.c  | 7 +++
 net/sunrpc/sched.c   | 2 +-
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a66c4be..e601dd8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -467,8 +467,8 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
 * the fault_*wqh.
 */
spin_lock(&ctx->fault_pending_wqh.lock);
-   __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0, &range);
-   __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
+   __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
+   __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &range);
spin_unlock(&ctx->fault_pending_wqh.lock);
 
wake_up_poll(&ctx->fd_wqh, POLLHUP);
@@ -652,10 +652,10 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
spin_lock(&ctx->fault_pending_wqh.lock);
/* wake all in the range and autoremove */
if (waitqueue_active(&ctx->fault_pending_wqh))
-   __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0,
+   __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
 range);
if (waitqueue_active(&ctx->fault_wqh))
-   __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+   __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, range);
spin_unlock(&ctx->fault_pending_wqh.lock);
 }
 
diff --git a/include/linux/wait.h b/include/linux/wait.h
index cf884cf..2db8334 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,8 +147,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t 
*old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
- void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void 
*key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -180,7 +179,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)  \
-   __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
+   __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
 #define wake_up_interruptible_poll(x, m)   \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)  \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 6da208dd2..852143a 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,10 +106,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int 
mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
- void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
 {
-   __wake_up_common(q, mode, nr, 0, key);
+   __wake_up_common(q, mode, 1, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -284,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
-   __wake_up_locked_key(q, mode, 1, key);
+   __wake_up_locked_key(q, mode, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b140c09..337ca85 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_c

[PATCH 3/7] userfaultfd: allow signals to interrupt a userfault

2015-06-15 Thread Andrea Arcangeli
This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b69d236..8286ec8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
-   bool must_wait;
+   bool must_wait, return_to_userland;
 
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
uwq.msg = userfault_msg(address, flags, reason);
uwq.ctx = ctx;
 
+   return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+   (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
spin_lock(&ctx->fault_pending_wqh.lock);
/*
 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
 * following the spin_unlock to happen before the list_add in
 * __add_wait_queue.
 */
-   set_current_state(TASK_KILLABLE);
+   set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+ TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);
 
must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
up_read(&mm->mmap_sem);
 
if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-  !fatal_signal_pending(current))) {
+  (return_to_userland ? !signal_pending(current) :
+   !fatal_signal_pending(current {
wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();
ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address,
 
__set_current_state(TASK_RUNNING);
 
+   if (return_to_userland) {
+   if (signal_pending(current) &&
+   !fatal_signal_pending(current)) {
+   /*
+* If we got a SIGSTOP or SIGCONT and this is
+* a normal userland page fault, just let
+* userland return so the signal will be
+* handled and gdb debugging works.  The page
+* fault code immediately after we return from
+* this function is going to release the
+* mmap_sem and it's not depending on it
+* (unlike gup would if we were not to return
+* VM_FAULT_RETRY).
+*
+* If a fatal signal is pending we still take
+* the streamlined VM_FAULT_RETRY failure path
+* and there's no need to retake the mmap_sem
+* in such case.
+*/
+   down_read(&mm->mmap_sem);
+   ret = 0;
+   }
+   }
+
/*
 * Here we race with the list_del; list_add in
 * userfaultfd_ctx_read(), however because we don't ever run
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] userfault21 update

2015-06-15 Thread Andrea Arcangeli
This is an incremental update to the userfaultfd code in -mm.

This fixes two bugs that could cause some malfunction (but nothing
that could cause memory corruption or kernel crashes of any sort,
neither in kernel nor userland).

This also introduces some enhancement: gdb now runs fine, signals can
interrupt userfaults (userfaults are retried when signal returns),
read blocking got wakeone behavior (with benchmark results in commit
header), the UFFDIO_API invocation is enforced before other ioctl can
run (to enforce future backwards compatibility just in case of API
bumps), one dependency on a scheduler change has been reverted.

Notably this introduces the testsuite as well. A good way to run the
testsuite is:

# it will use 10MiB-~6GiB 999 bounces, continue forever unless an error triggers
while ./userfaultfd $[RANDOM % 6000 + 10] 999; do true; done

What caused a significant amount of time wasted, had nothing to do
with userfaultfd. The testsuite exposed erratic memcmp/bcmp retvals if
part of the strings compared can change under memcmp/bcmp (while still
being different in other parts of the string that aren't actually
changing). I will provide a separate standalone testcase for this not
using userfaultfd (I already created it to be sure it isn't a bug in
userfaultfd, and nevertheless my my_bcmp works fine even with
userfaultfd). Insisting memcmp/bcmp would eventually lead to the
correct result that in kernel-speak to be initially (but erroneously)
translated to missing TLB flush (or cache flush but on x86 unlikely)
or a pagefault hitting on the zeropage somehow, or some other subtle
kernel bug. Eventually I had to consider the possibiltity memcmp or
bcmp core library functions were broken, despite how unlikely this
sounds. It might be possible that this only happens if the memory
changing is inside the "len" range being compared and that nothing
goes wrong if the data changing is beyond the end of the "len" even if
in the same cacheline. So it might be possible that it's perfectly
correct in C standard terms, but the total erratic result is
unacceptable to me and it makes memcmp/bcmp very risky to use in
multithreaded programs. I will ensure this gets fixed in my systems
with perhaps slower versions of memcpy/bcmp. If the two pages never
actually are the same at any given time (no matter if they're
changing) both bcmp and memcmp can't keep returning an erratic racy 0
here. If this is safe by C standard, this still wouldn't be safe
enough for me. It's unclear how this erratic result materializes at
this point and if SIMD instructions have special restrictions on
memory that is modified by other CPUs. CPU bugs in SIMD cannot be
ruled out either yet.

Andrea Arcangeli (7):
  userfaultfd: require UFFDIO_API before other ioctls
  userfaultfd: propagate the full address in THP faults
  userfaultfd: allow signals to interrupt a userfault
  userfaultfd: avoid missing wakeups during refile in userfaultfd_read
  userfaultfd: switch to exclusive wakeup for blocking reads
  userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to
__wake_up_locked_key"
  userfaultfd: selftest

 fs/userfaultfd.c |  78 +++-
 include/linux/wait.h |   5 +-
 kernel/sched/wait.c  |   7 +-
 mm/huge_memory.c |  10 +-
 net/sunrpc/sched.c   |   2 +-
 tools/testing/selftests/vm/Makefile  |   4 +-
 tools/testing/selftests/vm/userfaultfd.c | 669 +++
 7 files changed, 752 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrea Arcangeli
On Fri, May 22, 2015 at 02:18:30PM -0700, Andrew Morton wrote:
> 
> There's a more serious failure with i386 allmodconfig:
> 
> fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
>   BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
> 
> I'm surprised the feature is even reachable on i386 builds?

Unless we risk to run out of vma->vm_flags there's no particular
reason not to enable it on 32bit (even if we run out, making vm_flags
an unsigned long long is a few liner patch). Certainly it's less
useful on 32bit as there's a 3G limit but the max vmas per process are
still a small fraction of that. Especially if used for the volatile
pages on demand notification of page reclaim, it could end up useful
on arm32 (S6 is 64bit I think and latest snapdragon is too, so perhaps
it's too late anyway, but again it's not big deal).

Removing the BUILD_BUG_ON I think is not ok here because while I'm ok
to support 32bit archs, I don't want translation, the 64bit kernel
should talk with the 32bit app directly without a layer in between.

I tried to avoid using packet as without packed I could not get the
alignment wrong (and future union also couldn't get it wrong), and I
could avoid those reserved1/2/3, but it's more robust to use it in
combination with the BUILD_BUG_ON to detect right away problems like
this with 32bit builds that aligns things differently.

I'm actually surprised the buildbot that sends me email about all
archs didn't actually send me anything about it for 32bit x86?
Perhaps I'm overlooking something or x86 32bit (or any other 32bit
arch for that matter) isn't being checked?  This is actually a fairly
recent change, perhaps the buildbot was shutdown recently? That
buildbot was very useful to detect for problems like this.

===
>From 2f0a48670dc515932dec8b983871ec35caeba553 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Sat, 23 May 2015 02:26:32 +0200
Subject: [PATCH] userfaultfd: update the uffd_msg structure to be the same on
 32/64bit

Avoiding to using packed allowed the code to be nicer and it avoided
the reserved1/2/3 but the structure must be the same for 32bit and
64bit archs so x86 applications built with the 32bit ABI can run on
the 64bit kernel without requiring translation of the data read
through the read syscall.

$ gcc -m64 p.c && ./a.out
32
0
16
8
8
16
24
$ gcc -m32 p.c && ./a.out
32
0
16
8
8
16
24

int main()
{
printf("%lu\n", sizeof(struct uffd_msg));
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->event);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 
0)->arg.pagefault.address);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 
0)->arg.pagefault.flags);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 
0)->arg.reserved.reserved1);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 
0)->arg.reserved.reserved2);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 
0)->arg.reserved.reserved3);
}

Reported-by: Andrew Morton 
Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c8a543f..00d28e2 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -59,9 +59,13 @@
 struct uffd_msg {
__u8event;
 
+   __u8reserved1;
+   __u16   reserved2;
+   __u32   reserved3;
+
union {
struct {
-   __u32   flags;
+   __u64   flags;
__u64   address;
} pagefault;
 
@@ -72,7 +76,7 @@ struct uffd_msg {
__u64   reserved3;
} reserved;
} arg;
-};
+} __attribute__((packed));
 
 /*
  * Start at 0x12 and not at 0 to be more strict against bugs.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrea Arcangeli
On Fri, May 22, 2015 at 01:18:22PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli  
> wrote:
> 
> > If the rwsem starves writers it wasn't strictly a bug but lockdep
> > doesn't like it and this avoids depending on lowlevel implementation
> > details of the lock.
> > 
> > ...
> >
> > @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct 
> > mm_struct *dst_mm,
> >  
> > if (!zeropage)
> > err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > -  dst_addr, src_addr);
> > +  dst_addr, src_addr, &page);
> > else
> > err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
> >  dst_addr);
> >  
> > cond_resched();
> >  
> > +   if (unlikely(err == -EFAULT)) {
> > +   void *page_kaddr;
> > +
> > +   BUILD_BUG_ON(zeropage);
> 
> I'm not sure what this is trying to do.  BUILD_BUG_ON(local_variable)?
> 
> It goes bang in my build.  I'll just delete it.

Yes, it has to be a false positive failure, so it's fine to drop
it. My gcc 4.8.4 can go inside the static called function and see that
only mcopy_atomic_pte can return -EFAULT. RHEL7 (4.8.3) gcc didn't
complain either. Perhaps to make the BUILD_BUG_ON work with older gcc,
it requrires a local variable set explicitly in the callee, but it's
not worth it.

It would be bad if we end up in the -EFAULT path in the zeropage case
(if somebody later adds an apparently innocent -EFAULT retval and
unexpectedly ends up in the mcopy_atomic_pte retry logic), but it's
not important, the caller should be reviewed before improvising new
retvals anyway.

The retry loop addition and the BUILD_BUG_ON is all about the
copy_from_user run while we already hold the mmap_sem (potentially of
a different process in the non-cooperative case but it's a problem if
it's the current task mmap_sem in case the rwlock implementation
changes to avoid write starvation and becomes non-reentrant). lockdep
definitely complains (even if I think in practice it'd be safe to
read-lock recurse, we just got lockdep complains never deadlocks in
fact). I didn't want to call gup_fast as copy_from_user is faster and
I got an usable user mapping with likely TLB entry hot too. The
lockdep warnings we hit I think were associated with NUMA hinting
faults or something infrequent like that, the fast path doesn't need
to retry.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/23] userfaultfd v4

2015-05-21 Thread Andrea Arcangeli
Hi Kirill,

On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
> Sorry for maybe speaking up too late, but here is additional real

Not too late, in fact I don't think there's any change required for
this at this stage, but it'd be great if you could help me to review.

> Since arrays can be large, it would be slow and thus not practical to
[..]
> So I've implemented a scheme where array data is initially PROT_READ
> protected, then we catch SIGSEGV, if it is write and area belongs to array

In the case of postcopy live migration (for qemu and/or containers) and
postcopy live snapshotting, splitting the vmas is not an option
because we may run out of them.

If your PROT_READ areas are limited perhaps this isn't an issue but
with hundreds GB guests (currently plenty in production) that needs to
live migrate fully reliably and fast, the vmas could exceed the limit
if we were to use mprotect. If your arrays are very large and the
PROT_READ aren't limited, using userfaultfd this isn't only an
optimization for you too, it's actually a must to avoid a potential
-ENOMEM.

> Also, since arrays could be large - bigger than RAM, and only sparse
> parts of it could be needed to get needed information, for reading it
> also makes sense to lazily load data in SIGSEGV handler with initial
> PROT_NONE protection.

Similarly I heard somebody wrote a fastresume to load the suspended
(on disk) guest ram using userfaultfd. That is a slightly less
fundamental case than postcopy because you could do it also with
MAP_SHARED, but it's still interesting in allowing to compress or
decompress the suspended ram on the fly with lz4 for example,
something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
additional benefit of not having an orphaned inode left open even if
the file is deleted, that prevents to unmount the filesystem for the
whole lifetime of the guest).

> This is very similar to how memory mapped files work, but adds
> transactionality which, as far as I know, is not provided by any
> currently in-kernel filesystem on Linux.

That's another benefit yes.

> The gist of virtual memory-manager is this:
> 
> 
> https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
> https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
> (vma_on_pagefault)

I'll check it more in detail ASAP, thanks for the pointers!

> For operations it currently needs
> 
> - establishing virtual memory areas and connecting to tracking it

That's the UFFDIO_REGISTER/UNREGISTER.

> - changing pages protection
> 
> PROT_NONE or absent - initially

absent is what works with -mm already. The lazy loading already works.

> PROT_NONE   -> PROT_READ- after read

Current UFFDIO_COPY will map it using vma->vm_page_prot.

We'll need a new flag for UFFDIO_COPY to map it readonly. This is
already contemplated:

/*
 * There will be a wrprotection flag later that allows to map
 * pages wrprotected on the fly. And such a flag will be
 * available if the wrprotection ioctl are implemented for the
 * range according to the uffdio_register.ioctls.
 */
#define UFFDIO_COPY_MODE_DONTWAKE   ((__u64)1<<0)
__u64 mode;

If the memory protection framework exists (either through the
uffdio_register.ioctl out value, or through uffdio_api.features
out-only value) you can pass a new flag (MODE_WP) above to transition
from "absent" to "PROT_READ".

> PROT_READ   -> PROT_READWRITE   - after write

This will need to add UFFDIO_MPROTECT.

> PROT_READWRITE  -> PROT_READ- after commit

UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
to run this UFFDIO_MPROTECT without stopping the threads that are
accessing the memory concurrently).

And this should only work if the uffdio_register.mode had MODE_WP set,
so we don't run into the races created by COWs (gup vs fork race).

> PROT_READWRITE  -> PROT_NONE or absent (again)  - after abort

UFFDIO_MPROTECT again, but you won't be able to read the page contents
inside the memory manager thread (the one working with
userfaultfd).

The manager at all times if forbidden to touch the memory it is
tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
get rid of it). gdb ironically because it is using an underoptimized
access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
in handle_userfault in the gdb context, and it'll just receive a
sigbus if by mistake the user tries to touch the memory. Even if it
will hung later as get_user_pages_locked|unlocked gets used there too,
kill -9 would solve gdb too.

Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
memory: to do that a new ioctl should be required. I'd rather not go
back to the route of 

[PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

2015-05-14 Thread Andrea Arcangeli
This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 42 +++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 8e42bc3..c8a543f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -21,7 +21,9 @@
 (__u64)1 << _UFFDIO_UNREGISTER |   \
 (__u64)1 << _UFFDIO_API)
 #define UFFD_API_RANGE_IOCTLS  \
-   ((__u64)1 << _UFFDIO_WAKE)
+   ((__u64)1 << _UFFDIO_WAKE | \
+(__u64)1 << _UFFDIO_COPY | \
+(__u64)1 << _UFFDIO_ZEROPAGE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +36,8 @@
 #define _UFFDIO_REGISTER   (0x00)
 #define _UFFDIO_UNREGISTER (0x01)
 #define _UFFDIO_WAKE   (0x02)
+#define _UFFDIO_COPY   (0x03)
+#define _UFFDIO_ZEROPAGE   (0x04)
 #define _UFFDIO_API(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -46,6 +50,10 @@
 struct uffdio_range)
 #define UFFDIO_WAKE_IOR(UFFDIO, _UFFDIO_WAKE,  \
 struct uffdio_range)
+#define UFFDIO_COPY_IOWR(UFFDIO, _UFFDIO_COPY, \
+ struct uffdio_copy)
+#define UFFDIO_ZEROPAGE_IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
+ struct uffdio_zeropage)
 
 /* read() structure */
 struct uffd_msg {
@@ -118,4 +126,36 @@ struct uffdio_register {
__u64 ioctls;
 };
 
+struct uffdio_copy {
+   __u64 dst;
+   __u64 src;
+   __u64 len;
+   /*
+* There will be a wrprotection flag later that allows to map
+* pages wrprotected on the fly. And such a flag will be
+* available if the wrprotection ioctl are implemented for the
+* range according to the uffdio_register.ioctls.
+*/
+#define UFFDIO_COPY_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "copy" is written by the ioctl and must be at the end: the
+* copy_from_user will not read the last 8 bytes.
+*/
+   __s64 copy;
+};
+
+struct uffdio_zeropage {
+   struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "zeropage" is written by the ioctl and must be at the end:
+* the copy_from_user will not read the last 8 bytes.
+*/
+   __s64 zeropage;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read

2015-05-14 Thread Andrea Arcangeli
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

vcpubackground_thr userfault_thr
-   -  -
vcpu0 handle_mm_fault()

postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked

poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, 
REQUESTED) -> REQUESTED

tmp_state = postcopy_pmi_change_state(old_state, 
RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault 
thread:

vcpubackground_thr userfault_thr
-   -  -
vcpu0 handle_mm_fault()

postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, 
RECEIVED) -> RECEIVED

vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked

poll() -> POLLIN
read() -> 0x7fb76a139000

if (postcopy_pmi_change_state(MISSING, 
REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault 
thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 81 +---
 1 file changed, 66 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5542fe7..6772c22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -167,6 +167,67 @@ static inline struct uffd_msg userfault_msg(unsigned long 
address,
 }
 
 /*
+ * Verify the pagetables are still not ok after having reigstered into
+ * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
+ * userfault that has already been resolved, if userfaultfd_read and
+ * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
+ * threads.
+ */
+static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
+unsigned long address,
+unsigned long flags,
+unsigned long reason)
+{
+   struct mm_struct *mm = ctx->mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd, _pmd;
+   pte_t *pte;
+   bool ret = true;
+
+   VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+   pgd = pgd_offset(mm, address);
+   if (!pgd_present(*pgd))
+   goto out;
+   pud = pud_offset(pgd, address);
+   if (!pud_present(*pud))
+   goto out;
+   pmd = pmd_offset(pud, address);
+   /*
+* READ_ONCE must function as a barrier with narrower scope
+* and it must be equivalent to:
+*  _pmd = *pmd; barrier();
+*
+* This is to deal with the instability (as in
+* pmd_trans_unstable) of the pmd.
+*/
+   _pmd = READ_ONCE(*pmd);
+   if (!pmd_present(_pmd))
+   goto out;
+
+   ret = false;
+   if (pmd_trans_huge(_pmd))
+   goto out;
+
+   /*
+   

[PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

2015-05-14 Thread Andrea Arcangeli
These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli 
---
 fs/proc/task_mmu.c | 2 ++
 include/linux/mm.h | 2 ++
 kernel/fork.c  | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..58be92e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,6 +597,8 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_HUGEPAGE)]= "hg",
[ilog2(VM_NOHUGEPAGE)]  = "nh",
[ilog2(VM_MERGEABLE)]   = "mg",
+   [ilog2(VM_UFFD_MISSING)]= "um",
+   [ilog2(VM_UFFD_WP)] = "uw",
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..7fccd22 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE0x0080
 
 #define VM_GROWSDOWN   0x0100  /* general info on the segment */
+#define VM_UFFD_MISSING0x0200  /* missing pages tracking */
 #define VM_PFNMAP  0x0400  /* Page-ranges managed without "struct 
page", just pure PFN */
 #define VM_DENYWRITE   0x0800  /* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP 0x1000  /* wrprotect pages tracking */
 
 #define VM_LOCKED  0x2000
 #define VM_IO   0x4000 /* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index 430141b..56d1ddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -449,7 +449,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~VM_LOCKED;
+   tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
tmp->vm_next = tmp->vm_prev = NULL;
tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI

2015-03-05 Thread Andrea Arcangeli
This implements the uABI of UFFDIO_REMAP.

Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.

Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 61251e6..db6e99a 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
 #define UFFD_API_RANGE_IOCTLS  \
((__u64)1 << _UFFDIO_WAKE | \
 (__u64)1 << _UFFDIO_COPY | \
-(__u64)1 << _UFFDIO_ZEROPAGE)
+(__u64)1 << _UFFDIO_ZEROPAGE | \
+(__u64)1 << _UFFDIO_REMAP)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +35,7 @@
 #define _UFFDIO_WAKE   (0x02)
 #define _UFFDIO_COPY   (0x03)
 #define _UFFDIO_ZEROPAGE   (0x04)
+#define _UFFDIO_REMAP  (0x05)
 #define _UFFDIO_API(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -50,6 +52,8 @@
  struct uffdio_copy)
 #define UFFDIO_ZEROPAGE_IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
  struct uffdio_zeropage)
+#define UFFDIO_REMAP   _IOWR(UFFDIO, _UFFDIO_REMAP,\
+ struct uffdio_remap)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -122,4 +126,25 @@ struct uffdio_zeropage {
__s64 wake;
 };
 
+struct uffdio_remap {
+   __u64 dst;
+   __u64 src;
+   __u64 len;
+   /*
+* Especially if used to atomically remove memory from the
+* address space the wake on the dst range is not needed.
+*/
+#define UFFDIO_REMAP_MODE_DONTWAKE ((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES  ((__u64)1<<1)
+   __u64 mode;
+
+   /*
+* "remap" and "wake" are written by the ioctl and must be at
+* the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 remap;
+   __s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed

2015-03-05 Thread Andrea Arcangeli
If userfaultfd is armed on a certain vma we can't "fill" the holes
with zeroes or we'll break the userland on demand paging. The holes if
the userfault is armed, are really missing information (not zeroes)
that the userland has to load from network or elsewhere.

The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.

We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5374132..8f1b6a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2145,7 +2145,8 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
 _pte++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-   if (++none_or_zero <= khugepaged_max_ptes_none)
+   if (!userfaultfd_armed(vma) &&
+   ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out;
@@ -2593,7 +2594,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 _pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-   if (++none_or_zero <= khugepaged_max_ptes_none)
+   if (!userfaultfd_armed(vma) &&
+   ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out_unmap;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Andrea Arcangeli
On Thu, Mar 05, 2015 at 09:39:48AM -0800, Linus Torvalds wrote:
> Is this really worth it? On real loads? That people are expected to use?

I fully agree that it's not worth merging upstream UFFDIO_REMAP until
(and if) a real world usage for it will showup. To further clarify:
would this not have been an RFC, the patchset would have stopped at
patch number 15/21 included.

Merging UFFDIO_REMAP with no real life users, would just increase the
attack vector surface of the kernel for no good.

Thanks for your idea that the UFFDIO_COPY is faster, the userland code
we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
UFFDIO_REMAP. I immediately agreed about UFFDIO_COPY being preferable
after you mentioned it during review of the previous RFC.

However this being a RFC with a large audience, and UFFDIO_REMAP
allowing to "remove" memory (think like externalizing memory into to
ceph with deduplication or such), I still added it just in case there
are real world use cases that may justify me keeping it around (even
if I would definitely not have submitted it for merging in the short
term regardless).

In addition of dropping the parts that aren't suitable for merging in
the short term like UFFDIO_REMAP, for any further submits that won't
substantially alter the API like it happened between the v2 to v3
RFCs, I'll also shrink the To/Cc list considerably.

> Considering how we just got rid of one special magic VM remapping
> thing that nobody actually used, I'd really hate to add a new one.

Having to define an API somehow, I tried to think at all possible
future usages and make sure the API would allow for those if needed.

> Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
> argue for not merging the remap parts. I just don't see the point. It
> doesn't seem to add anything that is semantically very important -
> it's *potentially* a faster copy, but even that is
> 
>   (a) questionable in the first place

Yes, we already measured the UFFDIO_COPY is faster than UFFDIO_REMAP,
the userfault latency decreases -20%.

> 
> and
> 
>  (b) unclear why anybody would ever care about performance of
> infrastructure that nobody actually uses today, and future use isn't
> even clear or shown to be particualrly performance-sensitive.

The only potential _theoretical_ case that justify the existence of
UFFDIO_REMAP is about "removing" memory from the address space. To
"add" memory UFFDIO_COPY and UFFDIO_ZEROPAGE are always preferable
like you suggested.

> So basically I'd like to see better documentation, a few real use
> cases (and by real I very much do *not* mean "you can use it for
> this", but actual patches to actual projects that matter and that are
> expected to care and merge them), and a simplified series that doesn't
> do the remap thing.

So far I wrote some doc in 2/21 and in the cover letter, but certainly
more docs are necessary. Trinity is also needed (I got trinity running
on the v2 API but I haven't adapted to the new API yet).

About the real world usages, this is the primary one:

http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

And it actually cannot be merged in qemu until userfaultfd is merged
in the kernel. There's simply no safe way to implement postcopy live
migration without something equivalent to the userfaultfd if all Linux
VM features are intended to be retained in the destination node.

> Because *every* time we add a new clever interface, we end up with
> approximately zero users and just pain down the line. Examples:
> splice, mremap, yadda yadda.

Aside from mremap which I think is widely used, I totally agree in
principle.

For now I can quite comfortably guarantee the above real life user for
userfaultfd (qemu), but there are potential 5 of them. And none needs
UFFDIO_REMAP, which is again why I totally agree of not submitting it
for merging and it was intended it only for the initial RFC to share
the idea of "removing" the memory with a larger audience before I
shrink the Cc/To list for further updates.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-03-05 Thread Andrea Arcangeli
This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm_types.h | 11 +++
 kernel/fork.c|  1 +
 2 files changed, 12 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03a..fbf21f5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -247,6 +247,16 @@ struct vm_region {
* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+   struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -313,6 +323,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
+   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..cb215c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
goto fail_nomem_anon_vma_fork;
tmp->vm_flags &= ~VM_LOCKED;
tmp->vm_next = tmp->vm_prev = NULL;
+   tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
if (file) {
struct inode *inode = file_inode(file);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/21] userfaultfd: uAPI

2015-03-05 Thread Andrea Arcangeli
Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli 
---
 Documentation/ioctl/ioctl-number.txt |  1 +
 include/uapi/linux/userfaultfd.h | 81 
 2 files changed, 82 insertions(+)
 create mode 100644 include/uapi/linux/userfaultfd.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 8136e1f..be2d4a2 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -301,6 +301,7 @@ Code  Seq#(hex) Include FileComments
 0xA3   80-8F   Port ACLin development:
<mailto:tle...@mindspring.com>
 0xA3   90-9F   linux/dtlk.h
+0xAA   00-3F   linux/uapi/linux/userfaultfd.h
 0xAB   00-1F   linux/nbd.h
 0xAC   00-1F   linux/raw.h
 0xAD   00  Netfilter devicein development:
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
new file mode 100644
index 000..9a8cd56
--- /dev/null
+++ b/include/uapi/linux/userfaultfd.h
@@ -0,0 +1,81 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS\
+   ((__u64)1 << _UFFDIO_REGISTER | \
+(__u64)1 << _UFFDIO_UNREGISTER |   \
+(__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS  \
+   ((__u64)1 << _UFFDIO_WAKE)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F.  UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER   (0x00)
+#define _UFFDIO_UNREGISTER (0x01)
+#define _UFFDIO_WAKE   (0x02)
+#define _UFFDIO_API(0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API _IOWR(UFFDIO, _UFFDIO_API,  \
+ struct uffdio_api)
+#define UFFDIO_REGISTER_IOWR(UFFDIO, _UFFDIO_REGISTER, \
+ struct uffdio_register)
+#define UFFDIO_UNREGISTER  _IOR(UFFDIO, _UFFDIO_UNREGISTER,\
+struct uffdio_range)
+#define UFFDIO_WAKE_IOR(UFFDIO, _UFFDIO_WAKE,  \
+struct uffdio_range)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE (1<<0)  /* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP(1<<1)  /* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS  2   /* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+   /* userland asks for an API number */
+   __u64 api;
+
+   /* kernel answers below with the available features for the API */
+   __u64 bits;
+   __u64 ioctls;
+};
+
+struct uffdio_range {
+   __u64 start;
+   __u64 len;
+};
+
+struct uffdio_register {
+   struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING   ((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP((__u64)1<<1)
+   __u64 mode;
+
+   /*
+* kernel answers which ioctl commands are available for the
+* range, keep at the end as the last 8 bytes aren't read.
+*/
+   __u64 ioctls;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/21] userfaultfd: buildsystem activation

2015-03-05 Thread Andrea Arcangeli
This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli 
---
 fs/Makefile  |  1 +
 init/Kconfig | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index a88ac48..ba8ab62 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
 obj-$(CONFIG_SIGNALFD) += signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
+obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
 obj-$(CONFIG_FS_DAX)   += dax.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..580dae7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1550,6 +1550,17 @@ config ADVISE_SYSCALLS
  applications use these syscalls, you can disable this option to save
  space.
 
+config USERFAULTFD
+   bool "Enable userfaultfd() system call"
+   select ANON_INODES
+   default y
+   depends on MMU
+   help
+ Enable the userfaultfd() system call that allows to intercept and
+ handle page faults in userland.
+
+ If unsure, say Y.
+
 config PCI_QUIRKS
default y
bool "Enable PCI quirk workarounds" if EXPERT
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/21] userfaultfd: UFFDIO_REMAP

2015-03-05 Thread Andrea Arcangeli
This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6230f22..b4c7f25 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -892,6 +892,54 @@ out:
return ret;
 }
 
+static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
+unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_remap uffdio_remap;
+   struct uffdio_remap __user *user_uffdio_remap;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_remap = (struct uffdio_remap __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_remap, user_uffdio_remap,
+  /* don't copy "remap" and "wake" last field */
+  sizeof(uffdio_remap)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+   if (ret)
+   goto out;
+   ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+   if (ret)
+   goto out;
+   ret = -EINVAL;
+   if (uffdio_remap.mode & ~(UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES|
+ UFFDIO_REMAP_MODE_DONTWAKE))
+   goto out;
+
+   ret = remap_pages(ctx->mm, current->mm,
+ uffdio_remap.dst, uffdio_remap.src,
+ uffdio_remap.len, uffdio_remap.mode);
+   if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   /* len == 0 would wake all */
+   BUG_ON(!ret);
+   range.len = ret;
+   if (!(uffdio_remap.mode & UFFDIO_REMAP_MODE_DONTWAKE)) {
+   range.start = uffdio_remap.dst;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_remap->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_remap.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -955,6 +1003,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned 
cmd,
case UFFDIO_ZEROPAGE:
ret = userfaultfd_zeropage(ctx, arg);
break;
+   case UFFDIO_REMAP:
+   ret = userfaultfd_remap(ctx, arg);
+   break;
}
return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation

2015-03-05 Thread Andrea Arcangeli
Provide a new swapfile method for remap_pages() to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped in
multiple vmas, when the page is swapped back in, it could get mapped
in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/swap.h |  6 ++
 mm/swapfile.c| 13 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4759491..9adda11 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -436,6 +436,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -527,6 +528,11 @@ static inline int page_swapcount(struct page *page)
return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+   return 0;
+}
+
 #define reuse_swap_page(page)  (page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55cc..04c7621 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+   int count = 0;
+   struct swap_info_struct *p;
+
+   p = swap_info_get(entry);
+   if (p) {
+   count = swap_count(p->swap_map[swp_offset(entry)]);
+   spin_unlock(&p->lock);
+   }
+   return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults

2015-03-05 Thread Andrea Arcangeli
This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 68 ++--
 mm/memory.c  | 16 +
 2 files changed, 62 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f0207cf..5374132 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -708,7 +709,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t 
prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
-   struct page *page)
+   struct page *page, unsigned int flags)
 {
struct mem_cgroup *memcg;
pgtable_t pgtable;
@@ -716,12 +717,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
 
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-   if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
-   return VM_FAULT_OOM;
+   if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) {
+   put_page(page);
+   count_vm_event(THP_FAULT_FALLBACK);
+   return VM_FAULT_FALLBACK;
+   }
 
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg);
+   put_page(page);
return VM_FAULT_OOM;
}
 
@@ -741,6 +746,21 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
+
+   /* Deliver the page fault to userland */
+   if (userfaultfd_missing(vma)) {
+   int ret;
+
+   spin_unlock(ptl);
+   mem_cgroup_cancel_charge(page, memcg);
+   put_page(page);
+   pte_free(mm, pgtable);
+   ret = handle_userfault(vma, haddr, flags,
+  VM_UFFD_MISSING);
+   VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+   return ret;
+   }
+
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
@@ -751,6 +771,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_long_inc(&mm->nr_ptes);
spin_unlock(ptl);
+   count_vm_event(THP_FAULT_ALLOC);
}
 
return 0;
@@ -762,19 +783,16 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
 {
pmd_t entry;
-   if (!pmd_none(*pmd))
-   return false;
entry = mk_pmd(zero_page, vma->vm_page_prot);
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
atomic_long_inc(&mm->nr_ptes);
-   return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
@@ -797,6 +815,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pgtable_t pgtable;
struct page *zero_page;
bool set;
+   int ret;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
@@ -807,14 +826,28 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, 
struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
ptl = pmd_lock(mm, pmd);
-   set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-   zero_page);
-   spin_unlock(ptl);
+   ret = 0;
+   set = false;
+   if (pmd_none(*pmd)) {
+   if (userfaultfd_missing(vma)) {
+   spin_unlock(ptl);
+   

[PATCH 16/21] userfaultfd: remap_pages: rmap preparation

2015-03-05 Thread Andrea Arcangeli
As far as the rmap code is concerned, rmap_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_pages takes the anon_vma lock for writing before altering
the page->mapping, so if the page->mapping is still the same after
obtaining the anon_vma lock (without the page lock), the rmap walks
can go ahead safely (and remap_pages will wait them to complete before
proceeding).

remap_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_pages holds the mmap_sem for reading.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_pages must be mapped only in one vma, but
this is not a limitation when used to handle userland page faults. The
source addresses passed to remap_pages should be set as VM_DONTCOPY
with MADV_DONTFORK to avoid any risk of the mapcount of the pages
increasing, if fork runs in parallel in another thread, before or
while remap_pages runs.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 23 +++
 mm/rmap.c|  9 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f1b6a5..1e25cb3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1902,6 +1902,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
struct anon_vma *anon_vma;
int ret = 1;
+   struct address_space *mapping;
 
BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
@@ -1913,10 +1914,24 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * page_lock_anon_vma_read except the write lock is taken to serialise
 * against parallel split or collapse operations.
 */
-   anon_vma = page_get_anon_vma(page);
-   if (!anon_vma)
-   goto out;
-   anon_vma_lock_write(anon_vma);
+   for (;;) {
+   mapping = ACCESS_ONCE(page->mapping);
+   anon_vma = page_get_anon_vma(page);
+   if (!anon_vma)
+   goto out;
+   anon_vma_lock_write(anon_vma);
+   /*
+* We don't hold the page lock here so
+* remap_pages_huge_pmd can change the anon_vma from
+* under us until we obtain the anon_vma lock. Verify
+* that we obtained the anon_vma lock before
+* remap_pages did.
+*/
+   if (likely(mapping == ACCESS_ONCE(page->mapping)))
+   break;
+   anon_vma_unlock_write(anon_vma);
+   put_anon_vma(anon_vma);
+   }
 
ret = 0;
if (!PageCompound(page))
diff --git a/mm/rmap.c b/mm/rmap.c
index 5e3e090..5ab2df1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -492,6 +492,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
 
+repeat:
rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -530,6 +531,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
 
+   /* check if remap_anon_pages changed the anon_vma */
+   if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != 
anon_mapping)) {
+   anon_vma_unlock_read(anon_vma);
+   put_anon_vma(anon_vma);
+   anon_vma = NULL;
+   goto repeat;
+   }
+
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
 * Oops, we held the last refcount, release the lock
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers

2015-03-05 Thread Andrea Arcangeli
These helpers will be used to know if to call handle_userfault() during
wrprotect faults in order to deliver the wrprotect faults to userland.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3c39a4f..81f0d11 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -65,6 +65,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -92,6 +97,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return false;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

2015-03-05 Thread Andrea Arcangeli
These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h | 2 ++
 kernel/fork.c  | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..762ef9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE0x0080
 
 #define VM_GROWSDOWN   0x0100  /* general info on the segment */
+#define VM_UFFD_MISSING0x0200  /* missing pages tracking */
 #define VM_PFNMAP  0x0400  /* Page-ranges managed without "struct 
page", just pure PFN */
 #define VM_DENYWRITE   0x0800  /* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP 0x1000  /* wrprotect pages tracking */
 
 #define VM_LOCKED  0x2000
 #define VM_IO   0x4000 /* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index cb215c0..cfab6e9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -423,7 +423,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~VM_LOCKED;
+   tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
tmp->vm_next = tmp->vm_prev = NULL;
tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx

2015-03-05 Thread Andrea Arcangeli
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h |  2 +-
 mm/madvise.c   |  3 ++-
 mm/mempolicy.c |  4 ++--
 mm/mlock.c |  3 ++-
 mm/mmap.c  | 39 +++
 mm/mprotect.c  |  3 ++-
 6 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 762ef9d..26cef61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1879,7 +1879,7 @@ extern int vma_adjust(struct vm_area_struct *vma, 
unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *);
+   struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..10f62b7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -102,7 +102,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-   vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046..e1a2e9b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -722,8 +722,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long 
start,
pgoff = vma->vm_pgoff +
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff,
- new_pol);
+vma->anon_vma, vma->vm_file, pgoff,
+new_pol, vma->vm_userfaultfd_ctx);
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73cf098..9725abe 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct 
vm_area_struct **prev,
 
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index da9990a..135c2fa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -921,7 +922,8 @@ again:  remove_next = 1 + (end > 
next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-   struct file *file, unsigned long vm_flags)
+   struct file *file, unsigned long vm_flags,
+   struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
/*
 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -937,6 +939,8 @@ static inline int is_mergeable_vma(struct vm_area_struct 
*vma,
return 0;
if (vma->vm_ops && vma->vm_ops->close)
return 0;
+   if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+   return 0;
return 1;
 }
 
@@ -967,9 +971,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma 
*anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-   struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+struct anon_vma *anon_vma, struct file *file,
+pgoff_t vm_pgoff,
+struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-   if (is_mergeable_vma(vma, file, vm_flags) &&
+   if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
   

[PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

2015-03-05 Thread Andrea Arcangeli
This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 46 +++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9a8cd56..61251e6 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -17,7 +17,9 @@
 (__u64)1 << _UFFDIO_UNREGISTER |   \
 (__u64)1 << _UFFDIO_API)
 #define UFFD_API_RANGE_IOCTLS  \
-   ((__u64)1 << _UFFDIO_WAKE)
+   ((__u64)1 << _UFFDIO_WAKE | \
+(__u64)1 << _UFFDIO_COPY | \
+(__u64)1 << _UFFDIO_ZEROPAGE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -30,6 +32,8 @@
 #define _UFFDIO_REGISTER   (0x00)
 #define _UFFDIO_UNREGISTER (0x01)
 #define _UFFDIO_WAKE   (0x02)
+#define _UFFDIO_COPY   (0x03)
+#define _UFFDIO_ZEROPAGE   (0x04)
 #define _UFFDIO_API(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -42,6 +46,10 @@
 struct uffdio_range)
 #define UFFDIO_WAKE_IOR(UFFDIO, _UFFDIO_WAKE,  \
 struct uffdio_range)
+#define UFFDIO_COPY_IOWR(UFFDIO, _UFFDIO_COPY, \
+ struct uffdio_copy)
+#define UFFDIO_ZEROPAGE_IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
+ struct uffdio_zeropage)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -78,4 +86,40 @@ struct uffdio_register {
__u64 ioctls;
 };
 
+struct uffdio_copy {
+   __u64 dst;
+   __u64 src;
+   __u64 len;
+   /*
+* There will be a wrprotection flag later that allows to map
+* pages wrprotected on the fly. And such a flag will be
+* available if the wrprotection ioctl are implemented for the
+* range according to the uffdio_register.ioctls.
+*/
+#define UFFDIO_COPY_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "copy" and "wake" are written by the ioctl and must be at
+* the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 copy;
+   __s64 wake;
+};
+
+struct uffdio_zeropage {
+   struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "zeropage" and "wake" are written by the ioctl and must be
+* at the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 zeropage;
+   __s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/21] RFC: userfaultfd v3

2015-03-05 Thread Andrea Arcangeli
l usages, the
simplest solution would be that if FAULT_FLAG_TRIED is set
VM_FAULT_RETRY can still be returned (but only by handle_userfault
that has a legitimate reason for insisting a second time in a row with
VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
semantics. Again userland could cope with this detail but it'd be
inefficient to solve it in userland. This would be a fully backwards
compatible change and it's only strictly required by the wrprotect
tracking mode, so it's no problem to solve this later. Because of its
inherent racy nature, nobody could possibly depend on a racy SIGBUS
being raised now, when it won't be raised anymore later.

Andrea Arcangeli (21):
  userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: linux/Documentation/vm/userfaultfd.txt
  userfaultfd: uAPI
  userfaultfd: linux/userfaultfd_k.h
  userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
  userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
  userfaultfd: call handle_userfault() for userfaultfd_missing() faults
  userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
  userfaultfd: prevent khugepaged to merge if userfaultfd is armed
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: buildsystem activation
  userfaultfd: activate syscall
  userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
  userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
preparation
  userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
  userfaultfd: remap_pages: rmap preparation
  userfaultfd: remap_pages: swp_entry_swapcount() preparation
  userfaultfd: UFFDIO_REMAP uABI
  userfaultfd: remap_pages: UFFDIO_REMAP preparation
  userfaultfd: UFFDIO_REMAP
  userfaultfd: add userfaultfd_wp mm helpers

 Documentation/ioctl/ioctl-number.txt   |1 +
 Documentation/vm/userfaultfd.txt   |   97 +++
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |2 +-
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 arch/x86/syscalls/syscall_32.tbl   |1 +
 arch/x86/syscalls/syscall_64.tbl   |1 +
 fs/Makefile|1 +
 fs/userfaultfd.c   | 1128 
 include/linux/mm.h |4 +-
 include/linux/mm_types.h   |   11 +
 include/linux/swap.h   |6 +
 include/linux/syscalls.h   |1 +
 include/linux/userfaultfd_k.h  |  112 
 include/linux/wait.h   |5 +-
 include/uapi/linux/userfaultfd.h   |  150 +
 init/Kconfig   |   11 +
 kernel/fork.c  |3 +-
 kernel/sched/wait.c|7 +-
 kernel/sys_ni.c|1 +
 mm/Makefile|1 +
 mm/huge_memory.c   |  217 +-
 mm/madvise.c   |3 +-
 mm/memory.c|   16 +
 mm/mempolicy.c |4 +-
 mm/mlock.c |3 +-
 mm/mmap.c  |   39 +-
 mm/mprotect.c  |3 +-
 mm/rmap.c  |9 +
 mm/swapfile.c  |   13 +
 mm/userfaultfd.c   |  793 ++
 net/sunrpc/sched.c |2 +-
 32 files changed, 2593 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vm/userfaultfd.txt
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd_k.h
 create mode 100644 include/uapi/linux/userfaultfd.h
 create mode 100644 mm/userfaultfd.c

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt

2015-03-05 Thread Andrea Arcangeli
Add documentation.

Signed-off-by: Andrea Arcangeli 
---
 Documentation/vm/userfaultfd.txt | 97 
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/vm/userfaultfd.txt

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 000..2ec296c
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,97 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow to implement on demand paging from userland and more
+generally they allow userland to take control various memory page
+faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides for two primary functionalities:
+
+1) read/POLLIN protocol to notify an userland thread of the faults
+   happening
+
+2) various UFFDIO_* ioctls that can mangle over the virtual memory
+   regions registered in the userfaultfd that allows userland to
+   efficiently resolve the userfaults it receives via 1) or to mangle
+   the virtual memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page(or hugepage)-granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different process without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd themself
+on the same region the manager is already tracking, which is a corner
+case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API
+which will specify the read/POLLIN protocol userland intends to speak
+on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
+uffdio_api.api is spoken also by the running kernel), will return into
+uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
+respectively the activated feature bits below PAGE_SHIFT in the
+userfault addresses returned by read(2) and the generic ioctl
+available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range reigstered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to mangle the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means an userfault
+could be triggering just before userland maps in the background the
+user-faulted page. To avoid POLLIN resulting in an unexpected blocking
+read (if the UFFD is not opened in nonblocking mode in the first
+place), we don't allow the background thread to wake userfaults that
+haven't been read by userland yet. If we would do that likely the
+UFFDIO_WAKE ioctl could be dropped. This may change in the future
+(with a UFFD_API protocol bumb combined with the removal of the
+UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
+optimization and worthy to force userland to use the UFFD always in
+nonblocking mode if combined with POLLIN.
+
+userfaultfd is also a generic enough feature, that it allows KVM to
+implement postcopy live migration (one form of memory externalization
+consisting of a virtual machine running with part or all of its memory
+residing on a different node in the cloud) without having to modify a
+single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
+and all other GUP features works just fine in combination with
+userfaults (userfaults trigger async page faults in the guest
+scheduler so those guest processes that aren't waiting for userfaults
+can keep running in the guest vcpus)

[PATCH 10/21] userfaultfd: add new syscall to provide memory externalization

2015-03-05 Thread Andrea Arcangeli
Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 977 +++
 1 file changed, 977 insertions(+)
 create mode 100644 fs/userfaultfd.c

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 000..6b31967
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,977 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum userfaultfd_state {
+   UFFD_STATE_WAIT_API,
+   UFFD_STATE_RUNNING,
+};
+
+struct userfaultfd_ctx {
+   /* pseudo fd refcounting */
+   atomic_t refcount;
+   /* waitqueue head for the userfaultfd page faults */
+   wait_queue_head_t fault_wqh;
+   /* waitqueue head for the pseudo fd to wakeup poll/read */
+   wait_queue_head_t fd_wqh;
+   /* userfaultfd syscall flags */
+   unsigned int flags;
+   /* state machine */
+   enum userfaultfd_state state;
+   /* released */
+   bool released;
+   /* mm with one ore more vmas attached to this userfaultfd_ctx */
+   struct mm_struct *mm;
+};
+
+struct userfaultfd_wait_queue {
+   unsigned long address;
+   wait_queue_t wq;
+   bool pending;
+   struct userfaultfd_ctx *ctx;
+};
+
+struct userfaultfd_wake_range {
+   unsigned long start;
+   unsigned long len;
+};
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+int wake_flags, void *key)
+{
+   struct userfaultfd_wake_range *range = key;
+   int ret;
+   struct userfaultfd_wait_queue *uwq;
+   unsigned long start, len;
+
+   uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+   ret = 0;
+   /* don't wake the pending ones to avoid reads to block */
+   if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+   goto out;
+   /* len == 0 means wake all */
+   start = range->start;
+   len = range->len;
+   if (len && (start > uwq->address || start + len <= uwq->address))
+   goto out;
+   ret = wake_up_state(wq->private, mode);
+   if (ret)
+   /* wake only once, autoremove behavior */
+   list_del_init(&wq->task_list);
+out:
+   return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+   if (!atomic_inc_not_zero(&ctx->refcount))
+   BUG();
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+   if (atomic_dec_and_test(&ctx->refcount)) {
+   mmdrop(ctx->mm);
+   kfree(ctx);
+   }
+}
+
+static inline unsigned long userfault_address(unsigned long address,
+ unsigned int flags,
+ unsigned long reason)
+{
+   BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
+   address &= PAGE_MASK;
+   if (flags & FAULT_FLAG_WRITE)
+   /*
+* Encode "write" fault information in the LSB of the
+* address read by userland, without depending on
+* FAULT_FLAG_WRITE kernel internal value.
+*/
+   address |= UFFD_BIT_WRITE;
+   if (reason & VM_UFFD_WP)
+   /*
+* Encode "reason" fault information as bit number 1
+* in the address read by userland. If bit number 1 is
+* clear it means the reason is a VM_FAULT_MISSING
+* fault.
+*/
+   address |= UFFD_BIT_WP;
+   return addres

[PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Andrea Arcangeli
remap_pages is the lowlevel mm helper needed to implement
UFFDIO_REMAP.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h |  17 ++
 mm/huge_memory.c  | 120 ++
 mm/userfaultfd.c  | 526 ++
 3 files changed, 663 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..3c39a4f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,6 +36,23 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
  unsigned long dst_start,
  unsigned long len);
 
+/* remap_pages */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern ssize_t remap_pages(struct mm_struct *dst_mm,
+  struct mm_struct *src_mm,
+  unsigned long dst_start,
+  unsigned long src_start,
+  unsigned long len, __u64 flags);
+extern int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+   struct mm_struct *src_mm,
+   pmd_t *dst_pmd, pmd_t *src_pmd,
+   pmd_t dst_pmdval,
+   struct vm_area_struct *dst_vma,
+   struct vm_area_struct *src_vma,
+   unsigned long dst_addr,
+   unsigned long src_addr);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e25cb3..08c8afc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1531,6 +1531,124 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+struct mm_struct *src_mm,
+pmd_t *dst_pmd, pmd_t *src_pmd,
+pmd_t dst_pmdval,
+struct vm_area_struct *dst_vma,
+struct vm_area_struct *src_vma,
+unsigned long dst_addr,
+unsigned long src_addr)
+{
+   pmd_t _dst_pmd, src_pmdval;
+   struct page *src_page;
+   struct anon_vma *src_anon_vma, *dst_anon_vma;
+   spinlock_t *src_ptl, *dst_ptl;
+   pgtable_t pgtable;
+
+   src_pmdval = *src_pmd;
+   src_ptl = pmd_lockptr(src_mm, src_pmd);
+
+   BUG_ON(!pmd_trans_huge(src_pmdval));
+   BUG_ON(pmd_trans_splitting(src_pmdval));
+   BUG_ON(!pmd_none(dst_pmdval));
+   BUG_ON(!spin_is_locked(src_ptl));
+   BUG_ON(!rwsem_is_locked(&src_mm->mmap_sem));
+   BUG_ON(!rwsem_is_locked(&dst_mm->mmap_sem));
+
+   src_page = pmd_page(src_pmdval);
+   BUG_ON(!PageHead(src_page));
+   BUG_ON(!PageAnon(src_page));
+   if (unlikely(page_mapcount(src_page) != 1)) {
+   spin_unlock(src_ptl);
+   return -EBUSY;
+   }
+
+   get_page(src_page);
+   spin_unlock(src_ptl);
+
+   mmu_notifier_invalidate_range_start(src_mm, src_addr,
+   src_addr + HPAGE_PMD_SIZE);
+
+   /* block all concurrent rmap walks */
+   lock_page(src_page);
+
+   /*
+* split_huge_page walks the anon_vma chain without the page
+* lock. Serialize against it with the anon_vma lock, the page
+* lock is not enough.
+*/
+   src_anon_vma = page_get_anon_vma(src_page);
+   if (!src_anon_vma) {
+   unlock_page(src_page);
+   put_page(src_page);
+   mmu_notifier_invalidate_range_end(src_mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+   return -EAGAIN;
+   }
+   anon_vma_lock_write(src_anon_vma);
+
+   dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
+   double_pt_lock(src_ptl, dst_ptl);
+   if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+!pmd_same(*dst_pmd, dst_pmdval) ||
+page_mapcount(src_page) != 1)) {
+   double_pt_unlock(src_ptl, dst_ptl);
+   anon_vma_unlock_write(src_anon_vma);
+   put_anon_vma(src_anon_vma);
+   u

[PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key

2015-03-05 Thread Andrea Arcangeli
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli 
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db8334..cf884cf 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,7 +147,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t 
*old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void 
*key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -179,7 +180,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)  \
-   __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+   __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)   \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)  \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..6da208dd2 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,9 +106,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int 
mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
 {
-   __wake_up_common(q, mode, 1, 0, key);
+   __wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -283,7 +284,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
-   __wake_up_locked_key(q, mode, key);
+   __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b91fd9c..dead9e0 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
-   __wake_up_locked_key(wq, TASK_NORMAL, &k);
+   __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/21] userfaultfd: linux/userfaultfd_k.h

2015-03-05 Thread Andrea Arcangeli
Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 include/linux/userfaultfd_k.h

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
new file mode 100644
index 000..e1e4360
--- /dev/null
+++ b/include/linux/userfaultfd_k.h
@@ -0,0 +1,79 @@
+/*
+ *  include/linux/userfaultfd_k.h
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_K_H
+#define _LINUX_USERFAULTFD_K_H
+
+#ifdef CONFIG_USERFAULTFD
+
+#include  /* linux/include/uapi/linux/userfaultfd.h */
+
+#include 
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+   unsigned int flags, unsigned long reason);
+
+/* mm helpers */
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+   struct vm_userfaultfd_ctx vm_ctx)
+{
+   return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_MISSING;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+}
+
+#else /* CONFIG_USERFAULTFD */
+
+/* mm helpers */
+static inline int handle_userfault(struct vm_area_struct *vma,
+  unsigned long address,
+  unsigned int flags,
+  unsigned long reason)
+{
+   return VM_FAULT_SIGBUS;
+}
+
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+   struct vm_userfaultfd_ctx vm_ctx)
+{
+   return true;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+   return false;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+   return false;
+}
+
+#endif /* CONFIG_USERFAULTFD */
+
+#endif /* _LINUX_USERFAULTFD_K_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/21] userfaultfd: activate syscall

2015-03-05 Thread Andrea Arcangeli
This activates the userfaultfd syscall.

Signed-off-by: Andrea Arcangeli 
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 arch/x86/syscalls/syscall_32.tbl   | 1 +
 arch/x86/syscalls/syscall_64.tbl   | 1 +
 include/linux/syscalls.h   | 1 +
 kernel/sys_ni.c| 1 +
 7 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 91062ee..7f21cfd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -367,3 +367,4 @@ SYSCALL_SPU(getrandom)
 SYSCALL_SPU(memfd_create)
 SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 36b79c3..f4f8b66 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  363
+#define __NR_syscalls  364
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ef5b5b1..4b4f21e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -385,5 +385,6 @@
 #define __NR_memfd_create  360
 #define __NR_bpf   361
 #define __NR_execveat  362
+#define __NR_userfaultfd   363
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..a20f0b8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356i386memfd_createsys_memfd_create
 357i386bpf sys_bpf
 358i386execveatsys_execveat
stub32_execveat
+359i386userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..f320b19 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec_file_load
 321common  bpf sys_bpf
 32264  execveatstub_execveat
+323common  userfaultfd sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..adf5901 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -810,6 +810,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct 
itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int 
flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user 
*, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..2a10e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -204,6 +204,7 @@ cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
 cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

2015-03-05 Thread Andrea Arcangeli
This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h |   6 +
 mm/Makefile   |   1 +
 mm/userfaultfd.c  | 267 ++
 3 files changed, 274 insertions(+)
 create mode 100644 mm/userfaultfd.c

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e1e4360..587480a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,12 @@
 extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, unsigned long reason);
 
+extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+   unsigned long src_start, unsigned long len);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
+ unsigned long dst_start,
+ unsigned long len);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..ea9828e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,3 +76,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)  += cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
new file mode 100644
index 000..3f4c0ef
--- /dev/null
+++ b/mm/userfaultfd.c
@@ -0,0 +1,267 @@
+/*
+ *  mm/userfaultfd.c
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "internal.h"
+
+static int mcopy_atomic_pte(struct mm_struct *dst_mm,
+   pmd_t *dst_pmd,
+   struct vm_area_struct *dst_vma,
+   unsigned long dst_addr,
+   unsigned long src_addr)
+{
+   struct mem_cgroup *memcg;
+   pte_t _dst_pte, *dst_pte;
+   spinlock_t *ptl;
+   struct page *page;
+   void *page_kaddr;
+   int ret;
+
+   ret = -ENOMEM;
+   page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+   if (!page)
+   goto out;
+
+   page_kaddr = kmap(page);
+   ret = -EFAULT;
+   if (copy_from_user(page_kaddr, (const void __user *) src_addr,
+  PAGE_SIZE))
+   goto out_kunmap_release;
+   kunmap(page);
+
+   /*
+* The memory barrier inside __SetPageUptodate makes sure that
+* preceeding stores to the page contents become visible before
+* the set_pte_at() write.
+*/
+   __SetPageUptodate(page);
+
+   ret = -ENOMEM;
+   if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+   goto out_release;
+
+   _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+   if (dst_vma->vm_flags & VM_WRITE)
+   _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+   ret = -EEXIST;
+   dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+   if (!pte_none(*dst_pte))
+   goto out_release_uncharge_unlock;
+
+   inc_mm_counter(dst_mm, MM_ANONPAGES);
+   page_add_new_anon_rmap(page, dst_vma, dst_addr);
+   mem_cgroup_commit_charge(page, memcg, false);
+   lru_cache_add_active_or_unevictable(page, dst_vma);
+
+   set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+   /* No need to invalidate - it was non-present before */
+   update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+   pte_unmap_unlock(dst_pte, ptl);
+   ret = 0;
+out:
+   return ret;
+out_release_uncharge_unlock:
+   pte_unmap_unlock(dst_pte, ptl);
+   mem_cgroup_cancel_charge(page, memcg);
+out_release:
+   page_cache_release(page);
+   goto out;
+out_kunmap_release:
+   kunmap(page);
+   goto out_release;
+}
+
+static int mfill_zeropage_pte(struct mm_struct *dst_mm,
+ pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr)
+{
+   pte_t _dst_pte, *dst_pte;
+   spinlock_t *ptl;
+   int ret;
+
+   _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+dst_vma->vm_page_prot));
+   ret = -EEXIST;
+   dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+   if (!pte_none(*dst_pte))
+   goto out_unlock;
+   set_pte_at(

[PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

2015-03-05 Thread Andrea Arcangeli
These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 100 +++
 1 file changed, 100 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6b31967..6230f22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -798,6 +798,100 @@ out:
return ret;
 }
 
+static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
+   unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_copy uffdio_copy;
+   struct uffdio_copy __user *user_uffdio_copy;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_copy = (struct uffdio_copy __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_copy, user_uffdio_copy,
+  /* don't copy "copy" and "wake" last field */
+  sizeof(uffdio_copy)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+   if (ret)
+   goto out;
+   /*
+* double check for wraparound just in case. copy_from_user()
+* will later check uffdio_copy.src + uffdio_copy.len to fit
+* in the userland range.
+*/
+   ret = -EINVAL;
+   if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
+   goto out;
+   if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+   goto out;
+
+   ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+  uffdio_copy.len);
+   if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   BUG_ON(!ret);
+   /* len == 0 would wake all */
+   range.len = ret;
+   if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+   range.start = uffdio_copy.dst;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_copy->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
+static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
+   unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_zeropage uffdio_zeropage;
+   struct uffdio_zeropage __user *user_uffdio_zeropage;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
+  /* don't copy "zeropage" and "wake" last field */
+  sizeof(uffdio_zeropage)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+uffdio_zeropage.range.len);
+   if (ret)
+   goto out;
+   ret = -EINVAL;
+   if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+   goto out;
+
+   ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+uffdio_zeropage.range.len);
+   if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   /* len == 0 would wake all */
+   BUG_ON(!ret);
+   range.len = ret;
+   if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
+   range.start = uffdio_zeropage.range.start;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_zeropage->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -855,6 +949,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned 
cmd,
case UFFDIO_WAKE:
ret = userfaultfd_wake(ctx, arg);
break;
+   case UFFDIO_COPY:
+   ret = userfaultfd_copy(ctx, arg);
+   break;
+   case UFFDIO_ZEROPAGE:
+   ret = userfaultfd_zeropage(ctx, arg);
+   break;
}
return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify

2015-01-08 Thread Andrea Arcangeli
On Thu, Jan 08, 2015 at 11:59:06AM +, Marc Zyngier wrote:
> From: Steve Capper 
> 
> ptep_clear_flush_young_notify and pmdp_clear_flush_young_notify both
> call the notifiers *after* the pte/pmd has been made young.
> 

On x86 on EPT without hardware access bit (!shadow_accessed_mask),
we'll trigger a KVM page fault (gup_fast) which would mark the page
referenced to give it higher priority in the LRU (or set the accessed
bit if it's a THP).

If we drop the KVM shadow pagetable before clearing the accessed bit
in the linux pte, there's a window where we won't set the young bit
for THP. For non-THP it's less of an issue because gup_fast calls
mark_page_accessed which rolls the lrus and sets the referenced bit in
the struct page, so the effect of mark_page_accessed doesn't get
lost when the linux pte accessed bit is cleared.

We could also consider using mark_page_accessed in
follow_trans_huge_pmd to workaround the problem.  I think setting the
young bit in gup_fast is correct and would be more similar to a real
CPU access (which is what gup_fast simulates anyway) so below patch
literally is introducing a race condition even if it's going to be
lost in the noise and it's not a problem.

> This can cause problems with KVM that relies on being able to block
> MMU notifiers when carrying out maintenance of second stage
> descriptors.
> 
> This patch ensures that the MMU notifiers are called before ptes and
> pmds are made old.

Unfortunately I don't understand why this is needed.

The only difference this can make to KVM is that without the patch,
kvm_age_rmapp is called while the linux pte is less likely to have the
accessed bit set (current behavior). It can still be set by hardware
through another CPU touching the memory before the mmu notifier is
invoked.

With the patch the linux pte is more likely to have the accessed bit
set as it's not cleared before calling the mmu notifier.

In both cases (at least in x86 where the accessed bit is always set in
hardware) the accessed bit may or may not be set. The pte can not
otherwise change as it's called with the PT lock.

So again it looks a noop and it introduces a mostly theoretical race
condition for THP young bit in the linux pte with EPT and
!shadow_accessed_mask.

Clearly there must be some arm obscure detail I'm not aware of that
makes this helpful but the description in commit header isn't enough
to get what's up with blocking mmu notifiers or such. Could you
elaborate?

Thanks!
Andrea

> 
> Signed-off-by: Steve Capper 
> ---
>  include/linux/mmu_notifier.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 95243d2..c454c76 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -290,11 +290,11 @@ static inline void mmu_notifier_mm_destroy(struct 
> mm_struct *mm)
>   int __young;\
>   struct vm_area_struct *___vma = __vma;  \
>   unsigned long ___address = __address;   \
> - __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
> - __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,\
> + __young = mmu_notifier_clear_flush_young(___vma->vm_mm, \
> ___address,   \
> ___address +  \
>   PAGE_SIZE); \
> + __young |= ptep_clear_flush_young(___vma, ___address, __ptep);  \
>   __young;\
>  })
>  
> @@ -303,11 +303,11 @@ static inline void mmu_notifier_mm_destroy(struct 
> mm_struct *mm)
>   int __young;\
>   struct vm_area_struct *___vma = __vma;  \
>   unsigned long ___address = __address;   \
> - __young = pmdp_clear_flush_young(___vma, ___address, __pmdp);   \
> - __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,\
> + __young = mmu_notifier_clear_flush_young(___vma->vm_mm, \
> ___address,   \
> ___address +  \
>   PMD_SIZE);  \
> + __young |= pmdp_clear_flush_young(___vma, ___address, __pmdp);  \
>   __young;\
>  })
>  
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-25 Thread Andrea Arcangeli
On Fri, Nov 21, 2014 at 11:05:45PM +, Peter Maydell wrote:
> If it's mapped and readable-but-not-writable then it should still
> fault on write accesses, though? These are cases we currently get
> SEGV for, anyway.

Yes then it'll work just fine.

> Ah, I guess we have a terminology difference. I was considering
> "page fault" to mean (roughly) "anything that causes the CPU to
> take an exception on an attempted load/store" and expected that
> userfaultfd would notify userspace of any of those. (Well, not
> alignment faults, maybe, but I'm definitely surprised that
> access permission issues don't get reported the same way as
> page-completely-missing issues. In other words I was expecting
> that this was "everything previously reported via SIGSEGV or
> SIGBUS now comes via userfaultfd".)

Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV
currently. Because it's not a not-present fault (the page is present,
just not mapped readable) and it's neither a wrprotect fault (it is
trapped with the vma vm_flags permission bits instead before the
actual page fault handler is invoked). userfaultfd hooks into the
common code of the page fault handler.

> > Temporarily removing/moving the page with remap_anon_pages shall be
> > much better than using PROT_NONE for this (or alternative syscall name
> > to differentiate it further from remap_file_pages, or equivalent
> > userfaultfd command if we decide to hide the pte/pmd mangling as
> > userfaultfd commands instead of adding new standalone syscalls).
> 
> We don't use PROT_NONE for the linux-user situation, we just use
> mprotect() to remove the PAGE_WRITE permission so it's still
> readable.

Like said above it'll work just fine then.

> I suspect actually linux-user would be better off implementing
> something like "if this is a page which we've mapped read-only
> because we translated code out of it, then go ahead and remap
> it r/w and throw away the translation and retry the access,
> otherwise report SEGV to the guest", because taking SEGVs shouldn't
> be a fast path in the guest binary. That would let us work without
> architecture-specific junk and without requiring new kernel
> features either. So you can ignore this whole tangent thread :-)

You might get a significant boost if you use userfaultfd.

For postcopy live snapshot and postcopy live migration the main
benefit is the removal mprotect as a whole and the performance
improvement is a secondary benefit.

You can cap the max size of the JIT translated cache (and in turn the
maximal number of vmas generated by the mprotects) but we can't cap
the address space fragmentation. The faults may invoke way too many
mprotect and we may fragment the vma too much to the point we get
-ENOMEM.

Marking a page wrprotected however is always tricky, no matter if it's
fork doing it or KSM or something else. KSM just skips page that could
be under gup pins and retries them at the next pass. Fork simply won't
work right currently and it needs MADV_DONTFORK to avoid the
wrprotection entirely where you may use O_DIRECT mixed with threads
and fork.

For this new vma-less syscall (or ufd command) the best we could do is
to print a warning if any page marked wrprotected could be under GUP
pin (the warning could generate false positives as result of
speculative cache lookups that run lockless get_page_unless_zero() on
any pfn).

To avoid races the postcopy live snapshot feature I think it should be
enough to wait all in-flight I/O to complete before marking the guest
address space readonly (the KVM gup() side can be taken care of by
marking the shadow MMU readonly which is a must anyway, the mmu
notifier will take care of that part).

The postcopy live snapshot will have to copy the page so it's
effectively a COW in userland, and in turn it must ensure there's no
O_DIRECT in flight still writing to the page (despite we marked it
readonly) while the wrprotection syscall runs.

For your case probably there's no gup() in the equation unless you use
O_DIRECT (I don't think you use shadow-MMU in the kernel in
linux-user) so you don't have to worry about those races and it's just
simpler.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

2014-11-21 Thread Andrea Arcangeli
Hi Peter,

On Wed, Oct 29, 2014 at 05:56:59PM +, Peter Maydell wrote:
> On 29 October 2014 17:46, Andrea Arcangeli  wrote:
> > After some chat during the KVMForum I've been already thinking it
> > could be beneficial for some usage to give userland the information
> > about the fault being read or write
> 
> ...I wonder if that would let us replace the current nasty
> mess we use in linux-user to detect read vs write faults
> (which uses a bunch of architecture-specific hacks including
> in some cases "look at the insn that triggered this SEGV and
> decode it to see if it was a load or a store"; see the
> various cpu_signal_handler() implementations in user-exec.c).

There's currently no plan to deliver to userland read access
notifications of a present page, simply because the task of the
userfaultfd is to handle the page fault in userland, but if the page
is mapped and readable it won't fault in the first place :). I just
mean it's not like gdb read watch.

Even if the region would be set to PROT_NONE it would still SEGV
without triggering an userfault (after all pte_present would still
true because the page is still mapped despite not being readable, so
in any case it wouldn't be considered a not-present page fault).

If you temporarily remove the page (which requires an unavoidable TLB
flush also considering if the page was previously mapped the TLB could
still resolve it for reads) it would work then, because the plan is to
provide read/write fault information through the userfaultfd.

In theory it would be possible to deliver PROT_NONE faults through
userfault too but it doesn't make much sense because PROT_NONE still
requires a TLB flush, in addition to the vma
modifications/splitting/rbtree-rebalance and the mmap_sem for writing
as well.

Temporarily removing/moving the page with remap_anon_pages shall be
much better than using PROT_NONE for this (or alternative syscall name
to differentiate it further from remap_file_pages, or equivalent
userfaultfd command if we decide to hide the pte/pmd mangling as
userfaultfd commands instead of adding new standalone syscalls). It
would have the only constraint that you must mark the region
MADV_DONTFORK if you intend linux-user to ever fork or it won't work
reliably (that constraint is to eliminate the need of additional rmap
complexity, precisely so that it doesn't turn into something more
intrusive like remap_file_pages). I assume that would be a fine
constraint for linux-user.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-20 Thread Andrea Arcangeli
Hi,

On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote:
> Yes, you are right. This is what i really want, bypass all non-present faults
> and only track strict wrprotect faults. ;)
> 
> So, do you plan to support that in the userfault API?

Yes I think it's good idea to support wrprotect/COW faults too.

I just wanted to understand if there was any other reason why you
needed only wrprotect faults, because the non-present faults didn't
look like a big performance concern if they triggered in addition to
wrprotect faults, but it's certainly ok to optimize them away so it's
fully optimal.

All it takes to differentiate the behavior should be one more bit
during registration so you can select non-present, wrprotect faults or
both. postcopy live migration would select only non-present faults,
postcopy live snapshot would select only wrprotect faults, anything
like distributed shared memory supporting shared readonly access and
exclusive write access, would select both flags.

I just sent an (unfortunately) longish but way more detailed email
about live snapshotting with userfaultfd but I just wanted to give a
shorter answer here too :).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-11-20 Thread Andrea Arcangeli
Hi,

On Fri, Oct 31, 2014 at 12:39:32PM -0700, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> > Agreed, but for doing live memory snapshot (VM is running when do 
> > snapsphot),
> > we have to do this (block the write action), because we have to save the 
> > page before it
> > is dirtied by writing action. This is the difference, compared to pre-copy 
> > migration.
> 
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!

It sounds really similar issue as live migration, one can implement a
precopy live snapshot, or a precopy+postcopy live snapshot or a pure
postcopy live snapshot.

The decision on the amount of precopy done before engaging postcopy
(zero passes, 1 pass, or more passes) would have similar tradeoffs
too, except instead of having to re-transmit the re-dirtied pages over
the wire, it would need to overwrite them to disk.

The more precopy passes, the longer it takes for the live snapshotting
process to finish and the more I/O there will be (for live migration it'd
be network bandwidth usage instead of amount of I/O), but the shortest
the postcopy runtime will be (and the shorter postcopy runtime is, the
fewer userfaults will end up triggering on writes, in turn reducing
the slowdown and the artificial fault latency introduced to the guest
runtime). But the more precopy passes the more overwriting will happen
during the "longer" precopy stage and the more overall load there will
be for the host (the otherwise idle part of the host).

For the postcopy live snapshot the wrprotect faults are quite
equivalent to the not-present faults of postcopy live migration logic.

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page. 
> 
> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the
> parent process races ahead and writes to all of the pages before you finish 
> the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.

This is a very good point. fork must be evaluated first because it
literally already provides you a readonly memory snapshot of the guest
memory.

The memory cons mentioned above could lead to both -ENOMEM of too many
guests runs live snapshots at the same time in the same host, unless
overcommit_memory is set to 1 (0 by default). Even then if too many
live snapshots are running in parallel you could hit the OOM killer if
there are just a bit too many faults at the same time, or you could
hit heavy swapping which isn't ideal either.

In fact the -ENOMEM avoidance (with qemu failing) is one of the two
critical reasons why qemu always set the guest memory as
MADV_DONTFORK. But that's not the only reason.

To use the fork() trick you'd need to undo the MADV_DONTFORK first but
that would open another problem: there's a race condition between
fork() O_DIRECT and <4k hardblocksize of virtio-blk. If there's any
read() syscall with O_DIRECT with len=512 while fork() is running
(think if the aio runs in parallel with the live snapshot thread that
forks the child to dump the snapshot) and if the guest writes with the
CPU to any 512 fragment of the same page that is the destination
buffer of the write(len=512) (on two different 512bytes area of the
same guest page) the O_DIRECT write will get lost.

So to use fork we'd need to fix this longstanding race (I tried but in
the end we declared it an userland issue because it's not exploitable
to bypass permissions or corrupt kernel or unrelated memory). Or you'd
need to add locking between the dataplane/aio threads and the live
snapshot thread to ensure no direct-io I/O is ever in-flight while
fork runs.

The O_DIRECT however would only help if it's qemu TCG, if it's KVM
it's not even enough to stop O_DIRECT reads. KVM would use
gup(write=1) from the async-pf all the time... and then the shadow
pagetables would go out of sync (it won't destabilize the host of
course, but the guest memor

Re: [PATCH 00/17] RFC: userfault v2

2014-11-19 Thread Andrea Arcangeli
Hi Zhang,

On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote:
> On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:
> > * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
> >> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >>> Hi Zhanghailiang,
> >>>
> >>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>>> Hi Andrea,
> >>>>
> >>>> Thanks for your hard work on userfault;)
> >>>>
> >>>> This is really a useful API.
> >>>>
> >>>> I want to confirm a question:
> >>>> Can we support distinguishing between writing and reading memory for 
> >>>> userfault?
> >>>> That is, we can decide whether writing a page, reading a page or both 
> >>>> trigger userfault.
> >>>>
> >>>> I think this will help supporting vhost-scsi,ivshmem for migration,
> >>>> we can trace dirty page in userspace.
> >>>>
> >>>> Actually, i'm trying to relize live memory snapshot based on pre-copy 
> >>>> and userfault,
> >>>> but reading memory from migration thread will also trigger userfault.
> >>>> It will be easy to implement live memory snapshot, if we support 
> >>>> configuring
> >>>> userfault for writing memory only.
> >>>
> >>> Mail is going to be long enough already so I'll just assume tracking
> >>> dirty memory in userland (instead of doing it in kernel) is worthy
> >>> feature to have here.
> >>>
> >>> After some chat during the KVMForum I've been already thinking it
> >>> could be beneficial for some usage to give userland the information
> >>> about the fault being read or write, combined with the ability of
> >>> mapping pages wrprotected to mcopy_atomic (that would work without
> >>> false positives only with MADV_DONTFORK also set, but it's already set
> >>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> >>> checked also in the wrprotect faults, not just in the not present
> >>> faults, but it's not a massive change. Returning the read/write
> >>> information is also a not massive change. This will then payoff mostly
> >>> if there's also a way to remove the memory atomically (kind of
> >>> remap_anon_pages).
> >>>
> >>> Would that be enough? I mean are you still ok if non present read
> >>> fault traps too (you'd be notified it's a read) and you get
> >>> notification for both wrprotect and non present faults?
> >>>
> >> Hi Andrea,
> >>
> >> Thanks for your reply, and your patience;)
> >>
> >> Er, maybe i didn't describe clearly. What i really need for live memory 
> >> snapshot
> >> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing 
> >> write action*.
> >>
> >> My initial solution scheme for live memory snapshot is:
> >> (1) pause VM
> >> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> >> (3) save deivce state to snapshot file
> >> (4) resume VM
> >> (5) snapshot thread begin to save page of memory to snapshot file
> >> (6) VM is going to run, and it is OK for VM or other thread to read ram 
> >> (no fault trap),
> >>  but if VM try to write page (dirty the page), there will be
> >>  a userfault trap notification.
> >> (7) a fault-handle-thread reads the page request from userfaultfd,
> >>  it will copy content of the page to some buffers, and then remove the 
> >> page's
> >>  wrprotect limit(still using the userfaultfd to tell kernel).
> >> (8) after step (7), VM can continue to write the page which is now can be 
> >> write.
> >> (9) snapshot thread save the page cached in step (7)
> >> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
> >
> > Hmm, I can see the same process being useful for the fault-tolerance schemes
> > like COLO, it needs a memory state snapshot.
> >
> >> So, what i need for userfault is supporting only wrprotect fault. i don't
> >> want to get notification for non present reading faults, it will influence
> >> VM's performance and the efficiency of doing snapshot.
> >
> > What pages would be non-present at this point - just balloon?
> >
> 
> Er, sorry, it should be 'no-present page faults';)

Could you elaborate? The balloon pages or not yet allocated pages in
the guest, if they fault too (in addition to the wrprotect faults) it
doesn't sound a big deal, as it's not so common (balloon especially
shouldn't happen except during balloon deflating during the live
snapshotting). We could bypass non-present faults though, and only
track strict wrprotect faults.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] RFC: userfault v2

2014-10-29 Thread Andrea Arcangeli
Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> Hi Andrea,
> 
> Thanks for your hard work on userfault;)
> 
> This is really a useful API.
> 
> I want to confirm a question:
> Can we support distinguishing between writing and reading memory for 
> userfault?
> That is, we can decide whether writing a page, reading a page or both trigger 
> userfault.
> 
> I think this will help supporting vhost-scsi,ivshmem for migration,
> we can trace dirty page in userspace.
> 
> Actually, i'm trying to relize live memory snapshot based on pre-copy and 
> userfault,
> but reading memory from migration thread will also trigger userfault.
> It will be easy to implement live memory snapshot, if we support configuring
> userfault for writing memory only.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?

The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
  fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
  into that vma too

if yes engage userfaultfd protocol

otherwise raise SIGBUS (single threaded apps should be fine with
SIGBUS and it'll avoid them to spawn a thread in order to talk the
userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
  address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
  userfaultfd protocol so we keep the two problems separated and we
  don't mix them in the same API which makes it even harder to
  finalize it.

add mcopy_atomic (with a flag to map the page readonly too)

The alternative would be to hide mcopy_atomic (and even
remap_anon_pages in order to "remove" the memory atomically for
the externalization into the cloud) as userfaultfd commands to
write into the fd. But then there would be no much point to keep
MADV_USERFAULT around if I do so and I could just remove it
too or it doesn't look clean having to open the userfaultfd just
to issue an hidden mcopy_atomic.

So it becomes a decision if the basic SIGBUS mode for single
threaded apps should be supported or not. As long as we support
SIGBUS too and we don't force to use userfaultfd as the only
mechanism to be notified about userfaults, having a separate
mcopy_atomic syscall sounds cleaner.
 
Perhaps mcopy_atomic could be used in other cases that may arise
later that may not be connected with the userfault.

Questions to double check the above plan is ok:

1) should I drop the SIGBUS behavior and MADV_USERFAULT?

2) should I hide mcopy_atomic as a write into the userfaultfd?

   NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
   into the fd, the buffer pointer passed to write() syscall would
   still _not_ be pointing to the data like a regular write, but it
   would be a pointer to a command structure that points to the source
   and destination data of the "hidden" mcopy_atomic, the only
   advantage is that perhaps I could wakeup the blocked page faults
   without requiring an additional syscall.

   The standalone mcopy_atomic would still require a write into the
   userfaultfd as it happens now after remap_anon_pages returns, in
   order to wakeup the stopped page faults.

3) should I add a registration command to trap only write faults?

   The protocol can always be extended later anyway in a backwards
   compatible way but it's better if we get it fully featured from the
   start.

For completeness, some answers for other questions I've seen floating
around but that weren't posted on the list yet (you can skip reading
the below part if not interested):

- open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
  benefit: userfaul

Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-29 Thread Andrea Arcangeli
On Thu, Oct 09, 2014 at 12:50:37PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> 
> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > +  struct mm_struct *mm,
> > +  unsigned long start,
> > +  unsigned long nr_pages,
> > +  int write, int force,
> > +  struct page **pages,
> > +  struct vm_area_struct **vmas,
> > +  int *locked,
> > +  bool notify_drop)
> > +{
> 
> > +   if (notify_drop && lock_dropped && *locked) {
> > +   /*
> > +* We must let the caller know we temporarily dropped the lock
> > +* and so the critical section protected by it was lost.
> > +*/
> > +   up_read(&mm->mmap_sem);
> > +   *locked = 0;
> > +   }
> > +   return pages_done;
> > +}
> 
> > +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> > +  unsigned long start, unsigned long nr_pages,
> > +  int write, int force, struct page **pages,
> > +  int *locked)
> > +{
> > +   return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > +  pages, NULL, locked, true);
> > +}
> 
> > +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> > +unsigned long start, unsigned long nr_pages,
> > +int write, int force, struct page **pages)
> > +{
> > +   long ret;
> > +   int locked = 1;
> > +   down_read(&mm->mmap_sem);
> > +   ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > + pages, NULL, &locked, false);
> > +   if (locked)
> > +   up_read(&mm->mmap_sem);
> > +   return ret;
> > +}
> 
> >  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > unsigned long start, unsigned long nr_pages, int write,
> > int force, struct page **pages, struct vm_area_struct **vmas)
> >  {
> > +   return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > +  pages, vmas, NULL, false);
> >  }
> 
> I'm wondering about that notify_drop parameter, what's the added
> benefit? If you look at these 3 callers we can do away with it, since in
> the second called where we have locked but !notify_drop we seem to do

The second (and third) caller pass notify_drop=false, so the
notify_drop parameter is always a noop for them. They certainly could
get away without it.

> the exact same thing afterwards anyway.

It makes a difference only to the first caller, if it wasn't for the
first caller notify_drop could be dropped. The first caller does this:

return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
   pages, NULL, locked, true, FOLL_TOUCH);
^ notify_drop = true

Without "notify_drop=true" the first caller could make its own
respective caller think the lock has never been dropped, just because
it is locked by the time get_user_pages_locked returned. But the
caller must be made aware that the lock has been dropped during the
call and in turn any "vma" it got before inside the mmap_sem critical
section is now stale. That's all notify_drop achieves.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-29 Thread Andrea Arcangeli
On Thu, Oct 09, 2014 at 12:47:23PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > +  struct mm_struct *mm,
> > +  unsigned long start,
> > +  unsigned long nr_pages,
> > +  int write, int force,
> > +  struct page **pages,
> > +  struct vm_area_struct **vmas,
> > +  int *locked,
> > +  bool notify_drop)
> 
> You might want to consider __always_inline to make sure it does indeed
> get inlined and constant propagation works for @locked and @notify_drop.

Ok, that's included in the last patchset submit.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

2014-10-12 Thread Andrea Arcangeli
On Thu, Oct 09, 2014 at 12:52:45PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
> > Just an optimization.
> 
> Does it make sense to split the thing in two? One where you apply
> _unlocked and then one where you apply _fast?

Yes but I already dropped the _fast optimization, as the latency
enhancements to gup_fast were NAKed earlier in this thread. So this
patch has already been updated to only apply _unlocked.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=bc2e0473b601c6a330ddb4adbcf4c048b2233d4e
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Andrea Arcangeli
On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
> mremap like interface, or file+commands protocol interface. I tend to
> like mremap more, that's why I opted for a remap_anon_pages syscall
> kept orthogonal to the userfaultfd functionality (remap_anon_pages
> could be also used standalone as an accelerated mremap in some
> circumstances) but nothing prevents to just embed the same mechanism

Sorry for the self followup, but something else comes to mind to
elaborate this further.

In term of interfaces, the most efficient I could think of to minimize
the enter/exit kernel, would be to append the "source address" of the
data received from the network transport, to the userfaultfd_write()
command (by appending 8 bytes to the wakeup command). Said that,
mixing the mechanism to be notified about userfaults with the
mechanism to resolve an userfault to me looks a complication. I kind
of liked to keep the userfaultfd protocol is very simple and doing
just its thing. The userfaultfd doesn't need to know how the userfault
was resolved, even mremap would work theoretically (until we run out
of vmas). I thought it was simpler to keep it that way. However if we
want to resolve the fault with a "write()" syscall this may be the
most efficient way to do it, as we're already doing a write() into the
pseudofd to wakeup the page fault that contains the destination
address, I just need to append the source address to the wakeup command.

I probably grossly overestimated the benefits of resolving the
userfault with a zerocopy page move, sorry. So if we entirely drop the
zerocopy behavior and the TLB flush of the old page like you
suggested, the way to keep the userfaultfd mechanism decoupled from
the userfault resolution mechanism would be to implement an
atomic-copy syscall. That would work for SIGBUS userfaults too without
requiring a pseudofd then. It would be enough then to call
mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
wouldn't page fault or call GUP into the destination address (it can't
otherwise the in-flight partial copy would be visible to the process,
breaking the atomicity of the copy), but it would fill in the
pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
currently has (in turn it would by design bypass the VM_USERFAULT
check and be ideal for resolving userfaults).

mcopy_atomic could then be also extended to tmpfs and it would work
without requiring the source page to be a tmpfs page too without
having to convert page types on the fly.

If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
course so it'd be even less intrusive than the current
remap_anon_pages and it would require zero TLB flush during its
runtime (it would just require an atomic copy).

So should I try to embed a mcopy_atomic inside userfault_write or can
I expose it to userland as a standalone new syscall? Or should I do
something different? Comments?

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Andrea Arcangeli
Hello,

On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli  wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface
> 
> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.

I see what you mean. The only cons I see is that we couldn't use then
recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

Ideally if we could prevent the page data coming from the network to
ever become visible in the kernel we could avoid the TLB flush and
also be zerocopy but I can't see how we could achieve that.

The page data could come through a ssh pipe or anything (qemu supports
all kind of network transports for live migration), this is why
leaving the network protocol into userland is preferable.

As things stands now, I'm afraid with a write() syscall we couldn't do
it zerocopy. We'd still need to receive the memory in a temporary page
and then copy it to a kernel page (invisible to userland while we
write to it) to later map into the userfault address.

If it wasn't for the TLB flush of the old page, the remap_anon_pages
variant would be more optimal than doing a copy through a write
syscall. Is the copy cheaper than a TLB flush? I probably naively
assumed the TLB flush was always cheaper.

Now another idea that comes to mind to be able to add the ability to
switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that
would then do a copy inside remap_anon_pages and leave the original
page mapped in place... (and such flag would also disable the -EBUSY
error if page_mapcount is > 1).

So then if the RAP_FORCE_COPY flag is set remap_anon_pages would
behave like you suggested (but with a mremap-like interface, instead
of a write syscall) and we could benchmark the difference between copy
and TLB flush too. We could even periodically benchmark it at runtime
and switch over the faster method (the more CPUs there are in the host
and the more threads the process has, the faster the copy will be
compared to the TLB flush).

Of course in terms of API I could implement the exact same mechanism
as described above for remap_anon_pages inside a write() to the
userfaultfd (it's a pseudo inode). It'd need two different commands to
prepare for the coming write (with a len multiple of PAGE_SIZE) to
know the address where the page should be mapped into and if to behave
zerocopy or if to skip the TLB flush and copy.

Because the copy vs TLB flush trade off is possible to achieve with
both interfaces, I think it really boils down to choosing between a
mremap like interface, or file+commands protocol interface. I tend to
like mremap more, that's why I opted for a remap_anon_pages syscall
kept orthogonal to the userfaultfd functionality (remap_anon_pages
could be also used standalone as an accelerated mremap in some
circumstances) but nothing prevents to just embed the same mechanism
inside userfaultfd if a file+commands API is preferable. Or we could
add a different syscall (separated from userfaultfd) that creates
another pseudofd to write a command plus the page data into it. Just I
wouldn't see the point of creating a pseudofd just to copy a page
atomically, the write() syscall would look more ideal if the
userfaultfd is already open for other reasons and the pseudofd
overhead is required anyway.

Last thing to keep in mind is that if using userfaults with SIGBUS and
without userfaultfd, remap_anon_pages would have been still useful, so
if we retain the SIGBUS behavior for volatile pages and we don't force
the usage for userfaultfd, it may be cleaner not to use userfaultfd
but a separate pseudofd to do the write() syscall though. Otherwise
the app would need to open the userfaultfd to resolve the fault even
though it's not using the userfaultfd protocol which doesn't look an
intuitive interface to me.

Comments welcome.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Andrea Arcangeli
Hi Kirill,

On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> > There's one constraint enforced to allow this simplification: the
> > source pages passed to remap_anon_pages must be mapped only in one
> > vma, but this is not a limitation when used to handle userland page
> > faults with MADV_USERFAULT. The source addresses passed to
> > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> > avoid any risk of the mapcount of the pages increasing, if fork runs
> > in parallel in another thread, before or while remap_anon_pages runs.
> 
> Have you considered triggering COW instead of adding limitation on
> pages' mapcount? The limitation looks artificial from interface POV.

I haven't considered it, mostly because I see it as a feature that it
returns -EBUSY. I prefer to avoid the risk of userland getting a
successful retval but internally the kernel silently behaving
non-zerocopy by mistake because some userland bug forgot to set
MADV_DONTFORK on the src_vma.

COW would be not zerocopy so it's not ok. We get sub 1msec latency for
userfaults through 10gbit and we don't want to risk wasting CPU
caches.

I however considered allowing to extend the strict behavior (i.e. the
feature) later in a backwards compatible way. We could provide a
non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn
the -EBUSY error into a copy.

It's also more complex to implement the cow now, so it would make the
code that really matters, harder to review. So it may be preferable to
extend this later in a backwards compatible way with a new
RAP_ALLOW_COW flag.

The current handling the flags is already written in a way that should
allow backwards compatible extension with RAP_ALLOW_*:

#define RAP_ALLOW_SRC_HOLES (1UL<<0)

SYSCALL_DEFINE4(remap_anon_pages,
unsigned long, dst_start, unsigned long, src_start,
unsigned long, len, unsigned long, flags)
[..]
long err = -EINVAL;
[..]
if (flags & ~RAP_ALLOW_SRC_HOLES)
return err;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-07 Thread Andrea Arcangeli
Hi Kirill,

On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
> 
> Hm. I wounder if this functionality really fits madvise(2) interface: as
> far as I understand it, it provides a way to give a *hint* to kernel which
> may or may not trigger an action from kernel side. I don't think an
> application will behaive reasonably if kernel ignore the *advise* and will
> not send SIGBUS, but allocate memory.
> 
> I would suggest to consider to use some other interface for the
> functionality: a new syscall or, perhaps, mprotect().

I didn't feel like adding PROT_USERFAULT to mprotect, which looks
hardwired to just these flags:

   PROT_NONE  The memory cannot be accessed at all.

   PROT_READ  The memory can be read.

   PROT_WRITE The memory can be modified.

   PROT_EXEC  The memory can be executed.

Normally mprotect doesn't just alter the vmas but it also alters
pte/hugepmds protection bits, that's something that is never needed
with VM_USERFAULT so I didn't feel like VM_USERFAULT is a protection
change to the VMA.

mprotect is also hardwired to mangle only the VM_READ|WRITE|EXEC
flags, while madvise is ideal to set arbitrary vma flags.

>From an implementation standpoint the perfect place to set a flag in a
vma is madvise. This is what MADV_DONTFORK (it sets VM_DONTCOPY)
already does too in an identical way to MADV_USERFAULT/VM_USERFAULT.

MADV_DONTFORK is as critical as MADV_USERFAULT because people depends
on it for example to prevent the O_DIRECT vs fork race condition that
results in silent data corruption during I/O with threads that may
fork. The other reason why MADV_DONTFORK is critical is that fork()
would otherwise fail with OOM unless full overcommit is enabled
(i.e. pci hotplug crashes the guest if you forget to set
MADV_DONTFORK).

Another madvise that would generate a failure if not obeyed by the
kernel is MADV_DONTNEED that if it does nothing it could run lead to
OOM killing. We don't inflate virt balloons using munmap just to make
an example. Various other apps (maybe JVM garbage collection too)
makes extensive use of MADV_DONTNEED and depend on it.

Said that I can change it to mprotect, the only thing that I don't
like is that it'll result in a less clean patch and I can't possibly
see a practical risk in keeping it simpler with madvise, as long as we
always return -EINVAL whenever we encounter a vma type that cannot
raise userfaults yet (that is something I already enforced).

Yet another option would be to drop MADV_USERFAULT and
vm_flags&VM_USERFAULT entirely and in turn the ability to handle
userfaults with SIGBUS, and retain only the userfaultfd. The new
userfaultfd protocol requires registering each created userfaultfd
into its own private virtual memory ranges (that is to allow an
unlimited number of userfaultfd per process). Currently the
userfaultfd engages iff the fault address intersects both the
MADV_USERFAULT range and the userfaultfd registered ranges. So I could
drop MADV_USERFAULT and VM_USERFAULT and just check for
vma->vm_userfaultfd_ctx!=NULL to know if the userfaultfd protocol
needs to be engaged during the first page fault for a still unmapped
virtual address. I just thought it would be more flexibile to also
allow SIGBUS without forcing people to use userfaultfd (that's in fact
the only reason to still retain madvise(MADV_USERFAULT)!).

Volatile pages earlier patches only supported SIGBUS behavior for
example.. and I didn't intend to force them to use userfaultfd if
they're guaranteed to access the memory with the CPU and never through
a kernel syscall (that is something the app can enforce by
design). userfaultfd becomes necessary the moment you want to handle
userfaults through syscalls/gup etc... qemu obviously requires
userfaultfd and it never uses the userfaultfd-less SIGBUS behavior as
it touches the memory in all possible ways (first and foremost with
the KVM page fault that uses almost all variants of gup..).

So here somebody should comment and choose between:

1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
   the current madvise(MADV_USERFAULT)

2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
   userfaultfd protocol as the only way for userland to catch
   userfaults (each userfaultfd must already register itself into its
   own virtual memory ranges so it's a trivial change for userfaultfd
   users that deletes jus

Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-06 Thread Andrea Arcangeli
Hi,

On Sat, Oct 04, 2014 at 08:13:36AM +0900, Mike Hommey wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
> 
> What does "unmapped virtual address" mean in this context?

To clarify this I added this in a second sentence in the commit
header:

"still unmapped virtual address" of the previous sentence in this
context means that the pte/trans_huge_pmd is null. It means it's an
hole inside the anonymous vma (the kind of hole that doesn't account
for RSS but only virtual size of the process). It is the same state
all anonymous virtual memory is, right after mmap. The same state that
if you read from it, will map a zeropage into the faulting virtual
address. If the page is swapped out, it will not trigger userfaults.

If something isn't clear let me know.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-06 Thread Andrea Arcangeli
Hello,

On Mon, Oct 06, 2014 at 09:55:41AM +0100, Dr. David Alan Gilbert wrote:
> * Linus Torvalds (torva...@linux-foundation.org) wrote:
> > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli  
> > wrote:
> > >
> > > Overall this looks a fairly small change to the rmap code, notably
> > > less intrusive than the nonlinear vmas created by remap_file_pages.
> > 
> > Considering that remap_file_pages() was an unmitigated disaster, and
> > -mm has a patch to remove it entirely, I'm not at all convinced this
> > is a good argument.
> > 
> > We thought remap_file_pages() was a good idea, and it really really
> > really wasn't. Almost nobody used it, why would the anonymous page
> > case be any different?
> 
> I've posted code that uses this interface to qemu-devel and it works nicely;
> so chalk up at least one user.
> 
> For the postcopy case I'm using it for, we need to place a page, atomically
>   some thread might try and access it, and must either
>  1) get caught by userfault etc or
>  2) must succeed in it's access
> 
> and we'll have that happening somewhere between thousands and millions of 
> times
> to pages in no particular order, so we need to avoid creating millions of 
> mappings.

Yes, that's our current use case.

Of course if somebody has better ideas on how to resolve an anonymous
userfault they're welcome.

How to resolve an userfault is orthogonal on how to detect it and to
notify userland about it and to be notified when the userfault has
been resolved. The latter is what the userfault and userfaultfd
do. The former is what remap_anon_pages is used for but we could use
something else too if there are better ways. mremap would clearly work
too, but it would be less strict (it could lead to silent data
corruption if there are bugs in the userland code), it would be slower
and it would eventually a hit a -ENOMEM failure because there would be
too many vmas.

I could in theory drop remap_anon_pages from this patchset, but
without an optimal way to resolve an userfault, the rest isn't so
useful.

We're currently discussing on what would be the best way to resolve a
MAP_SHARED userfault on tmpfs in fact (that's not sorted yet), but so
far, it seems remap_anon_pages fits the bill for anonymous memory.

remap_anon_pages is not as problematic to maintain as remap_file_pages
for the reason explained in the commit header, but there are other
reasons: it doesn't require special pte_file and it changes nothing of
how anonymous page faults works. All it requires is a loop to catch a
changed page->index (previously page->index couldn't change, not it
can, that's the only thing it changes).

remap_file_pages complexity derives from not being allowed to change
page->index during a move because the page_mapping may be bigger than
1, while that is precisely what remap_anon_pages does.

As long as this "rmap preparation" is the only constraints that
remap_anon_pages introduces in terms of rmap, it looks a nice
not-too-intrusive solution to resolve anonymous userfaults
efficiently.

Introducing remap_anon_pages in fact doesn't reduce the
simplification derived from the removal of remap_file_pages.

As opposed removing remap_anon_pages later would only have the benefit
of removing this very patch 10/17 and no other benefit.

In short remap_anon_pages does this (heavily simplified):

   pte = *src_pte;
   *src_pte = 0;
   pte_page(pte)->index = adjusted according to src_vma/dst_vma->vm_pgoff
   *dst_pte = pte;

It guarantees not to modify the vmas and in turn it doesn't require to
take the mmap_sem for writing.

To use remap_anon_pages, each thread has to create its own temporary
vma with MADV_DONTFORK set on it (not formally required by the syscall
strict checks, but then the application must never fork if
MADV_DONTFORK isn't set or remap_anon_pages could return -EBUSY:
there's no risk of silent data corruption even if the thread forks
without setting MADV_DONTFORK) as source region where receive data
through the network. Then after the data is fully received
rmap_anon_pages moves the page from the temporary vma to the address
where the userfault triggered atomically (while other threads may be
attempting to access the userfault address too, thanks to
remap_anon_pages atomic behavior they won't risk to ever see partial
data coming from the network).

remap_anon_pages as side effect creates an hole in the temporary
(source) vma, so the next recv() syscall receiving data from the
network will fault-in a new anonymous page without requiring any
further malloc/free or other kind of vma mangling.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious

2014-10-06 Thread Andrea Arcangeli
Hello,

On Fri, Oct 03, 2014 at 11:23:53AM -0700, Linus Torvalds wrote:
> On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli  wrote:
> > This teaches gup_fast and __gup_fast to re-enable irqs and
> > cond_resched() if possible every BATCH_PAGES.
> 
> This is disgusting.
> 
> Many (most?) __gup_fast() users just want a single page, and the
> stupid overhead of the multi-page version is already unnecessary.
> This just makes things much worse.
> 
> Quite frankly, we should make a single-page version of __gup_fast(),
> and convert existign users to use that. After that, the few multi-page
> users could have this extra latency control stuff.

Ok. I didn't think at a better way to add the latency control other
than to reduce nr_pages in a outer loop instead of altering the inner
calls, but this is what I got after implementing it... If somebody has
a cleaner way to implement the latency control stuff that's welcome
and I'd be glad to replace it.

> And yes, the single-page version of get_user_pages_fast() is actually
> latency-critical. shared futexes hit it hard, and yes, I've seen this
> in profiles.

KVM would save a few cycles from a single-page version too. I just
thought further optimizations could be added later and this was better
than nothing.

Considering I've no better idea how to implement the latency control
stuff, for now I'll just drop this controversial patch, and I'll
convert those get_user_pages to gup_unlocked instead of converting
them to gup_fast, which is more than enough to obtain the mmap_sem
holding scalability improvement (that also solves the mmap_sem trouble
for the userfaultfd). gup_unlocked isn't as good as gup_fast but it's
at least better than the current get_user_pages().

I got into this gup_fast latency control stuff purely because there
were a few get_user_pages that could have been converted to
get_user_pages_fast as they were using "current" and "current->mm" as the
first two parameters, except for the risk of disabling irq for
long. So I tried to do the right thing and fix gup_fast but I'll leave
this further optimization queued for later.

About the missing commit header for the other patch Paolo already
replied to it, to clarify this a bit further in short I expect that
FOLL_TRIED flag to be merged through the KVM git tree which already
contains it. I'll add a comment to the commit header to specify
it. Sorry for the confusion about that patch.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/17] mm: madvise MADV_USERFAULT

2014-10-03 Thread Andrea Arcangeli
MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli 
---
 arch/alpha/include/uapi/asm/mman.h |  3 ++
 arch/mips/include/uapi/asm/mman.h  |  3 ++
 arch/parisc/include/uapi/asm/mman.h|  3 ++
 arch/xtensa/include/uapi/asm/mman.h|  3 ++
 fs/proc/task_mmu.c |  1 +
 include/linux/mm.h |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c   | 60 +-
 mm/madvise.c   | 17 ++
 mm/memory.c| 13 
 10 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP70  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 71  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 #define MAP_VARIABLE   0
diff --git a/arch/xtensa/include/uapi/asm/mman.h 
b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee1c3a2..6033cb8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_HUGEPAGE)]= "hg",
[ilog2(VM_NOHUGEPAGE)]  = "nh",
[ilog2(VM_MERGEABLE)]   = "mg",
+   [ilog2(VM_USERFAULT)]   = "uf",
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8900ba9..bf3df07 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE0x2000  /* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE  0x4000  /* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE   0x8000  /* KSM may merge identical pages */
+#define VM_USERFAULT   0x1ULL  /* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PATVM_ARCH_1   /* PAT re

[PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER

2014-10-03 Thread Andrea Arcangeli
This adds two protocol commands to the userfaultfd protocol.

To register memory regions into userfaultfd you can write 16 bytes as:

 [ start|0x1, end ]

to unregister write:

 [ start|0x2, end ]

End is "start+len" (not start+len-1). Same as vma->vm_end.

This also enforces the constraint that start and end must both be page
aligned (so the last two bits become available to implement the
USERFAULTFD_RANGE_REGISTER|UNREGISTER commands).

This way there can be multiple userfaultfd for each process and each
one can register into its own virtual memory ranges.

If an userfaultfd tries to register into a virtual memory range
already registered into a different userfaultfd, -EBUSY will be
returned by the write() syscall.

userfaultfd can register into allocated ranges that don't have
MADV_USERFAULT set, but if MADV_USERFAULT is not set, no userfault
will fire on those.

Only if MADV_USERFAULT is set on the virtual memory range, and the
userfaultfd registered into the same range, the userfaultfd protocol
will engage.

If only MADV_USERFAULT is set and there's no userfaultfd registered on
a memory range, only a SIGBUS will be raised and the page fault will
not engage the userfaultfd protocol.

This also makes the handle_userfault() safe against race conditions
with regard to the mmap_sem by requiring FAULT_FLAG_ALLOW_RETRY to be
set the first time a fault is raised by any thread. In turn to work
reliably, the userfaultd depends on the gup_locked|unlocked patchset
to be applied.

If get_user_pages() is run on virtual memory ranges registered into
the userfaultfd, handle_userfault() will return VM_FAULT_SIGBUS and
gup() will return -EFAULT, because get_user_pages() doesn't allow
handle_userfault() to release the mmap_sem and in turn we cannot
safely engage the userfaultfd protocol. So the remaining
get_user_pages() calls must be restricted to memory ranges that we
know are not tracked through the userfaultfd protocol for the
userfaultfd to be reliable.

The only exception of a get_user_pages() that can safely run into an
userfaultfd triggering a -EFAULT is ptrace. ptrace would otherwise
hang so it's actually ok if it will get a -EFAULT instead of
hanging. But it would be ok also to phase out get_user_pages()
completely and have ptrace hang on the userfault (the hang can be
resolved sending SIGKILL to gdb or whatever process that is calling
ptrace). We could also decide to retain the current -EFAULT behavior
of ptrace using get_user_pages_locked with a NULL locked parameter so
the FAULT_FLAG_ALLOW_RETRY flag will not be set. Either ways would be
safe.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c| 411 +++-
 include/linux/mm.h  |   2 +-
 include/linux/mm_types.h|  11 ++
 include/linux/userfaultfd.h |  19 +-
 mm/madvise.c|   3 +-
 mm/mempolicy.c  |   4 +-
 mm/mlock.c  |   3 +-
 mm/mmap.c   |  39 +++--
 mm/mprotect.c   |   3 +-
 9 files changed, 320 insertions(+), 175 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2667d0d..49bbd3b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct userfaultfd_ctx {
/* pseudo fd refcounting */
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
unsigned int state;
/* released */
bool released;
+   /* mm with one ore more vmas attached to this userfaultfd_ctx */
+   struct mm_struct *mm;
 };
 
 struct userfaultfd_wait_queue {
@@ -49,6 +52,10 @@ struct userfaultfd_wait_queue {
 #define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
 #define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
 
+#define USERFAULTFD_RANGE_REGISTER ((__u64) 0x1)
+#define USERFAULTFD_RANGE_UNREGISTER ((__u64) 0x2)
+#define USERFAULTFD_RANGE_MASK (~((__u64) 0x3))
+
 enum {
USERFAULTFD_STATE_ASK_PROTOCOL,
USERFAULTFD_STATE_ACK_PROTOCOL,
@@ -56,43 +63,6 @@ enum {
USERFAULTFD_STATE_RUNNING,
 };
 
-/**
- * struct mm_slot - userlandfd information per mm that is being scanned
- * @link: link to the mm_slots hash list
- * @mm: the mm that this information is valid for
- * @ctx: userfaultfd context for this mm
- */
-struct mm_slot {
-   struct hlist_node link;
-   struct mm_struct *mm;
-   struct userfaultfd_ctx ctx;
-   struct rcu_head rcu_head;
-};
-
-#define MM_USERLANDFD_HASH_BITS 10
-static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
-
-static DEFINE_MUTEX(mm_userlandfd_mutex);
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
-   struct mm_slot *slot;
-
-   hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
-  (unsigned long)mm)
-   if (slot->mm == mm)
-   return slot;
-
-   return NULL;
-}
-
-static void insert_to_mm_

[PATCH 12/17] mm: sys_remap_anon_pages

2014-10-03 Thread Andrea Arcangeli
OT_WRITE,
   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (tmp == MAP_FAILED)
perror("mmap"), exit(1);
 #else
ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
if (ret)
perror("posix_memalign"), exit(1);
ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
if (ret)
perror("posix_memalign"), exit(1);
 #endif
/*
 * MADV_USERFAULT must run before memset, to avoid THP 2m
 * faults to map memory into "tmp", if "tmp" isn't allocated
 * with hugepage alignment.
 */
if (madvise((void *)c, SIZE, MADV_USERFAULT))
perror("madvise"), exit(1);
memset((void *)tmp, 0xaa, SIZE);

sa.sa_sigaction = userfault_sighandler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
sigaction(SIGBUS, &sa, NULL);

 #ifndef USE_USERFAULT
ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
if (ret != SIZE)
perror("remap_anon_pages"), exit(1);
 #endif

for (i = 0; i < SIZE; i += 4096) {
if ((i/4096) % 2) {
/* exercise read and write MADV_USERFAULT */
c[i+1] = 0xbb;
}
if (c[i] != 0xaa)
    printf("error %x offset %lu\n", c[i], i), exit(1);
}
printf("remap_anon_pages functions correctly\n");

return 0;
 }
===

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h  |   7 +
 include/linux/syscalls.h |   4 +
 kernel/sys_ni.c  |   1 +
 mm/fremap.c  | 477 +++
 mm/huge_memory.c | 110 +
 7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..2d0594c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
 354i386seccomp sys_seccomp
 355i386getrandom   sys_getrandom
 356i386memfd_createsys_memfd_create
+357i386remap_anon_pagessys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..41e8f3e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
 318common  getrandom   sys_getrandom
 319common  memfd_createsys_memfd_create
 320common  kexec_file_load sys_kexec_file_load
+321common  remap_anon_pagessys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3aa10e0..8a85fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+pmd_t *dst_pmd, pmd_t *src_pmd,
+pmd_t dst_pmdval,
+struct vm_area_struct *dst_vma,
+struct vm_area_struct *src_vma,
+unsigned long dst_addr,
+unsigned long src_addr);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..3d4bb05 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -451,6 +451,10 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+unsigned long src_start,
+unsigned long len,
+unsigned long flags);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int 
advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4dd..2bc7bef 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,6 +178,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_rema

[PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-03 Thread Andrea Arcangeli
We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli 
Reviewed-by: Andres Lagar-Cavilla 
Reviewed-by: Peter Feiner 
---
 include/linux/mm.h |   7 +++
 mm/gup.c   | 178 +
 mm/nommu.c |  23 +++
 3 files changed, 197 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4196a..8900ba9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,13 @@ long get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   int write, int force, struct page **pages,
+   int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
 struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index af7ea3e..6f2f757 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,166 @@ int fixup_user_fault(struct task_struct *tsk, struct 
mm_struct *mm,
return 0;
 }
 
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+  struct mm_struct *mm,
+  unsigned long start,
+  unsigned long nr_pages,
+  int write, int force,
+  struct page **pages,
+  struct vm_area_struct **vmas,
+  int *locked,
+  bool notify_drop)
+{
+   int flags = FOLL_TOUCH;
+   long ret, pages_done;
+   bool lock_dropped;
+
+   if (locked) {
+   /* if VM_FAULT_RETRY can be returned, vmas become invalid */
+   BUG_ON(vmas);
+   /* check caller initialized locked */
+   BUG_ON(*locked != 1);
+   }
+
+   if (pages)
+   flags |= FOLL_GET;
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
+   pages_done = 0;
+   lock_dropped = false;
+   for (;;) {
+   ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+  vmas, locked);
+   if (!locked)
+   /* VM_FAULT_RETRY couldn't trigger, bypass */
+   return ret;
+
+   /* VM_FAULT_RETRY cannot return errors */
+   if (!*locked) {
+   BUG_ON(ret < 0);
+   BUG_ON(ret >= nr_pages);
+   }
+
+   if (!pages)
+   

[PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

2014-10-03 Thread Andrea Arcangeli
Just an optimization.

Signed-off-by: Andrea Arcangeli 
---
 drivers/dma/iovlock.c  | 10 ++
 drivers/iommu/amd_iommu_v2.c   |  6 ++
 drivers/media/pci/ivtv/ivtv-udma.c |  6 ++
 drivers/scsi/st.c  | 10 ++
 drivers/video/fbdev/pvr2fb.c   |  5 +
 mm/process_vm_access.c |  7 ++-
 mm/util.c  | 10 ++
 net/ceph/pagevec.c |  9 -
 8 files changed, 17 insertions(+), 46 deletions(-)

diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec 
*iov, size_t len)
pages += page_list->nr_pages;
 
/* pin pages down */
-   down_read(¤t->mm->mmap_sem);
-   ret = get_user_pages(
-   current,
-   current->mm,
+   ret = get_user_pages_fast(
(unsigned long) iov[i].iov_base,
page_list->nr_pages,
1,  /* write */
-   0,  /* force */
-   page_list->pages,
-   NULL);
-   up_read(¤t->mm->mmap_sem);
+   page_list->pages);
 
if (ret != page_list->nr_pages)
goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
 
write = !!(fault->flags & PPR_FAULT_WRITE);
 
-   down_read(&fault->state->mm->mmap_sem);
-   npages = get_user_pages(NULL, fault->state->mm,
-   fault->address, 1, write, 0, &page, NULL);
-   up_read(&fault->state->mm->mmap_sem);
+   npages = get_user_pages_unlocked(NULL, fault->state->mm,
+fault->address, 1, write, 0, &page);
 
if (npages == 1) {
put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c 
b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long 
ivtv_dest_addr,
}
 
/* Get user pages for DMA Xfer */
-   down_read(¤t->mm->mmap_sem);
-   err = get_user_pages(current, current->mm,
-   user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, 
NULL);
-   up_read(¤t->mm->mmap_sem);
+   err = get_user_pages_unlocked(current, current->mm,
+   user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
 
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead 
of %d\n",
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
return -ENOMEM;
 
 /* Try to fault in all of the necessary pages */
-   down_read(¤t->mm->mmap_sem);
 /* rw==READ means read from drive, write into memory area */
-   res = get_user_pages(
-   current,
-   current->mm,
+   res = get_user_pages_fast(
uaddr,
nr_pages,
rw == READ,
-   0, /* don't force */
-   pages,
-   NULL);
-   up_read(¤t->mm->mmap_sem);
+   pages);
 
/* Errors and no page mapped should return here */
if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const 
char *buf,
if (!pages)
return -ENOMEM;
 
-   down_read(¤t->mm->mmap_sem);
-   ret = get_user_pages(current, current->mm, (unsigned long)buf,
-nr_pages, WRITE, 0, pages, NULL);
-   up_read(¤t->mm->mmap_sem);
+   ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
 
if (ret < nr_pages) {
nr_pages = ret;
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
size_t bytes;
 
/* Get the pages we're interested in 

[PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-03 Thread Andrea Arcangeli
remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 24 
 mm/rmap.c|  9 +
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b402d60..4277ed7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1921,6 +1921,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
struct anon_vma *anon_vma;
int ret = 1;
+   struct address_space *mapping;
 
BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
@@ -1932,10 +1933,24 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * page_lock_anon_vma_read except the write lock is taken to serialise
 * against parallel split or collapse operations.
 */
-   anon_vma = page_get_anon_vma(page);
-   if (!anon_vma)
-   goto out;
-   anon_vma_lock_write(anon_vma);
+   for (;;) {
+   mapping = ACCESS_ONCE(page->mapping);
+   anon_vma = page_get_anon_vma(page);
+   if (!anon_vma)
+   goto out;
+   anon_vma_lock_write(anon_vma);
+   /*
+* We don't hold the page lock here so
+* remap_anon_pages_huge_pmd can change the anon_vma
+* from under us until we obtain the anon_vma
+* lock. Verify that we obtained the anon_vma lock
+* before remap_anon_pages did.
+*/
+   if (likely(mapping == ACCESS_ONCE(page->mapping)))
+   break;
+   anon_vma_unlock_write(anon_vma);
+   put_anon_vma(anon_vma);
+   }
 
ret = 0;
if (!PageCompound(page))
@@ -2460,6 +2475,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 * Prevent all access to pagetables with the exception of
 * gup_fast later hanlded by the ptep_clear_flush and the VM
 * handled by the anon_vma lock + PG_lock.
+* remap_anon_pages is prevented to race as well by the mmap_sem.
 */
down_write(&mm->mmap_sem);
if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..6d875eb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
 
+repeat:
rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
 
+   /* check if remap_anon_pages changed the anon_vma */
+   if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != 
anon_mapping)) {
+   anon_vma_unlock_read(anon_vma);
+   put_anon_vma(anon_vma);
+   anon_vma = NULL;
+   goto repeat;
+   }
+
if (atomic_dec_and_test(&anon_vma->refcount)) {
  

[PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

2014-10-03 Thread Andrea Arcangeli
We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli 
---
 fs/proc/task_mmu.c   | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h  | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c | 2 +-
 mm/ksm.c | 2 +-
 mm/madvise.c | 2 +-
 mm/mremap.c  | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c341568..ee1c3a2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
/*
 * Don't forget to update Documentation/ on changes.
 */
-   static const char mnemonics[BITS_PER_LONG][2] = {
+   static const char mnemonics[BITS_PER_LONG+1][2] = {
/*
 * In case if we meet a flag we don't know about.
 */
-   [0 ... (BITS_PER_LONG-1)] = "??",
+   [0 ... (BITS_PER_LONG)] = "??",
 
[ilog2(VM_READ)]= "rd",
[ilog2(VM_WRITE)]   = "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..3aa10e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, 
unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-   unsigned long *vm_flags, int advice);
+   vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
@@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd) \
do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-  unsigned long *vm_flags, int advice)
+  vm_flags_t *vm_flags, int advice)
 {
BUG();
return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags);
+   unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags)
+   unsigned long end, int advice, vm_flags_t *vm_flags)
 {
return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..2c876d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..e913a19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1942,7 +1942,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-unsigned long *vm_flags, int advice)
+vm_flags_t *vm_flags, int advice)
 {
switch (advice) {
case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..faf319e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags)
+   unsigned long end, int advice, vm_flags_t *vm_flags)
 {
struct mm_struct *mm = vma->vm_mm;
int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..d5aee71 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
-   unsigned long new_flags = vma->vm_flags;
+   vm_flags_t new_flags = vma->vm_flags;
 
switch (behavior) {
case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_a

[PATCH 09/17] mm: PT lock: export double_pt_lock/unlock

2014-10-03 Thread Andrea Arcangeli
Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h |  4 
 mm/fremap.c| 29 +
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf3df07..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1408,6 +1408,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+   spinlock_t *ptl2)
+   __acquires(ptl1)
+   __acquires(ptl2)
+{
+   spinlock_t *ptl_tmp;
+
+   if (ptl1 > ptl2) {
+   /* exchange ptl1 and ptl2 */
+   ptl_tmp = ptl1;
+   ptl1 = ptl2;
+   ptl2 = ptl_tmp;
+   }
+   /* lock in virtual address order to avoid lock inversion */
+   spin_lock(ptl1);
+   if (ptl1 != ptl2)
+   spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+ spinlock_t *ptl2)
+   __releases(ptl1)
+   __releases(ptl2)
+{
+   spin_unlock(ptl1);
+   if (ptl1 != ptl2)
+   spin_unlock(ptl2);
+}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/17] RFC: userfault v2

2014-10-03 Thread Andrea Arcangeli
Or it could be used for other
similar things with tmpfs in the future. I've been discussing how to
extend it to tmpfs for example. Currently if MADV_USERFAULT is set on
a non-anonymous vma, it will return -EINVAL and that's enough to
provide backwards compatibility once MADV_USERFAULT will be extended
to tmpfs. An orthogonal problem then will be to identify the optimal
mechanism to atomically resolve a tmpfs backed userfault (like
remap_anon_pages does it optimally for anonymous memory) but that's
beyond the scope of the userfault functionality (in theory
remap_anon_pages is also orthogonal and I could split it off in a
separate patchset if somebody prefers). Of course remap_file_pages
should do it fine too, but it would create rmap nonlinearity which
isn't optimal.

The code can be found here:

git clone --reference linux 
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

Andrea Arcangeli (15):
  mm: gup: add get_user_pages_locked and get_user_pages_unlocked
  mm: gup: use get_user_pages_unlocked within get_user_pages_fast
  mm: gup: make get_user_pages_fast and __get_user_pages_fast latency
conscious
  mm: gup: use get_user_pages_fast and get_user_pages_unlocked
  mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  mm: madvise MADV_USERFAULT
  mm: PT lock: export double_pt_lock/unlock
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages
  waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: make userfaultfd_write non blocking
  powerpc: add remap_anon_pages and userfaultfd
  userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER

Andres Lagar-Cavilla (2):
  mm: gup: add FOLL_TRIED
  kvm: Faults which trigger IO release the mmap_sem

 arch/alpha/include/uapi/asm/mman.h |   3 +
 arch/mips/include/uapi/asm/mman.h  |   3 +
 arch/mips/mm/gup.c |   8 +-
 arch/parisc/include/uapi/asm/mman.h|   3 +
 arch/powerpc/include/asm/systbl.h  |   2 +
 arch/powerpc/include/asm/unistd.h  |   2 +-
 arch/powerpc/include/uapi/asm/unistd.h |   2 +
 arch/powerpc/mm/gup.c  |   6 +-
 arch/s390/kvm/kvm-s390.c   |   4 +-
 arch/s390/mm/gup.c |   6 +-
 arch/sh/mm/gup.c   |   6 +-
 arch/sparc/mm/gup.c|   6 +-
 arch/x86/mm/gup.c  | 235 +++
 arch/x86/syscalls/syscall_32.tbl   |   2 +
 arch/x86/syscalls/syscall_64.tbl   |   2 +
 arch/xtensa/include/uapi/asm/mman.h|   3 +
 drivers/dma/iovlock.c  |  10 +-
 drivers/iommu/amd_iommu_v2.c   |   6 +-
 drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
 drivers/scsi/st.c  |  10 +-
 drivers/video/fbdev/pvr2fb.c   |   5 +-
 fs/Makefile|   1 +
 fs/proc/task_mmu.c |   5 +-
 fs/userfaultfd.c   | 722 +
 include/linux/huge_mm.h|  11 +-
 include/linux/ksm.h|   4 +-
 include/linux/mm.h |  15 +-
 include/linux/mm_types.h   |  13 +-
 include/linux/swap.h   |   6 +
 include/linux/syscalls.h   |   5 +
 include/linux/userfaultfd.h|  55 +++
 include/linux/wait.h   |   5 +-
 include/uapi/asm-generic/mman-common.h |   3 +
 init/Kconfig   |  11 +
 kernel/sched/wait.c|   7 +-
 kernel/sys_ni.c|   2 +
 mm/fremap.c| 506 +++
 mm/gup.c   | 182 -
 mm/huge_memory.c   | 208 --
 mm/ksm.c   |   2 +-
 mm/madvise.c   |  22 +-
 mm/memory.c|  14 +
 mm/mempolicy.c |   4 +-
 mm/mlock.c |   3 +-
 mm/mmap.c  |  39 +-
 mm/mprotect.c  |   3 +-
 mm/mremap.c|   2 +-
 mm/nommu.c |  23 ++
 mm/process_vm_access.c |   7 +-
 mm/rmap.c  |   9 +
 mm/swapfile.c  |  13 +
 mm/util.c  |  10 +-
 net/ceph/pagevec.c |   9 +-
 net/sunrpc/sched.c |   2 +-
 virt/kvm/async_pf.c|   4 +-
 virt/kvm/kvm_main.c|   4 +-
 56 files changed, 2025 insertions(+), 236 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h
--
To unsubscribe from this list: send t

[PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem

2014-10-03 Thread Andrea Arcangeli
From: Andres Lagar-Cavilla 

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář 
Signed-off-by: Andres Lagar-Cavilla 
Signed-off-by: Andrea Arcangeli 
---
 virt/kvm/async_pf.c | 4 +---
 virt/kvm/kvm_main.c | 4 ++--
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
 
might_sleep();
 
-   down_read(&mm->mmap_sem);
-   get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
-   up_read(&mm->mmap_sem);
+   get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
kvm_async_page_present_sync(vcpu, apf);
 
spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
*async, bool write_fault,
  addr, write_fault, page);
up_read(¤t->mm->mmap_sem);
} else
-   npages = get_user_pages_fast(addr, 1, write_fault,
-page);
+   npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+write_fault, 0, page);
if (npages != 1)
return npages;
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious

2014-10-03 Thread Andrea Arcangeli
This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

This must be implemented by other archs as well and it's a requirement
before converting more get_user_pages() to get_user_pages_fast() as an
optimization (instead of using get_user_pages_unlocked which would be
slower).

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/mm/gup.c | 234 ++
 1 file changed, 149 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..917d8c1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@
 
 #include 
 
+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES512
+
 static inline pte_t gup_get_pte(pte_t *ptep)
 {
 #ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
return 1;
 }
 
+static inline int __get_user_pages_fast_batch(unsigned long start,
+ unsigned long end,
+ int write, struct page **pages)
+{
+   struct mm_struct *mm = current->mm;
+   unsigned long next;
+   unsigned long flags;
+   pgd_t *pgdp;
+   int nr = 0;
+
+   /*
+* This doesn't prevent pagetable teardown, but does prevent
+* the pagetables and pages from being freed on x86.
+*
+* So long as we atomically load page table pointers versus teardown
+* (which we do on x86, with the above PAE exception), we can follow the
+* address down to the the page and take a ref on it.
+*/
+   local_irq_save(flags);
+   pgdp = pgd_offset(mm, start);
+   do {
+   pgd_t pgd = *pgdp;
+
+   next = pgd_addr_end(start, end);
+   if (pgd_none(pgd))
+   break;
+   if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+   break;
+   } while (pgdp++, start = next, start != end);
+   local_irq_restore(flags);
+
+   return nr;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -257,31 +297,55 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
-   struct mm_struct *mm = current->mm;
-   unsigned long addr, len, end;
-   unsigned long next;
-   unsigned long flags;
-   pgd_t *pgdp;
-   int nr = 0;
+   unsigned long len, end, batch_pages;
+   int nr, ret;
 
start &= PAGE_MASK;
-   addr = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
+   /*
+* get_user_pages() handles nr_pages == 0 gracefully, but
+* gup_fast starts walking the first pagetable in a do {}
+* while() fashion so it's not robust to handle nr_pages ==
+* 0. There's no point in being permissive about end < start
+* either. So this check verifies both nr_pages being non
+* zero, and that "end" didn't overflow.
+*/
+   VM_BUG_ON(end <= start);
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
(void __user *)start, len)))
return 0;
 
-   /*
-* XXX: batch / limit 'nr', to avoid large irq off latency
-* needs some instrumenting to determine the common sizes used by
-* important workloads (eg. DB2), and whether limiting the batch size
-* will decrease performance.
-*
-* It seems like we're in the clear for the moment. Direct-IO is
-* the main guy that batches up lots of get_user_pages, and even
-* they are limited to 64-at-a-time which is not so many.
-*/
+   ret = 0;
+   for (;;) {
+   batch_pages = nr_pages;
+   if (batch_pages > BATCH_PAGES && !irqs_disabled())
+   batch_pages = BATCH_PAGES;
+   len = (unsigned long) batch_pages << PAGE_SHIFT;
+   end = start + len;
+   nr = __get_user_pages_fast_batch(start, end, write, pages);
+   VM_BUG_ON(nr > batch_pages);
+   nr_pages -= nr;
+   ret += nr;
+   if (!nr_pages || nr != batch_pages)
+   break;
+   start += len;
+   pages += batch_pages;
+   }
+
+   return ret;
+}
+
+static inline int get_user_pages_fast_batch(unsigned long start,
+   unsigned long end,
+   int write, struct page **pages)
+{
+   st

[PATCH 14/17] userfaultfd: add new syscall to provide memory externalization

2014-10-03 Thread Andrea Arcangeli
Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile  |   1 +
 fs/userfaultfd.c | 643 +++
 include/linux/syscalls.h |   1 +
 include/linux/userfaultfd.h  |  42 +++
 init/Kconfig |  11 +
 kernel/sys_ni.c  |   1 +
 mm/huge_memory.c |  24 +-
 mm/memory.c  |   5 +-
 10 files changed, 720 insertions(+), 10 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 2d0594c..782038c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,4 @@
 355i386getrandom   sys_getrandom
 356i386memfd_createsys_memfd_create
 357i386remap_anon_pagessys_remap_anon_pages
+358i386userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 41e8f3e..3d5601f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,7 @@
 319common  memfd_createsys_memfd_create
 320common  kexec_file_load sys_kexec_file_load
 321common  remap_anon_pagessys_remap_anon_pages
+322common  userfaultfd sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..00dfe77 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
 obj-$(CONFIG_SIGNALFD) += signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
+obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 000..62b827e
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,643 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct userfaultfd_ctx {
+   /* pseudo fd refcounting */
+   atomic_t refcount;
+   /* waitqueue head for the userfaultfd page faults */
+   wait_queue_head_t fault_wqh;
+   /* waitqueue head for the pseudo fd to wakeup poll/read */
+   wait_queue_head_t fd_wqh;
+   /* userfaultfd syscall flags */
+   unsigned int flags;
+   /* state machine */
+   unsigned int state;
+   /* released */
+   bool released;
+};
+
+struct userfaultfd_wait_queue {
+   unsigned long address;
+   wait_queue_t wq;
+   bool pending;
+   struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+   USERFAULTFD_STATE_ASK_PROTOCOL,
+   USERFAULTFD_STATE_ACK_PROTOCOL,
+   USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+   USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per

[PATCH 01/17] mm: gup: add FOLL_TRIED

2014-10-03 Thread Andrea Arcangeli
From: Andres Lagar-Cavilla 

Reviewed-by: Radim Krčmář 
Signed-off-by: Andres Lagar-Cavilla 
Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h | 1 +
 mm/gup.c   | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0f4196a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct 
vm_area_struct *vma,
 #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
 #define FOLL_NUMA  0x200   /* force NUMA hinting page fault */
 #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800   /* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+   if (*flags & FOLL_TRIED) {
+   VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+   fault_flags |= FAULT_FLAG_TRIED;
+   }
 
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/17] userfaultfd: make userfaultfd_write non blocking

2014-10-03 Thread Andrea Arcangeli
It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 34 +-
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 62b827e..2667d0d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -458,9 +458,7 @@ static ssize_t userfaultfd_write(struct file *file, const 
char __user *buf,
 size_t count, loff_t *ppos)
 {
struct userfaultfd_ctx *ctx = file->private_data;
-   ssize_t res;
__u64 range[2];
-   DECLARE_WAITQUEUE(wait, current);
 
if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
__u64 protocol;
@@ -488,34 +486,12 @@ static ssize_t userfaultfd_write(struct file *file, const 
char __user *buf,
if (range[0] >= range[1])
return -ERANGE;
 
-   spin_lock(&ctx->fd_wqh.lock);
-   __add_wait_queue(&ctx->fd_wqh, &wait);
-   for (;;) {
-   set_current_state(TASK_INTERRUPTIBLE);
-   /* always take the fd_wqh lock before the fault_wqh lock */
-   if (find_userfault(ctx, NULL, POLLOUT)) {
-   if (!wake_userfault(ctx, range)) {
-   res = sizeof(range);
-   break;
-   }
-   }
-   if (signal_pending(current)) {
-   res = -ERESTARTSYS;
-   break;
-   }
-   if (file->f_flags & O_NONBLOCK) {
-   res = -EAGAIN;
-   break;
-   }
-   spin_unlock(&ctx->fd_wqh.lock);
-   schedule();
-   spin_lock(&ctx->fd_wqh.lock);
-   }
-   __remove_wait_queue(&ctx->fd_wqh, &wait);
-   __set_current_state(TASK_RUNNING);
-   spin_unlock(&ctx->fd_wqh.lock);
+   /* always take the fd_wqh lock before the fault_wqh lock */
+   if (find_userfault(ctx, NULL, POLLOUT))
+   if (!wake_userfault(ctx, range))
+   return sizeof(range);
 
-   return res;
+   return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast

2014-10-03 Thread Andrea Arcangeli
Signed-off-by: Andrea Arcangeli 
---
 arch/mips/mm/gup.c   | 8 +++-
 arch/powerpc/mm/gup.c| 6 ++
 arch/s390/kvm/kvm-s390.c | 4 +---
 arch/s390/mm/gup.c   | 6 ++
 arch/sh/mm/gup.c | 6 ++
 arch/sparc/mm/gup.c  | 6 ++
 arch/x86/mm/gup.c| 7 +++
 7 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT,
-   write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-nr_pages - nr, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t 
gpa, int writable)
hva = gmap_fault(gpa, vcpu->arch.gmap);
if (IS_ERR_VALUE(hva))
return (long)hva;
-   down_read(&mm->mmap_sem);
-   rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
-   up_read(&mm->mmap_sem);
+   rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
 
return rc < 0 ? rc : 0;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
pages += nr;
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-nr_pages - nr, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+nr_pages - nr, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0)
ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+   (end - start) >> PAGE_SHIFT, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+   (end - start) >> PAGE_SHIFT, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
start += nr <<

[PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd

2014-10-03 Thread Andrea Arcangeli
Add the syscall numbers.

Signed-off-by: Andrea Arcangeli 
---
 arch/powerpc/include/asm/systbl.h  | 2 ++
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 7d8a600..ef03a80 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -365,3 +365,5 @@ SYSCALL_SPU(renameat2)
 SYSCALL_SPU(seccomp)
 SYSCALL_SPU(getrandom)
 SYSCALL_SPU(memfd_create)
+SYSCALL_SPU(remap_anon_pages)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 4e9af3f..36b79c3 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  361
+#define __NR_syscalls  363
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 0688fc0..5514c57 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -383,5 +383,7 @@
 #define __NR_seccomp   358
 #define __NR_getrandom 359
 #define __NR_memfd_create  360
+#define __NR_remap_anon_pages  361
+#define __NR_userfaultfd   362
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key

2014-10-03 Thread Andrea Arcangeli
Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli 
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5..f8271cb 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -144,7 +144,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t 
*old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void 
*key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -175,7 +176,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)  \
-   __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+   __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)   \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)  \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a..d848738 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int 
mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
 {
-   __wake_up_common(q, mode, 1, 0, key);
+   __wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
-   __wake_up_locked_key(q, mode, key);
+   __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..39b7496 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
-   __wake_up_locked_key(wq, TASK_NORMAL, &k);
+   __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/17] mm: swp_entry_swapcount

2014-10-03 Thread Andrea Arcangeli
Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/swap.h |  6 ++
 mm/swapfile.c| 13 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8197452..af9977c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,6 +458,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -559,6 +560,11 @@ static inline int page_swapcount(struct page *page)
return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+   return 0;
+}
+
 #define reuse_swap_page(page)  (page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e..4cc9af6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+   int count = 0;
+   struct swap_info_struct *p;
+
+   p = swap_info_get(entry);
+   if (p) {
+   count = swap_count(p->swap_map[swp_offset(entry)]);
+   spin_unlock(&p->lock);
+   }
+   return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-10-02 Thread Andrea Arcangeli
On Thu, Oct 02, 2014 at 02:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
> > > On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> > > > For all these and the other _fast() users, is there an actual limit to
> > > > the nr_pages passed in? Because we used to have the 64 pages limit from
> > > > DIO, but without that we get rather long IRQ-off latencies.
> > > 
> > > Ok, I would tend to think this is an issue to solve in gup_fast
> > > implementation, I wouldn't blame or modify the callers for it.
> > > 
> > > I don't think there's anything that prevents gup_fast to enable irqs
> > > after certain number of pages have been taken, nop; and disable the
> > > irqs again.
> > > 
> > 
> > Agreed, I once upon a time had a patch set converting the 2 (x86 and
> > powerpc) gup_fast implementations at the time, but somehow that never
> > got anywhere.
> > 
> > Just saying we should probably do that before we add callers with
> > unlimited nr_pages.
> 
> https://lkml.org/lkml/2009/6/24/457
> 
> Clearly there's more work these days. Many more archs grew a gup.c

What about this? The alternative is that I do s/gup_fast/gup_unlocked/
to still retain the mmap_sem scalability benefit. It'd be still better
than the current plain gup() (and it would be equivalent for
userfaultfd point of view).

Or if the below is ok, should I modify all other archs too or are the
respective maintainers going to fix it themself? For example the arm*
gup_fast is a moving target in development on linux-mm right now and I
should only patch the gup_rcu version that didn't hit upstream yet. In
fact after that gup_rcu merge, supposedly the powerpc and sparc
gup_fast can be dropped from arch/* entirely and they can use the
generic version (otherwise having the arm gup_fast in mm/ instead of
arch/ would be a mistake). Right now, I wouldn't touch at least
arm/sparc/powerpc until the gup_rcu hit upstream as those are all
about to disappear.

Thanks,
Andrea

>From 2f6079396a59e64a380ff06e6107276dfa67b3ba Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Thu, 2 Oct 2014 16:58:00 +0200
Subject: [PATCH] mm: gup: make get_user_pages_fast and __get_user_pages_fast
 latency conscious

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/mm/gup.c | 239 +++---
 1 file changed, 154 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..e05d7b0 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@
 
 #include 
 
+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES512
+
 static inline pte_t gup_get_pte(pte_t *ptep)
 {
 #ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
return 1;
 }
 
+static inline int __get_user_pages_fast_batch(unsigned long start,
+ unsigned long end,
+ int write, struct page **pages)
+{
+   struct mm_struct *mm = current->mm;
+   unsigned long next;
+   unsigned long flags;
+   pgd_t *pgdp;
+   int nr = 0;
+
+   /*
+* This doesn't prevent pagetable teardown, but does prevent
+* the pagetables and pages from being freed on x86.
+*
+* So long as we atomically load page table pointers versus teardown
+* (which we do on x86, with the above PAE exception), we can follow the
+* address down to the the page and take a ref on it.
+*/
+   local_irq_save(flags);
+   pgdp = pgd_offset(mm, start);
+   do {
+   pgd_t pgd = *pgdp;
+
+   next = pgd_addr_end(start, end);
+   if (pgd_none(pgd))
+   break;
+   if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+   break;
+   } while (pgdp++, start = next, start != end);
+   local_irq_restore(flags);
+
+   return nr;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -257,31 +297,57 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
-   struct mm_struct *mm = current->mm;
-   unsigned long addr, len, end;
-   unsigned long next;
-   unsigned long f

Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-02 Thread Andrea Arcangeli
On Wed, Oct 01, 2014 at 10:06:27AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Oct 1, 2014 at 8:51 AM, Peter Feiner  wrote:
> > On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> >> + /* VM_FAULT_RETRY cannot return errors */
> >> + if (!*locked) {
> >> + BUG_ON(ret < 0);
> >> + BUG_ON(nr_pages == 1 && ret);
> >
> > If I understand correctly, this second BUG_ON is asserting that when
> > __get_user_pages is asked for a single page and it is successfully gets the
> > page, then it shouldn't have dropped the mmap_sem. If that's the case, then
> > you could generalize this assertion to
> >
> > BUG_ON(nr_pages == ret);

Agreed.

> 
> Even more strict:
>  BUG_ON(ret >= nr_pages);

Agreed too, plus this should be quicker than my weaker check.

Maybe some BUG_ON can be deleted later or converted to VM_BUG_ON, but
initially I feel safer with the BUG_ON considering that is a slow
path.

> Reviewed-by: Andres Lagar-Cavilla 

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-10-02 Thread Andrea Arcangeli
On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
> For all these and the other _fast() users, is there an actual limit to
> the nr_pages passed in? Because we used to have the 64 pages limit from
> DIO, but without that we get rather long IRQ-off latencies.

Ok, I would tend to think this is an issue to solve in gup_fast
implementation, I wouldn't blame or modify the callers for it.

I don't think there's anything that prevents gup_fast to enable irqs
after certain number of pages have been taken, nop; and disable the
irqs again.

If the TLB flush runs in parallel with gup_fast the result is
undefined anyway so there's no point to wait all pages to be taken
before letting the TLB flush go through. All it matters is that
gup_fast don't take pages that have been invalidated after the
tlb_flush returns on the other side. So I don't see issues in
releasing irqs and be latency friendly inside gup_fast fast path loop.

In fact gup_fast should also cond_resched() after releasing irqs, it's
not just an irq latency matter.

I could fix x86-64 for it in the same patchset unless somebody sees a
problem in releasing irqs inside the gup_fast fast path loop.

__gup_fast is an entirely different beast and that needs the callers to
be fixed but I didn't alter its callers.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

2014-10-01 Thread Andrea Arcangeli
On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index f74fc0c..cd20669 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct 
> *vma,
>  #else
>   *pageshift = PAGE_SHIFT;
>  #endif
> - if (get_user_pages
> - (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> + if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>   return -EFAULT;
>   *paddr = page_to_phys(page);
>   put_page(page);

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned 
> long addr)
>   struct page *p;
>   int err;
>  
> - err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> + err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
>   if (err >= 0) {
>   err = page_to_nid(p);
>   put_page(p);

I just noticed I need to revert the above two changes... (both weren't
exercised during the testing).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

2014-10-01 Thread Andrea Arcangeli
Just an optimization.

Signed-off-by: Andrea Arcangeli 
---
 drivers/dma/iovlock.c  | 10 ++
 drivers/iommu/amd_iommu_v2.c   |  6 ++
 drivers/media/pci/ivtv/ivtv-udma.c |  6 ++
 drivers/misc/sgi-gru/grufault.c|  3 +--
 drivers/scsi/st.c  | 10 ++
 drivers/video/fbdev/pvr2fb.c   |  5 +
 mm/mempolicy.c |  2 +-
 mm/process_vm_access.c |  7 ++-
 mm/util.c  | 10 ++
 net/ceph/pagevec.c |  9 -
 10 files changed, 19 insertions(+), 49 deletions(-)

diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec 
*iov, size_t len)
pages += page_list->nr_pages;
 
/* pin pages down */
-   down_read(¤t->mm->mmap_sem);
-   ret = get_user_pages(
-   current,
-   current->mm,
+   ret = get_user_pages_fast(
(unsigned long) iov[i].iov_base,
page_list->nr_pages,
1,  /* write */
-   0,  /* force */
-   page_list->pages,
-   NULL);
-   up_read(¤t->mm->mmap_sem);
+   page_list->pages);
 
if (ret != page_list->nr_pages)
goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
 
write = !!(fault->flags & PPR_FAULT_WRITE);
 
-   down_read(&fault->state->mm->mmap_sem);
-   npages = get_user_pages(NULL, fault->state->mm,
-   fault->address, 1, write, 0, &page, NULL);
-   up_read(&fault->state->mm->mmap_sem);
+   npages = get_user_pages_unlocked(NULL, fault->state->mm,
+fault->address, 1, write, 0, &page);
 
if (npages == 1) {
put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c 
b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long 
ivtv_dest_addr,
}
 
/* Get user pages for DMA Xfer */
-   down_read(¤t->mm->mmap_sem);
-   err = get_user_pages(current, current->mm,
-   user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, 
NULL);
-   up_read(¤t->mm->mmap_sem);
+   err = get_user_pages_unlocked(current, current->mm,
+   user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
 
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead 
of %d\n",
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index f74fc0c..cd20669 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 #else
*pageshift = PAGE_SHIFT;
 #endif
-   if (get_user_pages
-   (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+   if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
return -EFAULT;
*paddr = page_to_phys(page);
put_page(page);
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
return -ENOMEM;
 
 /* Try to fault in all of the necessary pages */
-   down_read(¤t->mm->mmap_sem);
 /* rw==READ means read from drive, write into memory area */
-   res = get_user_pages(
-   current,
-   current->mm,
+   res = get_user_pages_fast(
uaddr,
nr_pages,
rw == READ,
-   0, /* don't force */
-   pages,
-   NULL);
-   up_read(¤t->mm->mmap_sem);
+   pages);
 
/* Errors and no page mapped should return here */
if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const 
char *buf,
if (!pages)
return -ENOMEM;
 
-

[PATCH 4/4] mm: gup: use get_user_pages_unlocked within get_user_pages_fast

2014-10-01 Thread Andrea Arcangeli
Signed-off-by: Andrea Arcangeli 
---
 arch/mips/mm/gup.c   | 8 +++-
 arch/powerpc/mm/gup.c| 6 ++
 arch/s390/kvm/kvm-s390.c | 4 +---
 arch/s390/mm/gup.c   | 6 ++
 arch/sh/mm/gup.c | 6 ++
 arch/sparc/mm/gup.c  | 6 ++
 arch/x86/mm/gup.c| 7 +++
 7 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT,
-   write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-nr_pages - nr, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t 
gpa, int writable)
hva = gmap_fault(gpa, vcpu->arch.gmap);
if (IS_ERR_VALUE(hva))
return (long)hva;
-   down_read(&mm->mmap_sem);
-   rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
-   up_read(&mm->mmap_sem);
+   rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
 
return rc < 0 ? rc : 0;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
pages += nr;
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-nr_pages - nr, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+nr_pages - nr, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0)
ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+   (end - start) >> PAGE_SHIFT, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
start += nr << PAGE_SHIFT;
pages += nr;
 
-   down_read(&mm->mmap_sem);
-   ret = get_user_pages(current, mm, start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-   up_read(&mm->mmap_sem);
+   ret = get_user_pages_unlocked(current, mm, start,
+   (end - start) >> PAGE_SHIFT, write, 0, pages);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
start += nr <<

[PATCH 0/4] leverage FAULT_FOLL_ALLOW_RETRY in get_user_pages

2014-10-01 Thread Andrea Arcangeli
FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically
no get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not
leveraging that nifty feature.

Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it
either (the only exception being FOLL_NOWAIT in KVM which is really
nonblocking and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold
times during I/O.

The only few places that remains uncovered are drivers like v4l and
other exceptions that tends to work on their own memory and they're
not working on random user memory (for example like O_DIRECT that uses
gup_fast and is fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out
eventually. The "vmas" parameter of get_user_pages makes it
fundamentally incompatible with FAULT_FOLL_ALLOW_RETRY (vmas array
becomes meaningless the moment the mmap_sem is released). 

While this is just an optimization, this becomes an absolute
requirement for the userfaultfd. The userfaultfd allows to block the
page fault, and in order to do so I need to drop the mmap_sem
first. So this patch also ensures that all memory where
userfaultfd could be registered by KVM, the very first fault (no
matter if it is a regular page fault, or a get_user_pages) always
has FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is
waken only when the pagetable is already mapped. The second fault
attempt after the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's
ok to retry without it.

So I need this merged before I can attempt to merge the userfaultfd.

This has been running fully stable on a heavy KVM postcopy live
migration workload that also includes the new userfaultfd API allows
an unlimited number of userfaultfds per process and each one can at
any time register and unregister memory ranges, so each thread or each
shared lib can do userfaults in its own private memory independently
of each other and independently of the main process. This is also the
same load that exposed the nfs silent memory corruption and it uses
O_DIRECT also on nfs so get_user_pages_fast and all sort of
get_user_pages are exercised both by NFS and KVM at the same time on
the userfaultfd backed memory.

Reviews would be welcome, thanks,
Andrea

Andrea Arcangeli (3):
  mm: gup: add get_user_pages_locked and get_user_pages_unlocked
  mm: gup: use get_user_pages_fast and get_user_pages_unlocked
  mm: gup: use get_user_pages_unlocked within get_user_pages_fast

Andres Lagar-Cavilla (1):
  mm: gup: add FOLL_TRIED

 arch/mips/mm/gup.c |   8 +-
 arch/powerpc/mm/gup.c  |   6 +-
 arch/s390/kvm/kvm-s390.c   |   4 +-
 arch/s390/mm/gup.c |   6 +-
 arch/sh/mm/gup.c   |   6 +-
 arch/sparc/mm/gup.c|   6 +-
 arch/x86/mm/gup.c  |   7 +-
 drivers/dma/iovlock.c  |  10 +-
 drivers/iommu/amd_iommu_v2.c   |   6 +-
 drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
 drivers/misc/sgi-gru/grufault.c|   3 +-
 drivers/scsi/st.c  |  10 +-
 drivers/video/fbdev/pvr2fb.c   |   5 +-
 include/linux/mm.h |   8 ++
 mm/gup.c   | 182 ++---
 mm/mempolicy.c |   2 +-
 mm/nommu.c |  23 +
 mm/process_vm_access.c |   7 +-
 mm/util.c  |  10 +-
 net/ceph/pagevec.c |   9 +-
 20 files changed, 236 insertions(+), 88 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] mm: gup: add FOLL_TRIED

2014-10-01 Thread Andrea Arcangeli
From: Andres Lagar-Cavilla 

Reviewed-by: Radim Krčmář 
Signed-off-by: Andres Lagar-Cavilla 
Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h | 1 +
 mm/gup.c   | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0f4196a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct 
vm_area_struct *vma,
 #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
 #define FOLL_NUMA  0x200   /* force NUMA hinting page fault */
 #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800   /* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+   if (*flags & FOLL_TRIED) {
+   VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+   fault_flags |= FAULT_FLAG_TRIED;
+   }
 
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-01 Thread Andrea Arcangeli
We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h |   7 +++
 mm/gup.c   | 178 +
 mm/nommu.c |  23 +++
 3 files changed, 197 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4196a..8900ba9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,13 @@ long get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   int write, int force, struct page **pages,
+   int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long start, unsigned long nr_pages,
+   int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
 struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index af7ea3e..b6d0076 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,166 @@ int fixup_user_fault(struct task_struct *tsk, struct 
mm_struct *mm,
return 0;
 }
 
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+  struct mm_struct *mm,
+  unsigned long start,
+  unsigned long nr_pages,
+  int write, int force,
+  struct page **pages,
+  struct vm_area_struct **vmas,
+  int *locked,
+  bool notify_drop)
+{
+   int flags = FOLL_TOUCH;
+   long ret, pages_done;
+   bool lock_dropped;
+
+   if (locked) {
+   /* if VM_FAULT_RETRY can be returned, vmas become invalid */
+   BUG_ON(vmas);
+   /* check caller initialized locked */
+   BUG_ON(*locked != 1);
+   }
+
+   if (pages)
+   flags |= FOLL_GET;
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
+   pages_done = 0;
+   lock_dropped = false;
+   for (;;) {
+   ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+  vmas, locked);
+   if (!locked)
+   /* VM_FAULT_RETRY couldn't trigger, bypass */
+   return ret;
+
+   /* VM_FAULT_RETRY cannot return errors */
+   if (!*locked) {
+   BUG_ON(ret < 0);
+   BUG_ON(nr_pages == 1 && ret);
+   }
+
+   if (!pages)
+   /* If it's a prefault don't 

Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-09-28 Thread Andrea Arcangeli
On Fri, Sep 26, 2014 at 12:54:46PM -0700, Andres Lagar-Cavilla wrote:
> On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli  
> wrote:
> > On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> >> It's nearly impossible to name it right because 1) it indicates we can
> >> relinquish 2) it returns whether we still hold the mmap semaphore.
> >>
> >> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> >> what this is about ("nonblocking" or "locked" could be about a whole
> >> lot of things)
> >
> > To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> > "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> > without the mmap_sem, so I called it "locked"... but I'm fine to
> > change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> > seems less friendly than get_user_pages_locked(..., &locked). locked
> > as you used comes intuitive when you do later "if (locked) up_read".
> >
> 
> Heh. I was previously referring to the int *locked param , not the
> _(un)locked suffix. That param is all about the mmap semaphore, so why
> not name it less ambiguously. It's essentially a tristate.

I got you were referring to the parameter name, problem I didn't want
to call the function get_user_pages_mmap_sem_hold(), and if I call it
get_user_pages_locked() calling the parameter "*locked" just like you
did in your patch looked more intuitive.

Suggestions to a better than gup_locked are welcome though!

> My suggestion is that you just make gup behave as your proposed
> gup_locked, and no need to introduce another call. But I understand if
> you want to phase this out politely.

Yes replacing gup would be ideal but there are various drivers that
make use if and have a larger critical section and I didn't want to
have to deal with all that immediately. Not to tell if anything uses
the "vmas" parameter that prevent releasing the mmap_sem by design and
will require larger modifications to get rid of.

So with this patch there's an optimal version of gup_locked|unlocked,
and a "nonscalable" one that allows for a large critical section
before and after gup with the vmas parameter too.

> > Then I added an _unlocked kind which is a drop in replacement for many
> > places just to clean it up.
> >
> > get_user_pages_unlocked and get_user_pages_fast are equivalent in
> > semantics, so any call of get_user_pages_unlocked(current,
> > current->mm, ...) has no reason to exist and should be replaced to
> > get_user_pages_fast unless "force = 1" (gup_fast has no force param
> > just to make the argument list a bit more confusing across the various
> > versions of gup).
> >
> > get_user_pages over time should be phased out and dropped.
> 
> Please. Too many variants. So the end goal is
> * __gup_fast
> * gup_fast == __gup_fast + gup_unlocked for fallback
> * gup (or gup_locked)
> * gup_unlocked
> (and flat __gup remains buried in the impl)?

That's exactly the end goal, yes.

> Basically all this discussion should go into the patch as comments.
> Help people shortcut git blame.

Sure, I added the comments of the commit header inline too.

> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > +  struct mm_struct *mm,
> > +  unsigned long start,
> > +  unsigned long nr_pages,
> > +  int write, int force,
> > +  struct page **pages,
> > +  struct vm_area_struct **vmas,
> > +  int *locked,
> > +  bool immediate_unlock)
> s/immediate_unlock/notify_drop/

Applied.

> > +{
> > +   int flags = FOLL_TOUCH;
> > +   long ret, pages_done;
> > +   bool lock_dropped;
> s/lock_dropped/sem_dropped/

Well, this sounds a bit more confusing actually, unless we stop
calling the parameter "locked" first.

I mean it's the very "locked" parameter the "lock_dropped" variable
refers to. So I wouldn't bother to change to "sem" and stick to the
generic concept of locked/unlocked regardless of the underlying
implementation (the rwsem for reading).

> > +
> > +   if (locked) {
> > +   /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> > +   BUG_ON(vmas);
> > +   /* check caller i

RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-09-26 Thread Andrea Arcangeli
On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
> It's nearly impossible to name it right because 1) it indicates we can
> relinquish 2) it returns whether we still hold the mmap semaphore.
> 
> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
> what this is about ("nonblocking" or "locked" could be about a whole
> lot of things)

To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
"locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
without the mmap_sem, so I called it "locked"... but I'm fine to
change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
seems less friendly than get_user_pages_locked(..., &locked). locked
as you used comes intuitive when you do later "if (locked) up_read".

Then I added an _unlocked kind which is a drop in replacement for many
places just to clean it up.

get_user_pages_unlocked and get_user_pages_fast are equivalent in
semantics, so any call of get_user_pages_unlocked(current,
current->mm, ...) has no reason to exist and should be replaced to
get_user_pages_fast unless "force = 1" (gup_fast has no force param
just to make the argument list a bit more confusing across the various
versions of gup).

get_user_pages over time should be phased out and dropped.

> I can see that. My background for coming into this is very similar: in
> a previous life we had a file system shim that would kick up into
> userspace for servicing VM memory. KVM just wouldn't let the file
> system give up the mmap semaphore. We had /proc readers hanging up all
> over the place while userspace was servicing. Not happy.
> 
> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
> stands in your way? Methinks that gup_fast has no slowpath fallback
> that turns on ALLOW_RETRY. What would oppose that being the global
> behavior?

It should become the global behavior. Just it doesn't need to become a
global behavior immediately for all kind of gups (i.e. video4linux
drivers will never need to poke into the KVM guest user memory so it
doesn't matter if they don't use gup_locked immediately). Even then we
can still support get_user_pages_locked(, locked=NULL) for
ptrace/coredump and other things that may not want to trigger the
userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.

Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
invocations) and not invoke the userfaultfd protocol, if
FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
NULL or or gup() (without locked parameter) will not invoke the
userfaultfd protocol.

But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
like O_DIRECT uses it.

I tried to do a RFC patch below that goes into this direction and
should be enough for a start to solve all my issues with the mmap_sem
holding inside handle_userfault(), comments welcome.

===
>From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Fri, 26 Sep 2014 18:36:53 +0200
Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
 get_user_pages_unlocked

We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pag

Re: [PATCH] kvm: Fix kvm_get_page_retry_io __gup retval check

2014-09-25 Thread Andrea Arcangeli
On Thu, Sep 25, 2014 at 03:26:50PM -0700, Andres Lagar-Cavilla wrote:
> Confusion around -EBUSY and zero (inside a BUG_ON no less).
> 
> Reported-by: AndreA Arcangeli 
> Signed-off-by: Andres Lagar-Cavilla 
> ---
>  virt/kvm/kvm_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Andrea Arcangeli 

Valid for this and the previous patch in Message-Id:
<1410976308-7683-1-git-send-email-andre...@google.com> as well.

Thanks,
Andrea

> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3f16f56..a1cf53e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1146,7 +1146,7 @@ int kvm_get_user_page_io(struct task_struct *tsk, 
> struct mm_struct *mm,
>   npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
> &locked);
>   if (!locked) {
> - VM_BUG_ON(npages != -EBUSY);
> + VM_BUG_ON(npages);
>  
>   if (!pagep)
>   return 0;
> -- 
> 2.1.0.rc2.206.gedb03e5
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-25 Thread Andrea Arcangeli
Hi Andres,

On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> + if (!locked) {
> + VM_BUG_ON(npages != -EBUSY);
> +

Shouldn't this be VM_BUG_ON(npages)?

Alternatively we could patch gup to do:

case -EHWPOISON:
+   case -EBUSY:
return i ? i : ret;
-   case -EBUSY:
-   return i;

I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY
similarly to what you did to the KVM slow path.

gup_fast is called without the mmap_sem (incidentally its whole point
is to only disable irqs and not take the locks) so the enabling of
FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self
contained. It shouldn't concern KVM which should be already fine with
your patch, but it will allow the userfaultfd to intercept all
O_DIRECT gup_fast in addition to KVM with your patch.

Eventually get_user_pages should be obsoleted in favor of
get_user_pages_locked (or whoever we decide to call it) so the
userfaultfd can intercept all kind of gups. gup_locked is same as gup
except for one more "locked" parameter at the end, I called the
parameter locked instead of nonblocking because it'd be more proper to
call "nonblocking" gup the FOLL_NOWAIT kind which is quite the
opposite (in fact as the mmap_sem cannot be dropped in the non
blocking version).

ptrace ironically is better off sticking with a NULL locked parameter
and to get a sigbus instead of risking hanging on the userfaultfd
(which would be safe as it can be killed, but it'd be annoying if
erroneously poking into a hole during a gdb session). It's still
possible to pass NULL as parameter to get_user_pages_locked to achieve
that. So the fact some callers won't block in handle_userfault because
FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may
come handy.

What I'm trying to solve in this context is that the userfault cannot
safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow
userland to take the mmap_sem for an unlimited amount of time without
requiring special privileges, so if handle_userfault wants to blocks
within a gup invocation, it must first release the mmap_sem hence
FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any
virtual address.

With regard to the last sentence, there's actually a race with
MADV_DONTNEED too, I'd need to change the code to always pass
FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and
insisting with the __get_user_pages(locked) version to solve it). The
KVM ioctl worst case would get an -EFAULT if the race materializes for
example. It's non concerning though because that can be solved in
userland somehow by separating ballooning and live migration
activities.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-03 Thread Andrea Arcangeli
Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli 
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +   struct userfaultfd_ctx *ctx = f->private_data;
> > +   int ret;
> > +   wait_queue_t *wq;
> > +   struct userfaultfd_wait_queue *uwq;
> > +   unsigned long pending = 0, total = 0;
> > +
> > +   spin_lock(&ctx->fault_wqh.lock);
> > +   list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +   uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +   if (uwq->pending)
> > +   pending++;
> > +   total++;
> > +   }
> > +   spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, 
struct file *f)
}
spin_unlock(&ctx->fault_wqh.lock);
 
-   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+   /*
+* If more protocols will be added, there will be all shown
+* separated by a space. Like this:
+*  protocols: 0xaa 0xbb
+*/
+   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+pending, total, USERFAULTFD_PROTOCOL);
 
return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +   int fd, error;
> > +   struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

file = ERR_PTR(-EBUSY);
if (get_mm_slot(current->mm))
goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability 

Re: [PATCH 0/2] KVM: async_pf: use_mm/mm_users fixes

2014-04-28 Thread Andrea Arcangeli
On Mon, Apr 28, 2014 at 01:06:05PM +0200, Paolo Bonzini wrote:
> Patch 1 will be for 3.16 only, I'd like a review from Marcelo or Andrea 
> though (that's "KVM: async_pf: kill the unnecessary use_mm/unuse_mm 
> async_pf_execute()" for easier googling).

Patch 1:

Reviewed-by: Andrea Arcangeli 

I think current->NULL would be better too.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >