Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-03 Thread Andrea Arcangeli
Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
> On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> > Once an userfaultfd is created MADV_USERFAULT regions talks through
> > the userfaultfd protocol with the thread responsible for doing the
> > memory externalization of the process.
> > 
> > The protocol starts by userland writing the requested/preferred
> > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> > kernel knows it, it will ack it by allowing userland to read 64bit
> > from the userfault fd that will contain the same 64bit
> > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> > will have to try again by writing an older protocol version if
> > suitable for its usage too, and read it back again until it stops
> > reading -1ULL. After that the userfaultfd protocol starts.
> > 
> > The protocol consists in the userfault fd reads 64bit in size
> > providing userland the fault addresses. After a userfault address has
> > been read and the fault is resolved by userland, the application must
> > write back 128bits in the form of [ start, end ] range (64bit each)
> > that will tell the kernel such a range has been mapped. Multiple read
> > userfaults can be resolved in a single range write. poll() can be used
> > to know when there are new userfaults to read (POLLIN) and when there
> > are threads waiting a wakeup through a range write (POLLOUT).
> > 
> > Signed-off-by: Andrea Arcangeli 
> 
> > +#ifdef CONFIG_PROC_FS
> > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +   struct userfaultfd_ctx *ctx = f->private_data;
> > +   int ret;
> > +   wait_queue_t *wq;
> > +   struct userfaultfd_wait_queue *uwq;
> > +   unsigned long pending = 0, total = 0;
> > +
> > +   spin_lock(&ctx->fault_wqh.lock);
> > +   list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> > +   uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> > +   if (uwq->pending)
> > +   pending++;
> > +   total++;
> > +   }
> > +   spin_unlock(&ctx->fault_wqh.lock);
> > +
> > +   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
> 
> This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, 
struct file *f)
}
spin_unlock(&ctx->fault_wqh.lock);
 
-   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);
+   /*
+* If more protocols will be added, there will be all shown
+* separated by a space. Like this:
+*  protocols: 0xaa 0xbb
+*/
+   ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+pending, total, USERFAULTFD_PROTOCOL);
 
return ret;
 }


> > +
> > +SYSCALL_DEFINE1(userfaultfd, int, flags)
> > +{
> > +   int fd, error;
> > +   struct file *file;
> 
> This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

file = ERR_PTR(-EBUSY);
if (get_mm_slot(current->mm))
goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

> be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the user

Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli 

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> + struct userfaultfd_ctx *ctx = f->private_data;
> + int ret;
> + wait_queue_t *wq;
> + struct userfaultfd_wait_queue *uwq;
> + unsigned long pending = 0, total = 0;
> +
> + spin_lock(&ctx->fault_wqh.lock);
> + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> + if (uwq->pending)
> + pending++;
> + total++;
> + }
> + spin_unlock(&ctx->fault_wqh.lock);
> +
> + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> + int fd, error;
> + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html