Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)

2019-01-31 Thread Mike Rapoport
Hi Peter,

On Wed, Jan 30, 2019 at 05:23:02PM +0800, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> > 
> > If we are to discuss userfaultfd, I'd like also to bring the subject of COW
> > mappings.
> > The pages populated with UFFDIO_COPY cannot be COW-shared between related
> > processes which unnecessarily increases memory footprint of a migrated
> > process tree.
> > I've posted a patch [1] a (real) while ago, but nobody reacted and I've put
> > this aside.
> > Maybe it's time to discuss it again :)
> 
> Hi, Mike,
> 
> It's interesting to know such a work...
> 
> Since I really don't have much context on this, so sorry if I'm going
> to ask a silly question... but I'd say when reading this I'm thinking
> of KSM.  I think KSM does not suite in this case since when doing
> UFFDIO_COPY_COW it'll contain hinting information while KSM was only
> scanning over the pages between processes which seems to be O(N*N) if
> assuming there're two processes.  However, would it make any sense to
> provide a general interface to scan for same pages between any two
> processes within specific range and merge them if found (rather than a
> specific interface for userfaultfd only)?  Then it might even be used
> by KSM admins (just as an example) when the admin knows exactly that
> memory range (addr1, len) of process A should very probably has many
> same contents as the memory range (addr2, len) of process B?

I haven't really thought about using KSM in our case. Our goal was to make
the VM layout of the migrated processes as close as possible to the
original, including the COW sharing between parent process and its
descendants. For that UFFDIO_COPY_COW seems to be more natural fit than
KSM.

> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.



Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)

2019-01-30 Thread Andrea Arcangeli
Hello Mike,

On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> We (CRIU) have some concerns about obsoleting soft-dirty in favor of
> uffd-wp. If there are other soft-dirty users these concerns would be
> relevant to them as well.
> 
> With soft-dirty we collect the information about the changed memory every
> pre-dump iteration in the following manner:
> * freeze the tasks
> * find entries in /proc/pid/pagemap with SOFT_DIRTY set
> * unfreeze the tasks
> * dump the modified pages to disk/remote host
> 
> While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
> in between the pre-dump iterations and during the actual memory dump the
> tasks are running freely.
> 
> If we are to switch to uffd-wp, every write by the snapshotted/migrated
> task will incur latency of uffd-wp processing by the monitor.

That's valid concern indeed.

I didn't go into the details of what additional feature is needed in
addition to what is already present present in Peter's current
patchset, but you're correct that in order to perform well to do the
softdirty equivalent, we'll also need to add an async event model.

The async event model would be set during UFFD registration. It'd work
like async signals, you just queue up uffd events in the kernel by
allocating them with a slab object (not in the kernel stack of the
faulting process). Only if the monitor won't read() them fast enough
it'll eventually block the write protect fault and release the
mmap_sem but the page fault would always be resolved by the kernel
even in that case. For the monitor there'll be just a stream of
uffd_msg structures to read in multiples of the uffd_msg structure
size with a single syscall per wakeup of the monitor. Conceptually
it'd work the same as how PML works for EPT.

The main downside will be an allocation per fault (soft dirty doesn't
need to do such allocation), but there will be no round-trip to
userland latency added to the wrprotect fault that needs to be logged.

We need the synchronous/blocking uffd-wp for other things that aren't
related to soft dirty and can't be achieved with an async model like
softdirty. Adding an async model later would be a self contained
feature inside uffd.

So the idea would be to ignore any comparison with softdirty until
uffd-wp is finalized, and then evaluate the possibility of adding an
async model which would be simple thing to add in comparison of the
uffd-wp feature itself.

The theoretical expectation would be that softdirty would perform
better for small processes (but for those the overall logging overhead
is small anyway), but when it gets to the hundred-gigabytes/terabytes
regions, async uffd-wp should perform much better.

Thanks,
Andrea


Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE)

2019-01-30 Thread Peter Xu
On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote:
> Hi,
> 
> (changed the subject and added CRIU folks)
> 
> On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote:
> > Hello,
> > 
> > --
> > 
> > In addition to the above "NUMA remote THP vs NUMA local non-THP
> > tradeoff" topic, there are other developments in "userfaultfd" land that
> > are approaching merge readiness and that would be possible to provide a
> > short overview about:
> > 
> > - Peter Xu made significant progress in finalizing the userfaultfd-WP
> >   support over the last few months. That feature was planned from the
> >   start and it will allow userland to do some new things that weren't
> >   possible to achieve before. In addition to synchronously blocking
> >   write faults to be resolved by an userland manager, it has also the
> >   ability to obsolete the softdirty feature, because it can provide
> >   the same information, but with O(1) complexity (as opposed of the
> >   current softdirty O(N) complexity) similarly to what the Page
> >   Modification Logging (PML) does in hardware for EPT write accesses.
>  
> We (CRIU) have some concerns about obsoleting soft-dirty in favor of
> uffd-wp. If there are other soft-dirty users these concerns would be
> relevant to them as well.
> 
> With soft-dirty we collect the information about the changed memory every
> pre-dump iteration in the following manner:
> * freeze the tasks
> * find entries in /proc/pid/pagemap with SOFT_DIRTY set
> * unfreeze the tasks
> * dump the modified pages to disk/remote host
> 
> While we do need to traverse the /proc/pid/pagemap to identify dirty pages,
> in between the pre-dump iterations and during the actual memory dump the
> tasks are running freely.
> 
> If we are to switch to uffd-wp, every write by the snapshotted/migrated
> task will incur latency of uffd-wp processing by the monitor.
> 
> We'd need to see how this affects overall slowdown of the workload under
> migration before moving forward with obsoleting soft-dirty.
> 
> > - Blake Caldwell maintained the UFFDIO_REMAP support to atomically
> >   remove memory from a mapping with userfaultfd (which can't be done
> >   with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be
> >   safe) as an alternative to host swapping (which of course also
> >   requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was
> >   rightfully naked early on and quickly replaced by UFFDIO_COPY which
> >   is more optimal to add memory to a mapping is small chunks, but we
> >   can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as
> >   efficient as it gets when it comes to removing memory from a
> >   mapping.
> 
> If we are to discuss userfaultfd, I'd like also to bring the subject of COW
> mappings.
> The pages populated with UFFDIO_COPY cannot be COW-shared between related
> processes which unnecessarily increases memory footprint of a migrated
> process tree.
> I've posted a patch [1] a (real) while ago, but nobody reacted and I've put
> this aside.
> Maybe it's time to discuss it again :)

Hi, Mike,

It's interesting to know such a work...

Since I really don't have much context on this, so sorry if I'm going
to ask a silly question... but I'd say when reading this I'm thinking
of KSM.  I think KSM does not suite in this case since when doing
UFFDIO_COPY_COW it'll contain hinting information while KSM was only
scanning over the pages between processes which seems to be O(N*N) if
assuming there're two processes.  However, would it make any sense to
provide a general interface to scan for same pages between any two
processes within specific range and merge them if found (rather than a
specific interface for userfaultfd only)?  Then it might even be used
by KSM admins (just as an example) when the admin knows exactly that
memory range (addr1, len) of process A should very probably has many
same contents as the memory range (addr2, len) of process B?

Thanks,

-- 
Peter Xu


Re: [LSF/MM TOPIC] userfaultfd

2015-01-15 Thread Austin S Hemmelgarn

On 2015-01-14 18:01, Andrea Arcangeli wrote:

7) distributed shared memory that could allow simultaneous mapping of
regions marked readonly and collapse them on the first exclusive
write. I'm mentioning it as a corollary, because I'm not aware of
anybody who is planning to use it that way (still I'd like that
this will be possible too just in case it finds its way later on).
While I haven't actually written any code for it yet, I've been thinking 
about the possibility to use this to allow qemu to do distributed 
emulation of a NUMA system (ie, you could run qemu on a Beowulf cluster 
and make it look to the guest OS like it's running on a big NUMA system, 
essentially SSI clustering for people who don't have a multi-million 
dollar budget).  Having userfaultd to work with would make this 
exponentially easier to implement.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [LSF/MM TOPIC] userfaultfd

2015-01-15 Thread Pavel Emelyanov
On 01/15/2015 02:01 AM, Andrea Arcangeli wrote:
> Hello,
> 
> I would like to attend this year (2015) LSF/MM summit. I'm
> particularly interested about the MM track, in order to get help in
> finalizing the userfaultfd feature I've been working on lately.

I'd like the +1 this. I'm also interested in this topic, especially
in the item 5 below.

> 5) postcopy live migration of binaries inside linux containers
>(provided there is a userfaultfd command [not an external syscall
>like the original implementation] that allows to copy memory
>atomically in the userfaultfd "mm" and not in the manager "mm",
>hence the main reason the external syscalls are going away, and in
>turn MADV_USERFAULT fd-less is going away as well).

We've started to play with the userfaultfd in the CRIU project [1] to 
do the post-copy live migration of whole containers (and their parts).

One more use case I've seen on CRIU mailing list is the restore of
container from on-disk images w/o getting the whole memory in at the
restore time. The memory is to be put into tasks' address space in
n-demand manner later. It's claimed that such restore decreases the 
restore time significantly.


One more thing that userfaultfd can help with is restoring COW areas.
Right now, if we have two tasks, that share a phys page, but have one
RO mapped to do the COW later we do complex tricks with restoring the
page in common ancestor, then inheriting one on fork()-s and mremap-ing
it. Probably it's an API misuse, but it seems to be much simpler if
the page could be just "sent" to the remote mm via userfaultfd.

[1] http://criu.org/Main_Page

Thanks,
Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LSF/MM TOPIC] userfaultfd

2015-01-15 Thread Austin S Hemmelgarn

On 2015-01-14 18:01, Andrea Arcangeli wrote:

7) distributed shared memory that could allow simultaneous mapping of
regions marked readonly and collapse them on the first exclusive
write. I'm mentioning it as a corollary, because I'm not aware of
anybody who is planning to use it that way (still I'd like that
this will be possible too just in case it finds its way later on).
While I haven't actually written any code for it yet, I've been thinking 
about the possibility to use this to allow qemu to do distributed 
emulation of a NUMA system (ie, you could run qemu on a Beowulf cluster 
and make it look to the guest OS like it's running on a big NUMA system, 
essentially SSI clustering for people who don't have a multi-million 
dollar budget).  Having userfaultd to work with would make this 
exponentially easier to implement.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [LSF/MM TOPIC] userfaultfd

2015-01-15 Thread Pavel Emelyanov
On 01/15/2015 02:01 AM, Andrea Arcangeli wrote:
 Hello,
 
 I would like to attend this year (2015) LSF/MM summit. I'm
 particularly interested about the MM track, in order to get help in
 finalizing the userfaultfd feature I've been working on lately.

I'd like the +1 this. I'm also interested in this topic, especially
in the item 5 below.

 5) postcopy live migration of binaries inside linux containers
(provided there is a userfaultfd command [not an external syscall
like the original implementation] that allows to copy memory
atomically in the userfaultfd mm and not in the manager mm,
hence the main reason the external syscalls are going away, and in
turn MADV_USERFAULT fd-less is going away as well).

We've started to play with the userfaultfd in the CRIU project [1] to 
do the post-copy live migration of whole containers (and their parts).

One more use case I've seen on CRIU mailing list is the restore of
container from on-disk images w/o getting the whole memory in at the
restore time. The memory is to be put into tasks' address space in
n-demand manner later. It's claimed that such restore decreases the 
restore time significantly.


One more thing that userfaultfd can help with is restoring COW areas.
Right now, if we have two tasks, that share a phys page, but have one
RO mapped to do the COW later we do complex tricks with restoring the
page in common ancestor, then inheriting one on fork()-s and mremap-ing
it. Probably it's an API misuse, but it seems to be much simpler if
the page could be just sent to the remote mm via userfaultfd.

[1] http://criu.org/Main_Page

Thanks,
Pavel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/