On Tue, Jul 11, 2017 at 12:22:32PM +0800, Peter Xu wrote: > On Wed, Jun 28, 2017 at 08:00:40PM +0100, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > > > > Cause the vhost-user client to be woken up whenever: > > a) We place a page in postcopy mode > > Just to make sure I understand it correctly - UFFDIO_COPY will only > wake up the waiters on the same userfaultfd context, so we don't need > to wake up QEMU userfaultfd (vcpu threads), but we need to explicitly > wake up other ufds/threads, like vhost-user backends. Am I right?
Yes. Every "uffd" represents one and only one "mm" (i.e. a process). So there is no way a single UFFDIO_COPY can wake the faults happening on a process different from the "mm" the uffd is associated with. vhost-bridge being a different process requires a UFFDIO_WAKE on its own uffd it passed to qemu in addition of the UFFDIO_COPY that like you said implicitly wakes the userfaults happening on the qemu process (vcpus iothread, dataplane etc..). On a side note there's a way not to wake userfaults implicitly in UFFDIO_COPY in case you want to wake userfaults in batches but nobody uses that for now (uffdio_copy.mode |= UFFDIO_COPY_MODE_DONTWAKE). It'd be theoretically nice to optimize away the additional enter/exit kernel introduced by the UFFDIO_WAKE and the translation table as well. What we could do is to add a UFFDIO_BIND that takes an "fd" as parameter to the ioctl to bind the two uffd together. Then we could push logical offsets in addition to the virtual address ranges when calling UFFDIO_REGISTER_LOGICAL (the logical offsets would then match the guest physical addresses) so that the UFFDIO_COPY_LOGICAL would then be able to get a logical range to wakeup that the kernel would translate into virtual addresses for all uffds bind together. Pushing offsets into UFFDIO_REGISTER was David's idea. That would eliminate the enter/exit kernel for the explicit UFFDIO_WAKE and calling a single UFFDIO_COPY would be enough. Alternatively we should make the uffd work based on file offsets instead of virtual addresses but that would involve changes to filesystems and it only would move the needle on top of tmpfs (shared=on/off no difference) and hugetlbfs. It would be enough for vhost-bridge. Usually the uffd fault lives at the higher level of the virtual memory subsystem and never deals with file offsets so if we can get away with logical ranges per-uffd for UFFDIO_REGISTER and UFFDIO_COPY, it may be simpler and easier to extend automatically to all memory types supported by uffd (including anon which has no file offset). No major improvement is to be expected by such an enhancement though so it's not very high priority to implement. It's not even clear if the complexity is worth it. Doing one more syscall per page I think might be measurable only on very fast network. The current way of operation where uffd are independent of each other and the translation table is transferred by userland means is quite optimal already and much simpler. Furthermore for hugetlbfs the performance difference most certainly wouldn't be measurable, as the enter/exit kernel would be diluted by a factor of 512 compared to 4k userfaults. Thanks, Andrea