Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
On 20/06/2025 16:21, Peter Xu wrote:
Hi, Nikita,
On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
Thanks for explaining that. I played a bit with it myself and it appears to
be working for the MISSING mode for both shmem and guest_memfd. Attaching
[1]
my sketch below. Please let me know if this is how you see it.
I found that arguments and return values are significantly different between
the two request types, which may be a bit confusing, although we do not
expect many callers of those.
Indeed. Actually since I didn't yet get your reply, early this week I gave
it a shot, and then I found the same thing that it'll be nice to keep the
type checks all over the places. It'll also be awkward if we want to add
MISSING into the picture with the current req() interface (btw, IIUC you
meant MINOR above, not MISSING).
I did indeed, sorry for the typo.
Please have a look at what I came up with. I didn't yet got a chance to
post it, but it did compile all fine and pass the smoke tests here. Feel
free to take it over if you think that makes sense to you, or I can also
post it officially after some more tests.
This looks much cleaner. Please go ahead and post it. I will make use
of the interface in guest_memfd and run some tests as well meanwhile.
So, ultimately I introduced a vm_uffd_ops to keep all the type checks. I
don't think I like the uffd_copy() interfacing too much, but it should
still be the minimum changeset I can think of to generalize shmem as an
userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in
the last patch.
It's also unfortunate that hugetlb won't be able to already use the API,
similar to why we have hugetlb's fault() to BUG() and hard-coded it in
handle_mm_fault(). However it'll at least start to use the rest API all
fine, so as to generalize some hugetlb checks.
The shmem definition looks like this:
static const vm_uffd_ops shmem_uffd_ops = {
.uffd_features = __VM_UFFD_FLAGS,
.uffd_ioctls= BIT(_UFFDIO_COPY) |
BIT(_UFFDIO_ZEROPAGE) |
BIT(_UFFDIO_WRITEPROTECT) |
BIT(_UFFDIO_CONTINUE) |
BIT(_UFFDIO_POISON),
.uffd_get_folio = shmem_uffd_get_folio,
.uffd_copy = shmem_mfill_atomic_pte,
};
Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and
provide uffd_get_folio() for supporting MINOR.
Let me know what do you think.
Thanks,
===8<===
From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001
From: Peter Xu
Date: Thu, 12 Jun 2025 11:51:58 -0400
Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API
Introduce a generic userfaultfd API for vm_operations_struct, so that one
vma, especially when as a module, can support userfaults without modifying
the core files. More importantly, when the module can be compiled out of
the kernel.
So, instead of having core mm referencing modules that may not ever exist,
we need to have modules opt-in on core mm hooks instead.
After this API applied, if a module wants to support userfaultfd, the
module should only need to touch its own file and properly define
vm_uffd_ops, instead of changing anything in core mm.
Note that such API will not work for anonymous. Core mm will process
anonymous memory separately for userfault operations like before.
This patch only introduces the API alone so that we can start to move
existing users over but without breaking them.
Signed-off-by: Peter Xu
---
include/linux/mm.h| 71 +++
include/linux/userfaultfd_k.h | 12 --
2 files changed, 71 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307..8dfd83f01d3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -576,6 +576,70 @@ struct vm_fault {
*/
};
+#ifdef CONFIG_USERFAULTFD
+/* A combined operation mode + behavior flags. */
+typedef unsigned int __bitwise uffd_flags_t;
+
+enum mfill_atomic_mode {
+ MFILL_ATOMIC_COPY,
+ MFILL_ATOMIC_ZEROPAGE,
+ MFILL_ATOMIC_CONTINUE,
+ MFILL_ATOMIC_POISON,
+ NR_MFILL_ATOMIC_MODES,
+};
+
+/* VMA userfaultfd operations */
+typedef struct {
+ /**
+* @uffd_features: features supported in bitmask.
+*
+* When the ops is defined, the driver must set non-zero features
+* to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
+*/
+ unsigned long uffd_features;
+ /**
+* @uffd_ioctls: ioctls supported in bitmask.
+*
+* Userfaultfd ioctls supported by the module. Below will always
+* be supported by default whenever a module provides vm_uffd_ops:
+*
+* _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
+*
+* The module needs to provide all th
Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
Hi, Nikita,
On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
> Thanks for explaining that. I played a bit with it myself and it appears to
> be working for the MISSING mode for both shmem and guest_memfd. Attaching
[1]
> my sketch below. Please let me know if this is how you see it.
>
> I found that arguments and return values are significantly different between
> the two request types, which may be a bit confusing, although we do not
> expect many callers of those.
Indeed. Actually since I didn't yet get your reply, early this week I gave
it a shot, and then I found the same thing that it'll be nice to keep the
type checks all over the places. It'll also be awkward if we want to add
MISSING into the picture with the current req() interface (btw, IIUC you
meant MINOR above, not MISSING).
Please have a look at what I came up with. I didn't yet got a chance to
post it, but it did compile all fine and pass the smoke tests here. Feel
free to take it over if you think that makes sense to you, or I can also
post it officially after some more tests.
So, ultimately I introduced a vm_uffd_ops to keep all the type checks. I
don't think I like the uffd_copy() interfacing too much, but it should
still be the minimum changeset I can think of to generalize shmem as an
userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in
the last patch.
It's also unfortunate that hugetlb won't be able to already use the API,
similar to why we have hugetlb's fault() to BUG() and hard-coded it in
handle_mm_fault(). However it'll at least start to use the rest API all
fine, so as to generalize some hugetlb checks.
The shmem definition looks like this:
static const vm_uffd_ops shmem_uffd_ops = {
.uffd_features = __VM_UFFD_FLAGS,
.uffd_ioctls= BIT(_UFFDIO_COPY) |
BIT(_UFFDIO_ZEROPAGE) |
BIT(_UFFDIO_WRITEPROTECT) |
BIT(_UFFDIO_CONTINUE) |
BIT(_UFFDIO_POISON),
.uffd_get_folio = shmem_uffd_get_folio,
.uffd_copy = shmem_mfill_atomic_pte,
};
Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and
provide uffd_get_folio() for supporting MINOR.
Let me know what do you think.
Thanks,
===8<===
>From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001
From: Peter Xu
Date: Thu, 12 Jun 2025 11:51:58 -0400
Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API
Introduce a generic userfaultfd API for vm_operations_struct, so that one
vma, especially when as a module, can support userfaults without modifying
the core files. More importantly, when the module can be compiled out of
the kernel.
So, instead of having core mm referencing modules that may not ever exist,
we need to have modules opt-in on core mm hooks instead.
After this API applied, if a module wants to support userfaultfd, the
module should only need to touch its own file and properly define
vm_uffd_ops, instead of changing anything in core mm.
Note that such API will not work for anonymous. Core mm will process
anonymous memory separately for userfault operations like before.
This patch only introduces the API alone so that we can start to move
existing users over but without breaking them.
Signed-off-by: Peter Xu
---
include/linux/mm.h| 71 +++
include/linux/userfaultfd_k.h | 12 --
2 files changed, 71 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307..8dfd83f01d3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -576,6 +576,70 @@ struct vm_fault {
*/
};
+#ifdef CONFIG_USERFAULTFD
+/* A combined operation mode + behavior flags. */
+typedef unsigned int __bitwise uffd_flags_t;
+
+enum mfill_atomic_mode {
+ MFILL_ATOMIC_COPY,
+ MFILL_ATOMIC_ZEROPAGE,
+ MFILL_ATOMIC_CONTINUE,
+ MFILL_ATOMIC_POISON,
+ NR_MFILL_ATOMIC_MODES,
+};
+
+/* VMA userfaultfd operations */
+typedef struct {
+ /**
+* @uffd_features: features supported in bitmask.
+*
+* When the ops is defined, the driver must set non-zero features
+* to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
+*/
+ unsigned long uffd_features;
+ /**
+* @uffd_ioctls: ioctls supported in bitmask.
+*
+* Userfaultfd ioctls supported by the module. Below will always
+* be supported by default whenever a module provides vm_uffd_ops:
+*
+* _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
+*
+* The module needs to provide all the rest optionally supported
+* ioctls. For example, when VM_UFFD_MISSING was supported,
+* _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
+* is optional.
+*/
+ unsigned lon
Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
On 11/06/2025 13:56, Peter Xu wrote:
On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for
non-huge pages by calling vm_ops->fault(). A new VMF flag,
FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or
some mention of the need to introduce userfaultfd support in kernel
modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton
Signed-off-by: Nikita Kalyazin
---
include/linux/mm_types.h | 4
mm/hugetlb.c | 2 +-
mm/shmem.c | 9 ++---
mm/userfaultfd.c | 37 +++--
4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..2f26ee9742bf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
*We should only access orig_pte if this flag set.
* @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
+ * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
+ * minor handler as it is being called by the
+ * userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
as it's not easy to follow when without userfault minor context. It might
be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter
patch, whether we can also avoid reusing fault() but instead resolve the
page faults using the vm_ops hook too. That might be helpful because then
we can avoid this new FAULT_FLAG_* that is totally not useful to
non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse the
FS specifics when resolving the fault. I get that the new op can imply the
userfault flag so the flag doesn't need to be exposed to mm, but doing so
will bring duplication of the logic within FSes between this new op and the
fault(), unless we attempt to factor common parts out. For example, for
shmem_get_folio_gfp(), we would still need to find a way to suppress the
call to handle_userfault() when shmem_get_folio_gfp() is called from the new
op. Is that what you're proposing?
Yes it is what I was proposing. shmem_get_folio_gfp() always has that
handling when vmf==NULL, then vma==NULL and userfault will be skipped.
So what I was thinking is one vm_ops.userfaultfd_request(req), where req
can be:
(1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async
was so far by default almost supported everywhere except
VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
as of now (even if I think it shouldn't be hard to support the rest
two..).
(2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
defined to suite userfault's need on fault resolutions. It likely
doesn't need vmf as the parameter, but likely (when anon isn't taking
into account, after all anon have vm_ops==NULL..) the inode and
offsets, perhaps some flag would be needed to identify MISSING or
MINOR faults, for example.
Maybe some more.
I was even thinking whether we could merge hugetlb into the picture too on
generalize its fault resolutions. Hugetlb was always special, maye this is
a chance too to make it generalized, but it doesn't need to happen in one
shot even if it could work. We could start with shmem.
So this does sound like slightly involved, and I'm not yet 100% sure this
will work, but likely. If you want, I can take a stab at this this week or
next just to see whether it'll work in general. I also don't expect this
to depend on guest-memfd at all - it can be alone a refactoring making
userfault module-ready.
Thanks for explaining that. I played a bit with it myself and it
appears to be working for the MISSING mode for both shmem and
guest_memfd. Attaching my sketch below. Please let me know if this is
how you see it.
I found that arguments and return values are significantly different
between the two request types, which may be a bit confusing, although we
do not expect many callers of those.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09
Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
>
>
> On 10/06/2025 23:22, Peter Xu wrote:
> > On Fri, Apr 04, 2025 at 03:43:47PM +, Nikita Kalyazin wrote:
> > > Remove shmem-specific code from UFFDIO_CONTINUE implementation for
> > > non-huge pages by calling vm_ops->fault(). A new VMF flag,
> > > FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
> > > handle_userfault().
> >
> > It's not clear yet on why this is needed to be generalized out of the blue.
> >
> > Some mentioning of guest_memfd use case might help for other reviewers, or
> > some mention of the need to introduce userfaultfd support in kernel
> > modules.
>
> Hi Peter,
>
> Sounds fair, thank you.
>
> > >
> > > Suggested-by: James Houghton
> > > Signed-off-by: Nikita Kalyazin
> > > ---
> > > include/linux/mm_types.h | 4
> > > mm/hugetlb.c | 2 +-
> > > mm/shmem.c | 9 ++---
> > > mm/userfaultfd.c | 37 +++--
> > > 4 files changed, 38 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 0234f14f2aa6..2f26ee9742bf 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
> > >* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte
> > > cached.
> > >*We should only access orig_pte if this flag
> > > set.
> > >* @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
> > > + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call
> > > userfaultfd
> > > + * minor handler as it is being called
> > > by the
> > > + * userfaultfd code itself.
> >
> > We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
> > as it's not easy to follow when without userfault minor context. It might
> > be better to use generic terms like NO_USERFAULT.
>
> Yes, I agree, can name it more generically.
>
> > Said that, I wonder if we'll need to add a vm_ops anyway in the latter
> > patch, whether we can also avoid reusing fault() but instead resolve the
> > page faults using the vm_ops hook too. That might be helpful because then
> > we can avoid this new FAULT_FLAG_* that is totally not useful to
> > non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
> > struct below just to suite the current fault() interfacing.
>
> I'm not sure I fully understand that. Calling fault() op helps us reuse the
> FS specifics when resolving the fault. I get that the new op can imply the
> userfault flag so the flag doesn't need to be exposed to mm, but doing so
> will bring duplication of the logic within FSes between this new op and the
> fault(), unless we attempt to factor common parts out. For example, for
> shmem_get_folio_gfp(), we would still need to find a way to suppress the
> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
> op. Is that what you're proposing?
Yes it is what I was proposing. shmem_get_folio_gfp() always has that
handling when vmf==NULL, then vma==NULL and userfault will be skipped.
So what I was thinking is one vm_ops.userfaultfd_request(req), where req
can be:
(1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async
was so far by default almost supported everywhere except
VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
as of now (even if I think it shouldn't be hard to support the rest
two..).
(2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
defined to suite userfault's need on fault resolutions. It likely
doesn't need vmf as the parameter, but likely (when anon isn't taking
into account, after all anon have vm_ops==NULL..) the inode and
offsets, perhaps some flag would be needed to identify MISSING or
MINOR faults, for example.
Maybe some more.
I was even thinking whether we could merge hugetlb into the picture too on
generalize its fault resolutions. Hugetlb was always special, maye this is
a chance too to make it generalized, but it doesn't need to happen in one
shot even if it could work. We could start with shmem.
So this does sound like slightly involved, and I'm not yet 100% sure this
will work, but likely. If you want, I can take a stab at this this week or
next just to see whether it'll work in general. I also don't expect this
to depend on guest-memfd at all - it can be alone a refactoring making
userfault module-ready.
Thanks,
--
Peter Xu
Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for
non-huge pages by calling vm_ops->fault(). A new VMF flag,
FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or
some mention of the need to introduce userfaultfd support in kernel
modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton
Signed-off-by: Nikita Kalyazin
---
include/linux/mm_types.h | 4
mm/hugetlb.c | 2 +-
mm/shmem.c | 9 ++---
mm/userfaultfd.c | 37 +++--
4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..2f26ee9742bf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
*We should only access orig_pte if this flag set.
* @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
+ * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
+ * minor handler as it is being called by the
+ * userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
as it's not easy to follow when without userfault minor context. It might
be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter
patch, whether we can also avoid reusing fault() but instead resolve the
page faults using the vm_ops hook too. That might be helpful because then
we can avoid this new FAULT_FLAG_* that is totally not useful to
non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse
the FS specifics when resolving the fault. I get that the new op can
imply the userfault flag so the flag doesn't need to be exposed to mm,
but doing so will bring duplication of the logic within FSes between
this new op and the fault(), unless we attempt to factor common parts
out. For example, for shmem_get_folio_gfp(), we would still need to
find a way to suppress the call to handle_userfault() when
shmem_get_folio_gfp() is called from the new op. Is that what you're
proposing?
Thanks,
--
Peter Xu
Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
On Fri, Apr 04, 2025 at 03:43:47PM +, Nikita Kalyazin wrote:
> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
> non-huge pages by calling vm_ops->fault(). A new VMF flag,
> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
> handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or
some mention of the need to introduce userfaultfd support in kernel
modules.
>
> Suggested-by: James Houghton
> Signed-off-by: Nikita Kalyazin
> ---
> include/linux/mm_types.h | 4
> mm/hugetlb.c | 2 +-
> mm/shmem.c | 9 ++---
> mm/userfaultfd.c | 37 +++--
> 4 files changed, 38 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6..2f26ee9742bf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> *We should only access orig_pte if this flag set.
> * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call
> userfaultfd
> + * minor handler as it is being called by the
> + * userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
as it's not easy to follow when without userfault minor context. It might
be better to use generic terms like NO_USERFAULT.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter
patch, whether we can also avoid reusing fault() but instead resolve the
page faults using the vm_ops hook too. That might be helpful because then
we can avoid this new FAULT_FLAG_* that is totally not useful to
non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
struct below just to suite the current fault() interfacing.
Thanks,
--
Peter Xu

