On Fri, Mar 6, 2026 at 9:19 AM Mike Rapoport <[email protected]> wrote: > > From: Nikita Kalyazin <[email protected]> > > userfaultfd notifications about page faults used for live migration > and snapshotting of VMs. > > MISSING mode allows post-copy live migration and MINOR mode allows > optimization for post-copy live migration for VMs backed with shared > hugetlbfs or tmpfs mappings as described in detail in commit > 7677f7fd8be7 ("userfaultfd: add minor fault registration mode"). > > To use the same mechanisms for VMs that use guest_memfd to map their > memory, guest_memfd should support userfaultfd operations. > > Add implementation of vm_uffd_ops to guest_memfd. > > Signed-off-by: Nikita Kalyazin <[email protected]> > Co-developed-by: Mike Rapoport (Microsoft) <[email protected]> > Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Overall looks fine to me, but I am slightly concerned about in-place conversion[1], and I think you're going to want to implement a kvm_gmem_folio_present() op or something (like I was saying on the previous patch[2]). [1]: https://lore.kernel.org/kvm/[email protected]/ [2]: https://lore.kernel.org/linux-mm/cadrl8hvuj5fl97d9ytxp2wxos6hs+u+ycpsi5vxffsw9vac...@mail.gmail.com/ Some in-line comments below. > --- > mm/filemap.c | 1 + > virt/kvm/guest_memfd.c | 84 +++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 83 insertions(+), 2 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 6cd7974d4ada..19dfcebcd23f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -262,6 +262,7 @@ void filemap_remove_folio(struct folio *folio) > > filemap_free_folio(mapping, folio); > } > +EXPORT_SYMBOL_FOR_MODULES(filemap_remove_folio, "kvm"); > > /* > * page_cache_delete_batch - delete several folios from page cache > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 017d84a7adf3..46582feeed75 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -7,6 +7,7 @@ > #include <linux/mempolicy.h> > #include <linux/pseudo_fs.h> > #include <linux/pagemap.h> > +#include <linux/userfaultfd_k.h> > > #include "kvm_mm.h" > > @@ -107,6 +108,12 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, > struct kvm_memory_slot *slot, > return __kvm_gmem_prepare_folio(kvm, slot, index, folio); > } > > +static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode, pgoff_t > pgoff) > +{ > + return __filemap_get_folio(inode->i_mapping, pgoff, > + FGP_LOCK | FGP_ACCESSED, 0); > +} When in-place conversion is supported, I wonder what the semantics should be for when we get userfaults. Upon a userspace access to a file offset that is populated but private, should we get a userfault or a SIGBUS? I guess getting a userfault is strictly more useful for userspace, but I'm not sure which choice is more correct. > + > /* > * Returns a locked folio on success. The caller is responsible for > * setting the up-to-date flag before the memory is mapped into the guest. > @@ -126,8 +133,7 @@ static struct folio *kvm_gmem_get_folio(struct inode > *inode, pgoff_t index) > * Fast-path: See if folio is already present in mapping to avoid > * policy_lookup. > */ > - folio = __filemap_get_folio(inode->i_mapping, index, > - FGP_LOCK | FGP_ACCESSED, 0); > + folio = kvm_gmem_get_folio_noalloc(inode, index); > if (!IS_ERR(folio)) > return folio; > > @@ -457,12 +463,86 @@ static struct mempolicy *kvm_gmem_get_policy(struct > vm_area_struct *vma, > } > #endif /* CONFIG_NUMA */ > > +#ifdef CONFIG_USERFAULTFD > +static bool kvm_gmem_can_userfault(struct vm_area_struct *vma, vm_flags_t > vm_flags) > +{ > + struct inode *inode = file_inode(vma->vm_file); > + > + /* > + * Only support userfaultfd for guest_memfd with INIT_SHARED flag. > + * This ensures the memory can be mapped to userspace. > + */ > + if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)) > + return false; > + > + return true; > +} > + > +static struct folio *kvm_gmem_folio_alloc(struct vm_area_struct *vma, > + unsigned long addr) > +{ > + struct inode *inode = file_inode(vma->vm_file); > + pgoff_t pgoff = linear_page_index(vma, addr); > + struct mempolicy *mpol; > + struct folio *folio; > + gfp_t gfp; > + > + if (unlikely(pgoff >= (i_size_read(inode) >> PAGE_SHIFT))) > + return NULL; > + > + gfp = mapping_gfp_mask(inode->i_mapping); > + mpol = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff); > + mpol = mpol ?: get_task_policy(current); > + folio = filemap_alloc_folio(gfp, 0, mpol); > + mpol_cond_put(mpol); > + > + return folio; > +} > + > +static int kvm_gmem_filemap_add(struct folio *folio, > + struct vm_area_struct *vma, > + unsigned long addr) > +{ > + struct inode *inode = file_inode(vma->vm_file); > + struct address_space *mapping = inode->i_mapping; > + pgoff_t pgoff = linear_page_index(vma, addr); > + int err; > + > + __folio_set_locked(folio); > + err = filemap_add_folio(mapping, folio, pgoff, GFP_KERNEL); This is going to get more interesting with in-place conversion. I'm not really sure how to synchronize with it, but we'll probably need to take the invalidate lock for reading. And then we'll need a separate uffd_op to drop it after we install the PTE... I think. > + if (err) { > + folio_unlock(folio); > + return err; > + } > + > + return 0; > +} > + > +static void kvm_gmem_filemap_remove(struct folio *folio, > + struct vm_area_struct *vma) > +{ > + filemap_remove_folio(folio); > + folio_unlock(folio); > +} > + > +static const struct vm_uffd_ops kvm_gmem_uffd_ops = { > + .can_userfault = kvm_gmem_can_userfault, > + .get_folio_noalloc = kvm_gmem_get_folio_noalloc, > + .alloc_folio = kvm_gmem_folio_alloc, > + .filemap_add = kvm_gmem_filemap_add, > + .filemap_remove = kvm_gmem_filemap_remove, > +}; > +#endif /* CONFIG_USERFAULTFD */ > + > static const struct vm_operations_struct kvm_gmem_vm_ops = { > .fault = kvm_gmem_fault_user_mapping, > #ifdef CONFIG_NUMA > .get_policy = kvm_gmem_get_policy, > .set_policy = kvm_gmem_set_policy, > #endif > +#ifdef CONFIG_USERFAULTFD > + .uffd_ops = &kvm_gmem_uffd_ops, > +#endif > }; > > static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) > -- > 2.51.0 >

