Re: [f2fs-dev] [PATCH kvm-next V11 4/7] KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
On Thu, Sep 25, 2025, David Hildenbrand wrote: > On 25.09.25 15:41, Sean Christopherson wrote: > > Regarding timing, how much do people care about getting this into 6.18 in > > particular? > > I think it will be beneficial if we start getting stuff upstream. But > waiting a bit longer probably doesn't hurt. > > > AFAICT, this hasn't gotten any coverage in -next, which makes me a > > little nervous. > > Right. > > If we agree, then Shivank can just respin a new version after the merge > window. Actually, if Shivank is ok with it, I'd be happy to post the next version(s). I'll be focusing on the in-place conversion support for the next 1-2 weeks, and have some (half-baked) refactoring changes to better leverage the inode support from this series. I can also plop the first three patches (the non-KVM changes) in a topic branch straightaway, but not feed it into -next until the merge window closes. The 0-day bots scrapes kvm-x86, so that'd get us some early build-bot exposure, and we can stop bugging the non-KVM folks. Then when the dust settles on the KVM changes, I can throw them into the same topic branch. ___ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 0/7] Add NUMA mempolicy support for KVM guest-memfd
On Wed, 27 Aug 2025 17:52:41 +, Shivank Garg wrote: > This series introduces NUMA-aware memory placement support for KVM guests > with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) > that enabled host-mapping for guest_memfd memory [1] and can be applied > directly applied on KVM tree [2] (branch kvm-next, base commit: a6ad5413, > Merge branch 'guest-memfd-mmap' into HEAD) > > == Background == > KVM's guest-memfd memory backend currently lacks support for NUMA policy > enforcement, causing guest memory allocations to be distributed across host > nodes according to kernel's default behavior, irrespective of any policy > specified by the VMM. This limitation arises because conventional userspace > NUMA control mechanisms like mbind(2) don't work since the memory isn't > directly mapped to userspace when allocations occur. > Fuad's work [1] provides the necessary mmap capability, and this series > leverages it to enable mbind(2). > > [...] Applied the non-KVM change to kvm-x86 gmem. We're still tweaking and iterating on the KVM changes, but I fully expect them to land in 6.19. Holler if you object to taking these through the kvm tree. [1/7] mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() https://github.com/kvm-x86/linux/commit/601aa29f762f [2/7] mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies https://github.com/kvm-x86/linux/commit/2bb25703e5bd [3/7] mm/mempolicy: Export memory policy symbols https://github.com/kvm-x86/linux/commit/e1b4cf7d6be3 -- https://github.com/kvm-x86/linux/tree/next ___ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 6/7] KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
On Thu, Sep 25, 2025, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Shivank Garg wrote:
> > @@ -26,6 +28,9 @@ static inline struct kvm_gmem_inode_info
> > *KVM_GMEM_I(struct inode *inode)
> > return container_of(inode, struct kvm_gmem_inode_info, vfs_inode);
> > }
> >
> > +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct
> > kvm_gmem_inode_info *info,
> > + pgoff_t index);
> > +
> > /**
> > * folio_file_pfn - like folio_file_page, but return a pfn.
> > * @folio: The folio which contains this index.
> > @@ -112,7 +117,25 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> > static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> > {
> > /* TODO: Support huge pages. */
> > - return filemap_grab_folio(inode->i_mapping, index);
> > + struct mempolicy *policy;
> > + struct folio *folio;
> > +
> > + /*
> > +* Fast-path: See if folio is already present in mapping to avoid
> > +* policy_lookup.
> > +*/
> > + folio = __filemap_get_folio(inode->i_mapping, index,
> > + FGP_LOCK | FGP_ACCESSED, 0);
> > + if (!IS_ERR(folio))
> > + return folio;
> > +
> > + policy = kvm_gmem_get_pgoff_policy(KVM_GMEM_I(inode), index);
> > + folio = __filemap_get_folio_mpol(inode->i_mapping, index,
> > +FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
> > +mapping_gfp_mask(inode->i_mapping),
> > policy);
> > + mpol_cond_put(policy);
> > +
> > + return folio;
> > }
> >
> > static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> > @@ -372,8 +395,45 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct
> > vm_fault *vmf)
> > return ret;
> > }
> >
> > +#ifdef CONFIG_NUMA
> > +static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct
> > mempolicy *mpol)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > +
> > + return mpol_set_shared_policy(&KVM_GMEM_I(inode)->policy, vma, mpol);
> > +}
> > +
> > +static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> > +unsigned long addr, pgoff_t *pgoff)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > +
> > + *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > + return mpol_shared_policy_lookup(&KVM_GMEM_I(inode)->policy, *pgoff);
> > +}
> > +
> > +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct
> > kvm_gmem_inode_info *info,
> > + pgoff_t index)
>
> I keep reading this is "page offset policy", as opposed to "policy given a
> page
> offset". Another oddity that is confusing is that this helper explicitly does
> get_task_policy(current), while kvm_gmem_get_policy() lets the caller do that.
> The end result is the same, but I think it would be helpful for gmem to be
> internally consistent.
>
> If we have kvm_gmem_get_policy() use this helper, then we can kill two birds
> with
> one stone:
>
> static struct mempolicy *__kvm_gmem_get_policy(struct gmem_inode *gi,
> pgoff_t index)
> {
> struct mempolicy *mpol;
>
> mpol = mpol_shared_policy_lookup(&gi->policy, index);
> return mpol ? mpol : get_task_policy(current);
> }
>
> static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>unsigned long addr, pgoff_t *pgoff)
> {
> *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>
> return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
Argh! This breaks the selftest because do_get_mempolicy() very specifically
falls back to the default_policy, NOT to the current task's policy. That is
*exactly* the type of subtle detail that needs to be commented, because there's
no way some random KVM developer is going to know that returning NULL here is
important with respect to get_mempolicy() ABI.
On a happier note, I'm very glad you wrote a testcase :-)
I've got this as fixup-to-the-fixup:
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e796cc552a96..61130a52553f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -114,8 +114,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct
kvm_memory_slot *slot,
return r;
}
-static struct mempolicy *__kvm_gmem_get_policy(struct gmem_inode *gi,
- pgoff_t index)
+static struct mempolicy *kvm_gmem_get_folio_policy(struct gmem_inode *gi,
+ pgoff_t index)
{
#ifdef CONFIG_NUMA
struct mempolicy *mpol;
@@ -151,7 +151,7 @@ static struct folio *kvm_gmem_get_folio(struct inode
*inode, pgoff_t index)
if (!IS_ERR(folio))
return folio;
- policy = __
Re: [f2fs-dev] [PATCH kvm-next V11 5/7] KVM: guest_memfd: Add slab-allocated inode cache
On Thu, Sep 25, 2025, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Shivank Garg wrote:
> > Add dedicated inode structure (kvm_gmem_inode_info) and slab-allocated
> > inode cache for guest memory backing, similar to how shmem handles inodes.
> >
> > This adds the necessary allocation/destruction functions and prepares
> > for upcoming guest_memfd NUMA policy support changes.
> >
> > Signed-off-by: Shivank Garg
> > ---
> > virt/kvm/guest_memfd.c | 70 --
> > 1 file changed, 68 insertions(+), 2 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 6c66a0974055..356947d36a47 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -17,6 +17,15 @@ struct kvm_gmem {
> > struct list_head entry;
> > };
> >
> > +struct kvm_gmem_inode_info {
>
> What about naming this simply gmem_inode?
Heh, after looking through other filesystems, they're fairly even on appending
_info or not. My vote is definitely for gmem_inode.
Before we accumulate more inode usage, e.g. for in-place conversion (which is
actually why I started looking at this code), I think we should also settle on
naming for gmem_file and gmem_inode variables.
As below, "struct kvm_gmem *gmem" gets quite confusing once inodes are in the
picture, especially since that structure isn't _the_ gmem instance, rather it's
a VM's view of that gmem instance. And on the other side, "info" for the inode
is a bit imprecise, e.g. doesn't immediately make me think of inodes.
A few ideas:
(a)
struct gmem_inode *gmem;
struct gmem_file *f;
(b)
struct gmem_inode *gi;
struct gmem_file *f;
(c)
struct gmem_inode *gi;
struct gmem_file *gf;
(d)
struct gmem_inode *gmem_i;
struct gmem_file *gmem_f;
I think my would be for (a) or (b). Option (c) seems like it would be hard to
visually differentiate between "gi" and "gf", and gmem_{i,f} are a bit verbose
IMO.
> > + struct inode vfs_inode;
> > +};
> > +
> > +static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct inode *inode)
>
> And then GMEM_I()?
>
> And then (in a later follow-up if we target this for 6.18, or as a prep patch
> if
> we push this out to 6.19), rename kvm_gmem to gmem_file?
>
> That would make guest_memfd look a bit more like other filesystems, and I
> don't
> see a need to preface the local structures and helpers with "kvm_", e.g.
> GMEM_I()
> is analogous to x86's to_vmx() and to_svm().
>
> As for renaming kvm_gmem => gmem_file, I wandered back into this code via
> Ackerley's
> in-place conversion series, and it took me a good long while to remember the
> roles
> of files vs. inodes in gmem. That's probably a sign that the code needs
> clarification
> given that I wrote the original code. :-)
>
> Leveraging an old discussion[*], my thought is to get to this:
>
> /*
> * A guest_memfd instance can be associated multiple VMs, each with its own
> * "view" of the underlying physical memory.
> *
> * The gmem's inode is effectively the raw underlying physical storage, and is
> * used to track properties of the physical memory, while each gmem file is
> * effectively a single VM's view of that storage, and is used to track assets
> * specific to its associated VM, e.g. memslots=>gmem bindings.
> */
> struct gmem_file {
> struct kvm *kvm;
> struct xarray bindings;
> struct list_head entry;
> };
>
> struct gmem_inode {
> struct shared_policy policy;
> struct inode vfs_inode;
> };
>
> [*] https://lore.kernel.org/all/[email protected]
___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 7/7] KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support
On Wed, Aug 27, 2025, Shivank Garg wrote:
> Add tests for NUMA memory policy binding and NUMA aware allocation in
> guest_memfd. This extends the existing selftests by adding proper
> validation for:
> - KVM GMEM set_policy and get_policy() vm_ops functionality using
> mbind() and get_mempolicy()
> - NUMA policy application before and after memory allocation
>
> These tests help ensure NUMA support for guest_memfd works correctly.
>
> Signed-off-by: Shivank Garg
> ---
> tools/testing/selftests/kvm/Makefile.kvm | 1 +
> .../testing/selftests/kvm/guest_memfd_test.c | 121 ++
> 2 files changed, 122 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm
> b/tools/testing/selftests/kvm/Makefile.kvm
> index 90f03f00cb04..c46cef2a7cd7 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -275,6 +275,7 @@ pgste-option = $(call try-run, echo 'int main(void) {
> return 0; }' | \
> $(CC) -Werror -Wl$(comma)--s390-pgste -x c - -o
> "$$TMP",-Wl$(comma)--s390-pgste)
>
> LDLIBS += -ldl
> +LDLIBS += -lnuma
Hrm, this is going to be very annoying. I don't have libnuma-dev installed on
any of my systems, and I doubt I'm alone. Installing the package is
trivial, but I'm a little wary of foisting that requirement on all KVM
developers
and build bots.
I'd be especially curious what ARM and RISC-V think, as NUMA is likely a bit
less
prevelant there.
> LDFLAGS += -pthread $(no-pie-option) $(pgste-option)
>
> LIBKVM_C := $(filter %.c,$(LIBKVM))
> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c
> b/tools/testing/selftests/kvm/guest_memfd_test.c
> index b3ca6737f304..9640d04ec293 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -7,6 +7,8 @@
> #include
> #include
> #include
> +#include
> +#include
> #include
> #include
> #include
> @@ -19,6 +21,7 @@
> #include
> #include
> #include
> +#include
>
> #include "kvm_util.h"
> #include "test_util.h"
> @@ -72,6 +75,122 @@ static void test_mmap_supported(int fd, size_t page_size,
> size_t total_size)
> TEST_ASSERT(!ret, "munmap() should succeed.");
> }
>
> +#define TEST_REQUIRE_NUMA_MULTIPLE_NODES() \
> + TEST_REQUIRE(numa_available() != -1 && numa_max_node() >= 1)
Using TEST_REQUIRE() here will result in skipping the _entire_ test. Ideally
this test would use fixtures so that each testcase can run in a child process
and thus can use TEST_REQUIRE(), but that's a conversion for another day.
Easiest thing would probably be to turn this into a common helper and then bail
early.
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c
b/tools/testing/selftests/kvm/guest_memfd_test.c
index 9640d04ec293..6acb186e5300 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -7,7 +7,6 @@
#include
#include
#include
-#include
#include
#include
#include
@@ -75,9 +74,6 @@ static void test_mmap_supported(int fd, size_t page_size,
size_t total_size)
TEST_ASSERT(!ret, "munmap() should succeed.");
}
-#define TEST_REQUIRE_NUMA_MULTIPLE_NODES() \
- TEST_REQUIRE(numa_available() != -1 && numa_max_node() >= 1)
-
static void test_mbind(int fd, size_t page_size, size_t total_size)
{
unsigned long nodemask = 1; /* nid: 0 */
@@ -87,7 +83,8 @@ static void test_mbind(int fd, size_t page_size, size_t
total_size)
char *mem;
int ret;
- TEST_REQUIRE_NUMA_MULTIPLE_NODES();
+ if (!is_multi_numa_node_system())
+ return;
mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
TEST_ASSERT(mem != MAP_FAILED, "mmap for mbind test should succeed");
@@ -136,7 +133,8 @@ static void test_numa_allocation(int fd, size_t page_size,
size_t total_size)
char *mem;
int ret, i;
- TEST_REQUIRE_NUMA_MULTIPLE_NODES();
+ if (!is_multi_numa_node_system())
+ return;
/* Clean slate: deallocate all file space, if any */
ret = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
total_size);
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h
b/tools/testing/selftests/kvm/include/kvm_util.h
index 23a506d7eca3..d7051607e6bf 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -21,6 +21,7 @@
#include
#include
+#include
#include
#include "kvm_util_arch.h"
@@ -633,6 +634,11 @@ static inline bool is_smt_on(void)
return false;
}
+static inline bool is_multi_numa_node_system(void)
+{
+ return numa_available() != -1 && numa_max_node() >= 1;
+}
+
void vm_create_irqchip(struct kvm_vm *vm);
static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
___
Linux-f2fs-devel mailing list
Lin
Re: [f2fs-dev] [PATCH kvm-next V11 7/7] KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support
On Thu, Sep 25, 2025, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Shivank Garg wrote:
> > Add tests for NUMA memory policy binding and NUMA aware allocation in
> > guest_memfd. This extends the existing selftests by adding proper
> > validation for:
> > - KVM GMEM set_policy and get_policy() vm_ops functionality using
> > mbind() and get_mempolicy()
> > - NUMA policy application before and after memory allocation
> >
> > These tests help ensure NUMA support for guest_memfd works correctly.
> >
> > Signed-off-by: Shivank Garg
> > ---
> > tools/testing/selftests/kvm/Makefile.kvm | 1 +
> > .../testing/selftests/kvm/guest_memfd_test.c | 121 ++
> > 2 files changed, 122 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/Makefile.kvm
> > b/tools/testing/selftests/kvm/Makefile.kvm
> > index 90f03f00cb04..c46cef2a7cd7 100644
> > --- a/tools/testing/selftests/kvm/Makefile.kvm
> > +++ b/tools/testing/selftests/kvm/Makefile.kvm
> > @@ -275,6 +275,7 @@ pgste-option = $(call try-run, echo 'int main(void) {
> > return 0; }' | \
> > $(CC) -Werror -Wl$(comma)--s390-pgste -x c - -o
> > "$$TMP",-Wl$(comma)--s390-pgste)
> >
> > LDLIBS += -ldl
> > +LDLIBS += -lnuma
>
> Hrm, this is going to be very annoying. I don't have libnuma-dev installed on
> any of my systems, and I doubt I'm alone. Installing the package
> is
> trivial, but I'm a little wary of foisting that requirement on all KVM
> developers
> and build bots.
>
> I'd be especially curious what ARM and RISC-V think, as NUMA is likely a bit
> less
> prevelant there.
Ugh, and it doesn't play nice with static linking. I haven't tried running on a
NUMA system yet, so maybe it's benign?
/usr/bin/ld:
/usr/lib/gcc/x86_64-linux-gnu/14/../../../x86_64-linux-gnu/libnuma.a(affinity.o):
in function `affinity_ip':
(.text+0x629): warning: Using 'getaddrinfo' in statically linked applications
requires at runtime the shared libraries from the glibc version used for linking
___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 7/7] KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support
On Thu, Sep 25, 2025, Jason Gunthorpe wrote: > On Thu, Sep 25, 2025 at 02:35:19PM -0700, Sean Christopherson wrote: > > > LDLIBS += -ldl > > > +LDLIBS += -lnuma > > > > Hrm, this is going to be very annoying. I don't have libnuma-dev installed > > on > > any of my systems, and I doubt I'm alone. Installing the > > package is > > trivial, but I'm a little wary of foisting that requirement on all KVM > > developers > > and build bots. > > Wouldn't it be great if the kselftest build system used something like > meson and could work around these little issues without breaking the > whole build ? :( > > Does anyone else think this? > > Every time I try to build kselftsts I just ignore all the errors the > fly by because the one bit I wanted did build properly anyhow. I'm indifferent, as I literally never build all of kselftests, I just build KVM selftests. But I'm probably in the minority for the kernel overall. ___ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 4/7] KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
My apologies for the super late feedback. None of this is critical (mechanical
things that can be cleaned up after the fact), so if there's any urgency to
getting this series into 6.18, just ignore it.
On Wed, Aug 27, 2025, Ackerley Tng wrote:
> Shivank Garg writes:
> @@ -463,11 +502,70 @@ bool __weak kvm_arch_supports_gmem_mmap(struct kvm *kvm)
> return true;
> }
>
> +static struct inode *kvm_gmem_inode_create(const char *name, loff_t size,
> +u64 flags)
> +{
> + struct inode *inode;
> +
> + inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL);
> + if (IS_ERR(inode))
> + return inode;
> +
> + inode->i_private = (void *)(unsigned long)flags;
> + inode->i_op = &kvm_gmem_iops;
> + inode->i_mapping->a_ops = &kvm_gmem_aops;
> + inode->i_mode |= S_IFREG;
> + inode->i_size = size;
> + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + mapping_set_inaccessible(inode->i_mapping);
> + /* Unmovable mappings are supposed to be marked unevictable as well. */
> + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +
> + return inode;
> +}
> +
> +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> + u64 flags)
> +{
> + static const char *name = "[kvm-gmem]";
> + struct inode *inode;
> + struct file *file;
> + int err;
> +
> + err = -ENOENT;
> + /* __fput() will take care of fops_put(). */
> + if (!fops_get(&kvm_gmem_fops))
> + goto err;
> +
> + inode = kvm_gmem_inode_create(name, size, flags);
> + if (IS_ERR(inode)) {
> + err = PTR_ERR(inode);
> + goto err_fops_put;
> + }
> +
> + file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
> + &kvm_gmem_fops);
> + if (IS_ERR(file)) {
> + err = PTR_ERR(file);
> + goto err_put_inode;
> + }
> +
> + file->f_flags |= O_LARGEFILE;
> + file->private_data = priv;
> +
> + return file;
> +
> +err_put_inode:
> + iput(inode);
> +err_fops_put:
> + fops_put(&kvm_gmem_fops);
> +err:
> + return ERR_PTR(err);
> +}
I don't see any reason to add two helpers. It requires quite a bit more lines
of code due to adding more error paths and local variables, and IMO doesn't make
the code any easier to read.
Passing in "gmem" as @priv is especially ridiculous, as it adds code and
obfuscates what file->private_data is set to.
I get the sense that the code was written to be a "replacement" for common APIs,
but that is nonsensical (no pun intended).
> static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> {
> - const char *anon_name = "[kvm-gmem]";
> struct kvm_gmem *gmem;
> - struct inode *inode;
> struct file *file;
> int fd, err;
>
> @@ -481,32 +579,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t
> size, u64 flags)
> goto err_fd;
> }
>
> - file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
> - O_RDWR, NULL);
> + file = kvm_gmem_inode_create_getfile(gmem, size, flags);
> if (IS_ERR(file)) {
> err = PTR_ERR(file);
> goto err_gmem;
> }
>
> - file->f_flags |= O_LARGEFILE;
> -
> - inode = file->f_inode;
> - WARN_ON(file->f_mapping != inode->i_mapping);
> -
> - inode->i_private = (void *)(unsigned long)flags;
> - inode->i_op = &kvm_gmem_iops;
> - inode->i_mapping->a_ops = &kvm_gmem_aops;
> - inode->i_mode |= S_IFREG;
> - inode->i_size = size;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> - mapping_set_inaccessible(inode->i_mapping);
> - /* Unmovable mappings are supposed to be marked unevictable as well. */
> - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> -
> kvm_get_kvm(kvm);
> gmem->kvm = kvm;
> xa_init(&gmem->bindings);
> - list_add(&gmem->entry, &inode->i_mapping->i_private_list);
> + list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
I don't understand this change? Isn't file_inode(file) == inode?
Compile tested only, and again not critical, but it's -40 LoC...
---
include/uapi/linux/magic.h | 1 +
virt/kvm/guest_memfd.c | 75 --
virt/kvm/kvm_main.c| 7 +++-
virt/kvm/kvm_mm.h | 9 +++--
4 files changed, 76 insertions(+), 16 deletions(-)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..638ca21b7a90 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC0x5345434d /* "SECM" */
#define PID_FS_MAGIC 0x50494446 /* "PIDF" */
+#define GUEST_MEMFD_MAGIC 0x474d454d
Re: [f2fs-dev] [PATCH kvm-next V11 0/7] Add NUMA mempolicy support for KVM guest-memfd
On Wed, Oct 15, 2025, Sean Christopherson wrote: > On Wed, 27 Aug 2025 17:52:41 +, Shivank Garg wrote: > > This series introduces NUMA-aware memory placement support for KVM guests > > with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) > > that enabled host-mapping for guest_memfd memory [1] and can be applied > > directly applied on KVM tree [2] (branch kvm-next, base commit: a6ad5413, > > Merge branch 'guest-memfd-mmap' into HEAD) > > > > == Background == > > KVM's guest-memfd memory backend currently lacks support for NUMA policy > > enforcement, causing guest memory allocations to be distributed across host > > nodes according to kernel's default behavior, irrespective of any policy > > specified by the VMM. This limitation arises because conventional userspace > > NUMA control mechanisms like mbind(2) don't work since the memory isn't > > directly mapped to userspace when allocations occur. > > Fuad's work [1] provides the necessary mmap capability, and this series > > leverages it to enable mbind(2). > > > > [...] > > Applied the non-KVM change to kvm-x86 gmem. We're still tweaking and > iterating > on the KVM changes, but I fully expect them to land in 6.19. > > Holler if you object to taking these through the kvm tree. > > [1/7] mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() > https://github.com/kvm-x86/linux/commit/601aa29f762f > [2/7] mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies > https://github.com/kvm-x86/linux/commit/2bb25703e5bd > [3/7] mm/mempolicy: Export memory policy symbols > https://github.com/kvm-x86/linux/commit/e1b4cf7d6be3 FYI, I rebased these onto 6.18-rc2 to avoid a silly merge. New hashes: [1/3] mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() https://github.com/kvm-x86/linux/commit/7f3779a3ac3e [2/3] mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies https://github.com/kvm-x86/linux/commit/16a542e22339 [3/3] mm/mempolicy: Export memory policy symbols https://github.com/kvm-x86/linux/commit/f634f10809ec ___ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 6/7] KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
On Wed, Oct 15, 2025, Gregory Price wrote:
> On Fri, Sep 26, 2025 at 12:36:27PM -0700, Sean Christopherson via
> Linux-f2fs-devel wrote:
> > >
> > > static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> > >unsigned long addr, pgoff_t *pgoff)
> > > {
> > > *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > >
> > > return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
> >
> > Argh! This breaks the selftest because do_get_mempolicy() very
> > specifically
> > falls back to the default_policy, NOT to the current task's policy. That is
> > *exactly* the type of subtle detail that needs to be commented, because
> > there's
> > no way some random KVM developer is going to know that returning NULL here
> > is
> > important with respect to get_mempolicy() ABI.
> >
>
> Do_get_mempolicy was designed to be accessed by the syscall, not as an
> in-kernel ABI.
Ya, by "get_mempolicy() ABI" I meant the uABI for the get_mempolicy syscall.
> get_task_policy also returns the default policy if there's nothing
> there, because that's what applies.
>
> I have dangerous questions:
Not dangerous at all, I find them very helpful!
> why is __kvm_gmem_get_policy using
> mpol_shared_policy_lookup()
> instead of
> get_vma_policy()
With the disclaimer that I haven't followed the gory details of this series
super
closely, my understanding is...
Because the VMA is a means to an end, and we want the policy to persist even if
the VMA goes away.
With guest_memfd, KVM effectively inverts the standard MMU model. Instead of
mm/
being the primary MMU and KVM being a secondary MMU, guest_memfd is the primary
MMU and any VMAs are secondary (mostly; it's probably more like 1a and 1b).
This
allows KVM to map guest_memfd memory into a guest without a VMA, or with more
permissions than are granted to host userspace, e.g. guest_memfd memory could be
writable by the guest, but read-only for userspace.
But we still want to support things like mbind() so that userspace can ensure
guest_memfd allocations align with the vNUMA topology presented to the guest,
or are bound to the NUMA node where the VM will run. We considered adding
equivalent
file-based syscalls, e.g. fbind(), but IIRC the consensus was that doing so was
unnecessary (and potentially messy?) since we were planning on eventually adding
mmap() support to guest_memfd anyways.
> get_vma_policy does this all for you
I assume that doesn't work if the intent is for new VMAs to pick up the existing
policy from guest_memfd? And more importantly, guest_memfd needs to hook
->set_policy so that changes through e.g. mbind() persist beyond the lifetime of
the VMA.
> struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr, int order, pgoff_t *ilx)
> {
> struct mempolicy *pol;
>
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> if (pol->mode == MPOL_INTERLEAVE ||
> pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> return pol;
> }
>
> Of course you still have the same issue: get_task_policy will return the
> default, because that's what applies.
>
> do_get_mempolicy just seems like the completely incorrect interface to
> be using here.
___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 4/7] KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
On Thu, Sep 25, 2025, David Hildenbrand wrote: > On 25.09.25 13:44, Garg, Shivank wrote: > > On 9/25/2025 8:20 AM, Sean Christopherson wrote: > > I did functional testing and it works fine. > > I can queue this instead. I guess I can reuse the patch description and add > Sean as author + add his SOB (if he agrees). Eh, Ackerley and Fuad did all the work. If I had provided feedback earlier, this would have been handled in a new version. If they are ok with the changes, I would prefer they remain co-authors. Regarding timing, how much do people care about getting this into 6.18 in particular? AFAICT, this hasn't gotten any coverage in -next, which makes me a little nervous. ___ Linux-f2fs-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 5/7] KVM: guest_memfd: Add slab-allocated inode cache
On Wed, Aug 27, 2025, Shivank Garg wrote:
> Add dedicated inode structure (kvm_gmem_inode_info) and slab-allocated
> inode cache for guest memory backing, similar to how shmem handles inodes.
>
> This adds the necessary allocation/destruction functions and prepares
> for upcoming guest_memfd NUMA policy support changes.
>
> Signed-off-by: Shivank Garg
> ---
> virt/kvm/guest_memfd.c | 70 --
> 1 file changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 6c66a0974055..356947d36a47 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -17,6 +17,15 @@ struct kvm_gmem {
> struct list_head entry;
> };
>
> +struct kvm_gmem_inode_info {
What about naming this simply gmem_inode?
> + struct inode vfs_inode;
> +};
> +
> +static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct inode *inode)
And then GMEM_I()?
And then (in a later follow-up if we target this for 6.18, or as a prep patch if
we push this out to 6.19), rename kvm_gmem to gmem_file?
That would make guest_memfd look a bit more like other filesystems, and I don't
see a need to preface the local structures and helpers with "kvm_", e.g.
GMEM_I()
is analogous to x86's to_vmx() and to_svm().
As for renaming kvm_gmem => gmem_file, I wandered back into this code via
Ackerley's
in-place conversion series, and it took me a good long while to remember the
roles
of files vs. inodes in gmem. That's probably a sign that the code needs
clarification
given that I wrote the original code. :-)
Leveraging an old discussion[*], my thought is to get to this:
/*
* A guest_memfd instance can be associated multiple VMs, each with its own
* "view" of the underlying physical memory.
*
* The gmem's inode is effectively the raw underlying physical storage, and is
* used to track properties of the physical memory, while each gmem file is
* effectively a single VM's view of that storage, and is used to track assets
* specific to its associated VM, e.g. memslots=>gmem bindings.
*/
struct gmem_file {
struct kvm *kvm;
struct xarray bindings;
struct list_head entry;
};
struct gmem_inode {
struct shared_policy policy;
struct inode vfs_inode;
};
[*] https://lore.kernel.org/all/[email protected]
___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
Re: [f2fs-dev] [PATCH kvm-next V11 6/7] KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
On Wed, Aug 27, 2025, Shivank Garg wrote:
> @@ -26,6 +28,9 @@ static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct
> inode *inode)
> return container_of(inode, struct kvm_gmem_inode_info, vfs_inode);
> }
>
> +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct
> kvm_gmem_inode_info *info,
> +pgoff_t index);
> +
> /**
> * folio_file_pfn - like folio_file_page, but return a pfn.
> * @folio: The folio which contains this index.
> @@ -112,7 +117,25 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> {
> /* TODO: Support huge pages. */
> - return filemap_grab_folio(inode->i_mapping, index);
> + struct mempolicy *policy;
> + struct folio *folio;
> +
> + /*
> + * Fast-path: See if folio is already present in mapping to avoid
> + * policy_lookup.
> + */
> + folio = __filemap_get_folio(inode->i_mapping, index,
> + FGP_LOCK | FGP_ACCESSED, 0);
> + if (!IS_ERR(folio))
> + return folio;
> +
> + policy = kvm_gmem_get_pgoff_policy(KVM_GMEM_I(inode), index);
> + folio = __filemap_get_folio_mpol(inode->i_mapping, index,
> + FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
> + mapping_gfp_mask(inode->i_mapping),
> policy);
> + mpol_cond_put(policy);
> +
> + return folio;
> }
>
> static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> @@ -372,8 +395,45 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct
> vm_fault *vmf)
> return ret;
> }
>
> +#ifdef CONFIG_NUMA
> +static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy
> *mpol)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> +
> + return mpol_set_shared_policy(&KVM_GMEM_I(inode)->policy, vma, mpol);
> +}
> +
> +static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> + unsigned long addr, pgoff_t *pgoff)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> +
> + *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> + return mpol_shared_policy_lookup(&KVM_GMEM_I(inode)->policy, *pgoff);
> +}
> +
> +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct
> kvm_gmem_inode_info *info,
> +pgoff_t index)
I keep reading this is "page offset policy", as opposed to "policy given a page
offset". Another oddity that is confusing is that this helper explicitly does
get_task_policy(current), while kvm_gmem_get_policy() lets the caller do that.
The end result is the same, but I think it would be helpful for gmem to be
internally consistent.
If we have kvm_gmem_get_policy() use this helper, then we can kill two birds
with
one stone:
static struct mempolicy *__kvm_gmem_get_policy(struct gmem_inode *gi,
pgoff_t index)
{
struct mempolicy *mpol;
mpol = mpol_shared_policy_lookup(&gi->policy, index);
return mpol ? mpol : get_task_policy(current);
}
static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
unsigned long addr, pgoff_t *pgoff)
{
*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
}
___
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
