On 03.03.22 18:21, Michael S. Tsirkin wrote: > On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote: >> Allocate anonymous memory using memfd_create if the memfd-alloc machine >> option is set. >> >> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >> --- >> hw/core/machine.c | 19 +++++++++++++++++++ >> include/hw/boards.h | 1 + >> qemu-options.hx | 6 ++++++ >> softmmu/physmem.c | 47 ++++++++++++++++++++++++++++++++++++++--------- >> softmmu/vl.c | 1 + >> trace-events | 1 + >> util/qemu-config.c | 4 ++++ >> 7 files changed, 70 insertions(+), 9 deletions(-) >> >> diff --git a/hw/core/machine.c b/hw/core/machine.c >> index 53a99ab..7739d88 100644 >> --- a/hw/core/machine.c >> +++ b/hw/core/machine.c >> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool >> value, Error **errp) >> ms->mem_merge = value; >> } >> >> +static bool machine_get_memfd_alloc(Object *obj, Error **errp) >> +{ >> + MachineState *ms = MACHINE(obj); >> + >> + return ms->memfd_alloc; >> +} >> + >> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp) >> +{ >> + MachineState *ms = MACHINE(obj); >> + >> + ms->memfd_alloc = value; >> +} >> + >> static bool machine_get_usb(Object *obj, Error **errp) >> { >> MachineState *ms = MACHINE(obj); >> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void >> *data) >> object_class_property_set_description(oc, "mem-merge", >> "Enable/disable memory merge support"); >> >> + object_class_property_add_bool(oc, "memfd-alloc", >> + machine_get_memfd_alloc, machine_set_memfd_alloc); >> + object_class_property_set_description(oc, "memfd-alloc", >> + "Enable/disable allocating anonymous memory using memfd_create"); >> + >> object_class_property_add_bool(oc, "usb", >> machine_get_usb, machine_set_usb); >> object_class_property_set_description(oc, "usb", >> diff --git a/include/hw/boards.h b/include/hw/boards.h >> index 9c1c190..a57d7a0 100644 >> --- a/include/hw/boards.h >> +++ b/include/hw/boards.h >> @@ -327,6 +327,7 @@ struct MachineState { >> char *dt_compatible; >> bool dump_guest_core; >> bool mem_merge; >> + bool memfd_alloc; >> bool usb; >> bool usb_disabled; >> char *firmware; >> diff --git a/qemu-options.hx b/qemu-options.hx >> index 7d47510..33c8173 100644 >> --- a/qemu-options.hx >> +++ b/qemu-options.hx >> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ >> " vmport=on|off|auto controls emulation of vmport >> (default: auto)\n" >> " dump-guest-core=on|off include guest memory in a core >> dump (default=on)\n" >> " mem-merge=on|off controls memory merge support >> (default: on)\n" >> + " memfd-alloc=on|off controls allocating anonymous guest >> RAM using memfd_create (default: off)\n" > > Question: are there any disadvantages associated with using > memfd_create? I guess we are using up an fd, but that seems minor. Any > reason not to set to on by default? maybe with a fallback option to > disable that? > > I am concerned that it's actually a kind of memory backend, this flag > seems to instead be closer to the deprecated mem-prealloc. E.g. > it does not work with a mem path, does it?
We had a RH-internal discssuion some time ago, here is my writeup (note the TMPFS/SHMEM discussion): --- snip --- In QEMU, we specify the type of guest RAM via * -object memory-backend-ram,... * -object memory-backend-file,... * -object memory-backend-memfd,... We can specify whether to share the memory (share=on -- MAP_SHARED), or whether to keep modifications local to QEMU (share=off -- MAP_PRIVATE). Using "share=off" (or using the default) with files/memfd can have some serious side-effects. ALERT: "share=off" is the default in QEMU for memory-backend-ram and memory-backend-file. "share=on" is the default in QEMU only for memory-backend-memfd. I. MAP_SHARED vs. MAP_PRIVATE MAP_SHARED: when reading, read file content; when writing, modify file content. MAP_PRIVATE: when reading, read file content, except if there was a local/private change. When writing, keep change local/private and don't modify file content. MAP_PRIVATE sounds like a snapshot, however, in some cases it really behaves differently -- especially with tmpfs/shmem and when QEMU discards memory (e.g., with virtio-balloon or during postcopy live migration). There is some connection between MAP_PRIVATE and NUMA bindings that I have yet to fully explore. We could have issues with some MAP_SHARED mappings and NUMA bindings (IOW: policy getting ignored). II Impact on different memory backends/types II.1. Anonymous memory: Usage: -object memory-backend-ram,... We really want "share=off" in 99.99% of all cases. Shared anonymous RAM -- i.e., sharing RAM with your child processes -- does not really apply to QEMU and there are some cases that are broken in QEMU [1]; there is only a single use case in the context of RDMA -- whereby we only need shared anonymous memory to make mremap() work, not for actually sharing RAM with someone else. II.2. TMPFS/SHMEM Usage: -object memory-backend-memfd,... -object memory-backend-file,mem-path=/dev/shm/FILE,... We really want "share=on" in 99.99999% of all cases. There is a serious issue when using private mappings on an empty shmem file, whereby we can get a double memory consumption. The issue is that even when reading via a private mapping, we will allocate memory for the actual file (== RAM for tmpfs) -- even if it's just allocating blocks filled with zero. So doing a -object memory-backend-file,mem-path=/dev/shm/FILE will in the worst case consume 4G, even though we have an anonymous file -- *we have to use share=on*. II.3. Hugetlb Usage: -object memory-backend-memfd,hugetlb=on,hugetlbsize=2M,... -object memory-backend-file,mem-path=/dev/shm/FILE,... We usually want "share=on". However, there seems to be nothing wrong about using "memory-backend-memfd" -- IOW an anonymous file; it works as expected in my tests (fallocate() behaves in weird ways, but that's a different story). II.4. "Ordinary" Files Usage: -object memory-backend-file,mem-path=/some/file,... We usually want "share=on" in 99.9% of all cases, to have modifications go back to the file -- for example, for the "big file" use case where we want to use the actual file storage as memory backend (for example, when swapping is not desired), such that we can use the page cache where possible, but writeback the file content to disk when under memory pressure. 5. DAX/PMEM Usage: -object memory-backend-file,mem-path=/dev/dax,... We want "share=on" in 99.99999% of all cases when using dax/pmem in an emulated NVDIMM for our guest. We want the changes to go back to dax/pmem a.k.a. the actual NVDIMM (not some mixture of pmem and system RAM). III. MAP_PRIVATE vs. virtio-balloon and postcopy live migration Dave told me about a use case where we a) Start a VM with a MAP_SHARED file as guest RAM until it is booted up b) Save the VM state, *excluding guest RAM" c) Start multiple VMs using the VM state and the MAP_PRIVATE file as guest RAM This is essentially a fast "guest snapshot". But beware if you end up discarding memory in QEMU via ram_block_discard_range(), e.g., via virtio-balloon or via postcopy live migration. In QEMU, we always discard file content and modified pages in private mappings. Problem: If one VM discards memory, it will modify the snapshot. The snapshot will be broken. New VMs and running VMs will be affected! Note: We cannot easily teach QEMU to not modify file content when discarding memory of private mappings. This would break postcopy live in some cases completely. --- snip --- -- Thanks, David / dhildenb