Re: [PATCH V2 01/11] machine: alloc-anon option

Steven Sistare Tue, 13 Aug 2024 11:51:06 -0700

On 8/13/2024 2:46 PM, Steven Sistare wrote:

On 8/13/2024 1:00 PM, Alex Williamson wrote:

On Tue, 13 Aug 2024 11:35:15 -0400
Peter Xu <pet...@redhat.com> wrote:

On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:

On 8/8/2024 2:32 PM, Steven Sistare wrote:

On 7/29/2024 8:29 AM, Igor Mammedov wrote:

On Sat, 20 Jul 2024 16:28:25 -0400
Steven Sistare <steven.sist...@oracle.com> wrote:

On 7/16/2024 5:19 AM, Igor Mammedov wrote:

On Sun, 30 Jun 2024 12:40:24 -0700
Steve Sistare <steven.sist...@oracle.com> wrote:

Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This affects
memory-backend-ram objects, guest RAM created with the global -m option
but without an associated memory-backend object and without the -mem-path
option

nowadays, all machines were converted to use memory backend for VM RAM.
so -m option implicitly creates memory-backend object,
which will be either MEMORY_BACKEND_FILE if -mem-path present
or MEMORY_BACKEND_RAM otherwise.


Yes.  I dropped an an important adjective, "implicit".

     "guest RAM created with the global -m option but without an explicit 
associated
     memory-backend object and without the -mem-path option"

To access the same memory in the old and new QEMU processes, the memory
must be mapped shared.  Therefore, the implementation always sets
RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
user must explicitly specify the share option.  In lieu of defining a new

so statement at the top that memory-backend-ram is affected is not
really valid?


memory-backend-ram is affected by alloc-anon.  But in addition, the user must
explicitly add the "share" option.  I don't implicitly set share in this case,
because I would be overriding the user's specification of the memory object's 
property,
which would be private if omitted.


instead of touching implicit RAM (-m), it would be better to error out
and ask user to provide properly configured memory-backend explicitly.

RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
as the condition for calling memfd_create.


In general I do dislike adding yet another option that will affect
guest RAM allocation (memory-backends  should be sufficient).

However I do see that you need memfd for device memory (vram, roms, ...).
Can we just use memfd/shared unconditionally for those and
avoid introducing a new confusing option?


The Linux kernel has different tunables for backing memfd's with huge pages, so 
we
could hurt performance if we unconditionally change to memfd.  The user should 
have
a choice for any segment that is large enough for huge pages to improve 
performance,
which potentially is any memory-backend-object.  The non memory-backend objects 
are
small, and it would be OK to use memfd unconditionally for them.


Thanks everyone for your feedback.  The common theme is that you dislike that 
the
new option modifies the allocation of memory-backend-objects.  OK, accepted.  I 
propose
to remove that interaction, and document in the QAPI which backends work for 
CPR.
Specifically, memory-backend-memfd or memory-backend-file object is required,
with share=on (which is the default for memory-backend-memfd).  CPR will be 
blocked
otherwise.  The legacy -m option without an explicit memory-backend-object will 
not
support CPR.

Non memory-backend-objects (ramblocks not described on the qemu command line) 
will always
be allocated using memfd_create (on Linux only).  The alloc-anon option is 
deleted.
The logic in ram_block_add becomes:

      if (!new_block->host) {
          if (xen_enabled()) {
              ...
          } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
                                          TYPE_MEMORY_BACKEND)) {
              qemu_memfd_create()
          } else {
              qemu_anon_ram_alloc()
          }

Is that acceptable to everyone?  Igor, Peter, Daniel?


Sorry for a late reply.

I think this may not work as David pointed out? Where AFAIU it will switch
many old anon use cases to use memfd, aka, shmem, and it might be
problematic when share=off: we have double memory consumption issue with
shmem with private mapping.

I assume that includes things like "-m", "memory-backend-ram", and maybe
more.  IIUC memory consumption of the VM will double with them.


In a simple test here are the NON-memory-backend-object ramblocks which
are allocated with memfd_create in my new proposal:

   memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
   memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
   memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
   memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
   memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
   memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
   memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
   memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw

Of those, only a subset are mapped for DMA, per the existing QEMU logic,
no changes from me:

   dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
   dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
   dma_map: vga.rom 65536 @ 0x7fffe0400000 ro


I wonder whether there's any case that the "rom"s can be DMA target at
all..  I understand it's logically possible to be READ from as ROMs, but I
am curious what happens if we don't map them at all when they're ROMs, or
whether there's any device that can (in real life) DMA from device ROMs,
and for what use.

   dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
   dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw

system.flash0 is excluded by the vfio listener because it is a rom_device.
The rom@etc blocks are excluded because their MemoryRegions are not added to
any container region, so the flatmem traversal of the AS used by the listener
does not see them.

The BARs should not be mapped IMO, and I propose excluding them in the
iommufd series:
   
https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sist...@oracle.com/


Looks like this is clear now that they should be there.


Note that the old-QEMU contents of all ramblocks must be preserved, just like
in live migration.  Live migration copies the contents in the stream.  Live 
update
preserves the contents in place by preserving the memfd.  Thus memfd serves
two purposes: preserving old contents, and preserving DMA mapped pinned pages.


IMHO the 1st purpose is a fake one.  IOW:

   - Preserving content will be important on large RAM/ROM regions.  When
     it's small, it shouldn't matter a huge deal, IMHO, because this is
     about "how fast we can migrate / live upgrade'.  IOW, this is not a
     functional requirement.


Regardless of the size of a ROM region, how would it ever be faster to
migrate ROMs rather that reload them from stable media on the target?
Furthermore, what mechanism other than migrating the ROM do we have to
guarantee the contents of the ROM are identical?

I have a hard time accepting that ROMs are only migrated for
performance and there isn't some aspect of migrating them to ensure the
contents remain identical, and by that token CPR would also need to
preserve the contents to provide the same guarantee.  Thanks,


I agree.  Any ramblock may change if the contents are read from a file in
the QEMU distribution, or if the contents are composed by QEMU code.  Live
migration guards against this by sending the old ramblock contents in the
migration stream.


Our emails just crossed.  I will repost this to your new reply so keep a single
conversation thread.

- Steve

   - DMA mapped pinned pages: instead this is a hard requirement that we
     must make sure these pages are fd-based, because only a fd-based
     mapping can persist the pages (via page cache).

IMHO we shouldn't mangle them, and we should start with sticking with the
2nd goal here.  To be explicit, if we can find a good replacement for
-alloc-anon, IMHO we could still migrate the ramblocks only fall into the
1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
they should never be DMAed to/from.

Thanks,

Re: [PATCH V2 01/11] machine: alloc-anon option

Reply via email to