On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> On 8/8/2024 2:32 PM, Steven Sistare wrote:
> > On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > Steven Sistare <steven.sist...@oracle.com> wrote:
> > > 
> > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > Steve Sistare <steven.sist...@oracle.com> wrote:
> > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create 
> > > > > > depending
> > > > > > on the value of the anon-alloc machine property.  This affects
> > > > > > memory-backend-ram objects, guest RAM created with the global -m 
> > > > > > option
> > > > > > but without an associated memory-backend object and without the 
> > > > > > -mem-path
> > > > > > option
> > > > > nowadays, all machines were converted to use memory backend for VM 
> > > > > RAM.
> > > > > so -m option implicitly creates memory-backend object,
> > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > or MEMORY_BACKEND_RAM otherwise.
> > > > 
> > > > Yes.  I dropped an an important adjective, "implicit".
> > > > 
> > > >     "guest RAM created with the global -m option but without an 
> > > > explicit associated
> > > >     memory-backend object and without the -mem-path option"
> > > > 
> > > > > > To access the same memory in the old and new QEMU processes, the 
> > > > > > memory
> > > > > > must be mapped shared.  Therefore, the implementation always sets
> > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, 
> > > > > > where the
> > > > > > user must explicitly specify the share option.  In lieu of defining 
> > > > > > a new
> > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > really valid?
> > > > 
> > > > memory-backend-ram is affected by alloc-anon.  But in addition, the 
> > > > user must
> > > > explicitly add the "share" option.  I don't implicitly set share in 
> > > > this case,
> > > > because I would be overriding the user's specification of the memory 
> > > > object's property,
> > > > which would be private if omitted.
> > > 
> > > instead of touching implicit RAM (-m), it would be better to error out
> > > and ask user to provide properly configured memory-backend explicitly.
> > > 
> > > > 
> > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED 
> > > > > > with fd=-1
> > > > > > as the condition for calling memfd_create.
> > > > > 
> > > > > In general I do dislike adding yet another option that will affect
> > > > > guest RAM allocation (memory-backends  should be sufficient).
> > > > > 
> > > > > However I do see that you need memfd for device memory (vram, roms, 
> > > > > ...).
> > > > > Can we just use memfd/shared unconditionally for those and
> > > > > avoid introducing a new confusing option?
> > > > 
> > > > The Linux kernel has different tunables for backing memfd's with huge 
> > > > pages, so we
> > > > could hurt performance if we unconditionally change to memfd.  The user 
> > > > should have
> > > > a choice for any segment that is large enough for huge pages to improve 
> > > > performance,
> > > > which potentially is any memory-backend-object.  The non memory-backend 
> > > > objects are
> > > > small, and it would be OK to use memfd unconditionally for them.
> > 
> > Thanks everyone for your feedback.  The common theme is that you dislike 
> > that the
> > new option modifies the allocation of memory-backend-objects.  OK, 
> > accepted.  I propose
> > to remove that interaction, and document in the QAPI which backends work 
> > for CPR.
> > Specifically, memory-backend-memfd or memory-backend-file object is 
> > required,
> > with share=on (which is the default for memory-backend-memfd).  CPR will be 
> > blocked
> > otherwise.  The legacy -m option without an explicit memory-backend-object 
> > will not
> > support CPR.
> > 
> > Non memory-backend-objects (ramblocks not described on the qemu command 
> > line) will always
> > be allocated using memfd_create (on Linux only).  The alloc-anon option is 
> > deleted.
> > The logic in ram_block_add becomes:
> > 
> >      if (!new_block->host) {
> >          if (xen_enabled()) {
> >              ...
> >          } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> >                                          TYPE_MEMORY_BACKEND)) {
> >              qemu_memfd_create()
> >          } else {
> >              qemu_anon_ram_alloc()
> >          }
> > 
> > Is that acceptable to everyone?  Igor, Peter, Daniel?

Sorry for a late reply.

I think this may not work as David pointed out? Where AFAIU it will switch
many old anon use cases to use memfd, aka, shmem, and it might be
problematic when share=off: we have double memory consumption issue with
shmem with private mapping.

I assume that includes things like "-m", "memory-backend-ram", and maybe
more.  IIUC memory consumption of the VM will double with them.

> 
> In a simple test here are the NON-memory-backend-object ramblocks which
> are allocated with memfd_create in my new proposal:
> 
>   memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>   memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>   memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>   memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>   memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>   memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>   memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>   memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
> 
> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> no changes from me:
> 
>   dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>   dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>   dma_map: vga.rom 65536 @ 0x7fffe0400000 ro

I wonder whether there's any case that the "rom"s can be DMA target at
all..  I understand it's logically possible to be READ from as ROMs, but I
am curious what happens if we don't map them at all when they're ROMs, or
whether there's any device that can (in real life) DMA from device ROMs,
and for what use.

>   dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>   dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
> 
> system.flash0 is excluded by the vfio listener because it is a rom_device.
> The rom@etc blocks are excluded because their MemoryRegions are not added to
> any container region, so the flatmem traversal of the AS used by the listener
> does not see them.
> 
> The BARs should not be mapped IMO, and I propose excluding them in the
> iommufd series:
>   
> https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sist...@oracle.com/

Looks like this is clear now that they should be there.

> 
> Note that the old-QEMU contents of all ramblocks must be preserved, just like
> in live migration.  Live migration copies the contents in the stream.  Live 
> update
> preserves the contents in place by preserving the memfd.  Thus memfd serves
> two purposes: preserving old contents, and preserving DMA mapped pinned pages.

IMHO the 1st purpose is a fake one.  IOW:

  - Preserving content will be important on large RAM/ROM regions.  When
    it's small, it shouldn't matter a huge deal, IMHO, because this is
    about "how fast we can migrate / live upgrade'.  IOW, this is not a
    functional requirement.

  - DMA mapped pinned pages: instead this is a hard requirement that we
    must make sure these pages are fd-based, because only a fd-based
    mapping can persist the pages (via page cache).

IMHO we shouldn't mangle them, and we should start with sticking with the
2nd goal here.  To be explicit, if we can find a good replacement for
-alloc-anon, IMHO we could still migrate the ramblocks only fall into the
1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
they should never be DMAed to/from.

Thanks,

-- 
Peter Xu


Reply via email to