Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

david.dai Fri, 15 Oct 2021 02:27:02 -0700

On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (da...@redhat.com) 
wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> On 13.10.21 10:13, david.dai wrote:
> > On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand 
> > (da...@redhat.com) wrote:
> > > 
> > > 
> > > 
> > > > > virito-mem currently relies on having a single sparse memory region 
> > > > > (anon
> > > > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we 
> > > > > can
> > > > > share memory with other processes, sharing with other VMs is not 
> > > > > intended.
> > > > > Instead of actually mmaping parts dynamically (which can be quite
> > > > > expensive), virtio-mem relies on punching holes into the backend and
> > > > > dynamically allocating memory/file blocks/... on access.
> > > > > 
> > > > > So the easy way to make it work is:
> > > > > 
> > > > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in 
> > > > > device
> > > > > memory getting managed by the buddy on a separate NUMA node.
> > > > > 
> > > > 
> > > > Linux kernel buddy system? how to guarantee other applications don't 
> > > > apply memory
> > > > from it
> > > 
> > > Excellent question. Usually, you would online the memory to ZONE_MOVABLE,
> > > such that even if some other allocation ended up there, that it could
> > > get migrated somewhere else.
> > > 
> > > For example, "daxctl reconfigure-device" tries doing that as default:
> > > 
> > > https://pmem.io/ndctl/daxctl-reconfigure-device.html
> > > 
> > > However, I agree that we might actually want to tell the system to not
> > > use this CPU-less node as fallback for other allocations, and that we
> > > might not want to swap out such memory etc.
> > > 
> > > 
> > > But, in the end all that virtio-mem needs to work in the hypervisor is
> > > 
> > > a) A sparse memmap (anonymous RAM, memfd, file)
> > > b) A way to populate memory within that sparse memmap (e.g., on fault,
> > >     using madvise(MADV_POPULATE_WRITE), fallocate())
> > > c) A way to discard memory (madvise(MADV_DONTNEED),
> > >     fallocate(FALLOC_FL_PUNCH_HOLE))
> > > 
> > > So instead of using anonymous memory+mbind, you can also mmap a sparse 
> > > file
> > > and rely on populate-on-demand. One alternative for your use case would be
> > > to create a DAX  filesystem on that CXL memory (IIRC that should work) and
> > > simply providing virtio-mem with a sparse file located on that filesystem.
> > > 
> > > Of course, you can also use some other mechanism as you might have in
> > > your approach, as long as it supports a,b,c.
> > > 
> > > > 
> > > > > 
> > > > > b) (optional) allocate huge pages on that separate NUMA node.
> > > > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge 
> > > > > pages),
> > > > > *bidning* the memory backend to that special NUMA node.
> > > > > 
> > > > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, 
> > > > size=768G"
> > > > How to bind backend memory to NUMA node
> > > > 
> > > 
> > > I think the syntax is "policy=bind,host-nodes=X"
> > > 
> > > whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for 
> > > "5"
> > > "host-nodes=0x20" etc.
> > > 
> > > > > 
> > > > > This will dynamically allocate memory from that special NUMA node, 
> > > > > resulting
> > > > > in the virtio-mem device completely being backed by that device 
> > > > > memory,
> > > > > being able to dynamically resize the memory allocation.
> > > > > 
> > > > > 
> > > > > Exposing an actual devdax to the virtio-mem device, shared by 
> > > > > multiple VMs
> > > > > isn't really what we want and won't work without major design 
> > > > > changes. Also,
> > > > > I'm not so sure it's a very clean design: exposing memory belonging 
> > > > > to other
> > > > > VMs to unrelated QEMU processes. This sounds like a serious security 
> > > > > hole:
> > > > > if you managed to escalate to the QEMU process from inside the VM, 
> > > > > you can
> > > > > access unrelated VM memory quite happily. You want an abstraction
> > > > > in-between, that makes sure each VM/QEMU process only sees private 
> > > > > memory:
> > > > > for example, the buddy via dax/kmem.
> > > > > 
> > > > Hi David
> > > > Thanks for your suggestion, also sorry for my delayed reply due to my 
> > > > long vacation.
> > > > How does current virtio-mem dynamically attach memory to guest, via 
> > > > page fault?
> > > 
> > > Essentially you have a large sparse mmap. Withing that mmap, memory is
> > > populated on demand. Instead if mmap/munmap you perform a single large
> > > mmap and then dynamically populate memory/discard memory.
> > > 
> > > Right now, memory is populated via page faults on access. This is
> > > sub-optimal when dealing with limited resources (i.e., hugetlbfs,
> > > file blocks) and you might run out of backend memory.
> > > 
> > > I'm working on a "prealloc" mode, which will preallocate/populate memory
> > > necessary for exposing the next block of memory to the VM, and which
> > > fails gracefully if preallocation/population fails in the case of such
> > > limited resources.
> > > 
> > > The patch resides on:
> > >   https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next
> > > 
> > > commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0
> > > Author: David Hildenbrand <da...@redhat.com>
> > > Date:   Mon Aug 2 19:51:36 2021 +0200
> > > 
> > >      virtio-mem: support "prealloc=on" option
> > >      Especially for hugetlb, but also for file-based memory backends, we'd
> > >      like to be able to prealloc memory, especially to make user errors 
> > > less
> > >      severe: crashing the VM when there are not sufficient huge pages 
> > > around.
> > >      A common option for hugetlb will be using "reserve=off,prealloc=off" 
> > > for
> > >      the memory backend and "prealloc=on" for the virtio-mem device. This
> > >      way, no huge pages will be reserved for the process, but we can 
> > > recover
> > >      if there are no actual huge pages when plugging memory.
> > >      Signed-off-by: David Hildenbrand <da...@redhat.com>
> > > 
> > > 
> > > -- 
> > > Thanks,
> > > 
> > > David / dhildenb
> > > 
> > 
> > Hi David,
> > 
> > After read virtio-mem code, I understand what you have expressed, please 
> > allow me to describe
> > my understanding to virtio-mem, so that we have a aligned view.
> > 
> > Virtio-mem:
> >   Virtio-mem device initializes and reserved a memory area(GPA), later 
> > memory dynamically
> >   growing/shrinking will not exceed this scope, memory-backend-ram has 
> > mapped anon. memory
> >   to the whole area, but no ram is attached because Linux have a policy to 
> > delay allocation.
> 
> Right, but it can also be any sparse file (memory-backend-memfd,
> memory-backend-file).
> 
> >   When virtio-mem driver apply to dynamically add memory to guest, it first 
> > request a region
> >   from the reserved memory area, then notify virtio-mem device to record 
> > the information
> >   (virtio-mem device doesn't make real memory allocation). After received 
> > response from
> 
> In the upcoming prealloc=on mode I referenced, the allocation will happen
> before the guest is notified about success and starts using the memory.
> 
> With vfio/mdev support, the allocation will happen nowadays already, when
> vfio/mdev is notified about the populated memory ranges (see
> RamDiscardManager). That's essentially what makes virtio-mem device
> passthrough work.
> 
> >   virtio-mem deivce, virtio-mem driver will online the requested region and 
> > add it to Linux
> >   page allocator. Real ram allocation will happen via page fault when guest 
> > cpu access it.
> >   Memory shrink will be achieved by madvise()
> 
> Right, but you could write a custom virtio-mem driver that pools this memory
> differently.
> 
> Memory shrinking in the hypervisor is either done using madvise(DONMTNEED)
> or fallocate(FALLOC_FL_PUNCH_HOLE)
> 
> > 
> > Questions:
> > 1. heterogeneous computing, memory may be accessed by CPUs on host side and 
> > device side.
> >     Memory delayed allocation is not suitable. Host software(for instance, 
> > OpenCL) may
> >     allocate a buffer to computing device to place the computing result in.
> 
> That works already with virtio-mem with vfio/mdev via the RamDiscardManager
> infrastructure introduced recently. With "prealloc=on", the delayed memory
> allocation can also be avoided without vfio/mdev.
> 
> > 2. we hope build ourselves page allocator in host kernel, so it can offer 
> > customized mmap()
> >     method to build va->pa mapping in MMU and IOMMU.
> 
> Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev
> via the RamDiscardManager. From there, you can issue whatever syscall you
> need to popualte memory when plugging new memory blocks. All you need to
> support is a sparse mmap and a way to populate/discard memory.
> Populate/discard could be wired up in QEMU virtio-mem code as you need it.
> 
> > 3. some potential requirements also require our driver to manage memory, so 
> > that page size
> >     granularity can be controlled to fit small device iotlb cache.
> >     CXL has bias mode for HDM(host managed device memory), it needs 
> > physical address to make
> >     bias mode switch between host access and device access. These tell us 
> > driver manage memory
> >     is mandatory.
> 
> I think if you write your driver in a certain way and wire it up in QEMU
> virtio-mem accordingly (e.g., using a new memory-backend-whatever), that
> shouldn't be an issue.
>


Thanks a lot, so let me have a try.
 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 
>

Re: [PATCH] hw/misc: Add a virtual pci device to dynamically attach memory to QEMU

Reply via email to