On Wed, Oct 13, 2021 at 10:33:39AM +0200, David Hildenbrand (da...@redhat.com) wrote: > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you recognize the sender and know the > content is safe. > > > On 13.10.21 10:13, david.dai wrote: > > On Mon, Oct 11, 2021 at 09:43:53AM +0200, David Hildenbrand > > (da...@redhat.com) wrote: > > > > > > > > > > > > > > virito-mem currently relies on having a single sparse memory region > > > > > (anon > > > > > mmap, mmaped file, mmaped huge pages, mmap shmem) per VM. Although we > > > > > can > > > > > share memory with other processes, sharing with other VMs is not > > > > > intended. > > > > > Instead of actually mmaping parts dynamically (which can be quite > > > > > expensive), virtio-mem relies on punching holes into the backend and > > > > > dynamically allocating memory/file blocks/... on access. > > > > > > > > > > So the easy way to make it work is: > > > > > > > > > > a) Exposing the CXL memory to the buddy via dax/kmem, esulting in > > > > > device > > > > > memory getting managed by the buddy on a separate NUMA node. > > > > > > > > > > > > > Linux kernel buddy system? how to guarantee other applications don't > > > > apply memory > > > > from it > > > > > > Excellent question. Usually, you would online the memory to ZONE_MOVABLE, > > > such that even if some other allocation ended up there, that it could > > > get migrated somewhere else. > > > > > > For example, "daxctl reconfigure-device" tries doing that as default: > > > > > > https://pmem.io/ndctl/daxctl-reconfigure-device.html > > > > > > However, I agree that we might actually want to tell the system to not > > > use this CPU-less node as fallback for other allocations, and that we > > > might not want to swap out such memory etc. > > > > > > > > > But, in the end all that virtio-mem needs to work in the hypervisor is > > > > > > a) A sparse memmap (anonymous RAM, memfd, file) > > > b) A way to populate memory within that sparse memmap (e.g., on fault, > > > using madvise(MADV_POPULATE_WRITE), fallocate()) > > > c) A way to discard memory (madvise(MADV_DONTNEED), > > > fallocate(FALLOC_FL_PUNCH_HOLE)) > > > > > > So instead of using anonymous memory+mbind, you can also mmap a sparse > > > file > > > and rely on populate-on-demand. One alternative for your use case would be > > > to create a DAX filesystem on that CXL memory (IIRC that should work) and > > > simply providing virtio-mem with a sparse file located on that filesystem. > > > > > > Of course, you can also use some other mechanism as you might have in > > > your approach, as long as it supports a,b,c. > > > > > > > > > > > > > > > > > b) (optional) allocate huge pages on that separate NUMA node. > > > > > c) Use ordinary memory-device-ram or memory-device-memfd (for huge > > > > > pages), > > > > > *bidning* the memory backend to that special NUMA node. > > > > > > > > > "-object memory-backend/device-ram or memory-device-memfd, id=mem0, > > > > size=768G" > > > > How to bind backend memory to NUMA node > > > > > > > > > > I think the syntax is "policy=bind,host-nodes=X" > > > > > > whereby X is a node mask. So for node "0" you'd use "host-nodes=0x1" for > > > "5" > > > "host-nodes=0x20" etc. > > > > > > > > > > > > > This will dynamically allocate memory from that special NUMA node, > > > > > resulting > > > > > in the virtio-mem device completely being backed by that device > > > > > memory, > > > > > being able to dynamically resize the memory allocation. > > > > > > > > > > > > > > > Exposing an actual devdax to the virtio-mem device, shared by > > > > > multiple VMs > > > > > isn't really what we want and won't work without major design > > > > > changes. Also, > > > > > I'm not so sure it's a very clean design: exposing memory belonging > > > > > to other > > > > > VMs to unrelated QEMU processes. This sounds like a serious security > > > > > hole: > > > > > if you managed to escalate to the QEMU process from inside the VM, > > > > > you can > > > > > access unrelated VM memory quite happily. You want an abstraction > > > > > in-between, that makes sure each VM/QEMU process only sees private > > > > > memory: > > > > > for example, the buddy via dax/kmem. > > > > > > > > > Hi David > > > > Thanks for your suggestion, also sorry for my delayed reply due to my > > > > long vacation. > > > > How does current virtio-mem dynamically attach memory to guest, via > > > > page fault? > > > > > > Essentially you have a large sparse mmap. Withing that mmap, memory is > > > populated on demand. Instead if mmap/munmap you perform a single large > > > mmap and then dynamically populate memory/discard memory. > > > > > > Right now, memory is populated via page faults on access. This is > > > sub-optimal when dealing with limited resources (i.e., hugetlbfs, > > > file blocks) and you might run out of backend memory. > > > > > > I'm working on a "prealloc" mode, which will preallocate/populate memory > > > necessary for exposing the next block of memory to the VM, and which > > > fails gracefully if preallocation/population fails in the case of such > > > limited resources. > > > > > > The patch resides on: > > > https://github.com/davidhildenbrand/qemu/tree/virtio-mem-next > > > > > > commit ded0e302c14ae1b68bdce9059dcca344e0a5f5f0 > > > Author: David Hildenbrand <da...@redhat.com> > > > Date: Mon Aug 2 19:51:36 2021 +0200 > > > > > > virtio-mem: support "prealloc=on" option > > > Especially for hugetlb, but also for file-based memory backends, we'd > > > like to be able to prealloc memory, especially to make user errors > > > less > > > severe: crashing the VM when there are not sufficient huge pages > > > around. > > > A common option for hugetlb will be using "reserve=off,prealloc=off" > > > for > > > the memory backend and "prealloc=on" for the virtio-mem device. This > > > way, no huge pages will be reserved for the process, but we can > > > recover > > > if there are no actual huge pages when plugging memory. > > > Signed-off-by: David Hildenbrand <da...@redhat.com> > > > > > > > > > -- > > > Thanks, > > > > > > David / dhildenb > > > > > > > Hi David, > > > > After read virtio-mem code, I understand what you have expressed, please > > allow me to describe > > my understanding to virtio-mem, so that we have a aligned view. > > > > Virtio-mem: > > Virtio-mem device initializes and reserved a memory area(GPA), later > > memory dynamically > > growing/shrinking will not exceed this scope, memory-backend-ram has > > mapped anon. memory > > to the whole area, but no ram is attached because Linux have a policy to > > delay allocation. > > Right, but it can also be any sparse file (memory-backend-memfd, > memory-backend-file). > > > When virtio-mem driver apply to dynamically add memory to guest, it first > > request a region > > from the reserved memory area, then notify virtio-mem device to record > > the information > > (virtio-mem device doesn't make real memory allocation). After received > > response from > > In the upcoming prealloc=on mode I referenced, the allocation will happen > before the guest is notified about success and starts using the memory. > > With vfio/mdev support, the allocation will happen nowadays already, when > vfio/mdev is notified about the populated memory ranges (see > RamDiscardManager). That's essentially what makes virtio-mem device > passthrough work. > > > virtio-mem deivce, virtio-mem driver will online the requested region and > > add it to Linux > > page allocator. Real ram allocation will happen via page fault when guest > > cpu access it. > > Memory shrink will be achieved by madvise() > > Right, but you could write a custom virtio-mem driver that pools this memory > differently. > > Memory shrinking in the hypervisor is either done using madvise(DONMTNEED) > or fallocate(FALLOC_FL_PUNCH_HOLE) > > > > > Questions: > > 1. heterogeneous computing, memory may be accessed by CPUs on host side and > > device side. > > Memory delayed allocation is not suitable. Host software(for instance, > > OpenCL) may > > allocate a buffer to computing device to place the computing result in. > > That works already with virtio-mem with vfio/mdev via the RamDiscardManager > infrastructure introduced recently. With "prealloc=on", the delayed memory > allocation can also be avoided without vfio/mdev. > > > 2. we hope build ourselves page allocator in host kernel, so it can offer > > customized mmap() > > method to build va->pa mapping in MMU and IOMMU. > > Theoretically, you can wire up pretty much any driver in QEMU like vfio/mdev > via the RamDiscardManager. From there, you can issue whatever syscall you > need to popualte memory when plugging new memory blocks. All you need to > support is a sparse mmap and a way to populate/discard memory. > Populate/discard could be wired up in QEMU virtio-mem code as you need it. > > > 3. some potential requirements also require our driver to manage memory, so > > that page size > > granularity can be controlled to fit small device iotlb cache. > > CXL has bias mode for HDM(host managed device memory), it needs > > physical address to make > > bias mode switch between host access and device access. These tell us > > driver manage memory > > is mandatory. > > I think if you write your driver in a certain way and wire it up in QEMU > virtio-mem accordingly (e.g., using a new memory-backend-whatever), that > shouldn't be an issue. >
Thanks a lot, so let me have a try. > > -- > Thanks, > > David / dhildenb > >