Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem model

David Hildenbrand Fri, 10 Sep 2021 07:39:25 -0700

Hi,


sorry for replying this late.

Thanks for looking into this. It's a fairly long list, so it'sunderstandable that it took a while. :)



5. Slot handling.

As already discussed, virtio-mem and virtio-pmem don't need slots. Yet,
the "slots" definition is required and libvirt reserves once slot for
each such device ("error: unsupported configuration: memory device count
'2' exceeds slots count '1'"). This is certainly future work, if we ever
want to change that.


I can look into this.


Yeah, but it can certainly be considered as future work as well.


7. File source gets silently dropped

     <source>
       <path>/dev/shmem/vm0</path>
     </source>

The statement gets silently dropped, which is somewhat surprising.
However, I did not test what happens with DIMMs, maybe it's the same.


Yeah, this is somewhat expected. I mean, the part that's expected is
that libvirt drops parts it doesn't parse. Sometimes it is pretty
obvious (<source someRandomAttribute='value'/>) and sometimes it's not
so (like in your example when <path/> makes sense for other memory
models like virtio-pmem). But just to be sure - since virtio-mem can be
backed by any memory-backing-* backend, does it make sense to have
<path/> there? So far my code would use memory-backend-file for
hugepages only.

So it could be backed by a file residing on a filesystem that supportssparse files (shmem, hugetlbfs, ext4, ...) -- IOW anything modern :)

It's supposed to work, but if it makes your life easier, we can considersupporting other file sources future work.



8. Global preallocation of memory

With

<memoryBacking>
     <allocation mode="immediate"\>
</memoryBacking>

we also get "prealloc=on" set for the memory backends of the virito-mem
devices, which is sub-optimal, because we end up preallocating all
memory of the memory backend (which is unexpected for a virtio-mem
device) and virtio-mem will then discard all memory immediately again.
So it's essentially a dangerous NOP -- dangerous because we temporarily
consume a lot of memory.

In an ideal world, we would not set this for the memory backend used for
the virtio-mem devices, but for the virtio-mem devices themselves, such
that preallocation happens when new memory blocks are actually exposed
to the VM.

As virtio-mem does not support "prealloc=on" for virtio-mem devices yet,
this is future work. We might want to error out, though, if <allocation
mode="immediate"\> is used along with virtio-mem devices for now. I'm
planning on implementing this in QEMU soon. Until then, it might also be
good enough to simply document that this setup should be avoided.


Right. Meanwhile this was implemented in QEMU and thus I can drop
prealloc=on. But then my question is what happens when user wants to
expose additional memory to the guest but doesn't have enough free
hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't
get killed later, after the guest booted up.

So "prealloc=on" support for virtio-mem is unfortunately not part ofQEMU yet (only "reserve=off" for memory backends). As you correctlystate, until that is in place, huge pages cannot be used in a safe waywith virtio-mem, which is why they are not officially supported yet byvirtio-mem.

The idea is to specify "prealloc=on" on the virtio-mem device level oncesupported, instead of on the memory backend level. So virtio-mem willpreallocate the relevant memory before actually giving new block to theguest via virtio-mem, not when creating the memory backend. Ifpreallocation fails at that point, no new blocks are given to the guestand we won't get killed.

Think of it like this: you defer preallocation to the point where youactually use the memory and handle preallocation errors still in a safe way.


More details are below.



9. Memfd and huge pages

<memoryBacking>
     <source type="memfd"/>
</memoryBacking>

and

<memory model='virtio-mem' access='shared'>
   <source>
     <pagesize unit='KiB'>2048</pagesize>
   </source>
   ...
</memory>


I get on the QEMU cmdline

"-object
{"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"


Dropping "the memfd" source I get on the QEMU cmdline:

-object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}


"prealloc":true should not have been added for virtio-mem in case of
memfd. !memfd does what's expected.



Okay, I will fix this. But can you shed more light here? I mean, why the
difference?

Assume you want a 1TB virtio-mem device backed by huge pages butinitially only expose 1GB to the VM.

When setting prealloc=on on the memory backend, we will preallocate 1TBof huge pages when starting QEMU to discard the memory immediately againwithin virtio-mem startup code (first thing it does is make sure thereis no memory backing at all, meaning the memory backend is completely"empty"). We end up with no preallocated memory and temporarily havingallocated 1 TB.

When setting "prealloc=on" (once supported) on the virtio-mem deviceinstead, we'll preallocate memory dynamically as we hand it to the VM --so initially only 1GB.

Assume we want to give the VM an additional 16GB via that virtio-memdevice. virtio-mem will dynamically try preallocating the memory beforegiving the guest 16GB. Assume only 8GB could be preallocated, then theVM will only get additional 8GB and we won't crash.


[...]

11. Reservation of memory

With new QEMU versions we'll want to pass "reserve=off" for the memory
backend used, especially with hugepages and private mappings. While this
change was merged into QEMU, it's not part of an official release yet.
Future work.

https://lore.kernel.org/qemu-devel/20210510114328.21835-1-da...@redhat.com/

Otherwise, when we don't have the "size" currently in free and
"unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to
map backing store for guest RAM: Cannot allocate memory". The same thing
can easily happen on anonymous memory when memory overcommit isn't
disabled.

So this is future work, but at least the QEMU part is already upstream.



So what's the difference between reserve and prealloc?


It's difficult the way especially huge pages work.

Say you mmap() 1TB of huge pages. Linux will "reserve" 1TB of huge pagesand fail mmap() if it can't. BUT it will not preallocate huge pages yet,they are only accounted for in the OS as reserved for this mapping.

The idea is that you cannot really overcommit huge pages in thetraditional sense, so the reservation mechanism was implemented to makeit harder for user space to do something stupid. BUT we still needpreallocation because the whole "huge page reservation" code is brokenand not NUMA aware!

This, however, breaks the idea of virtio-mem, where you want todynamically decide how much memory you actually give to a VM. If youreserve all huge pages of the memory backend upfront, they cannot beused for anything else in the meantime and you can just stop usingvirtio-mem and use a large DIMM instead.


In the end, what we want in virtio-mem with huge pages in the future is:
* reserve=off for the memory backend: don't reserve any huge pages by
  in the OS, we'll be preallocating instead.
* prealloc=on for the virtio-mem device: preallocate memory dynamically
  when really about to be used by the VM and fail in a safe way if
  preallcoation fails.

In addition to that, "reserve=off" can be useful with virtio-mem alsowhen backed by ordinary system RAM where we don't use preallocation,just due to the way some memory overcommit modes work. But that's alsostuff for the future to optimize and you don't have to bother about thatjust now. :)


--
Thanks,

David / dhildenb

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem model

Reply via email to