On 22.09.2020 09:26, David Hildenbrand wrote:
> On 22.09.20 00:22, Maciej S. Szmigiero wrote:
>> Hi David,
>>
>> Thank you for your comments.
>>
(...)
>>
>> The idea is to use virtual DIMM sticks for hot-adding extra memory at
>> runtime, while using ballooning for runtime adjustment of the guest
>> memory size within the current maximum.
>>
>> When the guest is rebooted the virtual DIMMs configuration is adjusted
>> by the software controlling QEMU (some are removed and / or some are
>> added) to give the guest the same effective memory size as it had before
>> the reboot.
> 
> Okay, so while "the ACPI DIMM slot limit does not apply", the KVM memory
> slot limit (currently) applies, resulting in exactly the same behavior.
>
> The only (conceptual difference) I am able to spot is then a
> notification to the user on reboot, so the guest memory layout can be
> adjusted (which I consider very ugly, but it's the same thing when
> mixing ballooning and DIMMs - which is why it's usually never done).

If you want to shrink a guest at runtime you'll pretty much have to use
ballooning as {ACPI-based PC, virtual} DIMM stick sizes are far too
large to make anything but rough adjustments to the guest memory size.

In addition to that with ACPI-based PC DIMM hotplug it is the host that
chooses which particular DIMM stick to unplug while having no feedback
from the guest how much of each DIMM stick memory range is currently
in use and so will have to be copied somewhere else.

I know that this a source of significant hot removal slowdown, especially
when a "ripple effect" happens on removal:
1) There are 3 extra DIMMs plugged into the guest: A, B, C.
   A and B are nearly empty, but C is nearly full.

2) The host does not know anything which DIMM is empty and which is full,
   so it requests the guest to unplug the stick C,

3) The guest copies the content of the stick C to the stick B,

4) Once again, the host does not know anything which DIMM is empty and
   which is full, so it requests the guest to unplug the stick B,

5) The guest now has to copy the same data from the stick B to the
   stick A, once again.

With virtual DIMM sticks + this driver it is the guest which chooses
which particular pages to release, hopefully choosing the already unused
ones.
Once the whole memory behind a DIMM stick is released the host knows
that it can be unplugged now without any copying.

While it might seem like this will cause a lot of fragmentation in
practice Windows seems to try to give out the largest continuous range
of pages it is able to find.

One can also see in the hv_balloon client driver from the Linux kernel
that this driver tries to do 2 MB allocations for as long as it can
before giving out single pages.

The reason why ballooning and DIMMs wasn't being used together previously
is probably because virtio-balloon will (at least on Windows) mark the
ballooned out pages as in use inside the guest, preventing the removal
of the DIMM stick backing them.

In addition to the above, virtio-balloon is also very slow, as the whole
protocol operates on single pages only, not on page ranges.
There might also be some interference with Windows memory management
causing an extra slowdown in comparison to the native Windows DM
protocol.

If the KVM slot limit starts to be a problem in practice then we can
think what can be done about it.
It's always one obstacle less.

I see that the same KVM slot limit probably applies also for virtio-mem,
since it uses memory-backend-ram as its backing memory device, too,
right?

If not, then how do you map a totally new memory range into the guest
address space without consuming a KVM memory slot?
If that's somehow possible then maybe the same mechanism can simply be
reused for this driver.

> [...]
> 
>>
>> So, yes, it will be a problem if the user expands their running guest
>> ~256 times, each time making it even bigger than previously, without
>> rebooting it even once, but this does seem to be an edge use case.
> 
> IIRC, that's exactly what dynamic memory under Windows does in automatic
> mode, no? Monitor the guests, distribute memory accordingly - usually in
> smaller steps. But I am no expert on Hyper-V.

Yes, they call their automatic mode "Dynamic Memory" in recent Windows
versions.

This is a bit confusing because even if you disable this feature
the Hyper-V hypervisor will still provide this Dynamic Memory Protocol
service and use it to resize the guest on (user) demand.
Just it won't do such resize on its own but only when explicitly
requested.

Don't know if they internally have any limit that is similar to the KVM
memory slot limit, though.

>>
>> In the future it would be better to automatically turn the current
>> effective guest size into its boot memory size when the VM restarts
>> (the VM will then have no virtual DIMMs inserted after a reboot), but
>> doing this requires quite a few changes to QEMU, that's why it isn't
>> there yet.
> 
> Will most probably never happen as reshuffling the layout of your boot
> memory (especially with NUMA) within QEMU can break live migration in
> various ways.

That's why this functionality is not in the current driver version as
it is a bit hard to implement :)

> If you already notify the user on a reboot, the user can just kill the
> VM and start it with an adjusted boot memory size. Yeah, that's ugly,
> but so is the whole "adjust DIMM/balloon configuration during a reboot
> from outside QEMU".
>
> BTW, how would you handle: Start guest with 10G. Inflate balloon to 5G.
> Reboot. There are no virtual DIMMs to adjust.

You'll typically want to avoid relaunching QEMU as much as possible
since things like chardev sockets and a VNC connection will disconnect
if the QEMU process exits.
Not to mention that it takes some time for it to actually start again.

However, there is a trade-off here: one can either start the guest with
a relatively large boot memory size, but then shrinking the guest means
that it will see the whole boot memory size again during reboot, until
it is ballooned down again after it has connected to the DM protocol.

Or it can be started with a small boot memory size, but this means that
few virtual DIMMs might always be inserted (their size and / or count
can be optimized during the next reboot or if they become unused due
to ballooning).

Or one can choose some point in between these two scenarios.
 
I think a virtio-mem user has to choose a similar trade-off between
the boot memory size and the size and count of plugged-in virtio-mem
devices, right?

>>
>> The above is basically how Hyper-V hypervisor handles its memory size
>> changes and it seems to be as close to having a transparently resizable
>> guest as reasonably possible.
> 
> "having a transparently resizable _Windows_ guests right now" :)

Right.

(...)
> 
>>> I assume these numbers apply with Windows guests only. IIRC Linux
>>> hv_balloon does not support page migration/compaction, while
>>> virtio-balloon does. So you might end up with quite some fragmented
>>> memory with hv_balloon in Linux guests - of course, usually only in
>>> corner cases.
>>
>> As I previously mentioned, this driver targets mainly Windows guests.
> 
> ... and you cannot enforce that people will only use it with Windows
> guests :)
If people want to run this driver with Linux or port the hv_balloon
client driver from the Linux kernel to, for example, GNU Hurd and run
the DM protocol there then they are free to do so.
Just it really isn't this driver target environment.

> [...]
> 
>> Windows will generally leave some memory free when processing balloon
>> requests, although the precise amount varies between few hundred MB to
>> values like 1+ GB.
>>
>> Usually it runs stable even with these few hundred MBs of free memory
>> remaining but I have seen occasional crashes at shutdown time in this
>> case (probably something critical failing to initialize due to the
>> system running out of memory).
>>
>> While the above command was just a quick example, I personally think
>> it is the guest who should be enforcing a balloon floor since it is
>> the guest that knows its internal memory requirements, not the host.
> 
> Even the guest has no idea about the (future) working set size. That's a
> known problem.
> 
> There are always cases where the calculation is wrong, and if the
> monitoring process isn't fast enough to react and adjust the guest size,
> your things will end up baldy in your guest. Just as the reboot case you
> mentioned, where the VM crashes.

The actual Hyper-V hypervisor somehow manages not to over-balloon its
guests to the point that they run of of memory and crash.
So this is definitely doable (with a margin of safety).

However, such heuristics are really an issue for the software
controlling QEMU and so are outside the scope of this driver.

By the way, that's why DM guests emit a STATUS message each second
with various memory counters (translated into a QMP event by this driver)
- to give its host hints about the guest memory pressure.

> [...]
> 
>>>>
>>>> Future directions:
>>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>>   virtio-balloon drivers.
>>>>   Currently, only one of them can be added to the VM at the same time.
>>>
>>> Yeah, that makes sense. Only one at a time.
>>
>> Having only one *active* at a time makes sense, however it ultimately
>> would be nice to be able to have them both inserted into a VM:
>> one for Windows guests and one for Linux ones.
>> Even though only one obviously would be active at the same time.
> 
> I don't think that's the right way forward - that should be configured
> when the VM is started.
> 
> Personal opinion: I can understand the motivation to implement
> hypervisor-specific devices to better support closed-source operating
> systems. But I doubt we want to introduce+support ten different
> proprietary devices based on proprietary standards doing roughly the
> same thing just because closed-source operating systems are too lazy to
> support open standards properly.
> 

What do you mean by "ten" proprietary devices?
Is there another balloon protocol driver currently in the tree other
than virtio-balloon running over various buses?

People are running Windows guests using QEMU, too.

That's why there are dozen or so Hyper-V enlightenments implemented,
even though they duplicate KVM PV stuff or that there is kvmvapic
with its Windows guest live-patching.

Not to mention many, many devices like e1000 or VMware vmxnet3 even
though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
Or the applesmc driver, which is cleanly designed to help run just
one proprietary OS.

Thanks,
Maciej

Reply via email to