Hi all,
Device hotplugging in KVM is a much-wanted and highly appreciated
feature, thank you very much for making this real! However, there are a
couple of thoughts/comments I'd like to share with you.
The recent issue #757, as well as a couple of reports from Debian users
have uncovered a fundamental weakness in the implementation of
hotplugging in 2.10. In short, the problem is that KVMHypervisor has
changed its default behavior from letting qemu manage all device
assignments on the PCI bus to a hybrid model of manual assignment for
NICs and disks¹ and automatic assignment for the rest.
¹ and recently soundhw, spice and virtio-balloon-pci
Now, as issue #757 showed, this is not very straightforward to get
right, because we have a single resource pool (the PCI slots) managed by
two entities in a non-cooperative manner, under a number of assumptions
that are not necessarily true, like:
• The fact that the qemu-reserved PCI slots are no less and no more
than 3. This assumption is clearly in no way future-proof, as the way
qemu internally assigns whatever devices it needs may change in the
future.
• The assumption that there will always be exactly 3 sound card types
that occupy a PCI slot is also not guaranteed to hold in the future.
• The assumption that balloon and SPICE occupy exactly one PCI slot
each. Balloon might some day be e.g. multiplexed over virtio-scsi and
SPICE might add more devices at some point.
• The assumption that the user will not pass any PCI-backed devices
(like virtio-rng-pci) via kvm_extra (which is still evaluated before
the NICs and disks).
If any of the above doesn't hold, then the instance(s) become
unbootable, which is a pretty important issue. Also, fixes like 3a72e34
will almost certainly come up every time someone tries out a new
combination of devices. Additionally, 3a72e34 is a not-that-pretty
workaround: it adds unnecessary complexity to the code (moving more
devices to manual assignment mode) and hard-codes more of qemu's
internal information in Ganeti (like which soundcards use PCI), making
the code even more fragile and error-prone.
Now, what can we do to get things better? The way I see things, there
are two approaches, the "correct" one and the "robust workaround".
The "correct" solution
----------------------
The "correct" solution would be to have qemu manage everything. qemu's
device_add does not require a PCI slot to be passed, instead it assigns
the first free slot on demand. Using this, the "correct" solution would
be to start qemu as normal and simply hotplug the devices without
specifying PCI slots.
The obvious benefit of this approach is that we have to make no
assumptions whatsoever about the internals of qemu. The significant
draw-back however, is that because the PCI layout may end up with a
different device order than what is in the configuration (because of
device hot-removal²), we cannot just rely on the instance runtime config
during live migrations; instead, the PCI state must be explicitly
extracted from "info pci" and reconstructed on the secondary node,
adding significant complexity to the code. I assume this is exactly what
the original implementation wanted to avoid (and for this adopted manual
slot assignments).
² this is actually a behavior present in the current implementation as
well. Take the following scenario as an example (PCI slots: I=qemu
internal, _=free, D=disk, N=nic, lowercase denotes current step)
0. startup:
iiind__________________________
1. gnt-instance modify --hotplug --net add
IIINDn_________________________
2. gnt-instance modify --hotplug --disk add:size=2g
IIINDNd________________________
3. gnt-instance modify --hotplug --net 1:remove
IIIND_D________________________
4. gnt-instance modify --hotplug --disk add:size=2g
IIINDdD________________________
The disk in step 4 will be inserted before the disk in step 2, because
of the hole left by step 3.
The "robust workaround"
-----------------------
This is a minimal workaround that will fix the current situation and
allow some robustness, while keeping the current semantics, in a true
"Worse is Better" spirit. The idea is to keep manual assignments for
NICs and disks, but partition the PCI space in two areas: one managed by
qemu and one managed by the Ganeti's hotplug code. Ideally, using a
second PCI bus for hotplugging would be perfect, but this seems not to
be supported by qemu. An alternative would be adding a multi-function
pci-bridge, but I am not sure about the stability and scalability of
this (qemu 1.7 crashed on me a couple of times while experimenting).
Since we currently support 8 NICs and 16 disks max and virtio disks need
one PCI slot each, it makes sense to reserve the high 24 PCI slots for
these. This leaves qemu with 8 slots for dynamic assignments, which are
enough for the standard devices + spice + balloon + soundhw + a couple
more user-defined devices via kvm_extra. We can further partition the
hotplug range into 8 NIC slots and 16 disk slots (as different pools to
assign from) to avoid the interference between NIC and disk hotplugging
as described in ² above.
The downside of this is that all instances will see all disk and NIC PCI
slots changed on the first boot with the new model (although relative
order will be preserved). Linux guests will cope with this fine, but I'm
not sure how Windows guests will do.
Note that while we essentially still rely on some assumptions regarding
qemu's internals, we leave a big-enough buffer zone of 5 slots for qemu
to manage. Also note that moving from virtio-blk to virtio-scsi will
eventually allow us to free the 16 disk slots and substitute them with
only one for the virtio SCSI HBA, so the number of qemu-assignable slots
will likely increase in the future.
Finally, as an added bonus we get stable PCI slots for NICs across
boots, regardless of what may be in kvm_extra, or turning ballooning on
and off.
Conclusion
----------
Apologies for the long analysis (thanks for reading this far!), but I
wanted to outline the problem in detail. Qemu is a fast-moving target
and being both, future-proof and backwards-compatible is a difficult
task. IMHO, we should take the "robust workaround" course, together with
adding virtio-scsi support. I already have a patchset reverting 3a72e34
and implementing assignments from the high slots, that I can submit if
we agree on the above. I can also investigate virtio-scsi and post a
follow-up on it.
Please bring forth any comments and/or proposals! :-)
Thanks,
Apollon