Hi all,

Device hotplugging in KVM is a much-wanted and highly appreciated 
feature, thank you very much for making this real! However, there are a 
couple of thoughts/comments I'd like to share with you.

The recent issue #757, as well as a couple of reports from Debian users 
have uncovered a fundamental weakness in the implementation of 
hotplugging in 2.10. In short, the problem is that KVMHypervisor has 
changed its default behavior from letting qemu manage all device 
assignments on the PCI bus to a hybrid model of manual assignment for 
NICs and disks¹ and automatic assignment for the rest.  

 ¹ and recently soundhw, spice and virtio-balloon-pci

Now, as issue #757 showed, this is not very straightforward to get 
right, because we have a single resource pool (the PCI slots) managed by 
two entities in a non-cooperative manner, under a number of assumptions 
that are not necessarily true, like:

 • The fact that the qemu-reserved PCI slots are no less and no more 
   than 3. This assumption is clearly in no way future-proof, as the way 
   qemu internally assigns whatever devices it needs may change in the 
   future.

 • The assumption that there will always be exactly 3 sound card types 
   that occupy a PCI slot is also not guaranteed to hold in the future.

 • The assumption that balloon and SPICE occupy exactly one PCI slot 
   each. Balloon might some day be e.g. multiplexed over virtio-scsi and 
   SPICE might add more devices at some point.

 • The assumption that the user will not pass any PCI-backed devices 
   (like virtio-rng-pci) via kvm_extra (which is still evaluated before 
   the NICs and disks).

If any of the above doesn't hold, then the instance(s) become 
unbootable, which is a pretty important issue. Also, fixes like 3a72e34 
will almost certainly come up every time someone tries out a new 
combination of devices. Additionally, 3a72e34 is a not-that-pretty 
workaround: it adds unnecessary complexity to the code (moving more 
devices to manual assignment mode) and hard-codes more of qemu's 
internal information in Ganeti (like which soundcards use PCI), making 
the code even more fragile and error-prone.

Now, what can we do to get things better? The way I see things, there 
are two approaches, the "correct" one and the "robust workaround".

The "correct" solution
----------------------

The "correct" solution would be to have qemu manage everything. qemu's 
device_add does not require a PCI slot to be passed, instead it assigns 
the first free slot on demand. Using this, the "correct" solution would 
be to start qemu as normal and simply hotplug the devices without 
specifying PCI slots.

The obvious benefit of this approach is that we have to make no 
assumptions whatsoever about the internals of qemu. The significant 
draw-back however, is that because the PCI layout may end up with a 
different device order than what is in the configuration (because of 
device hot-removal²), we cannot just rely on the instance runtime config 
during live migrations; instead, the PCI state must be explicitly 
extracted from "info pci" and reconstructed on the secondary node, 
adding significant complexity to the code. I assume this is exactly what 
the original implementation wanted to avoid (and for this adopted manual 
slot assignments).

 ² this is actually a behavior present in the current implementation as 
   well.  Take the following scenario as an example (PCI slots: I=qemu 
   internal, _=free, D=disk, N=nic, lowercase denotes current step)

   0. startup:
            iiind__________________________

   1. gnt-instance modify --hotplug --net add
            IIINDn_________________________

   2. gnt-instance modify --hotplug --disk add:size=2g
            IIINDNd________________________

   3. gnt-instance modify --hotplug --net 1:remove
            IIIND_D________________________

   4. gnt-instance modify --hotplug --disk add:size=2g
            IIINDdD________________________

   The disk in step 4 will be inserted before the disk in step 2, because 
   of the hole left by step 3.

The "robust workaround"
-----------------------

This is a minimal workaround that will fix the current situation and 
allow some robustness, while keeping the current semantics, in a true 
"Worse is Better" spirit. The idea is to keep manual assignments for 
NICs and disks, but partition the PCI space in two areas: one managed by 
qemu and one managed by the Ganeti's hotplug code. Ideally, using a 
second PCI bus for hotplugging would be perfect, but this seems not to 
be supported by qemu. An alternative would be adding a multi-function 
pci-bridge, but I am not sure about the stability and scalability of 
this (qemu 1.7 crashed on me a couple of times while experimenting).

Since we currently support 8 NICs and 16 disks max and virtio disks need 
one PCI slot each, it makes sense to reserve the high 24 PCI slots for 
these. This leaves qemu with 8 slots for dynamic assignments, which are 
enough for the standard devices + spice + balloon + soundhw + a couple 
more user-defined devices via kvm_extra. We can further partition the 
hotplug range into 8 NIC slots and 16 disk slots (as different pools to 
assign from) to avoid the interference between NIC and disk hotplugging 
as described in ² above.

The downside of this is that all instances will see all disk and NIC PCI 
slots changed on the first boot with the new model (although relative 
order will be preserved). Linux guests will cope with this fine, but I'm 
not sure how Windows guests will do.

Note that while we essentially still rely on some assumptions regarding 
qemu's internals, we leave a big-enough buffer zone of 5 slots for qemu 
to manage. Also note that moving from virtio-blk to virtio-scsi will 
eventually allow us to free the 16 disk slots and substitute them with 
only one for the virtio SCSI HBA, so the number of qemu-assignable slots 
will likely increase in the future.

Finally, as an added bonus we get stable PCI slots for NICs across 
boots, regardless of what may be in kvm_extra, or turning ballooning on 
and off.

Conclusion
----------

Apologies for the long analysis (thanks for reading this far!), but I 
wanted to outline the problem in detail. Qemu is a fast-moving target 
and being both, future-proof and backwards-compatible is a difficult 
task. IMHO, we should take the "robust workaround" course, together with 
adding virtio-scsi support. I already have a patchset reverting 3a72e34 
and implementing assignments from the high slots, that I can submit if 
we agree on the above. I can also investigate virtio-scsi and post a 
follow-up on it.

Please bring forth any comments and/or proposals! :-)

Thanks,
Apollon

Reply via email to