Hi Apollon, Thanks for the in-depth analysis! Please see my comments inline:
On Thu, Mar 27, 2014 at 12:04 PM, Apollon Oikonomopoulos <[email protected]>wrote: > Hi all, > > Device hotplugging in KVM is a much-wanted and highly appreciated > feature, thank you very much for making this real! However, there are a > couple of thoughts/comments I'd like to share with you. > > The recent issue #757, as well as a couple of reports from Debian users > have uncovered a fundamental weakness in the implementation of > hotplugging in 2.10. In short, the problem is that KVMHypervisor has > changed its default behavior from letting qemu manage all device > assignments on the PCI bus to a hybrid model of manual assignment for > NICs and disks¹ and automatic assignment for the rest. > > ¹ and recently soundhw, spice and virtio-balloon-pci > > Now, as issue #757 showed, this is not very straightforward to get > right, because we have a single resource pool (the PCI slots) managed by > two entities in a non-cooperative manner, under a number of assumptions > that are not necessarily true, like: > > • The fact that the qemu-reserved PCI slots are no less and no more > than 3. This assumption is clearly in no way future-proof, as the way > qemu internally assigns whatever devices it needs may change in the > future. > > • The assumption that there will always be exactly 3 sound card types > that occupy a PCI slot is also not guaranteed to hold in the future. > > • The assumption that balloon and SPICE occupy exactly one PCI slot > each. Balloon might some day be e.g. multiplexed over virtio-scsi and > SPICE might add more devices at some point. > > • The assumption that the user will not pass any PCI-backed devices > (like virtio-rng-pci) via kvm_extra (which is still evaluated before > the NICs and disks). > > If any of the above doesn't hold, then the instance(s) become > unbootable, which is a pretty important issue. Also, fixes like 3a72e34 > will almost certainly come up every time someone tries out a new > combination of devices. Additionally, 3a72e34 is a not-that-pretty > workaround: it adds unnecessary complexity to the code (moving more > devices to manual assignment mode) and hard-codes more of qemu's > internal information in Ganeti (like which soundcards use PCI), making > the code even more fragile and error-prone. > > Now, what can we do to get things better? The way I see things, there > are two approaches, the "correct" one and the "robust workaround". > > The "correct" solution > ---------------------- > > The "correct" solution would be to have qemu manage everything. qemu's > device_add does not require a PCI slot to be passed, instead it assigns > the first free slot on demand. Using this, the "correct" solution would > be to start qemu as normal and simply hotplug the devices without > specifying PCI slots. > > The obvious benefit of this approach is that we have to make no > assumptions whatsoever about the internals of qemu. The significant > draw-back however, is that because the PCI layout may end up with a > different device order than what is in the configuration (because of > device hot-removal²), we cannot just rely on the instance runtime config > during live migrations; instead, the PCI state must be explicitly > extracted from "info pci" and reconstructed on the secondary node, > adding significant complexity to the code. I assume this is exactly what the original implementation wanted to avoid (and for this adopted manual > slot assignments). > > ² this is actually a behavior present in the current implementation as > well. Take the following scenario as an example (PCI slots: I=qemu > internal, _=free, D=disk, N=nic, lowercase denotes current step) > > 0. startup: > iiind__________________________ > > 1. gnt-instance modify --hotplug --net add > IIINDn_________________________ > > 2. gnt-instance modify --hotplug --disk add:size=2g > IIINDNd________________________ > > 3. gnt-instance modify --hotplug --net 1:remove > IIIND_D________________________ > > 4. gnt-instance modify --hotplug --disk add:size=2g > IIINDdD________________________ > > The disk in step 4 will be inserted before the disk in step 2, because > of the hole left by step 3. > > The "robust workaround" > ----------------------- > > This is a minimal workaround that will fix the current situation and > allow some robustness, while keeping the current semantics, in a true > "Worse is Better" spirit. The idea is to keep manual assignments for > NICs and disks, but partition the PCI space in two areas: one managed by > qemu and one managed by the Ganeti's hotplug code. Ideally, using a > second PCI bus for hotplugging would be perfect, but this seems not to > be supported by qemu. An alternative would be adding a multi-function > pci-bridge, but I am not sure about the stability and scalability of > this (qemu 1.7 crashed on me a couple of times while experimenting). > > Since we currently support 8 NICs and 16 disks max and virtio disks need > one PCI slot each, it makes sense to reserve the high 24 PCI slots for > these. This leaves qemu with 8 slots for dynamic assignments, which are > enough for the standard devices + spice + balloon + soundhw + a couple > more user-defined devices via kvm_extra. We can further partition the > hotplug range into 8 NIC slots and 16 disk slots (as different pools to > assign from) to avoid the interference between NIC and disk hotplugging > as described in ² above. > > The downside of this is that all instances will see all disk and NIC PCI > slots changed on the first boot with the new model (although relative > order will be preserved). Linux guests will cope with this fine, but I'm > not sure how Windows guests will do. > I think the same holds true for normal reboots right now: on the master node, we can't tell in which PCI slot the disk actually ended up. So after stopping the instance, a (re-)start will put the disks in the PCI slots in the order in which they are in the cluster configuration, not in the order they were when they were first hotplugged. Or am I wrong here? Also, I'm not sure if right now the relative order is preserved, as you can come up with an example like the above for the current implementation as well. In this regard, the above solution is not better, as the PCI slot assignment is not put into the master configuration neither. > > Note that while we essentially still rely on some assumptions regarding > qemu's internals, we leave a big-enough buffer zone of 5 slots for qemu > to manage. Also note that moving from virtio-blk to virtio-scsi will > eventually allow us to free the 16 disk slots and substitute them with > only one for the virtio SCSI HBA, so the number of qemu-assignable slots > will likely increase in the future. > > Finally, as an added bonus we get stable PCI slots for NICs across > boots, regardless of what may be in kvm_extra, or turning ballooning on > and off. > > Conclusion > ---------- > > Apologies for the long analysis (thanks for reading this far!), but I > wanted to outline the problem in detail. Qemu is a fast-moving target > and being both, future-proof and backwards-compatible is a difficult > task. IMHO, we should take the "robust workaround" course, together with > adding virtio-scsi support. I already have a patchset reverting 3a72e34 > and implementing assignments from the high slots, that I can submit if > we agree on the above. I can also investigate virtio-scsi and post a > follow-up on it. > I think in terms of robustness, the "robust workaround" is probably even robuster than the first proposal. We would rely less on output parsing of QEMU and would interact less with QEMU in general, but only go for the assumption that QEMU allocates PCI slots from "the bottom up" and that QEMU should not ever eat up more than 8 slots. The alternative is to trust in QEMUs 'info pci' output for older and newer versions. So I would go for the "robust workaround" option as well. Thanks, Thomas > Please bring forth any comments and/or proposals! :-) > > Thanks, > Apollon > -- Thomas Thrainer | Software Engineer | [email protected] | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
