Hi, * Apollon Oikonomopoulos <[email protected]> [2014-03-27 13:04:15 +0200]:
> Hi all, > > Device hotplugging in KVM is a much-wanted and highly appreciated > feature, thank you very much for making this real! However, there are a > couple of thoughts/comments I'd like to share with you. > Thanks. > The recent issue #757, as well as a couple of reports from Debian users > have uncovered a fundamental weakness in the implementation of > hotplugging in 2.10. In short, the problem is that KVMHypervisor has > changed its default behavior from letting qemu manage all device > assignments on the PCI bus to a hybrid model of manual assignment for > NICs and disks¹ and automatic assignment for the rest. > > ¹ and recently soundhw, spice and virtio-balloon-pci > > Now, as issue #757 showed, this is not very straightforward to get > right, because we have a single resource pool (the PCI slots) managed by > two entities in a non-cooperative manner, under a number of assumptions > that are not necessarily true, like: > > • The fact that the qemu-reserved PCI slots are no less and no more > than 3. This assumption is clearly in no way future-proof, as the way > qemu internally assigns whatever devices it needs may change in the > future. > > • The assumption that there will always be exactly 3 sound card types > that occupy a PCI slot is also not guaranteed to hold in the future. > > • The assumption that balloon and SPICE occupy exactly one PCI slot > each. Balloon might some day be e.g. multiplexed over virtio-scsi and > SPICE might add more devices at some point. > > • The assumption that the user will not pass any PCI-backed devices > (like virtio-rng-pci) via kvm_extra (which is still evaluated before > the NICs and disks). > Well, all use cases mentioned in issue 757 were found via testing, due to lack of documentation. I agree that we cannot count on the fact that those corner cases won't change in the near future. > If any of the above doesn't hold, then the instance(s) become > unbootable, which is a pretty important issue. Also, fixes like 3a72e34 > will almost certainly come up every time someone tries out a new > combination of devices. Additionally, 3a72e34 is a not-that-pretty > workaround: it adds unnecessary complexity to the code (moving more > devices to manual assignment mode) and hard-codes more of qemu's > internal information in Ganeti (like which soundcards use PCI), making > the code even more fragile and error-prone. > > Now, what can we do to get things better? The way I see things, there > are two approaches, the "correct" one and the "robust workaround". > > The "correct" solution > ---------------------- > Correct but too complicated to implement. > The "correct" solution would be to have qemu manage everything. qemu's > device_add does not require a PCI slot to be passed, instead it assigns > the first free slot on demand. Using this, the "correct" solution would > be to start qemu as normal and simply hotplug the devices without > specifying PCI slots. > > The obvious benefit of this approach is that we have to make no > assumptions whatsoever about the internals of qemu. The significant > draw-back however, is that because the PCI layout may end up with a > different device order than what is in the configuration (because of > device hot-removal²), we cannot just rely on the instance runtime config > during live migrations; instead, the PCI state must be explicitly > extracted from "info pci" and reconstructed on the secondary node, > adding significant complexity to the code. I assume this is exactly what > the original implementation wanted to avoid (and for this adopted manual > slot assignments). > > ² this is actually a behavior present in the current implementation as > well. Take the following scenario as an example (PCI slots: I=qemu > internal, _=free, D=disk, N=nic, lowercase denotes current step) > > 0. startup: > iiind__________________________ > > 1. gnt-instance modify --hotplug --net add > IIINDn_________________________ > > 2. gnt-instance modify --hotplug --disk add:size=2g > IIINDNd________________________ > > 3. gnt-instance modify --hotplug --net 1:remove > IIIND_D________________________ > > 4. gnt-instance modify --hotplug --disk add:size=2g > IIINDdD________________________ > > The disk in step 4 will be inserted before the disk in step 2, because > of the hole left by step 3. > > The "robust workaround" > ----------------------- > > This is a minimal workaround that will fix the current situation and > allow some robustness, while keeping the current semantics, in a true > "Worse is Better" spirit. The idea is to keep manual assignments for > NICs and disks, but partition the PCI space in two areas: one managed by > qemu and one managed by the Ganeti's hotplug code. Ideally, using a > second PCI bus for hotplugging would be perfect, but this seems not to > be supported by qemu. An alternative would be adding a multi-function > pci-bridge, but I am not sure about the stability and scalability of > this (qemu 1.7 crashed on me a couple of times while experimenting). > > Since we currently support 8 NICs and 16 disks max and virtio disks need > one PCI slot each, it makes sense to reserve the high 24 PCI slots for > these. This leaves qemu with 8 slots for dynamic assignments, which are > enough for the standard devices + spice + balloon + soundhw + a couple > more user-defined devices via kvm_extra. We can further partition the > hotplug range into 8 NIC slots and 16 disk slots (as different pools to > assign from) to avoid the interference between NIC and disk hotplugging > as described in ² above. > > The downside of this is that all instances will see all disk and NIC PCI > slots changed on the first boot with the new model (although relative > order will be preserved). Linux guests will cope with this fine, but I'm > not sure how Windows guests will do. > Well, currently, the first disk allocates the first available PCI slot, followed by the other disks, and then the NICs (see _GenerateKVMRuntime()). With the assumption that the first disk is the bootable one and that it has never been removed, shifting all devices some slots higher will not cause any boot problems. > Note that while we essentially still rely on some assumptions regarding > qemu's internals, we leave a big-enough buffer zone of 5 slots for qemu > to manage. Also note that moving from virtio-blk to virtio-scsi will > eventually allow us to free the 16 disk slots and substitute them with > only one for the virtio SCSI HBA, so the number of qemu-assignable slots > will likely increase in the future. > Here you mean that the `virtio-scsi` device uses the `virtio-bus` instead of the PCI bus, right? The `virtio-net-device` could be used too, I guess. > Finally, as an added bonus we get stable PCI slots for NICs across > boots, regardless of what may be in kvm_extra, or turning ballooning on > and off. > > Conclusion > ---------- > > Apologies for the long analysis (thanks for reading this far!), but I > wanted to outline the problem in detail. Qemu is a fast-moving target > and being both, future-proof and backwards-compatible is a difficult > task. IMHO, we should take the "robust workaround" course, together with > adding virtio-scsi support. I already have a patchset reverting 3a72e34 > and implementing assignments from the high slots, that I can submit if > we agree on the above. I can also investigate virtio-scsi and post a > follow-up on it. > > Please bring forth any comments and/or proposals! :-) > So from what I understand the proposed change will: 1) affect only virtio devices (NICs, and Disks) 2) keep the current hotplug rational (only NodeD knows about PCI info) 3) will change _DEFAULT_PCI_RESERVATIONS and reserve the first 8 slots 4) run `info pci` to get current PCI status and find available slots and "OR" them with _DEFAULT_PCI_RESERVATIONS 5) revert the commit that fixes #757 in the sense that virtio balloon and spice wont get PCI info. If I don't miss anything here, I agree with all the above. So yes, please proceed with patches. Just one note here: Currently *only* virtio devices obtain PCI info and thus are hotplug-able. Primarily, Ganeti used only the `-drive` option for disks. Hotplug support added the `-device` option, so that they can obtain PCI info. Should we adopt the `-device` option to all devices? Would this make any sense? Regards, dimara > Thanks, > Apollon ----- End forwarded message -----
signature.asc
Description: Digital signature
