Hi,

* Apollon Oikonomopoulos <[email protected]> [2014-03-27 13:04:15 +0200]:

> Hi all,
> 
> Device hotplugging in KVM is a much-wanted and highly appreciated 
> feature, thank you very much for making this real! However, there are a 
> couple of thoughts/comments I'd like to share with you.
> 

Thanks.

> The recent issue #757, as well as a couple of reports from Debian users 
> have uncovered a fundamental weakness in the implementation of 
> hotplugging in 2.10. In short, the problem is that KVMHypervisor has 
> changed its default behavior from letting qemu manage all device 
> assignments on the PCI bus to a hybrid model of manual assignment for 
> NICs and disks¹ and automatic assignment for the rest.  
> 
>  ¹ and recently soundhw, spice and virtio-balloon-pci
> 
> Now, as issue #757 showed, this is not very straightforward to get 
> right, because we have a single resource pool (the PCI slots) managed by 
> two entities in a non-cooperative manner, under a number of assumptions 
> that are not necessarily true, like:
> 
>  • The fact that the qemu-reserved PCI slots are no less and no more 
>    than 3. This assumption is clearly in no way future-proof, as the way 
>    qemu internally assigns whatever devices it needs may change in the 
>    future.
> 
>  • The assumption that there will always be exactly 3 sound card types 
>    that occupy a PCI slot is also not guaranteed to hold in the future.
> 
>  • The assumption that balloon and SPICE occupy exactly one PCI slot 
>    each. Balloon might some day be e.g. multiplexed over virtio-scsi and 
>    SPICE might add more devices at some point.
> 
>  • The assumption that the user will not pass any PCI-backed devices 
>    (like virtio-rng-pci) via kvm_extra (which is still evaluated before 
>    the NICs and disks).
> 

Well, all use cases mentioned in issue 757 were found via testing, due
to lack of documentation. I agree that we cannot count on the fact that
those corner cases won't change in the near future.

> If any of the above doesn't hold, then the instance(s) become 
> unbootable, which is a pretty important issue. Also, fixes like 3a72e34 
> will almost certainly come up every time someone tries out a new 
> combination of devices. Additionally, 3a72e34 is a not-that-pretty 
> workaround: it adds unnecessary complexity to the code (moving more 
> devices to manual assignment mode) and hard-codes more of qemu's 
> internal information in Ganeti (like which soundcards use PCI), making 
> the code even more fragile and error-prone.
> 
> Now, what can we do to get things better? The way I see things, there 
> are two approaches, the "correct" one and the "robust workaround".
> 
> The "correct" solution
> ----------------------
> 

Correct but too complicated to implement.

> The "correct" solution would be to have qemu manage everything. qemu's 
> device_add does not require a PCI slot to be passed, instead it assigns 
> the first free slot on demand. Using this, the "correct" solution would 
> be to start qemu as normal and simply hotplug the devices without 
> specifying PCI slots.
> 
> The obvious benefit of this approach is that we have to make no 
> assumptions whatsoever about the internals of qemu. The significant 
> draw-back however, is that because the PCI layout may end up with a 
> different device order than what is in the configuration (because of 
> device hot-removal²), we cannot just rely on the instance runtime config 
> during live migrations; instead, the PCI state must be explicitly 
> extracted from "info pci" and reconstructed on the secondary node, 
> adding significant complexity to the code. I assume this is exactly what 
> the original implementation wanted to avoid (and for this adopted manual 
> slot assignments).
> 
>  ² this is actually a behavior present in the current implementation as 
>    well.  Take the following scenario as an example (PCI slots: I=qemu 
>    internal, _=free, D=disk, N=nic, lowercase denotes current step)
> 
>    0. startup:
>             iiind__________________________
> 
>    1. gnt-instance modify --hotplug --net add
>             IIINDn_________________________
> 
>    2. gnt-instance modify --hotplug --disk add:size=2g
>             IIINDNd________________________
> 
>    3. gnt-instance modify --hotplug --net 1:remove
>             IIIND_D________________________
> 
>    4. gnt-instance modify --hotplug --disk add:size=2g
>             IIINDdD________________________
> 
>    The disk in step 4 will be inserted before the disk in step 2, because 
>    of the hole left by step 3.
> 
> The "robust workaround"
> -----------------------
> 
> This is a minimal workaround that will fix the current situation and 
> allow some robustness, while keeping the current semantics, in a true 
> "Worse is Better" spirit. The idea is to keep manual assignments for 
> NICs and disks, but partition the PCI space in two areas: one managed by 
> qemu and one managed by the Ganeti's hotplug code. Ideally, using a 
> second PCI bus for hotplugging would be perfect, but this seems not to 
> be supported by qemu. An alternative would be adding a multi-function 
> pci-bridge, but I am not sure about the stability and scalability of 
> this (qemu 1.7 crashed on me a couple of times while experimenting).
> 
> Since we currently support 8 NICs and 16 disks max and virtio disks need 
> one PCI slot each, it makes sense to reserve the high 24 PCI slots for 
> these. This leaves qemu with 8 slots for dynamic assignments, which are 
> enough for the standard devices + spice + balloon + soundhw + a couple 
> more user-defined devices via kvm_extra. We can further partition the 
> hotplug range into 8 NIC slots and 16 disk slots (as different pools to 
> assign from) to avoid the interference between NIC and disk hotplugging 
> as described in ² above.
> 
> The downside of this is that all instances will see all disk and NIC PCI 
> slots changed on the first boot with the new model (although relative 
> order will be preserved). Linux guests will cope with this fine, but I'm 
> not sure how Windows guests will do.
> 

Well, currently, the first disk allocates the first available PCI slot,
followed by the other disks, and then the NICs (see _GenerateKVMRuntime()).
With the assumption that the first disk is the bootable one and that it
has never been removed, shifting all devices some slots higher will not
cause any boot problems.

> Note that while we essentially still rely on some assumptions regarding 
> qemu's internals, we leave a big-enough buffer zone of 5 slots for qemu 
> to manage. Also note that moving from virtio-blk to virtio-scsi will 
> eventually allow us to free the 16 disk slots and substitute them with 
> only one for the virtio SCSI HBA, so the number of qemu-assignable slots 
> will likely increase in the future.
> 

Here you mean that the `virtio-scsi` device uses the `virtio-bus` instead of
the PCI bus, right? The `virtio-net-device` could be used too, I guess.

> Finally, as an added bonus we get stable PCI slots for NICs across 
> boots, regardless of what may be in kvm_extra, or turning ballooning on 
> and off.
> 
> Conclusion
> ----------
> 
> Apologies for the long analysis (thanks for reading this far!), but I 
> wanted to outline the problem in detail. Qemu is a fast-moving target 
> and being both, future-proof and backwards-compatible is a difficult 
> task. IMHO, we should take the "robust workaround" course, together with 
> adding virtio-scsi support. I already have a patchset reverting 3a72e34 
> and implementing assignments from the high slots, that I can submit if 
> we agree on the above. I can also investigate virtio-scsi and post a 
> follow-up on it.
> 
> Please bring forth any comments and/or proposals! :-)
> 

So from what I understand the proposed change will:

1) affect only virtio devices (NICs, and Disks)
2) keep the current hotplug rational (only NodeD knows about PCI info)
3) will change _DEFAULT_PCI_RESERVATIONS and reserve the first 8 slots
4) run `info pci` to get current PCI status and find available slots
   and "OR" them with _DEFAULT_PCI_RESERVATIONS
5) revert the commit that fixes #757 in the sense that virtio balloon
   and spice wont get PCI info.

If I don't miss anything here, I agree with all the above. So yes,
please proceed with patches.

Just one note here:

Currently *only* virtio devices obtain PCI info and thus are
hotplug-able. Primarily, Ganeti used only the `-drive` option for disks.
Hotplug support added the `-device` option, so that they can obtain PCI
info. Should we adopt the `-device` option to all devices? Would this
make any sense?

Regards,
dimara


> Thanks,
> Apollon



----- End forwarded message -----

Attachment: signature.asc
Description: Digital signature

Reply via email to