On 12:23 Thu 08 Oct , Dimitris Aragiorgis wrote: > Hi Apollon, > > * Apollon Oikonomopoulos <[email protected]> [2015-10-02 12:36:44 +0300]: > > > Hi Dimitris, > > > > Thanks a lot for this work! Please see comments inline. > > > > On 19:52 Thu 01 Oct , Dimitris Aragiorgis wrote: > > > This is a design document detailing the refactoring of device > > > handling in the KVM Hypervisor. More specifically, it will use the > > > latest QEMU device model and modify the current hotplug implementation > > > so that both PCI and SCSI devices can be managed. > > > > > > Signed-off-by: Dimitris Aragiorgis <[email protected]> > > > --- > > > Makefile.am | 1 + > > > doc/design-draft.rst | 1 + > > > doc/design-scsi-kvm.rst | 239 > > > +++++++++++++++++++++++++++++++++++++++++++++++ > > > 3 files changed, 241 insertions(+) > > > create mode 100644 doc/design-scsi-kvm.rst > > > > > > diff --git a/Makefile.am b/Makefile.am > > > index a6b9fbd..4506289 100644 > > > --- a/Makefile.am > > > +++ b/Makefile.am > > > @@ -717,6 +717,7 @@ docinput = \ > > > doc/design-reservations.rst \ > > > doc/design-resource-model.rst \ > > > doc/design-restricted-commands.rst \ > > > + doc/design-scsi-kvm.rst \ > > > doc/design-shared-storage.rst \ > > > doc/design-shared-storage-redundancy.rst \ > > > doc/design-ssh-ports.rst \ > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > > > index 3f2a1f3..83a392c 100644 > > > --- a/doc/design-draft.rst > > > +++ b/doc/design-draft.rst > > > @@ -25,6 +25,7 @@ Design document drafts > > > design-configlock.rst > > > design-multi-storage-htools.rst > > > design-repaird.rst > > > + design-scsi-kvm.rst > > > > > > .. vim: set textwidth=72 : > > > .. Local Variables: > > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst > > > new file mode 100644 > > > index 0000000..30cc7de > > > --- /dev/null > > > +++ b/doc/design-scsi-kvm.rst > > > @@ -0,0 +1,239 @@ > > > +========== > > > +KVM + SCSI > > > +========== > > > + > > > +.. contents:: :depth: 4 > > > + > > > +This is a design document detailing the refactoring of device > > > +handling in the KVM Hypervisor. More specifically, it will use > > > +the latest QEMU device model and modify the hotplug implementation > > > +so that both PCI and SCSI devices can be managed. > > > + > > > + > > > +Current state and shortcomings > > > +============================== > > > + > > > +Ganeti currently supports SCSI virtual devices in the KVM hypervisor by > > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually > > > +instruct QEMU to use the deprecated device model (i.e. -drive if=scsi), > > > +which will expose the backing store as an emulated SCSI device. This > > > +means that currently SCSI pass-through is not supported. > > > + > > > +On the other hand, the current hotplug implementation > > > +:doc:`design-hotplug` uses the latest QEMU > > > +device model (via the -device option) and is tailored to paravirtual > > > +devices, which leads to buggy behavior: if we hotplug a disk to an > > > +instance that is configured with disk_type=scsi hvparam, the > > > +disk which will get hot-plugged eventually will be a VirtIO device > > > +(i.e., virtio-blk-pci) on the PCI bus. > > > + > > > +The current implementation of creating the QEMU command line is > > > +error-prone, since an instance might not be able to boot due to PCI slot > > > +congestion. > > > + > > > + > > > +Proposed changes > > > +================ > > > + > > > +We change the way that the KVM hypervisor handles block devices by > > > +introducing latest QEMU device model for SCSI devices as well, so that > > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are > > > +supported too. Additionally we refactor the hotplug implementation in > > > +order to support hotplugging of SCSI devices too. Finally, we change the > > > +way we keep track of device info inside runtime files, and the way we > > > +place each device upon instance startup. > > > + > > > +Design decisions > > > +================ > > > + > > > +How to identify each device? > > > + > > > +Currently KVM does not support arbitrary IDs for devices; supported are > > > +only names starting with a letter, with max 32 chars length, and only > > > +including the '.', '_', '-' special chars. Currently we generate an ID > > > +with the following format: <device type>-<part of uuid>-pci-<slot>. > > > +This assumes that the device will be plugged in a certain slot on the > > > +PCI bus. Since we want to support devices on a SCSI bus too and adding > > > +the PCI slot to the ID is redundant, we keep only the first two parts of > > > +the existing ID. > > > + > > > + > > > +Which buses does the guest eventually see? > > > + > > > +By default QEMU starts with a single PCI bus named "pci.0". In case a > > > +SCSI controller is added on this bus, a SCSI bus is created with > > > +the corresponding name: "scsi.0". > > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti does > > > +not explicitly use a SCSI controller via a command line option, but lets > > > +QEMU add one automatically if needed. Here, in case we have a SCSI disk, > > > +a SCSI controller is explicitly added via the -device option. For the > > > +SCSI controller, we do not specify the PCI slot to use, but let QEMU find > > > +the first available (see below). > > > + > > > + > > > +What type of SCSI controller to use? > > > + > > > +QEMU uses the `lsi` controller by default. To make this configurable we > > > +add a new hvparam, `scsi_controller_type`. The available types will be > > > +`lsi`, `megasas`, and `virtio-scsi-pci`. > > > + > > > + > > > +Where to place the devices upon instance startup? > > > + > > > +By default QEMU boots with the first three slots of the default PCI bus > > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and > > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU > > > +options passed in the command line. > > > > Note that this is true for the `pc' and `pc-i440fx-*' machine models. > > However, the `q35' machine model emulates an ICH9-based motherboard that > > also has an additional multifunction device on PCI slot 31, hosting and > > ISA bridge, a SATA controller and an SMBus adapter. > > > > > I looked into the issue you mentioned with the q35 machine type a bit > more. Currently, even before this patch, Ganeti cannot even start a q35 > machine with paravirtual devices. > Apparently no Ganeti user has used the q35 machine model seriously yet :) > > The problem is that q35 defines a *PCI Express* bus, not a PCI bus. > The current code assumes the presence of a PCI bus (pci.0). > So this discussion that started at GanetiCon seems irrelevant. > > It is not a matter of reserving PCI slot 31; q35 machines expose > a PCI Express bus (pcie.0) that does not even support hotplugging: > http://www.linux-kvm.org/page/PCITodo > > So handling the q35 machine type is another story, and certainly > outside the scope of this patch set. > > After this patch set has been merged, a separate patch can explore > supporting the q35 machine type. This however may prove more > complicated than it looks: > > 1. The pcie.0 bus is not hotpluggable so we cannot hotplug devices > directly onto it. > > 2. One could work around it, by ensuring a PCI bus (pci.0) is always > available: > pcie.0 -> dmi-to-pci-bridge -> pci-to-pci-bridge (this will be pci.0). > This would allow the machine to boot but still wouldn't solve the > hotplug problem. PCI bridges do not yet support hotplugging in QEMU. > Libvirt seems to follow this approach.
Good, I didn't go that far with my research :-) I guess for the time being we need to add an erratum indicating that Ganeti versions with hotplugging support do not support the Q35 machine model as of qemu 2.4. > > > +The main reason that we want to be fully aware of the configuration of a > > > +running instance (machine type, PCI and SCSI bus state, devices, etc.) > > > +is that in case of migration a QEMU process with the exact same > > > +configuration should be created on the target node. The configuration is > > > +kept in the runtime file created just before starting the instance. > > > +Since hotplug has been introduced, the only thing that can change after > > > +starting an instance is the configuration related to NICs and Disks. > > > + > > > +Before implementing hotplug, Ganeti did not specify PCI slots > > > +explicitly, but let QEMU decide how to place the devices on the > > > +corresponding bus. This does not work if we want to have hotplug-able > > > +devices and migrate-able VMs. Currently, upon runtime file creation, we > > > +try to reserve PCI slots based on the hvparams, the disks, and the NICs > > > +of the instance. This has three major shortcomings: first, we have to be > > > +aware which options modify the PCI bus which is practically impossible > > > +due to the huge amount of QEMU options, second, QEMU may change the > > > +default PCI configuration from version to version, and third, we cannot > > > +know if the extra options passed by the user via the `kvm_extra` hvparam > > > +modify the PCI bus. > > > + > > > +All the above makes the current implementation error prone: an instance > > > +might not be able to boot if we explicitly add a NIC/Disk on a specific > > > +PCI slot that QEMU has already used for another device while parsing > > > +its command line options. Besides that, now, we want to use the SCSI bus > > > +as well so the above mechanism is insufficient. Here, we decide to put > > > +only disks and NICs on specific slots on the corresponding bus, and let > > > +QEMU put everything else automatically. To this end, we decide to let > > > +the first 12 PCI slots be managed by QEMU, and we start adding PCI > > > +devices (VirtIO block and network devices) from the 13th slot onwards. > > > +As far as the SCSI bus is concerned, we decide to put each SCSI > > > +disk on a different scsi-id (which corresponds to a different target > > > +number in SCSI terminology). The SCSI bus will not have any default > > > +reservations. > > > > Please note that different SCSI controllers have different device > > limits: IIRC the lsi controller supports up to 7 targets with a single > > LUN per target, while virtio-scsi-pci supports up to 255 targets with > > 16383 LUNs each. This means that we (currently) cannot offer the > > theoretical maximum of 16 disks (constants.MAX_DISKS) with a single lsi > > controller. > > > > The current situation is that QEMU uses an lsi controller implicitly. > So we are bound by the limit of 7 SCSI disks anyway. > > This patch increases the number of usable SCSI disks to the hard limit > imposed by constants.MAX_DISKS anyway. The only requirement is that the > user actually sets scsi_controller_type to a value which can support > that many disks, i.e., anything other than lsi. And this should be documented, thanks! > > > + > > > + > > > +How to support more than 16 PCI devices? > > > + > > > +In order to be able to have more than 16 devices on the PCI bus we add > > > +the new `default_pci_reservations` hvparam to denote how many PCI slots > > > +QEMU will handle implicitly. The rest will be available for disk and > > > +NICs inserted explicitly by Ganeti. By default the default PCI > > > +reservations will be 12 as explained above. > > > > It's not clear to me why 16 is special here. 32 - 12 = 20, and what > > might be of interest for us is being able to hotplug 24 devices (16 > > (MAX_DISKS using virtio-blk-pci) + 8 (MAX_NICS)). > > > > 16 default reservations was the choice of the first implementation. > This will change, since Riba suggested (and I agreed since this is now > configurable) to go with 12 default PCI reservations. > So I will rename this section to "How to support the theoretical maximum > of devices, 16 disks and 8 NICs?" and mention the usage of > kvm_pci_reservations. > > > > +How to keep track of the bus state of a running instance? > > > + > > > +To be able to hotplug a device, we need to know which slot is > > > +available on the desired bus. Until now, we were using the ``query-pci`` > > > +QMP command that returns the state of the PCI buses (i.e., which devices > > > +occupy which slots). Unfortunately, there is no equivalent for the SCSI > > > +buses. We could use the ``info qtree`` HMP command that practically > > > +dumps in plain text the whole device tree. This makes it really hard to > > > +parse. So we decide to generate the bus state of a running instance > > > +through our local runtime files. > > > > You don't need to treat all buses the same way, as they have some > > fundamental differences: the PCI bus is de-facto shared with the QEMU > > process (and the user/admin via kvm_extra), however it is safe and > > reasonable to assume that a SCSI bus on a controller plugged by us is > > exclusively owned by us. That said, it is reasonable to store SCSI IDs, > > but you should always query the PCI bus for robustness. > > > > As discussed in Ganeticon and as I wrote earlier[1], PCI bus management > > has a number of caveats we should be aware of. I'm not particularly fond > > of partitioning the PCI space, mostly because it is not robust and it > > also leads to underutilization: we have to reserve space for things we > > will likely never use, and let the user work her way around this by > > adjusting an additional HV parameter (default_pci_reservations) if she > > wants to do something fancy with kvm_extra. Note that the PCI device > > counts range depending on the disk/NIC type, from 1 PCI slot per disk in > > virtio-blk-pci, to 1 PCI slot per 7 disks for lsi and per 255 * 16383 > > disks for virtio-scsi-pci, so it's really hard to pick a good default > > for pre-reserved PCI slots. > > > > I still believe that the best approach would be to manage the SCSI > > bus(es) exclusively, but let QEMU manage hotplugging on the PCI bus and > > reconstruct the layout during live migration on the receiving side, as > > outlined in [1]. This would do away with partitioning the PCI slot space > > and would also guarantee maximum flexibility and robustness regardless > > of what the user might pass via kvm_extra or what future qemu machine > > models might place on the PCI bus. > > > > [1] > > https://groups.google.com/forum/#!searchin/ganeti-devel/hotplugging/ganeti-devel/8zvzAfQyoxU/B7_fcbr2-kcJ > > > > I see that back then, you considered partitioning the PCI space a > "robust workaround". Copying from [1]: "The idea is to keep manual > assignments for NICs and disks, but partition the PCI space in two > areas: one managed by qemu and one managed by the Ganeti's hotplug > code." Yes, but a workaround is a workaround, not a correct solution. Anyway, please proceed as you see fit, at this point both solutions will improve the current situation. > > I agree that partitioning the PCI space may lead to underutilization > in extreme cases. However, as you also mentioned in [1], the "correct > solution" of reconstructing all PCI state during migration would > require significant effort, and would lead to major refactoring and > less homogeneous code. I don't recall saying anything about code homogenity and refactoring. I'm just pointing out the drawbacks of each solution, so that we are all aware of their limitations in advance. > > > > + > > > +What info should be kept in runtime files? > > > + > > > +Runtime files are used for instance migration (to run a QEMU process on > > > +the target node with the same configuration) and for hotplug actions (to > > > +update the configuration of a running instance so that it can be > > > +migrated). Until now we were using devices only on the PCI bus, so only > > > +each device's PCI slot should be kept in the runtime file. This is > > > +obviously not enough. We decide to replace the `pci` slot of Disk and > > > +NIC configuration objects, with an `hvinfo` dict. It will contain all > > > +necessary info for constructing the appropriate -device QEMU option. > > > +Specifically the `driver`, `id`, and `bus` parameters will be present to > > > +all kind of devices. PCI devices will have the `addr` parameter, SCSI > > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks will > > > +have the extra `netdev` and `drive` parameters correspondingly. > > > + > > > + > > > +How to deal with existing instances? > > > + > > > +Only existing instances with paravirtual devices (configured via the > > > +disk_type and nic_type hvparam) use the latest QEMU device model. Only > > > +these have the `pci` slot filled. We will use the existing > > > +_UpgradeSerializedRuntime() method to migrate the old runtime format > > > +with `pci` slot in Disk and NIC configuration objects to the new one > > > +with `hvinfo` instead. The new hvinfo will contain the old driver > > > +(either virtio-blk-pci or virtio-net-pci), the old id > > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI > > > +slot (addr=4). This way those devices will still be hotplug-able, and > > > +the instance will still be migrate-able. When those instances are > > > +rebooted, the hvinfo will be re-generated. > > > + > > > + > > > +How to support downgrades? > > > + > > > +There are two possible ways, both not very pretty. The first one is to > > > +use _UpgradeSerializedRuntime() to remove the hvinfo slot. This would > > > +require the patching of all Ganeti versions down to 2.10 which is > > > practically > > > +imposible. Another way is to ssh to all nodes and remove this slot upon > > > +a cluster downgrade. This ugly hack would go away on 2.17 since we > > > support > > > +downgrades only to the previous minor version. > > > > IIRC, you don't need to do anything special for downgrades, as long as > > you don't add an array member to the top-level array in the runtime (as > > was done in 2.10 to introduce hotplugging). An additional dict field > > should be ignored by previous versions. IOW, if you just keep the old > > PCI-based IDs in place, there should be no problems, but I'm not 100% > > sure about that. > > > > _AnalyzeSerializedRuntime() uses the FromDict() method of Disk and NIC > objects. Such objects would not have an hvinfo dict on versions < 2.16. > So, if a VM started with 2.16 (including 'hvinfo' in its runtime file), > needs to be migrated after a downgrade, it will fail while reading the > runtime file. Riba suggested the workaround, we know it is ugly but it > will go away in 2.17. > > > > + > > > + > > > +Configuration changes > > > +--------------------- > > > + > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It is > > > +hypervisor-specific and will never reach config.data. In case of the KVM > > > +Hypervisor it will contain all necessary info for constructing the > > > -device > > > +QEMU option. Existing entries in runtime files that had a `pci` slot > > > +will be upgraded to have the corresponding `hvinfo` (see above). > > > + > > > +The new `scsi_controller_type` hvparam is added to denote what type of > > > +SCSI controller should be added to PCI bus if we have a SCSI disk. > > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`. > > > +We decide to use `lsi` by default since this is the one that QEMU > > > +adds automatically if not specified explicitly by an option. > > > + > > > + > > > +Hypervisor changes > > > +------------------ > > > + > > > +The current implementation verifies if a hotplug action has succeeded > > > +by scanning the PCI bus and searching for a specific device ID. This > > > +will change, and we will use the ``query-block`` along with the > > > +``query-pci`` QMP command to find block devices that are attached to the > > > +SCSI bus as well. > > > + > > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would use > > > the > > > +deprecated device model and end up using SCSI emulation, e.g.: > > > + > > > + :: > > > + > > > + -drive file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw > > > + > > > +Now the equivalent, which will also enable hotplugging, will be to set > > > +disk_type to `scsi-hd`. The QEMU command line will include: > > > + > > > + :: > > > + > > > + -drive > > > file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345 > > > + -device > > > scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0 > > > > Can we drop the `hot' part from `hotdisk'? This will allow us to use a > > couple more bits from the UUID, reducing the chances of a collision. I > > would go as far as using plain `d-' for disks and `n-' for NICs. > > > > > OK, I have already removed the 'hot' part, and will add the next 2 parts > of the UUID so the chance of collisions gets reduced significantly. > > I prefer to leave the 'disk-' and 'nic-' prefixes for reasons of > readability. Ack, thanks. > > To sum up, I will make the following modifications to the current design doc: > > - Reword the section "How to support more than 16 devices?" > - Update the section "How to identify each device?" > - Update the examples with the proper device IDs > > After this patch has been merged, and you have time, I would propose > that you amend the design doc with a "future work" section, so your > thoughts are persisted for anyone willing to contribute in the future. > > Thanks, > dimara LGTM, thanks.
