Hi Apollon, * Apollon Oikonomopoulos <[email protected]> [2015-10-02 12:36:44 +0300]:
> Hi Dimitris, > > Thanks a lot for this work! Please see comments inline. > > On 19:52 Thu 01 Oct , Dimitris Aragiorgis wrote: > > This is a design document detailing the refactoring of device > > handling in the KVM Hypervisor. More specifically, it will use the > > latest QEMU device model and modify the current hotplug implementation > > so that both PCI and SCSI devices can be managed. > > > > Signed-off-by: Dimitris Aragiorgis <[email protected]> > > --- > > Makefile.am | 1 + > > doc/design-draft.rst | 1 + > > doc/design-scsi-kvm.rst | 239 > > +++++++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 241 insertions(+) > > create mode 100644 doc/design-scsi-kvm.rst > > > > diff --git a/Makefile.am b/Makefile.am > > index a6b9fbd..4506289 100644 > > --- a/Makefile.am > > +++ b/Makefile.am > > @@ -717,6 +717,7 @@ docinput = \ > > doc/design-reservations.rst \ > > doc/design-resource-model.rst \ > > doc/design-restricted-commands.rst \ > > + doc/design-scsi-kvm.rst \ > > doc/design-shared-storage.rst \ > > doc/design-shared-storage-redundancy.rst \ > > doc/design-ssh-ports.rst \ > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > > index 3f2a1f3..83a392c 100644 > > --- a/doc/design-draft.rst > > +++ b/doc/design-draft.rst > > @@ -25,6 +25,7 @@ Design document drafts > > design-configlock.rst > > design-multi-storage-htools.rst > > design-repaird.rst > > + design-scsi-kvm.rst > > > > .. vim: set textwidth=72 : > > .. Local Variables: > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst > > new file mode 100644 > > index 0000000..30cc7de > > --- /dev/null > > +++ b/doc/design-scsi-kvm.rst > > @@ -0,0 +1,239 @@ > > +========== > > +KVM + SCSI > > +========== > > + > > +.. contents:: :depth: 4 > > + > > +This is a design document detailing the refactoring of device > > +handling in the KVM Hypervisor. More specifically, it will use > > +the latest QEMU device model and modify the hotplug implementation > > +so that both PCI and SCSI devices can be managed. > > + > > + > > +Current state and shortcomings > > +============================== > > + > > +Ganeti currently supports SCSI virtual devices in the KVM hypervisor by > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually > > +instruct QEMU to use the deprecated device model (i.e. -drive if=scsi), > > +which will expose the backing store as an emulated SCSI device. This > > +means that currently SCSI pass-through is not supported. > > + > > +On the other hand, the current hotplug implementation > > +:doc:`design-hotplug` uses the latest QEMU > > +device model (via the -device option) and is tailored to paravirtual > > +devices, which leads to buggy behavior: if we hotplug a disk to an > > +instance that is configured with disk_type=scsi hvparam, the > > +disk which will get hot-plugged eventually will be a VirtIO device > > +(i.e., virtio-blk-pci) on the PCI bus. > > + > > +The current implementation of creating the QEMU command line is > > +error-prone, since an instance might not be able to boot due to PCI slot > > +congestion. > > + > > + > > +Proposed changes > > +================ > > + > > +We change the way that the KVM hypervisor handles block devices by > > +introducing latest QEMU device model for SCSI devices as well, so that > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are > > +supported too. Additionally we refactor the hotplug implementation in > > +order to support hotplugging of SCSI devices too. Finally, we change the > > +way we keep track of device info inside runtime files, and the way we > > +place each device upon instance startup. > > + > > +Design decisions > > +================ > > + > > +How to identify each device? > > + > > +Currently KVM does not support arbitrary IDs for devices; supported are > > +only names starting with a letter, with max 32 chars length, and only > > +including the '.', '_', '-' special chars. Currently we generate an ID > > +with the following format: <device type>-<part of uuid>-pci-<slot>. > > +This assumes that the device will be plugged in a certain slot on the > > +PCI bus. Since we want to support devices on a SCSI bus too and adding > > +the PCI slot to the ID is redundant, we keep only the first two parts of > > +the existing ID. > > + > > + > > +Which buses does the guest eventually see? > > + > > +By default QEMU starts with a single PCI bus named "pci.0". In case a > > +SCSI controller is added on this bus, a SCSI bus is created with > > +the corresponding name: "scsi.0". > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti does > > +not explicitly use a SCSI controller via a command line option, but lets > > +QEMU add one automatically if needed. Here, in case we have a SCSI disk, > > +a SCSI controller is explicitly added via the -device option. For the > > +SCSI controller, we do not specify the PCI slot to use, but let QEMU find > > +the first available (see below). > > + > > + > > +What type of SCSI controller to use? > > + > > +QEMU uses the `lsi` controller by default. To make this configurable we > > +add a new hvparam, `scsi_controller_type`. The available types will be > > +`lsi`, `megasas`, and `virtio-scsi-pci`. > > + > > + > > +Where to place the devices upon instance startup? > > + > > +By default QEMU boots with the first three slots of the default PCI bus > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU > > +options passed in the command line. > > Note that this is true for the `pc' and `pc-i440fx-*' machine models. > However, the `q35' machine model emulates an ICH9-based motherboard that > also has an additional multifunction device on PCI slot 31, hosting and > ISA bridge, a SATA controller and an SMBus adapter. > I looked into the issue you mentioned with the q35 machine type a bit more. Currently, even before this patch, Ganeti cannot even start a q35 machine with paravirtual devices. Apparently no Ganeti user has used the q35 machine model seriously yet :) The problem is that q35 defines a *PCI Express* bus, not a PCI bus. The current code assumes the presence of a PCI bus (pci.0). So this discussion that started at GanetiCon seems irrelevant. It is not a matter of reserving PCI slot 31; q35 machines expose a PCI Express bus (pcie.0) that does not even support hotplugging: http://www.linux-kvm.org/page/PCITodo So handling the q35 machine type is another story, and certainly outside the scope of this patch set. After this patch set has been merged, a separate patch can explore supporting the q35 machine type. This however may prove more complicated than it looks: 1. The pcie.0 bus is not hotpluggable so we cannot hotplug devices directly onto it. 2. One could work around it, by ensuring a PCI bus (pci.0) is always available: pcie.0 -> dmi-to-pci-bridge -> pci-to-pci-bridge (this will be pci.0). This would allow the machine to boot but still wouldn't solve the hotplug problem. PCI bridges do not yet support hotplugging in QEMU. Libvirt seems to follow this approach. > > +The main reason that we want to be fully aware of the configuration of a > > +running instance (machine type, PCI and SCSI bus state, devices, etc.) > > +is that in case of migration a QEMU process with the exact same > > +configuration should be created on the target node. The configuration is > > +kept in the runtime file created just before starting the instance. > > +Since hotplug has been introduced, the only thing that can change after > > +starting an instance is the configuration related to NICs and Disks. > > + > > +Before implementing hotplug, Ganeti did not specify PCI slots > > +explicitly, but let QEMU decide how to place the devices on the > > +corresponding bus. This does not work if we want to have hotplug-able > > +devices and migrate-able VMs. Currently, upon runtime file creation, we > > +try to reserve PCI slots based on the hvparams, the disks, and the NICs > > +of the instance. This has three major shortcomings: first, we have to be > > +aware which options modify the PCI bus which is practically impossible > > +due to the huge amount of QEMU options, second, QEMU may change the > > +default PCI configuration from version to version, and third, we cannot > > +know if the extra options passed by the user via the `kvm_extra` hvparam > > +modify the PCI bus. > > + > > +All the above makes the current implementation error prone: an instance > > +might not be able to boot if we explicitly add a NIC/Disk on a specific > > +PCI slot that QEMU has already used for another device while parsing > > +its command line options. Besides that, now, we want to use the SCSI bus > > +as well so the above mechanism is insufficient. Here, we decide to put > > +only disks and NICs on specific slots on the corresponding bus, and let > > +QEMU put everything else automatically. To this end, we decide to let > > +the first 12 PCI slots be managed by QEMU, and we start adding PCI > > +devices (VirtIO block and network devices) from the 13th slot onwards. > > +As far as the SCSI bus is concerned, we decide to put each SCSI > > +disk on a different scsi-id (which corresponds to a different target > > +number in SCSI terminology). The SCSI bus will not have any default > > +reservations. > > Please note that different SCSI controllers have different device > limits: IIRC the lsi controller supports up to 7 targets with a single > LUN per target, while virtio-scsi-pci supports up to 255 targets with > 16383 LUNs each. This means that we (currently) cannot offer the > theoretical maximum of 16 disks (constants.MAX_DISKS) with a single lsi > controller. > The current situation is that QEMU uses an lsi controller implicitly. So we are bound by the limit of 7 SCSI disks anyway. This patch increases the number of usable SCSI disks to the hard limit imposed by constants.MAX_DISKS anyway. The only requirement is that the user actually sets scsi_controller_type to a value which can support that many disks, i.e., anything other than lsi. > > + > > + > > +How to support more than 16 PCI devices? > > + > > +In order to be able to have more than 16 devices on the PCI bus we add > > +the new `default_pci_reservations` hvparam to denote how many PCI slots > > +QEMU will handle implicitly. The rest will be available for disk and > > +NICs inserted explicitly by Ganeti. By default the default PCI > > +reservations will be 12 as explained above. > > It's not clear to me why 16 is special here. 32 - 12 = 20, and what > might be of interest for us is being able to hotplug 24 devices (16 > (MAX_DISKS using virtio-blk-pci) + 8 (MAX_NICS)). > 16 default reservations was the choice of the first implementation. This will change, since Riba suggested (and I agreed since this is now configurable) to go with 12 default PCI reservations. So I will rename this section to "How to support the theoretical maximum of devices, 16 disks and 8 NICs?" and mention the usage of kvm_pci_reservations. > > +How to keep track of the bus state of a running instance? > > + > > +To be able to hotplug a device, we need to know which slot is > > +available on the desired bus. Until now, we were using the ``query-pci`` > > +QMP command that returns the state of the PCI buses (i.e., which devices > > +occupy which slots). Unfortunately, there is no equivalent for the SCSI > > +buses. We could use the ``info qtree`` HMP command that practically > > +dumps in plain text the whole device tree. This makes it really hard to > > +parse. So we decide to generate the bus state of a running instance > > +through our local runtime files. > > You don't need to treat all buses the same way, as they have some > fundamental differences: the PCI bus is de-facto shared with the QEMU > process (and the user/admin via kvm_extra), however it is safe and > reasonable to assume that a SCSI bus on a controller plugged by us is > exclusively owned by us. That said, it is reasonable to store SCSI IDs, > but you should always query the PCI bus for robustness. > > As discussed in Ganeticon and as I wrote earlier[1], PCI bus management > has a number of caveats we should be aware of. I'm not particularly fond > of partitioning the PCI space, mostly because it is not robust and it > also leads to underutilization: we have to reserve space for things we > will likely never use, and let the user work her way around this by > adjusting an additional HV parameter (default_pci_reservations) if she > wants to do something fancy with kvm_extra. Note that the PCI device > counts range depending on the disk/NIC type, from 1 PCI slot per disk in > virtio-blk-pci, to 1 PCI slot per 7 disks for lsi and per 255 * 16383 > disks for virtio-scsi-pci, so it's really hard to pick a good default > for pre-reserved PCI slots. > > I still believe that the best approach would be to manage the SCSI > bus(es) exclusively, but let QEMU manage hotplugging on the PCI bus and > reconstruct the layout during live migration on the receiving side, as > outlined in [1]. This would do away with partitioning the PCI slot space > and would also guarantee maximum flexibility and robustness regardless > of what the user might pass via kvm_extra or what future qemu machine > models might place on the PCI bus. > > [1] > https://groups.google.com/forum/#!searchin/ganeti-devel/hotplugging/ganeti-devel/8zvzAfQyoxU/B7_fcbr2-kcJ > I see that back then, you considered partitioning the PCI space a "robust workaround". Copying from [1]: "The idea is to keep manual assignments for NICs and disks, but partition the PCI space in two areas: one managed by qemu and one managed by the Ganeti's hotplug code." I agree that partitioning the PCI space may lead to underutilization in extreme cases. However, as you also mentioned in [1], the "correct solution" of reconstructing all PCI state during migration would require significant effort, and would lead to major refactoring and less homogeneous code. > > + > > +What info should be kept in runtime files? > > + > > +Runtime files are used for instance migration (to run a QEMU process on > > +the target node with the same configuration) and for hotplug actions (to > > +update the configuration of a running instance so that it can be > > +migrated). Until now we were using devices only on the PCI bus, so only > > +each device's PCI slot should be kept in the runtime file. This is > > +obviously not enough. We decide to replace the `pci` slot of Disk and > > +NIC configuration objects, with an `hvinfo` dict. It will contain all > > +necessary info for constructing the appropriate -device QEMU option. > > +Specifically the `driver`, `id`, and `bus` parameters will be present to > > +all kind of devices. PCI devices will have the `addr` parameter, SCSI > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks will > > +have the extra `netdev` and `drive` parameters correspondingly. > > + > > + > > +How to deal with existing instances? > > + > > +Only existing instances with paravirtual devices (configured via the > > +disk_type and nic_type hvparam) use the latest QEMU device model. Only > > +these have the `pci` slot filled. We will use the existing > > +_UpgradeSerializedRuntime() method to migrate the old runtime format > > +with `pci` slot in Disk and NIC configuration objects to the new one > > +with `hvinfo` instead. The new hvinfo will contain the old driver > > +(either virtio-blk-pci or virtio-net-pci), the old id > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI > > +slot (addr=4). This way those devices will still be hotplug-able, and > > +the instance will still be migrate-able. When those instances are > > +rebooted, the hvinfo will be re-generated. > > + > > + > > +How to support downgrades? > > + > > +There are two possible ways, both not very pretty. The first one is to > > +use _UpgradeSerializedRuntime() to remove the hvinfo slot. This would > > +require the patching of all Ganeti versions down to 2.10 which is > > practically > > +imposible. Another way is to ssh to all nodes and remove this slot upon > > +a cluster downgrade. This ugly hack would go away on 2.17 since we support > > +downgrades only to the previous minor version. > > IIRC, you don't need to do anything special for downgrades, as long as > you don't add an array member to the top-level array in the runtime (as > was done in 2.10 to introduce hotplugging). An additional dict field > should be ignored by previous versions. IOW, if you just keep the old > PCI-based IDs in place, there should be no problems, but I'm not 100% > sure about that. > _AnalyzeSerializedRuntime() uses the FromDict() method of Disk and NIC objects. Such objects would not have an hvinfo dict on versions < 2.16. So, if a VM started with 2.16 (including 'hvinfo' in its runtime file), needs to be migrated after a downgrade, it will fail while reading the runtime file. Riba suggested the workaround, we know it is ugly but it will go away in 2.17. > > + > > + > > +Configuration changes > > +--------------------- > > + > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It is > > +hypervisor-specific and will never reach config.data. In case of the KVM > > +Hypervisor it will contain all necessary info for constructing the -device > > +QEMU option. Existing entries in runtime files that had a `pci` slot > > +will be upgraded to have the corresponding `hvinfo` (see above). > > + > > +The new `scsi_controller_type` hvparam is added to denote what type of > > +SCSI controller should be added to PCI bus if we have a SCSI disk. > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`. > > +We decide to use `lsi` by default since this is the one that QEMU > > +adds automatically if not specified explicitly by an option. > > + > > + > > +Hypervisor changes > > +------------------ > > + > > +The current implementation verifies if a hotplug action has succeeded > > +by scanning the PCI bus and searching for a specific device ID. This > > +will change, and we will use the ``query-block`` along with the > > +``query-pci`` QMP command to find block devices that are attached to the > > +SCSI bus as well. > > + > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would use the > > +deprecated device model and end up using SCSI emulation, e.g.: > > + > > + :: > > + > > + -drive file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw > > + > > +Now the equivalent, which will also enable hotplugging, will be to set > > +disk_type to `scsi-hd`. The QEMU command line will include: > > + > > + :: > > + > > + -drive > > file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345 > > + -device > > scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0 > > Can we drop the `hot' part from `hotdisk'? This will allow us to use a > couple more bits from the UUID, reducing the chances of a collision. I > would go as far as using plain `d-' for disks and `n-' for NICs. > OK, I have already removed the 'hot' part, and will add the next 2 parts of the UUID so the chance of collisions gets reduced significantly. I prefer to leave the 'disk-' and 'nic-' prefixes for reasons of readability. To sum up, I will make the following modifications to the current design doc: - Reword the section "How to support more than 16 devices?" - Update the section "How to identify each device?" - Update the examples with the proper device IDs After this patch has been merged, and you have time, I would propose that you amend the design doc with a "future work" section, so your thoughts are persisted for anyone willing to contribute in the future. Thanks, dimara > Thanks, > Apollon > > P.S.: I will be AFK for the next week or so, so I won't be able to > provide much feedback.
signature.asc
Description: Digital signature
