On 12:23 Thu 08 Oct     , Dimitris Aragiorgis wrote:
> Hi Apollon,
> 
> * Apollon Oikonomopoulos <[email protected]> [2015-10-02 12:36:44 +0300]:
> 
> > Hi Dimitris,
> > 
> > Thanks a lot for this work! Please see comments inline.
> > 
> > On 19:52 Thu 01 Oct     , Dimitris Aragiorgis wrote:
> > > This is a design document detailing the refactoring of device
> > > handling in the KVM Hypervisor. More specifically, it will use the
> > > latest QEMU device model and modify the current hotplug implementation
> > > so that both PCI and SCSI devices can be managed.
> > > 
> > > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > > ---
> > >  Makefile.am             |    1 +
> > >  doc/design-draft.rst    |    1 +
> > >  doc/design-scsi-kvm.rst |  239 
> > > +++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 241 insertions(+)
> > >  create mode 100644 doc/design-scsi-kvm.rst
> > > 
> > > diff --git a/Makefile.am b/Makefile.am
> > > index a6b9fbd..4506289 100644
> > > --- a/Makefile.am
> > > +++ b/Makefile.am
> > > @@ -717,6 +717,7 @@ docinput = \
> > >   doc/design-reservations.rst \
> > >   doc/design-resource-model.rst \
> > >   doc/design-restricted-commands.rst \
> > > + doc/design-scsi-kvm.rst \
> > >   doc/design-shared-storage.rst \
> > >   doc/design-shared-storage-redundancy.rst \
> > >   doc/design-ssh-ports.rst \
> > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > > index 3f2a1f3..83a392c 100644
> > > --- a/doc/design-draft.rst
> > > +++ b/doc/design-draft.rst
> > > @@ -25,6 +25,7 @@ Design document drafts
> > >     design-configlock.rst
> > >     design-multi-storage-htools.rst
> > >     design-repaird.rst
> > > +   design-scsi-kvm.rst
> > >  
> > >  .. vim: set textwidth=72 :
> > >  .. Local Variables:
> > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst
> > > new file mode 100644
> > > index 0000000..30cc7de
> > > --- /dev/null
> > > +++ b/doc/design-scsi-kvm.rst
> > > @@ -0,0 +1,239 @@
> > > +==========
> > > +KVM + SCSI
> > > +==========
> > > +
> > > +.. contents:: :depth: 4
> > > +
> > > +This is a design document detailing the refactoring of device
> > > +handling in the KVM Hypervisor. More specifically, it will use
> > > +the latest QEMU device model and modify the hotplug implementation
> > > +so that both PCI and SCSI devices can be managed.
> > > +
> > > +
> > > +Current state and shortcomings
> > > +==============================
> > > +
> > > +Ganeti currently supports SCSI virtual devices in the KVM hypervisor by
> > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually
> > > +instruct QEMU to use the deprecated device model (i.e. -drive if=scsi),
> > > +which will expose the backing store as an emulated SCSI device. This
> > > +means that currently SCSI pass-through is not supported.
> > > +
> > > +On the other hand, the current hotplug implementation
> > > +:doc:`design-hotplug` uses the latest QEMU
> > > +device model (via the -device option) and is tailored to paravirtual
> > > +devices, which leads to buggy behavior: if we hotplug a disk to an
> > > +instance that is configured with disk_type=scsi hvparam, the
> > > +disk which will get hot-plugged eventually will be a VirtIO device
> > > +(i.e., virtio-blk-pci) on the PCI bus.
> > > +
> > > +The current implementation of creating the QEMU command line is
> > > +error-prone, since an instance might not be able to boot due to PCI slot
> > > +congestion.
> > > +
> > > +
> > > +Proposed changes
> > > +================
> > > +
> > > +We change the way that the KVM hypervisor handles block devices by
> > > +introducing latest QEMU device model for SCSI devices as well, so that
> > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are
> > > +supported too. Additionally we refactor the hotplug implementation in
> > > +order to support hotplugging of SCSI devices too. Finally, we change the
> > > +way we keep track of device info inside runtime files, and the way we
> > > +place each device upon instance startup.
> > > +
> > > +Design decisions
> > > +================
> > > +
> > > +How to identify each device?
> > > +
> > > +Currently KVM does not support arbitrary IDs for devices; supported are
> > > +only names starting with a letter, with max 32 chars length, and only
> > > +including the '.', '_', '-' special chars. Currently we generate an ID
> > > +with the following format: <device type>-<part of uuid>-pci-<slot>.
> > > +This assumes that the device will be plugged in a certain slot on the
> > > +PCI bus. Since we want to support devices on a SCSI bus too and adding
> > > +the PCI slot to the ID is redundant, we keep only the first two parts of
> > > +the existing ID.
> > > +
> > > +
> > > +Which buses does the guest eventually see?
> > > +
> > > +By default QEMU starts with a single PCI bus named "pci.0". In case a
> > > +SCSI controller is added on this bus, a SCSI bus is created with
> > > +the corresponding name: "scsi.0".
> > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti does
> > > +not explicitly use a SCSI controller via a command line option, but lets
> > > +QEMU add one automatically if needed. Here, in case we have a SCSI disk,
> > > +a SCSI controller is explicitly added via the -device option. For the
> > > +SCSI controller, we do not specify the PCI slot to use, but let QEMU find
> > > +the first available (see below).
> > > +
> > > +
> > > +What type of SCSI controller to use?
> > > +
> > > +QEMU uses the `lsi` controller by default. To make this configurable we
> > > +add a new hvparam, `scsi_controller_type`. The available types will be
> > > +`lsi`, `megasas`, and `virtio-scsi-pci`.
> > > +
> > > +
> > > +Where to place the devices upon instance startup?
> > > +
> > > +By default QEMU boots with the first three slots of the default PCI bus
> > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and
> > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU
> > > +options passed in the command line.
> > 
> > Note that this is true for the `pc' and `pc-i440fx-*' machine models.  
> > However, the `q35' machine model emulates an ICH9-based motherboard that 
> > also has an additional multifunction device on PCI slot 31, hosting and 
> > ISA bridge, a SATA controller and an SMBus adapter.
> > 
> 
> 
> I looked into the issue you mentioned with the q35 machine type a bit
> more. Currently, even before this patch, Ganeti cannot even start a q35
> machine with paravirtual devices.
> Apparently no Ganeti user has used the q35 machine model seriously yet :)
> 
> The problem is that q35 defines a *PCI Express* bus, not a PCI bus.
> The current code assumes the presence of a PCI bus (pci.0).
> So this discussion that started at GanetiCon seems irrelevant.
> 
> It is not a matter of reserving PCI slot 31; q35 machines expose
> a PCI Express bus (pcie.0) that does not even support hotplugging:
> http://www.linux-kvm.org/page/PCITodo
> 
> So handling the q35 machine type is another story, and certainly
> outside the scope of this patch set.
> 
> After this patch set has been merged, a separate patch can explore
> supporting the q35 machine type. This however may prove more
> complicated than it looks:
> 
> 1. The pcie.0 bus is not hotpluggable so we cannot hotplug devices
>    directly onto it.
> 
> 2. One could work around it, by ensuring a PCI bus (pci.0) is always
>    available:
>    pcie.0 -> dmi-to-pci-bridge -> pci-to-pci-bridge (this will be pci.0).
>    This would allow the machine to boot but still wouldn't solve the
>    hotplug problem. PCI bridges do not yet support hotplugging in QEMU.
>    Libvirt seems to follow this approach.

Good, I didn't go that far with my research :-) I guess for the time 
being we need to add an erratum indicating that Ganeti versions with 
hotplugging support do not support the Q35 machine model as of qemu 2.4.

> > > +The main reason that we want to be fully aware of the configuration of a
> > > +running instance (machine type, PCI and SCSI bus state, devices, etc.)
> > > +is that in case of migration a QEMU process with the exact same
> > > +configuration should be created on the target node. The configuration is
> > > +kept in the runtime file created just before starting the instance.
> > > +Since hotplug has been introduced, the only thing that can change after
> > > +starting an instance is the configuration related to NICs and Disks.
> > > +
> > > +Before implementing hotplug, Ganeti did not specify PCI slots
> > > +explicitly, but let QEMU decide how to place the devices on the
> > > +corresponding bus. This does not work if we want to have hotplug-able
> > > +devices and migrate-able VMs. Currently, upon runtime file creation, we
> > > +try to reserve PCI slots based on the hvparams, the disks, and the NICs
> > > +of the instance. This has three major shortcomings: first, we have to be
> > > +aware which options modify the PCI bus which is practically impossible
> > > +due to the huge amount of QEMU options, second, QEMU may change the
> > > +default PCI configuration from version to version, and third, we cannot
> > > +know if the extra options passed by the user via the `kvm_extra` hvparam
> > > +modify the PCI bus.
> > > +
> > > +All the above makes the current implementation error prone: an instance
> > > +might not be able to boot if we explicitly add a NIC/Disk on a specific
> > > +PCI slot that QEMU has already used for another device while parsing
> > > +its command line options. Besides that, now, we want to use the SCSI bus
> > > +as well so the above mechanism is insufficient. Here, we decide to put
> > > +only disks and NICs on specific slots on the corresponding bus, and let
> > > +QEMU put everything else automatically. To this end, we decide to let
> > > +the first 12 PCI slots be managed by QEMU, and we start adding PCI
> > > +devices (VirtIO block and network devices) from the 13th slot onwards.
> > > +As far as the SCSI bus is concerned, we decide to put each SCSI
> > > +disk on a different scsi-id (which corresponds to a different target
> > > +number in SCSI terminology). The SCSI bus will not have any default
> > > +reservations.
> > 
> > Please note that different SCSI controllers have different device 
> > limits: IIRC the lsi controller supports up to 7 targets with a single 
> > LUN per target, while virtio-scsi-pci supports up to 255 targets with 
> > 16383 LUNs each. This means that we (currently) cannot offer the 
> > theoretical maximum of 16 disks (constants.MAX_DISKS) with a single lsi 
> > controller.
> > 
> 
> The current situation is that QEMU uses an lsi controller implicitly.
> So we are bound by the limit of 7 SCSI disks anyway.
> 
> This patch increases the number of usable SCSI disks to the hard limit
> imposed by constants.MAX_DISKS anyway. The only requirement is that the
> user actually sets scsi_controller_type to a value which can support
> that many disks, i.e., anything other than lsi.

And this should be documented, thanks!

> > > +
> > > +
> > > +How to support more than 16 PCI devices?
> > > +
> > > +In order to be able to have more than 16 devices on the PCI bus we add
> > > +the new `default_pci_reservations` hvparam to denote how many PCI slots
> > > +QEMU will handle implicitly. The rest will be available for disk and
> > > +NICs inserted explicitly by Ganeti. By default the default PCI
> > > +reservations will be 12 as explained above.
> > 
> > It's not clear to me why 16 is special here. 32 - 12 = 20, and what 
> > might be of interest for us is being able to hotplug 24 devices (16 
> > (MAX_DISKS using virtio-blk-pci) + 8 (MAX_NICS)).
> > 
> 
> 16 default reservations was the choice of the first implementation.
> This will change, since Riba suggested (and I agreed since this is now
> configurable) to go with 12 default PCI reservations.
> So I will rename this section to "How to support the theoretical maximum
> of devices, 16 disks and 8 NICs?" and mention the usage of
> kvm_pci_reservations.
> 
> > > +How to keep track of the bus state of a running instance?
> > > +
> > > +To be able to hotplug a device, we need to know which slot is
> > > +available on the desired bus. Until now, we were using the ``query-pci``
> > > +QMP command that returns the state of the PCI buses (i.e., which devices
> > > +occupy which slots). Unfortunately, there is no equivalent for the SCSI
> > > +buses. We could use the ``info qtree`` HMP command that practically
> > > +dumps in plain text the whole device tree. This makes it really hard to
> > > +parse. So we decide to generate the bus state of a running instance
> > > +through our local runtime files.
> > 
> > You don't need to treat all buses the same way, as they have some 
> > fundamental differences: the PCI bus is de-facto shared with the QEMU 
> > process (and the user/admin via kvm_extra), however it is safe and 
> > reasonable to assume that a SCSI bus on a controller plugged by us is 
> > exclusively owned by us. That said, it is reasonable to store SCSI IDs, 
> > but you should always query the PCI bus for robustness.
> > 
> > As discussed in Ganeticon and as I wrote earlier[1], PCI bus management 
> > has a number of caveats we should be aware of. I'm not particularly fond 
> > of partitioning the PCI space, mostly because it is not robust and it 
> > also leads to underutilization: we have to reserve space for things we 
> > will likely never use, and let the user work her way around this by 
> > adjusting an additional HV parameter (default_pci_reservations) if she 
> > wants to do something fancy with kvm_extra. Note that the PCI device 
> > counts range depending on the disk/NIC type, from 1 PCI slot per disk in 
> > virtio-blk-pci, to 1 PCI slot per 7 disks for lsi and per 255 * 16383 
> > disks for virtio-scsi-pci, so it's really hard to pick a good default 
> > for pre-reserved PCI slots.
> > 
> > I still believe that the best approach would be to manage the SCSI 
> > bus(es) exclusively, but let QEMU manage hotplugging on the PCI bus and 
> > reconstruct the layout during live migration on the receiving side, as 
> > outlined in [1]. This would do away with partitioning the PCI slot space 
> > and would also guarantee maximum flexibility and robustness regardless 
> > of what the user might pass via kvm_extra or what future qemu machine 
> > models might place on the PCI bus.
> > 
> > [1] 
> > https://groups.google.com/forum/#!searchin/ganeti-devel/hotplugging/ganeti-devel/8zvzAfQyoxU/B7_fcbr2-kcJ
> > 
> 
> I see that back then, you considered partitioning the PCI space a
> "robust workaround". Copying from [1]: "The idea is to keep manual
> assignments for NICs and disks, but partition the PCI space in two
> areas: one managed by qemu and one managed by the Ganeti's hotplug
> code."

Yes, but a workaround is a workaround, not a correct solution. Anyway, 
please proceed as you see fit, at this point both solutions will improve 
the current situation.

> 
> I agree that partitioning the PCI space may lead to underutilization
> in extreme cases. However, as you also mentioned in [1], the "correct
> solution" of reconstructing all PCI state during migration would
> require significant effort, and would lead to major refactoring and
> less homogeneous code.

I don't recall saying anything about code homogenity and refactoring.  
I'm just pointing out the drawbacks of each solution, so that we are all 
aware of their limitations in advance.

> 
> > > +
> > > +What info should be kept in runtime files?
> > > +
> > > +Runtime files are used for instance migration (to run a QEMU process on
> > > +the target node with the same configuration) and for hotplug actions (to
> > > +update the configuration of a running instance so that it can be
> > > +migrated). Until now we were using devices only on the PCI bus, so only
> > > +each device's PCI slot should be kept in the runtime file. This is
> > > +obviously not enough. We decide to replace the `pci` slot of Disk and
> > > +NIC configuration objects, with an `hvinfo` dict. It will contain all
> > > +necessary info for constructing the appropriate -device QEMU option.
> > > +Specifically the `driver`, `id`, and `bus` parameters will be present to
> > > +all kind of devices. PCI devices will have the `addr` parameter, SCSI
> > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks will
> > > +have the extra `netdev` and `drive` parameters correspondingly.
> > > +
> > > +
> > > +How to deal with existing instances?
> > > +
> > > +Only existing instances with paravirtual devices (configured via the
> > > +disk_type and nic_type hvparam) use the latest QEMU device model. Only
> > > +these have the `pci` slot filled. We will use the existing
> > > +_UpgradeSerializedRuntime() method to migrate the old runtime format
> > > +with `pci` slot in Disk and NIC configuration objects to the new one
> > > +with `hvinfo` instead. The new hvinfo will contain the old driver
> > > +(either virtio-blk-pci or virtio-net-pci), the old id
> > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI
> > > +slot (addr=4). This way those devices will still be hotplug-able, and
> > > +the instance will still be migrate-able. When those instances are
> > > +rebooted, the hvinfo will be re-generated.
> > > +
> > > +
> > > +How to support downgrades?
> > > +
> > > +There are two possible ways, both not very pretty. The first one is to
> > > +use _UpgradeSerializedRuntime() to remove the hvinfo slot. This would
> > > +require the patching of all Ganeti versions down to 2.10 which is 
> > > practically
> > > +imposible. Another way is to ssh to all nodes and remove this slot upon
> > > +a cluster downgrade. This ugly hack would go away on 2.17 since we 
> > > support
> > > +downgrades only to the previous minor version.
> > 
> > IIRC, you don't need to do anything special for downgrades, as long as 
> > you don't add an array member to the top-level array in the runtime (as 
> > was done in 2.10 to introduce hotplugging). An additional dict field 
> > should be ignored by previous versions. IOW, if you just keep the old 
> > PCI-based IDs in place, there should be no problems, but I'm not 100% 
> > sure about that.
> > 
> 
> _AnalyzeSerializedRuntime() uses the FromDict() method of Disk and NIC
> objects. Such objects would not have an hvinfo dict on versions < 2.16.
> So, if a VM started with 2.16 (including 'hvinfo' in its runtime file),
> needs to be migrated after a downgrade, it will fail while reading the
> runtime file. Riba suggested the workaround, we know it is ugly but it
> will go away in 2.17.
> 
> > > +
> > > +
> > > +Configuration changes
> > > +---------------------
> > > +
> > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It is
> > > +hypervisor-specific and will never reach config.data. In case of the KVM
> > > +Hypervisor it will contain all necessary info for constructing the 
> > > -device
> > > +QEMU option. Existing entries in runtime files that had a `pci` slot
> > > +will be upgraded to have the corresponding `hvinfo` (see above).
> > > +
> > > +The new `scsi_controller_type` hvparam is added to denote what type of
> > > +SCSI controller should be added to PCI bus if we have a SCSI disk.
> > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`.
> > > +We decide to use `lsi` by default since this is the one that QEMU
> > > +adds automatically if not specified explicitly by an option.
> > > +
> > > +
> > > +Hypervisor changes
> > > +------------------
> > > +
> > > +The current implementation verifies if a hotplug action has succeeded
> > > +by scanning the PCI bus and searching for a specific device ID. This
> > > +will change, and we will use the ``query-block`` along with the
> > > +``query-pci`` QMP command to find block devices that are attached to the
> > > +SCSI bus as well.
> > > +
> > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would use 
> > > the
> > > +deprecated device model and end up using SCSI emulation, e.g.:
> > > +
> > > +  ::
> > > +
> > > +    -drive file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw
> > > +
> > > +Now the equivalent, which will also enable hotplugging, will be to set
> > > +disk_type to `scsi-hd`. The QEMU command line will include:
> > > +
> > > +  ::
> > > +
> > > +    -drive 
> > > file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345
> > > +    -device 
> > > scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0
> > 
> > Can we drop the `hot' part from `hotdisk'? This will allow us to use a 
> > couple more bits from the UUID, reducing the chances of a collision. I 
> > would go as far as using plain `d-' for disks and `n-' for NICs.
> > 
> 
> 
> OK, I have already removed the 'hot' part, and will add the next 2 parts
> of the UUID so the chance of collisions gets reduced significantly.
> 
> I prefer to leave the 'disk-' and 'nic-' prefixes for reasons of
> readability.

Ack, thanks.

> 
> To sum up, I will make the following modifications to the current design doc:
> 
>  - Reword the section "How to support more than 16 devices?"
>  - Update the section "How to identify each device?"
>  - Update the examples with the proper device IDs
> 
> After this patch has been merged, and you have time, I would propose
> that you amend the design doc with a "future work" section, so your
> thoughts are persisted for anyone willing to contribute in the future.
> 
> Thanks,
> dimara

LGTM, thanks.

Reply via email to