Confirming the LGTM as well.

On Mon, Oct 12, 2015 at 8:17 AM, Apollon Oikonomopoulos <[email protected]>
wrote:

> On 12:23 Thu 08 Oct     , Dimitris Aragiorgis wrote:
> > Hi Apollon,
> >
> > * Apollon Oikonomopoulos <[email protected]> [2015-10-02 12:36:44
> +0300]:
> >
> > > Hi Dimitris,
> > >
> > > Thanks a lot for this work! Please see comments inline.
> > >
> > > On 19:52 Thu 01 Oct     , Dimitris Aragiorgis wrote:
> > > > This is a design document detailing the refactoring of device
> > > > handling in the KVM Hypervisor. More specifically, it will use the
> > > > latest QEMU device model and modify the current hotplug
> implementation
> > > > so that both PCI and SCSI devices can be managed.
> > > >
> > > > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > > > ---
> > > >  Makefile.am             |    1 +
> > > >  doc/design-draft.rst    |    1 +
> > > >  doc/design-scsi-kvm.rst |  239
> +++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 241 insertions(+)
> > > >  create mode 100644 doc/design-scsi-kvm.rst
> > > >
> > > > diff --git a/Makefile.am b/Makefile.am
> > > > index a6b9fbd..4506289 100644
> > > > --- a/Makefile.am
> > > > +++ b/Makefile.am
> > > > @@ -717,6 +717,7 @@ docinput = \
> > > >   doc/design-reservations.rst \
> > > >   doc/design-resource-model.rst \
> > > >   doc/design-restricted-commands.rst \
> > > > + doc/design-scsi-kvm.rst \
> > > >   doc/design-shared-storage.rst \
> > > >   doc/design-shared-storage-redundancy.rst \
> > > >   doc/design-ssh-ports.rst \
> > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > > > index 3f2a1f3..83a392c 100644
> > > > --- a/doc/design-draft.rst
> > > > +++ b/doc/design-draft.rst
> > > > @@ -25,6 +25,7 @@ Design document drafts
> > > >     design-configlock.rst
> > > >     design-multi-storage-htools.rst
> > > >     design-repaird.rst
> > > > +   design-scsi-kvm.rst
> > > >
> > > >  .. vim: set textwidth=72 :
> > > >  .. Local Variables:
> > > > diff --git a/doc/design-scsi-kvm.rst b/doc/design-scsi-kvm.rst
> > > > new file mode 100644
> > > > index 0000000..30cc7de
> > > > --- /dev/null
> > > > +++ b/doc/design-scsi-kvm.rst
> > > > @@ -0,0 +1,239 @@
> > > > +==========
> > > > +KVM + SCSI
> > > > +==========
> > > > +
> > > > +.. contents:: :depth: 4
> > > > +
> > > > +This is a design document detailing the refactoring of device
> > > > +handling in the KVM Hypervisor. More specifically, it will use
> > > > +the latest QEMU device model and modify the hotplug implementation
> > > > +so that both PCI and SCSI devices can be managed.
> > > > +
> > > > +
> > > > +Current state and shortcomings
> > > > +==============================
> > > > +
> > > > +Ganeti currently supports SCSI virtual devices in the KVM
> hypervisor by
> > > > +setting the `disk_type` hvparam to `scsi`. Ganeti will eventually
> > > > +instruct QEMU to use the deprecated device model (i.e. -drive
> if=scsi),
> > > > +which will expose the backing store as an emulated SCSI device. This
> > > > +means that currently SCSI pass-through is not supported.
> > > > +
> > > > +On the other hand, the current hotplug implementation
> > > > +:doc:`design-hotplug` uses the latest QEMU
> > > > +device model (via the -device option) and is tailored to paravirtual
> > > > +devices, which leads to buggy behavior: if we hotplug a disk to an
> > > > +instance that is configured with disk_type=scsi hvparam, the
> > > > +disk which will get hot-plugged eventually will be a VirtIO device
> > > > +(i.e., virtio-blk-pci) on the PCI bus.
> > > > +
> > > > +The current implementation of creating the QEMU command line is
> > > > +error-prone, since an instance might not be able to boot due to PCI
> slot
> > > > +congestion.
> > > > +
> > > > +
> > > > +Proposed changes
> > > > +================
> > > > +
> > > > +We change the way that the KVM hypervisor handles block devices by
> > > > +introducing latest QEMU device model for SCSI devices as well, so
> that
> > > > +scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are
> > > > +supported too. Additionally we refactor the hotplug implementation
> in
> > > > +order to support hotplugging of SCSI devices too. Finally, we
> change the
> > > > +way we keep track of device info inside runtime files, and the way
> we
> > > > +place each device upon instance startup.
> > > > +
> > > > +Design decisions
> > > > +================
> > > > +
> > > > +How to identify each device?
> > > > +
> > > > +Currently KVM does not support arbitrary IDs for devices; supported
> are
> > > > +only names starting with a letter, with max 32 chars length, and
> only
> > > > +including the '.', '_', '-' special chars. Currently we generate an
> ID
> > > > +with the following format: <device type>-<part of uuid>-pci-<slot>.
> > > > +This assumes that the device will be plugged in a certain slot on
> the
> > > > +PCI bus. Since we want to support devices on a SCSI bus too and
> adding
> > > > +the PCI slot to the ID is redundant, we keep only the first two
> parts of
> > > > +the existing ID.
> > > > +
> > > > +
> > > > +Which buses does the guest eventually see?
> > > > +
> > > > +By default QEMU starts with a single PCI bus named "pci.0". In case
> a
> > > > +SCSI controller is added on this bus, a SCSI bus is created with
> > > > +the corresponding name: "scsi.0".
> > > > +Any SCSI disks will be attached on this SCSI bus. Currently Ganeti
> does
> > > > +not explicitly use a SCSI controller via a command line option, but
> lets
> > > > +QEMU add one automatically if needed. Here, in case we have a SCSI
> disk,
> > > > +a SCSI controller is explicitly added via the -device option. For
> the
> > > > +SCSI controller, we do not specify the PCI slot to use, but let
> QEMU find
> > > > +the first available (see below).
> > > > +
> > > > +
> > > > +What type of SCSI controller to use?
> > > > +
> > > > +QEMU uses the `lsi` controller by default. To make this
> configurable we
> > > > +add a new hvparam, `scsi_controller_type`. The available types will
> be
> > > > +`lsi`, `megasas`, and `virtio-scsi-pci`.
> > > > +
> > > > +
> > > > +Where to place the devices upon instance startup?
> > > > +
> > > > +By default QEMU boots with the first three slots of the default PCI
> bus
> > > > +(pci.0) occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and
> > > > +slot 2 for VGA controller. Thereafter, the slots depend on the QEMU
> > > > +options passed in the command line.
> > >
> > > Note that this is true for the `pc' and `pc-i440fx-*' machine models.
> > > However, the `q35' machine model emulates an ICH9-based motherboard
> that
> > > also has an additional multifunction device on PCI slot 31, hosting and
> > > ISA bridge, a SATA controller and an SMBus adapter.
> > >
> >
> >
> > I looked into the issue you mentioned with the q35 machine type a bit
> > more. Currently, even before this patch, Ganeti cannot even start a q35
> > machine with paravirtual devices.
> > Apparently no Ganeti user has used the q35 machine model seriously yet :)
> >
> > The problem is that q35 defines a *PCI Express* bus, not a PCI bus.
> > The current code assumes the presence of a PCI bus (pci.0).
> > So this discussion that started at GanetiCon seems irrelevant.
> >
> > It is not a matter of reserving PCI slot 31; q35 machines expose
> > a PCI Express bus (pcie.0) that does not even support hotplugging:
> > http://www.linux-kvm.org/page/PCITodo
> >
> > So handling the q35 machine type is another story, and certainly
> > outside the scope of this patch set.
> >
> > After this patch set has been merged, a separate patch can explore
> > supporting the q35 machine type. This however may prove more
> > complicated than it looks:
> >
> > 1. The pcie.0 bus is not hotpluggable so we cannot hotplug devices
> >    directly onto it.
> >
> > 2. One could work around it, by ensuring a PCI bus (pci.0) is always
> >    available:
> >    pcie.0 -> dmi-to-pci-bridge -> pci-to-pci-bridge (this will be pci.0).
> >    This would allow the machine to boot but still wouldn't solve the
> >    hotplug problem. PCI bridges do not yet support hotplugging in QEMU.
> >    Libvirt seems to follow this approach.
>
> Good, I didn't go that far with my research :-) I guess for the time
> being we need to add an erratum indicating that Ganeti versions with
> hotplugging support do not support the Q35 machine model as of qemu 2.4.
>
> > > > +The main reason that we want to be fully aware of the configuration
> of a
> > > > +running instance (machine type, PCI and SCSI bus state, devices,
> etc.)
> > > > +is that in case of migration a QEMU process with the exact same
> > > > +configuration should be created on the target node. The
> configuration is
> > > > +kept in the runtime file created just before starting the instance.
> > > > +Since hotplug has been introduced, the only thing that can change
> after
> > > > +starting an instance is the configuration related to NICs and Disks.
> > > > +
> > > > +Before implementing hotplug, Ganeti did not specify PCI slots
> > > > +explicitly, but let QEMU decide how to place the devices on the
> > > > +corresponding bus. This does not work if we want to have
> hotplug-able
> > > > +devices and migrate-able VMs. Currently, upon runtime file
> creation, we
> > > > +try to reserve PCI slots based on the hvparams, the disks, and the
> NICs
> > > > +of the instance. This has three major shortcomings: first, we have
> to be
> > > > +aware which options modify the PCI bus which is practically
> impossible
> > > > +due to the huge amount of QEMU options, second, QEMU may change the
> > > > +default PCI configuration from version to version, and third, we
> cannot
> > > > +know if the extra options passed by the user via the `kvm_extra`
> hvparam
> > > > +modify the PCI bus.
> > > > +
> > > > +All the above makes the current implementation error prone: an
> instance
> > > > +might not be able to boot if we explicitly add a NIC/Disk on a
> specific
> > > > +PCI slot that QEMU has already used for another device while parsing
> > > > +its command line options. Besides that, now, we want to use the
> SCSI bus
> > > > +as well so the above mechanism is insufficient. Here, we decide to
> put
> > > > +only disks and NICs on specific slots on the corresponding bus, and
> let
> > > > +QEMU put everything else automatically. To this end, we decide to
> let
> > > > +the first 12 PCI slots be managed by QEMU, and we start adding PCI
> > > > +devices (VirtIO block and network devices) from the 13th slot
> onwards.
> > > > +As far as the SCSI bus is concerned, we decide to put each SCSI
> > > > +disk on a different scsi-id (which corresponds to a different target
> > > > +number in SCSI terminology). The SCSI bus will not have any default
> > > > +reservations.
> > >
> > > Please note that different SCSI controllers have different device
> > > limits: IIRC the lsi controller supports up to 7 targets with a single
> > > LUN per target, while virtio-scsi-pci supports up to 255 targets with
> > > 16383 LUNs each. This means that we (currently) cannot offer the
> > > theoretical maximum of 16 disks (constants.MAX_DISKS) with a single lsi
> > > controller.
> > >
> >
> > The current situation is that QEMU uses an lsi controller implicitly.
> > So we are bound by the limit of 7 SCSI disks anyway.
> >
> > This patch increases the number of usable SCSI disks to the hard limit
> > imposed by constants.MAX_DISKS anyway. The only requirement is that the
> > user actually sets scsi_controller_type to a value which can support
> > that many disks, i.e., anything other than lsi.
>
> And this should be documented, thanks!
>
> > > > +
> > > > +
> > > > +How to support more than 16 PCI devices?
> > > > +
> > > > +In order to be able to have more than 16 devices on the PCI bus we
> add
> > > > +the new `default_pci_reservations` hvparam to denote how many PCI
> slots
> > > > +QEMU will handle implicitly. The rest will be available for disk and
> > > > +NICs inserted explicitly by Ganeti. By default the default PCI
> > > > +reservations will be 12 as explained above.
> > >
> > > It's not clear to me why 16 is special here. 32 - 12 = 20, and what
> > > might be of interest for us is being able to hotplug 24 devices (16
> > > (MAX_DISKS using virtio-blk-pci) + 8 (MAX_NICS)).
> > >
> >
> > 16 default reservations was the choice of the first implementation.
> > This will change, since Riba suggested (and I agreed since this is now
> > configurable) to go with 12 default PCI reservations.
> > So I will rename this section to "How to support the theoretical maximum
> > of devices, 16 disks and 8 NICs?" and mention the usage of
> > kvm_pci_reservations.
> >
> > > > +How to keep track of the bus state of a running instance?
> > > > +
> > > > +To be able to hotplug a device, we need to know which slot is
> > > > +available on the desired bus. Until now, we were using the
> ``query-pci``
> > > > +QMP command that returns the state of the PCI buses (i.e., which
> devices
> > > > +occupy which slots). Unfortunately, there is no equivalent for the
> SCSI
> > > > +buses. We could use the ``info qtree`` HMP command that practically
> > > > +dumps in plain text the whole device tree. This makes it really
> hard to
> > > > +parse. So we decide to generate the bus state of a running instance
> > > > +through our local runtime files.
> > >
> > > You don't need to treat all buses the same way, as they have some
> > > fundamental differences: the PCI bus is de-facto shared with the QEMU
> > > process (and the user/admin via kvm_extra), however it is safe and
> > > reasonable to assume that a SCSI bus on a controller plugged by us is
> > > exclusively owned by us. That said, it is reasonable to store SCSI IDs,
> > > but you should always query the PCI bus for robustness.
> > >
> > > As discussed in Ganeticon and as I wrote earlier[1], PCI bus management
> > > has a number of caveats we should be aware of. I'm not particularly
> fond
> > > of partitioning the PCI space, mostly because it is not robust and it
> > > also leads to underutilization: we have to reserve space for things we
> > > will likely never use, and let the user work her way around this by
> > > adjusting an additional HV parameter (default_pci_reservations) if she
> > > wants to do something fancy with kvm_extra. Note that the PCI device
> > > counts range depending on the disk/NIC type, from 1 PCI slot per disk
> in
> > > virtio-blk-pci, to 1 PCI slot per 7 disks for lsi and per 255 * 16383
> > > disks for virtio-scsi-pci, so it's really hard to pick a good default
> > > for pre-reserved PCI slots.
> > >
> > > I still believe that the best approach would be to manage the SCSI
> > > bus(es) exclusively, but let QEMU manage hotplugging on the PCI bus and
> > > reconstruct the layout during live migration on the receiving side, as
> > > outlined in [1]. This would do away with partitioning the PCI slot
> space
> > > and would also guarantee maximum flexibility and robustness regardless
> > > of what the user might pass via kvm_extra or what future qemu machine
> > > models might place on the PCI bus.
> > >
> > > [1]
> https://groups.google.com/forum/#!searchin/ganeti-devel/hotplugging/ganeti-devel/8zvzAfQyoxU/B7_fcbr2-kcJ
> > >
> >
> > I see that back then, you considered partitioning the PCI space a
> > "robust workaround". Copying from [1]: "The idea is to keep manual
> > assignments for NICs and disks, but partition the PCI space in two
> > areas: one managed by qemu and one managed by the Ganeti's hotplug
> > code."
>
> Yes, but a workaround is a workaround, not a correct solution. Anyway,
> please proceed as you see fit, at this point both solutions will improve
> the current situation.
>
> >
> > I agree that partitioning the PCI space may lead to underutilization
> > in extreme cases. However, as you also mentioned in [1], the "correct
> > solution" of reconstructing all PCI state during migration would
> > require significant effort, and would lead to major refactoring and
> > less homogeneous code.
>
> I don't recall saying anything about code homogenity and refactoring.
> I'm just pointing out the drawbacks of each solution, so that we are all
> aware of their limitations in advance.
>
> >
> > > > +
> > > > +What info should be kept in runtime files?
> > > > +
> > > > +Runtime files are used for instance migration (to run a QEMU
> process on
> > > > +the target node with the same configuration) and for hotplug
> actions (to
> > > > +update the configuration of a running instance so that it can be
> > > > +migrated). Until now we were using devices only on the PCI bus, so
> only
> > > > +each device's PCI slot should be kept in the runtime file. This is
> > > > +obviously not enough. We decide to replace the `pci` slot of Disk
> and
> > > > +NIC configuration objects, with an `hvinfo` dict. It will contain
> all
> > > > +necessary info for constructing the appropriate -device QEMU option.
> > > > +Specifically the `driver`, `id`, and `bus` parameters will be
> present to
> > > > +all kind of devices. PCI devices will have the `addr` parameter,
> SCSI
> > > > +devices will have `channel`, `scsi-id`, and `lun`. NICs and Disks
> will
> > > > +have the extra `netdev` and `drive` parameters correspondingly.
> > > > +
> > > > +
> > > > +How to deal with existing instances?
> > > > +
> > > > +Only existing instances with paravirtual devices (configured via the
> > > > +disk_type and nic_type hvparam) use the latest QEMU device model.
> Only
> > > > +these have the `pci` slot filled. We will use the existing
> > > > +_UpgradeSerializedRuntime() method to migrate the old runtime format
> > > > +with `pci` slot in Disk and NIC configuration objects to the new one
> > > > +with `hvinfo` instead. The new hvinfo will contain the old driver
> > > > +(either virtio-blk-pci or virtio-net-pci), the old id
> > > > +(hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI
> > > > +slot (addr=4). This way those devices will still be hotplug-able,
> and
> > > > +the instance will still be migrate-able. When those instances are
> > > > +rebooted, the hvinfo will be re-generated.
> > > > +
> > > > +
> > > > +How to support downgrades?
> > > > +
> > > > +There are two possible ways, both not very pretty. The first one is
> to
> > > > +use _UpgradeSerializedRuntime() to remove the hvinfo slot. This
> would
> > > > +require the patching of all Ganeti versions down to 2.10 which is
> practically
> > > > +imposible. Another way is to ssh to all nodes and remove this slot
> upon
> > > > +a cluster downgrade. This ugly hack would go away on 2.17 since we
> support
> > > > +downgrades only to the previous minor version.
> > >
> > > IIRC, you don't need to do anything special for downgrades, as long as
> > > you don't add an array member to the top-level array in the runtime (as
> > > was done in 2.10 to introduce hotplugging). An additional dict field
> > > should be ignored by previous versions. IOW, if you just keep the old
> > > PCI-based IDs in place, there should be no problems, but I'm not 100%
> > > sure about that.
> > >
> >
> > _AnalyzeSerializedRuntime() uses the FromDict() method of Disk and NIC
> > objects. Such objects would not have an hvinfo dict on versions < 2.16.
> > So, if a VM started with 2.16 (including 'hvinfo' in its runtime file),
> > needs to be migrated after a downgrade, it will fail while reading the
> > runtime file. Riba suggested the workaround, we know it is ugly but it
> > will go away in 2.17.
> >
> > > > +
> > > > +
> > > > +Configuration changes
> > > > +---------------------
> > > > +
> > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``hvinfo``. It
> is
> > > > +hypervisor-specific and will never reach config.data. In case of
> the KVM
> > > > +Hypervisor it will contain all necessary info for constructing the
> -device
> > > > +QEMU option. Existing entries in runtime files that had a `pci` slot
> > > > +will be upgraded to have the corresponding `hvinfo` (see above).
> > > > +
> > > > +The new `scsi_controller_type` hvparam is added to denote what type
> of
> > > > +SCSI controller should be added to PCI bus if we have a SCSI disk.
> > > > +Allowed values will be `lsi`, `virtio-scsi-pci`, and `megasas`.
> > > > +We decide to use `lsi` by default since this is the one that QEMU
> > > > +adds automatically if not specified explicitly by an option.
> > > > +
> > > > +
> > > > +Hypervisor changes
> > > > +------------------
> > > > +
> > > > +The current implementation verifies if a hotplug action has
> succeeded
> > > > +by scanning the PCI bus and searching for a specific device ID. This
> > > > +will change, and we will use the ``query-block`` along with the
> > > > +``query-pci`` QMP command to find block devices that are attached
> to the
> > > > +SCSI bus as well.
> > > > +
> > > > +Up until now, if `disk_type` hvparam was set to `scsi`, QEMU would
> use the
> > > > +deprecated device model and end up using SCSI emulation, e.g.:
> > > > +
> > > > +  ::
> > > > +
> > > > +    -drive
> file=/var/run/ganeti/instance-disks/test:0,if=scsi,format=raw
> > > > +
> > > > +Now the equivalent, which will also enable hotplugging, will be to
> set
> > > > +disk_type to `scsi-hd`. The QEMU command line will include:
> > > > +
> > > > +  ::
> > > > +
> > > > +    -drive
> file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=hotdisk-12345
> > > > +    -device
> scsi-hd,id=hotdisk-12345,drive=hotdisk-12345,bus=scsi.0,channel=0,scsi-id=0,lun=0
> > >
> > > Can we drop the `hot' part from `hotdisk'? This will allow us to use a
> > > couple more bits from the UUID, reducing the chances of a collision. I
> > > would go as far as using plain `d-' for disks and `n-' for NICs.
> > >
> >
> >
> > OK, I have already removed the 'hot' part, and will add the next 2 parts
> > of the UUID so the chance of collisions gets reduced significantly.
> >
> > I prefer to leave the 'disk-' and 'nic-' prefixes for reasons of
> > readability.
>
> Ack, thanks.
>
> >
> > To sum up, I will make the following modifications to the current design
> doc:
> >
> >  - Reword the section "How to support more than 16 devices?"
> >  - Update the section "How to identify each device?"
> >  - Update the examples with the proper device IDs
> >
> > After this patch has been merged, and you have time, I would propose
> > that you amend the design doc with a "future work" section, so your
> > thoughts are persisted for anyone willing to contribute in the future.
> >
> > Thanks,
> > dimara
>
> LGTM, thanks.
>

Hrvoje Ribicic
Ganeti Engineering
Google Germany GmbH
Dienerstr. 12, 80331, München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind,
leiten Sie diese bitte nicht weiter, informieren Sie den Absender und
löschen Sie die E-Mail und alle Anhänge. Vielen Dank.

This e-mail is confidential. If you are not the right addressee please do
not forward it, please inform the sender, and please erase this e-mail
including any attachments. Thanks.

Reply via email to