On Sat, Jul 13, 2013 at 12:54 PM, Dimitris Aragiorgis <[email protected]> wrote:
> This is the desing doc detailing the implementation
> of device hotplugging in Ganeti.
>
> Signed-off-by: Dimitris Aragiorgis <[email protected]>
> ---
>  Makefile.am            |    1 +
>  doc/design-draft.rst   |    1 +
>  doc/design-hotplug.rst |  250 
> ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 252 insertions(+)
>  create mode 100644 doc/design-hotplug.rst
>
> diff --git a/Makefile.am b/Makefile.am
> index 49cd09e..35fd787 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -440,6 +440,7 @@ docinput = \
>         doc/design-cpu-pinning.rst \
>         doc/design-device-uuid-name.rst \
>         doc/design-draft.rst \
> +       doc/design-hotplug.rst \
>         doc/design-htools-2.3.rst \
>         doc/design-http-server.rst \
>         doc/design-impexp2.rst \
> diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> index 0e454cd..4c1c692 100644
> --- a/doc/design-draft.rst
> +++ b/doc/design-draft.rst
> @@ -20,6 +20,7 @@ Design document drafts
>     design-internal-shutdown.rst
>     design-glusterfs-ganeti-support.rst
>     design-openvswitch.rst
> +   design-hotplug.rst
>

This part needs to be rebased/resent.

>  .. vim: set textwidth=72 :
>  .. Local Variables:
> diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst
> new file mode 100644
> index 0000000..75dc928
> --- /dev/null
> +++ b/doc/design-hotplug.rst
> @@ -0,0 +1,250 @@
> +=======
> +Hotplug
> +=======
> +
> +.. contents:: :depth: 4
> +
> +This is a design document detailing the implementation of device
> +hotplugging in Ganeti. The logic used is hypervisor agnostic but still
> +the initial implementation will target the KVM hypervisor. The
> +implementation adds ``python-fdsend`` as a new dependency. In case
> +it is not installed hotplug will not be possible and the user will
> +be notified with a warning.
> +
> +
> +Current state and shortcomings
> +==============================
> +
> +Currently, Ganeti supports addition/removal/modification of devices
> +(NICs, Disks) but the actual modification takes place only after
> +rebooting the instance. To this end an instance cannot change network,
> +get a new disk etc. without a hard reboot.
> +
> +Until now, in case of KVM hypervisor, code does not name devices nor
> +places them in specific PCI slots. Devices are appended in the KVM
> +command and Ganeti lets KVM decide where to place them. This means that
> +there is a possibility a device that resides in PCI slot 5, after a
> +reboot (due to another device removal) to be moved to another PCI slot
> +and probably get renamed too (due to udev rules, etc.).
> +
> +In order migration to succeed, the process on the target node should be

s/order/order for a/

> +started with exactly the same machine version, CPU architecture and PCI
> +configuration with the running process. During instance creation/startup
> +ganeti creates a KVM runtime file with all the necessary information to
> +generate the KVM command. This runtime file is used during instance
> +migration to start a new identical KVM process. The current format
> +includes the fixed part of the final KVM command, a list of NICs',
> +and hvparams dict. It does not favor easy manipulations concerning
> +disks, because they are encapsulated in the fixed KVM command.
> +
> +Proposed changes
> +================
> +
> +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
> +instance. Disks and NICs occupy some of these slots. Recent versions of
> +QEMU have introduced monitor commands that allow addition/removal of PCI
> +devices. Devices are referenced based on their name or position on the
> +virtual PCI bus. To be able to use these commands, we need to be able to
> +assign each device a unique name.
> +
> +To keep track where each device is plugged into, we add the
> +``pci`` slot to Disk and NIC objects, but we save it only in runtime
> +files, since it is hypervisor specific info. This is added for easy
> +object manipulation and is ensured not to be written back to the config.
> +
> +We propose to make use of QEMU 1.0 monitor commands so that
> +modifications to devices take effect instantly without the need for hard
> +reboot. The only change exposed to the end-user will be the addition of
> +a ``--hotplug`` option to the ``gnt-instance modify`` command.
> +
> +Upon hotplugging the PCI configuration of an instance is changed.
> +Runtime files should be updated correspondingly. Currently this is
> +impossible in case of disk hotplug because disks are included in command
> +line entry of the runtime file, contrary to NICs that are correctly
> +treated separately. We change the format of runtime files, we remove
> +disks from the fixed KVM command and create new entry containing them
> +only. KVM options concerning disk are generated during
> +``_ExecuteKVMCommand()``, just like NICs.
> +
> +Design decisions
> +================
> +
> +Which should be each device ID? Currently KVM does not support arbitrary
> +IDs for devices; supported are only names starting with a letter, max 32
> +chars length, and only including '.' '_' '-' special chars.
> +For debugging purposes and in order to be more informative, device will be
> +named after: <device type>-<part of uuid>-pci-<slot>.
> +
> +Who decides where to hotplug each device? As long as this is a
> +hypervisor specific matter, there is no point for the master node to
> +decide such a thing. Master node just has to request noded to hotplug a
> +device. To this end, hypervisor specific code should parse the current
> +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the first
> +available slot and hotplug the device. Having noded to decide where to
> +hotplug a device we ensure that no error will occur due to duplicate
> +slot assignment (if masterd keeps track of PCI reservations and noded
> +fails to return the PCI slot that the device was plugged into then next
> +hotplug will fail).
> +
> +Where should we keep track of devices' PCI slots? As already mentioned,
> +we must keep track of devices PCI slots to successfully migrate
> +instances. First option is to save this info to config data, which would
> +allow us to place each device at the same PCI slot after reboot. This
> +would require to make the hypervisor return the PCI slot chosen for each
> +device, and storing this information to config data. Additionally the
> +whole instance configuration should be returned with PCI slots filled
> +after instance start and each instance should keep track of current PCI
> +reservations. We decide not to go towards this direction in order to
> +keep it simple and do not add hypervisor specific info to configuration
> +data (``pci_reservations`` at instance level and ``pci`` at device
> +level). For the aforementioned reason, we decide to store this info only
> +in KVM runtime files.
> +
> +Where to place the devices upon instance startup? QEMU has by default 4
> +pre-occupied PCI slots. So, hypervisor can use the remaining ones for
> +disks and NICs. Currently, PCI configuration is not preserved after
> +reboot.  Each time an instance starts, KVM assigns PCI slots to devices
> +based on their ordering in Ganeti configuration, i.e. the second disk
> +will be placed after the first, the third NIC after the second, etc.
> +Since we decided that there is no need to keep track of devices PCI
> +slots, there is no need to change current functionality.
> +
> +How to deal with existing instances? Hotplug depends on runtime file
> +manipulation. It stores there pci info and every device the kvm process is
> +currently using. Existing files have no pci info in devices and have block
> +devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing 
> devices
> +will not be possible. Still migration and hotplugging of new devices will
> +succeed. The workaround will happen upon loading kvm runtime: if we detect 
> old
> +style format we will add an empty list for block devices and upon saving kvm
> +runtime we will include this empty list as well. Switching entirely to new
> +format will happen upon instance reboot.
> +
> +
> +Configuration changes
> +---------------------
> +
> +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to
> +PCI slot that the device gets plugged into.
> +
> +In order to be able to live migrate successfully, runtime files should
> +be updated every time a live modification (hotplug) takes place. To this
> +end we change the format of runtime files. The KVM options referring to
> +instance's disks are no longer recorded as part of the KVM command line.
> +Disks are treated separately, just as we treat NICs right now. We insert
> +and remove entries to reflect the current PCI configuration.
> +
> +
> +Backend changes
> +---------------
> +
> +Introduce one new RPC call:
> +
> +- hotplug_device(DEVICE_TYPE, ACTION, device, ...)
> +
> +where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD.
> +
> +Hypervisor changes
> +------------------
> +
> +We implement hotplug on top of the KVM hypervisor. We take advantage of
> +QEMU 1.0 monitor commands (``device_add``, ``device_del``,
> +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
> +refers to devices based on their id. We use ``uuid`` to name them
> +properly. If a device is about to be hotplugged we parse the output of
> +``info pci`` and find the occupied PCI slots. We choose the first
> +available and the whole device object is appended to the corresponding
> +entry in the runtime file.
> +
> +Concerning NIC handling, we build on the top of the existing logic
> +(first create a tap with _OpenTap() and then pass its file descriptor to
> +the KVM process). To this end we need to pass access rights to the
> +corresponding file descriptor over the monitor socket (UNIX domain
> +socket). The open file is passed as a socket-level control message
> +(SCM), using the ``fdsend`` python library.
> +
> +
> +User interface
> +--------------
> +
> +The new ``--hotplug`` option to gnt-instance modify is introduced, which
> +forces live modifications.
> +
> +
> +Enabling hotplug
> +++++++++++++++++
> +
> +Hotplug will be optional during gnt-instance modify.  For existing
> +instance, after installing a version that supports hotplugging we
> +have the restriction that hotplug will not be supported for existing
> +devices. The reason is that old runtime files lack of:
> +
> +1. Device pci configuration info.
> +
> +2. Separate block device entry.
> +
> +Hotplug will be supported only for KVM in the first implementation. For
> +all other hypervisors, backend will raise an Exception case hotplug is
> +requested.
> +
> +
> +NIC Hotplug
> ++++++++++++
> +
> +The user can add/modify/remove NICs either with hotplugging or not. If a
> +NIC is to be added a tap is created first and configured properly with
> +kvm-vif-bridge script. Then the instance gets a new network interface.
> +Since there is no QEMU monitor command to modify a NIC, we modify a NIC
> +by temporary removing the existing one and adding a new with the new
> +configuration. When removing a NIC the corresponding tap gets removed as
> +well.
> +
> +::
> +
> + gnt-instance modify --net add --hotplug test
> + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
> + gnt-instance modify --net 1:remove --hotplug test
> +
> +
> +Disk Hotplug
> +++++++++++++
> +
> +The user can add and remove disks with hotplugging or not. QEMU monitor
> +supports resizing of disks, however the initial implementation will
> +support only disk addition/deletion.
> +
> +::
> +
> + gnt-instance modify --disk add:size=1G --hotplug test
> + gnt-instance modify --net 1:remove --hotplug test
> +
> +
> +Dealing with chroot and uid pool
> +--------------------------------
> +
> +The design so far covers all issues that arise without addressing the
> +case where the kvm process will not run with root privileges.
> +Specifically:
> +
> +- in case of chroot, the kvm process cannot see the newly created device
> +
> +- in case of uid pool security model, the kvm process is not allowed
> +  to access the device
> +
> +For NIC hotplug we address this problem by using the ``getfd`` monitor
> +command and passing the file descriptor to the kvm process over the
> +monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid
> +pool we can let the hypervisor code temporarily ``chown()`` the  device
> +before the actual hotplug. Still this is insufficient in case of chroot.
> +In this case, we need to ``mknod()`` the device inside the chroot. Both
> +workarounds can be avoided, if we make use of the ``add-fd`` qemu
> +monitor command, that was introduced in version 1.3. This command is the
> +equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in
> +every case. So, if the qemu monitor does not support the ``add-fd``
> +command, we will not allow disk hotplug for chroot and uid security
> +model and notify the user with the corresponding warning.
> +

LGTM.

Thanks,

Guido

Reply via email to