On Sat, Jul 13, 2013 at 12:54 PM, Dimitris Aragiorgis <[email protected]> wrote: > This is the desing doc detailing the implementation > of device hotplugging in Ganeti. > > Signed-off-by: Dimitris Aragiorgis <[email protected]> > --- > Makefile.am | 1 + > doc/design-draft.rst | 1 + > doc/design-hotplug.rst | 250 > ++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 252 insertions(+) > create mode 100644 doc/design-hotplug.rst > > diff --git a/Makefile.am b/Makefile.am > index 49cd09e..35fd787 100644 > --- a/Makefile.am > +++ b/Makefile.am > @@ -440,6 +440,7 @@ docinput = \ > doc/design-cpu-pinning.rst \ > doc/design-device-uuid-name.rst \ > doc/design-draft.rst \ > + doc/design-hotplug.rst \ > doc/design-htools-2.3.rst \ > doc/design-http-server.rst \ > doc/design-impexp2.rst \ > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > index 0e454cd..4c1c692 100644 > --- a/doc/design-draft.rst > +++ b/doc/design-draft.rst > @@ -20,6 +20,7 @@ Design document drafts > design-internal-shutdown.rst > design-glusterfs-ganeti-support.rst > design-openvswitch.rst > + design-hotplug.rst >
This part needs to be rebased/resent. > .. vim: set textwidth=72 : > .. Local Variables: > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst > new file mode 100644 > index 0000000..75dc928 > --- /dev/null > +++ b/doc/design-hotplug.rst > @@ -0,0 +1,250 @@ > +======= > +Hotplug > +======= > + > +.. contents:: :depth: 4 > + > +This is a design document detailing the implementation of device > +hotplugging in Ganeti. The logic used is hypervisor agnostic but still > +the initial implementation will target the KVM hypervisor. The > +implementation adds ``python-fdsend`` as a new dependency. In case > +it is not installed hotplug will not be possible and the user will > +be notified with a warning. > + > + > +Current state and shortcomings > +============================== > + > +Currently, Ganeti supports addition/removal/modification of devices > +(NICs, Disks) but the actual modification takes place only after > +rebooting the instance. To this end an instance cannot change network, > +get a new disk etc. without a hard reboot. > + > +Until now, in case of KVM hypervisor, code does not name devices nor > +places them in specific PCI slots. Devices are appended in the KVM > +command and Ganeti lets KVM decide where to place them. This means that > +there is a possibility a device that resides in PCI slot 5, after a > +reboot (due to another device removal) to be moved to another PCI slot > +and probably get renamed too (due to udev rules, etc.). > + > +In order migration to succeed, the process on the target node should be s/order/order for a/ > +started with exactly the same machine version, CPU architecture and PCI > +configuration with the running process. During instance creation/startup > +ganeti creates a KVM runtime file with all the necessary information to > +generate the KVM command. This runtime file is used during instance > +migration to start a new identical KVM process. The current format > +includes the fixed part of the final KVM command, a list of NICs', > +and hvparams dict. It does not favor easy manipulations concerning > +disks, because they are encapsulated in the fixed KVM command. > + > +Proposed changes > +================ > + > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the > +instance. Disks and NICs occupy some of these slots. Recent versions of > +QEMU have introduced monitor commands that allow addition/removal of PCI > +devices. Devices are referenced based on their name or position on the > +virtual PCI bus. To be able to use these commands, we need to be able to > +assign each device a unique name. > + > +To keep track where each device is plugged into, we add the > +``pci`` slot to Disk and NIC objects, but we save it only in runtime > +files, since it is hypervisor specific info. This is added for easy > +object manipulation and is ensured not to be written back to the config. > + > +We propose to make use of QEMU 1.0 monitor commands so that > +modifications to devices take effect instantly without the need for hard > +reboot. The only change exposed to the end-user will be the addition of > +a ``--hotplug`` option to the ``gnt-instance modify`` command. > + > +Upon hotplugging the PCI configuration of an instance is changed. > +Runtime files should be updated correspondingly. Currently this is > +impossible in case of disk hotplug because disks are included in command > +line entry of the runtime file, contrary to NICs that are correctly > +treated separately. We change the format of runtime files, we remove > +disks from the fixed KVM command and create new entry containing them > +only. KVM options concerning disk are generated during > +``_ExecuteKVMCommand()``, just like NICs. > + > +Design decisions > +================ > + > +Which should be each device ID? Currently KVM does not support arbitrary > +IDs for devices; supported are only names starting with a letter, max 32 > +chars length, and only including '.' '_' '-' special chars. > +For debugging purposes and in order to be more informative, device will be > +named after: <device type>-<part of uuid>-pci-<slot>. > + > +Who decides where to hotplug each device? As long as this is a > +hypervisor specific matter, there is no point for the master node to > +decide such a thing. Master node just has to request noded to hotplug a > +device. To this end, hypervisor specific code should parse the current > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the first > +available slot and hotplug the device. Having noded to decide where to > +hotplug a device we ensure that no error will occur due to duplicate > +slot assignment (if masterd keeps track of PCI reservations and noded > +fails to return the PCI slot that the device was plugged into then next > +hotplug will fail). > + > +Where should we keep track of devices' PCI slots? As already mentioned, > +we must keep track of devices PCI slots to successfully migrate > +instances. First option is to save this info to config data, which would > +allow us to place each device at the same PCI slot after reboot. This > +would require to make the hypervisor return the PCI slot chosen for each > +device, and storing this information to config data. Additionally the > +whole instance configuration should be returned with PCI slots filled > +after instance start and each instance should keep track of current PCI > +reservations. We decide not to go towards this direction in order to > +keep it simple and do not add hypervisor specific info to configuration > +data (``pci_reservations`` at instance level and ``pci`` at device > +level). For the aforementioned reason, we decide to store this info only > +in KVM runtime files. > + > +Where to place the devices upon instance startup? QEMU has by default 4 > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for > +disks and NICs. Currently, PCI configuration is not preserved after > +reboot. Each time an instance starts, KVM assigns PCI slots to devices > +based on their ordering in Ganeti configuration, i.e. the second disk > +will be placed after the first, the third NIC after the second, etc. > +Since we decided that there is no need to keep track of devices PCI > +slots, there is no need to change current functionality. > + > +How to deal with existing instances? Hotplug depends on runtime file > +manipulation. It stores there pci info and every device the kvm process is > +currently using. Existing files have no pci info in devices and have block > +devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing > devices > +will not be possible. Still migration and hotplugging of new devices will > +succeed. The workaround will happen upon loading kvm runtime: if we detect > old > +style format we will add an empty list for block devices and upon saving kvm > +runtime we will include this empty list as well. Switching entirely to new > +format will happen upon instance reboot. > + > + > +Configuration changes > +--------------------- > + > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to > +PCI slot that the device gets plugged into. > + > +In order to be able to live migrate successfully, runtime files should > +be updated every time a live modification (hotplug) takes place. To this > +end we change the format of runtime files. The KVM options referring to > +instance's disks are no longer recorded as part of the KVM command line. > +Disks are treated separately, just as we treat NICs right now. We insert > +and remove entries to reflect the current PCI configuration. > + > + > +Backend changes > +--------------- > + > +Introduce one new RPC call: > + > +- hotplug_device(DEVICE_TYPE, ACTION, device, ...) > + > +where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD. > + > +Hypervisor changes > +------------------ > + > +We implement hotplug on top of the KVM hypervisor. We take advantage of > +QEMU 1.0 monitor commands (``device_add``, ``device_del``, > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU > +refers to devices based on their id. We use ``uuid`` to name them > +properly. If a device is about to be hotplugged we parse the output of > +``info pci`` and find the occupied PCI slots. We choose the first > +available and the whole device object is appended to the corresponding > +entry in the runtime file. > + > +Concerning NIC handling, we build on the top of the existing logic > +(first create a tap with _OpenTap() and then pass its file descriptor to > +the KVM process). To this end we need to pass access rights to the > +corresponding file descriptor over the monitor socket (UNIX domain > +socket). The open file is passed as a socket-level control message > +(SCM), using the ``fdsend`` python library. > + > + > +User interface > +-------------- > + > +The new ``--hotplug`` option to gnt-instance modify is introduced, which > +forces live modifications. > + > + > +Enabling hotplug > +++++++++++++++++ > + > +Hotplug will be optional during gnt-instance modify. For existing > +instance, after installing a version that supports hotplugging we > +have the restriction that hotplug will not be supported for existing > +devices. The reason is that old runtime files lack of: > + > +1. Device pci configuration info. > + > +2. Separate block device entry. > + > +Hotplug will be supported only for KVM in the first implementation. For > +all other hypervisors, backend will raise an Exception case hotplug is > +requested. > + > + > +NIC Hotplug > ++++++++++++ > + > +The user can add/modify/remove NICs either with hotplugging or not. If a > +NIC is to be added a tap is created first and configured properly with > +kvm-vif-bridge script. Then the instance gets a new network interface. > +Since there is no QEMU monitor command to modify a NIC, we modify a NIC > +by temporary removing the existing one and adding a new with the new > +configuration. When removing a NIC the corresponding tap gets removed as > +well. > + > +:: > + > + gnt-instance modify --net add --hotplug test > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test > + gnt-instance modify --net 1:remove --hotplug test > + > + > +Disk Hotplug > +++++++++++++ > + > +The user can add and remove disks with hotplugging or not. QEMU monitor > +supports resizing of disks, however the initial implementation will > +support only disk addition/deletion. > + > +:: > + > + gnt-instance modify --disk add:size=1G --hotplug test > + gnt-instance modify --net 1:remove --hotplug test > + > + > +Dealing with chroot and uid pool > +-------------------------------- > + > +The design so far covers all issues that arise without addressing the > +case where the kvm process will not run with root privileges. > +Specifically: > + > +- in case of chroot, the kvm process cannot see the newly created device > + > +- in case of uid pool security model, the kvm process is not allowed > + to access the device > + > +For NIC hotplug we address this problem by using the ``getfd`` monitor > +command and passing the file descriptor to the kvm process over the > +monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid > +pool we can let the hypervisor code temporarily ``chown()`` the device > +before the actual hotplug. Still this is insufficient in case of chroot. > +In this case, we need to ``mknod()`` the device inside the chroot. Both > +workarounds can be avoided, if we make use of the ``add-fd`` qemu > +monitor command, that was introduced in version 1.3. This command is the > +equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in > +every case. So, if the qemu monitor does not support the ``add-fd`` > +command, we will not allow disk hotplug for chroot and uid security > +model and notify the user with the corresponding warning. > + LGTM. Thanks, Guido
