Re: Fwd: [PATCH master] Add hotplug design doc

Vangelis Koukis Thu, 11 Jul 2013 05:32:17 -0700

On Thu, Jul 11, 2013 at 12:27:00pm +0200, Guido Trotter wrote:
> +list
> 
> 
> ---------- Forwarded message ----------
> From: Guido Trotter <[email protected]>
> Date: Thu, Jul 11, 2013 at 12:26 PM
> Subject: Re: [PATCH master] Add hotplug design doc
> To: Dimitris Aragiorgis <[email protected]>
> 
> 
> On Fri, Jul 5, 2013 at 7:48 PM, Dimitris Aragiorgis <[email protected]> wrote:
> > + lists
> >
> > * Guido Trotter <[email protected]> [2013-07-05 12:57:33 +0200]:
> >
> >> On Fri, Jul 5, 2013 at 8:50 AM, Dimitris Aragiorgis <[email protected]> 
> >> wrote:
> >> > This is the design behind the first hotplug implementation
> >> > for the KVM hypervisor.
> >> >
> >> > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> >> > ---
> >> >
> >> > Hello team,
> >> >
> >> > This is the updated design doc for hotplug. It includes all
> >> > modifications/ suggestions that have been discussed in the last thread.
> >> > I will wait for your comments or eventually the final ACK so that I can
> >> > proceed with implementation patches.
> >> >
> >> > Thanks a lot,
> >> > dimara
> >> >
> >> >  Makefile.am            |    1 +
> >> >  doc/design-draft.rst   |    1 +
> >> >  doc/design-hotplug.rst |  222 
> >> > ++++++++++++++++++++++++++++++++++++++++++++++++
> >> >  3 files changed, 224 insertions(+)
> >> >  create mode 100644 doc/design-hotplug.rst
> >> >
> >> > diff --git a/Makefile.am b/Makefile.am
> >> > index 91f3f37..fda6f58 100644
> >> > --- a/Makefile.am
> >> > +++ b/Makefile.am
> >> > @@ -435,6 +435,7 @@ docinput = \
> >> >         doc/design-cpu-pinning.rst \
> >> >         doc/design-device-uuid-name.rst \
> >> >         doc/design-draft.rst \
> >> > +       doc/design-hotplug.rst \
> >> >         doc/design-htools-2.3.rst \
> >> >         doc/design-http-server.rst \
> >> >         doc/design-impexp2.rst \
> >> > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> >> > index 0e454cd..4c1c692 100644
> >> > --- a/doc/design-draft.rst
> >> > +++ b/doc/design-draft.rst
> >> > @@ -20,6 +20,7 @@ Design document drafts
> >> >     design-internal-shutdown.rst
> >> >     design-glusterfs-ganeti-support.rst
> >> >     design-openvswitch.rst
> >> > +   design-hotplug.rst
> >> >
> >> >  .. vim: set textwidth=72 :
> >> >  .. Local Variables:
> >> > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst
> >> > new file mode 100644
> >> > index 0000000..ff4ff95
> >> > --- /dev/null
> >> > +++ b/doc/design-hotplug.rst
> >> > @@ -0,0 +1,222 @@
> >> > +=======
> >> > +Hotplug
> >> > +=======
> >> > +
> >> > +.. contents:: :depth: 4
> >> > +
> >> > +This is a design document detailing the implementation of device
> >> > +hotplugging in Ganeti. The logic used is hypervisor agnostic but still
> >> > +the initial implementation will target the KVM hypervisor. The
> >> > +implementation adds ``python-fdsend`` as a new dependency.
> >> > +
> >>
> >> Can you please specify, as we agreed, that python-fdsend is an
> >> optional dependency, and if not present Ganeti will still work, but
> >> hotplugging won't be possible?
> >>
> >
> > Yes sure. Just like affinity module. BTW only NIC hotplug depends on fdsend
> > so we could still support disk hotplug.
> >
> >> > +
> >> > +Current state and shortcomings
> >> > +==============================
> >> > +
> >> > +Currently, Ganeti supports addition/removal/modification of devices
> >> > +(NICs, Disks) but the actual modification takes place only after
> >> > +rebooting the instance. To this end an instance cannot change network,
> >> > +get a new disk etc. without a hard reboot.
> >> > +
> >> > +Until now, in case of KVM hypervisor, code does not name devices nor
> >> > +places them in specific PCI slots. Devices are appended in the KVM
> >> > +command and Ganeti lets KVM decide where to place them. This means that
> >> > +there is a possibility a device that resides in PCI slot 5, after a
> >> > +reboot (due to another device removal) to be moved to another PCI slot
> >> > +and probably get renamed too (due to udev rules, etc.).
> >> > +
> >> > +In order migration to succeed, the process on the target node should be
> >> > +started with exactly the same machine version, CPU architecture and PCI
> >> > +configuration with the running process. During instance creation/startup
> >> > +ganeti creates a KVM runtime file with all the necessary information to
> >> > +generate the KVM command. This runtime file is used during instance
> >> > +migration to start a new identical KVM process. The current format
> >> > +includes the fixed part of the final KVM command, a list of NICs',
> >> > +and hvparams dict. It does not favor easy manipulations concerning
> >> > +disks, because they are encapsulated in the fixed KVM command.
> >> > +
> >> > +Proposed changes
> >> > +================
> >> > +
> >> > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
> >> > +instance. Disks and NICs occupy some of these slots. Recent versions of
> >> > +QEMU have introduced monitor commands that allow addition/removal of PCI
> >> > +devices. Devices are referenced based on their name or position on the
> >> > +virtual PCI bus. To be able to use these commands, we need to be able to
> >> > +assign each device a unique name.
> >> > +
> >> > +To keep track where each device is plugged into, we add the
> >> > +``pci`` slot to Disk and NIC objects, but we save it only in runtime
> >> > +files, since it is hypervisor specific info. This is added for easy
> >> > +object manipulation and is ensured not to be written back to the config.
> >> > +
> >> > +We propose to make use of QEMU 1.0 monitor commands so that
> >> > +modifications to devices take effect instantly without the need for hard
> >> > +reboot. The only change exposed to the end-user will be the addition of
> >> > +a ``--hotplug`` option to the ``gnt-instance modify`` command.
> >> > +
> >> > +Upon hotplugging the PCI configuration of an instance is changed.
> >> > +Runtime files should be updated correspondingly. Currently this is
> >> > +impossible in case of disk hotplug because disks are included in command
> >> > +line entry of the runtime file, contrary to NICs that are correctly
> >> > +treated separately. We change the format of runtime files, we remove
> >> > +disks from the fixed KVM command and create new entry containing them
> >> > +only. KVM options concerning disk are generated during
> >> > +``_ExecuteKVMCommand()``, just like NICs.
> >> > +
> >> > +Design decisions
> >> > +================
> >> > +
> >> > +Which should be each device ID? Currently KVM does not support arbitrary
> >> > +IDs for devices; supported are only names starting with a letter, max 32
> >> > +chars length, and only including '.' '_' '-' special chars.
> >> > +We use the device pci slot and name it after <device type>-pci-<slot>
> >> > +(for debugging purposes we could add a part of uuid as well).
> >>
> >> Didn't we decide for just <device-type>-<part-of-uuid>-<slot> ?
> >>
> >
> > Well I did that in order kvm command to be readable and not full of
> > random numbers. Adding the part of uuid is simple just another line of code 
> > and
> > nothing more. OK. I 'll change it to <device-type>-<part-of-uuid>-<slot>.
> >
> >> > +
> >> > +Who decides where to hotplug each device? As long as this is a
> >> > +hypervisor specific matter, there is no point for the master node to
> >> > +decide such a thing. Master node just has to request noded to hotplug a
> >> > +device. To this end, hypervisor specific code should parse the current
> >> > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the 
> >> > first
> >> > +available slot and hotplug the device. Having noded to decide where to
> >> > +hotplug a device we ensure that no error will occur due to duplicate
> >> > +slot assignment (if masterd keeps track of PCI reservations and noded
> >> > +fails to return the PCI slot that the device was plugged into then next
> >> > +hotplug will fail).
> >> > +
> >> > +Where should we keep track of devices' PCI slots? As already mentioned,
> >> > +we must keep track of devices PCI slots to successfully migrate
> >> > +instances. First option is to save this info to config data, which would
> >> > +allow us to place each device at the same PCI slot after reboot. This
> >> > +would require to make the hypervisor return the PCI slot chosen for each
> >> > +device, and storing this information to config data. Additionally the
> >> > +whole instance configuration should be returned with PCI slots filled
> >> > +after instance start and each instance should keep track of current PCI
> >> > +reservations. We decide not to go towards this direction in order to
> >> > +keep it simple and do not add hypervisor specific info to configuration
> >> > +data (``pci_reservations`` at instance level and ``pci`` at device
> >> > +level). For the aforementioned reason, we decide to store this info only
> >> > +in KVM runtime files.
> >> > +
> >> > +Where to place the devices upon instance startup? QEMU has by default 4
> >> > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for
> >> > +disks and NICs. Currently, PCI configuration is not preserved after
> >> > +reboot.  Each time an instance starts, KVM assigns PCI slots to devices
> >> > +based on their ordering in Ganeti configuration, i.e. the second disk
> >> > +will be placed after the first, the third NIC after the second, etc.
> >> > +Since we decided that there is no need to keep track of devices PCI
> >> > +slots, there is no need to change current functionality.
> >> > +
> >> > +How to deal with existing instances? Hotplug depends on runtime file
> >> > +manipulation. It stores there pci info and every device the kvm process 
> >> > is
> >> > +currently using. Existing files have no pci info in devices and have 
> >> > block
> >> > +devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing 
> >> > devices
> >> > +will not be possible.
> >> > Still migration and hotplugging of new devices will
> >> > +succeed. The workaround will happen upon loading kvm runtime: if we 
> >> > detect old
> >> > +style format we will add an empty list for block devices and upon 
> >> > saving kvm
> >> > +runtime we will include this empty list as well. Switching entirely to 
> >> > new
> >> > +format will happen upon instance reboot.
> >> > +
> >> > +
> >> > +Configuration changes
> >> > +---------------------
> >> > +
> >> > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers 
> >> > to
> >> > +PCI slot that the device gets plugged into.
> >> > +
> >> > +In order to be able to live migrate successfully, runtime files should
> >> > +be updated every time a live modification (hotplug) takes place. To this
> >> > +end we change the format of runtime files. The KVM options referring to
> >> > +instance's disks are no longer recorded as part of the KVM command line.
> >> > +Disks are treated separately, just as we treat NICs right now. We insert
> >> > +and remove entries to reflect the current PCI configuration.
> >> > +
> >> > +
> >> > +Backend changes
> >> > +---------------
> >> > +
> >> > +Introduce one new RPC call:
> >> > +
> >> > +- hotplug_device(DEVICE_TYPE, ACTION, device, ...)
> >> > +
> >> > +where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE 
> >> > or ADD.
> >> > +
> >> > +Hypervisor changes
> >> > +------------------
> >> > +
> >> > +We implement hotplug on top of the KVM hypervisor. We take advantage of
> >> > +QEMU 1.0 monitor commands (``device_add``, ``device_del``,
> >> > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
> >> > +refers to devices based on their id. We use ``uuid`` to name them
> >> > +properly. If a device is about to be hotplugged we parse the output of
> >> > +``info pci`` and find the occupied PCI slots. We choose the first
> >> > +available and the whole device object is appended to the corresponding
> >> > +entry in the runtime file.
> >> > +
> >> > +Concerning NIC handling, we build on the top of the existing logic
> >> > +(first create a tap with _OpenTap() and then pass its file descriptor to
> >> > +the KVM process). To this end we need to pass access rights to the
> >> > +corresponding file descriptor over the monitor socket (UNIX domain
> >> > +socket). The open file is passed as a socket-level control message
> >> > +(SCM), using the ``fdsend`` python library.
> >> > +
> >> > +
> >> > +User interface
> >> > +--------------
> >> > +
> >> > +The new ``--hotplug`` option to gnt-instance modify is introduced, which
> >> > +forces live modifications.
> >> > +
> >> > +
> >> > +Enabling hotplug
> >> > +++++++++++++++++
> >> > +
> >> > +Hotplug will be optional during gnt-instance modify.  For existing
> >> > +instance, after installing a version that supports hotplugging we
> >> > +have the restriction that hotplug will not be supported for existing
> >> > +devices. The reason is that old runtime files lack of:
> >> > +
> >> > +1. Device pci configuration info.
> >> > +
> >> > +2. Separate block device entry.
> >> > +
> >> > +Hotplug will be supported only for KVM in the first implementation. For
> >> > +all other hypervisors, backend will raise an Exception case hotplug is
> >> > +requested.
> >> > +
> >> > +
> >> > +NIC hotplug
> >> > ++++++++++++
> >> > +
> >> > +The user can add/modify/remove NICs either with hotplugging or not. If a
> >> > +NIC is to be added a tap is created first and configured properly with
> >> > +kvm-vif-bridge script. Then the instance gets a new network interface.
> >> > +Since there is no QEMU monitor command to modify a NIC, we modify a NIC
> >> > +by temporary removing the existing one and adding a new with the new
> >> > +configuration. When removing a NIC the corresponding tap gets removed as
> >> > +well.
> >> > +
> >>
> >> Please specify that this (modify as add/remove) is a potentially
> >> dangerous operation and there will be warnings.
> >>
> >
> > I will handle it just like we handle migrations. On the client side add
> > a "BIG WARNING. Continue? [y/N]"
> >
> >> > +::
> >> > +
> >> > + gnt-instance modify --net add --hotplug test
> >> > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
> >> > + gnt-instance modify --net 1:remove --hotplug test
> >> > +
> >> > +
> >> > +Disk hotplug
> >> > +++++++++++++
> >> > +
> >> > +The user can add and remove disks with hotplugging or not. QEMU monitor
> >> > +supports resizing of disks, however the initial implementation will
> >> > +support only disk addition/deletion.
> >> > +
> >> > +::
> >> > +
> >> > + gnt-instance modify --disk add:size=1G --hotplug test
> >> > + gnt-instance modify --net 1:remove --hotplug test
> >> > +
> >> > +.. vim: set textwidth=72 :
> >> > +.. Local Variables:
> >> > +.. mode: rst
> >> > +.. fill-column: 72
> >> > +.. End:
> >>
> >> Please finally specify the status about supporting non-root and chroot
> >> with hotplug.
> >> Will this work from the first version?
> >>
> >
> > Well after testing and digging (most of it done by psomas [cc]) we mention 
> > the
> > following:
> >
> > - nic hotplugging succeeds both with uid pools and chroot.
> > - disk hotplugging will fail. I propose for those cases, check hvparams
> > on hypervisor level and if security_model is other than SM_NONE or if
> > use_chroot is True just report a warning and continue. The device will
> > be available after reboot.
> >
> > KVM 1.2 or 1.3 has introduced add-fd command which may solve the problem but
> > I haven't tested it at all. Debian jessie has still 1.1.2 so there is
> > no reason to hurry.
> >
> > Are you OK with the above? If yes can I could send you a design doc
> > interdiff and then the rest of the patches.
> >
> 
> All sounds good, with one note: there were reports that it was
> possible to do hotplug of disks inside a chroot by linking or creating
> the device there, hotplugging it, and then removing the device. Would
> it make sense to do this? (or do say that it's going to be done in the
> next version).
> 
> Thanks,
> 
> Guido
>


Hello Guido,

we've actually had an entensive discussion internally about the problems
with device hotplugging when uid pooling, chroot or both are enabled.

These problems do not have anything to do with hotplugging per se,
they're a direct result of the way uid pooling and chroot work. 

I'll try to summarize the problems here and provide a few approaches we
want to explore. The summary: We will extend the design doc so that it
mentions that hotplugging is not supported when KVM runs in a chroot, or
when the security model is not None. We will also incorporate a
discussion of possible solutions, as outlined below, but I think it'd be
best not to commit to a specific solution yet ("it's going to be done
<this way> in the next version").

The problems:
* uid pool: KVM starts as root, opens e.g.,
/var/run/ganeti/instance-disks/snf-40416:0 which points to /dev/drbd3,
which is owned by root:disk, then it setuid()s itself to user 'pool152'.
After a while, we ask the KVM process to hotplug a new disk, e.g.,
/var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4,
which is owned by root:disk. KVM cannot open the block device, because
it no longer runs as root.

* chroot: same as above, this time KVM chroot's itself to the empty
directory /var/run/chroot-hypervisor after initialization. We ask it to hotplug
/var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4
and it fails, because there is no such device node in the chroot
directory.

Possible solutions:
* Latest KVM supports passing file descriptors through the monitor UNIX
domain socket. This way, the priviledged Ganeti noded can open the
block device itself, then pass the open file descriptor through the
monitor socket to the chrooted KVM process which runs under a pooled
uid. This will not work with earlier KVM versions, e.g., the one in
Debian Jessie.

* For uid pool, we can have noded chown() the block device to the
corresponding uid, so the KVM process can open() it successfully.
Once KVM has opened the device, we can chown() the device node back to
its previous owner, presumably root. This has the disadvantage that
should the operation be aborted (e.g., noded dies), the device node
will be left with a non-root owner, which could be a security risk.

Another workaround could be chown()ing the device node in a pre hook,
but this is a very ugly hack, that has all of the disadvantages of the
above solution, without any advantages.

* For chroot, we have to create the device nodes inside the chroot
itself. Once KVM has opened the device node, we can remove it, for
security reasons. Linking the device node will not work, since name
resolution for symlinks is also restricted by the chroot. Again,
if the process is aborted half way, the device node remains. We could
also do a bind mount of /dev inside the chroot, but this would defeat
the whole purpose of running inside a chroot in the first place.

Finally, we can combine both approaches: If a KVM process runs as
non-root inside a chroot, we have to mknod() and chown() a device node
inside the chroot, and remove it immediately afterwards.

We will extend the design doc to mention specifically that hotplugging
is incompatible with uid pooling or chroot, due to the way they work,
and also warn the user at runtime. We will also include a summary of the
options above, which we will implement in a later version.

How does all this sound?

Thanks,
Vangelis.

signature.asc
Description: Digital signature

Re: Fwd: [PATCH master] Add hotplug design doc

Reply via email to