On Thu, Jul 11, 2013 at 02:39:24pm +0200, Guido Trotter wrote:
> On Thu, Jul 11, 2013 at 2:31 PM, Vangelis Koukis <[email protected]> wrote:
> > On Thu, Jul 11, 2013 at 12:27:00pm +0200, Guido Trotter wrote:
> >> +list
> >>
> >>
> >> ---------- Forwarded message ----------
> >> From: Guido Trotter <[email protected]>
> >> Date: Thu, Jul 11, 2013 at 12:26 PM
> >> Subject: Re: [PATCH master] Add hotplug design doc
> >> To: Dimitris Aragiorgis <[email protected]>
> >>
> >>
> >> On Fri, Jul 5, 2013 at 7:48 PM, Dimitris Aragiorgis <[email protected]> 
> >> wrote:
> >> > + lists
> >> >
> >> > * Guido Trotter <[email protected]> [2013-07-05 12:57:33 +0200]:
> >> >
> >> >> On Fri, Jul 5, 2013 at 8:50 AM, Dimitris Aragiorgis <[email protected]> 
> >> >> wrote:
> >> >> > This is the design behind the first hotplug implementation
> >> >> > for the KVM hypervisor.
> >> >> >
> >> >> > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> >> >> > ---
> >> >> >
> >> >> > Hello team,
> >> >> >
> >> >> > This is the updated design doc for hotplug. It includes all
> >> >> > modifications/ suggestions that have been discussed in the last 
> >> >> > thread.
> >> >> > I will wait for your comments or eventually the final ACK so that I 
> >> >> > can
> >> >> > proceed with implementation patches.
> >> >> >
> >> >> > Thanks a lot,
> >> >> > dimara
> >> >> >
> >> >> >  Makefile.am            |    1 +
> >> >> >  doc/design-draft.rst   |    1 +
> >> >> >  doc/design-hotplug.rst |  222 
> >> >> > ++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> >  3 files changed, 224 insertions(+)
> >> >> >  create mode 100644 doc/design-hotplug.rst
> >> >> >
> >> >> > diff --git a/Makefile.am b/Makefile.am
> >> >> > index 91f3f37..fda6f58 100644
> >> >> > --- a/Makefile.am
> >> >> > +++ b/Makefile.am
> >> >> > @@ -435,6 +435,7 @@ docinput = \
> >> >> >         doc/design-cpu-pinning.rst \
> >> >> >         doc/design-device-uuid-name.rst \
> >> >> >         doc/design-draft.rst \
> >> >> > +       doc/design-hotplug.rst \
> >> >> >         doc/design-htools-2.3.rst \
> >> >> >         doc/design-http-server.rst \
> >> >> >         doc/design-impexp2.rst \
> >> >> > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> >> >> > index 0e454cd..4c1c692 100644
> >> >> > --- a/doc/design-draft.rst
> >> >> > +++ b/doc/design-draft.rst
> >> >> > @@ -20,6 +20,7 @@ Design document drafts
> >> >> >     design-internal-shutdown.rst
> >> >> >     design-glusterfs-ganeti-support.rst
> >> >> >     design-openvswitch.rst
> >> >> > +   design-hotplug.rst
> >> >> >
> >> >> >  .. vim: set textwidth=72 :
> >> >> >  .. Local Variables:
> >> >> > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst
> >> >> > new file mode 100644
> >> >> > index 0000000..ff4ff95
> >> >> > --- /dev/null
> >> >> > +++ b/doc/design-hotplug.rst
> >> >> > @@ -0,0 +1,222 @@
> >> >> > +=======
> >> >> > +Hotplug
> >> >> > +=======
> >> >> > +
> >> >> > +.. contents:: :depth: 4
> >> >> > +
> >> >> > +This is a design document detailing the implementation of device
> >> >> > +hotplugging in Ganeti. The logic used is hypervisor agnostic but 
> >> >> > still
> >> >> > +the initial implementation will target the KVM hypervisor. The
> >> >> > +implementation adds ``python-fdsend`` as a new dependency.
> >> >> > +
> >> >>
> >> >> Can you please specify, as we agreed, that python-fdsend is an
> >> >> optional dependency, and if not present Ganeti will still work, but
> >> >> hotplugging won't be possible?
> >> >>
> >> >
> >> > Yes sure. Just like affinity module. BTW only NIC hotplug depends on 
> >> > fdsend
> >> > so we could still support disk hotplug.
> >> >
> >> >> > +
> >> >> > +Current state and shortcomings
> >> >> > +==============================
> >> >> > +
> >> >> > +Currently, Ganeti supports addition/removal/modification of devices
> >> >> > +(NICs, Disks) but the actual modification takes place only after
> >> >> > +rebooting the instance. To this end an instance cannot change 
> >> >> > network,
> >> >> > +get a new disk etc. without a hard reboot.
> >> >> > +
> >> >> > +Until now, in case of KVM hypervisor, code does not name devices nor
> >> >> > +places them in specific PCI slots. Devices are appended in the KVM
> >> >> > +command and Ganeti lets KVM decide where to place them. This means 
> >> >> > that
> >> >> > +there is a possibility a device that resides in PCI slot 5, after a
> >> >> > +reboot (due to another device removal) to be moved to another PCI 
> >> >> > slot
> >> >> > +and probably get renamed too (due to udev rules, etc.).
> >> >> > +
> >> >> > +In order migration to succeed, the process on the target node should 
> >> >> > be
> >> >> > +started with exactly the same machine version, CPU architecture and 
> >> >> > PCI
> >> >> > +configuration with the running process. During instance 
> >> >> > creation/startup
> >> >> > +ganeti creates a KVM runtime file with all the necessary information 
> >> >> > to
> >> >> > +generate the KVM command. This runtime file is used during instance
> >> >> > +migration to start a new identical KVM process. The current format
> >> >> > +includes the fixed part of the final KVM command, a list of NICs',
> >> >> > +and hvparams dict. It does not favor easy manipulations concerning
> >> >> > +disks, because they are encapsulated in the fixed KVM command.
> >> >> > +
> >> >> > +Proposed changes
> >> >> > +================
> >> >> > +
> >> >> > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
> >> >> > +instance. Disks and NICs occupy some of these slots. Recent versions 
> >> >> > of
> >> >> > +QEMU have introduced monitor commands that allow addition/removal of 
> >> >> > PCI
> >> >> > +devices. Devices are referenced based on their name or position on 
> >> >> > the
> >> >> > +virtual PCI bus. To be able to use these commands, we need to be 
> >> >> > able to
> >> >> > +assign each device a unique name.
> >> >> > +
> >> >> > +To keep track where each device is plugged into, we add the
> >> >> > +``pci`` slot to Disk and NIC objects, but we save it only in runtime
> >> >> > +files, since it is hypervisor specific info. This is added for easy
> >> >> > +object manipulation and is ensured not to be written back to the 
> >> >> > config.
> >> >> > +
> >> >> > +We propose to make use of QEMU 1.0 monitor commands so that
> >> >> > +modifications to devices take effect instantly without the need for 
> >> >> > hard
> >> >> > +reboot. The only change exposed to the end-user will be the addition 
> >> >> > of
> >> >> > +a ``--hotplug`` option to the ``gnt-instance modify`` command.
> >> >> > +
> >> >> > +Upon hotplugging the PCI configuration of an instance is changed.
> >> >> > +Runtime files should be updated correspondingly. Currently this is
> >> >> > +impossible in case of disk hotplug because disks are included in 
> >> >> > command
> >> >> > +line entry of the runtime file, contrary to NICs that are correctly
> >> >> > +treated separately. We change the format of runtime files, we remove
> >> >> > +disks from the fixed KVM command and create new entry containing them
> >> >> > +only. KVM options concerning disk are generated during
> >> >> > +``_ExecuteKVMCommand()``, just like NICs.
> >> >> > +
> >> >> > +Design decisions
> >> >> > +================
> >> >> > +
> >> >> > +Which should be each device ID? Currently KVM does not support 
> >> >> > arbitrary
> >> >> > +IDs for devices; supported are only names starting with a letter, 
> >> >> > max 32
> >> >> > +chars length, and only including '.' '_' '-' special chars.
> >> >> > +We use the device pci slot and name it after <device type>-pci-<slot>
> >> >> > +(for debugging purposes we could add a part of uuid as well).
> >> >>
> >> >> Didn't we decide for just <device-type>-<part-of-uuid>-<slot> ?
> >> >>
> >> >
> >> > Well I did that in order kvm command to be readable and not full of
> >> > random numbers. Adding the part of uuid is simple just another line of 
> >> > code and
> >> > nothing more. OK. I 'll change it to <device-type>-<part-of-uuid>-<slot>.
> >> >
> >> >> > +
> >> >> > +Who decides where to hotplug each device? As long as this is a
> >> >> > +hypervisor specific matter, there is no point for the master node to
> >> >> > +decide such a thing. Master node just has to request noded to 
> >> >> > hotplug a
> >> >> > +device. To this end, hypervisor specific code should parse the 
> >> >> > current
> >> >> > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the 
> >> >> > first
> >> >> > +available slot and hotplug the device. Having noded to decide where 
> >> >> > to
> >> >> > +hotplug a device we ensure that no error will occur due to duplicate
> >> >> > +slot assignment (if masterd keeps track of PCI reservations and noded
> >> >> > +fails to return the PCI slot that the device was plugged into then 
> >> >> > next
> >> >> > +hotplug will fail).
> >> >> > +
> >> >> > +Where should we keep track of devices' PCI slots? As already 
> >> >> > mentioned,
> >> >> > +we must keep track of devices PCI slots to successfully migrate
> >> >> > +instances. First option is to save this info to config data, which 
> >> >> > would
> >> >> > +allow us to place each device at the same PCI slot after reboot. This
> >> >> > +would require to make the hypervisor return the PCI slot chosen for 
> >> >> > each
> >> >> > +device, and storing this information to config data. Additionally the
> >> >> > +whole instance configuration should be returned with PCI slots filled
> >> >> > +after instance start and each instance should keep track of current 
> >> >> > PCI
> >> >> > +reservations. We decide not to go towards this direction in order to
> >> >> > +keep it simple and do not add hypervisor specific info to 
> >> >> > configuration
> >> >> > +data (``pci_reservations`` at instance level and ``pci`` at device
> >> >> > +level). For the aforementioned reason, we decide to store this info 
> >> >> > only
> >> >> > +in KVM runtime files.
> >> >> > +
> >> >> > +Where to place the devices upon instance startup? QEMU has by 
> >> >> > default 4
> >> >> > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for
> >> >> > +disks and NICs. Currently, PCI configuration is not preserved after
> >> >> > +reboot.  Each time an instance starts, KVM assigns PCI slots to 
> >> >> > devices
> >> >> > +based on their ordering in Ganeti configuration, i.e. the second disk
> >> >> > +will be placed after the first, the third NIC after the second, etc.
> >> >> > +Since we decided that there is no need to keep track of devices PCI
> >> >> > +slots, there is no need to change current functionality.
> >> >> > +
> >> >> > +How to deal with existing instances? Hotplug depends on runtime file
> >> >> > +manipulation. It stores there pci info and every device the kvm 
> >> >> > process is
> >> >> > +currently using. Existing files have no pci info in devices and have 
> >> >> > block
> >> >> > +devices encapsulated inside kvm_cmd entry. Thus hotplugging of 
> >> >> > existing devices
> >> >> > +will not be possible.
> >> >> > Still migration and hotplugging of new devices will
> >> >> > +succeed. The workaround will happen upon loading kvm runtime: if we 
> >> >> > detect old
> >> >> > +style format we will add an empty list for block devices and upon 
> >> >> > saving kvm
> >> >> > +runtime we will include this empty list as well. Switching entirely 
> >> >> > to new
> >> >> > +format will happen upon instance reboot.
> >> >> > +
> >> >> > +
> >> >> > +Configuration changes
> >> >> > +---------------------
> >> >> > +
> >> >> > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It 
> >> >> > refers to
> >> >> > +PCI slot that the device gets plugged into.
> >> >> > +
> >> >> > +In order to be able to live migrate successfully, runtime files 
> >> >> > should
> >> >> > +be updated every time a live modification (hotplug) takes place. To 
> >> >> > this
> >> >> > +end we change the format of runtime files. The KVM options referring 
> >> >> > to
> >> >> > +instance's disks are no longer recorded as part of the KVM command 
> >> >> > line.
> >> >> > +Disks are treated separately, just as we treat NICs right now. We 
> >> >> > insert
> >> >> > +and remove entries to reflect the current PCI configuration.
> >> >> > +
> >> >> > +
> >> >> > +Backend changes
> >> >> > +---------------
> >> >> > +
> >> >> > +Introduce one new RPC call:
> >> >> > +
> >> >> > +- hotplug_device(DEVICE_TYPE, ACTION, device, ...)
> >> >> > +
> >> >> > +where DEVICE_TYPE can be either NIC or Disk, and ACTION either 
> >> >> > REMOVE or ADD.
> >> >> > +
> >> >> > +Hypervisor changes
> >> >> > +------------------
> >> >> > +
> >> >> > +We implement hotplug on top of the KVM hypervisor. We take advantage 
> >> >> > of
> >> >> > +QEMU 1.0 monitor commands (``device_add``, ``device_del``,
> >> >> > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
> >> >> > +refers to devices based on their id. We use ``uuid`` to name them
> >> >> > +properly. If a device is about to be hotplugged we parse the output 
> >> >> > of
> >> >> > +``info pci`` and find the occupied PCI slots. We choose the first
> >> >> > +available and the whole device object is appended to the 
> >> >> > corresponding
> >> >> > +entry in the runtime file.
> >> >> > +
> >> >> > +Concerning NIC handling, we build on the top of the existing logic
> >> >> > +(first create a tap with _OpenTap() and then pass its file 
> >> >> > descriptor to
> >> >> > +the KVM process). To this end we need to pass access rights to the
> >> >> > +corresponding file descriptor over the monitor socket (UNIX domain
> >> >> > +socket). The open file is passed as a socket-level control message
> >> >> > +(SCM), using the ``fdsend`` python library.
> >> >> > +
> >> >> > +
> >> >> > +User interface
> >> >> > +--------------
> >> >> > +
> >> >> > +The new ``--hotplug`` option to gnt-instance modify is introduced, 
> >> >> > which
> >> >> > +forces live modifications.
> >> >> > +
> >> >> > +
> >> >> > +Enabling hotplug
> >> >> > +++++++++++++++++
> >> >> > +
> >> >> > +Hotplug will be optional during gnt-instance modify.  For existing
> >> >> > +instance, after installing a version that supports hotplugging we
> >> >> > +have the restriction that hotplug will not be supported for existing
> >> >> > +devices. The reason is that old runtime files lack of:
> >> >> > +
> >> >> > +1. Device pci configuration info.
> >> >> > +
> >> >> > +2. Separate block device entry.
> >> >> > +
> >> >> > +Hotplug will be supported only for KVM in the first implementation. 
> >> >> > For
> >> >> > +all other hypervisors, backend will raise an Exception case hotplug 
> >> >> > is
> >> >> > +requested.
> >> >> > +
> >> >> > +
> >> >> > +NIC hotplug
> >> >> > ++++++++++++
> >> >> > +
> >> >> > +The user can add/modify/remove NICs either with hotplugging or not. 
> >> >> > If a
> >> >> > +NIC is to be added a tap is created first and configured properly 
> >> >> > with
> >> >> > +kvm-vif-bridge script. Then the instance gets a new network 
> >> >> > interface.
> >> >> > +Since there is no QEMU monitor command to modify a NIC, we modify a 
> >> >> > NIC
> >> >> > +by temporary removing the existing one and adding a new with the new
> >> >> > +configuration. When removing a NIC the corresponding tap gets 
> >> >> > removed as
> >> >> > +well.
> >> >> > +
> >> >>
> >> >> Please specify that this (modify as add/remove) is a potentially
> >> >> dangerous operation and there will be warnings.
> >> >>
> >> >
> >> > I will handle it just like we handle migrations. On the client side add
> >> > a "BIG WARNING. Continue? [y/N]"
> >> >
> >> >> > +::
> >> >> > +
> >> >> > + gnt-instance modify --net add --hotplug test
> >> >> > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
> >> >> > + gnt-instance modify --net 1:remove --hotplug test
> >> >> > +
> >> >> > +
> >> >> > +Disk hotplug
> >> >> > +++++++++++++
> >> >> > +
> >> >> > +The user can add and remove disks with hotplugging or not. QEMU 
> >> >> > monitor
> >> >> > +supports resizing of disks, however the initial implementation will
> >> >> > +support only disk addition/deletion.
> >> >> > +
> >> >> > +::
> >> >> > +
> >> >> > + gnt-instance modify --disk add:size=1G --hotplug test
> >> >> > + gnt-instance modify --net 1:remove --hotplug test
> >> >> > +
> >> >> > +.. vim: set textwidth=72 :
> >> >> > +.. Local Variables:
> >> >> > +.. mode: rst
> >> >> > +.. fill-column: 72
> >> >> > +.. End:
> >> >>
> >> >> Please finally specify the status about supporting non-root and chroot
> >> >> with hotplug.
> >> >> Will this work from the first version?
> >> >>
> >> >
> >> > Well after testing and digging (most of it done by psomas [cc]) we 
> >> > mention the
> >> > following:
> >> >
> >> > - nic hotplugging succeeds both with uid pools and chroot.
> >> > - disk hotplugging will fail. I propose for those cases, check hvparams
> >> > on hypervisor level and if security_model is other than SM_NONE or if
> >> > use_chroot is True just report a warning and continue. The device will
> >> > be available after reboot.
> >> >
> >> > KVM 1.2 or 1.3 has introduced add-fd command which may solve the problem 
> >> > but
> >> > I haven't tested it at all. Debian jessie has still 1.1.2 so there is
> >> > no reason to hurry.
> >> >
> >> > Are you OK with the above? If yes can I could send you a design doc
> >> > interdiff and then the rest of the patches.
> >> >
> >>
> >> All sounds good, with one note: there were reports that it was
> >> possible to do hotplug of disks inside a chroot by linking or creating
> >> the device there, hotplugging it, and then removing the device. Would
> >> it make sense to do this? (or do say that it's going to be done in the
> >> next version).
> >>
> >> Thanks,
> >>
> >> Guido
> >>
> >
> > Hello Guido,
> >
> > we've actually had an entensive discussion internally about the problems
> > with device hotplugging when uid pooling, chroot or both are enabled.
> >
> 
> Thanks for the extra information.
> 
> > These problems do not have anything to do with hotplugging per se,
> > they're a direct result of the way uid pooling and chroot work.
> >
> > I'll try to summarize the problems here and provide a few approaches we
> > want to explore. The summary: We will extend the design doc so that it
> > mentions that hotplugging is not supported when KVM runs in a chroot, or
> > when the security model is not None. We will also incorporate a
> > discussion of possible solutions, as outlined below, but I think it'd be
> > best not to commit to a specific solution yet ("it's going to be done
> > <this way> in the next version").
> >
> 
> Yes, please add the content below to the design doc, as it's going to
> be useful for the future.
> 
> > The problems:
> > * uid pool: KVM starts as root, opens e.g.,
> > /var/run/ganeti/instance-disks/snf-40416:0 which points to /dev/drbd3,
> > which is owned by root:disk, then it setuid()s itself to user 'pool152'.
> > After a while, we ask the KVM process to hotplug a new disk, e.g.,
> > /var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4,
> > which is owned by root:disk. KVM cannot open the block device, because
> > it no longer runs as root.
> >
> > * chroot: same as above, this time KVM chroot's itself to the empty
> > directory /var/run/chroot-hypervisor after initialization. We ask it to 
> > hotplug
> > /var/run/ganeti/instance-disks/snf-40416:1 -> /dev/drbd4
> > and it fails, because there is no such device node in the chroot
> > directory.
> >
> > Possible solutions:
> > * Latest KVM supports passing file descriptors through the monitor UNIX
> > domain socket. This way, the priviledged Ganeti noded can open the
> > block device itself, then pass the open file descriptor through the
> > monitor socket to the chrooted KVM process which runs under a pooled
> > uid. This will not work with earlier KVM versions, e.g., the one in
> > Debian Jessie.
> >
> 
> No problem at all, we should still start supporting it as we can, as
> there will definitely be backports for wheezy, or people using it on
> other distros. Plus with our "use feature when detected" method it
> should be easy to use it "when it's there"
> 
> > * For uid pool, we can have noded chown() the block device to the
> > corresponding uid, so the KVM process can open() it successfully.
> > Once KVM has opened the device, we can chown() the device node back to
> > its previous owner, presumably root. This has the disadvantage that
> > should the operation be aborted (e.g., noded dies), the device node
> > will be left with a non-root owner, which could be a security risk.
> >
> 
> Ack, indeed. Perhaps this behavior could be used as a fallback *if*
> the above doesn't work. Or even not supported at all, and we could say
> "hotplug+uid pool only works with the right version of qemu"
> 
> > Another workaround could be chown()ing the device node in a pre hook,
> > but this is a very ugly hack, that has all of the disadvantages of the
> > above solution, without any advantages.
> >
> > * For chroot, we have to create the device nodes inside the chroot
> > itself. Once KVM has opened the device node, we can remove it, for
> > security reasons. Linking the device node will not work, since name
> > resolution for symlinks is also restricted by the chroot. Again,
> > if the process is aborted half way, the device node remains. We could
> > also do a bind mount of /dev inside the chroot, but this would defeat
> > the whole purpose of running inside a chroot in the first place.
> 
> Indeed, it should be just "create" and then "delete". Of course in
> this case and the one above cluster verify should check whether there
> are leftovers.
> 
> >
> > Finally, we can combine both approaches: If a KVM process runs as
> > non-root inside a chroot, we have to mknod() and chown() a device node
> > inside the chroot, and remove it immediately afterwards.
> >
> > We will extend the design doc to mention specifically that hotplugging
> > is incompatible with uid pooling or chroot, due to the way they work,
> > and also warn the user at runtime. We will also include a summary of the
> > options above, which we will implement in a later version.
> >
> > How does all this sound?
> >
> 
> Sounds perfect. I'd be even ok with just the fd version being
> implemented, saying that we depend on a newer version of qemu: nobody
> says we have to support all possible features on plain wheezy.
> 
> Thanks,
> 
> Guido

[+list]

Thanks Guido,

We will send a final update to the design doc right away, then patches
for hotplug support. We'll continue with implementing the fd version,
and the mknod()/chown() version as a fallback in case of older qemu.

Vangelis.

Attachment: signature.asc
Description: Digital signature

Reply via email to