Re: [PATCH master] Add hotplug design doc

Dimitris Aragiorgis Sat, 25 May 2013 02:06:59 -0700

* Guido Trotter <[email protected]> [2013-05-24 17:01:29 +0200]:

> +list, sorry for dropping it.
> 
> 
> On Fri, May 24, 2013 at 2:34 PM, Dimitris Aragiorgis <[email protected]>wrote:
> 
> > Hi.
> >
> > Comments inline.
> >
> > * Guido Trotter <[email protected]> [2013-05-21 15:08:56 +0200]:
> >
> > > On Sat, May 18, 2013 at 3:25 AM, Dimitris Aragiorgis <[email protected]
> > >wrote:
> > >
> > > > Hi,
> > > >
> > > > since device UUIDs have been merged upstream, I think we could
> > > > reheat the conversation about hotplug. This is the updated
> > > > design doc. The implementation is ready but it is wiser to
> > > > agree on the design first and make any changes if needed.
> > > >
> > > > Looking forward to your feedback.
> > > >
> > > > Thanks in advance,
> > > > dimara
> > > >
> > > > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > > > ---
> > > >  Makefile.am            |    1 +
> > > >  doc/design-draft.rst   |    1 +
> > > >  doc/design-hotplug.rst |  220
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 222 insertions(+)
> > > >  create mode 100644 doc/design-hotplug.rst
> > > >
> > > > diff --git a/Makefile.am b/Makefile.am
> > > > index 037cf53..155577d 100644
> > > > --- a/Makefile.am
> > > > +++ b/Makefile.am
> > > > @@ -407,6 +407,7 @@ docinput = \
> > > >         doc/design-cpu-pinning.rst \
> > > >         doc/design-device-uuid-name.rst \
> > > >         doc/design-draft.rst \
> > > > +       doc/design-hotplug.rst \
> > > >         doc/design-htools-2.3.rst \
> > > >         doc/design-http-server.rst \
> > > >         doc/design-impexp2.rst \
> > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > > > index ccb2f93..3ec9c46 100644
> > > > --- a/doc/design-draft.rst
> > > > +++ b/doc/design-draft.rst
> > > > @@ -19,6 +19,7 @@ Design document drafts
> > > >     design-storagetypes.rst
> > > >     design-reason-trail.rst
> > > >     design-device-uuid-name.rst
> > > > +   design-hotplug.rst
> > > >
> > > >  .. vim: set textwidth=72 :
> > > >  .. Local Variables:
> > > > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst
> > > > new file mode 100644
> > > > index 0000000..d34179f
> > > > --- /dev/null
> > > > +++ b/doc/design-hotplug.rst
> > > > @@ -0,0 +1,220 @@
> > > > +=======
> > > > +Hotplug
> > > > +=======
> > > > +
> > > > +.. contents:: :depth: 4
> > > > +
> > > > +This is a design document detailing the implementation of device
> > > > +hotplugging in Ganeti. The logic used is hypervisor agnostic but still
> > > > +the initial implementation will target the KVM hypervisor. The
> > > > +implementation adds ``python-fdsend`` as a new dependency.
> > > > +
> > > > +
> > > > +Current state and shortcomings
> > > > +==============================
> > > > +
> > > > +Currently, Ganeti supports addition/removal/modification of devices
> > > > +(NICs, Disks) but the actual modification takes place only after
> > > > +rebooting the instance. To this end an instance cannot change network,
> > > > +get a new disk etc. without a hard reboot.
> > > > +
> > > > +Until now, in case of KVM hypervisor, code does not name devices nor
> > > > +places them in specific PCI slots. Devices are appended in the KVM
> > > > +command and Ganeti lets KVM decide where to place them. This means
> > that
> > > > +there is a possibility a device that resides in PCI slot 5, after a
> > > > +reboot (due to another device removal) to be moved to another PCI slot
> > > > +and probably get renamed too (due to udev rules, etc.).
> > > > +
> > > > +In order migration to succeed, the process on the target node should
> > be
> > > > +started with exactly the same machine version, CPU architecture and
> > PCI
> > > > +configuration with the running process. During instance
> > creation/startup
> > > > +ganeti creates a KVM runtime file with all the necessary information
> > to
> > > > +generate the KVM command. This runtime file is used during instance
> > > > +migration to start a new identical KVM process. The current format
> > > > +includes the fixed part of the final KVM command a list of NICs',
> > > > +and hvparams dict. It does not favor easy manipulations concerning
> > > > +disks, because they are encapsulated in the fixed KVM command.
> > > > +
> > > > +Proposed changes
> > > > +================
> > > > +
> > > > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
> > > > +instance. Disks and NICs occupy some of these slots. Recent versions
> > of
> > > > +QEMU have introduced monitor commands that allow addition/removal of
> > PCI
> > > > +devices. Devices are referenced based on their name or position on the
> > > > +virtual PCI bus. To be able to use these commands, we need to be able
> > to
> > > > +assign each device a unique name. For that we build on top of devices'
> > > > +UUIDs.
> > > > +
> > > > +Finally, to keep track where each device is plugged into, we add the
> > > > +``pci`` slot to Disk and NIC objects, but we save it only in runtime
> > > > +files, since it is hypervisor specific info.
> > > > +
> > > >
> > >
> > > What is the purpose of a "pci" slot to disk and nic *objects* if these
> > are
> > > never saved in the config, but only kept in KVM specific runtime files?
> > >
> >
> > In order to easily create NIC and Disk objects from serialized (json) data.
> >
> 
> Nack on this one. We already have horrible hacks there (eg. the logical_id
> and physical_id that make sense only at different points) and this is going
> to make it worse, with something that doesn't make sense at all at the
> config/rpc level but just in the backend.
> We should just have the backend's own object that contains a disk and all
> the extra backend info, that can be serialized and deserialized just in
> noded without changing the frontend.
> 
>


Currently runtime files have [kvm_cmd, [n.ToDict for n in nics], hv_params].
I propose [kvm_cmd, [n.ToDict for n in nics], hv_params, [(b.ToDict, link) for 
b in block_devices]] 
In order this to happen we need NIC and Disk objects to have a pci slot that
exists only in backend. If not we have to complicate things and save tupples of
(object.ToDict, pci). I really cannot see why we don't go for the simple way.
The pci slot is not propagated back to master.


> >
> > > +We propose to make use of QEMU 1.0 monitor commands so that
> > > > +modifications to devices take effect instantly without the need for
> > hard
> > > > +reboot. The only change exposed to the end-user will be the addition
> > of
> > > > +a ``--hotplug`` option to the ``gnt-instance modify`` command.
> > > > +
> > > > +Upon hotplugging the PCI configuration of an instance is changed.
> > > > +Runtime files should be updated correspondingly. Currently this is
> > > > +impossible in case of disk hotplug because disks are included in
> > command
> > > > +line entry of the runtime file, contrary to NICs that are correctly
> > > > +treated separately. We change the format of runtime files, we remove
> > > > +disks from the fixed KVM command and create new entry containing them
> > > > +only. KVM options concerning disk are generated during
> > > > +``_ExecuteKVMCommand()``, just like NICs.
> > > > +
> > > > +Design decisions
> > > > +================
> > > > +
> > > > +Which should be each device ID? Currently KVM does not support
> > arbitrary
> > > > +IDs for devices; supported are only names starting with a letter, max
> > 32
> > > > +chars length, and only including '.' '_' special chars. To this end we
> > > > +use a helper function that converts UUID to accepted device id. For
> > the
> > > > +sake of simplicity and readability we use the part of UUID until the
> > > > +first dash and insert a "x" at the beginning of the string.
> > > > +
> > > >
> > >
> > > How do we guarantee that no two devices will have the same name? We
> > should
> > > double check this, to be sure there are no conflicts.
> > >
> > >
> >
> > Well we actually don't. Trimming system's generated uuid rises our chances
> > for duplicate IDs. If there is a way to generate a uuid of max 32 chars
> > starting with a letter based on the uuids of type
> > /proc/sys/kernel/random/uuid
> > then it would be great. Until we find one,  we could disable hotplug for
> > devices
> > that have the uuid prefix.
> >
> 
> That's going to be a problem, as it will generate errors for no reason.
> (sorry, you can't hotplug your device, because you were unlucky in the uuid
> roulette. please try again being luckier)
> 
> How about using some information that is specific to the instance and can't
> change? eg.
> 
> x<pci-id>_<start-of-uuid>
> or

Well it seems that if we obtain the device pci slot before naming it then we
could definitely use it as an identifier. To this end, UUIDs are not a prereq.
And it seems to be doable. Good point.

> x<arbitrary-integer>_<start-of-uuid> (where the arbitrary integer is just
> used for deduplication)
> 
> 
> 
> >
> > >
> > > > +Who decides where to hotplug each device? As long as this is a
> > > > +hypervisor specific matter, there is no point for the master node to
> > > > +decide such a thing. Master node just has to request noded to hotplug
> > a
> > > > +device. To this end, hypervisor specific code should parse the current
> > > > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the
> > first
> > > > +available slot and hotplug the device. Having noded to decide where to
> > > > +hotplug a device we ensure that no error will occur due to duplicate
> > > > +slot assignment (if masterd keeps track of PCI reservations and noded
> > > > +fails to return the PCI slot that the device was plugged into then
> > next
> > > > +hotplug will fail).
> > > > +
> > > >
> > >
> > > Ack.
> > >
> > >
> > > > +Where should we keep track of devices' PCI slots? As already
> > mentioned,
> > > > +we must keep track of devices PCI slots to successfully migrate
> > > > +instances. First option is to save this info to config data, which
> > would
> > > > +allow us to place each device at the same PCI slot after reboot. This
> > > > +would require to make the hypervisor return the PCI slot chosen for
> > each
> > > > +device, and storing this information to config data. Additionally the
> > > > +whole instance configuration should be returned with PCI slots filled
> > > > +after instance start and each instance should keep track of current
> > PCI
> > > > +reservations. We decide not to go towards this direction in order to
> > > > +keep it simple and do not add hypervisor specific info to
> > configuration
> > > > +data (``pci_reservations`` at instance level and ``pci`` at device
> > > > +level). For the aforementioned reason, we decide to store this info
> > only
> > > > +in KVM runtime files.
> > > > +
> > > >
> > >
> > > Ack. qq: do we need to store them at all, or could we get them from kvm
> > on
> > > the source machine, and use them on the destination machine, upon
> > migration?
> > >
> > >
> >
> > Well the pci_resevations is actually needed upon instance startup as well.
> > Let's say we have an instance with pci slots 5, 6, 8, 10 reserved and the
> > corresponding device objects with their pci slot up-to-date. We bring this
> > machine down. We add another device. Upon instance startup how do we
> > know where to place that device? QEMU monitor cannot be asked because
> > instance
> > is down. Before hotplugging devices were placed automatically by qemu.
> > First
> > the disks then the nics. This resulted to pci renumbering.
> >
> >
> Yes. But there are two different problems addressed by this design then:
> 1) hotplug
> 2) pci renumbering
> 
> If you want to solve (2) across instance reboots you can't do it at the
> hypervisor level, as when the instance is down these data would not be
> saved. Even if it was then it would not survive a failover, because the
> original node would be likely down.
> 
> So you have two options:
> 
> 1) You accept that pci renumbering is going to happen at each instance start
> 2) You save&track pci numbers in the master
> 
> At the hypervisor level you can only guarantee the data until the instance
> is up, nothing more.
> 

You are right. Since we don't want to save hypervisor specific info in the
master, pci renumbering is going to happen on hard reboots.

> 
> > > > +Where to place the devices upon instance startup? QEMU has by default
> > 4
> > > > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for
> > > > +disks and NICs. Currently, PCI configuration is not preserved after
> > > > +reboot.  Each time an instance starts, KVM assigns PCI slots to
> > devices
> > > > +based on their ordering in Ganeti configuration, i.e. the second disk
> > > > +will be placed after the first, the third NIC after the second, etc.
> > > > +Since we decided that there is no need to keep track of devices PCI
> > > > +slots, there is no need to change current functionality.
> > > > +
> > > > +How to deal with existing instances? Migration and hotplugging of
> > > > +existing instances in a cluster after installing the version that
> > > > +supports hotplug will not be supported until we run a specific script
> > in
> > > > +all nodes that adds a new entry for the block devices (e.g. []) in
> > > > +existing runtime files or else until instances suffer a hard reboot.
> > > >
> > >
> > > Ouch. Can we support this better? :) If we go for "read current values
> > from
> > > the qemu monitor" rather than "store this info outside" we get this
> > > magically working, no?
> > >
> > >
> >
> > Got lost :) What is "this"? "read current values from qemu monitor"
> > works only if instance is running. Look previous comment.
> >
> 
> Ok, let's have this discussion in that part of the thread then.
> 
> 
> > Existing instances do not have a pci configuration. Still they have the
> > old runtime file format (that does not treat disks as a different
> > entry).  This is the reason migration will fail.
> >
> >
> Just saying that we should make this work somehow. But again, it doesn't
> make sense to discuss it until we have answers to that part, so let's pause
> this part of the conversation for a minute.
> 
> 
> > > > +Configuration changes
> > > > +---------------------
> > > > +
> > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It
> > refers to
> > > > +PCI slot that the device gets plugged into.
> > > > +
> > > >
> > >
> > > Why? You seem not to be using this anyway.
> > >
> >
> > Explained before.
> >
> > >
> > > > +In order to be able to live migrate successfully, runtime files should
> > > > +be updated every time a live modification (hotplug) takes place. To
> > this
> > > > +end we change the format of runtime files. The KVM options referring
> > to
> > > > +instance's disks are no longer recorded as part of the KVM command
> > line.
> > > > +Disks are treated separately, just as we treat NICs right now. We
> > insert
> > > > +and remove entries to reflect the current PCI configuration.
> > > >
> > >
> > > Ack.
> > >
> > >
> > > > +
> > > > +
> > > > +Backend changes
> > > > +---------------
> > > > +
> > > > +Introduce 4 new RPC calls:
> > > > +
> > > > +- hot_add_nic
> > > > +- hot_del_nic
> > > > +- hot_add_disk
> > > > +- hot_del_disk
> > > > +
> > > >
> > >
> > > Does it make sense to have one call for device type? Could we not use one
> > > call for both?
> >
> > Well I implemented it so that it is more straighforward. We could have
> > hotplug_device(device, action)
> > and act depending the device type (NIC, Disk) and the action (ADD, REMOVE)
> >
> >
> Ack, let's go for this route please, at least for the device type. This
> will make life easier if we have ever to hotplug a cpu, eg. :)
> 
> 

Right.

> > > As for del, do we have any concerns? (this can easily break an instance I
> > > suppose, am I wrong? can it fail if the device is in use? How do we
> > handle
> > > these cases?)
> > >
> >
> > Unplugging a device has concerns but there are cases we want this to
> > happen.
> > disable an instance networking, renumber an instance IP, remove a disk
> > (like
> > unplugging an USB), etc. With existing implementation when we remove a
> > disk from a running instance it fails because it is used by QEMU. With
> > hotplug
> > we hot_del_disk then _ShutdownInstanceDisks() then _AnnotateDiskParams()
> > and then blockdev_remove().
> >
> >
> ACK, as long as there are enough scary warnings.
> 
> 

I'll do my best :)

> > >
> > > > +
> > > > +Hypervisor changes
> > > > +------------------
> > > > +
> > > > +We implement hotplug on top of the KVM hypervisor. We take advantage
> > of
> > > > +QEMU 1.0 monitor commands (``device_add``, ``device_del``,
> > > > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
> > > > +refers to devices based on their id. We use ``uuid`` to name them
> > > > +properly. If a device is about to be hotplugged we parse the output of
> > > > +``info pci`` and find the occupied PCI slots. We choose the first
> > > > +available and the whole device object is appended to the corresponding
> > > > +entry in the runtime file.
> > > > +
> > > > +Concerning NIC handling, we build on the top of the existing logic
> > > > +(first create a tap with _OpenTap() and then pass its file descriptor
> > to
> > > > +the KVM process). To this end we need to pass access rights to the
> > > > +corresponding file descriptor over the monitor socket (UNIX domain
> > > > +socket). The open file is passed as a socket-level control message
> > > > +(SCM), using the ``fdsend`` python library.
> > > > +
> > > >
> > >
> > > Is there any way to have this as a recommends but not depends (aka:
> > > hotplugging won't work if this is not present, but everything else will)?
> > >
> >
> > Sure.
> >
> >
> Thanks.
> 
> 
> > >
> > > > +
> > > > +User interface
> > > > +--------------
> > > > +
> > > > +The new ``--hotplug`` option to gnt-instance modify is introduced,
> > which
> > > > +forces live modifications.
> > > > +
> > > >
> > >
> > > What happens if they are unsupported (eg. other hypervisors), does the
> > > change fail?
> >
> > In cmdlib we get instance's hypervisor  before hotplugging devices.
> > There we could see if the corresponding hypervisor supports it or not.
> >
> >
> Yes, what the design should say is what happens by design. Do we proceed
> without hotplug, or do we fail the operation?
> 
> 

I don't know. Both ways are reasonable and doable. I would suggest print
a warning and proceed without hotplug. Then the original message that
modifications take place after reboot will be shown.


> > > Should there be a way to use hotplug by default? (now, or later perhaps)?
> > >
> >
> > Yes. Swith --hotplug option to --no-hotplug. :)
> >
> >
> Yes, but then perhaps there should be a cluster level flag or an instance
> level one, or a hypervisor parameter, saying "hotplug when possible"? Or
> just per-device? Anyway, no need to implement this in the first version.
> 
> 

If any I would go with hv param.

> > >
> > > > +
> > > > +Enabling hotplug
> > > > +++++++++++++++++
> > > > +
> > > > +Hotplug will be optional during gnt-instance modify.  For existing
> > > > +instances inside an old ganeti cluster we have two options after
> > > > +upgrading to a version that supports hotplugging:
> > > > +
> > > > +1. Instances in the cluster should be hard rebooted or
> > > > +
> > > > +2. A specific script should change their runtime format before they
> > can
> > > > +support hotplugging of devices.
> > > > +
> > > > +If neither happens these instances will not support migration nor
> > > > +hotplug.
> > > > +
> > > >
> > >
> > > not supporting *migration*: definitely bad.
> > >
> > >
> >
> > Aggre. The old format of the runtime files to blame.. :)
> >
> >
> > > > +
> > > > +NIC hotplug
> > > > ++++++++++++
> > > > +
> > > > +The user can add/modify/remove NICs either with hotplugging or not.
> > If a
> > > > +NIC is to be added a tap is created first and configured properly with
> > > > +kvm-vif-bridge script. Then the instance gets a new network interface.
> > > > +Since there is no QEMU monitor command to modify a NIC, we modify a
> > NIC
> > > > +by temporary removing the existing one and adding a new with the new
> > > > +configuration. When removing a NIC the corresponding tap gets removed
> > as
> > > > +well.
> > > > +
> > > > +::
> > > > +
> > > > + gnt-instance modify --net add --hotplug test
> > > > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
> > > > + gnt-instance modify --net 1:remove --hotplug test
> > > > +
> > > >
> > >
> > > s/test/instance.example.com/ (so it's clear what it is)
> > > Also is it safe to "just" add and remove a nic from an instance? This is
> > > going to break things, I suppose.
> > >
> > >
> >
> > Not really. Just the instance will see an interface to dissapear. If this
> > is the primary interface then it will lose internet connection (if any).
> >
> >
> Sure, the problem here is that if you remove and readd an instance just
> because a parameter was changed, the instance risks not ever reconfiguring
> it, and as such never having connectivity again until the next reboot. As
> such also here a pretty scary warning should be in place: this is not
> *guaranteed* to work as intended at all. :)
> 
> 

Agree. I have tested it under debian (with and without network-manager) and on
windows and is works. Warnings can be added for sure.

> 
> > > > +
> > > > +Disk hotplug
> > > > +++++++++++++
> > > > +
> > > > +The user can add and remove disks with hotplugging or not. QEMU
> > monitor
> > > > +supports resizing of disks, however the initial implementation will
> > > > +support only disk addition/deletion.
> > > > +
> > > >
> > >
> > > It would definitely be nice to support resizing too, but ok, it can go in
> > > later.
> > >
> > >
> >
> > AGREE! But please later....
> >
> >
> Ok.
> 
> 
> > > > +::
> > > > +
> > > > + gnt-instance modify --disk add:size=1G --hotplug test
> > > > + gnt-instance modify --disk 1:remove --hotplug test
> > > > +
> > > >
> > >
> > > Thanks,
> > >
> >
> > I'll try to implement something of the aforementioned and rebase it to
> > current
> > master, you know, the one with NO cmdlib.py!
> >
> >
> Wait until we ironed out all the design, please. As you see there's still a
> bit of concerns, I think.
> Also let's try to address sasha's concerns above in the thread. In
> particular I share with him the view that hotplug with higher security
> shouldn't break (this is not mentioned in this design). Other points he
> makes are good, but indeed you're right that we have to start with a few
> things and then iterate and add. On the other hand let's design in a way
> that's extensible to those use cases. :)
> 

OK. Sasha's feedback is more than helpful. I'll try to address the issues that
are raised, but still I think that those could extend the initial implementation
discussed here and do not affect the basic design.


To sum up:

1) pci slot of each device must be saved in backend for the sake of
migrations.  Should this be a slot in NIC, Disk objects or should it be
saved as a tupple in serialized data?

2) if we want disk hotplug, the format of runtime files should change. 
block_devices
should be treated separately, just like NICs. Migrating old instances without 
first
rebooting them still remains an issue.

3) merge all rpc calls in one hotplug_device(device, action)

4) do not include disk resizing in first implementation.

5) add hv param for enabling hotplug or not

6) warnings in case of disk/nic live removal

7) warnings in case hotplug is not supported but --hotplug option is passed

8) address Sasha's concerns:
   * Parse kvm help output to see if hotplug is supported.
   * Dig into the case where an instance does not run as a root.


Please let me know if I miss something.

Thanks a lot for the discussion,
dimara

signature.asc
Description: Digital signature

Re: [PATCH master] Add hotplug design doc

Reply via email to