+list, sorry for dropping it.

On Fri, May 24, 2013 at 2:34 PM, Dimitris Aragiorgis <[email protected]>wrote:

> Hi.
>
> Comments inline.
>
> * Guido Trotter <[email protected]> [2013-05-21 15:08:56 +0200]:
>
> > On Sat, May 18, 2013 at 3:25 AM, Dimitris Aragiorgis <[email protected]
> >wrote:
> >
> > > Hi,
> > >
> > > since device UUIDs have been merged upstream, I think we could
> > > reheat the conversation about hotplug. This is the updated
> > > design doc. The implementation is ready but it is wiser to
> > > agree on the design first and make any changes if needed.
> > >
> > > Looking forward to your feedback.
> > >
> > > Thanks in advance,
> > > dimara
> > >
> > > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > > ---
> > >  Makefile.am            |    1 +
> > >  doc/design-draft.rst   |    1 +
> > >  doc/design-hotplug.rst |  220
> > > ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 222 insertions(+)
> > >  create mode 100644 doc/design-hotplug.rst
> > >
> > > diff --git a/Makefile.am b/Makefile.am
> > > index 037cf53..155577d 100644
> > > --- a/Makefile.am
> > > +++ b/Makefile.am
> > > @@ -407,6 +407,7 @@ docinput = \
> > >         doc/design-cpu-pinning.rst \
> > >         doc/design-device-uuid-name.rst \
> > >         doc/design-draft.rst \
> > > +       doc/design-hotplug.rst \
> > >         doc/design-htools-2.3.rst \
> > >         doc/design-http-server.rst \
> > >         doc/design-impexp2.rst \
> > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > > index ccb2f93..3ec9c46 100644
> > > --- a/doc/design-draft.rst
> > > +++ b/doc/design-draft.rst
> > > @@ -19,6 +19,7 @@ Design document drafts
> > >     design-storagetypes.rst
> > >     design-reason-trail.rst
> > >     design-device-uuid-name.rst
> > > +   design-hotplug.rst
> > >
> > >  .. vim: set textwidth=72 :
> > >  .. Local Variables:
> > > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst
> > > new file mode 100644
> > > index 0000000..d34179f
> > > --- /dev/null
> > > +++ b/doc/design-hotplug.rst
> > > @@ -0,0 +1,220 @@
> > > +=======
> > > +Hotplug
> > > +=======
> > > +
> > > +.. contents:: :depth: 4
> > > +
> > > +This is a design document detailing the implementation of device
> > > +hotplugging in Ganeti. The logic used is hypervisor agnostic but still
> > > +the initial implementation will target the KVM hypervisor. The
> > > +implementation adds ``python-fdsend`` as a new dependency.
> > > +
> > > +
> > > +Current state and shortcomings
> > > +==============================
> > > +
> > > +Currently, Ganeti supports addition/removal/modification of devices
> > > +(NICs, Disks) but the actual modification takes place only after
> > > +rebooting the instance. To this end an instance cannot change network,
> > > +get a new disk etc. without a hard reboot.
> > > +
> > > +Until now, in case of KVM hypervisor, code does not name devices nor
> > > +places them in specific PCI slots. Devices are appended in the KVM
> > > +command and Ganeti lets KVM decide where to place them. This means
> that
> > > +there is a possibility a device that resides in PCI slot 5, after a
> > > +reboot (due to another device removal) to be moved to another PCI slot
> > > +and probably get renamed too (due to udev rules, etc.).
> > > +
> > > +In order migration to succeed, the process on the target node should
> be
> > > +started with exactly the same machine version, CPU architecture and
> PCI
> > > +configuration with the running process. During instance
> creation/startup
> > > +ganeti creates a KVM runtime file with all the necessary information
> to
> > > +generate the KVM command. This runtime file is used during instance
> > > +migration to start a new identical KVM process. The current format
> > > +includes the fixed part of the final KVM command a list of NICs',
> > > +and hvparams dict. It does not favor easy manipulations concerning
> > > +disks, because they are encapsulated in the fixed KVM command.
> > > +
> > > +Proposed changes
> > > +================
> > > +
> > > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
> > > +instance. Disks and NICs occupy some of these slots. Recent versions
> of
> > > +QEMU have introduced monitor commands that allow addition/removal of
> PCI
> > > +devices. Devices are referenced based on their name or position on the
> > > +virtual PCI bus. To be able to use these commands, we need to be able
> to
> > > +assign each device a unique name. For that we build on top of devices'
> > > +UUIDs.
> > > +
> > > +Finally, to keep track where each device is plugged into, we add the
> > > +``pci`` slot to Disk and NIC objects, but we save it only in runtime
> > > +files, since it is hypervisor specific info.
> > > +
> > >
> >
> > What is the purpose of a "pci" slot to disk and nic *objects* if these
> are
> > never saved in the config, but only kept in KVM specific runtime files?
> >
>
> In order to easily create NIC and Disk objects from serialized (json) data.
>

Nack on this one. We already have horrible hacks there (eg. the logical_id
and physical_id that make sense only at different points) and this is going
to make it worse, with something that doesn't make sense at all at the
config/rpc level but just in the backend.
We should just have the backend's own object that contains a disk and all
the extra backend info, that can be serialized and deserialized just in
noded without changing the frontend.


>
> > +We propose to make use of QEMU 1.0 monitor commands so that
> > > +modifications to devices take effect instantly without the need for
> hard
> > > +reboot. The only change exposed to the end-user will be the addition
> of
> > > +a ``--hotplug`` option to the ``gnt-instance modify`` command.
> > > +
> > > +Upon hotplugging the PCI configuration of an instance is changed.
> > > +Runtime files should be updated correspondingly. Currently this is
> > > +impossible in case of disk hotplug because disks are included in
> command
> > > +line entry of the runtime file, contrary to NICs that are correctly
> > > +treated separately. We change the format of runtime files, we remove
> > > +disks from the fixed KVM command and create new entry containing them
> > > +only. KVM options concerning disk are generated during
> > > +``_ExecuteKVMCommand()``, just like NICs.
> > > +
> > > +Design decisions
> > > +================
> > > +
> > > +Which should be each device ID? Currently KVM does not support
> arbitrary
> > > +IDs for devices; supported are only names starting with a letter, max
> 32
> > > +chars length, and only including '.' '_' special chars. To this end we
> > > +use a helper function that converts UUID to accepted device id. For
> the
> > > +sake of simplicity and readability we use the part of UUID until the
> > > +first dash and insert a "x" at the beginning of the string.
> > > +
> > >
> >
> > How do we guarantee that no two devices will have the same name? We
> should
> > double check this, to be sure there are no conflicts.
> >
> >
>
> Well we actually don't. Trimming system's generated uuid rises our chances
> for duplicate IDs. If there is a way to generate a uuid of max 32 chars
> starting with a letter based on the uuids of type
> /proc/sys/kernel/random/uuid
> then it would be great. Until we find one,  we could disable hotplug for
> devices
> that have the uuid prefix.
>

That's going to be a problem, as it will generate errors for no reason.
(sorry, you can't hotplug your device, because you were unlucky in the uuid
roulette. please try again being luckier)

How about using some information that is specific to the instance and can't
change? eg.

x<pci-id>_<start-of-uuid>
or
x<arbitrary-integer>_<start-of-uuid> (where the arbitrary integer is just
used for deduplication)



>
> >
> > > +Who decides where to hotplug each device? As long as this is a
> > > +hypervisor specific matter, there is no point for the master node to
> > > +decide such a thing. Master node just has to request noded to hotplug
> a
> > > +device. To this end, hypervisor specific code should parse the current
> > > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the
> first
> > > +available slot and hotplug the device. Having noded to decide where to
> > > +hotplug a device we ensure that no error will occur due to duplicate
> > > +slot assignment (if masterd keeps track of PCI reservations and noded
> > > +fails to return the PCI slot that the device was plugged into then
> next
> > > +hotplug will fail).
> > > +
> > >
> >
> > Ack.
> >
> >
> > > +Where should we keep track of devices' PCI slots? As already
> mentioned,
> > > +we must keep track of devices PCI slots to successfully migrate
> > > +instances. First option is to save this info to config data, which
> would
> > > +allow us to place each device at the same PCI slot after reboot. This
> > > +would require to make the hypervisor return the PCI slot chosen for
> each
> > > +device, and storing this information to config data. Additionally the
> > > +whole instance configuration should be returned with PCI slots filled
> > > +after instance start and each instance should keep track of current
> PCI
> > > +reservations. We decide not to go towards this direction in order to
> > > +keep it simple and do not add hypervisor specific info to
> configuration
> > > +data (``pci_reservations`` at instance level and ``pci`` at device
> > > +level). For the aforementioned reason, we decide to store this info
> only
> > > +in KVM runtime files.
> > > +
> > >
> >
> > Ack. qq: do we need to store them at all, or could we get them from kvm
> on
> > the source machine, and use them on the destination machine, upon
> migration?
> >
> >
>
> Well the pci_resevations is actually needed upon instance startup as well.
> Let's say we have an instance with pci slots 5, 6, 8, 10 reserved and the
> corresponding device objects with their pci slot up-to-date. We bring this
> machine down. We add another device. Upon instance startup how do we
> know where to place that device? QEMU monitor cannot be asked because
> instance
> is down. Before hotplugging devices were placed automatically by qemu.
> First
> the disks then the nics. This resulted to pci renumbering.
>
>
Yes. But there are two different problems addressed by this design then:
1) hotplug
2) pci renumbering

If you want to solve (2) across instance reboots you can't do it at the
hypervisor level, as when the instance is down these data would not be
saved. Even if it was then it would not survive a failover, because the
original node would be likely down.

So you have two options:

1) You accept that pci renumbering is going to happen at each instance start
2) You save&track pci numbers in the master

At the hypervisor level you can only guarantee the data until the instance
is up, nothing more.


> > > +Where to place the devices upon instance startup? QEMU has by default
> 4
> > > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for
> > > +disks and NICs. Currently, PCI configuration is not preserved after
> > > +reboot.  Each time an instance starts, KVM assigns PCI slots to
> devices
> > > +based on their ordering in Ganeti configuration, i.e. the second disk
> > > +will be placed after the first, the third NIC after the second, etc.
> > > +Since we decided that there is no need to keep track of devices PCI
> > > +slots, there is no need to change current functionality.
> > > +
> > > +How to deal with existing instances? Migration and hotplugging of
> > > +existing instances in a cluster after installing the version that
> > > +supports hotplug will not be supported until we run a specific script
> in
> > > +all nodes that adds a new entry for the block devices (e.g. []) in
> > > +existing runtime files or else until instances suffer a hard reboot.
> > >
> >
> > Ouch. Can we support this better? :) If we go for "read current values
> from
> > the qemu monitor" rather than "store this info outside" we get this
> > magically working, no?
> >
> >
>
> Got lost :) What is "this"? "read current values from qemu monitor"
> works only if instance is running. Look previous comment.
>

Ok, let's have this discussion in that part of the thread then.


> Existing instances do not have a pci configuration. Still they have the
> old runtime file format (that does not treat disks as a different
> entry).  This is the reason migration will fail.
>
>
Just saying that we should make this work somehow. But again, it doesn't
make sense to discuss it until we have answers to that part, so let's pause
this part of the conversation for a minute.


> > > +Configuration changes
> > > +---------------------
> > > +
> > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It
> refers to
> > > +PCI slot that the device gets plugged into.
> > > +
> > >
> >
> > Why? You seem not to be using this anyway.
> >
>
> Explained before.
>
> >
> > > +In order to be able to live migrate successfully, runtime files should
> > > +be updated every time a live modification (hotplug) takes place. To
> this
> > > +end we change the format of runtime files. The KVM options referring
> to
> > > +instance's disks are no longer recorded as part of the KVM command
> line.
> > > +Disks are treated separately, just as we treat NICs right now. We
> insert
> > > +and remove entries to reflect the current PCI configuration.
> > >
> >
> > Ack.
> >
> >
> > > +
> > > +
> > > +Backend changes
> > > +---------------
> > > +
> > > +Introduce 4 new RPC calls:
> > > +
> > > +- hot_add_nic
> > > +- hot_del_nic
> > > +- hot_add_disk
> > > +- hot_del_disk
> > > +
> > >
> >
> > Does it make sense to have one call for device type? Could we not use one
> > call for both?
>
> Well I implemented it so that it is more straighforward. We could have
> hotplug_device(device, action)
> and act depending the device type (NIC, Disk) and the action (ADD, REMOVE)
>
>
Ack, let's go for this route please, at least for the device type. This
will make life easier if we have ever to hotplug a cpu, eg. :)


> > As for del, do we have any concerns? (this can easily break an instance I
> > suppose, am I wrong? can it fail if the device is in use? How do we
> handle
> > these cases?)
> >
>
> Unplugging a device has concerns but there are cases we want this to
> happen.
> disable an instance networking, renumber an instance IP, remove a disk
> (like
> unplugging an USB), etc. With existing implementation when we remove a
> disk from a running instance it fails because it is used by QEMU. With
> hotplug
> we hot_del_disk then _ShutdownInstanceDisks() then _AnnotateDiskParams()
> and then blockdev_remove().
>
>
ACK, as long as there are enough scary warnings.


> >
> > > +
> > > +Hypervisor changes
> > > +------------------
> > > +
> > > +We implement hotplug on top of the KVM hypervisor. We take advantage
> of
> > > +QEMU 1.0 monitor commands (``device_add``, ``device_del``,
> > > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
> > > +refers to devices based on their id. We use ``uuid`` to name them
> > > +properly. If a device is about to be hotplugged we parse the output of
> > > +``info pci`` and find the occupied PCI slots. We choose the first
> > > +available and the whole device object is appended to the corresponding
> > > +entry in the runtime file.
> > > +
> > > +Concerning NIC handling, we build on the top of the existing logic
> > > +(first create a tap with _OpenTap() and then pass its file descriptor
> to
> > > +the KVM process). To this end we need to pass access rights to the
> > > +corresponding file descriptor over the monitor socket (UNIX domain
> > > +socket). The open file is passed as a socket-level control message
> > > +(SCM), using the ``fdsend`` python library.
> > > +
> > >
> >
> > Is there any way to have this as a recommends but not depends (aka:
> > hotplugging won't work if this is not present, but everything else will)?
> >
>
> Sure.
>
>
Thanks.


> >
> > > +
> > > +User interface
> > > +--------------
> > > +
> > > +The new ``--hotplug`` option to gnt-instance modify is introduced,
> which
> > > +forces live modifications.
> > > +
> > >
> >
> > What happens if they are unsupported (eg. other hypervisors), does the
> > change fail?
>
> In cmdlib we get instance's hypervisor  before hotplugging devices.
> There we could see if the corresponding hypervisor supports it or not.
>
>
Yes, what the design should say is what happens by design. Do we proceed
without hotplug, or do we fail the operation?


> > Should there be a way to use hotplug by default? (now, or later perhaps)?
> >
>
> Yes. Swith --hotplug option to --no-hotplug. :)
>
>
Yes, but then perhaps there should be a cluster level flag or an instance
level one, or a hypervisor parameter, saying "hotplug when possible"? Or
just per-device? Anyway, no need to implement this in the first version.


> >
> > > +
> > > +Enabling hotplug
> > > +++++++++++++++++
> > > +
> > > +Hotplug will be optional during gnt-instance modify.  For existing
> > > +instances inside an old ganeti cluster we have two options after
> > > +upgrading to a version that supports hotplugging:
> > > +
> > > +1. Instances in the cluster should be hard rebooted or
> > > +
> > > +2. A specific script should change their runtime format before they
> can
> > > +support hotplugging of devices.
> > > +
> > > +If neither happens these instances will not support migration nor
> > > +hotplug.
> > > +
> > >
> >
> > not supporting *migration*: definitely bad.
> >
> >
>
> Aggre. The old format of the runtime files to blame.. :)
>
>
> > > +
> > > +NIC hotplug
> > > ++++++++++++
> > > +
> > > +The user can add/modify/remove NICs either with hotplugging or not.
> If a
> > > +NIC is to be added a tap is created first and configured properly with
> > > +kvm-vif-bridge script. Then the instance gets a new network interface.
> > > +Since there is no QEMU monitor command to modify a NIC, we modify a
> NIC
> > > +by temporary removing the existing one and adding a new with the new
> > > +configuration. When removing a NIC the corresponding tap gets removed
> as
> > > +well.
> > > +
> > > +::
> > > +
> > > + gnt-instance modify --net add --hotplug test
> > > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
> > > + gnt-instance modify --net 1:remove --hotplug test
> > > +
> > >
> >
> > s/test/instance.example.com/ (so it's clear what it is)
> > Also is it safe to "just" add and remove a nic from an instance? This is
> > going to break things, I suppose.
> >
> >
>
> Not really. Just the instance will see an interface to dissapear. If this
> is the primary interface then it will lose internet connection (if any).
>
>
Sure, the problem here is that if you remove and readd an instance just
because a parameter was changed, the instance risks not ever reconfiguring
it, and as such never having connectivity again until the next reboot. As
such also here a pretty scary warning should be in place: this is not
*guaranteed* to work as intended at all. :)



> > > +
> > > +Disk hotplug
> > > +++++++++++++
> > > +
> > > +The user can add and remove disks with hotplugging or not. QEMU
> monitor
> > > +supports resizing of disks, however the initial implementation will
> > > +support only disk addition/deletion.
> > > +
> > >
> >
> > It would definitely be nice to support resizing too, but ok, it can go in
> > later.
> >
> >
>
> AGREE! But please later....
>
>
Ok.


> > > +::
> > > +
> > > + gnt-instance modify --disk add:size=1G --hotplug test
> > > + gnt-instance modify --disk 1:remove --hotplug test
> > > +
> > >
> >
> > Thanks,
> >
>
> I'll try to implement something of the aforementioned and rebase it to
> current
> master, you know, the one with NO cmdlib.py!
>
>
Wait until we ironed out all the design, please. As you see there's still a
bit of concerns, I think.
Also let's try to address sasha's concerns above in the thread. In
particular I share with him the view that hotplug with higher security
shouldn't break (this is not mentioned in this design). Other points he
makes are good, but indeed you're right that we have to start with a few
things and then iterate and add. On the other hand let's design in a way
that's extensible to those use cases. :)

Thanks a lot,

Guido

Reply via email to