* Guido Trotter <[email protected]> [2013-05-24 17:01:29 +0200]: > +list, sorry for dropping it. > > > On Fri, May 24, 2013 at 2:34 PM, Dimitris Aragiorgis <[email protected]>wrote: > > > Hi. > > > > Comments inline. > > > > * Guido Trotter <[email protected]> [2013-05-21 15:08:56 +0200]: > > > > > On Sat, May 18, 2013 at 3:25 AM, Dimitris Aragiorgis <[email protected] > > >wrote: > > > > > > > Hi, > > > > > > > > since device UUIDs have been merged upstream, I think we could > > > > reheat the conversation about hotplug. This is the updated > > > > design doc. The implementation is ready but it is wiser to > > > > agree on the design first and make any changes if needed. > > > > > > > > Looking forward to your feedback. > > > > > > > > Thanks in advance, > > > > dimara > > > > > > > > Signed-off-by: Dimitris Aragiorgis <[email protected]> > > > > --- > > > > Makefile.am | 1 + > > > > doc/design-draft.rst | 1 + > > > > doc/design-hotplug.rst | 220 > > > > ++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 3 files changed, 222 insertions(+) > > > > create mode 100644 doc/design-hotplug.rst > > > > > > > > diff --git a/Makefile.am b/Makefile.am > > > > index 037cf53..155577d 100644 > > > > --- a/Makefile.am > > > > +++ b/Makefile.am > > > > @@ -407,6 +407,7 @@ docinput = \ > > > > doc/design-cpu-pinning.rst \ > > > > doc/design-device-uuid-name.rst \ > > > > doc/design-draft.rst \ > > > > + doc/design-hotplug.rst \ > > > > doc/design-htools-2.3.rst \ > > > > doc/design-http-server.rst \ > > > > doc/design-impexp2.rst \ > > > > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > > > > index ccb2f93..3ec9c46 100644 > > > > --- a/doc/design-draft.rst > > > > +++ b/doc/design-draft.rst > > > > @@ -19,6 +19,7 @@ Design document drafts > > > > design-storagetypes.rst > > > > design-reason-trail.rst > > > > design-device-uuid-name.rst > > > > + design-hotplug.rst > > > > > > > > .. vim: set textwidth=72 : > > > > .. Local Variables: > > > > diff --git a/doc/design-hotplug.rst b/doc/design-hotplug.rst > > > > new file mode 100644 > > > > index 0000000..d34179f > > > > --- /dev/null > > > > +++ b/doc/design-hotplug.rst > > > > @@ -0,0 +1,220 @@ > > > > +======= > > > > +Hotplug > > > > +======= > > > > + > > > > +.. contents:: :depth: 4 > > > > + > > > > +This is a design document detailing the implementation of device > > > > +hotplugging in Ganeti. The logic used is hypervisor agnostic but still > > > > +the initial implementation will target the KVM hypervisor. The > > > > +implementation adds ``python-fdsend`` as a new dependency. > > > > + > > > > + > > > > +Current state and shortcomings > > > > +============================== > > > > + > > > > +Currently, Ganeti supports addition/removal/modification of devices > > > > +(NICs, Disks) but the actual modification takes place only after > > > > +rebooting the instance. To this end an instance cannot change network, > > > > +get a new disk etc. without a hard reboot. > > > > + > > > > +Until now, in case of KVM hypervisor, code does not name devices nor > > > > +places them in specific PCI slots. Devices are appended in the KVM > > > > +command and Ganeti lets KVM decide where to place them. This means > > that > > > > +there is a possibility a device that resides in PCI slot 5, after a > > > > +reboot (due to another device removal) to be moved to another PCI slot > > > > +and probably get renamed too (due to udev rules, etc.). > > > > + > > > > +In order migration to succeed, the process on the target node should > > be > > > > +started with exactly the same machine version, CPU architecture and > > PCI > > > > +configuration with the running process. During instance > > creation/startup > > > > +ganeti creates a KVM runtime file with all the necessary information > > to > > > > +generate the KVM command. This runtime file is used during instance > > > > +migration to start a new identical KVM process. The current format > > > > +includes the fixed part of the final KVM command a list of NICs', > > > > +and hvparams dict. It does not favor easy manipulations concerning > > > > +disks, because they are encapsulated in the fixed KVM command. > > > > + > > > > +Proposed changes > > > > +================ > > > > + > > > > +For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the > > > > +instance. Disks and NICs occupy some of these slots. Recent versions > > of > > > > +QEMU have introduced monitor commands that allow addition/removal of > > PCI > > > > +devices. Devices are referenced based on their name or position on the > > > > +virtual PCI bus. To be able to use these commands, we need to be able > > to > > > > +assign each device a unique name. For that we build on top of devices' > > > > +UUIDs. > > > > + > > > > +Finally, to keep track where each device is plugged into, we add the > > > > +``pci`` slot to Disk and NIC objects, but we save it only in runtime > > > > +files, since it is hypervisor specific info. > > > > + > > > > > > > > > > What is the purpose of a "pci" slot to disk and nic *objects* if these > > are > > > never saved in the config, but only kept in KVM specific runtime files? > > > > > > > In order to easily create NIC and Disk objects from serialized (json) data. > > > > Nack on this one. We already have horrible hacks there (eg. the logical_id > and physical_id that make sense only at different points) and this is going > to make it worse, with something that doesn't make sense at all at the > config/rpc level but just in the backend. > We should just have the backend's own object that contains a disk and all > the extra backend info, that can be serialized and deserialized just in > noded without changing the frontend. > >
Currently runtime files have [kvm_cmd, [n.ToDict for n in nics], hv_params]. I propose [kvm_cmd, [n.ToDict for n in nics], hv_params, [(b.ToDict, link) for b in block_devices]] In order this to happen we need NIC and Disk objects to have a pci slot that exists only in backend. If not we have to complicate things and save tupples of (object.ToDict, pci). I really cannot see why we don't go for the simple way. The pci slot is not propagated back to master. > > > > > +We propose to make use of QEMU 1.0 monitor commands so that > > > > +modifications to devices take effect instantly without the need for > > hard > > > > +reboot. The only change exposed to the end-user will be the addition > > of > > > > +a ``--hotplug`` option to the ``gnt-instance modify`` command. > > > > + > > > > +Upon hotplugging the PCI configuration of an instance is changed. > > > > +Runtime files should be updated correspondingly. Currently this is > > > > +impossible in case of disk hotplug because disks are included in > > command > > > > +line entry of the runtime file, contrary to NICs that are correctly > > > > +treated separately. We change the format of runtime files, we remove > > > > +disks from the fixed KVM command and create new entry containing them > > > > +only. KVM options concerning disk are generated during > > > > +``_ExecuteKVMCommand()``, just like NICs. > > > > + > > > > +Design decisions > > > > +================ > > > > + > > > > +Which should be each device ID? Currently KVM does not support > > arbitrary > > > > +IDs for devices; supported are only names starting with a letter, max > > 32 > > > > +chars length, and only including '.' '_' special chars. To this end we > > > > +use a helper function that converts UUID to accepted device id. For > > the > > > > +sake of simplicity and readability we use the part of UUID until the > > > > +first dash and insert a "x" at the beginning of the string. > > > > + > > > > > > > > > > How do we guarantee that no two devices will have the same name? We > > should > > > double check this, to be sure there are no conflicts. > > > > > > > > > > Well we actually don't. Trimming system's generated uuid rises our chances > > for duplicate IDs. If there is a way to generate a uuid of max 32 chars > > starting with a letter based on the uuids of type > > /proc/sys/kernel/random/uuid > > then it would be great. Until we find one, we could disable hotplug for > > devices > > that have the uuid prefix. > > > > That's going to be a problem, as it will generate errors for no reason. > (sorry, you can't hotplug your device, because you were unlucky in the uuid > roulette. please try again being luckier) > > How about using some information that is specific to the instance and can't > change? eg. > > x<pci-id>_<start-of-uuid> > or Well it seems that if we obtain the device pci slot before naming it then we could definitely use it as an identifier. To this end, UUIDs are not a prereq. And it seems to be doable. Good point. > x<arbitrary-integer>_<start-of-uuid> (where the arbitrary integer is just > used for deduplication) > > > > > > > > > > > > +Who decides where to hotplug each device? As long as this is a > > > > +hypervisor specific matter, there is no point for the master node to > > > > +decide such a thing. Master node just has to request noded to hotplug > > a > > > > +device. To this end, hypervisor specific code should parse the current > > > > +PCI configuration (i.e. ``info pci`` QEMU monitor command), find the > > first > > > > +available slot and hotplug the device. Having noded to decide where to > > > > +hotplug a device we ensure that no error will occur due to duplicate > > > > +slot assignment (if masterd keeps track of PCI reservations and noded > > > > +fails to return the PCI slot that the device was plugged into then > > next > > > > +hotplug will fail). > > > > + > > > > > > > > > > Ack. > > > > > > > > > > +Where should we keep track of devices' PCI slots? As already > > mentioned, > > > > +we must keep track of devices PCI slots to successfully migrate > > > > +instances. First option is to save this info to config data, which > > would > > > > +allow us to place each device at the same PCI slot after reboot. This > > > > +would require to make the hypervisor return the PCI slot chosen for > > each > > > > +device, and storing this information to config data. Additionally the > > > > +whole instance configuration should be returned with PCI slots filled > > > > +after instance start and each instance should keep track of current > > PCI > > > > +reservations. We decide not to go towards this direction in order to > > > > +keep it simple and do not add hypervisor specific info to > > configuration > > > > +data (``pci_reservations`` at instance level and ``pci`` at device > > > > +level). For the aforementioned reason, we decide to store this info > > only > > > > +in KVM runtime files. > > > > + > > > > > > > > > > Ack. qq: do we need to store them at all, or could we get them from kvm > > on > > > the source machine, and use them on the destination machine, upon > > migration? > > > > > > > > > > Well the pci_resevations is actually needed upon instance startup as well. > > Let's say we have an instance with pci slots 5, 6, 8, 10 reserved and the > > corresponding device objects with their pci slot up-to-date. We bring this > > machine down. We add another device. Upon instance startup how do we > > know where to place that device? QEMU monitor cannot be asked because > > instance > > is down. Before hotplugging devices were placed automatically by qemu. > > First > > the disks then the nics. This resulted to pci renumbering. > > > > > Yes. But there are two different problems addressed by this design then: > 1) hotplug > 2) pci renumbering > > If you want to solve (2) across instance reboots you can't do it at the > hypervisor level, as when the instance is down these data would not be > saved. Even if it was then it would not survive a failover, because the > original node would be likely down. > > So you have two options: > > 1) You accept that pci renumbering is going to happen at each instance start > 2) You save&track pci numbers in the master > > At the hypervisor level you can only guarantee the data until the instance > is up, nothing more. > You are right. Since we don't want to save hypervisor specific info in the master, pci renumbering is going to happen on hard reboots. > > > > > +Where to place the devices upon instance startup? QEMU has by default > > 4 > > > > +pre-occupied PCI slots. So, hypervisor can use the remaining ones for > > > > +disks and NICs. Currently, PCI configuration is not preserved after > > > > +reboot. Each time an instance starts, KVM assigns PCI slots to > > devices > > > > +based on their ordering in Ganeti configuration, i.e. the second disk > > > > +will be placed after the first, the third NIC after the second, etc. > > > > +Since we decided that there is no need to keep track of devices PCI > > > > +slots, there is no need to change current functionality. > > > > + > > > > +How to deal with existing instances? Migration and hotplugging of > > > > +existing instances in a cluster after installing the version that > > > > +supports hotplug will not be supported until we run a specific script > > in > > > > +all nodes that adds a new entry for the block devices (e.g. []) in > > > > +existing runtime files or else until instances suffer a hard reboot. > > > > > > > > > > Ouch. Can we support this better? :) If we go for "read current values > > from > > > the qemu monitor" rather than "store this info outside" we get this > > > magically working, no? > > > > > > > > > > Got lost :) What is "this"? "read current values from qemu monitor" > > works only if instance is running. Look previous comment. > > > > Ok, let's have this discussion in that part of the thread then. > > > > Existing instances do not have a pci configuration. Still they have the > > old runtime file format (that does not treat disks as a different > > entry). This is the reason migration will fail. > > > > > Just saying that we should make this work somehow. But again, it doesn't > make sense to discuss it until we have answers to that part, so let's pause > this part of the conversation for a minute. > > > > > > +Configuration changes > > > > +--------------------- > > > > + > > > > +The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It > > refers to > > > > +PCI slot that the device gets plugged into. > > > > + > > > > > > > > > > Why? You seem not to be using this anyway. > > > > > > > Explained before. > > > > > > > > > +In order to be able to live migrate successfully, runtime files should > > > > +be updated every time a live modification (hotplug) takes place. To > > this > > > > +end we change the format of runtime files. The KVM options referring > > to > > > > +instance's disks are no longer recorded as part of the KVM command > > line. > > > > +Disks are treated separately, just as we treat NICs right now. We > > insert > > > > +and remove entries to reflect the current PCI configuration. > > > > > > > > > > Ack. > > > > > > > > > > + > > > > + > > > > +Backend changes > > > > +--------------- > > > > + > > > > +Introduce 4 new RPC calls: > > > > + > > > > +- hot_add_nic > > > > +- hot_del_nic > > > > +- hot_add_disk > > > > +- hot_del_disk > > > > + > > > > > > > > > > Does it make sense to have one call for device type? Could we not use one > > > call for both? > > > > Well I implemented it so that it is more straighforward. We could have > > hotplug_device(device, action) > > and act depending the device type (NIC, Disk) and the action (ADD, REMOVE) > > > > > Ack, let's go for this route please, at least for the device type. This > will make life easier if we have ever to hotplug a cpu, eg. :) > > Right. > > > As for del, do we have any concerns? (this can easily break an instance I > > > suppose, am I wrong? can it fail if the device is in use? How do we > > handle > > > these cases?) > > > > > > > Unplugging a device has concerns but there are cases we want this to > > happen. > > disable an instance networking, renumber an instance IP, remove a disk > > (like > > unplugging an USB), etc. With existing implementation when we remove a > > disk from a running instance it fails because it is used by QEMU. With > > hotplug > > we hot_del_disk then _ShutdownInstanceDisks() then _AnnotateDiskParams() > > and then blockdev_remove(). > > > > > ACK, as long as there are enough scary warnings. > > I'll do my best :) > > > > > > > + > > > > +Hypervisor changes > > > > +------------------ > > > > + > > > > +We implement hotplug on top of the KVM hypervisor. We take advantage > > of > > > > +QEMU 1.0 monitor commands (``device_add``, ``device_del``, > > > > +``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU > > > > +refers to devices based on their id. We use ``uuid`` to name them > > > > +properly. If a device is about to be hotplugged we parse the output of > > > > +``info pci`` and find the occupied PCI slots. We choose the first > > > > +available and the whole device object is appended to the corresponding > > > > +entry in the runtime file. > > > > + > > > > +Concerning NIC handling, we build on the top of the existing logic > > > > +(first create a tap with _OpenTap() and then pass its file descriptor > > to > > > > +the KVM process). To this end we need to pass access rights to the > > > > +corresponding file descriptor over the monitor socket (UNIX domain > > > > +socket). The open file is passed as a socket-level control message > > > > +(SCM), using the ``fdsend`` python library. > > > > + > > > > > > > > > > Is there any way to have this as a recommends but not depends (aka: > > > hotplugging won't work if this is not present, but everything else will)? > > > > > > > Sure. > > > > > Thanks. > > > > > > > > > + > > > > +User interface > > > > +-------------- > > > > + > > > > +The new ``--hotplug`` option to gnt-instance modify is introduced, > > which > > > > +forces live modifications. > > > > + > > > > > > > > > > What happens if they are unsupported (eg. other hypervisors), does the > > > change fail? > > > > In cmdlib we get instance's hypervisor before hotplugging devices. > > There we could see if the corresponding hypervisor supports it or not. > > > > > Yes, what the design should say is what happens by design. Do we proceed > without hotplug, or do we fail the operation? > > I don't know. Both ways are reasonable and doable. I would suggest print a warning and proceed without hotplug. Then the original message that modifications take place after reboot will be shown. > > > Should there be a way to use hotplug by default? (now, or later perhaps)? > > > > > > > Yes. Swith --hotplug option to --no-hotplug. :) > > > > > Yes, but then perhaps there should be a cluster level flag or an instance > level one, or a hypervisor parameter, saying "hotplug when possible"? Or > just per-device? Anyway, no need to implement this in the first version. > > If any I would go with hv param. > > > > > > > + > > > > +Enabling hotplug > > > > +++++++++++++++++ > > > > + > > > > +Hotplug will be optional during gnt-instance modify. For existing > > > > +instances inside an old ganeti cluster we have two options after > > > > +upgrading to a version that supports hotplugging: > > > > + > > > > +1. Instances in the cluster should be hard rebooted or > > > > + > > > > +2. A specific script should change their runtime format before they > > can > > > > +support hotplugging of devices. > > > > + > > > > +If neither happens these instances will not support migration nor > > > > +hotplug. > > > > + > > > > > > > > > > not supporting *migration*: definitely bad. > > > > > > > > > > Aggre. The old format of the runtime files to blame.. :) > > > > > > > > + > > > > +NIC hotplug > > > > ++++++++++++ > > > > + > > > > +The user can add/modify/remove NICs either with hotplugging or not. > > If a > > > > +NIC is to be added a tap is created first and configured properly with > > > > +kvm-vif-bridge script. Then the instance gets a new network interface. > > > > +Since there is no QEMU monitor command to modify a NIC, we modify a > > NIC > > > > +by temporary removing the existing one and adding a new with the new > > > > +configuration. When removing a NIC the corresponding tap gets removed > > as > > > > +well. > > > > + > > > > +:: > > > > + > > > > + gnt-instance modify --net add --hotplug test > > > > + gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test > > > > + gnt-instance modify --net 1:remove --hotplug test > > > > + > > > > > > > > > > s/test/instance.example.com/ (so it's clear what it is) > > > Also is it safe to "just" add and remove a nic from an instance? This is > > > going to break things, I suppose. > > > > > > > > > > Not really. Just the instance will see an interface to dissapear. If this > > is the primary interface then it will lose internet connection (if any). > > > > > Sure, the problem here is that if you remove and readd an instance just > because a parameter was changed, the instance risks not ever reconfiguring > it, and as such never having connectivity again until the next reboot. As > such also here a pretty scary warning should be in place: this is not > *guaranteed* to work as intended at all. :) > > Agree. I have tested it under debian (with and without network-manager) and on windows and is works. Warnings can be added for sure. > > > > > + > > > > +Disk hotplug > > > > +++++++++++++ > > > > + > > > > +The user can add and remove disks with hotplugging or not. QEMU > > monitor > > > > +supports resizing of disks, however the initial implementation will > > > > +support only disk addition/deletion. > > > > + > > > > > > > > > > It would definitely be nice to support resizing too, but ok, it can go in > > > later. > > > > > > > > > > AGREE! But please later.... > > > > > Ok. > > > > > > +:: > > > > + > > > > + gnt-instance modify --disk add:size=1G --hotplug test > > > > + gnt-instance modify --disk 1:remove --hotplug test > > > > + > > > > > > > > > > Thanks, > > > > > > > I'll try to implement something of the aforementioned and rebase it to > > current > > master, you know, the one with NO cmdlib.py! > > > > > Wait until we ironed out all the design, please. As you see there's still a > bit of concerns, I think. > Also let's try to address sasha's concerns above in the thread. In > particular I share with him the view that hotplug with higher security > shouldn't break (this is not mentioned in this design). Other points he > makes are good, but indeed you're right that we have to start with a few > things and then iterate and add. On the other hand let's design in a way > that's extensible to those use cases. :) > OK. Sasha's feedback is more than helpful. I'll try to address the issues that are raised, but still I think that those could extend the initial implementation discussed here and do not affect the basic design. To sum up: 1) pci slot of each device must be saved in backend for the sake of migrations. Should this be a slot in NIC, Disk objects or should it be saved as a tupple in serialized data? 2) if we want disk hotplug, the format of runtime files should change. block_devices should be treated separately, just like NICs. Migrating old instances without first rebooting them still remains an issue. 3) merge all rpc calls in one hotplug_device(device, action) 4) do not include disk resizing in first implementation. 5) add hv param for enabling hotplug or not 6) warnings in case of disk/nic live removal 7) warnings in case hotplug is not supported but --hotplug option is passed 8) address Sasha's concerns: * Parse kvm help output to see if hotplug is supported. * Dig into the case where an instance does not run as a root. Please let me know if I miss something. Thanks a lot for the discussion, dimara
signature.asc
Description: Digital signature
