On Tue, Nov 12, 2013 at 2:13 PM, Guido Trotter <[email protected]> wrote: > On Tue, Nov 12, 2013 at 12:41 PM, Michele Tartara <[email protected]> wrote: >> Add the document describing a new design for the OS installation process for >> new instances. >> >> Signed-off-by: Michele Tartara <[email protected]> >> --- >> doc/design-draft.rst | 1 + >> doc/design-os.rst | 318 >> ++++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 319 insertions(+) >> create mode 100644 doc/design-os.rst >> >> diff --git a/doc/design-draft.rst b/doc/design-draft.rst >> index c821292..3ed3852 100644 >> --- a/doc/design-draft.rst >> +++ b/doc/design-draft.rst >> @@ -20,6 +20,7 @@ Design document drafts >> design-daemons.rst >> design-hsqueeze.rst >> design-ssh-ports.rst >> + design-os.rst >> >> .. vim: set textwidth=72 : >> .. Local Variables: >> diff --git a/doc/design-os.rst b/doc/design-os.rst >> new file mode 100644 >> index 0000000..7a42a7f >> --- /dev/null >> +++ b/doc/design-os.rst >> @@ -0,0 +1,318 @@ >> +=============================== >> +Ganeti OS installation redesign >> +=============================== >> + >> +.. contents:: :depth: 3 >> + >> +This is a design document detailing a new OS installation procedure, more >> +secure, able to provide more features and easier to use for many common >> tasks >> +w.r.t. the current one. >> + >> +Current state and shortcomings >> +============================== >> + >> +As of Ganeti 2.10, each instance is associated with an OS definition. An OS >> +definition is a set of scripts (``create``, ``export``, ``import``, >> ``rename``) >> +that are executed with root privileges on the primary host of the instance >> to >> +perform all the OS-related functionality (setting up an operating system >> inside >> +the disks of the instance being created, exporting/importing the instance, >> +renaming it). >> + >> +These scripts receive, as environment variables, a fixed set of parameters >> +describing the instance (such as the hypervisor, the name of the instance, >> the >> +number of disks, and their location) and a set of user defined parameters. >> Each >> +of these parameters is also written into the configuration file of Ganeti, >> to >> +allow for future reinstalls of the instance, and in various log files, >> namely: >> + >> +* node daemon log file: contains DEBUG strings of the ``/os_validate``, >> + ``/instance_os_add`` and ``/instance_start`` RPC calls. >> + >> +* master daemon log file: DEBUG strings related to the same RPC calls are >> stored >> + here as well. >> + >> +* commands log: the CLI commands that create a new instance, including their >> + parameters, are logged here. >> + >> +* RAPI log: the RAPI commands that create a new instances, including their >> + parameters, are logged here. >> + >> +* job logs: the job files stored in the job queue or in its archive contain >> the >> + parameters. >> + >> +The current situation presents a number of shortcomings: >> + >> +* Having the installation scripts run with root power on the nodes is a huge >> + security issue. >> + > > s/is a huge security issue/doesn't allow user-defined os scripts, as > they would pose a huge security issue/ > > Note that there's no security issue *per se* in the current situation, > if the OS scripts are trusted. > (except perhaps for export, if the os script mounts the instance disk, > which is also not necessarily the case)
Yes, that's what I meant. I'll reword it as you suggest. > > That said it could be a safety issue in the sense that an eventual > bug/error in the os script could risk disrupting the node. ACK > >> +* Ganeti cannot be used to create instances starting from user provided disk >> + images: even in the (hypothetical) case where the scripts are completely >> + secure and run not by root but by an unprivileged user with only the >> power to >> + mount arbitrary files as disk images, this is a security issue. It has >> been >> + proven that a carefully crafted file system might exploit kernel >> + vulnerabilities to gain control of the system. Therefore, directly >> mounting >> + images on the Ganeti nodes is not an option. >> + >> +* There is no way to inject files into an existing disk image. A common use >> case >> + is for the system administrator to provide a standard image of the >> system, to >> + be later personalized with the network configuration, private keys >> identifying >> + the machine, ssh keys of the users and so on. A possible workaround would >> be >> + for the scripts to mount the image (only if this is trusted!) and to >> receive >> + the configurations and ssh keys as user defined OS parameters. >> Unfortunately, >> + this is also not an option for security sensitive material (such as the >> ssh >> + keys) because the OS parameters are stored in many places on the system, >> as >> + already described above. >> + >> +* Most other virtualization software simply work with instance images, not >> with >> + installation scripts. This difference makes the interaction of Ganeti with >> + other softwares difficult. > > s/softwares/software/ ACK > >> + >> +Proposed changes >> +================ >> + >> +In order to fix the shortcomings of the current state, we plan to introduce >> the >> +following changes: >> + >> +* Change the OS parameters to have three categories: >> + >> + * ``public``: the current behavior. The parameter is logged and stored >> freely. >> + >> + * ``private``: the parameter is saved inside the Ganeti configuration (to >> allow >> + for instance reinstall) but it is not shown in logs, job logs, or passed >> back >> + via RAPI. >> + >> + * ``secret``: the parameter is not saved inside the Ganeti configuration. >> + Reinstall are impossible unless the data is passed again. The parameter >> will >> + not appear in any log file. In order to preserve the functionality of >> Ganeti, >> + the parameters will still need to be stored in the job files, but they >> will >> + be removed from there when the job has finished running (either >> successfully >> + or not). >> + > > Do we actually need to save them in the job files? > The job files could be saved (to disk) without, and in case the master > is failed over the job can be failed. > (this should make it a lot harder to access) Unfortunately, I think we need to save them. Currently the job is created by luxid, serialized, and then read from file and executed by masterd, as part of the ongoing migration of the job queue from masterd to luxid. >> +* A new OS installation procedure, based on a safe virtualized environment. >> + This virtualized environment will run with the same hardware parameter as >> the >> + actual instance being installed, as much as possible. This will also >> allow to >> + reduce the memory usage in the host (specifically, in Dom0 for Xen >> + installations). >> Each instance will have these possible execution modes: >> + >> + * ``run``: the default mode, used when the machine is running normally. >> + >> + * ``self_install``: Ganeti will start the instance with a different set of >> + user-specified parameters, therefore allowing to attach an installation >> + floppy/cdrom/network, change the boot device order, or specify an OS >> image >> + to be used. The instance will then be responsible to get the parameters >> for >> + configuring itself (its network interfaces, IP address, hostname, etc.) >> from >> + a set of metadata provided to it by Ganeti (e.g.: using an approach >> + comparable to the one of the ``cloud-init`` tool). When this >> installation >> + mode is used, no OS installation script is required. >> + In order for installation of an OS from an image to be possible, a new >> + parameter ``--os-image`` will be added, allwoing to specify where to >> take >> + the image from. It will have to be mutually exclusive with >> ``--os-type``. If >> + ``--os-image`` is specified, ``--os-parameters`` can still be used, as >> it >> + will be passed to the instance as part of the metadata. >> + The set of ``self_install`` parameters will be stored as part of the >> + instance configuration, so that they can be used to reinstall the >> instance. >> + It will be the user's responsibility to ensure that the OS image or any >> + installation media is still available in the proper position when a >> + reinstall happens. >> + > > Should we use --os-type image:<name> and/or have an image os provider > that defines: > 1) the actual parameters needed for installation > 2) the image (eg. the verify script could double check that the image > is available from the node or accessible via the network...) > > I think in particular it would be useful to still have the concept of > an OS "provider" that tells ganeti how to install itself (which > parameters to use). This of course could be overridable, but at least > there would be a sane default without relying on the user to "get it > right". Regarding using --os-type image:<name>: That was my initial though too, and also my favorite choice. Still, given that we usually want to keep backwards compatibility, this would cause problems if somebody has an OS definition called "image". Furthermore, that name would become reserved in the future. If you think it is a small enough risk, and listing this in the "incompatible changes" section of the NEWS file is enough, then I'm absolutely in favor of doing it. Regarding the os provider: my idea here was to have a possibility of using Ganeti without having to provide a provider, but just an OS image plus some "gnt-instance add" parameters, therefore having a more standard approach, similar to what other solutions are doing. Having an OS provider for this as well, would defeat this purpose. Moreover, providing an installation script would still be an option, so who want to have an OS provider, can have it. > >> + * ``install``: Ganeti will start the instance using a virtual appliance >> + specifically made for installing Ganeti instances. Scripts analogous to >> the >> + current ones will run inside this instance. The disks of the instance >> being >> + installed will be connected to this virtual appliance, so that the >> scripts >> + can mount them and modify them as needed, as currently happens, but >> with the >> + additional protection given by this happening in a VM. The virtual >> appliance >> + will be started in a clean state every time a new instance need to be >> + created, to further increase security. Metadata will be provided also to >> + this virtual applicance, that will take care of converting them to >> + environment variables for the installation scripts. >> + > > Please specify better that by "will be started in a clean state" you > actually mean "the disk will be reset to its pristine state and not > reused between reinstallation" because it might be construed to mean > just the "booting" (runtime info) which is sort of less strict. ACK > >> +In order to allow for the metadata to be sent inside the instance, a >> +communication mechanism between the instance and the host will be created. >> This >> +mechanism will be bidirectional (e.g.: to allow the setup process going on >> +inside the instance to communicate its progress to the host). Each instance >> will >> +have access exclusively to its own metadata, and it will be only able to >> +communicate with its host over this channel. >> + > > Too vague :) It's intentionally vague: here it's just meant to state the problem. The actual description of the metadata and the communication mechanism is in the implementation section. I'll add a reference to that from here. > > >> +As part of the instance creation command it will be possible to indicate a >> URL >> +for a "personalization package", that is an archive containing a set of >> files >> +meant to be overlayed on top of the operating system file system at the end >> of >> +the setup process, before the VM is started for the first time in ``run`` >> mode. >> +Ganeti will provide a mechanism for receiving and unpacking this archive as >> part >> +of the ``install`` execution mode, whereas in ``self_install`` mode it will >> only >> +be provided as a metadata for the instance to use. >> +The archive will be in TAR-GZIP format (with extension ``.tar.gz`` or >> ``.tgz``) >> +and will contain the files according to the directory structure that will be >> +recreated on the installation disk. Files contained in this archive will >> +overwrite files with the same path created during the install procedure (if >> +any). >> +The URL of the "personalization package" will have to specify an extesion to >> +identify the file format (in order to allow for more formats to be >> supported in >> +the future). >> +The URL will be stored as part of the configuration of the instance >> (therefore, >> +the URL should not contain confidential information, but the file there >> +available can). It is up to the system administrator to ensure that a >> package >> +is actually available at that URL at install and reinstall time. >> +The content of the package is allowed to change. E.g.: a system >> administrator >> +might create a package containing the private keys of the instance being >> +created. When the instance is reinstalled, a new package with new keys can >> be >> +made available there, therefore allowing instance reinstall without the >> need to >> +store keys. >> + > > Add something about authentication perhaps (so that an admin can have > a file available only to the ganeti installer only for the time of the > installation) and also about the fact that we won't cache/keep the > file on the node OS. ACK > >> +Implementation >> +============== >> + >> +The implementation of this design will happen as an ordered sequence of >> steps, >> +of increasing impact on the system and, in some cases, dependent on each >> other: >> + >> +#. Private and secret instance parameters >> +#. Communication mechanism between host and instance >> +#. Metadata service >> +#. Personalization package >> +#. ``self_install`` mode >> +#. ``install`` mode (with virtualization environment) >> + >> +Some of these steps need to be more deeply specified w.r.t. what is already >> +written in the `Proposed changes`_ Section. Extra details will be provided >> in >> +the following Subsections. >> + >> +Communication mechanism and metadata service >> +++++++++++++++++++++++++++++++++++++++++++++ >> + >> +The communication mechanism and the metadata service are described together >> +because they are deeply tied. On the other hand, the communication mechanism >> +will need to be more generic because it can be used for other reasons in the >> +future (like allowing instances to esplicitly send commands to Ganeti, or >> to let > > explicitly ACK > >> +Ganeti control a helper instance, like the one hereby introduced for >> performing >> +OS installs inside a safe environment). >> + >> +The communication mechanism will be enabled automatically when the instance >> is >> +in ``self_install`` or ``install`` mode, but for backwards compatibility it >> will >> +be disabled when the instance is in ``run`` mode unless it is esplicitly > > ^ see above ACK > >> +requested at instance startup by using a new, ad-hoc, parameter >> +(``--communication``). > > Which parameter is this? An instance, hypervisor or backend parameter? And > why? > Also -C could do as well (if we go for instance level). Remember to > specify here as it has to be clear that an instance once configured > that way will be always started that way. > Yes, it's intended to be an instance level parameter. I'll specify that it is set at creation time, or modifiable with "gnt-instance modify", and then is automatically read from the config and used every time the instance is started. >> + >> +When the communication mechanism is enabled, Ganeti will create a new >> network >> +interface inside the instance. This extra network interface will be the >> last one >> +of the instance, after all the user defined ones. On the host side, this >> +interface will be only accessible to the host itself, and not be routed >> outside >> +the machine. > > Actually it would be great if we didn't even have to create the tap. Do you mean something like (for kvm): -net user,net=169.254.169.0/24,host=169.254.169.254 that starts a user network showing the host as reachable with address 169.254.169.254? > >> +On this network interface, the instance will connect using the IP: >> +169.254.169.1 and netmask 255.255.255.0. >> +The host will be on the same network, with the IP address: 169.254.169.254. >> +The instance will be able to connect to 169.254.169.254:80, and issue GET >> +requests to an HTTP server that will provide the instance metadata. >> + >> +The choice of this IP address and port is done for compatibility reasons >> with >> +OpenStack's and Amazon EC2's ways of providing metadata to the instance. >> + >> +Where possible, the metadata will be provided in a way compatible with >> OpenStack >> +at:: >> + >> + http://169.254.169.254/openstack/<version>/meta_data.json >> + >> +or with Amazon EC2, at:: >> + >> + http://169.254.169.254/<version>/meta-data/* >> + >> +If some metadata are Ganeti-specific and don't fit this structure, they >> will be >> +provided at:: >> + >> + http://169.254.169.254/<version>/ganeti/meta_data.json >> + > > Not quite clear! :) How does the OS choose between those? How are they > expected to differ? The idea is to provide the data in both formats, so the OS can chose based on its own preferences (there are some tools already getting the data from those postions, such as cloud-init). > >> +``<version>`` is either a date in YYYY-MM-DD format, or ``latest`` to >> indicate >> +the most recent available protocol version. >> + > > Is this what openstack and EC2 do? Yes, I'm writing this here just as a clarification, but it's exactly their format. > >> +A bi-directional, pipe-like communication channel will be provided. The >> instance >> +will be able to receive data from the host by a GET request at:: >> + >> + http://169.254.169.254/<version>/ganeti/pipe_in >> + >> +and to send data to the host by a POST request at:: >> + >> + http://169.254.169.254/<version>/ganeti/pipe_out >> + > > Why is it /openstack/<version> > but <version>/meta-data > and <version>/ganeti ? > Can we have it a bit more logical? EC2 is: /<version>/meta-data/* OpenStack came later but wanted to keep compatibility, so they created their own directory, including their own API version number: /openstack/<version>/meta-data.json And Ganeti is supposed to follow the same style as openstack, but I wrote it wrong, sorry for the mistake: /ganeti/<version>/* > >> +As in a pipe, once the data are read, they will not be in the buffer >> anymore, so >> +subsequent get request to ``pipe_in`` will not return the same data twice. >> +Unlike a pipe, though, it will not be possible to perform blocking I/O >> +operations. >> + > > So maybe we should just call it read and write? :) Perfectly fine for me. >> +The OS parameters will be accessible through a GET >> +request at:: >> + >> + http://169.254.169.254/<version>/ganeti/os/parameters/<visibility>.json >> + >> +as a JSON serialized dictionary. ``<visibility>`` will be either ``public`` >> or >> +``private`` or ``secret``. >> + > > Why does the instance care about the visibility, and why is this > provided at the file level? Couldn't a single json contain all info, > with also ancillary data to specify the level of confidentiality? Yes, a single file is also possible. > >> +The installation scripts to be run inside the virtualized environment while >> the >> +instance is run in ``install`` mode will be available at:: >> + >> + http://169.254.169.254/<version>/ganeti/os/scripts/<script_name> >> + >> +where ``<script_name>`` is the name of the script. >> + >> +The host and the instances (as detailed in `Installation process in a >> +virtualized environment`_) will be able to create other communication >> channels >> +on the other ports of the same IP address. >> + > > Why not at other URLs? In the design with an actual network interface, ports come "for free". If we go towards a design with no TAP device, this is probably going to be more difficult, and providing some way for the users to provide information as other URLS in this hierarchy becomes more interesting. > >> + >> +Rationale >> +--------- >> + >> +The choice of using a network interface for instance-host communication, as >> +opposed to VirtIO, XenBus or other methods, is due to the will of having a >> +generic, hypervisor-independent way of creating a communication channel, >> that >> +doesn't require unusual (para)virtualization drivers. >> +At the same time, a network interface was preferred over solutions involving >> +virtual floppy or USB devices because the latter tend to be detected and >> +configured by the guest operating systems, sometimes even in prominent >> positions >> +in the user interface, whereas it is fairly common to have an unconfigured >> +network interface in a system, usually without any negative side effects. >> + >> + >> +Installation process in a virtualized environment >> ++++++++++++++++++++++++++++++++++++++++++++++++++ >> + >> +In the new OS installation scenario, we distinguish between trusted and >> +untrusted code. >> + >> +The trusted installation code maintains the behavior of the current one, >> with >> +the scripts running on the node the instance is being created on. The >> untrusted >> +code is stored in a subdirectory of the OS definition called ``untrusted``. >> +This directory contains scripts that are equivalent to the already existing >> +ones (``create``, ``export``, ``import``, ``rename``) but that will be run >> +inside an virtualized environment, to protect the host from malicious >> tampering. >> + >> +The ``untrusted`` code is meant to either be untrusted itself, or to be >> trusted >> +code running operations that might be dangerous (such as mounting a >> +user-provided image). >> + >> +In order to allow for the highest flexibility, if both a trusted and an >> +untrusted script are provided for the same operation (i.e. ``create``), >> both of >> +them will be executed at the same time, one on the host, and one inside the >> +installation appliance. They will be allowed to communicate with each other >> +through the already described communication mechanism, in order to >> orchestrate >> +their execution (e.g.: the untrusted code might execute the installation, >> while >> +the trusted one receives status updates from it and delivers them to a user >> +interface). >> + > > Sounds a bit clunky, and makes it hard to provide OS definitions from > the user (as an admin I have to "open" them and check that the trusted > scripts are empty or allowed... maybe this should be a new version and > disallow the old way altogether. For user provided script, an administrator might simply decide that they are always untrusted, therefore allowing only for the untrusted part, thus requiring only a really simple check. I agree that having the new kind of scripts being completely untrusted and always running inside the VM would be the simplest and cleanest solution. I wrote the proposal this way to meet some explicit requests from the open source community, looking for a way to have trusted and untrusted code running together in a communication-synchronized way. Maybe we can leave this in the design marking it as optional and hope for some code contribution? > >> +Ganeti will provide a script to be run at install time that can be used to >> +create the virtualized environment that will perform the OS installation of >> new >> +instances. >> +This script will build a debootstrapped basic debian system including >> including > > s/including including/including/ > >> +a software that will read the metadata, setup the environment variables and >> +launch the installation scripts inside the virtualized environment. The >> script >> +will also provide hooks for personalization. >> + > > > >> +It will also be possible to use other self-made virtualized environment, as >> long >> +as they connect to ganeti over the described communication mechanism and >> they >> +know how to read and use the provided metadata to create a new instance. >> + >> +While performing an installation in the virtualized environment, a >> +personalizable timeout will be used to detect possible problems with the >> +installation process, and to kill the virtualized environment. >> + > > Will the timeout be reset upon communication? Will there be a way to reset it? > How will it be customizable? Who specifies where to customize it? I think the timeout should be cluster-wide, set by the administrator of the cluster, and not to be reset upon communication. It is supposed to be a way of avoiding an installation VM to run freely and uncontrolled (mainly in case it is taken over by malicious untrusted scripts), therefore a reset upon communication would make it fairly useless. Thanks, Michele -- Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
