On Thu, Jan 9, 2014 at 11:51 AM, Jose A. Lopes <[email protected]> wrote:
> Design document for KVM daemon which is needed by the instance
> shutdown detection for KVM.
>
> Signed-off-by: Jose A. Lopes <[email protected]>
> ---
>  Makefile.am         |   1 +
>  doc/design-kvmd.rst | 216 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  doc/index.rst       |   1 +
>  3 files changed, 218 insertions(+)
>  create mode 100644 doc/design-kvmd.rst
>
> diff --git a/Makefile.am b/Makefile.am
> index 10e9962..1249bd5 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -520,6 +520,7 @@ docinput = \
>         doc/design-hugepages-support.rst \
>         doc/design-impexp2.rst \
>         doc/design-internal-shutdown.rst \
> +       doc/design-kvmd.rst \
>         doc/design-linuxha.rst \
>         doc/design-lu-generated-jobs.rst \
>         doc/design-monitoring-agent.rst \
> diff --git a/doc/design-kvmd.rst b/doc/design-kvmd.rst
> new file mode 100644
> index 0000000..b627b35
> --- /dev/null
> +++ b/doc/design-kvmd.rst
> @@ -0,0 +1,216 @@
> +==========
> +KVM daemon
> +==========
> +
> +.. toctree::
> +   :maxdepth: 2
> +
> +This design document describes the KVM daemon, which is responsible for
> +determining whether a given KVM instance was shutdown by an
> +administrator or a user.
> +
> +
> +Current state and shortcomings
> +==============================
> +
> +This design document describes the KVM daemon which addresses the KVM
> +side of the user-initiated shutdown problem introduced in
> +:doc:`design-internal-shutdown`.  We are also interested in keeping this
> +functionality optional.  That is, an administrator does not necessarily
> +have to run the KVM daemon if either he is running Xen or even, if he
> +is running KVM, he is not interested in instance shutdown detection.
> +This requirement is important because it means the KVM daemon should
> +be a modular component in the overall Ganeti design, i.e., it should
> +be easy to enable and disable it.
> +
> +Proposed changes
> +================
> +
> +The instance shutdown feature for KVM requires listening on events from
> +the Qemu Machine Protocol (QMP) Unix socket, which is created together
> +with a KVM instance.  A QMP socket typically looks like
> +``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements
> +the QMP protocol.  This is a bidirectional protocol that allows Ganeti
> +to send commands, such as, system powerdown, as well as, receive events,
> +such as, the powerdown and shutdown events.
> +
> +Listening in on these events allows Ganeti to determine whether a given
> +KVM instance was shutdown by an administrator, either through
> +``gnt-instance stop|remove <instance>`` or ``kill -KILL
> +<instance-pid>``, or by a user, through ``poweroff`` from inside the
> +instance.  Upon an administrator powerdown, the QMP protocol sends two
> +events, namely, a powerdown event and a shutdown event, whereas upon a
> +user shutdown only the shutdown event is sent.  This is enough to
> +distinguish between an administrator and a user shutdown.  However,
> +there is one limitation, which is, ``kill -TERM <instance-pid>``.  Even
> +though this is an action performed by the administrator, it will be
> +considered a user shutdown by the approach described in this document.
> +
> +Several design strategies were considered.  Most of these strategies
> +consisted of spawning some process listening on the QMP socket when a
> +KVM instance is created.  However, having a listener process per KVM
> +instance is not scalable.  Therefore, a different strategy is proposed,
> +namely, having a single process, called the KVM daemon, listening on the
> +QMP sockets of all KVM instances within a node.  That also means there
> +is an instance of the KVM daemon on each node.
> +
> +In order to implement the KVM daemon, two problems need to be addressed,
> +namely, how the KVM daemon knows when to open a connection to a given
> +QMP socket and how the KVM daemon communicates with Ganeti whether a
> +given instance was shutdown by an administrator or a user.
> +
> +QMP connections management
> +--------------------------
> +
> +As mentioned before, the QMP sockets reside in the KVM control
> +directory, which is usually located under
> +``/var/run/ganeti/kvm-hypervisor/ctrl/``.  When a KVM instance is
> +created, a new QMP socket for this instance is also created in this
> +directory.
> +
> +In order to simplify the design of the KVM daemon, instead of having
> +Ganeti communicate to this daemon through a pipe or socket the creation
> +of a new KVM instance, and thus a new QMP socket, this daemon will
> +monitor the KVM control directory using ``inotify``.  As a result, the
> +daemon is not only able to deal with KVM instances being created and
> +removed, but also capable of overcoming other problematic situations
> +concerning the filesystem, such as, the case when the KVM control
> +directory does not exist because, for example, Ganeti was not yet
> +started, or the KVM control directory was removed, for example, as a
> +result of a Ganeti reinstallation.
> +
> +Shutdown detection
> +------------------
> +
> +As mentioned before, the KVM daemon is responsible for opening a
> +connection to the QMP socket of a given instance and listening in on the
> +shutdown and powerdown events, which allow the KVM daemon to determine
> +whether the instance stopped because of an administrator or user
> +shutdown.  Once the instance is stopped, the KVM daemon needs to
> +communicate to Ganeti whether the user was responsible for shutting down
> +the instance.
> +
> +In order to achieve this, the KVM daemon writes an empty file, called
> +the shutdown file, in the KVM control directory with a name similar to
> +the QMP socket file but with the extension ``.qmp`` replaced with
> +``.shutdown``.  The presence of this file indicates that the shutdown
> +was initiated by a user, whereas the absence of this file indicates that
> +the shutdown was caused by an administrator.  This strategy also handles
> +crashes and signals, such as, ``SIGKILL``, to be handled correctly,
> +given that in these cases the KVM daemon never receives the powerdown
> +and shutdown events and, therefore, never creates the shutdown file.
> +
> +KVM daemon launch
> +-----------------
> +
> +With the above issues addressed, a question remains as to when the KVM
> +daemon should be started.  The KVM daemon is different from other Ganeti
> +daemons, which start together with the Ganeti service, because the KVM
> +daemon is optional, given that it is specific to KVM and should not be
> +run on installations containing only Xen, and, even in a KVM
> +installation, the user might still choose not to enable it.  And finally
> +because the KVM daemon is not really necessary until the first KVM
> +instance is started.  For these reasons, the KVM daemon is started from
> +within Ganeti when a KVM instance is started.  And the job process
> +spawned by the node daemon is responsible for starting the KVM daemon.
> +
> +Given the current design of Ganeti, in which the node daemon spawns a
> +job process to handle the creation of the instance, when launching the
> +KVM daemon it is necessary to first check whether an instance of this
> +daemon is already running and, if this is not the case, then the KVM
> +daemon can be safely started.
> +
> +Design alternatives
> +===================
> +
> +At first, it might seem natural to include the instance shutdown
> +detection for KVM in the node daemon.  After all, the node daemon is
> +already responsible for managing instances, for example, starting and
> +stopping an instance.  Nevertheless, the node daemon is more complicated
> +than it might seem at first.
> +
> +The node daemon is composed of the main loop, which runs in the main
> +thread and is responsible for receiving requests and spawning jobs for
> +handling these requests, and the jobs, which are independent processes
> +spawned for executing the actual tasks, such as, creating an instance.
> +
> +Including instance shutdown detection in the node daemon is not viable
> +because adding it to the main loop would cause KVM specific code to
> +taint the generality of the node daemon.  In order to add it to the job
> +processes, it would be possible to spawn either a foreground or a
> +background process.  However, these options are also not viable because
> +they would lead to the situation described before where there would be a
> +monitoring process per instance, which is not scalable.  Moreover, the
> +foreground process has an additional disadvantage: it would require
> +modifications the node daemon in order not to expect a terminating job,
> +which is the current node daemon design.
> +
> +There is another design issue to have in mind.  We could reconsider the
> +place where to write the data that tell Ganeti whether an instance was
> +shutdown by an administrator or the user.  Instead of using the KVM
> +shutdown files presented above, in which the presence of the file
> +indicates a user shutdown and its absence an administrator shutdown, we
> +could store a value in the KVM runtime state file, which is where the
> +relevant KVM state information is.  The advantage of this approach is
> +that it would keep the KVM related information in one place, thus making
> +it easier to manage.  However, it would lead to a more complex
> +implementation and, in the context of the general transition in Ganeti
> +from Python to Haskell, a simpler implementation is preferred.
> +
> +Finally, it should be noted that the KVM runtime state file benefits
> +from automatic migration.  That is, when an instance is migrated so is
> +the KVM state file.  However, the instance shutdown detection for KVM
> +does not require this feature and, in fact, migrating the instance
> +shutdown state would be incorrect.
> +
> +Further considerations
> +======================
> +
> +There are potential race conditions between Ganeti and the KVM daemon,
> +however, in practice they seem unlikely.  For example, the KVM daemon
> +needs to add and remove watches to the parent directories of the KVM
> +control directory until this directory is finally created.  It is
> +possible that Ganeti creates this directory and a KVM instance before
> +the KVM daemon has a chance to add a watch to the KVM control directory,
> +thus causing this daemon to miss the ``inotify`` creation event for the
> +QMP socket.
> +
> +There are other problems which arise from the limitations of
> +``inotify``.  For example, if the KVM daemon is started after the first
> +Ganeti instance has been created, then the ``inotify`` will not produce
> +any event for the creation of the QMP socket.  This can happen, for
> +example, if the KVM daemon needs to be restarted or upgraded.  As a
> +result, it might be necessary to have an additional mechanism that runs
> +at KVM daemon startup or at regular intervals to ensure that the current
> +KVM internal state is consistent with the actual contents of the KVM
> +control directory.
> +
> +Another race condition occurs when Ganeti shuts down a KVM instance
> +using force.  Ganeti uses ``TERM`` signals to stop KVM instances when
> +force is specified or ACPI is not enabled.  However, as mentioned
> +before, ``TERM`` signals are interpreted by the KVM daemon as a user
> +shutdown.  As a result, the KVM daemon creates a shutdown file which
> +then must be removed by Ganeti.  The race condition occurs because the
> +KVM daemon might create the shutdown file after the hypervisor code that
> +tries to remove this file has already run.  In practice, the race
> +condition seems unlikely because Ganeti stops the KVM instance in a
> +retry loop, which allows Ganeti to stop the instance and cleanup its
> +runtime information.
> +
> +It is possible to determine if a process, in this particular case the
> +KVM process, was terminated by a ``TERM`` signal, using the `proc
> +connector and socket filters
> +<https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_.
> +The proc connector is a socket connected between a userspace process and
> +the kernel through the netlink protocol and can be used to receive
> +notifications of process events, and the socket filters is a mechanism
> +for subscribing only to events that are relevant.  There are several
> +`process events <http://lwn.net/Articles/157150/>`_ which can be
> +subscribed to, however, in this case, we are interested only in the exit
> +event, which carries information about the exit signal.
> +
> +.. vim: set textwidth=72 :
> +.. Local Variables:
> +.. mode: rst
> +.. fill-column: 72
> +.. End:
> diff --git a/doc/index.rst b/doc/index.rst
> index 7ec8162..df443e0 100644
> --- a/doc/index.rst
> +++ b/doc/index.rst
> @@ -116,6 +116,7 @@ Draft designs
>     design-device-uuid-name.rst
>     design-hroller.rst
>     design-hotplug.rst
> +   design-kvmd.rst
>     design-linuxha.rst
>     design-lu-generated-jobs.rst
>     design-monitoring-agent.rst
> --
> 1.8.5.1
>

LGTM, thanks.

Michele

-- 
Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Reply via email to