Re: [PATCH master] Design doc for internal shutdown detection

Guido Trotter Wed, 22 May 2013 01:34:24 -0700

On Tue, May 21, 2013 at 6:29 PM, Michele Tartara <[email protected]>wrote:


> On Tue, May 21, 2013 at 2:45 PM, Guido Trotter <[email protected]>wrote:
>
>>
>>
>>
>> On Tue, May 21, 2013 at 2:37 PM, Michele Tartara <[email protected]>wrote:
>>
>>> On Sat, May 18, 2013 at 10:07 AM, Michele Tartara 
>>> <[email protected]>wrote:
>>>
>>>> +list
>>>>
>>>>
>>>>
>>>> On Fri, May 17, 2013 at 4:26 PM, Michele Tartara 
>>>> <[email protected]>wrote:
>>>>
>>>>> On Fri, May 17, 2013 at 2:50 PM, Guido Trotter 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 17, 2013 at 2:30 PM, Michele Tartara <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> On Fri, May 17, 2013 at 10:54 AM, Guido Trotter <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 17, 2013 at 10:46 AM, Michele Tartara <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Ganeti is currently not able to detect a legit shutdown request
>>>>>>>>> performed by a
>>>>>>>>> user from inside a Xen domain.
>>>>>>>>>
>>>>>>>>> This patch provides a design document to implement a mechanism
>>>>>>>>> able to cope with
>>>>>>>>> such events.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Michele Tartara <[email protected]>
>>>>>>>>> ---
>>>>>>>>>  Makefile.am                      |  1 +
>>>>>>>>>  doc/design-draft.rst             |  1 +
>>>>>>>>>  doc/design-internal-shutdown.rst | 72
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>  3 files changed, 74 insertions(+)
>>>>>>>>>  create mode 100644 doc/design-internal-shutdown.rst
>>>>>>>>>
>>>>>>>>> diff --git a/Makefile.am b/Makefile.am
>>>>>>>>> index 037cf53..f66624e 100644
>>>>>>>>> --- a/Makefile.am
>>>>>>>>> +++ b/Makefile.am
>>>>>>>>> @@ -410,6 +410,7 @@ docinput = \
>>>>>>>>>         doc/design-htools-2.3.rst \
>>>>>>>>>         doc/design-http-server.rst \
>>>>>>>>>         doc/design-impexp2.rst \
>>>>>>>>> +       doc/design-internal-shutdown.rst \
>>>>>>>>>         doc/design-lu-generated-jobs.rst \
>>>>>>>>>         doc/design-linuxha.rst \
>>>>>>>>>         doc/design-multi-reloc.rst \
>>>>>>>>> diff --git a/doc/design-draft.rst b/doc/design-draft.rst
>>>>>>>>> index ccb2f93..9a1d2b1 100644
>>>>>>>>> --- a/doc/design-draft.rst
>>>>>>>>> +++ b/doc/design-draft.rst
>>>>>>>>> @@ -19,6 +19,7 @@ Design document drafts
>>>>>>>>>     design-storagetypes.rst
>>>>>>>>>     design-reason-trail.rst
>>>>>>>>>     design-device-uuid-name.rst
>>>>>>>>> +   design-internal-shutdown.rst
>>>>>>>>>
>>>>>>>>>  .. vim: set textwidth=72 :
>>>>>>>>>  .. Local Variables:
>>>>>>>>> diff --git a/doc/design-internal-shutdown.rst
>>>>>>>>> b/doc/design-internal-shutdown.rst
>>>>>>>>> new file mode 100644
>>>>>>>>> index 0000000..836d00c
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/doc/design-internal-shutdown.rst
>>>>>>>>> @@ -0,0 +1,72 @@
>>>>>>>>> +============================================================
>>>>>>>>> +Detection of user-initiated shutdown from inside an instance
>>>>>>>>> +============================================================
>>>>>>>>> +
>>>>>>>>> +.. contents:: :depth: 2
>>>>>>>>> +
>>>>>>>>> +This is a design document detailing the implementation of a way
>>>>>>>>> for Ganeti to
>>>>>>>>> +detect whether a machine marked as up but not running was
>>>>>>>>> shutdown gracefully
>>>>>>>>> +by the user from inside the machine itself.
>>>>>>>>> +
>>>>>>>>> +Current state and shortcomings
>>>>>>>>> +==============================
>>>>>>>>> +
>>>>>>>>> +Ganeti keeps track of the desired status of instances in order to
>>>>>>>>> be able to
>>>>>>>>> +take proper actions (e.g.: reboot) on the ones that happen to
>>>>>>>>> crash.
>>>>>>>>> +Currently, the only way to properly shut down a machine is
>>>>>>>>> through Ganeti's own
>>>>>>>>> +commands, that will mark an instance as ``ADMIN_down``.
>>>>>>>>> +If a user shuts down an instance from inside, through the proper
>>>>>>>>> command of the
>>>>>>>>> +operating system it is running, the instance will be shutdown
>>>>>>>>> gracefully, but
>>>>>>>>> +Ganeti is not aware of that: the desired status of the instance
>>>>>>>>> will still be
>>>>>>>>> +marked as ``running``, so when the watcher realises that the
>>>>>>>>> instance is down,
>>>>>>>>> +it will restart it. This behaviour is usually not what the user
>>>>>>>>> expects.
>>>>>>>>> +
>>>>>>>>> +Proposed changes
>>>>>>>>> +================
>>>>>>>>> +
>>>>>>>>> +We propose to modify Ganeti in such a way that it will detect
>>>>>>>>> when an instance
>>>>>>>>> +was shutdown because of an explicit user request. When such a
>>>>>>>>> situation is
>>>>>>>>> +detected, the state of the instance will be set to ADMIN_down, as
>>>>>>>>> intended by
>>>>>>>>> +the user.
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>
>>>>>>>> Should we provide an option to just restart it in that case? (as an
>>>>>>>> hv parameter)
>>>>>>>> The default could still be to leave it down.
>>>>>>>>
>>>>>>>
>>>>>>> Do you mean to have an option to tell the system to behave as it
>>>>>>> does currently? Is that really useful?
>>>>>>>
>>>>>>>
>>>>>> Yes. One thing is "we don't want these to appear as errors, like if
>>>>>> the instance have crashed", and a different one is "we're ok if they 
>>>>>> don't
>>>>>> come back up". That said, perhaps if the user does "halt" it's ok to 
>>>>>> halt,
>>>>>> and if they do reboot to reboot. So that should be the default.
>>>>>>
>>>>>
>>>>> Ack.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +This design document applies to the Xen backend of Ganeti,
>>>>>>>>> because it uses
>>>>>>>>> +features specific of such hypervisor.
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>
>>>>>>>> Is there any way to do something similar at least for kvm? I think
>>>>>>>> this is worth investigating.
>>>>>>>>
>>>>>>>
>>>>>>> I had a brief look, and unfortunately it seems like it's not
>>>>>>> possible. But I'll spend some more time trying to do it.
>>>>>>>
>>>>>>
>>>>>> I see a -no-shutdown option in qemu. Not sure what kind of
>>>>>> information the monitor gives, in that case.
>>>>>>
>>>>>
>>>>> Thanks for the hint, I'll have a look at it.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> +Implementation
>>>>>>>>> +==============
>>>>>>>>> +
>>>>>>>>> +Xen knows why a domain is being shut down (a crash or an explicit
>>>>>>>>> shutdown
>>>>>>>>> +or poweroff request), but such information is not usually readily
>>>>>>>>> available
>>>>>>>>> +externally, because all such cases lead to the virtual machine
>>>>>>>>> being destroyed
>>>>>>>>> +immediately after the event is detected.
>>>>>>>>> +
>>>>>>>>> +Still, Xen allows the instance configuration file to define what
>>>>>>>>> action to be
>>>>>>>>> +taken in all those cases through the ``on_poweroff``,
>>>>>>>>> ``on_shutdown`` and
>>>>>>>>> +``on_crash`` variables. By setting them to ``preserve``, Xen will
>>>>>>>>> avoid
>>>>>>>>> +destroying the domains automatically.
>>>>>>>>> +
>>>>>>>>> +When the domain is not destroyed, it can be viewed by using ``xm
>>>>>>>>> list`` (or ``xl
>>>>>>>>> +list`` in newer Xen versions), and the ``State`` field of the
>>>>>>>>> output will
>>>>>>>>> +provide useful information.
>>>>>>>>> +
>>>>>>>>> +If the state is ``----c-`` it means the instance has crashed.
>>>>>>>>> +
>>>>>>>>> +If the state is ``---s--`` it means the instance was properly
>>>>>>>>> shutdown.
>>>>>>>>> +
>>>>>>>>> +If the instance was properly shutdown and it is still marked as
>>>>>>>>> ``running`` by
>>>>>>>>> +Ganeti, it means that it was shutdown from inside by the user,
>>>>>>>>> and the ganeti
>>>>>>>>> +status of the instance needs to be changed to ``ADMIN_down``.
>>>>>>>>> +
>>>>>>>>> +This will be done at regular intervals by the group watcher, just
>>>>>>>>> before
>>>>>>>>> +deciding which instances to reboot.
>>>>>>>>
>>>>>>>> +
>>>>>>>>> +On top of that, at the same times, the watcher will also need to
>>>>>>>>> issue ``xm
>>>>>>>>> +destroy`` commands for all the domains that are in crashed or
>>>>>>>>> shutdown state,
>>>>>>>>> +since this will not be done automatically by Xen anymore because
>>>>>>>>> of the
>>>>>>>>> +``preserve`` setting in their config files.
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>
>>>>>>>> Does this mean the memory will be freed only 5 minutes later?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, that's the downside of this approach.
>>>>>>> Still, I think the impact is quite limited. First, because I don't
>>>>>>> think that shutting down VMs from inside happens so frequently (but we
>>>>>>> probably need some data to confirm this), and 5 minutes is the upper 
>>>>>>> limit
>>>>>>> anyway. Second, because if we see that it is required, we could have the
>>>>>>> cleanup function running not only in the watcher, but also "on 
>>>>>>> demand"every
>>>>>>> time it is needed, such as before creating a new instance, so that all 
>>>>>>> the
>>>>>>> operations that require memory to be available are sure to be executed 
>>>>>>> in a
>>>>>>> clean state.
>>>>>>>
>>>>>>>
>>>>>> Ack.
>>>>>>
>>>>>
>>>>>>
>>>>>>>  Can you add a note saying that this will not impact the "ganeti
>>>>>>>> admin" issued path, as that can check "xm list" and will do "the right
>>>>>>>> thing" finishing with the destruction?
>>>>>>>>
>>>>>>>
>>>>>>> Sure, this is not a problem.
>>>>>>>
>>>>>>>  Also, what will happen when gnt-instance list is called beteween
>>>>>>>> the shutdown and the watcher run? Do we need to display a different 
>>>>>>>> state,
>>>>>>>> make the cleanup happen then (not sure I like this), or just display
>>>>>>>> ADMIN_Down "as if", although the domain is not destroyed yet?
>>>>>>>>
>>>>>>>
>>>>>>> Given that gnt-instance list shows the status of instances according
>>>>>>> to Ganeti, I think the right thing would be to show the exact situation:
>>>>>>> the expected status, the actual status, and a note saying that an 
>>>>>>> internal
>>>>>>> shutdown was detected.
>>>>>>>
>>>>>>>
>>>>>> Ack, but this needs to be explicit in this design. What exactly is
>>>>>> going to appear in the "status" field? What about in oper_state and
>>>>>> admin_state ? Is there any new field you're adding? (I assume not, but 
>>>>>> this
>>>>>> is important to specify).
>>>>>>
>>>>>
>>>>> Ok, I'll detail all of this.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Finally, can you comment on changes to gnt-instance
>>>>>>>> start/failover/migrate and other ops, when they encounter an instance 
>>>>>>>> in
>>>>>>>> this state? Will the RPC layer be changed to make this transparent to 
>>>>>>>> them
>>>>>>>> (how?) or will they need to deal with this explicitly?
>>>>>>>>
>>>>>>>
>>>>>>> My idea is to run the cleanup function for a single instance just
>>>>>>> before executing commands that would affect the running state of that
>>>>>>> instance. This way, we'd end up being sure we always are in a clean 
>>>>>>> state
>>>>>>> before performing an operation.
>>>>>>> I don't see a need to modify the RPC layer: it seems to me that this
>>>>>>> is an hypervisor specific think, and could therefore be implemented
>>>>>>> directly in the affected hypervisor-specific functions.
>>>>>>>
>>>>>>>
>>>>>> Ack, but please add this to the design doc, specifying that it is
>>>>>> only for functions that perform action on an instance, and not for ones 
>>>>>> who
>>>>>> just query it (if this is the case as I assume it is).
>>>>>>
>>>>>
>>>>> Ok. Yes, of course, that is the case.
>>>>>
>>>>> Thanks,
>>>>> Michele
>>>>>
>>>>
>>>>
>>> Interdiff:
>>> diff --git a/doc/design-internal-shutdown.rst
>>> b/doc/design-internal-shutdown.rst
>>> index 836d00c..a92d837 100644
>>> --- a/doc/design-internal-shutdown.rst
>>> +++ b/doc/design-internal-shutdown.rst
>>>  We propose to modify Ganeti in such a way that it will detect when an
>>> instance
>>>  was shutdown because of an explicit user request. When such a situation
>>> is
>>> -detected, the state of the instance will be set to ADMIN_down, as
>>> intended by
>>>  -the user.
>>> +detected, instead of presenting an error as it happens now, either the
>>> state
>>> +of the instance will be set to ADMIN_down, or the instance will be
>>> +automatically rebooted, depending on a instance-specific configuration
>>> value.
>>> +The default behavior in case no such parameter is found will be to
>>> follow
>>> +the apparent will of the user, and setting to ADMIN_down an instance
>>> that
>>> +was shut down correctly from inside.
>>>
>>>  This design document applies to the Xen backend of Ganeti, because it
>>> uses
>>> -features specific of such hypervisor.
>>> +features specific of such hypervisor. Initial analysis suggests that a
>>> similar
>>> +approach might be used for KVM as well, so this design document will be
>>> later
>>> +extended to add more details about it.
>>>
>>>  Implementation
>>>  ==============
>>> @@ -65,6 +71,40 @@ destroy`` commands for all the domains that are in
>>> crashed or shutdown state,
>>>   since this will not be done automatically by Xen anymore because of
>>> the
>>>  ``preserve`` setting in their config files.
>>>
>>> +This behavior will be limited to the domains shut down from inside,
>>> because it
>>> +will actually keep the resources of the domain busy until the watcher
>>> will do
>>> +the cleaning job (that, with the default setting, is up to every 5
>>> minutes).
>>> +Still, this is considered acceptable, because it is not frequent for a
>>> domain
>>> +to be shut down this way. The cleanup function will be also run
>>> +automatically just before performing any job that requires resources to
>>> be
>>> +available (such as when creating a new instance), in order to ensure
>>> that the
>>> +new resource allocation happens starting from a clean state.
>>> Functionalities
>>> +that only query the state of instances will not run the cleanup
>>> function.
>>> +
>>> +This changes are hypervisor-specific and will not affect the external
>>> RPC
>>> +interface.
>>>
>>
>> Here you're talking about the RAPI I believe. The RPC interface (the
>> internal one, between noded and masterd) will need to be changed to support
>> this operation. Also the Hypervisor API will need changes for this, I
>> believe.
>>
>> Finally, will there be a new LU, or how will the watcher perform the
>> check/cleanup?
>>
>>
> Interdiff to explain all this:
> diff --git a/doc/design-internal-shutdown.rst
> b/doc/design-internal-shutdown.rst
> index a92d837..8b5e3c3 100644
> --- a/doc/design-internal-shutdown.rst
> +++ b/doc/design-internal-shutdown.rst
> @@ -81,11 +81,28 @@ available (such as when creating a new instance), in
> order to en
>  new resource allocation happens starting from a clean state.
> Functionalities
>  that only query the state of instances will not run the cleanup function.
>
> -This changes are hypervisor-specific and will not affect the external RPC
> -interface.
> -
> -Side effects of the modification
> -++++++++++++++++++++++++++++++++
> +The cleanup operation includes both node-specific operations (the actual
> +destruction of the stopped domains) and configuration changes, to be
> performed
> +on the master node (marking as offline an instance that was shut down
> +internally). Therefore, it will be implemented by adding a LU in cmdlib
> +(``LUCleanupInstances``). A Job executing such an opcode will be
> submitted by
> +the watcher to perform the cleanup.
> +
>

I think we might be able to do this without a new LU: how about the watcher
detecting the condition from the "oper state" (which as we show above shows
it) and then calling "stop", or "start" depending on the situation
(crashed: start, stopped: stop, unless the parameter above is set to True).


> +The node daemon will have to be modified in order to support at least the
> +following RPC calls:
> + * A call to list all the instances that have been shutdown from inside
>

Why do we need this? Can't we use the normal instance list instead, which
has to report the condition?


> + * A call to destroy a domain
> +
>

Again, what's the difference between this one and "shutdown"? (consider
that disks will need to be closed too, etc).
(Yes, I know I said you'd need new RPC calls, but perhaps this is not the
case. Perhaps though we could have a cleanup rpc call (or maybe modify
shutdown) that performs the shutdown just in case the instance is in the
"preserved" state and leaves it alone otherwise, so it can be easily called
by all instance LUs (start, stop, migrate, failover, ...).

One more consideration to give is race conditions: inside the instance
there are no locks: so is there anything "bad" happening if "cleanup" is
called (and detects nothing) then the user shuts down the instance from
inside, and then the rest of an operation proceeds?


+The base hypervisor class (and all the deriving classes) will need two
> methods
> +for implementing such functionalities in a hypervisor-specific way.
> +
>

Please specify the method name, and whether it's optional or mandatory.


> +LUs performing operations other than an explicit cleanup will have to be
> +modified to perform the cleanup as well, either by submitting a job to
> perform
> +the cleanup (to be completed before actually performing the task at hand)
> or by
> +explicitly performing the cleanup themselves through the RPC calls.
> +
>

We currently have no way to "submit a job" then return to the current one.
Please don't depend on it.



> +Other required changes
> +++++++++++++++++++++++
>
>  The implementation of this design document will require some commands to
> be
>  changed in order to cope with the new shutdown procedure.
>
>
>
>
>>  +
>>> +Side effects of the modification
>>> +++++++++++++++++++++++++++++++++
>>> +
>>>
>>
>> Other changes required, I would say, rather than "side effects".
>>
>>
>>> +The implementation of this design document will require some commands
>>> to be
>>> +changed in order to cope with the new shutdown procedure.
>>>  +
>>> +With this modification, also the Ganeti command for shutting down
>>> instances
>>> +would leave them in a shutdown but preserved state. Therefore, it will
>>> be
>>> +changed in such a way to immediately perform the cleanup of the instance
>>> +after verifying its correct shutdown.
>>> +
>>> +The ``gnt-instance list`` command will need to be able to handle the
>>> situation
>>> +where an instance was shutdown internally but not yet cleaned up.
>>> +The admin_state field will maintain the current meaning unchanged. The
>>> +oper_state field will get a new possible state, ``S``, meaning that the
>>> instance
>>> +was shutdown internally.
>>> +
>>>
>> +The ``gnt-instance info`` command ``State`` field, in such case, will
>>> show a
>>> +message stating that the instance was supposed to be run but was shut
>>> down
>>> +internally.
>>> +
>>>
>>
>> ACK.
>>
>> Thanks,
>>
>> Guido
>>
>>
> Thanks,
> Michele
>

Thanks,

Guido

Re: [PATCH master] Design doc for internal shutdown detection

Reply via email to