Re: [openstack-dev] [heat] convergence cancel messages

Anant Patil Fri, 19 Aug 2016 07:13:19 -0700

On Tue, Apr 19, 2016 at 9:36 PM Zane Bitter <[email protected]> wrote:


> On 17/04/16 00:44, Anant Patil wrote:
> >         I think it is a good idea, but I see that a resource can be
> marked
> >         unhealthy only after it is done.
> >
> >
> >     Currently, yes. The idea would be to change that so that if it finds
> >     the resource IN_PROGRESS then it kills the thread and makes sure the
> >     resource is in a FAILED state. I
> >
> >
> > Move the resource to CHECK_FAILED?
>
> I'd say that if killing the thread gets it to UPDATE_FAILED then Mission
> Accomplished, but obviously we'd have to check for races and make sure
> we move it to CHECK_FAILED if the update completes successfully.
>
> >     The trick would be if the stack update is still running and the
> >     resource is currently IN_PROGRESS to make sure that we fail the
> >     whole stack update (rolling back if the user has enabled that).
> >
> >
> > IMO, we can probably use the cancel  command do this, because when you
> > are marking a resource as unhealthy, you are
> > cancelling any action running on that resource. Would the following be
> ok?
> > (1) stack-cancel-update <stack_id> will cancel the update, mark
> > cancelled resources failed and rollback (existing stuff)
> > (2) stack-cancel-update <stack_id> --no-rollback will just cancel the
> > update and mark cancelled resources as failed
> > (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just
> > stop the action on given resources, mark as CHECK_FAILED, don't do
> > anything else. The stack won't progress further. Other resources running
> > while cancel-update will complete.
>
> None of those solve the use case I actually care about, which is "don't
> start any more resource updates, but don't mark the ones currently
> in-progress as failed either, and don't roll back". That would be a huge
> help in TripleO. We need a way to be able to stop updates that
> guarantees not unnecessarily destroying any part of the existing stack,
> and we need that to be the default.
>
> (We sort-of have the rollback version of this; it's equivalent to a
> stack update with the previous template/environment. But we need to make
> it easier and decouple it from the rollback IMHO.)
>
> So one way to do this would be:
>
> (1) stack-cancel-update <stack_id> will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update <stack_id> --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) stack-cancel-update <stack_id> --stop-in-progress will stop the
> traversal, kill any running threads update (marking cancelled resources
> failed) and rollback
> (4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will
> just stop the traversal, kill any running threads update (marking
> cancelled resources failed)
> (5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ...
> <resource_id> Just stop the action on given resources, mark as
> UPDATE_FAILED, don't do anything else. The stack won't progress further.
> Other resources running while cancel-update will complete.
>
> That would cover all the use cases. Some problems with it are:
> - It's way complicated. Lots of options.
> - Those options don't translate well to legacy (pre-convergence) stacks
> using the same client. e.g. there is now a non-default
> --stop-in-progress option, but on legacy stacks we always stop in-progress.
> - Options don't commute. When you specify resources with the
> --stop-in-progress flag it never rolls back, even though you haven't set
> the --no-rollback flag.
>
> An alternative would be to just drop (3) and (4), and maybe rename (5).
> I'd be OK with that:
>
> (1) stack-cancel-update <stack_id> will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update <stack_id> --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just
> stop the action on given resources, mark as UPDATE_FAILED, don't do
> anything else. The stack won't progress further. Other resources running
> while cancel-update will complete.
>
> That solves most of the issues, except that (3) has no real equivalent
> on legacy stacks (I guess we could just make it fail on the server side).
>
> What I'm suggesting is very close to that:
>
> (1) stack-cancel-update <stack_id> will start another update using the
> previous template/environment. We'll start rolling back; in-progress
> resources will be allowed to complete normally.
> (2) stack-cancel-update <stack_id> --no-rollback will set the
> traversal_id to None so no further resources will be updated;
> in-progress resources will be allowed to complete normally.
> (3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id>
> Kill any threads running a CREATE or UPDATE on the given resources, mark
> as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do
> anything else. If the resource was in progress, the stack won't progress
> further, other resources currently in-progress will complete, and if
> rollback is enabled and no other traversal has started then it will roll
> back to the previous template/environment.
>
> I have started implementation of the above three mechanisms. The first two
are implemented in https://review.openstack.org/#/c/357618
Note that the (2) needs a change in heat client (openstack client?) to have
a --no-rollback option.
(3) is a bit of long haul, and needs:
https://review.openstack.org/343076 : Adds mechanism to interrupt
convergence worker threads
https://review.openstack.org/301483 : Mechanism to send cancel message and
cancel worker upon receiving messages
Apart from the above two, I am implementing the actual patch which will
leverage the above two to complete resource-mark-unhealthy feature in
convergence.


> Basically this rolls the functionality of resource-stop-update into
> resource-mark-unhealthy instead of making a separate command for it. The
> only real difference is that the resource _always_ ends up in a failed
> state even if it had actually completed before the command was
> processed. (In practice this is likely to be irrelevant, because you'd
> used resource-stop-update only when something was stuck.) I like this
> because in each case under convergence the command acts like a
> convergified (yes, I just said that) version of the legacy behaviour:
>
> (1) In the legacy path we use stack-level locks, so to start a rollback
> we have to kill the current update. In convergence, we just start the
> rollback update.
> (2) There's no current equivalent of this, but it would be trivial (and
> useful) to add - the RPC API already supports it, so we just need to
> implement it in the ReST API and client. In both cases, it does exactly
> what it says on the tin: acts the same as (1) but without the rollback.
> (3) In the legacy path you can't issue this command during a stack
> update due to the stack-level lock, but in convergence without this lock
> you can do it any time. If a resource is in-progress when you mark it
> unhealthy then we just stop it because it's going to a FAILED state
> regardless. The stack update behaves normally - if a resource fails for
> any reason, roll back iff rollback is enabled.
>
> One caveat is that my brain thinks of convergence phase 1 exclusively in
> terms of replacing stack-level locks with resource-level locks. It's
> likely users don't think about it this way. However, I still think it's
> a coherent design, and it avoids adding an extra command to the CLI that
> does almost the same thing as an existing one.
>
> Note that this is actually probably the behaviour we want for
> resource-mark-unhealthy anyway, because that is likely to be called in
> many cases by some external monitoring tool, so it would be better if it
> took effect regardless of what is happening in the stack at the time. We
> can kill two birds with one stone.
>
> cheers,
> Zane.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Thanks,
Anant

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] convergence cancel messages

Reply via email to