Re: [openstack-dev] [heat] convergence cancel messages

Zane Bitter Tue, 19 Apr 2016 09:05:16 -0700

On 17/04/16 00:44, Anant Patil wrote:

        I think it is a good idea, but I see that a resource can be marked
        unhealthy only after it is done.



    Currently, yes. The idea would be to change that so that if it finds
    the resource IN_PROGRESS then it kills the thread and makes sure the
    resource is in a FAILED state. I


Move the resource to CHECK_FAILED?

I'd say that if killing the thread gets it to UPDATE_FAILED then MissionAccomplished, but obviously we'd have to check for races and make surewe move it to CHECK_FAILED if the update completes successfully.

    The trick would be if the stack update is still running and the
    resource is currently IN_PROGRESS to make sure that we fail the
    whole stack update (rolling back if the user has enabled that).


IMO, we can probably use the cancel  command do this, because when you
are marking a resource as unhealthy, you are
cancelling any action running on that resource. Would the following be ok?
(1) stack-cancel-update <stack_id> will cancel the update, mark
cancelled resources failed and rollback (existing stuff)
(2) stack-cancel-update <stack_id> --no-rollback will just cancel the
update and mark cancelled resources as failed
(3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just
stop the action on given resources, mark as CHECK_FAILED, don't do
anything else. The stack won't progress further. Other resources running
while cancel-update will complete.

None of those solve the use case I actually care about, which is "don'tstart any more resource updates, but don't mark the ones currentlyin-progress as failed either, and don't roll back". That would be a hugehelp in TripleO. We need a way to be able to stop updates thatguarantees not unnecessarily destroying any part of the existing stack,and we need that to be the default.

(We sort-of have the rollback version of this; it's equivalent to astack update with the previous template/environment. But we need to makeit easier and decouple it from the rollback IMHO.)


So one way to do this would be:

(1) stack-cancel-update <stack_id> will start another update using theprevious template/environment. We'll start rolling back; in-progressresources will be allowed to complete normally.(2) stack-cancel-update <stack_id> --no-rollback will set thetraversal_id to None so no further resources will be updated;in-progress resources will be allowed to complete normally.(3) stack-cancel-update <stack_id> --stop-in-progress will stop thetraversal, kill any running threads update (marking cancelled resourcesfailed) and rollback(4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback willjust stop the traversal, kill any running threads update (markingcancelled resources failed)(5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ...<resource_id> Just stop the action on given resources, mark asUPDATE_FAILED, don't do anything else. The stack won't progress further.Other resources running while cancel-update will complete.


That would cover all the use cases. Some problems with it are:
- It's way complicated. Lots of options.

- Those options don't translate well to legacy (pre-convergence) stacksusing the same client. e.g. there is now a non-default--stop-in-progress option, but on legacy stacks we always stop in-progress.- Options don't commute. When you specify resources with the--stop-in-progress flag it never rolls back, even though you haven't setthe --no-rollback flag.

An alternative would be to just drop (3) and (4), and maybe rename (5).I'd be OK with that:

That solves most of the issues, except that (3) has no real equivalenton legacy stacks (I guess we could just make it fail on the server side).


What I'm suggesting is very close to that:

(1) stack-cancel-update <stack_id> will start another update using theprevious template/environment. We'll start rolling back; in-progressresources will be allowed to complete normally.(2) stack-cancel-update <stack_id> --no-rollback will set thetraversal_id to None so no further resources will be updated;in-progress resources will be allowed to complete normally.(3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id>Kill any threads running a CREATE or UPDATE on the given resources, markas CHECK_FAILED if they are not already in UPDATE_FAILED, don't doanything else. If the resource was in progress, the stack won't progressfurther, other resources currently in-progress will complete, and ifrollback is enabled and no other traversal has started then it will rollback to the previous template/environment.

Basically this rolls the functionality of resource-stop-update intoresource-mark-unhealthy instead of making a separate command for it. Theonly real difference is that the resource _always_ ends up in a failedstate even if it had actually completed before the command wasprocessed. (In practice this is likely to be irrelevant, because you'dused resource-stop-update only when something was stuck.) I like thisbecause in each case under convergence the command acts like aconvergified (yes, I just said that) version of the legacy behaviour:

(1) In the legacy path we use stack-level locks, so to start a rollbackwe have to kill the current update. In convergence, we just start therollback update.(2) There's no current equivalent of this, but it would be trivial (anduseful) to add - the RPC API already supports it, so we just need toimplement it in the ReST API and client. In both cases, it does exactlywhat it says on the tin: acts the same as (1) but without the rollback.(3) In the legacy path you can't issue this command during a stackupdate due to the stack-level lock, but in convergence without this lockyou can do it any time. If a resource is in-progress when you mark itunhealthy then we just stop it because it's going to a FAILED stateregardless. The stack update behaves normally - if a resource fails forany reason, roll back iff rollback is enabled.

One caveat is that my brain thinks of convergence phase 1 exclusively interms of replacing stack-level locks with resource-level locks. It'slikely users don't think about it this way. However, I still think it'sa coherent design, and it avoids adding an extra command to the CLI thatdoes almost the same thing as an existing one.

Note that this is actually probably the behaviour we want forresource-mark-unhealthy anyway, because that is likely to be called inmany cases by some external monitoring tool, so it would be better if ittook effect regardless of what is happening in the stack at the time. Wecan kill two birds with one stone.


cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [heat] convergence cancel messages

Reply via email to