On 17/04/16 00:44, Anant Patil wrote:
        I think it is a good idea, but I see that a resource can be marked
        unhealthy only after it is done.


    Currently, yes. The idea would be to change that so that if it finds
    the resource IN_PROGRESS then it kills the thread and makes sure the
    resource is in a FAILED state. I


Move the resource to CHECK_FAILED?

I'd say that if killing the thread gets it to UPDATE_FAILED then Mission Accomplished, but obviously we'd have to check for races and make sure we move it to CHECK_FAILED if the update completes successfully.

    The trick would be if the stack update is still running and the
    resource is currently IN_PROGRESS to make sure that we fail the
    whole stack update (rolling back if the user has enabled that).


IMO, we can probably use the cancel  command do this, because when you
are marking a resource as unhealthy, you are
cancelling any action running on that resource. Would the following be ok?
(1) stack-cancel-update <stack_id> will cancel the update, mark
cancelled resources failed and rollback (existing stuff)
(2) stack-cancel-update <stack_id> --no-rollback will just cancel the
update and mark cancelled resources as failed
(3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just
stop the action on given resources, mark as CHECK_FAILED, don't do
anything else. The stack won't progress further. Other resources running
while cancel-update will complete.

None of those solve the use case I actually care about, which is "don't start any more resource updates, but don't mark the ones currently in-progress as failed either, and don't roll back". That would be a huge help in TripleO. We need a way to be able to stop updates that guarantees not unnecessarily destroying any part of the existing stack, and we need that to be the default.

(We sort-of have the rollback version of this; it's equivalent to a stack update with the previous template/environment. But we need to make it easier and decouple it from the rollback IMHO.)

So one way to do this would be:

(1) stack-cancel-update <stack_id> will start another update using the previous template/environment. We'll start rolling back; in-progress resources will be allowed to complete normally. (2) stack-cancel-update <stack_id> --no-rollback will set the traversal_id to None so no further resources will be updated; in-progress resources will be allowed to complete normally. (3) stack-cancel-update <stack_id> --stop-in-progress will stop the traversal, kill any running threads update (marking cancelled resources failed) and rollback (4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will just stop the traversal, kill any running threads update (marking cancelled resources failed) (5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ... <resource_id> Just stop the action on given resources, mark as UPDATE_FAILED, don't do anything else. The stack won't progress further. Other resources running while cancel-update will complete.

That would cover all the use cases. Some problems with it are:
- It's way complicated. Lots of options.
- Those options don't translate well to legacy (pre-convergence) stacks using the same client. e.g. there is now a non-default --stop-in-progress option, but on legacy stacks we always stop in-progress. - Options don't commute. When you specify resources with the --stop-in-progress flag it never rolls back, even though you haven't set the --no-rollback flag.

An alternative would be to just drop (3) and (4), and maybe rename (5). I'd be OK with that:

(1) stack-cancel-update <stack_id> will start another update using the previous template/environment. We'll start rolling back; in-progress resources will be allowed to complete normally. (2) stack-cancel-update <stack_id> --no-rollback will set the traversal_id to None so no further resources will be updated; in-progress resources will be allowed to complete normally. (3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just stop the action on given resources, mark as UPDATE_FAILED, don't do anything else. The stack won't progress further. Other resources running while cancel-update will complete.

That solves most of the issues, except that (3) has no real equivalent on legacy stacks (I guess we could just make it fail on the server side).

What I'm suggesting is very close to that:

(1) stack-cancel-update <stack_id> will start another update using the previous template/environment. We'll start rolling back; in-progress resources will be allowed to complete normally. (2) stack-cancel-update <stack_id> --no-rollback will set the traversal_id to None so no further resources will be updated; in-progress resources will be allowed to complete normally. (3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id> Kill any threads running a CREATE or UPDATE on the given resources, mark as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do anything else. If the resource was in progress, the stack won't progress further, other resources currently in-progress will complete, and if rollback is enabled and no other traversal has started then it will roll back to the previous template/environment.

Basically this rolls the functionality of resource-stop-update into resource-mark-unhealthy instead of making a separate command for it. The only real difference is that the resource _always_ ends up in a failed state even if it had actually completed before the command was processed. (In practice this is likely to be irrelevant, because you'd used resource-stop-update only when something was stuck.) I like this because in each case under convergence the command acts like a convergified (yes, I just said that) version of the legacy behaviour:

(1) In the legacy path we use stack-level locks, so to start a rollback we have to kill the current update. In convergence, we just start the rollback update. (2) There's no current equivalent of this, but it would be trivial (and useful) to add - the RPC API already supports it, so we just need to implement it in the ReST API and client. In both cases, it does exactly what it says on the tin: acts the same as (1) but without the rollback. (3) In the legacy path you can't issue this command during a stack update due to the stack-level lock, but in convergence without this lock you can do it any time. If a resource is in-progress when you mark it unhealthy then we just stop it because it's going to a FAILED state regardless. The stack update behaves normally - if a resource fails for any reason, roll back iff rollback is enabled.

One caveat is that my brain thinks of convergence phase 1 exclusively in terms of replacing stack-level locks with resource-level locks. It's likely users don't think about it this way. However, I still think it's a coherent design, and it avoids adding an extra command to the CLI that does almost the same thing as an existing one.

Note that this is actually probably the behaviour we want for resource-mark-unhealthy anyway, because that is likely to be called in many cases by some external monitoring tool, so it would be better if it took effect regardless of what is happening in the stack at the time. We can kill two birds with one stone.

cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to