On Tue, Apr 19, 2016 at 9:36 PM Zane Bitter <zbit...@redhat.com> wrote:
> On 17/04/16 00:44, Anant Patil wrote: > > I think it is a good idea, but I see that a resource can be > marked > > unhealthy only after it is done. > > > > > > Currently, yes. The idea would be to change that so that if it finds > > the resource IN_PROGRESS then it kills the thread and makes sure the > > resource is in a FAILED state. I > > > > > > Move the resource to CHECK_FAILED? > > I'd say that if killing the thread gets it to UPDATE_FAILED then Mission > Accomplished, but obviously we'd have to check for races and make sure > we move it to CHECK_FAILED if the update completes successfully. > > > The trick would be if the stack update is still running and the > > resource is currently IN_PROGRESS to make sure that we fail the > > whole stack update (rolling back if the user has enabled that). > > > > > > IMO, we can probably use the cancel command do this, because when you > > are marking a resource as unhealthy, you are > > cancelling any action running on that resource. Would the following be > ok? > > (1) stack-cancel-update <stack_id> will cancel the update, mark > > cancelled resources failed and rollback (existing stuff) > > (2) stack-cancel-update <stack_id> --no-rollback will just cancel the > > update and mark cancelled resources as failed > > (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just > > stop the action on given resources, mark as CHECK_FAILED, don't do > > anything else. The stack won't progress further. Other resources running > > while cancel-update will complete. > > None of those solve the use case I actually care about, which is "don't > start any more resource updates, but don't mark the ones currently > in-progress as failed either, and don't roll back". That would be a huge > help in TripleO. We need a way to be able to stop updates that > guarantees not unnecessarily destroying any part of the existing stack, > and we need that to be the default. > > (We sort-of have the rollback version of this; it's equivalent to a > stack update with the previous template/environment. But we need to make > it easier and decouple it from the rollback IMHO.) > > So one way to do this would be: > > (1) stack-cancel-update <stack_id> will start another update using the > previous template/environment. We'll start rolling back; in-progress > resources will be allowed to complete normally. > (2) stack-cancel-update <stack_id> --no-rollback will set the > traversal_id to None so no further resources will be updated; > in-progress resources will be allowed to complete normally. > (3) stack-cancel-update <stack_id> --stop-in-progress will stop the > traversal, kill any running threads update (marking cancelled resources > failed) and rollback > (4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will > just stop the traversal, kill any running threads update (marking > cancelled resources failed) > (5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ... > <resource_id> Just stop the action on given resources, mark as > UPDATE_FAILED, don't do anything else. The stack won't progress further. > Other resources running while cancel-update will complete. > > That would cover all the use cases. Some problems with it are: > - It's way complicated. Lots of options. > - Those options don't translate well to legacy (pre-convergence) stacks > using the same client. e.g. there is now a non-default > --stop-in-progress option, but on legacy stacks we always stop in-progress. > - Options don't commute. When you specify resources with the > --stop-in-progress flag it never rolls back, even though you haven't set > the --no-rollback flag. > > An alternative would be to just drop (3) and (4), and maybe rename (5). > I'd be OK with that: > > (1) stack-cancel-update <stack_id> will start another update using the > previous template/environment. We'll start rolling back; in-progress > resources will be allowed to complete normally. > (2) stack-cancel-update <stack_id> --no-rollback will set the > traversal_id to None so no further resources will be updated; > in-progress resources will be allowed to complete normally. > (3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just > stop the action on given resources, mark as UPDATE_FAILED, don't do > anything else. The stack won't progress further. Other resources running > while cancel-update will complete. > > That solves most of the issues, except that (3) has no real equivalent > on legacy stacks (I guess we could just make it fail on the server side). > > What I'm suggesting is very close to that: > > (1) stack-cancel-update <stack_id> will start another update using the > previous template/environment. We'll start rolling back; in-progress > resources will be allowed to complete normally. > (2) stack-cancel-update <stack_id> --no-rollback will set the > traversal_id to None so no further resources will be updated; > in-progress resources will be allowed to complete normally. > (3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id> > Kill any threads running a CREATE or UPDATE on the given resources, mark > as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do > anything else. If the resource was in progress, the stack won't progress > further, other resources currently in-progress will complete, and if > rollback is enabled and no other traversal has started then it will roll > back to the previous template/environment. > > I have started implementation of the above three mechanisms. The first two are implemented in https://review.openstack.org/#/c/357618 Note that the (2) needs a change in heat client (openstack client?) to have a --no-rollback option. (3) is a bit of long haul, and needs: https://review.openstack.org/343076 : Adds mechanism to interrupt convergence worker threads https://review.openstack.org/301483 : Mechanism to send cancel message and cancel worker upon receiving messages Apart from the above two, I am implementing the actual patch which will leverage the above two to complete resource-mark-unhealthy feature in convergence. > Basically this rolls the functionality of resource-stop-update into > resource-mark-unhealthy instead of making a separate command for it. The > only real difference is that the resource _always_ ends up in a failed > state even if it had actually completed before the command was > processed. (In practice this is likely to be irrelevant, because you'd > used resource-stop-update only when something was stuck.) I like this > because in each case under convergence the command acts like a > convergified (yes, I just said that) version of the legacy behaviour: > > (1) In the legacy path we use stack-level locks, so to start a rollback > we have to kill the current update. In convergence, we just start the > rollback update. > (2) There's no current equivalent of this, but it would be trivial (and > useful) to add - the RPC API already supports it, so we just need to > implement it in the ReST API and client. In both cases, it does exactly > what it says on the tin: acts the same as (1) but without the rollback. > (3) In the legacy path you can't issue this command during a stack > update due to the stack-level lock, but in convergence without this lock > you can do it any time. If a resource is in-progress when you mark it > unhealthy then we just stop it because it's going to a FAILED state > regardless. The stack update behaves normally - if a resource fails for > any reason, roll back iff rollback is enabled. > > One caveat is that my brain thinks of convergence phase 1 exclusively in > terms of replacing stack-level locks with resource-level locks. It's > likely users don't think about it this way. However, I still think it's > a coherent design, and it avoids adding an extra command to the CLI that > does almost the same thing as an existing one. > > Note that this is actually probably the behaviour we want for > resource-mark-unhealthy anyway, because that is likely to be called in > many cases by some external monitoring tool, so it would be better if it > took effect regardless of what is happening in the stack at the time. We > can kill two birds with one stone. > > cheers, > Zane. > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Thanks, Anant
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev