On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgail...@redhat.com> wrote: > A recent thread discussed a proposed new feature, a new environment > variable that would be passed to resource agents, indicating whether a > stop action was part of a recovery. > > Since that thread was long and covered a lot of topics, I'm starting a > new one to focus on the core issue remaining: > > The original idea was to pass the number of restarts remaining before > the resource will no longer tried to be started on the same node. This > involves calculating (fail-count - migration-threshold), and that > implies certain limitations: (1) it will only be set when the cluster > checks migration-threshold; (2) it will only be set for the failed > resource itself, not for other resources that may be recovered due to > dependencies on it. > > Ulrich Windl proposed an alternative: setting a boolean value instead. I > forgot to cc the list on my reply, so I'll summarize now: We would set a > new variable like OCF_RESKEY_CRM_recovery=true
This concept worries me, especially when what we've implemented is called OCF_RESKEY_CRM_restarting. The name alone encourages people to "optimise" the agent to not actually stop the service "because its just going to start again shortly". I know thats not what Adam would do, but not everyone understands how clusters work. There are any number of reasons why a cluster that intends to restart a service may not do so. In such a scenario, a badly written agent would cause the cluster to mistakenly believe that the service is stopped - allowing it to start elsewhere. Its true there are any number of ways to write bad agents, but I would argue that we shouldn't be nudging people in that direction :) > whenever a start is > scheduled after a stop on the same node in the same transition. This > would avoid the corner cases of the previous approach; instead of being > tied to migration-threshold, it would be set whenever a recovery was > being attempted, for any reason. And with this approach, it should be > easier to set the variable for all actions on the resource > (demote/stop/start/promote), rather than just the stop. > > I think the boolean approach fits all the envisioned use cases that have > been discussed. Any objections to going that route instead of the count? > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org