On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgail...@redhat.com> wrote: > On 06/02/2016 08:01 PM, Andrew Beekhof wrote: >> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>> A recent thread discussed a proposed new feature, a new environment >>> variable that would be passed to resource agents, indicating whether a >>> stop action was part of a recovery. >>> >>> Since that thread was long and covered a lot of topics, I'm starting a >>> new one to focus on the core issue remaining: >>> >>> The original idea was to pass the number of restarts remaining before >>> the resource will no longer tried to be started on the same node. This >>> involves calculating (fail-count - migration-threshold), and that >>> implies certain limitations: (1) it will only be set when the cluster >>> checks migration-threshold; (2) it will only be set for the failed >>> resource itself, not for other resources that may be recovered due to >>> dependencies on it. >>> >>> Ulrich Windl proposed an alternative: setting a boolean value instead. I >>> forgot to cc the list on my reply, so I'll summarize now: We would set a >>> new variable like OCF_RESKEY_CRM_recovery=true >> >> This concept worries me, especially when what we've implemented is >> called OCF_RESKEY_CRM_restarting. > > Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected. > >> The name alone encourages people to "optimise" the agent to not >> actually stop the service "because its just going to start again >> shortly". I know thats not what Adam would do, but not everyone >> understands how clusters work. >> >> There are any number of reasons why a cluster that intends to restart >> a service may not do so. In such a scenario, a badly written agent >> would cause the cluster to mistakenly believe that the service is >> stopped - allowing it to start elsewhere. >> >> Its true there are any number of ways to write bad agents, but I would >> argue that we shouldn't be nudging people in that direction :) > > I do have mixed feelings about that. I think if we name it > start_expected, and document it carefully, we can avoid any casual mistakes. > > My main question is how useful would it actually be in the proposed use > cases. Considering the possibility that the expected start might never > happen (or fail), can an RA really do anything different if > start_expected=true?
I would have thought not. Correctness should trump optimal. But I'm prepared to be mistaken. > If the use case is there, I have no problem with > adding it, but I want to make sure it's worthwhile. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org