A recent thread discussed a proposed new feature, a new environment variable that would be passed to resource agents, indicating whether a stop action was part of a recovery.
Since that thread was long and covered a lot of topics, I'm starting a new one to focus on the core issue remaining: The original idea was to pass the number of restarts remaining before the resource will no longer tried to be started on the same node. This involves calculating (fail-count - migration-threshold), and that implies certain limitations: (1) it will only be set when the cluster checks migration-threshold; (2) it will only be set for the failed resource itself, not for other resources that may be recovered due to dependencies on it. Ulrich Windl proposed an alternative: setting a boolean value instead. I forgot to cc the list on my reply, so I'll summarize now: We would set a new variable like OCF_RESKEY_CRM_recovery=true whenever a start is scheduled after a stop on the same node in the same transition. This would avoid the corner cases of the previous approach; instead of being tied to migration-threshold, it would be set whenever a recovery was being attempted, for any reason. And with this approach, it should be easier to set the variable for all actions on the resource (demote/stop/start/promote), rather than just the stop. I think the boolean approach fits all the envisioned use cases that have been discussed. Any objections to going that route instead of the count? -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org