On 06/06/2016 03:30 PM, Vladislav Bogdanov wrote: > 06.06.2016 22:43, Ken Gaillot wrote: >> On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote: >>> 06.06.2016 19:39, Ken Gaillot wrote: >>>> On 06/05/2016 07:27 PM, Andrew Beekhof wrote: >>>>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgail...@redhat.com> >>>>> wrote: >>>>>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote: >>>>>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgail...@redhat.com> >>>>>>> wrote: >>>>>>>> A recent thread discussed a proposed new feature, a new environment >>>>>>>> variable that would be passed to resource agents, indicating >>>>>>>> whether a >>>>>>>> stop action was part of a recovery. >>>>>>>> >>>>>>>> Since that thread was long and covered a lot of topics, I'm >>>>>>>> starting a >>>>>>>> new one to focus on the core issue remaining: >>>>>>>> >>>>>>>> The original idea was to pass the number of restarts remaining >>>>>>>> before >>>>>>>> the resource will no longer tried to be started on the same node. >>>>>>>> This >>>>>>>> involves calculating (fail-count - migration-threshold), and that >>>>>>>> implies certain limitations: (1) it will only be set when the >>>>>>>> cluster >>>>>>>> checks migration-threshold; (2) it will only be set for the failed >>>>>>>> resource itself, not for other resources that may be recovered >>>>>>>> due to >>>>>>>> dependencies on it. >>>>>>>> >>>>>>>> Ulrich Windl proposed an alternative: setting a boolean value >>>>>>>> instead. I >>>>>>>> forgot to cc the list on my reply, so I'll summarize now: We would >>>>>>>> set a >>>>>>>> new variable like OCF_RESKEY_CRM_recovery=true >>>>>>> >>>>>>> This concept worries me, especially when what we've implemented is >>>>>>> called OCF_RESKEY_CRM_restarting. >>>>>> >>>>>> Agreed; I plan to rename it yet again, to >>>>>> OCF_RESKEY_CRM_start_expected. >>>>>> >>>>>>> The name alone encourages people to "optimise" the agent to not >>>>>>> actually stop the service "because its just going to start again >>>>>>> shortly". I know thats not what Adam would do, but not everyone >>>>>>> understands how clusters work. >>>>>>> >>>>>>> There are any number of reasons why a cluster that intends to >>>>>>> restart >>>>>>> a service may not do so. In such a scenario, a badly written agent >>>>>>> would cause the cluster to mistakenly believe that the service is >>>>>>> stopped - allowing it to start elsewhere. >>>>>>> >>>>>>> Its true there are any number of ways to write bad agents, but I >>>>>>> would >>>>>>> argue that we shouldn't be nudging people in that direction :) >>>>>> >>>>>> I do have mixed feelings about that. I think if we name it >>>>>> start_expected, and document it carefully, we can avoid any casual >>>>>> mistakes. >>>>>> >>>>>> My main question is how useful would it actually be in the >>>>>> proposed use >>>>>> cases. Considering the possibility that the expected start might >>>>>> never >>>>>> happen (or fail), can an RA really do anything different if >>>>>> start_expected=true? >>>>> >>>>> I would have thought not. Correctness should trump optimal. >>>>> But I'm prepared to be mistaken. >>>>> >>>>>> If the use case is there, I have no problem with >>>>>> adding it, but I want to make sure it's worthwhile. >>>> >>>> Anyone have comments on this? >>>> >>>> A simple example: pacemaker calls an RA stop with start_expected=true, >>>> then before the start happens, someone disables the resource, so the >>>> start is never called. Or the node is fenced before the start happens, >>>> etc. >>>> >>>> Is there anything significant an RA can do differently based on >>>> start_expected=true/false without causing problems if an expected start >>>> never happens? >>> >>> Yep. >>> >>> It may request stop of other resources >>> * on that node by removing some node attributes which participate in >>> location constraints >>> * or cluster-wide by revoking/putting to standby cluster ticket other >>> resources depend on >>> >>> Latter case is that's why I asked about the possibility of passing the >>> node name resource is intended to be started on instead of a boolean >>> value (in comments to PR #1026) - I would use it to request stop of >>> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary >>> lustre component which does all "request routing") fails to start >>> anywhere in cluster. That way, if RA does not receive any node name, >> >> Why would ordering constraints be insufficient? > > They are in place, but advisory ones to allow MGS fail/switch-over. >> >> What happens if the MDTs/OSTs continue running because a start of MGS >> was expected, but something prevents the start from actually happening? > > Nothing critical, lustre clients won't be able to contact them without > MGS running and will hang. > But it is safer to shutdown them if it is known that MGS cannot be > started right now. Especially if geo-cluster failover is expected in > that case (as MGS can be local to a site, countrary to all other lustre > parts which need to be replicated). Actually that is the only part of a > puzzle remaining to "solve" that big project, and IMHO it is enough to > have a node name of a intended start or nothing in that attribute > (nothing means stop everything and initiate geo-failover if needed). If > f.e. fencing happens for a node intended to start resource, then stop > will be called again after the next start failure after failure-timeout > lapses. That would be much better than no information at all. Total stop > or geo-failover will happen just with some (configurable) delay instead > of rendering the whole filesystem to an unusable state requiring manual > intervention.
My gut feeling is that this is getting RAs a little too involved in the cluster's inner workings. If I understand your idea correctly, it would be sufficient for your needs to know whether a start is expected on any node in the same transition. So maybe start_expected=no/local/peer would cover this use case and the original one. >> >>> then it can be "almost sure" pacemaker does not intend to restart >>> resource (yet) and can request it to stop everything else (because >>> filesystem is not usable anyways). Later, if another start attempt >>> (caused by failure-timeout expiration) succeeds, RA may grant the ticket >>> back, and all other resources start again. >>> >>> Best, >>> Vladislav _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org