> On 16 May 2016, at 8:55 PM, Jehan-Guillaume de Rorthais <[email protected]> > wrote: > > Le Mon, 16 May 2016 13:15:11 +1000, > Andrew Beekhof <[email protected] <mailto:[email protected]>> a écrit : > >> >>> On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <[email protected]> >>> wrote: >>> >>> Hello all, >>> >>> According to the developers guide, when calling demote on a stopped >>> resources, the RA should returns a soft error: >>> >>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html >>> >>> « >>> foobar_monitor >>> rc=$? >>> case "$rc" in >>> [...] >>> "$OCF_NOT_RUNNING") >>> # Currently not running. Getting a demote action >>> # in this state is unexpected. Exit with an error >>> # and let the cluster manager recover. >>> ocf_log err "Resource is currently not running" >>> exit $OCF_ERR_GENERIC >>> ;; >>> [...] >>> » >>> >>> But to recover a master resource that is fount not running, PEngine produce >>> a transition with the following actions: demote -> stop -> start -> promote. >>> >>> If we follow the dev guide, the recover action is not possible on a >>> stopped master as the first action of the transition will always fail, >>> leading to a migration and a -inf score on the old master node. >>> >>> My first though was «why doing a demote -> stop that breaks everything when >>> it knows the resource is already stopped?!» >>> >>> If I understand correctly, I guess PEngine **must** produce such a >>> transition so the notify actions are triggered should other leaving clone >>> need to process them. Is it right? >> >> Yes, also because in theory there could be some cleanup that needs to happen. >> >>> If this is right, then maybe we should relax a bit what is >>> written in the ocf dev guide? >> >> I would change that block use to >> >> exit $OCF_NOT_RUNNING >> >> Because we don’t know for sure that the stop will happen > > I suppose returning OCF_NOT_RUNNING from the demote action would break the > current transition as the CRM is expecting a OCF_SUCCESS, isn't it?
Same as returning $OCF_ERR_GENERIC, yes. > Or does the > CRM conclude it does not need to run the next stop action? I forget what the current semantics are, the PE may indeed decide not to schedule a stop action when it recomputes. > > I am worried about breaking a transition as we rely on notify vars to detect > recover action of a slave, a master or a master move. You can’t avoid it, unless you lie and return $OCF_SUCCESS. > > For a master or a slave recover, we need to run some cleanup action on > PostgreSQL suie. That would be an argument to change the monitor action to return OCF_ERR_GENERIC if postgres isn’t running BUT cleanup IS needed and reserve OCF_NOT_RUNNING for when everything is cleanly stopped. > If we break the original transition, the new transition > **might** (if the new transition is actually different) look like a normal > master start->promote. Not possible. There will be a failed action in there so it won’t look normal. > > Regards,
_______________________________________________ Developers mailing list [email protected] http://clusterlabs.org/mailman/listinfo/developers
