On 09/22/2016 09:53 AM, Jan Pokorný wrote: > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote: >> Ken Gaillot <kgail...@redhat.com> writes: >> >>> I'm not saying it's a bad idea, just that it's more complicated than it >>> first sounds, so it's worth thinking through the implications. >> >> Thinking about it and looking at how complicated it gets, maybe what >> you'd really want, to make it clearer for the user, is the ability to >> explicitly configure the behavior, either globally or per-resource. So >> instead of having to tweak a set of variables that interact in complex >> ways, you'd configure something like rule expressions, >> >> <on_fail> >> <restart repeat="3" /> >> <migrate timeout="60s" /> >> <fence/> >> </on_fail> >> >> So, try to restart the service 3 times, if that fails migrate the >> service, if it still fails, fence the node. >> >> (obviously the details and XML syntax are just an example) >> >> This would then replace on-fail, migration-threshold, etc. > > I must admit that in previous emails in this thread, I wasn't able to > follow during the first pass, which is not the case with this procedural > (sequence-ordered) approach. Though someone can argue it doesn't take > type of operation into account, which might again open the door for > non-obvious interactions.
"restart" is the only on-fail value that it makes sense to escalate. block/stop/fence/standby are final. Block means "don't touch the resource again", so there can't be any further response to failures. Stop/fence/standby move the resource off the local node, so failure handling is reset (there are 0 failures on the new node to begin with). "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures then migrate", but I can't think of a real-world situation where that makes sense, and it would be a significant re-implementation of "ignore" (which currently ignores the state of having failed, as opposed to a particular instance of failure). What the interface needs to express is: "If this operation fails, optionally try a soft recovery [always stop+start], but if <N> failures occur on the same node, proceed to a [configurable] hard recovery". And of course the interface will need to be different depending on how certain details are decided, e.g. whether any failures count toward <N> or just failures of one particular operation type, and whether the hard recovery type can vary depending on what operation failed. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org