On 09/28/2016 10:54 PM, Andrew Beekhof wrote: > On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures >>> then migrate", but I can't think of a real-world situation where that >>> makes sense, >>> >>> >>> really? >>> >>> it is not uncommon to hear "i know its failed, but i dont want the >>> cluster to do anything until its _really_ failed" >> >> Hmm, I guess that would be similar to how monitoring systems such as >> nagios can be configured to send an alert only if N checks in a row >> fail. That's useful where transient outages (e.g. a webserver hitting >> its request limit) are acceptable for a short time. >> >> I'm not sure that's translatable to Pacemaker. Pacemaker's error count >> is not "in a row" but "since the count was last cleared". > > It would be a major change, but perhaps it should be "in-a-row" and > successfully performing the action clears the count. > Its entirely possible that the current behaviour is like that because > I wasn't smart enough to implement anything else at the time :-)
Or you were smart enough to realize what a can of worms it is. :) Take a look at all of nagios' options for deciding when a failure becomes "real". If you clear failures after a success, you can't detect/recover a resource that is flapping. >> "Ignore up to three monitor failures if they occur in a row [or, within >> 10 minutes?], then try soft recovery for the next two monitor failures, >> then ban this node for the next monitor failure." Not sure being able to >> say that is worth the complexity. > > Not disagreeing It only makes sense to escalate from ignore -> restart -> hard, so maybe something like: op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban To express current default behavior: op start ignore-fail=0 soft-fail=0 on-hard-fail=ban op stop ignore-fail=0 soft-fail=0 on-hard-fail=fence op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban on-fail, migration-threshold, and start-failure-is-fatal would be deprecated (and would be easy to map to the new parameters). I'd avoid the hassles of counting failures "in a row", and stick with counting failures since the last cleanup. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org