Hi Anthony, are you sure you are not mixing the two: fail-counts sticky resource failures
I mean once the fail-count did exceed, you get a sticky resource constraint (a candidate for "cleanup"). Even if the fail-count is reset after that, the constraints will still be there. Regards, Ulrich >>> Antony Stone <antony.st...@ha.open.source.it> schrieb am 31.03.2021 um 16:48 in Nachricht <202103311648.54643.antony.st...@ha.open.source.it>: > On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote: > >> On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote: >> >> > So, what am I misunderstanding about "failure‑timeout", and what >> > configuration setting do I need to use to tell pacemaker that "provided the >> > resource hasn't failed within the past X seconds, forget the fact that it >> > failed more than X seconds ago"? >> >> Unfortunately, there is no way. failure‑timeout expires *all* failures >> once the *most recent* is that old. It's a bit counter‑intuitive but >> currently, Pacemaker only remembers a resource's most recent failure >> and the total count of failures, and changing that would be a big >> project. > > So, are you saying that if a resource failed last Friday, and then again on > Saturday, but has been running perfectly happily ever since, a failure today > > will trigger "that's it, we're moving it, it doesn't work here"? > > That seems bizarre. > > Surely the length of time a resource has been running without problem should > > be taken into account when deciding whether the node it's running on is fit > to > handle it or not? > > My problem is also bigger than that ‑ and I can't believe there isn't a way > round the following, otherwise people couldn't use pacemaker: > > I have "migration‑threshold=3" on most of my resources, and I have three > nodes. > > If a resource fails for the third time (in any period of time) on a node, it > > gets moved (along with the rest in the group) to another node. The cluster > does not forget that it failed and was moved away from the first node, > though. > > "crm status ‑f" confirms that to me. > > If it then fails three times (in an hour, or a fortnight, whatever) on the > second node, it gets moved to node 3, and from that point on the cluster > thinks there's nowhere else to move it to, so another failure means a total > failure of the cluster. > > There must be _something_ I'm doing wrong for the cluster to behave in this > way? It can't believe it's by design. > > > Regards, > > > Antony. > > ‑‑ > Anyone that's normal doesn't really achieve much. > > ‑ Mark Blair, Australian rocket engineer > > Please reply to the list; > please *don't* CC > me. > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/