On Wed, 2021-03-31 at 17:38 +0200, Antony Stone wrote: > On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote: > > > I'm only interested in the most recent failure. I'm saying that > > once that > > failure is more than "failure-timeout" seconds old, I want the fact > > that > > the resource failed to be forgotten, so that it can be restarted or > > moved > > between nodes as normal, and not either be moved to another node
Ah, then yes, that's how it works. I thought you wanted older failures to expire as they aged, reducing the total failure count. > > just > > because (a) there were two failures last Friday and then one today, > > or (b) > > get stuck and not run on any nodes at all because all three nodes > > had > > three failures sometime in the past month. > > I've just confirmed that this is working as expected on pacemaker > 1.1.16 > (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10). > > I have one cluster of 3 machines running pacemaker 1.1.16 and I have > another > cluster of 3 machines running pacemaker 2.0.1 > > They are both running the same set of resources. > > I just deliberately killed the same resource on each cluster, and > sure enough > "crm status -f" on both told me it had a fail-count of 1, with a > last-failure > timestamp. > > I waited 5 minutes (well above my failure-timeout value) and asked > for "crm > status -f" again. > > On pacemaker 1.1.16 there was simply a list of resources; no mention > of > failures. Just what I want. > > On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 > and a > last-failure timestamp of 5 minutes earlier. That sounds like a bug in the Debian port. I'm not aware of any relevant bugs reported upstream. > To be sure I'm not being impatient, I've left it an hour (I did this > test > eariler, while I was still trying to understand the timing > interactions) and > the fail-count does not go away. > > > Does anyone have suggestions on how to debug this difference in > behaviour > between pacemaker 1.1.16 and 2.0.1, because at present it prevents me > being > able to upgrade an operational cluster, as the result is simply > unusable. > > > Thanks, > > > Antony. > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/