On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote: > Hi. > > I'm trying to understand what looks to me like incorrect behaviour > between > cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1 > > I have three machines in a corosync (3.0.1 if it matters) cluster, > managing 12 > resources in a single group. > > I'm following documentation from: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ > Pacemaker_Explained/s-cluster-options.html > > and > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ > Pacemaker_Explained/s-resource-options.html > > I have set a cluster property: > > cluster-recheck-interval=60s > > I have set a resource property: > > failure-timeout=180 > > The docs say failure-timeout is "How many seconds to wait before > acting as if > the failure had not occurred, and potentially allowing the resource > back to > the node on which it failed." > > I think this should mean that if the resource fails and gets > restarted, the > fact that it failed will be "forgotten" after 180 seconds (or maybe a > little > longer, depending on exactly when the next cluster recheck is done). > > However what I'm seeing is that if the resource fails and gets > restarted, and > this then happens an hour later, it's still counted as two > failures. If it
That is exactly correct. > fails and gets restarted another hour after that, it's recorded as > three > failures and (because I have "migration-threshold=3") it gets moved > to another > node (and therefore all the other resources in group are moved as > well). > > So, what am I misunderstanding about "failure-timeout", and what > configuration > setting do I need to use to tell pacemaker that "provided the > resource hasn't > failed within the past X seconds, forget the fact that it failed more > than X > seconds ago"? Unfortunately, there is no way. failure-timeout expires *all* failures once the *most recent* is that old. It's a bit counter-intuitive but currently, Pacemaker only remembers a resource's most recent failure and the total count of failures, and changing that would be a big project. > Thanks, > > > Antony. > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/