On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> Hi.
> I'm trying to understand what looks to me like incorrect behaviour
> between 
> cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1
> I have three machines in a corosync (3.0.1 if it matters) cluster,
> managing 12 
> resources in a single group.
> I'm following documentation from:
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-cluster-options.html
> and
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-resource-options.html
> I have set a cluster property:
>       cluster-recheck-interval=60s
> I have set a resource property:
>       failure-timeout=180
> The docs say failure-timeout is "How many seconds to wait before
> acting as if 
> the failure had not occurred, and potentially allowing the resource
> back to 
> the node on which it failed."
> I think this should mean that if the resource fails and gets
> restarted, the 
> fact that it failed will be "forgotten" after 180 seconds (or maybe a
> little 
> longer, depending on exactly when the next cluster recheck is done).
> However what I'm seeing is that if the resource fails and gets
> restarted, and 
> this then happens an hour later, it's still counted as two
> failures.  If it 

That is exactly correct.

> fails and gets restarted another hour after that, it's recorded as
> three 
> failures and (because I have "migration-threshold=3") it gets moved
> to another 
> node (and therefore all the other resources in group are moved as
> well).
> So, what am I misunderstanding about "failure-timeout", and what
> configuration 
> setting do I need to use to tell pacemaker that "provided the
> resource hasn't 
> failed within the past X seconds, forget the fact that it failed more
> than X 
> seconds ago"?

Unfortunately, there is no way. failure-timeout expires *all* failures
once the *most recent* is that old. It's a bit counter-intuitive but
currently, Pacemaker only remembers a resource's most recent failure
and the total count of failures, and changing that would be a big

> Thanks,
> Antony.
Ken Gaillot <kgail...@redhat.com>

Manage your subscription:

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to