Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

Antony Stone Wed, 31 Mar 2021 07:49:19 -0700

On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
>
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old. It's a bit counter-intuitive but
> currently, Pacemaker only remembers a resource's most recent failure
> and the total count of failures, and changing that would be a big
> project.


So, are you saying that if a resource failed last Friday, and then again on 
Saturday, but has been running perfectly happily ever since, a failure today 
will trigger "that's it, we're moving it, it doesn't work here"?

That seems bizarre.

Surely the length of time a resource has been running without problem should 
be taken into account when deciding whether the node it's running on is fit to 
handle it or not?

My problem is also bigger than that - and I can't believe there isn't a way 
round the following, otherwise people couldn't use pacemaker:

I have "migration-threshold=3" on most of my resources, and I have three 
nodes.

If a resource fails for the third time (in any period of time) on a node, it 
gets moved (along with the rest in the group) to another node.  The cluster 
does not forget that it failed and was moved away from the first node, though.

"crm status -f" confirms that to me.

If it then fails three times (in an hour, or a fortnight, whatever) on the 
second node, it gets moved to node 3, and from that point on the cluster 
thinks there's nowhere else to move it to, so another failure means a total 
failure of the cluster.

There must be _something_ I'm doing wrong for the cluster to behave in this 
way?  It can't believe it's by design.


Regards,


Antony.

-- 
Anyone that's normal doesn't really achieve much.

 - Mark Blair, Australian rocket engineer

                                                   Please reply to the list;
                                                         please *don't* CC me.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

Reply via email to