Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Ken Gaillot
On Wed, 2021-03-31 at 17:38 +0200, Antony Stone wrote:
> On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:
> 
> > I'm only interested in the most recent failure.  I'm saying that
> > once that
> > failure is more than "failure-timeout" seconds old, I want the fact
> > that
> > the resource failed to be forgotten, so that it can be restarted or
> > moved
> > between nodes as normal, and not either be moved to another node 

Ah, then yes, that's how it works.

I thought you wanted older failures to expire as they aged, reducing
the total failure count.

> > just
> > because (a) there were two failures last Friday and then one today,
> > or (b)
> > get stuck and not run on any nodes at all because all three nodes
> > had
> > three failures sometime in the past month.
> 
> I've just confirmed that this is working as expected on pacemaker
> 1.1.16 
> (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).
> 
> I have one cluster of 3 machines running pacemaker 1.1.16 and I have
> another 
> cluster of 3 machines running pacemaker 2.0.1
> 
> They are both running the same set of resources.
> 
> I just deliberately killed the same resource on each cluster, and
> sure enough 
> "crm status -f" on both told me it had a fail-count of 1, with a
> last-failure 
> timestamp.
> 
> I waited 5 minutes (well above my failure-timeout value) and asked
> for "crm 
> status -f" again.
> 
> On pacemaker 1.1.16 there was simply a list of resources; no mention
> of 
> failures.  Just what I want.
> 
> On pacemaker 2.0.1 there was a list of resources plus a fail-count=1
> and a 
> last-failure timestamp of 5 minutes earlier.

That sounds like a bug in the Debian port. I'm not aware of any
relevant bugs reported upstream.

> To be sure I'm not being impatient, I've left it an hour (I did this
> test 
> eariler, while I was still trying to understand the timing
> interactions) and 
> the fail-count does not go away.
> 
> 
> Does anyone have suggestions on how to debug this difference in
> behaviour 
> between pacemaker 1.1.16 and 2.0.1, because at present it prevents me
> being 
> able to upgrade an operational cluster, as the result is simply
> unusable.
> 
> 
> Thanks,
> 
> 
> Antony.
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:

> I'm only interested in the most recent failure.  I'm saying that once that
> failure is more than "failure-timeout" seconds old, I want the fact that
> the resource failed to be forgotten, so that it can be restarted or moved
> between nodes as normal, and not either be moved to another node just
> because (a) there were two failures last Friday and then one today, or (b)
> get stuck and not run on any nodes at all because all three nodes had
> three failures sometime in the past month.

I've just confirmed that this is working as expected on pacemaker 1.1.16 
(Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).

I have one cluster of 3 machines running pacemaker 1.1.16 and I have another 
cluster of 3 machines running pacemaker 2.0.1

They are both running the same set of resources.

I just deliberately killed the same resource on each cluster, and sure enough 
"crm status -f" on both told me it had a fail-count of 1, with a last-failure 
timestamp.

I waited 5 minutes (well above my failure-timeout value) and asked for "crm 
status -f" again.

On pacemaker 1.1.16 there was simply a list of resources; no mention of 
failures.  Just what I want.

On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 and a 
last-failure timestamp of 5 minutes earlier.

To be sure I'm not being impatient, I've left it an hour (I did this test 
eariler, while I was still trying to understand the timing interactions) and 
the fail-count does not go away.


Does anyone have suggestions on how to debug this difference in behaviour 
between pacemaker 1.1.16 and 2.0.1, because at present it prevents me being 
able to upgrade an operational cluster, as the result is simply unusable.


Thanks,


Antony.

-- 
Perfection in design is achieved not when there is nothing left to add, but 
rather when there is nothing left to take away.

 - Antoine de Saint-Exupery

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> > 
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old.

I've re-read the above sentence, and in fact you seem to be agreeing with my 
expectation (which is not what happens).

> It's a bit counter-intuitive but currently, Pacemaker only remembers a
> resource's most recent failure and the total count of failures, and changing
> that would be a big project.

I'm only interested in the most recent failure.  I'm saying that once that 
failure is more than "failure-timeout" seconds old, I want the fact that the 
resource failed to be forgotten, so that it can be restarted or moved between 
nodes as normal, and not either be moved to another node just because (a) 
there were two failures last Friday and then one today, or (b) get stuck and 
not run on any nodes at all because all three nodes had three failures 
sometime in the past month.


Thanks,


Antony.


-- 
The Magic Words are Squeamish Ossifrage.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:

> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
>
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
> 
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old. It's a bit counter-intuitive but
> currently, Pacemaker only remembers a resource's most recent failure
> and the total count of failures, and changing that would be a big
> project.

So, are you saying that if a resource failed last Friday, and then again on 
Saturday, but has been running perfectly happily ever since, a failure today 
will trigger "that's it, we're moving it, it doesn't work here"?

That seems bizarre.

Surely the length of time a resource has been running without problem should 
be taken into account when deciding whether the node it's running on is fit to 
handle it or not?

My problem is also bigger than that - and I can't believe there isn't a way 
round the following, otherwise people couldn't use pacemaker:

I have "migration-threshold=3" on most of my resources, and I have three 
nodes.

If a resource fails for the third time (in any period of time) on a node, it 
gets moved (along with the rest in the group) to another node.  The cluster 
does not forget that it failed and was moved away from the first node, though.

"crm status -f" confirms that to me.

If it then fails three times (in an hour, or a fortnight, whatever) on the 
second node, it gets moved to node 3, and from that point on the cluster 
thinks there's nowhere else to move it to, so another failure means a total 
failure of the cluster.

There must be _something_ I'm doing wrong for the cluster to behave in this 
way?  It can't believe it's by design.


Regards,


Antony.

-- 
Anyone that's normal doesn't really achieve much.

 - Mark Blair, Australian rocket engineer

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Ken Gaillot
On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> Hi.
> 
> I'm trying to understand what looks to me like incorrect behaviour
> between 
> cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1
> 
> I have three machines in a corosync (3.0.1 if it matters) cluster,
> managing 12 
> resources in a single group.
> 
> I'm following documentation from:
> 
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-cluster-options.html
> 
> and
> 
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-resource-options.html
> 
> I have set a cluster property:
> 
>   cluster-recheck-interval=60s
> 
> I have set a resource property:
> 
>   failure-timeout=180
> 
> The docs say failure-timeout is "How many seconds to wait before
> acting as if 
> the failure had not occurred, and potentially allowing the resource
> back to 
> the node on which it failed."
> 
> I think this should mean that if the resource fails and gets
> restarted, the 
> fact that it failed will be "forgotten" after 180 seconds (or maybe a
> little 
> longer, depending on exactly when the next cluster recheck is done).
> 
> However what I'm seeing is that if the resource fails and gets
> restarted, and 
> this then happens an hour later, it's still counted as two
> failures.  If it 

That is exactly correct.

> fails and gets restarted another hour after that, it's recorded as
> three 
> failures and (because I have "migration-threshold=3") it gets moved
> to another 
> node (and therefore all the other resources in group are moved as
> well).
> 
> So, what am I misunderstanding about "failure-timeout", and what
> configuration 
> setting do I need to use to tell pacemaker that "provided the
> resource hasn't 
> failed within the past X seconds, forget the fact that it failed more
> than X 
> seconds ago"?

Unfortunately, there is no way. failure-timeout expires *all* failures
once the *most recent* is that old. It's a bit counter-intuitive but
currently, Pacemaker only remembers a resource's most recent failure
and the total count of failures, and changing that would be a big
project.


> Thanks,
> 
> 
> Antony.
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] cluster-recheck-interval and failure-timeout

2021-03-31 Thread Antony Stone
Hi.

I'm trying to understand what looks to me like incorrect behaviour between 
cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1

I have three machines in a corosync (3.0.1 if it matters) cluster, managing 12 
resources in a single group.

I'm following documentation from:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-cluster-options.html

and

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-resource-options.html

I have set a cluster property:

cluster-recheck-interval=60s

I have set a resource property:

failure-timeout=180

The docs say failure-timeout is "How many seconds to wait before acting as if 
the failure had not occurred, and potentially allowing the resource back to 
the node on which it failed."

I think this should mean that if the resource fails and gets restarted, the 
fact that it failed will be "forgotten" after 180 seconds (or maybe a little 
longer, depending on exactly when the next cluster recheck is done).

However what I'm seeing is that if the resource fails and gets restarted, and 
this then happens an hour later, it's still counted as two failures.  If it 
fails and gets restarted another hour after that, it's recorded as three 
failures and (because I have "migration-threshold=3") it gets moved to another 
node (and therefore all the other resources in group are moved as well).

So, what am I misunderstanding about "failure-timeout", and what configuration 
setting do I need to use to tell pacemaker that "provided the resource hasn't 
failed within the past X seconds, forget the fact that it failed more than X 
seconds ago"?


Thanks,


Antony.

-- 
The first fifty percent of an engineering project takes ninety percent of the 
time, and the remaining fifty percent takes another ninety percent of the time.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/