Re: [ClusterLabs] cluster-recheck-interval and failure-timeout
On Wed, 2021-03-31 at 17:38 +0200, Antony Stone wrote: > On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote: > > > I'm only interested in the most recent failure. I'm saying that > > once that > > failure is more than "failure-timeout" seconds old, I want the fact > > that > > the resource failed to be forgotten, so that it can be restarted or > > moved > > between nodes as normal, and not either be moved to another node Ah, then yes, that's how it works. I thought you wanted older failures to expire as they aged, reducing the total failure count. > > just > > because (a) there were two failures last Friday and then one today, > > or (b) > > get stuck and not run on any nodes at all because all three nodes > > had > > three failures sometime in the past month. > > I've just confirmed that this is working as expected on pacemaker > 1.1.16 > (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10). > > I have one cluster of 3 machines running pacemaker 1.1.16 and I have > another > cluster of 3 machines running pacemaker 2.0.1 > > They are both running the same set of resources. > > I just deliberately killed the same resource on each cluster, and > sure enough > "crm status -f" on both told me it had a fail-count of 1, with a > last-failure > timestamp. > > I waited 5 minutes (well above my failure-timeout value) and asked > for "crm > status -f" again. > > On pacemaker 1.1.16 there was simply a list of resources; no mention > of > failures. Just what I want. > > On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 > and a > last-failure timestamp of 5 minutes earlier. That sounds like a bug in the Debian port. I'm not aware of any relevant bugs reported upstream. > To be sure I'm not being impatient, I've left it an hour (I did this > test > eariler, while I was still trying to understand the timing > interactions) and > the fail-count does not go away. > > > Does anyone have suggestions on how to debug this difference in > behaviour > between pacemaker 1.1.16 and 2.0.1, because at present it prevents me > being > able to upgrade an operational cluster, as the result is simply > unusable. > > > Thanks, > > > Antony. > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster-recheck-interval and failure-timeout
On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote: > I'm only interested in the most recent failure. I'm saying that once that > failure is more than "failure-timeout" seconds old, I want the fact that > the resource failed to be forgotten, so that it can be restarted or moved > between nodes as normal, and not either be moved to another node just > because (a) there were two failures last Friday and then one today, or (b) > get stuck and not run on any nodes at all because all three nodes had > three failures sometime in the past month. I've just confirmed that this is working as expected on pacemaker 1.1.16 (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10). I have one cluster of 3 machines running pacemaker 1.1.16 and I have another cluster of 3 machines running pacemaker 2.0.1 They are both running the same set of resources. I just deliberately killed the same resource on each cluster, and sure enough "crm status -f" on both told me it had a fail-count of 1, with a last-failure timestamp. I waited 5 minutes (well above my failure-timeout value) and asked for "crm status -f" again. On pacemaker 1.1.16 there was simply a list of resources; no mention of failures. Just what I want. On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 and a last-failure timestamp of 5 minutes earlier. To be sure I'm not being impatient, I've left it an hour (I did this test eariler, while I was still trying to understand the timing interactions) and the fail-count does not go away. Does anyone have suggestions on how to debug this difference in behaviour between pacemaker 1.1.16 and 2.0.1, because at present it prevents me being able to upgrade an operational cluster, as the result is simply unusable. Thanks, Antony. -- Perfection in design is achieved not when there is nothing left to add, but rather when there is nothing left to take away. - Antoine de Saint-Exupery Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster-recheck-interval and failure-timeout
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote: > On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote: > > > > So, what am I misunderstanding about "failure-timeout", and what > > configuration setting do I need to use to tell pacemaker that "provided the > > resource hasn't failed within the past X seconds, forget the fact that it > > failed more than X seconds ago"? > > Unfortunately, there is no way. failure-timeout expires *all* failures > once the *most recent* is that old. I've re-read the above sentence, and in fact you seem to be agreeing with my expectation (which is not what happens). > It's a bit counter-intuitive but currently, Pacemaker only remembers a > resource's most recent failure and the total count of failures, and changing > that would be a big project. I'm only interested in the most recent failure. I'm saying that once that failure is more than "failure-timeout" seconds old, I want the fact that the resource failed to be forgotten, so that it can be restarted or moved between nodes as normal, and not either be moved to another node just because (a) there were two failures last Friday and then one today, or (b) get stuck and not run on any nodes at all because all three nodes had three failures sometime in the past month. Thanks, Antony. -- The Magic Words are Squeamish Ossifrage. Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster-recheck-interval and failure-timeout
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote: > On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote: > > > So, what am I misunderstanding about "failure-timeout", and what > > configuration setting do I need to use to tell pacemaker that "provided the > > resource hasn't failed within the past X seconds, forget the fact that it > > failed more than X seconds ago"? > > Unfortunately, there is no way. failure-timeout expires *all* failures > once the *most recent* is that old. It's a bit counter-intuitive but > currently, Pacemaker only remembers a resource's most recent failure > and the total count of failures, and changing that would be a big > project. So, are you saying that if a resource failed last Friday, and then again on Saturday, but has been running perfectly happily ever since, a failure today will trigger "that's it, we're moving it, it doesn't work here"? That seems bizarre. Surely the length of time a resource has been running without problem should be taken into account when deciding whether the node it's running on is fit to handle it or not? My problem is also bigger than that - and I can't believe there isn't a way round the following, otherwise people couldn't use pacemaker: I have "migration-threshold=3" on most of my resources, and I have three nodes. If a resource fails for the third time (in any period of time) on a node, it gets moved (along with the rest in the group) to another node. The cluster does not forget that it failed and was moved away from the first node, though. "crm status -f" confirms that to me. If it then fails three times (in an hour, or a fortnight, whatever) on the second node, it gets moved to node 3, and from that point on the cluster thinks there's nowhere else to move it to, so another failure means a total failure of the cluster. There must be _something_ I'm doing wrong for the cluster to behave in this way? It can't believe it's by design. Regards, Antony. -- Anyone that's normal doesn't really achieve much. - Mark Blair, Australian rocket engineer Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] cluster-recheck-interval and failure-timeout
On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote: > Hi. > > I'm trying to understand what looks to me like incorrect behaviour > between > cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1 > > I have three machines in a corosync (3.0.1 if it matters) cluster, > managing 12 > resources in a single group. > > I'm following documentation from: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ > Pacemaker_Explained/s-cluster-options.html > > and > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ > Pacemaker_Explained/s-resource-options.html > > I have set a cluster property: > > cluster-recheck-interval=60s > > I have set a resource property: > > failure-timeout=180 > > The docs say failure-timeout is "How many seconds to wait before > acting as if > the failure had not occurred, and potentially allowing the resource > back to > the node on which it failed." > > I think this should mean that if the resource fails and gets > restarted, the > fact that it failed will be "forgotten" after 180 seconds (or maybe a > little > longer, depending on exactly when the next cluster recheck is done). > > However what I'm seeing is that if the resource fails and gets > restarted, and > this then happens an hour later, it's still counted as two > failures. If it That is exactly correct. > fails and gets restarted another hour after that, it's recorded as > three > failures and (because I have "migration-threshold=3") it gets moved > to another > node (and therefore all the other resources in group are moved as > well). > > So, what am I misunderstanding about "failure-timeout", and what > configuration > setting do I need to use to tell pacemaker that "provided the > resource hasn't > failed within the past X seconds, forget the fact that it failed more > than X > seconds ago"? Unfortunately, there is no way. failure-timeout expires *all* failures once the *most recent* is that old. It's a bit counter-intuitive but currently, Pacemaker only remembers a resource's most recent failure and the total count of failures, and changing that would be a big project. > Thanks, > > > Antony. > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] cluster-recheck-interval and failure-timeout
Hi. I'm trying to understand what looks to me like incorrect behaviour between cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1 I have three machines in a corosync (3.0.1 if it matters) cluster, managing 12 resources in a single group. I'm following documentation from: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ Pacemaker_Explained/s-cluster-options.html and https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/ Pacemaker_Explained/s-resource-options.html I have set a cluster property: cluster-recheck-interval=60s I have set a resource property: failure-timeout=180 The docs say failure-timeout is "How many seconds to wait before acting as if the failure had not occurred, and potentially allowing the resource back to the node on which it failed." I think this should mean that if the resource fails and gets restarted, the fact that it failed will be "forgotten" after 180 seconds (or maybe a little longer, depending on exactly when the next cluster recheck is done). However what I'm seeing is that if the resource fails and gets restarted, and this then happens an hour later, it's still counted as two failures. If it fails and gets restarted another hour after that, it's recorded as three failures and (because I have "migration-threshold=3") it gets moved to another node (and therefore all the other resources in group are moved as well). So, what am I misunderstanding about "failure-timeout", and what configuration setting do I need to use to tell pacemaker that "provided the resource hasn't failed within the past X seconds, forget the fact that it failed more than X seconds ago"? Thanks, Antony. -- The first fifty percent of an engineering project takes ninety percent of the time, and the remaining fifty percent takes another ninety percent of the time. Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/