On 04/04/2017 01:18 AM, Ulrich Windl wrote: >>>> Ken Gaillot <kgail...@redhat.com> schrieb am 03.04.2017 um 17:00 in >>>> Nachricht > <ae3a7cf4-2ef7-4c4f-ae3f-39f473ed6...@redhat.com>: >> Hi all, >> >> Pacemaker 1.1.17 will have a significant change in how it tracks >> resource failures, though the change will be mostly invisible to users. >> >> Previously, Pacemaker tracked a single count of failures per resource -- >> for example, start failures and monitor failures for a given resource >> were added together. > > That is "per resource operation", not "per resource" ;-)
I mean that there was only a single number to count failures for a given resource; before this change, failures were not remembered separately by operation. >> In a thread on this list last year[1], we discussed adding some new >> failure handling options that would require tracking failures for each >> operation type. > > So the existing set of operations failures was restricted to > start/stop/monitor? How about master/slave featuring two monitor operations? No, both previously and with the new changes, all operation failures are counted (well, except metadata!). The only change is whether they are remembered per resource or per operation. >> Pacemaker 1.1.17 will include this tracking, in preparation for adding >> the new options in a future release. >> >> Whereas previously, failure counts were stored in node attributes like >> "fail-count-myrsc", they will now be stored in multiple node attributes >> like "fail-count-myrsc#start_0" and "fail-count-myrsc#monitor_10000" >> (the number distinguishes monitors with different intervals). > > Wouldn't it be thinkable to store is as (transient) resource attribute, > either local to a node (LRM) or including the node attribute (CRM)? Failures are specific to the node the failure occurred on, so it makes sense to store them as transient node attributes. So, to be more precise, we previously recorded failures per node+resource combination, and now we record them per node+resource+operation+interval combination. >> Actual cluster behavior will be unchanged in this release (and >> backward-compatible); the cluster will sum the per-operation fail counts >> when checking against options such as migration-threshold. >> >> The part that will be visible to the user in this release is that the >> crm_failcount and crm_resource --cleanup tools will now be able to >> handle individual per-operation fail counts if desired, though by >> default they will still affect the total fail count for the resource. > > Another thing to think about would be "fail count" vs. "fail rate": Currently > there is a fail count, and some reset interval, which allows to build some > failure rate from it. Maybe many users just have the requirement that some > resource shouldn't fail again and again, but with long uptimes (and then the > operatior forgets to reset fail counters), occasional failures (like once in > two weeks) shouldn't prevent a resource from running. Yes, we discussed that a bit in the earlier thread. It would be too much of an incompatible change and add considerable complexity to start tracking the failure rate. Failure clearing hasn't changed -- failures can only be cleared by manual commands, the failure-timeout option, or a restart of cluster services on a node. For the example you mentioned, a high failure-timeout is the best answer we have. You could set a failure-timeout of 24 hours, and if the resource went 24 hours without any failures, any older failures would be forgotten. >> As an example, if "myrsc" has one start failure and one monitor failure, >> "crm_failcount -r myrsc --query" will still show 2, but now you can also >> say "crm_failcount -r myrsc --query --operation start" which will show 1. > > Would accumulated monitor failures ever prevent a resource from starting, or > will it force a stop of the resource? As of this release, failure recovery behavior has not changed. All operation failures are added together to produce a single fail count per resource, as was recorded before. The only thing that changed is how they're recorded. Failure recovery is controlled by the resource's migration-threshold and the operation's on-fail. By default, on-fail=restart and migration-threshold=INFINITY, so a monitor failure would result in 1,000,000 restarts before being banned from the failing node. > Regards, > Ulrich > >> >> Additionally, crm_failcount --delete previously only reset the >> resource's fail count, but it now behaves identically to crm_resource >> --cleanup (resetting the fail count and clearing the failure history). >> >> Special note for pgsql users: Older versions of common pgsql resource >> agents relied on a behavior of crm_failcount that is now rejected. While >> the impact is limited, users are recommended to make sure they have the >> latest version of their pgsql resource agent before upgrading to >> pacemaker 1.1.17. >> >> [1] http://lists.clusterlabs.org/pipermail/users/2016-September/004096.html _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org