Re: [ClusterLabs] Clearing failed actions
On Mon, 2018-07-09 at 09:11 +0200, Jehan-Guillaume de Rorthais wrote: > On Fri, 06 Jul 2018 10:15:08 -0600 > Casey Allen Shobe wrote: > > > Hi, > > > > I found a web page which suggested to clear the Failed Actions, to > > use > > `crm_resource -P`. Although this appears to work, it's not > > documented on the > > man page at all. Is this deprecated and is there a more correct > > way to be > > doing this? > > -P means "reprobe", so I guess this is a side effect or a pre- > requisit, but not > only to clean failcounts. In the 1.1 series, -P is a deprecated synonym for --cleanup / -C. The options clear fail counts and resource operation history (for a specific resource and/or node if specified with -r and/or -N, otherwise all). In the 2.0 series, -P is gone. --refresh / -R now does what cleanup used to; --cleanup / -C now cleans up only resources that have had failures. In other words, the old --cleanup and new --refresh clean resource history, forcing a re-probe, regardless of whether a resource failed or not, whereas the new --cleanup will skip resources that didn't have failures. > > Also, is there a way to clear one specific item from the list, or > > is clearing > > all the only option? > > pcs failcount reset [node] With the low level tools, you can use -r / --resource and/or -N / -- node with crm_resource to limit the clean-up. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Clearing failed actions
On Fri, 06 Jul 2018 10:15:08 -0600 Casey Allen Shobe wrote: > Hi, > > I found a web page which suggested to clear the Failed Actions, to use > `crm_resource -P`. Although this appears to work, it's not documented on the > man page at all. Is this deprecated and is there a more correct way to be > doing this? -P means "reprobe", so I guess this is a side effect or a pre-requisit, but not only to clean failcounts. > Also, is there a way to clear one specific item from the list, or is clearing > all the only option? pcs failcount reset [node] ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Clearing failed actions
Hi, I found a web page which suggested to clear the Failed Actions, to use `crm_resource -P`. Although this appears to work, it's not documented on the man page at all. Is this deprecated and is there a more correct way to be doing this? Also, is there a way to clear one specific item from the list, or is clearing all the only option? Thank you in advance for any advice, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Clearing "Failed Actions"
I found a random web page which suggested to clear the Failed Actions, to use `crm_resource -P`. Although this appears to work, it's not documented on the man page at all. Is this deprecated and is there a more correct way to be doing this? Cheers, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Clearing "Failed Actions"
I found a random web page which suggested to clear the Failed Actions, to use `crm_resource -P`. Although this appears to work, it's not documented on the man page at all. Is this deprecated and is there a more correct way to be doing this? Cheers, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
On 06/19/2017 04:54 PM, Attila Megyeri wrote: > One more thing to add. > Two almost identical clusters, with the identical asterisk primitive produce > a different crm_verify output. on one cluster, it returns no warnings, > whereas the other once complains: > > On the problematic one: > > crm_verify --live-check -VV > warning: get_failcount_full: Setting asterisk.failure_timeout=120 in > asterisk-stop-0 conflicts with on-fail=block: ignoring timeout > Warnings found during check: config may not be valid > > > The relevant primitive is in both clusters: > > primitive asterisk ocf:heartbeat:asterisk \ > op monitor interval="10s" timeout="45s" on-fail="restart" \ > op start interval="0" timeout="60s" on-fail="standby" \ > op stop interval="0" timeout="60s" on-fail="block" \ > meta migration-threshold="3" failure-timeout="2m" > > Why is the same configuration valid in one, but not in the other cluster? > Shall I simply omit the "op stop" line? > > thanks :) > Attila Ah, that could explain it. If a failure occurs when on-fail=block applies, the resource's failure timeout is disabled. This is partly because the point of on-fail=block is to allow the administrator to investigate and manually clear the error, and partly because blocking means nothing was done to recover the resource, so the failure likely is still present (clearing it would make on-fail=block similar to on-fail=ignore). The failure timeout should be ignored only if there's an actual error to be handled by on-fail=block, which would mean a stop failure in this case. That could explain why it's valid in one situation, if there are no stop failures there. Stop failures default to block without fencing because fencing is the only way to recover from a stop failure. Configuring fencing and using on-fail=fence for stop would avoid the issue. A future version of pacemaker will allow specifying the failure timeout separately for different operations, which would allow you to set failure timeout 0 on stop, and 1m on everything else. But that work hasn't started yet. > >> -Original Message- >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >> Sent: Monday, June 19, 2017 9:47 PM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> ; kgail...@redhat.com >> Subject: Re: [ClusterLabs] clearing failed actions >> >> I did another experiment, even simpler. >> >> Created one node, one resource, using pacemaker 1.1.14 on ubuntu. >> >> Configured failcount to 1, migration threshold to 2, failure timeout to 1 >> minute. >> >> crm_mon: >> >> Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 >> 19:37:09 2017 by root via cibadmin on test >> Stack: corosync >> Current DC: test (version 1.1.14-70404b0) - partition with quorum >> 1 node and 1 resource configured >> >> Online: [ test ] >> >> db-ip-master(ocf::heartbeat:IPaddr2): Started test >> >> Node Attributes: >> * Node test: >> >> Migration Summary: >> * Node test: >>db-ip-master: migration-threshold=2 fail-count=1 >> >> crm verify: >> >> crm_verify --live-check - >> info: validate_with_relaxng:Creating RNG parser context >> info: determine_online_status: Node test is online >> info: get_failcount_full: db-ip-master has failed 1 times on test >> info: get_failcount_full: db-ip-master has failed 1 times on test >> info: get_failcount_full: db-ip-master has failed 1 times on test >> info: get_failcount_full: db-ip-master has failed 1 times on test >> info: native_print: db-ip-master(ocf::heartbeat:IPaddr2): >> Started test >> info: get_failcount_full: db-ip-master has failed 1 times on test >> info: common_apply_stickiness: db-ip-master can fail 1 more times on >> test before being forced off >> info: LogActions: Leave db-ip-master(Started test) >> >> >> crm configure is: >> >> node 168362242: test \ >> attributes standby=off >> primitive db-ip-master IPaddr2 \ >> params lvs_support=true ip=10.9.1.10 cidr_netmask=24 >> broadcast=10.9.1.255 \ >> op start interval=0 timeout=20s on-fail=restart \ >> op monitor interval=20s timeout=20s \ >> op stop interval=0 timeout=20s on-fail=block \ >> meta migration-threshold=2 failure-timeout=1m target-role=Started >> location loc1 db-ip-
Re: [ClusterLabs] clearing failed actions
One more thing to add. Two almost identical clusters, with the identical asterisk primitive produce a different crm_verify output. on one cluster, it returns no warnings, whereas the other once complains: On the problematic one: crm_verify --live-check -VV warning: get_failcount_full: Setting asterisk.failure_timeout=120 in asterisk-stop-0 conflicts with on-fail=block: ignoring timeout Warnings found during check: config may not be valid The relevant primitive is in both clusters: primitive asterisk ocf:heartbeat:asterisk \ op monitor interval="10s" timeout="45s" on-fail="restart" \ op start interval="0" timeout="60s" on-fail="standby" \ op stop interval="0" timeout="60s" on-fail="block" \ meta migration-threshold="3" failure-timeout="2m" Why is the same configuration valid in one, but not in the other cluster? Shall I simply omit the "op stop" line? thanks :) Attila > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Monday, June 19, 2017 9:47 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > ; kgail...@redhat.com > Subject: Re: [ClusterLabs] clearing failed actions > > I did another experiment, even simpler. > > Created one node, one resource, using pacemaker 1.1.14 on ubuntu. > > Configured failcount to 1, migration threshold to 2, failure timeout to 1 > minute. > > crm_mon: > > Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 > 19:37:09 2017 by root via cibadmin on test > Stack: corosync > Current DC: test (version 1.1.14-70404b0) - partition with quorum > 1 node and 1 resource configured > > Online: [ test ] > > db-ip-master(ocf::heartbeat:IPaddr2): Started test > > Node Attributes: > * Node test: > > Migration Summary: > * Node test: >db-ip-master: migration-threshold=2 fail-count=1 > > crm verify: > > crm_verify --live-check - > info: validate_with_relaxng:Creating RNG parser context > info: determine_online_status: Node test is online > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: native_print: db-ip-master(ocf::heartbeat:IPaddr2): > Started test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: common_apply_stickiness: db-ip-master can fail 1 more times on > test before being forced off > info: LogActions: Leave db-ip-master(Started test) > > > crm configure is: > > node 168362242: test \ > attributes standby=off > primitive db-ip-master IPaddr2 \ > params lvs_support=true ip=10.9.1.10 cidr_netmask=24 > broadcast=10.9.1.255 \ > op start interval=0 timeout=20s on-fail=restart \ > op monitor interval=20s timeout=20s \ > op stop interval=0 timeout=20s on-fail=block \ > meta migration-threshold=2 failure-timeout=1m target-role=Started > location loc1 db-ip-master 0: test > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > cluster-recheck-interval=30s \ > symmetric-cluster=false > > > > > Corosync log: > > > Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input > has > not changed since last time, not saving to disk > Jun 19 19:45:07 [330] testpengine: info: determine_online_status: > Node test is online > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: native_print: db-ip-master > (ocf::heartbeat:IPaddr2): Started test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-ma
Re: [ClusterLabs] clearing failed actions
I did another experiment, even simpler. Created one node, one resource, using pacemaker 1.1.14 on ubuntu. Configured failcount to 1, migration threshold to 2, failure timeout to 1 minute. crm_mon: Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 19:37:09 2017 by root via cibadmin on test Stack: corosync Current DC: test (version 1.1.14-70404b0) - partition with quorum 1 node and 1 resource configured Online: [ test ] db-ip-master(ocf::heartbeat:IPaddr2): Started test Node Attributes: * Node test: Migration Summary: * Node test: db-ip-master: migration-threshold=2 fail-count=1 crm verify: crm_verify --live-check - info: validate_with_relaxng:Creating RNG parser context info: determine_online_status: Node test is online info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: native_print: db-ip-master(ocf::heartbeat:IPaddr2): Started test info: get_failcount_full: db-ip-master has failed 1 times on test info: common_apply_stickiness: db-ip-master can fail 1 more times on test before being forced off info: LogActions: Leave db-ip-master(Started test) crm configure is: node 168362242: test \ attributes standby=off primitive db-ip-master IPaddr2 \ params lvs_support=true ip=10.9.1.10 cidr_netmask=24 broadcast=10.9.1.255 \ op start interval=0 timeout=20s on-fail=restart \ op monitor interval=20s timeout=20s \ op stop interval=0 timeout=20s on-fail=block \ meta migration-threshold=2 failure-timeout=1m target-role=Started location loc1 db-ip-master 0: test property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ stonith-enabled=false \ cluster-recheck-interval=30s \ symmetric-cluster=false Corosync log: Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input has not changed since last time, not saving to disk Jun 19 19:45:07 [330] testpengine: info: determine_online_status: Node test is online Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): Started test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: common_apply_stickiness: db-ip-master can fail 1 more times on test before being forced off Jun 19 19:45:07 [330] testpengine: info: LogActions:Leave db-ip-master(Started test) Jun 19 19:45:07 [330] testpengine: notice: process_pe_message: Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2 Jun 19 19:45:07 [331] test crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jun 19 19:45:07 [331] test crmd: notice: run_graph: Transition 34 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete Jun 19 19:45:07 [331] test crmd: info: do_log:FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] I hope someone can help me figure this out :) Thanks! > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Monday, June 19, 2017 7:45 PM > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > Hi Ken, > > /sorry for the long text/ > > I have created a relatively simple setup to localize the issue. > Three nodes, no fencing, just a master/slave mysql with two virual IPs. > Just as a reminden, my primary issue is, that on clu
Re: [ClusterLabs] clearing failed actions
] ctmgr crmd:debug: do_state_transition: Starting PEngine Recheck Timer Jun 19 17:37:06 [18998] ctmgr crmd:debug: crm_timer_start:Started PEngine Recheck Timer (I_PE_CALC:3ms), src=277 As you can see from the logs, pacemaker does not even try to re-monitor the resource that had a failure, or at least I'm not seeing it. Cluster recheck interval is set to 30 seconds for troubleshooting reasons. If I execute a crm resource cleanup db-ip-master Tha failure is removed. Now am I taking something terribly wrong here? Or is this simply a bug in 1.1.10? Thanks, Attila > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Wednesday, June 7, 2017 10:14 PM > To: Attila Megyeri ; Cluster Labs - All topics > related to open-source clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > On 06/01/2017 02:44 PM, Attila Megyeri wrote: > > Ken, > > > > I noticed something strange, this might be the issue. > > > > In some cases, even the manual cleanup does not work. > > > > I have a failed action of resource "A" on node "a". DC is node "b". > > > > e.g. > > Failed actions: > > jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 > > > > > > When I attempt to do a "crm resource cleanup A" from node "b", nothing > happens. Basically the lrmd on "a" is not notified that it should monitor the > resource. > > > > > > When I execute a "crm resource cleanup A" command on node "a" (where > the operation failed) , the failed action is cleared properly. > > > > Why could this be happening? > > Which component should be responsible for this? pengine, crmd, lrmd? > > The crm shell will send commands to attrd (to clear fail counts) and > crmd (to clear the resource history), which in turn will record changes > in the cib. > > I'm not sure how crm shell implements it, but crm_resource sends > individual messages to each node when cleaning up a resource without > specifying a particular node. You could check the pacemaker log on each > node to see whether attrd and crmd are receiving those commands, and > what they do in response. > > > >> -Original Message- > >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > >> Sent: Thursday, June 1, 2017 6:57 PM > >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > >> clustering welcomed > >> Subject: Re: [ClusterLabs] clearing failed actions > >> > >> thanks Ken, > >> > >> > >> > >> > >> > >>> -Original Message- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Thursday, June 1, 2017 12:04 AM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote: > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >>>>> Hi Ken, > >>>>> > >>>>> > >>>>>> -Original Message- > >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM > >>>>>> To: users@clusterlabs.org > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions > >>>>>> > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Shouldn't the > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> cluster-recheck-interval="2m" > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes > >>> and > >>>>>>> clean the failcounts? > >>>>>> > >>>>>> It instructs pacemaker to recalculate whether any actions need to be > >>>>>> taken (including expiring any failcounts appropriately). > >>>>>> > >>>>>>> At the primitive level I also have a > >>>>>>> > >>>>>>> > >>>>>>> > >>>>&g
Re: [ClusterLabs] clearing failed actions
On 06/01/2017 02:44 PM, Attila Megyeri wrote: > Ken, > > I noticed something strange, this might be the issue. > > In some cases, even the manual cleanup does not work. > > I have a failed action of resource "A" on node "a". DC is node "b". > > e.g. > Failed actions: > jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 > > > When I attempt to do a "crm resource cleanup A" from node "b", nothing > happens. Basically the lrmd on "a" is not notified that it should monitor the > resource. > > > When I execute a "crm resource cleanup A" command on node "a" (where the > operation failed) , the failed action is cleared properly. > > Why could this be happening? > Which component should be responsible for this? pengine, crmd, lrmd? The crm shell will send commands to attrd (to clear fail counts) and crmd (to clear the resource history), which in turn will record changes in the cib. I'm not sure how crm shell implements it, but crm_resource sends individual messages to each node when cleaning up a resource without specifying a particular node. You could check the pacemaker log on each node to see whether attrd and crmd are receiving those commands, and what they do in response. >> -Original Message- >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >> Sent: Thursday, June 1, 2017 6:57 PM >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source >> clustering welcomed >> Subject: Re: [ClusterLabs] clearing failed actions >> >> thanks Ken, >> >> >> >> >> >>> -Original Message- >>> From: Ken Gaillot [mailto:kgail...@redhat.com] >>> Sent: Thursday, June 1, 2017 12:04 AM >>> To: users@clusterlabs.org >>> Subject: Re: [ClusterLabs] clearing failed actions >>> >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote: >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote: >>>>> Hi Ken, >>>>> >>>>> >>>>>> -Original Message- >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com] >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM >>>>>> To: users@clusterlabs.org >>>>>> Subject: Re: [ClusterLabs] clearing failed actions >>>>>> >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> >>>>>>> Shouldn't the >>>>>>> >>>>>>> >>>>>>> >>>>>>> cluster-recheck-interval="2m" >>>>>>> >>>>>>> >>>>>>> >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes >>> and >>>>>>> clean the failcounts? >>>>>> >>>>>> It instructs pacemaker to recalculate whether any actions need to be >>>>>> taken (including expiring any failcounts appropriately). >>>>>> >>>>>>> At the primitive level I also have a >>>>>>> >>>>>>> >>>>>>> >>>>>>> migration-threshold="30" failure-timeout="2m" >>>>>>> >>>>>>> >>>>>>> >>>>>>> but whenever I have a failure, it remains there forever. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> What could be causing this? >>>>>>> >>>>>>> >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> Attila >>>>>> Is it a single old failure, or a recurring failure? The failure timeout >>>>>> works in a somewhat nonintuitive way. Old failures are not individually >>>>>> expired. Instead, all failures of a resource are simultaneously cleared >>>>>> if all of them are older than the failure-timeout. So if something keeps >>>>>> failing repeatedly (more frequently than the failure-timeout), none of >>>>>> the failures will be cleared. >>>>>> >>>>>> If it's not a repeating failure, something odd is going on. >>>>> >&g
Re: [ClusterLabs] clearing failed actions
Ken, I noticed something strange, this might be the issue. In some cases, even the manual cleanup does not work. I have a failed action of resource "A" on node "a". DC is node "b". e.g. Failed actions: jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. Basically the lrmd on "a" is not notified that it should monitor the resource. When I execute a "crm resource cleanup A" command on node "a" (where the operation failed) , the failed action is cleared properly. Why could this be happening? Which component should be responsible for this? pengine, crmd, lrmd? > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, June 1, 2017 6:57 PM > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > thanks Ken, > > > > > > > -Original Message- > > From: Ken Gaillot [mailto:kgail...@redhat.com] > > Sent: Thursday, June 1, 2017 12:04 AM > > To: users@clusterlabs.org > > Subject: Re: [ClusterLabs] clearing failed actions > > > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > > >> Hi Ken, > > >> > > >> > > >>> -Original Message- > > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > > >>> Sent: Tuesday, May 30, 2017 4:32 PM > > >>> To: users@clusterlabs.org > > >>> Subject: Re: [ClusterLabs] clearing failed actions > > >>> > > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > >>>> Hi, > > >>>> > > >>>> > > >>>> > > >>>> Shouldn't the > > >>>> > > >>>> > > >>>> > > >>>> cluster-recheck-interval="2m" > > >>>> > > >>>> > > >>>> > > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > > and > > >>>> clean the failcounts? > > >>> > > >>> It instructs pacemaker to recalculate whether any actions need to be > > >>> taken (including expiring any failcounts appropriately). > > >>> > > >>>> At the primitive level I also have a > > >>>> > > >>>> > > >>>> > > >>>> migration-threshold="30" failure-timeout="2m" > > >>>> > > >>>> > > >>>> > > >>>> but whenever I have a failure, it remains there forever. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> What could be causing this? > > >>>> > > >>>> > > >>>> > > >>>> thanks, > > >>>> > > >>>> Attila > > >>> Is it a single old failure, or a recurring failure? The failure timeout > > >>> works in a somewhat nonintuitive way. Old failures are not individually > > >>> expired. Instead, all failures of a resource are simultaneously cleared > > >>> if all of them are older than the failure-timeout. So if something keeps > > >>> failing repeatedly (more frequently than the failure-timeout), none of > > >>> the failures will be cleared. > > >>> > > >>> If it's not a repeating failure, something odd is going on. > > >> > > >> It is not a repeating failure. Let's say that a resource fails for > > >> whatever > > action, It will remain in the failed actions (crm_mon -Af) until I issue a > > "crm > > resource cleanup ". Even after days or weeks, even > though > > I see in the logs that cluster is rechecked every 120 seconds. > > >> > > >> How could I troubleshoot this issue? > > >> > > >> thanks! > > > > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > > > The failure-timeout applies to the failure *count* (which is used for > > > checking against migration-threshold), not the failure *history* (which > > > is used
Re: [ClusterLabs] clearing failed actions
thanks Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Thursday, June 1, 2017 12:04 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >> Hi Ken, > >> > >> > >>> -Original Message- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Tuesday, May 30, 2017 4:32 PM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>> Hi, > >>>> > >>>> > >>>> > >>>> Shouldn't the > >>>> > >>>> > >>>> > >>>> cluster-recheck-interval="2m" > >>>> > >>>> > >>>> > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > and > >>>> clean the failcounts? > >>> > >>> It instructs pacemaker to recalculate whether any actions need to be > >>> taken (including expiring any failcounts appropriately). > >>> > >>>> At the primitive level I also have a > >>>> > >>>> > >>>> > >>>> migration-threshold="30" failure-timeout="2m" > >>>> > >>>> > >>>> > >>>> but whenever I have a failure, it remains there forever. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> What could be causing this? > >>>> > >>>> > >>>> > >>>> thanks, > >>>> > >>>> Attila > >>> Is it a single old failure, or a recurring failure? The failure timeout > >>> works in a somewhat nonintuitive way. Old failures are not individually > >>> expired. Instead, all failures of a resource are simultaneously cleared > >>> if all of them are older than the failure-timeout. So if something keeps > >>> failing repeatedly (more frequently than the failure-timeout), none of > >>> the failures will be cleared. > >>> > >>> If it's not a repeating failure, something odd is going on. > >> > >> It is not a repeating failure. Let's say that a resource fails for whatever > action, It will remain in the failed actions (crm_mon -Af) until I issue a > "crm > resource cleanup ". Even after days or weeks, even though > I see in the logs that cluster is rechecked every 120 seconds. > >> > >> How could I troubleshoot this issue? > >> > >> thanks! > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > The failure-timeout applies to the failure *count* (which is used for > > checking against migration-threshold), not the failure *history* (which > > is used for the status display). > > > > The idea is to have it no longer affect the cluster behavior, but still > > allow an administrator to know that it happened. That's why a manual > > cleanup is required to clear the history. > > Hmm, I'm wrong there ... failure-timeout does expire the failure history > used for status display. > > It works with the current versions. It's possible 1.1.10 had issues with > that. > Well if nothing helps I will try to upgrade to a more recent version.. > Check the status to see which node is DC, and look at the pacemaker log > there after the failure occurred. There should be a message about the > failcount expiring. You can also look at the live CIB and search for > last_failure to see what is used for the display. [AM] In the pacemaker log I see at every recheck interval the following lines: Jun 01 16:54:08 [8700] ctabsws2pengine: warning: unpack_rsc_op: Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1) If I check the CIB for the failure I see: Really have no clue why this isn't cleared... > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
On 05/31/2017 12:17 PM, Ken Gaillot wrote: > On 05/30/2017 02:50 PM, Attila Megyeri wrote: >> Hi Ken, >> >> >>> -Original Message- >>> From: Ken Gaillot [mailto:kgail...@redhat.com] >>> Sent: Tuesday, May 30, 2017 4:32 PM >>> To: users@clusterlabs.org >>> Subject: Re: [ClusterLabs] clearing failed actions >>> >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: >>>> Hi, >>>> >>>> >>>> >>>> Shouldn't the >>>> >>>> >>>> >>>> cluster-recheck-interval="2m" >>>> >>>> >>>> >>>> property instruct pacemaker to recheck the cluster every 2 minutes and >>>> clean the failcounts? >>> >>> It instructs pacemaker to recalculate whether any actions need to be >>> taken (including expiring any failcounts appropriately). >>> >>>> At the primitive level I also have a >>>> >>>> >>>> >>>> migration-threshold="30" failure-timeout="2m" >>>> >>>> >>>> >>>> but whenever I have a failure, it remains there forever. >>>> >>>> >>>> >>>> >>>> >>>> What could be causing this? >>>> >>>> >>>> >>>> thanks, >>>> >>>> Attila >>> Is it a single old failure, or a recurring failure? The failure timeout >>> works in a somewhat nonintuitive way. Old failures are not individually >>> expired. Instead, all failures of a resource are simultaneously cleared >>> if all of them are older than the failure-timeout. So if something keeps >>> failing repeatedly (more frequently than the failure-timeout), none of >>> the failures will be cleared. >>> >>> If it's not a repeating failure, something odd is going on. >> >> It is not a repeating failure. Let's say that a resource fails for whatever >> action, It will remain in the failed actions (crm_mon -Af) until I issue a >> "crm resource cleanup ". Even after days or weeks, even >> though I see in the logs that cluster is rechecked every 120 seconds. >> >> How could I troubleshoot this issue? >> >> thanks! > > > Ah, I see what you're saying. That's expected behavior. > > The failure-timeout applies to the failure *count* (which is used for > checking against migration-threshold), not the failure *history* (which > is used for the status display). > > The idea is to have it no longer affect the cluster behavior, but still > allow an administrator to know that it happened. That's why a manual > cleanup is required to clear the history. Hmm, I'm wrong there ... failure-timeout does expire the failure history used for status display. It works with the current versions. It's possible 1.1.10 had issues with that. Check the status to see which node is DC, and look at the pacemaker log there after the failure occurred. There should be a message about the failcount expiring. You can also look at the live CIB and search for last_failure to see what is used for the display. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
On 05/30/2017 02:50 PM, Attila Megyeri wrote: > Hi Ken, > > >> -Original Message- >> From: Ken Gaillot [mailto:kgail...@redhat.com] >> Sent: Tuesday, May 30, 2017 4:32 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] clearing failed actions >> >> On 05/30/2017 09:13 AM, Attila Megyeri wrote: >>> Hi, >>> >>> >>> >>> Shouldn't the >>> >>> >>> >>> cluster-recheck-interval="2m" >>> >>> >>> >>> property instruct pacemaker to recheck the cluster every 2 minutes and >>> clean the failcounts? >> >> It instructs pacemaker to recalculate whether any actions need to be >> taken (including expiring any failcounts appropriately). >> >>> At the primitive level I also have a >>> >>> >>> >>> migration-threshold="30" failure-timeout="2m" >>> >>> >>> >>> but whenever I have a failure, it remains there forever. >>> >>> >>> >>> >>> >>> What could be causing this? >>> >>> >>> >>> thanks, >>> >>> Attila >> Is it a single old failure, or a recurring failure? The failure timeout >> works in a somewhat nonintuitive way. Old failures are not individually >> expired. Instead, all failures of a resource are simultaneously cleared >> if all of them are older than the failure-timeout. So if something keeps >> failing repeatedly (more frequently than the failure-timeout), none of >> the failures will be cleared. >> >> If it's not a repeating failure, something odd is going on. > > It is not a repeating failure. Let's say that a resource fails for whatever > action, It will remain in the failed actions (crm_mon -Af) until I issue a > "crm resource cleanup ". Even after days or weeks, even though > I see in the logs that cluster is rechecked every 120 seconds. > > How could I troubleshoot this issue? > > thanks! Ah, I see what you're saying. That's expected behavior. The failure-timeout applies to the failure *count* (which is used for checking against migration-threshold), not the failure *history* (which is used for the status display). The idea is to have it no longer affect the cluster behavior, but still allow an administrator to know that it happened. That's why a manual cleanup is required to clear the history. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
Hi Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Tuesday, May 30, 2017 4:32 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > Hi, > > > > > > > > Shouldn't the > > > > > > > > cluster-recheck-interval="2m" > > > > > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > > clean the failcounts? > > It instructs pacemaker to recalculate whether any actions need to be > taken (including expiring any failcounts appropriately). > > > At the primitive level I also have a > > > > > > > > migration-threshold="30" failure-timeout="2m" > > > > > > > > but whenever I have a failure, it remains there forever. > > > > > > > > > > > > What could be causing this? > > > > > > > > thanks, > > > > Attila > Is it a single old failure, or a recurring failure? The failure timeout > works in a somewhat nonintuitive way. Old failures are not individually > expired. Instead, all failures of a resource are simultaneously cleared > if all of them are older than the failure-timeout. So if something keeps > failing repeatedly (more frequently than the failure-timeout), none of > the failures will be cleared. > > If it's not a repeating failure, something odd is going on. It is not a repeating failure. Let's say that a resource fails for whatever action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm resource cleanup ". Even after days or weeks, even though I see in the logs that cluster is rechecked every 120 seconds. How could I troubleshoot this issue? thanks! > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
On 05/30/2017 09:13 AM, Attila Megyeri wrote: > Hi, > > > > Shouldn’t the > > > > cluster-recheck-interval="2m" > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > clean the failcounts? It instructs pacemaker to recalculate whether any actions need to be taken (including expiring any failcounts appropriately). > At the primitive level I also have a > > > > migration-threshold="30" failure-timeout="2m" > > > > but whenever I have a failure, it remains there forever. > > > > > > What could be causing this? > > > > thanks, > > Attila Is it a single old failure, or a recurring failure? The failure timeout works in a somewhat nonintuitive way. Old failures are not individually expired. Instead, all failures of a resource are simultaneously cleared if all of them are older than the failure-timeout. So if something keeps failing repeatedly (more frequently than the failure-timeout), none of the failures will be cleared. If it's not a repeating failure, something odd is going on. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] clearing failed actions
Hi, Shouldn't the cluster-recheck-interval="2m" property instruct pacemaker to recheck the cluster every 2 minutes and clean the failcounts? At the primitive level I also have a migration-threshold="30" failure-timeout="2m" but whenever I have a failure, it remains there forever. What could be causing this? thanks, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org