Hi Ken, /sorry for the long text/
I have created a relatively simple setup to localize the issue. Three nodes, no fencing, just a master/slave mysql with two virual IPs. Just as a reminden, my primary issue is, that on cluster recheck intervals, tha failcounts are not cleared. I simuated a failure with: crm_failcount -N ctdb1 -r db-ip-master -v 1 crm_mon shows: Last updated: Mon Jun 19 17:34:35 2017 Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr Stack: corosync Current DC: ctmgr (168362243) - partition with quorum Version: 1.1.10-42f2063 3 Nodes configured 4 Resources configured Online: [ ctdb1 ctdb2 ctmgr ] db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1 db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2 Master/Slave Set: mysql [db-mysql] Masters: [ ctdb1 ] Slaves: [ ctdb2 ] Node Attributes: * Node ctdb1: + master-db-mysql : 3601 + readable : 1 * Node ctdb2: + master-db-mysql : 3600 + readable : 1 * Node ctmgr: Migration summary: * Node ctmgr: * Node ctdb1: db-ip-master: migration-threshold=1000000 fail-count=1 * Node ctdb2: When I check the pacemaker log on the DC, I see the following: Jun 19 17:37:06 [18998] ctmgr crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (30000ms) Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED origin=crm_timer_popped ] Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: All 3 cluster nodes are eligible to run resources. Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke: Query 231: Requesting the current CIB: S_POLICY_ENGINE Jun 19 17:37:06 [18994] ctmgr cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/231, version=0.12.9) Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke_callback: Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884, quorate=1 Jun 19 17:37:06 [18997] ctmgr pengine: info: process_pe_message: Input has not changed since last time, not saving to disk Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: STONITH timeout: 60000 Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: STONITH of failed nodes is disabled Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Stop all active resources: false Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Default stickiness: 0 Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: On loss of CCM Quorum: Stop ALL resources Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_domains: Unpacking domains Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctmgr is online Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctdb1 is online Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: Node ctdb2 is online Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctmgr to db-mysql:0 Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctdb1 to db-mysql:0 Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected value: 7 (not running) Jun 19 17:37:06 [18997] ctmgr pengine: notice: unpack_rsc_op: Operation monitor found resource db-mysql:0 active in master mode on ctdb1 Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: Internally renamed db-mysql on ctdb2 to db-mysql:1 Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db-mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected value: 7 (not running) Jun 19 17:37:06 [18997] ctmgr pengine: info: unpack_rsc_op: Operation monitor found resource db-mysql:1 active on ctdb2 Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1 Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2 Jun 19 17:37:06 [18997] ctmgr pengine: info: clone_print: Master/Slave Set: mysql [db-mysql] Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:0 active on ctdb1 Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:0 active on ctdb1 Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:1 active on ctdb2 Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: Resource db-mysql:1 active on ctdb2 Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Masters: [ ctdb1 ] Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: Slaves: [ ctdb2 ] Jun 19 17:37:06 [18997] ctmgr pengine: info: get_failcount_full: db-ip-master has failed 1 times on ctdb1 Jun 19 17:37:06 [18997] ctmgr pengine: info: common_apply_stickiness: db-ip-master can fail 999999 more times on ctdb1 before being forced off Jun 19 17:37:06 [18997] ctmgr pengine: debug: common_apply_stickiness: Resource db-mysql:0: preferring current location (node=ctdb1, weight=1) Jun 19 17:37:06 [18997] ctmgr pengine: debug: common_apply_stickiness: Resource db-mysql:1: preferring current location (node=ctdb2, weight=1) Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb1 to db-mysql:0 Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb2 to db-mysql:1 Jun 19 17:37:06 [18997] ctmgr pengine: debug: clone_color: Allocated 2 mysql instances of a possible 2 Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-mysql:0 master score: 3601 Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: Promoting db-mysql:0 (Master ctdb1) Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db-mysql:1 master score: 3600 Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: mysql: Promoted 1 instances of a possible 1 to master Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb1 to db-ip-master Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: Assigning ctdb2 to db-ip-slave Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_create_actions: Creating actions for mysql Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-ip-master (Started ctdb1) Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-ip-slave (Started ctdb2) Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-mysql:0 (Master ctdb1) Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: Leave db-mysql:1 (Slave ctdb2) Jun 19 17:37:06 [18997] ctmgr pengine: notice: process_pe_message: Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2 Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE origin=handle_response ] Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jun 19 17:37:06 [18998] ctmgr crmd: debug: unpack_graph: Unpacked transition 38: 0 actions in 0 synapses Jun 19 17:37:06 [18998] ctmgr crmd: info: do_te_invoke: Processing graph 38 (ref=pe_calc-dc-1497893826-144) derived from /var/lib/pacemaker/pengine/pe-input-16.bz2 Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty transition graph Jun 19 17:37:06 [18998] ctmgr crmd: notice: run_graph: Transition 38 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: Empty transition graph Jun 19 17:37:06 [18998] ctmgr crmd: debug: te_graph_trigger: Transition 38 is now complete Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Processing transition completion in state S_TRANSITION_ENGINE Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: Transition 38 status: done - <null> Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: Processing I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL origin=notify_crmd ] Jun 19 17:37:06 [18998] ctmgr crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: Starting PEngine Recheck Timer Jun 19 17:37:06 [18998] ctmgr crmd: debug: crm_timer_start: Started PEngine Recheck Timer (I_PE_CALC:30000ms), src=277 As you can see from the logs, pacemaker does not even try to re-monitor the resource that had a failure, or at least I'm not seeing it. Cluster recheck interval is set to 30 seconds for troubleshooting reasons. If I execute a crm resource cleanup db-ip-master Tha failure is removed. Now am I taking something terribly wrong here? Or is this simply a bug in 1.1.10? Thanks, Attila > -----Original Message----- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Wednesday, June 7, 2017 10:14 PM > To: Attila Megyeri <amegy...@minerva-soft.com>; Cluster Labs - All topics > related to open-source clustering welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] clearing failed actions > > On 06/01/2017 02:44 PM, Attila Megyeri wrote: > > Ken, > > > > I noticed something strange, this might be the issue. > > > > In some cases, even the manual cleanup does not work. > > > > I have a failed action of resource "A" on node "a". DC is node "b". > > > > e.g. > > Failed actions: > > jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1, > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 > > > > > > When I attempt to do a "crm resource cleanup A" from node "b", nothing > happens. Basically the lrmd on "a" is not notified that it should monitor the > resource. > > > > > > When I execute a "crm resource cleanup A" command on node "a" (where > the operation failed) , the failed action is cleared properly. > > > > Why could this be happening? > > Which component should be responsible for this? pengine, crmd, lrmd? > > The crm shell will send commands to attrd (to clear fail counts) and > crmd (to clear the resource history), which in turn will record changes > in the cib. > > I'm not sure how crm shell implements it, but crm_resource sends > individual messages to each node when cleaning up a resource without > specifying a particular node. You could check the pacemaker log on each > node to see whether attrd and crmd are receiving those commands, and > what they do in response. > > > >> -----Original Message----- > >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > >> Sent: Thursday, June 1, 2017 6:57 PM > >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > >> clustering welcomed <users@clusterlabs.org> > >> Subject: Re: [ClusterLabs] clearing failed actions > >> > >> thanks Ken, > >> > >> > >> > >> > >> > >>> -----Original Message----- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Thursday, June 1, 2017 12:04 AM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote: > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >>>>> Hi Ken, > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM > >>>>>> To: users@clusterlabs.org > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions > >>>>>> > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Shouldn't the > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> cluster-recheck-interval="2m" > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes > >>> and > >>>>>>> clean the failcounts? > >>>>>> > >>>>>> It instructs pacemaker to recalculate whether any actions need to be > >>>>>> taken (including expiring any failcounts appropriately). > >>>>>> > >>>>>>> At the primitive level I also have a > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> migration-threshold="30" failure-timeout="2m" > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> but whenever I have a failure, it remains there forever. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> What could be causing this? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> thanks, > >>>>>>> > >>>>>>> Attila > >>>>>> Is it a single old failure, or a recurring failure? The failure timeout > >>>>>> works in a somewhat nonintuitive way. Old failures are not > individually > >>>>>> expired. Instead, all failures of a resource are simultaneously cleared > >>>>>> if all of them are older than the failure-timeout. So if something > keeps > >>>>>> failing repeatedly (more frequently than the failure-timeout), none > of > >>>>>> the failures will be cleared. > >>>>>> > >>>>>> If it's not a repeating failure, something odd is going on. > >>>>> > >>>>> It is not a repeating failure. Let's say that a resource fails for > >>>>> whatever > >>> action, It will remain in the failed actions (crm_mon -Af) until I issue a > "crm > >>> resource cleanup <resource name>". Even after days or weeks, even > >> though > >>> I see in the logs that cluster is rechecked every 120 seconds. > >>>>> > >>>>> How could I troubleshoot this issue? > >>>>> > >>>>> thanks! > >>>> > >>>> > >>>> Ah, I see what you're saying. That's expected behavior. > >>>> > >>>> The failure-timeout applies to the failure *count* (which is used for > >>>> checking against migration-threshold), not the failure *history* (which > >>>> is used for the status display). > >>>> > >>>> The idea is to have it no longer affect the cluster behavior, but still > >>>> allow an administrator to know that it happened. That's why a manual > >>>> cleanup is required to clear the history. > >>> > >>> Hmm, I'm wrong there ... failure-timeout does expire the failure history > >>> used for status display. > >>> > >>> It works with the current versions. It's possible 1.1.10 had issues with > >>> that. > >>> > >> > >> Well if nothing helps I will try to upgrade to a more recent version.. > >> > >> > >> > >>> Check the status to see which node is DC, and look at the pacemaker log > >>> there after the failure occurred. There should be a message about the > >>> failcount expiring. You can also look at the live CIB and search for > >>> last_failure to see what is used for the display. > >> [AM] > >> > >> In the pacemaker log I see at every recheck interval the following lines: > >> > >> Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op: > >> Processing failed op start for jboss_admin2 on ctadmin2: unknown error > (1) > >> > >> If I check the CIB for the failure I see: > >> > >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last- > failure- > >> jboss_admin2" value="1496326649"/> > >> <lrm_rsc_op id="jboss_admin2_last_failure_0" > >> operation_key="jboss_admin2_start_0" operation="start" crm-debug- > >> origin="do_update_resource" crm_feature_set="3.0.7" transition- > >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition- > >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" > rc- > >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc- > >> change="1496326469" exec-time="180001" queue-time="0" op- > >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/> > >> > >> > >> Really have no clue why this isn't cleared... _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org