One more thing to add. Two almost identical clusters, with the identical asterisk primitive produce a different crm_verify output. on one cluster, it returns no warnings, whereas the other once complains:
On the problematic one: crm_verify --live-check -VV warning: get_failcount_full: Setting asterisk.failure_timeout=120 in asterisk-stop-0 conflicts with on-fail=block: ignoring timeout Warnings found during check: config may not be valid The relevant primitive is in both clusters: primitive asterisk ocf:heartbeat:asterisk \ op monitor interval="10s" timeout="45s" on-fail="restart" \ op start interval="0" timeout="60s" on-fail="standby" \ op stop interval="0" timeout="60s" on-fail="block" \ meta migration-threshold="3" failure-timeout="2m" Why is the same configuration valid in one, but not in the other cluster? Shall I simply omit the "op stop" line? thanks :) Attila > -----Original Message----- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Monday, June 19, 2017 9:47 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org>; kgail...@redhat.com > Subject: Re: [ClusterLabs] clearing failed actions > > I did another experiment, even simpler. > > Created one node, one resource, using pacemaker 1.1.14 on ubuntu. > > Configured failcount to 1, migration threshold to 2, failure timeout to 1 > minute. > > crm_mon: > > Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 > 19:37:09 2017 by root via cibadmin on test > Stack: corosync > Current DC: test (version 1.1.14-70404b0) - partition with quorum > 1 node and 1 resource configured > > Online: [ test ] > > db-ip-master (ocf::heartbeat:IPaddr2): Started test > > Node Attributes: > * Node test: > > Migration Summary: > * Node test: > db-ip-master: migration-threshold=2 fail-count=1 > > crm verify: > > crm_verify --live-check -VVVV > info: validate_with_relaxng: Creating RNG parser context > info: determine_online_status: Node test is online > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): > Started test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: common_apply_stickiness: db-ip-master can fail 1 more times on > test before being forced off > info: LogActions: Leave db-ip-master (Started test) > > > crm configure is: > > node 168362242: test \ > attributes standby=off > primitive db-ip-master IPaddr2 \ > params lvs_support=true ip=10.9.1.10 cidr_netmask=24 > broadcast=10.9.1.255 \ > op start interval=0 timeout=20s on-fail=restart \ > op monitor interval=20s timeout=20s \ > op stop interval=0 timeout=20s on-fail=block \ > meta migration-threshold=2 failure-timeout=1m target-role=Started > location loc1 db-ip-master 0: test > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > cluster-recheck-interval=30s \ > symmetric-cluster=false > > > > > Corosync log: > > > Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jun 19 19:45:07 [330] test pengine: info: process_pe_message: Input > has > not changed since last time, not saving to disk > Jun 19 19:45:07 [330] test pengine: info: determine_online_status: > Node test is online > Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] test pengine: info: native_print: db-ip-master > (ocf::heartbeat:IPaddr2): Started test > Jun 19 19:45:07 [330] test pengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] test pengine: info: common_apply_stickiness: > db-ip-master can fail 1 more times on test before being forced off > Jun 19 19:45:07 [330] test pengine: info: LogActions: Leave db-ip- > master (Started test) > Jun 19 19:45:07 [330] test pengine: notice: process_pe_message: > Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2 > Jun 19 19:45:07 [331] test crmd: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Jun 19 19:45:07 [331] test crmd: notice: run_graph: Transition 34 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete > Jun 19 19:45:07 [331] test crmd: info: do_log: FSA: Input > I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE > Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > I hope someone can help me figure this out :) > > Thanks! > > > > > -----Original Message----- > > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > > Sent: Monday, June 19, 2017 7:45 PM > > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > > clustering welcomed <users@clusterlabs.org> > > Subject: Re: [ClusterLabs] clearing failed actions > > > > Hi Ken, > > > > /sorry for the long text/ > > > > I have created a relatively simple setup to localize the issue. > > Three nodes, no fencing, just a master/slave mysql with two virual IPs. > > Just as a reminden, my primary issue is, that on cluster recheck intervals, > tha > > failcounts are not cleared. > > > > I simuated a failure with: > > > > crm_failcount -N ctdb1 -r db-ip-master -v 1 > > > > > > crm_mon shows: > > > > Last updated: Mon Jun 19 17:34:35 2017 > > Last change: Mon Jun 19 17:34:35 2017 via cibadmin on ctmgr > > Stack: corosync > > Current DC: ctmgr (168362243) - partition with quorum > > Version: 1.1.10-42f2063 > > 3 Nodes configured > > 4 Resources configured > > > > > > Online: [ ctdb1 ctdb2 ctmgr ] > > > > db-ip-master (ocf::heartbeat:IPaddr2): Started ctdb1 > > db-ip-slave (ocf::heartbeat:IPaddr2): Started ctdb2 > > Master/Slave Set: mysql [db-mysql] > > Masters: [ ctdb1 ] > > Slaves: [ ctdb2 ] > > > > Node Attributes: > > * Node ctdb1: > > + master-db-mysql : 3601 > > + readable : 1 > > * Node ctdb2: > > + master-db-mysql : 3600 > > + readable : 1 > > * Node ctmgr: > > > > Migration summary: > > * Node ctmgr: > > * Node ctdb1: > > db-ip-master: migration-threshold=1000000 fail-count=1 > > * Node ctdb2: > > > > > > > > When I check the pacemaker log on the DC, I see the following: > > > > Jun 19 17:37:06 [18998] ctmgr crmd: info: crm_timer_popped: > PEngine > > Recheck Timer (I_PE_CALC) just popped (30000ms) > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: > > Processing > > I_PE_CALC: [ state=S_IDLE cause=C_TIMER_POPPED > > origin=crm_timer_popped ] > > Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: > State > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > > cause=C_TIMER_POPPED origin=crm_timer_popped ] > > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: > > Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: > > All > 3 > > cluster nodes are eligible to run resources. > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke: > > Query > > 231: Requesting the current CIB: S_POLICY_ENGINE > > Jun 19 17:37:06 [18994] ctmgr cib: info: cib_process_request: > > Completed cib_query operation for section 'all': OK (rc=0, > > origin=local/crmd/231, version=0.12.9) > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_pe_invoke_callback: > > Invoking the PE: query=231, ref=pe_calc-dc-1497893826-144, seq=21884, > > quorate=1 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: process_pe_message: > > Input has not changed since last time, not saving to disk > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: > STONITH > > timeout: 60000 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: > STONITH > > of failed nodes is disabled > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: > > Stop all > > active resources: false > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: > > Default > > stickiness: 0 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: On > > loss > > of CCM Quorum: Stop ALL resources > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_config: Node > > scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_domains: > > Unpacking domains > > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: > > Node ctmgr is online > > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: > > Node ctdb1 is online > > Jun 19 17:37:06 [18997] ctmgr pengine: info: determine_online_status: > > Node ctdb2 is online > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: > > Internally renamed db-mysql on ctmgr to db-mysql:0 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: > > Internally renamed db-mysql on ctdb1 to db-mysql:0 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db- > > mysql_last_failure_0 on ctdb1 returned 8 (master) instead of the expected > > value: 7 (not running) > > Jun 19 17:37:06 [18997] ctmgr pengine: notice: unpack_rsc_op: > > Operation monitor found resource db-mysql:0 active in master mode on > > ctdb1 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: find_anonymous_clone: > > Internally renamed db-mysql on ctdb2 to db-mysql:1 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: unpack_rsc_op: db- > > mysql_last_failure_0 on ctdb2 returned 0 (ok) instead of the expected > value: > > 7 (not running) > > Jun 19 17:37:06 [18997] ctmgr pengine: info: unpack_rsc_op: > Operation > > monitor found resource db-mysql:1 active on ctdb2 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: > > db-ip- > master > > (ocf::heartbeat:IPaddr2): Started ctdb1 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: native_print: > > db-ip-slave > > (ocf::heartbeat:IPaddr2): Started ctdb2 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: clone_print: > > Master/Slave Set: mysql [db-mysql] > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: > Resource > > db-mysql:0 active on ctdb1 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: > Resource > > db-mysql:0 active on ctdb1 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: > Resource > > db-mysql:1 active on ctdb2 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_active: > Resource > > db-mysql:1 active on ctdb2 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: > > Masters: > [ > > ctdb1 ] > > Jun 19 17:37:06 [18997] ctmgr pengine: info: short_print: > > Slaves: [ > > ctdb2 ] > > Jun 19 17:37:06 [18997] ctmgr pengine: info: get_failcount_full: > > db- > ip- > > master has failed 1 times on ctdb1 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: common_apply_stickiness: > > db-ip-master can fail 999999 more times on ctdb1 before being forced off > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: > common_apply_stickiness: > > Resource db-mysql:0: preferring current location (node=ctdb1, weight=1) > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: > common_apply_stickiness: > > Resource db-mysql:1: preferring current location (node=ctdb2, weight=1) > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: > > Assigning ctdb1 to db-mysql:0 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: > > Assigning ctdb2 to db-mysql:1 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: clone_color: > > Allocated > 2 > > mysql instances of a possible 2 > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db- > > mysql:0 master score: 3601 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: > > Promoting > > db-mysql:0 (Master ctdb1) > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_color: db- > > mysql:1 master score: 3600 > > Jun 19 17:37:06 [18997] ctmgr pengine: info: master_color: > > mysql: > > Promoted 1 instances of a possible 1 to master > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: > > Assigning ctdb1 to db-ip-master > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: native_assign_node: > > Assigning ctdb2 to db-ip-slave > > Jun 19 17:37:06 [18997] ctmgr pengine: debug: master_create_actions: > > Creating actions for mysql > > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: > > Leave db- > ip- > > master (Started ctdb1) > > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: > > Leave db- > ip- > > slave (Started ctdb2) > > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: > > Leave db- > > mysql:0 (Master ctdb1) > > Jun 19 17:37:06 [18997] ctmgr pengine: info: LogActions: > > Leave db- > > mysql:1 (Slave ctdb2) > > Jun 19 17:37:06 [18997] ctmgr pengine: notice: process_pe_message: > > Calculated Transition 38: /var/lib/pacemaker/pengine/pe-input-16.bz2 > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: > > Processing > > I_PE_SUCCESS: [ state=S_POLICY_ENGINE cause=C_IPC_MESSAGE > > origin=handle_response ] > > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_state_transition: > State > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: unpack_graph: > Unpacked > > transition 38: 0 actions in 0 synapses > > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_te_invoke: > > Processing > > graph 38 (ref=pe_calc-dc-1497893826-144) derived from > > /var/lib/pacemaker/pengine/pe-input-16.bz2 > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: > > Empty > > transition graph > > Jun 19 17:37:06 [18998] ctmgr crmd: notice: run_graph: Transition > > 38 > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: print_graph: > > Empty > > transition graph > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: te_graph_trigger: > Transition > > 38 is now complete > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: > > Processing > > transition completion in state S_TRANSITION_ENGINE > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: notify_crmd: > > Transition > > 38 status: done - <null> > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: s_crmd_fsa: > > Processing > > I_TE_SUCCESS: [ state=S_TRANSITION_ENGINE cause=C_FSA_INTERNAL > > origin=notify_crmd ] > > Jun 19 17:37:06 [18998] ctmgr crmd: info: do_log: FSA: Input > > I_TE_SUCCESS from notify_crmd() received in state > S_TRANSITION_ENGINE > > Jun 19 17:37:06 [18998] ctmgr crmd: notice: do_state_transition: > State > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: do_state_transition: > > Starting PEngine Recheck Timer > > Jun 19 17:37:06 [18998] ctmgr crmd: debug: crm_timer_start: > > Started > > PEngine Recheck Timer (I_PE_CALC:30000ms), src=277 > > > > > > > > As you can see from the logs, pacemaker does not even try to re-monitor > the > > resource that had a failure, or at least I'm not seeing it. > > Cluster recheck interval is set to 30 seconds for troubleshooting reasons. > > > > If I execute a > > > > crm resource cleanup db-ip-master > > > > Tha failure is removed. > > > > Now am I taking something terribly wrong here? > > Or is this simply a bug in 1.1.10? > > > > > > Thanks, > > Attila > > > > > > > > > > > -----Original Message----- > > > From: Ken Gaillot [mailto:kgail...@redhat.com] > > > Sent: Wednesday, June 7, 2017 10:14 PM > > > To: Attila Megyeri <amegy...@minerva-soft.com>; Cluster Labs - All > topics > > > related to open-source clustering welcomed <users@clusterlabs.org> > > > Subject: Re: [ClusterLabs] clearing failed actions > > > > > > On 06/01/2017 02:44 PM, Attila Megyeri wrote: > > > > Ken, > > > > > > > > I noticed something strange, this might be the issue. > > > > > > > > In some cases, even the manual cleanup does not work. > > > > > > > > I have a failed action of resource "A" on node "a". DC is node "b". > > > > > > > > e.g. > > > > Failed actions: > > > > jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1, > > > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 > > > > > > > > > > > > When I attempt to do a "crm resource cleanup A" from node "b", > nothing > > > happens. Basically the lrmd on "a" is not notified that it should monitor > the > > > resource. > > > > > > > > > > > > When I execute a "crm resource cleanup A" command on node "a" > > (where > > > the operation failed) , the failed action is cleared properly. > > > > > > > > Why could this be happening? > > > > Which component should be responsible for this? pengine, crmd, lrmd? > > > > > > The crm shell will send commands to attrd (to clear fail counts) and > > > crmd (to clear the resource history), which in turn will record changes > > > in the cib. > > > > > > I'm not sure how crm shell implements it, but crm_resource sends > > > individual messages to each node when cleaning up a resource without > > > specifying a particular node. You could check the pacemaker log on each > > > node to see whether attrd and crmd are receiving those commands, and > > > what they do in response. > > > > > > > > > >> -----Original Message----- > > > >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > > > >> Sent: Thursday, June 1, 2017 6:57 PM > > > >> To: kgail...@redhat.com; Cluster Labs - All topics related to open- > source > > > >> clustering welcomed <users@clusterlabs.org> > > > >> Subject: Re: [ClusterLabs] clearing failed actions > > > >> > > > >> thanks Ken, > > > >> > > > >> > > > >> > > > >> > > > >> > > > >>> -----Original Message----- > > > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > > > >>> Sent: Thursday, June 1, 2017 12:04 AM > > > >>> To: users@clusterlabs.org > > > >>> Subject: Re: [ClusterLabs] clearing failed actions > > > >>> > > > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote: > > > >>>>> Hi Ken, > > > >>>>> > > > >>>>> > > > >>>>>> -----Original Message----- > > > >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com] > > > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM > > > >>>>>> To: users@clusterlabs.org > > > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions > > > >>>>>> > > > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > > >>>>>>> Hi, > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Shouldn't the > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> cluster-recheck-interval="2m" > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> property instruct pacemaker to recheck the cluster every 2 > > minutes > > > >>> and > > > >>>>>>> clean the failcounts? > > > >>>>>> > > > >>>>>> It instructs pacemaker to recalculate whether any actions need > to > > be > > > >>>>>> taken (including expiring any failcounts appropriately). > > > >>>>>> > > > >>>>>>> At the primitive level I also have a > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> migration-threshold="30" failure-timeout="2m" > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> but whenever I have a failure, it remains there forever. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> What could be causing this? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> thanks, > > > >>>>>>> > > > >>>>>>> Attila > > > >>>>>> Is it a single old failure, or a recurring failure? The failure > > > >>>>>> timeout > > > >>>>>> works in a somewhat nonintuitive way. Old failures are not > > > individually > > > >>>>>> expired. Instead, all failures of a resource are simultaneously > > cleared > > > >>>>>> if all of them are older than the failure-timeout. So if something > > > keeps > > > >>>>>> failing repeatedly (more frequently than the failure-timeout), > > none > > > of > > > >>>>>> the failures will be cleared. > > > >>>>>> > > > >>>>>> If it's not a repeating failure, something odd is going on. > > > >>>>> > > > >>>>> It is not a repeating failure. Let's say that a resource fails for > > whatever > > > >>> action, It will remain in the failed actions (crm_mon -Af) until I > > > >>> issue a > > > "crm > > > >>> resource cleanup <resource name>". Even after days or weeks, even > > > >> though > > > >>> I see in the logs that cluster is rechecked every 120 seconds. > > > >>>>> > > > >>>>> How could I troubleshoot this issue? > > > >>>>> > > > >>>>> thanks! > > > >>>> > > > >>>> > > > >>>> Ah, I see what you're saying. That's expected behavior. > > > >>>> > > > >>>> The failure-timeout applies to the failure *count* (which is used for > > > >>>> checking against migration-threshold), not the failure *history* > > (which > > > >>>> is used for the status display). > > > >>>> > > > >>>> The idea is to have it no longer affect the cluster behavior, but > > > >>>> still > > > >>>> allow an administrator to know that it happened. That's why a > manual > > > >>>> cleanup is required to clear the history. > > > >>> > > > >>> Hmm, I'm wrong there ... failure-timeout does expire the failure > > history > > > >>> used for status display. > > > >>> > > > >>> It works with the current versions. It's possible 1.1.10 had issues > > > >>> with > > > >>> that. > > > >>> > > > >> > > > >> Well if nothing helps I will try to upgrade to a more recent version.. > > > >> > > > >> > > > >> > > > >>> Check the status to see which node is DC, and look at the pacemaker > > log > > > >>> there after the failure occurred. There should be a message about > the > > > >>> failcount expiring. You can also look at the live CIB and search for > > > >>> last_failure to see what is used for the display. > > > >> [AM] > > > >> > > > >> In the pacemaker log I see at every recheck interval the following > lines: > > > >> > > > >> Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op: > > > >> Processing failed op start for jboss_admin2 on ctadmin2: unknown > error > > > (1) > > > >> > > > >> If I check the CIB for the failure I see: > > > >> > > > >> <nvpair id="status-168362322-last-failure-jboss_admin2" name="last- > > > failure- > > > >> jboss_admin2" value="1496326649"/> > > > >> <lrm_rsc_op id="jboss_admin2_last_failure_0" > > > >> operation_key="jboss_admin2_start_0" operation="start" crm- > debug- > > > >> origin="do_update_resource" crm_feature_set="3.0.7" transition- > > > >> key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition- > > > >> magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call- > > id="114" > > > rc- > > > >> code="1" op-status="2" interval="0" last-run="1496326469" last-rc- > > > >> change="1496326469" exec-time="180001" queue-time="0" op- > > > >> digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/> > > > >> > > > >> > > > >> Really have no clue why this isn't cleared... > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org