Ken, I noticed something strange, this might be the issue.
In some cases, even the manual cleanup does not work. I have a failed action of resource "A" on node "a". DC is node "b". e.g. Failed actions: jboss_imssrv1_monitor_10000 (node=ctims1, call=108, rc=1, status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. Basically the lrmd on "a" is not notified that it should monitor the resource. When I execute a "crm resource cleanup A" command on node "a" (where the operation failed) , the failed action is cleared properly. Why could this be happening? Which component should be responsible for this? pengine, crmd, lrmd? > -----Original Message----- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, June 1, 2017 6:57 PM > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > clustering welcomed <users@clusterlabs.org> > Subject: Re: [ClusterLabs] clearing failed actions > > thanks Ken, > > > > > > > -----Original Message----- > > From: Ken Gaillot [mailto:kgail...@redhat.com] > > Sent: Thursday, June 1, 2017 12:04 AM > > To: users@clusterlabs.org > > Subject: Re: [ClusterLabs] clearing failed actions > > > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > > >> Hi Ken, > > >> > > >> > > >>> -----Original Message----- > > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > > >>> Sent: Tuesday, May 30, 2017 4:32 PM > > >>> To: users@clusterlabs.org > > >>> Subject: Re: [ClusterLabs] clearing failed actions > > >>> > > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > >>>> Hi, > > >>>> > > >>>> > > >>>> > > >>>> Shouldn't the > > >>>> > > >>>> > > >>>> > > >>>> cluster-recheck-interval="2m" > > >>>> > > >>>> > > >>>> > > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > > and > > >>>> clean the failcounts? > > >>> > > >>> It instructs pacemaker to recalculate whether any actions need to be > > >>> taken (including expiring any failcounts appropriately). > > >>> > > >>>> At the primitive level I also have a > > >>>> > > >>>> > > >>>> > > >>>> migration-threshold="30" failure-timeout="2m" > > >>>> > > >>>> > > >>>> > > >>>> but whenever I have a failure, it remains there forever. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> What could be causing this? > > >>>> > > >>>> > > >>>> > > >>>> thanks, > > >>>> > > >>>> Attila > > >>> Is it a single old failure, or a recurring failure? The failure timeout > > >>> works in a somewhat nonintuitive way. Old failures are not individually > > >>> expired. Instead, all failures of a resource are simultaneously cleared > > >>> if all of them are older than the failure-timeout. So if something keeps > > >>> failing repeatedly (more frequently than the failure-timeout), none of > > >>> the failures will be cleared. > > >>> > > >>> If it's not a repeating failure, something odd is going on. > > >> > > >> It is not a repeating failure. Let's say that a resource fails for > > >> whatever > > action, It will remain in the failed actions (crm_mon -Af) until I issue a > > "crm > > resource cleanup <resource name>". Even after days or weeks, even > though > > I see in the logs that cluster is rechecked every 120 seconds. > > >> > > >> How could I troubleshoot this issue? > > >> > > >> thanks! > > > > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > > > The failure-timeout applies to the failure *count* (which is used for > > > checking against migration-threshold), not the failure *history* (which > > > is used for the status display). > > > > > > The idea is to have it no longer affect the cluster behavior, but still > > > allow an administrator to know that it happened. That's why a manual > > > cleanup is required to clear the history. > > > > Hmm, I'm wrong there ... failure-timeout does expire the failure history > > used for status display. > > > > It works with the current versions. It's possible 1.1.10 had issues with > > that. > > > > Well if nothing helps I will try to upgrade to a more recent version.. > > > > > Check the status to see which node is DC, and look at the pacemaker log > > there after the failure occurred. There should be a message about the > > failcount expiring. You can also look at the live CIB and search for > > last_failure to see what is used for the display. > [AM] > > In the pacemaker log I see at every recheck interval the following lines: > > Jun 01 16:54:08 [8700] ctabsws2 pengine: warning: unpack_rsc_op: > Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1) > > If I check the CIB for the failure I see: > > <nvpair id="status-168362322-last-failure-jboss_admin2" name="last-failure- > jboss_admin2" value="1496326649"/> > <lrm_rsc_op id="jboss_admin2_last_failure_0" > operation_key="jboss_admin2_start_0" operation="start" crm-debug- > origin="do_update_resource" crm_feature_set="3.0.7" transition- > key="73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" transition- > magic="2:1;73:4:0:0a88f6e6-4ed1-4b53-88ad-3c568ca3daa8" call-id="114" rc- > code="1" op-status="2" interval="0" last-run="1496326469" last-rc- > change="1496326469" exec-time="180001" queue-time="0" op- > digest="8ec02bcea0bab86f4a7e9e27c23bc88b"/> > > > Really have no clue why this isn't cleared... > > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org