On 03/28/2016 11:44 AM, Sam Gardner wrote: > I have a simple resource defined: > > [root@ha-d1 ~]# pcs resource show dmz1 > Resource: dmz1 (class=ocf provider=internal type=ip-address) > Attributes: address=172.16.10.192 monitor_link=true > Meta Attrs: migration-threshold=3 failure-timeout=30s > Operations: monitor interval=7s (dmz1-monitor-interval-7s) > > This is a custom resource which provides an ethernet alias to one of the > interfaces on our system. > > I can unplug the cable on either node and failover occurs as expected, > and 30s after re-plugging it I can repeat the exercise on the opposite > node and failover will happen as expected. > > However, if I unplug the cable from both nodes, the failcount goes up, > and the 30s failure-timeout does not reset the failcounts, meaning that > pacemaker never tries to start the failed resource again.
Apologies for the late response, but: Time-based actions in Pacemaker, including failure-timeout, are not guaranteed to be checked more frequently than the value of the cluster-recheck-interval cluster property, which defaults to 15 minutes. If the cluster responds to an event (node joining/leaving, monitor failure, etc.), it will check time-based actions at that point, but otherwise it doesn't. So cluster-recheck-interval acts as a maximum time between such checks. Try lowering your cluster-recheck-interval. Personally, I would think 30s for a failure-timeout is awfully quick; it would lead to continuous retries. And it would require setting cluster-recheck-interval to something similar, which would add a lot of computational overhead to the cluster. I'm curious what values of cluster-recheck-interval and failure-timeout people are commonly using "in the wild". On a small, underutilized cluster, you could probably get away with setting them quite low, but on larger clusters, I would expect it would be too much overhead. > Full list of resources: > > Resource Group: network > inif (off::internal:ip.sh): Started ha-d1.dev.com > outif (off::internal:ip.sh): Started ha-d2.dev.com > dmz1 (off::internal:ip.sh): Stopped > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d1.dev.com ] > Slaves: [ ha-d2.dev.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem): Stopped > Resource Group: application > service_failover (off::internal:service_failover): Stopped > > Failcounts for dmz1 > ha-d1.dev.com: 4 > ha-d2.dev.com: 4 > > Is there any way to automatically recover from this scenario, other than > setting an obnoxiously high migration-threshold? > > -- > > *Sam Gardner * > > Software Engineer > > *Trustwave** *| SMART SECURITY ON DEMAND _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org