Re: [ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes

2016-05-13 Thread Ken Gaillot
On 03/28/2016 11:44 AM, Sam Gardner wrote:
> I have a simple resource defined:
> 
> [root@ha-d1 ~]# pcs resource show dmz1
>  Resource: dmz1 (class=ocf provider=internal type=ip-address)
>   Attributes: address=172.16.10.192 monitor_link=true
>   Meta Attrs: migration-threshold=3 failure-timeout=30s
>   Operations: monitor interval=7s (dmz1-monitor-interval-7s)
> 
> This is a custom resource which provides an ethernet alias to one of the
> interfaces on our system.
> 
> I can unplug the cable on either node and failover occurs as expected,
> and 30s after re-plugging it I can repeat the exercise on the opposite
> node and failover will happen as expected.
> 
> However, if I unplug the cable from both nodes, the failcount goes up,
> and the 30s failure-timeout does not reset the failcounts, meaning that
> pacemaker never tries to start the failed resource again.

Apologies for the late response, but:

Time-based actions in Pacemaker, including failure-timeout, are not
guaranteed to be checked more frequently than the value of the
cluster-recheck-interval cluster property, which defaults to 15 minutes.

If the cluster responds to an event (node joining/leaving, monitor
failure, etc.), it will check time-based actions at that point, but
otherwise it doesn't. So cluster-recheck-interval acts as a maximum time
between such checks.

Try lowering your cluster-recheck-interval. Personally, I would think
30s for a failure-timeout is awfully quick; it would lead to continuous
retries. And it would require setting cluster-recheck-interval to
something similar, which would add a lot of computational overhead to
the cluster.

I'm curious what values of cluster-recheck-interval and failure-timeout
people are commonly using "in the wild". On a small, underutilized
cluster, you could probably get away with setting them quite low, but on
larger clusters, I would expect it would be too much overhead.

> Full list of resources:
> 
>  Resource Group: network
>  inif   (off::internal:ip.sh):   Started ha-d1.dev.com
>  outif  (off::internal:ip.sh):   Started ha-d2.dev.com
>  dmz1   (off::internal:ip.sh):   Stopped
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>  Masters: [ ha-d1.dev.com ]
>  Slaves: [ ha-d2.dev.com ]
>  Resource Group: filesystem
>  DRBDFS (ocf::heartbeat:Filesystem):Stopped
>  Resource Group: application
>  service_failover   (off::internal:service_failover):Stopped
> 
> Failcounts for dmz1
>  ha-d1.dev.com: 4
>  ha-d2.dev.com: 4
> 
> Is there any way to automatically recover from this scenario, other than
> setting an obnoxiously high migration-threshold? 
> 
> -- 
> 
> *Sam Gardner   *
> 
> Software Engineer
> 
> *Trustwave** *| SMART SECURITY ON DEMAND

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes

2016-03-28 Thread Digimer
On 28/03/16 12:44 PM, Sam Gardner wrote:
> I have a simple resource defined:
> 
> [root@ha-d1 ~]# pcs resource show dmz1
>  Resource: dmz1 (class=ocf provider=internal type=ip-address)
>   Attributes: address=172.16.10.192 monitor_link=true
>   Meta Attrs: migration-threshold=3 failure-timeout=30s
>   Operations: monitor interval=7s (dmz1-monitor-interval-7s)
> 
> This is a custom resource which provides an ethernet alias to one of the
> interfaces on our system.
> 
> I can unplug the cable on either node and failover occurs as expected,
> and 30s after re-plugging it I can repeat the exercise on the opposite
> node and failover will happen as expected.
> 
> However, if I unplug the cable from both nodes, the failcount goes up,
> and the 30s failure-timeout does not reset the failcounts, meaning that
> pacemaker never tries to start the failed resource again.
> 
> Full list of resources:
> 
>  Resource Group: network
>  inif   (off::internal:ip.sh):   Started ha-d1.dev.com
>  outif  (off::internal:ip.sh):   Started ha-d2.dev.com
>  dmz1   (off::internal:ip.sh):   Stopped
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>  Masters: [ ha-d1.dev.com ]
>  Slaves: [ ha-d2.dev.com ]
>  Resource Group: filesystem
>  DRBDFS (ocf::heartbeat:Filesystem):Stopped
>  Resource Group: application
>  service_failover   (off::internal:service_failover):Stopped
> 
> Failcounts for dmz1
>  ha-d1.dev.com: 4
>  ha-d2.dev.com: 4
> 
> Is there any way to automatically recover from this scenario, other than
> setting an obnoxiously high migration-threshold? 
> 
> -- 
> 
> *Sam Gardner   *
> 
> Software Engineer
> 
> *Trustwave** *| SMART SECURITY ON DEMAND

Stonith?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org