Re: [ClusterLabs] Make sure either 0 or all resources in a group are running

2016-03-28 Thread Sam Gardner
Unstandby-ing a node automatically at some point after a failure on
certain resources actually fits our use cases well, but the problem is
that the automatic unstandby does not put DRBD into secondary mode once it
occurs.

A manual pcs cluster standby $(uname -n) and pcs cluster unstandby $(uname
-n) does restart the state of the node properly, however.

--
Sam Gardner
Trustwave | SMART SECURITY ON DEMAND







On 3/28/16, 4:31 PM, "Sam Gardner"  wrote:

>'on-fail=standby' works well, however, setting a failure-timeout appears
>to automatically bring the node out of standby after it expires.
>
>--
>Sam Gardner
>Trustwave | SMART SECURITY ON DEMAND
>
>
>
>
>
>
>
>On 3/28/16, 3:31 PM, "Ken Gaillot"  wrote:
>
>>On 03/28/2016 02:19 PM, Sam Gardner wrote:
>>> Is there any way to modify the behavior of a resource group N of A, B,
>>>and C so that either A, B, and C are running on the same node, or none
>>>of them are?
>>>
>>> With Pacemaker 1.1.12 and Corosync 1.4.8, if a group N is defined via:
>>> pcs resource group N A B C
>>>
>>> if resource C cannot run, A and B still do.
>>>
>>> --
>>> Sam Gardner
>>> Trustwave | SMART SECURITY ON DEMAND
>>
>>The problem with that model is that none of the resources can be placed
>>or started, because each depends on the others being placed and started
>>already.
>>
>>I can think of two similar alternatives, though they would only work for
>>failures, not for any other reasons C might be stopped:
>>
>>* Use on-fail=standby, so that if any resource fails, all resources are
>>forced off that node. The node must be manually taken out of standby to
>>be used again.
>>
>>* Use rules to say that A cannot run on any node where fail-count-B gt 0
>>or fail-count-C gt 0, and B cannot run on any node where fail-count C gt
>>0. (The group should handle the rest of the dependencies.)
>>
>>
>>___
>>Users mailing list: Users@clusterlabs.org
>>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd
>>x
>>7Y26jDOQ=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fuser
>>s
>>
>>Project Home:
>>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd
>>x
>>za2ajHbQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg
>>Getting started:
>>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd
>>x
>>-P0PTEOA=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffro
>>m
>>%5fScratch%2epdf
>>Bugs:
>>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd
>>x
>>3f3qDDbA=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg
>
>
>
>
>This transmission may contain information that is privileged,
>confidential, and/or exempt from disclosure under applicable law. If you
>are not the intended recipient, you are hereby notified that any
>disclosure, copying, distribution, or use of the information contained
>herein (including any reliance thereon) is strictly prohibited. If you
>received this transmission in error, please immediately contact the
>sender and destroy the material in its entirety, whether in electronic or
>hard copy format.
>
>___
>Users mailing list: Users@clusterlabs.org
>http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi
>IcbkHVjA=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fusers
>
>Project Home:
>http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi
>AebEHR2A=5=http%3a%2f%2fwww%2eclusterlabs%2eorg
>Getting started:
>http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi
>NLZR3SjQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffrom
>%5fScratch%2epdf
>Bugs:
>http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi
>Eba0nV2Q=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg




This transmission may contain information that is privileged, confidential, 
and/or exempt from disclosure under applicable law. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, distribution, 
or use of the information contained herein (including any reliance thereon) is 
strictly prohibited. If you received this transmission in error, please 
immediately contact the sender and destroy the material in its entirety, 
whether in electronic or hard copy format.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Make sure either 0 or all resources in a group are running

2016-03-28 Thread Sam Gardner
'on-fail=standby' works well, however, setting a failure-timeout appears
to automatically bring the node out of standby after it expires.

--
Sam Gardner
Trustwave | SMART SECURITY ON DEMAND







On 3/28/16, 3:31 PM, "Ken Gaillot"  wrote:

>On 03/28/2016 02:19 PM, Sam Gardner wrote:
>> Is there any way to modify the behavior of a resource group N of A, B,
>>and C so that either A, B, and C are running on the same node, or none
>>of them are?
>>
>> With Pacemaker 1.1.12 and Corosync 1.4.8, if a group N is defined via:
>> pcs resource group N A B C
>>
>> if resource C cannot run, A and B still do.
>>
>> --
>> Sam Gardner
>> Trustwave | SMART SECURITY ON DEMAND
>
>The problem with that model is that none of the resources can be placed
>or started, because each depends on the others being placed and started
>already.
>
>I can think of two similar alternatives, though they would only work for
>failures, not for any other reasons C might be stopped:
>
>* Use on-fail=standby, so that if any resource fails, all resources are
>forced off that node. The node must be manually taken out of standby to
>be used again.
>
>* Use rules to say that A cannot run on any node where fail-count-B gt 0
>or fail-count-C gt 0, and B cannot run on any node where fail-count C gt
>0. (The group should handle the rest of the dependencies.)
>
>
>___
>Users mailing list: Users@clusterlabs.org
>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx
>7Y26jDOQ=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fusers
>
>Project Home:
>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx
>za2ajHbQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg
>Getting started:
>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx
>-P0PTEOA=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffrom
>%5fScratch%2epdf
>Bugs:
>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx
>3f3qDDbA=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg




This transmission may contain information that is privileged, confidential, 
and/or exempt from disclosure under applicable law. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, distribution, 
or use of the information contained herein (including any reliance thereon) is 
strictly prohibited. If you received this transmission in error, please 
immediately contact the sender and destroy the material in its entirety, 
whether in electronic or hard copy format.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes

2016-03-28 Thread Digimer
On 28/03/16 12:44 PM, Sam Gardner wrote:
> I have a simple resource defined:
> 
> [root@ha-d1 ~]# pcs resource show dmz1
>  Resource: dmz1 (class=ocf provider=internal type=ip-address)
>   Attributes: address=172.16.10.192 monitor_link=true
>   Meta Attrs: migration-threshold=3 failure-timeout=30s
>   Operations: monitor interval=7s (dmz1-monitor-interval-7s)
> 
> This is a custom resource which provides an ethernet alias to one of the
> interfaces on our system.
> 
> I can unplug the cable on either node and failover occurs as expected,
> and 30s after re-plugging it I can repeat the exercise on the opposite
> node and failover will happen as expected.
> 
> However, if I unplug the cable from both nodes, the failcount goes up,
> and the 30s failure-timeout does not reset the failcounts, meaning that
> pacemaker never tries to start the failed resource again.
> 
> Full list of resources:
> 
>  Resource Group: network
>  inif   (off::internal:ip.sh):   Started ha-d1.dev.com
>  outif  (off::internal:ip.sh):   Started ha-d2.dev.com
>  dmz1   (off::internal:ip.sh):   Stopped
>  Master/Slave Set: DRBDMaster [DRBDSlave]
>  Masters: [ ha-d1.dev.com ]
>  Slaves: [ ha-d2.dev.com ]
>  Resource Group: filesystem
>  DRBDFS (ocf::heartbeat:Filesystem):Stopped
>  Resource Group: application
>  service_failover   (off::internal:service_failover):Stopped
> 
> Failcounts for dmz1
>  ha-d1.dev.com: 4
>  ha-d2.dev.com: 4
> 
> Is there any way to automatically recover from this scenario, other than
> setting an obnoxiously high migration-threshold? 
> 
> -- 
> 
> *Sam Gardner   *
> 
> Software Engineer
> 
> *Trustwave** *| SMART SECURITY ON DEMAND

Stonith?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes

2016-03-28 Thread Sam Gardner
I have a simple resource defined:

[root@ha-d1 ~]# pcs resource show dmz1
 Resource: dmz1 (class=ocf provider=internal type=ip-address)
  Attributes: address=172.16.10.192 monitor_link=true
  Meta Attrs: migration-threshold=3 failure-timeout=30s
  Operations: monitor interval=7s (dmz1-monitor-interval-7s)

This is a custom resource which provides an ethernet alias to one of the 
interfaces on our system.

I can unplug the cable on either node and failover occurs as expected, and 30s 
after re-plugging it I can repeat the exercise on the opposite node and 
failover will happen as expected.

However, if I unplug the cable from both nodes, the failcount goes up, and the 
30s failure-timeout does not reset the failcounts, meaning that pacemaker never 
tries to start the failed resource again.

Full list of resources:

 Resource Group: network
 inif   (off::internal:ip.sh):   Started ha-d1.dev.com
 outif  (off::internal:ip.sh):   Started ha-d2.dev.com
 dmz1   (off::internal:ip.sh):   Stopped
 Master/Slave Set: DRBDMaster [DRBDSlave]
 Masters: [ ha-d1.dev.com ]
 Slaves: [ ha-d2.dev.com ]
 Resource Group: filesystem
 DRBDFS (ocf::heartbeat:Filesystem):Stopped
 Resource Group: application
 service_failover   (off::internal:service_failover):Stopped

Failcounts for dmz1
 ha-d1.dev.com: 4
 ha-d2.dev.com: 4

Is there any way to automatically recover from this scenario, other than 
setting an obnoxiously high migration-threshold?

--
Sam Gardner
Software Engineer
Trustwave | SMART SECURITY ON DEMAND



This transmission may contain information that is privileged, confidential, 
and/or exempt from disclosure under applicable law. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, distribution, 
or use of the information contained herein (including any reliance thereon) is 
strictly prohibited. If you received this transmission in error, please 
immediately contact the sender and destroy the material in its entirety, 
whether in electronic or hard copy format.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org