Re: [ClusterLabs] Make sure either 0 or all resources in a group are running
Unstandby-ing a node automatically at some point after a failure on certain resources actually fits our use cases well, but the problem is that the automatic unstandby does not put DRBD into secondary mode once it occurs. A manual pcs cluster standby $(uname -n) and pcs cluster unstandby $(uname -n) does restart the state of the node properly, however. -- Sam Gardner Trustwave | SMART SECURITY ON DEMAND On 3/28/16, 4:31 PM, "Sam Gardner"wrote: >'on-fail=standby' works well, however, setting a failure-timeout appears >to automatically bring the node out of standby after it expires. > >-- >Sam Gardner >Trustwave | SMART SECURITY ON DEMAND > > > > > > > >On 3/28/16, 3:31 PM, "Ken Gaillot" wrote: > >>On 03/28/2016 02:19 PM, Sam Gardner wrote: >>> Is there any way to modify the behavior of a resource group N of A, B, >>>and C so that either A, B, and C are running on the same node, or none >>>of them are? >>> >>> With Pacemaker 1.1.12 and Corosync 1.4.8, if a group N is defined via: >>> pcs resource group N A B C >>> >>> if resource C cannot run, A and B still do. >>> >>> -- >>> Sam Gardner >>> Trustwave | SMART SECURITY ON DEMAND >> >>The problem with that model is that none of the resources can be placed >>or started, because each depends on the others being placed and started >>already. >> >>I can think of two similar alternatives, though they would only work for >>failures, not for any other reasons C might be stopped: >> >>* Use on-fail=standby, so that if any resource fails, all resources are >>forced off that node. The node must be manually taken out of standby to >>be used again. >> >>* Use rules to say that A cannot run on any node where fail-count-B gt 0 >>or fail-count-C gt 0, and B cannot run on any node where fail-count C gt >>0. (The group should handle the rest of the dependencies.) >> >> >>___ >>Users mailing list: Users@clusterlabs.org >>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd >>x >>7Y26jDOQ=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fuser >>s >> >>Project Home: >>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd >>x >>za2ajHbQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg >>Getting started: >>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd >>x >>-P0PTEOA=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffro >>m >>%5fScratch%2epdf >>Bugs: >>http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEd >>x >>3f3qDDbA=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg > > > > >This transmission may contain information that is privileged, >confidential, and/or exempt from disclosure under applicable law. If you >are not the intended recipient, you are hereby notified that any >disclosure, copying, distribution, or use of the information contained >herein (including any reliance thereon) is strictly prohibited. If you >received this transmission in error, please immediately contact the >sender and destroy the material in its entirety, whether in electronic or >hard copy format. > >___ >Users mailing list: Users@clusterlabs.org >http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi >IcbkHVjA=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fusers > >Project Home: >http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi >AebEHR2A=5=http%3a%2f%2fwww%2eclusterlabs%2eorg >Getting started: >http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi >NLZR3SjQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffrom >%5fScratch%2epdf >Bugs: >http://scanmail.trustwave.com/?c=4062=rKP51jNpOYvYRlOEnCbJz6vJRyGhyhYdJi >Eba0nV2Q=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Make sure either 0 or all resources in a group are running
'on-fail=standby' works well, however, setting a failure-timeout appears to automatically bring the node out of standby after it expires. -- Sam Gardner Trustwave | SMART SECURITY ON DEMAND On 3/28/16, 3:31 PM, "Ken Gaillot"wrote: >On 03/28/2016 02:19 PM, Sam Gardner wrote: >> Is there any way to modify the behavior of a resource group N of A, B, >>and C so that either A, B, and C are running on the same node, or none >>of them are? >> >> With Pacemaker 1.1.12 and Corosync 1.4.8, if a group N is defined via: >> pcs resource group N A B C >> >> if resource C cannot run, A and B still do. >> >> -- >> Sam Gardner >> Trustwave | SMART SECURITY ON DEMAND > >The problem with that model is that none of the resources can be placed >or started, because each depends on the others being placed and started >already. > >I can think of two similar alternatives, though they would only work for >failures, not for any other reasons C might be stopped: > >* Use on-fail=standby, so that if any resource fails, all resources are >forced off that node. The node must be manually taken out of standby to >be used again. > >* Use rules to say that A cannot run on any node where fail-count-B gt 0 >or fail-count-C gt 0, and B cannot run on any node where fail-count C gt >0. (The group should handle the rest of the dependencies.) > > >___ >Users mailing list: Users@clusterlabs.org >http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx >7Y26jDOQ=5=http%3a%2f%2fclusterlabs%2eorg%2fmailman%2flistinfo%2fusers > >Project Home: >http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx >za2ajHbQ=5=http%3a%2f%2fwww%2eclusterlabs%2eorg >Getting started: >http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx >-P0PTEOA=5=http%3a%2f%2fwww%2eclusterlabs%2eorg%2fdoc%2fCluster%5ffrom >%5fScratch%2epdf >Bugs: >http://scanmail.trustwave.com/?c=4062=oJX51sQWSGT59IqY1PSn5CA3HNnKYhWEdx >3f3qDDbA=5=http%3a%2f%2fbugs%2eclusterlabs%2eorg This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes
On 28/03/16 12:44 PM, Sam Gardner wrote: > I have a simple resource defined: > > [root@ha-d1 ~]# pcs resource show dmz1 > Resource: dmz1 (class=ocf provider=internal type=ip-address) > Attributes: address=172.16.10.192 monitor_link=true > Meta Attrs: migration-threshold=3 failure-timeout=30s > Operations: monitor interval=7s (dmz1-monitor-interval-7s) > > This is a custom resource which provides an ethernet alias to one of the > interfaces on our system. > > I can unplug the cable on either node and failover occurs as expected, > and 30s after re-plugging it I can repeat the exercise on the opposite > node and failover will happen as expected. > > However, if I unplug the cable from both nodes, the failcount goes up, > and the 30s failure-timeout does not reset the failcounts, meaning that > pacemaker never tries to start the failed resource again. > > Full list of resources: > > Resource Group: network > inif (off::internal:ip.sh): Started ha-d1.dev.com > outif (off::internal:ip.sh): Started ha-d2.dev.com > dmz1 (off::internal:ip.sh): Stopped > Master/Slave Set: DRBDMaster [DRBDSlave] > Masters: [ ha-d1.dev.com ] > Slaves: [ ha-d2.dev.com ] > Resource Group: filesystem > DRBDFS (ocf::heartbeat:Filesystem):Stopped > Resource Group: application > service_failover (off::internal:service_failover):Stopped > > Failcounts for dmz1 > ha-d1.dev.com: 4 > ha-d2.dev.com: 4 > > Is there any way to automatically recover from this scenario, other than > setting an obnoxiously high migration-threshold? > > -- > > *Sam Gardner * > > Software Engineer > > *Trustwave** *| SMART SECURITY ON DEMAND Stonith? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Resource failure-timeout does not reset when resource fails to connect to both nodes
I have a simple resource defined: [root@ha-d1 ~]# pcs resource show dmz1 Resource: dmz1 (class=ocf provider=internal type=ip-address) Attributes: address=172.16.10.192 monitor_link=true Meta Attrs: migration-threshold=3 failure-timeout=30s Operations: monitor interval=7s (dmz1-monitor-interval-7s) This is a custom resource which provides an ethernet alias to one of the interfaces on our system. I can unplug the cable on either node and failover occurs as expected, and 30s after re-plugging it I can repeat the exercise on the opposite node and failover will happen as expected. However, if I unplug the cable from both nodes, the failcount goes up, and the 30s failure-timeout does not reset the failcounts, meaning that pacemaker never tries to start the failed resource again. Full list of resources: Resource Group: network inif (off::internal:ip.sh): Started ha-d1.dev.com outif (off::internal:ip.sh): Started ha-d2.dev.com dmz1 (off::internal:ip.sh): Stopped Master/Slave Set: DRBDMaster [DRBDSlave] Masters: [ ha-d1.dev.com ] Slaves: [ ha-d2.dev.com ] Resource Group: filesystem DRBDFS (ocf::heartbeat:Filesystem):Stopped Resource Group: application service_failover (off::internal:service_failover):Stopped Failcounts for dmz1 ha-d1.dev.com: 4 ha-d2.dev.com: 4 Is there any way to automatically recover from this scenario, other than setting an obnoxiously high migration-threshold? -- Sam Gardner Software Engineer Trustwave | SMART SECURITY ON DEMAND This transmission may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org