Hello, > 1. If a resource fails, node should reboot (through fencing mechanism) > and resources should re-start on the node.
Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? -In our system, there are some primitive, clone resources along with 3 different master-slave resources. -All the masters and slaves of these resources are co-located i.e. all the 3 masters are co-located on a node and 3 slaves on the other node. -These 3 master-slaves resources are tightly coupled. There is a requirement that failure of even any one of these resources, restarts all the resources in the group -All these resources can be shifted to the other node but subsequently these should also be restarted as a lot of data/control plane synching is being done between the two nodes. e.g. If one of the resources running on node1 as a Master fails, then all these 3 resources are shifted to the other node i.e. node2 (with corresponding slave resources being promoted as master). On node1, these resources should get re-started as slaves. We understand that node restart will increase the downtime but since we could not find much on the option for group restart of master-slave resources, we are trying for node restart option. Thanks and regards Neha Chatrath ---------- Forwarded message ---------- From: Andreas Kurz <andr...@hastexo.com> Date: Tue, Oct 18, 2011 at 1:55 PM Subject: Re: [Pacemaker] Problem in Stonith configuration To: pacemaker@oss.clusterlabs.org Hello, On 10/18/2011 09:00 AM, neha chatrath wrote: > Hello, > > Minor updates in the first requirement. > 1. If a resource fails, node should reboot (through fencing mechanism) > and resources should re-start on the node. Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? > 2. If the physical link between the nodes in a cluster fails then that > node should be isolated (kind of a power down) and the resources should > continue to run on the other nodes That is how stonith works, yes. crm ra list stonith ... gives you a list of all available stonith plugins. crm ra info stonit:xxxx ... details for a specific plugin. Using external/ipmi is often a good choice because a lot of servers already have an BMC with IPMI on board or they are shipped with a management card supporting IMPI. Regards, Andreas On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath <nehachatr...@gmail.com>wrote: > Hello, > > Minor updates in the first requirement. > 1. If a resource fails, node should reboot (through fencing mechanism) and > resources should re-start on the node. > > 2. If the physical link between the nodes in a cluster fails then that node > should be isolated (kind of a power down) and the resources should continue > to run on the other nodes > > Apologies for the inconvenience. > > > Thanks and regards > Neha Chatrath > > On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath <nehachatr...@gmail.com>wrote: > >> Hello Andreas, >> >> Thanks for the reply. >> >> So can you please suggest what Stonith plugin should I use for the >> production release of my software. I have the following system requirements: >> 1. If a node in the cluster fails, it should be reboot and resources >> should re-start on the node. >> 2. If the physical link between the nodes in a cluster fails then that >> node should be isolated (kind of a power down) and the resources should >> continue to run on the other nodes. >> >> I have different types of resources e.g. primitive, master-slave and cone >> running on my system. >> >> Thanks and regards >> Neha Chatrath >> >> >> Date: Mon, 17 Oct 2011 15:08:16 +0200 >> From: Andreas Kurz <andr...@hastexo.com> >> To: pacemaker@oss.clusterlabs.org >> Subject: Re: [Pacemaker] Problem in Stonith configuration >> Message-ID: <4e9c28c0.8070...@hastexo.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Hello, >> >> >> On 10/17/2011 12:34 PM, neha chatrath wrote: >> > Hello, >> > I am configuring a 2 node cluster with following configuration: >> > >> > *[root@MCG1 init.d]# crm configure show >> > >> > node $id="16738ea4-adae-483f-9d79- >> b0ecce8050f4" mcg2 \ >> > attributes standby="off" >> > >> > node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \ >> > attributes standby="off" >> > >> > primitive ClusterIP ocf:heartbeat:IPaddr \ >> > params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \ >> > >> > op monitor interval="40s" timeout="20s" \ >> > meta target-role="Started" >> > >> > primitive app1_fencing stonith:suicide \ >> > op monitor interval="90" \ >> > meta target-role="Started" >> > >> > primitive myapp1 ocf:heartbeat:Redundancy \ >> > op monitor interval="60s" role="Master" timeout="30s" on-fail="standby" >> \ >> > op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart" >> > >> > primitive myapp2 ocf:mcg:Redundancy_myapp2 \ >> > op monitor interval="60" role="Master" timeout="30" on-fail="standby" \ >> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart" >> > >> > primitive myapp3 ocf:mcg:red_app3 \ >> > op monitor interval="60" role="Master" timeout="30" on-fail="fence" \ >> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart" >> > >> > ms ms_myapp1 myapp1 \ >> > meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" >> > notify="true" >> > >> > ms ms_myapp2 myapp2 \ >> > meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" >> > notify="true" >> > >> > ms ms_myapp3 myapp3 \ >> > meta master-max="1" master-max-node="1" clone-max="2" clone-node-max="1" >> > notify="true" >> > >> > colocation myapp1_col inf: ClusterIP ms_myapp1:Master >> > >> > colocation myapp2_col inf: ClusterIP ms_myapp2:Master >> > >> > colocation myapp3_col inf: ClusterIP ms_myapp3:Master >> > >> > order myapp1_order inf: ms_myapp1:promote ClusterIP:start >> > >> > order myapp2_order inf: ms_myapp2:promote ms_myapp1:start >> > >> > order myapp3_order inf: ms_myapp3:promote ms_myapp2:start >> > >> > property $id="cib-bootstrap-options" \ >> > dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \ >> > cluster-infrastructure="Heartbeat" \ >> > stonith-enabled="true" \ >> > no-quorum-policy="ignore" >> > >> > rsc_defaults $id="rsc-options" \ >> > resource-stickiness="100" \ >> > migration-threshold="3" >> > * >> >> > I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the >> > resources (myapp, myapp1 etc) gets started even on this node. >> > Following is the output of "*crm_mon -f *" command: >> > >> > *Last updated: Mon Oct 17 10:19:22 2011 >> >> > Stack: Heartbeat >> > Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with >> > quorum >> > Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 >> > 2 Nodes configured, unknown expected votes >> > 5 Resources configured. >> > ============ >> > Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline) >> >> The cluster is waiting for a successful fencing event before starting >> all resources .. the only way to be sure the second node runs no >> resources. >> >> Since you are using suicide pluging this will never happen if Heartbeat >> is not started on that node. If this is only a _test_setup_ go with ssh >> or even null stonith plugin ... never use them on production systems! >> >> Regards, >> Andreas >> >> >> On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath <nehachatr...@gmail.com>wrote: >> >>> Hello, >>> I am configuring a 2 node cluster with following configuration: >>> >>> *[root@MCG1 init.d]# crm configure show >>> >>> node $id="16738ea4-adae-483f-9d79-b0ecce8050f4" mcg2 \ >>> attributes standby="off" >>> >>> node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \ >>> attributes standby="off" >>> >>> primitive ClusterIP ocf:heartbeat:IPaddr \ >>> params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \ >>> >>> op monitor interval="40s" timeout="20s" \ >>> meta target-role="Started" >>> >>> primitive app1_fencing stonith:suicide \ >>> op monitor interval="90" \ >>> meta target-role="Started" >>> >>> primitive myapp1 ocf:heartbeat:Redundancy \ >>> op monitor interval="60s" role="Master" timeout="30s" on-fail="standby" \ >>> op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart" >>> >>> primitive myapp2 ocf:mcg:Redundancy_myapp2 \ >>> op monitor interval="60" role="Master" timeout="30" on-fail="standby" \ >>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart" >>> >>> primitive myapp3 ocf:mcg:red_app3 \ >>> op monitor interval="60" role="Master" timeout="30" on-fail="fence" \ >>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart" >>> >>> ms ms_myapp1 myapp1 \ >>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" >>> notify="true" >>> >>> ms ms_myapp2 myapp2 \ >>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" >>> notify="true" >>> >>> ms ms_myapp3 myapp3 \ >>> meta master-max="1" master-max-node="1" clone-max="2" clone-node-max="1" >>> notify="true" >>> >>> colocation myapp1_col inf: ClusterIP ms_myapp1:Master >>> >>> colocation myapp2_col inf: ClusterIP ms_myapp2:Master >>> >>> colocation myapp3_col inf: ClusterIP ms_myapp3:Master >>> >>> order myapp1_order inf: ms_myapp1:promote ClusterIP:start >>> >>> order myapp2_order inf: ms_myapp2:promote ms_myapp1:start >>> >>> order myapp3_order inf: ms_myapp3:promote ms_myapp2:start >>> >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \ >>> cluster-infrastructure="Heartbeat" \ >>> stonith-enabled="true" \ >>> no-quorum-policy="ignore" >>> >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="100" \ >>> migration-threshold="3" >>> * >>> I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the >>> resources (myapp, myapp1 etc) gets started even on this node. >>> Following is the output of "*crm_mon -f *" command: >>> >>> *Last updated: Mon Oct 17 10:19:22 2011 >>> Stack: Heartbeat >>> Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with >>> quorum >>> Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 >>> 2 Nodes configured, unknown expected votes >>> 5 Resources configured. >>> ============ >>> Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline) >>> Online: [ mcg1 ] >>> app1_fencing (stonith:suicide):Started mcg1 >>> >>> Migration summary: >>> * Node mcg1: >>> * >>> When I set "stonith_enabled" as false, then all my resources comes up. >>> >>> Can somebody help me with STONITH configuration? >>> >>> Cheers >>> Neha Chatrath >>> KEEP SMILING!!!! >>> >> >> >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker