[Pacemaker] Promote of one resource leads to start of another resource in heartbeat cluster
Hello, I have the following 2 node cluster configuration: node $id=15f8a22d-9b1a-4ce3-bca2-05f654a9ed6a cps2 \ attributes standby=off node $id=d3088454-5ff3-4bcd-b94c-5a2567e2759b cps1 \ attributes standby=off primitive CPS ocf:heartbeat:jboss_cps \ params jboss_home=/home/cluster/cps/jboss-5.1.0.GA/ java_home=/usr/ run_opts=-c all -b 0.0.0.0 -g clusterCPS -Djboss.service.binding.set=ports-01 -Djboss.messaging.ServerPeerID=01 statusurl=http://127.0.0.1:8180; shutdown_opts=-s 127.0.0.1:1199 pstring=clusterCPS \ op start interval=0 timeout=150 \ op stop interval=0 timeout=240 \ op monitor interval=30s timeout=40s primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.114.150 cidr_netmask=32 nic=bond0:114:1 \ op monitor interval=40 timeout=20 \ meta target-role=Started primitive EMS ocf:heartbeat:jboss \ params jboss_home=/home/cluster/cps/Jboss_EMS/jboss-5.1.0.GA java_home=/usr/ run_opts=-c all -b 0.0.0.0 -g clusterEMS pstring=clusterEMS \ op start interval=0 timeout=60 \ op stop interval=0 timeout=240 \ op monitor interval=30s timeout=40s primitive LB ocf:ptt:lb_ptt \ op monitor interval=40 primitive NDB_MGMT ocf:ptt:NDB_MGM_RA \ op monitor interval=120 timeout=120 primitive NDB_VIP ocf:heartbeat:IPaddr2 \ params ip=192.168.117.150 cidr_netmask=255.255.255.255 nic=bond0.117:4 \ op monitor interval=30 timeout=25 primitive Rmgr ocf:ptt:RM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart \ op start interval=0 role=Master timeout=30 \ op start interval=0 role=Slave timeout=35 primitive mysql ocf:ptt:MYSQLD_RA \ op monitor interval=180 timeout=200 \ op start interval=0 timeout=40 primitive ndbd ocf:ptt:NDBD_RA \ op monitor interval=120 timeout=120 ms CPS_CLONE CPS \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 interleave=true notify=true ms ms_Rmgr Rmgr \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 interleave=true notify=true target-role=Started ms ms_mysqld mysql \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 interleave=true notify=true clone EMS_CLONE EMS \ meta globally-unique=false clone-max=2 clone-node-max=1 clone LB_CLONE LB \ meta globally-unique=false clone-max=2 clone-node-max=1 target-role=Started clone ndbdclone ndbd \ meta globally-unique=false clone-max=2 clone-node-max=1 colocation RM_with_ip inf: ms_Rmgr:Master ClusterIP colocation ndb_vip-with-ndb_mgm inf: NDB_MGMT NDB_VIP order RM-after-ip inf: ClusterIP ms_Rmgr order cps-after-mysqld inf: ms_mysqld CPS_CLONE order ip-after-mysqld inf: ms_mysqld ClusterIP order lb-after-cps inf: CPS_CLONE LB_CLONE order mysqld-after-ndbd inf: ndbdclone ms_mysqld order ndb_mgm-after-ndb_vip inf: NDB_VIP NDB_MGMT order ndbd-after-ndb_mgm inf: NDB_MGMT ndbdclone property $id=cib-bootstrap-options \ dc-version=1.0.11-9af47ddebcad19e35a61b2a20301dc038018e8e8 \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration_threshold=3 When I brig down the active node in the cluster, ms_mysqld resource on the standby node is promoted but another resource (ms_Rmgr) gets re-started. Following are excerpts form the logs: Mar 19 18:09:58 CPS2 lrmd: [27576]: info: operation monitor[13] on NDB_VIP for client 27579: pid 29532 exited with return code 0 Mar 19 18:10:06 CPS2 heartbeat: [27565]: WARN: node cps1: is dead Mar 19 18:10:06 CPS2 heartbeat: [27565]: info: Link cps1:bond0.115 dead. Mar 19 18:10:06 CPS2 ccm: [27574]: debug: recv msg status from cps1, status:dead Mar 19 18:10:06 CPS2 ccm: [27574]: debug: status of node cps1: active - dead Mar 19 18:10:06 CPS2 ccm: [27574]: debug: recv msg CCM_TYPE_LEAVE from cps1, status:[null ptr] Mar 19 18:10:06 CPS2 ccm: [27574]: debug: quorum plugin: majority Mar 19 18:10:06 CPS2 crmd: [27579]: notice: crmd_ha_status_callback: Status update: Node cps1 now has status [dead] (DC=true) Mar 19 18:10:06 CPS2 ccm: [27574]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100 Mar 19 18:10:06 CPS2 crmd: [27579]: info: crm_update_peer_proc: cps1.ais is now offline ... .. Mar 19 18:10:07 CPS2 pengine: [27584]: notice: LogActions: Start ClusterIP(cps2) Mar 19 18:10:07 CPS2 pengine: [27584]: notice: LogActions: Leave resource NDB_VIP (Started cps2) Mar 19 18:10:07 CPS2 pengine: [27584]: notice: LogActions: Leave resource NDB_MGMT(Started cps2) Mar 19 18:10:07 CPS2 pengine: [27584]: notice: LogActions: Leave resource ndbd:0 (Stopped) Mar 19 18:10:07 CPS2 pengine: [27584]: notice: LogActions: Leave resource ndbd:1 (Started cps2) Mar 19 18:10:07 CPS2 pengine: [27584]: notice:
Re: [Pacemaker] How to run heartbeat and pacemaker resources as a non-root user
Hello, Thanks for the reply. I have been successfully using Heartbeat as a root user. But I have a system requirement for which I need to run my different custom applications (configured using crm) as a non root user. Can this be done? Regards Neha Chatrath Date: Mon, 20 Feb 2012 22:05:30 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] How to run heartbeat and pacemaker resources as a non-root user Message-ID: caedlwg2ok25f4jrg8y0kwsgc6n35_bzzdy6np+egk0tutjg...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Mon, Feb 20, 2012 at 2:39 PM, neha chatrath nehachatr...@gmail.com wrote: Hello, I need to run heartbeat and pacemaker resources as non-root users. When I try to run heartbeat as a hacluster user, That probably wont work. We already try to drop as much privilege as we can, but some processes need to be root or that can't do anything - like add an IP address to a machine. it fails to run with the following error: Starting High-Availability services: chmod: changing permissions of `/var/run/heartbeat/rsctmp': Operation not permitted Done. touch: cannot touch `/var/lock/subsys/heartbeat': Permission denied I have tried changing ownership and permissions for the above directories and files but still the same result. Can somebody help me in this? Thanks and regards Neha Chatrath On Mon, Feb 20, 2012 at 9:09 AM, neha chatrath nehachatr...@gmail.comwrote: Hello, I need to run heartbeat and pacemaker resources as non-root users. When I try to run heartbeat as a hacluster user, it fails to run with the following error: Starting High-Availability services: chmod: changing permissions of `/var/run/heartbeat/rsctmp': Operation not permitted Done. touch: cannot touch `/var/lock/subsys/heartbeat': Permission denied I have tried changing ownership and permissions for the above directories and files but still the same result. Can somebody help me in this? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] How to run heartbeat and pacemaker resources as a non-root user
Hello, I need to run heartbeat and pacemaker resources as non-root users. When I try to run heartbeat as a hacluster user, it fails to run with the following error: Starting High-Availability services: chmod: changing permissions of `/var/run/heartbeat/rsctmp': Operation not permitted Done. touch: cannot touch `/var/lock/subsys/heartbeat': Permission denied I have tried changing ownership and permissions for the above directories and files but still the same result. Can somebody help me in this? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Stopping heartbeat service on one node lead to restart of resources on other node in cluster
pengine: [20534]: debug: native_assign_node: Could not allocate a node for Tmgr:1 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing Tmgr:0 with Rmgr:0 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing Tmgr:1 with Rmgr:1 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing Tmgr:0 with pimd:0 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: find_compatible_child: Can't pair pimd:1 with ms_Tmgr Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: No match found for pimd:1 (0) Feb 07 11:06:31 MCG1 pengine: [20534]: info: clone_rsc_order_lh: Inhibiting pimd:1 from being active Feb 07 11:06:31 MCG1 pengine: [20534]: debug: native_assign_node: Could not allocate a node for pimd:1 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing pimd:0 with Tmgr:0 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing pimd:1 with Tmgr:1 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing Rmgr:0 with mysql:0 Feb 07 11:06:31 MCG1 pengine: [20534]: debug: clone_rsc_order_lh: Pairing Rmgr:1 with mysql:1 Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Restart resource Rmgr:0 (Master mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource Rmgr:1 (mcg2) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Restart resource Tmgr:0 (Master mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource Tmgr:1 (mcg2) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Restart resource pimd:0 (Master mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource pimd:1 (mcg2) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Restart resource ClusterIP (Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Leave resource EMS:0 (Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource EMS:1 (mcg2) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Leave resource NDB_VIP (Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Leave resource NDB_MGMT(Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Restart resource mysql:0 (Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource mysql:1 (mcg2) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Leave resource ndbd:0 (Started mcg1) Feb 07 11:06:31 MCG1 pengine: [20534]: notice: LogActions: Stopresource ndbd:1 (mcg2) *Thanks in advance. Regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Invoking crm node standby command on active command leads to stopping of resource on both Active and standby node
Hello, I am using a cluster with following configuration: [root@MCG1 neha]# crm configure show node $id=0686a4d1-c9de-4334-8d33-1a9f6f0755dd ggns2mexsatsdp22 node $id=76246d46-f0e4-4ba8-9179-d60aa7c697c8 ggns2mexsatsdp23 node $id=9d59c9e6-24e0-4684-94ab-c07af7e7a2f0 mcg1 \ attributes standby=off node $id=fb3f06f0-05bf-42ef-a312-c072f589918a mcg2 \ attributes standby=off primitive ClusterIP ocf:mcg:MCG_VIPaddr_RA \ params ip=192.168.113.77 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40 timeout=20 primitive RM ocf:mcg:RM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive Tmgr ocf:mcg:TM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive pimd ocf:mcg:PIMD_RA \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_RM RM \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ms_Tmgr Tmgr \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ms_pimd pimd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started colocation ip_with_RM inf: ClusterIP ms_RM:Master colocation ip_with_Tmgr inf: ClusterIP ms_Tmgr:Master colocation ip_with_pimd inf: ClusterIP ms_pimd:Master order TM-after-RM inf: ms_RM:promote ms_Tmgr:start order ip-after-pimd inf: ms_pimd:promote ClusterIP:start order pimd-after-TM inf: ms_Tmgr:promote ms_pimd:start property $id=cib-bootstrap-options \ dc-version=1.0.11-55a5f5be61c367cbd676c2f0ec4f1c62b38223d7 \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=3 When I execute crm node standby command on the Active node, it leads to stopping of resourcs on both Active and Standby node. As per my understanding, this should lead to stopping of resources only on current Active node and all the resources on the standby node should get a promote. Please comment. Thanks and regards Neha ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Stopping heartbeat on active node leads to restart of resources on standby node
Hello, I have a 2 node cluster with following configuration: *crm configure show node $id=16738ea4-adae-483f-9d79-b0ecce8050f4 mcg2 primitive ClusterIP ocf:mcg:MCG_VIPaddr_RA \ params ip=192.168.113.67 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40 timeout=20 primitive Rmgr ocf:mcg:RM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive Tmgr ocf:mcg:TM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive pimd ocf:mcg:PIMD_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_Rmgr Rmgr \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true ms ms_Tmgr Tmgr \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true ms ms_pimd pimd \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true target-role=Stopped colocation ip_with_Rmgr inf: ClusterIP ms_Rmgr:Master colocation ip_with_Tmgr inf: ClusterIP ms_Tmgr:Master colocation ip_with_pimd inf: ClusterIP ms_pimd:Master order TM-after-RM inf: ms_Rmgr:promote ms_Tmgr:start order ip-after-pimd inf: ms_pimd:promote ClusterIP:start order pimd-after-TM inf: ms_Tmgr:promote ms_pimd:start property $id=cib-bootstrap-options \ dc-version=1.0.11-9af47ddebcad19e35a61b2a20301dc038018e8e8 \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false rsc_defaults $id=rsc-options \ migration_threshold=3 \ resource-stickiness=100 *With both Acitve and Standby nodes up and running, if I stop Heartbeat on Active node, all the resources on Standby node, first receives stop and then start from Pacemaker. As per the idea behind clustering, all the master/slave resources on Standby should simply receive Promote. Can somebody comment on this behavior? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] How to serialize/control resource startup on Standby node
Hello, I have cluster with 2 nodes with multiple Master/slave resources. The ordering of resources on the master node is achieved using order option of crm. When standby node started, the processes are started one after the another. Following is the configuration info: p*rimitive ClusterIP ocf:mcg:MCG_VIPaddr_RA \ params ip=192.168.113.67 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40 timeout=20 primitive Rmgr ocf:mcg:RM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive Tmgr ocf:mcg:TM_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive pimd ocf:mcg:PIMD_RA \ op monitor interval=60 role=Master timeout=30 on-fail=restart \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_Rmgr Rmgr \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true ms ms_Tmgr Tmgr \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true ms ms_pimd pimd \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true colocation ip_with_Rmgr inf: ClusterIP ms_Rmgr:Master colocation ip_with_Tmgr inf: ClusterIP ms_Tmgr:Master colocation ip_with_pimd inf: ClusterIP ms_pimd:Master order TM-after-RM inf: ms_Rmgr:promote ms_Tmgr:start order ip-after-pimd inf: ms_pimd:promote ClusterIP:start order pimd-after-TM inf: ms_Tmgr:promote ms_pimd:start property $id=cib-bootstrap-options \ dc-version=1.0.11-**db98485d06ed3fe0fe236509f023e1**bd4a5566f1 \ cluster-infrastructure=**Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false rsc_defaults $id=rsc-options \ migration_threshold=3 \ resource-stickiness=100 *I have a system requirement in which start of resource (e.g. pimd) is dependent on successful start of another resource (e.g. Tmgr) Everything run smoothly on the master node. This is due to *ordering and few seconds delay* untill a resource is promoted as Master. But on the standby node since the resources are started one after the another without any delay , Standby node in the cluster behaves erratically Is there a way, through which I can serialize/control resource start up on the standby node. Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Regarding Stonith RAs
Hello Andreas, Pacemaker is not built with Heartbeat support on RHEL-6 and its derivatives. How do I check this and what steps do I need to take to resolve this issue. Thanks and regards Neha Chatrath On Thu, Nov 24, 2011 at 5:38 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I could get list of Stontih RAs by installing cman, clvm, ricci, pacemaker, rgmanages RPMs provided by CentOS 6 distribution. But unfortunately after installing these packages, all the process related to Pacemaker are not coming up on starting Heartbeat Deamon. When I start Heartbeat daemon, only following process are started: Pacemaker is not built with Heartbeat support on RHEL-6 and its derivatives. root@p init.d]# ps -eaf |grep heartbeat root 3522 1 0 17:26 ?00:00:00 heartbeat: master control process root 3525 3522 0 17:26 ?00:00:00 heartbeat: FIFO reader root 3526 3522 0 17:26 ?00:00:00 heartbeat: write: bcast eth1 root 3527 3522 0 17:26 ?00:00:00 heartbeat: read: bcast eth1 root 3538 3381 0 17:26 pts/300:00:00 grep heartbeat In the log messages, following error logs are observed: Nov 24 17:26:19 p heartbeat: [3522]: debug: Signing on API client 3539 (ccm) Nov 24 17:26:19 p ccm: [3539]: info: Hostname: p Nov 24 17:26:19 p attrd: [3543]: info: Invoked: /usr/lib/heartbeat/attrd Nov 24 17:26:19 p stonith-ng: [3542]: info: Invoked: /usr/lib/heartbeat/stonithd Nov 24 17:26:19 p cib: [3540]: info: Invoked: /usr/lib/heartbeat/cib *Nov 24 17:26:19 p lrmd: [3541]: ERROR: socket_wait_conn_new: trying to create in /var/run/heartbeat/lrm_cmd_sock bind:: No such file or directory * Nov 24 17:26:19 p lrmd: [3541]: ERROR: main: can not create wait connection for command. Nov 24 17:26:19 p lrmd: [3541]: ERROR: Startup aborted (can't create comm channel). Shutting down. Nov 24 17:26:19 p heartbeat: [3522]: WARN: Managed /usr/lib/heartbeat/lrmd -r process 3541 exited with return code 100. Nov 24 17:26:19 p heartbeat: [3522]: ERROR: Client /usr/lib/heartbeat/lrmd -r exited with return code 100. Nov 24 17:26:19 p attrd: [3543]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Nov 24 17:26:19 p attrd: [3543]: info: main: Starting up Nov 24 17:26:19 p stonith-ng: [3542]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root Nov 24 17:26:19 p cib: [3540]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Nov 24 17:26:19 p attrd: [3543]: CRIT: get_cluster_type: This installation of Pacemaker does not support the '(null)' cluster infrastructure. Terminating. Nov 24 17:26:19 p stonith-ng: [3542]: CRIT: get_cluster_type: This installation of Pacemaker does not support the '(null)' cluster infrastructure. Terminating. Nov 24 17:26:19 p heartbeat: [3522]: WARN: Managed /usr/lib/heartbeat/attrd process 3543 exited with return code 100. Nov 24 17:26:19 p heartbeat: [3522]: ERROR: Client /usr/lib/heartbeat/attrd exited with return code 100. Nov 24 17:26:19 p heartbeat: [3522]: info: the send queue length from heartbeat to client ccm is set to 1024 Nov 24 17:26:19 p heartbeat: [3522]: WARN: Managed /usr/lib/heartbeat/stonithd process 3542 exited with return code 100. Nov 24 17:26:19 p heartbeat: [3522]: ERROR: Client /usr/lib/heartbeat/stonithd exited with return code 100. *Nov 24 17:26:19 p cib: [3540]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)* Nov 24 17:26:19 p cib: [3540]: debug: log_data_element: readCibXmlFile: [on-disk] cib epoch=0 num_updates=0 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Mon Nov 21 11:09:22 2011 ... Nov 24 17:26:19 p crmd: [3544]: info: crmd_init: Starting crmd Nov 24 17:26:19 p crmd: [3544]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] Nov 24 17:26:19 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_LOG Nov 24 17:26:19 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_STARTUP Nov 24 17:26:19 p crmd: [3544]: debug: do_startup: Registering Signal Handlers Nov 24 17:26:19 p crmd: [3544]: debug: do_startup: Creating CIB and LRM objects Nov 24 17:26:19 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_CIB_START Nov 24 17:26:19 p crmd: [3544]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw Nov 24 17:26:19 p crmd: [3544]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw Nov 24 17:26:19 p crmd: [3544]: debug: cib_native_signon_raw: Connection to command channel failed Nov 24 17:26:19 p crmd: [3544]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback Nov 24 17:26:19 p crmd: [3544]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var
Re: [Pacemaker] Regarding Stonith RAs
cib: [3540]: debug: write_cib_contents: Writing CIB to disk ... ... .. Nov 24 17:27:47 p crmd: [3544]: WARN: do_cib_control: Couldn't complete CIB registration 30 times... pause and retry Nov 24 17:27:47 p crmd: [3544]: ERROR: do_cib_control: Could not complete CIB registration 30 times... hard error Nov 24 17:27:47 p crmd: [3544]: debug: s_crmd_fsa: Processing I_ERROR: [ state=S_STARTING cause=C_FSA_INTERNAL origin=do_cib_control ] Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_ERROR Nov 24 17:27:47 p crmd: [3544]: ERROR: do_log: FSA: Input I_ERROR from do_cib_control() received in state S_STARTING Nov 24 17:27:47 p crmd: [3544]: info: do_state_transition: State transition S_STARTING - S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_cib_control ] Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_DC_TIMER_STOP Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_INTEGRATE_TIMER_STOP Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_FINALIZE_TIMER_STOP Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_RECOVER Nov 24 17:27:47 p crmd: [3544]: ERROR: do_recover: Action A_RECOVER (0100) not supported Nov 24 17:27:47 p crmd: [3544]: debug: do_fsa_action: actions:trace:// A_HA_CONNECT Nov 24 17:27:47 p crmd: [3544]: CRIT: get_cluster_type: This installation of Pacemaker does not support the '(null)' cluster infrastructure. Terminating. Nov 24 17:27:47 p heartbeat: [3522]: WARN: Managed /usr/lib/heartbeat/crmd process 3544 exited with return code 100. Nov 24 17:27:47 p heartbeat: [3522]: ERROR: Client /usr/lib/heartbeat/crmd exited with return code 100. It seems to be reading configuration info from /var/run/heartbeat directory but actually the info is present in /usr/var/run/heartbeat. Can somebody suggest me how should I correct that path? Path environment variable has the following value: [root@p init.d]# echo $PATH /usr/lib/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin Thanks and regards Neha Chatrath Date: Fri, 18 Nov 2011 10:22:22 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Regarding Stonith RAs Message-ID: CAEDLWG2QO+-puhr2qOuvXSCRUcg2gXHE=i=1d3losfn_pcs...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Thu, Nov 17, 2011 at 1:28 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Wed, Nov 16, 2011 at 05:49:30PM +0530, neha chatrath wrote: [...] Nov 14 13:16:57 ggns2mexsatsdp17.hsc.com lrmd: [3976]: notice: on_msg_get_rsc_types: can not find this RA class stonith The PILS plugin handling stonith resources was not found. Strange, cannot recall seeing this before. Could be a RHEL6 based distro. It should be in /usr/lib/heartbeat/plugins/RAExec/stonith.so (or /usr/lib64 depending on your installation). Please check permissions and if this file is really a valid so object file. If everything's in order no idea what else could be the reason. You could strace lrmd on startup and see what happens between lines 1137 and 1158. Thanks, Dejan On Mon, Nov 14, 2011 at 2:05 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , I get the following errors: ERROR: stonith:external/ipmi: could not parse meta-data: Traceback (most recent call last): File /usr/sbin/crm, line 41, in module crm.main.run() File /usr/lib/python2.6/site-packages/crm/main.py, line 249, in run if parse_line(levels,shlex.split(' '.join(args))): File /usr/lib/python2.6/site-packages/crm/main.py, line 145, in parse_line lvl.release() File /usr/lib/python2.6/site-packages/crm/levels.py, line 68, in release self.droplevel() File /usr/lib/python2.6/site-packages/crm/levels.py, line 87, in droplevel self.current_level.end_game(self._in_transit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1524, in end_game self.commit(commit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1425, in commit self._verify(mkset_obj(xml,changed),mkset_obj(xml)) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1324, in _verify rc2 = set_obj_semantic.semantic_check(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 280, in semantic_check rc = self.__check_unique_clash(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 260, in __check_unique_clash process_primitive(node, clash_dict) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 245, in process_primitive
Re: [Pacemaker] Regarding Stonith RAs
Hello, Looks like a broken installation. I guess that metadata for other resource classes works fine. It could be some issue with stonith-ng. Did you notice any messages from stonith-ng? A Yes, metadata for other resource classes like ocf/heartbeat, ocf/linbit is working fine. Problem is seen only with Stonith resource class . No stonith-ng related errors are visible in the log file. Following are some excerpts from log file: Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: apiauth stonithd uid=root Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=root, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: apiauth stonith-ng uid=root Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=root, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: apiauth attrduid=hacluster Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=hacluster, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: apiauth crmd uid=hacluster Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=hacluster, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: apiauth pingduid=root Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=root, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn hacluster /usr/lib/heartbeat/ccm Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: hacluster /usr/lib/heartbeat/ccm Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn hacluster /usr/lib/heartbeat/cib Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: hacluster /usr/lib/heartbeat/cib Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn root /usr/lib/heartbeat/lrmd -r Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: root /usr/lib/heartbeat/lrmd -r Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn root /usr/lib/heartbeat/stonithd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: root /usr/lib/heartbeat/stonithd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn hacluster /usr/lib/heartbeat/attrd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: hacluster /usr/lib/heartbeat/attrd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: Implicit directive: respawn hacluster /usr/lib/heartbeat/crmd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: info: respawn directive: hacluster /usr/lib/heartbeat/crmd Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=hacluster, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=hacluster, gid=null Nov 14 11:54:30 ggns2mexsatsdp17.hsc.com heartbeat: [3659]: debug: uid=root, gid=null ... v 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: info: Starting child client /usr/lib/heartbeat/lrmd -r (0,0) Nov 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: info: Starting child client /usr/lib/heartbeat/stonithd (0,0) Nov 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3976]: info: Starting /usr/lib/heartbeat/lrmd -r as uid 0 gid 0 (pid 3976) Nov 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3977]: info: Starting /usr/lib/heartbeat/stonithd as uid 0 gid 0 (pid 3977) Nov 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: info: Starting child client /usr/lib/heartbeat/attrd (495,489) Nov 14 11:54:45 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: info: Starting child client /usr/lib/heartbeat/crmd (495,489) Nov 14 11:54:46 ggns2mexsatsdp17.hsc.com stonithd: [3977]: debug: apichan=0x9a36368 Nov 14 11:54:46 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: debug: create_seq_snapshot_table:no missing packets found for node ggns2mexsatsdp17.hsc.com Nov 14 11:54:46 ggns2mexsatsdp17.hsc.com heartbeat: [3661]: debug: Signing on API client 3975 (cib) Nov 14 11:54:46 ggns2mexsatsdp17.hsc.com stonithd: [3977]: debug: callback_chan=0x9a36210 Nov 14 11:54:46 ggns2mexsatsdp17.hsc.com stonithd: [3977]: notice: /usr/lib/heartbeat/stonithd start up successfully . Nov 14 13:16:57 ggns2mexsatsdp17.hsc.com lrmd: [3976]: debug: unregister_client: client lrmadmin [pid:10061] is unregistered Nov 14 13:16:57 ggns2mexsatsdp17.hsc.com lrmd: [3976]: debug: on_msg_register:client lrmadmin [10062] registered Nov 14 13:16:57 ggns2mexsatsdp17.hsc.com lrmd: [3976]: notice: on_msg_get_rsc_types: can not find this RA class stonith Thanks and regards Neha Chatrath
Re: [Pacemaker] Query regarding crm node standby/online command
Hello, Thanks for the reply. Let me rephrase my query regarding interface monitoring. I have (say 3 IP interfaces) eth0, eth1, eth2. Heartbeat is running on eth0. I can monitor my eth0 link using Heartbeat but is there a possibility of monitoring eth1 and eth2 interfaces as well using Heartbeat mechanism? I need this to detect scenarios like if eth0 is working fine (thus, no break in cluster communication via Heartbeat) but there is some issue with either eth1 or eth2, I need to raise some alarms etc Thanks and regards Neha Chatrath Message: 4 Date: Tue, 08 Nov 2011 09:45:17 +0100 From: Florian Haas flor...@hastexo.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Query regarding crm node standby/online command Message-ID: 4eb8ec1d.8050...@hastexo.com Content-Type: text/plain; charset=ISO-8859-1 On 2011-11-08 06:22, neha chatrath wrote: Hello, I am running Heartbeat and Pacemaker in a cluster with 2 nodes. I also have a client registered with Heartbeat daemon for any node/IF status changes. Can you give more details as to the nature of that client? When I execute crm node standby command on one of the nodes, there is no node status change info reported to the client. Is this the expected behavior? I would say yes, as putting a node in standby mode does not change its status of being a fully-fledged member of the cluster. It still participates in all cluster communications, it receives all configuration changes and status updates. It's merely ineligible for running any resources. So from the cluster communications layer point of view (i.e. from Heartbeat's or Corosync's perspective) nothing changes. Also, one more query about Heartbeat daemon: In my system, I have multiple IP interfaces (each configured with a separate IP) with Heartbeat running on one of them. I have a requirement of monitoring of all these IP interfaces and perform necessary actions (like perform failover etc) in case of any interface failure. Well there is no reason to do this externally. You set up fencing using an out-of-band fencing method. When cluster communications break down, you fence one node off the cluster, so resources fail over to the other. As a word of caution, it seems like you're at least headed into the direction of reinventing the wheel, and it also seems like you are trying to implement functionality that's already present in the stack. (This is just a hunch based on the limited information given, however.) If that is the case, I would strongly suggest you take a look at Clusters From Scratch and the Linux-HA User's Guide, and possibly also Pacemaker: Configuration Explained, to better familiarize yourself with the functionality of the stack. Hope this helps. Cheers, Florian On Tue, Nov 8, 2011 at 10:52 AM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am running Heartbeat and Pacemaker in a cluster with 2 nodes. I also have a client registered with Heartbeat daemon for any node/IF status changes. When I execute crm node standby command on one of the nodes, there is no node status change info reported to the client. Is this the expected behavior? Also, one more query about Heartbeat daemon: In my system, I have multiple IP interfaces (each configured with a separate IP) with Heartbeat running on one of them. I have a requirement of monitoring of all these IP interfaces and perform necessary actions (like perform failover etc) in case of any interface failure. I am able to monitor the interface on which Heartbeat is running but not the rest of them. Does Heartbeat allows monitoring of interfaces other than the interfaces on which Heartbeat is running? Thanks and regards Neha Chatrath -- Cheers Neha Chatrath KEEP SMILING ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Regarding Stonith RAs
Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , I get the following errors: ERROR: stonith:external/ipmi: could not parse meta-data: Traceback (most recent call last): File /usr/sbin/crm, line 41, in module crm.main.run() File /usr/lib/python2.6/site-packages/crm/main.py, line 249, in run if parse_line(levels,shlex.split(' '.join(args))): File /usr/lib/python2.6/site-packages/crm/main.py, line 145, in parse_line lvl.release() File /usr/lib/python2.6/site-packages/crm/levels.py, line 68, in release self.droplevel() File /usr/lib/python2.6/site-packages/crm/levels.py, line 87, in droplevel self.current_level.end_game(self._in_transit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1524, in end_game self.commit(commit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1425, in commit self._verify(mkset_obj(xml,changed),mkset_obj(xml)) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1324, in _verify rc2 = set_obj_semantic.semantic_check(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 280, in semantic_check rc = self.__check_unique_clash(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 260, in __check_unique_clash process_primitive(node, clash_dict) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 245, in process_primitive if ra_params[ name ].get(unique) == 1: TypeError: 'NoneType' object is unsubscriptable From /var/log/messages: following error is being reported from lrmd: notice: on_msg_get_metadata: can not find the class stonith It seems the it is not able to find any RAs related to Stonith. Following is the output of some crm commands: *crm(live)ra# classes* heartbeat lsb ocf / heartbeat linbit mcg pacemaker *stonith* *crm(live)ra# list ocf heartbeat* AoEtarget AudibleAlarm CTDB ClusterMonDelay Dummy EvmsSCC Evmsd Filesystem ICP IPaddrIPaddr2 IPsrcaddr IPv6addr LVM LinuxSCSI MailToManageRAID ManageVE Pure-FTPd Raid1 Route SAPDatabase SAPInstance SendArp ServeRAID SphinxSearchDaemon Squid Stateful SysInfo VIPArip VirtualDomain WAS WAS6 WinPopup Xen Xinetdanything apache conntrackddb2 drbd eDir88exportfs fio iSCSILogicalUnit iSCSITarget ids iscsi jboss ldirectord mysql mysql-proxy nfsserver nginx oracleoralsnr pgsql pingd portblock postfix proftpd rsyncd scsi2reservation sfex syslog-ng tomcatvmware *crm(live)ra# list stonith* crm(live)ra# All the sotnith related RAs are present in /usr/lib/stonith/plugins/external. Following is the output of ls command : [root@ggns2mexsatsdp17 ~]# ls /usr/lib/stonith/plugins/external/ drac5 hetzner ibmrsa ipmi kdumpcheck nut output riloe ssh vmware xen0-ha dracmc-telnet hmchttp ibmrsa-telnet ippower9258 libvirt ouput rackpdu sbdvcenter xen0 Can somebody please help me with this? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Regarding Stonith RAs
Hello, I have tried with single : in the crm configure command but there is no change in the result. LRM and crm are showing the same errors. Thanks and regards Neha Chatrath Date: Mon, 14 Nov 2011 09:49:52 +0100 From: Michael Schwartzkopff mi...@clusterbau.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Regarding Stonith RAs Message-ID: 20140949.53351.mi...@clusterbau.com Content-Type: text/plain; charset=iso-8859-1 Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , try crm configure primitive app_fence stonith:external/ipmi (...) please note the ONE column. Providers are only known in the OCF RA class. -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 M?nchen Tel: (0163) 172 50 98 On Mon, Nov 14, 2011 at 2:05 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , I get the following errors: ERROR: stonith:external/ipmi: could not parse meta-data: Traceback (most recent call last): File /usr/sbin/crm, line 41, in module crm.main.run() File /usr/lib/python2.6/site-packages/crm/main.py, line 249, in run if parse_line(levels,shlex.split(' '.join(args))): File /usr/lib/python2.6/site-packages/crm/main.py, line 145, in parse_line lvl.release() File /usr/lib/python2.6/site-packages/crm/levels.py, line 68, in release self.droplevel() File /usr/lib/python2.6/site-packages/crm/levels.py, line 87, in droplevel self.current_level.end_game(self._in_transit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1524, in end_game self.commit(commit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1425, in commit self._verify(mkset_obj(xml,changed),mkset_obj(xml)) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1324, in _verify rc2 = set_obj_semantic.semantic_check(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 280, in semantic_check rc = self.__check_unique_clash(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 260, in __check_unique_clash process_primitive(node, clash_dict) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 245, in process_primitive if ra_params[ name ].get(unique) == 1: TypeError: 'NoneType' object is unsubscriptable From /var/log/messages: following error is being reported from lrmd: notice: on_msg_get_metadata: can not find the class stonith It seems the it is not able to find any RAs related to Stonith. Following is the output of some crm commands: *crm(live)ra# classes* heartbeat lsb ocf / heartbeat linbit mcg pacemaker *stonith* *crm(live)ra# list ocf heartbeat* AoEtarget AudibleAlarm CTDB ClusterMonDelay Dummy EvmsSCC Evmsd Filesystem ICP IPaddrIPaddr2 IPsrcaddr IPv6addr LVM LinuxSCSI MailToManageRAID ManageVE Pure-FTPd Raid1 Route SAPDatabase SAPInstance SendArp ServeRAID SphinxSearchDaemon Squid Stateful SysInfo VIPArip VirtualDomain WAS WAS6 WinPopup Xen Xinetdanything apache conntrackddb2 drbd eDir88exportfs fio iSCSILogicalUnit iSCSITarget ids iscsi jboss ldirectord mysql mysql-proxy nfsserver nginx oracleoralsnr pgsql pingd portblock postfix proftpd rsyncd scsi2reservation sfex syslog-ng tomcatvmware *crm(live)ra# list stonith* crm(live)ra# All the sotnith related RAs are present in /usr/lib/stonith/plugins/external. Following is the output of ls command : [root@ggns2mexsatsdp17 ~]# ls /usr/lib/stonith/plugins/external/ drac5 hetzner ibmrsa ipmi kdumpcheck nut output riloe ssh vmware xen0-ha dracmc-telnet hmchttp ibmrsa-telnet ippower9258 libvirt ouput rackpdu sbdvcenter xen0 Can somebody please help me with this? Thanks and regards Neha Chatrath
Re: [Pacemaker] Regarding Stonith RAs
Hello Dejan, I am using Cluster Glue version 1.0.7. Also this does not seem to be a problem with a specific Stonith agent like IPMI, I think it is more of an issue with all the Stonith agents. I have tried configuring another test Stonith agent e.g. Suicide and I am facing exactly the same issue. Kindly please suggest. Thanks and regards Neha Chatrath Date: Mon, 14 Nov 2011 15:41:43 +0100 From: Dejan Muhamedagic deja...@fastmail.fm To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Regarding Stonith RAs Message-ID: 2014144142.GA3735@squib Content-Type: text/plain; charset=us-ascii Hi, On Mon, Nov 14, 2011 at 02:05:49PM +0530, neha chatrath wrote: Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , I get the following errors: ERROR: stonith:external/ipmi: could not parse meta-data: Which version of cluster-glue do you have installed? There is a serious issue with external/ipmi in version 1.0.8, we'll make a new release ASAP. Thanks, Dejan On Mon, Nov 14, 2011 at 2:05 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am facing issue in configuring a Stonith resource in my system of cluster with 2 nodes. Whenever I try to give the following command: crm configure primitive app_fence stonith::external/ipmi params hostname= ggns2mexsatsdp17.hsc.com ipaddr=192.168.113.17 userid=root passwd=pass@abc123 , I get the following errors: ERROR: stonith:external/ipmi: could not parse meta-data: Traceback (most recent call last): File /usr/sbin/crm, line 41, in module crm.main.run() File /usr/lib/python2.6/site-packages/crm/main.py, line 249, in run if parse_line(levels,shlex.split(' '.join(args))): File /usr/lib/python2.6/site-packages/crm/main.py, line 145, in parse_line lvl.release() File /usr/lib/python2.6/site-packages/crm/levels.py, line 68, in release self.droplevel() File /usr/lib/python2.6/site-packages/crm/levels.py, line 87, in droplevel self.current_level.end_game(self._in_transit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1524, in end_game self.commit(commit) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1425, in commit self._verify(mkset_obj(xml,changed),mkset_obj(xml)) File /usr/lib/python2.6/site-packages/crm/ui.py, line 1324, in _verify rc2 = set_obj_semantic.semantic_check(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 280, in semantic_check rc = self.__check_unique_clash(set_obj_all) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 260, in __check_unique_clash process_primitive(node, clash_dict) File /usr/lib/python2.6/site-packages/crm/cibconfig.py, line 245, in process_primitive if ra_params[ name ].get(unique) == 1: TypeError: 'NoneType' object is unsubscriptable From /var/log/messages: following error is being reported from lrmd: notice: on_msg_get_metadata: can not find the class stonith It seems the it is not able to find any RAs related to Stonith. Following is the output of some crm commands: *crm(live)ra# classes* heartbeat lsb ocf / heartbeat linbit mcg pacemaker *stonith* *crm(live)ra# list ocf heartbeat* AoEtarget AudibleAlarm CTDB ClusterMonDelay Dummy EvmsSCC Evmsd Filesystem ICP IPaddrIPaddr2 IPsrcaddr IPv6addr LVM LinuxSCSI MailToManageRAID ManageVE Pure-FTPd Raid1 Route SAPDatabase SAPInstance SendArp ServeRAID SphinxSearchDaemon Squid Stateful SysInfo VIPArip VirtualDomain WAS WAS6 WinPopup Xen Xinetdanything apache conntrackddb2 drbd eDir88exportfs fio iSCSILogicalUnit iSCSITarget ids iscsi jboss ldirectord mysql mysql-proxy nfsserver nginx oracleoralsnr pgsql pingd portblock postfix proftpd rsyncd scsi2reservation sfex syslog-ng tomcatvmware *crm(live)ra# list stonith* crm(live)ra# All the sotnith related RAs are present in /usr/lib/stonith/plugins/external. Following is the output of ls command : [root@ggns2mexsatsdp17 ~]# ls /usr/lib/stonith/plugins/external/ drac5 hetzner ibmrsa ipmi kdumpcheck nut output
[Pacemaker] Query regarding crm node standby/online command
Hello, I am running Heartbeat and Pacemaker in a cluster with 2 nodes. I also have a client registered with Heartbeat daemon for any node/IF status changes. When I execute crm node standby command on one of the nodes, there is no node status change info reported to the client. Is this the expected behavior? Also, one more query about Heartbeat daemon: In my system, I have multiple IP interfaces (each configured with a separate IP) with Heartbeat running on one of them. I have a requirement of monitoring of all these IP interfaces and perform necessary actions (like perform failover etc) in case of any interface failure. I am able to monitor the interface on which Heartbeat is running but not the rest of them. Does Heartbeat allows monitoring of interfaces other than the interfaces on which Heartbeat is running? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Inter-cluster communication using Heartbeat and Pacemaker
Hello Andreas, There is a system requirement according to which: 1. There are 2 independent clusters : 1 for data plane 2. for control plane 2. These clusters are connected to each other through IP/Ethernet connectivity for transmission and reception of control plane signalling only i.e. user plane traffic does not go through control plane cluster. 3. Nodes in the control plane cluster needs to know the status of the nodes in data plane to apply e.g. load balancing algorithms at its end. Thus, inter cluster communication is required. Thanks and regards Neha Chatrath Date: Tue, 1 Nov 2011 21:27:51 +1100 From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Inter-cluster communication using Heartbeat and Pacemaker Message-ID: CAEDLWG0ZAfUs_a=yje2pauixx5tita55xb+hu2ywngmkuev...@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1 On Fri, Oct 28, 2011 at 8:41 PM, neha chatrath nehachatr...@gmail.com wrote: Hello, Is there a way to do inter-cluster communication using Heartbeat/Pacemaker framework? Well by definition if the two nodes can talk to each other they're part of the same cluster. What are you trying to achieve? Thanks and regards Neha Chatrath On Fri, Oct 28, 2011 at 3:11 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, Is there a way to do inter-cluster communication using Heartbeat/Pacemaker framework? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Inter-cluster communication using Heartbeat and Pacemaker
Hello, Is there a way to do inter-cluster communication using Heartbeat/Pacemaker framework? Thanks and regards Neha Chatrath ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Problem in Stonith configuration
Hello, 1. How about using Integrated ILO device for fencing? I am using HP Proliant DL360 G7 server which supports ILO3. - Can RILOE Stonith be used for this? 2. Can meatware Stonith plugin be used for production software? 3. One more issue which I am facing is that when I try -crm ra list stonith command, there is no output. although different RA's under Heartbeat class are visible. - Also, Stonith class is visible in the output of crm ra classes command - all the default Stonith RA's like meatware, suicide, ibmrsa, ipmi etc are present in /usr/lib/stonith/plugins directory. - Due to this I am not able to configure stonith in my system. Thanks and regards Neha Chatrath On Tue, Oct 18, 2011 at 2:51 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? -In our system, there are some primitive, clone resources along with 3 different master-slave resources. -All the masters and slaves of these resources are co-located i.e. all the 3 masters are co-located on a node and 3 slaves on the other node. -These 3 master-slaves resources are tightly coupled. There is a requirement that failure of even any one of these resources, restarts all the resources in the group -All these resources can be shifted to the other node but subsequently these should also be restarted as a lot of data/control plane synching is being done between the two nodes. e.g. If one of the resources running on node1 as a Master fails, then all these 3 resources are shifted to the other node i.e. node2 (with corresponding slave resources being promoted as master). On node1, these resources should get re-started as slaves. We understand that node restart will increase the downtime but since we could not find much on the option for group restart of master-slave resources, we are trying for node restart option. Thanks and regards Neha Chatrath -- Forwarded message -- From: Andreas Kurz andr...@hastexo.com Date: Tue, Oct 18, 2011 at 1:55 PM Subject: Re: [Pacemaker] Problem in Stonith configuration To: pacemaker@oss.clusterlabs.org Hello, On 10/18/2011 09:00 AM, neha chatrath wrote: Hello, Minor updates in the first requirement. 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes That is how stonith works, yes. crm ra list stonith ... gives you a list of all available stonith plugins. crm ra info stonit: ... details for a specific plugin. Using external/ipmi is often a good choice because a lot of servers already have an BMC with IPMI on board or they are shipped with a management card supporting IMPI. Regards, Andreas On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, Minor updates in the first requirement. 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes Apologies for the inconvenience. Thanks and regards Neha Chatrath On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath nehachatr...@gmail.comwrote: Hello Andreas, Thanks for the reply. So can you please suggest what Stonith plugin should I use for the production release of my software. I have the following system requirements: 1. If a node in the cluster fails, it should be reboot and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes. I have different types of resources e.g. primitive, master-slave and cone running on my system. Thanks and regards Neha Chatrath Date: Mon, 17 Oct 2011 15:08:16 +0200 From: Andreas Kurz andr...@hastexo.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Problem in Stonith configuration Message-ID: 4e9c28c0.8070...@hastexo.com Content-Type: text/plain; charset=iso-8859-1 Hello, On 10/17/2011 12:34 PM, neha chatrath wrote: Hello, I am configuring
Re: [Pacemaker] Problem in Stonith configuration
Hello Andreas, Thanks for the reply. So can you please suggest what Stonith plugin should I use for the production release of my software. I have the following system requirements: 1. If a node in the cluster fails, it should be reboot and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes. I have different types of resources e.g. primitive, master-slave and cone running on my system. Thanks and regards Neha Chatrath Date: Mon, 17 Oct 2011 15:08:16 +0200 From: Andreas Kurz andr...@hastexo.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Problem in Stonith configuration Message-ID: 4e9c28c0.8070...@hastexo.com Content-Type: text/plain; charset=iso-8859-1 Hello, On 10/17/2011 12:34 PM, neha chatrath wrote: Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79- b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor interval=90 \ meta target-role=Started primitive myapp1 ocf:heartbeat:Redundancy \ op monitor interval=60s role=Master timeout=30s on-fail=standby \ op monitor interval=40s role=Slave timeout=40s on-fail=restart primitive myapp2 ocf:mcg:Redundancy_myapp2 \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive myapp3 ocf:mcg:red_app3 \ op monitor interval=60 role=Master timeout=30 on-fail=fence \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_myapp1 myapp1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp2 myapp2 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp3 myapp3 \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true colocation myapp1_col inf: ClusterIP ms_myapp1:Master colocation myapp2_col inf: ClusterIP ms_myapp2:Master colocation myapp3_col inf: ClusterIP ms_myapp3:Master order myapp1_order inf: ms_myapp1:promote ClusterIP:start order myapp2_order inf: ms_myapp2:promote ms_myapp1:start order myapp3_order inf: ms_myapp3:promote ms_myapp2:start property $id=cib-bootstrap-options \ dc-version=1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 \ cluster-infrastructure=Heartbeat \ stonith-enabled=true \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=3 * I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the resources (myapp, myapp1 etc) gets started even on this node. Following is the output of *crm_mon -f * command: *Last updated: Mon Oct 17 10:19:22 2011 Stack: Heartbeat Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with quorum Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 2 Nodes configured, unknown expected votes 5 Resources configured. Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline) The cluster is waiting for a successful fencing event before starting all resources .. the only way to be sure the second node runs no resources. Since you are using suicide pluging this will never happen if Heartbeat is not started on that node. If this is only a _test_setup_ go with ssh or even null stonith plugin ... never use them on production systems! Regards, Andreas On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79-b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor interval=90 \ meta target-role=Started primitive myapp1 ocf:heartbeat:Redundancy \ op monitor interval=60s role=Master timeout=30s on-fail=standby \ op monitor interval=40s role=Slave timeout=40s on-fail=restart primitive myapp2 ocf:mcg:Redundancy_myapp2 \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive myapp3 ocf:mcg:red_app3 \ op monitor interval=60 role=Master timeout=30 on-fail=fence \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_myapp1 myapp1 \ meta master-max=1 master
Re: [Pacemaker] Problem in Stonith configuration
Hello, Minor updates in the first requirement. 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes Apologies for the inconvenience. Thanks and regards Neha Chatrath On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath nehachatr...@gmail.comwrote: Hello Andreas, Thanks for the reply. So can you please suggest what Stonith plugin should I use for the production release of my software. I have the following system requirements: 1. If a node in the cluster fails, it should be reboot and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes. I have different types of resources e.g. primitive, master-slave and cone running on my system. Thanks and regards Neha Chatrath Date: Mon, 17 Oct 2011 15:08:16 +0200 From: Andreas Kurz andr...@hastexo.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Problem in Stonith configuration Message-ID: 4e9c28c0.8070...@hastexo.com Content-Type: text/plain; charset=iso-8859-1 Hello, On 10/17/2011 12:34 PM, neha chatrath wrote: Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79- b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor interval=90 \ meta target-role=Started primitive myapp1 ocf:heartbeat:Redundancy \ op monitor interval=60s role=Master timeout=30s on-fail=standby \ op monitor interval=40s role=Slave timeout=40s on-fail=restart primitive myapp2 ocf:mcg:Redundancy_myapp2 \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive myapp3 ocf:mcg:red_app3 \ op monitor interval=60 role=Master timeout=30 on-fail=fence \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_myapp1 myapp1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp2 myapp2 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp3 myapp3 \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true colocation myapp1_col inf: ClusterIP ms_myapp1:Master colocation myapp2_col inf: ClusterIP ms_myapp2:Master colocation myapp3_col inf: ClusterIP ms_myapp3:Master order myapp1_order inf: ms_myapp1:promote ClusterIP:start order myapp2_order inf: ms_myapp2:promote ms_myapp1:start order myapp3_order inf: ms_myapp3:promote ms_myapp2:start property $id=cib-bootstrap-options \ dc-version=1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 \ cluster-infrastructure=Heartbeat \ stonith-enabled=true \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=3 * I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the resources (myapp, myapp1 etc) gets started even on this node. Following is the output of *crm_mon -f * command: *Last updated: Mon Oct 17 10:19:22 2011 Stack: Heartbeat Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with quorum Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 2 Nodes configured, unknown expected votes 5 Resources configured. Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline) The cluster is waiting for a successful fencing event before starting all resources .. the only way to be sure the second node runs no resources. Since you are using suicide pluging this will never happen if Heartbeat is not started on that node. If this is only a _test_setup_ go with ssh or even null stonith plugin ... never use them on production systems! Regards, Andreas On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79-b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor
Re: [Pacemaker] Problem in Stonith configuration
Hello, 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? -In our system, there are some primitive, clone resources along with 3 different master-slave resources. -All the masters and slaves of these resources are co-located i.e. all the 3 masters are co-located on a node and 3 slaves on the other node. -These 3 master-slaves resources are tightly coupled. There is a requirement that failure of even any one of these resources, restarts all the resources in the group -All these resources can be shifted to the other node but subsequently these should also be restarted as a lot of data/control plane synching is being done between the two nodes. e.g. If one of the resources running on node1 as a Master fails, then all these 3 resources are shifted to the other node i.e. node2 (with corresponding slave resources being promoted as master). On node1, these resources should get re-started as slaves. We understand that node restart will increase the downtime but since we could not find much on the option for group restart of master-slave resources, we are trying for node restart option. Thanks and regards Neha Chatrath -- Forwarded message -- From: Andreas Kurz andr...@hastexo.com Date: Tue, Oct 18, 2011 at 1:55 PM Subject: Re: [Pacemaker] Problem in Stonith configuration To: pacemaker@oss.clusterlabs.org Hello, On 10/18/2011 09:00 AM, neha chatrath wrote: Hello, Minor updates in the first requirement. 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. Why would you want that? This would increase the service downtime considerable. Why is a local restart not possible ... and even if there is a good reason for a reboot, why not starting the resource on the other node? 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes That is how stonith works, yes. crm ra list stonith ... gives you a list of all available stonith plugins. crm ra info stonit: ... details for a specific plugin. Using external/ipmi is often a good choice because a lot of servers already have an BMC with IPMI on board or they are shipped with a management card supporting IMPI. Regards, Andreas On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath nehachatr...@gmail.comwrote: Hello, Minor updates in the first requirement. 1. If a resource fails, node should reboot (through fencing mechanism) and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes Apologies for the inconvenience. Thanks and regards Neha Chatrath On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath nehachatr...@gmail.comwrote: Hello Andreas, Thanks for the reply. So can you please suggest what Stonith plugin should I use for the production release of my software. I have the following system requirements: 1. If a node in the cluster fails, it should be reboot and resources should re-start on the node. 2. If the physical link between the nodes in a cluster fails then that node should be isolated (kind of a power down) and the resources should continue to run on the other nodes. I have different types of resources e.g. primitive, master-slave and cone running on my system. Thanks and regards Neha Chatrath Date: Mon, 17 Oct 2011 15:08:16 +0200 From: Andreas Kurz andr...@hastexo.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Problem in Stonith configuration Message-ID: 4e9c28c0.8070...@hastexo.com Content-Type: text/plain; charset=iso-8859-1 Hello, On 10/17/2011 12:34 PM, neha chatrath wrote: Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79- b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor interval=90 \ meta target-role=Started primitive myapp1 ocf:heartbeat:Redundancy \ op monitor interval=60s role=Master timeout=30s on-fail=standby \ op monitor interval=40s role=Slave timeout=40s on-fail=restart primitive myapp2 ocf:mcg:Redundancy_myapp2 \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role
[Pacemaker] Problem in Stonith configuration
Hello, I am configuring a 2 node cluster with following configuration: *[root@MCG1 init.d]# crm configure show node $id=16738ea4-adae-483f-9d79-b0ecce8050f4 mcg2 \ attributes standby=off node $id=3d507250-780f-414a-b674-8c8d84e345cd mcg1 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.1.204 cidr_netmask=255.255.255.0 nic=eth0:1 \ op monitor interval=40s timeout=20s \ meta target-role=Started primitive app1_fencing stonith:suicide \ op monitor interval=90 \ meta target-role=Started primitive myapp1 ocf:heartbeat:Redundancy \ op monitor interval=60s role=Master timeout=30s on-fail=standby \ op monitor interval=40s role=Slave timeout=40s on-fail=restart primitive myapp2 ocf:mcg:Redundancy_myapp2 \ op monitor interval=60 role=Master timeout=30 on-fail=standby \ op monitor interval=40 role=Slave timeout=40 on-fail=restart primitive myapp3 ocf:mcg:red_app3 \ op monitor interval=60 role=Master timeout=30 on-fail=fence \ op monitor interval=40 role=Slave timeout=40 on-fail=restart ms ms_myapp1 myapp1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp2 myapp2 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_myapp3 myapp3 \ meta master-max=1 master-max-node=1 clone-max=2 clone-node-max=1 notify=true colocation myapp1_col inf: ClusterIP ms_myapp1:Master colocation myapp2_col inf: ClusterIP ms_myapp2:Master colocation myapp3_col inf: ClusterIP ms_myapp3:Master order myapp1_order inf: ms_myapp1:promote ClusterIP:start order myapp2_order inf: ms_myapp2:promote ms_myapp1:start order myapp3_order inf: ms_myapp3:promote ms_myapp2:start property $id=cib-bootstrap-options \ dc-version=1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 \ cluster-infrastructure=Heartbeat \ stonith-enabled=true \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=3 * I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the resources (myapp, myapp1 etc) gets started even on this node. Following is the output of *crm_mon -f * command: *Last updated: Mon Oct 17 10:19:22 2011 Stack: Heartbeat Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with quorum Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1 2 Nodes configured, unknown expected votes 5 Resources configured. Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline) Online: [ mcg1 ] app1_fencing(stonith:suicide):Started mcg1 Migration summary: * Node mcg1: * When I set stonith_enabled as false, then all my resources comes up. Can somebody help me with STONITH configuration? Cheers Neha Chatrath KEEP SMILING ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker