Hi I have setup a small 2 node cluster that we are using for HA a java app.
Basically the requirement is to provide HA and later on load balancing. My initial plan was to use 2 nodes of linux Iptables cluster module to do the load balancing Cluster software to do the failover. I have left the load balancing for now, HA has been given a higher priority. So I am using centos 6.3, with pacemaker 1.1.7 rpm's I have 2 nodes and 1 VIP, the VIP determines which node is the active one. The application is actual live on both nodes, its really only the VIP that moves I use pacemaker to ensure 1 the application is running and to place the VIP in the right place I have create my own resource script /usr/lib/ocf/resource.d/yb/ybrp Used one of the others script files for that, but it test 1) the application is running by using ps 2) that the application is okay, it can make a call and test the result The start stop basically touches a lock file Monitor does the test Status uses the lock file and does the tests as well So here is the output from crm configure show node dc1wwwrp01 node dc1wwwrp02 primitive ybrpip ocf:heartbeat:IPaddr2 \ params ip="10.32.21.10" cidr_netmask="24" \ op monitor interval="5s" primitive ybrpstat ocf:yb:ybrp \ op monitor interval="5s" group ybrp ybrpip ybrpstat property $id="cib-bootstrap-options" \ dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1369092192" is there anything I should be doing differently, I have seen colocation option and something about affinity of resources, but I used group which would be the best practise way of doing it ? my next step is to add in iptables cluster ip modules. It is controlled by a /proc/.... control file. Basically you tell the OS how many nodes and which node number this machine is looking after. So I was going to make a resource for node number ie node 1 preference for node 1 and node 2 preference for node 2. So that when 1 node goes down it will bring that resource over to it. That can be done by poking a number into the /proc file But I have seen some wierds things happen that I can explain or control. Sometimes things go a bit off when I do a /usr/sbin/crm_mon -1 I can see the resource have errors next to them and a message along the lines of operation monitor failed 'insufficient privileges' (rc=4) I normally just do a crm resource cleanup ybrpstat and things come back to normal, but I need to understand how it gets there and why and what I can do to stop it this is from /var/log/messages from node1 ========== May 21 09:02:35 dc1wwwrp01 cib[2351]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms) May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: unpack_config: On loss of CCM Quorum: Ignore May 21 09:09:28 dc1wwwrp01 pengine[2355]: error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4) May 21 09:09:28 dc1wwwrp01 pengine[2355]: warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp01: insufficient privileges (4) May 21 09:09:28 dc1wwwrp01 pengine[2355]: error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4) May 21 09:09:28 dc1wwwrp01 pengine[2355]: warning: unpack_rsc_op: Processing failed op ybrpstat_last_failure_0 on dc1wwwrp02: insufficient privileges (4) May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp01 before being forced off May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: common_apply_stickiness: ybrpstat can fail 999999 more times on dc1wwwrp02 before being forced off May 21 09:09:28 dc1wwwrp01 pengine[2355]: notice: process_pe_message: Transition 5487: PEngine Input stored in: /var/lib/pengine/pe-input-1485.bz2 May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 21 09:09:28 dc1wwwrp01 crmd[2356]: info: do_te_invoke: Processing graph 5487 (ref=pe_calc-dc-1369091368-5548) derived from /var/lib/pengine/pe-input-1485.bz2 May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: run_graph: ==== Transition 5487 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-1485.bz2): Complete May 21 09:09:28 dc1wwwrp01 crmd[2356]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] May 21 09:12:35 dc1wwwrp01 cib[2351]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min May 21 09:23:12 dc1wwwrp01 crm_resource[5165]: error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient privileges' (rc=4) May 21 09:23:12 dc1wwwrp01 crm_resource[5165]: error: unpack_rsc_op: Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient privileges' (rc=4) May 21 09:23:12 dc1wwwrp01 cib[2351]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5589, version=0.101.36): ok (rc=0) May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: delete_resource: Removing resource ybrpstat for 5165_crm_resource (internal) on dc1wwwrp01 May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: notify_deleted: Notifying 5165_crm_resource on dc1wwwrp01 that ybrpstat was deleted May 21 09:23:12 dc1wwwrp01 crmd[2356]: warning: decode_transition_key: Bad UUID (crm-resource-5165) in sscanf result (3) for 0:0:crm-resource-5165 May 21 09:23:12 dc1wwwrp01 attrd[2354]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (<null>) May 21 09:23:12 dc1wwwrp01 cib[2351]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] (origin=local/crmd/5590, version=0.101.37): ok (rc=0) May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal May 21 09:23:12 dc1wwwrp01 crmd[2356]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, cib=0.101.37) : Resource op removal >From node2 =========== May 21 09:20:03 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:16: monitor May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: cancel_op: operation monitor[16] on ocf::IPaddr2::ybrpip for client 2048, its parameters: CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6] CRM_meta_timeout=[20000] CRM_meta_interval=[5000] ip=[10.32.21.10] cancelled May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:20: stop May 21 09:23:12 dc1wwwrp02 cib[2043]: info: apply_xml_diff: Digest mis-match: expected dcee73fe6518ac0d4b3429425d5dfc16, calculated 4a39d2ad25d50af2ec19b5b24252aef8 May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_process_diff: Diff 0.101.36 -> 0.101.37 not applied to 0.101.36: Failed application of an update diff May 21 09:23:12 dc1wwwrp02 cib[2043]: info: cib_server_process_diff: Requesting re-sync from peer May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not applying diff 0.101.36 -> 0.101.37 (sync in progress) May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not applying diff 0.101.37 -> 0.102.1 (sync in progress) May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not applying diff 0.102.1 -> 0.102.2 (sync in progress) May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not applying diff 0.102.2 -> 0.102.3 (sync in progress) May 21 09:23:12 dc1wwwrp02 cib[2043]: notice: cib_server_process_diff: Not applying diff 0.102.3 -> 0.102.4 (sync in progress) Any help or suggestions muchly appreciated Thanks Alex _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org