[Pacemaker] newbie question(s)

Alex Samad - Yieldbroker Mon, 20 May 2013 22:10:36 -0700

Hi

I have setup a small 2 node cluster that we are using for HA a java app.


Basically the requirement is to provide HA and later on load balancing.

My initial plan was to use 
2 nodes of linux
Iptables cluster module to do the load balancing 
Cluster software to do the failover.

I have left the load balancing for now, HA has been given a higher priority.

So I am using centos 6.3, with pacemaker 1.1.7 rpm's

I have 2 nodes and 1 VIP, the VIP determines which node is the active one.
The application is actual live on both nodes, its really only the VIP that moves
I use pacemaker to ensure 1 the application is running and to place the VIP in 
the right place

I have create my own resource script
/usr/lib/ocf/resource.d/yb/ybrp

Used one of the others script files for that, but it test
1) the application is running by using ps
2) that the application is okay, it can make a call and test the result

The start stop basically touches a lock file
Monitor does the test
Status uses the lock file and does the tests as well


So here is the output from

crm configure show
node dc1wwwrp01
node dc1wwwrp02
primitive ybrpip ocf:heartbeat:IPaddr2 \
        params ip="10.32.21.10" cidr_netmask="24" \
        op monitor interval="5s"
primitive ybrpstat ocf:yb:ybrp \
        op monitor interval="5s"
group ybrp ybrpip ybrpstat
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1369092192"


is there anything I should be doing differently, I have seen colocation option 
and something about affinity of resources, but I used group which would be the 
best practise way of doing it ?

my next step is to add in iptables cluster ip modules. It is controlled by a 
/proc/.... control file.  Basically you tell the OS how many nodes and which 
node number this machine is looking after.

So I was going to make a resource for node number ie node 1 preference for node 
1 and node 2 preference for node 2. So that when 1 node goes down it will bring 
that resource over to it. That can be done by poking a number into the /proc 
file

But I have seen some wierds things happen that I can explain or control. 
Sometimes things go a bit off when I do a 

/usr/sbin/crm_mon -1

I can see the resource have errors next to them and a message along the lines 
of 

operation monitor failed 'insufficient privileges' (rc=4)

I normally just do a 
crm resource cleanup ybrpstat
and things come back to normal, but I need to understand how it gets there and 
why and what I can do to stop it


this is from /var/log/messages

from node1
==========
May 21 09:02:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 
operations (0.00us average, 0% utilization) in the last 10min
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: crm_timer_popped: PEngine 
Recheck Timer (I_PE_CALC) just popped (900000ms)
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_state_transition: 
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing 
ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 'insufficient 
privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing 
failed op ybrpstat_last_failure_0 on dc1wwwrp01: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:    error: unpack_rsc_op: Preventing 
ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 'insufficient 
privileges' (rc=4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:  warning: unpack_rsc_op: Processing 
failed op ybrpstat_last_failure_0 on dc1wwwrp02: insufficient privileges (4)
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: 
ybrpstat can fail 999999 more times on dc1wwwrp01 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: common_apply_stickiness: 
ybrpstat can fail 999999 more times on dc1wwwrp02 before being forced off
May 21 09:09:28 dc1wwwrp01 pengine[2355]:   notice: process_pe_message: 
Transition 5487: PEngine Input stored in: /var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
May 21 09:09:28 dc1wwwrp01 crmd[2356]:     info: do_te_invoke: Processing graph 
5487 (ref=pe_calc-dc-1369091368-5548) derived from 
/var/lib/pengine/pe-input-1485.bz2
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: run_graph: ==== Transition 
5487 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-1485.bz2): Complete
May 21 09:09:28 dc1wwwrp01 crmd[2356]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 21 09:12:35 dc1wwwrp01 cib[2351]:     info: cib_stats: Processed 1 
operations (0.00us average, 0% utilization) in the last 10min
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: 
Preventing ybrpstat from re-starting on dc1wwwrp01: operation monitor failed 
'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 crm_resource[5165]:    error: unpack_rsc_op: 
Preventing ybrpstat from re-starting on dc1wwwrp02: operation monitor failed 
'insufficient privileges' (rc=4)
May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation 
complete: op cib_delete for section 
//node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] 
(origin=local/crmd/5589, version=0.101.36): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: delete_resource: Removing 
resource ybrpstat for 5165_crm_resource (internal) on dc1wwwrp01
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: notify_deleted: Notifying 
5165_crm_resource on dc1wwwrp01 that ybrpstat was deleted
May 21 09:23:12 dc1wwwrp01 crmd[2356]:  warning: decode_transition_key: Bad 
UUID (crm-resource-5165) in sscanf result (3) for 0:0:crm-resource-5165
May 21 09:23:12 dc1wwwrp01 attrd[2354]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-ybrpstat (<null>)
May 21 09:23:12 dc1wwwrp01 cib[2351]:     info: cib_process_request: Operation 
complete: op cib_delete for section 
//node_state[@uname='dc1wwwrp01']//lrm_resource[@id='ybrpstat'] 
(origin=local/crmd/5590, version=0.101.37): ok (rc=0)
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: 
te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, 
id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, 
cib=0.101.37) : Resource op removal
May 21 09:23:12 dc1wwwrp01 crmd[2356]:     info: abort_transition_graph: 
te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, 
id=ybrpstat_last_0, magic=0:0;3:3572:0:c348b36c-f6dd-4a7d-ac5b-01a3b8ce3c34, 
cib=0.101.37) : Resource op removal

>From node2
===========
May 21 09:20:03 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:16: monitor
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: cancel_op: operation monitor[16] 
on ocf::IPaddr2::ybrpip for client 2048, its parameters: 
CRM_meta_name=[monitor] cidr_netmask=[24] crm_feature_set=[3.0.6] 
CRM_meta_timeout=[20000] CRM_meta_interval=[5000] ip=[10.32.21.10]  cancelled
May 21 09:23:12 dc1wwwrp02 lrmd: [2045]: info: rsc:ybrpip:20: stop
May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: apply_xml_diff: Digest 
mis-match: expected dcee73fe6518ac0d4b3429425d5dfc16, calculated 
4a39d2ad25d50af2ec19b5b24252aef8
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_process_diff: Diff 0.101.36 
-> 0.101.37 not applied to 0.101.36: Failed application of an update diff
May 21 09:23:12 dc1wwwrp02 cib[2043]:     info: cib_server_process_diff: 
Requesting re-sync from peer
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not 
applying diff 0.101.36 -> 0.101.37 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not 
applying diff 0.101.37 -> 0.102.1 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not 
applying diff 0.102.1 -> 0.102.2 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not 
applying diff 0.102.2 -> 0.102.3 (sync in progress)
May 21 09:23:12 dc1wwwrp02 cib[2043]:   notice: cib_server_process_diff: Not 
applying diff 0.102.3 -> 0.102.4 (sync in progress)

Any help or suggestions muchly appreciated


Thanks
Alex






_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] newbie question(s)

Reply via email to