Hey guys, I'm encountering a really strange problem testing failover of my ocf:heartbeat:nginx resource in my 2 node cluster. I am able to manually migrate the resource around the nodes and that works fine, but I can't get the resource to function on one node while the other has encountered a failure. The strange part is this only happens if the failure was on node 1. If I reproduce the failure on node 2 the resource will correctly failover to node 1.
no-quorum-policy is ignore, so that doesn't seem to be the issue, and some similar threads mentioned start-failure-is-fatal=false may help, but it doesn't resolve it either. I have a more advanced configuration that includes a Virtual IP and ping clones, and those parts seem to work fine, and nginx even failsover correctly when its host goes offline completely. Just can't get the same behaviour to work when only the resource fails. My test case: >vim /etc/nginx/nginx.conf >Insert invalid jargon and save >service nginx restart Expected outcome: Resource failsover to the other node upon monitor failure in either direction between my 2 nodes. Actual: Resource failsover correctly from node 2 -> node 1, but not node 1 -> node 2. This is my test configuration for reproducing the issue (to make sure my other stuff isn't interfering). ----------------------- node $id="724150464" lb01 node $id="740927680" lb02 primitive nginx ocf:heartbeat:nginx \ params configfile="/etc/nginx/nginx.conf" \ op monitor interval="10s" timeout="30s" depth="0" \ op monitor interval="15s" timeout="30s" status10url=" http://localhost/nginx_status" depth="10" property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ start-failure-is-fatal="false" \ last-lrm-refresh="1382410708" rsc_defaults $id="rsc-options" \ resource-stickiness="100" This is what happens when I perform the test case on node lb02, and it correctly migrates/restarts the resource on lb01. ----------------------- Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op monitor for nginx on lb02: not running (7) Oct 22 11:58:12 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Started lb02 FAILED Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (10s) for nginx on lb02 Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (15s) for nginx on lb02 Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: Recover nginx (Started lb02) Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1038, version=0.252.2) Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: Cancelling operation nginx_monitor_15000 Oct 22 11:58:12 [692] lb02 lrmd: info: cancel_recurring_action: Cancelling operation nginx_monitor_10000 Oct 22 11:58:12 [692] lb02 lrmd: info: log_execute: executing - rsc:nginx action:stop call_id:848 Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: LRM operation nginx_monitor_15000 (call=839, status=1, cib-update=0, confirmed=true) Cancelled Oct 22 11:58:12 [695] lb02 crmd: info: process_lrm_event: LRM operation nginx_monitor_10000 (call=841, status=1, cib-update=0, confirmed=true) Cancelled Oct 22 11:58:12 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: OK (rc=0, origin=local/attrd/1041, version=0.252.3) nginx[31237]: 2013/10/22_11:58:12 INFO: nginx is not running. Oct 22 11:58:12 [692] lb02 lrmd: info: log_finished: finished - rsc:nginx action:stop call_id:848 pid:31237 exit-code:0 exec-time:155ms queue-time:0ms Oct 22 11:58:12 [695] lb02 crmd: notice: process_lrm_event: LRM operation nginx_stop_0 (call=848, rc=0, cib-update=593, confirmed=true) ok Oct 22 11:58:12 [694] lb02 pengine: info: unpack_rsc_op: Operation monitor found resource nginx active on lb01 Oct 22 11:58:12 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op monitor for nginx on lb02: not running (7) Oct 22 11:58:12 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Stopped Oct 22 11:58:12 [694] lb02 pengine: info: get_failcount_full: nginx has failed 1 times on lb02 Oct 22 11:58:12 [694] lb02 pengine: info: common_apply_stickiness: nginx can fail 999999 more times on lb02 before being forced off Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (10s) for nginx on lb01 Oct 22 11:58:12 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (15s) for nginx on lb01 Oct 22 11:58:12 [694] lb02 pengine: notice: LogActions: Start nginx (lb01) This is what happens when I try to go from lb01 -> lb02. ----------------------- Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op monitor for nginx on lb01: not running (7) Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation monitor found resource nginx active on lb02 Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Started lb01 FAILED Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (10s) for nginx on lb01 Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (15s) for nginx on lb01 Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Recover nginx (Started lb01) Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1046, version=0.253.12) Oct 22 12:00:25 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='last-failure-nginx']: OK (rc=0, origin=local/attrd/1047, version=0.253.12) Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op monitor for nginx on lb01: not running (7) Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation monitor found resource nginx active on lb02 Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Stopped Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: nginx has failed 1 times on lb01 Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: nginx can fail 999999 more times on lb01 before being forced off Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (10s) for nginx on lb01 Oct 22 12:00:25 [694] lb02 pengine: info: RecurringOp: Start recurring monitor (15s) for nginx on lb01 Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Start nginx (lb01) Oct 22 12:00:25 [694] lb02 pengine: error: unpack_rsc_op: Preventing nginx from re-starting anywhere in the cluster : operation start failed 'not configured' (rc=6) Oct 22 12:00:25 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op start for nginx on lb01: not configured (6) Oct 22 12:00:25 [694] lb02 pengine: info: unpack_rsc_op: Operation monitor found resource nginx active on lb02 Oct 22 12:00:25 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Started lb01 FAILED Oct 22 12:00:25 [694] lb02 pengine: info: get_failcount_full: nginx has failed 1 times on lb01 Oct 22 12:00:25 [694] lb02 pengine: info: common_apply_stickiness: nginx can fail 999999 more times on lb01 before being forced off Oct 22 12:00:25 [694] lb02 pengine: info: native_color: Resource nginx cannot run anywhere Oct 22 12:00:25 [694] lb02 pengine: notice: LogActions: Stop nginx (lb01) Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1049, version=0.253.15) Oct 22 12:00:26 [690] lb02 cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='740927680']//transient_attributes//nvpair[@name='fail-count-nginx']: No such device or address (rc=-6, origin=local/attrd/1050, version=0.253.15) Oct 22 12:00:26 [694] lb02 pengine: error: unpack_rsc_op: Preventing nginx from re-starting anywhere in the cluster : operation start failed 'not configured' (rc=6) Oct 22 12:00:26 [694] lb02 pengine: warning: unpack_rsc_op: Processing failed op start for nginx on lb01: not configured (6) Oct 22 12:00:26 [694] lb02 pengine: info: unpack_rsc_op: Operation monitor found resource nginx active on lb02 Oct 22 12:00:26 [694] lb02 pengine: info: native_print: nginx (ocf::heartbeat:nginx): Stopped Oct 22 12:00:26 [694] lb02 pengine: info: get_failcount_full: nginx has failed INFINITY times on lb01 Oct 22 12:00:26 [694] lb02 pengine: warning: common_apply_stickiness: Forcing nginx away from lb01 after 1000000 failures (max=1000000) Oct 22 12:00:26 [694] lb02 pengine: info: native_color: Resource nginx cannot run anywhere Oct 22 12:00:26 [694] lb02 pengine: info: LogActions: Leave nginx (Stopped) I can't for the life of me work out why this is happening. For whatever reason in node 1 -> node 2, it randomly decides that the resource can no longer run anywhere. And yes, I am making sure everything works before I start each test, so its not failure to use crm resource cleanup etc. Would really appreciate help on this as I've been trying to debug this for a few days and have hit a wall. Thanks, Lucas
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org