Cluster running centos 6.8 with pacemaker/corosync. This config was running well for quite sometime. All of a sudden we see regular restarts of the monitored process where corosync thinks it has failed(even though it really has not failed). I am showing the relevant logs and config below. Any pointers appreciated as it is not clear why this would occur.
Thanks Suresh Dec 28 13:18:20 [2198] a.b.com pengine: info: LogActions: Leave mycustomprog (Started a.b.com) Dec 28 13:22:03 [2199] a.b.com crmd: info: process_lrm_event: Operation mycustomprog_monitor_10000: not running (node=a.b.com, call=29, rc=7, cib-update=1427, confirmed=false) Dec 28 13:22:03 [2199] a.b.com crmd: notice: process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ mycustomprogram (pid 15657) is running...\n ] Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/lrm[@id='a.b.com']/lrm_resources/lrm_resource[@id='mycustomprog']: <lrm_rsc_op id="mycustomprog_last_failure_0" operation_key="mycustomprog_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" transition-magic="0 Dec 28 13:22:03 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by mycustomprog_monitor_10000 'create' on a.b.com: Old event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, cib=0.48.2038786, source=process_graph_event:605, 1) Dec 28 13:22:03 [2199] a.b.com crmd: info: update_failcount: Updating failcount for mycustomprog on a.b.com after failed monitor: rc=7 (update=value++, time=1482931323) Dec 28 13:22:03 [2199] a.b.com crmd: info: process_graph_event: Detected action (462.7) mycustomprog_monitor_10000.29=not running: failed Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mycustomprog (1) Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-a.b.com']: <nvpair id="status-a.b.com-fail-count-mycustomprog" name="fail-count-mycustomprog" value="1"/> Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mycustomprog (1482931323) Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 20: last-failure-mycustomprog=1482931323 Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 20: last-failure-mycustomprog=1482931323 Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-a.b.com']: <nvpair id="status-a.b.com-last-failure-mycustomprog" name="last-failure-mycustomprog" value="1482931323"/> Dec 28 13:22:04 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: Transient attribute change (create cib=0.48.2038787, source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com ']/transient_attributes[@id='a.b.com']/instance_attributes[@id=' status-a.b.com Dec 28 13:22:04 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by status-a.b.com-last-failure-mycustomprog, last-failure-mycustomprog=1482931323: Transient attribute change (create cib=0.48.2038788, source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-macshii00002-hva.gs.r11. Dec 28 13:22:04 [2198] a.b.com pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for mycustomprog on a.b.com: not running (7) Dec 28 13:22:04 [2198] a.b.com pengine: info: native_print: mycustomprog (lsb:mycustomprog): FAILED a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: info: get_failcount_full: mycustomprog has failed 1 times on a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: info: common_apply_stickiness: mycustomprog can fail 999999 more times on a.b.com before being forced off Dec 28 13:22:04 [2198] a.b.com pengine: info: RecurringOp: Start recurring monitor (10s) for mycustomprog on a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: notice: LogActions: Recover mycustomprog (Started a.b.com) Dec 28 13:22:04 [2199] a.b.com crmd: notice: te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on a.b.com (local) Dec 28 13:22:04 [2196] a.b.com lrmd: info: cancel_recurring_action: Cancelling lsb operation mycustomprog_status_10000Dec 28 13:18:20 [2198] a.b.com pengine: info: LogActions: Leave mycustomprog (Started a.b.com) Dec 28 13:22:03 [2199] a.b.com crmd: info: process_lrm_event: Operation mycustomprog_monitor_10000: not running (node=a.b.com, call=29, rc=7, cib-update=1427, confirmed=false) Dec 28 13:22:03 [2199] a.b.com crmd: notice: process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ mycustomprogram (pid 15657) is running...\n ] Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/lrm[@id='a.b.com']/lrm_resources/lrm_resource[@id='mycustomprog']: <lrm_rsc_op id="mycustomprog_last_failure_0" operation_key="mycustomprog_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" transition-magic="0 Dec 28 13:22:03 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by mycustomprog_monitor_10000 'create' on a.b.com: Old event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, cib=0.48.2038786, source=process_graph_event:605, 1) Dec 28 13:22:03 [2199] a.b.com crmd: info: update_failcount: Updating failcount for mycustomprog on a.b.com after failed monitor: rc=7 (update=value++, time=1482931323) Dec 28 13:22:03 [2199] a.b.com crmd: info: process_graph_event: Detected action (462.7) mycustomprog_monitor_10000.29=not running: failed Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-mycustomprog (1) Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-a.b.com']: <nvpair id="status-a.b.com-fail-count-mycustomprog" name="fail-count-mycustomprog" value="1"/> Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-mycustomprog (1482931323) Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 20: last-failure-mycustomprog=1482931323 Dec 28 13:22:03 [2197] a.b.com attrd: notice: attrd_perform_update: Sent update 20: last-failure-mycustomprog=1482931323 Dec 28 13:22:03 [2194] a.b.com cib: info: cib_perform_op: ++ /cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-a.b.com']: <nvpair id="status-a.b.com-last-failure-mycustomprog" name="last-failure-mycustomprog" value="1482931323"/> Dec 28 13:22:04 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: Transient attribute change (create cib=0.48.2038787, source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com ']/transient_attributes[@id='a.b.com']/instance_attributes[@id=' status-a.b.com Dec 28 13:22:04 [2199] a.b.com crmd: info: abort_transition_graph: Transition aborted by status-a.b.com-last-failure-mycustomprog, last-failure-mycustomprog=1482931323: Transient attribute change (create cib=0.48.2038788, source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com']/transient_attributes[@id='a.b.com ']/instance_attributes[@id='status-a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for mycustomprog on a.b.com: not running (7) Dec 28 13:22:04 [2198] a.b.com pengine: info: native_print: mycustomprog (lsb:mycustomprog): FAILED a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: info: get_failcount_full: mycustomprog has failed 1 times on a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: info: common_apply_stickiness: mycustomprog can fail 999999 more times on a.b.com before being forced off Dec 28 13:22:04 [2198] a.b.com pengine: info: RecurringOp: Start recurring monitor (10s) for mycustomprog on a.b.com Dec 28 13:22:04 [2198] a.b.com pengine: notice: LogActions: Recover mycustomprog (Started a.b.com) Dec 28 13:22:04 [2199] a.b.com crmd: notice: te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on a.b.com (local) Dec 28 13:22:04 [2196] a.b.com lrmd: info: cancel_recurring_action: Cancelling lsb operation mycustomprog_status_10000 pcs config: source settings.rc pcs property set stonith-enabled=false pcs property set no-quorum-policy=ignore pcs resource create ClusterIP2 IPaddr2 ip=$MYVIP cidr_netmask=$NETMASKVIP1 pcs resource create ClusterIP3 IPaddr2 ip=$MYVIP2 cidr_netmask=$NETMASKVIP2 pcs resource create mycustomprog lsb:mycustomprog op monitor interval="10s" pcs constraint colocation add ClusterIP3 with ClusterIP2 INFINITY pcs constraint colocation add mycustomprog with ClusterIP2 INFINITY pcs property set start-failure-is-fatal=false pcs resource defaults resource-stickiness=100 pcs constraint colocation add chkhealth with ClusterIP2 INFINITY
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org