Hi Suresh! Have you tried lsb-status in a shell? Does it show anything interesting or is it hanging?
Regards, Klaus On 12/30/2016 08:45 AM, Suresh Rajagopalan wrote: > Cluster running centos 6.8 with pacemaker/corosync. This config was > running well for quite sometime. All of a sudden we see regular > restarts of the monitored process where corosync thinks it has > failed(even though it really has not failed). I am showing the > relevant logs and config below. Any pointers appreciated as it is not > clear why this would occur. > > Thanks > Suresh > > Dec 28 13:18:20 [2198] a.b.com <http://a.b.com> pengine: info: > LogActions: Leave mycustomprog (Started a.b.com <http://a.b.com>) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > process_lrm_event: Operation mycustomprog_monitor_10000: not running > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427, > confirmed=false) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice: > process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ > mycustomprogram (pid 15657) is running...\n ] > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/lrm[@id='a.b.com > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']: > <lrm_rsc_op id="mycustomprog_last_failure_0" > operation_key="mycustomprog_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" > transition-magic="0 > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, > cib=0.48.2038786, source=process_graph_event:605, 1) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > update_failcount: Updating failcount for mycustomprog on a.b.com > <http://a.b.com> after failed monitor: rc=7 (update=value++, > time=1482931323) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > process_graph_event: Detected action (462.7) > mycustomprog_monitor_10000.29=not running: failed > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > fail-count-mycustomprog (1) > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com>']: <nvpair > id="status-a.b.com-fail-count-mycustomprog" > name="fail-count-mycustomprog" value="1"/> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > last-failure-mycustomprog (1482931323) > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 20: > last-failure-mycustomprog=1482931323 > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 20: > last-failure-mycustomprog=1482931323 > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com>']: <nvpair > id="status-a.b.com-last-failure-mycustomprog" > name="last-failure-mycustomprog" value="1482931323"/> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: > Transient attribute change (create cib=0.48.2038787, > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > status-a.b.com-last-failure-mycustomprog, > last-failure-mycustomprog=1482931323: Transient attribute change > (create cib=0.48.2038788, source=abort_unless_down:329, > path=/cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-macshii00002-hva.gs.r11. > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning: > unpack_rsc_op_failure: Processing failed op monitor for > mycustomprog on a.b.com <http://a.b.com>: not running (7) > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > native_print: mycustomprog (lsb:mycustomprog): > FAILED a.b.com <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > get_failcount_full: mycustomprog has failed 1 times on a.b.com > <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > common_apply_stickiness: mycustomprog can fail 999999 more times > on a.b.com <http://a.b.com> before being forced off > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > RecurringOp: Start recurring monitor (10s) for mycustomprog on > a.b.com <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice: > LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com>) > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice: > te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on > a.b.com <http://a.b.com> (local) > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info: > cancel_recurring_action: Cancelling lsb operation > mycustomprog_status_10000Dec 28 13:18:20 [2198] a.b.com > <http://a.b.com> pengine: info: LogActions: Leave > mycustomprog (Started a.b.com <http://a.b.com>) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > process_lrm_event: Operation mycustomprog_monitor_10000: not running > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427, > confirmed=false) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice: > process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [ > mycustomprogram (pid 15657) is running...\n ] > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/lrm[@id='a.b.com > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']: > <lrm_rsc_op id="mycustomprog_last_failure_0" > operation_key="mycustomprog_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7" > transition-magic="0 > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7, > cib=0.48.2038786, source=process_graph_event:605, 1) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > update_failcount: Updating failcount for mycustomprog on a.b.com > <http://a.b.com> after failed monitor: rc=7 (update=value++, > time=1482931323) > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info: > process_graph_event: Detected action (462.7) > mycustomprog_monitor_10000.29=not running: failed > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > fail-count-mycustomprog (1) > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 18: fail-count-mycustomprog=1 > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com>']: <nvpair > id="status-a.b.com-fail-count-mycustomprog" > name="fail-count-mycustomprog" value="1"/> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > last-failure-mycustomprog (1482931323) > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 20: > last-failure-mycustomprog=1482931323 > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice: > attrd_perform_update: Sent update 20: > last-failure-mycustomprog=1482931323 > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info: > cib_perform_op: ++ /cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com>']: <nvpair > id="status-a.b.com-last-failure-mycustomprog" > name="last-failure-mycustomprog" value="1482931323"/> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1: > Transient attribute change (create cib=0.48.2038787, > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info: > abort_transition_graph: Transition aborted by > status-a.b.com-last-failure-mycustomprog, > last-failure-mycustomprog=1482931323: Transient attribute change > (create cib=0.48.2038788, source=abort_unless_down:329, > path=/cib/status/node_state[@id='a.b.com > <http://a.b.com>']/transient_attributes[@id='a.b.com > <http://a.b.com>']/instance_attributes[@id='status-a.b.com > <http://status-a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning: > unpack_rsc_op_failure: Processing failed op monitor for > mycustomprog on a.b.com <http://a.b.com>: not running (7) > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > native_print: mycustomprog (lsb:mycustomprog): > FAILED a.b.com <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > get_failcount_full: mycustomprog has failed 1 times on a.b.com > <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > common_apply_stickiness: mycustomprog can fail 999999 more times > on a.b.com <http://a.b.com> before being forced off > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info: > RecurringOp: Start recurring monitor (10s) for mycustomprog on > a.b.com <http://a.b.com> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice: > LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com>) > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice: > te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on > a.b.com <http://a.b.com> (local) > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info: > cancel_recurring_action: Cancelling lsb operation > mycustomprog_status_10000 > > > > pcs config: > > source settings.rc > > pcs property set stonith-enabled=false > pcs property set no-quorum-policy=ignore > pcs resource create ClusterIP2 IPaddr2 ip=$MYVIP cidr_netmask=$NETMASKVIP1 > pcs resource create ClusterIP3 IPaddr2 ip=$MYVIP2 > cidr_netmask=$NETMASKVIP2 > pcs resource create mycustomprog lsb:mycustomprog op monitor > interval="10s" > pcs constraint colocation add ClusterIP3 with ClusterIP2 INFINITY > pcs constraint colocation add mycustomprog with ClusterIP2 INFINITY > pcs property set start-failure-is-fatal=false > pcs resource defaults resource-stickiness=100 > pcs constraint colocation add chkhealth with ClusterIP2 INFINITY > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org