Hi
rpm -qa | grep pace pacemaker-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-cli-1.1.10-14.el6_5.3.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 centos 6.5 I have a 2 node cluster pcs config Cluster Name: ybrp Corosync Nodes: Pacemaker Nodes: wwwrp1 wwwrp2 Resources: Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-interval-0s) monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-interval-5s) stop interval=0s timeout=60s (ybrpip-stop-interval-0s) Clone: ybrpstat-clone Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 Resource: ybrpstat (class=ocf provider=yb type=proxy) Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-monitor-interval-5s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-ybrpip-mandatory) Colocation Constraints: ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-INFINITY) Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6_5.3-368c726 last-lrm-refresh: 1414629002 no-quorum-policy: ignore stonith-enabled: false Basically 1 node died (wwwrp1) hardware failure I can see in the logs the cluster wants to bring the IP address over and it seems to do it but ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used ## this seems to be where it checks other stuff Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local) Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] # this is where monitor times out, but it doesn't look like 20000ms .. the initial try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms) Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384) I'm guessing because it timed out it went into failed mode. I need to know why it timed out. The script has never timed out before or in testing... Am I reading this right. The reason the resource didn't fail over (ip address) was because there was no ybrpstat running on wwwrp2 and the reason for that was the monitor action failed/timed out Thanks Alex === logs == Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start ybrpip#011(wwwrp1) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp1) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2 Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start ybrpip#011(wwwrp2) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop ybrpstat:0#011(wwwrp1) Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2 Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 7: start ybrpip_start_0 on wwwrp2 (local) Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 1: stop ybrpstat_stop_0 on wwwrp1 Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local) Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms) Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384) Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (1) Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update 12543: fail-count-ybrpstat=1 Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592384) Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update 12545: last-failure-ybrpstat=1414592384 Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart ybrpip#011(Started wwwrp2) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2 Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart ybrpip#011(Started wwwrp2) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2) Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2 _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org