Hi Seems like I had the same problem on my prod cluster
Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 17918) timed out Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: operation_finished: ybrpstat_monitor_5000:17918 - timed out after 20000ms Oct 30 01:21:53 prodrp1 crmd[2539]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (4384) Timed Out (timeout=20000ms) Oct 30 01:21:53 prodrp1 attrd[2537]: notice: attrd_cs_dispatch: Update relayed from prodrp2 This is the first recording on prodrp1. The issue is that prodrp1 was the master for the IP. I am wondering if it has something to do with it being a cloned resource ? So the cluster realises there is a problem, in the middle of it realising that the cluster check monitor fails because (assumption here) it does some sort of cluster wide check! A > -----Original Message----- > From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com] > Sent: Thursday, 30 October 2014 11:43 AM > To: pacemaker@oss.clusterlabs.org > Subject: [Pacemaker] Help understanding what went wrong > > Hi > > > rpm -qa | grep pace > pacemaker-libs-1.1.10-14.el6_5.3.x86_64 > pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 > pacemaker-cli-1.1.10-14.el6_5.3.x86_64 > pacemaker-1.1.10-14.el6_5.3.x86_64 > > centos 6.5 > > > I have a 2 node cluster > pcs config > Cluster Name: ybrp > Corosync Nodes: > > Pacemaker Nodes: > wwwrp1 wwwrp2 > > Resources: > Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) > Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0 > clusterip_hash=sourceip-sourceport > Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s > Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start- > interval-0s) > monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor- > interval-5s) > stop interval=0s timeout=60s (ybrpip-stop-interval-0s) > Clone: ybrpstat-clone > Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 > Resource: ybrpstat (class=ocf provider=yb type=proxy) > Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat- > monitor-interval-5s) > > Stonith Devices: > Fencing Levels: > > Location Constraints: > Ordering Constraints: > start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone- > ybrpip-mandatory) > Colocation Constraints: > ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone- > INFINITY) > > Cluster Properties: > cluster-infrastructure: cman > dc-version: 1.1.10-14.el6_5.3-368c726 > last-lrm-refresh: 1414629002 > no-quorum-policy: ignore > stonith-enabled: false > > > > Basically 1 node died (wwwrp1) hardware failure > > I can see in the logs the cluster wants to bring the IP address over and it > seems to do it but > > > > ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2 > IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with > broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2 > IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43 > wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i > 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 > auto not_used not_used > > > ## this seems to be where it checks other stuff > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM > operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, > confirmed=true) ok > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating > action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local) > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM > operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, > confirmed=false) ok > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 > (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > # this is where monitor times out, but it doesn't look like 20000ms .. the > initial > try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: > child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed > out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished: > ybrpstat_monitor_5000:25712 - timed out after 20000ms > Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM > operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms) > Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating > failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 > (update=value++, time=1414592384) > > I'm guessing because it timed out it went into failed mode. I need to know > why it timed out. The script has never timed out before or in testing... > > Am I reading this right. The reason the resource didn't fail over (ip > address) > was because there was no ybrpstat running on wwwrp2 and the reason for > that was the monitor action failed/timed out > > Thanks > Alex > > > === logs == > > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing > failed op start for ybrpstat:0 on wwwrp1: unknown error (1) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start > ybrpip#011(wwwrp1) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover > ybrpstat:0#011(Started wwwrp1) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: > Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2 > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing > failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42 > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct > 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness: > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures > (max=1000000) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start > ybrpip#011(wwwrp2) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop > ybrpstat:0#011(wwwrp1) > Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: > Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2 > Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating > action 7: start ybrpip_start_0 on wwwrp2 (local) > Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating > action 1: stop ybrpstat_stop_0 on wwwrp1 > Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address > 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30 > 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct > 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: > /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource- > agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM > operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, > confirmed=true) ok > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating > action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local) > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM > operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, > confirmed=false) ok > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828 > (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete > Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback: > ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44 > wwwrp2 lrmd[2459]: warning: operation_finished: > ybrpstat_monitor_5000:25712 - timed out after 20000ms > Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM > operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms) > Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating > failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 > (update=value++, time=1414592384) > Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-ybrpstat (1) > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent > update 12543: fail-count-ybrpstat=1 > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending > flush op to all hosts for: last-failure-ybrpstat (1414592384) > Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent > update 12545: last-failure-ybrpstat=1414592384 > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing > failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44 > wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op > monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44 > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct > 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures > (max=1000000) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart > ybrpip#011(Started wwwrp2) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover > ybrpstat:0#011(Started wwwrp2) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: > Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2 > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing > failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44 > wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op > monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44 > wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing > ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct > 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness: > Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures > (max=1000000) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart > ybrpip#011(Started wwwrp2) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover > ybrpstat:0#011(Started wwwrp2) > Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: > Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2 > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org