Hi Parshvi, just a quick-shot and without analyzing your mail in detail: find attached an edited version of the IPaddr2 RA.
I was trying to use the original script a while agho, and basically nothing worked: It did not recognize the link failures (due to the way how the test was implemented it would only work if you have not more than 1 IP per interface), there was no proper support for bonding, the IP addresses would not be shifted .... I did some (very minor) changes to ge the script working for us. Just have a shot at it if you want, maybe replacing the RA will already solve your problem. Cheers, Mario On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote: > Parshvi <parshvi.17@...> writes: > > > > > Hi, > > > > The monitor operation of IPaddr2 rsc agent is timing out. > > Interval: 5s > > Timeout: 60s > > The timeout was increased from an earlier 20s to now 60s. Even then, there > > are > > multiple logs of monitor op. timing out. > > > > 1) What can cause the monitor to take so long ? > > 2) Looking at the pe-input, what contributes to the operation time ? Is it > just > > the exec-time or exec-time + queue-time ? > > 3) Any solution proposed ? > > > > I have lrm pe-input when the timeout was configured at 20s: > > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s) > > > > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf" > > provider="heartbeat"> > > <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" > > operation="monitor" > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition- > > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition- > > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc- > code="7" > > op-status="0" interval="0" last-run="1343738096" > > last-rc-change="1343738096" > > exec-time="20" queue-time="30" > > op-digest="f22a042c86b227078b239707d4e4d4a2"/> > > > > <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" > > crm- > > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition- > > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition- > > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503" > rc- > > code="0" op-status="0" interval="0" last-run="1343928908" last-rc- > > change="1343928908" exec-time="280" queue-time="20" op- > > digest="f22a042c86b227078b239707d4e4d4a2"/> > > > > <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000" > operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition- > > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition- > > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" > > call-id="83504" > rc- > > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec- > > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/> > > (Please note exec-time=20000) > > > > Following are the details of packages: > cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1) > resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c > CRM Version: 1.0.12 (unknown) > pacemaker 1.0.12-1.el5.centos - (none) x86_64 > corosync 1.2.7-1.1.el5 - (none) x86_64 > resource-agents 1.0.4-1.1.el5 - (none) x86_64 > > There are 4 virtual IP resources configued: > > Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does > not > recover. Following are the logs that are observed: > > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579 > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8, > Source=/var/lib/pengine/pe-input-1660.bz2): Terminated > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition > failed: terminated > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9 > actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was > confirmed (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 8]: > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 103]: > Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 97]: > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]: > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 98]: > Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]: > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 99]: > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]: > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 100]: > Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 99]: > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 101]: > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]: > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 102]: > Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 101]: > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is > pending > (priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 36]: > Pending (id: all_stopped, type: pseduo, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]: > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0) > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition > 63579 > is now complete > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579 > status: done - <null> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > 1) Please help as to why a monitor is timing out ? > 2) Why does one of the VIP's fails to recover after a timeout ? > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
IPaddr2a
Description: application/shellscript
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org