Andrew Beekhof <andrew@...> writes: > > > On 10 Oct 2014, at 12:12 pm, Lax <lkota@...> wrote: > > > Hi All, > > > > I ran into a time out issue while failing over from master to the peer > > server and I have a 2 node setup with 2 resources. Though it was working all > > along, this was the first time this issue is seen for me. > > > > It fail with following error 'error: process_lrm_event: LRM operation > > resourceB_stop_0 (40) Timed Out (timeout=20000ms)'. > > > > Have you considered making the timeout longer? > > > > > > > Here is the complete log snippet from pacemaker, appreciate your help on this. > > > > > > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: Diff: +++ 0.3.1 > > 4e9bfa03cf2fef61843c18e127044d81 > > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: -- <cib > > admin_epoch="0" epoch="2" num_updates="8" /> > > Oct 9 14:57:38 server1 crmd[373]: notice: do_state_transition: State > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > > origin=abort_transition_graph ] > > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++ > > <instance_attributes id="nodes-server1" > > > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++ <nvpair > > id="nodes-server1-standby" name="standby" value="true" /> > > Oct 9 14:57:38 server1 cib[368]: notice: cib:diff: ++ > > </instance_attributes> > > Oct 9 14:57:38 server1 pengine[372]: notice: unpack_config: On loss of > > CCM Quorum: Ignore > > Oct 9 14:57:38 server1 pengine[372]: notice: LogActions: Move > > ClusterIP#011(Started server1 -> 172.28.0.64) > > Oct 9 14:57:38 server1 pengine[372]: notice: LogActions: Move > > resourceB#011(Started server1 -> 172.28.0.64) > > Oct 9 14:57:38 server1 pengine[372]: notice: process_pe_message: > > Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-1710.bz2 > > Oct 9 14:57:58 server1 lrmd[370]: warning: child_timeout_callback: > > resourceB_stop_0 process (PID 17327) timed out > > Oct 9 14:57:58 server1 lrmd[370]: warning: operation_finished: > > resourceB_stop_0:17327 - timed out after 20000ms > > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished: > > resourceB_stop_0:17327 [ % Total % Received % Xferd Average Speed > > Time Time Time Current ] > > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished: > > resourceB_stop_0:17327 [ Dload Upload > > Total Spent Left Speed ] > > Oct 9 14:57:58 server1 lrmd[370]: notice: operation_finished: > > resourceB_stop_0:17327 [ #015 0 0 0 0 0 0 0 0 > > --:--:-- --:--:-- --:--:-- 0#015 0 0 0 0 0 0 0 > > 0 --:--:-- 0:00:01 --:--:-- 0#015 0 0 0 0 0 0 > > 0 0 --:--:-- 0:00:02 --:--:-- 0#015 0 0 0 0 0 > > 0 0 0 --:--:-- 0:00:03 --:--:-- 0#015 0 0 0 0 > > 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0#015 0 0 0 > > 0 0 0 0 0 --:--:-- 0:00:05 - > > Oct 9 14:57:58 server1 crmd[373]: error: process_lrm_event: LRM > > operation resourceB_stop_0 (40) Timed Out (timeout=20000ms) > > Oct 9 14:57:58 server1 crmd[373]: warning: status_from_rc: Action 10 > > (resourceB_stop_0) on server1 failed (target: 0 vs. rc: 1): Error > > Oct 9 14:57:58 server1 crmd[373]: warning: update_failcount: Updating > > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY, > > time=1412891878) > > Oct 9 14:57:58 server1 attrd[371]: notice: attrd_trigger_update: Sending > > flush op to all hosts for: fail-count-resourceB (INFINITY) > > Oct 9 14:57:58 server1 crmd[373]: warning: update_failcount: Updating > > failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY, > > time=1412891878) > > Oct 9 14:57:58 server1 crmd[373]: notice: run_graph: Transition 11 > > (Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-1710.bz2): Stopped > > Oct 9 14:57:58 server1 attrd[371]: notice: attrd_perform_update: Sent > > update 11: fail-count-resourceB=INFINITY > > > > > > Thanks > > Lax > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@... > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@... > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
Thanks for getting back Andrew. I have only given timeout for monitor, but not for stop in the resource definition. Do you mean I increase timeout for stop? Also when I try to force stop pacemaker service, it keeps saying 'Waiting for shutdown of managed resources;' and does not stop. Pacemaker log says failed to stop because of unknown error Oct 10 00:36:25 server1 crmd[373]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 10 00:51:25 server1 crmd[373]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Oct 10 00:51:25 server1 pengine[372]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 10 00:51:25 server1 pengine[372]: warning: unpack_rsc_op: Processing failed op stop for resourceB on server1.cisco.com: unknown error (1) Oct 10 00:51:25 server1 pengine[372]: warning: common_apply_stickiness: Forcing resourceB away from server1.cisco.com after 1000000 failures (max=1000000) Oct 10 00:51:25 server1 pengine[372]: notice: LogActions: Stop ClusterIP#011(Started 172.28.0.64) Oct 10 00:51:25 server1 pengine[372]: notice: process_pe_message: Calculated Transition 56: (null) Oct 10 00:51:25 server1 crmd[373]: notice: run_graph: Transition 56 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=unknown): Complete Oct 10 00:51:25 server1 crmd[373]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 10 00:53:03 server1 attrd[371]: notice: attrd_trigger_update: Sending flush op to all hosts for: standby (true) Oct 10 00:53:03 server1 attrd[371]: notice: attrd_perform_update: Sent update 21: standby=true Oct 10 00:53:03 server1 crmd[373]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 10 00:53:03 server1 pengine[372]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 10 00:53:03 server1 pengine[372]: warning: unpack_rsc_op: Processing failed op stop for resourceB on server1.cisco.com: unknown error (1) Oct 10 00:53:03 server1 pengine[372]: warning: common_apply_stickiness: Forcing resourceB away from server1.cisco.com after 1000000 failures (max=1000000) Oct 10 00:53:03 server1 pengine[372]: notice: LogActions: Stop ClusterIP#011(Started 172.28.0.64) Oct 10 00:53:03 server1 pengine[372]: notice: process_pe_message: Calculated Transition 57: /var/lib/pacemaker/pengine/pe-input-1717.bz2 Oct 10 00:53:03 server1 crmd[373]: notice: run_graph: Transition 57 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-1717.bz2): Complete Oct 10 00:53:03 server1 crmd[373]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Thanks Lax _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org