On 25 Mar 2014, at 1:03 am, Drapeau, Mathieu <mathieu.drap...@intel.com> wrote:
> Actually, I was wrong, the version used is 1.1.10. > So, how I can know which process is taking so long? top :) It will tell you where all the CPU is going. Do you have many resources configured? > > thanks > > On 3/23/14, 7:35 PM, "Andrew Beekhof" <and...@beekhof.net> wrote: > >> >> On 21 Mar 2014, at 3:57 am, Drapeau, Mathieu <mathieu.drap...@intel.com> >> wrote: >> >>> Hello, >>> From pacemaker 1.1.8-7 from EL6, crmd died unexpected generating this >>> logs during a failover: >> >> Please update to 1.1.10 from the EL6 update channels: >> >> http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pac >> emaker-1-dot-1-6-through-1-dot-1-9/ >> >>> >>> >>> crmd[10419]: error: crmd_node_update_complete: Node update 79 >>> failed: Timer expired (-62) >> >> It looks like your hardware is overloaded and an operation that shouldn't >> have taken very long has timed out. >> >>> crmd[10419]: error: do_log: FSA: Input I_ERROR from >>> crmd_node_update_complete() received in state S_IDLE >>> crmd[10419]: notice: do_state_transition: State transition S_IDLE -> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL >>> origin=crmd_node_update_complete ] >>> crmd[10419]: warning: do_recover: Fast-tracking shutdown in response >>> to errors >>> crmd[10419]: warning: do_election_vote: Not voting in election, we're >>> in state S_RECOVERY >>> crmd[10419]: error: do_log: FSA: Input I_TERMINATE from do_recover() >>> received in state S_RECOVERY >>> crmd[10419]: notice: lrm_state_verify_stopped: Stopped 0 recurring >>> operations at shutdown (2 ops remaining) >>> crmd[10419]: notice: lrm_state_verify_stopped: Recurring action >>> testfs-MDT0000_6cda68:21 (testfs-MDT0000_6cda68_monitor_5000) incomplete >>> at shutdown >>> crmd[10419]: notice: lrm_state_verify_stopped: Recurring action >>> MGS_f055b7:30 (MGS_f055b7_monitor_5000) incomplete at shutdown >>> crmd[10419]: error: lrm_state_verify_stopped: 3 resources were >>> active at shutdown. >>> crmd[10419]: notice: do_lrm_control: Disconnected from the LRM >>> crmd[10419]: notice: terminate_cs_connection: Disconnecting from >>> Corosync >>> corosync[10370]: [pcmk ] info: pcmk_ipc_exit: Client crmd >>> (conn=0x2589f40, async-conn=0x2589f40) left >>> crmd[10419]: error: crmd_fast_exit: Could not recover from internal >>> error >>> pacemakerd[10408]: error: pcmk_child_exit: Child process crmd >>> (10419) exited: Generic Pacemaker error (201) >>> pacemakerd[10408]: notice: pcmk_process_exit: Respawning failed child >>> process: crmd >>> >>> What could have happened and how to avoid crmd to die? >>> >>> Thanks, >>> Mat >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org