Re: [Linux-HA] Heartbeat dies

Tao Yu Thu, 28 Feb 2008 10:53:40 -0800

On Thu, Feb 28, 2008 at 5:25 AM, Dejan Muhamedagic <[EMAIL PROTECTED]>
wrote:


> Hi,
>
> On Wed, Feb 27, 2008 at 11:45:10AM -0500, Tao Yu wrote:
> > Running heartbeat 2.1.2 on Centos4.5
> > The cluster has just two nodes,  boxqaha1 and boxqaha2
> >
> > Leave the HA resources running boxqaha1 for some time and found out the
> > boxqaha2 took over resources for no reason. Checked the heartbeat on
> > boxqaha1, found out only lrmd was still running on boxqaha1.
> >
> > Following is the section of log file when error happened.
> >
> > cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER:
> > boxqaha1.cybervisiontech.com.ua
> > cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations
> (
> > 100000.00us average, 0% utilization) in the last 10min
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> Dispatch
> > function for send local status was delayed 14130 ms (> 510 ms) before
> being
> > called (GSource: 0x649118)
> > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> started
> > at 654275520 should have started at 654274107
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> > boxqaha1.cybervisiontech.com.ua: interval 15130 ms
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> > boxqaha2.cybervisiontech.com.ua: interval 15280 ms
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> Dispatch
> > function for check for signals was delayed 14130 ms (> 510 ms) before
> being
> > called (GSource: 0x649f88)
> > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> started
> > at 654275520 should have started at 654274107
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> Dispatch
> > function for update msgfree count was delayed 14270 ms (> 5000 ms)
> before
> > being called (GSource: 0x64a0b8)
> > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> started
> > at 654275520 should have started at 654274093
> > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> Dispatch
> > function for client audit was delayed 11860 ms (> 5000 ms) before being
> > called (GSource: 0x649bb8)
> > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> started
> > at 654275520 should have started at 654274334
> > heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [
> > boxqaha2.cybervisiontech.com.ua] [142807:143221]
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> 0:
> > Resource temporarily unavailable
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> 0:
> > Resource temporarily unavailable
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> >       ... ... ...
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> 0:
> > Resource temporarily unavailable
> > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> > heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194,
> > hist->hiseq=142300
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing
> from
> > boxqaha2.cybervisiontech.com.ua
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213
> >    ... ... ...
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from
> > boxqaha2.cybervisiontech.com.ua
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199
> > heartbeat[26913]: 2008/02/27_07:46:38 debug:
> > heartbeat[26913]: 2008/02/27_07:46:38 info: killing
> > /usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/ccm" (501,502) pid 26921
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/cib" (501,502) pid 26922
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/stonithd" (0,0) pid 26924
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/attrd" (501,502) pid 26925
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/crmd" (501,502) pid 26926
> > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > "/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927
> > mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down
> > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master
> > Control process died.
> > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with
> SIGTERM
> > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with
> SIGTERM
> > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with
> SIGTERM
> > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP
> dead):
> > Killing ourselves.
> > mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB
> terminated...
> > exiting
> > ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat
> service.
> > Need to bail out.
> > stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat
> > daemon
> > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost
> connection
> > to heartbeat service.
> > cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy:
> Heartbeat
> > connection lost!  Exiting.
> > crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost
> connection
> > to heartbeat service.
> > stonithd[26924]: 2008/02/27_07:46:41 notice:
> /usr/lib64/heartbeat/stonithd
> > normally quit.
> > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy:
> Lost
> > connection to heartbeat service!
> > cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered
> non-fatal
> > assert at main.c:213 : g_hash_table_size(client_list) == 0
> > crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm
> is
> > dead before the client!
> > attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting...
> > cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit
> > crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection
> appears
> > to have failed: rc=-1.
> > attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> > 4294967393 bytes in 128 byte bucket #2
> > cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> > 4294967391 bytes in 128 byte bucket #2
> > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR
> from
> > ccm_dispatch() received in state (S_NOT_DC)
> > crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State
> transition
> > S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK
> > origin=ccm_dispatch ]
> > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER
> > (0000000001000000) not supported
> > crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP
> from
> > do_recover() received in state (S_RECOVERY)
> > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State
> transition
> > S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL
> > origin=do_recover ]
> > crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released
> > crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input
> > I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING)
> > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State
> transition
> > S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL
> > origin=do_shutdown ]
> > crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for
> active
> > resources before exit
> > crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for
> active
> > resources before exit
> > crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from
> the
> > LRM
> > crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from
> > Heartbeat
> > crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB
> > crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to
> > 26922 is not connected
> > crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #=========
> > IPC[outbound] message start ==========#
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5
> fields
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command]
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib]
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave]
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149]
> > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256]
> > crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending
> > message to CIB service FAILED
> > crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy:
> > Connection to the CIB terminated...
> > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 -
> > gracefully exiting the CRMd
> > crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from
> > internal error
> > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2)
> >
> >
> > "ERROR: Cannot write to media pipe 0: Resource temporarily unavailable"
> > looks caused all this. What is this? the IP stack was messed up? But why
> > heartbeat dies here?
>
> The host was simply overwhelmed and couldn't keep the
> communication. There's a known bug in 2.1.2 which is excercised
> in case there are many and often monitor operations. Please
> upgrade to 2.1.3.
>
> Thanks,
>
> Dejan
>
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
Dejan, thanks a lot for the response!

Couple of questions,
  1. does it mean the overwhelming is caused by having too many and often
monitor operations? Is there any threshold for number of monitor operations
and the frequencies of those operations? In my environment, there are about
7 monitors and each fires every 5 seconds. Is that too much?

  2. It is very helpful to know the bug number of this problem. Is it
possible I can get this info?

Thanks,
Tao
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies

Reply via email to