Re: [Linux-HA] Heartbeat dies

Dejan Muhamedagic Fri, 29 Feb 2008 03:37:56 -0800

Hi,

On Thu, Feb 28, 2008 at 01:53:08PM -0500, Tao Yu wrote:
> On Thu, Feb 28, 2008 at 5:25 AM, Dejan Muhamedagic <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi,
> >
> > On Wed, Feb 27, 2008 at 11:45:10AM -0500, Tao Yu wrote:
> > > Running heartbeat 2.1.2 on Centos4.5
> > > The cluster has just two nodes,  boxqaha1 and boxqaha2
> > >
> > > Leave the HA resources running boxqaha1 for some time and found out the
> > > boxqaha2 took over resources for no reason. Checked the heartbeat on
> > > boxqaha1, found out only lrmd was still running on boxqaha1.
> > >
> > > Following is the section of log file when error happened.
> > >
> > > cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER:
> > > boxqaha1.cybervisiontech.com.ua
> > > cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations
> > (
> > > 100000.00us average, 0% utilization) in the last 10min
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> > Dispatch
> > > function for send local status was delayed 14130 ms (> 510 ms) before
> > being
> > > called (GSource: 0x649118)
> > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> > started
> > > at 654275520 should have started at 654274107
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> > > boxqaha1.cybervisiontech.com.ua: interval 15130 ms
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node
> > > boxqaha2.cybervisiontech.com.ua: interval 15280 ms
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> > Dispatch
> > > function for check for signals was delayed 14130 ms (> 510 ms) before
> > being
> > > called (GSource: 0x649f88)
> > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> > started
> > > at 654275520 should have started at 654274107
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> > Dispatch
> > > function for update msgfree count was delayed 14270 ms (> 5000 ms)
> > before
> > > being called (GSource: 0x64a0b8)
> > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> > started
> > > at 654275520 should have started at 654274093
> > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch:
> > Dispatch
> > > function for client audit was delayed 11860 ms (> 5000 ms) before being
> > > called (GSource: 0x649bb8)
> > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch:
> > started
> > > at 654275520 should have started at 654274334
> > > heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [
> > > boxqaha2.cybervisiontech.com.ua] [142807:143221]
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> > 0:
> > > Resource temporarily unavailable
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> > 0:
> > > Resource temporarily unavailable
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> > >       ... ... ...
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe
> > 0:
> > > Resource temporarily unavailable
> > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down.
> > > heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194,
> > > hist->hiseq=142300
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing
> > from
> > > boxqaha2.cybervisiontech.com.ua
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213
> > >    ... ... ...
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from
> > > boxqaha2.cybervisiontech.com.ua
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug:
> > > heartbeat[26913]: 2008/02/27_07:46:38 info: killing
> > > /usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/ccm" (501,502) pid 26921
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/cib" (501,502) pid 26922
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/stonithd" (0,0) pid 26924
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/attrd" (501,502) pid 26925
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/crmd" (501,502) pid 26926
> > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client
> > > "/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927
> > > mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down
> > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master
> > > Control process died.
> > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with
> > SIGTERM
> > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with
> > SIGTERM
> > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with
> > SIGTERM
> > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP
> > dead):
> > > Killing ourselves.
> > > mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB
> > terminated...
> > > exiting
> > > ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat
> > service.
> > > Need to bail out.
> > > stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat
> > > daemon
> > > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost
> > connection
> > > to heartbeat service.
> > > cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy:
> > Heartbeat
> > > connection lost!  Exiting.
> > > crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost
> > connection
> > > to heartbeat service.
> > > stonithd[26924]: 2008/02/27_07:46:41 notice:
> > /usr/lib64/heartbeat/stonithd
> > > normally quit.
> > > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy:
> > Lost
> > > connection to heartbeat service!
> > > cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered
> > non-fatal
> > > assert at main.c:213 : g_hash_table_size(client_list) == 0
> > > crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm
> > is
> > > dead before the client!
> > > attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting...
> > > cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit
> > > crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection
> > appears
> > > to have failed: rc=-1.
> > > attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> > > 4294967393 bytes in 128 byte bucket #2
> > > cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug:
> > > 4294967391 bytes in 128 byte bucket #2
> > > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR
> > from
> > > ccm_dispatch() received in state (S_NOT_DC)
> > > crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State
> > transition
> > > S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK
> > > origin=ccm_dispatch ]
> > > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER
> > > (0000000001000000) not supported
> > > crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP
> > from
> > > do_recover() received in state (S_RECOVERY)
> > > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State
> > transition
> > > S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL
> > > origin=do_recover ]
> > > crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released
> > > crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input
> > > I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING)
> > > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State
> > transition
> > > S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL
> > > origin=do_shutdown ]
> > > crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for
> > active
> > > resources before exit
> > > crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for
> > active
> > > resources before exit
> > > crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from
> > the
> > > LRM
> > > crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from
> > > Heartbeat
> > > crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB
> > > crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to
> > > 26922 is not connected
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #=========
> > > IPC[outbound] message start ==========#
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5
> > fields
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command]
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib]
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave]
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149]
> > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256]
> > > crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending
> > > message to CIB service FAILED
> > > crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy:
> > > Connection to the CIB terminated...
> > > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 -
> > > gracefully exiting the CRMd
> > > crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from
> > > internal error
> > > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> > > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [
> > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ]
> > > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2)
> > >
> > >
> > > "ERROR: Cannot write to media pipe 0: Resource temporarily unavailable"
> > > looks caused all this. What is this? the IP stack was messed up? But why
> > > heartbeat dies here?
> >
> > The host was simply overwhelmed and couldn't keep the
> > communication. There's a known bug in 2.1.2 which is excercised
> > in case there are many and often monitor operations. Please
> > upgrade to 2.1.3.
> >
> > Thanks,
> >
> > Dejan
> >
> > > _______________________________________________
> > > Linux-HA mailing list
> > > Linux-HA@lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> Dejan, thanks a lot for the response!
> 
> Couple of questions,
>   1. does it mean the overwhelming is caused by having too many and often
> monitor operations? Is there any threshold for number of monitor operations
> and the frequencies of those operations? In my environment, there are about
> 7 monitors and each fires every 5 seconds. Is that too much?


No, that's not too many as far as performance goes, if you have a
powerful host. Note that every monitor consists of forking a new
process and invoking an external shell script, i.e. a resource
agent. How often you run monitors is largely a preference and
depends on the resource.

>   2. It is very helpful to know the bug number of this problem. Is it
> possible I can get this info?

The bug is described here:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1697

I'd suggest upgrading to 2.1.3. Alternatively, you can apply the
fix to 2.1.2.

Thanks,

Dejan


> Thanks,
> Tao
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies

Reply via email to