Hi, On Thu, Feb 28, 2008 at 01:53:08PM -0500, Tao Yu wrote: > On Thu, Feb 28, 2008 at 5:25 AM, Dejan Muhamedagic <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > On Wed, Feb 27, 2008 at 11:45:10AM -0500, Tao Yu wrote: > > > Running heartbeat 2.1.2 on Centos4.5 > > > The cluster has just two nodes, boxqaha1 and boxqaha2 > > > > > > Leave the HA resources running boxqaha1 for some time and found out the > > > boxqaha2 took over resources for no reason. Checked the heartbeat on > > > boxqaha1, found out only lrmd was still running on boxqaha1. > > > > > > Following is the section of log file when error happened. > > > > > > cib[26922]: 2008/02/27_06:44:37 info: cib_ccm_msg_callback: PEER: > > > boxqaha1.cybervisiontech.com.ua > > > cib[26922]: 2008/02/27_06:44:54 info: cib_stats: Processed 1 operations > > ( > > > 100000.00us average, 0% utilization) in the last 10min > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: > > Dispatch > > > function for send local status was delayed 14130 ms (> 510 ms) before > > being > > > called (GSource: 0x649118) > > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: > > started > > > at 654275520 should have started at 654274107 > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node > > > boxqaha1.cybervisiontech.com.ua: interval 15130 ms > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Late heartbeat: Node > > > boxqaha2.cybervisiontech.com.ua: interval 15280 ms > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: > > Dispatch > > > function for check for signals was delayed 14130 ms (> 510 ms) before > > being > > > called (GSource: 0x649f88) > > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: > > started > > > at 654275520 should have started at 654274107 > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: > > Dispatch > > > function for update msgfree count was delayed 14270 ms (> 5000 ms) > > before > > > being called (GSource: 0x64a0b8) > > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: > > started > > > at 654275520 should have started at 654274093 > > > heartbeat[26913]: 2008/02/27_07:46:37 WARN: Gmain_timeout_dispatch: > > Dispatch > > > function for client audit was delayed 11860 ms (> 5000 ms) before being > > > called (GSource: 0x649bb8) > > > heartbeat[26913]: 2008/02/27_07:46:37 info: Gmain_timeout_dispatch: > > started > > > at 654275520 should have started at 654274334 > > > heartbeat[26913]: 2008/02/27_07:46:38 WARN: 413 lost packet(s) for [ > > > boxqaha2.cybervisiontech.com.ua] [142807:143221] > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe > > 0: > > > Resource temporarily unavailable > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe > > 0: > > > Resource temporarily unavailable > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > > > ... ... ... > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Cannot write to media pipe > > 0: > > > Resource temporarily unavailable > > > heartbeat[26913]: 2008/02/27_07:46:38 ERROR: Shutting down. > > > heartbeat[26913]: 2008/02/27_07:46:38 info: all clients are now paused > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->ackseq =142199 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: hist->lowseq =142194, > > > hist->hiseq=142300 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: At max 199 pkts missing > > from > > > boxqaha2.cybervisiontech.com.ua > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 0: missing pkt: 143208 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 1: missing pkt: 143209 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 2: missing pkt: 143210 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 3: missing pkt: 143211 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 4: missing pkt: 143212 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: 5: missing pkt: 143213 > > > ... ... ... > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: expecting from > > > boxqaha2.cybervisiontech.com.ua > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: it's ackseq=142199 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: > > > heartbeat[26913]: 2008/02/27_07:46:38 info: killing > > > /usr/lib64/heartbeat/mgmtd -v process group 26927 with signal 15 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/ccm" (501,502) pid 26921 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/cib" (501,502) pid 26922 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/lrmd -r" (0,0) pid 26923 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/stonithd" (0,0) pid 26924 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/attrd" (501,502) pid 26925 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/crmd" (501,502) pid 26926 > > > heartbeat[26913]: 2008/02/27_07:46:38 debug: RUNNING Child client > > > "/usr/lib64/heartbeat/mgmtd -v" (0,0) pid 26927 > > > mgmtd[26927]: 2008/02/27_07:46:38 info: mgmtd is shutting down > > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown: Master > > > Control process died. > > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26913 with > > SIGTERM > > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26917 with > > SIGTERM > > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Killing pid 26918 with > > SIGTERM > > > heartbeat[26916]: 2008/02/27_07:46:38 CRIT: Emergency Shutdown(MCP > > dead): > > > Killing ourselves. > > > mgmtd[26927]: 2008/02/27_07:46:41 ERROR: Connection to the CIB > > terminated... > > > exiting > > > ccm[26921]: 2008/02/27_07:46:41 ERROR: Lost connection to heartbeat > > service. > > > Need to bail out. > > > stonithd[26924]: 2008/02/27_07:46:41 ERROR: Disconnected with heartbeat > > > daemon > > > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_dispatch: Lost > > connection > > > to heartbeat service. > > > cib[26922]: 2008/02/27_07:46:41 ERROR: cib_ha_connection_destroy: > > Heartbeat > > > connection lost! Exiting. > > > crmd[26926]: 2008/02/27_07:46:41 CRIT: crmd_ha_msg_dispatch: Lost > > connection > > > to heartbeat service. > > > stonithd[26924]: 2008/02/27_07:46:41 notice: > > /usr/lib64/heartbeat/stonithd > > > normally quit. > > > attrd[26925]: 2008/02/27_07:46:41 CRIT: attrd_ha_connection_destroy: > > Lost > > > connection to heartbeat service! > > > cib[26922]: 2008/02/27_07:46:41 ERROR: crm_abort: main: Triggered > > non-fatal > > > assert at main.c:213 : g_hash_table_size(client_list) == 0 > > > crmd[26926]: 2008/02/27_07:46:41 info: mem_handle_func:IPC broken, ccm > > is > > > dead before the client! > > > attrd[26925]: 2008/02/27_07:46:41 info: main: Exiting... > > > cib[26922]: 2008/02/27_07:46:41 WARN: main: Not all clients gone at exit > > > crmd[26926]: 2008/02/27_07:46:41 ERROR: ccm_dispatch: CCM connection > > appears > > > to have failed: rc=-1. > > > attrd[26925]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug: > > > 4294967393 bytes in 128 byte bucket #2 > > > cib[26922]: 2008/02/27_07:46:41 ERROR: cl_malloc: bucket size bug: > > > 4294967391 bytes in 128 byte bucket #2 > > > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_log: [[FSA]] Input I_ERROR > > from > > > ccm_dispatch() received in state (S_NOT_DC) > > > crmd[26926]: 2008/02/27_07:46:44 info: do_state_transition: State > > transition > > > S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_CCM_CALLBACK > > > origin=ccm_dispatch ] > > > crmd[26926]: 2008/02/27_07:46:44 ERROR: do_recover: Action A_RECOVER > > > (0000000001000000) not supported > > > crmd[26926]: 2008/02/27_07:46:45 ERROR: do_log: [[FSA]] Input I_STOP > > from > > > do_recover() received in state (S_RECOVERY) > > > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State > > transition > > > S_RECOVERY -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL > > > origin=do_recover ] > > > crmd[26926]: 2008/02/27_07:46:45 info: do_dc_release: DC role released > > > crmd[26926]: 2008/02/27_07:46:45 WARN: do_log: [[FSA]] Input > > > I_RELEASE_SUCCESS from do_dc_release() received in state (S_STOPPING) > > > crmd[26926]: 2008/02/27_07:46:45 info: do_state_transition: State > > transition > > > S_STOPPING -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL > > > origin=do_shutdown ] > > > crmd[26926]: 2008/02/27_07:46:45 info: verify_stopped: Checking for > > active > > > resources before exit > > > crmd[26926]: 2008/02/27_07:46:46 info: verify_stopped: Checking for > > active > > > resources before exit > > > crmd[26926]: 2008/02/27_07:46:46 info: do_lrm_control: Disconnected from > > the > > > LRM > > > crmd[26926]: 2008/02/27_07:46:46 info: do_ha_control: Disconnected from > > > Heartbeat > > > crmd[26926]: 2008/02/27_07:46:46 info: do_cib_control: Disconnecting CIB > > > crmd[26926]: 2008/02/27_07:46:46 ERROR: send_ipc_message: IPC Channel to > > > 26922 is not connected > > > crmd[26926]: 2008/02/27_07:46:46 WARN: crm_log_message_adv: #========= > > > IPC[outbound] message start ==========# > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG: Dumping message with 5 > > fields > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[0] : [__name__=cib_command] > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[1] : [t=cib] > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[2] : [cib_op=cib_slave] > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[3] : [cib_callid=149] > > > crmd[26926]: 2008/02/27_07:46:46 WARN: MSG[4] : [cib_callopt=256] > > > crmd[26926]: 2008/02/27_07:46:46 ERROR: cib_native_perform_op: Sending > > > message to CIB service FAILED > > > crmd[26926]: 2008/02/27_07:46:46 info: crmd_cib_connection_destroy: > > > Connection to the CIB terminated... > > > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: Performing A_EXIT_0 - > > > gracefully exiting the CRMd > > > crmd[26926]: 2008/02/27_07:46:46 ERROR: do_exit: Could not recover from > > > internal error > > > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [ > > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ] > > > crmd[26926]: 2008/02/27_07:46:46 info: free_mem: Dropping I_TERMINATE: [ > > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=verify_stopped ] > > > crmd[26926]: 2008/02/27_07:46:46 info: do_exit: [crmd] stopped (2) > > > > > > > > > "ERROR: Cannot write to media pipe 0: Resource temporarily unavailable" > > > looks caused all this. What is this? the IP stack was messed up? But why > > > heartbeat dies here? > > > > The host was simply overwhelmed and couldn't keep the > > communication. There's a known bug in 2.1.2 which is excercised > > in case there are many and often monitor operations. Please > > upgrade to 2.1.3. > > > > Thanks, > > > > Dejan > > > > > _______________________________________________ > > > Linux-HA mailing list > > > Linux-HA@lists.linux-ha.org > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > Linux-HA@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > Dejan, thanks a lot for the response! > > Couple of questions, > 1. does it mean the overwhelming is caused by having too many and often > monitor operations? Is there any threshold for number of monitor operations > and the frequencies of those operations? In my environment, there are about > 7 monitors and each fires every 5 seconds. Is that too much?
No, that's not too many as far as performance goes, if you have a powerful host. Note that every monitor consists of forking a new process and invoking an external shell script, i.e. a resource agent. How often you run monitors is largely a preference and depends on the resource. > 2. It is very helpful to know the bug number of this problem. Is it > possible I can get this info? The bug is described here: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1697 I'd suggest upgrading to 2.1.3. Alternatively, you can apply the fix to 2.1.2. Thanks, Dejan > Thanks, > Tao > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems