Good news: I had to include the /etc/profile.local into heartbeat's startup script. The /etc/profile.local mainly defines LD_LIBRARY_PATH with some platform specific shared object libraries:
--- ./heartbeat/init.d/heartbeat.in 2007-04-23 10:32:16.000000000 +0200 +++ ./heartbeat/init.d/heartbeat.in.patched 2007-06-05 15:56:06.000177000 +0200 @@ -43,6 +43,8 @@ # Default-Stop: 0 6 ### END INIT INFO +# define LD_LIBRARY_PATH +test -s /etc/profile.local && . /etc/profile.local [EMAIL PROTECTED]@/ha.d; export HA_DIR CONFIG=$HA_DIR/ha.cf The last reported crashes during last week are all gone now! Thank you very much for your help. -----Ursprüngliche Nachricht----- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ext Andrew Beekhof Gesendet: Dienstag, 5. Juni 2007 12:27 An: High-Availability Linux Development List Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386 On 6/5/07, Otte, Joerg <[EMAIL PROTECTED]> wrote: > > > were there any errors before the crash? > The only error in the logs are several: > heartbeat[536]: 2007/06/05_10:01:03 WARN: Exiting > /usr/sfw/lib/python2.3/heartbeat/cib process 572 killed by signal 9. signal 9? are you sending that signal? > heartbeat[536]: 2007/06/05_10:01:03 ERROR: Respawning client > "/usr/sfw/lib/python2.3/heartbeat/cib": thats an odd place to put heartbeat :-) > > > > in the meantime, you can probably just comment out that line as the > > cib is seconds away from exiting and is "just" cleaning up. > You mean to remove the statement "hb_conn->llc_ops->delete(hb_conn);" in > main.c ? right > I already tried this. But now I get different cores. > In the logs I see only cib crashing but the core is from crmd: > > > bcm20-b:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core > core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576: > /usr/sfw/lib/python2.3/heartbeat/crmd > 080580e8 do_exit (40000000, 0, d, b, 17, 8091528) + ac > 08055f1f do_fsa_action (40000000, 0, 805803c, 1, 600002a8, a) + c3 > 080561ea s_crmd_fsa_actions (8091528, 2000420, 0, 10000000, 4, 0) + 76 > 080579e1 s_crmd_fsa (d, 808ebf0, 8047738, 805fc47) + 225 > 0805fc69 crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d > fef93857 G_TRIG_dispatch (808ebf0, 0, 0, 0) + a7 > fedbc77f g_main_context_dispatch (808b9f8, ffffff9c, 8089dd0, 0) + 1e7 > fedbe065 g_main_context_iterate (1, 808c6c0, 8047858, fedbe141, 80554d7, 1) > + 41d > fedbe2c0 g_main_loop_run (8089db8, 806f248, 806caa4, 806f21d, 0, 0) + 19c > 080554d7 crmd_init (80478a0, 80553f1, 808900c, 808901c, 0, 806ca2e) + b3 > 08055748 main (1, 80478d0, 80478d8) + f0 > 080552cc _start (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80 > bcm20-b:/ # l /var/ha/local/lib/heartbeat/cores/hacluster/core > -rw------- 1 hacluster haclient 5672714 Jun 5 10:01 > /var/ha/local/lib/heartbeat/cores/hacluster/core > bcm20-b:/ # > > GDB: > Reading symbols from /lib/libavl.so.1...done. > Loaded symbols for /lib/libavl.so.1 > #0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93 > ) at control.c:188 > 188 fsa_cluster_conn->llc_ops->delete(fsa_cluster_conn); does the memory pointed to by fsa_cluster_conn seem sane? if so then its probably the same bug (and the same workaround would apply) > (gdb) where > #0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93 > ) at control.c:188 > #1 0x08055f1f in do_fsa_action (fsa_data=0x8091528, an_action=Unhandled > dwarf expression opcode 0x93 > ) at fsa.c:177 > #2 0x080561ea in s_crmd_fsa_actions (fsa_data=0x8091528) at fsa.c:541 > #3 0x080579e1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:314 > #4 0x0805fc69 in crm_fsa_trigger (user_data=0x0) at callbacks.c:661 > #5 0xfef93857 in G_TRIG_dispatch (source=0x808ebf0, callback=0, > user_data=0x0) at GSource.c:1349 > #6 0xfedbc77f in g_main_context_dispatch () from > /usr/local/lib/libglib-2.0.so.0 > #7 0xfedbe065 in g_main_context_iterate () from > /usr/local/lib/libglib-2.0.so.0 > #8 0xfedbe2c0 in g_main_loop_run () from /usr/local/lib/libglib-2.0.so.0 > #9 0x080554d7 in crmd_init () at main.c:155 > #10 0x08055748 in main (argc=1, argv=0x80478d0) at main.c:122 > > > > -----Ursprüngliche Nachricht----- > Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ext Andrew > Beekhof > Gesendet: Montag, 4. Juni 2007 16:45 > An: High-Availability Linux Development List > Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386 > > On 6/4/07, Otte, Joerg <[EMAIL PROTECTED]> wrote: > > OK, the patch works. > > But now I stumbled across the next crash. > > I am now trying the 2.09 Version from SuSE. > > > > Cib crashes shortly after a reboot of the second node: > > > > > > #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028 > > 2028 if (ch->ops->send(ch, imsg) != IPC_OK) { > > (gdb) where > > #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028 > > #1 0xfedc51cb in hb_api_signoff (cinfo=0xffffffff, need_destroy_chan=1) at > > client_lib.c:470 > > #2 0xfedc5351 in hb_api_delete (ci=0x80a0dc0) at client_lib.c:501 > > #3 0x0805f1a2 in main (argc=1, argv=0x8047770) at main.c:216 > > > > (gdb) p *ch > > $2 = {ch_status = 134814800, farside_pid = -1, ch_private = 0x807b9e0, ops > > = 0xffffffff, msgpad = 0, > > bytes_remaining = 4294967295, should_send_block = 0, send_queue = > > 0xffffffff, recv_queue = 0xffffffff, > > pool = 0xffffffff, high_flow_mark = -1, low_flow_mark = -1, > > high_flow_userdata = 0xffffffff, > > low_flow_userdata = 0xffffffff, high_flow_callback = 0xffffffff, > > low_flow_callback = 0xffffffff, conntype = -1, > > failreason = '' <repeats 128 times>} > > > > (gdb) p *ch.ops > > Cannot access memory at address 0xffffffff > > (gdb) > > ok, this is a little more serious. > looks like there is a problem in the heartbeat api :-( > > can you create a bug for this please? If you use the "other" > component it will get assigned to the right person (Alan). > > in the meantime, you can probably just comment out that line as the > cib is seconds away from exiting and is "just" cleaning up. > > were there any errors before the crash? > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/