Good news:
I had to include the /etc/profile.local into heartbeat's startup script.
The /etc/profile.local mainly defines LD_LIBRARY_PATH with some platform 
specific
shared object libraries:

--- ./heartbeat/init.d/heartbeat.in     2007-04-23 10:32:16.000000000 +0200
+++ ./heartbeat/init.d/heartbeat.in.patched     2007-06-05 15:56:06.000177000 
+0200
@@ -43,6 +43,8 @@
 # Default-Stop: 0 6
 ### END INIT INFO
 
+# define LD_LIBRARY_PATH
+test -s /etc/profile.local && . /etc/profile.local
 
 [EMAIL PROTECTED]@/ha.d; export HA_DIR
 CONFIG=$HA_DIR/ha.cf


The last reported crashes during last week are all gone now!
Thank you very much for your help.





-----Ursprüngliche Nachricht-----
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ext Andrew 
Beekhof
Gesendet: Dienstag, 5. Juni 2007 12:27
An: High-Availability Linux Development List
Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386

On 6/5/07, Otte, Joerg <[EMAIL PROTECTED]> wrote:
>
> > were there any errors before the crash?
> The only error in the logs are several:
> heartbeat[536]: 2007/06/05_10:01:03 WARN: Exiting 
> /usr/sfw/lib/python2.3/heartbeat/cib process 572 killed by signal 9.

signal 9?  are you sending that signal?

> heartbeat[536]: 2007/06/05_10:01:03 ERROR: Respawning client 
> "/usr/sfw/lib/python2.3/heartbeat/cib":

thats an odd place to put heartbeat :-)

>
>
> > in the meantime, you can probably just comment out that line as the
> > cib is seconds away from exiting and is "just" cleaning up.
> You mean to remove the statement "hb_conn->llc_ops->delete(hb_conn);" in 
> main.c ?

right

> I already tried this. But now I get different cores.
> In the logs I see only cib crashing but the core is from crmd:
>
>
> bcm20-b:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
> core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576: 
> /usr/sfw/lib/python2.3/heartbeat/crmd
>  080580e8 do_exit  (40000000, 0, d, b, 17, 8091528) + ac
>  08055f1f do_fsa_action (40000000, 0, 805803c, 1, 600002a8, a) + c3
>  080561ea s_crmd_fsa_actions (8091528, 2000420, 0, 10000000, 4, 0) + 76
>  080579e1 s_crmd_fsa (d, 808ebf0, 8047738, 805fc47) + 225
>  0805fc69 crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d
>  fef93857 G_TRIG_dispatch (808ebf0, 0, 0, 0) + a7
>  fedbc77f g_main_context_dispatch (808b9f8, ffffff9c, 8089dd0, 0) + 1e7
>  fedbe065 g_main_context_iterate (1, 808c6c0, 8047858, fedbe141, 80554d7, 1) 
> + 41d
>  fedbe2c0 g_main_loop_run (8089db8, 806f248, 806caa4, 806f21d, 0, 0) + 19c
>  080554d7 crmd_init (80478a0, 80553f1, 808900c, 808901c, 0, 806ca2e) + b3
>  08055748 main     (1, 80478d0, 80478d8) + f0
>  080552cc _start   (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80
> bcm20-b:/ # l /var/ha/local/lib/heartbeat/cores/hacluster/core
> -rw-------   1 hacluster haclient 5672714 Jun  5 10:01 
> /var/ha/local/lib/heartbeat/cores/hacluster/core
> bcm20-b:/ #
>
> GDB:
> Reading symbols from /lib/libavl.so.1...done.
> Loaded symbols for /lib/libavl.so.1
> #0  0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
> ) at control.c:188
> 188             fsa_cluster_conn->llc_ops->delete(fsa_cluster_conn);

does the memory pointed to by fsa_cluster_conn seem sane?

if so then its probably the same bug (and the same workaround would apply)

> (gdb) where
> #0  0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
> ) at control.c:188
> #1  0x08055f1f in do_fsa_action (fsa_data=0x8091528, an_action=Unhandled 
> dwarf expression opcode 0x93
> ) at fsa.c:177
> #2  0x080561ea in s_crmd_fsa_actions (fsa_data=0x8091528) at fsa.c:541
> #3  0x080579e1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:314
> #4  0x0805fc69 in crm_fsa_trigger (user_data=0x0) at callbacks.c:661
> #5  0xfef93857 in G_TRIG_dispatch (source=0x808ebf0, callback=0, 
> user_data=0x0) at GSource.c:1349
> #6  0xfedbc77f in g_main_context_dispatch () from 
> /usr/local/lib/libglib-2.0.so.0
> #7  0xfedbe065 in g_main_context_iterate () from 
> /usr/local/lib/libglib-2.0.so.0
> #8  0xfedbe2c0 in g_main_loop_run () from /usr/local/lib/libglib-2.0.so.0
> #9  0x080554d7 in crmd_init () at main.c:155
> #10 0x08055748 in main (argc=1, argv=0x80478d0) at main.c:122
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ext Andrew 
> Beekhof
> Gesendet: Montag, 4. Juni 2007 16:45
> An: High-Availability Linux Development List
> Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
>
> On 6/4/07, Otte, Joerg <[EMAIL PROTECTED]> wrote:
> > OK, the patch works.
> > But now I stumbled across the next crash.
> > I am now trying the 2.09 Version from SuSE.
> >
> > Cib crashes shortly after a reboot of the second node:
> >
> >
> > #0  0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> > 2028            if (ch->ops->send(ch, imsg) != IPC_OK) {
> > (gdb) where
> > #0  0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> > #1  0xfedc51cb in hb_api_signoff (cinfo=0xffffffff, need_destroy_chan=1) at 
> > client_lib.c:470
> > #2  0xfedc5351 in hb_api_delete (ci=0x80a0dc0) at client_lib.c:501
> > #3  0x0805f1a2 in main (argc=1, argv=0x8047770) at main.c:216
> >
> > (gdb) p *ch
> > $2 = {ch_status = 134814800, farside_pid = -1, ch_private = 0x807b9e0, ops 
> > = 0xffffffff, msgpad = 0,
> >   bytes_remaining = 4294967295, should_send_block = 0, send_queue = 
> > 0xffffffff, recv_queue = 0xffffffff,
> >   pool = 0xffffffff, high_flow_mark = -1, low_flow_mark = -1, 
> > high_flow_userdata = 0xffffffff,
> >   low_flow_userdata = 0xffffffff, high_flow_callback = 0xffffffff, 
> > low_flow_callback = 0xffffffff, conntype = -1,
> >   failreason = '' <repeats 128 times>}
> >
> > (gdb) p *ch.ops
> > Cannot access memory at address 0xffffffff
> > (gdb)
>
> ok, this is a little more serious.
> looks like there is a problem in the heartbeat api :-(
>
> can you create a bug for this please?  If you use the "other"
> component it will get assigned to the right person (Alan).
>
> in the meantime, you can probably just comment out that line as the
> cib is seconds away from exiting and is "just" cleaning up.
>
> were there any errors before the crash?
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to