On 6/5/07, Otte, Joerg <[EMAIL PROTECTED]> wrote:

> were there any errors before the crash?
The only error in the logs are several:
heartbeat[536]: 2007/06/05_10:01:03 WARN: Exiting 
/usr/sfw/lib/python2.3/heartbeat/cib process 572 killed by signal 9.

signal 9?  are you sending that signal?

heartbeat[536]: 2007/06/05_10:01:03 ERROR: Respawning client 
"/usr/sfw/lib/python2.3/heartbeat/cib":

thats an odd place to put heartbeat :-)



> in the meantime, you can probably just comment out that line as the
> cib is seconds away from exiting and is "just" cleaning up.
You mean to remove the statement "hb_conn->llc_ops->delete(hb_conn);" in main.c 
?

right

I already tried this. But now I get different cores.
In the logs I see only cib crashing but the core is from crmd:


bcm20-b:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576: 
/usr/sfw/lib/python2.3/heartbeat/crmd
 080580e8 do_exit  (40000000, 0, d, b, 17, 8091528) + ac
 08055f1f do_fsa_action (40000000, 0, 805803c, 1, 600002a8, a) + c3
 080561ea s_crmd_fsa_actions (8091528, 2000420, 0, 10000000, 4, 0) + 76
 080579e1 s_crmd_fsa (d, 808ebf0, 8047738, 805fc47) + 225
 0805fc69 crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d
 fef93857 G_TRIG_dispatch (808ebf0, 0, 0, 0) + a7
 fedbc77f g_main_context_dispatch (808b9f8, ffffff9c, 8089dd0, 0) + 1e7
 fedbe065 g_main_context_iterate (1, 808c6c0, 8047858, fedbe141, 80554d7, 1) + 
41d
 fedbe2c0 g_main_loop_run (8089db8, 806f248, 806caa4, 806f21d, 0, 0) + 19c
 080554d7 crmd_init (80478a0, 80553f1, 808900c, 808901c, 0, 806ca2e) + b3
 08055748 main     (1, 80478d0, 80478d8) + f0
 080552cc _start   (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80
bcm20-b:/ # l /var/ha/local/lib/heartbeat/cores/hacluster/core
-rw-------   1 hacluster haclient 5672714 Jun  5 10:01 
/var/ha/local/lib/heartbeat/cores/hacluster/core
bcm20-b:/ #

GDB:
Reading symbols from /lib/libavl.so.1...done.
Loaded symbols for /lib/libavl.so.1
#0  0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
) at control.c:188
188             fsa_cluster_conn->llc_ops->delete(fsa_cluster_conn);

does the memory pointed to by fsa_cluster_conn seem sane?

if so then its probably the same bug (and the same workaround would apply)

(gdb) where
#0  0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
) at control.c:188
#1  0x08055f1f in do_fsa_action (fsa_data=0x8091528, an_action=Unhandled dwarf 
expression opcode 0x93
) at fsa.c:177
#2  0x080561ea in s_crmd_fsa_actions (fsa_data=0x8091528) at fsa.c:541
#3  0x080579e1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:314
#4  0x0805fc69 in crm_fsa_trigger (user_data=0x0) at callbacks.c:661
#5  0xfef93857 in G_TRIG_dispatch (source=0x808ebf0, callback=0, user_data=0x0) 
at GSource.c:1349
#6  0xfedbc77f in g_main_context_dispatch () from 
/usr/local/lib/libglib-2.0.so.0
#7  0xfedbe065 in g_main_context_iterate () from /usr/local/lib/libglib-2.0.so.0
#8  0xfedbe2c0 in g_main_loop_run () from /usr/local/lib/libglib-2.0.so.0
#9  0x080554d7 in crmd_init () at main.c:155
#10 0x08055748 in main (argc=1, argv=0x80478d0) at main.c:122



-----Ursprüngliche Nachricht-----
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ext Andrew 
Beekhof
Gesendet: Montag, 4. Juni 2007 16:45
An: High-Availability Linux Development List
Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386

On 6/4/07, Otte, Joerg <[EMAIL PROTECTED]> wrote:
> OK, the patch works.
> But now I stumbled across the next crash.
> I am now trying the 2.09 Version from SuSE.
>
> Cib crashes shortly after a reboot of the second node:
>
>
> #0  0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> 2028            if (ch->ops->send(ch, imsg) != IPC_OK) {
> (gdb) where
> #0  0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> #1  0xfedc51cb in hb_api_signoff (cinfo=0xffffffff, need_destroy_chan=1) at 
client_lib.c:470
> #2  0xfedc5351 in hb_api_delete (ci=0x80a0dc0) at client_lib.c:501
> #3  0x0805f1a2 in main (argc=1, argv=0x8047770) at main.c:216
>
> (gdb) p *ch
> $2 = {ch_status = 134814800, farside_pid = -1, ch_private = 0x807b9e0, ops = 
0xffffffff, msgpad = 0,
>   bytes_remaining = 4294967295, should_send_block = 0, send_queue = 
0xffffffff, recv_queue = 0xffffffff,
>   pool = 0xffffffff, high_flow_mark = -1, low_flow_mark = -1, 
high_flow_userdata = 0xffffffff,
>   low_flow_userdata = 0xffffffff, high_flow_callback = 0xffffffff, 
low_flow_callback = 0xffffffff, conntype = -1,
>   failreason = '' <repeats 128 times>}
>
> (gdb) p *ch.ops
> Cannot access memory at address 0xffffffff
> (gdb)

ok, this is a little more serious.
looks like there is a problem in the heartbeat api :-(

can you create a bug for this please?  If you use the "other"
component it will get assigned to the right person (Alan).

in the meantime, you can probably just comment out that line as the
cib is seconds away from exiting and is "just" cleaning up.

were there any errors before the crash?
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to