Hi all,

Corosync Cluster Engine, version '2.3.4'
Copyright (c) 2006-2009 Red Hat, Inc.

Today I found corosync consuming 100% cpu. Strace showed following:

write(7, "\v\0\0\0", 4)                 = -1 EAGAIN (Resource temporarily 
unavailable)
write(7, "\v\0\0\0", 4)                 = -1 EAGAIN (Resource temporarily 
unavailable)

Then I used gcore to get the coredump.

(gdb) bt
#0  0x00007f038b74b1cd in write () from /lib64/libpthread.so.0
#1  0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized out>, 
si=<optimized out>, context=<optimized out>) at loop_poll.c:474
#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007f038c220a3d in schedwrk_processor (context=<optimized out>) at 
sync.c:551
#5  0x00007f038c23042b in schedwrk_do (type=<optimized out>, 
context=0x6a12d56300000001) at schedwrk.c:77
#6  0x00007f038bdd49f7 in token_callbacks_execute (type=TOTEM_CALLBACK_TOKEN_SENT, 
instance=<optimized out>) at totemsrp.c:3493
#7  message_handler_orf_token (instance=<optimized out>, msg=<optimized out>, 
endian_conversion_needed=<optimized out>, msg_len=<optimized out>) at totemsrp.c:3894
#8  0x00007f038bdd65a5 in message_handler_orf_token (instance=<optimized out>, msg=<optimized 
out>, msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at 
totemsrp.c:3609
#9  0x00007f038bdcdfb9 in rrp_deliver_fn (context=0x7f038d541840, 
msg=0x7f038d541af8, msg_len=70) at totemrrp.c:1941
#10 0x00007f038bdca01e in net_deliver_fn (fd=<optimized out>, revents=<optimized 
out>, data=0x7f038d541a90) at totemudpu.c:499
#11 0x00007f038b96576f in _poll_dispatch_and_take_back_ (item=0x7f038d4fe168, 
p=<optimized out>) at loop_poll.c:108
#12 0x00007f038b965300 in qb_loop_run_level (level=0x7f038d4fde08) at loop.c:43
#13 qb_loop_run (lp=<optimized out>) at loop.c:210
#14 0x00007f038c21b6d0 in main (argc=<optimized out>, argv=<optimized out>, 
envp=<optimized out>) at main.c:1383

(gdb) f 1
#1  0x00007f038b9656ed in _handle_real_signal_ (signal_num=<optimized out>, 
si=<optimized out>, context=<optimized out>) at loop_poll.c:474
474                     res = write(pipe_fds[1], &sig, sizeof(int32_t));
(gdb) info locals
sig = 11
res = <optimized out>
__func__ = "_handle_real_signal_"
(gdb) f 4
#4  0x00007f038c220a3d in schedwrk_processor (context=<optimized out>) at 
sync.c:551
551                             my_service_list[my_processing_idx].sync_init 
(my_trans_list,
(gdb) p my_processing_idx
$31 = 3
(gdb) p my_service_list[3]
$32 = {service_id = 0, sync_init = 0x0, sync_abort = 0x0, sync_process = 0x0, 
sync_activate = 0x0, state = PROCESS, name = '\000 <repeats 127 times>}

So it seems  corosync dead looping in segfault handler.
I have not found any related changelog in the release notes after 2.3.4.

Can anyone help please?

Yep. It looks like (for some reason) signal pipe was not processed and libqb _handle_real_signal_ is looping. Corosync really cannot do anything about it. It looks like regular libqb bug, so even you can't do anything with it. CCing Chrissie so she is aware.

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to