I also experience crashes of stonithd, alone 2 times yesterday, always
on both nodes at the same time. Here is the stack trace:
root@kjp03:/var/crash# apport-retrace -Rs _usr_lib_pacemaker_stonithd.0.crash
E: Can not find version '1.1.10+git20130802-1ubuntu2.2' of package 'pacemaker'
E: Quellpaket für pacemaker kann nicht gefunden werden.
--- stack trace ---
#0 0x00007ffa6f17abb9 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
resultvar = 0
pid = 40008
selftid = 40008
#1 0x00007ffa6f17dfc8 in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0},
sa_mask = {__val = {0, 17179869185, 140713634797360, 140713634496512, 0,
140734633026224, 140713582943175, 140713586093704, 140734633026160, 397168, 32,
140713586088608, 0, 140713586088608, 140713582942786, 140713579551566}},
sa_flags = 1876903824, sa_restorer = 0x3f}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007ffa6fdcf6c9 in crm_abort (file=0x7ffa6fdf34bb "logging.c",
function=0x7ffa6fdf4790 <__PRETTY_FUNCTION__.22958> "crm_glib_handler",
line=63, assert_condition=0x7ffa72376ce0 "Source ID 541 was not found when
attempting to remove it", do_core=<optimized out>, do_fork=<optimized out>) at
utils.c:1118
rc = 0
pid = <optimized out>
status = 0
__func__ = "crm_abort"
#3 0x00007ffa6ee8bae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#4 0x00007ffa6ee8bd72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5 0x00007ffa6ee83c5c in g_source_remove () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#6 0x00007ffa6f999ef5 in stonith_action_clear_tracking_data
(action=action@entry=0x7ffa723350b0) at st_client.c:536
No locals.
#7 0x00007ffa6f999f2d in stonith_action_destroy (action=0x7ffa723350b0) at
st_client.c:557
No locals.
#8 0x00007ffa6fde7cd9 in child_waitpid (child=child@entry=0x7ffa7236bb20,
flags=flags@entry=1) at mainloop.c:948
rc = <optimized out>
core = <optimized out>
signo = 0
status = 0
exitcode = 0
__func__ = "child_waitpid"
#9 0x00007ffa6fde7fce in child_death_dispatch (signal=<optimized out>) at
mainloop.c:962
saved = 0x0
child = 0x7ffa7236bb20
iter = 0x7ffa7222d200
exited = <optimized out>
__func__ = "child_death_dispatch"
#10 0x00007ffa6fde6de7 in crm_signal_dispatch (source=0x7ffa7236ba50,
callback=<optimized out>, userdata=<optimized out>) at mainloop.c:275
__func__ = "crm_signal_dispatch"
#11 0x00007ffa6ee84e04 in g_main_context_dispatch () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#12 0x00007ffa6ee85048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#13 0x00007ffa6ee8530a in g_main_loop_run () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#14 0x00007ffa702282a9 in main (argc=<optimized out>, argv=<optimized out>) at
main.c:1136
flag = <optimized out>
lpc = 0
argerr = <optimized out>
option_index = 0
cluster = {uuid = 0x7ffa7222fba0 "167772162", uname = 0x7ffa72230280
"kjp03", nodeid = 167772162, destroy = 0x7ffa70229b40
<stonith_peer_cs_destroy>, hb_conn = 0x0, hb_dispatch = 0x7ffa702299c0
<stonith_peer_hb_callback>, group = {length = 128, value = "stonith-ng", '\000'
<repeats 117 times>}, cpg = {cpg_deliver_fn = 0x7ffa702298e0
<stonith_peer_ais_callback>, cpg_confchg_fn = 0x7ffa6fbb04a0
<pcmk_cpg_membership>}, cpg_handle = 7749363892505018368}
actions = {0x7ffa70236d7d "reboot", 0x7ffa70236d84 "off",
0x7ffa7023893f "list", 0x7ffa70236d88 "monitor", 0x7ffa70236d90 "status"}
__func__ = "main"
I also attach the crash report
Peter
** Attachment added: "_usr_lib_pacemaker_stonithd.0.crash"
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+attachment/4293914/+files/_usr_lib_pacemaker_stonithd.0.crash
--
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737
Title:
Pacemaker can seg fault on crm node online/standby
Status in pacemaker package in Ubuntu:
Fix Released
Status in pacemaker source package in Trusty:
Fix Committed
Status in pacemaker source package in Utopic:
Fix Committed
Status in pacemaker source package in Vivid:
Fix Released
Bug description:
[IMPACT]
- Pacemaker seg fault on repeated crm node online/standy because:
- Newer glib versions uses hash_table to find GSources
- Glib can try to assert source being removed multiple times
[TEST CASE]
- Using same configuration as attached cib.xml :
#!/bin/bash
while true; do
crm node standby clustertrusty01
sleep 7
crm node online clustertrusty01
sleep 7
crm node standby clustertrusty02
sleep 7
crm node online clustertrusty02
sleep 7
crm node standby clustertrusty03
sleep 7
crm node online clustertrusty03
sleep 7
done
[REGRESSION POTENTIAL]
- Based on upstream commit 568e41d
- Test case ran for more than 7 hours with no problems
[OTHER INFO]
It was brought to my attention the following situation:
"""
[Issue]
lrmd process crashed when repeating "crm node standby" and "crm node
online"
----------------
# grep pacemakerd ha-log.k1pm101 | grep core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 49275 (lrmd) dumped core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=49275, core=1)
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 1471 (lrmd) dumped core
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=1471, core=1)
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 35771 (lrmd) dumped core
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35771, core=1)
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 60709 (lrmd) dumped core
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=60709, core=1)
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 35838 (lrmd) dumped core
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35838, core=1)
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 49249 (lrmd) dumped core
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=49249, core=1)
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 65358 (lrmd) dumped core
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=65358, core=1)
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed
process 22693 (lrmd) dumped core
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=22693, core=1)
----------------
----------------
# grep pacemakerd ha-log.k1pm102 | grep core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 5812 (lrmd) dumped core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=5812, core=1)
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 35781 (lrmd) dumped core
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=35781, core=1)
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed
process 51984 (lrmd) dumped core
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child
process lrmd terminated with signal 11 (pid=51984, core=1)
"""
Analyzing core file with dbgsyms I could see that:
#0 0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at
services.c:434
434 crm_trace(" > stdout: %s", op->stdout_data);
Is responsible for the core.
I've checked upstream code and there might be 2 important commits that
could be cherry-picked to fix this behavior:
commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
Author: Andrew Beekhof <[email protected]>
Date: Fri Sep 20 12:20:36 2013 +1000
Fix: services: Prevent use-of-NULL when executing service actions
commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
Author: Gao,Yan <[email protected]>
Date: Sun Sep 29 12:40:18 2013 +0800
Fix: services: Fix the executing of synchronous actions
The core can be caused by things such as this missing code:
if (op == NULL) {
crm_trace("No operation to execute");
return FALSE;
on the beginning of "services_action_sync(svc_action_t * op)"
function.
And improved by commit #11473a5.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions
_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help : https://help.launchpad.net/ListHelp