Brian Murray or James Pages, I verified this fix for the test case in the description and it worked fine. Meanwhile I had some complains from Peter regarding crashes he was getting into his installation. I opened the following bug:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962 And provided 1st a fix to stonith cores and then to lrmd cores. I'm attaching in that bug the debdiffs that fixed all the issues Peter was seeing. I'll ask for sponsorship also. Thank you Rafael Tinoco -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1368737 Title: Pacemaker (lrmd) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it Status in pacemaker package in Ubuntu: Fix Released Status in pacemaker source package in Trusty: Fix Committed Status in pacemaker source package in Utopic: Fix Committed Status in pacemaker source package in Vivid: Fix Released Bug description: [IMPACT] - Pacemaker seg fault on repeated crm node online/standy because: - Newer glib versions uses hash_table to find GSources - Glib can try to assert source being removed multiple times [TEST CASE] - Using same configuration as attached cib.xml : #!/bin/bash while true; do crm node standby clustertrusty01 sleep 7 crm node online clustertrusty01 sleep 7 crm node standby clustertrusty02 sleep 7 crm node online clustertrusty02 sleep 7 crm node standby clustertrusty03 sleep 7 crm node online clustertrusty03 sleep 7 done [REGRESSION POTENTIAL] - Based on upstream commit 568e41d - Test case ran for more than 7 hours with no problems [OTHER INFO] It was brought to my attention the following situation: """ [Issue] lrmd process crashed when repeating "crm node standby" and "crm node online" ---------------- # grep pacemakerd ha-log.k1pm101 | grep core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49275 (lrmd) dumped core Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49275, core=1) Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 1471 (lrmd) dumped core Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=1471, core=1) Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35771 (lrmd) dumped core Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35771, core=1) Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 60709 (lrmd) dumped core Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=60709, core=1) Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35838 (lrmd) dumped core Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35838, core=1) Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49249 (lrmd) dumped core Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49249, core=1) Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 65358 (lrmd) dumped core Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=65358, core=1) Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 22693 (lrmd) dumped core Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=22693, core=1) ---------------- ---------------- # grep pacemakerd ha-log.k1pm102 | grep core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 5812 (lrmd) dumped core Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=5812, core=1) Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 35781 (lrmd) dumped core Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35781, core=1) Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 51984 (lrmd) dumped core Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=51984, core=1) """ Analyzing core file with dbgsyms I could see that: #0 0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at services.c:434 434 crm_trace(" > stdout: %s", op->stdout_data); Is responsible for the core. I've checked upstream code and there might be 2 important commits that could be cherry-picked to fix this behavior: commit f2a637cc553cb7aec59bdcf05c5e1d077173419f Author: Andrew Beekhof <[email protected]> Date: Fri Sep 20 12:20:36 2013 +1000 Fix: services: Prevent use-of-NULL when executing service actions commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac Author: Gao,Yan <[email protected]> Date: Sun Sep 29 12:40:18 2013 +0800 Fix: services: Fix the executing of synchronous actions The core can be caused by things such as this missing code: if (op == NULL) { crm_trace("No operation to execute"); return FALSE; on the beginning of "services_action_sync(svc_action_t * op)" function. And improved by commit #11473a5. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

