Brian Murray or James Pages,

I verified this fix for the test case in the description and it worked
fine. Meanwhile I had some complains from Peter regarding crashes he was
getting into his installation. I opened the following bug:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962

And provided 1st a fix to stonith cores and then to lrmd cores. I'm
attaching in that bug the debdiffs that fixed all the issues Peter was
seeing. I'll ask for sponsorship also.

Thank you

Rafael Tinoco

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737

Title:
  Pacemaker (lrmd) can seg fault in Trusty and Utopic after following
  message: Source ID XX was not found when attempting to remove it

Status in pacemaker package in Ubuntu:
  Fix Released
Status in pacemaker source package in Trusty:
  Fix Committed
Status in pacemaker source package in Utopic:
  Fix Committed
Status in pacemaker source package in Vivid:
  Fix Released

Bug description:
  [IMPACT]

    - Pacemaker seg fault on repeated crm node online/standy because:
        - Newer glib versions uses hash_table to find GSources
        - Glib can try to assert source being removed multiple times

  [TEST CASE]

    - Using same configuration as attached cib.xml :

          #!/bin/bash

          while true; do
              crm node standby clustertrusty01
              sleep 7
              crm node online clustertrusty01
              sleep 7
              crm node standby clustertrusty02
              sleep 7
              crm node online clustertrusty02
              sleep 7
              crm node standby clustertrusty03
              sleep 7
              crm node online clustertrusty03
              sleep 7
          done

  [REGRESSION POTENTIAL]

    - Based on upstream commit 568e41d
    - Test case ran for more than 7 hours with no problems

  [OTHER INFO]

  It was brought to my attention the following situation:

  """
  [Issue]

  lrmd process crashed when repeating "crm node standby" and "crm node
  online"

  ----------------
  # grep pacemakerd ha-log.k1pm101 | grep core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49275 (lrmd) dumped core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49275, core=1)
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 1471 (lrmd) dumped core
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=1471, core=1)
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35771 (lrmd) dumped core
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35771, core=1)
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 60709 (lrmd) dumped core
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=60709, core=1)
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35838 (lrmd) dumped core
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35838, core=1)
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49249 (lrmd) dumped core
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49249, core=1)
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 65358 (lrmd) dumped core
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=65358, core=1)
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 22693 (lrmd) dumped core
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=22693, core=1)
  ----------------

  ----------------
  # grep pacemakerd ha-log.k1pm102 | grep core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 5812 (lrmd) dumped core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=5812, core=1)
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 35781 (lrmd) dumped core
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35781, core=1)
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 51984 (lrmd) dumped core
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=51984, core=1)
  """

  Analyzing core file with dbgsyms I could see that:

  #0  0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at 
services.c:434
  434           crm_trace(" >  stdout: %s", op->stdout_data);

  Is responsible for the core.

  I've checked upstream code and there might be 2 important commits that
  could be cherry-picked to fix this behavior:

  commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
  Author: Andrew Beekhof <[email protected]>
  Date:   Fri Sep 20 12:20:36 2013 +1000

      Fix: services: Prevent use-of-NULL when executing service actions

  commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
  Author: Gao,Yan <[email protected]>
  Date:   Sun Sep 29 12:40:18 2013 +0800

      Fix: services: Fix the executing of synchronous actions

  The core can be caused by things such as this missing code:

  if (op == NULL) {
  crm_trace("No operation to execute");
  return FALSE;

  on the beginning of "services_action_sync(svc_action_t * op)"
  function.

  And improved by commit #11473a5.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to