Hi,

We run multiple deployments of corosync+pacemaker on Debian "wheezy" for high-availability of various resources. The configurations are unchanged and ran without any issues for many months. However, since we applied the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting resource agent errors on rare occasions, with error messages that are clearly incorrect.

The incidents have happened four times on two unrelated clusters:

* Our cluster hosts "talos" and "pomona" use pacemaker to manage a few virtual IP adresses using the ocf:heartbeat:IPaddr2 resource agent. This one has had two incidents. The first incident began with this error:

Jun 2 17:30:16 pomona lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1: /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied

The second incident began with this error:

Jul 12 08:36:15 talos IPaddr2[21294]: ERROR: Setup problem: couldn't find command: ip

I can confidently say, the permissions of IPaddr2 and the location of the "ip" command, did not change at any point!

* Our cluster hosts "aries" and "taurus" use pacemaker in a more complicated setup, managing Xen virtual machines on shared storage utilizing DRBD and CLVM, using the resource agents ocf:pacemaker:controld, ocf:gleim:clvmd (which is the stock clvmd resource agent from a later pacemaker version than is included in wheezy), ocf:heartbeat:LVM, ocf:linbit:drbd, and ocf:gleim:Xen (which is the stock Xen resource agent with a trivial one-line change for a local workaround).

This cluster had also had two incidents:

* The first began with:

Jun 16 10:38:15 aries lrmd: [3646]: info: RA output: (jabber:monitor:stderr) /usr/lib/ocf/resource.d//gleim/Xen: 71: local: en-list: bad variable name

There is no variable "en-list" in the resource agent; the closest string in the file is "xen-list", which is a binary not a variable, used like this:

  ...
  if have_binary xen-list; then
     xen-list $1 2>/dev/null | grep -qs "State.*[-r][-b][-p]--" 2>/dev/null
     ...

* The second began with:

Jun 21 11:58:58 taurus Xen[9052]: ERROR: Setup problem: couldn't find command: awk

Again, the location of "awk"  has not changed.


We have no reason to suspect the kernel update other than timing, and the fact that the incidents occur on unrelated clusters. We have since upgraded to Debian's next update, 3.2.57-3+deb7u2, but the most recent incident occurred after that. The original update included fixes for these issues:

CVE-2014-0196

    Jiri Slaby discovered a race condition in the pty layer, which could
    lead to denial of service or privilege escalation.

CVE-2014-1737 / CVE-2014-1738

    Matthew Daley discovered that missing input sanitising in the
    FDRAWCMD ioctl and an information leak could result in privilege
    escalation.

CVE-2014-2851

    Incorrect reference counting in the ping_init_sock() function allows
    denial of service or privilege escalation.

CVE-2014-3122

    Incorrect locking of memory can result in local denial of service.


Given the odd error messages from the resource agent, I suspect it's a memory corruption error of some sort. We've been unable to find anything else useful in the logs, and we'll probably end up reverting to the prior kernel version. But given the rarity of the issue, it would be a long while before we could be confident that fixed it.

Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel or later? Has anyone had any similar issues?

-- Ken Gaillot <kjgai...@gleim.com>
   Gleim NOC

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to