Re: [Pacemaker] When will pacemaker 1.1.14 be released?
On 01/04/2016 03:08 AM, Kang Kai wrote: > Hi Ken, > > Is there any schedule when pacemaker 1.1.14 will be released? And would > it be soon? > > Thanks a lot. Hi, Yes, the goal is to release it by the middle of this month. Thanks, -- Ken Gaillot ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reinstall Pacemaker/Corosync.
On 11/24/2015 04:28 AM, emmanuel segura wrote: > I don't remember well, But I think in Redhat 6.5 you need to use > cman+pacemaker and please your config and you need to be sure you have > fencing configured. Yes, the versions in 6.5 are quite old; 6.7 has recent versions, so if you can upgrade, that would help. Even 6.6 is significantly newer and has important bugfixes. RHEL 6 does use corosync 1, but via CMAN rather than directly. You can use the pcs command to configure and deconfigure the cluster (pcs cluster node add/remove for one node, or pcs cluster setup/destroy for the entire cluster). > 2015-11-24 11:18 GMT+01:00 Cayab, Jefrey E. : >> Hi all, >> >> I searched online but couldn't find a detailed answer. OS is RHEL 6.5. >> >> Problem: >> I have 2 servers which was setup fine (MySQL cluster is on it, DRBD for the >> data disk on local disk) on which these 2 servers needs to be migrated to >> other location. When it was migrated, the DRBD has to change from local disk >> to SAN LUN which was migrated ok but the cluster began experiencing weird >> behavior. Then the 2 nodes are shutdown and booted together, each server can >> see each other as online via "crm_mon -1" but when one of the node's >> pacemaker process is restarted, the status of that node from the other node >> stays offline/stopped, even if I reboot that node, it doesn't join back the >> cluster. >> >> Other observation - if these 2 servers boot up together, both see online as >> above and when I stop pacemaker process on the Active node, the other node >> takes over the resources which is good but even if I start back the >> pacemaker process on the other node, it's not able to take back the >> resources. Kind of like, only one failover can happen and cannot failback. >> >> >> What I did: >> I removed Pacemaker and Corosync via YUM >> Rebooted the OS >> Verified no more Pacemaker/Corosync packages >> Installed back Pacemaker and Corosync via YUM >> When I did "crm_mon -1", I'm surprised to see that configuration is still >> there. >> >> After the reinstallation, still experiencing the same behavior and noticed >> that DRBD is reporting Failed disk - only a reboot of the node can bring it >> back to UpToDate. >> >> Please advise on the correct procedure to wipe out the configuration and >> reinstallation. >> >> I will share the logs shortly. >> >> Thanks, >> Jef ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] move service basing on both connection status and hostname
On 11/09/2015 10:00 AM, Stefano Sasso wrote: > Hi Guys, > I am having some troubles with the location constraint. > > In particular, what I want to achieve, is to run my service on a host; if > the ip interconnection fails I want to migrate it to another host, but on > IP connectivity restoration the resource should move again on the primary > node. > > So, I have this configuration: > > primitive vfy_ether ocf:pacemaker:l2check \ >> params nic_list="eth1 eth2" debug="false" dampen="1s" \ >> op monitor interval="2s" >> clone ck_ether vfy_ether >> location cli-ethercheck MCluster \ >> rule $id="cli-prefer-rule-ethercheck" -inf: not_defined l2ckd or >> l2ckd lt 2 >> location cli-prefer-masterIP MCluster \ >> rule $id="cli-prefer-rule-masterIP" 50: #uname eq GHA-MO-1 > > > when the connectivity fails on the primary node, the resource is correctly > moved to the secondary one. > But, on IP connectivity restoration, the resource stays on the secondary > node (and does not move to the primary one). > > How can I solve that? > Any hint? :-) > > thanks, > stefano Mostly likely, you have a default resource-stickiness set. That tells Pacemaker to keep services where they are if possible. You can either delete the stickiness setting or make sure it has a lower score than your location preference. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker’s built-in resource isolation support
Hi Alex, please repost on us...@clusterlabs.org and I'll respond there. (This list is dead ...) On 10/16/2015 10:40 PM, Alex Litvak wrote: > Dear Pacemaker experts, > > > I am trying to build an HA cluster with LXC guests and I am reading latest > pacemaker remoted document. > > At some point it says that pacemaker remote shouldn't be used for LXC due > to new Pacemaker’s built-in resource isolation support > > At the bottom of the page it says there is no documentation available yet. > > Well is it supported or not, can at least provide a clue on how to set it > up? > > A small example perhaps? After some googling I hit the wall. > > Thank you > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote debian wheezy
On 01/15/2015 08:18 AM, Kristoffer Grönlund wrote: Thomas Manninger writes: Hi, I compiled the latest libqb, corosync and pacemaker from source. Now there is no crm command available? Is there another standard shell? Should i use crmadmin? Thanks! Regards Thomas You can get crmsh and build from source at crmsh.github.io, or try the .rpm packages for various distributions here: https://build.opensuse.org/package/show/network:ha-clustering:Stable/crmsh Congratulations on getting that far, that's probably the hardest part :-) The crm shell was part of the pacemaker packages in Debian squeeze. It was going to be separated into its own package for jessie, but that hasn't made it out of sid/unstable yet, so it might not make it into the final release. Since you've built everything else from source, that's probably easiest, but if you want to try ... For the rpm mentioned above, have a look at alien (https://wiki.debian.org/Alien). crmsh is a standalone package so hopefully it would work; I wouldn't try alien for something as complicated as all the rpm's that go into a pacemaker install. You could try backporting the sid package https://packages.debian.org/source/sid/crmsh but I suspect the dependencies would get you. In theory the crm binary from the squeeze packages should work with the newer pacemaker, if you can straighten out the library dependencies. Or you can use the crm*/cib* command-line tools that come with pacemaker if you don't mind the lower-level approach. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote debian wheezy
On 01/13/2015 03:55 AM, Thomas Manninger wrote: Hi, http://clusterlabs.org/wiki/SourceInstall can i use libQB and corosync from the debian repo, and only compile pacemaker? Corosync should be fine (but be aware wheezy has 1.x and not 2.x when reading how-to's); libqb is iffy, you're probably better off compiling it too. Wheezy's pacemaker 1.1.7 does not support pacemaker-remote; jessie's 1.1.10 should work in a jessie VM, but be aware pacemaker-remote received improvements and bugfixes since then. Of course you can compile 1.1.12 yourself (and optionally use checkinstall to make .deb's, see https://wiki.debian.org/CheckInstall). Unfortunately you can't backport the 1.1.10 jessie packages (which normally would be pretty easy) because the dependencies get too hairy (in particular you wind up needing a newer version of gcc than is in wheezy). ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote debian wheezy
On 01/13/2015 03:26 AM, Thomas Manninger wrote: Hi, thanks for answer! I try to build my own dpkg package with the newest source. Is the pacemaker-remote stable for productive use? Yes ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote debian wheezy
On 01/12/2015 12:34 PM, David Vossel wrote: - Original Message - what is the best way, to install in a debian wheezy vm the package "pacemaker-remote"? This package is in the debian repository not available. I have no clue. I just want to point out, if your host OS is debian wheezy and the pacemaker-remote package is in fact unavailable, it is possible the version of pacemaker shipped with wheezy doesn't even have the capability of managing pacemaker_remote nodes. -- Vossel Wheezy's pacemaker 1.1.7 does not support pacemaker-remote; jessie's 1.1.10 should work in a jessie VM, but be aware pacemaker-remote received improvements and bugfixes since then. Of course you can compile 1.1.12 yourself (and optionally use checkinstall to make .deb's, see https://wiki.debian.org/CheckInstall). Unfortunately you can't backport the 1.1.10 jessie packages (which normally would be pretty easy) because the dependencies get too hairy (in particular you wind up needing a newer version of gcc than is in wheezy). ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] best Way to build a 2 Node Cluster with cold Standby ?
On 12/12/2014 03:22 AM, Hauke Bruno Wollentin wrote: Hi Hauke, personally I won't use DRBD for a case like that because of the _missing_ replication here. Imho your idea will work, but it would be easier to manage to use some kind of file synchronisation like rsync, unison etc. when the cold standby node comes up. Agreed. I would say that cold standby isn't high-availability, so I don't think any HA software would be the right tool for the job. If you also (separately) do a full backup of the main server for data backup purposes, more frequently than every two weeks, I would recommend rsync'ing the standby server from the backups. That gives you a more recent copy of the data in case you have to failover when the main server has completely died, and it keeps the additional sync traffic off the production server. -- Ken Gaillot --- original message timestamp: Wednesday, December 10, 2014 09:17:31 PM from: Hauke Homburg to: pacemaker@oss.clusterlabs.org cc: subject: [Pacemaker] best Way to build a 2 Node Cluster with cold Standby ? message id: <5488aa5b.4070...@w3-creative.de> Hello, I want to build a 2 Node KVM Cluster with the folling Features: 1 Node ist the Primary Node for some Virtual Machines with Linux, and 1 Node i want to install as second KVM Server too. With the same virtual Machines in DRBD Devices. I want to boot the second Node every 2 Weeks to sync the Datta and then to shutdown. So i want to have in fail of the first Node a Backup Server. What the Best Way to do this? I think i install both Nodes with the DRBD and switch the Primary to Master, the second Machine to the DRBD Slave. Does the DRBD become problems when i shutdown the slave Device for so a long time? greetings Hauke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Occasional nonsensical resource agent errors, redux
On 11/04/2014 11:02 AM, Dejan Muhamedagic wrote: On 1 Nov 2014, at 11:03 pm, Patrick Kane wrote: Hi all: In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html). I was hoping to have something useful before posting another update, but since it's come up again, here's what we've found so far: * The most common manifestation is the "couldn't find command" error. In various instances it "couldn't find" xm, ip or awk. However, we've seen two other variations: lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad variable name and lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1: /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied The RA in the first case does not use the string "en-destroy" at all; it does call a command "xen-destroy". That, to me, is a strong suggestion of memory corruption somewhere, whether in the RA, the shell, lrmd or a library used by one of those. Scary. Shell and lrmd are two obvious candidates. I assume that none of them would cause a segfault if trampling through the memory where a copy of the running script resides. * I have not found any bugs in the RA or its included files. * I tried setting "debug: on" in corosync.conf, but that did not give any additional useful information. The resource agent error is still the first unusual message in the sequence. Here is an example, giving one successful monitor run and then an occurrence of the issue (the nodes are a pair of Xen dom0s including pisces, running two Xen domU resources pan and nemesis): Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372) Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid 409) Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan for client 3512: pid 372 exited with return code 0 Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on nemesis for client 3512: pid 409 exited with return code 0 Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455) Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local: This "local" seems to be from ocf-binaries:have_binary(): 71 local bin=`echo $1 | sed -e 's/ -.*//'` Agreed, nothing unusual there, reinforces suspicion of memory corruption. Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) en-destroy: bad variable name Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan for client 3512: pid 455 exited with return code 2 * I tried reverting several security updates applied in the month or so before we first saw the issue. Reverting the Debian kernel packages to 3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to libxml2 2.8.0+dfsg1-7+nmu2. I suppose that you restarted the cluster stack after update :) Yep, w/reboot for kernel reverts. Some of the libxml2 reverts got full reboots as well b/c they were done with other maintenance. None of the other updates from that time look like they could have any effect. * Regarding libxml2, I did find that Debian had backported an upstream patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption bug, which upstream later corrected (the bug never made it into an upstream release, but Debian had backported a specific changeset). I submitted that as Debian Bug #765770 which was just fixed last week. I haven't had a chance to apply that to the affected servers yet, but as mentioned above, reverting to the libxml2 before the introduced bug did not fix the issue. * I have not found a way to intentionally reproduce the issue. :-( We have had 10 occurrences across 3 two-node clusters in five months. Some of the nodes have had only one occurrence during that time, but one pair gets the most of them. With the time between occurrences, it's hard to do something like strace on lrmd, though that's probably a good way forward, scripting something to deal with the output reasonably. Perhaps dumping core of both lrmd and the shell when this happens would help. Are the most affected nodes in any way significantly different from the others? By CIB size perhaps? It's actually simpler. An overview of our setup is: Cluster #1 (with 6 of the 10 failures): Xen dom0s as nodes, two Xen domUs as
Re: [Pacemaker] Occasional nonsensical resource agent errors, redux
On 11/03/2014 09:26 AM, Dejan Muhamedagic wrote: On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote: В Mon, 3 Nov 2014 13:32:45 +1100 Andrew Beekhof пишет: On 1 Nov 2014, at 11:03 pm, Patrick Kane wrote: Hi all: In July, list member Ken Gaillot reported occasional nonsensical resource agent errors using Pacemaker (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html). We're seeing similar issues with our install. We have a 2 node corosync/pacemaker failover configuration that is using the ocf:heartbeat:IPaddr2 resource agent extensively. About once a week, we'll get an error like this, out of the blue: Nov 1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't find command: ip It goes without saying that the ip command hasn't gone anywhere and all the paths are configured correctly. We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside of a xen container. Any thoughts from folks on what might be happening or how we can get additional debug information to help figure out what's triggering this? its pretty much in the hands of the agent. Actually the message seems to be output by check_binary() function which is part of framework. Someone complained in the IRC about this issue (another resource agent though, I think Xen) and they said that which(1) was not able to find the program. I'd suggest to do strace (or ltrace) of which(1) at that point (it's in ocf-shellfuncs). The which(1) utility is a simple tool: it splits the PATH environment variable and stats the program name appended to each of the paths. PATH somehow corrupted or filesystem misbehaving? My guess is that it's the former. BTW, was there an upgrade of some kind before this started happening? I was hoping to have something useful before posting another update, but since it's come up again, here's what we've found so far: * The most common manifestation is the "couldn't find command" error. In various instances it "couldn't find" xm, ip or awk. However, we've seen two other variations: lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: bad variable name and lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1: /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied The RA in the first case does not use the string "en-destroy" at all; it does call a command "xen-destroy". That, to me, is a strong suggestion of memory corruption somewhere, whether in the RA, the shell, lrmd or a library used by one of those. * I have not found any bugs in the RA or its included files. * I tried setting "debug: on" in corosync.conf, but that did not give any additional useful information. The resource agent error is still the first unusual message in the sequence. Here is an example, giving one successful monitor run and then an occurrence of the issue (the nodes are a pair of Xen dom0s including pisces, running two Xen domU resources pan and nemesis): Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372) Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid 409) Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on pan for client 3512: pid 372 exited with return code 0 Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] on nemesis for client 3512: pid 409 exited with return code 0 Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455) Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local: Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) en-destroy: bad variable name Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr) Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on pan for client 3512: pid 455 exited with return code 2 * I tried reverting several security updates applied in the month or so before we first saw the issue. Reverting the Debian kernel packages to 3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 to libxml2 2.8.0+dfsg1-7+nmu2. None of the other updates from that time look like they could have any effect. * Regarding libxml2, I did find that Debian had backported an upstream patch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruption bug, which upstream later corrected (the bug never made it into an upstream release, but Debian had backported a specific cha
Re: [Pacemaker] MySQL, Percona replication manager - split brain
On 10/25/2014 03:32 PM, Andrew wrote: 2) How to resolve split brain state? Is it enough just to wait for failure, then - restart mysql by hand and clean row with dup index in slave db, and then run resource again? Or there is some automation for such cases? Regarding mysql cleanup, it is usually NOT sufficient to fix the one row with the duplicate key. The duplicate key is a symptom of prior data inconsistency, and if that isn't cleaned up, at best you'll have inconsistent data in a few rows, and at worst, replication will keep breaking at seemingly random times. You can manually compare the rows immediately prior to the duplicate ID value to figure out where it started, or use a special-purpose tool for checking consistency, such as pt-table-checksum from the Percona toolkit. -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] runing abitrary script when resource fails
On 10/06/2014 06:20 AM, Alex Samad - Yieldbroker wrote: Is it possible to do this ? Or even on any major fail, I would like to send a signal to my zabbix server Alex Hi Alex, This sort of thing has been discussed before, for example see http://oss.clusterlabs.org/pipermail/pacemaker/2014-August/022418.html At Gleim, we use an active monitoring approach -- instead of waiting for a notification, our monitor polls the cluster regularly. In our case, we're using the check_crm nagios plugin available at https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm. It's a fairly simple Perl script utilizing crm_mon, so you could probably tweak the output to fit something zabbix expects, if there isn't an equivalent for zabbix already. And of course you can configure zabbix to monitor the services running on the cluster as well. -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reload named upon failover to 2nd Pacemaker node
On 09/22/2014 12:06 PM, Martin Thorpe wrote: I've successfully setup a cluster IP (floating IP) using "crm configure" but I am struggling with how to define the named service and tell it to reload on the second node upon failure of the primary node of the two node test cluster. That was discussed recently on this list, and two approaches were mentioned: 1. Make a custom resource agent based on ocf::heartbeat:IPaddr2 that reloads named when it adds or removes an IP address; 2. Our approach at Gleim: order dns-restart +inf: dns-ip:start dns-daemon-clone:start dns-ip is our IPaddr2 resource, and dns-daemon-clone is the clone of the named resource. So that tells pacemaker to startdns-daemon-clone after dns-ip, which in practice makes it restart named if dns-ip moves to the host. This does a full restart rather than a reload, but if that is acceptable in your situation, it's easy. BTW, if your apache instance binds to the wildcard address, it won't need a reload or restart when the IP moves. BIND has the issue because it can only bind to specific IPs. -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Notification when a node is down
On 09/12/2014 02:30 AM, Sihan Goi wrote: Is there any way for a Pacemaker/Corosync/PCS setup to send a notification when it detects that a node in a cluster is down? I read that Pacemaker and Corosync logs events to syslog, but where is the syslog file in CentOS? Do they log events such as a failover occurrence? Pacemaker/corosync do extensive logging, even more so if debug is set to on in corosync.conf. Syslog is configurable to log the messages however you want; the default file locations vary from OS to OS. Monitoring and notification are usually handled by a dedicated package for that purpose, such as nagios, icinga, monit or zabbix. These packages can monitor services on the nodes directly, as well as the health of pacemaker itself. Here, we use icinga with Phil Garner's check_crm plugin: https://www.icinga.org/ https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] two active-active service processes, but only one vIP
On 09/08/2014 11:02 AM, David Magda wrote: On Sat, Sep 06, 2014 at 02:14:29PM -0400, David Magda wrote: On Sep 6, 2014, at 12:00, Ken Gaillot wrote: # This allows slapd to run on both hosts. clone ldap-daemon-clone ldap-daemon meta globally-unique="false" interleave="true" [...] I'll give it a go. This worked. primitive dmn_slapd ocf:work:slapd \ params config="/etc/ldap/slapd.d" user="openldap" \ services="ldap:/// ldapi:///" group="openldap" \ op monitor interval="20s" timeout="5s" clone dmn_slapd_clone dmn_slapd \ meta globally-unique="false" interleave="true" target-role="Started" colocation colo-vip_ldap2-with-dmn_slapd +inf: vip_ldap2 dmn_slapd_clone Since the daemon is managed by Pacemaker on each node, is there a way to start and stop it on a per-node basis? Doing a "crm resource dmn_slapd" seems to effect both systems. What if I want to do stop it on only one? I find the easiest way is to put the node in standby then online again. This does mean any other resources on the node get restarted, but that's acceptable in our setup. I am curious whether there is a cleaner way to restart a cloned resource on a single node only. -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] two active-active service processes, but only one vIP
On 9/5/14 11:23 PM, David Magda wrote: I have two nodes that I wish to run OpenLDAP slapd on. I want Pacemaker / CRM to check the health of the OpenLDAP daemon, and if it's healthy, I want that node to be a candidate for having a vIP live on it. If OpenLDAP's slapd is not healthy (process is down, incorrect query results, etc.) then I want the vIP to fail over to the other (presumably healthy) node. (I also want to do something similar with BIND named, but we'll use OpenLDAP as the working case for now.) The main thing is that I want the daemon to run on each node in active-active configuration (so Nagios can keep tabs on things), and only have the vIP for the LDAP service fail-over. The vIP is straight forward enough: sudo crm configure primitive vip_ldap2 \ ocf:heartbeat:IPaddr2 params ip="10.0.0.89" cidr_netmask="32" The following line creates a resource where slapd only runs on one of the nodes at a time, but I want it running on both: sudo crm configure primitive srv_slapd \ ocf:heartbeat:slapd op monitor interval="30s" I'm using Debian 7 with default pacemaker 1.1.7-1 package, with the following resource agent: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/slapd The slapd process can be either managed or unmanaged, but I think I would prefer unmanaged so that we can fiddle with it using the regular OS-level service commands. We don't use HA / clustering in a lot of places, and so it will probably be easy to forget that CRM is there, and so could lead to frustration if it's doing behind our backs. From what I could tell, I want to create a primitive (is-managed=false) and make an anonymous clone, which can then be run on multiple nodes. Somehow? Maybe? Hi David, We do something very similar: two nodes running stock wheezy, bind and slapd on both, and two virtual IPs (one for DNS and one for LDAP) that can bounce back and forth between the nodes. This type of setup allows for DNS/LDAP resolution lists of the form virtual-ip, node1-ip, node2-ip. If you're really set on not having slapd managed, Alex Samad's solution of customizing the IPaddr2 resource agent will likely perform better than trying to have pacemaker monitor an unmanaged resource. We have bind and slapd as managed resources. You do have to remember not to use the init script for restarts, but other than that, all the usual commands work fine. (Even "rndc reload" doesn't bother pacemaker.) The LDAP portion of our crm config (with additional comments) is: # ocf:gleim:slapd is the unmodified slapd resource agent, # from a later version of resource-agents than is available # with wheezy primitive ldap-daemon ocf:gleim:slapd \ params config="/etc/ldap/slapd.d" \ user="openldap" group="openldap" \ services="ldap:/// ldapi:///" \ op monitor interval="60" timeout="20" \ op start interval="0" timeout="20" \ op stop interval="0" timeout="20" # This allows slapd to run on both hosts. clone ldap-daemon-clone ldap-daemon meta globally-unique="false" interleave="true" # Bring up the virtual IP for LDAP resolution on one node. # Replace xxx's with your virtual IP and mask. primitive ldap-ip ocf:heartbeat:IPaddr2 \ params ip="xxx.xxx.xxx.xxx" cidr_netmask="xxx" \ op monitor depth="0" timeout="20s" interval="5s" \ op start interval="0" timeout="20" \ op stop interval="0" timeout="20" # Bring up the virtual IP only on a host with a working slapd. colocation ldap-ip-with-daemon +inf: ldap-ip ldap-daemon-clone -- Ken Gaillot Gleim NOC ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource Agent Restrictions on opening TCP Connection
On 8/26/14 6:32 AM, N, Ravikiran wrote: Is there any restrictions on Resource Agent script that it shall not support any TCP connection opens.. ?? I have a script as below executecmd="$FD<>/dev/tcp/localhost/$PORT" #FD is the lowest available FD and PORT is hardcoded port eval "exec $executecmd" 2> /dev/null retval=$? echo $retval #retval is always 1 irrespective of the TCP server running on localhost:$PORT Although I see that I can connect to TCP Server running on localhost:$PORT using other scripts with same statements. I cannot connect as a ocf-ra. So, I wanted to know is there any restrictions on my RA script. Hello Ravikiran, I can't speak to whether there are limitations on resource agent scripts, but one gotcha I've seen when using eval/exec is that it will likely use the system-wide default shell (e.g. /bin/sh) even if the RA script itself uses a different shell (e.g. /bin/bash). But when running from the command line under your own user account, it will use your account's default shell. So you can get different behaviors running interactively vs called from a daemon. I'd recommend making sure your exec syntax works in the default system shell, and if that's not it, try replacing your "2>/dev/null" with "2>/tmp/ra.err" and see if it's generating any interesting output. -- Ken Gaillot Gleim NOC ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Building pacemaker without gnutls
On 8/10/14 7:24 PM, Andrew Beekhof wrote: On 10 Aug 2014, at 7:10 pm, Oren wrote: Hi, Can you support pacemaker without gnutls as it is not FIPS compliant? Its not? This dependency may be replaced by openssl, with a configure flag to control this. We'll certainly consider a patch that did this. I don't know enough about openSSL to create it though. FYI this is nontrivial. The FIPS-certified OpenSSL is not the one normally distributed; applications (pacemaker in this case) have to be able to use a special, source-only OpenSSL component as-is, with not the slightest modification to the source or its build process. Woe unto them who need to change a single character: "New FIPS 140-2 validations (of any type) are slow (6-12 months is typical), expensive (US$50,000 is probably typical for an uncomplicated validation), and unpredictable (completion dates are not only uncertain when first beginning a validation, but remain so during the process)." https://www.openssl.org/docs/fips/fipsnotes.html The payoff is access to U.S. government contracts, if you're into that sort of thing. Ironically, the FIPS-certified OpenSSL can be considered less secure than the uncertified version, because due to the nature of certification, bugs and holes get patched much more slowly: https://blog.bit9.com/2012/04/23/fips-compliance-may-actually-make-openssl-less-secure/ -- Ken Gaillot Gleim NOC ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] run script when vip failover happens
On 08/08/2014 01:49 AM, Alex Samad - Yieldbroker wrote: Hi So I have taken a slightly different approach, I have taken ocf::heartbeat:IPaddr2 copied it to ocf::yb:namedVIP and I have added in some code to 1) check to see if named is running 2) reload when it adds/removes an ip address All seems good except Except what?!? Don't leave us hanging! ;-) This can be (acceptably) handled by the stock agents. We have a similar setup, and do this: order dns-restart +inf: dns-ip:start dns-daemon-clone:start dns-ip is our IPaddr2 resource, and dns-daemon-clone is the clone of the named resource. So the above tells pacemaker to always start dns-daemon-clone after dns-ip starts, which in practice makes it do a full restart of named if dns-ip moves to the host. Your approach has the advantage of being able to do a lighter-weight reload, but the above is easy. Once or twice in our setup, named did not get restarted as expected, but that was so rare it wasn't worth trying to track down. I had also considered this approach: primitive dns-rebind ocf:heartbeat:anything params binfile="/usr/sbin/rndc" cmdline_options="reconfig" group dns-cluster dns-ip dns-rebind meta target-role="Started" which also has the advantage of doing a reload instead of full restart, but it abuses the anything resource to do something it wasn't meant to do. -Original Message- From: Alex Samad - Yieldbroker [mailto:alex.sa...@yieldbroker.com] Sent: Friday, 8 August 2014 2:44 PM To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] run script when vip failover happens Hi I have a 2 node cluster. I want to run named on both and have 2 vips that are distributed, but I don't want to stop and start named when the vip moves. But named doesn't automatically start to listen on the new IP address you have to do a reload. So how can I attach a script to run when the vip moves to a node ? Alex -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Understanding virtual IP migration
On 07/30/2014 03:15 AM, Arjun Pandey wrote: Apologies for the delayed response. Well i am using IPAddr2 resource. I don't configure the addresses statically. It gets configured when pacemaker starts this resource.Failure handling of this link(when i manually bring the link down ) also works. It's just that even though the IP has moved. If you bring the failed link back again it still has that IP address.I have a 2 node(Active-Standby) cluster with 2 virtual IP's configured.Hence stonith is also disabled. There's not enough information here to guess what's wrong. I suggest going line-by-line through your "crm configure show" and comparing against the documentation and examples to make sure you understand how to make a resource run on one node or multiple nodes. Also, it's possible that when you brought the link back up, pacemaker moved the IP back to it. If you don't have any stonith, and you disconnect the only network connection between the nodes, then the expected behavior is a split-brain situation, where both nodes bring up the IPs. The purpose of stonith is to prevent such a situation by having one of the nodes shut down the other if communication fails. If you're sure you don't want to use stonith, then you need to maintain network communication between your nodes as reliably as possible. On Thu, Jul 24, 2014 at 12:00 AM, Ken Gaillot wrote: On 07/23/2014 12:15 AM, Arjun Pandey wrote: I am using virtual IP resource on a 2 node (Active-Passive) cluster. I was testing the migration of IP address. Moving the link down moves the IP over to the other node. However if i bring interface up on the node the VIP is still associated with this interface. Shouldn't we have removed this when we decided to migrate the IP in the first place ? Also on a related note plugging out the cable, doesn't lead to IP migration. I checked the IPAddr monitor logic which simply checks if the address is still associated with the interface. However shouldn't we be checking link state as well using ethtool. Hello Arjun, Pacemaker can certainly do what you want, but the configuration has to be exactly right. Can you post what configuration you're using? Based on the information provided, one guess is that you might have the IP statically configured on the interface (outside pacemaker), so that when you bring the interface up, the static configuration is taking effect. When a resource is managed by pacemaker, it should not be configured to start or stop by any other means. Regarding the pull-the-cable test, what is your networking setup? Does each cluster node have a single network connection, or do you have a dedicated link for clustering traffic? Do you have any sort of STONITH configured? -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm resourse (lsb:apache2) not starting
Script LSB is compatible (see http://www.linux-ha.org/wiki/LSB_Resource_Agents ). All sequences tested are ok How can I found out why crm is not starting Apache? most likely the status url is not setup/configured. have you checked the apache logs? 08.07.2014 16:15, W Forum W wrote: Hi, I have a two node cluster with a DRBD, heartbeat and pacemaker (on Debian Wheezy) The cluster is working fine. 2 DRBD resources, Shared IP, 2 File systems and a postgresql database start, stop, migrate, ... correctly. Now the problem is with the lsb:apache2 resource agent. When I try to start is (crm resource start p_ps_apache) immediately I got an error like /p_ps_apache_monitor_6 (node=wegc203136, call=653, rc=7, status=complete): not running/ When I start Apache from the console (service apache2 start), it works fine I have checked if the Init Script LSB is compatible (see http://www.linux-ha.org/wiki/LSB_Resource_Agents). All sequences tested are ok How can I found out why crm is not starting Apache? Is it really not started, or just is not configured enough to be successfully monitored and then monitor op fails? What your apache logs say? -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Understanding virtual IP migration
On 07/23/2014 12:15 AM, Arjun Pandey wrote: I am using virtual IP resource on a 2 node (Active-Passive) cluster. I was testing the migration of IP address. Moving the link down moves the IP over to the other node. However if i bring interface up on the node the VIP is still associated with this interface. Shouldn't we have removed this when we decided to migrate the IP in the first place ? Also on a related note plugging out the cable, doesn't lead to IP migration. I checked the IPAddr monitor logic which simply checks if the address is still associated with the interface. However shouldn't we be checking link state as well using ethtool. Hello Arjun, Pacemaker can certainly do what you want, but the configuration has to be exactly right. Can you post what configuration you're using? Based on the information provided, one guess is that you might have the IP statically configured on the interface (outside pacemaker), so that when you bring the interface up, the static configuration is taking effect. When a resource is managed by pacemaker, it should not be configured to start or stop by any other means. Regarding the pull-the-cable test, what is your networking setup? Does each cluster node have a single network connection, or do you have a dedicated link for clustering traffic? Do you have any sort of STONITH configured? -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Up-To-Date How To (Not Jaking "Clusters on Virtualized Platforms")
On 07/17/2014 02:01 PM, Nick Cameo wrote: For the sake of not hijacking a previous post. I am reaching out to the community for an up-to-date Pacemaker, OpenAIS, DRBD, GFS2/OCFS tutorial. We went down this avenue before and got everything working however, at the time controldlm and o2cb related "stuff" was partially taken care of by cman. Hi Nick, Our setup isn't exactly what you're looking for, but we have a cluster using Debian+Pacemaker+Corosync+DRBD+CLVM to share storage between two Xen dom0s. It manages DLM itself so this configuration excerpt might be helpful. I'm omitting the CLVM and volume group resource config since you're not interested in that, but I'm guessing your GFS2/OCFS resources would take their place here. # The Distributed Lock Manager is needed by CLVM and corosync. primitive dlm ocf:pacemaker:controld \ op monitor interval="120" timeout="30" \ op start interval="0" timeout="90" \ op stop interval="0" timeout="100" # Put DLM, CLVM and the volume group into a cloned group, # so they are started and stopped together, in proper order. group cluster-storage-group dlm clvm vg1 clone cluster-storage-clone cluster-storage-group \ meta globally-unique="false" interleave="true" # DRBD cannot be in cluster-storage-group because it is already a master-slave clone, # so instead group and order it using colocation. colocation colo-drbd-lock inf: cluster-storage-clone ms-drbd-clvm:Master order ord-drbd-lock inf: ms-drbd-clvm:promote cluster-storage-clone:start -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Occasional nonsensical resource agent errors
On 07/15/2014 02:31 PM, Andrew Daugherity wrote: Message: 1 Date: Sat, 12 Jul 2014 09:42:57 -0400 From: Ken Gaillot To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] Occasional nonsensical resource agent errors since Debian 3.2.57-3+deb7u1 kernel update Hi, We run multiple deployments of corosync+pacemaker on Debian "wheezy" for high-availability of various resources. The configurations are unchanged and ran without any issues for many months. However, since we applied the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting resource agent errors on rare occasions, with error messages that are clearly incorrect. [] Given the odd error messages from the resource agent, I suspect it's a memory corruption error of some sort. We've been unable to find anything else useful in the logs, and we'll probably end up reverting to the prior kernel version. But given the rarity of the issue, it would be a long while before we could be confident that fixed it. Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel or later? Has anyone had any similar issues? Just curious, I see you're running Xen; are you setting dom0_mem? I had similar issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random memory corruption due to a kernel bug. It was mostly random but I did eventually find a repeatable test case: checksum verification of a kernel build tree with mtree; on affected systems there would usually be a few files that failed to verify. I had been setting dom0_mem=768M, as that was a good balance between maximizing memory available for VMs while keeping enough for services in Dom0 (including pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB less than physical RAM, leaving 256M available for Xen overhead, etc. Raising it to 2048M (or not setting it at all) was a sufficient workaround to avoid the bug, but I have finally received a fixed kernel from Novell support. Note: this fix has not yet made it into any official updates for SLES 11 -- Novell/SUSE say it will be in the next kernel version, whenever that happens. Recent openSUSE kernels are also affected (and have yet to be fixed). -Andrew Hi Andrew, Thanks for the feedback! Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux kernel in wheezy has an issue where the reported RAM is less than the dom0_mem value, so dom0_mem is actually higher.) However we are also seeing the issue on our "talos/pomona" cluster, which are not dom0s, so I don't suspect Xen itself. But it could be the same kernel issue. mtree isn't packaged for Debian, and I'm not familiar with it, although I did see a Linux port on Google code. How do you use it for your test case? What do the detected differences signify? Do you know what kernel and Xen versions were in SP2/3, and what specifically was fixed in the kernel they gave you? -- Ken Gaillot Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Occasional nonsensical resource agent errors since Debian 3.2.57-3+deb7u1 kernel update
Hi, We run multiple deployments of corosync+pacemaker on Debian "wheezy" for high-availability of various resources. The configurations are unchanged and ran without any issues for many months. However, since we applied the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting resource agent errors on rare occasions, with error messages that are clearly incorrect. The incidents have happened four times on two unrelated clusters: * Our cluster hosts "talos" and "pomona" use pacemaker to manage a few virtual IP adresses using the ocf:heartbeat:IPaddr2 resource agent. This one has had two incidents. The first incident began with this error: Jun 2 17:30:16 pomona lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1: /usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied The second incident began with this error: Jul 12 08:36:15 talos IPaddr2[21294]: ERROR: Setup problem: couldn't find command: ip I can confidently say, the permissions of IPaddr2 and the location of the "ip" command, did not change at any point! * Our cluster hosts "aries" and "taurus" use pacemaker in a more complicated setup, managing Xen virtual machines on shared storage utilizing DRBD and CLVM, using the resource agents ocf:pacemaker:controld, ocf:gleim:clvmd (which is the stock clvmd resource agent from a later pacemaker version than is included in wheezy), ocf:heartbeat:LVM, ocf:linbit:drbd, and ocf:gleim:Xen (which is the stock Xen resource agent with a trivial one-line change for a local workaround). This cluster had also had two incidents: * The first began with: Jun 16 10:38:15 aries lrmd: [3646]: info: RA output: (jabber:monitor:stderr) /usr/lib/ocf/resource.d//gleim/Xen: 71: local: en-list: bad variable name There is no variable "en-list" in the resource agent; the closest string in the file is "xen-list", which is a binary not a variable, used like this: ... if have_binary xen-list; then xen-list $1 2>/dev/null | grep -qs "State.*[-r][-b][-p]--" 2>/dev/null ... * The second began with: Jun 21 11:58:58 taurus Xen[9052]: ERROR: Setup problem: couldn't find command: awk Again, the location of "awk" has not changed. We have no reason to suspect the kernel update other than timing, and the fact that the incidents occur on unrelated clusters. We have since upgraded to Debian's next update, 3.2.57-3+deb7u2, but the most recent incident occurred after that. The original update included fixes for these issues: CVE-2014-0196 Jiri Slaby discovered a race condition in the pty layer, which could lead to denial of service or privilege escalation. CVE-2014-1737 / CVE-2014-1738 Matthew Daley discovered that missing input sanitising in the FDRAWCMD ioctl and an information leak could result in privilege escalation. CVE-2014-2851 Incorrect reference counting in the ping_init_sock() function allows denial of service or privilege escalation. CVE-2014-3122 Incorrect locking of memory can result in local denial of service. Given the odd error messages from the resource agent, I suspect it's a memory corruption error of some sort. We've been unable to find anything else useful in the logs, and we'll probably end up reverting to the prior kernel version. But given the rarity of the issue, it would be a long while before we could be confident that fixed it. Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel or later? Has anyone had any similar issues? -- Ken Gaillot Gleim NOC ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org