Re: [Pacemaker] Occasional nonsensical resource agent errors, redux

Ken Gaillot Mon, 03 Nov 2014 08:41:40 -0800

On 11/03/2014 09:26 AM, Dejan Muhamedagic wrote:

On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:

В Mon, 3 Nov 2014 13:32:45 +1100
Andrew Beekhof <and...@beekhof.net> пишет:

On 1 Nov 2014, at 11:03 pm, Patrick Kane <p...@wawd.com> wrote:

Hi all:

In July, list member Ken Gaillot reported occasional nonsensical resource agent 
errors using Pacemaker 
(http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).

We're seeing similar issues with our install.  We have a 2 node 
corosync/pacemaker failover configuration that is using the 
ocf:heartbeat:IPaddr2 resource agent extensively.  About once a week, we'll get 
an error like this, out of the blue:

   Nov  1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: couldn't 
find command: ip

It goes without saying that the ip command hasn't gone anywhere and all the 
paths are configured correctly.

We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 inside 
of a xen container.

Any thoughts from folks on what might be happening or how we can get additional 
debug information to help figure out what's triggering this?


its pretty much in the hands of the agent.


Actually the message seems to be output by check_binary() function
which is part of framework.


Someone complained in the IRC about this issue (another resource
agent though, I think Xen) and they said that which(1) was not
able to find the program. I'd suggest to do strace (or ltrace)
of which(1) at that point (it's in ocf-shellfuncs).

The which(1) utility is a simple tool: it splits the PATH
environment variable and stats the program name appended to each
of the paths. PATH somehow corrupted or filesystem misbehaving?
My guess is that it's the former.

BTW, was there an upgrade of some kind before this started
happening?

I was hoping to have something useful before posting another update, butsince it's come up again, here's what we've found so far:

* The most common manifestation is the "couldn't find command" error. Invarious instances it "couldn't find" xm, ip or awk. However, we've seentwo other variations:

lrmd: [3363]: info: RA output: (pan:monitor:stderr) en-destroy: badvariable name

and

lrmd: [2145]: info: RA output: (ldap-ip:monitor:stderr)/usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1:/usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied

The RA in the first case does not use the string "en-destroy" at all; itdoes call a command "xen-destroy". That, to me, is a strong suggestionof memory corruption somewhere, whether in the RA, the shell, lrmd or alibrary used by one of those.


* I have not found any bugs in the RA or its included files.

* I tried setting "debug: on" in corosync.conf, but that did not giveany additional useful information. The resource agent error is still thefirst unusual message in the sequence. Here is an example, giving onesuccessful monitor run and then an occurrence of the issue (the nodesare a pair of Xen dom0s including pisces, running two Xen domU resourcespan and nemesis):


Sep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 372)

Sep 13 20:16:56 pisces lrmd: [372]: debug: perform_ra_op: resettingscheduler class to SCHED_OTHERSep 13 20:16:56 pisces lrmd: [3509]: debug: rsc:nemesis monitor[32] (pid409)Sep 13 20:16:56 pisces lrmd: [409]: debug: perform_ra_op: resettingscheduler class to SCHED_OTHERSep 13 20:16:56 pisces lrmd: [3509]: info: operation monitor[21] on panfor client 3512: pid 372 exited with return code 0Sep 13 20:16:57 pisces lrmd: [3509]: info: operation monitor[32] onnemesis for client 3512: pid 409 exited with return code 0

Sep 13 20:17:06 pisces lrmd: [3509]: debug: rsc:pan monitor[21] (pid 455)

Sep 13 20:17:06 pisces lrmd: [455]: debug: perform_ra_op: resettingscheduler class to SCHED_OTHERSep 13 20:17:07 pisces lrmd: [3509]: info: RA output:(pan:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local:Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output:(pan:monitor:stderr) en-destroy: bad variable name

Sep 13 20:17:07 pisces lrmd: [3509]: info: RA output: (pan:monitor:stderr)

Sep 13 20:17:07 pisces lrmd: [3509]: info: operation monitor[21] on panfor client 3512: pid 455 exited with return code 2

* I tried reverting several security updates applied in the month or sobefore we first saw the issue. Reverting the Debian kernel packages to3.2.57-3 and then 3.2.54-2 did not help, nor did reverting libxml2 tolibxml2 2.8.0+dfsg1-7+nmu2. None of the other updates from that timelook like they could have any effect.

* Regarding libxml2, I did find that Debian had backported an upstreampatch into its 2.8.0+dfsg1-7+nmu3 that introduced a memory corruptionbug, which upstream later corrected (the bug never made it into anupstream release, but Debian had backported a specific changeset). Isubmitted that as Debian Bug #765770 which was just fixed last week. Ihaven't had a chance to apply that to the affected servers yet, but asmentioned above, reverting to the libxml2 before the introduced bug didnot fix the issue.

* I have not found a way to intentionally reproduce the issue. :-( Wehave had 10 occurrences across 3 two-node clusters in five months. Someof the nodes have had only one occurrence during that time, but one pairgets the most of them. With the time between occurrences, it's hard todo something like strace on lrmd, though that's probably a good wayforward, scripting something to deal with the output reasonably.

* There does not seem to be any correlation with how long the node hasbeen up. Checking RAM usage of corosync and lrmd on all nodes over abouttwo weeks shows little to no change, so I don't suspect a leak. Most ofour errors have occurred in the Xen RA, but probably only because that'sthe RA we use most; we've also seen it in IPaddr2.

* My next idea would be to compile/install the latest versions of atleast pacemaker and the resource agents. However I am in the middle ofchanging jobs, and unfortunately do not have much time left for this. Mynew job will have plenty of time to spend on pacemaker ;-) so I may beable to give updates later. Debian's "jessie" release freezes this week,so I'm hoping that I will have time to at least get a test cluster uprunning the somewhat newer versions in that (pacemaker 1.1.10, corosync1.4.6).


-- Ken Gaillot <kjgai...@gleim.com>
Network Operations Center, Gleim Publications

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Occasional nonsensical resource agent errors, redux

Reply via email to