31.12.2015 12:57:45 CET, Bogdan Dobrelya <bdobre...@mirantis.com> wrote:
>Hello.
>I've been hopelessly fighting a bug [0] in the custom OCF agent of Fuel
>for OpenStack project. It is related to the destructive test case when
>one node of 3 or 5 total goes down and then back. The bug itself is
>tricky (is rarely reproduced), tl;dr, and has many duplicates. So I
>only
>put here the latest comment.
>
>As it says,
>at some point, after the rabbit OCF monitor reported an error followed
>by several "not running" reports (see crmd log snippet [1]), pacemaker
>starts "thinking" everything is fine with the resource and shows it as
>"running". While in fact it is completely dead and manually triggered
>OCF action monitor may confirm that (not running). But *why* pacemaker
>shows the resource is running and never calls monitor actions again?
>I have no idea how to proceed with the root cause of such pacemaker
>behaviour.
>
>So, I'm asking for guidance on the any recommendations on how-to debug
>and troubleshoot this strange situation and for which useful log
>patterns to seek (and where).
>Thank you in advance!
>
>PS. this is Pacemaker 1.1.12, Corosync 2.3.4,  libqb0 0.17.0 from
>Ubuntu
>vivid. But the Corosync & Pacemaker cluster looks healthy and I can
>find
>no log records saying otherwise.
>
>[0] https://bugs.launchpad.net/fuel/+bug/1472230/comments/32
>[1] http://pastebin.com/0UuBvzzz

Hi.
First, could you paste your CIB, preferably not in xml, but in crmsh format? 
Just to check that everything is fine with resource and fencing configuration.
Then, you may enable blackbox tracing inside pacemaker, USR1, USR2 and TRAP 
signals iirc, quick google search should point you to Andrew's blog with all 
information about that feature.
Next, if you use ocf-shellfuncs in your RA, you could enable tracing for 
resource itself, just add 'trace_ra=1' to every operation config (start and 
monitor).

All that may give you some additional hints on what's going on.

Also, you may think about upgrading pacemaker to 1.1.14-rcX, together with 
libqb to 0.17.2 (and rebuild corosync against that libqb).

Best,
Vladislav



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to