[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Ludovic Vaugeois-Pepin Thu, 11 May 2017 13:03:51 -0700

Hi
I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
in Python (https://github.com/ulodciv/deploy_cluster), and I have been
editing it heavily.


In parallel I am writing unit tests and functional tests.

I am having an issue with a functional test that abruptly powers off a
slave named says "host3" (hot standby PG instance). Later on I start the
slave back. Once it is started, I run "pcs cluster start host3". And this
is where I start having a problem.

I check every second the output of "pcs status xml" until host3 is said to
be ready as a slave again. In the following I assume that test3 is ready as
a slave:

    <nodes>
        <node name="test1" id="1" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="false" resources_running="2"
type="member" />
        <node name="test2" id="2" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="true" resources_running="1"
type="member" />
        <node name="test3" id="3" online="true" standby="false"
standby_onfail="false" maintenance="false" pending="false" unclean="false"
shutdown="false" expected_up="true" is_dc="false" resources_running="1"
type="member" />
    </nodes>
    <resources>
        <clone id="pgsql-ha" multi_state="true" unique="false"
managed="true" failed="false" failure_ignored="false" >
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Slave" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test3" id="3" cached="false"/>
            </resource>
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Master" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test1" id="1" cached="false"/>
            </resource>
            <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha"
role="Slave" active="true" orphaned="false" managed="true" failed="false"
failure_ignored="false" nodes_running_on="1" >
                <node name="test2" id="2" cached="false"/>
            </resource>
        </clone>
By ready to go I mean that upon running "pcs cluster start test3", the
following occurs before test3 appears ready in the XML:

pcs cluster start test3
monitor -> RA returns unknown error (1)
notify/pre-stop     -> RA returns ok (0)
stop     -> RA returns ok (0)
start -> RA returns ok (0)

The problem I have is that between "pcs cluster start test3" and "monitor",
it seems that the XML returned by "pcs status xml" says test3 is ready (the
XML extract above is what I get at that moment). Once "monitor" occurs, the
returned XML shows test3 to be offline, and not until the start is finished
do I once again have test3 shown as ready.

I am getting anything wrong? Is there a simpler or better way to check if
test3 is fully functional again, ie OCF start was successful?

Thanks

Ludovic

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

Reply via email to