I will look into adding alerts, thanks for the info. For now I introduced a 5 seconds sleep after "pcs cluster start ...". It seems enough for monitor to be run.
On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot <kgail...@redhat.com> wrote: > Another possibility you might want to look into is alerts. Pacemaker can > call a script of your choosing whenever a resource is started or > stopped. See: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-sing > le/Pacemaker_Explained/index.html#idm139683940283296 > > for the concepts, and the pcs man page for the "pcs alert" interface. > > On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote: > > I checked the node_state of the node that is killed and brought back > > (test3). in_ccm == true and crmd == online for a second or two between > > "pcs cluster start test3" "monitor": > > > > <node_state id="3" uname="test3" in_ccm="true" crmd="online" > > crm-debug-origin="peer_update_callback" join="member" expected="member"> > > > > > > > > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin > > <ludovi...@gmail.com <mailto:ludovi...@gmail.com>> wrote: > > > > Yes I haven't been using the "nodes" element in the XML, only the > > "resources" element. I couldn't find "node_state" elements or > > attributes in the XML, so after some searching I found that it is in > > the CIB that can be gotten with "pcs cluster cib foo.xml". I will > > start exploring this as an alternative to crm_mon/"pcs status". > > > > > > However I still find what happens to be confusing, so below I try to > > better explain what I see: > > > > > > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW > > shutdown a minute ago): > > > > crm_mon -1: > > > > Stack: corosync > > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > > partition with quorum > > Last updated: Fri May 12 10:45:36 2017 Last change: Fri > > May 12 09:18:13 2017 by root via crm_attribute on test1 > > > > 3 nodes and 4 resources configured > > > > Online: [ test1 test2 ] > > OFFLINE: [ test3 ] > > > > Active resources: > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ test1 ] > > Slaves: [ test2 ] > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > > test1 > > > > > > crm_mon -X: > > > > <resources> > > <clone id="pgsql-ha" multi_state="true" unique="false" > > managed="true" failed="false" failure_ignored="false" > > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Master" active="true" orphaned="false" managed="true" f > > ailed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Slave" active="true" orphaned="false" managed="true" fa > > iled="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test2" id="2" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Stopped" active="false" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="0" /> > > </clone> > > <resource id="pgsql-master-ip" > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > > orphaned="false" managed > > ="true" failed="false" failure_ignored="false" > > nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > </resources> > > > > > > > > At 10:45:39.440, after "pcs cluster start test3", before first > > "monitor" on test3 (this is where I can't seem to know that > > resources on test3 are down): > > > > crm_mon -1: > > > > Stack: corosync > > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > > partition with quorum > > Last updated: Fri May 12 10:45:39 2017 Last change: Fri > > May 12 10:45:39 2017 by root via crm_attribute on test1 > > > > 3 nodes and 4 resources configured > > > > Online: [ test1 test2 test3 ] > > > > Active resources: > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ test1 ] > > Slaves: [ test2 test3 ] > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > > test1 > > > > > > crm_mon -X: > > > > <resources> > > <clone id="pgsql-ha" multi_state="true" unique="false" > > managed="true" failed="false" failure_ignored="false" > > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Master" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test2" id="2" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test3" id="3" cached="false"/> > > </resource> > > </clone> > > <resource id="pgsql-master-ip" > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > > orphaned="false" managed="true" failed="false" > > failure_ignored="false" nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > </resources> > > > > > > > > At 10:45:41.606, after first "monitor" on test3 (I can now tell the > > resources on test3 are not ready): > > > > crm_mon -1: > > > > Stack: corosync > > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - > > partition with quorum > > Last updated: Fri May 12 10:45:41 2017 Last change: Fri > > May 12 10:45:39 2017 by root via crm_attribute on test1 > > > > 3 nodes and 4 resources configured > > > > Online: [ test1 test2 test3 ] > > > > Active resources: > > > > Master/Slave Set: pgsql-ha [pgsqld] > > Masters: [ test1 ] > > Slaves: [ test2 ] > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > > test1 > > > > > > crm_mon -X: > > > > <resources> > > <clone id="pgsql-ha" multi_state="true" unique="false" > > managed="true" failed="false" failure_ignored="false" > > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Master" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Slave" active="true" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="1" > > > <node name="test2" id="2" cached="false"/> > > </resource> > > <resource id="pgsqld" resource_agent="ocf::heartbeat:pgha" > > role="Stopped" active="false" orphaned="false" managed="true" > > failed="false" failure_ignored="false" nodes_running_on="0" /> > > </clone> > > <resource id="pgsql-master-ip" > > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true" > > orphaned="false" managed="true" failed="false" > > failure_ignored="false" nodes_running_on="1" > > > <node name="test1" id="1" cached="false"/> > > </resource> > > </resources> > > > > On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot <kgail...@redhat.com > > <mailto:kgail...@redhat.com>> wrote: > > > > On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote: > > > Hi > > > I translated the a Postgresql multi state RA > > > (https://github.com/dalibo/PAF) in Python > > > (https://github.com/ulodciv/deploy_cluster > > <https://github.com/ulodciv/deploy_cluster>), and I have been > > editing it > > > heavily. > > > > > > In parallel I am writing unit tests and functional tests. > > > > > > I am having an issue with a functional test that abruptly > > powers off a > > > slave named says "host3" (hot standby PG instance). Later on I > > start the > > > slave back. Once it is started, I run "pcs cluster start > > host3". And > > > this is where I start having a problem. > > > > > > I check every second the output of "pcs status xml" until > > host3 is said > > > to be ready as a slave again. In the following I assume that > > test3 is > > > ready as a slave: > > > > > > <nodes> > > > <node name="test1" id="1" online="true" standby="false" > > > standby_onfail="false" maintenance="false" pending="false" > > > unclean="false" shutdown="false" expected_up="true" > is_dc="false" > > > resources_running="2" type="member" /> > > > <node name="test2" id="2" online="true" standby="false" > > > standby_onfail="false" maintenance="false" pending="false" > > > unclean="false" shutdown="false" expected_up="true" > is_dc="true" > > > resources_running="1" type="member" /> > > > <node name="test3" id="3" online="true" standby="false" > > > standby_onfail="false" maintenance="false" pending="false" > > > unclean="false" shutdown="false" expected_up="true" > is_dc="false" > > > resources_running="1" type="member" /> > > > </nodes> > > > > The <nodes> section says nothing about the current state of the > > nodes. > > Look at the <node_state> entries for that. in_ccm means the > cluster > > stack level, and crmd means the pacemaker level -- both need to > > be up. > > > > > <resources> > > > <clone id="pgsql-ha" multi_state="true" unique="false" > > > managed="true" failed="false" failure_ignored="false" > > > > <resource id="pgsqld" > resource_agent="ocf::heartbeat:pgha" > > > role="Slave" active="true" orphaned="false" managed="true" > > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > <node name="test3" id="3" cached="false"/> > > > </resource> > > > <resource id="pgsqld" > resource_agent="ocf::heartbeat:pgha" > > > role="Master" active="true" orphaned="false" managed="true" > > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > <node name="test1" id="1" cached="false"/> > > > </resource> > > > <resource id="pgsqld" > resource_agent="ocf::heartbeat:pgha" > > > role="Slave" active="true" orphaned="false" managed="true" > > > failed="false" failure_ignored="false" nodes_running_on="1" > > > > <node name="test2" id="2" cached="false"/> > > > </resource> > > > </clone> > > > By ready to go I mean that upon running "pcs cluster start > test3", the > > > following occurs before test3 appears ready in the XML: > > > > > > pcs cluster start test3 > > > monitor-> RA returns unknown error (1) > > > notify/pre-stop -> RA returns ok (0) > > > stop -> RA returns ok (0) > > > start-> RA returns ok (0) > > > > > > The problem I have is that between "pcs cluster start test3" > and > > > "monitor", it seems that the XML returned by "pcs status xml" > says test3 > > > is ready (the XML extract above is what I get at that moment). > Once > > > "monitor" occurs, the returned XML shows test3 to be offline, > and not > > > until the start is finished do I once again have test3 shown > as ready. > > > > > > I am getting anything wrong? Is there a simpler or better way > to check > > > if test3 is fully functional again, ie OCF start was > successful? > > > > > > Thanks > > > > > > Ludovic > -- Ludovic Vaugeois-Pepin
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org