Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-15 Thread Ludovic Vaugeois-Pepin
I will look into adding alerts, thanks for the info.

For now I introduced a 5 seconds sleep after "pcs cluster start ...". It
seems enough for monitor to be run.

On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot  wrote:

> Another possibility you might want to look into is alerts. Pacemaker can
> call a script of your choosing whenever a resource is started or
> stopped. See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-sing
> le/Pacemaker_Explained/index.html#idm139683940283296
>
> for the concepts, and the pcs man page for the "pcs alert" interface.
>
> On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> > I checked the node_state of the node that is killed and brought back
> > (test3). in_ccm == true and crmd == online for a second or two between
> > "pcs cluster start test3" "monitor":
> >
> >  > crm-debug-origin="peer_update_callback" join="member" expected="member">
> >
> >
> >
> > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> > > wrote:
> >
> > Yes I haven't been using the "nodes" element in the XML, only the
> > "resources" element. I couldn't find "node_state" elements or
> > attributes in the XML, so after some searching I found that it is in
> > the CIB that can be gotten with "pcs cluster cib foo.xml". I will
> > start exploring this as an alternative to  crm_mon/"pcs status".
> >
> >
> > However I still find what happens to be confusing, so below I try to
> > better explain what I see:
> >
> >
> > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> > shutdown a minute ago):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:36 2017  Last change: Fri
> > May 12 09:18:13 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 ]
> > OFFLINE: [ test3 ]
> >
> > Active resources:
> >
> >  Master/Slave Set: pgsql-ha [pgsqld]
> >  Masters: [ test1 ]
> >  Slaves: [ test2 ]
> >  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Master" active="true" orphaned="false" managed="true" f
> > ailed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true" fa
> > iled="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Stopped" active="false" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="0" />
> > 
> >  > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed
> > ="true" failed="false" failure_ignored="false"
> > nodes_running_on="1" >
> > 
> > 
> > 
> >
> >
> >
> > At 10:45:39.440, after "pcs cluster start test3", before first
> > "monitor" on test3 (this is where I can't seem to know that
> > resources on test3 are down):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:39 2017  Last change: Fri
> > May 12 10:45:39 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 test3 ]
> >
> > Active resources:
> >
> >  Master/Slave Set: pgsql-ha [pgsqld]
> >  Masters: [ test1 ]
> >  Slaves: [ test2 test3 ]
> >  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> > 
> >  > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed="true" failed="false"
> > failure_ignored="false" 

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ken Gaillot
Another possibility you might want to look into is alerts. Pacemaker can
call a script of your choosing whenever a resource is started or
stopped. See:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm139683940283296

for the concepts, and the pcs man page for the "pcs alert" interface.

On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> I checked the node_state of the node that is killed and brought back
> (test3). in_ccm == true and crmd == online for a second or two between
> "pcs cluster start test3" "monitor":
> 
>  crm-debug-origin="peer_update_callback" join="member" expected="member">
> 
> 
> 
> On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> > wrote:
> 
> Yes I haven't been using the "nodes" element in the XML, only the
> "resources" element. I couldn't find "node_state" elements or
> attributes in the XML, so after some searching I found that it is in
> the CIB that can be gotten with "pcs cluster cib foo.xml". I will
> start exploring this as an alternative to  crm_mon/"pcs status".
> 
> 
> However I still find what happens to be confusing, so below I try to
> better explain what I see:
> 
> 
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> shutdown a minute ago):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:36 2017  Last change: Fri
> May 12 09:18:13 2017 by root via crm_attribute on test1
> 
> 3 nodes and 4 resources configured
> 
> Online: [ test1 test2 ]
> OFFLINE: [ test3 ]
> 
> Active resources:
> 
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> test1
> 
>  
> crm_mon -X:
> 
> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" f
> ailed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" fa
> iled="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> orphaned="false" managed
> ="true" failed="false" failure_ignored="false"
> nodes_running_on="1" >
> 
> 
> 
> 
> 
> 
> At 10:45:39.440, after "pcs cluster start test3", before first
> "monitor" on test3 (this is where I can't seem to know that
> resources on test3 are down):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:39 2017  Last change: Fri
> May 12 10:45:39 2017 by root via crm_attribute on test1
> 
> 3 nodes and 4 resources configured
> 
> Online: [ test1 test2 test3 ]
> 
> Active resources:
> 
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 test3 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> test1
> 
> 
> crm_mon -X:
> 
> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>  resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
> 
> 
> 
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> resources on test3 are not ready):
> 
> crm_mon -1:
> 
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> partition with quorum
> Last updated: Fri May 12 10:45:41 2017  Last change: Fri
> May 12 10:45:39 2017 by root via crm_attribute on test1
> 
> 

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
Hi Jehan-Guillaume,

I would be glad to discuss my motivations and findings with you, by mail or
in person, even.

Let's just say that I originally wanted to create something that would
allow deploying a PG cluster in manners of minutes (yes using Python). From
there I tried to understand how PAF works, and at some point I wanted to
start changing it, but not being too good with Perl, I chose to translate
it. This kinda became a pet project.

Ludovic



On Fri, May 12, 2017 at 2:01 PM, Jehan-Guillaume de Rorthais <
j...@dalibo.com> wrote:

> Hi Ludovic,
>
> On Thu, 11 May 2017 22:00:12 +0200
> Ludovic Vaugeois-Pepin  wrote:
>
> > I translated the a Postgresql multi state RA (
> https://github.com/dalibo/PAF)
> > in Python (https://github.com/ulodciv/deploy_cluster), and I have been
> > editing it heavily.
>
> Could you please provide the feedback to the upstream project (or here :))?
>
> * what did you improved in PAF?
> * what did you changed in PAF?
> * why did you translate PAF to python? Any advantages?
>
> A lot of time and research has been dedicated to this project. PAF is a
> pure
> open source project. We would love some feedback and contributors to keep
> improving it. Do not hesitate to open issues on PAF project if you need to
> discuss improvements.
>
> Regards,
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>



-- 
Ludovic Vaugeois-Pepin
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Jehan-Guillaume de Rorthais
Hi Ludovic,

On Thu, 11 May 2017 22:00:12 +0200
Ludovic Vaugeois-Pepin  wrote:

> I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
> in Python (https://github.com/ulodciv/deploy_cluster), and I have been
> editing it heavily.

Could you please provide the feedback to the upstream project (or here :))? 

* what did you improved in PAF?
* what did you changed in PAF?
* why did you translate PAF to python? Any advantages?

A lot of time and research has been dedicated to this project. PAF is a pure
open source project. We would love some feedback and contributors to keep
improving it. Do not hesitate to open issues on PAF project if you need to
discuss improvements.

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
I checked the node_state of the node that is killed and brought back
(test3). in_ccm == true and crmd == online for a second or two between "pcs
cluster start test3" "monitor":





On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin <
ludovi...@gmail.com> wrote:

> Yes I haven't been using the "nodes" element in the XML, only the
> "resources" element. I couldn't find "node_state" elements or attributes
> in the XML, so after some searching I found that it is in the CIB that can
> be gotten with "pcs cluster cib foo.xml". I will start exploring this as an
> alternative to  crm_mon/"pcs status".
>
>
> However I still find what happens to be confusing, so below I try to
> better explain what I see:
>
>
> Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> shutdown a minute ago):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:36 2017  Last change: Fri May
> 12 09:18:13 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 ]
> OFFLINE: [ test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" f
> ailed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" fa
> iled="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  role="Started" active="true" orphaned="false" managed
> ="true" failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
>
>
> At 10:45:39.440, after "pcs cluster start test3", before first "monitor"
> on test3 (this is where I can't seem to know that resources on test3 are
> down):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:39 2017  Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 test3 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>  role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
>
>
> At 10:45:41.606, after first "monitor" on test3 (I can now tell the
> resources on test3 are not ready):
>
> crm_mon -1:
>
> Stack: corosync
> Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
> quorum
> Last updated: Fri May 12 10:45:41 2017  Last change: Fri May
> 12 10:45:39 2017 by root via crm_attribute on test1
>
> 3 nodes and 4 resources configured
>
> Online: [ test1 test2 test3 ]
>
> Active resources:
>
>  Master/Slave Set: pgsql-ha [pgsqld]
>  Masters: [ test1 ]
>  Slaves: [ test2 ]
>  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1
>
>
> crm_mon -X:
>
> 
>  failed="false" failure_ignored="false" >
>  role="Master" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Stopped" active="false" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="0" />
> 
>  role="Started" active="true" orphaned="false" managed="true" failed="false"
> failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
>
> On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot  wrote:
>
>> On 05/11/2017 03:00 PM, Ludovic 

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-12 Thread Ludovic Vaugeois-Pepin
Yes I haven't been using the "nodes" element in the XML, only the
"resources" element. I couldn't find "node_state" elements or attributes in
the XML, so after some searching I found that it is in the CIB that can be
gotten with "pcs cluster cib foo.xml". I will start exploring this as an
alternative to  crm_mon/"pcs status".


However I still find what happens to be confusing, so below I try to better
explain what I see:


Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
shutdown a minute ago):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:36 2017  Last change: Fri May 12
09:18:13 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 ]
OFFLINE: [ test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:


















At 10:45:39.440, after "pcs cluster start test3", before first "monitor" on
test3 (this is where I can't seem to know that resources on test3 are down):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:39 2017  Last change: Fri May 12
10:45:39 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 test3 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:




















At 10:45:41.606, after first "monitor" on test3 (I can now tell the
resources on test3 are not ready):

crm_mon -1:

Stack: corosync
Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri May 12 10:45:41 2017  Last change: Fri May 12
10:45:39 2017 by root via crm_attribute on test1

3 nodes and 4 resources configured

Online: [ test1 test2 test3 ]

Active resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Masters: [ test1 ]
 Slaves: [ test2 ]
 pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started test1


crm_mon -X:
















On Fri, May 12, 2017 at 12:45 AM, Ken Gaillot  wrote:

> On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> > Hi
> > I translated the a Postgresql multi state RA
> > (https://github.com/dalibo/PAF) in Python
> > (https://github.com/ulodciv/deploy_cluster), and I have been editing it
> > heavily.
> >
> > In parallel I am writing unit tests and functional tests.
> >
> > I am having an issue with a functional test that abruptly powers off a
> > slave named says "host3" (hot standby PG instance). Later on I start the
> > slave back. Once it is started, I run "pcs cluster start host3". And
> > this is where I start having a problem.
> >
> > I check every second the output of "pcs status xml" until host3 is said
> > to be ready as a slave again. In the following I assume that test3 is
> > ready as a slave:
> >
> > 
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
> > resources_running="2" type="member" />
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="true"
> > resources_running="1" type="member" />
> >  > standby_onfail="false" maintenance="false" pending="false"
> > unclean="false" shutdown="false" expected_up="true" is_dc="false"
> > resources_running="1" type="member" />
> > 
>
> The  section says nothing about the current state of the nodes.
> Look at the  entries for that. in_ccm means the cluster
> stack level, and crmd means the pacemaker level -- both need to be up.
>
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> > 
> > 

Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-11 Thread Ken Gaillot
On 05/11/2017 03:00 PM, Ludovic Vaugeois-Pepin wrote:
> Hi
> I translated the a Postgresql multi state RA
> (https://github.com/dalibo/PAF) in Python
> (https://github.com/ulodciv/deploy_cluster), and I have been editing it
> heavily.
> 
> In parallel I am writing unit tests and functional tests.
> 
> I am having an issue with a functional test that abruptly powers off a
> slave named says "host3" (hot standby PG instance). Later on I start the
> slave back. Once it is started, I run "pcs cluster start host3". And
> this is where I start having a problem.
> 
> I check every second the output of "pcs status xml" until host3 is said
> to be ready as a slave again. In the following I assume that test3 is
> ready as a slave:
> 
> 
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="2" type="member" />
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="true"
> resources_running="1" type="member" />
>  standby_onfail="false" maintenance="false" pending="false"
> unclean="false" shutdown="false" expected_up="true" is_dc="false"
> resources_running="1" type="member" />
> 

The  section says nothing about the current state of the nodes.
Look at the  entries for that. in_ccm means the cluster
stack level, and crmd means the pacemaker level -- both need to be up.

> 
>  managed="true" failed="false" failure_ignored="false" >
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Master" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
>  role="Slave" active="true" orphaned="false" managed="true"
> failed="false" failure_ignored="false" nodes_running_on="1" >
> 
> 
> 
> By ready to go I mean that upon running "pcs cluster start test3", the
> following occurs before test3 appears ready in the XML:
> 
> pcs cluster start test3
> monitor-> RA returns unknown error (1) 
> notify/pre-stop-> RA returns ok (0)
> stop   -> RA returns ok (0)
> start-> RA returns ok (0)
> 
> The problem I have is that between "pcs cluster start test3" and
> "monitor", it seems that the XML returned by "pcs status xml" says test3
> is ready (the XML extract above is what I get at that moment). Once
> "monitor" occurs, the returned XML shows test3 to be offline, and not
> until the start is finished do I once again have test3 shown as ready.
> 
> I am getting anything wrong? Is there a simpler or better way to check
> if test3 is fully functional again, ie OCF start was successful?
> 
> Thanks
> 
> Ludovic

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-11 Thread Ludovic Vaugeois-Pepin
Hi
I translated the a Postgresql multi state RA (https://github.com/dalibo/PAF)
in Python (https://github.com/ulodciv/deploy_cluster), and I have been
editing it heavily.

In parallel I am writing unit tests and functional tests.

I am having an issue with a functional test that abruptly powers off a
slave named says "host3" (hot standby PG instance). Later on I start the
slave back. Once it is started, I run "pcs cluster start host3". And this
is where I start having a problem.

I check every second the output of "pcs status xml" until host3 is said to
be ready as a slave again. In the following I assume that test3 is ready as
a slave:


















By ready to go I mean that upon running "pcs cluster start test3", the
following occurs before test3 appears ready in the XML:

pcs cluster start test3
monitor -> RA returns unknown error (1)
notify/pre-stop -> RA returns ok (0)
stop -> RA returns ok (0)
start -> RA returns ok (0)

The problem I have is that between "pcs cluster start test3" and "monitor",
it seems that the XML returned by "pcs status xml" says test3 is ready (the
XML extract above is what I get at that moment). Once "monitor" occurs, the
returned XML shows test3 to be offline, and not until the start is finished
do I once again have test3 shown as ready.

I am getting anything wrong? Is there a simpler or better way to check if
test3 is fully functional again, ie OCF start was successful?

Thanks

Ludovic
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org