Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-15 Thread Ken Gaillot
On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
> On 05/15/2017 12:25 PM, Anu Pillai wrote:
>> Hi Klaus,
>>
>> Please find attached cib.xml as well as corosync.conf.

Maybe you're only setting this while testing, but having
stonith-enabled=false and no-quorum-policy=ignore is highly dangerous in
any kind of network split.

FYI, default-action-timeout is deprecated in favor of setting a timeout
in op_defaults, but it doesn't hurt anything.

> Why wouldn't you keep placement-strategy with default
> to keep things simple. You aren't using any load-balancing
> anyway as far as I understood it.

It looks like the intent is to use placement-strategy to limit each node
to 1 resource. The configuration looks good for that.

> Haven't used resource-stickiness=INF. No idea which strange
> behavior that triggers. Try to have it just higher than what
> the other scores might some up to.

Either way would be fine. Using INFINITY ensures that no other
combination of scores will override it.

> I might have overseen something in your scores but otherwise
> there is nothing obvious to me.
> 
> Regards,
> Klaus

I don't see anything obvious either. If you have logs around the time of
the incident, that might help.

>> Regards,
>> Aswathi
>>
>> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger > > wrote:
>>
>> On 05/15/2017 09:36 AM, Anu Pillai wrote:
>> > Hi,
>> >
>> > We are running pacemaker cluster for managing our resources. We
>> have 6
>> > system running 5 resources and one is acting as standby. We have a
>> > restriction that, only one resource can run in one node. But our
>> > observation is whenever we add or delete a resource from cluster all
>> > the remaining resources in the cluster are stopped and started back.
>> >
>> > Can you please guide us whether this normal behavior or we are
>> missing
>> > any configuration that is leading to this issue.
>>
>> It should definitely be possible to prevent this behavior.
>> If you share your config with us we might be able to
>> track that down.
>>
>> Regards,
>> Klaus
>>
>> >
>> > Regards
>> > Aswathi

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-15 Thread Ignazio Cassano
Hello, the following is the /etc/hosts for all my controllers:

127.0.0.1 localhost
10.102.184.70 tst-controller-01
10.102.184.71 tst-controller-02
10.102.184.72 tst-controller-03
10.102.119.223 iapi-tst-controller-01
10.102.119.224 iapi-tst-controller-02
10.102.119.225 iapi-tst-controller-03
10.102.184.90 compute-0 computenode0
10.102.184.91 compute-1 computenode1
10.102.184.102 tst-rabbit01
10.102.184.103 tst-rabbit02
10.102.184.104 tst-rabbit03
10.102.184.109 tst-swift01
10.102.184.110 tst-swift02
10.102.119.140 tst-mongo-primary
10.102.119.141 tst-mongo-repl1
10.102.119.142 tst-mongo-repl2
10.102.119.143 tst-mongo-repl3
10.102.119.144 tst-mongo-arbiter
10.102.184.96 tst-open-graphite


2017-05-15 14:02 GMT+02:00 Klaus Wenninger :

> On 05/15/2017 01:16 PM, Ignazio Cassano wrote:
> > Hello, cluster-recheck-interval=1min.
> >
> >
> > When I use the sytax:
> > "  pcs resource create computenode1 ocf:pacemaker:remote"
> >
> > the name is resolved in /etc/hosts
>
> Just wanted to know if you have it in /etc/hosts ...
>
> >
> > When I use the syntax:
> >
> >  pcs resource create computenode1 remote  server=10.102.184.91
>
> You can still use the name instead of the resolved IP for server ...
>
> Regards,
> Klaus
>
> >
> > I cannot avoid to resolve it.
> >
> > Regards
> > Ignazio
>
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-15 Thread Klaus Wenninger
On 05/15/2017 01:16 PM, Ignazio Cassano wrote:
> Hello, cluster-recheck-interval=1min.
>
>
> When I use the sytax:
> "  pcs resource create computenode1 ocf:pacemaker:remote"
>
> the name is resolved in /etc/hosts

Just wanted to know if you have it in /etc/hosts ...

>
> When I use the syntax:
>
>  pcs resource create computenode1 remote  server=10.102.184.91

You can still use the name instead of the resolved IP for server ...

Regards,
Klaus

>
> I cannot avoid to resolve it.
>
> Regards
> Ignazio



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-15 Thread Klaus Wenninger
On 05/15/2017 12:25 PM, Anu Pillai wrote:
> Hi Klaus,
>
> Please find attached cib.xml as well as corosync.conf.

Why wouldn't you keep placement-strategy with default
to keep things simple. You aren't using any load-balancing
anyway as far as I understood it.
Haven't used resource-stickiness=INF. No idea which strange
behavior that triggers. Try to have it just higher than what
the other scores might some up to.
I might have overseen something in your scores but otherwise
there is nothing obvious to me.

Regards,
Klaus
 
>
>
> Regards,
> Aswathi
>
> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger  > wrote:
>
> On 05/15/2017 09:36 AM, Anu Pillai wrote:
> > Hi,
> >
> > We are running pacemaker cluster for managing our resources. We
> have 6
> > system running 5 resources and one is acting as standby. We have a
> > restriction that, only one resource can run in one node. But our
> > observation is whenever we add or delete a resource from cluster all
> > the remaining resources in the cluster are stopped and started back.
> >
> > Can you please guide us whether this normal behavior or we are
> missing
> > any configuration that is leading to this issue.
>
> It should definitely be possible to prevent this behavior.
> If you share your config with us we might be able to
> track that down.
>
> Regards,
> Klaus
>
> >
> > Regards
> > Aswathi
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> 
> > http://lists.clusterlabs.org/mailman/listinfo/users
> 
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> 
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> Bugs: http://bugs.clusterlabs.org
>
>


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: pacemaker remote node ofgline after reboot

2017-05-15 Thread Klaus Wenninger
On 05/15/2017 01:46 PM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 15.05.2017 um 09:12 in
> Nachricht :
>
> [...]
>> Did you set the cluster-recheck-interval to a reasonable short value
>> (needed for connect-failures to timeout)?
> How short is "reasonable"? ;-)

Basically burns down to the question in that case:
Did you wait as long as recheck-interval takes to expire? ;-)

>
> We set the interval to somewhat longer, because it gives the administrartor 
> the change to cleanup after mistakes before the cluster does ;-)
> Meaning: When the admin does it, it works mostly without a reboot; if the 
> cluster does it, it will fence the node most of the time...
>
> Regards,
> Ulrich
>
>


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: pacemaker remote node ofgline after reboot

2017-05-15 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 15.05.2017 um 09:12 in
Nachricht :

[...]
> Did you set the cluster-recheck-interval to a reasonable short value
> (needed for connect-failures to timeout)?

How short is "reasonable"? ;-)

We set the interval to somewhat longer, because it gives the administrartor the 
change to cleanup after mistakes before the cluster does ;-)
Meaning: When the admin does it, it works mostly without a reboot; if the 
cluster does it, it will fence the node most of the time...

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-15 Thread Ignazio Cassano
Hello, cluster-recheck-interval=1min.


When I use the sytax:
"  pcs resource create computenode1 ocf:pacemaker:remote"

the name is resolved in /etc/hosts

When I use the syntax:

 pcs resource create computenode1 remote  server=10.102.184.91

I cannot avoid to resolve it.

Regards
Ignazio
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-15 Thread Anu Pillai
Hi Klaus,

Please find attached cib.xml as well as corosync.conf.


Regards,
Aswathi

On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger 
wrote:

> On 05/15/2017 09:36 AM, Anu Pillai wrote:
> > Hi,
> >
> > We are running pacemaker cluster for managing our resources. We have 6
> > system running 5 resources and one is acting as standby. We have a
> > restriction that, only one resource can run in one node. But our
> > observation is whenever we add or delete a resource from cluster all
> > the remaining resources in the cluster are stopped and started back.
> >
> > Can you please guide us whether this normal behavior or we are missing
> > any configuration that is leading to this issue.
>
> It should definitely be possible to prevent this behavior.
> If you share your config with us we might be able to
> track that down.
>
> Regards,
> Klaus
>
> >
> > Regards
> > Aswathi
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

  

  










  


  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  
  

  


  

  


  

  


  
  
  


  


  

  
  

  


  
  
  


  


  

  
  

  


  
  
  


  


  

  
  

  


  
  
  


  


  

  
  

  


  
  
  


  


  

  


  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


  



  

  
  
  



corosync.conf
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to check if a resource on a cluster node is really back on after a crash

2017-05-15 Thread Ludovic Vaugeois-Pepin
I will look into adding alerts, thanks for the info.

For now I introduced a 5 seconds sleep after "pcs cluster start ...". It
seems enough for monitor to be run.

On Fri, May 12, 2017 at 9:22 PM, Ken Gaillot  wrote:

> Another possibility you might want to look into is alerts. Pacemaker can
> call a script of your choosing whenever a resource is started or
> stopped. See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-sing
> le/Pacemaker_Explained/index.html#idm139683940283296
>
> for the concepts, and the pcs man page for the "pcs alert" interface.
>
> On 05/12/2017 06:17 AM, Ludovic Vaugeois-Pepin wrote:
> > I checked the node_state of the node that is killed and brought back
> > (test3). in_ccm == true and crmd == online for a second or two between
> > "pcs cluster start test3" "monitor":
> >
> >  > crm-debug-origin="peer_update_callback" join="member" expected="member">
> >
> >
> >
> > On Fri, May 12, 2017 at 11:27 AM, Ludovic Vaugeois-Pepin
> > mailto:ludovi...@gmail.com>> wrote:
> >
> > Yes I haven't been using the "nodes" element in the XML, only the
> > "resources" element. I couldn't find "node_state" elements or
> > attributes in the XML, so after some searching I found that it is in
> > the CIB that can be gotten with "pcs cluster cib foo.xml". I will
> > start exploring this as an alternative to  crm_mon/"pcs status".
> >
> >
> > However I still find what happens to be confusing, so below I try to
> > better explain what I see:
> >
> >
> > Before "pcs cluster start test3" at 10:45:36.362 (test3 has been HW
> > shutdown a minute ago):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:36 2017  Last change: Fri
> > May 12 09:18:13 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 ]
> > OFFLINE: [ test3 ]
> >
> > Active resources:
> >
> >  Master/Slave Set: pgsql-ha [pgsqld]
> >  Masters: [ test1 ]
> >  Slaves: [ test2 ]
> >  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Master" active="true" orphaned="false" managed="true" f
> > ailed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true" fa
> > iled="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Stopped" active="false" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="0" />
> > 
> >  > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed
> > ="true" failed="false" failure_ignored="false"
> > nodes_running_on="1" >
> > 
> > 
> > 
> >
> >
> >
> > At 10:45:39.440, after "pcs cluster start test3", before first
> > "monitor" on test3 (this is where I can't seem to know that
> > resources on test3 are down):
> >
> > crm_mon -1:
> >
> > Stack: corosync
> > Current DC: test1 (version 1.1.15-11.el7_3.4-e174ec8) -
> > partition with quorum
> > Last updated: Fri May 12 10:45:39 2017  Last change: Fri
> > May 12 10:45:39 2017 by root via crm_attribute on test1
> >
> > 3 nodes and 4 resources configured
> >
> > Online: [ test1 test2 test3 ]
> >
> > Active resources:
> >
> >  Master/Slave Set: pgsql-ha [pgsqld]
> >  Masters: [ test1 ]
> >  Slaves: [ test2 test3 ]
> >  pgsql-master-ip(ocf::heartbeat:IPaddr2):   Started
> > test1
> >
> >
> > crm_mon -X:
> >
> > 
> >  > managed="true" failed="false" failure_ignored="false" >
> >  > role="Master" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >  > role="Slave" active="true" orphaned="false" managed="true"
> > failed="false" failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> > 
> >  > resource_agent="ocf::heartbeat:IPaddr2" role="Started" active="true"
> > orphaned="false" managed="true" failed="false"
> > failure_ignored="false" nodes_running_on="1" >
> > 
> > 
> >

Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-15 Thread Klaus Wenninger
On 05/15/2017 09:36 AM, Anu Pillai wrote:
> Hi,   
>
> We are running pacemaker cluster for managing our resources. We have 6
> system running 5 resources and one is acting as standby. We have a
> restriction that, only one resource can run in one node. But our
> observation is whenever we add or delete a resource from cluster all
> the remaining resources in the cluster are stopped and started back.
>
> Can you please guide us whether this normal behavior or we are missing
> any configuration that is leading to this issue.

It should definitely be possible to prevent this behavior.
If you share your config with us we might be able to
track that down.

Regards,
Klaus

>
> Regards
> Aswathi
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-15 Thread Anu Pillai
Hi,

We are running pacemaker cluster for managing our resources. We have 6
system running 5 resources and one is acting as standby. We have a
restriction that, only one resource can run in one node. But our
observation is whenever we add or delete a resource from cluster all the
remaining resources in the cluster are stopped and started back.

Can you please guide us whether this normal behavior or we are missing any
configuration that is leading to this issue.

Regards
Aswathi
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker remote node ofgline after reboot

2017-05-15 Thread Klaus Wenninger
On 05/15/2017 08:43 AM, Ignazio Cassano wrote:
> Hello,
> adding remote compute with:
> pcs resource create computenode1 remote  server=10.102.184.91
>
> instead of:
> pcs resource create computenode1 ocf:pacemaker:remote
> reconnect_interval=60 op monitor interval=20

Have never tried it without giving the server.
Does computenode1 in your case resolve to the host-address?
Might as well be that the reconnect_interval mechanism doesn't
work as it should (/ I think it should ;-) ).
Did you set the cluster-recheck-interval to a reasonable short value
(needed for connect-failures to timeout)?

Regards,
Klaus

>
> SOLVES the issue when unexpected compute node reboot happens.
> It returns online and works fine.
> Regards
> Ignazio


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org