Re: [ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

2017-05-23 Thread Nikhil Utane
Thanks Aswathi.

(My account had stopped working due to mail bounces, never seen that occur
on gmail accounts)

Ken,

Answers to your questions are below:

*1. Using force option*
A) During our testing we had observed that in some instances the resource
deletion would fail and that's why we added the force option. With the
force option we never saw the problem again.

*2. "Maybe in this particular instance, you actually did "crm_resource
-C"?"*
A) This step is done through code so there is no human involvement. We are
printing the full command and we always see resource name is included. So
this cannot happen.

*3.  $ crm_node -R 0005B94238BC --force*
A) Yes, we want to remove the node completely. We are not specifying the
node information in corosync.conf so there is nothing to be removed there.
I need to go back and check but I vaguely remember that because of some
issue we had switched from using "pcs cluster node remove" command to
crm_node -R command. Perhaps because it gave us the option to use force.

*4. "No STONITH and QUORUM"*
A) As I have mentioned earlier, split-brain doesn't pose a problem for us
since we have a second line of defense based on our architecture. Hence we
have made a conscious decision to disable it. The config IS for production.

BTW, we also issue a "pcs resource disable" command before doing a "pcs
resource delete". Not sure if that makes any difference.

We will play around with those 4-5 commands that we execute to see whether
the resource restart happens as a reaction to any of those command.

-Thanks & Regards
Nikhil

On Wed, May 24, 2017 at 11:28 AM, Anu Pillai 
wrote:

> blank response for thread to appear in mailbox..pls ignore
>
> On Tue, May 23, 2017 at 4:21 AM, Ken Gaillot  wrote:
>
>> On 05/16/2017 04:34 AM, Anu Pillai wrote:
>> > Hi,
>> >
>> > Please find attached debug logs for the stated problem as well as
>> > crm_mon command outputs.
>> > In this case we are trying to remove/delete res3 and system/node
>> > (0005B94238BC) from the cluster.
>> >
>> > *_Test reproduction steps_*
>> >
>> > Current Configuration of the cluster:
>> >  0005B9423910  - res2
>> >  0005B9427C5A - res1
>> >  0005B94238BC - res3
>> >
>> > *crm_mon output:*
>> >
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:23 2017  Last change: Tue May 16
>> > 12:13:40 2017 by root via crm_attribute on 0005B9423910
>> >
>> > 3 nodes and 3 resources configured
>> >
>> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
>> >
>> >  res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
>> >  res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
>> >  res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
>> >
>> >
>> > Trigger the delete operation for res3 and node 0005B94238BC.
>> >
>> > Following commands applied from node 0005B94238BC
>> > $ pcs resource delete res3 --force
>> > $ crm_resource -C res3
>> > $ pcs cluster stop --force
>>
>> I don't think "pcs resource delete" or "pcs cluster stop" does anything
>> with the --force option. In any case, --force shouldn't be needed here.
>>
>> The crm_mon output you see is actually not what it appears. It starts
>> with:
>>
>> May 16 12:21:27 [4661] 0005B9423910   crmd:   notice: do_lrm_invoke:
>>Forcing the status of all resources to be redetected
>>
>> This is usually the result of a "cleanup all" command. It works by
>> erasing the resource history, causing pacemaker to re-probe all nodes to
>> get the current state. The history erasure makes it appear to crm_mon
>> that the resources are stopped, but they actually are not.
>>
>> In this case, I'm not sure why it's doing a "cleanup all", since you
>> only asked it to cleanup res3. Maybe in this particular instance, you
>> actually did "crm_resource -C"?
>>
>> > Following command applied from DC(0005B9423910)
>> > $ crm_node -R 0005B94238BC --force
>>
>> This can cause problems. This command shouldn't be run unless the node
>> is removed from both pacemaker's and corosync's configuration. If you
>> actually are trying to remove the node completely, a better alternative
>> would be "pcs cluster node remove 0005B94238BC", which will handle all
>> of that for you. If you're not trying to remove the node completely,
>> then you shouldn't need this command at all.
>>
>> >
>> >
>> > *crm_mon output:*
>> > *
>> > *
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:27 2017  Last change: Tue May 16
>> > 12:21:26 2017 by root via cibadmin on 0005B94238BC
>> >
>> > 3 nodes and 2 resources configured
>> >
>> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
>> >
>> >
>> > Obs

Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-03 Thread Nikhil Utane
Thank you Lars. Yes, subscribing will be better. Will look into it.
We have already started working on reducing the data that goes into CIB
file.

-Regards
Nikhil

On Mon, Apr 3, 2017 at 6:41 PM, Lars Ellenberg 
wrote:

> On Mon, Apr 03, 2017 at 03:44:21PM +0530, Nikhil Utane wrote:
> > Here's the snapshot. As seen below, the messages are coming at more than
> a
> > second frequency.
> > I checked that the cib.xml file was not updated (no change to timestamp
> of
> > file)
> > Then i took tcpdump and did not see any message other than keep-alives.
> > Is the cib process looping incorrectly?
> > Can share strace output if required.
> >
> > Apr 03 14:48:28 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13559 (ratio 31:1ms
> > Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13536 (ratio 31:1ms
> > Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13551 (ratio 31:1ms
> > Apr 03 14:48:30 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13552 (ratio 31:1ms
> > Apr 03 14:48:31 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13537 (ratio 31:1ms
> > Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13534 (ratio 31:1ms
> > Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 427943 bytes into 13546 (ratio 31:1ms
>
> Each and every "cibadmin -Q" (or equivalent) will trigger that,
> also for local IPC.
>
> Stop polling the cib several times per seconds.
>
> If you have to, "subscribe" to cib updates, using the API.
>
> And stop pushing that much data into the cib.
> Maybe, as a stop gap, compress it yourself,
> before you stuff it into the cib.
>
>
> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> : R&D, Integration, Ops, Consulting, Support
>
> DRBD® and LINBIT® are registered trademarks of LINBIT
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-03 Thread Nikhil Utane
Here's the snapshot. As seen below, the messages are coming at more than a
second frequency.
I checked that the cib.xml file was not updated (no change to timestamp of
file)
Then i took tcpdump and did not see any message other than keep-alives.
Is the cib process looping incorrectly?
Can share strace output if required.

Apr 03 14:48:28 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13559 (ratio 31:1ms
Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13536 (ratio 31:1ms
Apr 03 14:48:29 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13551 (ratio 31:1ms
Apr 03 14:48:30 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13552 (ratio 31:1ms
Apr 03 14:48:31 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13537 (ratio 31:1ms
Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13534 (ratio 31:1ms
Apr 03 14:48:32 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 427943 bytes into 13546 (ratio 31:1ms

-Regards
Nikhil

On Mon, Apr 3, 2017 at 11:38 AM, Nikhil Utane 
wrote:

> Ken,
>
> The CIB file is not being updated that often.
> I took a packet capture and don't see the node sending any message to
> other nodes (other than keep-alives).
> What then explains these messages coming every second?
>
> -Regards
> Nikhil
>
> On Sat, Apr 1, 2017 at 1:37 AM, Ken Gaillot  wrote:
>
>> On 03/31/2017 06:44 AM, Nikhil Utane wrote:
>> > We are seeing this log in pacemaker.log continuously.
>> >
>> > Mar 31 17:13:01 [6372] 0005B932ED72cib: info:
>> > crm_compress_string:  Compressed 436756 bytes into 14635 (ratio 29:1) in
>> > 284ms
>> >
>> > This looks to be the reason for high CPU. What does this log indicate?
>>
>> If a cluster message is larger than 128KB, pacemaker will compress it
>> (using BZ2) before transmitting it across the network to the other
>> nodes. This can hit the CPU significantly. Having a large resource
>> definition makes such messages more common.
>>
>> There are many ways to sync a configuration file between nodes. If the
>> configuration rarely changes, a simple rsync cron could do it.
>> Specialized tools like lsyncd are more responsive while still having a
>> minimal footprint. DRBD or shared storage would be more powerful and
>> real-time. If it's a custom app, you could even modify it to use
>> something like etcd or a NoSQL db.
>>
>> >
>> > -Regards
>> > Nikhil
>> >
>> >
>> > On Fri, Mar 31, 2017 at 12:08 PM, Nikhil Utane
>> > mailto:nikhil.subscri...@gmail.com>>
>> wrote:
>> >
>> > Hi,
>> >
>> > In our current design (which we plan to improve upon) we are using
>> > the CIB file to synchronize information across active and standby
>> nodes.
>> > Basically we want the standby node to take the configuration that
>> > was used by the active node so we are adding those as resource
>> > attributes. This ensures that when the standby node takes over, it
>> > can read all the configuration which will be passed to it as
>> > environment variables.
>> > Initially we thought the list of configuration parameters will be
>> > less and we did some prototyping and saw that there wasn't much of
>> > an issue. But now the list has grown it has become close to 300
>> > attributes. (I know this is like abusing the feature and we are
>> > looking towards doing it the right way).
>> >
>> > So I have two questions:
>> > 1) What is the best way to synchronize such kind of information
>> > across nodes in the cluster? DRBD? Anything else that is simpler?
>> > For e.g. instead of syncing 300 attributes i could just sync up the
>> > path to a file.
>> >
>> > 2) In the current design, is there anything that I can do to reduce
>> > the CPU utilization of cib process? Currently it regularly takes
>> > 30-50% of the CPU.
>> > Any quick fix that I can do which will bring it down? For e.g.
>> > configure how often it synchronizes etc?
>> >
>> > -Thanks
>> > Nikhil
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-02 Thread Nikhil Utane
Ken,

The CIB file is not being updated that often.
I took a packet capture and don't see the node sending any message to other
nodes (other than keep-alives).
What then explains these messages coming every second?

-Regards
Nikhil

On Sat, Apr 1, 2017 at 1:37 AM, Ken Gaillot  wrote:

> On 03/31/2017 06:44 AM, Nikhil Utane wrote:
> > We are seeing this log in pacemaker.log continuously.
> >
> > Mar 31 17:13:01 [6372] 0005B932ED72cib: info:
> > crm_compress_string:  Compressed 436756 bytes into 14635 (ratio 29:1) in
> > 284ms
> >
> > This looks to be the reason for high CPU. What does this log indicate?
>
> If a cluster message is larger than 128KB, pacemaker will compress it
> (using BZ2) before transmitting it across the network to the other
> nodes. This can hit the CPU significantly. Having a large resource
> definition makes such messages more common.
>
> There are many ways to sync a configuration file between nodes. If the
> configuration rarely changes, a simple rsync cron could do it.
> Specialized tools like lsyncd are more responsive while still having a
> minimal footprint. DRBD or shared storage would be more powerful and
> real-time. If it's a custom app, you could even modify it to use
> something like etcd or a NoSQL db.
>
> >
> > -Regards
> > Nikhil
> >
> >
> > On Fri, Mar 31, 2017 at 12:08 PM, Nikhil Utane
> > mailto:nikhil.subscri...@gmail.com>>
> wrote:
> >
> > Hi,
> >
> > In our current design (which we plan to improve upon) we are using
> > the CIB file to synchronize information across active and standby
> nodes.
> > Basically we want the standby node to take the configuration that
> > was used by the active node so we are adding those as resource
> > attributes. This ensures that when the standby node takes over, it
> > can read all the configuration which will be passed to it as
> > environment variables.
> > Initially we thought the list of configuration parameters will be
> > less and we did some prototyping and saw that there wasn't much of
> > an issue. But now the list has grown it has become close to 300
> > attributes. (I know this is like abusing the feature and we are
> > looking towards doing it the right way).
> >
> > So I have two questions:
> > 1) What is the best way to synchronize such kind of information
> > across nodes in the cluster? DRBD? Anything else that is simpler?
> > For e.g. instead of syncing 300 attributes i could just sync up the
> > path to a file.
> >
> > 2) In the current design, is there anything that I can do to reduce
> > the CPU utilization of cib process? Currently it regularly takes
> > 30-50% of the CPU.
> > Any quick fix that I can do which will bring it down? For e.g.
> > configure how often it synchronizes etc?
> >
> > -Thanks
> > Nikhil
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-03-31 Thread Nikhil Utane
We are seeing this log in pacemaker.log continuously.

Mar 31 17:13:01 [6372] 0005B932ED72cib: info:
crm_compress_string:  Compressed 436756 bytes into 14635 (ratio 29:1) in
284ms

This looks to be the reason for high CPU. What does this log indicate?

-Regards
Nikhil


On Fri, Mar 31, 2017 at 12:08 PM, Nikhil Utane 
wrote:

> Hi,
>
> In our current design (which we plan to improve upon) we are using the CIB
> file to synchronize information across active and standby nodes.
> Basically we want the standby node to take the configuration that was used
> by the active node so we are adding those as resource attributes. This
> ensures that when the standby node takes over, it can read all the
> configuration which will be passed to it as environment variables.
> Initially we thought the list of configuration parameters will be less and
> we did some prototyping and saw that there wasn't much of an issue. But now
> the list has grown it has become close to 300 attributes. (I know this is
> like abusing the feature and we are looking towards doing it the right way).
>
> So I have two questions:
> 1) What is the best way to synchronize such kind of information across
> nodes in the cluster? DRBD? Anything else that is simpler? For e.g. instead
> of syncing 300 attributes i could just sync up the path to a file.
>
> 2) In the current design, is there anything that I can do to reduce the
> CPU utilization of cib process? Currently it regularly takes 30-50% of the
> CPU.
> Any quick fix that I can do which will bring it down? For e.g. configure
> how often it synchronizes etc?
>
> -Thanks
> Nikhil
>
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-03-30 Thread Nikhil Utane
Hi,

In our current design (which we plan to improve upon) we are using the CIB
file to synchronize information across active and standby nodes.
Basically we want the standby node to take the configuration that was used
by the active node so we are adding those as resource attributes. This
ensures that when the standby node takes over, it can read all the
configuration which will be passed to it as environment variables.
Initially we thought the list of configuration parameters will be less and
we did some prototyping and saw that there wasn't much of an issue. But now
the list has grown it has become close to 300 attributes. (I know this is
like abusing the feature and we are looking towards doing it the right way).

So I have two questions:
1) What is the best way to synchronize such kind of information across
nodes in the cluster? DRBD? Anything else that is simpler? For e.g. instead
of syncing 300 attributes i could just sync up the path to a file.

2) In the current design, is there anything that I can do to reduce the CPU
utilization of cib process? Currently it regularly takes 30-50% of the CPU.
Any quick fix that I can do which will bring it down? For e.g. configure
how often it synchronizes etc?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Running two independent clusters

2017-03-29 Thread Nikhil Utane
"*Coincidentally, I am about to announce enhanced container support in*
*pacemaker. I should have a post with more details later today or tomorrow.*
"

Ken: Where you able to get to it?

-Thanks
Nikhil

On Thu, Mar 23, 2017 at 7:35 PM, Ken Gaillot  wrote:

> On 03/22/2017 11:08 PM, Nikhil Utane wrote:
> > I simplified when I called it as a service. Essentially it is a complete
> > system.
> > It is an LTE eNB solution. It provides LTE service (service A) and now
> > we need to provide redundancy for another different but related service
> > (service B). The catch being, the LTE redundancy solution will be tied
> > to one operator whereas the other service can span across multiple
> > operators. Therefore ideally we want two completely independent clusters
> > since different set of nodes will form the two clusters.
> > Now what I am thinking is, to run additional instance of Pacemaker +
> > Corosync in a container which can then notify the service B on host
> > machine to start or stop it's service. That way my CIB file will be
> > independent and I can run corosync on different interfaces.
> >
> > Workable right?
> >
> > -Regards
> > Nikhil
>
> It's not well-tested, but in theory it should work, as long as the
> container is privileged.
>
> I still think virtualizing the services would be more resilient. It
> makes sense to have a single determination of quorum and fencing for the
> same real hosts. I'd think of it like a cloud provider -- the cloud
> instances are segregated by customer, but the underlying hosts are the
> same.
>
> You could configure your cluster as asymmetric, and enable each VM only
> on the nodes it's allowed on, so you get the two separate "clusters"
> that way. You could set up the VMs as guest nodes if you want to monitor
> and manage multiple services within them. If your services require
> hardware access that's not easily passed to a VM, containerizing the
> services might be a better option.
>
> > On Wed, Mar 22, 2017 at 8:06 PM, Ken Gaillot  > <mailto:kgail...@redhat.com>> wrote:
> >
> > On 03/22/2017 05:23 AM, Nikhil Utane wrote:
> > > Hi Ulrich,
> > >
> > > It's not an option unfortunately.
> > > Our product runs on a specialized hardware and provides both the
> > > services (A & B) that I am referring to. Hence I cannot have
> service A
> > > running on some nodes as cluster A and service B running on other
> nodes
> > > as cluster B.
> > > The two services HAVE to run on same node. The catch being service
> A and
> > > service B have to be independent of each other.
> > >
> > > Hence looking at Container option since we are using that for some
> other
> > > product (but not for Pacemaker/Corosync).
> > >
> > > -Regards
> > > Nikhil
> >
> > Instead of containerizing pacemaker, why don't you containerize or
> > virtualize the services, and have pacemaker manage the
> containers/VMs?
> >
> > Coincidentally, I am about to announce enhanced container support in
> > pacemaker. I should have a post with more details later today or
> > tomorrow.
> >
> > >
> > > On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl
> > >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>
> > > <mailto:ulrich.wi...@rz.uni-regensburg.de
> > <mailto:ulrich.wi...@rz.uni-regensburg.de>>> wrote:
> > >
> > > >>> Nikhil Utane  nikhil.subscri...@gmail.com>
> > > <mailto:nikhil.subscri...@gmail.com
> > <mailto:nikhil.subscri...@gmail.com>>> schrieb am 22.03.2017 um
> 07:48 in
> > > Nachricht
> > >
> >   > <mailto:0g...@mail.gmail.com>
> > > <mailto:0g...@mail.gmail.com <mailto:0g...@mail.gmail.com>>>:
> > > > Hi All,
> > > >
> > > > First of all, let me thank everyone here for providing
> > excellent support
> > > > from the time I started evaluating this tool about a year
> > ago. It has
> > > > helped me to make a timely and good quality release of our
> > Redundancy
> > > > solution using Pacemaker & Corosync. (Three cheers :))
> > > >
> > > > Now for our next release we have a slightly different ask.
> > > > We want to provide 

Re: [ClusterLabs] Antw: Running two independent clusters

2017-03-22 Thread Nikhil Utane
I simplified when I called it as a service. Essentially it is a complete
system.
It is an LTE eNB solution. It provides LTE service (service A) and now we
need to provide redundancy for another different but related service
(service B). The catch being, the LTE redundancy solution will be tied to
one operator whereas the other service can span across multiple operators.
Therefore ideally we want two completely independent clusters since
different set of nodes will form the two clusters.
Now what I am thinking is, to run additional instance of Pacemaker +
Corosync in a container which can then notify the service B on host machine
to start or stop it's service. That way my CIB file will be independent and
I can run corosync on different interfaces.

Workable right?

-Regards
Nikhil


On Wed, Mar 22, 2017 at 8:06 PM, Ken Gaillot  wrote:

> On 03/22/2017 05:23 AM, Nikhil Utane wrote:
> > Hi Ulrich,
> >
> > It's not an option unfortunately.
> > Our product runs on a specialized hardware and provides both the
> > services (A & B) that I am referring to. Hence I cannot have service A
> > running on some nodes as cluster A and service B running on other nodes
> > as cluster B.
> > The two services HAVE to run on same node. The catch being service A and
> > service B have to be independent of each other.
> >
> > Hence looking at Container option since we are using that for some other
> > product (but not for Pacemaker/Corosync).
> >
> > -Regards
> > Nikhil
>
> Instead of containerizing pacemaker, why don't you containerize or
> virtualize the services, and have pacemaker manage the containers/VMs?
>
> Coincidentally, I am about to announce enhanced container support in
> pacemaker. I should have a post with more details later today or tomorrow.
>
> >
> > On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl
> >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
> >
> > >>> Nikhil Utane  > <mailto:nikhil.subscri...@gmail.com>> schrieb am 22.03.2017 um
> 07:48 in
> > Nachricht
> >  > <mailto:0g...@mail.gmail.com>>:
> > > Hi All,
> > >
> > > First of all, let me thank everyone here for providing excellent
> support
> > > from the time I started evaluating this tool about a year ago. It
> has
> > > helped me to make a timely and good quality release of our
> Redundancy
> > > solution using Pacemaker & Corosync. (Three cheers :))
> > >
> > > Now for our next release we have a slightly different ask.
> > > We want to provide Redundancy to two different types of services
> (we can
> > > call them Service A and Service B) such that all cluster
> communication for
> > > Service A happens on one network/interface (say VLAN A) and for
> service B
> > > happens on a different network/interface (say VLAN B). Moreover we
> do not
> > > want the details of Service A (resource attributes etc) to be seen
> by
> > > Service B and vice-versa.
> > >
> > > So essentially we want to be able to run two independent clusters.
> From
> > > what I gathered, we cannot run multiple instances of Pacemaker and
> Corosync
> > > on same node. I was thinking if we can use Containers and run two
> isolated
> >
> > You conclude from two services that should not see each other that
> > you need to instances of pacemaker on one node. Why?
> > If you want true separation, drop the VLANs, make real networks and
> > two independent clusters.
> > Even if two pacemeaker on one node would work, you habe the problem
> > of fencing, where at least one pacemaker instance will always be
> > surprised badly if fencing takes place. I cannot imaging you want
> that!
> >
> > > instances of Pacemaker + Corosync on same node.
> > > As per https://github.com/davidvossel/pacemaker_docker
> > <https://github.com/davidvossel/pacemaker_docker> it looks do-able.
> > > I wanted to get an opinion on this forum before I can commit that
> it can be
> > > done.
> >
> > Why are you designing it more complicated as necessary?
> >
> > >
> > > Please share your views if you have already done this and if there
> are any
> > > known challenges that I should be familiar with.
> > >
> > > -Thanks
> > > Nikhil
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Running two independent clusters

2017-03-22 Thread Nikhil Utane
I need 2 clusters to be running independently of each other on same node.

-Nikhil

On Wed, Mar 22, 2017 at 6:36 PM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Nikhil Utane  schrieb am 22.03.2017 um
> 11:23 in
> Nachricht
> :
> > Hi Ulrich,
> >
> > It's not an option unfortunately.
> > Our product runs on a specialized hardware and provides both the services
> > (A & B) that I am referring to. Hence I cannot have service A running on
> > some nodes as cluster A and service B running on other nodes as cluster
> B.
> > The two services HAVE to run on same node. The catch being service A and
> > service B have to be independent of each other.
> >
> > Hence looking at Container option since we are using that for some other
> > product (but not for Pacemaker/Corosync).
>
> But why do you need two pacemakers then?
>
> >
> > -Regards
> > Nikhil
> >
> >
> > On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl <
> > ulrich.wi...@rz.uni-regensburg.de> wrote:
> >
> >> >>> Nikhil Utane  schrieb am 22.03.2017 um
> >> 07:48 in
> >> Nachricht
> >> :
> >> > Hi All,
> >> >
> >> > First of all, let me thank everyone here for providing excellent
> support
> >> > from the time I started evaluating this tool about a year ago. It has
> >> > helped me to make a timely and good quality release of our Redundancy
> >> > solution using Pacemaker & Corosync. (Three cheers :))
> >> >
> >> > Now for our next release we have a slightly different ask.
> >> > We want to provide Redundancy to two different types of services (we
> can
> >> > call them Service A and Service B) such that all cluster communication
> >> for
> >> > Service A happens on one network/interface (say VLAN A) and for
> service B
> >> > happens on a different network/interface (say VLAN B). Moreover we do
> not
> >> > want the details of Service A (resource attributes etc) to be seen by
> >> > Service B and vice-versa.
> >> >
> >> > So essentially we want to be able to run two independent clusters.
> From
> >> > what I gathered, we cannot run multiple instances of Pacemaker and
> >> Corosync
> >> > on same node. I was thinking if we can use Containers and run two
> >> isolated
> >>
> >> You conclude from two services that should not see each other that you
> >> need to instances of pacemaker on one node. Why?
> >> If you want true separation, drop the VLANs, make real networks and two
> >> independent clusters.
> >> Even if two pacemeaker on one node would work, you habe the problem of
> >> fencing, where at least one pacemaker instance will always be surprised
> >> badly if fencing takes place. I cannot imaging you want that!
> >>
> >> > instances of Pacemaker + Corosync on same node.
> >> > As per https://github.com/davidvossel/pacemaker_docker it looks
> do-able.
> >> > I wanted to get an opinion on this forum before I can commit that it
> can
> >> be
> >> > done.
> >>
> >> Why are you designing it more complicated as necessary?
> >>
> >> >
> >> > Please share your views if you have already done this and if there are
> >> any
> >> > known challenges that I should be familiar with.
> >> >
> >> > -Thanks
> >> > Nikhil
> >>
> >>
> >>
> >>
> >>
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> http://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/
> doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Running two independent clusters

2017-03-22 Thread Nikhil Utane
Hi Ulrich,

It's not an option unfortunately.
Our product runs on a specialized hardware and provides both the services
(A & B) that I am referring to. Hence I cannot have service A running on
some nodes as cluster A and service B running on other nodes as cluster B.
The two services HAVE to run on same node. The catch being service A and
service B have to be independent of each other.

Hence looking at Container option since we are using that for some other
product (but not for Pacemaker/Corosync).

-Regards
Nikhil


On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Nikhil Utane  schrieb am 22.03.2017 um
> 07:48 in
> Nachricht
> :
> > Hi All,
> >
> > First of all, let me thank everyone here for providing excellent support
> > from the time I started evaluating this tool about a year ago. It has
> > helped me to make a timely and good quality release of our Redundancy
> > solution using Pacemaker & Corosync. (Three cheers :))
> >
> > Now for our next release we have a slightly different ask.
> > We want to provide Redundancy to two different types of services (we can
> > call them Service A and Service B) such that all cluster communication
> for
> > Service A happens on one network/interface (say VLAN A) and for service B
> > happens on a different network/interface (say VLAN B). Moreover we do not
> > want the details of Service A (resource attributes etc) to be seen by
> > Service B and vice-versa.
> >
> > So essentially we want to be able to run two independent clusters. From
> > what I gathered, we cannot run multiple instances of Pacemaker and
> Corosync
> > on same node. I was thinking if we can use Containers and run two
> isolated
>
> You conclude from two services that should not see each other that you
> need to instances of pacemaker on one node. Why?
> If you want true separation, drop the VLANs, make real networks and two
> independent clusters.
> Even if two pacemeaker on one node would work, you habe the problem of
> fencing, where at least one pacemaker instance will always be surprised
> badly if fencing takes place. I cannot imaging you want that!
>
> > instances of Pacemaker + Corosync on same node.
> > As per https://github.com/davidvossel/pacemaker_docker it looks do-able.
> > I wanted to get an opinion on this forum before I can commit that it can
> be
> > done.
>
> Why are you designing it more complicated as necessary?
>
> >
> > Please share your views if you have already done this and if there are
> any
> > known challenges that I should be familiar with.
> >
> > -Thanks
> > Nikhil
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Running two independent clusters

2017-03-21 Thread Nikhil Utane
Hi All,

First of all, let me thank everyone here for providing excellent support
from the time I started evaluating this tool about a year ago. It has
helped me to make a timely and good quality release of our Redundancy
solution using Pacemaker & Corosync. (Three cheers :))

Now for our next release we have a slightly different ask.
We want to provide Redundancy to two different types of services (we can
call them Service A and Service B) such that all cluster communication for
Service A happens on one network/interface (say VLAN A) and for service B
happens on a different network/interface (say VLAN B). Moreover we do not
want the details of Service A (resource attributes etc) to be seen by
Service B and vice-versa.

So essentially we want to be able to run two independent clusters. From
what I gathered, we cannot run multiple instances of Pacemaker and Corosync
on same node. I was thinking if we can use Containers and run two isolated
instances of Pacemaker + Corosync on same node.
As per https://github.com/davidvossel/pacemaker_docker it looks do-able.
I wanted to get an opinion on this forum before I can commit that it can be
done.

Please share your views if you have already done this and if there are any
known challenges that I should be familiar with.

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Get rid of reload altogether

2016-12-13 Thread Nikhil Utane
Jan,

Could you pls elaborate?
Currently we are thinking of running a script that will generate the list
of attributes after reading from another file .
But these are running into 3000+ parameters. :(
It will be a huge effort to maintain it.

All I want is Pacemaker not to all stop/start when resource attributes
change.
Would it be easier to modify the pacemaker source code and ignore this
change of value?

-Regards
Nikhil

On Wed, Nov 30, 2016 at 8:46 PM, Jan Pokorný  wrote:

> On 28/11/16 09:44 +0530, Nikhil Utane wrote:
> > I understand the whole concept of reload and how to define parameters
> with
> > unique=0 so that pacemaker can call the reload operation of the OCF
> script
> > instead of stopping and starting the resource.
> > Now my problem is that I have 100s of parameters and I don't want to
> > specify each of those with unique=0.
>
> Would it be doable that your agent, when asked for metadata, will
> produce them as usual, but in addition runs the XML through XSL
> template that will add these unique=0 declarations for you
> (except perhaps some whitelist)?
>
> --
> Jan (Poki)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Get rid of reload altogether

2016-11-27 Thread Nikhil Utane
Hi,

I understand the whole concept of reload and how to define parameters with
unique=0 so that pacemaker can call the reload operation of the OCF script
instead of stopping and starting the resource.
Now my problem is that I have 100s of parameters and I don't want to
specify each of those with unique=0.

Is there any other way to completely stop the whole business of reload?
Basically if any of the resource parameters are changed, don't do anything.

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-25 Thread Nikhil Utane
I think it was a silly mistake. "placement-strategy" was not enabled.
We have enabled it now and testing it out.

Thanks

On Mon, Oct 24, 2016 at 7:35 PM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Nikhil Utane  schrieb am 24.10.2016 um
> 13:22
> in
> Nachricht
> :
> > I had set resource utilization to 1. Even then it scheduled 2 resources.
> > Doesn't it honor utilization resources if it doesn't find a free node?
>
> Show us the config and the logs, please!
>
>
> >
> > -Nikhil
> >
> > On Mon, Oct 24, 2016 at 4:43 PM, Vladislav Bogdanov <
> bub...@hoster-ok.com>
> > wrote:
> >
> >> 24.10.2016 14:04, Nikhil Utane wrote:
> >>
> >>> That is what happened here :(.
> >>> When 2 nodes went down, two resources got scheduled on single node.
> >>> Isn't there any way to stop this from happening. Colocation constraint
> >>> is not helping.
> >>>
> >>
> >> If it is ok to have some instances not running in such outage cases, you
> >> can limit them to 1-per-node with utilization attributes (as was
> suggested
> >> earlier). Then, when nodes return, resource instances will return with
> (and
> >> on!) them.
> >>
> >>
> >>
> >>> -Regards
> >>> Nikhil
> >>>
> >>> On Sat, Oct 22, 2016 at 12:57 AM, Vladislav Bogdanov
> >>> mailto:bub...@hoster-ok.com>> wrote:
> >>>
> >>> 21.10.2016 19:34, Andrei Borzenkov wrote:
> >>>
> >>> 14.10.2016 10:39, Vladislav Bogdanov пишет:
> >>>
> >>>
> >>> use of utilization (balanced strategy) has one caveat:
> >>> resources are
> >>> not moved just because of utilization of one node is less,
> >>> when nodes
> >>> have the same allocation score for the resource. So, after
> the
> >>> simultaneus outage of two nodes in a 5-node cluster, it may
> >>> appear
> >>> that one node runs two resources and two recovered nodes
> run
> >>> nothing.
> >>>
> >>>
> >>> I call this a feature. Every resource move potentially means
> >>> service
> >>> outage, so it should not happen without explicit action.
> >>>
> >>>
> >>> In a case I describe that moves could be easily prevented by using
> >>> stickiness (it increases allocation score on a current node).
> >>> The issue is that it is impossible to "re-balance" resources in
> >>> time-frames when stickiness is zero (over-night maintenance
> window).
> >>>
> >>>
> >>>
> >>> Original 'utilization' strategy only limits resource
> >>> placement, it is
> >>> not considered when choosing a node for a resource.
> >>>
> >>>
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> <mailto:Users@clusterlabs.org>
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>> <http://clusterlabs.org/mailman/listinfo/users>
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org  >>> Users@clusterlabs.org>
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>> <http://clusterlabs.org/mailman/listinfo/users>
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>>
> >>>
> >>>
> >>> ___
>

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-24 Thread Nikhil Utane
I had set resource utilization to 1. Even then it scheduled 2 resources.
Doesn't it honor utilization resources if it doesn't find a free node?

-Nikhil

On Mon, Oct 24, 2016 at 4:43 PM, Vladislav Bogdanov 
wrote:

> 24.10.2016 14:04, Nikhil Utane wrote:
>
>> That is what happened here :(.
>> When 2 nodes went down, two resources got scheduled on single node.
>> Isn't there any way to stop this from happening. Colocation constraint
>> is not helping.
>>
>
> If it is ok to have some instances not running in such outage cases, you
> can limit them to 1-per-node with utilization attributes (as was suggested
> earlier). Then, when nodes return, resource instances will return with (and
> on!) them.
>
>
>
>> -Regards
>> Nikhil
>>
>> On Sat, Oct 22, 2016 at 12:57 AM, Vladislav Bogdanov
>> mailto:bub...@hoster-ok.com>> wrote:
>>
>> 21.10.2016 19:34, Andrei Borzenkov wrote:
>>
>> 14.10.2016 10:39, Vladislav Bogdanov пишет:
>>
>>
>> use of utilization (balanced strategy) has one caveat:
>> resources are
>> not moved just because of utilization of one node is less,
>> when nodes
>> have the same allocation score for the resource. So, after the
>> simultaneus outage of two nodes in a 5-node cluster, it may
>> appear
>> that one node runs two resources and two recovered nodes run
>> nothing.
>>
>>
>> I call this a feature. Every resource move potentially means
>> service
>> outage, so it should not happen without explicit action.
>>
>>
>> In a case I describe that moves could be easily prevented by using
>> stickiness (it increases allocation score on a current node).
>> The issue is that it is impossible to "re-balance" resources in
>> time-frames when stickiness is zero (over-night maintenance window).
>>
>>
>>
>> Original 'utilization' strategy only limits resource
>> placement, it is
>> not considered when choosing a node for a resource.
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> <mailto:Users@clusterlabs.org>
>> http://clusterlabs.org/mailman/listinfo/users
>> <http://clusterlabs.org/mailman/listinfo/users>
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org > Users@clusterlabs.org>
>> http://clusterlabs.org/mailman/listinfo/users
>> <http://clusterlabs.org/mailman/listinfo/users>
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-24 Thread Nikhil Utane
That is what happened here :(.
When 2 nodes went down, two resources got scheduled on single node.
Isn't there any way to stop this from happening. Colocation constraint is
not helping.

-Regards
Nikhil

On Sat, Oct 22, 2016 at 12:57 AM, Vladislav Bogdanov 
wrote:

> 21.10.2016 19:34, Andrei Borzenkov wrote:
>
>> 14.10.2016 10:39, Vladislav Bogdanov пишет:
>>
>>>
>>> use of utilization (balanced strategy) has one caveat: resources are
>>> not moved just because of utilization of one node is less, when nodes
>>> have the same allocation score for the resource. So, after the
>>> simultaneus outage of two nodes in a 5-node cluster, it may appear
>>> that one node runs two resources and two recovered nodes run
>>> nothing.
>>>
>>>
>> I call this a feature. Every resource move potentially means service
>> outage, so it should not happen without explicit action.
>>
>>
> In a case I describe that moves could be easily prevented by using
> stickiness (it increases allocation score on a current node).
> The issue is that it is impossible to "re-balance" resources in
> time-frames when stickiness is zero (over-night maintenance window).
>
>
>
> Original 'utilization' strategy only limits resource placement, it is
>>> not considered when choosing a node for a resource.
>>>
>>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-17 Thread Nikhil Utane
Yes Ulrich, Somehow I missed pursuing on that.
I will be doing both, configure stickiness to INFINITY and use utilization
attributes.
This should probably take care of it.

Thanks
Nikhil

On Tue, Oct 18, 2016 at 11:45 AM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Nikhil Utane  schrieb am 17.10.2016 um
> 16:46 in
> Nachricht
> :
> > This is driving me insane.
>
> Why don't you try the utilization approach?
>
> >
> > This is how the resources were started. Redund_CU1_WB30  was the DC
> which I
> > rebooted.
> >  cu_4 (ocf::redundancy:RedundancyRA): Started Redund_CU1_WB30
> >  cu_2 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30
> >  cu_3 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30
> >
> > Since the standby node was not UP. I was expecting resource cu_4 to be
> > waiting to be scheduled.
> > But then it re-arranged everything as below.
> >  cu_4 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30
> >  cu_2 (ocf::redundancy:RedundancyRA): Stopped
> >  cu_3 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30
> >
> > There is not much information available in the logs on new DC. It just
> > shows what it has decided to do but nothing to suggest why it did it that
> > way.
> >
> > notice: Start   cu_4 (Redun_CU4_Wb30)
> > notice: Stopcu_2 (Redund_CU5_WB30)
> > notice: Movecu_3 (Started Redun_CU4_Wb30 -> Redund_CU5_WB30)
> >
> > I have default stickiness set to 100 which is higher than any score that
> I
> > have configured.
> > I have migration_threshold set to 1. Should I bump that up instead?
> >
> > -Thanks
> > Nikhil
> >
> > On Sat, Oct 15, 2016 at 12:36 AM, Ken Gaillot 
> wrote:
> >
> >> On 10/14/2016 06:56 AM, Nikhil Utane wrote:
> >> > Hi,
> >> >
> >> > Thank you for the responses so far.
> >> > I added reverse colocation as well. However seeing some other issue in
> >> > resource movement that I am analyzing.
> >> >
> >> > Thinking further on this, why doesn't "/a not with b" does not imply
> "b
> >> > not with a"?/
> >> > Coz wouldn't putting "b with a" violate "a not with b"?
> >> >
> >> > Can someone confirm that colocation is required to be configured both
> >> ways?
> >>
> >> The anti-colocation should only be defined one-way. Otherwise, you get a
> >> dependency loop (as seen in logs you showed elsewhere).
> >>
> >> The one-way constraint is enough to keep the resources apart. However,
> >> the question is whether the cluster might move resources around
> >> unnecessarily.
> >>
> >> For example, "A not with B" means that the cluster will place B first,
> >> then place A somewhere else. So, if B's node fails, can the cluster
> >> decide that A's node is now the best place for B, and move A to a free
> >> node, rather than simply start B on the free node?
> >>
> >> The cluster does take dependencies into account when placing a resource,
> >> so I would hope that wouldn't happen. But I'm not sure. Having some
> >> stickiness might help, so that A has some preference against moving.
> >>
> >> > -Thanks
> >> > Nikhil
> >> >
> >> > /
> >> > /
> >> >
> >> > On Fri, Oct 14, 2016 at 1:09 PM, Vladislav Bogdanov
> >> > mailto:bub...@hoster-ok.com>> wrote:
> >> >
> >> > On October 14, 2016 10:13:17 AM GMT+03:00, Ulrich Windl
> >> >  >> > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
> >> > >>>> Nikhil Utane  >> > <mailto:nikhil.subscri...@gmail.com>> schrieb am 13.10.2016 um
> >> > >16:43 in
> >> > >Nachricht
> >> > > gmail.com
> >> > <mailto:CAGNWmJUbPucnBGXroHkHSbQ0LXovwsLFPkUPg1R8gJqRFqM9Dg@
> >> mail.gmail.com>>:
> >> > >> Ulrich,
> >> > >>
> >> > >> I have 4 resources only (not 5, nodes are 5). So then I only
> need
> >> 6
> >> > >> constraints, right?
> >> > >>
> >> > >>  [,1]   [,2]   [,3]   [,4]   [,5]  [,6]
> >> > >> [1,] "A"  "A"  "A""B"   "B""C"
> >> > 

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-17 Thread Nikhil Utane
Thanks Ken.
I will give it a shot.

http://oss.clusterlabs.org/pipermail/pacemaker/2011-August/011271.html
On this thread, if I interpret it correctly, his problem was solved when he
swapped the anti-location constraint

>From (mapping to my example)
cu_2 with cu_4 (score:-INFINITY)
cu_3 with cu_4 (score:-INFINITY)
cu_2 with cu_3 (score:-INFINITY)

To
cu_2 with cu_4 (score:-INFINITY)
cu_4 with cu_3 (score:-INFINITY)
cu_3 with cu_2 (score:-INFINITY)

Do you think that would make any difference? The way you explained it,
sounds to me it might.

-Regards
Nikhil

On Mon, Oct 17, 2016 at 11:36 PM, Ken Gaillot  wrote:

> On 10/17/2016 09:55 AM, Nikhil Utane wrote:
> > I see these prints.
> >
> > pengine: info: rsc_merge_weights:cu_4: Rolling back scores from cu_3
> > pengine:debug: native_assign_node:Assigning Redun_CU4_Wb30 to cu_4
> > pengine: info: rsc_merge_weights:cu_3: Rolling back scores from cu_2
> > pengine:debug: native_assign_node:Assigning Redund_CU5_WB30 to cu_3
> >
> > Looks like rolling back the scores is causing the new decision to
> > relocate the resources.
> > Am I using the scores incorrectly?
>
> No, I think this is expected.
>
> Your anti-colocation constraints place cu_2 and cu_3 relative to cu_4,
> so that means the cluster will place cu_4 first if possible, before
> deciding where the others should go. Similarly, cu_2 has a constraint
> relative to cu_3, so cu_3 gets placed next, and cu_2 is the one left out.
>
> The anti-colocation scores of -INFINITY outweigh the stickiness of 100.
> I'm not sure whether setting stickiness to INFINITY would change
> anything; hopefully, it would stop cu_3 from moving, but cu_2 would
> still be stopped.
>
> I don't see a good way around this. The cluster has to place some
> resource first, in order to know not to place some other resource on the
> same node. I don't think there's a way to make them "equal", because
> then none of them could be placed to begin with -- unless you went with
> utilization attributes, as someone else suggested, with
> placement-strategy=balanced:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
> single/Pacemaker_Explained/index.html#idm140521708557280
>
> >
> > [root@Redund_CU5_WB30 root]# pcs constraint
> > Location Constraints:
> >   Resource: cu_2
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_3
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_4
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> > Ordering Constraints:
> > Colocation Constraints:
> >   cu_2 with cu_4 (score:-INFINITY)
> >   cu_3 with cu_4 (score:-INFINITY)
> >   cu_2 with cu_3 (score:-INFINITY)
> >
> >
> > On Mon, Oct 17, 2016 at 8:16 PM, Nikhil Utane
> > mailto:nikhil.subscri...@gmail.com>>
> wrote:
> >
> > This is driving me insane.
> >
> > This is how the resources were started. Redund_CU1_WB30  was the DC
> > which I rebooted.
> >  cu_4(ocf::redundancy:RedundancyRA):Started Redund_CU1_WB30
> >  cu_2(ocf::redundancy:RedundancyRA):Started Redund_CU5_WB30
> >  cu_3(ocf::redundancy:RedundancyRA):Started Redun_CU4_Wb30
> >
> > Since the standby node was not UP. I was expecting resource cu_4 to
> > be waiting to be scheduled.
> > But then it re-arranged everything as below.
> >  cu_4(ocf::redundancy:RedundancyRA):Started Redun_CU4_Wb30
> >  cu_2(ocf::redundancy:RedundancyRA):Stopped
> >  cu_3(ocf::redundancy:RedundancyRA):Started Redund_CU5_WB30
> >
> > There is not much information available in the logs on new DC. It
> > just shows what it has decided to do but nothing to suggest why it
> > did it that way.
> >
> > notice: Start   cu_4(Redun_CU4_Wb30)
> > notice: Stop    cu_2(Redund_CU5_WB30)
> > notice: Movecu_3(Started Redun_CU4_Wb30 -> Redund_CU5_WB30)
> >
> > I have default stickiness set to 100 which is higher than any score
> > that I have configured.
> > I have migration_threshold set to 1. Should I bump that up instead?
> >
> > -Thanks
> > Nikhil
> >
> > On Sat,

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-17 Thread Nikhil Utane
I see these prints.

pengine: info: rsc_merge_weights: cu_4: Rolling back scores from cu_3
pengine:debug: native_assign_node: Assigning Redun_CU4_Wb30 to cu_4
pengine: info: rsc_merge_weights: cu_3: Rolling back scores from cu_2
pengine:debug: native_assign_node: Assigning Redund_CU5_WB30 to cu_3

Looks like rolling back the scores is causing the new decision to relocate
the resources.
Am I using the scores incorrectly?

[root@Redund_CU5_WB30 root]# pcs constraint
Location Constraints:
  Resource: cu_2
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
  Resource: cu_3
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
  Resource: cu_4
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
Ordering Constraints:
Colocation Constraints:
  cu_2 with cu_4 (score:-INFINITY)
  cu_3 with cu_4 (score:-INFINITY)
  cu_2 with cu_3 (score:-INFINITY)


On Mon, Oct 17, 2016 at 8:16 PM, Nikhil Utane 
wrote:

> This is driving me insane.
>
> This is how the resources were started. Redund_CU1_WB30  was the DC which
> I rebooted.
>  cu_4 (ocf::redundancy:RedundancyRA): Started Redund_CU1_WB30
>  cu_2 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30
>  cu_3 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30
>
> Since the standby node was not UP. I was expecting resource cu_4 to be
> waiting to be scheduled.
> But then it re-arranged everything as below.
>  cu_4 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30
>  cu_2 (ocf::redundancy:RedundancyRA): Stopped
>  cu_3 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30
>
> There is not much information available in the logs on new DC. It just
> shows what it has decided to do but nothing to suggest why it did it that
> way.
>
> notice: Start   cu_4 (Redun_CU4_Wb30)
> notice: Stopcu_2 (Redund_CU5_WB30)
> notice: Movecu_3 (Started Redun_CU4_Wb30 -> Redund_CU5_WB30)
>
> I have default stickiness set to 100 which is higher than any score that I
> have configured.
> I have migration_threshold set to 1. Should I bump that up instead?
>
> -Thanks
> Nikhil
>
> On Sat, Oct 15, 2016 at 12:36 AM, Ken Gaillot  wrote:
>
>> On 10/14/2016 06:56 AM, Nikhil Utane wrote:
>> > Hi,
>> >
>> > Thank you for the responses so far.
>> > I added reverse colocation as well. However seeing some other issue in
>> > resource movement that I am analyzing.
>> >
>> > Thinking further on this, why doesn't "/a not with b" does not imply "b
>> > not with a"?/
>> > Coz wouldn't putting "b with a" violate "a not with b"?
>> >
>> > Can someone confirm that colocation is required to be configured both
>> ways?
>>
>> The anti-colocation should only be defined one-way. Otherwise, you get a
>> dependency loop (as seen in logs you showed elsewhere).
>>
>> The one-way constraint is enough to keep the resources apart. However,
>> the question is whether the cluster might move resources around
>> unnecessarily.
>>
>> For example, "A not with B" means that the cluster will place B first,
>> then place A somewhere else. So, if B's node fails, can the cluster
>> decide that A's node is now the best place for B, and move A to a free
>> node, rather than simply start B on the free node?
>>
>> The cluster does take dependencies into account when placing a resource,
>> so I would hope that wouldn't happen. But I'm not sure. Having some
>> stickiness might help, so that A has some preference against moving.
>>
>> > -Thanks
>> > Nikhil
>> >
>> > /
>> > /
>> >
>> > On Fri, Oct 14, 2016 at 1:09 PM, Vladislav Bogdanov
>> > mailto:bub...@hoster-ok.com>> wrote:
>> >
>> > On October 14, 2016 10:13:17 AM GMT+03:00, Ulrich Windl
>> > > > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
>> > >>>> Nikhil Utane > > <mailto:nikhil.subscri...@gmail.com>> schrieb am 13.10.2016 um
>> > >16:43 in
>> > >Nachricht
>> > >> gmail.com
>> > <mailto:CAGNWmJUbPucnBGXroHkHSbQ0LXovwsLFPkUPg1R8gJqRFqM9Dg
>> @mail.gmail.com>>:
>> > >> Ulrich,
>> > >>
>> > >> I have 4 resources only (n

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-17 Thread Nikhil Utane
This is driving me insane.

This is how the resources were started. Redund_CU1_WB30  was the DC which I
rebooted.
 cu_4 (ocf::redundancy:RedundancyRA): Started Redund_CU1_WB30
 cu_2 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30
 cu_3 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30

Since the standby node was not UP. I was expecting resource cu_4 to be
waiting to be scheduled.
But then it re-arranged everything as below.
 cu_4 (ocf::redundancy:RedundancyRA): Started Redun_CU4_Wb30
 cu_2 (ocf::redundancy:RedundancyRA): Stopped
 cu_3 (ocf::redundancy:RedundancyRA): Started Redund_CU5_WB30

There is not much information available in the logs on new DC. It just
shows what it has decided to do but nothing to suggest why it did it that
way.

notice: Start   cu_4 (Redun_CU4_Wb30)
notice: Stopcu_2 (Redund_CU5_WB30)
notice: Movecu_3 (Started Redun_CU4_Wb30 -> Redund_CU5_WB30)

I have default stickiness set to 100 which is higher than any score that I
have configured.
I have migration_threshold set to 1. Should I bump that up instead?

-Thanks
Nikhil

On Sat, Oct 15, 2016 at 12:36 AM, Ken Gaillot  wrote:

> On 10/14/2016 06:56 AM, Nikhil Utane wrote:
> > Hi,
> >
> > Thank you for the responses so far.
> > I added reverse colocation as well. However seeing some other issue in
> > resource movement that I am analyzing.
> >
> > Thinking further on this, why doesn't "/a not with b" does not imply "b
> > not with a"?/
> > Coz wouldn't putting "b with a" violate "a not with b"?
> >
> > Can someone confirm that colocation is required to be configured both
> ways?
>
> The anti-colocation should only be defined one-way. Otherwise, you get a
> dependency loop (as seen in logs you showed elsewhere).
>
> The one-way constraint is enough to keep the resources apart. However,
> the question is whether the cluster might move resources around
> unnecessarily.
>
> For example, "A not with B" means that the cluster will place B first,
> then place A somewhere else. So, if B's node fails, can the cluster
> decide that A's node is now the best place for B, and move A to a free
> node, rather than simply start B on the free node?
>
> The cluster does take dependencies into account when placing a resource,
> so I would hope that wouldn't happen. But I'm not sure. Having some
> stickiness might help, so that A has some preference against moving.
>
> > -Thanks
> > Nikhil
> >
> > /
> > /
> >
> > On Fri, Oct 14, 2016 at 1:09 PM, Vladislav Bogdanov
> > mailto:bub...@hoster-ok.com>> wrote:
> >
> > On October 14, 2016 10:13:17 AM GMT+03:00, Ulrich Windl
> >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
> > >>>> Nikhil Utane  > <mailto:nikhil.subscri...@gmail.com>> schrieb am 13.10.2016 um
> > >16:43 in
> > >Nachricht
> > > > <mailto:CAGNWmJUbPucnBGXroHkHSbQ0LXovwsLFPkUPg1R8gJqRFqM9Dg@
> mail.gmail.com>>:
> > >> Ulrich,
> > >>
> > >> I have 4 resources only (not 5, nodes are 5). So then I only need
> 6
> > >> constraints, right?
> > >>
> > >>  [,1]   [,2]   [,3]   [,4]   [,5]  [,6]
> > >> [1,] "A"  "A"  "A""B"   "B""C"
> > >> [2,] "B"  "C"  "D"   "C"  "D""D"
> > >
> > >Sorry for my confusion. As Andrei Borzenkovsaid in
> > > > <mailto:CAA91j0W%2BepAHFLg9u6VX_X8LgFkf9Rp55g3nocY4oZNA9BbZ%
> 2...@mail.gmail.com>>
> > >you probably have to add (A, B) _and_ (B, A)! Thinking about it, I
> > >wonder whether an easier solution would be using "utilization": If
> > >every node has one token to give, and every resource needs on
> token, no
> > >two resources will run on one node. Sounds like an easier solution
> to
> > >me.
> > >
> > >Regards,
> > >Ulrich
> > >
> > >
> > >>
> > >> I understand that if I configure constraint of R1 with R2 score as
> > >> -infinity, then the same applies for R2 with R1 score as -infinity
> > >(don't
> > >> have to configure it explicitly).
> > >> I am not having a problem of multiple resources getting schedule
> on
> > >the
> > >> same node. Rather, one working resource is unnecessarily getting
> > >

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-14 Thread Nikhil Utane
u_4
*Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine: info: RecurringOp:
Start recurring monitor (30s) for cu_2 on Redund_CU2_WB30*
Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine: info: LogActions: Leave
  cu_5 (Started Redund_CU1_WB30)
Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine: info: LogActions: Leave
  cu_4 (Started Redund_CU2_WB30)
Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine: info: LogActions: Leave
  cu_3 (Started Redund_CU3_WB30)
*Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine:   notice: LogActions:
Movecu_2 (Started Redund_CU5_WB30 -> Redund_CU2_WB30)*
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info:
cib_file_write_with_digest: Wrote version 0.344.0 of the CIB to disk
(digest: c0090fdd0254bfc0cd81d0bbc8bc0a72)
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info:
cib_file_write_with_digest: Reading cluster configuration file
/dev/shm/lib/pacemaker/cib/cib.znavqE (digest:
/dev/shm/lib/pacemaker/cib/cib.eXT5e7)
Oct 14 16:30:53 [7367] Redund_CU1_WB30   crmd: info:
do_state_transition: State transition S_POLICY_ENGINE ->
S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
origin=handle_response ]
Oct 14 16:30:53 [7367] Redund_CU1_WB30   crmd: info:
do_te_invoke: Processing
graph 303 (ref=pe_calc-dc-1476462653-377) derived from
/dev/shm/lib/pacemaker/pengine/pe-input-303.bz2
Oct 14 16:30:53 [7367] Redund_CU1_WB30   crmd:   notice:
te_rsc_command: Initiating
action 12: stop cu_2_stop_0 on Redund_CU5_WB30
Oct 14 16:30:53 [7366] Redund_CU1_WB30pengine:   notice:
process_pe_message: Calculated Transition 303:
/dev/shm/lib/pacemaker/pengine/pe-input-303.bz2
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info:
cib_perform_op: Diff:
--- 0.344.0 2
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info:
cib_perform_op: Diff:
+++ 0.344.1 (null)
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info: cib_perform_op: +
 /cib:  @num_updates=1
Oct 14 16:30:53 [7362] Redund_CU1_WB30cib: info: cib_perform_op: +
 
/cib/status/node_state[@id='181462533']/lrm[@id='181462533']/lrm_resources/lrm_resource[@id='cu_2']/lrm_rsc_op[@id='cu_2_last_0']:
 @operation_key=cu_2_stop_0, @operation=stop,
@transition-key=12:303:0:07413883-c6c4-41b8-a68e-8ba4832aa4f8,
@transition-magic=0:0;12:303:0:07413883-c6c4-41b8-a68e-8ba4832aa4f8,
@call-id=21, @last-run=1476462653, @last-rc-change=1476462653,
@exec-time=237
*Oct 14 16:30:53 [7367] Redund_CU1_WB30   crmd: info:
match_graph_event: Action cu_2_stop_0 (12) confirmed on Redund_CU5_WB30
(rc=0)*
*Oct 14 16:30:53 [7367] Redund_CU1_WB30   crmd:   notice:
te_rsc_command: Initiating action 13: start cu_2_start_0 on Redund_CU2_WB30*


[root@Redund_CU2_WB30 root]# pcs constraint
Location Constraints:
  Resource: cu_2
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
  Resource: cu_3
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
  Resource: cu_4
Enabled on: Redund_CU2_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
  Resource: cu_5
Enabled on: Redund_CU1_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
Ordering Constraints:
Colocation Constraints:
  cu_2 with cu_3 (score:-INFINITY)
  cu_3 with cu_2 (score:-INFINITY)
  cu_2 with cu_5 (score:-INFINITY)
  cu_5 with cu_2 (score:-INFINITY)
  cu_3 with cu_5 (score:-INFINITY)
  cu_5 with cu_3 (score:-INFINITY)
  cu_4 with cu_3 (score:-INFINITY)
  cu_3 with cu_4 (score:-INFINITY)
  cu_4 with cu_2 (score:-INFINITY)
  cu_2 with cu_4 (score:-INFINITY)
  cu_4 with cu_5 (score:-INFINITY)
  cu_5 with cu_4 (score:-INFINITY)
Ticket Constraints:

-Thanks
Nikhil



On Fri, Oct 14, 2016 at 5:26 PM, Nikhil Utane 
wrote:

> Hi,
>
> Thank you for the responses so far.
> I added reverse colocation as well. However seeing some other issue in
> resource movement that I am analyzing.
>
> Thinking further on this, why doesn't "*a not with b" does not imply "b
> not with a"?*
> Coz wouldn't putting "b with a" violate "a not with b"?
>
> Can someone confirm that colocation is required to be configured both ways?
>
> -Thanks
> Nikhil
>
>
>
> On Fri, Oct 14, 2016 at 1:09 PM, Vladislav Bogdanov 
> wrote:
>
>> On October 14, 2016 10:13:17 AM GMT+03:00, Ulrich Windl <
>> ulrich.wi...@rz.uni-regensburg.de> wrote:
>> >>>> Nikhil Utane  schrieb am 13.10.2016 um
>> >16:43 in
>> >Nachricht
>> >:
>> >> Ulric

Re: [ClusterLabs] Antw: Re: Antw: Unexpected Resource movement after failover

2016-10-14 Thread Nikhil Utane
Hi,

Thank you for the responses so far.
I added reverse colocation as well. However seeing some other issue in
resource movement that I am analyzing.

Thinking further on this, why doesn't "*a not with b" does not imply "b not
with a"?*
Coz wouldn't putting "b with a" violate "a not with b"?

Can someone confirm that colocation is required to be configured both ways?

-Thanks
Nikhil



On Fri, Oct 14, 2016 at 1:09 PM, Vladislav Bogdanov 
wrote:

> On October 14, 2016 10:13:17 AM GMT+03:00, Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
> >>>> Nikhil Utane  schrieb am 13.10.2016 um
> >16:43 in
> >Nachricht
> >:
> >> Ulrich,
> >>
> >> I have 4 resources only (not 5, nodes are 5). So then I only need 6
> >> constraints, right?
> >>
> >>  [,1]   [,2]   [,3]   [,4]   [,5]  [,6]
> >> [1,] "A"  "A"  "A""B"   "B""C"
> >> [2,] "B"  "C"  "D"   "C"  "D""D"
> >
> >Sorry for my confusion. As Andrei Borzenkovsaid in
> >
> >you probably have to add (A, B) _and_ (B, A)! Thinking about it, I
> >wonder whether an easier solution would be using "utilization": If
> >every node has one token to give, and every resource needs on token, no
> >two resources will run on one node. Sounds like an easier solution to
> >me.
> >
> >Regards,
> >Ulrich
> >
> >
> >>
> >> I understand that if I configure constraint of R1 with R2 score as
> >> -infinity, then the same applies for R2 with R1 score as -infinity
> >(don't
> >> have to configure it explicitly).
> >> I am not having a problem of multiple resources getting schedule on
> >the
> >> same node. Rather, one working resource is unnecessarily getting
> >relocated.
> >>
> >> -Thanks
> >> Nikhil
> >>
> >>
> >> On Thu, Oct 13, 2016 at 7:45 PM, Ulrich Windl <
> >> ulrich.wi...@rz.uni-regensburg.de> wrote:
> >>
> >>> Hi!
> >>>
> >>> Don't you need 10 constraints, excluding every possible pair of your
> >5
> >>> resources (named A-E here), like in this table (produced with R):
> >>>
> >>>  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >>> [1,] "A"  "A"  "A"  "A"  "B"  "B"  "B"  "C"  "C"  "D"
> >>> [2,] "B"  "C"  "D"  "E"  "C"  "D"  "E"  "D"  "E"  "E"
> >>>
> >>> Ulrich
> >>>
> >>> >>> Nikhil Utane  schrieb am 13.10.2016
> >um
> >>> 15:59 in
> >>> Nachricht
> >>>
> >:
> >>> > Hi,
> >>> >
> >>> > I have 5 nodes and 4 resources configured.
> >>> > I have configured constraint such that no two resources can be
> >>> co-located.
> >>> > I brought down a node (which happened to be DC). I was expecting
> >the
> >>> > resource on the failed node would be migrated to the 5th waiting
> >node
> >>> (that
> >>> > is not running any resource).
> >>> > However what happened was the failed node resource was started on
> >another
> >>> > active node (after stopping it's existing resource) and that
> >node's
> >>> > resource was moved to the waiting node.
> >>> >
> >>> > What could I be doing wrong?
> >>> >
> >>> >  >>> > name="have-watchdog"/>
> >>> >  >value="1.1.14-5a6cdd1"
> >>> > name="dc-version"/>
> >>> >  >>> value="corosync"
> >>> > name="cluster-infrastructure"/>
> >>> >  >>> > name="stonith-enabled"/>
> >>> >  >>> > name="no-quorum-policy"/>
> >>> >  >value="240"
> >>> > name="default-action-timeout"/>
> >>> >  >>> > name="symmetric-cluster"/>
> >>> >
> >>> > # pcs constraint
> >>> > Location Constraints:
> >>> >   Resource: cu_2
> >>> > Enabled on: Redun_CU4_Wb30 (

Re: [ClusterLabs] Unexpected Resource movement after failover

2016-10-13 Thread Nikhil Utane
Andrei,

*"It would help if you told which node and which resources, so
your configuration could be interpreted in context. "*

Any resource can run on any node as long as it is not running any other
resource.

*"so "a not with b" does not imply "b not with a". So first pacemaker
decided where to place "b" and then it had to move "a" because it cannot
colocate with "b"."*

Hmm. I used to think "a not with b" means "b not with a" as well. Looks
like that's not the case. That should be it then.

Thanks for the quick answer, guys.

-Nikhil



On Thu, Oct 13, 2016 at 7:59 PM, Andrei Borzenkov 
wrote:

> On Thu, Oct 13, 2016 at 4:59 PM, Nikhil Utane
>  wrote:
> > Hi,
> >
> > I have 5 nodes and 4 resources configured.
> > I have configured constraint such that no two resources can be
> co-located.
> > I brought down a node (which happened to be DC). I was expecting the
> > resource on the failed node would be migrated to the 5th waiting node
> (that
> > is not running any resource).
> > However what happened was the failed node resource was started on another
> > active node (after stopping it's existing resource) and that node's
> resource
> > was moved to the waiting node.
> >
> > What could I be doing wrong?
> >
>
> It would help if you told which node and which resources, so your
> configuration could be interpreted in context. But I guess Ulrich is
> correct - your constraints are asymmetrical (I assume, I am not
> familiar with PCS), so "a not with b" does not imply "b not with a".
> So first pacemaker decided where to place "b" and then it had to move
> "a" because it cannot colocate with "b".
>
> >  > name="have-watchdog"/>
> >  > name="dc-version"/>
> >  value="corosync"
> > name="cluster-infrastructure"/>
> >  > name="stonith-enabled"/>
> >  > name="no-quorum-policy"/>
> >  > name="default-action-timeout"/>
> >  > name="symmetric-cluster"/>
> >
> > # pcs constraint
> > Location Constraints:
> >   Resource: cu_2
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_3
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_4
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_5
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> > Ordering Constraints:
> > Colocation Constraints:
> >   cu_3 with cu_2 (score:-INFINITY)
> >   cu_4 with cu_2 (score:-INFINITY)
> >   cu_4 with cu_3 (score:-INFINITY)
> >   cu_5 with cu_2 (score:-INFINITY)
> >   cu_5 with cu_3 (score:-INFINITY)
> >   cu_5 with cu_4 (score:-INFINITY)
> >
> > -Thanks
> > Nikhil
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Unexpected Resource movement after failover

2016-10-13 Thread Nikhil Utane
Ulrich,

I have 4 resources only (not 5, nodes are 5). So then I only need 6
constraints, right?

 [,1]   [,2]   [,3]   [,4]   [,5]  [,6]
[1,] "A"  "A"  "A""B"   "B""C"
[2,] "B"  "C"  "D"   "C"  "D""D"

I understand that if I configure constraint of R1 with R2 score as
-infinity, then the same applies for R2 with R1 score as -infinity (don't
have to configure it explicitly).
I am not having a problem of multiple resources getting schedule on the
same node. Rather, one working resource is unnecessarily getting relocated.

-Thanks
Nikhil


On Thu, Oct 13, 2016 at 7:45 PM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> Don't you need 10 constraints, excluding every possible pair of your 5
> resources (named A-E here), like in this table (produced with R):
>
>  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,] "A"  "A"  "A"  "A"  "B"  "B"  "B"  "C"  "C"  "D"
> [2,] "B"  "C"  "D"  "E"  "C"  "D"  "E"  "D"  "E"  "E"
>
> Ulrich
>
> >>> Nikhil Utane  schrieb am 13.10.2016 um
> 15:59 in
> Nachricht
> :
> > Hi,
> >
> > I have 5 nodes and 4 resources configured.
> > I have configured constraint such that no two resources can be
> co-located.
> > I brought down a node (which happened to be DC). I was expecting the
> > resource on the failed node would be migrated to the 5th waiting node
> (that
> > is not running any resource).
> > However what happened was the failed node resource was started on another
> > active node (after stopping it's existing resource) and that node's
> > resource was moved to the waiting node.
> >
> > What could I be doing wrong?
> >
> >  > name="have-watchdog"/>
> >  > name="dc-version"/>
> >  value="corosync"
> > name="cluster-infrastructure"/>
> >  > name="stonith-enabled"/>
> >  > name="no-quorum-policy"/>
> >  > name="default-action-timeout"/>
> >  > name="symmetric-cluster"/>
> >
> > # pcs constraint
> > Location Constraints:
> >   Resource: cu_2
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_3
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_4
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> >   Resource: cu_5
> > Enabled on: Redun_CU4_Wb30 (score:0)
> > Enabled on: Redund_CU2_WB30 (score:0)
> > Enabled on: Redund_CU3_WB30 (score:0)
> > Enabled on: Redund_CU5_WB30 (score:0)
> > Enabled on: Redund_CU1_WB30 (score:0)
> > Ordering Constraints:
> > Colocation Constraints:
> >   cu_3 with cu_2 (score:-INFINITY)
> >   cu_4 with cu_2 (score:-INFINITY)
> >   cu_4 with cu_3 (score:-INFINITY)
> >   cu_5 with cu_2 (score:-INFINITY)
> >   cu_5 with cu_3 (score:-INFINITY)
> >   cu_5 with cu_4 (score:-INFINITY)
> >
> > -Thanks
> > Nikhil
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unexpected Resource movement after failover

2016-10-13 Thread Nikhil Utane
Additional info,




-Nikhil

On Thu, Oct 13, 2016 at 7:29 PM, Nikhil Utane 
wrote:

> Hi,
>
> I have 5 nodes and 4 resources configured.
> I have configured constraint such that no two resources can be co-located.
> I brought down a node (which happened to be DC). I was expecting the
> resource on the failed node would be migrated to the 5th waiting node (that
> is not running any resource).
> However what happened was the failed node resource was started on another
> active node (after stopping it's existing resource) and that node's
> resource was moved to the waiting node.
>
> What could I be doing wrong?
>
>  name="have-watchdog"/>
>  name="dc-version"/>
>  value="corosync" name="cluster-infrastructure"/>
>  name="stonith-enabled"/>
>  name="no-quorum-policy"/>
>  name="default-action-timeout"/>
>  name="symmetric-cluster"/>
>
> # pcs constraint
> Location Constraints:
>   Resource: cu_2
> Enabled on: Redun_CU4_Wb30 (score:0)
> Enabled on: Redund_CU2_WB30 (score:0)
> Enabled on: Redund_CU3_WB30 (score:0)
> Enabled on: Redund_CU5_WB30 (score:0)
> Enabled on: Redund_CU1_WB30 (score:0)
>   Resource: cu_3
> Enabled on: Redun_CU4_Wb30 (score:0)
> Enabled on: Redund_CU2_WB30 (score:0)
> Enabled on: Redund_CU3_WB30 (score:0)
> Enabled on: Redund_CU5_WB30 (score:0)
> Enabled on: Redund_CU1_WB30 (score:0)
>   Resource: cu_4
> Enabled on: Redun_CU4_Wb30 (score:0)
> Enabled on: Redund_CU2_WB30 (score:0)
> Enabled on: Redund_CU3_WB30 (score:0)
> Enabled on: Redund_CU5_WB30 (score:0)
> Enabled on: Redund_CU1_WB30 (score:0)
>   Resource: cu_5
> Enabled on: Redun_CU4_Wb30 (score:0)
> Enabled on: Redund_CU2_WB30 (score:0)
> Enabled on: Redund_CU3_WB30 (score:0)
> Enabled on: Redund_CU5_WB30 (score:0)
> Enabled on: Redund_CU1_WB30 (score:0)
> Ordering Constraints:
> Colocation Constraints:
>   cu_3 with cu_2 (score:-INFINITY)
>   cu_4 with cu_2 (score:-INFINITY)
>   cu_4 with cu_3 (score:-INFINITY)
>   cu_5 with cu_2 (score:-INFINITY)
>   cu_5 with cu_3 (score:-INFINITY)
>   cu_5 with cu_4 (score:-INFINITY)
>
> -Thanks
> Nikhil
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Unexpected Resource movement after failover

2016-10-13 Thread Nikhil Utane
Hi,

I have 5 nodes and 4 resources configured.
I have configured constraint such that no two resources can be co-located.
I brought down a node (which happened to be DC). I was expecting the
resource on the failed node would be migrated to the 5th waiting node (that
is not running any resource).
However what happened was the failed node resource was started on another
active node (after stopping it's existing resource) and that node's
resource was moved to the waiting node.

What could I be doing wrong?









# pcs constraint
Location Constraints:
  Resource: cu_2
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
  Resource: cu_3
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
  Resource: cu_4
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
  Resource: cu_5
Enabled on: Redun_CU4_Wb30 (score:0)
Enabled on: Redund_CU2_WB30 (score:0)
Enabled on: Redund_CU3_WB30 (score:0)
Enabled on: Redund_CU5_WB30 (score:0)
Enabled on: Redund_CU1_WB30 (score:0)
Ordering Constraints:
Colocation Constraints:
  cu_3 with cu_2 (score:-INFINITY)
  cu_4 with cu_2 (score:-INFINITY)
  cu_4 with cu_3 (score:-INFINITY)
  cu_5 with cu_2 (score:-INFINITY)
  cu_5 with cu_3 (score:-INFINITY)
  cu_5 with cu_4 (score:-INFINITY)

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Limiting number of nodes that can join into a cluster

2016-06-28 Thread Nikhil Utane
Hmm. That is also a possibility. Thanks Chrissie.

On Tue, Jun 28, 2016 at 6:13 PM, Christine Caulfield 
wrote:

> On 28/06/16 13:27, Nikhil Utane wrote:
> > Hi Klaus,
> >
> > I am using multicast to avoid having to configure the host names.
> >
>
> To be honest, if you're serious about keeping the number of nodes down,
> then careful management is the way to do it, looking for a technical fix
> is not the answer. Yes, you could reduce the #define in corosync, but
> you risk hitting previously unknown bugs that might be caused by that
> number being hit for real.
>
>
> Chrissie
>
> > -Thanks
> > Nikhil
> >
> > On Tue, Jun 28, 2016 at 5:53 PM, Klaus Wenninger  > <mailto:kwenn...@redhat.com>> wrote:
> >
> > On 06/28/2016 01:18 PM, Nikhil Utane wrote:
> > > Hi,
> > >
> > > I want to limit the number of nodes that can form a cluster to
> single
> > > digit (say 6). I can do it using application-level logic but would
> > > like to know if there is any option in Corosync that would do it
> for
> > > me. (Didn't find one).
> > Going unicast you have full control over who is in your cluster and
> who
> > is not...
> > Not the direct answer to your question but maybe it solves your
> problem
> > though.
> > > If there is some #define, can I change it without any side-effect?
> > >
> > > -Thanks
> > > Nikhil
> > >
> > >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > <mailto:Users@clusterlabs.org>
> > > http://clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org  Users@clusterlabs.org>
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Limiting number of nodes that can join into a cluster

2016-06-28 Thread Nikhil Utane
Hi Klaus,

I am using multicast to avoid having to configure the host names.

-Thanks
Nikhil

On Tue, Jun 28, 2016 at 5:53 PM, Klaus Wenninger 
wrote:

> On 06/28/2016 01:18 PM, Nikhil Utane wrote:
> > Hi,
> >
> > I want to limit the number of nodes that can form a cluster to single
> > digit (say 6). I can do it using application-level logic but would
> > like to know if there is any option in Corosync that would do it for
> > me. (Didn't find one).
> Going unicast you have full control over who is in your cluster and who
> is not...
> Not the direct answer to your question but maybe it solves your problem
> though.
> > If there is some #define, can I change it without any side-effect?
> >
> > -Thanks
> > Nikhil
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Limiting number of nodes that can join into a cluster

2016-06-28 Thread Nikhil Utane
Hi,

I want to limit the number of nodes that can form a cluster to single digit
(say 6). I can do it using application-level logic but would like to know
if there is any option in Corosync that would do it for me. (Didn't find
one).
If there is some #define, can I change it without any side-effect?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Recovering after split-brain

2016-06-21 Thread Nikhil Utane
Hmm. I will then work towards bringing this in. Thanks for your input.

On Wed, Jun 22, 2016 at 10:44 AM, Digimer  wrote:

> On 22/06/16 01:07 AM, Nikhil Utane wrote:
> > I don't get it.  Pacemaker + Corosync is providing me so much of
> > functionality.
> > For e.g. if we leave out the condition of split-brain for a while, then
> > it provides:
> > 1) Discovery and cluster formation
> > 2) Synchronization of data
> > 3) Heartbeat mechanism
> > 4) Swift failover of the resource
> > 5) Guarantee that one resource will be started on only 1 node
> >
> > So in case of normal fail-over we need the basic functionality of
> > resource being migrated to a standby node.
> > And it is giving me all that.
> > So I don't agree that it needs to be as black and white as you say. Our
> > solution has different requirements than a typical HA solution. But that
> > is only now. In the future we might have to implement all the things. So
> > in that sense Pacemaker gives us a good framework that we can extend.
> >
> > BTW, we are not even using a virtual IP resource which again I believe
> > is something that everyone employs.
> > Because of the nature of the service a small glitch is going to happen.
> > Using virtual IPs is not giving any real benefit for us.
> > And with regard to the question, why even have a standby and let it be
> > active all the time, two-node cluster is one of the possible
> > configuration, but main requirement is to support N + 1. So standby node
> > doesn't know which active it has to take over until a failover occurs.
> >
> > Your comments however has made me re-consider using fencing. It was not
> > that we didn't want to do it.
> > Just that I felt it may not be needed. So I'll definitely explore this
> > further.
>
> It is needed, and it is that black and white. Ask yourself, for your
> particular installation; Can I run X in two places at the same time
> without coordination?
>
> If the answer is "yes", then just do that and be done with it.
>
> If the answer is "no", then you need fencing to allow pacemaker to know
> the state of all nodes (otherwise, the ability to coordinate is lost).
>
> I've never once seen a valid HA setup where fencing was not needed. I
> don't claim to be the best by any means, but I've been around long
> enough to say this with some confidence.
>
> digimer
>
> > Thanks everyone for the comments.
> >
> > -Regards
> > Nikhil
> >
> > On Tue, Jun 21, 2016 at 10:17 PM, Digimer  > <mailto:li...@alteeve.ca>> wrote:
> >
> > On 21/06/16 10:57 AM, Dmitri Maziuk wrote:
> > > On 2016-06-20 17:19, Digimer wrote:
> > >
> > >> Nikhil indicated that they could switch where traffic went
> up-stream
> > >> without issue, if I understood properly.
> > >
> > > They have some interesting setup, but that notwithstanding: if
> split
> > > brain happens some clients will connect to "old master" and some:
> to
> > > "new master", dep. on arp update. If there's a shared resource
> > > unavailable on one node, clients going there will error out. The
> other
> > > ones will not. It will work for some clients.
> > >
> > > Cf. both nodes going into stonith deathmatch and killing each
> other: the
> > > service now is not available for all clients. What I don't get is
> the
> > > blanket assertion that this "more highly" available that option #1.
> > >
> > > Dimitri
> >
> > As I've explained many times (here and on IRC);
> >
> > If you don't need to coordinate services/access, you don't need HA.
> >
> > If you do need to coordinate services/access, you need fencing.
> >
> > So if Nikhil really believes s/he doesn't need fencing and that
> > split-brains are OK, then drop HA. If that is not the case, then s/he
> > needs to implement fencing in pacemaker. It's pretty much that
> simple.
> >
> > --
> > Digimer
> > Papers and Projects: https://alteeve.ca/w/
> > What if the cure for cancer is trapped in the mind of a person
> without
> > access to education?
> >
> > ___
> > Users mailing list: Users@clusterlabs.org  Users@clusterlabs.org>
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Projec

Re: [ClusterLabs] Recovering after split-brain

2016-06-21 Thread Nikhil Utane
We are not using virtual IP. There is a separate discovery mechanism
between the server and client. The client will reach out to new server only
if it is incommunicado with the old one.

On Tue, Jun 21, 2016 at 8:27 PM, Dmitri Maziuk 
wrote:

> On 2016-06-20 17:19, Digimer wrote:
>
> Nikhil indicated that they could switch where traffic went up-stream
>> without issue, if I understood properly.
>>
>
> They have some interesting setup, but that notwithstanding: if split brain
> happens some clients will connect to "old master" and some: to "new
> master", dep. on arp update. If there's a shared resource unavailable on
> one node, clients going there will error out. The other ones will not. It
> will work for some clients.
>
> Cf. both nodes going into stonith deathmatch and killing each other: the
> service now is not available for all clients. What I don't get is the
> blanket assertion that this "more highly" available that option #1.
>
> Dimitri
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Recovering after split-brain

2016-06-21 Thread Nikhil Utane
I don't get it.  Pacemaker + Corosync is providing me so much of
functionality.
For e.g. if we leave out the condition of split-brain for a while, then it
provides:
1) Discovery and cluster formation
2) Synchronization of data
3) Heartbeat mechanism
4) Swift failover of the resource
5) Guarantee that one resource will be started on only 1 node

So in case of normal fail-over we need the basic functionality of resource
being migrated to a standby node.
And it is giving me all that.
So I don't agree that it needs to be as black and white as you say. Our
solution has different requirements than a typical HA solution. But that is
only now. In the future we might have to implement all the things. So in
that sense Pacemaker gives us a good framework that we can extend.

BTW, we are not even using a virtual IP resource which again I believe is
something that everyone employs.
Because of the nature of the service a small glitch is going to happen.
Using virtual IPs is not giving any real benefit for us.
And with regard to the question, why even have a standby and let it be
active all the time, two-node cluster is one of the possible configuration,
but main requirement is to support N + 1. So standby node doesn't know
which active it has to take over until a failover occurs.

Your comments however has made me re-consider using fencing. It was not
that we didn't want to do it.
Just that I felt it may not be needed. So I'll definitely explore this
further.

Thanks everyone for the comments.

-Regards
Nikhil

On Tue, Jun 21, 2016 at 10:17 PM, Digimer  wrote:

> On 21/06/16 10:57 AM, Dmitri Maziuk wrote:
> > On 2016-06-20 17:19, Digimer wrote:
> >
> >> Nikhil indicated that they could switch where traffic went up-stream
> >> without issue, if I understood properly.
> >
> > They have some interesting setup, but that notwithstanding: if split
> > brain happens some clients will connect to "old master" and some: to
> > "new master", dep. on arp update. If there's a shared resource
> > unavailable on one node, clients going there will error out. The other
> > ones will not. It will work for some clients.
> >
> > Cf. both nodes going into stonith deathmatch and killing each other: the
> > service now is not available for all clients. What I don't get is the
> > blanket assertion that this "more highly" available that option #1.
> >
> > Dimitri
>
> As I've explained many times (here and on IRC);
>
> If you don't need to coordinate services/access, you don't need HA.
>
> If you do need to coordinate services/access, you need fencing.
>
> So if Nikhil really believes s/he doesn't need fencing and that
> split-brains are OK, then drop HA. If that is not the case, then s/he
> needs to implement fencing in pacemaker. It's pretty much that simple.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Recovering after split-brain

2016-06-20 Thread Nikhil Utane
Let me give the full picture about our solution. It will then make it easy
to have the discussion.

We are looking at providing N + 1 Redundancy to our application servers,
i.e. 1 standby for upto N active (currently N<=5). Each server will have
some unique configuration. The standby will store the configuration of all
the active servers such that whichever server goes down, the standby loads
that particular configuration and becomes active. The server that went down
will now become standby.
We have bundled all the configuration that every server has into a resource
such that during failover the resource is moved to the newly active server,
and that way it takes up the personality of the server that went down. To
put it differently, every active server has a 'unique' resource that is
started by Pacemaker whereas standby has none.

Our servers do not write anything to an external database, all the writing
is done to the CIB file under the resource that it is currently managing.
We also have some clients that connect to the active servers (1 client can
connect to only 1 server, 1 server can have multiple clients) and provide
service to end-users. Now the reason I say that split-brain is not an issue
for us, is coz the clients can only connect to 1 of the active servers at
any given time (we have to handle the case that all clients move together
and do not get distributed). So even if two servers become active with same
personality, the clients can only connect to 1 of them. (Initial plan was
to go configure quorum but later I was told that service availability is of
utmost importance and since impact of split-brain is limited, we are
thinking of doing away with it).

Now the concern I have is, once the split is resolved, I would have 2
actives, each having its own view of the resource, trying to synchronize
the CIB. At this point I want the one that has the clients attached to it
win.
I am thinking I can implement a monitor function that can bring down the
resource if it doesn't find any clients attached to it within a given
period of time. But to understand the Pacemaker behavior, what exactly
would happen if the same resource is found to be active on two nodes after
recovery?

-Thanks
Nikhil



On Tue, Jun 21, 2016 at 3:49 AM, Digimer  wrote:

> On 20/06/16 05:58 PM, Dimitri Maziuk wrote:
> > On 06/20/2016 03:58 PM, Digimer wrote:
> >
> >> Then wouldn't it be a lot better to just run your services on both nodes
> >> all the time and take HA out of the picture? Availability is predicated
> >> on building the simplest system possible. If you have no concerns about
> >> uncoordinated access, then make like simpler and remove pacemaker
> entirely.
> >
> > Obviously you'd have to remove the other node as well since you now
> > can't have the single service access point anymore.
>
> Nikhil indicated that they could switch where traffic went up-stream
> without issue, if I understood properly.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Recovering after split-brain

2016-06-20 Thread Nikhil Utane
Hi,

For our solution we are making a conscious choice to not use quorum/fencing
as for us service availability is more important than having 2 nodes take
up the same active role. Split-brain is not an issue for us (at least i
think that way) since we have a second line of defense. We have clients who
can connect to only one of the two active nodes. So in that sense, even if
we end up with 2 nodes becoming active, since the clients can connect to
only 1 of the active node, we should not have any issue.

Now my question is what happens after recovering from split-brain since the
resource will be active on both the nodes. From application point of view
we want to be able to find out which node is servicing the clients and keep
that operational and make the other one as standby.

Does Pacemaker make it easy to do this kind of thing through some means?
Are there any issues that I am completely unaware due to letting
split-brain occur?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error "xml does not conform to the schema" upon "pcs cluster standby" command

2016-06-10 Thread Nikhil Utane
This solved the problem. Thanks. :)

On Mon, Jun 6, 2016 at 9:49 AM, Nikhil Utane 
wrote:

> Yes, everything is (almost) latest version. Will give this a try. Thanks
> Jan.
>
> On Fri, Jun 3, 2016 at 9:04 PM, Jan Pokorný  wrote:
>
>> Hello Nikhil,
>>
>> On 03/06/16 16:33 +0530, Nikhil Utane wrote:
>> > The node is up alright.
>> >
>> > [root@airv_cu pcs]# pcs cluster status
>> > Cluster Status:
>> >  Stack: corosync
>> >  Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition WITHOUT quorum
>> >  Last updated: Fri Jun  3 11:01:32 2016 Last change: Fri Jun  3
>> > 09:57:52 2016 by hacluster via crmd on airv_cu
>> >  2 nodes and 0 resources configured
>> >
>> > Upon entering command "pcs cluster standby airv_cu" getting below error.
>> > Error: cannot load cluster status, xml does not conform to the schema.
>> >
>> > What could be wrong?
>>
>> if you have a decently recent versions of both pacemaker and pcs (ca 3
>> months old or newer) it's entirely possible that this commit will
>> resolve it for you on the pacemaker side:
>>
>>
>> https://github.com/ClusterLabs/pacemaker/pull/1040/commits/87a82a165ccacaf1a0c48b5e1fad684a8dd2d8c9
>>
>> I'm just about to provide update to the expected test results and then
>> it (the whole pull request) is expected to land soon after that.
>>
>> --
>> Jan (Poki)
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Minimum configuration for dynamically adding a node to a cluster

2016-06-09 Thread Nikhil Utane
OK. The reason I got confused is coz all tutorials have prior step as
creating user hacluster and then using pcs commands. Hence i thought it
relies on ssh. Thanks for the clarification.

On Wed, Jun 8, 2016 at 7:51 PM, Ken Gaillot  wrote:

> On 06/08/2016 06:54 AM, Jehan-Guillaume de Rorthais wrote:
> >
> >
> > Le 8 juin 2016 13:36:03 GMT+02:00, Nikhil Utane <
> nikhil.subscri...@gmail.com> a écrit :
> >> Hi,
> >>
> >> Would like to know the best and easiest way to add a new node to an
> >> already
> >> running cluster.
> >>
> >> Our limitation:
> >> 1) pcsd cannot be used since (as per my understanding) it communicates
> >> over
> >> ssh which is prevented.
> >
> > As far as i remember,  pcsd deamons use their own tcp port (not the ssh
> one) and communicate with each others using http queries (over ssl i
> suppose).
>
> Correct, pcsd uses port 2224. It encrypts all traffic. If you can get
> that allowed through your firewall between cluster nodes, that will be
> the easiest way.
>
> corosync.conf does need to be kept the same on all nodes, and corosync
> needs to be reloaded after any changes. pcs will handle this
> automatically when adding/removing nodes. Alternatively, it is possible
> to use corosync.conf with multicast, without explicitly listing
> individual nodes.
>
> > As far as i understand, crmsh uses ssh, not pcsd.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Minimum configuration for dynamically adding a node to a cluster

2016-06-09 Thread Nikhil Utane
Thank you for all the information.

Yes, I am using multicast. Actually i had tried without nodelist but was
hasty in reading the error message.
Saw the "'corosync_quorum' failed to load for reason configuration error:
nodelist " and didn't read the second part properly about expected_votes.
My bad.

I don't want to configure quorum since keeping the service up is of utmost
importance and the split-brain problem is indirectly getting handled
through other means.
In this case, should I be configuring expected_votes as 1?

Currently my two-nodes in the cluster discovered each other without any
nodelist and expected_votes as 1. Which is what I always wanted.

-Thanks
Nikhil

On Wed, Jun 8, 2016 at 9:53 PM, Ferenc Wágner  wrote:

> Nikhil Utane  writes:
>
> > Would like to know the best and easiest way to add a new node to an
> already
> > running cluster.
> >
> > Our limitation:
> > 1) pcsd cannot be used since (as per my understanding) it communicates
> over
> > ssh which is prevented.
> > 2) No manual editing of corosync.conf
>
> If you use IPv4 multicast for Corosync 2 communication, then you needn't
> have a nodelist in corosync.conf.  However, if you want a quorum
> provider, then expected_votes must be set correctly, otherwise a small
> partition booting up could mistakenly assume it has quorum.  In a live
> system all corosync daemons will recognize new nodes and increase their
> "live" expected_votes accordingly.  But they won't write this back to
> the config file, leading to lack of information on reboot if they can't
> learn better from their peers.
>
> > So what I am thinking is, the first node will add nodelist with nodeid: 1
> > into its corosync.conf file.
> >
> > nodelist {
> > node {
> >   ring0_addr: node1
> >   nodeid: 1
> > }
> > }
> >
> > The second node to be added will get this information through some other
> > means and add itself with nodeid: 2 into it's corosync file.
> > Now the question I have is, does node1 also need to be updated with
> > information about node 2?
>
> It'd better, at least to exclude any possibility of clashing nodeids.
>
> > When i tested it locally, the cluster was up even without node1 having
> > node2 in its corosync.conf. Node2's corosync had both. If node1 doesn't
> > need to be told about node2, is there a way where we don't configure the
> > nodes but let them discover each other through the multicast IP (best
> > option).
>
> If you use IPv4 multicast and don't specify otherwise, the node IDs are
> assigned according to the ring0 addresses (IPv4 addresses are 32 bit
> integers after all).  But you still have to update expected_votes.
>
> > Assuming we should add it to keep the files in sync, what's the best way
> to
> > add the node information (either itself or other) preferably through some
> > CLI command?
>
> There's no corosync tool to update the config file.  An Augeas lense is
> provided for corosync.conf though, which should help with the task (I
> myself never tried it).  Then corosync-cfgtool -R makes all daemons in
> the cluster reload their config files.
> --
> Feri
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Minimum configuration for dynamically adding a node to a cluster

2016-06-08 Thread Nikhil Utane
Hi,

Would like to know the best and easiest way to add a new node to an already
running cluster.

Our limitation:
1) pcsd cannot be used since (as per my understanding) it communicates over
ssh which is prevented.
2) No manual editing of corosync.conf

So what I am thinking is, the first node will add nodelist with nodeid: 1
into its corosync.conf file.

nodelist {
node {
  ring0_addr: node1
  nodeid: 1
}
}

The second node to be added will get this information through some other
means and add itself with nodeid: 2 into it's corosync file.
Now the question I have is, does node1 also need to be updated with
information about node 2?
When i tested it locally, the cluster was up even without node1 having
node2 in its corosync.conf. Node2's corosync had both. If node1 doesn't
need to be told about node2, is there a way where we don't configure the
nodes but let them discover each other through the multicast IP (best
option).

Assuming we should add it to keep the files in sync, what's the best way to
add the node information (either itself or other) preferably through some
CLI command?

-Regards
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Few questions regarding corosync authkey

2016-06-05 Thread Nikhil Utane
Hi,

Would like to understand how secure is the corosync authkey.
As the authkey is a binary file, how is the private key saved inside the
authkey?
What safeguard mechanisms are in place if the private key is compromised?
For e.g I don't think it uses any temporary session key which refreshes
periodically.
Is it possible to dynamically update the key without causing any outage?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error "xml does not conform to the schema" upon "pcs cluster standby" command

2016-06-05 Thread Nikhil Utane
Yes, everything is (almost) latest version. Will give this a try. Thanks
Jan.

On Fri, Jun 3, 2016 at 9:04 PM, Jan Pokorný  wrote:

> Hello Nikhil,
>
> On 03/06/16 16:33 +0530, Nikhil Utane wrote:
> > The node is up alright.
> >
> > [root@airv_cu pcs]# pcs cluster status
> > Cluster Status:
> >  Stack: corosync
> >  Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition WITHOUT quorum
> >  Last updated: Fri Jun  3 11:01:32 2016 Last change: Fri Jun  3
> > 09:57:52 2016 by hacluster via crmd on airv_cu
> >  2 nodes and 0 resources configured
> >
> > Upon entering command "pcs cluster standby airv_cu" getting below error.
> > Error: cannot load cluster status, xml does not conform to the schema.
> >
> > What could be wrong?
>
> if you have a decently recent versions of both pacemaker and pcs (ca 3
> months old or newer) it's entirely possible that this commit will
> resolve it for you on the pacemaker side:
>
>
> https://github.com/ClusterLabs/pacemaker/pull/1040/commits/87a82a165ccacaf1a0c48b5e1fad684a8dd2d8c9
>
> I'm just about to provide update to the expected test results and then
> it (the whole pull request) is expected to land soon after that.
>
> --
> Jan (Poki)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Error "xml does not conform to the schema" upon "pcs cluster standby" command

2016-06-03 Thread Nikhil Utane
Thanks for your response Klaus.
Any command to add the cluster name?
All examples use 'pcs cluster setup' command. But if my cluster is already
running, how do I update it?
I tried with crm_attribute:
crm_attribute -t crm_config -n cluster-name -v mycluster

It has updated in cib but pcs status still doesn't show the cluster name.


  




  

[root@airv_cu root]# pcs status
Cluster name:

-Regards
Nikhil


On Fri, Jun 3, 2016 at 4:46 PM, Klaus Wenninger  wrote:

> On 06/03/2016 01:03 PM, Nikhil Utane wrote:
> > Hi,
> >
> > The node is up alright.
> >
> > [root@airv_cu pcs]# pcs cluster status
> > Cluster Status:
> >  Stack: corosync
> >  Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition WITHOUT quorum
> >  Last updated: Fri Jun  3 11:01:32 2016 Last change: Fri Jun
> >  3 09:57:52 2016 by hacluster via crmd on airv_cu
> >  2 nodes and 0 resources configured
> >
> > Upon entering command "pcs cluster standby airv_cu" getting below error.
> > Error: cannot load cluster status, xml does not conform to the schema.
> >
> > What could be wrong?
> >
> > [root@airv_cu pcs]# pcs cluster cib
> >  > num_updates="5" admin_epoch="0" cib-last-written="Fri Jun  3 09:57:52
> > 2016" update-origin="airv_cu" update-client="crmd"
> > update-user="hacluster" have-quorum="0" dc-uuid="1">
> >   
> > 
> >   
> >  > name="have-watchdog" value="true"/>
> >  > name="dc-version" value="1.1.14-5a6cdd1"/>
> >  > name="cluster-infrastructure" value="corosync"/>
> Your cluster doesn't have a name. iirc pcs (at least I've seen that in a
> version I was working with) doesn't like that.
>
> Something like:
>  name="cluster-name" value="mycluster"/>
>
> >   
> > 
> > 
> >   
> >   
> > 
> > 
> > 
> >   
> >   
> >  > crm-debug-origin="do_state_transition" join="member" expected="member">
> >   
> > 
> >   
> >   
> > 
> >   
> > 
> >   
> > 
> >   
> > 
> >
> > -Thanks
> > Nikhil
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Error "xml does not conform to the schema" upon "pcs cluster standby" command

2016-06-03 Thread Nikhil Utane
Hi,

The node is up alright.

[root@airv_cu pcs]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition WITHOUT quorum
 Last updated: Fri Jun  3 11:01:32 2016 Last change: Fri Jun  3
09:57:52 2016 by hacluster via crmd on airv_cu
 2 nodes and 0 resources configured

Upon entering command "pcs cluster standby airv_cu" getting below error.
Error: cannot load cluster status, xml does not conform to the schema.

What could be wrong?

[root@airv_cu pcs]# pcs cluster cib

  

  



  


  
  



  
  

  

  
  

  

  

  


-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Using different folder for /var/lib/pacemaker and usage of /dev/shm files

2016-05-17 Thread Nikhil Utane
Yes, we do have our application using shared memory which is what we see
when the cluster is down.

On Tue, May 17, 2016 at 10:53 PM, Ken Gaillot  wrote:

> On 05/17/2016 12:02 PM, Nikhil Utane wrote:
> > OK. Will do that.
> >
> > Actually I gave the /dev/shm usage when the cluster wasn't up.
> > When it is up, I see it occupies close to 300 MB (it's also the DC).
>
> Hmmm, there should be no usage if the cluster is stopped. Any memory
> used by the cluster will start with "qb-", so anything else is from
> something else.
>
> If all executables using libqb (including corosync and pacemaker) are
> stopped, it's safe to remove any /dev/shm/qb-* files that remain. That
> should be rare, probably only after a core dump or such.
>
> > tmpfs   500.0M329.4M170.6M  66% /dev/shm
> >
> > On another node the same is 115 MB.
> >
> > Anyways, I'll monitor the usage to know what size is needed.
> >
> > Thank you Ken and Ulrich.
> >
> > On Tue, May 17, 2016 at 8:23 PM, Ken Gaillot  > <mailto:kgail...@redhat.com>> wrote:
> >
> > On 05/17/2016 04:07 AM, Nikhil Utane wrote:
> > > What I would like to understand is how much total shared memory
> > > (approximately) would Pacemaker need so that accordingly I can
> define
> > > the partition size. Currently it is 300 MB in our system. I
> recently ran
> > > into insufficient shared memory issue because of improper
> clean-up. So
> > > would like to understand how much Pacemaker would need for a 6-node
> > > cluster so that accordingly I can increase it.
> >
> > I have no idea :-)
> >
> > I don't think there's any way to pre-calculate it. The libqb library
> is
> > the part of the software stack that actually manages the shared
> memory,
> > but it's used by everything -- corosync (including its cpg and
> > votequorum components) and each pacemaker daemon.
> >
> > The size depends directly on the amount of communication activity in
> the
> > cluster, which is only indirectly related to the number of
> > nodes/resources/etc., the size of the CIB, etc. A cluster with nodes
> > joining/leaving frequently and resources moving around a lot will use
> > more shared memory than a cluster of the same size that's quiet.
> Cluster
> > options such as cluster-recheck-interval would also matter.
> >
> > Practically, I think all you can do is simulate expected cluster
> > configurations and loads, and see what it comes out to be.
> >
> > > # df -kh
> > > tmpfs   300.0M 27.5M272.5M   9% /dev/shm
> > >
> > > Thanks
> > > Nikhil
> > >
> > > On Tue, May 17, 2016 at 12:09 PM, Ulrich Windl
> > >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>
> > > <mailto:ulrich.wi...@rz.uni-regensburg.de
> > <mailto:ulrich.wi...@rz.uni-regensburg.de>>> wrote:
> > >
> > > Hi!
> > >
> > > One of the main problems I identified with POSIX shared memory
> > > (/dev/shm) in Linux is that changes to the shared memory don't
> > > affect the i-node, so you cannot tell from a "ls -rtl" which
> > > segments are still active and which are not. You can only see
> the
> > > creation time.
> > >
> > > Maybe there should be a tool that identifies and cleans up
> obsolete
> > > shared memory.
> > > I don't understand the part talking about the size of
> /dev/shm: It's
> > > shared memory. See "kernel.shmmax" and "kernel.shmall" in you
> sysctl
> > > settings (/etc/sysctl.conf).
> > >
> > > Regards,
> > > Ulrich
> > >
> > > >>> Nikhil Utane  nikhil.subscri...@gmail.com>
> > > <mailto:nikhil.subscri...@gmail.com
> > <mailto:nikhil.subscri...@gmail.com>>> schrieb am 16.05.2016 um
> 14:31 in
> > > Nachricht
> > >
> >   > <mailto:2fs1c%2b0rgnqs994vv...@mail.gmail.com>
> > > <mailto:2fs1c%2b0rgnqs994vv...@mail.gmail.com
> > <mailto:2fs1c%252b0rgnqs994vv...@mail.gmail.com>>>:
> > > > Thanks Ken.
> > > >
> > 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-17 Thread Nikhil Utane
Hi Honza,

Just checking if you have the official patch available for this issue.

As far as I am concerned barring couple of issues, everything seems to be
working fine on our big-endian system. Much relieved. :)

-Thanks
Nikhil


On Thu, May 5, 2016 at 3:24 PM, Nikhil Utane 
wrote:

> It worked for me. :)
> I'll wait for your formal patch but until then I am able to proceed
> further. (Don't know if I'll run into something else)
>
> However now encountering issue in pacemaker.
>
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The cib process (15224) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The stonith-ng process (15225) can no longer be respawned, shutting the
> cluster down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The lrmd process (15226) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The crmd process (15229) can no longer be respawned, shutting the cluster
> down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The pengine process (15228) can no longer be respawned, shutting the
> cluster down.
> May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:
>  The attrd process (15227) can no longer be respawned, shutting the cluster
> down.
>
> Looking into it.
>
> -Thanks
> Nikhil
>
> On Thu, May 5, 2016 at 2:58 PM, Jan Friesse  wrote:
>
>> Nikhil
>>
>> Found the root-cause.
>>> In file schedwrk.c, the function handle2void() uses a union which was not
>>> initialized.
>>> Because of that the handle value was computed incorrectly (lower half was
>>> garbage).
>>>
>>>   56 static hdb_handle_t
>>>   57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
>>>   58 static const void *
>>>   59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>>>
>>> After initializing (as highlighted), the corosync initialization seems to
>>> be going through fine. Will check other things.
>>>
>>
>> Your patch is incorrect and actually doesn't work. As I said (when
>> pointing you to schedwrk.c), I will send you proper patch, but fix that
>> issue correctly is not easy.
>>
>> Regards,
>>   Honza
>>
>>
>>> -Regards
>>> Nikhil
>>>
>>> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane <
>>> nikhil.subscri...@gmail.com>
>>> wrote:
>>>
>>> Thanks for your response Dejan.
>>>>
>>>> I do not know yet whether this has anything to do with endianness.
>>>> FWIW, there could be something quirky with the system so keeping all
>>>> options open. :)
>>>>
>>>> I added some debug prints to understand what's happening under the hood.
>>>>
>>>> *Success case: (on x86 machine): *
>>>> [TOTEM ] entering OPERATIONAL state.
>>>> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members
>>>> joined:
>>>> 181272839
>>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>>> my_high_delivered=0
>>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>>> my_high_delivered=0
>>>> [TOTEM ] Delivering 0 to 1
>>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
>>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
>>>> my_high_delivered=1
>>>> [TOTEM ] Delivering 1 to 2
>>>> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
>>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
>>>> [SYNC  ] Nikhil: Entering sync_barrier_handler
>>>> [SYNC  ] Committing synchronization for corosync configuration map
>>>> access
>>>> .
>>>> [TOTEM ] Delivering 2 to 4
>>>> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>>>> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
>>>> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
>>>> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
>>>> left:0)
>>>> [SYNC  ] Committing synchronization for corosync cluster closed process
>>>> group service v1.01
>>>> *[MAIN  ] 

Re: [ClusterLabs] Antw: Re: Using different folder for /var/lib/pacemaker and usage of /dev/shm files

2016-05-17 Thread Nikhil Utane
OK. Will do that.

Actually I gave the /dev/shm usage when the cluster wasn't up.
When it is up, I see it occupies close to 300 MB (it's also the DC).

tmpfs   500.0M329.4M170.6M  66% /dev/shm

On another node the same is 115 MB.

Anyways, I'll monitor the usage to know what size is needed.

Thank you Ken and Ulrich.

On Tue, May 17, 2016 at 8:23 PM, Ken Gaillot  wrote:

> On 05/17/2016 04:07 AM, Nikhil Utane wrote:
> > What I would like to understand is how much total shared memory
> > (approximately) would Pacemaker need so that accordingly I can define
> > the partition size. Currently it is 300 MB in our system. I recently ran
> > into insufficient shared memory issue because of improper clean-up. So
> > would like to understand how much Pacemaker would need for a 6-node
> > cluster so that accordingly I can increase it.
>
> I have no idea :-)
>
> I don't think there's any way to pre-calculate it. The libqb library is
> the part of the software stack that actually manages the shared memory,
> but it's used by everything -- corosync (including its cpg and
> votequorum components) and each pacemaker daemon.
>
> The size depends directly on the amount of communication activity in the
> cluster, which is only indirectly related to the number of
> nodes/resources/etc., the size of the CIB, etc. A cluster with nodes
> joining/leaving frequently and resources moving around a lot will use
> more shared memory than a cluster of the same size that's quiet. Cluster
> options such as cluster-recheck-interval would also matter.
>
> Practically, I think all you can do is simulate expected cluster
> configurations and loads, and see what it comes out to be.
>
> > # df -kh
> > tmpfs   300.0M 27.5M272.5M   9% /dev/shm
> >
> > Thanks
> > Nikhil
> >
> > On Tue, May 17, 2016 at 12:09 PM, Ulrich Windl
> >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
> >
> > Hi!
> >
> > One of the main problems I identified with POSIX shared memory
> > (/dev/shm) in Linux is that changes to the shared memory don't
> > affect the i-node, so you cannot tell from a "ls -rtl" which
> > segments are still active and which are not. You can only see the
> > creation time.
> >
> > Maybe there should be a tool that identifies and cleans up obsolete
> > shared memory.
> > I don't understand the part talking about the size of /dev/shm: It's
> > shared memory. See "kernel.shmmax" and "kernel.shmall" in you sysctl
> > settings (/etc/sysctl.conf).
> >
> > Regards,
> > Ulrich
> >
> > >>> Nikhil Utane  > <mailto:nikhil.subscri...@gmail.com>> schrieb am 16.05.2016 um
> 14:31 in
> > Nachricht
> >  > <mailto:2fs1c%2b0rgnqs994vv...@mail.gmail.com>>:
> > > Thanks Ken.
> > >
> > > Could you also respond on the second question?
> > >
> > >> Also, in /dev/shm I see that it created around 300+ files of
> > around
> > >> 250 MB.
> > >>
> > >> For e.g.
> > >> -rw-rw1 hacluste hacluste  8232 May  6 13:03
> > >> qb-cib_rw-response-25035-25038-10-header
> > >> -rw-rw1 hacluste hacluste540672 May  6 13:03
> > >> qb-cib_rw-response-25035-25038-10-data
> > >> -rw---1 hacluste hacluste  8232 May  6 13:03
> > >> qb-cib_rw-response-25035-25036-12-header
> > >> -rw---1 hacluste hacluste540672 May  6 13:03
> > >> qb-cib_rw-response-25035-25036-12-data
> > >> And many more..
> > >>
> > >> We have limited space in /dev/shm and all these files are
> > filling it
> > >> up. Are these all needed? Any way to limit? Do we need to do
> any
> > >> clean-up if pacemaker termination was not graceful? What's the
> > > recommended size for this folder for Pacemaker? Our cluster will
> have
> > > maximum 6 nodes.
> > >
> > > -Regards
> > > Nikhil
> > >
> > > On Sat, May 14, 2016 at 3:11 AM, Ken Gaillot  > <mailto:kgail...@redhat.com>> wrote:
> > >
> > >> On 05/08/2016 11:19 PM, Nikhil Utane wrote:
> > >> > Moving these questions to a different thread.
> > >&

Re: [ClusterLabs] Antw: Re: Using different folder for /var/lib/pacemaker and usage of /dev/shm files

2016-05-17 Thread Nikhil Utane
What I would like to understand is how much total shared memory
(approximately) would Pacemaker need so that accordingly I can define the
partition size. Currently it is 300 MB in our system. I recently ran into
insufficient shared memory issue because of improper clean-up. So would
like to understand how much Pacemaker would need for a 6-node cluster so
that accordingly I can increase it.

# df -kh
tmpfs   300.0M 27.5M272.5M   9% /dev/shm

Thanks
Nikhil

On Tue, May 17, 2016 at 12:09 PM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> One of the main problems I identified with POSIX shared memory (/dev/shm)
> in Linux is that changes to the shared memory don't affect the i-node, so
> you cannot tell from a "ls -rtl" which segments are still active and which
> are not. You can only see the creation time.
>
> Maybe there should be a tool that identifies and cleans up obsolete shared
> memory.
> I don't understand the part talking about the size of /dev/shm: It's
> shared memory. See "kernel.shmmax" and "kernel.shmall" in you sysctl
> settings (/etc/sysctl.conf).
>
> Regards,
> Ulrich
>
> >>> Nikhil Utane  schrieb am 16.05.2016 um
> 14:31 in
> Nachricht
> :
> > Thanks Ken.
> >
> > Could you also respond on the second question?
> >
> >> Also, in /dev/shm I see that it created around 300+ files of around
> >> 250 MB.
> >>
> >> For e.g.
> >> -rw-rw1 hacluste hacluste  8232 May  6 13:03
> >> qb-cib_rw-response-25035-25038-10-header
> >> -rw-rw1 hacluste hacluste540672 May  6 13:03
> >> qb-cib_rw-response-25035-25038-10-data
> >> -rw---1 hacluste hacluste  8232 May  6 13:03
> >> qb-cib_rw-response-25035-25036-12-header
> >> -rw---1 hacluste hacluste540672 May  6 13:03
> >> qb-cib_rw-response-25035-25036-12-data
> >> And many more..
> >>
> >> We have limited space in /dev/shm and all these files are filling it
> >> up. Are these all needed? Any way to limit? Do we need to do any
> >>     clean-up if pacemaker termination was not graceful? What's the
> > recommended size for this folder for Pacemaker? Our cluster will have
> > maximum 6 nodes.
> >
> > -Regards
> > Nikhil
> >
> > On Sat, May 14, 2016 at 3:11 AM, Ken Gaillot 
> wrote:
> >
> >> On 05/08/2016 11:19 PM, Nikhil Utane wrote:
> >> > Moving these questions to a different thread.
> >> >
> >> > Hi,
> >> >
> >> > We have limited storage capacity in our system for different
> folders.
> >> > How can I configure to use a different folder for
> /var/lib/pacemaker?
> >>
> >> ./configure --localstatedir=/wherever (defaults to /var or
> ${prefix}/var)
> >>
> >> That will change everything that normally is placed or looked for under
> >> /var (/var/lib/pacemaker, /var/lib/heartbeat, /var/run, etc.).
> >>
> >> Note that while ./configure lets you change the location of nearly
> >> everything, /usr/lib/ocf/resource.d is an exception, because it is
> >> specified in the OCF standard.
> >>
> >> >
> >> >
> >> > Also, in /dev/shm I see that it created around 300+ files of
> around
> >> > 250 MB.
> >> >
> >> > For e.g.
> >> > -rw-rw1 hacluste hacluste  8232 May  6 13:03
> >> > qb-cib_rw-response-25035-25038-10-header
> >> > -rw-rw1 hacluste hacluste540672 May  6 13:03
> >> > qb-cib_rw-response-25035-25038-10-data
> >> > -rw---1 hacluste hacluste  8232 May  6 13:03
> >> > qb-cib_rw-response-25035-25036-12-header
> >> > -rw---1 hacluste hacluste540672 May  6 13:03
> >> > qb-cib_rw-response-25035-25036-12-data
> >> > And many more..
> >> >
> >> > We have limited space in /dev/shm and all these files are filling
> it
> >> > up. Are these all needed? Any way to limit? Do we need to do any
> >> > clean-up if pacemaker termination was not graceful?
> >> >
> >> > -Thanks
> >> > Nikhil
> >> >
> >> >
> >> >
> >> >
> >> > ___
> >> > Users mailing list: Users@clusterlabs.org
> &

Re: [ClusterLabs] Using different folder for /var/lib/pacemaker and usage of /dev/shm files

2016-05-16 Thread Nikhil Utane
Thanks Ken.

Could you also respond on the second question?

> Also, in /dev/shm I see that it created around 300+ files of around
> 250 MB.
>
> For e.g.
> -rw-rw1 hacluste hacluste  8232 May  6 13:03
> qb-cib_rw-response-25035-25038-10-header
> -rw-rw1 hacluste hacluste540672 May  6 13:03
> qb-cib_rw-response-25035-25038-10-data
> -rw---1 hacluste hacluste  8232 May  6 13:03
> qb-cib_rw-response-25035-25036-12-header
> -rw---1 hacluste hacluste540672 May  6 13:03
> qb-cib_rw-response-25035-25036-12-data
> And many more..
>
> We have limited space in /dev/shm and all these files are filling it
> up. Are these all needed? Any way to limit? Do we need to do any
> clean-up if pacemaker termination was not graceful? What's the
recommended size for this folder for Pacemaker? Our cluster will have
maximum 6 nodes.

-Regards
Nikhil

On Sat, May 14, 2016 at 3:11 AM, Ken Gaillot  wrote:

> On 05/08/2016 11:19 PM, Nikhil Utane wrote:
> > Moving these questions to a different thread.
> >
> > Hi,
> >
> > We have limited storage capacity in our system for different folders.
> > How can I configure to use a different folder for /var/lib/pacemaker?
>
> ./configure --localstatedir=/wherever (defaults to /var or ${prefix}/var)
>
> That will change everything that normally is placed or looked for under
> /var (/var/lib/pacemaker, /var/lib/heartbeat, /var/run, etc.).
>
> Note that while ./configure lets you change the location of nearly
> everything, /usr/lib/ocf/resource.d is an exception, because it is
> specified in the OCF standard.
>
> >
> >
> > Also, in /dev/shm I see that it created around 300+ files of around
> > 250 MB.
> >
> > For e.g.
> > -rw-rw1 hacluste hacluste  8232 May  6 13:03
> > qb-cib_rw-response-25035-25038-10-header
> > -rw-rw1 hacluste hacluste540672 May  6 13:03
> > qb-cib_rw-response-25035-25038-10-data
> > -rw---1 hacluste hacluste  8232 May  6 13:03
> > qb-cib_rw-response-25035-25036-12-header
> > -rw---1 hacluste hacluste540672 May  6 13:03
> > qb-cib_rw-response-25035-25036-12-data
> > And many more..
> >
> > We have limited space in /dev/shm and all these files are filling it
> > up. Are these all needed? Any way to limit? Do we need to do any
> > clean-up if pacemaker termination was not graceful?
> >
> > -Thanks
> > Nikhil
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Using different folder for /var/lib/pacemaker and usage of /dev/shm files

2016-05-08 Thread Nikhil Utane
Moving these questions to a different thread.

Hi,
>
> We have limited storage capacity in our system for different folders.
How can I configure to use a different folder for /var/lib/pacemaker?

>
> Also, in /dev/shm I see that it created around 300+ files of around 250 MB.
>
> For e.g.
> -rw-rw1 hacluste hacluste  8232 May  6 13:03
> qb-cib_rw-response-25035-25038-10-header
> -rw-rw1 hacluste hacluste540672 May  6 13:03
> qb-cib_rw-response-25035-25038-10-data
> -rw---1 hacluste hacluste  8232 May  6 13:03
> qb-cib_rw-response-25035-25036-12-header
> -rw---1 hacluste hacluste540672 May  6 13:03
> qb-cib_rw-response-25035-25036-12-data
> And many more..
>
> We have limited space in /dev/shm and all these files are filling it up.
> Are these all needed? Any way to limit? Do we need to do any clean-up if
> pacemaker termination was not graceful?
>
> -Thanks
> Nikhil
>
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-06 Thread Nikhil Utane
Hi,

QQ, how can I configure to use a different folder other than
/var/lib/pacemaker?

Also, in /dev/shm I see that it created around 300+ files of around 250 MB.

For e.g.
-rw-rw1 hacluste hacluste  8232 May  6 13:03
qb-cib_rw-response-25035-25038-10-header
-rw-rw1 hacluste hacluste540672 May  6 13:03
qb-cib_rw-response-25035-25038-10-data
-rw---1 hacluste hacluste  8232 May  6 13:03
qb-cib_rw-response-25035-25036-12-header
-rw---1 hacluste hacluste540672 May  6 13:03
qb-cib_rw-response-25035-25036-12-data
And many more..

We have limited space in /dev/shm and all these files are filling it up.
Are these all needed? Any way to limit?

-Regards
Nikhil


On Fri, May 6, 2016 at 6:16 PM, Nikhil Utane 
wrote:

> Thanks for the details, Jan.
> We have cross-compiled the same way.
> However because of space constraints on target we copied selected folders.
>
> For this issue, I made a softlink for pacemaker-1.0.rng to point to 
> pacemaker-next.rng
> and that did the trick. :)
> I am not getting the error and the node information is getting added now.
> Still would like to understand why it picked up pacemaker-1.0.rng instead
> of the latest one.
>
> There is no error now in the corosync.log file. Everything seems to be up
> and running.
> But still the output of 'pcs cluster start' command gave failure.
>
> [root@airv_cu xml]# pcs cluster start
> Starting Cluster...
> Starting Pacemaker Cluster Manager[FAILED]
>
> Error: unable to start pacemaker
>
> [root@airv_cu xml]# pcs cluster status
> Cluster Status:
>  Stack: corosync
>  Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition with quorum
>  Last updated: Fri May  6 12:44:33 2016 Last change: Fri May  6
> 12:08:23 2016
>  1 node and 0 resources configured
>
> Let me check whether qt-blackbox gives any details.
>
> -Thanks
> Nikhil
>
> On Fri, May 6, 2016 at 5:46 PM, Jan Pokorný  wrote:
>
>> On 06/05/16 16:59 +0530, Nikhil Utane wrote:
>> >
>> > [...]
>> >
>> >> On 05/06/2016 12:40 PM, Nikhil Utane wrote:
>> >>> As I am cross-compiling pacemaker on a build machine and later moving
>> >>> the binaries to the target, few binaries were missing. After fixing
>> >>> that and bunch of other errors/warning, I am able to get pacemaker
>> >>> started though not completely running fine.
>>
>> > As I mentioned, I am cross-compiling and copying the relevant files
>> > on target platform.
>>
>> I am afraid you are doing the "install" step of deploying from sources
>> across the machines utterly wrong.
>>
>> > In one of the earlier run pacemaker cribbed out not finding
>> > /usr/share/pacemaker/pacemaker-1.0.rng.
>>
>> What more to expect if you believe you can do with moving binaries
>> and getting relevant files OK by hand.  That doesn't really scale
>> and is error-prone, leading to more time spent on guesstimating
>> authoritative installation recipe that's already there (see below).
>>
>> > I found this file under xml folder in the build folder, so I copied all
>> the
>> > files under xml folder onto the target.
>> > Did that screw it up?
>> >
>> > This is the content of the folder:
>> > [root@airv_cu pacemaker]# ls /usr/share/pacemaker/
>> > Makefile  constraints-2.1.rng   nodes-1.0.rng
>> > pacemaker-2.1.rng rule.rng
>> > Makefile.am   constraints-2.2.rng   nodes-1.2.rng
>> > pacemaker-2.2.rng score.rng
>> > Makefile.in   constraints-2.3.rng   nodes-1.3.rng
>> > pacemaker-2.3.rng status-1.0.rng
>> > Readme.md constraints-next.rng  nvset-1.3.rng
>> > pacemaker-2.4.rng tags-1.3.rng
>> > acls-1.2.rng  context-of.xslnvset.rng
>> > pacemaker-next.rngupgrade-1.3.xsl
>> > acls-2.0.rng  crm-transitional.dtd  ocf-meta2man.xsl
>> >  pacemaker.rng upgrade06.xsl
>> > best-match.sh crm.dtd   options-1.0.rng
>> > regression.core.shversions.rng
>> > cib-1.0.rng   crm.xsl   pacemaker-1.0.rng
>> > regression.sh
>> > cib-1.2.rng   crm_mon.rng   pacemaker-1.2.rng
>> > resources-1.0.rng
>> > constraints-1.0.rng   fencing-1.2.rng   pacemaker-1.3.rng
>> > resources-1.2.rng
>> > constraints-1.2.rng   fencing-2.4.rng   pacemaker-2.0.rng
>> > resources-1.3.rng
>>
>> Now, you got overapproximation of what you really need
>> (e.g., context-of.xsl and best-match

Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-06 Thread Nikhil Utane
Thanks for the details, Jan.
We have cross-compiled the same way.
However because of space constraints on target we copied selected folders.

For this issue, I made a softlink for pacemaker-1.0.rng to point to
pacemaker-next.rng
and that did the trick. :)
I am not getting the error and the node information is getting added now.
Still would like to understand why it picked up pacemaker-1.0.rng instead
of the latest one.

There is no error now in the corosync.log file. Everything seems to be up
and running.
But still the output of 'pcs cluster start' command gave failure.

[root@airv_cu xml]# pcs cluster start
Starting Cluster...
Starting Pacemaker Cluster Manager[FAILED]

Error: unable to start pacemaker

[root@airv_cu xml]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: airv_cu (version 1.1.14-5a6cdd1) - partition with quorum
 Last updated: Fri May  6 12:44:33 2016 Last change: Fri May  6
12:08:23 2016
 1 node and 0 resources configured

Let me check whether qt-blackbox gives any details.

-Thanks
Nikhil

On Fri, May 6, 2016 at 5:46 PM, Jan Pokorný  wrote:

> On 06/05/16 16:59 +0530, Nikhil Utane wrote:
> >
> > [...]
> >
> >> On 05/06/2016 12:40 PM, Nikhil Utane wrote:
> >>> As I am cross-compiling pacemaker on a build machine and later moving
> >>> the binaries to the target, few binaries were missing. After fixing
> >>> that and bunch of other errors/warning, I am able to get pacemaker
> >>> started though not completely running fine.
>
> > As I mentioned, I am cross-compiling and copying the relevant files
> > on target platform.
>
> I am afraid you are doing the "install" step of deploying from sources
> across the machines utterly wrong.
>
> > In one of the earlier run pacemaker cribbed out not finding
> > /usr/share/pacemaker/pacemaker-1.0.rng.
>
> What more to expect if you believe you can do with moving binaries
> and getting relevant files OK by hand.  That doesn't really scale
> and is error-prone, leading to more time spent on guesstimating
> authoritative installation recipe that's already there (see below).
>
> > I found this file under xml folder in the build folder, so I copied all
> the
> > files under xml folder onto the target.
> > Did that screw it up?
> >
> > This is the content of the folder:
> > [root@airv_cu pacemaker]# ls /usr/share/pacemaker/
> > Makefile  constraints-2.1.rng   nodes-1.0.rng
> > pacemaker-2.1.rng rule.rng
> > Makefile.am   constraints-2.2.rng   nodes-1.2.rng
> > pacemaker-2.2.rng score.rng
> > Makefile.in   constraints-2.3.rng   nodes-1.3.rng
> > pacemaker-2.3.rng status-1.0.rng
> > Readme.md constraints-next.rng  nvset-1.3.rng
> > pacemaker-2.4.rng tags-1.3.rng
> > acls-1.2.rng  context-of.xslnvset.rng
> > pacemaker-next.rngupgrade-1.3.xsl
> > acls-2.0.rng  crm-transitional.dtd  ocf-meta2man.xsl
> >  pacemaker.rng upgrade06.xsl
> > best-match.sh crm.dtd   options-1.0.rng
> > regression.core.shversions.rng
> > cib-1.0.rng   crm.xsl   pacemaker-1.0.rng
> > regression.sh
> > cib-1.2.rng   crm_mon.rng   pacemaker-1.2.rng
> > resources-1.0.rng
> > constraints-1.0.rng   fencing-1.2.rng   pacemaker-1.3.rng
> > resources-1.2.rng
> > constraints-1.2.rng   fencing-2.4.rng   pacemaker-2.0.rng
> > resources-1.3.rng
>
> Now, you got overapproximation of what you really need
> (e.g., context-of.xsl and best-match.sh are just helpers for developers
> and make sense only from within the source tree, just as Makefile etc.
> does), which is what you want to avoid, especially in case of the
> embedded board.
>
> So now, what you should do instead is along these lines:
>
> $ mkdir pcmk-tree
> $ export CFLAGS=... CC=... # what you need for cross-compilation
> $ ./configure ...
> $ make && make install DESTDIR=$(pwd)/pcmk-tree
> $ tar czpf pcmk-tree.tar.gz pcmk-tree
>
> and now, distribute pcmk-tree.tar.gz to you target, untar it with
> something like "-k --strip-components=1" in the / dir.
>
> Or better yet, go a proper package management route, best using
> "make rpm" target (you'll have to edit pacemaker.spec or RPM macros
> on your system so as to pass the cross-compilation flags across)
> and then just install the package at the target if that's doable
> in your environment.
>
> Hope this helps.
>
> --
> Jan (Poki)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-06 Thread Nikhil Utane
I suppose the failure is because I do not have a DC yet.

[root@airv_cu xml]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: NONE

Can I bring it up when I have just 1 node?

On Fri, May 6, 2016 at 4:59 PM, Nikhil Utane 
wrote:

> The command failed.
> [root@airv_cu pacemaker]# cibadmin --upgrade --force
> Call cib_upgrade failed (-62): Timer expired
>
> I did not do any tooling. (Not even aware how to)
>
> As I mentioned, I am cross-compiling and copying the relevant files on
> target platform.
> In one of the earlier run pacemaker cribbed out not finding
> /usr/share/pacemaker/pacemaker-1.0.rng.
>
> I found this file under xml folder in the build folder, so I copied all
> the files under xml folder onto the target.
> Did that screw it up?
>
> This is the content of the folder:
> [root@airv_cu pacemaker]# ls /usr/share/pacemaker/
> Makefile  constraints-2.1.rng   nodes-1.0.rng
> pacemaker-2.1.rng rule.rng
> Makefile.am   constraints-2.2.rng   nodes-1.2.rng
> pacemaker-2.2.rng score.rng
> Makefile.in   constraints-2.3.rng   nodes-1.3.rng
> pacemaker-2.3.rng status-1.0.rng
> Readme.md constraints-next.rng  nvset-1.3.rng
> pacemaker-2.4.rng tags-1.3.rng
> acls-1.2.rng  context-of.xslnvset.rng
> pacemaker-next.rngupgrade-1.3.xsl
> acls-2.0.rng  crm-transitional.dtd  ocf-meta2man.xsl
>  pacemaker.rng upgrade06.xsl
> best-match.sh crm.dtd   options-1.0.rng
> regression.core.shversions.rng
> cib-1.0.rng   crm.xsl   pacemaker-1.0.rng
> regression.sh
> cib-1.2.rng   crm_mon.rng   pacemaker-1.2.rng
> resources-1.0.rng
> constraints-1.0.rng   fencing-1.2.rng   pacemaker-1.3.rng
> resources-1.2.rng
> constraints-1.2.rng   fencing-2.4.rng   pacemaker-2.0.rng
> resources-1.3.rng
>
> -Regards
> Nikhil
>
> On Fri, May 6, 2016 at 4:41 PM, Klaus Wenninger 
> wrote:
>
>> On 05/06/2016 12:40 PM, Nikhil Utane wrote:
>> > Hi,
>> >
>> > I used the blackbox feature which showed the reason for failure.
>> > As I am cross-compiling pacemaker on a build machine and later moving
>> > the binaries to the target, few binaries were missing. After fixing
>> > that and bunch of other errors/warning, I am able to get pacemaker
>> > started though not completely running fine.
>> >
>> > The node is not getting added:
>> > airv_cucib:error: xml_log:Element node failed to validate
>> > attributes
>> >
>> > I suppose it is because of this error:
>> > crmd:error: node_list_update_callback:Node update 4 failed: Update
>> > does not conform to the configured schema (-203)
>> >
>> > I am suspecting this is caused because of
>> > validate-with="pacemaker-0.7" in the cib. In another installation this
>> > is being set to '"pacemaker-2.0"'
>> >
>> > [root@airv_cu pacemaker]# pcs cluster cib
>> > > > num_updates="0" admin_epoch="0" cib-last-written="Fri May  6 09:28:10
>> > 2016" have-quorum="1">
>> >   
>> > 
>> >   
>> > > > name="have-watchdog" value="true"/>
>> > > > name="dc-version" value="1.1.14-5a6cdd1"/>
>> > > > name="cluster-infrastructure" value="corosync"/>
>> >   
>> > 
>> > 
>> > 
>> > 
>> >   
>> >   
>> > 
>> >
>> > Any idea why/where this is being set to 0.7. I am using latest
>> > pacemaker from GitHub.
>>
>> What kind of tooling did you use to create the cib?
>> Try 'cibadmin --upgrade'. That should set the cib-version to what your
>> pacemaker-version supports.
>>
>> >
>> > [root@airv_cu pacemaker]# pacemakerd --version
>> > Pacemaker 1.1.14
>> > Written by Andrew Beekhof
>> >
>> > Attaching the corosync.log and corosync.conf file.
>> >
>> > -Thanks
>> > Nikhil
>> >
>> >
>> > On Thu, May 5, 2016 at 10:21 PM, Ken Gaillot > > <mailto:kgail...@redhat.com>> wrote:
>> >
>> > On 05/05/2016 11:25 AM, Nikhil Utane wrote:
>> > > Thanks Ken for your quick response as always.
>> > >
>> > > But what if I don't want to use quorum? I just want to bring up
>> > > pacemaker + co

Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-06 Thread Nikhil Utane
The command failed.
[root@airv_cu pacemaker]# cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired

I did not do any tooling. (Not even aware how to)

As I mentioned, I am cross-compiling and copying the relevant files on
target platform.
In one of the earlier run pacemaker cribbed out not finding
/usr/share/pacemaker/pacemaker-1.0.rng.

I found this file under xml folder in the build folder, so I copied all the
files under xml folder onto the target.
Did that screw it up?

This is the content of the folder:
[root@airv_cu pacemaker]# ls /usr/share/pacemaker/
Makefile  constraints-2.1.rng   nodes-1.0.rng
pacemaker-2.1.rng rule.rng
Makefile.am   constraints-2.2.rng   nodes-1.2.rng
pacemaker-2.2.rng score.rng
Makefile.in   constraints-2.3.rng   nodes-1.3.rng
pacemaker-2.3.rng status-1.0.rng
Readme.md constraints-next.rng  nvset-1.3.rng
pacemaker-2.4.rng tags-1.3.rng
acls-1.2.rng  context-of.xslnvset.rng
pacemaker-next.rngupgrade-1.3.xsl
acls-2.0.rng  crm-transitional.dtd  ocf-meta2man.xsl
 pacemaker.rng upgrade06.xsl
best-match.sh crm.dtd   options-1.0.rng
regression.core.shversions.rng
cib-1.0.rng   crm.xsl   pacemaker-1.0.rng
regression.sh
cib-1.2.rng   crm_mon.rng   pacemaker-1.2.rng
resources-1.0.rng
constraints-1.0.rng   fencing-1.2.rng   pacemaker-1.3.rng
resources-1.2.rng
constraints-1.2.rng   fencing-2.4.rng   pacemaker-2.0.rng
resources-1.3.rng

-Regards
Nikhil

On Fri, May 6, 2016 at 4:41 PM, Klaus Wenninger  wrote:

> On 05/06/2016 12:40 PM, Nikhil Utane wrote:
> > Hi,
> >
> > I used the blackbox feature which showed the reason for failure.
> > As I am cross-compiling pacemaker on a build machine and later moving
> > the binaries to the target, few binaries were missing. After fixing
> > that and bunch of other errors/warning, I am able to get pacemaker
> > started though not completely running fine.
> >
> > The node is not getting added:
> > airv_cucib:error: xml_log:Element node failed to validate
> > attributes
> >
> > I suppose it is because of this error:
> > crmd:error: node_list_update_callback:Node update 4 failed: Update
> > does not conform to the configured schema (-203)
> >
> > I am suspecting this is caused because of
> > validate-with="pacemaker-0.7" in the cib. In another installation this
> > is being set to '"pacemaker-2.0"'
> >
> > [root@airv_cu pacemaker]# pcs cluster cib
> >  > num_updates="0" admin_epoch="0" cib-last-written="Fri May  6 09:28:10
> > 2016" have-quorum="1">
> >   
> > 
> >   
> >  > name="have-watchdog" value="true"/>
> >  > name="dc-version" value="1.1.14-5a6cdd1"/>
> >  > name="cluster-infrastructure" value="corosync"/>
> >   
> > 
> > 
> > 
> > 
> >   
> >   
> > 
> >
> > Any idea why/where this is being set to 0.7. I am using latest
> > pacemaker from GitHub.
>
> What kind of tooling did you use to create the cib?
> Try 'cibadmin --upgrade'. That should set the cib-version to what your
> pacemaker-version supports.
>
> >
> > [root@airv_cu pacemaker]# pacemakerd --version
> > Pacemaker 1.1.14
> > Written by Andrew Beekhof
> >
> > Attaching the corosync.log and corosync.conf file.
> >
> > -Thanks
> > Nikhil
> >
> >
> > On Thu, May 5, 2016 at 10:21 PM, Ken Gaillot  > <mailto:kgail...@redhat.com>> wrote:
> >
> > On 05/05/2016 11:25 AM, Nikhil Utane wrote:
> > > Thanks Ken for your quick response as always.
> > >
> > > But what if I don't want to use quorum? I just want to bring up
> > > pacemaker + corosync on 1 node to check that it all comes up fine.
> > > I added corosync_votequorum as you suggested. Additionally I
> > also added
> > > these 2 lines:
> > >
> > > expected_votes: 2
> > > two_node: 1
> >
> > There's actually nothing wrong with configuring a single-node
> cluster.
> > You can list just one node in corosync.conf and leave off the above.
> >
> > > However still pacemaker is not able to run.
> >
> > There must be other issues involved. Even if pacemaker doesn't have
> > quorum, it will still run, it just won't start resources.
> >
> > > [root@airv_cu root]# pc

Re: [ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-05 Thread Nikhil Utane
Thanks Ken for your quick response as always.

But what if I don't want to use quorum? I just want to bring up pacemaker +
corosync on 1 node to check that it all comes up fine.
I added corosync_votequorum as you suggested. Additionally I also added
these 2 lines:

expected_votes: 2
two_node: 1

However still pacemaker is not able to run.

[root@airv_cu root]# pcs cluster start
Starting Cluster...
Starting Pacemaker Cluster Manager[FAILED]

Error: unable to start pacemaker

Corosync.log:
*May 05 16:15:20 [16294] airv_cu pacemakerd: info:
pcmk_quorum_notification: Membership 240: quorum still lost (1)*
May 05 16:15:20 [16259] airv_cu corosync debug   [QB] Free'ing
ringbuffer: /dev/shm/qb-cmap-request-16259-16294-21-header
May 05 16:15:20 [16294] airv_cu pacemakerd:   notice:
crm_update_peer_state_iter:   pcmk_quorum_notification: Node
airv_cu[181344357] - state is now member (was (null))
May 05 16:15:20 [16294] airv_cu pacemakerd: info: pcmk_cpg_membership:
 Node 181344357 joined group pacemakerd (counter=0.0)
May 05 16:15:20 [16294] airv_cu pacemakerd: info: pcmk_cpg_membership:
 Node 181344357 still member of group pacemakerd (peer=airv_cu,
counter=0.0)
May 05 16:15:20 [16294] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
cib process (16353) can no longer be respawned, shutting the cluster down.
May 05 16:15:20 [16294] airv_cu pacemakerd:   notice: pcmk_shutdown_worker:
Shutting down Pacemaker

The log and conf file is attached.

-Regards
Nikhil

On Thu, May 5, 2016 at 8:04 PM, Ken Gaillot  wrote:

> On 05/05/2016 08:36 AM, Nikhil Utane wrote:
> > Hi,
> >
> > Continuing with my adventure to run Pacemaker & Corosync on our
> > big-endian system, I managed to get past the corosync issue for now. But
> > facing an issue in running Pacemaker.
> >
> > Seeing following messages in corosync.log.
> >  pacemakerd:  warning: pcmk_child_exit:  The cib process (2) can no
> > longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The stonith-ng process (20001)
> > can no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The lrmd process (20002) can no
> > longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The attrd process (20003) can
> > no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The pengine process (20004) can
> > no longer be respawned, shutting the cluster down.
> >  pacemakerd:  warning: pcmk_child_exit:  The crmd process (20005) can no
> > longer be respawned, shutting the cluster down.
> >
> > I see following error before these messages. Not sure if this is the
> cause.
> > May 05 11:26:24 [19998] airv_cu pacemakerd:error:
> > cluster_connect_quorum:   Corosync quorum is not configured
> >
> > I tried removing the quorum block (which is anyways blank) from the conf
> > file but still had the same error.
>
> Yes, that is the issue. Pacemaker can't do anything if it can't ask
> corosync about quorum. I don't know what the issue is at the corosync
> level, but your corosync.conf should have:
>
> quorum {
> provider: corosync_votequorum
> }
>
>
> > Attaching the log and conf files. Please let me know if there is any
> > obvious mistake or how to investigate it further.
> >
> > I am using pcs cluster start command to start the cluster
> >
> > -Thanks
> > Nikhil
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


corosync_with_quorum.log
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

2016-05-05 Thread Nikhil Utane
Hi,

Continuing with my adventure to run Pacemaker & Corosync on our big-endian
system, I managed to get past the corosync issue for now. But facing an
issue in running Pacemaker.

Seeing following messages in corosync.log.
 pacemakerd:  warning: pcmk_child_exit:  The cib process (2) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The stonith-ng process (20001) can
no longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The lrmd process (20002) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The attrd process (20003) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The pengine process (20004) can no
longer be respawned, shutting the cluster down.
 pacemakerd:  warning: pcmk_child_exit:  The crmd process (20005) can no
longer be respawned, shutting the cluster down.

I see following error before these messages. Not sure if this is the cause.
May 05 11:26:24 [19998] airv_cu pacemakerd:error:
cluster_connect_quorum:   Corosync quorum is not configured

I tried removing the quorum block (which is anyways blank) from the conf
file but still had the same error.

Attaching the log and conf files. Please let me know if there is any
obvious mistake or how to investigate it further.

I am using pcs cluster start command to start the cluster

-Thanks
Nikhil


corosync.log
Description: Binary data


pacemaker.log
Description: Binary data


corosync.conf
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
It worked for me. :)
I'll wait for your formal patch but until then I am able to proceed
further. (Don't know if I'll run into something else)

However now encountering issue in pacemaker.

May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
cib process (15224) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
stonith-ng process (15225) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
lrmd process (15226) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
crmd process (15229) can no longer be respawned, shutting the cluster down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
pengine process (15228) can no longer be respawned, shutting the cluster
down.
May 05 09:35:53 [15184] airv_cu pacemakerd:  warning: pcmk_child_exit:  The
attrd process (15227) can no longer be respawned, shutting the cluster down.

Looking into it.

-Thanks
Nikhil

On Thu, May 5, 2016 at 2:58 PM, Jan Friesse  wrote:

> Nikhil
>
> Found the root-cause.
>> In file schedwrk.c, the function handle2void() uses a union which was not
>> initialized.
>> Because of that the handle value was computed incorrectly (lower half was
>> garbage).
>>
>>   56 static hdb_handle_t
>>   57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
>>   58 static const void *
>>   59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>>
>> After initializing (as highlighted), the corosync initialization seems to
>> be going through fine. Will check other things.
>>
>
> Your patch is incorrect and actually doesn't work. As I said (when
> pointing you to schedwrk.c), I will send you proper patch, but fix that
> issue correctly is not easy.
>
> Regards,
>   Honza
>
>
>> -Regards
>> Nikhil
>>
>> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane > >
>> wrote:
>>
>> Thanks for your response Dejan.
>>>
>>> I do not know yet whether this has anything to do with endianness.
>>> FWIW, there could be something quirky with the system so keeping all
>>> options open. :)
>>>
>>> I added some debug prints to understand what's happening under the hood.
>>>
>>> *Success case: (on x86 machine): *
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members
>>> joined:
>>> 181272839
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
>>> my_high_delivered=1
>>> [TOTEM ] Delivering 1 to 2
>>> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
>>> [SYNC  ] Nikhil: Entering sync_barrier_handler
>>> [SYNC  ] Committing synchronization for corosync configuration map access
>>> .
>>> [TOTEM ] Delivering 2 to 4
>>> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>>> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
>>> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
>>> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
>>> left:0)
>>> [SYNC  ] Committing synchronization for corosync cluster closed process
>>> group service v1.01
>>> *[MAIN  ] Completed service synchronization, ready to provide service.*
>>>
>>>
>>> *Failure case: (on ppc)*:
>>>
>>> [TOTEM ] entering OPERATIONAL state.
>>> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
>>> 181344357
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>>> my_high_delivered=0
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>>> my_high_delivered=0
>>> [TOTEM ] Delivering 0 to 1
>>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>>> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
>>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-05 Thread Nikhil Utane
Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).

 56 static hdb_handle_t
 57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
 58 static const void *
 59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }

After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.

-Regards
Nikhil

On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane 
wrote:

> Thanks for your response Dejan.
>
> I do not know yet whether this has anything to do with endianness.
> FWIW, there could be something quirky with the system so keeping all
> options open. :)
>
> I added some debug prints to understand what's happening under the hood.
>
> *Success case: (on x86 machine): *
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
> 181272839
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
> my_high_delivered=1
> [TOTEM ] Delivering 1 to 2
> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
> [SYNC  ] Nikhil: Entering sync_barrier_handler
> [SYNC  ] Committing synchronization for corosync configuration map access
> .
> [TOTEM ] Delivering 2 to 4
> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
> [CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
> [CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
> left:0)
> [SYNC  ] Committing synchronization for corosync cluster closed process
> group service v1.01
> *[MAIN  ] Completed service synchronization, ready to provide service.*
>
>
> *Failure case: (on ppc)*:
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
> 181344357
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> Above message repeats continuously.
>
> So it appears that in failure case I do not receive messages with sequence
> number 2-4.
> If somebody can throw some ideas that'll help a lot.
>
> -Thanks
> Nikhil
>
> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
> wrote:
>
>> Hi,
>>
>> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
>> > >As your hardware is probably capable of running ppcle and if you have
>> an
>> > >environment
>> > >at hand without too much effort it might pay off to try that.
>> > >There are of course distributions out there support corosync on
>> > >big-endian architectures
>> > >but I don't know if there is an automatized regression for corosync on
>> > >big-endian that
>> > >would catch big-endian-issues right away with something as current as
>> > >your 2.3.5.
>> >
>> > No we are not testing big-endian.
>> >
>> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> > nodes are little-endian. Corosync should work in mixed BE/LE
>> > environment but because it's not tested, it may not work (and it's a
>> > bug, so if ppcle works I will try to fix BE).
>>
>> I tested a cluster consisting of big endian/little endian nodes
>> (s390 and x86-64), but that was a while ago. IIRC, all relevant
>> bugs in corosync got fixed at that time. Don't know what is the
>> situation with the latest version.
>>
>> Thanks,
>>
>> Dejan
>>
>> > Regards,
>> >   Honza
>> >
>> > >
>> > >Regards,
>> > >Klaus
>> > >
>> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>> > >

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-03 Thread Nikhil Utane
Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*


*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic 
wrote:

> Hi,
>
> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> > >As your hardware is probably capable of running ppcle and if you have an
> > >environment
> > >at hand without too much effort it might pay off to try that.
> > >There are of course distributions out there support corosync on
> > >big-endian architectures
> > >but I don't know if there is an automatized regression for corosync on
> > >big-endian that
> > >would catch big-endian-issues right away with something as current as
> > >your 2.3.5.
> >
> > No we are not testing big-endian.
> >
> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
> > nodes are little-endian. Corosync should work in mixed BE/LE
> > environment but because it's not tested, it may not work (and it's a
> > bug, so if ppcle works I will try to fix BE).
>
> I tested a cluster consisting of big endian/little endian nodes
> (s390 and x86-64), but that was a while ago. IIRC, all relevant
> bugs in corosync got fixed at that time. Don't know what is the
> situation with the latest version.
>
> Thanks,
>
> Dejan
>
> > Regards,
> >   Honza
> >
> > >
> > >Regards,
> > >Klaus
> > >
> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> > >>Re-sending as I don't see my post on the thread.
> > >>
> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> > >>mailto:nikhil.subscri...@gmail.com>>
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Looking for some guidance here as we are completely blocked
> > >> otherwise :(.
> > >>
> > >> -Regards
> > >> Nikhil
> > >>
> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  > >> <mailto:sriram...@gmail.com>> wrote:
> > >>
> > >> Corrected the subject.
> > >>
> > >> We went ahead and captured corosync debug logs for our ppc
> board.
> > >> After log analysis and comparison with the sucessful logs(
> > >> from x86 machine) ,
> > >> we didnt find *"[ MAIN  ] Completed service synchronization,
> > >>

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
It is Freescale e6500 processor. Nobody here has tried running it in LE
mode so it is going to take some doing.
We are going to add some debug logs to figure out where does corosync
initialization get stalled.
If you have have suggestions, pls let us know.

-Thanks
Nikhil


On Mon, May 2, 2016 at 1:00 PM, Nikhil Utane 
wrote:

> So what I understand what you are saying is, if the HW is bi-endian, then
> enable LE on PPC. Is that right?
> Need to check on that.
>
> On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane  > wrote:
>
>> Sorry about my ignorance but could you pls elaborate what do you mean by
>> "try to ppcle"?
>>
>> Our target platform is ppc so it is BE. We have to get it running only on
>> that.
>> How do we know this is LE/BE issue and nothing else?
>>
>> -Thanks
>> Nikhil
>>
>>
>> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>>
>>> As your hardware is probably capable of running ppcle and if you have an
>>>> environment
>>>> at hand without too much effort it might pay off to try that.
>>>> There are of course distributions out there support corosync on
>>>> big-endian architectures
>>>> but I don't know if there is an automatized regression for corosync on
>>>> big-endian that
>>>> would catch big-endian-issues right away with something as current as
>>>> your 2.3.5.
>>>>
>>>
>>> No we are not testing big-endian.
>>>
>>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>>> works I will try to fix BE).
>>>
>>> Regards,
>>>   Honza
>>>
>>>
>>>
>>>> Regards,
>>>> Klaus
>>>>
>>>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>>>
>>>>> Re-sending as I don't see my post on the thread.
>>>>>
>>>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>>>> mailto:nikhil.subscri...@gmail.com>>
>>>>> wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>>  Looking for some guidance here as we are completely blocked
>>>>>  otherwise :(.
>>>>>
>>>>>  -Regards
>>>>>  Nikhil
>>>>>
>>>>>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>>>  <mailto:sriram...@gmail.com>> wrote:
>>>>>
>>>>>  Corrected the subject.
>>>>>
>>>>>  We went ahead and captured corosync debug logs for our ppc
>>>>> board.
>>>>>  After log analysis and comparison with the sucessful logs(
>>>>>  from x86 machine) ,
>>>>>  we didnt find *"[ MAIN  ] Completed service synchronization,
>>>>>  ready to provide service.*" in ppc logs.
>>>>>  So, looks like corosync is not in a position to accept
>>>>>  connection from Pacemaker.
>>>>>  Even I tried with the new corosync.conf with no success.
>>>>>
>>>>>  Any hints on this issue would be really helpful.
>>>>>
>>>>>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>>>
>>>>>  Regards,
>>>>>  Sriram
>>>>>
>>>>>
>>>>>
>>>>>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>>>  <mailto:sriram...@gmail.com>> wrote:
>>>>>
>>>>>  Hi,
>>>>>
>>>>>  I went ahead and made some changes in file system(Like I
>>>>>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>>>>>  /etc/sysconfig ), After that I was able to run  "pcs
>>>>>  cluster start".
>>>>>  But it failed with the following error
>>>>>   # pcs cluster start
>>>>>  Starting Cluster...
>>>>>  Starting Pacemaker Cluster Manager[FAILED]
>>>>>  Error: unable to start pacemaker
>>>>>
>>>>>  And in the /var/log/pacemaker.log, I saw these errors
>>>>> 

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
So what I understand what you are saying is, if the HW is bi-endian, then
enable LE on PPC. Is that right?
Need to check on that.

On Mon, May 2, 2016 at 12:49 PM, Nikhil Utane 
wrote:

> Sorry about my ignorance but could you pls elaborate what do you mean by
> "try to ppcle"?
>
> Our target platform is ppc so it is BE. We have to get it running only on
> that.
> How do we know this is LE/BE issue and nothing else?
>
> -Thanks
> Nikhil
>
>
> On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:
>
>> As your hardware is probably capable of running ppcle and if you have an
>>> environment
>>> at hand without too much effort it might pay off to try that.
>>> There are of course distributions out there support corosync on
>>> big-endian architectures
>>> but I don't know if there is an automatized regression for corosync on
>>> big-endian that
>>> would catch big-endian-issues right away with something as current as
>>> your 2.3.5.
>>>
>>
>> No we are not testing big-endian.
>>
>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> nodes are little-endian. Corosync should work in mixed BE/LE environment
>> but because it's not tested, it may not work (and it's a bug, so if ppcle
>> works I will try to fix BE).
>>
>> Regards,
>>   Honza
>>
>>
>>
>>> Regards,
>>> Klaus
>>>
>>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>>
>>>> Re-sending as I don't see my post on the thread.
>>>>
>>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>>> mailto:nikhil.subscri...@gmail.com>>
>>>> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>  Looking for some guidance here as we are completely blocked
>>>>  otherwise :(.
>>>>
>>>>  -Regards
>>>>  Nikhil
>>>>
>>>>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>>  <mailto:sriram...@gmail.com>> wrote:
>>>>
>>>>  Corrected the subject.
>>>>
>>>>  We went ahead and captured corosync debug logs for our ppc
>>>> board.
>>>>  After log analysis and comparison with the sucessful logs(
>>>>  from x86 machine) ,
>>>>  we didnt find *"[ MAIN  ] Completed service synchronization,
>>>>  ready to provide service.*" in ppc logs.
>>>>  So, looks like corosync is not in a position to accept
>>>>  connection from Pacemaker.
>>>>  Even I tried with the new corosync.conf with no success.
>>>>
>>>>  Any hints on this issue would be really helpful.
>>>>
>>>>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>>
>>>>  Regards,
>>>>  Sriram
>>>>
>>>>
>>>>
>>>>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>>  <mailto:sriram...@gmail.com>> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>  I went ahead and made some changes in file system(Like I
>>>>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>>>>  /etc/sysconfig ), After that I was able to run  "pcs
>>>>  cluster start".
>>>>  But it failed with the following error
>>>>   # pcs cluster start
>>>>  Starting Cluster...
>>>>  Starting Pacemaker Cluster Manager[FAILED]
>>>>  Error: unable to start pacemaker
>>>>
>>>>  And in the /var/log/pacemaker.log, I saw these errors
>>>>  pacemakerd: info: mcp_read_config:  cmap connection
>>>>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>>>>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>>>>  mcp_read_config:  cmap connection setup failed:
>>>>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>>>>  mcp_read_config:  Could not connect to Cluster
>>>>  Configuration Database API, error 6
>>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>>>>  main: Could not ob

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-02 Thread Nikhil Utane
Sorry about my ignorance but could you pls elaborate what do you mean by
"try to ppcle"?

Our target platform is ppc so it is BE. We have to get it running only on
that.
How do we know this is LE/BE issue and nothing else?

-Thanks
Nikhil


On Mon, May 2, 2016 at 12:24 PM, Jan Friesse  wrote:

> As your hardware is probably capable of running ppcle and if you have an
>> environment
>> at hand without too much effort it might pay off to try that.
>> There are of course distributions out there support corosync on
>> big-endian architectures
>> but I don't know if there is an automatized regression for corosync on
>> big-endian that
>> would catch big-endian-issues right away with something as current as
>> your 2.3.5.
>>
>
> No we are not testing big-endian.
>
> So totally agree with Klaus. Give a try to ppcle. Also make sure all nodes
> are little-endian. Corosync should work in mixed BE/LE environment but
> because it's not tested, it may not work (and it's a bug, so if ppcle works
> I will try to fix BE).
>
> Regards,
>   Honza
>
>
>
>> Regards,
>> Klaus
>>
>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>
>>> Re-sending as I don't see my post on the thread.
>>>
>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>> mailto:nikhil.subscri...@gmail.com>>
>>> wrote:
>>>
>>>  Hi,
>>>
>>>  Looking for some guidance here as we are completely blocked
>>>  otherwise :(.
>>>
>>>  -Regards
>>>  Nikhil
>>>
>>>  On Fri, Apr 29, 2016 at 6:11 PM, Sriram >>  <mailto:sriram...@gmail.com>> wrote:
>>>
>>>  Corrected the subject.
>>>
>>>  We went ahead and captured corosync debug logs for our ppc
>>> board.
>>>  After log analysis and comparison with the sucessful logs(
>>>  from x86 machine) ,
>>>  we didnt find *"[ MAIN  ] Completed service synchronization,
>>>  ready to provide service.*" in ppc logs.
>>>  So, looks like corosync is not in a position to accept
>>>  connection from Pacemaker.
>>>  Even I tried with the new corosync.conf with no success.
>>>
>>>  Any hints on this issue would be really helpful.
>>>
>>>  Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>
>>>  Regards,
>>>  Sriram
>>>
>>>
>>>
>>>  On Fri, Apr 29, 2016 at 2:44 PM, Sriram >>  <mailto:sriram...@gmail.com>> wrote:
>>>
>>>  Hi,
>>>
>>>  I went ahead and made some changes in file system(Like I
>>>  brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
>>>  /etc/sysconfig ), After that I was able to run  "pcs
>>>  cluster start".
>>>  But it failed with the following error
>>>   # pcs cluster start
>>>  Starting Cluster...
>>>  Starting Pacemaker Cluster Manager[FAILED]
>>>  Error: unable to start pacemaker
>>>
>>>  And in the /var/log/pacemaker.log, I saw these errors
>>>  pacemakerd: info: mcp_read_config:  cmap connection
>>>  setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
>>>  Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>>>  mcp_read_config:  cmap connection setup failed:
>>>  CS_ERR_TRY_AGAIN.  Retrying in 5s
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
>>>  mcp_read_config:  Could not connect to Cluster
>>>  Configuration Database API, error 6
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
>>>  main: Could not obtain corosync config data, exiting
>>>  Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>>>  crm_xml_cleanup:  Cleaning up memory from libxml2
>>>
>>>
>>>  And in the /var/log/Debuglog, I saw these errors coming
>>>  from corosync
>>>  20160429 085347.487050  airv_cu
>>>  daemon.warn corosync[12857]:   [QB] Denied connection,
>>>  is not ready (12857-15863-14)
>>>  20160429 085347.487067  airv_cu
>>>  

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Nikhil Utane
Re-sending as I don't see my post on the thread.

On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane 
wrote:

> Hi,
>
> Looking for some guidance here as we are completely blocked otherwise :(.
>
> -Regards
> Nikhil
>
> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  wrote:
>
>> Corrected the subject.
>>
>> We went ahead and captured corosync debug logs for our ppc board.
>> After log analysis and comparison with the sucessful logs( from x86
>> machine) ,
>> we didnt find * "[ MAIN  ] Completed service synchronization, ready to
>> provide service.*" in ppc logs.
>> So, looks like corosync is not in a position to accept connection from
>> Pacemaker.
>> Even I tried with the new corosync.conf with no success.
>>
>> Any hints on this issue would be really helpful.
>>
>> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>
>> Regards,
>> Sriram
>>
>>
>>
>> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  wrote:
>>
>>> Hi,
>>>
>>> I went ahead and made some changes in file system(Like I brought in
>>> /etc/init.d/corosync and /etc/init.d/pacemaker, /etc/sysconfig ), After
>>> that I was able to run  "pcs cluster start".
>>> But it failed with the following error
>>>  # pcs cluster start
>>> Starting Cluster...
>>> Starting Pacemaker Cluster Manager[FAILED]
>>> Error: unable to start pacemaker
>>>
>>> And in the /var/log/pacemaker.log, I saw these errors
>>> pacemakerd: info: mcp_read_config:  cmap connection setup failed:
>>> CS_ERR_TRY_AGAIN.  Retrying in 4s
>>> Apr 29 08:53:47 [15863] node_cu pacemakerd: info: mcp_read_config:
>>> cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 5s
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning: mcp_read_config:
>>> Could not connect to Cluster Configuration Database API, error 6
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice: main: Could
>>> not obtain corosync config data, exiting
>>> Apr 29 08:53:52 [15863] node_cu pacemakerd: info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>>
>>>
>>> And in the /var/log/Debuglog, I saw these errors coming from corosync
>>> 20160429 085347.487050 airv_cu daemon.warn corosync[12857]:   [QB]
>>> Denied connection, is not ready (12857-15863-14)
>>> 20160429 085347.487067 airv_cu daemon.info corosync[12857]:   [QB]
>>> Denied connection, is not ready (12857-15863-14)
>>>
>>>
>>> I browsed the code of libqb to find that it is failing in
>>>
>>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>>
>>> Line 600 :
>>> handle_new_connection function
>>>
>>> Line 637:
>>> if (auth_result == 0 && c->service->serv_fns.connection_accept) {
>>> res = c->service->serv_fns.connection_accept(c,
>>>  c->euid, c->egid);
>>> }
>>> if (res != 0) {
>>> goto send_response;
>>> }
>>>
>>> Any hints on this issue would be really helpful for me to go ahead.
>>> Please let me know if any logs are required,
>>>
>>> Regards,
>>> Sriram
>>>
>>> On Thu, Apr 28, 2016 at 2:42 PM, Sriram  wrote:
>>>
>>>> Thanks Ken and Emmanuel.
>>>> Its a big endian machine. I will try with running "pcs cluster setup"
>>>> and "pcs cluster start"
>>>> Inside cluster.py, "service pacemaker start" and "service corosync
>>>> start" are executed to bring up pacemaker and corosync.
>>>> Those service scripts and the infrastructure needed to bring up the
>>>> processes in the above said manner doesn't exist in my board.
>>>> As it is a embedded board with the limited memory, full fledged linux
>>>> is not installed.
>>>> Just curious to know, what could be reason the pacemaker throws that
>>>> error.
>>>>
>>>>
>>>>
>>>> *"cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 1s"*
>>>> Thanks for response.
>>>>
>>>> Regards,
>>>> Sriram.
>>>>
>>>> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot 
>>>> wrote:
>>>>
>>>>> On 04/27/2016 11:25 AM, emmanuel segura wrote:
>>>>> > you need to use pcs to do everything,

Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-01 Thread Nikhil Utane
Hi,

Looking for some guidance here as we are completely blocked otherwise :(.

-Regards
Nikhil

On Fri, Apr 29, 2016 at 6:11 PM, Sriram  wrote:

> Corrected the subject.
>
> We went ahead and captured corosync debug logs for our ppc board.
> After log analysis and comparison with the sucessful logs( from x86
> machine) ,
> we didnt find * "[ MAIN  ] Completed service synchronization, ready to
> provide service.*" in ppc logs.
> So, looks like corosync is not in a position to accept connection from
> Pacemaker.
> Even I tried with the new corosync.conf with no success.
>
> Any hints on this issue would be really helpful.
>
> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>
> Regards,
> Sriram
>
>
>
> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  wrote:
>
>> Hi,
>>
>> I went ahead and made some changes in file system(Like I brought in
>> /etc/init.d/corosync and /etc/init.d/pacemaker, /etc/sysconfig ), After
>> that I was able to run  "pcs cluster start".
>> But it failed with the following error
>>  # pcs cluster start
>> Starting Cluster...
>> Starting Pacemaker Cluster Manager[FAILED]
>> Error: unable to start pacemaker
>>
>> And in the /var/log/pacemaker.log, I saw these errors
>> pacemakerd: info: mcp_read_config:  cmap connection setup failed:
>> CS_ERR_TRY_AGAIN.  Retrying in 4s
>> Apr 29 08:53:47 [15863] node_cu pacemakerd: info: mcp_read_config:
>> cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 5s
>> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning: mcp_read_config:
>> Could not connect to Cluster Configuration Database API, error 6
>> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice: main: Could not
>> obtain corosync config data, exiting
>> Apr 29 08:53:52 [15863] node_cu pacemakerd: info: crm_xml_cleanup:
>> Cleaning up memory from libxml2
>>
>>
>> And in the /var/log/Debuglog, I saw these errors coming from corosync
>> 20160429 085347.487050 airv_cu daemon.warn corosync[12857]:   [QB]
>> Denied connection, is not ready (12857-15863-14)
>> 20160429 085347.487067 airv_cu daemon.info corosync[12857]:   [QB]
>> Denied connection, is not ready (12857-15863-14)
>>
>>
>> I browsed the code of libqb to find that it is failing in
>>
>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>
>> Line 600 :
>> handle_new_connection function
>>
>> Line 637:
>> if (auth_result == 0 && c->service->serv_fns.connection_accept) {
>> res = c->service->serv_fns.connection_accept(c,
>>  c->euid, c->egid);
>> }
>> if (res != 0) {
>> goto send_response;
>> }
>>
>> Any hints on this issue would be really helpful for me to go ahead.
>> Please let me know if any logs are required,
>>
>> Regards,
>> Sriram
>>
>> On Thu, Apr 28, 2016 at 2:42 PM, Sriram  wrote:
>>
>>> Thanks Ken and Emmanuel.
>>> Its a big endian machine. I will try with running "pcs cluster setup"
>>> and "pcs cluster start"
>>> Inside cluster.py, "service pacemaker start" and "service corosync
>>> start" are executed to bring up pacemaker and corosync.
>>> Those service scripts and the infrastructure needed to bring up the
>>> processes in the above said manner doesn't exist in my board.
>>> As it is a embedded board with the limited memory, full fledged linux is
>>> not installed.
>>> Just curious to know, what could be reason the pacemaker throws that
>>> error.
>>>
>>>
>>>
>>> *"cmap connection setup failed: CS_ERR_TRY_AGAIN.  Retrying in 1s"*
>>> Thanks for response.
>>>
>>> Regards,
>>> Sriram.
>>>
>>> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot 
>>> wrote:
>>>
 On 04/27/2016 11:25 AM, emmanuel segura wrote:
 > you need to use pcs to do everything, pcs cluster setup and pcs
 > cluster start, try to use the redhat docs for more information.

 Agreed -- pcs cluster setup will create a proper corosync.conf for you.
 Your corosync.conf below uses corosync 1 syntax, and there were
 significant changes in corosync 2. In particular, you don't need the
 file created in step 4, because pacemaker is no longer launched via a
 corosync plugin.

 > 2016-04-27 17:28 GMT+02:00 Sriram :
 >> Dear All,
 >>
 >> I m trying to use pacemaker and corosync for the clustering
 requirement that
 >> came up recently.
 >> We have cross compiled corosync, pacemaker and pcs(python) for ppc
 >> environment (Target board where pacemaker and corosync are supposed
 to run)
 >> I m having trouble bringing up pacemaker in that environment, though
 I could
 >> successfully bring up corosync.
 >> Any help is welcome.
 >>
 >> I m using these versions of pacemaker and corosync
 >> [root@node_cu pacemaker]# corosync -v
 >> Corosync Cluster Engine, version '2.3.5'
 >> Copyright (c) 2006-2009 Red Hat, Inc.
 >> [root@node_cu pacemaker]# pacemakerd -$
 >> Pacemaker 1.1.14
 >> Written by Andrew Beekhof
 >>
 >> For running corosync, I did 

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-03-19 Thread Nikhil Utane
Thanks Ken for the detailed response.
I suppose I could even use some of the pcs/crm CLI commands then.
Cheers.

On Wed, Mar 16, 2016 at 8:27 PM, Ken Gaillot  wrote:

> On 03/16/2016 05:22 AM, Nikhil Utane wrote:
> > I see following info gets updated in CIB. Can I use this or there is
> better
> > way?
> >
> >  > crm-debug-origin="peer_update_callback" join="*down*" expected="member">
>
> in_ccm/crmd/join reflect the current state of the node (as known by the
> partition that you're looking at the CIB on), so if the node went down
> and came back up, it won't tell you anything about being down.
>
> - in_ccm indicates that the node is part of the underlying cluster layer
> (heartbeat/cman/corosync)
>
> - crmd indicates that the node is communicating at the pacemaker layer
>
> - join indicates what phase of the join process the node is at
>
> There's not a direct way to see what node went down after the fact.
> There are ways however:
>
> - if the node was running resources, those will be failed, and those
> failures (including node) will be shown in the cluster status
>
> - the logs show all node membership events; you can search for logs such
> as "state is now lost" and "left us"
>
> - "stonith -H $NODE_NAME" will show the fence history for a given node,
> so if the node went down due to fencing, it will show up there
>
> - you can configure an ocf:pacemaker:ClusterMon resource to run crm_mon
> periodically and run a script for node events, and you can write the
> script to do whatever you want (email you, etc.) (in the upcoming 1.1.15
> release, built-in notifications will make this more reliable and easier,
> but any script you use with ClusterMon will still be usable with the new
> method)
>
> > On Wed, Mar 16, 2016 at 12:40 PM, Nikhil Utane <
> nikhil.subscri...@gmail.com>
> > wrote:
> >
> >> Hi Ken,
> >>
> >> Sorry about the long delay. This activity was de-focussed but now it's
> >> back on track.
> >>
> >> One part of question that is still not answered is on the newly active
> >> node, how to find out which was the node that went down?
> >> Anything that gets updated in the status section that can be read and
> >> figured out?
> >>
> >> Thanks.
> >> Nikhil
> >>
> >> On Sat, Jan 9, 2016 at 3:31 AM, Ken Gaillot 
> wrote:
> >>
> >>> On 01/08/2016 11:13 AM, Nikhil Utane wrote:
> >>>>> I think stickiness will do what you want here. Set a stickiness
> higher
> >>>>> than the original node's preference, and the resource will want to
> stay
> >>>>> where it is.
> >>>>
> >>>> Not sure I understand this. Stickiness will ensure that resources
> don't
> >>>> move back when original node comes back up, isn't it?
> >>>> But in my case, I want the newly standby node to become the backup
> node
> >>> for
> >>>> all other nodes. i.e. it should now be able to run all my resource
> >>> groups
> >>>> albeit with a lower score. How do I achieve that?
> >>>
> >>> Oh right. I forgot to ask whether you had an opt-out
> >>> (symmetric-cluster=true, the default) or opt-in
> >>> (symmetric-cluster=false) cluster. If you're opt-out, every node can
> run
> >>> every resource unless you give it a negative preference.
> >>>
> >>> Partly it depends on whether there is a good reason to give each
> >>> instance a "home" node. Often, there's not. If you just want to balance
> >>> resources across nodes, the cluster will do that by default.
> >>>
> >>> If you prefer to put certain resources on certain nodes because the
> >>> resources require more physical resources (RAM/CPU/whatever), you can
> >>> set node attributes for that and use rules to set node preferences.
> >>>
> >>> Either way, you can decide whether you want stickiness with it.
> >>>
> >>>> Also can you answer, how to get the values of node that goes active
> and
> >>> the
> >>>> node that goes down inside the OCF agent?  Do I need to use
> >>> notification or
> >>>> some simpler alternative is available?
> >>>> Thanks.
> >>>>
> >>>>
> >>>> On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot 
> >>> wrote:
> >>>>
> >&g

Re: [ClusterLabs] Security with Corosync

2016-03-19 Thread Nikhil Utane
Honza,

In my CIB I see the infrastructure being set to cman. pcs status is
reporting the same.



[root@node3 corosync]# pcs status
Cluster name: mycluster
Last updated: Wed Mar 16 16:57:46 2016
Last change: Wed Mar 16 16:56:23 2016
Stack: *cman*

But corosync also is running fine.

[root@node2 nikhil]# pcs status nodes corosync
Corosync Nodes:
 Online: node2 node3
 Offline: node1

I did a cibadmin query and replace from cman to corosync but it doesn't
change (even though replace operation succeeds)
I read that CMAN internally uses corosync but in corosync 2 CMAN support is
removed.
Totally confused. Please help.

-Thanks
Nikhil

On Mon, Mar 14, 2016 at 1:19 PM, Jan Friesse  wrote:

> Nikhil Utane napsal(a):
>
>> Follow-up question.
>> I noticed that secauth was turned off in my corosync.conf file. I enabled
>> it on all 3 nodes and restarted the cluster. Everything was working fine.
>> However I just noticed that I had forgotten to copy the authkey to one of
>> the node. It is present on 2 nodes but not the third. And I did a failover
>> and the third node took over without any issue.
>> How is the 3rd node participating in the cluster if it doesn't have the
>> authkey?
>>
>
> It's just not possible. If you would enabled secauth correctly and you
> didn't have /etc/corosync/authkey, message like "Could not open
> /etc/corosync/authkey: No such file or directory" would show up. There are
> few exceptions:
> - you have changed totem.keyfile with file existing on all nodes
> - you are using totem.key then everything works as expected (it has
> priority over default authkey file but not over totem.keyfile)
> - you are using COROSYNC_TOTEM_AUTHKEY_FILE env with file existing on all
> nodes
>
> Regards,
>   Honza
>
>
>
>> On Fri, Mar 11, 2016 at 4:15 PM, Nikhil Utane <
>> nikhil.subscri...@gmail.com>
>> wrote:
>>
>> Perfect. Thanks for the quick response Honza.
>>>
>>> Cheers
>>> Nikhil
>>>
>>> On Fri, Mar 11, 2016 at 4:10 PM, Jan Friesse 
>>> wrote:
>>>
>>> Nikhil,
>>>>
>>>> Nikhil Utane napsal(a):
>>>>
>>>> Hi,
>>>>>
>>>>> I changed some configuration and captured packets. I can see that the
>>>>> data
>>>>> is already garbled and not in the clear.
>>>>> So does corosync already have this built-in?
>>>>> Can somebody provide more details as to what all security features are
>>>>> incorporated?
>>>>>
>>>>>
>>>> See man page corosync.conf(5) options crypto_hash, crypto_cipher (for
>>>> corosync 2.x) and potentially secauth (for coorsync 1.x and 2.x).
>>>>
>>>> Basically corosync by default uses aes256 for encryption and sha1 for
>>>> hmac authentication.
>>>>
>>>> Pacemaker uses corosync cpg API so as long as encryption is enabled in
>>>> the corosync.conf, messages interchanged between nodes are encrypted.
>>>>
>>>> Regards,
>>>>Honza
>>>>
>>>>
>>>> -Thanks
>>>>> Nikhil
>>>>>
>>>>> On Fri, Mar 11, 2016 at 11:38 AM, Nikhil Utane <
>>>>> nikhil.subscri...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> Does corosync provide mechanism to secure the communication path
>>>>>> between
>>>>>> nodes of a cluster?
>>>>>> I would like all the data that gets exchanged between all nodes to be
>>>>>> encrypted.
>>>>>>
>>>>>> A quick google threw up this link:
>>>>>> https://github.com/corosync/corosync/blob/master/SECURITY
>>>>>>
>>>>>> Can I make use of it with pacemaker?
>>>>>>
>>>>>> -Thanks
>>>>>> Nikhil
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> ___
>>>>> Users mailing list: Users@clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>

Re: [ClusterLabs] Security with Corosync

2016-03-19 Thread Nikhil Utane
Honza,

Actually this is only for a PoC (Proof of Concept) setup.
Next step is to move it to a different platform where we are
cross-compiling from the sources. I'd like the PoC setup to have the same
version as the final one.

Thanks.

On Thu, Mar 17, 2016 at 1:07 PM, Jan Friesse  wrote:

> Nikhil Utane napsal(a):
>
>> [root@node3 corosync]# corosync -v
>> Corosync Cluster Engine, version '1.4.7'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>> So it is 1.x :(
>> When I begun I was following multiple tutorials and ended up installing
>> multiple packages. Let me try moving to corosync 2.0.
>> I suppose it should be as easy as doing yum install.
>>
>
> It depends of what distribution are you using (for example RHEL/CentOS has
> only 1.x + cman in 6.x and 2.x in 7.x). But main question is, why you want
> to upgrade? 1.x is fully supported and if it works for you there is no
> reason to upgrade to 2.x. It's best to stay with whatever your distro ships.
>
> Honza
>
>
>
>
>> On Wed, Mar 16, 2016 at 10:29 PM, Jan Friesse 
>> wrote:
>>
>> Nikhil Utane napsal(a):
>>>
>>> Honza,
>>>>
>>>> In my CIB I see the infrastructure being set to cman. pcs status is
>>>> reporting the same.
>>>>
>>>> >>> name="cluster-infrastructure" value="*cman*"/>
>>>>
>>>> [root@node3 corosync]# pcs status
>>>> Cluster name: mycluster
>>>> Last updated: Wed Mar 16 16:57:46 2016
>>>> Last change: Wed Mar 16 16:56:23 2016
>>>> Stack: *cman*
>>>>
>>>> But corosync also is running fine.
>>>>
>>>> [root@node2 nikhil]# pcs status nodes corosync
>>>> Corosync Nodes:
>>>>Online: node2 node3
>>>>Offline: node1
>>>>
>>>> I did a cibadmin query and replace from cman to corosync but it doesn't
>>>> change (even though replace operation succeeds)
>>>> I read that CMAN internally uses corosync but in corosync 2 CMAN support
>>>> is
>>>> removed.
>>>> Totally confused. Please help.
>>>>
>>>>
>>> Best start is to find out what versions you are using? If you have
>>> corosync 1.x and really using cman (what is highly probable),
>>> corosync.conf
>>> is completely ignored and instead cluster.conf
>>> (/etc/cluster/cluster.conf)
>>> is used. cluster.conf uses cman keyfile and if this is not provided,
>>> encryption key is simply cluster name. This is probably reason why
>>> everything worked when you haven't had authkey on one of nodes.
>>>
>>> Honza
>>>
>>>
>>>
>>> -Thanks
>>>> Nikhil
>>>>
>>>> On Mon, Mar 14, 2016 at 1:19 PM, Jan Friesse 
>>>> wrote:
>>>>
>>>> Nikhil Utane napsal(a):
>>>>
>>>>>
>>>>> Follow-up question.
>>>>>
>>>>>> I noticed that secauth was turned off in my corosync.conf file. I
>>>>>> enabled
>>>>>> it on all 3 nodes and restarted the cluster. Everything was working
>>>>>> fine.
>>>>>> However I just noticed that I had forgotten to copy the authkey to one
>>>>>> of
>>>>>> the node. It is present on 2 nodes but not the third. And I did a
>>>>>> failover
>>>>>> and the third node took over without any issue.
>>>>>> How is the 3rd node participating in the cluster if it doesn't have
>>>>>> the
>>>>>> authkey?
>>>>>>
>>>>>>
>>>>>> It's just not possible. If you would enabled secauth correctly and you
>>>>> didn't have /etc/corosync/authkey, message like "Could not open
>>>>> /etc/corosync/authkey: No such file or directory" would show up. There
>>>>> are
>>>>> few exceptions:
>>>>> - you have changed totem.keyfile with file existing on all nodes
>>>>> - you are using totem.key then everything works as expected (it has
>>>>> priority over default authkey file but not over totem.keyfile)
>>>>> - you are using COROSYNC_TOTEM_AUTHKEY_FILE env with file existing on
>>>>> all
>>>>> nodes
>>>>>
>>>>> Regards,
>>>>> Honza
>>>>>
>>>>>
>>>>>
>>

Re: [ClusterLabs] Security with Corosync

2016-03-19 Thread Nikhil Utane
[root@node3 corosync]# corosync -v
Corosync Cluster Engine, version '1.4.7'
Copyright (c) 2006-2009 Red Hat, Inc.

So it is 1.x :(
When I begun I was following multiple tutorials and ended up installing
multiple packages. Let me try moving to corosync 2.0.
I suppose it should be as easy as doing yum install.

On Wed, Mar 16, 2016 at 10:29 PM, Jan Friesse  wrote:

> Nikhil Utane napsal(a):
>
>> Honza,
>>
>> In my CIB I see the infrastructure being set to cman. pcs status is
>> reporting the same.
>>
>> > name="cluster-infrastructure" value="*cman*"/>
>>
>> [root@node3 corosync]# pcs status
>> Cluster name: mycluster
>> Last updated: Wed Mar 16 16:57:46 2016
>> Last change: Wed Mar 16 16:56:23 2016
>> Stack: *cman*
>>
>> But corosync also is running fine.
>>
>> [root@node2 nikhil]# pcs status nodes corosync
>> Corosync Nodes:
>>   Online: node2 node3
>>   Offline: node1
>>
>> I did a cibadmin query and replace from cman to corosync but it doesn't
>> change (even though replace operation succeeds)
>> I read that CMAN internally uses corosync but in corosync 2 CMAN support
>> is
>> removed.
>> Totally confused. Please help.
>>
>
> Best start is to find out what versions you are using? If you have
> corosync 1.x and really using cman (what is highly probable), corosync.conf
> is completely ignored and instead cluster.conf (/etc/cluster/cluster.conf)
> is used. cluster.conf uses cman keyfile and if this is not provided,
> encryption key is simply cluster name. This is probably reason why
> everything worked when you haven't had authkey on one of nodes.
>
> Honza
>
>
>
>> -Thanks
>> Nikhil
>>
>> On Mon, Mar 14, 2016 at 1:19 PM, Jan Friesse  wrote:
>>
>> Nikhil Utane napsal(a):
>>>
>>> Follow-up question.
>>>> I noticed that secauth was turned off in my corosync.conf file. I
>>>> enabled
>>>> it on all 3 nodes and restarted the cluster. Everything was working
>>>> fine.
>>>> However I just noticed that I had forgotten to copy the authkey to one
>>>> of
>>>> the node. It is present on 2 nodes but not the third. And I did a
>>>> failover
>>>> and the third node took over without any issue.
>>>> How is the 3rd node participating in the cluster if it doesn't have the
>>>> authkey?
>>>>
>>>>
>>> It's just not possible. If you would enabled secauth correctly and you
>>> didn't have /etc/corosync/authkey, message like "Could not open
>>> /etc/corosync/authkey: No such file or directory" would show up. There
>>> are
>>> few exceptions:
>>> - you have changed totem.keyfile with file existing on all nodes
>>> - you are using totem.key then everything works as expected (it has
>>> priority over default authkey file but not over totem.keyfile)
>>> - you are using COROSYNC_TOTEM_AUTHKEY_FILE env with file existing on all
>>> nodes
>>>
>>> Regards,
>>>Honza
>>>
>>>
>>>
>>> On Fri, Mar 11, 2016 at 4:15 PM, Nikhil Utane <
>>>> nikhil.subscri...@gmail.com>
>>>> wrote:
>>>>
>>>> Perfect. Thanks for the quick response Honza.
>>>>
>>>>>
>>>>> Cheers
>>>>> Nikhil
>>>>>
>>>>> On Fri, Mar 11, 2016 at 4:10 PM, Jan Friesse 
>>>>> wrote:
>>>>>
>>>>> Nikhil,
>>>>>
>>>>>>
>>>>>> Nikhil Utane napsal(a):
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>>
>>>>>>> I changed some configuration and captured packets. I can see that the
>>>>>>> data
>>>>>>> is already garbled and not in the clear.
>>>>>>> So does corosync already have this built-in?
>>>>>>> Can somebody provide more details as to what all security features
>>>>>>> are
>>>>>>> incorporated?
>>>>>>>
>>>>>>>
>>>>>>> See man page corosync.conf(5) options crypto_hash, crypto_cipher (for
>>>>>> corosync 2.x) and potentially secauth (for coorsync 1.x and 2.x).
>>>>>>
>>>>>> Basically corosync by default uses aes256 for encryption and sha1 for
>>>>>> hmac authentication.
>>>

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-03-16 Thread Nikhil Utane
I see following info gets updated in CIB. Can I use this or there is better
way?



On Wed, Mar 16, 2016 at 12:40 PM, Nikhil Utane 
wrote:

> Hi Ken,
>
> Sorry about the long delay. This activity was de-focussed but now it's
> back on track.
>
> One part of question that is still not answered is on the newly active
> node, how to find out which was the node that went down?
> Anything that gets updated in the status section that can be read and
> figured out?
>
> Thanks.
> Nikhil
>
> On Sat, Jan 9, 2016 at 3:31 AM, Ken Gaillot  wrote:
>
>> On 01/08/2016 11:13 AM, Nikhil Utane wrote:
>> >> I think stickiness will do what you want here. Set a stickiness higher
>> >> than the original node's preference, and the resource will want to stay
>> >> where it is.
>> >
>> > Not sure I understand this. Stickiness will ensure that resources don't
>> > move back when original node comes back up, isn't it?
>> > But in my case, I want the newly standby node to become the backup node
>> for
>> > all other nodes. i.e. it should now be able to run all my resource
>> groups
>> > albeit with a lower score. How do I achieve that?
>>
>> Oh right. I forgot to ask whether you had an opt-out
>> (symmetric-cluster=true, the default) or opt-in
>> (symmetric-cluster=false) cluster. If you're opt-out, every node can run
>> every resource unless you give it a negative preference.
>>
>> Partly it depends on whether there is a good reason to give each
>> instance a "home" node. Often, there's not. If you just want to balance
>> resources across nodes, the cluster will do that by default.
>>
>> If you prefer to put certain resources on certain nodes because the
>> resources require more physical resources (RAM/CPU/whatever), you can
>> set node attributes for that and use rules to set node preferences.
>>
>> Either way, you can decide whether you want stickiness with it.
>>
>> > Also can you answer, how to get the values of node that goes active and
>> the
>> > node that goes down inside the OCF agent?  Do I need to use
>> notification or
>> > some simpler alternative is available?
>> > Thanks.
>> >
>> >
>> > On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot 
>> wrote:
>> >
>> >> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
>> >>> Would like to validate my final config.
>> >>>
>> >>> As I mentioned earlier, I will be having (upto) 5 active servers and 1
>> >>> standby server.
>> >>> The standby server should take up the role of active that went down.
>> Each
>> >>> active has some unique configuration that needs to be preserved.
>> >>>
>> >>> 1) So I will create total 5 groups. Each group has a
>> "heartbeat::IPaddr2
>> >>> resource (for virtual IP) and my custom resource.
>> >>> 2) The virtual IP needs to be read inside my custom OCF agent, so I
>> will
>> >>> make use of attribute reference and point to the value of IPaddr2
>> inside
>> >> my
>> >>> custom resource to avoid duplication.
>> >>> 3) I will then configure location constraint to run the group resource
>> >> on 5
>> >>> active nodes with higher score and lesser score on standby.
>> >>> For e.g.
>> >>> Group  NodeScore
>> >>> -
>> >>> MyGroup1node1   500
>> >>> MyGroup1node6   0
>> >>>
>> >>> MyGroup2node2   500
>> >>> MyGroup2node6   0
>> >>> ..
>> >>> MyGroup5node5   500
>> >>> MyGroup5node6   0
>> >>>
>> >>> 4) Now if say node1 were to go down, then stop action on node1 will
>> first
>> >>> get called. Haven't decided if I need to do anything specific here.
>> >>
>> >> To clarify, if node1 goes down intentionally (e.g. standby or stop),
>> >> then all resources on it will be stopped first. But if node1 becomes
>> >> unavailable (e.g. crash or communication outage), it will get fenced.
>> >>
>> >>> 5) But when the start action of node 6 gets called, then using crm
>> >> command
>> >>> line interface, I will modify the above config to swap node 1 and
>> node 

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-03-16 Thread Nikhil Utane
Hi Ken,

Sorry about the long delay. This activity was de-focussed but now it's back
on track.

One part of question that is still not answered is on the newly active
node, how to find out which was the node that went down?
Anything that gets updated in the status section that can be read and
figured out?

Thanks.
Nikhil

On Sat, Jan 9, 2016 at 3:31 AM, Ken Gaillot  wrote:

> On 01/08/2016 11:13 AM, Nikhil Utane wrote:
> >> I think stickiness will do what you want here. Set a stickiness higher
> >> than the original node's preference, and the resource will want to stay
> >> where it is.
> >
> > Not sure I understand this. Stickiness will ensure that resources don't
> > move back when original node comes back up, isn't it?
> > But in my case, I want the newly standby node to become the backup node
> for
> > all other nodes. i.e. it should now be able to run all my resource groups
> > albeit with a lower score. How do I achieve that?
>
> Oh right. I forgot to ask whether you had an opt-out
> (symmetric-cluster=true, the default) or opt-in
> (symmetric-cluster=false) cluster. If you're opt-out, every node can run
> every resource unless you give it a negative preference.
>
> Partly it depends on whether there is a good reason to give each
> instance a "home" node. Often, there's not. If you just want to balance
> resources across nodes, the cluster will do that by default.
>
> If you prefer to put certain resources on certain nodes because the
> resources require more physical resources (RAM/CPU/whatever), you can
> set node attributes for that and use rules to set node preferences.
>
> Either way, you can decide whether you want stickiness with it.
>
> > Also can you answer, how to get the values of node that goes active and
> the
> > node that goes down inside the OCF agent?  Do I need to use notification
> or
> > some simpler alternative is available?
> > Thanks.
> >
> >
> > On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot  wrote:
> >
> >> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
> >>> Would like to validate my final config.
> >>>
> >>> As I mentioned earlier, I will be having (upto) 5 active servers and 1
> >>> standby server.
> >>> The standby server should take up the role of active that went down.
> Each
> >>> active has some unique configuration that needs to be preserved.
> >>>
> >>> 1) So I will create total 5 groups. Each group has a
> "heartbeat::IPaddr2
> >>> resource (for virtual IP) and my custom resource.
> >>> 2) The virtual IP needs to be read inside my custom OCF agent, so I
> will
> >>> make use of attribute reference and point to the value of IPaddr2
> inside
> >> my
> >>> custom resource to avoid duplication.
> >>> 3) I will then configure location constraint to run the group resource
> >> on 5
> >>> active nodes with higher score and lesser score on standby.
> >>> For e.g.
> >>> Group  NodeScore
> >>> -
> >>> MyGroup1node1   500
> >>> MyGroup1node6   0
> >>>
> >>> MyGroup2node2   500
> >>> MyGroup2node6   0
> >>> ..
> >>> MyGroup5node5   500
> >>> MyGroup5node6   0
> >>>
> >>> 4) Now if say node1 were to go down, then stop action on node1 will
> first
> >>> get called. Haven't decided if I need to do anything specific here.
> >>
> >> To clarify, if node1 goes down intentionally (e.g. standby or stop),
> >> then all resources on it will be stopped first. But if node1 becomes
> >> unavailable (e.g. crash or communication outage), it will get fenced.
> >>
> >>> 5) But when the start action of node 6 gets called, then using crm
> >> command
> >>> line interface, I will modify the above config to swap node 1 and node
> 6.
> >>> i.e.
> >>> MyGroup1node6   500
> >>> MyGroup1node1   0
> >>>
> >>> MyGroup2node2   500
> >>> MyGroup2node1   0
> >>>
> >>> 6) To do the above, I need the newly active and newly standby node
> names
> >> to
> >>> be passed to my start action. What's the best way to get this
> information
> >>> inside my OCF agent?
> >>
>

Re: [ClusterLabs] Security with Corosync

2016-03-12 Thread Nikhil Utane
Follow-up question.
I noticed that secauth was turned off in my corosync.conf file. I enabled
it on all 3 nodes and restarted the cluster. Everything was working fine.
However I just noticed that I had forgotten to copy the authkey to one of
the node. It is present on 2 nodes but not the third. And I did a failover
and the third node took over without any issue.
How is the 3rd node participating in the cluster if it doesn't have the
authkey?

On Fri, Mar 11, 2016 at 4:15 PM, Nikhil Utane 
wrote:

> Perfect. Thanks for the quick response Honza.
>
> Cheers
> Nikhil
>
> On Fri, Mar 11, 2016 at 4:10 PM, Jan Friesse  wrote:
>
>> Nikhil,
>>
>> Nikhil Utane napsal(a):
>>
>>> Hi,
>>>
>>> I changed some configuration and captured packets. I can see that the
>>> data
>>> is already garbled and not in the clear.
>>> So does corosync already have this built-in?
>>> Can somebody provide more details as to what all security features are
>>> incorporated?
>>>
>>
>> See man page corosync.conf(5) options crypto_hash, crypto_cipher (for
>> corosync 2.x) and potentially secauth (for coorsync 1.x and 2.x).
>>
>> Basically corosync by default uses aes256 for encryption and sha1 for
>> hmac authentication.
>>
>> Pacemaker uses corosync cpg API so as long as encryption is enabled in
>> the corosync.conf, messages interchanged between nodes are encrypted.
>>
>> Regards,
>>   Honza
>>
>>
>>> -Thanks
>>> Nikhil
>>>
>>> On Fri, Mar 11, 2016 at 11:38 AM, Nikhil Utane <
>>> nikhil.subscri...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>>
>>>> Does corosync provide mechanism to secure the communication path between
>>>> nodes of a cluster?
>>>> I would like all the data that gets exchanged between all nodes to be
>>>> encrypted.
>>>>
>>>> A quick google threw up this link:
>>>> https://github.com/corosync/corosync/blob/master/SECURITY
>>>>
>>>> Can I make use of it with pacemaker?
>>>>
>>>> -Thanks
>>>> Nikhil
>>>>
>>>>
>>>>
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Security with Corosync

2016-03-11 Thread Nikhil Utane
Perfect. Thanks for the quick response Honza.

Cheers
Nikhil

On Fri, Mar 11, 2016 at 4:10 PM, Jan Friesse  wrote:

> Nikhil,
>
> Nikhil Utane napsal(a):
>
>> Hi,
>>
>> I changed some configuration and captured packets. I can see that the data
>> is already garbled and not in the clear.
>> So does corosync already have this built-in?
>> Can somebody provide more details as to what all security features are
>> incorporated?
>>
>
> See man page corosync.conf(5) options crypto_hash, crypto_cipher (for
> corosync 2.x) and potentially secauth (for coorsync 1.x and 2.x).
>
> Basically corosync by default uses aes256 for encryption and sha1 for hmac
> authentication.
>
> Pacemaker uses corosync cpg API so as long as encryption is enabled in the
> corosync.conf, messages interchanged between nodes are encrypted.
>
> Regards,
>   Honza
>
>
>> -Thanks
>> Nikhil
>>
>> On Fri, Mar 11, 2016 at 11:38 AM, Nikhil Utane <
>> nikhil.subscri...@gmail.com>
>> wrote:
>>
>> Hi,
>>>
>>> Does corosync provide mechanism to secure the communication path between
>>> nodes of a cluster?
>>> I would like all the data that gets exchanged between all nodes to be
>>> encrypted.
>>>
>>> A quick google threw up this link:
>>> https://github.com/corosync/corosync/blob/master/SECURITY
>>>
>>> Can I make use of it with pacemaker?
>>>
>>> -Thanks
>>> Nikhil
>>>
>>>
>>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Security with Corosync

2016-03-11 Thread Nikhil Utane
Hi,

I changed some configuration and captured packets. I can see that the data
is already garbled and not in the clear.
So does corosync already have this built-in?
Can somebody provide more details as to what all security features are
incorporated?

-Thanks
Nikhil

On Fri, Mar 11, 2016 at 11:38 AM, Nikhil Utane 
wrote:

> Hi,
>
> Does corosync provide mechanism to secure the communication path between
> nodes of a cluster?
> I would like all the data that gets exchanged between all nodes to be
> encrypted.
>
> A quick google threw up this link:
> https://github.com/corosync/corosync/blob/master/SECURITY
>
> Can I make use of it with pacemaker?
>
> -Thanks
> Nikhil
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Security with Corosync

2016-03-10 Thread Nikhil Utane
Hi,

Does corosync provide mechanism to secure the communication path between
nodes of a cluster?
I would like all the data that gets exchanged between all nodes to be
encrypted.

A quick google threw up this link:
https://github.com/corosync/corosync/blob/master/SECURITY

Can I make use of it with pacemaker?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-01-08 Thread Nikhil Utane
> I think stickiness will do what you want here. Set a stickiness higher
> than the original node's preference, and the resource will want to stay
> where it is.

Not sure I understand this. Stickiness will ensure that resources don't
move back when original node comes back up, isn't it?
But in my case, I want the newly standby node to become the backup node for
all other nodes. i.e. it should now be able to run all my resource groups
albeit with a lower score. How do I achieve that?
Also can you answer, how to get the values of node that goes active and the
node that goes down inside the OCF agent?  Do I need to use notification or
some simpler alternative is available?
Thanks.


On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot  wrote:

> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
> > Would like to validate my final config.
> >
> > As I mentioned earlier, I will be having (upto) 5 active servers and 1
> > standby server.
> > The standby server should take up the role of active that went down. Each
> > active has some unique configuration that needs to be preserved.
> >
> > 1) So I will create total 5 groups. Each group has a "heartbeat::IPaddr2
> > resource (for virtual IP) and my custom resource.
> > 2) The virtual IP needs to be read inside my custom OCF agent, so I will
> > make use of attribute reference and point to the value of IPaddr2 inside
> my
> > custom resource to avoid duplication.
> > 3) I will then configure location constraint to run the group resource
> on 5
> > active nodes with higher score and lesser score on standby.
> > For e.g.
> > Group  NodeScore
> > -
> > MyGroup1node1   500
> > MyGroup1node6   0
> >
> > MyGroup2node2   500
> > MyGroup2node6   0
> > ..
> > MyGroup5node5   500
> > MyGroup5node6   0
> >
> > 4) Now if say node1 were to go down, then stop action on node1 will first
> > get called. Haven't decided if I need to do anything specific here.
>
> To clarify, if node1 goes down intentionally (e.g. standby or stop),
> then all resources on it will be stopped first. But if node1 becomes
> unavailable (e.g. crash or communication outage), it will get fenced.
>
> > 5) But when the start action of node 6 gets called, then using crm
> command
> > line interface, I will modify the above config to swap node 1 and node 6.
> > i.e.
> > MyGroup1node6   500
> > MyGroup1node1   0
> >
> > MyGroup2node2   500
> > MyGroup2node1   0
> >
> > 6) To do the above, I need the newly active and newly standby node names
> to
> > be passed to my start action. What's the best way to get this information
> > inside my OCF agent?
>
> Modifying the configuration from within an agent is dangerous -- too
> much potential for feedback loops between pacemaker and the agent.
>
> I think stickiness will do what you want here. Set a stickiness higher
> than the original node's preference, and the resource will want to stay
> where it is.
>
> > 7) Apart from node name, there will be other information which I plan to
> > pass by making use of node attributes. What's the best way to get this
> > information inside my OCF agent? Use crm command to query?
>
> Any of the command-line interfaces for doing so should be fine, but I'd
> recommend using one of the lower-level tools (crm_attribute or
> attrd_updater) so you don't have a dependency on a higher-level tool
> that may not always be installed.
>
> > Thank You.
> >
> > On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane <
> nikhil.subscri...@gmail.com>
> > wrote:
> >
> >> Thanks to you Ken for giving all the pointers.
> >> Yes, I can use service start/stop which should be a lot simpler. Thanks
> >> again. :)
> >>
> >> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot 
> wrote:
> >>
> >>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
> >>>> I have prepared a write-up explaining my requirements and current
> >>> solution
> >>>> that I am proposing based on my understanding so far.
> >>>> Kindly let me know if what I am proposing is good or there is a better
> >>> way
> >>>> to achieve the same.
> >>>>
> >>>>
> >>>
> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
> >>>>
> >>>> Le

Re: [ClusterLabs] Help required for N+1 redundancy setup

2016-01-08 Thread Nikhil Utane
Would like to validate my final config.

As I mentioned earlier, I will be having (upto) 5 active servers and 1
standby server.
The standby server should take up the role of active that went down. Each
active has some unique configuration that needs to be preserved.

1) So I will create total 5 groups. Each group has a "heartbeat::IPaddr2
resource (for virtual IP) and my custom resource.
2) The virtual IP needs to be read inside my custom OCF agent, so I will
make use of attribute reference and point to the value of IPaddr2 inside my
custom resource to avoid duplication.
3) I will then configure location constraint to run the group resource on 5
active nodes with higher score and lesser score on standby.
For e.g.
Group  NodeScore
-
MyGroup1node1   500
MyGroup1node6   0

MyGroup2node2   500
MyGroup2node6   0
..
MyGroup5node5   500
MyGroup5node6   0

4) Now if say node1 were to go down, then stop action on node1 will first
get called. Haven't decided if I need to do anything specific here.
5) But when the start action of node 6 gets called, then using crm command
line interface, I will modify the above config to swap node 1 and node 6.
i.e.
MyGroup1node6   500
MyGroup1node1   0

MyGroup2node2   500
MyGroup2node1   0

6) To do the above, I need the newly active and newly standby node names to
be passed to my start action. What's the best way to get this information
inside my OCF agent?
7) Apart from node name, there will be other information which I plan to
pass by making use of node attributes. What's the best way to get this
information inside my OCF agent? Use crm command to query?

Thank You.

On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane 
wrote:

> Thanks to you Ken for giving all the pointers.
> Yes, I can use service start/stop which should be a lot simpler. Thanks
> again. :)
>
> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot  wrote:
>
>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
>> > I have prepared a write-up explaining my requirements and current
>> solution
>> > that I am proposing based on my understanding so far.
>> > Kindly let me know if what I am proposing is good or there is a better
>> way
>> > to achieve the same.
>> >
>> >
>> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
>> >
>> > Let me know if you face any issue in accessing the above link. Thanks.
>>
>> This looks great. Very well thought-out.
>>
>> One comment:
>>
>> "8. In the event of any failover, the standby node will get notified
>> through an event and it will execute a script that will read the
>> configuration specific to the node that went down (again using
>> crm_attribute) and become active."
>>
>> It may not be necessary to use the notifications for this. Pacemaker
>> will call your resource agent with the "start" action on the standby
>> node, after ensuring it is stopped on the previous node. Hopefully the
>> resource agent's start action has (or can have, with configuration
>> options) all the information you need.
>>
>> If you do end up needing notifications, be aware that the feature will
>> be disabled by default in the 1.1.14 release, because changes in syntax
>> are expected in further development. You can define a compile-time
>> constant to enable them.
>>
>>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Passing and binding to virtual IP in my service

2016-01-08 Thread Nikhil Utane
What I had posted earlier was an approach to do N+1 Redundancy for my
use-case (which could be different from yours).
Attaching the same and the cib xml to this thread (Don't know if
attachments are allowed.)
There are some follow-up questions that I am posting on my other thread.
Please check that.

On Fri, Jan 8, 2016 at 1:41 PM, Solutions Solutions 
wrote:

> hi Nikhil,
>  can you send me the N+1 redundancy configuration file,which you posted
> earlier.
>
> On Thu, Jan 7, 2016 at 2:58 PM, Nikhil Utane 
> wrote:
>
>> Hi,
>>
>> I have my cluster up and running just fine. I have a dummy service that
>> sends UDP packets out to another host.
>>
>>  Resource Group: MyGroup
>>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started node1
>>  UDPSend(ocf::nikhil:UDPSend):  Started node1
>>
>> If I ping to the virtual IP from outside, the response goes via virtual
>> IP.
>> But if I initiate ping from node1, then it takes the actual (non-virtual
>> IP). This is expected since I am not binding to the vip. (ping -I vip works
>> fine).
>> So my question is, how to pass the virtual IP to my UDPSend OCF agent so
>> that it can then bind to the vip? This will ensure that all messages
>> initiated by my UDPSend goes from vip.
>>
>> Out of curiosity, where is this virtual IP stored in the kernel?
>> I expected to see a secondary interface ( for e.g. eth0:1) with the vip
>> but it isn't there.
>>
>> -Thanks
>> Nikhil
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


Redundancy using Pacemaker & Corosync-External.docx
Description: MS-Word 2007 document

  

  






  


  

  
  

  
  

  
  
  

  
  


  

  


  
  

  


  


  
  



  


  

http://localhost/server-status"/>
  
  



  

  
  

  


  
  

  


  


  
  



  

  


  
  
  
  


  

  


  

  

  
  

  

  
  

  
  

  

  
  

  
  

  
  

  
  

  

  


  

  
  

  
  

  


  
  


  
  


  
  

  
  

  

  


  

  
  

  
  

  

  
  

  
  

  
  


  
  


  

  

  

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Passing and binding to virtual IP in my service

2016-01-07 Thread Nikhil Utane
Aah. Got it. Forgot about the ip addr command.
Cool. So now I have the option to bind as well as use route command.
Kristoffer answered about using attribute references so I should be all set.
Thanks Guys.

On Thu, Jan 7, 2016 at 4:12 PM, Jorge Fábregas 
wrote:

> On 01/07/2016 05:28 AM, Nikhil Utane wrote:
> > So my question is, how to pass the virtual IP to my UDPSend OCF agent so
> > that it can then bind to the vip? This will ensure that all messages
> > initiated by my UDPSend goes from vip.
>
> Hi,
>
> I don't know how ping -I does it (what system call it uses) but I think
> you'll have to implement that if you want your program to source
> connection from a particular virtual IP.
>
> As far as I know, the way this is usually done these days is by creating
> routes.  Something like:
>
> ip route change 192.168.14.0/24 dev eth0 src 192.168.14.4
>
>
> > Out of curiosity, where is this virtual IP stored in the kernel?
> > I expected to see a secondary interface ( for e.g. eth0:1) with the vip
> > but it isn't there.
>
> Well, in the old days we used to have a "virtual interface" (eth0:1,
> eth0:2 etc) but the proper modern way is to use "virtual addresses"
> within a single interface.  The caveat is that you need to use the ip
> command to show these virtual addresses (ifconfig is not aware of them):
>
> ip addr show
>
> You'll see there the notion of a primary IP and secondaries.  The system
> will initiate connection from the primary by default (unless you specify
> a route like the one above).
>
> HTH,
> Jorge
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Passing and binding to virtual IP in my service

2016-01-07 Thread Nikhil Utane
Hi,

I have my cluster up and running just fine. I have a dummy service that
sends UDP packets out to another host.

 Resource Group: MyGroup
 ClusterIP  (ocf::heartbeat:IPaddr2):   Started node1
 UDPSend(ocf::nikhil:UDPSend):  Started node1

If I ping to the virtual IP from outside, the response goes via virtual IP.
But if I initiate ping from node1, then it takes the actual (non-virtual
IP). This is expected since I am not binding to the vip. (ping -I vip works
fine).
So my question is, how to pass the virtual IP to my UDPSend OCF agent so
that it can then bind to the vip? This will ensure that all messages
initiated by my UDPSend goes from vip.

Out of curiosity, where is this virtual IP stored in the kernel?
I expected to see a secondary interface ( for e.g. eth0:1) with the vip but
it isn't there.

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Help required for N+1 redundancy setup

2015-12-22 Thread Nikhil Utane
Thanks to you Ken for giving all the pointers.
Yes, I can use service start/stop which should be a lot simpler. Thanks
again. :)

On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot  wrote:

> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
> > I have prepared a write-up explaining my requirements and current
> solution
> > that I am proposing based on my understanding so far.
> > Kindly let me know if what I am proposing is good or there is a better
> way
> > to achieve the same.
> >
> >
> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
> >
> > Let me know if you face any issue in accessing the above link. Thanks.
>
> This looks great. Very well thought-out.
>
> One comment:
>
> "8. In the event of any failover, the standby node will get notified
> through an event and it will execute a script that will read the
> configuration specific to the node that went down (again using
> crm_attribute) and become active."
>
> It may not be necessary to use the notifications for this. Pacemaker
> will call your resource agent with the "start" action on the standby
> node, after ensuring it is stopped on the previous node. Hopefully the
> resource agent's start action has (or can have, with configuration
> options) all the information you need.
>
> If you do end up needing notifications, be aware that the feature will
> be disabled by default in the 1.1.14 release, because changes in syntax
> are expected in further development. You can define a compile-time
> constant to enable them.
>
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Help required for N+1 redundancy setup

2015-12-21 Thread Nikhil Utane
I have prepared a write-up explaining my requirements and current solution
that I am proposing based on my understanding so far.
Kindly let me know if what I am proposing is good or there is a better way
to achieve the same.

https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing

Let me know if you face any issue in accessing the above link. Thanks.

On Thu, Dec 3, 2015 at 11:34 PM, Ken Gaillot  wrote:

> On 12/03/2015 05:23 AM, Nikhil Utane wrote:
> > Ken,
> >
> > One more question, if i have to propagate configuration changes between
> the
> > nodes then is cpg (closed process group) the right way?
> > For e.g.
> > Active Node1 has config A=1, B=2
> > Active Node2 has config A=3, B=4
> > Standby Node needs to have configuration for all the nodes such that
> > whichever goes down, it comes up with those values.
> > Here configuration is not static but can be updated at run-time.
>
> Being unfamiliar with the specifics of your case, I can't say what the
> best approach is, but it sounds like you will need to write a custom OCF
> resource agent to manage your service.
>
> A resource agent is similar to an init script:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf
>
> The RA will start the service with the appropriate configuration. It can
> use per-resource options configured in pacemaker or external information
> to do that.
>
> How does your service get its configuration currently?
>
> > BTW, I'm little confused between OpenAIS and Corosync. For my purpose I
> > should be able to use either, right?
>
> Corosync started out as a subset of OpenAIS, optimized for use with
> Pacemaker. Corosync 2 is now the preferred membership layer for
> Pacemaker for most uses, though other layers are still supported.
>
> > Thanks.
> >
> > On Tue, Dec 1, 2015 at 9:04 PM, Ken Gaillot  wrote:
> >
> >> On 12/01/2015 05:31 AM, Nikhil Utane wrote:
> >>> Hi,
> >>>
> >>> I am evaluating whether it is feasible to use Pacemaker + Corosync to
> add
> >>> support for clustering/redundancy into our product.
> >>
> >> Most definitely
> >>
> >>> Our objectives:
> >>> 1) Support N+1 redundancy. i,e. N Active and (up to) 1 Standby.
> >>
> >> You can do this with location constraints and scores. See:
> >>
> >>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on
> >>
> >> Basically, you give the standby node a lower score than the other nodes.
> >>
> >>> 2) Each node has some different configuration parameters.
> >>> 3) Whenever any active node goes down, the standby node comes up with
> the
> >>> same configuration that the active had.
> >>
> >> How you solve this requirement depends on the specifics of your
> >> situation. Ideally, you can use OCF resource agents that take the
> >> configuration location as a parameter. You may have to write your own,
> >> if none is available for your services.
> >>
> >>> 4) There is no one single process/service for which we need redundancy,
> >>> rather it is the entire system (multiple processes running together).
> >>
> >> This is trivially implemented using either groups or ordering and
> >> colocation constraints.
> >>
> >> Order constraint = start service A before starting service B (and stop
> >> in reverse order)
> >>
> >> Colocation constraint = keep services A and B on the same node
> >>
> >> Group = shortcut to specify several services that need to start/stop in
> >> order and be kept together
> >>
> >>
> >>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm231363875392
> >>
> >>
> >>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#group-resources
> >>
> >>
> >>> 5) I would also want to be notified when any active<->standby state
> >>> transition happens as I would want to take some steps at the
> application
> >>> level.
> >>
> >> There are multiple approaches.
> >>
> >> If you don't mind compiling your own packages, the latest master branch
> >> (which will be part of the upcoming 1.1.14 release) has built-in
> >> notification capability. See:
> >> http://blog.clusterlabs.or

Re: [ClusterLabs] Help required for N+1 redundancy setup

2015-12-03 Thread Nikhil Utane
Ken,

One more question, if i have to propagate configuration changes between the
nodes then is cpg (closed process group) the right way?
For e.g.
Active Node1 has config A=1, B=2
Active Node2 has config A=3, B=4
Standby Node needs to have configuration for all the nodes such that
whichever goes down, it comes up with those values.
Here configuration is not static but can be updated at run-time.
BTW, I'm little confused between OpenAIS and Corosync. For my purpose I
should be able to use either, right?
Thanks.

On Tue, Dec 1, 2015 at 9:04 PM, Ken Gaillot  wrote:

> On 12/01/2015 05:31 AM, Nikhil Utane wrote:
> > Hi,
> >
> > I am evaluating whether it is feasible to use Pacemaker + Corosync to add
> > support for clustering/redundancy into our product.
>
> Most definitely
>
> > Our objectives:
> > 1) Support N+1 redundancy. i,e. N Active and (up to) 1 Standby.
>
> You can do this with location constraints and scores. See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on
>
> Basically, you give the standby node a lower score than the other nodes.
>
> > 2) Each node has some different configuration parameters.
> > 3) Whenever any active node goes down, the standby node comes up with the
> > same configuration that the active had.
>
> How you solve this requirement depends on the specifics of your
> situation. Ideally, you can use OCF resource agents that take the
> configuration location as a parameter. You may have to write your own,
> if none is available for your services.
>
> > 4) There is no one single process/service for which we need redundancy,
> > rather it is the entire system (multiple processes running together).
>
> This is trivially implemented using either groups or ordering and
> colocation constraints.
>
> Order constraint = start service A before starting service B (and stop
> in reverse order)
>
> Colocation constraint = keep services A and B on the same node
>
> Group = shortcut to specify several services that need to start/stop in
> order and be kept together
>
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm231363875392
>
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#group-resources
>
>
> > 5) I would also want to be notified when any active<->standby state
> > transition happens as I would want to take some steps at the application
> > level.
>
> There are multiple approaches.
>
> If you don't mind compiling your own packages, the latest master branch
> (which will be part of the upcoming 1.1.14 release) has built-in
> notification capability. See:
> http://blog.clusterlabs.org/blog/2015/reliable-notifications/
>
> Otherwise, you can use SNMP or e-mail if your packages were compiled
> with those options, or you can use the ocf:pacemaker:ClusterMon resource
> agent:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm231308442928
>
> > I went through the documents/blogs but all had example for 1 active and 1
> > standby use-case and that too for some standard service like httpd.
>
> Pacemaker is incredibly versatile, and the use cases are far too varied
> to cover more than a small subset. Those simple examples show the basic
> building blocks, and can usually point you to the specific features you
> need to investigate further.
>
> > One additional question, If I am having multiple actives, then Virtual IP
> > configuration cannot be used? Is it possible such that N actives have
> > different IP addresses but whenever standby becomes active it uses the IP
> > address of the failed node?
>
> Yes, there are a few approaches here, too.
>
> The simplest is to assign a virtual IP to each active, and include it in
> your group of resources. The whole group will fail over to the standby
> node if the original goes down.
>
> If you want a single virtual IP that is used by all your actives, one
> alternative is to clone the ocf:heartbeat:IPaddr2 resource. When cloned,
> that resource agent will use iptables' CLUSTERIP functionality, which
> relies on multicast Ethernet addresses (not to be confused with
> multicast IP). Since multicast Ethernet has limitations, this is not
> often used in production.
>
> A more complicated method is to use a virtual IP in combination with a
> load-balancer such as haproxy. Pacemaker can manage haproxy and the real
> services, and haproxy manages distributing requests to the real services.
>
> > Thanking in advance.
> > Nikhil
>
> A last word

Re: [ClusterLabs] Help required for N+1 redundancy setup

2015-12-01 Thread Nikhil Utane
Thank You Ken for such a detailed response. Truly appreciate it. Cheers.


On Tue, Dec 1, 2015 at 9:04 PM, Ken Gaillot  wrote:

> On 12/01/2015 05:31 AM, Nikhil Utane wrote:
> > Hi,
> >
> > I am evaluating whether it is feasible to use Pacemaker + Corosync to add
> > support for clustering/redundancy into our product.
>
> Most definitely
>
> > Our objectives:
> > 1) Support N+1 redundancy. i,e. N Active and (up to) 1 Standby.
>
> You can do this with location constraints and scores. See:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on
>
> Basically, you give the standby node a lower score than the other nodes.
>
> > 2) Each node has some different configuration parameters.
> > 3) Whenever any active node goes down, the standby node comes up with the
> > same configuration that the active had.
>
> How you solve this requirement depends on the specifics of your
> situation. Ideally, you can use OCF resource agents that take the
> configuration location as a parameter. You may have to write your own,
> if none is available for your services.
>
> > 4) There is no one single process/service for which we need redundancy,
> > rather it is the entire system (multiple processes running together).
>
> This is trivially implemented using either groups or ordering and
> colocation constraints.
>
> Order constraint = start service A before starting service B (and stop
> in reverse order)
>
> Colocation constraint = keep services A and B on the same node
>
> Group = shortcut to specify several services that need to start/stop in
> order and be kept together
>
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm231363875392
>
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#group-resources
>
>
> > 5) I would also want to be notified when any active<->standby state
> > transition happens as I would want to take some steps at the application
> > level.
>
> There are multiple approaches.
>
> If you don't mind compiling your own packages, the latest master branch
> (which will be part of the upcoming 1.1.14 release) has built-in
> notification capability. See:
> http://blog.clusterlabs.org/blog/2015/reliable-notifications/
>
> Otherwise, you can use SNMP or e-mail if your packages were compiled
> with those options, or you can use the ocf:pacemaker:ClusterMon resource
> agent:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm231308442928
>
> > I went through the documents/blogs but all had example for 1 active and 1
> > standby use-case and that too for some standard service like httpd.
>
> Pacemaker is incredibly versatile, and the use cases are far too varied
> to cover more than a small subset. Those simple examples show the basic
> building blocks, and can usually point you to the specific features you
> need to investigate further.
>
> > One additional question, If I am having multiple actives, then Virtual IP
> > configuration cannot be used? Is it possible such that N actives have
> > different IP addresses but whenever standby becomes active it uses the IP
> > address of the failed node?
>
> Yes, there are a few approaches here, too.
>
> The simplest is to assign a virtual IP to each active, and include it in
> your group of resources. The whole group will fail over to the standby
> node if the original goes down.
>
> If you want a single virtual IP that is used by all your actives, one
> alternative is to clone the ocf:heartbeat:IPaddr2 resource. When cloned,
> that resource agent will use iptables' CLUSTERIP functionality, which
> relies on multicast Ethernet addresses (not to be confused with
> multicast IP). Since multicast Ethernet has limitations, this is not
> often used in production.
>
> A more complicated method is to use a virtual IP in combination with a
> load-balancer such as haproxy. Pacemaker can manage haproxy and the real
> services, and haproxy manages distributing requests to the real services.
>
> > Thanking in advance.
> > Nikhil
>
> A last word of advice: Fencing (aka STONITH) is important for proper
> recovery from difficult failure conditions. Without it, it is possible
> to have data loss or corruption in a split-brain situation.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Help required for N+1 redundancy setup

2015-12-01 Thread Nikhil Utane
Hi,

I am evaluating whether it is feasible to use Pacemaker + Corosync to add
support for clustering/redundancy into our product.

Our objectives:
1) Support N+1 redundancy. i,e. N Active and (up to) 1 Standby.
2) Each node has some different configuration parameters.
3) Whenever any active node goes down, the standby node comes up with the
same configuration that the active had.
4) There is no one single process/service for which we need redundancy,
rather it is the entire system (multiple processes running together).
5) I would also want to be notified when any active<->standby state
transition happens as I would want to take some steps at the application
level.

I went through the documents/blogs but all had example for 1 active and 1
standby use-case and that too for some standard service like httpd.

One additional question, If I am having multiple actives, then Virtual IP
configuration cannot be used? Is it possible such that N actives have
different IP addresses but whenever standby becomes active it uses the IP
address of the failed node?

Thanking in advance.
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org