[ClusterLabs] [Question:pacemaker_remote] About limitation of the placement of the resource to remote node.

2015-08-12 Thread renayama19661014
Hi All,

We confirmed movement of 
pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977)

It is the following cluster constitution.
 * sl7-01(KVM host)
 * snmp1(Guest on the sl7-01 host)
 * snmp2(Guest on the sl7-01 host)

We prepared for the next CLI file to confirm the resource placement to remote 
node.

--
property no-quorum-policy=ignore \
  stonith-enabled=false \
  startup-fencing=false

rsc_defaults resource-stickiness=INFINITY \
  migration-threshold=1

primitive remote-vm2 ocf:pacemaker:remote \
  params server=snmp1 \
  op monitor interval=3 timeout=15

primitive remote-vm3 ocf:pacemaker:remote \
  params server=snmp2 \
  op monitor interval=3 timeout=15

primitive dummy-remote-A Dummy \
  op start interval=0s timeout=60s \
  op monitor interval=30s timeout=60s \
  op stop interval=0s timeout=60s

primitive dummy-remote-B Dummy \
  op start interval=0s timeout=60s \
  op monitor interval=30s timeout=60s \
  op stop interval=0s timeout=60s

location loc1 dummy-remote-A \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname eq sl7-01
location loc2 dummy-remote-B \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname eq sl7-01
--

Case 1) The resource is placed as follows when I spend the CLI file which we 
prepared for.
 However, the placement of the dummy-remote resource does not meet a condition.
 dummy-remote-A starts in remote-vm2.

[root@sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Aug 13 08:49:09 2015          Last change: Thu Aug 13 
08:41:14 2015 by root via cibadmin on sl7-01
Stack: corosync
Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
3 nodes and 4 resources configured

Online: [ sl7-01 ]
RemoteOnline: [ remote-vm2 remote-vm3 ]

 dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm2
 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
 remote-vm2     (ocf::pacemaker:remote):        Started sl7-01
 remote-vm3     (ocf::pacemaker:remote):        Started sl7-01

(snip)

Case 2) When we change CLI file of it and spend it, the resource is placed as 
follows.
 The resource is placed definitely.
 dummy-remote-A starts in remote-vm3.
 dummy-remote-B starts in remote-vm3.


(snip)
location loc1 dummy-remote-A \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
  rule -inf: #uname eq sl7-01
location loc2 dummy-remote-B \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
  rule -inf: #uname eq sl7-01
(snip)


[root@sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Aug 13 08:55:28 2015          Last change: Thu Aug 13 
08:55:22 2015 by root via cibadmin on sl7-01
Stack: corosync
Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
3 nodes and 4 resources configured

Online: [ sl7-01 ]
RemoteOnline: [ remote-vm2 remote-vm3 ]

 dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm3
 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
 remote-vm2     (ocf::pacemaker:remote):        Started sl7-01
 remote-vm3     (ocf::pacemaker:remote):        Started sl7-01

(snip)

As for the placement of the resource being wrong with the first CLI file, the 
placement limitation of the remote node is like remote resource not being 
evaluated until it is done start.

The placement becomes right with the CLI file which I revised, but the 
description of this limitation is very troublesome when I compose a cluster of 
more nodes.

Does remote node not need processing delaying placement limitation until it is 
done start?

Is there a method to easily describe the limitation of the resource to remote 
node?

 * As one means, we know that the placement of the resource goes well by 
dividing the first CLI file into two.
   * After a cluster sent CLI which remote node starts, I send CLI where a 
cluster starts a resource.
 * However, we do not want to divide CLI file into two if possible.

Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] cib state is now lost

2015-08-12 Thread Ken Gaillot
On 08/12/2015 05:29 AM, David Neudorfer wrote:
 Thanks Ken,
 
 We're currently using Pacemaker 1.1.11 and at the moment its not an option
 to upgrade.
 I've spun up and down these boxes on AWS and even tried different sizes. I
 think a recent upgrade broke this deploy.

What OS distribution/version are you using?

If you have the option of switching from corosync 1+plugin to either
corosync 1+CMAN or corosync 2, that should avoid the issue, and put you
in a better supported position going forward. The plugin code has known
memory issues when nodes come and go, and the effects can be unpredictable.

 This is the output from dmesg:
 
 cib[16656] general protection ip:7f45391e9545 sp:7ffddf16c8b8 error:0 in
 libc-2.12.so[7f45390be000+18a000]
 cib[16659] general protection ip:7fa36fa89545 sp:7ffe28416288 error:0 in
 libc-2.12.so[7fa36f95e000+18a000]
 cib[16663] general protection ip:7fa3defce545 sp:7ffeb5b29c58 error:0 in
 libc-2.12.so[7fa3deea3000+18a000]
 cib[1] general protection ip:7fa1cefe4545 sp:7ffcc4b9c778 error:0 in
 libc-2.12.so[7fa1ceeb9000+18a000]
 cib[16669] general protection ip:7f4b3900f545 sp:7ffdcd65aaf8 error:0 in
 libc-2.12.so[7f4b38ee4000+18a000]
 cib[16672] general protection ip:7fc38be2b545 sp:7fffbc7e1598 error:0 in
 libc-2.12.so[7fc38bd0+18a000]
 cib[16675] general protection ip:7f9c6890c545 sp:7ffca09539f8 error:0 in
 libc-2.12.so[7f9c687e1000+18a000]
 cib[16678] general protection ip:7f1c636ad545 sp:7ffc677d2008 error:0 in
 libc-2.12.so[7f1c63582000+18a000]
 cib[16681] general protection ip:7fed0b47e545 sp:7ffd051f0618 error:0 in
 libc-2.12.so[7fed0b353000+18a000]
 cib[16684] general protection ip:7f2ee87cd545 sp:7fff8d9ae288 error:0 in
 libc-2.12.so[7f2ee86a2000+18a000]
 cib[16687] general protection ip:7f41c3789545 sp:7fff9f005848 error:0 in
 libc-2.12.so[7f41c365e000+18a000]
 
 
 
 On Mon, Aug 10, 2015 at 9:54 AM, Ken Gaillot kgail...@redhat.com wrote:
 
 On 08/09/2015 02:27 PM, David Neudorfer wrote:
 Where can I dig deeper to figure out why cib keeps terminating? selinux
 and
 iptables are both disabled and I've have debug enabled. Google hasn't
 been
 able to help me thus far.

 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:debug:
 get_local_nodeid: Local nodeid is 84939948
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 plugin_get_details:   Server details: id=84939948 uname=ip-172-20-16-5
 cname=pcmk
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_get_peer: Created entry
 c1f204b2-c994-48d9-81b6-87e1a7fc1ee7/0xa2c460 for node
 ip-172-20-16-5/84939948 (1 total)
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_get_peer: Node 84939948 is now known as ip-172-20-16-5
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_get_peer: Node 84939948 has uuid ip-172-20-16-5
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_update_peer_proc: init_cs_connection_classic: Node
 ip-172-20-16-5[84939948] - unknown is now online
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 init_cs_connection_once:  Connection to 'classic openais (with
 plugin)': established
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 get_node_name:Defaulting to uname -n for the local classic
 openais
 (with plugin) node name
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 qb_ipcs_us_publish:   server name: cib_ro
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 qb_ipcs_us_publish:   server name: cib_rw
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 qb_ipcs_us_publish:   server name: cib_shm
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: cib_init:
   Starting cib mainloop
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 plugin_handle_membership: Membership 104: quorum acquired
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_update_peer_proc: plugin_handle_membership: Node
 ip-172-20-16-5[84939948] - unknown is now member
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 crm_update_peer_state:cib_peer_update_callback: Node
 ip-172-20-16-5[84939948] - state is now lost (was (null))
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 crm_reap_dead_member: Removing ip-172-20-16-5/84939948 from the
 membership list
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 reap_crm_member:  Purged 1 peers with id=84939948 and/or uname=(null)
 from the membership cache
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib:   notice:
 crm_update_peer_state:plugin_handle_membership: Node
 ��[2077843320]
 - state is now member (was member)
 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info:
 crm_update_peer:  plugin_handle_membership: Node ��: id=2077843320
 state=r(0) ip(172.20.16.5)  addr=r(0) ip(172.20.16.5)  (new) votes=1
 (new) born=104 seen=104 

Re: [ClusterLabs] circumstances under which resources become unmanaged

2015-08-12 Thread Andrei Borzenkov



On 12.08.2015 20:46, N, Ravikiran wrote:

Hi All,

I have a resource added to pacemaker called 'cmsd' whose state is getting to 
'unmanaged FAILED' state.

Apart from manually changing the resource to unmanaged using pcs resource unmanage 
cmsd , I'm trying to understand under what all circumstances a resource can become 
unmanaged.. ?
I have not set any value for multilple-active field, which means by default it is set 
to stop-start, and hence I believe the resource can never go to unmanaged if it finds 
the resource active on more than one node.



unmanaged FAILED means pacemaker (or better resource agent) failed to 
stop resource. At this point resource state is undefined so pacemaker 
won't do anything with it.



Also, it would be more helpful if anyone can point out to specific sections of 
the pacemaker manuals for the answer.

Regards,
Ravikiran





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Ordering constraint restart second resource group

2015-08-12 Thread Andrei Borzenkov



On 12.08.2015 19:35, John Gogu wrote:

​Hello,
in my cluster configuration I have following situation:

resource_group_A
ip1
ip2
resource_group_B
apache1

ordering constraint resource_group_A then resource_group_B symetrical=true

When I add a new resource from group_A, resources from group_B are
restarted. If I remove constraint all ok but I need to keep this ordering
constraint.



Did you try adding resource as unmanaged, manually start it and change 
to managed?


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] circumstances under which resources become unmanaged

2015-08-12 Thread N, Ravikiran
Thanks for reply Andrei. What happens to the resources added with a COLOCATION 
or an ORDER constraint with this resource (unmanaged FAILED resource).. ? will 
the constraint be removed.. ?

Also please point me to any resource to understand this in detail.

Regards
Ravikiran

-Original Message-
From: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
Sent: Thursday, August 13, 2015 9:33 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] circumstances under which resources become unmanaged



On 12.08.2015 20:46, N, Ravikiran wrote:
 Hi All,

 I have a resource added to pacemaker called 'cmsd' whose state is getting to 
 'unmanaged FAILED' state.

 Apart from manually changing the resource to unmanaged using pcs resource 
 unmanage cmsd , I'm trying to understand under what all circumstances a 
 resource can become unmanaged.. ?
 I have not set any value for multilple-active field, which means by default 
 it is set to stop-start, and hence I believe the resource can never go to 
 unmanaged if it finds the resource active on more than one node.


unmanaged FAILED means pacemaker (or better resource agent) failed to stop 
resource. At this point resource state is undefined so pacemaker won't do 
anything with it.

 Also, it would be more helpful if anyone can point out to specific sections 
 of the pacemaker manuals for the answer.

 Regards,
 Ravikiran





 ___
 Users mailing list: Users@clusterlabs.org 
 http://clusterlabs.org/mailman/listinfo/users

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: CentOS 7 - Pacemaker - Problem with nfs-server and system

2015-08-12 Thread Stefan Bauer
Hi,



thank you for your reply. It seems to be a problem with the systemd unit files 
for nfs-server - specifically a timing issue.



[root@centos7-n1 ~]# systemctl list-unit-files --type=service | grep rpcbind
rpcbind.service static



rpcbind is set to static - should be started on demand by other units.



Invoking systemctl start nfs-server is pulling in rpcbind and nfs-lock



rpcbind is started - but nfs-lock is maybe trying to start too early:



Invoking manually systemctl start rpcbind and then systemctl start nfs-lock 
works within a second.

Invoking manually systemctl start rpcbind and then sytemctl start nfs-server 
works within a few seconds as well.



Invoking manually systemctl start nfs-server is only working randomly due to 
some timing issues.



My current workaround is to also start rpcbind by the cluster - just before 
nfsserver.



I also tried /usr/lib/ocf/resource.d/heartbeat/nfsserver - it is capable of 
handling systemd systems but start nfs-lock and nfs-server manually - hence hit 
the same problem in my case.



Cheers,



Stefan





-Ursprüngliche Nachricht-
Von: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de

 [root@centos7-n1 ˜]# time systemctl start nfs-server
 
 real1m0.480s

Probably time to look into syslog. I suspect a name/address resolving
issue...

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] starting of resources

2015-08-12 Thread Jan Pokorný
On 11/08/15 09:14 -0500, Ken Gaillot wrote:
 On 08/11/2015 02:12 AM, Vijay Partha wrote:
 After you start pacemaker and then type pcs status, we get the output that
 there are nodes online and the list of resources are empty. We then add
 resources to the nodes. Now what i want is after starting pacemaker can i
 get some resources to be started without adding the resources by making use
 of pcs.
 
 You only need to add resources once. pcs status takes a little time to
 show them when a cluster first starts up; just wait a while and type
 pcs status again. 

On a related note, one could be spared from this manual busy waiting 
if there was a support for that:
https://bugzilla.redhat.com/show_bug.cgi?id=1229822

 The resources themselves will be started as soon as the cluster
 determines they safely can be.
 
 On Tue, Aug 11, 2015 at 12:39 PM, Andrei Borzenkov arvidj...@gmail.com
 wrote:
 
 On Tue, Aug 11, 2015 at 9:44 AM, Vijay Partha vijaysarath...@gmail.com
 wrote:
 
 Can we statically add resources to the nodes. I mean before the
 pacemaker is started can we add resources to the nodes like you
 dont require to make use of pcs resource create. Is this
 possible?
 
 You better explain what you are trying to achieve. Otherwise exactly
 this question was discussed just recently, search archives of this
 list.
 
 If there are archives for this list could you help me out in
 sending the link.

In general, primary archive of the list can be reached at
http://clusterlabs.org/pipermail/users/, with other semi-endorsed 
(having their own merits) mirrors being Gmane:
http://dir.gmane.org/gmane.comp.clustering.clusterlabs.user
and The Mail Archive:
https://www.mail-archive.com/users@clusterlabs.org/

Andrei likely referred to this thread that should cover what, I also
think, you want to achieve:
http://clusterlabs.org/pipermail/users/2015-August/000913.html

Hope this helps.

-- 
Jan (Poki)


pgpNqzXY4JAJv.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Delayed first monitoring

2015-08-12 Thread Ken Gaillot
On 08/12/2015 10:45 AM, Miloš Kozák wrote:
 Thank you for your answer, but.
 
 1) This sounds ok, but in other words it means the first delayed check
 is not possible to be done.
 
 2) Start of init script? I follow lsb scripts from distribution, so
 there is not way to change them (I can change them, but with packages
 upgade they will go void). The is quite typical approach, how can I do
 HA for atlassian for example? Jira loads 5minutes..

I think your situation involves multiple issues which are worth
separating for clarity:

1. As Alexander mentioned, Pacemaker will do a monitor BEFORE trying to
start a service, to make sure it's not already running. So these don't
need any delay and are expected to fail.

2. Resource agents MUST NOT return success for start until the service
is fully up and running, so the next monitor should succeed, again
without needing any delay. If that's not the case, it's a bug in the agent.

3. It's generally better to use OCF resource agents whenever available,
as they have better integration with pacemaker than lsb/systemd/upstart.
In this case, take a look at ocf:heartbeat:apache.

4. You can configure the timeout used with each action (stop, start,
monitor, restart) on a given resource. The default is 20 seconds. For
example, if a start action is expected to take 5 minutes, you would
define a start operation on the resource with timeout=300s. How you do
that depends on your management tool (pcs, crmsh, or cibadmin).

Bottom line, you should never need a delay on the monitor, instead set
appropriate timeouts for each action, and make sure that the agent does
not return from start until the service is fully up.

 Dne 12.8.2015 v 16:14 Nekrasov, Alexander napsal(a):
 1. Pacemaker will/may call a monitor before starting a resource, in
 which case it expects a NOT_RUNNING response. It's just checking
 assumptions at that point.

 2. A resource::start must only return when resource::monitor is
 successful. Basically the logic of a start() must follow this:

 start() {
start_daemon()
while ! monitor() ; do
sleep some
done
return $OCF_SUCCESS
 }

 -Original Message-
 From: Miloš Kozák [mailto:milos.ko...@lejmr.com]
 Sent: Wednesday, August 12, 2015 10:03 AM
 To: users@clusterlabs.org
 Subject: [ClusterLabs] Delayed first monitoring

 Hi,

 I have set up and CoroSync+CMAN+Pacemaker at CentOS 6.5 in order to
 provide high-availability of opennebula. However, I am facing to a
 strange problem which raises from my lack of knowleadge..

 In the log I can see that when I create a resource based on an init
 script, typically:

 pcs resource create httpd lsb:httpd

 The httpd daemon gets started, but monitor is initiated at the same time
 and the resource is identified as not running. This behaviour makes
 sense since we realize that the daemon starting takes some time. In this
 particular case, I get error code 2 which means that process is running,
 but environment is not locked. The effect of this is that httpd resource
 gets restarted.

 My workaround is extra sleep in status function of the init script, but
 I dont like this solution at all! Do you have idea how to tackle this
 problem in a proper way? I expected an op attribut which would specify
 delay after service start and first monitoring, but I could not find
 it..

 Thank you, Milos


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] circumstances under which resources become unmanaged

2015-08-12 Thread N, Ravikiran
Hi All,

I have a resource added to pacemaker called 'cmsd' whose state is getting to 
'unmanaged FAILED' state.

Apart from manually changing the resource to unmanaged using pcs resource 
unmanage cmsd , I'm trying to understand under what all circumstances a 
resource can become unmanaged.. ?
I have not set any value for multilple-active field, which means by default 
it is set to stop-start, and hence I believe the resource can never go to 
unmanaged if it finds the resource active on more than one node.

Also, it would be more helpful if anyone can point out to specific sections of 
the pacemaker manuals for the answer.

Regards,
Ravikiran


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org