[ClusterLabs] [Question:pacemaker_remote] About limitation of the placement of the resource to remote node.
Hi All, We confirmed movement of pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977) It is the following cluster constitution. * sl7-01(KVM host) * snmp1(Guest on the sl7-01 host) * snmp2(Guest on the sl7-01 host) We prepared for the next CLI file to confirm the resource placement to remote node. -- property no-quorum-policy=ignore \ stonith-enabled=false \ startup-fencing=false rsc_defaults resource-stickiness=INFINITY \ migration-threshold=1 primitive remote-vm2 ocf:pacemaker:remote \ params server=snmp1 \ op monitor interval=3 timeout=15 primitive remote-vm3 ocf:pacemaker:remote \ params server=snmp2 \ op monitor interval=3 timeout=15 primitive dummy-remote-A Dummy \ op start interval=0s timeout=60s \ op monitor interval=30s timeout=60s \ op stop interval=0s timeout=60s primitive dummy-remote-B Dummy \ op start interval=0s timeout=60s \ op monitor interval=30s timeout=60s \ op stop interval=0s timeout=60s location loc1 dummy-remote-A \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname eq sl7-01 location loc2 dummy-remote-B \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname eq sl7-01 -- Case 1) The resource is placed as follows when I spend the CLI file which we prepared for. However, the placement of the dummy-remote resource does not meet a condition. dummy-remote-A starts in remote-vm2. [root@sl7-01 ~]# crm_mon -1 -Af Last updated: Thu Aug 13 08:49:09 2015 Last change: Thu Aug 13 08:41:14 2015 by root via cibadmin on sl7-01 Stack: corosync Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum 3 nodes and 4 resources configured Online: [ sl7-01 ] RemoteOnline: [ remote-vm2 remote-vm3 ] dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm2 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3 remote-vm2 (ocf::pacemaker:remote): Started sl7-01 remote-vm3 (ocf::pacemaker:remote): Started sl7-01 (snip) Case 2) When we change CLI file of it and spend it, the resource is placed as follows. The resource is placed definitely. dummy-remote-A starts in remote-vm3. dummy-remote-B starts in remote-vm3. (snip) location loc1 dummy-remote-A \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \ rule -inf: #uname eq sl7-01 location loc2 dummy-remote-B \ rule 200: #uname eq remote-vm3 \ rule 100: #uname eq remote-vm2 \ rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \ rule -inf: #uname eq sl7-01 (snip) [root@sl7-01 ~]# crm_mon -1 -Af Last updated: Thu Aug 13 08:55:28 2015 Last change: Thu Aug 13 08:55:22 2015 by root via cibadmin on sl7-01 Stack: corosync Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum 3 nodes and 4 resources configured Online: [ sl7-01 ] RemoteOnline: [ remote-vm2 remote-vm3 ] dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm3 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3 remote-vm2 (ocf::pacemaker:remote): Started sl7-01 remote-vm3 (ocf::pacemaker:remote): Started sl7-01 (snip) As for the placement of the resource being wrong with the first CLI file, the placement limitation of the remote node is like remote resource not being evaluated until it is done start. The placement becomes right with the CLI file which I revised, but the description of this limitation is very troublesome when I compose a cluster of more nodes. Does remote node not need processing delaying placement limitation until it is done start? Is there a method to easily describe the limitation of the resource to remote node? * As one means, we know that the placement of the resource goes well by dividing the first CLI file into two. * After a cluster sent CLI which remote node starts, I send CLI where a cluster starts a resource. * However, we do not want to divide CLI file into two if possible. Best Regards, Hideo Yamauchi. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] cib state is now lost
On 08/12/2015 05:29 AM, David Neudorfer wrote: Thanks Ken, We're currently using Pacemaker 1.1.11 and at the moment its not an option to upgrade. I've spun up and down these boxes on AWS and even tried different sizes. I think a recent upgrade broke this deploy. What OS distribution/version are you using? If you have the option of switching from corosync 1+plugin to either corosync 1+CMAN or corosync 2, that should avoid the issue, and put you in a better supported position going forward. The plugin code has known memory issues when nodes come and go, and the effects can be unpredictable. This is the output from dmesg: cib[16656] general protection ip:7f45391e9545 sp:7ffddf16c8b8 error:0 in libc-2.12.so[7f45390be000+18a000] cib[16659] general protection ip:7fa36fa89545 sp:7ffe28416288 error:0 in libc-2.12.so[7fa36f95e000+18a000] cib[16663] general protection ip:7fa3defce545 sp:7ffeb5b29c58 error:0 in libc-2.12.so[7fa3deea3000+18a000] cib[1] general protection ip:7fa1cefe4545 sp:7ffcc4b9c778 error:0 in libc-2.12.so[7fa1ceeb9000+18a000] cib[16669] general protection ip:7f4b3900f545 sp:7ffdcd65aaf8 error:0 in libc-2.12.so[7f4b38ee4000+18a000] cib[16672] general protection ip:7fc38be2b545 sp:7fffbc7e1598 error:0 in libc-2.12.so[7fc38bd0+18a000] cib[16675] general protection ip:7f9c6890c545 sp:7ffca09539f8 error:0 in libc-2.12.so[7f9c687e1000+18a000] cib[16678] general protection ip:7f1c636ad545 sp:7ffc677d2008 error:0 in libc-2.12.so[7f1c63582000+18a000] cib[16681] general protection ip:7fed0b47e545 sp:7ffd051f0618 error:0 in libc-2.12.so[7fed0b353000+18a000] cib[16684] general protection ip:7f2ee87cd545 sp:7fff8d9ae288 error:0 in libc-2.12.so[7f2ee86a2000+18a000] cib[16687] general protection ip:7f41c3789545 sp:7fff9f005848 error:0 in libc-2.12.so[7f41c365e000+18a000] On Mon, Aug 10, 2015 at 9:54 AM, Ken Gaillot kgail...@redhat.com wrote: On 08/09/2015 02:27 PM, David Neudorfer wrote: Where can I dig deeper to figure out why cib keeps terminating? selinux and iptables are both disabled and I've have debug enabled. Google hasn't been able to help me thus far. Aug 09 18:54:29 [12526] ip-172-20-16-5cib:debug: get_local_nodeid: Local nodeid is 84939948 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: plugin_get_details: Server details: id=84939948 uname=ip-172-20-16-5 cname=pcmk Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_get_peer: Created entry c1f204b2-c994-48d9-81b6-87e1a7fc1ee7/0xa2c460 for node ip-172-20-16-5/84939948 (1 total) Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_get_peer: Node 84939948 is now known as ip-172-20-16-5 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_get_peer: Node 84939948 has uuid ip-172-20-16-5 Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_update_peer_proc: init_cs_connection_classic: Node ip-172-20-16-5[84939948] - unknown is now online Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: init_cs_connection_once: Connection to 'classic openais (with plugin)': established Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: get_node_name:Defaulting to uname -n for the local classic openais (with plugin) node name Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: qb_ipcs_us_publish: server name: cib_ro Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: qb_ipcs_us_publish: server name: cib_rw Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: qb_ipcs_us_publish: server name: cib_shm Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: cib_init: Starting cib mainloop Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: plugin_handle_membership: Membership 104: quorum acquired Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_update_peer_proc: plugin_handle_membership: Node ip-172-20-16-5[84939948] - unknown is now member Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: crm_update_peer_state:cib_peer_update_callback: Node ip-172-20-16-5[84939948] - state is now lost (was (null)) Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: crm_reap_dead_member: Removing ip-172-20-16-5/84939948 from the membership list Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: reap_crm_member: Purged 1 peers with id=84939948 and/or uname=(null) from the membership cache Aug 09 18:54:29 [12526] ip-172-20-16-5cib: notice: crm_update_peer_state:plugin_handle_membership: Node ��[2077843320] - state is now member (was member) Aug 09 18:54:29 [12526] ip-172-20-16-5cib: info: crm_update_peer: plugin_handle_membership: Node ��: id=2077843320 state=r(0) ip(172.20.16.5) addr=r(0) ip(172.20.16.5) (new) votes=1 (new) born=104 seen=104
Re: [ClusterLabs] circumstances under which resources become unmanaged
On 12.08.2015 20:46, N, Ravikiran wrote: Hi All, I have a resource added to pacemaker called 'cmsd' whose state is getting to 'unmanaged FAILED' state. Apart from manually changing the resource to unmanaged using pcs resource unmanage cmsd , I'm trying to understand under what all circumstances a resource can become unmanaged.. ? I have not set any value for multilple-active field, which means by default it is set to stop-start, and hence I believe the resource can never go to unmanaged if it finds the resource active on more than one node. unmanaged FAILED means pacemaker (or better resource agent) failed to stop resource. At this point resource state is undefined so pacemaker won't do anything with it. Also, it would be more helpful if anyone can point out to specific sections of the pacemaker manuals for the answer. Regards, Ravikiran ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Ordering constraint restart second resource group
On 12.08.2015 19:35, John Gogu wrote: Hello, in my cluster configuration I have following situation: resource_group_A ip1 ip2 resource_group_B apache1 ordering constraint resource_group_A then resource_group_B symetrical=true When I add a new resource from group_A, resources from group_B are restarted. If I remove constraint all ok but I need to keep this ordering constraint. Did you try adding resource as unmanaged, manually start it and change to managed? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] circumstances under which resources become unmanaged
Thanks for reply Andrei. What happens to the resources added with a COLOCATION or an ORDER constraint with this resource (unmanaged FAILED resource).. ? will the constraint be removed.. ? Also please point me to any resource to understand this in detail. Regards Ravikiran -Original Message- From: Andrei Borzenkov [mailto:arvidj...@gmail.com] Sent: Thursday, August 13, 2015 9:33 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] circumstances under which resources become unmanaged On 12.08.2015 20:46, N, Ravikiran wrote: Hi All, I have a resource added to pacemaker called 'cmsd' whose state is getting to 'unmanaged FAILED' state. Apart from manually changing the resource to unmanaged using pcs resource unmanage cmsd , I'm trying to understand under what all circumstances a resource can become unmanaged.. ? I have not set any value for multilple-active field, which means by default it is set to stop-start, and hence I believe the resource can never go to unmanaged if it finds the resource active on more than one node. unmanaged FAILED means pacemaker (or better resource agent) failed to stop resource. At this point resource state is undefined so pacemaker won't do anything with it. Also, it would be more helpful if anyone can point out to specific sections of the pacemaker manuals for the answer. Regards, Ravikiran ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: CentOS 7 - Pacemaker - Problem with nfs-server and system
Hi, thank you for your reply. It seems to be a problem with the systemd unit files for nfs-server - specifically a timing issue. [root@centos7-n1 ~]# systemctl list-unit-files --type=service | grep rpcbind rpcbind.service static rpcbind is set to static - should be started on demand by other units. Invoking systemctl start nfs-server is pulling in rpcbind and nfs-lock rpcbind is started - but nfs-lock is maybe trying to start too early: Invoking manually systemctl start rpcbind and then systemctl start nfs-lock works within a second. Invoking manually systemctl start rpcbind and then sytemctl start nfs-server works within a few seconds as well. Invoking manually systemctl start nfs-server is only working randomly due to some timing issues. My current workaround is to also start rpcbind by the cluster - just before nfsserver. I also tried /usr/lib/ocf/resource.d/heartbeat/nfsserver - it is capable of handling systemd systems but start nfs-lock and nfs-server manually - hence hit the same problem in my case. Cheers, Stefan -Ursprüngliche Nachricht- Von: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de [root@centos7-n1 ˜]# time systemctl start nfs-server real1m0.480s Probably time to look into syslog. I suspect a name/address resolving issue... ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] starting of resources
On 11/08/15 09:14 -0500, Ken Gaillot wrote: On 08/11/2015 02:12 AM, Vijay Partha wrote: After you start pacemaker and then type pcs status, we get the output that there are nodes online and the list of resources are empty. We then add resources to the nodes. Now what i want is after starting pacemaker can i get some resources to be started without adding the resources by making use of pcs. You only need to add resources once. pcs status takes a little time to show them when a cluster first starts up; just wait a while and type pcs status again. On a related note, one could be spared from this manual busy waiting if there was a support for that: https://bugzilla.redhat.com/show_bug.cgi?id=1229822 The resources themselves will be started as soon as the cluster determines they safely can be. On Tue, Aug 11, 2015 at 12:39 PM, Andrei Borzenkov arvidj...@gmail.com wrote: On Tue, Aug 11, 2015 at 9:44 AM, Vijay Partha vijaysarath...@gmail.com wrote: Can we statically add resources to the nodes. I mean before the pacemaker is started can we add resources to the nodes like you dont require to make use of pcs resource create. Is this possible? You better explain what you are trying to achieve. Otherwise exactly this question was discussed just recently, search archives of this list. If there are archives for this list could you help me out in sending the link. In general, primary archive of the list can be reached at http://clusterlabs.org/pipermail/users/, with other semi-endorsed (having their own merits) mirrors being Gmane: http://dir.gmane.org/gmane.comp.clustering.clusterlabs.user and The Mail Archive: https://www.mail-archive.com/users@clusterlabs.org/ Andrei likely referred to this thread that should cover what, I also think, you want to achieve: http://clusterlabs.org/pipermail/users/2015-August/000913.html Hope this helps. -- Jan (Poki) pgpNqzXY4JAJv.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Delayed first monitoring
On 08/12/2015 10:45 AM, Miloš Kozák wrote: Thank you for your answer, but. 1) This sounds ok, but in other words it means the first delayed check is not possible to be done. 2) Start of init script? I follow lsb scripts from distribution, so there is not way to change them (I can change them, but with packages upgade they will go void). The is quite typical approach, how can I do HA for atlassian for example? Jira loads 5minutes.. I think your situation involves multiple issues which are worth separating for clarity: 1. As Alexander mentioned, Pacemaker will do a monitor BEFORE trying to start a service, to make sure it's not already running. So these don't need any delay and are expected to fail. 2. Resource agents MUST NOT return success for start until the service is fully up and running, so the next monitor should succeed, again without needing any delay. If that's not the case, it's a bug in the agent. 3. It's generally better to use OCF resource agents whenever available, as they have better integration with pacemaker than lsb/systemd/upstart. In this case, take a look at ocf:heartbeat:apache. 4. You can configure the timeout used with each action (stop, start, monitor, restart) on a given resource. The default is 20 seconds. For example, if a start action is expected to take 5 minutes, you would define a start operation on the resource with timeout=300s. How you do that depends on your management tool (pcs, crmsh, or cibadmin). Bottom line, you should never need a delay on the monitor, instead set appropriate timeouts for each action, and make sure that the agent does not return from start until the service is fully up. Dne 12.8.2015 v 16:14 Nekrasov, Alexander napsal(a): 1. Pacemaker will/may call a monitor before starting a resource, in which case it expects a NOT_RUNNING response. It's just checking assumptions at that point. 2. A resource::start must only return when resource::monitor is successful. Basically the logic of a start() must follow this: start() { start_daemon() while ! monitor() ; do sleep some done return $OCF_SUCCESS } -Original Message- From: Miloš Kozák [mailto:milos.ko...@lejmr.com] Sent: Wednesday, August 12, 2015 10:03 AM To: users@clusterlabs.org Subject: [ClusterLabs] Delayed first monitoring Hi, I have set up and CoroSync+CMAN+Pacemaker at CentOS 6.5 in order to provide high-availability of opennebula. However, I am facing to a strange problem which raises from my lack of knowleadge.. In the log I can see that when I create a resource based on an init script, typically: pcs resource create httpd lsb:httpd The httpd daemon gets started, but monitor is initiated at the same time and the resource is identified as not running. This behaviour makes sense since we realize that the daemon starting takes some time. In this particular case, I get error code 2 which means that process is running, but environment is not locked. The effect of this is that httpd resource gets restarted. My workaround is extra sleep in status function of the init script, but I dont like this solution at all! Do you have idea how to tackle this problem in a proper way? I expected an op attribut which would specify delay after service start and first monitoring, but I could not find it.. Thank you, Milos ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] circumstances under which resources become unmanaged
Hi All, I have a resource added to pacemaker called 'cmsd' whose state is getting to 'unmanaged FAILED' state. Apart from manually changing the resource to unmanaged using pcs resource unmanage cmsd , I'm trying to understand under what all circumstances a resource can become unmanaged.. ? I have not set any value for multilple-active field, which means by default it is set to stop-start, and hence I believe the resource can never go to unmanaged if it finds the resource active on more than one node. Also, it would be more helpful if anyone can point out to specific sections of the pacemaker manuals for the answer. Regards, Ravikiran ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org