Re: [Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?
Hello Steve, Are you sure that Pacemaker is the right product for your project? Have you checked Mesos/Marathon or Kubernates? Those are frameworks being developed for managing containers. On Sat Feb 07 2015 at 1:19:15 PM Steven Dake (stdake) wrote: > Hi, > > I am working on Containerizing OpenStack in the Kolla project ( > http://launchpad.net/kolla). One of the key things we want to do over > the next few months is add H/A support to our container tech. David Vossel > had suggested using systemctl to monitor the containers themselves by > running healthchecking scripts within the containers. That idea is sound. > > There is another technology called “super-privileged containers”. > Essentially it allows more host access for the container, allowing the > treatment of Pacemaker as a container rather than a RPM or DEB file. I’d > like corosync to run in a separate container. These containers will > communicate using their normal mechanisms in a super-privileged mode. We > will implement this in Kolla. > > Where I am stuck is how does Pacemaker within a container control other > containers in the host os. One way I have considered is using the docker > —pid=host flag, allowing pacemaker to communicate directly with the host > systemctl process. Where I am stuck is our containers don’t run via > systemctl, but instead via shell scripts that are executed by third party > deployment software. > > An example: > Lets say a rabbitmq container wants to run: > > The user would run > kolla-mgr deploy messaging > > This would run a small bit of code to launch the docker container set > for messaging. > > Could pacemaker run something like > > Kolla-mgr status messaging > > To control the lifecycle of the processes? > > Or would we be better off with some systemd integration with kolla-mgr? > > Thoughts welcome > > Regards, > -steve > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Basic Clone IP configuration not starting automatically
I solved the issue: I had to configure resrouce-stickness option to 0 and now whatever hosts I reboot the resources are back and stay like this: ClusterVIP:0 (ocf::heartbeat:IPaddr2): Started haproxy01 ClusterVIP:1 (ocf::heartbeat:IPaddr2): Started haproxy02 Regads Thanks for your time and support. 2015-02-06 15:04 GMT-03:00 Michael Schwartzkopff : > Am Freitag, 6. Februar 2015, 08:14:22 schrieben Sie: > > Hi Michael. > > > > I mean this, when I reboot the server., pacemaker starts automatically > but > > I do no see the VIP in the eth0, sorry I used the word alias which may > have > > lead to a confision. > > > > ip addr show > > 1: lo: mtu 16436 qdisc noqueue state UNKNOWN > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > inet 127.0.0.1/8 scope host lo > > inet6 ::1/128 scope host > >valid_lft forever preferred_lft forever > > > > 2: eth0: mtu 1500 qdisc pfifo_fast > state > > UP qlen 1000 > > link/ether 00:21:f6:00:00:25 brd ff:ff:ff:ff:ff:ff > > inet 1.1.1.2/16 brd 1.1.255.255 scope global eth0 ( example server > > service IP ) > > inet 1.1.1.1/16 brd 1.1.255.255 scope global secondary eth0 ( THIS > IS > > THE VIP) which is not created/configured automatically. > > > > So I have to start it with > > *:/usr/sbin/crm_resource --meta -t primitive -r res_IPaddr2_virtualip -p > > target-role -v started* > > Or from the LCMC java frontend I have to perform an stop/start. > > > > The output of crm_mon looks like this on both situations, the resources > > are shown as started but the VIP is not brought up > > automatically when the servers boots up eventhough pacemaker starts > > automatically. > > > > Clone Set: cl_IPaddr2_1 [res_IPaddr2_virtualip] (unique) > > res_IPaddr2_virtualip:0(ocf::heartbeat:IPaddr2): Started > > haproxy02 > > res_IPaddr2_virtualip:1(ocf::heartbeat:IPaddr2): Started > > haproxy02 > > Are you sure that the second IP address res_IPaddr_vitualip:1 no not > started? > Perhaps it is only started on the second node? As you see above both > instances > of the clone ...:0 and ...:1 are started in the same node haproxy02. > > Mit freundlichen Grüßen, > > Michael Schwartzkopff > > -- > [*] sys4 AG > > http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 > Franziskanerstraße 15, 81669 München > > Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 > Vorstand: Patrick Ben Koetter, Marc Schiffbauer > Aufsichtsratsvorsitzender: Florian Kirstein > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote: > Hi, > > On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote: > > That is the problem that makes geo-clustering very hard to nearly > > impossible. You can look at the Booth option for pacemaker, but that > > requires two (or more) full clusters, plus an arbitrator 3rd > > A full cluster can consist of one node only. Hence, it is > possible to have a kind of stretch two-node [multi-site] cluster > based on tickets and managed by booth. In theory. In practice, we rely on "proper behaviour" of "the other site", in case a ticket is revoked, or cannot be renewed. Relying on a single node for "proper behaviour" does not inspire as much confidence as relying on a multi-node HA-cluster at each site, which we can expect to ensure internal fencing. With reliable hardware watchdogs, it still should be ok to do "stretched two node HA clusters" in a reliable way. Be generous with timeouts. And document which failure modes you expect to handle, and how to deal with the worst-case scenarios if you end up with some failure case that you are not equipped to handle properly. There are deployments which favor "rather online with _potential_ split brain" over "rather offline just in case". Document this, print it out on paper, "I am aware that this may lead to lost transactions, data divergence, data corruption, or data loss. I am personally willing to take the blame, and live with the consequences." Have some "boss" sign that ^^^ in the real world using a real pen. Lars -- : Lars Ellenberg : http://www.LINBIT.com | Your Way to High Availability : DRBD, Linux-HA and Pacemaker support and consulting DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Booth ticket renovation timeout
On Mon, Feb 09, 2015 at 10:53:00AM +, Jorge Lopes wrote: > Hi Dejan, > Thanks for the tip. > > Concerning the timeout values, what would be the ticket renewal typical > values for a production environment? We have two parameters: expire and renewal. The latter used to be set to half of the former and due to a user request it is configurable now. If not configured, the expire time is set to 10 minutes, which yields the renewal time of 5 minutes. Those being defaults, eventually they depend on your business needs and site failover/disaster recovery procedures as well as the connection stability and packet loss rates. I doubt that an expiry time of less than 1 minute is practical, though testing could be done with times of less than 10 seconds. The README contains booth operation description which you may find useful: https://github.com/ClusterLabs/booth/blob/master/README Thanks, Dejan > Thanks, > Jorge > > > > >On Sun, Feb 08, 2015 at 07:06:13PM +, Jorge Lopes wrote: > >> Hi all, > >> > >> I'm performing a lab test were I have a geo cluster and an arbitrator, > in a > >> configuration for disaster recovery with fail over. There are two main > >> sites (primary and disaster recovery) and a third site for arbitrator. > >> > >> I have defined a ticket named "Primary", which will define which is the > >> primary site and which is the recovery site. > >> In my first configuration I had in the bothh.conf a value of 60 for the > >> ticket renewal. After I assigned the ticket to the primary site, when the > >> renovation time was reached, the ticket was not renewed and it ended up > not > >> assigned to any of the sites. > >> > >> So, I increased the value to 120 and now the ticket gets correctly > renewed. > >> > >> I am interested to know if there are any kind of constraints for the > >> minimum value for the ticket renewal. Is there any design aspect that > would > >> recommend higher values? And what about in a production environment, > where > >> time lags might be larger, would such a situation occur? What would be a > >> typical set of timeout values (please notice the CIB timeout values). > >> > >> My configurations are as follow. > > > >It seems like you're running the older version of booth, which > >has been deprecated and is effectively unmaintained. The newer > >version is available at > >https://github.com/ClusterLabs/booth/releases/tag/v0.2.0 > > > >Thanks, > > > >Dejan > > > >> Thanks in advance, > >> Jorge > >> > >> > >> /etc/booth/booth.conf: > >> > >> transport="UDP" > >> port="" > >> site="192.168.180.211" > >> site="192.168.190.211" > >> arbitrator="192.168.200.211" > >> ticket="primary;120" > >> > >> > >> crm configure show: > >> node $id="1084798152" cluster1-node1 > >> primitive booth ocf:pacemaker:booth-site \ > >> meta resource-stickiness="INFINITY" \ > >> op monitor interval="10s" timeout="20s" > >> primitive booth-ip ocf:heartbeat:IPaddr2 \ > >> params ip="192.168.180.211" > >> primitive dummy-pgsql ocf:pacemaker:Stateful \ > >> op monitor interval="15" role="Slave" timeout="60s" \ > >> op monitor interval="30" role="Master" timeout="60s" > >> primitive oversee-ip ocf:heartbeat:IPaddr2 \ > >> params ip="192.168.180.210" > >> group g-booth booth-ip booth > >> ms ms_dummy_pqsql dummy-pgsql \ > >> meta target-role="Master" clone-max="1" > >> order order-booth-oversee-ip inf: g-booth oversee-ip > >> rsc_ticket ms_dummy_pgsql_primary primary: ms_dummy_pqsql:Master > >> loss-policy=demote > >> rsc_ticket oversee-ip-req-primary primary: oversee-ip loss-policy=stop > >> property $id="cib-bootstrap-options" \ > >> dc-version="1.1.10-42f2063" \ > >> cluster-infrastructure="corosync" \ > >> stonith-enabled="false" > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Booth ticket renovation timeout
Hi Dejan, Thanks for the tip. Concerning the timeout values, what would be the ticket renewal typical values for a production environment? Thanks, Jorge > >On Sun, Feb 08, 2015 at 07:06:13PM +, Jorge Lopes wrote: >> Hi all, >> >> I'm performing a lab test were I have a geo cluster and an arbitrator, in a >> configuration for disaster recovery with fail over. There are two main >> sites (primary and disaster recovery) and a third site for arbitrator. >> >> I have defined a ticket named "Primary", which will define which is the >> primary site and which is the recovery site. >> In my first configuration I had in the bothh.conf a value of 60 for the >> ticket renewal. After I assigned the ticket to the primary site, when the >> renovation time was reached, the ticket was not renewed and it ended up not >> assigned to any of the sites. >> >> So, I increased the value to 120 and now the ticket gets correctly renewed. >> >> I am interested to know if there are any kind of constraints for the >> minimum value for the ticket renewal. Is there any design aspect that would >> recommend higher values? And what about in a production environment, where >> time lags might be larger, would such a situation occur? What would be a >> typical set of timeout values (please notice the CIB timeout values). >> >> My configurations are as follow. > >It seems like you're running the older version of booth, which >has been deprecated and is effectively unmaintained. The newer >version is available at >https://github.com/ClusterLabs/booth/releases/tag/v0.2.0 > >Thanks, > >Dejan > >> Thanks in advance, >> Jorge >> >> >> /etc/booth/booth.conf: >> >> transport="UDP" >> port="" >> site="192.168.180.211" >> site="192.168.190.211" >> arbitrator="192.168.200.211" >> ticket="primary;120" >> >> >> crm configure show: >> node $id="1084798152" cluster1-node1 >> primitive booth ocf:pacemaker:booth-site \ >> meta resource-stickiness="INFINITY" \ >> op monitor interval="10s" timeout="20s" >> primitive booth-ip ocf:heartbeat:IPaddr2 \ >> params ip="192.168.180.211" >> primitive dummy-pgsql ocf:pacemaker:Stateful \ >> op monitor interval="15" role="Slave" timeout="60s" \ >> op monitor interval="30" role="Master" timeout="60s" >> primitive oversee-ip ocf:heartbeat:IPaddr2 \ >> params ip="192.168.180.210" >> group g-booth booth-ip booth >> ms ms_dummy_pqsql dummy-pgsql \ >> meta target-role="Master" clone-max="1" >> order order-booth-oversee-ip inf: g-booth oversee-ip >> rsc_ticket ms_dummy_pgsql_primary primary: ms_dummy_pqsql:Master >> loss-policy=demote >> rsc_ticket oversee-ip-req-primary primary: oversee-ip loss-policy=stop >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.10-42f2063" \ >> cluster-infrastructure="corosync" \ >> stonith-enabled="false" > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Booth ticket renovation timeout
Hi, On Sun, Feb 08, 2015 at 07:06:13PM +, Jorge Lopes wrote: > Hi all, > > I'm performing a lab test were I have a geo cluster and an arbitrator, in a > configuration for disaster recovery with fail over. There are two main > sites (primary and disaster recovery) and a third site for arbitrator. > > I have defined a ticket named "Primary", which will define which is the > primary site and which is the recovery site. > In my first configuration I had in the bothh.conf a value of 60 for the > ticket renewal. After I assigned the ticket to the primary site, when the > renovation time was reached, the ticket was not renewed and it ended up not > assigned to any of the sites. > > So, I increased the value to 120 and now the ticket gets correctly renewed. > > I am interested to know if there are any kind of constraints for the > minimum value for the ticket renewal. Is there any design aspect that would > recommend higher values? And what about in a production environment, where > time lags might be larger, would such a situation occur? What would be a > typical set of timeout values (please notice the CIB timeout values). > > My configurations are as follow. It seems like you're running the older version of booth, which has been deprecated and is effectively unmaintained. The newer version is available at https://github.com/ClusterLabs/booth/releases/tag/v0.2.0 Thanks, Dejan > Thanks in advance, > Jorge > > > /etc/booth/booth.conf: > > transport="UDP" > port="" > site="192.168.180.211" > site="192.168.190.211" > arbitrator="192.168.200.211" > ticket="primary;120" > > > crm configure show: > node $id="1084798152" cluster1-node1 > primitive booth ocf:pacemaker:booth-site \ > meta resource-stickiness="INFINITY" \ > op monitor interval="10s" timeout="20s" > primitive booth-ip ocf:heartbeat:IPaddr2 \ > params ip="192.168.180.211" > primitive dummy-pgsql ocf:pacemaker:Stateful \ > op monitor interval="15" role="Slave" timeout="60s" \ > op monitor interval="30" role="Master" timeout="60s" > primitive oversee-ip ocf:heartbeat:IPaddr2 \ > params ip="192.168.180.210" > group g-booth booth-ip booth > ms ms_dummy_pqsql dummy-pgsql \ > meta target-role="Master" clone-max="1" > order order-booth-oversee-ip inf: g-booth oversee-ip > rsc_ticket ms_dummy_pgsql_primary primary: ms_dummy_pqsql:Master > loss-policy=demote > rsc_ticket oversee-ip-req-primary primary: oversee-ip loss-policy=stop > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > stonith-enabled="false" > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker-1.1.12 - lots of Could not establish cib_ro connection: Resource temporarily unavailable (11) errors
Hello, I'd like to ask about following problem that troubles me for some time and I wan't able to find solution for: I've got cluster with quite a lot of resources, and when I try to do multiple operations at time, I get a lot of resource failures (ie failed starts) The only related information I was able to find is following snippet of the log: crmd: notice: process_lrm_event: Operation vmtnv03_start_0: unknown error (node=v1b, call=748, rc=1, cib-update=211, confirmed=true) crmd: notice: process_lrm_event: v1b-vmtnv03_start_0:748 [ Error: 'Could not establish cib_ro connection: Resource temporarily unavailable (11)'\n ] The OCF script does not do anything special, start action basically just runs some python command.. Does somebody have a tip what to do with this problem, or how to debug it further? my system is latest centos 6, pacemaker-1.1.12-4.el6, cman-3.0.12.1-68.el6, resource-agents-3.9.5-12. thanks a lot in advance! with best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - pgp7tfyXjw4Tr.pgp Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration
Hi Kristoffer, Thank you for the help. I will let you know if I see the same with the latest version. On Feb 9, 2015 10:07 AM, "Kristoffer Grönlund" wrote: > Hi, > > Kostiantyn Ponomarenko writes: > > > Hi guys, > > > > I saw this during applying the configuration using a script with crmsh > > commands: > > > The CIB patching performed by crmsh has been a bit too sensitive to > CIB version mismatches which can cause errors like the one you are > seeing. This should be fixed in the latest released version of crmsh > (2.1.2), and I would recommend upgrading to that version if you can. > > If this problem still occurs with 2.1.2, please let me know [1]. > > Thanks! > > [1]: http://github.com/crmsh/crmsh/issues > > > > > > + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw > > + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params > > delay=10 > > + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1 > > -inf: node-1 > > + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0 > > -inf: node-0 > > Call cib_apply_diff failed (-205): Update was older than existing > > configuration > > ERROR: could not patch cib (rc=205) > > INFO: offending xml diff: > > > > > > > > > > > > > > > > > > > >epoch="27" > > num_updates="3" admin_epoch="0" cib-last-written="Thu Feb 5 14:56:09 > 2015" > > have-quorum="1" dc-uuid="1"/> > > > > > >> position="1"> > > > rsc="STONITH_node-0" score="-INFINITY" node="node-0"/> > > > > > > > > After that pacemaker stopped on the node on which the script was run. > > > > Thank you, > > Kostya > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration
Hi, Kostiantyn Ponomarenko writes: > Hi guys, > > I saw this during applying the configuration using a script with crmsh > commands: The CIB patching performed by crmsh has been a bit too sensitive to CIB version mismatches which can cause errors like the one you are seeing. This should be fixed in the latest released version of crmsh (2.1.2), and I would recommend upgrading to that version if you can. If this problem still occurs with 2.1.2, please let me know [1]. Thanks! [1]: http://github.com/crmsh/crmsh/issues > > + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw > + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params > delay=10 > + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1 > -inf: node-1 > + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0 > -inf: node-0 > Call cib_apply_diff failed (-205): Update was older than existing > configuration > ERROR: could not patch cib (rc=205) > INFO: offending xml diff: > > > > > > > > > >num_updates="3" admin_epoch="0" cib-last-written="Thu Feb 5 14:56:09 2015" > have-quorum="1" dc-uuid="1"/> > > >position="1"> > rsc="STONITH_node-0" score="-INFINITY" node="node-0"/> > > > > After that pacemaker stopped on the node on which the script was run. > > Thank you, > Kostya > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org