Re: [Pacemaker] Node recover causes resource to migrate
Hello, As suggested I'm trying to add the nodes via corosync-objctl. My current config file is this one: https://gist.github.com/therobot/4327cd0a2598d1d6bb93 using 5001 as the nodeid on the second node. Then I try to add the nodes with the following commands: corosync-objctl -n totem.interface.member.memberaddr=10.34.191.212 corosync-objctl -n runtime.totem.pg.mrp.srp.members.5001.join_count=1 corosync-objctl -n runtime.totem.pg.mrp.srp.members.5000.status=joined corosync-objctl -n untime.totem.pg.mrp.srp.members.5001.ip=r(0) ip(10.34.191.212) (I have a problem because the space between r(0) and ip is not respected by corosync-objctl) Am I on the right track? Should I abandon my quest to have an HA solution on EC2 public network? Thanks, Jacobo García López de Araujo http://thebourbaki.com | http://twitter.com/clapkent On Wed, Jul 24, 2013 at 2:43 PM, Andrew Beekhof wrote: > > On 24/07/2013, at 10:09 PM, Lars Marowsky-Bree wrote: > > > On 2013-07-24T21:40:40, Andrew Beekhof wrote: > > > >>> Statically assigned nodeids? > >> Wouldn't hurt, but you still need to bring down the still-active node > to get it to talk to the new node. > >> Which sucks > > > > Hm. But ... corosync/pacemaker ought to identify the node via the > > nodeid. If it comes back with a different IP address, that shouldn't be > > a problem. > > > > Oh. *thud* Just realized that it's bound to be one for unicast > > communications, not so much mcast. > > Exactly. > > > Seems we may need some corosync magic > > commands to edit the nodelist at runtime. (Or is that already possible > > and I just don't know how? ;-) > > I believe it might be possible - I just don't know it. > Might even be better to have it happen automagically - after-all the new > node knows the existing node's address. > > But good luck getting that one through. > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node recover causes resource to migrate
Thanks for your answers. Is it possible to configure corosync/pacemaker in this scenario so if a node goes down instead of bringing the node back I can build a new one and add it to the cluster? Building new nodes is almost "free" for me. Also, what's the difference between bringing a new node from scratch or starting the node that goes down? How is this treated in an a normal scenario? (IE you want to add a third node to a cluster). Side note, I have decided not to use amazon VPC for several reasons, it can be summarized in: too much hassle to set up even on the simplest configuration that would involve an HA solution for the router between the two networks. Jacobo García López de Araujo http://thebourbaki.com | http://twitter.com/clapkent On Wed, Jul 24, 2013 at 11:26 AM, Lars Marowsky-Bree wrote: > On 2013-07-24T09:00:23, Andrew Beekhof wrote: > > > > 4. Node A is back with a different internal ip address. > > > > This is your basic problem. > > > > I don't believe there is any cluster software that is designed to > support this sort of scenario. > > Even at the corosync level, it has no knowledge that this is the same > machine that left. > > Statically assigned nodeids? > > > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix > Imendörffer, HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Node recover causes resource to migrate
Hello, I'm trying the debug the following scenario: My cluster setup is two AWS instances providing an elastic ip as an HA resource. I have written a custom resource script for managing that, the script passes all the tests specified in ocf-tester. It behaves properly in other test scenarios. In this link[1] you can find the corosync configuration file of node A and the pacemaker configuration as well. The corosync configuration of the node B is reciprocal. But I founding lots of behavior that I am not able to understand in the following situation: 1. Both nodes are up and node A has the Elastic IP 2. I turn off node A through AWS API. 3. After some time (around 2-3 mins) the Elastic IP moves to node B. The strange behavior starts when I try to bring the node A back to the cluster: 4. Node A is back with a different internal ip address. 5. I update the ip address in the A node: `crm configure edit` 6. I add the ip address of node A in node B: `crm configure edit` (deleting the old ip of node B is not possible). 7. I update the ip address in the corosync configuration of node A. 8. I restart corosync in node A. 9. IP migrates from node B to node A because node B appears offline to node A, this is because node A has not had its corosync config updated because I don't want to restart corosync on the resource that is providing the ip. 10. At this point node A only sees itself and node B only sees itself. Here is the log files of both nodes[2]. How can I bring a node back without having it moving the IP address back? Thanks in advance. [1] https://gist.github.com/therobot/67bae18b5acb0ba78925 [2] https://gist.github.com/therobot/ef85646d6e0092e494c4 Jacobo García López de Araujo http://thebourbaki.com | http://twitter.com/clapkent ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Simulating that a node is down.
Thanks Andreas for your kind answer, I'll add this to my test battery. Also, my other question, is it a good idea to close the corosync port? Should corosync behave in a expected way? I am getting odd behaviors on this one, but not sure if where to put the blame. Thanks in advance. Jacobo García López de Araujo http://thebourbaki.com | http://twitter.com/clapkent On Thu, Jul 11, 2013 at 8:39 PM, Andreas Mock wrote: > Hi Jacobo, > > ** ** > > one very interesting thing is missing. > > Overload the node. Make a programm/script which generates > > many IO-operations, many flushes and meanwhile requesting > > more and more memory from the OS until swapping begins. > > Ohhh, yes, swapping and IO is nice… > > ** ** > > …then you can prove your monitor and stop action timeouts… ;-) > > ** ** > > Best regards**** > > Andreas Mock > > ** ** > > ** ** > > *Von:* Jacobo García [mailto:jacobo.gar...@gmail.com] > *Gesendet:* Donnerstag, 11. Juli 2013 19:14 > *An:* pacemaker@oss.clusterlabs.org > *Betreff:* [Pacemaker] Simulating that a node is down. > > ** ** > > Hello, > > ** ** > > I am looking for different ways of testing that a node is down. I am > finding a strange behavior with one of them (closing with IPtables the UDP > communication port). I would like to know if closing the port is a > recommended way of achieving my testing purposes. > > ** ** > > Also I would like to know other ways of testing apart from the ones > compiled in the list below: > > ** ** > > **1. **Stopping corosync. > > **2. **Shutting down the node. > > **3. **Shutting down the eth0 interface. > > **4. **Killing corosync process. > > **5. **Closing the corosync communication port. > > Thanks, > > > > > Jacobo García López de Araujo > > ** ** > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Simulating that a node is down.
Hello, I am looking for different ways of testing that a node is down. I am finding a strange behavior with one of them (closing with IPtables the UDP communication port). I would like to know if closing the port is a recommended way of achieving my testing purposes. Also I would like to know other ways of testing apart from the ones compiled in the list below: 1. Stopping corosync. 2. Shutting down the node. 3. Shutting down the eth0 interface. 4. Killing corosync process. 5. Closing the corosync communication port. Thanks, Jacobo García López de Araujo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem configuring a simple ping resource
Effectively Lars, just editing the attribute symmetric cluster and setting it to true made the resource work. This also has helped me to clarify the concepts of assymetric and symetric clusters. Thanks, Jacobo García On Thu, Jun 27, 2013 at 5:10 PM, Lars Marowsky-Bree wrote: > On 2013-06-27T17:01:37, Jacobo García wrote: > >> Enable assymetric clustering >> crm_attribute --attr-name symmetric-cluster --attr-value false >> >> Then I configure the resource: >> crm configure primitive ping ocf:pacemaker:ping params >> host_list="10.34.151.73" op monitor interval=15s timeout=5s >> WARNING: ping: default timeout 20s for start is smaller than the advised 60 >> WARNING: ping: specified timeout 5s for monitor is smaller than the advised >> 60 >> >> This is what I get now: >> >> crm status >> >> Last updated: Wed Jun 26 16:49:20 2013 >> Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209 >> Stack: openais >> Current DC: ip-10-35-147-209 - partition with quorum >> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c >> 2 Nodes configured, 2 expected votes >> 1 Resources configured. >> >> >> Online: [ ip-10-34-151-73 ip-10-35-147-209 ] >> >> crm resource list >> ping (ocf::pacemaker:ping) Stopped > > You've configured the cluster so that no node is eligible to run > resources unless you allow it (when you turned off symmetric > clustering). So you need to provide explicit location constraints to > allow the ping resource to run. (And you probably want to clone it.) > > Or, easiest, don't switch to asymmetric clustering. I don't think that's > what you want. > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, > HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Problem configuring a simple ping resource
Hello I am trying to configure a simple ping resource in order to start understanding the mechanics under pacemaker. I have sucessfully configured corosync between 2 EC2 nodes on the same region, with the following configuration: logging{ to_logfile: yes logfile: /var/log/corosync.log } totem { version: 2 secauth: on interface { member { memberaddr: 10.35.147.209 } member { memberaddr: 10.34.151.73 } ringnumber: 0 bindnetaddr: 10.35.147.209 mcastport: 694 ttl: 1 } transport: udpu token: 1 } service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker } The other node has the opposite ip addresses on memberaddr and bindnetaddr, the UDP port 694 is open on the Amazon Security Groups, everything looks fine crm status Last updated: Wed Jun 26 16:44:40 2013 Last change: Wed Jun 26 16:44:32 2013 via crmd on ip-10-35-147-209 Stack: openais Current DC: ip-10-35-147-209 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 0 Resources configured. Online: [ ip-10-34-151-73 ip-10-35-147-209 ] Afterwards: Disable STONITH crm_attribute -t crm_config -n stonith-enabled -v false Disable quorum crm_attribute -n no-quorum-policy -v ignore Enable assymetric clustering crm_attribute --attr-name symmetric-cluster --attr-value false Then I configure the resource: crm configure primitive ping ocf:pacemaker:ping params host_list="10.34.151.73" op monitor interval=15s timeout=5s WARNING: ping: default timeout 20s for start is smaller than the advised 60 WARNING: ping: specified timeout 5s for monitor is smaller than the advised 60 This is what I get now: crm status Last updated: Wed Jun 26 16:49:20 2013 Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209 Stack: openais Current DC: ip-10-35-147-209 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ ip-10-34-151-73 ip-10-35-147-209 ] crm resource list ping (ocf::pacemaker:ping) Stopped crm configure show node ip-10-34-151-73 node ip-10-35-147-209 primitive ping ocf:pacemaker:ping \ params host_list="10.34.151.73" \ op monitor interval="15s" timeout="5s" property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ symmetric-cluster="false" In this link I have pasted the whole contents of the log of both servers: https://gist.github.com/therobot/c650149aa52e36a29ba6 And in this one I have pasted the output of running `cibfadmin -Q`: https://gist.github.com/therobot/39e36bfb07839086c3db Is there anything else to have this configured? Browsing the log it seems that the resource pingd cannot run anywhere. I am missing something? Thanks in advance for the help, Jacobo García ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org