Re: [Pacemaker] Node recover causes resource to migrate

2013-07-25 Thread Jacobo García
Hello,

As suggested I'm trying to add the nodes via corosync-objctl.

My current config file is this one:
https://gist.github.com/therobot/4327cd0a2598d1d6bb93 using 5001 as the
nodeid on the second node.

Then I try to add the nodes with the following commands:

corosync-objctl -n totem.interface.member.memberaddr=10.34.191.212
corosync-objctl -n runtime.totem.pg.mrp.srp.members.5001.join_count=1
corosync-objctl -n runtime.totem.pg.mrp.srp.members.5000.status=joined
corosync-objctl -n untime.totem.pg.mrp.srp.members.5001.ip=r(0)
ip(10.34.191.212)


(I have a problem because the space between r(0) and ip is not respected by
corosync-objctl)

Am I on the right track? Should I abandon my quest to have an HA solution
on EC2 public network?

Thanks,



Jacobo García López de Araujo
http://thebourbaki.com | http://twitter.com/clapkent


On Wed, Jul 24, 2013 at 2:43 PM, Andrew Beekhof  wrote:

>
> On 24/07/2013, at 10:09 PM, Lars Marowsky-Bree  wrote:
>
> > On 2013-07-24T21:40:40, Andrew Beekhof  wrote:
> >
> >>> Statically assigned nodeids?
> >> Wouldn't hurt, but you still need to bring down the still-active node
> to get it to talk to the new node.
> >> Which sucks
> >
> > Hm. But ... corosync/pacemaker ought to identify the node via the
> > nodeid. If it comes back with a different IP address, that shouldn't be
> > a problem.
> >
> > Oh. *thud* Just realized that it's bound to be one for unicast
> > communications, not so much mcast.
>
> Exactly.
>
> > Seems we may need some corosync magic
> > commands to edit the nodelist at runtime. (Or is that already possible
> > and I just don't know how? ;-)
>
> I believe it might be possible - I just don't know it.
> Might even be better to have it happen automagically - after-all the new
> node knows the existing node's address.
>
> But good luck getting that one through.
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node recover causes resource to migrate

2013-07-24 Thread Jacobo García
Thanks for your answers.

Is it possible to configure corosync/pacemaker in this scenario so if a
node goes down instead of bringing the node back I can build a new one and
add it to the cluster? Building new nodes is almost "free" for me. Also,
what's the difference between bringing a new node from scratch or starting
the node that goes down? How is this treated in an a normal scenario? (IE
you want to add a third node to a cluster).

Side note, I have decided not to use amazon VPC for several reasons, it can
be summarized in: too much hassle to set up even on the simplest
configuration that would involve an HA solution for the router between the
two networks.

Jacobo García López de Araujo
http://thebourbaki.com | http://twitter.com/clapkent


On Wed, Jul 24, 2013 at 11:26 AM, Lars Marowsky-Bree  wrote:

> On 2013-07-24T09:00:23, Andrew Beekhof  wrote:
>
> > > 4. Node A is back with a different internal ip address.
> >
> > This is your basic problem.
> >
> > I don't believe there is any cluster software that is designed to
> support this sort of scenario.
> > Even at the corosync level, it has no knowledge that this is the same
> machine that left.
>
> Statically assigned nodeids?
>
>
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
> Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Node recover causes resource to migrate

2013-07-22 Thread Jacobo García
Hello,

I'm trying the debug the following scenario:

My cluster setup is two AWS instances providing an elastic ip as an HA
resource. I have written a custom resource script for managing that, the
script passes all the tests specified in ocf-tester. It behaves properly in
other test scenarios.

In this link[1] you can find the corosync configuration file of node A and
the pacemaker configuration as well. The corosync configuration of the node
B is reciprocal.

But I founding lots of behavior that I am not able to understand in the
following situation:

1. Both nodes are up and node A has the Elastic IP
2. I turn off node A through AWS API.
3. After some time (around 2-3 mins) the Elastic IP moves to node B.

The strange behavior starts when I try to bring the node A back to the
cluster:

4. Node A is back with a different internal ip address.
5. I update the ip address in the A node: `crm configure edit`
6. I add the ip address of node A in  node B: `crm configure edit`
(deleting the old ip of node B is not possible).
7. I update the ip address in the corosync configuration of node A.
8. I restart corosync in node A.
9. IP migrates from node B to node A because node B appears offline to node
A, this is because node A has not had its corosync config updated because I
don't want to restart corosync on the resource that is providing the ip.
10. At this point node A only sees itself and node B only sees itself.

Here is the log files of both nodes[2].

How can I bring a node back without having it moving the IP address back?

Thanks in advance.

[1] https://gist.github.com/therobot/67bae18b5acb0ba78925
[2] https://gist.github.com/therobot/ef85646d6e0092e494c4



Jacobo García López de Araujo
http://thebourbaki.com | http://twitter.com/clapkent
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Simulating that a node is down.

2013-07-12 Thread Jacobo García
Thanks Andreas for your kind answer, I'll add this to my test battery.

Also, my other question, is it a good idea to close the corosync port?
Should corosync behave in a expected way? I am getting odd behaviors on
this one, but not sure if where to put the blame.

Thanks in advance.

Jacobo García López de Araujo
http://thebourbaki.com | http://twitter.com/clapkent


On Thu, Jul 11, 2013 at 8:39 PM, Andreas Mock  wrote:

> Hi Jacobo,
>
> ** **
>
> one very interesting thing is missing.
>
> Overload the node. Make a programm/script which generates
>
> many IO-operations, many flushes and meanwhile requesting
>
> more and more memory from the OS until swapping begins.
>
> Ohhh, yes, swapping and IO is nice…
>
> ** **
>
> …then you can prove your monitor and stop action timeouts…  ;-)
>
> ** **
>
> Best regards****
>
> Andreas Mock
>
> ** **
>
> ** **
>
> *Von:* Jacobo García [mailto:jacobo.gar...@gmail.com]
> *Gesendet:* Donnerstag, 11. Juli 2013 19:14
> *An:* pacemaker@oss.clusterlabs.org
> *Betreff:* [Pacemaker] Simulating that a node is down.
>
> ** **
>
> Hello,
>
> ** **
>
> I am looking for different ways of testing that a node is down. I am
> finding a strange behavior with one of them (closing with IPtables the UDP
> communication port). I would like to know if closing the port is a
> recommended way of achieving my testing purposes. 
>
> ** **
>
> Also I would like to know other ways of testing apart from the ones
> compiled in the list below:
>
> ** **
>
> **1. **Stopping corosync.
>
> **2. **Shutting down the node.
>
> **3. **Shutting down the eth0 interface.
>
> **4. **Killing corosync process.
>
> **5. **Closing the corosync communication port.
>
> Thanks,
>
>
> 
>
> Jacobo García López de Araujo
>
> ** **
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Simulating that a node is down.

2013-07-11 Thread Jacobo García
Hello,

I am looking for different ways of testing that a node is down. I am
finding a strange behavior with one of them (closing with IPtables the UDP
communication port). I would like to know if closing the port is a
recommended way of achieving my testing purposes.

Also I would like to know other ways of testing apart from the ones
compiled in the list below:


   1. Stopping corosync.
   2. Shutting down the node.
   3. Shutting down the eth0 interface.
   4. Killing corosync process.
   5. Closing the corosync communication port.

Thanks,

Jacobo García López de Araujo
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Problem configuring a simple ping resource

2013-06-27 Thread Jacobo García
Effectively Lars, just editing the attribute symmetric cluster and
setting it to true made the resource work. This also has helped me to
clarify the concepts of assymetric and symetric clusters.

Thanks,

Jacobo García

On Thu, Jun 27, 2013 at 5:10 PM, Lars Marowsky-Bree  wrote:
> On 2013-06-27T17:01:37, Jacobo García  wrote:
>
>> Enable assymetric clustering
>> crm_attribute --attr-name symmetric-cluster --attr-value false
>>
>> Then I configure the resource:
>> crm configure primitive ping ocf:pacemaker:ping params
>> host_list="10.34.151.73" op monitor interval=15s timeout=5s
>> WARNING: ping: default timeout 20s for start is smaller than the advised 60
>> WARNING: ping: specified timeout 5s for monitor is smaller than the advised 
>> 60
>>
>> This is what I get now:
>>
>> crm status
>> 
>> Last updated: Wed Jun 26 16:49:20 2013
>> Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209
>> Stack: openais
>> Current DC: ip-10-35-147-209 - partition with quorum
>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>> 2 Nodes configured, 2 expected votes
>> 1 Resources configured.
>> 
>>
>> Online: [ ip-10-34-151-73 ip-10-35-147-209 ]
>>
>> crm resource list
>> ping (ocf::pacemaker:ping) Stopped
>
> You've configured the cluster so that no node is eligible to run
> resources unless you allow it (when you turned off symmetric
> clustering). So you need to provide explicit location constraints to
> allow the ping resource to run. (And you probably want to clone it.)
>
> Or, easiest, don't switch to asymmetric clustering. I don't think that's
> what you want.
>
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Problem configuring a simple ping resource

2013-06-27 Thread Jacobo García
Hello

I am trying to configure a simple ping resource in order to start
understanding the mechanics under pacemaker.

I have sucessfully configured corosync between 2 EC2 nodes on the same
region, with the following configuration:

logging{
  to_logfile: yes
  logfile: /var/log/corosync.log
}

totem {
version: 2
secauth: on
interface {
member {
memberaddr: 10.35.147.209
}
member {
memberaddr: 10.34.151.73
}
ringnumber: 0
bindnetaddr: 10.35.147.209
mcastport: 694
ttl: 1
}
transport: udpu
token: 1
}

service {
# Load the Pacemaker Cluster Resource Manager
ver:   0
name:  pacemaker
}

The other node has the opposite ip addresses on memberaddr and
bindnetaddr, the UDP port 694 is open on the Amazon Security Groups,
everything looks fine

crm status

Last updated: Wed Jun 26 16:44:40 2013
Last change: Wed Jun 26 16:44:32 2013 via crmd on ip-10-35-147-209
Stack: openais
Current DC: ip-10-35-147-209 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.


Online: [ ip-10-34-151-73 ip-10-35-147-209 ]

Afterwards:

Disable STONITH
crm_attribute -t crm_config -n stonith-enabled -v false

Disable quorum
crm_attribute -n no-quorum-policy -v ignore

Enable assymetric clustering
crm_attribute --attr-name symmetric-cluster --attr-value false

Then I configure the resource:
crm configure primitive ping ocf:pacemaker:ping params
host_list="10.34.151.73" op monitor interval=15s timeout=5s
WARNING: ping: default timeout 20s for start is smaller than the advised 60
WARNING: ping: specified timeout 5s for monitor is smaller than the advised 60

This is what I get now:

crm status

Last updated: Wed Jun 26 16:49:20 2013
Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209
Stack: openais
Current DC: ip-10-35-147-209 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ ip-10-34-151-73 ip-10-35-147-209 ]

crm resource list
ping (ocf::pacemaker:ping) Stopped

crm configure show
node ip-10-34-151-73
node ip-10-35-147-209
primitive ping ocf:pacemaker:ping \
params host_list="10.34.151.73" \
op monitor interval="15s" timeout="5s"
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
symmetric-cluster="false"


In this link I have pasted the whole contents of the log of both
servers: https://gist.github.com/therobot/c650149aa52e36a29ba6
And in this one I have pasted the output of running `cibfadmin -Q`:
https://gist.github.com/therobot/39e36bfb07839086c3db

Is there anything else to have this configured? Browsing the log it
seems that the resource pingd cannot run anywhere. I am missing
something?

Thanks in advance for the help,

Jacobo García

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org