Re: [Pacemaker] Pacemaker and LDAP (389 Directory Service)

2011-06-27 Thread veghead
Serge Dubrouski  writes:
> On Mon, Jun 27, 2011 at 3:33 PM, veghead  studyblue.com> wrote:
> If I remove the co-location, won't the elastic_ip resource just stay where it
> is? Regardless of what happens to LDAP?
> 
> Right. That's why I think that you don't really want to do it. You have 
> to make sure that your IP is up where you LDAP is up. 

Okay. So I took a step and revamped the configuration to test the elastic_ip 
less frequently and with a long timeout. I committed the changes, but "crm 
status" doesn't reflect the resources in question.

Here's the new config:

---snip---
# crm configure show
node $id="d2b294cf-328f-4481-aa2f-cc7b553e6cde" ldap1.example.ec2
node $id="e2a2e42e-1644-4f7d-8e54-71e1f7531e08" ldap2.example.ec2
primitive elastic_ip lsb:elastic-ip \
op monitor interval="30" timeout="300" on-fail="ignore" 
requires="nothing"
primitive ldap lsb:dirsrv \
op monitor interval="15s" on-fail="standby" requires="nothing"
clone ldap-clone ldap
colocation ldap-with-eip inf: elastic_ip ldap-clone
order ldap-after-eip inf: elastic_ip ldap-clone
property $id="cib-bootstrap-options" \
dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
stop-all-resources="true"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
---snip---

And here's the output from "crm status":

---snip---
# crm status

Last updated: Mon Jun 27 18:50:14 2011
Stack: Heartbeat
Current DC: ldap2.studyblue.ec2 (e2a2e42e-1644-4f7d-8e54-71e1f7531e08) - 
partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
2 Resources configured.


Online: [ ldap1.example.ec2 ldap2.example.ec2 ]
---snip---

I restarted the nodes one at a time - first I restarted ldap2, then I restarted 
ldap1. When ldap1 went down, ldap2 stopped the ldap resource and didn't make 
any 
attempt to start the elastic_ip resource:

---snip---
pengine: [12910]: notice: unpack_config: On loss of CCM Quorum: Ignore
pengine: [12910]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' 
= 0, 'green' = 0
pengine: [12910]: info: determine_online_status: Node ldap2.example.ec2 is 
online
pengine: [12910]: notice: native_print: elastic_ip   (lsb:elastic-ip):  
 
Stopped 
pengine: [12910]: notice: clone_print:  Clone Set: ldap-clone
pengine: [12910]: notice: short_print:  Stopped: [ ldap:0 ldap:1 ]
pengine: [12910]: notice: LogActions: Leave   resource elastic_ip
(Stopped)
pengine: [12910]: notice: LogActions: Leave   resource ldap:0(Stopped)
pengine: [12910]: notice: LogActions: Leave   resource ldap:1(Stopped)
---snip---

After heartbeat/pacemaker came back up on ldap1, it terminated the ldap service 
on ldap1. Now I'm just confused.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker and LDAP (389 Directory Service)

2011-06-27 Thread veghead
Sorry for the questions. Some days my brain is just slow. :)

Serge Dubrouski  writes:
> If you want to make your LDAP independent from IP just remove your 
> collocation:colocation ldap-with-eip inf: elastic_ip ldap-clone

Is that really what I want to do? I mean, I need the elastic ip assigned to 
~one~ of the machines... And if LDAP fails on that machine, I need Pacemaker to 
start the Elastic IP on the other machine.

If I remove the co-location, won't the elastic_ip resource just stay where it 
is? Regardless of what happens to LDAP?

> But I'd rather try to find out why monitoring for IP fails. May bet
> it just needs an increased timeout on monitor operation, though it
> looks like you've already increased it. What's in your log files
> when that monitor fails?

Originally, I had the monitor on the elastic_ip resource set to 10 seconds. The 
error in the logs was:

---snip---
pengine: [16980]: notice: unpack_rsc_op: Operation elastic_ip_monitor_0 found 
resource elastic_ip active on ldap1.example.ec2
pengine: [16980]: WARN: unpack_rsc_op: Processing failed op 
elastic_ip_monitor_1 on ldap1.example.ec2: unknown exec error (-2)
pengine: [16980]: WARN: unpack_rsc_op: Processing failed op elastic_ip_stop_0 
on 
ldap1.example.ec2: unknown exec error (-2)
pengine: [16980]: info: native_add_running: resource elastic_ip isnt managed
pengine: [16980]: notice: unpack_rsc_op: Operation ldap:1_monitor_0 found 
resource ldap:1 active on ldap2.example.ec2
pengine: [16980]: WARN: unpack_rsc_op: Processing failed op elastic_ip_start_0 
on ldap2.example.ec2: unknown exec error (-2)
pengine: [16980]: notice: native_print: elastic_ip   (lsb:elastic-ip):  
 
Started ldap1.example.ec2 (unmanaged) FAILED
pengine: [16980]: notice: clone_print:  Clone Set: ldap-clone
pengine: [16980]: notice: short_print:  Stopped: [ ldap:0 ldap:1 ]
pengine: [16980]: info: get_failcount: elastic_ip has failed INFINITY times on 
ldap1.example.ec2
pengine: [16980]: WARN: common_apply_stickiness: Forcing elastic_ip away from 
ldap1.example.ec2 after 100 failures (max=100)
pengine: [16980]: info: get_failcount: elastic_ip has failed INFINITY times on 
ldap2.example.ec2
pengine: [16980]: WARN: common_apply_stickiness: Forcing elastic_ip away from 
ldap2.example.ec2 after 100 failures (max=100)
pengine: [16980]: info: native_color: Unmanaged resource elastic_ip allocated 
to 
'nowhere': failed
pengine: [16980]: notice: RecurringOp:  Start recurring monitor (15s) for 
ldap:0 
on ldap1.example.ec2
pengine: [16980]: notice: RecurringOp:  Start recurring monitor (15s) for 
ldap:1 
on ldap2.example.ec2
pengine: [16980]: notice: LogActions: Leave   resource elastic_ip
(Started unmanaged)
pengine: [16980]: notice: LogActions: Start   ldap:0 (ldap1.example.ec2)
pengine: [16980]: notice: LogActions: Start   ldap:1 (ldap2.example.ec2)
---snip---

Now that I have set the monitor interval for the elastic_ip resource to "0", it 
keeps thinking everything is either stopped or should be stopped:

---snip---
pengine: [7287]: notice: unpack_rsc_op: Operation elastic_ip_monitor_0 found 
resource elastic_ip active on ldap1.example.ec2
pengine: [7287]: notice: unpack_rsc_op: Operation ldap:0_monitor_0 found 
resource ldap:0 active on ldap2.example.ec2
pengine: [7287]: notice: native_print: elastic_ip (lsb:elastic-ip):   
Stopped 
pengine: [7287]: notice: clone_print:  Clone Set: ldap-clone
pengine: [7287]: notice: short_print:  Stopped: [ ldap:0 ldap:1 ]
pengine: [7287]: notice: LogActions: Leave   resource elastic_ip  (Stopped)
pengine: [7287]: notice: LogActions: Leave   resource ldap:0  (Stopped)
pengine: [7287]: notice: LogActions: Leave   resource ldap:1  (Stopped)
---snip---

Very strange.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker and LDAP (389 Directory Service)

2011-06-27 Thread veghead
veghead  writes:
> Pair of LDAP servers running 389 (formerly Fedora DS) in 
> high availability using Pacemaker with a floating IP.
> In addition, 389 supports multi-master replication, 
> where all changes on one node are automatically 
> replicated on one or more other nodes.

I'm so close, but I'm still having issues. I'm running these on EC2 using an 
ElasticIP as the "floating" ip. Unfortunately, I have found that requests for 
the status of the ElasticIP occasionally fail for no apparent reason, even 
thought he ElasticIP is actually working fine. Once they fail, that triggers a 
failover and creates a mess.

What I'd like to do is:

* Run LDAP service on both nodes
* Ignore the status of the ElasticIP resource and only trigger a fail-over when 
the LDAP service fails.

I feel like my config is close, but the cluster keeps wanting to stop the 
resources.

Here's my current config:

---snip---
primitive elastic_ip lsb:elastic-ip \
op monitor interval="0" timeout="300" on-fail="ignore" 
requires="nothing"
primitive ldap lsb:dirsrv \
op monitor interval="15s" on-fail="standby" requires="nothing"
clone ldap-clone ldap
colocation ldap-with-eip inf: elastic_ip ldap-clone
order ldap-after-eip inf: elastic_ip ldap-clone
property $id="cib-bootstrap-options" \
dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
stop-all-resources="true"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
---snip---

Any suggestions as to what I'm doing wrong?



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker and LDAP (389 Directory Service)

2011-06-08 Thread veghead
Dejan Muhamedagic  writes:
> lsb:dirsrv doesn't understand master/slave. That's OK, none of
> LSB agents do. You can only try to use clones (clone ldap-clone
> ldap ...).

That worked perfectly. I was getting master/slave and basic 
clone stuff mixed up.  Thanks!



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Pacemaker and LDAP (389 Directory Service)

2011-06-07 Thread veghead
I'm trying to setup a pair of LDAP servers running 389 (formerly Fedora DS) in 
high availability using Pacemaker with a floating IP. In addition, 389 supports 
multi-master replication, where all changes on one node are automatically 
replicated on one or more other nodes.

I'm fairly close to having everything working. Failover works just fine. And 
multi-master replication works fine. However, my current Pacemaker config stops 
the directory service on the non-active node. Which means that the backup node 
is not receiving replication data from the other node.

What is the right way to setup Pacemaker so that:

1) LDAP directory services are always running on both nodes
2) Floating IP is assigned to one of the nodes
3) Failover occurs if the master node dies or LDAP service stops running on the 
master

Initially, my Pacemaker config looked like the following:

---snip---
property stonith-enabled=false
property no-quorum-policy=ignore

rsc_defaults resource-stickiness=100

primitive elastic_ip lsb:elastic-ip op monitor interval="10s"
primitive dirsrv lsb:dirsrv op monitor interval="10s"
order dirsrv-after-eip inf: elastic_ip dirsrv
colocation dirsrv-with-eip inf: dirsrv elastic_ip
---snip---

I then explored using Pacemaker clones:

---snip---
property stonith-enabled=false
property no-quorum-policy=ignore

rsc_defaults resource-stickiness=100

primitive elastic_ip lsb:elastic-ip op monitor interval="10s"
primitive ldap lsb:dirsrv op monitor interval="15s" role="Slave" timeout="10s" 
op monitor interval="16s" role="Master" timeout="10s"

ms ldap-clone ldap meta master-max=1 master-node-max=1 clone-max=3 clone-node-
max=1 notify-true

colocation ldap-with-eip inf: elastic_ip ldap-clone:Master
order eip-after-promote inf: ldap-clone:promote elastic_ip:start
order ldap-after-eip inf: elastic_ip ldap-clone
---snip---

Unfortunately, that doesn't quite work. pengine complains that "ldap-clone: 
Promoted 0 instances of a possible 1 to master" and then stops the LDAP 
service. 
I'm sure I'm missing something simple... any suggestions would be greatly 
appreciated.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Automating Pacemaker Setup

2011-05-31 Thread veghead
Dejan Muhamedagic  writes:
> On Fri, May 27, 2011 at 08:21:08PM +0000, veghead wrote:
> > 1) Is there a way to force crm to accept my configuration request 
> > ~before~ starting the second node?
> 
> No before the DC is elected. There are two settings: dc-deadtime
> and startup-fencing which can reduce the time for DC election.
> Note that disabling startup fencing is not recommended. But I
> don't know what's your use case. YMMV.

Well, I'm probably not quite the typical use case. We're using Amazon EC2 to 
setup and tear down testing environments. I have automated the entire process 
except for setting up Pacemaker. Beyond testing environments, I'd like to 
automate Pacemaker setup to cover the scenario where all nodes in a Pacemaker 
cluster crash and the entire configuration is lost.

Obviously, once one node is running, setting up additional nodes becomes easy. 
It's just the bootstrap phase that's a challenge to automate.

> > 2) Is there a way to tell Pacemaker to ignore quorum requirements
> > ~before~ starting additional nodes?
> > 
> > 2) Is there an alternate way to configure Pacemaker?
> 
> Yes, you can modify the CIB _before_ starting pacemaker. Sth
> like:
> 
> CIB_file=/var/lib/heartbeat/crm/cib.xml crm configure ...
> 
> But in that case you need to remove cib.xml.sig. Then you have to
> make sure that pacemaker starts first on this node. Consider this
> only if everything else fails.

I'll give that a shot.

Thanks.

-S


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Automating Pacemaker Setup

2011-05-27 Thread veghead
veghead  writes:
> Todd Nine  writes:
> Wow. The example pacemaker config and the trick of starting
> heartbeat before 

Bah. So close. But I still don't have it completely automated.

If I start heatbeat on the first node and then run:

crm configure < myconfigure.txt

That fails. If I start heartbeat on the second node and wait for the 
two nodes to connect to each other (so that we have a quorum), then I 
can run "crm configure" and it works.

So that leaves me with couple questions:

1) Is there a way to force crm to accept my configuration request 
~before~ starting the second node?

2) Is there a way to tell Pacemaker to ignore quorum requirements
~before~ starting additional nodes?

2) Is there an alternate way to configure Pacemaker?

-Sean


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Automating Pacemaker Setup

2011-05-27 Thread veghead
Todd Nine  writes:

>   I have a setup nearly working.  Would you be willing to share recipes?
...
> It's not quite working yet, but it's close.  Since you've managed to get
> this working, you may be able to finish these off.  I have everything
> working except the init start/stop hooks for pacemaker to set the
> Elastic IP automatically and then run chef-client to reconfigure
> everything on all the other nodes.

Wow. The example pacemaker config and the trick of starting heartbeat before 
using crm configure were the last steps I needed. Thanks!

So, here's how I got Elastic IP failover working. I can't claim credit for the 
idea... I found a basic example here: https://forums.aws.amazon.com/thread.jspa?
messageID=195373. That didn't quite work for me, so I rewrote the LSB script in 
pure ruby and leveraged the amazon-ec2 gem (https://github.com/grempe/amazon-
ec2) to handle associating the EIP with the current instance. I have included 
my 
script below. A couple of key things. 

First, I found that when an instance loses it's elastic ip (whether through 
"disassociate" or another instance grabbed eip), it loses public internet 
connectivity for 1-3 minutes. Apparently this is expected, according to AWS 
Support: https://forums.aws.amazon.com/message.jspa?messageID=250571#250571. As 
a result, I decided it didn't make any sense to have the "stop" method for the 
EIP LSB script do anything.

Second, my Pacemaker configure is pretty close to yours. I setup the nodes with 
ucast in almost the exact same manner. The key differences are all in setting 
up 
the primitives with the correct order and colocation:

primitive elastic_ip lsb:elastic-ip op monitor interval="10s"
primitive haproxy lsb:haproxy op monitor interval="10s"
order haproxy-after-eip inf: elastic_ip haproxy
colocation haproxy-with-eip inf: haproxy elastic_ip

Third, here's my elastic-ip.rb LSB script that handles the Elastic IP. Since 
LSB 
scripts can't take any parameters other than the usual start/stop/status/etc, I 
treat the script as a Chef template and inject the desired EIP into the 
template. The other secret is that I created a special user using AWS IAM with 
a 
policy that only allows the user to associate/disassociate EIP addresses. I 
store the AccessKey and SecretAccessKey in a file in /etc/aws/pacemaker_keys. 

Let me know if you have any questions. And thanks for the tip on using crm to 
initialize pacemaker.

#!/usr/bin/ruby

# Follows the LSB Spec: http://refspecs.freestandards.org/LSB_3.1.0/LSB-Core-
generic/LSB-Core-generic/iniscrptact.html

require 'rubygems'
require 'AWS'

ELASTIC_IP="<%= @elastic_ip %>"
EC2_INSTANCE_ID=`wget -T 5 -q -O - http://169.254.169.254/latest/meta-
data/instance-id`

# Load the AWS access keys
properties = {}
File.open("/etc/aws/pacemaker_keys", 'r') do |file|
  file.read.each_line do |line|
line.strip!
if (line[0] != ?# and line[0] != ?=)
  i = line.index('=')
  if (i)
properties[line[0..i - 1].strip] = line[i + 1..-1].strip
  else
properties[line] = ''
  end
end
  end
end
AWS_ACCESS_KEY = properties["AWS_ACCESS_KEY"].delete "\""
AWS_SECRET_ACCESS_KEY = properties["AWS_SECRET_ACCESS_KEY"].strip.delete "\""

[ ELASTIC_IP, EC2_INSTANCE_ID, AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY ].each do 
|value|
  if value.nil? || value.length == 0
exit case ARGV[0]
  when "status" then 4
  else 1
end
  end
end

def status(ec2)
  # Typical responses look like the following:
  #   {"requestId"=>"065d1661-31b1-455d-8f63-ba086b8104de", "addressesSet"=>
{"item"=>[{"instanceId"=>"i-22e93a4d", "publicIp"=>"50.19.93.215"}]}, 
"xmlns"=>"http://ec2.amazonaws.com/doc/2010-08-31/"}
  # or
  #   {"requestId"=>"9cd3ab7e-1c03-4821-9565-1791dd1bb0fc", "addressesSet"=>
{"item"=>[{"instanceId"=>nil, "publicIp"=>"174.129.34.161"}]}, 
"xmlns"=>"http://ec2.amazonaws.com/doc/2010-08-31/"}
  response = ec2.describe_addresses({:public_ip => ELASTIC_IP})
  retval = 4
  if ! response.nil?
if ! response["addressesSet"].nil?
  if ! response["addressesSet"]["item"].nil? && response["addressesSet"]
["item"].length >= 1
if response["addressesSet"]["item"][0]["instanceId"] == EC2_INSTANCE_ID
  retval = 0
else
  retval = 3
end
  end
end
  end
  retval
end

def start(ec2)
  # Throws exception if the instance does not exist or the address does not 
belong to us
  retval = 1
  begin
response = ec2.associate_address({ :public_ip => ELASTIC_IP, :instance_id 
=> 
EC2_INSTANCE_ID })
retval = 0
  rescue => e
puts "Error attempting to associate address: " + e
  end
  retval
end

def stop(ec2)
  0
end

def reload(ec2)
  start(ec2)
end

def force_reload(ec2)
  reload(ec2)
end

def restart(ec2)
  start(ec2)
end

def try_restart(ec2)
  start(ec2)
end

ec2 = AWS::EC2::Base.new(:access_key_id => AWS_ACCESS_KEY, :secret_access_key 
=> 
AWS_SECRET_ACCESS_KEY)

retval = case ARGV[0]
  when "status" then status(ec2)
  when "start" then

[Pacemaker] Automating Pacemaker Setup

2011-05-26 Thread veghead
Does anyone have any links to documentation on automating Pacemaker setup with 
tools like Chef or Puppet?

I have a working two node cluster for HAProxy on EC2, but I'd like to fully 
automate the setup process for future nodes. Specifically so I can spin up a 
new 
instance w/out any manual intervention.

Thanks.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker