On Sat, Nov 20, 2010 at 8:02 AM, Andrew Miklas <and...@pagerduty.com> wrote: > Hi all, > > I'm trying to use Pacemaker on a Amazon Web Services' EC2 to > automatically reassign elastic IPs (Amazon's equivalent to floating or > virtual IPs) in the event of a node failure. The setup I'm testing > with is two elastic IPs which will be assigned to any pair of hosts in > a three node cluster. The full config is below for reference. > > Things usually work very well -- the IP of a failed machine is usually > automatically reassigned to another host. However, sometimes the > cluster seems to get stuck trying to "start" the IP resource on the > other host. > > When this happens, crm_mon will show a line something like this: > elastic-ip-A (ocf::ec2:elastic-ip): Started test1 (unmanaged) FAILED > > After a bit of time, it will change to: > elastic-ip-A (ocf::ec2:elastic-ip): Stopped > > Once the system hits this state, the failed resource will remain > stopped until I do a "crm_resource -P", at which point the elastic IP > will be assigned to a node and started. > > > I suspect what's happening here is that the start action in the > elastic IP resource agent sometimes times out. Rather than > arbitrarily increasing the timeout value, though, I'd rather Pacemaker > simply retry failed start requests.
What do you think you gain by not increasing the timeout? We don't sit around doing nothing if it completes in only a fraction of the allocated time. In any case, if you really want, check out the start-failure-is-fatal option (man pengine). > > I've tried setting monitor operations and the "failure-timeout" > parameter on the elastic IP resources, but this doesn't seem to fix > the problem. Did you set cluster-recheck-interval appropriately? (man crmd) > Any ideas? Has anyone had success getting Pacemaker to reliably > switch elastic IPs on EC2? > > > > Thanks, > > > Andrew > > > ========= > > crm configure show > node $id="8780681b-f79f-47c5-9f92-ad3fc7a3584e" test3 \ > attributes class="www" > node $id="8be1d887-d333-4a15-b72d-ac1950973e2c" test2 \ > attributes class="www" > node $id="b3a4cc48-7707-4e47-aa10-3cd34230cebc" test1 \ > attributes class="www" > primitive elastic-ip-A ocf:ec2:elastic-ip \ > op monitor on-fail="restart" interval="30" \ > params ip="184.73.193.93" \ > meta is-managed="true" resource-stickiness="10" failure-timeout="90" > primitive elastic-ip-A-email ocf:heartbeat:MailTo \ > params email="ad...@pagerduty.com" subject="TEST Elastic IP A flip!" > primitive elastic-ip-B ocf:ec2:elastic-ip \ > op monitor on-fail="restart" interval="30" \ > params ip="184.72.236.247" \ > meta is-managed="true" resource-stickiness="10" failure-timeout="90" > primitive elastic-ip-B-email ocf:heartbeat:MailTo \ > params email="ad...@pagerduty.com" subject="TEST Elastic IP B flip!" > location location-elastic-ip-A elastic-ip-A \ > rule $id="elastic-ip-A-only-on-www" -inf: class ne www > location location-elastic-ip-B elastic-ip-B \ > rule $id="elastic-ip-B-only-on-www" -inf: class ne www > colocation elastic-ip-A-email-colo 10: elastic-ip-A-email elastic-ip-A > colocation elastic-ip-B-email-colo 10: elastic-ip-B-email elastic-ip-B > colocation elastic-ip-colo-1 -inf: elastic-ip-B elastic-ip-A > property $id="cib-bootstrap-options" \ > dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \ > cluster-infrastructure="Heartbeat" \ > stonith-enabled="false" \ > last-lrm-refresh="1290062815" > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems