Paco NATHAN wrote:
Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.


We have a patch to add a ping() operation to all the services -namenode, datanode, task tracker, job tracker. With a proposal to make that remotely visible: https://issues.apache.org/jira/browse/HADOOP-3969 , you could hit every URL with an appropriately authenticated GET and see if it is live.


Another trick I like for long haul health checks is to use google talk and give every machine a login underneath (you can have them all in your own domain via google apps). Then you can use XMPP to monitor system health. A more advanced variant is to use it as a the command interface, which is very similar to how botnets work with IRC, though in that case the botnet herders are trying to manage 500,000 0wned boxes and don't want a reply.

Reply via email to