Re: How to manage a large cluster?

Steve Loughran Tue, 16 Sep 2008 02:06:29 -0700

Paco NATHAN wrote:

We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.


Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.

Nice. Your CI tool could upload the latest release tagged as good andthe machines could pull it down.

The goal of cluster management is to make the addition/removal of anextra node an O(1) problem; you edit one entry in one place to incrementor decrement the number of machines, and that's it.

If you find you have lots of images to keep alive, then your costs goup. Keep the # of images you have to 1 and you will stay in control.


We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.

1. We have some components that use google talk to relay messages tolocal boxes behind the firewall. I could imagine hooking up hadoopstatus events to that too.

2. There's an old paper of mine, "Making Web Services that Work", inwhich I talk about deployment centric development:

http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html

The idea is that right from the outset, the dev team work on a clusterthat resembles production, the CI server builds to it automatically,changes get pushed out to production semi-automatically (you tag theversion you want pushed out in SVN, the CI server does the release). Thearticle is focused on services exported to third parties, not back endstuff, so it may not all apply to hadoop deployments.


-steve

Re: How to manage a large cluster?

Reply via email to